30 Jan, 2010

1 commit

  • After memory pressure has forced it to dip into the reserves, 2.6.32's
    5f8dcc21211a3d4e3a7a5ca366b469fb88117f61 "page-allocator: split per-cpu
    list into one-list-per-migrate-type" has been returning MIGRATE_RESERVE
    pages to the MIGRATE_MOVABLE free_list: in some sense depleting reserves.

    Fix that in the most straightforward way (which, considering the overheads
    of alternative approaches, is Mel's preference): the right migratetype is
    already in page_private(page), but free_pcppages_bulk() wasn't using it.

    How did this bug show up? As a 20% slowdown in my tmpfs loop kbuild
    swapping tests, on PowerMac G5 with SLUB allocator. Bisecting to that
    commit was easy, but explaining the magnitude of the slowdown not easy.

    The same effect appears, but much less markedly, with SLAB, and even
    less markedly on other machines (the PowerMac divides into fewer zones
    than x86, I think that may be a factor). We guess that lumpy reclaim
    of short-lived high-order pages is implicated in some way, and probably
    this bug has been tickling a poor decision somewhere in page reclaim.

    But instrumentation hasn't told me much, I've run out of time and
    imagination to determine exactly what's going on, and shouldn't hold up
    the fix any longer: it's valid, and might even fix other misbehaviours.

    Signed-off-by: Hugh Dickins
    Acked-by: Mel Gorman
    Cc: stable@kernel.org
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

17 Jan, 2010

2 commits

  • commit f2260e6b (page allocator: update NR_FREE_PAGES only as necessary)
    made one minor regression. if __rmqueue() was failed, NR_FREE_PAGES stat
    go wrong. this patch fixes it.

    Signed-off-by: KOSAKI Motohiro
    Cc: Mel Gorman
    Reviewed-by: Minchan Kim
    Reported-by: Huang Shijie
    Reviewed-by: Christoph Lameter
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • The current check for 'backward merging' within add_active_range() does
    not seem correct. start_pfn must be compared against
    early_node_map[i].start_pfn (and NOT against .end_pfn) to find out whether
    the new region is backward-mergeable with the existing range.

    Signed-off-by: Kazuhisa Ichikawa
    Acked-by: David Rientjes
    Cc: KOSAKI Motohiro
    Cc: Mel Gorman
    Cc: Christoph Lameter
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kazuhisa Ichikawa
     

24 Dec, 2009

1 commit


23 Dec, 2009

1 commit

  • * 'merge' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc: (36 commits)
    powerpc/gc/wii: Remove get_irq_desc()
    powerpc/gc/wii: hlwd-pic: convert irq_desc.lock to raw_spinlock
    powerpc/gamecube/wii: Fix off-by-one error in ugecon/usbgecko_udbg
    powerpc/mpic: Fix problem that affinity is not updated
    powerpc/mm: Fix stupid bug in subpge protection handling
    powerpc/iseries: use DECLARE_COMPLETION_ONSTACK for non-constant completion
    powerpc: Fix MSI support on U4 bridge PCIe slot
    powerpc: Handle VSX alignment faults correctly in little-endian mode
    powerpc/mm: Fix typo of cpumask_clear_cpu()
    powerpc/mm: Fix hash_utils_64.c compile errors with DEBUG enabled.
    powerpc: Convert BUG() to use unreachable()
    powerpc/pseries: Make declarations of cpu_hotplug_driver_lock() ANSI compatible.
    powerpc/pseries: Don't panic when H_PROD fails during cpu-online.
    powerpc/mm: Fix a WARN_ON() with CONFIG_DEBUG_PAGEALLOC and CONFIG_DEBUG_VM
    powerpc/defconfigs: Set HZ=100 on pseries and ppc64 defconfigs
    powerpc/defconfigs: Disable token ring in powerpc defconfigs
    powerpc/defconfigs: Reduce 64bit vmlinux by making acenic and cramfs modules
    powerpc/pseries: Select XICS and PCI_MSI PSERIES
    powerpc/85xx: Wrong variable returned on error
    powerpc/iseries: Convert to proc_fops
    ...

    Linus Torvalds
     

20 Dec, 2009

1 commit

  • …git/tip/linux-2.6-tip

    * 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
    x86, irq: Allow 0xff for /proc/irq/[n]/smp_affinity on an 8-cpu system
    Makefile: Unexport LC_ALL instead of clearing it
    x86: Fix objdump version check in arch/x86/tools/chkobjdump.awk
    x86: Reenable TSC sync check at boot, even with NONSTOP_TSC
    x86: Don't use POSIX character classes in gen-insn-attr-x86.awk
    Makefile: set LC_CTYPE, LC_COLLATE, LC_NUMERIC to C
    x86: Increase MAX_EARLY_RES; insufficient on 32-bit NUMA
    x86: Fix checking of SRAT when node 0 ram is not from 0
    x86, cpuid: Add "volatile" to asm in native_cpuid()
    x86, msr: msrs_alloc/free for CONFIG_SMP=n
    x86, amd: Get multi-node CPU info from NodeId MSR instead of PCI config space
    x86: Add IA32_TSC_AUX MSR and use it
    x86, msr/cpuid: Register enough minors for the MSR and CPUID drivers
    initramfs: add missing decompressor error check
    bzip2: Add missing checks for malloc returning NULL
    bzip2/lzma/gzip: pre-boot malloc doesn't return NULL on failure

    Linus Torvalds
     

18 Dec, 2009

1 commit

  • Memory balloon drivers can allocate a large amount of memory which is not
    movable but could be freed to accomodate memory hotplug remove.

    Prior to calling the memory hotplug notifier chain the memory in the
    pageblock is isolated. Currently, if the migrate type is not
    MIGRATE_MOVABLE the isolation will not proceed, causing the memory removal
    for that page range to fail.

    Rather than failing pageblock isolation if the migrateteype is not
    MIGRATE_MOVABLE, this patch checks if all of the pages in the pageblock,
    and not on the LRU, are owned by a registered balloon driver (or other
    entity) using a notifier chain. If all of the non-movable pages are owned
    by a balloon, they can be freed later through the memory notifier chain
    and the range can still be isolated in set_migratetype_isolate().

    Signed-off-by: Robert Jennings
    Cc: Mel Gorman
    Cc: Ingo Molnar
    Cc: Brian King
    Cc: Paul Mackerras
    Cc: Martin Schwidefsky
    Cc: Gerald Schaefer
    Cc: KAMEZAWA Hiroyuki
    Cc: Benjamin Herrenschmidt
    Signed-off-by: Andrew Morton
    Signed-off-by: Benjamin Herrenschmidt

    Robert Jennings
     

17 Dec, 2009

2 commits

  • Found one system that boot from socket1 instead of socket0, SRAT get rejected...

    [ 0.000000] SRAT: Node 1 PXM 0 0-a0000
    [ 0.000000] SRAT: Node 1 PXM 0 100000-80000000
    [ 0.000000] SRAT: Node 1 PXM 0 100000000-2080000000
    [ 0.000000] SRAT: Node 0 PXM 1 2080000000-4080000000
    [ 0.000000] SRAT: Node 2 PXM 2 4080000000-6080000000
    [ 0.000000] SRAT: Node 3 PXM 3 6080000000-8080000000
    [ 0.000000] SRAT: Node 4 PXM 4 8080000000-a080000000
    [ 0.000000] SRAT: Node 5 PXM 5 a080000000-c080000000
    [ 0.000000] SRAT: Node 6 PXM 6 c080000000-e080000000
    [ 0.000000] SRAT: Node 7 PXM 7 e080000000-10080000000
    ...
    [ 0.000000] NUMA: Allocated memnodemap from 500000 - 701040
    [ 0.000000] NUMA: Using 20 for the hash shift.
    [ 0.000000] Adding active range (0, 0x2080000, 0x4080000) 0 entries of 3200 used
    [ 0.000000] Adding active range (1, 0x0, 0x96) 1 entries of 3200 used
    [ 0.000000] Adding active range (1, 0x100, 0x7f750) 2 entries of 3200 used
    [ 0.000000] Adding active range (1, 0x100000, 0x2080000) 3 entries of 3200 used
    [ 0.000000] Adding active range (2, 0x4080000, 0x6080000) 4 entries of 3200 used
    [ 0.000000] Adding active range (3, 0x6080000, 0x8080000) 5 entries of 3200 used
    [ 0.000000] Adding active range (4, 0x8080000, 0xa080000) 6 entries of 3200 used
    [ 0.000000] Adding active range (5, 0xa080000, 0xc080000) 7 entries of 3200 used
    [ 0.000000] Adding active range (6, 0xc080000, 0xe080000) 8 entries of 3200 used
    [ 0.000000] Adding active range (7, 0xe080000, 0x10080000) 9 entries of 3200 used
    [ 0.000000] SRAT: PXMs only cover 917504MB of your 1048566MB e820 RAM. Not used.
    [ 0.000000] SRAT: SRAT not used.

    the early_node_map is not sorted because node0 with non zero start come first.

    so try to sort it right away after all regions are registered.

    also fixs refression by 8716273c (x86: Export srat physical topology)

    -v2: make it more solid to handle cross node case like node0 [0,4g), [8,12g) and node1 [4g, 8g), [12g, 16g)
    -v3: update comments.

    Reported-and-tested-by: Jens Axboe
    Signed-off-by: Yinghai Lu
    LKML-Reference:
    Signed-off-by: H. Peter Anvin

    Yinghai Lu
     
  • * 'hwpoison' of git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-mce-2.6: (34 commits)
    HWPOISON: Remove stray phrase in a comment
    HWPOISON: Try to allocate migration page on the same node
    HWPOISON: Don't do early filtering if filter is disabled
    HWPOISON: Add a madvise() injector for soft page offlining
    HWPOISON: Add soft page offline support
    HWPOISON: Undefine short-hand macros after use to avoid namespace conflict
    HWPOISON: Use new shake_page in memory_failure
    HWPOISON: Use correct name for MADV_HWPOISON in documentation
    HWPOISON: mention HWPoison in Kconfig entry
    HWPOISON: Use get_user_page_fast in hwpoison madvise
    HWPOISON: add an interface to switch off/on all the page filters
    HWPOISON: add memory cgroup filter
    memcg: add accessor to mem_cgroup.css
    memcg: rename and export try_get_mem_cgroup_from_page()
    HWPOISON: add page flags filter
    mm: export stable page flags
    HWPOISON: limit hwpoison injector to known page types
    HWPOISON: add fs/device filters
    HWPOISON: return 0 to indicate success reliably
    HWPOISON: make semantics of IGNORED/DELAYED clear
    ...

    Linus Torvalds
     

16 Dec, 2009

3 commits

  • Fix node-oriented allocation handling in oom-kill.c I myself think of this
    as a bugfix not as an ehnancement.

    In these days, things are changed as
    - alloc_pages() eats nodemask as its arguments, __alloc_pages_nodemask().
    - mempolicy don't maintain its own private zonelists.
    (And cpuset doesn't use nodemask for __alloc_pages_nodemask())

    So, current oom-killer's check function is wrong.

    This patch does
    - check nodemask, if nodemask && nodemask doesn't cover all
    node_states[N_HIGH_MEMORY], this is CONSTRAINT_MEMORY_POLICY.
    - Scan all zonelist under nodemask, if it hits cpuset's wall
    this faiulre is from cpuset.
    And
    - modifies the caller of out_of_memory not to call oom if __GFP_THISNODE.
    This doesn't change "current" behavior. If callers use __GFP_THISNODE
    it should handle "page allocation failure" by itself.

    - handle __GFP_NOFAIL+__GFP_THISNODE path.
    This is something like a FIXME but this gfpmask is not used now.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: KAMEZAWA Hiroyuki
    Acked-by: David Rientjes
    Cc: Daisuke Nishimura
    Cc: KOSAKI Motohiro
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • Most free pages in the buddy system have no PG_buddy set.
    Introduce is_free_buddy_page() for detecting them reliably.

    CC: Nick Piggin
    CC: Mel Gorman
    Signed-off-by: Wu Fengguang
    Signed-off-by: Andi Kleen

    Wu Fengguang
     
  • Remove three degrees of obfuscation, left over from when we had
    CONFIG_UNEVICTABLE_LRU. MLOCK_PAGES is CONFIG_HAVE_MLOCKED_PAGE_BIT is
    CONFIG_HAVE_MLOCK is CONFIG_MMU. rmap.o (and memory-failure.o) are only
    built when CONFIG_MMU, so don't need such conditions at all.

    Somehow, I feel no compulsion to remove the CONFIG_HAVE_MLOCK* lines from
    169 defconfigs: leave those to evolve in due course.

    Signed-off-by: Hugh Dickins
    Cc: Izik Eidus
    Cc: Andrea Arcangeli
    Cc: Nick Piggin
    Reviewed-by: KOSAKI Motohiro
    Cc: Rik van Riel
    Cc: Lee Schermerhorn
    Cc: Andi Kleen
    Cc: KAMEZAWA Hiroyuki
    Cc: Wu Fengguang
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

12 Nov, 2009

2 commits

  • Commit 341ce06f69abfafa31b9468410a13dbd60e2b237 ("page allocator:
    calculate the alloc_flags for allocation only once") altered watermark
    logic slightly by allowing rt_tasks that are handling an interrupt to set
    ALLOC_HARDER. This patch brings the watermark logic more in line with
    2.6.30.

    This change results in a reduction of the number high-order GFP_ATOMIC
    allocation failures reported. See
    http://www.gossamer-threads.com/lists/linux/kernel/1144153

    [rientjes@google.com: Spotted the problem]
    Signed-off-by: Mel Gorman
    Reviewed-by: Pekka Enberg
    Reviewed-by: Rik van Riel
    Reviewed-by: KOSAKI Motohiro
    Cc: David Rientjes
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • If a direct reclaim makes no forward progress, it considers whether it
    should go OOM or not. Whether OOM is triggered or not, it may retry the
    allocation afterwards. In times past, this would always wake kswapd as
    well but currently, kswapd is not woken up after direct reclaim fails.
    For order-0 allocations, this makes little difference but if there is a
    heavy mix of higher-order allocations that direct reclaim is failing for,
    it might mean that kswapd is not rewoken for higher orders as much as it
    did previously.

    This patch wakes up kswapd when an allocation is being retried after a
    direct reclaim failure. It would be expected that kswapd is already
    awake, but this has the effect of telling kswapd to reclaim at the higher
    order as well.

    Signed-off-by: Mel Gorman
    Reviewed-by: Christoph Lameter
    Reviewed-by: Pekka Enberg
    Reviewed-by: KOSAKI Motohiro
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

29 Oct, 2009

1 commit

  • Revert

    commit 71de1ccbe1fb40203edd3beb473f8580d917d2ca
    Author: KOSAKI Motohiro
    AuthorDate: Mon Sep 21 17:01:31 2009 -0700
    Commit: Linus Torvalds
    CommitDate: Tue Sep 22 07:17:27 2009 -0700

    mm: oom analysis: add buffer cache information to show_free_areas()

    show_free_areas() is called during page allocation failures, and page
    allocation failures can occur in any calling context.

    But nr_blockdev_pages() takes VFS locks which should not be taken from
    hard IRQ context (at least). The result is lockdep warnings (and
    deadlockability) during page allocation failures.

    Cc: KOSAKI Motohiro
    Cc: Wu Fengguang
    Cc: Rik van Riel
    Cc: David Rientjes
    Cc: Alexey Dobriyan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     

24 Sep, 2009

2 commits

  • * 'hwpoison' of git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-mce-2.6: (21 commits)
    HWPOISON: Enable error_remove_page on btrfs
    HWPOISON: Add simple debugfs interface to inject hwpoison on arbitary PFNs
    HWPOISON: Add madvise() based injector for hardware poisoned pages v4
    HWPOISON: Enable error_remove_page for NFS
    HWPOISON: Enable .remove_error_page for migration aware file systems
    HWPOISON: The high level memory error handler in the VM v7
    HWPOISON: Add PR_MCE_KILL prctl to control early kill behaviour per process
    HWPOISON: shmem: call set_page_dirty() with locked page
    HWPOISON: Define a new error_remove_page address space op for async truncation
    HWPOISON: Add invalidate_inode_page
    HWPOISON: Refactor truncate to allow direct truncating of page v2
    HWPOISON: check and isolate corrupted free pages v2
    HWPOISON: Handle hardware poisoned pages in try_to_unmap
    HWPOISON: Use bitmask/action code for try_to_unmap behaviour
    HWPOISON: x86: Add VM_FAULT_HWPOISON handling to x86 page fault handler v2
    HWPOISON: Add poison check to page fault handling
    HWPOISON: Add basic support for poisoned pages in fault handler v3
    HWPOISON: Add new SIGBUS error codes for hardware poison signals
    HWPOISON: Add support for poison swap entries v2
    HWPOISON: Export some rmap vma locking to outside world
    ...

    Linus Torvalds
     
  • It's unused.

    It isn't needed -- read or write flag is already passed and sysctl
    shouldn't care about the rest.

    It _was_ used in two places at arch/frv for some reason.

    Signed-off-by: Alexey Dobriyan
    Cc: David Howells
    Cc: "Eric W. Biederman"
    Cc: Al Viro
    Cc: Ralf Baechle
    Cc: Martin Schwidefsky
    Cc: Ingo Molnar
    Cc: "David S. Miller"
    Cc: James Morris
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     

22 Sep, 2009

23 commits

  • Move highest_memmap_pfn __read_mostly from page_alloc.c next to zero_pfn
    __read_mostly in memory.c: to help them share a cacheline, since they're
    very often tested together in vm_normal_page().

    Signed-off-by: Hugh Dickins
    Cc: Rik van Riel
    Cc: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Cc: Nick Piggin
    Cc: Mel Gorman
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • When round-robin freeing pages from the PCP lists, empty lists may be
    encountered. In the event one of the lists has more pages than another,
    there may be numerous checks for list_empty() which is undesirable. This
    patch maintains a count of pages to free which is incremented when empty
    lists are encountered. The intention is that more pages will then be
    freed from fuller lists than the empty ones reducing the number of empty
    list checks in the free path.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Mel Gorman
    Cc: Nick Piggin
    Cc: Christoph Lameter
    Reviewed-by: Minchan Kim
    Cc: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • The following two patches remove searching in the page allocator fast-path
    by maintaining multiple free-lists in the per-cpu structure. At the time
    the search was introduced, increasing the per-cpu structures would waste a
    lot of memory as per-cpu structures were statically allocated at
    compile-time. This is no longer the case.

    The patches are as follows. They are based on mmotm-2009-08-27.

    Patch 1 adds multiple lists to struct per_cpu_pages, one per
    migratetype that can be stored on the PCP lists.

    Patch 2 notes that the pcpu drain path check empty lists multiple times. The
    patch reduces the number of checks by maintaining a count of free
    lists encountered. Lists containing pages will then free multiple
    pages in batch

    The patches were tested with kernbench, netperf udp/tcp, hackbench and
    sysbench. The netperf tests were not bound to any CPU in particular and
    were run such that the results should be 99% confidence that the reported
    results are within 1% of the estimated mean. sysbench was run with a
    postgres background and read-only tests. Similar to netperf, it was run
    multiple times so that it's 99% confidence results are within 1%. The
    patches were tested on x86, x86-64 and ppc64 as

    x86: Intel Pentium D 3GHz with 8G RAM (no-brand machine)
    kernbench - No significant difference, variance well within noise
    netperf-udp - 1.34% to 2.28% gain
    netperf-tcp - 0.45% to 1.22% gain
    hackbench - Small variances, very close to noise
    sysbench - Very small gains

    x86-64: AMD Phenom 9950 1.3GHz with 8G RAM (no-brand machine)
    kernbench - No significant difference, variance well within noise
    netperf-udp - 1.83% to 10.42% gains
    netperf-tcp - No conclusive until buffer >= PAGE_SIZE
    4096 +15.83%
    8192 + 0.34% (not significant)
    16384 + 1%
    hackbench - Small gains, very close to noise
    sysbench - 0.79% to 1.6% gain

    ppc64: PPC970MP 2.5GHz with 10GB RAM (it's a terrasoft powerstation)
    kernbench - No significant difference, variance well within noise
    netperf-udp - 2-3% gain for almost all buffer sizes tested
    netperf-tcp - losses on small buffers, gains on larger buffers
    possibly indicates some bad caching effect.
    hackbench - No significant difference
    sysbench - 2-4% gain

    This patch:

    Currently the per-cpu page allocator searches the PCP list for pages of
    the correct migrate-type to reduce the possibility of pages being
    inappropriate placed from a fragmentation perspective. This search is
    potentially expensive in a fast-path and undesirable. Splitting the
    per-cpu list into multiple lists increases the size of a per-cpu structure
    and this was potentially a major problem at the time the search was
    introduced. These problem has been mitigated as now only the necessary
    number of structures is allocated for the running system.

    This patch replaces a list search in the per-cpu allocator with one list
    per migrate type. The potential snag with this approach is when bulk
    freeing pages. We round-robin free pages based on migrate type which has
    little bearing on the cache hotness of the page and potentially checks
    empty lists repeatedly in the event the majority of PCP pages are of one
    type.

    Signed-off-by: Mel Gorman
    Acked-by: Nick Piggin
    Cc: Christoph Lameter
    Cc: Minchan Kim
    Cc: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • For mem_cgroup, shrink_zone() may call shrink_list() with nr_to_scan=1, in
    which case shrink_list() _still_ calls isolate_pages() with the much
    larger SWAP_CLUSTER_MAX. It effectively scales up the inactive list scan
    rate by up to 32 times.

    For example, with 16k inactive pages and DEF_PRIORITY=12, (16k >> 12)=4.
    So when shrink_zone() expects to scan 4 pages in the active/inactive list,
    the active list will be scanned 4 pages, while the inactive list will be
    (over) scanned SWAP_CLUSTER_MAX=32 pages in effect. And that could break
    the balance between the two lists.

    It can further impact the scan of anon active list, due to the anon
    active/inactive ratio rebalance logic in balance_pgdat()/shrink_zone():

    inactive anon list over scanned => inactive_anon_is_low() == TRUE
    => shrink_active_list()
    => active anon list over scanned

    So the end result may be

    - anon inactive => over scanned
    - anon active => over scanned (maybe not as much)
    - file inactive => over scanned
    - file active => under scanned (relatively)

    The accesses to nr_saved_scan are not lock protected and so not 100%
    accurate, however we can tolerate small errors and the resulted small
    imbalanced scan rates between zones.

    Cc: Rik van Riel
    Reviewed-by: KOSAKI Motohiro
    Acked-by: Balbir Singh
    Reviewed-by: Minchan Kim
    Signed-off-by: KAMEZAWA Hiroyuki
    Signed-off-by: Wu Fengguang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wu Fengguang
     
  • This is being done by allowing boot time allocations to specify that they
    may want a sub-page sized amount of memory.

    Overall this seems more consistent with the other hash table allocations,
    and allows making two supposedly mm-only variables really mm-only
    (nr_{kernel,all}_pages).

    Signed-off-by: Jan Beulich
    Cc: Ingo Molnar
    Cc: "Eric W. Biederman"
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Beulich
     
  • After anti-fragmentation was merged, a bug was reported whereby devices
    that depended on high-order atomic allocations were failing. The solution
    was to preserve a property in the buddy allocator which tended to keep the
    minimum number of free pages in the zone at the lower physical addresses
    and contiguous. To preserve this property, MIGRATE_RESERVE was introduced
    and a number of pageblocks at the start of a zone would be marked
    "reserve", the number of which depended on min_free_kbytes.

    Anti-fragmentation works by avoiding the mixing of page migratetypes
    within the same pageblock. One way of helping this is to increase
    min_free_kbytes because it becomes less like that it will be necessary to
    place pages of of MIGRATE_RESERVE is unbounded, the free memory is kept
    there in large contiguous blocks instead of helping anti-fragmentation as
    much as it should. With the page-allocator tracepoint patches applied, it
    was found during anti-fragmentation tests that the number of
    fragmentation-related events were far higher than expected even with
    min_free_kbytes at higher values.

    This patch limits the number of MIGRATE_RESERVE blocks that exist per zone
    to two. For example, with a sufficient min_free_kbytes, 4MB of memory
    will be kept aside on an x86-64 and remain more or less free and
    contiguous for the systems uptime. This should be sufficient for devices
    depending on high-order atomic allocations while helping fragmentation
    control when min_free_kbytes is tuned appropriately. As side-effect of
    this patch is that the reserve variable is converted to int as unsigned
    long was the wrong type to use when ensuring that only the required number
    of reserve blocks are created.

    With the patches applied, fragmentation-related events as measured by the
    page allocator tracepoints were significantly reduced when running some
    fragmentation stress-tests on systems with min_free_kbytes tuned to a
    value appropriate for hugepage allocations at runtime. On x86, the events
    recorded were reduced by 99.8%, on x86-64 by 99.72% and on ppc64 by
    99.83%.

    Signed-off-by: Mel Gorman
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • The page allocation trace event reports that a page was successfully
    allocated but it does not specify where it came from. When analysing
    performance, it can be important to distinguish between pages coming from
    the per-cpu allocator and pages coming from the buddy lists as the latter
    requires the zone lock to the taken and more data structures to be
    examined.

    This patch adds a trace event for __rmqueue reporting when a page is being
    allocated from the buddy lists. It distinguishes between being called to
    refill the per-cpu lists or whether it is a high-order allocation.
    Similarly, this patch adds an event to catch when the PCP lists are being
    drained a little and pages are going back to the buddy lists.

    This is trickier to draw conclusions from but high activity on those
    events could explain why there were a large number of cache misses on a
    page-allocator-intensive workload. The coalescing and splitting of
    buddies involves a lot of writing of page metadata and cache line bounces
    not to mention the acquisition of an interrupt-safe lock necessary to
    enter this path.

    [akpm@linux-foundation.org: fix build]
    Signed-off-by: Mel Gorman
    Acked-by: Rik van Riel
    Reviewed-by: Ingo Molnar
    Cc: Larry Woodman
    Cc: Peter Zijlstra
    Cc: Li Ming Chun
    Reviewed-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Fragmentation avoidance depends on being able to use free pages from lists
    of the appropriate migrate type. In the event this is not possible,
    __rmqueue_fallback() selects a different list and in some circumstances
    change the migratetype of the pageblock. Simplistically, the more times
    this event occurs, the more likely that fragmentation will be a problem
    later for hugepage allocation at least but there are other considerations
    such as the order of page being split to satisfy the allocation.

    This patch adds a trace event for __rmqueue_fallback() that reports what
    page is being used for the fallback, the orders of relevant pages, the
    desired migratetype and the migratetype of the lists being used, whether
    the pageblock changed type and whether this event is important with
    respect to fragmentation avoidance or not. This information can be used
    to help analyse fragmentation avoidance and help decide whether
    min_free_kbytes should be increased or not.

    Signed-off-by: Mel Gorman
    Acked-by: Rik van Riel
    Reviewed-by: Ingo Molnar
    Cc: Larry Woodman
    Cc: Peter Zijlstra
    Cc: Li Ming Chun
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • This patch adds trace events for the allocation and freeing of pages,
    including the freeing of pagevecs. Using the events, it will be known
    what struct page and pfns are being allocated and freed and what the call
    site was in many cases.

    The page alloc tracepoints be used as an indicator as to whether the
    workload was heavily dependant on the page allocator or not. You can make
    a guess based on vmstat but you can't get a per-process breakdown.
    Depending on the call path, the call_site for page allocation may be
    __get_free_pages() instead of a useful callsite. Instead of passing down
    a return address similar to slab debugging, the user should enable the
    stacktrace and seg-addr options to get a proper stack trace.

    The pagevec free tracepoint has a different usecase. It can be used to
    get a idea of how many pages are being dumped off the LRU and whether it
    is kswapd doing the work or a process doing direct reclaim.

    Signed-off-by: Mel Gorman
    Acked-by: Rik van Riel
    Reviewed-by: Ingo Molnar
    Cc: Larry Woodman
    Cc: Peter Zijlstra
    Cc: Li Ming Chun
    Reviewed-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • The function free_cold_page() has no callers so delete it.

    Signed-off-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • …uring __rmqueue_fallback

    When there are no pages of a target migratetype free, the page allocator
    selects a high-order block of another migratetype to allocate from. When
    the order of the page taken is greater than pageblock_order, all
    pageblocks within that high-order page should change migratetype so that
    pages are later freed to the correct free-lists.

    The current behaviour is that pageblocks change migratetype if the order
    being split matches the pageblock_order. When pageblock_order <
    MAX_ORDER-1, ownership is not changing correct and pages are being later
    freed to the incorrect list and this impacts fragmentation avoidance.

    This patch changes all pageblocks within the high-order page being split
    to the correct migratetype. Without the patch, allocation success rates
    for hugepages under stress were about 59% of physical memory on x86-64.
    With the patch applied, this goes up to 65%.

    Signed-off-by: Mel Gorman <mel@csn.ul.ie>
    Cc: Andy Whitcroft <apw@shadowen.org>
    Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
    Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

    Mel Gorman
     
  • By the time PG_mlocked is cleared in the page freeing path, nobody else is
    looking at our page->flags anymore.

    It is thus safe to make the test-and-clear non-atomic and thereby removing
    an unnecessary and expensive operation from a hotpath.

    Signed-off-by: Johannes Weiner
    Reviewed-by: Christoph Lameter
    Reviewed-by: KOSAKI Motohiro
    Cc: Christoph Lameter
    Cc: Mel Gorman
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • __get_free_pages() with __GFP_HIGHMEM is not safe because the return
    address cannot represent a highmem page. get_zeroed_page() already has
    such a debug checking.

    Signed-off-by: Akinobu Mita
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Akinobu Mita
     
  • If the system is running a heavy load of processes then concurrent reclaim
    can isolate a large number of pages from the LRU. /proc/vmstat and the
    output generated for an OOM do not show how many pages were isolated.

    This has been observed during process fork bomb testing (mstctl11 in LTP).

    This patch shows the information about isolated pages.

    Reproduced via:

    -----------------------
    % ./hackbench 140 process 1000
    => OOM occur

    active_anon:146 inactive_anon:0 isolated_anon:49245
    active_file:79 inactive_file:18 isolated_file:113
    unevictable:0 dirty:0 writeback:0 unstable:0 buffer:39
    free:370 slab_reclaimable:309 slab_unreclaimable:5492
    mapped:53 shmem:15 pagetables:28140 bounce:0

    Signed-off-by: KOSAKI Motohiro
    Acked-by: Rik van Riel
    Acked-by: Wu Fengguang
    Reviewed-by: Minchan Kim
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • It is possible for the oom killer to select current as the task to kill.
    When this happens, alloc_flags needs to be updated accordingly to set
    ALLOC_NO_WATERMARKS so the subsequent allocation attempt may use memory
    reserves as the result of its thread having TIF_MEMDIE set if the
    allocation is not __GFP_NOMEMALLOC.

    Acked-by: Mel Gorman
    Signed-off-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • Recently we encountered OOM problems due to memory use of the GEM cache.
    Generally a large amuont of Shmem/Tmpfs pages tend to create a memory
    shortage problem.

    We often use the following calculation to determine the amount of shmem
    pages:

    shmem = NR_ACTIVE_ANON + NR_INACTIVE_ANON - NR_ANON_PAGES

    however the expression does not consider isolated and mlocked pages.

    This patch adds explicit accounting for pages used by shmem and tmpfs.

    Signed-off-by: KOSAKI Motohiro
    Acked-by: Rik van Riel
    Reviewed-by: Christoph Lameter
    Acked-by: Wu Fengguang
    Cc: David Rientjes
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • The amount of memory allocated to kernel stacks can become significant and
    cause OOM conditions. However, we do not display the amount of memory
    consumed by stacks.

    Add code to display the amount of memory used for stacks in /proc/meminfo.

    Signed-off-by: KOSAKI Motohiro
    Reviewed-by: Christoph Lameter
    Reviewed-by: Minchan Kim
    Reviewed-by: Rik van Riel
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • It is often useful to know the statistics for all pages that are handled
    like page cache pages when looking at OOM log output.

    Therefore show_free_areas() should also display buffer cache statistics.

    Signed-off-by: KOSAKI Motohiro
    Acked-by: Wu Fengguang
    Reviewed-by: Rik van Riel
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • show_free_areas() displays only a limited amount of zone counters. This
    patch includes additional counters in the display to allow easier
    debugging. This may be especially useful if an OOM is due to running out
    of DMA memory.

    Signed-off-by: KOSAKI Motohiro
    Reviewed-by: Christoph Lameter
    Acked-by: Wu Fengguang
    Reviewed-by: Minchan Kim
    Reviewed-by: Rik van Riel
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • If an OOM happens, we really want to know the number of remaining
    reclaimable pages. So the reclaimable slab and unreclaimable slab fields
    should not be combined for display.

    Signed-off-by: KOSAKI Motohiro
    Reviewed-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • Ummark function as having kernel-doc notation, fixing the kernel-doc
    warning.

    Warning(mm/page_alloc.c:4519): No description found for parameter 'zone'

    Signed-off-by: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Randy Dunlap
     
  • Pages on movable zone have two types, MIGRATE_MOVABLE and MIGRATE_RESERVE,
    both them can be movable, because only movable memory allocation can get
    pages from movable zone. This makes pages in movable zone always be able
    to migrate.

    Signed-off-by: Shaohua Li
    Cc: Mel Gorman
    Cc: Christoph Lameter
    Cc: Yakui Zhao
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shaohua Li
     
  • Pages marked as isolated should not be allocated again. If such pages
    reside in pcp list, they can be allocated too, so there is a ping-pong
    memory offline frees some pages to pcp list and the pages get allocated
    and then memory offline frees them again, this loop will happen again and
    again.

    This should have no impact in normal code path, because in normal code
    path, pages in pcp list aren't isolated, and below loop will break in the
    first entry.

    Signed-off-by: Shaohua Li
    Cc: Mel Gorman
    Cc: Christoph Lameter
    Cc: Yakui Zhao
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shaohua Li