14 Jan, 2011

13 commits

  • It's old-fashioned and unneeded.

    akpm:/usr/src/25> size mm/page_alloc.o
    text data bss dec hex filename
    39884 1241317 18808 1300009 13d629 mm/page_alloc.o (before)
    39838 1241317 18808 1299963 13d5fb mm/page_alloc.o (after)

    Acked-by: David Rientjes
    Acked-by: Mel Gorman
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • The previous approach of calucation of combined index was

    page_idx & ~(1 << order))

    but we have same result with

    page_idx & buddy_idx

    This reduces instructions slightly as well as enhances readability.

    [akpm@linux-foundation.org: coding-style fixes]
    [akpm@linux-foundation.org: fix used-unintialised warning]
    Signed-off-by: KyongHo Cho
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KyongHo Cho
     
  • PG_buddy can be converted to _mapcount == -2. So the PG_compound_lock can
    be added to page->flags without overflowing (because of the sparse section
    bits increasing) with CONFIG_X86_PAE=y and CONFIG_X86_PAT=y. This also
    has to move the memory hotplug code from _mapcount to lru.next to avoid
    any risk of clashes. We can't use lru.next for PG_buddy removal, but
    memory hotplug can use lru.next even more easily than the mapcount
    instead.

    Signed-off-by: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • Not worth throwing away the precious reserved free memory pool for
    allocations that can fail gracefully (either through mempool or because
    they're transhuge allocations later falling back to 4k allocations).

    Signed-off-by: Andrea Arcangeli
    Acked-by: Rik van Riel
    Reviewed-by: KOSAKI Motohiro
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • Transparent hugepage allocations must be allowed not to invoke kswapd or
    any other kind of indirect reclaim (especially when the defrag sysfs is
    control disabled). It's unacceptable to swap out anonymous pages
    (potentially anonymous transparent hugepages) in order to create new
    transparent hugepages. This is true for the MADV_HUGEPAGE areas too
    (swapping out a kvm virtual machine and so having it suffer an unbearable
    slowdown, so another one with guest physical memory marked MADV_HUGEPAGE
    can run 30% faster if it is running memory intensive workloads, makes no
    sense). If a transparent hugepage allocation fails the slowdown is minor
    and there is total fallback, so kswapd should never be asked to swapout
    memory to allow the high order allocation to succeed.

    Signed-off-by: Andrea Arcangeli
    Acked-by: Rik van Riel
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • Warn destroy_compound_page that __split_huge_page_refcount is heavily
    dependent on its internal behavior.

    Signed-off-by: Andrea Arcangeli
    Acked-by: Rik van Riel
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • Clear compound mapping for anonymous compound pages like it already
    happens for regular anonymous pages. But crash if mapping is set for any
    tail page, also the PageAnon check is meaningless for tail pages. This
    check only makes sense for the head page, for tail page it can only hide
    bugs and we definitely don't want to hide bugs.

    Signed-off-by: Andrea Arcangeli
    Acked-by: Rik van Riel
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • page_count shows the count of the head page, but the actual check is done
    on the tail page, so show what is really being checked.

    Signed-off-by: Andrea Arcangeli
    Acked-by: Rik van Riel
    Acked-by: Mel Gorman
    Reviewed-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • When numa_zonelist_order parameter is set to "node" or "zone" on the
    command line it's still showing as "default" in sysctl. That's because
    early_param parsing function changes only user_zonelist_order variable.
    Fix this by copying user-provided string to numa_zonelist_order if it was
    successfully parsed.

    Signed-off-by: Volodymyr G Lukiianyk
    Acked-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Volodymyr G. Lukiianyk
     
  • Simon Kirby reported the following problem

    We're seeing cases on a number of servers where cache never fully
    grows to use all available memory. Sometimes we see servers with 4 GB
    of memory that never seem to have less than 1.5 GB free, even with a
    constantly-active VM. In some cases, these servers also swap out while
    this happens, even though they are constantly reading the working set
    into memory. We have been seeing this happening for a long time; I
    don't think it's anything recent, and it still happens on 2.6.36.

    After some debugging work by Simon, Dave Hansen and others, the prevaling
    theory became that kswapd is reclaiming order-3 pages requested by SLUB
    too aggressive about it.

    There are two apparent problems here. On the target machine, there is a
    small Normal zone in comparison to DMA32. As kswapd tries to balance all
    zones, it would continually try reclaiming for Normal even though DMA32
    was balanced enough for callers. The second problem is that
    sleeping_prematurely() does not use the same logic as balance_pgdat() when
    deciding whether to sleep or not. This keeps kswapd artifically awake.

    A number of tests were run and the figures from previous postings will
    look very different for a few reasons. One, the old figures were forcing
    my network card to use GFP_ATOMIC in attempt to replicate Simon's problem.
    Second, I previous specified slub_min_order=3 again in an attempt to
    reproduce Simon's problem. In this posting, I'm depending on Simon to say
    whether his problem is fixed or not and these figures are to show the
    impact to the ordinary cases. Finally, the "vmscan" figures are taken
    from /proc/vmstat instead of the tracepoints. There is less information
    but recording is less disruptive.

    The first test of relevance was postmark with a process running in the
    background reading a large amount of anonymous memory in blocks. The
    objective was to vaguely simulate what was happening on Simon's machine
    and it's memory intensive enough to have kswapd awake.

    POSTMARK
    traceonly kanyzone
    Transactions per second: 156.00 ( 0.00%) 153.00 (-1.96%)
    Data megabytes read per second: 21.51 ( 0.00%) 21.52 ( 0.05%)
    Data megabytes written per second: 29.28 ( 0.00%) 29.11 (-0.58%)
    Files created alone per second: 250.00 ( 0.00%) 416.00 (39.90%)
    Files create/transact per second: 79.00 ( 0.00%) 76.00 (-3.95%)
    Files deleted alone per second: 520.00 ( 0.00%) 420.00 (-23.81%)
    Files delete/transact per second: 79.00 ( 0.00%) 76.00 (-3.95%)

    MMTests Statistics: duration
    User/Sys Time Running Test (seconds) 16.58 17.4
    Total Elapsed Time (seconds) 218.48 222.47

    VMstat Reclaim Statistics: vmscan
    Direct reclaims 0 4
    Direct reclaim pages scanned 0 203
    Direct reclaim pages reclaimed 0 184
    Kswapd pages scanned 326631 322018
    Kswapd pages reclaimed 312632 309784
    Kswapd low wmark quickly 1 4
    Kswapd high wmark quickly 122 475
    Kswapd skip congestion_wait 1 0
    Pages activated 700040 705317
    Pages deactivated 212113 203922
    Pages written 9875 6363

    Total pages scanned 326631 322221
    Total pages reclaimed 312632 309968
    %age total pages scanned/reclaimed 95.71% 96.20%
    %age total pages scanned/written 3.02% 1.97%

    proc vmstat: Faults
    Major Faults 300 254
    Minor Faults 645183 660284
    Page ins 493588 486704
    Page outs 4960088 4986704
    Swap ins 1230 661
    Swap outs 9869 6355

    Performance is mildly affected because kswapd is no longer doing as much
    work and the background memory consumer process is getting in the way.
    Note that kswapd scanned and reclaimed fewer pages as it's less aggressive
    and overall fewer pages were scanned and reclaimed. Swap in/out is
    particularly reduced again reflecting kswapd throwing out fewer pages.

    The slight performance impact is unfortunate here but it looks like a
    direct result of kswapd being less aggressive. As the bug report is about
    too many pages being freed by kswapd, it may have to be accepted for now.

    The second test is a streaming IO benchmark that was previously used by
    Johannes to show regressions in page reclaim.

    MICRO
    traceonly kanyzone
    User/Sys Time Running Test (seconds) 29.29 28.87
    Total Elapsed Time (seconds) 492.18 488.79

    VMstat Reclaim Statistics: vmscan
    Direct reclaims 2128 1460
    Direct reclaim pages scanned 2284822 1496067
    Direct reclaim pages reclaimed 148919 110937
    Kswapd pages scanned 15450014 16202876
    Kswapd pages reclaimed 8503697 8537897
    Kswapd low wmark quickly 3100 3397
    Kswapd high wmark quickly 1860 7243
    Kswapd skip congestion_wait 708 801
    Pages activated 9635 9573
    Pages deactivated 1432 1271
    Pages written 223 1130

    Total pages scanned 17734836 17698943
    Total pages reclaimed 8652616 8648834
    %age total pages scanned/reclaimed 48.79% 48.87%
    %age total pages scanned/written 0.00% 0.01%

    proc vmstat: Faults
    Major Faults 165 221
    Minor Faults 9655785 9656506
    Page ins 3880 7228
    Page outs 37692940 37480076
    Swap ins 0 69
    Swap outs 19 15

    Again fewer pages are scanned and reclaimed as expected and this time the
    test completed faster. Note that kswapd is hitting its watermarks faster
    (low and high wmark quickly) which I expect is due to kswapd reclaiming
    fewer pages.

    I also ran fs-mark, iozone and sysbench but there is nothing interesting
    to report in the figures. Performance is not significantly changed and
    the reclaim statistics look reasonable.

    Tgis patch:

    When the allocator enters its slow path, kswapd is woken up to balance the
    node. It continues working until all zones within the node are balanced.
    For order-0 allocations, this makes perfect sense but for higher orders it
    can have unintended side-effects. If the zone sizes are imbalanced,
    kswapd may reclaim heavily within a smaller zone discarding an excessive
    number of pages. The user-visible behaviour is that kswapd is awake and
    reclaiming even though plenty of pages are free from a suitable zone.

    This patch alters the "balance" logic for high-order reclaim allowing
    kswapd to stop if any suitable zone becomes balanced to reduce the number
    of pages it reclaims from other zones. kswapd still tries to ensure that
    order-0 watermarks for all zones are met before sleeping.

    Signed-off-by: Mel Gorman
    Reviewed-by: Minchan Kim
    Reviewed-by: KAMEZAWA Hiroyuki
    Reviewed-by: Eric B Munson
    Cc: Simon Kirby
    Cc: KOSAKI Motohiro
    Cc: Shaohua Li
    Cc: Dave Hansen
    Cc: Johannes Weiner
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • …ompaction in the faster path

    Migration synchronously waits for writeback if the initial passes fails.
    Callers of memory compaction do not necessarily want this behaviour if the
    caller is latency sensitive or expects that synchronous migration is not
    going to have a significantly better success rate.

    This patch adds a sync parameter to migrate_pages() allowing the caller to
    indicate if wait_on_page_writeback() is allowed within migration or not.
    For reclaim/compaction, try_to_compact_pages() is first called
    asynchronously, direct reclaim runs and then try_to_compact_pages() is
    called synchronously as there is a greater expectation that it'll succeed.

    [akpm@linux-foundation.org: build/merge fix]
    Signed-off-by: Mel Gorman <mel@csn.ul.ie>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
    Cc: Rik van Riel <riel@redhat.com>
    Acked-by: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Andy Whitcroft <apw@shadowen.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

    Mel Gorman
     
  • Lumpy reclaim is disruptive. It reclaims a large number of pages and
    ignores the age of the pages it reclaims. This can incur significant
    stalls and potentially increase the number of major faults.

    Compaction has reached the point where it is considered reasonably stable
    (meaning it has passed a lot of testing) and is a potential candidate for
    displacing lumpy reclaim. This patch introduces an alternative to lumpy
    reclaim whe compaction is available called reclaim/compaction. The basic
    operation is very simple - instead of selecting a contiguous range of
    pages to reclaim, a number of order-0 pages are reclaimed and then
    compaction is later by either kswapd (compact_zone_order()) or direct
    compaction (__alloc_pages_direct_compact()).

    [akpm@linux-foundation.org: fix build]
    [akpm@linux-foundation.org: use conventional task_struct naming]
    Signed-off-by: Mel Gorman
    Cc: Andrea Arcangeli
    Cc: KOSAKI Motohiro
    Cc: Rik van Riel
    Acked-by: Johannes Weiner
    Cc: Andy Whitcroft
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Commit aa45484 ("calculate a better estimate of NR_FREE_PAGES when memory
    is low") noted that watermarks were based on the vmstat NR_FREE_PAGES. To
    avoid synchronization overhead, these counters are maintained on a per-cpu
    basis and drained both periodically and when a threshold is above a
    threshold. On large CPU systems, the difference between the estimate and
    real value of NR_FREE_PAGES can be very high. The system can get into a
    case where pages are allocated far below the min watermark potentially
    causing livelock issues. The commit solved the problem by taking a better
    reading of NR_FREE_PAGES when memory was low.

    Unfortately, as reported by Shaohua Li this accurate reading can consume a
    large amount of CPU time on systems with many sockets due to cache line
    bouncing. This patch takes a different approach. For large machines
    where counter drift might be unsafe and while kswapd is awake, the per-cpu
    thresholds for the target pgdat are reduced to limit the level of drift to
    what should be a safe level. This incurs a performance penalty in heavy
    memory pressure by a factor that depends on the workload and the machine
    but the machine should function correctly without accidentally exhausting
    all memory on a node. There is an additional cost when kswapd wakes and
    sleeps but the event is not expected to be frequent - in Shaohua's test
    case, there was one recorded sleep and wake event at least.

    To ensure that kswapd wakes up, a safe version of zone_watermark_ok() is
    introduced that takes a more accurate reading of NR_FREE_PAGES when called
    from wakeup_kswapd, when deciding whether it is really safe to go back to
    sleep in sleeping_prematurely() and when deciding if a zone is really
    balanced or not in balance_pgdat(). We are still using an expensive
    function but limiting how often it is called.

    When the test case is reproduced, the time spent in the watermark
    functions is reduced. The following report is on the percentage of time
    spent cumulatively spent in the functions zone_nr_free_pages(),
    zone_watermark_ok(), __zone_watermark_ok(), zone_watermark_ok_safe(),
    zone_page_state_snapshot(), zone_page_state().

    vanilla 11.6615%
    disable-threshold 0.2584%

    David said:

    : We had to pull aa454840 "mm: page allocator: calculate a better estimate
    : of NR_FREE_PAGES when memory is low and kswapd is awake" from 2.6.36
    : internally because tests showed that it would cause the machine to stall
    : as the result of heavy kswapd activity. I merged it back with this fix as
    : it is pending in the -mm tree and it solves the issue we were seeing, so I
    : definitely think this should be pushed to -stable (and I would seriously
    : consider it for 2.6.37 inclusion even at this late date).

    Signed-off-by: Mel Gorman
    Reported-by: Shaohua Li
    Reviewed-by: Christoph Lameter
    Tested-by: Nicolas Bareil
    Cc: David Rientjes
    Cc: Kyle McMartin
    Cc: [2.6.37.1, 2.6.36.x]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

23 Dec, 2010

1 commit


07 Dec, 2010

1 commit

  • There is a problem that swap pages allocated before the creation of
    a hibernation image can be released and used for storing the contents
    of different memory pages while the image is being saved. Since the
    kernel stored in the image doesn't know of that, it causes memory
    corruption to occur after resume from hibernation, especially on
    systems with relatively small RAM that need to swap often.

    This issue can be addressed by keeping the GFP_IOFS bits clear
    in gfp_allowed_mask during the entire hibernation, including the
    saving of the image, until the system is finally turned off or
    the hibernation is aborted. Unfortunately, for this purpose
    it's necessary to rework the way in which the hibernate and
    suspend code manipulates gfp_allowed_mask.

    This change is based on an earlier patch from Hugh Dickins.

    Signed-off-by: Rafael J. Wysocki
    Reported-by: Ondrej Zary
    Acked-by: Hugh Dickins
    Reviewed-by: KAMEZAWA Hiroyuki
    Cc: stable@kernel.org

    Rafael J. Wysocki
     

29 Nov, 2010

1 commit

  • These warnings are spewed during a build of a 'allnoconfig' kernel
    (especially the ones from u64_stats_sync.h show up a lot) when building
    with -Wextra (which I often do)..
    They are
    a) annoying
    b) easy to get rid of.
    This patch kills them off.

    include/linux/u64_stats_sync.h:70:1: warning: ‘inline’ is not at beginning of declaration
    include/linux/u64_stats_sync.h:77:1: warning: ‘inline’ is not at beginning of declaration
    include/linux/u64_stats_sync.h:84:1: warning: ‘inline’ is not at beginning of declaration
    include/linux/u64_stats_sync.h:96:1: warning: ‘inline’ is not at beginning of declaration
    include/linux/u64_stats_sync.h:115:1: warning: ‘inline’ is not at beginning of declaration
    include/linux/u64_stats_sync.h:127:1: warning: ‘inline’ is not at beginning of declaration
    kernel/time.c:241:1: warning: ‘inline’ is not at beginning of declaration
    kernel/time.c:257:1: warning: ‘inline’ is not at beginning of declaration
    kernel/perf_event.c:4513:1: warning: ‘inline’ is not at beginning of declaration
    mm/page_alloc.c:4012:1: warning: ‘inline’ is not at beginning of declaration

    Signed-off-by: Jesper Juhl
    Signed-off-by: Jiri Kosina

    Jesper Juhl
     

25 Nov, 2010

1 commit

  • … under stop_machine_run()

    During memory hotplug, build_allzonelists() may be called under
    stop_machine_run(). In this function, setup_zone_pageset() is called.
    But it's bug because it will do page allocation under stop_machine_run().

    Here is a report from Alok Kataria.

    BUG: sleeping function called from invalid context at kernel/mutex.c:94
    in_atomic(): 0, irqs_disabled(): 1, pid: 4, name: migration/0
    Pid: 4, comm: migration/0 Not tainted 2.6.35.6-45.fc14.x86_64 #1
    Call Trace:
    [<ffffffff8103d12b>] __might_sleep+0xeb/0xf0
    [<ffffffff81468245>] mutex_lock+0x24/0x50
    [<ffffffff8110eaa6>] pcpu_alloc+0x6d/0x7ee
    [<ffffffff81048888>] ? load_balance+0xbe/0x60e
    [<ffffffff8103a1b3>] ? rt_se_boosted+0x21/0x2f
    [<ffffffff8103e1cf>] ? dequeue_rt_stack+0x18b/0x1ed
    [<ffffffff8110f237>] __alloc_percpu+0x10/0x12
    [<ffffffff81465e22>] setup_zone_pageset+0x38/0xbe
    [<ffffffff810d6d81>] ? build_zonelists_node.clone.58+0x79/0x8c
    [<ffffffff81452539>] __build_all_zonelists+0x419/0x46c
    [<ffffffff8108ef01>] ? cpu_stopper_thread+0xb2/0x198
    [<ffffffff8108f075>] stop_machine_cpu_stop+0x8e/0xc5
    [<ffffffff8108efe7>] ? stop_machine_cpu_stop+0x0/0xc5
    [<ffffffff8108ef57>] cpu_stopper_thread+0x108/0x198
    [<ffffffff81467a37>] ? schedule+0x5b2/0x5cc
    [<ffffffff8108ee4f>] ? cpu_stopper_thread+0x0/0x198
    [<ffffffff81065f29>] kthread+0x7f/0x87
    [<ffffffff8100aae4>] kernel_thread_helper+0x4/0x10
    [<ffffffff81065eaa>] ? kthread+0x0/0x87
    [<ffffffff8100aae0>] ? kernel_thread_helper+0x0/0x10
    Built 5 zonelists in Node order, mobility grouping on. Total pages: 289456
    Policy zone: Normal

    This patch tries to fix the issue by moving setup_zone_pageset() out from
    stop_machine_run(). It's obviously not necessary to be called under
    stop_machine_run().

    [akpm@linux-foundation.org: remove unneeded local]
    Reported-by: Alok Kataria <akataria@vmware.com>
    Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
    Cc: Tejun Heo <tj@kernel.org>
    Cc: Petr Vandrovec <petr@vmware.com>
    Cc: Pekka Enberg <penberg@cs.helsinki.fi>
    Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

    KAMEZAWA Hiroyuki
     

27 Oct, 2010

5 commits

  • This removes following warning from sparse:

    mm/page_alloc.c:1934:9: warning: restricted gfp_t degrades to integer

    Signed-off-by: Namhyung Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Namhyung Kim
     
  • …r if significant congestion is not being encountered in the current zone

    If congestion_wait() is called with no BDI congested, the caller will
    sleep for the full timeout and this may be an unnecessary sleep. This
    patch adds a wait_iff_congested() that checks congestion and only sleeps
    if a BDI is congested else, it calls cond_resched() to ensure the caller
    is not hogging the CPU longer than its quota but otherwise will not sleep.

    This is aimed at reducing some of the major desktop stalls reported during
    IO. For example, while kswapd is operating, it calls congestion_wait()
    but it could just have been reclaiming clean page cache pages with no
    congestion. Without this patch, it would sleep for a full timeout but
    after this patch, it'll just call schedule() if it has been on the CPU too
    long. Similar logic applies to direct reclaimers that are not making
    enough progress.

    Signed-off-by: Mel Gorman <mel@csn.ul.ie>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Minchan Kim <minchan.kim@gmail.com>
    Cc: Wu Fengguang <fengguang.wu@intel.com>
    Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
    Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
    Cc: Rik van Riel <riel@redhat.com>
    Cc: Jens Axboe <axboe@kernel.dk>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

    Mel Gorman
     
  • Now, sysfs interface of memory hotplug shows whether the section is
    removable or not. But it checks only migrateype of pages and doesn't
    check details of cluster of pages.

    Next, memory hotplug's set_migratetype_isolate() has the same kind of
    check, too.

    This patch adds the function __count_unmovable_pages() and makes above 2
    checks to use the same logic. Then, is_removable and hotremove code uses
    the same logic. No changes in the hotremove logic itself.

    TODO: need to find a way to check RECLAMABLE. But, considering bit,
    calling shrink_slab() against a range before starting memory hotremove
    sounds better. If so, this patch's logic doesn't need to be changed.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: KAMEZAWA Hiroyuki
    Reported-by: Michal Hocko
    Cc: Wu Fengguang
    Cc: Mel Gorman
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • Even if notifier cannot find any pages, it doesn't mean no pages are
    available...And, if there are no notifiers registered, this condition will
    be always true and memory hotplug will show -EBUSY.

    This is a bug but not critical.

    In most case, a pageblock which will be offlined is MIGRATE_MOVABLE This
    "notifier" is called only when the pageblock is _not_ MIGRATE_MOVABLE.
    But if not MIGRATE_MOVABLE, it's common case that memory hotplug will
    fail. So, no one notice this bug.

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Michal Hocko
    Cc: Wu Fengguang
    Cc: Mel Gorman
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • There is a bug in commit 6dda9d55 ("page allocator: reduce fragmentation
    in buddy allocator by adding buddies that are merging to the tail of the
    free lists") that means a buddy at order MAX_ORDER is checked for merging.
    A page of this order never exists so at times, an effectively random
    piece of memory is being checked.

    Alan Curry has reported that this is causing memory corruption in
    userspace data on a PPC32 platform (http://lkml.org/lkml/2010/10/9/32).
    It is not clear why this is happening. It could be a cache coherency
    problem where pages mapped in both user and kernel space are getting
    different cache lines due to the bad read from kernel space
    (http://lkml.org/lkml/2010/10/13/179). It could also be that there are
    some special registers being io-remapped at the end of the memmap array
    and that a read has special meaning on them. Compiler bugs have been
    ruled out because the assembly before and after the patch looks relatively
    harmless.

    This patch fixes the problem by ensuring we are not reading a possibly
    invalid location of memory. It's not clear why the read causes corruption
    but one way or the other it is a buggy read.

    Signed-off-by: Mel Gorman
    Cc: Corrado Zoccolo
    Reported-by: Alan Curry
    Cc: KOSAKI Motohiro
    Cc: Christoph Lameter
    Cc: Rik van Riel
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

12 Oct, 2010

1 commit


08 Oct, 2010

2 commits


10 Sep, 2010

3 commits

  • When under significant memory pressure, a process enters direct reclaim
    and immediately afterwards tries to allocate a page. If it fails and no
    further progress is made, it's possible the system will go OOM. However,
    on systems with large amounts of memory, it's possible that a significant
    number of pages are on per-cpu lists and inaccessible to the calling
    process. This leads to a process entering direct reclaim more often than
    it should increasing the pressure on the system and compounding the
    problem.

    This patch notes that if direct reclaim is making progress but allocations
    are still failing that the system is already under heavy pressure. In
    this case, it drains the per-cpu lists and tries the allocation a second
    time before continuing.

    Signed-off-by: Mel Gorman
    Reviewed-by: Minchan Kim
    Reviewed-by: KAMEZAWA Hiroyuki
    Reviewed-by: KOSAKI Motohiro
    Reviewed-by: Christoph Lameter
    Cc: Dave Chinner
    Cc: Wu Fengguang
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • …low and kswapd is awake

    Ordinarily watermark checks are based on the vmstat NR_FREE_PAGES as it is
    cheaper than scanning a number of lists. To avoid synchronization
    overhead, counter deltas are maintained on a per-cpu basis and drained
    both periodically and when the delta is above a threshold. On large CPU
    systems, the difference between the estimated and real value of
    NR_FREE_PAGES can be very high. If NR_FREE_PAGES is much higher than
    number of real free page in buddy, the VM can allocate pages below min
    watermark, at worst reducing the real number of pages to zero. Even if
    the OOM killer kills some victim for freeing memory, it may not free
    memory if the exit path requires a new page resulting in livelock.

    This patch introduces a zone_page_state_snapshot() function (courtesy of
    Christoph) that takes a slightly more accurate view of an arbitrary vmstat
    counter. It is used to read NR_FREE_PAGES while kswapd is awake to avoid
    the watermark being accidentally broken. The estimate is not perfect and
    may result in cache line bounces but is expected to be lighter than the
    IPI calls necessary to continually drain the per-cpu counters while kswapd
    is awake.

    Signed-off-by: Christoph Lameter <cl@linux.com>
    Signed-off-by: Mel Gorman <mel@csn.ul.ie>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

    Christoph Lameter
     
  • When allocating a page, the system uses NR_FREE_PAGES counters to
    determine if watermarks would remain intact after the allocation was made.
    This check is made without interrupts disabled or the zone lock held and
    so is race-prone by nature. Unfortunately, when pages are being freed in
    batch, the counters are updated before the pages are added on the list.
    During this window, the counters are misleading as the pages do not exist
    yet. When under significant pressure on systems with large numbers of
    CPUs, it's possible for processes to make progress even though they should
    have been stalled. This is particularly problematic if a number of the
    processes are using GFP_ATOMIC as the min watermark can be accidentally
    breached and in extreme cases, the system can livelock.

    This patch updates the counters after the pages have been added to the
    list. This makes the allocator more cautious with respect to preserving
    the watermarks and mitigates livelock possibilities.

    [akpm@linux-foundation.org: avoid modifying incoming args]
    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Reviewed-by: Minchan Kim
    Reviewed-by: KAMEZAWA Hiroyuki
    Reviewed-by: Christoph Lameter
    Reviewed-by: KOSAKI Motohiro
    Acked-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

31 Aug, 2010

1 commit


28 Aug, 2010

2 commits

  • 1. replace find_e820_area with memblock_find_in_range
    2. replace reserve_early with memblock_x86_reserve_range
    3. replace free_early with memblock_x86_free_range.
    4. NO_BOOTMEM will switch to use memblock too.
    5. use _e820, _early wrap in the patch, in following patch, will
    replace them all
    6. because memblock_x86_free_range support partial free, we can remove some special care
    7. Need to make sure that memblock_find_in_range() is called after memblock_x86_fill()
    so adjust some calling later in setup.c::setup_arch()
    -- corruption_check and mptable_update

    -v2: Move reserve_brk() early
    Before fill_memblock_area, to avoid overlap between brk and memblock_find_in_range()
    that could happen We have more then 128 RAM entry in E820 tables, and
    memblock_x86_fill() could use memblock_find_in_range() to find a new place for
    memblock.memory.region array.
    and We don't need to use extend_brk() after fill_memblock_area()
    So move reserve_brk() early before fill_memblock_area().
    -v3: Move find_smp_config early
    To make sure memblock_find_in_range not find wrong place, if BIOS doesn't put mptable
    in right place.
    -v4: Treat RESERVED_KERN as RAM in memblock.memory. and they are already in
    memblock.reserved already..
    use __NOT_KEEP_MEMBLOCK to make sure memblock related code could be freed later.
    -v5: Generic version __memblock_find_in_range() is going from high to low, and for 32bit
    active_region for 32bit does include high pages
    need to replace the limit with memblock.default_alloc_limit, aka get_max_mapped()
    -v6: Use current_limit instead
    -v7: check with MEMBLOCK_ERROR instead of -1ULL or -1L
    -v8: Set memblock_can_resize early to handle EFI with more RAM entries
    -v9: update after kmemleak changes in mainline

    Suggested-by: David S. Miller
    Suggested-by: Benjamin Herrenschmidt
    Suggested-by: Thomas Gleixner
    Signed-off-by: Yinghai Lu
    Signed-off-by: H. Peter Anvin

    Yinghai Lu
     
  • According to node range in early_node_map[] with __memblock_find_in_range
    to find free range.

    Will be used by memblock_x86_find_in_range_node()

    memblock_x86_find_in_range_node will be used to find right buffer for NODE_DATA

    Signed-off-by: Yinghai Lu
    Signed-off-by: H. Peter Anvin

    Yinghai Lu
     

10 Aug, 2010

3 commits

  • Since 2.6.28 zone->prev_priority is unused. Then it can be removed
    safely. It reduce stack usage slightly.

    Now I have to say that I'm sorry. 2 years ago, I thought prev_priority
    can be integrate again, it's useful. but four (or more) times trying
    haven't got good performance number. Thus I give up such approach.

    The rest of this changelog is notes on prev_priority and why it existed in
    the first place and why it might be not necessary any more. This information
    is based heavily on discussions between Andrew Morton, Rik van Riel and
    Kosaki Motohiro who is heavily quotes from.

    Historically prev_priority was important because it determined when the VM
    would start unmapping PTE pages. i.e. there are no balances of note within
    the VM, Anon vs File and Mapped vs Unmapped. Without prev_priority, there
    is a potential risk of unnecessarily increasing minor faults as a large
    amount of read activity of use-once pages could push mapped pages to the
    end of the LRU and get unmapped.

    There is no proof this is still a problem but currently it is not considered
    to be. Active files are not deactivated if the active file list is smaller
    than the inactive list reducing the liklihood that file-mapped pages are
    being pushed off the LRU and referenced executable pages are kept on the
    active list to avoid them getting pushed out by read activity.

    Even if it is a problem, prev_priority prev_priority wouldn't works
    nowadays. First of all, current vmscan still a lot of UP centric code. it
    expose some weakness on some dozens CPUs machine. I think we need more and
    more improvement.

    The problem is, current vmscan mix up per-system-pressure, per-zone-pressure
    and per-task-pressure a bit. example, prev_priority try to boost priority to
    other concurrent priority. but if the another task have mempolicy restriction,
    it is unnecessary, but also makes wrong big latency and exceeding reclaim.
    per-task based priority + prev_priority adjustment make the emulation of
    per-system pressure. but it have two issue 1) too rough and brutal emulation
    2) we need per-zone pressure, not per-system.

    Another example, currently DEF_PRIORITY is 12. it mean the lru rotate about
    2 cycle (1/4096 + 1/2048 + 1/1024 + .. + 1) before invoking OOM-Killer.
    but if 10,0000 thrreads enter DEF_PRIORITY reclaim at the same time, the
    system have higher memory pressure than priority==0 (1/4096*10,000 > 2).
    prev_priority can't solve such multithreads workload issue. In other word,
    prev_priority concept assume the sysmtem don't have lots threads."

    Signed-off-by: KOSAKI Motohiro
    Signed-off-by: Mel Gorman
    Reviewed-by: Johannes Weiner
    Reviewed-by: Rik van Riel
    Cc: Dave Chinner
    Cc: Chris Mason
    Cc: Nick Piggin
    Cc: Rik van Riel
    Cc: Johannes Weiner
    Cc: Christoph Hellwig
    Cc: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Cc: Andrea Arcangeli
    Cc: Michael Rubin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • We have been used naming try_set_zone_oom and clear_zonelist_oom.
    The role of functions is to lock of zonelist for preventing parallel
    OOM. So clear_zonelist_oom makes sense but try_set_zone_oome is rather
    awkward and unmatched with clear_zonelist_oom.

    Let's change it with try_set_zonelist_oom.

    Signed-off-by: Minchan Kim
    Acked-by: David Rientjes
    Reviewed-by: KOSAKI Motohiro
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • If memory has been depleted in lowmem zones even with the protection
    afforded to it by /proc/sys/vm/lowmem_reserve_ratio, it is unlikely that
    killing current users will help. The memory is either reclaimable (or
    migratable) already, in which case we should not invoke the oom killer at
    all, or it is pinned by an application for I/O. Killing such an
    application may leave the hardware in an unspecified state and there is no
    guarantee that it will be able to make a timely exit.

    Lowmem allocations are now failed in oom conditions when __GFP_NOFAIL is
    not used so that the task can perhaps recover or try again later.

    Previously, the heuristic provided some protection for those tasks with
    CAP_SYS_RAWIO, but this is no longer necessary since we will not be
    killing tasks for the purposes of ISA allocations.

    high_zoneidx is gfp_zone(gfp_flags), meaning that ZONE_NORMAL will be the
    default for all allocations that are not __GFP_DMA, __GFP_DMA32,
    __GFP_HIGHMEM, and __GFP_MOVABLE on kernels configured to support those
    flags. Testing for high_zoneidx being less than ZONE_NORMAL will only
    return true for allocations that have either __GFP_DMA or __GFP_DMA32.

    Acked-by: KOSAKI Motohiro
    Signed-off-by: David Rientjes
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

21 Jul, 2010

1 commit

  • Borislav Petkov reported his 32bit numa system has problem:

    [ 0.000000] Reserving total of 4c00 pages for numa KVA remap
    [ 0.000000] kva_start_pfn ~ 32800 max_low_pfn ~ 375fe
    [ 0.000000] max_pfn = 238000
    [ 0.000000] 8202MB HIGHMEM available.
    [ 0.000000] 885MB LOWMEM available.
    [ 0.000000] mapped low ram: 0 - 375fe000
    [ 0.000000] low ram: 0 - 375fe000
    [ 0.000000] alloc (nid=8 100000 - 7ee00000) (1000000 - ffffffff) 1000 1000 => 34e7000
    [ 0.000000] alloc (nid=8 100000 - 7ee00000) (1000000 - ffffffff) 200 40 => 34c9d80
    [ 0.000000] alloc (nid=0 100000 - 7ee00000) (1000000 - ffffffffffffffff) 180 40 => 34e6140
    [ 0.000000] alloc (nid=1 80000000 - c7e60000) (1000000 - ffffffffffffffff) 240 40 => 80000000
    [ 0.000000] BUG: unable to handle kernel paging request at 40000000
    [ 0.000000] IP: [] __alloc_memory_core_early+0x147/0x1d6
    [ 0.000000] *pdpt = 0000000000000000 *pde = f000ff53f000ff00
    ...
    [ 0.000000] Call Trace:
    [ 0.000000] [] ? __alloc_bootmem_node+0x216/0x22f
    [ 0.000000] [] ? sparse_early_usemaps_alloc_node+0x5a/0x10b
    [ 0.000000] [] ? sparse_init+0x1dc/0x499
    [ 0.000000] [] ? paging_init+0x168/0x1df
    [ 0.000000] [] ? native_pagetable_setup_start+0xef/0x1bb

    looks like it allocates too much high address for bootmem.

    Try to cut limit with get_max_mapped()

    Reported-by: Borislav Petkov
    Tested-by: Conny Seidel
    Signed-off-by: Yinghai Lu
    Cc: [2.6.34.x]
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Thomas Gleixner
    Cc: Johannes Weiner
    Cc: Lee Schermerhorn
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yinghai Lu
     

19 Jul, 2010

1 commit

  • With commits 08677214 and 59be5a8e, alloc_bootmem()/free_bootmem() and
    friends use the early_res functions for memory management when
    NO_BOOTMEM is enabled. This patch adds the kmemleak calls in the
    corresponding code paths for bootmem allocations.

    Signed-off-by: Catalin Marinas
    Acked-by: Pekka Enberg
    Acked-by: Yinghai Lu
    Cc: H. Peter Anvin
    Cc: stable@kernel.org

    Catalin Marinas
     

28 May, 2010

2 commits

  • Introduce numa_mem_id(), based on generic percpu variable infrastructure
    to track "nearest node with memory" for archs that support memoryless
    nodes.

    Define API in when CONFIG_HAVE_MEMORYLESS_NODES
    defined, else stubs. Architectures will define HAVE_MEMORYLESS_NODES
    if/when they support them.

    Archs can override definitions of:

    numa_mem_id() - returns node number of "local memory" node
    set_numa_mem() - initialize [this cpus'] per cpu variable 'numa_mem'
    cpu_to_mem() - return numa_mem for specified cpu; may be used as lvalue

    Generic initialization of 'numa_mem' occurs in __build_all_zonelists().
    This will initialize the boot cpu at boot time, and all cpus on change of
    numa_zonelist_order, or when node or memory hot-plug requires zonelist
    rebuild. Archs that support memoryless nodes will need to initialize
    'numa_mem' for secondary cpus as they're brought on-line.

    [akpm@linux-foundation.org: fix build]
    Signed-off-by: Lee Schermerhorn
    Signed-off-by: Christoph Lameter
    Cc: Tejun Heo
    Cc: Mel Gorman
    Cc: Christoph Lameter
    Cc: Nick Piggin
    Cc: David Rientjes
    Cc: Eric Whitney
    Cc: KAMEZAWA Hiroyuki
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Cc: "Luck, Tony"
    Cc: Pekka Enberg
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     
  • Rework the generic version of the numa_node_id() function to use the new
    generic percpu variable infrastructure.

    Guard the new implementation with a new config option:

    CONFIG_USE_PERCPU_NUMA_NODE_ID.

    Archs which support this new implemention will default this option to 'y'
    when NUMA is configured. This config option could be removed if/when all
    archs switch over to the generic percpu implementation of numa_node_id().
    Arch support involves:

    1) converting any existing per cpu variable implementations to use
    this implementation. x86_64 is an instance of such an arch.
    2) archs that don't use a per cpu variable for numa_node_id() will
    need to initialize the new per cpu variable "numa_node" as cpus
    are brought on-line. ia64 is an example.
    3) Defining USE_PERCPU_NUMA_NODE_ID in arch dependent Kconfig--e.g.,
    when NUMA is configured. This is required because I have
    retained the old implementation by default to allow archs to
    be modified incrementally, as desired.

    Subsequent patches will convert x86_64 and ia64 to use this implemenation.

    Signed-off-by: Lee Schermerhorn
    Cc: Tejun Heo
    Cc: Mel Gorman
    Reviewed-by: Christoph Lameter
    Cc: Nick Piggin
    Cc: David Rientjes
    Cc: Eric Whitney
    Cc: KAMEZAWA Hiroyuki
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Cc: "Luck, Tony"
    Cc: Pekka Enberg
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     

25 May, 2010

2 commits

  • Add global mutex zonelists_mutex to fix the possible race:

    CPU0 CPU1 CPU2
    (1) zone->present_pages += online_pages;
    (2) build_all_zonelists();
    (3) alloc_page();
    (4) free_page();
    (5) build_all_zonelists();
    (6) __build_all_zonelists();
    (7) zone->pageset = alloc_percpu();

    In step (3,4), zone->pageset still points to boot_pageset, so bad
    things may happen if 2+ nodes are in this state. Even if only 1 node
    is accessing the boot_pageset, (3) may still consume too much memory
    to fail the memory allocations in step (7).

    Besides, atomic operation ensures alloc_percpu() in step (7) will never fail
    since there is a new fresh memory block added in step(6).

    [haicheng.li@linux.intel.com: hold zonelists_mutex when build_all_zonelists]
    Signed-off-by: Haicheng Li
    Signed-off-by: Wu Fengguang
    Reviewed-by: Andi Kleen
    Cc: Christoph Lameter
    Cc: Mel Gorman
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Haicheng Li
     
  • For each new populated zone of hotadded node, need to update its pagesets
    with dynamically allocated per_cpu_pageset struct for all possible CPUs:

    1) Detach zone->pageset from the shared boot_pageset
    at end of __build_all_zonelists().

    2) Use mutex to protect zone->pageset when it's still
    shared in onlined_pages()

    Otherwises, multiple zones of different nodes would share same boot strapping
    boot_pageset for same CPU, which will finally cause below kernel panic:

    ------------[ cut here ]------------
    kernel BUG at mm/page_alloc.c:1239!
    invalid opcode: 0000 [#1] SMP
    ...
    Call Trace:
    [] __alloc_pages_nodemask+0x131/0x7b0
    [] alloc_pages_current+0x87/0xd0
    [] __page_cache_alloc+0x67/0x70
    [] __do_page_cache_readahead+0x120/0x260
    [] ra_submit+0x21/0x30
    [] ondemand_readahead+0x166/0x2c0
    [] page_cache_async_readahead+0x80/0xa0
    [] generic_file_aio_read+0x364/0x670
    [] nfs_file_read+0xca/0x130
    [] do_sync_read+0xfa/0x140
    [] vfs_read+0xb5/0x1a0
    [] sys_read+0x51/0x80
    [] system_call_fastpath+0x16/0x1b
    RIP [] get_page_from_freelist+0x883/0x900
    RSP
    ---[ end trace 4bda28328b9990db ]

    [akpm@linux-foundation.org: merge fix]
    Signed-off-by: Haicheng Li
    Signed-off-by: Wu Fengguang
    Reviewed-by: Andi Kleen
    Reviewed-by: Christoph Lameter
    Cc: Mel Gorman
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Haicheng Li