14 Feb, 2012

2 commits

  • commit dc9086004b3d5db75997a645b3fe08d9138b7ad0 upstream.

    When isolating pages for migration, migration starts at the start of a
    zone while the free scanner starts at the end of the zone. Migration
    avoids entering a new zone by never going beyond the free scanned.

    Unfortunately, in very rare cases nodes can overlap. When this happens,
    migration isolates pages without the LRU lock held, corrupting lists
    which will trigger errors in reclaim or during page free such as in the
    following oops

    BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
    IP: [] free_pcppages_bulk+0xcc/0x450
    PGD 1dda554067 PUD 1e1cb58067 PMD 0
    Oops: 0000 [#1] SMP
    CPU 37
    Pid: 17088, comm: memcg_process_s Tainted: G X
    RIP: free_pcppages_bulk+0xcc/0x450
    Process memcg_process_s (pid: 17088, threadinfo ffff881c2926e000, task ffff881c2926c0c0)
    Call Trace:
    free_hot_cold_page+0x17e/0x1f0
    __pagevec_free+0x90/0xb0
    release_pages+0x22a/0x260
    pagevec_lru_move_fn+0xf3/0x110
    putback_lru_page+0x66/0xe0
    unmap_and_move+0x156/0x180
    migrate_pages+0x9e/0x1b0
    compact_zone+0x1f3/0x2f0
    compact_zone_order+0xa2/0xe0
    try_to_compact_pages+0xdf/0x110
    __alloc_pages_direct_compact+0xee/0x1c0
    __alloc_pages_slowpath+0x370/0x830
    __alloc_pages_nodemask+0x1b1/0x1c0
    alloc_pages_vma+0x9b/0x160
    do_huge_pmd_anonymous_page+0x160/0x270
    do_page_fault+0x207/0x4c0
    page_fault+0x25/0x30

    The "X" in the taint flag means that external modules were loaded but but
    is unrelated to the bug triggering. The real problem was because the PFN
    layout looks like this

    Zone PFN ranges:
    DMA 0x00000010 -> 0x00001000
    DMA32 0x00001000 -> 0x00100000
    Normal 0x00100000 -> 0x01e80000
    Movable zone start PFN for each node
    early_node_map[14] active PFN ranges
    0: 0x00000010 -> 0x0000009b
    0: 0x00000100 -> 0x0007a1ec
    0: 0x0007a354 -> 0x0007a379
    0: 0x0007f7ff -> 0x0007f800
    0: 0x00100000 -> 0x00680000
    1: 0x00680000 -> 0x00e80000
    0: 0x00e80000 -> 0x01080000
    1: 0x01080000 -> 0x01280000
    0: 0x01280000 -> 0x01480000
    1: 0x01480000 -> 0x01680000
    0: 0x01680000 -> 0x01880000
    1: 0x01880000 -> 0x01a80000
    0: 0x01a80000 -> 0x01c80000
    1: 0x01c80000 -> 0x01e80000

    The fix is straight-forward. isolate_migratepages() has to make a
    similar check to isolate_freepage to ensure that it never isolates pages
    from a zone it does not hold the LRU lock for.

    This was discovered in a 3.0-based kernel but it affects 3.1.x, 3.2.x
    and current mainline.

    Signed-off-by: Mel Gorman
    Acked-by: Michal Nazarewicz
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Mel Gorman
     
  • …ing isolation for migration

    commit 0bf380bc70ecba68cb4d74dc656cc2fa8c4d801a upstream.

    When isolating for migration, migration starts at the start of a zone
    which is not necessarily pageblock aligned. Further, it stops isolating
    when COMPACT_CLUSTER_MAX pages are isolated so migrate_pfn is generally
    not aligned. This allows isolate_migratepages() to call pfn_to_page() on
    an invalid PFN which can result in a crash. This was originally reported
    against a 3.0-based kernel with the following trace in a crash dump.

    PID: 9902 TASK: d47aecd0 CPU: 0 COMMAND: "memcg_process_s"
    #0 [d72d3ad0] crash_kexec at c028cfdb
    #1 [d72d3b24] oops_end at c05c5322
    #2 [d72d3b38] __bad_area_nosemaphore at c0227e60
    #3 [d72d3bec] bad_area at c0227fb6
    #4 [d72d3c00] do_page_fault at c05c72ec
    #5 [d72d3c80] error_code (via page_fault) at c05c47a4
    EAX: 00000000 EBX: 000c0000 ECX: 00000001 EDX: 00000807 EBP: 000c0000
    DS: 007b ESI: 00000001 ES: 007b EDI: f3000a80 GS: 6f50
    CS: 0060 EIP: c030b15a ERR: ffffffff EFLAGS: 00010002
    #6 [d72d3cb4] isolate_migratepages at c030b15a
    #7 [d72d3d14] zone_watermark_ok at c02d26cb
    #8 [d72d3d2c] compact_zone at c030b8de
    #9 [d72d3d68] compact_zone_order at c030bba1
    #10 [d72d3db4] try_to_compact_pages at c030bc84
    #11 [d72d3ddc] __alloc_pages_direct_compact at c02d61e7
    #12 [d72d3e08] __alloc_pages_slowpath at c02d66c7
    #13 [d72d3e78] __alloc_pages_nodemask at c02d6a97
    #14 [d72d3eb8] alloc_pages_vma at c030a845
    #15 [d72d3ed4] do_huge_pmd_anonymous_page at c03178eb
    #16 [d72d3f00] handle_mm_fault at c02f36c6
    #17 [d72d3f30] do_page_fault at c05c70ed
    #18 [d72d3fb0] error_code (via page_fault) at c05c47a4
    EAX: b71ff000 EBX: 00000001 ECX: 00001600 EDX: 00000431
    DS: 007b ESI: 08048950 ES: 007b EDI: bfaa3788
    SS: 007b ESP: bfaa36e0 EBP: bfaa3828 GS: 6f50
    CS: 0073 EIP: 080487c8 ERR: ffffffff EFLAGS: 00010202

    It was also reported by Herbert van den Bergh against 3.1-based kernel
    with the following snippet from the console log.

    BUG: unable to handle kernel paging request at 01c00008
    IP: [<c0522399>] isolate_migratepages+0x119/0x390
    *pdpt = 000000002f7ce001 *pde = 0000000000000000

    It is expected that it also affects 3.2.x and current mainline.

    The problem is that pfn_valid is only called on the first PFN being
    checked and that PFN is not necessarily aligned. Lets say we have a case
    like this

    H = MAX_ORDER_NR_PAGES boundary
    | = pageblock boundary
    m = cc->migrate_pfn
    f = cc->free_pfn
    o = memory hole

    H------|------H------|----m-Hoooooo|ooooooH-f----|------H

    The migrate_pfn is just below a memory hole and the free scanner is beyond
    the hole. When isolate_migratepages started, it scans from migrate_pfn to
    migrate_pfn+pageblock_nr_pages which is now in a memory hole. It checks
    pfn_valid() on the first PFN but then scans into the hole where there are
    not necessarily valid struct pages.

    This patch ensures that isolate_migratepages calls pfn_valid when
    necessary.

    Reported-by: Herbert van den Bergh <herbert.van.den.bergh@oracle.com>
    Tested-by: Herbert van den Bergh <herbert.van.den.bergh@oracle.com>
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Acked-by: Michal Nazarewicz <mina86@mina86.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

    Mel Gorman
     

01 Nov, 2011

4 commits

  • There's no compact_zone_order() user outside file scope, so make it static.

    Signed-off-by: Kyungmin Park
    Acked-by: David Rientjes
    Reviewed-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kyungmin Park
     
  • In async mode, compaction doesn't migrate dirty or writeback pages. So,
    it's meaningless to pick the page and re-add it to lru list.

    Of course, when we isolate the page in compaction, the page might be dirty
    or writeback but when we try to migrate the page, the page would be not
    dirty, writeback. So it could be migrated. But it's very unlikely as
    isolate and migration cycle is much faster than writeout.

    So, this patch helps cpu overhead and prevent unnecessary LRU churning.

    Signed-off-by: Minchan Kim
    Acked-by: Johannes Weiner
    Reviewed-by: KAMEZAWA Hiroyuki
    Reviewed-by: KOSAKI Motohiro
    Acked-by: Mel Gorman
    Acked-by: Rik van Riel
    Reviewed-by: Michal Hocko
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • Change ISOLATE_XXX macro with bitwise isolate_mode_t type. Normally,
    macro isn't recommended as it's type-unsafe and making debugging harder as
    symbol cannot be passed throught to the debugger.

    Quote from Johannes
    " Hmm, it would probably be cleaner to fully convert the isolation mode
    into independent flags. INACTIVE, ACTIVE, BOTH is currently a
    tri-state among flags, which is a bit ugly."

    This patch moves isolate mode from swap.h to mmzone.h by memcontrol.h

    Signed-off-by: Minchan Kim
    Cc: Johannes Weiner
    Cc: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Michal Hocko
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • acct_isolated of compaction uses page_lru_base_type which returns only
    base type of LRU list so it never returns LRU_ACTIVE_ANON or
    LRU_ACTIVE_FILE. In addtion, cc->nr_[anon|file] is used in only
    acct_isolated so it doesn't have fields in conpact_control.

    This patch removes fields from compact_control and makes clear function of
    acct_issolated which counts the number of anon|file pages isolated.

    Signed-off-by: Minchan Kim
    Acked-by: Johannes Weiner
    Reviewed-by: KAMEZAWA Hiroyuki
    Reviewed-by: KOSAKI Motohiro
    Acked-by: Mel Gorman
    Acked-by: Rik van Riel
    Reviewed-by: Michal Hocko
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     

16 Jun, 2011

4 commits

  • Asynchronous compaction is used when promoting to huge pages. This is all
    very nice but if there are a number of processes in compacting memory, a
    large number of pages can be isolated. An "asynchronous" process can
    stall for long periods of time as a result with a user reporting that
    firefox can stall for 10s of seconds. This patch aborts asynchronous
    compaction if too many pages are isolated as it's better to fail a
    hugepage promotion than stall a process.

    [minchan.kim@gmail.com: return COMPACT_PARTIAL for abort]
    Reported-and-tested-by: Ury Stankevich
    Signed-off-by: Mel Gorman
    Reviewed-by: Minchan Kim
    Reviewed-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Compaction works with two scanners, a migration and a free scanner. When
    the scanners crossover, migration within the zone is complete. The
    location of the scanner is recorded on each cycle to avoid excesive
    scanning.

    When a zone is small and mostly reserved, it's very easy for the migration
    scanner to be close to the end of the zone. Then the following situation
    can occurs

    o migration scanner isolates some pages near the end of the zone
    o free scanner starts at the end of the zone but finds that the
    migration scanner is already there
    o free scanner gets reinitialised for the next cycle as
    cc->migrate_pfn + pageblock_nr_pages
    moving the free scanner into the next zone
    o migration scanner moves into the next zone

    When this happens, NR_ISOLATED accounting goes haywire because some of the
    accounting happens against the wrong zone. One zones counter remains
    positive while the other goes negative even though the overall global
    count is accurate. This was reported on X86-32 with !SMP because !SMP
    allows the negative counters to be visible. The fact that it is the bug
    should theoritically be possible there.

    Signed-off-by: Mel Gorman
    Reviewed-by: Minchan Kim
    Reviewed-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • fragmentation_index() returns -1000 when the allocation might succeed
    This doesn't match the comment and code in compaction_suitable(). I
    thought compaction_suitable should return COMPACT_PARTIAL in -1000
    case, because in this case allocation could succeed depending on
    watermarks.

    The impact of this is that compaction starts and compact_finished() is
    called which rechecks the watermarks and the free lists. It should have
    the same result in that compaction should not start but is more expensive.

    Acked-by: Mel Gorman
    Signed-off-by: Shaohua Li
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shaohua Li
     
  • Commit 56de7263fcf3 ("mm: compaction: direct compact when a high-order
    allocation fails") introduced a check for cc->order == -1 in
    compact_finished. We should continue compacting in that case because
    the request came from userspace and there is no particular order to
    compact for. Similar check has been added by 82478fb7 (mm: compaction:
    prevent division-by-zero during user-requested compaction) for
    compaction_suitable.

    The check is, however, done after zone_watermark_ok which uses order as a
    right hand argument for shifts. Not only watermark check is pointless if
    we can break out without it but it also uses 1 << -1 which is not well
    defined (at least from C standard). Let's move the -1 check above
    zone_watermark_ok.

    [minchan.kim@gmail.com> - caught compaction_suitable]
    Signed-off-by: Michal Hocko
    Cc: Mel Gorman
    Reviewed-by: Minchan Kim
    Reviewed-by: KAMEZAWA Hiroyuki
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

23 Mar, 2011

4 commits

  • compaction_alloc() isolates pages for migration in isolate_migratepages.
    While it's scanning, IRQs are disabled on the mistaken assumption the
    scanning should be short. Tests show this to be true for the most part
    but contention times on the LRU lock can be increased. Before this patch,
    the IRQ disabled times for a simple test looked like

    Total sampled time IRQs off (not real total time): 5493
    Event shrink_inactive_list..shrink_zone 1596 us count 1
    Event shrink_inactive_list..shrink_zone 1530 us count 1
    Event shrink_inactive_list..shrink_zone 956 us count 1
    Event shrink_inactive_list..shrink_zone 541 us count 1
    Event shrink_inactive_list..shrink_zone 531 us count 1
    Event split_huge_page..add_to_swap 232 us count 1
    Event save_args..call_softirq 36 us count 1
    Event save_args..call_softirq 35 us count 2
    Event __wake_up..__wake_up 1 us count 1

    This patch reduces the worst-case IRQs-disabled latencies by releasing the
    lock every SWAP_CLUSTER_MAX pages that are scanned and releasing the CPU if
    necessary. The cost of this is that the processing performing compaction will
    be slower but IRQs being disabled for too long a time has worse consequences
    as the following report shows;

    Total sampled time IRQs off (not real total time): 4367
    Event shrink_inactive_list..shrink_zone 881 us count 1
    Event shrink_inactive_list..shrink_zone 875 us count 1
    Event shrink_inactive_list..shrink_zone 868 us count 1
    Event shrink_inactive_list..shrink_zone 555 us count 1
    Event split_huge_page..add_to_swap 495 us count 1
    Event compact_zone..compact_zone_order 269 us count 1
    Event split_huge_page..add_to_swap 266 us count 1
    Event shrink_inactive_list..shrink_zone 85 us count 1
    Event save_args..call_softirq 36 us count 2
    Event __wake_up..__wake_up 1 us count 1

    [akpm@linux-foundation.org: simplify with s/unlocked/locked/]
    Signed-off-by: Andrea Arcangeli
    Signed-off-by: Mel Gorman
    Cc: Johannes Weiner
    Cc: Arthur Marsh
    Cc: Clemens Ladisch
    Cc: KAMEZAWA Hiroyuki
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • compaction_alloc() isolates free pages to be used as migration targets.
    While its scanning, IRQs are disabled on the mistaken assumption the
    scanning should be short. Analysis showed that IRQs were in fact being
    disabled for substantial time. A simple test was run using large
    anonymous mappings with transparent hugepage support enabled to trigger
    frequent compactions. A monitor sampled what the worst IRQ-off latencies
    were and a post-processing tool found the following;

    Total sampled time IRQs off (not real total time): 22355
    Event compaction_alloc..compaction_alloc 8409 us count 1
    Event compaction_alloc..compaction_alloc 7341 us count 1
    Event compaction_alloc..compaction_alloc 2463 us count 1
    Event compaction_alloc..compaction_alloc 2054 us count 1
    Event shrink_inactive_list..shrink_zone 1864 us count 1
    Event shrink_inactive_list..shrink_zone 88 us count 1
    Event save_args..call_softirq 36 us count 1
    Event save_args..call_softirq 35 us count 2
    Event __make_request..__blk_run_queue 24 us count 1
    Event __alloc_pages_nodemask..__alloc_pages_nodemask 6 us count 1

    i.e. compaction is disabled IRQs for a prolonged period of time - 8ms in
    one instance. The full report generated by the tool can be found at

    http://www.csn.ul.ie/~mel/postings/minfree-20110225/irqsoff-vanilla-micro.report

    This patch reduces the time IRQs are disabled by simply disabling IRQs at
    the last possible minute. An updated IRQs-off summary report then looks
    like;

    Total sampled time IRQs off (not real total time): 5493
    Event shrink_inactive_list..shrink_zone 1596 us count 1
    Event shrink_inactive_list..shrink_zone 1530 us count 1
    Event shrink_inactive_list..shrink_zone 956 us count 1
    Event shrink_inactive_list..shrink_zone 541 us count 1
    Event shrink_inactive_list..shrink_zone 531 us count 1
    Event split_huge_page..add_to_swap 232 us count 1
    Event save_args..call_softirq 36 us count 1
    Event save_args..call_softirq 35 us count 2
    Event __wake_up..__wake_up 1 us count 1

    A full report is again available at

    http://www.csn.ul.ie/~mel/postings/minfree-20110225/irqsoff-minimiseirq-free-v1r4-micro.report

    As should be obvious, IRQ disabled latencies due to compaction are
    almost elimimnated for this particular test.

    [aarcange@redhat.com: Fix initialisation of isolated]
    Signed-off-by: Mel Gorman
    Acked-by: Johannes Weiner
    Reviewed-by: KAMEZAWA Hiroyuki
    Reviewed-by: Minchan Kim
    Acked-by: Andrea Arcangeli
    Cc: Arthur Marsh
    Cc: Clemens Ladisch
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Many migrate_page's caller check return value instead of list_empy by
    cf608ac19c ("mm: compaction: fix COMPACTPAGEFAILED counting"). This patch
    makes compaction's migrate_pages consistent with others. This patch
    should not change old behavior.

    Signed-off-by: Minchan Kim
    Cc: Mel Gorman
    Cc: Andrea Arcangeli
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • This patch reverts 5a03b051 ("thp: use compaction in kswapd for GFP_ATOMIC
    order > 0") due to reports stating that kswapd CPU usage was higher and
    IRQs were being disabled more frequently. This was reported at
    http://www.spinics.net/linux/fedora/alsa-user/msg09885.html.

    Without this patch applied, CPU usage by kswapd hovers around the 20% mark
    according to the tester (Arthur Marsh:
    http://www.spinics.net/linux/fedora/alsa-user/msg09899.html). With this
    patch applied, it's around 2%.

    The problem is not related to THP which specifies __GFP_NO_KSWAPD but is
    triggered by high-order allocations hitting the low watermark for their
    order and waking kswapd on kernels with CONFIG_COMPACTION set. The most
    common trigger for this is network cards configured for jumbo frames but
    it's also possible it'll be triggered by fork-heavy workloads (order-1)
    and some wireless cards which depend on order-1 allocations.

    The symptoms for the user will be high CPU usage by kswapd in low-memory
    situations which could be confused with another writeback problem. While
    a patch like 5a03b051 may be reintroduced in the future, this patch plays
    it safe for now and reverts it.

    [mel@csn.ul.ie: Beefed up the changelog]
    Signed-off-by: Andrea Arcangeli
    Signed-off-by: Mel Gorman
    Reported-by: Arthur Marsh
    Tested-by: Arthur Marsh
    Cc: [2.6.38.1]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     

21 Jan, 2011

1 commit

  • Up until 3e7d344 ("mm: vmscan: reclaim order-0 and use compaction instead
    of lumpy reclaim"), compaction skipped calculating the fragmentation index
    of a zone when compaction was explicitely requested through the procfs
    knob.

    However, when compaction_suitable was introduced, it did not come with an
    extra check for order == -1, set on explicit compaction requests, and
    passed this order on to the fragmentation index calculation, where it
    overshifts the number of requested pages, leading to a division by zero.

    This patch makes sure that order == -1 is recognized as the flag it is
    rather than passing it along as valid order parameter.

    [akpm@linux-foundation.org: add comment, per Mel]
    Signed-off-by: Johannes Weiner
    Reviewed-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

14 Jan, 2011

8 commits

  • It makes no sense not to enable compaction for small order pages as we
    don't want to end up with bad order 2 allocations and good and graceful
    order 9 allocations.

    Signed-off-by: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • This takes advantage of memory compaction to properly generate pages of
    order > 0 if regular page reclaim fails and priority level becomes more
    severe and we don't reach the proper watermarks.

    Signed-off-by: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • It's not worth migrating transparent hugepages during compaction. Those
    hugepages don't create fragmentation.

    Signed-off-by: Andrea Arcangeli
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • try_to_compact_pages() is initially called to only migrate pages
    asychronously and kswapd always compacts asynchronously. Both are being
    optimistic so it is important to complete the work as quickly as possible
    to minimise stalls.

    This patch alters the scanner when asynchronous to only consider
    MIGRATE_MOVABLE pageblocks as migration candidates. This reduces stalls
    when allocating huge pages while not impairing allocation success rates as
    a full scan will be performed if necessary after direct reclaim.

    Signed-off-by: Mel Gorman
    Cc: Andrea Arcangeli
    Cc: KOSAKI Motohiro
    Cc: Rik van Riel
    Acked-by: Johannes Weiner
    Cc: Andy Whitcroft
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • With the introduction of the boolean sync parameter, the API looks a
    little inconsistent as offlining is still an int. Convert offlining to a
    bool for the sake of being tidy.

    Signed-off-by: Mel Gorman
    Cc: Andrea Arcangeli
    Cc: KOSAKI Motohiro
    Cc: Rik van Riel
    Acked-by: Johannes Weiner
    Cc: Andy Whitcroft
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • …ompaction in the faster path

    Migration synchronously waits for writeback if the initial passes fails.
    Callers of memory compaction do not necessarily want this behaviour if the
    caller is latency sensitive or expects that synchronous migration is not
    going to have a significantly better success rate.

    This patch adds a sync parameter to migrate_pages() allowing the caller to
    indicate if wait_on_page_writeback() is allowed within migration or not.
    For reclaim/compaction, try_to_compact_pages() is first called
    asynchronously, direct reclaim runs and then try_to_compact_pages() is
    called synchronously as there is a greater expectation that it'll succeed.

    [akpm@linux-foundation.org: build/merge fix]
    Signed-off-by: Mel Gorman <mel@csn.ul.ie>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
    Cc: Rik van Riel <riel@redhat.com>
    Acked-by: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Andy Whitcroft <apw@shadowen.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

    Mel Gorman
     
  • Lumpy reclaim is disruptive. It reclaims a large number of pages and
    ignores the age of the pages it reclaims. This can incur significant
    stalls and potentially increase the number of major faults.

    Compaction has reached the point where it is considered reasonably stable
    (meaning it has passed a lot of testing) and is a potential candidate for
    displacing lumpy reclaim. This patch introduces an alternative to lumpy
    reclaim whe compaction is available called reclaim/compaction. The basic
    operation is very simple - instead of selecting a contiguous range of
    pages to reclaim, a number of order-0 pages are reclaimed and then
    compaction is later by either kswapd (compact_zone_order()) or direct
    compaction (__alloc_pages_direct_compact()).

    [akpm@linux-foundation.org: fix build]
    [akpm@linux-foundation.org: use conventional task_struct naming]
    Signed-off-by: Mel Gorman
    Cc: Andrea Arcangeli
    Cc: KOSAKI Motohiro
    Cc: Rik van Riel
    Acked-by: Johannes Weiner
    Cc: Andy Whitcroft
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • In preparation for a patches promoting the use of memory compaction over
    lumpy reclaim, this patch adds trace points for memory compaction
    activity. Using them, we can monitor the scanning activity of the
    migration and free page scanners as well as the number and success rates
    of pages passed to page migration.

    Signed-off-by: Mel Gorman
    Cc: Andrea Arcangeli
    Cc: KOSAKI Motohiro
    Cc: Rik van Riel
    Cc: Johannes Weiner
    Cc: Andy Whitcroft
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

23 Dec, 2010

1 commit

  • del_page_from_lru_list() already called mem_cgroup_del_lru(). So we must
    not call it again. It adds unnecessary overhead.

    It was not a runtime bug because the TestClearPageCgroupAcctLRU() early in
    mem_cgroup_del_lru_list() will prevent any double-deletion, etc.

    Signed-off-by: Minchan Kim
    Acked-by: Balbir Singh
    Acked-by: KAMEZAWA Hiroyuki
    Acked-by: Mel Gorman
    Reviewed-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     

10 Sep, 2010

1 commit

  • Iram reported that compaction's too_many_isolated() loops forever.
    (http://www.spinics.net/lists/linux-mm/msg08123.html)

    The meminfo when the situation happened was inactive anon is zero. That's
    because the system has no memory pressure until then. While all anon
    pages were in the active lru, compaction could select active lru as well
    as inactive lru. That's a different thing from vmscan's isolated. So we
    has been two too_many_isolated.

    While compaction can isolate pages in both active and inactive, current
    implementation of too_many_isolated only considers inactive. It made
    Iram's problem.

    This patch handles active and inactive fairly. That's because we can't
    expect where from and how many compaction would isolated pages.

    This patch changes (nr_isolated > nr_inactive) with
    nr_isolated > (nr_active + nr_inactive) / 2.

    Signed-off-by: Minchan Kim
    Reported-by: Iram Shahzad
    Acked-by: Mel Gorman
    Acked-by: Wu Fengguang
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     

25 May, 2010

5 commits

  • …hen it should be reclaimed

    The kernel applies some heuristics when deciding if memory should be
    compacted or reclaimed to satisfy a high-order allocation. One of these
    is based on the fragmentation. If the index is below 500, memory will not
    be compacted. This choice is arbitrary and not based on data. To help
    optimise the system and set a sensible default for this value, this patch
    adds a sysctl extfrag_threshold. The kernel will only compact memory if
    the fragmentation index is above the extfrag_threshold.

    [randy.dunlap@oracle.com: Fix build errors when proc fs is not configured]
    Signed-off-by: Mel Gorman <mel@csn.ul.ie>
    Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
    Cc: Rik van Riel <riel@redhat.com>
    Cc: Minchan Kim <minchan.kim@gmail.com>
    Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
    Cc: Christoph Lameter <cl@linux-foundation.org>
    Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

    Mel Gorman
     
  • Ordinarily when a high-order allocation fails, direct reclaim is entered
    to free pages to satisfy the allocation. With this patch, it is
    determined if an allocation failed due to external fragmentation instead
    of low memory and if so, the calling process will compact until a suitable
    page is freed. Compaction by moving pages in memory is considerably
    cheaper than paging out to disk and works where there are locked pages or
    no swap. If compaction fails to free a page of a suitable size, then
    reclaim will still occur.

    Direct compaction returns as soon as possible. As each block is
    compacted, it is checked if a suitable page has been freed and if so, it
    returns.

    [akpm@linux-foundation.org: Fix build errors]
    [aarcange@redhat.com: fix count_vm_event preempt in memory compaction direct reclaim]
    Signed-off-by: Mel Gorman
    Acked-by: Rik van Riel
    Reviewed-by: Minchan Kim
    Cc: KOSAKI Motohiro
    Cc: Christoph Lameter
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Add a per-node sysfs file called compact. When the file is written to,
    each zone in that node is compacted. The intention that this would be
    used by something like a job scheduler in a batch system before a job
    starts so that the job can allocate the maximum number of hugepages
    without significant start-up cost.

    Signed-off-by: Mel Gorman
    Acked-by: Rik van Riel
    Reviewed-by: KOSAKI Motohiro
    Reviewed-by: Christoph Lameter
    Reviewed-by: Minchan Kim
    Reviewed-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Add a proc file /proc/sys/vm/compact_memory. When an arbitrary value is
    written to the file, all zones are compacted. The expected user of such a
    trigger is a job scheduler that prepares the system before the target
    application runs.

    Signed-off-by: Mel Gorman
    Acked-by: Rik van Riel
    Reviewed-by: KAMEZAWA Hiroyuki
    Reviewed-by: Minchan Kim
    Reviewed-by: KOSAKI Motohiro
    Reviewed-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • This patch is the core of a mechanism which compacts memory in a zone by
    relocating movable pages towards the end of the zone.

    A single compaction run involves a migration scanner and a free scanner.
    Both scanners operate on pageblock-sized areas in the zone. The migration
    scanner starts at the bottom of the zone and searches for all movable
    pages within each area, isolating them onto a private list called
    migratelist. The free scanner starts at the top of the zone and searches
    for suitable areas and consumes the free pages within making them
    available for the migration scanner. The pages isolated for migration are
    then migrated to the newly isolated free pages.

    [aarcange@redhat.com: Fix unsafe optimisation]
    [mel@csn.ul.ie: do not schedule work on other CPUs for compaction]
    Signed-off-by: Mel Gorman
    Acked-by: Rik van Riel
    Reviewed-by: Minchan Kim
    Cc: KOSAKI Motohiro
    Cc: Christoph Lameter
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman