22 Aug, 2012

3 commits

  • Jim Schutt reported a problem that pointed at compaction contending
    heavily on locks. The workload is straight-forward and in his own words;

    The systems in question have 24 SAS drives spread across 3 HBAs,
    running 24 Ceph OSD instances, one per drive. FWIW these servers
    are dual-socket Intel 5675 Xeons w/48 GB memory. I've got ~160
    Ceph Linux clients doing dd simultaneously to a Ceph file system
    backed by 12 of these servers.

    Early in the test everything looks fine

    procs -------------------memory------------------ ---swap-- -----io---- --system-- -----cpu-------
    r b swpd free buff cache si so bi bo in cs us sy id wa st
    31 15 0 287216 576 38606628 0 0 2 1158 2 14 1 3 95 0 0
    27 15 0 225288 576 38583384 0 0 18 2222016 203357 134876 11 56 17 15 0
    28 17 0 219256 576 38544736 0 0 11 2305932 203141 146296 11 49 23 17 0
    6 18 0 215596 576 38552872 0 0 7 2363207 215264 166502 12 45 22 20 0
    22 18 0 226984 576 38596404 0 0 3 2445741 223114 179527 12 43 23 22 0

    and then it goes to pot

    procs -------------------memory------------------ ---swap-- -----io---- --system-- -----cpu-------
    r b swpd free buff cache si so bi bo in cs us sy id wa st
    163 8 0 464308 576 36791368 0 0 11 22210 866 536 3 13 79 4 0
    207 14 0 917752 576 36181928 0 0 712 1345376 134598 47367 7 90 1 2 0
    123 12 0 685516 576 36296148 0 0 429 1386615 158494 60077 8 84 5 3 0
    123 12 0 598572 576 36333728 0 0 1107 1233281 147542 62351 7 84 5 4 0
    622 7 0 660768 576 36118264 0 0 557 1345548 151394 59353 7 85 4 3 0
    223 11 0 283960 576 36463868 0 0 46 1107160 121846 33006 6 93 1 1 0

    Note that system CPU usage is very high blocks being written out has
    dropped by 42%. He analysed this with perf and found

    perf record -g -a sleep 10
    perf report --sort symbol --call-graph fractal,5
    34.63% [k] _raw_spin_lock_irqsave
    |
    |--97.30%-- isolate_freepages
    | compaction_alloc
    | unmap_and_move
    | migrate_pages
    | compact_zone
    | compact_zone_order
    | try_to_compact_pages
    | __alloc_pages_direct_compact
    | __alloc_pages_slowpath
    | __alloc_pages_nodemask
    | alloc_pages_vma
    | do_huge_pmd_anonymous_page
    | handle_mm_fault
    | do_page_fault
    | page_fault
    | |
    | |--87.39%-- skb_copy_datagram_iovec
    | | tcp_recvmsg
    | | inet_recvmsg
    | | sock_recvmsg
    | | sys_recvfrom
    | | system_call
    | | __recv
    | | |
    | | --100.00%-- (nil)
    | |
    | --12.61%-- memcpy
    --2.70%-- [...]

    There was other data but primarily it is all showing that compaction is
    contended heavily on the zone->lock and zone->lru_lock.

    commit [b2eef8c0: mm: compaction: minimise the time IRQs are disabled
    while isolating pages for migration] noted that it was possible for
    migration to hold the lru_lock for an excessive amount of time. Very
    broadly speaking this patch expands the concept.

    This patch introduces compact_checklock_irqsave() to check if a lock
    is contended or the process needs to be scheduled. If either condition
    is true then async compaction is aborted and the caller is informed.
    The page allocator will fail a THP allocation if compaction failed due
    to contention. This patch also introduces compact_trylock_irqsave()
    which will acquire the lock only if it is not contended and the process
    does not need to schedule.

    Reported-by: Jim Schutt
    Tested-by: Jim Schutt
    Signed-off-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Commit 7db8889ab05b ("mm: have order > 0 compaction start off where it
    left") introduced a caching mechanism to reduce the amount work the free
    page scanner does in compaction. However, it has a problem. Consider
    two process simultaneously scanning free pages

    C
    Process A M S F
    |---------------------------------------|
    Process B M FS

    C is zone->compact_cached_free_pfn
    S is cc->start_pfree_pfn
    M is cc->migrate_pfn
    F is cc->free_pfn

    In this diagram, Process A has just reached its migrate scanner, wrapped
    around and updated compact_cached_free_pfn accordingly.

    Simultaneously, Process B finishes isolating in a block and updates
    compact_cached_free_pfn again to the location of its free scanner.

    Process A moves to "end_of_zone - one_pageblock" and runs this check

    if (cc->order > 0 && (!cc->wrapped ||
    zone->compact_cached_free_pfn >
    cc->start_free_pfn))
    pfn = min(pfn, zone->compact_cached_free_pfn);

    compact_cached_free_pfn is above where it started so the free scanner
    skips almost the entire space it should have scanned. When there are
    multiple processes compacting it can end in a situation where the entire
    zone is not being scanned at all. Further, it is possible for two
    processes to ping-pong update to compact_cached_free_pfn which is just
    random.

    Overall, the end result wrecks allocation success rates.

    There is not an obvious way around this problem without introducing new
    locking and state so this patch takes a different approach.

    First, it gets rid of the skip logic because it's not clear that it
    matters if two free scanners happen to be in the same block but with
    racing updates it's too easy for it to skip over blocks it should not.

    Second, it updates compact_cached_free_pfn in a more limited set of
    circumstances.

    If a scanner has wrapped, it updates compact_cached_free_pfn to the end
    of the zone. When a wrapped scanner isolates a page, it updates
    compact_cached_free_pfn to point to the highest pageblock it
    can isolate pages from.

    If a scanner has not wrapped when it has finished isolated pages it
    checks if compact_cached_free_pfn is pointing to the end of the
    zone. If so, the value is updated to point to the highest
    pageblock that pages were isolated from. This value will not
    be updated again until a free page scanner wraps and resets
    compact_cached_free_pfn.

    This is not optimal and it can still race but the compact_cached_free_pfn
    will be pointing to or very near a pageblock with free pages.

    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Reviewed-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Commit aff622495c9a ("vmscan: only defer compaction for failed order and
    higher") fixed bad deferring policy but made mistake about checking
    compact_order_failed in __compact_pgdat(). So it can't update
    compact_order_failed with the new order. This ends up preventing
    correct operation of policy deferral. This patch fixes it.

    Signed-off-by: Minchan Kim
    Reviewed-by: Rik van Riel
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     

01 Aug, 2012

1 commit

  • Order > 0 compaction stops when enough free pages of the correct page
    order have been coalesced. When doing subsequent higher order
    allocations, it is possible for compaction to be invoked many times.

    However, the compaction code always starts out looking for things to
    compact at the start of the zone, and for free pages to compact things to
    at the end of the zone.

    This can cause quadratic behaviour, with isolate_freepages starting at the
    end of the zone each time, even though previous invocations of the
    compaction code already filled up all free memory on that end of the zone.

    This can cause isolate_freepages to take enormous amounts of CPU with
    certain workloads on larger memory systems.

    The obvious solution is to have isolate_freepages remember where it left
    off last time, and continue at that point the next time it gets invoked
    for an order > 0 compaction. This could cause compaction to fail if
    cc->free_pfn and cc->migrate_pfn are close together initially, in that
    case we restart from the end of the zone and try once more.

    Forced full (order == -1) compactions are left alone.

    [akpm@linux-foundation.org: checkpatch fixes]
    [akpm@linux-foundation.org: s/laste/last/, use 80 cols]
    Signed-off-by: Rik van Riel
    Reported-by: Jim Schutt
    Tested-by: Jim Schutt
    Cc: Minchan Kim
    Reviewed-by: KAMEZAWA Hiroyuki
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik van Riel
     

12 Jul, 2012

1 commit

  • If page migration cannot charge the temporary page to the memcg,
    migrate_pages() will return -ENOMEM. This isn't considered in memory
    compaction however, and the loop continues to iterate over all
    pageblocks trying to isolate and migrate pages. If a small number of
    very large memcgs happen to be oom, however, these attempts will mostly
    be futile leading to an enormous amout of cpu consumption due to the
    page migration failures.

    This patch will short circuit and fail memory compaction if
    migrate_pages() returns -ENOMEM. COMPACT_PARTIAL is returned in case
    some migrations were successful so that the page allocator will retry.

    Signed-off-by: David Rientjes
    Acked-by: Mel Gorman
    Cc: Minchan Kim
    Cc: Kamezawa Hiroyuki
    Cc: Rik van Riel
    Cc: Andrea Arcangeli
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

04 Jun, 2012

1 commit

  • This reverts commit 5ceb9ce6fe9462a298bb2cd5c9f1ca6cb80a0199.

    That commit seems to be the cause of the mm compation list corruption
    issues that Dave Jones reported. The locking (or rather, absense
    there-of) is dubious, as is the use of the 'page' variable once it has
    been found to be outside the pageblock range.

    So revert it for now, we can re-visit this for 3.6. If we even need to:
    as Minchan Kim says, "The patch wasn't a bug fix and even test workload
    was very theoretical".

    Reported-and-tested-by: Dave Jones
    Acked-by: Hugh Dickins
    Acked-by: KOSAKI Motohiro
    Acked-by: Minchan Kim
    Cc: Bartlomiej Zolnierkiewicz
    Cc: Kyungmin Park
    Cc: Andrew Morton
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

30 May, 2012

3 commits

  • Take lruvec further: pass it instead of zone to add_page_to_lru_list() and
    del_page_from_lru_list(); and pagevec_lru_move_fn() pass lruvec down to
    its target functions.

    This cleanup eliminates a swathe of cruft in memcontrol.c, including
    mem_cgroup_lru_add_list(), mem_cgroup_lru_del_list() and
    mem_cgroup_lru_move_lists() - which never actually touched the lists.

    In their place, mem_cgroup_page_lruvec() to decide the lruvec, previously
    a side-effect of add, and mem_cgroup_update_lru_size() to maintain the
    lru_size stats.

    Whilst these are simplifications in their own right, the goal is to bring
    the evaluation of lruvec next to the spin_locking of the lrus, in
    preparation for a future patch.

    Signed-off-by: Hugh Dickins
    Cc: KOSAKI Motohiro
    Acked-by: KAMEZAWA Hiroyuki
    Acked-by: Michal Hocko
    Acked-by: Konstantin Khlebnikov
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • After patch "mm: forbid lumpy-reclaim in shrink_active_list()" we can
    completely remove anon/file and active/inactive lru type filters from
    __isolate_lru_page(), because isolation for 0-order reclaim always
    isolates pages from right lru list. And pages-isolation for lumpy
    shrink_inactive_list() or memory-compaction anyway allowed to isolate
    pages from all evictable lru lists.

    Signed-off-by: Konstantin Khlebnikov
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Hugh Dickins
    Acked-by: Michal Hocko
    Cc: Glauber Costa
    Cc: Johannes Weiner
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     
  • When MIGRATE_UNMOVABLE pages are freed from MIGRATE_UNMOVABLE type
    pageblock (and some MIGRATE_MOVABLE pages are left in it) waiting until an
    allocation takes ownership of the block may take too long. The type of
    the pageblock remains unchanged so the pageblock cannot be used as a
    migration target during compaction.

    Fix it by:

    * Adding enum compact_mode (COMPACT_ASYNC_[MOVABLE,UNMOVABLE], and
    COMPACT_SYNC) and then converting sync field in struct compact_control
    to use it.

    * Adding nr_pageblocks_skipped field to struct compact_control and
    tracking how many destination pageblocks were of MIGRATE_UNMOVABLE type.
    If COMPACT_ASYNC_MOVABLE mode compaction ran fully in
    try_to_compact_pages() (COMPACT_COMPLETE) it implies that there is not a
    suitable page for allocation. In this case then check how if there were
    enough MIGRATE_UNMOVABLE pageblocks to try a second pass in
    COMPACT_ASYNC_UNMOVABLE mode.

    * Scanning the MIGRATE_UNMOVABLE pageblocks (during COMPACT_SYNC and
    COMPACT_ASYNC_UNMOVABLE compaction modes) and building a count based on
    finding PageBuddy pages, page_count(page) == 0 or PageLRU pages. If all
    pages within the MIGRATE_UNMOVABLE pageblock are in one of those three
    sets change the whole pageblock type to MIGRATE_MOVABLE.

    My particular test case (on a ARM EXYNOS4 device with 512 MiB, which means
    131072 standard 4KiB pages in 'Normal' zone) is to:

    - allocate 120000 pages for kernel's usage
    - free every second page (60000 pages) of memory just allocated
    - allocate and use 60000 pages from user space
    - free remaining 60000 pages of kernel memory
    (now we have fragmented memory occupied mostly by user space pages)
    - try to allocate 100 order-9 (2048 KiB) pages for kernel's usage

    The results:
    - with compaction disabled I get 11 successful allocations
    - with compaction enabled - 14 successful allocations
    - with this patch I'm able to get all 100 successful allocations

    NOTE: If we can make kswapd aware of order-0 request during compaction, we
    can enhance kswapd with changing mode to COMPACT_ASYNC_FULL
    (COMPACT_ASYNC_MOVABLE + COMPACT_ASYNC_UNMOVABLE). Please see the
    following thread:

    http://marc.info/?l=linux-mm&m=133552069417068&w=2

    [minchan@kernel.org: minor cleanups]
    Cc: Mel Gorman
    Cc: Minchan Kim
    Cc: Rik van Riel
    Cc: Marek Szyprowski
    Signed-off-by: Bartlomiej Zolnierkiewicz
    Signed-off-by: Kyungmin Park
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Bartlomiej Zolnierkiewicz
     

21 May, 2012

5 commits

  • The MIGRATE_CMA migration type has two main characteristics:
    (i) only movable pages can be allocated from MIGRATE_CMA
    pageblocks and (ii) page allocator will never change migration
    type of MIGRATE_CMA pageblocks.

    This guarantees (to some degree) that page in a MIGRATE_CMA page
    block can always be migrated somewhere else (unless there's no
    memory left in the system).

    It is designed to be used for allocating big chunks (eg. 10MiB)
    of physically contiguous memory. Once driver requests
    contiguous memory, pages from MIGRATE_CMA pageblocks may be
    migrated away to create a contiguous block.

    To minimise number of migrations, MIGRATE_CMA migration type
    is the last type tried when page allocator falls back to other
    migration types when requested.

    Signed-off-by: Michal Nazarewicz
    Signed-off-by: Marek Szyprowski
    Signed-off-by: Kyungmin Park
    Acked-by: Mel Gorman
    Reviewed-by: KAMEZAWA Hiroyuki
    Tested-by: Rob Clark
    Tested-by: Ohad Ben-Cohen
    Tested-by: Benjamin Gaignard
    Tested-by: Robert Nelson
    Tested-by: Barry Song

    Michal Nazarewicz
     
  • This commit exports some of the functions from compaction.c file
    outside of it adding their declaration into internal.h header
    file so that other mm related code can use them.

    This forced compaction.c to always be compiled (as opposed to being
    compiled only if CONFIG_COMPACTION is defined) but as to avoid
    introducing code that user did not ask for, part of the compaction.c
    is now wrapped in on #ifdef.

    Signed-off-by: Michal Nazarewicz
    Signed-off-by: Marek Szyprowski
    Acked-by: Mel Gorman
    Reviewed-by: KAMEZAWA Hiroyuki
    Tested-by: Rob Clark
    Tested-by: Ohad Ben-Cohen
    Tested-by: Benjamin Gaignard
    Tested-by: Robert Nelson
    Tested-by: Barry Song

    Michal Nazarewicz
     
  • This commit introduces isolate_freepages_range() function which
    generalises isolate_freepages_block() so that it can be used on
    arbitrary PFN ranges.

    isolate_freepages_block() is left with only minor changes.

    Signed-off-by: Michal Nazarewicz
    Signed-off-by: Marek Szyprowski
    Acked-by: Mel Gorman
    Reviewed-by: KAMEZAWA Hiroyuki
    Tested-by: Rob Clark
    Tested-by: Ohad Ben-Cohen
    Tested-by: Benjamin Gaignard
    Tested-by: Robert Nelson
    Tested-by: Barry Song

    Michal Nazarewicz
     
  • This commit creates a map_pages() function which map pages freed
    using split_free_pages(). This merely moves some code from
    isolate_freepages() so that it can be reused in other places.

    Signed-off-by: Michal Nazarewicz
    Signed-off-by: Marek Szyprowski
    Acked-by: Mel Gorman
    Reviewed-by: KAMEZAWA Hiroyuki
    Tested-by: Robert Nelson
    Tested-by: Barry Song

    Michal Nazarewicz
     
  • This commit introduces isolate_migratepages_range() function which
    extracts functionality from isolate_migratepages() so that it can be
    used on arbitrary PFN ranges.

    isolate_migratepages() function is implemented as a simple wrapper
    around isolate_migratepages_range().

    Signed-off-by: Michal Nazarewicz
    Signed-off-by: Marek Szyprowski
    Acked-by: Mel Gorman
    Reviewed-by: KAMEZAWA Hiroyuki
    Tested-by: Rob Clark
    Tested-by: Ohad Ben-Cohen
    Tested-by: Benjamin Gaignard
    Tested-by: Robert Nelson
    Tested-by: Barry Song

    Michal Nazarewicz
     

22 Mar, 2012

4 commits

  • "order" is -1 when compacting via /proc/sys/vm/compact_memory. Making
    it unsigned causes a bug in __compact_pgdat() when we test:

    if (cc->order < 0 || !compaction_deferred(zone, cc->order))
    compact_zone(zone, cc);

    [akpm@linux-foundation.org: make __compact_pgdat()'s comparison match other code sites]
    Signed-off-by: Dan Carpenter
    Cc: Mel Gorman
    Cc: Minchan Kim
    Reviewed-by: Rik van Riel
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Carpenter
     
  • I get this lockdep warning from swapping load on linux-next, due to
    "vmscan: kswapd carefully call compaction".

    =================================
    [ INFO: inconsistent lock state ]
    3.3.0-rc2-next-20120201 #5 Not tainted
    ---------------------------------
    inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage.
    kswapd0/28 [HC0[0]:SC0[0]:HE1:SE1] takes:
    (pcpu_alloc_mutex){+.+.?.}, at: [] pcpu_alloc+0x67/0x325
    {RECLAIM_FS-ON-W} state was registered at:
    [] mark_held_locks+0xd7/0x103
    [] lockdep_trace_alloc+0x85/0x9e
    [] __kmalloc+0x6c/0x14b
    [] pcpu_mem_zalloc+0x59/0x62
    [] pcpu_extend_area_map+0x26/0xb1
    [] pcpu_alloc+0x182/0x325
    [] __alloc_percpu+0xb/0xd
    [] snmp_mib_init+0x1e/0x2e
    [] ipv4_mib_init_net+0x7a/0x184
    [] ops_init.clone.0+0x6b/0x73
    [] register_pernet_operations+0x61/0xa0
    [] register_pernet_subsys+0x29/0x42
    [] inet_init+0x1ad/0x252
    [] do_one_initcall+0x7a/0x12f
    [] kernel_init+0x9d/0x11e
    [] kernel_thread_helper+0x4/0x10
    irq event stamp: 656613
    hardirqs last enabled at (656613): [] __mutex_unlock_slowpath+0x104/0x128
    hardirqs last disabled at (656612): [] __mutex_unlock_slowpath+0x5c/0x128
    softirqs last enabled at (655568): [] __do_softirq+0x120/0x136
    softirqs last disabled at (654757): [] call_softirq+0x1c/0x30

    other info that might help us debug this:
    Possible unsafe locking scenario:

    CPU0
    ----
    lock(pcpu_alloc_mutex);

    lock(pcpu_alloc_mutex);

    *** DEADLOCK ***

    no locks held by kswapd0/28.

    stack backtrace:
    Pid: 28, comm: kswapd0 Not tainted 3.3.0-rc2-next-20120201 #5
    Call Trace:
    [] print_usage_bug+0x1bf/0x1d0
    [] ? print_irq_inversion_bug+0x1d9/0x1d9
    [] mark_lock_irq+0xbb/0x22e
    [] ? free_hot_cold_page+0x13d/0x14f
    [] mark_lock+0x251/0x331
    [] mark_irqflags+0x12f/0x141
    [] __lock_acquire+0x58d/0x753
    [] ? pcpu_alloc+0x67/0x325
    [] lock_acquire+0x54/0x6a
    [] ? pcpu_alloc+0x67/0x325
    [] ? add_preempt_count+0xa9/0xae
    [] mutex_lock_nested+0x5e/0x315
    [] ? pcpu_alloc+0x67/0x325
    [] ? __lock_acquire+0x6dc/0x753
    [] ? __pagevec_release+0x2c/0x2c
    [] pcpu_alloc+0x67/0x325
    [] ? __pagevec_release+0x2c/0x2c
    [] __alloc_percpu+0xb/0xd
    [] schedule_on_each_cpu+0x23/0x110
    [] lru_add_drain_all+0x10/0x12
    [] __compact_pgdat+0x20/0x182
    [] compact_pgdat+0x27/0x29
    [] ? zone_watermark_ok+0x1a/0x1c
    [] balance_pgdat+0x732/0x751
    [] kswapd+0x15f/0x178
    [] ? balance_pgdat+0x751/0x751
    [] kthread+0x84/0x8c
    [] kernel_thread_helper+0x4/0x10
    [] ? finish_task_switch+0x85/0xea
    [] ? retint_restore_args+0xe/0xe
    [] ? __init_kthread_worker+0x56/0x56
    [] ? gs_change+0xb/0xb

    The RECLAIM_FS notations indicate that it's doing the GFP_FS checking that
    Nick hacked into lockdep a while back: I think we're intended to read that
    "" in the DEADLOCK scenario as "".

    I'm hazy, I have not reached any conclusion as to whether it's right to
    complain or not; but I believe it's uneasy about kswapd now doing the
    mutex_lock(&pcpu_alloc_mutex) which lru_add_drain_all() entails. Nor have
    I reached any conclusion as to whether it's important for kswapd to do
    that draining or not.

    But so as not to get blocked on this, with lockdep disabled from giving
    further reports, here's a patch which removes the lru_add_drain_all() from
    kswapd's callpath (and calls it only once from compact_nodes(), instead of
    once per node).

    Signed-off-by: Hugh Dickins
    Cc: Rik van Riel
    Acked-by: Mel Gorman
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Currently a failed order-9 (transparent hugepage) compaction can lead to
    memory compaction being temporarily disabled for a memory zone. Even if
    we only need compaction for an order 2 allocation, eg. for jumbo frames
    networking.

    The fix is relatively straightforward: keep track of the highest order at
    which compaction is succeeding, and only defer compaction for orders at
    which compaction is failing.

    Signed-off-by: Rik van Riel
    Cc: Andrea Arcangeli
    Acked-by: Mel Gorman
    Cc: Johannes Weiner
    Cc: Minchan Kim
    Cc: KOSAKI Motohiro
    Cc: Hillf Danton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik van Riel
     
  • With CONFIG_COMPACTION enabled, kswapd does not try to free contiguous
    free pages, even when it is woken for a higher order request.

    This could be bad for eg. jumbo frame network allocations, which are done
    from interrupt context and cannot compact memory themselves. Higher than
    before allocation failure rates in the network receive path have been
    observed in kernels with compaction enabled.

    Teach kswapd to defragment the memory zones in a node, but only if
    required and compaction is not deferred in a zone.

    [akpm@linux-foundation.org: reduce scope of zones_need_compaction]
    Signed-off-by: Rik van Riel
    Acked-by: Mel Gorman
    Cc: Andrea Arcangeli
    Cc: Johannes Weiner
    Cc: Minchan Kim
    Cc: KOSAKI Motohiro
    Cc: Hillf Danton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik van Riel
     

09 Feb, 2012

1 commit

  • When isolating pages for migration, migration starts at the start of a
    zone while the free scanner starts at the end of the zone. Migration
    avoids entering a new zone by never going beyond the free scanned.

    Unfortunately, in very rare cases nodes can overlap. When this happens,
    migration isolates pages without the LRU lock held, corrupting lists
    which will trigger errors in reclaim or during page free such as in the
    following oops

    BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
    IP: [] free_pcppages_bulk+0xcc/0x450
    PGD 1dda554067 PUD 1e1cb58067 PMD 0
    Oops: 0000 [#1] SMP
    CPU 37
    Pid: 17088, comm: memcg_process_s Tainted: G X
    RIP: free_pcppages_bulk+0xcc/0x450
    Process memcg_process_s (pid: 17088, threadinfo ffff881c2926e000, task ffff881c2926c0c0)
    Call Trace:
    free_hot_cold_page+0x17e/0x1f0
    __pagevec_free+0x90/0xb0
    release_pages+0x22a/0x260
    pagevec_lru_move_fn+0xf3/0x110
    putback_lru_page+0x66/0xe0
    unmap_and_move+0x156/0x180
    migrate_pages+0x9e/0x1b0
    compact_zone+0x1f3/0x2f0
    compact_zone_order+0xa2/0xe0
    try_to_compact_pages+0xdf/0x110
    __alloc_pages_direct_compact+0xee/0x1c0
    __alloc_pages_slowpath+0x370/0x830
    __alloc_pages_nodemask+0x1b1/0x1c0
    alloc_pages_vma+0x9b/0x160
    do_huge_pmd_anonymous_page+0x160/0x270
    do_page_fault+0x207/0x4c0
    page_fault+0x25/0x30

    The "X" in the taint flag means that external modules were loaded but but
    is unrelated to the bug triggering. The real problem was because the PFN
    layout looks like this

    Zone PFN ranges:
    DMA 0x00000010 -> 0x00001000
    DMA32 0x00001000 -> 0x00100000
    Normal 0x00100000 -> 0x01e80000
    Movable zone start PFN for each node
    early_node_map[14] active PFN ranges
    0: 0x00000010 -> 0x0000009b
    0: 0x00000100 -> 0x0007a1ec
    0: 0x0007a354 -> 0x0007a379
    0: 0x0007f7ff -> 0x0007f800
    0: 0x00100000 -> 0x00680000
    1: 0x00680000 -> 0x00e80000
    0: 0x00e80000 -> 0x01080000
    1: 0x01080000 -> 0x01280000
    0: 0x01280000 -> 0x01480000
    1: 0x01480000 -> 0x01680000
    0: 0x01680000 -> 0x01880000
    1: 0x01880000 -> 0x01a80000
    0: 0x01a80000 -> 0x01c80000
    1: 0x01c80000 -> 0x01e80000

    The fix is straight-forward. isolate_migratepages() has to make a
    similar check to isolate_freepage to ensure that it never isolates pages
    from a zone it does not hold the LRU lock for.

    This was discovered in a 3.0-based kernel but it affects 3.1.x, 3.2.x
    and current mainline.

    Signed-off-by: Mel Gorman
    Acked-by: Michal Nazarewicz
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

04 Feb, 2012

1 commit

  • …ing isolation for migration

    When isolating for migration, migration starts at the start of a zone
    which is not necessarily pageblock aligned. Further, it stops isolating
    when COMPACT_CLUSTER_MAX pages are isolated so migrate_pfn is generally
    not aligned. This allows isolate_migratepages() to call pfn_to_page() on
    an invalid PFN which can result in a crash. This was originally reported
    against a 3.0-based kernel with the following trace in a crash dump.

    PID: 9902 TASK: d47aecd0 CPU: 0 COMMAND: "memcg_process_s"
    #0 [d72d3ad0] crash_kexec at c028cfdb
    #1 [d72d3b24] oops_end at c05c5322
    #2 [d72d3b38] __bad_area_nosemaphore at c0227e60
    #3 [d72d3bec] bad_area at c0227fb6
    #4 [d72d3c00] do_page_fault at c05c72ec
    #5 [d72d3c80] error_code (via page_fault) at c05c47a4
    EAX: 00000000 EBX: 000c0000 ECX: 00000001 EDX: 00000807 EBP: 000c0000
    DS: 007b ESI: 00000001 ES: 007b EDI: f3000a80 GS: 6f50
    CS: 0060 EIP: c030b15a ERR: ffffffff EFLAGS: 00010002
    #6 [d72d3cb4] isolate_migratepages at c030b15a
    #7 [d72d3d14] zone_watermark_ok at c02d26cb
    #8 [d72d3d2c] compact_zone at c030b8de
    #9 [d72d3d68] compact_zone_order at c030bba1
    #10 [d72d3db4] try_to_compact_pages at c030bc84
    #11 [d72d3ddc] __alloc_pages_direct_compact at c02d61e7
    #12 [d72d3e08] __alloc_pages_slowpath at c02d66c7
    #13 [d72d3e78] __alloc_pages_nodemask at c02d6a97
    #14 [d72d3eb8] alloc_pages_vma at c030a845
    #15 [d72d3ed4] do_huge_pmd_anonymous_page at c03178eb
    #16 [d72d3f00] handle_mm_fault at c02f36c6
    #17 [d72d3f30] do_page_fault at c05c70ed
    #18 [d72d3fb0] error_code (via page_fault) at c05c47a4
    EAX: b71ff000 EBX: 00000001 ECX: 00001600 EDX: 00000431
    DS: 007b ESI: 08048950 ES: 007b EDI: bfaa3788
    SS: 007b ESP: bfaa36e0 EBP: bfaa3828 GS: 6f50
    CS: 0073 EIP: 080487c8 ERR: ffffffff EFLAGS: 00010202

    It was also reported by Herbert van den Bergh against 3.1-based kernel
    with the following snippet from the console log.

    BUG: unable to handle kernel paging request at 01c00008
    IP: [<c0522399>] isolate_migratepages+0x119/0x390
    *pdpt = 000000002f7ce001 *pde = 0000000000000000

    It is expected that it also affects 3.2.x and current mainline.

    The problem is that pfn_valid is only called on the first PFN being
    checked and that PFN is not necessarily aligned. Lets say we have a case
    like this

    H = MAX_ORDER_NR_PAGES boundary
    | = pageblock boundary
    m = cc->migrate_pfn
    f = cc->free_pfn
    o = memory hole

    H------|------H------|----m-Hoooooo|ooooooH-f----|------H

    The migrate_pfn is just below a memory hole and the free scanner is beyond
    the hole. When isolate_migratepages started, it scans from migrate_pfn to
    migrate_pfn+pageblock_nr_pages which is now in a memory hole. It checks
    pfn_valid() on the first PFN but then scans into the hole where there are
    not necessarily valid struct pages.

    This patch ensures that isolate_migratepages calls pfn_valid when
    necessary.

    Reported-by: Herbert van den Bergh <herbert.van.den.bergh@oracle.com>
    Tested-by: Herbert van den Bergh <herbert.van.den.bergh@oracle.com>
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Acked-by: Michal Nazarewicz <mina86@mina86.com>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

    Mel Gorman
     

13 Jan, 2012

4 commits

  • This patch adds a lightweight sync migrate operation MIGRATE_SYNC_LIGHT
    mode that avoids writing back pages to backing storage. Async compaction
    maps to MIGRATE_ASYNC while sync compaction maps to MIGRATE_SYNC_LIGHT.
    For other migrate_pages users such as memory hotplug, MIGRATE_SYNC is
    used.

    This avoids sync compaction stalling for an excessive length of time,
    particularly when copying files to a USB stick where there might be a
    large number of dirty pages backed by a filesystem that does not support
    ->writepages.

    [aarcange@redhat.com: This patch is heavily based on Andrea's work]
    [akpm@linux-foundation.org: fix fs/nfs/write.c build]
    [akpm@linux-foundation.org: fix fs/btrfs/disk-io.c build]
    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Andrea Arcangeli
    Cc: Minchan Kim
    Cc: Dave Jones
    Cc: Jan Kara
    Cc: Andy Isaacson
    Cc: Nai Xia
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Commit 39deaf85 ("mm: compaction: make isolate_lru_page() filter-aware")
    noted that compaction does not migrate dirty or writeback pages and that
    is was meaningless to pick the page and re-add it to the LRU list. This
    had to be partially reverted because some dirty pages can be migrated by
    compaction without blocking.

    This patch updates "mm: compaction: make isolate_lru_page" by skipping
    over pages that migration has no possibility of migrating to minimise LRU
    disruption.

    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Andrea Arcangeli
    Reviewed-by: Minchan Kim
    Cc: Dave Jones
    Cc: Jan Kara
    Cc: Andy Isaacson
    Cc: Nai Xia
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • When asynchronous compaction was introduced, the
    /proc/sys/vm/compact_memory handler should have been updated to always use
    synchronous compaction. This did not happen so this patch addresses it.

    The assumption is if a user writes to /proc/sys/vm/compact_memory, they
    are willing for that process to stall.

    Signed-off-by: Mel Gorman
    Reviewed-by: Andrea Arcangeli
    Reviewed-by: Rik van Riel
    Reviewed-by: Minchan Kim
    Cc: Dave Jones
    Cc: Jan Kara
    Cc: Andy Isaacson
    Cc: Nai Xia
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Short summary: There are severe stalls when a USB stick using VFAT is
    used with THP enabled that are reduced by this series. If you are
    experiencing this problem, please test and report back and considering I
    have seen complaints from openSUSE and Fedora users on this as well as a
    few private mails, I'm guessing it's a widespread issue. This is a new
    type of USB-related stall because it is due to synchronous compaction
    writing where as in the past the big problem was dirty pages reaching
    the end of the LRU and being written by reclaim.

    Am cc'ing Andrew this time and this series would replace
    mm-do-not-stall-in-synchronous-compaction-for-thp-allocations.patch.
    I'm also cc'ing Dave Jones as he might have merged that patch to Fedora
    for wider testing and ideally it would be reverted and replaced by this
    series.

    That said, the later patches could really do with some review. If this
    series is not the answer then a new direction needs to be discussed
    because as it is, the stalls are unacceptable as the results in this
    leader show.

    For testers that try backporting this to 3.1, it won't work because
    there is a non-obvious dependency on not writing back pages in direct
    reclaim so you need those patches too.

    Changelog since V5
    o Rebase to 3.2-rc5
    o Tidy up the changelogs a bit

    Changelog since V4
    o Added reviewed-bys, credited Andrea properly for sync-light
    o Allow dirty pages without mappings to be considered for migration
    o Bound the number of pages freed for compaction
    o Isolate PageReclaim pages on their own LRU list

    This is against 3.2-rc5 and follows on from discussions on "mm: Do
    not stall in synchronous compaction for THP allocations" and "[RFC
    PATCH 0/5] Reduce compaction-related stalls". Initially, the proposed
    patch eliminated stalls due to compaction which sometimes resulted in
    user-visible interactivity problems on browsers by simply never using
    sync compaction. The downside was that THP success allocation rates
    were lower because dirty pages were not being migrated as reported by
    Andrea. His approach at fixing this was nacked on the grounds that
    it reverted fixes from Rik merged that reduced the amount of pages
    reclaimed as it severely impacted his workloads performance.

    This series attempts to reconcile the requirements of maximising THP
    usage, without stalling in a user-visible fashion due to compaction
    or cheating by reclaiming an excessive number of pages.

    Patch 1 partially reverts commit 39deaf85 to allow migration to isolate
    dirty pages. This is because migration can move some dirty
    pages without blocking.

    Patch 2 notes that the /proc/sys/vm/compact_memory handler is not using
    synchronous compaction when it should be. This is unrelated
    to the reported stalls but is worth fixing.

    Patch 3 checks if we isolated a compound page during lumpy scan and
    account for it properly. For the most part, this affects
    tracing so it's unrelated to the stalls but worth fixing.

    Patch 4 notes that it is possible to abort reclaim early for compaction
    and return 0 to the page allocator potentially entering the
    "may oom" path. This has not been observed in practice but
    the rest of the series potentially makes it easier to happen.

    Patch 5 adds a sync parameter to the migratepage callback and gives
    the callback responsibility for migrating the page without
    blocking if sync==false. For example, fallback_migrate_page
    will not call writepage if sync==false. This increases the
    number of pages that can be handled by asynchronous compaction
    thereby reducing stalls.

    Patch 6 restores filter-awareness to isolate_lru_page for migration.
    In practice, it means that pages under writeback and pages
    without a ->migratepage callback will not be isolated
    for migration.

    Patch 7 avoids calling direct reclaim if compaction is deferred but
    makes sure that compaction is only deferred if sync
    compaction was used.

    Patch 8 introduces a sync-light migration mechanism that sync compaction
    uses. The objective is to allow some stalls but to not call
    ->writepage which can lead to significant user-visible stalls.

    Patch 9 notes that while we want to abort reclaim ASAP to allow
    compation to go ahead that we leave a very small window of
    opportunity for compaction to run. This patch allows more pages
    to be freed by reclaim but bounds the number to a reasonable
    level based on the high watermark on each zone.

    Patch 10 allows slabs to be shrunk even after compaction_ready() is
    true for one zone. This is to avoid a problem whereby a single
    small zone can abort reclaim even though no pages have been
    reclaimed and no suitably large zone is in a usable state.

    Patch 11 fixes a problem with the rate of page scanning. As reclaim is
    rarely stalling on pages under writeback it means that scan
    rates are very high. This is particularly true for direct
    reclaim which is not calling writepage. The vmstat figures
    implied that much of this was busy work with PageReclaim pages
    marked for immediate reclaim. This patch is a prototype that
    moves these pages to their own LRU list.

    This has been tested and other than 2 USB keys getting trashed,
    nothing horrible fell out. That said, I am a bit unhappy with the
    rescue logic in patch 11 but did not find a better way around it. It
    does significantly reduce scan rates and System CPU time indicating
    it is the right direction to take.

    What is of critical importance is that stalls due to compaction
    are massively reduced even though sync compaction was still
    allowed. Testing from people complaining about stalls copying to USBs
    with THP enabled are particularly welcome.

    The following tests all involve THP usage and USB keys in some
    way. Each test follows this type of pattern

    1. Read from some fast fast storage, be it raw device or file. Each time
    the copy finishes, start again until the test ends
    2. Write a large file to a filesystem on a USB stick. Each time the copy
    finishes, start again until the test ends
    3. When memory is low, start an alloc process that creates a mapping
    the size of physical memory to stress THP allocation. This is the
    "real" part of the test and the part that is meant to trigger
    stalls when THP is enabled. Copying continues in the background.
    4. Record the CPU usage and time to execute of the alloc process
    5. Record the number of THP allocs and fallbacks as well as the number of THP
    pages in use a the end of the test just before alloc exited
    6. Run the test 5 times to get an idea of variability
    7. Between each run, sync is run and caches dropped and the test
    waits until nr_dirty is a small number to avoid interference
    or caching between iterations that would skew the figures.

    The individual tests were then

    writebackCPDeviceBasevfat
    Disable THP, read from a raw device (sda), vfat on USB stick
    writebackCPDeviceBaseext4
    Disable THP, read from a raw device (sda), ext4 on USB stick
    writebackCPDevicevfat
    THP enabled, read from a raw device (sda), vfat on USB stick
    writebackCPDeviceext4
    THP enabled, read from a raw device (sda), ext4 on USB stick
    writebackCPFilevfat
    THP enabled, read from a file on fast storage and USB, both vfat
    writebackCPFileext4
    THP enabled, read from a file on fast storage and USB, both ext4

    The kernels tested were

    3.1 3.1
    vanilla 3.2-rc5
    freemore Patches 1-10
    immediate Patches 1-11
    andrea The 8 patches Andrea posted as a basis of comparison

    The results are very long unfortunately. I'll start with the case
    where we are not using THP at all

    writebackCPDeviceBasevfat
    3.1.0-vanilla rc5-vanilla freemore-v6r1 isolate-v6r1 andrea-v2r1
    System Time 1.28 ( 0.00%) 54.49 (-4143.46%) 48.63 (-3687.69%) 4.69 ( -265.11%) 51.88 (-3940.81%)
    +/- 0.06 ( 0.00%) 2.45 (-4305.55%) 4.75 (-8430.57%) 7.46 (-13282.76%) 4.76 (-8440.70%)
    User Time 0.09 ( 0.00%) 0.05 ( 40.91%) 0.06 ( 29.55%) 0.07 ( 15.91%) 0.06 ( 27.27%)
    +/- 0.02 ( 0.00%) 0.01 ( 45.39%) 0.02 ( 25.07%) 0.00 ( 77.06%) 0.01 ( 52.24%)
    Elapsed Time 110.27 ( 0.00%) 56.38 ( 48.87%) 49.95 ( 54.70%) 11.77 ( 89.33%) 53.43 ( 51.54%)
    +/- 7.33 ( 0.00%) 3.77 ( 48.61%) 4.94 ( 32.63%) 6.71 ( 8.50%) 4.76 ( 35.03%)
    THP Active 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
    +/- 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
    Fault Alloc 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
    +/- 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
    Fault Fallback 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
    +/- 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)

    The THP figures are obviously all 0 because THP was enabled. The
    main thing to watch is the elapsed times and how they compare to
    times when THP is enabled later. It's also important to note that
    elapsed time is improved by this series as System CPu time is much
    reduced.

    writebackCPDevicevfat

    3.1.0-vanilla rc5-vanilla freemore-v6r1 isolate-v6r1 andrea-v2r1
    System Time 1.22 ( 0.00%) 13.89 (-1040.72%) 46.40 (-3709.20%) 4.44 ( -264.37%) 47.37 (-3789.33%)
    +/- 0.06 ( 0.00%) 22.82 (-37635.56%) 3.84 (-6249.44%) 6.48 (-10618.92%) 6.60
    (-10818.53%)
    User Time 0.06 ( 0.00%) 0.06 ( -6.90%) 0.05 ( 17.24%) 0.05 ( 13.79%) 0.04 ( 31.03%)
    +/- 0.01 ( 0.00%) 0.01 ( 33.33%) 0.01 ( 33.33%) 0.01 ( 39.14%) 0.01 ( 25.46%)
    Elapsed Time 10445.54 ( 0.00%) 2249.92 ( 78.46%) 70.06 ( 99.33%) 16.59 ( 99.84%) 472.43 (
    95.48%)
    +/- 643.98 ( 0.00%) 811.62 ( -26.03%) 10.02 ( 98.44%) 7.03 ( 98.91%) 59.99 ( 90.68%)
    THP Active 15.60 ( 0.00%) 35.20 ( 225.64%) 65.00 ( 416.67%) 70.80 ( 453.85%) 62.20 ( 398.72%)
    +/- 18.48 ( 0.00%) 51.29 ( 277.59%) 15.99 ( 86.52%) 37.91 ( 205.18%) 22.02 ( 119.18%)
    Fault Alloc 121.80 ( 0.00%) 76.60 ( 62.89%) 155.40 ( 127.59%) 181.20 ( 148.77%) 286.60 ( 235.30%)
    +/- 73.51 ( 0.00%) 61.11 ( 83.12%) 34.89 ( 47.46%) 31.88 ( 43.36%) 68.13 ( 92.68%)
    Fault Fallback 881.20 ( 0.00%) 926.60 ( -5.15%) 847.60 ( 3.81%) 822.00 ( 6.72%) 716.60 ( 18.68%)
    +/- 73.51 ( 0.00%) 61.26 ( 16.67%) 34.89 ( 52.54%) 31.65 ( 56.94%) 67.75 ( 7.84%)
    MMTests Statistics: duration
    User/Sys Time Running Test (seconds) 3540.88 1945.37 716.04 64.97 1937.03
    Total Elapsed Time (seconds) 52417.33 11425.90 501.02 230.95 2520.28

    The first thing to note is the "Elapsed Time" for the vanilla kernels
    of 2249 seconds versus 56 with THP disabled which might explain the
    reports of USB stalls with THP enabled. Applying the patches brings
    performance in line with THP-disabled performance while isolating
    pages for immediate reclaim from the LRU cuts down System CPU time.

    The "Fault Alloc" success rate figures are also improved. The vanilla
    kernel only managed to allocate 76.6 pages on average over the course
    of 5 iterations where as applying the series allocated 181.20 on
    average albeit it is well within variance. It's worth noting that
    applies the series at least descreases the amount of variance which
    implies an improvement.

    Andrea's series had a higher success rate for THP allocations but
    at a severe cost to elapsed time which is still better than vanilla
    but still much worse than disabling THP altogether. One can bring my
    series close to Andrea's by removing this check

    /*
    * If compaction is deferred for high-order allocations, it is because
    * sync compaction recently failed. In this is the case and the caller
    * has requested the system not be heavily disrupted, fail the
    * allocation now instead of entering direct reclaim
    */
    if (deferred_compaction && (gfp_mask & __GFP_NO_KSWAPD))
    goto nopage;

    I didn't include a patch that removed the above check because hurting
    overall performance to improve the THP figure is not what the average
    user wants. It's something to consider though if someone really wants
    to maximise THP usage no matter what it does to the workload initially.

    This is summary of vmstat figures from the same test.

    3.1.0-vanilla rc5-vanilla freemore-v6r1 isolate-v6r1 andrea-v2r1
    Page Ins 3257266139 1111844061 17263623 10901575 161423219
    Page Outs 81054922 30364312 3626530 3657687 8753730
    Swap Ins 3294 2851 6560 4964 4592
    Swap Outs 390073 528094 620197 790912 698285
    Direct pages scanned 1077581700 3024951463 1764930052 115140570 5901188831
    Kswapd pages scanned 34826043 7112868 2131265 1686942 1893966
    Kswapd pages reclaimed 28950067 4911036 1246044 966475 1497726
    Direct pages reclaimed 805148398 280167837 3623473 2215044 40809360
    Kswapd efficiency 83% 69% 58% 57% 79%
    Kswapd velocity 664.399 622.521 4253.852 7304.360 751.490
    Direct efficiency 74% 9% 0% 1% 0%
    Direct velocity 20557.737 264745.137 3522673.849 498551.938 2341481.435
    Percentage direct scans 96% 99% 99% 98% 99%
    Page writes by reclaim 722646 529174 620319 791018 699198
    Page writes file 332573 1080 122 106 913
    Page writes anon 390073 528094 620197 790912 698285
    Page reclaim immediate 0 2552514720 1635858848 111281140 5478375032
    Page rescued immediate 0 0 0 87848 0
    Slabs scanned 23552 23552 9216 8192 9216
    Direct inode steals 231 0 0 0 0
    Kswapd inode steals 0 0 0 0 0
    Kswapd skipped wait 28076 786 0 61 6
    THP fault alloc 609 383 753 906 1433
    THP collapse alloc 12 6 0 0 6
    THP splits 536 211 456 593 1136
    THP fault fallback 4406 4633 4263 4110 3583
    THP collapse fail 120 127 0 0 4
    Compaction stalls 1810 728 623 779 3200
    Compaction success 196 53 60 80 123
    Compaction failures 1614 675 563 699 3077
    Compaction pages moved 193158 53545 243185 333457 226688
    Compaction move failure 9952 9396 16424 23676 45070

    The main things to look at are

    1. Page In/out figures are much reduced by the series.

    2. Direct page scanning is incredibly high (264745.137 pages scanned
    per second on the vanilla kernel) but isolating PageReclaim pages
    on their own list reduces the number of pages scanned significantly.

    3. The fact that "Page rescued immediate" is a positive number implies
    that we sometimes race removing pages from the LRU_IMMEDIATE list
    that need to be put back on a normal LRU but it happens only for
    0.07% of the pages marked for immediate reclaim.

    writebackCPDeviceext4
    3.1.0-vanilla rc5-vanilla freemore-v6r1 isolate-v6r1 andrea-v2r1
    System Time 1.51 ( 0.00%) 1.77 ( -17.66%) 1.46 ( 2.92%) 1.15 ( 23.77%) 1.89 ( -25.63%)
    +/- 0.27 ( 0.00%) 0.67 ( -148.52%) 0.33 ( -22.76%) 0.30 ( -11.15%) 0.19 ( 30.16%)
    User Time 0.03 ( 0.00%) 0.04 ( -37.50%) 0.05 ( -62.50%) 0.07 ( -112.50%) 0.04 ( -18.75%)
    +/- 0.01 ( 0.00%) 0.02 ( -146.64%) 0.02 ( -97.91%) 0.02 ( -75.59%) 0.02 ( -63.30%)
    Elapsed Time 124.93 ( 0.00%) 114.49 ( 8.36%) 96.77 ( 22.55%) 27.48 ( 78.00%) 205.70 ( -64.65%)
    +/- 20.20 ( 0.00%) 74.39 ( -268.34%) 59.88 ( -196.48%) 7.72 ( 61.79%) 25.03 ( -23.95%)
    THP Active 161.80 ( 0.00%) 83.60 ( 51.67%) 141.20 ( 87.27%) 84.60 ( 52.29%) 82.60 ( 51.05%)
    +/- 71.95 ( 0.00%) 43.80 ( 60.88%) 26.91 ( 37.40%) 59.02 ( 82.03%) 52.13 ( 72.45%)
    Fault Alloc 471.40 ( 0.00%) 228.60 ( 48.49%) 282.20 ( 59.86%) 225.20 ( 47.77%) 388.40 ( 82.39%)
    +/- 88.07 ( 0.00%) 87.42 ( 99.26%) 73.79 ( 83.78%) 109.62 ( 124.47%) 82.62 ( 93.81%)
    Fault Fallback 531.60 ( 0.00%) 774.60 ( -45.71%) 720.80 ( -35.59%) 777.80 ( -46.31%) 614.80 ( -15.65%)
    +/- 88.07 ( 0.00%) 87.26 ( 0.92%) 73.79 ( 16.22%) 109.62 ( -24.47%) 82.29 ( 6.56%)
    MMTests Statistics: duration
    User/Sys Time Running Test (seconds) 50.22 33.76 30.65 24.14 128.45
    Total Elapsed Time (seconds) 1113.73 1132.19 1029.45 759.49 1707.26

    Similar test but the USB stick is using ext4 instead of vfat. As
    ext4 does not use writepage for migration, the large stalls due to
    compaction when THP is enabled are not observed. Still, isolating
    PageReclaim pages on their own list helped completion time largely
    by reducing the number of pages scanned by direct reclaim although
    time spend in congestion_wait could also be a factor.

    Again, Andrea's series had far higher success rates for THP allocation
    at the cost of elapsed time. I didn't look too closely but a quick
    look at the vmstat figures tells me kswapd reclaimed 8 times more pages
    than the patch series and direct reclaim reclaimed roughly three times
    as many pages. It follows that if memory is aggressively reclaimed,
    there will be more available for THP.

    writebackCPFilevfat
    3.1.0-vanilla rc5-vanilla freemore-v6r1 isolate-v6r1 andrea-v2r1
    System Time 1.76 ( 0.00%) 29.10 (-1555.52%) 46.01 (-2517.18%) 4.79 ( -172.35%) 54.89 (-3022.53%)
    +/- 0.14 ( 0.00%) 25.61 (-18185.17%) 2.15 (-1434.83%) 6.60 (-4610.03%) 9.75
    (-6863.76%)
    User Time 0.05 ( 0.00%) 0.07 ( -45.83%) 0.05 ( -4.17%) 0.06 ( -29.17%) 0.06 ( -16.67%)
    +/- 0.02 ( 0.00%) 0.02 ( 20.11%) 0.02 ( -3.14%) 0.01 ( 31.58%) 0.01 ( 47.41%)
    Elapsed Time 22520.79 ( 0.00%) 1082.85 ( 95.19%) 73.30 ( 99.67%) 32.43 ( 99.86%) 291.84 ( 98.70%)
    +/- 7277.23 ( 0.00%) 706.29 ( 90.29%) 19.05 ( 99.74%) 17.05 ( 99.77%) 125.55 ( 98.27%)
    THP Active 83.80 ( 0.00%) 12.80 ( 15.27%) 15.60 ( 18.62%) 13.00 ( 15.51%) 0.80 ( 0.95%)
    +/- 66.81 ( 0.00%) 20.19 ( 30.22%) 5.92 ( 8.86%) 15.06 ( 22.54%) 1.17 ( 1.75%)
    Fault Alloc 171.00 ( 0.00%) 67.80 ( 39.65%) 97.40 ( 56.96%) 125.60 ( 73.45%) 133.00 ( 77.78%)
    +/- 82.91 ( 0.00%) 30.69 ( 37.02%) 53.91 ( 65.02%) 55.05 ( 66.40%) 21.19 ( 25.56%)
    Fault Fallback 832.00 ( 0.00%) 935.20 ( -12.40%) 906.00 ( -8.89%) 877.40 ( -5.46%) 870.20 ( -4.59%)
    +/- 82.91 ( 0.00%) 30.69 ( 62.98%) 54.01 ( 34.86%) 55.05 ( 33.60%) 20.91 ( 74.78%)
    MMTests Statistics: duration
    User/Sys Time Running Test (seconds) 7229.81 928.42 704.52 80.68 1330.76
    Total Elapsed Time (seconds) 112849.04 5618.69 571.11 360.54 1664.28

    In this case, the test is reading/writing only from filesystems but as
    it's vfat, it's slow due to calling writepage during compaction. Little
    to observe really - the time to complete the test goes way down
    with the series applied and THP allocation success rates go up in
    comparison to 3.2-rc5. The success rates are lower than 3.1.0 but
    the elapsed time for that kernel is abysmal so it is not really a
    sensible comparison.

    As before, Andrea's series allocates more THPs at the cost of overall
    performance.

    writebackCPFileext4
    3.1.0-vanilla rc5-vanilla freemore-v6r1 isolate-v6r1 andrea-v2r1
    System Time 1.51 ( 0.00%) 1.77 ( -17.66%) 1.46 ( 2.92%) 1.15 ( 23.77%) 1.89 ( -25.63%)
    +/- 0.27 ( 0.00%) 0.67 ( -148.52%) 0.33 ( -22.76%) 0.30 ( -11.15%) 0.19 ( 30.16%)
    User Time 0.03 ( 0.00%) 0.04 ( -37.50%) 0.05 ( -62.50%) 0.07 ( -112.50%) 0.04 ( -18.75%)
    +/- 0.01 ( 0.00%) 0.02 ( -146.64%) 0.02 ( -97.91%) 0.02 ( -75.59%) 0.02 ( -63.30%)
    Elapsed Time 124.93 ( 0.00%) 114.49 ( 8.36%) 96.77 ( 22.55%) 27.48 ( 78.00%) 205.70 ( -64.65%)
    +/- 20.20 ( 0.00%) 74.39 ( -268.34%) 59.88 ( -196.48%) 7.72 ( 61.79%) 25.03 ( -23.95%)
    THP Active 161.80 ( 0.00%) 83.60 ( 51.67%) 141.20 ( 87.27%) 84.60 ( 52.29%) 82.60 ( 51.05%)
    +/- 71.95 ( 0.00%) 43.80 ( 60.88%) 26.91 ( 37.40%) 59.02 ( 82.03%) 52.13 ( 72.45%)
    Fault Alloc 471.40 ( 0.00%) 228.60 ( 48.49%) 282.20 ( 59.86%) 225.20 ( 47.77%) 388.40 ( 82.39%)
    +/- 88.07 ( 0.00%) 87.42 ( 99.26%) 73.79 ( 83.78%) 109.62 ( 124.47%) 82.62 ( 93.81%)
    Fault Fallback 531.60 ( 0.00%) 774.60 ( -45.71%) 720.80 ( -35.59%) 777.80 ( -46.31%) 614.80 ( -15.65%)
    +/- 88.07 ( 0.00%) 87.26 ( 0.92%) 73.79 ( 16.22%) 109.62 ( -24.47%) 82.29 ( 6.56%)
    MMTests Statistics: duration
    User/Sys Time Running Test (seconds) 50.22 33.76 30.65 24.14 128.45
    Total Elapsed Time (seconds) 1113.73 1132.19 1029.45 759.49 1707.26

    Same type of story - elapsed times go down. In this case, allocation
    success rates are roughtly the same. As before, Andrea's has higher
    success rates but takes a lot longer.

    Overall the series does reduce latencies and while the tests are
    inherency racy as alloc competes with the cp processes, the variability
    was included. The THP allocation rates are not as high as they could
    be but that is because we would have to be more aggressive about
    reclaim and compaction impacting overall performance.

    This patch:

    Commit 39deaf85 ("mm: compaction: make isolate_lru_page() filter-aware")
    noted that compaction does not migrate dirty or writeback pages and that
    is was meaningless to pick the page and re-add it to the LRU list.

    What was missed during review is that asynchronous migration moves dirty
    pages if their ->migratepage callback is migrate_page() because these can
    be moved without blocking. This potentially impacted hugepage allocation
    success rates by a factor depending on how many dirty pages are in the
    system.

    This patch partially reverts 39deaf85 to allow migration to isolate dirty
    pages again. This increases how much compaction disrupts the LRU but that
    is addressed later in the series.

    Signed-off-by: Mel Gorman
    Reviewed-by: Andrea Arcangeli
    Reviewed-by: Rik van Riel
    Reviewed-by: Minchan Kim
    Cc: Dave Jones
    Cc: Jan Kara
    Cc: Andy Isaacson
    Cc: Nai Xia
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

11 Jan, 2012

1 commit


22 Dec, 2011

1 commit

  • This moves the 'memory sysdev_class' over to a regular 'memory' subsystem
    and converts the devices to regular devices. The sysdev drivers are
    implemented as subsystem interfaces now.

    After all sysdev classes are ported to regular driver core entities, the
    sysdev implementation will be entirely removed from the kernel.

    Signed-off-by: Kay Sievers
    Signed-off-by: Greg Kroah-Hartman

    Kay Sievers
     

01 Nov, 2011

4 commits

  • There's no compact_zone_order() user outside file scope, so make it static.

    Signed-off-by: Kyungmin Park
    Acked-by: David Rientjes
    Reviewed-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kyungmin Park
     
  • In async mode, compaction doesn't migrate dirty or writeback pages. So,
    it's meaningless to pick the page and re-add it to lru list.

    Of course, when we isolate the page in compaction, the page might be dirty
    or writeback but when we try to migrate the page, the page would be not
    dirty, writeback. So it could be migrated. But it's very unlikely as
    isolate and migration cycle is much faster than writeout.

    So, this patch helps cpu overhead and prevent unnecessary LRU churning.

    Signed-off-by: Minchan Kim
    Acked-by: Johannes Weiner
    Reviewed-by: KAMEZAWA Hiroyuki
    Reviewed-by: KOSAKI Motohiro
    Acked-by: Mel Gorman
    Acked-by: Rik van Riel
    Reviewed-by: Michal Hocko
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • Change ISOLATE_XXX macro with bitwise isolate_mode_t type. Normally,
    macro isn't recommended as it's type-unsafe and making debugging harder as
    symbol cannot be passed throught to the debugger.

    Quote from Johannes
    " Hmm, it would probably be cleaner to fully convert the isolation mode
    into independent flags. INACTIVE, ACTIVE, BOTH is currently a
    tri-state among flags, which is a bit ugly."

    This patch moves isolate mode from swap.h to mmzone.h by memcontrol.h

    Signed-off-by: Minchan Kim
    Cc: Johannes Weiner
    Cc: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Michal Hocko
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • acct_isolated of compaction uses page_lru_base_type which returns only
    base type of LRU list so it never returns LRU_ACTIVE_ANON or
    LRU_ACTIVE_FILE. In addtion, cc->nr_[anon|file] is used in only
    acct_isolated so it doesn't have fields in conpact_control.

    This patch removes fields from compact_control and makes clear function of
    acct_issolated which counts the number of anon|file pages isolated.

    Signed-off-by: Minchan Kim
    Acked-by: Johannes Weiner
    Reviewed-by: KAMEZAWA Hiroyuki
    Reviewed-by: KOSAKI Motohiro
    Acked-by: Mel Gorman
    Acked-by: Rik van Riel
    Reviewed-by: Michal Hocko
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     

16 Jun, 2011

4 commits

  • Asynchronous compaction is used when promoting to huge pages. This is all
    very nice but if there are a number of processes in compacting memory, a
    large number of pages can be isolated. An "asynchronous" process can
    stall for long periods of time as a result with a user reporting that
    firefox can stall for 10s of seconds. This patch aborts asynchronous
    compaction if too many pages are isolated as it's better to fail a
    hugepage promotion than stall a process.

    [minchan.kim@gmail.com: return COMPACT_PARTIAL for abort]
    Reported-and-tested-by: Ury Stankevich
    Signed-off-by: Mel Gorman
    Reviewed-by: Minchan Kim
    Reviewed-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Compaction works with two scanners, a migration and a free scanner. When
    the scanners crossover, migration within the zone is complete. The
    location of the scanner is recorded on each cycle to avoid excesive
    scanning.

    When a zone is small and mostly reserved, it's very easy for the migration
    scanner to be close to the end of the zone. Then the following situation
    can occurs

    o migration scanner isolates some pages near the end of the zone
    o free scanner starts at the end of the zone but finds that the
    migration scanner is already there
    o free scanner gets reinitialised for the next cycle as
    cc->migrate_pfn + pageblock_nr_pages
    moving the free scanner into the next zone
    o migration scanner moves into the next zone

    When this happens, NR_ISOLATED accounting goes haywire because some of the
    accounting happens against the wrong zone. One zones counter remains
    positive while the other goes negative even though the overall global
    count is accurate. This was reported on X86-32 with !SMP because !SMP
    allows the negative counters to be visible. The fact that it is the bug
    should theoritically be possible there.

    Signed-off-by: Mel Gorman
    Reviewed-by: Minchan Kim
    Reviewed-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • fragmentation_index() returns -1000 when the allocation might succeed
    This doesn't match the comment and code in compaction_suitable(). I
    thought compaction_suitable should return COMPACT_PARTIAL in -1000
    case, because in this case allocation could succeed depending on
    watermarks.

    The impact of this is that compaction starts and compact_finished() is
    called which rechecks the watermarks and the free lists. It should have
    the same result in that compaction should not start but is more expensive.

    Acked-by: Mel Gorman
    Signed-off-by: Shaohua Li
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shaohua Li
     
  • Commit 56de7263fcf3 ("mm: compaction: direct compact when a high-order
    allocation fails") introduced a check for cc->order == -1 in
    compact_finished. We should continue compacting in that case because
    the request came from userspace and there is no particular order to
    compact for. Similar check has been added by 82478fb7 (mm: compaction:
    prevent division-by-zero during user-requested compaction) for
    compaction_suitable.

    The check is, however, done after zone_watermark_ok which uses order as a
    right hand argument for shifts. Not only watermark check is pointless if
    we can break out without it but it also uses 1 << -1 which is not well
    defined (at least from C standard). Let's move the -1 check above
    zone_watermark_ok.

    [minchan.kim@gmail.com> - caught compaction_suitable]
    Signed-off-by: Michal Hocko
    Cc: Mel Gorman
    Reviewed-by: Minchan Kim
    Reviewed-by: KAMEZAWA Hiroyuki
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

23 Mar, 2011

4 commits

  • compaction_alloc() isolates pages for migration in isolate_migratepages.
    While it's scanning, IRQs are disabled on the mistaken assumption the
    scanning should be short. Tests show this to be true for the most part
    but contention times on the LRU lock can be increased. Before this patch,
    the IRQ disabled times for a simple test looked like

    Total sampled time IRQs off (not real total time): 5493
    Event shrink_inactive_list..shrink_zone 1596 us count 1
    Event shrink_inactive_list..shrink_zone 1530 us count 1
    Event shrink_inactive_list..shrink_zone 956 us count 1
    Event shrink_inactive_list..shrink_zone 541 us count 1
    Event shrink_inactive_list..shrink_zone 531 us count 1
    Event split_huge_page..add_to_swap 232 us count 1
    Event save_args..call_softirq 36 us count 1
    Event save_args..call_softirq 35 us count 2
    Event __wake_up..__wake_up 1 us count 1

    This patch reduces the worst-case IRQs-disabled latencies by releasing the
    lock every SWAP_CLUSTER_MAX pages that are scanned and releasing the CPU if
    necessary. The cost of this is that the processing performing compaction will
    be slower but IRQs being disabled for too long a time has worse consequences
    as the following report shows;

    Total sampled time IRQs off (not real total time): 4367
    Event shrink_inactive_list..shrink_zone 881 us count 1
    Event shrink_inactive_list..shrink_zone 875 us count 1
    Event shrink_inactive_list..shrink_zone 868 us count 1
    Event shrink_inactive_list..shrink_zone 555 us count 1
    Event split_huge_page..add_to_swap 495 us count 1
    Event compact_zone..compact_zone_order 269 us count 1
    Event split_huge_page..add_to_swap 266 us count 1
    Event shrink_inactive_list..shrink_zone 85 us count 1
    Event save_args..call_softirq 36 us count 2
    Event __wake_up..__wake_up 1 us count 1

    [akpm@linux-foundation.org: simplify with s/unlocked/locked/]
    Signed-off-by: Andrea Arcangeli
    Signed-off-by: Mel Gorman
    Cc: Johannes Weiner
    Cc: Arthur Marsh
    Cc: Clemens Ladisch
    Cc: KAMEZAWA Hiroyuki
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • compaction_alloc() isolates free pages to be used as migration targets.
    While its scanning, IRQs are disabled on the mistaken assumption the
    scanning should be short. Analysis showed that IRQs were in fact being
    disabled for substantial time. A simple test was run using large
    anonymous mappings with transparent hugepage support enabled to trigger
    frequent compactions. A monitor sampled what the worst IRQ-off latencies
    were and a post-processing tool found the following;

    Total sampled time IRQs off (not real total time): 22355
    Event compaction_alloc..compaction_alloc 8409 us count 1
    Event compaction_alloc..compaction_alloc 7341 us count 1
    Event compaction_alloc..compaction_alloc 2463 us count 1
    Event compaction_alloc..compaction_alloc 2054 us count 1
    Event shrink_inactive_list..shrink_zone 1864 us count 1
    Event shrink_inactive_list..shrink_zone 88 us count 1
    Event save_args..call_softirq 36 us count 1
    Event save_args..call_softirq 35 us count 2
    Event __make_request..__blk_run_queue 24 us count 1
    Event __alloc_pages_nodemask..__alloc_pages_nodemask 6 us count 1

    i.e. compaction is disabled IRQs for a prolonged period of time - 8ms in
    one instance. The full report generated by the tool can be found at

    http://www.csn.ul.ie/~mel/postings/minfree-20110225/irqsoff-vanilla-micro.report

    This patch reduces the time IRQs are disabled by simply disabling IRQs at
    the last possible minute. An updated IRQs-off summary report then looks
    like;

    Total sampled time IRQs off (not real total time): 5493
    Event shrink_inactive_list..shrink_zone 1596 us count 1
    Event shrink_inactive_list..shrink_zone 1530 us count 1
    Event shrink_inactive_list..shrink_zone 956 us count 1
    Event shrink_inactive_list..shrink_zone 541 us count 1
    Event shrink_inactive_list..shrink_zone 531 us count 1
    Event split_huge_page..add_to_swap 232 us count 1
    Event save_args..call_softirq 36 us count 1
    Event save_args..call_softirq 35 us count 2
    Event __wake_up..__wake_up 1 us count 1

    A full report is again available at

    http://www.csn.ul.ie/~mel/postings/minfree-20110225/irqsoff-minimiseirq-free-v1r4-micro.report

    As should be obvious, IRQ disabled latencies due to compaction are
    almost elimimnated for this particular test.

    [aarcange@redhat.com: Fix initialisation of isolated]
    Signed-off-by: Mel Gorman
    Acked-by: Johannes Weiner
    Reviewed-by: KAMEZAWA Hiroyuki
    Reviewed-by: Minchan Kim
    Acked-by: Andrea Arcangeli
    Cc: Arthur Marsh
    Cc: Clemens Ladisch
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Many migrate_page's caller check return value instead of list_empy by
    cf608ac19c ("mm: compaction: fix COMPACTPAGEFAILED counting"). This patch
    makes compaction's migrate_pages consistent with others. This patch
    should not change old behavior.

    Signed-off-by: Minchan Kim
    Cc: Mel Gorman
    Cc: Andrea Arcangeli
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • This patch reverts 5a03b051 ("thp: use compaction in kswapd for GFP_ATOMIC
    order > 0") due to reports stating that kswapd CPU usage was higher and
    IRQs were being disabled more frequently. This was reported at
    http://www.spinics.net/linux/fedora/alsa-user/msg09885.html.

    Without this patch applied, CPU usage by kswapd hovers around the 20% mark
    according to the tester (Arthur Marsh:
    http://www.spinics.net/linux/fedora/alsa-user/msg09899.html). With this
    patch applied, it's around 2%.

    The problem is not related to THP which specifies __GFP_NO_KSWAPD but is
    triggered by high-order allocations hitting the low watermark for their
    order and waking kswapd on kernels with CONFIG_COMPACTION set. The most
    common trigger for this is network cards configured for jumbo frames but
    it's also possible it'll be triggered by fork-heavy workloads (order-1)
    and some wireless cards which depend on order-1 allocations.

    The symptoms for the user will be high CPU usage by kswapd in low-memory
    situations which could be confused with another writeback problem. While
    a patch like 5a03b051 may be reintroduced in the future, this patch plays
    it safe for now and reverts it.

    [mel@csn.ul.ie: Beefed up the changelog]
    Signed-off-by: Andrea Arcangeli
    Signed-off-by: Mel Gorman
    Reported-by: Arthur Marsh
    Tested-by: Arthur Marsh
    Cc: [2.6.38.1]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     

21 Jan, 2011

1 commit

  • Up until 3e7d344 ("mm: vmscan: reclaim order-0 and use compaction instead
    of lumpy reclaim"), compaction skipped calculating the fragmentation index
    of a zone when compaction was explicitely requested through the procfs
    knob.

    However, when compaction_suitable was introduced, it did not come with an
    extra check for order == -1, set on explicit compaction requests, and
    passed this order on to the fragmentation index calculation, where it
    overshifts the number of requested pages, leading to a division by zero.

    This patch makes sure that order == -1 is recognized as the flag it is
    rather than passing it along as valid order parameter.

    [akpm@linux-foundation.org: add comment, per Mel]
    Signed-off-by: Johannes Weiner
    Reviewed-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

14 Jan, 2011

1 commit