27 Jan, 2015

1 commit

  • for_each_zone_zonelist_nodemask wants an enum zone_type argument, but is
    passed gfp_t:

    mm/vmscan.c:2658:9: expected int enum zone_type [signed] highest_zoneidx
    mm/vmscan.c:2658:9: got restricted gfp_t [usertype] gfp_mask
    mm/vmscan.c:2658:9: warning: incorrect type in argument 2 (different base types)
    mm/vmscan.c:2658:9: expected int enum zone_type [signed] highest_zoneidx
    mm/vmscan.c:2658:9: got restricted gfp_t [usertype] gfp_mask

    convert argument to the correct type.

    Signed-off-by: Michael S. Tsirkin
    Cc: Johannes Weiner
    Cc: Vladimir Davydov
    Cc: Rik van Riel
    Cc: Michal Hocko
    Cc: Mel Gorman
    Cc: Vlastimil Babka
    Cc: Suleiman Souhlal
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michael S. Tsirkin
     

09 Jan, 2015

1 commit

  • Charles Shirron and Paul Cassella from Cray Inc have reported kswapd
    stuck in a busy loop with nothing left to balance, but
    kswapd_try_to_sleep() failing to sleep. Their analysis found the cause
    to be a combination of several factors:

    1. A process is waiting in throttle_direct_reclaim() on pgdat->pfmemalloc_wait

    2. The process has been killed (by OOM in this case), but has not yet been
    scheduled to remove itself from the waitqueue and die.

    3. kswapd checks for throttled processes in prepare_kswapd_sleep():

    if (waitqueue_active(&pgdat->pfmemalloc_wait)) {
    wake_up(&pgdat->pfmemalloc_wait);
    return false; // kswapd will not go to sleep
    }

    However, for a process that was already killed, wake_up() does not remove
    the process from the waitqueue, since try_to_wake_up() checks its state
    first and returns false when the process is no longer waiting.

    4. kswapd is running on the same CPU as the only CPU that the process is
    allowed to run on (through cpus_allowed, or possibly single-cpu system).

    5. CONFIG_PREEMPT_NONE=y kernel is used. If there's nothing to balance, kswapd
    encounters no voluntary preemption points and repeatedly fails
    prepare_kswapd_sleep(), blocking the process from running and removing
    itself from the waitqueue, which would let kswapd sleep.

    So, the source of the problem is that we prevent kswapd from going to
    sleep until there are processes waiting on the pfmemalloc_wait queue,
    and a process waiting on a queue is guaranteed to be removed from the
    queue only when it gets scheduled. This was done to make sure that no
    process is left sleeping on pfmemalloc_wait when kswapd itself goes to
    sleep.

    However, it isn't necessary to postpone kswapd sleep until the
    pfmemalloc_wait queue actually empties. To prevent processes from being
    left sleeping, it's actually enough to guarantee that all processes
    waiting on pfmemalloc_wait queue have been woken up by the time we put
    kswapd to sleep.

    This patch therefore fixes this issue by substituting 'wake_up' with
    'wake_up_all' and removing 'return false' in the code snippet from
    prepare_kswapd_sleep() above. Note that if any process puts itself in
    the queue after this waitqueue_active() check, or after the wake up
    itself, it means that the process will also wake up kswapd - and since
    we are under prepare_to_wait(), the wake up won't be missed. Also we
    update the comment prepare_kswapd_sleep() to hopefully more clearly
    describe the races it is preventing.

    Fixes: 5515061d22f0 ("mm: throttle direct reclaimers if PF_MEMALLOC reserves are low and swap is backed by network storage")
    Signed-off-by: Vlastimil Babka
    Signed-off-by: Vladimir Davydov
    Cc: Mel Gorman
    Cc: Johannes Weiner
    Acked-by: Michal Hocko
    Acked-by: Rik van Riel
    Cc: [3.6+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     

14 Dec, 2014

1 commit

  • The slab shrinkers are currently invoked from the zonelist walkers in
    kswapd, direct reclaim, and zone reclaim, all of which roughly gauge the
    eligible LRU pages and assemble a nodemask to pass to NUMA-aware
    shrinkers, which then again have to walk over the nodemask. This is
    redundant code, extra runtime work, and fairly inaccurate when it comes to
    the estimation of actually scannable LRU pages. The code duplication will
    only get worse when making the shrinkers cgroup-aware and requiring them
    to have out-of-band cgroup hierarchy walks as well.

    Instead, invoke the shrinkers from shrink_zone(), which is where all
    reclaimers end up, to avoid this duplication.

    Take the count for eligible LRU pages out of get_scan_count(), which
    considers many more factors than just the availability of swap space, like
    zone_reclaimable_pages() currently does. Accumulate the number over all
    visited lruvecs to get the per-zone value.

    Some nodes have multiple zones due to memory addressing restrictions. To
    avoid putting too much pressure on the shrinkers, only invoke them once
    for each such node, using the class zone of the allocation as the pivot
    zone.

    For now, this integrates the slab shrinking better into the reclaim logic
    and gets rid of duplicative invocations from kswapd, direct reclaim, and
    zone reclaim. It also prepares for cgroup-awareness, allowing
    memcg-capable shrinkers to be added at the lruvec level without much
    duplication of both code and runtime work.

    This changes kswapd behavior, which used to invoke the shrinkers for each
    zone, but with scan ratios gathered from the entire node, resulting in
    meaningless pressure quantities on multi-zone nodes.

    Zone reclaim behavior also changes. It used to shrink slabs until the
    same amount of pages were shrunk as were reclaimed from the LRUs. Now it
    merely invokes the shrinkers once with the zone's scan ratio, which makes
    the shrinkers go easier on caches that implement aging and would prefer
    feeding back pressure from recently used slab objects to unused LRU pages.

    [vdavydov@parallels.com: assure class zone is populated]
    Signed-off-by: Johannes Weiner
    Cc: Dave Chinner
    Signed-off-by: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

12 Dec, 2014

1 commit

  • Pull cgroup update from Tejun Heo:
    "cpuset got simplified a bit. cgroup core got a fix on unified
    hierarchy and grew some effective css related interfaces which will be
    used for blkio support for writeback IO traffic which is currently
    being worked on"

    * 'for-3.19' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
    cgroup: implement cgroup_get_e_css()
    cgroup: add cgroup_subsys->css_e_css_changed()
    cgroup: add cgroup_subsys->css_released()
    cgroup: fix the async css offline wait logic in cgroup_subtree_control_write()
    cgroup: restructure child_subsys_mask handling in cgroup_subtree_control_write()
    cgroup: separate out cgroup_calc_child_subsys_mask() from cgroup_refresh_child_subsys_mask()
    cpuset: lock vs unlock typo
    cpuset: simplify cpuset_node_allowed API
    cpuset: convert callback_mutex to a spinlock

    Linus Torvalds
     

11 Dec, 2014

3 commits

  • Compaction relies on zone watermark checks for decisions such as if it's
    worth to start compacting in compaction_suitable() or whether compaction
    should stop in compact_finished(). The watermark checks take
    classzone_idx and alloc_flags parameters, which are related to the memory
    allocation request. But from the context of compaction they are currently
    passed as 0, including the direct compaction which is invoked to satisfy
    the allocation request, and could therefore know the proper values.

    The lack of proper values can lead to mismatch between decisions taken
    during compaction and decisions related to the allocation request. Lack
    of proper classzone_idx value means that lowmem_reserve is not taken into
    account. This has manifested (during recent changes to deferred
    compaction) when DMA zone was used as fallback for preferred Normal zone.
    compaction_suitable() without proper classzone_idx would think that the
    watermarks are already satisfied, but watermark check in
    get_page_from_freelist() would fail. Because of this problem, deferring
    compaction has extra complexity that can be removed in the following
    patch.

    The issue (not confirmed in practice) with missing alloc_flags is opposite
    in nature. For allocations that include ALLOC_HIGH, ALLOC_HIGHER or
    ALLOC_CMA in alloc_flags (the last includes all MOVABLE allocations on
    CMA-enabled systems) the watermark checking in compaction with 0 passed
    will be stricter than in get_page_from_freelist(). In these cases
    compaction might be running for a longer time than is really needed.

    Another issue compaction_suitable() is that the check for "does the zone
    need compaction at all?" comes only after the check "does the zone have
    enough free free pages to succeed compaction". The latter considers extra
    pages for migration and can therefore in some situations fail and return
    COMPACT_SKIPPED, although the high-order allocation would succeed and we
    should return COMPACT_PARTIAL.

    This patch fixes these problems by adding alloc_flags and classzone_idx to
    struct compact_control and related functions involved in direct compaction
    and watermark checking. Where possible, all other callers of
    compaction_suitable() pass proper values where those are known. This is
    currently limited to classzone_idx, which is sometimes known in kswapd
    context. However, the direct reclaim callers should_continue_reclaim()
    and compaction_ready() do not currently know the proper values, so the
    coordination between reclaim and compaction may still not be as accurate
    as it could. This can be fixed later, if it's shown to be an issue.

    Additionaly the checks in compact_suitable() are reordered to address the
    second issue described above.

    The effect of this patch should be slightly better high-order allocation
    success rates and/or less compaction overhead, depending on the type of
    allocations and presence of CMA. It allows simplifying deferred
    compaction code in a followup patch.

    When testing with stress-highalloc, there was some slight improvement
    (which might be just due to variance) in success rates of non-THP-like
    allocations.

    Signed-off-by: Vlastimil Babka
    Cc: Minchan Kim
    Cc: Mel Gorman
    Cc: Joonsoo Kim
    Cc: Michal Nazarewicz
    Cc: Naoya Horiguchi
    Cc: Christoph Lameter
    Acked-by: Rik van Riel
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • shrink_page_list() counts all pages with a mapping, including clean pages,
    toward nr_congested if they're on a write-congested BDI.
    shrink_inactive_list() then sets ZONE_CONGESTED if nr_dirty ==
    nr_congested. Fix this apples-to-oranges comparison by only counting
    pages for nr_congested if they count for nr_dirty.

    Signed-off-by: Jamie Liu
    Cc: Johannes Weiner
    Cc: Mel Gorman
    Cc: Greg Thelen
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jamie Liu
     
  • This patch replaces printk(KERN_ERR..) with pr_err found under
    shrink_slab. Thus it also reduces one line extra because of formatting.

    Signed-off-by: Pintu Kumar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pintu Kumar
     

27 Oct, 2014

1 commit

  • Current cpuset API for checking if a zone/node is allowed to allocate
    from looks rather awkward. We have hardwall and softwall versions of
    cpuset_node_allowed with the softwall version doing literally the same
    as the hardwall version if __GFP_HARDWALL is passed to it in gfp flags.
    If it isn't, the softwall version may check the given node against the
    enclosing hardwall cpuset, which it needs to take the callback lock to
    do.

    Such a distinction was introduced by commit 02a0e53d8227 ("cpuset:
    rework cpuset_zone_allowed api"). Before, we had the only version with
    the __GFP_HARDWALL flag determining its behavior. The purpose of the
    commit was to avoid sleep-in-atomic bugs when someone would mistakenly
    call the function without the __GFP_HARDWALL flag for an atomic
    allocation. The suffixes introduced were intended to make the callers
    think before using the function.

    However, since the callback lock was converted from mutex to spinlock by
    the previous patch, the softwall check function cannot sleep, and these
    precautions are no longer necessary.

    So let's simplify the API back to the single check.

    Suggested-by: David Rientjes
    Signed-off-by: Vladimir Davydov
    Acked-by: Christoph Lameter
    Acked-by: Zefan Li
    Signed-off-by: Tejun Heo

    Vladimir Davydov
     

10 Oct, 2014

4 commits

  • In a memcg with even just moderate cache pressure, success rates for
    transparent huge page allocations drop to zero, wasting a lot of effort
    that the allocator puts into assembling these pages.

    The reason for this is that the memcg reclaim code was never designed for
    higher-order charges. It reclaims in small batches until there is room
    for at least one page. Huge page charges only succeed when these batches
    add up over a series of huge faults, which is unlikely under any
    significant load involving order-0 allocations in the group.

    Remove that loop on the memcg side in favor of passing the actual reclaim
    goal to direct reclaim, which is already set up and optimized to meet
    higher-order goals efficiently.

    This brings memcg's THP policy in line with the system policy: if the
    allocator painstakingly assembles a hugepage, memcg will at least make an
    honest effort to charge it. As a result, transparent hugepage allocation
    rates amid cache activity are drastically improved:

    vanilla patched
    pgalloc 4717530.80 ( +0.00%) 4451376.40 ( -5.64%)
    pgfault 491370.60 ( +0.00%) 225477.40 ( -54.11%)
    pgmajfault 2.00 ( +0.00%) 1.80 ( -6.67%)
    thp_fault_alloc 0.00 ( +0.00%) 531.60 (+100.00%)
    thp_fault_fallback 749.00 ( +0.00%) 217.40 ( -70.88%)

    [ Note: this may in turn increase memory consumption from internal
    fragmentation, which is an inherent risk of transparent hugepages.
    Some setups may have to adjust the memcg limits accordingly to
    accomodate this - or, if the machine is already packed to capacity,
    disable the transparent huge page feature. ]

    Signed-off-by: Johannes Weiner
    Reviewed-by: Vladimir Davydov
    Cc: Michal Hocko
    Cc: Dave Hansen
    Cc: Greg Thelen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Page reclaim tests zone_is_reclaim_dirty(), but the site that actually
    sets this state does zone_set_flag(zone, ZONE_TAIL_LRU_DIRTY), sending the
    reader through layers indirection just to track down a simple bit.

    Remove all zone flag wrappers and just use bitops against zone->flags
    directly. It's just as readable and the lines are barely any longer.

    Also rename ZONE_TAIL_LRU_DIRTY to ZONE_DIRTY to match ZONE_WRITEBACK, and
    remove the zone_flags_t typedef.

    Signed-off-by: Johannes Weiner
    Acked-by: David Rientjes
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • The deprecation warnings for the scan_unevictable interface triggers by
    scripts doing `sysctl -a | grep something else'. This is annoying and not
    helpful.

    The interface has been defunct since 264e56d8247e ("mm: disable user
    interface to manually rescue unevictable pages"), which was in 2011, and
    there haven't been any reports of usecases for it, only reports that the
    deprecation warnings are annying. It's unlikely that anybody is using
    this interface specifically at this point, so remove it.

    Signed-off-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • When direct sync compaction is often unsuccessful, it may become deferred
    for some time to avoid further useless attempts, both sync and async.
    Successful high-order allocations un-defer compaction, while further
    unsuccessful compaction attempts prolong the compaction deferred period.

    Currently the checking and setting deferred status is performed only on
    the preferred zone of the allocation that invoked direct compaction. But
    compaction itself is attempted on all eligible zones in the zonelist, so
    the behavior is suboptimal and may lead both to scenarios where 1)
    compaction is attempted uselessly, or 2) where it's not attempted despite
    good chances of succeeding, as shown on the examples below:

    1) A direct compaction with Normal preferred zone failed and set
    deferred compaction for the Normal zone. Another unrelated direct
    compaction with DMA32 as preferred zone will attempt to compact DMA32
    zone even though the first compaction attempt also included DMA32 zone.

    In another scenario, compaction with Normal preferred zone failed to
    compact Normal zone, but succeeded in the DMA32 zone, so it will not
    defer compaction. In the next attempt, it will try Normal zone which
    will fail again, instead of skipping Normal zone and trying DMA32
    directly.

    2) Kswapd will balance DMA32 zone and reset defer status based on
    watermarks looking good. A direct compaction with preferred Normal
    zone will skip compaction of all zones including DMA32 because Normal
    was still deferred. The allocation might have succeeded in DMA32, but
    won't.

    This patch makes compaction deferring work on individual zone basis
    instead of preferred zone. For each zone, it checks compaction_deferred()
    to decide if the zone should be skipped. If watermarks fail after
    compacting the zone, defer_compaction() is called. The zone where
    watermarks passed can still be deferred when the allocation attempt is
    unsuccessful. When allocation is successful, compaction_defer_reset() is
    called for the zone containing the allocated page. This approach should
    approximate calling defer_compaction() only on zones where compaction was
    attempted and did not yield allocated page. There might be corner cases
    but that is inevitable as long as the decision to stop compacting dues not
    guarantee that a page will be allocated.

    Due to a new COMPACT_DEFERRED return value, some functions relying
    implicitly on COMPACT_SKIPPED = 0 had to be updated, with comments made
    more accurate. The did_some_progress output parameter of
    __alloc_pages_direct_compact() is removed completely, as the caller
    actually does not use it after compaction sets it - it is only considered
    when direct reclaim sets it.

    During testing on a two-node machine with a single very small Normal zone
    on node 1, this patch has improved success rates in stress-highalloc
    mmtests benchmark. The success here were previously made worse by commit
    3a025760fc15 ("mm: page_alloc: spill to remote nodes before waking
    kswapd") as kswapd was no longer resetting often enough the deferred
    compaction for the Normal zone, and DMA32 zones on both nodes were thus
    not considered for compaction. On different machine, success rates were
    improved with __GFP_NO_KSWAPD allocations.

    [akpm@linux-foundation.org: fix CONFIG_COMPACTION=n build]
    Signed-off-by: Vlastimil Babka
    Acked-by: Minchan Kim
    Reviewed-by: Zhang Yanfei
    Acked-by: Mel Gorman
    Cc: Joonsoo Kim
    Cc: Michal Nazarewicz
    Cc: Naoya Horiguchi
    Cc: Christoph Lameter
    Cc: Rik van Riel
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     

09 Aug, 2014

2 commits

  • Pages are now uncharged at release time, and all sources of batched
    uncharges operate on lists of pages. Directly use those lists, and
    get rid of the per-task batching state.

    This also batches statistics accounting, in addition to the res
    counter charges, to reduce IRQ-disabling and re-enabling.

    Signed-off-by: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: Hugh Dickins
    Cc: Tejun Heo
    Cc: Vladimir Davydov
    Cc: Naoya Horiguchi
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • The memcg uncharging code that is involved towards the end of a page's
    lifetime - truncation, reclaim, swapout, migration - is impressively
    complicated and fragile.

    Because anonymous and file pages were always charged before they had their
    page->mapping established, uncharges had to happen when the page type
    could still be known from the context; as in unmap for anonymous, page
    cache removal for file and shmem pages, and swap cache truncation for swap
    pages. However, these operations happen well before the page is actually
    freed, and so a lot of synchronization is necessary:

    - Charging, uncharging, page migration, and charge migration all need
    to take a per-page bit spinlock as they could race with uncharging.

    - Swap cache truncation happens during both swap-in and swap-out, and
    possibly repeatedly before the page is actually freed. This means
    that the memcg swapout code is called from many contexts that make
    no sense and it has to figure out the direction from page state to
    make sure memory and memory+swap are always correctly charged.

    - On page migration, the old page might be unmapped but then reused,
    so memcg code has to prevent untimely uncharging in that case.
    Because this code - which should be a simple charge transfer - is so
    special-cased, it is not reusable for replace_page_cache().

    But now that charged pages always have a page->mapping, introduce
    mem_cgroup_uncharge(), which is called after the final put_page(), when we
    know for sure that nobody is looking at the page anymore.

    For page migration, introduce mem_cgroup_migrate(), which is called after
    the migration is successful and the new page is fully rmapped. Because
    the old page is no longer uncharged after migration, prevent double
    charges by decoupling the page's memcg association (PCG_USED and
    pc->mem_cgroup) from the page holding an actual charge. The new bits
    PCG_MEM and PCG_MEMSW represent the respective charges and are transferred
    to the new page during migration.

    mem_cgroup_migrate() is suitable for replace_page_cache() as well,
    which gets rid of mem_cgroup_replace_page_cache(). However, care
    needs to be taken because both the source and the target page can
    already be charged and on the LRU when fuse is splicing: grab the page
    lock on the charge moving side to prevent changing pc->mem_cgroup of a
    page under migration. Also, the lruvecs of both pages change as we
    uncharge the old and charge the new during migration, and putback may
    race with us, so grab the lru lock and isolate the pages iff on LRU to
    prevent races and ensure the pages are on the right lruvec afterward.

    Swap accounting is massively simplified: because the page is no longer
    uncharged as early as swap cache deletion, a new mem_cgroup_swapout() can
    transfer the page's memory+swap charge (PCG_MEMSW) to the swap entry
    before the final put_page() in page reclaim.

    Finally, page_cgroup changes are now protected by whatever protection the
    page itself offers: anonymous pages are charged under the page table lock,
    whereas page cache insertions, swapin, and migration hold the page lock.
    Uncharging happens under full exclusion with no outstanding references.
    Charging and uncharging also ensure that the page is off-LRU, which
    serializes against charge migration. Remove the very costly page_cgroup
    lock and set pc->flags non-atomically.

    [mhocko@suse.cz: mem_cgroup_charge_statistics needs preempt_disable]
    [vdavydov@parallels.com: fix flags definition]
    Signed-off-by: Johannes Weiner
    Cc: Hugh Dickins
    Cc: Tejun Heo
    Cc: Vladimir Davydov
    Tested-by: Jet Chen
    Acked-by: Michal Hocko
    Tested-by: Felipe Balbi
    Signed-off-by: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

07 Aug, 2014

9 commits

  • When memory cgoups are enabled, the code that decides to force to scan
    anonymous pages in get_scan_count() compares global values (free,
    high_watermark) to a value that is restricted to a memory cgroup (file).
    It make the code over-eager to force anon scan.

    For instance, it will force anon scan when scanning a memcg that is
    mainly populated by anonymous page, even when there is plenty of file
    pages to get rid of in others memcgs, even when swappiness == 0. It
    breaks user's expectation about swappiness and hurts performance.

    This patch makes sure that forced anon scan only happens when there not
    enough file pages for the all zone, not just in one random memcg.

    [hannes@cmpxchg.org: cleanups]
    Signed-off-by: Jerome Marchand
    Acked-by: Michal Hocko
    Acked-by: Johannes Weiner
    Reviewed-by: Rik van Riel
    Cc: Mel Gorman
    Signed-off-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jerome Marchand
     
  • Quite a while ago, get_scan_ratio() has been renamed get_scan_count(),
    however a comment in shrink_active_list() still mention it. This patch
    fixes the outdated comment.

    Signed-off-by: Jerome Marchand
    Cc: Michal Hocko
    Cc: Johannes Weiner
    Cc: Rik van Riel
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jerome Marchand
     
  • zone->pages_scanned is a write-intensive cache line during page reclaim
    and it's also updated during page free. Move the counter into vmstat to
    take advantage of the per-cpu updates and do not update it in the free
    paths unless necessary.

    On a small UMA machine running tiobench the difference is marginal. On
    a 4-node machine the overhead is more noticable. Note that automatic
    NUMA balancing was disabled for this test as otherwise the system CPU
    overhead is unpredictable.

    3.16.0-rc3 3.16.0-rc3 3.16.0-rc3
    vanillarearrange-v5 vmstat-v5
    User 746.94 759.78 774.56
    System 65336.22 58350.98 32847.27
    Elapsed 27553.52 27282.02 27415.04

    Note that the overhead reduction will vary depending on where exactly
    pages are allocated and freed.

    Signed-off-by: Mel Gorman
    Acked-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • vm_total_pages is calculated by nr_free_pagecache_pages(), which counts
    the number of pages which are beyond the high watermark within all
    zones. So vm_total_pages is not equal to total number of pages which
    the VM controls.

    Signed-off-by: Wang Sheng-Hui
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wang Sheng-Hui
     
  • Reorder the members by input and output, then turn the individual
    integers for may_writepage, may_unmap, may_swap, compaction_ready,
    hibernation_mode into bit fields to save stack space:

    +72/-296 -224
    kswapd 104 176 +72
    try_to_free_pages 80 56 -24
    try_to_free_mem_cgroup_pages 80 56 -24
    shrink_all_memory 88 64 -24
    reclaim_clean_pages_from_list 168 144 -24
    mem_cgroup_shrink_node_zone 104 80 -24
    __zone_reclaim 176 152 -24
    balance_pgdat 152 - -152

    Signed-off-by: Johannes Weiner
    Suggested-by: Mel Gorman
    Acked-by: Mel Gorman
    Acked-by: Michal Hocko
    Cc: Minchan Kim
    Cc: Rik van Riel
    Acked-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Swappiness is determined for each scanned memcg individually in
    shrink_zone() and is not a parameter that applies throughout the reclaim
    scan. Move it out of struct scan_control to prevent accidental use of a
    stale value.

    Signed-off-by: Johannes Weiner
    Cc: Mel Gorman
    Cc: Rik van Riel
    Acked-by: Michal Hocko
    Cc: Vlastimil Babka
    Acked-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Direct reclaim currently calls shrink_zones() to reclaim all members of
    a zonelist, and if that wasn't successful it does another pass through
    the same zonelist to check overall reclaimability.

    Just check reclaimability in shrink_zones() directly and propagate the
    result through the return value. Then remove all_unreclaimable().

    Signed-off-by: Johannes Weiner
    Cc: Mel Gorman
    Cc: Rik van Riel
    Acked-by: Michal Hocko
    Cc: Vlastimil Babka
    Acked-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Page reclaim for a higher-order page runs until compaction is ready,
    then aborts and signals this situation through the return value of
    shrink_zones(). This is an oddly specific signal to encode in the
    return value of shrink_zones(), though, and can be quite confusing.

    Introduce sc->compaction_ready and signal the compactability of the
    zones out-of-band to free up the return value of shrink_zones() for
    actual zone reclaimability.

    Signed-off-by: Johannes Weiner
    Acked-by: Vlastimil Babka
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Michal Hocko
    Acked-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • shrink_zones() has a special branch to skip the all_unreclaimable()
    check during hibernation, because a frozen kswapd can't mark a zone
    unreclaimable.

    But ever since commit 6e543d5780e3 ("mm: vmscan: fix
    do_try_to_free_pages() livelock"), determining a zone to be
    unreclaimable is done by directly looking at its scan history and no
    longer relies on kswapd setting the per-zone flag.

    Remove this branch and let shrink_zones() check the reclaimability of
    the target zones regardless of hibernation state.

    Signed-off-by: Johannes Weiner
    Acked-by: Mel Gorman
    Cc: Rik van Riel
    Acked-by: Michal Hocko
    Cc: Vlastimil Babka
    Acked-by: Minchan Kim
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

13 Jun, 2014

1 commit

  • Pull vfs updates from Al Viro:
    "This the bunch that sat in -next + lock_parent() fix. This is the
    minimal set; there's more pending stuff.

    In particular, I really hope to get acct.c fixes merged this cycle -
    we need that to deal sanely with delayed-mntput stuff. In the next
    pile, hopefully - that series is fairly short and localized
    (kernel/acct.c, fs/super.c and fs/namespace.c). In this pile: more
    iov_iter work. Most of prereqs for ->splice_write with sane locking
    order are there and Kent's dio rewrite would also fit nicely on top of
    this pile"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (70 commits)
    lock_parent: don't step on stale ->d_parent of all-but-freed one
    kill generic_file_splice_write()
    ceph: switch to iter_file_splice_write()
    shmem: switch to iter_file_splice_write()
    nfs: switch to iter_splice_write_file()
    fs/splice.c: remove unneeded exports
    ocfs2: switch to iter_file_splice_write()
    ->splice_write() via ->write_iter()
    bio_vec-backed iov_iter
    optimize copy_page_{to,from}_iter()
    bury generic_file_aio_{read,write}
    lustre: get rid of messing with iovecs
    ceph: switch to ->write_iter()
    ceph_sync_direct_write: stop poking into iov_iter guts
    ceph_sync_read: stop poking into iov_iter guts
    new helper: copy_page_from_iter()
    fuse: switch to ->write_iter()
    btrfs: switch to ->write_iter()
    ocfs2: switch to ->write_iter()
    xfs: switch to ->write_iter()
    ...

    Linus Torvalds
     

09 Jun, 2014

1 commit

  • shrink_inactive_list() used to wait 0.1s to avoid congestion when all
    the pages that were isolated from the inactive list were dirty but not
    under active writeback. That makes no real sense, and apparently causes
    major interactivity issues under some loads since 3.11.

    The ostensible reason for it was to wait for kswapd to start writing
    pages, but that seems questionable as well, since the congestion wait
    code seems to trigger for kswapd itself as well. Also, the logic behind
    delaying anything when we haven't actually started writeback is not
    clear - it only delays actually starting that writeback.

    We'll still trigger the congestion waiting if

    (a) the process is kswapd, and we hit pages flagged for immediate
    reclaim

    (b) the process is not kswapd, and the zone backing dev writeback is
    actually congested.

    This probably needs to be revisited, but as it is this fixes a reported
    regression.

    Reported-by: Felipe Contreras
    Pinpointed-by: Hillf Danton
    Cc: Michal Hocko
    Cc: Andrew Morton
    Cc: Mel Gorman
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

07 Jun, 2014

3 commits

  • printk is meant to be used with an associated log level. There are some
    instances of printk scattered around the mm code where the log level is
    missing. Add a log level and adhere to suggestions by
    scripts/checkpatch.pl by moving to the pr_* macros.

    Also add the typical pr_fmt definition so that print statements can be
    easily traced back to the modules where they occur, correlated one with
    another, etc. This will require the removal of some (now redundant)
    prefixes on a few print statements.

    Signed-off-by: Mitchel Humpherys
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mitchel Humpherys
     
  • Memory reclaim always uses swappiness of the reclaim target memcg
    (origin of the memory pressure) or vm_swappiness for global memory
    reclaim. This behavior was consistent (except for difference between
    global and hard limit reclaim) because swappiness was enforced to be
    consistent within each memcg hierarchy.

    After "mm: memcontrol: remove hierarchy restrictions for swappiness and
    oom_control" each memcg can have its own swappiness independent of
    hierarchical parents, though, so the consistency guarantee is gone.
    This can lead to an unexpected behavior. Say that a group is explicitly
    configured to not swapout by memory.swappiness=0 but its memory gets
    swapped out anyway when the memory pressure comes from its parent with a
    It is also unexpected that the knob is meaningless without setting the
    hard limit which would trigger the reclaim and enforce the swappiness.
    There are setups where the hard limit is configured higher in the
    hierarchy by an administrator and children groups are under control of
    somebody else who is interested in the swapout behavior but not
    necessarily about the memory limit.

    From a semantic point of view swappiness is an attribute defining anon
    vs.
    file proportional scanning of LRU which is memcg specific (unlike
    charges which are propagated up the hierarchy) so it should be applied
    to the particular memcg's LRU regardless where the memory pressure comes
    from.

    This patch removes vmscan_swappiness() and stores the swappiness into
    the scan_control structure. mem_cgroup_swappiness is then used to
    provide the correct value before shrink_lruvec is called. The global
    vm_swappiness is used for the root memcg.

    [hughd@google.com: oopses immediately when booted with cgroup_disable=memory]
    Signed-off-by: Michal Hocko
    Acked-by: Johannes Weiner
    Cc: Tejun Heo
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • When kswapd exits, it can end up taking locks that were previously held
    by allocating tasks while they waited for reclaim. Lockdep currently
    warns about this:

    On Wed, May 28, 2014 at 06:06:34PM +0800, Gu Zheng wrote:
    > inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-R} usage.
    > kswapd2/1151 [HC0[0]:SC0[0]:HE1:SE1] takes:
    > (&sig->group_rwsem){+++++?}, at: exit_signals+0x24/0x130
    > {RECLAIM_FS-ON-W} state was registered at:
    > mark_held_locks+0xb9/0x140
    > lockdep_trace_alloc+0x7a/0xe0
    > kmem_cache_alloc_trace+0x37/0x240
    > flex_array_alloc+0x99/0x1a0
    > cgroup_attach_task+0x63/0x430
    > attach_task_by_pid+0x210/0x280
    > cgroup_procs_write+0x16/0x20
    > cgroup_file_write+0x120/0x2c0
    > vfs_write+0xc0/0x1f0
    > SyS_write+0x4c/0xa0
    > tracesys+0xdd/0xe2
    > irq event stamp: 49
    > hardirqs last enabled at (49): _raw_spin_unlock_irqrestore+0x36/0x70
    > hardirqs last disabled at (48): _raw_spin_lock_irqsave+0x2b/0xa0
    > softirqs last enabled at (0): copy_process.part.24+0x627/0x15f0
    > softirqs last disabled at (0): (null)
    >
    > other info that might help us debug this:
    > Possible unsafe locking scenario:
    >
    > CPU0
    > ----
    > lock(&sig->group_rwsem);
    >
    > lock(&sig->group_rwsem);
    >
    > *** DEADLOCK ***
    >
    > no locks held by kswapd2/1151.
    >
    > stack backtrace:
    > CPU: 30 PID: 1151 Comm: kswapd2 Not tainted 3.10.39+ #4
    > Call Trace:
    > dump_stack+0x19/0x1b
    > print_usage_bug+0x1f7/0x208
    > mark_lock+0x21d/0x2a0
    > __lock_acquire+0x52a/0xb60
    > lock_acquire+0xa2/0x140
    > down_read+0x51/0xa0
    > exit_signals+0x24/0x130
    > do_exit+0xb5/0xa50
    > kthread+0xdb/0x100
    > ret_from_fork+0x7c/0xb0

    This is because the kswapd thread is still marked as a reclaimer at the
    time of exit. But because it is exiting, nobody is actually waiting on
    it to make reclaim progress anymore, and it's nothing but a regular
    thread at this point. Be tidy and strip it of all its powers
    (PF_MEMALLOC, PF_SWAPWRITE, PF_KSWAPD, and the lockdep reclaim state)
    before returning from the thread function.

    Signed-off-by: Johannes Weiner
    Reported-by: Gu Zheng
    Cc: Yasuaki Ishimatsu
    Cc: Tang Chen
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

05 Jun, 2014

9 commits

  • Commit "mm: vmscan: obey proportional scanning requirements for kswapd"
    ensured that file/anon lists were scanned proportionally for reclaim from
    kswapd but ignored it for direct reclaim. The intent was to minimse
    direct reclaim latency but Yuanhan Liu pointer out that it substitutes one
    long stall for many small stalls and distorts aging for normal workloads
    like streaming readers/writers. Hugh Dickins pointed out that a
    side-effect of the same commit was that when one LRU list dropped to zero
    that the entirety of the other list was shrunk leading to excessive
    reclaim in memcgs. This patch scans the file/anon lists proportionally
    for direct reclaim to similarly age page whether reclaimed by kswapd or
    direct reclaim but takes care to abort reclaim if one LRU drops to zero
    after reclaiming the requested number of pages.

    Based on ext4 and using the Intel VM scalability test

    3.15.0-rc5 3.15.0-rc5
    shrinker proportion
    Unit lru-file-readonce elapsed 5.3500 ( 0.00%) 5.4200 ( -1.31%)
    Unit lru-file-readonce time_range 0.2700 ( 0.00%) 0.1400 ( 48.15%)
    Unit lru-file-readonce time_stddv 0.1148 ( 0.00%) 0.0536 ( 53.33%)
    Unit lru-file-readtwice elapsed 8.1700 ( 0.00%) 8.1700 ( 0.00%)
    Unit lru-file-readtwice time_range 0.4300 ( 0.00%) 0.2300 ( 46.51%)
    Unit lru-file-readtwice time_stddv 0.1650 ( 0.00%) 0.0971 ( 41.16%)

    The test cases are running multiple dd instances reading sparse files. The results are within
    the noise for the small test machine. The impact of the patch is more noticable from the vmstats

    3.15.0-rc5 3.15.0-rc5
    shrinker proportion
    Minor Faults 35154 36784
    Major Faults 611 1305
    Swap Ins 394 1651
    Swap Outs 4394 5891
    Allocation stalls 118616 44781
    Direct pages scanned 4935171 4602313
    Kswapd pages scanned 15921292 16258483
    Kswapd pages reclaimed 15913301 16248305
    Direct pages reclaimed 4933368 4601133
    Kswapd efficiency 99% 99%
    Kswapd velocity 670088.047 682555.961
    Direct efficiency 99% 99%
    Direct velocity 207709.217 193212.133
    Percentage direct scans 23% 22%
    Page writes by reclaim 4858.000 6232.000
    Page writes file 464 341
    Page writes anon 4394 5891

    Note that there are fewer allocation stalls even though the amount
    of direct reclaim scanning is very approximately the same.

    Signed-off-by: Mel Gorman
    Cc: Johannes Weiner
    Cc: Hugh Dickins
    Cc: Tim Chen
    Cc: Dave Chinner
    Tested-by: Yuanhan Liu
    Cc: Bob Liu
    Cc: Jan Kara
    Cc: Rik van Riel
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Currently, we use (zone->managed_pages + KSWAPD_ZONE_BALANCE_GAP_RATIO-1)
    / KSWAPD_ZONE_BALANCE_GAP_RATIO to avoid a zero gap value. It's better to
    use DIV_ROUND_UP macro for neater code and clear meaning.

    Besides, the gap value is calculated against the per-zone "managed pages",
    not "present pages". This patch also corrects the comment and do some
    rephrasing.

    Signed-off-by: Jianyu Zhan
    Acked-by: Rik van Riel
    Acked-by: Rafael Aquini
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jianyu Zhan
     
  • cold is a bool, make it one. Make the likely case the "if" part of the
    block instead of the else as according to the optimisation manual this is
    preferred.

    Signed-off-by: Mel Gorman
    Acked-by: Rik van Riel
    Cc: Johannes Weiner
    Cc: Vlastimil Babka
    Cc: Jan Kara
    Cc: Michal Hocko
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Theodore Ts'o
    Cc: "Paul E. McKenney"
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Now that we are doing NUMA-aware shrinking, and can have shrinkers
    running in parallel, or working on individual nodes, it seems like we
    should also be sticking the node in the output.

    Signed-off-by: Dave Hansen
    Acked-by: Dave Chinner
    Cc: Konstantin Khlebnikov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Hansen
     
  • I was looking at a trace of the slab shrinkers (attachment in this comment):

    https://bugs.freedesktop.org/show_bug.cgi?id=72742#c67

    and noticed that "total_scan" can go negative in some cases. We
    used to dump out the "total_scan" variable directly, but some of
    the shrinker modifications along the way changed that.

    This patch just dumps it out directly, again. It doesn't make
    any sense to derive it from new_nr and nr any more since there
    are now other shrinkers that can be running in parallel and
    mucking with those values.

    Here's an example of the negative numbers in the output:

    > kswapd0-840 [000] 160.869398: mm_shrink_slab_end: i915_gem_inactive_scan+0x0 0xffff8800037cbc68: unused scan count 10 new scan count 39 total_scan 29 last shrinker return val 256
    > kswapd0-840 [000] 160.869618: mm_shrink_slab_end: i915_gem_inactive_scan+0x0 0xffff8800037cbc68: unused scan count 39 new scan count 102 total_scan 63 last shrinker return val 256
    > kswapd0-840 [000] 160.870031: mm_shrink_slab_end: i915_gem_inactive_scan+0x0 0xffff8800037cbc68: unused scan count 102 new scan count 47 total_scan -55 last shrinker return val 768
    > kswapd0-840 [000] 160.870464: mm_shrink_slab_end: i915_gem_inactive_scan+0x0 0xffff8800037cbc68: unused scan count 47 new scan count 45 total_scan -2 last shrinker return val 768
    > kswapd0-840 [000] 163.384144: mm_shrink_slab_end: i915_gem_inactive_scan+0x0 0xffff8800037cbc68: unused scan count 45 new scan count 56 total_scan 11 last shrinker return val 0
    > kswapd0-840 [000] 163.384297: mm_shrink_slab_end: i915_gem_inactive_scan+0x0 0xffff8800037cbc68: unused scan count 56 new scan count 15 total_scan -41 last shrinker return val 256
    > kswapd0-840 [000] 163.384414: mm_shrink_slab_end: i915_gem_inactive_scan+0x0 0xffff8800037cbc68: unused scan count 15 new scan count 117 total_scan 102 last shrinker return val 0
    > kswapd0-840 [000] 163.384657: mm_shrink_slab_end: i915_gem_inactive_scan+0x0 0xffff8800037cbc68: unused scan count 117 new scan count 36 total_scan -81 last shrinker return val 512
    > kswapd0-840 [000] 163.384880: mm_shrink_slab_end: i915_gem_inactive_scan+0x0 0xffff8800037cbc68: unused scan count 36 new scan count 111 total_scan 75 last shrinker return val 256
    > kswapd0-840 [000] 163.385256: mm_shrink_slab_end: i915_gem_inactive_scan+0x0 0xffff8800037cbc68: unused scan count 111 new scan count 34 total_scan -77 last shrinker return val 768
    > kswapd0-840 [000] 163.385598: mm_shrink_slab_end: i915_gem_inactive_scan+0x0 0xffff8800037cbc68: unused scan count 34 new scan count 122 total_scan 88 last shrinker return val 512

    Signed-off-by: Dave Hansen
    Acked-by: Dave Chinner
    Cc: Konstantin Khlebnikov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Hansen
     
  • When a loopback NFS mount is active and the backing device for the NFS
    mount becomes congested, that can impose throttling delays on the nfsd
    threads.

    These delays significantly reduce throughput and so the NFS mount remains
    congested.

    This results in a livelock and the reduced throughput persists.

    This livelock has been found in testing with the 'wait_iff_congested'
    call, and could possibly be caused by the 'congestion_wait' call.

    This livelock is similar to the deadlock which justified the introduction
    of PF_LESS_THROTTLE, and the same flag can be used to remove this
    livelock.

    To minimise the impact of the change, we still throttle nfsd when the
    filesystem it is writing to is congested, but not when some separate
    filesystem (e.g. the NFS filesystem) is congested.

    Signed-off-by: NeilBrown
    Cc: Mel Gorman
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • throttle_direct_reclaim() is meant to trigger during swap-over-network
    during which the min watermark is treated as a pfmemalloc reserve. It
    throttes on the first node in the zonelist but this is flawed.

    The user-visible impact is that a process running on CPU whose local
    memory node has no ZONE_NORMAL will stall for prolonged periods of time,
    possibly indefintely. This is due to throttle_direct_reclaim thinking the
    pfmemalloc reserves are depleted when in fact they don't exist on that
    node.

    On a NUMA machine running a 32-bit kernel (I know) allocation requests
    from CPUs on node 1 would detect no pfmemalloc reserves and the process
    gets throttled. This patch adjusts throttling of direct reclaim to
    throttle based on the first node in the zonelist that has a usable
    ZONE_NORMAL or lower zone.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Mel Gorman
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • kmem_cache_{create,destroy,shrink} need to get a stable value of
    cpu/node online mask, because they init/destroy/access per-cpu/node
    kmem_cache parts, which can be allocated or destroyed on cpu/mem
    hotplug. To protect against cpu hotplug, these functions use
    {get,put}_online_cpus. However, they do nothing to synchronize with
    memory hotplug - taking the slab_mutex does not eliminate the
    possibility of race as described in patch 2.

    What we need there is something like get_online_cpus, but for memory.
    We already have lock_memory_hotplug, which serves for the purpose, but
    it's a bit of a hammer right now, because it's backed by a mutex. As a
    result, it imposes some limitations to locking order, which are not
    desirable, and can't be used just like get_online_cpus. That's why in
    patch 1 I substitute it with get/put_online_mems, which work exactly
    like get/put_online_cpus except they block not cpu, but memory hotplug.

    [ v1 can be found at https://lkml.org/lkml/2014/4/6/68. I NAK'ed it by
    myself, because it used an rw semaphore for get/put_online_mems,
    making them dead lock prune. ]

    This patch (of 2):

    {un}lock_memory_hotplug, which is used to synchronize against memory
    hotplug, is currently backed by a mutex, which makes it a bit of a
    hammer - threads that only want to get a stable value of online nodes
    mask won't be able to proceed concurrently. Also, it imposes some
    strong locking ordering rules on it, which narrows down the set of its
    usage scenarios.

    This patch introduces get/put_online_mems, which are the same as
    get/put_online_cpus, but for memory hotplug, i.e. executing a code
    inside a get/put_online_mems section will guarantee a stable value of
    online nodes, present pages, etc.

    lock_memory_hotplug()/unlock_memory_hotplug() are removed altogether.

    Signed-off-by: Vladimir Davydov
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: Tang Chen
    Cc: Zhang Yanfei
    Cc: Toshi Kani
    Cc: Xishi Qiu
    Cc: Jiang Liu
    Cc: Rafael J. Wysocki
    Cc: David Rientjes
    Cc: Wen Congyang
    Cc: Yasuaki Ishimatsu
    Cc: Lai Jiangshan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • Prior to this change, we would decide whether to force scan a LRU during
    reclaim if that LRU itself was too small for the current priority.
    However, this can lead to the file LRU getting force scanned even if
    there are a lot of anonymous pages we can reclaim, leading to hot file
    pages getting needlessly reclaimed.

    To address this, we instead only force scan when none of the reclaimable
    LRUs are big enough.

    Gives huge improvements with zswap. For example, when doing -j20 kernel
    build in a 500MB container with zswap enabled, runtime (in seconds) is
    greatly reduced:

    x without this change
    + with this change
    N Min Max Median Avg Stddev
    x 5 700.997 790.076 763.928 754.05 39.59493
    + 5 141.634 197.899 155.706 161.9 21.270224
    Difference at 95.0% confidence
    -592.15 +/- 46.3521
    -78.5293% +/- 6.14709%
    (Student's t, pooled s = 31.7819)

    Should also give some improvements in regular (non-zswap) swap cases.

    Yes, hughd found significant speedup using regular swap, with several
    memcgs under pressure; and it should also be effective in the non-memcg
    case, whenever one or another zone LRU is forced too small.

    Signed-off-by: Suleiman Souhlal
    Signed-off-by: Hugh Dickins
    Cc: Suleiman Souhlal
    Cc: Mel Gorman
    Acked-by: Rik van Riel
    Acked-by: Rafael Aquini
    Cc: Michal Hocko
    Cc: Yuanhan Liu
    Cc: Seth Jennings
    Cc: Bob Liu
    Cc: Minchan Kim
    Cc: Luigi Semenzato
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Suleiman Souhlal
     

07 May, 2014

2 commits

  • Signed-off-by: Al Viro

    Al Viro
     
  • This reverts commit 0bf1457f0cfc ("mm: vmscan: do not swap anon pages
    just because free+file is low") because it introduced a regression in
    mostly-anonymous workloads, where reclaim would become ineffective and
    trap every allocating task in direct reclaim.

    The problem is that there is a runaway feedback loop in the scan balance
    between file and anon, where the balance tips heavily towards a tiny
    thrashing file LRU and anonymous pages are no longer being looked at.
    The commit in question removed the safe guard that would detect such
    situations and respond with forced anonymous reclaim.

    This commit was part of a series to fix premature swapping in loads with
    relatively little cache, and while it made a small difference, the cure
    is obviously worse than the disease. Revert it.

    Signed-off-by: Johannes Weiner
    Reported-by: Christian Borntraeger
    Acked-by: Christian Borntraeger
    Acked-by: Rafael Aquini
    Cc: Rik van Riel
    Cc: [3.12+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

19 Apr, 2014

1 commit