13 Feb, 2019

1 commit

  • This reverts commit 172b06c32b9497 ("mm: slowly shrink slabs with a
    relatively small number of objects").

    This change changes the agressiveness of shrinker reclaim, causing small
    cache and low priority reclaim to greatly increase scanning pressure on
    small caches. As a result, light memory pressure has a disproportionate
    affect on small caches, and causes large caches to be reclaimed much
    faster than previously.

    As a result, it greatly perturbs the delicate balance of the VFS caches
    (dentry/inode vs file page cache) such that the inode/dentry caches are
    reclaimed much, much faster than the page cache and this drives us into
    several other caching imbalance related problems.

    As such, this is a bad change and needs to be reverted.

    [ Needs some massaging to retain the later seekless shrinker
    modifications.]

    Link: http://lkml.kernel.org/r/20190130041707.27750-3-david@fromorbit.com
    Fixes: 172b06c32b9497 ("mm: slowly shrink slabs with a relatively small number of objects")
    Signed-off-by: Dave Chinner
    Cc: Wolfgang Walter
    Cc: Roman Gushchin
    Cc: Spock
    Cc: Rik van Riel
    Cc: Michal Hocko
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Chinner
     

29 Dec, 2018

2 commits

  • Waiting on a page migration entry has used wait_on_page_locked() all along
    since 2006: but you cannot safely wait_on_page_locked() without holding a
    reference to the page, and that extra reference is enough to make
    migrate_page_move_mapping() fail with -EAGAIN, when a racing task faults
    on the entry before migrate_page_move_mapping() gets there.

    And that failure is retried nine times, amplifying the pain when trying to
    migrate a popular page. With a single persistent faulter, migration
    sometimes succeeds; with two or three concurrent faulters, success becomes
    much less likely (and the more the page was mapped, the worse the overhead
    of unmapping and remapping it on each try).

    This is especially a problem for memory offlining, where the outer level
    retries forever (or until terminated from userspace), because a heavy
    refault workload can trigger an endless loop of migration failures.
    wait_on_page_locked() is the wrong tool for the job.

    David Herrmann (but was he the first?) noticed this issue in 2014:
    https://marc.info/?l=linux-mm&m=140110465608116&w=2

    Tim Chen started a thread in August 2017 which appears relevant:
    https://marc.info/?l=linux-mm&m=150275941014915&w=2 where Kan Liang went
    on to implicate __migration_entry_wait():
    https://marc.info/?l=linux-mm&m=150300268411980&w=2 and the thread ended
    up with the v4.14 commits: 2554db916586 ("sched/wait: Break up long wake
    list walk") 11a19c7b099f ("sched/wait: Introduce wakeup boomark in
    wake_up_page_bit")

    Baoquan He reported "Memory hotplug softlock issue" 14 November 2018:
    https://marc.info/?l=linux-mm&m=154217936431300&w=2

    We have all assumed that it is essential to hold a page reference while
    waiting on a page lock: partly to guarantee that there is still a struct
    page when MEMORY_HOTREMOVE is configured, but also to protect against
    reuse of the struct page going to someone who then holds the page locked
    indefinitely, when the waiter can reasonably expect timely unlocking.

    But in fact, so long as wait_on_page_bit_common() does the put_page(), and
    is careful not to rely on struct page contents thereafter, there is no
    need to hold a reference to the page while waiting on it. That does mean
    that this case cannot go back through the loop: but that's fine for the
    page migration case, and even if used more widely, is limited by the "Stop
    walking if it's locked" optimization in wake_page_function().

    Add interface put_and_wait_on_page_locked() to do this, using "behavior"
    enum in place of "lock" arg to wait_on_page_bit_common() to implement it.
    No interruptible or killable variant needed yet, but they might follow: I
    have a vague notion that reporting -EINTR should take precedence over
    return from wait_on_page_bit_common() without knowing the page state, so
    arrange it accordingly - but that may be nothing but pedantic.

    __migration_entry_wait() still has to take a brief reference to the page,
    prior to calling put_and_wait_on_page_locked(): but now that it is dropped
    before waiting, the chance of impeding page migration is very much
    reduced. Should we perhaps disable preemption across this?

    shrink_page_list()'s __ClearPageLocked(): that was a surprise! This
    survived a lot of testing before that showed up. PageWaiters may have
    been set by wait_on_page_bit_common(), and the reference dropped, just
    before shrink_page_list() succeeds in freezing its last page reference: in
    such a case, unlock_page() must be used. Follow the suggestion from
    Michal Hocko, just revert a978d6f52106 ("mm: unlockless reclaim") now:
    that optimization predates PageWaiters, and won't buy much these days; but
    we can reinstate it for the !PageWaiters case if anyone notices.

    It does raise the question: should vmscan.c's is_page_cache_freeable() and
    __remove_mapping() now treat a PageWaiters page as if an extra reference
    were held? Perhaps, but I don't think it matters much, since
    shrink_page_list() already had to win its trylock_page(), so waiters are
    not very common there: I noticed no difference when trying the bigger
    change, and it's surely not needed while put_and_wait_on_page_locked() is
    only used for page migration.

    [willy@infradead.org: add put_and_wait_on_page_locked() kerneldoc]
    Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1811261121330.1116@eggly.anvils
    Signed-off-by: Hugh Dickins
    Reported-by: Baoquan He
    Tested-by: Baoquan He
    Reviewed-by: Andrea Arcangeli
    Acked-by: Michal Hocko
    Acked-by: Linus Torvalds
    Acked-by: Vlastimil Babka
    Cc: Matthew Wilcox
    Cc: Baoquan He
    Cc: David Hildenbrand
    Cc: Mel Gorman
    Cc: David Herrmann
    Cc: Tim Chen
    Cc: Kan Liang
    Cc: Andi Kleen
    Cc: Davidlohr Bueso
    Cc: Peter Zijlstra
    Cc: Christoph Lameter
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • An external fragmentation event was previously described as

    When the page allocator fragments memory, it records the event using
    the mm_page_alloc_extfrag event. If the fallback_order is smaller
    than a pageblock order (order-9 on 64-bit x86) then it's considered
    an event that will cause external fragmentation issues in the future.

    The kernel reduces the probability of such events by increasing the
    watermark sizes by calling set_recommended_min_free_kbytes early in the
    lifetime of the system. This works reasonably well in general but if
    there are enough sparsely populated pageblocks then the problem can still
    occur as enough memory is free overall and kswapd stays asleep.

    This patch introduces a watermark_boost_factor sysctl that allows a zone
    watermark to be temporarily boosted when an external fragmentation causing
    events occurs. The boosting will stall allocations that would decrease
    free memory below the boosted low watermark and kswapd is woken if the
    calling context allows to reclaim an amount of memory relative to the size
    of the high watermark and the watermark_boost_factor until the boost is
    cleared. When kswapd finishes, it wakes kcompactd at the pageblock order
    to clean some of the pageblocks that may have been affected by the
    fragmentation event. kswapd avoids any writeback, slab shrinkage and swap
    from reclaim context during this operation to avoid excessive system
    disruption in the name of fragmentation avoidance. Care is taken so that
    kswapd will do normal reclaim work if the system is really low on memory.

    This was evaluated using the same workloads as "mm, page_alloc: Spread
    allocations across zones before introducing fragmentation".

    1-socket Skylake machine
    config-global-dhp__workload_thpfioscale XFS (no special madvise)
    4 fio threads, 1 THP allocating thread
    --------------------------------------

    4.20-rc3 extfrag events < order 9: 804694
    4.20-rc3+patch: 408912 (49% reduction)
    4.20-rc3+patch1-4: 18421 (98% reduction)

    4.20.0-rc3 4.20.0-rc3
    lowzone-v5r8 boost-v5r8
    Amean fault-base-1 653.58 ( 0.00%) 652.71 ( 0.13%)
    Amean fault-huge-1 0.00 ( 0.00%) 178.93 * -99.00%*

    4.20.0-rc3 4.20.0-rc3
    lowzone-v5r8 boost-v5r8
    Percentage huge-1 0.00 ( 0.00%) 5.12 ( 100.00%)

    Note that external fragmentation causing events are massively reduced by
    this path whether in comparison to the previous kernel or the vanilla
    kernel. The fault latency for huge pages appears to be increased but that
    is only because THP allocations were successful with the patch applied.

    1-socket Skylake machine
    global-dhp__workload_thpfioscale-madvhugepage-xfs (MADV_HUGEPAGE)
    -----------------------------------------------------------------

    4.20-rc3 extfrag events < order 9: 291392
    4.20-rc3+patch: 191187 (34% reduction)
    4.20-rc3+patch1-4: 13464 (95% reduction)

    thpfioscale Fault Latencies
    4.20.0-rc3 4.20.0-rc3
    lowzone-v5r8 boost-v5r8
    Min fault-base-1 912.00 ( 0.00%) 905.00 ( 0.77%)
    Min fault-huge-1 127.00 ( 0.00%) 135.00 ( -6.30%)
    Amean fault-base-1 1467.55 ( 0.00%) 1481.67 ( -0.96%)
    Amean fault-huge-1 1127.11 ( 0.00%) 1063.88 * 5.61%*

    4.20.0-rc3 4.20.0-rc3
    lowzone-v5r8 boost-v5r8
    Percentage huge-1 77.64 ( 0.00%) 83.46 ( 7.49%)

    As before, massive reduction in external fragmentation events, some jitter
    on latencies and an increase in THP allocation success rates.

    2-socket Haswell machine
    config-global-dhp__workload_thpfioscale XFS (no special madvise)
    4 fio threads, 5 THP allocating threads
    ----------------------------------------------------------------

    4.20-rc3 extfrag events < order 9: 215698
    4.20-rc3+patch: 200210 (7% reduction)
    4.20-rc3+patch1-4: 14263 (93% reduction)

    4.20.0-rc3 4.20.0-rc3
    lowzone-v5r8 boost-v5r8
    Amean fault-base-5 1346.45 ( 0.00%) 1306.87 ( 2.94%)
    Amean fault-huge-5 3418.60 ( 0.00%) 1348.94 ( 60.54%)

    4.20.0-rc3 4.20.0-rc3
    lowzone-v5r8 boost-v5r8
    Percentage huge-5 0.78 ( 0.00%) 7.91 ( 910.64%)

    There is a 93% reduction in fragmentation causing events, there is a big
    reduction in the huge page fault latency and allocation success rate is
    higher.

    2-socket Haswell machine
    global-dhp__workload_thpfioscale-madvhugepage-xfs (MADV_HUGEPAGE)
    -----------------------------------------------------------------

    4.20-rc3 extfrag events < order 9: 166352
    4.20-rc3+patch: 147463 (11% reduction)
    4.20-rc3+patch1-4: 11095 (93% reduction)

    thpfioscale Fault Latencies
    4.20.0-rc3 4.20.0-rc3
    lowzone-v5r8 boost-v5r8
    Amean fault-base-5 6217.43 ( 0.00%) 7419.67 * -19.34%*
    Amean fault-huge-5 3163.33 ( 0.00%) 3263.80 ( -3.18%)

    4.20.0-rc3 4.20.0-rc3
    lowzone-v5r8 boost-v5r8
    Percentage huge-5 95.14 ( 0.00%) 87.98 ( -7.53%)

    There is a large reduction in fragmentation events with some jitter around
    the latencies and success rates. As before, the high THP allocation
    success rate does mean the system is under a lot of pressure. However, as
    the fragmentation events are reduced, it would be expected that the
    long-term allocation success rate would be higher.

    Link: http://lkml.kernel.org/r/20181123114528.28802-5-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Cc: Andrea Arcangeli
    Cc: David Rientjes
    Cc: Michal Hocko
    Cc: Zi Yan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

20 Nov, 2018

1 commit


07 Nov, 2018

1 commit

  • The i915 driver uses shmemfs to allocate backing storage for gem
    objects. These shmemfs pages can be pinned (increased ref count) by
    shmem_read_mapping_page_gfp(). When a lot of pages are pinned, vmscan
    wastes a lot of time scanning these pinned pages. In some extreme case,
    all pages in the inactive anon lru are pinned, and only the inactive
    anon lru is scanned due to inactive_ratio, the system cannot swap and
    invokes the oom-killer. Mark these pinned pages as unevictable to speed
    up vmscan.

    Export pagevec API check_move_unevictable_pages().

    This patch was inspired by Chris Wilson's change [1].

    [1]: https://patchwork.kernel.org/patch/9768741/

    Cc: Chris Wilson
    Cc: Joonas Lahtinen
    Cc: Peter Zijlstra
    Cc: Andrew Morton
    Cc: Dave Hansen
    Signed-off-by: Kuo-Hsin Yang
    Acked-by: Michal Hocko # mm part
    Reviewed-by: Chris Wilson
    Acked-by: Dave Hansen
    Acked-by: Andrew Morton
    Link: https://patchwork.freedesktop.org/patch/msgid/20181106132324.17390-1-chris@chris-wilson.co.uk
    Signed-off-by: Chris Wilson

    Kuo-Hsin Yang
     

29 Oct, 2018

1 commit

  • Pull XArray conversion from Matthew Wilcox:
    "The XArray provides an improved interface to the radix tree data
    structure, providing locking as part of the API, specifying GFP flags
    at allocation time, eliminating preloading, less re-walking the tree,
    more efficient iterations and not exposing RCU-protected pointers to
    its users.

    This patch set

    1. Introduces the XArray implementation

    2. Converts the pagecache to use it

    3. Converts memremap to use it

    The page cache is the most complex and important user of the radix
    tree, so converting it was most important. Converting the memremap
    code removes the only other user of the multiorder code, which allows
    us to remove the radix tree code that supported it.

    I have 40+ followup patches to convert many other users of the radix
    tree over to the XArray, but I'd like to get this part in first. The
    other conversions haven't been in linux-next and aren't suitable for
    applying yet, but you can see them in the xarray-conv branch if you're
    interested"

    * 'xarray' of git://git.infradead.org/users/willy/linux-dax: (90 commits)
    radix tree: Remove multiorder support
    radix tree test: Convert multiorder tests to XArray
    radix tree tests: Convert item_delete_rcu to XArray
    radix tree tests: Convert item_kill_tree to XArray
    radix tree tests: Move item_insert_order
    radix tree test suite: Remove multiorder benchmarking
    radix tree test suite: Remove __item_insert
    memremap: Convert to XArray
    xarray: Add range store functionality
    xarray: Move multiorder_check to in-kernel tests
    xarray: Move multiorder_shrink to kernel tests
    xarray: Move multiorder account test in-kernel
    radix tree test suite: Convert iteration test to XArray
    radix tree test suite: Convert tag_tagged_items to XArray
    radix tree: Remove radix_tree_clear_tags
    radix tree: Remove radix_tree_maybe_preload_order
    radix tree: Remove split/join code
    radix tree: Remove radix_tree_update_node_t
    page cache: Finish XArray conversion
    dax: Convert page fault handlers to XArray
    ...

    Linus Torvalds
     

27 Oct, 2018

4 commits

  • The page cache and most shrinkable slab caches hold data that has been
    read from disk, but there are some caches that only cache CPU work, such
    as the dentry and inode caches of procfs and sysfs, as well as the subset
    of radix tree nodes that track non-resident page cache.

    Currently, all these are shrunk at the same rate: using DEFAULT_SEEKS for
    the shrinker's seeks setting tells the reclaim algorithm that for every
    two page cache pages scanned it should scan one slab object.

    This is a bogus setting. A virtual inode that required no IO to create is
    not twice as valuable as a page cache page; shadow cache entries with
    eviction distances beyond the size of memory aren't either.

    In most cases, the behavior in practice is still fine. Such virtual
    caches don't tend to grow and assert themselves aggressively, and usually
    get picked up before they cause problems. But there are scenarios where
    that's not true.

    Our database workloads suffer from two of those. For one, their file
    workingset is several times bigger than available memory, which has the
    kernel aggressively create shadow page cache entries for the non-resident
    parts of it. The workingset code does tell the VM that most of these are
    expendable, but the VM ends up balancing them 2:1 to cache pages as per
    the seeks setting. This is a huge waste of memory.

    These workloads also deal with tens of thousands of open files and use
    /proc for introspection, which ends up growing the proc_inode_cache to
    absurdly large sizes - again at the cost of valuable cache space, which
    isn't a reasonable trade-off, given that proc inodes can be re-created
    without involving the disk.

    This patch implements a "zero-seek" setting for shrinkers that results in
    a target ratio of 0:1 between their objects and IO-backed caches. This
    allows such virtual caches to grow when memory is available (they do
    cache/avoid CPU work after all), but effectively disables them as soon as
    IO-backed objects are under pressure.

    It then switches the shrinkers for procfs and sysfs metadata, as well as
    excess page cache shadow nodes, to the new zero-seek setting.

    Link: http://lkml.kernel.org/r/20181009184732.762-5-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Reported-by: Domas Mituzas
    Reviewed-by: Andrew Morton
    Reviewed-by: Rik van Riel
    Acked-by: Peter Zijlstra (Intel)
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • When systems are overcommitted and resources become contended, it's hard
    to tell exactly the impact this has on workload productivity, or how close
    the system is to lockups and OOM kills. In particular, when machines work
    multiple jobs concurrently, the impact of overcommit in terms of latency
    and throughput on the individual job can be enormous.

    In order to maximize hardware utilization without sacrificing individual
    job health or risk complete machine lockups, this patch implements a way
    to quantify resource pressure in the system.

    A kernel built with CONFIG_PSI=y creates files in /proc/pressure/ that
    expose the percentage of time the system is stalled on CPU, memory, or IO,
    respectively. Stall states are aggregate versions of the per-task delay
    accounting delays:

    cpu: some tasks are runnable but not executing on a CPU
    memory: tasks are reclaiming, or waiting for swapin or thrashing cache
    io: tasks are waiting for io completions

    These percentages of walltime can be thought of as pressure percentages,
    and they give a general sense of system health and productivity loss
    incurred by resource overcommit. They can also indicate when the system
    is approaching lockup scenarios and OOMs.

    To do this, psi keeps track of the task states associated with each CPU
    and samples the time they spend in stall states. Every 2 seconds, the
    samples are averaged across CPUs - weighted by the CPUs' non-idle time to
    eliminate artifacts from unused CPUs - and translated into percentages of
    walltime. A running average of those percentages is maintained over 10s,
    1m, and 5m periods (similar to the loadaverage).

    [hannes@cmpxchg.org: doc fixlet, per Randy]
    Link: http://lkml.kernel.org/r/20180828205625.GA14030@cmpxchg.org
    [hannes@cmpxchg.org: code optimization]
    Link: http://lkml.kernel.org/r/20180907175015.GA8479@cmpxchg.org
    [hannes@cmpxchg.org: rename psi_clock() to psi_update_work(), per Peter]
    Link: http://lkml.kernel.org/r/20180907145404.GB11088@cmpxchg.org
    [hannes@cmpxchg.org: fix build]
    Link: http://lkml.kernel.org/r/20180913014222.GA2370@cmpxchg.org
    Link: http://lkml.kernel.org/r/20180828172258.3185-9-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Acked-by: Peter Zijlstra (Intel)
    Tested-by: Daniel Drake
    Tested-by: Suren Baghdasaryan
    Cc: Christopher Lameter
    Cc: Ingo Molnar
    Cc: Johannes Weiner
    Cc: Mike Galbraith
    Cc: Peter Enderborg
    Cc: Randy Dunlap
    Cc: Shakeel Butt
    Cc: Tejun Heo
    Cc: Vinayak Menon
    Cc: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Refaults happen during transitions between workingsets as well as in-place
    thrashing. Knowing the difference between the two has a range of
    applications, including measuring the impact of memory shortage on the
    system performance, as well as the ability to smarter balance pressure
    between the filesystem cache and the swap-backed workingset.

    During workingset transitions, inactive cache refaults and pushes out
    established active cache. When that active cache isn't stale, however,
    and also ends up refaulting, that's bonafide thrashing.

    Introduce a new page flag that tells on eviction whether the page has been
    active or not in its lifetime. This bit is then stored in the shadow
    entry, to classify refaults as transitioning or thrashing.

    How many page->flags does this leave us with on 32-bit?

    20 bits are always page flags

    21 if you have an MMU

    23 with the zone bits for DMA, Normal, HighMem, Movable

    29 with the sparsemem section bits

    30 if PAE is enabled

    31 with this patch.

    So on 32-bit PAE, that leaves 1 bit for distinguishing two NUMA nodes. If
    that's not enough, the system can switch to discontigmem and re-gain the 6
    or 7 sparsemem section bits.

    Link: http://lkml.kernel.org/r/20180828172258.3185-3-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Acked-by: Peter Zijlstra (Intel)
    Tested-by: Daniel Drake
    Tested-by: Suren Baghdasaryan
    Cc: Christopher Lameter
    Cc: Ingo Molnar
    Cc: Johannes Weiner
    Cc: Mike Galbraith
    Cc: Peter Enderborg
    Cc: Randy Dunlap
    Cc: Shakeel Butt
    Cc: Tejun Heo
    Cc: Vinayak Menon
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • I've noticed, that dying memory cgroups are often pinned in memory by a
    single pagecache page. Even under moderate memory pressure they sometimes
    stayed in such state for a long time. That looked strange.

    My investigation showed that the problem is caused by applying the LRU
    pressure balancing math:

    scan = div64_u64(scan * fraction[lru], denominator),

    where

    denominator = fraction[anon] + fraction[file] + 1.

    Because fraction[lru] is always less than denominator, if the initial scan
    size is 1, the result is always 0.

    This means the last page is not scanned and has
    no chances to be reclaimed.

    Fix this by rounding up the result of the division.

    In practice this change significantly improves the speed of dying cgroups
    reclaim.

    [guro@fb.com: prevent double calculation of DIV64_U64_ROUND_UP() arguments]
    Link: http://lkml.kernel.org/r/20180829213311.GA13501@castle
    Link: http://lkml.kernel.org/r/20180827162621.30187-3-guro@fb.com
    Signed-off-by: Roman Gushchin
    Reviewed-by: Andrew Morton
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Tejun Heo
    Cc: Rik van Riel
    Cc: Konstantin Khlebnikov
    Cc: Matthew Wilcox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     

21 Oct, 2018

2 commits


06 Oct, 2018

1 commit

  • do_shrink_slab() returns unsigned long value, and the placing into int
    variable cuts high bytes off. Then we compare ret and 0xfffffffe (since
    SHRINK_EMPTY is converted to ret type).

    Thus a large number of objects returned by do_shrink_slab() may be
    interpreted as SHRINK_EMPTY, if low bytes of their value are equal to
    0xfffffffe. Fix that by declaration ret as unsigned long in these
    functions.

    Link: http://lkml.kernel.org/r/153813407177.17544.14888305435570723973.stgit@localhost.localdomain
    Signed-off-by: Kirill Tkhai
    Reported-by: Cyrill Gorcunov
    Acked-by: Cyrill Gorcunov
    Reviewed-by: Josef Bacik
    Cc: Michal Hocko
    Cc: Andrey Ryabinin
    Cc: Johannes Weiner
    Cc: Tetsuo Handa
    Cc: Shakeel Butt
    Signed-off-by: Andrew Morton
    Signed-off-by: Greg Kroah-Hartman

    Kirill Tkhai
     

21 Sep, 2018

1 commit

  • 9092c71bb724 ("mm: use sc->priority for slab shrink targets") changed the
    way that the target slab pressure is calculated and made it
    priority-based:

    delta = freeable >> priority;
    delta *= 4;
    do_div(delta, shrinker->seeks);

    The problem is that on a default priority (which is 12) no pressure is
    applied at all, if the number of potentially reclaimable objects is less
    than 4096 (1<
    Acked-by: Rik van Riel
    Cc: Josef Bacik
    Cc: Johannes Weiner
    Cc: Shakeel Butt
    Cc: Michal Hocko
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Greg Kroah-Hartman

    Roman Gushchin
     

23 Aug, 2018

2 commits

  • page_freeze_refs/page_unfreeze_refs have already been relplaced by
    page_ref_freeze/page_ref_unfreeze , but they are not modified in the
    comments.

    Link: http://lkml.kernel.org/r/1532590226-106038-1-git-send-email-jiang.biao2@zte.com.cn
    Signed-off-by: Jiang Biao
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jiang Biao
     
  • There is a sad BUG introduced in patch adding SHRINKER_REGISTERING.
    shrinker_idr business is only for memcg-aware shrinkers. Only such type
    of shrinkers have id and they must be finaly installed via idr_replace()
    in this function. For !memcg-aware shrinkers we never initialize
    shrinker->id field.

    But there are all types of shrinkers passed to idr_replace(), and every
    !memcg-aware shrinker with random ID (most probably, its id is 0)
    replaces memcg-aware shrinker pointed by the ID in IDR.

    This patch fixes the problem.

    Link: http://lkml.kernel.org/r/8ff8a793-8211-713a-4ed9-d6e52390c2fc@virtuozzo.com
    Fixes: 7e010df53c80 "mm: use special value SHRINKER_REGISTERING instead of list_empty() check"
    Signed-off-by: Kirill Tkhai
    Reported-by:
    Cc: Andrey Ryabinin
    Cc: Johannes Weiner
    Cc: Josef Bacik
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Tetsuo Handa
    Cc: Shakeel Butt
    Cc:
    Cc: Huang Ying
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill Tkhai
     

18 Aug, 2018

9 commits

  • The patch introduces a special value SHRINKER_REGISTERING to use instead
    of list_empty() to differ a registering shrinker from unregistered
    shrinker. Why we need that at all?

    Shrinker registration is split in two parts. The first one is
    prealloc_shrinker(), which allocates shrinker memory and reserves ID in
    shrinker_idr. This function can fail. The second is
    register_shrinker_prepared(), and it finalizes the registration. This
    function actually makes shrinker available to be used from
    shrink_slab(), and it can't fail.

    One shrinker may be based on more then one LRU lists. So, we never
    clear the bit in memcg shrinker maps, when (one of) corresponding LRU
    list becomes empty, since other LRU lists may be not empty. See
    superblock shrinker for example: it is based on two LRU lists:
    s_inode_lru and s_dentry_lru. We do not want to clear shrinker bit,
    when there are no inodes in s_inode_lru, as s_dentry_lru may contain
    dentries.

    Instead of that, we use special algorithm to detect shrinkers having no
    elements at all its LRU lists, and this is made in shrink_slab_memcg().
    See the comment in this function for the details.

    Also, in shrink_slab_memcg() we clear shrinker bit in the map, when we
    meet unregistered shrinker (bit is set, while there is no a shrinker in
    IDR). Otherwise, we would have done that at the moment of shrinker
    unregistration for all memcgs (and this looks worse, since iteration
    over all memcg may take much time). Also this would have imposed
    restrictions on shrinker unregistration order for its users: they would
    have had to guarantee, there are no new elements after
    unregister_shrinker() (otherwise, a new added element would have set a
    bit).

    So, if we meet a set bit in map and no shrinker in IDR when we're
    iterating over the map in shrink_slab_memcg(), this means the
    corresponding shrinker is unregistered, and we must clear the bit.

    Another case is shrinker registration. We want two things there:

    1) do_shrink_slab() can be called only for completely registered
    shrinkers;

    2) shrinker internal lists may be populated in any order with
    register_shrinker_prepared() (let's talk on the example with sb). Both
    of:

    a)list_lru_add(&inode->i_sb->s_inode_lru, &inode->i_lru); [cpu0]
    memcg_set_shrinker_bit(); [cpu0]
    ...
    register_shrinker_prepared(); [cpu1]

    and

    b)register_shrinker_prepared(); [cpu0]
    ...
    list_lru_add(&inode->i_sb->s_inode_lru, &inode->i_lru); [cpu1]
    memcg_set_shrinker_bit(); [cpu1]

    are legitimate. We don't want to impose restriction here and to
    force people to use only (b) variant. We don't want to force people to
    care, there is no elements in LRU lists before the shrinker is
    completely registered. Internal users of LRU lists and shrinker code
    are two different subsystems, and they have to be closed in themselves
    each other.

    In (a) case we have the bit set before shrinker is completely
    registered. We don't want do_shrink_slab() is called at this moment, so
    we have to detect such the registering shrinkers.

    Before this patch list_empty() (shrinker is not linked to the list)
    check was used for that. So, in (a) there could be a bit set, but we
    don't call do_shrink_slab() unless shrinker is linked to the list. It's
    just an indicator, I just overloaded linking to the list.

    This was not the best solution, since it's better not to touch the
    shrinker memory from shrink_slab_memcg() before it's completely
    registered (this also will be useful in the future to make shrink_slab()
    completely lockless).

    So, this patch introduces better way to detect registering shrinker,
    which allows not to dereference shrinker memory. It's just a ~0UL
    value, which we insert into the IDR during ID allocation. After
    shrinker is ready to be used, we insert actual shrinker pointer in the
    IDR, and it becomes available to shrink_slab_memcg().

    We can't use NULL instead of this new value for this purpose as:
    shrink_slab_memcg() already uses NULL to detect unregistered shrinkers,
    and we don't want the function sees NULL and clears the bit, otherwise
    (a) won't work.

    This is the only thing the patch makes: the better way to detect
    registering shrinker. Nothing else this patch makes.

    Also this gives a better assembler, but it's minor side of the patch:

    Before:
    callq
    mov %rax,%r15
    test %rax,%rax
    je
    mov 0x20(%rax),%rax
    lea 0x20(%r15),%rdx
    cmp %rax,%rdx
    je
    mov 0x8(%rsp),%edx
    mov %r15,%rsi
    lea 0x10(%rsp),%rdi
    callq

    After:
    callq
    mov %rax,%r15
    lea -0x1(%rax),%rax
    cmp $0xfffffffffffffffd,%rax
    ja
    mov 0x8(%rsp),%edx
    mov %r15,%rsi
    lea 0x10(%rsp),%rdi
    callq ffffffff810cefd0

    [ktkhai@virtuozzo.com: add #ifdef CONFIG_MEMCG_KMEM around idr_replace()]
    Link: http://lkml.kernel.org/r/758b8fec-7573-47eb-b26a-7b2847ae7b8c@virtuozzo.com
    Link: http://lkml.kernel.org/r/153355467546.11522.4518015068123480218.stgit@localhost.localdomain
    Signed-off-by: Kirill Tkhai
    Reviewed-by: Andrew Morton
    Cc: Vladimir Davydov
    Cc: Michal Hocko
    Cc: Andrey Ryabinin
    Cc: "Huang, Ying"
    Cc: Tetsuo Handa
    Cc: Matthew Wilcox
    Cc: Shakeel Butt
    Cc: Josef Bacik
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill Tkhai
     
  • In case of shrink_slab_memcg() we do not zero nid, when shrinker is not
    numa-aware. This is not a real problem, since currently all memcg-aware
    shrinkers are numa-aware too (we have two: super_block shrinker and
    workingset shrinker), but something may change in the future.

    Link: http://lkml.kernel.org/r/153320759911.18959.8842396230157677671.stgit@localhost.localdomain
    Signed-off-by: Kirill Tkhai
    Reviewed-by: Andrew Morton
    Cc: Vladimir Davydov
    Cc: Michal Hocko
    Cc: Andrey Ryabinin
    Cc: "Huang, Ying"
    Cc: Tetsuo Handa
    Cc: Matthew Wilcox
    Cc: Shakeel Butt
    Cc: Josef Bacik
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill Tkhai
     
  • To avoid further unneed calls of do_shrink_slab() for shrinkers, which
    already do not have any charged objects in a memcg, their bits have to
    be cleared.

    This patch introduces a lockless mechanism to do that without races
    without parallel list lru add. After do_shrink_slab() returns
    SHRINK_EMPTY the first time, we clear the bit and call it once again.
    Then we restore the bit, if the new return value is different.

    Note, that single smp_mb__after_atomic() in shrink_slab_memcg() covers
    two situations:

    1)list_lru_add() shrink_slab_memcg
    list_add_tail() for_each_set_bit()
    set_bit() do_shrink_slab() before the first call of do_shrink_slab()
    instead of this to do not slow down generic case. Also, it's need the
    second call as seen in below in (2).

    2)list_lru_add() shrink_slab_memcg()
    list_add_tail() ...
    set_bit() ...
    ... for_each_set_bit()
    do_shrink_slab() do_shrink_slab()
    clear_bit() ...
    ... ...
    list_lru_add() ...
    list_add_tail() clear_bit()

    set_bit() do_shrink_slab()

    The barriers guarantee that the second do_shrink_slab() in the right
    side task sees list update if really cleared the bit. This case is
    drawn in the code comment.

    [Results/performance of the patchset]

    After the whole patchset applied the below test shows signify increase
    of performance:

    $echo 1 > /sys/fs/cgroup/memory/memory.use_hierarchy
    $mkdir /sys/fs/cgroup/memory/ct
    $echo 4000M > /sys/fs/cgroup/memory/ct/memory.kmem.limit_in_bytes
    $for i in `seq 0 4000`; do mkdir /sys/fs/cgroup/memory/ct/$i;
    echo $$ > /sys/fs/cgroup/memory/ct/$i/cgroup.procs;
    mkdir -p s/$i; mount -t tmpfs $i s/$i;
    touch s/$i/file; done

    Then, 5 sequential calls of drop caches:

    $time echo 3 > /proc/sys/vm/drop_caches

    1)Before:
    0.00user 13.78system 0:13.78elapsed 99%CPU
    0.00user 5.59system 0:05.60elapsed 99%CPU
    0.00user 5.48system 0:05.48elapsed 99%CPU
    0.00user 8.35system 0:08.35elapsed 99%CPU
    0.00user 8.34system 0:08.35elapsed 99%CPU

    2)After
    0.00user 1.10system 0:01.10elapsed 99%CPU
    0.00user 0.00system 0:00.01elapsed 64%CPU
    0.00user 0.01system 0:00.01elapsed 82%CPU
    0.00user 0.00system 0:00.01elapsed 64%CPU
    0.00user 0.01system 0:00.01elapsed 82%CPU

    The results show the performance increases at least in 548 times.

    Shakeel Butt tested this patchset with fork-bomb on his configuration:

    > I created 255 memcgs, 255 ext4 mounts and made each memcg create a
    > file containing few KiBs on corresponding mount. Then in a separate
    > memcg of 200 MiB limit ran a fork-bomb.
    >
    > I ran the "perf record -ag -- sleep 60" and below are the results:
    >
    > Without the patch series:
    > Samples: 4M of event 'cycles', Event count (approx.): 3279403076005
    > + 36.40% fb.sh [kernel.kallsyms] [k] shrink_slab
    > + 18.97% fb.sh [kernel.kallsyms] [k] list_lru_count_one
    > + 6.75% fb.sh [kernel.kallsyms] [k] super_cache_count
    > + 0.49% fb.sh [kernel.kallsyms] [k] down_read_trylock
    > + 0.44% fb.sh [kernel.kallsyms] [k] mem_cgroup_iter
    > + 0.27% fb.sh [kernel.kallsyms] [k] up_read
    > + 0.21% fb.sh [kernel.kallsyms] [k] osq_lock
    > + 0.13% fb.sh [kernel.kallsyms] [k] shmem_unused_huge_count
    > + 0.08% fb.sh [kernel.kallsyms] [k] shrink_node_memcg
    > + 0.08% fb.sh [kernel.kallsyms] [k] shrink_node
    >
    > With the patch series:
    > Samples: 4M of event 'cycles', Event count (approx.): 2756866824946
    > + 47.49% fb.sh [kernel.kallsyms] [k] down_read_trylock
    > + 30.72% fb.sh [kernel.kallsyms] [k] up_read
    > + 9.51% fb.sh [kernel.kallsyms] [k] mem_cgroup_iter
    > + 1.69% fb.sh [kernel.kallsyms] [k] shrink_node_memcg
    > + 1.35% fb.sh [kernel.kallsyms] [k] mem_cgroup_protected
    > + 1.05% fb.sh [kernel.kallsyms] [k] queued_spin_lock_slowpath
    > + 0.85% fb.sh [kernel.kallsyms] [k] _raw_spin_lock
    > + 0.78% fb.sh [kernel.kallsyms] [k] lruvec_lru_size
    > + 0.57% fb.sh [kernel.kallsyms] [k] shrink_node
    > + 0.54% fb.sh [kernel.kallsyms] [k] queue_work_on
    > + 0.46% fb.sh [kernel.kallsyms] [k] shrink_slab_memcg

    [ktkhai@virtuozzo.com: v9]
    Link: http://lkml.kernel.org/r/153112561772.4097.11011071937553113003.stgit@localhost.localdomain
    Link: http://lkml.kernel.org/r/153063070859.1818.11870882950920963480.stgit@localhost.localdomain
    Signed-off-by: Kirill Tkhai
    Acked-by: Vladimir Davydov
    Tested-by: Shakeel Butt
    Cc: Al Viro
    Cc: Andrey Ryabinin
    Cc: Chris Wilson
    Cc: Greg Kroah-Hartman
    Cc: Guenter Roeck
    Cc: "Huang, Ying"
    Cc: Johannes Weiner
    Cc: Josef Bacik
    Cc: Li RongQing
    Cc: Matthew Wilcox
    Cc: Matthias Kaehlcke
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Philippe Ombredanne
    Cc: Roman Gushchin
    Cc: Sahitya Tummala
    Cc: Stephen Rothwell
    Cc: Tetsuo Handa
    Cc: Thomas Gleixner
    Cc: Waiman Long
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill Tkhai
     
  • We need to distinguish the situations when shrinker has very small
    amount of objects (see vfs_pressure_ratio() called from
    super_cache_count()), and when it has no objects at all. Currently, in
    the both of these cases, shrinker::count_objects() returns 0.

    The patch introduces new SHRINK_EMPTY return value, which will be used
    for "no objects at all" case. It's is a refactoring mostly, as
    SHRINK_EMPTY is replaced by 0 by all callers of do_shrink_slab() in this
    patch, and all the magic will happen in further.

    Link: http://lkml.kernel.org/r/153063069574.1818.11037751256699341813.stgit@localhost.localdomain
    Signed-off-by: Kirill Tkhai
    Acked-by: Vladimir Davydov
    Tested-by: Shakeel Butt
    Cc: Al Viro
    Cc: Andrey Ryabinin
    Cc: Chris Wilson
    Cc: Greg Kroah-Hartman
    Cc: Guenter Roeck
    Cc: "Huang, Ying"
    Cc: Johannes Weiner
    Cc: Josef Bacik
    Cc: Li RongQing
    Cc: Matthew Wilcox
    Cc: Matthias Kaehlcke
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Philippe Ombredanne
    Cc: Roman Gushchin
    Cc: Sahitya Tummala
    Cc: Stephen Rothwell
    Cc: Tetsuo Handa
    Cc: Thomas Gleixner
    Cc: Waiman Long
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill Tkhai
     
  • The patch makes shrink_slab() be called for root_mem_cgroup in the same
    way as it's called for the rest of cgroups. This simplifies the logic
    and improves the readability.

    [ktkhai@virtuozzo.com: wrote changelog]
    Link: http://lkml.kernel.org/r/153063068338.1818.11496084754797453962.stgit@localhost.localdomain
    Signed-off-by: Vladimir Davydov
    Signed-off-by: Kirill Tkhai
    Tested-by: Shakeel Butt
    Cc: Al Viro
    Cc: Andrey Ryabinin
    Cc: Chris Wilson
    Cc: Greg Kroah-Hartman
    Cc: Guenter Roeck
    Cc: "Huang, Ying"
    Cc: Johannes Weiner
    Cc: Josef Bacik
    Cc: Li RongQing
    Cc: Matthew Wilcox
    Cc: Matthias Kaehlcke
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Philippe Ombredanne
    Cc: Roman Gushchin
    Cc: Sahitya Tummala
    Cc: Stephen Rothwell
    Cc: Tetsuo Handa
    Cc: Thomas Gleixner
    Cc: Waiman Long
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • Using the preparations made in previous patches, in case of memcg
    shrink, we may avoid shrinkers, which are not set in memcg's shrinkers
    bitmap. To do that, we separate iterations over memcg-aware and
    !memcg-aware shrinkers, and memcg-aware shrinkers are chosen via
    for_each_set_bit() from the bitmap. In case of big nodes, having many
    isolated environments, this gives significant performance growth. See
    next patches for the details.

    Note that the patch does not respect to empty memcg shrinkers, since we
    never clear the bitmap bits after we set it once. Their shrinkers will
    be called again, with no shrinked objects as result. This functionality
    is provided by next patches.

    [ktkhai@virtuozzo.com: v9]
    Link: http://lkml.kernel.org/r/153112558507.4097.12713813335683345488.stgit@localhost.localdomain
    Link: http://lkml.kernel.org/r/153063066653.1818.976035462801487910.stgit@localhost.localdomain
    Signed-off-by: Kirill Tkhai
    Acked-by: Vladimir Davydov
    Tested-by: Shakeel Butt
    Cc: Al Viro
    Cc: Andrey Ryabinin
    Cc: Chris Wilson
    Cc: Greg Kroah-Hartman
    Cc: Guenter Roeck
    Cc: "Huang, Ying"
    Cc: Johannes Weiner
    Cc: Josef Bacik
    Cc: Li RongQing
    Cc: Matthew Wilcox
    Cc: Matthias Kaehlcke
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Philippe Ombredanne
    Cc: Roman Gushchin
    Cc: Sahitya Tummala
    Cc: Stephen Rothwell
    Cc: Tetsuo Handa
    Cc: Thomas Gleixner
    Cc: Waiman Long
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill Tkhai
     
  • Imagine a big node with many cpus, memory cgroups and containers. Let
    we have 200 containers, every container has 10 mounts, and 10 cgroups.
    All container tasks don't touch foreign containers mounts. If there is
    intensive pages write, and global reclaim happens, a writing task has to
    iterate over all memcgs to shrink slab, before it's able to go to
    shrink_page_list().

    Iteration over all the memcg slabs is very expensive: the task has to
    visit 200 * 10 = 2000 shrinkers for every memcg, and since there are
    2000 memcgs, the total calls are 2000 * 2000 = 4000000.

    So, the shrinker makes 4 million do_shrink_slab() calls just to try to
    isolate SWAP_CLUSTER_MAX pages in one of the actively writing memcg via
    shrink_page_list(). I've observed a node spending almost 100% in
    kernel, making useless iteration over already shrinked slab.

    This patch adds bitmap of memcg-aware shrinkers to memcg. The size of
    the bitmap depends on bitmap_nr_ids, and during memcg life it's
    maintained to be enough to fit bitmap_nr_ids shrinkers. Every bit in
    the map is related to corresponding shrinker id.

    Next patches will maintain set bit only for really charged memcg. This
    will allow shrink_slab() to increase its performance in significant way.
    See the last patch for the numbers.

    [ktkhai@virtuozzo.com: v9]
    Link: http://lkml.kernel.org/r/153112549031.4097.3576147070498769979.stgit@localhost.localdomain
    [ktkhai@virtuozzo.com: add comment to mem_cgroup_css_online()]
    Link: http://lkml.kernel.org/r/521f9e5f-c436-b388-fe83-4dc870bfb489@virtuozzo.com
    Link: http://lkml.kernel.org/r/153063056619.1818.12550500883688681076.stgit@localhost.localdomain
    Signed-off-by: Kirill Tkhai
    Acked-by: Vladimir Davydov
    Tested-by: Shakeel Butt
    Cc: Al Viro
    Cc: Andrey Ryabinin
    Cc: Chris Wilson
    Cc: Greg Kroah-Hartman
    Cc: Guenter Roeck
    Cc: "Huang, Ying"
    Cc: Johannes Weiner
    Cc: Josef Bacik
    Cc: Li RongQing
    Cc: Matthew Wilcox
    Cc: Matthias Kaehlcke
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Philippe Ombredanne
    Cc: Roman Gushchin
    Cc: Sahitya Tummala
    Cc: Stephen Rothwell
    Cc: Tetsuo Handa
    Cc: Thomas Gleixner
    Cc: Waiman Long
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill Tkhai
     
  • Introduce shrinker::id number, which is used to enumerate memcg-aware
    shrinkers. The number start from 0, and the code tries to maintain it
    as small as possible.

    This will be used to represent a memcg-aware shrinkers in memcg
    shrinkers map.

    Since all memcg-aware shrinkers are based on list_lru, which is
    per-memcg in case of !CONFIG_MEMCG_KMEM only, the new functionality will
    be under this config option.

    [ktkhai@virtuozzo.com: v9]
    Link: http://lkml.kernel.org/r/153112546435.4097.10607140323811756557.stgit@localhost.localdomain
    Link: http://lkml.kernel.org/r/153063054586.1818.6041047871606697364.stgit@localhost.localdomain
    Signed-off-by: Kirill Tkhai
    Acked-by: Vladimir Davydov
    Tested-by: Shakeel Butt
    Cc: Al Viro
    Cc: Andrey Ryabinin
    Cc: Chris Wilson
    Cc: Greg Kroah-Hartman
    Cc: Guenter Roeck
    Cc: "Huang, Ying"
    Cc: Johannes Weiner
    Cc: Josef Bacik
    Cc: Li RongQing
    Cc: Matthew Wilcox
    Cc: Matthias Kaehlcke
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Philippe Ombredanne
    Cc: Roman Gushchin
    Cc: Sahitya Tummala
    Cc: Stephen Rothwell
    Cc: Tetsuo Handa
    Cc: Thomas Gleixner
    Cc: Waiman Long
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill Tkhai
     
  • Use smaller scan_control fields for order, priority, and reclaim_idx.
    Convert fields from int => s8. All easily fit within a byte:

    - allocation order range: 0..MAX_ORDER(64?)
    - priority range: 0..12(DEF_PRIORITY)
    - reclaim_idx range: 0..6(__MAX_NR_ZONES)

    Since 6538b8ea886e ("x86_64: expand kernel stack to 16K") x86_64 stack
    overflows are not an issue. But it's inefficient to use ints.

    Use s8 (signed byte) rather than u8 to allow for loops like:
    do {
    ...
    } while (--sc.priority >= 0);

    Add BUILD_BUG_ON to verify that s8 is capable of storing max values.

    This reduces sizeof(struct scan_control):
    - 96 => 80 bytes (x86_64)
    - 68 => 56 bytes (i386)

    scan_control structure field order is changed to utilize padding. After
    this patch there is 1 bit of scan_control padding.

    akpm: makes my vmscan.o's .text 572 bytes smaller as well.

    Link: http://lkml.kernel.org/r/20180530061212.84915-1-gthelen@google.com
    Signed-off-by: Greg Thelen
    Suggested-by: Matthew Wilcox
    Reviewed-by: Andrew Morton
    Cc: Michal Hocko
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Greg Thelen
     

08 Jun, 2018

2 commits

  • Memory controller implements the memory.low best-effort memory
    protection mechanism, which works perfectly in many cases and allows
    protecting working sets of important workloads from sudden reclaim.

    But its semantics has a significant limitation: it works only as long as
    there is a supply of reclaimable memory. This makes it pretty useless
    against any sort of slow memory leaks or memory usage increases. This
    is especially true for swapless systems. If swap is enabled, memory
    soft protection effectively postpones problems, allowing a leaking
    application to fill all swap area, which makes no sense. The only
    effective way to guarantee the memory protection in this case is to
    invoke the OOM killer.

    It's possible to handle this case in userspace by reacting on MEMCG_LOW
    events; but there is still a place for a fail-safe in-kernel mechanism
    to provide stronger guarantees.

    This patch introduces the memory.min interface for cgroup v2 memory
    controller. It works very similarly to memory.low (sharing the same
    hierarchical behavior), except that it's not disabled if there is no
    more reclaimable memory in the system.

    If cgroup is not populated, its memory.min is ignored, because otherwise
    even the OOM killer wouldn't be able to reclaim the protected memory,
    and the system can stall.

    [guro@fb.com: s/low/min/ in docs]
    Link: http://lkml.kernel.org/r/20180510130758.GA9129@castle.DHCP.thefacebook.com
    Link: http://lkml.kernel.org/r/20180509180734.GA4856@castle.DHCP.thefacebook.com
    Signed-off-by: Roman Gushchin
    Reviewed-by: Randy Dunlap
    Acked-by: Johannes Weiner
    Cc: Michal Hocko
    Cc: Vladimir Davydov
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     
  • While revisiting my Btrfs swapfile series [1], I introduced a situation
    in which reclaim would lock i_rwsem, and even though the swapon() path
    clearly made GFP_KERNEL allocations while holding i_rwsem, I got no
    complaints from lockdep. It turns out that the rework of the fs_reclaim
    annotation was broken: if the current task has PF_MEMALLOC set, we don't
    acquire the dummy fs_reclaim lock, but when reclaiming we always check
    this _after_ we've just set the PF_MEMALLOC flag. In most cases, we can
    fix this by moving the fs_reclaim_{acquire,release}() outside of the
    memalloc_noreclaim_{save,restore}(), althought kswapd is slightly
    different. After applying this, I got the expected lockdep splats.

    1: https://lwn.net/Articles/625412/

    Link: http://lkml.kernel.org/r/9f8aa70652a98e98d7c4de0fc96a4addcee13efe.1523778026.git.osandov@fb.com
    Fixes: d92a8cfcb37e ("locking/lockdep: Rework FS_RECLAIM annotation")
    Signed-off-by: Omar Sandoval
    Reviewed-by: Andrew Morton
    Cc: Peter Zijlstra
    Cc: Tetsuo Handa
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Omar Sandoval
     

03 Jun, 2018

1 commit

  • George Boole would have noticed a slight error in 4.16 commit
    69d763fc6d3a ("mm: pin address_space before dereferencing it while
    isolating an LRU page"). Fix it, to match both the comment above it,
    and the original behaviour.

    Although anonymous pages are not marked PageDirty at first, we have an
    old habit of calling SetPageDirty when a page is removed from swap
    cache: so there's a category of ex-swap pages that are easily
    migratable, but were inadvertently excluded from compaction's async
    migration in 4.16.

    Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1805302014001.12558@eggly.anvils
    Fixes: 69d763fc6d3a ("mm: pin address_space before dereferencing it while isolating an LRU page")
    Signed-off-by: Hugh Dickins
    Acked-by: Minchan Kim
    Acked-by: Mel Gorman
    Reported-by: Ivan Kalvachev
    Cc: "Huang, Ying"
    Cc: Jan Kara
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

16 Apr, 2018

1 commit

  • syzbot is catching so many bugs triggered by commit 9ee332d99e4d5a97
    ("sget(): handle failures of register_shrinker()"). That commit expected
    that calling kill_sb() from deactivate_locked_super() without successful
    fill_super() is safe, but the reality was different; some callers assign
    attributes which are needed for kill_sb() after sget() succeeds.

    For example, [1] is a report where sb->s_mode (which seems to be either
    FMODE_READ | FMODE_EXCL | FMODE_WRITE or FMODE_READ | FMODE_EXCL) is not
    assigned unless sget() succeeds. But it does not worth complicate sget()
    so that register_shrinker() failure path can safely call
    kill_block_super() via kill_sb(). Making alloc_super() fail if memory
    allocation for register_shrinker() failed is much simpler. Let's avoid
    calling deactivate_locked_super() from sget_userns() by preallocating
    memory for the shrinker and making register_shrinker() in sget_userns()
    never fail.

    [1] https://syzkaller.appspot.com/bug?id=588996a25a2587be2e3a54e8646728fb9cae44e7

    Signed-off-by: Tetsuo Handa
    Reported-by: syzbot
    Cc: Al Viro
    Cc: Michal Hocko
    Signed-off-by: Al Viro

    Tetsuo Handa
     

12 Apr, 2018

7 commits

  • Remove the address_space ->tree_lock and use the xa_lock newly added to
    the radix_tree_root. Rename the address_space ->page_tree to ->i_pages,
    since we don't really care that it's a tree.

    [willy@infradead.org: fix nds32, fs/dax.c]
    Link: http://lkml.kernel.org/r/20180406145415.GB20605@bombadil.infradead.orgLink: http://lkml.kernel.org/r/20180313132639.17387-9-willy@infradead.org
    Signed-off-by: Matthew Wilcox
    Acked-by: Jeff Layton
    Cc: Darrick J. Wong
    Cc: Dave Chinner
    Cc: Ryusuke Konishi
    Cc: Will Deacon
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     
  • Commit a983b5ebee57 ("mm: memcontrol: fix excessive complexity in
    memory.stat reporting") added per-cpu drift to all memory cgroup stats
    and events shown in memory.stat and memory.events.

    For memory.stat this is acceptable. But memory.events issues file
    notifications, and somebody polling the file for changes will be
    confused when the counters in it are unchanged after a wakeup.

    Luckily, the events in memory.events - MEMCG_LOW, MEMCG_HIGH, MEMCG_MAX,
    MEMCG_OOM - are sufficiently rare and high-level that we don't need
    per-cpu buffering for them: MEMCG_HIGH and MEMCG_MAX would be the most
    frequent, but they're counting invocations of reclaim, which is a
    complex operation that touches many shared cachelines.

    This splits memory.events from the generic VM events and tracks them in
    their own, unbuffered atomic counters. That's also cleaner, as it
    eliminates the ugly enum nesting of VM and cgroup events.

    [hannes@cmpxchg.org: "array subscript is above array bounds"]
    Link: http://lkml.kernel.org/r/20180406155441.GA20806@cmpxchg.org
    Link: http://lkml.kernel.org/r/20180405175507.GA24817@cmpxchg.org
    Fixes: a983b5ebee57 ("mm: memcontrol: fix excessive complexity in memory.stat reporting")
    Signed-off-by: Johannes Weiner
    Reported-by: Tejun Heo
    Acked-by: Tejun Heo
    Acked-by: Michal Hocko
    Cc: Vladimir Davydov
    Cc: Roman Gushchin
    Cc: Rik van Riel
    Cc: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • The trace event trace_mm_vmscan_lru_shrink_inactive() currently has 12
    parameters! Seven of them are from the reclaim_stat structure. This
    structure is currently local to mm/vmscan.c. By moving it to the global
    vmstat.h header, we can also reference it from the vmscan tracepoints.
    In moving it, it brings down the overhead of passing so many arguments
    to the trace event. In the future, we may limit the number of arguments
    that a trace event may pass (ideally just 6, but more realistically it
    may be 8).

    Before this patch, the code to call the trace event is this:

    0f 83 aa fe ff ff jae ffffffff811e6261
    48 8b 45 a0 mov -0x60(%rbp),%rax
    45 8b 64 24 20 mov 0x20(%r12),%r12d
    44 8b 6d d4 mov -0x2c(%rbp),%r13d
    8b 4d d0 mov -0x30(%rbp),%ecx
    44 8b 75 cc mov -0x34(%rbp),%r14d
    44 8b 7d c8 mov -0x38(%rbp),%r15d
    48 89 45 90 mov %rax,-0x70(%rbp)
    8b 83 b8 fe ff ff mov -0x148(%rbx),%eax
    8b 55 c0 mov -0x40(%rbp),%edx
    8b 7d c4 mov -0x3c(%rbp),%edi
    8b 75 b8 mov -0x48(%rbp),%esi
    89 45 80 mov %eax,-0x80(%rbp)
    65 ff 05 e4 f7 e2 7e incl %gs:0x7ee2f7e4(%rip) # 15bd0
    48 8b 05 75 5b 13 01 mov 0x1135b75(%rip),%rax # ffffffff8231bf68
    48 85 c0 test %rax,%rax
    74 72 je ffffffff811e646a
    48 89 c3 mov %rax,%rbx
    4c 8b 10 mov (%rax),%r10
    89 f8 mov %edi,%eax
    48 89 85 68 ff ff ff mov %rax,-0x98(%rbp)
    89 f0 mov %esi,%eax
    48 89 85 60 ff ff ff mov %rax,-0xa0(%rbp)
    89 c8 mov %ecx,%eax
    48 89 85 78 ff ff ff mov %rax,-0x88(%rbp)
    89 d0 mov %edx,%eax
    48 89 85 70 ff ff ff mov %rax,-0x90(%rbp)
    8b 45 8c mov -0x74(%rbp),%eax
    48 8b 7b 08 mov 0x8(%rbx),%rdi
    48 83 c3 18 add $0x18,%rbx
    50 push %rax
    41 54 push %r12
    41 55 push %r13
    ff b5 78 ff ff ff pushq -0x88(%rbp)
    41 56 push %r14
    41 57 push %r15
    ff b5 70 ff ff ff pushq -0x90(%rbp)
    4c 8b 8d 68 ff ff ff mov -0x98(%rbp),%r9
    4c 8b 85 60 ff ff ff mov -0xa0(%rbp),%r8
    48 8b 4d 98 mov -0x68(%rbp),%rcx
    48 8b 55 90 mov -0x70(%rbp),%rdx
    8b 75 80 mov -0x80(%rbp),%esi
    41 ff d2 callq *%r10

    After the patch:

    0f 83 a8 fe ff ff jae ffffffff811e626d
    8b 9b b8 fe ff ff mov -0x148(%rbx),%ebx
    45 8b 64 24 20 mov 0x20(%r12),%r12d
    4c 8b 6d a0 mov -0x60(%rbp),%r13
    65 ff 05 f5 f7 e2 7e incl %gs:0x7ee2f7f5(%rip) # 15bd0
    4c 8b 35 86 5b 13 01 mov 0x1135b86(%rip),%r14 # ffffffff8231bf68
    4d 85 f6 test %r14,%r14
    74 2a je ffffffff811e6411
    49 8b 06 mov (%r14),%rax
    8b 4d 8c mov -0x74(%rbp),%ecx
    49 8b 7e 08 mov 0x8(%r14),%rdi
    49 83 c6 18 add $0x18,%r14
    4c 89 ea mov %r13,%rdx
    45 89 e1 mov %r12d,%r9d
    4c 8d 45 b8 lea -0x48(%rbp),%r8
    89 de mov %ebx,%esi
    51 push %rcx
    48 8b 4d 98 mov -0x68(%rbp),%rcx
    ff d0 callq *%rax

    Link: http://lkml.kernel.org/r/2559d7cb-ec60-1200-2362-04fa34fd02bb@fb.com
    Link: http://lkml.kernel.org/r/20180322121003.4177af15@gandalf.local.home
    Signed-off-by: Steven Rostedt (VMware)
    Reported-by: Alexei Starovoitov
    Acked-by: David Rientjes
    Acked-by: Michal Hocko
    Cc: Mel Gorman
    Cc: Vlastimil Babka
    Cc: Andrey Ryabinin
    Cc: Alexei Starovoitov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Steven Rostedt
     
  • memcg reclaim may alter pgdat->flags based on the state of LRU lists in
    cgroup and its children. PGDAT_WRITEBACK may force kswapd to sleep
    congested_wait(), PGDAT_DIRTY may force kswapd to writeback filesystem
    pages. But the worst here is PGDAT_CONGESTED, since it may force all
    direct reclaims to stall in wait_iff_congested(). Note that only kswapd
    have powers to clear any of these bits. This might just never happen if
    cgroup limits configured that way. So all direct reclaims will stall as
    long as we have some congested bdi in the system.

    Leave all pgdat->flags manipulations to kswapd. kswapd scans the whole
    pgdat, only kswapd can clear pgdat->flags once node is balanced, thus
    it's reasonable to leave all decisions about node state to kswapd.

    Why only kswapd? Why not allow to global direct reclaim change these
    flags? It is because currently only kswapd can clear these flags. I'm
    less worried about the case when PGDAT_CONGESTED falsely not set, and
    more worried about the case when it falsely set. If direct reclaimer
    sets PGDAT_CONGESTED, do we have guarantee that after the congestion
    problem is sorted out, kswapd will be woken up and clear the flag? It
    seems like there is no such guarantee. E.g. direct reclaimers may
    eventually balance pgdat and kswapd simply won't wake up (see
    wakeup_kswapd()).

    Moving pgdat->flags manipulation to kswapd, means that cgroup2 recalim
    now loses its congestion throttling mechanism. Add per-cgroup
    congestion state and throttle cgroup2 reclaimers if memcg is in
    congestion state.

    Currently there is no need in per-cgroup PGDAT_WRITEBACK and PGDAT_DIRTY
    bits since they alter only kswapd behavior.

    The problem could be easily demonstrated by creating heavy congestion in
    one cgroup:

    echo "+memory" > /sys/fs/cgroup/cgroup.subtree_control
    mkdir -p /sys/fs/cgroup/congester
    echo 512M > /sys/fs/cgroup/congester/memory.max
    echo $$ > /sys/fs/cgroup/congester/cgroup.procs
    /* generate a lot of diry data on slow HDD */
    while true; do dd if=/dev/zero of=/mnt/sdb/zeroes bs=1M count=1024; done &
    ....
    while true; do dd if=/dev/zero of=/mnt/sdb/zeroes bs=1M count=1024; done &

    and some job in another cgroup:

    mkdir /sys/fs/cgroup/victim
    echo 128M > /sys/fs/cgroup/victim/memory.max

    # time cat /dev/sda > /dev/null
    real 10m15.054s
    user 0m0.487s
    sys 1m8.505s

    According to the tracepoint in wait_iff_congested(), the 'cat' spent 50%
    of the time sleeping there.

    With the patch, cat don't waste time anymore:

    # time cat /dev/sda > /dev/null
    real 5m32.911s
    user 0m0.411s
    sys 0m56.664s

    [aryabinin@virtuozzo.com: congestion state should be per-node]
    Link: http://lkml.kernel.org/r/20180406135215.10057-1-aryabinin@virtuozzo.com
    [ayabinin@virtuozzo.com: make congestion state per-cgroup-per-node instead of just per-cgroup[
    Link: http://lkml.kernel.org/r/20180406180254.8970-2-aryabinin@virtuozzo.com
    Link: http://lkml.kernel.org/r/20180323152029.11084-5-aryabinin@virtuozzo.com
    Signed-off-by: Andrey Ryabinin
    Reviewed-by: Shakeel Butt
    Acked-by: Johannes Weiner
    Cc: Mel Gorman
    Cc: Tejun Heo
    Cc: Michal Hocko
    Cc: Steven Rostedt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Ryabinin
     
  • We have separate LRU list for each memory cgroup. Memory reclaim
    iterates over cgroups and calls shrink_inactive_list() every inactive
    LRU list. Based on the state of a single LRU shrink_inactive_list() may
    flag the whole node as dirty,congested or under writeback. This is
    obviously wrong and hurtful. It's especially hurtful when we have
    possibly small congested cgroup in system. Than *all* direct reclaims
    waste time by sleeping in wait_iff_congested(). And the more memcgs in
    the system we have the longer memory allocation stall is, because
    wait_iff_congested() called on each lru-list scan.

    Sum reclaim stats across all visited LRUs on node and flag node as
    dirty, congested or under writeback based on that sum. Also call
    congestion_wait(), wait_iff_congested() once per pgdat scan, instead of
    once per lru-list scan.

    This only fixes the problem for global reclaim case. Per-cgroup reclaim
    may alter global pgdat flags too, which is wrong. But that is separate
    issue and will be addressed in the next patch.

    This change will not have any effect on a systems with all workload
    concentrated in a single cgroup.

    [aryabinin@virtuozzo.com: check nr_writeback against all nr_taken, not just file]
    Link: http://lkml.kernel.org/r/20180406180254.8970-1-aryabinin@virtuozzo.com
    Link: http://lkml.kernel.org/r/20180323152029.11084-4-aryabinin@virtuozzo.com
    Signed-off-by: Andrey Ryabinin
    Reviewed-by: Shakeel Butt
    Cc: Mel Gorman
    Cc: Tejun Heo
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Steven Rostedt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Ryabinin
     
  • Only kswapd can have non-zero nr_immediate, and current_may_throttle()
    is always true for kswapd (PF_LESS_THROTTLE bit is never set) thus it's
    enough to check stat.nr_immediate only.

    Link: http://lkml.kernel.org/r/20180315164553.17856-4-aryabinin@virtuozzo.com
    Signed-off-by: Andrey Ryabinin
    Acked-by: Michal Hocko
    Cc: Shakeel Butt
    Cc: Mel Gorman
    Cc: Tejun Heo
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Ryabinin
     
  • Update some comments that became stale since transiton from per-zone to
    per-node reclaim.

    Link: http://lkml.kernel.org/r/20180315164553.17856-2-aryabinin@virtuozzo.com
    Signed-off-by: Andrey Ryabinin
    Acked-by: Michal Hocko
    Cc: Shakeel Butt
    Cc: Mel Gorman
    Cc: Tejun Heo
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Ryabinin
     

06 Apr, 2018

3 commits

  • Kswapd will not wakeup if per-zone watermarks are not failing or if too
    many previous attempts at background reclaim have failed.

    This can be true if there is a lot of free memory available. For high-
    order allocations, kswapd is responsible for waking up kcompactd for
    background compaction. If the zone is not below its watermarks or
    reclaim has recently failed (lots of free memory, nothing left to
    reclaim), kcompactd does not get woken up.

    When __GFP_DIRECT_RECLAIM is not allowed, allow kcompactd to still be
    woken up even if kswapd will not reclaim. This allows high-order
    allocations, such as thp, to still trigger background compaction even
    when the zone has an abundance of free memory.

    Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1803111659420.209721@chino.kir.corp.google.com
    Signed-off-by: David Rientjes
    Acked-by: Vlastimil Babka
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • Since we no longer use return value of shrink_slab() for normal reclaim,
    the comment is no longer true. If some do_shrink_slab() call takes
    unexpectedly long (root cause of stall is currently unknown) when
    register_shrinker()/unregister_shrinker() is pending, trying to drop
    caches via /proc/sys/vm/drop_caches could become infinite cond_resched()
    loop if many mem_cgroup are defined. For safety, let's not pretend
    forward progress.

    Link: http://lkml.kernel.org/r/201802202229.GGF26507.LVFtMSOOHFJOQF@I-love.SAKURA.ne.jp
    Signed-off-by: Tetsuo Handa
    Acked-by: Michal Hocko
    Reviewed-by: Andrew Morton
    Cc: Dave Chinner
    Cc: Glauber Costa
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tetsuo Handa
     
  • When page_mapping() is called and the mapping is dereferenced in
    page_evicatable() through shrink_active_list(), it is possible for the
    inode to be truncated and the embedded address space to be freed at the
    same time. This may lead to the following race.

    CPU1 CPU2

    truncate(inode) shrink_active_list()
    ... page_evictable(page)
    truncate_inode_page(mapping, page);
    delete_from_page_cache(page)
    spin_lock_irqsave(&mapping->tree_lock, flags);
    __delete_from_page_cache(page, NULL)
    page_cache_tree_delete(..)
    ... mapping = page_mapping(page);
    page->mapping = NULL;
    ...
    spin_unlock_irqrestore(&mapping->tree_lock, flags);
    page_cache_free_page(mapping, page)
    put_page(page)
    if (put_page_testzero(page)) -> false
    - inode now has no pages and can be freed including embedded address_space

    mapping_unevictable(mapping)
    test_bit(AS_UNEVICTABLE, &mapping->flags);
    - we've dereferenced mapping which is potentially already free.

    Similar race exists between swap cache freeing and page_evicatable()
    too.

    The address_space in inode and swap cache will be freed after a RCU
    grace period. So the races are fixed via enclosing the page_mapping()
    and address_space usage in rcu_read_lock/unlock(). Some comments are
    added in code to make it clear what is protected by the RCU read lock.

    Link: http://lkml.kernel.org/r/20180212081227.1940-1-ying.huang@intel.com
    Signed-off-by: "Huang, Ying"
    Reviewed-by: Jan Kara
    Reviewed-by: Andrew Morton
    Cc: Mel Gorman
    Cc: Minchan Kim
    Cc: "Huang, Ying"
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Ying
     

23 Mar, 2018

1 commit

  • Commit 726d061fbd36 ("mm: vmscan: kick flushers when we encounter dirty
    pages on the LRU") added flusher invocation to shrink_inactive_list()
    when many dirty pages on the LRU are encountered.

    However, shrink_inactive_list() doesn't wake up flushers for legacy
    cgroup reclaim, so the next commit bbef938429f5 ("mm: vmscan: remove old
    flusher wakeup from direct reclaim path") removed the only source of
    flusher's wake up in legacy mem cgroup reclaim path.

    This leads to premature OOM if there is too many dirty pages in cgroup:
    # mkdir /sys/fs/cgroup/memory/test
    # echo $$ > /sys/fs/cgroup/memory/test/tasks
    # echo 50M > /sys/fs/cgroup/memory/test/memory.limit_in_bytes
    # dd if=/dev/zero of=tmp_file bs=1M count=100
    Killed

    dd invoked oom-killer: gfp_mask=0x14000c0(GFP_KERNEL), nodemask=(null), order=0, oom_score_adj=0

    Call Trace:
    dump_stack+0x46/0x65
    dump_header+0x6b/0x2ac
    oom_kill_process+0x21c/0x4a0
    out_of_memory+0x2a5/0x4b0
    mem_cgroup_out_of_memory+0x3b/0x60
    mem_cgroup_oom_synchronize+0x2ed/0x330
    pagefault_out_of_memory+0x24/0x54
    __do_page_fault+0x521/0x540
    page_fault+0x45/0x50

    Task in /test killed as a result of limit of /test
    memory: usage 51200kB, limit 51200kB, failcnt 73
    memory+swap: usage 51200kB, limit 9007199254740988kB, failcnt 0
    kmem: usage 296kB, limit 9007199254740988kB, failcnt 0
    Memory cgroup stats for /test: cache:49632KB rss:1056KB rss_huge:0KB shmem:0KB
    mapped_file:0KB dirty:49500KB writeback:0KB swap:0KB inactive_anon:0KB
    active_anon:1168KB inactive_file:24760KB active_file:24960KB unevictable:0KB
    Memory cgroup out of memory: Kill process 3861 (bash) score 88 or sacrifice child
    Killed process 3876 (dd) total-vm:8484kB, anon-rss:1052kB, file-rss:1720kB, shmem-rss:0kB
    oom_reaper: reaped process 3876 (dd), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

    Wake up flushers in legacy cgroup reclaim too.

    Link: http://lkml.kernel.org/r/20180315164553.17856-1-aryabinin@virtuozzo.com
    Fixes: bbef938429f5 ("mm: vmscan: remove old flusher wakeup from direct reclaim path")
    Signed-off-by: Andrey Ryabinin
    Tested-by: Shakeel Butt
    Acked-by: Michal Hocko
    Cc: Mel Gorman
    Cc: Tejun Heo
    Cc: Johannes Weiner
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Ryabinin