14 Aug, 2019

1 commit

  • Memcg counters for shadow nodes are broken because the memcg pointer is
    obtained in a wrong way. The following approach is used:
    virt_to_page(xa_node)->mem_cgroup

    Since commit 4d96ba353075 ("mm: memcg/slab: stop setting
    page->mem_cgroup pointer for slab pages") page->mem_cgroup pointer isn't
    set for slab pages, so memcg_from_slab_page() should be used instead.

    Also I doubt that it ever worked correctly: virt_to_head_page() should
    be used instead of virt_to_page(). Otherwise objects residing on tail
    pages are not accounted, because only the head page contains a valid
    mem_cgroup pointer. That was a case since the introduction of these
    counters by the commit 68d48e6a2df5 ("mm: workingset: add vmstat counter
    for shadow nodes").

    Link: http://lkml.kernel.org/r/20190801233532.138743-1-guro@fb.com
    Fixes: 4d96ba353075 ("mm: memcg/slab: stop setting page->mem_cgroup pointer for slab pages")
    Signed-off-by: Roman Gushchin
    Acked-by: Johannes Weiner
    Cc: Vladimir Davydov
    Cc: Shakeel Butt
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     

15 May, 2019

2 commits

  • Patch series "mm: memcontrol: memory.stat cost & correctness".

    The cgroup memory.stat file holds recursive statistics for the entire
    subtree. The current implementation does this tree walk on-demand
    whenever the file is read. This is giving us problems in production.

    1. The cost of aggregating the statistics on-demand is high. A lot of
    system service cgroups are mostly idle and their stats don't change
    between reads, yet we always have to check them. There are also always
    some lazily-dying cgroups sitting around that are pinned by a handful
    of remaining page cache; the same applies to them.

    In an application that periodically monitors memory.stat in our
    fleet, we have seen the aggregation consume up to 5% CPU time.

    2. When cgroups die and disappear from the cgroup tree, so do their
    accumulated vm events. The result is that the event counters at
    higher-level cgroups can go backwards and confuse some of our
    automation, let alone people looking at the graphs over time.

    To address both issues, this patch series changes the stat
    implementation to spill counts upwards when the counters change.

    The upward spilling is batched using the existing per-cpu cache. In a
    sparse file stress test with 5 level cgroup nesting, the additional cost
    of the flushing was negligible (a little under 1% of CPU at 100% CPU
    utilization, compared to the 5% of reading memory.stat during regular
    operation).

    This patch (of 4):

    memcg_page_state(), lruvec_page_state(), memcg_sum_events() are
    currently returning the state of the local memcg or lruvec, not the
    recursive state.

    In practice there is a demand for both versions, although the callers
    that want the recursive counts currently sum them up by hand.

    Per default, cgroups are considered recursive entities and generally we
    expect more users of the recursive counters, with the local counts being
    special cases. To reflect that in the name, add a _local suffix to the
    current implementations.

    The following patch will re-incarnate these functions with recursive
    semantics, but with an O(1) implementation.

    [hannes@cmpxchg.org: fix bisection hole]
    Link: http://lkml.kernel.org/r/20190417160347.GC23013@cmpxchg.org
    Link: http://lkml.kernel.org/r/20190412151507.2769-2-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Reviewed-by: Shakeel Butt
    Reviewed-by: Roman Gushchin
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • mem_cgroup_node_nr_lru_pages() is just a convenience wrapper around
    lruvec_page_state() that takes bitmasks of lru indexes and aggregates the
    counts for those.

    Replace callsites where the bitmask is simple enough with direct
    lruvec_page_state() calls.

    This removes the last extern user of mem_cgroup_node_nr_lru_pages(), so
    make that function private again, too.

    Link: http://lkml.kernel.org/r/20190228163020.24100-5-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Reviewed-by: Roman Gushchin
    Cc: Michal Hocko
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

06 Mar, 2019

1 commit

  • workingset_eviction() doesn't use and never did use the @mapping
    argument. Remove it.

    Link: http://lkml.kernel.org/r/20190228083329.31892-1-aryabinin@virtuozzo.com
    Signed-off-by: Andrey Ryabinin
    Acked-by: Johannes Weiner
    Acked-by: Rik van Riel
    Acked-by: Vlastimil Babka
    Acked-by: Mel Gorman
    Cc: Michal Hocko
    Cc: William Kucharski
    Cc: John Hubbard
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Ryabinin
     

29 Dec, 2018

1 commit

  • totalram_pages and totalhigh_pages are made static inline function.

    Main motivation was that managed_page_count_lock handling was complicating
    things. It was discussed in length here,
    https://lore.kernel.org/patchwork/patch/995739/#1181785 So it seemes
    better to remove the lock and convert variables to atomic, with preventing
    poteintial store-to-read tearing as a bonus.

    [akpm@linux-foundation.org: coding style fixes]
    Link: http://lkml.kernel.org/r/1542090790-21750-4-git-send-email-arunks@codeaurora.org
    Signed-off-by: Arun KS
    Suggested-by: Michal Hocko
    Suggested-by: Vlastimil Babka
    Reviewed-by: Konstantin Khlebnikov
    Reviewed-by: Pavel Tatashin
    Acked-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Cc: David Hildenbrand
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Arun KS
     

29 Oct, 2018

1 commit

  • Pull XArray conversion from Matthew Wilcox:
    "The XArray provides an improved interface to the radix tree data
    structure, providing locking as part of the API, specifying GFP flags
    at allocation time, eliminating preloading, less re-walking the tree,
    more efficient iterations and not exposing RCU-protected pointers to
    its users.

    This patch set

    1. Introduces the XArray implementation

    2. Converts the pagecache to use it

    3. Converts memremap to use it

    The page cache is the most complex and important user of the radix
    tree, so converting it was most important. Converting the memremap
    code removes the only other user of the multiorder code, which allows
    us to remove the radix tree code that supported it.

    I have 40+ followup patches to convert many other users of the radix
    tree over to the XArray, but I'd like to get this part in first. The
    other conversions haven't been in linux-next and aren't suitable for
    applying yet, but you can see them in the xarray-conv branch if you're
    interested"

    * 'xarray' of git://git.infradead.org/users/willy/linux-dax: (90 commits)
    radix tree: Remove multiorder support
    radix tree test: Convert multiorder tests to XArray
    radix tree tests: Convert item_delete_rcu to XArray
    radix tree tests: Convert item_kill_tree to XArray
    radix tree tests: Move item_insert_order
    radix tree test suite: Remove multiorder benchmarking
    radix tree test suite: Remove __item_insert
    memremap: Convert to XArray
    xarray: Add range store functionality
    xarray: Move multiorder_check to in-kernel tests
    xarray: Move multiorder_shrink to kernel tests
    xarray: Move multiorder account test in-kernel
    radix tree test suite: Convert iteration test to XArray
    radix tree test suite: Convert tag_tagged_items to XArray
    radix tree: Remove radix_tree_clear_tags
    radix tree: Remove radix_tree_maybe_preload_order
    radix tree: Remove split/join code
    radix tree: Remove radix_tree_update_node_t
    page cache: Finish XArray conversion
    dax: Convert page fault handlers to XArray
    ...

    Linus Torvalds
     

27 Oct, 2018

5 commits

  • The page cache and most shrinkable slab caches hold data that has been
    read from disk, but there are some caches that only cache CPU work, such
    as the dentry and inode caches of procfs and sysfs, as well as the subset
    of radix tree nodes that track non-resident page cache.

    Currently, all these are shrunk at the same rate: using DEFAULT_SEEKS for
    the shrinker's seeks setting tells the reclaim algorithm that for every
    two page cache pages scanned it should scan one slab object.

    This is a bogus setting. A virtual inode that required no IO to create is
    not twice as valuable as a page cache page; shadow cache entries with
    eviction distances beyond the size of memory aren't either.

    In most cases, the behavior in practice is still fine. Such virtual
    caches don't tend to grow and assert themselves aggressively, and usually
    get picked up before they cause problems. But there are scenarios where
    that's not true.

    Our database workloads suffer from two of those. For one, their file
    workingset is several times bigger than available memory, which has the
    kernel aggressively create shadow page cache entries for the non-resident
    parts of it. The workingset code does tell the VM that most of these are
    expendable, but the VM ends up balancing them 2:1 to cache pages as per
    the seeks setting. This is a huge waste of memory.

    These workloads also deal with tens of thousands of open files and use
    /proc for introspection, which ends up growing the proc_inode_cache to
    absurdly large sizes - again at the cost of valuable cache space, which
    isn't a reasonable trade-off, given that proc inodes can be re-created
    without involving the disk.

    This patch implements a "zero-seek" setting for shrinkers that results in
    a target ratio of 0:1 between their objects and IO-backed caches. This
    allows such virtual caches to grow when memory is available (they do
    cache/avoid CPU work after all), but effectively disables them as soon as
    IO-backed objects are under pressure.

    It then switches the shrinkers for procfs and sysfs metadata, as well as
    excess page cache shadow nodes, to the new zero-seek setting.

    Link: http://lkml.kernel.org/r/20181009184732.762-5-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Reported-by: Domas Mituzas
    Reviewed-by: Andrew Morton
    Reviewed-by: Rik van Riel
    Acked-by: Peter Zijlstra (Intel)
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Make it easier to catch bugs in the shadow node shrinker by adding a
    counter for the shadow nodes in circulation.

    [akpm@linux-foundation.org: assert that irqs are disabled, for __inc_lruvec_page_state()]
    [akpm@linux-foundation.org: s/WARN_ON_ONCE/VM_WARN_ON_ONCE/, per Johannes]
    Link: http://lkml.kernel.org/r/20181009184732.762-4-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Reviewed-by: Andrew Morton
    Acked-by: Peter Zijlstra (Intel)
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • No need to use the preemption-safe lruvec state function inside the
    reclaim region that has irqs disabled.

    Link: http://lkml.kernel.org/r/20181009184732.762-3-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Reviewed-by: Andrew Morton
    Reviewed-by: Rik van Riel
    Acked-by: Peter Zijlstra (Intel)
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Refaults happen during transitions between workingsets as well as in-place
    thrashing. Knowing the difference between the two has a range of
    applications, including measuring the impact of memory shortage on the
    system performance, as well as the ability to smarter balance pressure
    between the filesystem cache and the swap-backed workingset.

    During workingset transitions, inactive cache refaults and pushes out
    established active cache. When that active cache isn't stale, however,
    and also ends up refaulting, that's bonafide thrashing.

    Introduce a new page flag that tells on eviction whether the page has been
    active or not in its lifetime. This bit is then stored in the shadow
    entry, to classify refaults as transitioning or thrashing.

    How many page->flags does this leave us with on 32-bit?

    20 bits are always page flags

    21 if you have an MMU

    23 with the zone bits for DMA, Normal, HighMem, Movable

    29 with the sparsemem section bits

    30 if PAE is enabled

    31 with this patch.

    So on 32-bit PAE, that leaves 1 bit for distinguishing two NUMA nodes. If
    that's not enough, the system can switch to discontigmem and re-gain the 6
    or 7 sparsemem section bits.

    Link: http://lkml.kernel.org/r/20180828172258.3185-3-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Acked-by: Peter Zijlstra (Intel)
    Tested-by: Daniel Drake
    Tested-by: Suren Baghdasaryan
    Cc: Christopher Lameter
    Cc: Ingo Molnar
    Cc: Johannes Weiner
    Cc: Mike Galbraith
    Cc: Peter Enderborg
    Cc: Randy Dunlap
    Cc: Shakeel Butt
    Cc: Tejun Heo
    Cc: Vinayak Menon
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Patch series "psi: pressure stall information for CPU, memory, and IO", v4.

    Overview

    PSI reports the overall wallclock time in which the tasks in a system (or
    cgroup) wait for (contended) hardware resources.

    This helps users understand the resource pressure their workloads are
    under, which allows them to rootcause and fix throughput and latency
    problems caused by overcommitting, underprovisioning, suboptimal job
    placement in a grid; as well as anticipate major disruptions like OOM.

    Real-world applications

    We're using the data collected by PSI (and its previous incarnation,
    memdelay) quite extensively at Facebook, and with several success stories.

    One usecase is avoiding OOM hangs/livelocks. The reason these happen is
    because the OOM killer is triggered by reclaim not being able to free
    pages, but with fast flash devices there is *always* some clean and
    uptodate cache to reclaim; the OOM killer never kicks in, even as tasks
    spend 90% of the time thrashing the cache pages of their own executables.
    There is no situation where this ever makes sense in practice. We wrote a

    Acked-by: Peter Zijlstra (Intel)
    Reviewed-by: Rik van Riel
    Tested-by: Daniel Drake
    Tested-by: Suren Baghdasaryan
    Cc: Ingo Molnar
    Cc: Tejun Heo
    Cc: Vinayak Menon
    Cc: Christopher Lameter
    Cc: Peter Enderborg
    Cc: Shakeel Butt
    Cc: Mike Galbraith
    Cc: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

21 Oct, 2018

2 commits

  • We construct an XA_STATE and use it to delete the node with
    xas_store() rather than adding a special function for this unique
    use case. Includes a test that simulates this usage for the
    test suite.

    Signed-off-by: Matthew Wilcox

    Matthew Wilcox
     
  • This is a direct replacement for struct radix_tree_node. A couple of
    struct members have changed name, so convert those. Use a #define so
    that radix tree users continue to work without change.

    Signed-off-by: Matthew Wilcox
    Reviewed-by: Josef Bacik

    Matthew Wilcox
     

30 Sep, 2018

1 commit

  • Introduce xarray value entries and tagged pointers to replace radix
    tree exceptional entries. This is a slight change in encoding to allow
    the use of an extra bit (we can now store BITS_PER_LONG - 1 bits in a
    value entry). It is also a change in emphasis; exceptional entries are
    intimidating and different. As the comment explains, you can choose
    to store values or pointers in the xarray and they are both first-class
    citizens.

    Signed-off-by: Matthew Wilcox
    Reviewed-by: Josef Bacik

    Matthew Wilcox
     

18 Aug, 2018

6 commits

  • Provide list_lru_shrink_walk_irq() and let it behave like
    list_lru_walk_one() except that it locks the spinlock with
    spin_lock_irq(). This is used by scan_shadow_nodes() because its lock
    nests within the i_pages lock which is acquired with IRQ. This change
    allows to use proper locking promitives instead hand crafted
    lock_irq_disable() plus spin_lock().

    There is no EXPORT_SYMBOL provided because the current user is in-kernel
    only.

    Add list_lru_shrink_walk_irq() which acquires the spinlock with the
    proper locking primitives.

    Link: http://lkml.kernel.org/r/20180716111921.5365-5-bigeasy@linutronix.de
    Signed-off-by: Sebastian Andrzej Siewior
    Reviewed-by: Vladimir Davydov
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sebastian Andrzej Siewior
     
  • We need to distinguish the situations when shrinker has very small
    amount of objects (see vfs_pressure_ratio() called from
    super_cache_count()), and when it has no objects at all. Currently, in
    the both of these cases, shrinker::count_objects() returns 0.

    The patch introduces new SHRINK_EMPTY return value, which will be used
    for "no objects at all" case. It's is a refactoring mostly, as
    SHRINK_EMPTY is replaced by 0 by all callers of do_shrink_slab() in this
    patch, and all the magic will happen in further.

    Link: http://lkml.kernel.org/r/153063069574.1818.11037751256699341813.stgit@localhost.localdomain
    Signed-off-by: Kirill Tkhai
    Acked-by: Vladimir Davydov
    Tested-by: Shakeel Butt
    Cc: Al Viro
    Cc: Andrey Ryabinin
    Cc: Chris Wilson
    Cc: Greg Kroah-Hartman
    Cc: Guenter Roeck
    Cc: "Huang, Ying"
    Cc: Johannes Weiner
    Cc: Josef Bacik
    Cc: Li RongQing
    Cc: Matthew Wilcox
    Cc: Matthias Kaehlcke
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Philippe Ombredanne
    Cc: Roman Gushchin
    Cc: Sahitya Tummala
    Cc: Stephen Rothwell
    Cc: Tetsuo Handa
    Cc: Thomas Gleixner
    Cc: Waiman Long
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill Tkhai
     
  • Add list_lru::shrinker_id field and populate it by registered shrinker
    id.

    This will be used to set correct bit in memcg shrinkers map by lru code
    in next patches, after there appeared the first related to memcg element
    in list_lru.

    Link: http://lkml.kernel.org/r/153063059758.1818.14866596416857717800.stgit@localhost.localdomain
    Signed-off-by: Kirill Tkhai
    Acked-by: Vladimir Davydov
    Tested-by: Shakeel Butt
    Cc: Al Viro
    Cc: Andrey Ryabinin
    Cc: Chris Wilson
    Cc: Greg Kroah-Hartman
    Cc: Guenter Roeck
    Cc: "Huang, Ying"
    Cc: Johannes Weiner
    Cc: Josef Bacik
    Cc: Li RongQing
    Cc: Matthew Wilcox
    Cc: Matthias Kaehlcke
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Philippe Ombredanne
    Cc: Roman Gushchin
    Cc: Sahitya Tummala
    Cc: Stephen Rothwell
    Cc: Tetsuo Handa
    Cc: Thomas Gleixner
    Cc: Waiman Long
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill Tkhai
     
  • Use prealloc_shrinker()/register_shrinker_prepared() instead of
    register_shrinker(). This will be used in next patch.

    [ktkhai@virtuozzo.com: v9]
    Link: http://lkml.kernel.org/r/153112550112.4097.16606173020912323761.stgit@localhost.localdomain
    Link: http://lkml.kernel.org/r/153063057666.1818.17625951186610808734.stgit@localhost.localdomain
    Signed-off-by: Kirill Tkhai
    Acked-by: Vladimir Davydov
    Tested-by: Shakeel Butt
    Cc: Al Viro
    Cc: Andrey Ryabinin
    Cc: Chris Wilson
    Cc: Greg Kroah-Hartman
    Cc: Guenter Roeck
    Cc: "Huang, Ying"
    Cc: Johannes Weiner
    Cc: Josef Bacik
    Cc: Li RongQing
    Cc: Matthew Wilcox
    Cc: Matthias Kaehlcke
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Philippe Ombredanne
    Cc: Roman Gushchin
    Cc: Sahitya Tummala
    Cc: Stephen Rothwell
    Cc: Tetsuo Handa
    Cc: Thomas Gleixner
    Cc: Waiman Long
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill Tkhai
     
  • shadow_lru_isolate() disables interrupts and acquires a lock. It could
    use spin_lock_irq() instead. It also uses local_irq_enable() while it
    could use spin_unlock_irq()/xa_unlock_irq().

    Use proper suffix for lock/unlock in order to enable/disable interrupts
    during release/acquire of a lock.

    Link: http://lkml.kernel.org/r/20180622151221.28167-3-bigeasy@linutronix.de
    Signed-off-by: Sebastian Andrzej Siewior
    Reviewed-by: Andrew Morton
    Cc: Vladimir Davydov
    Cc: Kirill Tkhai
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sebastian Andrzej Siewior
     
  • Patch series "mm: use irq locking suffix instead local_irq_disable()".

    A small series which avoids using local_irq_disable()/local_irq_enable()
    but instead does spin_lock_irq()/spin_unlock_irq() so it is within the
    context of the lock which it belongs to. Patch #1 is a cleanup where
    local_irq_.*() remained after the lock was removed.

    This patch (of 2):

    In 0c7c1bed7e13 ("mm: make counting of list_lru_one::nr_items lockless")
    the

    spin_lock(&nlru->lock);

    statement was replaced with

    rcu_read_lock();

    in __list_lru_count_one(). The comment in count_shadow_nodes() says
    that the local_irq_disable() is required because the lock must be
    acquired with disabled interrupts and (spin_lock()) does not do so.
    Since the lock is replaced with rcu_read_lock() the local_irq_disable()
    is no longer needed. The code path is

    list_lru_shrink_count()
    -> list_lru_count_one()
    -> __list_lru_count_one()
    -> rcu_read_lock()
    -> list_lru_from_memcg_idx()
    -> rcu_read_unlock()

    Remove the local_irq_disable() statement.

    Link: http://lkml.kernel.org/r/20180622151221.28167-2-bigeasy@linutronix.de
    Signed-off-by: Sebastian Andrzej Siewior
    Reviewed-by: Andrew Morton
    Reviewed-by: Kirill Tkhai
    Acked-by: Vladimir Davydov
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sebastian Andrzej Siewior
     

12 Apr, 2018

1 commit

  • Remove the address_space ->tree_lock and use the xa_lock newly added to
    the radix_tree_root. Rename the address_space ->page_tree to ->i_pages,
    since we don't really care that it's a tree.

    [willy@infradead.org: fix nds32, fs/dax.c]
    Link: http://lkml.kernel.org/r/20180406145415.GB20605@bombadil.infradead.orgLink: http://lkml.kernel.org/r/20180313132639.17387-9-willy@infradead.org
    Signed-off-by: Matthew Wilcox
    Acked-by: Jeff Layton
    Cc: Darrick J. Wong
    Cc: Dave Chinner
    Cc: Ryusuke Konishi
    Cc: Will Deacon
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     

16 Nov, 2017

1 commit

  • During truncation, the mapping has already been checked for shmem and
    dax so it's known that workingset_update_node is required.

    This patch avoids the checks on mapping for each page being truncated.
    In all other cases, a lookup helper is used to determine if
    workingset_update_node() needs to be called. The one danger is that the
    API is slightly harder to use as calling workingset_update_node directly
    without checking for dax or shmem mappings could lead to surprises.
    However, the API rarely needs to be used and hopefully the comment is
    enough to give people the hint.

    sparsetruncate (tiny)
    4.14.0-rc4 4.14.0-rc4
    oneirq-v1r1 pickhelper-v1r1
    Min Time 141.00 ( 0.00%) 140.00 ( 0.71%)
    1st-qrtle Time 142.00 ( 0.00%) 141.00 ( 0.70%)
    2nd-qrtle Time 142.00 ( 0.00%) 142.00 ( 0.00%)
    3rd-qrtle Time 143.00 ( 0.00%) 143.00 ( 0.00%)
    Max-90% Time 144.00 ( 0.00%) 144.00 ( 0.00%)
    Max-95% Time 147.00 ( 0.00%) 145.00 ( 1.36%)
    Max-99% Time 195.00 ( 0.00%) 191.00 ( 2.05%)
    Max Time 230.00 ( 0.00%) 205.00 ( 10.87%)
    Amean Time 144.37 ( 0.00%) 143.82 ( 0.38%)
    Stddev Time 10.44 ( 0.00%) 9.00 ( 13.74%)
    Coeff Time 7.23 ( 0.00%) 6.26 ( 13.41%)
    Best99%Amean Time 143.72 ( 0.00%) 143.34 ( 0.26%)
    Best95%Amean Time 142.37 ( 0.00%) 142.00 ( 0.26%)
    Best90%Amean Time 142.19 ( 0.00%) 141.85 ( 0.24%)
    Best75%Amean Time 141.92 ( 0.00%) 141.58 ( 0.24%)
    Best50%Amean Time 141.69 ( 0.00%) 141.31 ( 0.27%)
    Best25%Amean Time 141.38 ( 0.00%) 140.97 ( 0.29%)

    As you'd expect, the gain is marginal but it can be detected. The
    differences in bonnie are all within the noise which is not surprising
    given the impact on the microbenchmark.

    radix_tree_update_node_t is a callback for some radix operations that
    optionally passes in a private field. The only user of the callback is
    workingset_update_node and as it no longer requires a mapping, the
    private field is removed.

    Link: http://lkml.kernel.org/r/20171018075952.10627-3-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Johannes Weiner
    Reviewed-by: Jan Kara
    Cc: Andi Kleen
    Cc: Dave Chinner
    Cc: Dave Hansen
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

02 Nov, 2017

1 commit

  • Many source files in the tree are missing licensing information, which
    makes it harder for compliance tools to determine the correct license.

    By default all files without license information are under the default
    license of the kernel, which is GPL version 2.

    Update the files which contain no license information with the 'GPL-2.0'
    SPDX license identifier. The SPDX identifier is a legally binding
    shorthand, which can be used instead of the full boiler plate text.

    This patch is based on work done by Thomas Gleixner and Kate Stewart and
    Philippe Ombredanne.

    How this work was done:

    Patches were generated and checked against linux-4.14-rc6 for a subset of
    the use cases:
    - file had no licensing information it it.
    - file was a */uapi/* one with no licensing information in it,
    - file was a */uapi/* one with existing licensing information,

    Further patches will be generated in subsequent months to fix up cases
    where non-standard license headers were used, and references to license
    had to be inferred by heuristics based on keywords.

    The analysis to determine which SPDX License Identifier to be applied to
    a file was done in a spreadsheet of side by side results from of the
    output of two independent scanners (ScanCode & Windriver) producing SPDX
    tag:value files created by Philippe Ombredanne. Philippe prepared the
    base worksheet, and did an initial spot review of a few 1000 files.

    The 4.13 kernel was the starting point of the analysis with 60,537 files
    assessed. Kate Stewart did a file by file comparison of the scanner
    results in the spreadsheet to determine which SPDX license identifier(s)
    to be applied to the file. She confirmed any determination that was not
    immediately clear with lawyers working with the Linux Foundation.

    Criteria used to select files for SPDX license identifier tagging was:
    - Files considered eligible had to be source code files.
    - Make and config files were included as candidates if they contained >5
    lines of source
    - File already had some variant of a license header in it (even if
    Reviewed-by: Philippe Ombredanne
    Reviewed-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Greg Kroah-Hartman
     

07 Jul, 2017

1 commit

  • lruvecs are at the intersection of the NUMA node and memcg, which is the
    scope for most paging activity.

    Introduce a convenient accounting infrastructure that maintains
    statistics per node, per memcg, and the lruvec itself.

    Then convert over accounting sites for statistics that are already
    tracked in both nodes and memcgs and can be easily switched.

    [hannes@cmpxchg.org: fix crash in the new cgroup stat keeping code]
    Link: http://lkml.kernel.org/r/20170531171450.GA10481@cmpxchg.org
    [hannes@cmpxchg.org: don't track uncharged pages at all
    Link: http://lkml.kernel.org/r/20170605175254.GA8547@cmpxchg.org
    [hannes@cmpxchg.org: add missing free_percpu()]
    Link: http://lkml.kernel.org/r/20170605175354.GB8547@cmpxchg.org
    [linux@roeck-us.net: hexagon: fix build error caused by include file order]
    Link: http://lkml.kernel.org/r/20170617153721.GA4382@roeck-us.net
    Link: http://lkml.kernel.org/r/20170530181724.27197-6-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Signed-off-by: Guenter Roeck
    Acked-by: Vladimir Davydov
    Cc: Josef Bacik
    Cc: Michal Hocko
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

04 May, 2017

3 commits

  • The memory controllers stat function names are awkwardly long and
    arbitrarily different from the zone and node stat functions.

    The current interface is named:

    mem_cgroup_read_stat()
    mem_cgroup_update_stat()
    mem_cgroup_inc_stat()
    mem_cgroup_dec_stat()
    mem_cgroup_update_page_stat()
    mem_cgroup_inc_page_stat()
    mem_cgroup_dec_page_stat()

    This patch renames it to match the corresponding node stat functions:

    memcg_page_state() [node_page_state()]
    mod_memcg_state() [mod_node_state()]
    inc_memcg_state() [inc_node_state()]
    dec_memcg_state() [dec_node_state()]
    mod_memcg_page_state() [mod_node_page_state()]
    inc_memcg_page_state() [inc_node_page_state()]
    dec_memcg_page_state() [dec_node_page_state()]

    Link: http://lkml.kernel.org/r/20170404220148.28338-4-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Acked-by: Vladimir Davydov
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • The current duplication is a high-maintenance mess, and it's painful to
    add new items or query memcg state from the rest of the VM.

    This increases the size of the stat array marginally, but we should aim
    to track all these stats on a per-cgroup level anyway.

    Link: http://lkml.kernel.org/r/20170404220148.28338-3-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Acked-by: Vladimir Davydov
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Since commit 59dc76b0d4df ("mm: vmscan: reduce size of inactive file
    list") we noticed bigger IO spikes during changes in cache access
    patterns.

    The patch in question shrunk the inactive list size to leave more room
    for the current workingset in the presence of streaming IO. However,
    workingset transitions that previously happened on the inactive list are
    now pushed out of memory and incur more refaults to complete.

    This patch disables active list protection when refaults are being
    observed. This accelerates workingset transitions, and allows more of
    the new set to establish itself from memory, without eating into the
    ability to protect the established workingset during stable periods.

    The workloads that were measurably affected for us were hit pretty bad
    by it, with refault/majfault rates doubling and tripling during cache
    transitions, and the machines sustaining half-hour periods of 100% IO
    utilization, where they'd previously have sub-minute peaks at 60-90%.

    Stateful services that handle user data tend to be more conservative
    with kernel upgrades. As a result we hit most page cache issues with
    some delay, as was the case here.

    The severity seemed to warrant a stable tag.

    Fixes: 59dc76b0d4df ("mm: vmscan: reduce size of inactive file list")
    Link: http://lkml.kernel.org/r/20170404220052.27593-1-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Vladimir Davydov
    Cc: [4.7+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

01 Apr, 2017

1 commit

  • Commit 0a6b76dd23fa ("mm: workingset: make shadow node shrinker memcg
    aware") enabled cgroup-awareness in the shadow node shrinker, but forgot
    to also enable cgroup-awareness in the list_lru the shadow nodes sit on.

    Consequently, all shadow nodes are sitting on a global (per-NUMA node)
    list, while the shrinker applies the limits according to the amount of
    cache in the cgroup its shrinking. The result is excessive pressure on
    the shadow nodes from cgroups that have very little cache.

    Enable memcg-mode on the shadow node LRUs, such that per-cgroup limits
    are applied to per-cgroup lists.

    Fixes: 0a6b76dd23fa ("mm: workingset: make shadow node shrinker memcg aware")
    Link: http://lkml.kernel.org/r/20170322005320.8165-1-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Acked-by: Vladimir Davydov
    Cc: Michal Hocko
    Cc: [4.6+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

01 Mar, 2017

1 commit

  • Pull IDR rewrite from Matthew Wilcox:
    "The most significant part of the following is the patch to rewrite the
    IDR & IDA to be clients of the radix tree. But there's much more,
    including an enhancement of the IDA to be significantly more space
    efficient, an IDR & IDA test suite, some improvements to the IDR API
    (and driver changes to take advantage of those improvements), several
    improvements to the radix tree test suite and RCU annotations.

    The IDR & IDA rewrite had a good spin in linux-next and Andrew's tree
    for most of the last cycle. Coupled with the IDR test suite, I feel
    pretty confident that any remaining bugs are quite hard to hit. 0-day
    did a great job of watching my git tree and pointing out problems; as
    it hit them, I added new test-cases to be sure not to be caught the
    same way twice"

    Willy goes on to expand a bit on the IDR rewrite rationale:
    "The radix tree and the IDR use very similar data structures.

    Merging the two codebases lets us share the memory allocation pools,
    and results in a net deletion of 500 lines of code. It also opens up
    the possibility of exposing more of the features of the radix tree to
    users of the IDR (and I have some interesting patches along those
    lines waiting for 4.12)

    It also shrinks the size of the 'struct idr' from 40 bytes to 24 which
    will shrink a fair few data structures that embed an IDR"

    * 'idr-4.11' of git://git.infradead.org/users/willy/linux-dax: (32 commits)
    radix tree test suite: Add config option for map shift
    idr: Add missing __rcu annotations
    radix-tree: Fix __rcu annotations
    radix-tree: Add rcu_dereference and rcu_assign_pointer calls
    radix tree test suite: Run iteration tests for longer
    radix tree test suite: Fix split/join memory leaks
    radix tree test suite: Fix leaks in regression2.c
    radix tree test suite: Fix leaky tests
    radix tree test suite: Enable address sanitizer
    radix_tree_iter_resume: Fix out of bounds error
    radix-tree: Store a pointer to the root in each node
    radix-tree: Chain preallocated nodes through ->parent
    radix tree test suite: Dial down verbosity with -v
    radix tree test suite: Introduce kmalloc_verbose
    idr: Return the deleted entry from idr_remove
    radix tree test suite: Build separate binaries for some tests
    ida: Use exceptional entries for small IDAs
    ida: Move ida_bitmap to a percpu variable
    Reimplement IDR and IDA using the radix tree
    radix-tree: Add radix_tree_iter_delete
    ...

    Linus Torvalds
     

25 Feb, 2017

1 commit

  • Remove the prototypes for shmem_mapping() and shmem_zero_setup() from
    linux/mm.h, since they are already provided in linux/shmem_fs.h. But
    shmem_fs.h must then provide the inline stub for shmem_mapping() when
    CONFIG_SHMEM is not set, and a few more cfiles now need to #include it.

    Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1702081658250.1549@eggly.anvils
    Signed-off-by: Hugh Dickins
    Cc: Johannes Weiner
    Cc: Michal Simek
    Cc: Michael Ellerman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

23 Feb, 2017

1 commit

  • lruvec_lru_size returns the full size of the LRU list while we sometimes
    need a value reduced only to eligible zones (e.g. for lowmem requests).
    inactive_list_is_low is one such user. Later patches will add more of
    them. Add a new parameter to lruvec_lru_size and allow it filter out
    zones which are not eligible for the given context.

    Link: http://lkml.kernel.org/r/20170117103702.28542-2-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Acked-by: Johannes Weiner
    Acked-by: Hillf Danton
    Acked-by: Minchan Kim
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

14 Feb, 2017

1 commit


08 Jan, 2017

1 commit

  • Several people report seeing warnings about inconsistent radix tree
    nodes followed by crashes in the workingset code, which all looked like
    use-after-free access from the shadow node shrinker.

    Dave Jones managed to reproduce the issue with a debug patch applied,
    which confirmed that the radix tree shrinking indeed frees shadow nodes
    while they are still linked to the shadow LRU:

    WARNING: CPU: 2 PID: 53 at lib/radix-tree.c:643 delete_node+0x1e4/0x200
    CPU: 2 PID: 53 Comm: kswapd0 Not tainted 4.10.0-rc2-think+ #3
    Call Trace:
    delete_node+0x1e4/0x200
    __radix_tree_delete_node+0xd/0x10
    shadow_lru_isolate+0xe6/0x220
    __list_lru_walk_one.isra.4+0x9b/0x190
    list_lru_walk_one+0x23/0x30
    scan_shadow_nodes+0x2e/0x40
    shrink_slab.part.44+0x23d/0x5d0
    shrink_node+0x22c/0x330
    kswapd+0x392/0x8f0

    This is the WARN_ON_ONCE(!list_empty(&node->private_list)) placed in the
    inlined radix_tree_shrink().

    The problem is with 14b468791fa9 ("mm: workingset: move shadow entry
    tracking to radix tree exceptional tracking"), which passes an update
    callback into the radix tree to link and unlink shadow leaf nodes when
    tree entries change, but forgot to pass the callback when reclaiming a
    shadow node.

    While the reclaimed shadow node itself is unlinked by the shrinker, its
    deletion from the tree can cause the left-most leaf node in the tree to
    be shrunk. If that happens to be a shadow node as well, we don't unlink
    it from the LRU as we should.

    Consider this tree, where the s are shadow entries:

    root->rnode
    |
    [0 n]
    | |
    [s ] [sssss]

    Now the shadow node shrinker reclaims the rightmost leaf node through
    the shadow node LRU:

    root->rnode
    |
    [0 ]
    |
    [s ]

    Because the parent of the deleted node is the first level below the
    root and has only one child in the left-most slot, the intermediate
    level is shrunk and the node containing the single shadow is put in
    its place:

    root->rnode
    |
    [s ]

    The shrinker again sees a single left-most slot in a first level node
    and thus decides to store the shadow in root->rnode directly and free
    the node - which is a leaf node on the shadow node LRU.

    root->rnode
    |
    s

    Without the update callback, the freed node remains on the shadow LRU,
    where it causes later shrinker runs to crash.

    Pass the node updater callback into __radix_tree_delete_node() in case
    the deletion causes the left-most branch in the tree to collapse too.

    Also add warnings when linked nodes are freed right away, rather than
    wait for the use-after-free when the list is scanned much later.

    Fixes: 14b468791fa9 ("mm: workingset: move shadow entry tracking to radix tree exceptional tracking")
    Reported-by: Dave Chinner
    Reported-by: Hugh Dickins
    Reported-by: Andrea Arcangeli
    Reported-and-tested-by: Dave Jones
    Signed-off-by: Johannes Weiner
    Cc: Christoph Hellwig
    Cc: Chris Leech
    Cc: Lee Duncan
    Cc: Jan Kara
    Cc: Kirill A. Shutemov
    Cc: Matthew Wilcox
    Cc: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

13 Dec, 2016

3 commits

  • Since commit 59dc76b0d4df ("mm: vmscan: reduce size of inactive file
    list") the size of the active file list is no longer limited to half of
    memory. Increase the shadow node limit accordingly to avoid throwing
    out shadow entries that might still result in eligible refaults.

    The exact size of the active list now depends on the overall size of the
    page cache, but converges toward taking up most of the space:

    In mm/vmscan.c::inactive_list_is_low(),

    * total target max
    * memory ratio inactive
    * -------------------------------------
    * 10MB 1 5MB
    * 100MB 1 50MB
    * 1GB 3 250MB
    * 10GB 10 0.9GB
    * 100GB 31 3GB
    * 1TB 101 10GB
    * 10TB 320 32GB

    It would be possible to apply the same precise ratios when determining
    the limit for radix tree nodes containing shadow entries, but since it
    is merely an approximation of the oldest refault distances in the wild
    and the code also makes assumptions about the node population density,
    keep it simple and always target the full cache size.

    While at it, clarify the comment and the formula for memory footprint.

    Link: http://lkml.kernel.org/r/20161117214701.29000-1-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Currently, we track the shadow entries in the page cache in the upper
    bits of the radix_tree_node->count, behind the back of the radix tree
    implementation. Because the radix tree code has no awareness of them,
    we rely on random subtleties throughout the implementation (such as the
    node->count != 1 check in the shrinking code, which is meant to exclude
    multi-entry nodes but also happens to skip nodes with only one shadow
    entry, as that's accounted in the upper bits). This is error prone and
    has, in fact, caused the bug fixed in d3798ae8c6f3 ("mm: filemap: don't
    plant shadow entries without radix tree node").

    To remove these subtleties, this patch moves shadow entry tracking from
    the upper bits of node->count to the existing counter for exceptional
    entries. node->count goes back to being a simple counter of valid
    entries in the tree node and can be shrunk to a single byte.

    This vastly simplifies the page cache code. All accounting happens
    natively inside the radix tree implementation, and maintaining the LRU
    linkage of shadow nodes is consolidated into a single function in the
    workingset code that is called for leaf nodes affected by a change in
    the page cache tree.

    This also removes the last user of the __radix_delete_node() return
    value. Eliminate it.

    Link: http://lkml.kernel.org/r/20161117193211.GE23430@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Reviewed-by: Jan Kara
    Cc: Kirill A. Shutemov
    Cc: Hugh Dickins
    Cc: Matthew Wilcox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • When the shadow page shrinker tries to reclaim a radix tree node but
    finds it in an unexpected state - it should contain no pages, and
    non-zero shadow entries - there is no need to kill the executing task or
    even the entire system. Warn about the invalid state, then leave that
    tree node be. Simply don't put it back on the shadow LRU for future
    reclaim and move on.

    Link: http://lkml.kernel.org/r/20161117191138.22769-4-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Reviewed-by: Jan Kara
    Cc: Kirill A. Shutemov
    Cc: Hugh Dickins
    Cc: Matthew Wilcox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

03 Dec, 2016

1 commit

  • Commit 0a6b76dd23fa ("mm: workingset: make shadow node shrinker memcg
    aware") has made the workingset shadow nodes shrinker memcg aware. The
    implementation is not correct though because memcg_kmem_enabled() might
    become true while we are doing a global reclaim when the sc->memcg might
    be NULL which is exactly what Marek has seen:

    BUG: unable to handle kernel NULL pointer dereference at 0000000000000400
    IP: [] mem_cgroup_node_nr_lru_pages+0x20/0x40
    PGD 0
    Oops: 0000 [#1] SMP
    CPU: 0 PID: 60 Comm: kswapd0 Tainted: G O 4.8.10-12.pvops.qubes.x86_64 #1
    task: ffff880011863b00 task.stack: ffff880011868000
    RIP: mem_cgroup_node_nr_lru_pages+0x20/0x40
    RSP: e02b:ffff88001186bc70 EFLAGS: 00010293
    RAX: 0000000000000000 RBX: ffff88001186bd20 RCX: 0000000000000002
    RDX: 000000000000000c RSI: 0000000000000000 RDI: 0000000000000000
    RBP: ffff88001186bc70 R08: 28f5c28f5c28f5c3 R09: 0000000000000000
    R10: 0000000000006c34 R11: 0000000000000333 R12: 00000000000001f6
    R13: ffffffff81c6f6a0 R14: 0000000000000000 R15: 0000000000000000
    FS: 0000000000000000(0000) GS:ffff880013c00000(0000) knlGS:ffff880013d00000
    CS: e033 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 0000000000000400 CR3: 00000000122f2000 CR4: 0000000000042660
    Call Trace:
    count_shadow_nodes+0x9a/0xa0
    shrink_slab.part.42+0x119/0x3e0
    shrink_node+0x22c/0x320
    kswapd+0x32c/0x700
    kthread+0xd8/0xf0
    ret_from_fork+0x1f/0x40
    Code: 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 3b 35 dd eb b1 00 55 48 89 e5 73 2c 89 d2 31 c9 31 c0 4c 63 ce 48 0f a3 ca 73 13 8b b4 cf 00 04 00 00 41 89 c8 4a 03 84 c6 80 00 00 00 83 c1
    RIP mem_cgroup_node_nr_lru_pages+0x20/0x40
    RSP
    CR2: 0000000000000400
    ---[ end trace 100494b9edbdfc4d ]---

    This patch fixes the issue by checking sc->memcg rather than
    memcg_kmem_enabled() which is sufficient because shrink_slab makes sure
    that only memcg aware shrinkers will get non-NULL memcgs and only if
    memcg_kmem_enabled is true.

    Fixes: 0a6b76dd23fa ("mm: workingset: make shadow node shrinker memcg aware")
    Link: http://lkml.kernel.org/r/20161201132156.21450-1-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Reported-by: Marek Marczykowski-Górecki
    Tested-by: Marek Marczykowski-Górecki
    Acked-by: Vladimir Davydov
    Acked-by: Johannes Weiner
    Acked-by: Balbir Singh
    Cc: [4.6+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

01 Oct, 2016

1 commit

  • Antonio reports the following crash when using fuse under memory pressure:

    kernel BUG at /build/linux-a2WvEb/linux-4.4.0/mm/workingset.c:346!
    invalid opcode: 0000 [#1] SMP
    Modules linked in: all of them
    CPU: 2 PID: 63 Comm: kswapd0 Not tainted 4.4.0-36-generic #55-Ubuntu
    Hardware name: System manufacturer System Product Name/P8H67-M PRO, BIOS 3904 04/27/2013
    task: ffff88040cae6040 ti: ffff880407488000 task.ti: ffff880407488000
    RIP: shadow_lru_isolate+0x181/0x190
    Call Trace:
    __list_lru_walk_one.isra.3+0x8f/0x130
    list_lru_walk_one+0x23/0x30
    scan_shadow_nodes+0x34/0x50
    shrink_slab.part.40+0x1ed/0x3d0
    shrink_zone+0x2ca/0x2e0
    kswapd+0x51e/0x990
    kthread+0xd8/0xf0
    ret_from_fork+0x3f/0x70

    which corresponds to the following sanity check in the shadow node
    tracking:

    BUG_ON(node->count & RADIX_TREE_COUNT_MASK);

    The workingset code tracks radix tree nodes that exclusively contain
    shadow entries of evicted pages in them, and this (somewhat obscure)
    line checks whether there are real pages left that would interfere with
    reclaim of the radix tree node under memory pressure.

    While discussing ways how fuse might sneak pages into the radix tree
    past the workingset code, Miklos pointed to replace_page_cache_page(),
    and indeed there is a problem there: it properly accounts for the old
    page being removed - __delete_from_page_cache() does that - but then
    does a raw raw radix_tree_insert(), not accounting for the replacement
    page. Eventually the page count bits in node->count underflow while
    leaving the node incorrectly linked to the shadow node LRU.

    To address this, make sure replace_page_cache_page() uses the tracked
    page insertion code, page_cache_tree_insert(). This fixes the page
    accounting and makes sure page-containing nodes are properly unlinked
    from the shadow node LRU again.

    Also, make the sanity checks a bit less obscure by using the helpers for
    checking the number of pages and shadows in a radix tree node.

    Fixes: 449dd6984d0e ("mm: keep page cache radix tree nodes in check")
    Link: http://lkml.kernel.org/r/20160919155822.29498-1-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Reported-by: Antonio SJ Musumeci
    Debugged-by: Miklos Szeredi
    Cc: [3.15+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

29 Jul, 2016

2 commits

  • Working set and refault detection is still zone-based, fix it.

    Link: http://lkml.kernel.org/r/1467970510-21195-16-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Johannes Weiner
    Acked-by: Vlastimil Babka
    Cc: Hillf Danton
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Memcg needs adjustment after moving LRUs to the node. Limits are
    tracked per memcg but the soft-limit excess is tracked per zone. As
    global page reclaim is based on the node, it is easy to imagine a
    situation where a zone soft limit is exceeded even though the memcg
    limit is fine.

    This patch moves the soft limit tree the node. Technically, all the
    variable names should also change but people are already familiar by the
    meaning of "mz" even if "mn" would be a more appropriate name now.

    Link: http://lkml.kernel.org/r/1467970510-21195-15-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Michal Hocko
    Cc: Hillf Danton
    Acked-by: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: Minchan Kim
    Cc: Rik van Riel
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman