18 Aug, 2018

6 commits

  • Provide list_lru_shrink_walk_irq() and let it behave like
    list_lru_walk_one() except that it locks the spinlock with
    spin_lock_irq(). This is used by scan_shadow_nodes() because its lock
    nests within the i_pages lock which is acquired with IRQ. This change
    allows to use proper locking promitives instead hand crafted
    lock_irq_disable() plus spin_lock().

    There is no EXPORT_SYMBOL provided because the current user is in-kernel
    only.

    Add list_lru_shrink_walk_irq() which acquires the spinlock with the
    proper locking primitives.

    Link: http://lkml.kernel.org/r/20180716111921.5365-5-bigeasy@linutronix.de
    Signed-off-by: Sebastian Andrzej Siewior
    Reviewed-by: Vladimir Davydov
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sebastian Andrzej Siewior
     
  • We need to distinguish the situations when shrinker has very small
    amount of objects (see vfs_pressure_ratio() called from
    super_cache_count()), and when it has no objects at all. Currently, in
    the both of these cases, shrinker::count_objects() returns 0.

    The patch introduces new SHRINK_EMPTY return value, which will be used
    for "no objects at all" case. It's is a refactoring mostly, as
    SHRINK_EMPTY is replaced by 0 by all callers of do_shrink_slab() in this
    patch, and all the magic will happen in further.

    Link: http://lkml.kernel.org/r/153063069574.1818.11037751256699341813.stgit@localhost.localdomain
    Signed-off-by: Kirill Tkhai
    Acked-by: Vladimir Davydov
    Tested-by: Shakeel Butt
    Cc: Al Viro
    Cc: Andrey Ryabinin
    Cc: Chris Wilson
    Cc: Greg Kroah-Hartman
    Cc: Guenter Roeck
    Cc: "Huang, Ying"
    Cc: Johannes Weiner
    Cc: Josef Bacik
    Cc: Li RongQing
    Cc: Matthew Wilcox
    Cc: Matthias Kaehlcke
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Philippe Ombredanne
    Cc: Roman Gushchin
    Cc: Sahitya Tummala
    Cc: Stephen Rothwell
    Cc: Tetsuo Handa
    Cc: Thomas Gleixner
    Cc: Waiman Long
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill Tkhai
     
  • Add list_lru::shrinker_id field and populate it by registered shrinker
    id.

    This will be used to set correct bit in memcg shrinkers map by lru code
    in next patches, after there appeared the first related to memcg element
    in list_lru.

    Link: http://lkml.kernel.org/r/153063059758.1818.14866596416857717800.stgit@localhost.localdomain
    Signed-off-by: Kirill Tkhai
    Acked-by: Vladimir Davydov
    Tested-by: Shakeel Butt
    Cc: Al Viro
    Cc: Andrey Ryabinin
    Cc: Chris Wilson
    Cc: Greg Kroah-Hartman
    Cc: Guenter Roeck
    Cc: "Huang, Ying"
    Cc: Johannes Weiner
    Cc: Josef Bacik
    Cc: Li RongQing
    Cc: Matthew Wilcox
    Cc: Matthias Kaehlcke
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Philippe Ombredanne
    Cc: Roman Gushchin
    Cc: Sahitya Tummala
    Cc: Stephen Rothwell
    Cc: Tetsuo Handa
    Cc: Thomas Gleixner
    Cc: Waiman Long
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill Tkhai
     
  • Use prealloc_shrinker()/register_shrinker_prepared() instead of
    register_shrinker(). This will be used in next patch.

    [ktkhai@virtuozzo.com: v9]
    Link: http://lkml.kernel.org/r/153112550112.4097.16606173020912323761.stgit@localhost.localdomain
    Link: http://lkml.kernel.org/r/153063057666.1818.17625951186610808734.stgit@localhost.localdomain
    Signed-off-by: Kirill Tkhai
    Acked-by: Vladimir Davydov
    Tested-by: Shakeel Butt
    Cc: Al Viro
    Cc: Andrey Ryabinin
    Cc: Chris Wilson
    Cc: Greg Kroah-Hartman
    Cc: Guenter Roeck
    Cc: "Huang, Ying"
    Cc: Johannes Weiner
    Cc: Josef Bacik
    Cc: Li RongQing
    Cc: Matthew Wilcox
    Cc: Matthias Kaehlcke
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Philippe Ombredanne
    Cc: Roman Gushchin
    Cc: Sahitya Tummala
    Cc: Stephen Rothwell
    Cc: Tetsuo Handa
    Cc: Thomas Gleixner
    Cc: Waiman Long
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill Tkhai
     
  • shadow_lru_isolate() disables interrupts and acquires a lock. It could
    use spin_lock_irq() instead. It also uses local_irq_enable() while it
    could use spin_unlock_irq()/xa_unlock_irq().

    Use proper suffix for lock/unlock in order to enable/disable interrupts
    during release/acquire of a lock.

    Link: http://lkml.kernel.org/r/20180622151221.28167-3-bigeasy@linutronix.de
    Signed-off-by: Sebastian Andrzej Siewior
    Reviewed-by: Andrew Morton
    Cc: Vladimir Davydov
    Cc: Kirill Tkhai
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sebastian Andrzej Siewior
     
  • Patch series "mm: use irq locking suffix instead local_irq_disable()".

    A small series which avoids using local_irq_disable()/local_irq_enable()
    but instead does spin_lock_irq()/spin_unlock_irq() so it is within the
    context of the lock which it belongs to. Patch #1 is a cleanup where
    local_irq_.*() remained after the lock was removed.

    This patch (of 2):

    In 0c7c1bed7e13 ("mm: make counting of list_lru_one::nr_items lockless")
    the

    spin_lock(&nlru->lock);

    statement was replaced with

    rcu_read_lock();

    in __list_lru_count_one(). The comment in count_shadow_nodes() says
    that the local_irq_disable() is required because the lock must be
    acquired with disabled interrupts and (spin_lock()) does not do so.
    Since the lock is replaced with rcu_read_lock() the local_irq_disable()
    is no longer needed. The code path is

    list_lru_shrink_count()
    -> list_lru_count_one()
    -> __list_lru_count_one()
    -> rcu_read_lock()
    -> list_lru_from_memcg_idx()
    -> rcu_read_unlock()

    Remove the local_irq_disable() statement.

    Link: http://lkml.kernel.org/r/20180622151221.28167-2-bigeasy@linutronix.de
    Signed-off-by: Sebastian Andrzej Siewior
    Reviewed-by: Andrew Morton
    Reviewed-by: Kirill Tkhai
    Acked-by: Vladimir Davydov
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sebastian Andrzej Siewior
     

12 Apr, 2018

1 commit

  • Remove the address_space ->tree_lock and use the xa_lock newly added to
    the radix_tree_root. Rename the address_space ->page_tree to ->i_pages,
    since we don't really care that it's a tree.

    [willy@infradead.org: fix nds32, fs/dax.c]
    Link: http://lkml.kernel.org/r/20180406145415.GB20605@bombadil.infradead.orgLink: http://lkml.kernel.org/r/20180313132639.17387-9-willy@infradead.org
    Signed-off-by: Matthew Wilcox
    Acked-by: Jeff Layton
    Cc: Darrick J. Wong
    Cc: Dave Chinner
    Cc: Ryusuke Konishi
    Cc: Will Deacon
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     

16 Nov, 2017

1 commit

  • During truncation, the mapping has already been checked for shmem and
    dax so it's known that workingset_update_node is required.

    This patch avoids the checks on mapping for each page being truncated.
    In all other cases, a lookup helper is used to determine if
    workingset_update_node() needs to be called. The one danger is that the
    API is slightly harder to use as calling workingset_update_node directly
    without checking for dax or shmem mappings could lead to surprises.
    However, the API rarely needs to be used and hopefully the comment is
    enough to give people the hint.

    sparsetruncate (tiny)
    4.14.0-rc4 4.14.0-rc4
    oneirq-v1r1 pickhelper-v1r1
    Min Time 141.00 ( 0.00%) 140.00 ( 0.71%)
    1st-qrtle Time 142.00 ( 0.00%) 141.00 ( 0.70%)
    2nd-qrtle Time 142.00 ( 0.00%) 142.00 ( 0.00%)
    3rd-qrtle Time 143.00 ( 0.00%) 143.00 ( 0.00%)
    Max-90% Time 144.00 ( 0.00%) 144.00 ( 0.00%)
    Max-95% Time 147.00 ( 0.00%) 145.00 ( 1.36%)
    Max-99% Time 195.00 ( 0.00%) 191.00 ( 2.05%)
    Max Time 230.00 ( 0.00%) 205.00 ( 10.87%)
    Amean Time 144.37 ( 0.00%) 143.82 ( 0.38%)
    Stddev Time 10.44 ( 0.00%) 9.00 ( 13.74%)
    Coeff Time 7.23 ( 0.00%) 6.26 ( 13.41%)
    Best99%Amean Time 143.72 ( 0.00%) 143.34 ( 0.26%)
    Best95%Amean Time 142.37 ( 0.00%) 142.00 ( 0.26%)
    Best90%Amean Time 142.19 ( 0.00%) 141.85 ( 0.24%)
    Best75%Amean Time 141.92 ( 0.00%) 141.58 ( 0.24%)
    Best50%Amean Time 141.69 ( 0.00%) 141.31 ( 0.27%)
    Best25%Amean Time 141.38 ( 0.00%) 140.97 ( 0.29%)

    As you'd expect, the gain is marginal but it can be detected. The
    differences in bonnie are all within the noise which is not surprising
    given the impact on the microbenchmark.

    radix_tree_update_node_t is a callback for some radix operations that
    optionally passes in a private field. The only user of the callback is
    workingset_update_node and as it no longer requires a mapping, the
    private field is removed.

    Link: http://lkml.kernel.org/r/20171018075952.10627-3-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Johannes Weiner
    Reviewed-by: Jan Kara
    Cc: Andi Kleen
    Cc: Dave Chinner
    Cc: Dave Hansen
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

02 Nov, 2017

1 commit

  • Many source files in the tree are missing licensing information, which
    makes it harder for compliance tools to determine the correct license.

    By default all files without license information are under the default
    license of the kernel, which is GPL version 2.

    Update the files which contain no license information with the 'GPL-2.0'
    SPDX license identifier. The SPDX identifier is a legally binding
    shorthand, which can be used instead of the full boiler plate text.

    This patch is based on work done by Thomas Gleixner and Kate Stewart and
    Philippe Ombredanne.

    How this work was done:

    Patches were generated and checked against linux-4.14-rc6 for a subset of
    the use cases:
    - file had no licensing information it it.
    - file was a */uapi/* one with no licensing information in it,
    - file was a */uapi/* one with existing licensing information,

    Further patches will be generated in subsequent months to fix up cases
    where non-standard license headers were used, and references to license
    had to be inferred by heuristics based on keywords.

    The analysis to determine which SPDX License Identifier to be applied to
    a file was done in a spreadsheet of side by side results from of the
    output of two independent scanners (ScanCode & Windriver) producing SPDX
    tag:value files created by Philippe Ombredanne. Philippe prepared the
    base worksheet, and did an initial spot review of a few 1000 files.

    The 4.13 kernel was the starting point of the analysis with 60,537 files
    assessed. Kate Stewart did a file by file comparison of the scanner
    results in the spreadsheet to determine which SPDX license identifier(s)
    to be applied to the file. She confirmed any determination that was not
    immediately clear with lawyers working with the Linux Foundation.

    Criteria used to select files for SPDX license identifier tagging was:
    - Files considered eligible had to be source code files.
    - Make and config files were included as candidates if they contained >5
    lines of source
    - File already had some variant of a license header in it (even if
    Reviewed-by: Philippe Ombredanne
    Reviewed-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Greg Kroah-Hartman
     

07 Jul, 2017

1 commit

  • lruvecs are at the intersection of the NUMA node and memcg, which is the
    scope for most paging activity.

    Introduce a convenient accounting infrastructure that maintains
    statistics per node, per memcg, and the lruvec itself.

    Then convert over accounting sites for statistics that are already
    tracked in both nodes and memcgs and can be easily switched.

    [hannes@cmpxchg.org: fix crash in the new cgroup stat keeping code]
    Link: http://lkml.kernel.org/r/20170531171450.GA10481@cmpxchg.org
    [hannes@cmpxchg.org: don't track uncharged pages at all
    Link: http://lkml.kernel.org/r/20170605175254.GA8547@cmpxchg.org
    [hannes@cmpxchg.org: add missing free_percpu()]
    Link: http://lkml.kernel.org/r/20170605175354.GB8547@cmpxchg.org
    [linux@roeck-us.net: hexagon: fix build error caused by include file order]
    Link: http://lkml.kernel.org/r/20170617153721.GA4382@roeck-us.net
    Link: http://lkml.kernel.org/r/20170530181724.27197-6-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Signed-off-by: Guenter Roeck
    Acked-by: Vladimir Davydov
    Cc: Josef Bacik
    Cc: Michal Hocko
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

04 May, 2017

3 commits

  • The memory controllers stat function names are awkwardly long and
    arbitrarily different from the zone and node stat functions.

    The current interface is named:

    mem_cgroup_read_stat()
    mem_cgroup_update_stat()
    mem_cgroup_inc_stat()
    mem_cgroup_dec_stat()
    mem_cgroup_update_page_stat()
    mem_cgroup_inc_page_stat()
    mem_cgroup_dec_page_stat()

    This patch renames it to match the corresponding node stat functions:

    memcg_page_state() [node_page_state()]
    mod_memcg_state() [mod_node_state()]
    inc_memcg_state() [inc_node_state()]
    dec_memcg_state() [dec_node_state()]
    mod_memcg_page_state() [mod_node_page_state()]
    inc_memcg_page_state() [inc_node_page_state()]
    dec_memcg_page_state() [dec_node_page_state()]

    Link: http://lkml.kernel.org/r/20170404220148.28338-4-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Acked-by: Vladimir Davydov
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • The current duplication is a high-maintenance mess, and it's painful to
    add new items or query memcg state from the rest of the VM.

    This increases the size of the stat array marginally, but we should aim
    to track all these stats on a per-cgroup level anyway.

    Link: http://lkml.kernel.org/r/20170404220148.28338-3-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Acked-by: Vladimir Davydov
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Since commit 59dc76b0d4df ("mm: vmscan: reduce size of inactive file
    list") we noticed bigger IO spikes during changes in cache access
    patterns.

    The patch in question shrunk the inactive list size to leave more room
    for the current workingset in the presence of streaming IO. However,
    workingset transitions that previously happened on the inactive list are
    now pushed out of memory and incur more refaults to complete.

    This patch disables active list protection when refaults are being
    observed. This accelerates workingset transitions, and allows more of
    the new set to establish itself from memory, without eating into the
    ability to protect the established workingset during stable periods.

    The workloads that were measurably affected for us were hit pretty bad
    by it, with refault/majfault rates doubling and tripling during cache
    transitions, and the machines sustaining half-hour periods of 100% IO
    utilization, where they'd previously have sub-minute peaks at 60-90%.

    Stateful services that handle user data tend to be more conservative
    with kernel upgrades. As a result we hit most page cache issues with
    some delay, as was the case here.

    The severity seemed to warrant a stable tag.

    Fixes: 59dc76b0d4df ("mm: vmscan: reduce size of inactive file list")
    Link: http://lkml.kernel.org/r/20170404220052.27593-1-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Vladimir Davydov
    Cc: [4.7+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

01 Apr, 2017

1 commit

  • Commit 0a6b76dd23fa ("mm: workingset: make shadow node shrinker memcg
    aware") enabled cgroup-awareness in the shadow node shrinker, but forgot
    to also enable cgroup-awareness in the list_lru the shadow nodes sit on.

    Consequently, all shadow nodes are sitting on a global (per-NUMA node)
    list, while the shrinker applies the limits according to the amount of
    cache in the cgroup its shrinking. The result is excessive pressure on
    the shadow nodes from cgroups that have very little cache.

    Enable memcg-mode on the shadow node LRUs, such that per-cgroup limits
    are applied to per-cgroup lists.

    Fixes: 0a6b76dd23fa ("mm: workingset: make shadow node shrinker memcg aware")
    Link: http://lkml.kernel.org/r/20170322005320.8165-1-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Acked-by: Vladimir Davydov
    Cc: Michal Hocko
    Cc: [4.6+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

01 Mar, 2017

1 commit

  • Pull IDR rewrite from Matthew Wilcox:
    "The most significant part of the following is the patch to rewrite the
    IDR & IDA to be clients of the radix tree. But there's much more,
    including an enhancement of the IDA to be significantly more space
    efficient, an IDR & IDA test suite, some improvements to the IDR API
    (and driver changes to take advantage of those improvements), several
    improvements to the radix tree test suite and RCU annotations.

    The IDR & IDA rewrite had a good spin in linux-next and Andrew's tree
    for most of the last cycle. Coupled with the IDR test suite, I feel
    pretty confident that any remaining bugs are quite hard to hit. 0-day
    did a great job of watching my git tree and pointing out problems; as
    it hit them, I added new test-cases to be sure not to be caught the
    same way twice"

    Willy goes on to expand a bit on the IDR rewrite rationale:
    "The radix tree and the IDR use very similar data structures.

    Merging the two codebases lets us share the memory allocation pools,
    and results in a net deletion of 500 lines of code. It also opens up
    the possibility of exposing more of the features of the radix tree to
    users of the IDR (and I have some interesting patches along those
    lines waiting for 4.12)

    It also shrinks the size of the 'struct idr' from 40 bytes to 24 which
    will shrink a fair few data structures that embed an IDR"

    * 'idr-4.11' of git://git.infradead.org/users/willy/linux-dax: (32 commits)
    radix tree test suite: Add config option for map shift
    idr: Add missing __rcu annotations
    radix-tree: Fix __rcu annotations
    radix-tree: Add rcu_dereference and rcu_assign_pointer calls
    radix tree test suite: Run iteration tests for longer
    radix tree test suite: Fix split/join memory leaks
    radix tree test suite: Fix leaks in regression2.c
    radix tree test suite: Fix leaky tests
    radix tree test suite: Enable address sanitizer
    radix_tree_iter_resume: Fix out of bounds error
    radix-tree: Store a pointer to the root in each node
    radix-tree: Chain preallocated nodes through ->parent
    radix tree test suite: Dial down verbosity with -v
    radix tree test suite: Introduce kmalloc_verbose
    idr: Return the deleted entry from idr_remove
    radix tree test suite: Build separate binaries for some tests
    ida: Use exceptional entries for small IDAs
    ida: Move ida_bitmap to a percpu variable
    Reimplement IDR and IDA using the radix tree
    radix-tree: Add radix_tree_iter_delete
    ...

    Linus Torvalds
     

25 Feb, 2017

1 commit

  • Remove the prototypes for shmem_mapping() and shmem_zero_setup() from
    linux/mm.h, since they are already provided in linux/shmem_fs.h. But
    shmem_fs.h must then provide the inline stub for shmem_mapping() when
    CONFIG_SHMEM is not set, and a few more cfiles now need to #include it.

    Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1702081658250.1549@eggly.anvils
    Signed-off-by: Hugh Dickins
    Cc: Johannes Weiner
    Cc: Michal Simek
    Cc: Michael Ellerman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

23 Feb, 2017

1 commit

  • lruvec_lru_size returns the full size of the LRU list while we sometimes
    need a value reduced only to eligible zones (e.g. for lowmem requests).
    inactive_list_is_low is one such user. Later patches will add more of
    them. Add a new parameter to lruvec_lru_size and allow it filter out
    zones which are not eligible for the given context.

    Link: http://lkml.kernel.org/r/20170117103702.28542-2-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Acked-by: Johannes Weiner
    Acked-by: Hillf Danton
    Acked-by: Minchan Kim
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

14 Feb, 2017

1 commit


08 Jan, 2017

1 commit

  • Several people report seeing warnings about inconsistent radix tree
    nodes followed by crashes in the workingset code, which all looked like
    use-after-free access from the shadow node shrinker.

    Dave Jones managed to reproduce the issue with a debug patch applied,
    which confirmed that the radix tree shrinking indeed frees shadow nodes
    while they are still linked to the shadow LRU:

    WARNING: CPU: 2 PID: 53 at lib/radix-tree.c:643 delete_node+0x1e4/0x200
    CPU: 2 PID: 53 Comm: kswapd0 Not tainted 4.10.0-rc2-think+ #3
    Call Trace:
    delete_node+0x1e4/0x200
    __radix_tree_delete_node+0xd/0x10
    shadow_lru_isolate+0xe6/0x220
    __list_lru_walk_one.isra.4+0x9b/0x190
    list_lru_walk_one+0x23/0x30
    scan_shadow_nodes+0x2e/0x40
    shrink_slab.part.44+0x23d/0x5d0
    shrink_node+0x22c/0x330
    kswapd+0x392/0x8f0

    This is the WARN_ON_ONCE(!list_empty(&node->private_list)) placed in the
    inlined radix_tree_shrink().

    The problem is with 14b468791fa9 ("mm: workingset: move shadow entry
    tracking to radix tree exceptional tracking"), which passes an update
    callback into the radix tree to link and unlink shadow leaf nodes when
    tree entries change, but forgot to pass the callback when reclaiming a
    shadow node.

    While the reclaimed shadow node itself is unlinked by the shrinker, its
    deletion from the tree can cause the left-most leaf node in the tree to
    be shrunk. If that happens to be a shadow node as well, we don't unlink
    it from the LRU as we should.

    Consider this tree, where the s are shadow entries:

    root->rnode
    |
    [0 n]
    | |
    [s ] [sssss]

    Now the shadow node shrinker reclaims the rightmost leaf node through
    the shadow node LRU:

    root->rnode
    |
    [0 ]
    |
    [s ]

    Because the parent of the deleted node is the first level below the
    root and has only one child in the left-most slot, the intermediate
    level is shrunk and the node containing the single shadow is put in
    its place:

    root->rnode
    |
    [s ]

    The shrinker again sees a single left-most slot in a first level node
    and thus decides to store the shadow in root->rnode directly and free
    the node - which is a leaf node on the shadow node LRU.

    root->rnode
    |
    s

    Without the update callback, the freed node remains on the shadow LRU,
    where it causes later shrinker runs to crash.

    Pass the node updater callback into __radix_tree_delete_node() in case
    the deletion causes the left-most branch in the tree to collapse too.

    Also add warnings when linked nodes are freed right away, rather than
    wait for the use-after-free when the list is scanned much later.

    Fixes: 14b468791fa9 ("mm: workingset: move shadow entry tracking to radix tree exceptional tracking")
    Reported-by: Dave Chinner
    Reported-by: Hugh Dickins
    Reported-by: Andrea Arcangeli
    Reported-and-tested-by: Dave Jones
    Signed-off-by: Johannes Weiner
    Cc: Christoph Hellwig
    Cc: Chris Leech
    Cc: Lee Duncan
    Cc: Jan Kara
    Cc: Kirill A. Shutemov
    Cc: Matthew Wilcox
    Cc: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

13 Dec, 2016

3 commits

  • Since commit 59dc76b0d4df ("mm: vmscan: reduce size of inactive file
    list") the size of the active file list is no longer limited to half of
    memory. Increase the shadow node limit accordingly to avoid throwing
    out shadow entries that might still result in eligible refaults.

    The exact size of the active list now depends on the overall size of the
    page cache, but converges toward taking up most of the space:

    In mm/vmscan.c::inactive_list_is_low(),

    * total target max
    * memory ratio inactive
    * -------------------------------------
    * 10MB 1 5MB
    * 100MB 1 50MB
    * 1GB 3 250MB
    * 10GB 10 0.9GB
    * 100GB 31 3GB
    * 1TB 101 10GB
    * 10TB 320 32GB

    It would be possible to apply the same precise ratios when determining
    the limit for radix tree nodes containing shadow entries, but since it
    is merely an approximation of the oldest refault distances in the wild
    and the code also makes assumptions about the node population density,
    keep it simple and always target the full cache size.

    While at it, clarify the comment and the formula for memory footprint.

    Link: http://lkml.kernel.org/r/20161117214701.29000-1-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Currently, we track the shadow entries in the page cache in the upper
    bits of the radix_tree_node->count, behind the back of the radix tree
    implementation. Because the radix tree code has no awareness of them,
    we rely on random subtleties throughout the implementation (such as the
    node->count != 1 check in the shrinking code, which is meant to exclude
    multi-entry nodes but also happens to skip nodes with only one shadow
    entry, as that's accounted in the upper bits). This is error prone and
    has, in fact, caused the bug fixed in d3798ae8c6f3 ("mm: filemap: don't
    plant shadow entries without radix tree node").

    To remove these subtleties, this patch moves shadow entry tracking from
    the upper bits of node->count to the existing counter for exceptional
    entries. node->count goes back to being a simple counter of valid
    entries in the tree node and can be shrunk to a single byte.

    This vastly simplifies the page cache code. All accounting happens
    natively inside the radix tree implementation, and maintaining the LRU
    linkage of shadow nodes is consolidated into a single function in the
    workingset code that is called for leaf nodes affected by a change in
    the page cache tree.

    This also removes the last user of the __radix_delete_node() return
    value. Eliminate it.

    Link: http://lkml.kernel.org/r/20161117193211.GE23430@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Reviewed-by: Jan Kara
    Cc: Kirill A. Shutemov
    Cc: Hugh Dickins
    Cc: Matthew Wilcox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • When the shadow page shrinker tries to reclaim a radix tree node but
    finds it in an unexpected state - it should contain no pages, and
    non-zero shadow entries - there is no need to kill the executing task or
    even the entire system. Warn about the invalid state, then leave that
    tree node be. Simply don't put it back on the shadow LRU for future
    reclaim and move on.

    Link: http://lkml.kernel.org/r/20161117191138.22769-4-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Reviewed-by: Jan Kara
    Cc: Kirill A. Shutemov
    Cc: Hugh Dickins
    Cc: Matthew Wilcox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

03 Dec, 2016

1 commit

  • Commit 0a6b76dd23fa ("mm: workingset: make shadow node shrinker memcg
    aware") has made the workingset shadow nodes shrinker memcg aware. The
    implementation is not correct though because memcg_kmem_enabled() might
    become true while we are doing a global reclaim when the sc->memcg might
    be NULL which is exactly what Marek has seen:

    BUG: unable to handle kernel NULL pointer dereference at 0000000000000400
    IP: [] mem_cgroup_node_nr_lru_pages+0x20/0x40
    PGD 0
    Oops: 0000 [#1] SMP
    CPU: 0 PID: 60 Comm: kswapd0 Tainted: G O 4.8.10-12.pvops.qubes.x86_64 #1
    task: ffff880011863b00 task.stack: ffff880011868000
    RIP: mem_cgroup_node_nr_lru_pages+0x20/0x40
    RSP: e02b:ffff88001186bc70 EFLAGS: 00010293
    RAX: 0000000000000000 RBX: ffff88001186bd20 RCX: 0000000000000002
    RDX: 000000000000000c RSI: 0000000000000000 RDI: 0000000000000000
    RBP: ffff88001186bc70 R08: 28f5c28f5c28f5c3 R09: 0000000000000000
    R10: 0000000000006c34 R11: 0000000000000333 R12: 00000000000001f6
    R13: ffffffff81c6f6a0 R14: 0000000000000000 R15: 0000000000000000
    FS: 0000000000000000(0000) GS:ffff880013c00000(0000) knlGS:ffff880013d00000
    CS: e033 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 0000000000000400 CR3: 00000000122f2000 CR4: 0000000000042660
    Call Trace:
    count_shadow_nodes+0x9a/0xa0
    shrink_slab.part.42+0x119/0x3e0
    shrink_node+0x22c/0x320
    kswapd+0x32c/0x700
    kthread+0xd8/0xf0
    ret_from_fork+0x1f/0x40
    Code: 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 3b 35 dd eb b1 00 55 48 89 e5 73 2c 89 d2 31 c9 31 c0 4c 63 ce 48 0f a3 ca 73 13 8b b4 cf 00 04 00 00 41 89 c8 4a 03 84 c6 80 00 00 00 83 c1
    RIP mem_cgroup_node_nr_lru_pages+0x20/0x40
    RSP
    CR2: 0000000000000400
    ---[ end trace 100494b9edbdfc4d ]---

    This patch fixes the issue by checking sc->memcg rather than
    memcg_kmem_enabled() which is sufficient because shrink_slab makes sure
    that only memcg aware shrinkers will get non-NULL memcgs and only if
    memcg_kmem_enabled is true.

    Fixes: 0a6b76dd23fa ("mm: workingset: make shadow node shrinker memcg aware")
    Link: http://lkml.kernel.org/r/20161201132156.21450-1-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Reported-by: Marek Marczykowski-Górecki
    Tested-by: Marek Marczykowski-Górecki
    Acked-by: Vladimir Davydov
    Acked-by: Johannes Weiner
    Acked-by: Balbir Singh
    Cc: [4.6+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

01 Oct, 2016

1 commit

  • Antonio reports the following crash when using fuse under memory pressure:

    kernel BUG at /build/linux-a2WvEb/linux-4.4.0/mm/workingset.c:346!
    invalid opcode: 0000 [#1] SMP
    Modules linked in: all of them
    CPU: 2 PID: 63 Comm: kswapd0 Not tainted 4.4.0-36-generic #55-Ubuntu
    Hardware name: System manufacturer System Product Name/P8H67-M PRO, BIOS 3904 04/27/2013
    task: ffff88040cae6040 ti: ffff880407488000 task.ti: ffff880407488000
    RIP: shadow_lru_isolate+0x181/0x190
    Call Trace:
    __list_lru_walk_one.isra.3+0x8f/0x130
    list_lru_walk_one+0x23/0x30
    scan_shadow_nodes+0x34/0x50
    shrink_slab.part.40+0x1ed/0x3d0
    shrink_zone+0x2ca/0x2e0
    kswapd+0x51e/0x990
    kthread+0xd8/0xf0
    ret_from_fork+0x3f/0x70

    which corresponds to the following sanity check in the shadow node
    tracking:

    BUG_ON(node->count & RADIX_TREE_COUNT_MASK);

    The workingset code tracks radix tree nodes that exclusively contain
    shadow entries of evicted pages in them, and this (somewhat obscure)
    line checks whether there are real pages left that would interfere with
    reclaim of the radix tree node under memory pressure.

    While discussing ways how fuse might sneak pages into the radix tree
    past the workingset code, Miklos pointed to replace_page_cache_page(),
    and indeed there is a problem there: it properly accounts for the old
    page being removed - __delete_from_page_cache() does that - but then
    does a raw raw radix_tree_insert(), not accounting for the replacement
    page. Eventually the page count bits in node->count underflow while
    leaving the node incorrectly linked to the shadow node LRU.

    To address this, make sure replace_page_cache_page() uses the tracked
    page insertion code, page_cache_tree_insert(). This fixes the page
    accounting and makes sure page-containing nodes are properly unlinked
    from the shadow node LRU again.

    Also, make the sanity checks a bit less obscure by using the helpers for
    checking the number of pages and shadows in a radix tree node.

    Fixes: 449dd6984d0e ("mm: keep page cache radix tree nodes in check")
    Link: http://lkml.kernel.org/r/20160919155822.29498-1-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Reported-by: Antonio SJ Musumeci
    Debugged-by: Miklos Szeredi
    Cc: [3.15+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

29 Jul, 2016

6 commits

  • Working set and refault detection is still zone-based, fix it.

    Link: http://lkml.kernel.org/r/1467970510-21195-16-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Johannes Weiner
    Acked-by: Vlastimil Babka
    Cc: Hillf Danton
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Memcg needs adjustment after moving LRUs to the node. Limits are
    tracked per memcg but the soft-limit excess is tracked per zone. As
    global page reclaim is based on the node, it is easy to imagine a
    situation where a zone soft limit is exceeded even though the memcg
    limit is fine.

    This patch moves the soft limit tree the node. Technically, all the
    variable names should also change but people are already familiar by the
    meaning of "mz" even if "mn" would be a more appropriate name now.

    Link: http://lkml.kernel.org/r/1467970510-21195-15-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Michal Hocko
    Cc: Hillf Danton
    Acked-by: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: Minchan Kim
    Cc: Rik van Riel
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Earlier patches focused on having direct reclaim and kswapd use data
    that is node-centric for reclaiming but shrink_node() itself still uses
    too much zone information. This patch removes unnecessary zone-based
    information with the most important decision being whether to continue
    reclaim or not. Some memcg APIs are adjusted as a result even though
    memcg itself still uses some zone information.

    [mgorman@techsingularity.net: optimization]
    Link: http://lkml.kernel.org/r/1468588165-12461-2-git-send-email-mgorman@techsingularity.net
    Link: http://lkml.kernel.org/r/1467970510-21195-14-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Cc: Hillf Danton
    Acked-by: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: Minchan Kim
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • This moves the LRU lists from the zone to the node and related data such
    as counters, tracing, congestion tracking and writeback tracking.

    Unfortunately, due to reclaim and compaction retry logic, it is
    necessary to account for the number of LRU pages on both zone and node
    logic. Most reclaim logic is based on the node counters but the retry
    logic uses the zone counters which do not distinguish inactive and
    active sizes. It would be possible to leave the LRU counters on a
    per-zone basis but it's a heavier calculation across multiple cache
    lines that is much more frequent than the retry checks.

    Other than the LRU counters, this is mostly a mechanical patch but note
    that it introduces a number of anomalies. For example, the scans are
    per-zone but using per-node counters. We also mark a node as congested
    when a zone is congested. This causes weird problems that are fixed
    later but is easier to review.

    In the event that there is excessive overhead on 32-bit systems due to
    the nodes being on LRU then there are two potential solutions

    1. Long-term isolation of highmem pages when reclaim is lowmem

    When pages are skipped, they are immediately added back onto the LRU
    list. If lowmem reclaim persisted for long periods of time, the same
    highmem pages get continually scanned. The idea would be that lowmem
    keeps those pages on a separate list until a reclaim for highmem pages
    arrives that splices the highmem pages back onto the LRU. It potentially
    could be implemented similar to the UNEVICTABLE list.

    That would reduce the skip rate with the potential corner case is that
    highmem pages have to be scanned and reclaimed to free lowmem slab pages.

    2. Linear scan lowmem pages if the initial LRU shrink fails

    This will break LRU ordering but may be preferable and faster during
    memory pressure than skipping LRU pages.

    Link: http://lkml.kernel.org/r/1467970510-21195-4-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Johannes Weiner
    Acked-by: Vlastimil Babka
    Cc: Hillf Danton
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Patchset: "Move LRU page reclaim from zones to nodes v9"

    This series moves LRUs from the zones to the node. While this is a
    current rebase, the test results were based on mmotm as of June 23rd.
    Conceptually, this series is simple but there are a lot of details.
    Some of the broad motivations for this are;

    1. The residency of a page partially depends on what zone the page was
    allocated from. This is partially combatted by the fair zone allocation
    policy but that is a partial solution that introduces overhead in the
    page allocator paths.

    2. Currently, reclaim on node 0 behaves slightly different to node 1. For
    example, direct reclaim scans in zonelist order and reclaims even if
    the zone is over the high watermark regardless of the age of pages
    in that LRU. Kswapd on the other hand starts reclaim on the highest
    unbalanced zone. A difference in distribution of file/anon pages due
    to when they were allocated results can result in a difference in
    again. While the fair zone allocation policy mitigates some of the
    problems here, the page reclaim results on a multi-zone node will
    always be different to a single-zone node.
    it was scheduled on as a result.

    3. kswapd and the page allocator scan zones in the opposite order to
    avoid interfering with each other but it's sensitive to timing. This
    mitigates the page allocator using pages that were allocated very recently
    in the ideal case but it's sensitive to timing. When kswapd is allocating
    from lower zones then it's great but during the rebalancing of the highest
    zone, the page allocator and kswapd interfere with each other. It's worse
    if the highest zone is small and difficult to balance.

    4. slab shrinkers are node-based which makes it harder to identify the exact
    relationship between slab reclaim and LRU reclaim.

    The reason we have zone-based reclaim is that we used to have
    large highmem zones in common configurations and it was necessary
    to quickly find ZONE_NORMAL pages for reclaim. Today, this is much
    less of a concern as machines with lots of memory will (or should) use
    64-bit kernels. Combinations of 32-bit hardware and 64-bit hardware are
    rare. Machines that do use highmem should have relatively low highmem:lowmem
    ratios than we worried about in the past.

    Conceptually, moving to node LRUs should be easier to understand. The
    page allocator plays fewer tricks to game reclaim and reclaim behaves
    similarly on all nodes.

    The series has been tested on a 16 core UMA machine and a 2-socket 48
    core NUMA machine. The UMA results are presented in most cases as the NUMA
    machine behaved similarly.

    pagealloc
    ---------

    This is a microbenchmark that shows the benefit of removing the fair zone
    allocation policy. It was tested uip to order-4 but only orders 0 and 1 are
    shown as the other orders were comparable.

    4.7.0-rc4 4.7.0-rc4
    mmotm-20160623 nodelru-v9
    Min total-odr0-1 490.00 ( 0.00%) 457.00 ( 6.73%)
    Min total-odr0-2 347.00 ( 0.00%) 329.00 ( 5.19%)
    Min total-odr0-4 288.00 ( 0.00%) 273.00 ( 5.21%)
    Min total-odr0-8 251.00 ( 0.00%) 239.00 ( 4.78%)
    Min total-odr0-16 234.00 ( 0.00%) 222.00 ( 5.13%)
    Min total-odr0-32 223.00 ( 0.00%) 211.00 ( 5.38%)
    Min total-odr0-64 217.00 ( 0.00%) 208.00 ( 4.15%)
    Min total-odr0-128 214.00 ( 0.00%) 204.00 ( 4.67%)
    Min total-odr0-256 250.00 ( 0.00%) 230.00 ( 8.00%)
    Min total-odr0-512 271.00 ( 0.00%) 269.00 ( 0.74%)
    Min total-odr0-1024 291.00 ( 0.00%) 282.00 ( 3.09%)
    Min total-odr0-2048 303.00 ( 0.00%) 296.00 ( 2.31%)
    Min total-odr0-4096 311.00 ( 0.00%) 309.00 ( 0.64%)
    Min total-odr0-8192 316.00 ( 0.00%) 314.00 ( 0.63%)
    Min total-odr0-16384 317.00 ( 0.00%) 315.00 ( 0.63%)
    Min total-odr1-1 742.00 ( 0.00%) 712.00 ( 4.04%)
    Min total-odr1-2 562.00 ( 0.00%) 530.00 ( 5.69%)
    Min total-odr1-4 457.00 ( 0.00%) 433.00 ( 5.25%)
    Min total-odr1-8 411.00 ( 0.00%) 381.00 ( 7.30%)
    Min total-odr1-16 381.00 ( 0.00%) 356.00 ( 6.56%)
    Min total-odr1-32 372.00 ( 0.00%) 346.00 ( 6.99%)
    Min total-odr1-64 372.00 ( 0.00%) 343.00 ( 7.80%)
    Min total-odr1-128 375.00 ( 0.00%) 351.00 ( 6.40%)
    Min total-odr1-256 379.00 ( 0.00%) 351.00 ( 7.39%)
    Min total-odr1-512 385.00 ( 0.00%) 355.00 ( 7.79%)
    Min total-odr1-1024 386.00 ( 0.00%) 358.00 ( 7.25%)
    Min total-odr1-2048 390.00 ( 0.00%) 362.00 ( 7.18%)
    Min total-odr1-4096 390.00 ( 0.00%) 362.00 ( 7.18%)
    Min total-odr1-8192 388.00 ( 0.00%) 363.00 ( 6.44%)

    This shows a steady improvement throughout. The primary benefit is from
    reduced system CPU usage which is obvious from the overall times;

    4.7.0-rc4 4.7.0-rc4
    mmotm-20160623nodelru-v8
    User 189.19 191.80
    System 2604.45 2533.56
    Elapsed 2855.30 2786.39

    The vmstats also showed that the fair zone allocation policy was definitely
    removed as can be seen here;

    4.7.0-rc3 4.7.0-rc3
    mmotm-20160623 nodelru-v8
    DMA32 allocs 28794729769 0
    Normal allocs 48432501431 77227309877
    Movable allocs 0 0

    tiobench on ext4
    ----------------

    tiobench is a benchmark that artifically benefits if old pages remain resident
    while new pages get reclaimed. The fair zone allocation policy mitigates this
    problem so pages age fairly. While the benchmark has problems, it is important
    that tiobench performance remains constant as it implies that page aging
    problems that the fair zone allocation policy fixes are not re-introduced.

    4.7.0-rc4 4.7.0-rc4
    mmotm-20160623 nodelru-v9
    Min PotentialReadSpeed 89.65 ( 0.00%) 90.21 ( 0.62%)
    Min SeqRead-MB/sec-1 82.68 ( 0.00%) 82.01 ( -0.81%)
    Min SeqRead-MB/sec-2 72.76 ( 0.00%) 72.07 ( -0.95%)
    Min SeqRead-MB/sec-4 75.13 ( 0.00%) 74.92 ( -0.28%)
    Min SeqRead-MB/sec-8 64.91 ( 0.00%) 65.19 ( 0.43%)
    Min SeqRead-MB/sec-16 62.24 ( 0.00%) 62.22 ( -0.03%)
    Min RandRead-MB/sec-1 0.88 ( 0.00%) 0.88 ( 0.00%)
    Min RandRead-MB/sec-2 0.95 ( 0.00%) 0.92 ( -3.16%)
    Min RandRead-MB/sec-4 1.43 ( 0.00%) 1.34 ( -6.29%)
    Min RandRead-MB/sec-8 1.61 ( 0.00%) 1.60 ( -0.62%)
    Min RandRead-MB/sec-16 1.80 ( 0.00%) 1.90 ( 5.56%)
    Min SeqWrite-MB/sec-1 76.41 ( 0.00%) 76.85 ( 0.58%)
    Min SeqWrite-MB/sec-2 74.11 ( 0.00%) 73.54 ( -0.77%)
    Min SeqWrite-MB/sec-4 80.05 ( 0.00%) 80.13 ( 0.10%)
    Min SeqWrite-MB/sec-8 72.88 ( 0.00%) 73.20 ( 0.44%)
    Min SeqWrite-MB/sec-16 75.91 ( 0.00%) 76.44 ( 0.70%)
    Min RandWrite-MB/sec-1 1.18 ( 0.00%) 1.14 ( -3.39%)
    Min RandWrite-MB/sec-2 1.02 ( 0.00%) 1.03 ( 0.98%)
    Min RandWrite-MB/sec-4 1.05 ( 0.00%) 0.98 ( -6.67%)
    Min RandWrite-MB/sec-8 0.89 ( 0.00%) 0.92 ( 3.37%)
    Min RandWrite-MB/sec-16 0.92 ( 0.00%) 0.93 ( 1.09%)

    4.7.0-rc4 4.7.0-rc4
    mmotm-20160623 approx-v9
    User 645.72 525.90
    System 403.85 331.75
    Elapsed 6795.36 6783.67

    This shows that the series has little or not impact on tiobench which is
    desirable and a reduction in system CPU usage. It indicates that the fair
    zone allocation policy was removed in a manner that didn't reintroduce
    one class of page aging bug. There were only minor differences in overall
    reclaim activity

    4.7.0-rc4 4.7.0-rc4
    mmotm-20160623nodelru-v8
    Minor Faults 645838 647465
    Major Faults 573 640
    Swap Ins 0 0
    Swap Outs 0 0
    DMA allocs 0 0
    DMA32 allocs 46041453 44190646
    Normal allocs 78053072 79887245
    Movable allocs 0 0
    Allocation stalls 24 67
    Stall zone DMA 0 0
    Stall zone DMA32 0 0
    Stall zone Normal 0 2
    Stall zone HighMem 0 0
    Stall zone Movable 0 65
    Direct pages scanned 10969 30609
    Kswapd pages scanned 93375144 93492094
    Kswapd pages reclaimed 93372243 93489370
    Direct pages reclaimed 10969 30609
    Kswapd efficiency 99% 99%
    Kswapd velocity 13741.015 13781.934
    Direct efficiency 100% 100%
    Direct velocity 1.614 4.512
    Percentage direct scans 0% 0%

    kswapd activity was roughly comparable. There were differences in direct
    reclaim activity but negligible in the context of the overall workload
    (velocity of 4 pages per second with the patches applied, 1.6 pages per
    second in the baseline kernel).

    pgbench read-only large configuration on ext4
    ---------------------------------------------

    pgbench is a database benchmark that can be sensitive to page reclaim
    decisions. This also checks if removing the fair zone allocation policy
    is safe

    pgbench Transactions
    4.7.0-rc4 4.7.0-rc4
    mmotm-20160623 nodelru-v8
    Hmean 1 188.26 ( 0.00%) 189.78 ( 0.81%)
    Hmean 5 330.66 ( 0.00%) 328.69 ( -0.59%)
    Hmean 12 370.32 ( 0.00%) 380.72 ( 2.81%)
    Hmean 21 368.89 ( 0.00%) 369.00 ( 0.03%)
    Hmean 30 382.14 ( 0.00%) 360.89 ( -5.56%)
    Hmean 32 428.87 ( 0.00%) 432.96 ( 0.95%)

    Negligible differences again. As with tiobench, overall reclaim activity
    was comparable.

    bonnie++ on ext4
    ----------------

    No interesting performance difference, negligible differences on reclaim
    stats.

    paralleldd on ext4
    ------------------

    This workload uses varying numbers of dd instances to read large amounts of
    data from disk.

    4.7.0-rc3 4.7.0-rc3
    mmotm-20160623 nodelru-v9
    Amean Elapsd-1 186.04 ( 0.00%) 189.41 ( -1.82%)
    Amean Elapsd-3 192.27 ( 0.00%) 191.38 ( 0.46%)
    Amean Elapsd-5 185.21 ( 0.00%) 182.75 ( 1.33%)
    Amean Elapsd-7 183.71 ( 0.00%) 182.11 ( 0.87%)
    Amean Elapsd-12 180.96 ( 0.00%) 181.58 ( -0.35%)
    Amean Elapsd-16 181.36 ( 0.00%) 183.72 ( -1.30%)

    4.7.0-rc4 4.7.0-rc4
    mmotm-20160623 nodelru-v9
    User 1548.01 1552.44
    System 8609.71 8515.08
    Elapsed 3587.10 3594.54

    There is little or no change in performance but some drop in system CPU usage.

    4.7.0-rc3 4.7.0-rc3
    mmotm-20160623 nodelru-v9
    Minor Faults 362662 367360
    Major Faults 1204 1143
    Swap Ins 22 0
    Swap Outs 2855 1029
    DMA allocs 0 0
    DMA32 allocs 31409797 28837521
    Normal allocs 46611853 49231282
    Movable allocs 0 0
    Direct pages scanned 0 0
    Kswapd pages scanned 40845270 40869088
    Kswapd pages reclaimed 40830976 40855294
    Direct pages reclaimed 0 0
    Kswapd efficiency 99% 99%
    Kswapd velocity 11386.711 11369.769
    Direct efficiency 100% 100%
    Direct velocity 0.000 0.000
    Percentage direct scans 0% 0%
    Page writes by reclaim 2855 1029
    Page writes file 0 0
    Page writes anon 2855 1029
    Page reclaim immediate 771 1628
    Sector Reads 293312636 293536360
    Sector Writes 18213568 18186480
    Page rescued immediate 0 0
    Slabs scanned 128257 132747
    Direct inode steals 181 56
    Kswapd inode steals 59 1131

    It basically shows that kswapd was active at roughly the same rate in
    both kernels. There was also comparable slab scanning activity and direct
    reclaim was avoided in both cases. There appears to be a large difference
    in numbers of inodes reclaimed but the workload has few active inodes and
    is likely a timing artifact.

    stutter
    -------

    stutter simulates a simple workload. One part uses a lot of anonymous
    memory, a second measures mmap latency and a third copies a large file.
    The primary metric is checking for mmap latency.

    stutter
    4.7.0-rc4 4.7.0-rc4
    mmotm-20160623 nodelru-v8
    Min mmap 16.6283 ( 0.00%) 13.4258 ( 19.26%)
    1st-qrtle mmap 54.7570 ( 0.00%) 34.9121 ( 36.24%)
    2nd-qrtle mmap 57.3163 ( 0.00%) 46.1147 ( 19.54%)
    3rd-qrtle mmap 58.9976 ( 0.00%) 47.1882 ( 20.02%)
    Max-90% mmap 59.7433 ( 0.00%) 47.4453 ( 20.58%)
    Max-93% mmap 60.1298 ( 0.00%) 47.6037 ( 20.83%)
    Max-95% mmap 73.4112 ( 0.00%) 82.8719 (-12.89%)
    Max-99% mmap 92.8542 ( 0.00%) 88.8870 ( 4.27%)
    Max mmap 1440.6569 ( 0.00%) 121.4201 ( 91.57%)
    Mean mmap 59.3493 ( 0.00%) 42.2991 ( 28.73%)
    Best99%Mean mmap 57.2121 ( 0.00%) 41.8207 ( 26.90%)
    Best95%Mean mmap 55.9113 ( 0.00%) 39.9620 ( 28.53%)
    Best90%Mean mmap 55.6199 ( 0.00%) 39.3124 ( 29.32%)
    Best50%Mean mmap 53.2183 ( 0.00%) 33.1307 ( 37.75%)
    Best10%Mean mmap 45.9842 ( 0.00%) 20.4040 ( 55.63%)
    Best5%Mean mmap 43.2256 ( 0.00%) 17.9654 ( 58.44%)
    Best1%Mean mmap 32.9388 ( 0.00%) 16.6875 ( 49.34%)

    This shows a number of improvements with the worst-case outlier greatly
    improved.

    Some of the vmstats are interesting

    4.7.0-rc4 4.7.0-rc4
    mmotm-20160623nodelru-v8
    Swap Ins 163 502
    Swap Outs 0 0
    DMA allocs 0 0
    DMA32 allocs 618719206 1381662383
    Normal allocs 891235743 564138421
    Movable allocs 0 0
    Allocation stalls 2603 1
    Direct pages scanned 216787 2
    Kswapd pages scanned 50719775 41778378
    Kswapd pages reclaimed 41541765 41777639
    Direct pages reclaimed 209159 0
    Kswapd efficiency 81% 99%
    Kswapd velocity 16859.554 14329.059
    Direct efficiency 96% 0%
    Direct velocity 72.061 0.001
    Percentage direct scans 0% 0%
    Page writes by reclaim 6215049 0
    Page writes file 6215049 0
    Page writes anon 0 0
    Page reclaim immediate 70673 90
    Sector Reads 81940800 81680456
    Sector Writes 100158984 98816036
    Page rescued immediate 0 0
    Slabs scanned 1366954 22683

    While this is not guaranteed in all cases, this particular test showed
    a large reduction in direct reclaim activity. It's also worth noting
    that no page writes were issued from reclaim context.

    This series is not without its hazards. There are at least three areas
    that I'm concerned with even though I could not reproduce any problems in
    that area.

    1. Reclaim/compaction is going to be affected because the amount of reclaim is
    no longer targetted at a specific zone. Compaction works on a per-zone basis
    so there is no guarantee that reclaiming a few THP's worth page pages will
    have a positive impact on compaction success rates.

    2. The Slab/LRU reclaim ratio is affected because the frequency the shrinkers
    are called is now different. This may or may not be a problem but if it
    is, it'll be because shrinkers are not called enough and some balancing
    is required.

    3. The anon/file reclaim ratio may be affected. Pages about to be dirtied are
    distributed between zones and the fair zone allocation policy used to do
    something very similar for anon. The distribution is now different but not
    necessarily in any way that matters but it's still worth bearing in mind.

    VM statistic counters for reclaim decisions are zone-based. If the kernel
    is to reclaim on a per-node basis then we need to track per-node
    statistics but there is no infrastructure for that. The most notable
    change is that the old node_page_state is renamed to
    sum_zone_node_page_state. The new node_page_state takes a pglist_data and
    uses per-node stats but none exist yet. There is some renaming such as
    vm_stat to vm_zone_stat and the addition of vm_node_stat and the renaming
    of mod_state to mod_zone_state. Otherwise, this is mostly a mechanical
    patch with no functional change. There is a lot of similarity between the
    node and zone helpers which is unfortunate but there was no obvious way of
    reusing the code and maintaining type safety.

    Link: http://lkml.kernel.org/r/1467970510-21195-2-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Johannes Weiner
    Acked-by: Vlastimil Babka
    Cc: Rik van Riel
    Cc: Vlastimil Babka
    Cc: Johannes Weiner
    Cc: Minchan Kim
    Cc: Joonsoo Kim
    Cc: Hillf Danton
    Cc: Michal Hocko
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Commit 23047a96d7cf ("mm: workingset: per-cgroup cache thrash
    detection") added a page->mem_cgroup lookup to the cache eviction,
    refault, and activation paths, as well as locking to the activation
    path, and the vm-scalability tests showed a regression of -23%.

    While the test in question is an artificial worst-case scenario that
    doesn't occur in real workloads - reading two sparse files in parallel
    at full CPU speed just to hammer the LRU paths - there is still some
    optimizations that can be done in those paths.

    Inline the lookup functions to eliminate calls. Also, page->mem_cgroup
    doesn't need to be stabilized when counting an activation; we merely
    need to hold the RCU lock to prevent the memcg from being freed.

    This cuts down on overhead quite a bit:

    23047a96d7cfcfca 063f6715e77a7be5770d6081fe
    ---------------- --------------------------
    %stddev %change %stddev
    \ | \
    21621405 +- 0% +11.3% 24069657 +- 2% vm-scalability.throughput

    [linux@roeck-us.net: drop unnecessary include file]
    [hannes@cmpxchg.org: add WARN_ON_ONCE()s]
    Link: http://lkml.kernel.org/r/20160707194024.GA26580@cmpxchg.org
    Link: http://lkml.kernel.org/r/20160624175101.GA3024@cmpxchg.org
    Reported-by: Ye Xiaolong
    Signed-off-by: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: Vladimir Davydov
    Signed-off-by: Guenter Roeck
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

15 Jul, 2016

1 commit

  • Commit 612e44939c3c ("mm: workingset: eviction buckets for bigmem/lowbit
    machines") added a printk without a log level. Quieten it by using
    pr_info().

    Link: http://lkml.kernel.org/r/1466982072-29836-2-git-send-email-anton@ozlabs.org
    Signed-off-by: Anton Blanchard
    Acked-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Anton Blanchard
     

18 Mar, 2016

2 commits

  • Workingset code was recently made memcg aware, but shadow node shrinker
    is still global. As a result, one small cgroup can consume all memory
    available for shadow nodes, possibly hurting other cgroups by reclaiming
    their shadow nodes, even though reclaim distances stored in its shadow
    nodes have no effect. To avoid this, we need to make shadow node
    shrinker memcg aware.

    Signed-off-by: Vladimir Davydov
    Acked-by: Johannes Weiner
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • A page is activated on refault if the refault distance stored in the
    corresponding shadow entry is less than the number of active file pages.
    Since active file pages can't occupy more than half memory, we assume
    that the maximal effective refault distance can't be greater than half
    the number of present pages and size the shadow nodes lru list
    appropriately. Generally speaking, this assumption is correct, but it
    can result in wasting a considerable chunk of memory on stale shadow
    nodes in case the portion of file pages is small, e.g. if a workload
    mostly uses anonymous memory.

    To sort this out, we need to compute the size of shadow nodes lru basing
    not on the maximal possible, but the current size of file cache. We
    could take the size of active file lru for the maximal refault distance,
    but active lru is pretty unstable - it can shrink dramatically at
    runtime possibly disrupting workingset detection logic.

    Instead we assume that the maximal refault distance equals half the
    total number of file cache pages. This will protect us against active
    file lru size fluctuations while still being correct, because size of
    active lru is normally maintained lower than size of inactive lru.

    Signed-off-by: Vladimir Davydov
    Acked-by: Johannes Weiner
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     

16 Mar, 2016

5 commits

  • Now that migration doesn't clear page->mem_cgroup of live pages anymore,
    it's safe to make lock_page_memcg() and the memcg stat functions take
    pages, and spare the callers from memcg objects.

    [akpm@linux-foundation.org: fix warnings]
    Signed-off-by: Johannes Weiner
    Suggested-by: Vladimir Davydov
    Acked-by: Vladimir Davydov
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Cache thrash detection (see a528910e12ec "mm: thrash detection-based
    file cache sizing" for details) currently only works on the system
    level, not inside cgroups. Worse, as the refaults are compared to the
    global number of active cache, cgroups might wrongfully get all their
    refaults activated when their pages are hotter than those of others.

    Move the refault machinery from the zone to the lruvec, and then tag
    eviction entries with the memcg ID. This makes the thrash detection
    work correctly inside cgroups.

    [sergey.senozhatsky@gmail.com: do not return from workingset_activation() with locked rcu and page]
    Signed-off-by: Johannes Weiner
    Signed-off-by: Sergey Senozhatsky
    Reviewed-by: Vladimir Davydov
    Cc: Michal Hocko
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • For per-cgroup thrash detection, we need to store the memcg ID inside
    the radix tree cookie as well. However, on 32 bit that doesn't leave
    enough bits for the eviction timestamp to cover the necessary range of
    recently evicted pages. The radix tree entry would look like this:

    [ RADIX_TREE_EXCEPTIONAL(2) | ZONEID(2) | MEMCGID(16) | EVICTION(12) ]

    12 bits means 4096 pages, means 16M worth of recently evicted pages.
    But refaults are actionable up to distances covering half of memory. To
    not miss refaults, we have to stretch out the range at the cost of how
    precisely we can tell when a page was evicted. This way we can shave
    off lower bits from the eviction timestamp until the necessary range is
    covered. E.g. grouping evictions into 1M buckets (256 pages) will
    stretch the longest representable refault distance to 4G.

    This patch implements eviction buckets that are automatically sized
    according to the available bits and the necessary refault range, in
    preparation for per-cgroup thrash detection.

    The maximum actionable distance is currently half of memory, but to
    support memory hotplug of up to 200% of boot-time memory, we size the
    buckets to cover double the distance. Beyond that, thrashing won't be
    detectable anymore.

    During boot, the kernel will print out the exact parameters, like so:

    [ 0.113929] workingset: timestamp_bits=12 max_order=18 bucket_order=6

    In this example, there are 12 radix entry bits available for the
    eviction timestamp, to cover a maximum distance of 2^18 pages (this is a
    1G machine). Consequently, evictions must be grouped into buckets of
    2^6 pages, or 256K.

    Signed-off-by: Johannes Weiner
    Reviewed-by: Vladimir Davydov
    Cc: Michal Hocko
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Per-cgroup thrash detection will need to derive a live memcg from the
    eviction cookie, and doing that inside unpack_shadow() will get nasty
    with the reference handling spread over two functions.

    In preparation, make unpack_shadow() clearly about extracting static
    data, and let workingset_refault() do all the higher-level handling.

    Signed-off-by: Johannes Weiner
    Reviewed-by: Vladimir Davydov
    Cc: Michal Hocko
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • This is a compile-time constant, no need to calculate it on refault.

    Signed-off-by: Johannes Weiner
    Reviewed-by: Vladimir Davydov
    Cc: Michal Hocko
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

23 Jan, 2016

1 commit

  • Add support for tracking dirty DAX entries in the struct address_space
    radix tree. This tree is already used for dirty page writeback, and it
    already supports the use of exceptional (non struct page*) entries.

    In order to properly track dirty DAX pages we will insert new
    exceptional entries into the radix tree that represent dirty DAX PTE or
    PMD pages. These exceptional entries will also contain the writeback
    addresses for the PTE or PMD faults that we can use at fsync/msync time.

    There are currently two types of exceptional entries (shmem and shadow)
    that can be placed into the radix tree, and this adds a third. We rely
    on the fact that only one type of exceptional entry can be found in a
    given radix tree based on its usage. This happens for free with DAX vs
    shmem but we explicitly prevent shadow entries from being added to radix
    trees for DAX mappings.

    The only shadow entries that would be generated for DAX radix trees
    would be to track zero page mappings that were created for holes. These
    pages would receive minimal benefit from having shadow entries, and the
    choice to have only one type of exceptional entry in a given radix tree
    makes the logic simpler both in clear_exceptional_entry() and in the
    rest of DAX.

    Signed-off-by: Ross Zwisler
    Cc: "H. Peter Anvin"
    Cc: "J. Bruce Fields"
    Cc: "Theodore Ts'o"
    Cc: Alexander Viro
    Cc: Andreas Dilger
    Cc: Dave Chinner
    Cc: Ingo Molnar
    Cc: Jan Kara
    Cc: Jeff Layton
    Cc: Matthew Wilcox
    Cc: Thomas Gleixner
    Cc: Dan Williams
    Cc: Matthew Wilcox
    Cc: Dave Hansen
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ross Zwisler
     

13 Feb, 2015

1 commit

  • Currently, the isolate callback passed to the list_lru_walk family of
    functions is supposed to just delete an item from the list upon returning
    LRU_REMOVED or LRU_REMOVED_RETRY, while nr_items counter is fixed by
    __list_lru_walk_one after the callback returns. Since the callback is
    allowed to drop the lock after removing an item (it has to return
    LRU_REMOVED_RETRY then), the nr_items can be less than the actual number
    of elements on the list even if we check them under the lock. This makes
    it difficult to move items from one list_lru_one to another, which is
    required for per-memcg list_lru reparenting - we can't just splice the
    lists, we have to move entries one by one.

    This patch therefore introduces helpers that must be used by callback
    functions to isolate items instead of raw list_del/list_move. These are
    list_lru_isolate and list_lru_isolate_move. They not only remove the
    entry from the list, but also fix the nr_items counter, making sure
    nr_items always reflects the actual number of elements on the list if
    checked under the appropriate lock.

    Signed-off-by: Vladimir Davydov
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Tejun Heo
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Dave Chinner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov