20 Feb, 2019

1 commit

  • commit a9a238e83fbb0df31c3b9b67003f8f9d1d1b6c96 upstream.

    This reverts commit 172b06c32b9497 ("mm: slowly shrink slabs with a
    relatively small number of objects").

    This change changes the agressiveness of shrinker reclaim, causing small
    cache and low priority reclaim to greatly increase scanning pressure on
    small caches. As a result, light memory pressure has a disproportionate
    affect on small caches, and causes large caches to be reclaimed much
    faster than previously.

    As a result, it greatly perturbs the delicate balance of the VFS caches
    (dentry/inode vs file page cache) such that the inode/dentry caches are
    reclaimed much, much faster than the page cache and this drives us into
    several other caching imbalance related problems.

    As such, this is a bad change and needs to be reverted.

    [ Needs some massaging to retain the later seekless shrinker
    modifications.]

    Link: http://lkml.kernel.org/r/20190130041707.27750-3-david@fromorbit.com
    Fixes: 172b06c32b9497 ("mm: slowly shrink slabs with a relatively small number of objects")
    Signed-off-by: Dave Chinner
    Cc: Wolfgang Walter
    Cc: Roman Gushchin
    Cc: Spock
    Cc: Rik van Riel
    Cc: Michal Hocko
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Dave Chinner
     

29 Dec, 2018

1 commit

  • commit 68600f623d69da428c6163275f97ca126e1a8ec5 upstream.

    I've noticed, that dying memory cgroups are often pinned in memory by a
    single pagecache page. Even under moderate memory pressure they sometimes
    stayed in such state for a long time. That looked strange.

    My investigation showed that the problem is caused by applying the LRU
    pressure balancing math:

    scan = div64_u64(scan * fraction[lru], denominator),

    where

    denominator = fraction[anon] + fraction[file] + 1.

    Because fraction[lru] is always less than denominator, if the initial scan
    size is 1, the result is always 0.

    This means the last page is not scanned and has
    no chances to be reclaimed.

    Fix this by rounding up the result of the division.

    In practice this change significantly improves the speed of dying cgroups
    reclaim.

    [guro@fb.com: prevent double calculation of DIV64_U64_ROUND_UP() arguments]
    Link: http://lkml.kernel.org/r/20180829213311.GA13501@castle
    Link: http://lkml.kernel.org/r/20180827162621.30187-3-guro@fb.com
    Signed-off-by: Roman Gushchin
    Reviewed-by: Andrew Morton
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Tejun Heo
    Cc: Rik van Riel
    Cc: Konstantin Khlebnikov
    Cc: Matthew Wilcox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Roman Gushchin
     

06 Oct, 2018

1 commit

  • do_shrink_slab() returns unsigned long value, and the placing into int
    variable cuts high bytes off. Then we compare ret and 0xfffffffe (since
    SHRINK_EMPTY is converted to ret type).

    Thus a large number of objects returned by do_shrink_slab() may be
    interpreted as SHRINK_EMPTY, if low bytes of their value are equal to
    0xfffffffe. Fix that by declaration ret as unsigned long in these
    functions.

    Link: http://lkml.kernel.org/r/153813407177.17544.14888305435570723973.stgit@localhost.localdomain
    Signed-off-by: Kirill Tkhai
    Reported-by: Cyrill Gorcunov
    Acked-by: Cyrill Gorcunov
    Reviewed-by: Josef Bacik
    Cc: Michal Hocko
    Cc: Andrey Ryabinin
    Cc: Johannes Weiner
    Cc: Tetsuo Handa
    Cc: Shakeel Butt
    Signed-off-by: Andrew Morton
    Signed-off-by: Greg Kroah-Hartman

    Kirill Tkhai
     

21 Sep, 2018

1 commit

  • 9092c71bb724 ("mm: use sc->priority for slab shrink targets") changed the
    way that the target slab pressure is calculated and made it
    priority-based:

    delta = freeable >> priority;
    delta *= 4;
    do_div(delta, shrinker->seeks);

    The problem is that on a default priority (which is 12) no pressure is
    applied at all, if the number of potentially reclaimable objects is less
    than 4096 (1<
    Acked-by: Rik van Riel
    Cc: Josef Bacik
    Cc: Johannes Weiner
    Cc: Shakeel Butt
    Cc: Michal Hocko
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Greg Kroah-Hartman

    Roman Gushchin
     

23 Aug, 2018

2 commits

  • page_freeze_refs/page_unfreeze_refs have already been relplaced by
    page_ref_freeze/page_ref_unfreeze , but they are not modified in the
    comments.

    Link: http://lkml.kernel.org/r/1532590226-106038-1-git-send-email-jiang.biao2@zte.com.cn
    Signed-off-by: Jiang Biao
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jiang Biao
     
  • There is a sad BUG introduced in patch adding SHRINKER_REGISTERING.
    shrinker_idr business is only for memcg-aware shrinkers. Only such type
    of shrinkers have id and they must be finaly installed via idr_replace()
    in this function. For !memcg-aware shrinkers we never initialize
    shrinker->id field.

    But there are all types of shrinkers passed to idr_replace(), and every
    !memcg-aware shrinker with random ID (most probably, its id is 0)
    replaces memcg-aware shrinker pointed by the ID in IDR.

    This patch fixes the problem.

    Link: http://lkml.kernel.org/r/8ff8a793-8211-713a-4ed9-d6e52390c2fc@virtuozzo.com
    Fixes: 7e010df53c80 "mm: use special value SHRINKER_REGISTERING instead of list_empty() check"
    Signed-off-by: Kirill Tkhai
    Reported-by:
    Cc: Andrey Ryabinin
    Cc: Johannes Weiner
    Cc: Josef Bacik
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Tetsuo Handa
    Cc: Shakeel Butt
    Cc:
    Cc: Huang Ying
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill Tkhai
     

18 Aug, 2018

9 commits

  • The patch introduces a special value SHRINKER_REGISTERING to use instead
    of list_empty() to differ a registering shrinker from unregistered
    shrinker. Why we need that at all?

    Shrinker registration is split in two parts. The first one is
    prealloc_shrinker(), which allocates shrinker memory and reserves ID in
    shrinker_idr. This function can fail. The second is
    register_shrinker_prepared(), and it finalizes the registration. This
    function actually makes shrinker available to be used from
    shrink_slab(), and it can't fail.

    One shrinker may be based on more then one LRU lists. So, we never
    clear the bit in memcg shrinker maps, when (one of) corresponding LRU
    list becomes empty, since other LRU lists may be not empty. See
    superblock shrinker for example: it is based on two LRU lists:
    s_inode_lru and s_dentry_lru. We do not want to clear shrinker bit,
    when there are no inodes in s_inode_lru, as s_dentry_lru may contain
    dentries.

    Instead of that, we use special algorithm to detect shrinkers having no
    elements at all its LRU lists, and this is made in shrink_slab_memcg().
    See the comment in this function for the details.

    Also, in shrink_slab_memcg() we clear shrinker bit in the map, when we
    meet unregistered shrinker (bit is set, while there is no a shrinker in
    IDR). Otherwise, we would have done that at the moment of shrinker
    unregistration for all memcgs (and this looks worse, since iteration
    over all memcg may take much time). Also this would have imposed
    restrictions on shrinker unregistration order for its users: they would
    have had to guarantee, there are no new elements after
    unregister_shrinker() (otherwise, a new added element would have set a
    bit).

    So, if we meet a set bit in map and no shrinker in IDR when we're
    iterating over the map in shrink_slab_memcg(), this means the
    corresponding shrinker is unregistered, and we must clear the bit.

    Another case is shrinker registration. We want two things there:

    1) do_shrink_slab() can be called only for completely registered
    shrinkers;

    2) shrinker internal lists may be populated in any order with
    register_shrinker_prepared() (let's talk on the example with sb). Both
    of:

    a)list_lru_add(&inode->i_sb->s_inode_lru, &inode->i_lru); [cpu0]
    memcg_set_shrinker_bit(); [cpu0]
    ...
    register_shrinker_prepared(); [cpu1]

    and

    b)register_shrinker_prepared(); [cpu0]
    ...
    list_lru_add(&inode->i_sb->s_inode_lru, &inode->i_lru); [cpu1]
    memcg_set_shrinker_bit(); [cpu1]

    are legitimate. We don't want to impose restriction here and to
    force people to use only (b) variant. We don't want to force people to
    care, there is no elements in LRU lists before the shrinker is
    completely registered. Internal users of LRU lists and shrinker code
    are two different subsystems, and they have to be closed in themselves
    each other.

    In (a) case we have the bit set before shrinker is completely
    registered. We don't want do_shrink_slab() is called at this moment, so
    we have to detect such the registering shrinkers.

    Before this patch list_empty() (shrinker is not linked to the list)
    check was used for that. So, in (a) there could be a bit set, but we
    don't call do_shrink_slab() unless shrinker is linked to the list. It's
    just an indicator, I just overloaded linking to the list.

    This was not the best solution, since it's better not to touch the
    shrinker memory from shrink_slab_memcg() before it's completely
    registered (this also will be useful in the future to make shrink_slab()
    completely lockless).

    So, this patch introduces better way to detect registering shrinker,
    which allows not to dereference shrinker memory. It's just a ~0UL
    value, which we insert into the IDR during ID allocation. After
    shrinker is ready to be used, we insert actual shrinker pointer in the
    IDR, and it becomes available to shrink_slab_memcg().

    We can't use NULL instead of this new value for this purpose as:
    shrink_slab_memcg() already uses NULL to detect unregistered shrinkers,
    and we don't want the function sees NULL and clears the bit, otherwise
    (a) won't work.

    This is the only thing the patch makes: the better way to detect
    registering shrinker. Nothing else this patch makes.

    Also this gives a better assembler, but it's minor side of the patch:

    Before:
    callq
    mov %rax,%r15
    test %rax,%rax
    je
    mov 0x20(%rax),%rax
    lea 0x20(%r15),%rdx
    cmp %rax,%rdx
    je
    mov 0x8(%rsp),%edx
    mov %r15,%rsi
    lea 0x10(%rsp),%rdi
    callq

    After:
    callq
    mov %rax,%r15
    lea -0x1(%rax),%rax
    cmp $0xfffffffffffffffd,%rax
    ja
    mov 0x8(%rsp),%edx
    mov %r15,%rsi
    lea 0x10(%rsp),%rdi
    callq ffffffff810cefd0

    [ktkhai@virtuozzo.com: add #ifdef CONFIG_MEMCG_KMEM around idr_replace()]
    Link: http://lkml.kernel.org/r/758b8fec-7573-47eb-b26a-7b2847ae7b8c@virtuozzo.com
    Link: http://lkml.kernel.org/r/153355467546.11522.4518015068123480218.stgit@localhost.localdomain
    Signed-off-by: Kirill Tkhai
    Reviewed-by: Andrew Morton
    Cc: Vladimir Davydov
    Cc: Michal Hocko
    Cc: Andrey Ryabinin
    Cc: "Huang, Ying"
    Cc: Tetsuo Handa
    Cc: Matthew Wilcox
    Cc: Shakeel Butt
    Cc: Josef Bacik
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill Tkhai
     
  • In case of shrink_slab_memcg() we do not zero nid, when shrinker is not
    numa-aware. This is not a real problem, since currently all memcg-aware
    shrinkers are numa-aware too (we have two: super_block shrinker and
    workingset shrinker), but something may change in the future.

    Link: http://lkml.kernel.org/r/153320759911.18959.8842396230157677671.stgit@localhost.localdomain
    Signed-off-by: Kirill Tkhai
    Reviewed-by: Andrew Morton
    Cc: Vladimir Davydov
    Cc: Michal Hocko
    Cc: Andrey Ryabinin
    Cc: "Huang, Ying"
    Cc: Tetsuo Handa
    Cc: Matthew Wilcox
    Cc: Shakeel Butt
    Cc: Josef Bacik
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill Tkhai
     
  • To avoid further unneed calls of do_shrink_slab() for shrinkers, which
    already do not have any charged objects in a memcg, their bits have to
    be cleared.

    This patch introduces a lockless mechanism to do that without races
    without parallel list lru add. After do_shrink_slab() returns
    SHRINK_EMPTY the first time, we clear the bit and call it once again.
    Then we restore the bit, if the new return value is different.

    Note, that single smp_mb__after_atomic() in shrink_slab_memcg() covers
    two situations:

    1)list_lru_add() shrink_slab_memcg
    list_add_tail() for_each_set_bit()
    set_bit() do_shrink_slab() before the first call of do_shrink_slab()
    instead of this to do not slow down generic case. Also, it's need the
    second call as seen in below in (2).

    2)list_lru_add() shrink_slab_memcg()
    list_add_tail() ...
    set_bit() ...
    ... for_each_set_bit()
    do_shrink_slab() do_shrink_slab()
    clear_bit() ...
    ... ...
    list_lru_add() ...
    list_add_tail() clear_bit()

    set_bit() do_shrink_slab()

    The barriers guarantee that the second do_shrink_slab() in the right
    side task sees list update if really cleared the bit. This case is
    drawn in the code comment.

    [Results/performance of the patchset]

    After the whole patchset applied the below test shows signify increase
    of performance:

    $echo 1 > /sys/fs/cgroup/memory/memory.use_hierarchy
    $mkdir /sys/fs/cgroup/memory/ct
    $echo 4000M > /sys/fs/cgroup/memory/ct/memory.kmem.limit_in_bytes
    $for i in `seq 0 4000`; do mkdir /sys/fs/cgroup/memory/ct/$i;
    echo $$ > /sys/fs/cgroup/memory/ct/$i/cgroup.procs;
    mkdir -p s/$i; mount -t tmpfs $i s/$i;
    touch s/$i/file; done

    Then, 5 sequential calls of drop caches:

    $time echo 3 > /proc/sys/vm/drop_caches

    1)Before:
    0.00user 13.78system 0:13.78elapsed 99%CPU
    0.00user 5.59system 0:05.60elapsed 99%CPU
    0.00user 5.48system 0:05.48elapsed 99%CPU
    0.00user 8.35system 0:08.35elapsed 99%CPU
    0.00user 8.34system 0:08.35elapsed 99%CPU

    2)After
    0.00user 1.10system 0:01.10elapsed 99%CPU
    0.00user 0.00system 0:00.01elapsed 64%CPU
    0.00user 0.01system 0:00.01elapsed 82%CPU
    0.00user 0.00system 0:00.01elapsed 64%CPU
    0.00user 0.01system 0:00.01elapsed 82%CPU

    The results show the performance increases at least in 548 times.

    Shakeel Butt tested this patchset with fork-bomb on his configuration:

    > I created 255 memcgs, 255 ext4 mounts and made each memcg create a
    > file containing few KiBs on corresponding mount. Then in a separate
    > memcg of 200 MiB limit ran a fork-bomb.
    >
    > I ran the "perf record -ag -- sleep 60" and below are the results:
    >
    > Without the patch series:
    > Samples: 4M of event 'cycles', Event count (approx.): 3279403076005
    > + 36.40% fb.sh [kernel.kallsyms] [k] shrink_slab
    > + 18.97% fb.sh [kernel.kallsyms] [k] list_lru_count_one
    > + 6.75% fb.sh [kernel.kallsyms] [k] super_cache_count
    > + 0.49% fb.sh [kernel.kallsyms] [k] down_read_trylock
    > + 0.44% fb.sh [kernel.kallsyms] [k] mem_cgroup_iter
    > + 0.27% fb.sh [kernel.kallsyms] [k] up_read
    > + 0.21% fb.sh [kernel.kallsyms] [k] osq_lock
    > + 0.13% fb.sh [kernel.kallsyms] [k] shmem_unused_huge_count
    > + 0.08% fb.sh [kernel.kallsyms] [k] shrink_node_memcg
    > + 0.08% fb.sh [kernel.kallsyms] [k] shrink_node
    >
    > With the patch series:
    > Samples: 4M of event 'cycles', Event count (approx.): 2756866824946
    > + 47.49% fb.sh [kernel.kallsyms] [k] down_read_trylock
    > + 30.72% fb.sh [kernel.kallsyms] [k] up_read
    > + 9.51% fb.sh [kernel.kallsyms] [k] mem_cgroup_iter
    > + 1.69% fb.sh [kernel.kallsyms] [k] shrink_node_memcg
    > + 1.35% fb.sh [kernel.kallsyms] [k] mem_cgroup_protected
    > + 1.05% fb.sh [kernel.kallsyms] [k] queued_spin_lock_slowpath
    > + 0.85% fb.sh [kernel.kallsyms] [k] _raw_spin_lock
    > + 0.78% fb.sh [kernel.kallsyms] [k] lruvec_lru_size
    > + 0.57% fb.sh [kernel.kallsyms] [k] shrink_node
    > + 0.54% fb.sh [kernel.kallsyms] [k] queue_work_on
    > + 0.46% fb.sh [kernel.kallsyms] [k] shrink_slab_memcg

    [ktkhai@virtuozzo.com: v9]
    Link: http://lkml.kernel.org/r/153112561772.4097.11011071937553113003.stgit@localhost.localdomain
    Link: http://lkml.kernel.org/r/153063070859.1818.11870882950920963480.stgit@localhost.localdomain
    Signed-off-by: Kirill Tkhai
    Acked-by: Vladimir Davydov
    Tested-by: Shakeel Butt
    Cc: Al Viro
    Cc: Andrey Ryabinin
    Cc: Chris Wilson
    Cc: Greg Kroah-Hartman
    Cc: Guenter Roeck
    Cc: "Huang, Ying"
    Cc: Johannes Weiner
    Cc: Josef Bacik
    Cc: Li RongQing
    Cc: Matthew Wilcox
    Cc: Matthias Kaehlcke
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Philippe Ombredanne
    Cc: Roman Gushchin
    Cc: Sahitya Tummala
    Cc: Stephen Rothwell
    Cc: Tetsuo Handa
    Cc: Thomas Gleixner
    Cc: Waiman Long
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill Tkhai
     
  • We need to distinguish the situations when shrinker has very small
    amount of objects (see vfs_pressure_ratio() called from
    super_cache_count()), and when it has no objects at all. Currently, in
    the both of these cases, shrinker::count_objects() returns 0.

    The patch introduces new SHRINK_EMPTY return value, which will be used
    for "no objects at all" case. It's is a refactoring mostly, as
    SHRINK_EMPTY is replaced by 0 by all callers of do_shrink_slab() in this
    patch, and all the magic will happen in further.

    Link: http://lkml.kernel.org/r/153063069574.1818.11037751256699341813.stgit@localhost.localdomain
    Signed-off-by: Kirill Tkhai
    Acked-by: Vladimir Davydov
    Tested-by: Shakeel Butt
    Cc: Al Viro
    Cc: Andrey Ryabinin
    Cc: Chris Wilson
    Cc: Greg Kroah-Hartman
    Cc: Guenter Roeck
    Cc: "Huang, Ying"
    Cc: Johannes Weiner
    Cc: Josef Bacik
    Cc: Li RongQing
    Cc: Matthew Wilcox
    Cc: Matthias Kaehlcke
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Philippe Ombredanne
    Cc: Roman Gushchin
    Cc: Sahitya Tummala
    Cc: Stephen Rothwell
    Cc: Tetsuo Handa
    Cc: Thomas Gleixner
    Cc: Waiman Long
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill Tkhai
     
  • The patch makes shrink_slab() be called for root_mem_cgroup in the same
    way as it's called for the rest of cgroups. This simplifies the logic
    and improves the readability.

    [ktkhai@virtuozzo.com: wrote changelog]
    Link: http://lkml.kernel.org/r/153063068338.1818.11496084754797453962.stgit@localhost.localdomain
    Signed-off-by: Vladimir Davydov
    Signed-off-by: Kirill Tkhai
    Tested-by: Shakeel Butt
    Cc: Al Viro
    Cc: Andrey Ryabinin
    Cc: Chris Wilson
    Cc: Greg Kroah-Hartman
    Cc: Guenter Roeck
    Cc: "Huang, Ying"
    Cc: Johannes Weiner
    Cc: Josef Bacik
    Cc: Li RongQing
    Cc: Matthew Wilcox
    Cc: Matthias Kaehlcke
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Philippe Ombredanne
    Cc: Roman Gushchin
    Cc: Sahitya Tummala
    Cc: Stephen Rothwell
    Cc: Tetsuo Handa
    Cc: Thomas Gleixner
    Cc: Waiman Long
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • Using the preparations made in previous patches, in case of memcg
    shrink, we may avoid shrinkers, which are not set in memcg's shrinkers
    bitmap. To do that, we separate iterations over memcg-aware and
    !memcg-aware shrinkers, and memcg-aware shrinkers are chosen via
    for_each_set_bit() from the bitmap. In case of big nodes, having many
    isolated environments, this gives significant performance growth. See
    next patches for the details.

    Note that the patch does not respect to empty memcg shrinkers, since we
    never clear the bitmap bits after we set it once. Their shrinkers will
    be called again, with no shrinked objects as result. This functionality
    is provided by next patches.

    [ktkhai@virtuozzo.com: v9]
    Link: http://lkml.kernel.org/r/153112558507.4097.12713813335683345488.stgit@localhost.localdomain
    Link: http://lkml.kernel.org/r/153063066653.1818.976035462801487910.stgit@localhost.localdomain
    Signed-off-by: Kirill Tkhai
    Acked-by: Vladimir Davydov
    Tested-by: Shakeel Butt
    Cc: Al Viro
    Cc: Andrey Ryabinin
    Cc: Chris Wilson
    Cc: Greg Kroah-Hartman
    Cc: Guenter Roeck
    Cc: "Huang, Ying"
    Cc: Johannes Weiner
    Cc: Josef Bacik
    Cc: Li RongQing
    Cc: Matthew Wilcox
    Cc: Matthias Kaehlcke
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Philippe Ombredanne
    Cc: Roman Gushchin
    Cc: Sahitya Tummala
    Cc: Stephen Rothwell
    Cc: Tetsuo Handa
    Cc: Thomas Gleixner
    Cc: Waiman Long
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill Tkhai
     
  • Imagine a big node with many cpus, memory cgroups and containers. Let
    we have 200 containers, every container has 10 mounts, and 10 cgroups.
    All container tasks don't touch foreign containers mounts. If there is
    intensive pages write, and global reclaim happens, a writing task has to
    iterate over all memcgs to shrink slab, before it's able to go to
    shrink_page_list().

    Iteration over all the memcg slabs is very expensive: the task has to
    visit 200 * 10 = 2000 shrinkers for every memcg, and since there are
    2000 memcgs, the total calls are 2000 * 2000 = 4000000.

    So, the shrinker makes 4 million do_shrink_slab() calls just to try to
    isolate SWAP_CLUSTER_MAX pages in one of the actively writing memcg via
    shrink_page_list(). I've observed a node spending almost 100% in
    kernel, making useless iteration over already shrinked slab.

    This patch adds bitmap of memcg-aware shrinkers to memcg. The size of
    the bitmap depends on bitmap_nr_ids, and during memcg life it's
    maintained to be enough to fit bitmap_nr_ids shrinkers. Every bit in
    the map is related to corresponding shrinker id.

    Next patches will maintain set bit only for really charged memcg. This
    will allow shrink_slab() to increase its performance in significant way.
    See the last patch for the numbers.

    [ktkhai@virtuozzo.com: v9]
    Link: http://lkml.kernel.org/r/153112549031.4097.3576147070498769979.stgit@localhost.localdomain
    [ktkhai@virtuozzo.com: add comment to mem_cgroup_css_online()]
    Link: http://lkml.kernel.org/r/521f9e5f-c436-b388-fe83-4dc870bfb489@virtuozzo.com
    Link: http://lkml.kernel.org/r/153063056619.1818.12550500883688681076.stgit@localhost.localdomain
    Signed-off-by: Kirill Tkhai
    Acked-by: Vladimir Davydov
    Tested-by: Shakeel Butt
    Cc: Al Viro
    Cc: Andrey Ryabinin
    Cc: Chris Wilson
    Cc: Greg Kroah-Hartman
    Cc: Guenter Roeck
    Cc: "Huang, Ying"
    Cc: Johannes Weiner
    Cc: Josef Bacik
    Cc: Li RongQing
    Cc: Matthew Wilcox
    Cc: Matthias Kaehlcke
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Philippe Ombredanne
    Cc: Roman Gushchin
    Cc: Sahitya Tummala
    Cc: Stephen Rothwell
    Cc: Tetsuo Handa
    Cc: Thomas Gleixner
    Cc: Waiman Long
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill Tkhai
     
  • Introduce shrinker::id number, which is used to enumerate memcg-aware
    shrinkers. The number start from 0, and the code tries to maintain it
    as small as possible.

    This will be used to represent a memcg-aware shrinkers in memcg
    shrinkers map.

    Since all memcg-aware shrinkers are based on list_lru, which is
    per-memcg in case of !CONFIG_MEMCG_KMEM only, the new functionality will
    be under this config option.

    [ktkhai@virtuozzo.com: v9]
    Link: http://lkml.kernel.org/r/153112546435.4097.10607140323811756557.stgit@localhost.localdomain
    Link: http://lkml.kernel.org/r/153063054586.1818.6041047871606697364.stgit@localhost.localdomain
    Signed-off-by: Kirill Tkhai
    Acked-by: Vladimir Davydov
    Tested-by: Shakeel Butt
    Cc: Al Viro
    Cc: Andrey Ryabinin
    Cc: Chris Wilson
    Cc: Greg Kroah-Hartman
    Cc: Guenter Roeck
    Cc: "Huang, Ying"
    Cc: Johannes Weiner
    Cc: Josef Bacik
    Cc: Li RongQing
    Cc: Matthew Wilcox
    Cc: Matthias Kaehlcke
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Philippe Ombredanne
    Cc: Roman Gushchin
    Cc: Sahitya Tummala
    Cc: Stephen Rothwell
    Cc: Tetsuo Handa
    Cc: Thomas Gleixner
    Cc: Waiman Long
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill Tkhai
     
  • Use smaller scan_control fields for order, priority, and reclaim_idx.
    Convert fields from int => s8. All easily fit within a byte:

    - allocation order range: 0..MAX_ORDER(64?)
    - priority range: 0..12(DEF_PRIORITY)
    - reclaim_idx range: 0..6(__MAX_NR_ZONES)

    Since 6538b8ea886e ("x86_64: expand kernel stack to 16K") x86_64 stack
    overflows are not an issue. But it's inefficient to use ints.

    Use s8 (signed byte) rather than u8 to allow for loops like:
    do {
    ...
    } while (--sc.priority >= 0);

    Add BUILD_BUG_ON to verify that s8 is capable of storing max values.

    This reduces sizeof(struct scan_control):
    - 96 => 80 bytes (x86_64)
    - 68 => 56 bytes (i386)

    scan_control structure field order is changed to utilize padding. After
    this patch there is 1 bit of scan_control padding.

    akpm: makes my vmscan.o's .text 572 bytes smaller as well.

    Link: http://lkml.kernel.org/r/20180530061212.84915-1-gthelen@google.com
    Signed-off-by: Greg Thelen
    Suggested-by: Matthew Wilcox
    Reviewed-by: Andrew Morton
    Cc: Michal Hocko
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Greg Thelen
     

08 Jun, 2018

2 commits

  • Memory controller implements the memory.low best-effort memory
    protection mechanism, which works perfectly in many cases and allows
    protecting working sets of important workloads from sudden reclaim.

    But its semantics has a significant limitation: it works only as long as
    there is a supply of reclaimable memory. This makes it pretty useless
    against any sort of slow memory leaks or memory usage increases. This
    is especially true for swapless systems. If swap is enabled, memory
    soft protection effectively postpones problems, allowing a leaking
    application to fill all swap area, which makes no sense. The only
    effective way to guarantee the memory protection in this case is to
    invoke the OOM killer.

    It's possible to handle this case in userspace by reacting on MEMCG_LOW
    events; but there is still a place for a fail-safe in-kernel mechanism
    to provide stronger guarantees.

    This patch introduces the memory.min interface for cgroup v2 memory
    controller. It works very similarly to memory.low (sharing the same
    hierarchical behavior), except that it's not disabled if there is no
    more reclaimable memory in the system.

    If cgroup is not populated, its memory.min is ignored, because otherwise
    even the OOM killer wouldn't be able to reclaim the protected memory,
    and the system can stall.

    [guro@fb.com: s/low/min/ in docs]
    Link: http://lkml.kernel.org/r/20180510130758.GA9129@castle.DHCP.thefacebook.com
    Link: http://lkml.kernel.org/r/20180509180734.GA4856@castle.DHCP.thefacebook.com
    Signed-off-by: Roman Gushchin
    Reviewed-by: Randy Dunlap
    Acked-by: Johannes Weiner
    Cc: Michal Hocko
    Cc: Vladimir Davydov
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     
  • While revisiting my Btrfs swapfile series [1], I introduced a situation
    in which reclaim would lock i_rwsem, and even though the swapon() path
    clearly made GFP_KERNEL allocations while holding i_rwsem, I got no
    complaints from lockdep. It turns out that the rework of the fs_reclaim
    annotation was broken: if the current task has PF_MEMALLOC set, we don't
    acquire the dummy fs_reclaim lock, but when reclaiming we always check
    this _after_ we've just set the PF_MEMALLOC flag. In most cases, we can
    fix this by moving the fs_reclaim_{acquire,release}() outside of the
    memalloc_noreclaim_{save,restore}(), althought kswapd is slightly
    different. After applying this, I got the expected lockdep splats.

    1: https://lwn.net/Articles/625412/

    Link: http://lkml.kernel.org/r/9f8aa70652a98e98d7c4de0fc96a4addcee13efe.1523778026.git.osandov@fb.com
    Fixes: d92a8cfcb37e ("locking/lockdep: Rework FS_RECLAIM annotation")
    Signed-off-by: Omar Sandoval
    Reviewed-by: Andrew Morton
    Cc: Peter Zijlstra
    Cc: Tetsuo Handa
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Omar Sandoval
     

03 Jun, 2018

1 commit

  • George Boole would have noticed a slight error in 4.16 commit
    69d763fc6d3a ("mm: pin address_space before dereferencing it while
    isolating an LRU page"). Fix it, to match both the comment above it,
    and the original behaviour.

    Although anonymous pages are not marked PageDirty at first, we have an
    old habit of calling SetPageDirty when a page is removed from swap
    cache: so there's a category of ex-swap pages that are easily
    migratable, but were inadvertently excluded from compaction's async
    migration in 4.16.

    Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1805302014001.12558@eggly.anvils
    Fixes: 69d763fc6d3a ("mm: pin address_space before dereferencing it while isolating an LRU page")
    Signed-off-by: Hugh Dickins
    Acked-by: Minchan Kim
    Acked-by: Mel Gorman
    Reported-by: Ivan Kalvachev
    Cc: "Huang, Ying"
    Cc: Jan Kara
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

16 Apr, 2018

1 commit

  • syzbot is catching so many bugs triggered by commit 9ee332d99e4d5a97
    ("sget(): handle failures of register_shrinker()"). That commit expected
    that calling kill_sb() from deactivate_locked_super() without successful
    fill_super() is safe, but the reality was different; some callers assign
    attributes which are needed for kill_sb() after sget() succeeds.

    For example, [1] is a report where sb->s_mode (which seems to be either
    FMODE_READ | FMODE_EXCL | FMODE_WRITE or FMODE_READ | FMODE_EXCL) is not
    assigned unless sget() succeeds. But it does not worth complicate sget()
    so that register_shrinker() failure path can safely call
    kill_block_super() via kill_sb(). Making alloc_super() fail if memory
    allocation for register_shrinker() failed is much simpler. Let's avoid
    calling deactivate_locked_super() from sget_userns() by preallocating
    memory for the shrinker and making register_shrinker() in sget_userns()
    never fail.

    [1] https://syzkaller.appspot.com/bug?id=588996a25a2587be2e3a54e8646728fb9cae44e7

    Signed-off-by: Tetsuo Handa
    Reported-by: syzbot
    Cc: Al Viro
    Cc: Michal Hocko
    Signed-off-by: Al Viro

    Tetsuo Handa
     

12 Apr, 2018

7 commits

  • Remove the address_space ->tree_lock and use the xa_lock newly added to
    the radix_tree_root. Rename the address_space ->page_tree to ->i_pages,
    since we don't really care that it's a tree.

    [willy@infradead.org: fix nds32, fs/dax.c]
    Link: http://lkml.kernel.org/r/20180406145415.GB20605@bombadil.infradead.orgLink: http://lkml.kernel.org/r/20180313132639.17387-9-willy@infradead.org
    Signed-off-by: Matthew Wilcox
    Acked-by: Jeff Layton
    Cc: Darrick J. Wong
    Cc: Dave Chinner
    Cc: Ryusuke Konishi
    Cc: Will Deacon
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     
  • Commit a983b5ebee57 ("mm: memcontrol: fix excessive complexity in
    memory.stat reporting") added per-cpu drift to all memory cgroup stats
    and events shown in memory.stat and memory.events.

    For memory.stat this is acceptable. But memory.events issues file
    notifications, and somebody polling the file for changes will be
    confused when the counters in it are unchanged after a wakeup.

    Luckily, the events in memory.events - MEMCG_LOW, MEMCG_HIGH, MEMCG_MAX,
    MEMCG_OOM - are sufficiently rare and high-level that we don't need
    per-cpu buffering for them: MEMCG_HIGH and MEMCG_MAX would be the most
    frequent, but they're counting invocations of reclaim, which is a
    complex operation that touches many shared cachelines.

    This splits memory.events from the generic VM events and tracks them in
    their own, unbuffered atomic counters. That's also cleaner, as it
    eliminates the ugly enum nesting of VM and cgroup events.

    [hannes@cmpxchg.org: "array subscript is above array bounds"]
    Link: http://lkml.kernel.org/r/20180406155441.GA20806@cmpxchg.org
    Link: http://lkml.kernel.org/r/20180405175507.GA24817@cmpxchg.org
    Fixes: a983b5ebee57 ("mm: memcontrol: fix excessive complexity in memory.stat reporting")
    Signed-off-by: Johannes Weiner
    Reported-by: Tejun Heo
    Acked-by: Tejun Heo
    Acked-by: Michal Hocko
    Cc: Vladimir Davydov
    Cc: Roman Gushchin
    Cc: Rik van Riel
    Cc: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • The trace event trace_mm_vmscan_lru_shrink_inactive() currently has 12
    parameters! Seven of them are from the reclaim_stat structure. This
    structure is currently local to mm/vmscan.c. By moving it to the global
    vmstat.h header, we can also reference it from the vmscan tracepoints.
    In moving it, it brings down the overhead of passing so many arguments
    to the trace event. In the future, we may limit the number of arguments
    that a trace event may pass (ideally just 6, but more realistically it
    may be 8).

    Before this patch, the code to call the trace event is this:

    0f 83 aa fe ff ff jae ffffffff811e6261
    48 8b 45 a0 mov -0x60(%rbp),%rax
    45 8b 64 24 20 mov 0x20(%r12),%r12d
    44 8b 6d d4 mov -0x2c(%rbp),%r13d
    8b 4d d0 mov -0x30(%rbp),%ecx
    44 8b 75 cc mov -0x34(%rbp),%r14d
    44 8b 7d c8 mov -0x38(%rbp),%r15d
    48 89 45 90 mov %rax,-0x70(%rbp)
    8b 83 b8 fe ff ff mov -0x148(%rbx),%eax
    8b 55 c0 mov -0x40(%rbp),%edx
    8b 7d c4 mov -0x3c(%rbp),%edi
    8b 75 b8 mov -0x48(%rbp),%esi
    89 45 80 mov %eax,-0x80(%rbp)
    65 ff 05 e4 f7 e2 7e incl %gs:0x7ee2f7e4(%rip) # 15bd0
    48 8b 05 75 5b 13 01 mov 0x1135b75(%rip),%rax # ffffffff8231bf68
    48 85 c0 test %rax,%rax
    74 72 je ffffffff811e646a
    48 89 c3 mov %rax,%rbx
    4c 8b 10 mov (%rax),%r10
    89 f8 mov %edi,%eax
    48 89 85 68 ff ff ff mov %rax,-0x98(%rbp)
    89 f0 mov %esi,%eax
    48 89 85 60 ff ff ff mov %rax,-0xa0(%rbp)
    89 c8 mov %ecx,%eax
    48 89 85 78 ff ff ff mov %rax,-0x88(%rbp)
    89 d0 mov %edx,%eax
    48 89 85 70 ff ff ff mov %rax,-0x90(%rbp)
    8b 45 8c mov -0x74(%rbp),%eax
    48 8b 7b 08 mov 0x8(%rbx),%rdi
    48 83 c3 18 add $0x18,%rbx
    50 push %rax
    41 54 push %r12
    41 55 push %r13
    ff b5 78 ff ff ff pushq -0x88(%rbp)
    41 56 push %r14
    41 57 push %r15
    ff b5 70 ff ff ff pushq -0x90(%rbp)
    4c 8b 8d 68 ff ff ff mov -0x98(%rbp),%r9
    4c 8b 85 60 ff ff ff mov -0xa0(%rbp),%r8
    48 8b 4d 98 mov -0x68(%rbp),%rcx
    48 8b 55 90 mov -0x70(%rbp),%rdx
    8b 75 80 mov -0x80(%rbp),%esi
    41 ff d2 callq *%r10

    After the patch:

    0f 83 a8 fe ff ff jae ffffffff811e626d
    8b 9b b8 fe ff ff mov -0x148(%rbx),%ebx
    45 8b 64 24 20 mov 0x20(%r12),%r12d
    4c 8b 6d a0 mov -0x60(%rbp),%r13
    65 ff 05 f5 f7 e2 7e incl %gs:0x7ee2f7f5(%rip) # 15bd0
    4c 8b 35 86 5b 13 01 mov 0x1135b86(%rip),%r14 # ffffffff8231bf68
    4d 85 f6 test %r14,%r14
    74 2a je ffffffff811e6411
    49 8b 06 mov (%r14),%rax
    8b 4d 8c mov -0x74(%rbp),%ecx
    49 8b 7e 08 mov 0x8(%r14),%rdi
    49 83 c6 18 add $0x18,%r14
    4c 89 ea mov %r13,%rdx
    45 89 e1 mov %r12d,%r9d
    4c 8d 45 b8 lea -0x48(%rbp),%r8
    89 de mov %ebx,%esi
    51 push %rcx
    48 8b 4d 98 mov -0x68(%rbp),%rcx
    ff d0 callq *%rax

    Link: http://lkml.kernel.org/r/2559d7cb-ec60-1200-2362-04fa34fd02bb@fb.com
    Link: http://lkml.kernel.org/r/20180322121003.4177af15@gandalf.local.home
    Signed-off-by: Steven Rostedt (VMware)
    Reported-by: Alexei Starovoitov
    Acked-by: David Rientjes
    Acked-by: Michal Hocko
    Cc: Mel Gorman
    Cc: Vlastimil Babka
    Cc: Andrey Ryabinin
    Cc: Alexei Starovoitov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Steven Rostedt
     
  • memcg reclaim may alter pgdat->flags based on the state of LRU lists in
    cgroup and its children. PGDAT_WRITEBACK may force kswapd to sleep
    congested_wait(), PGDAT_DIRTY may force kswapd to writeback filesystem
    pages. But the worst here is PGDAT_CONGESTED, since it may force all
    direct reclaims to stall in wait_iff_congested(). Note that only kswapd
    have powers to clear any of these bits. This might just never happen if
    cgroup limits configured that way. So all direct reclaims will stall as
    long as we have some congested bdi in the system.

    Leave all pgdat->flags manipulations to kswapd. kswapd scans the whole
    pgdat, only kswapd can clear pgdat->flags once node is balanced, thus
    it's reasonable to leave all decisions about node state to kswapd.

    Why only kswapd? Why not allow to global direct reclaim change these
    flags? It is because currently only kswapd can clear these flags. I'm
    less worried about the case when PGDAT_CONGESTED falsely not set, and
    more worried about the case when it falsely set. If direct reclaimer
    sets PGDAT_CONGESTED, do we have guarantee that after the congestion
    problem is sorted out, kswapd will be woken up and clear the flag? It
    seems like there is no such guarantee. E.g. direct reclaimers may
    eventually balance pgdat and kswapd simply won't wake up (see
    wakeup_kswapd()).

    Moving pgdat->flags manipulation to kswapd, means that cgroup2 recalim
    now loses its congestion throttling mechanism. Add per-cgroup
    congestion state and throttle cgroup2 reclaimers if memcg is in
    congestion state.

    Currently there is no need in per-cgroup PGDAT_WRITEBACK and PGDAT_DIRTY
    bits since they alter only kswapd behavior.

    The problem could be easily demonstrated by creating heavy congestion in
    one cgroup:

    echo "+memory" > /sys/fs/cgroup/cgroup.subtree_control
    mkdir -p /sys/fs/cgroup/congester
    echo 512M > /sys/fs/cgroup/congester/memory.max
    echo $$ > /sys/fs/cgroup/congester/cgroup.procs
    /* generate a lot of diry data on slow HDD */
    while true; do dd if=/dev/zero of=/mnt/sdb/zeroes bs=1M count=1024; done &
    ....
    while true; do dd if=/dev/zero of=/mnt/sdb/zeroes bs=1M count=1024; done &

    and some job in another cgroup:

    mkdir /sys/fs/cgroup/victim
    echo 128M > /sys/fs/cgroup/victim/memory.max

    # time cat /dev/sda > /dev/null
    real 10m15.054s
    user 0m0.487s
    sys 1m8.505s

    According to the tracepoint in wait_iff_congested(), the 'cat' spent 50%
    of the time sleeping there.

    With the patch, cat don't waste time anymore:

    # time cat /dev/sda > /dev/null
    real 5m32.911s
    user 0m0.411s
    sys 0m56.664s

    [aryabinin@virtuozzo.com: congestion state should be per-node]
    Link: http://lkml.kernel.org/r/20180406135215.10057-1-aryabinin@virtuozzo.com
    [ayabinin@virtuozzo.com: make congestion state per-cgroup-per-node instead of just per-cgroup[
    Link: http://lkml.kernel.org/r/20180406180254.8970-2-aryabinin@virtuozzo.com
    Link: http://lkml.kernel.org/r/20180323152029.11084-5-aryabinin@virtuozzo.com
    Signed-off-by: Andrey Ryabinin
    Reviewed-by: Shakeel Butt
    Acked-by: Johannes Weiner
    Cc: Mel Gorman
    Cc: Tejun Heo
    Cc: Michal Hocko
    Cc: Steven Rostedt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Ryabinin
     
  • We have separate LRU list for each memory cgroup. Memory reclaim
    iterates over cgroups and calls shrink_inactive_list() every inactive
    LRU list. Based on the state of a single LRU shrink_inactive_list() may
    flag the whole node as dirty,congested or under writeback. This is
    obviously wrong and hurtful. It's especially hurtful when we have
    possibly small congested cgroup in system. Than *all* direct reclaims
    waste time by sleeping in wait_iff_congested(). And the more memcgs in
    the system we have the longer memory allocation stall is, because
    wait_iff_congested() called on each lru-list scan.

    Sum reclaim stats across all visited LRUs on node and flag node as
    dirty, congested or under writeback based on that sum. Also call
    congestion_wait(), wait_iff_congested() once per pgdat scan, instead of
    once per lru-list scan.

    This only fixes the problem for global reclaim case. Per-cgroup reclaim
    may alter global pgdat flags too, which is wrong. But that is separate
    issue and will be addressed in the next patch.

    This change will not have any effect on a systems with all workload
    concentrated in a single cgroup.

    [aryabinin@virtuozzo.com: check nr_writeback against all nr_taken, not just file]
    Link: http://lkml.kernel.org/r/20180406180254.8970-1-aryabinin@virtuozzo.com
    Link: http://lkml.kernel.org/r/20180323152029.11084-4-aryabinin@virtuozzo.com
    Signed-off-by: Andrey Ryabinin
    Reviewed-by: Shakeel Butt
    Cc: Mel Gorman
    Cc: Tejun Heo
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Steven Rostedt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Ryabinin
     
  • Only kswapd can have non-zero nr_immediate, and current_may_throttle()
    is always true for kswapd (PF_LESS_THROTTLE bit is never set) thus it's
    enough to check stat.nr_immediate only.

    Link: http://lkml.kernel.org/r/20180315164553.17856-4-aryabinin@virtuozzo.com
    Signed-off-by: Andrey Ryabinin
    Acked-by: Michal Hocko
    Cc: Shakeel Butt
    Cc: Mel Gorman
    Cc: Tejun Heo
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Ryabinin
     
  • Update some comments that became stale since transiton from per-zone to
    per-node reclaim.

    Link: http://lkml.kernel.org/r/20180315164553.17856-2-aryabinin@virtuozzo.com
    Signed-off-by: Andrey Ryabinin
    Acked-by: Michal Hocko
    Cc: Shakeel Butt
    Cc: Mel Gorman
    Cc: Tejun Heo
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Ryabinin
     

06 Apr, 2018

3 commits

  • Kswapd will not wakeup if per-zone watermarks are not failing or if too
    many previous attempts at background reclaim have failed.

    This can be true if there is a lot of free memory available. For high-
    order allocations, kswapd is responsible for waking up kcompactd for
    background compaction. If the zone is not below its watermarks or
    reclaim has recently failed (lots of free memory, nothing left to
    reclaim), kcompactd does not get woken up.

    When __GFP_DIRECT_RECLAIM is not allowed, allow kcompactd to still be
    woken up even if kswapd will not reclaim. This allows high-order
    allocations, such as thp, to still trigger background compaction even
    when the zone has an abundance of free memory.

    Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1803111659420.209721@chino.kir.corp.google.com
    Signed-off-by: David Rientjes
    Acked-by: Vlastimil Babka
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • Since we no longer use return value of shrink_slab() for normal reclaim,
    the comment is no longer true. If some do_shrink_slab() call takes
    unexpectedly long (root cause of stall is currently unknown) when
    register_shrinker()/unregister_shrinker() is pending, trying to drop
    caches via /proc/sys/vm/drop_caches could become infinite cond_resched()
    loop if many mem_cgroup are defined. For safety, let's not pretend
    forward progress.

    Link: http://lkml.kernel.org/r/201802202229.GGF26507.LVFtMSOOHFJOQF@I-love.SAKURA.ne.jp
    Signed-off-by: Tetsuo Handa
    Acked-by: Michal Hocko
    Reviewed-by: Andrew Morton
    Cc: Dave Chinner
    Cc: Glauber Costa
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tetsuo Handa
     
  • When page_mapping() is called and the mapping is dereferenced in
    page_evicatable() through shrink_active_list(), it is possible for the
    inode to be truncated and the embedded address space to be freed at the
    same time. This may lead to the following race.

    CPU1 CPU2

    truncate(inode) shrink_active_list()
    ... page_evictable(page)
    truncate_inode_page(mapping, page);
    delete_from_page_cache(page)
    spin_lock_irqsave(&mapping->tree_lock, flags);
    __delete_from_page_cache(page, NULL)
    page_cache_tree_delete(..)
    ... mapping = page_mapping(page);
    page->mapping = NULL;
    ...
    spin_unlock_irqrestore(&mapping->tree_lock, flags);
    page_cache_free_page(mapping, page)
    put_page(page)
    if (put_page_testzero(page)) -> false
    - inode now has no pages and can be freed including embedded address_space

    mapping_unevictable(mapping)
    test_bit(AS_UNEVICTABLE, &mapping->flags);
    - we've dereferenced mapping which is potentially already free.

    Similar race exists between swap cache freeing and page_evicatable()
    too.

    The address_space in inode and swap cache will be freed after a RCU
    grace period. So the races are fixed via enclosing the page_mapping()
    and address_space usage in rcu_read_lock/unlock(). Some comments are
    added in code to make it clear what is protected by the RCU read lock.

    Link: http://lkml.kernel.org/r/20180212081227.1940-1-ying.huang@intel.com
    Signed-off-by: "Huang, Ying"
    Reviewed-by: Jan Kara
    Reviewed-by: Andrew Morton
    Cc: Mel Gorman
    Cc: Minchan Kim
    Cc: "Huang, Ying"
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Ying
     

23 Mar, 2018

1 commit

  • Commit 726d061fbd36 ("mm: vmscan: kick flushers when we encounter dirty
    pages on the LRU") added flusher invocation to shrink_inactive_list()
    when many dirty pages on the LRU are encountered.

    However, shrink_inactive_list() doesn't wake up flushers for legacy
    cgroup reclaim, so the next commit bbef938429f5 ("mm: vmscan: remove old
    flusher wakeup from direct reclaim path") removed the only source of
    flusher's wake up in legacy mem cgroup reclaim path.

    This leads to premature OOM if there is too many dirty pages in cgroup:
    # mkdir /sys/fs/cgroup/memory/test
    # echo $$ > /sys/fs/cgroup/memory/test/tasks
    # echo 50M > /sys/fs/cgroup/memory/test/memory.limit_in_bytes
    # dd if=/dev/zero of=tmp_file bs=1M count=100
    Killed

    dd invoked oom-killer: gfp_mask=0x14000c0(GFP_KERNEL), nodemask=(null), order=0, oom_score_adj=0

    Call Trace:
    dump_stack+0x46/0x65
    dump_header+0x6b/0x2ac
    oom_kill_process+0x21c/0x4a0
    out_of_memory+0x2a5/0x4b0
    mem_cgroup_out_of_memory+0x3b/0x60
    mem_cgroup_oom_synchronize+0x2ed/0x330
    pagefault_out_of_memory+0x24/0x54
    __do_page_fault+0x521/0x540
    page_fault+0x45/0x50

    Task in /test killed as a result of limit of /test
    memory: usage 51200kB, limit 51200kB, failcnt 73
    memory+swap: usage 51200kB, limit 9007199254740988kB, failcnt 0
    kmem: usage 296kB, limit 9007199254740988kB, failcnt 0
    Memory cgroup stats for /test: cache:49632KB rss:1056KB rss_huge:0KB shmem:0KB
    mapped_file:0KB dirty:49500KB writeback:0KB swap:0KB inactive_anon:0KB
    active_anon:1168KB inactive_file:24760KB active_file:24960KB unevictable:0KB
    Memory cgroup out of memory: Kill process 3861 (bash) score 88 or sacrifice child
    Killed process 3876 (dd) total-vm:8484kB, anon-rss:1052kB, file-rss:1720kB, shmem-rss:0kB
    oom_reaper: reaped process 3876 (dd), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

    Wake up flushers in legacy cgroup reclaim too.

    Link: http://lkml.kernel.org/r/20180315164553.17856-1-aryabinin@virtuozzo.com
    Fixes: bbef938429f5 ("mm: vmscan: remove old flusher wakeup from direct reclaim path")
    Signed-off-by: Andrey Ryabinin
    Tested-by: Shakeel Butt
    Acked-by: Michal Hocko
    Cc: Mel Gorman
    Cc: Tejun Heo
    Cc: Johannes Weiner
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Ryabinin
     

22 Feb, 2018

1 commit

  • When a thread mlocks an address space backed either by file pages which
    are currently not present in memory or swapped out anon pages (not in
    swapcache), a new page is allocated and added to the local pagevec
    (lru_add_pvec), I/O is triggered and the thread then sleeps on the page.
    On I/O completion, the thread can wake on a different CPU, the mlock
    syscall will then sets the PageMlocked() bit of the page but will not be
    able to put that page in unevictable LRU as the page is on the pagevec
    of a different CPU. Even on drain, that page will go to evictable LRU
    because the PageMlocked() bit is not checked on pagevec drain.

    The page will eventually go to right LRU on reclaim but the LRU stats
    will remain skewed for a long time.

    This patch puts all the pages, even unevictable, to the pagevecs and on
    the drain, the pages will be added on their LRUs correctly by checking
    their evictability. This resolves the mlocked pages on pagevec of other
    CPUs issue because when those pagevecs will be drained, the mlocked file
    pages will go to unevictable LRU. Also this makes the race with munlock
    easier to resolve because the pagevec drains happen in LRU lock.

    However there is still one place which makes a page evictable and does
    PageLRU check on that page without LRU lock and needs special attention.
    TestClearPageMlocked() and isolate_lru_page() in clear_page_mlock().

    #0: __pagevec_lru_add_fn #1: clear_page_mlock

    SetPageLRU() if (!TestClearPageMlocked())
    return
    smp_mb() //
    Acked-by: Vlastimil Babka
    Cc: Jérôme Glisse
    Cc: Huang Ying
    Cc: Tim Chen
    Cc: Michal Hocko
    Cc: Greg Thelen
    Cc: Johannes Weiner
    Cc: Balbir Singh
    Cc: Minchan Kim
    Cc: Shaohua Li
    Cc: Jan Kara
    Cc: Nicholas Piggin
    Cc: Dan Williams
    Cc: Mel Gorman
    Cc: Hugh Dickins
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shakeel Butt
     

07 Feb, 2018

1 commit


01 Feb, 2018

4 commits

  • Minchan Kim asked the following question -- what locks protects
    address_space destroying when race happens between inode trauncation and
    __isolate_lru_page? Jan Kara clarified by describing the race as follows

    CPU1 CPU2

    truncate(inode) __isolate_lru_page()
    ...
    truncate_inode_page(mapping, page);
    delete_from_page_cache(page)
    spin_lock_irqsave(&mapping->tree_lock, flags);
    __delete_from_page_cache(page, NULL)
    page_cache_tree_delete(..)
    ... mapping = page_mapping(page);
    page->mapping = NULL;
    ...
    spin_unlock_irqrestore(&mapping->tree_lock, flags);
    page_cache_free_page(mapping, page)
    put_page(page)
    if (put_page_testzero(page)) -> false
    - inode now has no pages and can be freed including embedded address_space

    if (mapping && !mapping->a_ops->migratepage)
    - we've dereferenced mapping which is potentially already free.

    The race is theoretically possible but unlikely. Before the
    delete_from_page_cache, truncate_cleanup_page is called so the page is
    likely to be !PageDirty or PageWriteback which gets skipped by the only
    caller that checks the mappping in __isolate_lru_page. Even if the race
    occurs, a substantial amount of work has to happen during a tiny window
    with no preemption but it could potentially be done using a virtual
    machine to artifically slow one CPU or halt it during the critical
    window.

    This patch should eliminate the race with truncation by try-locking the
    page before derefencing mapping and aborting if the lock was not
    acquired. There was a suggestion from Huang Ying to use RCU as a
    side-effect to prevent mapping being freed. However, I do not like the
    solution as it's an unconventional means of preserving a mapping and
    it's not a context where rcu_read_lock is obviously protecting rcu data.

    Link: http://lkml.kernel.org/r/20180104102512.2qos3h5vqzeisrek@techsingularity.net
    Fixes: c82449352854 ("mm: compaction: make isolate_lru_page() filter-aware again")
    Signed-off-by: Mel Gorman
    Acked-by: Minchan Kim
    Cc: "Huang, Ying"
    Cc: Jan Kara
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Remove unused function pgdat_reclaimable_pages() and
    node_page_state_snapshot() which becomes unused as well.

    Link: http://lkml.kernel.org/r/20171122094416.26019-1-jack@suse.cz
    Signed-off-by: Jan Kara
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • Shakeel Butt reported he has observed in production systems that the job
    loader gets stuck for 10s of seconds while doing a mount operation. It
    turns out that it was stuck in register_shrinker() because some
    unrelated job was under memory pressure and was spending time in
    shrink_slab(). Machines have a lot of shrinkers registered and jobs
    under memory pressure have to traverse all of those memcg-aware
    shrinkers and affect unrelated jobs which want to register their own
    shrinkers.

    To solve the issue, this patch simply bails out slab shrinking if it is
    found that someone wants to register a shrinker in parallel. A downside
    is it could cause unfair shrinking between shrinkers. However, it
    should be rare and we can add compilcated logic if we find it's not
    enough.

    [akpm@linux-foundation.org: tweak code comment]
    Link: http://lkml.kernel.org/r/20171115005602.GB23810@bbox
    Link: http://lkml.kernel.org/r/1511481899-20335-1-git-send-email-minchan@kernel.org
    Signed-off-by: Minchan Kim
    Signed-off-by: Shakeel Butt
    Reported-by: Shakeel Butt
    Tested-by: Shakeel Butt
    Acked-by: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: Tetsuo Handa
    Cc: Anshuman Khandual
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • Previously we were using the ratio of the number of lru pages scanned to
    the number of eligible lru pages to determine the number of slab objects
    to scan. The problem with this is that these two things have nothing to
    do with each other, so in slab heavy work loads where there is little to
    no page cache we can end up with the pages scanned being a very low
    number. This means that we reclaim next to no slab pages and waste a
    lot of time reclaiming small amounts of space.

    Consider the following scenario, where we have the following values and
    the rest of the memory usage is in slab

    Active: 58840 kB
    Inactive: 46860 kB

    Every time we do a get_scan_count() we do this

    scan = size >> sc->priority

    where sc->priority starts at DEF_PRIORITY, which is 12. The first loop
    through reclaim would result in a scan target of 2 pages to 11715 total
    inactive pages, and 3 pages to 14710 total active pages. This is a
    really really small target for a system that is entirely slab pages.
    And this is super optimistic, this assumes we even get to scan these
    pages. We don't increment sc->nr_scanned unless we 1) isolate the page,
    which assumes it's not in use, and 2) can lock the page. Under pressure
    these numbers could probably go down, I'm sure there's some random pages
    from daemons that aren't actually in use, so the targets get even
    smaller.

    Instead use sc->priority in the same way we use it to determine scan
    amounts for the lru's. This generally equates to pages. Consider the
    following

    slab_pages = (nr_objects * object_size) / PAGE_SIZE

    What we would like to do is

    scan = slab_pages >> sc->priority

    but we don't know the number of slab pages each shrinker controls, only
    the objects. However say that theoretically we knew how many pages a
    shrinker controlled, we'd still have to convert this to objects, which
    would look like the following

    scan = shrinker_pages >> sc->priority
    scan_objects = (PAGE_SIZE / object_size) * scan

    or written another way

    scan_objects = (shrinker_pages >> sc->priority) *
    (PAGE_SIZE / object_size)

    which can thus be written

    scan_objects = ((shrinker_pages * PAGE_SIZE) / object_size) >>
    sc->priority

    which is just

    scan_objects = nr_objects >> sc->priority

    We don't need to know exactly how many pages each shrinker represents,
    it's objects are all the information we need. Making this change allows
    us to place an appropriate amount of pressure on the shrinker pools for
    their relative size.

    Link: http://lkml.kernel.org/r/1510780549-6812-1-git-send-email-josef@toxicpanda.com
    Signed-off-by: Josef Bacik
    Acked-by: Johannes Weiner
    Acked-by: Dave Chinner
    Acked-by: Andrey Ryabinin
    Cc: Michal Hocko
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Josef Bacik
     

19 Dec, 2017

1 commit

  • Syzbot caught an oops at unregister_shrinker() because combination of
    commit 1d3d4437eae1bb29 ("vmscan: per-node deferred work") and fault
    injection made register_shrinker() fail and the caller of
    register_shrinker() did not check for failure.

    ----------
    [ 554.881422] FAULT_INJECTION: forcing a failure.
    [ 554.881422] name failslab, interval 1, probability 0, space 0, times 0
    [ 554.881438] CPU: 1 PID: 13231 Comm: syz-executor1 Not tainted 4.14.0-rc8+ #82
    [ 554.881443] Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
    [ 554.881445] Call Trace:
    [ 554.881459] dump_stack+0x194/0x257
    [ 554.881474] ? arch_local_irq_restore+0x53/0x53
    [ 554.881486] ? find_held_lock+0x35/0x1d0
    [ 554.881507] should_fail+0x8c0/0xa40
    [ 554.881522] ? fault_create_debugfs_attr+0x1f0/0x1f0
    [ 554.881537] ? check_noncircular+0x20/0x20
    [ 554.881546] ? find_next_zero_bit+0x2c/0x40
    [ 554.881560] ? ida_get_new_above+0x421/0x9d0
    [ 554.881577] ? find_held_lock+0x35/0x1d0
    [ 554.881594] ? __lock_is_held+0xb6/0x140
    [ 554.881628] ? check_same_owner+0x320/0x320
    [ 554.881634] ? lock_downgrade+0x990/0x990
    [ 554.881649] ? find_held_lock+0x35/0x1d0
    [ 554.881672] should_failslab+0xec/0x120
    [ 554.881684] __kmalloc+0x63/0x760
    [ 554.881692] ? lock_downgrade+0x990/0x990
    [ 554.881712] ? register_shrinker+0x10e/0x2d0
    [ 554.881721] ? trace_event_raw_event_module_request+0x320/0x320
    [ 554.881737] register_shrinker+0x10e/0x2d0
    [ 554.881747] ? prepare_kswapd_sleep+0x1f0/0x1f0
    [ 554.881755] ? _down_write_nest_lock+0x120/0x120
    [ 554.881765] ? memcpy+0x45/0x50
    [ 554.881785] sget_userns+0xbcd/0xe20
    (...snipped...)
    [ 554.898693] kasan: CONFIG_KASAN_INLINE enabled
    [ 554.898724] kasan: GPF could be caused by NULL-ptr deref or user memory access
    [ 554.898732] general protection fault: 0000 [#1] SMP KASAN
    [ 554.898737] Dumping ftrace buffer:
    [ 554.898741] (ftrace buffer empty)
    [ 554.898743] Modules linked in:
    [ 554.898752] CPU: 1 PID: 13231 Comm: syz-executor1 Not tainted 4.14.0-rc8+ #82
    [ 554.898755] Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
    [ 554.898760] task: ffff8801d1dbe5c0 task.stack: ffff8801c9e38000
    [ 554.898772] RIP: 0010:__list_del_entry_valid+0x7e/0x150
    [ 554.898775] RSP: 0018:ffff8801c9e3f108 EFLAGS: 00010246
    [ 554.898780] RAX: dffffc0000000000 RBX: 0000000000000000 RCX: 0000000000000000
    [ 554.898784] RDX: 0000000000000000 RSI: ffff8801c53c6f98 RDI: ffff8801c53c6fa0
    [ 554.898788] RBP: ffff8801c9e3f120 R08: 1ffff100393c7d55 R09: 0000000000000004
    [ 554.898791] R10: ffff8801c9e3ef70 R11: 0000000000000000 R12: 0000000000000000
    [ 554.898795] R13: dffffc0000000000 R14: 1ffff100393c7e45 R15: ffff8801c53c6f98
    [ 554.898800] FS: 0000000000000000(0000) GS:ffff8801db300000(0000) knlGS:0000000000000000
    [ 554.898804] CS: 0010 DS: 002b ES: 002b CR0: 0000000080050033
    [ 554.898807] CR2: 00000000dbc23000 CR3: 00000001c7269000 CR4: 00000000001406e0
    [ 554.898813] DR0: 0000000020000000 DR1: 0000000020000000 DR2: 0000000000000000
    [ 554.898816] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600
    [ 554.898818] Call Trace:
    [ 554.898828] unregister_shrinker+0x79/0x300
    [ 554.898837] ? perf_trace_mm_vmscan_writepage+0x750/0x750
    [ 554.898844] ? down_write+0x87/0x120
    [ 554.898851] ? deactivate_super+0x139/0x1b0
    [ 554.898857] ? down_read+0x150/0x150
    [ 554.898864] ? check_same_owner+0x320/0x320
    [ 554.898875] deactivate_locked_super+0x64/0xd0
    [ 554.898883] deactivate_super+0x141/0x1b0
    ----------

    Since allowing register_shrinker() callers to call unregister_shrinker()
    when register_shrinker() failed can simplify error recovery path, this
    patch makes unregister_shrinker() no-op when register_shrinker() failed.
    Also, reset shrinker->nr_deferred in case unregister_shrinker() was
    by error called twice.

    Signed-off-by: Tetsuo Handa
    Signed-off-by: Aliaksei Karaliou
    Reported-by: syzbot
    Cc: Glauber Costa
    Cc: Al Viro
    Signed-off-by: Al Viro

    Tetsuo Handa
     

16 Nov, 2017

2 commits

  • Most callers users of free_hot_cold_page claim the pages being released
    are cache hot. The exception is the page reclaim paths where it is
    likely that enough pages will be freed in the near future that the
    per-cpu lists are going to be recycled and the cache hotness information
    is lost. As no one really cares about the hotness of pages being
    released to the allocator, just ditch the parameter.

    The APIs are renamed to indicate that it's no longer about hot/cold
    pages. It should also be less confusing as there are subtle differences
    between them. __free_pages drops a reference and frees a page when the
    refcount reaches zero. free_hot_cold_page handled pages whose refcount
    was already zero which is non-obvious from the name. free_unref_page
    should be more obvious.

    No performance impact is expected as the overhead is marginal. The
    parameter is removed simply because it is a bit stupid to have a useless
    parameter copied everywhere.

    [mgorman@techsingularity.net: add pages to head, not tail]
    Link: http://lkml.kernel.org/r/20171019154321.qtpzaeftoyyw4iey@techsingularity.net
    Link: http://lkml.kernel.org/r/20171018075952.10627-8-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Cc: Andi Kleen
    Cc: Dave Chinner
    Cc: Dave Hansen
    Cc: Jan Kara
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Since commit 59dc76b0d4df ("mm: vmscan: reduce size of inactive file
    list") 'pgdat->inactive_ratio' is not used, except for printing
    "node_inactive_ratio: 0" in /proc/zoneinfo output.

    Remove it.

    Link: http://lkml.kernel.org/r/20171003152611.27483-1-aryabinin@virtuozzo.com
    Signed-off-by: Andrey Ryabinin
    Reviewed-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Ryabinin
     

15 Nov, 2017

1 commit

  • Pull core block layer updates from Jens Axboe:
    "This is the main pull request for block storage for 4.15-rc1.

    Nothing out of the ordinary in here, and no API changes or anything
    like that. Just various new features for drivers, core changes, etc.
    In particular, this pull request contains:

    - A patch series from Bart, closing the whole on blk/scsi-mq queue
    quescing.

    - A series from Christoph, building towards hidden gendisks (for
    multipath) and ability to move bio chains around.

    - NVMe
    - Support for native multipath for NVMe (Christoph).
    - Userspace notifications for AENs (Keith).
    - Command side-effects support (Keith).
    - SGL support (Chaitanya Kulkarni)
    - FC fixes and improvements (James Smart)
    - Lots of fixes and tweaks (Various)

    - bcache
    - New maintainer (Michael Lyle)
    - Writeback control improvements (Michael)
    - Various fixes (Coly, Elena, Eric, Liang, et al)

    - lightnvm updates, mostly centered around the pblk interface
    (Javier, Hans, and Rakesh).

    - Removal of unused bio/bvec kmap atomic interfaces (me, Christoph)

    - Writeback series that fix the much discussed hundreds of millions
    of sync-all units. This goes all the way, as discussed previously
    (me).

    - Fix for missing wakeup on writeback timer adjustments (Yafang
    Shao).

    - Fix laptop mode on blk-mq (me).

    - {mq,name} tupple lookup for IO schedulers, allowing us to have
    alias names. This means you can use 'deadline' on both !mq and on
    mq (where it's called mq-deadline). (me).

    - blktrace race fix, oopsing on sg load (me).

    - blk-mq optimizations (me).

    - Obscure waitqueue race fix for kyber (Omar).

    - NBD fixes (Josef).

    - Disable writeback throttling by default on bfq, like we do on cfq
    (Luca Miccio).

    - Series from Ming that enable us to treat flush requests on blk-mq
    like any other request. This is a really nice cleanup.

    - Series from Ming that improves merging on blk-mq with schedulers,
    getting us closer to flipping the switch on scsi-mq again.

    - BFQ updates (Paolo).

    - blk-mq atomic flags memory ordering fixes (Peter Z).

    - Loop cgroup support (Shaohua).

    - Lots of minor fixes from lots of different folks, both for core and
    driver code"

    * 'for-4.15/block' of git://git.kernel.dk/linux-block: (294 commits)
    nvme: fix visibility of "uuid" ns attribute
    blk-mq: fixup some comment typos and lengths
    ide: ide-atapi: fix compile error with defining macro DEBUG
    blk-mq: improve tag waiting setup for non-shared tags
    brd: remove unused brd_mutex
    blk-mq: only run the hardware queue if IO is pending
    block: avoid null pointer dereference on null disk
    fs: guard_bio_eod() needs to consider partitions
    xtensa/simdisk: fix compile error
    nvme: expose subsys attribute to sysfs
    nvme: create 'slaves' and 'holders' entries for hidden controllers
    block: create 'slaves' and 'holders' entries for hidden gendisks
    nvme: also expose the namespace identification sysfs files for mpath nodes
    nvme: implement multipath access to nvme subsystems
    nvme: track shared namespaces
    nvme: introduce a nvme_ns_ids structure
    nvme: track subsystems
    block, nvme: Introduce blk_mq_req_flags_t
    block, scsi: Make SCSI quiesce and resume work reliably
    block: Add the QUEUE_FLAG_PREEMPT_ONLY request queue flag
    ...

    Linus Torvalds