07 Dec, 2020

1 commit

  • When investigating a slab cache bloat problem, significant amount of
    negative dentry cache was seen, but confusingly they neither got shrunk
    by reclaimer (the host has very tight memory) nor be shrunk by dropping
    cache. The vmcore shows there are over 14M negative dentry objects on
    lru, but tracing result shows they were even not scanned at all.

    Further investigation shows the memcg's vfs shrinker_map bit is not set.
    So the reclaimer or dropping cache just skip calling vfs shrinker. So
    we have to reboot the hosts to get the memory back.

    I didn't manage to come up with a reproducer in test environment, and
    the problem can't be reproduced after rebooting. But it seems there is
    race between shrinker map bit clear and reparenting by code inspection.
    The hypothesis is elaborated as below.

    The memcg hierarchy on our production environment looks like:

    root
    / \
    system user

    The main workloads are running under user slice's children, and it
    creates and removes memcg frequently. So reparenting happens very often
    under user slice, but no task is under user slice directly.

    So with the frequent reparenting and tight memory pressure, the below
    hypothetical race condition may happen:

    CPU A CPU B
    reparent
    dst->nr_items == 0
    shrinker:
    total_objects == 0
    add src->nr_items to dst
    set_bit
    return SHRINK_EMPTY
    clear_bit
    child memcg offline
    replace child's kmemcg_id with
    parent's (in memcg_offline_kmem())
    list_lru_del() between shrinker runs
    see parent's kmemcg_id
    dec dst->nr_items
    reparent again
    dst->nr_items may go negative
    due to concurrent list_lru_del()

    The second run of shrinker:
    read nr_items without any
    synchronization, so it may
    see intermediate negative
    nr_items then total_objects
    may return 0 coincidently

    keep the bit cleared
    dst->nr_items != 0
    skip set_bit
    add scr->nr_item to dst

    After this point dst->nr_item may never go zero, so reparenting will not
    set shrinker_map bit anymore. And since there is no task under user
    slice directly, so no new object will be added to its lru to set the
    shrinker map bit either. That bit is kept cleared forever.

    How does list_lru_del() race with reparenting? It is because reparenting
    replaces children's kmemcg_id to parent's without protecting from
    nlru->lock, so list_lru_del() may see parent's kmemcg_id but actually
    deleting items from child's lru, but dec'ing parent's nr_items, so the
    parent's nr_items may go negative as commit 2788cf0c401c ("memcg:
    reparent list_lrus and free kmemcg_id on css offline") says.

    Since it is impossible that dst->nr_items goes negative and
    src->nr_items goes zero at the same time, so it seems we could set the
    shrinker map bit iff src->nr_items != 0. We could synchronize
    list_lru_count_one() and reparenting with nlru->lock, but it seems
    checking src->nr_items in reparenting is the simplest and avoids lock
    contention.

    Fixes: fae91d6d8be5 ("mm/list_lru.c: set bit in memcg shrinker bitmap on first list_lru item appearance")
    Suggested-by: Roman Gushchin
    Signed-off-by: Yang Shi
    Signed-off-by: Andrew Morton
    Reviewed-by: Roman Gushchin
    Reviewed-by: Shakeel Butt
    Acked-by: Kirill Tkhai
    Cc: Vladimir Davydov
    Cc: [4.19]
    Link: https://lkml.kernel.org/r/20201202171749.264354-1-shy828301@gmail.com
    Signed-off-by: Linus Torvalds

    Yang Shi
     

15 Aug, 2020

1 commit

  • struct list_lru_one l.nr_items could be accessed concurrently as noticed
    by KCSAN,

    BUG: KCSAN: data-race in list_lru_count_one / list_lru_isolate_move

    write to 0xffffa102789c4510 of 8 bytes by task 823 on cpu 39:
    list_lru_isolate_move+0xf9/0x130
    list_lru_isolate_move at mm/list_lru.c:180
    inode_lru_isolate+0x12b/0x2a0
    __list_lru_walk_one+0x122/0x3d0
    list_lru_walk_one+0x75/0xa0
    prune_icache_sb+0x8b/0xc0
    super_cache_scan+0x1b8/0x250
    do_shrink_slab+0x256/0x6d0
    shrink_slab+0x41b/0x4a0
    shrink_node+0x35c/0xd80
    balance_pgdat+0x652/0xd90
    kswapd+0x396/0x8d0
    kthread+0x1e0/0x200
    ret_from_fork+0x27/0x50

    read to 0xffffa102789c4510 of 8 bytes by task 6345 on cpu 56:
    list_lru_count_one+0x116/0x2f0
    list_lru_count_one at mm/list_lru.c:193
    super_cache_count+0xe8/0x170
    do_shrink_slab+0x95/0x6d0
    shrink_slab+0x41b/0x4a0
    shrink_node+0x35c/0xd80
    do_try_to_free_pages+0x1f7/0xa10
    try_to_free_pages+0x26c/0x5e0
    __alloc_pages_slowpath+0x458/0x1290
    __alloc_pages_nodemask+0x3bb/0x450
    alloc_pages_vma+0x8a/0x2c0
    do_anonymous_page+0x170/0x700
    __handle_mm_fault+0xc9f/0xd00
    handle_mm_fault+0xfc/0x2f0
    do_page_fault+0x263/0x6f9
    page_fault+0x34/0x40

    Reported by Kernel Concurrency Sanitizer on:
    CPU: 56 PID: 6345 Comm: oom01 Tainted: G W L 5.5.0-next-20200205+ #4
    Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 07/10/2019

    A shattered l.nr_items could affect the shrinker behaviour due to a data
    race. Fix it by adding READ_ONCE() for the read. Since the writes are
    aligned and up to word-size, assume those are safe from data races to
    avoid readability issues of writing WRITE_ONCE(var, var + val).

    Signed-off-by: Qian Cai
    Signed-off-by: Andrew Morton
    Cc: Marco Elver
    Cc: Konrad Rzeszutek Wilk
    Link: http://lkml.kernel.org/r/1581114679-5488-1-git-send-email-cai@lca.pw
    Signed-off-by: Linus Torvalds

    Qian Cai
     

30 Jun, 2020

1 commit

  • Rename kvfree_rcu() function to the kvfree_rcu_local() one.
    The purpose is to prevent a conflict of two same function
    declarations. The kvfree_rcu() will be globally visible
    what would lead to a build error. No functional change.

    Cc: linux-mm@kvack.org
    Cc: rcu@vger.kernel.org
    Cc: Andrew Morton
    Signed-off-by: Uladzislau Rezki (Sony)
    Reviewed-by: Joel Fernandes (Google)
    Signed-off-by: Joel Fernandes (Google)
    Signed-off-by: Paul E. McKenney

    Uladzislau Rezki (Sony)
     

05 Jun, 2020

1 commit


08 Apr, 2020

1 commit

  • Convert the various /* fallthrough */ comments to the pseudo-keyword
    fallthrough;

    Done via script:
    https://lore.kernel.org/lkml/b56602fcf79f849e733e7b521bb0e17895d390fa.1582230379.git.joe@perches.com/

    Signed-off-by: Joe Perches
    Signed-off-by: Andrew Morton
    Reviewed-by: Gustavo A. R. Silva
    Link: http://lkml.kernel.org/r/f62fea5d10eb0ccfc05d87c242a620c261219b66.camel@perches.com
    Signed-off-by: Linus Torvalds

    Joe Perches
     

03 Apr, 2020

1 commit

  • Sometimes we need to get a memcg pointer from a charged kernel object.
    The right way to get it depends on whether it's a proper slab object or
    it's backed by raw pages (e.g. it's a vmalloc alloction). In the first
    case the kmem_cache->memcg_params.memcg indirection should be used; in
    other cases it's just page->mem_cgroup.

    To simplify this task and hide the implementation details let's use the
    mem_cgroup_from_obj() helper, which takes a pointer to any kernel object
    and returns a valid memcg pointer or NULL.

    Passing a kernel address rather than a pointer to a page will allow to use
    this helper for per-object (rather than per-page) tracked objects in the
    future.

    The caller is still responsible to ensure that the returned memcg isn't
    going away underneath: take the rcu read lock, cgroup mutex etc; depending
    on the context.

    mem_cgroup_from_kmem() defined in mm/list_lru.c is now obsolete and can be
    removed.

    Signed-off-by: Roman Gushchin
    Signed-off-by: Andrew Morton
    Reviewed-by: Shakeel Butt
    Acked-by: Yafang Shao
    Cc: Michal Hocko
    Cc: Johannes Weiner
    Cc: Vladimir Davydov
    Link: http://lkml.kernel.org/r/20200117203609.3146239-1-guro@fb.com
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     

13 Jul, 2019

1 commit

  • Every slab page charged to a non-root memory cgroup has a pointer to the
    memory cgroup and holds a reference to it, which protects a non-empty
    memory cgroup from being released. At the same time the page has a
    pointer to the corresponding kmem_cache, and also hold a reference to the
    kmem_cache. And kmem_cache by itself holds a reference to the cgroup.

    So there is clearly some redundancy, which allows to stop setting the
    page->mem_cgroup pointer and rely on getting memcg pointer indirectly via
    kmem_cache. Further it will allow to change this pointer easier, without
    a need to go over all charged pages.

    So let's stop setting page->mem_cgroup pointer for slab pages, and stop
    using the css refcounter directly for protecting the memory cgroup from
    going away. Instead rely on kmem_cache as an intermediate object.

    Make sure that vmstats and shrinker lists are working as previously, as
    well as /proc/kpagecgroup interface.

    Link: http://lkml.kernel.org/r/20190611231813.3148843-10-guro@fb.com
    Signed-off-by: Roman Gushchin
    Acked-by: Vladimir Davydov
    Reviewed-by: Shakeel Butt
    Cc: Christoph Lameter
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Waiman Long
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Pekka Enberg
    Cc: Andrei Vagin
    Cc: Qian Cai
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     

14 Jun, 2019

1 commit

  • Syzbot reported following memory leak:

    ffffffffda RBX: 0000000000000003 RCX: 0000000000441f79
    BUG: memory leak
    unreferenced object 0xffff888114f26040 (size 32):
    comm "syz-executor626", pid 7056, jiffies 4294948701 (age 39.410s)
    hex dump (first 32 bytes):
    40 60 f2 14 81 88 ff ff 40 60 f2 14 81 88 ff ff @`......@`......
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
    backtrace:
    slab_post_alloc_hook mm/slab.h:439 [inline]
    slab_alloc mm/slab.c:3326 [inline]
    kmem_cache_alloc_trace+0x13d/0x280 mm/slab.c:3553
    kmalloc include/linux/slab.h:547 [inline]
    __memcg_init_list_lru_node+0x58/0xf0 mm/list_lru.c:352
    memcg_init_list_lru_node mm/list_lru.c:375 [inline]
    memcg_init_list_lru mm/list_lru.c:459 [inline]
    __list_lru_init+0x193/0x2a0 mm/list_lru.c:626
    alloc_super+0x2e0/0x310 fs/super.c:269
    sget_userns+0x94/0x2a0 fs/super.c:609
    sget+0x8d/0xb0 fs/super.c:660
    mount_nodev+0x31/0xb0 fs/super.c:1387
    fuse_mount+0x2d/0x40 fs/fuse/inode.c:1236
    legacy_get_tree+0x27/0x80 fs/fs_context.c:661
    vfs_get_tree+0x2e/0x120 fs/super.c:1476
    do_new_mount fs/namespace.c:2790 [inline]
    do_mount+0x932/0xc50 fs/namespace.c:3110
    ksys_mount+0xab/0x120 fs/namespace.c:3319
    __do_sys_mount fs/namespace.c:3333 [inline]
    __se_sys_mount fs/namespace.c:3330 [inline]
    __x64_sys_mount+0x26/0x30 fs/namespace.c:3330
    do_syscall_64+0x76/0x1a0 arch/x86/entry/common.c:301
    entry_SYSCALL_64_after_hwframe+0x44/0xa9

    This is a simple off by one bug on the error path.

    Link: http://lkml.kernel.org/r/20190528043202.99980-1-shakeelb@google.com
    Fixes: 60d3fd32a7a9 ("list_lru: introduce per-memcg lists")
    Reported-by: syzbot+f90a420dfe2b1b03cb2c@syzkaller.appspotmail.com
    Signed-off-by: Shakeel Butt
    Acked-by: Michal Hocko
    Reviewed-by: Kirill Tkhai
    Cc: [4.0+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shakeel Butt
     

02 Jun, 2019

1 commit

  • We have a single node system with node 0 disabled:
    Scanning NUMA topology in Northbridge 24
    Number of physical nodes 2
    Skipping disabled node 0
    Node 1 MemBase 0000000000000000 Limit 00000000fbff0000
    NODE_DATA(1) allocated [mem 0xfbfda000-0xfbfeffff]

    This causes crashes in memcg when system boots:
    BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
    #PF error: [normal kernel read fault]
    ...
    RIP: 0010:list_lru_add+0x94/0x170
    ...
    Call Trace:
    d_lru_add+0x44/0x50
    dput.part.34+0xfc/0x110
    __fput+0x108/0x230
    task_work_run+0x9f/0xc0
    exit_to_usermode_loop+0xf5/0x100

    It is reproducible as far as 4.12. I did not try older kernels. You have
    to have a new enough systemd, e.g. 241 (the reason is unknown -- was not
    investigated). Cannot be reproduced with systemd 234.

    The system crashes because the size of lru array is never updated in
    memcg_update_all_list_lrus and the reads are past the zero-sized array,
    causing dereferences of random memory.

    The root cause are list_lru_memcg_aware checks in the list_lru code. The
    test in list_lru_memcg_aware is broken: it assumes node 0 is always
    present, but it is not true on some systems as can be seen above.

    So fix this by avoiding checks on node 0. Remember the memcg-awareness by
    a bool flag in struct list_lru.

    Link: http://lkml.kernel.org/r/20190522091940.3615-1-jslaby@suse.cz
    Fixes: 60d3fd32a7a9 ("list_lru: introduce per-memcg lists")
    Signed-off-by: Jiri Slaby
    Acked-by: Michal Hocko
    Suggested-by: Vladimir Davydov
    Acked-by: Vladimir Davydov
    Reviewed-by: Shakeel Butt
    Cc: Johannes Weiner
    Cc: Raghavendra K T
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jiri Slaby
     

21 May, 2019

1 commit

  • Add SPDX license identifiers to all files which:

    - Have no license information of any form

    - Have EXPORT_.*_SYMBOL_GPL inside which was used in the
    initial scan/conversion to ignore the file

    These files fall under the project license, GPL v2 only. The resulting SPDX
    license identifier is:

    GPL-2.0-only

    Signed-off-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     

06 Mar, 2019

1 commit

  • Number of NUMA nodes can't be negative.

    This saves a few bytes on x86_64:

    add/remove: 0/0 grow/shrink: 4/21 up/down: 27/-265 (-238)
    Function old new delta
    hv_synic_alloc.cold 88 110 +22
    prealloc_shrinker 260 262 +2
    bootstrap 249 251 +2
    sched_init_numa 1566 1567 +1
    show_slab_objects 778 777 -1
    s_show 1201 1200 -1
    kmem_cache_init 346 345 -1
    __alloc_workqueue_key 1146 1145 -1
    mem_cgroup_css_alloc 1614 1612 -2
    __do_sys_swapon 4702 4699 -3
    __list_lru_init 655 651 -4
    nic_probe 2379 2374 -5
    store_user_store 118 111 -7
    red_zone_store 106 99 -7
    poison_store 106 99 -7
    wq_numa_init 348 338 -10
    __kmem_cache_empty 75 65 -10
    task_numa_free 186 173 -13
    merge_across_nodes_store 351 336 -15
    irq_create_affinity_masks 1261 1246 -15
    do_numa_crng_init 343 321 -22
    task_numa_fault 4760 4737 -23
    swapfile_init 179 156 -23
    hv_synic_alloc 536 492 -44
    apply_wqattrs_prepare 746 695 -51

    Link: http://lkml.kernel.org/r/20190201223029.GA15820@avx2
    Signed-off-by: Alexey Dobriyan
    Reviewed-by: Andrew Morton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     

18 Aug, 2018

12 commits

  • Provide list_lru_shrink_walk_irq() and let it behave like
    list_lru_walk_one() except that it locks the spinlock with
    spin_lock_irq(). This is used by scan_shadow_nodes() because its lock
    nests within the i_pages lock which is acquired with IRQ. This change
    allows to use proper locking promitives instead hand crafted
    lock_irq_disable() plus spin_lock().

    There is no EXPORT_SYMBOL provided because the current user is in-kernel
    only.

    Add list_lru_shrink_walk_irq() which acquires the spinlock with the
    proper locking primitives.

    Link: http://lkml.kernel.org/r/20180716111921.5365-5-bigeasy@linutronix.de
    Signed-off-by: Sebastian Andrzej Siewior
    Reviewed-by: Vladimir Davydov
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sebastian Andrzej Siewior
     
  • __list_lru_walk_one() is invoked with struct list_lru *lru, int nid as
    the first two argument. Those two are only used to retrieve struct
    list_lru_node. Since this is already done by the caller of the function
    for the locking, we can pass struct list_lru_node* directly and avoid
    the dance around it.

    Link: http://lkml.kernel.org/r/20180716111921.5365-4-bigeasy@linutronix.de
    Signed-off-by: Sebastian Andrzej Siewior
    Reviewed-by: Vladimir Davydov
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sebastian Andrzej Siewior
     
  • Move the locking inside __list_lru_walk_one() to its caller. This is a
    preparation step in order to introduce list_lru_walk_one_irq() which
    does spin_lock_irq() instead of spin_lock() for the locking.

    Link: http://lkml.kernel.org/r/20180716111921.5365-3-bigeasy@linutronix.de
    Signed-off-by: Sebastian Andrzej Siewior
    Reviewed-by: Vladimir Davydov
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sebastian Andrzej Siewior
     
  • Patch series "mm/list_lru: Add list_lru_shrink_walk_irq() and a user".

    This series removes the local_irq_disable() around
    list_lru_shrink_walk() (as used by mm/workingset) by adding
    list_lru_shrink_walk_irq().

    Vladimir Davydov preferred this over `irq' argument which I added to
    struct list_lru.

    The initial post (of this series) received a Reviewed-by tag by Vladimir
    Davydov which I added to each patch of the series. The series applies
    on top of akpm's tree which has Kirill's shrink_slab series and does not
    clash with it (akpm asked me to wait a week or so and repost it then).

    I tested the code paths by triggering the OOM-killer via memory over
    commit and lockdep did not complain (nor did I see any warnings).

    This patch (of 4):

    list_lru_walk_node() invokes __list_lru_walk_one() with -1 as the
    memcg_idx parameter. The same can be achieved by list_lru_walk_one() and
    passing NULL as memcg argument which then gets converted into -1. This is
    a preparation step when the spin_lock() function is lifted to the caller
    of __list_lru_walk_one(). Invoke list_lru_walk_one() instead
    __list_lru_walk_one() when possible.

    Link: http://lkml.kernel.org/r/20180716111921.5365-2-bigeasy@linutronix.de
    Signed-off-by: Sebastian Andrzej Siewior
    Reviewed-by: Vladimir Davydov
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sebastian Andrzej Siewior
     
  • Introduce set_shrinker_bit() function to set shrinker-related bit in
    memcg shrinker bitmap, and set the bit after the first item is added and
    in case of reparenting destroyed memcg's items.

    This will allow next patch to make shrinkers be called only, in case of
    they have charged objects at the moment, and to improve shrink_slab()
    performance.

    [ktkhai@virtuozzo.com: v9]
    Link: http://lkml.kernel.org/r/153112557572.4097.17315791419810749985.stgit@localhost.localdomain
    Link: http://lkml.kernel.org/r/153063065671.1818.15914674956134687268.stgit@localhost.localdomain
    Signed-off-by: Kirill Tkhai
    Acked-by: Vladimir Davydov
    Tested-by: Shakeel Butt
    Cc: Al Viro
    Cc: Andrey Ryabinin
    Cc: Chris Wilson
    Cc: Greg Kroah-Hartman
    Cc: Guenter Roeck
    Cc: "Huang, Ying"
    Cc: Johannes Weiner
    Cc: Josef Bacik
    Cc: Li RongQing
    Cc: Matthew Wilcox
    Cc: Matthias Kaehlcke
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Philippe Ombredanne
    Cc: Roman Gushchin
    Cc: Sahitya Tummala
    Cc: Stephen Rothwell
    Cc: Tetsuo Handa
    Cc: Thomas Gleixner
    Cc: Waiman Long
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill Tkhai
     
  • This is just refactoring to allow next patches to have lru pointer in
    memcg_drain_list_lru_node().

    Link: http://lkml.kernel.org/r/153063063164.1818.55009531386089350.stgit@localhost.localdomain
    Signed-off-by: Kirill Tkhai
    Acked-by: Vladimir Davydov
    Tested-by: Shakeel Butt
    Cc: Al Viro
    Cc: Andrey Ryabinin
    Cc: Chris Wilson
    Cc: Greg Kroah-Hartman
    Cc: Guenter Roeck
    Cc: "Huang, Ying"
    Cc: Johannes Weiner
    Cc: Josef Bacik
    Cc: Li RongQing
    Cc: Matthew Wilcox
    Cc: Matthias Kaehlcke
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Philippe Ombredanne
    Cc: Roman Gushchin
    Cc: Sahitya Tummala
    Cc: Stephen Rothwell
    Cc: Tetsuo Handa
    Cc: Thomas Gleixner
    Cc: Waiman Long
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill Tkhai
     
  • This is just refactoring to allow the next patches to have dst_memcg
    pointer in memcg_drain_list_lru_node().

    Link: http://lkml.kernel.org/r/153063062118.1818.2761273817739499749.stgit@localhost.localdomain
    Signed-off-by: Kirill Tkhai
    Acked-by: Vladimir Davydov
    Tested-by: Shakeel Butt
    Cc: Al Viro
    Cc: Andrey Ryabinin
    Cc: Chris Wilson
    Cc: Greg Kroah-Hartman
    Cc: Guenter Roeck
    Cc: "Huang, Ying"
    Cc: Johannes Weiner
    Cc: Josef Bacik
    Cc: Li RongQing
    Cc: Matthew Wilcox
    Cc: Matthias Kaehlcke
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Philippe Ombredanne
    Cc: Roman Gushchin
    Cc: Sahitya Tummala
    Cc: Stephen Rothwell
    Cc: Tetsuo Handa
    Cc: Thomas Gleixner
    Cc: Waiman Long
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill Tkhai
     
  • This is just refactoring to allow the next patches to have memcg pointer
    in list_lru_from_kmem().

    Link: http://lkml.kernel.org/r/153063060664.1818.9541345386733498582.stgit@localhost.localdomain
    Signed-off-by: Kirill Tkhai
    Acked-by: Vladimir Davydov
    Tested-by: Shakeel Butt
    Cc: Al Viro
    Cc: Andrey Ryabinin
    Cc: Chris Wilson
    Cc: Greg Kroah-Hartman
    Cc: Guenter Roeck
    Cc: "Huang, Ying"
    Cc: Johannes Weiner
    Cc: Josef Bacik
    Cc: Li RongQing
    Cc: Matthew Wilcox
    Cc: Matthias Kaehlcke
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Philippe Ombredanne
    Cc: Roman Gushchin
    Cc: Sahitya Tummala
    Cc: Stephen Rothwell
    Cc: Tetsuo Handa
    Cc: Thomas Gleixner
    Cc: Waiman Long
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill Tkhai
     
  • Add list_lru::shrinker_id field and populate it by registered shrinker
    id.

    This will be used to set correct bit in memcg shrinkers map by lru code
    in next patches, after there appeared the first related to memcg element
    in list_lru.

    Link: http://lkml.kernel.org/r/153063059758.1818.14866596416857717800.stgit@localhost.localdomain
    Signed-off-by: Kirill Tkhai
    Acked-by: Vladimir Davydov
    Tested-by: Shakeel Butt
    Cc: Al Viro
    Cc: Andrey Ryabinin
    Cc: Chris Wilson
    Cc: Greg Kroah-Hartman
    Cc: Guenter Roeck
    Cc: "Huang, Ying"
    Cc: Johannes Weiner
    Cc: Josef Bacik
    Cc: Li RongQing
    Cc: Matthew Wilcox
    Cc: Matthias Kaehlcke
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Philippe Ombredanne
    Cc: Roman Gushchin
    Cc: Sahitya Tummala
    Cc: Stephen Rothwell
    Cc: Tetsuo Handa
    Cc: Thomas Gleixner
    Cc: Waiman Long
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill Tkhai
     
  • Introduce new config option, which is used to replace repeating
    CONFIG_MEMCG && !CONFIG_SLOB pattern. Next patches add a little more
    memcg+kmem related code, so let's keep the defines more clearly.

    Link: http://lkml.kernel.org/r/153063053670.1818.15013136946600481138.stgit@localhost.localdomain
    Signed-off-by: Kirill Tkhai
    Acked-by: Vladimir Davydov
    Tested-by: Shakeel Butt
    Cc: Al Viro
    Cc: Andrey Ryabinin
    Cc: Chris Wilson
    Cc: Greg Kroah-Hartman
    Cc: Guenter Roeck
    Cc: "Huang, Ying"
    Cc: Johannes Weiner
    Cc: Josef Bacik
    Cc: Li RongQing
    Cc: Matthew Wilcox
    Cc: Matthias Kaehlcke
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Philippe Ombredanne
    Cc: Roman Gushchin
    Cc: Sahitya Tummala
    Cc: Stephen Rothwell
    Cc: Tetsuo Handa
    Cc: Thomas Gleixner
    Cc: Waiman Long
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill Tkhai
     
  • Patch series "Improve shrink_slab() scalability (old complexity was O(n^2), new is O(n))", v8.

    This patcheset solves the problem with slow shrink_slab() occuring on
    the machines having many shrinkers and memory cgroups (i.e., with many
    containers). The problem is complexity of shrink_slab() is O(n^2) and
    it grows too fast with the growth of containers numbers.

    Let us have 200 containers, and every container has 10 mounts and 10
    cgroups. All container tasks are isolated, and they don't touch foreign
    containers mounts.

    In case of global reclaim, a task has to iterate all over the memcgs and
    to call all the memcg-aware shrinkers for all of them. This means, the
    task has to visit 200 * 10 = 2000 shrinkers for every memcg, and since
    there are 2000 memcgs, the total calls of do_shrink_slab() are 2000 *
    2000 = 4000000.

    4 million calls are not a number operations, which can takes 1 cpu
    cycle. E.g., super_cache_count() accesses at least two lists, and makes
    arifmetical calculations. Even, if there are no charged objects, we do
    these calculations, and replaces cpu caches by read memory. I observed
    nodes spending almost 100% time in kernel, in case of intensive writing
    and global reclaim. The writer consumes pages fast, but it's need to
    shrink_slab() before the reclaimer reached shrink pages function (and
    frees SWAP_CLUSTER_MAX pages). Even if there is no writing, the
    iterations just waste the time, and slows reclaim down.

    Let's see the small test below:

    $echo 1 > /sys/fs/cgroup/memory/memory.use_hierarchy
    $mkdir /sys/fs/cgroup/memory/ct
    $echo 4000M > /sys/fs/cgroup/memory/ct/memory.kmem.limit_in_bytes
    $for i in `seq 0 4000`;
    do mkdir /sys/fs/cgroup/memory/ct/$i;
    echo $$ > /sys/fs/cgroup/memory/ct/$i/cgroup.procs;
    mkdir -p s/$i; mount -t tmpfs $i s/$i; touch s/$i/file;
    done

    Then, let's see drop caches time (5 sequential calls):

    $time echo 3 > /proc/sys/vm/drop_caches

    0.00user 13.78system 0:13.78elapsed 99%CPU
    0.00user 5.59system 0:05.60elapsed 99%CPU
    0.00user 5.48system 0:05.48elapsed 99%CPU
    0.00user 8.35system 0:08.35elapsed 99%CPU
    0.00user 8.34system 0:08.35elapsed 99%CPU

    The last four calls don't actually shrink anything. So, the iterations
    over slab shrinkers take 5.48 seconds. Not so good for scalability.

    The patchset solves the problem by making shrink_slab() of O(n)
    complexity. There are following functional actions:

    1) Assign id to every registered memcg-aware shrinker.

    2) Maintain per-memcgroup bitmap of memcg-aware shrinkers, and set a
    shrinker-related bit after the first element is added to lru list
    (also, when removed child memcg elements are reparanted).

    3) Split memcg-aware shrinkers and !memcg-aware shrinkers, and call a
    shrinker if its bit is set in memcg's shrinker bitmap. (Also, there is
    a functionality to clear the bit, after last element is shrinked).

    This gives significant performance increase. The result after patchset
    is applied:

    $time echo 3 > /proc/sys/vm/drop_caches

    0.00user 1.10system 0:01.10elapsed 99%CPU
    0.00user 0.00system 0:00.01elapsed 64%CPU
    0.00user 0.01system 0:00.01elapsed 82%CPU
    0.00user 0.00system 0:00.01elapsed 64%CPU
    0.00user 0.01system 0:00.01elapsed 82%CPU

    The results show the performance increases at least in 548 times.

    So, the patchset makes shrink_slab() of less complexity and improves the
    performance in such types of load I pointed. This will give a profit in
    case of !global reclaim case, since there also will be less
    do_shrink_slab() calls.

    This patch (of 17):

    These two pairs of blocks of code are under the same #ifdef #else
    #endif.

    Link: http://lkml.kernel.org/r/153063052519.1818.9393587113056959488.stgit@localhost.localdomain
    Signed-off-by: Kirill Tkhai
    Acked-by: Vladimir Davydov
    Tested-by: Shakeel Butt
    Cc: Al Viro
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Thomas Gleixner
    Cc: Philippe Ombredanne
    Cc: Sahitya Tummala
    Cc: Greg Kroah-Hartman
    Cc: Stephen Rothwell
    Cc: Roman Gushchin
    Cc: Matthias Kaehlcke
    Cc: Tetsuo Handa
    Cc: Chris Wilson
    Cc: Waiman Long
    Cc: Minchan Kim
    Cc: "Huang, Ying"
    Cc: Mel Gorman
    Cc: Josef Bacik
    Cc: Guenter Roeck
    Cc: Matthew Wilcox
    Cc: Li RongQing
    Cc: Andrey Ryabinin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill Tkhai
     
  • __list_lru_count_one() has a single callsite.

    Acked-by: Vladimir Davydov
    Cc: Sebastian Andrzej Siewior
    Cc: Kirill Tkhai
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     

06 Apr, 2018

1 commit

  • During the reclaiming slab of a memcg, shrink_slab iterates over all
    registered shrinkers in the system, and tries to count and consume
    objects related to the cgroup. In case of memory pressure, this behaves
    bad: I observe high system time and time spent in list_lru_count_one()
    for many processes on RHEL7 kernel.

    This patch makes list_lru_node::memcg_lrus rcu protected, that allows to
    skip taking spinlock in list_lru_count_one().

    Shakeel Butt with the patch observes significant perf graph change. He
    says:

    ========================================================================
    Setup: running a fork-bomb in a memcg of 200MiB on a 8GiB and 4 vcpu
    VM and recording the trace with 'perf record -g -a'.

    The trace without the patch:

    + 34.19% fb.sh [kernel.kallsyms] [k] queued_spin_lock_slowpath
    + 30.77% fb.sh [kernel.kallsyms] [k] _raw_spin_lock
    + 3.53% fb.sh [kernel.kallsyms] [k] list_lru_count_one
    + 2.26% fb.sh [kernel.kallsyms] [k] super_cache_count
    + 1.68% fb.sh [kernel.kallsyms] [k] shrink_slab
    + 0.59% fb.sh [kernel.kallsyms] [k] down_read_trylock
    + 0.48% fb.sh [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore
    + 0.38% fb.sh [kernel.kallsyms] [k] shrink_node_memcg
    + 0.32% fb.sh [kernel.kallsyms] [k] queue_work_on
    + 0.26% fb.sh [kernel.kallsyms] [k] count_shadow_nodes

    With the patch:

    + 0.16% swapper [kernel.kallsyms] [k] default_idle
    + 0.13% oom_reaper [kernel.kallsyms] [k] mutex_spin_on_owner
    + 0.05% perf [kernel.kallsyms] [k] copy_user_generic_string
    + 0.05% init.real [kernel.kallsyms] [k] wait_consider_task
    + 0.05% kworker/0:0 [kernel.kallsyms] [k] finish_task_switch
    + 0.04% kworker/2:1 [kernel.kallsyms] [k] finish_task_switch
    + 0.04% kworker/3:1 [kernel.kallsyms] [k] finish_task_switch
    + 0.04% kworker/1:0 [kernel.kallsyms] [k] finish_task_switch
    + 0.03% binary [kernel.kallsyms] [k] copy_page
    ========================================================================

    Thanks Shakeel for the testing.

    [ktkhai@virtuozzo.com: v2]
    Link: http://lkml.kernel.org/r/151203869520.3915.2587549826865799173.stgit@localhost.localdomain
    Link: http://lkml.kernel.org/r/150583358557.26700.8490036563698102569.stgit@localhost.localdomain
    Signed-off-by: Kirill Tkhai
    Tested-by: Shakeel Butt
    Acked-by: Vladimir Davydov
    Cc: Andrey Ryabinin
    Cc: Michal Hocko
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill Tkhai
     

16 Nov, 2017

1 commit


04 Oct, 2017

1 commit

  • For quick per-memcg indexing, slab caches and list_lru structures
    maintain linear arrays of descriptors. As the number of concurrent
    memory cgroups in the system goes up, this requires large contiguous
    allocations (8k cgroups = order-5, 16k cgroups = order-6 etc.) for every
    existing slab cache and list_lru, which can easily fail on loaded
    systems. E.g.:

    mkdir: page allocation failure: order:5, mode:0x14040c0(GFP_KERNEL|__GFP_COMP), nodemask=(null)
    CPU: 1 PID: 6399 Comm: mkdir Not tainted 4.13.0-mm1-00065-g720bbe532b7c-dirty #481
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-20170228_101828-anatol 04/01/2014
    Call Trace:
    ? __alloc_pages_direct_compact+0x4c/0x110
    __alloc_pages_nodemask+0xf50/0x1430
    alloc_pages_current+0x60/0xc0
    kmalloc_order_trace+0x29/0x1b0
    __kmalloc+0x1f4/0x320
    memcg_update_all_list_lrus+0xca/0x2e0
    mem_cgroup_css_alloc+0x612/0x670
    cgroup_apply_control_enable+0x19e/0x360
    cgroup_mkdir+0x322/0x490
    kernfs_iop_mkdir+0x55/0x80
    vfs_mkdir+0xd0/0x120
    SyS_mkdirat+0x6c/0xe0
    SyS_mkdir+0x14/0x20
    entry_SYSCALL_64_fastpath+0x18/0xad
    Mem-Info:
    active_anon:2965 inactive_anon:19 isolated_anon:0
    active_file:100270 inactive_file:98846 isolated_file:0
    unevictable:0 dirty:0 writeback:0 unstable:0
    slab_reclaimable:7328 slab_unreclaimable:16402
    mapped:771 shmem:52 pagetables:278 bounce:0
    free:13718 free_pcp:0 free_cma:0

    This output is from an artificial reproducer, but we have repeatedly
    observed order-7 failures in production in the Facebook fleet. These
    systems become useless as they cannot run more jobs, even though there
    is plenty of memory to allocate 128 individual pages.

    Use kvmalloc and kvzalloc to fall back to vmalloc space if these arrays
    prove too large for allocating them physically contiguous.

    Link: http://lkml.kernel.org/r/20170918184919.20644-1-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Reviewed-by: Josef Bacik
    Acked-by: Michal Hocko
    Acked-by: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

11 Jul, 2017

1 commit

  • list_lru_count_node() iterates over all memcgs to get the total number of
    entries on the node but it can race with memcg_drain_all_list_lrus(),
    which migrates the entries from a dead cgroup to another. This can return
    incorrect number of entries from list_lru_count_node().

    Fix this by keeping track of entries per node and simply return it in
    list_lru_count_node().

    Link: http://lkml.kernel.org/r/1498707555-30525-1-git-send-email-stummala@codeaurora.org
    Signed-off-by: Sahitya Tummala
    Acked-by: Vladimir Davydov
    Cc: Jan Kara
    Cc: Alexander Polakov
    Cc: Al Viro
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sahitya Tummala
     

28 Oct, 2016

1 commit

  • As described in https://bugzilla.kernel.org/show_bug.cgi?id=177821:

    After some analysis it seems to be that the problem is in alloc_super().
    In case list_lru_init_memcg() fails it goes into destroy_super(), which
    calls list_lru_destroy().

    And in list_lru_init() we see that in case memcg_init_list_lru() fails,
    lru->node is freed, but not set NULL, which then leads list_lru_destroy()
    to believe it is initialized and call memcg_destroy_list_lru().
    memcg_destroy_list_lru() in turn can access lru->node[i].memcg_lrus,
    which is NULL.

    [akpm@linux-foundation.org: add comment]
    Signed-off-by: Alexander Polakov
    Acked-by: Vladimir Davydov
    Cc: Al Viro
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexander Polakov
     

21 Jan, 2016

1 commit


06 Nov, 2015

2 commits

  • Before the previous patch ("memcg: unify slab and other kmem pages
    charging"), __mem_cgroup_from_kmem had to handle two types of kmem - slab
    pages and pages allocated with alloc_kmem_pages - memcg in the page
    struct. Now we can unify it. Since after it, this function becomes tiny
    we can fold it into mem_cgroup_from_kmem.

    [hughd@google.com: move mem_cgroup_from_kmem into list_lru.c]
    Signed-off-by: Vladimir Davydov
    Acked-by: Michal Hocko
    Cc: Johannes Weiner
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • The functions used in the patch are in slowpath, which gets called
    whenever alloc_super is called during mounts.

    Though this should not make difference for the architectures with
    sequential numa node ids, for the powerpc which can potentially have
    sparse node ids (for e.g., 4 node system having numa ids, 0,1,16,17 is
    common), this patch saves some unnecessary allocations for non existing
    numa nodes.

    Even without that saving, perhaps patch makes code more readable.

    [vdavydov@parallels.com: take memcg_aware check outside for_each loop]
    Signed-off-by: Raghavendra K T
    Reviewed-by: Vladimir Davydov
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: Michael Ellerman
    Cc: Anton Blanchard
    Cc: Nishanth Aravamudan
    Cc: Greg Kurz
    Cc: Grant Likely
    Cc: Nikunj A Dadhania
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Raghavendra K T
     

09 Sep, 2015

1 commit


13 Feb, 2015

5 commits

  • Now, the only reason to keep kmemcg_id till css free is list_lru, which
    uses it to distribute elements between per-memcg lists. However, it can
    be easily sorted out - we only need to change kmemcg_id of an offline
    cgroup to its parent's id, making further list_lru_add()'s add elements to
    the parent's list, and then move all elements from the offline cgroup's
    list to the one of its parent. It will work, because a racing
    list_lru_del() does not need to know the list it is deleting the element
    from. It can decrement the wrong nr_items counter though, but the ongoing
    reparenting will fix it. After list_lru reparenting is done we are free
    to release kmemcg_id saving a valuable slot in a per-memcg array for new
    cgroups.

    Signed-off-by: Vladimir Davydov
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Tejun Heo
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Dave Chinner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • Currently, the isolate callback passed to the list_lru_walk family of
    functions is supposed to just delete an item from the list upon returning
    LRU_REMOVED or LRU_REMOVED_RETRY, while nr_items counter is fixed by
    __list_lru_walk_one after the callback returns. Since the callback is
    allowed to drop the lock after removing an item (it has to return
    LRU_REMOVED_RETRY then), the nr_items can be less than the actual number
    of elements on the list even if we check them under the lock. This makes
    it difficult to move items from one list_lru_one to another, which is
    required for per-memcg list_lru reparenting - we can't just splice the
    lists, we have to move entries one by one.

    This patch therefore introduces helpers that must be used by callback
    functions to isolate items instead of raw list_del/list_move. These are
    list_lru_isolate and list_lru_isolate_move. They not only remove the
    entry from the list, but also fix the nr_items counter, making sure
    nr_items always reflects the actual number of elements on the list if
    checked under the appropriate lock.

    Signed-off-by: Vladimir Davydov
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Tejun Heo
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Dave Chinner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • There are several FS shrinkers, including super_block::s_shrink, that
    keep reclaimable objects in the list_lru structure. Hence to turn them
    to memcg-aware shrinkers, it is enough to make list_lru per-memcg.

    This patch does the trick. It adds an array of lru lists to the
    list_lru_node structure (per-node part of the list_lru), one for each
    kmem-active memcg, and dispatches every item addition or removal to the
    list corresponding to the memcg which the item is accounted to. So now
    the list_lru structure is not just per node, but per node and per memcg.

    Not all list_lrus need this feature, so this patch also adds a new
    method, list_lru_init_memcg, which initializes a list_lru as memcg
    aware. Otherwise (i.e. if initialized with old list_lru_init), the
    list_lru won't have per memcg lists.

    Just like per memcg caches arrays, the arrays of per-memcg lists are
    indexed by memcg_cache_id, so we must grow them whenever
    memcg_nr_cache_ids is increased. So we introduce a callback,
    memcg_update_all_list_lrus, invoked by memcg_alloc_cache_id if the id
    space is full.

    The locking is implemented in a manner similar to lruvecs, i.e. we have
    one lock per node that protects all lists (both global and per cgroup) on
    the node.

    Signed-off-by: Vladimir Davydov
    Cc: Dave Chinner
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Greg Thelen
    Cc: Glauber Costa
    Cc: Alexander Viro
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • To make list_lru memcg aware, we need all list_lrus to be kept on a list
    protected by a mutex, so that we could sleep while walking over the
    list.

    Therefore after this change list_lru_destroy may sleep. Fortunately,
    there is only one user that calls it from an atomic context - it's
    put_super - and we can easily fix it by calling list_lru_destroy before
    put_super in destroy_locked_super - anyway we don't longer need lrus by
    that time.

    Another point that should be noted is that list_lru_destroy is allowed
    to be called on an uninitialized zeroed-out object, in which case it is
    a no-op. Before this patch this was guaranteed by kfree, but now we
    need an explicit check there.

    Signed-off-by: Vladimir Davydov
    Cc: Dave Chinner
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Greg Thelen
    Cc: Glauber Costa
    Cc: Alexander Viro
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • The active_nodes mask allows us to skip empty nodes when walking over
    list_lru items from all nodes in list_lru_count/walk. However, these
    functions are never called from hot paths, so it doesn't seem we need
    such kind of optimization there. OTOH, removing the mask will make it
    easier to make list_lru per-memcg.

    Signed-off-by: Vladimir Davydov
    Cc: Dave Chinner
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Greg Thelen
    Cc: Glauber Costa
    Cc: Alexander Viro
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     

04 Apr, 2014

1 commit

  • Previously, page cache radix tree nodes were freed after reclaim emptied
    out their page pointers. But now reclaim stores shadow entries in their
    place, which are only reclaimed when the inodes themselves are
    reclaimed. This is problematic for bigger files that are still in use
    after they have a significant amount of their cache reclaimed, without
    any of those pages actually refaulting. The shadow entries will just
    sit there and waste memory. In the worst case, the shadow entries will
    accumulate until the machine runs out of memory.

    To get this under control, the VM will track radix tree nodes
    exclusively containing shadow entries on a per-NUMA node list. Per-NUMA
    rather than global because we expect the radix tree nodes themselves to
    be allocated node-locally and we want to reduce cross-node references of
    otherwise independent cache workloads. A simple shrinker will then
    reclaim these nodes on memory pressure.

    A few things need to be stored in the radix tree node to implement the
    shadow node LRU and allow tree deletions coming from the list:

    1. There is no index available that would describe the reverse path
    from the node up to the tree root, which is needed to perform a
    deletion. To solve this, encode in each node its offset inside the
    parent. This can be stored in the unused upper bits of the same
    member that stores the node's height at no extra space cost.

    2. The number of shadow entries needs to be counted in addition to the
    regular entries, to quickly detect when the node is ready to go to
    the shadow node LRU list. The current entry count is an unsigned
    int but the maximum number of entries is 64, so a shadow counter
    can easily be stored in the unused upper bits.

    3. Tree modification needs tree lock and tree root, which are located
    in the address space, so store an address_space backpointer in the
    node. The parent pointer of the node is in a union with the 2-word
    rcu_head, so the backpointer comes at no extra cost as well.

    4. The node needs to be linked to an LRU list, which requires a list
    head inside the node. This does increase the size of the node, but
    it does not change the number of objects that fit into a slab page.

    [akpm@linux-foundation.org: export the right function]
    Signed-off-by: Johannes Weiner
    Reviewed-by: Rik van Riel
    Reviewed-by: Minchan Kim
    Cc: Andrea Arcangeli
    Cc: Bob Liu
    Cc: Christoph Hellwig
    Cc: Dave Chinner
    Cc: Greg Thelen
    Cc: Hugh Dickins
    Cc: Jan Kara
    Cc: KOSAKI Motohiro
    Cc: Luigi Semenzato
    Cc: Mel Gorman
    Cc: Metin Doslu
    Cc: Michel Lespinasse
    Cc: Ozgun Erdogan
    Cc: Peter Zijlstra
    Cc: Roman Gushchin
    Cc: Ryan Mallon
    Cc: Tejun Heo
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

31 Oct, 2013

1 commit

  • I've seen a fair number of issues with kswapd and other processes
    appearing to get stuck in v3.12-rc. Using sysrq-p many times seems to
    indicate that it gets stuck somewhere in list_lru_walk_node(), called
    from prune_icache_sb() and super_cache_scan().

    I never seem to be able to trigger a calltrace for functions above that
    point.

    So I decided to add the following to super_cache_scan():

    @@ -81,10 +81,14 @@ static unsigned long super_cache_scan(struct shrinker *shrink,
    inodes = list_lru_count_node(&sb->s_inode_lru, sc->nid);
    dentries = list_lru_count_node(&sb->s_dentry_lru, sc->nid);
    total_objects = dentries + inodes + fs_objects + 1;
    +printk("%s:%u: %s: dentries %lu inodes %lu total %lu\n", current->comm, current->pid, __func__, dentries, inodes, total_objects);

    /* proportion the scan between the caches */
    dentries = mult_frac(sc->nr_to_scan, dentries, total_objects);
    inodes = mult_frac(sc->nr_to_scan, inodes, total_objects);
    +printk("%s:%u: %s: dentries %lu inodes %lu\n", current->comm, current->pid, __func__, dentries, inodes);
    +BUG_ON(dentries == 0);
    +BUG_ON(inodes == 0);

    /*
    * prune the dcache first as the icache is pinned by it, then
    @@ -99,7 +103,7 @@ static unsigned long super_cache_scan(struct shrinker *shrink,
    freed += sb->s_op->free_cached_objects(sb, fs_objects,
    sc->nid);
    }
    -
    +printk("%s:%u: %s: dentries %lu inodes %lu freed %lu\n", current->comm, current->pid, __func__, dentries, inodes, freed);
    drop_super(sb);
    return freed;
    }

    and shortly thereafter, having applied some pressure, I got this:

    update-apt-xapi:1616: super_cache_scan: dentries 25632 inodes 2 total 25635
    update-apt-xapi:1616: super_cache_scan: dentries 1023 inodes 0
    ------------[ cut here ]------------
    Kernel BUG at c0101994 [verbose debug info unavailable]
    Internal error: Oops - BUG: 0 [#3] SMP ARM
    Modules linked in: fuse rfcomm bnep bluetooth hid_cypress
    CPU: 0 PID: 1616 Comm: update-apt-xapi Tainted: G D 3.12.0-rc7+ #154
    task: daea1200 ti: c3bf8000 task.ti: c3bf8000
    PC is at super_cache_scan+0x1c0/0x278
    LR is at trace_hardirqs_on+0x14/0x18
    Process update-apt-xapi (pid: 1616, stack limit = 0xc3bf8240)
    ...
    Backtrace:
    (super_cache_scan) from [] (shrink_slab+0x254/0x4c8)
    (shrink_slab) from [] (try_to_free_pages+0x3a0/0x5e0)
    (try_to_free_pages) from [] (__alloc_pages_nodemask+0x5)
    (__alloc_pages_nodemask) from [] (__pte_alloc+0x2c/0x13)
    (__pte_alloc) from [] (handle_mm_fault+0x84c/0x914)
    (handle_mm_fault) from [] (do_page_fault+0x1f0/0x3bc)
    (do_page_fault) from [] (do_translation_fault+0xac/0xb8)
    (do_translation_fault) from [] (do_DataAbort+0x38/0xa0)
    (do_DataAbort) from [] (__dabt_usr+0x38/0x40)

    Notice that we had a very low number of inodes, which were reduced to
    zero my mult_frac().

    Now, prune_icache_sb() calls list_lru_walk_node() passing that number of
    inodes (0) into that as the number of objects to scan:

    long prune_icache_sb(struct super_block *sb, unsigned long nr_to_scan,
    int nid)
    {
    LIST_HEAD(freeable);
    long freed;

    freed = list_lru_walk_node(&sb->s_inode_lru, nid, inode_lru_isolate,
    &freeable, &nr_to_scan);

    which does:

    unsigned long
    list_lru_walk_node(struct list_lru *lru, int nid, list_lru_walk_cb isolate,
    void *cb_arg, unsigned long *nr_to_walk)
    {

    struct list_lru_node *nlru = &lru->node[nid];
    struct list_head *item, *n;
    unsigned long isolated = 0;

    spin_lock(&nlru->lock);
    restart:
    list_for_each_safe(item, n, &nlru->list) {
    enum lru_status ret;

    /*
    * decrement nr_to_walk first so that we don't livelock if we
    * get stuck on large numbesr of LRU_RETRY items
    */
    if (--(*nr_to_walk) == 0)
    break;

    So, if *nr_to_walk was zero when this function was entered, that means
    we're wanting to operate on (~0UL)+1 objects - which might as well be
    infinite.

    Clearly this is not correct behaviour. If we think about the behaviour
    of this function when *nr_to_walk is 1, then clearly it's wrong - we
    decrement first and then test for zero - which results in us doing
    nothing at all. A post-decrement would give the desired behaviour -
    we'd try to walk one object and one object only if *nr_to_walk were one.

    It also gives the correct behaviour for zero - we exit at this point.

    Fixes: 5cedf721a7cd ("list_lru: fix broken LRU_RETRY behaviour")
    Signed-off-by: Russell King
    Cc: Dave Chinner
    Cc: Al Viro
    Cc: Andrew Morton
    [ Modified to make sure we never underflow the count: this function gets
    called in a loop, so the 0 -> ~0ul transition is dangerous - Linus ]
    Signed-off-by: Linus Torvalds

    Russell King
     

11 Sep, 2013

1 commit

  • We currently use a compile-time constant to size the node array for the
    list_lru structure. Due to this, we don't need to allocate any memory at
    initialization time. But as a consequence, the structures that contain
    embedded list_lru lists can become way too big (the superblock for
    instance contains two of them).

    This patch aims at ameliorating this situation by dynamically allocating
    the node arrays with the firmware provided nr_node_ids.

    Signed-off-by: Glauber Costa
    Cc: Dave Chinner
    Cc: Mel Gorman
    Cc: "Theodore Ts'o"
    Cc: Adrian Hunter
    Cc: Al Viro
    Cc: Artem Bityutskiy
    Cc: Arve Hjønnevåg
    Cc: Carlos Maiolino
    Cc: Christoph Hellwig
    Cc: Chuck Lever
    Cc: Daniel Vetter
    Cc: David Rientjes
    Cc: Gleb Natapov
    Cc: Greg Thelen
    Cc: J. Bruce Fields
    Cc: Jan Kara
    Cc: Jerome Glisse
    Cc: John Stultz
    Cc: KAMEZAWA Hiroyuki
    Cc: Kent Overstreet
    Cc: Kirill A. Shutemov
    Cc: Marcelo Tosatti
    Cc: Mel Gorman
    Cc: Steven Whitehouse
    Cc: Thomas Hellstrom
    Cc: Trond Myklebust
    Signed-off-by: Andrew Morton
    Signed-off-by: Al Viro

    Glauber Costa