03 Aug, 2016

1 commit

  • We've had a report about soft lockups caused by lock bouncing in the
    soft reclaim path:

    BUG: soft lockup - CPU#0 stuck for 22s! [kav4proxy-kavic:3128]
    RIP: 0010:[] [] _raw_spin_lock+0x18/0x20
    Call Trace:
    mem_cgroup_soft_limit_reclaim+0x25a/0x280
    shrink_zones+0xed/0x200
    do_try_to_free_pages+0x74/0x320
    try_to_free_pages+0x112/0x180
    __alloc_pages_slowpath+0x3ff/0x820
    __alloc_pages_nodemask+0x1e9/0x200
    alloc_pages_vma+0xe1/0x290
    do_wp_page+0x19f/0x840
    handle_pte_fault+0x1cd/0x230
    do_page_fault+0x1fd/0x4c0
    page_fault+0x25/0x30

    There are no memcgs created so there cannot be any in the soft limit
    excess obviously:

    [...]
    memory 0 1 1

    so all this just seems to be mem_cgroup_largest_soft_limit_node trying
    to get spin_lock_irq(&mctz->lock) just to find out that the soft limit
    excess tree is empty. This is just pointless wasting of cycles and
    cache line bouncing during heavy parallel reclaim on large machines.
    The particular machine wasn't very healthy and most probably suffering
    from a memory leak which just caused the memory reclaim to trash
    heavily. But bouncing on the lock certainly didn't help...

    Fix this by optimistic lockless check and bail out early if the tree is
    empty. This is theoretically racy but that shouldn't matter all that
    much. First of all soft limit is a best effort feature and it is slowly
    getting deprecated and its usage should be really scarce. Bouncing on a
    lock without a good reason is surely much bigger problem, especially on
    large CPU machines.

    Link: http://lkml.kernel.org/r/1470073277-1056-1-git-send-email-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Acked-by: Vladimir Davydov
    Acked-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

29 Jul, 2016

8 commits

  • We should account for stacks regardless of stack size, and we need to
    account in sub-page units if THREAD_SIZE < PAGE_SIZE. Change the units
    to kilobytes and Move it into account_kernel_stack().

    Fixes: 12580e4b54ba8 ("mm: memcontrol: report kernel stack usage in cgroup2 memory.stat")
    Link: http://lkml.kernel.org/r/9b5314e3ee5eda61b0317ec1563768602c1ef438.1468523549.git.luto@kernel.org
    Signed-off-by: Andy Lutomirski
    Cc: Vladimir Davydov
    Acked-by: Johannes Weiner
    Cc: Michal Hocko
    Reviewed-by: Josh Poimboeuf
    Reviewed-by: Vladimir Davydov
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andy Lutomirski
     
  • Minchan Kim reported setting the following warning on a 32-bit system
    although it can affect 64-bit systems.

    WARNING: CPU: 4 PID: 1322 at mm/memcontrol.c:998 mem_cgroup_update_lru_size+0x103/0x110
    mem_cgroup_update_lru_size(f44b4000, 1, -7): zid 1 lru_size 1 but empty
    Modules linked in:
    CPU: 4 PID: 1322 Comm: cp Not tainted 4.7.0-rc4-mm1+ #143
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
    Call Trace:
    dump_stack+0x76/0xaf
    __warn+0xea/0x110
    ? mem_cgroup_update_lru_size+0x103/0x110
    warn_slowpath_fmt+0x3b/0x40
    mem_cgroup_update_lru_size+0x103/0x110
    isolate_lru_pages.isra.61+0x2e2/0x360
    shrink_active_list+0xac/0x2a0
    ? __delay+0xe/0x10
    shrink_node_memcg+0x53c/0x7a0
    shrink_node+0xab/0x2a0
    do_try_to_free_pages+0xc6/0x390
    try_to_free_pages+0x245/0x590

    LRU list contents and counts are updated separately. Counts are updated
    before pages are added to the LRU and updated after pages are removed.
    The warning above is from a check in mem_cgroup_update_lru_size that
    ensures that list sizes of zero are empty.

    The problem is that node-lru needs to account for highmem pages if
    CONFIG_HIGHMEM is set. One impact of the implementation is that the
    sizes are updated in multiple passes when pages from multiple zones were
    isolated. This happens whether HIGHMEM is set or not. When multiple
    zones are isolated, it's possible for a debugging check in memcg to be
    tripped.

    This patch forces all the zone counts to be updated before the memcg
    function is called.

    Link: http://lkml.kernel.org/r/1468588165-12461-6-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Tested-by: Minchan Kim
    Reported-by: Minchan Kim
    Acked-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Memcg needs adjustment after moving LRUs to the node. Limits are
    tracked per memcg but the soft-limit excess is tracked per zone. As
    global page reclaim is based on the node, it is easy to imagine a
    situation where a zone soft limit is exceeded even though the memcg
    limit is fine.

    This patch moves the soft limit tree the node. Technically, all the
    variable names should also change but people are already familiar by the
    meaning of "mz" even if "mn" would be a more appropriate name now.

    Link: http://lkml.kernel.org/r/1467970510-21195-15-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Michal Hocko
    Cc: Hillf Danton
    Acked-by: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: Minchan Kim
    Cc: Rik van Riel
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Earlier patches focused on having direct reclaim and kswapd use data
    that is node-centric for reclaiming but shrink_node() itself still uses
    too much zone information. This patch removes unnecessary zone-based
    information with the most important decision being whether to continue
    reclaim or not. Some memcg APIs are adjusted as a result even though
    memcg itself still uses some zone information.

    [mgorman@techsingularity.net: optimization]
    Link: http://lkml.kernel.org/r/1468588165-12461-2-git-send-email-mgorman@techsingularity.net
    Link: http://lkml.kernel.org/r/1467970510-21195-14-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Cc: Hillf Danton
    Acked-by: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: Minchan Kim
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • This moves the LRU lists from the zone to the node and related data such
    as counters, tracing, congestion tracking and writeback tracking.

    Unfortunately, due to reclaim and compaction retry logic, it is
    necessary to account for the number of LRU pages on both zone and node
    logic. Most reclaim logic is based on the node counters but the retry
    logic uses the zone counters which do not distinguish inactive and
    active sizes. It would be possible to leave the LRU counters on a
    per-zone basis but it's a heavier calculation across multiple cache
    lines that is much more frequent than the retry checks.

    Other than the LRU counters, this is mostly a mechanical patch but note
    that it introduces a number of anomalies. For example, the scans are
    per-zone but using per-node counters. We also mark a node as congested
    when a zone is congested. This causes weird problems that are fixed
    later but is easier to review.

    In the event that there is excessive overhead on 32-bit systems due to
    the nodes being on LRU then there are two potential solutions

    1. Long-term isolation of highmem pages when reclaim is lowmem

    When pages are skipped, they are immediately added back onto the LRU
    list. If lowmem reclaim persisted for long periods of time, the same
    highmem pages get continually scanned. The idea would be that lowmem
    keeps those pages on a separate list until a reclaim for highmem pages
    arrives that splices the highmem pages back onto the LRU. It potentially
    could be implemented similar to the UNEVICTABLE list.

    That would reduce the skip rate with the potential corner case is that
    highmem pages have to be scanned and reclaimed to free lowmem slab pages.

    2. Linear scan lowmem pages if the initial LRU shrink fails

    This will break LRU ordering but may be preferable and faster during
    memory pressure than skipping LRU pages.

    Link: http://lkml.kernel.org/r/1467970510-21195-4-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Johannes Weiner
    Acked-by: Vlastimil Babka
    Cc: Hillf Danton
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Node-based reclaim requires node-based LRUs and locking. This is a
    preparation patch that just moves the lru_lock to the node so later
    patches are easier to review. It is a mechanical change but note this
    patch makes contention worse because the LRU lock is hotter and direct
    reclaim and kswapd can contend on the same lock even when reclaiming
    from different zones.

    Link: http://lkml.kernel.org/r/1467970510-21195-3-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Reviewed-by: Minchan Kim
    Acked-by: Johannes Weiner
    Acked-by: Vlastimil Babka
    Cc: Hillf Danton
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Commit 23047a96d7cf ("mm: workingset: per-cgroup cache thrash
    detection") added a page->mem_cgroup lookup to the cache eviction,
    refault, and activation paths, as well as locking to the activation
    path, and the vm-scalability tests showed a regression of -23%.

    While the test in question is an artificial worst-case scenario that
    doesn't occur in real workloads - reading two sparse files in parallel
    at full CPU speed just to hammer the LRU paths - there is still some
    optimizations that can be done in those paths.

    Inline the lookup functions to eliminate calls. Also, page->mem_cgroup
    doesn't need to be stabilized when counting an activation; we merely
    need to hold the RCU lock to prevent the memcg from being freed.

    This cuts down on overhead quite a bit:

    23047a96d7cfcfca 063f6715e77a7be5770d6081fe
    ---------------- --------------------------
    %stddev %change %stddev
    \ | \
    21621405 +- 0% +11.3% 24069657 +- 2% vm-scalability.throughput

    [linux@roeck-us.net: drop unnecessary include file]
    [hannes@cmpxchg.org: add WARN_ON_ONCE()s]
    Link: http://lkml.kernel.org/r/20160707194024.GA26580@cmpxchg.org
    Link: http://lkml.kernel.org/r/20160624175101.GA3024@cmpxchg.org
    Reported-by: Ye Xiaolong
    Signed-off-by: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: Vladimir Davydov
    Signed-off-by: Guenter Roeck
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • task_will_free_mem is rather weak. It doesn't really tell whether the
    task has chance to drop its mm. 98748bd72200 ("oom: consider
    multi-threaded tasks in task_will_free_mem") made a first step into making
    it more robust for multi-threaded applications so now we know that the
    whole process is going down and probably drop the mm.

    This patch builds on top for more complex scenarios where mm is shared
    between different processes - CLONE_VM without CLONE_SIGHAND, or in kernel
    use_mm().

    Make sure that all processes sharing the mm are killed or exiting. This
    will allow us to replace try_oom_reaper by wake_oom_reaper because
    task_will_free_mem implies the task is reapable now. Therefore all paths
    which bypass the oom killer are now reapable and so they shouldn't lock up
    the oom killer.

    Link: http://lkml.kernel.org/r/1466426628-15074-8-git-send-email-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Acked-by: Oleg Nesterov
    Cc: Vladimir Davydov
    Cc: David Rientjes
    Cc: Tetsuo Handa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

27 Jul, 2016

7 commits

  • Commit f627c2f53786 ("memcg: adjust to support new THP refcounting")
    adds a compound parameter for several functions, and change one as
    compound for mem_cgroup_move_account but it does not change the
    comments.

    Link: http://lkml.kernel.org/r/1465368216-9393-1-git-send-email-roy.qing.li@gmail.com
    Signed-off-by: Li RongQing
    Acked-by: Michal Hocko
    Acked-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Li RongQing
     
  • When calling uncharge_list, if a page is transparent huge we don't need
    to BUG_ON about non-transparent huge, since nobody should be able to see
    the page at this stage and this page cannot be raced against with a THP
    split.

    This check became unneeded after 0a31bc97c80c ("mm: memcontrol: rewrite
    uncharge API").

    [mhocko@suse.com: changelog enhancements]
    Link: http://lkml.kernel.org/r/1465369248-13865-1-git-send-email-roy.qing.li@gmail.com
    Signed-off-by: Li RongQing
    Acked-by: Michal Hocko
    Acked-by: Johannes Weiner
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Li RongQing
     
  • oom_scan_process_thread() does not use totalpages argument.
    oom_badness() uses it.

    Link: http://lkml.kernel.org/r/1463796041-7889-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp
    Signed-off-by: Tetsuo Handa
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tetsuo Handa
     
  • Page table pages are batched-freed in release_pages on most
    architectures. If we want to charge them to kmemcg (this is what is
    done later in this series), we need to teach mem_cgroup_uncharge_list to
    handle kmem pages.

    Link: http://lkml.kernel.org/r/18d5c09e97f80074ed25b97a7d0f32b95d875717.1464079538.git.vdavydov@virtuozzo.com
    Signed-off-by: Vladimir Davydov
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Eric Dumazet
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • - Handle memcg_kmem_enabled check out to the caller. This reduces the
    number of function definitions making the code easier to follow. At
    the same time it doesn't result in code bloat, because all of these
    functions are used only in one or two places.

    - Move __GFP_ACCOUNT check to the caller as well so that one wouldn't
    have to dive deep into memcg implementation to see which allocations
    are charged and which are not.

    - Refresh comments.

    Link: http://lkml.kernel.org/r/52882a28b542c1979fd9a033b4dc8637fc347399.1464079537.git.vdavydov@virtuozzo.com
    Signed-off-by: Vladimir Davydov
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Eric Dumazet
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • It's a part of oom context just like allocation order and nodemask, so
    let's move it to oom_control instead of passing it in the argument list.

    Link: http://lkml.kernel.org/r/40e03fd7aaf1f55c75d787128d6d17c5a71226c2.1464358556.git.vdavydov@virtuozzo.com
    Signed-off-by: Vladimir Davydov
    Acked-by: Michal Hocko
    Acked-by: Johannes Weiner
    Cc: Tetsuo Handa
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • It seems like this parameter has never been used since being introduced
    by 90254a65833b ("memcg: clean up move charge"). Not a big deal because
    I assume the function would get inlined into the caller anyway but why
    not get rid of it.

    [mhocko@suse.com: wrote changelog]
    Link: http://lkml.kernel.org/r/20160525151831.GJ20132@dhcp22.suse.cz
    Link: http://lkml.kernel.org/r/1464145026-26693-1-git-send-email-roy.qing.li@gmail.com
    Signed-off-by: Li RongQing
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Li RongQing
     

23 Jul, 2016

1 commit

  • The memory controller has quite a bit of state that usually outlives the
    cgroup and pins its CSS until said state disappears. At the same time
    it imposes a 16-bit limit on the CSS ID space to economically store IDs
    in the wild. Consequently, when we use cgroups to contain frequent but
    small and short-lived jobs that leave behind some page cache, we quickly
    run into the 64k limitations of outstanding CSSs. Creating a new cgroup
    fails with -ENOSPC while there are only a few, or even no user-visible
    cgroups in existence.

    Although pinning CSSs past cgroup removal is common, there are only two
    instances that actually need an ID after a cgroup is deleted: cache
    shadow entries and swapout records.

    Cache shadow entries reference the ID weakly and can deal with the CSS
    having disappeared when it's looked up later. They pose no hurdle.

    Swap-out records do need to pin the css to hierarchically attribute
    swapins after the cgroup has been deleted; though the only pages that
    remain swapped out after offlining are tmpfs/shmem pages. And those
    references are under the user's control, so they are manageable.

    This patch introduces a private 16-bit memcg ID and switches swap and
    cache shadow entries over to using that. This ID can then be recycled
    after offlining when the CSS remains pinned only by objects that don't
    specifically need it.

    This script demonstrates the problem by faulting one cache page in a new
    cgroup and deleting it again:

    set -e
    mkdir -p pages
    for x in `seq 128000`; do
    [ $((x % 1000)) -eq 0 ] && echo $x
    mkdir /cgroup/foo
    echo $$ >/cgroup/foo/cgroup.procs
    echo trex >pages/$x
    echo $$ >/cgroup/cgroup.procs
    rmdir /cgroup/foo
    done

    When run on an unpatched kernel, we eventually run out of possible IDs
    even though there are no visible cgroups:

    [root@ham ~]# ./cssidstress.sh
    [...]
    65000
    mkdir: cannot create directory '/cgroup/foo': No space left on device

    After this patch, the IDs get released upon cgroup destruction and the
    cache and css objects get released once memory reclaim kicks in.

    [hannes@cmpxchg.org: init the IDR]
    Link: http://lkml.kernel.org/r/20160621154601.GA22431@cmpxchg.org
    Fixes: b2052564e66d ("mm: memcontrol: continue cache reclaim from offlined groups")
    Link: http://lkml.kernel.org/r/20160617162516.GD19084@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Reported-by: John Garcia
    Reviewed-by: Vladimir Davydov
    Acked-by: Tejun Heo
    Cc: Nikolay Borisov
    Cc: [3.19+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

25 Jun, 2016

2 commits

  • mem_cgroup_css_alloc() was returning NULL on failure while cgroup core
    expected it to return an ERR_PTR value leading to the following NULL
    deref after a css allocation failure. Fix it by return
    ERR_PTR(-ENOMEM) instead. I'll also update cgroup core so that it
    can handle NULL returns.

    mkdir: page allocation failure: order:6, mode:0x240c0c0(GFP_KERNEL|__GFP_COMP|__GFP_ZERO)
    CPU: 0 PID: 8738 Comm: mkdir Not tainted 4.7.0-rc3+ #123
    ...
    Call Trace:
    dump_stack+0x68/0xa1
    warn_alloc_failed+0xd6/0x130
    __alloc_pages_nodemask+0x4c6/0xf20
    alloc_pages_current+0x66/0xe0
    alloc_kmem_pages+0x14/0x80
    kmalloc_order_trace+0x2a/0x1a0
    __kmalloc+0x291/0x310
    memcg_update_all_caches+0x6c/0x130
    mem_cgroup_css_alloc+0x590/0x610
    cgroup_apply_control_enable+0x18b/0x370
    cgroup_mkdir+0x1de/0x2e0
    kernfs_iop_mkdir+0x55/0x80
    vfs_mkdir+0xb9/0x150
    SyS_mkdir+0x66/0xd0
    do_syscall_64+0x53/0x120
    entry_SYSCALL64_slow_path+0x25/0x25
    ...
    BUG: unable to handle kernel NULL pointer dereference at 00000000000000d0
    IP: init_and_link_css+0x37/0x220
    PGD 34b1e067 PUD 3a109067 PMD 0
    Oops: 0002 [#1] SMP
    Modules linked in:
    CPU: 0 PID: 8738 Comm: mkdir Not tainted 4.7.0-rc3+ #123
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.2-20160422_131301-anatol 04/01/2014
    task: ffff88007cbc5200 ti: ffff8800666d4000 task.ti: ffff8800666d4000
    RIP: 0010:[] [] init_and_link_css+0x37/0x220
    RSP: 0018:ffff8800666d7d90 EFLAGS: 00010246
    RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
    RDX: ffffffff810f2499 RSI: 0000000000000000 RDI: 0000000000000008
    RBP: ffff8800666d7db8 R08: 0000000000000003 R09: 0000000000000000
    R10: 0000000000000001 R11: 0000000000000000 R12: ffff88005a5fb400
    R13: ffffffff81f0f8a0 R14: ffff88005a5fb400 R15: 0000000000000010
    FS: 00007fc944689700(0000) GS:ffff88007fc00000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 00007f3aed0d2b80 CR3: 000000003a1e8000 CR4: 00000000000006f0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    Call Trace:
    cgroup_apply_control_enable+0x1ac/0x370
    cgroup_mkdir+0x1de/0x2e0
    kernfs_iop_mkdir+0x55/0x80
    vfs_mkdir+0xb9/0x150
    SyS_mkdir+0x66/0xd0
    do_syscall_64+0x53/0x120
    entry_SYSCALL64_slow_path+0x25/0x25
    Code: 89 f5 48 89 fb 49 89 d4 48 83 ec 08 8b 05 72 3b d8 00 85 c0 0f 85 60 01 00 00 4c 89 e7 e8 72 f7 ff ff 48 8d 7b 08 48 89 d9 31 c0 c7 83 d0 00 00 00 00 00 00 00 48 83 e7 f8 48 29 f9 81 c1 d8
    RIP init_and_link_css+0x37/0x220
    RSP
    CR2: 00000000000000d0
    ---[ end trace a2d8836ae1e852d1 ]---

    Link: http://lkml.kernel.org/r/20160621165740.GJ3262@mtj.duckdns.org
    Signed-off-by: Tejun Heo
    Reported-by: Johannes Weiner
    Reviewed-by: Vladimir Davydov
    Acked-by: Johannes Weiner
    Acked-by: Michal Hocko
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tejun Heo
     
  • mem_cgroup_migrate() uses local_irq_disable/enable() but can be called
    with irq disabled from migrate_page_copy(). This ends up enabling irq
    while holding a irq context lock triggering the following lockdep
    warning. Fix it by using irq_save/restore instead.

    =================================
    [ INFO: inconsistent lock state ]
    4.7.0-rc1+ #52 Tainted: G W
    ---------------------------------
    inconsistent {IN-SOFTIRQ-W} -> {SOFTIRQ-ON-W} usage.
    kcompactd0/151 [HC0[0]:SC0[0]:HE1:SE1] takes:
    (&(&ctx->completion_lock)->rlock){+.?.-.}, at: [] aio_migratepage+0x156/0x1e8
    {IN-SOFTIRQ-W} state was registered at:
    __lock_acquire+0x5b6/0x1930
    lock_acquire+0xee/0x270
    _raw_spin_lock_irqsave+0x66/0xb0
    aio_complete+0x98/0x328
    dio_complete+0xe4/0x1e0
    blk_update_request+0xd4/0x450
    scsi_end_request+0x48/0x1c8
    scsi_io_completion+0x272/0x698
    blk_done_softirq+0xca/0xe8
    __do_softirq+0xc8/0x518
    irq_exit+0xee/0x110
    do_IRQ+0x6a/0x88
    io_int_handler+0x11a/0x25c
    __mutex_unlock_slowpath+0x144/0x1d8
    __mutex_unlock_slowpath+0x140/0x1d8
    kernfs_iop_permission+0x64/0x80
    __inode_permission+0x9e/0xf0
    link_path_walk+0x6e/0x510
    path_lookupat+0xc4/0x1a8
    filename_lookup+0x9c/0x160
    user_path_at_empty+0x5c/0x70
    SyS_readlinkat+0x68/0x140
    system_call+0xd6/0x270
    irq event stamp: 971410
    hardirqs last enabled at (971409): migrate_page_move_mapping+0x3ea/0x588
    hardirqs last disabled at (971410): _raw_spin_lock_irqsave+0x3c/0xb0
    softirqs last enabled at (970526): __do_softirq+0x460/0x518
    softirqs last disabled at (970519): irq_exit+0xee/0x110

    other info that might help us debug this:
    Possible unsafe locking scenario:

    CPU0
    ----
    lock(&(&ctx->completion_lock)->rlock);

    lock(&(&ctx->completion_lock)->rlock);

    *** DEADLOCK ***

    3 locks held by kcompactd0/151:
    #0: (&(&mapping->private_lock)->rlock){+.+.-.}, at: aio_migratepage+0x42/0x1e8
    #1: (&ctx->ring_lock){+.+.+.}, at: aio_migratepage+0x5a/0x1e8
    #2: (&(&ctx->completion_lock)->rlock){+.?.-.}, at: aio_migratepage+0x156/0x1e8

    stack backtrace:
    CPU: 20 PID: 151 Comm: kcompactd0 Tainted: G W 4.7.0-rc1+ #52
    Call Trace:
    show_trace+0xea/0xf0
    show_stack+0x72/0xf0
    dump_stack+0x9a/0xd8
    print_usage_bug.part.27+0x2d4/0x2e8
    mark_lock+0x17e/0x758
    mark_held_locks+0xa2/0xd0
    trace_hardirqs_on_caller+0x140/0x1c0
    mem_cgroup_migrate+0x266/0x370
    aio_migratepage+0x16a/0x1e8
    move_to_new_page+0xb0/0x260
    migrate_pages+0x8f4/0x9f0
    compact_zone+0x4dc/0xdc8
    kcompactd_do_work+0x1aa/0x358
    kcompactd+0xba/0x2c8
    kthread+0x10a/0x110
    kernel_thread_starter+0x6/0xc
    kernel_thread_starter+0x0/0xc
    INFO: lockdep is turned off.

    Link: http://lkml.kernel.org/r/20160620184158.GO3262@mtj.duckdns.org
    Link: http://lkml.kernel.org/g/5767CFE5.7080904@de.ibm.com
    Fixes: 74485cf2bc85 ("mm: migrate: consolidate mem_cgroup_migrate() calls")
    Signed-off-by: Tejun Heo
    Reported-by: Christian Borntraeger
    Acked-by: Johannes Weiner
    Acked-by: Michal Hocko
    Reviewed-by: Vladimir Davydov
    Cc: [4.5+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tejun Heo
     

10 Jun, 2016

1 commit


04 Jun, 2016

1 commit

  • memcg_offline_kmem() may be called from memcg_free_kmem() after a css
    init failure. memcg_free_kmem() is a ->css_free callback which is
    called without cgroup_mutex and memcg_offline_kmem() ends up using
    css_for_each_descendant_pre() without any locking. Fix it by adding rcu
    read locking around it.

    mkdir: cannot create directory `65530': No space left on device
    ===============================
    [ INFO: suspicious RCU usage. ]
    4.6.0-work+ #321 Not tainted
    -------------------------------
    kernel/cgroup.c:4008 cgroup_mutex or RCU read lock required!
    [ 527.243970] other info that might help us debug this:
    [ 527.244715]
    rcu_scheduler_active = 1, debug_locks = 0
    2 locks held by kworker/0:5/1664:
    #0: ("cgroup_destroy"){.+.+..}, at: [] process_one_work+0x165/0x4a0
    #1: ((&css->destroy_work)#3){+.+...}, at: [] process_one_work+0x165/0x4a0
    [ 527.248098] stack backtrace:
    CPU: 0 PID: 1664 Comm: kworker/0:5 Not tainted 4.6.0-work+ #321
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.1-1.fc24 04/01/2014
    Workqueue: cgroup_destroy css_free_work_fn
    Call Trace:
    dump_stack+0x68/0xa1
    lockdep_rcu_suspicious+0xd7/0x110
    css_next_descendant_pre+0x7d/0xb0
    memcg_offline_kmem.part.44+0x4a/0xc0
    mem_cgroup_css_free+0x1ec/0x200
    css_free_work_fn+0x49/0x5e0
    process_one_work+0x1c5/0x4a0
    worker_thread+0x49/0x490
    kthread+0xea/0x100
    ret_from_fork+0x1f/0x40

    Link: http://lkml.kernel.org/r/20160526203018.GG23194@mtj.duckdns.org
    Signed-off-by: Tejun Heo
    Acked-by: Vladimir Davydov
    Acked-by: Johannes Weiner
    Cc: Michal Hocko
    Cc: [4.5+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tejun Heo
     

28 May, 2016

2 commits


27 May, 2016

1 commit

  • mem_cgroup_out_of_memory() is returning "true" if it finds a TIF_MEMDIE
    task after an eligible task was found, "false" if it found a TIF_MEMDIE
    task before an eligible task is found.

    This difference confuses memory_max_write() which checks the return
    value of mem_cgroup_out_of_memory(). Since memory_max_write() wants to
    continue looping, mem_cgroup_out_of_memory() should return "true" in
    this case.

    This patch sets a dummy pointer in order to return "true".

    Link: http://lkml.kernel.org/r/1463753327-5170-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp
    Signed-off-by: Tetsuo Handa
    Acked-by: Michal Hocko
    Acked-by: Vladimir Davydov
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tetsuo Handa
     

24 May, 2016

1 commit

  • mem_cgroup_oom may be invoked multiple times while a process is handling
    a page fault, in which case current->memcg_in_oom will be overwritten
    leaking the previously taken css reference.

    Link: http://lkml.kernel.org/r/1464019330-7579-1-git-send-email-vdavydov@virtuozzo.com
    Signed-off-by: Vladimir Davydov
    Acked-by: Michal Hocko
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     

21 May, 2016

1 commit

  • Commit f61c42a7d911 ("memcg: remove tasks/children test from
    mem_cgroup_force_empty()") removed memory reparenting from the function.

    Fix the function's comment.

    Link: http://lkml.kernel.org/r/1462569810-54496-1-git-send-email-gthelen@google.com
    Signed-off-by: Greg Thelen
    Acked-by: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Greg Thelen
     

20 May, 2016

5 commits

  • If either the current task is already killed or PF_EXITING or a selected
    task is PF_EXITING then the oom killer is suppressed and so is the oom
    reaper. This patch adds try_oom_reaper which checks the given task and
    queues it for the oom reaper if that is safe to be done meaning that the
    task doesn't share the mm with an alive process.

    This might help to release the memory pressure while the task tries to
    exit.

    [akpm@linux-foundation.org: fix nommu build]
    Signed-off-by: Michal Hocko
    Cc: Raushaniya Maksudova
    Cc: Michael S. Tsirkin
    Cc: Paul E. McKenney
    Cc: David Rientjes
    Cc: Tetsuo Handa
    Cc: Daniel Vetter
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Konstantin Khlebnikov pointed out (nearly four years ago, when lumpy
    reclaim was removed) that lru_size can be updated by -nr_taken once per
    call to isolate_lru_pages(), instead of page by page.

    Update it inside isolate_lru_pages(), or at its two callsites? I chose
    to update it at the callsites, rearranging and grouping the updates by
    nr_taken and nr_scanned together in both.

    With one exception, mem_cgroup_update_lru_size(,lru,) is then used where
    __mod_zone_page_state(,NR_LRU_BASE+lru,) is used; and we shall be adding
    some more calls in a future commit. Make the code a little smaller and
    simpler by incorporating stat update in lru_size update.

    The exception was move_active_pages_to_lru(), which aggregated the
    pgmoved stat update separately from the individual lru_size updates; but
    I still think this a simplification worth making.

    However, the __mod_zone_page_state is not peculiar to mem_cgroups: so
    better use the name update_lru_size, calls mem_cgroup_update_lru_size
    when CONFIG_MEMCG.

    Signed-off-by: Hugh Dickins
    Cc: "Kirill A. Shutemov"
    Cc: Andrea Arcangeli
    Cc: Andres Lagar-Cavilla
    Cc: Yang Shi
    Cc: Ning Qu
    Cc: Mel Gorman
    Cc: Konstantin Khlebnikov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Though debug kernels have a VM_BUG_ON to help protect from misaccounting
    lru_size, non-debug kernels are liable to wrap it around: and then the
    vast unsigned long size draws page reclaim into a loop of repeatedly
    doing nothing on an empty list, without even a cond_resched().

    That soft lockup looks confusingly like an over-busy reclaim scenario,
    with lots of contention on the lru_lock in shrink_inactive_list(): yet
    has a totally different origin.

    Help differentiate with a custom warning in
    mem_cgroup_update_lru_size(), even in non-debug kernels; and reset the
    size to avoid the lockup. But the particular bug which suggested this
    change was mine alone, and since fixed.

    Make it a WARN_ONCE: the first occurrence is the most informative, a
    flurry may follow, yet even when rate-limited little more is learnt.

    Signed-off-by: Hugh Dickins
    Cc: "Kirill A. Shutemov"
    Cc: Andrea Arcangeli
    Cc: Andres Lagar-Cavilla
    Cc: Yang Shi
    Cc: Ning Qu
    Cc: Mel Gorman
    Cc: Andres Lagar-Cavilla
    Cc: Konstantin Khlebnikov
    Cc: Michal Hocko
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • > The comment seems to have not much to do with the code?

    I guess the comment tries to say that the code path is triggered when we
    charge the page which happens _before_ it is added to the LRU list and
    so last_scanned_node might contain the stale data.

    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Lots of code does

    node = next_node(node, XXX);
    if (node == MAX_NUMNODES)
    node = first_node(XXX);

    so create next_node_in() to do this and use it in various places.

    [mhocko@suse.com: use next_node_in() helper]
    Acked-by: Vlastimil Babka
    Acked-by: Michal Hocko
    Signed-off-by: Michal Hocko
    Cc: Xishi Qiu
    Cc: Joonsoo Kim
    Cc: David Rientjes
    Cc: Naoya Horiguchi
    Cc: Laura Abbott
    Cc: Hui Zhu
    Cc: Wang Xiaoqiang
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     

26 Apr, 2016

1 commit

  • Hello,

    So, this ended up a lot simpler than I originally expected. I tested
    it lightly and it seems to work fine. Petr, can you please test these
    two patches w/o the lru drain drop patch and see whether the problem
    is gone?

    Thanks.
    ------ 8< ------
    If charge moving is used, memcg performs relabeling of the affected
    pages from its ->attach callback which is called under both
    cgroup_threadgroup_rwsem and thus can't create new kthreads. This is
    fragile as various operations may depend on workqueues making forward
    progress which relies on the ability to create new kthreads.

    There's no reason to perform charge moving from ->attach which is deep
    in the task migration path. Move it to ->post_attach which is called
    after the actual migration is finished and cgroup_threadgroup_rwsem is
    dropped.

    * move_charge_struct->mm is added and ->can_attach is now responsible
    for pinning and recording the target mm. mem_cgroup_clear_mc() is
    updated accordingly. This also simplifies mem_cgroup_move_task().

    * mem_cgroup_move_task() is now called from ->post_attach instead of
    ->attach.

    Signed-off-by: Tejun Heo
    Cc: Johannes Weiner
    Acked-by: Michal Hocko
    Debugged-and-tested-by: Petr Mladek
    Reported-by: Cyril Hrubis
    Reported-by: Johannes Weiner
    Fixes: 1ed1328792ff ("sched, cgroup: replace signal_struct->group_rwsem with a global percpu_rwsem")
    Cc: # 4.4+

    Tejun Heo
     

18 Mar, 2016

8 commits

  • mem_cgroup_print_oom_info is always called under oom_lock, so
    oom_info_lock is redundant.

    Signed-off-by: Vladimir Davydov
    Cc: Johannes Weiner
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • uncharge_list() does an unusual list walk because the function can take
    regular lists with dedicated list_heads as well as singleton lists where
    a single page is passed via the page->lru list node.

    This can sometimes lead to confusion as well as suggestions to replace
    the loop with a list_for_each_entry(), which wouldn't work.

    Signed-off-by: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Setting the original memory.limit_in_bytes hardlimit is subject to a
    race condition when the desired value is below the current usage. The
    code tries a few times to first reclaim and then see if the usage has
    dropped to where we would like it to be, but there is no locking, and
    the workload is free to continue making new charges up to the old limit.
    Thus, attempting to shrink a workload relies on pure luck and hope that
    the workload happens to cooperate.

    To fix this in the cgroup2 memory.max knob, do it the other way round:
    set the limit first, then try enforcement. And if reclaim is not able
    to succeed, trigger OOM kills in the group. Keep going until the new
    limit is met, we run out of OOM victims and there's only unreclaimable
    memory left, or the task writing to memory.max is killed. This allows
    users to shrink groups reliably, and the behavior is consistent with
    what happens when new charges are attempted in excess of memory.max.

    Signed-off-by: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: Vladimir Davydov
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • When setting memory.high below usage, nothing happens until the next
    charge comes along, and then it will only reclaim its own charge and not
    the now potentially huge excess of the new memory.high. This can cause
    groups to stay in excess of their memory.high indefinitely.

    To fix that, when shrinking memory.high, kick off a reclaim cycle that
    goes after the delta.

    Signed-off-by: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: Vladimir Davydov
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • - Do not take memcg_limit_mutex for resetting limits - the cgroup cannot
    be altered from userspace anymore, so no need to protect them.

    - Use plain page_counter_limit() for resetting ->memory and ->memsw
    limits instead of mem_cgrouop_resize_* helpers - we enlarge the limits,
    so no need in special handling.

    - Reset ->swap and ->tcpmem limits as well.

    Signed-off-by: Vladimir Davydov
    Acked-by: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • Workingset code was recently made memcg aware, but shadow node shrinker
    is still global. As a result, one small cgroup can consume all memory
    available for shadow nodes, possibly hurting other cgroups by reclaiming
    their shadow nodes, even though reclaim distances stored in its shadow
    nodes have no effect. To avoid this, we need to make shadow node
    shrinker memcg aware.

    Signed-off-by: Vladimir Davydov
    Acked-by: Johannes Weiner
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • As kmem accounting is now either enabled for all cgroups or disabled
    system-wide, there's no point in having memcg_kmem_online() helper -
    instead one can use memcg_kmem_enabled() and mem_cgroup_online(), as
    shrink_slab() now does.

    There are only two places left where this helper is used -
    __memcg_kmem_charge() and memcg_create_kmem_cache(). The former can
    only be called if memcg_kmem_enabled() returned true. Since the cgroup
    it operates on is online, mem_cgroup_is_root() check will be enough.

    memcg_create_kmem_cache() can't use mem_cgroup_online() helper instead
    of memcg_kmem_online(), because it relies on the fact that in
    memcg_offline_kmem() memcg->kmem_state is changed before
    memcg_deactivate_kmem_caches() is called, but there we can just
    open-code the check.

    Signed-off-by: Vladimir Davydov
    Acked-by: Johannes Weiner
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • Workingset code was recently made memcg aware, but shadow node shrinker
    is still global. As a result, one small cgroup can consume all memory
    available for shadow nodes, possibly hurting other cgroups by reclaiming
    their shadow nodes, even though reclaim distances stored in its shadow
    nodes have no effect. To avoid this, we need to make shadow node
    shrinker memcg aware.

    The actual work is done in patch 6 of the series. Patches 1 and 2
    prepare memcg/shrinker infrastructure for the change. Patch 3 is just a
    collateral cleanup. Patch 4 makes radix_tree_node accounted, which is
    necessary for making shadow node shrinker memcg aware. Patch 5 reduces
    shadow nodes overhead in case workload mostly uses anonymous pages.

    This patch:

    Currently, in the legacy hierarchy kmem accounting is off for all
    cgroups by default and must be enabled explicitly by writing something
    to memory.kmem.limit_in_bytes. Since we don't support reclaim on
    hitting kmem limit, nor do we have any plans to implement it, this is
    likely to be -1, just to enable kmem accounting and limit kernel memory
    consumption by the memory.limit_in_bytes along with user memory.

    This user API was introduced when the implementation of kmem accounting
    lacked slab shrinker support and hence was useless in practice. Things
    have changed since then - slab shrinkers were made memcg aware, the
    accounting overhead seems to be negligible, and a failure to charge a
    kmem allocation should not have critical consequences, because we only
    account those kernel objects that should be safe to fail. That's why
    kmem accounting is enabled by default for all cgroups in the default
    hierarchy, which will eventually replace the legacy one.

    The ability to enable kmem accounting for some cgroups while keeping it
    disabled for others is getting difficult to maintain. E.g. to make
    shadow node shrinker memcg aware (see mm/workingset.c), we need to know
    the relationship between the number of shadow nodes allocated for a
    cgroup and the size of its lru list. If kmem accounting is enabled for
    all cgroups there is no problem, but what should we do if kmem
    accounting is enabled only for half of cgroups? We've no other choice
    but use global lru stats while scanning root cgroup's shadow nodes, but
    that would be wrong if kmem accounting was enabled for all cgroups
    (which is the case if the unified hierarchy is used), in which case we
    should use lru stats of the root cgroup's lruvec.

    That being said, let's enable kmem accounting for all memory cgroups by
    default. If one finds it unstable or too costly, it can always be
    disabled system-wide by passing cgroup.memory=nokmem to the kernel at
    boot time.

    Signed-off-by: Vladimir Davydov
    Acked-by: Johannes Weiner
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov