01 Feb, 2017

1 commit

  • commit 3674534b775354516e5c148ea48f51d4d1909a78 upstream.

    When memory.move_charge_at_immigrate is enabled and precharges are
    depleted during move, mem_cgroup_move_charge_pte_range() will attempt to
    increase the size of the precharge.

    Prevent precharges from ever looping by setting __GFP_NORETRY. This was
    probably the intention of the GFP_KERNEL & ~__GFP_NORETRY, which is
    pointless as written.

    Fixes: 0029e19ebf84 ("mm: memcontrol: remove explicit OOM parameter in charge path")
    Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1701130208510.69402@chino.kir.corp.google.com
    Signed-off-by: David Rientjes
    Acked-by: Michal Hocko
    Cc: Johannes Weiner
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    David Rientjes
     

20 Jan, 2017

1 commit

  • commit b4536f0c829c8586544c94735c343f9b5070bd01 upstream.

    Nils Holland and Klaus Ethgen have reported unexpected OOM killer
    invocations with 32b kernel starting with 4.8 kernels

    kworker/u4:5 invoked oom-killer: gfp_mask=0x2400840(GFP_NOFS|__GFP_NOFAIL), nodemask=0, order=0, oom_score_adj=0
    kworker/u4:5 cpuset=/ mems_allowed=0
    CPU: 1 PID: 2603 Comm: kworker/u4:5 Not tainted 4.9.0-gentoo #2
    [...]
    Mem-Info:
    active_anon:58685 inactive_anon:90 isolated_anon:0
    active_file:274324 inactive_file:281962 isolated_file:0
    unevictable:0 dirty:649 writeback:0 unstable:0
    slab_reclaimable:40662 slab_unreclaimable:17754
    mapped:7382 shmem:202 pagetables:351 bounce:0
    free:206736 free_pcp:332 free_cma:0
    Node 0 active_anon:234740kB inactive_anon:360kB active_file:1097296kB inactive_file:1127848kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:29528kB dirty:2596kB writeback:0kB shmem:0kB shmem_thp: 0kB shmem_pmdmapped: 184320kB anon_thp: 808kB writeback_tmp:0kB unstable:0kB pages_scanned:0 all_unreclaimable? no
    DMA free:3952kB min:788kB low:984kB high:1180kB active_anon:0kB inactive_anon:0kB active_file:7316kB inactive_file:0kB unevictable:0kB writepending:96kB present:15992kB managed:15916kB mlocked:0kB slab_reclaimable:3200kB slab_unreclaimable:1408kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
    lowmem_reserve[]: 0 813 3474 3474
    Normal free:41332kB min:41368kB low:51708kB high:62048kB active_anon:0kB inactive_anon:0kB active_file:532748kB inactive_file:44kB unevictable:0kB writepending:24kB present:897016kB managed:836248kB mlocked:0kB slab_reclaimable:159448kB slab_unreclaimable:69608kB kernel_stack:1112kB pagetables:1404kB bounce:0kB free_pcp:528kB local_pcp:340kB free_cma:0kB
    lowmem_reserve[]: 0 0 21292 21292
    HighMem free:781660kB min:512kB low:34356kB high:68200kB active_anon:234740kB inactive_anon:360kB active_file:557232kB inactive_file:1127804kB unevictable:0kB writepending:2592kB present:2725384kB managed:2725384kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:800kB local_pcp:608kB free_cma:0kB

    the oom killer is clearly pre-mature because there there is still a lot
    of page cache in the zone Normal which should satisfy this lowmem
    request. Further debugging has shown that the reclaim cannot make any
    forward progress because the page cache is hidden in the active list
    which doesn't get rotated because inactive_list_is_low is not memcg
    aware.

    The code simply subtracts per-zone highmem counters from the respective
    memcg's lru sizes which doesn't make any sense. We can simply end up
    always seeing the resulting active and inactive counts 0 and return
    false. This issue is not limited to 32b kernels but in practice the
    effect on systems without CONFIG_HIGHMEM would be much harder to notice
    because we do not invoke the OOM killer for allocations requests
    targeting < ZONE_NORMAL.

    Fix the issue by tracking per zone lru page counts in mem_cgroup_per_node
    and subtract per-memcg highmem counts when memcg is enabled. Introduce
    helper lruvec_zone_lru_size which redirects to either zone counters or
    mem_cgroup_get_zone_lru_size when appropriate.

    We are losing empty LRU but non-zero lru size detection introduced by
    ca707239e8a7 ("mm: update_lru_size warn and reset bad lru_size") because
    of the inherent zone vs. node discrepancy.

    Fixes: f8d1a31163fc ("mm: consider whether to decivate based on eligible zones inactive ratio")
    Link: http://lkml.kernel.org/r/20170104100825.3729-1-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Reported-by: Nils Holland
    Tested-by: Nils Holland
    Reported-by: Klaus Ethgen
    Acked-by: Minchan Kim
    Acked-by: Mel Gorman
    Acked-by: Johannes Weiner
    Reviewed-by: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Michal Hocko
     

28 Oct, 2016

1 commit

  • On 4.0, we saw a stack corruption from a page fault entering direct
    memory cgroup reclaim, calling into btrfs_releasepage(), which then
    tried to allocate an extent and recursed back into a kmem charge ad
    nauseam:

    [...]
    btrfs_releasepage+0x2c/0x30
    try_to_release_page+0x32/0x50
    shrink_page_list+0x6da/0x7a0
    shrink_inactive_list+0x1e5/0x510
    shrink_lruvec+0x605/0x7f0
    shrink_zone+0xee/0x320
    do_try_to_free_pages+0x174/0x440
    try_to_free_mem_cgroup_pages+0xa7/0x130
    try_charge+0x17b/0x830
    memcg_charge_kmem+0x40/0x80
    new_slab+0x2d9/0x5a0
    __slab_alloc+0x2fd/0x44f
    kmem_cache_alloc+0x193/0x1e0
    alloc_extent_state+0x21/0xc0
    __clear_extent_bit+0x2b5/0x400
    try_release_extent_mapping+0x1a3/0x220
    __btrfs_releasepage+0x31/0x70
    btrfs_releasepage+0x2c/0x30
    try_to_release_page+0x32/0x50
    shrink_page_list+0x6da/0x7a0
    shrink_inactive_list+0x1e5/0x510
    shrink_lruvec+0x605/0x7f0
    shrink_zone+0xee/0x320
    do_try_to_free_pages+0x174/0x440
    try_to_free_mem_cgroup_pages+0xa7/0x130
    try_charge+0x17b/0x830
    mem_cgroup_try_charge+0x65/0x1c0
    handle_mm_fault+0x117f/0x1510
    __do_page_fault+0x177/0x420
    do_page_fault+0xc/0x10
    page_fault+0x22/0x30

    On later kernels, kmem charging is opt-in rather than opt-out, and that
    particular kmem allocation in btrfs_releasepage() is no longer being
    charged and won't recurse and overrun the stack anymore.

    But it's not impossible for an accounted allocation to happen from the
    memcg direct reclaim context, and we needed to reproduce this crash many
    times before we even got a useful stack trace out of it.

    Like other direct reclaimers, mark tasks in memcg reclaim PF_MEMALLOC to
    avoid recursing into any other form of direct reclaim. Then let
    recursive charges from PF_MEMALLOC contexts bypass the cgroup limit.

    Link: http://lkml.kernel.org/r/20161025141050.GA13019@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: Vladimir Davydov
    Cc: Tejun Heo
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

08 Oct, 2016

5 commits

  • The cgroup core and the memory controller need to track socket ownership
    for different purposes, but the tracking sites being entirely different
    is kind of ugly.

    Be a better citizen and rename the memory controller callbacks to match
    the cgroup core callbacks, then move them to the same place.

    [akpm@linux-foundation.org: coding-style fixes]
    Link: http://lkml.kernel.org/r/20160914194846.11153-3-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Acked-by: Tejun Heo
    Cc: "David S. Miller"
    Cc: Michal Hocko
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • This patch is to improve the performance of swap cache operations when
    the type of the swap device is not 0. Originally, the whole swap entry
    value is used as the key of the swap cache, even though there is one
    radix tree for each swap device. If the type of the swap device is not
    0, the height of the radix tree of the swap cache will be increased
    unnecessary, especially on 64bit architecture. For example, for a 1GB
    swap device on the x86_64 architecture, the height of the radix tree of
    the swap cache is 11. But if the offset of the swap entry is used as
    the key of the swap cache, the height of the radix tree of the swap
    cache is 4. The increased height causes unnecessary radix tree
    descending and increased cache footprint.

    This patch reduces the height of the radix tree of the swap cache via
    using the offset of the swap entry instead of the whole swap entry value
    as the key of the swap cache. In 32 processes sequential swap out test
    case on a Xeon E5 v3 system with RAM disk as swap, the lock contention
    for the spinlock of the swap cache is reduced from 20.15% to 12.19%,
    when the type of the swap device is 1.

    Use the whole swap entry as key,

    perf-profile.calltrace.cycles-pp._raw_spin_lock_irq.__add_to_swap_cache.add_to_swap_cache.add_to_swap.shrink_page_list: 10.37,
    perf-profile.calltrace.cycles-pp._raw_spin_lock_irqsave.__remove_mapping.shrink_page_list.shrink_inactive_list.shrink_node_memcg: 9.78,

    Use the swap offset as key,

    perf-profile.calltrace.cycles-pp._raw_spin_lock_irq.__add_to_swap_cache.add_to_swap_cache.add_to_swap.shrink_page_list: 6.25,
    perf-profile.calltrace.cycles-pp._raw_spin_lock_irqsave.__remove_mapping.shrink_page_list.shrink_inactive_list.shrink_node_memcg: 5.94,

    Link: http://lkml.kernel.org/r/1473270649-27229-1-git-send-email-ying.huang@intel.com
    Signed-off-by: "Huang, Ying"
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Vladimir Davydov
    Cc: "Kirill A. Shutemov"
    Cc: Dave Hansen
    Cc: Dan Williams
    Cc: Joonsoo Kim
    Cc: Hugh Dickins
    Cc: Mel Gorman
    Cc: Minchan Kim
    Cc: Aaron Lu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Ying
     
  • mem_cgroup_count_precharge() and mem_cgroup_move_charge() both call
    walk_page_range() on the range 0 to ~0UL, neither provide a pte_hole
    callback, which causes the current implementation to skip non-vma
    regions. This is all fine but follow up changes would like to make
    walk_page_range more generic so it is better to be explicit about which
    range to traverse so let's use highest_vm_end to explicitly traverse
    only user mmaped memory.

    [mhocko@kernel.org: rewrote changelog]
    Link: http://lkml.kernel.org/r/1472655897-22532-1-git-send-email-james.morse@arm.com
    Signed-off-by: James Morse
    Acked-by: Naoya Horiguchi
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    James Morse
     
  • Link: http://lkml.kernel.org/r/1c5ddb1c171dbdfc3262252769d6138a29b35b70.1470219853.git.vdavydov@virtuozzo.com
    Signed-off-by: Vladimir Davydov
    Acked-by: Johannes Weiner
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • When selecting an oom victim, we use the same heuristic for both memory
    cgroup and global oom. The only difference is the scope of tasks to
    select the victim from. So we could just export an iterator over all
    memcg tasks and keep all oom related logic in oom_kill.c, but instead we
    duplicate pieces of it in memcontrol.c reusing some initially private
    functions of oom_kill.c in order to not duplicate all of it. That looks
    ugly and error prone, because any modification of select_bad_process
    should also be propagated to mem_cgroup_out_of_memory.

    Let's rework this as follows: keep all oom heuristic related code private
    to oom_kill.c and make oom_kill.c use exported memcg functions when it's
    really necessary (like in case of iterating over memcg tasks).

    Link: http://lkml.kernel.org/r/1470056933-7505-1-git-send-email-vdavydov@virtuozzo.com
    Signed-off-by: Vladimir Davydov
    Acked-by: Johannes Weiner
    Cc: Michal Hocko
    Cc: Tetsuo Handa
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     

20 Sep, 2016

1 commit

  • During cgroup2 rollout into production, we started encountering css
    refcount underflows and css access crashes in the memory controller.
    Splitting the heavily shared css reference counter into logical users
    narrowed the imbalance down to the cgroup2 socket memory accounting.

    The problem turns out to be the per-cpu charge cache. Cgroup1 had a
    separate socket counter, but the new cgroup2 socket accounting goes
    through the common charge path that uses a shared per-cpu cache for all
    memory that is being tracked. Those caches are safe against scheduling
    preemption, but not against interrupts - such as the newly added packet
    receive path. When cache draining is interrupted by network RX taking
    pages out of the cache, the resuming drain operation will put references
    of in-use pages, thus causing the imbalance.

    Disable IRQs during all per-cpu charge cache operations.

    Fixes: f7e1cb6ec51b ("mm: memcontrol: account socket memory in unified hierarchy memory controller")
    Link: http://lkml.kernel.org/r/20160914194846.11153-1-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Acked-by: Tejun Heo
    Cc: "David S. Miller"
    Cc: Michal Hocko
    Cc: Vladimir Davydov
    Cc: [4.5+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

27 Aug, 2016

1 commit

  • A bugfix in v4.8-rc2 introduced a harmless warning when
    CONFIG_MEMCG_SWAP is disabled but CONFIG_MEMCG is enabled:

    mm/memcontrol.c:4085:27: error: 'mem_cgroup_id_get_online' defined but not used [-Werror=unused-function]
    static struct mem_cgroup *mem_cgroup_id_get_online(struct mem_cgroup *memcg)

    This moves the function inside of the #ifdef block that hides the
    calling function, to avoid the warning.

    Fixes: 1f47b61fb407 ("mm: memcontrol: fix swap counter leak on swapout from offline cgroup")
    Link: http://lkml.kernel.org/r/20160824113733.2776701-1-arnd@arndb.de
    Signed-off-by: Arnd Bergmann
    Acked-by: Michal Hocko
    Acked-by: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Arnd Bergmann
     

12 Aug, 2016

2 commits

  • Since commit 73f576c04b94 ("mm: memcontrol: fix cgroup creation failure
    after many small jobs") swap entries do not pin memcg->css.refcnt
    directly. Instead, they pin memcg->id.ref. So we should adjust the
    reference counters accordingly when moving swap charges between cgroups.

    Fixes: 73f576c04b941 ("mm: memcontrol: fix cgroup creation failure after many small jobs")
    Link: http://lkml.kernel.org/r/9ce297c64954a42dc90b543bc76106c4a94f07e8.1470219853.git.vdavydov@virtuozzo.com
    Signed-off-by: Vladimir Davydov
    Acked-by: Michal Hocko
    Acked-by: Johannes Weiner
    Cc: [3.19+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • An offline memory cgroup might have anonymous memory or shmem left
    charged to it and no swap. Since only swap entries pin the id of an
    offline cgroup, such a cgroup will have no id and so an attempt to
    swapout its anon/shmem will not store memory cgroup info in the swap
    cgroup map. As a result, memcg->swap or memcg->memsw will never get
    uncharged from it and any of its ascendants.

    Fix this by always charging swapout to the first ancestor cgroup that
    hasn't released its id yet.

    [hannes@cmpxchg.org: add comment to mem_cgroup_swapout]
    [vdavydov@virtuozzo.com: use WARN_ON_ONCE() in mem_cgroup_id_get_online()]
    Link: http://lkml.kernel.org/r/20160803123445.GJ13263@esperanza
    Fixes: 73f576c04b941 ("mm: memcontrol: fix cgroup creation failure after many small jobs")
    Link: http://lkml.kernel.org/r/5336daa5c9a32e776067773d9da655d2dc126491.1470219853.git.vdavydov@virtuozzo.com
    Signed-off-by: Vladimir Davydov
    Acked-by: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: [3.19+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     

10 Aug, 2016

1 commit

  • To distinguish non-slab pages charged to kmemcg we mark them PageKmemcg,
    which sets page->_mapcount to -512. Currently, we set/clear PageKmemcg
    in __alloc_pages_nodemask()/free_pages_prepare() for any page allocated
    with __GFP_ACCOUNT, including those that aren't actually charged to any
    cgroup, i.e. allocated from the root cgroup context. To avoid overhead
    in case cgroups are not used, we only do that if memcg_kmem_enabled() is
    true. The latter is set iff there are kmem-enabled memory cgroups
    (online or offline). The root cgroup is not considered kmem-enabled.

    As a result, if a page is allocated with __GFP_ACCOUNT for the root
    cgroup when there are kmem-enabled memory cgroups and is freed after all
    kmem-enabled memory cgroups were removed, e.g.

    # no memory cgroups has been created yet, create one
    mkdir /sys/fs/cgroup/memory/test
    # run something allocating pages with __GFP_ACCOUNT, e.g.
    # a program using pipe
    dmesg | tail
    # remove the memory cgroup
    rmdir /sys/fs/cgroup/memory/test

    we'll get bad page state bug complaining about page->_mapcount != -1:

    BUG: Bad page state in process swapper/0 pfn:1fd945c
    page:ffffea007f651700 count:0 mapcount:-511 mapping: (null) index:0x0
    flags: 0x1000000000000000()

    To avoid that, let's mark with PageKmemcg only those pages that are
    actually charged to and hence pin a non-root memory cgroup.

    Fixes: 4949148ad433 ("mm: charge/uncharge kmemcg from generic page allocator paths")
    Reported-and-tested-by: Eric Dumazet
    Signed-off-by: Vladimir Davydov
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     

03 Aug, 2016

1 commit

  • We've had a report about soft lockups caused by lock bouncing in the
    soft reclaim path:

    BUG: soft lockup - CPU#0 stuck for 22s! [kav4proxy-kavic:3128]
    RIP: 0010:[] [] _raw_spin_lock+0x18/0x20
    Call Trace:
    mem_cgroup_soft_limit_reclaim+0x25a/0x280
    shrink_zones+0xed/0x200
    do_try_to_free_pages+0x74/0x320
    try_to_free_pages+0x112/0x180
    __alloc_pages_slowpath+0x3ff/0x820
    __alloc_pages_nodemask+0x1e9/0x200
    alloc_pages_vma+0xe1/0x290
    do_wp_page+0x19f/0x840
    handle_pte_fault+0x1cd/0x230
    do_page_fault+0x1fd/0x4c0
    page_fault+0x25/0x30

    There are no memcgs created so there cannot be any in the soft limit
    excess obviously:

    [...]
    memory 0 1 1

    so all this just seems to be mem_cgroup_largest_soft_limit_node trying
    to get spin_lock_irq(&mctz->lock) just to find out that the soft limit
    excess tree is empty. This is just pointless wasting of cycles and
    cache line bouncing during heavy parallel reclaim on large machines.
    The particular machine wasn't very healthy and most probably suffering
    from a memory leak which just caused the memory reclaim to trash
    heavily. But bouncing on the lock certainly didn't help...

    Fix this by optimistic lockless check and bail out early if the tree is
    empty. This is theoretically racy but that shouldn't matter all that
    much. First of all soft limit is a best effort feature and it is slowly
    getting deprecated and its usage should be really scarce. Bouncing on a
    lock without a good reason is surely much bigger problem, especially on
    large CPU machines.

    Link: http://lkml.kernel.org/r/1470073277-1056-1-git-send-email-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Acked-by: Vladimir Davydov
    Acked-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

29 Jul, 2016

8 commits

  • We should account for stacks regardless of stack size, and we need to
    account in sub-page units if THREAD_SIZE < PAGE_SIZE. Change the units
    to kilobytes and Move it into account_kernel_stack().

    Fixes: 12580e4b54ba8 ("mm: memcontrol: report kernel stack usage in cgroup2 memory.stat")
    Link: http://lkml.kernel.org/r/9b5314e3ee5eda61b0317ec1563768602c1ef438.1468523549.git.luto@kernel.org
    Signed-off-by: Andy Lutomirski
    Cc: Vladimir Davydov
    Acked-by: Johannes Weiner
    Cc: Michal Hocko
    Reviewed-by: Josh Poimboeuf
    Reviewed-by: Vladimir Davydov
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andy Lutomirski
     
  • Minchan Kim reported setting the following warning on a 32-bit system
    although it can affect 64-bit systems.

    WARNING: CPU: 4 PID: 1322 at mm/memcontrol.c:998 mem_cgroup_update_lru_size+0x103/0x110
    mem_cgroup_update_lru_size(f44b4000, 1, -7): zid 1 lru_size 1 but empty
    Modules linked in:
    CPU: 4 PID: 1322 Comm: cp Not tainted 4.7.0-rc4-mm1+ #143
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
    Call Trace:
    dump_stack+0x76/0xaf
    __warn+0xea/0x110
    ? mem_cgroup_update_lru_size+0x103/0x110
    warn_slowpath_fmt+0x3b/0x40
    mem_cgroup_update_lru_size+0x103/0x110
    isolate_lru_pages.isra.61+0x2e2/0x360
    shrink_active_list+0xac/0x2a0
    ? __delay+0xe/0x10
    shrink_node_memcg+0x53c/0x7a0
    shrink_node+0xab/0x2a0
    do_try_to_free_pages+0xc6/0x390
    try_to_free_pages+0x245/0x590

    LRU list contents and counts are updated separately. Counts are updated
    before pages are added to the LRU and updated after pages are removed.
    The warning above is from a check in mem_cgroup_update_lru_size that
    ensures that list sizes of zero are empty.

    The problem is that node-lru needs to account for highmem pages if
    CONFIG_HIGHMEM is set. One impact of the implementation is that the
    sizes are updated in multiple passes when pages from multiple zones were
    isolated. This happens whether HIGHMEM is set or not. When multiple
    zones are isolated, it's possible for a debugging check in memcg to be
    tripped.

    This patch forces all the zone counts to be updated before the memcg
    function is called.

    Link: http://lkml.kernel.org/r/1468588165-12461-6-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Tested-by: Minchan Kim
    Reported-by: Minchan Kim
    Acked-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Memcg needs adjustment after moving LRUs to the node. Limits are
    tracked per memcg but the soft-limit excess is tracked per zone. As
    global page reclaim is based on the node, it is easy to imagine a
    situation where a zone soft limit is exceeded even though the memcg
    limit is fine.

    This patch moves the soft limit tree the node. Technically, all the
    variable names should also change but people are already familiar by the
    meaning of "mz" even if "mn" would be a more appropriate name now.

    Link: http://lkml.kernel.org/r/1467970510-21195-15-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Michal Hocko
    Cc: Hillf Danton
    Acked-by: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: Minchan Kim
    Cc: Rik van Riel
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Earlier patches focused on having direct reclaim and kswapd use data
    that is node-centric for reclaiming but shrink_node() itself still uses
    too much zone information. This patch removes unnecessary zone-based
    information with the most important decision being whether to continue
    reclaim or not. Some memcg APIs are adjusted as a result even though
    memcg itself still uses some zone information.

    [mgorman@techsingularity.net: optimization]
    Link: http://lkml.kernel.org/r/1468588165-12461-2-git-send-email-mgorman@techsingularity.net
    Link: http://lkml.kernel.org/r/1467970510-21195-14-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Cc: Hillf Danton
    Acked-by: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: Minchan Kim
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • This moves the LRU lists from the zone to the node and related data such
    as counters, tracing, congestion tracking and writeback tracking.

    Unfortunately, due to reclaim and compaction retry logic, it is
    necessary to account for the number of LRU pages on both zone and node
    logic. Most reclaim logic is based on the node counters but the retry
    logic uses the zone counters which do not distinguish inactive and
    active sizes. It would be possible to leave the LRU counters on a
    per-zone basis but it's a heavier calculation across multiple cache
    lines that is much more frequent than the retry checks.

    Other than the LRU counters, this is mostly a mechanical patch but note
    that it introduces a number of anomalies. For example, the scans are
    per-zone but using per-node counters. We also mark a node as congested
    when a zone is congested. This causes weird problems that are fixed
    later but is easier to review.

    In the event that there is excessive overhead on 32-bit systems due to
    the nodes being on LRU then there are two potential solutions

    1. Long-term isolation of highmem pages when reclaim is lowmem

    When pages are skipped, they are immediately added back onto the LRU
    list. If lowmem reclaim persisted for long periods of time, the same
    highmem pages get continually scanned. The idea would be that lowmem
    keeps those pages on a separate list until a reclaim for highmem pages
    arrives that splices the highmem pages back onto the LRU. It potentially
    could be implemented similar to the UNEVICTABLE list.

    That would reduce the skip rate with the potential corner case is that
    highmem pages have to be scanned and reclaimed to free lowmem slab pages.

    2. Linear scan lowmem pages if the initial LRU shrink fails

    This will break LRU ordering but may be preferable and faster during
    memory pressure than skipping LRU pages.

    Link: http://lkml.kernel.org/r/1467970510-21195-4-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Johannes Weiner
    Acked-by: Vlastimil Babka
    Cc: Hillf Danton
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Node-based reclaim requires node-based LRUs and locking. This is a
    preparation patch that just moves the lru_lock to the node so later
    patches are easier to review. It is a mechanical change but note this
    patch makes contention worse because the LRU lock is hotter and direct
    reclaim and kswapd can contend on the same lock even when reclaiming
    from different zones.

    Link: http://lkml.kernel.org/r/1467970510-21195-3-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Reviewed-by: Minchan Kim
    Acked-by: Johannes Weiner
    Acked-by: Vlastimil Babka
    Cc: Hillf Danton
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Commit 23047a96d7cf ("mm: workingset: per-cgroup cache thrash
    detection") added a page->mem_cgroup lookup to the cache eviction,
    refault, and activation paths, as well as locking to the activation
    path, and the vm-scalability tests showed a regression of -23%.

    While the test in question is an artificial worst-case scenario that
    doesn't occur in real workloads - reading two sparse files in parallel
    at full CPU speed just to hammer the LRU paths - there is still some
    optimizations that can be done in those paths.

    Inline the lookup functions to eliminate calls. Also, page->mem_cgroup
    doesn't need to be stabilized when counting an activation; we merely
    need to hold the RCU lock to prevent the memcg from being freed.

    This cuts down on overhead quite a bit:

    23047a96d7cfcfca 063f6715e77a7be5770d6081fe
    ---------------- --------------------------
    %stddev %change %stddev
    \ | \
    21621405 +- 0% +11.3% 24069657 +- 2% vm-scalability.throughput

    [linux@roeck-us.net: drop unnecessary include file]
    [hannes@cmpxchg.org: add WARN_ON_ONCE()s]
    Link: http://lkml.kernel.org/r/20160707194024.GA26580@cmpxchg.org
    Link: http://lkml.kernel.org/r/20160624175101.GA3024@cmpxchg.org
    Reported-by: Ye Xiaolong
    Signed-off-by: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: Vladimir Davydov
    Signed-off-by: Guenter Roeck
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • task_will_free_mem is rather weak. It doesn't really tell whether the
    task has chance to drop its mm. 98748bd72200 ("oom: consider
    multi-threaded tasks in task_will_free_mem") made a first step into making
    it more robust for multi-threaded applications so now we know that the
    whole process is going down and probably drop the mm.

    This patch builds on top for more complex scenarios where mm is shared
    between different processes - CLONE_VM without CLONE_SIGHAND, or in kernel
    use_mm().

    Make sure that all processes sharing the mm are killed or exiting. This
    will allow us to replace try_oom_reaper by wake_oom_reaper because
    task_will_free_mem implies the task is reapable now. Therefore all paths
    which bypass the oom killer are now reapable and so they shouldn't lock up
    the oom killer.

    Link: http://lkml.kernel.org/r/1466426628-15074-8-git-send-email-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Acked-by: Oleg Nesterov
    Cc: Vladimir Davydov
    Cc: David Rientjes
    Cc: Tetsuo Handa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

27 Jul, 2016

7 commits

  • Commit f627c2f53786 ("memcg: adjust to support new THP refcounting")
    adds a compound parameter for several functions, and change one as
    compound for mem_cgroup_move_account but it does not change the
    comments.

    Link: http://lkml.kernel.org/r/1465368216-9393-1-git-send-email-roy.qing.li@gmail.com
    Signed-off-by: Li RongQing
    Acked-by: Michal Hocko
    Acked-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Li RongQing
     
  • When calling uncharge_list, if a page is transparent huge we don't need
    to BUG_ON about non-transparent huge, since nobody should be able to see
    the page at this stage and this page cannot be raced against with a THP
    split.

    This check became unneeded after 0a31bc97c80c ("mm: memcontrol: rewrite
    uncharge API").

    [mhocko@suse.com: changelog enhancements]
    Link: http://lkml.kernel.org/r/1465369248-13865-1-git-send-email-roy.qing.li@gmail.com
    Signed-off-by: Li RongQing
    Acked-by: Michal Hocko
    Acked-by: Johannes Weiner
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Li RongQing
     
  • oom_scan_process_thread() does not use totalpages argument.
    oom_badness() uses it.

    Link: http://lkml.kernel.org/r/1463796041-7889-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp
    Signed-off-by: Tetsuo Handa
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tetsuo Handa
     
  • Page table pages are batched-freed in release_pages on most
    architectures. If we want to charge them to kmemcg (this is what is
    done later in this series), we need to teach mem_cgroup_uncharge_list to
    handle kmem pages.

    Link: http://lkml.kernel.org/r/18d5c09e97f80074ed25b97a7d0f32b95d875717.1464079538.git.vdavydov@virtuozzo.com
    Signed-off-by: Vladimir Davydov
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Eric Dumazet
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • - Handle memcg_kmem_enabled check out to the caller. This reduces the
    number of function definitions making the code easier to follow. At
    the same time it doesn't result in code bloat, because all of these
    functions are used only in one or two places.

    - Move __GFP_ACCOUNT check to the caller as well so that one wouldn't
    have to dive deep into memcg implementation to see which allocations
    are charged and which are not.

    - Refresh comments.

    Link: http://lkml.kernel.org/r/52882a28b542c1979fd9a033b4dc8637fc347399.1464079537.git.vdavydov@virtuozzo.com
    Signed-off-by: Vladimir Davydov
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Eric Dumazet
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • It's a part of oom context just like allocation order and nodemask, so
    let's move it to oom_control instead of passing it in the argument list.

    Link: http://lkml.kernel.org/r/40e03fd7aaf1f55c75d787128d6d17c5a71226c2.1464358556.git.vdavydov@virtuozzo.com
    Signed-off-by: Vladimir Davydov
    Acked-by: Michal Hocko
    Acked-by: Johannes Weiner
    Cc: Tetsuo Handa
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • It seems like this parameter has never been used since being introduced
    by 90254a65833b ("memcg: clean up move charge"). Not a big deal because
    I assume the function would get inlined into the caller anyway but why
    not get rid of it.

    [mhocko@suse.com: wrote changelog]
    Link: http://lkml.kernel.org/r/20160525151831.GJ20132@dhcp22.suse.cz
    Link: http://lkml.kernel.org/r/1464145026-26693-1-git-send-email-roy.qing.li@gmail.com
    Signed-off-by: Li RongQing
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Li RongQing
     

23 Jul, 2016

1 commit

  • The memory controller has quite a bit of state that usually outlives the
    cgroup and pins its CSS until said state disappears. At the same time
    it imposes a 16-bit limit on the CSS ID space to economically store IDs
    in the wild. Consequently, when we use cgroups to contain frequent but
    small and short-lived jobs that leave behind some page cache, we quickly
    run into the 64k limitations of outstanding CSSs. Creating a new cgroup
    fails with -ENOSPC while there are only a few, or even no user-visible
    cgroups in existence.

    Although pinning CSSs past cgroup removal is common, there are only two
    instances that actually need an ID after a cgroup is deleted: cache
    shadow entries and swapout records.

    Cache shadow entries reference the ID weakly and can deal with the CSS
    having disappeared when it's looked up later. They pose no hurdle.

    Swap-out records do need to pin the css to hierarchically attribute
    swapins after the cgroup has been deleted; though the only pages that
    remain swapped out after offlining are tmpfs/shmem pages. And those
    references are under the user's control, so they are manageable.

    This patch introduces a private 16-bit memcg ID and switches swap and
    cache shadow entries over to using that. This ID can then be recycled
    after offlining when the CSS remains pinned only by objects that don't
    specifically need it.

    This script demonstrates the problem by faulting one cache page in a new
    cgroup and deleting it again:

    set -e
    mkdir -p pages
    for x in `seq 128000`; do
    [ $((x % 1000)) -eq 0 ] && echo $x
    mkdir /cgroup/foo
    echo $$ >/cgroup/foo/cgroup.procs
    echo trex >pages/$x
    echo $$ >/cgroup/cgroup.procs
    rmdir /cgroup/foo
    done

    When run on an unpatched kernel, we eventually run out of possible IDs
    even though there are no visible cgroups:

    [root@ham ~]# ./cssidstress.sh
    [...]
    65000
    mkdir: cannot create directory '/cgroup/foo': No space left on device

    After this patch, the IDs get released upon cgroup destruction and the
    cache and css objects get released once memory reclaim kicks in.

    [hannes@cmpxchg.org: init the IDR]
    Link: http://lkml.kernel.org/r/20160621154601.GA22431@cmpxchg.org
    Fixes: b2052564e66d ("mm: memcontrol: continue cache reclaim from offlined groups")
    Link: http://lkml.kernel.org/r/20160617162516.GD19084@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Reported-by: John Garcia
    Reviewed-by: Vladimir Davydov
    Acked-by: Tejun Heo
    Cc: Nikolay Borisov
    Cc: [3.19+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

25 Jun, 2016

2 commits

  • mem_cgroup_css_alloc() was returning NULL on failure while cgroup core
    expected it to return an ERR_PTR value leading to the following NULL
    deref after a css allocation failure. Fix it by return
    ERR_PTR(-ENOMEM) instead. I'll also update cgroup core so that it
    can handle NULL returns.

    mkdir: page allocation failure: order:6, mode:0x240c0c0(GFP_KERNEL|__GFP_COMP|__GFP_ZERO)
    CPU: 0 PID: 8738 Comm: mkdir Not tainted 4.7.0-rc3+ #123
    ...
    Call Trace:
    dump_stack+0x68/0xa1
    warn_alloc_failed+0xd6/0x130
    __alloc_pages_nodemask+0x4c6/0xf20
    alloc_pages_current+0x66/0xe0
    alloc_kmem_pages+0x14/0x80
    kmalloc_order_trace+0x2a/0x1a0
    __kmalloc+0x291/0x310
    memcg_update_all_caches+0x6c/0x130
    mem_cgroup_css_alloc+0x590/0x610
    cgroup_apply_control_enable+0x18b/0x370
    cgroup_mkdir+0x1de/0x2e0
    kernfs_iop_mkdir+0x55/0x80
    vfs_mkdir+0xb9/0x150
    SyS_mkdir+0x66/0xd0
    do_syscall_64+0x53/0x120
    entry_SYSCALL64_slow_path+0x25/0x25
    ...
    BUG: unable to handle kernel NULL pointer dereference at 00000000000000d0
    IP: init_and_link_css+0x37/0x220
    PGD 34b1e067 PUD 3a109067 PMD 0
    Oops: 0002 [#1] SMP
    Modules linked in:
    CPU: 0 PID: 8738 Comm: mkdir Not tainted 4.7.0-rc3+ #123
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.2-20160422_131301-anatol 04/01/2014
    task: ffff88007cbc5200 ti: ffff8800666d4000 task.ti: ffff8800666d4000
    RIP: 0010:[] [] init_and_link_css+0x37/0x220
    RSP: 0018:ffff8800666d7d90 EFLAGS: 00010246
    RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
    RDX: ffffffff810f2499 RSI: 0000000000000000 RDI: 0000000000000008
    RBP: ffff8800666d7db8 R08: 0000000000000003 R09: 0000000000000000
    R10: 0000000000000001 R11: 0000000000000000 R12: ffff88005a5fb400
    R13: ffffffff81f0f8a0 R14: ffff88005a5fb400 R15: 0000000000000010
    FS: 00007fc944689700(0000) GS:ffff88007fc00000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 00007f3aed0d2b80 CR3: 000000003a1e8000 CR4: 00000000000006f0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    Call Trace:
    cgroup_apply_control_enable+0x1ac/0x370
    cgroup_mkdir+0x1de/0x2e0
    kernfs_iop_mkdir+0x55/0x80
    vfs_mkdir+0xb9/0x150
    SyS_mkdir+0x66/0xd0
    do_syscall_64+0x53/0x120
    entry_SYSCALL64_slow_path+0x25/0x25
    Code: 89 f5 48 89 fb 49 89 d4 48 83 ec 08 8b 05 72 3b d8 00 85 c0 0f 85 60 01 00 00 4c 89 e7 e8 72 f7 ff ff 48 8d 7b 08 48 89 d9 31 c0 c7 83 d0 00 00 00 00 00 00 00 48 83 e7 f8 48 29 f9 81 c1 d8
    RIP init_and_link_css+0x37/0x220
    RSP
    CR2: 00000000000000d0
    ---[ end trace a2d8836ae1e852d1 ]---

    Link: http://lkml.kernel.org/r/20160621165740.GJ3262@mtj.duckdns.org
    Signed-off-by: Tejun Heo
    Reported-by: Johannes Weiner
    Reviewed-by: Vladimir Davydov
    Acked-by: Johannes Weiner
    Acked-by: Michal Hocko
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tejun Heo
     
  • mem_cgroup_migrate() uses local_irq_disable/enable() but can be called
    with irq disabled from migrate_page_copy(). This ends up enabling irq
    while holding a irq context lock triggering the following lockdep
    warning. Fix it by using irq_save/restore instead.

    =================================
    [ INFO: inconsistent lock state ]
    4.7.0-rc1+ #52 Tainted: G W
    ---------------------------------
    inconsistent {IN-SOFTIRQ-W} -> {SOFTIRQ-ON-W} usage.
    kcompactd0/151 [HC0[0]:SC0[0]:HE1:SE1] takes:
    (&(&ctx->completion_lock)->rlock){+.?.-.}, at: [] aio_migratepage+0x156/0x1e8
    {IN-SOFTIRQ-W} state was registered at:
    __lock_acquire+0x5b6/0x1930
    lock_acquire+0xee/0x270
    _raw_spin_lock_irqsave+0x66/0xb0
    aio_complete+0x98/0x328
    dio_complete+0xe4/0x1e0
    blk_update_request+0xd4/0x450
    scsi_end_request+0x48/0x1c8
    scsi_io_completion+0x272/0x698
    blk_done_softirq+0xca/0xe8
    __do_softirq+0xc8/0x518
    irq_exit+0xee/0x110
    do_IRQ+0x6a/0x88
    io_int_handler+0x11a/0x25c
    __mutex_unlock_slowpath+0x144/0x1d8
    __mutex_unlock_slowpath+0x140/0x1d8
    kernfs_iop_permission+0x64/0x80
    __inode_permission+0x9e/0xf0
    link_path_walk+0x6e/0x510
    path_lookupat+0xc4/0x1a8
    filename_lookup+0x9c/0x160
    user_path_at_empty+0x5c/0x70
    SyS_readlinkat+0x68/0x140
    system_call+0xd6/0x270
    irq event stamp: 971410
    hardirqs last enabled at (971409): migrate_page_move_mapping+0x3ea/0x588
    hardirqs last disabled at (971410): _raw_spin_lock_irqsave+0x3c/0xb0
    softirqs last enabled at (970526): __do_softirq+0x460/0x518
    softirqs last disabled at (970519): irq_exit+0xee/0x110

    other info that might help us debug this:
    Possible unsafe locking scenario:

    CPU0
    ----
    lock(&(&ctx->completion_lock)->rlock);

    lock(&(&ctx->completion_lock)->rlock);

    *** DEADLOCK ***

    3 locks held by kcompactd0/151:
    #0: (&(&mapping->private_lock)->rlock){+.+.-.}, at: aio_migratepage+0x42/0x1e8
    #1: (&ctx->ring_lock){+.+.+.}, at: aio_migratepage+0x5a/0x1e8
    #2: (&(&ctx->completion_lock)->rlock){+.?.-.}, at: aio_migratepage+0x156/0x1e8

    stack backtrace:
    CPU: 20 PID: 151 Comm: kcompactd0 Tainted: G W 4.7.0-rc1+ #52
    Call Trace:
    show_trace+0xea/0xf0
    show_stack+0x72/0xf0
    dump_stack+0x9a/0xd8
    print_usage_bug.part.27+0x2d4/0x2e8
    mark_lock+0x17e/0x758
    mark_held_locks+0xa2/0xd0
    trace_hardirqs_on_caller+0x140/0x1c0
    mem_cgroup_migrate+0x266/0x370
    aio_migratepage+0x16a/0x1e8
    move_to_new_page+0xb0/0x260
    migrate_pages+0x8f4/0x9f0
    compact_zone+0x4dc/0xdc8
    kcompactd_do_work+0x1aa/0x358
    kcompactd+0xba/0x2c8
    kthread+0x10a/0x110
    kernel_thread_starter+0x6/0xc
    kernel_thread_starter+0x0/0xc
    INFO: lockdep is turned off.

    Link: http://lkml.kernel.org/r/20160620184158.GO3262@mtj.duckdns.org
    Link: http://lkml.kernel.org/g/5767CFE5.7080904@de.ibm.com
    Fixes: 74485cf2bc85 ("mm: migrate: consolidate mem_cgroup_migrate() calls")
    Signed-off-by: Tejun Heo
    Reported-by: Christian Borntraeger
    Acked-by: Johannes Weiner
    Acked-by: Michal Hocko
    Reviewed-by: Vladimir Davydov
    Cc: [4.5+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tejun Heo
     

10 Jun, 2016

1 commit


04 Jun, 2016

1 commit

  • memcg_offline_kmem() may be called from memcg_free_kmem() after a css
    init failure. memcg_free_kmem() is a ->css_free callback which is
    called without cgroup_mutex and memcg_offline_kmem() ends up using
    css_for_each_descendant_pre() without any locking. Fix it by adding rcu
    read locking around it.

    mkdir: cannot create directory `65530': No space left on device
    ===============================
    [ INFO: suspicious RCU usage. ]
    4.6.0-work+ #321 Not tainted
    -------------------------------
    kernel/cgroup.c:4008 cgroup_mutex or RCU read lock required!
    [ 527.243970] other info that might help us debug this:
    [ 527.244715]
    rcu_scheduler_active = 1, debug_locks = 0
    2 locks held by kworker/0:5/1664:
    #0: ("cgroup_destroy"){.+.+..}, at: [] process_one_work+0x165/0x4a0
    #1: ((&css->destroy_work)#3){+.+...}, at: [] process_one_work+0x165/0x4a0
    [ 527.248098] stack backtrace:
    CPU: 0 PID: 1664 Comm: kworker/0:5 Not tainted 4.6.0-work+ #321
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.1-1.fc24 04/01/2014
    Workqueue: cgroup_destroy css_free_work_fn
    Call Trace:
    dump_stack+0x68/0xa1
    lockdep_rcu_suspicious+0xd7/0x110
    css_next_descendant_pre+0x7d/0xb0
    memcg_offline_kmem.part.44+0x4a/0xc0
    mem_cgroup_css_free+0x1ec/0x200
    css_free_work_fn+0x49/0x5e0
    process_one_work+0x1c5/0x4a0
    worker_thread+0x49/0x490
    kthread+0xea/0x100
    ret_from_fork+0x1f/0x40

    Link: http://lkml.kernel.org/r/20160526203018.GG23194@mtj.duckdns.org
    Signed-off-by: Tejun Heo
    Acked-by: Vladimir Davydov
    Acked-by: Johannes Weiner
    Cc: Michal Hocko
    Cc: [4.5+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tejun Heo
     

28 May, 2016

2 commits


27 May, 2016

1 commit

  • mem_cgroup_out_of_memory() is returning "true" if it finds a TIF_MEMDIE
    task after an eligible task was found, "false" if it found a TIF_MEMDIE
    task before an eligible task is found.

    This difference confuses memory_max_write() which checks the return
    value of mem_cgroup_out_of_memory(). Since memory_max_write() wants to
    continue looping, mem_cgroup_out_of_memory() should return "true" in
    this case.

    This patch sets a dummy pointer in order to return "true".

    Link: http://lkml.kernel.org/r/1463753327-5170-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp
    Signed-off-by: Tetsuo Handa
    Acked-by: Michal Hocko
    Acked-by: Vladimir Davydov
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tetsuo Handa
     

24 May, 2016

1 commit

  • mem_cgroup_oom may be invoked multiple times while a process is handling
    a page fault, in which case current->memcg_in_oom will be overwritten
    leaking the previously taken css reference.

    Link: http://lkml.kernel.org/r/1464019330-7579-1-git-send-email-vdavydov@virtuozzo.com
    Signed-off-by: Vladimir Davydov
    Acked-by: Michal Hocko
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     

21 May, 2016

1 commit

  • Commit f61c42a7d911 ("memcg: remove tasks/children test from
    mem_cgroup_force_empty()") removed memory reparenting from the function.

    Fix the function's comment.

    Link: http://lkml.kernel.org/r/1462569810-54496-1-git-send-email-gthelen@google.com
    Signed-off-by: Greg Thelen
    Acked-by: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Greg Thelen
     

20 May, 2016

1 commit

  • If either the current task is already killed or PF_EXITING or a selected
    task is PF_EXITING then the oom killer is suppressed and so is the oom
    reaper. This patch adds try_oom_reaper which checks the given task and
    queues it for the oom reaper if that is safe to be done meaning that the
    task doesn't share the mm with an alive process.

    This might help to release the memory pressure while the task tries to
    exit.

    [akpm@linux-foundation.org: fix nommu build]
    Signed-off-by: Michal Hocko
    Cc: Raushaniya Maksudova
    Cc: Michael S. Tsirkin
    Cc: Paul E. McKenney
    Cc: David Rientjes
    Cc: Tetsuo Handa
    Cc: Daniel Vetter
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko