13 May, 2017

1 commit

  • Laurent Dufour has noticed that hwpoinsoned pages are kept charged. In
    his particular case he has hit a bad_page("page still charged to
    cgroup") when onlining a hwpoison page. While this looks like something
    that shouldn't happen in the first place because onlining hwpages and
    returning them to the page allocator makes only little sense it shows a
    real problem.

    hwpoison pages do not get freed usually so we do not uncharge them (at
    least not since commit 0a31bc97c80c ("mm: memcontrol: rewrite uncharge
    API")). Each charge pins memcg (since e8ea14cc6ead ("mm: memcontrol:
    take a css reference for each charged page")) as well and so the
    mem_cgroup and the associated state will never go away. Fix this leak
    by forcibly uncharging a LRU hwpoisoned page in delete_from_lru_cache().
    We also have to tweak uncharge_list because it cannot rely on zero ref
    count for these pages.

    [akpm@linux-foundation.org: coding-style fixes]
    Fixes: 0a31bc97c80c ("mm: memcontrol: rewrite uncharge API")
    Link: http://lkml.kernel.org/r/20170502185507.GB19165@dhcp22.suse.cz
    Signed-off-by: Michal Hocko
    Reported-by: Laurent Dufour
    Tested-by: Laurent Dufour
    Reviewed-by: Balbir Singh
    Reviewed-by: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

04 May, 2017

6 commits

  • The memory controllers stat function names are awkwardly long and
    arbitrarily different from the zone and node stat functions.

    The current interface is named:

    mem_cgroup_read_stat()
    mem_cgroup_update_stat()
    mem_cgroup_inc_stat()
    mem_cgroup_dec_stat()
    mem_cgroup_update_page_stat()
    mem_cgroup_inc_page_stat()
    mem_cgroup_dec_page_stat()

    This patch renames it to match the corresponding node stat functions:

    memcg_page_state() [node_page_state()]
    mod_memcg_state() [mod_node_state()]
    inc_memcg_state() [inc_node_state()]
    dec_memcg_state() [dec_node_state()]
    mod_memcg_page_state() [mod_node_page_state()]
    inc_memcg_page_state() [inc_node_page_state()]
    dec_memcg_page_state() [dec_node_page_state()]

    Link: http://lkml.kernel.org/r/20170404220148.28338-4-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Acked-by: Vladimir Davydov
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • The current duplication is a high-maintenance mess, and it's painful to
    add new items or query memcg state from the rest of the VM.

    This increases the size of the stat array marginally, but we should aim
    to track all these stats on a per-cgroup level anyway.

    Link: http://lkml.kernel.org/r/20170404220148.28338-3-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Acked-by: Vladimir Davydov
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • The current duplication is a high-maintenance mess, and it's painful to
    add new items.

    This increases the size of the event array, but we'll eventually want
    most of the VM events tracked on a per-cgroup basis anyway.

    Link: http://lkml.kernel.org/r/20170404220148.28338-2-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Acked-by: Vladimir Davydov
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • We only ever count single events, drop the @nr parameter. Rename the
    function accordingly. Remove low-information kerneldoc.

    Link: http://lkml.kernel.org/r/20170404220148.28338-1-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Acked-by: Vladimir Davydov
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Since commit 59dc76b0d4df ("mm: vmscan: reduce size of inactive file
    list") we noticed bigger IO spikes during changes in cache access
    patterns.

    The patch in question shrunk the inactive list size to leave more room
    for the current workingset in the presence of streaming IO. However,
    workingset transitions that previously happened on the inactive list are
    now pushed out of memory and incur more refaults to complete.

    This patch disables active list protection when refaults are being
    observed. This accelerates workingset transitions, and allows more of
    the new set to establish itself from memory, without eating into the
    ability to protect the established workingset during stable periods.

    The workloads that were measurably affected for us were hit pretty bad
    by it, with refault/majfault rates doubling and tripling during cache
    transitions, and the machines sustaining half-hour periods of 100% IO
    utilization, where they'd previously have sub-minute peaks at 60-90%.

    Stateful services that handle user data tend to be more conservative
    with kernel upgrades. As a result we hit most page cache issues with
    some delay, as was the case here.

    The severity seemed to warrant a stable tag.

    Fixes: 59dc76b0d4df ("mm: vmscan: reduce size of inactive file list")
    Link: http://lkml.kernel.org/r/20170404220052.27593-1-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Vladimir Davydov
    Cc: [4.7+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Cgroups currently don't report how much shmem they use, which can be
    useful data to have, in particular since shmem is included in the
    cache/file item while being reclaimed like anonymous memory.

    Add a counter to track shmem pages during charging and uncharging.

    Link: http://lkml.kernel.org/r/20170221164343.32252-1-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Reported-by: Chris Down
    Cc: Michal Hocko
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

10 Mar, 2017

2 commits

  • mem_cgroup_free() indirectly calls wb_domain_exit() which is not
    prepared to deal with a struct wb_domain object that hasn't executed
    wb_domain_init(). For instance, the following warning message is
    printed by lockdep if alloc_percpu() fails in mem_cgroup_alloc():

    INFO: trying to register non-static key.
    the code is fine but needs lockdep annotation.
    turning off the locking correctness validator.
    CPU: 1 PID: 1950 Comm: mkdir Not tainted 4.10.0+ #151
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
    Call Trace:
    dump_stack+0x67/0x99
    register_lock_class+0x36d/0x540
    __lock_acquire+0x7f/0x1a30
    lock_acquire+0xcc/0x200
    del_timer_sync+0x3c/0xc0
    wb_domain_exit+0x14/0x20
    mem_cgroup_free+0x14/0x40
    mem_cgroup_css_alloc+0x3f9/0x620
    cgroup_apply_control_enable+0x190/0x390
    cgroup_mkdir+0x290/0x3d0
    kernfs_iop_mkdir+0x58/0x80
    vfs_mkdir+0x10e/0x1a0
    SyS_mkdirat+0xa8/0xd0
    SyS_mkdir+0x14/0x20
    entry_SYSCALL_64_fastpath+0x18/0xad

    Add __mem_cgroup_free() which skips wb_domain_exit(). This is used by
    both mem_cgroup_free() and mem_cgroup_alloc() clean up.

    Fixes: 0b8f73e104285 ("mm: memcontrol: clean up alloc, online, offline, free functions")
    Link: http://lkml.kernel.org/r/20170306192122.24262-1-tahsin@google.com
    Signed-off-by: Tahsin Erdogan
    Acked-by: Michal Hocko
    Cc: Johannes Weiner
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tahsin Erdogan
     
  • The system may panic when initialisation is done when almost all the
    memory is assigned to the huge pages using the kernel command line
    parameter hugepage=xxxx. Panic may occur like this:

    Unable to handle kernel paging request for data at address 0x00000000
    Faulting instruction address: 0xc000000000302b88
    Oops: Kernel access of bad area, sig: 11 [#1]
    SMP NR_CPUS=2048 [ 0.082424] NUMA
    pSeries
    Modules linked in:
    CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.9.0-15-generic #16-Ubuntu
    task: c00000021ed01600 task.stack: c00000010d108000
    NIP: c000000000302b88 LR: c000000000270e04 CTR: c00000000016cfd0
    REGS: c00000010d10b2c0 TRAP: 0300 Not tainted (4.9.0-15-generic)
    MSR: 8000000002009033 [ 0.082770] CR: 28424422 XER: 00000000
    CFAR: c0000000003d28b8 DAR: 0000000000000000 DSISR: 40000000 SOFTE: 1
    GPR00: c000000000270e04 c00000010d10b540 c00000000141a300 c00000010fff6300
    GPR04: 0000000000000000 00000000026012c0 c00000010d10b630 0000000487ab0000
    GPR08: 000000010ee90000 c000000001454fd8 0000000000000000 0000000000000000
    GPR12: 0000000000004400 c00000000fb80000 00000000026012c0 00000000026012c0
    GPR16: 00000000026012c0 0000000000000000 0000000000000000 0000000000000002
    GPR20: 000000000000000c 0000000000000000 0000000000000000 00000000024200c0
    GPR24: c0000000016eef48 0000000000000000 c00000010fff7d00 00000000026012c0
    GPR28: 0000000000000000 c00000010fff7d00 c00000010fff6300 c00000010d10b6d0
    NIP mem_cgroup_soft_limit_reclaim+0xf8/0x4f0
    LR do_try_to_free_pages+0x1b4/0x450
    Call Trace:
    do_try_to_free_pages+0x1b4/0x450
    try_to_free_pages+0xf8/0x270
    __alloc_pages_nodemask+0x7a8/0xff0
    new_slab+0x104/0x8e0
    ___slab_alloc+0x620/0x700
    __slab_alloc+0x34/0x60
    kmem_cache_alloc_node_trace+0xdc/0x310
    mem_cgroup_init+0x158/0x1c8
    do_one_initcall+0x68/0x1d0
    kernel_init_freeable+0x278/0x360
    kernel_init+0x24/0x170
    ret_from_kernel_thread+0x5c/0x74
    Instruction dump:
    eb81ffe0 eba1ffe8 ebc1fff0 ebe1fff8 4e800020 3d230001 e9499a42 3d220004
    3929acd8 794a1f24 7d295214 eac90100 2fa90000 419eff74 3b200000
    ---[ end trace 342f5208b00d01b6 ]---

    This is a chicken and egg issue where the kernel try to get free memory
    when allocating per node data in mem_cgroup_init(), but in that path
    mem_cgroup_soft_limit_reclaim() is called which assumes that these data
    are allocated.

    As mem_cgroup_soft_limit_reclaim() is best effort, it should return when
    these data are not yet allocated.

    This patch also fixes potential null pointer access in
    mem_cgroup_remove_from_trees() and mem_cgroup_update_tree().

    Link: http://lkml.kernel.org/r/1487856999-16581-2-git-send-email-ldufour@linux.vnet.ibm.com
    Signed-off-by: Laurent Dufour
    Acked-by: Michal Hocko
    Acked-by: Johannes Weiner
    Acked-by: Balbir Singh
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Laurent Dufour
     

02 Mar, 2017

1 commit

  • We are going to split out of , which
    will have to be picked up from other headers and a couple of .c files.

    Create a trivial placeholder file that just
    maps to to make this patch obviously correct and
    bisectable.

    The APIs that are going to be moved first are:

    mm_alloc()
    __mmdrop()
    mmdrop()
    mmdrop_async_fn()
    mmdrop_async()
    mmget_not_zero()
    mmput()
    mmput_async()
    get_task_mm()
    mm_access()
    mm_release()

    Include the new header in the files that are going to need it.

    Acked-by: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Ingo Molnar
     

25 Feb, 2017

1 commit

  • Remove the prototypes for shmem_mapping() and shmem_zero_setup() from
    linux/mm.h, since they are already provided in linux/shmem_fs.h. But
    shmem_fs.h must then provide the inline stub for shmem_mapping() when
    CONFIG_SHMEM is not set, and a few more cfiles now need to #include it.

    Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1702081658250.1549@eggly.anvils
    Signed-off-by: Hugh Dickins
    Cc: Johannes Weiner
    Cc: Michal Simek
    Cc: Michael Ellerman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

23 Feb, 2017

2 commits

  • If there's contention on slab_mutex, queueing the per-cache destruction
    work item on the system_wq can unnecessarily create and tie up a lot of
    kworkers.

    Rename memcg_kmem_cache_create_wq to memcg_kmem_cache_wq and make it
    global and use that workqueue for the destruction work items too. While
    at it, convert the workqueue from an unbound workqueue to a per-cpu one
    with concurrency limited to 1. It's generally preferable to use per-cpu
    workqueues and concurrency limit of 1 is safe enough.

    This is suggested by Joonsoo Kim.

    Link: http://lkml.kernel.org/r/20170117235411.9408-11-tj@kernel.org
    Signed-off-by: Tejun Heo
    Reported-by: Jay Vana
    Acked-by: Vladimir Davydov
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tejun Heo
     
  • With kmem cgroup support enabled, kmem_caches can be created and
    destroyed frequently and a great number of near empty kmem_caches can
    accumulate if there are a lot of transient cgroups and the system is not
    under memory pressure. When memory reclaim starts under such
    conditions, it can lead to consecutive deactivation and destruction of
    many kmem_caches, easily hundreds of thousands on moderately large
    systems, exposing scalability issues in the current slab management
    code. This is one of the patches to address the issue.

    While a memcg kmem_cache is listed on its root cache's ->children list,
    there is no direct way to iterate all kmem_caches which are assocaited
    with a memory cgroup. The only way to iterate them is walking all
    caches while filtering out caches which don't match, which would be most
    of them.

    This makes memcg destruction operations O(N^2) where N is the total
    number of slab caches which can be huge. This combined with the
    synchronous RCU operations can tie up a CPU and affect the whole machine
    for many hours when memory reclaim triggers offlining and destruction of
    the stale memcgs.

    This patch adds mem_cgroup->kmem_caches list which goes through
    memcg_cache_params->kmem_caches_node of all kmem_caches which are
    associated with the memcg. All memcg specific iterations, including
    stat file access, are updated to use the new list instead.

    Link: http://lkml.kernel.org/r/20170117235411.9408-6-tj@kernel.org
    Signed-off-by: Tejun Heo
    Reported-by: Jay Vana
    Acked-by: Vladimir Davydov
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tejun Heo
     

25 Jan, 2017

1 commit

  • When memory.move_charge_at_immigrate is enabled and precharges are
    depleted during move, mem_cgroup_move_charge_pte_range() will attempt to
    increase the size of the precharge.

    Prevent precharges from ever looping by setting __GFP_NORETRY. This was
    probably the intention of the GFP_KERNEL & ~__GFP_NORETRY, which is
    pointless as written.

    Fixes: 0029e19ebf84 ("mm: memcontrol: remove explicit OOM parameter in charge path")
    Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1701130208510.69402@chino.kir.corp.google.com
    Signed-off-by: David Rientjes
    Acked-by: Michal Hocko
    Cc: Johannes Weiner
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

11 Jan, 2017

1 commit

  • Nils Holland and Klaus Ethgen have reported unexpected OOM killer
    invocations with 32b kernel starting with 4.8 kernels

    kworker/u4:5 invoked oom-killer: gfp_mask=0x2400840(GFP_NOFS|__GFP_NOFAIL), nodemask=0, order=0, oom_score_adj=0
    kworker/u4:5 cpuset=/ mems_allowed=0
    CPU: 1 PID: 2603 Comm: kworker/u4:5 Not tainted 4.9.0-gentoo #2
    [...]
    Mem-Info:
    active_anon:58685 inactive_anon:90 isolated_anon:0
    active_file:274324 inactive_file:281962 isolated_file:0
    unevictable:0 dirty:649 writeback:0 unstable:0
    slab_reclaimable:40662 slab_unreclaimable:17754
    mapped:7382 shmem:202 pagetables:351 bounce:0
    free:206736 free_pcp:332 free_cma:0
    Node 0 active_anon:234740kB inactive_anon:360kB active_file:1097296kB inactive_file:1127848kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:29528kB dirty:2596kB writeback:0kB shmem:0kB shmem_thp: 0kB shmem_pmdmapped: 184320kB anon_thp: 808kB writeback_tmp:0kB unstable:0kB pages_scanned:0 all_unreclaimable? no
    DMA free:3952kB min:788kB low:984kB high:1180kB active_anon:0kB inactive_anon:0kB active_file:7316kB inactive_file:0kB unevictable:0kB writepending:96kB present:15992kB managed:15916kB mlocked:0kB slab_reclaimable:3200kB slab_unreclaimable:1408kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
    lowmem_reserve[]: 0 813 3474 3474
    Normal free:41332kB min:41368kB low:51708kB high:62048kB active_anon:0kB inactive_anon:0kB active_file:532748kB inactive_file:44kB unevictable:0kB writepending:24kB present:897016kB managed:836248kB mlocked:0kB slab_reclaimable:159448kB slab_unreclaimable:69608kB kernel_stack:1112kB pagetables:1404kB bounce:0kB free_pcp:528kB local_pcp:340kB free_cma:0kB
    lowmem_reserve[]: 0 0 21292 21292
    HighMem free:781660kB min:512kB low:34356kB high:68200kB active_anon:234740kB inactive_anon:360kB active_file:557232kB inactive_file:1127804kB unevictable:0kB writepending:2592kB present:2725384kB managed:2725384kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:800kB local_pcp:608kB free_cma:0kB

    the oom killer is clearly pre-mature because there there is still a lot
    of page cache in the zone Normal which should satisfy this lowmem
    request. Further debugging has shown that the reclaim cannot make any
    forward progress because the page cache is hidden in the active list
    which doesn't get rotated because inactive_list_is_low is not memcg
    aware.

    The code simply subtracts per-zone highmem counters from the respective
    memcg's lru sizes which doesn't make any sense. We can simply end up
    always seeing the resulting active and inactive counts 0 and return
    false. This issue is not limited to 32b kernels but in practice the
    effect on systems without CONFIG_HIGHMEM would be much harder to notice
    because we do not invoke the OOM killer for allocations requests
    targeting < ZONE_NORMAL.

    Fix the issue by tracking per zone lru page counts in mem_cgroup_per_node
    and subtract per-memcg highmem counts when memcg is enabled. Introduce
    helper lruvec_zone_lru_size which redirects to either zone counters or
    mem_cgroup_get_zone_lru_size when appropriate.

    We are losing empty LRU but non-zero lru size detection introduced by
    ca707239e8a7 ("mm: update_lru_size warn and reset bad lru_size") because
    of the inherent zone vs. node discrepancy.

    Fixes: f8d1a31163fc ("mm: consider whether to decivate based on eligible zones inactive ratio")
    Link: http://lkml.kernel.org/r/20170104100825.3729-1-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Reported-by: Nils Holland
    Tested-by: Nils Holland
    Reported-by: Klaus Ethgen
    Acked-by: Minchan Kim
    Acked-by: Mel Gorman
    Acked-by: Johannes Weiner
    Reviewed-by: Vladimir Davydov
    Cc: [4.8+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

25 Dec, 2016

1 commit


13 Dec, 2016

2 commits

  • Merge updates from Andrew Morton:

    - various misc bits

    - most of MM (quite a lot of MM material is awaiting the merge of
    linux-next dependencies)

    - kasan

    - printk updates

    - procfs updates

    - MAINTAINERS

    - /lib updates

    - checkpatch updates

    * emailed patches from Andrew Morton : (123 commits)
    init: reduce rootwait polling interval time to 5ms
    binfmt_elf: use vmalloc() for allocation of vma_filesz
    checkpatch: don't emit unified-diff error for rename-only patches
    checkpatch: don't check c99 types like uint8_t under tools
    checkpatch: avoid multiple line dereferences
    checkpatch: don't check .pl files, improve absolute path commit log test
    scripts/checkpatch.pl: fix spelling
    checkpatch: don't try to get maintained status when --no-tree is given
    lib/ida: document locking requirements a bit better
    lib/rbtree.c: fix typo in comment of ____rb_erase_color
    lib/Kconfig.debug: make CONFIG_STRICT_DEVMEM depend on CONFIG_DEVMEM
    MAINTAINERS: add drm and drm/i915 irc channels
    MAINTAINERS: add "C:" for URI for chat where developers hang out
    MAINTAINERS: add drm and drm/i915 bug filing info
    MAINTAINERS: add "B:" for URI where to file bugs
    get_maintainer: look for arbitrary letter prefixes in sections
    printk: add Kconfig option to set default console loglevel
    printk/sound: handle more message headers
    printk/btrfs: handle more message headers
    printk/kdb: handle more message headers
    ...

    Linus Torvalds
     
  • Creating a lot of cgroups at the same time might stall all worker
    threads with kmem cache creation works, because kmem cache creation is
    done with the slab_mutex held. The problem was amplified by commits
    801faf0db894 ("mm/slab: lockless decision to grow cache") in case of
    SLAB and 81ae6d03952c ("mm/slub.c: replace kick_all_cpus_sync() with
    synchronize_sched() in kmem_cache_shrink()") in case of SLUB, which
    increased the maximal time the slab_mutex can be held.

    To prevent that from happening, let's use a special ordered single
    threaded workqueue for kmem cache creation. This shouldn't introduce
    any functional changes regarding how kmem caches are created, as the
    work function holds the global slab_mutex during its whole runtime
    anyway, making it impossible to run more than one work at a time. By
    using a single threaded workqueue, we just avoid creating a thread per
    each work. Ordering is required to avoid a situation when a cgroup's
    work is put off indefinitely because there are other cgroups to serve,
    in other words to guarantee fairness.

    Link: https://bugzilla.kernel.org/show_bug.cgi?id=172981
    Link: http://lkml.kernel.org/r/20161004131417.GC1862@esperanza
    Signed-off-by: Vladimir Davydov
    Reported-by: Doug Smythies
    Acked-by: Michal Hocko
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     

10 Nov, 2016

1 commit

  • Install the callbacks via the state machine.

    Signed-off-by: Sebastian Andrzej Siewior
    Signed-off-by: Thomas Gleixner
    Cc: Michal Hocko
    Cc: linux-mm@kvack.org
    Cc: rt@linutronix.de
    Cc: Johannes Weiner
    Cc: cgroups@vger.kernel.org
    Link: http://lkml.kernel.org/r/20161103145021.28528-4-bigeasy@linutronix.de
    Signed-off-by: Thomas Gleixner

    Sebastian Andrzej Siewior
     

28 Oct, 2016

1 commit

  • On 4.0, we saw a stack corruption from a page fault entering direct
    memory cgroup reclaim, calling into btrfs_releasepage(), which then
    tried to allocate an extent and recursed back into a kmem charge ad
    nauseam:

    [...]
    btrfs_releasepage+0x2c/0x30
    try_to_release_page+0x32/0x50
    shrink_page_list+0x6da/0x7a0
    shrink_inactive_list+0x1e5/0x510
    shrink_lruvec+0x605/0x7f0
    shrink_zone+0xee/0x320
    do_try_to_free_pages+0x174/0x440
    try_to_free_mem_cgroup_pages+0xa7/0x130
    try_charge+0x17b/0x830
    memcg_charge_kmem+0x40/0x80
    new_slab+0x2d9/0x5a0
    __slab_alloc+0x2fd/0x44f
    kmem_cache_alloc+0x193/0x1e0
    alloc_extent_state+0x21/0xc0
    __clear_extent_bit+0x2b5/0x400
    try_release_extent_mapping+0x1a3/0x220
    __btrfs_releasepage+0x31/0x70
    btrfs_releasepage+0x2c/0x30
    try_to_release_page+0x32/0x50
    shrink_page_list+0x6da/0x7a0
    shrink_inactive_list+0x1e5/0x510
    shrink_lruvec+0x605/0x7f0
    shrink_zone+0xee/0x320
    do_try_to_free_pages+0x174/0x440
    try_to_free_mem_cgroup_pages+0xa7/0x130
    try_charge+0x17b/0x830
    mem_cgroup_try_charge+0x65/0x1c0
    handle_mm_fault+0x117f/0x1510
    __do_page_fault+0x177/0x420
    do_page_fault+0xc/0x10
    page_fault+0x22/0x30

    On later kernels, kmem charging is opt-in rather than opt-out, and that
    particular kmem allocation in btrfs_releasepage() is no longer being
    charged and won't recurse and overrun the stack anymore.

    But it's not impossible for an accounted allocation to happen from the
    memcg direct reclaim context, and we needed to reproduce this crash many
    times before we even got a useful stack trace out of it.

    Like other direct reclaimers, mark tasks in memcg reclaim PF_MEMALLOC to
    avoid recursing into any other form of direct reclaim. Then let
    recursive charges from PF_MEMALLOC contexts bypass the cgroup limit.

    Link: http://lkml.kernel.org/r/20161025141050.GA13019@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: Vladimir Davydov
    Cc: Tejun Heo
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

08 Oct, 2016

5 commits

  • The cgroup core and the memory controller need to track socket ownership
    for different purposes, but the tracking sites being entirely different
    is kind of ugly.

    Be a better citizen and rename the memory controller callbacks to match
    the cgroup core callbacks, then move them to the same place.

    [akpm@linux-foundation.org: coding-style fixes]
    Link: http://lkml.kernel.org/r/20160914194846.11153-3-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Acked-by: Tejun Heo
    Cc: "David S. Miller"
    Cc: Michal Hocko
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • This patch is to improve the performance of swap cache operations when
    the type of the swap device is not 0. Originally, the whole swap entry
    value is used as the key of the swap cache, even though there is one
    radix tree for each swap device. If the type of the swap device is not
    0, the height of the radix tree of the swap cache will be increased
    unnecessary, especially on 64bit architecture. For example, for a 1GB
    swap device on the x86_64 architecture, the height of the radix tree of
    the swap cache is 11. But if the offset of the swap entry is used as
    the key of the swap cache, the height of the radix tree of the swap
    cache is 4. The increased height causes unnecessary radix tree
    descending and increased cache footprint.

    This patch reduces the height of the radix tree of the swap cache via
    using the offset of the swap entry instead of the whole swap entry value
    as the key of the swap cache. In 32 processes sequential swap out test
    case on a Xeon E5 v3 system with RAM disk as swap, the lock contention
    for the spinlock of the swap cache is reduced from 20.15% to 12.19%,
    when the type of the swap device is 1.

    Use the whole swap entry as key,

    perf-profile.calltrace.cycles-pp._raw_spin_lock_irq.__add_to_swap_cache.add_to_swap_cache.add_to_swap.shrink_page_list: 10.37,
    perf-profile.calltrace.cycles-pp._raw_spin_lock_irqsave.__remove_mapping.shrink_page_list.shrink_inactive_list.shrink_node_memcg: 9.78,

    Use the swap offset as key,

    perf-profile.calltrace.cycles-pp._raw_spin_lock_irq.__add_to_swap_cache.add_to_swap_cache.add_to_swap.shrink_page_list: 6.25,
    perf-profile.calltrace.cycles-pp._raw_spin_lock_irqsave.__remove_mapping.shrink_page_list.shrink_inactive_list.shrink_node_memcg: 5.94,

    Link: http://lkml.kernel.org/r/1473270649-27229-1-git-send-email-ying.huang@intel.com
    Signed-off-by: "Huang, Ying"
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Vladimir Davydov
    Cc: "Kirill A. Shutemov"
    Cc: Dave Hansen
    Cc: Dan Williams
    Cc: Joonsoo Kim
    Cc: Hugh Dickins
    Cc: Mel Gorman
    Cc: Minchan Kim
    Cc: Aaron Lu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Ying
     
  • mem_cgroup_count_precharge() and mem_cgroup_move_charge() both call
    walk_page_range() on the range 0 to ~0UL, neither provide a pte_hole
    callback, which causes the current implementation to skip non-vma
    regions. This is all fine but follow up changes would like to make
    walk_page_range more generic so it is better to be explicit about which
    range to traverse so let's use highest_vm_end to explicitly traverse
    only user mmaped memory.

    [mhocko@kernel.org: rewrote changelog]
    Link: http://lkml.kernel.org/r/1472655897-22532-1-git-send-email-james.morse@arm.com
    Signed-off-by: James Morse
    Acked-by: Naoya Horiguchi
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    James Morse
     
  • Link: http://lkml.kernel.org/r/1c5ddb1c171dbdfc3262252769d6138a29b35b70.1470219853.git.vdavydov@virtuozzo.com
    Signed-off-by: Vladimir Davydov
    Acked-by: Johannes Weiner
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • When selecting an oom victim, we use the same heuristic for both memory
    cgroup and global oom. The only difference is the scope of tasks to
    select the victim from. So we could just export an iterator over all
    memcg tasks and keep all oom related logic in oom_kill.c, but instead we
    duplicate pieces of it in memcontrol.c reusing some initially private
    functions of oom_kill.c in order to not duplicate all of it. That looks
    ugly and error prone, because any modification of select_bad_process
    should also be propagated to mem_cgroup_out_of_memory.

    Let's rework this as follows: keep all oom heuristic related code private
    to oom_kill.c and make oom_kill.c use exported memcg functions when it's
    really necessary (like in case of iterating over memcg tasks).

    Link: http://lkml.kernel.org/r/1470056933-7505-1-git-send-email-vdavydov@virtuozzo.com
    Signed-off-by: Vladimir Davydov
    Acked-by: Johannes Weiner
    Cc: Michal Hocko
    Cc: Tetsuo Handa
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     

20 Sep, 2016

1 commit

  • During cgroup2 rollout into production, we started encountering css
    refcount underflows and css access crashes in the memory controller.
    Splitting the heavily shared css reference counter into logical users
    narrowed the imbalance down to the cgroup2 socket memory accounting.

    The problem turns out to be the per-cpu charge cache. Cgroup1 had a
    separate socket counter, but the new cgroup2 socket accounting goes
    through the common charge path that uses a shared per-cpu cache for all
    memory that is being tracked. Those caches are safe against scheduling
    preemption, but not against interrupts - such as the newly added packet
    receive path. When cache draining is interrupted by network RX taking
    pages out of the cache, the resuming drain operation will put references
    of in-use pages, thus causing the imbalance.

    Disable IRQs during all per-cpu charge cache operations.

    Fixes: f7e1cb6ec51b ("mm: memcontrol: account socket memory in unified hierarchy memory controller")
    Link: http://lkml.kernel.org/r/20160914194846.11153-1-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Acked-by: Tejun Heo
    Cc: "David S. Miller"
    Cc: Michal Hocko
    Cc: Vladimir Davydov
    Cc: [4.5+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

27 Aug, 2016

1 commit

  • A bugfix in v4.8-rc2 introduced a harmless warning when
    CONFIG_MEMCG_SWAP is disabled but CONFIG_MEMCG is enabled:

    mm/memcontrol.c:4085:27: error: 'mem_cgroup_id_get_online' defined but not used [-Werror=unused-function]
    static struct mem_cgroup *mem_cgroup_id_get_online(struct mem_cgroup *memcg)

    This moves the function inside of the #ifdef block that hides the
    calling function, to avoid the warning.

    Fixes: 1f47b61fb407 ("mm: memcontrol: fix swap counter leak on swapout from offline cgroup")
    Link: http://lkml.kernel.org/r/20160824113733.2776701-1-arnd@arndb.de
    Signed-off-by: Arnd Bergmann
    Acked-by: Michal Hocko
    Acked-by: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Arnd Bergmann
     

12 Aug, 2016

2 commits

  • Since commit 73f576c04b94 ("mm: memcontrol: fix cgroup creation failure
    after many small jobs") swap entries do not pin memcg->css.refcnt
    directly. Instead, they pin memcg->id.ref. So we should adjust the
    reference counters accordingly when moving swap charges between cgroups.

    Fixes: 73f576c04b941 ("mm: memcontrol: fix cgroup creation failure after many small jobs")
    Link: http://lkml.kernel.org/r/9ce297c64954a42dc90b543bc76106c4a94f07e8.1470219853.git.vdavydov@virtuozzo.com
    Signed-off-by: Vladimir Davydov
    Acked-by: Michal Hocko
    Acked-by: Johannes Weiner
    Cc: [3.19+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • An offline memory cgroup might have anonymous memory or shmem left
    charged to it and no swap. Since only swap entries pin the id of an
    offline cgroup, such a cgroup will have no id and so an attempt to
    swapout its anon/shmem will not store memory cgroup info in the swap
    cgroup map. As a result, memcg->swap or memcg->memsw will never get
    uncharged from it and any of its ascendants.

    Fix this by always charging swapout to the first ancestor cgroup that
    hasn't released its id yet.

    [hannes@cmpxchg.org: add comment to mem_cgroup_swapout]
    [vdavydov@virtuozzo.com: use WARN_ON_ONCE() in mem_cgroup_id_get_online()]
    Link: http://lkml.kernel.org/r/20160803123445.GJ13263@esperanza
    Fixes: 73f576c04b941 ("mm: memcontrol: fix cgroup creation failure after many small jobs")
    Link: http://lkml.kernel.org/r/5336daa5c9a32e776067773d9da655d2dc126491.1470219853.git.vdavydov@virtuozzo.com
    Signed-off-by: Vladimir Davydov
    Acked-by: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: [3.19+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     

10 Aug, 2016

1 commit

  • To distinguish non-slab pages charged to kmemcg we mark them PageKmemcg,
    which sets page->_mapcount to -512. Currently, we set/clear PageKmemcg
    in __alloc_pages_nodemask()/free_pages_prepare() for any page allocated
    with __GFP_ACCOUNT, including those that aren't actually charged to any
    cgroup, i.e. allocated from the root cgroup context. To avoid overhead
    in case cgroups are not used, we only do that if memcg_kmem_enabled() is
    true. The latter is set iff there are kmem-enabled memory cgroups
    (online or offline). The root cgroup is not considered kmem-enabled.

    As a result, if a page is allocated with __GFP_ACCOUNT for the root
    cgroup when there are kmem-enabled memory cgroups and is freed after all
    kmem-enabled memory cgroups were removed, e.g.

    # no memory cgroups has been created yet, create one
    mkdir /sys/fs/cgroup/memory/test
    # run something allocating pages with __GFP_ACCOUNT, e.g.
    # a program using pipe
    dmesg | tail
    # remove the memory cgroup
    rmdir /sys/fs/cgroup/memory/test

    we'll get bad page state bug complaining about page->_mapcount != -1:

    BUG: Bad page state in process swapper/0 pfn:1fd945c
    page:ffffea007f651700 count:0 mapcount:-511 mapping: (null) index:0x0
    flags: 0x1000000000000000()

    To avoid that, let's mark with PageKmemcg only those pages that are
    actually charged to and hence pin a non-root memory cgroup.

    Fixes: 4949148ad433 ("mm: charge/uncharge kmemcg from generic page allocator paths")
    Reported-and-tested-by: Eric Dumazet
    Signed-off-by: Vladimir Davydov
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     

03 Aug, 2016

1 commit

  • We've had a report about soft lockups caused by lock bouncing in the
    soft reclaim path:

    BUG: soft lockup - CPU#0 stuck for 22s! [kav4proxy-kavic:3128]
    RIP: 0010:[] [] _raw_spin_lock+0x18/0x20
    Call Trace:
    mem_cgroup_soft_limit_reclaim+0x25a/0x280
    shrink_zones+0xed/0x200
    do_try_to_free_pages+0x74/0x320
    try_to_free_pages+0x112/0x180
    __alloc_pages_slowpath+0x3ff/0x820
    __alloc_pages_nodemask+0x1e9/0x200
    alloc_pages_vma+0xe1/0x290
    do_wp_page+0x19f/0x840
    handle_pte_fault+0x1cd/0x230
    do_page_fault+0x1fd/0x4c0
    page_fault+0x25/0x30

    There are no memcgs created so there cannot be any in the soft limit
    excess obviously:

    [...]
    memory 0 1 1

    so all this just seems to be mem_cgroup_largest_soft_limit_node trying
    to get spin_lock_irq(&mctz->lock) just to find out that the soft limit
    excess tree is empty. This is just pointless wasting of cycles and
    cache line bouncing during heavy parallel reclaim on large machines.
    The particular machine wasn't very healthy and most probably suffering
    from a memory leak which just caused the memory reclaim to trash
    heavily. But bouncing on the lock certainly didn't help...

    Fix this by optimistic lockless check and bail out early if the tree is
    empty. This is theoretically racy but that shouldn't matter all that
    much. First of all soft limit is a best effort feature and it is slowly
    getting deprecated and its usage should be really scarce. Bouncing on a
    lock without a good reason is surely much bigger problem, especially on
    large CPU machines.

    Link: http://lkml.kernel.org/r/1470073277-1056-1-git-send-email-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Acked-by: Vladimir Davydov
    Acked-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

29 Jul, 2016

8 commits

  • We should account for stacks regardless of stack size, and we need to
    account in sub-page units if THREAD_SIZE < PAGE_SIZE. Change the units
    to kilobytes and Move it into account_kernel_stack().

    Fixes: 12580e4b54ba8 ("mm: memcontrol: report kernel stack usage in cgroup2 memory.stat")
    Link: http://lkml.kernel.org/r/9b5314e3ee5eda61b0317ec1563768602c1ef438.1468523549.git.luto@kernel.org
    Signed-off-by: Andy Lutomirski
    Cc: Vladimir Davydov
    Acked-by: Johannes Weiner
    Cc: Michal Hocko
    Reviewed-by: Josh Poimboeuf
    Reviewed-by: Vladimir Davydov
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andy Lutomirski
     
  • Minchan Kim reported setting the following warning on a 32-bit system
    although it can affect 64-bit systems.

    WARNING: CPU: 4 PID: 1322 at mm/memcontrol.c:998 mem_cgroup_update_lru_size+0x103/0x110
    mem_cgroup_update_lru_size(f44b4000, 1, -7): zid 1 lru_size 1 but empty
    Modules linked in:
    CPU: 4 PID: 1322 Comm: cp Not tainted 4.7.0-rc4-mm1+ #143
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
    Call Trace:
    dump_stack+0x76/0xaf
    __warn+0xea/0x110
    ? mem_cgroup_update_lru_size+0x103/0x110
    warn_slowpath_fmt+0x3b/0x40
    mem_cgroup_update_lru_size+0x103/0x110
    isolate_lru_pages.isra.61+0x2e2/0x360
    shrink_active_list+0xac/0x2a0
    ? __delay+0xe/0x10
    shrink_node_memcg+0x53c/0x7a0
    shrink_node+0xab/0x2a0
    do_try_to_free_pages+0xc6/0x390
    try_to_free_pages+0x245/0x590

    LRU list contents and counts are updated separately. Counts are updated
    before pages are added to the LRU and updated after pages are removed.
    The warning above is from a check in mem_cgroup_update_lru_size that
    ensures that list sizes of zero are empty.

    The problem is that node-lru needs to account for highmem pages if
    CONFIG_HIGHMEM is set. One impact of the implementation is that the
    sizes are updated in multiple passes when pages from multiple zones were
    isolated. This happens whether HIGHMEM is set or not. When multiple
    zones are isolated, it's possible for a debugging check in memcg to be
    tripped.

    This patch forces all the zone counts to be updated before the memcg
    function is called.

    Link: http://lkml.kernel.org/r/1468588165-12461-6-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Tested-by: Minchan Kim
    Reported-by: Minchan Kim
    Acked-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Memcg needs adjustment after moving LRUs to the node. Limits are
    tracked per memcg but the soft-limit excess is tracked per zone. As
    global page reclaim is based on the node, it is easy to imagine a
    situation where a zone soft limit is exceeded even though the memcg
    limit is fine.

    This patch moves the soft limit tree the node. Technically, all the
    variable names should also change but people are already familiar by the
    meaning of "mz" even if "mn" would be a more appropriate name now.

    Link: http://lkml.kernel.org/r/1467970510-21195-15-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Michal Hocko
    Cc: Hillf Danton
    Acked-by: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: Minchan Kim
    Cc: Rik van Riel
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Earlier patches focused on having direct reclaim and kswapd use data
    that is node-centric for reclaiming but shrink_node() itself still uses
    too much zone information. This patch removes unnecessary zone-based
    information with the most important decision being whether to continue
    reclaim or not. Some memcg APIs are adjusted as a result even though
    memcg itself still uses some zone information.

    [mgorman@techsingularity.net: optimization]
    Link: http://lkml.kernel.org/r/1468588165-12461-2-git-send-email-mgorman@techsingularity.net
    Link: http://lkml.kernel.org/r/1467970510-21195-14-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Cc: Hillf Danton
    Acked-by: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: Minchan Kim
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • This moves the LRU lists from the zone to the node and related data such
    as counters, tracing, congestion tracking and writeback tracking.

    Unfortunately, due to reclaim and compaction retry logic, it is
    necessary to account for the number of LRU pages on both zone and node
    logic. Most reclaim logic is based on the node counters but the retry
    logic uses the zone counters which do not distinguish inactive and
    active sizes. It would be possible to leave the LRU counters on a
    per-zone basis but it's a heavier calculation across multiple cache
    lines that is much more frequent than the retry checks.

    Other than the LRU counters, this is mostly a mechanical patch but note
    that it introduces a number of anomalies. For example, the scans are
    per-zone but using per-node counters. We also mark a node as congested
    when a zone is congested. This causes weird problems that are fixed
    later but is easier to review.

    In the event that there is excessive overhead on 32-bit systems due to
    the nodes being on LRU then there are two potential solutions

    1. Long-term isolation of highmem pages when reclaim is lowmem

    When pages are skipped, they are immediately added back onto the LRU
    list. If lowmem reclaim persisted for long periods of time, the same
    highmem pages get continually scanned. The idea would be that lowmem
    keeps those pages on a separate list until a reclaim for highmem pages
    arrives that splices the highmem pages back onto the LRU. It potentially
    could be implemented similar to the UNEVICTABLE list.

    That would reduce the skip rate with the potential corner case is that
    highmem pages have to be scanned and reclaimed to free lowmem slab pages.

    2. Linear scan lowmem pages if the initial LRU shrink fails

    This will break LRU ordering but may be preferable and faster during
    memory pressure than skipping LRU pages.

    Link: http://lkml.kernel.org/r/1467970510-21195-4-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Johannes Weiner
    Acked-by: Vlastimil Babka
    Cc: Hillf Danton
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Node-based reclaim requires node-based LRUs and locking. This is a
    preparation patch that just moves the lru_lock to the node so later
    patches are easier to review. It is a mechanical change but note this
    patch makes contention worse because the LRU lock is hotter and direct
    reclaim and kswapd can contend on the same lock even when reclaiming
    from different zones.

    Link: http://lkml.kernel.org/r/1467970510-21195-3-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Reviewed-by: Minchan Kim
    Acked-by: Johannes Weiner
    Acked-by: Vlastimil Babka
    Cc: Hillf Danton
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Commit 23047a96d7cf ("mm: workingset: per-cgroup cache thrash
    detection") added a page->mem_cgroup lookup to the cache eviction,
    refault, and activation paths, as well as locking to the activation
    path, and the vm-scalability tests showed a regression of -23%.

    While the test in question is an artificial worst-case scenario that
    doesn't occur in real workloads - reading two sparse files in parallel
    at full CPU speed just to hammer the LRU paths - there is still some
    optimizations that can be done in those paths.

    Inline the lookup functions to eliminate calls. Also, page->mem_cgroup
    doesn't need to be stabilized when counting an activation; we merely
    need to hold the RCU lock to prevent the memcg from being freed.

    This cuts down on overhead quite a bit:

    23047a96d7cfcfca 063f6715e77a7be5770d6081fe
    ---------------- --------------------------
    %stddev %change %stddev
    \ | \
    21621405 +- 0% +11.3% 24069657 +- 2% vm-scalability.throughput

    [linux@roeck-us.net: drop unnecessary include file]
    [hannes@cmpxchg.org: add WARN_ON_ONCE()s]
    Link: http://lkml.kernel.org/r/20160707194024.GA26580@cmpxchg.org
    Link: http://lkml.kernel.org/r/20160624175101.GA3024@cmpxchg.org
    Reported-by: Ye Xiaolong
    Signed-off-by: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: Vladimir Davydov
    Signed-off-by: Guenter Roeck
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • task_will_free_mem is rather weak. It doesn't really tell whether the
    task has chance to drop its mm. 98748bd72200 ("oom: consider
    multi-threaded tasks in task_will_free_mem") made a first step into making
    it more robust for multi-threaded applications so now we know that the
    whole process is going down and probably drop the mm.

    This patch builds on top for more complex scenarios where mm is shared
    between different processes - CLONE_VM without CLONE_SIGHAND, or in kernel
    use_mm().

    Make sure that all processes sharing the mm are killed or exiting. This
    will allow us to replace try_oom_reaper by wake_oom_reaper because
    task_will_free_mem implies the task is reapable now. Therefore all paths
    which bypass the oom killer are now reapable and so they shouldn't lock up
    the oom killer.

    Link: http://lkml.kernel.org/r/1466426628-15074-8-git-send-email-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Acked-by: Oleg Nesterov
    Cc: Vladimir Davydov
    Cc: David Rientjes
    Cc: Tetsuo Handa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

27 Jul, 2016

1 commit

  • Commit f627c2f53786 ("memcg: adjust to support new THP refcounting")
    adds a compound parameter for several functions, and change one as
    compound for mem_cgroup_move_account but it does not change the
    comments.

    Link: http://lkml.kernel.org/r/1465368216-9393-1-git-send-email-roy.qing.li@gmail.com
    Signed-off-by: Li RongQing
    Acked-by: Michal Hocko
    Acked-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Li RongQing