29 Feb, 2020

1 commit

  • commit 75866af62b439859d5146b7093ceb6b482852683 upstream.

    for_each_mem_cgroup() increases css reference counter for memory cgroup
    and requires to use mem_cgroup_iter_break() if the walk is cancelled.

    Link: http://lkml.kernel.org/r/c98414fb-7e1f-da0f-867a-9340ec4bd30b@virtuozzo.com
    Fixes: 0a4465d34028 ("mm, memcg: assign memcg-aware shrinkers bitmap to memcg")
    Signed-off-by: Vasily Averin
    Acked-by: Kirill Tkhai
    Acked-by: Michal Hocko
    Reviewed-by: Roman Gushchin
    Cc: Johannes Weiner
    Cc: Vladimir Davydov
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Vasily Averin
     

11 Feb, 2020

1 commit

  • commit fac0516b5534897bf4c4a88daa06a8cfa5611b23 upstream.

    If compound is true, this means it is a PMD mapped THP. Which implies
    the page is not linked to any defer list. So the first code chunk will
    not be executed.

    Also with this reason, it would not be proper to add this page to a
    defer list. So the second code chunk is not correct.

    Based on this, we should remove the defer list related code.

    [yang.shi@linux.alibaba.com: better patch title]
    Link: http://lkml.kernel.org/r/20200117233836.3434-1-richardw.yang@linux.intel.com
    Fixes: 87eaceb3faa5 ("mm: thp: make deferred split shrinker memcg aware")
    Signed-off-by: Wei Yang
    Suggested-by: Kirill A. Shutemov
    Acked-by: Yang Shi
    Cc: David Rientjes
    Cc: Michal Hocko
    Cc: Kirill A. Shutemov
    Cc: Johannes Weiner
    Cc: Vladimir Davydov
    Cc: [5.4+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Wei Yang
     

23 Jan, 2020

1 commit

  • commit 4a87e2a25dc27131c3cce5e94421622193305638 upstream.

    Currently slab percpu vmstats are flushed twice: during the memcg
    offlining and just before freeing the memcg structure. Each time percpu
    counters are summed, added to the atomic counterparts and propagated up
    by the cgroup tree.

    The second flushing is required due to how recursive vmstats are
    implemented: counters are batched in percpu variables on a local level,
    and once a percpu value is crossing some predefined threshold, it spills
    over to atomic values on the local and each ascendant levels. It means
    that without flushing some numbers cached in percpu variables will be
    dropped on floor each time a cgroup is destroyed. And with uptime the
    error on upper levels might become noticeable.

    The first flushing aims to make counters on ancestor levels more
    precise. Dying cgroups may resume in the dying state for a long time.
    After kmem_cache reparenting which is performed during the offlining
    slab counters of the dying cgroup don't have any chances to be updated,
    because any slab operations will be performed on the parent level. It
    means that the inaccuracy caused by percpu batching will not decrease up
    to the final destruction of the cgroup. By the original idea flushing
    slab counters during the offlining should minimize the visible
    inaccuracy of slab counters on the parent level.

    The problem is that percpu counters are not zeroed after the first
    flushing. So every cached percpu value is summed twice. It creates a
    small error (up to 32 pages per cpu, but usually less) which accumulates
    on parent cgroup level. After creating and destroying of thousands of
    child cgroups, slab counter on parent level can be way off the real
    value.

    For now, let's just stop flushing slab counters on memcg offlining. It
    can't be done correctly without scheduling a work on each cpu: reading
    and zeroing it during css offlining can race with an asynchronous
    update, which doesn't expect values to be changed underneath.

    With this change, slab counters on parent level will become eventually
    consistent. Once all dying children are gone, values are correct. And
    if not, the error is capped by 32 * NR_CPUS pages per dying cgroup.

    It's not perfect, as slab are reparented, so any updates after the
    reparenting will happen on the parent level. It means that if a slab
    page was allocated, a counter on child level was bumped, then the page
    was reparented and freed, the annihilation of positive and negative
    counter values will not happen until the child cgroup is released. It
    makes slab counters different from others, and it might want us to
    implement flushing in a correct form again. But it's also a question of
    performance: scheduling a work on each cpu isn't free, and it's an open
    question if the benefit of having more accurate counters is worth it.

    We might also consider flushing all counters on offlining, not only slab
    counters.

    So let's fix the main problem now: make the slab counters eventually
    consistent, so at least the error won't grow with uptime (or more
    precisely the number of created and destroyed cgroups). And think about
    the accuracy of counters separately.

    Link: http://lkml.kernel.org/r/20191220042728.1045881-1-guro@fb.com
    Fixes: bee07b33db78 ("mm: memcontrol: flush percpu slab vmstats on kmem offlining")
    Signed-off-by: Roman Gushchin
    Acked-by: Johannes Weiner
    Acked-by: Michal Hocko
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Roman Gushchin
     

16 Nov, 2019

1 commit

  • We've encountered a rcu stall in get_mem_cgroup_from_mm():

    rcu: INFO: rcu_sched self-detected stall on CPU
    rcu: 33-....: (21000 ticks this GP) idle=6c6/1/0x4000000000000002 softirq=35441/35441 fqs=5017
    (t=21031 jiffies g=324821 q=95837) NMI backtrace for cpu 33

    RIP: 0010:get_mem_cgroup_from_mm+0x2f/0x90

    __memcg_kmem_charge+0x55/0x140
    __alloc_pages_nodemask+0x267/0x320
    pipe_write+0x1ad/0x400
    new_sync_write+0x127/0x1c0
    __kernel_write+0x4f/0xf0
    dump_emit+0x91/0xc0
    writenote+0xa0/0xc0
    elf_core_dump+0x11af/0x1430
    do_coredump+0xc65/0xee0
    get_signal+0x132/0x7c0
    do_signal+0x36/0x640
    exit_to_usermode_loop+0x61/0xd0
    do_syscall_64+0xd4/0x100
    entry_SYSCALL_64_after_hwframe+0x44/0xa9

    The problem is caused by an exiting task which is associated with an
    offline memcg. We're iterating over and over in the do {} while
    (!css_tryget_online()) loop, but obviously the memcg won't become online
    and the exiting task won't be migrated to a live memcg.

    Let's fix it by switching from css_tryget_online() to css_tryget().

    As css_tryget_online() cannot guarantee that the memcg won't go offline,
    the check is usually useless, except some rare cases when for example it
    determines if something should be presented to a user.

    A similar problem is described by commit 18fa84a2db0e ("cgroup: Use
    css_tryget() instead of css_tryget_online() in task_get_css()").

    Johannes:

    : The bug aside, it doesn't matter whether the cgroup is online for the
    : callers. It used to matter when offlining needed to evacuate all charges
    : from the memcg, and so needed to prevent new ones from showing up, but we
    : don't care now.

    Link: http://lkml.kernel.org/r/20191106225131.3543616-1-guro@fb.com
    Signed-off-by: Roman Gushchin
    Acked-by: Johannes Weiner
    Acked-by: Tejun Heo
    Reviewed-by: Shakeel Butt
    Cc: Michal Hocko
    Cc: Michal Koutn
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     

07 Nov, 2019

3 commits

  • While upgrading from 4.16 to 5.2, we noticed these allocation errors in
    the log of the new kernel:

    SLUB: Unable to allocate memory on node -1, gfp=0xa20(GFP_ATOMIC)
    cache: tw_sock_TCPv6(960:helper-logs), object size: 232, buffer size: 240, default order: 1, min order: 0
    node 0: slabs: 5, objs: 170, free: 0

    slab_out_of_memory+1
    ___slab_alloc+969
    __slab_alloc+14
    kmem_cache_alloc+346
    inet_twsk_alloc+60
    tcp_time_wait+46
    tcp_fin+206
    tcp_data_queue+2034
    tcp_rcv_state_process+784
    tcp_v6_do_rcv+405
    __release_sock+118
    tcp_close+385
    inet_release+46
    __sock_release+55
    sock_close+17
    __fput+170
    task_work_run+127
    exit_to_usermode_loop+191
    do_syscall_64+212
    entry_SYSCALL_64_after_hwframe+68

    accompanied by an increase in machines going completely radio silent
    under memory pressure.

    One thing that changed since 4.16 is e699e2c6a654 ("net, mm: account
    sock objects to kmemcg"), which made these slab caches subject to cgroup
    memory accounting and control.

    The problem with that is that cgroups, unlike the page allocator, do not
    maintain dedicated atomic reserves. As a cgroup's usage hovers at its
    limit, atomic allocations - such as done during network rx - can fail
    consistently for extended periods of time. The kernel is not able to
    operate under these conditions.

    We don't want to revert the culprit patch, because it indeed tracks a
    potentially substantial amount of memory used by a cgroup.

    We also don't want to implement dedicated atomic reserves for cgroups.
    There is no point in keeping a fixed margin of unused bytes in the
    cgroup's memory budget to accomodate a consumer that is impossible to
    predict - we'd be wasting memory and get into configuration headaches,
    not unlike what we have going with min_free_kbytes. We do this for
    physical mem because we have to, but cgroups are an accounting game.

    Instead, account these privileged allocations to the cgroup, but let
    them bypass the configured limit if they have to. This way, we get the
    benefits of accounting the consumed memory and have it exert pressure on
    the rest of the cgroup, but like with the page allocator, we shift the
    burden of reclaimining on behalf of atomic allocations onto the regular
    allocations that can block.

    Link: http://lkml.kernel.org/r/20191022233708.365764-1-hannes@cmpxchg.org
    Fixes: e699e2c6a654 ("net, mm: account sock objects to kmemcg")
    Signed-off-by: Johannes Weiner
    Reviewed-by: Shakeel Butt
    Cc: Suleiman Souhlal
    Cc: Michal Hocko
    Cc: [4.18+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • page_cgroup_ino() doesn't return a valid memcg pointer for non-compound
    slab pages, because it depends on PgHead AND PgSlab flags to be set to
    determine the memory cgroup from the kmem_cache. It's correct for
    compound pages, but not for generic small pages. Those don't have PgHead
    set, so it ends up returning zero.

    Fix this by replacing the condition to PageSlab() && !PageTail().

    Before this patch:
    [root@localhost ~]# ./page-types -c /sys/fs/cgroup/user.slice/user-0.slice/user@0.service/ | grep slab
    0x0000000000000080 38 0 _______S___________________________________ slab

    After this patch:
    [root@localhost ~]# ./page-types -c /sys/fs/cgroup/user.slice/user-0.slice/user@0.service/ | grep slab
    0x0000000000000080 147 0 _______S___________________________________ slab

    Also, hwpoison_filter_task() uses output of page_cgroup_ino() in order
    to filter error injection events based on memcg. So if
    page_cgroup_ino() fails to return memcg pointer, we just fail to inject
    memory error. Considering that hwpoison filter is for testing, affected
    users are limited and the impact should be marginal.

    [n-horiguchi@ah.jp.nec.com: changelog additions]
    Link: http://lkml.kernel.org/r/20191031012151.2722280-1-guro@fb.com
    Fixes: 4d96ba353075 ("mm: memcg/slab: stop setting page->mem_cgroup pointer for slab pages")
    Signed-off-by: Roman Gushchin
    Reviewed-by: Shakeel Butt
    Acked-by: David Rientjes
    Cc: Vladimir Davydov
    Cc: Daniel Jordan
    Cc: Naoya Horiguchi
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     
  • __mem_cgroup_free() can be called on the failure path in
    mem_cgroup_alloc(). However memcg_flush_percpu_vmstats() and
    memcg_flush_percpu_vmevents() which are called from __mem_cgroup_free()
    access the fields of memcg which can potentially be null if called from
    failure path from mem_cgroup_alloc(). Indeed syzbot has reported the
    following crash:

    kasan: CONFIG_KASAN_INLINE enabled
    kasan: GPF could be caused by NULL-ptr deref or user memory access
    general protection fault: 0000 [#1] PREEMPT SMP KASAN
    CPU: 0 PID: 30393 Comm: syz-executor.1 Not tainted 5.4.0-rc2+ #0
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
    RIP: 0010:memcg_flush_percpu_vmstats+0x4ae/0x930 mm/memcontrol.c:3436
    Code: 05 41 89 c0 41 0f b6 04 24 41 38 c7 7c 08 84 c0 0f 85 5d 03 00 00 44 3b 05 33 d5 12 08 0f 83 e2 00 00 00 4c 89 f0 48 c1 e8 03 80 3c 28 00 0f 85 91 03 00 00 48 8b 85 10 fe ff ff 48 8b b0 90
    RSP: 0018:ffff888095c27980 EFLAGS: 00010206
    RAX: 0000000000000012 RBX: ffff888095c27b28 RCX: ffffc90008192000
    RDX: 0000000000040000 RSI: ffffffff8340fae7 RDI: 0000000000000007
    RBP: ffff888095c27be0 R08: 0000000000000000 R09: ffffed1013f0da33
    R10: ffffed1013f0da32 R11: ffff88809f86d197 R12: fffffbfff138b760
    R13: dffffc0000000000 R14: 0000000000000090 R15: 0000000000000007
    FS: 00007f5027170700(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 0000000000710158 CR3: 00000000a7b18000 CR4: 00000000001406f0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    Call Trace:
    __mem_cgroup_free+0x1a/0x190 mm/memcontrol.c:5021
    mem_cgroup_free mm/memcontrol.c:5033 [inline]
    mem_cgroup_css_alloc+0x3a1/0x1ae0 mm/memcontrol.c:5160
    css_create kernel/cgroup/cgroup.c:5156 [inline]
    cgroup_apply_control_enable+0x44d/0xc40 kernel/cgroup/cgroup.c:3119
    cgroup_mkdir+0x899/0x11b0 kernel/cgroup/cgroup.c:5401
    kernfs_iop_mkdir+0x14d/0x1d0 fs/kernfs/dir.c:1124
    vfs_mkdir+0x42e/0x670 fs/namei.c:3807
    do_mkdirat+0x234/0x2a0 fs/namei.c:3830
    __do_sys_mkdir fs/namei.c:3846 [inline]
    __se_sys_mkdir fs/namei.c:3844 [inline]
    __x64_sys_mkdir+0x5c/0x80 fs/namei.c:3844
    do_syscall_64+0xfa/0x760 arch/x86/entry/common.c:290
    entry_SYSCALL_64_after_hwframe+0x49/0xbe

    Fixing this by moving the flush to mem_cgroup_free as there is no need
    to flush anything if we see failure in mem_cgroup_alloc().

    Link: http://lkml.kernel.org/r/20191018165231.249872-1-shakeelb@google.com
    Fixes: bb65f89b7d3d ("mm: memcontrol: flush percpu vmevents before releasing memcg")
    Fixes: c350a99ea2b1 ("mm: memcontrol: flush percpu vmstats before releasing memcg")
    Signed-off-by: Shakeel Butt
    Reported-by: syzbot+515d5bcfe179cdf049b2@syzkaller.appspotmail.com
    Reviewed-by: Roman Gushchin
    Cc: Michal Hocko
    Cc: Johannes Weiner
    Cc: Vladimir Davydov
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shakeel Butt
     

19 Oct, 2019

1 commit

  • Mapped, dirty and writeback pages are also counted in per-lruvec stats.
    These counters needs update when page is moved between cgroups.

    Currently is nobody *consuming* the lruvec versions of these counters and
    that there is no user-visible effect.

    Link: http://lkml.kernel.org/r/157112699975.7360.1062614888388489788.stgit@buzz
    Fixes: 00f3ca2c2d66 ("mm: memcontrol: per-lruvec stats infrastructure")
    Signed-off-by: Konstantin Khlebnikov
    Acked-by: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: Vladimir Davydov
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     

08 Oct, 2019

1 commit

  • cgroup v2 introduces two memory protection thresholds: memory.low
    (best-effort) and memory.min (hard protection). While they generally do
    what they say on the tin, there is a limitation in their implementation
    that makes them difficult to use effectively: that cliff behaviour often
    manifests when they become eligible for reclaim. This patch implements
    more intuitive and usable behaviour, where we gradually mount more
    reclaim pressure as cgroups further and further exceed their protection
    thresholds.

    This cliff edge behaviour happens because we only choose whether or not
    to reclaim based on whether the memcg is within its protection limits
    (see the use of mem_cgroup_protected in shrink_node), but we don't vary
    our reclaim behaviour based on this information. Imagine the following
    timeline, with the numbers the lruvec size in this zone:

    1. memory.low=1000000, memory.current=999999. 0 pages may be scanned.
    2. memory.low=1000000, memory.current=1000000. 0 pages may be scanned.
    3. memory.low=1000000, memory.current=1000001. 1000001* pages may be
    scanned. (?!)

    * Of course, we won't usually scan all available pages in the zone even
    without this patch because of scan control priority, over-reclaim
    protection, etc. However, as shown by the tests at the end, these
    techniques don't sufficiently throttle such an extreme change in input,
    so cliff-like behaviour isn't really averted by their existence alone.

    Here's an example of how this plays out in practice. At Facebook, we are
    trying to protect various workloads from "system" software, like
    configuration management tools, metric collectors, etc (see this[0] case
    study). In order to find a suitable memory.low value, we start by
    determining the expected memory range within which the workload will be
    comfortable operating. This isn't an exact science -- memory usage deemed
    "comfortable" will vary over time due to user behaviour, differences in
    composition of work, etc, etc. As such we need to ballpark memory.low,
    but doing this is currently problematic:

    1. If we end up setting it too low for the workload, it won't have
    *any* effect (see discussion above). The group will receive the full
    weight of reclaim and won't have any priority while competing with the
    less important system software, as if we had no memory.low configured
    at all.

    2. Because of this behaviour, we end up erring on the side of setting
    it too high, such that the comfort range is reliably covered. However,
    protected memory is completely unavailable to the rest of the system,
    so we might cause undue memory and IO pressure there when we *know* we
    have some elasticity in the workload.

    3. Even if we get the value totally right, smack in the middle of the
    comfort zone, we get extreme jumps between no pressure and full
    pressure that cause unpredictable pressure spikes in the workload due
    to the current binary reclaim behaviour.

    With this patch, we can set it to our ballpark estimation without too much
    worry. Any undesirable behaviour, such as too much or too little reclaim
    pressure on the workload or system will be proportional to how far our
    estimation is off. This means we can set memory.low much more
    conservatively and thus waste less resources *without* the risk of the
    workload falling off a cliff if we overshoot.

    As a more abstract technical description, this unintuitive behaviour
    results in having to give high-priority workloads a large protection
    buffer on top of their expected usage to function reliably, as otherwise
    we have abrupt periods of dramatically increased memory pressure which
    hamper performance. Having to set these thresholds so high wastes
    resources and generally works against the principle of work conservation.
    In addition, having proportional memory reclaim behaviour has other
    benefits. Most notably, before this patch it's basically mandatory to set
    memory.low to a higher than desirable value because otherwise as soon as
    you exceed memory.low, all protection is lost, and all pages are eligible
    to scan again. By contrast, having a gradual ramp in reclaim pressure
    means that you now still get some protection when thresholds are exceeded,
    which means that one can now be more comfortable setting memory.low to
    lower values without worrying that all protection will be lost. This is
    important because workingset size is really hard to know exactly,
    especially with variable workloads, so at least getting *some* protection
    if your workingset size grows larger than you expect increases user
    confidence in setting memory.low without a huge buffer on top being
    needed.

    Thanks a lot to Johannes Weiner and Tejun Heo for their advice and
    assistance in thinking about how to make this work better.

    In testing these changes, I intended to verify that:

    1. Changes in page scanning become gradual and proportional instead of
    binary.

    To test this, I experimented stepping further and further down
    memory.low protection on a workload that floats around 19G workingset
    when under memory.low protection, watching page scan rates for the
    workload cgroup:

    +------------+-----------------+--------------------+--------------+
    | memory.low | test (pgscan/s) | control (pgscan/s) | % of control |
    +------------+-----------------+--------------------+--------------+
    | 21G | 0 | 0 | N/A |
    | 17G | 867 | 3799 | 23% |
    | 12G | 1203 | 3543 | 34% |
    | 8G | 2534 | 3979 | 64% |
    | 4G | 3980 | 4147 | 96% |
    | 0 | 3799 | 3980 | 95% |
    +------------+-----------------+--------------------+--------------+

    As you can see, the test kernel (with a kernel containing this
    patch) ramps up page scanning significantly more gradually than the
    control kernel (without this patch).

    2. More gradual ramp up in reclaim aggression doesn't result in
    premature OOMs.

    To test this, I wrote a script that slowly increments the number of
    pages held by stress(1)'s --vm-keep mode until a production system
    entered severe overall memory contention. This script runs in a highly
    protected slice taking up the majority of available system memory.
    Watching vmstat revealed that page scanning continued essentially
    nominally between test and control, without causing forward reclaim
    progress to become arrested.

    [0]: https://facebookmicrosites.github.io/cgroup2/docs/overview.html#case-study-the-fbtax2-project

    [akpm@linux-foundation.org: reflow block comments to fit in 80 cols]
    [chris@chrisdown.name: handle cgroup_disable=memory when getting memcg protection]
    Link: http://lkml.kernel.org/r/20190201045711.GA18302@chrisdown.name
    Link: http://lkml.kernel.org/r/20190124014455.GA6396@chrisdown.name
    Signed-off-by: Chris Down
    Acked-by: Johannes Weiner
    Reviewed-by: Roman Gushchin
    Cc: Michal Hocko
    Cc: Tejun Heo
    Cc: Dennis Zhou
    Cc: Tetsuo Handa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chris Down
     

26 Sep, 2019

1 commit

  • Thomas has noticed the following NULL ptr dereference when using cgroup
    v1 kmem limit:
    BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
    PGD 0
    P4D 0
    Oops: 0000 [#1] PREEMPT SMP PTI
    CPU: 3 PID: 16923 Comm: gtk-update-icon Not tainted 4.19.51 #42
    Hardware name: Gigabyte Technology Co., Ltd. Z97X-Gaming G1/Z97X-Gaming G1, BIOS F9 07/31/2015
    RIP: 0010:create_empty_buffers+0x24/0x100
    Code: cd 0f 1f 44 00 00 0f 1f 44 00 00 41 54 49 89 d4 ba 01 00 00 00 55 53 48 89 fb e8 97 fe ff ff 48 89 c5 48 89 c2 eb 03 48 89 ca 8b 4a 08 4c 09 22 48 85 c9 75 f1 48 89 6a 08 48 8b 43 18 48 8d
    RSP: 0018:ffff927ac1b37bf8 EFLAGS: 00010286
    RAX: 0000000000000000 RBX: fffff2d4429fd740 RCX: 0000000100097149
    RDX: 0000000000000000 RSI: 0000000000000082 RDI: ffff9075a99fbe00
    RBP: 0000000000000000 R08: fffff2d440949cc8 R09: 00000000000960c0
    R10: 0000000000000002 R11: 0000000000000000 R12: 0000000000000000
    R13: ffff907601f18360 R14: 0000000000002000 R15: 0000000000001000
    FS: 00007fb55b288bc0(0000) GS:ffff90761f8c0000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 0000000000000008 CR3: 000000007aebc002 CR4: 00000000001606e0
    Call Trace:
    create_page_buffers+0x4d/0x60
    __block_write_begin_int+0x8e/0x5a0
    ? ext4_inode_attach_jinode.part.82+0xb0/0xb0
    ? jbd2__journal_start+0xd7/0x1f0
    ext4_da_write_begin+0x112/0x3d0
    generic_perform_write+0xf1/0x1b0
    ? file_update_time+0x70/0x140
    __generic_file_write_iter+0x141/0x1a0
    ext4_file_write_iter+0xef/0x3b0
    __vfs_write+0x17e/0x1e0
    vfs_write+0xa5/0x1a0
    ksys_write+0x57/0xd0
    do_syscall_64+0x55/0x160
    entry_SYSCALL_64_after_hwframe+0x44/0xa9

    Tetsuo then noticed that this is because the __memcg_kmem_charge_memcg
    fails __GFP_NOFAIL charge when the kmem limit is reached. This is a wrong
    behavior because nofail allocations are not allowed to fail. Normal
    charge path simply forces the charge even if that means to cross the
    limit. Kmem accounting should be doing the same.

    Link: http://lkml.kernel.org/r/20190906125608.32129-1-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Reported-by: Thomas Lindroth
    Debugged-by: Tetsuo Handa
    Cc: Johannes Weiner
    Cc: Vladimir Davydov
    Cc: Andrey Ryabinin
    Cc: Thomas Lindroth
    Cc: Shakeel Butt
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

25 Sep, 2019

7 commits

  • Currently THP deferred split shrinker is not memcg aware, this may cause
    premature OOM with some configuration. For example the below test would
    run into premature OOM easily:

    $ cgcreate -g memory:thp
    $ echo 4G > /sys/fs/cgroup/memory/thp/memory/limit_in_bytes
    $ cgexec -g memory:thp transhuge-stress 4000

    transhuge-stress comes from kernel selftest.

    It is easy to hit OOM, but there are still a lot THP on the deferred split
    queue, memcg direct reclaim can't touch them since the deferred split
    shrinker is not memcg aware.

    Convert deferred split shrinker memcg aware by introducing per memcg
    deferred split queue. The THP should be on either per node or per memcg
    deferred split queue if it belongs to a memcg. When the page is
    immigrated to the other memcg, it will be immigrated to the target memcg's
    deferred split queue too.

    Reuse the second tail page's deferred_list for per memcg list since the
    same THP can't be on multiple deferred split queues.

    [yang.shi@linux.alibaba.com: simplify deferred split queue dereference per Kirill Tkhai]
    Link: http://lkml.kernel.org/r/1566496227-84952-5-git-send-email-yang.shi@linux.alibaba.com
    Link: http://lkml.kernel.org/r/1565144277-36240-5-git-send-email-yang.shi@linux.alibaba.com
    Signed-off-by: Yang Shi
    Acked-by: Kirill A. Shutemov
    Reviewed-by: Kirill Tkhai
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: "Kirill A . Shutemov"
    Cc: Hugh Dickins
    Cc: Shakeel Butt
    Cc: David Rientjes
    Cc: Qian Cai
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yang Shi
     
  • Currently shrinker is just allocated and can work when memcg kmem is
    enabled. But, THP deferred split shrinker is not slab shrinker, it
    doesn't make too much sense to have such shrinker depend on memcg kmem.
    It should be able to reclaim THP even though memcg kmem is disabled.

    Introduce a new shrinker flag, SHRINKER_NONSLAB, for non-slab shrinker.
    When memcg kmem is disabled, just such shrinkers can be called in
    shrinking memcg slab.

    [yang.shi@linux.alibaba.com: add comment]
    Link: http://lkml.kernel.org/r/1566496227-84952-4-git-send-email-yang.shi@linux.alibaba.com
    Link: http://lkml.kernel.org/r/1565144277-36240-4-git-send-email-yang.shi@linux.alibaba.com
    Signed-off-by: Yang Shi
    Acked-by: Kirill A. Shutemov
    Reviewed-by: Kirill Tkhai
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: "Kirill A . Shutemov"
    Cc: Hugh Dickins
    Cc: Shakeel Butt
    Cc: David Rientjes
    Cc: Qian Cai
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yang Shi
     
  • Cgroup v1 memcg controller has exposed a dedicated kmem limit to users
    which turned out to be really a bad idea because there are paths which
    cannot shrink the kernel memory usage enough to get below the limit (e.g.
    because the accounted memory is not reclaimable). There are cases when
    the failure is even not allowed (e.g. __GFP_NOFAIL). This means that the
    kmem limit is in excess to the hard limit without any way to shrink and
    thus completely useless. OOM killer cannot be invoked to handle the
    situation because that would lead to a premature oom killing.

    As a result many places might see ENOMEM returning from kmalloc and result
    in unexpected errors. E.g. a global OOM killer when there is a lot of
    free memory because ENOMEM is translated into VM_FAULT_OOM in #PF path and
    therefore pagefault_out_of_memory would result in OOM killer.

    Please note that the kernel memory is still accounted to the overall limit
    along with the user memory so removing the kmem specific limit should
    still allow to contain kernel memory consumption. Unlike the kmem one,
    though, it invokes memory reclaim and targeted memcg oom killing if
    necessary.

    Start the deprecation process by crying to the kernel log. Let's see
    whether there are relevant usecases and simply return to EINVAL in the
    second stage if nobody complains in few releases.

    [akpm@linux-foundation.org: tweak documentation text]
    Link: http://lkml.kernel.org/r/20190911151612.GI4023@dhcp22.suse.cz
    Signed-off-by: Michal Hocko
    Reviewed-by: Shakeel Butt
    Cc: Johannes Weiner
    Cc: Vladimir Davydov
    Cc: Andrey Ryabinin
    Cc: Thomas Lindroth
    Cc: Tetsuo Handa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • mem_cgroup_id_get() was introduced in commit 73f576c04b94 ("mm:memcontrol:
    fix cgroup creation failure after many small jobs").

    Later, it no longer has any user since the commits,

    1f47b61fb407 ("mm: memcontrol: fix swap counter leak on swapout from offline cgroup")
    58fa2a5512d9 ("mm: memcontrol: add sanity checks for memcg->id.ref on get/put")

    so safe to remove it.

    Link: http://lkml.kernel.org/r/1568648453-5482-1-git-send-email-cai@lca.pw
    Signed-off-by: Qian Cai
    Acked-by: Michal Hocko
    Cc: Johannes Weiner
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Qian Cai
     
  • Commit 72f0184c8a00 ("mm, memcg: remove hotplug locking from try_charge")
    introduced css_tryget()/css_put() calls in drain_all_stock(), which are
    supposed to protect the target memory cgroup from being released during
    the mem_cgroup_is_descendant() call.

    However, it's not completely safe. In theory, memcg can go away between
    reading stock->cached pointer and calling css_tryget().

    This can happen if drain_all_stock() races with drain_local_stock()
    performed on the remote cpu as a result of a work, scheduled by the
    previous invocation of drain_all_stock().

    The race is a bit theoretical and there are few chances to trigger it, but
    the current code looks a bit confusing, so it makes sense to fix it
    anyway. The code looks like as if css_tryget() and css_put() are used to
    protect stocks drainage. It's not necessary because stocked pages are
    holding references to the cached cgroup. And it obviously won't work for
    works, scheduled on other cpus.

    So, let's read the stock->cached pointer and evaluate the memory cgroup
    inside a rcu read section, and get rid of css_tryget()/css_put() calls.

    Link: http://lkml.kernel.org/r/20190802192241.3253165-1-guro@fb.com
    Signed-off-by: Roman Gushchin
    Acked-by: Michal Hocko
    Cc: Hillf Danton
    Cc: Johannes Weiner
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     
  • We're trying to use memory.high to limit workloads, but have found that
    containment can frequently fail completely and cause OOM situations
    outside of the cgroup. This happens especially with swap space -- either
    when none is configured, or swap is full. These failures often also don't
    have enough warning to allow one to react, whether for a human or for a
    daemon monitoring PSI.

    Here is output from a simple program showing how long it takes in usec
    (column 2) to allocate a megabyte of anonymous memory (column 1) when a
    cgroup is already beyond its memory high setting, and no swap is
    available:

    [root@ktst ~]# systemd-run -p MemoryHigh=100M -p MemorySwapMax=1 \
    > --wait -t timeout 300 /root/mdf
    [...]
    95 1035
    96 1038
    97 1000
    98 1036
    99 1048
    100 1590
    101 1968
    102 1776
    103 1863
    104 1757
    105 1921
    106 1893
    107 1760
    108 1748
    109 1843
    110 1716
    111 1924
    112 1776
    113 1831
    114 1766
    115 1836
    116 1588
    117 1912
    118 1802
    119 1857
    120 1731
    [...]
    [System OOM in 2-3 seconds]

    The delay does go up extremely marginally past the 100MB memory.high
    threshold, as now we spend time scanning before returning to usermode, but
    it's nowhere near enough to contain growth. It also doesn't get worse the
    more pages you have, since it only considers nr_pages.

    The current situation goes against both the expectations of users of
    memory.high, and our intentions as cgroup v2 developers. In
    cgroup-v2.txt, we claim that we will throttle and only under "extreme
    conditions" will memory.high protection be breached. Likewise, cgroup v2
    users generally also expect that memory.high should throttle workloads as
    they exceed their high threshold. However, as seen above, this isn't
    always how it works in practice -- even on banal setups like those with no
    swap, or where swap has become exhausted, we can end up with memory.high
    being breached and us having no weapons left in our arsenal to combat
    runaway growth with, since reclaim is futile.

    It's also hard for system monitoring software or users to tell how bad the
    situation is, as "high" events for the memcg may in some cases be benign,
    and in others be catastrophic. The current status quo is that we fail
    containment in a way that doesn't provide any advance warning that things
    are about to go horribly wrong (for example, we are about to invoke the
    kernel OOM killer).

    This patch introduces explicit throttling when reclaim is failing to keep
    memcg size contained at the memory.high setting. It does so by applying
    an exponential delay curve derived from the memcg's overage compared to
    memory.high. In the normal case where the memcg is either below or only
    marginally over its memory.high setting, no throttling will be performed.

    This composes well with system health monitoring and remediation, as these
    allocator delays are factored into PSI's memory pressure calculations.
    This both creates a mechanism system administrators or applications
    consuming the PSI interface to trivially see that the memcg in question is
    struggling and use that to make more reasonable decisions, and permits
    them enough time to act. Either of these can act with significantly more
    nuance than that we can provide using the system OOM killer.

    This is a similar idea to memory.oom_control in cgroup v1 which would put
    the cgroup to sleep if the threshold was violated, but it's also
    significantly improved as it results in visible memory pressure, and also
    doesn't schedule indefinitely, which previously made tracing and other
    introspection difficult (ie. it's clamped at 2*HZ per allocation through
    MEMCG_MAX_HIGH_DELAY_JIFFIES).

    Contrast the previous results with a kernel with this patch:

    [root@ktst ~]# systemd-run -p MemoryHigh=100M -p MemorySwapMax=1 \
    > --wait -t timeout 300 /root/mdf
    [...]
    95 1002
    96 1000
    97 1002
    98 1003
    99 1000
    100 1043
    101 84724
    102 330628
    103 610511
    104 1016265
    105 1503969
    106 2391692
    107 2872061
    108 3248003
    109 4791904
    110 5759832
    111 6912509
    112 8127818
    113 9472203
    114 12287622
    115 12480079
    116 14144008
    117 15808029
    118 16384500
    119 16383242
    120 16384979
    [...]

    As you can see, in the normal case, memory allocation takes around 1000
    usec. However, as we exceed our memory.high, things start to increase
    exponentially, but fairly leniently at first. Our first megabyte over
    memory.high takes us 0.16 seconds, then the next is 0.46 seconds, then the
    next is almost an entire second. This gets worse until we reach our
    eventual 2*HZ clamp per batch, resulting in 16 seconds per megabyte.
    However, this is still making forward progress, so permits tracing or
    further analysis with programs like GDB.

    We use an exponential curve for our delay penalty for a few reasons:

    1. We run mem_cgroup_handle_over_high to potentially do reclaim after
    we've already performed allocations, which means that temporarily
    going over memory.high by a small amount may be perfectly legitimate,
    even for compliant workloads. We don't want to unduly penalise such
    cases.
    2. An exponential curve (as opposed to a static or linear delay) allows
    ramping up memory pressure stats more gradually, which can be useful
    to work out that you have set memory.high too low, without destroying
    application performance entirely.

    This patch expands on earlier work by Johannes Weiner. Thanks!

    [akpm@linux-foundation.org: fix max() warning]
    [akpm@linux-foundation.org: fix __udivdi3 ref on 32-bit]
    [akpm@linux-foundation.org: fix it even more]
    [chris@chrisdown.name: fix 64-bit divide even more]
    Link: http://lkml.kernel.org/r/20190723180700.GA29459@chrisdown.name
    Signed-off-by: Chris Down
    Acked-by: Johannes Weiner
    Cc: Tejun Heo
    Cc: Roman Gushchin
    Cc: Michal Hocko
    Cc: Nathan Chancellor
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chris Down
     
  • Replace 1 << compound_order(page) with compound_nr(page). Minor
    improvements in readability.

    Link: http://lkml.kernel.org/r/20190721104612.19120-4-willy@infradead.org
    Signed-off-by: Matthew Wilcox (Oracle)
    Reviewed-by: Andrew Morton
    Reviewed-by: Ira Weiny
    Acked-by: Kirill A. Shutemov
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox (Oracle)
     

22 Sep, 2019

1 commit

  • Pull hmm updates from Jason Gunthorpe:
    "This is more cleanup and consolidation of the hmm APIs and the very
    strongly related mmu_notifier interfaces. Many places across the tree
    using these interfaces are touched in the process. Beyond that a
    cleanup to the page walker API and a few memremap related changes
    round out the series:

    - General improvement of hmm_range_fault() and related APIs, more
    documentation, bug fixes from testing, API simplification &
    consolidation, and unused API removal

    - Simplify the hmm related kconfigs to HMM_MIRROR and DEVICE_PRIVATE,
    and make them internal kconfig selects

    - Hoist a lot of code related to mmu notifier attachment out of
    drivers by using a refcount get/put attachment idiom and remove the
    convoluted mmu_notifier_unregister_no_release() and related APIs.

    - General API improvement for the migrate_vma API and revision of its
    only user in nouveau

    - Annotate mmu_notifiers with lockdep and sleeping region debugging

    Two series unrelated to HMM or mmu_notifiers came along due to
    dependencies:

    - Allow pagemap's memremap_pages family of APIs to work without
    providing a struct device

    - Make walk_page_range() and related use a constant structure for
    function pointers"

    * tag 'for-linus-hmm' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (75 commits)
    libnvdimm: Enable unit test infrastructure compile checks
    mm, notifier: Catch sleeping/blocking for !blockable
    kernel.h: Add non_block_start/end()
    drm/radeon: guard against calling an unpaired radeon_mn_unregister()
    csky: add missing brackets in a macro for tlb.h
    pagewalk: use lockdep_assert_held for locking validation
    pagewalk: separate function pointers from iterator data
    mm: split out a new pagewalk.h header from mm.h
    mm/mmu_notifiers: annotate with might_sleep()
    mm/mmu_notifiers: prime lockdep
    mm/mmu_notifiers: add a lockdep map for invalidate_range_start/end
    mm/mmu_notifiers: remove the __mmu_notifier_invalidate_range_start/end exports
    mm/hmm: hmm_range_fault() infinite loop
    mm/hmm: hmm_range_fault() NULL pointer bug
    mm/hmm: fix hmm_range_fault()'s handling of swapped out pages
    mm/mmu_notifiers: remove unregister_no_release
    RDMA/odp: remove ib_ucontext from ib_umem
    RDMA/odp: use mmu_notifier_get/put for 'struct ib_ucontext_per_mm'
    RDMA/mlx5: Use odp instead of mr->umem in pagefault_mr
    RDMA/mlx5: Use ib_umem_start instead of umem.address
    ...

    Linus Torvalds
     

18 Sep, 2019

1 commit

  • Pull block updates from Jens Axboe:

    - Two NVMe pull requests:
    - ana log parse fix from Anton
    - nvme quirks support for Apple devices from Ben
    - fix missing bio completion tracing for multipath stack devices
    from Hannes and Mikhail
    - IP TOS settings for nvme rdma and tcp transports from Israel
    - rq_dma_dir cleanups from Israel
    - tracing for Get LBA Status command from Minwoo
    - Some nvme-tcp cleanups from Minwoo, Potnuri and Myself
    - Some consolidation between the fabrics transports for handling
    the CAP register
    - reset race with ns scanning fix for fabrics (move fabrics
    commands to a dedicated request queue with a different lifetime
    from the admin request queue)."
    - controller reset and namespace scan races fixes
    - nvme discovery log change uevent support
    - naming improvements from Keith
    - multiple discovery controllers reject fix from James
    - some regular cleanups from various people

    - Series fixing (and re-fixing) null_blk debug printing and nr_devices
    checks (André)

    - A few pull requests from Song, with fixes from Andy, Guoqing,
    Guilherme, Neil, Nigel, and Yufen.

    - REQ_OP_ZONE_RESET_ALL support (Chaitanya)

    - Bio merge handling unification (Christoph)

    - Pick default elevator correctly for devices with special needs
    (Damien)

    - Block stats fixes (Hou)

    - Timeout and support devices nbd fixes (Mike)

    - Series fixing races around elevator switching and device add/remove
    (Ming)

    - sed-opal cleanups (Revanth)

    - Per device weight support for BFQ (Fam)

    - Support for blk-iocost, a new model that can properly account cost of
    IO workloads. (Tejun)

    - blk-cgroup writeback fixes (Tejun)

    - paride queue init fixes (zhengbin)

    - blk_set_runtime_active() cleanup (Stanley)

    - Block segment mapping optimizations (Bart)

    - lightnvm fixes (Hans/Minwoo/YueHaibing)

    - Various little fixes and cleanups

    * tag 'for-5.4/block-2019-09-16' of git://git.kernel.dk/linux-block: (186 commits)
    null_blk: format pr_* logs with pr_fmt
    null_blk: match the type of parameter nr_devices
    null_blk: do not fail the module load with zero devices
    block: also check RQF_STATS in blk_mq_need_time_stamp()
    block: make rq sector size accessible for block stats
    bfq: Fix bfq linkage error
    raid5: use bio_end_sector in r5_next_bio
    raid5: remove STRIPE_OPS_REQ_PENDING
    md: add feature flag MD_FEATURE_RAID0_LAYOUT
    md/raid0: avoid RAID0 data corruption due to layout confusion.
    raid5: don't set STRIPE_HANDLE to stripe which is in batch list
    raid5: don't increment read_errors on EILSEQ return
    nvmet: fix a wrong error status returned in error log page
    nvme: send discovery log page change events to userspace
    nvme: add uevent variables for controller devices
    nvme: enable aen regardless of the presence of I/O queues
    nvme-fabrics: allow discovery subsystems accept a kato
    nvmet: Use PTR_ERR_OR_ZERO() in nvmet_init_discovery()
    nvme: Remove redundant assignment of cq vector
    nvme: Assign subsys instance from first ctrl
    ...

    Linus Torvalds
     

07 Sep, 2019

2 commits

  • The mm_walk structure currently mixed data and code. Split out the
    operations vectors into a new mm_walk_ops structure, and while we are
    changing the API also declare the mm_walk structure inside the
    walk_page_range and walk_page_vma functions.

    Based on patch from Linus Torvalds.

    Link: https://lore.kernel.org/r/20190828141955.22210-3-hch@lst.de
    Signed-off-by: Christoph Hellwig
    Reviewed-by: Thomas Hellstrom
    Reviewed-by: Steven Price
    Reviewed-by: Jason Gunthorpe
    Signed-off-by: Jason Gunthorpe

    Christoph Hellwig
     
  • Add a new header for the two handful of users of the walk_page_range /
    walk_page_vma interface instead of polluting all users of mm.h with it.

    Link: https://lore.kernel.org/r/20190828141955.22210-2-hch@lst.de
    Signed-off-by: Christoph Hellwig
    Reviewed-by: Thomas Hellstrom
    Reviewed-by: Steven Price
    Reviewed-by: Jason Gunthorpe
    Signed-off-by: Jason Gunthorpe

    Christoph Hellwig
     

31 Aug, 2019

3 commits

  • Instead of using raw_cpu_read() use per_cpu() to read the actual data of
    the corresponding cpu otherwise we will be reading the data of the
    current cpu for the number of online CPUs.

    Link: http://lkml.kernel.org/r/20190829203110.129263-1-shakeelb@google.com
    Fixes: bb65f89b7d3d ("mm: memcontrol: flush percpu vmevents before releasing memcg")
    Fixes: c350a99ea2b1 ("mm: memcontrol: flush percpu vmstats before releasing memcg")
    Signed-off-by: Shakeel Butt
    Acked-by: Roman Gushchin
    Acked-by: Michal Hocko
    Cc: Johannes Weiner
    Cc: Vladimir Davydov
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shakeel Butt
     
  • …h the hierarchical ones"

    Commit 766a4c19d880 ("mm/memcontrol.c: keep local VM counters in sync
    with the hierarchical ones") effectively decreased the precision of
    per-memcg vmstats_local and per-memcg-per-node lruvec percpu counters.

    That's good for displaying in memory.stat, but brings a serious
    regression into the reclaim process.

    One issue I've discovered and debugged is the following:
    lruvec_lru_size() can return 0 instead of the actual number of pages in
    the lru list, preventing the kernel to reclaim last remaining pages.
    Result is yet another dying memory cgroups flooding. The opposite is
    also happening: scanning an empty lru list is the waste of cpu time.

    Also, inactive_list_is_low() can return incorrect values, preventing the
    active lru from being scanned and freed. It can fail both because the
    size of active and inactive lists are inaccurate, and because the number
    of workingset refaults isn't precise. In other words, the result is
    pretty random.

    I'm not sure, if using the approximate number of slab pages in
    count_shadow_number() is acceptable, but issues described above are
    enough to partially revert the patch.

    Let's keep per-memcg vmstat_local batched (they are only used for
    displaying stats to the userspace), but keep lruvec stats precise. This
    change fixes the dead memcg flooding on my setup.

    Link: http://lkml.kernel.org/r/20190817004726.2530670-1-guro@fb.com
    Fixes: 766a4c19d880 ("mm/memcontrol.c: keep local VM counters in sync with the hierarchical ones")
    Signed-off-by: Roman Gushchin <guro@fb.com>
    Acked-by: Yafang Shao <laoar.shao@gmail.com>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Michal Hocko <mhocko@kernel.org>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

    Roman Gushchin
     
  • I've noticed that the "slab" value in memory.stat is sometimes 0, even
    if some children memory cgroups have a non-zero "slab" value. The
    following investigation showed that this is the result of the kmem_cache
    reparenting in combination with the per-cpu batching of slab vmstats.

    At the offlining some vmstat value may leave in the percpu cache, not
    being propagated upwards by the cgroup hierarchy. It means that stats
    on ancestor levels are lower than actual. Later when slab pages are
    released, the precise number of pages is substracted on the parent
    level, making the value negative. We don't show negative values, 0 is
    printed instead.

    To fix this issue, let's flush percpu slab memcg and lruvec stats on
    memcg offlining. This guarantees that numbers on all ancestor levels
    are accurate and match the actual number of outstanding slab pages.

    Link: http://lkml.kernel.org/r/20190819202338.363363-3-guro@fb.com
    Fixes: fb2f2b0adb98 ("mm: memcg/slab: reparent memcg kmem_caches on cgroup removal")
    Signed-off-by: Roman Gushchin
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     

30 Aug, 2019

1 commit


27 Aug, 2019

1 commit

  • There's an inherent mismatch between memcg and writeback. The former
    trackes ownership per-page while the latter per-inode. This was a
    deliberate design decision because honoring per-page ownership in the
    writeback path is complicated, may lead to higher CPU and IO overheads
    and deemed unnecessary given that write-sharing an inode across
    different cgroups isn't a common use-case.

    Combined with inode majority-writer ownership switching, this works
    well enough in most cases but there are some pathological cases. For
    example, let's say there are two cgroups A and B which keep writing to
    different but confined parts of the same inode. B owns the inode and
    A's memory is limited far below B's. A's dirty ratio can rise enough
    to trigger balance_dirty_pages() sleeps but B's can be low enough to
    avoid triggering background writeback. A will be slowed down without
    a way to make writeback of the dirty pages happen.

    This patch implements foreign dirty recording and foreign mechanism so
    that when a memcg encounters a condition as above it can trigger
    flushes on bdi_writebacks which can clean its pages. Please see the
    comment on top of mem_cgroup_track_foreign_dirty_slowpath() for
    details.

    A reproducer follows.

    write-range.c::

    #include
    #include
    #include
    #include
    #include

    static const char *usage = "write-range FILE START SIZE\n";

    int main(int argc, char **argv)
    {
    int fd;
    unsigned long start, size, end, pos;
    char *endp;
    char buf[4096];

    if (argc < 4) {
    fprintf(stderr, usage);
    return 1;
    }

    fd = open(argv[1], O_WRONLY);
    if (fd < 0) {
    perror("open");
    return 1;
    }

    start = strtoul(argv[2], &endp, 0);
    if (*endp != '\0') {
    fprintf(stderr, usage);
    return 1;
    }

    size = strtoul(argv[3], &endp, 0);
    if (*endp != '\0') {
    fprintf(stderr, usage);
    return 1;
    }

    end = start + size;

    while (1) {
    for (pos = start; pos < end; ) {
    long bread, bwritten = 0;

    if (lseek(fd, pos, SEEK_SET) < 0) {
    perror("lseek");
    return 1;
    }

    bread = read(0, buf, sizeof(buf) < end - pos ?
    sizeof(buf) : end - pos);
    if (bread < 0) {
    perror("read");
    return 1;
    }
    if (bread == 0)
    return 0;

    while (bwritten < bread) {
    long this;

    this = write(fd, buf + bwritten,
    bread - bwritten);
    if (this < 0) {
    perror("write");
    return 1;
    }

    bwritten += this;
    pos += bwritten;
    }
    }
    }
    }

    repro.sh::

    #!/bin/bash

    set -e
    set -x

    sysctl -w vm.dirty_expire_centisecs=300000
    sysctl -w vm.dirty_writeback_centisecs=300000
    sysctl -w vm.dirtytime_expire_seconds=300000
    echo 3 > /proc/sys/vm/drop_caches

    TEST=/sys/fs/cgroup/test
    A=$TEST/A
    B=$TEST/B

    mkdir -p $A $B
    echo "+memory +io" > $TEST/cgroup.subtree_control
    echo $((1< $A/memory.high
    echo $((32< $B/memory.high

    rm -f testfile
    touch testfile
    fallocate -l 4G testfile

    echo "Starting B"

    (echo $BASHPID > $B/cgroup.procs
    pv -q --rate-limit 70M < /dev/urandom | ./write-range testfile $((2<< $A/cgroup.procs
    pv < /dev/urandom | ./write-range testfile 0 $((2<
    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Tejun Heo
     

25 Aug, 2019

2 commits

  • Similar to vmstats, percpu caching of local vmevents leads to an
    accumulation of errors on non-leaf levels. This happens because some
    leftovers may remain in percpu caches, so that they are never propagated
    up by the cgroup tree and just disappear into nonexistence with on
    releasing of the memory cgroup.

    To fix this issue let's accumulate and propagate percpu vmevents values
    before releasing the memory cgroup similar to what we're doing with
    vmstats.

    Since on cpu hotplug we do flush percpu vmstats anyway, we can iterate
    only over online cpus.

    Link: http://lkml.kernel.org/r/20190819202338.363363-4-guro@fb.com
    Fixes: 42a300353577 ("mm: memcontrol: fix recursive statistics correctness & scalabilty")
    Signed-off-by: Roman Gushchin
    Acked-by: Michal Hocko
    Cc: Johannes Weiner
    Cc: Vladimir Davydov
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     
  • Percpu caching of local vmstats with the conditional propagation by the
    cgroup tree leads to an accumulation of errors on non-leaf levels.

    Let's imagine two nested memory cgroups A and A/B. Say, a process
    belonging to A/B allocates 100 pagecache pages on the CPU 0. The percpu
    cache will spill 3 times, so that 32*3=96 pages will be accounted to A/B
    and A atomic vmstat counters, 4 pages will remain in the percpu cache.

    Imagine A/B is nearby memory.max, so that every following allocation
    triggers a direct reclaim on the local CPU. Say, each such attempt will
    free 16 pages on a new cpu. That means every percpu cache will have -16
    pages, except the first one, which will have 4 - 16 = -12. A/B and A
    atomic counters will not be touched at all.

    Now a user removes A/B. All percpu caches are freed and corresponding
    vmstat numbers are forgotten. A has 96 pages more than expected.

    As memory cgroups are created and destroyed, errors do accumulate. Even
    1-2 pages differences can accumulate into large numbers.

    To fix this issue let's accumulate and propagate percpu vmstat values
    before releasing the memory cgroup. At this point these numbers are
    stable and cannot be changed.

    Since on cpu hotplug we do flush percpu vmstats anyway, we can iterate
    only over online cpus.

    Link: http://lkml.kernel.org/r/20190819202338.363363-2-guro@fb.com
    Fixes: 42a300353577 ("mm: memcontrol: fix recursive statistics correctness & scalabilty")
    Signed-off-by: Roman Gushchin
    Acked-by: Michal Hocko
    Cc: Johannes Weiner
    Cc: Vladimir Davydov
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     

14 Aug, 2019

2 commits

  • Memcg counters for shadow nodes are broken because the memcg pointer is
    obtained in a wrong way. The following approach is used:
    virt_to_page(xa_node)->mem_cgroup

    Since commit 4d96ba353075 ("mm: memcg/slab: stop setting
    page->mem_cgroup pointer for slab pages") page->mem_cgroup pointer isn't
    set for slab pages, so memcg_from_slab_page() should be used instead.

    Also I doubt that it ever worked correctly: virt_to_head_page() should
    be used instead of virt_to_page(). Otherwise objects residing on tail
    pages are not accounted, because only the head page contains a valid
    mem_cgroup pointer. That was a case since the introduction of these
    counters by the commit 68d48e6a2df5 ("mm: workingset: add vmstat counter
    for shadow nodes").

    Link: http://lkml.kernel.org/r/20190801233532.138743-1-guro@fb.com
    Fixes: 4d96ba353075 ("mm: memcg/slab: stop setting page->mem_cgroup pointer for slab pages")
    Signed-off-by: Roman Gushchin
    Acked-by: Johannes Weiner
    Cc: Vladimir Davydov
    Cc: Shakeel Butt
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     
  • This patch is sent to report an use after free in mem_cgroup_iter()
    after merging commit be2657752e9e ("mm: memcg: fix use after free in
    mem_cgroup_iter()").

    I work with android kernel tree (4.9 & 4.14), and commit be2657752e9e
    ("mm: memcg: fix use after free in mem_cgroup_iter()") has been merged
    to the trees. However, I can still observe use after free issues
    addressed in the commit be2657752e9e. (on low-end devices, a few times
    this month)

    backtrace:
    css_tryget stat);
    + /* poison memcg before freeing it */
    + memset(memcg, 0x78, sizeof(struct mem_cgroup));
    kfree(memcg);
    }

    The coredump shows the position=0xdbbc2a00 is freed.

    (gdb) p/x ((struct mem_cgroup_per_node *)0xe5009e00)->iter[8]
    $13 = {position = 0xdbbc2a00, generation = 0x2efd}

    0xdbbc2a00: 0xdbbc2e00 0x00000000 0xdbbc2800 0x00000100
    0xdbbc2a10: 0x00000200 0x78787878 0x00026218 0x00000000
    0xdbbc2a20: 0xdcad6000 0x00000001 0x78787800 0x00000000
    0xdbbc2a30: 0x78780000 0x00000000 0x0068fb84 0x78787878
    0xdbbc2a40: 0x78787878 0x78787878 0x78787878 0xe3fa5cc0
    0xdbbc2a50: 0x78787878 0x78787878 0x00000000 0x00000000
    0xdbbc2a60: 0x00000000 0x00000000 0x00000000 0x00000000
    0xdbbc2a70: 0x00000000 0x00000000 0x00000000 0x00000000
    0xdbbc2a80: 0x00000000 0x00000000 0x00000000 0x00000000
    0xdbbc2a90: 0x00000001 0x00000000 0x00000000 0x00100000
    0xdbbc2aa0: 0x00000001 0xdbbc2ac8 0x00000000 0x00000000
    0xdbbc2ab0: 0x00000000 0x00000000 0x00000000 0x00000000
    0xdbbc2ac0: 0x00000000 0x00000000 0xe5b02618 0x00001000
    0xdbbc2ad0: 0x00000000 0x78787878 0x78787878 0x78787878
    0xdbbc2ae0: 0x78787878 0x78787878 0x78787878 0x78787878
    0xdbbc2af0: 0x78787878 0x78787878 0x78787878 0x78787878
    0xdbbc2b00: 0x78787878 0x78787878 0x78787878 0x78787878
    0xdbbc2b10: 0x78787878 0x78787878 0x78787878 0x78787878
    0xdbbc2b20: 0x78787878 0x78787878 0x78787878 0x78787878
    0xdbbc2b30: 0x78787878 0x78787878 0x78787878 0x78787878
    0xdbbc2b40: 0x78787878 0x78787878 0x78787878 0x78787878
    0xdbbc2b50: 0x78787878 0x78787878 0x78787878 0x78787878
    0xdbbc2b60: 0x78787878 0x78787878 0x78787878 0x78787878
    0xdbbc2b70: 0x78787878 0x78787878 0x78787878 0x78787878
    0xdbbc2b80: 0x78787878 0x78787878 0x00000000 0x78787878
    0xdbbc2b90: 0x78787878 0x78787878 0x78787878 0x78787878
    0xdbbc2ba0: 0x78787878 0x78787878 0x78787878 0x78787878

    In the reclaim path, try_to_free_pages() does not setup
    sc.target_mem_cgroup and sc is passed to do_try_to_free_pages(), ...,
    shrink_node().

    In mem_cgroup_iter(), root is set to root_mem_cgroup because
    sc->target_mem_cgroup is NULL. It is possible to assign a memcg to
    root_mem_cgroup.nodeinfo.iter in mem_cgroup_iter().

    try_to_free_pages
    struct scan_control sc = {...}, target_mem_cgroup is 0x0;
    do_try_to_free_pages
    shrink_zones
    shrink_node
    mem_cgroup *root = sc->target_mem_cgroup;
    memcg = mem_cgroup_iter(root, NULL, &reclaim);
    mem_cgroup_iter()
    if (!root)
    root = root_mem_cgroup;
    ...

    css = css_next_descendant_pre(css, &root->css);
    memcg = mem_cgroup_from_css(css);
    cmpxchg(&iter->position, pos, memcg);

    My device uses memcg non-hierarchical mode. When we release a memcg:
    invalidate_reclaim_iterators() reaches only dead_memcg and its parents.
    If non-hierarchical mode is used, invalidate_reclaim_iterators() never
    reaches root_mem_cgroup.

    static void invalidate_reclaim_iterators(struct mem_cgroup *dead_memcg)
    {
    struct mem_cgroup *memcg = dead_memcg;

    for (; memcg; memcg = parent_mem_cgroup(memcg)
    ...
    }

    So the use after free scenario looks like:

    CPU1 CPU2

    try_to_free_pages
    do_try_to_free_pages
    shrink_zones
    shrink_node
    mem_cgroup_iter()
    if (!root)
    root = root_mem_cgroup;
    ...
    css = css_next_descendant_pre(css, &root->css);
    memcg = mem_cgroup_from_css(css);
    cmpxchg(&iter->position, pos, memcg);

    invalidate_reclaim_iterators(memcg);
    ...
    __mem_cgroup_free()
    kfree(memcg);

    try_to_free_pages
    do_try_to_free_pages
    shrink_zones
    shrink_node
    mem_cgroup_iter()
    if (!root)
    root = root_mem_cgroup;
    ...
    mz = mem_cgroup_nodeinfo(root, reclaim->pgdat->node_id);
    iter = &mz->iter[reclaim->priority];
    pos = READ_ONCE(iter->position);
    css_tryget(&pos->css)
    Signed-off-by: Qian Cai
    Acked-by: Michal Hocko
    Cc: Johannes Weiner
    Cc: Vladimir Davydov
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miles Chen
     

17 Jul, 2019

1 commit

  • After commit 815744d75152 ("mm: memcontrol: don't batch updates of local
    VM stats and events"), the local VM counter are not in sync with the
    hierarchical ones.

    Below is one example in a leaf memcg on my server (with 8 CPUs):

    inactive_file 3567570944
    total_inactive_file 3568029696

    We find that the deviation is very great because the 'val' in
    __mod_memcg_state() is in pages while the effective value in
    memcg_stat_show() is in bytes.

    So the maximum of this deviation between local VM stats and total VM
    stats can be (32 * number_of_cpu * PAGE_SIZE), that may be an
    unacceptably great value.

    We should keep the local VM stats in sync with the total stats. In
    order to keep this behavior the same across counters, this patch updates
    __mod_lruvec_state() and __count_memcg_events() as well.

    Link: http://lkml.kernel.org/r/1562851979-10610-1-git-send-email-laoar.shao@gmail.com
    Signed-off-by: Yafang Shao
    Acked-by: Johannes Weiner
    Cc: Michal Hocko
    Cc: Vladimir Davydov
    Cc: Yafang Shao
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yafang Shao
     

15 Jul, 2019

1 commit

  • Pull HMM updates from Jason Gunthorpe:
    "Improvements and bug fixes for the hmm interface in the kernel:

    - Improve clarity, locking and APIs related to the 'hmm mirror'
    feature merged last cycle. In linux-next we now see AMDGPU and
    nouveau to be using this API.

    - Remove old or transitional hmm APIs. These are hold overs from the
    past with no users, or APIs that existed only to manage cross tree
    conflicts. There are still a few more of these cleanups that didn't
    make the merge window cut off.

    - Improve some core mm APIs:
    - export alloc_pages_vma() for driver use
    - refactor into devm_request_free_mem_region() to manage
    DEVICE_PRIVATE resource reservations
    - refactor duplicative driver code into the core dev_pagemap
    struct

    - Remove hmm wrappers of improved core mm APIs, instead have drivers
    use the simplified API directly

    - Remove DEVICE_PUBLIC

    - Simplify the kconfig flow for the hmm users and core code"

    * tag 'for-linus-hmm' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (42 commits)
    mm: don't select MIGRATE_VMA_HELPER from HMM_MIRROR
    mm: remove the HMM config option
    mm: sort out the DEVICE_PRIVATE Kconfig mess
    mm: simplify ZONE_DEVICE page private data
    mm: remove hmm_devmem_add
    mm: remove hmm_vma_alloc_locked_page
    nouveau: use devm_memremap_pages directly
    nouveau: use alloc_page_vma directly
    PCI/P2PDMA: use the dev_pagemap internal refcount
    device-dax: use the dev_pagemap internal refcount
    memremap: provide an optional internal refcount in struct dev_pagemap
    memremap: replace the altmap_valid field with a PGMAP_ALTMAP_VALID flag
    memremap: remove the data field in struct dev_pagemap
    memremap: add a migrate_to_ram method to struct dev_pagemap_ops
    memremap: lift the devmap_enable manipulation into devm_memremap_pages
    memremap: pass a struct dev_pagemap to ->kill and ->cleanup
    memremap: move dev_pagemap callbacks into a separate structure
    memremap: validate the pagemap type passed to devm_memremap_pages
    mm: factor out a devm_request_free_mem_region helper
    mm: export alloc_pages_vma
    ...

    Linus Torvalds
     

13 Jul, 2019

8 commits

  • oom_unkillable_task() can be called from three different contexts i.e.
    global OOM, memcg OOM and oom_score procfs interface. At the moment
    oom_unkillable_task() does a task_in_mem_cgroup() check on the given
    process. Since there is no reason to perform task_in_mem_cgroup()
    check for global OOM and oom_score procfs interface, those contexts
    provide NULL memcg and skips the task_in_mem_cgroup() check. However
    for memcg OOM context, the oom_unkillable_task() is always called from
    mem_cgroup_scan_tasks() and thus task_in_mem_cgroup() check becomes
    redundant and effectively dead code. So, just remove the
    task_in_mem_cgroup() check altogether.

    Link: http://lkml.kernel.org/r/20190624212631.87212-2-shakeelb@google.com
    Signed-off-by: Shakeel Butt
    Signed-off-by: Tetsuo Handa
    Acked-by: Roman Gushchin
    Acked-by: Michal Hocko
    Cc: David Rientjes
    Cc: Johannes Weiner
    Cc: KOSAKI Motohiro
    Cc: Nick Piggin
    Cc: Paul Jackson
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shakeel Butt
     
  • Since commit c03cd7738a83 ("cgroup: Include dying leaders with live
    threads in PROCS iterations") corrected how CSS_TASK_ITER_PROCS works,
    mem_cgroup_scan_tasks() can use CSS_TASK_ITER_PROCS in order to check
    only one thread from each thread group.

    [penguin-kernel@I-love.SAKURA.ne.jp: remove thread group leader check in oom_evaluate_task()]
    Link: http://lkml.kernel.org/r/1560853257-14934-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp
    Link: http://lkml.kernel.org/r/c763afc8-f0ae-756a-56a7-395f625b95fc@i-love.sakura.ne.jp
    Signed-off-by: Tetsuo Handa
    Acked-by: Michal Hocko
    Reviewed-by: Shakeel Butt
    Cc: Johannes Weiner
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tetsuo Handa
     
  • Let's reparent non-root kmem_caches on memcg offlining. This allows us to
    release the memory cgroup without waiting for the last outstanding kernel
    object (e.g. dentry used by another application).

    Since the parent cgroup is already charged, everything we need to do is to
    splice the list of kmem_caches to the parent's kmem_caches list, swap the
    memcg pointer, drop the css refcounter for each kmem_cache and adjust the
    parent's css refcounter.

    Please, note that kmem_cache->memcg_params.memcg isn't a stable pointer
    anymore. It's safe to read it under rcu_read_lock(), cgroup_mutex held,
    or any other way that protects the memory cgroup from being released.

    We can race with the slab allocation and deallocation paths. It's not a
    big problem: parent's charge and slab global stats are always correct, and
    we don't care anymore about the child usage and global stats. The child
    cgroup is already offline, so we don't use or show it anywhere.

    Local slab stats (NR_SLAB_RECLAIMABLE and NR_SLAB_UNRECLAIMABLE) aren't
    used anywhere except count_shadow_nodes(). But even there it won't break
    anything: after reparenting "nodes" will be 0 on child level (because
    we're already reparenting shrinker lists), and on parent level page stats
    always were 0, and this patch won't change anything.

    [guro@fb.com: properly handle kmem_caches reparented to root_mem_cgroup]
    Link: http://lkml.kernel.org/r/20190620213427.1691847-1-guro@fb.com
    Link: http://lkml.kernel.org/r/20190611231813.3148843-11-guro@fb.com
    Signed-off-by: Roman Gushchin
    Acked-by: Vladimir Davydov
    Reviewed-by: Shakeel Butt
    Acked-by: David Rientjes
    Cc: Christoph Lameter
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Waiman Long
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Pekka Enberg
    Cc: Andrei Vagin
    Cc: Qian Cai
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     
  • Every slab page charged to a non-root memory cgroup has a pointer to the
    memory cgroup and holds a reference to it, which protects a non-empty
    memory cgroup from being released. At the same time the page has a
    pointer to the corresponding kmem_cache, and also hold a reference to the
    kmem_cache. And kmem_cache by itself holds a reference to the cgroup.

    So there is clearly some redundancy, which allows to stop setting the
    page->mem_cgroup pointer and rely on getting memcg pointer indirectly via
    kmem_cache. Further it will allow to change this pointer easier, without
    a need to go over all charged pages.

    So let's stop setting page->mem_cgroup pointer for slab pages, and stop
    using the css refcounter directly for protecting the memory cgroup from
    going away. Instead rely on kmem_cache as an intermediate object.

    Make sure that vmstats and shrinker lists are working as previously, as
    well as /proc/kpagecgroup interface.

    Link: http://lkml.kernel.org/r/20190611231813.3148843-10-guro@fb.com
    Signed-off-by: Roman Gushchin
    Acked-by: Vladimir Davydov
    Reviewed-by: Shakeel Butt
    Cc: Christoph Lameter
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Waiman Long
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Pekka Enberg
    Cc: Andrei Vagin
    Cc: Qian Cai
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     
  • Currently each charged slab page holds a reference to the cgroup to which
    it's charged. Kmem_caches are held by the memcg and are released all
    together with the memory cgroup. It means that none of kmem_caches are
    released unless at least one reference to the memcg exists, which is very
    far from optimal.

    Let's rework it in a way that allows releasing individual kmem_caches as
    soon as the cgroup is offline, the kmem_cache is empty and there are no
    pending allocations.

    To make it possible, let's introduce a new percpu refcounter for non-root
    kmem caches. The counter is initialized to the percpu mode, and is
    switched to the atomic mode during kmem_cache deactivation. The counter
    is bumped for every charged page and also for every running allocation.
    So the kmem_cache can't be released unless all allocations complete.

    To shutdown non-active empty kmem_caches, let's reuse the work queue,
    previously used for the kmem_cache deactivation. Once the reference
    counter reaches 0, let's schedule an asynchronous kmem_cache release.

    * I used the following simple approach to test the performance
    (stolen from another patchset by T. Harding):

    time find / -name fname-no-exist
    echo 2 > /proc/sys/vm/drop_caches
    repeat 10 times

    Results:

    orig patched

    real 0m1.455s real 0m1.355s
    user 0m0.206s user 0m0.219s
    sys 0m0.855s sys 0m0.807s

    real 0m1.487s real 0m1.699s
    user 0m0.221s user 0m0.256s
    sys 0m0.806s sys 0m0.948s

    real 0m1.515s real 0m1.505s
    user 0m0.183s user 0m0.215s
    sys 0m0.876s sys 0m0.858s

    real 0m1.291s real 0m1.380s
    user 0m0.193s user 0m0.198s
    sys 0m0.843s sys 0m0.786s

    real 0m1.364s real 0m1.374s
    user 0m0.180s user 0m0.182s
    sys 0m0.868s sys 0m0.806s

    real 0m1.352s real 0m1.312s
    user 0m0.201s user 0m0.212s
    sys 0m0.820s sys 0m0.761s

    real 0m1.302s real 0m1.349s
    user 0m0.205s user 0m0.203s
    sys 0m0.803s sys 0m0.792s

    real 0m1.334s real 0m1.301s
    user 0m0.194s user 0m0.201s
    sys 0m0.806s sys 0m0.779s

    real 0m1.426s real 0m1.434s
    user 0m0.216s user 0m0.181s
    sys 0m0.824s sys 0m0.864s

    real 0m1.350s real 0m1.295s
    user 0m0.200s user 0m0.190s
    sys 0m0.842s sys 0m0.811s

    So it looks like the difference is not noticeable in this test.

    [cai@lca.pw: fix an use-after-free in kmemcg_workfn()]
    Link: http://lkml.kernel.org/r/1560977573-10715-1-git-send-email-cai@lca.pw
    Link: http://lkml.kernel.org/r/20190611231813.3148843-9-guro@fb.com
    Signed-off-by: Roman Gushchin
    Signed-off-by: Qian Cai
    Acked-by: Vladimir Davydov
    Cc: Christoph Lameter
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Shakeel Butt
    Cc: Waiman Long
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Pekka Enberg
    Cc: Andrei Vagin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     
  • Let's separate the page counter modification code out of
    __memcg_kmem_uncharge() in a way similar to what
    __memcg_kmem_charge() and __memcg_kmem_charge_memcg() work.

    This will allow to reuse this code later using a new
    memcg_kmem_uncharge_memcg() wrapper, which calls
    __memcg_kmem_uncharge_memcg() if memcg_kmem_enabled()
    check is passed.

    Link: http://lkml.kernel.org/r/20190611231813.3148843-5-guro@fb.com
    Signed-off-by: Roman Gushchin
    Reviewed-by: Shakeel Butt
    Acked-by: Vladimir Davydov
    Cc: Christoph Lameter
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Waiman Long
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Pekka Enberg
    Cc: Andrei Vagin
    Cc: Qian Cai
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     
  • The current cgroup OOM memory info dump doesn't include all the memory
    we are tracking, nor does it give insight into what the VM tried to do
    leading up to the OOM. All that useful info is in memory.stat.

    Furthermore, the recursive printing for every child cgroup can
    generate absurd amounts of data on the console for larger cgroup
    trees, and it's not like we provide a per-cgroup breakdown during
    global OOM kills.

    When an OOM kill is triggered, print one set of recursive memory.stat
    items at the level whose limit triggered the OOM condition.

    Example output:

    stress invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0
    CPU: 2 PID: 210 Comm: stress Not tainted 5.2.0-rc2-mm1-00247-g47d49835983c #135
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-20181126_142135-anatol 04/01/2014
    Call Trace:
    dump_stack+0x46/0x60
    dump_header+0x4c/0x2d0
    oom_kill_process.cold.10+0xb/0x10
    out_of_memory+0x200/0x270
    ? try_to_free_mem_cgroup_pages+0xdf/0x130
    mem_cgroup_out_of_memory+0xb7/0xc0
    try_charge+0x680/0x6f0
    mem_cgroup_try_charge+0xb5/0x160
    __add_to_page_cache_locked+0xc6/0x300
    ? list_lru_destroy+0x80/0x80
    add_to_page_cache_lru+0x45/0xc0
    pagecache_get_page+0x11b/0x290
    filemap_fault+0x458/0x6d0
    ext4_filemap_fault+0x27/0x36
    __do_fault+0x2f/0xb0
    __handle_mm_fault+0x9c5/0x1140
    ? apic_timer_interrupt+0xa/0x20
    handle_mm_fault+0xc5/0x180
    __do_page_fault+0x1ab/0x440
    ? page_fault+0x8/0x30
    page_fault+0x1e/0x30
    RIP: 0033:0x55c32167fc10
    Code: Bad RIP value.
    RSP: 002b:00007fff1d031c50 EFLAGS: 00010206
    RAX: 000000000dc00000 RBX: 00007fd2db000010 RCX: 00007fd2db000010
    RDX: 0000000000000000 RSI: 0000000010001000 RDI: 0000000000000000
    RBP: 000055c321680a54 R08: 00000000ffffffff R09: 0000000000000000
    R10: 0000000000000022 R11: 0000000000000246 R12: ffffffffffffffff
    R13: 0000000000000002 R14: 0000000000001000 R15: 0000000010000000
    memory: usage 1024kB, limit 1024kB, failcnt 75131
    swap: usage 0kB, limit 9007199254740988kB, failcnt 0
    Memory cgroup stats for /foo:
    anon 0
    file 0
    kernel_stack 36864
    slab 274432
    sock 0
    shmem 0
    file_mapped 0
    file_dirty 0
    file_writeback 0
    anon_thp 0
    inactive_anon 126976
    active_anon 0
    inactive_file 0
    active_file 0
    unevictable 0
    slab_reclaimable 0
    slab_unreclaimable 274432
    pgfault 59466
    pgmajfault 1617
    workingset_refault 2145
    workingset_activate 0
    workingset_nodereclaim 0
    pgrefill 98952
    pgscan 200060
    pgsteal 59340
    pgactivate 40095
    pgdeactivate 96787
    pglazyfree 0
    pglazyfreed 0
    thp_fault_alloc 0
    thp_collapse_alloc 0
    Tasks state (memory values in pages):
    [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name
    [ 200] 0 200 1121 884 53248 29 0 bash
    [ 209] 0 209 905 246 45056 19 0 stress
    [ 210] 0 210 66442 56 499712 56349 0 stress
    oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),oom_memcg=/foo,task_memcg=/foo,task=stress,pid=210,uid=0
    Memory cgroup out of memory: Killed process 210 (stress) total-vm:265768kB, anon-rss:0kB, file-rss:224kB, shmem-rss:0kB
    oom_reaper: reaped process 210 (stress), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

    [hannes@cmpxchg.org: s/kvmalloc/kmalloc/ per Michal]
    Link: http://lkml.kernel.org/r/20190605161133.GA12453@cmpxchg.org
    Link: http://lkml.kernel.org/r/20190604210509.9744-1-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • The memory controller in cgroup v2 exposes memory.events file for each
    memcg which shows the number of times events like low, high, max, oom
    and oom_kill have happened for the whole tree rooted at that memcg.
    Users can also poll or register notification to monitor the changes in
    that file. Any event at any level of the tree rooted at memcg will
    notify all the listeners along the path till root_mem_cgroup. There are
    existing users which depend on this behavior.

    However there are users which are only interested in the events
    happening at a specific level of the memcg tree and not in the events in
    the underlying tree rooted at that memcg. One such use-case is a
    centralized resource monitor which can dynamically adjust the limits of
    the jobs running on a system. The jobs can create their sub-hierarchy
    for their own sub-tasks. The centralized monitor is only interested in
    the events at the top level memcgs of the jobs as it can then act and
    adjust the limits of the jobs. Using the current memory.events for such
    centralized monitor is very inconvenient. The monitor will keep
    receiving events which it is not interested and to find if the received
    event is interesting, it has to read memory.event files of the next
    level and compare it with the top level one. So, let's introduce
    memory.events.local to the memcg which shows and notify for the events
    at the memcg level.

    Now, does memory.stat and memory.pressure need their local versions. IMHO
    no due to the no internal process contraint of the cgroup v2. The
    memory.stat file of the top level memcg of a job shows the stats and
    vmevents of the whole tree. The local stats or vmevents of the top level
    memcg will only change if there is a process running in that memcg but v2
    does not allow that. Similarly for memory.pressure there will not be any
    process in the internal nodes and thus no chance of local pressure.

    Link: http://lkml.kernel.org/r/20190527174643.209172-1-shakeelb@google.com
    Signed-off-by: Shakeel Butt
    Reviewed-by: Roman Gushchin
    Acked-by: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: Vladimir Davydov
    Cc: Chris Down
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shakeel Butt