17 Apr, 2019

1 commit

  • commit 0b3d6e6f2dd0a7b697b1aa8c167265908940624b upstream.

    Since commit a983b5ebee57 ("mm: memcontrol: fix excessive complexity in
    memory.stat reporting") memcg dirty and writeback counters are managed
    as:

    1) per-memcg per-cpu values in range of [-32..32]

    2) per-memcg atomic counter

    When a per-cpu counter cannot fit in [-32..32] it's flushed to the
    atomic. Stat readers only check the atomic. Thus readers such as
    balance_dirty_pages() may see a nontrivial error margin: 32 pages per
    cpu.

    Assuming 100 cpus:
    4k x86 page_size: 13 MiB error per memcg
    64k ppc page_size: 200 MiB error per memcg

    Considering that dirty+writeback are used together for some decisions the
    errors double.

    This inaccuracy can lead to undeserved oom kills. One nasty case is
    when all per-cpu counters hold positive values offsetting an atomic
    negative value (i.e. per_cpu[*]=32, atomic=n_cpu*-32).
    balance_dirty_pages() only consults the atomic and does not consider
    throttling the next n_cpu*32 dirty pages. If the file_lru is in the
    13..200 MiB range then there's absolutely no dirty throttling, which
    burdens vmscan with only dirty+writeback pages thus resorting to oom
    kill.

    It could be argued that tiny containers are not supported, but it's more
    subtle. It's the amount the space available for file lru that matters.
    If a container has memory.max-200MiB of non reclaimable memory, then it
    will also suffer such oom kills on a 100 cpu machine.

    The following test reliably ooms without this patch. This patch avoids
    oom kills.

    $ cat test
    mount -t cgroup2 none /dev/cgroup
    cd /dev/cgroup
    echo +io +memory > cgroup.subtree_control
    mkdir test
    cd test
    echo 10M > memory.max
    (echo $BASHPID > cgroup.procs && exec /memcg-writeback-stress /foo)
    (echo $BASHPID > cgroup.procs && exec dd if=/dev/zero of=/foo bs=2M count=100)

    $ cat memcg-writeback-stress.c
    /*
    * Dirty pages from all but one cpu.
    * Clean pages from the non dirtying cpu.
    * This is to stress per cpu counter imbalance.
    * On a 100 cpu machine:
    * - per memcg per cpu dirty count is 32 pages for each of 99 cpus
    * - per memcg atomic is -99*32 pages
    * - thus the complete dirty limit: sum of all counters 0
    * - balance_dirty_pages() only sees atomic count -99*32 pages, which
    * it max()s to 0.
    * - So a workload can dirty -99*32 pages before balance_dirty_pages()
    * cares.
    */
    #define _GNU_SOURCE
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include

    static char *buf;
    static int bufSize;

    static void set_affinity(int cpu)
    {
    cpu_set_t affinity;

    CPU_ZERO(&affinity);
    CPU_SET(cpu, &affinity);
    if (sched_setaffinity(0, sizeof(affinity), &affinity))
    err(1, "sched_setaffinity");
    }

    static void dirty_on(int output_fd, int cpu)
    {
    int i, wrote;

    set_affinity(cpu);
    for (i = 0; i < 32; i++) {
    for (wrote = 0; wrote < bufSize; ) {
    int ret = write(output_fd, buf+wrote, bufSize-wrote);
    if (ret == -1)
    err(1, "write");
    wrote += ret;
    }
    }
    }

    int main(int argc, char **argv)
    {
    int cpu, flush_cpu = 1, output_fd;
    const char *output;

    if (argc != 2)
    errx(1, "usage: output_file");

    output = argv[1];
    bufSize = getpagesize();
    buf = malloc(getpagesize());
    if (buf == NULL)
    errx(1, "malloc failed");

    output_fd = open(output, O_CREAT|O_RDWR);
    if (output_fd == -1)
    err(1, "open(%s)", output);

    for (cpu = 0; cpu < get_nprocs(); cpu++) {
    if (cpu != flush_cpu)
    dirty_on(output_fd, cpu);
    }

    set_affinity(flush_cpu);
    if (fsync(output_fd))
    err(1, "fsync(%s)", output);
    if (close(output_fd))
    err(1, "close(%s)", output);
    free(buf);
    }

    Make balance_dirty_pages() and wb_over_bg_thresh() work harder to
    collect exact per memcg counters. This avoids the aforementioned oom
    kills.

    This does not affect the overhead of memory.stat, which still reads the
    single atomic counter.

    Why not use percpu_counter? memcg already handles cpus going offline, so
    no need for that overhead from percpu_counter. And the percpu_counter
    spinlocks are more heavyweight than is required.

    It probably also makes sense to use exact dirty and writeback counters
    in memcg oom reports. But that is saved for later.

    Link: http://lkml.kernel.org/r/20190329174609.164344-1-gthelen@google.com
    Signed-off-by: Greg Thelen
    Reviewed-by: Roman Gushchin
    Acked-by: Johannes Weiner
    Cc: Michal Hocko
    Cc: Vladimir Davydov
    Cc: Tejun Heo
    Cc: [4.16+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Greg Thelen
     

06 Apr, 2019

1 commit

  • [ Upstream commit 7775face207922ea62a4e96b9cd45abfdc7b9840 ]

    If a memory cgroup contains a single process with many threads
    (including different process group sharing the mm) then it is possible
    to trigger a race when the oom killer complains that there are no oom
    elible tasks and complain into the log which is both annoying and
    confusing because there is no actual problem. The race looks as
    follows:

    P1 oom_reaper P2
    try_charge try_charge
    mem_cgroup_out_of_memory
    mutex_lock(oom_lock)
    out_of_memory
    oom_kill_process(P1,P2)
    wake_oom_reaper
    mutex_unlock(oom_lock)
    oom_reap_task
    mutex_lock(oom_lock)
    select_bad_process # no victim

    The problem is more visible with many threads.

    Fix this by checking for fatal_signal_pending from
    mem_cgroup_out_of_memory when the oom_lock is already held.

    The oom bypass is safe because we do the same early in the try_charge
    path already. The situation migh have changed in the mean time. It
    should be safe to check for fatal_signal_pending and tsk_is_oom_victim
    but for a better code readability abstract the current charge bypass
    condition into should_force_charge and reuse it from that path. "

    Link: http://lkml.kernel.org/r/01370f70-e1f6-ebe4-b95e-0df21a0bc15e@i-love.sakura.ne.jp
    Signed-off-by: Tetsuo Handa
    Acked-by: Michal Hocko
    Acked-by: Johannes Weiner
    Cc: David Rientjes
    Cc: Kirill Tkhai
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    Tetsuo Handa
     

13 Jan, 2019

1 commit

  • commit 7056d3a37d2c6aaaab10c13e8e69adc67ec1fc65 upstream.

    Burt Holzman has noticed that memcg v1 doesn't notify about OOM events via
    eventfd anymore. The reason is that 29ef680ae7c2 ("memcg, oom: move
    out_of_memory back to the charge path") has moved the oom handling back to
    the charge path. While doing so the notification was left behind in
    mem_cgroup_oom_synchronize.

    Fix the issue by replicating the oom hierarchy locking and the
    notification.

    Link: http://lkml.kernel.org/r/20181224091107.18354-1-mhocko@kernel.org
    Fixes: 29ef680ae7c2 ("memcg, oom: move out_of_memory back to the charge path")
    Signed-off-by: Michal Hocko
    Reported-by: Burt Holzman
    Acked-by: Johannes Weiner
    Cc: Vladimir Davydov [4.19+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Michal Hocko
     

05 Sep, 2018

1 commit

  • When the memcg OOM killer runs out of killable tasks, it currently
    prints a WARN with no further OOM context. This has caused some user
    confusion.

    Warnings indicate a kernel problem. In a reported case, however, the
    situation was triggered by a nonsensical memcg configuration (hard limit
    set to 0). But without any VM context this wasn't obvious from the
    report, and it took some back and forth on the mailing list to identify
    what is actually a trivial issue.

    Handle this OOM condition like we handle it in the global OOM killer:
    dump the full OOM context and tell the user we ran out of tasks.

    This way the user can identify misconfigurations easily by themselves
    and rectify the problem - without having to go through the hassle of
    running into an obscure but unsettling warning, finding the appropriate
    kernel mailing list and waiting for a kernel developer to remote-analyze
    that the memcg configuration caused this.

    If users cannot make sense of why the OOM killer was triggered or why it
    failed, they will still report it to the mailing list, we know that from
    experience. So in case there is an actual kernel bug causing this,
    kernel developers will very likely hear about it.

    Link: http://lkml.kernel.org/r/20180821160406.22578-1-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: Dmitry Vyukov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

23 Aug, 2018

2 commits

  • For some workloads an intervention from the OOM killer can be painful.
    Killing a random task can bring the workload into an inconsistent state.

    Historically, there are two common solutions for this
    problem:
    1) enabling panic_on_oom,
    2) using a userspace daemon to monitor OOMs and kill
    all outstanding processes.

    Both approaches have their downsides: rebooting on each OOM is an obvious
    waste of capacity, and handling all in userspace is tricky and requires a
    userspace agent, which will monitor all cgroups for OOMs.

    In most cases an in-kernel after-OOM cleaning-up mechanism can eliminate
    the necessity of enabling panic_on_oom. Also, it can simplify the cgroup
    management for userspace applications.

    This commit introduces a new knob for cgroup v2 memory controller:
    memory.oom.group. The knob determines whether the cgroup should be
    treated as an indivisible workload by the OOM killer. If set, all tasks
    belonging to the cgroup or to its descendants (if the memory cgroup is not
    a leaf cgroup) are killed together or not at all.

    To determine which cgroup has to be killed, we do traverse the cgroup
    hierarchy from the victim task's cgroup up to the OOMing cgroup (or root)
    and looking for the highest-level cgroup with memory.oom.group set.

    Tasks with the OOM protection (oom_score_adj set to -1000) are treated as
    an exception and are never killed.

    This patch doesn't change the OOM victim selection algorithm.

    Link: http://lkml.kernel.org/r/20180802003201.817-4-guro@fb.com
    Signed-off-by: Roman Gushchin
    Acked-by: Michal Hocko
    Acked-by: Johannes Weiner
    Cc: David Rientjes
    Cc: Tetsuo Handa
    Cc: Tejun Heo
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     
  • Currently cgroup-v1's memcg_stat_show traverses the memcg tree ~17 times
    to collect the stats while cgroup-v2's memory_stat_show traverses the
    memcg tree thrice. On a large machine, a couple thousand memcgs is very
    normal and if the churn is high and memcgs stick around during to several
    reasons, tens of thousands of nodes in memcg tree can exist. This patch
    has refactored and shared the stat collection code between cgroup-v1 and
    cgroup-v2 and has reduced the tree traversal to just one.

    I ran a simple benchmark which reads the root_mem_cgroup's stat file
    1000 times in the presense of 2500 memcgs on cgroup-v1. The results are:

    Without the patch:
    $ time ./read-root-stat-1000-times

    real 0m1.663s
    user 0m0.000s
    sys 0m1.660s

    With the patch:
    $ time ./read-root-stat-1000-times

    real 0m0.468s
    user 0m0.000s
    sys 0m0.467s

    Link: http://lkml.kernel.org/r/20180724224635.143944-1-shakeelb@google.com
    Signed-off-by: Shakeel Butt
    Acked-by: Michal Hocko
    Cc: Johannes Weiner
    Cc: Vladimir Davydov
    Cc: Greg Thelen
    Cc: Bruce Merry
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shakeel Butt
     

18 Aug, 2018

10 commits

  • To avoid further unneed calls of do_shrink_slab() for shrinkers, which
    already do not have any charged objects in a memcg, their bits have to
    be cleared.

    This patch introduces a lockless mechanism to do that without races
    without parallel list lru add. After do_shrink_slab() returns
    SHRINK_EMPTY the first time, we clear the bit and call it once again.
    Then we restore the bit, if the new return value is different.

    Note, that single smp_mb__after_atomic() in shrink_slab_memcg() covers
    two situations:

    1)list_lru_add() shrink_slab_memcg
    list_add_tail() for_each_set_bit()
    set_bit() do_shrink_slab() before the first call of do_shrink_slab()
    instead of this to do not slow down generic case. Also, it's need the
    second call as seen in below in (2).

    2)list_lru_add() shrink_slab_memcg()
    list_add_tail() ...
    set_bit() ...
    ... for_each_set_bit()
    do_shrink_slab() do_shrink_slab()
    clear_bit() ...
    ... ...
    list_lru_add() ...
    list_add_tail() clear_bit()

    set_bit() do_shrink_slab()

    The barriers guarantee that the second do_shrink_slab() in the right
    side task sees list update if really cleared the bit. This case is
    drawn in the code comment.

    [Results/performance of the patchset]

    After the whole patchset applied the below test shows signify increase
    of performance:

    $echo 1 > /sys/fs/cgroup/memory/memory.use_hierarchy
    $mkdir /sys/fs/cgroup/memory/ct
    $echo 4000M > /sys/fs/cgroup/memory/ct/memory.kmem.limit_in_bytes
    $for i in `seq 0 4000`; do mkdir /sys/fs/cgroup/memory/ct/$i;
    echo $$ > /sys/fs/cgroup/memory/ct/$i/cgroup.procs;
    mkdir -p s/$i; mount -t tmpfs $i s/$i;
    touch s/$i/file; done

    Then, 5 sequential calls of drop caches:

    $time echo 3 > /proc/sys/vm/drop_caches

    1)Before:
    0.00user 13.78system 0:13.78elapsed 99%CPU
    0.00user 5.59system 0:05.60elapsed 99%CPU
    0.00user 5.48system 0:05.48elapsed 99%CPU
    0.00user 8.35system 0:08.35elapsed 99%CPU
    0.00user 8.34system 0:08.35elapsed 99%CPU

    2)After
    0.00user 1.10system 0:01.10elapsed 99%CPU
    0.00user 0.00system 0:00.01elapsed 64%CPU
    0.00user 0.01system 0:00.01elapsed 82%CPU
    0.00user 0.00system 0:00.01elapsed 64%CPU
    0.00user 0.01system 0:00.01elapsed 82%CPU

    The results show the performance increases at least in 548 times.

    Shakeel Butt tested this patchset with fork-bomb on his configuration:

    > I created 255 memcgs, 255 ext4 mounts and made each memcg create a
    > file containing few KiBs on corresponding mount. Then in a separate
    > memcg of 200 MiB limit ran a fork-bomb.
    >
    > I ran the "perf record -ag -- sleep 60" and below are the results:
    >
    > Without the patch series:
    > Samples: 4M of event 'cycles', Event count (approx.): 3279403076005
    > + 36.40% fb.sh [kernel.kallsyms] [k] shrink_slab
    > + 18.97% fb.sh [kernel.kallsyms] [k] list_lru_count_one
    > + 6.75% fb.sh [kernel.kallsyms] [k] super_cache_count
    > + 0.49% fb.sh [kernel.kallsyms] [k] down_read_trylock
    > + 0.44% fb.sh [kernel.kallsyms] [k] mem_cgroup_iter
    > + 0.27% fb.sh [kernel.kallsyms] [k] up_read
    > + 0.21% fb.sh [kernel.kallsyms] [k] osq_lock
    > + 0.13% fb.sh [kernel.kallsyms] [k] shmem_unused_huge_count
    > + 0.08% fb.sh [kernel.kallsyms] [k] shrink_node_memcg
    > + 0.08% fb.sh [kernel.kallsyms] [k] shrink_node
    >
    > With the patch series:
    > Samples: 4M of event 'cycles', Event count (approx.): 2756866824946
    > + 47.49% fb.sh [kernel.kallsyms] [k] down_read_trylock
    > + 30.72% fb.sh [kernel.kallsyms] [k] up_read
    > + 9.51% fb.sh [kernel.kallsyms] [k] mem_cgroup_iter
    > + 1.69% fb.sh [kernel.kallsyms] [k] shrink_node_memcg
    > + 1.35% fb.sh [kernel.kallsyms] [k] mem_cgroup_protected
    > + 1.05% fb.sh [kernel.kallsyms] [k] queued_spin_lock_slowpath
    > + 0.85% fb.sh [kernel.kallsyms] [k] _raw_spin_lock
    > + 0.78% fb.sh [kernel.kallsyms] [k] lruvec_lru_size
    > + 0.57% fb.sh [kernel.kallsyms] [k] shrink_node
    > + 0.54% fb.sh [kernel.kallsyms] [k] queue_work_on
    > + 0.46% fb.sh [kernel.kallsyms] [k] shrink_slab_memcg

    [ktkhai@virtuozzo.com: v9]
    Link: http://lkml.kernel.org/r/153112561772.4097.11011071937553113003.stgit@localhost.localdomain
    Link: http://lkml.kernel.org/r/153063070859.1818.11870882950920963480.stgit@localhost.localdomain
    Signed-off-by: Kirill Tkhai
    Acked-by: Vladimir Davydov
    Tested-by: Shakeel Butt
    Cc: Al Viro
    Cc: Andrey Ryabinin
    Cc: Chris Wilson
    Cc: Greg Kroah-Hartman
    Cc: Guenter Roeck
    Cc: "Huang, Ying"
    Cc: Johannes Weiner
    Cc: Josef Bacik
    Cc: Li RongQing
    Cc: Matthew Wilcox
    Cc: Matthias Kaehlcke
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Philippe Ombredanne
    Cc: Roman Gushchin
    Cc: Sahitya Tummala
    Cc: Stephen Rothwell
    Cc: Tetsuo Handa
    Cc: Thomas Gleixner
    Cc: Waiman Long
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill Tkhai
     
  • Introduce set_shrinker_bit() function to set shrinker-related bit in
    memcg shrinker bitmap, and set the bit after the first item is added and
    in case of reparenting destroyed memcg's items.

    This will allow next patch to make shrinkers be called only, in case of
    they have charged objects at the moment, and to improve shrink_slab()
    performance.

    [ktkhai@virtuozzo.com: v9]
    Link: http://lkml.kernel.org/r/153112557572.4097.17315791419810749985.stgit@localhost.localdomain
    Link: http://lkml.kernel.org/r/153063065671.1818.15914674956134687268.stgit@localhost.localdomain
    Signed-off-by: Kirill Tkhai
    Acked-by: Vladimir Davydov
    Tested-by: Shakeel Butt
    Cc: Al Viro
    Cc: Andrey Ryabinin
    Cc: Chris Wilson
    Cc: Greg Kroah-Hartman
    Cc: Guenter Roeck
    Cc: "Huang, Ying"
    Cc: Johannes Weiner
    Cc: Josef Bacik
    Cc: Li RongQing
    Cc: Matthew Wilcox
    Cc: Matthias Kaehlcke
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Philippe Ombredanne
    Cc: Roman Gushchin
    Cc: Sahitya Tummala
    Cc: Stephen Rothwell
    Cc: Tetsuo Handa
    Cc: Thomas Gleixner
    Cc: Waiman Long
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill Tkhai
     
  • This will be used in next patch.

    Link: http://lkml.kernel.org/r/153063064347.1818.1987011484100392706.stgit@localhost.localdomain
    Signed-off-by: Kirill Tkhai
    Acked-by: Vladimir Davydov
    Tested-by: Shakeel Butt
    Cc: Al Viro
    Cc: Andrey Ryabinin
    Cc: Chris Wilson
    Cc: Greg Kroah-Hartman
    Cc: Guenter Roeck
    Cc: "Huang, Ying"
    Cc: Johannes Weiner
    Cc: Josef Bacik
    Cc: Li RongQing
    Cc: Matthew Wilcox
    Cc: Matthias Kaehlcke
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Philippe Ombredanne
    Cc: Roman Gushchin
    Cc: Sahitya Tummala
    Cc: Stephen Rothwell
    Cc: Tetsuo Handa
    Cc: Thomas Gleixner
    Cc: Waiman Long
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill Tkhai
     
  • This is just refactoring to allow the next patches to have dst_memcg
    pointer in memcg_drain_list_lru_node().

    Link: http://lkml.kernel.org/r/153063062118.1818.2761273817739499749.stgit@localhost.localdomain
    Signed-off-by: Kirill Tkhai
    Acked-by: Vladimir Davydov
    Tested-by: Shakeel Butt
    Cc: Al Viro
    Cc: Andrey Ryabinin
    Cc: Chris Wilson
    Cc: Greg Kroah-Hartman
    Cc: Guenter Roeck
    Cc: "Huang, Ying"
    Cc: Johannes Weiner
    Cc: Josef Bacik
    Cc: Li RongQing
    Cc: Matthew Wilcox
    Cc: Matthias Kaehlcke
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Philippe Ombredanne
    Cc: Roman Gushchin
    Cc: Sahitya Tummala
    Cc: Stephen Rothwell
    Cc: Tetsuo Handa
    Cc: Thomas Gleixner
    Cc: Waiman Long
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill Tkhai
     
  • Imagine a big node with many cpus, memory cgroups and containers. Let
    we have 200 containers, every container has 10 mounts, and 10 cgroups.
    All container tasks don't touch foreign containers mounts. If there is
    intensive pages write, and global reclaim happens, a writing task has to
    iterate over all memcgs to shrink slab, before it's able to go to
    shrink_page_list().

    Iteration over all the memcg slabs is very expensive: the task has to
    visit 200 * 10 = 2000 shrinkers for every memcg, and since there are
    2000 memcgs, the total calls are 2000 * 2000 = 4000000.

    So, the shrinker makes 4 million do_shrink_slab() calls just to try to
    isolate SWAP_CLUSTER_MAX pages in one of the actively writing memcg via
    shrink_page_list(). I've observed a node spending almost 100% in
    kernel, making useless iteration over already shrinked slab.

    This patch adds bitmap of memcg-aware shrinkers to memcg. The size of
    the bitmap depends on bitmap_nr_ids, and during memcg life it's
    maintained to be enough to fit bitmap_nr_ids shrinkers. Every bit in
    the map is related to corresponding shrinker id.

    Next patches will maintain set bit only for really charged memcg. This
    will allow shrink_slab() to increase its performance in significant way.
    See the last patch for the numbers.

    [ktkhai@virtuozzo.com: v9]
    Link: http://lkml.kernel.org/r/153112549031.4097.3576147070498769979.stgit@localhost.localdomain
    [ktkhai@virtuozzo.com: add comment to mem_cgroup_css_online()]
    Link: http://lkml.kernel.org/r/521f9e5f-c436-b388-fe83-4dc870bfb489@virtuozzo.com
    Link: http://lkml.kernel.org/r/153063056619.1818.12550500883688681076.stgit@localhost.localdomain
    Signed-off-by: Kirill Tkhai
    Acked-by: Vladimir Davydov
    Tested-by: Shakeel Butt
    Cc: Al Viro
    Cc: Andrey Ryabinin
    Cc: Chris Wilson
    Cc: Greg Kroah-Hartman
    Cc: Guenter Roeck
    Cc: "Huang, Ying"
    Cc: Johannes Weiner
    Cc: Josef Bacik
    Cc: Li RongQing
    Cc: Matthew Wilcox
    Cc: Matthias Kaehlcke
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Philippe Ombredanne
    Cc: Roman Gushchin
    Cc: Sahitya Tummala
    Cc: Stephen Rothwell
    Cc: Tetsuo Handa
    Cc: Thomas Gleixner
    Cc: Waiman Long
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill Tkhai
     
  • Next patch requires these defines are above their current position, so
    here they are moved to declarations.

    Link: http://lkml.kernel.org/r/153063055665.1818.5200425793649695598.stgit@localhost.localdomain
    Signed-off-by: Kirill Tkhai
    Acked-by: Vladimir Davydov
    Tested-by: Shakeel Butt
    Cc: Al Viro
    Cc: Andrey Ryabinin
    Cc: Chris Wilson
    Cc: Greg Kroah-Hartman
    Cc: Guenter Roeck
    Cc: "Huang, Ying"
    Cc: Johannes Weiner
    Cc: Josef Bacik
    Cc: Li RongQing
    Cc: Matthew Wilcox
    Cc: Matthias Kaehlcke
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Philippe Ombredanne
    Cc: Roman Gushchin
    Cc: Sahitya Tummala
    Cc: Stephen Rothwell
    Cc: Tetsuo Handa
    Cc: Thomas Gleixner
    Cc: Waiman Long
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill Tkhai
     
  • Introduce new config option, which is used to replace repeating
    CONFIG_MEMCG && !CONFIG_SLOB pattern. Next patches add a little more
    memcg+kmem related code, so let's keep the defines more clearly.

    Link: http://lkml.kernel.org/r/153063053670.1818.15013136946600481138.stgit@localhost.localdomain
    Signed-off-by: Kirill Tkhai
    Acked-by: Vladimir Davydov
    Tested-by: Shakeel Butt
    Cc: Al Viro
    Cc: Andrey Ryabinin
    Cc: Chris Wilson
    Cc: Greg Kroah-Hartman
    Cc: Guenter Roeck
    Cc: "Huang, Ying"
    Cc: Johannes Weiner
    Cc: Josef Bacik
    Cc: Li RongQing
    Cc: Matthew Wilcox
    Cc: Matthias Kaehlcke
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Philippe Ombredanne
    Cc: Roman Gushchin
    Cc: Sahitya Tummala
    Cc: Stephen Rothwell
    Cc: Tetsuo Handa
    Cc: Thomas Gleixner
    Cc: Waiman Long
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill Tkhai
     
  • Commit 3812c8c8f395 ("mm: memcg: do not trap chargers with full
    callstack on OOM") has changed the ENOMEM semantic of memcg charges.
    Rather than invoking the oom killer from the charging context it delays
    the oom killer to the page fault path (pagefault_out_of_memory). This
    in turn means that many users (e.g. slab or g-u-p) will get ENOMEM when
    the corresponding memcg hits the hard limit and the memcg is is OOM.
    This is behavior is inconsistent with !memcg case where the oom killer
    is invoked from the allocation context and the allocator keeps retrying
    until it succeeds.

    The difference in the behavior is user visible. mmap(MAP_POPULATE)
    might result in not fully populated ranges while the mmap return code
    doesn't tell that to the userspace. Random syscalls might fail with
    ENOMEM etc.

    The primary motivation of the different memcg oom semantic was the
    deadlock avoidance. Things have changed since then, though. We have an
    async oom teardown by the oom reaper now and so we do not have to rely
    on the victim to tear down its memory anymore. Therefore we can return
    to the original semantic as long as the memcg oom killer is not handed
    over to the users space.

    There is still one thing to be careful about here though. If the oom
    killer is not able to make any forward progress - e.g. because there is
    no eligible task to kill - then we have to bail out of the charge path
    to prevent from same class of deadlocks. We have basically two options
    here. Either we fail the charge with ENOMEM or force the charge and
    allow overcharge. The first option has been considered more harmful
    than useful because rare inconsistencies in the ENOMEM behavior is hard
    to test for and error prone. Basically the same reason why the page
    allocator doesn't fail allocations under such conditions. The later
    might allow runaways but those should be really unlikely unless somebody
    misconfigures the system. E.g. allowing to migrate tasks away from the
    memcg to a different unlimited memcg with move_charge_at_immigrate
    disabled.

    Link: http://lkml.kernel.org/r/20180628151101.25307-1-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Acked-by: Greg Thelen
    Cc: Johannes Weiner
    Cc: Shakeel Butt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • The buffer_head can consume a significant amount of system memory and is
    directly related to the amount of page cache. In our production
    environment we have observed that a lot of machines are spending a
    significant amount of memory as buffer_head and can not be left as
    system memory overhead.

    Charging buffer_head is not as simple as adding __GFP_ACCOUNT to the
    allocation. The buffer_heads can be allocated in a memcg different from
    the memcg of the page for which buffer_heads are being allocated. One
    concrete example is memory reclaim. The reclaim can trigger I/O of
    pages of any memcg on the system. So, the right way to charge
    buffer_head is to extract the memcg from the page for which buffer_heads
    are being allocated and then use targeted memcg charging API.

    [shakeelb@google.com: use __GFP_ACCOUNT for directed memcg charging]
    Link: http://lkml.kernel.org/r/20180702220208.213380-1-shakeelb@google.com
    Link: http://lkml.kernel.org/r/20180627191250.209150-3-shakeelb@google.com
    Signed-off-by: Shakeel Butt
    Acked-by: Johannes Weiner
    Cc: Michal Hocko
    Cc: Jan Kara
    Cc: Amir Goldstein
    Cc: Greg Thelen
    Cc: Vladimir Davydov
    Cc: Roman Gushchin
    Cc: Alexander Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shakeel Butt
     
  • Patch series "Directed kmem charging", v8.

    The Linux kernel's memory cgroup allows limiting the memory usage of the
    jobs running on the system to provide isolation between the jobs. All
    the kernel memory allocated in the context of the job and marked with
    __GFP_ACCOUNT will also be included in the memory usage and be limited
    by the job's limit.

    The kernel memory can only be charged to the memcg of the process in
    whose context kernel memory was allocated. However there are cases
    where the allocated kernel memory should be charged to the memcg
    different from the current processes's memcg. This patch series
    contains two such concrete use-cases i.e. fsnotify and buffer_head.

    The fsnotify event objects can consume a lot of system memory for large
    or unlimited queues if there is either no or slow listener. The events
    are allocated in the context of the event producer. However they should
    be charged to the event consumer. Similarly the buffer_head objects can
    be allocated in a memcg different from the memcg of the page for which
    buffer_head objects are being allocated.

    To solve this issue, this patch series introduces mechanism to charge
    kernel memory to a given memcg. In case of fsnotify events, the memcg
    of the consumer can be used for charging and for buffer_head, the memcg
    of the page can be charged. For directed charging, the caller can use
    the scope API memalloc_[un]use_memcg() to specify the memcg to charge
    for all the __GFP_ACCOUNT allocations within the scope.

    This patch (of 2):

    A lot of memory can be consumed by the events generated for the huge or
    unlimited queues if there is either no or slow listener. This can cause
    system level memory pressure or OOMs. So, it's better to account the
    fsnotify kmem caches to the memcg of the listener.

    However the listener can be in a different memcg than the memcg of the
    producer and these allocations happen in the context of the event
    producer. This patch introduces remote memcg charging API which the
    producer can use to charge the allocations to the memcg of the listener.

    There are seven fsnotify kmem caches and among them allocations from
    dnotify_struct_cache, dnotify_mark_cache, fanotify_mark_cache and
    inotify_inode_mark_cachep happens in the context of syscall from the
    listener. So, SLAB_ACCOUNT is enough for these caches.

    The objects from fsnotify_mark_connector_cachep are not accounted as
    they are small compared to the notification mark or events and it is
    unclear whom to account connector to since it is shared by all events
    attached to the inode.

    The allocations from the event caches happen in the context of the event
    producer. For such caches we will need to remote charge the allocations
    to the listener's memcg. Thus we save the memcg reference in the
    fsnotify_group structure of the listener.

    This patch has also moved the members of fsnotify_group to keep the size
    same, at least for 64 bit build, even with additional member by filling
    the holes.

    [shakeelb@google.com: use GFP_KERNEL_ACCOUNT rather than open-coding it]
    Link: http://lkml.kernel.org/r/20180702215439.211597-1-shakeelb@google.com
    Link: http://lkml.kernel.org/r/20180627191250.209150-2-shakeelb@google.com
    Signed-off-by: Shakeel Butt
    Acked-by: Johannes Weiner
    Cc: Michal Hocko
    Cc: Jan Kara
    Cc: Amir Goldstein
    Cc: Greg Thelen
    Cc: Vladimir Davydov
    Cc: Roman Gushchin
    Cc: Alexander Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shakeel Butt
     

15 Aug, 2018

1 commit

  • Pull block updates from Jens Axboe:
    "First pull request for this merge window, there will also be a
    followup request with some stragglers.

    This pull request contains:

    - Fix for a thundering heard issue in the wbt block code (Anchal
    Agarwal)

    - A few NVMe pull requests:
    * Improved tracepoints (Keith)
    * Larger inline data support for RDMA (Steve Wise)
    * RDMA setup/teardown fixes (Sagi)
    * Effects log suppor for NVMe target (Chaitanya Kulkarni)
    * Buffered IO suppor for NVMe target (Chaitanya Kulkarni)
    * TP4004 (ANA) support (Christoph)
    * Various NVMe fixes

    - Block io-latency controller support. Much needed support for
    properly containing block devices. (Josef)

    - Series improving how we handle sense information on the stack
    (Kees)

    - Lightnvm fixes and updates/improvements (Mathias/Javier et al)

    - Zoned device support for null_blk (Matias)

    - AIX partition fixes (Mauricio Faria de Oliveira)

    - DIF checksum code made generic (Max Gurtovoy)

    - Add support for discard in iostats (Michael Callahan / Tejun)

    - Set of updates for BFQ (Paolo)

    - Removal of async write support for bsg (Christoph)

    - Bio page dirtying and clone fixups (Christoph)

    - Set of bcache fix/changes (via Coly)

    - Series improving blk-mq queue setup/teardown speed (Ming)

    - Series improving merging performance on blk-mq (Ming)

    - Lots of other fixes and cleanups from a slew of folks"

    * tag 'for-4.19/block-20180812' of git://git.kernel.dk/linux-block: (190 commits)
    blkcg: Make blkg_root_lookup() work for queues in bypass mode
    bcache: fix error setting writeback_rate through sysfs interface
    null_blk: add lock drop/acquire annotation
    Blk-throttle: reduce tail io latency when iops limit is enforced
    block: paride: pd: mark expected switch fall-throughs
    block: Ensure that a request queue is dissociated from the cgroup controller
    block: Introduce blk_exit_queue()
    blkcg: Introduce blkg_root_lookup()
    block: Remove two superfluous #include directives
    blk-mq: count the hctx as active before allocating tag
    block: bvec_nr_vecs() returns value for wrong slab
    bcache: trivial - remove tailing backslash in macro BTREE_FLAG
    bcache: make the pr_err statement used for ENOENT only in sysfs_attatch section
    bcache: set max writeback rate when I/O request is idle
    bcache: add code comments for bset.c
    bcache: fix mistaken comments in request.c
    bcache: fix mistaken code comments in bcache.h
    bcache: add a comment in super.c
    bcache: avoid unncessary cache prefetch bch_btree_node_get()
    bcache: display rate debug parameters to 0 when writeback is not running
    ...

    Linus Torvalds
     

06 Aug, 2018

1 commit


03 Aug, 2018

1 commit

  • In case of memcg_online_kmem() failure, memcg_cgroup::id remains hashed
    in mem_cgroup_idr even after memcg memory is freed. This leads to leak
    of ID in mem_cgroup_idr.

    This patch adds removal into mem_cgroup_css_alloc(), which fixes the
    problem. For better readability, it adds a generic helper which is used
    in mem_cgroup_alloc() and mem_cgroup_id_put_many() as well.

    Link: http://lkml.kernel.org/r/152354470916.22460.14397070748001974638.stgit@localhost.localdomain
    Fixes 73f576c04b94 ("mm: memcontrol: fix cgroup creation failure after many small jobs")
    Signed-off-by: Kirill Tkhai
    Acked-by: Johannes Weiner
    Acked-by: Vladimir Davydov
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill Tkhai
     

22 Jul, 2018

1 commit

  • It was reported that a kernel crash happened in mem_cgroup_iter(), which
    can be triggered if the legacy cgroup-v1 non-hierarchical mode is used.

    Unable to handle kernel paging request at virtual address 6b6b6b6b6b6b8f
    ......
    Call trace:
    mem_cgroup_iter+0x2e0/0x6d4
    shrink_zone+0x8c/0x324
    balance_pgdat+0x450/0x640
    kswapd+0x130/0x4b8
    kthread+0xe8/0xfc
    ret_from_fork+0x10/0x20

    mem_cgroup_iter():
    ......
    if (css_tryget(css)) position, which has been freed before and
    filled with POISON_FREE(0x6b).

    And the root cause of the use-after-free issue is that
    invalidate_reclaim_iterators() fails to reset the value of iter->position
    to NULL when the css of the memcg is released in non- hierarchical mode.

    Link: http://lkml.kernel.org/r/1531994807-25639-1-git-send-email-jing.xia@unisoc.com
    Fixes: 6df38689e0e9 ("mm: memcontrol: fix possible memcg leak due to interrupted reclaim")
    Signed-off-by: Jing Xia
    Acked-by: Michal Hocko
    Cc: Johannes Weiner
    Cc: Vladimir Davydov
    Cc:
    Cc: Shakeel Butt
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jing Xia
     

09 Jul, 2018

1 commit

  • Memory allocations can induce swapping via kswapd or direct reclaim. If
    we are having IO done for us by kswapd and don't actually go into direct
    reclaim we may never get scheduled for throttling. So instead check to
    see if our cgroup is congested, and if so schedule the throttling.
    Before we return to user space the throttling stuff will only throttle
    if we actually required it.

    Signed-off-by: Tejun Heo
    Signed-off-by: Josef Bacik
    Acked-by: Johannes Weiner
    Acked-by: Andrew Morton
    Signed-off-by: Jens Axboe

    Tejun Heo
     

15 Jun, 2018

2 commits

  • Commit e27be240df53 ("mm: memcg: make sure memory.events is uptodate
    when waking pollers") converted most of memcg event counters to
    per-memcg atomics, which made them less confusing for a user. The
    "oom_kill" counter remained untouched, so now it behaves differently
    than other counters (including "oom"). This adds nothing but confusion.

    Let's fix this by adding the MEMCG_OOM_KILL event, and follow the
    MEMCG_OOM approach.

    This also removes a hack from count_memcg_event_mm(), introduced earlier
    specially for the OOM_KILL counter.

    [akpm@linux-foundation.org: fix for droppage of memcg-replace-mm-owner-with-mm-memcg.patch]
    Link: http://lkml.kernel.org/r/20180508124637.29984-1-guro@fb.com
    Signed-off-by: Roman Gushchin
    Acked-by: Konstantin Khlebnikov
    Acked-by: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     
  • Shakeel reported a crash in mem_cgroup_protected(), which can be triggered
    by memcg reclaim if the legacy cgroup v1 use_hierarchy=0 mode is used:

    BUG: unable to handle kernel NULL pointer dereference at 0000000000000120
    PGD 8000001ff55da067 P4D 8000001ff55da067 PUD 1fdc7df067 PMD 0
    Oops: 0000 [#4] SMP PTI
    CPU: 0 PID: 15581 Comm: bash Tainted: G D 4.17.0-smp-clean #5
    Hardware name: ...
    RIP: 0010:mem_cgroup_protected+0x54/0x130
    Code: 4c 8b 8e 00 01 00 00 4c 8b 86 08 01 00 00 48 8d 8a 08 ff ff ff 48 85 d2 ba 00 00 00 00 48 0f 44 ca 48 39 c8 0f 84 cf 00 00 00 8b 81 20 01 00 00 4d 89 ca 4c 39 c8 4c 0f 46 d0 4d 85 d2 74 05
    RSP: 0000:ffffabe64dfafa58 EFLAGS: 00010286
    RAX: ffff9fb6ff03d000 RBX: ffff9fb6f5b1b000 RCX: 0000000000000000
    RDX: 0000000000000000 RSI: ffff9fb6f5b1b000 RDI: ffff9fb6f5b1b000
    RBP: ffffabe64dfafb08 R08: 0000000000000000 R09: 0000000000000000
    R10: 0000000000000000 R11: 000000000000c800 R12: ffffabe64dfafb88
    R13: ffff9fb6f5b1b000 R14: ffffabe64dfafb88 R15: ffff9fb77fffe000
    FS: 00007fed1f8ac700(0000) GS:ffff9fb6ff400000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 0000000000000120 CR3: 0000001fdcf86003 CR4: 00000000001606f0
    Call Trace:
    ? shrink_node+0x194/0x510
    do_try_to_free_pages+0xfd/0x390
    try_to_free_mem_cgroup_pages+0x123/0x210
    try_charge+0x19e/0x700
    mem_cgroup_try_charge+0x10b/0x1a0
    wp_page_copy+0x134/0x5b0
    do_wp_page+0x90/0x460
    __handle_mm_fault+0x8e3/0xf30
    handle_mm_fault+0xfe/0x220
    __do_page_fault+0x262/0x500
    do_page_fault+0x28/0xd0
    ? page_fault+0x8/0x30
    page_fault+0x1e/0x30
    RIP: 0033:0x485b72

    The problem happens because parent_mem_cgroup() returns a NULL pointer,
    which is dereferenced later without a check.

    As cgroup v1 has no memory guarantee support, let's make
    mem_cgroup_protected() immediately return MEMCG_PROT_NONE, if the given
    cgroup has no parent (non-hierarchical mode is used).

    Link: http://lkml.kernel.org/r/20180611175418.7007-2-guro@fb.com
    Fixes: bf8d5d52ffe8 ("memcg: introduce memory.min")
    Signed-off-by: Roman Gushchin
    Reported-by: Shakeel Butt
    Tested-by: Shakeel Butt
    Tested-by: John Stultz
    Acked-by: Johannes Weiner
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     

08 Jun, 2018

11 commits

  • Currently an attempt to set swap.max into a value lower than the actual
    swap usage fails, which causes configuration problems as there's no way
    of lowering the configuration below the current usage short of turning
    off swap entirely. This makes swap.max difficult to use and allows
    delegatees to lock the delegator out of reducing swap allocation.

    This patch updates swap_max_write() so that the limit can be lowered
    below the current usage. It doesn't implement active reclaiming of swap
    entries for the following reasons.

    * mem_cgroup_swap_full() already tells the swap machinary to
    aggressively reclaim swap entries if the usage is above 50% of
    limit, so simply lowering the limit automatically triggers gradual
    reclaim.

    * Forcing back swapped out pages is likely to heavily impact the
    workload and mess up the working set. Given that swap usually is a
    lot less valuable and less scarce, letting the existing usage
    dissipate over time through the above gradual reclaim and as they're
    falted back in is likely the better behavior.

    Link: http://lkml.kernel.org/r/20180523185041.GR1718769@devbig577.frc2.facebook.com
    Signed-off-by: Tejun Heo
    Acked-by: Roman Gushchin
    Acked-by: Rik van Riel
    Acked-by: Johannes Weiner
    Cc: Michal Hocko
    Cc: Shaohua Li
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tejun Heo
     
  • Memory controller implements the memory.low best-effort memory
    protection mechanism, which works perfectly in many cases and allows
    protecting working sets of important workloads from sudden reclaim.

    But its semantics has a significant limitation: it works only as long as
    there is a supply of reclaimable memory. This makes it pretty useless
    against any sort of slow memory leaks or memory usage increases. This
    is especially true for swapless systems. If swap is enabled, memory
    soft protection effectively postpones problems, allowing a leaking
    application to fill all swap area, which makes no sense. The only
    effective way to guarantee the memory protection in this case is to
    invoke the OOM killer.

    It's possible to handle this case in userspace by reacting on MEMCG_LOW
    events; but there is still a place for a fail-safe in-kernel mechanism
    to provide stronger guarantees.

    This patch introduces the memory.min interface for cgroup v2 memory
    controller. It works very similarly to memory.low (sharing the same
    hierarchical behavior), except that it's not disabled if there is no
    more reclaimable memory in the system.

    If cgroup is not populated, its memory.min is ignored, because otherwise
    even the OOM killer wouldn't be able to reclaim the protected memory,
    and the system can stall.

    [guro@fb.com: s/low/min/ in docs]
    Link: http://lkml.kernel.org/r/20180510130758.GA9129@castle.DHCP.thefacebook.com
    Link: http://lkml.kernel.org/r/20180509180734.GA4856@castle.DHCP.thefacebook.com
    Signed-off-by: Roman Gushchin
    Reviewed-by: Randy Dunlap
    Acked-by: Johannes Weiner
    Cc: Michal Hocko
    Cc: Vladimir Davydov
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     
  • The per-cpu memcg stock can retain a charge of upto 32 pages. On a
    machine with large number of cpus, this can amount to a decent amount of
    memory. Additionally force_empty interface might be triggering unneeded
    memcg reclaims.

    Link: http://lkml.kernel.org/r/20180507201651.165879-1-shakeelb@google.com
    Signed-off-by: Junaid Shahid
    Signed-off-by: Shakeel Butt
    Acked-by: Michal Hocko
    Cc: Greg Thelen
    Cc: Johannes Weiner
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Junaid Shahid
     
  • Resizing the memcg limit for cgroup-v2 drains the stocks before
    triggering the memcg reclaim. Do the same for cgroup-v1 to make the
    behavior consistent.

    Link: http://lkml.kernel.org/r/20180504205548.110696-1-shakeelb@google.com
    Signed-off-by: Shakeel Butt
    Acked-by: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: Greg Thelen
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shakeel Butt
     
  • Mark memcg1_events static: it's only used by memcontrol.c. And mark it
    const: it's not modified.

    Link: http://lkml.kernel.org/r/20180503192940.94971-1-gthelen@google.com
    Signed-off-by: Greg Thelen
    Acked-by: Michal Hocko
    Cc: Johannes Weiner
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Greg Thelen
     
  • mem_cgroup_cgwb_list is a very simple wrapper and it will never be used
    outside of code under CONFIG_CGROUP_WRITEBACK. so use memcg->cgwb_list
    directly.

    Link: http://lkml.kernel.org/r/1524406173-212182-1-git-send-email-wanglong19@meituan.com
    Signed-off-by: Wang Long
    Reviewed-by: Jan Kara
    Acked-by: Tejun Heo
    Acked-by: Michal Hocko
    Reviewed-by: Andrew Morton
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wang Long
     
  • If memcg's usage is equal to the memory.low value, avoid reclaiming from
    this cgroup while there is a surplus of reclaimable memory.

    This sounds more logical and also matches memory.high and memory.max
    behavior: both are inclusive.

    Empty cgroups are not considered protected, so MEMCG_LOW events are not
    emitted for empty cgroups, if there is no more reclaimable memory in the
    system.

    Link: http://lkml.kernel.org/r/20180406122132.GA7185@castle
    Signed-off-by: Roman Gushchin
    Acked-by: Johannes Weiner
    Cc: Michal Hocko
    Cc: Vladimir Davydov
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     
  • This patch aims to address an issue in current memory.low semantics,
    which makes it hard to use it in a hierarchy, where some leaf memory
    cgroups are more valuable than others.

    For example, there are memcgs A, A/B, A/C, A/D and A/E:

    A A/memory.low = 2G, A/memory.current = 6G
    //\\
    BC DE B/memory.low = 3G B/memory.current = 2G
    C/memory.low = 1G C/memory.current = 2G
    D/memory.low = 0 D/memory.current = 2G
    E/memory.low = 10G E/memory.current = 0

    If we apply memory pressure, B, C and D are reclaimed at the same pace
    while A's usage exceeds 2G. This is obviously wrong, as B's usage is
    fully below B's memory.low, and C has 1G of protection as well. Also, A
    is pushed to the size, which is less than A's 2G memory.low, which is
    also wrong.

    A simple bash script (provided below) can be used to reproduce
    the problem. Current results are:
    A: 1430097920
    A/B: 711929856
    A/C: 717426688
    A/D: 741376
    A/E: 0

    To address the issue a concept of effective memory.low is introduced.
    Effective memory.low is always equal or less than original memory.low.
    In a case, when there is no memory.low overcommittment (and also for
    top-level cgroups), these two values are equal.

    Otherwise it's a part of parent's effective memory.low, calculated as a
    cgroup's memory.low usage divided by sum of sibling's memory.low usages
    (under memory.low usage I mean the size of actually protected memory:
    memory.current if memory.current < memory.low, 0 otherwise). It's
    necessary to track the actual usage, because otherwise an empty cgroup
    with memory.low set (A/E in my example) will affect actual memory
    distribution, which makes no sense. To avoid traversing the cgroup tree
    twice, page_counters code is reused.

    Calculating effective memory.low can be done in the reclaim path, as we
    conveniently traversing the cgroup tree from top to bottom and check
    memory.low on each level. So, it's a perfect place to calculate
    effective memory low and save it to use it for children cgroups.

    This also eliminates a need to traverse the cgroup tree from bottom to
    top each time to check if parent's guarantee is not exceeded.

    Setting/resetting effective memory.low is intentionally racy, but it's
    fine and shouldn't lead to any significant differences in actual memory
    distribution.

    With this patch applied results are matching the expectations:
    A: 2147930112
    A/B: 1428721664
    A/C: 718393344
    A/D: 815104
    A/E: 0

    Test script:
    #!/bin/bash

    CGPATH="/sys/fs/cgroup"

    truncate /file1 --size 2G
    truncate /file2 --size 2G
    truncate /file3 --size 2G
    truncate /file4 --size 50G

    mkdir "${CGPATH}/A"
    echo "+memory" > "${CGPATH}/A/cgroup.subtree_control"
    mkdir "${CGPATH}/A/B" "${CGPATH}/A/C" "${CGPATH}/A/D" "${CGPATH}/A/E"

    echo 2G > "${CGPATH}/A/memory.low"
    echo 3G > "${CGPATH}/A/B/memory.low"
    echo 1G > "${CGPATH}/A/C/memory.low"
    echo 0 > "${CGPATH}/A/D/memory.low"
    echo 10G > "${CGPATH}/A/E/memory.low"

    echo $$ > "${CGPATH}/A/B/cgroup.procs" && vmtouch -qt /file1
    echo $$ > "${CGPATH}/A/C/cgroup.procs" && vmtouch -qt /file2
    echo $$ > "${CGPATH}/A/D/cgroup.procs" && vmtouch -qt /file3
    echo $$ > "${CGPATH}/cgroup.procs" && vmtouch -qt /file4

    echo "A: " `cat "${CGPATH}/A/memory.current"`
    echo "A/B: " `cat "${CGPATH}/A/B/memory.current"`
    echo "A/C: " `cat "${CGPATH}/A/C/memory.current"`
    echo "A/D: " `cat "${CGPATH}/A/D/memory.current"`
    echo "A/E: " `cat "${CGPATH}/A/E/memory.current"`

    rmdir "${CGPATH}/A/B" "${CGPATH}/A/C" "${CGPATH}/A/D" "${CGPATH}/A/E"
    rmdir "${CGPATH}/A"
    rm /file1 /file2 /file3 /file4

    Link: http://lkml.kernel.org/r/20180405185921.4942-2-guro@fb.com
    Signed-off-by: Roman Gushchin
    Acked-by: Johannes Weiner
    Cc: Michal Hocko
    Cc: Vladimir Davydov
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     
  • This patch renames struct page_counter fields:
    count -> usage
    limit -> max

    and the corresponding functions:
    page_counter_limit() -> page_counter_set_max()
    mem_cgroup_get_limit() -> mem_cgroup_get_max()
    mem_cgroup_resize_limit() -> mem_cgroup_resize_max()
    memcg_update_kmem_limit() -> memcg_update_kmem_max()
    memcg_update_tcp_limit() -> memcg_update_tcp_max()

    The idea behind this renaming is to have the direct matching
    between memory cgroup knobs (low, high, max) and page_counters API.

    This is pure renaming, this patch doesn't bring any functional change.

    Link: http://lkml.kernel.org/r/20180405185921.4942-1-guro@fb.com
    Signed-off-by: Roman Gushchin
    Acked-by: Johannes Weiner
    Cc: Michal Hocko
    Cc: Vladimir Davydov
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     
  • Add swap max and fail events so that userland can monitor and respond to
    running out of swap.

    I'm not too sure about the fail event. Right now, it's a bit confusing
    which stats / events are recursive and which aren't and also which ones
    reflect events which originate from a given cgroup and which targets the
    cgroup. No idea what the right long term solution is and it could just
    be that growing them organically is actually the only right thing to do.

    Link: http://lkml.kernel.org/r/20180416231151.GI1911913@devbig577.frc2.facebook.com
    Signed-off-by: Tejun Heo
    Reviewed-by: Andrew Morton
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Vladimir Davydov
    Cc: Roman Gushchin
    Cc: Rik van Riel
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tejun Heo
     
  • Patch series "mm, memcontrol: Implement memory.swap.events", v2.

    This patchset implements memory.swap.events which contains max and fail
    events so that userland can monitor and respond to swap running out.

    This patch (of 2):

    get_swap_page() is always followed by mem_cgroup_try_charge_swap().
    This patch moves mem_cgroup_try_charge_swap() into get_swap_page() and
    makes get_swap_page() call the function even after swap allocation
    failure.

    This simplifies the callers and consolidates memcg related logic and
    will ease adding swap related memcg events.

    Link: http://lkml.kernel.org/r/20180416230934.GH1911913@devbig577.frc2.facebook.com
    Signed-off-by: Tejun Heo
    Reviewed-by: Andrew Morton
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Vladimir Davydov
    Cc: Roman Gushchin
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tejun Heo
     

26 May, 2018

1 commit


21 Apr, 2018

1 commit

  • If there is heavy memory pressure, page allocation with __GFP_NOWAIT
    fails easily although it's order-0 request. I got below warning 9 times
    for normal boot.

    : page allocation failure: order:0, mode:0x2200000(GFP_NOWAIT|__GFP_NOTRACK)
    .. snip ..
    Call trace:
    dump_backtrace+0x0/0x4
    dump_stack+0xa4/0xc0
    warn_alloc+0xd4/0x15c
    __alloc_pages_nodemask+0xf88/0x10fc
    alloc_slab_page+0x40/0x18c
    new_slab+0x2b8/0x2e0
    ___slab_alloc+0x25c/0x464
    __kmalloc+0x394/0x498
    memcg_kmem_get_cache+0x114/0x2b8
    kmem_cache_alloc+0x98/0x3e8
    mmap_region+0x3bc/0x8c0
    do_mmap+0x40c/0x43c
    vm_mmap_pgoff+0x15c/0x1e4
    sys_mmap+0xb0/0xc8
    el0_svc_naked+0x24/0x28
    Mem-Info:
    active_anon:17124 inactive_anon:193 isolated_anon:0
    active_file:7898 inactive_file:712955 isolated_file:55
    unevictable:0 dirty:27 writeback:18 unstable:0
    slab_reclaimable:12250 slab_unreclaimable:23334
    mapped:19310 shmem:212 pagetables:816 bounce:0
    free:36561 free_pcp:1205 free_cma:35615
    Node 0 active_anon:68496kB inactive_anon:772kB active_file:31592kB inactive_file:2851820kB unevictable:0kB isolated(anon):0kB isolated(file):220kB mapped:77240kB dirty:108kB writeback:72kB shmem:848kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
    DMA free:142188kB min:3056kB low:3820kB high:4584kB active_anon:10052kB inactive_anon:12kB active_file:312kB inactive_file:1412620kB unevictable:0kB writepending:0kB present:1781412kB managed:1604728kB mlocked:0kB slab_reclaimable:3592kB slab_unreclaimable:876kB kernel_stack:400kB pagetables:52kB bounce:0kB free_pcp:1436kB local_pcp:124kB free_cma:142492kB
    lowmem_reserve[]: 0 1842 1842
    Normal free:4056kB min:4172kB low:5212kB high:6252kB active_anon:58376kB inactive_anon:760kB active_file:31348kB inactive_file:1439040kB unevictable:0kB writepending:180kB present:2000636kB managed:1923688kB mlocked:0kB slab_reclaimable:45408kB slab_unreclaimable:92460kB kernel_stack:9680kB pagetables:3212kB bounce:0kB free_pcp:3392kB local_pcp:688kB free_cma:0kB
    lowmem_reserve[]: 0 0 0
    DMA: 0*4kB 0*8kB 1*16kB (C) 0*32kB 0*64kB 0*128kB 1*256kB (C) 1*512kB (C) 0*1024kB 1*2048kB (C) 34*4096kB (C) = 142096kB
    Normal: 228*4kB (UMEH) 172*8kB (UMH) 23*16kB (UH) 24*32kB (H) 5*64kB (H) 1*128kB (H) 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 3872kB
    721350 total pagecache pages
    0 pages in swap cache
    Swap cache stats: add 0, delete 0, find 0/0
    Free swap = 0kB
    Total swap = 0kB
    945512 pages RAM
    0 pages HighMem/MovableOnly
    63408 pages reserved
    51200 pages cma reserved

    __memcg_schedule_kmem_cache_create() tries to create a shadow slab cache
    and the worker allocation failure is not really critical because we will
    retry on the next kmem charge. We might miss some charges but that
    shouldn't be critical. The excessive allocation failure report is not
    very helpful.

    [mhocko@kernel.org: changelog update]
    Link: http://lkml.kernel.org/r/20180418022912.248417-1-minchan@kernel.org
    Signed-off-by: Minchan Kim
    Acked-by: Johannes Weiner
    Reviewed-by: Andrew Morton
    Cc: Michal Hocko
    Cc: Vladimir Davydov
    Cc: Minchan Kim
    Cc: Matthew Wilcox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     

12 Apr, 2018

4 commits

  • Remove the address_space ->tree_lock and use the xa_lock newly added to
    the radix_tree_root. Rename the address_space ->page_tree to ->i_pages,
    since we don't really care that it's a tree.

    [willy@infradead.org: fix nds32, fs/dax.c]
    Link: http://lkml.kernel.org/r/20180406145415.GB20605@bombadil.infradead.orgLink: http://lkml.kernel.org/r/20180313132639.17387-9-willy@infradead.org
    Signed-off-by: Matthew Wilcox
    Acked-by: Jeff Layton
    Cc: Darrick J. Wong
    Cc: Dave Chinner
    Cc: Ryusuke Konishi
    Cc: Will Deacon
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     
  • syzbot has triggered a NULL ptr dereference when allocation fault
    injection enforces a failure and alloc_mem_cgroup_per_node_info
    initializes memcg->nodeinfo only half way through.

    But __mem_cgroup_free still tries to free all per-node data and
    dereferences pn->lruvec_stat_cpu unconditioanlly even if the specific
    per-node data hasn't been initialized.

    The bug is quite unlikely to hit because small allocations do not fail
    and we would need quite some numa nodes to make struct
    mem_cgroup_per_node large enough to cross the costly order.

    Link: http://lkml.kernel.org/r/20180406100906.17790-1-mhocko@kernel.org
    Reported-by: syzbot+8a5de3cce7cdc70e9ebe@syzkaller.appspotmail.com
    Fixes: 00f3ca2c2d66 ("mm: memcontrol: per-lruvec stats infrastructure")
    Signed-off-by: Michal Hocko
    Reviewed-by: Andrey Ryabinin
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Commit a983b5ebee57 ("mm: memcontrol: fix excessive complexity in
    memory.stat reporting") added per-cpu drift to all memory cgroup stats
    and events shown in memory.stat and memory.events.

    For memory.stat this is acceptable. But memory.events issues file
    notifications, and somebody polling the file for changes will be
    confused when the counters in it are unchanged after a wakeup.

    Luckily, the events in memory.events - MEMCG_LOW, MEMCG_HIGH, MEMCG_MAX,
    MEMCG_OOM - are sufficiently rare and high-level that we don't need
    per-cpu buffering for them: MEMCG_HIGH and MEMCG_MAX would be the most
    frequent, but they're counting invocations of reclaim, which is a
    complex operation that touches many shared cachelines.

    This splits memory.events from the generic VM events and tracks them in
    their own, unbuffered atomic counters. That's also cleaner, as it
    eliminates the ugly enum nesting of VM and cgroup events.

    [hannes@cmpxchg.org: "array subscript is above array bounds"]
    Link: http://lkml.kernel.org/r/20180406155441.GA20806@cmpxchg.org
    Link: http://lkml.kernel.org/r/20180405175507.GA24817@cmpxchg.org
    Fixes: a983b5ebee57 ("mm: memcontrol: fix excessive complexity in memory.stat reporting")
    Signed-off-by: Johannes Weiner
    Reported-by: Tejun Heo
    Acked-by: Tejun Heo
    Acked-by: Michal Hocko
    Cc: Vladimir Davydov
    Cc: Roman Gushchin
    Cc: Rik van Riel
    Cc: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • A THP memcg charge can trigger the oom killer since 2516035499b9 ("mm,
    thp: remove __GFP_NORETRY from khugepaged and madvised allocations").
    We have used an explicit __GFP_NORETRY previously which ruled the OOM
    killer automagically.

    Memcg charge path should be semantically compliant with the allocation
    path and that means that if we do not trigger the OOM killer for costly
    orders which should do the same in the memcg charge path as well.
    Otherwise we are forcing callers to distinguish the two and use
    different gfp masks which is both non-intuitive and bug prone. As soon
    as we get a costly high order kmalloc user we even do not have any means
    to tell the memcg specific gfp mask to prevent from OOM because the
    charging is deep within guts of the slab allocator.

    The unexpected memcg OOM on THP has already been fixed upstream by
    9d3c3354bb85 ("mm, thp: do not cause memcg oom for thp") but this is a
    one-off fix rather than a generic solution. Teach mem_cgroup_oom to
    bail out on costly order requests to fix the THP issue as well as any
    other costly OOM eligible allocations to be added in future.

    Also revert 9d3c3354bb85 because special gfp for THP is no longer
    needed.

    Link: http://lkml.kernel.org/r/20180403193129.22146-1-mhocko@kernel.org
    Fixes: 2516035499b9 ("mm, thp: remove __GFP_NORETRY from khugepaged and madvised allocations")
    Signed-off-by: Michal Hocko
    Acked-by: Johannes Weiner
    Cc: "Kirill A. Shutemov"
    Cc: Vlastimil Babka
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko