14 Jun, 2019

1 commit

  • The kernel test robot noticed a 26% will-it-scale pagefault regression
    from commit 42a300353577 ("mm: memcontrol: fix recursive statistics
    correctness & scalabilty"). This appears to be caused by bouncing the
    additional cachelines from the new hierarchical statistics counters.

    We can fix this by getting rid of the batched local counters instead.

    Originally, there were *only* group-local counters, and they were fully
    maintained per cpu. A reader of a stats file high up in the cgroup tree
    would have to walk the entire subtree and collect each level's per-cpu
    counters to get the recursive view. This was prohibitively expensive,
    and so we switched to per-cpu batched updates of the local counters
    during a983b5ebee57 ("mm: memcontrol: fix excessive complexity in
    memory.stat reporting"), reducing the complexity from nr_subgroups *
    nr_cpus to nr_subgroups.

    With growing machines and cgroup trees, the tree walk itself became too
    expensive for monitoring top-level groups, and this is when the culprit
    patch added hierarchy counters on each cgroup level. When the per-cpu
    batch size would be reached, both the local and the hierarchy counters
    would get batch-updated from the per-cpu delta simultaneously.

    This makes local and hierarchical counter reads blazingly fast, but it
    unfortunately makes the write-side too cache line intense.

    Since local counter reads were never a problem - we only centralized
    them to accelerate the hierarchy walk - and use of the local counters
    are becoming rarer due to replacement with hierarchical views (ongoing
    rework in the page reclaim and workingset code), we can make those local
    counters unbatched per-cpu counters again.

    The scheme will then be as such:

    when a memcg statistic changes, the writer will:
    - update the local counter (per-cpu)
    - update the batch counter (per-cpu). If the batch is full:
    - spill the batch into the group's atomic_t
    - spill the batch into all ancestors' atomic_ts
    - empty out the batch counter (per-cpu)

    when a local memcg counter is read, the reader will:
    - collect the local counter from all cpus

    when a hiearchy memcg counter is read, the reader will:
    - read the atomic_t

    We might be able to simplify this further and make the recursive
    counters unbatched per-cpu counters as well (batch upward propagation,
    but leave per-cpu collection to the readers), but that will require a
    more in-depth analysis and testing of all the callsites. Deal with the
    immediate regression for now.

    Link: http://lkml.kernel.org/r/20190521151647.GB2870@cmpxchg.org
    Fixes: 42a300353577 ("mm: memcontrol: fix recursive statistics correctness & scalabilty")
    Signed-off-by: Johannes Weiner
    Reported-by: kernel test robot
    Tested-by: kernel test robot
    Cc: Michal Hocko
    Cc: Shakeel Butt
    Cc: Roman Gushchin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

31 May, 2019

1 commit

  • Based on 3 normalized pattern(s):

    this program is free software you can redistribute it and or modify
    it under the terms of the gnu general public license as published by
    the free software foundation either version 2 of the license or at
    your option any later version this program is distributed in the
    hope that it will be useful but without any warranty without even
    the implied warranty of merchantability or fitness for a particular
    purpose see the gnu general public license for more details

    this program is free software you can redistribute it and or modify
    it under the terms of the gnu general public license as published by
    the free software foundation either version 2 of the license or at
    your option any later version [author] [kishon] [vijay] [abraham]
    [i] [kishon]@[ti] [com] this program is distributed in the hope that
    it will be useful but without any warranty without even the implied
    warranty of merchantability or fitness for a particular purpose see
    the gnu general public license for more details

    this program is free software you can redistribute it and or modify
    it under the terms of the gnu general public license as published by
    the free software foundation either version 2 of the license or at
    your option any later version [author] [graeme] [gregory]
    [gg]@[slimlogic] [co] [uk] [author] [kishon] [vijay] [abraham] [i]
    [kishon]@[ti] [com] [based] [on] [twl6030]_[usb] [c] [author] [hema]
    [hk] [hemahk]@[ti] [com] this program is distributed in the hope
    that it will be useful but without any warranty without even the
    implied warranty of merchantability or fitness for a particular
    purpose see the gnu general public license for more details

    extracted by the scancode license scanner the SPDX license identifier

    GPL-2.0-or-later

    has been chosen to replace the boilerplate/reference in 1105 file(s).

    Signed-off-by: Thomas Gleixner
    Reviewed-by: Allison Randal
    Reviewed-by: Richard Fontana
    Reviewed-by: Kate Stewart
    Cc: linux-spdx@vger.kernel.org
    Link: https://lkml.kernel.org/r/20190527070033.202006027@linutronix.de
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     

15 May, 2019

10 commits

  • When a cgroup is reclaimed on behalf of a configured limit, reclaim
    needs to round-robin through all NUMA nodes that hold pages of the memcg
    in question. However, when assembling the mask of candidate NUMA nodes,
    the code only consults the *local* cgroup LRU counters, not the
    recursive counters for the entire subtree. Cgroup limits are frequently
    configured against intermediate cgroups that do not have memory on their
    own LRUs. In this case, the node mask will always come up empty and
    reclaim falls back to scanning only the current node.

    If a cgroup subtree has some memory on one node but the processes are
    bound to another node afterwards, the limit reclaim will never age or
    reclaim that memory anymore.

    To fix this, use the recursive LRU counts for a cgroup subtree to
    determine which nodes hold memory of that cgroup.

    The code has been broken like this forever, so it doesn't seem to be a
    problem in practice. I just noticed it while reviewing the way the LRU
    counters are used in general.

    Link: http://lkml.kernel.org/r/20190412151507.2769-5-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Reviewed-by: Shakeel Butt
    Reviewed-by: Roman Gushchin
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Right now, when somebody needs to know the recursive memory statistics
    and events of a cgroup subtree, they need to walk the entire subtree and
    sum up the counters manually.

    There are two issues with this:

    1. When a cgroup gets deleted, its stats are lost. The state counters
    should all be 0 at that point, of course, but the events are not.
    When this happens, the event counters, which are supposed to be
    monotonic, can go backwards in the parent cgroups.

    2. During regular operation, we always have a certain number of lazily
    freed cgroups sitting around that have been deleted, have no tasks,
    but have a few cache pages remaining. These groups' statistics do not
    change until we eventually hit memory pressure, but somebody
    watching, say, memory.stat on an ancestor has to iterate those every
    time.

    This patch addresses both issues by introducing recursive counters at
    each level that are propagated from the write side when stats change.

    Upward propagation happens when the per-cpu caches spill over into the
    local atomic counter. This is the same thing we do during charge and
    uncharge, except that the latter uses atomic RMWs, which are more
    expensive; stat changes happen at around the same rate. In a sparse
    file test (page faults and reclaim at maximum CPU speed) with 5 cgroup
    nesting levels, perf shows __mod_memcg_page state at ~1%.

    Link: http://lkml.kernel.org/r/20190412151507.2769-4-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Reviewed-by: Shakeel Butt
    Reviewed-by: Roman Gushchin
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • These are getting too big to be inlined in every callsite. They were
    stolen from vmstat.c, which already out-of-lines them, and they have
    only been growing since. The callsites aren't that hot, either.

    Move __mod_memcg_state()
    __mod_lruvec_state() and
    __count_memcg_events() out of line and add kerneldoc comments.

    Link: http://lkml.kernel.org/r/20190412151507.2769-3-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Reviewed-by: Shakeel Butt
    Reviewed-by: Roman Gushchin
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Patch series "mm: memcontrol: memory.stat cost & correctness".

    The cgroup memory.stat file holds recursive statistics for the entire
    subtree. The current implementation does this tree walk on-demand
    whenever the file is read. This is giving us problems in production.

    1. The cost of aggregating the statistics on-demand is high. A lot of
    system service cgroups are mostly idle and their stats don't change
    between reads, yet we always have to check them. There are also always
    some lazily-dying cgroups sitting around that are pinned by a handful
    of remaining page cache; the same applies to them.

    In an application that periodically monitors memory.stat in our
    fleet, we have seen the aggregation consume up to 5% CPU time.

    2. When cgroups die and disappear from the cgroup tree, so do their
    accumulated vm events. The result is that the event counters at
    higher-level cgroups can go backwards and confuse some of our
    automation, let alone people looking at the graphs over time.

    To address both issues, this patch series changes the stat
    implementation to spill counts upwards when the counters change.

    The upward spilling is batched using the existing per-cpu cache. In a
    sparse file stress test with 5 level cgroup nesting, the additional cost
    of the flushing was negligible (a little under 1% of CPU at 100% CPU
    utilization, compared to the 5% of reading memory.stat during regular
    operation).

    This patch (of 4):

    memcg_page_state(), lruvec_page_state(), memcg_sum_events() are
    currently returning the state of the local memcg or lruvec, not the
    recursive state.

    In practice there is a demand for both versions, although the callers
    that want the recursive counts currently sum them up by hand.

    Per default, cgroups are considered recursive entities and generally we
    expect more users of the recursive counters, with the local counts being
    special cases. To reflect that in the name, add a _local suffix to the
    current implementations.

    The following patch will re-incarnate these functions with recursive
    semantics, but with an O(1) implementation.

    [hannes@cmpxchg.org: fix bisection hole]
    Link: http://lkml.kernel.org/r/20190417160347.GC23013@cmpxchg.org
    Link: http://lkml.kernel.org/r/20190412151507.2769-2-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Reviewed-by: Shakeel Butt
    Reviewed-by: Roman Gushchin
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • I spent literally an hour trying to work out why an earlier version of
    my memory.events aggregation code doesn't work properly, only to find
    out I was calling memcg->events instead of memcg->memory_events, which
    is fairly confusing.

    This naming seems in need of reworking, so make it harder to do the
    wrong thing by using vmevents instead of events, which makes it more
    clear that these are vm counters rather than memcg-specific counters.

    There are also a few other inconsistent names in both the percpu and
    aggregated structs, so these are all cleaned up to be more coherent and
    easy to understand.

    This commit contains code cleanup only: there are no logic changes.

    [akpm@linux-foundation.org: fix it for preceding changes]
    Link: http://lkml.kernel.org/r/20190208224319.GA23801@chrisdown.name
    Signed-off-by: Chris Down
    Acked-by: Johannes Weiner
    Cc: Michal Hocko
    Cc: Tejun Heo
    Cc: Roman Gushchin
    Cc: Dennis Zhou
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chris Down
     
  • Only memcg_numa_stat_show() uses those wrappers and the lru bitmasks,
    group them together.

    Link: http://lkml.kernel.org/r/20190228163020.24100-7-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Reviewed-by: Roman Gushchin
    Cc: Michal Hocko
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • mem_cgroup_nr_lru_pages() is just a convenience wrapper around
    memcg_page_state() that takes bitmasks of lru indexes and aggregates the
    counts for those.

    Replace callsites where the bitmask is simple enough with direct
    memcg_page_state() call(s).

    Link: http://lkml.kernel.org/r/20190228163020.24100-6-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Reviewed-by: Roman Gushchin
    Cc: Michal Hocko
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • mem_cgroup_node_nr_lru_pages() is just a convenience wrapper around
    lruvec_page_state() that takes bitmasks of lru indexes and aggregates the
    counts for those.

    Replace callsites where the bitmask is simple enough with direct
    lruvec_page_state() calls.

    This removes the last extern user of mem_cgroup_node_nr_lru_pages(), so
    make that function private again, too.

    Link: http://lkml.kernel.org/r/20190228163020.24100-5-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Reviewed-by: Roman Gushchin
    Cc: Michal Hocko
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Instead of adding up the node counters, use memcg_page_state() to get the
    memcg state directly. This is a bit cheaper and more stream-lined.

    Link: http://lkml.kernel.org/r/20190228163020.24100-4-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Reviewed-by: Roman Gushchin
    Cc: Michal Hocko
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Instead of adding up the zone counters, use lruvec_page_state() to get the
    node state directly. This is a bit cheaper and more stream-lined.

    Link: http://lkml.kernel.org/r/20190228163020.24100-3-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Reviewed-by: Roman Gushchin
    Cc: Michal Hocko
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

06 Apr, 2019

1 commit

  • Since commit a983b5ebee57 ("mm: memcontrol: fix excessive complexity in
    memory.stat reporting") memcg dirty and writeback counters are managed
    as:

    1) per-memcg per-cpu values in range of [-32..32]

    2) per-memcg atomic counter

    When a per-cpu counter cannot fit in [-32..32] it's flushed to the
    atomic. Stat readers only check the atomic. Thus readers such as
    balance_dirty_pages() may see a nontrivial error margin: 32 pages per
    cpu.

    Assuming 100 cpus:
    4k x86 page_size: 13 MiB error per memcg
    64k ppc page_size: 200 MiB error per memcg

    Considering that dirty+writeback are used together for some decisions the
    errors double.

    This inaccuracy can lead to undeserved oom kills. One nasty case is
    when all per-cpu counters hold positive values offsetting an atomic
    negative value (i.e. per_cpu[*]=32, atomic=n_cpu*-32).
    balance_dirty_pages() only consults the atomic and does not consider
    throttling the next n_cpu*32 dirty pages. If the file_lru is in the
    13..200 MiB range then there's absolutely no dirty throttling, which
    burdens vmscan with only dirty+writeback pages thus resorting to oom
    kill.

    It could be argued that tiny containers are not supported, but it's more
    subtle. It's the amount the space available for file lru that matters.
    If a container has memory.max-200MiB of non reclaimable memory, then it
    will also suffer such oom kills on a 100 cpu machine.

    The following test reliably ooms without this patch. This patch avoids
    oom kills.

    $ cat test
    mount -t cgroup2 none /dev/cgroup
    cd /dev/cgroup
    echo +io +memory > cgroup.subtree_control
    mkdir test
    cd test
    echo 10M > memory.max
    (echo $BASHPID > cgroup.procs && exec /memcg-writeback-stress /foo)
    (echo $BASHPID > cgroup.procs && exec dd if=/dev/zero of=/foo bs=2M count=100)

    $ cat memcg-writeback-stress.c
    /*
    * Dirty pages from all but one cpu.
    * Clean pages from the non dirtying cpu.
    * This is to stress per cpu counter imbalance.
    * On a 100 cpu machine:
    * - per memcg per cpu dirty count is 32 pages for each of 99 cpus
    * - per memcg atomic is -99*32 pages
    * - thus the complete dirty limit: sum of all counters 0
    * - balance_dirty_pages() only sees atomic count -99*32 pages, which
    * it max()s to 0.
    * - So a workload can dirty -99*32 pages before balance_dirty_pages()
    * cares.
    */
    #define _GNU_SOURCE
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include

    static char *buf;
    static int bufSize;

    static void set_affinity(int cpu)
    {
    cpu_set_t affinity;

    CPU_ZERO(&affinity);
    CPU_SET(cpu, &affinity);
    if (sched_setaffinity(0, sizeof(affinity), &affinity))
    err(1, "sched_setaffinity");
    }

    static void dirty_on(int output_fd, int cpu)
    {
    int i, wrote;

    set_affinity(cpu);
    for (i = 0; i < 32; i++) {
    for (wrote = 0; wrote < bufSize; ) {
    int ret = write(output_fd, buf+wrote, bufSize-wrote);
    if (ret == -1)
    err(1, "write");
    wrote += ret;
    }
    }
    }

    int main(int argc, char **argv)
    {
    int cpu, flush_cpu = 1, output_fd;
    const char *output;

    if (argc != 2)
    errx(1, "usage: output_file");

    output = argv[1];
    bufSize = getpagesize();
    buf = malloc(getpagesize());
    if (buf == NULL)
    errx(1, "malloc failed");

    output_fd = open(output, O_CREAT|O_RDWR);
    if (output_fd == -1)
    err(1, "open(%s)", output);

    for (cpu = 0; cpu < get_nprocs(); cpu++) {
    if (cpu != flush_cpu)
    dirty_on(output_fd, cpu);
    }

    set_affinity(flush_cpu);
    if (fsync(output_fd))
    err(1, "fsync(%s)", output);
    if (close(output_fd))
    err(1, "close(%s)", output);
    free(buf);
    }

    Make balance_dirty_pages() and wb_over_bg_thresh() work harder to
    collect exact per memcg counters. This avoids the aforementioned oom
    kills.

    This does not affect the overhead of memory.stat, which still reads the
    single atomic counter.

    Why not use percpu_counter? memcg already handles cpus going offline, so
    no need for that overhead from percpu_counter. And the percpu_counter
    spinlocks are more heavyweight than is required.

    It probably also makes sense to use exact dirty and writeback counters
    in memcg oom reports. But that is saved for later.

    Link: http://lkml.kernel.org/r/20190329174609.164344-1-gthelen@google.com
    Signed-off-by: Greg Thelen
    Reviewed-by: Roman Gushchin
    Acked-by: Johannes Weiner
    Cc: Michal Hocko
    Cc: Vladimir Davydov
    Cc: Tejun Heo
    Cc: [4.16+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Greg Thelen
     

06 Mar, 2019

9 commits

  • Commit 230671533d64 ("mm: memory.low hierarchical behavior") missed an
    asterisk in one of the comments.

    mm/memcontrol.c:5774: warning: bad line: | 0, otherwise.

    Link: http://lkml.kernel.org/r/20190301143734.94393-1-cai@lca.pw
    Acked-by: Souptick Joarder
    Signed-off-by: Qian Cai
    Reviewed-by: Andrew Morton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Qian Cai
     
  • We have common pattern to access lru_lock from a page pointer:
    zone_lru_lock(page_zone(page))

    Which is silly, because it unfolds to this:
    &NODE_DATA(page_to_nid(page))->node_zones[page_zonenum(page)]->zone_pgdat->lru_lock
    while we can simply do
    &NODE_DATA(page_to_nid(page))->lru_lock

    Remove zone_lru_lock() function, since it's only complicate things. Use
    'page_pgdat(page)->lru_lock' pattern instead.

    [aryabinin@virtuozzo.com: a slightly better version of __split_huge_page()]
    Link: http://lkml.kernel.org/r/20190301121651.7741-1-aryabinin@virtuozzo.com
    Link: http://lkml.kernel.org/r/20190228083329.31892-2-aryabinin@virtuozzo.com
    Signed-off-by: Andrey Ryabinin
    Acked-by: Vlastimil Babka
    Acked-by: Mel Gorman
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Rik van Riel
    Cc: William Kucharski
    Cc: John Hubbard
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Ryabinin
     
  • Number of NUMA nodes can't be negative.

    This saves a few bytes on x86_64:

    add/remove: 0/0 grow/shrink: 4/21 up/down: 27/-265 (-238)
    Function old new delta
    hv_synic_alloc.cold 88 110 +22
    prealloc_shrinker 260 262 +2
    bootstrap 249 251 +2
    sched_init_numa 1566 1567 +1
    show_slab_objects 778 777 -1
    s_show 1201 1200 -1
    kmem_cache_init 346 345 -1
    __alloc_workqueue_key 1146 1145 -1
    mem_cgroup_css_alloc 1614 1612 -2
    __do_sys_swapon 4702 4699 -3
    __list_lru_init 655 651 -4
    nic_probe 2379 2374 -5
    store_user_store 118 111 -7
    red_zone_store 106 99 -7
    poison_store 106 99 -7
    wq_numa_init 348 338 -10
    __kmem_cache_empty 75 65 -10
    task_numa_free 186 173 -13
    merge_across_nodes_store 351 336 -15
    irq_create_affinity_masks 1261 1246 -15
    do_numa_crng_init 343 321 -22
    task_numa_fault 4760 4737 -23
    swapfile_init 179 156 -23
    hv_synic_alloc 536 492 -44
    apply_wqattrs_prepare 746 695 -51

    Link: http://lkml.kernel.org/r/20190201223029.GA15820@avx2
    Signed-off-by: Alexey Dobriyan
    Reviewed-by: Andrew Morton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • Currently THP allocation events data is fairly opaque, since you can
    only get it system-wide. This patch makes it easier to reason about
    transparent hugepage behaviour on a per-memcg basis.

    For anonymous THP-backed pages, we already have MEMCG_RSS_HUGE in v1,
    which is used for v1's rss_huge [sic]. This is reused here as it's
    fairly involved to untangle NR_ANON_THPS right now to make it per-memcg,
    since right now some of this is delegated to rmap before we have any
    memcg actually assigned to the page. It's a good idea to rework that,
    but let's leave untangling THP allocation for a future patch.

    [akpm@linux-foundation.org: fix build]
    [chris@chrisdown.name: fix memcontrol build when THP is disabled]
    Link: http://lkml.kernel.org/r/20190131160802.GA5777@chrisdown.name
    Link: http://lkml.kernel.org/r/20190129205852.GA7310@chrisdown.name
    Signed-off-by: Chris Down
    Acked-by: Johannes Weiner
    Cc: Tejun Heo
    Cc: Roman Gushchin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chris Down
     
  • If a memory cgroup contains a single process with many threads
    (including different process group sharing the mm) then it is possible
    to trigger a race when the oom killer complains that there are no oom
    elible tasks and complain into the log which is both annoying and
    confusing because there is no actual problem. The race looks as
    follows:

    P1 oom_reaper P2
    try_charge try_charge
    mem_cgroup_out_of_memory
    mutex_lock(oom_lock)
    out_of_memory
    oom_kill_process(P1,P2)
    wake_oom_reaper
    mutex_unlock(oom_lock)
    oom_reap_task
    mutex_lock(oom_lock)
    select_bad_process # no victim

    The problem is more visible with many threads.

    Fix this by checking for fatal_signal_pending from
    mem_cgroup_out_of_memory when the oom_lock is already held.

    The oom bypass is safe because we do the same early in the try_charge
    path already. The situation migh have changed in the mean time. It
    should be safe to check for fatal_signal_pending and tsk_is_oom_victim
    but for a better code readability abstract the current charge bypass
    condition into should_force_charge and reuse it from that path. "

    Link: http://lkml.kernel.org/r/01370f70-e1f6-ebe4-b95e-0df21a0bc15e@i-love.sakura.ne.jp
    Signed-off-by: Tetsuo Handa
    Acked-by: Michal Hocko
    Acked-by: Johannes Weiner
    Cc: David Rientjes
    Cc: Kirill Tkhai
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tetsuo Handa
     
  • memcg has a significant number of files exposed to kernfs where their
    value is either exposed directly or is "max" in the case of
    PAGE_COUNTER_MAX.

    This patch makes this generic by providing a single function to do this
    work. In combination with the previous patch adding
    mem_cgroup_from_seq, this makes all of the seq_show feeder functions
    significantly more simple.

    Link: http://lkml.kernel.org/r/20190124194100.GA31425@chrisdown.name
    Signed-off-by: Chris Down
    Acked-by: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: Tejun Heo
    Cc: Roman Gushchin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chris Down
     
  • This is the start of a series of patches similar to my earlier
    DEFINE_MEMCG_MAX_OR_VAL work, but with less Macro Magic(tm).

    There are a bunch of places we go from seq_file to mem_cgroup, which
    currently requires manually getting the css, then getting the mem_cgroup
    from the css. It's in enough places now that having mem_cgroup_from_seq
    makes sense (and also makes the next patch a bit nicer).

    Link: http://lkml.kernel.org/r/20190124194050.GA31341@chrisdown.name
    Signed-off-by: Chris Down
    Acked-by: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: Tejun Heo
    Cc: Roman Gushchin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chris Down
     
  • One of the more common cases of allocation size calculations is finding
    the size of a structure that has a zero-sized array at the end, along
    with memory for some number of elements for that array. For example:

    struct foo {
    int stuff;
    void *entry[];
    };

    instance = kmalloc(sizeof(struct foo) + sizeof(void *) * count, GFP_KERNEL);

    Instead of leaving these open-coded and prone to type mistakes, we can
    now use the new struct_size() helper:

    instance = kmalloc(struct_size(instance, entry, count), GFP_KERNEL);

    This code was detected with the help of Coccinelle.

    Link: http://lkml.kernel.org/r/20190104183726.GA6374@embeddedor
    Signed-off-by: Gustavo A. R. Silva
    Acked-by: Michal Hocko
    Cc: Johannes Weiner
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Gustavo A. R. Silva
     
  • Move the memcg_kmem_enabled() checks into memcg kmem charge/uncharge
    functions, so, the users don't have to explicitly check that condition.

    This is purely code cleanup patch without any functional change. Only
    the order of checks in memcg_charge_slab() can potentially be changed
    but the functionally it will be same. This should not matter as
    memcg_charge_slab() is not in the hot path.

    Link: http://lkml.kernel.org/r/20190103161203.162375-1-shakeelb@google.com
    Signed-off-by: Shakeel Butt
    Acked-by: Michal Hocko
    Cc: Johannes Weiner
    Cc: Vladimir Davydov
    Cc: Roman Gushchin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shakeel Butt
     

29 Dec, 2018

2 commits

  • Burt Holzman has noticed that memcg v1 doesn't notify about OOM events via
    eventfd anymore. The reason is that 29ef680ae7c2 ("memcg, oom: move
    out_of_memory back to the charge path") has moved the oom handling back to
    the charge path. While doing so the notification was left behind in
    mem_cgroup_oom_synchronize.

    Fix the issue by replicating the oom hierarchy locking and the
    notification.

    Link: http://lkml.kernel.org/r/20181224091107.18354-1-mhocko@kernel.org
    Fixes: 29ef680ae7c2 ("memcg, oom: move out_of_memory back to the charge path")
    Signed-off-by: Michal Hocko
    Reported-by: Burt Holzman
    Acked-by: Johannes Weiner
    Cc: Vladimir Davydov [4.19+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • The current oom report doesn't display victim's memcg context during the
    global OOM situation. While this information is not strictly needed, it
    can be really helpful for containerized environments to locate which
    container has lost a process. Now that we have a single line for the oom
    context, we can trivially add both the oom memcg (this can be either
    global_oom or a specific memcg which hits its hard limits) and task_memcg
    which is the victim's memcg.

    Below is the single line output in the oom report after this patch.

    - global oom context information:

    oom-kill:constraint=,nodemask=,cpuset=,mems_allowed=,global_oom,task_memcg=,task=,pid=,uid=

    - memcg oom context information:

    oom-kill:constraint=,nodemask=,cpuset=,mems_allowed=,oom_memcg=,task_memcg=,task=,pid=,uid=

    [penguin-kernel@I-love.SAKURA.ne.jp: use pr_cont() in mem_cgroup_print_oom_context()]
    Link: http://lkml.kernel.org/r/201812190723.wBJ7NdkN032628@www262.sakura.ne.jp
    Link: http://lkml.kernel.org/r/1542799799-36184-2-git-send-email-ufo19890607@gmail.com
    Signed-off-by: yuzhoujian
    Signed-off-by: Tetsuo Handa
    Acked-by: Michal Hocko
    Cc: David Rientjes
    Cc: "Kirill A . Shutemov"
    Cc: Andrea Arcangeli
    Cc: Tetsuo Handa
    Cc: Roman Gushchin
    Cc: Yang Shi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    yuzhoujian
     

04 Nov, 2018

1 commit

  • Mike Galbraith reported a regression caused by the commit 9b6f7e163cd0
    ("mm: rework memcg kernel stack accounting") on a system with
    "cgroup_disable=memory" boot option: the system panics with the following
    stack trace:

    BUG: unable to handle kernel NULL pointer dereference at 00000000000000f8
    PGD 0 P4D 0
    Oops: 0002 [#1] PREEMPT SMP PTI
    CPU: 0 PID: 1 Comm: systemd Not tainted 4.19.0-preempt+ #410
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20180531_142017-buildhw-08.phx2.fed4
    RIP: 0010:page_counter_try_charge+0x22/0xc0
    Code: 41 5d c3 c3 0f 1f 40 00 0f 1f 44 00 00 48 85 ff 0f 84 a7 00 00 00 41 56 48 89 f8 49 89 fe 49
    Call Trace:
    try_charge+0xcb/0x780
    memcg_kmem_charge_memcg+0x28/0x80
    memcg_kmem_charge+0x8b/0x1d0
    copy_process.part.41+0x1ca/0x2070
    _do_fork+0xd7/0x3d0
    do_syscall_64+0x5a/0x180
    entry_SYSCALL_64_after_hwframe+0x49/0xbe

    The problem occurs because get_mem_cgroup_from_current() returns the NULL
    pointer if memory controller is disabled. Let's check if this is a case
    at the beginning of memcg_kmem_charge() and just return 0 if
    mem_cgroup_disabled() returns true. This is how we handle this case in
    many other places in the memory controller code.

    Link: http://lkml.kernel.org/r/20181029215123.17830-1-guro@fb.com
    Fixes: 9b6f7e163cd0 ("mm: rework memcg kernel stack accounting")
    Signed-off-by: Roman Gushchin
    Reported-by: Mike Galbraith
    Acked-by: Rik van Riel
    Acked-by: Michal Hocko
    Cc: Johannes Weiner
    Cc: Vladimir Davydov
    Cc: Shakeel Butt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     

29 Oct, 2018

1 commit

  • Pull XArray conversion from Matthew Wilcox:
    "The XArray provides an improved interface to the radix tree data
    structure, providing locking as part of the API, specifying GFP flags
    at allocation time, eliminating preloading, less re-walking the tree,
    more efficient iterations and not exposing RCU-protected pointers to
    its users.

    This patch set

    1. Introduces the XArray implementation

    2. Converts the pagecache to use it

    3. Converts memremap to use it

    The page cache is the most complex and important user of the radix
    tree, so converting it was most important. Converting the memremap
    code removes the only other user of the multiorder code, which allows
    us to remove the radix tree code that supported it.

    I have 40+ followup patches to convert many other users of the radix
    tree over to the XArray, but I'd like to get this part in first. The
    other conversions haven't been in linux-next and aren't suitable for
    applying yet, but you can see them in the xarray-conv branch if you're
    interested"

    * 'xarray' of git://git.infradead.org/users/willy/linux-dax: (90 commits)
    radix tree: Remove multiorder support
    radix tree test: Convert multiorder tests to XArray
    radix tree tests: Convert item_delete_rcu to XArray
    radix tree tests: Convert item_kill_tree to XArray
    radix tree tests: Move item_insert_order
    radix tree test suite: Remove multiorder benchmarking
    radix tree test suite: Remove __item_insert
    memremap: Convert to XArray
    xarray: Add range store functionality
    xarray: Move multiorder_check to in-kernel tests
    xarray: Move multiorder_shrink to kernel tests
    xarray: Move multiorder account test in-kernel
    radix tree test suite: Convert iteration test to XArray
    radix tree test suite: Convert tag_tagged_items to XArray
    radix tree: Remove radix_tree_clear_tags
    radix tree: Remove radix_tree_maybe_preload_order
    radix tree: Remove split/join code
    radix tree: Remove radix_tree_update_node_t
    page cache: Finish XArray conversion
    dax: Convert page fault handlers to XArray
    ...

    Linus Torvalds
     

27 Oct, 2018

5 commits

  • It was reported that on some of our machines containers were restarted
    with OOM symptoms without an obvious reason. Despite there were almost no
    memory pressure and plenty of page cache, MEMCG_OOM event was raised
    occasionally, causing the container management software to think, that OOM
    has happened. However, no tasks have been killed.

    The following investigation showed that the problem is caused by a failing
    attempt to charge a high-order page. In such case, the OOM killer is
    never invoked. As shown below, it can happen under conditions, which are
    very far from a real OOM: e.g. there is plenty of clean page cache and no
    memory pressure.

    There is no sense in raising an OOM event in this case, as it might
    confuse a user and lead to wrong and excessive actions (e.g. restart the
    workload, as in my case).

    Let's look at the charging path in try_charge(). If the memory usage is
    about memory.max, which is absolutely natural for most memory cgroups, we
    try to reclaim some pages. Even if we were able to reclaim enough memory
    for the allocation, the following check can fail due to a race with
    another concurrent allocation:

    if (mem_cgroup_margin(mem_over_limit) >= nr_pages)
    goto retry;

    For regular pages the following condition will save us from triggering
    the OOM:

    if (nr_reclaimed && nr_pages << PAGE_ALLOC_COSTLY_ORDER))
    goto retry;

    But for high-order allocation this condition will intentionally fail. The
    reason behind is that we'll likely fall to regular pages anyway, so it's
    ok and even preferred to return ENOMEM.

    In this case the idea of raising MEMCG_OOM looks dubious.

    Fix this by moving MEMCG_OOM raising to mem_cgroup_oom() after allocation
    order check, so that the event won't be raised for high order allocations.
    This change doesn't affect regular pages allocation and charging.

    Link: http://lkml.kernel.org/r/20181004214050.7417-1-guro@fb.com
    Signed-off-by: Roman Gushchin
    Acked-by: David Rientjes
    Acked-by: Michal Hocko
    Acked-by: Johannes Weiner
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     
  • This will allow to use generic refcount_t interfaces to check counters
    overflow instead of currently existing VM_BUG_ON(). The only difference
    after the patch is VM_BUG_ON() may cause BUG(), while refcount_t fires
    with WARN(). But this seems not to be significant here, since such the
    problems are usually caught by syzbot with panic-on-warn enabled.

    Link: http://lkml.kernel.org/r/153910718919.7006.13400779039257185427.stgit@localhost.localdomain
    Signed-off-by: Kirill Tkhai
    Reviewed-by: Andrew Morton
    Acked-by: Michal Hocko
    Cc: Johannes Weiner
    Cc: Vladimir Davydov
    Cc: Andrea Parri
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill Tkhai
     
  • The flag memcg_kmem_skip_account was added during the era of opt-out kmem
    accounting. There is no need for such flag in the opt-in world as there
    aren't any __GFP_ACCOUNT allocations within memcg_create_cache_enqueue().

    Link: http://lkml.kernel.org/r/20180919004501.178023-1-shakeelb@google.com
    Signed-off-by: Shakeel Butt
    Acked-by: Johannes Weiner
    Cc: Michal Hocko
    Cc: Vladimir Davydov
    Cc: Greg Thelen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shakeel Butt
     
  • The refault stats go better with the page fault stats, and are of
    higher interest than the stats on LRU operations. In fact they used to
    be grouped together; when the LRU operation stats were added later on,
    they were wedged in between.

    Move them back together. Documentation/admin-guide/cgroup-v2.rst
    already lists them in the right order.

    Link: http://lkml.kernel.org/r/20181010140239.GA2527@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Cc: Rik van Riel
    Cc: Michal Hocko
    Cc: Peter Zijlstra (Intel)
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Memcg charge is batched using per-cpu stocks, so an offline memcg can be
    pinned by a cached charge up to a moment, when a process belonging to some
    other cgroup will charge some memory on the same cpu. In other words,
    cached charges can prevent a memory cgroup from being reclaimed for some
    time, without any clear need.

    Let's optimize it by explicit draining of all stocks on css offlining. As
    draining is performed asynchronously, and is skipped if any parallel
    draining is happening, it's cheap.

    Link: http://lkml.kernel.org/r/20180827162621.30187-2-guro@fb.com
    Signed-off-by: Roman Gushchin
    Reviewed-by: Shakeel Butt
    Acked-by: Michal Hocko
    Cc: Johannes Weiner
    Cc: Konstantin Khlebnikov
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     

30 Sep, 2018

1 commit

  • Introduce xarray value entries and tagged pointers to replace radix
    tree exceptional entries. This is a slight change in encoding to allow
    the use of an extra bit (we can now store BITS_PER_LONG - 1 bits in a
    value entry). It is also a change in emphasis; exceptional entries are
    intimidating and different. As the comment explains, you can choose
    to store values or pointers in the xarray and they are both first-class
    citizens.

    Signed-off-by: Matthew Wilcox
    Reviewed-by: Josef Bacik

    Matthew Wilcox
     

05 Sep, 2018

1 commit

  • When the memcg OOM killer runs out of killable tasks, it currently
    prints a WARN with no further OOM context. This has caused some user
    confusion.

    Warnings indicate a kernel problem. In a reported case, however, the
    situation was triggered by a nonsensical memcg configuration (hard limit
    set to 0). But without any VM context this wasn't obvious from the
    report, and it took some back and forth on the mailing list to identify
    what is actually a trivial issue.

    Handle this OOM condition like we handle it in the global OOM killer:
    dump the full OOM context and tell the user we ran out of tasks.

    This way the user can identify misconfigurations easily by themselves
    and rectify the problem - without having to go through the hassle of
    running into an obscure but unsettling warning, finding the appropriate
    kernel mailing list and waiting for a kernel developer to remote-analyze
    that the memcg configuration caused this.

    If users cannot make sense of why the OOM killer was triggered or why it
    failed, they will still report it to the mailing list, we know that from
    experience. So in case there is an actual kernel bug causing this,
    kernel developers will very likely hear about it.

    Link: http://lkml.kernel.org/r/20180821160406.22578-1-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: Dmitry Vyukov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

23 Aug, 2018

2 commits

  • For some workloads an intervention from the OOM killer can be painful.
    Killing a random task can bring the workload into an inconsistent state.

    Historically, there are two common solutions for this
    problem:
    1) enabling panic_on_oom,
    2) using a userspace daemon to monitor OOMs and kill
    all outstanding processes.

    Both approaches have their downsides: rebooting on each OOM is an obvious
    waste of capacity, and handling all in userspace is tricky and requires a
    userspace agent, which will monitor all cgroups for OOMs.

    In most cases an in-kernel after-OOM cleaning-up mechanism can eliminate
    the necessity of enabling panic_on_oom. Also, it can simplify the cgroup
    management for userspace applications.

    This commit introduces a new knob for cgroup v2 memory controller:
    memory.oom.group. The knob determines whether the cgroup should be
    treated as an indivisible workload by the OOM killer. If set, all tasks
    belonging to the cgroup or to its descendants (if the memory cgroup is not
    a leaf cgroup) are killed together or not at all.

    To determine which cgroup has to be killed, we do traverse the cgroup
    hierarchy from the victim task's cgroup up to the OOMing cgroup (or root)
    and looking for the highest-level cgroup with memory.oom.group set.

    Tasks with the OOM protection (oom_score_adj set to -1000) are treated as
    an exception and are never killed.

    This patch doesn't change the OOM victim selection algorithm.

    Link: http://lkml.kernel.org/r/20180802003201.817-4-guro@fb.com
    Signed-off-by: Roman Gushchin
    Acked-by: Michal Hocko
    Acked-by: Johannes Weiner
    Cc: David Rientjes
    Cc: Tetsuo Handa
    Cc: Tejun Heo
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     
  • Currently cgroup-v1's memcg_stat_show traverses the memcg tree ~17 times
    to collect the stats while cgroup-v2's memory_stat_show traverses the
    memcg tree thrice. On a large machine, a couple thousand memcgs is very
    normal and if the churn is high and memcgs stick around during to several
    reasons, tens of thousands of nodes in memcg tree can exist. This patch
    has refactored and shared the stat collection code between cgroup-v1 and
    cgroup-v2 and has reduced the tree traversal to just one.

    I ran a simple benchmark which reads the root_mem_cgroup's stat file
    1000 times in the presense of 2500 memcgs on cgroup-v1. The results are:

    Without the patch:
    $ time ./read-root-stat-1000-times

    real 0m1.663s
    user 0m0.000s
    sys 0m1.660s

    With the patch:
    $ time ./read-root-stat-1000-times

    real 0m0.468s
    user 0m0.000s
    sys 0m0.467s

    Link: http://lkml.kernel.org/r/20180724224635.143944-1-shakeelb@google.com
    Signed-off-by: Shakeel Butt
    Acked-by: Michal Hocko
    Cc: Johannes Weiner
    Cc: Vladimir Davydov
    Cc: Greg Thelen
    Cc: Bruce Merry
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shakeel Butt
     

18 Aug, 2018

5 commits

  • To avoid further unneed calls of do_shrink_slab() for shrinkers, which
    already do not have any charged objects in a memcg, their bits have to
    be cleared.

    This patch introduces a lockless mechanism to do that without races
    without parallel list lru add. After do_shrink_slab() returns
    SHRINK_EMPTY the first time, we clear the bit and call it once again.
    Then we restore the bit, if the new return value is different.

    Note, that single smp_mb__after_atomic() in shrink_slab_memcg() covers
    two situations:

    1)list_lru_add() shrink_slab_memcg
    list_add_tail() for_each_set_bit()
    set_bit() do_shrink_slab() before the first call of do_shrink_slab()
    instead of this to do not slow down generic case. Also, it's need the
    second call as seen in below in (2).

    2)list_lru_add() shrink_slab_memcg()
    list_add_tail() ...
    set_bit() ...
    ... for_each_set_bit()
    do_shrink_slab() do_shrink_slab()
    clear_bit() ...
    ... ...
    list_lru_add() ...
    list_add_tail() clear_bit()

    set_bit() do_shrink_slab()

    The barriers guarantee that the second do_shrink_slab() in the right
    side task sees list update if really cleared the bit. This case is
    drawn in the code comment.

    [Results/performance of the patchset]

    After the whole patchset applied the below test shows signify increase
    of performance:

    $echo 1 > /sys/fs/cgroup/memory/memory.use_hierarchy
    $mkdir /sys/fs/cgroup/memory/ct
    $echo 4000M > /sys/fs/cgroup/memory/ct/memory.kmem.limit_in_bytes
    $for i in `seq 0 4000`; do mkdir /sys/fs/cgroup/memory/ct/$i;
    echo $$ > /sys/fs/cgroup/memory/ct/$i/cgroup.procs;
    mkdir -p s/$i; mount -t tmpfs $i s/$i;
    touch s/$i/file; done

    Then, 5 sequential calls of drop caches:

    $time echo 3 > /proc/sys/vm/drop_caches

    1)Before:
    0.00user 13.78system 0:13.78elapsed 99%CPU
    0.00user 5.59system 0:05.60elapsed 99%CPU
    0.00user 5.48system 0:05.48elapsed 99%CPU
    0.00user 8.35system 0:08.35elapsed 99%CPU
    0.00user 8.34system 0:08.35elapsed 99%CPU

    2)After
    0.00user 1.10system 0:01.10elapsed 99%CPU
    0.00user 0.00system 0:00.01elapsed 64%CPU
    0.00user 0.01system 0:00.01elapsed 82%CPU
    0.00user 0.00system 0:00.01elapsed 64%CPU
    0.00user 0.01system 0:00.01elapsed 82%CPU

    The results show the performance increases at least in 548 times.

    Shakeel Butt tested this patchset with fork-bomb on his configuration:

    > I created 255 memcgs, 255 ext4 mounts and made each memcg create a
    > file containing few KiBs on corresponding mount. Then in a separate
    > memcg of 200 MiB limit ran a fork-bomb.
    >
    > I ran the "perf record -ag -- sleep 60" and below are the results:
    >
    > Without the patch series:
    > Samples: 4M of event 'cycles', Event count (approx.): 3279403076005
    > + 36.40% fb.sh [kernel.kallsyms] [k] shrink_slab
    > + 18.97% fb.sh [kernel.kallsyms] [k] list_lru_count_one
    > + 6.75% fb.sh [kernel.kallsyms] [k] super_cache_count
    > + 0.49% fb.sh [kernel.kallsyms] [k] down_read_trylock
    > + 0.44% fb.sh [kernel.kallsyms] [k] mem_cgroup_iter
    > + 0.27% fb.sh [kernel.kallsyms] [k] up_read
    > + 0.21% fb.sh [kernel.kallsyms] [k] osq_lock
    > + 0.13% fb.sh [kernel.kallsyms] [k] shmem_unused_huge_count
    > + 0.08% fb.sh [kernel.kallsyms] [k] shrink_node_memcg
    > + 0.08% fb.sh [kernel.kallsyms] [k] shrink_node
    >
    > With the patch series:
    > Samples: 4M of event 'cycles', Event count (approx.): 2756866824946
    > + 47.49% fb.sh [kernel.kallsyms] [k] down_read_trylock
    > + 30.72% fb.sh [kernel.kallsyms] [k] up_read
    > + 9.51% fb.sh [kernel.kallsyms] [k] mem_cgroup_iter
    > + 1.69% fb.sh [kernel.kallsyms] [k] shrink_node_memcg
    > + 1.35% fb.sh [kernel.kallsyms] [k] mem_cgroup_protected
    > + 1.05% fb.sh [kernel.kallsyms] [k] queued_spin_lock_slowpath
    > + 0.85% fb.sh [kernel.kallsyms] [k] _raw_spin_lock
    > + 0.78% fb.sh [kernel.kallsyms] [k] lruvec_lru_size
    > + 0.57% fb.sh [kernel.kallsyms] [k] shrink_node
    > + 0.54% fb.sh [kernel.kallsyms] [k] queue_work_on
    > + 0.46% fb.sh [kernel.kallsyms] [k] shrink_slab_memcg

    [ktkhai@virtuozzo.com: v9]
    Link: http://lkml.kernel.org/r/153112561772.4097.11011071937553113003.stgit@localhost.localdomain
    Link: http://lkml.kernel.org/r/153063070859.1818.11870882950920963480.stgit@localhost.localdomain
    Signed-off-by: Kirill Tkhai
    Acked-by: Vladimir Davydov
    Tested-by: Shakeel Butt
    Cc: Al Viro
    Cc: Andrey Ryabinin
    Cc: Chris Wilson
    Cc: Greg Kroah-Hartman
    Cc: Guenter Roeck
    Cc: "Huang, Ying"
    Cc: Johannes Weiner
    Cc: Josef Bacik
    Cc: Li RongQing
    Cc: Matthew Wilcox
    Cc: Matthias Kaehlcke
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Philippe Ombredanne
    Cc: Roman Gushchin
    Cc: Sahitya Tummala
    Cc: Stephen Rothwell
    Cc: Tetsuo Handa
    Cc: Thomas Gleixner
    Cc: Waiman Long
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill Tkhai
     
  • Introduce set_shrinker_bit() function to set shrinker-related bit in
    memcg shrinker bitmap, and set the bit after the first item is added and
    in case of reparenting destroyed memcg's items.

    This will allow next patch to make shrinkers be called only, in case of
    they have charged objects at the moment, and to improve shrink_slab()
    performance.

    [ktkhai@virtuozzo.com: v9]
    Link: http://lkml.kernel.org/r/153112557572.4097.17315791419810749985.stgit@localhost.localdomain
    Link: http://lkml.kernel.org/r/153063065671.1818.15914674956134687268.stgit@localhost.localdomain
    Signed-off-by: Kirill Tkhai
    Acked-by: Vladimir Davydov
    Tested-by: Shakeel Butt
    Cc: Al Viro
    Cc: Andrey Ryabinin
    Cc: Chris Wilson
    Cc: Greg Kroah-Hartman
    Cc: Guenter Roeck
    Cc: "Huang, Ying"
    Cc: Johannes Weiner
    Cc: Josef Bacik
    Cc: Li RongQing
    Cc: Matthew Wilcox
    Cc: Matthias Kaehlcke
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Philippe Ombredanne
    Cc: Roman Gushchin
    Cc: Sahitya Tummala
    Cc: Stephen Rothwell
    Cc: Tetsuo Handa
    Cc: Thomas Gleixner
    Cc: Waiman Long
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill Tkhai
     
  • This will be used in next patch.

    Link: http://lkml.kernel.org/r/153063064347.1818.1987011484100392706.stgit@localhost.localdomain
    Signed-off-by: Kirill Tkhai
    Acked-by: Vladimir Davydov
    Tested-by: Shakeel Butt
    Cc: Al Viro
    Cc: Andrey Ryabinin
    Cc: Chris Wilson
    Cc: Greg Kroah-Hartman
    Cc: Guenter Roeck
    Cc: "Huang, Ying"
    Cc: Johannes Weiner
    Cc: Josef Bacik
    Cc: Li RongQing
    Cc: Matthew Wilcox
    Cc: Matthias Kaehlcke
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Philippe Ombredanne
    Cc: Roman Gushchin
    Cc: Sahitya Tummala
    Cc: Stephen Rothwell
    Cc: Tetsuo Handa
    Cc: Thomas Gleixner
    Cc: Waiman Long
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill Tkhai
     
  • This is just refactoring to allow the next patches to have dst_memcg
    pointer in memcg_drain_list_lru_node().

    Link: http://lkml.kernel.org/r/153063062118.1818.2761273817739499749.stgit@localhost.localdomain
    Signed-off-by: Kirill Tkhai
    Acked-by: Vladimir Davydov
    Tested-by: Shakeel Butt
    Cc: Al Viro
    Cc: Andrey Ryabinin
    Cc: Chris Wilson
    Cc: Greg Kroah-Hartman
    Cc: Guenter Roeck
    Cc: "Huang, Ying"
    Cc: Johannes Weiner
    Cc: Josef Bacik
    Cc: Li RongQing
    Cc: Matthew Wilcox
    Cc: Matthias Kaehlcke
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Philippe Ombredanne
    Cc: Roman Gushchin
    Cc: Sahitya Tummala
    Cc: Stephen Rothwell
    Cc: Tetsuo Handa
    Cc: Thomas Gleixner
    Cc: Waiman Long
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill Tkhai
     
  • Imagine a big node with many cpus, memory cgroups and containers. Let
    we have 200 containers, every container has 10 mounts, and 10 cgroups.
    All container tasks don't touch foreign containers mounts. If there is
    intensive pages write, and global reclaim happens, a writing task has to
    iterate over all memcgs to shrink slab, before it's able to go to
    shrink_page_list().

    Iteration over all the memcg slabs is very expensive: the task has to
    visit 200 * 10 = 2000 shrinkers for every memcg, and since there are
    2000 memcgs, the total calls are 2000 * 2000 = 4000000.

    So, the shrinker makes 4 million do_shrink_slab() calls just to try to
    isolate SWAP_CLUSTER_MAX pages in one of the actively writing memcg via
    shrink_page_list(). I've observed a node spending almost 100% in
    kernel, making useless iteration over already shrinked slab.

    This patch adds bitmap of memcg-aware shrinkers to memcg. The size of
    the bitmap depends on bitmap_nr_ids, and during memcg life it's
    maintained to be enough to fit bitmap_nr_ids shrinkers. Every bit in
    the map is related to corresponding shrinker id.

    Next patches will maintain set bit only for really charged memcg. This
    will allow shrink_slab() to increase its performance in significant way.
    See the last patch for the numbers.

    [ktkhai@virtuozzo.com: v9]
    Link: http://lkml.kernel.org/r/153112549031.4097.3576147070498769979.stgit@localhost.localdomain
    [ktkhai@virtuozzo.com: add comment to mem_cgroup_css_online()]
    Link: http://lkml.kernel.org/r/521f9e5f-c436-b388-fe83-4dc870bfb489@virtuozzo.com
    Link: http://lkml.kernel.org/r/153063056619.1818.12550500883688681076.stgit@localhost.localdomain
    Signed-off-by: Kirill Tkhai
    Acked-by: Vladimir Davydov
    Tested-by: Shakeel Butt
    Cc: Al Viro
    Cc: Andrey Ryabinin
    Cc: Chris Wilson
    Cc: Greg Kroah-Hartman
    Cc: Guenter Roeck
    Cc: "Huang, Ying"
    Cc: Johannes Weiner
    Cc: Josef Bacik
    Cc: Li RongQing
    Cc: Matthew Wilcox
    Cc: Matthias Kaehlcke
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Philippe Ombredanne
    Cc: Roman Gushchin
    Cc: Sahitya Tummala
    Cc: Stephen Rothwell
    Cc: Tetsuo Handa
    Cc: Thomas Gleixner
    Cc: Waiman Long
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill Tkhai