08 Apr, 2014

13 commits

  • Currently we destroy children caches at the very beginning of
    kmem_cache_destroy(). This is wrong, because the root cache will not
    necessarily be destroyed in the end - if it has aliases (refcount > 0),
    kmem_cache_destroy() will simply decrement its refcount and return. In
    this case, at best we will get a bunch of warnings in dmesg, like this
    one:

    kmem_cache_destroy kmalloc-32:0: Slab cache still has objects
    CPU: 1 PID: 7139 Comm: modprobe Tainted: G B W 3.13.0+ #117
    Call Trace:
    dump_stack+0x49/0x5b
    kmem_cache_destroy+0xdf/0xf0
    kmem_cache_destroy_memcg_children+0x97/0xc0
    kmem_cache_destroy+0xf/0xf0
    xfs_mru_cache_uninit+0x21/0x30 [xfs]
    exit_xfs_fs+0x2e/0xc44 [xfs]
    SyS_delete_module+0x198/0x1f0
    system_call_fastpath+0x16/0x1b

    At worst - if kmem_cache_destroy() will race with an allocation from a
    memcg cache - the kernel will panic.

    This patch fixes this by moving children caches destruction after the
    check if the cache has aliases. Plus, it forbids destroying a root
    cache if it still has children caches, because each children cache keeps
    a reference to its parent.

    Signed-off-by: Vladimir Davydov
    Cc: Michal Hocko
    Cc: Johannes Weiner
    Cc: David Rientjes
    Cc: Pekka Enberg
    Cc: Glauber Costa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • Currently, memcg_unregister_cache(), which deletes the cache being
    destroyed from the memcg_slab_caches list, is called after
    __kmem_cache_shutdown() (see kmem_cache_destroy()), which starts to
    destroy the cache.

    As a result, one can access a partially destroyed cache while traversing
    a memcg_slab_caches list, which can have deadly consequences (for
    instance, cache_show() called for each cache on a memcg_slab_caches list
    from mem_cgroup_slabinfo_read() will dereference pointers to already
    freed data).

    To fix this, let's move memcg_unregister_cache() before the cache
    destruction process beginning, issuing memcg_register_cache() on failure.

    Signed-off-by: Vladimir Davydov
    Cc: Michal Hocko
    Cc: Johannes Weiner
    Cc: David Rientjes
    Cc: Pekka Enberg
    Cc: Glauber Costa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • Memcg-awareness turned kmem_cache_create() into a dirty interweaving of
    memcg-only and except-for-memcg calls. To clean this up, let's move the
    code responsible for memcg cache creation to a separate function.

    Signed-off-by: Vladimir Davydov
    Cc: Michal Hocko
    Cc: Johannes Weiner
    Cc: David Rientjes
    Cc: Pekka Enberg
    Cc: Glauber Costa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • This patch cleans up the memcg cache creation path as follows:

    - Move memcg cache name creation to a separate function to be called
    from kmem_cache_create_memcg(). This allows us to get rid of the mutex
    protecting the temporary buffer used for the name formatting, because
    the whole cache creation path is protected by the slab_mutex.

    - Get rid of memcg_create_kmem_cache(). This function serves as a proxy
    to kmem_cache_create_memcg(). After separating the cache name creation
    path, it would be reduced to a function call, so let's inline it.

    Signed-off-by: Vladimir Davydov
    Cc: Michal Hocko
    Cc: Johannes Weiner
    Cc: David Rientjes
    Cc: Pekka Enberg
    Cc: Glauber Costa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • mem_cgroup_newpage_charge is used only for charging anonymous memory so
    it is better to rename it to mem_cgroup_charge_anon.

    mem_cgroup_cache_charge is used for file backed memory so rename it to
    mem_cgroup_charge_file.

    Signed-off-by: Michal Hocko
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Some callsites pass a memcg directly, some callsites pass an mm that
    then has to be translated to a memcg. This makes for a terrible
    function interface.

    Just push the mm-to-memcg translation into the respective callsites and
    always pass a memcg to mem_cgroup_try_charge().

    [mhocko@suse.cz: add charge mm helper]
    Signed-off-by: Johannes Weiner
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • __mem_cgroup_try_charge duplicates get_mem_cgroup_from_mm for charges
    which came without a memcg. The only reason seems to be a tiny
    optimization when css_tryget is not called if the charge can be consumed
    from the stock. Nevertheless css_tryget is very cheap since it has been
    reworked to use per-cpu counting so this optimization doesn't give us
    anything these days.

    So let's drop the code duplication so that the code is more readable.

    Signed-off-by: Michal Hocko
    Signed-off-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Instead of returning NULL from try_get_mem_cgroup_from_mm() when the mm
    owner is exiting, just return root_mem_cgroup. This makes sense for all
    callsites and gets rid of some of them having to fallback manually.

    [fengguang.wu@intel.com: fix warnings]
    Signed-off-by: Johannes Weiner
    Signed-off-by: Fengguang Wu
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Users pass either a mm that has been established under task lock, or use
    a verified current->mm, which means the task can't be exiting.

    Signed-off-by: Johannes Weiner
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Only page cache charges can happen without an mm context, so push this
    special case out of the inner core and into the cache charge function.

    An ancient comment explains that the mm can also be NULL in case the
    task is currently being migrated, but that is not actually true with the
    current case, so just remove it.

    Signed-off-by: Johannes Weiner
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • mem_cgroup_charge_common() is used by both cache and anon pages, but
    most of its body only applies to anon pages and the remainder is not
    worth having in a separate function.

    Signed-off-by: Johannes Weiner
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • It used to disable preemption and run sanity checks but now it's only
    taking a number out of one percpu counter and putting it into another.
    Do this directly in the callsite and save the indirection.

    Signed-off-by: Johannes Weiner
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • lock_page_cgroup() disables preemption, remove explicit preemption
    disabling for code paths holding this lock.

    Signed-off-by: Johannes Weiner
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

04 Apr, 2014

1 commit

  • Pull cgroup updates from Tejun Heo:
    "A lot updates for cgroup:

    - The biggest one is cgroup's conversion to kernfs. cgroup took
    after the long abandoned vfs-entangled sysfs implementation and
    made it even more convoluted over time. cgroup's internal objects
    were fused with vfs objects which also brought in vfs locking and
    object lifetime rules. Naturally, there are places where vfs rules
    don't fit and nasty hacks, such as credential switching or lock
    dance interleaving inode mutex and cgroup_mutex with object serial
    number comparison thrown in to decide whether the operation is
    actually necessary, needed to be employed.

    After conversion to kernfs, internal object lifetime and locking
    rules are mostly isolated from vfs interactions allowing shedding
    of several nasty hacks and overall simplification. This will also
    allow implmentation of operations which may affect multiple cgroups
    which weren't possible before as it would have required nesting
    i_mutexes.

    - Various simplifications including dropping of module support,
    easier cgroup name/path handling, simplified cgroup file type
    handling and task_cg_lists optimization.

    - Prepatory changes for the planned unified hierarchy, which is still
    a patchset away from being actually operational. The dummy
    hierarchy is updated to serve as the default unified hierarchy.
    Controllers which aren't claimed by other hierarchies are
    associated with it, which BTW was what the dummy hierarchy was for
    anyway.

    - Various fixes from Li and others. This pull request includes some
    patches to add missing slab.h to various subsystems. This was
    triggered xattr.h include removal from cgroup.h. cgroup.h
    indirectly got included a lot of files which brought in xattr.h
    which brought in slab.h.

    There are several merge commits - one to pull in kernfs updates
    necessary for converting cgroup (already in upstream through
    driver-core), others for interfering changes in the fixes branch"

    * 'for-3.15' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (74 commits)
    cgroup: remove useless argument from cgroup_exit()
    cgroup: fix spurious lockdep warning in cgroup_exit()
    cgroup: Use RCU_INIT_POINTER(x, NULL) in cgroup.c
    cgroup: break kernfs active_ref protection in cgroup directory operations
    cgroup: fix cgroup_taskset walking order
    cgroup: implement CFTYPE_ONLY_ON_DFL
    cgroup: make cgrp_dfl_root mountable
    cgroup: drop const from @buffer of cftype->write_string()
    cgroup: rename cgroup_dummy_root and related names
    cgroup: move ->subsys_mask from cgroupfs_root to cgroup
    cgroup: treat cgroup_dummy_root as an equivalent hierarchy during rebinding
    cgroup: remove NULL checks from [pr_cont_]cgroup_{name|path}()
    cgroup: use cgroup_setup_root() to initialize cgroup_dummy_root
    cgroup: reorganize cgroup bootstrapping
    cgroup: relocate setting of CGRP_DEAD
    cpuset: use rcu_read_lock() to protect task_cs()
    cgroup_freezer: document freezer_fork() subtleties
    cgroup: update cgroup_transfer_tasks() to either succeed or fail
    cgroup: drop task_lock() protection around task->cgroups
    cgroup: update how a newly forked task gets associated with css_set
    ...

    Linus Torvalds
     

19 Mar, 2014

1 commit

  • cftype->write_string() just passes on the writeable buffer from kernfs
    and there's no reason to add const restriction on the buffer. The
    only thing const achieves is unnecessarily complicating parsing of the
    buffer. Drop const from @buffer.

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan
    Cc: Peter Zijlstra
    Cc: Paul Mackerras
    Cc: Ingo Molnar
    Cc: Arnaldo Carvalho de Melo
    Cc: Daniel Borkmann
    Cc: Michal Hocko
    Cc: Johannes Weiner
    Cc: Balbir Singh
    Cc: KAMEZAWA Hiroyuki

    Tejun Heo
     

04 Mar, 2014

2 commits

  • Sometimes the cleanup after memcg hierarchy testing gets stuck in
    mem_cgroup_reparent_charges(), unable to bring non-kmem usage down to 0.

    There may turn out to be several causes, but a major cause is this: the
    workitem to offline parent can get run before workitem to offline child;
    parent's mem_cgroup_reparent_charges() circles around waiting for the
    child's pages to be reparented to its lrus, but it's holding
    cgroup_mutex which prevents the child from reaching its
    mem_cgroup_reparent_charges().

    Further testing showed that an ordered workqueue for cgroup_destroy_wq
    is not always good enough: percpu_ref_kill_and_confirm's call_rcu_sched
    stage on the way can mess up the order before reaching the workqueue.

    Instead, when offlining a memcg, call mem_cgroup_reparent_charges() on
    all its children (and grandchildren, in the correct order) to have their
    charges reparented first.

    Fixes: e5fca243abae ("cgroup: use a dedicated workqueue for cgroup destruction")
    Signed-off-by: Filipe Brandenburger
    Signed-off-by: Hugh Dickins
    Reviewed-by: Tejun Heo
    Acked-by: Michal Hocko
    Cc: Johannes Weiner
    Cc: [v3.10+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Filipe Brandenburger
     
  • Commit 0eef615665ed ("memcg: fix css reference leak and endless loop in
    mem_cgroup_iter") got the interaction with the commit a few before it
    d8ad30559715 ("mm/memcg: iteration skip memcgs not yet fully
    initialized") slightly wrong, and we didn't notice at the time.

    It's elusive, and harder to get than the original, but for a couple of
    days before rc1, I several times saw a endless loop similar to that
    supposedly being fixed.

    This time it was a tighter loop in __mem_cgroup_iter_next(): because we
    can get here when our root has already been offlined, and the ordering
    of conditions was such that we then just cycled around forever.

    Fixes: 0eef615665ed ("memcg: fix css reference leak and endless loop in mem_cgroup_iter").
    Signed-off-by: Hugh Dickins
    Acked-by: Michal Hocko
    Cc: Johannes Weiner
    Cc: Greg Thelen
    Cc: [3.12+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

26 Feb, 2014

1 commit

  • Kirill has reported the following:

    Task in /test killed as a result of limit of /test
    memory: usage 10240kB, limit 10240kB, failcnt 51
    memory+swap: usage 10240kB, limit 10240kB, failcnt 0
    kmem: usage 0kB, limit 18014398509481983kB, failcnt 0
    Memory cgroup stats for /test:

    BUG: sleeping function called from invalid context at kernel/cpu.c:68
    in_atomic(): 1, irqs_disabled(): 0, pid: 66, name: memcg_test
    2 locks held by memcg_test/66:
    #0: (memcg_oom_lock#2){+.+...}, at: [] pagefault_out_of_memory+0x14/0x90
    #1: (oom_info_lock){+.+...}, at: [] mem_cgroup_print_oom_info+0x2a/0x390
    CPU: 2 PID: 66 Comm: memcg_test Not tainted 3.14.0-rc1-dirty #745
    Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS Bochs 01/01/2011
    Call Trace:
    __might_sleep+0x16a/0x210
    get_online_cpus+0x1c/0x60
    mem_cgroup_read_stat+0x27/0xb0
    mem_cgroup_print_oom_info+0x260/0x390
    dump_header+0x88/0x251
    ? trace_hardirqs_on+0xd/0x10
    oom_kill_process+0x258/0x3d0
    mem_cgroup_oom_synchronize+0x656/0x6c0
    ? mem_cgroup_charge_common+0xd0/0xd0
    pagefault_out_of_memory+0x14/0x90
    mm_fault_error+0x91/0x189
    __do_page_fault+0x48e/0x580
    do_page_fault+0xe/0x10
    page_fault+0x22/0x30

    which complains that mem_cgroup_read_stat cannot be called from an atomic
    context but mem_cgroup_print_oom_info takes a spinlock. Change
    oom_info_lock to a mutex.

    This was introduced by 947b3dd1a84b ("memcg, oom: lock
    mem_cgroup_print_oom_info").

    Signed-off-by: Michal Hocko
    Reported-by: "Kirill A. Shutemov"
    Cc: Johannes Weiner
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

13 Feb, 2014

1 commit

  • cgroup_task_count() read-locks css_set_lock and walks all tasks to
    count them and then returns the result. The only thing all the users
    want is determining whether the cgroup is empty or not. This patch
    implements cgroup_has_tasks() which tests whether cgroup->cset_links
    is empty, replaces all cgroup_task_count() usages and unexports it.

    Note that the test isn't synchronized. This is the same as before.
    The test has always been racy.

    This will help planned css_set locking update.

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan
    Acked-by: Michal Hocko
    Cc: Johannes Weiner
    Cc: Balbir Singh
    Cc: KAMEZAWA Hiroyuki

    Tejun Heo
     

12 Feb, 2014

2 commits

  • cgroup->name handling became quite complicated over time involving
    dedicated struct cgroup_name for RCU protection. Now that cgroup is
    on kernfs, we can drop all of it and simply use kernfs_name/path() and
    friends. Replace cgroup->name and all related code with kernfs
    name/path constructs.

    * Reimplement cgroup_name() and cgroup_path() as thin wrappers on top
    of kernfs counterparts, which involves semantic changes.
    pr_cont_cgroup_name() and pr_cont_cgroup_path() added.

    * cgroup->name handling dropped from cgroup_rename().

    * All users of cgroup_name/path() updated to the new semantics. Users
    which were formatting the string just to printk them are converted
    to use pr_cont_cgroup_name/path() instead, which simplifies things
    quite a bit. As cgroup_name() no longer requires RCU read lock
    around it, RCU lockings which were protecting only cgroup_name() are
    removed.

    v2: Comment above oom_info_lock updated as suggested by Michal.

    v3: dummy_top doesn't have a kn associated and
    pr_cont_cgroup_name/path() ended up calling the matching kernfs
    functions with NULL kn leading to oops. Test for NULL kn and
    print "/" if so. This issue was reported by Fengguang Wu.

    v4: Rebased on top of 0ab02ca8f887 ("cgroup: protect modifications to
    cgroup_idr with cgroup_mutex").

    Signed-off-by: Tejun Heo
    Acked-by: Peter Zijlstra
    Acked-by: Michal Hocko
    Acked-by: Li Zefan
    Cc: Fengguang Wu
    Cc: Ingo Molnar
    Cc: Johannes Weiner
    Cc: Balbir Singh
    Cc: KAMEZAWA Hiroyuki

    Tejun Heo
     
  • css_from_dir() returns the matching css (cgroup_subsys_state) given a
    dentry and subsystem. The function doesn't pin the css before
    returning and requires the caller to be holding RCU read lock or
    cgroup_mutex and handling pinning on the caller side.

    Given that users of the function are likely to want to pin the
    returned css (both existing users do) and that getting and putting
    css's are very cheap, there's no reason for the interface to be tricky
    like this.

    Rename css_from_dir() to css_tryget_from_dir() and make it try to pin
    the found css and return it only if pinning succeeded. The callers
    are updated so that they no longer do RCU locking and pinning around
    the function and just use the returned css.

    This will also ease converting cgroup to kernfs.

    Signed-off-by: Tejun Heo
    Acked-by: Michal Hocko
    Acked-by: Li Zefan
    Cc: Steven Rostedt
    Cc: Frederic Weisbecker
    Cc: Ingo Molnar
    Cc: Johannes Weiner
    Cc: Balbir Singh
    Cc: KAMEZAWA Hiroyuki

    Tejun Heo
     

08 Feb, 2014

1 commit

  • cgroup_subsys is a bit messier than it needs to be.

    * The name of a subsys can be different from its internal identifier
    defined in cgroup_subsys.h. Most subsystems use the matching name
    but three - cpu, memory and perf_event - use different ones.

    * cgroup_subsys_id enums are postfixed with _subsys_id and each
    cgroup_subsys is postfixed with _subsys. cgroup.h is widely
    included throughout various subsystems, it doesn't and shouldn't
    have claim on such generic names which don't have any qualifier
    indicating that they belong to cgroup.

    * cgroup_subsys->subsys_id should always equal the matching
    cgroup_subsys_id enum; however, we require each controller to
    initialize it and then BUG if they don't match, which is a bit
    silly.

    This patch cleans up cgroup_subsys names and initialization by doing
    the followings.

    * cgroup_subsys_id enums are now postfixed with _cgrp_id, and each
    cgroup_subsys with _cgrp_subsys.

    * With the above, renaming subsys identifiers to match the userland
    visible names doesn't cause any naming conflicts. All non-matching
    identifiers are renamed to match the official names.

    cpu_cgroup -> cpu
    mem_cgroup -> memory
    perf -> perf_event

    * controllers no longer need to initialize ->subsys_id and ->name.
    They're generated in cgroup core and set automatically during boot.

    * Redundant cgroup_subsys declarations removed.

    * While updating BUG_ON()s in cgroup_init_early(), convert them to
    WARN()s. BUGging that early during boot is stupid - the kernel
    can't print anything, even through serial console and the trap
    handler doesn't even link stack frame properly for back-tracing.

    This patch doesn't introduce any behavior changes.

    v2: Rebased on top of fe1217c4f3f7 ("net: net_cls: move cgroupfs
    classid handling into core").

    Signed-off-by: Tejun Heo
    Acked-by: Neil Horman
    Acked-by: "David S. Miller"
    Acked-by: "Rafael J. Wysocki"
    Acked-by: Michal Hocko
    Acked-by: Peter Zijlstra
    Acked-by: Aristeu Rozanski
    Acked-by: Ingo Molnar
    Acked-by: Li Zefan
    Cc: Johannes Weiner
    Cc: Balbir Singh
    Cc: KAMEZAWA Hiroyuki
    Cc: Serge E. Hallyn
    Cc: Vivek Goyal
    Cc: Thomas Graf

    Tejun Heo
     

31 Jan, 2014

1 commit

  • Commit 842e2873697e ("memcg: get rid of kmem_cache_dup()") introduced a
    mutex for memcg_create_kmem_cache() to protect the tmp_name buffer that
    holds the memcg name. It failed to unlock the mutex if this buffer
    could not be allocated.

    This patch fixes the issue by appropriately unlocking the mutex if the
    allocation fails.

    Signed-off-by: Vladimir Davydov
    Cc: Michal Hocko
    Cc: Johannes Weiner
    Cc: Glauber Costa
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     

24 Jan, 2014

17 commits

  • Signed-off-by: Vladimir Davydov
    Reviewed-by: Michal Hocko
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • Commit 19f39402864e ("memcg: simplify mem_cgroup_iter") has reorganized
    mem_cgroup_iter code in order to simplify it. A part of that change was
    dropping an optimization which didn't call css_tryget on the root of the
    walked tree. The patch however didn't change the css_put part in
    mem_cgroup_iter which excludes root.

    This wasn't an issue at the time because __mem_cgroup_iter_next bailed
    out for root early without taking a reference as cgroup iterators
    (css_next_descendant_pre) didn't visit root themselves.

    Nevertheless cgroup iterators have been reworked to visit root by commit
    bd8815a6d802 ("cgroup: make css_for_each_descendant() and friends
    include the origin css in the iteration") when the root bypass have been
    dropped in __mem_cgroup_iter_next. This means that css_put is not
    called for root and so css along with mem_cgroup and other cgroup
    internal object tied by css lifetime are never freed.

    Fix the issue by reintroducing root check in __mem_cgroup_iter_next and
    do not take css reference for it.

    This reference counting magic protects us also from another issue, an
    endless loop reported by Hugh Dickins when reclaim races with root
    removal and css_tryget called by iterator internally would fail. There
    would be no other nodes to visit so __mem_cgroup_iter_next would return
    NULL and mem_cgroup_iter would interpret it as "start looping from root
    again" and so mem_cgroup_iter would loop forever internally.

    Signed-off-by: Michal Hocko
    Reported-by: Hugh Dickins
    Tested-by: Hugh Dickins
    Cc: Johannes Weiner
    Cc: Greg Thelen
    Cc: [3.12+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Hugh has reported an endless loop when the hardlimit reclaim sees the
    same group all the time. This might happen when the reclaim races with
    the memcg removal.

    shrink_zone
    [rmdir root]
    mem_cgroup_iter(root, NULL, reclaim)
    // prev = NULL
    rcu_read_lock()
    mem_cgroup_iter_load
    last_visited = iter->last_visited // gets root || NULL
    css_tryget(last_visited) // failed
    last_visited = NULL [1]
    memcg = root = __mem_cgroup_iter_next(root, NULL)
    mem_cgroup_iter_update
    iter->last_visited = root;
    reclaim->generation = iter->generation

    mem_cgroup_iter(root, root, reclaim)
    // prev = root
    rcu_read_lock
    mem_cgroup_iter_load
    last_visited = iter->last_visited // gets root
    css_tryget(last_visited) // failed
    [1]

    The issue seemed to be introduced by commit 5f5781619718 ("memcg: relax
    memcg iter caching") which has replaced unconditional css_get/css_put by
    css_tryget/css_put for the cached iterator.

    This patch fixes the issue by skipping css_tryget on the root of the
    tree walk in mem_cgroup_iter_load and symmetrically doesn't release it
    in mem_cgroup_iter_update.

    Signed-off-by: Michal Hocko
    Reported-by: Hugh Dickins
    Tested-by: Hugh Dickins
    Cc: Johannes Weiner
    Cc: Greg Thelen
    Cc: [3.10+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • When two threads have the same badness score, it's preferable to kill
    the thread group leader so that the actual process name is printed to
    the kernel log rather than the thread group name which may be shared
    amongst several processes.

    This was the behavior when select_bad_process() used to do
    for_each_process(), but it now iterates threads instead and leads to
    ambiguity.

    Signed-off-by: David Rientjes
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: KAMEZAWA Hiroyuki
    Cc: Greg Thelen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • It is surprising that the mem_cgroup iterator can return memcgs which
    have not yet been fully initialized. By accident (or trial and error?)
    this appears not to present an actual problem; but it may be better to
    prevent such surprises, by skipping memcgs not yet online.

    Signed-off-by: Hugh Dickins
    Cc: Tejun Heo
    Acked-by: Michal Hocko
    Cc: Johannes Weiner
    Cc: [3.12+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Shorten mem_cgroup_reclaim_iter.last_dead_count from unsigned long to
    int: it's assigned from an int and compared with an int, and adjacent to
    an unsigned int: so there's no point to it being unsigned long, which
    wasted 104 bytes in every mem_cgroup_per_zone.

    Signed-off-by: Hugh Dickins
    Acked-by: Michal Hocko
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Currently we take both the memcg_create_mutex and the set_limit_mutex
    when we enable kmem accounting for a memory cgroup, which makes kmem
    activation events serialize with both memcg creations and other memcg
    limit updates (memory.limit, memory.memsw.limit). However, there is no
    point in such strict synchronization rules there.

    First, the set_limit_mutex was introduced to keep the memory.limit and
    memory.memsw.limit values in sync. Since memory.kmem.limit can be set
    independently of them, it is better to introduce a separate mutex to
    synchronize against concurrent kmem limit updates.

    Second, we take the memcg_create_mutex in order to make sure all
    children of this memcg will be kmem-active as well. For achieving that,
    it is enough to hold this mutex only while checking if
    memcg_has_children() though. This guarantees that if a child is added
    after we checked that the memcg has no children, the newly added cgroup
    will see its parent kmem-active (of course if the latter succeeded), and
    call kmem activation for itself.

    This patch simplifies the locking rules of memcg_update_kmem_limit()
    according to these considerations.

    [vdavydov@parallels.com: fix unintialized var warning]
    Signed-off-by: Vladimir Davydov
    Cc: Michal Hocko
    Cc: Glauber Costa
    Cc: Johannes Weiner
    Cc: Balbir Singh
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • Currently we have two state bits in mem_cgroup::kmem_account_flags
    regarding kmem accounting activation, ACTIVATED and ACTIVE. We start
    kmem accounting only if both flags are set (memcg_can_account_kmem()),
    plus throughout the code there are several places where we check only
    the ACTIVE flag, but we never check the ACTIVATED flag alone. These
    flags are both set from memcg_update_kmem_limit() under the
    set_limit_mutex, the ACTIVE flag always being set after ACTIVATED, and
    they never get cleared. That said checking if both flags are set is
    equivalent to checking only for the ACTIVE flag, and since there is no
    ACTIVATED flag checks, we can safely remove the ACTIVATED flag, and
    nothing will change.

    Let's try to understand what was the reason for introducing these flags.
    The purpose of the ACTIVE flag is clear - it states that kmem should be
    accounting to the cgroup. The only requirement for it is that it should
    be set after we have fully initialized kmem accounting bits for the
    cgroup and patched all static branches relating to kmem accounting.
    Since we always check if static branch is enabled before actually
    considering if we should account (otherwise we wouldn't benefit from
    static branching), this guarantees us that we won't skip a commit or
    uncharge after a charge due to an unpatched static branch.

    Now let's move on to the ACTIVATED bit. As I proved in the beginning of
    this message, it is absolutely useless, and removing it will change
    nothing. So what was the reason introducing it?

    The ACTIVATED flag was introduced by commit a8964b9b84f9 ("memcg: use
    static branches when code not in use") in order to guarantee that
    static_key_slow_inc(&memcg_kmem_enabled_key) would be called only once
    for each memory cgroup when its kmem accounting was activated. The
    point was that at that time the memcg_update_kmem_limit() function's
    work-flow looked like this:

    bool must_inc_static_branch = false;

    cgroup_lock();
    mutex_lock(&set_limit_mutex);
    if (!memcg->kmem_account_flags && val != RESOURCE_MAX) {
    /* The kmem limit is set for the first time */
    ret = res_counter_set_limit(&memcg->kmem, val);

    memcg_kmem_set_activated(memcg);
    must_inc_static_branch = true;
    } else
    ret = res_counter_set_limit(&memcg->kmem, val);
    mutex_unlock(&set_limit_mutex);
    cgroup_unlock();

    if (must_inc_static_branch) {
    /* We can't do this under cgroup_lock */
    static_key_slow_inc(&memcg_kmem_enabled_key);
    memcg_kmem_set_active(memcg);
    }

    So that without the ACTIVATED flag we could race with other threads
    trying to set the limit and increment the static branching ref-counter
    more than once. Today we call the whole memcg_update_kmem_limit()
    function under the set_limit_mutex and this race is impossible.

    As now we understand why the ACTIVATED bit was introduced and why we
    don't need it now, and know that removing it will change nothing anyway,
    let's get rid of it.

    Signed-off-by: Vladimir Davydov
    Cc: Michal Hocko
    Cc: Glauber Costa
    Cc: Johannes Weiner
    Cc: Balbir Singh
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • We relocate root cache's memcg_params whenever we need to grow the
    memcg_caches array to accommodate all kmem-active memory cgroups.
    Currently on relocation we free the old version immediately, which can
    lead to use-after-free, because the memcg_caches array is accessed
    lock-free (see cache_from_memcg_idx()). This patch fixes this by making
    memcg_params RCU-protected for root caches.

    Signed-off-by: Vladimir Davydov
    Cc: Michal Hocko
    Cc: Glauber Costa
    Cc: Johannes Weiner
    Cc: Balbir Singh
    Cc: KAMEZAWA Hiroyuki
    Cc: Pekka Enberg
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • kmem_cache_dup() is only called from memcg_create_kmem_cache(). The
    latter, in fact, does nothing besides this, so let's fold
    kmem_cache_dup() into memcg_create_kmem_cache().

    This patch also makes the memcg_cache_mutex private to
    memcg_create_kmem_cache(), because it is not used anywhere else.

    Signed-off-by: Vladimir Davydov
    Cc: Michal Hocko
    Cc: Glauber Costa
    Cc: Johannes Weiner
    Cc: Balbir Singh
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • We obtain a per-memcg cache from a root kmem_cache by dereferencing an
    entry of the root cache's memcg_params::memcg_caches array. If we find
    no cache for a memcg there on allocation, we initiate the memcg cache
    creation (see memcg_kmem_get_cache()). The cache creation proceeds
    asynchronously in memcg_create_kmem_cache() in order to avoid lock
    clashes, so there can be several threads trying to create the same
    kmem_cache concurrently, but only one of them may succeed. However, due
    to a race in the code, it is not always true. The point is that the
    memcg_caches array can be relocated when we activate kmem accounting for
    a memcg (see memcg_update_all_caches(), memcg_update_cache_size()). If
    memcg_update_cache_size() and memcg_create_kmem_cache() proceed
    concurrently as described below, we can leak a kmem_cache.

    Asume two threads schedule creation of the same kmem_cache. One of them
    successfully creates it. Another one should fail then, but if
    memcg_create_kmem_cache() interleaves with memcg_update_cache_size() as
    follows, it won't:

    memcg_create_kmem_cache() memcg_update_cache_size()
    (called w/o mutexes held) (called with slab_mutex,
    set_limit_mutex held)
    ------------------------- -------------------------

    mutex_lock(&memcg_cache_mutex)

    s->memcg_params=kzalloc(...)

    new_cachep=cache_from_memcg_idx(cachep,idx)
    // new_cachep==NULL => proceed to creation

    s->memcg_params->memcg_caches[i]
    =cur_params->memcg_caches[i]

    // kmem_cache_create_memcg takes slab_mutex
    // so we will hang around until
    // memcg_update_cache_size finishes, but
    // nothing will prevent it from succeeding so
    // memcg_caches[idx] will be overwritten in
    // memcg_register_cache!

    new_cachep = kmem_cache_create_memcg(...)
    mutex_unlock(&memcg_cache_mutex)

    Let's fix this by moving the check for existence of the memcg cache to
    kmem_cache_create_memcg() to be called under the slab_mutex and make it
    return NULL if so.

    A similar race is possible when destroying a memcg cache (see
    kmem_cache_destroy()). Since memcg_unregister_cache(), which clears the
    pointer in the memcg_caches array, is called w/o protection, we can race
    with memcg_update_cache_size() and omit clearing the pointer. Therefore
    memcg_unregister_cache() should be moved before we release the
    slab_mutex.

    Signed-off-by: Vladimir Davydov
    Cc: Michal Hocko
    Cc: Glauber Costa
    Cc: Johannes Weiner
    Cc: Balbir Singh
    Cc: KAMEZAWA Hiroyuki
    Cc: Pekka Enberg
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • All caches of the same memory cgroup are linked in the memcg_slab_caches
    list via kmem_cache::memcg_params::list. This list is traversed, for
    example, when we read memory.kmem.slabinfo.

    Since the list actually consists of memcg_cache_params objects, we have
    to convert an element of the list to a kmem_cache object using
    memcg_params_to_cache(), which obtains the pointer to the cache from the
    memcg_params::memcg_caches array of the corresponding root cache. That
    said the pointer to a kmem_cache in its parent's memcg_params must be
    initialized before adding the cache to the list, and cleared only after
    it has been unlinked. Currently it is vice-versa, which can result in a
    NULL ptr dereference while traversing the memcg_slab_caches list. This
    patch restores the correct order.

    Signed-off-by: Vladimir Davydov
    Cc: Michal Hocko
    Cc: Glauber Costa
    Cc: Johannes Weiner
    Cc: Balbir Singh
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • Each root kmem_cache has pointers to per-memcg caches stored in its
    memcg_params::memcg_caches array. Whenever we want to allocate a slab
    for a memcg, we access this array to get per-memcg cache to allocate
    from (see memcg_kmem_get_cache()). The access must be lock-free for
    performance reasons, so we should use barriers to assert the kmem_cache
    is up-to-date.

    First, we should place a write barrier immediately before setting the
    pointer to it in the memcg_caches array in order to make sure nobody
    will see a partially initialized object. Second, we should issue a read
    barrier before dereferencing the pointer to conform to the write
    barrier.

    However, currently the barrier usage looks rather strange. We have a
    write barrier *after* setting the pointer and a read barrier *before*
    reading the pointer, which is incorrect. This patch fixes this.

    Signed-off-by: Vladimir Davydov
    Cc: Michal Hocko
    Cc: Glauber Costa
    Cc: Johannes Weiner
    Cc: Balbir Singh
    Cc: KAMEZAWA Hiroyuki
    Cc: Pekka Enberg
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • Currently, we have rather a messy function set relating to per-memcg
    kmem cache initialization/destruction.

    Per-memcg caches are created in memcg_create_kmem_cache(). This
    function calls kmem_cache_create_memcg() to allocate and initialize a
    kmem cache and then "registers" the new cache in the
    memcg_params::memcg_caches array of the parent cache.

    During its work-flow, kmem_cache_create_memcg() executes the following
    memcg-related functions:

    - memcg_alloc_cache_params(), to initialize memcg_params of the newly
    created cache;
    - memcg_cache_list_add(), to add the new cache to the memcg_slab_caches
    list.

    On the other hand, kmem_cache_destroy() called on a cache destruction
    only calls memcg_release_cache(), which does all the work: it cleans the
    reference to the cache in its parent's memcg_params::memcg_caches,
    removes the cache from the memcg_slab_caches list, and frees
    memcg_params.

    Such an inconsistency between destruction and initialization paths make
    the code difficult to read, so let's clean this up a bit.

    This patch moves all the code relating to registration of per-memcg
    caches (adding to memcg list, setting the pointer to a cache from its
    parent) to the newly created memcg_register_cache() and
    memcg_unregister_cache() functions making the initialization and
    destruction paths look symmetrical.

    Signed-off-by: Vladimir Davydov
    Cc: Michal Hocko
    Cc: Glauber Costa
    Cc: Johannes Weiner
    Cc: Balbir Singh
    Cc: KAMEZAWA Hiroyuki
    Cc: Pekka Enberg
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • We do not free the cache's memcg_params if __kmem_cache_create fails.
    Fix this.

    Plus, rename memcg_register_cache() to memcg_alloc_cache_params(),
    because it actually does not register the cache anywhere, but simply
    initialize kmem_cache::memcg_params.

    [akpm@linux-foundation.org: fix build]
    Signed-off-by: Vladimir Davydov
    Cc: Michal Hocko
    Cc: Glauber Costa
    Cc: Johannes Weiner
    Cc: Balbir Singh
    Cc: KAMEZAWA Hiroyuki
    Cc: Pekka Enberg
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • Most of the VM_BUG_ON assertions are performed on a page. Usually, when
    one of these assertions fails we'll get a BUG_ON with a call stack and
    the registers.

    I've recently noticed based on the requests to add a small piece of code
    that dumps the page to various VM_BUG_ON sites that the page dump is
    quite useful to people debugging issues in mm.

    This patch adds a VM_BUG_ON_PAGE(cond, page) which beyond doing what
    VM_BUG_ON() does, also dumps the page before executing the actual
    BUG_ON.

    [akpm@linux-foundation.org: fix up includes]
    Signed-off-by: Sasha Levin
    Cc: "Kirill A. Shutemov"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sasha Levin
     
  • The vmalloc was introduced by 33327948782b ("memcgroup: use vmalloc for
    mem_cgroup allocation"), because at that time MAX_NUMNODES was used for
    defining the per-node array in the mem_cgroup structure so that the
    structure could be huge even if the system had the only NUMA node.

    The situation was significantly improved by commit 45cf7ebd5a03 ("memcg:
    reduce the size of struct memcg 244-fold"), which made the size of the
    mem_cgroup structure calculated dynamically depending on the real number
    of NUMA nodes installed on the system (nr_node_ids), so now there is no
    point in using vmalloc here: the structure is allocated rarely and on
    most systems its size is about 1K.

    Signed-off-by: Vladimir Davydov
    Acked-by: Michal Hocko
    Cc: Glauber Costa
    Cc: Johannes Weiner
    Cc: Balbir Singh
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov