15 Nov, 2014

1 commit

  • commit 2f7dd7a4100ad4affcb141605bef178ab98ccb18 upstream.

    The cgroup iterators yield css objects that have not yet gone through
    css_online(), but they are not complete memcgs at this point and so the
    memcg iterators should not return them. Commit d8ad30559715 ("mm/memcg:
    iteration skip memcgs not yet fully initialized") set out to implement
    exactly this, but it uses CSS_ONLINE, a cgroup-internal flag that does
    not meet the ordering requirements for memcg, and so the iterator may
    skip over initialized groups, or return partially initialized memcgs.

    The cgroup core can not reasonably provide a clear answer on whether the
    object around the css has been fully initialized, as that depends on
    controller-specific locking and lifetime rules. Thus, introduce a
    memcg-specific flag that is set after the memcg has been initialized in
    css_online(), and read before mem_cgroup_iter() callers access the memcg
    members.

    Signed-off-by: Johannes Weiner
    Cc: Tejun Heo
    Acked-by: Michal Hocko
    Cc: Hugh Dickins
    Cc: Peter Zijlstra
    Cc: [3.12+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Johannes Weiner
     

08 Aug, 2014

1 commit

  • commit 2bcf2e92c3918ce62ab4e934256e47e9a16d19c3 upstream.

    Paul Furtado has reported the following GPF:

    general protection fault: 0000 [#1] SMP
    Modules linked in: ipv6 dm_mod xen_netfront coretemp hwmon x86_pkg_temp_thermal crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel ablk_helper cryptd lrw gf128mul glue_helper aes_x86_64 microcode pcspkr ext4 jbd2 mbcache raid0 xen_blkfront
    CPU: 3 PID: 3062 Comm: java Not tainted 3.16.0-rc5 #1
    task: ffff8801cfe8f170 ti: ffff8801d2ec4000 task.ti: ffff8801d2ec4000
    RIP: e030:mem_cgroup_oom_synchronize+0x140/0x240
    RSP: e02b:ffff8801d2ec7d48 EFLAGS: 00010283
    RAX: 0000000000000001 RBX: ffff88009d633800 RCX: 000000000000000e
    RDX: fffffffffffffffe RSI: ffff88009d630200 RDI: ffff88009d630200
    RBP: ffff8801d2ec7da8 R08: 0000000000000012 R09: 00000000fffffffe
    R10: 0000000000000000 R11: 0000000000000000 R12: ffff88009d633800
    R13: ffff8801d2ec7d48 R14: dead000000100100 R15: ffff88009d633a30
    FS: 00007f1748bb4700(0000) GS:ffff8801def80000(0000) knlGS:0000000000000000
    CS: e033 DS: 0000 ES: 0000 CR0: 000000008005003b
    CR2: 00007f4110300308 CR3: 00000000c05f7000 CR4: 0000000000002660
    Call Trace:
    pagefault_out_of_memory+0x18/0x90
    mm_fault_error+0xa9/0x1a0
    __do_page_fault+0x478/0x4c0
    do_page_fault+0x2c/0x40
    page_fault+0x28/0x30
    Code: 44 00 00 48 89 df e8 40 ca ff ff 48 85 c0 49 89 c4 74 35 4c 8b b0 30 02 00 00 4c 8d b8 30 02 00 00 4d 39 fe 74 1b 0f 1f 44 00 00 8b 7e 10 be 01 00 00 00 e8 42 d2 04 00 4d 8b 36 4d 39 fe 75
    RIP mem_cgroup_oom_synchronize+0x140/0x240

    Commit fb2a6fc56be6 ("mm: memcg: rework and document OOM waiting and
    wakeup") has moved mem_cgroup_oom_notify outside of memcg_oom_lock
    assuming it is protected by the hierarchical OOM-lock.

    Although this is true for the notification part the protection doesn't
    cover unregistration of event which can happen in parallel now so
    mem_cgroup_oom_notify can see already unlinked and/or freed
    mem_cgroup_eventfd_list.

    Fix this by using memcg_oom_lock also in mem_cgroup_oom_notify.

    Addresses https://bugzilla.kernel.org/show_bug.cgi?id=80881

    Fixes: fb2a6fc56be6 (mm: memcg: rework and document OOM waiting and wakeup)
    Signed-off-by: Michal Hocko
    Reported-by: Paul Furtado
    Tested-by: Paul Furtado
    Acked-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Michal Hocko
     

04 Mar, 2014

2 commits

  • Sometimes the cleanup after memcg hierarchy testing gets stuck in
    mem_cgroup_reparent_charges(), unable to bring non-kmem usage down to 0.

    There may turn out to be several causes, but a major cause is this: the
    workitem to offline parent can get run before workitem to offline child;
    parent's mem_cgroup_reparent_charges() circles around waiting for the
    child's pages to be reparented to its lrus, but it's holding
    cgroup_mutex which prevents the child from reaching its
    mem_cgroup_reparent_charges().

    Further testing showed that an ordered workqueue for cgroup_destroy_wq
    is not always good enough: percpu_ref_kill_and_confirm's call_rcu_sched
    stage on the way can mess up the order before reaching the workqueue.

    Instead, when offlining a memcg, call mem_cgroup_reparent_charges() on
    all its children (and grandchildren, in the correct order) to have their
    charges reparented first.

    Fixes: e5fca243abae ("cgroup: use a dedicated workqueue for cgroup destruction")
    Signed-off-by: Filipe Brandenburger
    Signed-off-by: Hugh Dickins
    Reviewed-by: Tejun Heo
    Acked-by: Michal Hocko
    Cc: Johannes Weiner
    Cc: [v3.10+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Filipe Brandenburger
     
  • Commit 0eef615665ed ("memcg: fix css reference leak and endless loop in
    mem_cgroup_iter") got the interaction with the commit a few before it
    d8ad30559715 ("mm/memcg: iteration skip memcgs not yet fully
    initialized") slightly wrong, and we didn't notice at the time.

    It's elusive, and harder to get than the original, but for a couple of
    days before rc1, I several times saw a endless loop similar to that
    supposedly being fixed.

    This time it was a tighter loop in __mem_cgroup_iter_next(): because we
    can get here when our root has already been offlined, and the ordering
    of conditions was such that we then just cycled around forever.

    Fixes: 0eef615665ed ("memcg: fix css reference leak and endless loop in mem_cgroup_iter").
    Signed-off-by: Hugh Dickins
    Acked-by: Michal Hocko
    Cc: Johannes Weiner
    Cc: Greg Thelen
    Cc: [3.12+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

26 Feb, 2014

1 commit

  • Kirill has reported the following:

    Task in /test killed as a result of limit of /test
    memory: usage 10240kB, limit 10240kB, failcnt 51
    memory+swap: usage 10240kB, limit 10240kB, failcnt 0
    kmem: usage 0kB, limit 18014398509481983kB, failcnt 0
    Memory cgroup stats for /test:

    BUG: sleeping function called from invalid context at kernel/cpu.c:68
    in_atomic(): 1, irqs_disabled(): 0, pid: 66, name: memcg_test
    2 locks held by memcg_test/66:
    #0: (memcg_oom_lock#2){+.+...}, at: [] pagefault_out_of_memory+0x14/0x90
    #1: (oom_info_lock){+.+...}, at: [] mem_cgroup_print_oom_info+0x2a/0x390
    CPU: 2 PID: 66 Comm: memcg_test Not tainted 3.14.0-rc1-dirty #745
    Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS Bochs 01/01/2011
    Call Trace:
    __might_sleep+0x16a/0x210
    get_online_cpus+0x1c/0x60
    mem_cgroup_read_stat+0x27/0xb0
    mem_cgroup_print_oom_info+0x260/0x390
    dump_header+0x88/0x251
    ? trace_hardirqs_on+0xd/0x10
    oom_kill_process+0x258/0x3d0
    mem_cgroup_oom_synchronize+0x656/0x6c0
    ? mem_cgroup_charge_common+0xd0/0xd0
    pagefault_out_of_memory+0x14/0x90
    mm_fault_error+0x91/0x189
    __do_page_fault+0x48e/0x580
    do_page_fault+0xe/0x10
    page_fault+0x22/0x30

    which complains that mem_cgroup_read_stat cannot be called from an atomic
    context but mem_cgroup_print_oom_info takes a spinlock. Change
    oom_info_lock to a mutex.

    This was introduced by 947b3dd1a84b ("memcg, oom: lock
    mem_cgroup_print_oom_info").

    Signed-off-by: Michal Hocko
    Reported-by: "Kirill A. Shutemov"
    Cc: Johannes Weiner
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

31 Jan, 2014

1 commit

  • Commit 842e2873697e ("memcg: get rid of kmem_cache_dup()") introduced a
    mutex for memcg_create_kmem_cache() to protect the tmp_name buffer that
    holds the memcg name. It failed to unlock the mutex if this buffer
    could not be allocated.

    This patch fixes the issue by appropriately unlocking the mutex if the
    allocation fails.

    Signed-off-by: Vladimir Davydov
    Cc: Michal Hocko
    Cc: Johannes Weiner
    Cc: Glauber Costa
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     

24 Jan, 2014

17 commits

  • Signed-off-by: Vladimir Davydov
    Reviewed-by: Michal Hocko
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • Commit 19f39402864e ("memcg: simplify mem_cgroup_iter") has reorganized
    mem_cgroup_iter code in order to simplify it. A part of that change was
    dropping an optimization which didn't call css_tryget on the root of the
    walked tree. The patch however didn't change the css_put part in
    mem_cgroup_iter which excludes root.

    This wasn't an issue at the time because __mem_cgroup_iter_next bailed
    out for root early without taking a reference as cgroup iterators
    (css_next_descendant_pre) didn't visit root themselves.

    Nevertheless cgroup iterators have been reworked to visit root by commit
    bd8815a6d802 ("cgroup: make css_for_each_descendant() and friends
    include the origin css in the iteration") when the root bypass have been
    dropped in __mem_cgroup_iter_next. This means that css_put is not
    called for root and so css along with mem_cgroup and other cgroup
    internal object tied by css lifetime are never freed.

    Fix the issue by reintroducing root check in __mem_cgroup_iter_next and
    do not take css reference for it.

    This reference counting magic protects us also from another issue, an
    endless loop reported by Hugh Dickins when reclaim races with root
    removal and css_tryget called by iterator internally would fail. There
    would be no other nodes to visit so __mem_cgroup_iter_next would return
    NULL and mem_cgroup_iter would interpret it as "start looping from root
    again" and so mem_cgroup_iter would loop forever internally.

    Signed-off-by: Michal Hocko
    Reported-by: Hugh Dickins
    Tested-by: Hugh Dickins
    Cc: Johannes Weiner
    Cc: Greg Thelen
    Cc: [3.12+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Hugh has reported an endless loop when the hardlimit reclaim sees the
    same group all the time. This might happen when the reclaim races with
    the memcg removal.

    shrink_zone
    [rmdir root]
    mem_cgroup_iter(root, NULL, reclaim)
    // prev = NULL
    rcu_read_lock()
    mem_cgroup_iter_load
    last_visited = iter->last_visited // gets root || NULL
    css_tryget(last_visited) // failed
    last_visited = NULL [1]
    memcg = root = __mem_cgroup_iter_next(root, NULL)
    mem_cgroup_iter_update
    iter->last_visited = root;
    reclaim->generation = iter->generation

    mem_cgroup_iter(root, root, reclaim)
    // prev = root
    rcu_read_lock
    mem_cgroup_iter_load
    last_visited = iter->last_visited // gets root
    css_tryget(last_visited) // failed
    [1]

    The issue seemed to be introduced by commit 5f5781619718 ("memcg: relax
    memcg iter caching") which has replaced unconditional css_get/css_put by
    css_tryget/css_put for the cached iterator.

    This patch fixes the issue by skipping css_tryget on the root of the
    tree walk in mem_cgroup_iter_load and symmetrically doesn't release it
    in mem_cgroup_iter_update.

    Signed-off-by: Michal Hocko
    Reported-by: Hugh Dickins
    Tested-by: Hugh Dickins
    Cc: Johannes Weiner
    Cc: Greg Thelen
    Cc: [3.10+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • When two threads have the same badness score, it's preferable to kill
    the thread group leader so that the actual process name is printed to
    the kernel log rather than the thread group name which may be shared
    amongst several processes.

    This was the behavior when select_bad_process() used to do
    for_each_process(), but it now iterates threads instead and leads to
    ambiguity.

    Signed-off-by: David Rientjes
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: KAMEZAWA Hiroyuki
    Cc: Greg Thelen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • It is surprising that the mem_cgroup iterator can return memcgs which
    have not yet been fully initialized. By accident (or trial and error?)
    this appears not to present an actual problem; but it may be better to
    prevent such surprises, by skipping memcgs not yet online.

    Signed-off-by: Hugh Dickins
    Cc: Tejun Heo
    Acked-by: Michal Hocko
    Cc: Johannes Weiner
    Cc: [3.12+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Shorten mem_cgroup_reclaim_iter.last_dead_count from unsigned long to
    int: it's assigned from an int and compared with an int, and adjacent to
    an unsigned int: so there's no point to it being unsigned long, which
    wasted 104 bytes in every mem_cgroup_per_zone.

    Signed-off-by: Hugh Dickins
    Acked-by: Michal Hocko
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Currently we take both the memcg_create_mutex and the set_limit_mutex
    when we enable kmem accounting for a memory cgroup, which makes kmem
    activation events serialize with both memcg creations and other memcg
    limit updates (memory.limit, memory.memsw.limit). However, there is no
    point in such strict synchronization rules there.

    First, the set_limit_mutex was introduced to keep the memory.limit and
    memory.memsw.limit values in sync. Since memory.kmem.limit can be set
    independently of them, it is better to introduce a separate mutex to
    synchronize against concurrent kmem limit updates.

    Second, we take the memcg_create_mutex in order to make sure all
    children of this memcg will be kmem-active as well. For achieving that,
    it is enough to hold this mutex only while checking if
    memcg_has_children() though. This guarantees that if a child is added
    after we checked that the memcg has no children, the newly added cgroup
    will see its parent kmem-active (of course if the latter succeeded), and
    call kmem activation for itself.

    This patch simplifies the locking rules of memcg_update_kmem_limit()
    according to these considerations.

    [vdavydov@parallels.com: fix unintialized var warning]
    Signed-off-by: Vladimir Davydov
    Cc: Michal Hocko
    Cc: Glauber Costa
    Cc: Johannes Weiner
    Cc: Balbir Singh
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • Currently we have two state bits in mem_cgroup::kmem_account_flags
    regarding kmem accounting activation, ACTIVATED and ACTIVE. We start
    kmem accounting only if both flags are set (memcg_can_account_kmem()),
    plus throughout the code there are several places where we check only
    the ACTIVE flag, but we never check the ACTIVATED flag alone. These
    flags are both set from memcg_update_kmem_limit() under the
    set_limit_mutex, the ACTIVE flag always being set after ACTIVATED, and
    they never get cleared. That said checking if both flags are set is
    equivalent to checking only for the ACTIVE flag, and since there is no
    ACTIVATED flag checks, we can safely remove the ACTIVATED flag, and
    nothing will change.

    Let's try to understand what was the reason for introducing these flags.
    The purpose of the ACTIVE flag is clear - it states that kmem should be
    accounting to the cgroup. The only requirement for it is that it should
    be set after we have fully initialized kmem accounting bits for the
    cgroup and patched all static branches relating to kmem accounting.
    Since we always check if static branch is enabled before actually
    considering if we should account (otherwise we wouldn't benefit from
    static branching), this guarantees us that we won't skip a commit or
    uncharge after a charge due to an unpatched static branch.

    Now let's move on to the ACTIVATED bit. As I proved in the beginning of
    this message, it is absolutely useless, and removing it will change
    nothing. So what was the reason introducing it?

    The ACTIVATED flag was introduced by commit a8964b9b84f9 ("memcg: use
    static branches when code not in use") in order to guarantee that
    static_key_slow_inc(&memcg_kmem_enabled_key) would be called only once
    for each memory cgroup when its kmem accounting was activated. The
    point was that at that time the memcg_update_kmem_limit() function's
    work-flow looked like this:

    bool must_inc_static_branch = false;

    cgroup_lock();
    mutex_lock(&set_limit_mutex);
    if (!memcg->kmem_account_flags && val != RESOURCE_MAX) {
    /* The kmem limit is set for the first time */
    ret = res_counter_set_limit(&memcg->kmem, val);

    memcg_kmem_set_activated(memcg);
    must_inc_static_branch = true;
    } else
    ret = res_counter_set_limit(&memcg->kmem, val);
    mutex_unlock(&set_limit_mutex);
    cgroup_unlock();

    if (must_inc_static_branch) {
    /* We can't do this under cgroup_lock */
    static_key_slow_inc(&memcg_kmem_enabled_key);
    memcg_kmem_set_active(memcg);
    }

    So that without the ACTIVATED flag we could race with other threads
    trying to set the limit and increment the static branching ref-counter
    more than once. Today we call the whole memcg_update_kmem_limit()
    function under the set_limit_mutex and this race is impossible.

    As now we understand why the ACTIVATED bit was introduced and why we
    don't need it now, and know that removing it will change nothing anyway,
    let's get rid of it.

    Signed-off-by: Vladimir Davydov
    Cc: Michal Hocko
    Cc: Glauber Costa
    Cc: Johannes Weiner
    Cc: Balbir Singh
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • We relocate root cache's memcg_params whenever we need to grow the
    memcg_caches array to accommodate all kmem-active memory cgroups.
    Currently on relocation we free the old version immediately, which can
    lead to use-after-free, because the memcg_caches array is accessed
    lock-free (see cache_from_memcg_idx()). This patch fixes this by making
    memcg_params RCU-protected for root caches.

    Signed-off-by: Vladimir Davydov
    Cc: Michal Hocko
    Cc: Glauber Costa
    Cc: Johannes Weiner
    Cc: Balbir Singh
    Cc: KAMEZAWA Hiroyuki
    Cc: Pekka Enberg
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • kmem_cache_dup() is only called from memcg_create_kmem_cache(). The
    latter, in fact, does nothing besides this, so let's fold
    kmem_cache_dup() into memcg_create_kmem_cache().

    This patch also makes the memcg_cache_mutex private to
    memcg_create_kmem_cache(), because it is not used anywhere else.

    Signed-off-by: Vladimir Davydov
    Cc: Michal Hocko
    Cc: Glauber Costa
    Cc: Johannes Weiner
    Cc: Balbir Singh
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • We obtain a per-memcg cache from a root kmem_cache by dereferencing an
    entry of the root cache's memcg_params::memcg_caches array. If we find
    no cache for a memcg there on allocation, we initiate the memcg cache
    creation (see memcg_kmem_get_cache()). The cache creation proceeds
    asynchronously in memcg_create_kmem_cache() in order to avoid lock
    clashes, so there can be several threads trying to create the same
    kmem_cache concurrently, but only one of them may succeed. However, due
    to a race in the code, it is not always true. The point is that the
    memcg_caches array can be relocated when we activate kmem accounting for
    a memcg (see memcg_update_all_caches(), memcg_update_cache_size()). If
    memcg_update_cache_size() and memcg_create_kmem_cache() proceed
    concurrently as described below, we can leak a kmem_cache.

    Asume two threads schedule creation of the same kmem_cache. One of them
    successfully creates it. Another one should fail then, but if
    memcg_create_kmem_cache() interleaves with memcg_update_cache_size() as
    follows, it won't:

    memcg_create_kmem_cache() memcg_update_cache_size()
    (called w/o mutexes held) (called with slab_mutex,
    set_limit_mutex held)
    ------------------------- -------------------------

    mutex_lock(&memcg_cache_mutex)

    s->memcg_params=kzalloc(...)

    new_cachep=cache_from_memcg_idx(cachep,idx)
    // new_cachep==NULL => proceed to creation

    s->memcg_params->memcg_caches[i]
    =cur_params->memcg_caches[i]

    // kmem_cache_create_memcg takes slab_mutex
    // so we will hang around until
    // memcg_update_cache_size finishes, but
    // nothing will prevent it from succeeding so
    // memcg_caches[idx] will be overwritten in
    // memcg_register_cache!

    new_cachep = kmem_cache_create_memcg(...)
    mutex_unlock(&memcg_cache_mutex)

    Let's fix this by moving the check for existence of the memcg cache to
    kmem_cache_create_memcg() to be called under the slab_mutex and make it
    return NULL if so.

    A similar race is possible when destroying a memcg cache (see
    kmem_cache_destroy()). Since memcg_unregister_cache(), which clears the
    pointer in the memcg_caches array, is called w/o protection, we can race
    with memcg_update_cache_size() and omit clearing the pointer. Therefore
    memcg_unregister_cache() should be moved before we release the
    slab_mutex.

    Signed-off-by: Vladimir Davydov
    Cc: Michal Hocko
    Cc: Glauber Costa
    Cc: Johannes Weiner
    Cc: Balbir Singh
    Cc: KAMEZAWA Hiroyuki
    Cc: Pekka Enberg
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • All caches of the same memory cgroup are linked in the memcg_slab_caches
    list via kmem_cache::memcg_params::list. This list is traversed, for
    example, when we read memory.kmem.slabinfo.

    Since the list actually consists of memcg_cache_params objects, we have
    to convert an element of the list to a kmem_cache object using
    memcg_params_to_cache(), which obtains the pointer to the cache from the
    memcg_params::memcg_caches array of the corresponding root cache. That
    said the pointer to a kmem_cache in its parent's memcg_params must be
    initialized before adding the cache to the list, and cleared only after
    it has been unlinked. Currently it is vice-versa, which can result in a
    NULL ptr dereference while traversing the memcg_slab_caches list. This
    patch restores the correct order.

    Signed-off-by: Vladimir Davydov
    Cc: Michal Hocko
    Cc: Glauber Costa
    Cc: Johannes Weiner
    Cc: Balbir Singh
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • Each root kmem_cache has pointers to per-memcg caches stored in its
    memcg_params::memcg_caches array. Whenever we want to allocate a slab
    for a memcg, we access this array to get per-memcg cache to allocate
    from (see memcg_kmem_get_cache()). The access must be lock-free for
    performance reasons, so we should use barriers to assert the kmem_cache
    is up-to-date.

    First, we should place a write barrier immediately before setting the
    pointer to it in the memcg_caches array in order to make sure nobody
    will see a partially initialized object. Second, we should issue a read
    barrier before dereferencing the pointer to conform to the write
    barrier.

    However, currently the barrier usage looks rather strange. We have a
    write barrier *after* setting the pointer and a read barrier *before*
    reading the pointer, which is incorrect. This patch fixes this.

    Signed-off-by: Vladimir Davydov
    Cc: Michal Hocko
    Cc: Glauber Costa
    Cc: Johannes Weiner
    Cc: Balbir Singh
    Cc: KAMEZAWA Hiroyuki
    Cc: Pekka Enberg
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • Currently, we have rather a messy function set relating to per-memcg
    kmem cache initialization/destruction.

    Per-memcg caches are created in memcg_create_kmem_cache(). This
    function calls kmem_cache_create_memcg() to allocate and initialize a
    kmem cache and then "registers" the new cache in the
    memcg_params::memcg_caches array of the parent cache.

    During its work-flow, kmem_cache_create_memcg() executes the following
    memcg-related functions:

    - memcg_alloc_cache_params(), to initialize memcg_params of the newly
    created cache;
    - memcg_cache_list_add(), to add the new cache to the memcg_slab_caches
    list.

    On the other hand, kmem_cache_destroy() called on a cache destruction
    only calls memcg_release_cache(), which does all the work: it cleans the
    reference to the cache in its parent's memcg_params::memcg_caches,
    removes the cache from the memcg_slab_caches list, and frees
    memcg_params.

    Such an inconsistency between destruction and initialization paths make
    the code difficult to read, so let's clean this up a bit.

    This patch moves all the code relating to registration of per-memcg
    caches (adding to memcg list, setting the pointer to a cache from its
    parent) to the newly created memcg_register_cache() and
    memcg_unregister_cache() functions making the initialization and
    destruction paths look symmetrical.

    Signed-off-by: Vladimir Davydov
    Cc: Michal Hocko
    Cc: Glauber Costa
    Cc: Johannes Weiner
    Cc: Balbir Singh
    Cc: KAMEZAWA Hiroyuki
    Cc: Pekka Enberg
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • We do not free the cache's memcg_params if __kmem_cache_create fails.
    Fix this.

    Plus, rename memcg_register_cache() to memcg_alloc_cache_params(),
    because it actually does not register the cache anywhere, but simply
    initialize kmem_cache::memcg_params.

    [akpm@linux-foundation.org: fix build]
    Signed-off-by: Vladimir Davydov
    Cc: Michal Hocko
    Cc: Glauber Costa
    Cc: Johannes Weiner
    Cc: Balbir Singh
    Cc: KAMEZAWA Hiroyuki
    Cc: Pekka Enberg
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • Most of the VM_BUG_ON assertions are performed on a page. Usually, when
    one of these assertions fails we'll get a BUG_ON with a call stack and
    the registers.

    I've recently noticed based on the requests to add a small piece of code
    that dumps the page to various VM_BUG_ON sites that the page dump is
    quite useful to people debugging issues in mm.

    This patch adds a VM_BUG_ON_PAGE(cond, page) which beyond doing what
    VM_BUG_ON() does, also dumps the page before executing the actual
    BUG_ON.

    [akpm@linux-foundation.org: fix up includes]
    Signed-off-by: Sasha Levin
    Cc: "Kirill A. Shutemov"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sasha Levin
     
  • The vmalloc was introduced by 33327948782b ("memcgroup: use vmalloc for
    mem_cgroup allocation"), because at that time MAX_NUMNODES was used for
    defining the per-node array in the mem_cgroup structure so that the
    structure could be huge even if the system had the only NUMA node.

    The situation was significantly improved by commit 45cf7ebd5a03 ("memcg:
    reduce the size of struct memcg 244-fold"), which made the size of the
    mem_cgroup structure calculated dynamically depending on the real number
    of NUMA nodes installed on the system (nr_node_ids), so now there is no
    point in using vmalloc here: the structure is allocated rarely and on
    most systems its size is about 1K.

    Signed-off-by: Vladimir Davydov
    Acked-by: Michal Hocko
    Cc: Glauber Costa
    Cc: Johannes Weiner
    Cc: Balbir Singh
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     

22 Jan, 2014

5 commits

  • Merge first patch-bomb from Andrew Morton:

    - a couple of misc things

    - inotify/fsnotify work from Jan

    - ocfs2 updates (partial)

    - about half of MM

    * emailed patches from Andrew Morton : (117 commits)
    mm/migrate: remove unused function, fail_migrate_page()
    mm/migrate: remove putback_lru_pages, fix comment on putback_movable_pages
    mm/migrate: correct failure handling if !hugepage_migration_support()
    mm/migrate: add comment about permanent failure path
    mm, page_alloc: warn for non-blockable __GFP_NOFAIL allocation failure
    mm: compaction: reset scanner positions immediately when they meet
    mm: compaction: do not mark unmovable pageblocks as skipped in async compaction
    mm: compaction: detect when scanners meet in isolate_freepages
    mm: compaction: reset cached scanner pfn's before reading them
    mm: compaction: encapsulate defer reset logic
    mm: compaction: trace compaction begin and end
    memcg, oom: lock mem_cgroup_print_oom_info
    sched: add tracepoints related to NUMA task migration
    mm: numa: do not automatically migrate KSM pages
    mm: numa: trace tasks that fail migration due to rate limiting
    mm: numa: limit scope of lock for NUMA migrate rate limiting
    mm: numa: make NUMA-migrate related functions static
    lib/show_mem.c: show num_poisoned_pages when oom
    mm/hwpoison: add '#' to hwpoison_inject
    mm/memblock: use WARN_ONCE when MAX_NUMNODES passed as input parameter
    ...

    Linus Torvalds
     
  • Pull cgroup updates from Tejun Heo:
    "The bulk of changes are cleanups and preparations for the upcoming
    kernfs conversion.

    - cgroup_event mechanism which is and will be used only by memcg is
    moved to memcg.

    - pidlist handling is updated so that it can be served by seq_file.

    Also, the list is not sorted if sane_behavior. cgroup
    documentation explicitly states that the file is not sorted but it
    has been for quite some time.

    - All cgroup file handling now happens on top of seq_file. This is
    to prepare for kernfs conversion. In addition, all operations are
    restructured so that they map 1-1 to kernfs operations.

    - Other cleanups and low-pri fixes"

    * 'for-3.14' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (40 commits)
    cgroup: trivial style updates
    cgroup: remove stray references to css_id
    doc: cgroups: Fix typo in doc/cgroups
    cgroup: fix fail path in cgroup_load_subsys()
    cgroup: fix missing unlock on error in cgroup_load_subsys()
    cgroup: remove for_each_root_subsys()
    cgroup: implement for_each_css()
    cgroup: factor out cgroup_subsys_state creation into create_css()
    cgroup: combine css handling loops in cgroup_create()
    cgroup: reorder operations in cgroup_create()
    cgroup: make for_each_subsys() useable under cgroup_root_mutex
    cgroup: css iterations and css_from_dir() are safe under cgroup_mutex
    cgroup: unify pidlist and other file handling
    cgroup: replace cftype->read_seq_string() with cftype->seq_show()
    cgroup: attach cgroup_open_file to all cgroup files
    cgroup: generalize cgroup_pidlist_open_file
    cgroup: unify read path so that seq_file is always used
    cgroup: unify cgroup_write_X64() and cgroup_write_string()
    cgroup: remove cftype->read(), ->read_map() and ->write()
    hugetlb_cgroup: convert away from cftype->read()
    ...

    Linus Torvalds
     
  • mem_cgroup_print_oom_info uses a static buffer (memcg_name) to store the
    name of the cgroup. This is not safe as pointed out by David Rientjes
    because memcg oom is locked only for its hierarchy and nothing prevents
    another parallel hierarchy to trigger oom as well and overwrite the
    already in-use buffer.

    This patch introduces oom_info_lock hidden inside
    mem_cgroup_print_oom_info which is held throughout the function. It
    makes access to memcg_name safe and as a bonus it also prevents parallel
    memcg ooms to interleave their statistics which would make the printed
    data hard to analyze otherwise.

    Signed-off-by: Michal Hocko
    Cc: Johannes Weiner
    Cc: KOSAKI Motohiro
    Cc: KAMEZAWA Hiroyuki
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • This function is not used outside of memcontrol.c so make it static.

    Signed-off-by: Vladimir Davydov
    Cc: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: Balbir Singh
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • We should start kmem accounting for a memory cgroup only after both its
    kmem limit is set (KMEM_ACCOUNTED_ACTIVE) and related call sites are
    patched (KMEM_ACCOUNTED_ACTIVATED). Currently memcg_can_account_kmem()
    allows kmem accounting even if only one of the conditions is true. Fix
    it.

    This means that a page might get charged by memcg_kmem_newpage_charge
    which would see its static key patched already but
    memcg_kmem_commit_charge would still see it unpatched and so the charge
    won't be committed. The result would be charge inconsistency
    (page_cgroup not marked as PageCgroupUsed) and the charge would leak
    because __memcg_kmem_uncharge_pages would ignore it.

    [mhocko@suse.cz: augment changelog]
    Signed-off-by: Vladimir Davydov
    Cc: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: Balbir Singh
    Cc: KAMEZAWA Hiroyuki
    Cc: Glauber Costa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     

03 Jan, 2014

1 commit

  • The mem_cgroup structure contains nr_node_ids pointers to
    mem_cgroup_per_node objects, not the objects themselves.

    Signed-off-by: Vladimir Davydov
    Acked-by: Michal Hocko
    Cc: Glauber Costa
    Cc: Johannes Weiner
    Cc: Balbir Singh
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     

13 Dec, 2013

3 commits

  • Commit 4942642080ea ("mm: memcg: handle non-error OOM situations more
    gracefully") allowed tasks that already entered a memcg OOM condition to
    bypass the memcg limit on subsequent allocation attempts hoping this
    would expedite finishing the page fault and executing the kill.

    David Rientjes is worried that this breaks memcg isolation guarantees
    and since there is no evidence that the bypass actually speeds up fault
    processing just change it so that these subsequent charge attempts fail
    outright. The notable exception being __GFP_NOFAIL charges which are
    required to bypass the limit regardless.

    Signed-off-by: Johannes Weiner
    Reported-by: David Rientjes
    Acked-by: Michal Hocko
    Acked-bt: David Rientjes
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • There is a race condition between a memcg being torn down and a swapin
    triggered from a different memcg of a page that was recorded to belong
    to the exiting memcg on swapout (with CONFIG_MEMCG_SWAP extension). The
    result is unreclaimable pages pointing to dead memcgs, which can lead to
    anything from endless loops in later memcg teardown (the page is charged
    to all hierarchical parents but is not on any LRU list) or crashes from
    following the dangling memcg pointer.

    Memcgs with tasks in them can not be torn down and usually charges don't
    show up in memcgs without tasks. Swapin with the CONFIG_MEMCG_SWAP
    extension is the notable exception because it charges the cgroup that
    was recorded as owner during swapout, which may be empty and in the
    process of being torn down when a task in another memcg triggers the
    swapin:

    teardown: swapin:

    lookup_swap_cgroup_id()
    rcu_read_lock()
    mem_cgroup_lookup()
    css_tryget()
    rcu_read_unlock()
    disable css_tryget()
    call_rcu()
    offline_css()
    reparent_charges()
    res_counter_charge() (hierarchical!)
    css_put()
    css_free()
    pc->mem_cgroup = dead memcg
    add page to dead lru

    Add a final reparenting step into css_free() to make sure any such raced
    charges are moved out of the memcg before it's finally freed.

    In the longer term it would be cleaner to have the css_tryget() and the
    res_counter charge under the same RCU lock section so that the charge
    reparenting is deferred until the last charge whose tryget succeeded is
    visible. But this will require more invasive changes that will be
    harder to evaluate and backport into stable, so better defer them to a
    separate change set.

    Signed-off-by: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: David Rientjes
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Commit 84235de394d9 ("fs: buffer: move allocation failure loop into the
    allocator") started recognizing __GFP_NOFAIL in memory cgroups but
    forgot to disable the OOM killer.

    Any task that does not fail allocation will also not enter the OOM
    completion path. So don't declare an OOM state in this case or it'll be
    leaked and the task be able to bypass the limit until the next
    userspace-triggered page fault cleans up the OOM state.

    Reported-by: William Dauchy
    Signed-off-by: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: David Rientjes
    Cc: [3.12.x]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

06 Dec, 2013

2 commits

  • In preparation of conversion to kernfs, cgroup file handling is
    updated so that it can be easily mapped to kernfs. This patch
    replaces cftype->read_seq_string() with cftype->seq_show() which is
    not limited to single_open() operation and will map directcly to
    kernfs seq_file interface.

    The conversions are mechanical. As ->seq_show() doesn't have @css and
    @cft, the functions which make use of them are converted to use
    seq_css() and seq_cft() respectively. In several occassions, e.f. if
    it has seq_string in its name, the function name is updated to fit the
    new method better.

    This patch does not introduce any behavior changes.

    Signed-off-by: Tejun Heo
    Acked-by: Aristeu Rozanski
    Acked-by: Vivek Goyal
    Acked-by: Michal Hocko
    Acked-by: Daniel Wagner
    Acked-by: Li Zefan
    Cc: Jens Axboe
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Cc: Johannes Weiner
    Cc: Balbir Singh
    Cc: KAMEZAWA Hiroyuki
    Cc: Neil Horman

    Tejun Heo
     
  • In preparation of conversion to kernfs, cgroup file handling is being
    consolidated so that it can be easily mapped to the seq_file based
    interface of kernfs.

    cftype->read_map() doesn't add any value and being replaced with
    ->read_seq_string(), and all users of cftype->read() can be easily
    served, usually better, by seq_file and other methods.

    Update mem_cgroup_read() to return u64 instead of printing itself and
    rename it to mem_cgroup_read_u64(), and update
    mem_cgroup_oom_control_read() to use ->read_seq_string() instead of
    ->read_map().

    This patch doesn't make any visible behavior changes.

    Signed-off-by: Tejun Heo
    Acked-by: Michal Hocko
    Acked-by: Li Zefan
    Cc: Johannes Weiner
    Cc: Balbir Singh
    Cc: KAMEZAWA Hiroyuki

    Tejun Heo
     

23 Nov, 2013

6 commits

  • Merge v3.12 based patch series to move cgroup_event implementation to
    memcg into for-3.14. The following two commits cause a conflict in
    kernel/cgroup.c

    2ff2a7d03bbe4 ("cgroup: kill css_id")
    79bd9814e5ec9 ("cgroup, memcg: move cgroup_event implementation to memcg")

    Each patch removes a struct definition from kernel/cgroup.c. As the
    two are adjacent, they cause a context conflict. Easily resolved by
    removing both structs.

    Signed-off-by: Tejun Heo

    Tejun Heo
     
  • cgroup_event is only available in memcg now. Let's brand it that way.
    While at it, add a comment encouraging deprecation of the feature and
    remove the respective section from cgroup documentation.

    This patch is cosmetic.

    v3: Typo update as per Li Zefan.

    v2: Index in cgroups.txt updated accordingly as suggested by Li Zefan.

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan
    Acked-by: Kirill A. Shutemov
    Acked-by: Michal Hocko

    Tejun Heo
     
  • cgroup_event is now memcg specific. Replace cgroup_event->css with
    ->memcg and convert [un]register_event() callbacks to take mem_cgroup
    pointer instead of cgroup_subsys_state one. This simplifies the code
    slightly and makes css_to_vmpressure() unnecessary which is removed.

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan
    Acked-by: Kirill A. Shutemov
    Acked-by: Michal Hocko

    Tejun Heo
     
  • The only use of cgroup_event->cft is distinguishing "usage_in_bytes"
    and "memsw.usgae_in_bytes" for mem_cgroup_usage_[un]register_event(),
    which can be done by adding an explicit argument to the function and
    implementing two wrappers so that the two cases can be distinguished
    from the function alone.

    Remove cgroup_event->cft and the related code including
    [un]register_events() methods.

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan
    Acked-by: Kirill A. Shutemov
    Acked-by: Michal Hocko

    Tejun Heo
     
  • cgroup_event is being moved from cgroup core to memcg and the
    implementation is already moved by the previous patch. This patch
    moves the data fields and callbacks.

    * cgroup->event_list[_lock] are moved to mem_cgroup.

    * cftype->[un]register_event() are moved to cgroup_event. This makes
    it impossible for individual cftype definitions to specify their
    event callbacks. This is worked around by simply hard-coding
    filename to event callback mapping in cgroup_write_event_control().
    This is awkward and inflexible, which is actually desirable given
    that we don't want to grow more usages of this feature.

    * eventfd_ctx declaration is removed from cgroup.h, which makes
    vmpressure.h miss eventfd_ctx declaration. Include eventfd.h from
    vmpressure.h.

    v2: Use file name from dentry instead of cftype. This will allow
    removing all cftype handling in the function.

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan
    Acked-by: Kirill A. Shutemov
    Acked-by: Michal Hocko
    Cc: Johannes Weiner
    Cc: Balbir Singh

    Tejun Heo
     
  • @css for cgroup_write_event_control() is now always for memcg and the
    target file should be a memcg file too. Drop code which assumes @css
    is dummy_css and the target file may belong to different subsystems.

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan
    Acked-by: Kirill A. Shutemov

    Tejun Heo