08 Apr, 2014

5 commits

  • Currently we destroy children caches at the very beginning of
    kmem_cache_destroy(). This is wrong, because the root cache will not
    necessarily be destroyed in the end - if it has aliases (refcount > 0),
    kmem_cache_destroy() will simply decrement its refcount and return. In
    this case, at best we will get a bunch of warnings in dmesg, like this
    one:

    kmem_cache_destroy kmalloc-32:0: Slab cache still has objects
    CPU: 1 PID: 7139 Comm: modprobe Tainted: G B W 3.13.0+ #117
    Call Trace:
    dump_stack+0x49/0x5b
    kmem_cache_destroy+0xdf/0xf0
    kmem_cache_destroy_memcg_children+0x97/0xc0
    kmem_cache_destroy+0xf/0xf0
    xfs_mru_cache_uninit+0x21/0x30 [xfs]
    exit_xfs_fs+0x2e/0xc44 [xfs]
    SyS_delete_module+0x198/0x1f0
    system_call_fastpath+0x16/0x1b

    At worst - if kmem_cache_destroy() will race with an allocation from a
    memcg cache - the kernel will panic.

    This patch fixes this by moving children caches destruction after the
    check if the cache has aliases. Plus, it forbids destroying a root
    cache if it still has children caches, because each children cache keeps
    a reference to its parent.

    Signed-off-by: Vladimir Davydov
    Cc: Michal Hocko
    Cc: Johannes Weiner
    Cc: David Rientjes
    Cc: Pekka Enberg
    Cc: Glauber Costa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • Currently, memcg_unregister_cache(), which deletes the cache being
    destroyed from the memcg_slab_caches list, is called after
    __kmem_cache_shutdown() (see kmem_cache_destroy()), which starts to
    destroy the cache.

    As a result, one can access a partially destroyed cache while traversing
    a memcg_slab_caches list, which can have deadly consequences (for
    instance, cache_show() called for each cache on a memcg_slab_caches list
    from mem_cgroup_slabinfo_read() will dereference pointers to already
    freed data).

    To fix this, let's move memcg_unregister_cache() before the cache
    destruction process beginning, issuing memcg_register_cache() on failure.

    Signed-off-by: Vladimir Davydov
    Cc: Michal Hocko
    Cc: Johannes Weiner
    Cc: David Rientjes
    Cc: Pekka Enberg
    Cc: Glauber Costa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • Memcg-awareness turned kmem_cache_create() into a dirty interweaving of
    memcg-only and except-for-memcg calls. To clean this up, let's move the
    code responsible for memcg cache creation to a separate function.

    Signed-off-by: Vladimir Davydov
    Cc: Michal Hocko
    Cc: Johannes Weiner
    Cc: David Rientjes
    Cc: Pekka Enberg
    Cc: Glauber Costa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • This patch cleans up the memcg cache creation path as follows:

    - Move memcg cache name creation to a separate function to be called
    from kmem_cache_create_memcg(). This allows us to get rid of the mutex
    protecting the temporary buffer used for the name formatting, because
    the whole cache creation path is protected by the slab_mutex.

    - Get rid of memcg_create_kmem_cache(). This function serves as a proxy
    to kmem_cache_create_memcg(). After separating the cache name creation
    path, it would be reduced to a function call, so let's inline it.

    Signed-off-by: Vladimir Davydov
    Cc: Michal Hocko
    Cc: Johannes Weiner
    Cc: David Rientjes
    Cc: Pekka Enberg
    Cc: Glauber Costa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • When a kmem cache is created (kmem_cache_create_memcg()), we first try to
    find a compatible cache that already exists and can handle requests from
    the new cache, i.e. has the same object size, alignment, ctor, etc. If
    there is such a cache, we do not create any new caches, instead we simply
    increment the refcount of the cache found and return it.

    Currently we do this procedure not only when creating root caches, but
    also for memcg caches. However, there is no point in that, because, as
    every memcg cache has exactly the same parameters as its parent and cache
    merging cannot be turned off in runtime (only on boot by passing
    "slub_nomerge"), the root caches of any two potentially mergeable memcg
    caches should be merged already, i.e. it must be the same root cache, and
    therefore we couldn't even get to the memcg cache creation, because it
    already exists.

    The only exception is boot caches - they are explicitly forbidden to be
    merged by setting their refcount to -1. There are currently only two of
    them - kmem_cache and kmem_cache_node, which are used in slab internals (I
    do not count kmalloc caches as their refcount is set to 1 immediately
    after creation). Since they are prevented from merging preliminary I
    guess we should avoid to merge their children too.

    So let's remove the useless code responsible for merging memcg caches.

    Signed-off-by: Vladimir Davydov
    Cc: Michal Hocko
    Cc: Johannes Weiner
    Cc: David Rientjes
    Cc: Pekka Enberg
    Cc: Glauber Costa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     

30 Jan, 2014

1 commit

  • On kmem_cache_create_memcg() error path we set 'err', but leave 's' (the
    new cache ptr) undefined. The latter can be NULL if we could not
    allocate the cache, or pointing to a freed area if we failed somewhere
    later while trying to initialize it. Initially we checked 'err'
    immediately before exiting the function and returned NULL if it was set
    ignoring the value of 's':

    out_unlock:
    ...
    if (err) {
    /* report error */
    return NULL;
    }
    return s;

    Recently this check was, in fact, broken by commit f717eb3abb5e ("slab:
    do not panic if we fail to create memcg cache"), which turned it to:

    out_unlock:
    ...
    if (err && !memcg) {
    /* report error */
    return NULL;
    }
    return s;

    As a result, if we are failing creating a cache for a memcg, we will
    skip the check and return 's' that can contain crap. Obviously, commit
    f717eb3abb5e intended not to return crap on error allocating a cache for
    a memcg, but only to remove the error reporting in this case, so the
    check should look like this:

    out_unlock:
    ...
    if (err) {
    if (!memcg)
    return NULL;
    /* report error */
    return NULL;
    }
    return s;

    [rientjes@google.com: despaghettification]
    [vdavydov@parallels.com: patch monkeying]
    Signed-off-by: David Rientjes
    Signed-off-by: Vladimir Davydov
    Signed-off-by: Dave Jones
    Reported-by: Dave Jones
    Acked-by: Pekka Enberg
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Jones
     

24 Jan, 2014

5 commits

  • There is no point in flooding logs with warnings or especially crashing
    the system if we fail to create a cache for a memcg. In this case we
    will be accounting the memcg allocation to the root cgroup until we
    succeed to create its own cache, but it isn't that critical.

    Signed-off-by: Vladimir Davydov
    Cc: Michal Hocko
    Cc: Glauber Costa
    Cc: Johannes Weiner
    Cc: Pekka Enberg
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • We obtain a per-memcg cache from a root kmem_cache by dereferencing an
    entry of the root cache's memcg_params::memcg_caches array. If we find
    no cache for a memcg there on allocation, we initiate the memcg cache
    creation (see memcg_kmem_get_cache()). The cache creation proceeds
    asynchronously in memcg_create_kmem_cache() in order to avoid lock
    clashes, so there can be several threads trying to create the same
    kmem_cache concurrently, but only one of them may succeed. However, due
    to a race in the code, it is not always true. The point is that the
    memcg_caches array can be relocated when we activate kmem accounting for
    a memcg (see memcg_update_all_caches(), memcg_update_cache_size()). If
    memcg_update_cache_size() and memcg_create_kmem_cache() proceed
    concurrently as described below, we can leak a kmem_cache.

    Asume two threads schedule creation of the same kmem_cache. One of them
    successfully creates it. Another one should fail then, but if
    memcg_create_kmem_cache() interleaves with memcg_update_cache_size() as
    follows, it won't:

    memcg_create_kmem_cache() memcg_update_cache_size()
    (called w/o mutexes held) (called with slab_mutex,
    set_limit_mutex held)
    ------------------------- -------------------------

    mutex_lock(&memcg_cache_mutex)

    s->memcg_params=kzalloc(...)

    new_cachep=cache_from_memcg_idx(cachep,idx)
    // new_cachep==NULL => proceed to creation

    s->memcg_params->memcg_caches[i]
    =cur_params->memcg_caches[i]

    // kmem_cache_create_memcg takes slab_mutex
    // so we will hang around until
    // memcg_update_cache_size finishes, but
    // nothing will prevent it from succeeding so
    // memcg_caches[idx] will be overwritten in
    // memcg_register_cache!

    new_cachep = kmem_cache_create_memcg(...)
    mutex_unlock(&memcg_cache_mutex)

    Let's fix this by moving the check for existence of the memcg cache to
    kmem_cache_create_memcg() to be called under the slab_mutex and make it
    return NULL if so.

    A similar race is possible when destroying a memcg cache (see
    kmem_cache_destroy()). Since memcg_unregister_cache(), which clears the
    pointer in the memcg_caches array, is called w/o protection, we can race
    with memcg_update_cache_size() and omit clearing the pointer. Therefore
    memcg_unregister_cache() should be moved before we release the
    slab_mutex.

    Signed-off-by: Vladimir Davydov
    Cc: Michal Hocko
    Cc: Glauber Costa
    Cc: Johannes Weiner
    Cc: Balbir Singh
    Cc: KAMEZAWA Hiroyuki
    Cc: Pekka Enberg
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • Currently, we have rather a messy function set relating to per-memcg
    kmem cache initialization/destruction.

    Per-memcg caches are created in memcg_create_kmem_cache(). This
    function calls kmem_cache_create_memcg() to allocate and initialize a
    kmem cache and then "registers" the new cache in the
    memcg_params::memcg_caches array of the parent cache.

    During its work-flow, kmem_cache_create_memcg() executes the following
    memcg-related functions:

    - memcg_alloc_cache_params(), to initialize memcg_params of the newly
    created cache;
    - memcg_cache_list_add(), to add the new cache to the memcg_slab_caches
    list.

    On the other hand, kmem_cache_destroy() called on a cache destruction
    only calls memcg_release_cache(), which does all the work: it cleans the
    reference to the cache in its parent's memcg_params::memcg_caches,
    removes the cache from the memcg_slab_caches list, and frees
    memcg_params.

    Such an inconsistency between destruction and initialization paths make
    the code difficult to read, so let's clean this up a bit.

    This patch moves all the code relating to registration of per-memcg
    caches (adding to memcg list, setting the pointer to a cache from its
    parent) to the newly created memcg_register_cache() and
    memcg_unregister_cache() functions making the initialization and
    destruction paths look symmetrical.

    Signed-off-by: Vladimir Davydov
    Cc: Michal Hocko
    Cc: Glauber Costa
    Cc: Johannes Weiner
    Cc: Balbir Singh
    Cc: KAMEZAWA Hiroyuki
    Cc: Pekka Enberg
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • We do not free the cache's memcg_params if __kmem_cache_create fails.
    Fix this.

    Plus, rename memcg_register_cache() to memcg_alloc_cache_params(),
    because it actually does not register the cache anywhere, but simply
    initialize kmem_cache::memcg_params.

    [akpm@linux-foundation.org: fix build]
    Signed-off-by: Vladimir Davydov
    Cc: Michal Hocko
    Cc: Glauber Costa
    Cc: Johannes Weiner
    Cc: Balbir Singh
    Cc: KAMEZAWA Hiroyuki
    Cc: Pekka Enberg
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • Currently kmem_cache_create_memcg() backoffs on failure inside
    conditionals, without using gotos. This results in the rollback code
    duplication, which makes the function look cumbersome even though on
    error we should only free the allocated cache. Since in the next patch
    I am going to add yet another rollback function call on error path
    there, let's employ labels instead of conditionals for undoing any
    changes on failure to keep things clean.

    Signed-off-by: Vladimir Davydov
    Reviewed-by: Pekka Enberg
    Cc: Michal Hocko
    Cc: Glauber Costa
    Cc: Johannes Weiner
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     

13 Nov, 2013

1 commit


28 Sep, 2013

1 commit

  • SLUB can alias multiple slab kmem_create_requests to one slab cache to save
    memory and increase the cache hotness. As a result the name of the slab can be
    stale. Only check the name for duplicates if we are in debug mode where we do
    not merge multiple caches.

    This fixes the following problem reported by Jonathan Brassow:

    The problem with kmem_cache* is this:

    *) Assume CONFIG_SLUB is set
    1) kmem_cache_create(name="foo-a")
    - creates new kmem_cache structure
    2) kmem_cache_create(name="foo-b")
    - If identical cache characteristics, it will be merged with the previously
    created cache associated with "foo-a". The cache's refcount will be
    incremented and an alias will be created via sysfs_slab_alias().
    3) kmem_cache_destroy()
    - Attempting to destroy cache associated with "foo-a", but instead the
    refcount is simply decremented. I don't even think the sysfs aliases are
    ever removed...
    4) kmem_cache_create(name="foo-a")
    - This FAILS because kmem_cache_sanity_check colides with the existing
    name ("foo-a") associated with the non-removed cache.

    This is a problem for RAID (specifically dm-raid) because the name used
    for the kmem_cache_create is ("raid%d-%p", level, mddev). If the cache
    persists for long enough, the memory address of an old mddev will be
    reused for a new mddev - causing an identical formulation of the cache
    name. Even though kmem_cache_destory had long ago been used to delete
    the old cache, the merging of caches has cause the name and cache of that
    old instance to be preserved and causes a colision (and thus failure) in
    kmem_cache_create(). I see this regularly in my testing.

    Reported-by: Jonathan Brassow
    Signed-off-by: Christoph Lameter
    Signed-off-by: Pekka Enberg

    Christoph Lameter
     

05 Sep, 2013

1 commit

  • The kmalloc* functions of all slab allcoators are similar now so
    lets move them into slab.h. This requires some function naming changes
    in slob.

    As a results of this patch there is a common set of functions for
    all allocators. Also means that kmalloc_large() is now available
    in general to perform large order allocations that go directly
    via the page allocator. kmalloc_large() can be substituted if
    kmalloc() throws warnings because of too large allocations.

    kmalloc_large() has exactly the same semantics as kmalloc but
    can only used for allocations > PAGE_SIZE.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Pekka Enberg

    Christoph Lameter
     

14 Aug, 2013

1 commit


15 Jul, 2013

1 commit

  • Pull slab update from Pekka Enberg:
    "Highlights:

    - Fix for boot-time problems on some architectures due to
    init_lock_keys() not respecting kmalloc_caches boundaries
    (Christoph Lameter)

    - CONFIG_SLUB_CPU_PARTIAL requested by RT folks (Joonsoo Kim)

    - Fix for excessive slab freelist draining (Wanpeng Li)

    - SLUB and SLOB cleanups and fixes (various people)"

    I ended up editing the branch, and this avoids two commits at the end
    that were immediately reverted, and I instead just applied the oneliner
    fix in between myself.

    * 'slab/for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/linux
    slub: Check for page NULL before doing the node_match check
    mm/slab: Give s_next and s_stop slab-specific names
    slob: Check for NULL pointer before calling ctor()
    slub: Make cpu partial slab support configurable
    slab: add kmalloc() to kernel API documentation
    slab: fix init_lock_keys
    slob: use DIV_ROUND_UP where possible
    slub: do not put a slab to cpu partial list when cpu_partial is 0
    mm/slub: Use node_nr_slabs and node_nr_objs in get_slabinfo
    mm/slub: Drop unnecessary nr_partials
    mm/slab: Fix /proc/slabinfo unwriteable for slab
    mm/slab: Sharing s_next and s_stop between slab and slub
    mm/slab: Fix drain freelist excessively
    slob: Rework #ifdeffery in slab.h
    mm, slab: moved kmem_cache_alloc_node comment to correct place

    Linus Torvalds
     

08 Jul, 2013

1 commit


07 Jul, 2013

2 commits


13 Jun, 2013

1 commit

  • Sasha Levin noticed that the warning introduced by commit 6286ae9
    ("slab: Return NULL for oversized allocations) is being triggered:

    WARNING: CPU: 15 PID: 21519 at mm/slab_common.c:376 kmalloc_slab+0x2f/0xb0()
    can: request_module (can-proto-4) failed.
    mpoa: proc_mpc_write: could not parse ''
    Modules linked in:
    CPU: 15 PID: 21519 Comm: trinity-child15 Tainted: G W 3.10.0-rc4-next-20130607-sasha-00011-gcd78395-dirty #2
    0000000000000009 ffff880020a95e30 ffffffff83ff4041 0000000000000000
    ffff880020a95e68 ffffffff8111fe12 fffffffffffffff0 00000000000082d0
    0000000000080000 0000000000080000 0000000001400000 ffff880020a95e78
    Call Trace:
    [] dump_stack+0x4e/0x82
    [] warn_slowpath_common+0x82/0xb0
    [] warn_slowpath_null+0x15/0x20
    [] kmalloc_slab+0x2f/0xb0
    [] __kmalloc+0x24/0x4b0
    [] ? security_capable+0x13/0x20
    [] ? pipe_fcntl+0x107/0x210
    [] pipe_fcntl+0x107/0x210
    [] ? fget_raw_light+0x130/0x3f0
    [] SyS_fcntl+0x60b/0x6a0
    [] tracesys+0xe1/0xe6

    Andrew Morton writes:

    __GFP_NOWARN is frequently used by kernel code to probe for "how big
    an allocation can I get". That's a bit lame, but it's used on slow
    paths and is pretty simple.

    However, SLAB would still spew a warning when a big allocation happens
    if the __GFP_NOWARN flag is _not_ set to expose kernel bugs.

    Signed-off-by: Sasha Levin
    [ penberg@kernel.org: improve changelog ]
    Signed-off-by: Pekka Enberg

    Sasha Levin
     

09 May, 2013

1 commit

  • Commit 8a965b3baa89 ("mm, slab_common: Fix bootstrap creation of kmalloc
    caches") introduced a regression that caused us to crash early during
    boot. The commit was introducing ordering of slab creation, making sure
    two odd-sized slabs were created after specific powers of two sizes.

    But, if any of the power of two slabs were created earlier during boot,
    slabs at index 1 or 2 might not get created at all. This patch makes
    sure none of the slabs get skipped.

    Tony Lindgren bisected this down to the offending commit, which really
    helped because bisect kept bringing me to almost but not quite this one.

    Signed-off-by: Chris Mason
    Acked-by: Christoph Lameter
    Acked-by: Tony Lindgren
    Acked-by: Soren Brinkmann
    Tested-by: Tetsuo Handa
    Tested-by: Konrad Rzeszutek Wilk
    Signed-off-by: Linus Torvalds

    Chris Mason
     

07 May, 2013

1 commit

  • For SLAB the kmalloc caches must be created in ascending sizes in order
    for the OFF_SLAB sub-slab cache to work properly.

    Create the non power of two caches immediately after the prior power of
    two kmalloc cache. Do not create the non power of two caches before all
    other caches.

    Reported-and-tested-by: Tetsuo Handa
    Signed-off-by: Christoph Lamete
    Link: http://lkml.kernel.org/r/201305040348.CIF81716.OStQOHFJMFLOVF@I-love.SAKURA.ne.jp
    Signed-off-by: Pekka Enberg

    Christoph Lameter
     

06 May, 2013

1 commit

  • The inline path seems to have changed the SLAB behavior for very large
    kmalloc allocations with commit e3366016 ("slab: Use common
    kmalloc_index/kmalloc_size functions"). This patch restores the old
    behavior but also adds diagnostics so that we can figure where in the
    code these large allocations occur.

    Reported-and-tested-by: Tetsuo Handa
    Signed-off-by: Christoph Lameter
    Link: http://lkml.kernel.org/r/201305040348.CIF81716.OStQOHFJMFLOVF@I-love.SAKURA.ne.jp
    [ penberg@kernel.org: use WARN_ON_ONCE ]
    Signed-off-by: Pekka Enberg

    Christoph Lameter
     

07 Feb, 2013

1 commit

  • commit "slab: Common Kmalloc cache determination" made mistake
    in kmalloc_slab(). SLAB_CACHE_DMA is for kmem_cache creation,
    not for allocation. For allocation, we should use GFP_XXX to identify
    type of allocation. So, change SLAB_CACHE_DMA to GFP_DMA.

    Acked-by: Christoph Lameter
    Reported-by: Fengguang Wu
    Signed-off-by: Joonsoo Kim
    Signed-off-by: Pekka Enberg

    Joonsoo Kim
     

01 Feb, 2013

4 commits


19 Dec, 2012

5 commits

  • SLAB allows us to tune a particular cache behavior with tunables. When
    creating a new memcg cache copy, we'd like to preserve any tunables the
    parent cache already had.

    This could be done by an explicit call to do_tune_cpucache() after the
    cache is created. But this is not very convenient now that the caches are
    created from common code, since this function is SLAB-specific.

    Another method of doing that is taking advantage of the fact that
    do_tune_cpucache() is always called from enable_cpucache(), which is
    called at cache initialization. We can just preset the values, and then
    things work as expected.

    It can also happen that a root cache has its tunables updated during
    normal system operation. In this case, we will propagate the change to
    all caches that are already active.

    This change will require us to move the assignment of root_cache in
    memcg_params a bit earlier. We need this to be already set - which
    memcg_kmem_register_cache will do - when we reach __kmem_cache_create()

    Signed-off-by: Glauber Costa
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Frederic Weisbecker
    Cc: Greg Thelen
    Cc: Johannes Weiner
    Cc: JoonSoo Kim
    Cc: KAMEZAWA Hiroyuki
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Pekka Enberg
    Cc: Rik van Riel
    Cc: Suleiman Souhlal
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Glauber Costa
     
  • When we create caches in memcgs, we need to display their usage
    information somewhere. We'll adopt a scheme similar to /proc/meminfo,
    with aggregate totals shown in the global file, and per-group information
    stored in the group itself.

    For the time being, only reads are allowed in the per-group cache.

    Signed-off-by: Glauber Costa
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Frederic Weisbecker
    Cc: Greg Thelen
    Cc: Johannes Weiner
    Cc: JoonSoo Kim
    Cc: KAMEZAWA Hiroyuki
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Pekka Enberg
    Cc: Rik van Riel
    Cc: Suleiman Souhlal
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Glauber Costa
     
  • This enables us to remove all the children of a kmem_cache being
    destroyed, if for example the kernel module it's being used in gets
    unloaded. Otherwise, the children will still point to the destroyed
    parent.

    Signed-off-by: Suleiman Souhlal
    Signed-off-by: Glauber Costa
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Frederic Weisbecker
    Cc: Greg Thelen
    Cc: Johannes Weiner
    Cc: JoonSoo Kim
    Cc: KAMEZAWA Hiroyuki
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Pekka Enberg
    Cc: Rik van Riel
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Glauber Costa
     
  • Every cache that is considered a root cache (basically the "original"
    caches, tied to the root memcg/no-memcg) will have an array that should be
    large enough to store a cache pointer per each memcg in the system.

    Theoreticaly, this is as high as 1 << sizeof(css_id), which is currently
    in the 64k pointers range. Most of the time, we won't be using that much.

    What goes in this patch, is a simple scheme to dynamically allocate such
    an array, in order to minimize memory usage for memcg caches. Because we
    would also like to avoid allocations all the time, at least for now, the
    array will only grow. It will tend to be big enough to hold the maximum
    number of kmem-limited memcgs ever achieved.

    We'll allocate it to be a minimum of 64 kmem-limited memcgs. When we have
    more than that, we'll start doubling the size of this array every time the
    limit is reached.

    Because we are only considering kmem limited memcgs, a natural point for
    this to happen is when we write to the limit. At that point, we already
    have set_limit_mutex held, so that will become our natural synchronization
    mechanism.

    Signed-off-by: Glauber Costa
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Frederic Weisbecker
    Cc: Greg Thelen
    Cc: Johannes Weiner
    Cc: JoonSoo Kim
    Cc: KAMEZAWA Hiroyuki
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Pekka Enberg
    Cc: Rik van Riel
    Cc: Suleiman Souhlal
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Glauber Costa
     
  • Allow a memcg parameter to be passed during cache creation. When the slub
    allocator is being used, it will only merge caches that belong to the same
    memcg. We'll do this by scanning the global list, and then translating
    the cache to a memcg-specific cache

    Default function is created as a wrapper, passing NULL to the memcg
    version. We only merge caches that belong to the same memcg.

    A helper is provided, memcg_css_id: because slub needs a unique cache name
    for sysfs. Since this is visible, but not the canonical location for slab
    data, the cache name is not used, the css_id should suffice.

    Signed-off-by: Glauber Costa
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Frederic Weisbecker
    Cc: Greg Thelen
    Cc: Johannes Weiner
    Cc: JoonSoo Kim
    Cc: KAMEZAWA Hiroyuki
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Pekka Enberg
    Cc: Rik van Riel
    Cc: Suleiman Souhlal
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Glauber Costa
     

11 Dec, 2012

2 commits


31 Oct, 2012

1 commit

  • Some flags are used internally by the allocators for management
    purposes. One example of that is the CFLGS_OFF_SLAB flag that slab uses
    to mark that the metadata for that cache is stored outside of the slab.

    No cache should ever pass those as a creation flags. We can just ignore
    this bit if it happens to be passed (such as when duplicating a cache in
    the kmem memcg patches).

    Because such flags can vary from allocator to allocator, we allow them
    to make their own decisions on that, defining SLAB_AVAILABLE_FLAGS with
    all flags that are valid at creation time. Allocators that doesn't have
    any specific flag requirement should define that to mean all flags.

    Common code will mask out all flags not belonging to that set.

    Acked-by: Christoph Lameter
    Acked-by: David Rientjes
    Signed-off-by: Glauber Costa
    Signed-off-by: Pekka Enberg

    Glauber Costa
     

24 Oct, 2012

3 commits

  • With all the infrastructure in place, we can now have slabinfo_show
    done from slab_common.c. A cache-specific function is called to grab
    information about the cache itself, since that is still heavily
    dependent on the implementation. But with the values produced by it, all
    the printing and handling is done from common code.

    Signed-off-by: Glauber Costa
    CC: Christoph Lameter
    CC: David Rientjes
    Signed-off-by: Pekka Enberg

    Glauber Costa
     
  • The header format is highly similar between slab and slub. The main
    difference lays in the fact that slab may optionally have statistics
    added here in case of CONFIG_SLAB_DEBUG, while the slub will stick them
    somewhere else.

    By making sure that information conditionally lives inside a
    globally-visible CONFIG_DEBUG_SLAB switch, we can move the header
    printing to a common location.

    Signed-off-by: Glauber Costa
    Acked-by: Christoph Lameter
    CC: David Rientjes
    Signed-off-by: Pekka Enberg

    Glauber Costa
     
  • This patch moves all the common machinery to slabinfo processing
    to slab_common.c. We can do better by noticing that the output is
    heavily common, and having the allocators to just provide finished
    information about this. But after this first step, this can be done
    easier.

    Signed-off-by: Glauber Costa
    Acked-by: Christoph Lameter
    CC: David Rientjes
    Signed-off-by: Pekka Enberg

    Glauber Costa
     

10 Oct, 2012

1 commit

  • Commit 1331e7a1bbe1 ("rcu: Remove _rcu_barrier() dependency on
    __stop_machine()") introduced slab_mutex -> cpu_hotplug.lock dependency
    through kmem_cache_destroy() -> rcu_barrier() -> _rcu_barrier() ->
    get_online_cpus().

    Lockdep thinks that this might actually result in ABBA deadlock,
    and reports it as below:

    === [ cut here ] ===
    ======================================================
    [ INFO: possible circular locking dependency detected ]
    3.6.0-rc5-00004-g0d8ee37 #143 Not tainted
    -------------------------------------------------------
    kworker/u:2/40 is trying to acquire lock:
    (rcu_sched_state.barrier_mutex){+.+...}, at: [] _rcu_barrier+0x26/0x1e0

    but task is already holding lock:
    (slab_mutex){+.+.+.}, at: [] kmem_cache_destroy+0x45/0xe0

    which lock already depends on the new lock.

    the existing dependency chain (in reverse order) is:

    -> #2 (slab_mutex){+.+.+.}:
    [] validate_chain+0x632/0x720
    [] __lock_acquire+0x309/0x530
    [] lock_acquire+0x121/0x190
    [] __mutex_lock_common+0x5c/0x450
    [] mutex_lock_nested+0x3e/0x50
    [] cpuup_callback+0x2f/0xbe
    [] notifier_call_chain+0x93/0x140
    [] __raw_notifier_call_chain+0x9/0x10
    [] _cpu_up+0xba/0x14e
    [] cpu_up+0xbc/0x117
    [] smp_init+0x6b/0x9f
    [] kernel_init+0x147/0x1dc
    [] kernel_thread_helper+0x4/0x10

    -> #1 (cpu_hotplug.lock){+.+.+.}:
    [] validate_chain+0x632/0x720
    [] __lock_acquire+0x309/0x530
    [] lock_acquire+0x121/0x190
    [] __mutex_lock_common+0x5c/0x450
    [] mutex_lock_nested+0x3e/0x50
    [] get_online_cpus+0x37/0x50
    [] _rcu_barrier+0xbb/0x1e0
    [] rcu_barrier_sched+0x10/0x20
    [] rcu_barrier+0x9/0x10
    [] deactivate_locked_super+0x49/0x90
    [] deactivate_super+0x61/0x70
    [] mntput_no_expire+0x127/0x180
    [] sys_umount+0x6e/0xd0
    [] system_call_fastpath+0x16/0x1b

    -> #0 (rcu_sched_state.barrier_mutex){+.+...}:
    [] check_prev_add+0x3de/0x440
    [] validate_chain+0x632/0x720
    [] __lock_acquire+0x309/0x530
    [] lock_acquire+0x121/0x190
    [] __mutex_lock_common+0x5c/0x450
    [] mutex_lock_nested+0x3e/0x50
    [] _rcu_barrier+0x26/0x1e0
    [] rcu_barrier_sched+0x10/0x20
    [] rcu_barrier+0x9/0x10
    [] kmem_cache_destroy+0xd1/0xe0
    [] nf_conntrack_cleanup_net+0xe4/0x110 [nf_conntrack]
    [] nf_conntrack_cleanup+0x2a/0x70 [nf_conntrack]
    [] nf_conntrack_net_exit+0x5e/0x80 [nf_conntrack]
    [] ops_exit_list+0x39/0x60
    [] cleanup_net+0xfb/0x1b0
    [] process_one_work+0x26b/0x4c0
    [] worker_thread+0x12e/0x320
    [] kthread+0x9e/0xb0
    [] kernel_thread_helper+0x4/0x10

    other info that might help us debug this:

    Chain exists of:
    rcu_sched_state.barrier_mutex --> cpu_hotplug.lock --> slab_mutex

    Possible unsafe locking scenario:

    CPU0 CPU1
    ---- ----
    lock(slab_mutex);
    lock(cpu_hotplug.lock);
    lock(slab_mutex);
    lock(rcu_sched_state.barrier_mutex);

    *** DEADLOCK ***
    === [ cut here ] ===

    This is actually a false positive. Lockdep has no way of knowing the fact
    that the ABBA can actually never happen, because of special semantics of
    cpu_hotplug.refcount and its handling in cpu_hotplug_begin(); the mutual
    exclusion there is not achieved through mutex, but through
    cpu_hotplug.refcount.

    The "neither cpu_up() nor cpu_down() will proceed past cpu_hotplug_begin()
    until everyone who called get_online_cpus() will call put_online_cpus()"
    semantics is totally invisible to lockdep.

    This patch therefore moves the unlock of slab_mutex so that rcu_barrier()
    is being called with it unlocked. It has two advantages:

    - it slightly reduces hold time of slab_mutex; as it's used to protect
    the cachep list, it's not necessary to hold it over kmem_cache_free()
    call any more
    - it silences the lockdep false positive warning, as it avoids lockdep ever
    learning about slab_mutex -> cpu_hotplug.lock dependency

    Reviewed-by: Paul E. McKenney
    Reviewed-by: Srivatsa S. Bhat
    Acked-by: David Rientjes
    Signed-off-by: Jiri Kosina
    Signed-off-by: Pekka Enberg

    Jiri Kosina