13 Feb, 2015

3 commits

  • To speed up further allocations SLUB may store empty slabs in per cpu/node
    partial lists instead of freeing them immediately. This prevents per
    memcg caches destruction, because kmem caches created for a memory cgroup
    are only destroyed after the last page charged to the cgroup is freed.

    To fix this issue, this patch resurrects approach first proposed in [1].
    It forbids SLUB to cache empty slabs after the memory cgroup that the
    cache belongs to was destroyed. It is achieved by setting kmem_cache's
    cpu_partial and min_partial constants to 0 and tuning put_cpu_partial() so
    that it would drop frozen empty slabs immediately if cpu_partial = 0.

    The runtime overhead is minimal. From all the hot functions, we only
    touch relatively cold put_cpu_partial(): we make it call
    unfreeze_partials() after freezing a slab that belongs to an offline
    memory cgroup. Since slab freezing exists to avoid moving slabs from/to a
    partial list on free/alloc, and there can't be allocations from dead
    caches, it shouldn't cause any overhead. We do have to disable preemption
    for put_cpu_partial() to achieve that though.

    The original patch was accepted well and even merged to the mm tree.
    However, I decided to withdraw it due to changes happening to the memcg
    core at that time. I had an idea of introducing per-memcg shrinkers for
    kmem caches, but now, as memcg has finally settled down, I do not see it
    as an option, because SLUB shrinker would be too costly to call since SLUB
    does not keep free slabs on a separate list. Besides, we currently do not
    even call per-memcg shrinkers for offline memcgs. Overall, it would
    introduce much more complexity to both SLUB and memcg than this small
    patch.

    Regarding to SLAB, there's no problem with it, because it shrinks
    per-cpu/node caches periodically. Thanks to list_lru reparenting, we no
    longer keep entries for offline cgroups in per-memcg arrays (such as
    memcg_cache_params->memcg_caches), so we do not have to bother if a
    per-memcg cache will be shrunk a bit later than it could be.

    [1] http://thread.gmane.org/gmane.linux.kernel.mm/118649/focus=118650

    Signed-off-by: Vladimir Davydov
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • Sometimes, we need to iterate over all memcg copies of a particular root
    kmem cache. Currently, we use memcg_cache_params->memcg_caches array for
    that, because it contains all existing memcg caches.

    However, it's a bad practice to keep all caches, including those that
    belong to offline cgroups, in this array, because it will be growing
    beyond any bounds then. I'm going to wipe away dead caches from it to
    save space. To still be able to perform iterations over all memcg caches
    of the same kind, let us link them into a list.

    Signed-off-by: Vladimir Davydov
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Tejun Heo
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Dave Chinner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • Currently, kmem_cache stores a pointer to struct memcg_cache_params
    instead of embedding it. The rationale is to save memory when kmem
    accounting is disabled. However, the memcg_cache_params has shrivelled
    drastically since it was first introduced:

    * Initially:

    struct memcg_cache_params {
    bool is_root_cache;
    union {
    struct kmem_cache *memcg_caches[0];
    struct {
    struct mem_cgroup *memcg;
    struct list_head list;
    struct kmem_cache *root_cache;
    bool dead;
    atomic_t nr_pages;
    struct work_struct destroy;
    };
    };
    };

    * Now:

    struct memcg_cache_params {
    bool is_root_cache;
    union {
    struct {
    struct rcu_head rcu_head;
    struct kmem_cache *memcg_caches[0];
    };
    struct {
    struct mem_cgroup *memcg;
    struct kmem_cache *root_cache;
    };
    };
    };

    So the memory saving does not seem to be a clear win anymore.

    OTOH, keeping a pointer to memcg_cache_params struct instead of embedding
    it results in touching one more cache line on kmem alloc/free hot paths.
    Besides, it makes linking kmem caches in a list chained by a field of
    struct memcg_cache_params really painful due to a level of indirection,
    while I want to make them linked in the following patch. That said, let
    us embed it.

    Signed-off-by: Vladimir Davydov
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Tejun Heo
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Dave Chinner
    Cc: Dan Carpenter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     

11 Feb, 2015

1 commit


11 Dec, 2014

3 commits

  • Let's use generic slab_start/next/stop for showing memcg caches info. In
    contrast to the current implementation, this will work even if all memcg
    caches' info doesn't fit into a seq buffer (a page), plus it simply looks
    neater.

    Actually, the main reason I do this isn't mere cleanup. I'm going to zap
    the memcg_slab_caches list, because I find it useless provided we have the
    slab_caches list, and this patch is a step in this direction.

    It should be noted that before this patch an attempt to read
    memory.kmem.slabinfo of a cgroup that doesn't have kmem limit set resulted
    in -EIO, while after this patch it will silently show nothing except the
    header, but I don't think it will frustrate anyone.

    Signed-off-by: Vladimir Davydov
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • Recently lockless_dereference() was added which can be used in place of
    hard-coding smp_read_barrier_depends(). The following PATCH makes the
    change.

    Signed-off-by: Pranith Kumar
    Cc: "Paul E. McKenney"
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pranith Kumar
     
  • Currently we print the slabinfo header in the seq start method, which
    makes it unusable for showing leaks, so we have leaks_show, which does
    practically the same as s_show except it doesn't show the header.

    However, we can print the header in the seq show method - we only need
    to check if the current element is the first on the list. This will
    allow us to use the same set of seq iterators for both leaks and
    slabinfo reporting, which is nice.

    Signed-off-by: Vladimir Davydov
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     

10 Oct, 2014

5 commits

  • Because of chicken and egg problem, initialization of SLAB is really
    complicated. We need to allocate cpu cache through SLAB to make the
    kmem_cache work, but before initialization of kmem_cache, allocation
    through SLAB is impossible.

    On the other hand, SLUB does initialization in a more simple way. It uses
    percpu allocator to allocate cpu cache so there is no chicken and egg
    problem.

    So, this patch try to use percpu allocator in SLAB. This simplifies the
    initialization step in SLAB so that we could maintain SLAB code more
    easily.

    In my testing there is no performance difference.

    This implementation relies on percpu allocator. Because percpu allocator
    uses vmalloc address space, vmalloc address space could be exhausted by
    this change on many cpu system with *32 bit* kernel. This implementation
    can cover 1024 cpus in worst case by following calculation.

    Worst: 1024 cpus * 4 bytes for pointer * 300 kmem_caches *
    120 objects per cpu_cache = 140 MB
    Normal: 1024 cpus * 4 bytes for pointer * 150 kmem_caches(slab merge) *
    80 objects per cpu_cache = 46 MB

    Signed-off-by: Joonsoo Kim
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Jeremiah Mahler
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • Slab merge is good feature to reduce fragmentation. If new creating slab
    have similar size and property with exsitent slab, this feature reuse it
    rather than creating new one. As a result, objects are packed into fewer
    slabs so that fragmentation is reduced.

    Below is result of my testing.

    * After boot, sleep 20; cat /proc/meminfo | grep Slab

    Slab: 25136 kB

    Slab: 24364 kB

    We can save 3% memory used by slab.

    For supporting this feature in SLAB, we need to implement SLAB specific
    kmem_cache_flag() and __kmem_cache_alias(), because SLUB implements some
    SLUB specific processing related to debug flag and object size change on
    these functions.

    Signed-off-by: Joonsoo Kim
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • Slab merge is good feature to reduce fragmentation. Now, it is only
    applied to SLUB, but, it would be good to apply it to SLAB. This patch is
    preparation step to apply slab merge to SLAB by commonizing slab merge
    logic.

    Signed-off-by: Joonsoo Kim
    Cc: Randy Dunlap
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • Fix a bug (discovered with kmemcheck) in for_each_kmem_cache_node(). The
    for loop reads the array "node" before verifying that the index is within
    the range. This results in kmemcheck warning.

    Signed-off-by: Mikulas Patocka
    Reviewed-by: Pekka Enberg
    Acked-by: Christoph Lameter
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mikulas Patocka
     
  • We don't need to keep kmem_cache definition in include/linux/slab.h if we
    don't need to inline kmem_cache_size(). According to my code inspection,
    this function is only called at lc_create() in lib/lru_cache.c which may
    be called at initialization phase of something, so we don't need to inline
    it. Therfore, move it to slab_common.c and move kmem_cache definition to
    internal header.

    After this change, we can change kmem_cache definition easily without full
    kernel build. For instance, we can turn on/off CONFIG_SLUB_STATS without
    full kernel build.

    [akpm@linux-foundation.org: export kmem_cache_size() to modules]
    [rdunlap@infradead.org: add header files to fix kmemcheck.c build errors]
    Signed-off-by: Joonsoo Kim
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Zhang Yanfei
    Signed-off-by: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     

07 Aug, 2014

4 commits

  • Just about all of these have been converted to __func__, so convert the
    last use.

    Signed-off-by: Joe Perches
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joe Perches
     
  • Currently, we use array_cache for alien_cache. Although they are mostly
    similar, there is one difference, that is, need for spinlock. We don't
    need spinlock for array_cache itself, but to use array_cache for
    alien_cache, array_cache structure should have spinlock. This is
    needless overhead, so removing it would be better. This patch prepare
    it by introducing alien_cache and using it. In the following patch, we
    remove spinlock in array_cache.

    Signed-off-by: Joonsoo Kim
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • Guarding section:
    #ifndef MM_SLAB_H
    #define MM_SLAB_H
    ...
    #endif
    currently doesn't cover the whole mm/slab.h. It seems like it was
    done unintentionally.

    Wrap the whole file by moving closing #endif to the end of it.

    Signed-off-by: Andrey Ryabinin
    Acked-by: Christoph Lameter
    Acked-by: David Rientjes
    Reviewed-by: Vladimir Davydov
    Cc: Pekka Enberg
    Cc: Joonsoo Kim

    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Ryabinin
     
  • The patchset provides two new functions in mm/slab.h and modifies SLAB
    and SLUB to use these. The kmem_cache_node structure is shared between
    both allocators and the use of common accessors will allow us to move
    more code into slab_common.c in the future.

    This patch (of 3):

    These functions allow to eliminate repeatedly used code in both SLAB and
    SLUB and also allow for the insertion of debugging code that may be
    needed in the development process.

    Signed-off-by: Christoph Lameter
    Cc: Pekka Enberg
    Acked-by: David Rientjes
    Acked-by: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

05 Jun, 2014

4 commits

  • Currently we have two pairs of kmemcg-related functions that are called on
    slab alloc/free. The first is memcg_{bind,release}_pages that count the
    total number of pages allocated on a kmem cache. The second is
    memcg_{un}charge_slab that {un}charge slab pages to kmemcg resource
    counter. Let's just merge them to keep the code clean.

    Signed-off-by: Vladimir Davydov
    Acked-by: Johannes Weiner
    Cc: Michal Hocko
    Cc: Glauber Costa
    Cc: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • This patchset is a part of preparations for kmemcg re-parenting. It
    targets at simplifying kmemcg work-flows and synchronization.

    First, it removes async per memcg cache destruction (see patches 1, 2).
    Now caches are only destroyed on memcg offline. That means the caches
    that are not empty on memcg offline will be leaked. However, they are
    already leaked, because memcg_cache_params::nr_pages normally never drops
    to 0 so the destruction work is never scheduled except kmem_cache_shrink
    is called explicitly. In the future I'm planning reaping such dead caches
    on vmpressure or periodically.

    Second, it substitutes per memcg slab_caches_mutex's with the global
    memcg_slab_mutex, which should be taken during the whole per memcg cache
    creation/destruction path before the slab_mutex (see patch 3). This
    greatly simplifies synchronization among various per memcg cache
    creation/destruction paths.

    I'm still not quite sure about the end picture, in particular I don't know
    whether we should reap dead memcgs' kmem caches periodically or try to
    merge them with their parents (see https://lkml.org/lkml/2014/4/20/38 for
    more details), but whichever way we choose, this set looks like a
    reasonable change to me, because it greatly simplifies kmemcg work-flows
    and eases further development.

    This patch (of 3):

    After a memcg is offlined, we mark its kmem caches that cannot be deleted
    right now due to pending objects as dead by setting the
    memcg_cache_params::dead flag, so that memcg_release_pages will schedule
    cache destruction (memcg_cache_params::destroy) as soon as the last slab
    of the cache is freed (memcg_cache_params::nr_pages drops to zero).

    I guess the idea was to destroy the caches as soon as possible, i.e.
    immediately after freeing the last object. However, it just doesn't work
    that way, because kmem caches always preserve some pages for the sake of
    performance, so that nr_pages never gets to zero unless the cache is
    shrunk explicitly using kmem_cache_shrink. Of course, we could account
    the total number of objects on the cache or check if all the slabs
    allocated for the cache are empty on kmem_cache_free and schedule
    destruction if so, but that would be too costly.

    Thus we have a piece of code that works only when we explicitly call
    kmem_cache_shrink, but complicates the whole picture a lot. Moreover,
    it's racy in fact. For instance, kmem_cache_shrink may free the last slab
    and thus schedule cache destruction before it finishes checking that the
    cache is empty, which can lead to use-after-free.

    So I propose to remove this async cache destruction from
    memcg_release_pages, and check if the cache is empty explicitly after
    calling kmem_cache_shrink instead. This will simplify things a lot w/o
    introducing any functional changes.

    And regarding dead memcg caches (i.e. those that are left hanging around
    after memcg offline for they have objects), I suppose we should reap them
    either periodically or on vmpressure as Glauber suggested initially. I'm
    going to implement this later.

    Signed-off-by: Vladimir Davydov
    Acked-by: Johannes Weiner
    Cc: Michal Hocko
    Cc: Glauber Costa
    Cc: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • When we create a sl[au]b cache, we allocate kmem_cache_node structures
    for each online NUMA node. To handle nodes taken online/offline, we
    register memory hotplug notifier and allocate/free kmem_cache_node
    corresponding to the node that changes its state for each kmem cache.

    To synchronize between the two paths we hold the slab_mutex during both
    the cache creationg/destruction path and while tuning per-node parts of
    kmem caches in memory hotplug handler, but that's not quite right,
    because it does not guarantee that a newly created cache will have all
    kmem_cache_nodes initialized in case it races with memory hotplug. For
    instance, in case of slub:

    CPU0 CPU1
    ---- ----
    kmem_cache_create: online_pages:
    __kmem_cache_create: slab_memory_callback:
    slab_mem_going_online_callback:
    lock slab_mutex
    for each slab_caches list entry
    allocate kmem_cache node
    unlock slab_mutex
    lock slab_mutex
    init_kmem_cache_nodes:
    for_each_node_state(node, N_NORMAL_MEMORY)
    allocate kmem_cache node
    add kmem_cache to slab_caches list
    unlock slab_mutex
    online_pages (continued):
    node_states_set_node

    As a result we'll get a kmem cache with not all kmem_cache_nodes
    allocated.

    To avoid issues like that we should hold get/put_online_mems() during
    the whole kmem cache creation/destruction/shrink paths, just like we
    deal with cpu hotplug. This patch does the trick.

    Note, that after it's applied, there is no need in taking the slab_mutex
    for kmem_cache_shrink any more, so it is removed from there.

    Signed-off-by: Vladimir Davydov
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: Tang Chen
    Cc: Zhang Yanfei
    Cc: Toshi Kani
    Cc: Xishi Qiu
    Cc: Jiang Liu
    Cc: Rafael J. Wysocki
    Cc: David Rientjes
    Cc: Wen Congyang
    Cc: Yasuaki Ishimatsu
    Cc: Lai Jiangshan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • We have only a few places where we actually want to charge kmem so
    instead of intruding into the general page allocation path with
    __GFP_KMEMCG it's better to explictly charge kmem there. All kmem
    charges will be easier to follow that way.

    This is a step towards removing __GFP_KMEMCG. It removes __GFP_KMEMCG
    from memcg caches' allocflags. Instead it makes slab allocation path
    call memcg_charge_kmem directly getting memcg to charge from the cache's
    memcg params.

    This also eliminates any possibility of misaccounting an allocation
    going from one memcg's cache to another memcg, because now we always
    charge slabs against the memcg the cache belongs to. That's why this
    patch removes the big comment to memcg_kmem_get_cache.

    Signed-off-by: Vladimir Davydov
    Acked-by: Greg Thelen
    Cc: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: Glauber Costa
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     

07 May, 2014

1 commit

  • debugobjects warning during netfilter exit:

    ------------[ cut here ]------------
    WARNING: CPU: 6 PID: 4178 at lib/debugobjects.c:260 debug_print_object+0x8d/0xb0()
    ODEBUG: free active (active state 0) object type: timer_list hint: delayed_work_timer_fn+0x0/0x20
    Modules linked in:
    CPU: 6 PID: 4178 Comm: kworker/u16:2 Tainted: G W 3.11.0-next-20130906-sasha #3984
    Workqueue: netns cleanup_net
    Call Trace:
    dump_stack+0x52/0x87
    warn_slowpath_common+0x8c/0xc0
    warn_slowpath_fmt+0x46/0x50
    debug_print_object+0x8d/0xb0
    __debug_check_no_obj_freed+0xa5/0x220
    debug_check_no_obj_freed+0x15/0x20
    kmem_cache_free+0x197/0x340
    kmem_cache_destroy+0x86/0xe0
    nf_conntrack_cleanup_net_list+0x131/0x170
    nf_conntrack_pernet_exit+0x5d/0x70
    ops_exit_list+0x5e/0x70
    cleanup_net+0xfb/0x1c0
    process_one_work+0x338/0x550
    worker_thread+0x215/0x350
    kthread+0xe7/0xf0
    ret_from_fork+0x7c/0xb0

    Also during dcookie cleanup:

    WARNING: CPU: 12 PID: 9725 at lib/debugobjects.c:260 debug_print_object+0x8c/0xb0()
    ODEBUG: free active (active state 0) object type: timer_list hint: delayed_work_timer_fn+0x0/0x20
    Modules linked in:
    CPU: 12 PID: 9725 Comm: trinity-c141 Not tainted 3.15.0-rc2-next-20140423-sasha-00018-gc4ff6c4 #408
    Call Trace:
    dump_stack (lib/dump_stack.c:52)
    warn_slowpath_common (kernel/panic.c:430)
    warn_slowpath_fmt (kernel/panic.c:445)
    debug_print_object (lib/debugobjects.c:262)
    __debug_check_no_obj_freed (lib/debugobjects.c:697)
    debug_check_no_obj_freed (lib/debugobjects.c:726)
    kmem_cache_free (mm/slub.c:2689 mm/slub.c:2717)
    kmem_cache_destroy (mm/slab_common.c:363)
    dcookie_unregister (fs/dcookies.c:302 fs/dcookies.c:343)
    event_buffer_release (arch/x86/oprofile/../../../drivers/oprofile/event_buffer.c:153)
    __fput (fs/file_table.c:217)
    ____fput (fs/file_table.c:253)
    task_work_run (kernel/task_work.c:125 (discriminator 1))
    do_notify_resume (include/linux/tracehook.h:196 arch/x86/kernel/signal.c:751)
    int_signal (arch/x86/kernel/entry_64.S:807)

    Sysfs has a release mechanism. Use that to release the kmem_cache
    structure if CONFIG_SYSFS is enabled.

    Only slub is changed - slab currently only supports /proc/slabinfo and
    not /sys/kernel/slab/*. We talked about adding that and someone was
    working on it.

    [akpm@linux-foundation.org: fix CONFIG_SYSFS=n build]
    [akpm@linux-foundation.org: fix CONFIG_SYSFS=n build even more]
    Signed-off-by: Christoph Lameter
    Reported-by: Sasha Levin
    Tested-by: Sasha Levin
    Acked-by: Greg KH
    Cc: Thomas Gleixner
    Cc: Pekka Enberg
    Cc: Russell King
    Cc: Bart Van Assche
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

08 Apr, 2014

1 commit

  • When a kmem cache is created (kmem_cache_create_memcg()), we first try to
    find a compatible cache that already exists and can handle requests from
    the new cache, i.e. has the same object size, alignment, ctor, etc. If
    there is such a cache, we do not create any new caches, instead we simply
    increment the refcount of the cache found and return it.

    Currently we do this procedure not only when creating root caches, but
    also for memcg caches. However, there is no point in that, because, as
    every memcg cache has exactly the same parameters as its parent and cache
    merging cannot be turned off in runtime (only on boot by passing
    "slub_nomerge"), the root caches of any two potentially mergeable memcg
    caches should be merged already, i.e. it must be the same root cache, and
    therefore we couldn't even get to the memcg cache creation, because it
    already exists.

    The only exception is boot caches - they are explicitly forbidden to be
    merged by setting their refcount to -1. There are currently only two of
    them - kmem_cache and kmem_cache_node, which are used in slab internals (I
    do not count kmalloc caches as their refcount is set to 1 immediately
    after creation). Since they are prevented from merging preliminary I
    guess we should avoid to merge their children too.

    So let's remove the useless code responsible for merging memcg caches.

    Signed-off-by: Vladimir Davydov
    Cc: Michal Hocko
    Cc: Johannes Weiner
    Cc: David Rientjes
    Cc: Pekka Enberg
    Cc: Glauber Costa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     

24 Jan, 2014

2 commits

  • We relocate root cache's memcg_params whenever we need to grow the
    memcg_caches array to accommodate all kmem-active memory cgroups.
    Currently on relocation we free the old version immediately, which can
    lead to use-after-free, because the memcg_caches array is accessed
    lock-free (see cache_from_memcg_idx()). This patch fixes this by making
    memcg_params RCU-protected for root caches.

    Signed-off-by: Vladimir Davydov
    Cc: Michal Hocko
    Cc: Glauber Costa
    Cc: Johannes Weiner
    Cc: Balbir Singh
    Cc: KAMEZAWA Hiroyuki
    Cc: Pekka Enberg
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • Each root kmem_cache has pointers to per-memcg caches stored in its
    memcg_params::memcg_caches array. Whenever we want to allocate a slab
    for a memcg, we access this array to get per-memcg cache to allocate
    from (see memcg_kmem_get_cache()). The access must be lock-free for
    performance reasons, so we should use barriers to assert the kmem_cache
    is up-to-date.

    First, we should place a write barrier immediately before setting the
    pointer to it in the memcg_caches array in order to make sure nobody
    will see a partially initialized object. Second, we should issue a read
    barrier before dereferencing the pointer to conform to the write
    barrier.

    However, currently the barrier usage looks rather strange. We have a
    write barrier *after* setting the pointer and a read barrier *before*
    reading the pointer, which is incorrect. This patch fixes this.

    Signed-off-by: Vladimir Davydov
    Cc: Michal Hocko
    Cc: Glauber Costa
    Cc: Johannes Weiner
    Cc: Balbir Singh
    Cc: KAMEZAWA Hiroyuki
    Cc: Pekka Enberg
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     

13 Nov, 2013

1 commit


29 Aug, 2013

1 commit

  • If the system had a few memory groups and all of them were destroyed,
    memcg_limited_groups_array_size has non-zero value, but all new caches
    are created without memcg_params, because memcg_kmem_enabled() returns
    false.

    We try to enumirate child caches in a few places and all of them are
    potentially dangerous.

    For example my kernel is compiled with CONFIG_SLAB and it crashed when I
    tryed to mount a NFS share after a few experiments with kmemcg.

    BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
    IP: [] do_tune_cpucache+0x8a/0xd0
    PGD b942a067 PUD b999f067 PMD 0
    Oops: 0000 [#1] SMP
    Modules linked in: fscache(+) ip6table_filter ip6_tables iptable_filter ip_tables i2c_piix4 pcspkr virtio_net virtio_balloon i2c_core floppy
    CPU: 0 PID: 357 Comm: modprobe Not tainted 3.11.0-rc7+ #59
    Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
    task: ffff8800b9f98240 ti: ffff8800ba32e000 task.ti: ffff8800ba32e000
    RIP: 0010:[] [] do_tune_cpucache+0x8a/0xd0
    RSP: 0018:ffff8800ba32fb70 EFLAGS: 00010246
    RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000006
    RDX: 0000000000000000 RSI: ffff8800b9f98910 RDI: 0000000000000246
    RBP: ffff8800ba32fba0 R08: 0000000000000002 R09: 0000000000000004
    R10: 0000000000000001 R11: 0000000000000001 R12: 0000000000000010
    R13: 0000000000000008 R14: 00000000000000d0 R15: ffff8800375d0200
    FS: 00007f55f1378740(0000) GS:ffff8800bfa00000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
    CR2: 00007f24feba57a0 CR3: 0000000037b51000 CR4: 00000000000006f0
    Call Trace:
    enable_cpucache+0x49/0x100
    setup_cpu_cache+0x215/0x280
    __kmem_cache_create+0x2fa/0x450
    kmem_cache_create_memcg+0x214/0x350
    kmem_cache_create+0x2b/0x30
    fscache_init+0x19b/0x230 [fscache]
    do_one_initcall+0xfa/0x1b0
    load_module+0x1c41/0x26d0
    SyS_finit_module+0x86/0xb0
    system_call_fastpath+0x16/0x1b

    Signed-off-by: Andrey Vagin
    Cc: Pekka Enberg
    Cc: Christoph Lameter
    Cc: Glauber Costa
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: Johannes Weiner
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Vagin
     

08 Jul, 2013

1 commit


07 Jul, 2013

1 commit


01 Feb, 2013

4 commits


19 Dec, 2012

6 commits

  • SLAB allows us to tune a particular cache behavior with tunables. When
    creating a new memcg cache copy, we'd like to preserve any tunables the
    parent cache already had.

    This could be done by an explicit call to do_tune_cpucache() after the
    cache is created. But this is not very convenient now that the caches are
    created from common code, since this function is SLAB-specific.

    Another method of doing that is taking advantage of the fact that
    do_tune_cpucache() is always called from enable_cpucache(), which is
    called at cache initialization. We can just preset the values, and then
    things work as expected.

    It can also happen that a root cache has its tunables updated during
    normal system operation. In this case, we will propagate the change to
    all caches that are already active.

    This change will require us to move the assignment of root_cache in
    memcg_params a bit earlier. We need this to be already set - which
    memcg_kmem_register_cache will do - when we reach __kmem_cache_create()

    Signed-off-by: Glauber Costa
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Frederic Weisbecker
    Cc: Greg Thelen
    Cc: Johannes Weiner
    Cc: JoonSoo Kim
    Cc: KAMEZAWA Hiroyuki
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Pekka Enberg
    Cc: Rik van Riel
    Cc: Suleiman Souhlal
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Glauber Costa
     
  • When we create caches in memcgs, we need to display their usage
    information somewhere. We'll adopt a scheme similar to /proc/meminfo,
    with aggregate totals shown in the global file, and per-group information
    stored in the group itself.

    For the time being, only reads are allowed in the per-group cache.

    Signed-off-by: Glauber Costa
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Frederic Weisbecker
    Cc: Greg Thelen
    Cc: Johannes Weiner
    Cc: JoonSoo Kim
    Cc: KAMEZAWA Hiroyuki
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Pekka Enberg
    Cc: Rik van Riel
    Cc: Suleiman Souhlal
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Glauber Costa
     
  • Implement destruction of memcg caches. Right now, only caches where our
    reference counter is the last remaining are deleted. If there are any
    other reference counters around, we just leave the caches lying around
    until they go away.

    When that happens, a destruction function is called from the cache code.
    Caches are only destroyed in process context, so we queue them up for
    later processing in the general case.

    Signed-off-by: Glauber Costa
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Frederic Weisbecker
    Cc: Greg Thelen
    Cc: Johannes Weiner
    Cc: JoonSoo Kim
    Cc: KAMEZAWA Hiroyuki
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Pekka Enberg
    Cc: Rik van Riel
    Cc: Suleiman Souhlal
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Glauber Costa
     
  • struct page already has this information. If we start chaining caches,
    this information will always be more trustworthy than whatever is passed
    into the function.

    Signed-off-by: Glauber Costa
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Frederic Weisbecker
    Cc: Greg Thelen
    Cc: Johannes Weiner
    Cc: JoonSoo Kim
    Cc: KAMEZAWA Hiroyuki
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Pekka Enberg
    Cc: Rik van Riel
    Cc: Suleiman Souhlal
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Glauber Costa
     
  • Allow a memcg parameter to be passed during cache creation. When the slub
    allocator is being used, it will only merge caches that belong to the same
    memcg. We'll do this by scanning the global list, and then translating
    the cache to a memcg-specific cache

    Default function is created as a wrapper, passing NULL to the memcg
    version. We only merge caches that belong to the same memcg.

    A helper is provided, memcg_css_id: because slub needs a unique cache name
    for sysfs. Since this is visible, but not the canonical location for slab
    data, the cache name is not used, the css_id should suffice.

    Signed-off-by: Glauber Costa
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Frederic Weisbecker
    Cc: Greg Thelen
    Cc: Johannes Weiner
    Cc: JoonSoo Kim
    Cc: KAMEZAWA Hiroyuki
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Pekka Enberg
    Cc: Rik van Riel
    Cc: Suleiman Souhlal
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Glauber Costa
     
  • For the kmem slab controller, we need to record some extra information in
    the kmem_cache structure.

    Signed-off-by: Glauber Costa
    Signed-off-by: Suleiman Souhlal
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Frederic Weisbecker
    Cc: Greg Thelen
    Cc: Johannes Weiner
    Cc: JoonSoo Kim
    Cc: KAMEZAWA Hiroyuki
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Pekka Enberg
    Cc: Rik van Riel
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Glauber Costa
     

11 Dec, 2012

2 commits