14 Nov, 2014

1 commit

  • Unlike SLUB, sometimes, object isn't started at the beginning of the
    slab in SLAB. This causes the unalignment problem after slab merging is
    supported by commit 12220dea07f1 ("mm/slab: support slab merge").

    Following is the report from Markos that fail to boot on Malta with EVA.

    Calibrating delay loop... 19.86 BogoMIPS (lpj=99328)
    pid_max: default: 32768 minimum: 301
    Mount-cache hash table entries: 4096 (order: 0, 16384 bytes)
    Mountpoint-cache hash table entries: 4096 (order: 0, 16384 bytes)
    Kernel bug detected[#1]:
    CPU: 0 PID: 1 Comm: swapper/0 Not tainted 3.17.0-05639-g12220dea07f1 #1631
    task: 1f04f5d8 ti: 1f050000 task.ti: 1f050000
    epc : 80141190 alloc_unbound_pwq+0x234/0x304
    Not tainted
    ra : 80141184 alloc_unbound_pwq+0x228/0x304
    Process swapper/0 (pid: 1, threadinfo=1f050000, task=1f04f5d8, tls=00000000)
    Call Trace:
    alloc_unbound_pwq+0x234/0x304
    apply_workqueue_attrs+0x11c/0x294
    __alloc_workqueue_key+0x23c/0x470
    init_workqueues+0x320/0x400
    do_one_initcall+0xe8/0x23c
    kernel_init_freeable+0x9c/0x224
    kernel_init+0x10/0x100
    ret_from_kernel_thread+0x14/0x1c
    [ end trace cb88537fdc8fa200 ]
    Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b

    alloc_unbound_pwq() allocates slab object from pool_workqueue. This
    kmem_cache requires 256 bytes alignment, but, current merging code
    doesn't honor that, and merge it with kmalloc-256. kmalloc-256 requires
    only cacheline size alignment so that above failure occurs. However, in
    x86, kmalloc-256 is luckily aligned in 256 bytes, so the problem didn't
    happen on it.

    To fix this problem, this patch introduces alignment mismatch check in
    find_mergeable(). This will fix the problem.

    Signed-off-by: Joonsoo Kim
    Reported-by: Markos Chandras
    Tested-by: Markos Chandras
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     

30 Oct, 2014

1 commit

  • The SLUB cache merges caches with the same size and alignment and there
    was long standing bug with this behavior:

    - create the cache named "foo"
    - create the cache named "bar" (which is merged with "foo")
    - delete the cache named "foo" (but it stays allocated because "bar"
    uses it)
    - create the cache named "foo" again - it fails because the name "foo"
    is already used

    That bug was fixed in commit 694617474e33 ("slab_common: fix the check
    for duplicate slab names") by not warning on duplicate cache names when
    the SLUB subsystem is used.

    Recently, cache merging was implemented the with SLAB subsystem too, in
    12220dea07f1 ("mm/slab: support slab merge")). Therefore we need stop
    checking for duplicate names even for the SLAB subsystem.

    This patch fixes the bug by removing the check.

    Signed-off-by: Mikulas Patocka
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mikulas Patocka
     

10 Oct, 2014

5 commits

  • `While growing per memcg caches arrays, we jump between memcontrol.c and
    slab_common.c in a weird way:

    memcg_alloc_cache_id - memcontrol.c
    memcg_update_all_caches - slab_common.c
    memcg_update_cache_size - memcontrol.c

    There's absolutely no reason why memcg_update_cache_size can't live on the
    slab's side though. So let's move it there and settle it comfortably amid
    per-memcg cache allocation functions.

    Besides, this patch cleans this function up a bit, removing all the
    useless comments from it, and renames it to memcg_update_cache_params to
    conform to memcg_alloc/free_cache_params, which we already have in
    slab_common.c.

    Signed-off-by: Vladimir Davydov
    Acked-by: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: Christoph Lameter
    Cc: Glauber Costa
    Cc: Joonsoo Kim
    Cc: David Rientjes
    Cc: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • The only reason why they live in memcontrol.c is that we get/put css
    reference to the owner memory cgroup in them. However, we can do that in
    memcg_{un,}register_cache. OTOH, there are several reasons to move them
    to slab_common.c.

    First, I think that the less public interface functions we have in
    memcontrol.h the better. Since the functions I move don't depend on
    memcontrol, I think it's worth making them private to slab, especially
    taking into account that the arrays are defined on the slab's side too.

    Second, the way how per-memcg arrays are updated looks rather awkward: it
    proceeds from memcontrol.c (__memcg_activate_kmem) to slab_common.c
    (memcg_update_all_caches) and back to memcontrol.c again
    (memcg_update_array_size). In the following patches I move the function
    relocating the arrays (memcg_update_array_size) to slab_common.c and
    therefore get rid this circular call path. I think we should have the
    cache allocation stuff in the same place where we have relocation, because
    it's easier to follow the code then. So I move arrays alloc/free
    functions to slab_common.c too.

    The third point isn't obvious. I'm going to make the list_lru structure
    per-memcg to allow targeted kmem reclaim. That means we will have
    per-memcg arrays in list_lrus too. It turns out that it's much easier to
    update these arrays in list_lru.c rather than in memcontrol.c, because all
    the stuff we need is defined there. This patch makes memcg caches arrays
    allocation path conform that of the upcoming list_lru.

    So let's move these functions to slab_common.c and make them static.

    Signed-off-by: Vladimir Davydov
    Acked-by: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: Christoph Lameter
    Cc: Glauber Costa
    Cc: Joonsoo Kim
    Cc: David Rientjes
    Cc: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • Slab merge is good feature to reduce fragmentation. Now, it is only
    applied to SLUB, but, it would be good to apply it to SLAB. This patch is
    preparation step to apply slab merge to SLAB by commonizing slab merge
    logic.

    Signed-off-by: Joonsoo Kim
    Cc: Randy Dunlap
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • We don't need to keep kmem_cache definition in include/linux/slab.h if we
    don't need to inline kmem_cache_size(). According to my code inspection,
    this function is only called at lc_create() in lib/lru_cache.c which may
    be called at initialization phase of something, so we don't need to inline
    it. Therfore, move it to slab_common.c and move kmem_cache definition to
    internal header.

    After this change, we can change kmem_cache definition easily without full
    kernel build. For instance, we can turn on/off CONFIG_SLUB_STATS without
    full kernel build.

    [akpm@linux-foundation.org: export kmem_cache_size() to modules]
    [rdunlap@infradead.org: add header files to fix kmemcheck.c build errors]
    Signed-off-by: Joonsoo Kim
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Zhang Yanfei
    Signed-off-by: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • False positive:

    mm/slab_common.c: In function 'kmem_cache_create':
    mm/slab_common.c:204: warning: 's' may be used uninitialized in this function

    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     

07 Aug, 2014

1 commit

  • Functions krealloc(), __krealloc(), kzfree() belongs to slab API, so
    should be placed in slab_common.c

    Also move slab allocator's tracepoints defenitions to slab_common.c No
    functional changes here.

    Signed-off-by: Andrey Ryabinin
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Ryabinin
     

23 Jul, 2014

1 commit


05 Jun, 2014

8 commits

  • Currently, on kmem_cache_destroy we delete the cache from the slab_list
    before __kmem_cache_shutdown, inserting it back to the list on failure.
    Initially, this was done, because we could release the slab_mutex in
    __kmem_cache_shutdown to delete sysfs slub entry, but since commit
    41a212859a4d ("slub: use sysfs'es release mechanism for kmem_cache") we
    remove sysfs entry later in kmem_cache_destroy after dropping the
    slab_mutex, so that no implementation of __kmem_cache_shutdown can ever
    release the lock. Therefore we can simplify the code a bit by moving
    list_del after __kmem_cache_shutdown.

    Signed-off-by: Vladimir Davydov
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • Current names are rather inconsistent. Let's try to improve them.

    Brief change log:

    ** old name ** ** new name **

    kmem_cache_create_memcg memcg_create_kmem_cache
    memcg_kmem_create_cache memcg_regsiter_cache
    memcg_kmem_destroy_cache memcg_unregister_cache

    kmem_cache_destroy_memcg_children memcg_cleanup_cache_params
    mem_cgroup_destroy_all_caches memcg_unregister_all_caches

    create_work memcg_register_cache_work
    memcg_create_cache_work_func memcg_register_cache_func
    memcg_create_cache_enqueue memcg_schedule_register_cache

    Signed-off-by: Vladimir Davydov
    Acked-by: Michal Hocko
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • Instead of calling back to memcontrol.c from kmem_cache_create_memcg in
    order to just create the name of a per memcg cache, let's allocate it in
    place. We only need to pass the memcg name to kmem_cache_create_memcg for
    that - everything else can be done in slab_common.c.

    Signed-off-by: Vladimir Davydov
    Acked-by: Michal Hocko
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • At present, we have the following mutexes protecting data related to per
    memcg kmem caches:

    - slab_mutex. This one is held during the whole kmem cache creation
    and destruction paths. We also take it when updating per root cache
    memcg_caches arrays (see memcg_update_all_caches). As a result, taking
    it guarantees there will be no changes to any kmem cache (including per
    memcg). Why do we need something else then? The point is it is
    private to slab implementation and has some internal dependencies with
    other mutexes (get_online_cpus). So we just don't want to rely upon it
    and prefer to introduce additional mutexes instead.

    - activate_kmem_mutex. Initially it was added to synchronize
    initializing kmem limit (memcg_activate_kmem). However, since we can
    grow per root cache memcg_caches arrays only on kmem limit
    initialization (see memcg_update_all_caches), we also employ it to
    protect against memcg_caches arrays relocation (e.g. see
    __kmem_cache_destroy_memcg_children).

    - We have a convention not to take slab_mutex in memcontrol.c, but we
    want to walk over per memcg memcg_slab_caches lists there (e.g. for
    destroying all memcg caches on offline). So we have per memcg
    slab_caches_mutex's protecting those lists.

    The mutexes are taken in the following order:

    activate_kmem_mutex -> slab_mutex -> memcg::slab_caches_mutex

    Such a syncrhonization scheme has a number of flaws, for instance:

    - We can't call kmem_cache_{destroy,shrink} while walking over a
    memcg::memcg_slab_caches list due to locking order. As a result, in
    mem_cgroup_destroy_all_caches we schedule the
    memcg_cache_params::destroy work shrinking and destroying the cache.

    - We don't have a mutex to synchronize per memcg caches destruction
    between memcg offline (mem_cgroup_destroy_all_caches) and root cache
    destruction (__kmem_cache_destroy_memcg_children). Currently we just
    don't bother about it.

    This patch simplifies it by substituting per memcg slab_caches_mutex's
    with the global memcg_slab_mutex. It will be held whenever a new per
    memcg cache is created or destroyed, so it protects per root cache
    memcg_caches arrays and per memcg memcg_slab_caches lists. The locking
    order is following:

    activate_kmem_mutex -> memcg_slab_mutex -> slab_mutex

    This allows us to call kmem_cache_{create,shrink,destroy} under the
    memcg_slab_mutex. As a result, we don't need memcg_cache_params::destroy
    work any more - we can simply destroy caches while iterating over a per
    memcg slab caches list.

    Also using the global mutex simplifies synchronization between concurrent
    per memcg caches creation/destruction, e.g. mem_cgroup_destroy_all_caches
    vs __kmem_cache_destroy_memcg_children.

    The downside of this is that we substitute per-memcg slab_caches_mutex's
    with a hummer-like global mutex, but since we already take either the
    slab_mutex or the cgroup_mutex along with a memcg::slab_caches_mutex, it
    shouldn't hurt concurrency a lot.

    Signed-off-by: Vladimir Davydov
    Acked-by: Johannes Weiner
    Cc: Michal Hocko
    Cc: Glauber Costa
    Cc: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • When we create a sl[au]b cache, we allocate kmem_cache_node structures
    for each online NUMA node. To handle nodes taken online/offline, we
    register memory hotplug notifier and allocate/free kmem_cache_node
    corresponding to the node that changes its state for each kmem cache.

    To synchronize between the two paths we hold the slab_mutex during both
    the cache creationg/destruction path and while tuning per-node parts of
    kmem caches in memory hotplug handler, but that's not quite right,
    because it does not guarantee that a newly created cache will have all
    kmem_cache_nodes initialized in case it races with memory hotplug. For
    instance, in case of slub:

    CPU0 CPU1
    ---- ----
    kmem_cache_create: online_pages:
    __kmem_cache_create: slab_memory_callback:
    slab_mem_going_online_callback:
    lock slab_mutex
    for each slab_caches list entry
    allocate kmem_cache node
    unlock slab_mutex
    lock slab_mutex
    init_kmem_cache_nodes:
    for_each_node_state(node, N_NORMAL_MEMORY)
    allocate kmem_cache node
    add kmem_cache to slab_caches list
    unlock slab_mutex
    online_pages (continued):
    node_states_set_node

    As a result we'll get a kmem cache with not all kmem_cache_nodes
    allocated.

    To avoid issues like that we should hold get/put_online_mems() during
    the whole kmem cache creation/destruction/shrink paths, just like we
    deal with cpu hotplug. This patch does the trick.

    Note, that after it's applied, there is no need in taking the slab_mutex
    for kmem_cache_shrink any more, so it is removed from there.

    Signed-off-by: Vladimir Davydov
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: Tang Chen
    Cc: Zhang Yanfei
    Cc: Toshi Kani
    Cc: Xishi Qiu
    Cc: Jiang Liu
    Cc: Rafael J. Wysocki
    Cc: David Rientjes
    Cc: Wen Congyang
    Cc: Yasuaki Ishimatsu
    Cc: Lai Jiangshan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • Signed-off-by: Vladimir Davydov
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • Currently to allocate a page that should be charged to kmemcg (e.g.
    threadinfo), we pass __GFP_KMEMCG flag to the page allocator. The page
    allocated is then to be freed by free_memcg_kmem_pages. Apart from
    looking asymmetrical, this also requires intrusion to the general
    allocation path. So let's introduce separate functions that will
    alloc/free pages charged to kmemcg.

    The new functions are called alloc_kmem_pages and free_kmem_pages. They
    should be used when the caller actually would like to use kmalloc, but
    has to fall back to the page allocator for the allocation is large.
    They only differ from alloc_pages and free_pages in that besides
    allocating or freeing pages they also charge them to the kmem resource
    counter of the current memory cgroup.

    [sfr@canb.auug.org.au: export kmalloc_order() to modules]
    Signed-off-by: Vladimir Davydov
    Acked-by: Greg Thelen
    Cc: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: Glauber Costa
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Signed-off-by: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • We have only a few places where we actually want to charge kmem so
    instead of intruding into the general page allocation path with
    __GFP_KMEMCG it's better to explictly charge kmem there. All kmem
    charges will be easier to follow that way.

    This is a step towards removing __GFP_KMEMCG. It removes __GFP_KMEMCG
    from memcg caches' allocflags. Instead it makes slab allocation path
    call memcg_charge_kmem directly getting memcg to charge from the cache's
    memcg params.

    This also eliminates any possibility of misaccounting an allocation
    going from one memcg's cache to another memcg, because now we always
    charge slabs against the memcg the cache belongs to. That's why this
    patch removes the big comment to memcg_kmem_get_cache.

    Signed-off-by: Vladimir Davydov
    Acked-by: Greg Thelen
    Cc: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: Glauber Costa
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     

24 May, 2014

1 commit

  • The patch 3e374919b314f20e2a04f641ebc1093d758f66a4 is supposed to fix the
    problem where kmem_cache_create incorrectly reports duplicate cache name
    and fails. The problem is described in the header of that patch.

    However, the patch doesn't really fix the problem because of these
    reasons:

    * the logic to test for debugging is reversed. It was intended to perform
    the check only if slub debugging is enabled (which implies that caches
    with the same parameters are not merged). Therefore, there should be
    #if !defined(CONFIG_SLUB) || defined(CONFIG_SLUB_DEBUG_ON)
    The current code has the condition reversed and performs the test if
    debugging is disabled.

    * slub debugging may be enabled or disabled based on kernel command line,
    CONFIG_SLUB_DEBUG_ON is just the default settings. Therefore the test
    based on definition of CONFIG_SLUB_DEBUG_ON is unreliable.

    This patch fixes the problem by removing the test
    "!defined(CONFIG_SLUB_DEBUG_ON)". Therefore, duplicate names are never
    checked if the SLUB allocator is used.

    Note to stable kernel maintainers: when backporint this patch, please
    backport also the patch 3e374919b314f20e2a04f641ebc1093d758f66a4.

    Acked-by: David Rientjes
    Acked-by: Christoph Lameter
    Signed-off-by: Mikulas Patocka
    Cc: stable@vger.kernel.org # 3.6+
    Signed-off-by: Pekka Enberg

    Mikulas Patocka
     

07 May, 2014

1 commit

  • debugobjects warning during netfilter exit:

    ------------[ cut here ]------------
    WARNING: CPU: 6 PID: 4178 at lib/debugobjects.c:260 debug_print_object+0x8d/0xb0()
    ODEBUG: free active (active state 0) object type: timer_list hint: delayed_work_timer_fn+0x0/0x20
    Modules linked in:
    CPU: 6 PID: 4178 Comm: kworker/u16:2 Tainted: G W 3.11.0-next-20130906-sasha #3984
    Workqueue: netns cleanup_net
    Call Trace:
    dump_stack+0x52/0x87
    warn_slowpath_common+0x8c/0xc0
    warn_slowpath_fmt+0x46/0x50
    debug_print_object+0x8d/0xb0
    __debug_check_no_obj_freed+0xa5/0x220
    debug_check_no_obj_freed+0x15/0x20
    kmem_cache_free+0x197/0x340
    kmem_cache_destroy+0x86/0xe0
    nf_conntrack_cleanup_net_list+0x131/0x170
    nf_conntrack_pernet_exit+0x5d/0x70
    ops_exit_list+0x5e/0x70
    cleanup_net+0xfb/0x1c0
    process_one_work+0x338/0x550
    worker_thread+0x215/0x350
    kthread+0xe7/0xf0
    ret_from_fork+0x7c/0xb0

    Also during dcookie cleanup:

    WARNING: CPU: 12 PID: 9725 at lib/debugobjects.c:260 debug_print_object+0x8c/0xb0()
    ODEBUG: free active (active state 0) object type: timer_list hint: delayed_work_timer_fn+0x0/0x20
    Modules linked in:
    CPU: 12 PID: 9725 Comm: trinity-c141 Not tainted 3.15.0-rc2-next-20140423-sasha-00018-gc4ff6c4 #408
    Call Trace:
    dump_stack (lib/dump_stack.c:52)
    warn_slowpath_common (kernel/panic.c:430)
    warn_slowpath_fmt (kernel/panic.c:445)
    debug_print_object (lib/debugobjects.c:262)
    __debug_check_no_obj_freed (lib/debugobjects.c:697)
    debug_check_no_obj_freed (lib/debugobjects.c:726)
    kmem_cache_free (mm/slub.c:2689 mm/slub.c:2717)
    kmem_cache_destroy (mm/slab_common.c:363)
    dcookie_unregister (fs/dcookies.c:302 fs/dcookies.c:343)
    event_buffer_release (arch/x86/oprofile/../../../drivers/oprofile/event_buffer.c:153)
    __fput (fs/file_table.c:217)
    ____fput (fs/file_table.c:253)
    task_work_run (kernel/task_work.c:125 (discriminator 1))
    do_notify_resume (include/linux/tracehook.h:196 arch/x86/kernel/signal.c:751)
    int_signal (arch/x86/kernel/entry_64.S:807)

    Sysfs has a release mechanism. Use that to release the kmem_cache
    structure if CONFIG_SYSFS is enabled.

    Only slub is changed - slab currently only supports /proc/slabinfo and
    not /sys/kernel/slab/*. We talked about adding that and someone was
    working on it.

    [akpm@linux-foundation.org: fix CONFIG_SYSFS=n build]
    [akpm@linux-foundation.org: fix CONFIG_SYSFS=n build even more]
    Signed-off-by: Christoph Lameter
    Reported-by: Sasha Levin
    Tested-by: Sasha Levin
    Acked-by: Greg KH
    Cc: Thomas Gleixner
    Cc: Pekka Enberg
    Cc: Russell King
    Cc: Bart Van Assche
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

08 Apr, 2014

5 commits

  • Currently we destroy children caches at the very beginning of
    kmem_cache_destroy(). This is wrong, because the root cache will not
    necessarily be destroyed in the end - if it has aliases (refcount > 0),
    kmem_cache_destroy() will simply decrement its refcount and return. In
    this case, at best we will get a bunch of warnings in dmesg, like this
    one:

    kmem_cache_destroy kmalloc-32:0: Slab cache still has objects
    CPU: 1 PID: 7139 Comm: modprobe Tainted: G B W 3.13.0+ #117
    Call Trace:
    dump_stack+0x49/0x5b
    kmem_cache_destroy+0xdf/0xf0
    kmem_cache_destroy_memcg_children+0x97/0xc0
    kmem_cache_destroy+0xf/0xf0
    xfs_mru_cache_uninit+0x21/0x30 [xfs]
    exit_xfs_fs+0x2e/0xc44 [xfs]
    SyS_delete_module+0x198/0x1f0
    system_call_fastpath+0x16/0x1b

    At worst - if kmem_cache_destroy() will race with an allocation from a
    memcg cache - the kernel will panic.

    This patch fixes this by moving children caches destruction after the
    check if the cache has aliases. Plus, it forbids destroying a root
    cache if it still has children caches, because each children cache keeps
    a reference to its parent.

    Signed-off-by: Vladimir Davydov
    Cc: Michal Hocko
    Cc: Johannes Weiner
    Cc: David Rientjes
    Cc: Pekka Enberg
    Cc: Glauber Costa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • Currently, memcg_unregister_cache(), which deletes the cache being
    destroyed from the memcg_slab_caches list, is called after
    __kmem_cache_shutdown() (see kmem_cache_destroy()), which starts to
    destroy the cache.

    As a result, one can access a partially destroyed cache while traversing
    a memcg_slab_caches list, which can have deadly consequences (for
    instance, cache_show() called for each cache on a memcg_slab_caches list
    from mem_cgroup_slabinfo_read() will dereference pointers to already
    freed data).

    To fix this, let's move memcg_unregister_cache() before the cache
    destruction process beginning, issuing memcg_register_cache() on failure.

    Signed-off-by: Vladimir Davydov
    Cc: Michal Hocko
    Cc: Johannes Weiner
    Cc: David Rientjes
    Cc: Pekka Enberg
    Cc: Glauber Costa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • Memcg-awareness turned kmem_cache_create() into a dirty interweaving of
    memcg-only and except-for-memcg calls. To clean this up, let's move the
    code responsible for memcg cache creation to a separate function.

    Signed-off-by: Vladimir Davydov
    Cc: Michal Hocko
    Cc: Johannes Weiner
    Cc: David Rientjes
    Cc: Pekka Enberg
    Cc: Glauber Costa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • This patch cleans up the memcg cache creation path as follows:

    - Move memcg cache name creation to a separate function to be called
    from kmem_cache_create_memcg(). This allows us to get rid of the mutex
    protecting the temporary buffer used for the name formatting, because
    the whole cache creation path is protected by the slab_mutex.

    - Get rid of memcg_create_kmem_cache(). This function serves as a proxy
    to kmem_cache_create_memcg(). After separating the cache name creation
    path, it would be reduced to a function call, so let's inline it.

    Signed-off-by: Vladimir Davydov
    Cc: Michal Hocko
    Cc: Johannes Weiner
    Cc: David Rientjes
    Cc: Pekka Enberg
    Cc: Glauber Costa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • When a kmem cache is created (kmem_cache_create_memcg()), we first try to
    find a compatible cache that already exists and can handle requests from
    the new cache, i.e. has the same object size, alignment, ctor, etc. If
    there is such a cache, we do not create any new caches, instead we simply
    increment the refcount of the cache found and return it.

    Currently we do this procedure not only when creating root caches, but
    also for memcg caches. However, there is no point in that, because, as
    every memcg cache has exactly the same parameters as its parent and cache
    merging cannot be turned off in runtime (only on boot by passing
    "slub_nomerge"), the root caches of any two potentially mergeable memcg
    caches should be merged already, i.e. it must be the same root cache, and
    therefore we couldn't even get to the memcg cache creation, because it
    already exists.

    The only exception is boot caches - they are explicitly forbidden to be
    merged by setting their refcount to -1. There are currently only two of
    them - kmem_cache and kmem_cache_node, which are used in slab internals (I
    do not count kmalloc caches as their refcount is set to 1 immediately
    after creation). Since they are prevented from merging preliminary I
    guess we should avoid to merge their children too.

    So let's remove the useless code responsible for merging memcg caches.

    Signed-off-by: Vladimir Davydov
    Cc: Michal Hocko
    Cc: Johannes Weiner
    Cc: David Rientjes
    Cc: Pekka Enberg
    Cc: Glauber Costa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     

30 Jan, 2014

1 commit

  • On kmem_cache_create_memcg() error path we set 'err', but leave 's' (the
    new cache ptr) undefined. The latter can be NULL if we could not
    allocate the cache, or pointing to a freed area if we failed somewhere
    later while trying to initialize it. Initially we checked 'err'
    immediately before exiting the function and returned NULL if it was set
    ignoring the value of 's':

    out_unlock:
    ...
    if (err) {
    /* report error */
    return NULL;
    }
    return s;

    Recently this check was, in fact, broken by commit f717eb3abb5e ("slab:
    do not panic if we fail to create memcg cache"), which turned it to:

    out_unlock:
    ...
    if (err && !memcg) {
    /* report error */
    return NULL;
    }
    return s;

    As a result, if we are failing creating a cache for a memcg, we will
    skip the check and return 's' that can contain crap. Obviously, commit
    f717eb3abb5e intended not to return crap on error allocating a cache for
    a memcg, but only to remove the error reporting in this case, so the
    check should look like this:

    out_unlock:
    ...
    if (err) {
    if (!memcg)
    return NULL;
    /* report error */
    return NULL;
    }
    return s;

    [rientjes@google.com: despaghettification]
    [vdavydov@parallels.com: patch monkeying]
    Signed-off-by: David Rientjes
    Signed-off-by: Vladimir Davydov
    Signed-off-by: Dave Jones
    Reported-by: Dave Jones
    Acked-by: Pekka Enberg
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Jones
     

24 Jan, 2014

5 commits

  • There is no point in flooding logs with warnings or especially crashing
    the system if we fail to create a cache for a memcg. In this case we
    will be accounting the memcg allocation to the root cgroup until we
    succeed to create its own cache, but it isn't that critical.

    Signed-off-by: Vladimir Davydov
    Cc: Michal Hocko
    Cc: Glauber Costa
    Cc: Johannes Weiner
    Cc: Pekka Enberg
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • We obtain a per-memcg cache from a root kmem_cache by dereferencing an
    entry of the root cache's memcg_params::memcg_caches array. If we find
    no cache for a memcg there on allocation, we initiate the memcg cache
    creation (see memcg_kmem_get_cache()). The cache creation proceeds
    asynchronously in memcg_create_kmem_cache() in order to avoid lock
    clashes, so there can be several threads trying to create the same
    kmem_cache concurrently, but only one of them may succeed. However, due
    to a race in the code, it is not always true. The point is that the
    memcg_caches array can be relocated when we activate kmem accounting for
    a memcg (see memcg_update_all_caches(), memcg_update_cache_size()). If
    memcg_update_cache_size() and memcg_create_kmem_cache() proceed
    concurrently as described below, we can leak a kmem_cache.

    Asume two threads schedule creation of the same kmem_cache. One of them
    successfully creates it. Another one should fail then, but if
    memcg_create_kmem_cache() interleaves with memcg_update_cache_size() as
    follows, it won't:

    memcg_create_kmem_cache() memcg_update_cache_size()
    (called w/o mutexes held) (called with slab_mutex,
    set_limit_mutex held)
    ------------------------- -------------------------

    mutex_lock(&memcg_cache_mutex)

    s->memcg_params=kzalloc(...)

    new_cachep=cache_from_memcg_idx(cachep,idx)
    // new_cachep==NULL => proceed to creation

    s->memcg_params->memcg_caches[i]
    =cur_params->memcg_caches[i]

    // kmem_cache_create_memcg takes slab_mutex
    // so we will hang around until
    // memcg_update_cache_size finishes, but
    // nothing will prevent it from succeeding so
    // memcg_caches[idx] will be overwritten in
    // memcg_register_cache!

    new_cachep = kmem_cache_create_memcg(...)
    mutex_unlock(&memcg_cache_mutex)

    Let's fix this by moving the check for existence of the memcg cache to
    kmem_cache_create_memcg() to be called under the slab_mutex and make it
    return NULL if so.

    A similar race is possible when destroying a memcg cache (see
    kmem_cache_destroy()). Since memcg_unregister_cache(), which clears the
    pointer in the memcg_caches array, is called w/o protection, we can race
    with memcg_update_cache_size() and omit clearing the pointer. Therefore
    memcg_unregister_cache() should be moved before we release the
    slab_mutex.

    Signed-off-by: Vladimir Davydov
    Cc: Michal Hocko
    Cc: Glauber Costa
    Cc: Johannes Weiner
    Cc: Balbir Singh
    Cc: KAMEZAWA Hiroyuki
    Cc: Pekka Enberg
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • Currently, we have rather a messy function set relating to per-memcg
    kmem cache initialization/destruction.

    Per-memcg caches are created in memcg_create_kmem_cache(). This
    function calls kmem_cache_create_memcg() to allocate and initialize a
    kmem cache and then "registers" the new cache in the
    memcg_params::memcg_caches array of the parent cache.

    During its work-flow, kmem_cache_create_memcg() executes the following
    memcg-related functions:

    - memcg_alloc_cache_params(), to initialize memcg_params of the newly
    created cache;
    - memcg_cache_list_add(), to add the new cache to the memcg_slab_caches
    list.

    On the other hand, kmem_cache_destroy() called on a cache destruction
    only calls memcg_release_cache(), which does all the work: it cleans the
    reference to the cache in its parent's memcg_params::memcg_caches,
    removes the cache from the memcg_slab_caches list, and frees
    memcg_params.

    Such an inconsistency between destruction and initialization paths make
    the code difficult to read, so let's clean this up a bit.

    This patch moves all the code relating to registration of per-memcg
    caches (adding to memcg list, setting the pointer to a cache from its
    parent) to the newly created memcg_register_cache() and
    memcg_unregister_cache() functions making the initialization and
    destruction paths look symmetrical.

    Signed-off-by: Vladimir Davydov
    Cc: Michal Hocko
    Cc: Glauber Costa
    Cc: Johannes Weiner
    Cc: Balbir Singh
    Cc: KAMEZAWA Hiroyuki
    Cc: Pekka Enberg
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • We do not free the cache's memcg_params if __kmem_cache_create fails.
    Fix this.

    Plus, rename memcg_register_cache() to memcg_alloc_cache_params(),
    because it actually does not register the cache anywhere, but simply
    initialize kmem_cache::memcg_params.

    [akpm@linux-foundation.org: fix build]
    Signed-off-by: Vladimir Davydov
    Cc: Michal Hocko
    Cc: Glauber Costa
    Cc: Johannes Weiner
    Cc: Balbir Singh
    Cc: KAMEZAWA Hiroyuki
    Cc: Pekka Enberg
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • Currently kmem_cache_create_memcg() backoffs on failure inside
    conditionals, without using gotos. This results in the rollback code
    duplication, which makes the function look cumbersome even though on
    error we should only free the allocated cache. Since in the next patch
    I am going to add yet another rollback function call on error path
    there, let's employ labels instead of conditionals for undoing any
    changes on failure to keep things clean.

    Signed-off-by: Vladimir Davydov
    Reviewed-by: Pekka Enberg
    Cc: Michal Hocko
    Cc: Glauber Costa
    Cc: Johannes Weiner
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     

13 Nov, 2013

1 commit


28 Sep, 2013

1 commit

  • SLUB can alias multiple slab kmem_create_requests to one slab cache to save
    memory and increase the cache hotness. As a result the name of the slab can be
    stale. Only check the name for duplicates if we are in debug mode where we do
    not merge multiple caches.

    This fixes the following problem reported by Jonathan Brassow:

    The problem with kmem_cache* is this:

    *) Assume CONFIG_SLUB is set
    1) kmem_cache_create(name="foo-a")
    - creates new kmem_cache structure
    2) kmem_cache_create(name="foo-b")
    - If identical cache characteristics, it will be merged with the previously
    created cache associated with "foo-a". The cache's refcount will be
    incremented and an alias will be created via sysfs_slab_alias().
    3) kmem_cache_destroy()
    - Attempting to destroy cache associated with "foo-a", but instead the
    refcount is simply decremented. I don't even think the sysfs aliases are
    ever removed...
    4) kmem_cache_create(name="foo-a")
    - This FAILS because kmem_cache_sanity_check colides with the existing
    name ("foo-a") associated with the non-removed cache.

    This is a problem for RAID (specifically dm-raid) because the name used
    for the kmem_cache_create is ("raid%d-%p", level, mddev). If the cache
    persists for long enough, the memory address of an old mddev will be
    reused for a new mddev - causing an identical formulation of the cache
    name. Even though kmem_cache_destory had long ago been used to delete
    the old cache, the merging of caches has cause the name and cache of that
    old instance to be preserved and causes a colision (and thus failure) in
    kmem_cache_create(). I see this regularly in my testing.

    Reported-by: Jonathan Brassow
    Signed-off-by: Christoph Lameter
    Signed-off-by: Pekka Enberg

    Christoph Lameter
     

05 Sep, 2013

1 commit

  • The kmalloc* functions of all slab allcoators are similar now so
    lets move them into slab.h. This requires some function naming changes
    in slob.

    As a results of this patch there is a common set of functions for
    all allocators. Also means that kmalloc_large() is now available
    in general to perform large order allocations that go directly
    via the page allocator. kmalloc_large() can be substituted if
    kmalloc() throws warnings because of too large allocations.

    kmalloc_large() has exactly the same semantics as kmalloc but
    can only used for allocations > PAGE_SIZE.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Pekka Enberg

    Christoph Lameter
     

14 Aug, 2013

1 commit


15 Jul, 2013

1 commit

  • Pull slab update from Pekka Enberg:
    "Highlights:

    - Fix for boot-time problems on some architectures due to
    init_lock_keys() not respecting kmalloc_caches boundaries
    (Christoph Lameter)

    - CONFIG_SLUB_CPU_PARTIAL requested by RT folks (Joonsoo Kim)

    - Fix for excessive slab freelist draining (Wanpeng Li)

    - SLUB and SLOB cleanups and fixes (various people)"

    I ended up editing the branch, and this avoids two commits at the end
    that were immediately reverted, and I instead just applied the oneliner
    fix in between myself.

    * 'slab/for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/linux
    slub: Check for page NULL before doing the node_match check
    mm/slab: Give s_next and s_stop slab-specific names
    slob: Check for NULL pointer before calling ctor()
    slub: Make cpu partial slab support configurable
    slab: add kmalloc() to kernel API documentation
    slab: fix init_lock_keys
    slob: use DIV_ROUND_UP where possible
    slub: do not put a slab to cpu partial list when cpu_partial is 0
    mm/slub: Use node_nr_slabs and node_nr_objs in get_slabinfo
    mm/slub: Drop unnecessary nr_partials
    mm/slab: Fix /proc/slabinfo unwriteable for slab
    mm/slab: Sharing s_next and s_stop between slab and slub
    mm/slab: Fix drain freelist excessively
    slob: Rework #ifdeffery in slab.h
    mm, slab: moved kmem_cache_alloc_node comment to correct place

    Linus Torvalds
     

08 Jul, 2013

1 commit


07 Jul, 2013

2 commits


13 Jun, 2013

1 commit

  • Sasha Levin noticed that the warning introduced by commit 6286ae9
    ("slab: Return NULL for oversized allocations) is being triggered:

    WARNING: CPU: 15 PID: 21519 at mm/slab_common.c:376 kmalloc_slab+0x2f/0xb0()
    can: request_module (can-proto-4) failed.
    mpoa: proc_mpc_write: could not parse ''
    Modules linked in:
    CPU: 15 PID: 21519 Comm: trinity-child15 Tainted: G W 3.10.0-rc4-next-20130607-sasha-00011-gcd78395-dirty #2
    0000000000000009 ffff880020a95e30 ffffffff83ff4041 0000000000000000
    ffff880020a95e68 ffffffff8111fe12 fffffffffffffff0 00000000000082d0
    0000000000080000 0000000000080000 0000000001400000 ffff880020a95e78
    Call Trace:
    [] dump_stack+0x4e/0x82
    [] warn_slowpath_common+0x82/0xb0
    [] warn_slowpath_null+0x15/0x20
    [] kmalloc_slab+0x2f/0xb0
    [] __kmalloc+0x24/0x4b0
    [] ? security_capable+0x13/0x20
    [] ? pipe_fcntl+0x107/0x210
    [] pipe_fcntl+0x107/0x210
    [] ? fget_raw_light+0x130/0x3f0
    [] SyS_fcntl+0x60b/0x6a0
    [] tracesys+0xe1/0xe6

    Andrew Morton writes:

    __GFP_NOWARN is frequently used by kernel code to probe for "how big
    an allocation can I get". That's a bit lame, but it's used on slow
    paths and is pretty simple.

    However, SLAB would still spew a warning when a big allocation happens
    if the __GFP_NOWARN flag is _not_ set to expose kernel bugs.

    Signed-off-by: Sasha Levin
    [ penberg@kernel.org: improve changelog ]
    Signed-off-by: Pekka Enberg

    Sasha Levin
     

09 May, 2013

1 commit

  • Commit 8a965b3baa89 ("mm, slab_common: Fix bootstrap creation of kmalloc
    caches") introduced a regression that caused us to crash early during
    boot. The commit was introducing ordering of slab creation, making sure
    two odd-sized slabs were created after specific powers of two sizes.

    But, if any of the power of two slabs were created earlier during boot,
    slabs at index 1 or 2 might not get created at all. This patch makes
    sure none of the slabs get skipped.

    Tony Lindgren bisected this down to the offending commit, which really
    helped because bisect kept bringing me to almost but not quite this one.

    Signed-off-by: Chris Mason
    Acked-by: Christoph Lameter
    Acked-by: Tony Lindgren
    Acked-by: Soren Brinkmann
    Tested-by: Tetsuo Handa
    Tested-by: Konrad Rzeszutek Wilk
    Signed-off-by: Linus Torvalds

    Chris Mason