01 Dec, 2018

1 commit

  • commit 61448479a9f2c954cde0cfe778cb6bec5d0a748d upstream.

    Slub does not call kmalloc_slab() for sizes > KMALLOC_MAX_CACHE_SIZE,
    instead it falls back to kmalloc_large().

    For slab KMALLOC_MAX_CACHE_SIZE == KMALLOC_MAX_SIZE and it calls
    kmalloc_slab() for all allocations relying on NULL return value for
    over-sized allocations.

    This inconsistency leads to unwanted warnings from kmalloc_slab() for
    over-sized allocations for slab. Returning NULL for failed allocations is
    the expected behavior.

    Make slub and slab code consistent by checking size >
    KMALLOC_MAX_CACHE_SIZE in slab before calling kmalloc_slab().

    While we are here also fix the check in kmalloc_slab(). We should check
    against KMALLOC_MAX_CACHE_SIZE rather than KMALLOC_MAX_SIZE. It all kinda
    worked because for slab the constants are the same, and slub always checks
    the size against KMALLOC_MAX_CACHE_SIZE before kmalloc_slab(). But if we
    get there with size > KMALLOC_MAX_CACHE_SIZE anyhow bad things will
    happen. For example, in case of a newly introduced bug in slub code.

    Also move the check in kmalloc_slab() from function entry to the size >
    192 case. This partially compensates for the additional check in slab
    code and makes slub code a bit faster (at least theoretically).

    Also drop __GFP_NOWARN in the warning check. This warning means a bug in
    slab code itself, user-passed flags have nothing to do with it.

    Nothing of this affects slob.

    Link: http://lkml.kernel.org/r/20180927171502.226522-1-dvyukov@gmail.com
    Signed-off-by: Dmitry Vyukov
    Reported-by: syzbot+87829a10073277282ad1@syzkaller.appspotmail.com
    Reported-by: syzbot+ef4e8fc3a06e9019bb40@syzkaller.appspotmail.com
    Reported-by: syzbot+6e438f4036df52cbb863@syzkaller.appspotmail.com
    Reported-by: syzbot+8574471d8734457d98aa@syzkaller.appspotmail.com
    Reported-by: syzbot+af1504df0807a083dbd9@syzkaller.appspotmail.com
    Acked-by: Christoph Lameter
    Acked-by: Vlastimil Babka
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Dmitry Vyukov
     

03 Jul, 2018

1 commit

  • commit d50d82faa0c964e31f7a946ba8aba7c715ca7ab0 upstream.

    In kernel 4.17 I removed some code from dm-bufio that did slab cache
    merging (commit 21bb13276768: "dm bufio: remove code that merges slab
    caches") - both slab and slub support merging caches with identical
    attributes, so dm-bufio now just calls kmem_cache_create and relies on
    implicit merging.

    This uncovered a bug in the slub subsystem - if we delete a cache and
    immediatelly create another cache with the same attributes, it fails
    because of duplicate filename in /sys/kernel/slab/. The slub subsystem
    offloads freeing the cache to a workqueue - and if we create the new
    cache before the workqueue runs, it complains because of duplicate
    filename in sysfs.

    This patch fixes the bug by moving the call of kobject_del from
    sysfs_slab_remove_workfn to shutdown_cache. kobject_del must be called
    while we hold slab_mutex - so that the sysfs entry is deleted before a
    cache with the same attributes could be created.

    Running device-mapper-test-suite with:

    dmtest run --suite thin-provisioning -n /commit_failure_causes_fallback/

    triggered:

    Buffer I/O error on dev dm-0, logical block 1572848, async page read
    device-mapper: thin: 253:1: metadata operation 'dm_pool_alloc_data_block' failed: error = -5
    device-mapper: thin: 253:1: aborting current metadata transaction
    sysfs: cannot create duplicate filename '/kernel/slab/:a-0000144'
    CPU: 2 PID: 1037 Comm: kworker/u48:1 Not tainted 4.17.0.snitm+ #25
    Hardware name: Supermicro SYS-1029P-WTR/X11DDW-L, BIOS 2.0a 12/06/2017
    Workqueue: dm-thin do_worker [dm_thin_pool]
    Call Trace:
    dump_stack+0x5a/0x73
    sysfs_warn_dup+0x58/0x70
    sysfs_create_dir_ns+0x77/0x80
    kobject_add_internal+0xba/0x2e0
    kobject_init_and_add+0x70/0xb0
    sysfs_slab_add+0xb1/0x250
    __kmem_cache_create+0x116/0x150
    create_cache+0xd9/0x1f0
    kmem_cache_create_usercopy+0x1c1/0x250
    kmem_cache_create+0x18/0x20
    dm_bufio_client_create+0x1ae/0x410 [dm_bufio]
    dm_block_manager_create+0x5e/0x90 [dm_persistent_data]
    __create_persistent_data_objects+0x38/0x940 [dm_thin_pool]
    dm_pool_abort_metadata+0x64/0x90 [dm_thin_pool]
    metadata_operation_failed+0x59/0x100 [dm_thin_pool]
    alloc_data_block.isra.53+0x86/0x180 [dm_thin_pool]
    process_cell+0x2a3/0x550 [dm_thin_pool]
    do_worker+0x28d/0x8f0 [dm_thin_pool]
    process_one_work+0x171/0x370
    worker_thread+0x49/0x3f0
    kthread+0xf8/0x130
    ret_from_fork+0x35/0x40
    kobject_add_internal failed for :a-0000144 with -EEXIST, don't try to register things with the same name in the same directory.
    kmem_cache_create(dm_bufio_buffer-16) failed with error -17

    Link: http://lkml.kernel.org/r/alpine.LRH.2.02.1806151817130.6333@file01.intranet.prod.int.rdu2.redhat.com
    Signed-off-by: Mikulas Patocka
    Reported-by: Mike Snitzer
    Tested-by: Mike Snitzer
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Mikulas Patocka
     

22 Feb, 2018

1 commit

  • commit 75f296d93bcebcfe375884ddac79e30263a31766 upstream.

    Convert all allocations that used a NOTRACK flag to stop using it.

    Link: http://lkml.kernel.org/r/20171007030159.22241-3-alexander.levin@verizon.com
    Signed-off-by: Sasha Levin
    Cc: Alexander Potapenko
    Cc: Eric W. Biederman
    Cc: Michal Hocko
    Cc: Pekka Enberg
    Cc: Steven Rostedt
    Cc: Tim Hansen
    Cc: Vegard Nossum
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Levin, Alexander (Sasha Levin)
     

02 Nov, 2017

1 commit

  • Many source files in the tree are missing licensing information, which
    makes it harder for compliance tools to determine the correct license.

    By default all files without license information are under the default
    license of the kernel, which is GPL version 2.

    Update the files which contain no license information with the 'GPL-2.0'
    SPDX license identifier. The SPDX identifier is a legally binding
    shorthand, which can be used instead of the full boiler plate text.

    This patch is based on work done by Thomas Gleixner and Kate Stewart and
    Philippe Ombredanne.

    How this work was done:

    Patches were generated and checked against linux-4.14-rc6 for a subset of
    the use cases:
    - file had no licensing information it it.
    - file was a */uapi/* one with no licensing information in it,
    - file was a */uapi/* one with existing licensing information,

    Further patches will be generated in subsequent months to fix up cases
    where non-standard license headers were used, and references to license
    had to be inferred by heuristics based on keywords.

    The analysis to determine which SPDX License Identifier to be applied to
    a file was done in a spreadsheet of side by side results from of the
    output of two independent scanners (ScanCode & Windriver) producing SPDX
    tag:value files created by Philippe Ombredanne. Philippe prepared the
    base worksheet, and did an initial spot review of a few 1000 files.

    The 4.13 kernel was the starting point of the analysis with 60,537 files
    assessed. Kate Stewart did a file by file comparison of the scanner
    results in the spreadsheet to determine which SPDX license identifier(s)
    to be applied to the file. She confirmed any determination that was not
    immediately clear with lawyers working with the Linux Foundation.

    Criteria used to select files for SPDX license identifier tagging was:
    - Files considered eligible had to be source code files.
    - Make and config files were included as candidates if they contained >5
    lines of source
    - File already had some variant of a license header in it (even if
    Reviewed-by: Philippe Ombredanne
    Reviewed-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Greg Kroah-Hartman
     

04 Oct, 2017

1 commit

  • For quick per-memcg indexing, slab caches and list_lru structures
    maintain linear arrays of descriptors. As the number of concurrent
    memory cgroups in the system goes up, this requires large contiguous
    allocations (8k cgroups = order-5, 16k cgroups = order-6 etc.) for every
    existing slab cache and list_lru, which can easily fail on loaded
    systems. E.g.:

    mkdir: page allocation failure: order:5, mode:0x14040c0(GFP_KERNEL|__GFP_COMP), nodemask=(null)
    CPU: 1 PID: 6399 Comm: mkdir Not tainted 4.13.0-mm1-00065-g720bbe532b7c-dirty #481
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-20170228_101828-anatol 04/01/2014
    Call Trace:
    ? __alloc_pages_direct_compact+0x4c/0x110
    __alloc_pages_nodemask+0xf50/0x1430
    alloc_pages_current+0x60/0xc0
    kmalloc_order_trace+0x29/0x1b0
    __kmalloc+0x1f4/0x320
    memcg_update_all_list_lrus+0xca/0x2e0
    mem_cgroup_css_alloc+0x612/0x670
    cgroup_apply_control_enable+0x19e/0x360
    cgroup_mkdir+0x322/0x490
    kernfs_iop_mkdir+0x55/0x80
    vfs_mkdir+0xd0/0x120
    SyS_mkdirat+0x6c/0xe0
    SyS_mkdir+0x14/0x20
    entry_SYSCALL_64_fastpath+0x18/0xad
    Mem-Info:
    active_anon:2965 inactive_anon:19 isolated_anon:0
    active_file:100270 inactive_file:98846 isolated_file:0
    unevictable:0 dirty:0 writeback:0 unstable:0
    slab_reclaimable:7328 slab_unreclaimable:16402
    mapped:771 shmem:52 pagetables:278 bounce:0
    free:13718 free_pcp:0 free_cma:0

    This output is from an artificial reproducer, but we have repeatedly
    observed order-7 failures in production in the Facebook fleet. These
    systems become useless as they cannot run more jobs, even though there
    is plenty of memory to allocate 128 individual pages.

    Use kvmalloc and kvzalloc to fall back to vmalloc space if these arrays
    prove too large for allocating them physically contiguous.

    Link: http://lkml.kernel.org/r/20170918184919.20644-1-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Reviewed-by: Josef Bacik
    Acked-by: Michal Hocko
    Acked-by: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

07 Jul, 2017

1 commit

  • Some hardened environments want to build kernels with slab_nomerge
    already set (so that they do not depend on remembering to set the kernel
    command line option). This is desired to reduce the risk of kernel heap
    overflows being able to overwrite objects from merged caches and changes
    the requirements for cache layout control, increasing the difficulty of
    these attacks. By keeping caches unmerged, these kinds of exploits can
    usually only damage objects in the same cache (though the risk to
    metadata exploitation is unchanged).

    Link: http://lkml.kernel.org/r/20170620230911.GA25238@beast
    Signed-off-by: Kees Cook
    Cc: Daniel Micay
    Cc: David Windsor
    Cc: Eric Biggers
    Cc: Christoph Lameter
    Cc: Jonathan Corbet
    Cc: Daniel Micay
    Cc: David Windsor
    Cc: Eric Biggers
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: "Rafael J. Wysocki"
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: Mauro Carvalho Chehab
    Cc: "Paul E. McKenney"
    Cc: Arnd Bergmann
    Cc: Andy Lutomirski
    Cc: Nicolas Pitre
    Cc: Tejun Heo
    Cc: Daniel Mack
    Cc: Sebastian Andrzej Siewior
    Cc: Sergey Senozhatsky
    Cc: Helge Deller
    Cc: Rik van Riel
    Cc: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kees Cook
     

19 Apr, 2017

1 commit

  • A group of Linux kernel hackers reported chasing a bug that resulted
    from their assumption that SLAB_DESTROY_BY_RCU provided an existence
    guarantee, that is, that no block from such a slab would be reallocated
    during an RCU read-side critical section. Of course, that is not the
    case. Instead, SLAB_DESTROY_BY_RCU only prevents freeing of an entire
    slab of blocks.

    However, there is a phrase for this, namely "type safety". This commit
    therefore renames SLAB_DESTROY_BY_RCU to SLAB_TYPESAFE_BY_RCU in order
    to avoid future instances of this sort of confusion.

    Signed-off-by: Paul E. McKenney
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Andrew Morton
    Cc:
    Acked-by: Johannes Weiner
    Acked-by: Vlastimil Babka
    [ paulmck: Add comments mentioning the old name, as requested by Eric
    Dumazet, in order to help people familiar with the old name find
    the new one. ]
    Acked-by: David Rientjes

    Paul E. McKenney
     

25 Feb, 2017

1 commit

  • Per memcg slab accounting and kasan have a problem with kmem_cache
    destruction.
    - kmem_cache_create() allocates a kmem_cache, which is used for
    allocations from processes running in root (top) memcg.
    - Processes running in non root memcg and allocating with either
    __GFP_ACCOUNT or from a SLAB_ACCOUNT cache use a per memcg
    kmem_cache.
    - Kasan catches use-after-free by having kfree() and kmem_cache_free()
    defer freeing of objects. Objects are placed in a quarantine.
    - kmem_cache_destroy() destroys root and non root kmem_caches. It takes
    care to drain the quarantine of objects from the root memcg's
    kmem_cache, but ignores objects associated with non root memcg. This
    causes leaks because quarantined per memcg objects refer to per memcg
    kmem cache being destroyed.

    To see the problem:

    1) create a slab cache with kmem_cache_create(,,,SLAB_ACCOUNT,)
    2) from non root memcg, allocate and free a few objects from cache
    3) dispose of the cache with kmem_cache_destroy() kmem_cache_destroy()
    will trigger a "Slab cache still has objects" warning indicating
    that the per memcg kmem_cache structure was leaked.

    Fix the leak by draining kasan quarantined objects allocated from non
    root memcg.

    Racing memcg deletion is tricky, but handled. kmem_cache_destroy() =>
    shutdown_memcg_caches() => __shutdown_memcg_cache() => shutdown_cache()
    flushes per memcg quarantined objects, even if that memcg has been
    rmdir'd and gone through memcg_deactivate_kmem_caches().

    This leak only affects destroyed SLAB_ACCOUNT kmem caches when kasan is
    enabled. So I don't think it's worth patching stable kernels.

    Link: http://lkml.kernel.org/r/1482257462-36948-1-git-send-email-gthelen@google.com
    Signed-off-by: Greg Thelen
    Reviewed-by: Vladimir Davydov
    Acked-by: Andrey Ryabinin
    Cc: Alexander Potapenko
    Cc: Dmitry Vyukov
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Greg Thelen
     

23 Feb, 2017

11 commits

  • If there's contention on slab_mutex, queueing the per-cache destruction
    work item on the system_wq can unnecessarily create and tie up a lot of
    kworkers.

    Rename memcg_kmem_cache_create_wq to memcg_kmem_cache_wq and make it
    global and use that workqueue for the destruction work items too. While
    at it, convert the workqueue from an unbound workqueue to a per-cpu one
    with concurrency limited to 1. It's generally preferable to use per-cpu
    workqueues and concurrency limit of 1 is safe enough.

    This is suggested by Joonsoo Kim.

    Link: http://lkml.kernel.org/r/20170117235411.9408-11-tj@kernel.org
    Signed-off-by: Tejun Heo
    Reported-by: Jay Vana
    Acked-by: Vladimir Davydov
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tejun Heo
     
  • With kmem cgroup support enabled, kmem_caches can be created and
    destroyed frequently and a great number of near empty kmem_caches can
    accumulate if there are a lot of transient cgroups and the system is not
    under memory pressure. When memory reclaim starts under such
    conditions, it can lead to consecutive deactivation and destruction of
    many kmem_caches, easily hundreds of thousands on moderately large
    systems, exposing scalability issues in the current slab management
    code. This is one of the patches to address the issue.

    slub uses synchronize_sched() to deactivate a memcg cache.
    synchronize_sched() is an expensive and slow operation and doesn't scale
    when a huge number of caches are destroyed back-to-back. While there
    used to be a simple batching mechanism, the batching was too restricted
    to be helpful.

    This patch implements slab_deactivate_memcg_cache_rcu_sched() which slub
    can use to schedule sched RCU callback instead of performing
    synchronize_sched() synchronously while holding cgroup_mutex. While
    this adds online cpus, mems and slab_mutex operations, operating on
    these locks back-to-back from the same kworker, which is what's gonna
    happen when there are many to deactivate, isn't expensive at all and
    this gets rid of the scalability problem completely.

    Link: http://lkml.kernel.org/r/20170117235411.9408-9-tj@kernel.org
    Signed-off-by: Tejun Heo
    Reported-by: Jay Vana
    Acked-by: Vladimir Davydov
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tejun Heo
     
  • __kmem_cache_shrink() is called with %true @deactivate only for memcg
    caches. Remove @deactivate from __kmem_cache_shrink() and introduce
    __kmemcg_cache_deactivate() instead. Each memcg-supporting allocator
    should implement it and it should deactivate and drain the cache.

    This is to allow memcg cache deactivation behavior to further deviate
    from simple shrinking without messing up __kmem_cache_shrink().

    This is pure reorganization and doesn't introduce any observable
    behavior changes.

    v2: Dropped unnecessary ifdef in mm/slab.h as suggested by Vladimir.

    Link: http://lkml.kernel.org/r/20170117235411.9408-8-tj@kernel.org
    Signed-off-by: Tejun Heo
    Acked-by: Vladimir Davydov
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tejun Heo
     
  • With kmem cgroup support enabled, kmem_caches can be created and
    destroyed frequently and a great number of near empty kmem_caches can
    accumulate if there are a lot of transient cgroups and the system is not
    under memory pressure. When memory reclaim starts under such
    conditions, it can lead to consecutive deactivation and destruction of
    many kmem_caches, easily hundreds of thousands on moderately large
    systems, exposing scalability issues in the current slab management
    code. This is one of the patches to address the issue.

    slab_caches currently lists all caches including root and memcg ones.
    This is the only data structure which lists the root caches and
    iterating root caches can only be done by walking the list while
    skipping over memcg caches. As there can be a huge number of memcg
    caches, this can become very expensive.

    This also can make /proc/slabinfo behave very badly. seq_file processes
    reads in 4k chunks and seeks to the previous Nth position on slab_caches
    list to resume after each chunk. With a lot of memcg cache churns on
    the list, reading /proc/slabinfo can become very slow and its content
    often ends up with duplicate and/or missing entries.

    This patch adds a new list slab_root_caches which lists only the root
    caches. When memcg is not enabled, it becomes just an alias of
    slab_caches. memcg specific list operations are collected into
    memcg_[un]link_cache().

    Link: http://lkml.kernel.org/r/20170117235411.9408-7-tj@kernel.org
    Signed-off-by: Tejun Heo
    Reported-by: Jay Vana
    Acked-by: Vladimir Davydov
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tejun Heo
     
  • With kmem cgroup support enabled, kmem_caches can be created and
    destroyed frequently and a great number of near empty kmem_caches can
    accumulate if there are a lot of transient cgroups and the system is not
    under memory pressure. When memory reclaim starts under such
    conditions, it can lead to consecutive deactivation and destruction of
    many kmem_caches, easily hundreds of thousands on moderately large
    systems, exposing scalability issues in the current slab management
    code. This is one of the patches to address the issue.

    While a memcg kmem_cache is listed on its root cache's ->children list,
    there is no direct way to iterate all kmem_caches which are assocaited
    with a memory cgroup. The only way to iterate them is walking all
    caches while filtering out caches which don't match, which would be most
    of them.

    This makes memcg destruction operations O(N^2) where N is the total
    number of slab caches which can be huge. This combined with the
    synchronous RCU operations can tie up a CPU and affect the whole machine
    for many hours when memory reclaim triggers offlining and destruction of
    the stale memcgs.

    This patch adds mem_cgroup->kmem_caches list which goes through
    memcg_cache_params->kmem_caches_node of all kmem_caches which are
    associated with the memcg. All memcg specific iterations, including
    stat file access, are updated to use the new list instead.

    Link: http://lkml.kernel.org/r/20170117235411.9408-6-tj@kernel.org
    Signed-off-by: Tejun Heo
    Reported-by: Jay Vana
    Acked-by: Vladimir Davydov
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tejun Heo
     
  • We're going to change how memcg caches are iterated. In preparation,
    clean up and reorganize memcg_cache_params.

    * The shared ->list is replaced by ->children in root and
    ->children_node in children.

    * ->is_root_cache is removed. Instead ->root_cache is moved out of
    the child union and now used by both root and children. NULL
    indicates root cache. Non-NULL a memcg one.

    This patch doesn't cause any observable behavior changes.

    Link: http://lkml.kernel.org/r/20170117235411.9408-5-tj@kernel.org
    Signed-off-by: Tejun Heo
    Acked-by: Vladimir Davydov
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tejun Heo
     
  • With kmem cgroup support enabled, kmem_caches can be created and
    destroyed frequently and a great number of near empty kmem_caches can
    accumulate if there are a lot of transient cgroups and the system is not
    under memory pressure. When memory reclaim starts under such
    conditions, it can lead to consecutive deactivation and destruction of
    many kmem_caches, easily hundreds of thousands on moderately large
    systems, exposing scalability issues in the current slab management
    code. This is one of the patches to address the issue.

    SLAB_DESTORY_BY_RCU caches need to flush all RCU operations before
    destruction because slab pages are freed through RCU and they need to be
    able to dereference the associated kmem_cache. Currently, it's done
    synchronously with rcu_barrier(). As rcu_barrier() is expensive
    time-wise, slab implements a batching mechanism so that rcu_barrier()
    can be done for multiple caches at the same time.

    Unfortunately, the rcu_barrier() is in synchronous path which is called
    while holding cgroup_mutex and the batching is too limited to be
    actually helpful.

    This patch updates the cache release path so that the batching is
    asynchronous and global. All SLAB_DESTORY_BY_RCU caches are queued
    globally and a work item consumes the list. The work item calls
    rcu_barrier() only once for all caches that are currently queued.

    * release_caches() is removed and shutdown_cache() now either directly
    release the cache or schedules a RCU callback to do that. This
    makes the cache inaccessible once shutdown_cache() is called and
    makes it impossible for shutdown_memcg_caches() to do memcg-specific
    cleanups afterwards. Move memcg-specific part into a helper,
    unlink_memcg_cache(), and make shutdown_cache() call it directly.

    Link: http://lkml.kernel.org/r/20170117235411.9408-4-tj@kernel.org
    Signed-off-by: Tejun Heo
    Reported-by: Jay Vana
    Acked-by: Vladimir Davydov
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tejun Heo
     
  • Separate out slub sysfs removal and release, and call the former earlier
    from __kmem_cache_shutdown(). There's no reason to defer sysfs removal
    through RCU and this will later allow us to remove sysfs files way
    earlier during memory cgroup offline instead of release.

    Link: http://lkml.kernel.org/r/20170117235411.9408-3-tj@kernel.org
    Signed-off-by: Tejun Heo
    Acked-by: Vladimir Davydov
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tejun Heo
     
  • Patch series "slab: make memcg slab destruction scalable", v3.

    With kmem cgroup support enabled, kmem_caches can be created and
    destroyed frequently and a great number of near empty kmem_caches can
    accumulate if there are a lot of transient cgroups and the system is not
    under memory pressure. When memory reclaim starts under such
    conditions, it can lead to consecutive deactivation and destruction of
    many kmem_caches, easily hundreds of thousands on moderately large
    systems, exposing scalability issues in the current slab management
    code.

    I've seen machines which end up with hundred thousands of caches and
    many millions of kernfs_nodes. The current code is O(N^2) on the total
    number of caches and has synchronous rcu_barrier() and
    synchronize_sched() in cgroup offline / release path which is executed
    while holding cgroup_mutex. Combined, this leads to very expensive and
    slow cache destruction operations which can easily keep running for half
    a day.

    This also messes up /proc/slabinfo along with other cache iterating
    operations. seq_file operates on 4k chunks and on each 4k boundary
    tries to seek to the last position in the list. With a huge number of
    caches on the list, this becomes very slow and very prone to the list
    content changing underneath it leading to a lot of missing and/or
    duplicate entries.

    This patchset addresses the scalability problem.

    * Add root and per-memcg lists. Update each user to use the
    appropriate list.

    * Make rcu_barrier() for SLAB_DESTROY_BY_RCU caches globally batched
    and asynchronous.

    * For dying empty slub caches, remove the sysfs files after
    deactivation so that we don't end up with millions of sysfs files
    without any useful information on them.

    This patchset contains the following nine patches.

    0001-Revert-slub-move-synchronize_sched-out-of-slab_mutex.patch
    0002-slub-separate-out-sysfs_slab_release-from-sysfs_slab.patch
    0003-slab-remove-synchronous-rcu_barrier-call-in-memcg-ca.patch
    0004-slab-reorganize-memcg_cache_params.patch
    0005-slab-link-memcg-kmem_caches-on-their-associated-memo.patch
    0006-slab-implement-slab_root_caches-list.patch
    0007-slab-introduce-__kmemcg_cache_deactivate.patch
    0008-slab-remove-synchronous-synchronize_sched-from-memcg.patch
    0009-slab-remove-slub-sysfs-interface-files-early-for-emp.patch
    0010-slab-use-memcg_kmem_cache_wq-for-slab-destruction-op.patch

    0001 reverts an existing optimization to prepare for the following
    changes. 0002 is a prep patch. 0003 makes rcu_barrier() in release
    path batched and asynchronous. 0004-0006 separate out the lists.
    0007-0008 replace synchronize_sched() in slub destruction path with
    call_rcu_sched(). 0009 removes sysfs files early for empty dying
    caches. 0010 makes destruction work items use a workqueue with limited
    concurrency.

    This patch (of 10):

    Revert 89e364db71fb5e ("slub: move synchronize_sched out of slab_mutex on
    shrink").

    With kmem cgroup support enabled, kmem_caches can be created and destroyed
    frequently and a great number of near empty kmem_caches can accumulate if
    there are a lot of transient cgroups and the system is not under memory
    pressure. When memory reclaim starts under such conditions, it can lead
    to consecutive deactivation and destruction of many kmem_caches, easily
    hundreds of thousands on moderately large systems, exposing scalability
    issues in the current slab management code. This is one of the patches to
    address the issue.

    Moving synchronize_sched() out of slab_mutex isn't enough as it's still
    inside cgroup_mutex. The whole deactivation / release path will be
    updated to avoid all synchronous RCU operations. Revert this insufficient
    optimization in preparation to ease future changes.

    Link: http://lkml.kernel.org/r/20170117235411.9408-2-tj@kernel.org
    Signed-off-by: Tejun Heo
    Reported-by: Jay Vana
    Cc: Vladimir Davydov
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tejun Heo
     
  • SLAB as part of its bootstrap pre-creates one kmalloc cache that can fit
    the kmem_cache_node management structure, and puts it into the generic
    kmalloc cache array (e.g. for 128b objects). The name of this cache is
    "kmalloc-node", which is confusing for readers of /proc/slabinfo as the
    cache is used for generic allocations (and not just the kmem_cache_node
    struct) and it appears as the kmalloc-128 cache is missing.

    An easy solution is to use the kmalloc- name when pre-creating the
    cache, which we can get from the kmalloc_info array.

    Example /proc/slabinfo before the patch:

    ...
    kmalloc-256 1647 1984 256 16 1 : tunables 120 60 8 : slabdata 124 124 828
    kmalloc-192 1974 1974 192 21 1 : tunables 120 60 8 : slabdata 94 94 133
    kmalloc-96 1332 1344 128 32 1 : tunables 120 60 8 : slabdata 42 42 219
    kmalloc-64 2505 5952 64 64 1 : tunables 120 60 8 : slabdata 93 93 715
    kmalloc-32 4278 4464 32 124 1 : tunables 120 60 8 : slabdata 36 36 346
    kmalloc-node 1352 1376 128 32 1 : tunables 120 60 8 : slabdata 43 43 53
    kmem_cache 132 147 192 21 1 : tunables 120 60 8 : slabdata 7 7 0

    After the patch:

    ...
    kmalloc-256 1672 2160 256 16 1 : tunables 120 60 8 : slabdata 135 135 807
    kmalloc-192 1992 2016 192 21 1 : tunables 120 60 8 : slabdata 96 96 203
    kmalloc-96 1159 1184 128 32 1 : tunables 120 60 8 : slabdata 37 37 116
    kmalloc-64 2561 4864 64 64 1 : tunables 120 60 8 : slabdata 76 76 785
    kmalloc-32 4253 4340 32 124 1 : tunables 120 60 8 : slabdata 35 35 270
    kmalloc-128 1256 1280 128 32 1 : tunables 120 60 8 : slabdata 40 40 39
    kmem_cache 125 147 192 21 1 : tunables 120 60 8 : slabdata 7 7 0

    [vbabka@suse.cz: export the whole kmalloc_info structure instead of just a name accessor, per Christoph Lameter]
    Link: http://lkml.kernel.org/r/54e80303-b814-4232-66d4-95b34d3eb9d0@suse.cz
    Link: http://lkml.kernel.org/r/20170203181008.24898-1-vbabka@suse.cz
    Signed-off-by: Vlastimil Babka
    Reviewed-by: Matthew Wilcox
    Cc: Joonsoo Kim
    Cc: David Rientjes
    Cc: Pekka Enberg
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • In case CONFIG_SLUB_DEBUG_ON=n, find_mergeable() gets debug features from
    commandline but never checks if there are features from the
    SLAB_NEVER_MERGE set.

    As a result selected by slub_debug caches are always mergeable if they
    have been created without a custom constructor set or without one of the
    SLAB_* debug features on.

    This moves the SLAB_NEVER_MERGE check below the flags update from
    commandline to make sure it won't merge the slab cache if one of the debug
    features is on.

    Link: http://lkml.kernel.org/r/20170101124451.GA4740@lp-laptop-d
    Signed-off-by: Grygorii Maistrenko
    Reviewed-by: Pekka Enberg
    Acked-by: David Rientjes
    Acked-by: Christoph Lameter
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Grygorii Maistrenko
     

13 Dec, 2016

2 commits

  • Verify that kmem_create_cache flags are not allocator specific. It is
    done before removing flags that are not available with the current
    configuration.

    The current kmem_cache_create removes incorrect flags but do not
    validate the callers are using them right. This change will ensure that
    callers are not trying to create caches with flags that won't be used
    because allocator specific.

    Link: http://lkml.kernel.org/r/1478553075-120242-2-git-send-email-thgarnie@google.com
    Signed-off-by: Thomas Garnier
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Thomas Garnier
     
  • synchronize_sched() is a heavy operation and calling it per each cache
    owned by a memory cgroup being destroyed may take quite some time. What
    is worse, it's currently called under the slab_mutex, stalling all works
    doing cache creation/destruction.

    Actually, there isn't much point in calling synchronize_sched() for each
    cache - it's enough to call it just once - after setting cpu_partial for
    all caches and before shrinking them. This way, we can also move it out
    of the slab_mutex, which we have to hold for iterating over the slab
    cache list.

    Link: https://bugzilla.kernel.org/show_bug.cgi?id=172991
    Link: http://lkml.kernel.org/r/0a10d71ecae3db00fb4421bcd3f82bcc911f4be4.1475329751.git.vdavydov.dev@gmail.com
    Signed-off-by: Vladimir Davydov
    Reported-by: Doug Smythies
    Acked-by: Joonsoo Kim
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     

12 Nov, 2016

1 commit

  • While testing OBJFREELIST_SLAB integration with pagealloc, we found a
    bug where kmem_cache(sys) would be created with both CFLGS_OFF_SLAB &
    CFLGS_OBJFREELIST_SLAB. When it happened, critical allocations needed
    for loading drivers or creating new caches will fail.

    The original kmem_cache is created early making OFF_SLAB not possible.
    When kmem_cache(sys) is created, OFF_SLAB is possible and if pagealloc
    is enabled it will try to enable it first under certain conditions.
    Given kmem_cache(sys) reuses the original flag, you can have both flags
    at the same time resulting in allocation failures and odd behaviors.

    This fix discards allocator specific flags from memcg before calling
    create_cache.

    The bug exists since 4.6-rc1 and affects testing debug pagealloc
    configurations.

    Fixes: b03a017bebc4 ("mm/slab: introduce new slab management type, OBJFREELIST_SLAB")
    Link: http://lkml.kernel.org/r/1478553075-120242-1-git-send-email-thgarnie@google.com
    Signed-off-by: Greg Thelen
    Signed-off-by: Thomas Garnier
    Tested-by: Thomas Garnier
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Vladimir Davydov
    Cc: Michal Hocko
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Greg Thelen
     

27 Jul, 2016

2 commits

  • Currently, to charge a non-slab allocation to kmemcg one has to use
    alloc_kmem_pages helper with __GFP_ACCOUNT flag. A page allocated with
    this helper should finally be freed using free_kmem_pages, otherwise it
    won't be uncharged.

    This API suits its current users fine, but it turns out to be impossible
    to use along with page reference counting, i.e. when an allocation is
    supposed to be freed with put_page, as it is the case with pipe or unix
    socket buffers.

    To overcome this limitation, this patch moves charging/uncharging to
    generic page allocator paths, i.e. to __alloc_pages_nodemask and
    free_pages_prepare, and zaps alloc/free_kmem_pages helpers. This way,
    one can use any of the available page allocation functions to get the
    allocated page charged to kmemcg - it's enough to pass __GFP_ACCOUNT,
    just like in case of kmalloc and friends. A charged page will be
    automatically uncharged on free.

    To make it possible, we need to mark pages charged to kmemcg somehow.
    To avoid introducing a new page flag, we make use of page->_mapcount for
    marking such pages. Since pages charged to kmemcg are not supposed to
    be mapped to userspace, it should work just fine. There are other
    (ab)users of page->_mapcount - buddy and balloon pages - but we don't
    conflict with them.

    In case kmemcg is compiled out or not used at runtime, this patch
    introduces no overhead to generic page allocator paths. If kmemcg is
    used, it will be plus one gfp flags check on alloc and plus one
    page->_mapcount check on free, which shouldn't hurt performance, because
    the data accessed are hot.

    Link: http://lkml.kernel.org/r/a9736d856f895bcb465d9f257b54efe32eda6f99.1464079538.git.vdavydov@virtuozzo.com
    Signed-off-by: Vladimir Davydov
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Eric Dumazet
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • The kernel heap allocators are using a sequential freelist making their
    allocation predictable. This predictability makes kernel heap overflow
    easier to exploit. An attacker can careful prepare the kernel heap to
    control the following chunk overflowed.

    For example these attacks exploit the predictability of the heap:
    - Linux Kernel CAN SLUB overflow (https://goo.gl/oMNWkU)
    - Exploiting Linux Kernel Heap corruptions (http://goo.gl/EXLn95)

    ***Problems that needed solving:
    - Randomize the Freelist (singled linked) used in the SLUB allocator.
    - Ensure good performance to encourage usage.
    - Get best entropy in early boot stage.

    ***Parts:
    - 01/02 Reorganize the SLAB Freelist randomization to share elements
    with the SLUB implementation.
    - 02/02 The SLUB Freelist randomization implementation. Similar approach
    than the SLAB but tailored to the singled freelist used in SLUB.

    ***Performance data:

    slab_test impact is between 3% to 4% on average for 100000 attempts
    without smp. It is a very focused testing, kernbench show the overall
    impact on the system is way lower.

    Before:

    Single thread testing
    =====================
    1. Kmalloc: Repeatedly allocate then free test
    100000 times kmalloc(8) -> 49 cycles kfree -> 77 cycles
    100000 times kmalloc(16) -> 51 cycles kfree -> 79 cycles
    100000 times kmalloc(32) -> 53 cycles kfree -> 83 cycles
    100000 times kmalloc(64) -> 62 cycles kfree -> 90 cycles
    100000 times kmalloc(128) -> 81 cycles kfree -> 97 cycles
    100000 times kmalloc(256) -> 98 cycles kfree -> 121 cycles
    100000 times kmalloc(512) -> 95 cycles kfree -> 122 cycles
    100000 times kmalloc(1024) -> 96 cycles kfree -> 126 cycles
    100000 times kmalloc(2048) -> 115 cycles kfree -> 140 cycles
    100000 times kmalloc(4096) -> 149 cycles kfree -> 171 cycles
    2. Kmalloc: alloc/free test
    100000 times kmalloc(8)/kfree -> 70 cycles
    100000 times kmalloc(16)/kfree -> 70 cycles
    100000 times kmalloc(32)/kfree -> 70 cycles
    100000 times kmalloc(64)/kfree -> 70 cycles
    100000 times kmalloc(128)/kfree -> 70 cycles
    100000 times kmalloc(256)/kfree -> 69 cycles
    100000 times kmalloc(512)/kfree -> 70 cycles
    100000 times kmalloc(1024)/kfree -> 73 cycles
    100000 times kmalloc(2048)/kfree -> 72 cycles
    100000 times kmalloc(4096)/kfree -> 71 cycles

    After:

    Single thread testing
    =====================
    1. Kmalloc: Repeatedly allocate then free test
    100000 times kmalloc(8) -> 57 cycles kfree -> 78 cycles
    100000 times kmalloc(16) -> 61 cycles kfree -> 81 cycles
    100000 times kmalloc(32) -> 76 cycles kfree -> 93 cycles
    100000 times kmalloc(64) -> 83 cycles kfree -> 94 cycles
    100000 times kmalloc(128) -> 106 cycles kfree -> 107 cycles
    100000 times kmalloc(256) -> 118 cycles kfree -> 117 cycles
    100000 times kmalloc(512) -> 114 cycles kfree -> 116 cycles
    100000 times kmalloc(1024) -> 115 cycles kfree -> 118 cycles
    100000 times kmalloc(2048) -> 147 cycles kfree -> 131 cycles
    100000 times kmalloc(4096) -> 214 cycles kfree -> 161 cycles
    2. Kmalloc: alloc/free test
    100000 times kmalloc(8)/kfree -> 66 cycles
    100000 times kmalloc(16)/kfree -> 66 cycles
    100000 times kmalloc(32)/kfree -> 66 cycles
    100000 times kmalloc(64)/kfree -> 66 cycles
    100000 times kmalloc(128)/kfree -> 65 cycles
    100000 times kmalloc(256)/kfree -> 67 cycles
    100000 times kmalloc(512)/kfree -> 67 cycles
    100000 times kmalloc(1024)/kfree -> 64 cycles
    100000 times kmalloc(2048)/kfree -> 67 cycles
    100000 times kmalloc(4096)/kfree -> 67 cycles

    Kernbench, before:

    Average Optimal load -j 12 Run (std deviation):
    Elapsed Time 101.873 (1.16069)
    User Time 1045.22 (1.60447)
    System Time 88.969 (0.559195)
    Percent CPU 1112.9 (13.8279)
    Context Switches 189140 (2282.15)
    Sleeps 99008.6 (768.091)

    After:

    Average Optimal load -j 12 Run (std deviation):
    Elapsed Time 102.47 (0.562732)
    User Time 1045.3 (1.34263)
    System Time 88.311 (0.342554)
    Percent CPU 1105.8 (6.49444)
    Context Switches 189081 (2355.78)
    Sleeps 99231.5 (800.358)

    This patch (of 2):

    This commit reorganizes the previous SLAB freelist randomization to
    prepare for the SLUB implementation. It moves functions that will be
    shared to slab_common.

    The entropy functions are changed to align with the SLUB implementation,
    now using get_random_(int|long) functions. These functions were chosen
    because they provide a bit more entropy early on boot and better
    performance when specific arch instructions are not available.

    [akpm@linux-foundation.org: fix build]
    Link: http://lkml.kernel.org/r/1464295031-26375-2-git-send-email-thgarnie@google.com
    Signed-off-by: Thomas Garnier
    Reviewed-by: Kees Cook
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Thomas Garnier
     

23 Jul, 2016

1 commit

  • The memory controller has quite a bit of state that usually outlives the
    cgroup and pins its CSS until said state disappears. At the same time
    it imposes a 16-bit limit on the CSS ID space to economically store IDs
    in the wild. Consequently, when we use cgroups to contain frequent but
    small and short-lived jobs that leave behind some page cache, we quickly
    run into the 64k limitations of outstanding CSSs. Creating a new cgroup
    fails with -ENOSPC while there are only a few, or even no user-visible
    cgroups in existence.

    Although pinning CSSs past cgroup removal is common, there are only two
    instances that actually need an ID after a cgroup is deleted: cache
    shadow entries and swapout records.

    Cache shadow entries reference the ID weakly and can deal with the CSS
    having disappeared when it's looked up later. They pose no hurdle.

    Swap-out records do need to pin the css to hierarchically attribute
    swapins after the cgroup has been deleted; though the only pages that
    remain swapped out after offlining are tmpfs/shmem pages. And those
    references are under the user's control, so they are manageable.

    This patch introduces a private 16-bit memcg ID and switches swap and
    cache shadow entries over to using that. This ID can then be recycled
    after offlining when the CSS remains pinned only by objects that don't
    specifically need it.

    This script demonstrates the problem by faulting one cache page in a new
    cgroup and deleting it again:

    set -e
    mkdir -p pages
    for x in `seq 128000`; do
    [ $((x % 1000)) -eq 0 ] && echo $x
    mkdir /cgroup/foo
    echo $$ >/cgroup/foo/cgroup.procs
    echo trex >pages/$x
    echo $$ >/cgroup/cgroup.procs
    rmdir /cgroup/foo
    done

    When run on an unpatched kernel, we eventually run out of possible IDs
    even though there are no visible cgroups:

    [root@ham ~]# ./cssidstress.sh
    [...]
    65000
    mkdir: cannot create directory '/cgroup/foo': No space left on device

    After this patch, the IDs get released upon cgroup destruction and the
    cache and css objects get released once memory reclaim kicks in.

    [hannes@cmpxchg.org: init the IDR]
    Link: http://lkml.kernel.org/r/20160621154601.GA22431@cmpxchg.org
    Fixes: b2052564e66d ("mm: memcontrol: continue cache reclaim from offlined groups")
    Link: http://lkml.kernel.org/r/20160617162516.GD19084@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Reported-by: John Garcia
    Reviewed-by: Vladimir Davydov
    Acked-by: Tejun Heo
    Cc: Nikolay Borisov
    Cc: [3.19+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

21 May, 2016

1 commit

  • Quarantine isolates freed objects in a separate queue. The objects are
    returned to the allocator later, which helps to detect use-after-free
    errors.

    When the object is freed, its state changes from KASAN_STATE_ALLOC to
    KASAN_STATE_QUARANTINE. The object is poisoned and put into quarantine
    instead of being returned to the allocator, therefore every subsequent
    access to that object triggers a KASAN error, and the error handler is
    able to say where the object has been allocated and deallocated.

    When it's time for the object to leave quarantine, its state becomes
    KASAN_STATE_FREE and it's returned to the allocator. From now on the
    allocator may reuse it for another allocation. Before that happens,
    it's still possible to detect a use-after free on that object (it
    retains the allocation/deallocation stacks).

    When the allocator reuses this object, the shadow is unpoisoned and old
    allocation/deallocation stacks are wiped. Therefore a use of this
    object, even an incorrect one, won't trigger ASan warning.

    Without the quarantine, it's not guaranteed that the objects aren't
    reused immediately, that's why the probability of catching a
    use-after-free is lower than with quarantine in place.

    Quarantine isolates freed objects in a separate queue. The objects are
    returned to the allocator later, which helps to detect use-after-free
    errors.

    Freed objects are first added to per-cpu quarantine queues. When a
    cache is destroyed or memory shrinking is requested, the objects are
    moved into the global quarantine queue. Whenever a kmalloc call allows
    memory reclaiming, the oldest objects are popped out of the global queue
    until the total size of objects in quarantine is less than 3/4 of the
    maximum quarantine size (which is a fraction of installed physical
    memory).

    As long as an object remains in the quarantine, KASAN is able to report
    accesses to it, so the chance of reporting a use-after-free is
    increased. Once the object leaves quarantine, the allocator may reuse
    it, in which case the object is unpoisoned and KASAN can't detect
    incorrect accesses to it.

    Right now quarantine support is only enabled in SLAB allocator.
    Unification of KASAN features in SLAB and SLUB will be done later.

    This patch is based on the "mm: kasan: quarantine" patch originally
    prepared by Dmitry Chernenkov. A number of improvements have been
    suggested by Andrey Ryabinin.

    [glider@google.com: v9]
    Link: http://lkml.kernel.org/r/1462987130-144092-1-git-send-email-glider@google.com
    Signed-off-by: Alexander Potapenko
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Andrey Konovalov
    Cc: Dmitry Vyukov
    Cc: Andrey Ryabinin
    Cc: Steven Rostedt
    Cc: Konstantin Serebryany
    Cc: Dmitry Chernenkov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexander Potapenko
     

26 Mar, 2016

2 commits

  • Add GFP flags to KASAN hooks for future patches to use.

    This patch is based on the "mm: kasan: unified support for SLUB and SLAB
    allocators" patch originally prepared by Dmitry Chernenkov.

    Signed-off-by: Alexander Potapenko
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Andrey Konovalov
    Cc: Dmitry Vyukov
    Cc: Andrey Ryabinin
    Cc: Steven Rostedt
    Cc: Konstantin Serebryany
    Cc: Dmitry Chernenkov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexander Potapenko
     
  • Add KASAN hooks to SLAB allocator.

    This patch is based on the "mm: kasan: unified support for SLUB and SLAB
    allocators" patch originally prepared by Dmitry Chernenkov.

    Signed-off-by: Alexander Potapenko
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Andrey Konovalov
    Cc: Dmitry Vyukov
    Cc: Andrey Ryabinin
    Cc: Steven Rostedt
    Cc: Konstantin Serebryany
    Cc: Dmitry Chernenkov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexander Potapenko
     

18 Mar, 2016

3 commits

  • Most of the mm subsystem uses pr_ so make it consistent.

    Miscellanea:

    - Realign arguments
    - Add missing newline to format
    - kmemleak-test.c has a "kmemleak: " prefix added to the
    "Kmemleak testing" logging message via pr_fmt

    Signed-off-by: Joe Perches
    Acked-by: Tejun Heo [percpu]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joe Perches
     
  • Kernel style prefers a single string over split strings when the string is
    'user-visible'.

    Miscellanea:

    - Add a missing newline
    - Realign arguments

    Signed-off-by: Joe Perches
    Acked-by: Tejun Heo [percpu]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joe Perches
     
  • As kmem accounting is now either enabled for all cgroups or disabled
    system-wide, there's no point in having memcg_kmem_online() helper -
    instead one can use memcg_kmem_enabled() and mem_cgroup_online(), as
    shrink_slab() now does.

    There are only two places left where this helper is used -
    __memcg_kmem_charge() and memcg_create_kmem_cache(). The former can
    only be called if memcg_kmem_enabled() returned true. Since the cgroup
    it operates on is online, mem_cgroup_is_root() check will be enough.

    memcg_create_kmem_cache() can't use mem_cgroup_online() helper instead
    of memcg_kmem_online(), because it relies on the fact that in
    memcg_offline_kmem() memcg->kmem_state is changed before
    memcg_deactivate_kmem_caches() is called, but there we can just
    open-code the check.

    Signed-off-by: Vladimir Davydov
    Acked-by: Johannes Weiner
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     

16 Mar, 2016

1 commit

  • This patch introduce a new API call kfree_bulk() for bulk freeing memory
    objects not bound to a single kmem_cache.

    Christoph pointed out that it is possible to implement freeing of
    objects, without knowing the kmem_cache pointer as that information is
    available from the object's page->slab_cache. Proposing to remove the
    kmem_cache argument from the bulk free API.

    Jesper demonstrated that these extra steps per object comes at a
    performance cost. It is only in the case CONFIG_MEMCG_KMEM is compiled
    in and activated runtime that these steps are done anyhow. The extra
    cost is most visible for SLAB allocator, because the SLUB allocator does
    the page lookup (virt_to_head_page()) anyhow.

    Thus, the conclusion was to keep the kmem_cache free bulk API with a
    kmem_cache pointer, but we can still implement a kfree_bulk() API fairly
    easily. Simply by handling if kmem_cache_free_bulk() gets called with a
    kmem_cache NULL pointer.

    This does increase the code size a bit, but implementing a separate
    kfree_bulk() call would likely increase code size even more.

    Below benchmarks cost of alloc+free (obj size 256 bytes) on CPU i7-4790K
    @ 4.00GHz, no PREEMPT and CONFIG_MEMCG_KMEM=y.

    Code size increase for SLAB:

    add/remove: 0/0 grow/shrink: 1/0 up/down: 74/0 (74)
    function old new delta
    kmem_cache_free_bulk 660 734 +74

    SLAB fastpath: 87 cycles(tsc) 21.814
    sz - fallback - kmem_cache_free_bulk - kfree_bulk
    1 - 103 cycles 25.878 ns - 41 cycles 10.498 ns - 81 cycles 20.312 ns
    2 - 94 cycles 23.673 ns - 26 cycles 6.682 ns - 42 cycles 10.649 ns
    3 - 92 cycles 23.181 ns - 21 cycles 5.325 ns - 39 cycles 9.950 ns
    4 - 90 cycles 22.727 ns - 18 cycles 4.673 ns - 26 cycles 6.693 ns
    8 - 89 cycles 22.270 ns - 14 cycles 3.664 ns - 23 cycles 5.835 ns
    16 - 88 cycles 22.038 ns - 14 cycles 3.503 ns - 22 cycles 5.543 ns
    30 - 89 cycles 22.284 ns - 13 cycles 3.310 ns - 20 cycles 5.197 ns
    32 - 88 cycles 22.249 ns - 13 cycles 3.420 ns - 20 cycles 5.166 ns
    34 - 88 cycles 22.224 ns - 14 cycles 3.643 ns - 20 cycles 5.170 ns
    48 - 88 cycles 22.088 ns - 14 cycles 3.507 ns - 20 cycles 5.203 ns
    64 - 88 cycles 22.063 ns - 13 cycles 3.428 ns - 20 cycles 5.152 ns
    128 - 89 cycles 22.483 ns - 15 cycles 3.891 ns - 23 cycles 5.885 ns
    158 - 89 cycles 22.381 ns - 15 cycles 3.779 ns - 22 cycles 5.548 ns
    250 - 91 cycles 22.798 ns - 16 cycles 4.152 ns - 23 cycles 5.967 ns

    SLAB when enabling MEMCG_KMEM runtime:
    - kmemcg fastpath: 130 cycles(tsc) 32.684 ns (step:0)
    1 - 148 cycles 37.220 ns - 66 cycles 16.622 ns - 66 cycles 16.583 ns
    2 - 141 cycles 35.510 ns - 51 cycles 12.820 ns - 58 cycles 14.625 ns
    3 - 140 cycles 35.017 ns - 37 cycles 9.326 ns - 33 cycles 8.474 ns
    4 - 137 cycles 34.507 ns - 31 cycles 7.888 ns - 33 cycles 8.300 ns
    8 - 140 cycles 35.069 ns - 25 cycles 6.461 ns - 25 cycles 6.436 ns
    16 - 138 cycles 34.542 ns - 23 cycles 5.945 ns - 22 cycles 5.670 ns
    30 - 136 cycles 34.227 ns - 22 cycles 5.502 ns - 22 cycles 5.587 ns
    32 - 136 cycles 34.253 ns - 21 cycles 5.475 ns - 21 cycles 5.324 ns
    34 - 136 cycles 34.254 ns - 21 cycles 5.448 ns - 20 cycles 5.194 ns
    48 - 136 cycles 34.075 ns - 21 cycles 5.458 ns - 21 cycles 5.367 ns
    64 - 135 cycles 33.994 ns - 21 cycles 5.350 ns - 21 cycles 5.259 ns
    128 - 137 cycles 34.446 ns - 23 cycles 5.816 ns - 22 cycles 5.688 ns
    158 - 137 cycles 34.379 ns - 22 cycles 5.727 ns - 22 cycles 5.602 ns
    250 - 138 cycles 34.755 ns - 24 cycles 6.093 ns - 23 cycles 5.986 ns

    Code size increase for SLUB:
    function old new delta
    kmem_cache_free_bulk 717 799 +82

    SLUB benchmark:
    SLUB fastpath: 46 cycles(tsc) 11.691 ns (step:0)
    sz - fallback - kmem_cache_free_bulk - kfree_bulk
    1 - 61 cycles 15.486 ns - 53 cycles 13.364 ns - 57 cycles 14.464 ns
    2 - 54 cycles 13.703 ns - 32 cycles 8.110 ns - 33 cycles 8.482 ns
    3 - 53 cycles 13.272 ns - 25 cycles 6.362 ns - 27 cycles 6.947 ns
    4 - 51 cycles 12.994 ns - 24 cycles 6.087 ns - 24 cycles 6.078 ns
    8 - 50 cycles 12.576 ns - 21 cycles 5.354 ns - 22 cycles 5.513 ns
    16 - 49 cycles 12.368 ns - 20 cycles 5.054 ns - 20 cycles 5.042 ns
    30 - 49 cycles 12.273 ns - 18 cycles 4.748 ns - 19 cycles 4.758 ns
    32 - 49 cycles 12.401 ns - 19 cycles 4.821 ns - 19 cycles 4.810 ns
    34 - 98 cycles 24.519 ns - 24 cycles 6.154 ns - 24 cycles 6.157 ns
    48 - 83 cycles 20.833 ns - 21 cycles 5.446 ns - 21 cycles 5.429 ns
    64 - 75 cycles 18.891 ns - 20 cycles 5.247 ns - 20 cycles 5.238 ns
    128 - 93 cycles 23.271 ns - 27 cycles 6.856 ns - 27 cycles 6.823 ns
    158 - 102 cycles 25.581 ns - 30 cycles 7.714 ns - 30 cycles 7.695 ns
    250 - 107 cycles 26.917 ns - 38 cycles 9.514 ns - 38 cycles 9.506 ns

    SLUB when enabling MEMCG_KMEM runtime:
    - kmemcg fastpath: 71 cycles(tsc) 17.897 ns (step:0)
    1 - 85 cycles 21.484 ns - 78 cycles 19.569 ns - 75 cycles 18.938 ns
    2 - 81 cycles 20.363 ns - 45 cycles 11.258 ns - 44 cycles 11.076 ns
    3 - 78 cycles 19.709 ns - 33 cycles 8.354 ns - 32 cycles 8.044 ns
    4 - 77 cycles 19.430 ns - 28 cycles 7.216 ns - 28 cycles 7.003 ns
    8 - 101 cycles 25.288 ns - 23 cycles 5.849 ns - 23 cycles 5.787 ns
    16 - 76 cycles 19.148 ns - 20 cycles 5.162 ns - 20 cycles 5.081 ns
    30 - 76 cycles 19.067 ns - 19 cycles 4.868 ns - 19 cycles 4.821 ns
    32 - 76 cycles 19.052 ns - 19 cycles 4.857 ns - 19 cycles 4.815 ns
    34 - 121 cycles 30.291 ns - 25 cycles 6.333 ns - 25 cycles 6.268 ns
    48 - 108 cycles 27.111 ns - 21 cycles 5.498 ns - 21 cycles 5.458 ns
    64 - 100 cycles 25.164 ns - 20 cycles 5.242 ns - 20 cycles 5.229 ns
    128 - 155 cycles 38.976 ns - 27 cycles 6.886 ns - 27 cycles 6.892 ns
    158 - 132 cycles 33.034 ns - 30 cycles 7.711 ns - 30 cycles 7.728 ns
    250 - 130 cycles 32.612 ns - 38 cycles 9.560 ns - 38 cycles 9.549 ns

    Signed-off-by: Jesper Dangaard Brouer
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jesper Dangaard Brouer
     

19 Feb, 2016

1 commit

  • When slub_debug alloc_calls_show is enabled we will try to track
    location and user of slab object on each online node, kmem_cache_node
    structure and cpu_cache/cpu_slub shouldn't be freed till there is the
    last reference to sysfs file.

    This fixes the following panic:

    BUG: unable to handle kernel NULL pointer dereference at 0000000000000020
    IP: list_locations+0x169/0x4e0
    PGD 257304067 PUD 438456067 PMD 0
    Oops: 0000 [#1] SMP
    CPU: 3 PID: 973074 Comm: cat ve: 0 Not tainted 3.10.0-229.7.2.ovz.9.30-00007-japdoll-dirty #2 9.30
    Hardware name: DEPO Computers To Be Filled By O.E.M./H67DE3, BIOS L1.60c 07/14/2011
    task: ffff88042a5dc5b0 ti: ffff88037f8d8000 task.ti: ffff88037f8d8000
    RIP: list_locations+0x169/0x4e0
    Call Trace:
    alloc_calls_show+0x1d/0x30
    slab_attr_show+0x1b/0x30
    sysfs_read_file+0x9a/0x1a0
    vfs_read+0x9c/0x170
    SyS_read+0x58/0xb0
    system_call_fastpath+0x16/0x1b
    Code: 5e 07 12 00 b9 00 04 00 00 3d 00 04 00 00 0f 4f c1 3d 00 04 00 00 89 45 b0 0f 84 c3 00 00 00 48 63 45 b0 49 8b 9c c4 f8 00 00 00 8b 43 20 48 85 c0 74 b6 48 89 df e8 46 37 44 00 48 8b 53 10
    CR2: 0000000000000020

    Separated __kmem_cache_release from __kmem_cache_shutdown which now
    called on slab_kmem_cache_release (after the last reference to sysfs
    file object has dropped).

    Reintroduced locking in free_partial as sysfs file might access cache's
    partial list after shutdowning - partial revert of the commit
    69cb8e6b7c29 ("slub: free slabs without holding locks"). Zap
    __remove_partial and use remove_partial (w/o underscores) as
    free_partial now takes list_lock which s partial revert for commit
    1e4dd9461fab ("slub: do not assert not having lock in removing freed
    partial")

    Signed-off-by: Dmitry Safonov
    Suggested-by: Vladimir Davydov
    Acked-by: Vladimir Davydov
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dmitry Safonov
     

21 Jan, 2016

2 commits

  • The cgroup2 memory controller will account important in-kernel memory
    consumers per default. Move all necessary components to CONFIG_MEMCG.

    Signed-off-by: Johannes Weiner
    Acked-by: Vladimir Davydov
    Cc: Michal Hocko
    Cc: Arnd Bergmann
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • On any given memcg, the kmem accounting feature has three separate
    states: not initialized, structures allocated, and actively accounting
    slab memory. These are represented through a combination of the
    kmem_acct_activated and kmem_acct_active flags, which is confusing.

    Convert to a kmem_state enum with the states NONE, ALLOCATED, and
    ONLINE. Then rename the functions to modify the state accordingly.
    This follows the nomenclature of css object states more closely.

    Signed-off-by: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: Tejun Heo
    Acked-by: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

15 Jan, 2016

1 commit

  • Currently, if we want to account all objects of a particular kmem cache,
    we have to pass __GFP_ACCOUNT to each kmem_cache_alloc call, which is
    inconvenient. This patch introduces SLAB_ACCOUNT flag which if passed
    to kmem_cache_create will force accounting for every allocation from
    this cache even if __GFP_ACCOUNT is not passed.

    This patch does not make any of the existing caches use this flag - it
    will be done later in the series.

    Note, a cache with SLAB_ACCOUNT cannot be merged with a cache w/o
    SLAB_ACCOUNT, because merged caches share the same kmem_cache struct and
    hence cannot have different sets of SLAB_* flags. Thus using this flag
    will probably reduce the number of merged slabs even if kmem accounting
    is not used (only compiled in).

    Signed-off-by: Vladimir Davydov
    Suggested-by: Tejun Heo
    Acked-by: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: Greg Thelen
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     

23 Nov, 2015

1 commit

  • Adjust kmem_cache_alloc_bulk API before we have any real users.

    Adjust API to return type 'int' instead of previously type 'bool'. This
    is done to allow future extension of the bulk alloc API.

    A future extension could be to allow SLUB to stop at a page boundary, when
    specified by a flag, and then return the number of objects.

    The advantage of this approach, would make it easier to make bulk alloc
    run without local IRQs disabled. With an approach of cmpxchg "stealing"
    the entire c->freelist or page->freelist. To avoid overshooting we would
    stop processing at a slab-page boundary. Else we always end up returning
    some objects at the cost of another cmpxchg.

    To keep compatible with future users of this API linking against an older
    kernel when using the new flag, we need to return the number of allocated
    objects with this API change.

    Signed-off-by: Jesper Dangaard Brouer
    Cc: Vladimir Davydov
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jesper Dangaard Brouer
     

06 Nov, 2015

3 commits

  • The assignment to NULL within the error condition was written in a 2014
    patch to suppress a compiler warning. However it would be cleaner to just
    initialize the kmem_cache to NULL and just return it in case of an error
    condition.

    Signed-off-by: Alexandru Moise
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Acked-by: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexandru Moise
     
  • Currently, when kmem_cache_destroy() is called for a global cache, we
    print a warning for each per memcg cache attached to it that has active
    objects (see shutdown_cache). This is redundant, because it gives no new
    information and only clutters the log. If a cache being destroyed has
    active objects, there must be a memory leak in the module that created the
    cache, and it does not matter if the cache was used by users in memory
    cgroups or not.

    This patch moves the warning from shutdown_cache(), which is called for
    shutting down both global and per memcg caches, to kmem_cache_destroy(),
    so that the warning is only printed once if there are objects left in the
    cache being destroyed.

    Signed-off-by: Vladimir Davydov
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • Currently, we do not clear pointers to per memcg caches in the
    memcg_params.memcg_caches array when a global cache is destroyed with
    kmem_cache_destroy.

    This is fine if the global cache does get destroyed. However, a cache can
    be left on the list if it still has active objects when kmem_cache_destroy
    is called (due to a memory leak). If this happens, the entries in the
    array will point to already freed areas, which is likely to result in data
    corruption when the cache is reused (via slab merging).

    Signed-off-by: Vladimir Davydov
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov