28 Oct, 2016

1 commit

  • On large systems, when some slab caches grow to millions of objects (and
    many gigabytes), running 'cat /proc/slabinfo' can take up to 1-2
    seconds. During this time, interrupts are disabled while walking the
    slab lists (slabs_full, slabs_partial, and slabs_free) for each node,
    and this sometimes causes timeouts in other drivers (for instance,
    Infiniband).

    This patch optimizes 'cat /proc/slabinfo' by maintaining a counter for
    total number of allocated slabs per node, per cache. This counter is
    updated when a slab is created or destroyed. This enables us to skip
    traversing the slabs_full list while gathering slabinfo statistics, and
    since slabs_full tends to be the biggest list when the cache is large,
    it results in a dramatic performance improvement. Getting slabinfo
    statistics now only requires walking the slabs_free and slabs_partial
    lists, and those lists are usually much smaller than slabs_full.

    We tested this after growing the dentry cache to 70GB, and the
    performance improved from 2s to 5ms.

    Link: http://lkml.kernel.org/r/1472517876-26814-1-git-send-email-aruna.ramakrishna@oracle.com
    Signed-off-by: Aruna Ramakrishna
    Acked-by: David Rientjes
    Cc: Mike Kravetz
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Aruna Ramakrishna
     

29 Jul, 2016

1 commit

  • For KASAN builds:
    - switch SLUB allocator to using stackdepot instead of storing the
    allocation/deallocation stacks in the objects;
    - change the freelist hook so that parts of the freelist can be put
    into the quarantine.

    [aryabinin@virtuozzo.com: fixes]
    Link: http://lkml.kernel.org/r/1468601423-28676-1-git-send-email-aryabinin@virtuozzo.com
    Link: http://lkml.kernel.org/r/1468347165-41906-3-git-send-email-glider@google.com
    Signed-off-by: Alexander Potapenko
    Cc: Andrey Konovalov
    Cc: Christoph Lameter
    Cc: Dmitry Vyukov
    Cc: Steven Rostedt (Red Hat)
    Cc: Joonsoo Kim
    Cc: Kostya Serebryany
    Cc: Andrey Ryabinin
    Cc: Kuthonuzo Luruo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexander Potapenko
     

27 Jul, 2016

2 commits

  • - Handle memcg_kmem_enabled check out to the caller. This reduces the
    number of function definitions making the code easier to follow. At
    the same time it doesn't result in code bloat, because all of these
    functions are used only in one or two places.

    - Move __GFP_ACCOUNT check to the caller as well so that one wouldn't
    have to dive deep into memcg implementation to see which allocations
    are charged and which are not.

    - Refresh comments.

    Link: http://lkml.kernel.org/r/52882a28b542c1979fd9a033b4dc8637fc347399.1464079537.git.vdavydov@virtuozzo.com
    Signed-off-by: Vladimir Davydov
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Eric Dumazet
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • The kernel heap allocators are using a sequential freelist making their
    allocation predictable. This predictability makes kernel heap overflow
    easier to exploit. An attacker can careful prepare the kernel heap to
    control the following chunk overflowed.

    For example these attacks exploit the predictability of the heap:
    - Linux Kernel CAN SLUB overflow (https://goo.gl/oMNWkU)
    - Exploiting Linux Kernel Heap corruptions (http://goo.gl/EXLn95)

    ***Problems that needed solving:
    - Randomize the Freelist (singled linked) used in the SLUB allocator.
    - Ensure good performance to encourage usage.
    - Get best entropy in early boot stage.

    ***Parts:
    - 01/02 Reorganize the SLAB Freelist randomization to share elements
    with the SLUB implementation.
    - 02/02 The SLUB Freelist randomization implementation. Similar approach
    than the SLAB but tailored to the singled freelist used in SLUB.

    ***Performance data:

    slab_test impact is between 3% to 4% on average for 100000 attempts
    without smp. It is a very focused testing, kernbench show the overall
    impact on the system is way lower.

    Before:

    Single thread testing
    =====================
    1. Kmalloc: Repeatedly allocate then free test
    100000 times kmalloc(8) -> 49 cycles kfree -> 77 cycles
    100000 times kmalloc(16) -> 51 cycles kfree -> 79 cycles
    100000 times kmalloc(32) -> 53 cycles kfree -> 83 cycles
    100000 times kmalloc(64) -> 62 cycles kfree -> 90 cycles
    100000 times kmalloc(128) -> 81 cycles kfree -> 97 cycles
    100000 times kmalloc(256) -> 98 cycles kfree -> 121 cycles
    100000 times kmalloc(512) -> 95 cycles kfree -> 122 cycles
    100000 times kmalloc(1024) -> 96 cycles kfree -> 126 cycles
    100000 times kmalloc(2048) -> 115 cycles kfree -> 140 cycles
    100000 times kmalloc(4096) -> 149 cycles kfree -> 171 cycles
    2. Kmalloc: alloc/free test
    100000 times kmalloc(8)/kfree -> 70 cycles
    100000 times kmalloc(16)/kfree -> 70 cycles
    100000 times kmalloc(32)/kfree -> 70 cycles
    100000 times kmalloc(64)/kfree -> 70 cycles
    100000 times kmalloc(128)/kfree -> 70 cycles
    100000 times kmalloc(256)/kfree -> 69 cycles
    100000 times kmalloc(512)/kfree -> 70 cycles
    100000 times kmalloc(1024)/kfree -> 73 cycles
    100000 times kmalloc(2048)/kfree -> 72 cycles
    100000 times kmalloc(4096)/kfree -> 71 cycles

    After:

    Single thread testing
    =====================
    1. Kmalloc: Repeatedly allocate then free test
    100000 times kmalloc(8) -> 57 cycles kfree -> 78 cycles
    100000 times kmalloc(16) -> 61 cycles kfree -> 81 cycles
    100000 times kmalloc(32) -> 76 cycles kfree -> 93 cycles
    100000 times kmalloc(64) -> 83 cycles kfree -> 94 cycles
    100000 times kmalloc(128) -> 106 cycles kfree -> 107 cycles
    100000 times kmalloc(256) -> 118 cycles kfree -> 117 cycles
    100000 times kmalloc(512) -> 114 cycles kfree -> 116 cycles
    100000 times kmalloc(1024) -> 115 cycles kfree -> 118 cycles
    100000 times kmalloc(2048) -> 147 cycles kfree -> 131 cycles
    100000 times kmalloc(4096) -> 214 cycles kfree -> 161 cycles
    2. Kmalloc: alloc/free test
    100000 times kmalloc(8)/kfree -> 66 cycles
    100000 times kmalloc(16)/kfree -> 66 cycles
    100000 times kmalloc(32)/kfree -> 66 cycles
    100000 times kmalloc(64)/kfree -> 66 cycles
    100000 times kmalloc(128)/kfree -> 65 cycles
    100000 times kmalloc(256)/kfree -> 67 cycles
    100000 times kmalloc(512)/kfree -> 67 cycles
    100000 times kmalloc(1024)/kfree -> 64 cycles
    100000 times kmalloc(2048)/kfree -> 67 cycles
    100000 times kmalloc(4096)/kfree -> 67 cycles

    Kernbench, before:

    Average Optimal load -j 12 Run (std deviation):
    Elapsed Time 101.873 (1.16069)
    User Time 1045.22 (1.60447)
    System Time 88.969 (0.559195)
    Percent CPU 1112.9 (13.8279)
    Context Switches 189140 (2282.15)
    Sleeps 99008.6 (768.091)

    After:

    Average Optimal load -j 12 Run (std deviation):
    Elapsed Time 102.47 (0.562732)
    User Time 1045.3 (1.34263)
    System Time 88.311 (0.342554)
    Percent CPU 1105.8 (6.49444)
    Context Switches 189081 (2355.78)
    Sleeps 99231.5 (800.358)

    This patch (of 2):

    This commit reorganizes the previous SLAB freelist randomization to
    prepare for the SLUB implementation. It moves functions that will be
    shared to slab_common.

    The entropy functions are changed to align with the SLUB implementation,
    now using get_random_(int|long) functions. These functions were chosen
    because they provide a bit more entropy early on boot and better
    performance when specific arch instructions are not available.

    [akpm@linux-foundation.org: fix build]
    Link: http://lkml.kernel.org/r/1464295031-26375-2-git-send-email-thgarnie@google.com
    Signed-off-by: Thomas Garnier
    Reviewed-by: Kees Cook
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Thomas Garnier
     

21 May, 2016

1 commit

  • Quarantine isolates freed objects in a separate queue. The objects are
    returned to the allocator later, which helps to detect use-after-free
    errors.

    When the object is freed, its state changes from KASAN_STATE_ALLOC to
    KASAN_STATE_QUARANTINE. The object is poisoned and put into quarantine
    instead of being returned to the allocator, therefore every subsequent
    access to that object triggers a KASAN error, and the error handler is
    able to say where the object has been allocated and deallocated.

    When it's time for the object to leave quarantine, its state becomes
    KASAN_STATE_FREE and it's returned to the allocator. From now on the
    allocator may reuse it for another allocation. Before that happens,
    it's still possible to detect a use-after free on that object (it
    retains the allocation/deallocation stacks).

    When the allocator reuses this object, the shadow is unpoisoned and old
    allocation/deallocation stacks are wiped. Therefore a use of this
    object, even an incorrect one, won't trigger ASan warning.

    Without the quarantine, it's not guaranteed that the objects aren't
    reused immediately, that's why the probability of catching a
    use-after-free is lower than with quarantine in place.

    Quarantine isolates freed objects in a separate queue. The objects are
    returned to the allocator later, which helps to detect use-after-free
    errors.

    Freed objects are first added to per-cpu quarantine queues. When a
    cache is destroyed or memory shrinking is requested, the objects are
    moved into the global quarantine queue. Whenever a kmalloc call allows
    memory reclaiming, the oldest objects are popped out of the global queue
    until the total size of objects in quarantine is less than 3/4 of the
    maximum quarantine size (which is a fraction of installed physical
    memory).

    As long as an object remains in the quarantine, KASAN is able to report
    accesses to it, so the chance of reporting a use-after-free is
    increased. Once the object leaves quarantine, the allocator may reuse
    it, in which case the object is unpoisoned and KASAN can't detect
    incorrect accesses to it.

    Right now quarantine support is only enabled in SLAB allocator.
    Unification of KASAN features in SLAB and SLUB will be done later.

    This patch is based on the "mm: kasan: quarantine" patch originally
    prepared by Dmitry Chernenkov. A number of improvements have been
    suggested by Andrey Ryabinin.

    [glider@google.com: v9]
    Link: http://lkml.kernel.org/r/1462987130-144092-1-git-send-email-glider@google.com
    Signed-off-by: Alexander Potapenko
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Andrey Konovalov
    Cc: Dmitry Vyukov
    Cc: Andrey Ryabinin
    Cc: Steven Rostedt
    Cc: Konstantin Serebryany
    Cc: Dmitry Chernenkov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexander Potapenko
     

26 Mar, 2016

1 commit

  • Add GFP flags to KASAN hooks for future patches to use.

    This patch is based on the "mm: kasan: unified support for SLUB and SLAB
    allocators" patch originally prepared by Dmitry Chernenkov.

    Signed-off-by: Alexander Potapenko
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Andrey Konovalov
    Cc: Dmitry Vyukov
    Cc: Andrey Ryabinin
    Cc: Steven Rostedt
    Cc: Konstantin Serebryany
    Cc: Dmitry Chernenkov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexander Potapenko
     

18 Mar, 2016

1 commit


16 Mar, 2016

4 commits

  • SLAB_DEBUG_FREE allows expensive consistency checks at free to be turned
    on or off. Expand its use to be able to turn off all consistency
    checks. This gives a nice speed up if you only want features such as
    poisoning or tracing.

    Credit to Mathias Krause for the original work which inspired this
    series

    Signed-off-by: Laura Abbott
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Kees Cook
    Cc: Mathias Krause
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Laura Abbott
     
  • Fix up trivial spelling errors, noticed while reading the code.

    Signed-off-by: Jesper Dangaard Brouer
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jesper Dangaard Brouer
     
  • Remove the SLAB specific function slab_should_failslab(), by moving the
    check against fault-injection for the bootstrap slab, into the shared
    function should_failslab() (used by both SLAB and SLUB).

    This is a step towards sharing alloc_hook's between SLUB and SLAB.

    This bootstrap slab "kmem_cache" is used for allocating struct
    kmem_cache objects to the allocator itself.

    Signed-off-by: Jesper Dangaard Brouer
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jesper Dangaard Brouer
     
  • First step towards sharing alloc_hook's between SLUB and SLAB
    allocators. Move the SLUB allocators *_alloc_hook to the common
    mm/slab.h for internal slab definitions.

    Signed-off-by: Jesper Dangaard Brouer
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jesper Dangaard Brouer
     

19 Feb, 2016

1 commit

  • When slub_debug alloc_calls_show is enabled we will try to track
    location and user of slab object on each online node, kmem_cache_node
    structure and cpu_cache/cpu_slub shouldn't be freed till there is the
    last reference to sysfs file.

    This fixes the following panic:

    BUG: unable to handle kernel NULL pointer dereference at 0000000000000020
    IP: list_locations+0x169/0x4e0
    PGD 257304067 PUD 438456067 PMD 0
    Oops: 0000 [#1] SMP
    CPU: 3 PID: 973074 Comm: cat ve: 0 Not tainted 3.10.0-229.7.2.ovz.9.30-00007-japdoll-dirty #2 9.30
    Hardware name: DEPO Computers To Be Filled By O.E.M./H67DE3, BIOS L1.60c 07/14/2011
    task: ffff88042a5dc5b0 ti: ffff88037f8d8000 task.ti: ffff88037f8d8000
    RIP: list_locations+0x169/0x4e0
    Call Trace:
    alloc_calls_show+0x1d/0x30
    slab_attr_show+0x1b/0x30
    sysfs_read_file+0x9a/0x1a0
    vfs_read+0x9c/0x170
    SyS_read+0x58/0xb0
    system_call_fastpath+0x16/0x1b
    Code: 5e 07 12 00 b9 00 04 00 00 3d 00 04 00 00 0f 4f c1 3d 00 04 00 00 89 45 b0 0f 84 c3 00 00 00 48 63 45 b0 49 8b 9c c4 f8 00 00 00 8b 43 20 48 85 c0 74 b6 48 89 df e8 46 37 44 00 48 8b 53 10
    CR2: 0000000000000020

    Separated __kmem_cache_release from __kmem_cache_shutdown which now
    called on slab_kmem_cache_release (after the last reference to sysfs
    file object has dropped).

    Reintroduced locking in free_partial as sysfs file might access cache's
    partial list after shutdowning - partial revert of the commit
    69cb8e6b7c29 ("slub: free slabs without holding locks"). Zap
    __remove_partial and use remove_partial (w/o underscores) as
    free_partial now takes list_lock which s partial revert for commit
    1e4dd9461fab ("slub: do not assert not having lock in removing freed
    partial")

    Signed-off-by: Dmitry Safonov
    Suggested-by: Vladimir Davydov
    Acked-by: Vladimir Davydov
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dmitry Safonov
     

21 Jan, 2016

1 commit


15 Jan, 2016

1 commit

  • Currently, if we want to account all objects of a particular kmem cache,
    we have to pass __GFP_ACCOUNT to each kmem_cache_alloc call, which is
    inconvenient. This patch introduces SLAB_ACCOUNT flag which if passed
    to kmem_cache_create will force accounting for every allocation from
    this cache even if __GFP_ACCOUNT is not passed.

    This patch does not make any of the existing caches use this flag - it
    will be done later in the series.

    Note, a cache with SLAB_ACCOUNT cannot be merged with a cache w/o
    SLAB_ACCOUNT, because merged caches share the same kmem_cache struct and
    hence cannot have different sets of SLAB_* flags. Thus using this flag
    will probably reduce the number of merged slabs even if kmem accounting
    is not used (only compiled in).

    Signed-off-by: Vladimir Davydov
    Suggested-by: Tejun Heo
    Acked-by: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: Greg Thelen
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     

23 Nov, 2015

1 commit

  • Adjust kmem_cache_alloc_bulk API before we have any real users.

    Adjust API to return type 'int' instead of previously type 'bool'. This
    is done to allow future extension of the bulk alloc API.

    A future extension could be to allow SLUB to stop at a page boundary, when
    specified by a flag, and then return the number of objects.

    The advantage of this approach, would make it easier to make bulk alloc
    run without local IRQs disabled. With an approach of cmpxchg "stealing"
    the entire c->freelist or page->freelist. To avoid overshooting we would
    stop processing at a slab-page boundary. Else we always end up returning
    some objects at the cost of another cmpxchg.

    To keep compatible with future users of this API linking against an older
    kernel when using the new flag, we need to return the number of allocated
    objects with this API change.

    Signed-off-by: Jesper Dangaard Brouer
    Cc: Vladimir Davydov
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jesper Dangaard Brouer
     

06 Nov, 2015

2 commits

  • We have memcg_kmem_charge and memcg_kmem_uncharge methods for charging and
    uncharging kmem pages to memcg, but currently they are not used for
    charging slab pages (i.e. they are only used for charging pages allocated
    with alloc_kmem_pages). The only reason why the slab subsystem uses
    special helpers, memcg_charge_slab and memcg_uncharge_slab, is that it
    needs to charge to the memcg of kmem cache while memcg_charge_kmem charges
    to the memcg that the current task belongs to.

    To remove this diversity, this patch adds an extra argument to
    __memcg_kmem_charge that can be a pointer to a memcg or NULL. If it is
    not NULL, the function tries to charge to the memcg it points to,
    otherwise it charge to the current context. Next, it makes the slab
    subsystem use this function to charge slab pages.

    Since memcg_charge_kmem and memcg_uncharge_kmem helpers are now used only
    in __memcg_kmem_charge and __memcg_kmem_uncharge, they are inlined. Since
    __memcg_kmem_charge stores a pointer to the memcg in the page struct, we
    don't need memcg_uncharge_slab anymore and can use free_kmem_pages.
    Besides, one can now detect which memcg a slab page belongs to by reading
    /proc/kpagecgroup.

    Note, this patch switches slab to charge-after-alloc design. Since this
    design is already used for all other memcg charges, it should not make any
    difference.

    [hannes@cmpxchg.org: better to have an outer function than a magic parameter for the memcg lookup]
    Signed-off-by: Vladimir Davydov
    Acked-by: Michal Hocko
    Signed-off-by: Johannes Weiner
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • Currently, we do not clear pointers to per memcg caches in the
    memcg_params.memcg_caches array when a global cache is destroyed with
    kmem_cache_destroy.

    This is fine if the global cache does get destroyed. However, a cache can
    be left on the list if it still has active objects when kmem_cache_destroy
    is called (due to a memory leak). If this happens, the entries in the
    array will point to already freed areas, which is likely to result in data
    corruption when the cache is reused (via slab merging).

    Signed-off-by: Vladimir Davydov
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     

05 Sep, 2015

2 commits

  • While debugging a networking issue, I hit a condition that triggered an
    object to be freed into the wrong kmem cache, and thus triggered the
    warning in cache_from_obj().

    The arguments in the error message are in wrong order: the location
    of the object's kmem cache is in cachep, not s.

    Signed-off-by: Daniel Borkmann
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Daniel Borkmann
     
  • Add the basic infrastructure for alloc/free operations on pointer arrays.
    It includes a generic function in the common slab code that is used in
    this infrastructure patch to create the unoptimized functionality for slab
    bulk operations.

    Allocators can then provide optimized allocation functions for situations
    in which large numbers of objects are needed. These optimization may
    avoid taking locks repeatedly and bypass metadata creation if all objects
    in slab pages can be used to provide the objects required.

    Allocators can extend the skeletons provided and add their own code to the
    bulk alloc and free functions. They can keep the generic allocation and
    freeing and just fall back to those if optimizations would not work (like
    for example when debugging is on).

    Signed-off-by: Christoph Lameter
    Signed-off-by: Jesper Dangaard Brouer
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

25 Jun, 2015

1 commit

  • This patch moves the initialization of the size_index table slightly
    earlier so that the first few kmem_cache_node's can be safely allocated
    when KMALLOC_MIN_SIZE is large.

    There are currently two ways to generate indices into kmalloc_caches (via
    kmalloc_index() and via the size_index table in slab_common.c) and on some
    arches (possibly only MIPS) they potentially disagree with each other
    until create_kmalloc_caches() has been called. It seems that the
    intention is that the size_index table is a fast equivalent to
    kmalloc_index() and that create_kmalloc_caches() patches the table to
    return the correct value for the cases where kmalloc_index()'s
    if-statements apply.

    The failing sequence was:
    * kmalloc_caches contains NULL elements
    * kmem_cache_init initialises the element that 'struct
    kmem_cache_node' will be allocated to. For 32-bit Mips, this is a
    56-byte struct and kmalloc_index returns KMALLOC_SHIFT_LOW (7).
    * init_list is called which calls kmalloc_node to allocate a 'struct
    kmem_cache_node'.
    * kmalloc_slab selects the kmem_caches element using
    size_index[size_index_elem(size)]. For MIPS, size is 56, and the
    expression returns 6.
    * This element of kmalloc_caches is NULL and allocation fails.
    * If it had not already failed, it would have called
    create_kmalloc_caches() at this point which would have changed
    size_index[size_index_elem(size)] to 7.

    I don't believe the bug to be LLVM specific but GCC doesn't normally
    encounter the problem. I haven't been able to identify exactly what GCC
    is doing better (probably inlining) but it seems that GCC is managing to
    optimize to the point that it eliminates the problematic allocations.
    This theory is supported by the fact that GCC can be made to fail in the
    same way by changing inline, __inline, __inline__, and __always_inline in
    include/linux/compiler-gcc.h such that they don't actually inline things.

    Signed-off-by: Daniel Sanders
    Acked-by: Pekka Enberg
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Daniel Sanders
     

13 Feb, 2015

3 commits

  • To speed up further allocations SLUB may store empty slabs in per cpu/node
    partial lists instead of freeing them immediately. This prevents per
    memcg caches destruction, because kmem caches created for a memory cgroup
    are only destroyed after the last page charged to the cgroup is freed.

    To fix this issue, this patch resurrects approach first proposed in [1].
    It forbids SLUB to cache empty slabs after the memory cgroup that the
    cache belongs to was destroyed. It is achieved by setting kmem_cache's
    cpu_partial and min_partial constants to 0 and tuning put_cpu_partial() so
    that it would drop frozen empty slabs immediately if cpu_partial = 0.

    The runtime overhead is minimal. From all the hot functions, we only
    touch relatively cold put_cpu_partial(): we make it call
    unfreeze_partials() after freezing a slab that belongs to an offline
    memory cgroup. Since slab freezing exists to avoid moving slabs from/to a
    partial list on free/alloc, and there can't be allocations from dead
    caches, it shouldn't cause any overhead. We do have to disable preemption
    for put_cpu_partial() to achieve that though.

    The original patch was accepted well and even merged to the mm tree.
    However, I decided to withdraw it due to changes happening to the memcg
    core at that time. I had an idea of introducing per-memcg shrinkers for
    kmem caches, but now, as memcg has finally settled down, I do not see it
    as an option, because SLUB shrinker would be too costly to call since SLUB
    does not keep free slabs on a separate list. Besides, we currently do not
    even call per-memcg shrinkers for offline memcgs. Overall, it would
    introduce much more complexity to both SLUB and memcg than this small
    patch.

    Regarding to SLAB, there's no problem with it, because it shrinks
    per-cpu/node caches periodically. Thanks to list_lru reparenting, we no
    longer keep entries for offline cgroups in per-memcg arrays (such as
    memcg_cache_params->memcg_caches), so we do not have to bother if a
    per-memcg cache will be shrunk a bit later than it could be.

    [1] http://thread.gmane.org/gmane.linux.kernel.mm/118649/focus=118650

    Signed-off-by: Vladimir Davydov
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • Sometimes, we need to iterate over all memcg copies of a particular root
    kmem cache. Currently, we use memcg_cache_params->memcg_caches array for
    that, because it contains all existing memcg caches.

    However, it's a bad practice to keep all caches, including those that
    belong to offline cgroups, in this array, because it will be growing
    beyond any bounds then. I'm going to wipe away dead caches from it to
    save space. To still be able to perform iterations over all memcg caches
    of the same kind, let us link them into a list.

    Signed-off-by: Vladimir Davydov
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Tejun Heo
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Dave Chinner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • Currently, kmem_cache stores a pointer to struct memcg_cache_params
    instead of embedding it. The rationale is to save memory when kmem
    accounting is disabled. However, the memcg_cache_params has shrivelled
    drastically since it was first introduced:

    * Initially:

    struct memcg_cache_params {
    bool is_root_cache;
    union {
    struct kmem_cache *memcg_caches[0];
    struct {
    struct mem_cgroup *memcg;
    struct list_head list;
    struct kmem_cache *root_cache;
    bool dead;
    atomic_t nr_pages;
    struct work_struct destroy;
    };
    };
    };

    * Now:

    struct memcg_cache_params {
    bool is_root_cache;
    union {
    struct {
    struct rcu_head rcu_head;
    struct kmem_cache *memcg_caches[0];
    };
    struct {
    struct mem_cgroup *memcg;
    struct kmem_cache *root_cache;
    };
    };
    };

    So the memory saving does not seem to be a clear win anymore.

    OTOH, keeping a pointer to memcg_cache_params struct instead of embedding
    it results in touching one more cache line on kmem alloc/free hot paths.
    Besides, it makes linking kmem caches in a list chained by a field of
    struct memcg_cache_params really painful due to a level of indirection,
    while I want to make them linked in the following patch. That said, let
    us embed it.

    Signed-off-by: Vladimir Davydov
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Tejun Heo
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Dave Chinner
    Cc: Dan Carpenter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     

11 Feb, 2015

1 commit


11 Dec, 2014

3 commits

  • Let's use generic slab_start/next/stop for showing memcg caches info. In
    contrast to the current implementation, this will work even if all memcg
    caches' info doesn't fit into a seq buffer (a page), plus it simply looks
    neater.

    Actually, the main reason I do this isn't mere cleanup. I'm going to zap
    the memcg_slab_caches list, because I find it useless provided we have the
    slab_caches list, and this patch is a step in this direction.

    It should be noted that before this patch an attempt to read
    memory.kmem.slabinfo of a cgroup that doesn't have kmem limit set resulted
    in -EIO, while after this patch it will silently show nothing except the
    header, but I don't think it will frustrate anyone.

    Signed-off-by: Vladimir Davydov
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • Recently lockless_dereference() was added which can be used in place of
    hard-coding smp_read_barrier_depends(). The following PATCH makes the
    change.

    Signed-off-by: Pranith Kumar
    Cc: "Paul E. McKenney"
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pranith Kumar
     
  • Currently we print the slabinfo header in the seq start method, which
    makes it unusable for showing leaks, so we have leaks_show, which does
    practically the same as s_show except it doesn't show the header.

    However, we can print the header in the seq show method - we only need
    to check if the current element is the first on the list. This will
    allow us to use the same set of seq iterators for both leaks and
    slabinfo reporting, which is nice.

    Signed-off-by: Vladimir Davydov
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     

10 Oct, 2014

5 commits

  • Because of chicken and egg problem, initialization of SLAB is really
    complicated. We need to allocate cpu cache through SLAB to make the
    kmem_cache work, but before initialization of kmem_cache, allocation
    through SLAB is impossible.

    On the other hand, SLUB does initialization in a more simple way. It uses
    percpu allocator to allocate cpu cache so there is no chicken and egg
    problem.

    So, this patch try to use percpu allocator in SLAB. This simplifies the
    initialization step in SLAB so that we could maintain SLAB code more
    easily.

    In my testing there is no performance difference.

    This implementation relies on percpu allocator. Because percpu allocator
    uses vmalloc address space, vmalloc address space could be exhausted by
    this change on many cpu system with *32 bit* kernel. This implementation
    can cover 1024 cpus in worst case by following calculation.

    Worst: 1024 cpus * 4 bytes for pointer * 300 kmem_caches *
    120 objects per cpu_cache = 140 MB
    Normal: 1024 cpus * 4 bytes for pointer * 150 kmem_caches(slab merge) *
    80 objects per cpu_cache = 46 MB

    Signed-off-by: Joonsoo Kim
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Jeremiah Mahler
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • Slab merge is good feature to reduce fragmentation. If new creating slab
    have similar size and property with exsitent slab, this feature reuse it
    rather than creating new one. As a result, objects are packed into fewer
    slabs so that fragmentation is reduced.

    Below is result of my testing.

    * After boot, sleep 20; cat /proc/meminfo | grep Slab

    Slab: 25136 kB

    Slab: 24364 kB

    We can save 3% memory used by slab.

    For supporting this feature in SLAB, we need to implement SLAB specific
    kmem_cache_flag() and __kmem_cache_alias(), because SLUB implements some
    SLUB specific processing related to debug flag and object size change on
    these functions.

    Signed-off-by: Joonsoo Kim
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • Slab merge is good feature to reduce fragmentation. Now, it is only
    applied to SLUB, but, it would be good to apply it to SLAB. This patch is
    preparation step to apply slab merge to SLAB by commonizing slab merge
    logic.

    Signed-off-by: Joonsoo Kim
    Cc: Randy Dunlap
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • Fix a bug (discovered with kmemcheck) in for_each_kmem_cache_node(). The
    for loop reads the array "node" before verifying that the index is within
    the range. This results in kmemcheck warning.

    Signed-off-by: Mikulas Patocka
    Reviewed-by: Pekka Enberg
    Acked-by: Christoph Lameter
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mikulas Patocka
     
  • We don't need to keep kmem_cache definition in include/linux/slab.h if we
    don't need to inline kmem_cache_size(). According to my code inspection,
    this function is only called at lc_create() in lib/lru_cache.c which may
    be called at initialization phase of something, so we don't need to inline
    it. Therfore, move it to slab_common.c and move kmem_cache definition to
    internal header.

    After this change, we can change kmem_cache definition easily without full
    kernel build. For instance, we can turn on/off CONFIG_SLUB_STATS without
    full kernel build.

    [akpm@linux-foundation.org: export kmem_cache_size() to modules]
    [rdunlap@infradead.org: add header files to fix kmemcheck.c build errors]
    Signed-off-by: Joonsoo Kim
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Zhang Yanfei
    Signed-off-by: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     

07 Aug, 2014

4 commits

  • Just about all of these have been converted to __func__, so convert the
    last use.

    Signed-off-by: Joe Perches
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joe Perches
     
  • Currently, we use array_cache for alien_cache. Although they are mostly
    similar, there is one difference, that is, need for spinlock. We don't
    need spinlock for array_cache itself, but to use array_cache for
    alien_cache, array_cache structure should have spinlock. This is
    needless overhead, so removing it would be better. This patch prepare
    it by introducing alien_cache and using it. In the following patch, we
    remove spinlock in array_cache.

    Signed-off-by: Joonsoo Kim
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • Guarding section:
    #ifndef MM_SLAB_H
    #define MM_SLAB_H
    ...
    #endif
    currently doesn't cover the whole mm/slab.h. It seems like it was
    done unintentionally.

    Wrap the whole file by moving closing #endif to the end of it.

    Signed-off-by: Andrey Ryabinin
    Acked-by: Christoph Lameter
    Acked-by: David Rientjes
    Reviewed-by: Vladimir Davydov
    Cc: Pekka Enberg
    Cc: Joonsoo Kim

    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Ryabinin
     
  • The patchset provides two new functions in mm/slab.h and modifies SLAB
    and SLUB to use these. The kmem_cache_node structure is shared between
    both allocators and the use of common accessors will allow us to move
    more code into slab_common.c in the future.

    This patch (of 3):

    These functions allow to eliminate repeatedly used code in both SLAB and
    SLUB and also allow for the insertion of debugging code that may be
    needed in the development process.

    Signed-off-by: Christoph Lameter
    Cc: Pekka Enberg
    Acked-by: David Rientjes
    Acked-by: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

05 Jun, 2014

4 commits

  • Currently we have two pairs of kmemcg-related functions that are called on
    slab alloc/free. The first is memcg_{bind,release}_pages that count the
    total number of pages allocated on a kmem cache. The second is
    memcg_{un}charge_slab that {un}charge slab pages to kmemcg resource
    counter. Let's just merge them to keep the code clean.

    Signed-off-by: Vladimir Davydov
    Acked-by: Johannes Weiner
    Cc: Michal Hocko
    Cc: Glauber Costa
    Cc: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • This patchset is a part of preparations for kmemcg re-parenting. It
    targets at simplifying kmemcg work-flows and synchronization.

    First, it removes async per memcg cache destruction (see patches 1, 2).
    Now caches are only destroyed on memcg offline. That means the caches
    that are not empty on memcg offline will be leaked. However, they are
    already leaked, because memcg_cache_params::nr_pages normally never drops
    to 0 so the destruction work is never scheduled except kmem_cache_shrink
    is called explicitly. In the future I'm planning reaping such dead caches
    on vmpressure or periodically.

    Second, it substitutes per memcg slab_caches_mutex's with the global
    memcg_slab_mutex, which should be taken during the whole per memcg cache
    creation/destruction path before the slab_mutex (see patch 3). This
    greatly simplifies synchronization among various per memcg cache
    creation/destruction paths.

    I'm still not quite sure about the end picture, in particular I don't know
    whether we should reap dead memcgs' kmem caches periodically or try to
    merge them with their parents (see https://lkml.org/lkml/2014/4/20/38 for
    more details), but whichever way we choose, this set looks like a
    reasonable change to me, because it greatly simplifies kmemcg work-flows
    and eases further development.

    This patch (of 3):

    After a memcg is offlined, we mark its kmem caches that cannot be deleted
    right now due to pending objects as dead by setting the
    memcg_cache_params::dead flag, so that memcg_release_pages will schedule
    cache destruction (memcg_cache_params::destroy) as soon as the last slab
    of the cache is freed (memcg_cache_params::nr_pages drops to zero).

    I guess the idea was to destroy the caches as soon as possible, i.e.
    immediately after freeing the last object. However, it just doesn't work
    that way, because kmem caches always preserve some pages for the sake of
    performance, so that nr_pages never gets to zero unless the cache is
    shrunk explicitly using kmem_cache_shrink. Of course, we could account
    the total number of objects on the cache or check if all the slabs
    allocated for the cache are empty on kmem_cache_free and schedule
    destruction if so, but that would be too costly.

    Thus we have a piece of code that works only when we explicitly call
    kmem_cache_shrink, but complicates the whole picture a lot. Moreover,
    it's racy in fact. For instance, kmem_cache_shrink may free the last slab
    and thus schedule cache destruction before it finishes checking that the
    cache is empty, which can lead to use-after-free.

    So I propose to remove this async cache destruction from
    memcg_release_pages, and check if the cache is empty explicitly after
    calling kmem_cache_shrink instead. This will simplify things a lot w/o
    introducing any functional changes.

    And regarding dead memcg caches (i.e. those that are left hanging around
    after memcg offline for they have objects), I suppose we should reap them
    either periodically or on vmpressure as Glauber suggested initially. I'm
    going to implement this later.

    Signed-off-by: Vladimir Davydov
    Acked-by: Johannes Weiner
    Cc: Michal Hocko
    Cc: Glauber Costa
    Cc: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • When we create a sl[au]b cache, we allocate kmem_cache_node structures
    for each online NUMA node. To handle nodes taken online/offline, we
    register memory hotplug notifier and allocate/free kmem_cache_node
    corresponding to the node that changes its state for each kmem cache.

    To synchronize between the two paths we hold the slab_mutex during both
    the cache creationg/destruction path and while tuning per-node parts of
    kmem caches in memory hotplug handler, but that's not quite right,
    because it does not guarantee that a newly created cache will have all
    kmem_cache_nodes initialized in case it races with memory hotplug. For
    instance, in case of slub:

    CPU0 CPU1
    ---- ----
    kmem_cache_create: online_pages:
    __kmem_cache_create: slab_memory_callback:
    slab_mem_going_online_callback:
    lock slab_mutex
    for each slab_caches list entry
    allocate kmem_cache node
    unlock slab_mutex
    lock slab_mutex
    init_kmem_cache_nodes:
    for_each_node_state(node, N_NORMAL_MEMORY)
    allocate kmem_cache node
    add kmem_cache to slab_caches list
    unlock slab_mutex
    online_pages (continued):
    node_states_set_node

    As a result we'll get a kmem cache with not all kmem_cache_nodes
    allocated.

    To avoid issues like that we should hold get/put_online_mems() during
    the whole kmem cache creation/destruction/shrink paths, just like we
    deal with cpu hotplug. This patch does the trick.

    Note, that after it's applied, there is no need in taking the slab_mutex
    for kmem_cache_shrink any more, so it is removed from there.

    Signed-off-by: Vladimir Davydov
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: Tang Chen
    Cc: Zhang Yanfei
    Cc: Toshi Kani
    Cc: Xishi Qiu
    Cc: Jiang Liu
    Cc: Rafael J. Wysocki
    Cc: David Rientjes
    Cc: Wen Congyang
    Cc: Yasuaki Ishimatsu
    Cc: Lai Jiangshan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • We have only a few places where we actually want to charge kmem so
    instead of intruding into the general page allocation path with
    __GFP_KMEMCG it's better to explictly charge kmem there. All kmem
    charges will be easier to follow that way.

    This is a step towards removing __GFP_KMEMCG. It removes __GFP_KMEMCG
    from memcg caches' allocflags. Instead it makes slab allocation path
    call memcg_charge_kmem directly getting memcg to charge from the cache's
    memcg params.

    This also eliminates any possibility of misaccounting an allocation
    going from one memcg's cache to another memcg, because now we always
    charge slabs against the memcg the cache belongs to. That's why this
    patch removes the big comment to memcg_kmem_get_cache.

    Signed-off-by: Vladimir Davydov
    Acked-by: Greg Thelen
    Cc: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: Glauber Costa
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov