12 Nov, 2016

1 commit

  • While testing OBJFREELIST_SLAB integration with pagealloc, we found a
    bug where kmem_cache(sys) would be created with both CFLGS_OFF_SLAB &
    CFLGS_OBJFREELIST_SLAB. When it happened, critical allocations needed
    for loading drivers or creating new caches will fail.

    The original kmem_cache is created early making OFF_SLAB not possible.
    When kmem_cache(sys) is created, OFF_SLAB is possible and if pagealloc
    is enabled it will try to enable it first under certain conditions.
    Given kmem_cache(sys) reuses the original flag, you can have both flags
    at the same time resulting in allocation failures and odd behaviors.

    This fix discards allocator specific flags from memcg before calling
    create_cache.

    The bug exists since 4.6-rc1 and affects testing debug pagealloc
    configurations.

    Fixes: b03a017bebc4 ("mm/slab: introduce new slab management type, OBJFREELIST_SLAB")
    Link: http://lkml.kernel.org/r/1478553075-120242-1-git-send-email-thgarnie@google.com
    Signed-off-by: Greg Thelen
    Signed-off-by: Thomas Garnier
    Tested-by: Thomas Garnier
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Vladimir Davydov
    Cc: Michal Hocko
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Greg Thelen
     

27 Jul, 2016

2 commits

  • Currently, to charge a non-slab allocation to kmemcg one has to use
    alloc_kmem_pages helper with __GFP_ACCOUNT flag. A page allocated with
    this helper should finally be freed using free_kmem_pages, otherwise it
    won't be uncharged.

    This API suits its current users fine, but it turns out to be impossible
    to use along with page reference counting, i.e. when an allocation is
    supposed to be freed with put_page, as it is the case with pipe or unix
    socket buffers.

    To overcome this limitation, this patch moves charging/uncharging to
    generic page allocator paths, i.e. to __alloc_pages_nodemask and
    free_pages_prepare, and zaps alloc/free_kmem_pages helpers. This way,
    one can use any of the available page allocation functions to get the
    allocated page charged to kmemcg - it's enough to pass __GFP_ACCOUNT,
    just like in case of kmalloc and friends. A charged page will be
    automatically uncharged on free.

    To make it possible, we need to mark pages charged to kmemcg somehow.
    To avoid introducing a new page flag, we make use of page->_mapcount for
    marking such pages. Since pages charged to kmemcg are not supposed to
    be mapped to userspace, it should work just fine. There are other
    (ab)users of page->_mapcount - buddy and balloon pages - but we don't
    conflict with them.

    In case kmemcg is compiled out or not used at runtime, this patch
    introduces no overhead to generic page allocator paths. If kmemcg is
    used, it will be plus one gfp flags check on alloc and plus one
    page->_mapcount check on free, which shouldn't hurt performance, because
    the data accessed are hot.

    Link: http://lkml.kernel.org/r/a9736d856f895bcb465d9f257b54efe32eda6f99.1464079538.git.vdavydov@virtuozzo.com
    Signed-off-by: Vladimir Davydov
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Eric Dumazet
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • The kernel heap allocators are using a sequential freelist making their
    allocation predictable. This predictability makes kernel heap overflow
    easier to exploit. An attacker can careful prepare the kernel heap to
    control the following chunk overflowed.

    For example these attacks exploit the predictability of the heap:
    - Linux Kernel CAN SLUB overflow (https://goo.gl/oMNWkU)
    - Exploiting Linux Kernel Heap corruptions (http://goo.gl/EXLn95)

    ***Problems that needed solving:
    - Randomize the Freelist (singled linked) used in the SLUB allocator.
    - Ensure good performance to encourage usage.
    - Get best entropy in early boot stage.

    ***Parts:
    - 01/02 Reorganize the SLAB Freelist randomization to share elements
    with the SLUB implementation.
    - 02/02 The SLUB Freelist randomization implementation. Similar approach
    than the SLAB but tailored to the singled freelist used in SLUB.

    ***Performance data:

    slab_test impact is between 3% to 4% on average for 100000 attempts
    without smp. It is a very focused testing, kernbench show the overall
    impact on the system is way lower.

    Before:

    Single thread testing
    =====================
    1. Kmalloc: Repeatedly allocate then free test
    100000 times kmalloc(8) -> 49 cycles kfree -> 77 cycles
    100000 times kmalloc(16) -> 51 cycles kfree -> 79 cycles
    100000 times kmalloc(32) -> 53 cycles kfree -> 83 cycles
    100000 times kmalloc(64) -> 62 cycles kfree -> 90 cycles
    100000 times kmalloc(128) -> 81 cycles kfree -> 97 cycles
    100000 times kmalloc(256) -> 98 cycles kfree -> 121 cycles
    100000 times kmalloc(512) -> 95 cycles kfree -> 122 cycles
    100000 times kmalloc(1024) -> 96 cycles kfree -> 126 cycles
    100000 times kmalloc(2048) -> 115 cycles kfree -> 140 cycles
    100000 times kmalloc(4096) -> 149 cycles kfree -> 171 cycles
    2. Kmalloc: alloc/free test
    100000 times kmalloc(8)/kfree -> 70 cycles
    100000 times kmalloc(16)/kfree -> 70 cycles
    100000 times kmalloc(32)/kfree -> 70 cycles
    100000 times kmalloc(64)/kfree -> 70 cycles
    100000 times kmalloc(128)/kfree -> 70 cycles
    100000 times kmalloc(256)/kfree -> 69 cycles
    100000 times kmalloc(512)/kfree -> 70 cycles
    100000 times kmalloc(1024)/kfree -> 73 cycles
    100000 times kmalloc(2048)/kfree -> 72 cycles
    100000 times kmalloc(4096)/kfree -> 71 cycles

    After:

    Single thread testing
    =====================
    1. Kmalloc: Repeatedly allocate then free test
    100000 times kmalloc(8) -> 57 cycles kfree -> 78 cycles
    100000 times kmalloc(16) -> 61 cycles kfree -> 81 cycles
    100000 times kmalloc(32) -> 76 cycles kfree -> 93 cycles
    100000 times kmalloc(64) -> 83 cycles kfree -> 94 cycles
    100000 times kmalloc(128) -> 106 cycles kfree -> 107 cycles
    100000 times kmalloc(256) -> 118 cycles kfree -> 117 cycles
    100000 times kmalloc(512) -> 114 cycles kfree -> 116 cycles
    100000 times kmalloc(1024) -> 115 cycles kfree -> 118 cycles
    100000 times kmalloc(2048) -> 147 cycles kfree -> 131 cycles
    100000 times kmalloc(4096) -> 214 cycles kfree -> 161 cycles
    2. Kmalloc: alloc/free test
    100000 times kmalloc(8)/kfree -> 66 cycles
    100000 times kmalloc(16)/kfree -> 66 cycles
    100000 times kmalloc(32)/kfree -> 66 cycles
    100000 times kmalloc(64)/kfree -> 66 cycles
    100000 times kmalloc(128)/kfree -> 65 cycles
    100000 times kmalloc(256)/kfree -> 67 cycles
    100000 times kmalloc(512)/kfree -> 67 cycles
    100000 times kmalloc(1024)/kfree -> 64 cycles
    100000 times kmalloc(2048)/kfree -> 67 cycles
    100000 times kmalloc(4096)/kfree -> 67 cycles

    Kernbench, before:

    Average Optimal load -j 12 Run (std deviation):
    Elapsed Time 101.873 (1.16069)
    User Time 1045.22 (1.60447)
    System Time 88.969 (0.559195)
    Percent CPU 1112.9 (13.8279)
    Context Switches 189140 (2282.15)
    Sleeps 99008.6 (768.091)

    After:

    Average Optimal load -j 12 Run (std deviation):
    Elapsed Time 102.47 (0.562732)
    User Time 1045.3 (1.34263)
    System Time 88.311 (0.342554)
    Percent CPU 1105.8 (6.49444)
    Context Switches 189081 (2355.78)
    Sleeps 99231.5 (800.358)

    This patch (of 2):

    This commit reorganizes the previous SLAB freelist randomization to
    prepare for the SLUB implementation. It moves functions that will be
    shared to slab_common.

    The entropy functions are changed to align with the SLUB implementation,
    now using get_random_(int|long) functions. These functions were chosen
    because they provide a bit more entropy early on boot and better
    performance when specific arch instructions are not available.

    [akpm@linux-foundation.org: fix build]
    Link: http://lkml.kernel.org/r/1464295031-26375-2-git-send-email-thgarnie@google.com
    Signed-off-by: Thomas Garnier
    Reviewed-by: Kees Cook
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Thomas Garnier
     

23 Jul, 2016

1 commit

  • The memory controller has quite a bit of state that usually outlives the
    cgroup and pins its CSS until said state disappears. At the same time
    it imposes a 16-bit limit on the CSS ID space to economically store IDs
    in the wild. Consequently, when we use cgroups to contain frequent but
    small and short-lived jobs that leave behind some page cache, we quickly
    run into the 64k limitations of outstanding CSSs. Creating a new cgroup
    fails with -ENOSPC while there are only a few, or even no user-visible
    cgroups in existence.

    Although pinning CSSs past cgroup removal is common, there are only two
    instances that actually need an ID after a cgroup is deleted: cache
    shadow entries and swapout records.

    Cache shadow entries reference the ID weakly and can deal with the CSS
    having disappeared when it's looked up later. They pose no hurdle.

    Swap-out records do need to pin the css to hierarchically attribute
    swapins after the cgroup has been deleted; though the only pages that
    remain swapped out after offlining are tmpfs/shmem pages. And those
    references are under the user's control, so they are manageable.

    This patch introduces a private 16-bit memcg ID and switches swap and
    cache shadow entries over to using that. This ID can then be recycled
    after offlining when the CSS remains pinned only by objects that don't
    specifically need it.

    This script demonstrates the problem by faulting one cache page in a new
    cgroup and deleting it again:

    set -e
    mkdir -p pages
    for x in `seq 128000`; do
    [ $((x % 1000)) -eq 0 ] && echo $x
    mkdir /cgroup/foo
    echo $$ >/cgroup/foo/cgroup.procs
    echo trex >pages/$x
    echo $$ >/cgroup/cgroup.procs
    rmdir /cgroup/foo
    done

    When run on an unpatched kernel, we eventually run out of possible IDs
    even though there are no visible cgroups:

    [root@ham ~]# ./cssidstress.sh
    [...]
    65000
    mkdir: cannot create directory '/cgroup/foo': No space left on device

    After this patch, the IDs get released upon cgroup destruction and the
    cache and css objects get released once memory reclaim kicks in.

    [hannes@cmpxchg.org: init the IDR]
    Link: http://lkml.kernel.org/r/20160621154601.GA22431@cmpxchg.org
    Fixes: b2052564e66d ("mm: memcontrol: continue cache reclaim from offlined groups")
    Link: http://lkml.kernel.org/r/20160617162516.GD19084@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Reported-by: John Garcia
    Reviewed-by: Vladimir Davydov
    Acked-by: Tejun Heo
    Cc: Nikolay Borisov
    Cc: [3.19+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

21 May, 2016

1 commit

  • Quarantine isolates freed objects in a separate queue. The objects are
    returned to the allocator later, which helps to detect use-after-free
    errors.

    When the object is freed, its state changes from KASAN_STATE_ALLOC to
    KASAN_STATE_QUARANTINE. The object is poisoned and put into quarantine
    instead of being returned to the allocator, therefore every subsequent
    access to that object triggers a KASAN error, and the error handler is
    able to say where the object has been allocated and deallocated.

    When it's time for the object to leave quarantine, its state becomes
    KASAN_STATE_FREE and it's returned to the allocator. From now on the
    allocator may reuse it for another allocation. Before that happens,
    it's still possible to detect a use-after free on that object (it
    retains the allocation/deallocation stacks).

    When the allocator reuses this object, the shadow is unpoisoned and old
    allocation/deallocation stacks are wiped. Therefore a use of this
    object, even an incorrect one, won't trigger ASan warning.

    Without the quarantine, it's not guaranteed that the objects aren't
    reused immediately, that's why the probability of catching a
    use-after-free is lower than with quarantine in place.

    Quarantine isolates freed objects in a separate queue. The objects are
    returned to the allocator later, which helps to detect use-after-free
    errors.

    Freed objects are first added to per-cpu quarantine queues. When a
    cache is destroyed or memory shrinking is requested, the objects are
    moved into the global quarantine queue. Whenever a kmalloc call allows
    memory reclaiming, the oldest objects are popped out of the global queue
    until the total size of objects in quarantine is less than 3/4 of the
    maximum quarantine size (which is a fraction of installed physical
    memory).

    As long as an object remains in the quarantine, KASAN is able to report
    accesses to it, so the chance of reporting a use-after-free is
    increased. Once the object leaves quarantine, the allocator may reuse
    it, in which case the object is unpoisoned and KASAN can't detect
    incorrect accesses to it.

    Right now quarantine support is only enabled in SLAB allocator.
    Unification of KASAN features in SLAB and SLUB will be done later.

    This patch is based on the "mm: kasan: quarantine" patch originally
    prepared by Dmitry Chernenkov. A number of improvements have been
    suggested by Andrey Ryabinin.

    [glider@google.com: v9]
    Link: http://lkml.kernel.org/r/1462987130-144092-1-git-send-email-glider@google.com
    Signed-off-by: Alexander Potapenko
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Andrey Konovalov
    Cc: Dmitry Vyukov
    Cc: Andrey Ryabinin
    Cc: Steven Rostedt
    Cc: Konstantin Serebryany
    Cc: Dmitry Chernenkov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexander Potapenko
     

26 Mar, 2016

2 commits

  • Add GFP flags to KASAN hooks for future patches to use.

    This patch is based on the "mm: kasan: unified support for SLUB and SLAB
    allocators" patch originally prepared by Dmitry Chernenkov.

    Signed-off-by: Alexander Potapenko
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Andrey Konovalov
    Cc: Dmitry Vyukov
    Cc: Andrey Ryabinin
    Cc: Steven Rostedt
    Cc: Konstantin Serebryany
    Cc: Dmitry Chernenkov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexander Potapenko
     
  • Add KASAN hooks to SLAB allocator.

    This patch is based on the "mm: kasan: unified support for SLUB and SLAB
    allocators" patch originally prepared by Dmitry Chernenkov.

    Signed-off-by: Alexander Potapenko
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Andrey Konovalov
    Cc: Dmitry Vyukov
    Cc: Andrey Ryabinin
    Cc: Steven Rostedt
    Cc: Konstantin Serebryany
    Cc: Dmitry Chernenkov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexander Potapenko
     

18 Mar, 2016

3 commits

  • Most of the mm subsystem uses pr_ so make it consistent.

    Miscellanea:

    - Realign arguments
    - Add missing newline to format
    - kmemleak-test.c has a "kmemleak: " prefix added to the
    "Kmemleak testing" logging message via pr_fmt

    Signed-off-by: Joe Perches
    Acked-by: Tejun Heo [percpu]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joe Perches
     
  • Kernel style prefers a single string over split strings when the string is
    'user-visible'.

    Miscellanea:

    - Add a missing newline
    - Realign arguments

    Signed-off-by: Joe Perches
    Acked-by: Tejun Heo [percpu]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joe Perches
     
  • As kmem accounting is now either enabled for all cgroups or disabled
    system-wide, there's no point in having memcg_kmem_online() helper -
    instead one can use memcg_kmem_enabled() and mem_cgroup_online(), as
    shrink_slab() now does.

    There are only two places left where this helper is used -
    __memcg_kmem_charge() and memcg_create_kmem_cache(). The former can
    only be called if memcg_kmem_enabled() returned true. Since the cgroup
    it operates on is online, mem_cgroup_is_root() check will be enough.

    memcg_create_kmem_cache() can't use mem_cgroup_online() helper instead
    of memcg_kmem_online(), because it relies on the fact that in
    memcg_offline_kmem() memcg->kmem_state is changed before
    memcg_deactivate_kmem_caches() is called, but there we can just
    open-code the check.

    Signed-off-by: Vladimir Davydov
    Acked-by: Johannes Weiner
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     

16 Mar, 2016

1 commit

  • This patch introduce a new API call kfree_bulk() for bulk freeing memory
    objects not bound to a single kmem_cache.

    Christoph pointed out that it is possible to implement freeing of
    objects, without knowing the kmem_cache pointer as that information is
    available from the object's page->slab_cache. Proposing to remove the
    kmem_cache argument from the bulk free API.

    Jesper demonstrated that these extra steps per object comes at a
    performance cost. It is only in the case CONFIG_MEMCG_KMEM is compiled
    in and activated runtime that these steps are done anyhow. The extra
    cost is most visible for SLAB allocator, because the SLUB allocator does
    the page lookup (virt_to_head_page()) anyhow.

    Thus, the conclusion was to keep the kmem_cache free bulk API with a
    kmem_cache pointer, but we can still implement a kfree_bulk() API fairly
    easily. Simply by handling if kmem_cache_free_bulk() gets called with a
    kmem_cache NULL pointer.

    This does increase the code size a bit, but implementing a separate
    kfree_bulk() call would likely increase code size even more.

    Below benchmarks cost of alloc+free (obj size 256 bytes) on CPU i7-4790K
    @ 4.00GHz, no PREEMPT and CONFIG_MEMCG_KMEM=y.

    Code size increase for SLAB:

    add/remove: 0/0 grow/shrink: 1/0 up/down: 74/0 (74)
    function old new delta
    kmem_cache_free_bulk 660 734 +74

    SLAB fastpath: 87 cycles(tsc) 21.814
    sz - fallback - kmem_cache_free_bulk - kfree_bulk
    1 - 103 cycles 25.878 ns - 41 cycles 10.498 ns - 81 cycles 20.312 ns
    2 - 94 cycles 23.673 ns - 26 cycles 6.682 ns - 42 cycles 10.649 ns
    3 - 92 cycles 23.181 ns - 21 cycles 5.325 ns - 39 cycles 9.950 ns
    4 - 90 cycles 22.727 ns - 18 cycles 4.673 ns - 26 cycles 6.693 ns
    8 - 89 cycles 22.270 ns - 14 cycles 3.664 ns - 23 cycles 5.835 ns
    16 - 88 cycles 22.038 ns - 14 cycles 3.503 ns - 22 cycles 5.543 ns
    30 - 89 cycles 22.284 ns - 13 cycles 3.310 ns - 20 cycles 5.197 ns
    32 - 88 cycles 22.249 ns - 13 cycles 3.420 ns - 20 cycles 5.166 ns
    34 - 88 cycles 22.224 ns - 14 cycles 3.643 ns - 20 cycles 5.170 ns
    48 - 88 cycles 22.088 ns - 14 cycles 3.507 ns - 20 cycles 5.203 ns
    64 - 88 cycles 22.063 ns - 13 cycles 3.428 ns - 20 cycles 5.152 ns
    128 - 89 cycles 22.483 ns - 15 cycles 3.891 ns - 23 cycles 5.885 ns
    158 - 89 cycles 22.381 ns - 15 cycles 3.779 ns - 22 cycles 5.548 ns
    250 - 91 cycles 22.798 ns - 16 cycles 4.152 ns - 23 cycles 5.967 ns

    SLAB when enabling MEMCG_KMEM runtime:
    - kmemcg fastpath: 130 cycles(tsc) 32.684 ns (step:0)
    1 - 148 cycles 37.220 ns - 66 cycles 16.622 ns - 66 cycles 16.583 ns
    2 - 141 cycles 35.510 ns - 51 cycles 12.820 ns - 58 cycles 14.625 ns
    3 - 140 cycles 35.017 ns - 37 cycles 9.326 ns - 33 cycles 8.474 ns
    4 - 137 cycles 34.507 ns - 31 cycles 7.888 ns - 33 cycles 8.300 ns
    8 - 140 cycles 35.069 ns - 25 cycles 6.461 ns - 25 cycles 6.436 ns
    16 - 138 cycles 34.542 ns - 23 cycles 5.945 ns - 22 cycles 5.670 ns
    30 - 136 cycles 34.227 ns - 22 cycles 5.502 ns - 22 cycles 5.587 ns
    32 - 136 cycles 34.253 ns - 21 cycles 5.475 ns - 21 cycles 5.324 ns
    34 - 136 cycles 34.254 ns - 21 cycles 5.448 ns - 20 cycles 5.194 ns
    48 - 136 cycles 34.075 ns - 21 cycles 5.458 ns - 21 cycles 5.367 ns
    64 - 135 cycles 33.994 ns - 21 cycles 5.350 ns - 21 cycles 5.259 ns
    128 - 137 cycles 34.446 ns - 23 cycles 5.816 ns - 22 cycles 5.688 ns
    158 - 137 cycles 34.379 ns - 22 cycles 5.727 ns - 22 cycles 5.602 ns
    250 - 138 cycles 34.755 ns - 24 cycles 6.093 ns - 23 cycles 5.986 ns

    Code size increase for SLUB:
    function old new delta
    kmem_cache_free_bulk 717 799 +82

    SLUB benchmark:
    SLUB fastpath: 46 cycles(tsc) 11.691 ns (step:0)
    sz - fallback - kmem_cache_free_bulk - kfree_bulk
    1 - 61 cycles 15.486 ns - 53 cycles 13.364 ns - 57 cycles 14.464 ns
    2 - 54 cycles 13.703 ns - 32 cycles 8.110 ns - 33 cycles 8.482 ns
    3 - 53 cycles 13.272 ns - 25 cycles 6.362 ns - 27 cycles 6.947 ns
    4 - 51 cycles 12.994 ns - 24 cycles 6.087 ns - 24 cycles 6.078 ns
    8 - 50 cycles 12.576 ns - 21 cycles 5.354 ns - 22 cycles 5.513 ns
    16 - 49 cycles 12.368 ns - 20 cycles 5.054 ns - 20 cycles 5.042 ns
    30 - 49 cycles 12.273 ns - 18 cycles 4.748 ns - 19 cycles 4.758 ns
    32 - 49 cycles 12.401 ns - 19 cycles 4.821 ns - 19 cycles 4.810 ns
    34 - 98 cycles 24.519 ns - 24 cycles 6.154 ns - 24 cycles 6.157 ns
    48 - 83 cycles 20.833 ns - 21 cycles 5.446 ns - 21 cycles 5.429 ns
    64 - 75 cycles 18.891 ns - 20 cycles 5.247 ns - 20 cycles 5.238 ns
    128 - 93 cycles 23.271 ns - 27 cycles 6.856 ns - 27 cycles 6.823 ns
    158 - 102 cycles 25.581 ns - 30 cycles 7.714 ns - 30 cycles 7.695 ns
    250 - 107 cycles 26.917 ns - 38 cycles 9.514 ns - 38 cycles 9.506 ns

    SLUB when enabling MEMCG_KMEM runtime:
    - kmemcg fastpath: 71 cycles(tsc) 17.897 ns (step:0)
    1 - 85 cycles 21.484 ns - 78 cycles 19.569 ns - 75 cycles 18.938 ns
    2 - 81 cycles 20.363 ns - 45 cycles 11.258 ns - 44 cycles 11.076 ns
    3 - 78 cycles 19.709 ns - 33 cycles 8.354 ns - 32 cycles 8.044 ns
    4 - 77 cycles 19.430 ns - 28 cycles 7.216 ns - 28 cycles 7.003 ns
    8 - 101 cycles 25.288 ns - 23 cycles 5.849 ns - 23 cycles 5.787 ns
    16 - 76 cycles 19.148 ns - 20 cycles 5.162 ns - 20 cycles 5.081 ns
    30 - 76 cycles 19.067 ns - 19 cycles 4.868 ns - 19 cycles 4.821 ns
    32 - 76 cycles 19.052 ns - 19 cycles 4.857 ns - 19 cycles 4.815 ns
    34 - 121 cycles 30.291 ns - 25 cycles 6.333 ns - 25 cycles 6.268 ns
    48 - 108 cycles 27.111 ns - 21 cycles 5.498 ns - 21 cycles 5.458 ns
    64 - 100 cycles 25.164 ns - 20 cycles 5.242 ns - 20 cycles 5.229 ns
    128 - 155 cycles 38.976 ns - 27 cycles 6.886 ns - 27 cycles 6.892 ns
    158 - 132 cycles 33.034 ns - 30 cycles 7.711 ns - 30 cycles 7.728 ns
    250 - 130 cycles 32.612 ns - 38 cycles 9.560 ns - 38 cycles 9.549 ns

    Signed-off-by: Jesper Dangaard Brouer
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jesper Dangaard Brouer
     

19 Feb, 2016

1 commit

  • When slub_debug alloc_calls_show is enabled we will try to track
    location and user of slab object on each online node, kmem_cache_node
    structure and cpu_cache/cpu_slub shouldn't be freed till there is the
    last reference to sysfs file.

    This fixes the following panic:

    BUG: unable to handle kernel NULL pointer dereference at 0000000000000020
    IP: list_locations+0x169/0x4e0
    PGD 257304067 PUD 438456067 PMD 0
    Oops: 0000 [#1] SMP
    CPU: 3 PID: 973074 Comm: cat ve: 0 Not tainted 3.10.0-229.7.2.ovz.9.30-00007-japdoll-dirty #2 9.30
    Hardware name: DEPO Computers To Be Filled By O.E.M./H67DE3, BIOS L1.60c 07/14/2011
    task: ffff88042a5dc5b0 ti: ffff88037f8d8000 task.ti: ffff88037f8d8000
    RIP: list_locations+0x169/0x4e0
    Call Trace:
    alloc_calls_show+0x1d/0x30
    slab_attr_show+0x1b/0x30
    sysfs_read_file+0x9a/0x1a0
    vfs_read+0x9c/0x170
    SyS_read+0x58/0xb0
    system_call_fastpath+0x16/0x1b
    Code: 5e 07 12 00 b9 00 04 00 00 3d 00 04 00 00 0f 4f c1 3d 00 04 00 00 89 45 b0 0f 84 c3 00 00 00 48 63 45 b0 49 8b 9c c4 f8 00 00 00 8b 43 20 48 85 c0 74 b6 48 89 df e8 46 37 44 00 48 8b 53 10
    CR2: 0000000000000020

    Separated __kmem_cache_release from __kmem_cache_shutdown which now
    called on slab_kmem_cache_release (after the last reference to sysfs
    file object has dropped).

    Reintroduced locking in free_partial as sysfs file might access cache's
    partial list after shutdowning - partial revert of the commit
    69cb8e6b7c29 ("slub: free slabs without holding locks"). Zap
    __remove_partial and use remove_partial (w/o underscores) as
    free_partial now takes list_lock which s partial revert for commit
    1e4dd9461fab ("slub: do not assert not having lock in removing freed
    partial")

    Signed-off-by: Dmitry Safonov
    Suggested-by: Vladimir Davydov
    Acked-by: Vladimir Davydov
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dmitry Safonov
     

21 Jan, 2016

2 commits

  • The cgroup2 memory controller will account important in-kernel memory
    consumers per default. Move all necessary components to CONFIG_MEMCG.

    Signed-off-by: Johannes Weiner
    Acked-by: Vladimir Davydov
    Cc: Michal Hocko
    Cc: Arnd Bergmann
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • On any given memcg, the kmem accounting feature has three separate
    states: not initialized, structures allocated, and actively accounting
    slab memory. These are represented through a combination of the
    kmem_acct_activated and kmem_acct_active flags, which is confusing.

    Convert to a kmem_state enum with the states NONE, ALLOCATED, and
    ONLINE. Then rename the functions to modify the state accordingly.
    This follows the nomenclature of css object states more closely.

    Signed-off-by: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: Tejun Heo
    Acked-by: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

15 Jan, 2016

1 commit

  • Currently, if we want to account all objects of a particular kmem cache,
    we have to pass __GFP_ACCOUNT to each kmem_cache_alloc call, which is
    inconvenient. This patch introduces SLAB_ACCOUNT flag which if passed
    to kmem_cache_create will force accounting for every allocation from
    this cache even if __GFP_ACCOUNT is not passed.

    This patch does not make any of the existing caches use this flag - it
    will be done later in the series.

    Note, a cache with SLAB_ACCOUNT cannot be merged with a cache w/o
    SLAB_ACCOUNT, because merged caches share the same kmem_cache struct and
    hence cannot have different sets of SLAB_* flags. Thus using this flag
    will probably reduce the number of merged slabs even if kmem accounting
    is not used (only compiled in).

    Signed-off-by: Vladimir Davydov
    Suggested-by: Tejun Heo
    Acked-by: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: Greg Thelen
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     

23 Nov, 2015

1 commit

  • Adjust kmem_cache_alloc_bulk API before we have any real users.

    Adjust API to return type 'int' instead of previously type 'bool'. This
    is done to allow future extension of the bulk alloc API.

    A future extension could be to allow SLUB to stop at a page boundary, when
    specified by a flag, and then return the number of objects.

    The advantage of this approach, would make it easier to make bulk alloc
    run without local IRQs disabled. With an approach of cmpxchg "stealing"
    the entire c->freelist or page->freelist. To avoid overshooting we would
    stop processing at a slab-page boundary. Else we always end up returning
    some objects at the cost of another cmpxchg.

    To keep compatible with future users of this API linking against an older
    kernel when using the new flag, we need to return the number of allocated
    objects with this API change.

    Signed-off-by: Jesper Dangaard Brouer
    Cc: Vladimir Davydov
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jesper Dangaard Brouer
     

06 Nov, 2015

5 commits

  • The assignment to NULL within the error condition was written in a 2014
    patch to suppress a compiler warning. However it would be cleaner to just
    initialize the kmem_cache to NULL and just return it in case of an error
    condition.

    Signed-off-by: Alexandru Moise
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Acked-by: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexandru Moise
     
  • Currently, when kmem_cache_destroy() is called for a global cache, we
    print a warning for each per memcg cache attached to it that has active
    objects (see shutdown_cache). This is redundant, because it gives no new
    information and only clutters the log. If a cache being destroyed has
    active objects, there must be a memory leak in the module that created the
    cache, and it does not matter if the cache was used by users in memory
    cgroups or not.

    This patch moves the warning from shutdown_cache(), which is called for
    shutting down both global and per memcg caches, to kmem_cache_destroy(),
    so that the warning is only printed once if there are objects left in the
    cache being destroyed.

    Signed-off-by: Vladimir Davydov
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • Currently, we do not clear pointers to per memcg caches in the
    memcg_params.memcg_caches array when a global cache is destroyed with
    kmem_cache_destroy.

    This is fine if the global cache does get destroyed. However, a cache can
    be left on the list if it still has active objects when kmem_cache_destroy
    is called (due to a memory leak). If this happens, the entries in the
    array will point to already freed areas, which is likely to result in data
    corruption when the cache is reused (via slab merging).

    Signed-off-by: Vladimir Davydov
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • do_kmem_cache_create(), do_kmem_cache_shutdown(), and
    do_kmem_cache_release() sound awkward for static helper functions that are
    not supposed to be used outside slab_common.c. Rename them to
    create_cache(), shutdown_cache(), and release_caches(), respectively.
    This patch is a pure cleanup and does not introduce any functional
    changes.

    Signed-off-by: Vladimir Davydov
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Acked-by: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • A good candidate to return a boolean result.

    Signed-off-by: Denis Kirjanov
    Cc: Christoph Lameter
    Reviewed-by: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Denis Kirjanov
     

09 Sep, 2015

2 commits

  • mem_cgroup structure is defined in mm/memcontrol.c currently which means
    that the code outside of this file has to use external API even for
    trivial access stuff.

    This patch exports mm_struct with its dependencies and makes some of the
    exported functions inlines. This even helps to reduce the code size a bit
    (make defconfig + CONFIG_MEMCG=y)

    text data bss dec hex filename
    12355346 1823792 1089536 15268674 e8fb42 vmlinux.before
    12354970 1823792 1089536 15268298 e8f9ca vmlinux.after

    This is not much (370B) but better than nothing.

    We also save a function call in some hot paths like callers of
    mem_cgroup_count_vm_event which is used for accounting.

    The patch doesn't introduce any functional changes.

    [vdavykov@parallels.com: inline memcg_kmem_is_active]
    [vdavykov@parallels.com: do not expose type outside of CONFIG_MEMCG]
    [akpm@linux-foundation.org: memcontrol.h needs eventfd.h for eventfd_ctx]
    [akpm@linux-foundation.org: export mem_cgroup_from_task() to modules]
    Signed-off-by: Michal Hocko
    Reviewed-by: Vladimir Davydov
    Suggested-by: Johannes Weiner
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • kmem_cache_destroy() does not tolerate a NULL kmem_cache pointer argument
    and performs a NULL-pointer dereference. This requires additional
    attention and effort from developers/reviewers and forces all
    kmem_cache_destroy() callers (200+ as of 4.1) to do a NULL check

    if (cache)
    kmem_cache_destroy(cache);

    Or, otherwise, be invalid kmem_cache_destroy() users.

    Tweak kmem_cache_destroy() and NULL-check the pointer there.

    Proposed by Andrew Morton.

    Link: https://lkml.org/lkml/2015/6/8/583
    Signed-off-by: Sergey Senozhatsky
    Acked-by: David Rientjes
    Cc: Julia Lawall
    Cc: Joe Perches
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sergey Senozhatsky
     

05 Sep, 2015

1 commit

  • Add the basic infrastructure for alloc/free operations on pointer arrays.
    It includes a generic function in the common slab code that is used in
    this infrastructure patch to create the unoptimized functionality for slab
    bulk operations.

    Allocators can then provide optimized allocation functions for situations
    in which large numbers of objects are needed. These optimization may
    avoid taking locks repeatedly and bypass metadata creation if all objects
    in slab pages can be used to provide the objects required.

    Allocators can extend the skeletons provided and add their own code to the
    bulk alloc and free functions. They can keep the generic allocation and
    freeing and just fall back to those if optimizations would not work (like
    for example when debugging is on).

    Signed-off-by: Christoph Lameter
    Signed-off-by: Jesper Dangaard Brouer
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

07 Aug, 2015

1 commit

  • This patch fixes creation of new kmem-caches after enabling
    sanity_checks for existing mergeable kmem-caches in runtime: before that
    patch creation fails because unique name in sysfs already taken by
    existing kmem-cache.

    Unlike other debug options this doesn't change object layout and could
    be enabled and disabled at any time.

    Signed-off-by: Konstantin Khlebnikov
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Acked-by: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     

02 Jul, 2015

1 commit

  • Avoid the warning:

    WARNING: mm/built-in.o(.text.unlikely+0xc22): Section mismatch in reference from the function .new_kmalloc_cache() to the variable .init.rodata:kmalloc_info
    The function .new_kmalloc_cache() references
    the variable __initconst kmalloc_info.

    Signed-off-by: Christoph Lameter
    Reported-by: Stephen Rothwell
    Tested-by: Geert Uytterhoeven
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

30 Jun, 2015

1 commit

  • This patch restores the slab creation sequence that was broken by commit
    4066c33d0308f8 and also reverts the portions that introduced the
    KMALLOC_LOOP_XXX macros. Those can never really work since the slab creation
    is much more complex than just going from a minimum to a maximum number.

    The latest upstream kernel boots cleanly on my machine with a 64 bit x86
    configuration under KVM using either SLAB or SLUB.

    Fixes: 4066c33d0308f8 ("support the slub_debug boot option")
    Reported-by: Theodore Ts'o
    Signed-off-by: Christoph Lameter
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

25 Jun, 2015

2 commits

  • This patch moves the initialization of the size_index table slightly
    earlier so that the first few kmem_cache_node's can be safely allocated
    when KMALLOC_MIN_SIZE is large.

    There are currently two ways to generate indices into kmalloc_caches (via
    kmalloc_index() and via the size_index table in slab_common.c) and on some
    arches (possibly only MIPS) they potentially disagree with each other
    until create_kmalloc_caches() has been called. It seems that the
    intention is that the size_index table is a fast equivalent to
    kmalloc_index() and that create_kmalloc_caches() patches the table to
    return the correct value for the cases where kmalloc_index()'s
    if-statements apply.

    The failing sequence was:
    * kmalloc_caches contains NULL elements
    * kmem_cache_init initialises the element that 'struct
    kmem_cache_node' will be allocated to. For 32-bit Mips, this is a
    56-byte struct and kmalloc_index returns KMALLOC_SHIFT_LOW (7).
    * init_list is called which calls kmalloc_node to allocate a 'struct
    kmem_cache_node'.
    * kmalloc_slab selects the kmem_caches element using
    size_index[size_index_elem(size)]. For MIPS, size is 56, and the
    expression returns 6.
    * This element of kmalloc_caches is NULL and allocation fails.
    * If it had not already failed, it would have called
    create_kmalloc_caches() at this point which would have changed
    size_index[size_index_elem(size)] to 7.

    I don't believe the bug to be LLVM specific but GCC doesn't normally
    encounter the problem. I haven't been able to identify exactly what GCC
    is doing better (probably inlining) but it seems that GCC is managing to
    optimize to the point that it eliminates the problematic allocations.
    This theory is supported by the fact that GCC can be made to fail in the
    same way by changing inline, __inline, __inline__, and __always_inline in
    include/linux/compiler-gcc.h such that they don't actually inline things.

    Signed-off-by: Daniel Sanders
    Acked-by: Pekka Enberg
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Daniel Sanders
     
  • The slub_debug=PU,kmalloc-xx cannot work because in the
    create_kmalloc_caches() the s->name is created after the
    create_kmalloc_cache() is called. The name is NULL in the
    create_kmalloc_cache() so the kmem_cache_flags() would not set the
    slub_debug flags to the s->flags. The fix here set up a kmalloc_names
    string array for the initialization purpose and delete the dynamic name
    creation of kmalloc_caches.

    [akpm@linux-foundation.org: s/kmalloc_names/kmalloc_info/, tweak comment text]
    Signed-off-by: Gavin Guo
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Gavin Guo
     

14 Feb, 2015

2 commits

  • With this patch kasan will be able to catch bugs in memory allocated by
    slub. Initially all objects in newly allocated slab page, marked as
    redzone. Later, when allocation of slub object happens, requested by
    caller number of bytes marked as accessible, and the rest of the object
    (including slub's metadata) marked as redzone (inaccessible).

    We also mark object as accessible if ksize was called for this object.
    There is some places in kernel where ksize function is called to inquire
    size of really allocated area. Such callers could validly access whole
    allocated memory, so it should be marked as accessible.

    Code in slub.c and slab_common.c files could validly access to object's
    metadata, so instrumentation for this files are disabled.

    Signed-off-by: Andrey Ryabinin
    Signed-off-by: Dmitry Chernenkov
    Cc: Dmitry Vyukov
    Cc: Konstantin Serebryany
    Signed-off-by: Andrey Konovalov
    Cc: Yuri Gribov
    Cc: Konstantin Khlebnikov
    Cc: Sasha Levin
    Cc: Christoph Lameter
    Cc: Joonsoo Kim
    Cc: Dave Hansen
    Cc: Andi Kleen
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Ryabinin
     
  • slab frequently performs duplication of strings located in read-only
    memory section. Replacing kstrdup by kstrdup_const allows to avoid such
    operations.

    [akpm@linux-foundation.org: make the handling of kmem_cache.name const-correct]
    Signed-off-by: Andrzej Hajda
    Cc: Marek Szyprowski
    Cc: Kyungmin Park
    Cc: Mike Turquette
    Cc: Alexander Viro
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Tejun Heo
    Cc: Greg KH
    Cc: Geert Uytterhoeven
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrzej Hajda
     

13 Feb, 2015

7 commits

  • To speed up further allocations SLUB may store empty slabs in per cpu/node
    partial lists instead of freeing them immediately. This prevents per
    memcg caches destruction, because kmem caches created for a memory cgroup
    are only destroyed after the last page charged to the cgroup is freed.

    To fix this issue, this patch resurrects approach first proposed in [1].
    It forbids SLUB to cache empty slabs after the memory cgroup that the
    cache belongs to was destroyed. It is achieved by setting kmem_cache's
    cpu_partial and min_partial constants to 0 and tuning put_cpu_partial() so
    that it would drop frozen empty slabs immediately if cpu_partial = 0.

    The runtime overhead is minimal. From all the hot functions, we only
    touch relatively cold put_cpu_partial(): we make it call
    unfreeze_partials() after freezing a slab that belongs to an offline
    memory cgroup. Since slab freezing exists to avoid moving slabs from/to a
    partial list on free/alloc, and there can't be allocations from dead
    caches, it shouldn't cause any overhead. We do have to disable preemption
    for put_cpu_partial() to achieve that though.

    The original patch was accepted well and even merged to the mm tree.
    However, I decided to withdraw it due to changes happening to the memcg
    core at that time. I had an idea of introducing per-memcg shrinkers for
    kmem caches, but now, as memcg has finally settled down, I do not see it
    as an option, because SLUB shrinker would be too costly to call since SLUB
    does not keep free slabs on a separate list. Besides, we currently do not
    even call per-memcg shrinkers for offline memcgs. Overall, it would
    introduce much more complexity to both SLUB and memcg than this small
    patch.

    Regarding to SLAB, there's no problem with it, because it shrinks
    per-cpu/node caches periodically. Thanks to list_lru reparenting, we no
    longer keep entries for offline cgroups in per-memcg arrays (such as
    memcg_cache_params->memcg_caches), so we do not have to bother if a
    per-memcg cache will be shrunk a bit later than it could be.

    [1] http://thread.gmane.org/gmane.linux.kernel.mm/118649/focus=118650

    Signed-off-by: Vladimir Davydov
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • We need to look up a kmem_cache in ->memcg_params.memcg_caches arrays only
    on allocations, so there is no need to have the array entries set until
    css free - we can clear them on css offline. This will allow us to reuse
    array entries more efficiently and avoid costly array relocations.

    Signed-off-by: Vladimir Davydov
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Tejun Heo
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Dave Chinner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • Currently, we use mem_cgroup->kmemcg_id to guarantee kmem_cache->name
    uniqueness. This is correct, because kmemcg_id is only released on css
    free after destroying all per memcg caches.

    However, I am going to change that and release kmemcg_id on css offline,
    because it is not wise to keep it for so long, wasting valuable entries of
    memcg_cache_params->memcg_caches arrays. Therefore, to preserve cache
    name uniqueness, let us switch to css->id.

    Signed-off-by: Vladimir Davydov
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Tejun Heo
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Dave Chinner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • Sometimes, we need to iterate over all memcg copies of a particular root
    kmem cache. Currently, we use memcg_cache_params->memcg_caches array for
    that, because it contains all existing memcg caches.

    However, it's a bad practice to keep all caches, including those that
    belong to offline cgroups, in this array, because it will be growing
    beyond any bounds then. I'm going to wipe away dead caches from it to
    save space. To still be able to perform iterations over all memcg caches
    of the same kind, let us link them into a list.

    Signed-off-by: Vladimir Davydov
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Tejun Heo
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Dave Chinner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • Currently, kmem_cache stores a pointer to struct memcg_cache_params
    instead of embedding it. The rationale is to save memory when kmem
    accounting is disabled. However, the memcg_cache_params has shrivelled
    drastically since it was first introduced:

    * Initially:

    struct memcg_cache_params {
    bool is_root_cache;
    union {
    struct kmem_cache *memcg_caches[0];
    struct {
    struct mem_cgroup *memcg;
    struct list_head list;
    struct kmem_cache *root_cache;
    bool dead;
    atomic_t nr_pages;
    struct work_struct destroy;
    };
    };
    };

    * Now:

    struct memcg_cache_params {
    bool is_root_cache;
    union {
    struct {
    struct rcu_head rcu_head;
    struct kmem_cache *memcg_caches[0];
    };
    struct {
    struct mem_cgroup *memcg;
    struct kmem_cache *root_cache;
    };
    };
    };

    So the memory saving does not seem to be a clear win anymore.

    OTOH, keeping a pointer to memcg_cache_params struct instead of embedding
    it results in touching one more cache line on kmem alloc/free hot paths.
    Besides, it makes linking kmem caches in a list chained by a field of
    struct memcg_cache_params really painful due to a level of indirection,
    while I want to make them linked in the following patch. That said, let
    us embed it.

    Signed-off-by: Vladimir Davydov
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Tejun Heo
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Dave Chinner
    Cc: Dan Carpenter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • We need a stable value of memcg_nr_cache_ids in kmem_cache_create()
    (memcg_alloc_cache_params() wants it for root caches), where we only
    hold the slab_mutex and no memcg-related locks. As a result, we have to
    update memcg_nr_cache_ids under the slab_mutex, which we can only take
    on the slab's side (see memcg_update_array_size). This looks awkward
    and will become even worse when per-memcg list_lru is introduced, which
    also wants stable access to memcg_nr_cache_ids.

    To get rid of this dependency between the memcg_nr_cache_ids and the
    slab_mutex, this patch introduces a special rwsem. The rwsem is held
    for writing during memcg_caches arrays relocation and memcg_nr_cache_ids
    updates. Therefore one can take it for reading to get a stable access
    to memcg_caches arrays and/or memcg_nr_cache_ids.

    Currently the semaphore is taken for reading only from
    kmem_cache_create, right before taking the slab_mutex, so right now
    there's no much point in using rwsem instead of mutex. However, once
    list_lru is made per-memcg it will allow list_lru initializations to
    proceed concurrently.

    Signed-off-by: Vladimir Davydov
    Cc: Dave Chinner
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Greg Thelen
    Cc: Glauber Costa
    Cc: Alexander Viro
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • memcg_limited_groups_array_size, which defines the size of memcg_caches
    arrays, sounds rather cumbersome. Also it doesn't point anyhow that
    it's related to kmem/caches stuff. So let's rename it to
    memcg_nr_cache_ids. It's concise and points us directly to
    memcg_cache_id.

    Also, rename kmem_limited_groups to memcg_cache_ida.

    Signed-off-by: Vladimir Davydov
    Cc: Dave Chinner
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Greg Thelen
    Cc: Glauber Costa
    Cc: Alexander Viro
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     

11 Feb, 2015

2 commits

  • mem_cgroup->memcg_slab_caches is a list of kmem caches corresponding to
    the given cgroup. Currently, it is only used on css free in order to
    destroy all caches corresponding to the memory cgroup being freed. The
    list is protected by memcg_slab_mutex. The mutex is also used to protect
    kmem_cache->memcg_params->memcg_caches arrays and synchronizes
    kmem_cache_destroy vs memcg_unregister_all_caches.

    However, we can perfectly get on without these two. To destroy all caches
    corresponding to a memory cgroup, we can walk over the global list of kmem
    caches, slab_caches, and we can do all the synchronization stuff using the
    slab_mutex instead of the memcg_slab_mutex. This patch therefore gets rid
    of the memcg_slab_caches and memcg_slab_mutex.

    Apart from this nice cleanup, it also:

    - assures that rcu_barrier() is called once at max when a root cache is
    destroyed or a memory cgroup is freed, no matter how many caches have
    SLAB_DESTROY_BY_RCU flag set;

    - fixes the race between kmem_cache_destroy and kmem_cache_create that
    exists, because memcg_cleanup_cache_params, which is called from
    kmem_cache_destroy after checking that kmem_cache->refcount=0,
    releases the slab_mutex, which gives kmem_cache_create a chance to
    make an alias to a cache doomed to be destroyed.

    Signed-off-by: Vladimir Davydov
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Acked-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • Instead of passing the name of the memory cgroup which the cache is
    created for in the memcg_name_argument, let's obtain it immediately in
    memcg_create_kmem_cache.

    Signed-off-by: Vladimir Davydov
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov