07 Sep, 2016

1 commit

  • Install the callbacks via the state machine.

    Signed-off-by: Richard Weinberger
    Signed-off-by: Thomas Gleixner
    Signed-off-by: Sebastian Andrzej Siewior
    Reviewed-by: Sebastian Andrzej Siewior
    Cc: Peter Zijlstra
    Cc: Pekka Enberg
    Cc: linux-mm@kvack.org
    Cc: rt@linutronix.de
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Andrew Morton
    Cc: Christoph Lameter
    Link: http://lkml.kernel.org/r/20160823125319.abeapfjapf2kfezp@linutronix.de
    Signed-off-by: Thomas Gleixner

    Sebastian Andrzej Siewior
     

09 Aug, 2016

1 commit

  • Pull usercopy protection from Kees Cook:
    "Tbhis implements HARDENED_USERCOPY verification of copy_to_user and
    copy_from_user bounds checking for most architectures on SLAB and
    SLUB"

    * tag 'usercopy-v4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
    mm: SLUB hardened usercopy support
    mm: SLAB hardened usercopy support
    s390/uaccess: Enable hardened usercopy
    sparc/uaccess: Enable hardened usercopy
    powerpc/uaccess: Enable hardened usercopy
    ia64/uaccess: Enable hardened usercopy
    arm64/uaccess: Enable hardened usercopy
    ARM: uaccess: Enable hardened usercopy
    x86/uaccess: Enable hardened usercopy
    mm: Hardened usercopy
    mm: Implement stack frame object validation
    mm: Add is_migrate_cma_page

    Linus Torvalds
     

27 Jul, 2016

2 commits

  • When both arguments to kmalloc_array() or kcalloc() are known at compile
    time then their product is known at compile time but search for kmalloc
    cache happens at runtime not at compile time.

    Link: http://lkml.kernel.org/r/20160627213454.GA2440@p183.telecom.by
    Signed-off-by: Alexey Dobriyan
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • This is the start of porting PAX_USERCOPY into the mainline kernel. This
    is the first set of features, controlled by CONFIG_HARDENED_USERCOPY. The
    work is based on code by PaX Team and Brad Spengler, and an earlier port
    from Casey Schaufler. Additional non-slab page tests are from Rik van Riel.

    This patch contains the logic for validating several conditions when
    performing copy_to_user() and copy_from_user() on the kernel object
    being copied to/from:
    - address range doesn't wrap around
    - address range isn't NULL or zero-allocated (with a non-zero copy size)
    - if on the slab allocator:
    - object size must be less than or equal to copy size (when check is
    implemented in the allocator, which appear in subsequent patches)
    - otherwise, object must not span page allocations (excepting Reserved
    and CMA ranges)
    - if on the stack
    - object must not extend before/after the current process stack
    - object must be contained by a valid stack frame (when there is
    arch/build support for identifying stack frames)
    - object must not overlap with kernel text

    Signed-off-by: Kees Cook
    Tested-by: Valdis Kletnieks
    Tested-by: Michael Ellerman

    Kees Cook
     

20 May, 2016

1 commit

  • Attach the malloc attribute to a few allocation functions. This helps
    gcc generate better code by telling it that the return value doesn't
    alias any existing pointers (which is even more valuable given the
    pessimizations implied by -fno-strict-aliasing).

    A simple example of what this allows gcc to do can be seen by looking at
    the last part of drm_atomic_helper_plane_reset:

    plane->state = kzalloc(sizeof(*plane->state), GFP_KERNEL);

    if (plane->state) {
    plane->state->plane = plane;
    plane->state->rotation = BIT(DRM_ROTATE_0);
    }

    which compiles to

    e8 99 bf d6 ff callq ffffffff8116d540
    48 85 c0 test %rax,%rax
    48 89 83 40 02 00 00 mov %rax,0x240(%rbx)
    74 11 je ffffffff814015c4
    48 89 18 mov %rbx,(%rax)
    48 8b 83 40 02 00 00 mov 0x240(%rbx),%rax [*]
    c7 40 40 01 00 00 00 movl $0x1,0x40(%rax)

    With this patch applied, the instruction at [*] is elided, since the
    store to plane->state->plane is known to not alter the value of
    plane->state.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Rasmus Villemoes
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rasmus Villemoes
     

26 Mar, 2016

2 commits

  • Add GFP flags to KASAN hooks for future patches to use.

    This patch is based on the "mm: kasan: unified support for SLUB and SLAB
    allocators" patch originally prepared by Dmitry Chernenkov.

    Signed-off-by: Alexander Potapenko
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Andrey Konovalov
    Cc: Dmitry Vyukov
    Cc: Andrey Ryabinin
    Cc: Steven Rostedt
    Cc: Konstantin Serebryany
    Cc: Dmitry Chernenkov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexander Potapenko
     
  • Add KASAN hooks to SLAB allocator.

    This patch is based on the "mm: kasan: unified support for SLUB and SLAB
    allocators" patch originally prepared by Dmitry Chernenkov.

    Signed-off-by: Alexander Potapenko
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Andrey Konovalov
    Cc: Dmitry Vyukov
    Cc: Andrey Ryabinin
    Cc: Steven Rostedt
    Cc: Konstantin Serebryany
    Cc: Dmitry Chernenkov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexander Potapenko
     

16 Mar, 2016

3 commits

  • SLAB_DEBUG_FREE allows expensive consistency checks at free to be turned
    on or off. Expand its use to be able to turn off all consistency
    checks. This gives a nice speed up if you only want features such as
    poisoning or tracing.

    Credit to Mathias Krause for the original work which inspired this
    series

    Signed-off-by: Laura Abbott
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Kees Cook
    Cc: Mathias Krause
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Laura Abbott
     
  • Fix up trivial spelling errors, noticed while reading the code.

    Signed-off-by: Jesper Dangaard Brouer
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jesper Dangaard Brouer
     
  • This patch introduce a new API call kfree_bulk() for bulk freeing memory
    objects not bound to a single kmem_cache.

    Christoph pointed out that it is possible to implement freeing of
    objects, without knowing the kmem_cache pointer as that information is
    available from the object's page->slab_cache. Proposing to remove the
    kmem_cache argument from the bulk free API.

    Jesper demonstrated that these extra steps per object comes at a
    performance cost. It is only in the case CONFIG_MEMCG_KMEM is compiled
    in and activated runtime that these steps are done anyhow. The extra
    cost is most visible for SLAB allocator, because the SLUB allocator does
    the page lookup (virt_to_head_page()) anyhow.

    Thus, the conclusion was to keep the kmem_cache free bulk API with a
    kmem_cache pointer, but we can still implement a kfree_bulk() API fairly
    easily. Simply by handling if kmem_cache_free_bulk() gets called with a
    kmem_cache NULL pointer.

    This does increase the code size a bit, but implementing a separate
    kfree_bulk() call would likely increase code size even more.

    Below benchmarks cost of alloc+free (obj size 256 bytes) on CPU i7-4790K
    @ 4.00GHz, no PREEMPT and CONFIG_MEMCG_KMEM=y.

    Code size increase for SLAB:

    add/remove: 0/0 grow/shrink: 1/0 up/down: 74/0 (74)
    function old new delta
    kmem_cache_free_bulk 660 734 +74

    SLAB fastpath: 87 cycles(tsc) 21.814
    sz - fallback - kmem_cache_free_bulk - kfree_bulk
    1 - 103 cycles 25.878 ns - 41 cycles 10.498 ns - 81 cycles 20.312 ns
    2 - 94 cycles 23.673 ns - 26 cycles 6.682 ns - 42 cycles 10.649 ns
    3 - 92 cycles 23.181 ns - 21 cycles 5.325 ns - 39 cycles 9.950 ns
    4 - 90 cycles 22.727 ns - 18 cycles 4.673 ns - 26 cycles 6.693 ns
    8 - 89 cycles 22.270 ns - 14 cycles 3.664 ns - 23 cycles 5.835 ns
    16 - 88 cycles 22.038 ns - 14 cycles 3.503 ns - 22 cycles 5.543 ns
    30 - 89 cycles 22.284 ns - 13 cycles 3.310 ns - 20 cycles 5.197 ns
    32 - 88 cycles 22.249 ns - 13 cycles 3.420 ns - 20 cycles 5.166 ns
    34 - 88 cycles 22.224 ns - 14 cycles 3.643 ns - 20 cycles 5.170 ns
    48 - 88 cycles 22.088 ns - 14 cycles 3.507 ns - 20 cycles 5.203 ns
    64 - 88 cycles 22.063 ns - 13 cycles 3.428 ns - 20 cycles 5.152 ns
    128 - 89 cycles 22.483 ns - 15 cycles 3.891 ns - 23 cycles 5.885 ns
    158 - 89 cycles 22.381 ns - 15 cycles 3.779 ns - 22 cycles 5.548 ns
    250 - 91 cycles 22.798 ns - 16 cycles 4.152 ns - 23 cycles 5.967 ns

    SLAB when enabling MEMCG_KMEM runtime:
    - kmemcg fastpath: 130 cycles(tsc) 32.684 ns (step:0)
    1 - 148 cycles 37.220 ns - 66 cycles 16.622 ns - 66 cycles 16.583 ns
    2 - 141 cycles 35.510 ns - 51 cycles 12.820 ns - 58 cycles 14.625 ns
    3 - 140 cycles 35.017 ns - 37 cycles 9.326 ns - 33 cycles 8.474 ns
    4 - 137 cycles 34.507 ns - 31 cycles 7.888 ns - 33 cycles 8.300 ns
    8 - 140 cycles 35.069 ns - 25 cycles 6.461 ns - 25 cycles 6.436 ns
    16 - 138 cycles 34.542 ns - 23 cycles 5.945 ns - 22 cycles 5.670 ns
    30 - 136 cycles 34.227 ns - 22 cycles 5.502 ns - 22 cycles 5.587 ns
    32 - 136 cycles 34.253 ns - 21 cycles 5.475 ns - 21 cycles 5.324 ns
    34 - 136 cycles 34.254 ns - 21 cycles 5.448 ns - 20 cycles 5.194 ns
    48 - 136 cycles 34.075 ns - 21 cycles 5.458 ns - 21 cycles 5.367 ns
    64 - 135 cycles 33.994 ns - 21 cycles 5.350 ns - 21 cycles 5.259 ns
    128 - 137 cycles 34.446 ns - 23 cycles 5.816 ns - 22 cycles 5.688 ns
    158 - 137 cycles 34.379 ns - 22 cycles 5.727 ns - 22 cycles 5.602 ns
    250 - 138 cycles 34.755 ns - 24 cycles 6.093 ns - 23 cycles 5.986 ns

    Code size increase for SLUB:
    function old new delta
    kmem_cache_free_bulk 717 799 +82

    SLUB benchmark:
    SLUB fastpath: 46 cycles(tsc) 11.691 ns (step:0)
    sz - fallback - kmem_cache_free_bulk - kfree_bulk
    1 - 61 cycles 15.486 ns - 53 cycles 13.364 ns - 57 cycles 14.464 ns
    2 - 54 cycles 13.703 ns - 32 cycles 8.110 ns - 33 cycles 8.482 ns
    3 - 53 cycles 13.272 ns - 25 cycles 6.362 ns - 27 cycles 6.947 ns
    4 - 51 cycles 12.994 ns - 24 cycles 6.087 ns - 24 cycles 6.078 ns
    8 - 50 cycles 12.576 ns - 21 cycles 5.354 ns - 22 cycles 5.513 ns
    16 - 49 cycles 12.368 ns - 20 cycles 5.054 ns - 20 cycles 5.042 ns
    30 - 49 cycles 12.273 ns - 18 cycles 4.748 ns - 19 cycles 4.758 ns
    32 - 49 cycles 12.401 ns - 19 cycles 4.821 ns - 19 cycles 4.810 ns
    34 - 98 cycles 24.519 ns - 24 cycles 6.154 ns - 24 cycles 6.157 ns
    48 - 83 cycles 20.833 ns - 21 cycles 5.446 ns - 21 cycles 5.429 ns
    64 - 75 cycles 18.891 ns - 20 cycles 5.247 ns - 20 cycles 5.238 ns
    128 - 93 cycles 23.271 ns - 27 cycles 6.856 ns - 27 cycles 6.823 ns
    158 - 102 cycles 25.581 ns - 30 cycles 7.714 ns - 30 cycles 7.695 ns
    250 - 107 cycles 26.917 ns - 38 cycles 9.514 ns - 38 cycles 9.506 ns

    SLUB when enabling MEMCG_KMEM runtime:
    - kmemcg fastpath: 71 cycles(tsc) 17.897 ns (step:0)
    1 - 85 cycles 21.484 ns - 78 cycles 19.569 ns - 75 cycles 18.938 ns
    2 - 81 cycles 20.363 ns - 45 cycles 11.258 ns - 44 cycles 11.076 ns
    3 - 78 cycles 19.709 ns - 33 cycles 8.354 ns - 32 cycles 8.044 ns
    4 - 77 cycles 19.430 ns - 28 cycles 7.216 ns - 28 cycles 7.003 ns
    8 - 101 cycles 25.288 ns - 23 cycles 5.849 ns - 23 cycles 5.787 ns
    16 - 76 cycles 19.148 ns - 20 cycles 5.162 ns - 20 cycles 5.081 ns
    30 - 76 cycles 19.067 ns - 19 cycles 4.868 ns - 19 cycles 4.821 ns
    32 - 76 cycles 19.052 ns - 19 cycles 4.857 ns - 19 cycles 4.815 ns
    34 - 121 cycles 30.291 ns - 25 cycles 6.333 ns - 25 cycles 6.268 ns
    48 - 108 cycles 27.111 ns - 21 cycles 5.498 ns - 21 cycles 5.458 ns
    64 - 100 cycles 25.164 ns - 20 cycles 5.242 ns - 20 cycles 5.229 ns
    128 - 155 cycles 38.976 ns - 27 cycles 6.886 ns - 27 cycles 6.892 ns
    158 - 132 cycles 33.034 ns - 30 cycles 7.711 ns - 30 cycles 7.728 ns
    250 - 130 cycles 32.612 ns - 38 cycles 9.560 ns - 38 cycles 9.549 ns

    Signed-off-by: Jesper Dangaard Brouer
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jesper Dangaard Brouer
     

21 Jan, 2016

1 commit


15 Jan, 2016

1 commit

  • Currently, if we want to account all objects of a particular kmem cache,
    we have to pass __GFP_ACCOUNT to each kmem_cache_alloc call, which is
    inconvenient. This patch introduces SLAB_ACCOUNT flag which if passed
    to kmem_cache_create will force accounting for every allocation from
    this cache even if __GFP_ACCOUNT is not passed.

    This patch does not make any of the existing caches use this flag - it
    will be done later in the series.

    Note, a cache with SLAB_ACCOUNT cannot be merged with a cache w/o
    SLAB_ACCOUNT, because merged caches share the same kmem_cache struct and
    hence cannot have different sets of SLAB_* flags. Thus using this flag
    will probably reduce the number of merged slabs even if kmem accounting
    is not used (only compiled in).

    Signed-off-by: Vladimir Davydov
    Suggested-by: Tejun Heo
    Acked-by: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: Greg Thelen
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     

23 Nov, 2015

1 commit

  • Adjust kmem_cache_alloc_bulk API before we have any real users.

    Adjust API to return type 'int' instead of previously type 'bool'. This
    is done to allow future extension of the bulk alloc API.

    A future extension could be to allow SLUB to stop at a page boundary, when
    specified by a flag, and then return the number of objects.

    The advantage of this approach, would make it easier to make bulk alloc
    run without local IRQs disabled. With an approach of cmpxchg "stealing"
    the entire c->freelist or page->freelist. To avoid overshooting we would
    stop processing at a slab-page boundary. Else we always end up returning
    some objects at the cost of another cmpxchg.

    To keep compatible with future users of this API linking against an older
    kernel when using the new flag, we need to return the number of allocated
    objects with this API change.

    Signed-off-by: Jesper Dangaard Brouer
    Cc: Vladimir Davydov
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jesper Dangaard Brouer
     

21 Nov, 2015

1 commit

  • The various allocators return aligned memory. Telling the compiler that
    allows it to generate better code in many cases, for example when the
    return value is immediately passed to memset().

    Some code does become larger, but at least we win twice as much as we lose:

    $ scripts/bloat-o-meter /tmp/vmlinux vmlinux
    add/remove: 0/0 grow/shrink: 13/52 up/down: 995/-2140 (-1145)

    An example of the different (and smaller) code can be seen in mm_alloc(). Before:

    : 48 8d 78 08 lea 0x8(%rax),%rdi
    : 48 89 c1 mov %rax,%rcx
    : 48 89 c2 mov %rax,%rdx
    : 48 c7 00 00 00 00 00 movq $0x0,(%rax)
    : 48 c7 80 48 03 00 00 movq $0x0,0x348(%rax)
    : 00 00 00 00
    : 31 c0 xor %eax,%eax
    : 48 83 e7 f8 and $0xfffffffffffffff8,%rdi
    : 48 29 f9 sub %rdi,%rcx
    : 81 c1 50 03 00 00 add $0x350,%ecx
    : c1 e9 03 shr $0x3,%ecx
    : f3 48 ab rep stos %rax,%es:(%rdi)

    After:

    : 48 89 c2 mov %rax,%rdx
    : b9 6a 00 00 00 mov $0x6a,%ecx
    : 31 c0 xor %eax,%eax
    : 48 89 d7 mov %rdx,%rdi
    : f3 48 ab rep stos %rax,%es:(%rdi)

    So gcc's strategy is to do two possibly (but not really, of course)
    unaligned stores to the first and last word, then do an aligned rep stos
    covering the middle part with a little overlap. Maybe arches which do not
    allow unaligned stores gain even more.

    I don't know if gcc can actually make use of alignments greater than 8 for
    anything, so one could probably drop the __assume_xyz_alignment macros and
    just use __assume_aligned(8).

    The increases in code size are mostly caused by gcc deciding to
    opencode strlen() using the check-four-bytes-at-a-time trick when it
    knows the buffer is sufficiently aligned (one function grew by 200
    bytes). Now it turns out that many of these strlen() calls showing up
    were in fact redundant, and they're gone from -next. Applying the two
    patches to next-20151001 bloat-o-meter instead says

    add/remove: 0/0 grow/shrink: 6/52 up/down: 244/-2140 (-1896)

    Signed-off-by: Rasmus Villemoes
    Acked-by: Christoph Lameter
    Cc: David Rientjes
    Cc: Pekka Enberg
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rasmus Villemoes
     

06 Nov, 2015

1 commit


05 Sep, 2015

1 commit

  • Add the basic infrastructure for alloc/free operations on pointer arrays.
    It includes a generic function in the common slab code that is used in
    this infrastructure patch to create the unoptimized functionality for slab
    bulk operations.

    Allocators can then provide optimized allocation functions for situations
    in which large numbers of objects are needed. These optimization may
    avoid taking locks repeatedly and bypass metadata creation if all objects
    in slab pages can be used to provide the objects required.

    Allocators can extend the skeletons provided and add their own code to the
    bulk alloc and free functions. They can keep the generic allocation and
    freeing and just fall back to those if optimizations would not work (like
    for example when debugging is on).

    Signed-off-by: Christoph Lameter
    Signed-off-by: Jesper Dangaard Brouer
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

30 Jun, 2015

1 commit

  • This patch restores the slab creation sequence that was broken by commit
    4066c33d0308f8 and also reverts the portions that introduced the
    KMALLOC_LOOP_XXX macros. Those can never really work since the slab creation
    is much more complex than just going from a minimum to a maximum number.

    The latest upstream kernel boots cleanly on my machine with a 64 bit x86
    configuration under KVM using either SLAB or SLUB.

    Fixes: 4066c33d0308f8 ("support the slub_debug boot option")
    Reported-by: Theodore Ts'o
    Signed-off-by: Christoph Lameter
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

25 Jun, 2015

2 commits

  • The first is a keyboard-off-by-one, the other two the ordinary mathy kind.

    Signed-off-by: Rasmus Villemoes
    Acked-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rasmus Villemoes
     
  • The slub_debug=PU,kmalloc-xx cannot work because in the
    create_kmalloc_caches() the s->name is created after the
    create_kmalloc_cache() is called. The name is NULL in the
    create_kmalloc_cache() so the kmem_cache_flags() would not set the
    slub_debug flags to the s->flags. The fix here set up a kmalloc_names
    string array for the initialization purpose and delete the dynamic name
    creation of kmalloc_caches.

    [akpm@linux-foundation.org: s/kmalloc_names/kmalloc_info/, tweak comment text]
    Signed-off-by: Gavin Guo
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Gavin Guo
     

15 Apr, 2015

1 commit


14 Feb, 2015

1 commit

  • With this patch kasan will be able to catch bugs in memory allocated by
    slub. Initially all objects in newly allocated slab page, marked as
    redzone. Later, when allocation of slub object happens, requested by
    caller number of bytes marked as accessible, and the rest of the object
    (including slub's metadata) marked as redzone (inaccessible).

    We also mark object as accessible if ksize was called for this object.
    There is some places in kernel where ksize function is called to inquire
    size of really allocated area. Such callers could validly access whole
    allocated memory, so it should be marked as accessible.

    Code in slub.c and slab_common.c files could validly access to object's
    metadata, so instrumentation for this files are disabled.

    Signed-off-by: Andrey Ryabinin
    Signed-off-by: Dmitry Chernenkov
    Cc: Dmitry Vyukov
    Cc: Konstantin Serebryany
    Signed-off-by: Andrey Konovalov
    Cc: Yuri Gribov
    Cc: Konstantin Khlebnikov
    Cc: Sasha Levin
    Cc: Christoph Lameter
    Cc: Joonsoo Kim
    Cc: Dave Hansen
    Cc: Andi Kleen
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Ryabinin
     

13 Feb, 2015

3 commits

  • We need to look up a kmem_cache in ->memcg_params.memcg_caches arrays only
    on allocations, so there is no need to have the array entries set until
    css free - we can clear them on css offline. This will allow us to reuse
    array entries more efficiently and avoid costly array relocations.

    Signed-off-by: Vladimir Davydov
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Tejun Heo
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Dave Chinner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • Sometimes, we need to iterate over all memcg copies of a particular root
    kmem cache. Currently, we use memcg_cache_params->memcg_caches array for
    that, because it contains all existing memcg caches.

    However, it's a bad practice to keep all caches, including those that
    belong to offline cgroups, in this array, because it will be growing
    beyond any bounds then. I'm going to wipe away dead caches from it to
    save space. To still be able to perform iterations over all memcg caches
    of the same kind, let us link them into a list.

    Signed-off-by: Vladimir Davydov
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Tejun Heo
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Dave Chinner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • Currently, kmem_cache stores a pointer to struct memcg_cache_params
    instead of embedding it. The rationale is to save memory when kmem
    accounting is disabled. However, the memcg_cache_params has shrivelled
    drastically since it was first introduced:

    * Initially:

    struct memcg_cache_params {
    bool is_root_cache;
    union {
    struct kmem_cache *memcg_caches[0];
    struct {
    struct mem_cgroup *memcg;
    struct list_head list;
    struct kmem_cache *root_cache;
    bool dead;
    atomic_t nr_pages;
    struct work_struct destroy;
    };
    };
    };

    * Now:

    struct memcg_cache_params {
    bool is_root_cache;
    union {
    struct {
    struct rcu_head rcu_head;
    struct kmem_cache *memcg_caches[0];
    };
    struct {
    struct mem_cgroup *memcg;
    struct kmem_cache *root_cache;
    };
    };
    };

    So the memory saving does not seem to be a clear win anymore.

    OTOH, keeping a pointer to memcg_cache_params struct instead of embedding
    it results in touching one more cache line on kmem alloc/free hot paths.
    Besides, it makes linking kmem caches in a list chained by a field of
    struct memcg_cache_params really painful due to a level of indirection,
    while I want to make them linked in the following patch. That said, let
    us embed it.

    Signed-off-by: Vladimir Davydov
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Tejun Heo
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Dave Chinner
    Cc: Dan Carpenter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     

11 Feb, 2015

2 commits

  • mem_cgroup->memcg_slab_caches is a list of kmem caches corresponding to
    the given cgroup. Currently, it is only used on css free in order to
    destroy all caches corresponding to the memory cgroup being freed. The
    list is protected by memcg_slab_mutex. The mutex is also used to protect
    kmem_cache->memcg_params->memcg_caches arrays and synchronizes
    kmem_cache_destroy vs memcg_unregister_all_caches.

    However, we can perfectly get on without these two. To destroy all caches
    corresponding to a memory cgroup, we can walk over the global list of kmem
    caches, slab_caches, and we can do all the synchronization stuff using the
    slab_mutex instead of the memcg_slab_mutex. This patch therefore gets rid
    of the memcg_slab_caches and memcg_slab_mutex.

    Apart from this nice cleanup, it also:

    - assures that rcu_barrier() is called once at max when a root cache is
    destroyed or a memory cgroup is freed, no matter how many caches have
    SLAB_DESTROY_BY_RCU flag set;

    - fixes the race between kmem_cache_destroy and kmem_cache_create that
    exists, because memcg_cleanup_cache_params, which is called from
    kmem_cache_destroy after checking that kmem_cache->refcount=0,
    releases the slab_mutex, which gives kmem_cache_create a chance to
    make an alias to a cache doomed to be destroyed.

    Signed-off-by: Vladimir Davydov
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Acked-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • Instead of passing the name of the memory cgroup which the cache is
    created for in the memcg_name_argument, let's obtain it immediately in
    memcg_create_kmem_cache.

    Signed-off-by: Vladimir Davydov
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     

14 Dec, 2014

1 commit

  • Suppose task @t that belongs to a memory cgroup @memcg is going to
    allocate an object from a kmem cache @c. The copy of @c corresponding to
    @memcg, @mc, is empty. Then if kmem_cache_alloc races with the memory
    cgroup destruction we can access the memory cgroup's copy of the cache
    after it was destroyed:

    CPU0 CPU1
    ---- ----
    [ current=@t
    @mc->memcg_params->nr_pages=0 ]

    kmem_cache_alloc(@c):
    call memcg_kmem_get_cache(@c);
    proceed to allocation from @mc:
    alloc a page for @mc:
    ...

    move @t from @memcg
    destroy @memcg:
    mem_cgroup_css_offline(@memcg):
    memcg_unregister_all_caches(@memcg):
    kmem_cache_destroy(@mc)

    add page to @mc

    We could fix this issue by taking a reference to a per-memcg cache, but
    that would require adding a per-cpu reference counter to per-memcg caches,
    which would look cumbersome.

    Instead, let's take a reference to a memory cgroup, which already has a
    per-cpu reference counter, in the beginning of kmem_cache_alloc to be
    dropped in the end, and move per memcg caches destruction from css offline
    to css free. As a side effect, per-memcg caches will be destroyed not one
    by one, but all at once when the last page accounted to the memory cgroup
    is freed. This doesn't sound as a high price for code readability though.

    Note, this patch does add some overhead to the kmem_cache_alloc hot path,
    but it is pretty negligible - it's just a function call plus a per cpu
    counter decrement, which is comparable to what we already have in
    memcg_kmem_get_cache. Besides, it's only relevant if there are memory
    cgroups with kmem accounting enabled. I don't think we can find a way to
    handle this race w/o it, because alloc_page called from kmem_cache_alloc
    may sleep so we can't flush all pending kmallocs w/o reference counting.

    Signed-off-by: Vladimir Davydov
    Acked-by: Christoph Lameter
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     

11 Dec, 2014

1 commit

  • Let's use generic slab_start/next/stop for showing memcg caches info. In
    contrast to the current implementation, this will work even if all memcg
    caches' info doesn't fit into a seq buffer (a page), plus it simply looks
    neater.

    Actually, the main reason I do this isn't mere cleanup. I'm going to zap
    the memcg_slab_caches list, because I find it useless provided we have the
    slab_caches list, and this patch is a step in this direction.

    It should be noted that before this patch an attempt to read
    memory.kmem.slabinfo of a cgroup that doesn't have kmem limit set resulted
    in -EIO, while after this patch it will silently show nothing except the
    header, but I don't think it will frustrate anyone.

    Signed-off-by: Vladimir Davydov
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     

10 Oct, 2014

2 commits

  • Now, we track caller if tracing or slab debugging is enabled. If they are
    disabled, we could save one argument passing overhead by calling
    __kmalloc(_node)(). But, I think that it would be marginal. Furthermore,
    default slab allocator, SLUB, doesn't use this technique so I think that
    it's okay to change this situation.

    After this change, we can turn on/off CONFIG_DEBUG_SLAB without full
    kernel build and remove some complicated '#if' defintion. It looks more
    benefitial to me.

    Signed-off-by: Joonsoo Kim
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Zhang Yanfei
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • We don't need to keep kmem_cache definition in include/linux/slab.h if we
    don't need to inline kmem_cache_size(). According to my code inspection,
    this function is only called at lc_create() in lib/lru_cache.c which may
    be called at initialization phase of something, so we don't need to inline
    it. Therfore, move it to slab_common.c and move kmem_cache definition to
    internal header.

    After this change, we can change kmem_cache definition easily without full
    kernel build. For instance, we can turn on/off CONFIG_SLUB_STATS without
    full kernel build.

    [akpm@linux-foundation.org: export kmem_cache_size() to modules]
    [rdunlap@infradead.org: add header files to fix kmemcheck.c build errors]
    Signed-off-by: Joonsoo Kim
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Zhang Yanfei
    Signed-off-by: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     

05 Jun, 2014

5 commits

  • Current names are rather inconsistent. Let's try to improve them.

    Brief change log:

    ** old name ** ** new name **

    kmem_cache_create_memcg memcg_create_kmem_cache
    memcg_kmem_create_cache memcg_regsiter_cache
    memcg_kmem_destroy_cache memcg_unregister_cache

    kmem_cache_destroy_memcg_children memcg_cleanup_cache_params
    mem_cgroup_destroy_all_caches memcg_unregister_all_caches

    create_work memcg_register_cache_work
    memcg_create_cache_work_func memcg_register_cache_func
    memcg_create_cache_enqueue memcg_schedule_register_cache

    Signed-off-by: Vladimir Davydov
    Acked-by: Michal Hocko
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • Instead of calling back to memcontrol.c from kmem_cache_create_memcg in
    order to just create the name of a per memcg cache, let's allocate it in
    place. We only need to pass the memcg name to kmem_cache_create_memcg for
    that - everything else can be done in slab_common.c.

    Signed-off-by: Vladimir Davydov
    Acked-by: Michal Hocko
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • At present, we have the following mutexes protecting data related to per
    memcg kmem caches:

    - slab_mutex. This one is held during the whole kmem cache creation
    and destruction paths. We also take it when updating per root cache
    memcg_caches arrays (see memcg_update_all_caches). As a result, taking
    it guarantees there will be no changes to any kmem cache (including per
    memcg). Why do we need something else then? The point is it is
    private to slab implementation and has some internal dependencies with
    other mutexes (get_online_cpus). So we just don't want to rely upon it
    and prefer to introduce additional mutexes instead.

    - activate_kmem_mutex. Initially it was added to synchronize
    initializing kmem limit (memcg_activate_kmem). However, since we can
    grow per root cache memcg_caches arrays only on kmem limit
    initialization (see memcg_update_all_caches), we also employ it to
    protect against memcg_caches arrays relocation (e.g. see
    __kmem_cache_destroy_memcg_children).

    - We have a convention not to take slab_mutex in memcontrol.c, but we
    want to walk over per memcg memcg_slab_caches lists there (e.g. for
    destroying all memcg caches on offline). So we have per memcg
    slab_caches_mutex's protecting those lists.

    The mutexes are taken in the following order:

    activate_kmem_mutex -> slab_mutex -> memcg::slab_caches_mutex

    Such a syncrhonization scheme has a number of flaws, for instance:

    - We can't call kmem_cache_{destroy,shrink} while walking over a
    memcg::memcg_slab_caches list due to locking order. As a result, in
    mem_cgroup_destroy_all_caches we schedule the
    memcg_cache_params::destroy work shrinking and destroying the cache.

    - We don't have a mutex to synchronize per memcg caches destruction
    between memcg offline (mem_cgroup_destroy_all_caches) and root cache
    destruction (__kmem_cache_destroy_memcg_children). Currently we just
    don't bother about it.

    This patch simplifies it by substituting per memcg slab_caches_mutex's
    with the global memcg_slab_mutex. It will be held whenever a new per
    memcg cache is created or destroyed, so it protects per root cache
    memcg_caches arrays and per memcg memcg_slab_caches lists. The locking
    order is following:

    activate_kmem_mutex -> memcg_slab_mutex -> slab_mutex

    This allows us to call kmem_cache_{create,shrink,destroy} under the
    memcg_slab_mutex. As a result, we don't need memcg_cache_params::destroy
    work any more - we can simply destroy caches while iterating over a per
    memcg slab caches list.

    Also using the global mutex simplifies synchronization between concurrent
    per memcg caches creation/destruction, e.g. mem_cgroup_destroy_all_caches
    vs __kmem_cache_destroy_memcg_children.

    The downside of this is that we substitute per-memcg slab_caches_mutex's
    with a hummer-like global mutex, but since we already take either the
    slab_mutex or the cgroup_mutex along with a memcg::slab_caches_mutex, it
    shouldn't hurt concurrency a lot.

    Signed-off-by: Vladimir Davydov
    Acked-by: Johannes Weiner
    Cc: Michal Hocko
    Cc: Glauber Costa
    Cc: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • This patchset is a part of preparations for kmemcg re-parenting. It
    targets at simplifying kmemcg work-flows and synchronization.

    First, it removes async per memcg cache destruction (see patches 1, 2).
    Now caches are only destroyed on memcg offline. That means the caches
    that are not empty on memcg offline will be leaked. However, they are
    already leaked, because memcg_cache_params::nr_pages normally never drops
    to 0 so the destruction work is never scheduled except kmem_cache_shrink
    is called explicitly. In the future I'm planning reaping such dead caches
    on vmpressure or periodically.

    Second, it substitutes per memcg slab_caches_mutex's with the global
    memcg_slab_mutex, which should be taken during the whole per memcg cache
    creation/destruction path before the slab_mutex (see patch 3). This
    greatly simplifies synchronization among various per memcg cache
    creation/destruction paths.

    I'm still not quite sure about the end picture, in particular I don't know
    whether we should reap dead memcgs' kmem caches periodically or try to
    merge them with their parents (see https://lkml.org/lkml/2014/4/20/38 for
    more details), but whichever way we choose, this set looks like a
    reasonable change to me, because it greatly simplifies kmemcg work-flows
    and eases further development.

    This patch (of 3):

    After a memcg is offlined, we mark its kmem caches that cannot be deleted
    right now due to pending objects as dead by setting the
    memcg_cache_params::dead flag, so that memcg_release_pages will schedule
    cache destruction (memcg_cache_params::destroy) as soon as the last slab
    of the cache is freed (memcg_cache_params::nr_pages drops to zero).

    I guess the idea was to destroy the caches as soon as possible, i.e.
    immediately after freeing the last object. However, it just doesn't work
    that way, because kmem caches always preserve some pages for the sake of
    performance, so that nr_pages never gets to zero unless the cache is
    shrunk explicitly using kmem_cache_shrink. Of course, we could account
    the total number of objects on the cache or check if all the slabs
    allocated for the cache are empty on kmem_cache_free and schedule
    destruction if so, but that would be too costly.

    Thus we have a piece of code that works only when we explicitly call
    kmem_cache_shrink, but complicates the whole picture a lot. Moreover,
    it's racy in fact. For instance, kmem_cache_shrink may free the last slab
    and thus schedule cache destruction before it finishes checking that the
    cache is empty, which can lead to use-after-free.

    So I propose to remove this async cache destruction from
    memcg_release_pages, and check if the cache is empty explicitly after
    calling kmem_cache_shrink instead. This will simplify things a lot w/o
    introducing any functional changes.

    And regarding dead memcg caches (i.e. those that are left hanging around
    after memcg offline for they have objects), I suppose we should reap them
    either periodically or on vmpressure as Glauber suggested initially. I'm
    going to implement this later.

    Signed-off-by: Vladimir Davydov
    Acked-by: Johannes Weiner
    Cc: Michal Hocko
    Cc: Glauber Costa
    Cc: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • Currently to allocate a page that should be charged to kmemcg (e.g.
    threadinfo), we pass __GFP_KMEMCG flag to the page allocator. The page
    allocated is then to be freed by free_memcg_kmem_pages. Apart from
    looking asymmetrical, this also requires intrusion to the general
    allocation path. So let's introduce separate functions that will
    alloc/free pages charged to kmemcg.

    The new functions are called alloc_kmem_pages and free_kmem_pages. They
    should be used when the caller actually would like to use kmalloc, but
    has to fall back to the page allocator for the allocation is large.
    They only differ from alloc_pages and free_pages in that besides
    allocating or freeing pages they also charge them to the kmem resource
    counter of the current memory cgroup.

    [sfr@canb.auug.org.au: export kmalloc_order() to modules]
    Signed-off-by: Vladimir Davydov
    Acked-by: Greg Thelen
    Cc: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: Glauber Costa
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Signed-off-by: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     

14 Apr, 2014

1 commit

  • Pull slab changes from Pekka Enberg:
    "The biggest change is byte-sized freelist indices which reduces slab
    freelist memory usage:

    https://lkml.org/lkml/2013/12/2/64"

    * 'slab/next' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/linux:
    mm: slab/slub: use page->list consistently instead of page->lru
    mm/slab.c: cleanup outdated comments and unify variables naming
    slab: fix wrongly used macro
    slub: fix high order page allocation problem with __GFP_NOFAIL
    slab: Make allocations with GFP_ZERO slightly more efficient
    slab: make more slab management structure off the slab
    slab: introduce byte sized index for the freelist of a slab
    slab: restrict the number of objects in a slab
    slab: introduce helper functions to get/set free object
    slab: factor out calculate nr objects in cache_estimate

    Linus Torvalds
     

08 Apr, 2014

1 commit

  • Memcg-awareness turned kmem_cache_create() into a dirty interweaving of
    memcg-only and except-for-memcg calls. To clean this up, let's move the
    code responsible for memcg cache creation to a separate function.

    Signed-off-by: Vladimir Davydov
    Cc: Michal Hocko
    Cc: Johannes Weiner
    Cc: David Rientjes
    Cc: Pekka Enberg
    Cc: Glauber Costa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     

01 Apr, 2014

1 commit

  • commit 'slab: restrict the number of objects in a slab' uses
    __builtin_constant_p() on #if macro. It is wrong usage of builtin
    function, but it is compiled on x86 without any problem, so I can't
    find it before 0 day build system find it.

    This commit fixes the situation by using KMALLOC_MIN_SIZE, instead of
    KMALLOC_SHIFT_LOW. KMALLOC_SHIFT_LOW is parsed to ilog2() on some
    architecture and this ilog2() uses __builtin_constant_p() and results in
    the problem. This problem would disappear by using KMALLOC_MIN_SIZE,
    since it is just constant.

    Tested-by: David Rientjes
    Signed-off-by: Joonsoo Kim
    Signed-off-by: Pekka Enberg

    Joonsoo Kim
     

11 Mar, 2014

1 commit

  • GFP_THISNODE is for callers that implement their own clever fallback to
    remote nodes. It restricts the allocation to the specified node and
    does not invoke reclaim, assuming that the caller will take care of it
    when the fallback fails, e.g. through a subsequent allocation request
    without GFP_THISNODE set.

    However, many current GFP_THISNODE users only want the node exclusive
    aspect of the flag, without actually implementing their own fallback or
    triggering reclaim if necessary. This results in things like page
    migration failing prematurely even when there is easily reclaimable
    memory available, unless kswapd happens to be running already or a
    concurrent allocation attempt triggers the necessary reclaim.

    Convert all callsites that don't implement their own fallback strategy
    to __GFP_THISNODE. This restricts the allocation a single node too, but
    at the same time allows the allocator to enter the slowpath, wake
    kswapd, and invoke direct reclaim if necessary, to make the allocation
    happen when memory is full.

    Signed-off-by: Johannes Weiner
    Acked-by: Rik van Riel
    Cc: Jan Stancek
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

08 Feb, 2014

1 commit

  • To prepare to implement byte sized index for managing the freelist
    of a slab, we should restrict the number of objects in a slab to be less
    or equal to 256, since byte only represent 256 different values.
    Setting the size of object to value equal or more than newly introduced
    SLAB_OBJ_MIN_SIZE ensures that the number of objects in a slab is less or
    equal to 256 for a slab with 1 page.

    If page size is rather larger than 4096, above assumption would be wrong.
    In this case, we would fall back on 2 bytes sized index.

    If minimum size of kmalloc is less than 16, we use it as minimum object
    size and give up this optimization.

    Signed-off-by: Joonsoo Kim
    Signed-off-by: Pekka Enberg

    Joonsoo Kim