07 Sep, 2016

1 commit

  • Install the callbacks via the state machine.

    Signed-off-by: Richard Weinberger
    Signed-off-by: Thomas Gleixner
    Signed-off-by: Sebastian Andrzej Siewior
    Reviewed-by: Sebastian Andrzej Siewior
    Cc: Peter Zijlstra
    Cc: Pekka Enberg
    Cc: linux-mm@kvack.org
    Cc: rt@linutronix.de
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Andrew Morton
    Cc: Christoph Lameter
    Link: http://lkml.kernel.org/r/20160823125319.abeapfjapf2kfezp@linutronix.de
    Signed-off-by: Thomas Gleixner

    Sebastian Andrzej Siewior
     

09 Aug, 2016

1 commit

  • Pull usercopy protection from Kees Cook:
    "Tbhis implements HARDENED_USERCOPY verification of copy_to_user and
    copy_from_user bounds checking for most architectures on SLAB and
    SLUB"

    * tag 'usercopy-v4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
    mm: SLUB hardened usercopy support
    mm: SLAB hardened usercopy support
    s390/uaccess: Enable hardened usercopy
    sparc/uaccess: Enable hardened usercopy
    powerpc/uaccess: Enable hardened usercopy
    ia64/uaccess: Enable hardened usercopy
    arm64/uaccess: Enable hardened usercopy
    ARM: uaccess: Enable hardened usercopy
    x86/uaccess: Enable hardened usercopy
    mm: Hardened usercopy
    mm: Implement stack frame object validation
    mm: Add is_migrate_cma_page

    Linus Torvalds
     

03 Aug, 2016

2 commits

  • There was only one use of __initdata_refok and __exit_refok

    __init_refok was used 46 times against 82 for __ref.

    Those definitions are obsolete since commit 312b1485fb50 ("Introduce new
    section reference annotations tags: __ref, __refdata, __refconst")

    This patch removes the following compatibility definitions and replaces
    them treewide.

    /* compatibility defines */
    #define __init_refok __ref
    #define __initdata_refok __refdata
    #define __exit_refok __ref

    I can also provide separate patches if necessary.
    (One patch per tree and check in 1 month or 2 to remove old definitions)

    [akpm@linux-foundation.org: coding-style fixes]
    Link: http://lkml.kernel.org/r/1466796271-3043-1-git-send-email-fabf@skynet.be
    Signed-off-by: Fabian Frederick
    Cc: Ingo Molnar
    Cc: Sam Ravnborg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Fabian Frederick
     
  • The state of object currently tracked in two places - shadow memory, and
    the ->state field in struct kasan_alloc_meta. We can get rid of the
    latter. The will save us a little bit of memory. Also, this allow us
    to move free stack into struct kasan_alloc_meta, without increasing
    memory consumption. So now we should always know when the last time the
    object was freed. This may be useful for long delayed use-after-free
    bugs.

    As a side effect this fixes following UBSAN warning:
    UBSAN: Undefined behaviour in mm/kasan/quarantine.c:102:13
    member access within misaligned address ffff88000d1efebc for type 'struct qlist_node'
    which requires 8 byte alignment

    Link: http://lkml.kernel.org/r/1470062715-14077-5-git-send-email-aryabinin@virtuozzo.com
    Reported-by: kernel test robot
    Signed-off-by: Andrey Ryabinin
    Cc: Alexander Potapenko
    Cc: Dmitry Vyukov
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Ryabinin
     

27 Jul, 2016

5 commits

  • Using list_move() instead of list_del() + list_add() to avoid needlessly
    poisoning the next and prev values.

    Link: http://lkml.kernel.org/r/1468929772-9174-1-git-send-email-weiyj_lk@163.com
    Signed-off-by: Wei Yongjun
    Acked-by: David Rientjes
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wei Yongjun
     
  • Both SLAB and SLUB BUG() when a caller provides an invalid gfp_mask.
    This is a rather harsh way to announce a non-critical issue. Allocator
    is free to ignore invalid flags. Let's simply replace BUG() by
    dump_stack to tell the offender and fixup the mask to move on with the
    allocation request.

    This is an example for kmalloc(GFP_KERNEL|__GFP_HIGHMEM) from a test
    module:

    Unexpected gfp: 0x2 (__GFP_HIGHMEM). Fixing up to gfp: 0x24000c0 (GFP_KERNEL). Fix your code!
    CPU: 0 PID: 2916 Comm: insmod Tainted: G O 4.6.0-slabgfp2-00002-g4cdfc2ef4892-dirty #936
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Debian-1.8.2-1 04/01/2014
    Call Trace:
    dump_stack+0x67/0x90
    cache_alloc_refill+0x201/0x617
    kmem_cache_alloc_trace+0xa7/0x24a
    ? 0xffffffffa0005000
    mymodule_init+0x20/0x1000 [test_slab]
    do_one_initcall+0xe7/0x16c
    ? rcu_read_lock_sched_held+0x61/0x69
    ? kmem_cache_alloc_trace+0x197/0x24a
    do_init_module+0x5f/0x1d9
    load_module+0x1a3d/0x1f21
    ? retint_kernel+0x2d/0x2d
    SyS_init_module+0xe8/0x10e
    ? SyS_init_module+0xe8/0x10e
    do_syscall_64+0x68/0x13f
    entry_SYSCALL64_slow_path+0x25/0x25

    Link: http://lkml.kernel.org/r/1465548200-11384-2-git-send-email-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Cc: Sergey Senozhatsky
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • printk offers %pGg for quite some time so let's use it to get a human
    readable list of invalid flags.

    The original output would be
    [ 429.191962] gfp: 2

    after the change
    [ 429.191962] Unexpected gfp: 0x2 (__GFP_HIGHMEM)

    Link: http://lkml.kernel.org/r/1465548200-11384-1-git-send-email-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Cc: Sergey Senozhatsky
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • The kernel heap allocators are using a sequential freelist making their
    allocation predictable. This predictability makes kernel heap overflow
    easier to exploit. An attacker can careful prepare the kernel heap to
    control the following chunk overflowed.

    For example these attacks exploit the predictability of the heap:
    - Linux Kernel CAN SLUB overflow (https://goo.gl/oMNWkU)
    - Exploiting Linux Kernel Heap corruptions (http://goo.gl/EXLn95)

    ***Problems that needed solving:
    - Randomize the Freelist (singled linked) used in the SLUB allocator.
    - Ensure good performance to encourage usage.
    - Get best entropy in early boot stage.

    ***Parts:
    - 01/02 Reorganize the SLAB Freelist randomization to share elements
    with the SLUB implementation.
    - 02/02 The SLUB Freelist randomization implementation. Similar approach
    than the SLAB but tailored to the singled freelist used in SLUB.

    ***Performance data:

    slab_test impact is between 3% to 4% on average for 100000 attempts
    without smp. It is a very focused testing, kernbench show the overall
    impact on the system is way lower.

    Before:

    Single thread testing
    =====================
    1. Kmalloc: Repeatedly allocate then free test
    100000 times kmalloc(8) -> 49 cycles kfree -> 77 cycles
    100000 times kmalloc(16) -> 51 cycles kfree -> 79 cycles
    100000 times kmalloc(32) -> 53 cycles kfree -> 83 cycles
    100000 times kmalloc(64) -> 62 cycles kfree -> 90 cycles
    100000 times kmalloc(128) -> 81 cycles kfree -> 97 cycles
    100000 times kmalloc(256) -> 98 cycles kfree -> 121 cycles
    100000 times kmalloc(512) -> 95 cycles kfree -> 122 cycles
    100000 times kmalloc(1024) -> 96 cycles kfree -> 126 cycles
    100000 times kmalloc(2048) -> 115 cycles kfree -> 140 cycles
    100000 times kmalloc(4096) -> 149 cycles kfree -> 171 cycles
    2. Kmalloc: alloc/free test
    100000 times kmalloc(8)/kfree -> 70 cycles
    100000 times kmalloc(16)/kfree -> 70 cycles
    100000 times kmalloc(32)/kfree -> 70 cycles
    100000 times kmalloc(64)/kfree -> 70 cycles
    100000 times kmalloc(128)/kfree -> 70 cycles
    100000 times kmalloc(256)/kfree -> 69 cycles
    100000 times kmalloc(512)/kfree -> 70 cycles
    100000 times kmalloc(1024)/kfree -> 73 cycles
    100000 times kmalloc(2048)/kfree -> 72 cycles
    100000 times kmalloc(4096)/kfree -> 71 cycles

    After:

    Single thread testing
    =====================
    1. Kmalloc: Repeatedly allocate then free test
    100000 times kmalloc(8) -> 57 cycles kfree -> 78 cycles
    100000 times kmalloc(16) -> 61 cycles kfree -> 81 cycles
    100000 times kmalloc(32) -> 76 cycles kfree -> 93 cycles
    100000 times kmalloc(64) -> 83 cycles kfree -> 94 cycles
    100000 times kmalloc(128) -> 106 cycles kfree -> 107 cycles
    100000 times kmalloc(256) -> 118 cycles kfree -> 117 cycles
    100000 times kmalloc(512) -> 114 cycles kfree -> 116 cycles
    100000 times kmalloc(1024) -> 115 cycles kfree -> 118 cycles
    100000 times kmalloc(2048) -> 147 cycles kfree -> 131 cycles
    100000 times kmalloc(4096) -> 214 cycles kfree -> 161 cycles
    2. Kmalloc: alloc/free test
    100000 times kmalloc(8)/kfree -> 66 cycles
    100000 times kmalloc(16)/kfree -> 66 cycles
    100000 times kmalloc(32)/kfree -> 66 cycles
    100000 times kmalloc(64)/kfree -> 66 cycles
    100000 times kmalloc(128)/kfree -> 65 cycles
    100000 times kmalloc(256)/kfree -> 67 cycles
    100000 times kmalloc(512)/kfree -> 67 cycles
    100000 times kmalloc(1024)/kfree -> 64 cycles
    100000 times kmalloc(2048)/kfree -> 67 cycles
    100000 times kmalloc(4096)/kfree -> 67 cycles

    Kernbench, before:

    Average Optimal load -j 12 Run (std deviation):
    Elapsed Time 101.873 (1.16069)
    User Time 1045.22 (1.60447)
    System Time 88.969 (0.559195)
    Percent CPU 1112.9 (13.8279)
    Context Switches 189140 (2282.15)
    Sleeps 99008.6 (768.091)

    After:

    Average Optimal load -j 12 Run (std deviation):
    Elapsed Time 102.47 (0.562732)
    User Time 1045.3 (1.34263)
    System Time 88.311 (0.342554)
    Percent CPU 1105.8 (6.49444)
    Context Switches 189081 (2355.78)
    Sleeps 99231.5 (800.358)

    This patch (of 2):

    This commit reorganizes the previous SLAB freelist randomization to
    prepare for the SLUB implementation. It moves functions that will be
    shared to slab_common.

    The entropy functions are changed to align with the SLUB implementation,
    now using get_random_(int|long) functions. These functions were chosen
    because they provide a bit more entropy early on boot and better
    performance when specific arch instructions are not available.

    [akpm@linux-foundation.org: fix build]
    Link: http://lkml.kernel.org/r/1464295031-26375-2-git-send-email-thgarnie@google.com
    Signed-off-by: Thomas Garnier
    Reviewed-by: Kees Cook
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Thomas Garnier
     
  • Under CONFIG_HARDENED_USERCOPY, this adds object size checking to the
    SLAB allocator to catch any copies that may span objects.

    Based on code from PaX and grsecurity.

    Signed-off-by: Kees Cook
    Tested-by: Valdis Kletnieks

    Kees Cook
     

21 May, 2016

2 commits

  • Instead of calling kasan_krealloc(), which replaces the memory
    allocation stack ID (if stack depot is used), just unpoison the whole
    memory chunk.

    Signed-off-by: Alexander Potapenko
    Acked-by: Andrey Ryabinin
    Cc: Andrey Konovalov
    Cc: Dmitry Vyukov
    Cc: Christoph Lameter
    Cc: Konstantin Serebryany
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexander Potapenko
     
  • Quarantine isolates freed objects in a separate queue. The objects are
    returned to the allocator later, which helps to detect use-after-free
    errors.

    When the object is freed, its state changes from KASAN_STATE_ALLOC to
    KASAN_STATE_QUARANTINE. The object is poisoned and put into quarantine
    instead of being returned to the allocator, therefore every subsequent
    access to that object triggers a KASAN error, and the error handler is
    able to say where the object has been allocated and deallocated.

    When it's time for the object to leave quarantine, its state becomes
    KASAN_STATE_FREE and it's returned to the allocator. From now on the
    allocator may reuse it for another allocation. Before that happens,
    it's still possible to detect a use-after free on that object (it
    retains the allocation/deallocation stacks).

    When the allocator reuses this object, the shadow is unpoisoned and old
    allocation/deallocation stacks are wiped. Therefore a use of this
    object, even an incorrect one, won't trigger ASan warning.

    Without the quarantine, it's not guaranteed that the objects aren't
    reused immediately, that's why the probability of catching a
    use-after-free is lower than with quarantine in place.

    Quarantine isolates freed objects in a separate queue. The objects are
    returned to the allocator later, which helps to detect use-after-free
    errors.

    Freed objects are first added to per-cpu quarantine queues. When a
    cache is destroyed or memory shrinking is requested, the objects are
    moved into the global quarantine queue. Whenever a kmalloc call allows
    memory reclaiming, the oldest objects are popped out of the global queue
    until the total size of objects in quarantine is less than 3/4 of the
    maximum quarantine size (which is a fraction of installed physical
    memory).

    As long as an object remains in the quarantine, KASAN is able to report
    accesses to it, so the chance of reporting a use-after-free is
    increased. Once the object leaves quarantine, the allocator may reuse
    it, in which case the object is unpoisoned and KASAN can't detect
    incorrect accesses to it.

    Right now quarantine support is only enabled in SLAB allocator.
    Unification of KASAN features in SLAB and SLUB will be done later.

    This patch is based on the "mm: kasan: quarantine" patch originally
    prepared by Dmitry Chernenkov. A number of improvements have been
    suggested by Andrey Ryabinin.

    [glider@google.com: v9]
    Link: http://lkml.kernel.org/r/1462987130-144092-1-git-send-email-glider@google.com
    Signed-off-by: Alexander Potapenko
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Andrey Konovalov
    Cc: Dmitry Vyukov
    Cc: Andrey Ryabinin
    Cc: Steven Rostedt
    Cc: Konstantin Serebryany
    Cc: Dmitry Chernenkov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexander Potapenko
     

20 May, 2016

14 commits

  • Lots of code does

    node = next_node(node, XXX);
    if (node == MAX_NUMNODES)
    node = first_node(XXX);

    so create next_node_in() to do this and use it in various places.

    [mhocko@suse.com: use next_node_in() helper]
    Acked-by: Vlastimil Babka
    Acked-by: Michal Hocko
    Signed-off-by: Michal Hocko
    Cc: Xishi Qiu
    Cc: Joonsoo Kim
    Cc: David Rientjes
    Cc: Naoya Horiguchi
    Cc: Laura Abbott
    Cc: Hui Zhu
    Cc: Wang Xiaoqiang
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • Now we have IS_ENABLED helper to check if a Kconfig option is enabled or
    not, so ZONE_DMA_FLAG sounds no longer useful.

    And, the use of ZONE_DMA_FLAG in slab looks pointless according to the
    comment [1] from Johannes Weiner, so remove them and ORing passed in
    flags with the cache gfp flags has been done in kmem_getpages().

    [1] https://lkml.org/lkml/2014/9/25/553

    Link: http://lkml.kernel.org/r/1462381297-11009-1-git-send-email-yang.shi@linaro.org
    Signed-off-by: Yang Shi
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yang Shi
     
  • Provides an optional config (CONFIG_SLAB_FREELIST_RANDOM) to randomize
    the SLAB freelist. The list is randomized during initialization of a
    new set of pages. The order on different freelist sizes is pre-computed
    at boot for performance. Each kmem_cache has its own randomized
    freelist. Before pre-computed lists are available freelists are
    generated dynamically. This security feature reduces the predictability
    of the kernel SLAB allocator against heap overflows rendering attacks
    much less stable.

    For example this attack against SLUB (also applicable against SLAB)
    would be affected:

    https://jon.oberheide.org/blog/2010/09/10/linux-kernel-can-slub-overflow/

    Also, since v4.6 the freelist was moved at the end of the SLAB. It
    means a controllable heap is opened to new attacks not yet publicly
    discussed. A kernel heap overflow can be transformed to multiple
    use-after-free. This feature makes this type of attack harder too.

    To generate entropy, we use get_random_bytes_arch because 0 bits of
    entropy is available in the boot stage. In the worse case this function
    will fallback to the get_random_bytes sub API. We also generate a shift
    random number to shift pre-computed freelist for each new set of pages.

    The config option name is not specific to the SLAB as this approach will
    be extended to other allocators like SLUB.

    Performance results highlighted no major changes:

    Hackbench (running 90 10 times):

    Before average: 0.0698
    After average: 0.0663 (-5.01%)

    slab_test 1 run on boot. Difference only seen on the 2048 size test
    being the worse case scenario covered by freelist randomization. New
    slab pages are constantly being created on the 10000 allocations.
    Variance should be mainly due to getting new pages every few
    allocations.

    Before:

    Single thread testing
    =====================
    1. Kmalloc: Repeatedly allocate then free test
    10000 times kmalloc(8) -> 99 cycles kfree -> 112 cycles
    10000 times kmalloc(16) -> 109 cycles kfree -> 140 cycles
    10000 times kmalloc(32) -> 129 cycles kfree -> 137 cycles
    10000 times kmalloc(64) -> 141 cycles kfree -> 141 cycles
    10000 times kmalloc(128) -> 152 cycles kfree -> 148 cycles
    10000 times kmalloc(256) -> 195 cycles kfree -> 167 cycles
    10000 times kmalloc(512) -> 257 cycles kfree -> 199 cycles
    10000 times kmalloc(1024) -> 393 cycles kfree -> 251 cycles
    10000 times kmalloc(2048) -> 649 cycles kfree -> 228 cycles
    10000 times kmalloc(4096) -> 806 cycles kfree -> 370 cycles
    10000 times kmalloc(8192) -> 814 cycles kfree -> 411 cycles
    10000 times kmalloc(16384) -> 892 cycles kfree -> 455 cycles
    2. Kmalloc: alloc/free test
    10000 times kmalloc(8)/kfree -> 121 cycles
    10000 times kmalloc(16)/kfree -> 121 cycles
    10000 times kmalloc(32)/kfree -> 121 cycles
    10000 times kmalloc(64)/kfree -> 121 cycles
    10000 times kmalloc(128)/kfree -> 121 cycles
    10000 times kmalloc(256)/kfree -> 119 cycles
    10000 times kmalloc(512)/kfree -> 119 cycles
    10000 times kmalloc(1024)/kfree -> 119 cycles
    10000 times kmalloc(2048)/kfree -> 119 cycles
    10000 times kmalloc(4096)/kfree -> 121 cycles
    10000 times kmalloc(8192)/kfree -> 119 cycles
    10000 times kmalloc(16384)/kfree -> 119 cycles

    After:

    Single thread testing
    =====================
    1. Kmalloc: Repeatedly allocate then free test
    10000 times kmalloc(8) -> 130 cycles kfree -> 86 cycles
    10000 times kmalloc(16) -> 118 cycles kfree -> 86 cycles
    10000 times kmalloc(32) -> 121 cycles kfree -> 85 cycles
    10000 times kmalloc(64) -> 176 cycles kfree -> 102 cycles
    10000 times kmalloc(128) -> 178 cycles kfree -> 100 cycles
    10000 times kmalloc(256) -> 205 cycles kfree -> 109 cycles
    10000 times kmalloc(512) -> 262 cycles kfree -> 136 cycles
    10000 times kmalloc(1024) -> 342 cycles kfree -> 157 cycles
    10000 times kmalloc(2048) -> 701 cycles kfree -> 238 cycles
    10000 times kmalloc(4096) -> 803 cycles kfree -> 364 cycles
    10000 times kmalloc(8192) -> 835 cycles kfree -> 404 cycles
    10000 times kmalloc(16384) -> 896 cycles kfree -> 441 cycles
    2. Kmalloc: alloc/free test
    10000 times kmalloc(8)/kfree -> 121 cycles
    10000 times kmalloc(16)/kfree -> 121 cycles
    10000 times kmalloc(32)/kfree -> 123 cycles
    10000 times kmalloc(64)/kfree -> 142 cycles
    10000 times kmalloc(128)/kfree -> 121 cycles
    10000 times kmalloc(256)/kfree -> 119 cycles
    10000 times kmalloc(512)/kfree -> 119 cycles
    10000 times kmalloc(1024)/kfree -> 119 cycles
    10000 times kmalloc(2048)/kfree -> 119 cycles
    10000 times kmalloc(4096)/kfree -> 119 cycles
    10000 times kmalloc(8192)/kfree -> 119 cycles
    10000 times kmalloc(16384)/kfree -> 119 cycles

    [akpm@linux-foundation.org: propagate gfp_t into cache_random_seq_create()]
    Signed-off-by: Thomas Garnier
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Kees Cook
    Cc: Greg Thelen
    Cc: Laura Abbott
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Thomas Garnier
     
  • To check whether free objects exist or not precisely, we need to grab a
    lock. But, accuracy isn't that important because race window would be
    even small and if there is too much free object, cache reaper would reap
    it. So, this patch makes the check for free object exisistence not to
    hold a lock. This will reduce lock contention in heavily allocation
    case.

    Note that until now, n->shared can be freed during the processing by
    writing slabinfo, but, with some trick in this patch, we can access it
    freely within interrupt disabled period.

    Below is the result of concurrent allocation/free in slab allocation
    benchmark made by Christoph a long time ago. I make the output simpler.
    The number shows cycle count during alloc/free respectively so less is
    better.

    * Before
    Kmalloc N*alloc N*free(32): Average=248/966
    Kmalloc N*alloc N*free(64): Average=261/949
    Kmalloc N*alloc N*free(128): Average=314/1016
    Kmalloc N*alloc N*free(256): Average=741/1061
    Kmalloc N*alloc N*free(512): Average=1246/1152
    Kmalloc N*alloc N*free(1024): Average=2437/1259
    Kmalloc N*alloc N*free(2048): Average=4980/1800
    Kmalloc N*alloc N*free(4096): Average=9000/2078

    * After
    Kmalloc N*alloc N*free(32): Average=344/792
    Kmalloc N*alloc N*free(64): Average=347/882
    Kmalloc N*alloc N*free(128): Average=390/959
    Kmalloc N*alloc N*free(256): Average=393/1067
    Kmalloc N*alloc N*free(512): Average=683/1229
    Kmalloc N*alloc N*free(1024): Average=1295/1325
    Kmalloc N*alloc N*free(2048): Average=2513/1664
    Kmalloc N*alloc N*free(4096): Average=4742/2172

    It shows that allocation performance decreases for the object size up to
    128 and it may be due to extra checks in cache_alloc_refill(). But,
    with considering improvement of free performance, net result looks the
    same. Result for other size class looks very promising, roughly, 50%
    performance improvement.

    Signed-off-by: Joonsoo Kim
    Cc: Jesper Dangaard Brouer
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • Until now, cache growing makes a free slab on node's slab list and then
    we can allocate free objects from it. This necessarily requires to hold
    a node lock which is very contended. If we refill cpu cache before
    attaching it to node's slab list, we can avoid holding a node lock as
    much as possible because this newly allocated slab is only visible to
    the current task. This will reduce lock contention.

    Below is the result of concurrent allocation/free in slab allocation
    benchmark made by Christoph a long time ago. I make the output simpler.
    The number shows cycle count during alloc/free respectively so less is
    better.

    * Before
    Kmalloc N*alloc N*free(32): Average=355/750
    Kmalloc N*alloc N*free(64): Average=452/812
    Kmalloc N*alloc N*free(128): Average=559/1070
    Kmalloc N*alloc N*free(256): Average=1176/980
    Kmalloc N*alloc N*free(512): Average=1939/1189
    Kmalloc N*alloc N*free(1024): Average=3521/1278
    Kmalloc N*alloc N*free(2048): Average=7152/1838
    Kmalloc N*alloc N*free(4096): Average=13438/2013

    * After
    Kmalloc N*alloc N*free(32): Average=248/966
    Kmalloc N*alloc N*free(64): Average=261/949
    Kmalloc N*alloc N*free(128): Average=314/1016
    Kmalloc N*alloc N*free(256): Average=741/1061
    Kmalloc N*alloc N*free(512): Average=1246/1152
    Kmalloc N*alloc N*free(1024): Average=2437/1259
    Kmalloc N*alloc N*free(2048): Average=4980/1800
    Kmalloc N*alloc N*free(4096): Average=9000/2078

    It shows that contention is reduced for all the object sizes and
    performance increases by 30 ~ 40%.

    Signed-off-by: Joonsoo Kim
    Cc: Jesper Dangaard Brouer
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • This is a preparation step to implement lockless allocation path when
    there is no free objects in kmem_cache.

    What we'd like to do here is to refill cpu cache without holding a node
    lock. To accomplish this purpose, refill should be done after new slab
    allocation but before attaching the slab to the management list. So,
    this patch separates cache_grow() to two parts, allocation and attaching
    to the list in order to add some code inbetween them in the following
    patch.

    Signed-off-by: Joonsoo Kim
    Cc: Jesper Dangaard Brouer
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • Currently, cache_grow() assumes that allocated page's nodeid would be
    same with parameter nodeid which is used for allocation request. If we
    discard this assumption, we can handle fallback_alloc() case gracefully.
    So, this patch makes cache_grow() handle the page allocated on arbitrary
    node and clean-up relevant code.

    Signed-off-by: Joonsoo Kim
    Cc: Jesper Dangaard Brouer
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • Slab color isn't needed to be changed strictly. Because locking for
    changing slab color could cause more lock contention so this patch
    implements racy access/modify the slab color. This is a preparation
    step to implement lockless allocation path when there is no free objects
    in the kmem_cache.

    Below is the result of concurrent allocation/free in slab allocation
    benchmark made by Christoph a long time ago. I make the output simpler.
    The number shows cycle count during alloc/free respectively so less is
    better.

    * Before
    Kmalloc N*alloc N*free(32): Average=365/806
    Kmalloc N*alloc N*free(64): Average=452/690
    Kmalloc N*alloc N*free(128): Average=736/886
    Kmalloc N*alloc N*free(256): Average=1167/985
    Kmalloc N*alloc N*free(512): Average=2088/1125
    Kmalloc N*alloc N*free(1024): Average=4115/1184
    Kmalloc N*alloc N*free(2048): Average=8451/1748
    Kmalloc N*alloc N*free(4096): Average=16024/2048

    * After
    Kmalloc N*alloc N*free(32): Average=355/750
    Kmalloc N*alloc N*free(64): Average=452/812
    Kmalloc N*alloc N*free(128): Average=559/1070
    Kmalloc N*alloc N*free(256): Average=1176/980
    Kmalloc N*alloc N*free(512): Average=1939/1189
    Kmalloc N*alloc N*free(1024): Average=3521/1278
    Kmalloc N*alloc N*free(2048): Average=7152/1838
    Kmalloc N*alloc N*free(4096): Average=13438/2013

    It shows that contention is reduced for object size >= 1024 and
    performance increases by roughly 15%.

    Signed-off-by: Joonsoo Kim
    Acked-by: Christoph Lameter
    Cc: Jesper Dangaard Brouer
    Cc: Pekka Enberg
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • Currently, determination to free a slab is done whenever each freed
    object is put into the slab. This has a following problem.

    Assume free_limit = 10 and nr_free = 9.

    Free happens as following sequence and nr_free changes as following.

    free(become a free slab) free(not become a free slab) nr_free: 9 -> 10
    (at first free) -> 11 (at second free)

    If we try to check if we can free current slab or not on each object
    free, we can't free any slab in this situation because current slab
    isn't a free slab when nr_free exceed free_limit (at second free) even
    if there is a free slab.

    However, if we check it lastly, we can free 1 free slab.

    This problem would cause to keep too much memory in the slab subsystem.
    This patch try to fix it by checking number of free object after all
    free work is done. If there is free slab at that time, we can free slab
    as much as possible so we keep free slab as minimal.

    Signed-off-by: Joonsoo Kim
    Cc: Jesper Dangaard Brouer
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • There are mostly same code for setting up kmem_cache_node either in
    cpuup_prepare() or alloc_kmem_cache_node(). Factor out and clean-up
    them.

    Signed-off-by: Joonsoo Kim
    Tested-by: Nishanth Menon
    Tested-by: Jon Hunter
    Acked-by: Christoph Lameter
    Cc: Jesper Dangaard Brouer
    Cc: Pekka Enberg
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • It can be reused on other place, so factor out it. Following patch will
    use it.

    Signed-off-by: Joonsoo Kim
    Acked-by: Christoph Lameter
    Cc: Jesper Dangaard Brouer
    Cc: Pekka Enberg
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • slabs_tofree() implies freeing all free slab. We can do it with just
    providing INT_MAX.

    Signed-off-by: Joonsoo Kim
    Acked-by: Christoph Lameter
    Cc: Jesper Dangaard Brouer
    Cc: Pekka Enberg
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • Initial attemp to remove BAD_ALIEN_MAGIC is once reverted by 'commit
    edcad2509550 ("Revert "slab: remove BAD_ALIEN_MAGIC"")' because it
    causes a problem on m68k which has many node but !CONFIG_NUMA. In this
    case, although alien cache isn't used at all but to cope with some
    initialization path, garbage value is used and that is BAD_ALIEN_MAGIC.
    Now, this patch set use_alien_caches to 0 when !CONFIG_NUMA, there is no
    initialization path problem so we don't need BAD_ALIEN_MAGIC at all. So
    remove it.

    Signed-off-by: Joonsoo Kim
    Tested-by: Geert Uytterhoeven
    Acked-by: Christoph Lameter
    Cc: Jesper Dangaard Brouer
    Cc: Pekka Enberg
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • While processing concurrent allocation, SLAB could be contended a lot
    because it did a lots of work with holding a lock. This patchset try to
    reduce the number of critical section to reduce lock contention. Major
    changes are lockless decision to allocate more slab and lockless cpu
    cache refill from the newly allocated slab.

    Below is the result of concurrent allocation/free in slab allocation
    benchmark made by Christoph a long time ago. I make the output simpler.
    The number shows cycle count during alloc/free respectively so less is
    better.

    * Before
    Kmalloc N*alloc N*free(32): Average=365/806
    Kmalloc N*alloc N*free(64): Average=452/690
    Kmalloc N*alloc N*free(128): Average=736/886
    Kmalloc N*alloc N*free(256): Average=1167/985
    Kmalloc N*alloc N*free(512): Average=2088/1125
    Kmalloc N*alloc N*free(1024): Average=4115/1184
    Kmalloc N*alloc N*free(2048): Average=8451/1748
    Kmalloc N*alloc N*free(4096): Average=16024/2048

    * After
    Kmalloc N*alloc N*free(32): Average=344/792
    Kmalloc N*alloc N*free(64): Average=347/882
    Kmalloc N*alloc N*free(128): Average=390/959
    Kmalloc N*alloc N*free(256): Average=393/1067
    Kmalloc N*alloc N*free(512): Average=683/1229
    Kmalloc N*alloc N*free(1024): Average=1295/1325
    Kmalloc N*alloc N*free(2048): Average=2513/1664
    Kmalloc N*alloc N*free(4096): Average=4742/2172

    It shows that performance improves greatly (roughly more than 50%) for
    the object class whose size is more than 128 bytes.

    This patch (of 11):

    If we don't hold neither the slab_mutex nor the node lock, node's shared
    array cache could be freed and re-populated. If __kmem_cache_shrink()
    is called at the same time, it will call drain_array() with n->shared
    without holding node lock so problem can happen. This patch fix the
    situation by holding the node lock before trying to drain the shared
    array.

    In addition, add a debug check to confirm that n->shared access race
    doesn't exist.

    Signed-off-by: Joonsoo Kim
    Cc: Jesper Dangaard Brouer
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     

26 Mar, 2016

2 commits

  • Add GFP flags to KASAN hooks for future patches to use.

    This patch is based on the "mm: kasan: unified support for SLUB and SLAB
    allocators" patch originally prepared by Dmitry Chernenkov.

    Signed-off-by: Alexander Potapenko
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Andrey Konovalov
    Cc: Dmitry Vyukov
    Cc: Andrey Ryabinin
    Cc: Steven Rostedt
    Cc: Konstantin Serebryany
    Cc: Dmitry Chernenkov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexander Potapenko
     
  • Add KASAN hooks to SLAB allocator.

    This patch is based on the "mm: kasan: unified support for SLUB and SLAB
    allocators" patch originally prepared by Dmitry Chernenkov.

    Signed-off-by: Alexander Potapenko
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Andrey Konovalov
    Cc: Dmitry Vyukov
    Cc: Andrey Ryabinin
    Cc: Steven Rostedt
    Cc: Konstantin Serebryany
    Cc: Dmitry Chernenkov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexander Potapenko
     

18 Mar, 2016

4 commits

  • Most of the mm subsystem uses pr_ so make it consistent.

    Miscellanea:

    - Realign arguments
    - Add missing newline to format
    - kmemleak-test.c has a "kmemleak: " prefix added to the
    "Kmemleak testing" logging message via pr_fmt

    Signed-off-by: Joe Perches
    Acked-by: Tejun Heo [percpu]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joe Perches
     
  • Kernel style prefers a single string over split strings when the string is
    'user-visible'.

    Miscellanea:

    - Add a missing newline
    - Realign arguments

    Signed-off-by: Joe Perches
    Acked-by: Tejun Heo [percpu]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joe Perches
     
  • THP defrag is enabled by default to direct reclaim/compact but not wake
    kswapd in the event of a THP allocation failure. The problem is that
    THP allocation requests potentially enter reclaim/compaction. This
    potentially incurs a severe stall that is not guaranteed to be offset by
    reduced TLB misses. While there has been considerable effort to reduce
    the impact of reclaim/compaction, it is still a high cost and workloads
    that should fit in memory fail to do so. Specifically, a simple
    anon/file streaming workload will enter direct reclaim on NUMA at least
    even though the working set size is 80% of RAM. It's been years and
    it's time to throw in the towel.

    First, this patch defines THP defrag as follows;

    madvise: A failed allocation will direct reclaim/compact if the application requests it
    never: Neither reclaim/compact nor wake kswapd
    defer: A failed allocation will wake kswapd/kcompactd
    always: A failed allocation will direct reclaim/compact (historical behaviour)
    khugepaged defrag will enter direct/reclaim but not wake kswapd.

    Next it sets the default defrag option to be "madvise" to only enter
    direct reclaim/compaction for applications that specifically requested
    it.

    Lastly, it removes a check from the page allocator slowpath that is
    related to __GFP_THISNODE to allow "defer" to work. The callers that
    really cares are slub/slab and they are updated accordingly. The slab
    one may be surprising because it also corrects a comment as kswapd was
    never woken up by that path.

    This means that a THP fault will no longer stall for most applications
    by default and the ideal for most users that get THP if they are
    immediately available. There are still options for users that prefer a
    stall at startup of a new application by either restoring historical
    behaviour with "always" or pick a half-way point with "defer" where
    kswapd does some of the work in the background and wakes kcompactd if
    necessary. THP defrag for khugepaged remains enabled and will enter
    direct/reclaim but no wakeup kswapd or kcompactd.

    After this patch a THP allocation failure will quickly fallback and rely
    on khugepaged to recover the situation at some time in the future. In
    some cases, this will reduce THP usage but the benefit of THP is hard to
    measure and not a universal win where as a stall to reclaim/compaction
    is definitely measurable and can be painful.

    The first test for this is using "usemem" to read a large file and write
    a large anonymous mapping (to avoid the zero page) multiple times. The
    total size of the mappings is 80% of RAM and the benchmark simply
    measures how long it takes to complete. It uses multiple threads to see
    if that is a factor. On UMA, the performance is almost identical so is
    not reported but on NUMA, we see this

    usemem
    4.4.0 4.4.0
    kcompactd-v1r1 nodefrag-v1r3
    Amean System-1 102.86 ( 0.00%) 46.81 ( 54.50%)
    Amean System-4 37.85 ( 0.00%) 34.02 ( 10.12%)
    Amean System-7 48.12 ( 0.00%) 46.89 ( 2.56%)
    Amean System-12 51.98 ( 0.00%) 56.96 ( -9.57%)
    Amean System-21 80.16 ( 0.00%) 79.05 ( 1.39%)
    Amean System-30 110.71 ( 0.00%) 107.17 ( 3.20%)
    Amean System-48 127.98 ( 0.00%) 124.83 ( 2.46%)
    Amean Elapsd-1 185.84 ( 0.00%) 105.51 ( 43.23%)
    Amean Elapsd-4 26.19 ( 0.00%) 25.58 ( 2.33%)
    Amean Elapsd-7 21.65 ( 0.00%) 21.62 ( 0.16%)
    Amean Elapsd-12 18.58 ( 0.00%) 17.94 ( 3.43%)
    Amean Elapsd-21 17.53 ( 0.00%) 16.60 ( 5.33%)
    Amean Elapsd-30 17.45 ( 0.00%) 17.13 ( 1.84%)
    Amean Elapsd-48 15.40 ( 0.00%) 15.27 ( 0.82%)

    For a single thread, the benchmark completes 43.23% faster with this
    patch applied with smaller benefits as the thread increases. Similar,
    notice the large reduction in most cases in system CPU usage. The
    overall CPU time is

    4.4.0 4.4.0
    kcompactd-v1r1 nodefrag-v1r3
    User 10357.65 10438.33
    System 3988.88 3543.94
    Elapsed 2203.01 1634.41

    Which is substantial. Now, the reclaim figures

    4.4.0 4.4.0
    kcompactd-v1r1nodefrag-v1r3
    Minor Faults 128458477 278352931
    Major Faults 2174976 225
    Swap Ins 16904701 0
    Swap Outs 17359627 0
    Allocation stalls 43611 0
    DMA allocs 0 0
    DMA32 allocs 19832646 19448017
    Normal allocs 614488453 580941839
    Movable allocs 0 0
    Direct pages scanned 24163800 0
    Kswapd pages scanned 0 0
    Kswapd pages reclaimed 0 0
    Direct pages reclaimed 20691346 0
    Compaction stalls 42263 0
    Compaction success 938 0
    Compaction failures 41325 0

    This patch eliminates almost all swapping and direct reclaim activity.
    There is still overhead but it's from NUMA balancing which does not
    identify that it's pointless trying to do anything with this workload.

    I also tried the thpscale benchmark which forces a corner case where
    compaction can be used heavily and measures the latency of whether base
    or huge pages were used

    thpscale Fault Latencies
    4.4.0 4.4.0
    kcompactd-v1r1 nodefrag-v1r3
    Amean fault-base-1 5288.84 ( 0.00%) 2817.12 ( 46.73%)
    Amean fault-base-3 6365.53 ( 0.00%) 3499.11 ( 45.03%)
    Amean fault-base-5 6526.19 ( 0.00%) 4363.06 ( 33.15%)
    Amean fault-base-7 7142.25 ( 0.00%) 4858.08 ( 31.98%)
    Amean fault-base-12 13827.64 ( 0.00%) 10292.11 ( 25.57%)
    Amean fault-base-18 18235.07 ( 0.00%) 13788.84 ( 24.38%)
    Amean fault-base-24 21597.80 ( 0.00%) 24388.03 (-12.92%)
    Amean fault-base-30 26754.15 ( 0.00%) 19700.55 ( 26.36%)
    Amean fault-base-32 26784.94 ( 0.00%) 19513.57 ( 27.15%)
    Amean fault-huge-1 4223.96 ( 0.00%) 2178.57 ( 48.42%)
    Amean fault-huge-3 2194.77 ( 0.00%) 2149.74 ( 2.05%)
    Amean fault-huge-5 2569.60 ( 0.00%) 2346.95 ( 8.66%)
    Amean fault-huge-7 3612.69 ( 0.00%) 2997.70 ( 17.02%)
    Amean fault-huge-12 3301.75 ( 0.00%) 6727.02 (-103.74%)
    Amean fault-huge-18 6696.47 ( 0.00%) 6685.72 ( 0.16%)
    Amean fault-huge-24 8000.72 ( 0.00%) 9311.43 (-16.38%)
    Amean fault-huge-30 13305.55 ( 0.00%) 9750.45 ( 26.72%)
    Amean fault-huge-32 9981.71 ( 0.00%) 10316.06 ( -3.35%)

    The average time to fault pages is substantially reduced in the majority
    of caseds but with the obvious caveat that fewer THPs are actually used
    in this adverse workload

    4.4.0 4.4.0
    kcompactd-v1r1 nodefrag-v1r3
    Percentage huge-1 0.71 ( 0.00%) 14.04 (1865.22%)
    Percentage huge-3 10.77 ( 0.00%) 33.05 (206.85%)
    Percentage huge-5 60.39 ( 0.00%) 38.51 (-36.23%)
    Percentage huge-7 45.97 ( 0.00%) 34.57 (-24.79%)
    Percentage huge-12 68.12 ( 0.00%) 40.07 (-41.17%)
    Percentage huge-18 64.93 ( 0.00%) 47.82 (-26.35%)
    Percentage huge-24 62.69 ( 0.00%) 44.23 (-29.44%)
    Percentage huge-30 43.49 ( 0.00%) 55.38 ( 27.34%)
    Percentage huge-32 50.72 ( 0.00%) 51.90 ( 2.35%)

    4.4.0 4.4.0
    kcompactd-v1r1nodefrag-v1r3
    Minor Faults 37429143 47564000
    Major Faults 1916 1558
    Swap Ins 1466 1079
    Swap Outs 2936863 149626
    Allocation stalls 62510 3
    DMA allocs 0 0
    DMA32 allocs 6566458 6401314
    Normal allocs 216361697 216538171
    Movable allocs 0 0
    Direct pages scanned 25977580 17998
    Kswapd pages scanned 0 3638931
    Kswapd pages reclaimed 0 207236
    Direct pages reclaimed 8833714 88
    Compaction stalls 103349 5
    Compaction success 270 4
    Compaction failures 103079 1

    Note again that while this does swap as it's an aggressive workload, the
    direct relcim activity and allocation stalls is substantially reduced.
    There is some kswapd activity but ftrace showed that the kswapd activity
    was due to normal wakeups from 4K pages being allocated.
    Compaction-related stalls and activity are almost eliminated.

    I also tried the stutter benchmark. For this, I do not have figures for
    NUMA but it's something that does impact UMA so I'll report what is
    available

    stutter
    4.4.0 4.4.0
    kcompactd-v1r1 nodefrag-v1r3
    Min mmap 7.3571 ( 0.00%) 7.3438 ( 0.18%)
    1st-qrtle mmap 7.5278 ( 0.00%) 17.9200 (-138.05%)
    2nd-qrtle mmap 7.6818 ( 0.00%) 21.6055 (-181.25%)
    3rd-qrtle mmap 11.0889 ( 0.00%) 21.8881 (-97.39%)
    Max-90% mmap 27.8978 ( 0.00%) 22.1632 ( 20.56%)
    Max-93% mmap 28.3202 ( 0.00%) 22.3044 ( 21.24%)
    Max-95% mmap 28.5600 ( 0.00%) 22.4580 ( 21.37%)
    Max-99% mmap 29.6032 ( 0.00%) 25.5216 ( 13.79%)
    Max mmap 4109.7289 ( 0.00%) 4813.9832 (-17.14%)
    Mean mmap 12.4474 ( 0.00%) 19.3027 (-55.07%)

    This benchmark is trying to fault an anonymous mapping while there is a
    heavy IO load -- a scenario that desktop users used to complain about
    frequently. This shows a mix because the ideal case of mapping with THP
    is not hit as often. However, note that 99% of the mappings complete
    13.79% faster. The CPU usage here is particularly interesting

    4.4.0 4.4.0
    kcompactd-v1r1nodefrag-v1r3
    User 67.50 0.99
    System 1327.88 91.30
    Elapsed 2079.00 2128.98

    And once again we look at the reclaim figures

    4.4.0 4.4.0
    kcompactd-v1r1nodefrag-v1r3
    Minor Faults 335241922 1314582827
    Major Faults 715 819
    Swap Ins 0 0
    Swap Outs 0 0
    Allocation stalls 532723 0
    DMA allocs 0 0
    DMA32 allocs 1822364341 1177950222
    Normal allocs 1815640808 1517844854
    Movable allocs 0 0
    Direct pages scanned 21892772 0
    Kswapd pages scanned 20015890 41879484
    Kswapd pages reclaimed 19961986 41822072
    Direct pages reclaimed 21892741 0
    Compaction stalls 1065755 0
    Compaction success 514 0
    Compaction failures 1065241 0

    Allocation stalls and all direct reclaim activity is eliminated as well
    as compaction-related stalls.

    THP gives impressive gains in some cases but only if they are quickly
    available. We're not going to reach the point where they are completely
    free so lets take the costs out of the fast paths finally and defer the
    cost to kswapd, kcompactd and khugepaged where it belongs.

    Signed-off-by: Mel Gorman
    Acked-by: Rik van Riel
    Acked-by: Johannes Weiner
    Acked-by: Vlastimil Babka
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Show how much memory is used for storing reclaimable and unreclaimable
    in-kernel data structures allocated from slab caches.

    Signed-off-by: Vladimir Davydov
    Acked-by: Johannes Weiner
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     

16 Mar, 2016

9 commits

  • We can now print gfp_flags more human-readable. Make use of this in
    slab_out_of_memory() for SLUB and SLAB. Also convert the SLAB variant
    it to pr_warn() along the way.

    Signed-off-by: Vlastimil Babka
    Acked-by: David Rientjes
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • Current implementation of pfmemalloc handling in SLAB has some problems.

    1) pfmemalloc_active is set to true when there is just one or more
    pfmemalloc slabs in the system, but it is cleared when there is no
    pfmemalloc slab in one arbitrary kmem_cache. So, pfmemalloc_active
    could be wrongly cleared.

    2) Search to partial and free list doesn't happen when non-pfmemalloc
    object are not found in cpu cache. Instead, allocating new slab
    happens and it is not optimal.

    3) Even after sk_memalloc_socks() is disabled, cpu cache would keep
    pfmemalloc objects tagged with SLAB_OBJ_PFMEMALLOC. It isn't cleared
    if sk_memalloc_socks() is disabled so it could cause problem.

    4) If cpu cache is filled with pfmemalloc objects, it would cause slow
    down non-pfmemalloc allocation.

    To me, current pointer tagging approach looks complex and fragile so this
    patch re-implement whole thing instead of fixing problems one by one.

    Design principle for new implementation is that

    1) Don't disrupt non-pfmemalloc allocation in fast path even if
    sk_memalloc_socks() is enabled. It's more likely case than pfmemalloc
    allocation.

    2) Ensure that pfmemalloc slab is used only for pfmemalloc allocation.

    3) Don't consider performance of pfmemalloc allocation in memory
    deficiency state.

    As a result, all pfmemalloc alloc/free in memory tight state will be
    handled in slow-path. If there is non-pfmemalloc free object, it will be
    returned first even for pfmemalloc user in fast-path so that performance
    of pfmemalloc user isn't affected in normal case and pfmemalloc objects
    will be kept as long as possible.

    Signed-off-by: Joonsoo Kim
    Tested-by: Mel Gorman
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • Returing values by reference is bad practice. Instead, just use
    function return value.

    Signed-off-by: Joonsoo Kim
    Suggested-by: Christoph Lameter
    Acked-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • SLAB needs an array to manage freed objects in a slab. It is only used
    if some objects are freed so we can use free object itself as this
    array. This requires additional branch in somewhat critical lock path
    to check if it is first freed object or not but that's all we need.
    Benefits is that we can save extra memory usage and reduce some
    computational overhead by allocating a management array when new slab is
    created.

    Code change is rather complex than what we can expect from the idea, in
    order to handle debugging feature efficiently. If you want to see core
    idea only, please remove '#if DEBUG' block in the patch.

    Although this idea can apply to all caches whose size is larger than
    management array size, it isn't applied to caches which have a
    constructor. If such cache's object is used for management array,
    constructor should be called for it before that object is returned to
    user. I guess that overhead overwhelm benefit in that case so this idea
    doesn't applied to them at least now.

    For summary, from now on, slab management type is determined by
    following logic.

    1) if management array size is smaller than object size and no ctor, it
    becomes OBJFREELIST_SLAB.

    2) if management array size is smaller than leftover, it becomes
    NORMAL_SLAB which uses leftover as a array.

    3) if OFF_SLAB help to save memory than way 4), it becomes OFF_SLAB.
    It allocate a management array from the other cache so memory waste
    happens.

    4) others become NORMAL_SLAB. It uses dedicated internal memory in a
    slab as a management array so it causes memory waste.

    In my system, without enabling CONFIG_DEBUG_SLAB, Almost caches become
    OBJFREELIST_SLAB and NORMAL_SLAB (using leftover) which doesn't waste
    memory. Following is the result of number of caches with specific slab
    management type.

    TOTAL = OBJFREELIST + NORMAL(leftover) + NORMAL + OFF

    /Before/
    126 = 0 + 60 + 25 + 41

    /After/
    126 = 97 + 12 + 15 + 2

    Result shows that number of caches that doesn't waste memory increase
    from 60 to 109.

    I did some benchmarking and it looks that benefit are more than loss.

    Kmalloc: Repeatedly allocate then free test

    /Before/
    [ 0.286809] 1. Kmalloc: Repeatedly allocate then free test
    [ 1.143674] 100000 times kmalloc(32) -> 116 cycles kfree -> 78 cycles
    [ 1.441726] 100000 times kmalloc(64) -> 121 cycles kfree -> 80 cycles
    [ 1.815734] 100000 times kmalloc(128) -> 168 cycles kfree -> 85 cycles
    [ 2.380709] 100000 times kmalloc(256) -> 287 cycles kfree -> 95 cycles
    [ 3.101153] 100000 times kmalloc(512) -> 370 cycles kfree -> 117 cycles
    [ 3.942432] 100000 times kmalloc(1024) -> 413 cycles kfree -> 156 cycles
    [ 5.227396] 100000 times kmalloc(2048) -> 622 cycles kfree -> 248 cycles
    [ 7.519793] 100000 times kmalloc(4096) -> 1102 cycles kfree -> 452 cycles

    /After/
    [ 1.205313] 100000 times kmalloc(32) -> 117 cycles kfree -> 78 cycles
    [ 1.510526] 100000 times kmalloc(64) -> 124 cycles kfree -> 81 cycles
    [ 1.827382] 100000 times kmalloc(128) -> 130 cycles kfree -> 84 cycles
    [ 2.226073] 100000 times kmalloc(256) -> 177 cycles kfree -> 92 cycles
    [ 2.814747] 100000 times kmalloc(512) -> 286 cycles kfree -> 112 cycles
    [ 3.532952] 100000 times kmalloc(1024) -> 344 cycles kfree -> 141 cycles
    [ 4.608777] 100000 times kmalloc(2048) -> 519 cycles kfree -> 210 cycles
    [ 6.350105] 100000 times kmalloc(4096) -> 789 cycles kfree -> 391 cycles

    In fact, I tested another idea implementing OBJFREELIST_SLAB with
    extendable linked array through another freed object. It can remove
    memory waste completely but it causes more computational overhead in
    critical lock path and it seems that overhead outweigh benefit. So, this
    patch doesn't include it.

    Signed-off-by: Joonsoo Kim
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Jesper Dangaard Brouer
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • cache_init_objs() will be changed in following patch and current form
    doesn't fit well for that change. So, before doing it, this patch
    separates debugging initialization. This would cause two loop iteration
    when debugging is enabled, but, this overhead seems too light than debug
    feature itself so effect may not be visible. This patch will greatly
    simplify changes in cache_init_objs() in following patch.

    Signed-off-by: Joonsoo Kim
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Jesper Dangaard Brouer
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • Slab list should be fixed up after object is detached from the slab and
    this happens at two places. They do exactly same thing. They will be
    changed in the following patch, so, to reduce code duplication, this
    patch factor out them and make it common function.

    Signed-off-by: Joonsoo Kim
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Jesper Dangaard Brouer
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • To become an off slab, there are some constraints to avoid bootstrapping
    problem and recursive call. This can be avoided differently by simply
    checking that corresponding kmalloc cache is ready and it's not a off
    slab. It would be more robust because static size checking can be
    affected by cache size change or architecture type but dynamic checking
    isn't.

    One check 'freelist_cache->size > cachep->size / 2' is added to check
    benefit of choosing off slab, because, now, there is no size constraint
    which ensures enough advantage when selecting off slab.

    Signed-off-by: Joonsoo Kim
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Jesper Dangaard Brouer
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • We can fail to setup off slab in some conditions. Even in this case,
    debug pagealloc increases cache size to PAGE_SIZE in advance and it is
    waste because debug pagealloc cannot work for it when it isn't the off
    slab. To improve this situation, this patch checks first that this
    cache with increased size is suitable for off slab. It actually
    increases cache size when it is suitable for off-slab, so possible waste
    is removed.

    Signed-off-by: Joonsoo Kim
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Jesper Dangaard Brouer
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • Current cache type determination code is open-code and looks not
    understandable. Following patch will introduce one more cache type and
    it would make code more complex. So, before it happens, this patch
    abstracts these codes.

    Signed-off-by: Joonsoo Kim
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Jesper Dangaard Brouer
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim