15 Feb, 2017

1 commit

  • commit a810007afe239d59c1115fcaa06eb5b480f876e9 upstream.

    Commit 210e7a43fa90 ("mm: SLUB freelist randomization") broke USB hub
    initialisation as described in

    https://bugzilla.kernel.org/show_bug.cgi?id=177551.

    Bail out early from init_cache_random_seq if s->random_seq is already
    initialised. This prevents destroying the previously computed
    random_seq offsets later in the function.

    If the offsets are destroyed, then shuffle_freelist will truncate
    page->freelist to just the first object (orphaning the rest).

    Fixes: 210e7a43fa90 ("mm: SLUB freelist randomization")
    Link: http://lkml.kernel.org/r/20170207140707.20824-1-sean@erifax.org
    Signed-off-by: Sean Rees
    Reported-by:
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Thomas Garnier
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Sean Rees
     

07 Sep, 2016

1 commit

  • Install the callbacks via the state machine.

    Signed-off-by: Sebastian Andrzej Siewior
    Cc: Andrew Morton
    Cc: Peter Zijlstra
    Cc: Pekka Enberg
    Cc: linux-mm@kvack.org
    Cc: rt@linutronix.de
    Cc: David Rientjes
    Cc: Christoph Lameter
    Cc: Joonsoo Kim
    Link: http://lkml.kernel.org/r/20160818125731.27256-5-bigeasy@linutronix.de
    Signed-off-by: Thomas Gleixner

    Sebastian Andrzej Siewior
     

11 Aug, 2016

1 commit

  • With debugobjects enabled and using SLAB_DESTROY_BY_RCU, when a
    kmem_cache_node is destroyed the call_rcu() may trigger a slab
    allocation to fill the debug object pool (__debug_object_init:fill_pool).

    Everywhere but during kmem_cache_destroy(), discard_slab() is performed
    outside of the kmem_cache_node->list_lock and avoids a lockdep warning
    about potential recursion:

    =============================================
    [ INFO: possible recursive locking detected ]
    4.8.0-rc1-gfxbench+ #1 Tainted: G U
    ---------------------------------------------
    rmmod/8895 is trying to acquire lock:
    (&(&n->list_lock)->rlock){-.-...}, at: [] get_partial_node.isra.63+0x47/0x430

    but task is already holding lock:
    (&(&n->list_lock)->rlock){-.-...}, at: [] __kmem_cache_shutdown+0x54/0x320

    other info that might help us debug this:
    Possible unsafe locking scenario:
    CPU0
    ----
    lock(&(&n->list_lock)->rlock);
    lock(&(&n->list_lock)->rlock);

    *** DEADLOCK ***
    May be due to missing lock nesting notation
    5 locks held by rmmod/8895:
    #0: (&dev->mutex){......}, at: driver_detach+0x42/0xc0
    #1: (&dev->mutex){......}, at: driver_detach+0x50/0xc0
    #2: (cpu_hotplug.dep_map){++++++}, at: get_online_cpus+0x2d/0x80
    #3: (slab_mutex){+.+.+.}, at: kmem_cache_destroy+0x3c/0x220
    #4: (&(&n->list_lock)->rlock){-.-...}, at: __kmem_cache_shutdown+0x54/0x320

    stack backtrace:
    CPU: 6 PID: 8895 Comm: rmmod Tainted: G U 4.8.0-rc1-gfxbench+ #1
    Hardware name: Gigabyte Technology Co., Ltd. H87M-D3H/H87M-D3H, BIOS F11 08/18/2015
    Call Trace:
    __lock_acquire+0x1646/0x1ad0
    lock_acquire+0xb2/0x200
    _raw_spin_lock+0x36/0x50
    get_partial_node.isra.63+0x47/0x430
    ___slab_alloc.constprop.67+0x1a7/0x3b0
    __slab_alloc.isra.64.constprop.66+0x43/0x80
    kmem_cache_alloc+0x236/0x2d0
    __debug_object_init+0x2de/0x400
    debug_object_activate+0x109/0x1e0
    __call_rcu.constprop.63+0x32/0x2f0
    call_rcu+0x12/0x20
    discard_slab+0x3d/0x40
    __kmem_cache_shutdown+0xdb/0x320
    shutdown_cache+0x19/0x60
    kmem_cache_destroy+0x1ae/0x220
    i915_gem_load_cleanup+0x14/0x40 [i915]
    i915_driver_unload+0x151/0x180 [i915]
    i915_pci_remove+0x14/0x20 [i915]
    pci_device_remove+0x34/0xb0
    __device_release_driver+0x95/0x140
    driver_detach+0xb6/0xc0
    bus_remove_driver+0x53/0xd0
    driver_unregister+0x27/0x50
    pci_unregister_driver+0x25/0x70
    i915_exit+0x1a/0x1e2 [i915]
    SyS_delete_module+0x193/0x1f0
    entry_SYSCALL_64_fastpath+0x1c/0xac

    Fixes: 52b4b950b507 ("mm: slab: free kmem_cache_node after destroy sysfs file")
    Link: http://lkml.kernel.org/r/1470759070-18743-1-git-send-email-chris@chris-wilson.co.uk
    Reported-by: Dave Gordon
    Signed-off-by: Chris Wilson
    Reviewed-by: Vladimir Davydov
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Dmitry Safonov
    Cc: Daniel Vetter
    Cc: Dave Gordon
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chris Wilson
     

09 Aug, 2016

1 commit

  • Pull usercopy protection from Kees Cook:
    "Tbhis implements HARDENED_USERCOPY verification of copy_to_user and
    copy_from_user bounds checking for most architectures on SLAB and
    SLUB"

    * tag 'usercopy-v4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
    mm: SLUB hardened usercopy support
    mm: SLAB hardened usercopy support
    s390/uaccess: Enable hardened usercopy
    sparc/uaccess: Enable hardened usercopy
    powerpc/uaccess: Enable hardened usercopy
    ia64/uaccess: Enable hardened usercopy
    arm64/uaccess: Enable hardened usercopy
    ARM: uaccess: Enable hardened usercopy
    x86/uaccess: Enable hardened usercopy
    mm: Hardened usercopy
    mm: Implement stack frame object validation
    mm: Add is_migrate_cma_page

    Linus Torvalds
     

05 Aug, 2016

1 commit

  • With m68k-linux-gnu-gcc-4.1:

    include/linux/slub_def.h:126: warning: `fixup_red_left' declared inline after being called
    include/linux/slub_def.h:126: warning: previous declaration of `fixup_red_left' was here

    Commit c146a2b98eb5 ("mm, kasan: account for object redzone in SLUB's
    nearest_obj()") made fixup_red_left() global, but forgot to remove the
    inline keyword.

    Fixes: c146a2b98eb5898e ("mm, kasan: account for object redzone in SLUB's nearest_obj()")
    Link: http://lkml.kernel.org/r/1470256262-1586-1-git-send-email-geert@linux-m68k.org
    Signed-off-by: Geert Uytterhoeven
    Cc: Alexander Potapenko
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Geert Uytterhoeven
     

03 Aug, 2016

1 commit

  • The state of object currently tracked in two places - shadow memory, and
    the ->state field in struct kasan_alloc_meta. We can get rid of the
    latter. The will save us a little bit of memory. Also, this allow us
    to move free stack into struct kasan_alloc_meta, without increasing
    memory consumption. So now we should always know when the last time the
    object was freed. This may be useful for long delayed use-after-free
    bugs.

    As a side effect this fixes following UBSAN warning:
    UBSAN: Undefined behaviour in mm/kasan/quarantine.c:102:13
    member access within misaligned address ffff88000d1efebc for type 'struct qlist_node'
    which requires 8 byte alignment

    Link: http://lkml.kernel.org/r/1470062715-14077-5-git-send-email-aryabinin@virtuozzo.com
    Reported-by: kernel test robot
    Signed-off-by: Andrey Ryabinin
    Cc: Alexander Potapenko
    Cc: Dmitry Vyukov
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Ryabinin
     

29 Jul, 2016

2 commits

  • For KASAN builds:
    - switch SLUB allocator to using stackdepot instead of storing the
    allocation/deallocation stacks in the objects;
    - change the freelist hook so that parts of the freelist can be put
    into the quarantine.

    [aryabinin@virtuozzo.com: fixes]
    Link: http://lkml.kernel.org/r/1468601423-28676-1-git-send-email-aryabinin@virtuozzo.com
    Link: http://lkml.kernel.org/r/1468347165-41906-3-git-send-email-glider@google.com
    Signed-off-by: Alexander Potapenko
    Cc: Andrey Konovalov
    Cc: Christoph Lameter
    Cc: Dmitry Vyukov
    Cc: Steven Rostedt (Red Hat)
    Cc: Joonsoo Kim
    Cc: Kostya Serebryany
    Cc: Andrey Ryabinin
    Cc: Kuthonuzo Luruo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexander Potapenko
     
  • When looking up the nearest SLUB object for a given address, correctly
    calculate its offset if SLAB_RED_ZONE is enabled for that cache.

    Previously, when KASAN had detected an error on an object from a cache
    with SLAB_RED_ZONE set, the actual start address of the object was
    miscalculated, which led to random stacks having been reported.

    When looking up the nearest SLUB object for a given address, correctly
    calculate its offset if SLAB_RED_ZONE is enabled for that cache.

    Fixes: 7ed2f9e663854db ("mm, kasan: SLAB support")
    Link: http://lkml.kernel.org/r/1468347165-41906-2-git-send-email-glider@google.com
    Signed-off-by: Alexander Potapenko
    Cc: Andrey Konovalov
    Cc: Christoph Lameter
    Cc: Dmitry Vyukov
    Cc: Steven Rostedt (Red Hat)
    Cc: Joonsoo Kim
    Cc: Kostya Serebryany
    Cc: Andrey Ryabinin
    Cc: Kuthonuzo Luruo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexander Potapenko
     

27 Jul, 2016

5 commits

  • Currently, to charge a non-slab allocation to kmemcg one has to use
    alloc_kmem_pages helper with __GFP_ACCOUNT flag. A page allocated with
    this helper should finally be freed using free_kmem_pages, otherwise it
    won't be uncharged.

    This API suits its current users fine, but it turns out to be impossible
    to use along with page reference counting, i.e. when an allocation is
    supposed to be freed with put_page, as it is the case with pipe or unix
    socket buffers.

    To overcome this limitation, this patch moves charging/uncharging to
    generic page allocator paths, i.e. to __alloc_pages_nodemask and
    free_pages_prepare, and zaps alloc/free_kmem_pages helpers. This way,
    one can use any of the available page allocation functions to get the
    allocated page charged to kmemcg - it's enough to pass __GFP_ACCOUNT,
    just like in case of kmalloc and friends. A charged page will be
    automatically uncharged on free.

    To make it possible, we need to mark pages charged to kmemcg somehow.
    To avoid introducing a new page flag, we make use of page->_mapcount for
    marking such pages. Since pages charged to kmemcg are not supposed to
    be mapped to userspace, it should work just fine. There are other
    (ab)users of page->_mapcount - buddy and balloon pages - but we don't
    conflict with them.

    In case kmemcg is compiled out or not used at runtime, this patch
    introduces no overhead to generic page allocator paths. If kmemcg is
    used, it will be plus one gfp flags check on alloc and plus one
    page->_mapcount check on free, which shouldn't hurt performance, because
    the data accessed are hot.

    Link: http://lkml.kernel.org/r/a9736d856f895bcb465d9f257b54efe32eda6f99.1464079538.git.vdavydov@virtuozzo.com
    Signed-off-by: Vladimir Davydov
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Eric Dumazet
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • Both SLAB and SLUB BUG() when a caller provides an invalid gfp_mask.
    This is a rather harsh way to announce a non-critical issue. Allocator
    is free to ignore invalid flags. Let's simply replace BUG() by
    dump_stack to tell the offender and fixup the mask to move on with the
    allocation request.

    This is an example for kmalloc(GFP_KERNEL|__GFP_HIGHMEM) from a test
    module:

    Unexpected gfp: 0x2 (__GFP_HIGHMEM). Fixing up to gfp: 0x24000c0 (GFP_KERNEL). Fix your code!
    CPU: 0 PID: 2916 Comm: insmod Tainted: G O 4.6.0-slabgfp2-00002-g4cdfc2ef4892-dirty #936
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Debian-1.8.2-1 04/01/2014
    Call Trace:
    dump_stack+0x67/0x90
    cache_alloc_refill+0x201/0x617
    kmem_cache_alloc_trace+0xa7/0x24a
    ? 0xffffffffa0005000
    mymodule_init+0x20/0x1000 [test_slab]
    do_one_initcall+0xe7/0x16c
    ? rcu_read_lock_sched_held+0x61/0x69
    ? kmem_cache_alloc_trace+0x197/0x24a
    do_init_module+0x5f/0x1d9
    load_module+0x1a3d/0x1f21
    ? retint_kernel+0x2d/0x2d
    SyS_init_module+0xe8/0x10e
    ? SyS_init_module+0xe8/0x10e
    do_syscall_64+0x68/0x13f
    entry_SYSCALL64_slow_path+0x25/0x25

    Link: http://lkml.kernel.org/r/1465548200-11384-2-git-send-email-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Cc: Sergey Senozhatsky
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • printk offers %pGg for quite some time so let's use it to get a human
    readable list of invalid flags.

    The original output would be
    [ 429.191962] gfp: 2

    after the change
    [ 429.191962] Unexpected gfp: 0x2 (__GFP_HIGHMEM)

    Link: http://lkml.kernel.org/r/1465548200-11384-1-git-send-email-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Cc: Sergey Senozhatsky
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Implements freelist randomization for the SLUB allocator. It was
    previous implemented for the SLAB allocator. Both use the same
    configuration option (CONFIG_SLAB_FREELIST_RANDOM).

    The list is randomized during initialization of a new set of pages. The
    order on different freelist sizes is pre-computed at boot for
    performance. Each kmem_cache has its own randomized freelist.

    This security feature reduces the predictability of the kernel SLUB
    allocator against heap overflows rendering attacks much less stable.

    For example these attacks exploit the predictability of the heap:
    - Linux Kernel CAN SLUB overflow (https://goo.gl/oMNWkU)
    - Exploiting Linux Kernel Heap corruptions (http://goo.gl/EXLn95)

    Performance results:

    slab_test impact is between 3% to 4% on average for 100000 attempts
    without smp. It is a very focused testing, kernbench show the overall
    impact on the system is way lower.

    Before:

    Single thread testing
    =====================
    1. Kmalloc: Repeatedly allocate then free test
    100000 times kmalloc(8) -> 49 cycles kfree -> 77 cycles
    100000 times kmalloc(16) -> 51 cycles kfree -> 79 cycles
    100000 times kmalloc(32) -> 53 cycles kfree -> 83 cycles
    100000 times kmalloc(64) -> 62 cycles kfree -> 90 cycles
    100000 times kmalloc(128) -> 81 cycles kfree -> 97 cycles
    100000 times kmalloc(256) -> 98 cycles kfree -> 121 cycles
    100000 times kmalloc(512) -> 95 cycles kfree -> 122 cycles
    100000 times kmalloc(1024) -> 96 cycles kfree -> 126 cycles
    100000 times kmalloc(2048) -> 115 cycles kfree -> 140 cycles
    100000 times kmalloc(4096) -> 149 cycles kfree -> 171 cycles
    2. Kmalloc: alloc/free test
    100000 times kmalloc(8)/kfree -> 70 cycles
    100000 times kmalloc(16)/kfree -> 70 cycles
    100000 times kmalloc(32)/kfree -> 70 cycles
    100000 times kmalloc(64)/kfree -> 70 cycles
    100000 times kmalloc(128)/kfree -> 70 cycles
    100000 times kmalloc(256)/kfree -> 69 cycles
    100000 times kmalloc(512)/kfree -> 70 cycles
    100000 times kmalloc(1024)/kfree -> 73 cycles
    100000 times kmalloc(2048)/kfree -> 72 cycles
    100000 times kmalloc(4096)/kfree -> 71 cycles

    After:

    Single thread testing
    =====================
    1. Kmalloc: Repeatedly allocate then free test
    100000 times kmalloc(8) -> 57 cycles kfree -> 78 cycles
    100000 times kmalloc(16) -> 61 cycles kfree -> 81 cycles
    100000 times kmalloc(32) -> 76 cycles kfree -> 93 cycles
    100000 times kmalloc(64) -> 83 cycles kfree -> 94 cycles
    100000 times kmalloc(128) -> 106 cycles kfree -> 107 cycles
    100000 times kmalloc(256) -> 118 cycles kfree -> 117 cycles
    100000 times kmalloc(512) -> 114 cycles kfree -> 116 cycles
    100000 times kmalloc(1024) -> 115 cycles kfree -> 118 cycles
    100000 times kmalloc(2048) -> 147 cycles kfree -> 131 cycles
    100000 times kmalloc(4096) -> 214 cycles kfree -> 161 cycles
    2. Kmalloc: alloc/free test
    100000 times kmalloc(8)/kfree -> 66 cycles
    100000 times kmalloc(16)/kfree -> 66 cycles
    100000 times kmalloc(32)/kfree -> 66 cycles
    100000 times kmalloc(64)/kfree -> 66 cycles
    100000 times kmalloc(128)/kfree -> 65 cycles
    100000 times kmalloc(256)/kfree -> 67 cycles
    100000 times kmalloc(512)/kfree -> 67 cycles
    100000 times kmalloc(1024)/kfree -> 64 cycles
    100000 times kmalloc(2048)/kfree -> 67 cycles
    100000 times kmalloc(4096)/kfree -> 67 cycles

    Kernbench, before:

    Average Optimal load -j 12 Run (std deviation):
    Elapsed Time 101.873 (1.16069)
    User Time 1045.22 (1.60447)
    System Time 88.969 (0.559195)
    Percent CPU 1112.9 (13.8279)
    Context Switches 189140 (2282.15)
    Sleeps 99008.6 (768.091)

    After:

    Average Optimal load -j 12 Run (std deviation):
    Elapsed Time 102.47 (0.562732)
    User Time 1045.3 (1.34263)
    System Time 88.311 (0.342554)
    Percent CPU 1105.8 (6.49444)
    Context Switches 189081 (2355.78)
    Sleeps 99231.5 (800.358)

    Link: http://lkml.kernel.org/r/1464295031-26375-3-git-send-email-thgarnie@google.com
    Signed-off-by: Thomas Garnier
    Reviewed-by: Kees Cook
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Thomas Garnier
     
  • Under CONFIG_HARDENED_USERCOPY, this adds object size checking to the
    SLUB allocator to catch any copies that may span objects. Includes a
    redzone handling fix discovered by Michael Ellerman.

    Based on code from PaX and grsecurity.

    Signed-off-by: Kees Cook
    Tested-by: Michael Ellerman
    Reviwed-by: Laura Abbott

    Kees Cook
     

21 May, 2016

1 commit

  • Instead of calling kasan_krealloc(), which replaces the memory
    allocation stack ID (if stack depot is used), just unpoison the whole
    memory chunk.

    Signed-off-by: Alexander Potapenko
    Acked-by: Andrey Ryabinin
    Cc: Andrey Konovalov
    Cc: Dmitry Vyukov
    Cc: Christoph Lameter
    Cc: Konstantin Serebryany
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexander Potapenko
     

20 May, 2016

3 commits

  • Many developers already know that field for reference count of the
    struct page is _count and atomic type. They would try to handle it
    directly and this could break the purpose of page reference count
    tracepoint. To prevent direct _count modification, this patch rename it
    to _refcount and add warning message on the code. After that, developer
    who need to handle reference count will find that field should not be
    accessed directly.

    [akpm@linux-foundation.org: fix comments, per Vlastimil]
    [akpm@linux-foundation.org: Documentation/vm/transhuge.txt too]
    [sfr@canb.auug.org.au: sync ethernet driver changes]
    Signed-off-by: Joonsoo Kim
    Signed-off-by: Stephen Rothwell
    Cc: Vlastimil Babka
    Cc: Hugh Dickins
    Cc: Johannes Berg
    Cc: "David S. Miller"
    Cc: Sunil Goutham
    Cc: Chris Metcalf
    Cc: Manish Chopra
    Cc: Yuval Mintz
    Cc: Tariq Toukan
    Cc: Saeed Mahameed
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • /sys/kernel/slab/xx/defrag_ratio should be remote_node_defrag_ratio.

    Link: http://lkml.kernel.org/r/1463449242-5366-1-git-send-email-lip@dtdream.com
    Signed-off-by: Li Peng
    Acked-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Li Peng
     
  • When we call __kmem_cache_shrink on memory cgroup removal, we need to
    synchronize kmem_cache->cpu_partial update with put_cpu_partial that
    might be running on other cpus. Currently, we achieve that by using
    kick_all_cpus_sync, which works as a system wide memory barrier. Though
    fast it is, this method has a flaw - it issues a lot of IPIs, which
    might hurt high performance or real-time workloads.

    To fix this, let's replace kick_all_cpus_sync with synchronize_sched.
    Although the latter one may take much longer to finish, it shouldn't be
    a problem in this particular case, because memory cgroups are destroyed
    asynchronously from a workqueue so that no user visible effects should
    be introduced. OTOH, it will save us from excessive IPIs when someone
    removes a cgroup.

    Anyway, even if using synchronize_sched turns out to take too long, we
    can always introduce a kind of __kmem_cache_shrink batching so that this
    method would only be called once per one cgroup destruction (not per
    each per memcg kmem cache as it is now).

    Signed-off-by: Vladimir Davydov
    Reported-by: Peter Zijlstra
    Suggested-by: Peter Zijlstra
    Acked-by: Peter Zijlstra (Intel)
    Acked-by: Michal Hocko
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     

26 Mar, 2016

1 commit

  • Add GFP flags to KASAN hooks for future patches to use.

    This patch is based on the "mm: kasan: unified support for SLUB and SLAB
    allocators" patch originally prepared by Dmitry Chernenkov.

    Signed-off-by: Alexander Potapenko
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Andrey Konovalov
    Cc: Dmitry Vyukov
    Cc: Andrey Ryabinin
    Cc: Steven Rostedt
    Cc: Konstantin Serebryany
    Cc: Dmitry Chernenkov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexander Potapenko
     

18 Mar, 2016

4 commits

  • Kernel style prefers a single string over split strings when the string is
    'user-visible'.

    Miscellanea:

    - Add a missing newline
    - Realign arguments

    Signed-off-by: Joe Perches
    Acked-by: Tejun Heo [percpu]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joe Perches
     
  • THP defrag is enabled by default to direct reclaim/compact but not wake
    kswapd in the event of a THP allocation failure. The problem is that
    THP allocation requests potentially enter reclaim/compaction. This
    potentially incurs a severe stall that is not guaranteed to be offset by
    reduced TLB misses. While there has been considerable effort to reduce
    the impact of reclaim/compaction, it is still a high cost and workloads
    that should fit in memory fail to do so. Specifically, a simple
    anon/file streaming workload will enter direct reclaim on NUMA at least
    even though the working set size is 80% of RAM. It's been years and
    it's time to throw in the towel.

    First, this patch defines THP defrag as follows;

    madvise: A failed allocation will direct reclaim/compact if the application requests it
    never: Neither reclaim/compact nor wake kswapd
    defer: A failed allocation will wake kswapd/kcompactd
    always: A failed allocation will direct reclaim/compact (historical behaviour)
    khugepaged defrag will enter direct/reclaim but not wake kswapd.

    Next it sets the default defrag option to be "madvise" to only enter
    direct reclaim/compaction for applications that specifically requested
    it.

    Lastly, it removes a check from the page allocator slowpath that is
    related to __GFP_THISNODE to allow "defer" to work. The callers that
    really cares are slub/slab and they are updated accordingly. The slab
    one may be surprising because it also corrects a comment as kswapd was
    never woken up by that path.

    This means that a THP fault will no longer stall for most applications
    by default and the ideal for most users that get THP if they are
    immediately available. There are still options for users that prefer a
    stall at startup of a new application by either restoring historical
    behaviour with "always" or pick a half-way point with "defer" where
    kswapd does some of the work in the background and wakes kcompactd if
    necessary. THP defrag for khugepaged remains enabled and will enter
    direct/reclaim but no wakeup kswapd or kcompactd.

    After this patch a THP allocation failure will quickly fallback and rely
    on khugepaged to recover the situation at some time in the future. In
    some cases, this will reduce THP usage but the benefit of THP is hard to
    measure and not a universal win where as a stall to reclaim/compaction
    is definitely measurable and can be painful.

    The first test for this is using "usemem" to read a large file and write
    a large anonymous mapping (to avoid the zero page) multiple times. The
    total size of the mappings is 80% of RAM and the benchmark simply
    measures how long it takes to complete. It uses multiple threads to see
    if that is a factor. On UMA, the performance is almost identical so is
    not reported but on NUMA, we see this

    usemem
    4.4.0 4.4.0
    kcompactd-v1r1 nodefrag-v1r3
    Amean System-1 102.86 ( 0.00%) 46.81 ( 54.50%)
    Amean System-4 37.85 ( 0.00%) 34.02 ( 10.12%)
    Amean System-7 48.12 ( 0.00%) 46.89 ( 2.56%)
    Amean System-12 51.98 ( 0.00%) 56.96 ( -9.57%)
    Amean System-21 80.16 ( 0.00%) 79.05 ( 1.39%)
    Amean System-30 110.71 ( 0.00%) 107.17 ( 3.20%)
    Amean System-48 127.98 ( 0.00%) 124.83 ( 2.46%)
    Amean Elapsd-1 185.84 ( 0.00%) 105.51 ( 43.23%)
    Amean Elapsd-4 26.19 ( 0.00%) 25.58 ( 2.33%)
    Amean Elapsd-7 21.65 ( 0.00%) 21.62 ( 0.16%)
    Amean Elapsd-12 18.58 ( 0.00%) 17.94 ( 3.43%)
    Amean Elapsd-21 17.53 ( 0.00%) 16.60 ( 5.33%)
    Amean Elapsd-30 17.45 ( 0.00%) 17.13 ( 1.84%)
    Amean Elapsd-48 15.40 ( 0.00%) 15.27 ( 0.82%)

    For a single thread, the benchmark completes 43.23% faster with this
    patch applied with smaller benefits as the thread increases. Similar,
    notice the large reduction in most cases in system CPU usage. The
    overall CPU time is

    4.4.0 4.4.0
    kcompactd-v1r1 nodefrag-v1r3
    User 10357.65 10438.33
    System 3988.88 3543.94
    Elapsed 2203.01 1634.41

    Which is substantial. Now, the reclaim figures

    4.4.0 4.4.0
    kcompactd-v1r1nodefrag-v1r3
    Minor Faults 128458477 278352931
    Major Faults 2174976 225
    Swap Ins 16904701 0
    Swap Outs 17359627 0
    Allocation stalls 43611 0
    DMA allocs 0 0
    DMA32 allocs 19832646 19448017
    Normal allocs 614488453 580941839
    Movable allocs 0 0
    Direct pages scanned 24163800 0
    Kswapd pages scanned 0 0
    Kswapd pages reclaimed 0 0
    Direct pages reclaimed 20691346 0
    Compaction stalls 42263 0
    Compaction success 938 0
    Compaction failures 41325 0

    This patch eliminates almost all swapping and direct reclaim activity.
    There is still overhead but it's from NUMA balancing which does not
    identify that it's pointless trying to do anything with this workload.

    I also tried the thpscale benchmark which forces a corner case where
    compaction can be used heavily and measures the latency of whether base
    or huge pages were used

    thpscale Fault Latencies
    4.4.0 4.4.0
    kcompactd-v1r1 nodefrag-v1r3
    Amean fault-base-1 5288.84 ( 0.00%) 2817.12 ( 46.73%)
    Amean fault-base-3 6365.53 ( 0.00%) 3499.11 ( 45.03%)
    Amean fault-base-5 6526.19 ( 0.00%) 4363.06 ( 33.15%)
    Amean fault-base-7 7142.25 ( 0.00%) 4858.08 ( 31.98%)
    Amean fault-base-12 13827.64 ( 0.00%) 10292.11 ( 25.57%)
    Amean fault-base-18 18235.07 ( 0.00%) 13788.84 ( 24.38%)
    Amean fault-base-24 21597.80 ( 0.00%) 24388.03 (-12.92%)
    Amean fault-base-30 26754.15 ( 0.00%) 19700.55 ( 26.36%)
    Amean fault-base-32 26784.94 ( 0.00%) 19513.57 ( 27.15%)
    Amean fault-huge-1 4223.96 ( 0.00%) 2178.57 ( 48.42%)
    Amean fault-huge-3 2194.77 ( 0.00%) 2149.74 ( 2.05%)
    Amean fault-huge-5 2569.60 ( 0.00%) 2346.95 ( 8.66%)
    Amean fault-huge-7 3612.69 ( 0.00%) 2997.70 ( 17.02%)
    Amean fault-huge-12 3301.75 ( 0.00%) 6727.02 (-103.74%)
    Amean fault-huge-18 6696.47 ( 0.00%) 6685.72 ( 0.16%)
    Amean fault-huge-24 8000.72 ( 0.00%) 9311.43 (-16.38%)
    Amean fault-huge-30 13305.55 ( 0.00%) 9750.45 ( 26.72%)
    Amean fault-huge-32 9981.71 ( 0.00%) 10316.06 ( -3.35%)

    The average time to fault pages is substantially reduced in the majority
    of caseds but with the obvious caveat that fewer THPs are actually used
    in this adverse workload

    4.4.0 4.4.0
    kcompactd-v1r1 nodefrag-v1r3
    Percentage huge-1 0.71 ( 0.00%) 14.04 (1865.22%)
    Percentage huge-3 10.77 ( 0.00%) 33.05 (206.85%)
    Percentage huge-5 60.39 ( 0.00%) 38.51 (-36.23%)
    Percentage huge-7 45.97 ( 0.00%) 34.57 (-24.79%)
    Percentage huge-12 68.12 ( 0.00%) 40.07 (-41.17%)
    Percentage huge-18 64.93 ( 0.00%) 47.82 (-26.35%)
    Percentage huge-24 62.69 ( 0.00%) 44.23 (-29.44%)
    Percentage huge-30 43.49 ( 0.00%) 55.38 ( 27.34%)
    Percentage huge-32 50.72 ( 0.00%) 51.90 ( 2.35%)

    4.4.0 4.4.0
    kcompactd-v1r1nodefrag-v1r3
    Minor Faults 37429143 47564000
    Major Faults 1916 1558
    Swap Ins 1466 1079
    Swap Outs 2936863 149626
    Allocation stalls 62510 3
    DMA allocs 0 0
    DMA32 allocs 6566458 6401314
    Normal allocs 216361697 216538171
    Movable allocs 0 0
    Direct pages scanned 25977580 17998
    Kswapd pages scanned 0 3638931
    Kswapd pages reclaimed 0 207236
    Direct pages reclaimed 8833714 88
    Compaction stalls 103349 5
    Compaction success 270 4
    Compaction failures 103079 1

    Note again that while this does swap as it's an aggressive workload, the
    direct relcim activity and allocation stalls is substantially reduced.
    There is some kswapd activity but ftrace showed that the kswapd activity
    was due to normal wakeups from 4K pages being allocated.
    Compaction-related stalls and activity are almost eliminated.

    I also tried the stutter benchmark. For this, I do not have figures for
    NUMA but it's something that does impact UMA so I'll report what is
    available

    stutter
    4.4.0 4.4.0
    kcompactd-v1r1 nodefrag-v1r3
    Min mmap 7.3571 ( 0.00%) 7.3438 ( 0.18%)
    1st-qrtle mmap 7.5278 ( 0.00%) 17.9200 (-138.05%)
    2nd-qrtle mmap 7.6818 ( 0.00%) 21.6055 (-181.25%)
    3rd-qrtle mmap 11.0889 ( 0.00%) 21.8881 (-97.39%)
    Max-90% mmap 27.8978 ( 0.00%) 22.1632 ( 20.56%)
    Max-93% mmap 28.3202 ( 0.00%) 22.3044 ( 21.24%)
    Max-95% mmap 28.5600 ( 0.00%) 22.4580 ( 21.37%)
    Max-99% mmap 29.6032 ( 0.00%) 25.5216 ( 13.79%)
    Max mmap 4109.7289 ( 0.00%) 4813.9832 (-17.14%)
    Mean mmap 12.4474 ( 0.00%) 19.3027 (-55.07%)

    This benchmark is trying to fault an anonymous mapping while there is a
    heavy IO load -- a scenario that desktop users used to complain about
    frequently. This shows a mix because the ideal case of mapping with THP
    is not hit as often. However, note that 99% of the mappings complete
    13.79% faster. The CPU usage here is particularly interesting

    4.4.0 4.4.0
    kcompactd-v1r1nodefrag-v1r3
    User 67.50 0.99
    System 1327.88 91.30
    Elapsed 2079.00 2128.98

    And once again we look at the reclaim figures

    4.4.0 4.4.0
    kcompactd-v1r1nodefrag-v1r3
    Minor Faults 335241922 1314582827
    Major Faults 715 819
    Swap Ins 0 0
    Swap Outs 0 0
    Allocation stalls 532723 0
    DMA allocs 0 0
    DMA32 allocs 1822364341 1177950222
    Normal allocs 1815640808 1517844854
    Movable allocs 0 0
    Direct pages scanned 21892772 0
    Kswapd pages scanned 20015890 41879484
    Kswapd pages reclaimed 19961986 41822072
    Direct pages reclaimed 21892741 0
    Compaction stalls 1065755 0
    Compaction success 514 0
    Compaction failures 1065241 0

    Allocation stalls and all direct reclaim activity is eliminated as well
    as compaction-related stalls.

    THP gives impressive gains in some cases but only if they are quickly
    available. We're not going to reach the point where they are completely
    free so lets take the costs out of the fast paths finally and defer the
    cost to kswapd, kcompactd and khugepaged where it belongs.

    Signed-off-by: Mel Gorman
    Acked-by: Rik van Riel
    Acked-by: Johannes Weiner
    Acked-by: Vlastimil Babka
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • We can disable debug_pagealloc processing even if the code is compiled
    with CONFIG_DEBUG_PAGEALLOC. This patch changes the code to query
    whether it is enabled or not in runtime.

    [akpm@linux-foundation.org: clean up code, per Christian]
    Signed-off-by: Joonsoo Kim
    Reviewed-by: Christian Borntraeger
    Cc: Benjamin Herrenschmidt
    Cc: Chris Metcalf
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Pekka Enberg
    Cc: Takashi Iwai
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • Show how much memory is used for storing reclaimable and unreclaimable
    in-kernel data structures allocated from slab caches.

    Signed-off-by: Vladimir Davydov
    Acked-by: Johannes Weiner
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     

16 Mar, 2016

9 commits

  • We can now print gfp_flags more human-readable. Make use of this in
    slab_out_of_memory() for SLUB and SLAB. Also convert the SLAB variant
    it to pr_warn() along the way.

    Signed-off-by: Vlastimil Babka
    Acked-by: David Rientjes
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • SLUB already has a redzone debugging feature. But it is only positioned
    at the end of object (aka right redzone) so it cannot catch left oob.
    Although current object's right redzone acts as left redzone of next
    object, first object in a slab cannot take advantage of this effect.
    This patch explicitly adds a left red zone to each object to detect left
    oob more precisely.

    Background:

    Someone complained to me that left OOB doesn't catch even if KASAN is
    enabled which does page allocation debugging. That page is out of our
    control so it would be allocated when left OOB happens and, in this
    case, we can't find OOB. Moreover, SLUB debugging feature can be
    enabled without page allocator debugging and, in this case, we will miss
    that OOB.

    Before trying to implement, I expected that changes would be too
    complex, but, it doesn't look that complex to me now. Almost changes
    are applied to debug specific functions so I feel okay.

    Signed-off-by: Joonsoo Kim
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • When debug options are enabled, cmpxchg on the page is disabled. This
    is because the page must be locked to ensure there are no false
    positives when performing consistency checks. Some debug options such
    as poisoning and red zoning only act on the object itself. There is no
    need to protect other CPUs from modification on only the object. Allow
    cmpxchg to happen with poisoning and red zoning are set on a slab.

    Credit to Mathias Krause for the original work which inspired this
    series

    Signed-off-by: Laura Abbott
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Kees Cook
    Cc: Mathias Krause
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Laura Abbott
     
  • SLAB_DEBUG_FREE allows expensive consistency checks at free to be turned
    on or off. Expand its use to be able to turn off all consistency
    checks. This gives a nice speed up if you only want features such as
    poisoning or tracing.

    Credit to Mathias Krause for the original work which inspired this
    series

    Signed-off-by: Laura Abbott
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Kees Cook
    Cc: Mathias Krause
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Laura Abbott
     
  • Since commit 19c7ff9ecd89 ("slub: Take node lock during object free
    checks") check_object has been incorrectly returning success as it
    follows the out label which just returns the node.

    Thanks to refactoring, the out and fail paths are now basically the
    same. Combine the two into one and just use a single label.

    Credit to Mathias Krause for the original work which inspired this
    series

    Signed-off-by: Laura Abbott
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Kees Cook
    Cc: Mathias Krause
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Laura Abbott
     
  • This series takes the suggestion of Christoph Lameter and only focuses
    on optimizing the slow path where the debug processing runs. The two
    main optimizations in this series are letting the consistency checks be
    skipped and relaxing the cmpxchg restrictions when we are not doing
    consistency checks. With hackbench -g 20 -l 1000 averaged over 100
    runs:

    Before slub_debug=P
    mean 15.607
    variance .086
    stdev .294

    After slub_debug=P
    mean 10.836
    variance .155
    stdev .394

    This still isn't as fast as what is in grsecurity unfortunately so there's
    still work to be done. Profiling ___slab_alloc shows that 25-50% of time
    is spent in deactivate_slab. I haven't looked too closely to see if this
    is something that can be optimized. My plan for now is to focus on
    getting all of this merged (if appropriate) before digging in to another
    task.

    This patch (of 4):

    Currently, free_debug_processing has a comment "Keep node_lock to preserve
    integrity until the object is actually freed". In actuallity, the lock is
    dropped immediately in __slab_free. Rather than wait until __slab_free
    and potentially throw off the unlikely marking, just drop the lock in
    __slab_free. This also lets free_debug_processing take its own copy of
    the spinlock flags rather than trying to share the ones from __slab_free.
    Since there is no use for the node afterwards, change the return type of
    free_debug_processing to return an int like alloc_debug_processing.

    Credit to Mathias Krause for the original work which inspired this series

    [akpm@linux-foundation.org: fix build]
    Signed-off-by: Laura Abbott
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Kees Cook
    Cc: Mathias Krause
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Laura Abbott
     
  • This patch introduce a new API call kfree_bulk() for bulk freeing memory
    objects not bound to a single kmem_cache.

    Christoph pointed out that it is possible to implement freeing of
    objects, without knowing the kmem_cache pointer as that information is
    available from the object's page->slab_cache. Proposing to remove the
    kmem_cache argument from the bulk free API.

    Jesper demonstrated that these extra steps per object comes at a
    performance cost. It is only in the case CONFIG_MEMCG_KMEM is compiled
    in and activated runtime that these steps are done anyhow. The extra
    cost is most visible for SLAB allocator, because the SLUB allocator does
    the page lookup (virt_to_head_page()) anyhow.

    Thus, the conclusion was to keep the kmem_cache free bulk API with a
    kmem_cache pointer, but we can still implement a kfree_bulk() API fairly
    easily. Simply by handling if kmem_cache_free_bulk() gets called with a
    kmem_cache NULL pointer.

    This does increase the code size a bit, but implementing a separate
    kfree_bulk() call would likely increase code size even more.

    Below benchmarks cost of alloc+free (obj size 256 bytes) on CPU i7-4790K
    @ 4.00GHz, no PREEMPT and CONFIG_MEMCG_KMEM=y.

    Code size increase for SLAB:

    add/remove: 0/0 grow/shrink: 1/0 up/down: 74/0 (74)
    function old new delta
    kmem_cache_free_bulk 660 734 +74

    SLAB fastpath: 87 cycles(tsc) 21.814
    sz - fallback - kmem_cache_free_bulk - kfree_bulk
    1 - 103 cycles 25.878 ns - 41 cycles 10.498 ns - 81 cycles 20.312 ns
    2 - 94 cycles 23.673 ns - 26 cycles 6.682 ns - 42 cycles 10.649 ns
    3 - 92 cycles 23.181 ns - 21 cycles 5.325 ns - 39 cycles 9.950 ns
    4 - 90 cycles 22.727 ns - 18 cycles 4.673 ns - 26 cycles 6.693 ns
    8 - 89 cycles 22.270 ns - 14 cycles 3.664 ns - 23 cycles 5.835 ns
    16 - 88 cycles 22.038 ns - 14 cycles 3.503 ns - 22 cycles 5.543 ns
    30 - 89 cycles 22.284 ns - 13 cycles 3.310 ns - 20 cycles 5.197 ns
    32 - 88 cycles 22.249 ns - 13 cycles 3.420 ns - 20 cycles 5.166 ns
    34 - 88 cycles 22.224 ns - 14 cycles 3.643 ns - 20 cycles 5.170 ns
    48 - 88 cycles 22.088 ns - 14 cycles 3.507 ns - 20 cycles 5.203 ns
    64 - 88 cycles 22.063 ns - 13 cycles 3.428 ns - 20 cycles 5.152 ns
    128 - 89 cycles 22.483 ns - 15 cycles 3.891 ns - 23 cycles 5.885 ns
    158 - 89 cycles 22.381 ns - 15 cycles 3.779 ns - 22 cycles 5.548 ns
    250 - 91 cycles 22.798 ns - 16 cycles 4.152 ns - 23 cycles 5.967 ns

    SLAB when enabling MEMCG_KMEM runtime:
    - kmemcg fastpath: 130 cycles(tsc) 32.684 ns (step:0)
    1 - 148 cycles 37.220 ns - 66 cycles 16.622 ns - 66 cycles 16.583 ns
    2 - 141 cycles 35.510 ns - 51 cycles 12.820 ns - 58 cycles 14.625 ns
    3 - 140 cycles 35.017 ns - 37 cycles 9.326 ns - 33 cycles 8.474 ns
    4 - 137 cycles 34.507 ns - 31 cycles 7.888 ns - 33 cycles 8.300 ns
    8 - 140 cycles 35.069 ns - 25 cycles 6.461 ns - 25 cycles 6.436 ns
    16 - 138 cycles 34.542 ns - 23 cycles 5.945 ns - 22 cycles 5.670 ns
    30 - 136 cycles 34.227 ns - 22 cycles 5.502 ns - 22 cycles 5.587 ns
    32 - 136 cycles 34.253 ns - 21 cycles 5.475 ns - 21 cycles 5.324 ns
    34 - 136 cycles 34.254 ns - 21 cycles 5.448 ns - 20 cycles 5.194 ns
    48 - 136 cycles 34.075 ns - 21 cycles 5.458 ns - 21 cycles 5.367 ns
    64 - 135 cycles 33.994 ns - 21 cycles 5.350 ns - 21 cycles 5.259 ns
    128 - 137 cycles 34.446 ns - 23 cycles 5.816 ns - 22 cycles 5.688 ns
    158 - 137 cycles 34.379 ns - 22 cycles 5.727 ns - 22 cycles 5.602 ns
    250 - 138 cycles 34.755 ns - 24 cycles 6.093 ns - 23 cycles 5.986 ns

    Code size increase for SLUB:
    function old new delta
    kmem_cache_free_bulk 717 799 +82

    SLUB benchmark:
    SLUB fastpath: 46 cycles(tsc) 11.691 ns (step:0)
    sz - fallback - kmem_cache_free_bulk - kfree_bulk
    1 - 61 cycles 15.486 ns - 53 cycles 13.364 ns - 57 cycles 14.464 ns
    2 - 54 cycles 13.703 ns - 32 cycles 8.110 ns - 33 cycles 8.482 ns
    3 - 53 cycles 13.272 ns - 25 cycles 6.362 ns - 27 cycles 6.947 ns
    4 - 51 cycles 12.994 ns - 24 cycles 6.087 ns - 24 cycles 6.078 ns
    8 - 50 cycles 12.576 ns - 21 cycles 5.354 ns - 22 cycles 5.513 ns
    16 - 49 cycles 12.368 ns - 20 cycles 5.054 ns - 20 cycles 5.042 ns
    30 - 49 cycles 12.273 ns - 18 cycles 4.748 ns - 19 cycles 4.758 ns
    32 - 49 cycles 12.401 ns - 19 cycles 4.821 ns - 19 cycles 4.810 ns
    34 - 98 cycles 24.519 ns - 24 cycles 6.154 ns - 24 cycles 6.157 ns
    48 - 83 cycles 20.833 ns - 21 cycles 5.446 ns - 21 cycles 5.429 ns
    64 - 75 cycles 18.891 ns - 20 cycles 5.247 ns - 20 cycles 5.238 ns
    128 - 93 cycles 23.271 ns - 27 cycles 6.856 ns - 27 cycles 6.823 ns
    158 - 102 cycles 25.581 ns - 30 cycles 7.714 ns - 30 cycles 7.695 ns
    250 - 107 cycles 26.917 ns - 38 cycles 9.514 ns - 38 cycles 9.506 ns

    SLUB when enabling MEMCG_KMEM runtime:
    - kmemcg fastpath: 71 cycles(tsc) 17.897 ns (step:0)
    1 - 85 cycles 21.484 ns - 78 cycles 19.569 ns - 75 cycles 18.938 ns
    2 - 81 cycles 20.363 ns - 45 cycles 11.258 ns - 44 cycles 11.076 ns
    3 - 78 cycles 19.709 ns - 33 cycles 8.354 ns - 32 cycles 8.044 ns
    4 - 77 cycles 19.430 ns - 28 cycles 7.216 ns - 28 cycles 7.003 ns
    8 - 101 cycles 25.288 ns - 23 cycles 5.849 ns - 23 cycles 5.787 ns
    16 - 76 cycles 19.148 ns - 20 cycles 5.162 ns - 20 cycles 5.081 ns
    30 - 76 cycles 19.067 ns - 19 cycles 4.868 ns - 19 cycles 4.821 ns
    32 - 76 cycles 19.052 ns - 19 cycles 4.857 ns - 19 cycles 4.815 ns
    34 - 121 cycles 30.291 ns - 25 cycles 6.333 ns - 25 cycles 6.268 ns
    48 - 108 cycles 27.111 ns - 21 cycles 5.498 ns - 21 cycles 5.458 ns
    64 - 100 cycles 25.164 ns - 20 cycles 5.242 ns - 20 cycles 5.229 ns
    128 - 155 cycles 38.976 ns - 27 cycles 6.886 ns - 27 cycles 6.892 ns
    158 - 132 cycles 33.034 ns - 30 cycles 7.711 ns - 30 cycles 7.728 ns
    250 - 130 cycles 32.612 ns - 38 cycles 9.560 ns - 38 cycles 9.549 ns

    Signed-off-by: Jesper Dangaard Brouer
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jesper Dangaard Brouer
     
  • First step towards sharing alloc_hook's between SLUB and SLAB
    allocators. Move the SLUB allocators *_alloc_hook to the common
    mm/slab.h for internal slab definitions.

    Signed-off-by: Jesper Dangaard Brouer
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jesper Dangaard Brouer
     
  • This change is primarily an attempt to make it easier to realize the
    optimizations the compiler performs in-case CONFIG_MEMCG_KMEM is not
    enabled.

    Performance wise, even when CONFIG_MEMCG_KMEM is compiled in, the
    overhead is zero. This is because, as long as no process have enabled
    kmem cgroups accounting, the assignment is replaced by asm-NOP
    operations. This is possible because memcg_kmem_enabled() uses a
    static_key_false() construct.

    It also helps readability as it avoid accessing the p[] array like:
    p[size - 1] which "expose" that the array is processed backwards inside
    helper function build_detached_freelist().

    Lastly this also makes the code more robust, in error case like passing
    NULL pointers in the array. Which were previously handled before commit
    033745189b1b ("slub: add missing kmem cgroup support to
    kmem_cache_free_bulk").

    Fixes: 033745189b1b ("slub: add missing kmem cgroup support to kmem_cache_free_bulk")
    Signed-off-by: Jesper Dangaard Brouer
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jesper Dangaard Brouer
     

19 Feb, 2016

1 commit

  • When slub_debug alloc_calls_show is enabled we will try to track
    location and user of slab object on each online node, kmem_cache_node
    structure and cpu_cache/cpu_slub shouldn't be freed till there is the
    last reference to sysfs file.

    This fixes the following panic:

    BUG: unable to handle kernel NULL pointer dereference at 0000000000000020
    IP: list_locations+0x169/0x4e0
    PGD 257304067 PUD 438456067 PMD 0
    Oops: 0000 [#1] SMP
    CPU: 3 PID: 973074 Comm: cat ve: 0 Not tainted 3.10.0-229.7.2.ovz.9.30-00007-japdoll-dirty #2 9.30
    Hardware name: DEPO Computers To Be Filled By O.E.M./H67DE3, BIOS L1.60c 07/14/2011
    task: ffff88042a5dc5b0 ti: ffff88037f8d8000 task.ti: ffff88037f8d8000
    RIP: list_locations+0x169/0x4e0
    Call Trace:
    alloc_calls_show+0x1d/0x30
    slab_attr_show+0x1b/0x30
    sysfs_read_file+0x9a/0x1a0
    vfs_read+0x9c/0x170
    SyS_read+0x58/0xb0
    system_call_fastpath+0x16/0x1b
    Code: 5e 07 12 00 b9 00 04 00 00 3d 00 04 00 00 0f 4f c1 3d 00 04 00 00 89 45 b0 0f 84 c3 00 00 00 48 63 45 b0 49 8b 9c c4 f8 00 00 00 8b 43 20 48 85 c0 74 b6 48 89 df e8 46 37 44 00 48 8b 53 10
    CR2: 0000000000000020

    Separated __kmem_cache_release from __kmem_cache_shutdown which now
    called on slab_kmem_cache_release (after the last reference to sysfs
    file object has dropped).

    Reintroduced locking in free_partial as sysfs file might access cache's
    partial list after shutdowning - partial revert of the commit
    69cb8e6b7c29 ("slub: free slabs without holding locks"). Zap
    __remove_partial and use remove_partial (w/o underscores) as
    free_partial now takes list_lock which s partial revert for commit
    1e4dd9461fab ("slub: do not assert not having lock in removing freed
    partial")

    Signed-off-by: Dmitry Safonov
    Suggested-by: Vladimir Davydov
    Acked-by: Vladimir Davydov
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dmitry Safonov
     

21 Jan, 2016

1 commit


16 Jan, 2016

1 commit

  • lock_page() must operate on the whole compound page. It doesn't make
    much sense to lock part of compound page. Change code to use head
    page's PG_locked, if tail page is passed.

    This patch also gets rid of custom helper functions --
    __set_page_locked() and __clear_page_locked(). They are replaced with
    helpers generated by __SETPAGEFLAG/__CLEARPAGEFLAG. Tail pages to these
    helper would trigger VM_BUG_ON().

    SLUB uses PG_locked as a bit spin locked. IIUC, tail pages should never
    appear there. VM_BUG_ON() is added to make sure that this assumption is
    correct.

    [akpm@linux-foundation.org: fix fs/cifs/file.c]
    Signed-off-by: Kirill A. Shutemov
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Vlastimil Babka
    Cc: Christoph Lameter
    Cc: Naoya Horiguchi
    Cc: Steve Capper
    Cc: "Aneesh Kumar K.V"
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Jerome Marchand
    Cc: Jérôme Glisse
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

15 Jan, 2016

1 commit

  • Currently, if we want to account all objects of a particular kmem cache,
    we have to pass __GFP_ACCOUNT to each kmem_cache_alloc call, which is
    inconvenient. This patch introduces SLAB_ACCOUNT flag which if passed
    to kmem_cache_create will force accounting for every allocation from
    this cache even if __GFP_ACCOUNT is not passed.

    This patch does not make any of the existing caches use this flag - it
    will be done later in the series.

    Note, a cache with SLAB_ACCOUNT cannot be merged with a cache w/o
    SLAB_ACCOUNT, because merged caches share the same kmem_cache struct and
    hence cannot have different sets of SLAB_* flags. Thus using this flag
    will probably reduce the number of merged slabs even if kmem accounting
    is not used (only compiled in).

    Signed-off-by: Vladimir Davydov
    Suggested-by: Tejun Heo
    Acked-by: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: Greg Thelen
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     

23 Nov, 2015

5 commits

  • Adjust kmem_cache_alloc_bulk API before we have any real users.

    Adjust API to return type 'int' instead of previously type 'bool'. This
    is done to allow future extension of the bulk alloc API.

    A future extension could be to allow SLUB to stop at a page boundary, when
    specified by a flag, and then return the number of objects.

    The advantage of this approach, would make it easier to make bulk alloc
    run without local IRQs disabled. With an approach of cmpxchg "stealing"
    the entire c->freelist or page->freelist. To avoid overshooting we would
    stop processing at a slab-page boundary. Else we always end up returning
    some objects at the cost of another cmpxchg.

    To keep compatible with future users of this API linking against an older
    kernel when using the new flag, we need to return the number of allocated
    objects with this API change.

    Signed-off-by: Jesper Dangaard Brouer
    Cc: Vladimir Davydov
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jesper Dangaard Brouer
     
  • Initial implementation missed support for kmem cgroup support in
    kmem_cache_free_bulk() call, add this.

    If CONFIG_MEMCG_KMEM is not enabled, the compiler should be smart enough
    to not add any asm code.

    Incoming bulk free objects can belong to different kmem cgroups, and
    object free call can happen at a later point outside memcg context. Thus,
    we need to keep the orig kmem_cache, to correctly verify if a memcg object
    match against its "root_cache" (s->memcg_params.root_cache).

    Signed-off-by: Jesper Dangaard Brouer
    Reviewed-by: Vladimir Davydov
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jesper Dangaard Brouer
     
  • The call slab_pre_alloc_hook() interacts with kmemgc and is not allowed to
    be called several times inside the bulk alloc for loop, due to the call to
    memcg_kmem_get_cache().

    This would result in hitting the VM_BUG_ON in __memcg_kmem_get_cache.

    As suggested by Vladimir Davydov, change slab_post_alloc_hook() to be able
    to handle an array of objects.

    A subtle detail is, loop iterator "i" in slab_post_alloc_hook() must have
    same type (size_t) as size argument. This helps the compiler to easier
    realize that it can remove the loop, when all debug statements inside loop
    evaluates to nothing. Note, this is only an issue because the kernel is
    compiled with GCC option: -fno-strict-overflow

    In slab_alloc_node() the compiler inlines and optimizes the invocation of
    slab_post_alloc_hook(s, flags, 1, &object) by removing the loop and access
    object directly.

    Signed-off-by: Jesper Dangaard Brouer
    Reported-by: Vladimir Davydov
    Suggested-by: Vladimir Davydov
    Reviewed-by: Vladimir Davydov
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jesper Dangaard Brouer
     
  • This change focus on improving the speed of object freeing in the
    "slowpath" of kmem_cache_free_bulk.

    The calls slab_free (fastpath) and __slab_free (slowpath) have been
    extended with support for bulk free, which amortize the overhead of
    the (locked) cmpxchg_double.

    To use the new bulking feature, we build what I call a detached
    freelist. The detached freelist takes advantage of three properties:

    1) the free function call owns the object that is about to be freed,
    thus writing into this memory is synchronization-free.

    2) many freelist's can co-exist side-by-side in the same slab-page
    each with a separate head pointer.

    3) it is the visibility of the head pointer that needs synchronization.

    Given these properties, the brilliant part is that the detached
    freelist can be constructed without any need for synchronization. The
    freelist is constructed directly in the page objects, without any
    synchronization needed. The detached freelist is allocated on the
    stack of the function call kmem_cache_free_bulk. Thus, the freelist
    head pointer is not visible to other CPUs.

    All objects in a SLUB freelist must belong to the same slab-page.
    Thus, constructing the detached freelist is about matching objects
    that belong to the same slab-page. The bulk free array is scanned is
    a progressive manor with a limited look-ahead facility.

    Kmem debug support is handled in call of slab_free().

    Notice kmem_cache_free_bulk no longer need to disable IRQs. This
    only slowed down single free bulk with approx 3 cycles.

    Performance data:
    Benchmarked[1] obj size 256 bytes on CPU i7-4790K @ 4.00GHz

    SLUB fastpath single object quick reuse: 47 cycles(tsc) 11.931 ns

    To get stable and comparable numbers, the kernel have been booted with
    "slab_merge" (this also improve performance for larger bulk sizes).

    Performance data, compared against fallback bulking:

    bulk - fallback bulk - improvement with this patch
    1 - 62 cycles(tsc) 15.662 ns - 49 cycles(tsc) 12.407 ns- improved 21.0%
    2 - 55 cycles(tsc) 13.935 ns - 30 cycles(tsc) 7.506 ns - improved 45.5%
    3 - 53 cycles(tsc) 13.341 ns - 23 cycles(tsc) 5.865 ns - improved 56.6%
    4 - 52 cycles(tsc) 13.081 ns - 20 cycles(tsc) 5.048 ns - improved 61.5%
    8 - 50 cycles(tsc) 12.627 ns - 18 cycles(tsc) 4.659 ns - improved 64.0%
    16 - 49 cycles(tsc) 12.412 ns - 17 cycles(tsc) 4.495 ns - improved 65.3%
    30 - 49 cycles(tsc) 12.484 ns - 18 cycles(tsc) 4.533 ns - improved 63.3%
    32 - 50 cycles(tsc) 12.627 ns - 18 cycles(tsc) 4.707 ns - improved 64.0%
    34 - 96 cycles(tsc) 24.243 ns - 23 cycles(tsc) 5.976 ns - improved 76.0%
    48 - 83 cycles(tsc) 20.818 ns - 21 cycles(tsc) 5.329 ns - improved 74.7%
    64 - 74 cycles(tsc) 18.700 ns - 20 cycles(tsc) 5.127 ns - improved 73.0%
    128 - 90 cycles(tsc) 22.734 ns - 27 cycles(tsc) 6.833 ns - improved 70.0%
    158 - 99 cycles(tsc) 24.776 ns - 30 cycles(tsc) 7.583 ns - improved 69.7%
    250 - 104 cycles(tsc) 26.089 ns - 37 cycles(tsc) 9.280 ns - improved 64.4%

    Performance data, compared current in-kernel bulking:

    bulk - curr in-kernel - improvement with this patch
    1 - 46 cycles(tsc) - 49 cycles(tsc) - improved (cycles:-3) -6.5%
    2 - 27 cycles(tsc) - 30 cycles(tsc) - improved (cycles:-3) -11.1%
    3 - 21 cycles(tsc) - 23 cycles(tsc) - improved (cycles:-2) -9.5%
    4 - 18 cycles(tsc) - 20 cycles(tsc) - improved (cycles:-2) -11.1%
    8 - 17 cycles(tsc) - 18 cycles(tsc) - improved (cycles:-1) -5.9%
    16 - 18 cycles(tsc) - 17 cycles(tsc) - improved (cycles: 1) 5.6%
    30 - 18 cycles(tsc) - 18 cycles(tsc) - improved (cycles: 0) 0.0%
    32 - 18 cycles(tsc) - 18 cycles(tsc) - improved (cycles: 0) 0.0%
    34 - 78 cycles(tsc) - 23 cycles(tsc) - improved (cycles:55) 70.5%
    48 - 60 cycles(tsc) - 21 cycles(tsc) - improved (cycles:39) 65.0%
    64 - 49 cycles(tsc) - 20 cycles(tsc) - improved (cycles:29) 59.2%
    128 - 69 cycles(tsc) - 27 cycles(tsc) - improved (cycles:42) 60.9%
    158 - 79 cycles(tsc) - 30 cycles(tsc) - improved (cycles:49) 62.0%
    250 - 86 cycles(tsc) - 37 cycles(tsc) - improved (cycles:49) 57.0%

    Performance with normal SLUB merging is significantly slower for
    larger bulking. This is believed to (primarily) be an effect of not
    having to share the per-CPU data-structures, as tuning per-CPU size
    can achieve similar performance.

    bulk - slab_nomerge - normal SLUB merge
    1 - 49 cycles(tsc) - 49 cycles(tsc) - merge slower with cycles:0
    2 - 30 cycles(tsc) - 30 cycles(tsc) - merge slower with cycles:0
    3 - 23 cycles(tsc) - 23 cycles(tsc) - merge slower with cycles:0
    4 - 20 cycles(tsc) - 20 cycles(tsc) - merge slower with cycles:0
    8 - 18 cycles(tsc) - 18 cycles(tsc) - merge slower with cycles:0
    16 - 17 cycles(tsc) - 17 cycles(tsc) - merge slower with cycles:0
    30 - 18 cycles(tsc) - 23 cycles(tsc) - merge slower with cycles:5
    32 - 18 cycles(tsc) - 22 cycles(tsc) - merge slower with cycles:4
    34 - 23 cycles(tsc) - 22 cycles(tsc) - merge slower with cycles:-1
    48 - 21 cycles(tsc) - 22 cycles(tsc) - merge slower with cycles:1
    64 - 20 cycles(tsc) - 48 cycles(tsc) - merge slower with cycles:28
    128 - 27 cycles(tsc) - 57 cycles(tsc) - merge slower with cycles:30
    158 - 30 cycles(tsc) - 59 cycles(tsc) - merge slower with cycles:29
    250 - 37 cycles(tsc) - 56 cycles(tsc) - merge slower with cycles:19

    Joint work with Alexander Duyck.

    [1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/mm/slab_bulk_test01.c

    [akpm@linux-foundation.org: BUG_ON -> WARN_ON;return]
    Signed-off-by: Jesper Dangaard Brouer
    Signed-off-by: Alexander Duyck
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jesper Dangaard Brouer
     
  • Make it possible to free a freelist with several objects by adjusting API
    of slab_free() and __slab_free() to have head, tail and an objects counter
    (cnt).

    Tail being NULL indicate single object free of head object. This allow
    compiler inline constant propagation in slab_free() and
    slab_free_freelist_hook() to avoid adding any overhead in case of single
    object free.

    This allows a freelist with several objects (all within the same
    slab-page) to be free'ed using a single locked cmpxchg_double in
    __slab_free() and with an unlocked cmpxchg_double in slab_free().

    Object debugging on the free path is also extended to handle these
    freelists. When CONFIG_SLUB_DEBUG is enabled it will also detect if
    objects don't belong to the same slab-page.

    These changes are needed for the next patch to bulk free the detached
    freelists it introduces and constructs.

    Micro benchmarking showed no performance reduction due to this change,
    when debugging is turned off (compiled with CONFIG_SLUB_DEBUG).

    Signed-off-by: Jesper Dangaard Brouer
    Signed-off-by: Alexander Duyck
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jesper Dangaard Brouer