20 May, 2016

40 commits

  • Lots of code does

    node = next_node(node, XXX);
    if (node == MAX_NUMNODES)
    node = first_node(XXX);

    so create next_node_in() to do this and use it in various places.

    [mhocko@suse.com: use next_node_in() helper]
    Acked-by: Vlastimil Babka
    Acked-by: Michal Hocko
    Signed-off-by: Michal Hocko
    Cc: Xishi Qiu
    Cc: Joonsoo Kim
    Cc: David Rientjes
    Cc: Naoya Horiguchi
    Cc: Laura Abbott
    Cc: Hui Zhu
    Cc: Wang Xiaoqiang
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • Attach the malloc attribute to a few allocation functions. This helps
    gcc generate better code by telling it that the return value doesn't
    alias any existing pointers (which is even more valuable given the
    pessimizations implied by -fno-strict-aliasing).

    A simple example of what this allows gcc to do can be seen by looking at
    the last part of drm_atomic_helper_plane_reset:

    plane->state = kzalloc(sizeof(*plane->state), GFP_KERNEL);

    if (plane->state) {
    plane->state->plane = plane;
    plane->state->rotation = BIT(DRM_ROTATE_0);
    }

    which compiles to

    e8 99 bf d6 ff callq ffffffff8116d540
    48 85 c0 test %rax,%rax
    48 89 83 40 02 00 00 mov %rax,0x240(%rbx)
    74 11 je ffffffff814015c4
    48 89 18 mov %rbx,(%rax)
    48 8b 83 40 02 00 00 mov 0x240(%rbx),%rax [*]
    c7 40 40 01 00 00 00 movl $0x1,0x40(%rax)

    With this patch applied, the instruction at [*] is elided, since the
    store to plane->state->plane is known to not alter the value of
    plane->state.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Rasmus Villemoes
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rasmus Villemoes
     
  • gcc as far back as at least 3.04 documents the function attribute
    __malloc__. Add a shorthand for attaching that to a function
    declaration. This was also suggested by Andi Kleen way back in 2002
    [1], but didn't get applied, perhaps because gcc at that time generated
    the exact same code with and without this attribute.

    This attribute tells the compiler that the return value (if non-NULL)
    can be assumed not to alias any other valid pointers at the time of the
    call.

    Please note that the documentation for a range of gcc versions (starting
    from around 4.7) contained a somewhat confusing and self-contradicting
    text:

    The malloc attribute is used to tell the compiler that a function may
    be treated as if any non-NULL pointer it returns cannot alias any other
    pointer valid when the function returns and *that the memory has
    undefined content*. [...] Standard functions with this property include
    malloc and *calloc*.

    (emphasis mine). The intended meaning has later been clarified [2]:

    This tells the compiler that a function is malloc-like, i.e., that the
    pointer P returned by the function cannot alias any other pointer valid
    when the function returns, and moreover no pointers to valid objects
    occur in any storage addressed by P.

    What this means is that we can apply the attribute to kmalloc and
    friends, and it is ok for the returned memory to have well-defined
    contents (__GFP_ZERO). But it is not ok to apply it to kmemdup(), nor
    to other functions which both allocate and possibly initialize the
    memory with existing pointers. So unless someone is doing something
    pretty perverted kstrdup() should also be a fine candidate.

    [1] http://thread.gmane.org/gmane.linux.kernel/57172
    [2] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=56955

    Signed-off-by: Rasmus Villemoes
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rasmus Villemoes
     
  • Many developers already know that field for reference count of the
    struct page is _count and atomic type. They would try to handle it
    directly and this could break the purpose of page reference count
    tracepoint. To prevent direct _count modification, this patch rename it
    to _refcount and add warning message on the code. After that, developer
    who need to handle reference count will find that field should not be
    accessed directly.

    [akpm@linux-foundation.org: fix comments, per Vlastimil]
    [akpm@linux-foundation.org: Documentation/vm/transhuge.txt too]
    [sfr@canb.auug.org.au: sync ethernet driver changes]
    Signed-off-by: Joonsoo Kim
    Signed-off-by: Stephen Rothwell
    Cc: Vlastimil Babka
    Cc: Hugh Dickins
    Cc: Johannes Berg
    Cc: "David S. Miller"
    Cc: Sunil Goutham
    Cc: Chris Metcalf
    Cc: Manish Chopra
    Cc: Yuval Mintz
    Cc: Tariq Toukan
    Cc: Saeed Mahameed
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • page_reference manipulation functions are introduced to track down
    reference count change of the page. Use it instead of direct
    modification of _count.

    Signed-off-by: Joonsoo Kim
    Acked-by: Vlastimil Babka
    Cc: Hugh Dickins
    Cc: Johannes Berg
    Cc: "David S. Miller"
    Cc: Sunil Goutham
    Cc: Chris Metcalf
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • /sys/kernel/slab/xx/defrag_ratio should be remote_node_defrag_ratio.

    Link: http://lkml.kernel.org/r/1463449242-5366-1-git-send-email-lip@dtdream.com
    Signed-off-by: Li Peng
    Acked-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Li Peng
     
  • Now we have IS_ENABLED helper to check if a Kconfig option is enabled or
    not, so ZONE_DMA_FLAG sounds no longer useful.

    And, the use of ZONE_DMA_FLAG in slab looks pointless according to the
    comment [1] from Johannes Weiner, so remove them and ORing passed in
    flags with the cache gfp flags has been done in kmem_getpages().

    [1] https://lkml.org/lkml/2014/9/25/553

    Link: http://lkml.kernel.org/r/1462381297-11009-1-git-send-email-yang.shi@linaro.org
    Signed-off-by: Yang Shi
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yang Shi
     
  • Provides an optional config (CONFIG_SLAB_FREELIST_RANDOM) to randomize
    the SLAB freelist. The list is randomized during initialization of a
    new set of pages. The order on different freelist sizes is pre-computed
    at boot for performance. Each kmem_cache has its own randomized
    freelist. Before pre-computed lists are available freelists are
    generated dynamically. This security feature reduces the predictability
    of the kernel SLAB allocator against heap overflows rendering attacks
    much less stable.

    For example this attack against SLUB (also applicable against SLAB)
    would be affected:

    https://jon.oberheide.org/blog/2010/09/10/linux-kernel-can-slub-overflow/

    Also, since v4.6 the freelist was moved at the end of the SLAB. It
    means a controllable heap is opened to new attacks not yet publicly
    discussed. A kernel heap overflow can be transformed to multiple
    use-after-free. This feature makes this type of attack harder too.

    To generate entropy, we use get_random_bytes_arch because 0 bits of
    entropy is available in the boot stage. In the worse case this function
    will fallback to the get_random_bytes sub API. We also generate a shift
    random number to shift pre-computed freelist for each new set of pages.

    The config option name is not specific to the SLAB as this approach will
    be extended to other allocators like SLUB.

    Performance results highlighted no major changes:

    Hackbench (running 90 10 times):

    Before average: 0.0698
    After average: 0.0663 (-5.01%)

    slab_test 1 run on boot. Difference only seen on the 2048 size test
    being the worse case scenario covered by freelist randomization. New
    slab pages are constantly being created on the 10000 allocations.
    Variance should be mainly due to getting new pages every few
    allocations.

    Before:

    Single thread testing
    =====================
    1. Kmalloc: Repeatedly allocate then free test
    10000 times kmalloc(8) -> 99 cycles kfree -> 112 cycles
    10000 times kmalloc(16) -> 109 cycles kfree -> 140 cycles
    10000 times kmalloc(32) -> 129 cycles kfree -> 137 cycles
    10000 times kmalloc(64) -> 141 cycles kfree -> 141 cycles
    10000 times kmalloc(128) -> 152 cycles kfree -> 148 cycles
    10000 times kmalloc(256) -> 195 cycles kfree -> 167 cycles
    10000 times kmalloc(512) -> 257 cycles kfree -> 199 cycles
    10000 times kmalloc(1024) -> 393 cycles kfree -> 251 cycles
    10000 times kmalloc(2048) -> 649 cycles kfree -> 228 cycles
    10000 times kmalloc(4096) -> 806 cycles kfree -> 370 cycles
    10000 times kmalloc(8192) -> 814 cycles kfree -> 411 cycles
    10000 times kmalloc(16384) -> 892 cycles kfree -> 455 cycles
    2. Kmalloc: alloc/free test
    10000 times kmalloc(8)/kfree -> 121 cycles
    10000 times kmalloc(16)/kfree -> 121 cycles
    10000 times kmalloc(32)/kfree -> 121 cycles
    10000 times kmalloc(64)/kfree -> 121 cycles
    10000 times kmalloc(128)/kfree -> 121 cycles
    10000 times kmalloc(256)/kfree -> 119 cycles
    10000 times kmalloc(512)/kfree -> 119 cycles
    10000 times kmalloc(1024)/kfree -> 119 cycles
    10000 times kmalloc(2048)/kfree -> 119 cycles
    10000 times kmalloc(4096)/kfree -> 121 cycles
    10000 times kmalloc(8192)/kfree -> 119 cycles
    10000 times kmalloc(16384)/kfree -> 119 cycles

    After:

    Single thread testing
    =====================
    1. Kmalloc: Repeatedly allocate then free test
    10000 times kmalloc(8) -> 130 cycles kfree -> 86 cycles
    10000 times kmalloc(16) -> 118 cycles kfree -> 86 cycles
    10000 times kmalloc(32) -> 121 cycles kfree -> 85 cycles
    10000 times kmalloc(64) -> 176 cycles kfree -> 102 cycles
    10000 times kmalloc(128) -> 178 cycles kfree -> 100 cycles
    10000 times kmalloc(256) -> 205 cycles kfree -> 109 cycles
    10000 times kmalloc(512) -> 262 cycles kfree -> 136 cycles
    10000 times kmalloc(1024) -> 342 cycles kfree -> 157 cycles
    10000 times kmalloc(2048) -> 701 cycles kfree -> 238 cycles
    10000 times kmalloc(4096) -> 803 cycles kfree -> 364 cycles
    10000 times kmalloc(8192) -> 835 cycles kfree -> 404 cycles
    10000 times kmalloc(16384) -> 896 cycles kfree -> 441 cycles
    2. Kmalloc: alloc/free test
    10000 times kmalloc(8)/kfree -> 121 cycles
    10000 times kmalloc(16)/kfree -> 121 cycles
    10000 times kmalloc(32)/kfree -> 123 cycles
    10000 times kmalloc(64)/kfree -> 142 cycles
    10000 times kmalloc(128)/kfree -> 121 cycles
    10000 times kmalloc(256)/kfree -> 119 cycles
    10000 times kmalloc(512)/kfree -> 119 cycles
    10000 times kmalloc(1024)/kfree -> 119 cycles
    10000 times kmalloc(2048)/kfree -> 119 cycles
    10000 times kmalloc(4096)/kfree -> 119 cycles
    10000 times kmalloc(8192)/kfree -> 119 cycles
    10000 times kmalloc(16384)/kfree -> 119 cycles

    [akpm@linux-foundation.org: propagate gfp_t into cache_random_seq_create()]
    Signed-off-by: Thomas Garnier
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Kees Cook
    Cc: Greg Thelen
    Cc: Laura Abbott
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Thomas Garnier
     
  • When we call __kmem_cache_shrink on memory cgroup removal, we need to
    synchronize kmem_cache->cpu_partial update with put_cpu_partial that
    might be running on other cpus. Currently, we achieve that by using
    kick_all_cpus_sync, which works as a system wide memory barrier. Though
    fast it is, this method has a flaw - it issues a lot of IPIs, which
    might hurt high performance or real-time workloads.

    To fix this, let's replace kick_all_cpus_sync with synchronize_sched.
    Although the latter one may take much longer to finish, it shouldn't be
    a problem in this particular case, because memory cgroups are destroyed
    asynchronously from a workqueue so that no user visible effects should
    be introduced. OTOH, it will save us from excessive IPIs when someone
    removes a cgroup.

    Anyway, even if using synchronize_sched turns out to take too long, we
    can always introduce a kind of __kmem_cache_shrink batching so that this
    method would only be called once per one cgroup destruction (not per
    each per memcg kmem cache as it is now).

    Signed-off-by: Vladimir Davydov
    Reported-by: Peter Zijlstra
    Suggested-by: Peter Zijlstra
    Acked-by: Peter Zijlstra (Intel)
    Acked-by: Michal Hocko
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • To check whether free objects exist or not precisely, we need to grab a
    lock. But, accuracy isn't that important because race window would be
    even small and if there is too much free object, cache reaper would reap
    it. So, this patch makes the check for free object exisistence not to
    hold a lock. This will reduce lock contention in heavily allocation
    case.

    Note that until now, n->shared can be freed during the processing by
    writing slabinfo, but, with some trick in this patch, we can access it
    freely within interrupt disabled period.

    Below is the result of concurrent allocation/free in slab allocation
    benchmark made by Christoph a long time ago. I make the output simpler.
    The number shows cycle count during alloc/free respectively so less is
    better.

    * Before
    Kmalloc N*alloc N*free(32): Average=248/966
    Kmalloc N*alloc N*free(64): Average=261/949
    Kmalloc N*alloc N*free(128): Average=314/1016
    Kmalloc N*alloc N*free(256): Average=741/1061
    Kmalloc N*alloc N*free(512): Average=1246/1152
    Kmalloc N*alloc N*free(1024): Average=2437/1259
    Kmalloc N*alloc N*free(2048): Average=4980/1800
    Kmalloc N*alloc N*free(4096): Average=9000/2078

    * After
    Kmalloc N*alloc N*free(32): Average=344/792
    Kmalloc N*alloc N*free(64): Average=347/882
    Kmalloc N*alloc N*free(128): Average=390/959
    Kmalloc N*alloc N*free(256): Average=393/1067
    Kmalloc N*alloc N*free(512): Average=683/1229
    Kmalloc N*alloc N*free(1024): Average=1295/1325
    Kmalloc N*alloc N*free(2048): Average=2513/1664
    Kmalloc N*alloc N*free(4096): Average=4742/2172

    It shows that allocation performance decreases for the object size up to
    128 and it may be due to extra checks in cache_alloc_refill(). But,
    with considering improvement of free performance, net result looks the
    same. Result for other size class looks very promising, roughly, 50%
    performance improvement.

    Signed-off-by: Joonsoo Kim
    Cc: Jesper Dangaard Brouer
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • Until now, cache growing makes a free slab on node's slab list and then
    we can allocate free objects from it. This necessarily requires to hold
    a node lock which is very contended. If we refill cpu cache before
    attaching it to node's slab list, we can avoid holding a node lock as
    much as possible because this newly allocated slab is only visible to
    the current task. This will reduce lock contention.

    Below is the result of concurrent allocation/free in slab allocation
    benchmark made by Christoph a long time ago. I make the output simpler.
    The number shows cycle count during alloc/free respectively so less is
    better.

    * Before
    Kmalloc N*alloc N*free(32): Average=355/750
    Kmalloc N*alloc N*free(64): Average=452/812
    Kmalloc N*alloc N*free(128): Average=559/1070
    Kmalloc N*alloc N*free(256): Average=1176/980
    Kmalloc N*alloc N*free(512): Average=1939/1189
    Kmalloc N*alloc N*free(1024): Average=3521/1278
    Kmalloc N*alloc N*free(2048): Average=7152/1838
    Kmalloc N*alloc N*free(4096): Average=13438/2013

    * After
    Kmalloc N*alloc N*free(32): Average=248/966
    Kmalloc N*alloc N*free(64): Average=261/949
    Kmalloc N*alloc N*free(128): Average=314/1016
    Kmalloc N*alloc N*free(256): Average=741/1061
    Kmalloc N*alloc N*free(512): Average=1246/1152
    Kmalloc N*alloc N*free(1024): Average=2437/1259
    Kmalloc N*alloc N*free(2048): Average=4980/1800
    Kmalloc N*alloc N*free(4096): Average=9000/2078

    It shows that contention is reduced for all the object sizes and
    performance increases by 30 ~ 40%.

    Signed-off-by: Joonsoo Kim
    Cc: Jesper Dangaard Brouer
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • This is a preparation step to implement lockless allocation path when
    there is no free objects in kmem_cache.

    What we'd like to do here is to refill cpu cache without holding a node
    lock. To accomplish this purpose, refill should be done after new slab
    allocation but before attaching the slab to the management list. So,
    this patch separates cache_grow() to two parts, allocation and attaching
    to the list in order to add some code inbetween them in the following
    patch.

    Signed-off-by: Joonsoo Kim
    Cc: Jesper Dangaard Brouer
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • Currently, cache_grow() assumes that allocated page's nodeid would be
    same with parameter nodeid which is used for allocation request. If we
    discard this assumption, we can handle fallback_alloc() case gracefully.
    So, this patch makes cache_grow() handle the page allocated on arbitrary
    node and clean-up relevant code.

    Signed-off-by: Joonsoo Kim
    Cc: Jesper Dangaard Brouer
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • Slab color isn't needed to be changed strictly. Because locking for
    changing slab color could cause more lock contention so this patch
    implements racy access/modify the slab color. This is a preparation
    step to implement lockless allocation path when there is no free objects
    in the kmem_cache.

    Below is the result of concurrent allocation/free in slab allocation
    benchmark made by Christoph a long time ago. I make the output simpler.
    The number shows cycle count during alloc/free respectively so less is
    better.

    * Before
    Kmalloc N*alloc N*free(32): Average=365/806
    Kmalloc N*alloc N*free(64): Average=452/690
    Kmalloc N*alloc N*free(128): Average=736/886
    Kmalloc N*alloc N*free(256): Average=1167/985
    Kmalloc N*alloc N*free(512): Average=2088/1125
    Kmalloc N*alloc N*free(1024): Average=4115/1184
    Kmalloc N*alloc N*free(2048): Average=8451/1748
    Kmalloc N*alloc N*free(4096): Average=16024/2048

    * After
    Kmalloc N*alloc N*free(32): Average=355/750
    Kmalloc N*alloc N*free(64): Average=452/812
    Kmalloc N*alloc N*free(128): Average=559/1070
    Kmalloc N*alloc N*free(256): Average=1176/980
    Kmalloc N*alloc N*free(512): Average=1939/1189
    Kmalloc N*alloc N*free(1024): Average=3521/1278
    Kmalloc N*alloc N*free(2048): Average=7152/1838
    Kmalloc N*alloc N*free(4096): Average=13438/2013

    It shows that contention is reduced for object size >= 1024 and
    performance increases by roughly 15%.

    Signed-off-by: Joonsoo Kim
    Acked-by: Christoph Lameter
    Cc: Jesper Dangaard Brouer
    Cc: Pekka Enberg
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • Currently, determination to free a slab is done whenever each freed
    object is put into the slab. This has a following problem.

    Assume free_limit = 10 and nr_free = 9.

    Free happens as following sequence and nr_free changes as following.

    free(become a free slab) free(not become a free slab) nr_free: 9 -> 10
    (at first free) -> 11 (at second free)

    If we try to check if we can free current slab or not on each object
    free, we can't free any slab in this situation because current slab
    isn't a free slab when nr_free exceed free_limit (at second free) even
    if there is a free slab.

    However, if we check it lastly, we can free 1 free slab.

    This problem would cause to keep too much memory in the slab subsystem.
    This patch try to fix it by checking number of free object after all
    free work is done. If there is free slab at that time, we can free slab
    as much as possible so we keep free slab as minimal.

    Signed-off-by: Joonsoo Kim
    Cc: Jesper Dangaard Brouer
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • There are mostly same code for setting up kmem_cache_node either in
    cpuup_prepare() or alloc_kmem_cache_node(). Factor out and clean-up
    them.

    Signed-off-by: Joonsoo Kim
    Tested-by: Nishanth Menon
    Tested-by: Jon Hunter
    Acked-by: Christoph Lameter
    Cc: Jesper Dangaard Brouer
    Cc: Pekka Enberg
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • It can be reused on other place, so factor out it. Following patch will
    use it.

    Signed-off-by: Joonsoo Kim
    Acked-by: Christoph Lameter
    Cc: Jesper Dangaard Brouer
    Cc: Pekka Enberg
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • slabs_tofree() implies freeing all free slab. We can do it with just
    providing INT_MAX.

    Signed-off-by: Joonsoo Kim
    Acked-by: Christoph Lameter
    Cc: Jesper Dangaard Brouer
    Cc: Pekka Enberg
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • Initial attemp to remove BAD_ALIEN_MAGIC is once reverted by 'commit
    edcad2509550 ("Revert "slab: remove BAD_ALIEN_MAGIC"")' because it
    causes a problem on m68k which has many node but !CONFIG_NUMA. In this
    case, although alien cache isn't used at all but to cope with some
    initialization path, garbage value is used and that is BAD_ALIEN_MAGIC.
    Now, this patch set use_alien_caches to 0 when !CONFIG_NUMA, there is no
    initialization path problem so we don't need BAD_ALIEN_MAGIC at all. So
    remove it.

    Signed-off-by: Joonsoo Kim
    Tested-by: Geert Uytterhoeven
    Acked-by: Christoph Lameter
    Cc: Jesper Dangaard Brouer
    Cc: Pekka Enberg
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • While processing concurrent allocation, SLAB could be contended a lot
    because it did a lots of work with holding a lock. This patchset try to
    reduce the number of critical section to reduce lock contention. Major
    changes are lockless decision to allocate more slab and lockless cpu
    cache refill from the newly allocated slab.

    Below is the result of concurrent allocation/free in slab allocation
    benchmark made by Christoph a long time ago. I make the output simpler.
    The number shows cycle count during alloc/free respectively so less is
    better.

    * Before
    Kmalloc N*alloc N*free(32): Average=365/806
    Kmalloc N*alloc N*free(64): Average=452/690
    Kmalloc N*alloc N*free(128): Average=736/886
    Kmalloc N*alloc N*free(256): Average=1167/985
    Kmalloc N*alloc N*free(512): Average=2088/1125
    Kmalloc N*alloc N*free(1024): Average=4115/1184
    Kmalloc N*alloc N*free(2048): Average=8451/1748
    Kmalloc N*alloc N*free(4096): Average=16024/2048

    * After
    Kmalloc N*alloc N*free(32): Average=344/792
    Kmalloc N*alloc N*free(64): Average=347/882
    Kmalloc N*alloc N*free(128): Average=390/959
    Kmalloc N*alloc N*free(256): Average=393/1067
    Kmalloc N*alloc N*free(512): Average=683/1229
    Kmalloc N*alloc N*free(1024): Average=1295/1325
    Kmalloc N*alloc N*free(2048): Average=2513/1664
    Kmalloc N*alloc N*free(4096): Average=4742/2172

    It shows that performance improves greatly (roughly more than 50%) for
    the object class whose size is more than 128 bytes.

    This patch (of 11):

    If we don't hold neither the slab_mutex nor the node lock, node's shared
    array cache could be freed and re-populated. If __kmem_cache_shrink()
    is called at the same time, it will call drain_array() with n->shared
    without holding node lock so problem can happen. This patch fix the
    situation by holding the node lock before trying to drain the shared
    array.

    In addition, add a debug check to confirm that n->shared access race
    doesn't exist.

    Signed-off-by: Joonsoo Kim
    Cc: Jesper Dangaard Brouer
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • A recent cleanup removed some exported functions that were not used
    anywhere, which in turn exposed the fact that some other functions in
    the same file are only used in some configurations.

    We now get a warning about them when CONFIG_HOTPLUG_CPU is disabled:

    kernel/padata.c:670:12: error: '__padata_remove_cpu' defined but not used [-Werror=unused-function]
    static int __padata_remove_cpu(struct padata_instance *pinst, int cpu)
    ^~~~~~~~~~~~~~~~~~~
    kernel/padata.c:650:12: error: '__padata_add_cpu' defined but not used [-Werror=unused-function]
    static int __padata_add_cpu(struct padata_instance *pinst, int cpu)

    This rearranges the code so the __padata_remove_cpu/__padata_add_cpu
    functions are within the #ifdef that protects the code that calls them.

    [akpm@linux-foundation.org: coding-style fixes]
    Fixes: 4ba6d78c671e ("kernel/padata.c: removed unused code")
    Signed-off-by: Arnd Bergmann
    Cc: Richard Cochran
    Cc: Steffen Klassert
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Arnd Bergmann
     
  • By accident I stumbled across code that has never been used. This
    driver has EXPORT_SYMBOL functions, and the only user of the code is
    pcrypt.c, but this only uses a subset of the exported symbols.

    According to 'git log -G', the functions, padata_set_cpumasks,
    padata_add_cpu, and padata_remove_cpu have never been used since they
    were first introduced. This patch removes the unused code.

    On one 64 bit build, with CRYPTO_PCRYPT built in, the text is more than
    4k smaller.

    kbuild_hp> size $KBUILD_OUTPUT/vmlinux
    text data bss dec hex filename
    10566658 4678360 1122304 16367322 f9beda vmlinux
    10561984 4678360 1122304 16362648 f9ac98 vmlinux

    On another config, 32 bit, the saving is about 0.5k bytes.

    kbuild_hp-x86> size $KBUILD_OUTPUT/vmlinux
    6012005 2409513 2785280 11206798 ab008e vmlinux
    6011491 2409513 2785280 11206284 aafe8c vmlinux

    Signed-off-by: Richard Cochran
    Cc: Steffen Klassert
    Cc: Herbert Xu
    Cc: "David S. Miller"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Richard Cochran
     
  • The goto is not useful in ocfs2_put_slot(), so delete it.

    Signed-off-by: Guozhonghua
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Junxiao Bi
    Cc: Joseph Qi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Guozhonghua
     
  • Clean up unused parameter 'count' in o2hb_read_block_input().

    Signed-off-by: Jun Piao
    Reviewed-by: Joseph Qi
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Junxiao Bi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jun Piao
     
  • Clean up an unused variable 'wants_rotate' in ocfs2_truncate_rec.

    Signed-off-by: Jun Piao
    Reviewed-by: Joseph Qi
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Junxiao Bi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    piaojun
     
  • The comment in ocfs2_extended_slot has the offset wrong.

    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Junxiao Bi
    Cc: Joseph Qi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Guozhonghua
     
  • When activating a static object we need make sure that the object is
    tracked in the object tracker. If it is a non-static object then the
    activation is illegal.

    In previous implementation, each subsystem need take care of this in
    their fixup callbacks. Actually we can put it into debugobjects core.
    Thus we can save duplicated code, and have *pure* fixup callbacks.

    To achieve this, a new callback "is_static_object" is introduced to let
    the type specific code decide whether a object is static or not. If
    yes, we take it into object tracker, otherwise give warning and invoke
    fixup callback.

    This change has paassed debugobjects selftest, and I also do some test
    with all debugobjects supports enabled.

    At last, I have a concern about the fixups that can it change the object
    which is in incorrect state on fixup? Because the 'addr' may not point
    to any valid object if a non-static object is not tracked. Then Change
    such object can overwrite someone's memory and cause unexpected
    behaviour. For example, the timer_fixup_activate bind timer to function
    stub_timer.

    Link: http://lkml.kernel.org/r/1462576157-14539-1-git-send-email-changbin.du@intel.com
    [changbin.du@intel.com: improve code comments where invoke the new is_static_object callback]
    Link: http://lkml.kernel.org/r/1462777431-8171-1-git-send-email-changbin.du@intel.com
    Signed-off-by: Du, Changbin
    Cc: Jonathan Corbet
    Cc: Josh Triplett
    Cc: Steven Rostedt
    Cc: Thomas Gleixner
    Cc: Tejun Heo
    Cc: Christian Borntraeger
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Du, Changbin
     
  • Update documentation creangponding to change(debugobjects: make fixup
    functions return bool instead of int).

    Signed-off-by: Du, Changbin
    Cc: Jonathan Corbet
    Cc: Josh Triplett
    Cc: Steven Rostedt
    Cc: Thomas Gleixner
    Cc: Tejun Heo
    Cc: Christian Borntraeger
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Du, Changbin
     
  • Update the return type to use bool instead of int, corresponding to
    cheange (debugobjects: make fixup functions return bool instead of int).

    Signed-off-by: Du, Changbin
    Cc: Jonathan Corbet
    Cc: Josh Triplett
    Cc: Steven Rostedt
    Cc: Thomas Gleixner
    Cc: Tejun Heo
    Cc: Christian Borntraeger
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Du, Changbin
     
  • Update the return type to use bool instead of int, corresponding to
    cheange (debugobjects: make fixup functions return bool instead of int).

    Signed-off-by: Du, Changbin
    Cc: Jonathan Corbet
    Cc: Josh Triplett
    Cc: Steven Rostedt
    Cc: Thomas Gleixner
    Cc: Tejun Heo
    Cc: Christian Borntraeger
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Du, Changbin
     
  • Update the return type to use bool instead of int, corresponding to
    cheange (debugobjects: make fixup functions return bool instead of int).

    Signed-off-by: Du, Changbin
    Cc: Jonathan Corbet
    Cc: Josh Triplett
    Cc: Steven Rostedt
    Cc: Thomas Gleixner
    Cc: Tejun Heo
    Cc: Christian Borntraeger
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Du, Changbin
     
  • Update the return type to use bool instead of int, corresponding to
    change (debugobjects: make fixup functions return bool instead of int)

    Signed-off-by: Du, Changbin
    Cc: Jonathan Corbet
    Cc: Josh Triplett
    Cc: Steven Rostedt
    Cc: Thomas Gleixner
    Cc: Tejun Heo
    Cc: Christian Borntraeger
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Du, Changbin
     
  • If debug_object_fixup() return non-zero when problem has been fixed.
    But the code got it backwards, it taks 0 as fixup successfully. So fix
    it.

    Signed-off-by: Du, Changbin
    Cc: Jonathan Corbet
    Cc: Josh Triplett
    Cc: Steven Rostedt
    Cc: Thomas Gleixner
    Cc: Tejun Heo
    Cc: Christian Borntraeger
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Du, Changbin
     
  • I am going to introduce debugobjects infrastructure to USB subsystem.
    But before this, I found the code of debugobjects could be improved.
    This patchset will make fixup functions return bool type instead of int.
    Because fixup only need report success or no. boolean is the 'real'
    type.

    This patch (of 7):

    The object debugging infrastructure core provides some fixup callbacks
    for the subsystem who use it. These callbacks are called from the debug
    code whenever a problem in debug_object_init is detected. And
    debugobjects core suppose them returns 1 when the fixup was successful,
    otherwise 0. So the return type is boolean.

    A bad thing is that debug_object_fixup use the return value for
    arithmetic operation. It confused me that what is the reall return
    type.

    Reading over the whole code, I found some place do use the return value
    incorrectly(see next patch). So why use bool type instead?

    Signed-off-by: Du, Changbin
    Cc: Jonathan Corbet
    Cc: Josh Triplett
    Cc: Steven Rostedt
    Cc: Thomas Gleixner
    Cc: Tejun Heo
    Cc: Christian Borntraeger
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Du, Changbin
     
  • This adds an additional line of output (to reduce the chances of
    breaking any existing output parsers) which prints the total size before
    and after and the relative difference.

    add/remove: 39/0 grow/shrink: 12408/55 up/down: 362227/-1430 (360797)
    function old new delta
    ext4_fill_super 10556 12590 +2034
    _fpadd_parts - 1186 +1186
    ntfs_fill_super 5340 6164 +824
    ...
    ...
    __divdf3 752 386 -366
    unlzma 3682 3274 -408
    Total: Before=5023101, After=5383898, chg 7.000000%
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

    Link: http://lkml.kernel.org/r/1463124110-30314-1-git-send-email-vgupta@synopsys.com
    Signed-off-by: Vineet Gupta
    Cc: Josh Triplett
    Cc: Michal Marek
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vineet Gupta
     
  • A few instances of "fimware" instead of "firmware" were found. Fix
    these and add it to the spelling.txt file.

    Signed-off-by: Kees Cook
    Reported-by: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kees Cook
     
  • scripts/decode_stacktrace.sh presently displays module symbols as

    func+0x0ff/0x5153 [module]

    Add a third argument: the pathname of a directory where the script
    should look for the file module.ko so that the output appears as

    func (foo/bar.c:123) module

    Without the argument or if the module file isn't found the script prints
    such symbols as is without decoding.

    Signed-off-by: Konstantin Khlebnikov
    Cc: Sasha Levin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     
  • All references to timespec_add_safe() now use timespec64_add_safe().

    The plan is to replace struct timespec references with struct timespec64
    throughout the kernel as timespec is not y2038 safe.

    Drop timespec_add_safe() and use timespec64_add_safe() for all
    architectures.

    Link: http://lkml.kernel.org/r/1461947989-21926-4-git-send-email-deepa.kernel@gmail.com
    Signed-off-by: Deepa Dinamani
    Acked-by: John Stultz
    Cc: Thomas Gleixner
    Cc: Arnd Bergmann
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Deepa Dinamani
     
  • struct timespec is not y2038 safe. Even though timespec might be
    sufficient to represent timeouts, use struct timespec64 here as the plan
    is to get rid of all timespec reference in the kernel.

    The patch transitions the common functions: poll_select_set_timeout()
    and select_estimate_accuracy() to use timespec64. And, all the syscalls
    that use these functions are transitioned in the same patch.

    The restart block parameters for poll uses monotonic time. Use
    timespec64 here as well to assign timeout value. This parameter in the
    restart block need not change because this only holds the monotonic
    timestamp at which timeout should occur. And, unsigned long data type
    should be big enough for this timestamp.

    The system call interfaces will be handled in a separate series.

    Compat interfaces need not change as timespec64 is an alias to struct
    timespec on a 64 bit system.

    Link: http://lkml.kernel.org/r/1461947989-21926-3-git-send-email-deepa.kernel@gmail.com
    Signed-off-by: Deepa Dinamani
    Acked-by: John Stultz
    Acked-by: David S. Miller
    Cc: Alexander Viro
    Cc: Arnd Bergmann
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Deepa Dinamani
     
  • timespec64_add_safe() has been defined in time64.h for 64 bit systems.
    But, 32 bit systems only have an extern function prototype defined.
    Provide a definition for the above function.

    The function will be necessary as part of y2038 changes. struct
    timespec is not y2038 safe. All references to timespec will be replaced
    by struct timespec64. The function is meant to be a replacement for
    timespec_add_safe().

    The implementation is similar to timespec_add_safe().

    Link: http://lkml.kernel.org/r/1461947989-21926-2-git-send-email-deepa.kernel@gmail.com
    Signed-off-by: Deepa Dinamani
    Acked-by: John Stultz
    Cc: Thomas Gleixner
    Cc: Arnd Bergmann
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Deepa Dinamani