21 Jan, 2016

1 commit


16 Jan, 2016

1 commit

  • lock_page() must operate on the whole compound page. It doesn't make
    much sense to lock part of compound page. Change code to use head
    page's PG_locked, if tail page is passed.

    This patch also gets rid of custom helper functions --
    __set_page_locked() and __clear_page_locked(). They are replaced with
    helpers generated by __SETPAGEFLAG/__CLEARPAGEFLAG. Tail pages to these
    helper would trigger VM_BUG_ON().

    SLUB uses PG_locked as a bit spin locked. IIUC, tail pages should never
    appear there. VM_BUG_ON() is added to make sure that this assumption is
    correct.

    [akpm@linux-foundation.org: fix fs/cifs/file.c]
    Signed-off-by: Kirill A. Shutemov
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Vlastimil Babka
    Cc: Christoph Lameter
    Cc: Naoya Horiguchi
    Cc: Steve Capper
    Cc: "Aneesh Kumar K.V"
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Jerome Marchand
    Cc: Jérôme Glisse
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

15 Jan, 2016

1 commit

  • Currently, if we want to account all objects of a particular kmem cache,
    we have to pass __GFP_ACCOUNT to each kmem_cache_alloc call, which is
    inconvenient. This patch introduces SLAB_ACCOUNT flag which if passed
    to kmem_cache_create will force accounting for every allocation from
    this cache even if __GFP_ACCOUNT is not passed.

    This patch does not make any of the existing caches use this flag - it
    will be done later in the series.

    Note, a cache with SLAB_ACCOUNT cannot be merged with a cache w/o
    SLAB_ACCOUNT, because merged caches share the same kmem_cache struct and
    hence cannot have different sets of SLAB_* flags. Thus using this flag
    will probably reduce the number of merged slabs even if kmem accounting
    is not used (only compiled in).

    Signed-off-by: Vladimir Davydov
    Suggested-by: Tejun Heo
    Acked-by: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: Greg Thelen
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     

23 Nov, 2015

5 commits

  • Adjust kmem_cache_alloc_bulk API before we have any real users.

    Adjust API to return type 'int' instead of previously type 'bool'. This
    is done to allow future extension of the bulk alloc API.

    A future extension could be to allow SLUB to stop at a page boundary, when
    specified by a flag, and then return the number of objects.

    The advantage of this approach, would make it easier to make bulk alloc
    run without local IRQs disabled. With an approach of cmpxchg "stealing"
    the entire c->freelist or page->freelist. To avoid overshooting we would
    stop processing at a slab-page boundary. Else we always end up returning
    some objects at the cost of another cmpxchg.

    To keep compatible with future users of this API linking against an older
    kernel when using the new flag, we need to return the number of allocated
    objects with this API change.

    Signed-off-by: Jesper Dangaard Brouer
    Cc: Vladimir Davydov
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jesper Dangaard Brouer
     
  • Initial implementation missed support for kmem cgroup support in
    kmem_cache_free_bulk() call, add this.

    If CONFIG_MEMCG_KMEM is not enabled, the compiler should be smart enough
    to not add any asm code.

    Incoming bulk free objects can belong to different kmem cgroups, and
    object free call can happen at a later point outside memcg context. Thus,
    we need to keep the orig kmem_cache, to correctly verify if a memcg object
    match against its "root_cache" (s->memcg_params.root_cache).

    Signed-off-by: Jesper Dangaard Brouer
    Reviewed-by: Vladimir Davydov
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jesper Dangaard Brouer
     
  • The call slab_pre_alloc_hook() interacts with kmemgc and is not allowed to
    be called several times inside the bulk alloc for loop, due to the call to
    memcg_kmem_get_cache().

    This would result in hitting the VM_BUG_ON in __memcg_kmem_get_cache.

    As suggested by Vladimir Davydov, change slab_post_alloc_hook() to be able
    to handle an array of objects.

    A subtle detail is, loop iterator "i" in slab_post_alloc_hook() must have
    same type (size_t) as size argument. This helps the compiler to easier
    realize that it can remove the loop, when all debug statements inside loop
    evaluates to nothing. Note, this is only an issue because the kernel is
    compiled with GCC option: -fno-strict-overflow

    In slab_alloc_node() the compiler inlines and optimizes the invocation of
    slab_post_alloc_hook(s, flags, 1, &object) by removing the loop and access
    object directly.

    Signed-off-by: Jesper Dangaard Brouer
    Reported-by: Vladimir Davydov
    Suggested-by: Vladimir Davydov
    Reviewed-by: Vladimir Davydov
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jesper Dangaard Brouer
     
  • This change focus on improving the speed of object freeing in the
    "slowpath" of kmem_cache_free_bulk.

    The calls slab_free (fastpath) and __slab_free (slowpath) have been
    extended with support for bulk free, which amortize the overhead of
    the (locked) cmpxchg_double.

    To use the new bulking feature, we build what I call a detached
    freelist. The detached freelist takes advantage of three properties:

    1) the free function call owns the object that is about to be freed,
    thus writing into this memory is synchronization-free.

    2) many freelist's can co-exist side-by-side in the same slab-page
    each with a separate head pointer.

    3) it is the visibility of the head pointer that needs synchronization.

    Given these properties, the brilliant part is that the detached
    freelist can be constructed without any need for synchronization. The
    freelist is constructed directly in the page objects, without any
    synchronization needed. The detached freelist is allocated on the
    stack of the function call kmem_cache_free_bulk. Thus, the freelist
    head pointer is not visible to other CPUs.

    All objects in a SLUB freelist must belong to the same slab-page.
    Thus, constructing the detached freelist is about matching objects
    that belong to the same slab-page. The bulk free array is scanned is
    a progressive manor with a limited look-ahead facility.

    Kmem debug support is handled in call of slab_free().

    Notice kmem_cache_free_bulk no longer need to disable IRQs. This
    only slowed down single free bulk with approx 3 cycles.

    Performance data:
    Benchmarked[1] obj size 256 bytes on CPU i7-4790K @ 4.00GHz

    SLUB fastpath single object quick reuse: 47 cycles(tsc) 11.931 ns

    To get stable and comparable numbers, the kernel have been booted with
    "slab_merge" (this also improve performance for larger bulk sizes).

    Performance data, compared against fallback bulking:

    bulk - fallback bulk - improvement with this patch
    1 - 62 cycles(tsc) 15.662 ns - 49 cycles(tsc) 12.407 ns- improved 21.0%
    2 - 55 cycles(tsc) 13.935 ns - 30 cycles(tsc) 7.506 ns - improved 45.5%
    3 - 53 cycles(tsc) 13.341 ns - 23 cycles(tsc) 5.865 ns - improved 56.6%
    4 - 52 cycles(tsc) 13.081 ns - 20 cycles(tsc) 5.048 ns - improved 61.5%
    8 - 50 cycles(tsc) 12.627 ns - 18 cycles(tsc) 4.659 ns - improved 64.0%
    16 - 49 cycles(tsc) 12.412 ns - 17 cycles(tsc) 4.495 ns - improved 65.3%
    30 - 49 cycles(tsc) 12.484 ns - 18 cycles(tsc) 4.533 ns - improved 63.3%
    32 - 50 cycles(tsc) 12.627 ns - 18 cycles(tsc) 4.707 ns - improved 64.0%
    34 - 96 cycles(tsc) 24.243 ns - 23 cycles(tsc) 5.976 ns - improved 76.0%
    48 - 83 cycles(tsc) 20.818 ns - 21 cycles(tsc) 5.329 ns - improved 74.7%
    64 - 74 cycles(tsc) 18.700 ns - 20 cycles(tsc) 5.127 ns - improved 73.0%
    128 - 90 cycles(tsc) 22.734 ns - 27 cycles(tsc) 6.833 ns - improved 70.0%
    158 - 99 cycles(tsc) 24.776 ns - 30 cycles(tsc) 7.583 ns - improved 69.7%
    250 - 104 cycles(tsc) 26.089 ns - 37 cycles(tsc) 9.280 ns - improved 64.4%

    Performance data, compared current in-kernel bulking:

    bulk - curr in-kernel - improvement with this patch
    1 - 46 cycles(tsc) - 49 cycles(tsc) - improved (cycles:-3) -6.5%
    2 - 27 cycles(tsc) - 30 cycles(tsc) - improved (cycles:-3) -11.1%
    3 - 21 cycles(tsc) - 23 cycles(tsc) - improved (cycles:-2) -9.5%
    4 - 18 cycles(tsc) - 20 cycles(tsc) - improved (cycles:-2) -11.1%
    8 - 17 cycles(tsc) - 18 cycles(tsc) - improved (cycles:-1) -5.9%
    16 - 18 cycles(tsc) - 17 cycles(tsc) - improved (cycles: 1) 5.6%
    30 - 18 cycles(tsc) - 18 cycles(tsc) - improved (cycles: 0) 0.0%
    32 - 18 cycles(tsc) - 18 cycles(tsc) - improved (cycles: 0) 0.0%
    34 - 78 cycles(tsc) - 23 cycles(tsc) - improved (cycles:55) 70.5%
    48 - 60 cycles(tsc) - 21 cycles(tsc) - improved (cycles:39) 65.0%
    64 - 49 cycles(tsc) - 20 cycles(tsc) - improved (cycles:29) 59.2%
    128 - 69 cycles(tsc) - 27 cycles(tsc) - improved (cycles:42) 60.9%
    158 - 79 cycles(tsc) - 30 cycles(tsc) - improved (cycles:49) 62.0%
    250 - 86 cycles(tsc) - 37 cycles(tsc) - improved (cycles:49) 57.0%

    Performance with normal SLUB merging is significantly slower for
    larger bulking. This is believed to (primarily) be an effect of not
    having to share the per-CPU data-structures, as tuning per-CPU size
    can achieve similar performance.

    bulk - slab_nomerge - normal SLUB merge
    1 - 49 cycles(tsc) - 49 cycles(tsc) - merge slower with cycles:0
    2 - 30 cycles(tsc) - 30 cycles(tsc) - merge slower with cycles:0
    3 - 23 cycles(tsc) - 23 cycles(tsc) - merge slower with cycles:0
    4 - 20 cycles(tsc) - 20 cycles(tsc) - merge slower with cycles:0
    8 - 18 cycles(tsc) - 18 cycles(tsc) - merge slower with cycles:0
    16 - 17 cycles(tsc) - 17 cycles(tsc) - merge slower with cycles:0
    30 - 18 cycles(tsc) - 23 cycles(tsc) - merge slower with cycles:5
    32 - 18 cycles(tsc) - 22 cycles(tsc) - merge slower with cycles:4
    34 - 23 cycles(tsc) - 22 cycles(tsc) - merge slower with cycles:-1
    48 - 21 cycles(tsc) - 22 cycles(tsc) - merge slower with cycles:1
    64 - 20 cycles(tsc) - 48 cycles(tsc) - merge slower with cycles:28
    128 - 27 cycles(tsc) - 57 cycles(tsc) - merge slower with cycles:30
    158 - 30 cycles(tsc) - 59 cycles(tsc) - merge slower with cycles:29
    250 - 37 cycles(tsc) - 56 cycles(tsc) - merge slower with cycles:19

    Joint work with Alexander Duyck.

    [1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/mm/slab_bulk_test01.c

    [akpm@linux-foundation.org: BUG_ON -> WARN_ON;return]
    Signed-off-by: Jesper Dangaard Brouer
    Signed-off-by: Alexander Duyck
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jesper Dangaard Brouer
     
  • Make it possible to free a freelist with several objects by adjusting API
    of slab_free() and __slab_free() to have head, tail and an objects counter
    (cnt).

    Tail being NULL indicate single object free of head object. This allow
    compiler inline constant propagation in slab_free() and
    slab_free_freelist_hook() to avoid adding any overhead in case of single
    object free.

    This allows a freelist with several objects (all within the same
    slab-page) to be free'ed using a single locked cmpxchg_double in
    __slab_free() and with an unlocked cmpxchg_double in slab_free().

    Object debugging on the free path is also extended to handle these
    freelists. When CONFIG_SLUB_DEBUG is enabled it will also detect if
    objects don't belong to the same slab-page.

    These changes are needed for the next patch to bulk free the detached
    freelists it introduces and constructs.

    Micro benchmarking showed no performance reduction due to this change,
    when debugging is turned off (compiled with CONFIG_SLUB_DEBUG).

    Signed-off-by: Jesper Dangaard Brouer
    Signed-off-by: Alexander Duyck
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jesper Dangaard Brouer
     

21 Nov, 2015

3 commits


07 Nov, 2015

2 commits

  • We have properly typed page->rcu_head, no need to cast page->lru.

    Signed-off-by: Kirill A. Shutemov
    Reviewed-by: Andrea Arcangeli
    Acked-by: Christoph Lameter
    Cc: "Paul E. McKenney"
    Cc: Andi Kleen
    Cc: Aneesh Kumar K.V
    Cc: David Rientjes
    Cc: Hugh Dickins
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: Sergey Senozhatsky
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • …d avoiding waking kswapd

    __GFP_WAIT has been used to identify atomic context in callers that hold
    spinlocks or are in interrupts. They are expected to be high priority and
    have access one of two watermarks lower than "min" which can be referred
    to as the "atomic reserve". __GFP_HIGH users get access to the first
    lower watermark and can be called the "high priority reserve".

    Over time, callers had a requirement to not block when fallback options
    were available. Some have abused __GFP_WAIT leading to a situation where
    an optimisitic allocation with a fallback option can access atomic
    reserves.

    This patch uses __GFP_ATOMIC to identify callers that are truely atomic,
    cannot sleep and have no alternative. High priority users continue to use
    __GFP_HIGH. __GFP_DIRECT_RECLAIM identifies callers that can sleep and
    are willing to enter direct reclaim. __GFP_KSWAPD_RECLAIM to identify
    callers that want to wake kswapd for background reclaim. __GFP_WAIT is
    redefined as a caller that is willing to enter direct reclaim and wake
    kswapd for background reclaim.

    This patch then converts a number of sites

    o __GFP_ATOMIC is used by callers that are high priority and have memory
    pools for those requests. GFP_ATOMIC uses this flag.

    o Callers that have a limited mempool to guarantee forward progress clear
    __GFP_DIRECT_RECLAIM but keep __GFP_KSWAPD_RECLAIM. bio allocations fall
    into this category where kswapd will still be woken but atomic reserves
    are not used as there is a one-entry mempool to guarantee progress.

    o Callers that are checking if they are non-blocking should use the
    helper gfpflags_allow_blocking() where possible. This is because
    checking for __GFP_WAIT as was done historically now can trigger false
    positives. Some exceptions like dm-crypt.c exist where the code intent
    is clearer if __GFP_DIRECT_RECLAIM is used instead of the helper due to
    flag manipulations.

    o Callers that built their own GFP flags instead of starting with GFP_KERNEL
    and friends now also need to specify __GFP_KSWAPD_RECLAIM.

    The first key hazard to watch out for is callers that removed __GFP_WAIT
    and was depending on access to atomic reserves for inconspicuous reasons.
    In some cases it may be appropriate for them to use __GFP_HIGH.

    The second key hazard is callers that assembled their own combination of
    GFP flags instead of starting with something like GFP_KERNEL. They may
    now wish to specify __GFP_KSWAPD_RECLAIM. It's almost certainly harmless
    if it's missed in most cases as other activity will wake kswapd.

    Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Acked-by: Michal Hocko <mhocko@suse.com>
    Acked-by: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Christoph Lameter <cl@linux.com>
    Cc: David Rientjes <rientjes@google.com>
    Cc: Vitaly Wool <vitalywool@gmail.com>
    Cc: Rik van Riel <riel@redhat.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

    Mel Gorman
     

06 Nov, 2015

5 commits

  • It's recommended to have slub's user tracking enabled with CONFIG_KASAN,
    because:

    a) User tracking disables slab merging which improves
    detecting out-of-bounds accesses.
    b) User tracking metadata acts as redzone which also improves
    detecting out-of-bounds accesses.
    c) User tracking provides additional information about object.
    This information helps to understand bugs.

    Currently it is not enabled by default. Besides recompiling the kernel
    with KASAN and reinstalling it, user also have to change the boot cmdline,
    which is not very handy.

    Enable slub user tracking by default with KASAN=y, since there is no good
    reason to not do this.

    [akpm@linux-foundation.org: little fixes, per David]
    Signed-off-by: Andrey Ryabinin
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Ryabinin
     
  • We have memcg_kmem_charge and memcg_kmem_uncharge methods for charging and
    uncharging kmem pages to memcg, but currently they are not used for
    charging slab pages (i.e. they are only used for charging pages allocated
    with alloc_kmem_pages). The only reason why the slab subsystem uses
    special helpers, memcg_charge_slab and memcg_uncharge_slab, is that it
    needs to charge to the memcg of kmem cache while memcg_charge_kmem charges
    to the memcg that the current task belongs to.

    To remove this diversity, this patch adds an extra argument to
    __memcg_kmem_charge that can be a pointer to a memcg or NULL. If it is
    not NULL, the function tries to charge to the memcg it points to,
    otherwise it charge to the current context. Next, it makes the slab
    subsystem use this function to charge slab pages.

    Since memcg_charge_kmem and memcg_uncharge_kmem helpers are now used only
    in __memcg_kmem_charge and __memcg_kmem_uncharge, they are inlined. Since
    __memcg_kmem_charge stores a pointer to the memcg in the page struct, we
    don't need memcg_uncharge_slab anymore and can use free_kmem_pages.
    Besides, one can now detect which memcg a slab page belongs to by reading
    /proc/kpagecgroup.

    Note, this patch switches slab to charge-after-alloc design. Since this
    design is already used for all other memcg charges, it should not make any
    difference.

    [hannes@cmpxchg.org: better to have an outer function than a magic parameter for the memcg lookup]
    Signed-off-by: Vladimir Davydov
    Acked-by: Michal Hocko
    Signed-off-by: Johannes Weiner
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • In slub_order(), the order starts from max(min_order,
    get_order(min_objects * size)). When (min_objects * size) has different
    order from (min_objects * size + reserved), it will skip this order via a
    check in the loop.

    This patch optimizes this a little by calculating the start order with
    `reserved' in consideration and removing the check in loop.

    Signed-off-by: Wei Yang
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wei Yang
     
  • get_order() is more easy to understand.

    This patch just replaces it.

    Signed-off-by: Wei Yang
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Reviewed-by: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wei Yang
     
  • In calculate_order(), it tries to calculate the best order by adjusting
    the fraction and min_objects. On each iteration on min_objects, fraction
    iterates on 16, 8, 4. Which means the acceptable waste increases with
    1/16, 1/8, 1/4.

    This patch corrects the comment according to the code.

    Signed-off-by: Wei Yang
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wei Yang
     

09 Sep, 2015

1 commit

  • alloc_pages_exact_node() was introduced in commit 6484eb3e2a81 ("page
    allocator: do not check NUMA node ID when the caller knows the node is
    valid") as an optimized variant of alloc_pages_node(), that doesn't
    fallback to current node for nid == NUMA_NO_NODE. Unfortunately the
    name of the function can easily suggest that the allocation is
    restricted to the given node and fails otherwise. In truth, the node is
    only preferred, unless __GFP_THISNODE is passed among the gfp flags.

    The misleading name has lead to mistakes in the past, see for example
    commits 5265047ac301 ("mm, thp: really limit transparent hugepage
    allocation to local node") and b360edb43f8e ("mm, mempolicy:
    migrate_to_node should only migrate to node").

    Another issue with the name is that there's a family of
    alloc_pages_exact*() functions where 'exact' means exact size (instead
    of page order), which leads to more confusion.

    To prevent further mistakes, this patch effectively renames
    alloc_pages_exact_node() to __alloc_pages_node() to better convey that
    it's an optimized variant of alloc_pages_node() not intended for general
    usage. Both functions get described in comments.

    It has been also considered to really provide a convenience function for
    allocations restricted to a node, but the major opinion seems to be that
    __GFP_THISNODE already provides that functionality and we shouldn't
    duplicate the API needlessly. The number of users would be small
    anyway.

    Existing callers of alloc_pages_exact_node() are simply converted to
    call __alloc_pages_node(), with the exception of sba_alloc_coherent()
    which open-codes the check for NUMA_NO_NODE, so it is converted to use
    alloc_pages_node() instead. This means it no longer performs some
    VM_BUG_ON checks, and since the current check for nid in
    alloc_pages_node() uses a 'nid < 0' comparison (which includes
    NUMA_NO_NODE), it may hide wrong values which would be previously
    exposed.

    Both differences will be rectified by the next patch.

    To sum up, this patch makes no functional changes, except temporarily
    hiding potentially buggy callers. Restricting the checks in
    alloc_pages_node() is left for the next patch which can in turn expose
    more existing buggy callers.

    Signed-off-by: Vlastimil Babka
    Acked-by: Johannes Weiner
    Acked-by: Robin Holt
    Acked-by: Michal Hocko
    Acked-by: Christoph Lameter
    Acked-by: Michael Ellerman
    Cc: Mel Gorman
    Cc: David Rientjes
    Cc: Greg Thelen
    Cc: Aneesh Kumar K.V
    Cc: Pekka Enberg
    Cc: Joonsoo Kim
    Cc: Naoya Horiguchi
    Cc: Tony Luck
    Cc: Fenghua Yu
    Cc: Arnd Bergmann
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: Gleb Natapov
    Cc: Paolo Bonzini
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Cliff Whickman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     

05 Sep, 2015

9 commits

  • Description is almost copied from commit fb05e7a89f50 ("net: don't wait
    for order-3 page allocation").

    I saw excessive direct memory reclaim/compaction triggered by slub. This
    causes performance issues and add latency. Slub uses high-order
    allocation to reduce internal fragmentation and management overhead. But,
    direct memory reclaim/compaction has high overhead and the benefit of
    high-order allocation can't compensate the overhead of both work.

    This patch makes auxiliary high-order allocation atomic. If there is no
    memory pressure and memory isn't fragmented, the alloction will still
    success, so we don't sacrifice high-order allocation's benefit here. If
    the atomic allocation fails, direct memory reclaim/compaction will not be
    triggered, allocation fallback to low-order immediately, hence the direct
    memory reclaim/compaction overhead is avoided. In the allocation failure
    case, kswapd is waken up and trying to make high-order freepages, so
    allocation could success next time.

    Following is the test to measure effect of this patch.

    System: QEMU, CPU 8, 512 MB
    Mem: 25% memory is allocated at random position to make fragmentation.
    Memory-hogger occupies 150 MB memory.
    Workload: hackbench -g 20 -l 1000

    Average result by 10 runs (Base va Patched)

    elapsed_time(s): 4.3468 vs 2.9838
    compact_stall: 461.7 vs 73.6
    pgmigrate_success: 28315.9 vs 7256.1

    Signed-off-by: Joonsoo Kim
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Acked-by: David Rientjes
    Cc: Shaohua Li
    Cc: Vlastimil Babka
    Cc: Michal Hocko
    Cc: Eric Dumazet
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • sysfs_slab_add() shouldn't call kobject_put at error path: this puts last
    reference of kmem-cache kobject and frees it. Kmem cache will be freed
    second time at error path in kmem_cache_create().

    For example this happens when slub debug was enabled in runtime and
    somebody creates new kmem cache:

    # echo 1 | tee /sys/kernel/slab/*/sanity_checks
    # modprobe configfs

    "configfs_dir_cache" cannot be merged because existing slab have debug and
    cannot create new slab because unique name ":t-0000096" already taken.

    Signed-off-by: Konstantin Khlebnikov
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Acked-by: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     
  • Initializing a new slab can introduce rather large latencies because most
    of the initialization runs always with interrupts disabled.

    There is no point in doing so. The newly allocated slab is not visible
    yet, so there is no reason to protect it against concurrent alloc/free.

    Move the expensive parts of the initialization into allocate_slab(), so
    for all allocations with GFP_WAIT set, interrupts are enabled.

    Signed-off-by: Thomas Gleixner
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Acked-by: David Rientjes
    Cc: Joonsoo Kim
    Cc: Sebastian Andrzej Siewior
    Cc: Steven Rostedt
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Thomas Gleixner
     
  • Per request of Joonsoo Kim adding kmem debug support.

    I've tested that when debugging is disabled, then there is almost no
    performance impact as this code basically gets removed by the compiler.

    Need some guidance in enabling and testing this.

    bulk- PREVIOUS - THIS-PATCH
    1 - 43 cycles(tsc) 10.811 ns - 44 cycles(tsc) 11.236 ns improved -2.3%
    2 - 27 cycles(tsc) 6.867 ns - 28 cycles(tsc) 7.019 ns improved -3.7%
    3 - 21 cycles(tsc) 5.496 ns - 22 cycles(tsc) 5.526 ns improved -4.8%
    4 - 24 cycles(tsc) 6.038 ns - 19 cycles(tsc) 4.786 ns improved 20.8%
    8 - 17 cycles(tsc) 4.280 ns - 18 cycles(tsc) 4.572 ns improved -5.9%
    16 - 17 cycles(tsc) 4.483 ns - 18 cycles(tsc) 4.658 ns improved -5.9%
    30 - 18 cycles(tsc) 4.531 ns - 18 cycles(tsc) 4.568 ns improved 0.0%
    32 - 58 cycles(tsc) 14.586 ns - 65 cycles(tsc) 16.454 ns improved -12.1%
    34 - 53 cycles(tsc) 13.391 ns - 63 cycles(tsc) 15.932 ns improved -18.9%
    48 - 65 cycles(tsc) 16.268 ns - 50 cycles(tsc) 12.506 ns improved 23.1%
    64 - 53 cycles(tsc) 13.440 ns - 63 cycles(tsc) 15.929 ns improved -18.9%
    128 - 79 cycles(tsc) 19.899 ns - 86 cycles(tsc) 21.583 ns improved -8.9%
    158 - 90 cycles(tsc) 22.732 ns - 90 cycles(tsc) 22.552 ns improved 0.0%
    250 - 95 cycles(tsc) 23.916 ns - 98 cycles(tsc) 24.589 ns improved -3.2%

    Signed-off-by: Jesper Dangaard Brouer
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jesper Dangaard Brouer
     
  • This implements SLUB specific kmem_cache_free_bulk(). SLUB allocator now
    both have bulk alloc and free implemented.

    Choose to reenable local IRQs while calling slowpath __slab_free(). In
    worst case, where all objects hit slowpath call, the performance should
    still be faster than fallback function __kmem_cache_free_bulk(), because
    local_irq_{disable+enable} is very fast (7-cycles), while the fallback
    invokes this_cpu_cmpxchg() which is slightly slower (9-cycles).
    Nitpicking, this should be faster for N>=4, due to the entry cost of
    local_irq_{disable+enable}.

    Do notice that the save+restore variant is very expensive, this is key to
    why this optimization works.

    CPU: i7-4790K CPU @ 4.00GHz
    * local_irq_{disable,enable}: 7 cycles(tsc) - 1.821 ns
    * local_irq_{save,restore} : 37 cycles(tsc) - 9.443 ns

    Measurements on CPU CPU i7-4790K @ 4.00GHz
    Baseline normal fastpath (alloc+free cost): 43 cycles(tsc) 10.834 ns

    Bulk- fallback - this-patch
    1 - 58 cycles(tsc) 14.542 ns - 43 cycles(tsc) 10.811 ns improved 25.9%
    2 - 50 cycles(tsc) 12.659 ns - 27 cycles(tsc) 6.867 ns improved 46.0%
    3 - 48 cycles(tsc) 12.168 ns - 21 cycles(tsc) 5.496 ns improved 56.2%
    4 - 47 cycles(tsc) 11.987 ns - 24 cycles(tsc) 6.038 ns improved 48.9%
    8 - 46 cycles(tsc) 11.518 ns - 17 cycles(tsc) 4.280 ns improved 63.0%
    16 - 45 cycles(tsc) 11.366 ns - 17 cycles(tsc) 4.483 ns improved 62.2%
    30 - 45 cycles(tsc) 11.433 ns - 18 cycles(tsc) 4.531 ns improved 60.0%
    32 - 75 cycles(tsc) 18.983 ns - 58 cycles(tsc) 14.586 ns improved 22.7%
    34 - 71 cycles(tsc) 17.940 ns - 53 cycles(tsc) 13.391 ns improved 25.4%
    48 - 80 cycles(tsc) 20.077 ns - 65 cycles(tsc) 16.268 ns improved 18.8%
    64 - 71 cycles(tsc) 17.799 ns - 53 cycles(tsc) 13.440 ns improved 25.4%
    128 - 91 cycles(tsc) 22.980 ns - 79 cycles(tsc) 19.899 ns improved 13.2%
    158 - 100 cycles(tsc) 25.241 ns - 90 cycles(tsc) 22.732 ns improved 10.0%
    250 - 102 cycles(tsc) 25.583 ns - 95 cycles(tsc) 23.916 ns improved 6.9%

    Signed-off-by: Jesper Dangaard Brouer
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jesper Dangaard Brouer
     
  • Call slowpath __slab_alloc() from within the bulk loop, as the side-effect
    of this call likely repopulates c->freelist.

    Choose to reenable local IRQs while calling slowpath.

    Saving some optimizations for later. E.g. it is possible to extract
    parts of __slab_alloc() and avoid the unnecessary and expensive (37
    cycles) local_irq_{save,restore}. For now, be happy calling
    __slab_alloc() this lower icache impact of this func and I don't have to
    worry about correctness.

    Measurements on CPU CPU i7-4790K @ 4.00GHz
    Baseline normal fastpath (alloc+free cost): 42 cycles(tsc) 10.601 ns

    Bulk- fallback - this-patch
    1 - 58 cycles(tsc) 14.516 ns - 49 cycles(tsc) 12.459 ns improved 15.5%
    2 - 51 cycles(tsc) 12.930 ns - 38 cycles(tsc) 9.605 ns improved 25.5%
    3 - 49 cycles(tsc) 12.274 ns - 34 cycles(tsc) 8.525 ns improved 30.6%
    4 - 48 cycles(tsc) 12.058 ns - 32 cycles(tsc) 8.036 ns improved 33.3%
    8 - 46 cycles(tsc) 11.609 ns - 31 cycles(tsc) 7.756 ns improved 32.6%
    16 - 45 cycles(tsc) 11.451 ns - 32 cycles(tsc) 8.148 ns improved 28.9%
    30 - 79 cycles(tsc) 19.865 ns - 68 cycles(tsc) 17.164 ns improved 13.9%
    32 - 76 cycles(tsc) 19.212 ns - 66 cycles(tsc) 16.584 ns improved 13.2%
    34 - 74 cycles(tsc) 18.600 ns - 63 cycles(tsc) 15.954 ns improved 14.9%
    48 - 88 cycles(tsc) 22.092 ns - 77 cycles(tsc) 19.373 ns improved 12.5%
    64 - 80 cycles(tsc) 20.043 ns - 68 cycles(tsc) 17.188 ns improved 15.0%
    128 - 99 cycles(tsc) 24.818 ns - 89 cycles(tsc) 22.404 ns improved 10.1%
    158 - 99 cycles(tsc) 24.977 ns - 92 cycles(tsc) 23.089 ns improved 7.1%
    250 - 106 cycles(tsc) 26.552 ns - 99 cycles(tsc) 24.785 ns improved 6.6%

    Signed-off-by: Jesper Dangaard Brouer
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jesper Dangaard Brouer
     
  • First piece: acceleration of retrieval of per cpu objects

    If we are allocating lots of objects then it is advantageous to disable
    interrupts and avoid the this_cpu_cmpxchg() operation to get these objects
    faster.

    Note that we cannot do the fast operation if debugging is enabled, because
    we would have to add extra code to do all the debugging checks. And it
    would not be fast anyway.

    Note also that the requirement of having interrupts disabled avoids having
    to do processor flag operations.

    Allocate as many objects as possible in the fast way and then fall back to
    the generic implementation for the rest of the objects.

    Measurements on CPU CPU i7-4790K @ 4.00GHz
    Baseline normal fastpath (alloc+free cost): 42 cycles(tsc) 10.554 ns

    Bulk- fallback - this-patch
    1 - 57 cycles(tsc) 14.432 ns - 48 cycles(tsc) 12.155 ns improved 15.8%
    2 - 50 cycles(tsc) 12.746 ns - 37 cycles(tsc) 9.390 ns improved 26.0%
    3 - 48 cycles(tsc) 12.180 ns - 33 cycles(tsc) 8.417 ns improved 31.2%
    4 - 48 cycles(tsc) 12.015 ns - 32 cycles(tsc) 8.045 ns improved 33.3%
    8 - 46 cycles(tsc) 11.526 ns - 30 cycles(tsc) 7.699 ns improved 34.8%
    16 - 45 cycles(tsc) 11.418 ns - 32 cycles(tsc) 8.205 ns improved 28.9%
    30 - 80 cycles(tsc) 20.246 ns - 73 cycles(tsc) 18.328 ns improved 8.8%
    32 - 79 cycles(tsc) 19.946 ns - 72 cycles(tsc) 18.208 ns improved 8.9%
    34 - 78 cycles(tsc) 19.659 ns - 71 cycles(tsc) 17.987 ns improved 9.0%
    48 - 86 cycles(tsc) 21.516 ns - 82 cycles(tsc) 20.566 ns improved 4.7%
    64 - 93 cycles(tsc) 23.423 ns - 89 cycles(tsc) 22.480 ns improved 4.3%
    128 - 100 cycles(tsc) 25.170 ns - 99 cycles(tsc) 24.871 ns improved 1.0%
    158 - 102 cycles(tsc) 25.549 ns - 101 cycles(tsc) 25.375 ns improved 1.0%
    250 - 101 cycles(tsc) 25.344 ns - 100 cycles(tsc) 25.182 ns improved 1.0%

    Signed-off-by: Christoph Lameter
    Signed-off-by: Jesper Dangaard Brouer
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jesper Dangaard Brouer
     
  • Add the basic infrastructure for alloc/free operations on pointer arrays.
    It includes a generic function in the common slab code that is used in
    this infrastructure patch to create the unoptimized functionality for slab
    bulk operations.

    Allocators can then provide optimized allocation functions for situations
    in which large numbers of objects are needed. These optimization may
    avoid taking locks repeatedly and bypass metadata creation if all objects
    in slab pages can be used to provide the objects required.

    Allocators can extend the skeletons provided and add their own code to the
    bulk alloc and free functions. They can keep the generic allocation and
    freeing and just fall back to those if optimizations would not work (like
    for example when debugging is on).

    Signed-off-by: Christoph Lameter
    Signed-off-by: Jesper Dangaard Brouer
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • With this patchset the SLUB allocator now has both bulk alloc and free
    implemented.

    This patchset mostly optimizes the "fastpath" where objects are available
    on the per CPU fastpath page. This mostly amortize the less-heavy
    none-locked cmpxchg_double used on fastpath.

    The "fallback" bulking (e.g __kmem_cache_free_bulk) provides a good basis
    for comparison. Measurements[1] of the fallback functions
    __kmem_cache_{free,alloc}_bulk have been copied from slab_common.c and
    forced "noinline" to force a function call like slab_common.c.

    Measurements on CPU CPU i7-4790K @ 4.00GHz
    Baseline normal fastpath (alloc+free cost): 42 cycles(tsc) 10.601 ns

    Measurements last-patch with disabled debugging:

    Bulk- fallback - this-patch
    1 - 57 cycles(tsc) 14.448 ns - 44 cycles(tsc) 11.236 ns improved 22.8%
    2 - 51 cycles(tsc) 12.768 ns - 28 cycles(tsc) 7.019 ns improved 45.1%
    3 - 48 cycles(tsc) 12.232 ns - 22 cycles(tsc) 5.526 ns improved 54.2%
    4 - 48 cycles(tsc) 12.025 ns - 19 cycles(tsc) 4.786 ns improved 60.4%
    8 - 46 cycles(tsc) 11.558 ns - 18 cycles(tsc) 4.572 ns improved 60.9%
    16 - 45 cycles(tsc) 11.458 ns - 18 cycles(tsc) 4.658 ns improved 60.0%
    30 - 45 cycles(tsc) 11.499 ns - 18 cycles(tsc) 4.568 ns improved 60.0%
    32 - 79 cycles(tsc) 19.917 ns - 65 cycles(tsc) 16.454 ns improved 17.7%
    34 - 78 cycles(tsc) 19.655 ns - 63 cycles(tsc) 15.932 ns improved 19.2%
    48 - 68 cycles(tsc) 17.049 ns - 50 cycles(tsc) 12.506 ns improved 26.5%
    64 - 80 cycles(tsc) 20.009 ns - 63 cycles(tsc) 15.929 ns improved 21.3%
    128 - 94 cycles(tsc) 23.749 ns - 86 cycles(tsc) 21.583 ns improved 8.5%
    158 - 97 cycles(tsc) 24.299 ns - 90 cycles(tsc) 22.552 ns improved 7.2%
    250 - 102 cycles(tsc) 25.681 ns - 98 cycles(tsc) 24.589 ns improved 3.9%

    Benchmarking shows impressive improvements in the "fastpath" with a small
    number of objects in the working set. Once the working set increases,
    resulting in activating the "slowpath" (that contains the heavier locked
    cmpxchg_double) the improvement decreases.

    I'm currently working on also optimizing the "slowpath" (as network stack
    use-case hits this), but this patchset should provide a good foundation
    for further improvements. Rest of my patch queue in this area needs some
    more work, but preliminary results are good. I'm attending Netfilter
    Workshop[2] next week, and I'll hopefully return working on further
    improvements in this area.

    This patch (of 6):

    s/succedd/succeed/

    Signed-off-by: Jesper Dangaard Brouer
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jesper Dangaard Brouer
     

22 Aug, 2015

1 commit

  • Commit c48a11c7ad26 ("netvm: propagate page->pfmemalloc to skb") added
    checks for page->pfmemalloc to __skb_fill_page_desc():

    if (page->pfmemalloc && !page->mapping)
    skb->pfmemalloc = true;

    It assumes page->mapping == NULL implies that page->pfmemalloc can be
    trusted. However, __delete_from_page_cache() can set set page->mapping
    to NULL and leave page->index value alone. Due to being in union, a
    non-zero page->index will be interpreted as true page->pfmemalloc.

    So the assumption is invalid if the networking code can see such a page.
    And it seems it can. We have encountered this with a NFS over loopback
    setup when such a page is attached to a new skbuf. There is no copying
    going on in this case so the page confuses __skb_fill_page_desc which
    interprets the index as pfmemalloc flag and the network stack drops
    packets that have been allocated using the reserves unless they are to
    be queued on sockets handling the swapping which is the case here and
    that leads to hangs when the nfs client waits for a response from the
    server which has been dropped and thus never arrive.

    The struct page is already heavily packed so rather than finding another
    hole to put it in, let's do a trick instead. We can reuse the index
    again but define it to an impossible value (-1UL). This is the page
    index so it should never see the value that large. Replace all direct
    users of page->pfmemalloc by page_is_pfmemalloc which will hide this
    nastiness from unspoiled eyes.

    The information will get lost if somebody wants to use page->index
    obviously but that was the case before and the original code expected
    that the information should be persisted somewhere else if that is
    really needed (e.g. what SLAB and SLUB do).

    [akpm@linux-foundation.org: fix blooper in slub]
    Fixes: c48a11c7ad26 ("netvm: propagate page->pfmemalloc to skb")
    Signed-off-by: Michal Hocko
    Debugged-by: Vlastimil Babka
    Debugged-by: Jiri Bohac
    Cc: Eric Dumazet
    Cc: David Miller
    Acked-by: Mel Gorman
    Cc: [3.6+]
    Signed-off-by: Andrew Morton

    Signed-off-by: Linus Torvalds

    Michal Hocko
     

25 Jun, 2015

1 commit

  • This patch moves the initialization of the size_index table slightly
    earlier so that the first few kmem_cache_node's can be safely allocated
    when KMALLOC_MIN_SIZE is large.

    There are currently two ways to generate indices into kmalloc_caches (via
    kmalloc_index() and via the size_index table in slab_common.c) and on some
    arches (possibly only MIPS) they potentially disagree with each other
    until create_kmalloc_caches() has been called. It seems that the
    intention is that the size_index table is a fast equivalent to
    kmalloc_index() and that create_kmalloc_caches() patches the table to
    return the correct value for the cases where kmalloc_index()'s
    if-statements apply.

    The failing sequence was:
    * kmalloc_caches contains NULL elements
    * kmem_cache_init initialises the element that 'struct
    kmem_cache_node' will be allocated to. For 32-bit Mips, this is a
    56-byte struct and kmalloc_index returns KMALLOC_SHIFT_LOW (7).
    * init_list is called which calls kmalloc_node to allocate a 'struct
    kmem_cache_node'.
    * kmalloc_slab selects the kmem_caches element using
    size_index[size_index_elem(size)]. For MIPS, size is 56, and the
    expression returns 6.
    * This element of kmalloc_caches is NULL and allocation fails.
    * If it had not already failed, it would have called
    create_kmalloc_caches() at this point which would have changed
    size_index[size_index_elem(size)] to 7.

    I don't believe the bug to be LLVM specific but GCC doesn't normally
    encounter the problem. I haven't been able to identify exactly what GCC
    is doing better (probably inlining) but it seems that GCC is managing to
    optimize to the point that it eliminates the problematic allocations.
    This theory is supported by the fact that GCC can be made to fail in the
    same way by changing inline, __inline, __inline__, and __always_inline in
    include/linux/compiler-gcc.h such that they don't actually inline things.

    Signed-off-by: Daniel Sanders
    Acked-by: Pekka Enberg
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Daniel Sanders
     

16 Apr, 2015

1 commit

  • We converted some of the usages of ACCESS_ONCE to READ_ONCE in the mm/
    tree since it doesn't work reliably on non-scalar types.

    This patch removes the rest of the usages of ACCESS_ONCE, and use the new
    READ_ONCE API for the read accesses. This makes things cleaner, instead
    of using separate/multiple sets of APIs.

    Signed-off-by: Jason Low
    Acked-by: Michal Hocko
    Acked-by: Davidlohr Bueso
    Acked-by: Rik van Riel
    Reviewed-by: Christian Borntraeger
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jason Low
     

15 Apr, 2015

2 commits

  • Use the normal return values for bool functions

    Signed-off-by: Joe Perches
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Acked-by: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joe Perches
     
  • By moving the O option detection into the switch statement, we allow this
    parameter to be combined with other options correctly. Previously options
    like slub_debug=OFZ would only detect the 'o' and use DEBUG_DEFAULT_FLAGS
    to fill in the rest of the flags.

    Signed-off-by: Chris J Arges
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Acked-by: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chris J Arges
     

26 Mar, 2015

1 commit

  • Commit 9aabf810a67c ("mm/slub: optimize alloc/free fastpath by removing
    preemption on/off") introduced an occasional hang for kernels built with
    CONFIG_PREEMPT && !CONFIG_SMP.

    The problem is the following loop the patch introduced to
    slab_alloc_node and slab_free:

    do {
    tid = this_cpu_read(s->cpu_slab->tid);
    c = raw_cpu_ptr(s->cpu_slab);
    } while (IS_ENABLED(CONFIG_PREEMPT) && unlikely(tid != c->tid));

    GCC 4.9 has been observed to hoist the load of c and c->tid above the
    loop for !SMP kernels (as in this case raw_cpu_ptr(x) is compile-time
    constant and does not force a reload). On arm64 the generated assembly
    looks like:

    ldr x4, [x0,#8]
    loop:
    ldr x1, [x0,#8]
    cmp x1, x4
    b.ne loop

    If the thread is preempted between the load of c->tid (into x1) and tid
    (into x4), and an allocation or free occurs in another thread (bumping
    the cpu_slab's tid), the thread will be stuck in the loop until
    s->cpu_slab->tid wraps, which may be forever in the absence of
    allocations/frees on the same CPU.

    This patch changes the loop condition to access c->tid with READ_ONCE.
    This ensures that the value is reloaded even when the compiler would
    otherwise assume it could cache the value, and also ensures that the
    load will not be torn.

    Signed-off-by: Mark Rutland
    Cc: Catalin Marinas
    Acked-by: Christoph Lameter
    Cc: David Rientjes
    Cc: Jesper Dangaard Brouer
    Cc: Joonsoo Kim
    Cc: Pekka Enberg
    Cc: Steve Capper
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mark Rutland
     

14 Feb, 2015

4 commits

  • With this patch kasan will be able to catch bugs in memory allocated by
    slub. Initially all objects in newly allocated slab page, marked as
    redzone. Later, when allocation of slub object happens, requested by
    caller number of bytes marked as accessible, and the rest of the object
    (including slub's metadata) marked as redzone (inaccessible).

    We also mark object as accessible if ksize was called for this object.
    There is some places in kernel where ksize function is called to inquire
    size of really allocated area. Such callers could validly access whole
    allocated memory, so it should be marked as accessible.

    Code in slub.c and slab_common.c files could validly access to object's
    metadata, so instrumentation for this files are disabled.

    Signed-off-by: Andrey Ryabinin
    Signed-off-by: Dmitry Chernenkov
    Cc: Dmitry Vyukov
    Cc: Konstantin Serebryany
    Signed-off-by: Andrey Konovalov
    Cc: Yuri Gribov
    Cc: Konstantin Khlebnikov
    Cc: Sasha Levin
    Cc: Christoph Lameter
    Cc: Joonsoo Kim
    Cc: Dave Hansen
    Cc: Andi Kleen
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Ryabinin
     
  • It's ok for slub to access memory that marked by kasan as inaccessible
    (object's metadata). Kasan shouldn't print report in that case because
    these accesses are valid. Disabling instrumentation of slub.c code is not
    enough to achieve this because slub passes pointer to object's metadata
    into external functions like memchr_inv().

    We don't want to disable instrumentation for memchr_inv() because this is
    quite generic function, and we don't want to miss bugs.

    metadata_access_enable/metadata_access_disable used to tell KASan where
    accesses to metadata starts/end, so we could temporarily disable KASan
    reports.

    Signed-off-by: Andrey Ryabinin
    Cc: Dmitry Vyukov
    Cc: Konstantin Serebryany
    Cc: Dmitry Chernenkov
    Signed-off-by: Andrey Konovalov
    Cc: Yuri Gribov
    Cc: Konstantin Khlebnikov
    Cc: Sasha Levin
    Cc: Christoph Lameter
    Cc: Joonsoo Kim
    Cc: Dave Hansen
    Cc: Andi Kleen
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Ryabinin
     
  • Remove static and add function declarations to linux/slub_def.h so it
    could be used by kernel address sanitizer.

    Signed-off-by: Andrey Ryabinin
    Cc: Dmitry Vyukov
    Cc: Konstantin Serebryany
    Cc: Dmitry Chernenkov
    Signed-off-by: Andrey Konovalov
    Cc: Yuri Gribov
    Cc: Konstantin Khlebnikov
    Cc: Sasha Levin
    Cc: Christoph Lameter
    Cc: Joonsoo Kim
    Cc: Dave Hansen
    Cc: Andi Kleen
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Ryabinin
     
  • printk and friends can now format bitmaps using '%*pb[l]'. cpumask
    and nodemask also provide cpumask_pr_args() and nodemask_pr_args()
    respectively which can be used to generate the two printf arguments
    necessary to format the specified cpu/nodemask.

    * This is an equivalent conversion but the whole function should be
    converted to use scnprinf famiily of functions rather than
    performing custom output length predictions in multiple places.

    Signed-off-by: Tejun Heo
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tejun Heo
     

13 Feb, 2015

2 commits

  • To speed up further allocations SLUB may store empty slabs in per cpu/node
    partial lists instead of freeing them immediately. This prevents per
    memcg caches destruction, because kmem caches created for a memory cgroup
    are only destroyed after the last page charged to the cgroup is freed.

    To fix this issue, this patch resurrects approach first proposed in [1].
    It forbids SLUB to cache empty slabs after the memory cgroup that the
    cache belongs to was destroyed. It is achieved by setting kmem_cache's
    cpu_partial and min_partial constants to 0 and tuning put_cpu_partial() so
    that it would drop frozen empty slabs immediately if cpu_partial = 0.

    The runtime overhead is minimal. From all the hot functions, we only
    touch relatively cold put_cpu_partial(): we make it call
    unfreeze_partials() after freezing a slab that belongs to an offline
    memory cgroup. Since slab freezing exists to avoid moving slabs from/to a
    partial list on free/alloc, and there can't be allocations from dead
    caches, it shouldn't cause any overhead. We do have to disable preemption
    for put_cpu_partial() to achieve that though.

    The original patch was accepted well and even merged to the mm tree.
    However, I decided to withdraw it due to changes happening to the memcg
    core at that time. I had an idea of introducing per-memcg shrinkers for
    kmem caches, but now, as memcg has finally settled down, I do not see it
    as an option, because SLUB shrinker would be too costly to call since SLUB
    does not keep free slabs on a separate list. Besides, we currently do not
    even call per-memcg shrinkers for offline memcgs. Overall, it would
    introduce much more complexity to both SLUB and memcg than this small
    patch.

    Regarding to SLAB, there's no problem with it, because it shrinks
    per-cpu/node caches periodically. Thanks to list_lru reparenting, we no
    longer keep entries for offline cgroups in per-memcg arrays (such as
    memcg_cache_params->memcg_caches), so we do not have to bother if a
    per-memcg cache will be shrunk a bit later than it could be.

    [1] http://thread.gmane.org/gmane.linux.kernel.mm/118649/focus=118650

    Signed-off-by: Vladimir Davydov
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • It is supposed to return 0 if the cache has no remaining objects and 1
    otherwise, while currently it always returns 0. Fix it.

    Signed-off-by: Vladimir Davydov
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov