26 Mar, 2016

1 commit

  • Add GFP flags to KASAN hooks for future patches to use.

    This patch is based on the "mm: kasan: unified support for SLUB and SLAB
    allocators" patch originally prepared by Dmitry Chernenkov.

    Signed-off-by: Alexander Potapenko
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Andrey Konovalov
    Cc: Dmitry Vyukov
    Cc: Andrey Ryabinin
    Cc: Steven Rostedt
    Cc: Konstantin Serebryany
    Cc: Dmitry Chernenkov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexander Potapenko
     

18 Mar, 2016

4 commits

  • Kernel style prefers a single string over split strings when the string is
    'user-visible'.

    Miscellanea:

    - Add a missing newline
    - Realign arguments

    Signed-off-by: Joe Perches
    Acked-by: Tejun Heo [percpu]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joe Perches
     
  • THP defrag is enabled by default to direct reclaim/compact but not wake
    kswapd in the event of a THP allocation failure. The problem is that
    THP allocation requests potentially enter reclaim/compaction. This
    potentially incurs a severe stall that is not guaranteed to be offset by
    reduced TLB misses. While there has been considerable effort to reduce
    the impact of reclaim/compaction, it is still a high cost and workloads
    that should fit in memory fail to do so. Specifically, a simple
    anon/file streaming workload will enter direct reclaim on NUMA at least
    even though the working set size is 80% of RAM. It's been years and
    it's time to throw in the towel.

    First, this patch defines THP defrag as follows;

    madvise: A failed allocation will direct reclaim/compact if the application requests it
    never: Neither reclaim/compact nor wake kswapd
    defer: A failed allocation will wake kswapd/kcompactd
    always: A failed allocation will direct reclaim/compact (historical behaviour)
    khugepaged defrag will enter direct/reclaim but not wake kswapd.

    Next it sets the default defrag option to be "madvise" to only enter
    direct reclaim/compaction for applications that specifically requested
    it.

    Lastly, it removes a check from the page allocator slowpath that is
    related to __GFP_THISNODE to allow "defer" to work. The callers that
    really cares are slub/slab and they are updated accordingly. The slab
    one may be surprising because it also corrects a comment as kswapd was
    never woken up by that path.

    This means that a THP fault will no longer stall for most applications
    by default and the ideal for most users that get THP if they are
    immediately available. There are still options for users that prefer a
    stall at startup of a new application by either restoring historical
    behaviour with "always" or pick a half-way point with "defer" where
    kswapd does some of the work in the background and wakes kcompactd if
    necessary. THP defrag for khugepaged remains enabled and will enter
    direct/reclaim but no wakeup kswapd or kcompactd.

    After this patch a THP allocation failure will quickly fallback and rely
    on khugepaged to recover the situation at some time in the future. In
    some cases, this will reduce THP usage but the benefit of THP is hard to
    measure and not a universal win where as a stall to reclaim/compaction
    is definitely measurable and can be painful.

    The first test for this is using "usemem" to read a large file and write
    a large anonymous mapping (to avoid the zero page) multiple times. The
    total size of the mappings is 80% of RAM and the benchmark simply
    measures how long it takes to complete. It uses multiple threads to see
    if that is a factor. On UMA, the performance is almost identical so is
    not reported but on NUMA, we see this

    usemem
    4.4.0 4.4.0
    kcompactd-v1r1 nodefrag-v1r3
    Amean System-1 102.86 ( 0.00%) 46.81 ( 54.50%)
    Amean System-4 37.85 ( 0.00%) 34.02 ( 10.12%)
    Amean System-7 48.12 ( 0.00%) 46.89 ( 2.56%)
    Amean System-12 51.98 ( 0.00%) 56.96 ( -9.57%)
    Amean System-21 80.16 ( 0.00%) 79.05 ( 1.39%)
    Amean System-30 110.71 ( 0.00%) 107.17 ( 3.20%)
    Amean System-48 127.98 ( 0.00%) 124.83 ( 2.46%)
    Amean Elapsd-1 185.84 ( 0.00%) 105.51 ( 43.23%)
    Amean Elapsd-4 26.19 ( 0.00%) 25.58 ( 2.33%)
    Amean Elapsd-7 21.65 ( 0.00%) 21.62 ( 0.16%)
    Amean Elapsd-12 18.58 ( 0.00%) 17.94 ( 3.43%)
    Amean Elapsd-21 17.53 ( 0.00%) 16.60 ( 5.33%)
    Amean Elapsd-30 17.45 ( 0.00%) 17.13 ( 1.84%)
    Amean Elapsd-48 15.40 ( 0.00%) 15.27 ( 0.82%)

    For a single thread, the benchmark completes 43.23% faster with this
    patch applied with smaller benefits as the thread increases. Similar,
    notice the large reduction in most cases in system CPU usage. The
    overall CPU time is

    4.4.0 4.4.0
    kcompactd-v1r1 nodefrag-v1r3
    User 10357.65 10438.33
    System 3988.88 3543.94
    Elapsed 2203.01 1634.41

    Which is substantial. Now, the reclaim figures

    4.4.0 4.4.0
    kcompactd-v1r1nodefrag-v1r3
    Minor Faults 128458477 278352931
    Major Faults 2174976 225
    Swap Ins 16904701 0
    Swap Outs 17359627 0
    Allocation stalls 43611 0
    DMA allocs 0 0
    DMA32 allocs 19832646 19448017
    Normal allocs 614488453 580941839
    Movable allocs 0 0
    Direct pages scanned 24163800 0
    Kswapd pages scanned 0 0
    Kswapd pages reclaimed 0 0
    Direct pages reclaimed 20691346 0
    Compaction stalls 42263 0
    Compaction success 938 0
    Compaction failures 41325 0

    This patch eliminates almost all swapping and direct reclaim activity.
    There is still overhead but it's from NUMA balancing which does not
    identify that it's pointless trying to do anything with this workload.

    I also tried the thpscale benchmark which forces a corner case where
    compaction can be used heavily and measures the latency of whether base
    or huge pages were used

    thpscale Fault Latencies
    4.4.0 4.4.0
    kcompactd-v1r1 nodefrag-v1r3
    Amean fault-base-1 5288.84 ( 0.00%) 2817.12 ( 46.73%)
    Amean fault-base-3 6365.53 ( 0.00%) 3499.11 ( 45.03%)
    Amean fault-base-5 6526.19 ( 0.00%) 4363.06 ( 33.15%)
    Amean fault-base-7 7142.25 ( 0.00%) 4858.08 ( 31.98%)
    Amean fault-base-12 13827.64 ( 0.00%) 10292.11 ( 25.57%)
    Amean fault-base-18 18235.07 ( 0.00%) 13788.84 ( 24.38%)
    Amean fault-base-24 21597.80 ( 0.00%) 24388.03 (-12.92%)
    Amean fault-base-30 26754.15 ( 0.00%) 19700.55 ( 26.36%)
    Amean fault-base-32 26784.94 ( 0.00%) 19513.57 ( 27.15%)
    Amean fault-huge-1 4223.96 ( 0.00%) 2178.57 ( 48.42%)
    Amean fault-huge-3 2194.77 ( 0.00%) 2149.74 ( 2.05%)
    Amean fault-huge-5 2569.60 ( 0.00%) 2346.95 ( 8.66%)
    Amean fault-huge-7 3612.69 ( 0.00%) 2997.70 ( 17.02%)
    Amean fault-huge-12 3301.75 ( 0.00%) 6727.02 (-103.74%)
    Amean fault-huge-18 6696.47 ( 0.00%) 6685.72 ( 0.16%)
    Amean fault-huge-24 8000.72 ( 0.00%) 9311.43 (-16.38%)
    Amean fault-huge-30 13305.55 ( 0.00%) 9750.45 ( 26.72%)
    Amean fault-huge-32 9981.71 ( 0.00%) 10316.06 ( -3.35%)

    The average time to fault pages is substantially reduced in the majority
    of caseds but with the obvious caveat that fewer THPs are actually used
    in this adverse workload

    4.4.0 4.4.0
    kcompactd-v1r1 nodefrag-v1r3
    Percentage huge-1 0.71 ( 0.00%) 14.04 (1865.22%)
    Percentage huge-3 10.77 ( 0.00%) 33.05 (206.85%)
    Percentage huge-5 60.39 ( 0.00%) 38.51 (-36.23%)
    Percentage huge-7 45.97 ( 0.00%) 34.57 (-24.79%)
    Percentage huge-12 68.12 ( 0.00%) 40.07 (-41.17%)
    Percentage huge-18 64.93 ( 0.00%) 47.82 (-26.35%)
    Percentage huge-24 62.69 ( 0.00%) 44.23 (-29.44%)
    Percentage huge-30 43.49 ( 0.00%) 55.38 ( 27.34%)
    Percentage huge-32 50.72 ( 0.00%) 51.90 ( 2.35%)

    4.4.0 4.4.0
    kcompactd-v1r1nodefrag-v1r3
    Minor Faults 37429143 47564000
    Major Faults 1916 1558
    Swap Ins 1466 1079
    Swap Outs 2936863 149626
    Allocation stalls 62510 3
    DMA allocs 0 0
    DMA32 allocs 6566458 6401314
    Normal allocs 216361697 216538171
    Movable allocs 0 0
    Direct pages scanned 25977580 17998
    Kswapd pages scanned 0 3638931
    Kswapd pages reclaimed 0 207236
    Direct pages reclaimed 8833714 88
    Compaction stalls 103349 5
    Compaction success 270 4
    Compaction failures 103079 1

    Note again that while this does swap as it's an aggressive workload, the
    direct relcim activity and allocation stalls is substantially reduced.
    There is some kswapd activity but ftrace showed that the kswapd activity
    was due to normal wakeups from 4K pages being allocated.
    Compaction-related stalls and activity are almost eliminated.

    I also tried the stutter benchmark. For this, I do not have figures for
    NUMA but it's something that does impact UMA so I'll report what is
    available

    stutter
    4.4.0 4.4.0
    kcompactd-v1r1 nodefrag-v1r3
    Min mmap 7.3571 ( 0.00%) 7.3438 ( 0.18%)
    1st-qrtle mmap 7.5278 ( 0.00%) 17.9200 (-138.05%)
    2nd-qrtle mmap 7.6818 ( 0.00%) 21.6055 (-181.25%)
    3rd-qrtle mmap 11.0889 ( 0.00%) 21.8881 (-97.39%)
    Max-90% mmap 27.8978 ( 0.00%) 22.1632 ( 20.56%)
    Max-93% mmap 28.3202 ( 0.00%) 22.3044 ( 21.24%)
    Max-95% mmap 28.5600 ( 0.00%) 22.4580 ( 21.37%)
    Max-99% mmap 29.6032 ( 0.00%) 25.5216 ( 13.79%)
    Max mmap 4109.7289 ( 0.00%) 4813.9832 (-17.14%)
    Mean mmap 12.4474 ( 0.00%) 19.3027 (-55.07%)

    This benchmark is trying to fault an anonymous mapping while there is a
    heavy IO load -- a scenario that desktop users used to complain about
    frequently. This shows a mix because the ideal case of mapping with THP
    is not hit as often. However, note that 99% of the mappings complete
    13.79% faster. The CPU usage here is particularly interesting

    4.4.0 4.4.0
    kcompactd-v1r1nodefrag-v1r3
    User 67.50 0.99
    System 1327.88 91.30
    Elapsed 2079.00 2128.98

    And once again we look at the reclaim figures

    4.4.0 4.4.0
    kcompactd-v1r1nodefrag-v1r3
    Minor Faults 335241922 1314582827
    Major Faults 715 819
    Swap Ins 0 0
    Swap Outs 0 0
    Allocation stalls 532723 0
    DMA allocs 0 0
    DMA32 allocs 1822364341 1177950222
    Normal allocs 1815640808 1517844854
    Movable allocs 0 0
    Direct pages scanned 21892772 0
    Kswapd pages scanned 20015890 41879484
    Kswapd pages reclaimed 19961986 41822072
    Direct pages reclaimed 21892741 0
    Compaction stalls 1065755 0
    Compaction success 514 0
    Compaction failures 1065241 0

    Allocation stalls and all direct reclaim activity is eliminated as well
    as compaction-related stalls.

    THP gives impressive gains in some cases but only if they are quickly
    available. We're not going to reach the point where they are completely
    free so lets take the costs out of the fast paths finally and defer the
    cost to kswapd, kcompactd and khugepaged where it belongs.

    Signed-off-by: Mel Gorman
    Acked-by: Rik van Riel
    Acked-by: Johannes Weiner
    Acked-by: Vlastimil Babka
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • We can disable debug_pagealloc processing even if the code is compiled
    with CONFIG_DEBUG_PAGEALLOC. This patch changes the code to query
    whether it is enabled or not in runtime.

    [akpm@linux-foundation.org: clean up code, per Christian]
    Signed-off-by: Joonsoo Kim
    Reviewed-by: Christian Borntraeger
    Cc: Benjamin Herrenschmidt
    Cc: Chris Metcalf
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Pekka Enberg
    Cc: Takashi Iwai
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • Show how much memory is used for storing reclaimable and unreclaimable
    in-kernel data structures allocated from slab caches.

    Signed-off-by: Vladimir Davydov
    Acked-by: Johannes Weiner
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     

16 Mar, 2016

9 commits

  • We can now print gfp_flags more human-readable. Make use of this in
    slab_out_of_memory() for SLUB and SLAB. Also convert the SLAB variant
    it to pr_warn() along the way.

    Signed-off-by: Vlastimil Babka
    Acked-by: David Rientjes
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • SLUB already has a redzone debugging feature. But it is only positioned
    at the end of object (aka right redzone) so it cannot catch left oob.
    Although current object's right redzone acts as left redzone of next
    object, first object in a slab cannot take advantage of this effect.
    This patch explicitly adds a left red zone to each object to detect left
    oob more precisely.

    Background:

    Someone complained to me that left OOB doesn't catch even if KASAN is
    enabled which does page allocation debugging. That page is out of our
    control so it would be allocated when left OOB happens and, in this
    case, we can't find OOB. Moreover, SLUB debugging feature can be
    enabled without page allocator debugging and, in this case, we will miss
    that OOB.

    Before trying to implement, I expected that changes would be too
    complex, but, it doesn't look that complex to me now. Almost changes
    are applied to debug specific functions so I feel okay.

    Signed-off-by: Joonsoo Kim
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • When debug options are enabled, cmpxchg on the page is disabled. This
    is because the page must be locked to ensure there are no false
    positives when performing consistency checks. Some debug options such
    as poisoning and red zoning only act on the object itself. There is no
    need to protect other CPUs from modification on only the object. Allow
    cmpxchg to happen with poisoning and red zoning are set on a slab.

    Credit to Mathias Krause for the original work which inspired this
    series

    Signed-off-by: Laura Abbott
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Kees Cook
    Cc: Mathias Krause
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Laura Abbott
     
  • SLAB_DEBUG_FREE allows expensive consistency checks at free to be turned
    on or off. Expand its use to be able to turn off all consistency
    checks. This gives a nice speed up if you only want features such as
    poisoning or tracing.

    Credit to Mathias Krause for the original work which inspired this
    series

    Signed-off-by: Laura Abbott
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Kees Cook
    Cc: Mathias Krause
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Laura Abbott
     
  • Since commit 19c7ff9ecd89 ("slub: Take node lock during object free
    checks") check_object has been incorrectly returning success as it
    follows the out label which just returns the node.

    Thanks to refactoring, the out and fail paths are now basically the
    same. Combine the two into one and just use a single label.

    Credit to Mathias Krause for the original work which inspired this
    series

    Signed-off-by: Laura Abbott
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Kees Cook
    Cc: Mathias Krause
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Laura Abbott
     
  • This series takes the suggestion of Christoph Lameter and only focuses
    on optimizing the slow path where the debug processing runs. The two
    main optimizations in this series are letting the consistency checks be
    skipped and relaxing the cmpxchg restrictions when we are not doing
    consistency checks. With hackbench -g 20 -l 1000 averaged over 100
    runs:

    Before slub_debug=P
    mean 15.607
    variance .086
    stdev .294

    After slub_debug=P
    mean 10.836
    variance .155
    stdev .394

    This still isn't as fast as what is in grsecurity unfortunately so there's
    still work to be done. Profiling ___slab_alloc shows that 25-50% of time
    is spent in deactivate_slab. I haven't looked too closely to see if this
    is something that can be optimized. My plan for now is to focus on
    getting all of this merged (if appropriate) before digging in to another
    task.

    This patch (of 4):

    Currently, free_debug_processing has a comment "Keep node_lock to preserve
    integrity until the object is actually freed". In actuallity, the lock is
    dropped immediately in __slab_free. Rather than wait until __slab_free
    and potentially throw off the unlikely marking, just drop the lock in
    __slab_free. This also lets free_debug_processing take its own copy of
    the spinlock flags rather than trying to share the ones from __slab_free.
    Since there is no use for the node afterwards, change the return type of
    free_debug_processing to return an int like alloc_debug_processing.

    Credit to Mathias Krause for the original work which inspired this series

    [akpm@linux-foundation.org: fix build]
    Signed-off-by: Laura Abbott
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Kees Cook
    Cc: Mathias Krause
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Laura Abbott
     
  • This patch introduce a new API call kfree_bulk() for bulk freeing memory
    objects not bound to a single kmem_cache.

    Christoph pointed out that it is possible to implement freeing of
    objects, without knowing the kmem_cache pointer as that information is
    available from the object's page->slab_cache. Proposing to remove the
    kmem_cache argument from the bulk free API.

    Jesper demonstrated that these extra steps per object comes at a
    performance cost. It is only in the case CONFIG_MEMCG_KMEM is compiled
    in and activated runtime that these steps are done anyhow. The extra
    cost is most visible for SLAB allocator, because the SLUB allocator does
    the page lookup (virt_to_head_page()) anyhow.

    Thus, the conclusion was to keep the kmem_cache free bulk API with a
    kmem_cache pointer, but we can still implement a kfree_bulk() API fairly
    easily. Simply by handling if kmem_cache_free_bulk() gets called with a
    kmem_cache NULL pointer.

    This does increase the code size a bit, but implementing a separate
    kfree_bulk() call would likely increase code size even more.

    Below benchmarks cost of alloc+free (obj size 256 bytes) on CPU i7-4790K
    @ 4.00GHz, no PREEMPT and CONFIG_MEMCG_KMEM=y.

    Code size increase for SLAB:

    add/remove: 0/0 grow/shrink: 1/0 up/down: 74/0 (74)
    function old new delta
    kmem_cache_free_bulk 660 734 +74

    SLAB fastpath: 87 cycles(tsc) 21.814
    sz - fallback - kmem_cache_free_bulk - kfree_bulk
    1 - 103 cycles 25.878 ns - 41 cycles 10.498 ns - 81 cycles 20.312 ns
    2 - 94 cycles 23.673 ns - 26 cycles 6.682 ns - 42 cycles 10.649 ns
    3 - 92 cycles 23.181 ns - 21 cycles 5.325 ns - 39 cycles 9.950 ns
    4 - 90 cycles 22.727 ns - 18 cycles 4.673 ns - 26 cycles 6.693 ns
    8 - 89 cycles 22.270 ns - 14 cycles 3.664 ns - 23 cycles 5.835 ns
    16 - 88 cycles 22.038 ns - 14 cycles 3.503 ns - 22 cycles 5.543 ns
    30 - 89 cycles 22.284 ns - 13 cycles 3.310 ns - 20 cycles 5.197 ns
    32 - 88 cycles 22.249 ns - 13 cycles 3.420 ns - 20 cycles 5.166 ns
    34 - 88 cycles 22.224 ns - 14 cycles 3.643 ns - 20 cycles 5.170 ns
    48 - 88 cycles 22.088 ns - 14 cycles 3.507 ns - 20 cycles 5.203 ns
    64 - 88 cycles 22.063 ns - 13 cycles 3.428 ns - 20 cycles 5.152 ns
    128 - 89 cycles 22.483 ns - 15 cycles 3.891 ns - 23 cycles 5.885 ns
    158 - 89 cycles 22.381 ns - 15 cycles 3.779 ns - 22 cycles 5.548 ns
    250 - 91 cycles 22.798 ns - 16 cycles 4.152 ns - 23 cycles 5.967 ns

    SLAB when enabling MEMCG_KMEM runtime:
    - kmemcg fastpath: 130 cycles(tsc) 32.684 ns (step:0)
    1 - 148 cycles 37.220 ns - 66 cycles 16.622 ns - 66 cycles 16.583 ns
    2 - 141 cycles 35.510 ns - 51 cycles 12.820 ns - 58 cycles 14.625 ns
    3 - 140 cycles 35.017 ns - 37 cycles 9.326 ns - 33 cycles 8.474 ns
    4 - 137 cycles 34.507 ns - 31 cycles 7.888 ns - 33 cycles 8.300 ns
    8 - 140 cycles 35.069 ns - 25 cycles 6.461 ns - 25 cycles 6.436 ns
    16 - 138 cycles 34.542 ns - 23 cycles 5.945 ns - 22 cycles 5.670 ns
    30 - 136 cycles 34.227 ns - 22 cycles 5.502 ns - 22 cycles 5.587 ns
    32 - 136 cycles 34.253 ns - 21 cycles 5.475 ns - 21 cycles 5.324 ns
    34 - 136 cycles 34.254 ns - 21 cycles 5.448 ns - 20 cycles 5.194 ns
    48 - 136 cycles 34.075 ns - 21 cycles 5.458 ns - 21 cycles 5.367 ns
    64 - 135 cycles 33.994 ns - 21 cycles 5.350 ns - 21 cycles 5.259 ns
    128 - 137 cycles 34.446 ns - 23 cycles 5.816 ns - 22 cycles 5.688 ns
    158 - 137 cycles 34.379 ns - 22 cycles 5.727 ns - 22 cycles 5.602 ns
    250 - 138 cycles 34.755 ns - 24 cycles 6.093 ns - 23 cycles 5.986 ns

    Code size increase for SLUB:
    function old new delta
    kmem_cache_free_bulk 717 799 +82

    SLUB benchmark:
    SLUB fastpath: 46 cycles(tsc) 11.691 ns (step:0)
    sz - fallback - kmem_cache_free_bulk - kfree_bulk
    1 - 61 cycles 15.486 ns - 53 cycles 13.364 ns - 57 cycles 14.464 ns
    2 - 54 cycles 13.703 ns - 32 cycles 8.110 ns - 33 cycles 8.482 ns
    3 - 53 cycles 13.272 ns - 25 cycles 6.362 ns - 27 cycles 6.947 ns
    4 - 51 cycles 12.994 ns - 24 cycles 6.087 ns - 24 cycles 6.078 ns
    8 - 50 cycles 12.576 ns - 21 cycles 5.354 ns - 22 cycles 5.513 ns
    16 - 49 cycles 12.368 ns - 20 cycles 5.054 ns - 20 cycles 5.042 ns
    30 - 49 cycles 12.273 ns - 18 cycles 4.748 ns - 19 cycles 4.758 ns
    32 - 49 cycles 12.401 ns - 19 cycles 4.821 ns - 19 cycles 4.810 ns
    34 - 98 cycles 24.519 ns - 24 cycles 6.154 ns - 24 cycles 6.157 ns
    48 - 83 cycles 20.833 ns - 21 cycles 5.446 ns - 21 cycles 5.429 ns
    64 - 75 cycles 18.891 ns - 20 cycles 5.247 ns - 20 cycles 5.238 ns
    128 - 93 cycles 23.271 ns - 27 cycles 6.856 ns - 27 cycles 6.823 ns
    158 - 102 cycles 25.581 ns - 30 cycles 7.714 ns - 30 cycles 7.695 ns
    250 - 107 cycles 26.917 ns - 38 cycles 9.514 ns - 38 cycles 9.506 ns

    SLUB when enabling MEMCG_KMEM runtime:
    - kmemcg fastpath: 71 cycles(tsc) 17.897 ns (step:0)
    1 - 85 cycles 21.484 ns - 78 cycles 19.569 ns - 75 cycles 18.938 ns
    2 - 81 cycles 20.363 ns - 45 cycles 11.258 ns - 44 cycles 11.076 ns
    3 - 78 cycles 19.709 ns - 33 cycles 8.354 ns - 32 cycles 8.044 ns
    4 - 77 cycles 19.430 ns - 28 cycles 7.216 ns - 28 cycles 7.003 ns
    8 - 101 cycles 25.288 ns - 23 cycles 5.849 ns - 23 cycles 5.787 ns
    16 - 76 cycles 19.148 ns - 20 cycles 5.162 ns - 20 cycles 5.081 ns
    30 - 76 cycles 19.067 ns - 19 cycles 4.868 ns - 19 cycles 4.821 ns
    32 - 76 cycles 19.052 ns - 19 cycles 4.857 ns - 19 cycles 4.815 ns
    34 - 121 cycles 30.291 ns - 25 cycles 6.333 ns - 25 cycles 6.268 ns
    48 - 108 cycles 27.111 ns - 21 cycles 5.498 ns - 21 cycles 5.458 ns
    64 - 100 cycles 25.164 ns - 20 cycles 5.242 ns - 20 cycles 5.229 ns
    128 - 155 cycles 38.976 ns - 27 cycles 6.886 ns - 27 cycles 6.892 ns
    158 - 132 cycles 33.034 ns - 30 cycles 7.711 ns - 30 cycles 7.728 ns
    250 - 130 cycles 32.612 ns - 38 cycles 9.560 ns - 38 cycles 9.549 ns

    Signed-off-by: Jesper Dangaard Brouer
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jesper Dangaard Brouer
     
  • First step towards sharing alloc_hook's between SLUB and SLAB
    allocators. Move the SLUB allocators *_alloc_hook to the common
    mm/slab.h for internal slab definitions.

    Signed-off-by: Jesper Dangaard Brouer
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jesper Dangaard Brouer
     
  • This change is primarily an attempt to make it easier to realize the
    optimizations the compiler performs in-case CONFIG_MEMCG_KMEM is not
    enabled.

    Performance wise, even when CONFIG_MEMCG_KMEM is compiled in, the
    overhead is zero. This is because, as long as no process have enabled
    kmem cgroups accounting, the assignment is replaced by asm-NOP
    operations. This is possible because memcg_kmem_enabled() uses a
    static_key_false() construct.

    It also helps readability as it avoid accessing the p[] array like:
    p[size - 1] which "expose" that the array is processed backwards inside
    helper function build_detached_freelist().

    Lastly this also makes the code more robust, in error case like passing
    NULL pointers in the array. Which were previously handled before commit
    033745189b1b ("slub: add missing kmem cgroup support to
    kmem_cache_free_bulk").

    Fixes: 033745189b1b ("slub: add missing kmem cgroup support to kmem_cache_free_bulk")
    Signed-off-by: Jesper Dangaard Brouer
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jesper Dangaard Brouer
     

19 Feb, 2016

1 commit

  • When slub_debug alloc_calls_show is enabled we will try to track
    location and user of slab object on each online node, kmem_cache_node
    structure and cpu_cache/cpu_slub shouldn't be freed till there is the
    last reference to sysfs file.

    This fixes the following panic:

    BUG: unable to handle kernel NULL pointer dereference at 0000000000000020
    IP: list_locations+0x169/0x4e0
    PGD 257304067 PUD 438456067 PMD 0
    Oops: 0000 [#1] SMP
    CPU: 3 PID: 973074 Comm: cat ve: 0 Not tainted 3.10.0-229.7.2.ovz.9.30-00007-japdoll-dirty #2 9.30
    Hardware name: DEPO Computers To Be Filled By O.E.M./H67DE3, BIOS L1.60c 07/14/2011
    task: ffff88042a5dc5b0 ti: ffff88037f8d8000 task.ti: ffff88037f8d8000
    RIP: list_locations+0x169/0x4e0
    Call Trace:
    alloc_calls_show+0x1d/0x30
    slab_attr_show+0x1b/0x30
    sysfs_read_file+0x9a/0x1a0
    vfs_read+0x9c/0x170
    SyS_read+0x58/0xb0
    system_call_fastpath+0x16/0x1b
    Code: 5e 07 12 00 b9 00 04 00 00 3d 00 04 00 00 0f 4f c1 3d 00 04 00 00 89 45 b0 0f 84 c3 00 00 00 48 63 45 b0 49 8b 9c c4 f8 00 00 00 8b 43 20 48 85 c0 74 b6 48 89 df e8 46 37 44 00 48 8b 53 10
    CR2: 0000000000000020

    Separated __kmem_cache_release from __kmem_cache_shutdown which now
    called on slab_kmem_cache_release (after the last reference to sysfs
    file object has dropped).

    Reintroduced locking in free_partial as sysfs file might access cache's
    partial list after shutdowning - partial revert of the commit
    69cb8e6b7c29 ("slub: free slabs without holding locks"). Zap
    __remove_partial and use remove_partial (w/o underscores) as
    free_partial now takes list_lock which s partial revert for commit
    1e4dd9461fab ("slub: do not assert not having lock in removing freed
    partial")

    Signed-off-by: Dmitry Safonov
    Suggested-by: Vladimir Davydov
    Acked-by: Vladimir Davydov
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dmitry Safonov
     

21 Jan, 2016

1 commit


16 Jan, 2016

1 commit

  • lock_page() must operate on the whole compound page. It doesn't make
    much sense to lock part of compound page. Change code to use head
    page's PG_locked, if tail page is passed.

    This patch also gets rid of custom helper functions --
    __set_page_locked() and __clear_page_locked(). They are replaced with
    helpers generated by __SETPAGEFLAG/__CLEARPAGEFLAG. Tail pages to these
    helper would trigger VM_BUG_ON().

    SLUB uses PG_locked as a bit spin locked. IIUC, tail pages should never
    appear there. VM_BUG_ON() is added to make sure that this assumption is
    correct.

    [akpm@linux-foundation.org: fix fs/cifs/file.c]
    Signed-off-by: Kirill A. Shutemov
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Vlastimil Babka
    Cc: Christoph Lameter
    Cc: Naoya Horiguchi
    Cc: Steve Capper
    Cc: "Aneesh Kumar K.V"
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Jerome Marchand
    Cc: Jérôme Glisse
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

15 Jan, 2016

1 commit

  • Currently, if we want to account all objects of a particular kmem cache,
    we have to pass __GFP_ACCOUNT to each kmem_cache_alloc call, which is
    inconvenient. This patch introduces SLAB_ACCOUNT flag which if passed
    to kmem_cache_create will force accounting for every allocation from
    this cache even if __GFP_ACCOUNT is not passed.

    This patch does not make any of the existing caches use this flag - it
    will be done later in the series.

    Note, a cache with SLAB_ACCOUNT cannot be merged with a cache w/o
    SLAB_ACCOUNT, because merged caches share the same kmem_cache struct and
    hence cannot have different sets of SLAB_* flags. Thus using this flag
    will probably reduce the number of merged slabs even if kmem accounting
    is not used (only compiled in).

    Signed-off-by: Vladimir Davydov
    Suggested-by: Tejun Heo
    Acked-by: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: Greg Thelen
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     

23 Nov, 2015

5 commits

  • Adjust kmem_cache_alloc_bulk API before we have any real users.

    Adjust API to return type 'int' instead of previously type 'bool'. This
    is done to allow future extension of the bulk alloc API.

    A future extension could be to allow SLUB to stop at a page boundary, when
    specified by a flag, and then return the number of objects.

    The advantage of this approach, would make it easier to make bulk alloc
    run without local IRQs disabled. With an approach of cmpxchg "stealing"
    the entire c->freelist or page->freelist. To avoid overshooting we would
    stop processing at a slab-page boundary. Else we always end up returning
    some objects at the cost of another cmpxchg.

    To keep compatible with future users of this API linking against an older
    kernel when using the new flag, we need to return the number of allocated
    objects with this API change.

    Signed-off-by: Jesper Dangaard Brouer
    Cc: Vladimir Davydov
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jesper Dangaard Brouer
     
  • Initial implementation missed support for kmem cgroup support in
    kmem_cache_free_bulk() call, add this.

    If CONFIG_MEMCG_KMEM is not enabled, the compiler should be smart enough
    to not add any asm code.

    Incoming bulk free objects can belong to different kmem cgroups, and
    object free call can happen at a later point outside memcg context. Thus,
    we need to keep the orig kmem_cache, to correctly verify if a memcg object
    match against its "root_cache" (s->memcg_params.root_cache).

    Signed-off-by: Jesper Dangaard Brouer
    Reviewed-by: Vladimir Davydov
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jesper Dangaard Brouer
     
  • The call slab_pre_alloc_hook() interacts with kmemgc and is not allowed to
    be called several times inside the bulk alloc for loop, due to the call to
    memcg_kmem_get_cache().

    This would result in hitting the VM_BUG_ON in __memcg_kmem_get_cache.

    As suggested by Vladimir Davydov, change slab_post_alloc_hook() to be able
    to handle an array of objects.

    A subtle detail is, loop iterator "i" in slab_post_alloc_hook() must have
    same type (size_t) as size argument. This helps the compiler to easier
    realize that it can remove the loop, when all debug statements inside loop
    evaluates to nothing. Note, this is only an issue because the kernel is
    compiled with GCC option: -fno-strict-overflow

    In slab_alloc_node() the compiler inlines and optimizes the invocation of
    slab_post_alloc_hook(s, flags, 1, &object) by removing the loop and access
    object directly.

    Signed-off-by: Jesper Dangaard Brouer
    Reported-by: Vladimir Davydov
    Suggested-by: Vladimir Davydov
    Reviewed-by: Vladimir Davydov
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jesper Dangaard Brouer
     
  • This change focus on improving the speed of object freeing in the
    "slowpath" of kmem_cache_free_bulk.

    The calls slab_free (fastpath) and __slab_free (slowpath) have been
    extended with support for bulk free, which amortize the overhead of
    the (locked) cmpxchg_double.

    To use the new bulking feature, we build what I call a detached
    freelist. The detached freelist takes advantage of three properties:

    1) the free function call owns the object that is about to be freed,
    thus writing into this memory is synchronization-free.

    2) many freelist's can co-exist side-by-side in the same slab-page
    each with a separate head pointer.

    3) it is the visibility of the head pointer that needs synchronization.

    Given these properties, the brilliant part is that the detached
    freelist can be constructed without any need for synchronization. The
    freelist is constructed directly in the page objects, without any
    synchronization needed. The detached freelist is allocated on the
    stack of the function call kmem_cache_free_bulk. Thus, the freelist
    head pointer is not visible to other CPUs.

    All objects in a SLUB freelist must belong to the same slab-page.
    Thus, constructing the detached freelist is about matching objects
    that belong to the same slab-page. The bulk free array is scanned is
    a progressive manor with a limited look-ahead facility.

    Kmem debug support is handled in call of slab_free().

    Notice kmem_cache_free_bulk no longer need to disable IRQs. This
    only slowed down single free bulk with approx 3 cycles.

    Performance data:
    Benchmarked[1] obj size 256 bytes on CPU i7-4790K @ 4.00GHz

    SLUB fastpath single object quick reuse: 47 cycles(tsc) 11.931 ns

    To get stable and comparable numbers, the kernel have been booted with
    "slab_merge" (this also improve performance for larger bulk sizes).

    Performance data, compared against fallback bulking:

    bulk - fallback bulk - improvement with this patch
    1 - 62 cycles(tsc) 15.662 ns - 49 cycles(tsc) 12.407 ns- improved 21.0%
    2 - 55 cycles(tsc) 13.935 ns - 30 cycles(tsc) 7.506 ns - improved 45.5%
    3 - 53 cycles(tsc) 13.341 ns - 23 cycles(tsc) 5.865 ns - improved 56.6%
    4 - 52 cycles(tsc) 13.081 ns - 20 cycles(tsc) 5.048 ns - improved 61.5%
    8 - 50 cycles(tsc) 12.627 ns - 18 cycles(tsc) 4.659 ns - improved 64.0%
    16 - 49 cycles(tsc) 12.412 ns - 17 cycles(tsc) 4.495 ns - improved 65.3%
    30 - 49 cycles(tsc) 12.484 ns - 18 cycles(tsc) 4.533 ns - improved 63.3%
    32 - 50 cycles(tsc) 12.627 ns - 18 cycles(tsc) 4.707 ns - improved 64.0%
    34 - 96 cycles(tsc) 24.243 ns - 23 cycles(tsc) 5.976 ns - improved 76.0%
    48 - 83 cycles(tsc) 20.818 ns - 21 cycles(tsc) 5.329 ns - improved 74.7%
    64 - 74 cycles(tsc) 18.700 ns - 20 cycles(tsc) 5.127 ns - improved 73.0%
    128 - 90 cycles(tsc) 22.734 ns - 27 cycles(tsc) 6.833 ns - improved 70.0%
    158 - 99 cycles(tsc) 24.776 ns - 30 cycles(tsc) 7.583 ns - improved 69.7%
    250 - 104 cycles(tsc) 26.089 ns - 37 cycles(tsc) 9.280 ns - improved 64.4%

    Performance data, compared current in-kernel bulking:

    bulk - curr in-kernel - improvement with this patch
    1 - 46 cycles(tsc) - 49 cycles(tsc) - improved (cycles:-3) -6.5%
    2 - 27 cycles(tsc) - 30 cycles(tsc) - improved (cycles:-3) -11.1%
    3 - 21 cycles(tsc) - 23 cycles(tsc) - improved (cycles:-2) -9.5%
    4 - 18 cycles(tsc) - 20 cycles(tsc) - improved (cycles:-2) -11.1%
    8 - 17 cycles(tsc) - 18 cycles(tsc) - improved (cycles:-1) -5.9%
    16 - 18 cycles(tsc) - 17 cycles(tsc) - improved (cycles: 1) 5.6%
    30 - 18 cycles(tsc) - 18 cycles(tsc) - improved (cycles: 0) 0.0%
    32 - 18 cycles(tsc) - 18 cycles(tsc) - improved (cycles: 0) 0.0%
    34 - 78 cycles(tsc) - 23 cycles(tsc) - improved (cycles:55) 70.5%
    48 - 60 cycles(tsc) - 21 cycles(tsc) - improved (cycles:39) 65.0%
    64 - 49 cycles(tsc) - 20 cycles(tsc) - improved (cycles:29) 59.2%
    128 - 69 cycles(tsc) - 27 cycles(tsc) - improved (cycles:42) 60.9%
    158 - 79 cycles(tsc) - 30 cycles(tsc) - improved (cycles:49) 62.0%
    250 - 86 cycles(tsc) - 37 cycles(tsc) - improved (cycles:49) 57.0%

    Performance with normal SLUB merging is significantly slower for
    larger bulking. This is believed to (primarily) be an effect of not
    having to share the per-CPU data-structures, as tuning per-CPU size
    can achieve similar performance.

    bulk - slab_nomerge - normal SLUB merge
    1 - 49 cycles(tsc) - 49 cycles(tsc) - merge slower with cycles:0
    2 - 30 cycles(tsc) - 30 cycles(tsc) - merge slower with cycles:0
    3 - 23 cycles(tsc) - 23 cycles(tsc) - merge slower with cycles:0
    4 - 20 cycles(tsc) - 20 cycles(tsc) - merge slower with cycles:0
    8 - 18 cycles(tsc) - 18 cycles(tsc) - merge slower with cycles:0
    16 - 17 cycles(tsc) - 17 cycles(tsc) - merge slower with cycles:0
    30 - 18 cycles(tsc) - 23 cycles(tsc) - merge slower with cycles:5
    32 - 18 cycles(tsc) - 22 cycles(tsc) - merge slower with cycles:4
    34 - 23 cycles(tsc) - 22 cycles(tsc) - merge slower with cycles:-1
    48 - 21 cycles(tsc) - 22 cycles(tsc) - merge slower with cycles:1
    64 - 20 cycles(tsc) - 48 cycles(tsc) - merge slower with cycles:28
    128 - 27 cycles(tsc) - 57 cycles(tsc) - merge slower with cycles:30
    158 - 30 cycles(tsc) - 59 cycles(tsc) - merge slower with cycles:29
    250 - 37 cycles(tsc) - 56 cycles(tsc) - merge slower with cycles:19

    Joint work with Alexander Duyck.

    [1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/mm/slab_bulk_test01.c

    [akpm@linux-foundation.org: BUG_ON -> WARN_ON;return]
    Signed-off-by: Jesper Dangaard Brouer
    Signed-off-by: Alexander Duyck
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jesper Dangaard Brouer
     
  • Make it possible to free a freelist with several objects by adjusting API
    of slab_free() and __slab_free() to have head, tail and an objects counter
    (cnt).

    Tail being NULL indicate single object free of head object. This allow
    compiler inline constant propagation in slab_free() and
    slab_free_freelist_hook() to avoid adding any overhead in case of single
    object free.

    This allows a freelist with several objects (all within the same
    slab-page) to be free'ed using a single locked cmpxchg_double in
    __slab_free() and with an unlocked cmpxchg_double in slab_free().

    Object debugging on the free path is also extended to handle these
    freelists. When CONFIG_SLUB_DEBUG is enabled it will also detect if
    objects don't belong to the same slab-page.

    These changes are needed for the next patch to bulk free the detached
    freelists it introduces and constructs.

    Micro benchmarking showed no performance reduction due to this change,
    when debugging is turned off (compiled with CONFIG_SLUB_DEBUG).

    Signed-off-by: Jesper Dangaard Brouer
    Signed-off-by: Alexander Duyck
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jesper Dangaard Brouer
     

21 Nov, 2015

3 commits


07 Nov, 2015

2 commits

  • We have properly typed page->rcu_head, no need to cast page->lru.

    Signed-off-by: Kirill A. Shutemov
    Reviewed-by: Andrea Arcangeli
    Acked-by: Christoph Lameter
    Cc: "Paul E. McKenney"
    Cc: Andi Kleen
    Cc: Aneesh Kumar K.V
    Cc: David Rientjes
    Cc: Hugh Dickins
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: Sergey Senozhatsky
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • …d avoiding waking kswapd

    __GFP_WAIT has been used to identify atomic context in callers that hold
    spinlocks or are in interrupts. They are expected to be high priority and
    have access one of two watermarks lower than "min" which can be referred
    to as the "atomic reserve". __GFP_HIGH users get access to the first
    lower watermark and can be called the "high priority reserve".

    Over time, callers had a requirement to not block when fallback options
    were available. Some have abused __GFP_WAIT leading to a situation where
    an optimisitic allocation with a fallback option can access atomic
    reserves.

    This patch uses __GFP_ATOMIC to identify callers that are truely atomic,
    cannot sleep and have no alternative. High priority users continue to use
    __GFP_HIGH. __GFP_DIRECT_RECLAIM identifies callers that can sleep and
    are willing to enter direct reclaim. __GFP_KSWAPD_RECLAIM to identify
    callers that want to wake kswapd for background reclaim. __GFP_WAIT is
    redefined as a caller that is willing to enter direct reclaim and wake
    kswapd for background reclaim.

    This patch then converts a number of sites

    o __GFP_ATOMIC is used by callers that are high priority and have memory
    pools for those requests. GFP_ATOMIC uses this flag.

    o Callers that have a limited mempool to guarantee forward progress clear
    __GFP_DIRECT_RECLAIM but keep __GFP_KSWAPD_RECLAIM. bio allocations fall
    into this category where kswapd will still be woken but atomic reserves
    are not used as there is a one-entry mempool to guarantee progress.

    o Callers that are checking if they are non-blocking should use the
    helper gfpflags_allow_blocking() where possible. This is because
    checking for __GFP_WAIT as was done historically now can trigger false
    positives. Some exceptions like dm-crypt.c exist where the code intent
    is clearer if __GFP_DIRECT_RECLAIM is used instead of the helper due to
    flag manipulations.

    o Callers that built their own GFP flags instead of starting with GFP_KERNEL
    and friends now also need to specify __GFP_KSWAPD_RECLAIM.

    The first key hazard to watch out for is callers that removed __GFP_WAIT
    and was depending on access to atomic reserves for inconspicuous reasons.
    In some cases it may be appropriate for them to use __GFP_HIGH.

    The second key hazard is callers that assembled their own combination of
    GFP flags instead of starting with something like GFP_KERNEL. They may
    now wish to specify __GFP_KSWAPD_RECLAIM. It's almost certainly harmless
    if it's missed in most cases as other activity will wake kswapd.

    Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Acked-by: Michal Hocko <mhocko@suse.com>
    Acked-by: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Christoph Lameter <cl@linux.com>
    Cc: David Rientjes <rientjes@google.com>
    Cc: Vitaly Wool <vitalywool@gmail.com>
    Cc: Rik van Riel <riel@redhat.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

    Mel Gorman
     

06 Nov, 2015

5 commits

  • It's recommended to have slub's user tracking enabled with CONFIG_KASAN,
    because:

    a) User tracking disables slab merging which improves
    detecting out-of-bounds accesses.
    b) User tracking metadata acts as redzone which also improves
    detecting out-of-bounds accesses.
    c) User tracking provides additional information about object.
    This information helps to understand bugs.

    Currently it is not enabled by default. Besides recompiling the kernel
    with KASAN and reinstalling it, user also have to change the boot cmdline,
    which is not very handy.

    Enable slub user tracking by default with KASAN=y, since there is no good
    reason to not do this.

    [akpm@linux-foundation.org: little fixes, per David]
    Signed-off-by: Andrey Ryabinin
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Ryabinin
     
  • We have memcg_kmem_charge and memcg_kmem_uncharge methods for charging and
    uncharging kmem pages to memcg, but currently they are not used for
    charging slab pages (i.e. they are only used for charging pages allocated
    with alloc_kmem_pages). The only reason why the slab subsystem uses
    special helpers, memcg_charge_slab and memcg_uncharge_slab, is that it
    needs to charge to the memcg of kmem cache while memcg_charge_kmem charges
    to the memcg that the current task belongs to.

    To remove this diversity, this patch adds an extra argument to
    __memcg_kmem_charge that can be a pointer to a memcg or NULL. If it is
    not NULL, the function tries to charge to the memcg it points to,
    otherwise it charge to the current context. Next, it makes the slab
    subsystem use this function to charge slab pages.

    Since memcg_charge_kmem and memcg_uncharge_kmem helpers are now used only
    in __memcg_kmem_charge and __memcg_kmem_uncharge, they are inlined. Since
    __memcg_kmem_charge stores a pointer to the memcg in the page struct, we
    don't need memcg_uncharge_slab anymore and can use free_kmem_pages.
    Besides, one can now detect which memcg a slab page belongs to by reading
    /proc/kpagecgroup.

    Note, this patch switches slab to charge-after-alloc design. Since this
    design is already used for all other memcg charges, it should not make any
    difference.

    [hannes@cmpxchg.org: better to have an outer function than a magic parameter for the memcg lookup]
    Signed-off-by: Vladimir Davydov
    Acked-by: Michal Hocko
    Signed-off-by: Johannes Weiner
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • In slub_order(), the order starts from max(min_order,
    get_order(min_objects * size)). When (min_objects * size) has different
    order from (min_objects * size + reserved), it will skip this order via a
    check in the loop.

    This patch optimizes this a little by calculating the start order with
    `reserved' in consideration and removing the check in loop.

    Signed-off-by: Wei Yang
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wei Yang
     
  • get_order() is more easy to understand.

    This patch just replaces it.

    Signed-off-by: Wei Yang
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Reviewed-by: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wei Yang
     
  • In calculate_order(), it tries to calculate the best order by adjusting
    the fraction and min_objects. On each iteration on min_objects, fraction
    iterates on 16, 8, 4. Which means the acceptable waste increases with
    1/16, 1/8, 1/4.

    This patch corrects the comment according to the code.

    Signed-off-by: Wei Yang
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wei Yang
     

09 Sep, 2015

1 commit

  • alloc_pages_exact_node() was introduced in commit 6484eb3e2a81 ("page
    allocator: do not check NUMA node ID when the caller knows the node is
    valid") as an optimized variant of alloc_pages_node(), that doesn't
    fallback to current node for nid == NUMA_NO_NODE. Unfortunately the
    name of the function can easily suggest that the allocation is
    restricted to the given node and fails otherwise. In truth, the node is
    only preferred, unless __GFP_THISNODE is passed among the gfp flags.

    The misleading name has lead to mistakes in the past, see for example
    commits 5265047ac301 ("mm, thp: really limit transparent hugepage
    allocation to local node") and b360edb43f8e ("mm, mempolicy:
    migrate_to_node should only migrate to node").

    Another issue with the name is that there's a family of
    alloc_pages_exact*() functions where 'exact' means exact size (instead
    of page order), which leads to more confusion.

    To prevent further mistakes, this patch effectively renames
    alloc_pages_exact_node() to __alloc_pages_node() to better convey that
    it's an optimized variant of alloc_pages_node() not intended for general
    usage. Both functions get described in comments.

    It has been also considered to really provide a convenience function for
    allocations restricted to a node, but the major opinion seems to be that
    __GFP_THISNODE already provides that functionality and we shouldn't
    duplicate the API needlessly. The number of users would be small
    anyway.

    Existing callers of alloc_pages_exact_node() are simply converted to
    call __alloc_pages_node(), with the exception of sba_alloc_coherent()
    which open-codes the check for NUMA_NO_NODE, so it is converted to use
    alloc_pages_node() instead. This means it no longer performs some
    VM_BUG_ON checks, and since the current check for nid in
    alloc_pages_node() uses a 'nid < 0' comparison (which includes
    NUMA_NO_NODE), it may hide wrong values which would be previously
    exposed.

    Both differences will be rectified by the next patch.

    To sum up, this patch makes no functional changes, except temporarily
    hiding potentially buggy callers. Restricting the checks in
    alloc_pages_node() is left for the next patch which can in turn expose
    more existing buggy callers.

    Signed-off-by: Vlastimil Babka
    Acked-by: Johannes Weiner
    Acked-by: Robin Holt
    Acked-by: Michal Hocko
    Acked-by: Christoph Lameter
    Acked-by: Michael Ellerman
    Cc: Mel Gorman
    Cc: David Rientjes
    Cc: Greg Thelen
    Cc: Aneesh Kumar K.V
    Cc: Pekka Enberg
    Cc: Joonsoo Kim
    Cc: Naoya Horiguchi
    Cc: Tony Luck
    Cc: Fenghua Yu
    Cc: Arnd Bergmann
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: Gleb Natapov
    Cc: Paolo Bonzini
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Cliff Whickman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     

05 Sep, 2015

6 commits

  • Description is almost copied from commit fb05e7a89f50 ("net: don't wait
    for order-3 page allocation").

    I saw excessive direct memory reclaim/compaction triggered by slub. This
    causes performance issues and add latency. Slub uses high-order
    allocation to reduce internal fragmentation and management overhead. But,
    direct memory reclaim/compaction has high overhead and the benefit of
    high-order allocation can't compensate the overhead of both work.

    This patch makes auxiliary high-order allocation atomic. If there is no
    memory pressure and memory isn't fragmented, the alloction will still
    success, so we don't sacrifice high-order allocation's benefit here. If
    the atomic allocation fails, direct memory reclaim/compaction will not be
    triggered, allocation fallback to low-order immediately, hence the direct
    memory reclaim/compaction overhead is avoided. In the allocation failure
    case, kswapd is waken up and trying to make high-order freepages, so
    allocation could success next time.

    Following is the test to measure effect of this patch.

    System: QEMU, CPU 8, 512 MB
    Mem: 25% memory is allocated at random position to make fragmentation.
    Memory-hogger occupies 150 MB memory.
    Workload: hackbench -g 20 -l 1000

    Average result by 10 runs (Base va Patched)

    elapsed_time(s): 4.3468 vs 2.9838
    compact_stall: 461.7 vs 73.6
    pgmigrate_success: 28315.9 vs 7256.1

    Signed-off-by: Joonsoo Kim
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Acked-by: David Rientjes
    Cc: Shaohua Li
    Cc: Vlastimil Babka
    Cc: Michal Hocko
    Cc: Eric Dumazet
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • sysfs_slab_add() shouldn't call kobject_put at error path: this puts last
    reference of kmem-cache kobject and frees it. Kmem cache will be freed
    second time at error path in kmem_cache_create().

    For example this happens when slub debug was enabled in runtime and
    somebody creates new kmem cache:

    # echo 1 | tee /sys/kernel/slab/*/sanity_checks
    # modprobe configfs

    "configfs_dir_cache" cannot be merged because existing slab have debug and
    cannot create new slab because unique name ":t-0000096" already taken.

    Signed-off-by: Konstantin Khlebnikov
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Acked-by: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     
  • Initializing a new slab can introduce rather large latencies because most
    of the initialization runs always with interrupts disabled.

    There is no point in doing so. The newly allocated slab is not visible
    yet, so there is no reason to protect it against concurrent alloc/free.

    Move the expensive parts of the initialization into allocate_slab(), so
    for all allocations with GFP_WAIT set, interrupts are enabled.

    Signed-off-by: Thomas Gleixner
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Acked-by: David Rientjes
    Cc: Joonsoo Kim
    Cc: Sebastian Andrzej Siewior
    Cc: Steven Rostedt
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Thomas Gleixner
     
  • Per request of Joonsoo Kim adding kmem debug support.

    I've tested that when debugging is disabled, then there is almost no
    performance impact as this code basically gets removed by the compiler.

    Need some guidance in enabling and testing this.

    bulk- PREVIOUS - THIS-PATCH
    1 - 43 cycles(tsc) 10.811 ns - 44 cycles(tsc) 11.236 ns improved -2.3%
    2 - 27 cycles(tsc) 6.867 ns - 28 cycles(tsc) 7.019 ns improved -3.7%
    3 - 21 cycles(tsc) 5.496 ns - 22 cycles(tsc) 5.526 ns improved -4.8%
    4 - 24 cycles(tsc) 6.038 ns - 19 cycles(tsc) 4.786 ns improved 20.8%
    8 - 17 cycles(tsc) 4.280 ns - 18 cycles(tsc) 4.572 ns improved -5.9%
    16 - 17 cycles(tsc) 4.483 ns - 18 cycles(tsc) 4.658 ns improved -5.9%
    30 - 18 cycles(tsc) 4.531 ns - 18 cycles(tsc) 4.568 ns improved 0.0%
    32 - 58 cycles(tsc) 14.586 ns - 65 cycles(tsc) 16.454 ns improved -12.1%
    34 - 53 cycles(tsc) 13.391 ns - 63 cycles(tsc) 15.932 ns improved -18.9%
    48 - 65 cycles(tsc) 16.268 ns - 50 cycles(tsc) 12.506 ns improved 23.1%
    64 - 53 cycles(tsc) 13.440 ns - 63 cycles(tsc) 15.929 ns improved -18.9%
    128 - 79 cycles(tsc) 19.899 ns - 86 cycles(tsc) 21.583 ns improved -8.9%
    158 - 90 cycles(tsc) 22.732 ns - 90 cycles(tsc) 22.552 ns improved 0.0%
    250 - 95 cycles(tsc) 23.916 ns - 98 cycles(tsc) 24.589 ns improved -3.2%

    Signed-off-by: Jesper Dangaard Brouer
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jesper Dangaard Brouer
     
  • This implements SLUB specific kmem_cache_free_bulk(). SLUB allocator now
    both have bulk alloc and free implemented.

    Choose to reenable local IRQs while calling slowpath __slab_free(). In
    worst case, where all objects hit slowpath call, the performance should
    still be faster than fallback function __kmem_cache_free_bulk(), because
    local_irq_{disable+enable} is very fast (7-cycles), while the fallback
    invokes this_cpu_cmpxchg() which is slightly slower (9-cycles).
    Nitpicking, this should be faster for N>=4, due to the entry cost of
    local_irq_{disable+enable}.

    Do notice that the save+restore variant is very expensive, this is key to
    why this optimization works.

    CPU: i7-4790K CPU @ 4.00GHz
    * local_irq_{disable,enable}: 7 cycles(tsc) - 1.821 ns
    * local_irq_{save,restore} : 37 cycles(tsc) - 9.443 ns

    Measurements on CPU CPU i7-4790K @ 4.00GHz
    Baseline normal fastpath (alloc+free cost): 43 cycles(tsc) 10.834 ns

    Bulk- fallback - this-patch
    1 - 58 cycles(tsc) 14.542 ns - 43 cycles(tsc) 10.811 ns improved 25.9%
    2 - 50 cycles(tsc) 12.659 ns - 27 cycles(tsc) 6.867 ns improved 46.0%
    3 - 48 cycles(tsc) 12.168 ns - 21 cycles(tsc) 5.496 ns improved 56.2%
    4 - 47 cycles(tsc) 11.987 ns - 24 cycles(tsc) 6.038 ns improved 48.9%
    8 - 46 cycles(tsc) 11.518 ns - 17 cycles(tsc) 4.280 ns improved 63.0%
    16 - 45 cycles(tsc) 11.366 ns - 17 cycles(tsc) 4.483 ns improved 62.2%
    30 - 45 cycles(tsc) 11.433 ns - 18 cycles(tsc) 4.531 ns improved 60.0%
    32 - 75 cycles(tsc) 18.983 ns - 58 cycles(tsc) 14.586 ns improved 22.7%
    34 - 71 cycles(tsc) 17.940 ns - 53 cycles(tsc) 13.391 ns improved 25.4%
    48 - 80 cycles(tsc) 20.077 ns - 65 cycles(tsc) 16.268 ns improved 18.8%
    64 - 71 cycles(tsc) 17.799 ns - 53 cycles(tsc) 13.440 ns improved 25.4%
    128 - 91 cycles(tsc) 22.980 ns - 79 cycles(tsc) 19.899 ns improved 13.2%
    158 - 100 cycles(tsc) 25.241 ns - 90 cycles(tsc) 22.732 ns improved 10.0%
    250 - 102 cycles(tsc) 25.583 ns - 95 cycles(tsc) 23.916 ns improved 6.9%

    Signed-off-by: Jesper Dangaard Brouer
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jesper Dangaard Brouer
     
  • Call slowpath __slab_alloc() from within the bulk loop, as the side-effect
    of this call likely repopulates c->freelist.

    Choose to reenable local IRQs while calling slowpath.

    Saving some optimizations for later. E.g. it is possible to extract
    parts of __slab_alloc() and avoid the unnecessary and expensive (37
    cycles) local_irq_{save,restore}. For now, be happy calling
    __slab_alloc() this lower icache impact of this func and I don't have to
    worry about correctness.

    Measurements on CPU CPU i7-4790K @ 4.00GHz
    Baseline normal fastpath (alloc+free cost): 42 cycles(tsc) 10.601 ns

    Bulk- fallback - this-patch
    1 - 58 cycles(tsc) 14.516 ns - 49 cycles(tsc) 12.459 ns improved 15.5%
    2 - 51 cycles(tsc) 12.930 ns - 38 cycles(tsc) 9.605 ns improved 25.5%
    3 - 49 cycles(tsc) 12.274 ns - 34 cycles(tsc) 8.525 ns improved 30.6%
    4 - 48 cycles(tsc) 12.058 ns - 32 cycles(tsc) 8.036 ns improved 33.3%
    8 - 46 cycles(tsc) 11.609 ns - 31 cycles(tsc) 7.756 ns improved 32.6%
    16 - 45 cycles(tsc) 11.451 ns - 32 cycles(tsc) 8.148 ns improved 28.9%
    30 - 79 cycles(tsc) 19.865 ns - 68 cycles(tsc) 17.164 ns improved 13.9%
    32 - 76 cycles(tsc) 19.212 ns - 66 cycles(tsc) 16.584 ns improved 13.2%
    34 - 74 cycles(tsc) 18.600 ns - 63 cycles(tsc) 15.954 ns improved 14.9%
    48 - 88 cycles(tsc) 22.092 ns - 77 cycles(tsc) 19.373 ns improved 12.5%
    64 - 80 cycles(tsc) 20.043 ns - 68 cycles(tsc) 17.188 ns improved 15.0%
    128 - 99 cycles(tsc) 24.818 ns - 89 cycles(tsc) 22.404 ns improved 10.1%
    158 - 99 cycles(tsc) 24.977 ns - 92 cycles(tsc) 23.089 ns improved 7.1%
    250 - 106 cycles(tsc) 26.552 ns - 99 cycles(tsc) 24.785 ns improved 6.6%

    Signed-off-by: Jesper Dangaard Brouer
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jesper Dangaard Brouer