28 Sep, 2011

1 commit


20 Aug, 2011

1 commit

  • Allow filling out the rest of the kmem_cache_cpu cacheline with pointers to
    partial pages. The partial page list is used in slab_free() to avoid
    per node lock taking.

    In __slab_alloc() we can then take multiple partial pages off the per
    node partial list in one go reducing node lock pressure.

    We can also use the per cpu partial list in slab_alloc() to avoid scanning
    partial lists for pages with free objects.

    The main effect of a per cpu partial list is that the per node list_lock
    is taken for batches of partial pages instead of individual ones.

    Potential future enhancements:

    1. The pickup from the partial list could be perhaps be done without disabling
    interrupts with some work. The free path already puts the page into the
    per cpu partial list without disabling interrupts.

    2. __slab_free() may have some code paths that could use optimization.

    Performance:

    Before After
    ./hackbench 100 process 200000
    Time: 1953.047 1564.614
    ./hackbench 100 process 20000
    Time: 207.176 156.940
    ./hackbench 100 process 20000
    Time: 204.468 156.940
    ./hackbench 100 process 20000
    Time: 204.879 158.772
    ./hackbench 10 process 20000
    Time: 20.153 15.853
    ./hackbench 10 process 20000
    Time: 20.153 15.986
    ./hackbench 10 process 20000
    Time: 19.363 16.111
    ./hackbench 1 process 20000
    Time: 2.518 2.307
    ./hackbench 1 process 20000
    Time: 2.258 2.339
    ./hackbench 1 process 20000
    Time: 2.864 2.163

    Signed-off-by: Christoph Lameter
    Signed-off-by: Pekka Enberg

    Christoph Lameter
     

31 Jul, 2011

1 commit

  • * 'slub/lockless' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/slab-2.6: (21 commits)
    slub: When allocating a new slab also prep the first object
    slub: disable interrupts in cmpxchg_double_slab when falling back to pagelock
    Avoid duplicate _count variables in page_struct
    Revert "SLUB: Fix build breakage in linux/mm_types.h"
    SLUB: Fix build breakage in linux/mm_types.h
    slub: slabinfo update for cmpxchg handling
    slub: Not necessary to check for empty slab on load_freelist
    slub: fast release on full slab
    slub: Add statistics for the case that the current slab does not match the node
    slub: Get rid of the another_slab label
    slub: Avoid disabling interrupts in free slowpath
    slub: Disable interrupts in free_debug processing
    slub: Invert locking and avoid slab lock
    slub: Rework allocator fastpaths
    slub: Pass kmem_cache struct to lock and freeze slab
    slub: explicit list_lock taking
    slub: Add cmpxchg_double_slab()
    mm: Rearrange struct page
    slub: Move page->frozen handling near where the page->freelist handling occurs
    slub: Do not use frozen page flag but a bit in the page counters
    ...

    Linus Torvalds
     

08 Jul, 2011

1 commit


02 Jul, 2011

3 commits


17 Jun, 2011

1 commit

  • Every slab has its on alignment definition in include/linux/sl?b_def.h. Extract those
    and define a common set in include/linux/slab.h.

    SLOB: As notes sometimes we need double word alignment on 32 bit. This gives all
    structures allocated by SLOB a unsigned long long alignment like the others do.

    SLAB: If ARCH_SLAB_MINALIGN is not set SLAB would set ARCH_SLAB_MINALIGN to
    zero meaning no alignment at all. Give it the default unsigned long long alignment.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Pekka Enberg

    Christoph Lameter
     

21 May, 2011

1 commit


08 May, 2011

1 commit

  • Remove the #ifdefs. This means that the irqsafe_cpu_cmpxchg_double() is used
    everywhere.

    There may be performance implications since:

    A. We now have to manage a transaction ID for all arches

    B. The interrupt holdoff for arches not supporting CONFIG_CMPXCHG_LOCAL is reduced
    to a very short irqoff section.

    There are no multiple irqoff/irqon sequences as a result of this change. Even in the fallback
    case we only have to do one disable and enable like before.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Pekka Enberg

    Christoph Lameter
     

23 Mar, 2011

1 commit


21 Mar, 2011

1 commit


12 Mar, 2011

1 commit

  • There is no "struct" for slub's slab, it shares with struct page.
    But struct page is very small, it is insufficient when we need
    to add some metadata for slab.

    So we add a field "reserved" to struct kmem_cache, when a slab
    is allocated, kmem_cache->reserved bytes are automatically reserved
    at the end of the slab for slab's metadata.

    Changed from v1:
    Export the reserved field via sysfs

    Acked-by: Christoph Lameter
    Signed-off-by: Lai Jiangshan
    Signed-off-by: Pekka Enberg

    Lai Jiangshan
     

11 Mar, 2011

2 commits

  • Use the this_cpu_cmpxchg_double functionality to implement a lockless
    allocation algorithm on arches that support fast this_cpu_ops.

    Each of the per cpu pointers is paired with a transaction id that ensures
    that updates of the per cpu information can only occur in sequence on
    a certain cpu.

    A transaction id is a "long" integer that is comprised of an event number
    and the cpu number. The event number is incremented for every change to the
    per cpu state. This means that the cmpxchg instruction can verify for an
    update that nothing interfered and that we are updating the percpu structure
    for the processor where we picked up the information and that we are also
    currently on that processor when we update the information.

    This results in a significant decrease of the overhead in the fastpaths. It
    also makes it easy to adopt the fast path for realtime kernels since this
    is lockless and does not require the use of the current per cpu area
    over the critical section. It is only important that the per cpu area is
    current at the beginning of the critical section and at the end.

    So there is no need even to disable preemption.

    Test results show that the fastpath cycle count is reduced by up to ~ 40%
    (alloc/free test goes from ~140 cycles down to ~80). The slowpath for kfree
    adds a few cycles.

    Sadly this does nothing for the slowpath which is where the main issues with
    performance in slub are but the best case performance rises significantly.
    (For that see the more complex slub patches that require cmpxchg_double)

    Kmalloc: alloc/free test

    Before:

    10000 times kmalloc(8)/kfree -> 134 cycles
    10000 times kmalloc(16)/kfree -> 152 cycles
    10000 times kmalloc(32)/kfree -> 144 cycles
    10000 times kmalloc(64)/kfree -> 142 cycles
    10000 times kmalloc(128)/kfree -> 142 cycles
    10000 times kmalloc(256)/kfree -> 132 cycles
    10000 times kmalloc(512)/kfree -> 132 cycles
    10000 times kmalloc(1024)/kfree -> 135 cycles
    10000 times kmalloc(2048)/kfree -> 135 cycles
    10000 times kmalloc(4096)/kfree -> 135 cycles
    10000 times kmalloc(8192)/kfree -> 144 cycles
    10000 times kmalloc(16384)/kfree -> 754 cycles

    After:

    10000 times kmalloc(8)/kfree -> 78 cycles
    10000 times kmalloc(16)/kfree -> 78 cycles
    10000 times kmalloc(32)/kfree -> 82 cycles
    10000 times kmalloc(64)/kfree -> 88 cycles
    10000 times kmalloc(128)/kfree -> 79 cycles
    10000 times kmalloc(256)/kfree -> 79 cycles
    10000 times kmalloc(512)/kfree -> 85 cycles
    10000 times kmalloc(1024)/kfree -> 82 cycles
    10000 times kmalloc(2048)/kfree -> 82 cycles
    10000 times kmalloc(4096)/kfree -> 85 cycles
    10000 times kmalloc(8192)/kfree -> 82 cycles
    10000 times kmalloc(16384)/kfree -> 706 cycles

    Kmalloc: Repeatedly allocate then free test

    Before:

    10000 times kmalloc(8) -> 211 cycles kfree -> 113 cycles
    10000 times kmalloc(16) -> 174 cycles kfree -> 115 cycles
    10000 times kmalloc(32) -> 235 cycles kfree -> 129 cycles
    10000 times kmalloc(64) -> 222 cycles kfree -> 120 cycles
    10000 times kmalloc(128) -> 343 cycles kfree -> 139 cycles
    10000 times kmalloc(256) -> 827 cycles kfree -> 147 cycles
    10000 times kmalloc(512) -> 1048 cycles kfree -> 272 cycles
    10000 times kmalloc(1024) -> 2043 cycles kfree -> 528 cycles
    10000 times kmalloc(2048) -> 4002 cycles kfree -> 571 cycles
    10000 times kmalloc(4096) -> 7740 cycles kfree -> 628 cycles
    10000 times kmalloc(8192) -> 8062 cycles kfree -> 850 cycles
    10000 times kmalloc(16384) -> 8895 cycles kfree -> 1249 cycles

    After:

    10000 times kmalloc(8) -> 190 cycles kfree -> 129 cycles
    10000 times kmalloc(16) -> 76 cycles kfree -> 123 cycles
    10000 times kmalloc(32) -> 126 cycles kfree -> 124 cycles
    10000 times kmalloc(64) -> 181 cycles kfree -> 128 cycles
    10000 times kmalloc(128) -> 310 cycles kfree -> 140 cycles
    10000 times kmalloc(256) -> 809 cycles kfree -> 165 cycles
    10000 times kmalloc(512) -> 1005 cycles kfree -> 269 cycles
    10000 times kmalloc(1024) -> 1999 cycles kfree -> 527 cycles
    10000 times kmalloc(2048) -> 3967 cycles kfree -> 570 cycles
    10000 times kmalloc(4096) -> 7658 cycles kfree -> 637 cycles
    10000 times kmalloc(8192) -> 8111 cycles kfree -> 859 cycles
    10000 times kmalloc(16384) -> 8791 cycles kfree -> 1173 cycles

    Signed-off-by: Christoph Lameter
    Signed-off-by: Pekka Enberg

    Christoph Lameter
     
  • It is used in unfreeze_slab() which is a performance critical
    function.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Pekka Enberg

    Christoph Lameter
     

06 Nov, 2010

1 commit

  • Having the trace calls defined in the always inlined kmalloc functions
    in include/linux/slub_def.h causes a lot of code duplication as the
    trace functions get instantiated for each kamalloc call site. This can
    simply be removed by pushing the trace calls down into the functions in
    slub.c.

    On my x86_64 built this patch shrinks the code size of the kernel by
    approx 36K and also shrinks the code size of many modules -- too many to
    list here ;)

    size vmlinux (2.6.36) reports
    text data bss dec hex filename
    5410611 743172 828928 6982711 6a8c37 vmlinux
    5373738 744244 828928 6946910 6a005e vmlinux + patch

    The resulting kernel has had some testing & kmalloc trace still seems to
    work.

    This patch
    - moves trace_kmalloc out of the inlined kmalloc() and pushes it down
    into kmem_cache_alloc_trace() so this it only get instantiated once.

    - rename kmem_cache_alloc_notrace() to kmem_cache_alloc_trace() to
    indicate that now is does have tracing. (maybe this would better being
    called something like kmalloc_kmem_cache ?)

    - adds a new function kmalloc_order() to handle allocation and tracing
    of large allocations of page order.

    - removes tracing from the inlined kmalloc_large() replacing them with a
    call to kmalloc_order();

    - move tracing out of inlined kmalloc_node() and pushing it down into
    kmem_cache_alloc_node_trace

    - rename kmem_cache_alloc_node_notrace() to
    kmem_cache_alloc_node_trace()

    - removes the include of trace/events/kmem.h from slub_def.h.

    v2
    - keep kmalloc_order_trace inline when !CONFIG_TRACE

    Signed-off-by: Richard Kennedy
    Signed-off-by: Pekka Enberg

    Richard Kennedy
     

06 Oct, 2010

1 commit


02 Oct, 2010

2 commits

  • Reduce the #ifdefs and simplify bootstrap by making SMP and NUMA as much alike
    as possible. This means that there will be an additional indirection to get to
    the kmem_cache_node field under SMP.

    Acked-by: David Rientjes
    Signed-off-by: Christoph Lameter
    Signed-off-by: Pekka Enberg

    Christoph Lameter
     
  • kmalloc caches are statically defined and may take up a lot of space just
    because the sizes of the node array has to be dimensioned for the largest
    node count supported.

    This patch makes the size of the kmem_cache structure dynamic throughout by
    creating a kmem_cache slab cache for the kmem_cache objects. The bootstrap
    occurs by allocating the initial one or two kmem_cache objects from the
    page allocator.

    C2->C3
    - Fix various issues indicated by David
    - Make create kmalloc_cache return a kmem_cache * pointer.

    Acked-by: David Rientjes
    Signed-off-by: Christoph Lameter
    Signed-off-by: Pekka Enberg

    Christoph Lameter
     

23 Aug, 2010

1 commit


11 Aug, 2010

1 commit

  • Now each architecture has the own dma_get_cache_alignment implementation.

    dma_get_cache_alignment returns the minimum DMA alignment. Architectures
    define it as ARCH_KMALLOC_MINALIGN (it's used to make sure that malloc'ed
    buffer is DMA-safe; the buffer doesn't share a cache with the others). So
    we can unify dma_get_cache_alignment implementations.

    This patch:

    dma_get_cache_alignment() needs to know if an architecture defines
    ARCH_KMALLOC_MINALIGN or not (needs to know if architecture has DMA
    alignment restriction). However, slab.h define ARCH_KMALLOC_MINALIGN if
    architectures doesn't define it.

    Let's rename ARCH_KMALLOC_MINALIGN to ARCH_DMA_MINALIGN.
    ARCH_KMALLOC_MINALIGN is used only in the internals of slab/slob/slub
    (except for crypto).

    Signed-off-by: FUJITA Tomonori
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    FUJITA Tomonori
     

09 Aug, 2010

1 commit


10 Jun, 2010

1 commit


09 Jun, 2010

1 commit

  • We have been resisting new ftrace plugins and removing existing
    ones, and kmemtrace has been superseded by kmem trace events
    and perf-kmem, so we remove it.

    Signed-off-by: Li Zefan
    Acked-by: Pekka Enberg
    Acked-by: Eduard - Gabriel Munteanu
    Cc: Ingo Molnar
    Cc: Steven Rostedt
    [ remove kmemtrace from the makefile, handle slob too ]
    Signed-off-by: Frederic Weisbecker

    Li Zefan
     

30 May, 2010

1 commit

  • Commit 756dee75872a2a764b478e18076360b8a4ec9045 ("SLUB: Get rid of dynamic DMA
    kmalloc cache allocation") makes S390 run out of kmalloc caches. Increase the
    number of kmalloc caches to a safe size.

    Cc: [ .33 and .34 ]
    Reported-by: Heiko Carstens
    Tested-by: Heiko Carstens
    Signed-off-by: Christoph Lameter
    Signed-off-by: Pekka Enberg

    Christoph Lameter
     

25 May, 2010

1 commit

  • This patch is meant to improve the performance of SLUB by moving the local
    kmem_cache_node lock into it's own cacheline separate from kmem_cache.
    This is accomplished by simply removing the local_node when NUMA is enabled.

    On my system with 2 nodes I saw around a 5% performance increase w/
    hackbench times dropping from 6.2 seconds to 5.9 seconds on average. I
    suspect the performance gain would increase as the number of nodes
    increases, but I do not have the data to currently back that up.

    Bugzilla-Reference: http://bugzilla.kernel.org/show_bug.cgi?id=15713
    Cc:
    Reported-by: Alex Shi
    Tested-by: Alex Shi
    Acked-by: Yanmin Zhang
    Acked-by: Christoph Lameter
    Signed-off-by: Alexander Duyck
    Signed-off-by: Pekka Enberg

    Alexander Duyck
     

20 May, 2010

1 commit


20 Dec, 2009

3 commits

  • Remove the fields in struct kmem_cache_cpu that were used to cache data from
    struct kmem_cache when they were in different cachelines. The cacheline that
    holds the per cpu array pointer now also holds these values. We can cut down
    the struct kmem_cache_cpu size to almost half.

    The get_freepointer() and set_freepointer() functions that used to be only
    intended for the slow path now are also useful for the hot path since access
    to the size field does not require accessing an additional cacheline anymore.
    This results in consistent use of functions for setting the freepointer of
    objects throughout SLUB.

    Also we initialize all possible kmem_cache_cpu structures when a slab is
    created. No need to initialize them when a processor or node comes online.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Pekka Enberg

    Christoph Lameter
     
  • Dynamic DMA kmalloc cache allocation is troublesome since the
    new percpu allocator does not support allocations in atomic contexts.
    Reserve some statically allocated kmalloc_cpu structures instead.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Pekka Enberg

    Christoph Lameter
     
  • Using per cpu allocations removes the needs for the per cpu arrays in the
    kmem_cache struct. These could get quite big if we have to support systems
    with thousands of cpus. The use of this_cpu_xx operations results in:

    1. The size of kmem_cache for SMP configuration shrinks since we will only
    need 1 pointer instead of NR_CPUS. The same pointer can be used by all
    processors. Reduces cache footprint of the allocator.

    2. We can dynamically size kmem_cache according to the actual nodes in the
    system meaning less memory overhead for configurations that may potentially
    support up to 1k NUMA nodes / 4k cpus.

    3. We can remove the diddle widdle with allocating and releasing of
    kmem_cache_cpu structures when bringing up and shutting down cpus. The cpu
    alloc logic will do it all for us. Removes some portions of the cpu hotplug
    functionality.

    4. Fastpath performance increases since per cpu pointer lookups and
    address calculations are avoided.

    V7-V8
    - Convert missed get_cpu_slab() under CONFIG_SLUB_STATS

    Signed-off-by: Christoph Lameter
    Signed-off-by: Pekka Enberg

    Christoph Lameter
     

11 Dec, 2009

1 commit

  • Define kmem_trace_alloc_{,node}_notrace() if CONFIG_TRACING is
    enabled, otherwise perf-kmem will show wrong stats ifndef
    CONFIG_KMEM_TRACE, because a kmalloc() memory allocation may
    be traced by both trace_kmalloc() and trace_kmem_cache_alloc().

    Signed-off-by: Li Zefan
    Reviewed-by: Pekka Enberg
    Cc: Christoph Lameter
    Cc: Steven Rostedt
    Cc: Frederic Weisbecker
    Cc: linux-mm@kvack.org
    Cc: Eduard - Gabriel Munteanu
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Li Zefan
     

15 Sep, 2009

1 commit


30 Aug, 2009

1 commit

  • If the minalign is 64 bytes, then the 96 byte cache should not be created
    because it would conflict with the 128 byte cache.

    If the minalign is 256 bytes, patching the size_index table should not
    result in a buffer overrun.

    The calculation "(i - 1) / 8" used to access size_index[] is moved to
    a separate function as suggested by Christoph Lameter.

    Acked-by: Christoph Lameter
    Signed-off-by: Aaro Koskinen
    Signed-off-by: Pekka Enberg

    Aaro Koskinen
     

06 Aug, 2009

1 commit


08 Jul, 2009

1 commit


12 Jun, 2009

1 commit

  • As explained by Benjamin Herrenschmidt:

    Oh and btw, your patch alone doesn't fix powerpc, because it's missing
    a whole bunch of GFP_KERNEL's in the arch code... You would have to
    grep the entire kernel for things that check slab_is_available() and
    even then you'll be missing some.

    For example, slab_is_available() didn't always exist, and so in the
    early days on powerpc, we used a mem_init_done global that is set form
    mem_init() (not perfect but works in practice). And we still have code
    using that to do the test.

    Therefore, mask out __GFP_WAIT, __GFP_IO, and __GFP_FS in the slab allocators
    in early boot code to avoid enabling interrupts.

    Signed-off-by: Pekka Enberg

    Pekka Enberg
     

12 Apr, 2009

1 commit

  • Impact: refactor code for future changes

    Current kmemtrace.h is used both as header file of kmemtrace and kmem's
    tracepoints definition.

    Tracepoints' definition file may be used by other code, and should only have
    definition of tracepoint.

    We can separate include/trace/kmemtrace.h into 2 files:

    include/linux/kmemtrace.h: header file for kmemtrace
    include/trace/kmem.h: definition of kmem tracepoints

    Signed-off-by: Zhao Lei
    Acked-by: Eduard - Gabriel Munteanu
    Acked-by: Pekka Enberg
    Cc: Steven Rostedt
    Cc: Frederic Weisbecker
    Cc: Tom Zanussi
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Zhaolei
     

03 Apr, 2009

1 commit

  • kmemtrace now uses tracepoints instead of markers. We no longer need to
    use format specifiers to pass arguments.

    Signed-off-by: Eduard - Gabriel Munteanu
    [ folded: Use the new TP_PROTO and TP_ARGS to fix the build. ]
    [ folded: fix build when CONFIG_KMEMTRACE is disabled. ]
    [ folded: define tracepoints when CONFIG_TRACEPOINTS is enabled. ]
    Signed-off-by: Pekka Enberg
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Eduard - Gabriel Munteanu
     

02 Apr, 2009

1 commit


24 Mar, 2009

1 commit