14 Apr, 2008

6 commits

  • The per node counters are used mainly for showing data through the sysfs API.
    If that API is not compiled in then there is no point in keeping track of this
    data. Disable counters for the number of slabs and the number of total slabs
    if !SLUB_DEBUG. Incrementing the per node counters is also accessing a
    potentially contended cacheline so this could actually be a performance
    benefit to embedded systems.

    SLABINFO support is also affected. It now must depends on SLUB_DEBUG (which
    is on by default).

    Patch also avoids a check for a NULL kmem_cache_node pointer in new_slab()
    if the system is not compiled with NUMA support.

    [penberg@cs.helsinki.fi: fix oops and move ->nr_slabs into CONFIG_SLUB_DEBUG]
    Signed-off-by: Christoph Lameter
    Signed-off-by: Pekka Enberg

    Christoph Lameter
     
  • __free_slab does some diagnostics. The resetting of mapcount etc
    in discard_slab() can interfere with debug processing. So move
    the reset immediately before the page is freed.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Pekka Enberg

    Christoph Lameter
     
  • Only output per cpu stats if the kernel is build for SMP.

    Use a capital "C" as a leading character for the processor number
    (same as the numa statistics that also use a capital letter "N").

    Signed-off-by: Christoph Lameter
    Signed-off-by: Pekka Enberg

    Christoph Lameter
     
  • count_partial() is used by both slabinfo and the sysfs proc support. Move
    the function directly before the beginning of the sysfs code so that it can
    be easily found. Rework the preprocessor conditional to take into account
    that slub sysfs support depends on CONFIG_SYSFS *and* CONFIG_SLUB_DEBUG.

    Make CONFIG_SLUB_STATS depend on CONFIG_SLUB_DEBUG and CONFIG_SYSFS. There
    is no point of keeping statistics if no one can restrive them.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Pekka Enberg

    Christoph Lameter
     
  • Move the definition of kmalloc_caches_dma() into a later #ifdef CONFIG_ZONE_DMA.
    This saves one #ifdef and leaves us with a total of two #ifdefs for dma slab support.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Pekka Enberg

    Christoph Lameter
     
  • As spotted by kmemcheck, we need to initialize the per-CPU ->stat array before
    using it.

    [kmem_cache_cpu structures are usually allocated from arrays defined via
    DEFINE_PER_CPU that are zeroed so we have not noticed this so far --cl].

    Reported-by: Vegard Nossum
    Signed-off-by: Christoph Lameter
    Signed-off-by: Pekka Enberg

    Pekka Enberg
     

02 Apr, 2008

1 commit

  • Small typo in the patch recently merged to avoid the unused symbol
    message for count_partial(). Discussion thread with confirmation of fix at
    http://marc.info/?t=120696854400001&r=1&w=2

    Typo in the check if we need the count_partial function that was
    introduced by 53625b4204753b904addd40ca96d9ba802e6977d

    Signed-off-by: Christoph Lameter
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

28 Mar, 2008

1 commit


27 Mar, 2008

1 commit


18 Mar, 2008

1 commit

  • The fallback path needs to enable interrupts like done for
    the other page allocator calls. This was not necessary with
    the alternate fast path since we handled irq enable/disable in
    the slow path. The regular fastpath handles irq enable/disable
    around calls to the slow path so we need to restore the proper
    status before calling the page allocator from the slowpath.

    Signed-off-by: Christoph Lameter

    Christoph Lameter
     

07 Mar, 2008

2 commits


04 Mar, 2008

11 commits


20 Feb, 2008

1 commit

  • This reverts commit 1f84260c8ce3b1ce26d4c1d6dedc2f33a3a29c0c, which is
    suspected to be the reason for some very occasional and hard-to-trigger
    crashes that usually look related to memory allocation (mostly reported
    in networking, but since that's generally the most common source of
    shortlived allocations - and allocations in interrupt contexts - that in
    itself is not a big clue).

    See for example
    http://bugzilla.kernel.org/show_bug.cgi?id=9973
    http://lkml.org/lkml/2008/2/19/278
    etc.

    One promising suspicion for what the root cause of bug is (which also
    explains why it's so hard to trigger in practice) came from Eric
    Dumazet:

    "I wonder how SLUB_FASTPATH is supposed to work, since it is affected
    by a classical ABA problem of lockless algo.

    cmpxchg_local(&c->freelist, object, object[c->offset]) can succeed,
    while an interrupt came (on this cpu), and several allocations were
    done, and one free was performed at the end of this interruption, so
    'object' was recycled.

    c->freelist can then contain the previous value (object), but
    object[c->offset] was changed by IRQ.

    We then put back in freelist an already allocated object."

    but another reason for the revert is simply that everybody agrees that
    this code was the main suspect just by virtue of the pattern of oopses.

    Cc: Torsten Kaiser
    Cc: Christoph Lameter
    Cc: Mathieu Desnoyers
    Cc: Pekka Enberg
    Cc: Ingo Molnar
    Cc: Eric Dumazet
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

15 Feb, 2008

5 commits

  • Currently we hand off PAGE_SIZEd kmallocs to the page allocator in the
    mistaken belief that the page allocator can handle these allocations
    effectively. However, measurements indicate a minimum slowdown by the
    factor of 8 (and that is only SMP, NUMA is much worse) vs the slub fastpath
    which causes regressions in tbench.

    Increase the number of kmalloc caches by one so that we again handle 4k
    kmallocs directly from slub. 4k page buffering for the page allocator
    will be performed by slub like done by slab.

    At some point the page allocator fastpath should be fixed. A lot of the kernel
    would benefit from a faster ability to allocate a single page. If that is
    done then the 4k allocs may again be forwarded to the page allocator and this
    patch could be reverted.

    Reviewed-by: Pekka Enberg
    Acked-by: Mel Gorman
    Signed-off-by: Christoph Lameter

    Christoph Lameter
     
  • Slub already has two ways of allocating an object. One is via its own
    logic and the other is via the call to kmalloc_large to hand off object
    allocation to the page allocator. kmalloc_large is typically used
    for objects >= PAGE_SIZE.

    We can use that handoff to avoid failing if a higher order kmalloc slab
    allocation cannot be satisfied by the page allocator. If we reach the
    out of memory path then simply try a kmalloc_large(). kfree() can
    already handle the case of an object that was allocated via the page
    allocator and so this will work just fine (apart from object
    accounting...).

    For any kmalloc slab that already requires higher order allocs (which
    makes it impossible to use the page allocator fastpath!)
    we just use PAGE_ALLOC_COSTLY_ORDER to get the largest number of
    objects in one go from the page allocator slowpath.

    On a 4k platform this patch will lead to the following use of higher
    order pages for the following kmalloc slabs:

    8 ... 1024 order 0
    2048 .. 4096 order 3 (4k slab only after the next patch)

    We may waste some space if fallback occurs on a 2k slab but we
    are always able to fallback to an order 0 alloc.

    Reviewed-by: Pekka Enberg
    Signed-off-by: Christoph Lameter

    Christoph Lameter
     
  • Currently we determine the gfp flags to pass to the page allocator
    each time a slab is being allocated.

    Determine the bits to be set at the time the slab is created. Store
    in a new allocflags field and add the flags in allocate_slab().

    Acked-by: Mel Gorman
    Reviewed-by: Pekka Enberg
    Signed-off-by: Christoph Lameter

    Christoph Lameter
     
  • slab_address() can become static.

    Signed-off-by: Adrian Bunk
    Signed-off-by: Christoph Lameter

    Adrian Bunk
     
  • This adds a proper function for kmalloc page allocator pass-through. While it
    simplifies any code that does slab tracing code a lot, I think it's a
    worthwhile cleanup in itself.

    Signed-off-by: Pekka Enberg
    Signed-off-by: Christoph Lameter

    Pekka Enberg
     

08 Feb, 2008

6 commits

  • fix checkpatch --file mm/slub.c errors and warnings.

    $ q-code-quality-compare
    errors lines of code errors/KLOC
    mm/slub.c [before] 22 4204 5.2
    mm/slub.c [after] 0 4210 0

    no code changed:

    text data bss dec hex filename
    22195 8634 136 30965 78f5 slub.o.before
    22195 8634 136 30965 78f5 slub.o.after

    md5:
    93cdfbec2d6450622163c590e1064358 slub.o.before.asm
    93cdfbec2d6450622163c590e1064358 slub.o.after.asm

    [clameter: rediffed against Pekka's cleanup patch, omitted
    moves of the name of a function to the start of line]
    Signed-off-by: Ingo Molnar
    Signed-off-by: Christoph Lameter

    Ingo Molnar
     
  • Slub can use the non-atomic version to unlock because other flags will not
    get modified with the lock held.

    Signed-off-by: Nick Piggin
    Acked-by: Christoph Lameter
    Signed-off-by: Andrew Morton

    Nick Piggin
     
  • The statistics provided here allow the monitoring of allocator behavior but
    at the cost of some (minimal) loss of performance. Counters are placed in
    SLUB's per cpu data structure. The per cpu structure may be extended by the
    statistics to grow larger than one cacheline which will increase the cache
    footprint of SLUB.

    There is a compile option to enable/disable the inclusion of the runtime
    statistics and its off by default.

    The slabinfo tool is enhanced to support these statistics via two options:

    -D Switches the line of information displayed for a slab from size
    mode to activity mode.

    -A Sorts the slabs displayed by activity. This allows the display of
    the slabs most important to the performance of a certain load.

    -r Report option will report detailed statistics on

    Example (tbench load):

    slabinfo -AD ->Shows the most active slabs

    Name Objects Alloc Free %Fast
    skbuff_fclone_cache 33 111953835 111953835 99 99
    :0000192 2666 5283688 5281047 99 99
    :0001024 849 5247230 5246389 83 83
    vm_area_struct 1349 119642 118355 91 22
    :0004096 15 66753 66751 98 98
    :0000064 2067 25297 23383 98 78
    dentry 10259 28635 18464 91 45
    :0000080 11004 18950 8089 98 98
    :0000096 1703 12358 10784 99 98
    :0000128 762 10582 9875 94 18
    :0000512 184 9807 9647 95 81
    :0002048 479 9669 9195 83 65
    anon_vma 777 9461 9002 99 71
    kmalloc-8 6492 9981 5624 99 97
    :0000768 258 7174 6931 58 15

    So the skbuff_fclone_cache is of highest importance for the tbench load.
    Pretty high load on the 192 sized slab. Look for the aliases

    slabinfo -a | grep 000192
    :0000192 -r option implied if cache name is mentioned

    .... Usual output ...

    Slab Perf Counter Alloc Free %Al %Fr
    --------------------------------------------------
    Fastpath 111953360 111946981 99 99
    Slowpath 1044 7423 0 0
    Page Alloc 272 264 0 0
    Add partial 25 325 0 0
    Remove partial 86 264 0 0
    RemoteObj/SlabFrozen 350 4832 0 0
    Total 111954404 111954404

    Flushes 49 Refill 0
    Deactivate Full=325(92%) Empty=0(0%) ToHead=24(6%) ToTail=1(0%)

    Looks good because the fastpath is overwhelmingly taken.

    skbuff_head_cache:

    Slab Perf Counter Alloc Free %Al %Fr
    --------------------------------------------------
    Fastpath 5297262 5259882 99 99
    Slowpath 4477 39586 0 0
    Page Alloc 937 824 0 0
    Add partial 0 2515 0 0
    Remove partial 1691 824 0 0
    RemoteObj/SlabFrozen 2621 9684 0 0
    Total 5301739 5299468

    Deactivate Full=2620(100%) Empty=0(0%) ToHead=0(0%) ToTail=0(0%)

    Descriptions of the output:

    Total: The total number of allocation and frees that occurred for a
    slab

    Fastpath: The number of allocations/frees that used the fastpath.

    Slowpath: Other allocations

    Page Alloc: Number of calls to the page allocator as a result of slowpath
    processing

    Add Partial: Number of slabs added to the partial list through free or
    alloc (occurs during cpuslab flushes)

    Remove Partial: Number of slabs removed from the partial list as a result of
    allocations retrieving a partial slab or by a free freeing
    the last object of a slab.

    RemoteObj/Froz: How many times were remotely freed object encountered when a
    slab was about to be deactivated. Frozen: How many times was
    free able to skip list processing because the slab was in use
    as the cpuslab of another processor.

    Flushes: Number of times the cpuslab was flushed on request
    (kmem_cache_shrink, may result from races in __slab_alloc)

    Refill: Number of times we were able to refill the cpuslab from
    remotely freed objects for the same slab.

    Deactivate: Statistics how slabs were deactivated. Shows how they were
    put onto the partial list.

    In general fastpath is very good. Slowpath without partial list processing is
    also desirable. Any touching of partial list uses node specific locks which
    may potentially cause list lock contention.

    Signed-off-by: Christoph Lameter

    Christoph Lameter
     
  • Provide an alternate implementation of the SLUB fast paths for alloc
    and free using cmpxchg_local. The cmpxchg_local fast path is selected
    for arches that have CONFIG_FAST_CMPXCHG_LOCAL set. An arch should only
    set CONFIG_FAST_CMPXCHG_LOCAL if the cmpxchg_local is faster than an
    interrupt enable/disable sequence. This is known to be true for both
    x86 platforms so set FAST_CMPXCHG_LOCAL for both arches.

    Currently another requirement for the fastpath is that the kernel is
    compiled without preemption. The restriction will go away with the
    introduction of a new per cpu allocator and new per cpu operations.

    The advantages of a cmpxchg_local based fast path are:

    1. Potentially lower cycle count (30%-60% faster)

    2. There is no need to disable and enable interrupts on the fast path.
    Currently interrupts have to be disabled and enabled on every
    slab operation. This is likely avoiding a significant percentage
    of interrupt off / on sequences in the kernel.

    3. The disposal of freed slabs can occur with interrupts enabled.

    The alternate path is realized using #ifdef's. Several attempts to do the
    same with macros and inline functions resulted in a mess (in particular due
    to the strange way that local_interrupt_save() handles its argument and due
    to the need to define macros/functions that sometimes disable interrupts
    and sometimes do something else).

    [clameter: Stripped preempt bits and disabled fastpath if preempt is enabled]
    Signed-off-by: Christoph Lameter
    Reviewed-by: Pekka Enberg
    Cc:
    Signed-off-by: Andrew Morton

    Christoph Lameter
     
  • We use a NULL pointer on freelists to signal that there are no more objects.
    However the NULL pointers of all slabs match in contrast to the pointers to
    the real objects which are in different ranges for different slab pages.

    Change the end pointer to be a pointer to the first object and set bit 0.
    Every slab will then have a different end pointer. This is necessary to ensure
    that end markers can be matched to the source slab during cmpxchg_local.

    Bring back the use of the mapping field by SLUB since we would otherwise have
    to call a relatively expensive function page_address() in __slab_alloc(). Use
    of the mapping field allows avoiding a call to page_address() in various other
    functions as well.

    There is no need to change the page_mapping() function since bit 0 is set on
    the mapping as also for anonymous pages. page_mapping(slab_page) will
    therefore still return NULL although the mapping field is overloaded.

    Signed-off-by: Christoph Lameter
    Cc: Pekka Enberg
    Signed-off-by: Andrew Morton

    Christoph Lameter
     
  • gcc 4.2 spits out an annoying warning if one casts a const void *
    pointer to a void * pointer. No warning is generated if the
    conversion is done through an assignment.

    Signed-off-by: Christoph Lameter

    Christoph Lameter
     

05 Feb, 2008

5 commits

  • inconsistent {softirq-on-W} -> {in-softirq-W} usage.
    swapper/0 [HC0[0]:SC1[1]:HE0:SE0] takes:
    (&n->list_lock){-+..}, at: [] add_partial+0x31/0xa0
    {softirq-on-W} state was registered at:
    [] __lock_acquire+0x3e8/0x1140
    [] debug_check_no_locks_freed+0x188/0x1a0
    [] lock_acquire+0x55/0x70
    [] add_partial+0x31/0xa0
    [] _spin_lock+0x1e/0x30
    [] add_partial+0x31/0xa0
    [] kmem_cache_open+0x1cc/0x330
    [] _spin_unlock_irq+0x24/0x30
    [] create_kmalloc_cache+0x64/0xf0
    [] init_alloc_cpu_cpu+0x70/0x90
    [] kmem_cache_init+0x65/0x1d0
    [] start_kernel+0x23e/0x350
    [] _sinittext+0x12d/0x140
    [] 0xffffffffffffffff

    This change isn't really necessary for correctness, but it prevents lockdep
    from getting upset and then disabling itself.

    Signed-off-by: Peter Zijlstra
    Cc: Christoph Lameter
    Cc: Kamalesh Babulal
    Signed-off-by: Andrew Morton
    Signed-off-by: Christoph Lameter

    root
     
  • This fixes most of the obvious coding style violations in mm/slub.c as
    reported by checkpatch.

    Acked-by: Christoph Lameter
    Signed-off-by: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Christoph Lameter

    Pekka Enberg
     
  • Add a parameter to add_partial instead of having separate functions. The
    parameter allows a more detailed control of where the slab pages is placed in
    the partial queues.

    If we put slabs back to the front then they are likely immediately used for
    allocations. If they are put at the end then we can maximize the time that
    the partial slabs spent without being subject to allocations.

    When deactivating slab we can put the slabs that had remote objects freed (we
    can see that because objects were put on the freelist that requires locks) to
    them at the end of the list so that the cachelines of remote processors can
    cool down. Slabs that had objects from the local cpu freed to them (objects
    exist in the lockless freelist) are put in the front of the list to be reused
    ASAP in order to exploit the cache hot state of the local cpu.

    Patch seems to slightly improve tbench speed (1-2%).

    Signed-off-by: Christoph Lameter
    Reviewed-by: Pekka Enberg
    Signed-off-by: Andrew Morton

    Christoph Lameter
     
  • The NUMA defrag works by allocating objects from partial slabs on remote
    nodes. Rename it to

    remote_node_defrag_ratio

    to be clear about this.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton

    Christoph Lameter
     
  • Move the counting function for objects in partial slabs so that it is placed
    before kmem_cache_shrink.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton

    Christoph Lameter