27 Jul, 2008

1 commit

  • Kmem cache passed to constructor is only needed for constructors that are
    themselves multiplexeres. Nobody uses this "feature", nor does anybody uses
    passed kmem cache in non-trivial way, so pass only pointer to object.

    Non-trivial places are:
    arch/powerpc/mm/init_64.c
    arch/powerpc/mm/hugetlbpage.c

    This is flag day, yes.

    Signed-off-by: Alexey Dobriyan
    Acked-by: Pekka Enberg
    Acked-by: Christoph Lameter
    Cc: Jon Tollefson
    Cc: Nick Piggin
    Cc: Matt Mackall
    [akpm@linux-foundation.org: fix arch/powerpc/mm/hugetlbpage.c]
    [akpm@linux-foundation.org: fix mm/slab.c]
    [akpm@linux-foundation.org: fix ubifs]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     

25 Jul, 2008

1 commit

  • SLUB reuses two page bits for internal purposes, it overlays PG_active and
    PG_error. This is hidden away in slub.c. Document these overlays
    explicitly in the main page-flags enum along with all the others.

    Signed-off-by: Andy Whitcroft
    Cc: Pekka Enberg
    Cc: Christoph Lameter
    Cc: Matt Mackall
    Cc: Nick Piggin
    Tested-by: KOSAKI Motohiro
    Cc: KOSAKI Motohiro
    Cc: Rik van Riel
    Cc: Jeremy Fitzhardinge
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andy Whitcroft
     

19 Jul, 2008

1 commit

  • The limit of 128 bytes is too small when debugging slab corruption of the skb
    cache, for example. So increase the limit to PAGE_SIZE to make debugging
    corruptions easier.

    Acked-by: Ingo Molnar
    Acked-by: Christoph Lameter
    Signed-off-by: Pekka Enberg

    Pekka Enberg
     

17 Jul, 2008

1 commit


16 Jul, 2008

3 commits


15 Jul, 2008

1 commit


11 Jul, 2008

1 commit

  • Vegard Nossum reported a crash in kmem_cache_alloc():

    BUG: unable to handle kernel paging request at da87d000
    IP: [] kmem_cache_alloc+0xc7/0xe0
    *pde = 28180163 *pte = 1a87d160
    Oops: 0002 [#1] PREEMPT SMP DEBUG_PAGEALLOC
    Pid: 3850, comm: grep Not tainted (2.6.26-rc9-00059-gb190333 #5)
    EIP: 0060:[] EFLAGS: 00210203 CPU: 0
    EIP is at kmem_cache_alloc+0xc7/0xe0
    EAX: 00000000 EBX: da87c100 ECX: 1adad71a EDX: 6b6b6b6b
    ESI: 00200282 EDI: da87d000 EBP: f60bfe74 ESP: f60bfe54
    DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068

    and analyzed it:

    "The register %ecx looks innocent but is very important here. The disassembly:

    mov %edx,%ecx
    shr $0x2,%ecx
    rep stos %eax,%es:(%edi) > 2 == 0x1adadada.)

    %ecx is the counter for the memset, from here:

    memset(object, 0, c->objsize);

    i.e. %ecx was loaded from c->objsize, so "c" must have been freed.
    Where did "c" come from? Uh-oh...

    c = get_cpu_slab(s, smp_processor_id());

    This looks like it has very much to do with CPU hotplug/unplug. Is
    there a race between SLUB/hotplug since the CPU slab is used after it
    has been freed?"

    Good analysis.

    Yeah, it's possible that a caller of kmem_cache_alloc() -> slab_alloc()
    can be migrated on another CPU right after local_irq_restore() and
    before memset(). The inital cpu can become offline in the mean time (or
    a migration is a consequence of the CPU going offline) so its
    'kmem_cache_cpu' structure gets freed ( slab_cpuup_callback).

    At some point of time the caller continues on another CPU having an
    obsolete pointer...

    Signed-off-by: Dmitry Adamushko
    Reported-by: Vegard Nossum
    Acked-by: Ingo Molnar
    Cc: stable@kernel.org
    Signed-off-by: Linus Torvalds

    Dmitry Adamushko
     

05 Jul, 2008

1 commit

  • Remove all clameter@sgi.com addresses from the kernel tree since they will
    become invalid on June 27th. Change my maintainer email address for the
    slab allocators to cl@linux-foundation.org (which will be the new email
    address for the future).

    Signed-off-by: Christoph Lameter
    Signed-off-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: Stephen Rothwell
    Cc: Matt Mackall
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

04 Jul, 2008

1 commit

  • The 192 byte cache is not necessary if we have a basic alignment of 128
    byte. If it would be used then the 192 would be aligned to the next 128 byte
    boundary which would result in another 256 byte cache. Two 256 kmalloc caches
    cause sysfs to complain about a duplicate entry.

    MIPS needs 128 byte aligned kmalloc caches and spits out warnings on boot without
    this patch.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Pekka Enberg

    Christoph Lameter
     

26 Jun, 2008

1 commit


23 May, 2008

1 commit

  • Add a WARN_ON for pages that don't have PageSlab nor PageCompound set to catch
    the worst abusers of ksize() in the kernel.

    Acked-by: Christoph Lameter
    Cc: Matt Mackall
    Signed-off-by: Pekka Enberg

    Pekka Enberg
     

09 May, 2008

1 commit


02 May, 2008

2 commits


01 May, 2008

1 commit

  • x86 is the only arch right now, which provides an optimized for
    div_long_long_rem and it has the downside that one has to be very careful that
    the divide doesn't overflow.

    The API is a little akward, as the arguments for the unsigned divide are
    signed. The signed version also doesn't handle a negative divisor and
    produces worse code on 64bit archs.

    There is little incentive to keep this API alive, so this converts the few
    users to the new API.

    Signed-off-by: Roman Zippel
    Cc: Ralf Baechle
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: john stultz
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roman Zippel
     

30 Apr, 2008

1 commit

  • We can see an ever repeating problem pattern with objects of any kind in the
    kernel:

    1) freeing of active objects
    2) reinitialization of active objects

    Both problems can be hard to debug because the crash happens at a point where
    we have no chance to decode the root cause anymore. One problem spot are
    kernel timers, where the detection of the problem often happens in interrupt
    context and usually causes the machine to panic.

    While working on a timer related bug report I had to hack specialized code
    into the timer subsystem to get a reasonable hint for the root cause. This
    debug hack was fine for temporary use, but far from a mergeable solution due
    to the intrusiveness into the timer code.

    The code further lacked the ability to detect and report the root cause
    instantly and keep the system operational.

    Keeping the system operational is important to get hold of the debug
    information without special debugging aids like serial consoles and special
    knowledge of the bug reporter.

    The problems described above are not restricted to timers, but timers tend to
    expose it usually in a full system crash. Other objects are less explosive,
    but the symptoms caused by such mistakes can be even harder to debug.

    Instead of creating specialized debugging code for the timer subsystem a
    generic infrastructure is created which allows developers to verify their code
    and provides an easy to enable debug facility for users in case of trouble.

    The debugobjects core code keeps track of operations on static and dynamic
    objects by inserting them into a hashed list and sanity checking them on
    object operations and provides additional checks whenever kernel memory is
    freed.

    The tracked object operations are:
    - initializing an object
    - adding an object to a subsystem list
    - deleting an object from a subsystem list

    Each operation is sanity checked before the operation is executed and the
    subsystem specific code can provide a fixup function which allows to prevent
    the damage of the operation. When the sanity check triggers a warning message
    and a stack trace is printed.

    The list of operations can be extended if the need arises. For now it's
    limited to the requirements of the first user (timers).

    The core code enqueues the objects into hash buckets. The hash index is
    generated from the address of the object to simplify the lookup for the check
    on kfree/vfree. Each bucket has it's own spinlock to avoid contention on a
    global lock.

    The debug code can be compiled in without being active. The runtime overhead
    is minimal and could be optimized by asm alternatives. A kernel command line
    option enables the debugging code.

    Thanks to Ingo Molnar for review, suggestions and cleanup patches.

    Signed-off-by: Thomas Gleixner
    Signed-off-by: Ingo Molnar
    Cc: Greg KH
    Cc: Randy Dunlap
    Cc: Kay Sievers
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Thomas Gleixner
     

29 Apr, 2008

2 commits

  • This is a trivial patch that defines the priority of slab_memory_callback in
    the callback chain as a constant. This is to prepare for next patch in the
    series.

    Signed-off-by: Nadia Derbey
    Cc: Yasunori Goto
    Cc: Matt Helsley
    Cc: Mingming Cao
    Cc: Pierre Peiffer
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nadia Derbey
     
  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/slab-2.6:
    slub: pack objects denser
    slub: Calculate min_objects based on number of processors.
    slub: Drop DEFAULT_MAX_ORDER / DEFAULT_MIN_OBJECTS
    slub: Simplify any_slab_object checks
    slub: Make the order configurable for each slab cache
    slub: Drop fallback to page allocator method
    slub: Fallback to minimal order during slab page allocation
    slub: Update statistics handling for variable order slabs
    slub: Add kmem_cache_order_objects struct
    slub: for_each_object must be passed the number of objects in a slab
    slub: Store max number of objects in the page struct.
    slub: Dump list of objects not freed on kmem_cache_close()
    slub: free_list() cleanup
    slub: improve kmem_cache_destroy() error message
    slob: fix bug - when slob allocates "struct kmem_cache", it does not force alignment.

    Linus Torvalds
     

28 Apr, 2008

4 commits

  • Not all architectures define cache_line_size() so as suggested by Andrew move
    the private implementations in mm/slab.c and mm/slob.c to .

    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: H. Peter Anvin
    Reviewed-by: Christoph Lameter
    Signed-off-by: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pekka Enberg
     
  • Filtering zonelists requires very frequent use of zone_idx(). This is costly
    as it involves a lookup of another structure and a substraction operation. As
    the zone_idx is often required, it should be quickly accessible. The node idx
    could also be stored here if it was found that accessing zone->node is
    significant which may be the case on workloads where nodemasks are heavily
    used.

    This patch introduces a struct zoneref to store a zone pointer and a zone
    index. The zonelist then consists of an array of these struct zonerefs which
    are looked up as necessary. Helpers are given for accessing the zone index as
    well as the node index.

    [kamezawa.hiroyu@jp.fujitsu.com: Suggested struct zoneref instead of embedding information in pointers]
    [hugh@veritas.com: mm-have-zonelist: fix memcg ooms]
    [hugh@veritas.com: just return do_try_to_free_pages]
    [hugh@veritas.com: do_try_to_free_pages gfp_mask redundant]
    Signed-off-by: Mel Gorman
    Acked-by: Christoph Lameter
    Acked-by: David Rientjes
    Signed-off-by: Lee Schermerhorn
    Cc: KAMEZAWA Hiroyuki
    Cc: Mel Gorman
    Cc: Christoph Lameter
    Cc: Nick Piggin
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Currently a node has two sets of zonelists, one for each zone type in the
    system and a second set for GFP_THISNODE allocations. Based on the zones
    allowed by a gfp mask, one of these zonelists is selected. All of these
    zonelists consume memory and occupy cache lines.

    This patch replaces the multiple zonelists per-node with two zonelists. The
    first contains all populated zones in the system, ordered by distance, for
    fallback allocations when the target/preferred node has no free pages. The
    second contains all populated zones in the node suitable for GFP_THISNODE
    allocations.

    An iterator macro is introduced called for_each_zone_zonelist() that interates
    through each zone allowed by the GFP flags in the selected zonelist.

    Signed-off-by: Mel Gorman
    Acked-by: Christoph Lameter
    Signed-off-by: Lee Schermerhorn
    Cc: KAMEZAWA Hiroyuki
    Cc: Mel Gorman
    Cc: Christoph Lameter
    Cc: Hugh Dickins
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Introduce a node_zonelist() helper function. It is used to lookup the
    appropriate zonelist given a node and a GFP mask. The patch on its own is a
    cleanup but it helps clarify parts of the two-zonelist-per-node patchset. If
    necessary, it can be merged with the next patch in this set without problems.

    Reviewed-by: Christoph Lameter
    Signed-off-by: Mel Gorman
    Signed-off-by: Lee Schermerhorn
    Cc: KAMEZAWA Hiroyuki
    Cc: Mel Gorman
    Cc: Christoph Lameter
    Cc: Hugh Dickins
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

27 Apr, 2008

14 commits

  • Since we now have more orders available use a denser packing.
    Increase slab order if more than 1/16th of a slab would be wasted.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Pekka Enberg

    Christoph Lameter
     
  • The mininum objects per slab is calculated based on the number of processors
    that may come online.

    Processors min_objects
    ---------------------------
    1 8
    2 12
    4 16
    8 20
    16 24
    32 28
    64 32
    1024 48
    4096 56

    The higher the number of processors the large the order sizes used for various
    slab caches will become. This has been shown to address the performance issues
    in hackbench on 16p etc.

    The calculation is only performed if slub_min_objects is zero (default). If one
    specifies a slub_min_objects on boot then that setting is taken.

    As suggested by Zhang Yanmin's performance tests on 16-core Tigerton, use the
    formula '4 * (fls(nr_cpu_ids) + 1)':

    ./hackbench 100 process 2000:

    1) 2.6.25-rc6slab: 23.5 seconds
    2) 2.6.25-rc7SLUB+slub_min_objects=20: 31 seconds
    3) 2.6.25-rc7SLUB+slub_min_objects=24: 23.5 seconds

    Signed-off-by: Christoph Lameter
    Signed-off-by: Zhang Yanmin
    Signed-off-by: Pekka Enberg

    Christoph Lameter
     
  • We can now fallback to order 0 slabs. So set the slub_max_order to
    PAGE_CACHE_ORDER_COSTLY but keep the slub_min_objects at 4. This
    will mostly preserve the orders used in 2.6.25. F.e. The 2k kmalloc slab
    will use order 1 allocs and the 4k kmalloc slab order 2.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Pekka Enberg

    Christoph Lameter
     
  • Since we now have total_objects counter per node use that to
    check for the presence of any objects. The loop over all cpu slabs
    is not that useful since any cpu slab would require an object allocation
    first. So drop that.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Pekka Enberg

    Christoph Lameter
     
  • Makes /sys/kernel/slab//order writable. The allocation
    order of a slab cache can then be changed dynamically during runtime.
    This can be used to override the objects per slabs value establisheed
    with the slub_min_objects setting that was manually specified or
    calculated on bootup.

    The changes of the slab order can occur while allocate_slab() runs.
    Allocate slab needs the order and the number of slab objects that
    are both changed by the change of order. Both are put into
    a single word (struct kmem_cache_order_objects). They can then
    be atomically updated and retrieved.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Pekka Enberg

    Christoph Lameter
     
  • There is now a generic method of falling back to a slab page of minimal
    order. No need anymore for the fallback to kmalloc_large().

    Signed-off-by: Christoph Lameter
    Signed-off-by: Pekka Enberg

    Christoph Lameter
     
  • If any higher order allocation fails then fall back the smallest order
    necessary to contain at least one object. This enables fallback for all
    allocations to order 0 pages. The fallback will waste more memory (objects
    will not fit neatly) and the fallback slabs will be not as efficient as larger
    slabs since they contain less objects.

    Note that SLAB also depends on order 1 allocations for some slabs that waste
    too much memory if forced into PAGE_SIZE'd page. SLUB now can now deal with
    failing order 1 allocs which SLAB cannot do.

    Add a new field min that will contain the objects for the smallest possible order
    for a slab cache.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Pekka Enberg

    Christoph Lameter
     
  • Change the statistics to consider that slabs of the same slabcache
    can have different number of objects in them since they may be of
    different order.

    Provide a new sysfs field

    total_objects

    which shows the total objects that the allocated slabs of a slabcache
    could hold.

    Add a max field that holds the largest slab order that was ever used
    for a slab cache.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Pekka Enberg

    Christoph Lameter
     
  • Pack the order and the number of objects into a single word.
    This saves some memory in the kmem_cache_structure and more importantly
    allows us to fetch both values atomically.

    Later the slab orders become runtime configurable and we need to fetch these
    two items together in order to properly allocate a slab and initialize its
    objects.

    Fix the race by fetching the order and the number of objects in one word.

    [penberg@cs.helsinki.fi: fix memset() page order in new_slab()]
    Signed-off-by: Christoph Lameter
    Signed-off-by: Pekka Enberg

    Christoph Lameter
     
  • Pass the number of objects to the for_each_object macro. Most of these are
    debug related.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Pekka Enberg

    Christoph Lameter
     
  • Split the inuse field up to be able to store the number of objects in this
    page in the page struct as well. Necessary if we want to have pages of
    various orders for a slab. Also avoids touching struct kmem_cache cachelines in
    __slab_alloc().

    Update diagnostic code to check the number of objects and make sure that
    the number of objects always stays within the bounds of a 16 bit unsigned
    integer.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Pekka Enberg

    Christoph Lameter
     
  • Dump a list of unfreed objects if a slab cache is closed but
    objects still remain.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Pekka Enberg

    Christoph Lameter
     
  • free_list looked a bit screwy so here is an attempt to clean it up.

    free_list is is only used for freeing partial lists. We do not need to return a
    parameter if we decrement nr_partial within the function which allows a
    simplification of the whole thing.

    The current version modifies nr_partial outside of the list_lock which is
    technically not correct. It was only ok because we should be the only user of
    this slab cache at this point.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Pekka Enberg

    Christoph Lameter
     
  • As pointed out by Ingo, the SLUB warning of calling kmem_cache_destroy()
    with cache that still has objects triggers in practice. So turn this
    WARN_ON() into a nice SLUB specific error message to avoid people
    confusing it to a SLUB bug.

    Cc: Ingo Molnar
    Acked-by: Christoph Lameter
    Signed-off-by: Pekka Enberg

    Pekka Enberg
     

24 Apr, 2008

1 commit


14 Apr, 2008

1 commit

  • The per node counters are used mainly for showing data through the sysfs API.
    If that API is not compiled in then there is no point in keeping track of this
    data. Disable counters for the number of slabs and the number of total slabs
    if !SLUB_DEBUG. Incrementing the per node counters is also accessing a
    potentially contended cacheline so this could actually be a performance
    benefit to embedded systems.

    SLABINFO support is also affected. It now must depends on SLUB_DEBUG (which
    is on by default).

    Patch also avoids a check for a NULL kmem_cache_node pointer in new_slab()
    if the system is not compiled with NUMA support.

    [penberg@cs.helsinki.fi: fix oops and move ->nr_slabs into CONFIG_SLUB_DEBUG]
    Signed-off-by: Christoph Lameter
    Signed-off-by: Pekka Enberg

    Christoph Lameter