08 May, 2007

12 commits

  • I have never seen a use of SLAB_DEBUG_INITIAL. It is only supported by
    SLAB.

    I think its purpose was to have a callback after an object has been freed
    to verify that the state is the constructor state again? The callback is
    performed before each freeing of an object.

    I would think that it is much easier to check the object state manually
    before the free. That also places the check near the code object
    manipulation of the object.

    Also the SLAB_DEBUG_INITIAL callback is only performed if the kernel was
    compiled with SLAB debugging on. If there would be code in a constructor
    handling SLAB_DEBUG_INITIAL then it would have to be conditional on
    SLAB_DEBUG otherwise it would just be dead code. But there is no such code
    in the kernel. I think SLUB_DEBUG_INITIAL is too problematic to make real
    use of, difficult to understand and there are easier ways to accomplish the
    same effect (i.e. add debug code before kfree).

    There is a related flag SLAB_CTOR_VERIFY that is frequently checked to be
    clear in fs inode caches. Remove the pointless checks (they would even be
    pointless without removeal of SLAB_DEBUG_INITIAL) from the fs constructors.

    This is the last slab flag that SLUB did not support. Remove the check for
    unimplemented flags from SLUB.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Currently failslab injects failures into ____cache_alloc(). But with enabling
    CONFIG_NUMA it's not enough to let actual slab allocator functions (kmalloc,
    kmem_cache_alloc, ...) return NULL.

    This patch moves fault injection hook inside of __cache_alloc() and
    __cache_alloc_node(). These are lower call path than ____cache_alloc() and
    enable to inject faulures to slab allocators with CONFIG_NUMA.

    Acked-by: Pekka Enberg
    Signed-off-by: Akinobu Mita
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Akinobu Mita
     
  • This patch was recently posted to lkml and acked by Pekka.

    The flag SLAB_MUST_HWCACHE_ALIGN is

    1. Never checked by SLAB at all.

    2. A duplicate of SLAB_HWCACHE_ALIGN for SLUB

    3. Fulfills the role of SLAB_HWCACHE_ALIGN for SLOB.

    The only remaining use is in sparc64 and ppc64 and their use there
    reflects some earlier role that the slab flag once may have had. If
    its specified then SLAB_HWCACHE_ALIGN is also specified.

    The flag is confusing, inconsistent and has no purpose.

    Remove it.

    Acked-by: Pekka Enberg
    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Signed-off-by: Matthias Kaehlcke
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    matze
     
  • Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • If we add a new flag so that we can distinguish between the first page and the
    tail pages then we can avoid to use page->private in the first page.
    page->private == page for the first page, so there is no real information in
    there.

    Freeing up page->private makes the use of compound pages more transparent.
    They become more usable like real pages. Right now we have to be careful f.e.
    if we are going beyond PAGE_SIZE allocations in the slab on i386 because we
    can then no longer use the private field. This is one of the issues that
    cause us not to support debugging for page size slabs in SLAB.

    Having page->private available for SLUB would allow more meta information in
    the page struct. I can probably avoid the 16 bit ints that I have in there
    right now.

    Also if page->private is available then a compound page may be equipped with
    buffer heads. This may free up the way for filesystems to support larger
    blocks than page size.

    We add PageTail as an alias of PageReclaim. Compound pages cannot currently
    be reclaimed. Because of the alias one needs to check PageCompound first.

    The RFC for the this approach was discussed at
    http://marc.info/?t=117574302800001&r=1&w=2

    [nacc@us.ibm.com: fix hugetlbfs]
    Signed-off-by: Christoph Lameter
    Signed-off-by: Nishanth Aravamudan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • It is only ever used prior to free_initmem().

    (It will cause a warning when we run the section checking, but that's a
    false-positive and it simply changes the source of an existing warning, which
    is also a false-positive)

    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • Some NUMA machines have a big MAX_NUMNODES (possibly 1024), but fewer
    possible nodes. This patch dynamically sizes the 'struct kmem_cache' to
    allocate only needed space.

    I moved nodelists[] field at the end of struct kmem_cache, and use the
    following computation in kmem_cache_init()

    cache_cache.buffer_size = offsetof(struct kmem_cache, nodelists) +
    nr_node_ids * sizeof(struct kmem_list3 *);

    On my two nodes x86_64 machine, kmem_cache.obj_size is now 192 instead of 704
    (This is because on x86_64, MAX_NUMNODES is 64)

    On bigger NUMA setups, this might reduce the gfporder of "cache_cache"

    Signed-off-by: Eric Dumazet
    Cc: Pekka Enberg
    Cc: Andy Whitcroft
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric Dumazet
     
  • We can avoid allocating empty shared caches and avoid unecessary check of
    cache->limit. We save some memory. We avoid bringing into CPU cache
    unecessary cache lines.

    All accesses to l3->shared are already checking NULL pointers so this patch is
    safe.

    Signed-off-by: Eric Dumazet
    Acked-by: Pekka Enberg
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric Dumazet
     
  • The existing comment in mm/slab.c is *perfect*, so I reproduce it :

    /*
    * CPU bound tasks (e.g. network routing) can exhibit cpu bound
    * allocation behaviour: Most allocs on one cpu, most free operations
    * on another cpu. For these cases, an efficient object passing between
    * cpus is necessary. This is provided by a shared array. The array
    * replaces Bonwick's magazine layer.
    * On uniprocessor, it's functionally equivalent (but less efficient)
    * to a larger limit. Thus disabled by default.
    */

    As most shiped linux kernels are now compiled with CONFIG_SMP, there is no way
    a preprocessor #if can detect if the machine is UP or SMP. Better to use
    num_possible_cpus().

    This means on UP we allocate a 'size=0 shared array', to be more efficient.

    Another patch can later avoid the allocations of 'empty shared arrays', to
    save some memory.

    Signed-off-by: Eric Dumazet
    Acked-by: Pekka Enberg
    Acked-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric Dumazet
     
  • If slab->inuse is corrupted, cache_alloc_refill can enter an infinite
    loop as detailed by Michael Richardson in the following post:
    . This adds a BUG_ON to catch
    those cases.

    Cc: Michael Richardson
    Acked-by: Christoph Lameter
    Signed-off-by: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pekka Enberg
     
  • This introduce krealloc() that reallocates memory while keeping the contents
    unchanged. The allocator avoids reallocation if the new size fits the
    currently used cache. I also added a simple non-optimized version for
    mm/slob.c for compatibility.

    [akpm@linux-foundation.org: fix warnings]
    Acked-by: Josef Sipek
    Acked-by: Matt Mackall
    Acked-by: Christoph Lameter
    Signed-off-by: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pekka Enberg
     

03 May, 2007

1 commit

  • Set use_alien_caches to 0 on non NUMA platforms. And avoid calling the
    cache_free_alien() when use_alien_caches is not set. This will avoid the
    cache miss that happens while dereferencing slabp to get nodeid.

    Signed-off-by: Suresh Siddha
    Signed-off-by: Andi Kleen
    Cc: Andi Kleen
    Cc: Eric Dumazet
    Cc: David Rientjes
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton

    Siddha, Suresh B
     

04 Apr, 2007

1 commit


02 Mar, 2007

1 commit


21 Feb, 2007

1 commit

  • The alien cache is a per cpu per node array allocated for every slab on the
    system. Currently we size this array for all nodes that the kernel does
    support. For IA64 this is 1024 nodes. So we allocate an array with 1024
    objects even if we only boot a system with 4 nodes.

    This patch uses "nr_node_ids" to determine the number of possible nodes
    supported by a hardware configuration and only allocates an alien cache
    sized for possible nodes.

    The initialization of nr_node_ids occurred too late relative to the bootstrap
    of the slab allocator and so I moved the setup_nr_node_ids() into
    free_area_init_nodes().

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

12 Feb, 2007

6 commits

  • A variety of (mostly) innocuous fixes to the embedded kernel-doc content in
    source files, including:

    * make multi-line initial descriptions single line
    * denote some function names, constants and structs as such
    * change erroneous opening '/*' to '/**' in a few places
    * reword some text for clarity

    Signed-off-by: Robert P. J. Day
    Cc: "Randy.Dunlap"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Robert P. J. Day
     
  • kmem_cache_free() was missing the check for freeing held locks.

    Signed-off-by: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ingo Molnar
     
  • Make ZONE_DMA optional in core code.

    - ifdef all code for ZONE_DMA and related definitions following the example
    for ZONE_DMA32 and ZONE_HIGHMEM.

    - Without ZONE_DMA, ZONE_HIGHMEM and ZONE_DMA32 we get to a ZONES_SHIFT of
    0.

    - Modify the VM statistics to work correctly without a DMA zone.

    - Modify slab to not create DMA slabs if there is no ZONE_DMA.

    [akpm@osdl.org: cleanup]
    [jdike@addtoit.com: build fix]
    [apw@shadowen.org: Simplify calculation of the number of bits we need for ZONES_SHIFT]
    Signed-off-by: Christoph Lameter
    Cc: Andi Kleen
    Cc: "Luck, Tony"
    Cc: Kyle McMartin
    Cc: Matthew Wilcox
    Cc: James Bottomley
    Cc: Paul Mundt
    Signed-off-by: Andy Whitcroft
    Signed-off-by: Jeff Dike
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Use the pointer passed to cache_reap to determine the work pointer and
    consolidate exit paths.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Clean up __cache_alloc and __cache_alloc_node functions a bit. We no
    longer need to do NUMA_BUILD tricks and the UMA allocation path is much
    simpler. No functional changes in this patch.

    Note: saves few kernel text bytes on x86 NUMA build due to using gotos in
    __cache_alloc_node() and moving __GFP_THISNODE check in to
    fallback_alloc().

    Cc: Andy Whitcroft
    Cc: Christoph Hellwig
    Cc: Manfred Spraul
    Acked-by: Christoph Lameter
    Cc: Paul Jackson
    Signed-off-by: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pekka Enberg
     
  • The PageSlab debug check in kfree_debugcheck() is broken for compound
    pages. It is also redundant as we already do BUG_ON for non-slab pages in
    page_get_cache() and page_get_slab() which are always called before we free
    any actual objects.

    Signed-off-by: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pekka Enberg
     

06 Jan, 2007

1 commit

  • pdflush hit the BUG_ON(!PageSlab(page)) in kmem_freepages called from
    fallback_alloc: cache_grow already freed those pages when alloc_slabmgmt
    failed. But it wouldn't have freed them if __GFP_NO_GROW, so make sure
    fallback_alloc doesn't waste its time on that case.

    Signed-off-by: Hugh Dickins
    Acked-by: Christoph Lameter
    Acked-by: Pekka J Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

23 Dec, 2006

2 commits


14 Dec, 2006

4 commits

  • When some objects are allocated by one CPU but freed by another CPU we can
    consume lot of cycles doing divides in obj_to_index().

    (Typical load on a dual processor machine where network interrupts are
    handled by one particular CPU (allocating skbufs), and the other CPU is
    running the application (consuming and freeing skbufs))

    Here on one production server (dual-core AMD Opteron 285), I noticed this
    divide took 1.20 % of CPU_CLK_UNHALTED events in kernel. But Opteron are
    quite modern cpus and the divide is much more expensive on oldest
    architectures :

    On a 200 MHz sparcv9 machine, the division takes 64 cycles instead of 1
    cycle for a multiply.

    Doing some math, we can use a reciprocal multiplication instead of a divide.

    If we want to compute V = (A / B) (A and B being u32 quantities)
    we can instead use :

    V = ((u64)A * RECIPROCAL(B)) >> 32 ;

    where RECIPROCAL(B) is precalculated to ((1LL << 32) + (B - 1)) / B

    Note :

    I wrote pure C code for clarity. gcc output for i386 is not optimal but
    acceptable :

    mull 0x14(%ebx)
    mov %edx,%eax // part of the >> 32
    xor %edx,%edx // useless
    mov %eax,(%esp) // could be avoided
    mov %edx,0x4(%esp) // useless
    mov (%esp),%ebx

    [akpm@osdl.org: small cleanups]
    Signed-off-by: Eric Dumazet
    Cc: Christoph Lameter
    Cc: David Miller
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric Dumazet
     
  • Elaborate the API for calling cpuset_zone_allowed(), so that users have to
    explicitly choose between the two variants:

    cpuset_zone_allowed_hardwall()
    cpuset_zone_allowed_softwall()

    Until now, whether or not you got the hardwall flavor depended solely on
    whether or not you or'd in the __GFP_HARDWALL gfp flag to the gfp_mask
    argument.

    If you didn't specify __GFP_HARDWALL, you implicitly got the softwall
    version.

    Unfortunately, this meant that users would end up with the softwall version
    without thinking about it. Since only the softwall version might sleep,
    this led to bugs with possible sleeping in interrupt context on more than
    one occassion.

    The hardwall version requires that the current tasks mems_allowed allows
    the node of the specified zone (or that you're in interrupt or that
    __GFP_THISNODE is set or that you're on a one cpuset system.)

    The softwall version, depending on the gfp_mask, might allow a node if it
    was allowed in the nearest enclusing cpuset marked mem_exclusive (which
    requires taking the cpuset lock 'callback_mutex' to evaluate.)

    This patch removes the cpuset_zone_allowed() call, and forces the caller to
    explicitly choose between the hardwall and the softwall case.

    If the caller wants the gfp_mask to determine this choice, they should (1)
    be sure they can sleep or that __GFP_HARDWALL is set, and (2) invoke the
    cpuset_zone_allowed_softwall() routine.

    This adds another 100 or 200 bytes to the kernel text space, due to the few
    lines of nearly duplicate code at the top of both cpuset_zone_allowed_*
    routines. It should save a few instructions executed for the calls that
    turned into calls of cpuset_zone_allowed_hardwall, thanks to not having to
    set (before the call) then check (within the call) the __GFP_HARDWALL flag.

    For the most critical call, from get_page_from_freelist(), the same
    instructions are executed as before -- the old cpuset_zone_allowed()
    routine it used to call is the same code as the
    cpuset_zone_allowed_softwall() routine that it calls now.

    Not a perfect win, but seems worth it, to reduce this chance of hitting a
    sleeping with irq off complaint again.

    Signed-off-by: Paul Jackson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Jackson
     
  • More cleanups for slab.h

    1. Remove tabs from weird locations as suggested by Pekka

    2. Drop the check for NUMA and SLAB_DEBUG from the fallback section
    as suggested by Pekka.

    3. Uses static inline for the fallback defs as also suggested by Pekka.

    4. Make kmem_ptr_valid take a const * argument.

    5. Separate the NUMA fallback definitions from the kmalloc_track fallback
    definitions.

    Signed-off-by: Christoph Lameter
    Cc: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Fallback_alloc() does not do the check for GFP_WAIT as done in
    cache_grow(). Thus interrupts are disabled when we call kmem_getpages()
    which results in the failure.

    Duplicate the handling of GFP_WAIT in cache_grow().

    Signed-off-by: Christoph Lameter
    Cc: Jay Cliburn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

11 Dec, 2006

1 commit

  • This patch introduces users of the round_jiffies() function in the slab code.

    The slab code has a few "run every second" timers for background work; these
    are obviously not timing critical as long as they happen roughly at the right
    frequency.

    Signed-off-by: Arjan van de Ven
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Arjan van de Ven
     

09 Dec, 2006

3 commits

  • Assign defaults most likely to please a new user:
    1) generate some logging output
    (verbose=2)
    2) avoid injecting failures likely to lock up UI
    (ignore_gfp_wait=1, ignore_gfp_highmem=1)

    Signed-off-by: Don Mullis
    Cc: Akinobu Mita
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Don Mullis
     
  • This patch provides fault-injection capability for kmalloc.

    Boot option:

    failslab=,,,

    -- specifies the interval of failures.

    -- specifies how often it should fail in percent.

    -- specifies the size of free space where memory can be
    allocated safely in bytes.

    -- specifies how many times failures may happen at most.

    Debugfs:

    /debug/failslab/interval
    /debug/failslab/probability
    /debug/failslab/specifies
    /debug/failslab/times
    /debug/failslab/ignore-gfp-highmem
    /debug/failslab/ignore-gfp-wait

    Example:

    failslab=10,100,0,-1

    slab allocation (kmalloc(), kmem_cache_alloc(),..) fails once per 10 times.

    Cc: Pekka Enberg
    Signed-off-by: Akinobu Mita
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Akinobu Mita
     
  • fallback_alloc() could end up calling cpuset_zone_allowed() with interrupts
    disabled (by code in kmem_cache_alloc_node()), but without __GFP_HARDWALL
    set, leading to a possible call of a sleeping function with interrupts
    disabled.

    This results in the BUG report:

    BUG: sleeping function called from invalid context at kernel/cpuset.c:1520
    in_atomic():0, irqs_disabled():1

    Thanks to Paul Menage for catching this one.

    Signed-off-by: Paul Jackson
    Cc: Paul Menage
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Jackson
     

08 Dec, 2006

7 commits

  • - move some file_operations structs into the .rodata section

    - move static strings from policy_types[] array into the .rodata section

    - fix generic seq_operations usages, so that those structs may be defined
    as "const" as well

    [akpm@osdl.org: couple of fixes]
    Signed-off-by: Helge Deller
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Helge Deller
     
  • Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • Currently we simply attempt to allocate from all allowed nodes using
    GFP_THISNODE. However, GFP_THISNODE does not do reclaim (it wont do any at
    all if the recent GFP_THISNODE patch is accepted). If we truly run out of
    memory in the whole system then fallback_alloc may return NULL although
    memory may still be available if we would perform more thorough reclaim.

    This patch changes fallback_alloc() so that we first only inspect all the
    per node queues for available slabs. If we find any then we allocate from
    those. This avoids slab fragmentation by first getting rid of all partial
    allocated slabs on every node before allocating new memory.

    If we cannot satisfy the allocation from any per node queue then we extend
    a slab. We now call into the page allocator without specifying
    GFP_THISNODE. The page allocator will then implement its own fallback (in
    the given cpuset context), perform necessary reclaim (again considering not
    a single node but the whole set of allowed nodes) and then return pages for
    a new slab.

    We identify from which node the pages were allocated and then insert the
    pages into the corresponding per node structure. In order to do so we need
    to modify cache_grow() to take a parameter that specifies the new slab.
    kmem_getpages() can no longer set the GFP_THISNODE flag since we need to be
    able to use kmem_getpage to allocate from an arbitrary node. GFP_THISNODE
    needs to be specified when calling cache_grow().

    One key advantage is that the decision from which node to allocate new
    memory is removed from slab fallback processing. The patch allows to go
    back to use of the page allocators fallback/reclaim logic.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • This addresses two issues:

    1. Kmalloc_node() may intermittently return NULL if we are allocating
    from the current node and are unable to obtain memory for the current
    node from the page allocator. This is because we call ___cache_alloc()
    if nodeid == numa_node_id() and ____cache_alloc is not able to fallback
    to other nodes.

    This was introduced in the 2.6.19 development cycle.
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • SLAB_DMA is an alias of GFP_DMA. This is the last one so we
    remove the leftover comment too.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • SLAB_KERNEL is an alias of GFP_KERNEL.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • SLAB_LEVEL_MASK is only used internally to the slab and is
    and alias of GFP_LEVEL_MASK.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter