08 Dec, 2006

11 commits

  • Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • Currently we simply attempt to allocate from all allowed nodes using
    GFP_THISNODE. However, GFP_THISNODE does not do reclaim (it wont do any at
    all if the recent GFP_THISNODE patch is accepted). If we truly run out of
    memory in the whole system then fallback_alloc may return NULL although
    memory may still be available if we would perform more thorough reclaim.

    This patch changes fallback_alloc() so that we first only inspect all the
    per node queues for available slabs. If we find any then we allocate from
    those. This avoids slab fragmentation by first getting rid of all partial
    allocated slabs on every node before allocating new memory.

    If we cannot satisfy the allocation from any per node queue then we extend
    a slab. We now call into the page allocator without specifying
    GFP_THISNODE. The page allocator will then implement its own fallback (in
    the given cpuset context), perform necessary reclaim (again considering not
    a single node but the whole set of allowed nodes) and then return pages for
    a new slab.

    We identify from which node the pages were allocated and then insert the
    pages into the corresponding per node structure. In order to do so we need
    to modify cache_grow() to take a parameter that specifies the new slab.
    kmem_getpages() can no longer set the GFP_THISNODE flag since we need to be
    able to use kmem_getpage to allocate from an arbitrary node. GFP_THISNODE
    needs to be specified when calling cache_grow().

    One key advantage is that the decision from which node to allocate new
    memory is removed from slab fallback processing. The patch allows to go
    back to use of the page allocators fallback/reclaim logic.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • This addresses two issues:

    1. Kmalloc_node() may intermittently return NULL if we are allocating
    from the current node and are unable to obtain memory for the current
    node from the page allocator. This is because we call ___cache_alloc()
    if nodeid == numa_node_id() and ____cache_alloc is not able to fallback
    to other nodes.

    This was introduced in the 2.6.19 development cycle.
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • SLAB_DMA is an alias of GFP_DMA. This is the last one so we
    remove the leftover comment too.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • SLAB_KERNEL is an alias of GFP_KERNEL.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • SLAB_LEVEL_MASK is only used internally to the slab and is
    and alias of GFP_LEVEL_MASK.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • It is only used internally in the slab.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • We have variants of kmalloc and kmem_cache_alloc that leave leak tracking to
    the caller. This is used for subsystem-specific allocators like skb_alloc.

    To make skb_alloc node-aware we need similar routines for the node-aware slab
    allocator, which this patch adds.

    Note that the code is rather ugly, but it mirrors the non-node-aware code 1:1:

    [akpm@osdl.org: add module export]
    Signed-off-by: Christoph Hellwig
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     
  • When using numa=fake on non-NUMA hardware there is no benefit to having the
    alien caches, and they consume much memory.

    Add a kernel boot option to disable them.

    Christoph sayeth "This is good to have even on large NUMA. The problem is
    that the alien caches grow by the square of the size of the system in terms of
    nodes."

    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: Manfred Spraul
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Menage
     
  • Here's an attempt towards doing away with lock_cpu_hotplug in the slab
    subsystem. This approach also fixes a bug which shows up when cpus are
    being offlined/onlined and slab caches are being tuned simultaneously.

    http://marc.theaimsgroup.com/?l=linux-kernel&m=116098888100481&w=2

    The patch has been stress tested overnight on a 2 socket 4 core AMD box with
    repeated cpu online and offline, while dbench and kernbench process are
    running, and slab caches being tuned at the same time.
    There were no lockdep warnings either. (This test on 2,6.18 as 2.6.19-rc
    crashes at __drain_pages
    http://marc.theaimsgroup.com/?l=linux-kernel&m=116172164217678&w=2 )

    The approach here is to hold cache_chain_mutex from CPU_UP_PREPARE until
    CPU_ONLINE (similar in approach as worqueue_mutex) . Slab code sensitive
    to cpu_online_map (kmem_cache_create, kmem_cache_destroy, slabinfo_write,
    __cache_shrink) is already serialized with cache_chain_mutex. (This patch
    lengthens cache_chain_mutex hold time at kmem_cache_destroy to cover this).
    This patch also takes the cache_chain_sem at kmem_cache_shrink to protect
    sanity of cpu_online_map at __cache_shrink, as viewed by slab.
    (kmem_cache_shrink->__cache_shrink->drain_cpu_caches). But, really,
    kmem_cache_shrink is used at just one place in the acpi subsystem! Do we
    really need to keep kmem_cache_shrink at all?

    Another note. Looks like a cpu hotplug event can send CPU_UP_CANCELED to
    a registered subsystem even if the subsystem did not receive CPU_UP_PREPARE.
    This could be due to a subsystem registered for notification earlier than
    the current subsystem crapping out with NOTIFY_BAD. Badness can occur with
    in the CPU_UP_CANCELED code path at slab if this happens (The same would
    apply for workqueue.c as well). To overcome this, we might have to use either
    a) a per subsystem flag and avoid handling of CPU_UP_CANCELED, or
    b) Use a special notifier events like LOCK_ACQUIRE/RELEASE as Gautham was
    using in his experiments, or
    c) Do not send CPU_UP_CANCELED to a subsystem which did not receive
    CPU_UP_PREPARE.

    I would prefer c).

    Signed-off-by: Ravikiran Thirumalai
    Signed-off-by: Shai Fultheim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ravikiran G Thirumalai
     
  • When CONFIG_SLAB_DEBUG is used in combination with ARCH_SLAB_MINALIGN, some
    debug flags should be disabled which depend on BYTES_PER_WORD alignment.

    The disabling of these debug flags is not properly handled when
    BYTES_PER_WORD < ARCH_SLAB_MEMALIGN < cache_line_size()

    This patch fixes that and also adds an alignment check to
    cache_alloc_debugcheck_after() when ARCH_SLAB_MINALIGN is used.

    Signed-off-by: Kevin Hilman
    Cc: Pekka Enberg
    Cc: Christoph Lameter
    Cc: Manfred Spraul
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kevin Hilman
     

22 Nov, 2006

2 commits

  • Pass the work_struct pointer to the work function rather than context data.
    The work function can use container_of() to work out the data.

    For the cases where the container of the work_struct may go away the moment the
    pending bit is cleared, it is made possible to defer the release of the
    structure by deferring the clearing of the pending bit.

    To make this work, an extra flag is introduced into the management side of the
    work_struct. This governs auto-release of the structure upon execution.

    Ordinarily, the work queue executor would release the work_struct for further
    scheduling or deallocation by clearing the pending bit prior to jumping to the
    work function. This means that, unless the driver makes some guarantee itself
    that the work_struct won't go away, the work function may not access anything
    else in the work_struct or its container lest they be deallocated.. This is a
    problem if the auxiliary data is taken away (as done by the last patch).

    However, if the pending bit is *not* cleared before jumping to the work
    function, then the work function *may* access the work_struct and its container
    with no problems. But then the work function must itself release the
    work_struct by calling work_release().

    In most cases, automatic release is fine, so this is the default. Special
    initiators exist for the non-auto-release case (ending in _NAR).

    Signed-Off-By: David Howells

    David Howells
     
  • Separate delayable work items from non-delayable work items be splitting them
    into a separate structure (delayed_work), which incorporates a work_struct and
    the timer_list removed from work_struct.

    The work_struct struct is huge, and this limits it's usefulness. On a 64-bit
    architecture it's nearly 100 bytes in size. This reduces that by half for the
    non-delayable type of event.

    Signed-Off-By: David Howells

    David Howells
     

04 Nov, 2006

1 commit

  • It looks like there is a bug in init_reap_node() in slab.c that can cause
    multiple oops's on certain ES7000 configurations. The variable reap_node
    is defined per cpu, but only initialized on a single CPU. This causes an
    oops in next_reap_node() when __get_cpu_var(reap_node) returns the wrong
    value. Fix is below.

    Signed-off-by: Dan Yeisley
    Cc: Andi Kleen
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: Manfred Spraul
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Daniel Yeisley
     

22 Oct, 2006

1 commit

  • The zonelist may contain zones of nodes that have not been bootstrapped and
    we will oops if we try to allocate from those zones. So check if the node
    information for the slab and the node have been setup before attempting an
    allocation. If it has not been setup then skip that zone.

    Usually we will not encounter this situation since the slab bootstrap code
    avoids falling back before we have setup the respective nodes but we seem
    to have a special needs for pppc.

    Signed-off-by: Christoph Lameter
    Acked-by: Andy Whitcroft
    Cc: Paul Mackerras
    Cc: Mike Kravetz
    Cc: Benjamin Herrenschmidt
    Acked-by: Mel Gorman
    Acked-by: Will Schmidt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

08 Oct, 2006

1 commit

  • Init list is called with a list parameter that is not equal to the
    cachep->nodelists entry under NUMA if more than one node exists. This is
    fully legitimatei. One may want to populate the list fields before
    switching nodelist pointers.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

06 Oct, 2006

1 commit

  • Reduce the NUMA text size of mm/slab.o a little on x86 by using a local
    variable to store the result of numa_node_id().

    text data bss dec hex filename
    16858 2584 16 19458 4c02 mm/slab.o (before)
    16804 2584 16 19404 4bcc mm/slab.o (after)

    [akpm@osdl.org: use better names]
    [pbadari@us.ibm.com: fix that]
    Cc: Christoph Lameter
    Signed-off-by: Pekka Enberg
    Signed-off-by: Badari Pulavarty
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pekka Enberg
     

05 Oct, 2006

1 commit


04 Oct, 2006

2 commits

  • - rename ____kmalloc to kmalloc_track_caller so that people have a chance
    to guess what it does just from it's name. Add a comment describing it
    for those who don't. Also move it after kmalloc in slab.h so people get
    less confused when they are just looking for kmalloc - move things around
    in slab.c a little to reduce the ifdef mess.

    [penberg@cs.helsinki.fi: Fix up reversed #ifdef]
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Pekka Enberg
    Cc: Christoph Lameter
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     
  • kbuild explicitly includes this at build time.

    Signed-off-by: Dave Jones

    Dave Jones
     

30 Sep, 2006

1 commit

  • In cases where we detect a single bit has been flipped, we spew the usual
    slab corruption message, which users instantly think is a kernel bug. In a
    lot of cases, single bit errors are down to bad memory, or other hardware
    failure.

    This patch adds an extra line to the slab debug messages in those cases, in
    the hope that users will try memtest before they report a bug.

    000: 6b 6b 6b 6b 6a 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
    Single bit error detected. Possibly bad RAM. Run memtest86.

    [akpm@osdl.org: cleanups]
    Signed-off-by: Dave Jones
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Jones
     

27 Sep, 2006

3 commits

  • This patch insures that the slab node lists in the NUMA case only contain
    slabs that belong to that specific node. All slab allocations use
    GFP_THISNODE when calling into the page allocator. If an allocation fails
    then we fall back in the slab allocator according to the zonelists appropriate
    for a certain context.

    This allows a replication of the behavior of alloc_pages and alloc_pages node
    in the slab layer.

    Currently allocations requested from the page allocator may be redirected via
    cpusets to other nodes. This results in remote pages on nodelists and that in
    turn results in interrupt latency issues during cache draining. Plus the slab
    is handing out memory as local when it is really remote.

    Fallback for slab memory allocations will occur within the slab allocator and
    not in the page allocator. This is necessary in order to be able to use the
    existing pools of objects on the nodes that we fall back to before adding more
    pages to a slab.

    The fallback function insures that the nodes we fall back to obey cpuset
    restrictions of the current context. We do not allocate objects from outside
    of the current cpuset context like before.

    Note that the implementation of locality constraints within the slab allocator
    requires importing logic from the page allocator. This is a mischmash that is
    not that great. Other allocators (uncached allocator, vmalloc, huge pages)
    face similar problems and have similar minimal reimplementations of the basic
    fallback logic of the page allocator. There is another way of implementing a
    slab by avoiding per node lists (see modular slab) but this wont work within
    the existing slab.

    V1->V2:
    - Use NUMA_BUILD to avoid #ifdef CONFIG_NUMA
    - Exploit GFP_THISNODE being 0 in the NON_NUMA case to avoid another
    #ifdef

    [akpm@osdl.org: build fix]
    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • kmalloc_node() falls back to ___cache_alloc() under certain conditions and
    at that point memory policies may be applied redirecting the allocation
    away from the current node. Therefore kmalloc_node(...,numa_node_id()) or
    kmalloc_node(...,-1) may not return memory from the local node.

    Fix this by doing the policy check in __cache_alloc() instead of
    ____cache_alloc().

    This version here is a cleanup of Kiran's patch.

    - Tested on ia64.
    - Extra material removed.
    - Consolidate the exit path if alternate_node_alloc() returned an object.

    [akpm@osdl.org: warning fix]
    Signed-off-by: Alok N Kataria
    Signed-off-by: Ravikiran Thirumalai
    Signed-off-by: Shai Fultheim
    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • un-, de-, -free, -destroy, -exit, etc functions should in general return
    void. Also,

    There is very little, say, filesystem driver code can do upon failed
    kmem_cache_destroy(). If it will be decided to BUG in this case, BUG
    should be put in generic code, instead.

    Signed-off-by: Alexey Dobriyan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     

26 Sep, 2006

11 commits

  • Remove the atomic counter for slab_reclaim_pages and replace the counter
    and NR_SLAB with two ZVC counter that account for unreclaimable and
    reclaimable slab pages: NR_SLAB_RECLAIMABLE and NR_SLAB_UNRECLAIMABLE.

    Change the check in vmscan.c to refer to to NR_SLAB_RECLAIMABLE. The
    intend seems to be to check for slab pages that could be freed.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • The allocpercpu functions __alloc_percpu and __free_percpu() are heavily
    using the slab allocator. However, they are conceptually slab. This also
    simplifies SLOB (at this point slob may be broken in mm. This should fix
    it).

    Signed-off-by: Christoph Lameter
    Cc: Matt Mackall
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • On High end systems (1024 or so cpus) this can potentially cause stack
    overflow. Fix the stack usage.

    Signed-off-by: Suresh Siddha
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Siddha, Suresh B
     
  • Place the alien array cache locks of on slab malloc slab caches on a
    seperate lockdep class. This avoids false positives from lockdep

    [akpm@osdl.org: build fix]
    Signed-off-by: Ravikiran Thirumalai
    Signed-off-by: Shai Fultheim
    Cc: Thomas Gleixner
    Acked-by: Arjan van de Ven
    Cc: Ingo Molnar
    Cc: Pekka Enberg
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ravikiran G Thirumalai
     
  • It is fairly easy to get a system to oops by simply sizing a cache via
    /proc in such a way that one of the chaches (shared is easiest) becomes
    bigger than the maximum allowed slab allocation size. This occurs because
    enable_cpucache() fails if it cannot reallocate some caches.

    However, enable_cpucache() is used for multiple purposes: resizing caches,
    cache creation and bootstrap.

    If the slab is already up then we already have working caches. The resize
    can fail without a problem. We just need to return the proper error code.
    F.e. after this patch:

    # echo "size-64 10000 50 1000" >/proc/slabinfo
    -bash: echo: write error: Cannot allocate memory

    notice no OOPS.

    If we are doing a kmem_cache_create() then we also should not panic but
    return -ENOMEM.

    If on the other hand we do not have a fully bootstrapped slab allocator yet
    then we should indeed panic since we are unable to bring up the slab to its
    full functionality.

    Signed-off-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: Manfred Spraul
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • The ability to free memory allocated to a slab cache is also useful if an
    error occurs during setup of a slab. So extract the function.

    Signed-off-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: Manfred Spraul
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • [akpm@osdl.org: export fix]
    Signed-off-by: Christoph Hellwig
    Acked-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     
  • Also, checks if we get a valid slabp_cache for off slab slab-descriptors.
    We should always get this. If we don't, then in that case we, will have to
    disable off-slab descriptors for this cache and do the calculations again.
    This is a rare case, so add a BUG_ON, for now, just in case.

    Signed-off-by: Alok N Kataria
    Signed-off-by: Ravikiran Thirumalai
    Signed-off-by: Shai Fultheim
    Cc: Pekka Enberg
    Cc: Manfred Spraul
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ravikiran G Thirumalai
     
  • As explained by Heiko, on s390 (32-bit) ARCH_KMALLOC_MINALIGN is set to
    eight because their common I/O layer allocates data structures that need to
    have an eight byte alignment. This does not work when CONFIG_SLAB_DEBUG is
    enabled because kmem_cache_create will override alignment to BYTES_PER_WORD
    which is four.

    So change kmem_cache_create to ensure cache alignment is always at minimum
    what the architecture or caller mandates even if slab debugging is enabled.

    Cc: Heiko Carstens
    Cc: Christoph Lameter
    Signed-off-by: Manfred Spraul
    Signed-off-by: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pekka Enberg
     
  • This patch splits alloc_percpu() up into two phases. Likewise for
    free_percpu(). This allows clients to limit initial allocations to online
    cpu's, and to populate or depopulate per-cpu data at run time as needed:

    struct my_struct *obj;

    /* initial allocation for online cpu's */
    obj = percpu_alloc(sizeof(struct my_struct), GFP_KERNEL);

    ...

    /* populate per-cpu data for cpu coming online */
    ptr = percpu_populate(obj, sizeof(struct my_struct), GFP_KERNEL, cpu);

    ...

    /* access per-cpu object */
    ptr = percpu_ptr(obj, smp_processor_id());

    ...

    /* depopulate per-cpu data for cpu going offline */
    percpu_depopulate(obj, cpu);

    ...

    /* final removal */
    percpu_free(obj);

    Signed-off-by: Martin Peschke
    Cc: Paul Jackson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Martin Peschke
     
  • This patch makes the following needlessly global functions static:
    - slab.c: kmem_find_general_cachep()
    - swap.c: __page_cache_release()
    - vmalloc.c: __vmalloc_node()

    Signed-off-by: Adrian Bunk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Adrian Bunk
     

01 Aug, 2006

2 commits


14 Jul, 2006

3 commits

  • Chandra Seetharaman reported SLAB crashes caused by the slab.c lock
    annotation patch. There is only one chunk of that patch that has a
    material effect on the slab logic - this patch undoes that chunk.

    This was confirmed to fix the slab problem by Chandra.

    Signed-off-by: Ingo Molnar
    Tested-by: Chandra Seetharaman
    Signed-off-by: Linus Torvalds

    Ingo Molnar
     
  • mm/slab.c uses nested locking when dealing with 'off-slab'
    caches, in that case it allocates the slab header from the
    (on-slab) kmalloc caches. Teach the lock validator about
    this by putting all on-slab caches into a separate class.

    this patch has no effect on non-lockdep kernels.

    Signed-off-by: Arjan van de Ven
    Signed-off-by: Ingo Molnar
    Signed-off-by: Linus Torvalds

    Arjan van de Ven
     
  • undo existing mm/slab.c lock-validator annotations, in preparation
    of a new, less intrusive annotation patch.

    Signed-off-by: Ingo Molnar
    Signed-off-by: Arjan van de Ven
    Signed-off-by: Linus Torvalds

    Ingo Molnar