08 Oct, 2006

1 commit

  • Init list is called with a list parameter that is not equal to the
    cachep->nodelists entry under NUMA if more than one node exists. This is
    fully legitimatei. One may want to populate the list fields before
    switching nodelist pointers.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

06 Oct, 2006

1 commit

  • Reduce the NUMA text size of mm/slab.o a little on x86 by using a local
    variable to store the result of numa_node_id().

    text data bss dec hex filename
    16858 2584 16 19458 4c02 mm/slab.o (before)
    16804 2584 16 19404 4bcc mm/slab.o (after)

    [akpm@osdl.org: use better names]
    [pbadari@us.ibm.com: fix that]
    Cc: Christoph Lameter
    Signed-off-by: Pekka Enberg
    Signed-off-by: Badari Pulavarty
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pekka Enberg
     

05 Oct, 2006

1 commit


04 Oct, 2006

2 commits

  • - rename ____kmalloc to kmalloc_track_caller so that people have a chance
    to guess what it does just from it's name. Add a comment describing it
    for those who don't. Also move it after kmalloc in slab.h so people get
    less confused when they are just looking for kmalloc - move things around
    in slab.c a little to reduce the ifdef mess.

    [penberg@cs.helsinki.fi: Fix up reversed #ifdef]
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Pekka Enberg
    Cc: Christoph Lameter
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     
  • kbuild explicitly includes this at build time.

    Signed-off-by: Dave Jones

    Dave Jones
     

30 Sep, 2006

1 commit

  • In cases where we detect a single bit has been flipped, we spew the usual
    slab corruption message, which users instantly think is a kernel bug. In a
    lot of cases, single bit errors are down to bad memory, or other hardware
    failure.

    This patch adds an extra line to the slab debug messages in those cases, in
    the hope that users will try memtest before they report a bug.

    000: 6b 6b 6b 6b 6a 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
    Single bit error detected. Possibly bad RAM. Run memtest86.

    [akpm@osdl.org: cleanups]
    Signed-off-by: Dave Jones
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Jones
     

27 Sep, 2006

3 commits

  • This patch insures that the slab node lists in the NUMA case only contain
    slabs that belong to that specific node. All slab allocations use
    GFP_THISNODE when calling into the page allocator. If an allocation fails
    then we fall back in the slab allocator according to the zonelists appropriate
    for a certain context.

    This allows a replication of the behavior of alloc_pages and alloc_pages node
    in the slab layer.

    Currently allocations requested from the page allocator may be redirected via
    cpusets to other nodes. This results in remote pages on nodelists and that in
    turn results in interrupt latency issues during cache draining. Plus the slab
    is handing out memory as local when it is really remote.

    Fallback for slab memory allocations will occur within the slab allocator and
    not in the page allocator. This is necessary in order to be able to use the
    existing pools of objects on the nodes that we fall back to before adding more
    pages to a slab.

    The fallback function insures that the nodes we fall back to obey cpuset
    restrictions of the current context. We do not allocate objects from outside
    of the current cpuset context like before.

    Note that the implementation of locality constraints within the slab allocator
    requires importing logic from the page allocator. This is a mischmash that is
    not that great. Other allocators (uncached allocator, vmalloc, huge pages)
    face similar problems and have similar minimal reimplementations of the basic
    fallback logic of the page allocator. There is another way of implementing a
    slab by avoiding per node lists (see modular slab) but this wont work within
    the existing slab.

    V1->V2:
    - Use NUMA_BUILD to avoid #ifdef CONFIG_NUMA
    - Exploit GFP_THISNODE being 0 in the NON_NUMA case to avoid another
    #ifdef

    [akpm@osdl.org: build fix]
    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • kmalloc_node() falls back to ___cache_alloc() under certain conditions and
    at that point memory policies may be applied redirecting the allocation
    away from the current node. Therefore kmalloc_node(...,numa_node_id()) or
    kmalloc_node(...,-1) may not return memory from the local node.

    Fix this by doing the policy check in __cache_alloc() instead of
    ____cache_alloc().

    This version here is a cleanup of Kiran's patch.

    - Tested on ia64.
    - Extra material removed.
    - Consolidate the exit path if alternate_node_alloc() returned an object.

    [akpm@osdl.org: warning fix]
    Signed-off-by: Alok N Kataria
    Signed-off-by: Ravikiran Thirumalai
    Signed-off-by: Shai Fultheim
    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • un-, de-, -free, -destroy, -exit, etc functions should in general return
    void. Also,

    There is very little, say, filesystem driver code can do upon failed
    kmem_cache_destroy(). If it will be decided to BUG in this case, BUG
    should be put in generic code, instead.

    Signed-off-by: Alexey Dobriyan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     

26 Sep, 2006

11 commits

  • Remove the atomic counter for slab_reclaim_pages and replace the counter
    and NR_SLAB with two ZVC counter that account for unreclaimable and
    reclaimable slab pages: NR_SLAB_RECLAIMABLE and NR_SLAB_UNRECLAIMABLE.

    Change the check in vmscan.c to refer to to NR_SLAB_RECLAIMABLE. The
    intend seems to be to check for slab pages that could be freed.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • The allocpercpu functions __alloc_percpu and __free_percpu() are heavily
    using the slab allocator. However, they are conceptually slab. This also
    simplifies SLOB (at this point slob may be broken in mm. This should fix
    it).

    Signed-off-by: Christoph Lameter
    Cc: Matt Mackall
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • On High end systems (1024 or so cpus) this can potentially cause stack
    overflow. Fix the stack usage.

    Signed-off-by: Suresh Siddha
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Siddha, Suresh B
     
  • Place the alien array cache locks of on slab malloc slab caches on a
    seperate lockdep class. This avoids false positives from lockdep

    [akpm@osdl.org: build fix]
    Signed-off-by: Ravikiran Thirumalai
    Signed-off-by: Shai Fultheim
    Cc: Thomas Gleixner
    Acked-by: Arjan van de Ven
    Cc: Ingo Molnar
    Cc: Pekka Enberg
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ravikiran G Thirumalai
     
  • It is fairly easy to get a system to oops by simply sizing a cache via
    /proc in such a way that one of the chaches (shared is easiest) becomes
    bigger than the maximum allowed slab allocation size. This occurs because
    enable_cpucache() fails if it cannot reallocate some caches.

    However, enable_cpucache() is used for multiple purposes: resizing caches,
    cache creation and bootstrap.

    If the slab is already up then we already have working caches. The resize
    can fail without a problem. We just need to return the proper error code.
    F.e. after this patch:

    # echo "size-64 10000 50 1000" >/proc/slabinfo
    -bash: echo: write error: Cannot allocate memory

    notice no OOPS.

    If we are doing a kmem_cache_create() then we also should not panic but
    return -ENOMEM.

    If on the other hand we do not have a fully bootstrapped slab allocator yet
    then we should indeed panic since we are unable to bring up the slab to its
    full functionality.

    Signed-off-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: Manfred Spraul
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • The ability to free memory allocated to a slab cache is also useful if an
    error occurs during setup of a slab. So extract the function.

    Signed-off-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: Manfred Spraul
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • [akpm@osdl.org: export fix]
    Signed-off-by: Christoph Hellwig
    Acked-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     
  • Also, checks if we get a valid slabp_cache for off slab slab-descriptors.
    We should always get this. If we don't, then in that case we, will have to
    disable off-slab descriptors for this cache and do the calculations again.
    This is a rare case, so add a BUG_ON, for now, just in case.

    Signed-off-by: Alok N Kataria
    Signed-off-by: Ravikiran Thirumalai
    Signed-off-by: Shai Fultheim
    Cc: Pekka Enberg
    Cc: Manfred Spraul
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ravikiran G Thirumalai
     
  • As explained by Heiko, on s390 (32-bit) ARCH_KMALLOC_MINALIGN is set to
    eight because their common I/O layer allocates data structures that need to
    have an eight byte alignment. This does not work when CONFIG_SLAB_DEBUG is
    enabled because kmem_cache_create will override alignment to BYTES_PER_WORD
    which is four.

    So change kmem_cache_create to ensure cache alignment is always at minimum
    what the architecture or caller mandates even if slab debugging is enabled.

    Cc: Heiko Carstens
    Cc: Christoph Lameter
    Signed-off-by: Manfred Spraul
    Signed-off-by: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pekka Enberg
     
  • This patch splits alloc_percpu() up into two phases. Likewise for
    free_percpu(). This allows clients to limit initial allocations to online
    cpu's, and to populate or depopulate per-cpu data at run time as needed:

    struct my_struct *obj;

    /* initial allocation for online cpu's */
    obj = percpu_alloc(sizeof(struct my_struct), GFP_KERNEL);

    ...

    /* populate per-cpu data for cpu coming online */
    ptr = percpu_populate(obj, sizeof(struct my_struct), GFP_KERNEL, cpu);

    ...

    /* access per-cpu object */
    ptr = percpu_ptr(obj, smp_processor_id());

    ...

    /* depopulate per-cpu data for cpu going offline */
    percpu_depopulate(obj, cpu);

    ...

    /* final removal */
    percpu_free(obj);

    Signed-off-by: Martin Peschke
    Cc: Paul Jackson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Martin Peschke
     
  • This patch makes the following needlessly global functions static:
    - slab.c: kmem_find_general_cachep()
    - swap.c: __page_cache_release()
    - vmalloc.c: __vmalloc_node()

    Signed-off-by: Adrian Bunk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Adrian Bunk
     

01 Aug, 2006

2 commits


14 Jul, 2006

3 commits

  • Chandra Seetharaman reported SLAB crashes caused by the slab.c lock
    annotation patch. There is only one chunk of that patch that has a
    material effect on the slab logic - this patch undoes that chunk.

    This was confirmed to fix the slab problem by Chandra.

    Signed-off-by: Ingo Molnar
    Tested-by: Chandra Seetharaman
    Signed-off-by: Linus Torvalds

    Ingo Molnar
     
  • mm/slab.c uses nested locking when dealing with 'off-slab'
    caches, in that case it allocates the slab header from the
    (on-slab) kmalloc caches. Teach the lock validator about
    this by putting all on-slab caches into a separate class.

    this patch has no effect on non-lockdep kernels.

    Signed-off-by: Arjan van de Ven
    Signed-off-by: Ingo Molnar
    Signed-off-by: Linus Torvalds

    Arjan van de Ven
     
  • undo existing mm/slab.c lock-validator annotations, in preparation
    of a new, less intrusive annotation patch.

    Signed-off-by: Ingo Molnar
    Signed-off-by: Arjan van de Ven
    Signed-off-by: Linus Torvalds

    Ingo Molnar
     

04 Jul, 2006

1 commit

  • Teach special (recursive) locking code to the lock validator. Has no effect
    on non-lockdep kernels.

    Fix initialize-locks-via-memcpy assumptions.

    Effects on non-lockdep kernels: the subclass nesting parameter is passed into
    cache_free_alien() and __cache_free(), and turns one internal
    kmem_cache_free() call into an open-coded __cache_free() call.

    Signed-off-by: Ingo Molnar
    Signed-off-by: Arjan van de Ven
    Cc: Pekka Enberg
    Cc: Christoph Lameter
    Cc: Manfred Spraul
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ingo Molnar
     

01 Jul, 2006

3 commits

  • Post and discussion:
    http://marc.theaimsgroup.com/?t=115074342800003&r=1&w=2

    Code in __shrink_node() duplicates code in cache_reap()

    Add a new function drain_freelist that removes slabs with objects that are
    already free and use that in various places.

    This eliminates the __node_shrink() function and provides the interrupt
    holdoff reduction from slab_free to code that used to call __node_shrink.

    [akpm@osdl.org: build fixes]
    Signed-off-by: Christoph Lameter
    Cc: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • - Allows reclaim to access counter without looping over processor counts.

    - Allows accurate statistics on how many pages are used in a zone by
    the slab. This may become useful to balance slab allocations over
    various zones.

    [akpm@osdl.org: bugfix]
    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Per zone counter infrastructure

    The counters that we currently have for the VM are split per processor. The
    processor however has not much to do with the zone these pages belong to. We
    cannot tell f.e. how many ZONE_DMA pages are dirty.

    So we are blind to potentially inbalances in the usage of memory in various
    zones. F.e. in a NUMA system we cannot tell how many pages are dirty on a
    particular node. If we knew then we could put measures into the VM to balance
    the use of memory between different zones and different nodes in a NUMA
    system. For example it would be possible to limit the dirty pages per node so
    that fast local memory is kept available even if a process is dirtying huge
    amounts of pages.

    Another example is zone reclaim. We do not know how many unmapped pages exist
    per zone. So we just have to try to reclaim. If it is not working then we
    pause and try again later. It would be better if we knew when it makes sense
    to reclaim unmapped pages from a zone. This patchset allows the determination
    of the number of unmapped pages per zone. We can remove the zone reclaim
    interval with the counters introduced here.

    Futhermore the ability to have various usage statistics available will allow
    the development of new NUMA balancing algorithms that may be able to improve
    the decision making in the scheduler of when to move a process to another node
    and hopefully will also enable automatic page migration through a user space
    program that can analyse the memory load distribution and then rebalance
    memory use in order to increase performance.

    The counter framework here implements differential counters for each processor
    in struct zone. The differential counters are consolidated when a threshold
    is exceeded (like done in the current implementation for nr_pageache), when
    slab reaping occurs or when a consolidation function is called.

    Consolidation uses atomic operations and accumulates counters per zone in the
    zone structure and also globally in the vm_stat array. VM functions can
    access the counts by simply indexing a global or zone specific array.

    The arrangement of counters in an array also simplifies processing when output
    has to be generated for /proc/*.

    Counters can be updated by calling inc/dec_zone_page_state or
    _inc/dec_zone_page_state analogous to *_page_state. The second group of
    functions can be called if it is known that interrupts are disabled.

    Special optimized increment and decrement functions are provided. These can
    avoid certain checks and use increment or decrement instructions that an
    architecture may provide.

    We also add a new CONFIG_DMA_IS_NORMAL that signifies that an architecture can
    do DMA to all memory and therefore ZONE_NORMAL will not be populated. This is
    only currently set for IA64 SGI SN2 and currently only affects
    node_page_state(). In the best case node_page_state can be reduced to
    retrieving a single counter for the one zone on the node.

    [akpm@osdl.org: cleanups]
    [akpm@osdl.org: export vm_stat[] for filesystems]
    Signed-off-by: Christoph Lameter
    Cc: Trond Myklebust
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

28 Jun, 2006

5 commits

  • Runtime debugging functionality for rt-mutexes.

    Signed-off-by: Ingo Molnar
    Signed-off-by: Thomas Gleixner
    Signed-off-by: Arjan van de Ven
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ingo Molnar
     
  • Add debug_check_no_locks_freed(), as a central inline to add
    bad-lock-free-debugging functionality to.

    Signed-off-by: Ingo Molnar
    Signed-off-by: Thomas Gleixner
    Signed-off-by: Arjan van de Ven
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ingo Molnar
     
  • Make notifier_blocks associated with cpu_notifier as __cpuinitdata.

    __cpuinitdata makes sure that the data is init time only unless
    CONFIG_HOTPLUG_CPU is defined.

    Signed-off-by: Chandra Seetharaman
    Cc: Ashok Raj
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chandra Seetharaman
     
  • In 2.6.17, there was a problem with cpu_notifiers and XFS. I provided a
    band-aid solution to solve that problem. In the process, i undid all the
    changes you both were making to ensure that these notifiers were available
    only at init time (unless CONFIG_HOTPLUG_CPU is defined).

    We deferred the real fix to 2.6.18. Here is a set of patches that fixes the
    XFS problem cleanly and makes the cpu notifiers available only at init time
    (unless CONFIG_HOTPLUG_CPU is defined).

    If CONFIG_HOTPLUG_CPU is defined then cpu notifiers are available at run
    time.

    This patch reverts the notifier_call changes made in 2.6.17

    Signed-off-by: Chandra Seetharaman
    Cc: Ashok Raj
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chandra Seetharaman
     
  • Localize poison values into one header file for better documentation and
    easier/quicker debugging and so that the same values won't be used for
    multiple purposes.

    Use these constants in core arch., mm, driver, and fs code.

    Signed-off-by: Randy Dunlap
    Acked-by: Matt Mackall
    Cc: Paul Mackerras
    Cc: Benjamin Herrenschmidt
    Cc: "David S. Miller"
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Randy Dunlap
     

23 Jun, 2006

6 commits

  • - Move comments for kmalloc to right place, currently it near __do_kmalloc

    - Comments for kzalloc

    - More detailed comments for kmalloc

    - Appearance of "kmalloc" and "kzalloc" man pages after "make mandocs"

    [rdunlap@xenotime.net: simplification]
    Signed-off-by: Paul Drynoff
    Acked-by: Randy Dunlap
    Cc: Pekka Enberg
    Cc: Manfred Spraul
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Drynoff
     
  • The SLAB bootstrap code assumes that the first two kmalloc caches created
    (the INDEX_AC and INDEX_L3 kmalloc caches) wont be off-slab. But due to AC
    and L3 structure size increase in lockdep, one of them ended up being
    off-slab, and subsequently crashing with:

    Unable to handle kernel NULL pointer dereference at 0000000000000000 RIP:
    [] kmem_cache_alloc+0x26/0x7d

    The fix is to introduce a bootstrap flag and to use it to prevent off-slab
    caches being created so early during bootup.

    (The calculation for off-slab caches is quite complex so i didnt want to
    complicate things with introducing yet another INDEX_ calculation, the flag
    approach is simpler and smaller.)

    Signed-off-by: Ingo Molnar
    Cc: Manfred Spraul
    Cc: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ingo Molnar
     
  • Passing an invalid pointer to kfree() and kmem_cache_free() is likely to
    cause bad memory corruption or even take down the whole system because the
    bad pointer is likely reused immediately due to the per-CPU caches. Until
    now, we don't do any verification for this if CONFIG_DEBUG_SLAB is
    disabled.

    As suggested by Linus, add PageSlab check to page_to_cache() and
    page_to_slab() to verify pointers passed to kfree(). Also, move the
    stronger check from cache_free_debugcheck() to kmem_cache_free() to ensure
    the passed pointer actually belongs to the cache we're about to free the
    object.

    For page_to_cache() and page_to_slab(), the assertions should have
    virtually no extra cost (two instructions, no data cache pressure) and for
    kmem_cache_free() the overhead should be minimal.

    Signed-off-by: Pekka Enberg
    Cc: Manfred Spraul
    Cc: Christoph Lameter
    Cc: Linus Torvalds
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pekka Enberg
     
  • At present our slab debugging tells us that it detected a double-free or
    corruption - it does not distinguish between them. Sometimes it's useful
    to be able to differentiate between these two types of information.

    Add double-free detection to redzone verification when freeing an object.
    As explained by Manfred, when we are freeing an object, both redzones
    should be RED_ACTIVE. However, if both are RED_INACTIVE, we are trying to
    free an object that was already free'd.

    Signed-off-by: Manfred Spraul
    Signed-off-by: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pekka Enberg
     
  • Use the _entry variant everywhere to clean the code up a tiny bit.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     
  • The last ifdef addition hit the ugliness treshold on this functions, so:

    - rename the variable i to nr_pages so it's somewhat descriptive
    - remove the addr variable and do the page_address call at the very end
    - instead of ifdef'ing the whole alloc_pages_node call just make the
    __GFP_COMP addition to flags conditional
    - rewrite the __GFP_COMP comment to make sense

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Hellwig