01 Aug, 2006

2 commits


14 Jul, 2006

3 commits

  • Chandra Seetharaman reported SLAB crashes caused by the slab.c lock
    annotation patch. There is only one chunk of that patch that has a
    material effect on the slab logic - this patch undoes that chunk.

    This was confirmed to fix the slab problem by Chandra.

    Signed-off-by: Ingo Molnar
    Tested-by: Chandra Seetharaman
    Signed-off-by: Linus Torvalds

    Ingo Molnar
     
  • mm/slab.c uses nested locking when dealing with 'off-slab'
    caches, in that case it allocates the slab header from the
    (on-slab) kmalloc caches. Teach the lock validator about
    this by putting all on-slab caches into a separate class.

    this patch has no effect on non-lockdep kernels.

    Signed-off-by: Arjan van de Ven
    Signed-off-by: Ingo Molnar
    Signed-off-by: Linus Torvalds

    Arjan van de Ven
     
  • undo existing mm/slab.c lock-validator annotations, in preparation
    of a new, less intrusive annotation patch.

    Signed-off-by: Ingo Molnar
    Signed-off-by: Arjan van de Ven
    Signed-off-by: Linus Torvalds

    Ingo Molnar
     

04 Jul, 2006

1 commit

  • Teach special (recursive) locking code to the lock validator. Has no effect
    on non-lockdep kernels.

    Fix initialize-locks-via-memcpy assumptions.

    Effects on non-lockdep kernels: the subclass nesting parameter is passed into
    cache_free_alien() and __cache_free(), and turns one internal
    kmem_cache_free() call into an open-coded __cache_free() call.

    Signed-off-by: Ingo Molnar
    Signed-off-by: Arjan van de Ven
    Cc: Pekka Enberg
    Cc: Christoph Lameter
    Cc: Manfred Spraul
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ingo Molnar
     

01 Jul, 2006

3 commits

  • Post and discussion:
    http://marc.theaimsgroup.com/?t=115074342800003&r=1&w=2

    Code in __shrink_node() duplicates code in cache_reap()

    Add a new function drain_freelist that removes slabs with objects that are
    already free and use that in various places.

    This eliminates the __node_shrink() function and provides the interrupt
    holdoff reduction from slab_free to code that used to call __node_shrink.

    [akpm@osdl.org: build fixes]
    Signed-off-by: Christoph Lameter
    Cc: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • - Allows reclaim to access counter without looping over processor counts.

    - Allows accurate statistics on how many pages are used in a zone by
    the slab. This may become useful to balance slab allocations over
    various zones.

    [akpm@osdl.org: bugfix]
    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Per zone counter infrastructure

    The counters that we currently have for the VM are split per processor. The
    processor however has not much to do with the zone these pages belong to. We
    cannot tell f.e. how many ZONE_DMA pages are dirty.

    So we are blind to potentially inbalances in the usage of memory in various
    zones. F.e. in a NUMA system we cannot tell how many pages are dirty on a
    particular node. If we knew then we could put measures into the VM to balance
    the use of memory between different zones and different nodes in a NUMA
    system. For example it would be possible to limit the dirty pages per node so
    that fast local memory is kept available even if a process is dirtying huge
    amounts of pages.

    Another example is zone reclaim. We do not know how many unmapped pages exist
    per zone. So we just have to try to reclaim. If it is not working then we
    pause and try again later. It would be better if we knew when it makes sense
    to reclaim unmapped pages from a zone. This patchset allows the determination
    of the number of unmapped pages per zone. We can remove the zone reclaim
    interval with the counters introduced here.

    Futhermore the ability to have various usage statistics available will allow
    the development of new NUMA balancing algorithms that may be able to improve
    the decision making in the scheduler of when to move a process to another node
    and hopefully will also enable automatic page migration through a user space
    program that can analyse the memory load distribution and then rebalance
    memory use in order to increase performance.

    The counter framework here implements differential counters for each processor
    in struct zone. The differential counters are consolidated when a threshold
    is exceeded (like done in the current implementation for nr_pageache), when
    slab reaping occurs or when a consolidation function is called.

    Consolidation uses atomic operations and accumulates counters per zone in the
    zone structure and also globally in the vm_stat array. VM functions can
    access the counts by simply indexing a global or zone specific array.

    The arrangement of counters in an array also simplifies processing when output
    has to be generated for /proc/*.

    Counters can be updated by calling inc/dec_zone_page_state or
    _inc/dec_zone_page_state analogous to *_page_state. The second group of
    functions can be called if it is known that interrupts are disabled.

    Special optimized increment and decrement functions are provided. These can
    avoid certain checks and use increment or decrement instructions that an
    architecture may provide.

    We also add a new CONFIG_DMA_IS_NORMAL that signifies that an architecture can
    do DMA to all memory and therefore ZONE_NORMAL will not be populated. This is
    only currently set for IA64 SGI SN2 and currently only affects
    node_page_state(). In the best case node_page_state can be reduced to
    retrieving a single counter for the one zone on the node.

    [akpm@osdl.org: cleanups]
    [akpm@osdl.org: export vm_stat[] for filesystems]
    Signed-off-by: Christoph Lameter
    Cc: Trond Myklebust
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

28 Jun, 2006

5 commits

  • Runtime debugging functionality for rt-mutexes.

    Signed-off-by: Ingo Molnar
    Signed-off-by: Thomas Gleixner
    Signed-off-by: Arjan van de Ven
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ingo Molnar
     
  • Add debug_check_no_locks_freed(), as a central inline to add
    bad-lock-free-debugging functionality to.

    Signed-off-by: Ingo Molnar
    Signed-off-by: Thomas Gleixner
    Signed-off-by: Arjan van de Ven
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ingo Molnar
     
  • Make notifier_blocks associated with cpu_notifier as __cpuinitdata.

    __cpuinitdata makes sure that the data is init time only unless
    CONFIG_HOTPLUG_CPU is defined.

    Signed-off-by: Chandra Seetharaman
    Cc: Ashok Raj
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chandra Seetharaman
     
  • In 2.6.17, there was a problem with cpu_notifiers and XFS. I provided a
    band-aid solution to solve that problem. In the process, i undid all the
    changes you both were making to ensure that these notifiers were available
    only at init time (unless CONFIG_HOTPLUG_CPU is defined).

    We deferred the real fix to 2.6.18. Here is a set of patches that fixes the
    XFS problem cleanly and makes the cpu notifiers available only at init time
    (unless CONFIG_HOTPLUG_CPU is defined).

    If CONFIG_HOTPLUG_CPU is defined then cpu notifiers are available at run
    time.

    This patch reverts the notifier_call changes made in 2.6.17

    Signed-off-by: Chandra Seetharaman
    Cc: Ashok Raj
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chandra Seetharaman
     
  • Localize poison values into one header file for better documentation and
    easier/quicker debugging and so that the same values won't be used for
    multiple purposes.

    Use these constants in core arch., mm, driver, and fs code.

    Signed-off-by: Randy Dunlap
    Acked-by: Matt Mackall
    Cc: Paul Mackerras
    Cc: Benjamin Herrenschmidt
    Cc: "David S. Miller"
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Randy Dunlap
     

23 Jun, 2006

8 commits

  • - Move comments for kmalloc to right place, currently it near __do_kmalloc

    - Comments for kzalloc

    - More detailed comments for kmalloc

    - Appearance of "kmalloc" and "kzalloc" man pages after "make mandocs"

    [rdunlap@xenotime.net: simplification]
    Signed-off-by: Paul Drynoff
    Acked-by: Randy Dunlap
    Cc: Pekka Enberg
    Cc: Manfred Spraul
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Drynoff
     
  • The SLAB bootstrap code assumes that the first two kmalloc caches created
    (the INDEX_AC and INDEX_L3 kmalloc caches) wont be off-slab. But due to AC
    and L3 structure size increase in lockdep, one of them ended up being
    off-slab, and subsequently crashing with:

    Unable to handle kernel NULL pointer dereference at 0000000000000000 RIP:
    [] kmem_cache_alloc+0x26/0x7d

    The fix is to introduce a bootstrap flag and to use it to prevent off-slab
    caches being created so early during bootup.

    (The calculation for off-slab caches is quite complex so i didnt want to
    complicate things with introducing yet another INDEX_ calculation, the flag
    approach is simpler and smaller.)

    Signed-off-by: Ingo Molnar
    Cc: Manfred Spraul
    Cc: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ingo Molnar
     
  • Passing an invalid pointer to kfree() and kmem_cache_free() is likely to
    cause bad memory corruption or even take down the whole system because the
    bad pointer is likely reused immediately due to the per-CPU caches. Until
    now, we don't do any verification for this if CONFIG_DEBUG_SLAB is
    disabled.

    As suggested by Linus, add PageSlab check to page_to_cache() and
    page_to_slab() to verify pointers passed to kfree(). Also, move the
    stronger check from cache_free_debugcheck() to kmem_cache_free() to ensure
    the passed pointer actually belongs to the cache we're about to free the
    object.

    For page_to_cache() and page_to_slab(), the assertions should have
    virtually no extra cost (two instructions, no data cache pressure) and for
    kmem_cache_free() the overhead should be minimal.

    Signed-off-by: Pekka Enberg
    Cc: Manfred Spraul
    Cc: Christoph Lameter
    Cc: Linus Torvalds
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pekka Enberg
     
  • At present our slab debugging tells us that it detected a double-free or
    corruption - it does not distinguish between them. Sometimes it's useful
    to be able to differentiate between these two types of information.

    Add double-free detection to redzone verification when freeing an object.
    As explained by Manfred, when we are freeing an object, both redzones
    should be RED_ACTIVE. However, if both are RED_INACTIVE, we are trying to
    free an object that was already free'd.

    Signed-off-by: Manfred Spraul
    Signed-off-by: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pekka Enberg
     
  • Use the _entry variant everywhere to clean the code up a tiny bit.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     
  • The last ifdef addition hit the ugliness treshold on this functions, so:

    - rename the variable i to nr_pages so it's somewhat descriptive
    - remove the addr variable and do the page_address call at the very end
    - instead of ifdef'ing the whole alloc_pages_node call just make the
    __GFP_COMP addition to flags conditional
    - rewrite the __GFP_COMP comment to make sense

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     
  • Clean up slab allocator page mapping a bit. The memory allocated for a
    slab is physically contiguous so it is okay to assume struct pages are too
    so kill the long-standing comment. Furthermore, rename set_slab_attr to
    slab_map_pages and add a comment explaining why its needed.

    Signed-off-by: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pekka Enberg
     
  • Move alien object freeing to cache_free_alien() to reduce #ifdef clutter in
    __cache_free().

    Signed-off-by: Pekka Enberg
    Acked-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pekka Enberg
     

03 Jun, 2006

1 commit

  • mm/slab.c's offlab_limit logic is totally broken.

    Firstly, "offslab_limit" is a global variable while it should either be
    calculated in situ or should be passed in as a parameter.

    Secondly, the more serious problem with it is that the condition for
    calculating it:

    if (!(OFF_SLAB(sizes->cs_cachep))) {
    offslab_limit = sizes->cs_size - sizeof(struct slab);
    offslab_limit /= sizeof(kmem_bufctl_t);

    is in total disconnect with the condition that makes use of it:

    /* More than offslab_limit objects will cause problems */
    if ((flags & CFLGS_OFF_SLAB) && num > offslab_limit)
    break;

    but due to offslab_limit being a global variable this breakage was
    hidden.

    Up until lockdep came along and perturbed the slab sizes sufficiently so
    that the first off-slab cache would still see a (non-calculated) zero
    value for offslab_limit and would panic with:

    kmem_cache_create: couldn't create cache size-512.

    Call Trace:
    [] show_trace+0x96/0x1c8
    [] dump_stack+0x13/0x15
    [] panic+0x39/0x21a
    [] kmem_cache_create+0x5a0/0x5d0
    [] kmem_cache_init+0x193/0x379
    [] start_kernel+0x17f/0x218
    [] _sinittext+0x263/0x26a

    Kernel panic - not syncing: kmem_cache_create(): failed to create slab `size-512'

    Paolo Ornati's config on x86_64 managed to trigger it.

    The fix is to move the calculation to the place that makes use of it.
    This also makes slab.o 54 bytes smaller.

    Btw., the check itself is quite silly. Its intention is to test whether
    the number of objects per slab would be higher than the number of slab
    control pointers possible. In theory it could be triggered: if someone
    tried to allocate 4-byte objects cache and explicitly requested with
    CFLGS_OFF_SLAB. So i kept the check.

    Out of historic interest i checked how old this bug was and it's
    ancient, 10 years old! It is the oldest hidden and then truly triggering
    bugs i ever saw being fixed in the kernel!

    Signed-off-by: Ingo Molnar
    Signed-off-by: Linus Torvalds

    Ingo Molnar
     

16 May, 2006

2 commits

  • With CONFIG_NUMA set, kmem_cache_destroy() may fail and say "Can't
    free all objects." The problem is caused by sequences such as the
    following (suppose we are on a NUMA machine with two nodes, 0 and 1):

    * Allocate an object from cache on node 0.
    * Free the object on node 1. The object is put into node 1's alien
    array_cache for node 0.
    * Call kmem_cache_destroy(), which ultimately ends up in __cache_shrink().
    * __cache_shrink() does drain_cpu_caches(), which loops through all nodes.
    For each node it drains the shared array_cache and then handles the
    alien array_cache for the other node.

    However this means that node 0's shared array_cache will be drained,
    and then node 1 will move the contents of its alien[0] array_cache
    into that same shared array_cache. node 0's shared array_cache is
    never looked at again, so the objects left there will appear to be in
    use when __cache_shrink() calls __node_shrink() for node 0. So
    __node_shrink() will return 1 and kmem_cache_destroy() will fail.

    This patch fixes this by having drain_cpu_caches() do
    drain_alien_cache() on every node before it does drain_array() on the
    nodes' shared array_caches.

    The problem was originally reported by Or Gerlitz .

    Signed-off-by: Roland Dreier
    Acked-by: Christoph Lameter
    Acked-by: Pekka Enberg
    Signed-off-by: Linus Torvalds

    Roland Dreier
     
  • slab_is_available() indicates slab based allocators are available for use.
    SPARSEMEM code needs to know this as it can be called at various times
    during the boot process.

    Signed-off-by: Mike Kravetz
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     

29 Apr, 2006

1 commit


26 Apr, 2006

1 commit


11 Apr, 2006

3 commits

  • The earlier patch to consolidate mmu and nommu page allocation and
    refcounting by using compound pages for nommu allocations had a bug:
    kmalloc slabs who's pages were initially allocated by a non-__GFP_COMP
    allocator could be passed into mm/nommu.c kmalloc allocations which really
    wanted __GFP_COMP underlying pages. Fix that by having nommu pass
    __GFP_COMP to all higher order slab allocations.

    Signed-off-by: Luke Yang
    Acked-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Luke Yang
     
  • Add a statistics counter which is incremented everytime the alien cache
    overflows. alien_cache limit is hardcoded to 12 right now. We can use
    this statistics to tune alien cache if needed in the future.

    Signed-off-by: Alok N Kataria
    Signed-off-by: Ravikiran Thirumalai
    Signed-off-by: Shai Fultheim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ravikiran G Thirumalai
     
  • Allocate off-slab slab descriptors from node local memory.

    Signed-off-by: Alok N Kataria
    Signed-off-by: Ravikiran Thirumalai
    Signed-off-by: Shai Fultheim
    Acked-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ravikiran G Thirumalai
     

02 Apr, 2006

1 commit


29 Mar, 2006

1 commit


26 Mar, 2006

7 commits

  • We have had this memory leak for a while now. The situation is complicated
    by the use of alloc_kmemlist() as a function to resize various caches by
    do_tune_cpucache().

    What we do here is first of all make sure that we deallocate properly in
    the loop over all the nodes.

    If we are just resizing caches then we can simply return with -ENOMEM if an
    allocation fails.

    If the cache is new then we need to rollback and remove all earlier
    allocations.

    We detect that a cache is new by checking if the link to the global cache
    chain has been setup. This is a bit hackish ....

    (also fix up too overlong lines that I added in the last patch...)

    Signed-off-by: Christoph Lameter
    Cc: Jesper Juhl
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Inspired by Jesper Juhl's patch from today

    1. Get rid of err
    We do not set it to anything else but zero.

    2. Drop the CONFIG_NUMA stuff.
    There are definitions for alloc_alien_cache and free_alien_cache()
    that do the right thing for the non NUMA case.

    3. Better naming of variables.

    4. Remove redundant cachep->nodelists[node] expressions.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Jesper Juhl
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • __drain_alien_cache() currently drains objects by freeing them to the
    (remote) freelists of the original node. However, each node also has a
    shared list containing objects to be used on any processor of that node.
    We can avoid a number of remote node accesses by copying the pointers to
    the free objects directly into the remote shared array.

    And while we are at it: Skip alien draining if the alien cache spinlock is
    already taken.

    Kiran reported that this is a performance benefit.

    Signed-off-by: Christoph Lameter
    Cc: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • slabr_objects() can be used to transfer objects between various object
    caches of the slab allocator. It is currently only used during
    __cache_alloc() to retrieve elements from the shared array. We will be
    using it soon to transfer elements from the alien caches to the remote
    shared array.

    Signed-off-by: Christoph Lameter
    Cc: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Convert mm/ to use the new kmem_cache_zalloc allocator.

    Signed-off-by: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pekka Enberg
     
  • Introduce a memory-zeroing variant of kmem_cache_alloc. The allocator
    already exits in XFS and there are potential users for it so this patch
    makes the allocator available for the general public.

    Signed-off-by: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pekka Enberg
     
  • Implement /proc/slab_allocators. It produces output like:

    idr_layer_cache: 80 idr_pre_get+0x33/0x4e
    buffer_head: 2555 alloc_buffer_head+0x20/0x75
    mm_struct: 9 mm_alloc+0x1e/0x42
    mm_struct: 20 dup_mm+0x36/0x370
    vm_area_struct: 384 dup_mm+0x18f/0x370
    vm_area_struct: 151 do_mmap_pgoff+0x2e0/0x7c3
    vm_area_struct: 1 split_vma+0x5a/0x10e
    vm_area_struct: 11 do_brk+0x206/0x2e2
    vm_area_struct: 2 copy_vma+0xda/0x142
    vm_area_struct: 9 setup_arg_pages+0x99/0x214
    fs_cache: 8 copy_fs_struct+0x21/0x133
    fs_cache: 29 copy_process+0xf38/0x10e3
    files_cache: 30 alloc_files+0x1b/0xcf
    signal_cache: 81 copy_process+0xbaa/0x10e3
    sighand_cache: 77 copy_process+0xe65/0x10e3
    sighand_cache: 1 de_thread+0x4d/0x5f8
    anon_vma: 241 anon_vma_prepare+0xd9/0xf3
    size-2048: 1 add_sect_attrs+0x5f/0x145
    size-2048: 2 journal_init_revoke+0x99/0x302
    size-2048: 2 journal_init_revoke+0x137/0x302
    size-2048: 2 journal_init_inode+0xf9/0x1c4

    Cc: Manfred Spraul
    Cc: Alexander Nyberg
    Cc: Pekka Enberg
    Cc: Christoph Lameter
    Cc: Ravikiran Thirumalai
    Signed-off-by: Al Viro
    DESC
    slab-leaks3-locking-fix
    EDESC
    From: Andrew Morton

    Update for slab-remove-cachep-spinlock.patch

    Cc: Al Viro
    Cc: Manfred Spraul
    Cc: Alexander Nyberg
    Cc: Pekka Enberg
    Cc: Christoph Lameter
    Cc: Ravikiran Thirumalai
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Al Viro
     

24 Mar, 2006

1 commit

  • The hook in the slab cache allocation path to handle cpuset memory
    spreading for tasks in cpusets with 'memory_spread_slab' enabled has a
    modest performance bug. The hook calls into the memory spreading handler
    alternate_node_alloc() if either of 'memory_spread_slab' or
    'memory_spread_page' is enabled, even though the handler does nothing
    (albeit harmlessly) for the page case

    Fix - drop PF_SPREAD_PAGE from the set of flag bits that are used to
    trigger a call to alternate_node_alloc().

    The page case is handled by separate hooks -- see the calls conditioned on
    cpuset_do_page_mem_spread() in mm/filemap.c

    Signed-off-by: Paul Jackson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Jackson