08 May, 2007

40 commits

  • Address spaces contain an allocation flag that specifies restriction on the
    zone for pages placed in the mapping. I.e. some device may require pages
    to be allocated from a DMA zone. Block devices may not be able to use
    pages from HIGHMEM.

    Memory policies and the common use of page migration works only on the
    highest zone. If the address space does not allow allocation from the
    highest zone then the pages in the address space are not migratable simply
    because we can only allocate memory for a specified node if we allow
    allocation for the highest zone on each node.

    Acked-by: Hugh Dickins
    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • SLOB doesn't calculate correct page order when page size is not 4KB. This
    patch fixes it with using get_order() instead of find_order() which is SLOB
    version of get_order().

    Signed-off-by: Akinobu Mita
    Acked-by: Matt Mackall
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Akinobu Mita
     
  • If hugetlbfs module_init() fails, hugetlbfs_vfsmount is not initialized and
    shmget() with SHM_HUGETLB flag will cause NULL pointer dereference.

    Signed-off-by: Akinobu Mita
    Acked-by: William Irwin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Akinobu Mita
     
  • There is no user remaining and I have never seen any use of that flag.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • SLAB_CTOR atomic is never used which is no surprise since I cannot imagine
    that one would want to do something serious in a constructor or destructor.
    In particular given that the slab allocators run with interrupts disabled.
    Actions in constructors and destructors are by their nature very limited
    and usually do not go beyond initializing variables and list operations.

    (The i386 pgd ctor and dtors do take a spinlock in constructor and
    destructor..... I think that is the furthest we go at this point.)

    There is no flag passed to the destructor so removing SLAB_CTOR_ATOMIC also
    establishes a certain symmetry.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • I have never seen a use of SLAB_DEBUG_INITIAL. It is only supported by
    SLAB.

    I think its purpose was to have a callback after an object has been freed
    to verify that the state is the constructor state again? The callback is
    performed before each freeing of an object.

    I would think that it is much easier to check the object state manually
    before the free. That also places the check near the code object
    manipulation of the object.

    Also the SLAB_DEBUG_INITIAL callback is only performed if the kernel was
    compiled with SLAB debugging on. If there would be code in a constructor
    handling SLAB_DEBUG_INITIAL then it would have to be conditional on
    SLAB_DEBUG otherwise it would just be dead code. But there is no such code
    in the kernel. I think SLUB_DEBUG_INITIAL is too problematic to make real
    use of, difficult to understand and there are easier ways to accomplish the
    same effect (i.e. add debug code before kfree).

    There is a related flag SLAB_CTOR_VERIFY that is frequently checked to be
    clear in fs inode caches. Remove the pointless checks (they would even be
    pointless without removeal of SLAB_DEBUG_INITIAL) from the fs constructors.

    This is the last slab flag that SLUB did not support. Remove the check for
    unimplemented flags from SLUB.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Remove the hugetlbfs specific hacks in toplevel get_unmapped_area() now that
    all archs and hugetlbfs itself do the right thing for both cases.

    Signed-off-by: Benjamin Herrenschmidt
    Acked-by: William Irwin
    Cc: Paul Mackerras
    Cc: Richard Henderson
    Cc: Ivan Kokshaysky
    Cc: Russell King
    Cc: David Howells
    Cc: Andi Kleen
    Cc: "Luck, Tony"
    Cc: Kyle McMartin
    Cc: Grant Grundler
    Cc: Matthew Wilcox
    Cc: "David S. Miller"
    Cc: Adam Litke
    Cc: David Gibson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Benjamin Herrenschmidt
     
  • generic arch_get_unmapped_area() now handles MAP_FIXED. Now that all
    implementations have been fixed, change the toplevel get_unmapped_area() to
    call into arch or drivers for the MAP_FIXED case.

    Signed-off-by: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: Richard Henderson
    Cc: Ivan Kokshaysky
    Cc: Russell King
    Cc: David Howells
    Cc: Andi Kleen
    Cc: "Luck, Tony"
    Cc: Kyle McMartin
    Cc: Grant Grundler
    Cc: Matthew Wilcox
    Cc: "David S. Miller"
    Cc: William Irwin
    Cc: Adam Litke
    Cc: David Gibson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Benjamin Herrenschmidt
     
  • Generic hugetlb_get_unmapped_area() now handles MAP_FIXED by just calling
    prepare_hugepage_range()

    Signed-off-by: Benjamin Herrenschmidt
    Acked-by: William Irwin
    Cc: Paul Mackerras
    Cc: Richard Henderson
    Cc: Ivan Kokshaysky
    Cc: Russell King
    Cc: David Howells
    Cc: Andi Kleen
    Cc: "Luck, Tony"
    Cc: Kyle McMartin
    Cc: Grant Grundler
    Cc: Matthew Wilcox
    Cc: "David S. Miller"
    Cc: Adam Litke
    Cc: David Gibson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Benjamin Herrenschmidt
     
  • Handle MAP_FIXED in x86_64 arch_get_unmapped_area(), simple case, just return
    the address as passed in

    Signed-off-by: Benjamin Herrenschmidt
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Benjamin Herrenschmidt
     
  • Handle MAP_FIXED in hugetlb_get_unmapped_area on sparc64 by just using
    prepare_hugepage_range()

    Signed-off-by: Benjamin Herrenschmidt
    Acked-by: William Irwin
    Cc: "David S. Miller"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Benjamin Herrenschmidt
     
  • Handle MAP_FIXED in parisc arch_get_unmapped_area(), just return the address.
    We might want to also check for possible cache aliasing issues now that we get
    called in that case (like ARM or MIPS), leave a comment for the maintainers to
    pick up.

    Signed-off-by: Benjamin Herrenschmidt
    Cc: Kyle McMartin
    Cc: Grant Grundler
    Cc: Matthew Wilcox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Benjamin Herrenschmidt
     
  • Handle MAP_FIXED in ia64 arch_get_unmapped_area and
    hugetlb_get_unmapped_area(), just call prepare_hugepage_range in the later and
    is_hugepage_only_range() in the former.

    Signed-off-by: Benjamin Herrenschmidt
    Acked-by: William Irwin
    Cc: "Luck, Tony"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Benjamin Herrenschmidt
     
  • Handle MAP_FIXED in i386 hugetlb_get_unmapped_area(), just call
    prepare_hugepage_range.

    Signed-off-by: Benjamin Herrenschmidt
    Acked-by: William Irwin
    Cc: Andi Kleen
    Cc: Adam Litke
    Cc: David Gibson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Benjamin Herrenschmidt
     
  • Handle MAP_FIXED in arch_get_unmapped_area on frv. Trivial case, just return
    the address.

    Signed-off-by: Benjamin Herrenschmidt
    Cc: David Howells
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Benjamin Herrenschmidt
     
  • ARM already had a case for MAP_FIXED in arch_get_unmapped_area() though it was
    not called before. Fix the comment to reflect that it will now be called.

    Signed-off-by: Benjamin Herrenschmidt
    Acked-by: Russell King
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Benjamin Herrenschmidt
     
  • Handle MAP_FIXED in alpha's arch_get_unmapped_area(), simple case, just return
    the address as passed in

    Signed-off-by: Benjamin Herrenschmidt
    Cc: Richard Henderson
    Cc: Ivan Kokshaysky
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Benjamin Herrenschmidt
     
  • The current get_unmapped_area code calls the f_ops->get_unmapped_area or the
    arch one (via the mm) only when MAP_FIXED is not passed. That makes it
    impossible for archs to impose proper constraints on regions of the virtual
    address space. To work around that, get_unmapped_area() then calls some
    hugetlbfs specific hacks.

    This cause several problems, among others:

    - It makes it impossible for a driver or filesystem to do the same thing
    that hugetlbfs does (for example, to allow a driver to use larger page sizes
    to map external hardware) if that requires applying a constraint on the
    addresses (constraining that mapping in certain regions and other mappings
    out of those regions).

    - Some archs like arm, mips, sparc, sparc64, sh and sh64 already want
    MAP_FIXED to be passed down in order to deal with aliasing issues. The code
    is there to handle it... but is never called.

    This series of patches moves the logic to handle MAP_FIXED down to the various
    arch/driver get_unmapped_area() implementations, and then changes the generic
    code to always call them. The hugetlbfs hacks then disappear from the generic
    code.

    Since I need to do some special 64K pages mappings for SPEs on cell, I need to
    work around the first problem at least. I have further patches thus
    implementing a "slices" layer that handles multiple page sizes through slices
    of the address space for use by hugetlbfs, the SPE code, and possibly others,
    but it requires that serie of patches first/

    There is still a potential (but not practical) issue due to the fact that
    filesystems/drivers implemeting g_u_a will effectively bypass all arch checks.
    This is not an issue in practice as the only filesystems/drivers using that
    hook are doing so for arch specific purposes in the first place.

    There is also a problem with mremap that will completely bypass all arch
    checks. I'll try to address that separately, I'm not 100% certain yet how,
    possibly by making it not work when the vma has a file whose f_ops has a
    get_unmapped_area callback, and by making it use is_hugepage_only_range()
    before expanding into a new area.

    Also, I want to turn is_hugepage_only_range() into a more generic
    is_normal_page_range() as that's really what it will end up meaning when used
    in stack grow, brk grow and mremap.

    None of the above "issues" however are introduced by this patch, they are
    already there, so I think the patch can go ini for 2.6.22.

    This patch:

    Handle MAP_FIXED in powerpc's arch_get_unmapped_area() in all 3
    implementations of it.

    Signed-off-by: Benjamin Herrenschmidt
    Acked-by: William Irwin
    Cc: Paul Mackerras
    Cc: Richard Henderson
    Cc: Ivan Kokshaysky
    Cc: Russell King
    Cc: David Howells
    Cc: Andi Kleen
    Cc: "Luck, Tony"
    Cc: Kyle McMartin
    Cc: Grant Grundler
    Cc: Matthew Wilcox
    Cc: "David S. Miller"
    Cc: Adam Litke
    Cc: David Gibson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Benjamin Herrenschmidt
     
  • Fixes a deadlock in the OOM killer for allocations that are not
    __GFP_HARDWALL.

    Before the OOM killer checks for the allocation constraint, it takes
    callback_mutex.

    constrained_alloc() iterates through each zone in the allocation zonelist
    and calls cpuset_zone_allowed_softwall() to determine whether an allocation
    for gfp_mask is possible. If a zone's node is not in the OOM-triggering
    task's mems_allowed, it is not exiting, and we did not fail on a
    __GFP_HARDWALL allocation, cpuset_zone_allowed_softwall() attempts to take
    callback_mutex to check the nearest exclusive ancestor of current's cpuset.
    This results in deadlock.

    We now take callback_mutex after iterating through the zonelist since we
    don't need it yet.

    Cc: Andi Kleen
    Cc: Nick Piggin
    Cc: Christoph Lameter
    Cc: Martin J. Bligh
    Signed-off-by: David Rientjes
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • The current panic_on_oom may not work if there is a process using
    cpusets/mempolicy, because other nodes' memory may remain. But some people
    want failover by panic ASAP even if they are used. This patch makes new
    setting for its request.

    This is tested on my ia64 box which has 3 nodes.

    Signed-off-by: Yasunori Goto
    Signed-off-by: Benjamin LaHaise
    Cc: Christoph Lameter
    Cc: Paul Jackson
    Cc: Ethan Solomita
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yasunori Goto
     
  • Currently failslab injects failures into ____cache_alloc(). But with enabling
    CONFIG_NUMA it's not enough to let actual slab allocator functions (kmalloc,
    kmem_cache_alloc, ...) return NULL.

    This patch moves fault injection hook inside of __cache_alloc() and
    __cache_alloc_node(). These are lower call path than ____cache_alloc() and
    enable to inject faulures to slab allocators with CONFIG_NUMA.

    Acked-by: Pekka Enberg
    Signed-off-by: Akinobu Mita
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Akinobu Mita
     
  • It is not necessary to tell the slab allocators to align to a cacheline
    if an explicit alignment was already specified. It is rather confusing
    to specify multiple alignments.

    Make sure that the call sites only use one form of alignment.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • This patch provides a new macro

    KMEM_CACHE(, )

    to simplify slab creation. KMEM_CACHE creates a slab with the name of the
    struct, with the size of the struct and with the alignment of the struct.
    Additional slab flags may be specified if necessary.

    Example

    struct test_slab {
    int a,b,c;
    struct list_head;
    } __cacheline_aligned_in_smp;

    test_slab_cache = KMEM_CACHE(test_slab, SLAB_PANIC)

    will create a new slab named "test_slab" of the size sizeof(struct
    test_slab) and aligned to the alignment of test slab. If it fails then we
    panic.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • This patch was recently posted to lkml and acked by Pekka.

    The flag SLAB_MUST_HWCACHE_ALIGN is

    1. Never checked by SLAB at all.

    2. A duplicate of SLAB_HWCACHE_ALIGN for SLUB

    3. Fulfills the role of SLAB_HWCACHE_ALIGN for SLOB.

    The only remaining use is in sparc64 and ppc64 and their use there
    reflects some earlier role that the slab flag once may have had. If
    its specified then SLAB_HWCACHE_ALIGN is also specified.

    The flag is confusing, inconsistent and has no purpose.

    Remove it.

    Acked-by: Pekka Enberg
    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • invalidate_bdev() is superfluous when truncate_inode_pages() is also
    called. do call invalidate_bh_lrus() though, to avoid stale pointers.

    Signed-off-by: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     
  • Remove duplicate work in kill_bdev().

    It currently invalidates and then truncates the bdev's mapping.
    invalidate_mapping_pages() will opportunistically remove pages from the
    mapping. And truncate_inode_pages() will forcefully remove all pages.

    The only thing truncate doesn't do is flush the bh lrus. So do that
    explicitly. This avoids (very unlikely) but possible invalid lookup
    results if the same bdev is quickly re-issued.

    It also will prevent extreme kernel latencies which are observed when
    blockdevs which have a large amount of pagecache are unmounted, by avoiding
    invalidate_mapping_pages() on that path. invalidate_mapping_pages() has no
    cond_resched (it can be called under spinlock), whereas truncate_inode_pages()
    has one.

    [akpm@linux-foundation.org: restore nrpages==0 optimisation]
    Signed-off-by: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     
  • Remove the destroy_dirty_buffers argument from invalidate_bdev(), it hasn't
    been used in 6 years (so akpm says).

    find * -name \*.[ch] | xargs grep -l invalidate_bdev |
    while read file; do
    quilt add $file;
    sed -ie 's/invalidate_bdev(\([^,]*\),[^)]*)/invalidate_bdev(\1)/g' $file;
    done

    Signed-off-by: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     
  • Avoid down_write of the mmap_sem in madvise when we can help it.

    Acked-by: Hugh Dickins
    Signed-off-by: Nick Piggin
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Signed-off-by: Matthias Kaehlcke
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    matze
     
  • kmem_cache_create() for slob doesn't handle SLAB_PANIC.

    Signed-off-by: Matt Mackall
    Signed-off-by: Akinobu Mita
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Akinobu Mita
     
  • I ported this to sparc64 as per the patch below, tested on UP SunBlade1500 and
    24 cpu Niagara T1000.

    Signed-off-by: David S. Miller
    Signed-off-by: Christoph Lameter
    Cc: "David S. Miller"
    Cc: Andi Kleen
    Cc: "Luck, Tony"
    Cc: William Lee Irwin III
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Miller
     
  • On x86_64 this cuts allocation overhead for page table pages down to a
    fraction (kernel compile / editing load. TSC based measurement of times spend
    in each function):

    no quicklist

    pte_alloc 1569048 4.3s(401ns/2.7us/179.7us)
    pmd_alloc 780988 2.1s(337ns/2.7us/86.1us)
    pud_alloc 780072 2.2s(424ns/2.8us/300.6us)
    pgd_alloc 260022 1s(920ns/4us/263.1us)

    quicklist:

    pte_alloc 452436 573.4ms(8ns/1.3us/121.1us)
    pmd_alloc 196204 174.5ms(7ns/889ns/46.1us)
    pud_alloc 195688 172.4ms(7ns/881ns/151.3us)
    pgd_alloc 65228 9.8ms(8ns/150ns/6.1us)

    pgd allocations are the most complex and there we see the most dramatic
    improvement (may be we can cut down the amount of pgds cached somewhat?). But
    even the pte allocations still see a doubling of performance.

    1. Proven code from the IA64 arch.

    The method used here has been fine tuned for years and
    is NUMA aware. It is based on the knowledge that accesses
    to page table pages are sparse in nature. Taking a page
    off the freelists instead of allocating a zeroed pages
    allows a reduction of number of cachelines touched
    in addition to getting rid of the slab overhead. So
    performance improves. This is particularly useful if pgds
    contain standard mappings. We can save on the teardown
    and setup of such a page if we have some on the quicklists.
    This includes avoiding lists operations that are otherwise
    necessary on alloc and free to track pgds.

    2. Light weight alternative to use slab to manage page size pages

    Slab overhead is significant and even page allocator use
    is pretty heavy weight. The use of a per cpu quicklist
    means that we touch only two cachelines for an allocation.
    There is no need to access the page_struct (unless arch code
    needs to fiddle around with it). So the fast past just
    means bringing in one cacheline at the beginning of the
    page. That same cacheline may then be used to store the
    page table entry. Or a second cacheline may be used
    if the page table entry is not in the first cacheline of
    the page. The current code will zero the page which means
    touching 32 cachelines (assuming 128 byte). We get down
    from 32 to 2 cachelines in the fast path.

    3. x86_64 gets lightweight page table page management.

    This will allow x86_64 arch code to faster repopulate pgds
    and other page table entries. The list operations for pgds
    are reduced in the same way as for i386 to the point where
    a pgd is allocated from the page allocator and when it is
    freed back to the page allocator. A pgd can pass through
    the quicklists without having to be reinitialized.

    64 Consolidation of code from multiple arches

    So far arches have their own implementation of quicklist
    management. This patch moves that feature into the core allowing
    an easier maintenance and consistent management of quicklists.

    Page table pages have the characteristics that they are typically zero or in a
    known state when they are freed. This is usually the exactly same state as
    needed after allocation. So it makes sense to build a list of freed page
    table pages and then consume the pages already in use first. Those pages have
    already been initialized correctly (thus no need to zero them) and are likely
    already cached in such a way that the MMU can use them most effectively. Page
    table pages are used in a sparse way so zeroing them on allocation is not too
    useful.

    Such an implementation already exits for ia64. Howver, that implementation
    did not support constructors and destructors as needed by i386 / x86_64. It
    also only supported a single quicklist. The implementation here has
    constructor and destructor support as well as the ability for an arch to
    specify how many quicklists are needed.

    Quicklists are defined by an arch defining CONFIG_QUICKLIST. If more than one
    quicklist is necessary then we can define NR_QUICK for additional lists. F.e.
    i386 needs two and thus has

    config NR_QUICK
    int
    default 2

    If an arch has requested quicklist support then pages can be allocated
    from the quicklist (or from the page allocator if the quicklist is
    empty) via:

    quicklist_alloc(, , )

    Page table pages can be freed using:

    quicklist_free(, , )

    Pages must have a definite state after allocation and before
    they are freed. If no constructor is specified then pages
    will be zeroed on allocation and must be zeroed before they are
    freed.

    If a constructor is used then the constructor will establish
    a definite page state. F.e. the i386 and x86_64 pgd constructors
    establish certain mappings.

    Constructors and destructors can also be used to track the pages.
    i386 and x86_64 use a list of pgds in order to be able to dynamically
    update standard mappings.

    Signed-off-by: Christoph Lameter
    Cc: "David S. Miller"
    Cc: Andi Kleen
    Cc: "Luck, Tony"
    Cc: William Lee Irwin III
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Add the tool which gets reports about slabs to the VM documentation directory.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Make sure that the check function really only check things and do not perform
    activities. Extract the tracing and object seeding out of the two check
    functions and place them into slab_alloc and slab_free

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • At kmem_cache_shrink check if we have any empty slabs on the partial
    if so then remove them.

    Also--as an anti-fragmentation measure--sort the partial slabs so that
    the most fully allocated ones come first and the least allocated last.

    The next allocations may fill up the nearly full slabs. Having the
    least allocated slabs last gives them the maximum chance that their
    remaining objects may be freed. Thus we can hopefully minimize the
    partial slabs.

    I think this is the best one can do in terms antifragmentation
    measures. Real defragmentation (meaning moving objects out of slabs with
    the least free objects to those that are almost full) can be implemted
    by reverse scanning through the list produced here but that would mean
    that we need to provide a callback at slab cache creation that allows
    the deletion or moving of an object. This will involve slab API
    changes, so defer for now.

    Cc: Mel Gorman
    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • This patch enables listing the callers who allocated or freed objects in a
    cache.

    For example to list the allocators for kmalloc-128 do

    cat /sys/slab/kmalloc-128/alloc_calls
    7 sn_io_slot_fixup+0x40/0x700
    7 sn_io_slot_fixup+0x80/0x700
    9 sn_bus_fixup+0xe0/0x380
    6 param_sysfs_setup+0xf0/0x280
    276 percpu_populate+0xf0/0x1a0
    19 __register_chrdev_region+0x30/0x360
    8 expand_files+0x2e0/0x6e0
    1 sys_epoll_create+0x60/0x200
    1 __mounts_open+0x140/0x2c0
    65 kmem_alloc+0x110/0x280
    3 alloc_disk_node+0xe0/0x200
    33 as_get_io_context+0x90/0x280
    74 kobject_kset_add_dir+0x40/0x140
    12 pci_create_bus+0x2a0/0x5c0
    1 acpi_ev_create_gpe_block+0x120/0x9e0
    41 con_insert_unipair+0x100/0x1c0
    1 uart_open+0x1c0/0xba0
    1 dma_pool_create+0xe0/0x340
    2 neigh_table_init_no_netlink+0x260/0x4c0
    6 neigh_parms_alloc+0x30/0x200
    1 netlink_kernel_create+0x130/0x320
    5 fz_hash_alloc+0x50/0xe0
    2 sn_common_hubdev_init+0xd0/0x6e0
    28 kernel_param_sysfs_setup+0x30/0x180
    72 process_zones+0x70/0x2e0

    cat /sys/slab/kmalloc-128/free_calls
    558
    3 sn_io_slot_fixup+0x600/0x700
    84 free_fdtable_rcu+0x120/0x260
    2 seq_release+0x40/0x60
    6 kmem_free+0x70/0xc0
    24 free_as_io_context+0x20/0x200
    1 acpi_get_object_info+0x3a0/0x3e0
    1 acpi_add_single_object+0xcf0/0x1e40
    2 con_release_unimap+0x80/0x140
    1 free+0x20/0x40

    SLAB_STORE_USER must be enabled for a slab cache by either booting with
    "slab_debug" or enabling user tracking specifically for the slab of interest.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • We leave a mininum of partial slabs on nodes when we search for
    partial slabs on other node. Define a constant for that value.

    Then modify slub to keep MIN_PARTIAL slabs around.

    This avoids bad situations where a function frees the last object
    in a slab (which results in the page being returned to the page
    allocator) only to then allocate one again (which requires getting
    a page back from the page allocator if the partial list was empty).
    Keeping a couple of slabs on the partial list reduces overhead.

    Empty slabs are added to the end of the partial list to insure that
    partially allocated slabs are consumed first (defragmentation).

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • This enables validation of slab. Validation means that all objects are
    checked to see if there are redzone violations, if padding has been
    overwritten or any pointers have been corrupted. Also checks the consistency
    of slab counters.

    Validation enables the detection of metadata corruption without the kernel
    having to execute code that actually uses (allocs/frees) and object. It
    allows one to make sure that the slab metainformation and the guard values
    around an object have not been compromised.

    A single slabcache can be checked by writing a 1 to the "validate" file.

    i.e.

    echo 1 >/sys/slab/kmalloc-128/validate

    or use the slabinfo tool to check all slabs

    slabinfo -v

    Error messages will show up in the syslog.

    Note that validation can only reach slabs that are on a list. This means that
    we are usually restricted to partial slabs and active slabs unless
    SLAB_STORE_USER is active which will build a full slab list and allows
    validation of slabs that are fully in use. Booting with "slub_debug" set will
    enable SLAB_STORE_USER and then full diagnostic are available.

    Note that we attempt to push cpu slabs back to the lists when we start the
    check. If the cpu slab is reactivated before we get to it (another processor
    grabs it before we get to it) then it cannot be checked.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • If slab tracking is on then build a list of full slabs so that we can verify
    the integrity of all slabs and are also able to built list of alloc/free
    callers.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter