29 Jul, 2008

1 commit

  • With KVM/GFP/XPMEM there isn't just the primary CPU MMU pointing to pages.
    There are secondary MMUs (with secondary sptes and secondary tlbs) too.
    sptes in the kvm case are shadow pagetables, but when I say spte in
    mmu-notifier context, I mean "secondary pte". In GRU case there's no
    actual secondary pte and there's only a secondary tlb because the GRU
    secondary MMU has no knowledge about sptes and every secondary tlb miss
    event in the MMU always generates a page fault that has to be resolved by
    the CPU (this is not the case of KVM where the a secondary tlb miss will
    walk sptes in hardware and it will refill the secondary tlb transparently
    to software if the corresponding spte is present). The same way
    zap_page_range has to invalidate the pte before freeing the page, the spte
    (and secondary tlb) must also be invalidated before any page is freed and
    reused.

    Currently we take a page_count pin on every page mapped by sptes, but that
    means the pages can't be swapped whenever they're mapped by any spte
    because they're part of the guest working set. Furthermore a spte unmap
    event can immediately lead to a page to be freed when the pin is released
    (so requiring the same complex and relatively slow tlb_gather smp safe
    logic we have in zap_page_range and that can be avoided completely if the
    spte unmap event doesn't require an unpin of the page previously mapped in
    the secondary MMU).

    The mmu notifiers allow kvm/GRU/XPMEM to attach to the tsk->mm and know
    when the VM is swapping or freeing or doing anything on the primary MMU so
    that the secondary MMU code can drop sptes before the pages are freed,
    avoiding all page pinning and allowing 100% reliable swapping of guest
    physical address space. Furthermore it avoids the code that teardown the
    mappings of the secondary MMU, to implement a logic like tlb_gather in
    zap_page_range that would require many IPI to flush other cpu tlbs, for
    each fixed number of spte unmapped.

    To make an example: if what happens on the primary MMU is a protection
    downgrade (from writeable to wrprotect) the secondary MMU mappings will be
    invalidated, and the next secondary-mmu-page-fault will call
    get_user_pages and trigger a do_wp_page through get_user_pages if it
    called get_user_pages with write=1, and it'll re-establishing an updated
    spte or secondary-tlb-mapping on the copied page. Or it will setup a
    readonly spte or readonly tlb mapping if it's a guest-read, if it calls
    get_user_pages with write=0. This is just an example.

    This allows to map any page pointed by any pte (and in turn visible in the
    primary CPU MMU), into a secondary MMU (be it a pure tlb like GRU, or an
    full MMU with both sptes and secondary-tlb like the shadow-pagetable layer
    with kvm), or a remote DMA in software like XPMEM (hence needing of
    schedule in XPMEM code to send the invalidate to the remote node, while no
    need to schedule in kvm/gru as it's an immediate event like invalidating
    primary-mmu pte).

    At least for KVM without this patch it's impossible to swap guests
    reliably. And having this feature and removing the page pin allows
    several other optimizations that simplify life considerably.

    Dependencies:

    1) mm_take_all_locks() to register the mmu notifier when the whole VM
    isn't doing anything with "mm". This allows mmu notifier users to keep
    track if the VM is in the middle of the invalidate_range_begin/end
    critical section with an atomic counter incraese in range_begin and
    decreased in range_end. No secondary MMU page fault is allowed to map
    any spte or secondary tlb reference, while the VM is in the middle of
    range_begin/end as any page returned by get_user_pages in that critical
    section could later immediately be freed without any further
    ->invalidate_page notification (invalidate_range_begin/end works on
    ranges and ->invalidate_page isn't called immediately before freeing
    the page). To stop all page freeing and pagetable overwrites the
    mmap_sem must be taken in write mode and all other anon_vma/i_mmap
    locks must be taken too.

    2) It'd be a waste to add branches in the VM if nobody could possibly
    run KVM/GRU/XPMEM on the kernel, so mmu notifiers will only enabled if
    CONFIG_KVM=m/y. In the current kernel kvm won't yet take advantage of
    mmu notifiers, but this already allows to compile a KVM external module
    against a kernel with mmu notifiers enabled and from the next pull from
    kvm.git we'll start using them. And GRU/XPMEM will also be able to
    continue the development by enabling KVM=m in their config, until they
    submit all GRU/XPMEM GPLv2 code to the mainline kernel. Then they can
    also enable MMU_NOTIFIERS in the same way KVM does it (even if KVM=n).
    This guarantees nobody selects MMU_NOTIFIER=y if KVM and GRU and XPMEM
    are all =n.

    The mmu_notifier_register call can fail because mm_take_all_locks may be
    interrupted by a signal and return -EINTR. Because mmu_notifier_reigster
    is used when a driver startup, a failure can be gracefully handled. Here
    an example of the change applied to kvm to register the mmu notifiers.
    Usually when a driver startups other allocations are required anyway and
    -ENOMEM failure paths exists already.

    struct kvm *kvm_arch_create_vm(void)
    {
    struct kvm *kvm = kzalloc(sizeof(struct kvm), GFP_KERNEL);
    + int err;

    if (!kvm)
    return ERR_PTR(-ENOMEM);

    INIT_LIST_HEAD(&kvm->arch.active_mmu_pages);

    + kvm->arch.mmu_notifier.ops = &kvm_mmu_notifier_ops;
    + err = mmu_notifier_register(&kvm->arch.mmu_notifier, current->mm);
    + if (err) {
    + kfree(kvm);
    + return ERR_PTR(err);
    + }
    +
    return kvm;
    }

    mmu_notifier_unregister returns void and it's reliable.

    The patch also adds a few needed but missing includes that would prevent
    kernel to compile after these changes on non-x86 archs (x86 didn't need
    them by luck).

    [akpm@linux-foundation.org: coding-style fixes]
    [akpm@linux-foundation.org: fix mm/filemap_xip.c build]
    [akpm@linux-foundation.org: fix mm/mmu_notifier.c build]
    Signed-off-by: Andrea Arcangeli
    Signed-off-by: Nick Piggin
    Signed-off-by: Christoph Lameter
    Cc: Jack Steiner
    Cc: Robin Holt
    Cc: Nick Piggin
    Cc: Peter Zijlstra
    Cc: Kanoj Sarcar
    Cc: Roland Dreier
    Cc: Steve Wise
    Cc: Avi Kivity
    Cc: Hugh Dickins
    Cc: Rusty Russell
    Cc: Anthony Liguori
    Cc: Chris Wright
    Cc: Marcelo Tosatti
    Cc: Eric Dumazet
    Cc: "Paul E. McKenney"
    Cc: Izik Eidus
    Cc: Anthony Liguori
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     

26 Jul, 2008

5 commits

  • Now that we have core_state->dumper list we can use it to wake up the
    sub-threads waiting for the coredump completion.

    This uglifies the code and .text grows by 47 bytes, but otoh mm_struct
    lessens by sizeof(struct completion). Also, with this change we can
    decouple exit_mm() from the coredumping code.

    Signed-off-by: Oleg Nesterov
    Cc: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • binfmt->core_dump() has to iterate over the all threads in system in order
    to find the coredumping threads and construct the list using the
    GFP_ATOMIC allocations.

    With this patch each thread allocates the list node on exit_mm()'s stack and
    adds itself to the list.

    This allows us to do further changes:

    - simplify ->core_dump()

    - change exit_mm() to clear ->mm first, then wait for ->core_done.
    this makes the coredumping process visible to oom_kill

    - kill mm->core_done

    Signed-off-by: Oleg Nesterov
    Acked-by: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Turn core_state->nr_threads into atomic_t and kill now unneeded
    down_write(&mm->mmap_sem) in exit_mm().

    Signed-off-by: Oleg Nesterov
    Cc: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Move mm->core_waiters into "struct core_state" allocated on stack. This
    shrinks mm_struct a little bit and allows further changes.

    This patch mostly does s/core_waiters/core_state. The only essential
    change is that coredump_wait() must clear mm->core_state before return.

    The coredump_wait()'s path is uglified and .text grows by 30 bytes, this
    is fixed by the next patch.

    Signed-off-by: Oleg Nesterov
    Cc: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • mm->core_startup_done points to "struct completion startup_done" allocated
    on the coredump_wait()'s stack. Introduce the new structure, core_state,
    which holds this "struct completion". This way we can add more info
    visible to the threads participating in coredump without enlarging
    mm_struct.

    No changes in affected .o files.

    Signed-off-by: Oleg Nesterov
    Cc: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

13 May, 2008

1 commit

  • When mm destruction happens, we should pass mm_update_next_owner() the old mm.
    But unfortunately new mm is passed in exec_mmap().

    Thus, kernel panic is possible when a multi-threaded process uses exec().

    Also, the owner member comment description is wrong. mm->owner does not
    necessarily point to the thread group leader.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: KOSAKI Motohiro
    Acked-by: Balbir Singh
    Cc: "Paul Menage"
    Cc: "KAMEZAWA Hiroyuki"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     

29 Apr, 2008

3 commits

  • The kernel implements readlink of /proc/pid/exe by getting the file from
    the first executable VMA. Then the path to the file is reconstructed and
    reported as the result.

    Because of the VMA walk the code is slightly different on nommu systems.
    This patch avoids separate /proc/pid/exe code on nommu systems. Instead of
    walking the VMAs to find the first executable file-backed VMA we store a
    reference to the exec'd file in the mm_struct.

    That reference would prevent the filesystem holding the executable file
    from being unmounted even after unmapping the VMAs. So we track the number
    of VM_EXECUTABLE VMAs and drop the new reference when the last one is
    unmapped. This avoids pinning the mounted filesystem.

    [akpm@linux-foundation.org: improve comments]
    [yamamoto@valinux.co.jp: fix dup_mmap]
    Signed-off-by: Matt Helsley
    Cc: Oleg Nesterov
    Cc: David Howells
    Cc:"Eric W. Biederman"
    Cc: Christoph Hellwig
    Cc: Al Viro
    Cc: Hugh Dickins
    Signed-off-by: YAMAMOTO Takashi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matt Helsley
     
  • Remove the mem_cgroup member from mm_struct and instead adds an owner.

    This approach was suggested by Paul Menage. The advantage of this approach
    is that, once the mm->owner is known, using the subsystem id, the cgroup
    can be determined. It also allows several control groups that are
    virtually grouped by mm_struct, to exist independent of the memory
    controller i.e., without adding mem_cgroup's for each controller, to
    mm_struct.

    A new config option CONFIG_MM_OWNER is added and the memory resource
    controller selects this config option.

    This patch also adds cgroup callbacks to notify subsystems when mm->owner
    changes. The mm_cgroup_changed callback is called with the task_lock() of
    the new task held and is called just prior to changing the mm->owner.

    I am indebted to Paul Menage for the several reviews of this patchset and
    helping me make it lighter and simpler.

    This patch was tested on a powerpc box, it was compiled with both the
    MM_OWNER config turned on and off.

    After the thread group leader exits, it's moved to init_css_state by
    cgroup_exit(), thus all future charges from runnings threads would be
    redirected to the init_css_set's subsystem.

    Signed-off-by: Balbir Singh
    Cc: Pavel Emelianov
    Cc: Hugh Dickins
    Cc: Sudhir Kumar
    Cc: YAMAMOTO Takashi
    Cc: Hirokazu Takahashi
    Cc: David Rientjes ,
    Cc: Balbir Singh
    Acked-by: KAMEZAWA Hiroyuki
    Acked-by: Pekka Enberg
    Reviewed-by: Paul Menage
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Balbir Singh
     
  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/slab-2.6:
    slub: pack objects denser
    slub: Calculate min_objects based on number of processors.
    slub: Drop DEFAULT_MAX_ORDER / DEFAULT_MIN_OBJECTS
    slub: Simplify any_slab_object checks
    slub: Make the order configurable for each slab cache
    slub: Drop fallback to page allocator method
    slub: Fallback to minimal order during slab page allocation
    slub: Update statistics handling for variable order slabs
    slub: Add kmem_cache_order_objects struct
    slub: for_each_object must be passed the number of objects in a slab
    slub: Store max number of objects in the page struct.
    slub: Dump list of objects not freed on kmem_cache_close()
    slub: free_list() cleanup
    slub: improve kmem_cache_destroy() error message
    slob: fix bug - when slob allocates "struct kmem_cache", it does not force alignment.

    Linus Torvalds
     

28 Apr, 2008

1 commit


27 Apr, 2008

1 commit

  • Split the inuse field up to be able to store the number of objects in this
    page in the page struct as well. Necessary if we want to have pages of
    various orders for a slab. Also avoids touching struct kmem_cache cachelines in
    __slab_alloc().

    Update diagnostic code to check the number of objects and make sure that
    the number of objects always stays within the bounds of a 16 bit unsigned
    integer.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Pekka Enberg

    Christoph Lameter
     

05 Mar, 2008

1 commit

  • Rename Memory Controller to Memory Resource Controller. Reflect the same
    changes in the CONFIG definition for the Memory Resource Controller. Group
    together the config options for Resource Counters and Memory Resource
    Controller.

    Signed-off-by: Balbir Singh
    Cc: Paul Menage
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Balbir Singh
     

04 Mar, 2008

1 commit

  • This only made sense for the alternate fastpath which was reverted last week.

    Mathieu is working on a new version that addresses the fastpath issues but that
    new code first needs to go through mm and it is not clear if we need the
    unique end pointers with his new scheme.

    Reviewed-by: Pekka Enberg
    Signed-off-by: Christoph Lameter

    Christoph Lameter
     

08 Feb, 2008

2 commits

  • We use a NULL pointer on freelists to signal that there are no more objects.
    However the NULL pointers of all slabs match in contrast to the pointers to
    the real objects which are in different ranges for different slab pages.

    Change the end pointer to be a pointer to the first object and set bit 0.
    Every slab will then have a different end pointer. This is necessary to ensure
    that end markers can be matched to the source slab during cmpxchg_local.

    Bring back the use of the mapping field by SLUB since we would otherwise have
    to call a relatively expensive function page_address() in __slab_alloc(). Use
    of the mapping field allows avoiding a call to page_address() in various other
    functions as well.

    There is no need to change the page_mapping() function since bit 0 is set on
    the mapping as also for anonymous pages. page_mapping(slab_page) will
    therefore still return NULL although the mapping field is overloaded.

    Signed-off-by: Christoph Lameter
    Cc: Pekka Enberg
    Signed-off-by: Andrew Morton

    Christoph Lameter
     
  • Basic setup routines, the mm_struct has a pointer to the cgroup that
    it belongs to and the the page has a page_cgroup associated with it.

    Signed-off-by: Pavel Emelianov
    Signed-off-by: Balbir Singh
    Cc: Paul Menage
    Cc: Peter Zijlstra
    Cc: "Eric W. Biederman"
    Cc: Nick Piggin
    Cc: Kirill Korotaev
    Cc: Herbert Poetzl
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Emelianov
     

17 Oct, 2007

4 commits

  • include/asm-powerpc/elf.h has 6 entries in ARCH_DLINFO. fs/binfmt_elf.c
    has 14 unconditional NEW_AUX_ENT entries and 2 conditional NEW_AUX_ENT
    entries. So in the worst case, saved_auxv does not get an AT_NULL entry at
    the end.

    The saved_auxv array must be terminated with an AT_NULL entry. Make the
    size of mm_struct->saved_auxv arch dependend, based on the number of
    ARCH_DLINFO entries.

    Signed-off-by: Olaf Hering
    Cc: Roland McGrath
    Cc: Jakub Jelinek
    Cc: Richard Henderson
    Cc: Ivan Kokshaysky
    Cc: "Luck, Tony"
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: Paul Mundt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Olaf Hering
     
  • We need the offset from the page struct during slab_alloc and slab_free. In
    both cases we also reference the cacheline of the kmem_cache_cpu structure.
    We can therefore move the offset field into the kmem_cache_cpu structure
    freeing up 16 bits in the page struct.

    Moving the offset allows an allocation from slab_alloc() without touching the
    page struct in the hot path.

    The only thing left in slab_free() that touches the page struct cacheline for
    per cpu freeing is the checking of SlabDebug(page). The next patch deals with
    that.

    Use the available 16 bits to broaden page->inuse. More than 64k objects per
    slab become possible and we can get rid of the checks for that limitation.

    No need anymore to shrink the order of slabs if we boot with 2M sized slabs
    (slub_min_order=9).

    No need anymore to switch off the offset calculation for very large slabs
    since the field in the kmem_cache_cpu structure is 32 bits and so the offset
    field can now handle slab sizes of up to 8GB.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • After moving the lockless_freelist to kmem_cache_cpu we no longer need
    page->lockless_freelist. Restructure the use of the struct page fields in
    such a way that we never touch the mapping field.

    This is turn allows us to remove the special casing of SLUB when determining
    the mapping of a page (needed for corner cases of virtual caches machines that
    need to flush caches of processors mapping a page).

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Move the definitions of struct mm_struct and struct vma_area_struct to
    include/mm_types.h. This allows to define more function in asm/pgtable.h
    and friends with inline assemblies instead of macros. Compile tested on
    i386, powerpc, powerpc64, s390-32, s390-64 and x86_64.

    [aurelien@aurel32.net: build fix]
    Signed-off-by: Martin Schwidefsky
    Signed-off-by: Aurelien Jarno
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Martin Schwidefsky
     

11 May, 2007

1 commit

  • Avoid atomic overhead in slab_alloc and slab_free

    SLUB needs to use the slab_lock for the per cpu slabs to synchronize with
    potential kfree operations. This patch avoids that need by moving all free
    objects onto a lockless_freelist. The regular freelist continues to exist
    and will be used to free objects. So while we consume the
    lockless_freelist the regular freelist may build up objects.

    If we are out of objects on the lockless_freelist then we may check the
    regular freelist. If it has objects then we move those over to the
    lockless_freelist and do this again. There is a significant savings in
    terms of atomic operations that have to be performed.

    We can even free directly to the lockless_freelist if we know that we are
    running on the same processor. So this speeds up short lived objects.
    They may be allocated and freed without taking the slab_lock. This is
    particular good for netperf.

    In order to maximize the effect of the new faster hotpath we extract the
    hottest performance pieces into inlined functions. These are then inlined
    into kmem_cache_alloc and kmem_cache_free. So hotpath allocation and
    freeing no longer requires a subroutine call within SLUB.

    [I am not sure that it is worth doing this because it changes the easy to
    read structure of slub just to reduce atomic ops. However, there is
    someone out there with a benchmark on 4 way and 8 way processor systems
    that seems to show a 5% regression vs. Slab. Seems that the regression is
    due to increased atomic operations use vs. SLAB in SLUB). I wonder if
    this is applicable or discernable at all in a real workload?]

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

08 May, 2007

1 commit

  • This is a new slab allocator which was motivated by the complexity of the
    existing code in mm/slab.c. It attempts to address a variety of concerns
    with the existing implementation.

    A. Management of object queues

    A particular concern was the complex management of the numerous object
    queues in SLAB. SLUB has no such queues. Instead we dedicate a slab for
    each allocating CPU and use objects from a slab directly instead of
    queueing them up.

    B. Storage overhead of object queues

    SLAB Object queues exist per node, per CPU. The alien cache queue even
    has a queue array that contain a queue for each processor on each
    node. For very large systems the number of queues and the number of
    objects that may be caught in those queues grows exponentially. On our
    systems with 1k nodes / processors we have several gigabytes just tied up
    for storing references to objects for those queues This does not include
    the objects that could be on those queues. One fears that the whole
    memory of the machine could one day be consumed by those queues.

    C. SLAB meta data overhead

    SLAB has overhead at the beginning of each slab. This means that data
    cannot be naturally aligned at the beginning of a slab block. SLUB keeps
    all meta data in the corresponding page_struct. Objects can be naturally
    aligned in the slab. F.e. a 128 byte object will be aligned at 128 byte
    boundaries and can fit tightly into a 4k page with no bytes left over.
    SLAB cannot do this.

    D. SLAB has a complex cache reaper

    SLUB does not need a cache reaper for UP systems. On SMP systems
    the per CPU slab may be pushed back into partial list but that
    operation is simple and does not require an iteration over a list
    of objects. SLAB expires per CPU, shared and alien object queues
    during cache reaping which may cause strange hold offs.

    E. SLAB has complex NUMA policy layer support

    SLUB pushes NUMA policy handling into the page allocator. This means that
    allocation is coarser (SLUB does interleave on a page level) but that
    situation was also present before 2.6.13. SLABs application of
    policies to individual slab objects allocated in SLAB is
    certainly a performance concern due to the frequent references to
    memory policies which may lead a sequence of objects to come from
    one node after another. SLUB will get a slab full of objects
    from one node and then will switch to the next.

    F. Reduction of the size of partial slab lists

    SLAB has per node partial lists. This means that over time a large
    number of partial slabs may accumulate on those lists. These can
    only be reused if allocator occur on specific nodes. SLUB has a global
    pool of partial slabs and will consume slabs from that pool to
    decrease fragmentation.

    G. Tunables

    SLAB has sophisticated tuning abilities for each slab cache. One can
    manipulate the queue sizes in detail. However, filling the queues still
    requires the uses of the spin lock to check out slabs. SLUB has a global
    parameter (min_slab_order) for tuning. Increasing the minimum slab
    order can decrease the locking overhead. The bigger the slab order the
    less motions of pages between per CPU and partial lists occur and the
    better SLUB will be scaling.

    G. Slab merging

    We often have slab caches with similar parameters. SLUB detects those
    on boot up and merges them into the corresponding general caches. This
    leads to more effective memory use. About 50% of all caches can
    be eliminated through slab merging. This will also decrease
    slab fragmentation because partial allocated slabs can be filled
    up again. Slab merging can be switched off by specifying
    slub_nomerge on boot up.

    Note that merging can expose heretofore unknown bugs in the kernel
    because corrupted objects may now be placed differently and corrupt
    differing neighboring objects. Enable sanity checks to find those.

    H. Diagnostics

    The current slab diagnostics are difficult to use and require a
    recompilation of the kernel. SLUB contains debugging code that
    is always available (but is kept out of the hot code paths).
    SLUB diagnostics can be enabled via the "slab_debug" option.
    Parameters can be specified to select a single or a group of
    slab caches for diagnostics. This means that the system is running
    with the usual performance and it is much more likely that
    race conditions can be reproduced.

    I. Resiliency

    If basic sanity checks are on then SLUB is capable of detecting
    common error conditions and recover as best as possible to allow the
    system to continue.

    J. Tracing

    Tracing can be enabled via the slab_debug=T, option
    during boot. SLUB will then protocol all actions on that slabcache
    and dump the object contents on free.

    K. On demand DMA cache creation.

    Generally DMA caches are not needed. If a kmalloc is used with
    __GFP_DMA then just create this single slabcache that is needed.
    For systems that have no ZONE_DMA requirement the support is
    completely eliminated.

    L. Performance increase

    Some benchmarks have shown speed improvements on kernbench in the
    range of 5-10%. The locking overhead of slub is based on the
    underlying base allocation size. If we can reliably allocate
    larger order pages then it is possible to increase slub
    performance much further. The anti-fragmentation patches may
    enable further performance increases.

    Tested on:
    i386 UP + SMP, x86_64 UP + SMP + NUMA emulation, IA64 NUMA + Simulator

    SLUB Boot options

    slub_nomerge Disable merging of slabs
    slub_min_order=x Require a minimum order for slab caches. This
    increases the managed chunk size and therefore
    reduces meta data and locking overhead.
    slub_min_objects=x Mininum objects per slab. Default is 8.
    slub_max_order=x Avoid generating slabs larger than order specified.
    slub_debug Enable all diagnostics for all caches
    slub_debug= Enable selective options for all caches
    slub_debug=, Enable selective options for a certain set of
    caches

    Available Debug options
    F Double Free checking, sanity and resiliency
    R Red zoning
    P Object / padding poisoning
    U Track last free / alloc
    T Trace all allocs / frees (only use for individual slabs).

    To use SLUB: Apply this patch and then select SLUB as the default slab
    allocator.

    [hugh@veritas.com: fix an oops-causing locking error]
    [akpm@linux-foundation.org: various stupid cleanups and small fixes]
    Signed-off-by: Christoph Lameter
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

27 Sep, 2006

1 commit

  • This moves the definition of struct page from mm.h to its own header file
    page-struct.h. This is a prereq to fix SetPageUptodate which is broken on
    s390:

    #define SetPageUptodate(_page)
    do {
    struct page *__page = (_page);
    if (!test_and_set_bit(PG_uptodate, &__page->flags))
    page_test_and_clear_dirty(_page);
    } while (0)

    _page gets used twice in this macro which can cause subtle bugs. Using
    __page for the page_test_and_clear_dirty call doesn't work since it causes
    yet another problem with the page_test_and_clear_dirty macro as well.

    In order to avoid all these problems caused by macros it seems to be a good
    idea to get rid of them and convert them to static inline functions.
    Because of header file include order it's necessary to have a seperate
    header file for the struct page definition.

    Cc: Martin Schwidefsky
    Signed-off-by: Heiko Carstens
    Cc: Roman Zippel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Heiko Carstens