08 Feb, 2008

6 commits

  • fix checkpatch --file mm/slub.c errors and warnings.

    $ q-code-quality-compare
    errors lines of code errors/KLOC
    mm/slub.c [before] 22 4204 5.2
    mm/slub.c [after] 0 4210 0

    no code changed:

    text data bss dec hex filename
    22195 8634 136 30965 78f5 slub.o.before
    22195 8634 136 30965 78f5 slub.o.after

    md5:
    93cdfbec2d6450622163c590e1064358 slub.o.before.asm
    93cdfbec2d6450622163c590e1064358 slub.o.after.asm

    [clameter: rediffed against Pekka's cleanup patch, omitted
    moves of the name of a function to the start of line]
    Signed-off-by: Ingo Molnar
    Signed-off-by: Christoph Lameter

    Ingo Molnar
     
  • Slub can use the non-atomic version to unlock because other flags will not
    get modified with the lock held.

    Signed-off-by: Nick Piggin
    Acked-by: Christoph Lameter
    Signed-off-by: Andrew Morton

    Nick Piggin
     
  • The statistics provided here allow the monitoring of allocator behavior but
    at the cost of some (minimal) loss of performance. Counters are placed in
    SLUB's per cpu data structure. The per cpu structure may be extended by the
    statistics to grow larger than one cacheline which will increase the cache
    footprint of SLUB.

    There is a compile option to enable/disable the inclusion of the runtime
    statistics and its off by default.

    The slabinfo tool is enhanced to support these statistics via two options:

    -D Switches the line of information displayed for a slab from size
    mode to activity mode.

    -A Sorts the slabs displayed by activity. This allows the display of
    the slabs most important to the performance of a certain load.

    -r Report option will report detailed statistics on

    Example (tbench load):

    slabinfo -AD ->Shows the most active slabs

    Name Objects Alloc Free %Fast
    skbuff_fclone_cache 33 111953835 111953835 99 99
    :0000192 2666 5283688 5281047 99 99
    :0001024 849 5247230 5246389 83 83
    vm_area_struct 1349 119642 118355 91 22
    :0004096 15 66753 66751 98 98
    :0000064 2067 25297 23383 98 78
    dentry 10259 28635 18464 91 45
    :0000080 11004 18950 8089 98 98
    :0000096 1703 12358 10784 99 98
    :0000128 762 10582 9875 94 18
    :0000512 184 9807 9647 95 81
    :0002048 479 9669 9195 83 65
    anon_vma 777 9461 9002 99 71
    kmalloc-8 6492 9981 5624 99 97
    :0000768 258 7174 6931 58 15

    So the skbuff_fclone_cache is of highest importance for the tbench load.
    Pretty high load on the 192 sized slab. Look for the aliases

    slabinfo -a | grep 000192
    :0000192 -r option implied if cache name is mentioned

    .... Usual output ...

    Slab Perf Counter Alloc Free %Al %Fr
    --------------------------------------------------
    Fastpath 111953360 111946981 99 99
    Slowpath 1044 7423 0 0
    Page Alloc 272 264 0 0
    Add partial 25 325 0 0
    Remove partial 86 264 0 0
    RemoteObj/SlabFrozen 350 4832 0 0
    Total 111954404 111954404

    Flushes 49 Refill 0
    Deactivate Full=325(92%) Empty=0(0%) ToHead=24(6%) ToTail=1(0%)

    Looks good because the fastpath is overwhelmingly taken.

    skbuff_head_cache:

    Slab Perf Counter Alloc Free %Al %Fr
    --------------------------------------------------
    Fastpath 5297262 5259882 99 99
    Slowpath 4477 39586 0 0
    Page Alloc 937 824 0 0
    Add partial 0 2515 0 0
    Remove partial 1691 824 0 0
    RemoteObj/SlabFrozen 2621 9684 0 0
    Total 5301739 5299468

    Deactivate Full=2620(100%) Empty=0(0%) ToHead=0(0%) ToTail=0(0%)

    Descriptions of the output:

    Total: The total number of allocation and frees that occurred for a
    slab

    Fastpath: The number of allocations/frees that used the fastpath.

    Slowpath: Other allocations

    Page Alloc: Number of calls to the page allocator as a result of slowpath
    processing

    Add Partial: Number of slabs added to the partial list through free or
    alloc (occurs during cpuslab flushes)

    Remove Partial: Number of slabs removed from the partial list as a result of
    allocations retrieving a partial slab or by a free freeing
    the last object of a slab.

    RemoteObj/Froz: How many times were remotely freed object encountered when a
    slab was about to be deactivated. Frozen: How many times was
    free able to skip list processing because the slab was in use
    as the cpuslab of another processor.

    Flushes: Number of times the cpuslab was flushed on request
    (kmem_cache_shrink, may result from races in __slab_alloc)

    Refill: Number of times we were able to refill the cpuslab from
    remotely freed objects for the same slab.

    Deactivate: Statistics how slabs were deactivated. Shows how they were
    put onto the partial list.

    In general fastpath is very good. Slowpath without partial list processing is
    also desirable. Any touching of partial list uses node specific locks which
    may potentially cause list lock contention.

    Signed-off-by: Christoph Lameter

    Christoph Lameter
     
  • Provide an alternate implementation of the SLUB fast paths for alloc
    and free using cmpxchg_local. The cmpxchg_local fast path is selected
    for arches that have CONFIG_FAST_CMPXCHG_LOCAL set. An arch should only
    set CONFIG_FAST_CMPXCHG_LOCAL if the cmpxchg_local is faster than an
    interrupt enable/disable sequence. This is known to be true for both
    x86 platforms so set FAST_CMPXCHG_LOCAL for both arches.

    Currently another requirement for the fastpath is that the kernel is
    compiled without preemption. The restriction will go away with the
    introduction of a new per cpu allocator and new per cpu operations.

    The advantages of a cmpxchg_local based fast path are:

    1. Potentially lower cycle count (30%-60% faster)

    2. There is no need to disable and enable interrupts on the fast path.
    Currently interrupts have to be disabled and enabled on every
    slab operation. This is likely avoiding a significant percentage
    of interrupt off / on sequences in the kernel.

    3. The disposal of freed slabs can occur with interrupts enabled.

    The alternate path is realized using #ifdef's. Several attempts to do the
    same with macros and inline functions resulted in a mess (in particular due
    to the strange way that local_interrupt_save() handles its argument and due
    to the need to define macros/functions that sometimes disable interrupts
    and sometimes do something else).

    [clameter: Stripped preempt bits and disabled fastpath if preempt is enabled]
    Signed-off-by: Christoph Lameter
    Reviewed-by: Pekka Enberg
    Cc:
    Signed-off-by: Andrew Morton

    Christoph Lameter
     
  • We use a NULL pointer on freelists to signal that there are no more objects.
    However the NULL pointers of all slabs match in contrast to the pointers to
    the real objects which are in different ranges for different slab pages.

    Change the end pointer to be a pointer to the first object and set bit 0.
    Every slab will then have a different end pointer. This is necessary to ensure
    that end markers can be matched to the source slab during cmpxchg_local.

    Bring back the use of the mapping field by SLUB since we would otherwise have
    to call a relatively expensive function page_address() in __slab_alloc(). Use
    of the mapping field allows avoiding a call to page_address() in various other
    functions as well.

    There is no need to change the page_mapping() function since bit 0 is set on
    the mapping as also for anonymous pages. page_mapping(slab_page) will
    therefore still return NULL although the mapping field is overloaded.

    Signed-off-by: Christoph Lameter
    Cc: Pekka Enberg
    Signed-off-by: Andrew Morton

    Christoph Lameter
     
  • gcc 4.2 spits out an annoying warning if one casts a const void *
    pointer to a void * pointer. No warning is generated if the
    conversion is done through an assignment.

    Signed-off-by: Christoph Lameter

    Christoph Lameter
     

05 Feb, 2008

7 commits

  • inconsistent {softirq-on-W} -> {in-softirq-W} usage.
    swapper/0 [HC0[0]:SC1[1]:HE0:SE0] takes:
    (&n->list_lock){-+..}, at: [] add_partial+0x31/0xa0
    {softirq-on-W} state was registered at:
    [] __lock_acquire+0x3e8/0x1140
    [] debug_check_no_locks_freed+0x188/0x1a0
    [] lock_acquire+0x55/0x70
    [] add_partial+0x31/0xa0
    [] _spin_lock+0x1e/0x30
    [] add_partial+0x31/0xa0
    [] kmem_cache_open+0x1cc/0x330
    [] _spin_unlock_irq+0x24/0x30
    [] create_kmalloc_cache+0x64/0xf0
    [] init_alloc_cpu_cpu+0x70/0x90
    [] kmem_cache_init+0x65/0x1d0
    [] start_kernel+0x23e/0x350
    [] _sinittext+0x12d/0x140
    [] 0xffffffffffffffff

    This change isn't really necessary for correctness, but it prevents lockdep
    from getting upset and then disabling itself.

    Signed-off-by: Peter Zijlstra
    Cc: Christoph Lameter
    Cc: Kamalesh Babulal
    Signed-off-by: Andrew Morton
    Signed-off-by: Christoph Lameter

    root
     
  • This fixes most of the obvious coding style violations in mm/slub.c as
    reported by checkpatch.

    Acked-by: Christoph Lameter
    Signed-off-by: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Christoph Lameter

    Pekka Enberg
     
  • Add a parameter to add_partial instead of having separate functions. The
    parameter allows a more detailed control of where the slab pages is placed in
    the partial queues.

    If we put slabs back to the front then they are likely immediately used for
    allocations. If they are put at the end then we can maximize the time that
    the partial slabs spent without being subject to allocations.

    When deactivating slab we can put the slabs that had remote objects freed (we
    can see that because objects were put on the freelist that requires locks) to
    them at the end of the list so that the cachelines of remote processors can
    cool down. Slabs that had objects from the local cpu freed to them (objects
    exist in the lockless freelist) are put in the front of the list to be reused
    ASAP in order to exploit the cache hot state of the local cpu.

    Patch seems to slightly improve tbench speed (1-2%).

    Signed-off-by: Christoph Lameter
    Reviewed-by: Pekka Enberg
    Signed-off-by: Andrew Morton

    Christoph Lameter
     
  • The NUMA defrag works by allocating objects from partial slabs on remote
    nodes. Rename it to

    remote_node_defrag_ratio

    to be clear about this.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton

    Christoph Lameter
     
  • Move the counting function for objects in partial slabs so that it is placed
    before kmem_cache_shrink.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton

    Christoph Lameter
     
  • If CONFIG_SYSFS is set then free the kmem_cache structure when
    sysfs tells us its okay.

    Otherwise there is the danger (as pointed out by
    Al Viro) that sysfs thinks the kobject still exists after
    kmem_cache_destroy() removed it.

    Signed-off-by: Christoph Lameter
    Reviewed-by: Pekka J Enberg

    Christoph Lameter
     
  • Introduce 'len' at outer level:
    mm/slub.c:3406:26: warning: symbol 'n' shadows an earlier one
    mm/slub.c:3393:6: originally declared here

    No need to declare new node:
    mm/slub.c:3501:7: warning: symbol 'node' shadows an earlier one
    mm/slub.c:3491:6: originally declared here

    No need to declare new x:
    mm/slub.c:3513:9: warning: symbol 'x' shadows an earlier one
    mm/slub.c:3492:6: originally declared here

    Signed-off-by: Harvey Harrison
    Signed-off-by: Christoph Lameter

    Harvey Harrison
     

25 Jan, 2008

5 commits


03 Jan, 2008

1 commit

  • Both SLUB and SLAB really did almost exactly the same thing for
    /proc/slabinfo setup, using duplicate code and per-allocator #ifdef's.

    This just creates a common CONFIG_SLABINFO that is enabled by both SLUB
    and SLAB, and shares all the setup code. Maybe SLOB will want this some
    day too.

    Reviewed-by: Pekka Enberg
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

02 Jan, 2008

1 commit

  • This adds a read-only /proc/slabinfo file on SLUB, that makes slabtop work.

    [ mingo@elte.hu: build fix. ]

    Cc: Andi Kleen
    Cc: Christoph Lameter
    Cc: Peter Zijlstra
    Signed-off-by: Pekka Enberg
    Signed-off-by: Ingo Molnar
    Signed-off-by: Linus Torvalds

    Pekka J Enberg
     

22 Dec, 2007

1 commit

  • Increase the mininum number of partial slabs to keep around and put
    partial slabs to the end of the partial queue so that they can add
    more objects.

    Signed-off-by: Christoph Lameter
    Reviewed-by: Pekka Enberg
    Acked-by: Ingo Molnar
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

18 Dec, 2007

1 commit

  • Remove a recently added useless masking of GFP_ZERO. GFP_ZERO is already
    masked out in new_slab() (See how it calls allocate_slab). No need to do
    it twice.

    This reverts the SLUB parts of 7fd272550bd43cc1d7289ef0ab2fa50de137e767.

    Cc: Matt Mackall
    Reviewed-by: Pekka Enberg
    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

10 Dec, 2007

1 commit

  • Both slob and slub react to __GFP_ZERO by clearing the allocation, which
    means that passing the GFP_ZERO bit down to the page allocator is just
    wasteful and pointless.

    Acked-by: Matt Mackall
    Reviewed-by: Pekka Enberg
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

06 Dec, 2007

1 commit

  • I can't pass memory allocated by kmalloc() to ksize() if it is allocated by
    SLUB allocator and size is larger than (I guess) PAGE_SIZE / 2.

    The error of ksize() seems to be that it does not check if the allocation
    was made by SLUB or the page allocator.

    Reviewed-by: Pekka Enberg
    Tested-by: Tetsuo Handa
    Cc: Christoph Lameter , Matt Mackall
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vegard Nossum
     

13 Nov, 2007

1 commit


06 Nov, 2007

1 commit

  • Fix the memory leak that may occur when we attempt to reuse a cpu_slab
    that was allocated while we reenabled interrupts in order to be able to
    grow a slab cache.

    The per cpu freelist may contain objects and in that situation we may
    overwrite the per cpu freelist pointer loosing objects. This only
    occurs if we find that the concurrently allocated slab fits our
    allocation needs.

    If we simply always deactivate the slab then the freelist will be
    properly reintegrated and the memory leak will go away.

    Signed-off-by: Christoph Lameter
    Acked-by: Hugh Dickins
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

29 Oct, 2007

1 commit


22 Oct, 2007

1 commit

  • Fix a panic due to access NULL pointer of kmem_cache_node at discard_slab()
    after memory online.

    When memory online is called, kmem_cache_nodes are created for all SLUBs
    for new node whose memory are available.

    slab_mem_going_online_callback() is called to make kmem_cache_node() in
    callback of memory online event. If it (or other callbacks) fails, then
    slab_mem_offline_callback() is called for rollback.

    In memory offline, slab_mem_going_offline_callback() is called to shrink
    all slub cache, then slab_mem_offline_callback() is called later.

    [akpm@linux-foundation.org: coding-style fixes]
    [akpm@linux-foundation.org: locking fix]
    [akpm@linux-foundation.org: build fix]
    Signed-off-by: Yasunori Goto
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yasunori Goto
     

17 Oct, 2007

12 commits

  • Slab constructors currently have a flags parameter that is never used. And
    the order of the arguments is opposite to other slab functions. The object
    pointer is placed before the kmem_cache pointer.

    Convert

    ctor(void *object, struct kmem_cache *s, unsigned long flags)

    to

    ctor(struct kmem_cache *s, void *object)

    throughout the kernel

    [akpm@linux-foundation.org: coupla fixes]
    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Move irq handling out of new slab into __slab_alloc. That is useful for
    Mathieu's cmpxchg_local patchset and also allows us to remove the crude
    local_irq_off in early_kmem_cache_alloc().

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • It's a short-lived allocation.

    Cc: Christoph Lameter
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • We touch a cacheline in the kmem_cache structure for zeroing to get the
    size. However, the hot paths in slab_alloc and slab_free do not reference
    any other fields in kmem_cache, so we may have to just bring in the
    cacheline for this one access.

    Add a new field to kmem_cache_cpu that contains the object size. That
    cacheline must already be used in the hotpaths. So we save one cacheline
    on every slab_alloc if we zero.

    We need to update the kmem_cache_cpu object size if an aliasing operation
    changes the objsize of an non debug slab.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • The kmem_cache_cpu structures introduced are currently an array placed in the
    kmem_cache struct. Meaning the kmem_cache_cpu structures are overwhelmingly
    on the wrong node for systems with a higher amount of nodes. These are
    performance critical structures since the per node information has
    to be touched for every alloc and free in a slab.

    In order to place the kmem_cache_cpu structure optimally we put an array
    of pointers to kmem_cache_cpu structs in kmem_cache (similar to SLAB).

    However, the kmem_cache_cpu structures can now be allocated in a more
    intelligent way.

    We would like to put per cpu structures for the same cpu but different
    slab caches in cachelines together to save space and decrease the cache
    footprint. However, the slab allocators itself control only allocations
    per node. We set up a simple per cpu array for every processor with
    100 per cpu structures which is usually enough to get them all set up right.
    If we run out then we fall back to kmalloc_node. This also solves the
    bootstrap problem since we do not have to use slab allocator functions
    early in boot to get memory for the small per cpu structures.

    Pro:
    - NUMA aware placement improves memory performance
    - All global structures in struct kmem_cache become readonly
    - Dense packing of per cpu structures reduces cacheline
    footprint in SMP and NUMA.
    - Potential avoidance of exclusive cacheline fetches
    on the free and alloc hotpath since multiple kmem_cache_cpu
    structures are in one cacheline. This is particularly important
    for the kmalloc array.

    Cons:
    - Additional reference to one read only cacheline (per cpu
    array of pointers to kmem_cache_cpu) in both slab_alloc()
    and slab_free().

    [akinobu.mita@gmail.com: fix cpu hotplug offline/online path]
    Signed-off-by: Christoph Lameter
    Cc: "Pekka Enberg"
    Cc: Akinobu Mita
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Set c->node to -1 if we allocate from a debug slab instead for SlabDebug
    which requires access the page struct cacheline.

    Signed-off-by: Christoph Lameter
    Tested-by: Alexey Dobriyan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • We need the offset from the page struct during slab_alloc and slab_free. In
    both cases we also reference the cacheline of the kmem_cache_cpu structure.
    We can therefore move the offset field into the kmem_cache_cpu structure
    freeing up 16 bits in the page struct.

    Moving the offset allows an allocation from slab_alloc() without touching the
    page struct in the hot path.

    The only thing left in slab_free() that touches the page struct cacheline for
    per cpu freeing is the checking of SlabDebug(page). The next patch deals with
    that.

    Use the available 16 bits to broaden page->inuse. More than 64k objects per
    slab become possible and we can get rid of the checks for that limitation.

    No need anymore to shrink the order of slabs if we boot with 2M sized slabs
    (slub_min_order=9).

    No need anymore to switch off the offset calculation for very large slabs
    since the field in the kmem_cache_cpu structure is 32 bits and so the offset
    field can now handle slab sizes of up to 8GB.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • After moving the lockless_freelist to kmem_cache_cpu we no longer need
    page->lockless_freelist. Restructure the use of the struct page fields in
    such a way that we never touch the mapping field.

    This is turn allows us to remove the special casing of SLUB when determining
    the mapping of a page (needed for corner cases of virtual caches machines that
    need to flush caches of processors mapping a page).

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • A remote free may access the same page struct that also contains the lockless
    freelist for the cpu slab. If objects have a short lifetime and are freed by
    a different processor then remote frees back to the slab from which we are
    currently allocating are frequent. The cacheline with the page struct needs
    to be repeately acquired in exclusive mode by both the allocating thread and
    the freeing thread. If this is frequent enough then performance will suffer
    because of cacheline bouncing.

    This patchset puts the lockless_freelist pointer in its own cacheline. In
    order to make that happen we introduce a per cpu structure called
    kmem_cache_cpu.

    Instead of keeping an array of pointers to page structs we now keep an array
    to a per cpu structure that--among other things--contains the pointer to the
    lockless freelist. The freeing thread can then keep possession of exclusive
    access to the page struct cacheline while the allocating thread keeps its
    exclusive access to the cacheline containing the per cpu structure.

    This works as long as the allocating cpu is able to service its request
    from the lockless freelist. If the lockless freelist runs empty then the
    allocating thread needs to acquire exclusive access to the cacheline with
    the page struct lock the slab.

    The allocating thread will then check if new objects were freed to the per
    cpu slab. If so it will keep the slab as the cpu slab and continue with the
    recently remote freed objects. So the allocating thread can take a series
    of just freed remote pages and dish them out again. Ideally allocations
    could be just recycling objects in the same slab this way which will lead
    to an ideal allocation / remote free pattern.

    The number of objects that can be handled in this way is limited by the
    capacity of one slab. Increasing slab size via slub_min_objects/
    slub_max_order may increase the number of objects and therefore performance.

    If the allocating thread runs out of objects and finds that no objects were
    put back by the remote processor then it will retrieve a new slab (from the
    partial lists or from the page allocator) and start with a whole
    new set of objects while the remote thread may still be freeing objects to
    the old cpu slab. This may then repeat until the new slab is also exhausted.
    If remote freeing has freed objects in the earlier slab then that earlier
    slab will now be on the partial freelist and the allocating thread will
    pick that slab next for allocation. So the loop is extended. However,
    both threads need to take the list_lock to make the swizzling via
    the partial list happen.

    It is likely that this kind of scheme will keep the objects being passed
    around to a small set that can be kept in the cpu caches leading to increased
    performance.

    More code cleanups become possible:

    - Instead of passing a cpu we can now pass a kmem_cache_cpu structure around.
    Allows reducing the number of parameters to various functions.
    - Can define a new node_match() function for NUMA to encapsulate locality
    checks.

    Effect on allocations:

    Cachelines touched before this patch:

    Write: page cache struct and first cacheline of object

    Cachelines touched after this patch:

    Write: kmem_cache_cpu cacheline and first cacheline of object
    Read: page cache struct (but see later patch that avoids touching
    that cacheline)

    The handling when the lockless alloc list runs empty gets to be a bit more
    complicated since another cacheline has now to be written to. But that is
    halfway out of the hot path.

    Effect on freeing:

    Cachelines touched before this patch:

    Write: page_struct and first cacheline of object

    Cachelines touched after this patch depending on how we free:

    Write(to cpu_slab): kmem_cache_cpu struct and first cacheline of object
    Write(to other): page struct and first cacheline of object

    Read(to cpu_slab): page struct to id slab etc. (but see later patch that
    avoids touching the page struct on free)
    Read(to other): cpu local kmem_cache_cpu struct to verify its not
    the cpu slab.

    Summary:

    Pro:
    - Distinct cachelines so that concurrent remote frees and local
    allocs on a cpuslab can occur without cacheline bouncing.
    - Avoids potential bouncing cachelines because of neighboring
    per cpu pointer updates in kmem_cache's cpu_slab structure since
    it now grows to a cacheline (Therefore remove the comment
    that talks about that concern).

    Cons:
    - Freeing objects now requires the reading of one additional
    cacheline. That can be mitigated for some cases by the following
    patches but its not possible to completely eliminate these
    references.

    - Memory usage grows slightly.

    The size of each per cpu object is blown up from one word
    (pointing to the page_struct) to one cacheline with various data.
    So this is NR_CPUS*NR_SLABS*L1_BYTES more memory use. Lets say
    NR_SLABS is 100 and a cache line size of 128 then we have just
    increased SLAB metadata requirements by 12.8k per cpu.
    (Another later patch reduces these requirements)

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • This patch marks a number of allocations that are either short-lived such as
    network buffers or are reclaimable such as inode allocations. When something
    like updatedb is called, long-lived and unmovable kernel allocations tend to
    be spread throughout the address space which increases fragmentation.

    This patch groups these allocations together as much as possible by adding a
    new MIGRATE_TYPE. The MIGRATE_RECLAIMABLE type is for allocations that can be
    reclaimed on demand, but not moved. i.e. they can be migrated by deleting
    them and re-reading the information from elsewhere.

    Signed-off-by: Mel Gorman
    Cc: Andy Whitcroft
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • The function of GFP_LEVEL_MASK seems to be unclear. In order to clear up
    the mystery we get rid of it and replace GFP_LEVEL_MASK with 3 sets of GFP
    flags:

    GFP_RECLAIM_MASK Flags used to control page allocator reclaim behavior.

    GFP_CONSTRAINT_MASK Flags used to limit where allocations can occur.

    GFP_SLAB_BUG_MASK Flags that the slab allocator BUG()s on.

    These replace the uses of GFP_LEVEL mask in the slab allocators and in
    vmalloc.c.

    The use of the flags not included in these sets may occur as a result of a
    slab allocation standing in for a page allocation when constructing scatter
    gather lists. Extraneous flags are cleared and not passed through to the
    page allocator. __GFP_MOVABLE/RECLAIMABLE, __GFP_COLD and __GFP_COMP will
    now be ignored if passed to a slab allocator.

    Change the allocation of allocator meta data in SLAB and vmalloc to not
    pass through flags listed in GFP_CONSTRAINT_MASK. SLAB already removes the
    __GFP_THISNODE flag for such allocations. Generalize that to also cover
    vmalloc. The use of GFP_CONSTRAINT_MASK also includes __GFP_HARDWALL.

    The impact of allocator metadata placement on access latency to the
    cachelines of the object itself is minimal since metadata is only
    referenced on alloc and free. The attempt is still made to place the meta
    data optimally but we consistently allow fallback both in SLAB and vmalloc
    (SLUB does not need to allocate metadata like that).

    Allocator metadata may serve multiple in kernel users and thus should not
    be subject to the limitations arising from a single allocation context.

    [akpm@linux-foundation.org: fix fallback_alloc()]
    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Simply switch all for_each_online_node to for_each_node_state(NORMAL_MEMORY).
    That way SLUB only operates on nodes with regular memory. Any allocation
    attempt on a memoryless node or a node with just highmem will fall whereupon
    SLUB will fetch memory from a nearby node (depending on how memory policies
    and cpuset describe fallback).

    Signed-off-by: Christoph Lameter
    Tested-by: Lee Schermerhorn
    Acked-by: Bob Picco
    Cc: Nishanth Aravamudan
    Cc: KAMEZAWA Hiroyuki
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter