Eric Lee / smarc-fsl-linux-kernel

08 Feb, 2008

6 commits

3adbefee6 SLUB: fix checkpatch warnings ... Browse Code »

fix checkpatch --file mm/slub.c errors and warnings.

$ q-code-quality-compare
errors lines of code errors/KLOC
mm/slub.c [before] 22 4204 5.2
mm/slub.c [after] 0 4210 0

no code changed:

text data bss dec hex filename
22195 8634 136 30965 78f5 slub.o.before
22195 8634 136 30965 78f5 slub.o.after

md5:
93cdfbec2d6450622163c590e1064358 slub.o.before.asm
93cdfbec2d6450622163c590e1064358 slub.o.after.asm

[clameter: rediffed against Pekka's cleanup patch, omitted
moves of the name of a function to the start of line]
Signed-off-by: Ingo Molnar
Signed-off-by: Christoph Lameter

Ingo Molnar
2008-02-08 09:52:39 +0800
a76d35462 Use non atomic unlock ... Browse Code »

Slub can use the non-atomic version to unlock because other flags will not
get modified with the lock held.

Signed-off-by: Nick Piggin
Acked-by: Christoph Lameter
Signed-off-by: Andrew Morton

Nick Piggin
2008-02-08 09:47:42 +0800
8ff12cfc0 SLUB: Support for performance statistics ... Browse Code »

The statistics provided here allow the monitoring of allocator behavior but
at the cost of some (minimal) loss of performance. Counters are placed in
SLUB's per cpu data structure. The per cpu structure may be extended by the
statistics to grow larger than one cacheline which will increase the cache
footprint of SLUB.

There is a compile option to enable/disable the inclusion of the runtime
statistics and its off by default.

The slabinfo tool is enhanced to support these statistics via two options:

-D Switches the line of information displayed for a slab from size
mode to activity mode.

-A Sorts the slabs displayed by activity. This allows the display of
the slabs most important to the performance of a certain load.

-r Report option will report detailed statistics on

Example (tbench load):

slabinfo -AD ->Shows the most active slabs

Name Objects Alloc Free %Fast
skbuff_fclone_cache 33 111953835 111953835 99 99
:0000192 2666 5283688 5281047 99 99
:0001024 849 5247230 5246389 83 83
vm_area_struct 1349 119642 118355 91 22
:0004096 15 66753 66751 98 98
:0000064 2067 25297 23383 98 78
dentry 10259 28635 18464 91 45
:0000080 11004 18950 8089 98 98
:0000096 1703 12358 10784 99 98
:0000128 762 10582 9875 94 18
:0000512 184 9807 9647 95 81
:0002048 479 9669 9195 83 65
anon_vma 777 9461 9002 99 71
kmalloc-8 6492 9981 5624 99 97
:0000768 258 7174 6931 58 15

So the skbuff_fclone_cache is of highest importance for the tbench load.
Pretty high load on the 192 sized slab. Look for the aliases

slabinfo -a | grep 000192
:0000192 -r option implied if cache name is mentioned

.... Usual output ...

Slab Perf Counter Alloc Free %Al %Fr
--------------------------------------------------
Fastpath 111953360 111946981 99 99
Slowpath 1044 7423 0 0
Page Alloc 272 264 0 0
Add partial 25 325 0 0
Remove partial 86 264 0 0
RemoteObj/SlabFrozen 350 4832 0 0
Total 111954404 111954404

Flushes 49 Refill 0
Deactivate Full=325(92%) Empty=0(0%) ToHead=24(6%) ToTail=1(0%)

Looks good because the fastpath is overwhelmingly taken.

skbuff_head_cache:

Slab Perf Counter Alloc Free %Al %Fr
--------------------------------------------------
Fastpath 5297262 5259882 99 99
Slowpath 4477 39586 0 0
Page Alloc 937 824 0 0
Add partial 0 2515 0 0
Remove partial 1691 824 0 0
RemoteObj/SlabFrozen 2621 9684 0 0
Total 5301739 5299468

Deactivate Full=2620(100%) Empty=0(0%) ToHead=0(0%) ToTail=0(0%)

Descriptions of the output:

Total: The total number of allocation and frees that occurred for a
slab

Fastpath: The number of allocations/frees that used the fastpath.

Slowpath: Other allocations

Page Alloc: Number of calls to the page allocator as a result of slowpath
processing

Add Partial: Number of slabs added to the partial list through free or
alloc (occurs during cpuslab flushes)

Remove Partial: Number of slabs removed from the partial list as a result of
allocations retrieving a partial slab or by a free freeing
the last object of a slab.

RemoteObj/Froz: How many times were remotely freed object encountered when a
slab was about to be deactivated. Frozen: How many times was
free able to skip list processing because the slab was in use
as the cpuslab of another processor.

Flushes: Number of times the cpuslab was flushed on request
(kmem_cache_shrink, may result from races in __slab_alloc)

Refill: Number of times we were able to refill the cpuslab from
remotely freed objects for the same slab.

Deactivate: Statistics how slabs were deactivated. Shows how they were
put onto the partial list.

In general fastpath is very good. Slowpath without partial list processing is
also desirable. Any touching of partial list uses node specific locks which
may potentially cause list lock contention.

Signed-off-by: Christoph Lameter

Christoph Lameter
2008-02-08 09:47:41 +0800
1f84260c8 SLUB: Alternate fast paths using cmpxchg_local ... Browse Code »

Provide an alternate implementation of the SLUB fast paths for alloc
and free using cmpxchg_local. The cmpxchg_local fast path is selected
for arches that have CONFIG_FAST_CMPXCHG_LOCAL set. An arch should only
set CONFIG_FAST_CMPXCHG_LOCAL if the cmpxchg_local is faster than an
interrupt enable/disable sequence. This is known to be true for both
x86 platforms so set FAST_CMPXCHG_LOCAL for both arches.

Currently another requirement for the fastpath is that the kernel is
compiled without preemption. The restriction will go away with the
introduction of a new per cpu allocator and new per cpu operations.

The advantages of a cmpxchg_local based fast path are:

1. Potentially lower cycle count (30%-60% faster)

2. There is no need to disable and enable interrupts on the fast path.
Currently interrupts have to be disabled and enabled on every
slab operation. This is likely avoiding a significant percentage
of interrupt off / on sequences in the kernel.

3. The disposal of freed slabs can occur with interrupts enabled.

The alternate path is realized using #ifdef's. Several attempts to do the
same with macros and inline functions resulted in a mess (in particular due
to the strange way that local_interrupt_save() handles its argument and due
to the need to define macros/functions that sometimes disable interrupts
and sometimes do something else).

[clameter: Stripped preempt bits and disabled fastpath if preempt is enabled]
Signed-off-by: Christoph Lameter
Reviewed-by: Pekka Enberg
Cc:
Signed-off-by: Andrew Morton

Christoph Lameter
2008-02-08 09:47:41 +0800
683d0baad SLUB: Use unique end pointer for each slab page. ... Browse Code »

We use a NULL pointer on freelists to signal that there are no more objects.
However the NULL pointers of all slabs match in contrast to the pointers to
the real objects which are in different ranges for different slab pages.

Change the end pointer to be a pointer to the first object and set bit 0.
Every slab will then have a different end pointer. This is necessary to ensure
that end markers can be matched to the source slab during cmpxchg_local.

Bring back the use of the mapping field by SLUB since we would otherwise have
to call a relatively expensive function page_address() in __slab_alloc(). Use
of the mapping field allows avoiding a call to page_address() in various other
functions as well.

There is no need to change the page_mapping() function since bit 0 is set on
the mapping as also for anonymous pages. page_mapping(slab_page) will
therefore still return NULL although the mapping field is overloaded.

Signed-off-by: Christoph Lameter
Cc: Pekka Enberg
Signed-off-by: Andrew Morton

Christoph Lameter
2008-02-08 09:47:41 +0800
5bb983b0c SLUB: Deal with annoying gcc warning on kfree() ... Browse Code »

gcc 4.2 spits out an annoying warning if one casts a const void *
pointer to a void * pointer. No warning is generated if the
conversion is done through an assignment.

Signed-off-by: Christoph Lameter

Christoph Lameter
2008-02-08 09:47:41 +0800

05 Feb, 2008

7 commits

ba84c73c7 SLUB: Do not upset lockdep ... Browse Code »

inconsistent {softirq-on-W} -> {in-softirq-W} usage.
swapper/0 [HC0[0]:SC1[1]:HE0:SE0] takes:
(&n->list_lock){-+..}, at: [] add_partial+0x31/0xa0
{softirq-on-W} state was registered at:
[] __lock_acquire+0x3e8/0x1140
[] debug_check_no_locks_freed+0x188/0x1a0
[] lock_acquire+0x55/0x70
[] add_partial+0x31/0xa0
[] _spin_lock+0x1e/0x30
[] add_partial+0x31/0xa0
[] kmem_cache_open+0x1cc/0x330
[] _spin_unlock_irq+0x24/0x30
[] create_kmalloc_cache+0x64/0xf0
[] init_alloc_cpu_cpu+0x70/0x90
[] kmem_cache_init+0x65/0x1d0
[] start_kernel+0x23e/0x350
[] _sinittext+0x12d/0x140
[] 0xffffffffffffffff

This change isn't really necessary for correctness, but it prevents lockdep
from getting upset and then disabling itself.

Signed-off-by: Peter Zijlstra
Cc: Christoph Lameter
Cc: Kamalesh Babulal
Signed-off-by: Andrew Morton
Signed-off-by: Christoph Lameter

root
2008-02-05 02:56:02 +0800
064287807 SLUB: Fix coding style violations ... Browse Code »

This fixes most of the obvious coding style violations in mm/slub.c as
reported by checkpatch.

Acked-by: Christoph Lameter
Signed-off-by: Pekka Enberg
Signed-off-by: Andrew Morton
Signed-off-by: Christoph Lameter

Pekka Enberg
2008-02-05 02:56:02 +0800
7c2e132c5 Add parameter to add_partial to avoid having two functions ... Browse Code »

Add a parameter to add_partial instead of having separate functions. The
parameter allows a more detailed control of where the slab pages is placed in
the partial queues.

If we put slabs back to the front then they are likely immediately used for
allocations. If they are put at the end then we can maximize the time that
the partial slabs spent without being subject to allocations.

When deactivating slab we can put the slabs that had remote objects freed (we
can see that because objects were put on the freelist that requires locks) to
them at the end of the list so that the cachelines of remote processors can
cool down. Slabs that had objects from the local cpu freed to them (objects
exist in the lockless freelist) are put in the front of the list to be reused
ASAP in order to exploit the cache hot state of the local cpu.

Patch seems to slightly improve tbench speed (1-2%).

Signed-off-by: Christoph Lameter
Reviewed-by: Pekka Enberg
Signed-off-by: Andrew Morton

Christoph Lameter
2008-02-05 02:56:02 +0800
9824601ea SLUB: rename defrag to remote_node_defrag_ratio ... Browse Code »

The NUMA defrag works by allocating objects from partial slabs on remote
nodes. Rename it to

remote_node_defrag_ratio

to be clear about this.

Signed-off-by: Christoph Lameter
Signed-off-by: Andrew Morton

Christoph Lameter
2008-02-05 02:56:02 +0800
f61396aed Move count_partial before kmem_cache_shrink ... Browse Code »

Move the counting function for objects in partial slabs so that it is placed
before kmem_cache_shrink.

Signed-off-by: Christoph Lameter
Signed-off-by: Andrew Morton

Christoph Lameter
2008-02-05 02:56:01 +0800
151c602f7 SLUB: Fix sysfs refcounting ... Browse Code »

If CONFIG_SYSFS is set then free the kmem_cache structure when
sysfs tells us its okay.

Otherwise there is the danger (as pointed out by
Al Viro) that sysfs thinks the kobject still exists after
kmem_cache_destroy() removed it.

Signed-off-by: Christoph Lameter
Reviewed-by: Pekka J Enberg

Christoph Lameter
2008-02-05 02:56:01 +0800
e374d4835 slub: fix shadowed variable sparse warnings ... Browse Code »

Introduce 'len' at outer level:
mm/slub.c:3406:26: warning: symbol 'n' shadows an earlier one
mm/slub.c:3393:6: originally declared here

No need to declare new node:
mm/slub.c:3501:7: warning: symbol 'node' shadows an earlier one
mm/slub.c:3491:6: originally declared here

No need to declare new x:
mm/slub.c:3513:9: warning: symbol 'x' shadows an earlier one
mm/slub.c:3492:6: originally declared here

Signed-off-by: Harvey Harrison
Signed-off-by: Christoph Lameter

Harvey Harrison
2008-02-05 02:56:01 +0800

25 Jan, 2008

5 commits

1eada11c8 Kobject: convert mm/slub.c to use kobject_init/add_ng() ... Browse Code »

This converts the code to use the new kobject functions, cleaning up the
logic in doing so.

Cc: Christoph Lameter
Cc: Kay Sievers
Signed-off-by: Greg Kroah-Hartman

Greg Kroah-Hartman
2008-01-25 12:40:31 +0800
0ff21e466 kobject: convert kernel_kset to be a kobject ... Browse Code »

kernel_kset does not need to be a kset, but a much simpler kobject now
that we have kobj_attributes.

We also rename kernel_kset to kernel_kobj to catch all users of this
symbol with a build error instead of an easy-to-ignore build warning.

Cc: Kay Sievers
Signed-off-by: Greg Kroah-Hartman

Greg Kroah-Hartman
2008-01-25 12:40:24 +0800
081248de0 kset: move /sys/slab to /sys/kernel/slab ... Browse Code »

/sys/kernel is where these things should go.
Also updated the documentation and tool that used this directory.

Cc: Kay Sievers
Acked-by: Christoph Lameter
Signed-off-by: Greg Kroah-Hartman

Greg Kroah-Hartman
2008-01-25 12:40:16 +0800
27c3a314d kset: convert slub to use kset_create ... Browse Code »

Dynamically create the kset instead of declaring it statically.

Cc: Kay Sievers
Cc: Christoph Lameter
Signed-off-by: Greg Kroah-Hartman

Greg Kroah-Hartman
2008-01-25 12:40:15 +0800
3514faca1 kobject: remove struct kobj_type from struct kset ... Browse Code »

We don't need a "default" ktype for a kset. We should set this
explicitly every time for each kset. This change is needed so that we
can make ksets dynamic, and cleans up one of the odd, undocumented
assumption that the kset/kobject/ktype model has.

This patch is based on a lot of help from Kay Sievers.

Nasty bug in the block code was found by Dave Young

Cc: Kay Sievers
Cc: Dave Young
Signed-off-by: Greg Kroah-Hartman

Greg Kroah-Hartman
2008-01-25 12:40:10 +0800

03 Jan, 2008

1 commit

158a96242 Unify /proc/slabinfo configuration ... Browse Code »

Both SLUB and SLAB really did almost exactly the same thing for
/proc/slabinfo setup, using duplicate code and per-allocator #ifdef's.

This just creates a common CONFIG_SLABINFO that is enabled by both SLUB
and SLAB, and shares all the setup code. Maybe SLOB will want this some
day too.

Reviewed-by: Pekka Enberg
Signed-off-by: Linus Torvalds

Linus Torvalds
2008-01-03 05:04:48 +0800

02 Jan, 2008

1 commit

57ed3eda9 slub: provide /proc/slabinfo ... Browse Code »

This adds a read-only /proc/slabinfo file on SLUB, that makes slabtop work.

[ mingo@elte.hu: build fix. ]

Cc: Andi Kleen
Cc: Christoph Lameter
Cc: Peter Zijlstra
Signed-off-by: Pekka Enberg
Signed-off-by: Ingo Molnar
Signed-off-by: Linus Torvalds

Pekka J Enberg
2008-01-02 03:32:02 +0800

22 Dec, 2007

1 commit

76be89500 SLUB: Improve hackbench speed ... Browse Code »

Increase the mininum number of partial slabs to keep around and put
partial slabs to the end of the partial queue so that they can add
more objects.

Signed-off-by: Christoph Lameter
Reviewed-by: Pekka Enberg
Acked-by: Ingo Molnar
Signed-off-by: Linus Torvalds

Christoph Lameter
2007-12-22 07:51:07 +0800

18 Dec, 2007

1 commit

3811dbf67 SLUB: remove useless masking of GFP_ZERO ... Browse Code »

Remove a recently added useless masking of GFP_ZERO. GFP_ZERO is already
masked out in new_slab() (See how it calls allocate_slab). No need to do
it twice.

This reverts the SLUB parts of 7fd272550bd43cc1d7289ef0ab2fa50de137e767.

Cc: Matt Mackall
Reviewed-by: Pekka Enberg
Signed-off-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2007-12-18 11:28:17 +0800

10 Dec, 2007

1 commit

7fd272550 Avoid double memclear() in SLOB/SLUB ... Browse Code »

Both slob and slub react to __GFP_ZERO by clearing the allocation, which
means that passing the GFP_ZERO bit down to the page allocator is just
wasteful and pointless.

Acked-by: Matt Mackall
Reviewed-by: Pekka Enberg
Signed-off-by: Linus Torvalds

Linus Torvalds
2007-12-10 02:17:52 +0800

06 Dec, 2007

1 commit

294a80a8e SLUB's ksize() fails for size > 2048 ... Browse Code »

I can't pass memory allocated by kmalloc() to ksize() if it is allocated by
SLUB allocator and size is larger than (I guess) PAGE_SIZE / 2.

The error of ksize() seems to be that it does not check if the allocation
was made by SLUB or the page allocator.

Reviewed-by: Pekka Enberg
Tested-by: Tetsuo Handa
Cc: Christoph Lameter , Matt Mackall
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vegard Nossum
2007-12-06 01:21:20 +0800

13 Nov, 2007

1 commit

efe44183f SLUB: killed the unused "end" variable ... Browse Code »

Since the macro "for_each_object" introduced, the "end" variable becomes unused anymore.

Signed-off-by: Denis Cheng
Signed-off-by: Linus Torvalds

Denis Cheng
2007-11-13 02:32:29 +0800

06 Nov, 2007

1 commit

05aa34503 SLUB: Fix memory leak by not reusing cpu_slab ... Browse Code »

Fix the memory leak that may occur when we attempt to reuse a cpu_slab
that was allocated while we reenabled interrupts in order to be able to
grow a slab cache.

The per cpu freelist may contain objects and in that situation we may
overwrite the per cpu freelist pointer loosing objects. This only
occurs if we find that the concurrently allocated slab fits our
allocation needs.

If we simply always deactivate the slab then the freelist will be
properly reintegrated and the memory leak will go away.

Signed-off-by: Christoph Lameter
Acked-by: Hugh Dickins
Signed-off-by: Linus Torvalds

Christoph Lameter
2007-11-06 03:37:12 +0800

29 Oct, 2007

1 commit

27bb628a1 missing atomic_read_long() in slub.c ... Browse Code »

nr_slabs is atomic_long_t, not atomic_t

Signed-off-by: Al Viro
Signed-off-by: Linus Torvalds

Al Viro
2007-10-29 22:41:32 +0800

22 Oct, 2007

1 commit

b9049e234 memory hotplug: make kmem_cache_node for SLUB on memory online avoid panic ... Browse Code »

Fix a panic due to access NULL pointer of kmem_cache_node at discard_slab()
after memory online.

When memory online is called, kmem_cache_nodes are created for all SLUBs
for new node whose memory are available.

slab_mem_going_online_callback() is called to make kmem_cache_node() in
callback of memory online event. If it (or other callbacks) fails, then
slab_mem_offline_callback() is called for rollback.

In memory offline, slab_mem_going_offline_callback() is called to shrink
all slub cache, then slab_mem_offline_callback() is called later.

[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: locking fix]
[akpm@linux-foundation.org: build fix]
Signed-off-by: Yasunori Goto
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Yasunori Goto
2007-10-22 23:13:17 +0800

17 Oct, 2007

12 commits

4ba9b9d0b Slab API: remove useless ctor parameter and reorder parameters ... Browse Code »

Slab constructors currently have a flags parameter that is never used. And
the order of the arguments is opposite to other slab functions. The object
pointer is placed before the kmem_cache pointer.

Convert

ctor(void *object, struct kmem_cache *s, unsigned long flags)

to

ctor(struct kmem_cache *s, void *object)

throughout the kernel

[akpm@linux-foundation.org: coupla fixes]
Signed-off-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2007-10-17 23:42:45 +0800
b811c202a SLUB: simplify IRQ off handling ... Browse Code »

Move irq handling out of new slab into __slab_alloc. That is useful for
Mathieu's cmpxchg_local patchset and also allows us to remove the crude
local_irq_off in early_kmem_cache_alloc().

Signed-off-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2007-10-17 23:42:45 +0800
ea3061d22 slub: list_locations() can use GFP_TEMPORARY ... Browse Code »

It's a short-lived allocation.

Cc: Christoph Lameter
Cc: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrew Morton
2007-10-17 00:43:01 +0800
42a9fdbb1 SLUB: Optimize cacheline use for zeroing ... Browse Code »

We touch a cacheline in the kmem_cache structure for zeroing to get the
size. However, the hot paths in slab_alloc and slab_free do not reference
any other fields in kmem_cache, so we may have to just bring in the
cacheline for this one access.

Add a new field to kmem_cache_cpu that contains the object size. That
cacheline must already be used in the hotpaths. So we save one cacheline
on every slab_alloc if we zero.

We need to update the kmem_cache_cpu object size if an aliasing operation
changes the objsize of an non debug slab.

Signed-off-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2007-10-17 00:43:01 +0800
4c93c355d SLUB: Place kmem_cache_cpu structures in a NUMA aware way ... Browse Code »

The kmem_cache_cpu structures introduced are currently an array placed in the
kmem_cache struct. Meaning the kmem_cache_cpu structures are overwhelmingly
on the wrong node for systems with a higher amount of nodes. These are
performance critical structures since the per node information has
to be touched for every alloc and free in a slab.

In order to place the kmem_cache_cpu structure optimally we put an array
of pointers to kmem_cache_cpu structs in kmem_cache (similar to SLAB).

However, the kmem_cache_cpu structures can now be allocated in a more
intelligent way.

We would like to put per cpu structures for the same cpu but different
slab caches in cachelines together to save space and decrease the cache
footprint. However, the slab allocators itself control only allocations
per node. We set up a simple per cpu array for every processor with
100 per cpu structures which is usually enough to get them all set up right.
If we run out then we fall back to kmalloc_node. This also solves the
bootstrap problem since we do not have to use slab allocator functions
early in boot to get memory for the small per cpu structures.

Pro:
- NUMA aware placement improves memory performance
- All global structures in struct kmem_cache become readonly
- Dense packing of per cpu structures reduces cacheline
footprint in SMP and NUMA.
- Potential avoidance of exclusive cacheline fetches
on the free and alloc hotpath since multiple kmem_cache_cpu
structures are in one cacheline. This is particularly important
for the kmalloc array.

Cons:
- Additional reference to one read only cacheline (per cpu
array of pointers to kmem_cache_cpu) in both slab_alloc()
and slab_free().

[akinobu.mita@gmail.com: fix cpu hotplug offline/online path]
Signed-off-by: Christoph Lameter
Cc: "Pekka Enberg"
Cc: Akinobu Mita
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2007-10-17 00:43:01 +0800
ee3c72a14 SLUB: Avoid touching page struct when freeing to per cpu slab ... Browse Code »

Set c->node to -1 if we allocate from a debug slab instead for SlabDebug
which requires access the page struct cacheline.

Signed-off-by: Christoph Lameter
Tested-by: Alexey Dobriyan
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2007-10-17 00:43:01 +0800
b3fba8da6 SLUB: Move page->offset to kmem_cache_cpu->offset ... Browse Code »

We need the offset from the page struct during slab_alloc and slab_free. In
both cases we also reference the cacheline of the kmem_cache_cpu structure.
We can therefore move the offset field into the kmem_cache_cpu structure
freeing up 16 bits in the page struct.

Moving the offset allows an allocation from slab_alloc() without touching the
page struct in the hot path.

The only thing left in slab_free() that touches the page struct cacheline for
per cpu freeing is the checking of SlabDebug(page). The next patch deals with
that.

Use the available 16 bits to broaden page->inuse. More than 64k objects per
slab become possible and we can get rid of the checks for that limitation.

No need anymore to shrink the order of slabs if we boot with 2M sized slabs
(slub_min_order=9).

No need anymore to switch off the offset calculation for very large slabs
since the field in the kmem_cache_cpu structure is 32 bits and so the offset
field can now handle slab sizes of up to 8GB.

Signed-off-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2007-10-17 00:43:01 +0800
8e65d24c7 SLUB: Do not use page->mapping ... Browse Code »

After moving the lockless_freelist to kmem_cache_cpu we no longer need
page->lockless_freelist. Restructure the use of the struct page fields in
such a way that we never touch the mapping field.

This is turn allows us to remove the special casing of SLUB when determining
the mapping of a page (needed for corner cases of virtual caches machines that
need to flush caches of processors mapping a page).

Signed-off-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2007-10-17 00:43:01 +0800
dfb4f0960 SLUB: Avoid page struct cacheline bouncing due to remote frees to cpu slab ... Browse Code »

A remote free may access the same page struct that also contains the lockless
freelist for the cpu slab. If objects have a short lifetime and are freed by
a different processor then remote frees back to the slab from which we are
currently allocating are frequent. The cacheline with the page struct needs
to be repeately acquired in exclusive mode by both the allocating thread and
the freeing thread. If this is frequent enough then performance will suffer
because of cacheline bouncing.

This patchset puts the lockless_freelist pointer in its own cacheline. In
order to make that happen we introduce a per cpu structure called
kmem_cache_cpu.

Instead of keeping an array of pointers to page structs we now keep an array
to a per cpu structure that--among other things--contains the pointer to the
lockless freelist. The freeing thread can then keep possession of exclusive
access to the page struct cacheline while the allocating thread keeps its
exclusive access to the cacheline containing the per cpu structure.

This works as long as the allocating cpu is able to service its request
from the lockless freelist. If the lockless freelist runs empty then the
allocating thread needs to acquire exclusive access to the cacheline with
the page struct lock the slab.

The allocating thread will then check if new objects were freed to the per
cpu slab. If so it will keep the slab as the cpu slab and continue with the
recently remote freed objects. So the allocating thread can take a series
of just freed remote pages and dish them out again. Ideally allocations
could be just recycling objects in the same slab this way which will lead
to an ideal allocation / remote free pattern.

The number of objects that can be handled in this way is limited by the
capacity of one slab. Increasing slab size via slub_min_objects/
slub_max_order may increase the number of objects and therefore performance.

If the allocating thread runs out of objects and finds that no objects were
put back by the remote processor then it will retrieve a new slab (from the
partial lists or from the page allocator) and start with a whole
new set of objects while the remote thread may still be freeing objects to
the old cpu slab. This may then repeat until the new slab is also exhausted.
If remote freeing has freed objects in the earlier slab then that earlier
slab will now be on the partial freelist and the allocating thread will
pick that slab next for allocation. So the loop is extended. However,
both threads need to take the list_lock to make the swizzling via
the partial list happen.

It is likely that this kind of scheme will keep the objects being passed
around to a small set that can be kept in the cpu caches leading to increased
performance.

More code cleanups become possible:

- Instead of passing a cpu we can now pass a kmem_cache_cpu structure around.
Allows reducing the number of parameters to various functions.
- Can define a new node_match() function for NUMA to encapsulate locality
checks.

Effect on allocations:

Cachelines touched before this patch:

Write: page cache struct and first cacheline of object

Cachelines touched after this patch:

Write: kmem_cache_cpu cacheline and first cacheline of object
Read: page cache struct (but see later patch that avoids touching
that cacheline)

The handling when the lockless alloc list runs empty gets to be a bit more
complicated since another cacheline has now to be written to. But that is
halfway out of the hot path.

Effect on freeing:

Cachelines touched before this patch:

Write: page_struct and first cacheline of object

Cachelines touched after this patch depending on how we free:

Write(to cpu_slab): kmem_cache_cpu struct and first cacheline of object
Write(to other): page struct and first cacheline of object

Read(to cpu_slab): page struct to id slab etc. (but see later patch that
avoids touching the page struct on free)
Read(to other): cpu local kmem_cache_cpu struct to verify its not
the cpu slab.

Summary:

Pro:
- Distinct cachelines so that concurrent remote frees and local
allocs on a cpuslab can occur without cacheline bouncing.
- Avoids potential bouncing cachelines because of neighboring
per cpu pointer updates in kmem_cache's cpu_slab structure since
it now grows to a cacheline (Therefore remove the comment
that talks about that concern).

Cons:
- Freeing objects now requires the reading of one additional
cacheline. That can be mitigated for some cases by the following
patches but its not possible to completely eliminate these
references.

- Memory usage grows slightly.

The size of each per cpu object is blown up from one word
(pointing to the page_struct) to one cacheline with various data.
So this is NR_CPUS*NR_SLABS*L1_BYTES more memory use. Lets say
NR_SLABS is 100 and a cache line size of 128 then we have just
increased SLAB metadata requirements by 12.8k per cpu.
(Another later patch reduces these requirements)

Signed-off-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2007-10-17 00:43:01 +0800
e12ba74d8 Group short-lived and reclaimable kernel allocations ... Browse Code »

This patch marks a number of allocations that are either short-lived such as
network buffers or are reclaimable such as inode allocations. When something
like updatedb is called, long-lived and unmovable kernel allocations tend to
be spread throughout the address space which increases fragmentation.

This patch groups these allocations together as much as possible by adding a
new MIGRATE_TYPE. The MIGRATE_RECLAIMABLE type is for allocations that can be
reclaimed on demand, but not moved. i.e. they can be migrated by deleting
them and re-reading the information from elsewhere.

Signed-off-by: Mel Gorman
Cc: Andy Whitcroft
Cc: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2007-10-17 00:43:00 +0800
6cb062296 Categorize GFP flags ... Browse Code »

The function of GFP_LEVEL_MASK seems to be unclear. In order to clear up
the mystery we get rid of it and replace GFP_LEVEL_MASK with 3 sets of GFP
flags:

GFP_RECLAIM_MASK Flags used to control page allocator reclaim behavior.

GFP_CONSTRAINT_MASK Flags used to limit where allocations can occur.

GFP_SLAB_BUG_MASK Flags that the slab allocator BUG()s on.

These replace the uses of GFP_LEVEL mask in the slab allocators and in
vmalloc.c.

The use of the flags not included in these sets may occur as a result of a
slab allocation standing in for a page allocation when constructing scatter
gather lists. Extraneous flags are cleared and not passed through to the
page allocator. __GFP_MOVABLE/RECLAIMABLE, __GFP_COLD and __GFP_COMP will
now be ignored if passed to a slab allocator.

Change the allocation of allocator meta data in SLAB and vmalloc to not
pass through flags listed in GFP_CONSTRAINT_MASK. SLAB already removes the
__GFP_THISNODE flag for such allocations. Generalize that to also cover
vmalloc. The use of GFP_CONSTRAINT_MASK also includes __GFP_HARDWALL.

The impact of allocator metadata placement on access latency to the
cachelines of the object itself is minimal since metadata is only
referenced on alloc and free. The attempt is still made to place the meta
data optimally but we consistently allow fallback both in SLAB and vmalloc
(SLUB does not need to allocate metadata like that).

Allocator metadata may serve multiple in kernel users and thus should not
be subject to the limitations arising from a single allocation context.

[akpm@linux-foundation.org: fix fallback_alloc()]
Signed-off-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2007-10-17 00:42:59 +0800
f64dc58c5 Memoryless nodes: SLUB support ... Browse Code »

Simply switch all for_each_online_node to for_each_node_state(NORMAL_MEMORY).
That way SLUB only operates on nodes with regular memory. Any allocation
attempt on a memoryless node or a node with just highmem will fall whereupon
SLUB will fetch memory from a nearby node (depending on how memory policies
and cpuset describe fallback).

Signed-off-by: Christoph Lameter
Tested-by: Lee Schermerhorn
Acked-by: Bob Picco
Cc: Nishanth Aravamudan
Cc: KAMEZAWA Hiroyuki
Cc: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2007-10-17 00:42:58 +0800