Eric Lee / smarc-ti-linux-kernel | Embedian Git Server

09 Dec, 2006

4 commits

f0d1b0b30 [PATCH] LOG2: Implement a general integer log2 facility in the kernel ... Browse Code »

This facility provides three entry points:

ilog2() Log base 2 of unsigned long
ilog2_u32() Log base 2 of u32
ilog2_u64() Log base 2 of u64

These facilities can either be used inside functions on dynamic data:

int do_something(long q)
{
...;
y = ilog2(x)
...;
}

Or can be used to statically initialise global variables with constant values:

unsigned n = ilog2(27);

When performing static initialisation, the compiler will report "error:
initializer element is not constant" if asked to take a log of zero or of
something not reducible to a constant. They treat negative numbers as
unsigned.

When not dealing with a constant, they fall back to using fls() which permits
them to use arch-specific log calculation instructions - such as BSR on
x86/x86_64 or SCAN on FRV - if available.

[akpm@osdl.org: MMC fix]
Signed-off-by: David Howells
Cc: Benjamin Herrenschmidt
Cc: Paul Mackerras
Cc: Herbert Xu
Cc: David Howells
Cc: Wojtek Kaniewski
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Howells
2006-12-09 00:28:51 +0800
e9536ae72 [PATCH] struct path: convert mm ... Browse Code »

Signed-off-by: Josef Sipek
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Josef Sipek
2006-12-09 00:28:47 +0800
d3ac7f892 [PATCH] mm: change uses of f_{dentry,vfsmnt} to use f_path ... Browse Code »

Change all the uses of f_{dentry,vfsmnt} to f_path.{dentry,mnt} in linux/mm/.

Signed-off-by: Josef "Jeff" Sipek
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Josef "Jeff" Sipek
2006-12-09 00:28:43 +0800
b8b50b651 [PATCH] mm: fallback_alloc cpuset_zone_allowed irq fix ... Browse Code »

fallback_alloc() could end up calling cpuset_zone_allowed() with interrupts
disabled (by code in kmem_cache_alloc_node()), but without __GFP_HARDWALL
set, leading to a possible call of a sleeping function with interrupts
disabled.

This results in the BUG report:

BUG: sleeping function called from invalid context at kernel/cpuset.c:1520
in_atomic():0, irqs_disabled():1

Thanks to Paul Menage for catching this one.

Signed-off-by: Paul Jackson
Cc: Paul Menage
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Paul Jackson
2006-12-09 00:28:37 +0800

08 Dec, 2006

36 commits

15ad7cdcf [PATCH] struct seq_operations and struct file_operations constification ... Browse Code »

- move some file_operations structs into the .rodata section

- move static strings from policy_types[] array into the .rodata section

- fix generic seq_operations usages, so that those structs may be defined
as "const" as well

[akpm@osdl.org: couple of fixes]
Signed-off-by: Helge Deller
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Helge Deller
2006-12-08 00:39:46 +0800
045f147f3 [PATCH] remove EXPORT_UNUSED_SYMBOL'ed symbols ... Browse Code »

In time for 2.6.20, we can get rid of this junk.

Signed-off-by: Adrian Bunk
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Adrian Bunk
2006-12-08 00:39:44 +0800
4668edc33 [PATCH] kernel core: replace kmalloc+memset with kzalloc ... Browse Code »

Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Burman Yan
2006-12-08 00:39:41 +0800
1f370a23f [PATCH] make mm/shmem.c:shmem_xattr_security_handler static ... Browse Code »

Signed-off-by: Adrian Bunk
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Adrian Bunk
2006-12-08 00:39:39 +0800
023160678 [PATCH] hotplug CPU: clean up hotcpu_notifier() use ... Browse Code »

There was lots of #ifdef noise in the kernel due to hotcpu_notifier(fn,
prio) not correctly marking 'fn' as used in the !HOTPLUG_CPU case, and thus
generating compiler warnings of unused symbols, hence forcing people to add
#ifdefs.

the compiler can skip truly unused functions just fine:

text data bss dec hex filename
1624412 728710 3674856 6027978 5bfaca vmlinux.before
1624412 728710 3674856 6027978 5bfaca vmlinux.after

[akpm@osdl.org: topology.c fix]
Signed-off-by: Ingo Molnar
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Ingo Molnar
2006-12-08 00:39:39 +0800
049036643 [PATCH] remove HASH_HIGHMEM ... Browse Code »

It has no users and it's doubtful that we'll need it again.

Cc: "David S. Miller"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrew Morton
2006-12-08 00:39:37 +0800
38da288b8 [PATCH] read_cache_pages() cleanup ... Browse Code »

Use put_pages_list() instead of opencoding it.

Signed-off-by: OGAWA Hirofumi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

OGAWA Hirofumi
2006-12-08 00:39:34 +0800
138ae6631 [PATCH] slab: use probe_kernel_address() ... Browse Code »

Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrew Morton
2006-12-08 00:39:34 +0800
7dfb71030 [PATCH] Add include/linux/freezer.h and move definitions from sched.h ... Browse Code »

Move process freezing functions from include/linux/sched.h to freezer.h, so
that modifications to the freezer or the kernel configuration don't require
recompiling just about everything.

[akpm@osdl.org: fix ueagle driver]
Signed-off-by: Nigel Cunningham
Cc: "Rafael J. Wysocki"
Cc: Pavel Machek
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nigel Cunningham
2006-12-08 00:39:27 +0800
8357376d3 [PATCH] swsusp: Improve handling of highmem ... Browse Code »

Currently swsusp saves the contents of highmem pages by copying them to the
normal zone which is quite inefficient (eg. it requires two normal pages
to be used for saving one highmem page). This may be improved by using
highmem for saving the contents of saveable highmem pages.

Namely, during the suspend phase of the suspend-resume cycle we try to
allocate as many free highmem pages as there are saveable highmem pages.
If there are not enough highmem image pages to store the contents of all of
the saveable highmem pages, some of them will be stored in the "normal"
memory. Next, we allocate as many free "normal" pages as needed to store
the (remaining) image data. We use a memory bitmap to mark the allocated
free pages (ie. highmem as well as "normal" image pages).

Now, we use another memory bitmap to mark all of the saveable pages
(highmem as well as "normal") and the contents of the saveable pages are
copied into the image pages. Then, the second bitmap is used to save the
pfns corresponding to the saveable pages and the first one is used to save
their data.

During the resume phase the pfns of the pages that were saveable during the
suspend are loaded from the image and used to mark the "unsafe" page
frames. Next, we try to allocate as many free highmem page frames as to
load all of the image data that had been in the highmem before the suspend
and we allocate so many free "normal" page frames that the total number of
allocated free pages (highmem and "normal") is equal to the size of the
image. While doing this we have to make sure that there will be some extra
free "normal" and "safe" page frames for two lists of PBEs constructed
later.

Now, the image data are loaded, if possible, into their "original" page
frames. The image data that cannot be written into their "original" page
frames are loaded into "safe" page frames and their "original" kernel
virtual addresses, as well as the addresses of the "safe" pages containing
their copies, are stored in one of two lists of PBEs.

One list of PBEs is for the copies of "normal" suspend pages (ie. "normal"
pages that were saveable during the suspend) and it is used in the same way
as previously (ie. by the architecture-dependent parts of swsusp). The
other list of PBEs is for the copies of highmem suspend pages. The pages
in this list are restored (in a reversible way) right before the
arch-dependent code is called.

Signed-off-by: Rafael J. Wysocki
Cc: Pavel Machek
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Rafael J. Wysocki
2006-12-08 00:39:27 +0800
3aef83e0e [PATCH] swsusp: use block device offsets to identify swap locations ... Browse Code »

Make swsusp use block device offsets instead of swap offsets to identify swap
locations and make it use the same code paths for writing as well as for
reading data.

This allows us to use the same code for handling swap files and swap
partitions and to simplify the code, eg. by dropping rw_swap_page_sync().

Signed-off-by: Rafael J. Wysocki
Cc: Pavel Machek
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Rafael J. Wysocki
2006-12-08 00:39:27 +0800
915bae9eb [PATCH] swsusp: use partition device and offset to identify swap areas ... Browse Code »

The Linux kernel handles swap files almost in the same way as it handles swap
partitions and there are only two differences between these two types of swap
areas:

(1) swap files need not be contiguous,

(2) the header of a swap file is not in the first block of the partition
that holds it. From the swsusp's point of view (1) is not a problem,
because it is already taken care of by the swap-handling code, but (2) has
to be taken into consideration.

In principle the location of a swap file's header may be determined with the
help of appropriate filesystem driver. Unfortunately, however, it requires
the filesystem holding the swap file to be mounted, and if this filesystem is
journaled, it cannot be mounted during a resume from disk. For this reason we
need some other means by which swap areas can be identified.

For example, to identify a swap area we can use the partition that holds the
area and the offset from the beginning of this partition at which the swap
header is located.

The following patch allows swsusp to identify swap areas this way. It changes
swap_type_of() so that it takes an additional argument representing an offset
of the swap header within the partition represented by its first argument.

Signed-off-by: Rafael J. Wysocki
Acked-by: Pavel Machek
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Rafael J. Wysocki
2006-12-08 00:39:27 +0800
7cf9c2c76 [PATCH] radix-tree: RCU lockless readside ... Browse Code »

Make radix tree lookups safe to be performed without locks. Readers are
protected against nodes being deleted by using RCU based freeing. Readers
are protected against new node insertion by using memory barriers to ensure
the node itself will be properly written before it is visible in the radix
tree.

Each radix tree node keeps a record of their height (above leaf nodes).
This height does not change after insertion -- when the radix tree is
extended, higher nodes are only inserted in the top. So a lookup can take
the pointer to what is *now* the root node, and traverse down it even if
the tree is concurrently extended and this node becomes a subtree of a new
root.

"Direct" pointers (tree height of 0, where root->rnode points directly to
the data item) are handled by using the low bit of the pointer to signal
whether rnode is a direct pointer or a pointer to a radix tree node.

When a reader wants to traverse the next branch, they will take a copy of
the pointer. This pointer will be either NULL (and the branch is empty) or
non-NULL (and will point to a valid node).

[akpm@osdl.org: cleanups]
[Lee.Schermerhorn@hp.com: bugfixes, comments, simplifications]
[clameter@sgi.com: build fix]
Signed-off-by: Nick Piggin
Cc: "Paul E. McKenney"
Signed-off-by: Lee Schermerhorn
Cc: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nick Piggin
2006-12-08 00:39:25 +0800
33f2ef89f [PATCH] mm: make compound page destructor handling explicit ... Browse Code »

Currently we we use the lru head link of the second page of a compound page
to hold its destructor. This was ok when it was purely an internal
implmentation detail. However, hugetlbfs overrides this destructor
violating the layering. Abstract this out as explicit calls, also
introduce a type for the callback function allowing them to be type
checked. For each callback we pre-declare the function, causing a type
error on definition rather than on use elsewhere.

[akpm@osdl.org: cleanups]
Signed-off-by: Andy Whitcroft
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andy Whitcroft
2006-12-08 00:39:25 +0800
3c517a613 [PATCH] slab: better fallback allocation behavior ... Browse Code »

Currently we simply attempt to allocate from all allowed nodes using
GFP_THISNODE. However, GFP_THISNODE does not do reclaim (it wont do any at
all if the recent GFP_THISNODE patch is accepted). If we truly run out of
memory in the whole system then fallback_alloc may return NULL although
memory may still be available if we would perform more thorough reclaim.

This patch changes fallback_alloc() so that we first only inspect all the
per node queues for available slabs. If we find any then we allocate from
those. This avoids slab fragmentation by first getting rid of all partial
allocated slabs on every node before allocating new memory.

If we cannot satisfy the allocation from any per node queue then we extend
a slab. We now call into the page allocator without specifying
GFP_THISNODE. The page allocator will then implement its own fallback (in
the given cpuset context), perform necessary reclaim (again considering not
a single node but the whole set of allowed nodes) and then return pages for
a new slab.

We identify from which node the pages were allocated and then insert the
pages into the corresponding per node structure. In order to do so we need
to modify cache_grow() to take a parameter that specifies the new slab.
kmem_getpages() can no longer set the GFP_THISNODE flag since we need to be
able to use kmem_getpage to allocate from an arbitrary node. GFP_THISNODE
needs to be specified when calling cache_grow().

One key advantage is that the decision from which node to allocate new
memory is removed from slab fallback processing. The patch allows to go
back to use of the page allocators fallback/reclaim logic.

Signed-off-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2006-12-08 00:39:25 +0800
952f3b51b [PATCH] GFP_THISNODE must not trigger global reclaim ... Browse Code »

The intent of GFP_THISNODE is to make sure that an allocation occurs on a
particular node. If this is not possible then NULL needs to be returned so
that the caller can choose what to do next on its own (the slab allocator
depends on that).

However, GFP_THISNODE currently triggers reclaim before returning a failure
(GFP_THISNODE means GFP_NORETRY is set). If we have over allocated a node
then we will currently do some reclaim before returning NULL. The caller
may want memory from other nodes before reclaim should be triggered. (If
the caller wants reclaim then he can directly use __GFP_THISNODE instead).

There is no flag to avoid reclaim in the page allocator and adding yet
another GFP_xx flag would be difficult given that we are out of available
flags.

So just compare and see if all bits for GFP_THISNODE (__GFP_THISNODE,
__GFP_NORETRY and __GFP_NOWARN) are set. If so then we return NULL before
waking up kswapd.

Signed-off-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2006-12-08 00:39:25 +0800
5bcd234d8 [PATCH] slab: fix two issues in kmalloc_node / __cache_alloc_node ... Browse Code »

This addresses two issues:

1. Kmalloc_node() may intermittently return NULL if we are allocating
from the current node and are unable to obtain memory for the current
node from the page allocator. This is because we call ___cache_alloc()
if nodeid == numa_node_id() and ____cache_alloc is not able to fallback
to other nodes.

This was introduced in the 2.6.19 development cycle.
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2006-12-08 00:39:25 +0800
441e143e9 [PATCH] slab: remove SLAB_DMA ... Browse Code »

SLAB_DMA is an alias of GFP_DMA. This is the last one so we
remove the leftover comment too.

Signed-off-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2006-12-08 00:39:24 +0800
e94b17660 [PATCH] slab: remove SLAB_KERNEL ... Browse Code »

SLAB_KERNEL is an alias of GFP_KERNEL.

Signed-off-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2006-12-08 00:39:24 +0800
a06d72c1d [PATCH] slab: remove SLAB_LEVEL_MASK ... Browse Code »

SLAB_LEVEL_MASK is only used internally to the slab and is
and alias of GFP_LEVEL_MASK.

Signed-off-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2006-12-08 00:39:23 +0800
6e0eaa4b0 [PATCH] slab: remove SLAB_NO_GROW ... Browse Code »

It is only used internally in the slab.

Signed-off-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2006-12-08 00:39:23 +0800
2d4d862f7 [PATCH] kill install_file_pte's pte_val ... Browse Code »

David Binderman and his Intel C compiler rightly observe that
install_file_pte no longer has any use for its pte_val.

Signed-off-by: Hugh Dickins
Cc: d binderman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2006-12-08 00:39:23 +0800
ce421c799 [PATCH] mm: cleanup indentation on switch for CPU operations ... Browse Code »

These patches introduced new switch statements which are indented contrary
to the concensus in mm/*.c. Fix them up to match that concensus.

[PATCH] node local per-cpu-pages
[PATCH] ZVC: Scale thresholds depending on the size of the system
commit e7c8d5c9955a4d2e88e36b640563f5d6d5aba48a
commit df9ecaba3f152d1ea79f2a5e0b87505e03f47590

Signed-off-by: Andy Whitcroft
Cc: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andy Whitcroft
2006-12-08 00:39:23 +0800
5d1854e15 [PATCH] reject corrupt swapfiles earlier ... Browse Code »

The fsfuzzer found this; with a corrupt small swapfile that claims to have
many pages:

[root]# file swap.741.img
swap.741.img: Linux/i386 swap file (new style) 1 (4K pages) size 1040191487 pages
[root]# ls -l swap.741.img
-rw-r--r-- 1 root root 16777216 Nov 22 05:18 swap.741.img

sys_swapon() will try to vmalloc all those pages, and -then- check to see if
the file is actually that large:

if (!(p->swap_map = vmalloc(maxpages * sizeof(short)))) {

if (swapfilesize && maxpages > swapfilesize) {
printk(KERN_WARNING
"Swap area shorter than signature indicates\n");

It seems to me that it would make more sense to move this test up before
the vmalloc, with the other checks, to avoid the OOM-killer in this
situation...

Signed-off-by: Eric Sandeen
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Eric Sandeen
2006-12-08 00:39:23 +0800
25ba77c14 [PATCH] numa node ids are int, page_to_nid and zone_to_nid should return int ... Browse Code »

NUMA node ids are passed as either int or unsigned int almost exclusivly
page_to_nid and zone_to_nid both return unsigned long. This is a throw
back to when page_to_nid was a #define and was thus exposing the real type
of the page flags field.

In addition to fixing up the definitions of page_to_nid and zone_to_nid I
audited the users of these functions identifying the following incorrect
uses:

1) mm/page_alloc.c show_node() -- printk dumping the node id,
2) include/asm-ia64/pgalloc.h pgtable_quicklist_free() -- comparison
against numa_node_id() which returns an int from cpu_to_node(), and
3) mm/mpolicy.c check_pte_range -- used as an index in node_isset which
uses bit_set which in generic code takes an int.

Signed-off-by: Andy Whitcroft
Cc: Christoph Lameter
Cc: "Luck, Tony"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andy Whitcroft
2006-12-08 00:39:23 +0800
bc4ba393c [PATCH] drain_node_page(): Drain pages in batch units ... Browse Code »

drain_node_pages() currently drains the complete pageset of all pages. If
there are a large number of pages in the queues then we may hold off
interrupts for too long.

Duplicate the method used in free_hot_cold_page. Only drain pcp->batch
pages at one time.

Signed-off-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2006-12-08 00:39:23 +0800
e30500557 [PATCH] make mm/thrash.c:global_faults static ... Browse Code »

This patch makes the needlessly global "global_faults" static.

Signed-off-by: Adrian Bunk
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Adrian Bunk
2006-12-08 00:39:22 +0800
7c309a64d [PATCH] enable booting a NUMA system where some nodes have no memory ... Browse Code »

When booting a NUMA system with nodes that have no memory (eg by limiting
memory), bootmem_alloc_core tried to find pages in an uninitialized
bootmem_map. This caused a null pointer access. This fix adds a check, so
that NULL is returned. That will enable the caller (bootmem_alloc_nopanic)
to alloc memory on other without a panic.

Signed-off-by: Christian Krafft
Cc: Christoph Lameter
Cc: Andy Whitcroft
Cc: Martin Bligh
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christian Krafft
2006-12-08 00:39:22 +0800
a12058687 [PATCH] Allow NULL pointers in percpu_free ... Browse Code »

The patch (as824b) makes percpu_free() ignore NULL arguments, as one would
expect for a deallocation routine. (Note that free_percpu is #defined as
percpu_free in include/linux/percpu.h.) A few callers are updated to remove
now-unneeded tests for NULL. A few other callers already seem to assume
that passing a NULL pointer to percpu_free() is okay!

The patch also removes an unnecessary NULL check in percpu_depopulate().

Signed-off-by: Alan Stern
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Alan Stern
2006-12-08 00:39:22 +0800
8b98c1699 [PATCH] leak tracking for kmalloc_node ... Browse Code »

We have variants of kmalloc and kmem_cache_alloc that leave leak tracking to
the caller. This is used for subsystem-specific allocators like skb_alloc.

To make skb_alloc node-aware we need similar routines for the node-aware slab
allocator, which this patch adds.

Note that the code is rather ugly, but it mirrors the non-node-aware code 1:1:

[akpm@osdl.org: add module export]
Signed-off-by: Christoph Hellwig
Cc: Al Viro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Hellwig
2006-12-08 00:39:22 +0800
881e4aabe [PATCH] Always print out the header line in /proc/swaps ... Browse Code »

It would be possible for /proc/swaps to not always print out the header:

swapon /dev/hdc2
swapon /dev/hde2
swapoff /dev/hdc2

At this point /proc/swaps would not have a header.

Signed-off-by: Suleiman Souhlal
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Suleiman Souhlal
2006-12-08 00:39:22 +0800
b43a57bb4 [PATCH] OOM can panic due to processes stuck in __alloc_pages() ... Browse Code »

OOM can panic due to the processes stuck in __alloc_pages() doing infinite
rebalance loop while no memory can be reclaimed. OOM killer tries to kill
some processes, but unfortunetaly, rebalance label was moved by someone
below the TIF_MEMDIE check, so buddy allocator doesn't see that process is
OOM-killed and it can simply fail the allocation :/

Observed in reality on RHEL4(2.6.9)+OpenVZ kernel when a user doing some
memory allocation tricks triggered OOM panic.

Signed-off-by: Denis Lunev
Signed-off-by: Kirill Korotaev
Cc: Nick Piggin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill Korotaev
2006-12-08 00:39:22 +0800
a3eea484f [PATCH] mlock cleanup ... Browse Code »

mm is defined as vma->vm_mm, so use that.

Acked-by: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Rik Bobbaers
2006-12-08 00:39:22 +0800
3395ee058 [PATCH] mm: add noaliencache boot option to disable numa alien caches ... Browse Code »

When using numa=fake on non-NUMA hardware there is no benefit to having the
alien caches, and they consume much memory.

Add a kernel boot option to disable them.

Christoph sayeth "This is good to have even on large NUMA. The problem is
that the alien caches grow by the square of the size of the system in terms of
nodes."

Cc: Christoph Lameter
Cc: Pekka Enberg
Cc: Manfred Spraul
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Paul Menage
2006-12-08 00:39:21 +0800
8f5be20bf [PATCH] mm: slab: eliminate lock_cpu_hotplug from slab ... Browse Code »

Here's an attempt towards doing away with lock_cpu_hotplug in the slab
subsystem. This approach also fixes a bug which shows up when cpus are
being offlined/onlined and slab caches are being tuned simultaneously.

http://marc.theaimsgroup.com/?l=linux-kernel&m=116098888100481&w=2

The patch has been stress tested overnight on a 2 socket 4 core AMD box with
repeated cpu online and offline, while dbench and kernbench process are
running, and slab caches being tuned at the same time.
There were no lockdep warnings either. (This test on 2,6.18 as 2.6.19-rc
crashes at __drain_pages
http://marc.theaimsgroup.com/?l=linux-kernel&m=116172164217678&w=2 )

The approach here is to hold cache_chain_mutex from CPU_UP_PREPARE until
CPU_ONLINE (similar in approach as worqueue_mutex) . Slab code sensitive
to cpu_online_map (kmem_cache_create, kmem_cache_destroy, slabinfo_write,
__cache_shrink) is already serialized with cache_chain_mutex. (This patch
lengthens cache_chain_mutex hold time at kmem_cache_destroy to cover this).
This patch also takes the cache_chain_sem at kmem_cache_shrink to protect
sanity of cpu_online_map at __cache_shrink, as viewed by slab.
(kmem_cache_shrink->__cache_shrink->drain_cpu_caches). But, really,
kmem_cache_shrink is used at just one place in the acpi subsystem! Do we
really need to keep kmem_cache_shrink at all?

Another note. Looks like a cpu hotplug event can send CPU_UP_CANCELED to
a registered subsystem even if the subsystem did not receive CPU_UP_PREPARE.
This could be due to a subsystem registered for notification earlier than
the current subsystem crapping out with NOTIFY_BAD. Badness can occur with
in the CPU_UP_CANCELED code path at slab if this happens (The same would
apply for workqueue.c as well). To overcome this, we might have to use either
a) a per subsystem flag and avoid handling of CPU_UP_CANCELED, or
b) Use a special notifier events like LOCK_ACQUIRE/RELEASE as Gautham was
using in his experiments, or
c) Do not send CPU_UP_CANCELED to a subsystem which did not receive
CPU_UP_PREPARE.

I would prefer c).

Signed-off-by: Ravikiran Thirumalai
Signed-off-by: Shai Fultheim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Ravikiran G Thirumalai
2006-12-08 00:39:21 +0800
a44b56d35 [PATCH] slab debug and ARCH_SLAB_MINALIGN don't get along ... Browse Code »

When CONFIG_SLAB_DEBUG is used in combination with ARCH_SLAB_MINALIGN, some
debug flags should be disabled which depend on BYTES_PER_WORD alignment.

The disabling of these debug flags is not properly handled when
BYTES_PER_WORD < ARCH_SLAB_MEMALIGN < cache_line_size()

This patch fixes that and also adds an alignment check to
cache_alloc_debugcheck_after() when ARCH_SLAB_MINALIGN is used.

Signed-off-by: Kevin Hilman
Cc: Pekka Enberg
Cc: Christoph Lameter
Cc: Manfred Spraul
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kevin Hilman
2006-12-08 00:39:21 +0800