05 Dec, 2011
1 commit
-
Commit 30765b92 ("slab, lockdep: Annotate the locks before using
them") moves the init_lock_keys() call from after g_cpucache_up =
FULL, to before it. And overlooks the fact that init_node_lock_keys()
tests for it and ignores everything !FULL.Introduce a LATE stage and change the lockdep test to be
Cc: Pekka Enberg
Cc: stable@kernel.org
Signed-off-by: Peter Zijlstra
Signed-off-by: Ingo Molnar
28 Sep, 2011
1 commit
-
Historically /proc/slabinfo and files under /sys/kernel/slab/* have
world read permissions and are accessible to the world. slabinfo
contains rather private information related both to the kernel and
userspace tasks. Depending on the situation, it might reveal either
private information per se or information useful to make another
targeted attack. Some examples of what can be learned by
reading/watching for /proc/slabinfo entries:1) dentry (and different *inode*) number might reveal other processes fs
activity. The number of dentry "active objects" doesn't strictly show
file count opened/touched by a process, however, there is a good
correlation between them. The patch "proc: force dcache drop on
unauthorized access" relies on the privacy of dentry count.2) different inode entries might reveal the same information as (1), but
these are more fine granted counters. If a filesystem is mounted in a
private mount point (or even a private namespace) and fs type differs from
other mounted fs types, fs activity in this mount point/namespace is
revealed. If there is a single ecryptfs mount point, the whole fs
activity of a single user is revealed. Number of files in ecryptfs
mount point is a private information per se.3) fuse_* reveals number of files / fs activity of a user in a user
private mount point. It is approx. the same severity as ecryptfs
infoleak in (2).4) sysfs_dir_cache similar to (2) reveals devices' addition/removal,
which can be otherwise hidden by "chmod 0700 /sys/". With 0444 slabinfo
the precise number of sysfs files is known to the world.5) buffer_head might reveal some kernel activity. With other
information leaks an attacker might identify what specific kernel
routines generate buffer_head activity.6) *kmalloc* infoleaks are very situational. Attacker should watch for
the specific kmalloc size entry and filter the noise related to the unrelated
kernel activity. If an attacker has relatively silent victim system, he
might get rather precise counters.Additional information sources might significantly increase the slabinfo
infoleak benefits. E.g. if an attacker knows that the processes
activity on the system is very low (only core daemons like syslog and
cron), he may run setxid binaries / trigger local daemon activity /
trigger network services activity / await sporadic cron jobs activity
/ etc. and get rather precise counters for fs and network activity of
these privileged tasks, which is unknown otherwise.Also hiding slabinfo and /sys/kernel/slab/* is a one step to complicate
exploitation of kernel heap overflows (and possibly, other bugs). The
related discussion:http://thread.gmane.org/gmane.linux.kernel/1108378
To keep compatibility with old permission model where non-root
monitoring daemon could watch for kernel memleaks though slabinfo one
should do:groupadd slabinfo
usermod -a -G slabinfo $MONITOR_USERAnd add the following commands to init scripts (to mountall.conf in
Ubuntu's upstart case):chmod g+r /proc/slabinfo /sys/kernel/slab/*/*
chgrp slabinfo /proc/slabinfo /sys/kernel/slab/*/*Signed-off-by: Vasiliy Kulikov
Reviewed-by: Kees Cook
Reviewed-by: Dave Hansen
Acked-by: Christoph Lameter
Acked-by: David Rientjes
CC: Valdis.Kletnieks@vt.edu
CC: Linus Torvalds
CC: Alan Cox
Signed-off-by: Pekka Enberg
19 Sep, 2011
1 commit
04 Aug, 2011
2 commits
-
Fernando found we hit the regular OFF_SLAB 'recursion' before we
annotate the locks, cure this.The relevant portion of the stack-trace:
> [ 0.000000] [] rt_spin_lock+0x50/0x56
> [ 0.000000] [] __cache_free+0x43/0xc3
> [ 0.000000] [] kmem_cache_free+0x6c/0xdc
> [ 0.000000] [] slab_destroy+0x4f/0x53
> [ 0.000000] [] free_block+0x94/0xc1
> [ 0.000000] [] do_tune_cpucache+0x10b/0x2bb
> [ 0.000000] [] enable_cpucache+0x7b/0xa7
> [ 0.000000] [] kmem_cache_init_late+0x1f/0x61
> [ 0.000000] [] start_kernel+0x24c/0x363
> [ 0.000000] [] i386_start_kernel+0xa9/0xafReported-by: Fernando Lopez-Lezcano
Acked-by: Pekka Enberg
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/1311888176.2617.379.camel@laptop
Signed-off-by: Ingo Molnar -
Lockdep thinks there's lock recursion through:
kmem_cache_free()
cache_flusharray()
spin_lock(&l3->list_lock) list_lock) --'Now debug objects doesn't use SLAB_DESTROY_BY_RCU and hence there is no
actual possibility of recursing. Luckily debug objects marks it slab
with SLAB_DEBUG_OBJECTS so we can identify the thing.Mark all SLAB_DEBUG_OBJECTS (all one!) slab caches with a special
lockdep key so that lockdep sees its a different cachep.Also add a WARN on trying to create a SLAB_DESTROY_BY_RCU |
SLAB_DEBUG_OBJECTS cache, to avoid possible future trouble.Reported-and-tested-by: Sebastian Siewior
[ fixes to the initial patch ]
Reported-by: Thomas Gleixner
Acked-by: Pekka Enberg
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/1311341165.27400.58.camel@twins
Signed-off-by: Ingo Molnar
01 Aug, 2011
1 commit
-
Less code and the advantage of ascii dump.
before:
| Slab corruption: names_cache start=c5788000, len=4096
| 000: 6b 6b 01 00 00 00 56 00 00 00 24 00 00 00 2a 00
| 010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
| 020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ff ff
| 030: ff ff ff ff e2 b4 17 18 c7 e4 08 06 00 01 08 00
| 040: 06 04 00 01 e2 b4 17 18 c7 e4 0a 00 00 01 00 00
| 050: 00 00 00 00 0a 00 00 02 6b 6b 6b 6b 6b 6b 6b 6bafter:
| Slab corruption: size-4096 start=c38a9000, len=4096
| 000: 6b 6b 01 00 00 00 56 00 00 00 24 00 00 00 2a 00 kk....V...$...*.
| 010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
| 020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ff ff ................
| 030: ff ff ff ff d2 56 5f aa db 9c 08 06 00 01 08 00 .....V_.........
| 040: 06 04 00 01 d2 56 5f aa db 9c 0a 00 00 01 00 00 .....V_.........
| 050: 00 00 00 00 0a 00 00 02 6b 6b 6b 6b 6b 6b 6b 6b ........kkkkkkkkAcked-by: Christoph Lameter
Signed-off-by: Sebastian Andrzej Siewior
Signed-off-by: Pekka Enberg
31 Jul, 2011
1 commit
-
Use the nice enumerated constant.
Cc: KOSAKI Motohiro
Signed-off-by: Andrew Morton
Signed-off-by: Pekka Enberg
28 Jul, 2011
1 commit
-
Reduce high order allocations in do_tune_cpucache() for some setups.
(NR_CPUS=4096 -> we need 64KB)Signed-off-by: Eric Dumazet
Acked-by: Christoph Lameter
Signed-off-by: Pekka Enberg
23 Jul, 2011
1 commit
-
* 'slab-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/slab-2.6:
slab: fix DEBUG_SLAB warning
slab: shrink sizeof(struct kmem_cache)
slab: fix DEBUG_SLAB build
SLUB: Fix missing include
slub: reduce overhead of slub_debug
slub: Add method to verify memory is not freed
slub: Enable backtrace for create/delete points
slab allocators: Provide generic description of alignment defines
slab, slub, slob: Unify alignment definition
slob/lockdep: Fix gfp flags passed to lockdep
22 Jul, 2011
1 commit
-
In commit c225150b "slab: fix DEBUG_SLAB build",
"if ((unsigned long)objp & (ARCH_SLAB_MINALIGN-1))" is always true if
ARCH_SLAB_MINALIGN == 0. Do not print warning if ARCH_SLAB_MINALIGN == 0.Signed-off-by: Tetsuo Handa
Signed-off-by: Pekka Enberg
21 Jul, 2011
1 commit
-
Reduce high order allocations for some setups.
(NR_CPUS=4096 -> we need 64KB per kmem_cache struct)We now allocate exact needed size (using nr_cpu_ids and nr_node_ids)
This also makes code a bit smaller on x86_64, since some field offsets
are less than the 127 limit :Before patch :
# size mm/slab.o
text data bss dec hex filename
22605 361665 32 384302 5dd2e mm/slab.oAfter patch :
# size mm/slab.o
text data bss dec hex filename
22349 353473 8224 384046 5dc2e mm/slab.oCC: Andrew Morton
Reported-by: Konstantin Khlebnikov
Signed-off-by: Eric Dumazet
Acked-by: Christoph Lameter
Signed-off-by: Pekka Enberg
18 Jul, 2011
1 commit
-
Fix CONFIG_SLAB=y CONFIG_DEBUG_SLAB=y build error and warnings.
Now that ARCH_SLAB_MINALIGN defaults to __alignof__(unsigned long long),
it is always defined (when slab.h included), but cannot be used in #if:
mm/slab.c: In function `cache_alloc_debugcheck_after':
mm/slab.c:3156:5: warning: "__alignof__" is not defined
mm/slab.c:3156:5: error: missing binary operator before token "("
make[1]: *** [mm/slab.o] Error 1So just remove the #if and #endif lines, but then 64-bit build warns:
mm/slab.c: In function `cache_alloc_debugcheck_after':
mm/slab.c:3156:6: warning: cast from pointer to integer of different size
mm/slab.c:3158:10: warning: format `%d' expects type `int', but argument
3 has type `long unsigned int'
Fix those with casts, whatever the actual type of ARCH_SLAB_MINALIGN.Acked-by: Christoph Lameter
Signed-off-by: Hugh Dickins
Signed-off-by: Pekka Enberg
04 Jun, 2011
1 commit
-
Currently, when using CONFIG_DEBUG_SLAB, we put in kfree() or
kmem_cache_free() as the last user of free objects, which is not
very useful, so change it to the caller of those functions instead.Acked-by: David Rientjes
Acked-by: Christoph Lameter
Signed-off-by: Suleiman Souhlal
Signed-off-by: Pekka Enberg
21 May, 2011
1 commit
-
Commit e66eed651fd1 ("list: remove prefetching from regular list
iterators") removed the include of prefetch.h from list.h, which
uncovered several cases that had apparently relied on that rather
obscure header file dependency.So this fixes things up a bit, using
grep -L linux/prefetch.h $(git grep -l '[^a-z_]prefetchw*(' -- '*.[ch]')
grep -L 'prefetchw*(' $(git grep -l 'linux/prefetch.h' -- '*.[ch]')to guide us in finding files that either need
inclusion, or have it despite not needing it.There are more of them around (mostly network drivers), but this gets
many core ones.Reported-by: Stephen Rothwell
Signed-off-by: Linus Torvalds
31 Mar, 2011
1 commit
-
Fixes generated by 'codespell' and manually reviewed.
Signed-off-by: Lucas De Marchi
23 Mar, 2011
1 commit
-
While looking at some other notifier callbacks I noticed this code could
use a simple cleanup.notifier_from_errno() no longer needs the if (ret)/else conditional. That
same conditional is now done in notifier_from_errno().Signed-off-by: Prarit Bhargava
Cc: Paul Menage
Cc: Li Zefan
Acked-by: Pekka Enberg
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
12 Mar, 2011
3 commits
-
Conflicts:
mm/slub.c -
The size of struct rcu_head may be changed. When it becomes larger,
it may pollute the data after struct slab.Acked-by: Christoph Lameter
Signed-off-by: Lai Jiangshan
Signed-off-by: Pekka Enberg
14 Feb, 2011
1 commit
-
This reverts commit 5c5e3b33b7cb959a401f823707bee006caadd76e.
The commit breaks ARM thusly:
| Mount-cache hash table entries: 512
| slab error in verify_redzone_free(): cache `idr_layer_cache': memory outside object was overwritten
| Backtrace:
| [] (dump_backtrace+0x0/0x110) from [] (dump_stack+0x18/0x1c)
| [] (dump_stack+0x0/0x1c) from [] (__slab_error+0x28/0x30)
| [] (__slab_error+0x0/0x30) from [] (cache_free_debugcheck+0x1c0/0x2b8)
| [] (cache_free_debugcheck+0x0/0x2b8) from [] (kmem_cache_free+0x3c/0xc0)
| [] (kmem_cache_free+0x0/0xc0) from [] (ida_get_new_above+0x19c/0x1c0)
| [] (ida_get_new_above+0x0/0x1c0) from [] (alloc_vfsmnt+0x54/0x144)
| [] (alloc_vfsmnt+0x0/0x144) from [] (vfs_kern_mount+0x30/0xec)
| [] (vfs_kern_mount+0x0/0xec) from [] (kern_mount_data+0x1c/0x20)
| [] (kern_mount_data+0x0/0x20) from [] (sysfs_init+0x68/0xc8)
| [] (sysfs_init+0x0/0xc8) from [] (mnt_init+0x90/0x1b0)
| [] (mnt_init+0x0/0x1b0) from [] (vfs_caches_init+0x100/0x140)
| [] (vfs_caches_init+0x0/0x140) from [] (start_kernel+0x2e8/0x368)
| [] (start_kernel+0x0/0x368) from [] (__enable_mmu+0x0/0x2c)
| c0113268: redzone 1:0xd84156c5c032b3ac, redzone 2:0xd84156c5635688c0.
| slab error in cache_alloc_debugcheck_after(): cache `idr_layer_cache': double free, or memory outside object was overwritten
| ...
| c011307c: redzone 1:0x9f91102ffffffff, redzone 2:0x9f911029d74e35b
| slab: Internal list corruption detected in cache 'idr_layer_cache'(24), slabp c0113000(16). Hexdump:
|
| 000: 20 4f 10 c0 20 4f 10 c0 7c 00 00 00 7c 30 11 c0
| 010: 10 00 00 00 10 00 00 00 00 00 c9 17 fe ff ff ff
| 020: fe ff ff ff fe ff ff ff fe ff ff ff fe ff ff ff
| 030: fe ff ff ff fe ff ff ff fe ff ff ff fe ff ff ff
| 040: fe ff ff ff fe ff ff ff fe ff ff ff fe ff ff ff
| 050: fe ff ff ff fe ff ff ff fe ff ff ff 11 00 00 00
| 060: 12 00 00 00 13 00 00 00 14 00 00 00 15 00 00 00
| 070: 16 00 00 00 17 00 00 00 c0 88 56 63
| kernel BUG at /home/rmk/git/linux-2.6-rmk/mm/slab.c:2928!Reference: https://lkml.org/lkml/2011/2/7/238
Cc: # 2.6.35.y and later
Reported-and-analyzed-by: Russell King
Signed-off-by: Pekka Enberg
24 Jan, 2011
1 commit
-
The last user was ext4 and Eric Sandeen removed the call in a recent patch. See
the following URL for the discussion:http://marc.info/?l=linux-ext4&m=129546975702198&w=2
Signed-off-by: Christoph Lameter
Signed-off-by: Pekka Enberg
15 Jan, 2011
1 commit
-
Local symbols should be static.
Signed-off-by: H Hartley Sweeten
Cc: Christoph Lameter
Cc: Pekka Enberg
Cc: Matt Mackall
Signed-off-by: Pekka Enberg
11 Jan, 2011
1 commit
-
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/slab-2.6:
slub: Fix a crash during slabinfo -v
tracing/slab: Move kmalloc tracepoint out of inline code
slub: Fix slub_lock down/up imbalance
slub: Fix build breakage in Documentation/vm
slub tracing: move trace calls out of always inlined functions to reduce kernel code size
slub: move slabinfo.c to tools/slub/slabinfo.c
08 Jan, 2011
2 commits
-
* 'for-2.6.38' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: (30 commits)
gameport: use this_cpu_read instead of lookup
x86: udelay: Use this_cpu_read to avoid address calculation
x86: Use this_cpu_inc_return for nmi counter
x86: Replace uses of current_cpu_data with this_cpu ops
x86: Use this_cpu_ops to optimize code
vmstat: User per cpu atomics to avoid interrupt disable / enable
irq_work: Use per cpu atomics instead of regular atomics
cpuops: Use cmpxchg for xchg to avoid lock semantics
x86: this_cpu_cmpxchg and this_cpu_xchg operations
percpu: Generic this_cpu_cmpxchg() and this_cpu_xchg support
percpu,x86: relocate this_cpu_add_return() and friends
connector: Use this_cpu operations
xen: Use this_cpu_inc_return
taskstats: Use this_cpu_ops
random: Use this_cpu_inc_return
fs: Use this_cpu_inc_return in buffer.c
highmem: Use this_cpu_xx_return() operations
vmstat: Use this_cpu_inc_return for vm statistics
x86: Support for this_cpu_add, sub, dec, inc_return
percpu: Generic support for this_cpu_add, sub, dec, inc_return
...Fixed up conflicts: in arch/x86/kernel/{apic/nmi.c, apic/x2apic_uv_x.c, process.c}
as per Tejun. -
* 'for-2.6.38' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: (33 commits)
usb: don't use flush_scheduled_work()
speedtch: don't abuse struct delayed_work
media/video: don't use flush_scheduled_work()
media/video: explicitly flush request_module work
ioc4: use static work_struct for ioc4_load_modules()
init: don't call flush_scheduled_work() from do_initcalls()
s390: don't use flush_scheduled_work()
rtc: don't use flush_scheduled_work()
mmc: update workqueue usages
mfd: update workqueue usages
dvb: don't use flush_scheduled_work()
leds-wm8350: don't use flush_scheduled_work()
mISDN: don't use flush_scheduled_work()
macintosh/ams: don't use flush_scheduled_work()
vmwgfx: don't use flush_scheduled_work()
tpm: don't use flush_scheduled_work()
sonypi: don't use flush_scheduled_work()
hvsi: don't use flush_scheduled_work()
xen: don't use flush_scheduled_work()
gdrom: don't use flush_scheduled_work()
...Fixed up trivial conflict in drivers/media/video/bt8xx/bttv-input.c
as per Tejun.
07 Jan, 2011
1 commit
-
This is a nasty and error prone API. It is no longer used, remove it.
Signed-off-by: Nick Piggin
17 Dec, 2010
1 commit
-
__get_cpu_var() can be replaced with this_cpu_read and will then use a
single read instruction with implied address calculation to access the
correct per cpu instance.However, the address of a per cpu variable passed to __this_cpu_read()
cannot be determined (since it's an implied address conversion through
segment prefixes). Therefore apply this only to uses of __get_cpu_var
where the address of the variable is not used.Cc: Pekka Enberg
Cc: Hugh Dickins
Cc: Thomas Gleixner
Acked-by: H. Peter Anvin
Signed-off-by: Christoph Lameter
Signed-off-by: Tejun Heo
15 Dec, 2010
1 commit
-
cancel_rearming_delayed_work[queue]() has been superceded by
cancel_delayed_work_sync() quite some time ago. Convert all the
in-kernel users. The conversions are completely equivalent and
trivial.Signed-off-by: Tejun Heo
Acked-by: "David S. Miller"
Acked-by: Greg Kroah-Hartman
Acked-by: Evgeniy Polyakov
Cc: Jeff Garzik
Cc: Benjamin Herrenschmidt
Cc: Mauro Carvalho Chehab
Cc: netdev@vger.kernel.org
Cc: Anton Vorontsov
Cc: David Woodhouse
Cc: "J. Bruce Fields"
Cc: Neil Brown
Cc: Alex Elder
Cc: xfs-masters@oss.sgi.com
Cc: Christoph Lameter
Cc: Pekka Enberg
Cc: Andrew Morton
Cc: netfilter-devel@vger.kernel.org
Cc: Trond Myklebust
Cc: linux-nfs@vger.kernel.org
29 Nov, 2010
1 commit
-
The tracepoint for kmalloc is in the slab inlined code which causes
every instance of kmalloc to have the tracepoint.This patch moves the tracepoint out of the inline code to the
slab C file, which removes a large number of inlined trace
points.objdump -dr vmlinux.slab| grep 'jmpq.*
Signed-off-by: Pekka Enberg
27 Oct, 2010
1 commit
-
Use the new {max,min}3 macros to save some cycles and bytes on the stack.
This patch substitutes trivial nested macros with their counterpart.Signed-off-by: Hagen Paul Pfeifer
Cc: Joe Perches
Cc: Ingo Molnar
Cc: Hartley Sweeten
Cc: Russell King
Cc: Benjamin Herrenschmidt
Cc: Thomas Gleixner
Cc: Herbert Xu
Cc: Roland Dreier
Cc: Sean Hefty
Cc: Pekka Enberg
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
23 Aug, 2010
1 commit
-
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/slab-2.6:
slab: fix object alignment
slub: add missing __percpu markup in mm/slub_def.h
10 Aug, 2010
1 commit
-
No real bugs, just some dead code and some fixups.
Signed-off-by: Andi Kleen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
09 Aug, 2010
1 commit
-
This patch fixes alignment of slab objects in case CONFIG_DEBUG_PAGEALLOC is
active.
Before this spot in kmem_cache_create, we have this situation:
- align contains the required alignment of the object
- cachep->obj_offset is 0 or equals align in case of CONFIG_DEBUG_SLAB
- size equals the size of the object, or object plus trailing redzone in case
of CONFIG_DEBUG_SLABThis spot tries to fill one page per object if the object is in certain size
limits, however setting obj_offset to PAGE_SIZE - size does break the object
alignment since size may not be aligned with the required alignment.
This patch simply adds an ALIGN(size, align) to the equation and fixes the
object size detection accordingly.This code in drivers/s390/cio/qdio_setup_init has lead to incorrectly aligned
slab objects (sizeof(struct qdio_q) equals 1792):
qdio_q_cache = kmem_cache_create("qdio_q", sizeof(struct qdio_q),
256, 0, NULL);Acked-by: Christoph Lameter
Signed-off-by: Carsten Otte
Signed-off-by: Pekka Enberg
07 Aug, 2010
1 commit
-
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/slab-2.6:
slub: Allow removal of slab caches during boot
Revert "slub: Allow removal of slab caches during boot"
slub numa: Fix rare allocation from unexpected node
slab: use deferable timers for its periodic housekeeping
slub: Use kmem_cache flags to detect if slab is in debugging mode.
slub: Allow removal of slab caches during boot
slub: Check kasprintf results in kmem_cache_init()
SLUB: Constants need UL
slub: Use a constant for a unspecified node.
SLOB: Free objects to their own list
slab: fix caller tracking on !CONFIG_DEBUG_SLAB && CONFIG_TRACING
20 Jul, 2010
1 commit
-
slab has a "once every 2 second" timer for its housekeeping.
As the number of logical processors is growing, its more and more
common that this 2 second timer becomes the primary wakeup source.This patch turns this housekeeping timer into a deferable timer,
which means that the timer does not interrupt idle, but just runs
at the next event that wakes the cpu up.The impact is that the timer likely runs a bit later, but during the
delay no code is running so there's not all that much reason for
a difference in housekeeping to occur because of this delay.Signed-off-by: Arjan van de Ven
Signed-off-by: Pekka Enberg
09 Jun, 2010
1 commit
-
We have been resisting new ftrace plugins and removing existing
ones, and kmemtrace has been superseded by kmem trace events
and perf-kmem, so we remove it.Signed-off-by: Li Zefan
Acked-by: Pekka Enberg
Acked-by: Eduard - Gabriel Munteanu
Cc: Ingo Molnar
Cc: Steven Rostedt
[ remove kmemtrace from the makefile, handle slob too ]
Signed-off-by: Frederic Weisbecker
28 May, 2010
3 commits
-
Example usage of generic "numa_mem_id()":
The mainline slab code, since ~ 2.6.19, does not handle memoryless nodes
well. Specifically, the "fast path"--____cache_alloc()--will never
succeed as slab doesn't cache offnode object on the per cpu queues, and
for memoryless nodes, all memory will be "off node" relative to
numa_node_id(). This adds significant overhead to all kmem cache
allocations, incurring a significant regression relative to earlier
kernels [from before slab.c was reorganized].This patch uses the generic topology function "numa_mem_id()" to return
the "effective local memory node" for the calling context. This is the
first node in the local node's generic fallback zonelist-- the same node
that "local" mempolicy-based allocations would use. This lets slab cache
these "local" allocations and avoid fallback/refill on every allocation.N.B.: Slab will need to handle node and memory hotplug events that could
change the value returned by numa_mem_id() for any given node if recent
changes to address memory hotplug don't already address this. E.g., flush
all per cpu slab queues before rebuilding the zonelists while the
"machine" is held in the stopped state.Performance impact on "hackbench 400 process 200"
2.6.34-rc3-mmotm-100405-1609 no-patch this-patch
ia64 no memoryless nodes [avg of 10]: 11.713 11.637 ~0.65 diff
ia64 cpus all on memless nodes [10]: 228.259 26.484 ~8.6x speedupThe slowdown of the patched kernel from ~12 sec to ~28 seconds when
configured with memoryless nodes is the result of all cpus allocating from
a single node's mm pagepool. The cache lines of the single node are
distributed/interleaved over the memory of the real physical nodes, but
the zone lock, list heads, ... of the single node with memory still each
live in a single cache line that is accessed from all processors.x86_64 [8x6 AMD] [avg of 40]: 2.883 2.845
Signed-off-by: Lee Schermerhorn
Cc: Tejun Heo
Cc: Mel Gorman
Cc: Christoph Lameter
Cc: Nick Piggin
Cc: David Rientjes
Cc: Eric Whitney
Cc: KAMEZAWA Hiroyuki
Cc: Ingo Molnar
Cc: Thomas Gleixner
Cc: "H. Peter Anvin"
Cc: "Luck, Tony"
Cc: Pekka Enberg
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
By the previous modification, the cpu notifier can return encapsulate
errno value. This converts the cpu notifiers for slab.Signed-off-by: Akinobu Mita
Cc: Christoph Lameter
Acked-by: Pekka Enberg
Cc: Matt Mackall
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
We have observed several workloads running on multi-node systems where
memory is assigned unevenly across the nodes in the system. There are
numerous reasons for this but one is the round-robin rotor in
cpuset_mem_spread_node().For example, a simple test that writes a multi-page file will allocate
pages on nodes 0 2 4 6 ... Odd nodes are skipped. (Sometimes it
allocates on odd nodes & skips even nodes).An example is shown below. The program "lfile" writes a file consisting
of 10 pages. The program then mmaps the file & uses get_mempolicy(...,
MPOL_F_NODE) to determine the nodes where the file pages were allocated.
The output is shown below:# ./lfile
allocated on nodes: 2 4 6 0 1 2 6 0 2There is a single rotor that is used for allocating both file pages & slab
pages. Writing the file allocates both a data page & a slab page
(buffer_head). This advances the RR rotor 2 nodes for each page
allocated.A quick confirmation seems to confirm this is the cause of the uneven
allocation:# echo 0 >/dev/cpuset/memory_spread_slab
# ./lfile
allocated on nodes: 6 7 8 9 0 1 2 3 4 5This patch introduces a second rotor that is used for slab allocations.
Signed-off-by: Jack Steiner
Acked-by: Christoph Lameter
Cc: Pekka Enberg
Cc: Paul Menage
Cc: Jack Steiner
Cc: Robin Holt
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
25 May, 2010
1 commit
-
Before applying this patch, cpuset updates task->mems_allowed and
mempolicy by setting all new bits in the nodemask first, and clearing all
old unallowed bits later. But in the way, the allocator may find that
there is no node to alloc memory.The reason is that cpuset rebinds the task's mempolicy, it cleans the
nodes which the allocater can alloc pages on, for example:(mpol: mempolicy)
task1 task1's mpol task2
alloc page 1
alloc on node0? NO 1
1 change mems from 1 to 0
1 rebind task1's mpol
0-1 set new bits
0 clear disallowed bits
alloc on node1? NO 0
...
can't alloc page
goto oomThis patch fixes this problem by expanding the nodes range first(set newly
allowed bits) and shrink it lazily(clear newly disallowed bits). So we
use a variable to tell the write-side task that read-side task is reading
nodemask, and the write-side task clears newly disallowed nodes after
read-side task ends the current memory allocation.[akpm@linux-foundation.org: fix spello]
Signed-off-by: Miao Xie
Cc: David Rientjes
Cc: Nick Piggin
Cc: Paul Menage
Cc: Lee Schermerhorn
Cc: Hugh Dickins
Cc: Ravikiran Thirumalai
Cc: KOSAKI Motohiro
Cc: Christoph Lameter
Cc: Andi Kleen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds