05 Dec, 2011

1 commit

  • Commit 30765b92 ("slab, lockdep: Annotate the locks before using
    them") moves the init_lock_keys() call from after g_cpucache_up =
    FULL, to before it. And overlooks the fact that init_node_lock_keys()
    tests for it and ignores everything !FULL.

    Introduce a LATE stage and change the lockdep test to be
    Cc: Pekka Enberg
    Cc: stable@kernel.org
    Signed-off-by: Peter Zijlstra
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

28 Sep, 2011

1 commit

  • Historically /proc/slabinfo and files under /sys/kernel/slab/* have
    world read permissions and are accessible to the world. slabinfo
    contains rather private information related both to the kernel and
    userspace tasks. Depending on the situation, it might reveal either
    private information per se or information useful to make another
    targeted attack. Some examples of what can be learned by
    reading/watching for /proc/slabinfo entries:

    1) dentry (and different *inode*) number might reveal other processes fs
    activity. The number of dentry "active objects" doesn't strictly show
    file count opened/touched by a process, however, there is a good
    correlation between them. The patch "proc: force dcache drop on
    unauthorized access" relies on the privacy of dentry count.

    2) different inode entries might reveal the same information as (1), but
    these are more fine granted counters. If a filesystem is mounted in a
    private mount point (or even a private namespace) and fs type differs from
    other mounted fs types, fs activity in this mount point/namespace is
    revealed. If there is a single ecryptfs mount point, the whole fs
    activity of a single user is revealed. Number of files in ecryptfs
    mount point is a private information per se.

    3) fuse_* reveals number of files / fs activity of a user in a user
    private mount point. It is approx. the same severity as ecryptfs
    infoleak in (2).

    4) sysfs_dir_cache similar to (2) reveals devices' addition/removal,
    which can be otherwise hidden by "chmod 0700 /sys/". With 0444 slabinfo
    the precise number of sysfs files is known to the world.

    5) buffer_head might reveal some kernel activity. With other
    information leaks an attacker might identify what specific kernel
    routines generate buffer_head activity.

    6) *kmalloc* infoleaks are very situational. Attacker should watch for
    the specific kmalloc size entry and filter the noise related to the unrelated
    kernel activity. If an attacker has relatively silent victim system, he
    might get rather precise counters.

    Additional information sources might significantly increase the slabinfo
    infoleak benefits. E.g. if an attacker knows that the processes
    activity on the system is very low (only core daemons like syslog and
    cron), he may run setxid binaries / trigger local daemon activity /
    trigger network services activity / await sporadic cron jobs activity
    / etc. and get rather precise counters for fs and network activity of
    these privileged tasks, which is unknown otherwise.

    Also hiding slabinfo and /sys/kernel/slab/* is a one step to complicate
    exploitation of kernel heap overflows (and possibly, other bugs). The
    related discussion:

    http://thread.gmane.org/gmane.linux.kernel/1108378

    To keep compatibility with old permission model where non-root
    monitoring daemon could watch for kernel memleaks though slabinfo one
    should do:

    groupadd slabinfo
    usermod -a -G slabinfo $MONITOR_USER

    And add the following commands to init scripts (to mountall.conf in
    Ubuntu's upstart case):

    chmod g+r /proc/slabinfo /sys/kernel/slab/*/*
    chgrp slabinfo /proc/slabinfo /sys/kernel/slab/*/*

    Signed-off-by: Vasiliy Kulikov
    Reviewed-by: Kees Cook
    Reviewed-by: Dave Hansen
    Acked-by: Christoph Lameter
    Acked-by: David Rientjes
    CC: Valdis.Kletnieks@vt.edu
    CC: Linus Torvalds
    CC: Alan Cox
    Signed-off-by: Pekka Enberg

    Vasiliy Kulikov
     

19 Sep, 2011

1 commit


04 Aug, 2011

2 commits

  • Fernando found we hit the regular OFF_SLAB 'recursion' before we
    annotate the locks, cure this.

    The relevant portion of the stack-trace:

    > [ 0.000000] [] rt_spin_lock+0x50/0x56
    > [ 0.000000] [] __cache_free+0x43/0xc3
    > [ 0.000000] [] kmem_cache_free+0x6c/0xdc
    > [ 0.000000] [] slab_destroy+0x4f/0x53
    > [ 0.000000] [] free_block+0x94/0xc1
    > [ 0.000000] [] do_tune_cpucache+0x10b/0x2bb
    > [ 0.000000] [] enable_cpucache+0x7b/0xa7
    > [ 0.000000] [] kmem_cache_init_late+0x1f/0x61
    > [ 0.000000] [] start_kernel+0x24c/0x363
    > [ 0.000000] [] i386_start_kernel+0xa9/0xaf

    Reported-by: Fernando Lopez-Lezcano
    Acked-by: Pekka Enberg
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1311888176.2617.379.camel@laptop
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Lockdep thinks there's lock recursion through:

    kmem_cache_free()
    cache_flusharray()
    spin_lock(&l3->list_lock) list_lock) --'

    Now debug objects doesn't use SLAB_DESTROY_BY_RCU and hence there is no
    actual possibility of recursing. Luckily debug objects marks it slab
    with SLAB_DEBUG_OBJECTS so we can identify the thing.

    Mark all SLAB_DEBUG_OBJECTS (all one!) slab caches with a special
    lockdep key so that lockdep sees its a different cachep.

    Also add a WARN on trying to create a SLAB_DESTROY_BY_RCU |
    SLAB_DEBUG_OBJECTS cache, to avoid possible future trouble.

    Reported-and-tested-by: Sebastian Siewior
    [ fixes to the initial patch ]
    Reported-by: Thomas Gleixner
    Acked-by: Pekka Enberg
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1311341165.27400.58.camel@twins
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

01 Aug, 2011

1 commit

  • Less code and the advantage of ascii dump.

    before:
    | Slab corruption: names_cache start=c5788000, len=4096
    | 000: 6b 6b 01 00 00 00 56 00 00 00 24 00 00 00 2a 00
    | 010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
    | 020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ff ff
    | 030: ff ff ff ff e2 b4 17 18 c7 e4 08 06 00 01 08 00
    | 040: 06 04 00 01 e2 b4 17 18 c7 e4 0a 00 00 01 00 00
    | 050: 00 00 00 00 0a 00 00 02 6b 6b 6b 6b 6b 6b 6b 6b

    after:
    | Slab corruption: size-4096 start=c38a9000, len=4096
    | 000: 6b 6b 01 00 00 00 56 00 00 00 24 00 00 00 2a 00 kk....V...$...*.
    | 010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
    | 020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ff ff ................
    | 030: ff ff ff ff d2 56 5f aa db 9c 08 06 00 01 08 00 .....V_.........
    | 040: 06 04 00 01 d2 56 5f aa db 9c 0a 00 00 01 00 00 .....V_.........
    | 050: 00 00 00 00 0a 00 00 02 6b 6b 6b 6b 6b 6b 6b 6b ........kkkkkkkk

    Acked-by: Christoph Lameter
    Signed-off-by: Sebastian Andrzej Siewior
    Signed-off-by: Pekka Enberg

    Sebastian Andrzej Siewior
     

31 Jul, 2011

1 commit


28 Jul, 2011

1 commit


23 Jul, 2011

1 commit

  • * 'slab-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/slab-2.6:
    slab: fix DEBUG_SLAB warning
    slab: shrink sizeof(struct kmem_cache)
    slab: fix DEBUG_SLAB build
    SLUB: Fix missing include
    slub: reduce overhead of slub_debug
    slub: Add method to verify memory is not freed
    slub: Enable backtrace for create/delete points
    slab allocators: Provide generic description of alignment defines
    slab, slub, slob: Unify alignment definition
    slob/lockdep: Fix gfp flags passed to lockdep

    Linus Torvalds
     

22 Jul, 2011

1 commit

  • In commit c225150b "slab: fix DEBUG_SLAB build",
    "if ((unsigned long)objp & (ARCH_SLAB_MINALIGN-1))" is always true if
    ARCH_SLAB_MINALIGN == 0. Do not print warning if ARCH_SLAB_MINALIGN == 0.

    Signed-off-by: Tetsuo Handa
    Signed-off-by: Pekka Enberg

    Tetsuo Handa
     

21 Jul, 2011

1 commit

  • Reduce high order allocations for some setups.
    (NR_CPUS=4096 -> we need 64KB per kmem_cache struct)

    We now allocate exact needed size (using nr_cpu_ids and nr_node_ids)

    This also makes code a bit smaller on x86_64, since some field offsets
    are less than the 127 limit :

    Before patch :
    # size mm/slab.o
    text data bss dec hex filename
    22605 361665 32 384302 5dd2e mm/slab.o

    After patch :
    # size mm/slab.o
    text data bss dec hex filename
    22349 353473 8224 384046 5dc2e mm/slab.o

    CC: Andrew Morton
    Reported-by: Konstantin Khlebnikov
    Signed-off-by: Eric Dumazet
    Acked-by: Christoph Lameter
    Signed-off-by: Pekka Enberg

    Eric Dumazet
     

18 Jul, 2011

1 commit

  • Fix CONFIG_SLAB=y CONFIG_DEBUG_SLAB=y build error and warnings.

    Now that ARCH_SLAB_MINALIGN defaults to __alignof__(unsigned long long),
    it is always defined (when slab.h included), but cannot be used in #if:
    mm/slab.c: In function `cache_alloc_debugcheck_after':
    mm/slab.c:3156:5: warning: "__alignof__" is not defined
    mm/slab.c:3156:5: error: missing binary operator before token "("
    make[1]: *** [mm/slab.o] Error 1

    So just remove the #if and #endif lines, but then 64-bit build warns:
    mm/slab.c: In function `cache_alloc_debugcheck_after':
    mm/slab.c:3156:6: warning: cast from pointer to integer of different size
    mm/slab.c:3158:10: warning: format `%d' expects type `int', but argument
    3 has type `long unsigned int'
    Fix those with casts, whatever the actual type of ARCH_SLAB_MINALIGN.

    Acked-by: Christoph Lameter
    Signed-off-by: Hugh Dickins
    Signed-off-by: Pekka Enberg

    Hugh Dickins
     

04 Jun, 2011

1 commit

  • Currently, when using CONFIG_DEBUG_SLAB, we put in kfree() or
    kmem_cache_free() as the last user of free objects, which is not
    very useful, so change it to the caller of those functions instead.

    Acked-by: David Rientjes
    Acked-by: Christoph Lameter
    Signed-off-by: Suleiman Souhlal
    Signed-off-by: Pekka Enberg

    Suleiman Souhlal
     

21 May, 2011

1 commit

  • Commit e66eed651fd1 ("list: remove prefetching from regular list
    iterators") removed the include of prefetch.h from list.h, which
    uncovered several cases that had apparently relied on that rather
    obscure header file dependency.

    So this fixes things up a bit, using

    grep -L linux/prefetch.h $(git grep -l '[^a-z_]prefetchw*(' -- '*.[ch]')
    grep -L 'prefetchw*(' $(git grep -l 'linux/prefetch.h' -- '*.[ch]')

    to guide us in finding files that either need
    inclusion, or have it despite not needing it.

    There are more of them around (mostly network drivers), but this gets
    many core ones.

    Reported-by: Stephen Rothwell
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

31 Mar, 2011

1 commit


23 Mar, 2011

1 commit

  • While looking at some other notifier callbacks I noticed this code could
    use a simple cleanup.

    notifier_from_errno() no longer needs the if (ret)/else conditional. That
    same conditional is now done in notifier_from_errno().

    Signed-off-by: Prarit Bhargava
    Cc: Paul Menage
    Cc: Li Zefan
    Acked-by: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Prarit Bhargava
     

12 Mar, 2011

3 commits


14 Feb, 2011

1 commit

  • This reverts commit 5c5e3b33b7cb959a401f823707bee006caadd76e.

    The commit breaks ARM thusly:

    | Mount-cache hash table entries: 512
    | slab error in verify_redzone_free(): cache `idr_layer_cache': memory outside object was overwritten
    | Backtrace:
    | [] (dump_backtrace+0x0/0x110) from [] (dump_stack+0x18/0x1c)
    | [] (dump_stack+0x0/0x1c) from [] (__slab_error+0x28/0x30)
    | [] (__slab_error+0x0/0x30) from [] (cache_free_debugcheck+0x1c0/0x2b8)
    | [] (cache_free_debugcheck+0x0/0x2b8) from [] (kmem_cache_free+0x3c/0xc0)
    | [] (kmem_cache_free+0x0/0xc0) from [] (ida_get_new_above+0x19c/0x1c0)
    | [] (ida_get_new_above+0x0/0x1c0) from [] (alloc_vfsmnt+0x54/0x144)
    | [] (alloc_vfsmnt+0x0/0x144) from [] (vfs_kern_mount+0x30/0xec)
    | [] (vfs_kern_mount+0x0/0xec) from [] (kern_mount_data+0x1c/0x20)
    | [] (kern_mount_data+0x0/0x20) from [] (sysfs_init+0x68/0xc8)
    | [] (sysfs_init+0x0/0xc8) from [] (mnt_init+0x90/0x1b0)
    | [] (mnt_init+0x0/0x1b0) from [] (vfs_caches_init+0x100/0x140)
    | [] (vfs_caches_init+0x0/0x140) from [] (start_kernel+0x2e8/0x368)
    | [] (start_kernel+0x0/0x368) from [] (__enable_mmu+0x0/0x2c)
    | c0113268: redzone 1:0xd84156c5c032b3ac, redzone 2:0xd84156c5635688c0.
    | slab error in cache_alloc_debugcheck_after(): cache `idr_layer_cache': double free, or memory outside object was overwritten
    | ...
    | c011307c: redzone 1:0x9f91102ffffffff, redzone 2:0x9f911029d74e35b
    | slab: Internal list corruption detected in cache 'idr_layer_cache'(24), slabp c0113000(16). Hexdump:
    |
    | 000: 20 4f 10 c0 20 4f 10 c0 7c 00 00 00 7c 30 11 c0
    | 010: 10 00 00 00 10 00 00 00 00 00 c9 17 fe ff ff ff
    | 020: fe ff ff ff fe ff ff ff fe ff ff ff fe ff ff ff
    | 030: fe ff ff ff fe ff ff ff fe ff ff ff fe ff ff ff
    | 040: fe ff ff ff fe ff ff ff fe ff ff ff fe ff ff ff
    | 050: fe ff ff ff fe ff ff ff fe ff ff ff 11 00 00 00
    | 060: 12 00 00 00 13 00 00 00 14 00 00 00 15 00 00 00
    | 070: 16 00 00 00 17 00 00 00 c0 88 56 63
    | kernel BUG at /home/rmk/git/linux-2.6-rmk/mm/slab.c:2928!

    Reference: https://lkml.org/lkml/2011/2/7/238
    Cc: # 2.6.35.y and later
    Reported-and-analyzed-by: Russell King
    Signed-off-by: Pekka Enberg

    Pekka Enberg
     

24 Jan, 2011

1 commit


15 Jan, 2011

1 commit


11 Jan, 2011

1 commit


08 Jan, 2011

2 commits

  • * 'for-2.6.38' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: (30 commits)
    gameport: use this_cpu_read instead of lookup
    x86: udelay: Use this_cpu_read to avoid address calculation
    x86: Use this_cpu_inc_return for nmi counter
    x86: Replace uses of current_cpu_data with this_cpu ops
    x86: Use this_cpu_ops to optimize code
    vmstat: User per cpu atomics to avoid interrupt disable / enable
    irq_work: Use per cpu atomics instead of regular atomics
    cpuops: Use cmpxchg for xchg to avoid lock semantics
    x86: this_cpu_cmpxchg and this_cpu_xchg operations
    percpu: Generic this_cpu_cmpxchg() and this_cpu_xchg support
    percpu,x86: relocate this_cpu_add_return() and friends
    connector: Use this_cpu operations
    xen: Use this_cpu_inc_return
    taskstats: Use this_cpu_ops
    random: Use this_cpu_inc_return
    fs: Use this_cpu_inc_return in buffer.c
    highmem: Use this_cpu_xx_return() operations
    vmstat: Use this_cpu_inc_return for vm statistics
    x86: Support for this_cpu_add, sub, dec, inc_return
    percpu: Generic support for this_cpu_add, sub, dec, inc_return
    ...

    Fixed up conflicts: in arch/x86/kernel/{apic/nmi.c, apic/x2apic_uv_x.c, process.c}
    as per Tejun.

    Linus Torvalds
     
  • * 'for-2.6.38' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: (33 commits)
    usb: don't use flush_scheduled_work()
    speedtch: don't abuse struct delayed_work
    media/video: don't use flush_scheduled_work()
    media/video: explicitly flush request_module work
    ioc4: use static work_struct for ioc4_load_modules()
    init: don't call flush_scheduled_work() from do_initcalls()
    s390: don't use flush_scheduled_work()
    rtc: don't use flush_scheduled_work()
    mmc: update workqueue usages
    mfd: update workqueue usages
    dvb: don't use flush_scheduled_work()
    leds-wm8350: don't use flush_scheduled_work()
    mISDN: don't use flush_scheduled_work()
    macintosh/ams: don't use flush_scheduled_work()
    vmwgfx: don't use flush_scheduled_work()
    tpm: don't use flush_scheduled_work()
    sonypi: don't use flush_scheduled_work()
    hvsi: don't use flush_scheduled_work()
    xen: don't use flush_scheduled_work()
    gdrom: don't use flush_scheduled_work()
    ...

    Fixed up trivial conflict in drivers/media/video/bt8xx/bttv-input.c
    as per Tejun.

    Linus Torvalds
     

07 Jan, 2011

1 commit


17 Dec, 2010

1 commit

  • __get_cpu_var() can be replaced with this_cpu_read and will then use a
    single read instruction with implied address calculation to access the
    correct per cpu instance.

    However, the address of a per cpu variable passed to __this_cpu_read()
    cannot be determined (since it's an implied address conversion through
    segment prefixes). Therefore apply this only to uses of __get_cpu_var
    where the address of the variable is not used.

    Cc: Pekka Enberg
    Cc: Hugh Dickins
    Cc: Thomas Gleixner
    Acked-by: H. Peter Anvin
    Signed-off-by: Christoph Lameter
    Signed-off-by: Tejun Heo

    Christoph Lameter
     

15 Dec, 2010

1 commit

  • cancel_rearming_delayed_work[queue]() has been superceded by
    cancel_delayed_work_sync() quite some time ago. Convert all the
    in-kernel users. The conversions are completely equivalent and
    trivial.

    Signed-off-by: Tejun Heo
    Acked-by: "David S. Miller"
    Acked-by: Greg Kroah-Hartman
    Acked-by: Evgeniy Polyakov
    Cc: Jeff Garzik
    Cc: Benjamin Herrenschmidt
    Cc: Mauro Carvalho Chehab
    Cc: netdev@vger.kernel.org
    Cc: Anton Vorontsov
    Cc: David Woodhouse
    Cc: "J. Bruce Fields"
    Cc: Neil Brown
    Cc: Alex Elder
    Cc: xfs-masters@oss.sgi.com
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: Andrew Morton
    Cc: netfilter-devel@vger.kernel.org
    Cc: Trond Myklebust
    Cc: linux-nfs@vger.kernel.org

    Tejun Heo
     

29 Nov, 2010

1 commit


27 Oct, 2010

1 commit

  • Use the new {max,min}3 macros to save some cycles and bytes on the stack.
    This patch substitutes trivial nested macros with their counterpart.

    Signed-off-by: Hagen Paul Pfeifer
    Cc: Joe Perches
    Cc: Ingo Molnar
    Cc: Hartley Sweeten
    Cc: Russell King
    Cc: Benjamin Herrenschmidt
    Cc: Thomas Gleixner
    Cc: Herbert Xu
    Cc: Roland Dreier
    Cc: Sean Hefty
    Cc: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hagen Paul Pfeifer
     

23 Aug, 2010

1 commit


10 Aug, 2010

1 commit


09 Aug, 2010

1 commit

  • This patch fixes alignment of slab objects in case CONFIG_DEBUG_PAGEALLOC is
    active.
    Before this spot in kmem_cache_create, we have this situation:
    - align contains the required alignment of the object
    - cachep->obj_offset is 0 or equals align in case of CONFIG_DEBUG_SLAB
    - size equals the size of the object, or object plus trailing redzone in case
    of CONFIG_DEBUG_SLAB

    This spot tries to fill one page per object if the object is in certain size
    limits, however setting obj_offset to PAGE_SIZE - size does break the object
    alignment since size may not be aligned with the required alignment.
    This patch simply adds an ALIGN(size, align) to the equation and fixes the
    object size detection accordingly.

    This code in drivers/s390/cio/qdio_setup_init has lead to incorrectly aligned
    slab objects (sizeof(struct qdio_q) equals 1792):
    qdio_q_cache = kmem_cache_create("qdio_q", sizeof(struct qdio_q),
    256, 0, NULL);

    Acked-by: Christoph Lameter
    Signed-off-by: Carsten Otte
    Signed-off-by: Pekka Enberg

    Carsten Otte
     

07 Aug, 2010

1 commit

  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/slab-2.6:
    slub: Allow removal of slab caches during boot
    Revert "slub: Allow removal of slab caches during boot"
    slub numa: Fix rare allocation from unexpected node
    slab: use deferable timers for its periodic housekeeping
    slub: Use kmem_cache flags to detect if slab is in debugging mode.
    slub: Allow removal of slab caches during boot
    slub: Check kasprintf results in kmem_cache_init()
    SLUB: Constants need UL
    slub: Use a constant for a unspecified node.
    SLOB: Free objects to their own list
    slab: fix caller tracking on !CONFIG_DEBUG_SLAB && CONFIG_TRACING

    Linus Torvalds
     

20 Jul, 2010

1 commit

  • slab has a "once every 2 second" timer for its housekeeping.
    As the number of logical processors is growing, its more and more
    common that this 2 second timer becomes the primary wakeup source.

    This patch turns this housekeeping timer into a deferable timer,
    which means that the timer does not interrupt idle, but just runs
    at the next event that wakes the cpu up.

    The impact is that the timer likely runs a bit later, but during the
    delay no code is running so there's not all that much reason for
    a difference in housekeeping to occur because of this delay.

    Signed-off-by: Arjan van de Ven
    Signed-off-by: Pekka Enberg

    Arjan van de Ven
     

09 Jun, 2010

1 commit

  • We have been resisting new ftrace plugins and removing existing
    ones, and kmemtrace has been superseded by kmem trace events
    and perf-kmem, so we remove it.

    Signed-off-by: Li Zefan
    Acked-by: Pekka Enberg
    Acked-by: Eduard - Gabriel Munteanu
    Cc: Ingo Molnar
    Cc: Steven Rostedt
    [ remove kmemtrace from the makefile, handle slob too ]
    Signed-off-by: Frederic Weisbecker

    Li Zefan
     

28 May, 2010

3 commits

  • Example usage of generic "numa_mem_id()":

    The mainline slab code, since ~ 2.6.19, does not handle memoryless nodes
    well. Specifically, the "fast path"--____cache_alloc()--will never
    succeed as slab doesn't cache offnode object on the per cpu queues, and
    for memoryless nodes, all memory will be "off node" relative to
    numa_node_id(). This adds significant overhead to all kmem cache
    allocations, incurring a significant regression relative to earlier
    kernels [from before slab.c was reorganized].

    This patch uses the generic topology function "numa_mem_id()" to return
    the "effective local memory node" for the calling context. This is the
    first node in the local node's generic fallback zonelist-- the same node
    that "local" mempolicy-based allocations would use. This lets slab cache
    these "local" allocations and avoid fallback/refill on every allocation.

    N.B.: Slab will need to handle node and memory hotplug events that could
    change the value returned by numa_mem_id() for any given node if recent
    changes to address memory hotplug don't already address this. E.g., flush
    all per cpu slab queues before rebuilding the zonelists while the
    "machine" is held in the stopped state.

    Performance impact on "hackbench 400 process 200"

    2.6.34-rc3-mmotm-100405-1609 no-patch this-patch
    ia64 no memoryless nodes [avg of 10]: 11.713 11.637 ~0.65 diff
    ia64 cpus all on memless nodes [10]: 228.259 26.484 ~8.6x speedup

    The slowdown of the patched kernel from ~12 sec to ~28 seconds when
    configured with memoryless nodes is the result of all cpus allocating from
    a single node's mm pagepool. The cache lines of the single node are
    distributed/interleaved over the memory of the real physical nodes, but
    the zone lock, list heads, ... of the single node with memory still each
    live in a single cache line that is accessed from all processors.

    x86_64 [8x6 AMD] [avg of 40]: 2.883 2.845

    Signed-off-by: Lee Schermerhorn
    Cc: Tejun Heo
    Cc: Mel Gorman
    Cc: Christoph Lameter
    Cc: Nick Piggin
    Cc: David Rientjes
    Cc: Eric Whitney
    Cc: KAMEZAWA Hiroyuki
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Cc: "Luck, Tony"
    Cc: Pekka Enberg
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     
  • By the previous modification, the cpu notifier can return encapsulate
    errno value. This converts the cpu notifiers for slab.

    Signed-off-by: Akinobu Mita
    Cc: Christoph Lameter
    Acked-by: Pekka Enberg
    Cc: Matt Mackall
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Akinobu Mita
     
  • We have observed several workloads running on multi-node systems where
    memory is assigned unevenly across the nodes in the system. There are
    numerous reasons for this but one is the round-robin rotor in
    cpuset_mem_spread_node().

    For example, a simple test that writes a multi-page file will allocate
    pages on nodes 0 2 4 6 ... Odd nodes are skipped. (Sometimes it
    allocates on odd nodes & skips even nodes).

    An example is shown below. The program "lfile" writes a file consisting
    of 10 pages. The program then mmaps the file & uses get_mempolicy(...,
    MPOL_F_NODE) to determine the nodes where the file pages were allocated.
    The output is shown below:

    # ./lfile
    allocated on nodes: 2 4 6 0 1 2 6 0 2

    There is a single rotor that is used for allocating both file pages & slab
    pages. Writing the file allocates both a data page & a slab page
    (buffer_head). This advances the RR rotor 2 nodes for each page
    allocated.

    A quick confirmation seems to confirm this is the cause of the uneven
    allocation:

    # echo 0 >/dev/cpuset/memory_spread_slab
    # ./lfile
    allocated on nodes: 6 7 8 9 0 1 2 3 4 5

    This patch introduces a second rotor that is used for slab allocations.

    Signed-off-by: Jack Steiner
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: Paul Menage
    Cc: Jack Steiner
    Cc: Robin Holt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jack Steiner
     

25 May, 2010

1 commit

  • Before applying this patch, cpuset updates task->mems_allowed and
    mempolicy by setting all new bits in the nodemask first, and clearing all
    old unallowed bits later. But in the way, the allocator may find that
    there is no node to alloc memory.

    The reason is that cpuset rebinds the task's mempolicy, it cleans the
    nodes which the allocater can alloc pages on, for example:

    (mpol: mempolicy)
    task1 task1's mpol task2
    alloc page 1
    alloc on node0? NO 1
    1 change mems from 1 to 0
    1 rebind task1's mpol
    0-1 set new bits
    0 clear disallowed bits
    alloc on node1? NO 0
    ...
    can't alloc page
    goto oom

    This patch fixes this problem by expanding the nodes range first(set newly
    allowed bits) and shrink it lazily(clear newly disallowed bits). So we
    use a variable to tell the write-side task that read-side task is reading
    nodemask, and the write-side task clears newly disallowed nodes after
    read-side task ends the current memory allocation.

    [akpm@linux-foundation.org: fix spello]
    Signed-off-by: Miao Xie
    Cc: David Rientjes
    Cc: Nick Piggin
    Cc: Paul Menage
    Cc: Lee Schermerhorn
    Cc: Hugh Dickins
    Cc: Ravikiran Thirumalai
    Cc: KOSAKI Motohiro
    Cc: Christoph Lameter
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miao Xie