21 May, 2011

1 commit

  • Commit e66eed651fd1 ("list: remove prefetching from regular list
    iterators") removed the include of prefetch.h from list.h, which
    uncovered several cases that had apparently relied on that rather
    obscure header file dependency.

    So this fixes things up a bit, using

    grep -L linux/prefetch.h $(git grep -l '[^a-z_]prefetchw*(' -- '*.[ch]')
    grep -L 'prefetchw*(' $(git grep -l 'linux/prefetch.h' -- '*.[ch]')

    to guide us in finding files that either need
    inclusion, or have it despite not needing it.

    There are more of them around (mostly network drivers), but this gets
    many core ones.

    Reported-by: Stephen Rothwell
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

31 Mar, 2011

1 commit


23 Mar, 2011

1 commit

  • While looking at some other notifier callbacks I noticed this code could
    use a simple cleanup.

    notifier_from_errno() no longer needs the if (ret)/else conditional. That
    same conditional is now done in notifier_from_errno().

    Signed-off-by: Prarit Bhargava
    Cc: Paul Menage
    Cc: Li Zefan
    Acked-by: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Prarit Bhargava
     

12 Mar, 2011

3 commits


14 Feb, 2011

1 commit

  • This reverts commit 5c5e3b33b7cb959a401f823707bee006caadd76e.

    The commit breaks ARM thusly:

    | Mount-cache hash table entries: 512
    | slab error in verify_redzone_free(): cache `idr_layer_cache': memory outside object was overwritten
    | Backtrace:
    | [] (dump_backtrace+0x0/0x110) from [] (dump_stack+0x18/0x1c)
    | [] (dump_stack+0x0/0x1c) from [] (__slab_error+0x28/0x30)
    | [] (__slab_error+0x0/0x30) from [] (cache_free_debugcheck+0x1c0/0x2b8)
    | [] (cache_free_debugcheck+0x0/0x2b8) from [] (kmem_cache_free+0x3c/0xc0)
    | [] (kmem_cache_free+0x0/0xc0) from [] (ida_get_new_above+0x19c/0x1c0)
    | [] (ida_get_new_above+0x0/0x1c0) from [] (alloc_vfsmnt+0x54/0x144)
    | [] (alloc_vfsmnt+0x0/0x144) from [] (vfs_kern_mount+0x30/0xec)
    | [] (vfs_kern_mount+0x0/0xec) from [] (kern_mount_data+0x1c/0x20)
    | [] (kern_mount_data+0x0/0x20) from [] (sysfs_init+0x68/0xc8)
    | [] (sysfs_init+0x0/0xc8) from [] (mnt_init+0x90/0x1b0)
    | [] (mnt_init+0x0/0x1b0) from [] (vfs_caches_init+0x100/0x140)
    | [] (vfs_caches_init+0x0/0x140) from [] (start_kernel+0x2e8/0x368)
    | [] (start_kernel+0x0/0x368) from [] (__enable_mmu+0x0/0x2c)
    | c0113268: redzone 1:0xd84156c5c032b3ac, redzone 2:0xd84156c5635688c0.
    | slab error in cache_alloc_debugcheck_after(): cache `idr_layer_cache': double free, or memory outside object was overwritten
    | ...
    | c011307c: redzone 1:0x9f91102ffffffff, redzone 2:0x9f911029d74e35b
    | slab: Internal list corruption detected in cache 'idr_layer_cache'(24), slabp c0113000(16). Hexdump:
    |
    | 000: 20 4f 10 c0 20 4f 10 c0 7c 00 00 00 7c 30 11 c0
    | 010: 10 00 00 00 10 00 00 00 00 00 c9 17 fe ff ff ff
    | 020: fe ff ff ff fe ff ff ff fe ff ff ff fe ff ff ff
    | 030: fe ff ff ff fe ff ff ff fe ff ff ff fe ff ff ff
    | 040: fe ff ff ff fe ff ff ff fe ff ff ff fe ff ff ff
    | 050: fe ff ff ff fe ff ff ff fe ff ff ff 11 00 00 00
    | 060: 12 00 00 00 13 00 00 00 14 00 00 00 15 00 00 00
    | 070: 16 00 00 00 17 00 00 00 c0 88 56 63
    | kernel BUG at /home/rmk/git/linux-2.6-rmk/mm/slab.c:2928!

    Reference: https://lkml.org/lkml/2011/2/7/238
    Cc: # 2.6.35.y and later
    Reported-and-analyzed-by: Russell King
    Signed-off-by: Pekka Enberg

    Pekka Enberg
     

24 Jan, 2011

1 commit


15 Jan, 2011

1 commit


11 Jan, 2011

1 commit


08 Jan, 2011

2 commits

  • * 'for-2.6.38' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: (30 commits)
    gameport: use this_cpu_read instead of lookup
    x86: udelay: Use this_cpu_read to avoid address calculation
    x86: Use this_cpu_inc_return for nmi counter
    x86: Replace uses of current_cpu_data with this_cpu ops
    x86: Use this_cpu_ops to optimize code
    vmstat: User per cpu atomics to avoid interrupt disable / enable
    irq_work: Use per cpu atomics instead of regular atomics
    cpuops: Use cmpxchg for xchg to avoid lock semantics
    x86: this_cpu_cmpxchg and this_cpu_xchg operations
    percpu: Generic this_cpu_cmpxchg() and this_cpu_xchg support
    percpu,x86: relocate this_cpu_add_return() and friends
    connector: Use this_cpu operations
    xen: Use this_cpu_inc_return
    taskstats: Use this_cpu_ops
    random: Use this_cpu_inc_return
    fs: Use this_cpu_inc_return in buffer.c
    highmem: Use this_cpu_xx_return() operations
    vmstat: Use this_cpu_inc_return for vm statistics
    x86: Support for this_cpu_add, sub, dec, inc_return
    percpu: Generic support for this_cpu_add, sub, dec, inc_return
    ...

    Fixed up conflicts: in arch/x86/kernel/{apic/nmi.c, apic/x2apic_uv_x.c, process.c}
    as per Tejun.

    Linus Torvalds
     
  • * 'for-2.6.38' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: (33 commits)
    usb: don't use flush_scheduled_work()
    speedtch: don't abuse struct delayed_work
    media/video: don't use flush_scheduled_work()
    media/video: explicitly flush request_module work
    ioc4: use static work_struct for ioc4_load_modules()
    init: don't call flush_scheduled_work() from do_initcalls()
    s390: don't use flush_scheduled_work()
    rtc: don't use flush_scheduled_work()
    mmc: update workqueue usages
    mfd: update workqueue usages
    dvb: don't use flush_scheduled_work()
    leds-wm8350: don't use flush_scheduled_work()
    mISDN: don't use flush_scheduled_work()
    macintosh/ams: don't use flush_scheduled_work()
    vmwgfx: don't use flush_scheduled_work()
    tpm: don't use flush_scheduled_work()
    sonypi: don't use flush_scheduled_work()
    hvsi: don't use flush_scheduled_work()
    xen: don't use flush_scheduled_work()
    gdrom: don't use flush_scheduled_work()
    ...

    Fixed up trivial conflict in drivers/media/video/bt8xx/bttv-input.c
    as per Tejun.

    Linus Torvalds
     

07 Jan, 2011

1 commit


17 Dec, 2010

1 commit

  • __get_cpu_var() can be replaced with this_cpu_read and will then use a
    single read instruction with implied address calculation to access the
    correct per cpu instance.

    However, the address of a per cpu variable passed to __this_cpu_read()
    cannot be determined (since it's an implied address conversion through
    segment prefixes). Therefore apply this only to uses of __get_cpu_var
    where the address of the variable is not used.

    Cc: Pekka Enberg
    Cc: Hugh Dickins
    Cc: Thomas Gleixner
    Acked-by: H. Peter Anvin
    Signed-off-by: Christoph Lameter
    Signed-off-by: Tejun Heo

    Christoph Lameter
     

15 Dec, 2010

1 commit

  • cancel_rearming_delayed_work[queue]() has been superceded by
    cancel_delayed_work_sync() quite some time ago. Convert all the
    in-kernel users. The conversions are completely equivalent and
    trivial.

    Signed-off-by: Tejun Heo
    Acked-by: "David S. Miller"
    Acked-by: Greg Kroah-Hartman
    Acked-by: Evgeniy Polyakov
    Cc: Jeff Garzik
    Cc: Benjamin Herrenschmidt
    Cc: Mauro Carvalho Chehab
    Cc: netdev@vger.kernel.org
    Cc: Anton Vorontsov
    Cc: David Woodhouse
    Cc: "J. Bruce Fields"
    Cc: Neil Brown
    Cc: Alex Elder
    Cc: xfs-masters@oss.sgi.com
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: Andrew Morton
    Cc: netfilter-devel@vger.kernel.org
    Cc: Trond Myklebust
    Cc: linux-nfs@vger.kernel.org

    Tejun Heo
     

29 Nov, 2010

1 commit


27 Oct, 2010

1 commit

  • Use the new {max,min}3 macros to save some cycles and bytes on the stack.
    This patch substitutes trivial nested macros with their counterpart.

    Signed-off-by: Hagen Paul Pfeifer
    Cc: Joe Perches
    Cc: Ingo Molnar
    Cc: Hartley Sweeten
    Cc: Russell King
    Cc: Benjamin Herrenschmidt
    Cc: Thomas Gleixner
    Cc: Herbert Xu
    Cc: Roland Dreier
    Cc: Sean Hefty
    Cc: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hagen Paul Pfeifer
     

23 Aug, 2010

1 commit


10 Aug, 2010

1 commit


09 Aug, 2010

1 commit

  • This patch fixes alignment of slab objects in case CONFIG_DEBUG_PAGEALLOC is
    active.
    Before this spot in kmem_cache_create, we have this situation:
    - align contains the required alignment of the object
    - cachep->obj_offset is 0 or equals align in case of CONFIG_DEBUG_SLAB
    - size equals the size of the object, or object plus trailing redzone in case
    of CONFIG_DEBUG_SLAB

    This spot tries to fill one page per object if the object is in certain size
    limits, however setting obj_offset to PAGE_SIZE - size does break the object
    alignment since size may not be aligned with the required alignment.
    This patch simply adds an ALIGN(size, align) to the equation and fixes the
    object size detection accordingly.

    This code in drivers/s390/cio/qdio_setup_init has lead to incorrectly aligned
    slab objects (sizeof(struct qdio_q) equals 1792):
    qdio_q_cache = kmem_cache_create("qdio_q", sizeof(struct qdio_q),
    256, 0, NULL);

    Acked-by: Christoph Lameter
    Signed-off-by: Carsten Otte
    Signed-off-by: Pekka Enberg

    Carsten Otte
     

07 Aug, 2010

1 commit

  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/slab-2.6:
    slub: Allow removal of slab caches during boot
    Revert "slub: Allow removal of slab caches during boot"
    slub numa: Fix rare allocation from unexpected node
    slab: use deferable timers for its periodic housekeeping
    slub: Use kmem_cache flags to detect if slab is in debugging mode.
    slub: Allow removal of slab caches during boot
    slub: Check kasprintf results in kmem_cache_init()
    SLUB: Constants need UL
    slub: Use a constant for a unspecified node.
    SLOB: Free objects to their own list
    slab: fix caller tracking on !CONFIG_DEBUG_SLAB && CONFIG_TRACING

    Linus Torvalds
     

20 Jul, 2010

1 commit

  • slab has a "once every 2 second" timer for its housekeeping.
    As the number of logical processors is growing, its more and more
    common that this 2 second timer becomes the primary wakeup source.

    This patch turns this housekeeping timer into a deferable timer,
    which means that the timer does not interrupt idle, but just runs
    at the next event that wakes the cpu up.

    The impact is that the timer likely runs a bit later, but during the
    delay no code is running so there's not all that much reason for
    a difference in housekeeping to occur because of this delay.

    Signed-off-by: Arjan van de Ven
    Signed-off-by: Pekka Enberg

    Arjan van de Ven
     

09 Jun, 2010

1 commit

  • We have been resisting new ftrace plugins and removing existing
    ones, and kmemtrace has been superseded by kmem trace events
    and perf-kmem, so we remove it.

    Signed-off-by: Li Zefan
    Acked-by: Pekka Enberg
    Acked-by: Eduard - Gabriel Munteanu
    Cc: Ingo Molnar
    Cc: Steven Rostedt
    [ remove kmemtrace from the makefile, handle slob too ]
    Signed-off-by: Frederic Weisbecker

    Li Zefan
     

28 May, 2010

3 commits

  • Example usage of generic "numa_mem_id()":

    The mainline slab code, since ~ 2.6.19, does not handle memoryless nodes
    well. Specifically, the "fast path"--____cache_alloc()--will never
    succeed as slab doesn't cache offnode object on the per cpu queues, and
    for memoryless nodes, all memory will be "off node" relative to
    numa_node_id(). This adds significant overhead to all kmem cache
    allocations, incurring a significant regression relative to earlier
    kernels [from before slab.c was reorganized].

    This patch uses the generic topology function "numa_mem_id()" to return
    the "effective local memory node" for the calling context. This is the
    first node in the local node's generic fallback zonelist-- the same node
    that "local" mempolicy-based allocations would use. This lets slab cache
    these "local" allocations and avoid fallback/refill on every allocation.

    N.B.: Slab will need to handle node and memory hotplug events that could
    change the value returned by numa_mem_id() for any given node if recent
    changes to address memory hotplug don't already address this. E.g., flush
    all per cpu slab queues before rebuilding the zonelists while the
    "machine" is held in the stopped state.

    Performance impact on "hackbench 400 process 200"

    2.6.34-rc3-mmotm-100405-1609 no-patch this-patch
    ia64 no memoryless nodes [avg of 10]: 11.713 11.637 ~0.65 diff
    ia64 cpus all on memless nodes [10]: 228.259 26.484 ~8.6x speedup

    The slowdown of the patched kernel from ~12 sec to ~28 seconds when
    configured with memoryless nodes is the result of all cpus allocating from
    a single node's mm pagepool. The cache lines of the single node are
    distributed/interleaved over the memory of the real physical nodes, but
    the zone lock, list heads, ... of the single node with memory still each
    live in a single cache line that is accessed from all processors.

    x86_64 [8x6 AMD] [avg of 40]: 2.883 2.845

    Signed-off-by: Lee Schermerhorn
    Cc: Tejun Heo
    Cc: Mel Gorman
    Cc: Christoph Lameter
    Cc: Nick Piggin
    Cc: David Rientjes
    Cc: Eric Whitney
    Cc: KAMEZAWA Hiroyuki
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Cc: "Luck, Tony"
    Cc: Pekka Enberg
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     
  • By the previous modification, the cpu notifier can return encapsulate
    errno value. This converts the cpu notifiers for slab.

    Signed-off-by: Akinobu Mita
    Cc: Christoph Lameter
    Acked-by: Pekka Enberg
    Cc: Matt Mackall
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Akinobu Mita
     
  • We have observed several workloads running on multi-node systems where
    memory is assigned unevenly across the nodes in the system. There are
    numerous reasons for this but one is the round-robin rotor in
    cpuset_mem_spread_node().

    For example, a simple test that writes a multi-page file will allocate
    pages on nodes 0 2 4 6 ... Odd nodes are skipped. (Sometimes it
    allocates on odd nodes & skips even nodes).

    An example is shown below. The program "lfile" writes a file consisting
    of 10 pages. The program then mmaps the file & uses get_mempolicy(...,
    MPOL_F_NODE) to determine the nodes where the file pages were allocated.
    The output is shown below:

    # ./lfile
    allocated on nodes: 2 4 6 0 1 2 6 0 2

    There is a single rotor that is used for allocating both file pages & slab
    pages. Writing the file allocates both a data page & a slab page
    (buffer_head). This advances the RR rotor 2 nodes for each page
    allocated.

    A quick confirmation seems to confirm this is the cause of the uneven
    allocation:

    # echo 0 >/dev/cpuset/memory_spread_slab
    # ./lfile
    allocated on nodes: 6 7 8 9 0 1 2 3 4 5

    This patch introduces a second rotor that is used for slab allocations.

    Signed-off-by: Jack Steiner
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: Paul Menage
    Cc: Jack Steiner
    Cc: Robin Holt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jack Steiner
     

25 May, 2010

1 commit

  • Before applying this patch, cpuset updates task->mems_allowed and
    mempolicy by setting all new bits in the nodemask first, and clearing all
    old unallowed bits later. But in the way, the allocator may find that
    there is no node to alloc memory.

    The reason is that cpuset rebinds the task's mempolicy, it cleans the
    nodes which the allocater can alloc pages on, for example:

    (mpol: mempolicy)
    task1 task1's mpol task2
    alloc page 1
    alloc on node0? NO 1
    1 change mems from 1 to 0
    1 rebind task1's mpol
    0-1 set new bits
    0 clear disallowed bits
    alloc on node1? NO 0
    ...
    can't alloc page
    goto oom

    This patch fixes this problem by expanding the nodes range first(set newly
    allowed bits) and shrink it lazily(clear newly disallowed bits). So we
    use a variable to tell the write-side task that read-side task is reading
    nodemask, and the write-side task clears newly disallowed nodes after
    read-side task ends the current memory allocation.

    [akpm@linux-foundation.org: fix spello]
    Signed-off-by: Miao Xie
    Cc: David Rientjes
    Cc: Nick Piggin
    Cc: Paul Menage
    Cc: Lee Schermerhorn
    Cc: Hugh Dickins
    Cc: Ravikiran Thirumalai
    Cc: KOSAKI Motohiro
    Cc: Christoph Lameter
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miao Xie
     

22 May, 2010

1 commit


20 May, 2010

1 commit


15 Apr, 2010

1 commit

  • Even with SLAB_RED_ZONE and SLAB_STORE_USER enabled, kernel would NOT store
    redzone and last user data around allocated memory space if "arch cache line >
    sizeof(unsigned long long)". As a result, last user information is unexpectedly
    MISSED while dumping slab corruption log.

    This fix makes sure that redzone and last user tags get stored unless the
    required alignment breaks redzone's.

    Signed-off-by: Shiyong Li
    Signed-off-by: Pekka Enberg

    Shiyong Li
     

10 Apr, 2010

1 commit

  • As suggested by Linus, introduce a kern_ptr_validate() helper that does some
    sanity checks to make sure a pointer is a valid kernel pointer. This is a
    preparational step for fixing SLUB kmem_ptr_validate().

    Cc: Andrew Morton
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Ingo Molnar
    Cc: Matt Mackall
    Cc: Nick Piggin
    Signed-off-by: Pekka Enberg
    Signed-off-by: Linus Torvalds

    Pekka Enberg
     

08 Apr, 2010

1 commit

  • Slab lacks any memory hotplug support for nodes that are hotplugged
    without cpus being hotplugged. This is possible at least on x86
    CONFIG_MEMORY_HOTPLUG_SPARSE kernels where SRAT entries are marked
    ACPI_SRAT_MEM_HOT_PLUGGABLE and the regions of RAM represent a seperate
    node. It can also be done manually by writing the start address to
    /sys/devices/system/memory/probe for kernels that have
    CONFIG_ARCH_MEMORY_PROBE set, which is how this patch was tested, and
    then onlining the new memory region.

    When a node is hotadded, a nodelist for that node is allocated and
    initialized for each slab cache. If this isn't completed due to a lack
    of memory, the hotadd is aborted: we have a reasonable expectation that
    kmalloc_node(nid) will work for all caches if nid is online and memory is
    available.

    Since nodelists must be allocated and initialized prior to the new node's
    memory actually being online, the struct kmem_list3 is allocated off-node
    due to kmalloc_node()'s fallback.

    When an entire node would be offlined, its nodelists are subsequently
    drained. If slab objects still exist and cannot be freed, the offline is
    aborted. It is possible that objects will be allocated between this
    drain and page isolation, so it's still possible that the offline will
    still fail, however.

    Acked-by: Christoph Lameter
    Signed-off-by: David Rientjes
    Signed-off-by: Pekka Enberg

    David Rientjes
     

29 Mar, 2010

1 commit



27 Feb, 2010

1 commit

  • This patch allow to inject faults only for specific slabs.
    In order to preserve default behavior cache filter is off by
    default (all caches are faulty).

    One may define specific set of slabs like this:
    # mark skbuff_head_cache as faulty
    echo 1 > /sys/kernel/slab/skbuff_head_cache/failslab
    # Turn on cache filter (off by default)
    echo 1 > /sys/kernel/debug/failslab/cache-filter
    # Turn on fault injection
    echo 1 > /sys/kernel/debug/failslab/times
    echo 1 > /sys/kernel/debug/failslab/probability

    Acked-by: David Rientjes
    Acked-by: Akinobu Mita
    Acked-by: Christoph Lameter
    Signed-off-by: Dmitry Monakhov
    Signed-off-by: Pekka Enberg

    Dmitry Monakhov
     

30 Jan, 2010

1 commit

  • When factoring common code into transfer_objects in commit 3ded175 ("slab: add
    transfer_objects() function"), the 'touched' logic got a bit broken. When
    refilling from the shared array (taking objects from the shared array), we are
    making use of the shared array so it should be marked as touched.

    Subsequently pulling an element from the cpu array and allocating it should
    also touch the cpu array, but that is taken care of after the alloc_done label.
    (So yes, the cpu array was getting touched = 1 twice).

    So revert this logic to how it worked in earlier kernels.

    This also affects the behaviour in __drain_alien_cache, which would previously
    'touch' the shared array and now does not. I think it is more logical not to
    touch there, because we are pushing objects into the shared array rather than
    pulling them off. So there is no good reason to postpone reaping them -- if the
    shared array is getting utilized, then it will get 'touched' in the alloc path
    (where this patch now restores the touch).

    Acked-by: Christoph Lameter
    Signed-off-by: Nick Piggin
    Signed-off-by: Pekka Enberg

    Nick Piggin
     

12 Jan, 2010

1 commit


29 Dec, 2009

1 commit

  • Commit ce79ddc8e2376a9a93c7d42daf89bfcbb9187e62 ("SLAB: Fix lockdep annotations
    for CPU hotplug") broke init_node_lock_keys() off-slab logic which causes
    lockdep false positives.

    Fix that up by reverting the logic back to original while keeping CPU hotplug
    fixes intact.

    Reported-and-tested-by: Heiko Carstens
    Reported-and-tested-by: Andi Kleen
    Signed-off-by: Pekka Enberg

    Pekka Enberg
     

18 Dec, 2009

2 commits

  • …/rusty/linux-2.6-for-linus

    * 'cpumask-cleanups' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-for-linus:
    cpumask: rename tsk_cpumask to tsk_cpus_allowed
    cpumask: don't recommend set_cpus_allowed hack in Documentation/cpu-hotplug.txt
    cpumask: avoid dereferencing struct cpumask
    cpumask: convert drivers/idle/i7300_idle.c to cpumask_var_t
    cpumask: use modern cpumask style in drivers/scsi/fcoe/fcoe.c
    cpumask: avoid deprecated function in mm/slab.c
    cpumask: use cpu_online in kernel/perf_event.c

    Linus Torvalds
     
  • * 'kmemleak' of git://linux-arm.org/linux-2.6:
    kmemleak: fix kconfig for crc32 build error
    kmemleak: Reduce the false positives by checking for modified objects
    kmemleak: Show the age of an unreferenced object
    kmemleak: Release the object lock before calling put_object()
    kmemleak: Scan the _ftrace_events section in modules
    kmemleak: Simplify the kmemleak_scan_area() function prototype
    kmemleak: Do not use off-slab management with SLAB_NOLEAKTRACE

    Linus Torvalds