07 Oct, 2012

1 commit

  • Pull SLAB changes from Pekka Enberg:
    "New and noteworthy:

    * More SLAB allocator unification patches from Christoph Lameter and
    others. This paves the way for slab memcg patches that hopefully
    will land in v3.8.

    * SLAB tracing improvements from Ezequiel Garcia.

    * Kernel tainting upon SLAB corruption from Dave Jones.

    * Miscellanous SLAB allocator bug fixes and improvements from various
    people."

    * 'slab/for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/linux: (43 commits)
    slab: Fix build failure in __kmem_cache_create()
    slub: init_kmem_cache_cpus() and put_cpu_partial() can be static
    mm/slab: Fix kmem_cache_alloc_node_trace() declaration
    Revert "mm/slab: Fix kmem_cache_alloc_node_trace() declaration"
    mm, slob: fix build breakage in __kmalloc_node_track_caller
    mm/slab: Fix kmem_cache_alloc_node_trace() declaration
    mm/slab: Fix typo _RET_IP -> _RET_IP_
    mm, slub: Rename slab_alloc() -> slab_alloc_node() to match SLAB
    mm, slab: Rename __cache_alloc() -> slab_alloc()
    mm, slab: Match SLAB and SLUB kmem_cache_alloc_xxx_trace() prototype
    mm, slab: Replace 'caller' type, void* -> unsigned long
    mm, slob: Add support for kmalloc_track_caller()
    mm, slab: Remove silly function slab_buffer_size()
    mm, slob: Use NUMA_NO_NODE instead of -1
    mm, sl[au]b: Taint kernel when we detect a corrupted slab
    slab: Only define slab_error for DEBUG
    slab: fix the DEADLOCK issue on l3 alien lock
    slub: Zero initial memory segment for kmem_cache and kmem_cache_node
    Revert "mm/sl[aou]b: Move sysfs_slab_add to common"
    mm/sl[aou]b: Move kmem_cache refcounting to common code
    ...

    Linus Torvalds
     

03 Oct, 2012

5 commits

  • Pekka Enberg
     
  • Fix up a trivial conflict with NUMA_NO_NODE cleanups.

    Conflicts:
    mm/slob.c

    Signed-off-by: Pekka Enberg

    Pekka Enberg
     
  • Pekka Enberg
     
  • Fix build failure with CONFIG_DEBUG_SLAB=y && CONFIG_DEBUG_PAGEALLOC=y caused
    by commit 8a13a4cc "mm/sl[aou]b: Shrink __kmem_cache_create() parameter lists".

    mm/slab.c: In function '__kmem_cache_create':
    mm/slab.c:2474: error: 'align' undeclared (first use in this function)
    mm/slab.c:2474: error: (Each undeclared identifier is reported only once
    mm/slab.c:2474: error: for each function it appears in.)
    make[1]: *** [mm/slab.o] Error 1
    make: *** [mm] Error 2

    Acked-by: Christoph Lameter
    Signed-off-by: Tetsuo Handa
    Signed-off-by: Pekka Enberg

    Tetsuo Handa
     
  • Pull workqueue changes from Tejun Heo:
    "This is workqueue updates for v3.7-rc1. A lot of activities this
    round including considerable API and behavior cleanups.

    * delayed_work combines a timer and a work item. The handling of the
    timer part has always been a bit clunky leading to confusing
    cancelation API with weird corner-case behaviors. delayed_work is
    updated to use new IRQ safe timer and cancelation now works as
    expected.

    * Another deficiency of delayed_work was lack of the counterpart of
    mod_timer() which led to cancel+queue combinations or open-coded
    timer+work usages. mod_delayed_work[_on]() are added.

    These two delayed_work changes make delayed_work provide interface
    and behave like timer which is executed with process context.

    * A work item could be executed concurrently on multiple CPUs, which
    is rather unintuitive and made flush_work() behavior confusing and
    half-broken under certain circumstances. This problem doesn't
    exist for non-reentrant workqueues. While non-reentrancy check
    isn't free, the overhead is incurred only when a work item bounces
    across different CPUs and even in simulated pathological scenario
    the overhead isn't too high.

    All workqueues are made non-reentrant. This removes the
    distinction between flush_[delayed_]work() and
    flush_[delayed_]_work_sync(). The former is now as strong as the
    latter and the specified work item is guaranteed to have finished
    execution of any previous queueing on return.

    * In addition to the various bug fixes, Lai redid and simplified CPU
    hotplug handling significantly.

    * Joonsoo introduced system_highpri_wq and used it during CPU
    hotplug.

    There are two merge commits - one to pull in IRQ safe timer from
    tip/timers/core and the other to pull in CPU hotplug fixes from
    wq/for-3.6-fixes as Lai's hotplug restructuring depended on them."

    Fixed a number of trivial conflicts, but the more interesting conflicts
    were silent ones where the deprecated interfaces had been used by new
    code in the merge window, and thus didn't cause any real data conflicts.

    Tejun pointed out a few of them, I fixed a couple more.

    * 'for-3.7' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: (46 commits)
    workqueue: remove spurious WARN_ON_ONCE(in_irq()) from try_to_grab_pending()
    workqueue: use cwq_set_max_active() helper for workqueue_set_max_active()
    workqueue: introduce cwq_set_max_active() helper for thaw_workqueues()
    workqueue: remove @delayed from cwq_dec_nr_in_flight()
    workqueue: fix possible stall on try_to_grab_pending() of a delayed work item
    workqueue: use hotcpu_notifier() for workqueue_cpu_down_callback()
    workqueue: use __cpuinit instead of __devinit for cpu callbacks
    workqueue: rename manager_mutex to assoc_mutex
    workqueue: WORKER_REBIND is no longer necessary for idle rebinding
    workqueue: WORKER_REBIND is no longer necessary for busy rebinding
    workqueue: reimplement idle worker rebinding
    workqueue: deprecate __cancel_delayed_work()
    workqueue: reimplement cancel_delayed_work() using try_to_grab_pending()
    workqueue: use mod_delayed_work() instead of __cancel + queue
    workqueue: use irqsafe timer for delayed_work
    workqueue: clean up delayed_work initializers and add missing one
    workqueue: make deferrable delayed_work initializer names consistent
    workqueue: cosmetic whitespace updates for macro definitions
    workqueue: deprecate system_nrt[_freezable]_wq
    workqueue: deprecate flush[_delayed]_work_sync()
    ...

    Linus Torvalds
     

29 Sep, 2012

1 commit


26 Sep, 2012

2 commits


25 Sep, 2012

4 commits


19 Sep, 2012

2 commits

  • It doesn't seem worth adding a new taint flag for this, so just re-use
    the one from 'bad page'

    Acked-by: Christoph Lameter # SLUB
    Acked-by: David Rientjes
    Signed-off-by: Dave Jones
    Signed-off-by: Pekka Enberg

    Dave Jones
     
  • On Tue, 11 Sep 2012, Stephen Rothwell wrote:
    > After merging the final tree, today's linux-next build (sparc64 defconfig)
    > produced this warning:
    >
    > mm/slab.c:808:13: warning: '__slab_error' defined but not used [-Wunused-function]
    >
    > Introduced by commit 945cf2b6199b ("mm/sl[aou]b: Extract a common
    > function for kmem_cache_destroy"). All uses of slab_error() are now
    > guarded by DEBUG.

    There is no use case left for slab builds without DEBUG.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Pekka Enberg

    Christoph Lameter
     

18 Sep, 2012

2 commits

  • In array cache, there is a object at index 0, check it.

    Signed-off-by: Joonsoo Kim
    Signed-off-by: Mel Gorman
    Cc: David Miller
    Cc: Chuck Lever
    Cc: David Rientjes
    Cc: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • Right now, we call ClearSlabPfmemalloc() for first page of slab when we
    clear SlabPfmemalloc flag. This is fine for most swap-over-network use
    cases as it is expected that order-0 pages are in use. Unfortunately it
    is possible that that __ac_put_obj() checks SlabPfmemalloc on a tail
    page and while this is harmless, it is sloppy. This patch ensures that
    the head page is always used.

    This problem was originally identified by Joonsoo Kim.

    [js1304@gmail.com: Original implementation and problem identification]
    Signed-off-by: Mel Gorman
    Cc: David Miller
    Cc: Chuck Lever
    Cc: Joonsoo Kim
    Cc: David Rientjes
    Cc: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

12 Sep, 2012

1 commit

  • DEADLOCK will be report while running a kernel with NUMA and LOCKDEP enabled,
    the process of this fake report is:

    kmem_cache_free() //free obj in cachep
    -> cache_free_alien() //acquire cachep's l3 alien lock
    -> __drain_alien_cache()
    -> free_block()
    -> slab_destroy()
    -> kmem_cache_free() //free slab in cachep->slabp_cache
    -> cache_free_alien() //acquire cachep->slabp_cache's l3 alien lock

    Since the cachep and cachep->slabp_cache's l3 alien are in the same lock class,
    fake report generated.

    This should not happen since we already have init_lock_keys() which will
    reassign the lock class for both l3 list and l3 alien.

    However, init_lock_keys() was invoked at a wrong position which is before we
    invoke enable_cpucache() on each cache.

    Since until set slab_state to be FULL, we won't invoke enable_cpucache()
    on caches to build their l3 alien while creating them, so although we invoked
    init_lock_keys(), the l3 alien lock class won't change since we don't have
    them until invoked enable_cpucache() later.

    This patch will invoke init_lock_keys() after we done enable_cpucache()
    instead of before to avoid the fake DEADLOCK report.

    Michael traced the problem back to a commit in release 3.0.0:

    commit 30765b92ada267c5395fc788623cb15233276f5c
    Author: Peter Zijlstra
    Date: Thu Jul 28 23:22:56 2011 +0200

    slab, lockdep: Annotate the locks before using them

    Fernando found we hit the regular OFF_SLAB 'recursion' before we
    annotate the locks, cure this.

    The relevant portion of the stack-trace:

    > [ 0.000000] [] rt_spin_lock+0x50/0x56
    > [ 0.000000] [] __cache_free+0x43/0xc3
    > [ 0.000000] [] kmem_cache_free+0x6c/0xdc
    > [ 0.000000] [] slab_destroy+0x4f/0x53
    > [ 0.000000] [] free_block+0x94/0xc1
    > [ 0.000000] [] do_tune_cpucache+0x10b/0x2bb
    > [ 0.000000] [] enable_cpucache+0x7b/0xa7
    > [ 0.000000] [] kmem_cache_init_late+0x1f/0x61
    > [ 0.000000] [] start_kernel+0x24c/0x363
    > [ 0.000000] [] i386_start_kernel+0xa9/0xaf

    Reported-by: Fernando Lopez-Lezcano
    Acked-by: Pekka Enberg
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1311888176.2617.379.camel@laptop
    Signed-off-by: Ingo Molnar

    The commit moved init_lock_keys() before we build up the alien, so we
    failed to reclass it.

    Cc: # 3.0+
    Acked-by: Christoph Lameter
    Tested-by: Paul E. McKenney
    Signed-off-by: Michael Wang
    Signed-off-by: Pekka Enberg

    Michael Wang
     

05 Sep, 2012

8 commits


30 Aug, 2012

1 commit

  • cache_grow() can reenable irqs so the cpu (and node) can change, so ensure
    that we take list_lock on the correct nodelist.

    This fixes an issue with commit 072bb0aa5e06 ("mm: sl[au]b: add
    knowledge of PFMEMALLOC reserve pages") where list_lock for the wrong
    node was taken after growing the cache.

    Reported-and-tested-by: Haggai Eran
    Signed-off-by: David Rientjes
    Signed-off-by: Linus Torvalds

    David Rientjes
     

22 Aug, 2012

1 commit


17 Aug, 2012

1 commit


16 Aug, 2012

1 commit

  • page_get_cache() does not need to call compound_head(), as its unique
    caller virt_to_slab() already makes sure to return a head page.

    Additionally, removing the compound_head() call makes page_get_cache()
    consistent with page_get_slab().

    Signed-off-by: Michel Lespinasse
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Pekka Enberg

    Michel Lespinasse
     

01 Aug, 2012

3 commits

  • Getting and putting objects in SLAB currently requires a function call but
    the bulk of the work is related to PFMEMALLOC reserves which are only
    consumed when network-backed storage is critical. Use an inline function
    to determine if the function call is required.

    Signed-off-by: Mel Gorman
    Cc: David Miller
    Cc: Neil Brown
    Cc: Peter Zijlstra
    Cc: Mike Christie
    Cc: Eric B Munson
    Cc: Eric Dumazet
    Cc: Sebastian Andrzej Siewior
    Cc: Mel Gorman
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • __GFP_MEMALLOC will allow the allocation to disregard the watermarks, much
    like PF_MEMALLOC. It allows one to pass along the memalloc state in
    object related allocation flags as opposed to task related flags, such as
    sk->sk_allocation. This removes the need for ALLOC_PFMEMALLOC as callers
    using __GFP_MEMALLOC can get the ALLOC_NO_WATERMARK flag which is now
    enough to identify allocations related to page reclaim.

    Signed-off-by: Peter Zijlstra
    Signed-off-by: Mel Gorman
    Cc: David Miller
    Cc: Neil Brown
    Cc: Mike Christie
    Cc: Eric B Munson
    Cc: Eric Dumazet
    Cc: Sebastian Andrzej Siewior
    Cc: Mel Gorman
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • When a user or administrator requires swap for their application, they
    create a swap partition and file, format it with mkswap and activate it
    with swapon. Swap over the network is considered as an option in diskless
    systems. The two likely scenarios are when blade servers are used as part
    of a cluster where the form factor or maintenance costs do not allow the
    use of disks and thin clients.

    The Linux Terminal Server Project recommends the use of the Network Block
    Device (NBD) for swap according to the manual at
    https://sourceforge.net/projects/ltsp/files/Docs-Admin-Guide/LTSPManual.pdf/download
    There is also documentation and tutorials on how to setup swap over NBD at
    places like https://help.ubuntu.com/community/UbuntuLTSP/EnableNBDSWAP The
    nbd-client also documents the use of NBD as swap. Despite this, the fact
    is that a machine using NBD for swap can deadlock within minutes if swap
    is used intensively. This patch series addresses the problem.

    The core issue is that network block devices do not use mempools like
    normal block devices do. As the host cannot control where they receive
    packets from, they cannot reliably work out in advance how much memory
    they might need. Some years ago, Peter Zijlstra developed a series of
    patches that supported swap over an NFS that at least one distribution is
    carrying within their kernels. This patch series borrows very heavily
    from Peter's work to support swapping over NBD as a pre-requisite to
    supporting swap-over-NFS. The bulk of the complexity is concerned with
    preserving memory that is allocated from the PFMEMALLOC reserves for use
    by the network layer which is needed for both NBD and NFS.

    Patch 1 adds knowledge of the PFMEMALLOC reserves to SLAB and SLUB to
    preserve access to pages allocated under low memory situations
    to callers that are freeing memory.

    Patch 2 optimises the SLUB fast path to avoid pfmemalloc checks

    Patch 3 introduces __GFP_MEMALLOC to allow access to the PFMEMALLOC
    reserves without setting PFMEMALLOC.

    Patch 4 opens the possibility for softirqs to use PFMEMALLOC reserves
    for later use by network packet processing.

    Patch 5 only sets page->pfmemalloc when ALLOC_NO_WATERMARKS was required

    Patch 6 ignores memory policies when ALLOC_NO_WATERMARKS is set.

    Patches 7-12 allows network processing to use PFMEMALLOC reserves when
    the socket has been marked as being used by the VM to clean pages. If
    packets are received and stored in pages that were allocated under
    low-memory situations and are unrelated to the VM, the packets
    are dropped.

    Patch 11 reintroduces __skb_alloc_page which the networking
    folk may object to but is needed in some cases to propogate
    pfmemalloc from a newly allocated page to an skb. If there is a
    strong objection, this patch can be dropped with the impact being
    that swap-over-network will be slower in some cases but it should
    not fail.

    Patch 13 is a micro-optimisation to avoid a function call in the
    common case.

    Patch 14 tags NBD sockets as being SOCK_MEMALLOC so they can use
    PFMEMALLOC if necessary.

    Patch 15 notes that it is still possible for the PFMEMALLOC reserve
    to be depleted. To prevent this, direct reclaimers get throttled on
    a waitqueue if 50% of the PFMEMALLOC reserves are depleted. It is
    expected that kswapd and the direct reclaimers already running
    will clean enough pages for the low watermark to be reached and
    the throttled processes are woken up.

    Patch 16 adds a statistic to track how often processes get throttled

    Some basic performance testing was run using kernel builds, netperf on
    loopback for UDP and TCP, hackbench (pipes and sockets), iozone and
    sysbench. Each of them were expected to use the sl*b allocators
    reasonably heavily but there did not appear to be significant performance
    variances.

    For testing swap-over-NBD, a machine was booted with 2G of RAM with a
    swapfile backed by NBD. 8*NUM_CPU processes were started that create
    anonymous memory mappings and read them linearly in a loop. The total
    size of the mappings were 4*PHYSICAL_MEMORY to use swap heavily under
    memory pressure.

    Without the patches and using SLUB, the machine locks up within minutes
    and runs to completion with them applied. With SLAB, the story is
    different as an unpatched kernel run to completion. However, the patched
    kernel completed the test 45% faster.

    MICRO
    3.5.0-rc2 3.5.0-rc2
    vanilla swapnbd
    Unrecognised test vmscan-anon-mmap-write
    MMTests Statistics: duration
    Sys Time Running Test (seconds) 197.80 173.07
    User+Sys Time Running Test (seconds) 206.96 182.03
    Total Elapsed Time (seconds) 3240.70 1762.09

    This patch: mm: sl[au]b: add knowledge of PFMEMALLOC reserve pages

    Allocations of pages below the min watermark run a risk of the machine
    hanging due to a lack of memory. To prevent this, only callers who have
    PF_MEMALLOC or TIF_MEMDIE set and are not processing an interrupt are
    allowed to allocate with ALLOC_NO_WATERMARKS. Once they are allocated to
    a slab though, nothing prevents other callers consuming free objects
    within those slabs. This patch limits access to slab pages that were
    alloced from the PFMEMALLOC reserves.

    When this patch is applied, pages allocated from below the low watermark
    are returned with page->pfmemalloc set and it is up to the caller to
    determine how the page should be protected. SLAB restricts access to any
    page with page->pfmemalloc set to callers which are known to able to
    access the PFMEMALLOC reserve. If one is not available, an attempt is
    made to allocate a new page rather than use a reserve. SLUB is a bit more
    relaxed in that it only records if the current per-CPU page was allocated
    from PFMEMALLOC reserve and uses another partial slab if the caller does
    not have the necessary GFP or process flags. This was found to be
    sufficient in tests to avoid hangs due to SLUB generally maintaining
    smaller lists than SLAB.

    In low-memory conditions it does mean that !PFMEMALLOC allocators can fail
    a slab allocation even though free objects are available because they are
    being preserved for callers that are freeing pages.

    [a.p.zijlstra@chello.nl: Original implementation]
    [sebastian@breakpoint.cc: Correct order of page flag clearing]
    Signed-off-by: Mel Gorman
    Cc: David Miller
    Cc: Neil Brown
    Cc: Peter Zijlstra
    Cc: Mike Christie
    Cc: Eric B Munson
    Cc: Eric Dumazet
    Cc: Sebastian Andrzej Siewior
    Cc: Mel Gorman
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

09 Jul, 2012

4 commits

  • Move the mutex handling into the common kmem_cache_create()
    function.

    Then we can also move more checks out of SLAB's kmem_cache_create()
    into the common code.

    Reviewed-by: Glauber Costa
    Signed-off-by: Christoph Lameter
    Signed-off-by: Pekka Enberg

    Christoph Lameter
     
  • Use the mutex definition from SLAB and make it the common way to take a sleeping lock.

    This has the effect of using a mutex instead of a rw semaphore for SLUB.

    SLOB gains the use of a mutex for kmem_cache_create serialization.
    Not needed now but SLOB may acquire some more features later (like slabinfo
    / sysfs support) through the expansion of the common code that will
    need this.

    Reviewed-by: Glauber Costa
    Reviewed-by: Joonsoo Kim
    Signed-off-by: Christoph Lameter
    Signed-off-by: Pekka Enberg

    Christoph Lameter
     
  • All allocators have some sort of support for the bootstrap status.

    Setup a common definition for the boot states and make all slab
    allocators use that definition.

    Reviewed-by: Glauber Costa
    Reviewed-by: Joonsoo Kim
    Signed-off-by: Christoph Lameter
    Signed-off-by: Pekka Enberg

    Christoph Lameter
     
  • Kmem_cache_create() does a variety of sanity checks but those
    vary depending on the allocator. Use the strictest tests and put them into
    a slab_common file. Make the tests conditional on CONFIG_DEBUG_VM.

    This patch has the effect of adding sanity checks for SLUB and SLOB
    under CONFIG_DEBUG_VM and removes the checks in SLAB for !CONFIG_DEBUG_VM.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Pekka Enberg

    Christoph Lameter
     

02 Jul, 2012

3 commits

  • During kmem_cache_init_late(), we transition to the LATE state,
    and after some more work, to the FULL state, its last state

    This is quite different from slub, that will only transition to
    its last state (previously SYSFS), in a (late)initcall, after a lot
    more of the kernel is ready.

    This means that in slab, we have no way to taking actions dependent
    on the initialization of other pieces of the kernel that are supposed
    to start way after kmem_init_late(), such as cgroups initialization.

    To achieve more consistency in this behavior, that patch only
    transitions to the UP state in kmem_init_late. In my analysis,
    setup_cpu_cache() should be happy to test for >= UP, instead of
    == FULL. It also has passed some tests I've made.

    We then only mark FULL state after the reap timers are in place,
    meaning that no further setup is expected.

    Signed-off-by: Glauber Costa
    Acked-by: Christoph Lameter
    Acked-by: David Rientjes
    Signed-off-by: Pekka Enberg

    Glauber Costa
     
  • Commit 8c138b only sits in Pekka's and linux-next tree now, which tries
    to replace obj_size(cachep) with cachep->object_size, but has a typo in
    kmem_cache_free() by using "size" instead of "object_size", which casues
    some regressions.

    Reported-and-tested-by: Fengguang Wu
    Signed-off-by: Feng Tang
    Cc: Christoph Lameter
    Acked-by: Glauber Costa
    Signed-off-by: Pekka Enberg

    Feng Tang
     
  • Commit 3b0efdf ("mm, sl[aou]b: Extract common fields from struct
    kmem_cache") renamed the kmem_cache structure's "next" field to "list"
    but forgot to update one instance in leaks_show().

    Signed-off-by: Thierry Reding
    Signed-off-by: Pekka Enberg

    Thierry Reding