07 May, 2013

2 commits

  • Pull slab changes from Pekka Enberg:
    "The bulk of the changes are more slab unification from Christoph.

    There's also few fixes from Aaron, Glauber, and Joonsoo thrown into
    the mix."

    * 'slab/for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/linux: (24 commits)
    mm, slab_common: Fix bootstrap creation of kmalloc caches
    slab: Return NULL for oversized allocations
    mm: slab: Verify the nodeid passed to ____cache_alloc_node
    slub: tid must be retrieved from the percpu area of the current processor
    slub: Do not dereference NULL pointer in node_match
    slub: add 'likely' macro to inc_slabs_node()
    slub: correct to calculate num of acquired objects in get_partial_node()
    slub: correctly bootstrap boot caches
    mm/sl[au]b: correct allocation type check in kmalloc_slab()
    slab: Fixup CONFIG_PAGE_ALLOC/DEBUG_SLAB_LEAK sections
    slab: Handle ARCH_DMA_MINALIGN correctly
    slab: Common definition for kmem_cache_node
    slab: Rename list3/l3 to node
    slab: Common Kmalloc cache determination
    stat: Use size_t for sizes instead of unsigned
    slab: Common function to create the kmalloc array
    slab: Common definition for the array of kmalloc caches
    slab: Common constants for kmalloc boundaries
    slab: Rename nodelists to node
    slab: Common name for the per node structures
    ...

    Linus Torvalds
     
  • Pekka Enberg
     

30 Apr, 2013

1 commit


05 Apr, 2013

2 commits

  • As Steven Rostedt has pointer out: rescheduling could occur on a
    different processor after the determination of the per cpu pointer and
    before the tid is retrieved. This could result in allocation from the
    wrong node in slab_alloc().

    The effect is much more severe in slab_free() where we could free to the
    freelist of the wrong page.

    The window for something like that occurring is pretty small but it is
    possible.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Pekka Enberg

    Christoph Lameter
     
  • The variables accessed in slab_alloc are volatile and therefore
    the page pointer passed to node_match can be NULL. The processing
    of data in slab_alloc is tentative until either the cmpxhchg
    succeeds or the __slab_alloc slowpath is invoked. Both are
    able to perform the same allocation from the freelist.

    Check for the NULL pointer in node_match.

    A false positive will lead to a retry of the loop in __slab_alloc.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Pekka Enberg

    Christoph Lameter
     

02 Apr, 2013

2 commits

  • After boot phase, 'n' always exist.
    So add 'likely' macro for helping compiler.

    Acked-by: Christoph Lameter
    Signed-off-by: Joonsoo Kim
    Signed-off-by: Pekka Enberg

    Joonsoo Kim
     
  • There is a subtle bug when calculating a number of acquired objects.

    Currently, we calculate "available = page->objects - page->inuse",
    after acquire_slab() is called in get_partial_node().

    In acquire_slab() with mode = 1, we always set new.inuse = page->objects.
    So,

    acquire_slab(s, n, page, object == NULL);

    if (!object) {
    c->page = page;
    stat(s, ALLOC_FROM_PARTIAL);
    object = t;
    available = page->objects - page->inuse;

    !!! availabe is always 0 !!!
    ...

    Therfore, "available > s->cpu_partial / 2" is always false and
    we always go to second iteration.
    This patch correct this problem.

    After that, we don't need return value of put_cpu_partial().
    So remove it.

    Reviewed-by: Wanpeng Li
    Acked-by: Christoph Lameter
    Signed-off-by: Joonsoo Kim
    Signed-off-by: Pekka Enberg

    Joonsoo Kim
     

28 Feb, 2013

1 commit

  • After we create a boot cache, we may allocate from it until it is bootstraped.
    This will move the page from the partial list to the cpu slab list. If this
    happens, the loop:

    list_for_each_entry(p, &n->partial, lru)

    that we use to scan for all partial pages will yield nothing, and the pages
    will keep pointing to the boot cpu cache, which is of course, invalid. To do
    that, we should flush the cache to make sure that the cpu slab is back to the
    partial list.

    Signed-off-by: Glauber Costa
    Reported-by: Steffen Michalke
    Tested-by: KAMEZAWA Hiroyuki
    Acked-by: Christoph Lameter
    Cc: Andrew Morton
    Cc: Tejun Heo
    Signed-off-by: Pekka Enberg

    Glauber Costa
     

26 Feb, 2013

1 commit

  • Pull module update from Rusty Russell:
    "The sweeping change is to make add_taint() explicitly indicate whether
    to disable lockdep, but it's a mechanical change."

    * tag 'modules-next-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux:
    MODSIGN: Add option to not sign modules during modules_install
    MODSIGN: Add -s option to sign-file
    MODSIGN: Specify the hash algorithm on sign-file command line
    MODSIGN: Simplify Makefile with a Kconfig helper
    module: clean up load_module a little more.
    modpost: Ignore ARC specific non-alloc sections
    module: constify within_module_*
    taint: add explicit flag to show whether lock dep is still OK.
    module: printk message when module signature fail taints kernel.

    Linus Torvalds
     

24 Feb, 2013

1 commit

  • The function names page_xchg_last_nid(), page_last_nid() and
    reset_page_last_nid() were judged to be inconsistent so rename them to a
    struct_field_op style pattern. As it looked jarring to have
    reset_page_mapcount() and page_nid_reset_last() beside each other in
    memmap_init_zone(), this patch also renames reset_page_mapcount() to
    page_mapcount_reset(). There are others like init_page_count() but as
    it is used throughout the arch code a rename would likely cause more
    conflicts than it is worth.

    [akpm@linux-foundation.org: fix zcache]
    Signed-off-by: Mel Gorman
    Suggested-by: Andrew Morton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

01 Feb, 2013

4 commits


21 Jan, 2013

1 commit


19 Dec, 2012

8 commits

  • Sasha Levin recently reported a lockdep problem resulting from the new
    attribute propagation introduced by kmemcg series. In short, slab_mutex
    will be called from within the sysfs attribute store function. This will
    create a dependency, that will later be held backwards when a cache is
    destroyed - since destruction occurs with the slab_mutex held, and then
    calls in to the sysfs directory removal function.

    In this patch, I propose to adopt a strategy close to what
    __kmem_cache_create does before calling sysfs_slab_add, and release the
    lock before the call to sysfs_slab_remove. This is pretty much the last
    operation in the kmem_cache_shutdown() path, so we could do better by
    splitting this and moving this call alone to later on. This will fit
    nicely when sysfs handling is consistent between all caches, but will look
    weird now.

    Lockdep info:

    ======================================================
    [ INFO: possible circular locking dependency detected ]
    3.7.0-rc4-next-20121106-sasha-00008-g353b62f #117 Tainted: G W
    -------------------------------------------------------
    trinity-child13/6961 is trying to acquire lock:
    (s_active#43){++++.+}, at: sysfs_addrm_finish+0x31/0x60

    but task is already holding lock:
    (slab_mutex){+.+.+.}, at: kmem_cache_destroy+0x22/0xe0

    which lock already depends on the new lock.

    the existing dependency chain (in reverse order) is:
    -> #1 (slab_mutex){+.+.+.}:
    lock_acquire+0x1aa/0x240
    __mutex_lock_common+0x59/0x5a0
    mutex_lock_nested+0x3f/0x50
    slab_attr_store+0xde/0x110
    sysfs_write_file+0xfa/0x150
    vfs_write+0xb0/0x180
    sys_pwrite64+0x60/0xb0
    tracesys+0xe1/0xe6
    -> #0 (s_active#43){++++.+}:
    __lock_acquire+0x14df/0x1ca0
    lock_acquire+0x1aa/0x240
    sysfs_deactivate+0x122/0x1a0
    sysfs_addrm_finish+0x31/0x60
    sysfs_remove_dir+0x89/0xd0
    kobject_del+0x16/0x40
    __kmem_cache_shutdown+0x40/0x60
    kmem_cache_destroy+0x40/0xe0
    mon_text_release+0x78/0xe0
    __fput+0x122/0x2d0
    ____fput+0x9/0x10
    task_work_run+0xbe/0x100
    do_exit+0x432/0xbd0
    do_group_exit+0x84/0xd0
    get_signal_to_deliver+0x81d/0x930
    do_signal+0x3a/0x950
    do_notify_resume+0x3e/0x90
    int_signal+0x12/0x17

    other info that might help us debug this:

    Possible unsafe locking scenario:

    CPU0 CPU1
    ---- ----
    lock(slab_mutex);
    lock(s_active#43);
    lock(slab_mutex);
    lock(s_active#43);

    *** DEADLOCK ***

    2 locks held by trinity-child13/6961:
    #0: (mon_lock){+.+.+.}, at: mon_text_release+0x25/0xe0
    #1: (slab_mutex){+.+.+.}, at: kmem_cache_destroy+0x22/0xe0

    stack backtrace:
    Pid: 6961, comm: trinity-child13 Tainted: G W 3.7.0-rc4-next-20121106-sasha-00008-g353b62f #117
    Call Trace:
    print_circular_bug+0x1fb/0x20c
    __lock_acquire+0x14df/0x1ca0
    lock_acquire+0x1aa/0x240
    sysfs_deactivate+0x122/0x1a0
    sysfs_addrm_finish+0x31/0x60
    sysfs_remove_dir+0x89/0xd0
    kobject_del+0x16/0x40
    __kmem_cache_shutdown+0x40/0x60
    kmem_cache_destroy+0x40/0xe0
    mon_text_release+0x78/0xe0
    __fput+0x122/0x2d0
    ____fput+0x9/0x10
    task_work_run+0xbe/0x100
    do_exit+0x432/0xbd0
    do_group_exit+0x84/0xd0
    get_signal_to_deliver+0x81d/0x930
    do_signal+0x3a/0x950
    do_notify_resume+0x3e/0x90
    int_signal+0x12/0x17

    Signed-off-by: Glauber Costa
    Reported-by: Sasha Levin
    Cc: Michal Hocko
    Cc: Kamezawa Hiroyuki
    Cc: Johannes Weiner
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Glauber Costa
     
  • This patch clarifies two aspects of cache attribute propagation.

    First, the expected context for the for_each_memcg_cache macro in
    memcontrol.h. The usages already in the codebase are safe. In mm/slub.c,
    it is trivially safe because the lock is acquired right before the loop.
    In mm/slab.c, it is less so: the lock is acquired by an outer function a
    few steps back in the stack, so a VM_BUG_ON() is added to make sure it is
    indeed safe.

    A comment is also added to detail why we are returning the value of the
    parent cache and ignoring the children's when we propagate the attributes.

    Signed-off-by: Glauber Costa
    Cc: Michal Hocko
    Cc: Kamezawa Hiroyuki
    Cc: Johannes Weiner
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Glauber Costa
     
  • SLUB allows us to tune a particular cache behavior with sysfs-based
    tunables. When creating a new memcg cache copy, we'd like to preserve any
    tunables the parent cache already had.

    This can be done by tapping into the store attribute function provided by
    the allocator. We of course don't need to mess with read-only fields.
    Since the attributes can have multiple types and are stored internally by
    sysfs, the best strategy is to issue a ->show() in the root cache, and
    then ->store() in the memcg cache.

    The drawback of that, is that sysfs can allocate up to a page in buffering
    for show(), that we are likely not to need, but also can't guarantee. To
    avoid always allocating a page for that, we can update the caches at store
    time with the maximum attribute size ever stored to the root cache. We
    will then get a buffer big enough to hold it. The corolary to this, is
    that if no stores happened, nothing will be propagated.

    It can also happen that a root cache has its tunables updated during
    normal system operation. In this case, we will propagate the change to
    all caches that are already active.

    [akpm@linux-foundation.org: tweak code to avoid __maybe_unused]
    Signed-off-by: Glauber Costa
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Frederic Weisbecker
    Cc: Greg Thelen
    Cc: Johannes Weiner
    Cc: JoonSoo Kim
    Cc: KAMEZAWA Hiroyuki
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Pekka Enberg
    Cc: Rik van Riel
    Cc: Suleiman Souhlal
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Glauber Costa
     
  • Implement destruction of memcg caches. Right now, only caches where our
    reference counter is the last remaining are deleted. If there are any
    other reference counters around, we just leave the caches lying around
    until they go away.

    When that happens, a destruction function is called from the cache code.
    Caches are only destroyed in process context, so we queue them up for
    later processing in the general case.

    Signed-off-by: Glauber Costa
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Frederic Weisbecker
    Cc: Greg Thelen
    Cc: Johannes Weiner
    Cc: JoonSoo Kim
    Cc: KAMEZAWA Hiroyuki
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Pekka Enberg
    Cc: Rik van Riel
    Cc: Suleiman Souhlal
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Glauber Costa
     
  • We are able to match a cache allocation to a particular memcg. If the
    task doesn't change groups during the allocation itself - a rare event,
    this will give us a good picture about who is the first group to touch a
    cache page.

    This patch uses the now available infrastructure by calling
    memcg_kmem_get_cache() before all the cache allocations.

    Signed-off-by: Glauber Costa
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Frederic Weisbecker
    Cc: Greg Thelen
    Cc: Johannes Weiner
    Cc: JoonSoo Kim
    Cc: KAMEZAWA Hiroyuki
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Pekka Enberg
    Cc: Rik van Riel
    Cc: Suleiman Souhlal
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Glauber Costa
     
  • struct page already has this information. If we start chaining caches,
    this information will always be more trustworthy than whatever is passed
    into the function.

    Signed-off-by: Glauber Costa
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Frederic Weisbecker
    Cc: Greg Thelen
    Cc: Johannes Weiner
    Cc: JoonSoo Kim
    Cc: KAMEZAWA Hiroyuki
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Pekka Enberg
    Cc: Rik van Riel
    Cc: Suleiman Souhlal
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Glauber Costa
     
  • Allow a memcg parameter to be passed during cache creation. When the slub
    allocator is being used, it will only merge caches that belong to the same
    memcg. We'll do this by scanning the global list, and then translating
    the cache to a memcg-specific cache

    Default function is created as a wrapper, passing NULL to the memcg
    version. We only merge caches that belong to the same memcg.

    A helper is provided, memcg_css_id: because slub needs a unique cache name
    for sysfs. Since this is visible, but not the canonical location for slab
    data, the cache name is not used, the css_id should suffice.

    Signed-off-by: Glauber Costa
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Frederic Weisbecker
    Cc: Greg Thelen
    Cc: Johannes Weiner
    Cc: JoonSoo Kim
    Cc: KAMEZAWA Hiroyuki
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Pekka Enberg
    Cc: Rik van Riel
    Cc: Suleiman Souhlal
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Glauber Costa
     
  • Pull SLAB changes from Pekka Enberg:
    "This contains preparational work from Christoph Lameter and Glauber
    Costa for SLAB memcg and cleanups and improvements from Ezequiel
    Garcia and Joonsoo Kim.

    Please note that the SLOB cleanup commit from Arnd Bergmann already
    appears in your tree but I had also merged it myself which is why it
    shows up in the shortlog."

    * 'slab/for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/linux:
    mm/sl[aou]b: Common alignment code
    slab: Use the new create_boot_cache function to simplify bootstrap
    slub: Use statically allocated kmem_cache boot structure for bootstrap
    mm, sl[au]b: create common functions for boot slab creation
    slab: Simplify bootstrap
    slub: Use correct cpu_slab on dead cpu
    mm: fix slab.c kernel-doc warnings
    mm/slob: use min_t() to compare ARCH_SLAB_MINALIGN
    slab: Ignore internal flags in cache creation
    mm/slob: Use free_page instead of put_page for page-size kmalloc allocations
    mm/sl[aou]b: Move common kmem_cache_size() to slab.h
    mm/slob: Use object_size field in kmem_cache_size()
    mm/slob: Drop usage of page->private for storing page-sized allocations
    slub: Commonize slab_cache field in struct page
    sl[au]b: Process slabinfo_show in common code
    mm/sl[au]b: Move print_slabinfo_header to slab_common.c
    mm/sl[au]b: Move slabinfo processing to slab_common.c
    slub: remove one code path and reduce lock contention in __slab_free()

    Linus Torvalds
     

12 Dec, 2012

1 commit

  • SLUB only focuses on the nodes which have normal memory and it ignores the
    other node's hot-adding and hot-removing.

    Aka: if some memory of a node which has no onlined memory is online, but
    this new memory onlined is not normal memory (for example, highmem), we
    should not allocate kmem_cache_node for SLUB.

    And if the last normal memory is offlined, but the node still has memory,
    we should remove kmem_cache_node for that node. (The current code delays
    it when all of the memory is offlined)

    So we only do something when marg->status_change_nid_normal > 0.
    marg->status_change_nid is not suitable here.

    The same problem doesn't exist in SLAB, because SLAB allocates kmem_list3
    for every node even the node don't have normal memory, SLAB tolerates
    kmem_list3 on alien nodes. SLUB only focuses on the nodes which have
    normal memory, it don't tolerate alien kmem_cache_node. The patch makes
    SLUB become self-compatible and avoids WARNs and BUGs in rare conditions.

    Signed-off-by: Lai Jiangshan
    Cc: David Rientjes
    Cc: Minchan Kim
    Cc: KOSAKI Motohiro
    Cc: Yasuaki Ishimatsu
    Cc: Rob Landley
    Cc: Andrew Morton
    Cc: Jiang Liu
    Cc: Kay Sievers
    Cc: Greg Kroah-Hartman
    Cc: Mel Gorman
    Cc: Wen Congyang
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lai Jiangshan
     

11 Dec, 2012

4 commits


31 Oct, 2012

2 commits

  • Some flags are used internally by the allocators for management
    purposes. One example of that is the CFLGS_OFF_SLAB flag that slab uses
    to mark that the metadata for that cache is stored outside of the slab.

    No cache should ever pass those as a creation flags. We can just ignore
    this bit if it happens to be passed (such as when duplicating a cache in
    the kmem memcg patches).

    Because such flags can vary from allocator to allocator, we allow them
    to make their own decisions on that, defining SLAB_AVAILABLE_FLAGS with
    all flags that are valid at creation time. Allocators that doesn't have
    any specific flag requirement should define that to mean all flags.

    Common code will mask out all flags not belonging to that set.

    Acked-by: Christoph Lameter
    Acked-by: David Rientjes
    Signed-off-by: Glauber Costa
    Signed-off-by: Pekka Enberg

    Glauber Costa
     
  • This function is identically defined in all three allocators
    and it's trivial to move it to slab.h

    Since now it's static, inline, header-defined function
    this patch also drops the EXPORT_SYMBOL tag.

    Cc: Pekka Enberg
    Cc: Matt Mackall
    Acked-by: Christoph Lameter
    Signed-off-by: Ezequiel Garcia
    Signed-off-by: Pekka Enberg

    Ezequiel Garcia
     

24 Oct, 2012

5 commits

  • Right now, slab and slub have fields in struct page to derive which
    cache a page belongs to, but they do it slightly differently.

    slab uses a field called slab_cache, that lives in the third double
    word. slub, uses a field called "slab", living outside of the
    doublewords area.

    Ideally, we could use the same field for this. Since slub heavily makes
    use of the doubleword region, there isn't really much room to move
    slub's slab_cache field around. Since slab does not have such strict
    placement restrictions, we can move it outside the doubleword area.

    The naming used by slab, "slab_cache", is less confusing, and it is
    preferred over slub's generic "slab".

    Signed-off-by: Glauber Costa
    Acked-by: Christoph Lameter
    CC: David Rientjes
    Signed-off-by: Pekka Enberg

    Glauber Costa
     
  • Pekka Enberg
     
  • With all the infrastructure in place, we can now have slabinfo_show
    done from slab_common.c. A cache-specific function is called to grab
    information about the cache itself, since that is still heavily
    dependent on the implementation. But with the values produced by it, all
    the printing and handling is done from common code.

    Signed-off-by: Glauber Costa
    CC: Christoph Lameter
    CC: David Rientjes
    Signed-off-by: Pekka Enberg

    Glauber Costa
     
  • The header format is highly similar between slab and slub. The main
    difference lays in the fact that slab may optionally have statistics
    added here in case of CONFIG_SLAB_DEBUG, while the slub will stick them
    somewhere else.

    By making sure that information conditionally lives inside a
    globally-visible CONFIG_DEBUG_SLAB switch, we can move the header
    printing to a common location.

    Signed-off-by: Glauber Costa
    Acked-by: Christoph Lameter
    CC: David Rientjes
    Signed-off-by: Pekka Enberg

    Glauber Costa
     
  • This patch moves all the common machinery to slabinfo processing
    to slab_common.c. We can do better by noticing that the output is
    heavily common, and having the allocators to just provide finished
    information about this. But after this first step, this can be done
    easier.

    Signed-off-by: Glauber Costa
    Acked-by: Christoph Lameter
    CC: David Rientjes
    Signed-off-by: Pekka Enberg

    Glauber Costa
     

19 Oct, 2012

1 commit

  • When we try to free object, there is some of case that we need
    to take a node lock. This is the necessary step for preventing a race.
    After taking a lock, then we try to cmpxchg_double_slab().
    But, there is a possible scenario that cmpxchg_double_slab() is failed
    with taking a lock. Following example explains it.

    CPU A CPU B
    need lock
    ... need lock
    ... lock!!
    lock..but spin free success
    spin... unlock
    lock!!
    free fail

    In this case, retry with taking a lock is occured in CPU A.
    I think that in this case for CPU A,
    "release a lock first, and re-take a lock if necessary" is preferable way.

    There are two reasons for this.

    First, this makes __slab_free()'s logic somehow simple.
    With this patch, 'was_frozen = 1' is "always" handled without taking a lock.
    So we can remove one code path.

    Second, it may reduce lock contention.
    When we do retrying, status of slab is already changed,
    so we don't need a lock anymore in almost every case.
    "release a lock first, and re-take a lock if necessary" policy is
    helpful to this.

    Signed-off-by: Joonsoo Kim
    Acked-by: Christoph Lameter
    Signed-off-by: Pekka Enberg

    Joonsoo Kim
     

03 Oct, 2012

3 commits


25 Sep, 2012

1 commit