10 Oct, 2014

3 commits

  • Slab merge is good feature to reduce fragmentation. Now, it is only
    applied to SLUB, but, it would be good to apply it to SLAB. This patch is
    preparation step to apply slab merge to SLAB by commonizing slab merge
    logic.

    Signed-off-by: Joonsoo Kim
    Cc: Randy Dunlap
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • Update the SLUB code to search for partial slabs on the nearest node with
    memory in the presence of memoryless nodes. Additionally, do not consider
    it to be an ALLOC_NODE_MISMATCH (and deactivate the slab) when a
    memoryless-node specified allocation goes off-node.

    Signed-off-by: Joonsoo Kim
    Signed-off-by: Nishanth Aravamudan
    Cc: David Rientjes
    Cc: Han Pingtian
    Cc: Pekka Enberg
    Cc: Paul Mackerras
    Cc: Benjamin Herrenschmidt
    Cc: Michael Ellerman
    Cc: Anton Blanchard
    Cc: Christoph Lameter
    Cc: Wanpeng Li
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • Tracing of mergeable slabs as well as uses of failslab are confusing since
    the objects of multiple slab caches will be affected. Moreover this
    creates a situation where a mergeable slab will become unmergeable.

    If tracing or failslab testing is desired then it may be best to switch
    merging off for starters.

    Signed-off-by: Christoph Lameter
    Tested-by: WANG Chao
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

07 Aug, 2014

8 commits

  • This function is never called for memcg caches, because they are
    unmergeable, so remove the dead code.

    Signed-off-by: Vladimir Davydov
    Cc: Michal Hocko
    Cc: Johannes Weiner
    Cc: Christoph Lameter
    Reviewed-by: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • We mark some slab caches (e.g. kmem_cache_node) as unmergeable by
    setting refcount to -1, and their alias should be 0, not refcount-1, so
    correct it here.

    Signed-off-by: Gu Zheng
    Acked-by: David Rientjes
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Gu Zheng
     
  • The return statement goes with the cmpxchg_double() condition so it needs
    to be indented another tab.

    Also these days the fashion is to line function parameters up, and it
    looks nicer that way because then the "freelist_new" is not at the same
    indent level as the "return 1;".

    Signed-off-by: Dan Carpenter
    Signed-off-by: Pekka Enberg
    Signed-off-by: David Rientjes
    Cc: Joonsoo Kim
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Carpenter
     
  • When a kmem_cache is created with ctor, each object in the kmem_cache
    will be initialized before ready to use. While in slub implementation,
    the first object will be initialized twice.

    This patch reduces the duplication of initialization of the first
    object.

    Fix commit 7656c72b ("SLUB: add macros for scanning objects in a slab").

    Signed-off-by: Wei Yang
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wei Yang
     
  • There are two versions of alloc/free hooks now - one for
    CONFIG_SLUB_DEBUG=y and another one for CONFIG_SLUB_DEBUG=n.

    I see no reason why calls to other debugging subsystems (LOCKDEP,
    DEBUG_ATOMIC_SLEEP, KMEMCHECK and FAILSLAB) are hidden under SLUB_DEBUG.
    All this features should work regardless of SLUB_DEBUG config, as all of
    them already have own Kconfig options.

    This also fixes failslab for CONFIG_SLUB_DEBUG=n configuration. It
    simply has not worked before because should_failslab() call was in a
    hook hidden under "#ifdef CONFIG_SLUB_DEBUG #else".

    Note: There is one concealed change in allocation path for SLUB_DEBUG=n
    and all other debugging features disabled. The might_sleep_if() call
    can generate some code even if DEBUG_ATOMIC_SLEEP=n. For
    PREEMPT_VOLUNTARY=y might_sleep() inserts _cond_resched() call, but I
    think it should be ok.

    Signed-off-by: Andrey Ryabinin
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Ryabinin
     
  • resiliency_test() is only called for bootstrap, so it may be moved to
    init.text and freed after boot.

    Signed-off-by: David Rientjes
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • Make use of the new node functions in mm/slab.h to reduce code size and
    simplify.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Christoph Lameter
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Acked-by: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • The patchset provides two new functions in mm/slab.h and modifies SLAB
    and SLUB to use these. The kmem_cache_node structure is shared between
    both allocators and the use of common accessors will allow us to move
    more code into slab_common.c in the future.

    This patch (of 3):

    These functions allow to eliminate repeatedly used code in both SLAB and
    SLUB and also allow for the insertion of debugging code that may be
    needed in the development process.

    Signed-off-by: Christoph Lameter
    Cc: Pekka Enberg
    Acked-by: David Rientjes
    Acked-by: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

04 Jul, 2014

1 commit

  • min_partial means minimum number of slab cached in node partial list.
    So, if nr_partial is less than it, we keep newly empty slab on node
    partial list rather than freeing it. But if nr_partial is equal or
    greater than it, it means that we have enough partial slabs so should
    free newly empty slab. Current implementation missed the equal case so
    if we set min_partial is 0, then, at least one slab could be cached.
    This is critical problem to kmemcg destroying logic because it doesn't
    works properly if some slabs is cached. This patch fixes this problem.

    Fixes 91cb69620284 ("slub: make dead memcg caches discard free slabs
    immediately").

    Signed-off-by: Joonsoo Kim
    Acked-by: Vladimir Davydov
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     

07 Jun, 2014

1 commit

  • Currently, if allocation constraint to node is NUMA_NO_NODE, we search a
    partial slab on numa_node_id() node. This doesn't work properly on a
    system having memoryless nodes, since it can have no memory on that node
    so there must be no partial slab on that node.

    On that node, page allocation always falls back to numa_mem_id() first.
    So searching a partial slab on numa_node_id() in that case is the proper
    solution for the memoryless node case.

    Signed-off-by: Joonsoo Kim
    Acked-by: Nishanth Aravamudan
    Acked-by: David Rientjes
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: Wanpeng Li
    Cc: Han Pingtian
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     

05 Jun, 2014

10 commits

  • Replace places where __get_cpu_var() is used for an address calculation
    with this_cpu_ptr().

    Signed-off-by: Christoph Lameter
    Cc: Tejun Heo
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Currently we have two pairs of kmemcg-related functions that are called on
    slab alloc/free. The first is memcg_{bind,release}_pages that count the
    total number of pages allocated on a kmem cache. The second is
    memcg_{un}charge_slab that {un}charge slab pages to kmemcg resource
    counter. Let's just merge them to keep the code clean.

    Signed-off-by: Vladimir Davydov
    Acked-by: Johannes Weiner
    Cc: Michal Hocko
    Cc: Glauber Costa
    Cc: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • When we create a sl[au]b cache, we allocate kmem_cache_node structures
    for each online NUMA node. To handle nodes taken online/offline, we
    register memory hotplug notifier and allocate/free kmem_cache_node
    corresponding to the node that changes its state for each kmem cache.

    To synchronize between the two paths we hold the slab_mutex during both
    the cache creationg/destruction path and while tuning per-node parts of
    kmem caches in memory hotplug handler, but that's not quite right,
    because it does not guarantee that a newly created cache will have all
    kmem_cache_nodes initialized in case it races with memory hotplug. For
    instance, in case of slub:

    CPU0 CPU1
    ---- ----
    kmem_cache_create: online_pages:
    __kmem_cache_create: slab_memory_callback:
    slab_mem_going_online_callback:
    lock slab_mutex
    for each slab_caches list entry
    allocate kmem_cache node
    unlock slab_mutex
    lock slab_mutex
    init_kmem_cache_nodes:
    for_each_node_state(node, N_NORMAL_MEMORY)
    allocate kmem_cache node
    add kmem_cache to slab_caches list
    unlock slab_mutex
    online_pages (continued):
    node_states_set_node

    As a result we'll get a kmem cache with not all kmem_cache_nodes
    allocated.

    To avoid issues like that we should hold get/put_online_mems() during
    the whole kmem cache creation/destruction/shrink paths, just like we
    deal with cpu hotplug. This patch does the trick.

    Note, that after it's applied, there is no need in taking the slab_mutex
    for kmem_cache_shrink any more, so it is removed from there.

    Signed-off-by: Vladimir Davydov
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: Tang Chen
    Cc: Zhang Yanfei
    Cc: Toshi Kani
    Cc: Xishi Qiu
    Cc: Jiang Liu
    Cc: Rafael J. Wysocki
    Cc: David Rientjes
    Cc: Wen Congyang
    Cc: Yasuaki Ishimatsu
    Cc: Lai Jiangshan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • kmem_cache_{create,destroy,shrink} need to get a stable value of
    cpu/node online mask, because they init/destroy/access per-cpu/node
    kmem_cache parts, which can be allocated or destroyed on cpu/mem
    hotplug. To protect against cpu hotplug, these functions use
    {get,put}_online_cpus. However, they do nothing to synchronize with
    memory hotplug - taking the slab_mutex does not eliminate the
    possibility of race as described in patch 2.

    What we need there is something like get_online_cpus, but for memory.
    We already have lock_memory_hotplug, which serves for the purpose, but
    it's a bit of a hammer right now, because it's backed by a mutex. As a
    result, it imposes some limitations to locking order, which are not
    desirable, and can't be used just like get_online_cpus. That's why in
    patch 1 I substitute it with get/put_online_mems, which work exactly
    like get/put_online_cpus except they block not cpu, but memory hotplug.

    [ v1 can be found at https://lkml.org/lkml/2014/4/6/68. I NAK'ed it by
    myself, because it used an rw semaphore for get/put_online_mems,
    making them dead lock prune. ]

    This patch (of 2):

    {un}lock_memory_hotplug, which is used to synchronize against memory
    hotplug, is currently backed by a mutex, which makes it a bit of a
    hammer - threads that only want to get a stable value of online nodes
    mask won't be able to proceed concurrently. Also, it imposes some
    strong locking ordering rules on it, which narrows down the set of its
    usage scenarios.

    This patch introduces get/put_online_mems, which are the same as
    get/put_online_cpus, but for memory hotplug, i.e. executing a code
    inside a get/put_online_mems section will guarantee a stable value of
    online nodes, present pages, etc.

    lock_memory_hotplug()/unlock_memory_hotplug() are removed altogether.

    Signed-off-by: Vladimir Davydov
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: Tang Chen
    Cc: Zhang Yanfei
    Cc: Toshi Kani
    Cc: Xishi Qiu
    Cc: Jiang Liu
    Cc: Rafael J. Wysocki
    Cc: David Rientjes
    Cc: Wen Congyang
    Cc: Yasuaki Ishimatsu
    Cc: Lai Jiangshan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • Currently to allocate a page that should be charged to kmemcg (e.g.
    threadinfo), we pass __GFP_KMEMCG flag to the page allocator. The page
    allocated is then to be freed by free_memcg_kmem_pages. Apart from
    looking asymmetrical, this also requires intrusion to the general
    allocation path. So let's introduce separate functions that will
    alloc/free pages charged to kmemcg.

    The new functions are called alloc_kmem_pages and free_kmem_pages. They
    should be used when the caller actually would like to use kmalloc, but
    has to fall back to the page allocator for the allocation is large.
    They only differ from alloc_pages and free_pages in that besides
    allocating or freeing pages they also charge them to the kmem resource
    counter of the current memory cgroup.

    [sfr@canb.auug.org.au: export kmalloc_order() to modules]
    Signed-off-by: Vladimir Davydov
    Acked-by: Greg Thelen
    Cc: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: Glauber Costa
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Signed-off-by: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • We have only a few places where we actually want to charge kmem so
    instead of intruding into the general page allocation path with
    __GFP_KMEMCG it's better to explictly charge kmem there. All kmem
    charges will be easier to follow that way.

    This is a step towards removing __GFP_KMEMCG. It removes __GFP_KMEMCG
    from memcg caches' allocflags. Instead it makes slab allocation path
    call memcg_charge_kmem directly getting memcg to charge from the cache's
    memcg params.

    This also eliminates any possibility of misaccounting an allocation
    going from one memcg's cache to another memcg, because now we always
    charge slabs against the memcg the cache belongs to. That's why this
    patch removes the big comment to memcg_kmem_get_cache.

    Signed-off-by: Vladimir Davydov
    Acked-by: Greg Thelen
    Cc: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: Glauber Costa
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • There used to be only one path out of __slab_alloc(), and ALLOC_SLOWPATH
    got bumped in that exit path. Now there are two, and a bunch of gotos.
    ALLOC_SLOWPATH can now get set more than once during a single call to
    __slab_alloc() which is pretty bogus. Here's the sequence:

    1. Enter __slab_alloc(), fall through all the way to the
    stat(s, ALLOC_SLOWPATH);
    2. hit 'if (!freelist)', and bump DEACTIVATE_BYPASS, jump to
    new_slab (goto #1)
    3. Hit 'if (c->partial)', bump CPU_PARTIAL_ALLOC, goto redo
    (goto #2)
    4. Fall through in the same path we did before all the way to
    stat(s, ALLOC_SLOWPATH)
    5. bump ALLOC_REFILL stat, then return

    Doing this is obviously bogus. It keeps us from being able to
    accurately compare ALLOC_SLOWPATH vs. ALLOC_FASTPATH. It also means
    that the total number of allocs always exceeds the total number of
    frees.

    This patch moves stat(s, ALLOC_SLOWPATH) to be called from the same
    place that __slab_alloc() is. This makes it much less likely that
    ALLOC_SLOWPATH will get botched again in the spaghetti-code inside
    __slab_alloc().

    Signed-off-by: Dave Hansen
    Acked-by: Christoph Lameter
    Acked-by: David Rientjes
    Cc: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Hansen
     
  • When the slab or slub allocators cannot allocate additional slab pages,
    they emit diagnostic information to the kernel log such as current
    number of slabs, number of objects, active objects, etc. This is always
    coupled with a page allocation failure warning since it is controlled by
    !__GFP_NOWARN.

    Suppress this out of memory warning if the allocator is configured
    without debug supported. The page allocation failure warning will
    indicate it is a failed slab allocation, the order, and the gfp mask, so
    this is only useful to diagnose allocator issues.

    Since CONFIG_SLUB_DEBUG is already enabled by default for the slub
    allocator, there is no functional change with this patch. If debug is
    disabled, however, the warnings are now suppressed.

    Signed-off-by: David Rientjes
    Cc: Pekka Enberg
    Acked-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • Inspired by Joe Perches suggestion in ntfs logging clean-up.

    Signed-off-by: Fabian Frederick
    Acked-by: Christoph Lameter
    Cc: Joe Perches
    Cc: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Fabian Frederick
     
  • All printk(KERN_foo converted to pr_foo()

    Default printk converted to pr_warn()

    Coalesce format fragments

    Signed-off-by: Fabian Frederick
    Acked-by: Christoph Lameter
    Cc: Joe Perches
    Cc: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Fabian Frederick
     

07 May, 2014

2 commits

  • debugobjects warning during netfilter exit:

    ------------[ cut here ]------------
    WARNING: CPU: 6 PID: 4178 at lib/debugobjects.c:260 debug_print_object+0x8d/0xb0()
    ODEBUG: free active (active state 0) object type: timer_list hint: delayed_work_timer_fn+0x0/0x20
    Modules linked in:
    CPU: 6 PID: 4178 Comm: kworker/u16:2 Tainted: G W 3.11.0-next-20130906-sasha #3984
    Workqueue: netns cleanup_net
    Call Trace:
    dump_stack+0x52/0x87
    warn_slowpath_common+0x8c/0xc0
    warn_slowpath_fmt+0x46/0x50
    debug_print_object+0x8d/0xb0
    __debug_check_no_obj_freed+0xa5/0x220
    debug_check_no_obj_freed+0x15/0x20
    kmem_cache_free+0x197/0x340
    kmem_cache_destroy+0x86/0xe0
    nf_conntrack_cleanup_net_list+0x131/0x170
    nf_conntrack_pernet_exit+0x5d/0x70
    ops_exit_list+0x5e/0x70
    cleanup_net+0xfb/0x1c0
    process_one_work+0x338/0x550
    worker_thread+0x215/0x350
    kthread+0xe7/0xf0
    ret_from_fork+0x7c/0xb0

    Also during dcookie cleanup:

    WARNING: CPU: 12 PID: 9725 at lib/debugobjects.c:260 debug_print_object+0x8c/0xb0()
    ODEBUG: free active (active state 0) object type: timer_list hint: delayed_work_timer_fn+0x0/0x20
    Modules linked in:
    CPU: 12 PID: 9725 Comm: trinity-c141 Not tainted 3.15.0-rc2-next-20140423-sasha-00018-gc4ff6c4 #408
    Call Trace:
    dump_stack (lib/dump_stack.c:52)
    warn_slowpath_common (kernel/panic.c:430)
    warn_slowpath_fmt (kernel/panic.c:445)
    debug_print_object (lib/debugobjects.c:262)
    __debug_check_no_obj_freed (lib/debugobjects.c:697)
    debug_check_no_obj_freed (lib/debugobjects.c:726)
    kmem_cache_free (mm/slub.c:2689 mm/slub.c:2717)
    kmem_cache_destroy (mm/slab_common.c:363)
    dcookie_unregister (fs/dcookies.c:302 fs/dcookies.c:343)
    event_buffer_release (arch/x86/oprofile/../../../drivers/oprofile/event_buffer.c:153)
    __fput (fs/file_table.c:217)
    ____fput (fs/file_table.c:253)
    task_work_run (kernel/task_work.c:125 (discriminator 1))
    do_notify_resume (include/linux/tracehook.h:196 arch/x86/kernel/signal.c:751)
    int_signal (arch/x86/kernel/entry_64.S:807)

    Sysfs has a release mechanism. Use that to release the kmem_cache
    structure if CONFIG_SYSFS is enabled.

    Only slub is changed - slab currently only supports /proc/slabinfo and
    not /sys/kernel/slab/*. We talked about adding that and someone was
    working on it.

    [akpm@linux-foundation.org: fix CONFIG_SYSFS=n build]
    [akpm@linux-foundation.org: fix CONFIG_SYSFS=n build even more]
    Signed-off-by: Christoph Lameter
    Reported-by: Sasha Levin
    Tested-by: Sasha Levin
    Acked-by: Greg KH
    Cc: Thomas Gleixner
    Cc: Pekka Enberg
    Cc: Russell King
    Cc: Bart Van Assche
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • After creating a cache for a memcg we should initialize its sysfs attrs
    with the values from its parent. That's what memcg_propagate_slab_attrs
    is for. Currently it's broken - we clearly muddled root-vs-memcg caches
    there. Let's fix it up.

    Signed-off-by: Vladimir Davydov
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: Michal Hocko
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     

14 Apr, 2014

1 commit

  • Pull slab changes from Pekka Enberg:
    "The biggest change is byte-sized freelist indices which reduces slab
    freelist memory usage:

    https://lkml.org/lkml/2013/12/2/64"

    * 'slab/next' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/linux:
    mm: slab/slub: use page->list consistently instead of page->lru
    mm/slab.c: cleanup outdated comments and unify variables naming
    slab: fix wrongly used macro
    slub: fix high order page allocation problem with __GFP_NOFAIL
    slab: Make allocations with GFP_ZERO slightly more efficient
    slab: make more slab management structure off the slab
    slab: introduce byte sized index for the freelist of a slab
    slab: restrict the number of objects in a slab
    slab: introduce helper functions to get/set free object
    slab: factor out calculate nr objects in cache_estimate

    Linus Torvalds
     

08 Apr, 2014

6 commits

  • Statistics are not critical to the operation of the allocation but
    should also not cause too much overhead.

    When __this_cpu_inc is altered to check if preemption is disabled this
    triggers. Use raw_cpu_inc to avoid the checks. Using this_cpu_ops may
    cause interrupt disable/enable sequences on various arches which may
    significantly impact allocator performance.

    [akpm@linux-foundation.org: add comment]
    Signed-off-by: Christoph Lameter
    Cc: Fengguang Wu
    Cc: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • The failure paths of sysfs_slab_add don't release the allocation of
    'name' made by create_unique_id() a few lines above the context of the
    diff below. Create a common exit path to make it more obvious what
    needs freeing.

    [vdavydov@parallels.com: free the name only if !unmergeable]
    Signed-off-by: Dave Jones
    Signed-off-by: Vladimir Davydov
    Cc: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Jones
     
  • Currently, we try to arrange sysfs entries for memcg caches in the same
    manner as for global caches. Apart from turning /sys/kernel/slab into a
    mess when there are a lot of kmem-active memcgs created, it actually
    does not work properly - we won't create more than one link to a memcg
    cache in case its parent is merged with another cache. For instance, if
    A is a root cache merged with another root cache B, we will have the
    following sysfs setup:

    X
    A -> X
    B -> X

    where X is some unique id (see create_unique_id()). Now if memcgs M and
    N start to allocate from cache A (or B, which is the same), we will get:

    X
    X:M
    X:N
    A -> X
    B -> X
    A:M -> X:M
    A:N -> X:N

    Since B is an alias for A, we won't get entries B:M and B:N, which is
    confusing.

    It is more logical to have entries for memcg caches under the
    corresponding root cache's sysfs directory. This would allow us to keep
    sysfs layout clean, and avoid such inconsistencies like one described
    above.

    This patch does the trick. It creates a "cgroup" kset in each root
    cache kobject to keep its children caches there.

    Signed-off-by: Vladimir Davydov
    Cc: Michal Hocko
    Cc: Johannes Weiner
    Cc: David Rientjes
    Cc: Pekka Enberg
    Cc: Glauber Costa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • Otherwise, kzalloc() called from a memcg won't clear the whole object.

    Signed-off-by: Vladimir Davydov
    Cc: Michal Hocko
    Cc: Johannes Weiner
    Cc: David Rientjes
    Cc: Pekka Enberg
    Cc: Glauber Costa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • When a kmem cache is created (kmem_cache_create_memcg()), we first try to
    find a compatible cache that already exists and can handle requests from
    the new cache, i.e. has the same object size, alignment, ctor, etc. If
    there is such a cache, we do not create any new caches, instead we simply
    increment the refcount of the cache found and return it.

    Currently we do this procedure not only when creating root caches, but
    also for memcg caches. However, there is no point in that, because, as
    every memcg cache has exactly the same parameters as its parent and cache
    merging cannot be turned off in runtime (only on boot by passing
    "slub_nomerge"), the root caches of any two potentially mergeable memcg
    caches should be merged already, i.e. it must be the same root cache, and
    therefore we couldn't even get to the memcg cache creation, because it
    already exists.

    The only exception is boot caches - they are explicitly forbidden to be
    merged by setting their refcount to -1. There are currently only two of
    them - kmem_cache and kmem_cache_node, which are used in slab internals (I
    do not count kmalloc caches as their refcount is set to 1 immediately
    after creation). Since they are prevented from merging preliminary I
    guess we should avoid to merge their children too.

    So let's remove the useless code responsible for merging memcg caches.

    Signed-off-by: Vladimir Davydov
    Cc: Michal Hocko
    Cc: Johannes Weiner
    Cc: David Rientjes
    Cc: Pekka Enberg
    Cc: Glauber Costa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • slab_node() is actually a mempolicy function, so rename it to
    mempolicy_slab_node() to make it clearer that it used for processes with
    mempolicies.

    At the same time, cleanup its code by saving numa_mem_id() in a local
    variable (since we require a node with memory, not just any node) and
    remove an obsolete comment that assumes the mempolicy is actually passed
    into the function.

    Signed-off-by: David Rientjes
    Acked-by: Christoph Lameter
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: KAMEZAWA Hiroyuki
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: Tejun Heo
    Cc: Mel Gorman
    Cc: Oleg Nesterov
    Cc: Rik van Riel
    Cc: Jianguo Wu
    Cc: Tim Hockin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

04 Apr, 2014

2 commits

  • We release the slab_mutex while calling sysfs_slab_add from
    __kmem_cache_create since commit 66c4c35c6bc5 ("slub: Do not hold
    slub_lock when calling sysfs_slab_add()"), because kobject_uevent called
    by sysfs_slab_add might block waiting for the usermode helper to exec,
    which would result in a deadlock if we took the slab_mutex while
    executing it.

    However, apart from complicating synchronization rules, releasing the
    slab_mutex on kmem cache creation can result in a kmemcg-related race.
    The point is that we check if the memcg cache exists before going to
    __kmem_cache_create, but register the new cache in memcg subsys after
    it. Since we can drop the mutex there, several threads can see that the
    memcg cache does not exist and proceed to creating it, which is wrong.

    Fortunately, recently kobject_uevent was patched to call the usermode
    helper with the UMH_NO_WAIT flag, making the deadlock impossible.
    Therefore there is no point in releasing the slab_mutex while calling
    sysfs_slab_add, so let's simplify kmem_cache_create synchronization and
    fix the kmemcg-race mentioned above by holding the slab_mutex during the
    whole cache creation path.

    Signed-off-by: Vladimir Davydov
    Acked-by: Christoph Lameter
    Cc: Greg KH
    Cc: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • Since put_mems_allowed() is strictly optional, its a seqcount retry, we
    don't need to evaluate the function if the allocation was in fact
    successful, saving a smp_rmb some loads and comparisons on some relative
    fast-paths.

    Since the naming, get/put_mems_allowed() does suggest a mandatory
    pairing, rename the interface, as suggested by Mel, to resemble the
    seqcount interface.

    This gives us: read_mems_allowed_begin() and read_mems_allowed_retry(),
    where it is important to note that the return value of the latter call
    is inverted from its previous incarnation.

    Signed-off-by: Peter Zijlstra
    Signed-off-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

27 Mar, 2014

1 commit

  • SLUB already try to allocate high order page with clearing __GFP_NOFAIL.
    But, when allocating shadow page for kmemcheck, it missed clearing
    the flag. This trigger WARN_ON_ONCE() reported by Christian Casteyde.

    https://bugzilla.kernel.org/show_bug.cgi?id=65991
    https://lkml.org/lkml/2013/12/3/764

    This patch fix this situation by using same allocation flag as original
    allocation.

    Reported-by: Christian Casteyde
    Acked-by: David Rientjes
    Signed-off-by: Joonsoo Kim
    Signed-off-by: Pekka Enberg

    Joonsoo Kim
     

11 Feb, 2014

2 commits

  • Vladimir reported the following issue:

    Commit c65c1877bd68 ("slub: use lockdep_assert_held") requires
    remove_partial() to be called with n->list_lock held, but free_partial()
    called from kmem_cache_close() on cache destruction does not follow this
    rule, leading to a warning:

    WARNING: CPU: 0 PID: 2787 at mm/slub.c:1536 __kmem_cache_shutdown+0x1b2/0x1f0()
    Modules linked in:
    CPU: 0 PID: 2787 Comm: modprobe Tainted: G W 3.14.0-rc1-mm1+ #1
    Hardware name:
    0000000000000600 ffff88003ae1dde8 ffffffff816d9583 0000000000000600
    0000000000000000 ffff88003ae1de28 ffffffff8107c107 0000000000000000
    ffff880037ab2b00 ffff88007c240d30 ffffea0001ee5280 ffffea0001ee52a0
    Call Trace:
    __kmem_cache_shutdown+0x1b2/0x1f0
    kmem_cache_destroy+0x43/0xf0
    xfs_destroy_zones+0x103/0x110 [xfs]
    exit_xfs_fs+0x38/0x4e4 [xfs]
    SyS_delete_module+0x19a/0x1f0
    system_call_fastpath+0x16/0x1b

    His solution was to add a spinlock in order to quiet lockdep. Although
    there would be no contention to adding the lock, that lock also requires
    disabling of interrupts which will have a larger impact on the system.

    Instead of adding a spinlock to a location where it is not needed for
    lockdep, make a __remove_partial() function that does not test if the
    list_lock is held, as no one should have it due to it being freed.

    Also added a __add_partial() function that does not do the lock
    validation either, as it is not needed for the creation of the cache.

    Signed-off-by: Steven Rostedt
    Reported-by: Vladimir Davydov
    Suggested-by: David Rientjes
    Acked-by: David Rientjes
    Acked-by: Vladimir Davydov
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Steven Rostedt
     
  • Commit c65c1877bd68 ("slub: use lockdep_assert_held") incorrectly
    required that add_full() and remove_full() hold n->list_lock. The lock
    is only taken when kmem_cache_debug(s), since that's the only time it
    actually does anything.

    Require that the lock only be taken under such a condition.

    Reported-by: Larry Finger
    Tested-by: Larry Finger
    Tested-by: Paul E. McKenney
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Signed-off-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

03 Feb, 2014

1 commit

  • Pull SLAB changes from Pekka Enberg:
    "Random bug fixes that have accumulated in my inbox over the past few
    months"

    * 'slab/next' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/linux:
    mm: Fix warning on make htmldocs caused by slab.c
    mm: slub: work around unneeded lockdep warning
    mm: sl[uo]b: fix misleading comments
    slub: Fix possible format string bug.
    slub: use lockdep_assert_held
    slub: Fix calculation of cpu slabs
    slab.h: remove duplicate kmalloc declaration and fix kernel-doc warnings

    Linus Torvalds
     

31 Jan, 2014

2 commits

  • The slub code does some setup during early boot in
    early_kmem_cache_node_alloc() with some local data. There is no
    possible way that another CPU can see this data, so the slub code
    doesn't unnecessarily lock it. However, some new lockdep asserts
    check to make sure that add_partial() _always_ has the list_lock
    held.

    Just add the locking, even though it is technically unnecessary.

    Cc: Peter Zijlstra
    Cc: Russell King
    Acked-by: David Rientjes
    Signed-off-by: Dave Hansen
    Signed-off-by: Pekka Enberg

    Dave Hansen
     
  • Commit abca7c496584 ("mm: fix slab->page _count corruption when using
    slub") notes that we can not _set_ a page->counters directly, except
    when using a real double-cmpxchg. Doing so can lose updates to
    ->_count.

    That is an absolute rule:

    You may not *set* page->counters except via a cmpxchg.

    Commit abca7c496584 fixed this for the folks who have the slub
    cmpxchg_double code turned off at compile time, but it left the bad case
    alone. It can still be reached, and the same bug triggered in two
    cases:

    1. Turning on slub debugging at runtime, which is available on
    the distro kernels that I looked at.
    2. On 64-bit CPUs with no CMPXCHG16B (some early AMD x86-64
    cpus, evidently)

    There are at least 3 ways we could fix this:

    1. Take all of the exising calls to cmpxchg_double_slab() and
    __cmpxchg_double_slab() and convert them to take an old, new
    and target 'struct page'.
    2. Do (1), but with the newly-introduced 'slub_data'.
    3. Do some magic inside the two cmpxchg...slab() functions to
    pull the counters out of new_counters and only set those
    fields in page->{inuse,frozen,objects}.

    I've done (2) as well, but it's a bunch more code. This patch is an
    attempt at (3). This was the most straightforward and foolproof way
    that I could think to do this.

    This would also technically allow us to get rid of the ugly

    #if defined(CONFIG_HAVE_CMPXCHG_DOUBLE) && \
    defined(CONFIG_HAVE_ALIGNED_STRUCT_PAGE)

    in 'struct page', but leaving it alone has the added benefit that
    'counters' stays 'unsigned' instead of 'unsigned long', so all the
    copies that the slub code does stay a bit smaller.

    Signed-off-by: Dave Hansen
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: Matt Mackall
    Cc: Pravin B Shelar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Hansen