14 Dec, 2014

2 commits

  • If we fail to allocate from the current node's stock, we look for free
    objects on other nodes before calling the page allocator (see
    get_any_partial). While checking other nodes we respect cpuset
    constraints by calling cpuset_zone_allowed. We enforce hardwall check.
    As a result, we will fallback to the page allocator even if there are some
    pages cached on other nodes, but the current cpuset doesn't have them set.
    However, the page allocator uses softwall check for kernel allocations,
    so it may allocate from one of the other nodes in this case.

    Therefore we should use softwall cpuset check in get_any_partial to
    conform with the cpuset check in the page allocator.

    Signed-off-by: Vladimir Davydov
    Acked-by: Zefan Li
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • Suppose task @t that belongs to a memory cgroup @memcg is going to
    allocate an object from a kmem cache @c. The copy of @c corresponding to
    @memcg, @mc, is empty. Then if kmem_cache_alloc races with the memory
    cgroup destruction we can access the memory cgroup's copy of the cache
    after it was destroyed:

    CPU0 CPU1
    ---- ----
    [ current=@t
    @mc->memcg_params->nr_pages=0 ]

    kmem_cache_alloc(@c):
    call memcg_kmem_get_cache(@c);
    proceed to allocation from @mc:
    alloc a page for @mc:
    ...

    move @t from @memcg
    destroy @memcg:
    mem_cgroup_css_offline(@memcg):
    memcg_unregister_all_caches(@memcg):
    kmem_cache_destroy(@mc)

    add page to @mc

    We could fix this issue by taking a reference to a per-memcg cache, but
    that would require adding a per-cpu reference counter to per-memcg caches,
    which would look cumbersome.

    Instead, let's take a reference to a memory cgroup, which already has a
    per-cpu reference counter, in the beginning of kmem_cache_alloc to be
    dropped in the end, and move per memcg caches destruction from css offline
    to css free. As a side effect, per-memcg caches will be destroyed not one
    by one, but all at once when the last page accounted to the memory cgroup
    is freed. This doesn't sound as a high price for code readability though.

    Note, this patch does add some overhead to the kmem_cache_alloc hot path,
    but it is pretty negligible - it's just a function call plus a per cpu
    counter decrement, which is comparable to what we already have in
    memcg_kmem_get_cache. Besides, it's only relevant if there are memory
    cgroups with kmem accounting enabled. I don't think we can find a way to
    handle this race w/o it, because alloc_page called from kmem_cache_alloc
    may sleep so we can't flush all pending kmallocs w/o reference counting.

    Signed-off-by: Vladimir Davydov
    Acked-by: Christoph Lameter
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     

12 Dec, 2014

1 commit

  • Pull cgroup update from Tejun Heo:
    "cpuset got simplified a bit. cgroup core got a fix on unified
    hierarchy and grew some effective css related interfaces which will be
    used for blkio support for writeback IO traffic which is currently
    being worked on"

    * 'for-3.19' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
    cgroup: implement cgroup_get_e_css()
    cgroup: add cgroup_subsys->css_e_css_changed()
    cgroup: add cgroup_subsys->css_released()
    cgroup: fix the async css offline wait logic in cgroup_subtree_control_write()
    cgroup: restructure child_subsys_mask handling in cgroup_subtree_control_write()
    cgroup: separate out cgroup_calc_child_subsys_mask() from cgroup_refresh_child_subsys_mask()
    cpuset: lock vs unlock typo
    cpuset: simplify cpuset_node_allowed API
    cpuset: convert callback_mutex to a spinlock

    Linus Torvalds
     

11 Dec, 2014

3 commits

  • The code goes BUG, but doesn't tell us which bits were unexpectedly set.
    Print that out.

    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • Adding __printf(3, 4) to slab_err exposed following:

    mm/slub.c: In function `check_slab':
    mm/slub.c:852:4: warning: format `%u' expects argument of type `unsigned int', but argument 4 has type `const char *' [-Wformat=]
    s->name, page->objects, maxobj);
    ^
    mm/slub.c:852:4: warning: too many arguments for format [-Wformat-extra-args]
    mm/slub.c:857:4: warning: format `%u' expects argument of type `unsigned int', but argument 4 has type `const char *' [-Wformat=]
    s->name, page->inuse, page->objects);
    ^
    mm/slub.c:857:4: warning: too many arguments for format [-Wformat-extra-args]

    mm/slub.c: In function `on_freelist':
    mm/slub.c:905:4: warning: format `%d' expects argument of type `int', but argument 5 has type `long unsigned int' [-Wformat=]
    "should be %d", page->objects, max_objects);

    Fix first two warnings by removing redundant s->name.
    Fix the last by changing type of max_object from unsigned long to int.

    Signed-off-by: Andrey Ryabinin
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Acked-by: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Ryabinin
     
  • Some code in mm/slab.c and mm/slub.c use whitespaces in indent.
    Clean them up.

    Signed-off-by: LQYMGT
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    LQYMGT
     

27 Oct, 2014

1 commit

  • Current cpuset API for checking if a zone/node is allowed to allocate
    from looks rather awkward. We have hardwall and softwall versions of
    cpuset_node_allowed with the softwall version doing literally the same
    as the hardwall version if __GFP_HARDWALL is passed to it in gfp flags.
    If it isn't, the softwall version may check the given node against the
    enclosing hardwall cpuset, which it needs to take the callback lock to
    do.

    Such a distinction was introduced by commit 02a0e53d8227 ("cpuset:
    rework cpuset_zone_allowed api"). Before, we had the only version with
    the __GFP_HARDWALL flag determining its behavior. The purpose of the
    commit was to avoid sleep-in-atomic bugs when someone would mistakenly
    call the function without the __GFP_HARDWALL flag for an atomic
    allocation. The suffixes introduced were intended to make the callers
    think before using the function.

    However, since the callback lock was converted from mutex to spinlock by
    the previous patch, the softwall check function cannot sleep, and these
    precautions are no longer necessary.

    So let's simplify the API back to the single check.

    Suggested-by: David Rientjes
    Signed-off-by: Vladimir Davydov
    Acked-by: Christoph Lameter
    Acked-by: Zefan Li
    Signed-off-by: Tejun Heo

    Vladimir Davydov
     

10 Oct, 2014

3 commits

  • Slab merge is good feature to reduce fragmentation. Now, it is only
    applied to SLUB, but, it would be good to apply it to SLAB. This patch is
    preparation step to apply slab merge to SLAB by commonizing slab merge
    logic.

    Signed-off-by: Joonsoo Kim
    Cc: Randy Dunlap
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • Update the SLUB code to search for partial slabs on the nearest node with
    memory in the presence of memoryless nodes. Additionally, do not consider
    it to be an ALLOC_NODE_MISMATCH (and deactivate the slab) when a
    memoryless-node specified allocation goes off-node.

    Signed-off-by: Joonsoo Kim
    Signed-off-by: Nishanth Aravamudan
    Cc: David Rientjes
    Cc: Han Pingtian
    Cc: Pekka Enberg
    Cc: Paul Mackerras
    Cc: Benjamin Herrenschmidt
    Cc: Michael Ellerman
    Cc: Anton Blanchard
    Cc: Christoph Lameter
    Cc: Wanpeng Li
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • Tracing of mergeable slabs as well as uses of failslab are confusing since
    the objects of multiple slab caches will be affected. Moreover this
    creates a situation where a mergeable slab will become unmergeable.

    If tracing or failslab testing is desired then it may be best to switch
    merging off for starters.

    Signed-off-by: Christoph Lameter
    Tested-by: WANG Chao
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

07 Aug, 2014

8 commits

  • This function is never called for memcg caches, because they are
    unmergeable, so remove the dead code.

    Signed-off-by: Vladimir Davydov
    Cc: Michal Hocko
    Cc: Johannes Weiner
    Cc: Christoph Lameter
    Reviewed-by: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • We mark some slab caches (e.g. kmem_cache_node) as unmergeable by
    setting refcount to -1, and their alias should be 0, not refcount-1, so
    correct it here.

    Signed-off-by: Gu Zheng
    Acked-by: David Rientjes
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Gu Zheng
     
  • The return statement goes with the cmpxchg_double() condition so it needs
    to be indented another tab.

    Also these days the fashion is to line function parameters up, and it
    looks nicer that way because then the "freelist_new" is not at the same
    indent level as the "return 1;".

    Signed-off-by: Dan Carpenter
    Signed-off-by: Pekka Enberg
    Signed-off-by: David Rientjes
    Cc: Joonsoo Kim
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Carpenter
     
  • When a kmem_cache is created with ctor, each object in the kmem_cache
    will be initialized before ready to use. While in slub implementation,
    the first object will be initialized twice.

    This patch reduces the duplication of initialization of the first
    object.

    Fix commit 7656c72b ("SLUB: add macros for scanning objects in a slab").

    Signed-off-by: Wei Yang
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wei Yang
     
  • There are two versions of alloc/free hooks now - one for
    CONFIG_SLUB_DEBUG=y and another one for CONFIG_SLUB_DEBUG=n.

    I see no reason why calls to other debugging subsystems (LOCKDEP,
    DEBUG_ATOMIC_SLEEP, KMEMCHECK and FAILSLAB) are hidden under SLUB_DEBUG.
    All this features should work regardless of SLUB_DEBUG config, as all of
    them already have own Kconfig options.

    This also fixes failslab for CONFIG_SLUB_DEBUG=n configuration. It
    simply has not worked before because should_failslab() call was in a
    hook hidden under "#ifdef CONFIG_SLUB_DEBUG #else".

    Note: There is one concealed change in allocation path for SLUB_DEBUG=n
    and all other debugging features disabled. The might_sleep_if() call
    can generate some code even if DEBUG_ATOMIC_SLEEP=n. For
    PREEMPT_VOLUNTARY=y might_sleep() inserts _cond_resched() call, but I
    think it should be ok.

    Signed-off-by: Andrey Ryabinin
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Ryabinin
     
  • resiliency_test() is only called for bootstrap, so it may be moved to
    init.text and freed after boot.

    Signed-off-by: David Rientjes
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • Make use of the new node functions in mm/slab.h to reduce code size and
    simplify.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Christoph Lameter
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Acked-by: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • The patchset provides two new functions in mm/slab.h and modifies SLAB
    and SLUB to use these. The kmem_cache_node structure is shared between
    both allocators and the use of common accessors will allow us to move
    more code into slab_common.c in the future.

    This patch (of 3):

    These functions allow to eliminate repeatedly used code in both SLAB and
    SLUB and also allow for the insertion of debugging code that may be
    needed in the development process.

    Signed-off-by: Christoph Lameter
    Cc: Pekka Enberg
    Acked-by: David Rientjes
    Acked-by: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

04 Jul, 2014

1 commit

  • min_partial means minimum number of slab cached in node partial list.
    So, if nr_partial is less than it, we keep newly empty slab on node
    partial list rather than freeing it. But if nr_partial is equal or
    greater than it, it means that we have enough partial slabs so should
    free newly empty slab. Current implementation missed the equal case so
    if we set min_partial is 0, then, at least one slab could be cached.
    This is critical problem to kmemcg destroying logic because it doesn't
    works properly if some slabs is cached. This patch fixes this problem.

    Fixes 91cb69620284 ("slub: make dead memcg caches discard free slabs
    immediately").

    Signed-off-by: Joonsoo Kim
    Acked-by: Vladimir Davydov
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     

07 Jun, 2014

1 commit

  • Currently, if allocation constraint to node is NUMA_NO_NODE, we search a
    partial slab on numa_node_id() node. This doesn't work properly on a
    system having memoryless nodes, since it can have no memory on that node
    so there must be no partial slab on that node.

    On that node, page allocation always falls back to numa_mem_id() first.
    So searching a partial slab on numa_node_id() in that case is the proper
    solution for the memoryless node case.

    Signed-off-by: Joonsoo Kim
    Acked-by: Nishanth Aravamudan
    Acked-by: David Rientjes
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: Wanpeng Li
    Cc: Han Pingtian
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     

05 Jun, 2014

10 commits

  • Replace places where __get_cpu_var() is used for an address calculation
    with this_cpu_ptr().

    Signed-off-by: Christoph Lameter
    Cc: Tejun Heo
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Currently we have two pairs of kmemcg-related functions that are called on
    slab alloc/free. The first is memcg_{bind,release}_pages that count the
    total number of pages allocated on a kmem cache. The second is
    memcg_{un}charge_slab that {un}charge slab pages to kmemcg resource
    counter. Let's just merge them to keep the code clean.

    Signed-off-by: Vladimir Davydov
    Acked-by: Johannes Weiner
    Cc: Michal Hocko
    Cc: Glauber Costa
    Cc: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • When we create a sl[au]b cache, we allocate kmem_cache_node structures
    for each online NUMA node. To handle nodes taken online/offline, we
    register memory hotplug notifier and allocate/free kmem_cache_node
    corresponding to the node that changes its state for each kmem cache.

    To synchronize between the two paths we hold the slab_mutex during both
    the cache creationg/destruction path and while tuning per-node parts of
    kmem caches in memory hotplug handler, but that's not quite right,
    because it does not guarantee that a newly created cache will have all
    kmem_cache_nodes initialized in case it races with memory hotplug. For
    instance, in case of slub:

    CPU0 CPU1
    ---- ----
    kmem_cache_create: online_pages:
    __kmem_cache_create: slab_memory_callback:
    slab_mem_going_online_callback:
    lock slab_mutex
    for each slab_caches list entry
    allocate kmem_cache node
    unlock slab_mutex
    lock slab_mutex
    init_kmem_cache_nodes:
    for_each_node_state(node, N_NORMAL_MEMORY)
    allocate kmem_cache node
    add kmem_cache to slab_caches list
    unlock slab_mutex
    online_pages (continued):
    node_states_set_node

    As a result we'll get a kmem cache with not all kmem_cache_nodes
    allocated.

    To avoid issues like that we should hold get/put_online_mems() during
    the whole kmem cache creation/destruction/shrink paths, just like we
    deal with cpu hotplug. This patch does the trick.

    Note, that after it's applied, there is no need in taking the slab_mutex
    for kmem_cache_shrink any more, so it is removed from there.

    Signed-off-by: Vladimir Davydov
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: Tang Chen
    Cc: Zhang Yanfei
    Cc: Toshi Kani
    Cc: Xishi Qiu
    Cc: Jiang Liu
    Cc: Rafael J. Wysocki
    Cc: David Rientjes
    Cc: Wen Congyang
    Cc: Yasuaki Ishimatsu
    Cc: Lai Jiangshan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • kmem_cache_{create,destroy,shrink} need to get a stable value of
    cpu/node online mask, because they init/destroy/access per-cpu/node
    kmem_cache parts, which can be allocated or destroyed on cpu/mem
    hotplug. To protect against cpu hotplug, these functions use
    {get,put}_online_cpus. However, they do nothing to synchronize with
    memory hotplug - taking the slab_mutex does not eliminate the
    possibility of race as described in patch 2.

    What we need there is something like get_online_cpus, but for memory.
    We already have lock_memory_hotplug, which serves for the purpose, but
    it's a bit of a hammer right now, because it's backed by a mutex. As a
    result, it imposes some limitations to locking order, which are not
    desirable, and can't be used just like get_online_cpus. That's why in
    patch 1 I substitute it with get/put_online_mems, which work exactly
    like get/put_online_cpus except they block not cpu, but memory hotplug.

    [ v1 can be found at https://lkml.org/lkml/2014/4/6/68. I NAK'ed it by
    myself, because it used an rw semaphore for get/put_online_mems,
    making them dead lock prune. ]

    This patch (of 2):

    {un}lock_memory_hotplug, which is used to synchronize against memory
    hotplug, is currently backed by a mutex, which makes it a bit of a
    hammer - threads that only want to get a stable value of online nodes
    mask won't be able to proceed concurrently. Also, it imposes some
    strong locking ordering rules on it, which narrows down the set of its
    usage scenarios.

    This patch introduces get/put_online_mems, which are the same as
    get/put_online_cpus, but for memory hotplug, i.e. executing a code
    inside a get/put_online_mems section will guarantee a stable value of
    online nodes, present pages, etc.

    lock_memory_hotplug()/unlock_memory_hotplug() are removed altogether.

    Signed-off-by: Vladimir Davydov
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: Tang Chen
    Cc: Zhang Yanfei
    Cc: Toshi Kani
    Cc: Xishi Qiu
    Cc: Jiang Liu
    Cc: Rafael J. Wysocki
    Cc: David Rientjes
    Cc: Wen Congyang
    Cc: Yasuaki Ishimatsu
    Cc: Lai Jiangshan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • Currently to allocate a page that should be charged to kmemcg (e.g.
    threadinfo), we pass __GFP_KMEMCG flag to the page allocator. The page
    allocated is then to be freed by free_memcg_kmem_pages. Apart from
    looking asymmetrical, this also requires intrusion to the general
    allocation path. So let's introduce separate functions that will
    alloc/free pages charged to kmemcg.

    The new functions are called alloc_kmem_pages and free_kmem_pages. They
    should be used when the caller actually would like to use kmalloc, but
    has to fall back to the page allocator for the allocation is large.
    They only differ from alloc_pages and free_pages in that besides
    allocating or freeing pages they also charge them to the kmem resource
    counter of the current memory cgroup.

    [sfr@canb.auug.org.au: export kmalloc_order() to modules]
    Signed-off-by: Vladimir Davydov
    Acked-by: Greg Thelen
    Cc: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: Glauber Costa
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Signed-off-by: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • We have only a few places where we actually want to charge kmem so
    instead of intruding into the general page allocation path with
    __GFP_KMEMCG it's better to explictly charge kmem there. All kmem
    charges will be easier to follow that way.

    This is a step towards removing __GFP_KMEMCG. It removes __GFP_KMEMCG
    from memcg caches' allocflags. Instead it makes slab allocation path
    call memcg_charge_kmem directly getting memcg to charge from the cache's
    memcg params.

    This also eliminates any possibility of misaccounting an allocation
    going from one memcg's cache to another memcg, because now we always
    charge slabs against the memcg the cache belongs to. That's why this
    patch removes the big comment to memcg_kmem_get_cache.

    Signed-off-by: Vladimir Davydov
    Acked-by: Greg Thelen
    Cc: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: Glauber Costa
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • There used to be only one path out of __slab_alloc(), and ALLOC_SLOWPATH
    got bumped in that exit path. Now there are two, and a bunch of gotos.
    ALLOC_SLOWPATH can now get set more than once during a single call to
    __slab_alloc() which is pretty bogus. Here's the sequence:

    1. Enter __slab_alloc(), fall through all the way to the
    stat(s, ALLOC_SLOWPATH);
    2. hit 'if (!freelist)', and bump DEACTIVATE_BYPASS, jump to
    new_slab (goto #1)
    3. Hit 'if (c->partial)', bump CPU_PARTIAL_ALLOC, goto redo
    (goto #2)
    4. Fall through in the same path we did before all the way to
    stat(s, ALLOC_SLOWPATH)
    5. bump ALLOC_REFILL stat, then return

    Doing this is obviously bogus. It keeps us from being able to
    accurately compare ALLOC_SLOWPATH vs. ALLOC_FASTPATH. It also means
    that the total number of allocs always exceeds the total number of
    frees.

    This patch moves stat(s, ALLOC_SLOWPATH) to be called from the same
    place that __slab_alloc() is. This makes it much less likely that
    ALLOC_SLOWPATH will get botched again in the spaghetti-code inside
    __slab_alloc().

    Signed-off-by: Dave Hansen
    Acked-by: Christoph Lameter
    Acked-by: David Rientjes
    Cc: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Hansen
     
  • When the slab or slub allocators cannot allocate additional slab pages,
    they emit diagnostic information to the kernel log such as current
    number of slabs, number of objects, active objects, etc. This is always
    coupled with a page allocation failure warning since it is controlled by
    !__GFP_NOWARN.

    Suppress this out of memory warning if the allocator is configured
    without debug supported. The page allocation failure warning will
    indicate it is a failed slab allocation, the order, and the gfp mask, so
    this is only useful to diagnose allocator issues.

    Since CONFIG_SLUB_DEBUG is already enabled by default for the slub
    allocator, there is no functional change with this patch. If debug is
    disabled, however, the warnings are now suppressed.

    Signed-off-by: David Rientjes
    Cc: Pekka Enberg
    Acked-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • Inspired by Joe Perches suggestion in ntfs logging clean-up.

    Signed-off-by: Fabian Frederick
    Acked-by: Christoph Lameter
    Cc: Joe Perches
    Cc: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Fabian Frederick
     
  • All printk(KERN_foo converted to pr_foo()

    Default printk converted to pr_warn()

    Coalesce format fragments

    Signed-off-by: Fabian Frederick
    Acked-by: Christoph Lameter
    Cc: Joe Perches
    Cc: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Fabian Frederick
     

07 May, 2014

2 commits

  • debugobjects warning during netfilter exit:

    ------------[ cut here ]------------
    WARNING: CPU: 6 PID: 4178 at lib/debugobjects.c:260 debug_print_object+0x8d/0xb0()
    ODEBUG: free active (active state 0) object type: timer_list hint: delayed_work_timer_fn+0x0/0x20
    Modules linked in:
    CPU: 6 PID: 4178 Comm: kworker/u16:2 Tainted: G W 3.11.0-next-20130906-sasha #3984
    Workqueue: netns cleanup_net
    Call Trace:
    dump_stack+0x52/0x87
    warn_slowpath_common+0x8c/0xc0
    warn_slowpath_fmt+0x46/0x50
    debug_print_object+0x8d/0xb0
    __debug_check_no_obj_freed+0xa5/0x220
    debug_check_no_obj_freed+0x15/0x20
    kmem_cache_free+0x197/0x340
    kmem_cache_destroy+0x86/0xe0
    nf_conntrack_cleanup_net_list+0x131/0x170
    nf_conntrack_pernet_exit+0x5d/0x70
    ops_exit_list+0x5e/0x70
    cleanup_net+0xfb/0x1c0
    process_one_work+0x338/0x550
    worker_thread+0x215/0x350
    kthread+0xe7/0xf0
    ret_from_fork+0x7c/0xb0

    Also during dcookie cleanup:

    WARNING: CPU: 12 PID: 9725 at lib/debugobjects.c:260 debug_print_object+0x8c/0xb0()
    ODEBUG: free active (active state 0) object type: timer_list hint: delayed_work_timer_fn+0x0/0x20
    Modules linked in:
    CPU: 12 PID: 9725 Comm: trinity-c141 Not tainted 3.15.0-rc2-next-20140423-sasha-00018-gc4ff6c4 #408
    Call Trace:
    dump_stack (lib/dump_stack.c:52)
    warn_slowpath_common (kernel/panic.c:430)
    warn_slowpath_fmt (kernel/panic.c:445)
    debug_print_object (lib/debugobjects.c:262)
    __debug_check_no_obj_freed (lib/debugobjects.c:697)
    debug_check_no_obj_freed (lib/debugobjects.c:726)
    kmem_cache_free (mm/slub.c:2689 mm/slub.c:2717)
    kmem_cache_destroy (mm/slab_common.c:363)
    dcookie_unregister (fs/dcookies.c:302 fs/dcookies.c:343)
    event_buffer_release (arch/x86/oprofile/../../../drivers/oprofile/event_buffer.c:153)
    __fput (fs/file_table.c:217)
    ____fput (fs/file_table.c:253)
    task_work_run (kernel/task_work.c:125 (discriminator 1))
    do_notify_resume (include/linux/tracehook.h:196 arch/x86/kernel/signal.c:751)
    int_signal (arch/x86/kernel/entry_64.S:807)

    Sysfs has a release mechanism. Use that to release the kmem_cache
    structure if CONFIG_SYSFS is enabled.

    Only slub is changed - slab currently only supports /proc/slabinfo and
    not /sys/kernel/slab/*. We talked about adding that and someone was
    working on it.

    [akpm@linux-foundation.org: fix CONFIG_SYSFS=n build]
    [akpm@linux-foundation.org: fix CONFIG_SYSFS=n build even more]
    Signed-off-by: Christoph Lameter
    Reported-by: Sasha Levin
    Tested-by: Sasha Levin
    Acked-by: Greg KH
    Cc: Thomas Gleixner
    Cc: Pekka Enberg
    Cc: Russell King
    Cc: Bart Van Assche
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • After creating a cache for a memcg we should initialize its sysfs attrs
    with the values from its parent. That's what memcg_propagate_slab_attrs
    is for. Currently it's broken - we clearly muddled root-vs-memcg caches
    there. Let's fix it up.

    Signed-off-by: Vladimir Davydov
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: Michal Hocko
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     

14 Apr, 2014

1 commit

  • Pull slab changes from Pekka Enberg:
    "The biggest change is byte-sized freelist indices which reduces slab
    freelist memory usage:

    https://lkml.org/lkml/2013/12/2/64"

    * 'slab/next' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/linux:
    mm: slab/slub: use page->list consistently instead of page->lru
    mm/slab.c: cleanup outdated comments and unify variables naming
    slab: fix wrongly used macro
    slub: fix high order page allocation problem with __GFP_NOFAIL
    slab: Make allocations with GFP_ZERO slightly more efficient
    slab: make more slab management structure off the slab
    slab: introduce byte sized index for the freelist of a slab
    slab: restrict the number of objects in a slab
    slab: introduce helper functions to get/set free object
    slab: factor out calculate nr objects in cache_estimate

    Linus Torvalds
     

08 Apr, 2014

6 commits

  • Statistics are not critical to the operation of the allocation but
    should also not cause too much overhead.

    When __this_cpu_inc is altered to check if preemption is disabled this
    triggers. Use raw_cpu_inc to avoid the checks. Using this_cpu_ops may
    cause interrupt disable/enable sequences on various arches which may
    significantly impact allocator performance.

    [akpm@linux-foundation.org: add comment]
    Signed-off-by: Christoph Lameter
    Cc: Fengguang Wu
    Cc: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • The failure paths of sysfs_slab_add don't release the allocation of
    'name' made by create_unique_id() a few lines above the context of the
    diff below. Create a common exit path to make it more obvious what
    needs freeing.

    [vdavydov@parallels.com: free the name only if !unmergeable]
    Signed-off-by: Dave Jones
    Signed-off-by: Vladimir Davydov
    Cc: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Jones
     
  • Currently, we try to arrange sysfs entries for memcg caches in the same
    manner as for global caches. Apart from turning /sys/kernel/slab into a
    mess when there are a lot of kmem-active memcgs created, it actually
    does not work properly - we won't create more than one link to a memcg
    cache in case its parent is merged with another cache. For instance, if
    A is a root cache merged with another root cache B, we will have the
    following sysfs setup:

    X
    A -> X
    B -> X

    where X is some unique id (see create_unique_id()). Now if memcgs M and
    N start to allocate from cache A (or B, which is the same), we will get:

    X
    X:M
    X:N
    A -> X
    B -> X
    A:M -> X:M
    A:N -> X:N

    Since B is an alias for A, we won't get entries B:M and B:N, which is
    confusing.

    It is more logical to have entries for memcg caches under the
    corresponding root cache's sysfs directory. This would allow us to keep
    sysfs layout clean, and avoid such inconsistencies like one described
    above.

    This patch does the trick. It creates a "cgroup" kset in each root
    cache kobject to keep its children caches there.

    Signed-off-by: Vladimir Davydov
    Cc: Michal Hocko
    Cc: Johannes Weiner
    Cc: David Rientjes
    Cc: Pekka Enberg
    Cc: Glauber Costa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • Otherwise, kzalloc() called from a memcg won't clear the whole object.

    Signed-off-by: Vladimir Davydov
    Cc: Michal Hocko
    Cc: Johannes Weiner
    Cc: David Rientjes
    Cc: Pekka Enberg
    Cc: Glauber Costa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • When a kmem cache is created (kmem_cache_create_memcg()), we first try to
    find a compatible cache that already exists and can handle requests from
    the new cache, i.e. has the same object size, alignment, ctor, etc. If
    there is such a cache, we do not create any new caches, instead we simply
    increment the refcount of the cache found and return it.

    Currently we do this procedure not only when creating root caches, but
    also for memcg caches. However, there is no point in that, because, as
    every memcg cache has exactly the same parameters as its parent and cache
    merging cannot be turned off in runtime (only on boot by passing
    "slub_nomerge"), the root caches of any two potentially mergeable memcg
    caches should be merged already, i.e. it must be the same root cache, and
    therefore we couldn't even get to the memcg cache creation, because it
    already exists.

    The only exception is boot caches - they are explicitly forbidden to be
    merged by setting their refcount to -1. There are currently only two of
    them - kmem_cache and kmem_cache_node, which are used in slab internals (I
    do not count kmalloc caches as their refcount is set to 1 immediately
    after creation). Since they are prevented from merging preliminary I
    guess we should avoid to merge their children too.

    So let's remove the useless code responsible for merging memcg caches.

    Signed-off-by: Vladimir Davydov
    Cc: Michal Hocko
    Cc: Johannes Weiner
    Cc: David Rientjes
    Cc: Pekka Enberg
    Cc: Glauber Costa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • slab_node() is actually a mempolicy function, so rename it to
    mempolicy_slab_node() to make it clearer that it used for processes with
    mempolicies.

    At the same time, cleanup its code by saving numa_mem_id() in a local
    variable (since we require a node with memory, not just any node) and
    remove an obsolete comment that assumes the mempolicy is actually passed
    into the function.

    Signed-off-by: David Rientjes
    Acked-by: Christoph Lameter
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: KAMEZAWA Hiroyuki
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: Tejun Heo
    Cc: Mel Gorman
    Cc: Oleg Nesterov
    Cc: Rik van Riel
    Cc: Jianguo Wu
    Cc: Tim Hockin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

04 Apr, 2014

1 commit

  • We release the slab_mutex while calling sysfs_slab_add from
    __kmem_cache_create since commit 66c4c35c6bc5 ("slub: Do not hold
    slub_lock when calling sysfs_slab_add()"), because kobject_uevent called
    by sysfs_slab_add might block waiting for the usermode helper to exec,
    which would result in a deadlock if we took the slab_mutex while
    executing it.

    However, apart from complicating synchronization rules, releasing the
    slab_mutex on kmem cache creation can result in a kmemcg-related race.
    The point is that we check if the memcg cache exists before going to
    __kmem_cache_create, but register the new cache in memcg subsys after
    it. Since we can drop the mutex there, several threads can see that the
    memcg cache does not exist and proceed to creating it, which is wrong.

    Fortunately, recently kobject_uevent was patched to call the usermode
    helper with the UMH_NO_WAIT flag, making the deadlock impossible.
    Therefore there is no point in releasing the slab_mutex while calling
    sysfs_slab_add, so let's simplify kmem_cache_create synchronization and
    fix the kmemcg-race mentioned above by holding the slab_mutex during the
    whole cache creation path.

    Signed-off-by: Vladimir Davydov
    Acked-by: Christoph Lameter
    Cc: Greg KH
    Cc: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov