Eric Lee / smarc-ti-linux-kernel | Embedian Git Server

08 Apr, 2014

5 commits

b8529907b memcg, slab: do not destroy children caches if parent has aliases ... Browse Code »

Currently we destroy children caches at the very beginning of
kmem_cache_destroy(). This is wrong, because the root cache will not
necessarily be destroyed in the end - if it has aliases (refcount > 0),
kmem_cache_destroy() will simply decrement its refcount and return. In
this case, at best we will get a bunch of warnings in dmesg, like this
one:

kmem_cache_destroy kmalloc-32:0: Slab cache still has objects
CPU: 1 PID: 7139 Comm: modprobe Tainted: G B W 3.13.0+ #117
Call Trace:
dump_stack+0x49/0x5b
kmem_cache_destroy+0xdf/0xf0
kmem_cache_destroy_memcg_children+0x97/0xc0
kmem_cache_destroy+0xf/0xf0
xfs_mru_cache_uninit+0x21/0x30 [xfs]
exit_xfs_fs+0x2e/0xc44 [xfs]
SyS_delete_module+0x198/0x1f0
system_call_fastpath+0x16/0x1b

At worst - if kmem_cache_destroy() will race with an allocation from a
memcg cache - the kernel will panic.

This patch fixes this by moving children caches destruction after the
check if the cache has aliases. Plus, it forbids destroying a root
cache if it still has children caches, because each children cache keeps
a reference to its parent.

Signed-off-by: Vladimir Davydov
Cc: Michal Hocko
Cc: Johannes Weiner
Cc: David Rientjes
Cc: Pekka Enberg
Cc: Glauber Costa
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vladimir Davydov
2014-04-08 07:36:13 +0800
051dd4605 memcg, slab: unregister cache from memcg before starting to destroy it ... Browse Code »

Currently, memcg_unregister_cache(), which deletes the cache being
destroyed from the memcg_slab_caches list, is called after
__kmem_cache_shutdown() (see kmem_cache_destroy()), which starts to
destroy the cache.

As a result, one can access a partially destroyed cache while traversing
a memcg_slab_caches list, which can have deadly consequences (for
instance, cache_show() called for each cache on a memcg_slab_caches list
from mem_cgroup_slabinfo_read() will dereference pointers to already
freed data).

To fix this, let's move memcg_unregister_cache() before the cache
destruction process beginning, issuing memcg_register_cache() on failure.

Signed-off-by: Vladimir Davydov
Cc: Michal Hocko
Cc: Johannes Weiner
Cc: David Rientjes
Cc: Pekka Enberg
Cc: Glauber Costa
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vladimir Davydov
2014-04-08 07:36:12 +0800
794b1248b memcg, slab: separate memcg vs root cache creation paths ... Browse Code »

Memcg-awareness turned kmem_cache_create() into a dirty interweaving of
memcg-only and except-for-memcg calls. To clean this up, let's move the
code responsible for memcg cache creation to a separate function.

Signed-off-by: Vladimir Davydov
Cc: Michal Hocko
Cc: Johannes Weiner
Cc: David Rientjes
Cc: Pekka Enberg
Cc: Glauber Costa
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vladimir Davydov
2014-04-08 07:36:12 +0800
5722d094a memcg, slab: cleanup memcg cache creation ... Browse Code »

This patch cleans up the memcg cache creation path as follows:

- Move memcg cache name creation to a separate function to be called
from kmem_cache_create_memcg(). This allows us to get rid of the mutex
protecting the temporary buffer used for the name formatting, because
the whole cache creation path is protected by the slab_mutex.

- Get rid of memcg_create_kmem_cache(). This function serves as a proxy
to kmem_cache_create_memcg(). After separating the cache name creation
path, it would be reduced to a function call, so let's inline it.

Signed-off-by: Vladimir Davydov
Cc: Michal Hocko
Cc: Johannes Weiner
Cc: David Rientjes
Cc: Pekka Enberg
Cc: Glauber Costa
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vladimir Davydov
2014-04-08 07:36:12 +0800
a44cb9449 memcg, slab: never try to merge memcg caches ... Browse Code »

When a kmem cache is created (kmem_cache_create_memcg()), we first try to
find a compatible cache that already exists and can handle requests from
the new cache, i.e. has the same object size, alignment, ctor, etc. If
there is such a cache, we do not create any new caches, instead we simply
increment the refcount of the cache found and return it.

Currently we do this procedure not only when creating root caches, but
also for memcg caches. However, there is no point in that, because, as
every memcg cache has exactly the same parameters as its parent and cache
merging cannot be turned off in runtime (only on boot by passing
"slub_nomerge"), the root caches of any two potentially mergeable memcg
caches should be merged already, i.e. it must be the same root cache, and
therefore we couldn't even get to the memcg cache creation, because it
already exists.

The only exception is boot caches - they are explicitly forbidden to be
merged by setting their refcount to -1. There are currently only two of
them - kmem_cache and kmem_cache_node, which are used in slab internals (I
do not count kmalloc caches as their refcount is set to 1 immediately
after creation). Since they are prevented from merging preliminary I
guess we should avoid to merge their children too.

So let's remove the useless code responsible for merging memcg caches.

Signed-off-by: Vladimir Davydov
Cc: Michal Hocko
Cc: Johannes Weiner
Cc: David Rientjes
Cc: Pekka Enberg
Cc: Glauber Costa
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vladimir Davydov
2014-04-08 07:36:12 +0800

30 Jan, 2014

1 commit

ba3253c78 slab: fix wrong retval on kmem_cache_create_memcg error path ... Browse Code »

On kmem_cache_create_memcg() error path we set 'err', but leave 's' (the
new cache ptr) undefined. The latter can be NULL if we could not
allocate the cache, or pointing to a freed area if we failed somewhere
later while trying to initialize it. Initially we checked 'err'
immediately before exiting the function and returned NULL if it was set
ignoring the value of 's':

out_unlock:
...
if (err) {
/* report error */
return NULL;
}
return s;

Recently this check was, in fact, broken by commit f717eb3abb5e ("slab:
do not panic if we fail to create memcg cache"), which turned it to:

out_unlock:
...
if (err && !memcg) {
/* report error */
return NULL;
}
return s;

As a result, if we are failing creating a cache for a memcg, we will
skip the check and return 's' that can contain crap. Obviously, commit
f717eb3abb5e intended not to return crap on error allocating a cache for
a memcg, but only to remove the error reporting in this case, so the
check should look like this:

out_unlock:
...
if (err) {
if (!memcg)
return NULL;
/* report error */
return NULL;
}
return s;

[rientjes@google.com: despaghettification]
[vdavydov@parallels.com: patch monkeying]
Signed-off-by: David Rientjes
Signed-off-by: Vladimir Davydov
Signed-off-by: Dave Jones
Reported-by: Dave Jones
Acked-by: Pekka Enberg
Cc: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Dave Jones
2014-01-30 08:22:40 +0800

24 Jan, 2014

5 commits

f717eb3ab slab: do not panic if we fail to create memcg cache ... Browse Code »

There is no point in flooding logs with warnings or especially crashing
the system if we fail to create a cache for a memcg. In this case we
will be accounting the memcg allocation to the root cgroup until we
succeed to create its own cache, but it isn't that critical.

Signed-off-by: Vladimir Davydov
Cc: Michal Hocko
Cc: Glauber Costa
Cc: Johannes Weiner
Cc: Pekka Enberg
Cc: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vladimir Davydov
2014-01-24 08:36:51 +0800
2edefe115 memcg, slab: fix races in per-memcg cache creation/destruction ... Browse Code »

We obtain a per-memcg cache from a root kmem_cache by dereferencing an
entry of the root cache's memcg_params::memcg_caches array. If we find
no cache for a memcg there on allocation, we initiate the memcg cache
creation (see memcg_kmem_get_cache()). The cache creation proceeds
asynchronously in memcg_create_kmem_cache() in order to avoid lock
clashes, so there can be several threads trying to create the same
kmem_cache concurrently, but only one of them may succeed. However, due
to a race in the code, it is not always true. The point is that the
memcg_caches array can be relocated when we activate kmem accounting for
a memcg (see memcg_update_all_caches(), memcg_update_cache_size()). If
memcg_update_cache_size() and memcg_create_kmem_cache() proceed
concurrently as described below, we can leak a kmem_cache.

Asume two threads schedule creation of the same kmem_cache. One of them
successfully creates it. Another one should fail then, but if
memcg_create_kmem_cache() interleaves with memcg_update_cache_size() as
follows, it won't:

memcg_create_kmem_cache() memcg_update_cache_size()
(called w/o mutexes held) (called with slab_mutex,
set_limit_mutex held)
------------------------- -------------------------

mutex_lock(&memcg_cache_mutex)

s->memcg_params=kzalloc(...)

new_cachep=cache_from_memcg_idx(cachep,idx)
// new_cachep==NULL => proceed to creation

s->memcg_params->memcg_caches[i]
=cur_params->memcg_caches[i]

// kmem_cache_create_memcg takes slab_mutex
// so we will hang around until
// memcg_update_cache_size finishes, but
// nothing will prevent it from succeeding so
// memcg_caches[idx] will be overwritten in
// memcg_register_cache!

new_cachep = kmem_cache_create_memcg(...)
mutex_unlock(&memcg_cache_mutex)

Let's fix this by moving the check for existence of the memcg cache to
kmem_cache_create_memcg() to be called under the slab_mutex and make it
return NULL if so.

A similar race is possible when destroying a memcg cache (see
kmem_cache_destroy()). Since memcg_unregister_cache(), which clears the
pointer in the memcg_caches array, is called w/o protection, we can race
with memcg_update_cache_size() and omit clearing the pointer. Therefore
memcg_unregister_cache() should be moved before we release the
slab_mutex.

Signed-off-by: Vladimir Davydov
Cc: Michal Hocko
Cc: Glauber Costa
Cc: Johannes Weiner
Cc: Balbir Singh
Cc: KAMEZAWA Hiroyuki
Cc: Pekka Enberg
Cc: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vladimir Davydov
2014-01-24 08:36:51 +0800
1aa132542 memcg, slab: clean up memcg cache initialization/destruction ... Browse Code »

Currently, we have rather a messy function set relating to per-memcg
kmem cache initialization/destruction.

Per-memcg caches are created in memcg_create_kmem_cache(). This
function calls kmem_cache_create_memcg() to allocate and initialize a
kmem cache and then "registers" the new cache in the
memcg_params::memcg_caches array of the parent cache.

During its work-flow, kmem_cache_create_memcg() executes the following
memcg-related functions:

- memcg_alloc_cache_params(), to initialize memcg_params of the newly
created cache;
- memcg_cache_list_add(), to add the new cache to the memcg_slab_caches
list.

On the other hand, kmem_cache_destroy() called on a cache destruction
only calls memcg_release_cache(), which does all the work: it cleans the
reference to the cache in its parent's memcg_params::memcg_caches,
removes the cache from the memcg_slab_caches list, and frees
memcg_params.

Such an inconsistency between destruction and initialization paths make
the code difficult to read, so let's clean this up a bit.

This patch moves all the code relating to registration of per-memcg
caches (adding to memcg list, setting the pointer to a cache from its
parent) to the newly created memcg_register_cache() and
memcg_unregister_cache() functions making the initialization and
destruction paths look symmetrical.

Signed-off-by: Vladimir Davydov
Cc: Michal Hocko
Cc: Glauber Costa
Cc: Johannes Weiner
Cc: Balbir Singh
Cc: KAMEZAWA Hiroyuki
Cc: Pekka Enberg
Cc: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vladimir Davydov
2014-01-24 08:36:51 +0800
363a044f7 memcg, slab: kmem_cache_create_memcg(): fix memleak on fail path ... Browse Code »

We do not free the cache's memcg_params if __kmem_cache_create fails.
Fix this.

Plus, rename memcg_register_cache() to memcg_alloc_cache_params(),
because it actually does not register the cache anywhere, but simply
initialize kmem_cache::memcg_params.

[akpm@linux-foundation.org: fix build]
Signed-off-by: Vladimir Davydov
Cc: Michal Hocko
Cc: Glauber Costa
Cc: Johannes Weiner
Cc: Balbir Singh
Cc: KAMEZAWA Hiroyuki
Cc: Pekka Enberg
Cc: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vladimir Davydov
2014-01-24 08:36:51 +0800
3965fc365 slab: clean up kmem_cache_create_memcg() error handling ... Browse Code »

Currently kmem_cache_create_memcg() backoffs on failure inside
conditionals, without using gotos. This results in the rollback code
duplication, which makes the function look cumbersome even though on
error we should only free the allocated cache. Since in the next patch
I am going to add yet another rollback function call on error path
there, let's employ labels instead of conditionals for undoing any
changes on failure to keep things clean.

Signed-off-by: Vladimir Davydov
Reviewed-by: Pekka Enberg
Cc: Michal Hocko
Cc: Glauber Costa
Cc: Johannes Weiner
Cc: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vladimir Davydov
2014-01-24 08:36:50 +0800

13 Nov, 2013

1 commit

2ade4de87 memcg, kmem: rename cache_from_memcg to cache_from_memcg_idx ... Browse Code »

We can't see the relationship with memcg from the parameters,
so the name with memcg_idx would be more reasonable.

Signed-off-by: Qiang Huang
Reviewed-by: Pekka Enberg
Acked-by: David Rientjes
Cc: Johannes Weiner
Cc: Michal Hocko
Cc: Glauber Costa
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Qiang Huang
2013-11-13 11:09:10 +0800

28 Sep, 2013

1 commit

3e374919b slab_common: Do not check for duplicate slab names ... Browse Code »
38

SLUB can alias multiple slab kmem_create_requests to one slab cache to save
memory and increase the cache hotness. As a result the name of the slab can be
stale. Only check the name for duplicates if we are in debug mode where we do
not merge multiple caches.

This fixes the following problem reported by Jonathan Brassow:

The problem with kmem_cache* is this:

*) Assume CONFIG_SLUB is set
1) kmem_cache_create(name="foo-a")
- creates new kmem_cache structure
2) kmem_cache_create(name="foo-b")
- If identical cache characteristics, it will be merged with the previously
created cache associated with "foo-a". The cache's refcount will be
incremented and an alias will be created via sysfs_slab_alias().
3) kmem_cache_destroy()
- Attempting to destroy cache associated with "foo-a", but instead the
refcount is simply decremented. I don't even think the sysfs aliases are
ever removed...
4) kmem_cache_create(name="foo-a")
- This FAILS because kmem_cache_sanity_check colides with the existing
name ("foo-a") associated with the non-removed cache.

This is a problem for RAID (specifically dm-raid) because the name used
for the kmem_cache_create is ("raid%d-%p", level, mddev). If the cache
persists for long enough, the memory address of an old mddev will be
reused for a new mddev - causing an identical formulation of the cache
name. Even though kmem_cache_destory had long ago been used to delete
the old cache, the merging of caches has cause the name and cache of that
old instance to be preserved and causes a colision (and thus failure) in
kmem_cache_create(). I see this regularly in my testing.

Reported-by: Jonathan Brassow
Signed-off-by: Christoph Lameter
Signed-off-by: Pekka Enberg

Christoph Lameter
2013-09-28 14:47:41 +0800

05 Sep, 2013

1 commit

f1b6eb6e6 mm/sl[aou]b: Move kmallocXXX functions to common code ... Browse Code »

The kmalloc* functions of all slab allcoators are similar now so
lets move them into slab.h. This requires some function naming changes
in slob.

As a results of this patch there is a common set of functions for
all allocators. Also means that kmalloc_large() is now available
in general to perform large order allocations that go directly
via the page allocator. kmalloc_large() can be substituted if
kmalloc() throws warnings because of too large allocations.

kmalloc_large() has exactly the same semantics as kmalloc but
can only used for allocations > PAGE_SIZE.

Signed-off-by: Christoph Lameter
Signed-off-by: Pekka Enberg

Christoph Lameter
2013-09-05 01:51:33 +0800

14 Aug, 2013

1 commit

9de1bc875 mm, slab_common: add 'unlikely' to size check of kmalloc_slab() ... Browse Code »

Size is usually below than KMALLOC_MAX_SIZE.
If we add a 'unlikely' macro, compiler can make better code.

Acked-by: Christoph Lameter
Signed-off-by: Joonsoo Kim
Signed-off-by: Pekka Enberg

Joonsoo Kim
2013-08-14 23:26:10 +0800

15 Jul, 2013

1 commit

54be82001 Merge branch 'slab/for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/linux ... Browse Code »

Pull slab update from Pekka Enberg:
"Highlights:

- Fix for boot-time problems on some architectures due to
init_lock_keys() not respecting kmalloc_caches boundaries
(Christoph Lameter)

- CONFIG_SLUB_CPU_PARTIAL requested by RT folks (Joonsoo Kim)

- Fix for excessive slab freelist draining (Wanpeng Li)

- SLUB and SLOB cleanups and fixes (various people)"

I ended up editing the branch, and this avoids two commits at the end
that were immediately reverted, and I instead just applied the oneliner
fix in between myself.

* 'slab/for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/linux
slub: Check for page NULL before doing the node_match check
mm/slab: Give s_next and s_stop slab-specific names
slob: Check for NULL pointer before calling ctor()
slub: Make cpu partial slab support configurable
slab: add kmalloc() to kernel API documentation
slab: fix init_lock_keys
slob: use DIV_ROUND_UP where possible
slub: do not put a slab to cpu partial list when cpu_partial is 0
mm/slub: Use node_nr_slabs and node_nr_objs in get_slabinfo
mm/slub: Drop unnecessary nr_partials
mm/slab: Fix /proc/slabinfo unwriteable for slab
mm/slab: Sharing s_next and s_stop between slab and slub
mm/slab: Fix drain freelist excessively
slob: Rework #ifdeffery in slab.h
mm, slab: moved kmem_cache_alloc_node comment to correct place

Linus Torvalds
2013-07-15 06:14:29 +0800

08 Jul, 2013

1 commit

276a2439c mm/slab: Give s_next and s_stop slab-specific names ... Browse Code »

Give s_next and s_stop slab-specific names instead of exporting
"s_next" and "s_stop".

Signed-off-by: Wanpeng Li
Signed-off-by: Pekka Enberg

Wanpeng Li
2013-07-08 16:02:17 +0800

07 Jul, 2013

2 commits

e9b4db2b8 mm/slab: Fix /proc/slabinfo unwriteable for slab ... Browse Code »

Slab have some tunables like limit, batchcount, and sharedfactor can be
tuned through function slabinfo_write. Commit (b7454ad3: mm/sl[au]b: Move
slabinfo processing to slab_common.c) uncorrectly change /proc/slabinfo
unwriteable for slab, this patch fix it by revert to original mode.

Acked-by: Christoph Lameter
Signed-off-by: Wanpeng Li
Signed-off-by: Pekka Enberg

Wanpeng Li
2013-07-07 23:37:47 +0800
e25839f67 mm/slab: Sharing s_next and s_stop between slab and slub ... Browse Code »

This patch shares s_next and s_stop between slab and slub.

Acked-by: Christoph Lameter
Signed-off-by: Wanpeng Li
Signed-off-by: Pekka Enberg

Wanpeng Li
2013-07-07 23:37:47 +0800

13 Jun, 2013

1 commit

907985f48 slab: prevent warnings when allocating with __GFP_NOWARN ... Browse Code »

Sasha Levin noticed that the warning introduced by commit 6286ae9
("slab: Return NULL for oversized allocations) is being triggered:

WARNING: CPU: 15 PID: 21519 at mm/slab_common.c:376 kmalloc_slab+0x2f/0xb0()
can: request_module (can-proto-4) failed.
mpoa: proc_mpc_write: could not parse ''
Modules linked in:
CPU: 15 PID: 21519 Comm: trinity-child15 Tainted: G W 3.10.0-rc4-next-20130607-sasha-00011-gcd78395-dirty #2
0000000000000009 ffff880020a95e30 ffffffff83ff4041 0000000000000000
ffff880020a95e68 ffffffff8111fe12 fffffffffffffff0 00000000000082d0
0000000000080000 0000000000080000 0000000001400000 ffff880020a95e78
Call Trace:
[] dump_stack+0x4e/0x82
[] warn_slowpath_common+0x82/0xb0
[] warn_slowpath_null+0x15/0x20
[] kmalloc_slab+0x2f/0xb0
[] __kmalloc+0x24/0x4b0
[] ? security_capable+0x13/0x20
[] ? pipe_fcntl+0x107/0x210
[] pipe_fcntl+0x107/0x210
[] ? fget_raw_light+0x130/0x3f0
[] SyS_fcntl+0x60b/0x6a0
[] tracesys+0xe1/0xe6

Andrew Morton writes:

__GFP_NOWARN is frequently used by kernel code to probe for "how big
an allocation can I get". That's a bit lame, but it's used on slow
paths and is pretty simple.

However, SLAB would still spew a warning when a big allocation happens
if the __GFP_NOWARN flag is _not_ set to expose kernel bugs.

Signed-off-by: Sasha Levin
[ penberg@kernel.org: improve changelog ]
Signed-off-by: Pekka Enberg

Sasha Levin
2013-06-13 15:01:58 +0800

09 May, 2013

1 commit

956e46efb mm/slab: Fix crash during slab init ... Browse Code »

Commit 8a965b3baa89 ("mm, slab_common: Fix bootstrap creation of kmalloc
caches") introduced a regression that caused us to crash early during
boot. The commit was introducing ordering of slab creation, making sure
two odd-sized slabs were created after specific powers of two sizes.

But, if any of the power of two slabs were created earlier during boot,
slabs at index 1 or 2 might not get created at all. This patch makes
sure none of the slabs get skipped.

Tony Lindgren bisected this down to the offending commit, which really
helped because bisect kept bringing me to almost but not quite this one.

Signed-off-by: Chris Mason
Acked-by: Christoph Lameter
Acked-by: Tony Lindgren
Acked-by: Soren Brinkmann
Tested-by: Tetsuo Handa
Tested-by: Konrad Rzeszutek Wilk
Signed-off-by: Linus Torvalds

Chris Mason
2013-05-09 06:02:33 +0800

07 May, 2013

1 commit

8a965b3ba mm, slab_common: Fix bootstrap creation of kmalloc caches ... Browse Code »

For SLAB the kmalloc caches must be created in ascending sizes in order
for the OFF_SLAB sub-slab cache to work properly.

Create the non power of two caches immediately after the prior power of
two kmalloc cache. Do not create the non power of two caches before all
other caches.

Reported-and-tested-by: Tetsuo Handa
Signed-off-by: Christoph Lamete
Link: http://lkml.kernel.org/r/201305040348.CIF81716.OStQOHFJMFLOVF@I-love.SAKURA.ne.jp
Signed-off-by: Pekka Enberg

Christoph Lameter
2013-05-07 04:22:17 +0800

06 May, 2013

1 commit

6286ae97d slab: Return NULL for oversized allocations ... Browse Code »

The inline path seems to have changed the SLAB behavior for very large
kmalloc allocations with commit e3366016 ("slab: Use common
kmalloc_index/kmalloc_size functions"). This patch restores the old
behavior but also adds diagnostics so that we can figure where in the
code these large allocations occur.

Reported-and-tested-by: Tetsuo Handa
Signed-off-by: Christoph Lameter
Link: http://lkml.kernel.org/r/201305040348.CIF81716.OStQOHFJMFLOVF@I-love.SAKURA.ne.jp
[ penberg@kernel.org: use WARN_ON_ONCE ]
Signed-off-by: Pekka Enberg

Christoph Lameter
2013-05-06 14:24:16 +0800

07 Feb, 2013

1 commit

b1e054167 mm/sl[au]b: correct allocation type check in kmalloc_slab() ... Browse Code »

commit "slab: Common Kmalloc cache determination" made mistake
in kmalloc_slab(). SLAB_CACHE_DMA is for kmem_cache creation,
not for allocation. For allocation, we should use GFP_XXX to identify
type of allocation. So, change SLAB_CACHE_DMA to GFP_DMA.

Acked-by: Christoph Lameter
Reported-by: Fengguang Wu
Signed-off-by: Joonsoo Kim
Signed-off-by: Pekka Enberg

Joonsoo Kim
2013-02-07 02:37:55 +0800

01 Feb, 2013

4 commits

2c59dd654 slab: Common Kmalloc cache determination ... Browse Code »

Extract the optimized lookup functions from slub and put them into
slab_common.c. Then make slab use these functions as well.

Joonsoo notes that this fixes some issues with constant folding which
also reduces the code size for slub.

https://lkml.org/lkml/2012/10/20/82

Signed-off-by: Christoph Lameter
Signed-off-by: Pekka Enberg

Christoph Lameter
2013-02-01 18:32:08 +0800
f97d5f634 slab: Common function to create the kmalloc array ... Browse Code »

The kmalloc array is created in similar ways in both SLAB
and SLUB. Create a common function and have both allocators
call that function.

V1->V2:
Whitespace cleanup

Reviewed-by: Glauber Costa
Signed-off-by: Christoph Lameter
Signed-off-by: Pekka Enberg

Christoph Lameter
2013-02-01 18:32:08 +0800
9425c58e5 slab: Common definition for the array of kmalloc caches ... Browse Code »

Have a common definition fo the kmalloc cache arrays in
SLAB and SLUB

Acked-by: Glauber Costa
Signed-off-by: Christoph Lameter
Signed-off-by: Pekka Enberg

Christoph Lameter
2013-02-01 18:32:07 +0800
31ba7346f slab: Use proper formatting specs for unsigned size_t ... Browse Code »

Signed-off-by: Christoph Lameter
Signed-off-by: Pekka Enberg

Christoph Lameter
2013-02-01 18:32:05 +0800

19 Dec, 2012

5 commits

943a451a8 slab: propagate tunable values ... Browse Code »

SLAB allows us to tune a particular cache behavior with tunables. When
creating a new memcg cache copy, we'd like to preserve any tunables the
parent cache already had.

This could be done by an explicit call to do_tune_cpucache() after the
cache is created. But this is not very convenient now that the caches are
created from common code, since this function is SLAB-specific.

Another method of doing that is taking advantage of the fact that
do_tune_cpucache() is always called from enable_cpucache(), which is
called at cache initialization. We can just preset the values, and then
things work as expected.

It can also happen that a root cache has its tunables updated during
normal system operation. In this case, we will propagate the change to
all caches that are already active.

This change will require us to move the assignment of root_cache in
memcg_params a bit earlier. We need this to be already set - which
memcg_kmem_register_cache will do - when we reach __kmem_cache_create()

Signed-off-by: Glauber Costa
Cc: Christoph Lameter
Cc: David Rientjes
Cc: Frederic Weisbecker
Cc: Greg Thelen
Cc: Johannes Weiner
Cc: JoonSoo Kim
Cc: KAMEZAWA Hiroyuki
Cc: Mel Gorman
Cc: Michal Hocko
Cc: Pekka Enberg
Cc: Rik van Riel
Cc: Suleiman Souhlal
Cc: Tejun Heo
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Glauber Costa
2012-12-19 07:02:14 +0800
749c54151 memcg: aggregate memcg cache values in slabinfo ... Browse Code »

When we create caches in memcgs, we need to display their usage
information somewhere. We'll adopt a scheme similar to /proc/meminfo,
with aggregate totals shown in the global file, and per-group information
stored in the group itself.

For the time being, only reads are allowed in the per-group cache.

Signed-off-by: Glauber Costa
Cc: Christoph Lameter
Cc: David Rientjes
Cc: Frederic Weisbecker
Cc: Greg Thelen
Cc: Johannes Weiner
Cc: JoonSoo Kim
Cc: KAMEZAWA Hiroyuki
Cc: Mel Gorman
Cc: Michal Hocko
Cc: Pekka Enberg
Cc: Rik van Riel
Cc: Suleiman Souhlal
Cc: Tejun Heo
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Glauber Costa
2012-12-19 07:02:14 +0800
7cf279824 memcg/sl[au]b: track all the memcg children of a kmem_cache ... Browse Code »

This enables us to remove all the children of a kmem_cache being
destroyed, if for example the kernel module it's being used in gets
unloaded. Otherwise, the children will still point to the destroyed
parent.

Signed-off-by: Suleiman Souhlal
Signed-off-by: Glauber Costa
Cc: Christoph Lameter
Cc: David Rientjes
Cc: Frederic Weisbecker
Cc: Greg Thelen
Cc: Johannes Weiner
Cc: JoonSoo Kim
Cc: KAMEZAWA Hiroyuki
Cc: Mel Gorman
Cc: Michal Hocko
Cc: Pekka Enberg
Cc: Rik van Riel
Cc: Tejun Heo
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Glauber Costa
2012-12-19 07:02:14 +0800
55007d849 memcg: allocate memory for memcg caches whenever a new memcg appears ... Browse Code »

Every cache that is considered a root cache (basically the "original"
caches, tied to the root memcg/no-memcg) will have an array that should be
large enough to store a cache pointer per each memcg in the system.

Theoreticaly, this is as high as 1 << sizeof(css_id), which is currently
in the 64k pointers range. Most of the time, we won't be using that much.

What goes in this patch, is a simple scheme to dynamically allocate such
an array, in order to minimize memory usage for memcg caches. Because we
would also like to avoid allocations all the time, at least for now, the
array will only grow. It will tend to be big enough to hold the maximum
number of kmem-limited memcgs ever achieved.

We'll allocate it to be a minimum of 64 kmem-limited memcgs. When we have
more than that, we'll start doubling the size of this array every time the
limit is reached.

Because we are only considering kmem limited memcgs, a natural point for
this to happen is when we write to the limit. At that point, we already
have set_limit_mutex held, so that will become our natural synchronization
mechanism.

Signed-off-by: Glauber Costa
Cc: Christoph Lameter
Cc: David Rientjes
Cc: Frederic Weisbecker
Cc: Greg Thelen
Cc: Johannes Weiner
Cc: JoonSoo Kim
Cc: KAMEZAWA Hiroyuki
Cc: Mel Gorman
Cc: Michal Hocko
Cc: Pekka Enberg
Cc: Rik van Riel
Cc: Suleiman Souhlal
Cc: Tejun Heo
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Glauber Costa
2012-12-19 07:02:13 +0800
2633d7a02 slab/slub: consider a memcg parameter in kmem_create_cache ... Browse Code »

Allow a memcg parameter to be passed during cache creation. When the slub
allocator is being used, it will only merge caches that belong to the same
memcg. We'll do this by scanning the global list, and then translating
the cache to a memcg-specific cache

Default function is created as a wrapper, passing NULL to the memcg
version. We only merge caches that belong to the same memcg.

A helper is provided, memcg_css_id: because slub needs a unique cache name
for sysfs. Since this is visible, but not the canonical location for slab
data, the cache name is not used, the css_id should suffice.

Signed-off-by: Glauber Costa
Cc: Christoph Lameter
Cc: David Rientjes
Cc: Frederic Weisbecker
Cc: Greg Thelen
Cc: Johannes Weiner
Cc: JoonSoo Kim
Cc: KAMEZAWA Hiroyuki
Cc: Mel Gorman
Cc: Michal Hocko
Cc: Pekka Enberg
Cc: Rik van Riel
Cc: Suleiman Souhlal
Cc: Tejun Heo
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Glauber Costa
2012-12-19 07:02:13 +0800

11 Dec, 2012

2 commits

459068554 mm/sl[aou]b: Common alignment code ... Browse Code »
18

Extract the code to do object alignment from the allocators.
Do the alignment calculations in slab_common so that the
__kmem_cache_create functions of the allocators do not have
to deal with alignment.

Signed-off-by: Christoph Lameter
Signed-off-by: Pekka Enberg

Christoph Lameter
2012-12-11 18:14:28 +0800
45530c447 mm, sl[au]b: create common functions for boot slab creation ... Browse Code »

Use a special function to create kmalloc caches and use that function in
SLAB and SLUB.

Acked-by: Joonsoo Kim
Reviewed-by: Glauber Costa
Acked-by: David Rientjes
Signed-off-by: Christoph Lameter
Signed-off-by: Pekka Enberg

Christoph Lameter
2012-12-11 18:14:27 +0800

31 Oct, 2012

1 commit

d8843922f slab: Ignore internal flags in cache creation ... Browse Code »

Some flags are used internally by the allocators for management
purposes. One example of that is the CFLGS_OFF_SLAB flag that slab uses
to mark that the metadata for that cache is stored outside of the slab.

No cache should ever pass those as a creation flags. We can just ignore
this bit if it happens to be passed (such as when duplicating a cache in
the kmem memcg patches).

Because such flags can vary from allocator to allocator, we allow them
to make their own decisions on that, defining SLAB_AVAILABLE_FLAGS with
all flags that are valid at creation time. Allocators that doesn't have
any specific flag requirement should define that to mean all flags.

Common code will mask out all flags not belonging to that set.

Acked-by: Christoph Lameter
Acked-by: David Rientjes
Signed-off-by: Glauber Costa
Signed-off-by: Pekka Enberg

Glauber Costa
2012-10-31 15:13:01 +0800

24 Oct, 2012

3 commits

0d7561c61 sl[au]b: Process slabinfo_show in common code ... Browse Code »

With all the infrastructure in place, we can now have slabinfo_show
done from slab_common.c. A cache-specific function is called to grab
information about the cache itself, since that is still heavily
dependent on the implementation. But with the values produced by it, all
the printing and handling is done from common code.

Signed-off-by: Glauber Costa
CC: Christoph Lameter
CC: David Rientjes
Signed-off-by: Pekka Enberg

Glauber Costa
2012-10-24 14:39:16 +0800
bcee6e2a1 mm/sl[au]b: Move print_slabinfo_header to slab_common.c ... Browse Code »

The header format is highly similar between slab and slub. The main
difference lays in the fact that slab may optionally have statistics
added here in case of CONFIG_SLAB_DEBUG, while the slub will stick them
somewhere else.

By making sure that information conditionally lives inside a
globally-visible CONFIG_DEBUG_SLAB switch, we can move the header
printing to a common location.

Signed-off-by: Glauber Costa
Acked-by: Christoph Lameter
CC: David Rientjes
Signed-off-by: Pekka Enberg

Glauber Costa
2012-10-24 14:38:38 +0800
b7454ad3c mm/sl[au]b: Move slabinfo processing to slab_common.c ... Browse Code »

This patch moves all the common machinery to slabinfo processing
to slab_common.c. We can do better by noticing that the output is
heavily common, and having the allocators to just provide finished
information about this. But after this first step, this can be done
easier.

Signed-off-by: Glauber Costa
Acked-by: Christoph Lameter
CC: David Rientjes
Signed-off-by: Pekka Enberg

Glauber Costa
2012-10-24 14:37:41 +0800

10 Oct, 2012

1 commit

210ed9def mm, slab: release slab_mutex earlier in kmem_cache_destroy() ... Browse Code »

Commit 1331e7a1bbe1 ("rcu: Remove _rcu_barrier() dependency on
__stop_machine()") introduced slab_mutex -> cpu_hotplug.lock dependency
through kmem_cache_destroy() -> rcu_barrier() -> _rcu_barrier() ->
get_online_cpus().

Lockdep thinks that this might actually result in ABBA deadlock,
and reports it as below:

=== [ cut here ] ===
======================================================
[ INFO: possible circular locking dependency detected ]
3.6.0-rc5-00004-g0d8ee37 #143 Not tainted
-------------------------------------------------------
kworker/u:2/40 is trying to acquire lock:
(rcu_sched_state.barrier_mutex){+.+...}, at: [] _rcu_barrier+0x26/0x1e0

but task is already holding lock:
(slab_mutex){+.+.+.}, at: [] kmem_cache_destroy+0x45/0xe0

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #2 (slab_mutex){+.+.+.}:
[] validate_chain+0x632/0x720
[] __lock_acquire+0x309/0x530
[] lock_acquire+0x121/0x190
[] __mutex_lock_common+0x5c/0x450
[] mutex_lock_nested+0x3e/0x50
[] cpuup_callback+0x2f/0xbe
[] notifier_call_chain+0x93/0x140
[] __raw_notifier_call_chain+0x9/0x10
[] _cpu_up+0xba/0x14e
[] cpu_up+0xbc/0x117
[] smp_init+0x6b/0x9f
[] kernel_init+0x147/0x1dc
[] kernel_thread_helper+0x4/0x10

-> #1 (cpu_hotplug.lock){+.+.+.}:
[] validate_chain+0x632/0x720
[] __lock_acquire+0x309/0x530
[] lock_acquire+0x121/0x190
[] __mutex_lock_common+0x5c/0x450
[] mutex_lock_nested+0x3e/0x50
[] get_online_cpus+0x37/0x50
[] _rcu_barrier+0xbb/0x1e0
[] rcu_barrier_sched+0x10/0x20
[] rcu_barrier+0x9/0x10
[] deactivate_locked_super+0x49/0x90
[] deactivate_super+0x61/0x70
[] mntput_no_expire+0x127/0x180
[] sys_umount+0x6e/0xd0
[] system_call_fastpath+0x16/0x1b

-> #0 (rcu_sched_state.barrier_mutex){+.+...}:
[] check_prev_add+0x3de/0x440
[] validate_chain+0x632/0x720
[] __lock_acquire+0x309/0x530
[] lock_acquire+0x121/0x190
[] __mutex_lock_common+0x5c/0x450
[] mutex_lock_nested+0x3e/0x50
[] _rcu_barrier+0x26/0x1e0
[] rcu_barrier_sched+0x10/0x20
[] rcu_barrier+0x9/0x10
[] kmem_cache_destroy+0xd1/0xe0
[] nf_conntrack_cleanup_net+0xe4/0x110 [nf_conntrack]
[] nf_conntrack_cleanup+0x2a/0x70 [nf_conntrack]
[] nf_conntrack_net_exit+0x5e/0x80 [nf_conntrack]
[] ops_exit_list+0x39/0x60
[] cleanup_net+0xfb/0x1b0
[] process_one_work+0x26b/0x4c0
[] worker_thread+0x12e/0x320
[] kthread+0x9e/0xb0
[] kernel_thread_helper+0x4/0x10

other info that might help us debug this:

Chain exists of:
rcu_sched_state.barrier_mutex --> cpu_hotplug.lock --> slab_mutex

Possible unsafe locking scenario:

CPU0 CPU1
---- ----
lock(slab_mutex);
lock(cpu_hotplug.lock);
lock(slab_mutex);
lock(rcu_sched_state.barrier_mutex);

*** DEADLOCK ***
=== [ cut here ] ===

This is actually a false positive. Lockdep has no way of knowing the fact
that the ABBA can actually never happen, because of special semantics of
cpu_hotplug.refcount and its handling in cpu_hotplug_begin(); the mutual
exclusion there is not achieved through mutex, but through
cpu_hotplug.refcount.

The "neither cpu_up() nor cpu_down() will proceed past cpu_hotplug_begin()
until everyone who called get_online_cpus() will call put_online_cpus()"
semantics is totally invisible to lockdep.

This patch therefore moves the unlock of slab_mutex so that rcu_barrier()
is being called with it unlocked. It has two advantages:

- it slightly reduces hold time of slab_mutex; as it's used to protect
the cachep list, it's not necessary to hold it over kmem_cache_free()
call any more
- it silences the lockdep false positive warning, as it avoids lockdep ever
learning about slab_mutex -> cpu_hotplug.lock dependency

Reviewed-by: Paul E. McKenney
Reviewed-by: Srivatsa S. Bhat
Acked-by: David Rientjes
Signed-off-by: Jiri Kosina
Signed-off-by: Pekka Enberg

Jiri Kosina
2012-10-10 14:25:08 +0800