Eric Lee / smarc-ti-linux-kernel | Embedian Git Server

14 Dec, 2014

2 commits

dee2f8aaa slub: fix cpuset check in get_any_partial ... Browse Code »

If we fail to allocate from the current node's stock, we look for free
objects on other nodes before calling the page allocator (see
get_any_partial). While checking other nodes we respect cpuset
constraints by calling cpuset_zone_allowed. We enforce hardwall check.
As a result, we will fallback to the page allocator even if there are some
pages cached on other nodes, but the current cpuset doesn't have them set.
However, the page allocator uses softwall check for kernel allocations,
so it may allocate from one of the other nodes in this case.

Therefore we should use softwall cpuset check in get_any_partial to
conform with the cpuset check in the page allocator.

Signed-off-by: Vladimir Davydov
Acked-by: Zefan Li
Acked-by: Christoph Lameter
Cc: Pekka Enberg
Cc: David Rientjes
Cc: Joonsoo Kim
Cc: Tejun Heo
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vladimir Davydov
2014-12-14 04:42:53 +0800
8135be5a8 memcg: fix possible use-after-free in memcg_kmem_get_cache() ... Browse Code »

Suppose task @t that belongs to a memory cgroup @memcg is going to
allocate an object from a kmem cache @c. The copy of @c corresponding to
@memcg, @mc, is empty. Then if kmem_cache_alloc races with the memory
cgroup destruction we can access the memory cgroup's copy of the cache
after it was destroyed:

CPU0 CPU1
---- ----
[ current=@t
@mc->memcg_params->nr_pages=0 ]

kmem_cache_alloc(@c):
call memcg_kmem_get_cache(@c);
proceed to allocation from @mc:
alloc a page for @mc:
...

move @t from @memcg
destroy @memcg:
mem_cgroup_css_offline(@memcg):
memcg_unregister_all_caches(@memcg):
kmem_cache_destroy(@mc)

add page to @mc

We could fix this issue by taking a reference to a per-memcg cache, but
that would require adding a per-cpu reference counter to per-memcg caches,
which would look cumbersome.

Instead, let's take a reference to a memory cgroup, which already has a
per-cpu reference counter, in the beginning of kmem_cache_alloc to be
dropped in the end, and move per memcg caches destruction from css offline
to css free. As a side effect, per-memcg caches will be destroyed not one
by one, but all at once when the last page accounted to the memory cgroup
is freed. This doesn't sound as a high price for code readability though.

Note, this patch does add some overhead to the kmem_cache_alloc hot path,
but it is pretty negligible - it's just a function call plus a per cpu
counter decrement, which is comparable to what we already have in
memcg_kmem_get_cache. Besides, it's only relevant if there are memory
cgroups with kmem accounting enabled. I don't think we can find a way to
handle this race w/o it, because alloc_page called from kmem_cache_alloc
may sleep so we can't flush all pending kmallocs w/o reference counting.

Signed-off-by: Vladimir Davydov
Acked-by: Christoph Lameter
Cc: Johannes Weiner
Cc: Michal Hocko
Cc: Pekka Enberg
Cc: David Rientjes
Cc: Joonsoo Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vladimir Davydov
2014-12-14 04:42:49 +0800

12 Dec, 2014

1 commit

2756d373a Merge branch 'for-3.19' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup ... Browse Code »

Pull cgroup update from Tejun Heo:
"cpuset got simplified a bit. cgroup core got a fix on unified
hierarchy and grew some effective css related interfaces which will be
used for blkio support for writeback IO traffic which is currently
being worked on"

* 'for-3.19' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cgroup: implement cgroup_get_e_css()
cgroup: add cgroup_subsys->css_e_css_changed()
cgroup: add cgroup_subsys->css_released()
cgroup: fix the async css offline wait logic in cgroup_subtree_control_write()
cgroup: restructure child_subsys_mask handling in cgroup_subtree_control_write()
cgroup: separate out cgroup_calc_child_subsys_mask() from cgroup_refresh_child_subsys_mask()
cpuset: lock vs unlock typo
cpuset: simplify cpuset_node_allowed API
cpuset: convert callback_mutex to a spinlock

Linus Torvalds
2014-12-12 10:57:19 +0800

11 Dec, 2014

3 commits

c871ac4e9 slab: improve checking for invalid gfp_flags ... Browse Code »

The code goes BUG, but doesn't tell us which bits were unexpectedly set.
Print that out.

Cc: Christoph Lameter
Cc: Pekka Enberg
Cc: David Rientjes
Cc: Joonsoo Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrew Morton
2014-12-11 09:41:04 +0800
f6edde9cb mm: slub: fix format mismatches in slab_err() callers ... Browse Code »

Adding __printf(3, 4) to slab_err exposed following:

mm/slub.c: In function `check_slab':
mm/slub.c:852:4: warning: format `%u' expects argument of type `unsigned int', but argument 4 has type `const char *' [-Wformat=]
s->name, page->objects, maxobj);
^
mm/slub.c:852:4: warning: too many arguments for format [-Wformat-extra-args]
mm/slub.c:857:4: warning: format `%u' expects argument of type `unsigned int', but argument 4 has type `const char *' [-Wformat=]
s->name, page->inuse, page->objects);
^
mm/slub.c:857:4: warning: too many arguments for format [-Wformat-extra-args]

mm/slub.c: In function `on_freelist':
mm/slub.c:905:4: warning: format `%d' expects argument of type `int', but argument 5 has type `long unsigned int' [-Wformat=]
"should be %d", page->objects, max_objects);

Fix first two warnings by removing redundant s->name.
Fix the last by changing type of max_object from unsigned long to int.

Signed-off-by: Andrey Ryabinin
Cc: Christoph Lameter
Cc: Pekka Enberg
Acked-by: David Rientjes
Cc: Joonsoo Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrey Ryabinin
2014-12-11 09:41:04 +0800
b455def28 mm: slab/slub: coding style: whitespaces and tabs mixture ... Browse Code »

Some code in mm/slab.c and mm/slub.c use whitespaces in indent.
Clean them up.

Signed-off-by: LQYMGT
Acked-by: Christoph Lameter
Cc: Pekka Enberg
Cc: David Rientjes
Cc: Joonsoo Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

LQYMGT
2014-12-11 09:41:04 +0800

27 Oct, 2014

1 commit

344736f29 cpuset: simplify cpuset_node_allowed API ... Browse Code »

Current cpuset API for checking if a zone/node is allowed to allocate
from looks rather awkward. We have hardwall and softwall versions of
cpuset_node_allowed with the softwall version doing literally the same
as the hardwall version if __GFP_HARDWALL is passed to it in gfp flags.
If it isn't, the softwall version may check the given node against the
enclosing hardwall cpuset, which it needs to take the callback lock to
do.

Such a distinction was introduced by commit 02a0e53d8227 ("cpuset:
rework cpuset_zone_allowed api"). Before, we had the only version with
the __GFP_HARDWALL flag determining its behavior. The purpose of the
commit was to avoid sleep-in-atomic bugs when someone would mistakenly
call the function without the __GFP_HARDWALL flag for an atomic
allocation. The suffixes introduced were intended to make the callers
think before using the function.

However, since the callback lock was converted from mutex to spinlock by
the previous patch, the softwall check function cannot sleep, and these
precautions are no longer necessary.

So let's simplify the API back to the single check.

Suggested-by: David Rientjes
Signed-off-by: Vladimir Davydov
Acked-by: Christoph Lameter
Acked-by: Zefan Li
Signed-off-by: Tejun Heo

Vladimir Davydov
2014-10-27 23:15:27 +0800

10 Oct, 2014

3 commits

423c929cb mm/slab_common: commonize slab merge logic ... Browse Code »

Slab merge is good feature to reduce fragmentation. Now, it is only
applied to SLUB, but, it would be good to apply it to SLAB. This patch is
preparation step to apply slab merge to SLAB by commonizing slab merge
logic.

Signed-off-by: Joonsoo Kim
Cc: Randy Dunlap
Cc: Christoph Lameter
Cc: Pekka Enberg
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joonsoo Kim
2014-10-10 10:25:51 +0800
a561ce00b slub: fall back to node_to_mem_node() node if allocating on memoryless node ... Browse Code »

Update the SLUB code to search for partial slabs on the nearest node with
memory in the presence of memoryless nodes. Additionally, do not consider
it to be an ALLOC_NODE_MISMATCH (and deactivate the slab) when a
memoryless-node specified allocation goes off-node.

Signed-off-by: Joonsoo Kim
Signed-off-by: Nishanth Aravamudan
Cc: David Rientjes
Cc: Han Pingtian
Cc: Pekka Enberg
Cc: Paul Mackerras
Cc: Benjamin Herrenschmidt
Cc: Michael Ellerman
Cc: Anton Blanchard
Cc: Christoph Lameter
Cc: Wanpeng Li
Cc: Tejun Heo
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joonsoo Kim
2014-10-10 10:25:51 +0800
c9e16131d slub: disable tracing and failslab for merged slabs ... Browse Code »

Tracing of mergeable slabs as well as uses of failslab are confusing since
the objects of multiple slab caches will be affected. Moreover this
creates a situation where a mergeable slab will become unmergeable.

If tracing or failslab testing is desired then it may be best to switch
merging off for starters.

Signed-off-by: Christoph Lameter
Tested-by: WANG Chao
Cc: Pekka Enberg
Cc: David Rientjes
Cc: Joonsoo Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2014-10-10 10:25:51 +0800

07 Aug, 2014

8 commits

aee52cae0 slub: remove kmemcg id from create_unique_id ... Browse Code »

This function is never called for memcg caches, because they are
unmergeable, so remove the dead code.

Signed-off-by: Vladimir Davydov
Cc: Michal Hocko
Cc: Johannes Weiner
Cc: Christoph Lameter
Reviewed-by: Pekka Enberg
Cc: David Rientjes
Cc: Joonsoo Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vladimir Davydov
2014-08-07 09:01:21 +0800
4307c14f3 slab: fix the alias count (via sysfs) of slab cache ... Browse Code »

We mark some slab caches (e.g. kmem_cache_node) as unmergeable by
setting refcount to -1, and their alias should be 0, not refcount-1, so
correct it here.

Signed-off-by: Gu Zheng
Acked-by: David Rientjes
Cc: Christoph Lameter
Cc: Pekka Enberg
Cc: Joonsoo Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Gu Zheng
2014-08-07 09:01:15 +0800
0aa9a13d8 mm, slub: fix some indenting in cmpxchg_double_slab() ... Browse Code »

The return statement goes with the cmpxchg_double() condition so it needs
to be indented another tab.

Also these days the fashion is to line function parameters up, and it
looks nicer that way because then the "freelist_new" is not at the same
indent level as the "return 1;".

Signed-off-by: Dan Carpenter
Signed-off-by: Pekka Enberg
Signed-off-by: David Rientjes
Cc: Joonsoo Kim
Cc: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Dan Carpenter
2014-08-07 09:01:15 +0800
542666407 slub: avoid duplicate creation on the first object ... Browse Code »

When a kmem_cache is created with ctor, each object in the kmem_cache
will be initialized before ready to use. While in slub implementation,
the first object will be initialized twice.

This patch reduces the duplication of initialization of the first
object.

Fix commit 7656c72b ("SLUB: add macros for scanning objects in a slab").

Signed-off-by: Wei Yang
Acked-by: Christoph Lameter
Cc: Pekka Enberg
Cc: David Rientjes
Cc: Joonsoo Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Wei Yang
2014-08-07 09:01:15 +0800
02e72cc61 mm: slub: SLUB_DEBUG=n: use the same alloc/free hooks as for SLUB_DEBUG=y ... Browse Code »

There are two versions of alloc/free hooks now - one for
CONFIG_SLUB_DEBUG=y and another one for CONFIG_SLUB_DEBUG=n.

I see no reason why calls to other debugging subsystems (LOCKDEP,
DEBUG_ATOMIC_SLEEP, KMEMCHECK and FAILSLAB) are hidden under SLUB_DEBUG.
All this features should work regardless of SLUB_DEBUG config, as all of
them already have own Kconfig options.

This also fixes failslab for CONFIG_SLUB_DEBUG=n configuration. It
simply has not worked before because should_failslab() call was in a
hook hidden under "#ifdef CONFIG_SLUB_DEBUG #else".

Note: There is one concealed change in allocation path for SLUB_DEBUG=n
and all other debugging features disabled. The might_sleep_if() call
can generate some code even if DEBUG_ATOMIC_SLEEP=n. For
PREEMPT_VOLUNTARY=y might_sleep() inserts _cond_resched() call, but I
think it should be ok.

Signed-off-by: Andrey Ryabinin
Cc: Christoph Lameter
Cc: Pekka Enberg
Cc: David Rientjes
Cc: Joonsoo Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrey Ryabinin
2014-08-07 09:01:14 +0800
c07b8183c mm, slub: mark resiliency_test as init text ... Browse Code »

resiliency_test() is only called for bootstrap, so it may be moved to
init.text and freed after boot.

Signed-off-by: David Rientjes
Acked-by: Christoph Lameter
Cc: Pekka Enberg
Cc: Joonsoo Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Rientjes
2014-08-07 09:01:14 +0800
fa45dc254 slub: use new node functions ... Browse Code »

Make use of the new node functions in mm/slab.h to reduce code size and
simplify.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Christoph Lameter
Cc: Christoph Lameter
Cc: Pekka Enberg
Acked-by: David Rientjes
Cc: Joonsoo Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2014-08-07 09:01:13 +0800
44c5356fb slab common: add functions for kmem_cache_node access ... Browse Code »

The patchset provides two new functions in mm/slab.h and modifies SLAB
and SLUB to use these. The kmem_cache_node structure is shared between
both allocators and the use of common accessors will allow us to move
more code into slab_common.c in the future.

This patch (of 3):

These functions allow to eliminate repeatedly used code in both SLAB and
SLUB and also allow for the insertion of debugging code that may be
needed in the development process.

Signed-off-by: Christoph Lameter
Cc: Pekka Enberg
Acked-by: David Rientjes
Acked-by: Joonsoo Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2014-08-07 09:01:13 +0800

04 Jul, 2014

1 commit

8a5b20aeb slub: fix off by one in number of slab tests ... Browse Code »

min_partial means minimum number of slab cached in node partial list.
So, if nr_partial is less than it, we keep newly empty slab on node
partial list rather than freeing it. But if nr_partial is equal or
greater than it, it means that we have enough partial slabs so should
free newly empty slab. Current implementation missed the equal case so
if we set min_partial is 0, then, at least one slab could be cached.
This is critical problem to kmemcg destroying logic because it doesn't
works properly if some slabs is cached. This patch fixes this problem.

Fixes 91cb69620284 ("slub: make dead memcg caches discard free slabs
immediately").

Signed-off-by: Joonsoo Kim
Acked-by: Vladimir Davydov
Cc: Christoph Lameter
Cc: Pekka Enberg
Acked-by: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joonsoo Kim
2014-07-04 00:21:53 +0800

07 Jun, 2014

1 commit

844e4d66f slub: search partial list on numa_mem_id(), instead of numa_node_id() ... Browse Code »

Currently, if allocation constraint to node is NUMA_NO_NODE, we search a
partial slab on numa_node_id() node. This doesn't work properly on a
system having memoryless nodes, since it can have no memory on that node
so there must be no partial slab on that node.

On that node, page allocation always falls back to numa_mem_id() first.
So searching a partial slab on numa_node_id() in that case is the proper
solution for the memoryless node case.

Signed-off-by: Joonsoo Kim
Acked-by: Nishanth Aravamudan
Acked-by: David Rientjes
Acked-by: Christoph Lameter
Cc: Pekka Enberg
Cc: Wanpeng Li
Cc: Han Pingtian
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joonsoo Kim
2014-06-07 07:08:06 +0800

05 Jun, 2014

10 commits

7c8e0181e mm: replace __get_cpu_var uses with this_cpu_ptr ... Browse Code »

Replace places where __get_cpu_var() is used for an address calculation
with this_cpu_ptr().

Signed-off-by: Christoph Lameter
Cc: Tejun Heo
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2014-06-05 07:54:03 +0800
c67a8a685 memcg, slab: merge memcg_{bind,release}_pages to memcg_{un}charge_slab ... Browse Code »

Currently we have two pairs of kmemcg-related functions that are called on
slab alloc/free. The first is memcg_{bind,release}_pages that count the
total number of pages allocated on a kmem cache. The second is
memcg_{un}charge_slab that {un}charge slab pages to kmemcg resource
counter. Let's just merge them to keep the code clean.

Signed-off-by: Vladimir Davydov
Acked-by: Johannes Weiner
Cc: Michal Hocko
Cc: Glauber Costa
Cc: Pekka Enberg
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vladimir Davydov
2014-06-05 07:54:01 +0800
03afc0e25 slab: get_online_mems for kmem_cache_{create,destroy,shrink} ... Browse Code »

When we create a sl[au]b cache, we allocate kmem_cache_node structures
for each online NUMA node. To handle nodes taken online/offline, we
register memory hotplug notifier and allocate/free kmem_cache_node
corresponding to the node that changes its state for each kmem cache.

To synchronize between the two paths we hold the slab_mutex during both
the cache creationg/destruction path and while tuning per-node parts of
kmem caches in memory hotplug handler, but that's not quite right,
because it does not guarantee that a newly created cache will have all
kmem_cache_nodes initialized in case it races with memory hotplug. For
instance, in case of slub:

CPU0 CPU1
---- ----
kmem_cache_create: online_pages:
__kmem_cache_create: slab_memory_callback:
slab_mem_going_online_callback:
lock slab_mutex
for each slab_caches list entry
allocate kmem_cache node
unlock slab_mutex
lock slab_mutex
init_kmem_cache_nodes:
for_each_node_state(node, N_NORMAL_MEMORY)
allocate kmem_cache node
add kmem_cache to slab_caches list
unlock slab_mutex
online_pages (continued):
node_states_set_node

As a result we'll get a kmem cache with not all kmem_cache_nodes
allocated.

To avoid issues like that we should hold get/put_online_mems() during
the whole kmem cache creation/destruction/shrink paths, just like we
deal with cpu hotplug. This patch does the trick.

Note, that after it's applied, there is no need in taking the slab_mutex
for kmem_cache_shrink any more, so it is removed from there.

Signed-off-by: Vladimir Davydov
Cc: Christoph Lameter
Cc: Pekka Enberg
Cc: Tang Chen
Cc: Zhang Yanfei
Cc: Toshi Kani
Cc: Xishi Qiu
Cc: Jiang Liu
Cc: Rafael J. Wysocki
Cc: David Rientjes
Cc: Wen Congyang
Cc: Yasuaki Ishimatsu
Cc: Lai Jiangshan
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vladimir Davydov
2014-06-05 07:53:59 +0800
bfc8c9013 mem-hotplug: implement get/put_online_mems ... Browse Code »

kmem_cache_{create,destroy,shrink} need to get a stable value of
cpu/node online mask, because they init/destroy/access per-cpu/node
kmem_cache parts, which can be allocated or destroyed on cpu/mem
hotplug. To protect against cpu hotplug, these functions use
{get,put}_online_cpus. However, they do nothing to synchronize with
memory hotplug - taking the slab_mutex does not eliminate the
possibility of race as described in patch 2.

What we need there is something like get_online_cpus, but for memory.
We already have lock_memory_hotplug, which serves for the purpose, but
it's a bit of a hammer right now, because it's backed by a mutex. As a
result, it imposes some limitations to locking order, which are not
desirable, and can't be used just like get_online_cpus. That's why in
patch 1 I substitute it with get/put_online_mems, which work exactly
like get/put_online_cpus except they block not cpu, but memory hotplug.

[ v1 can be found at https://lkml.org/lkml/2014/4/6/68. I NAK'ed it by
myself, because it used an rw semaphore for get/put_online_mems,
making them dead lock prune. ]

This patch (of 2):

{un}lock_memory_hotplug, which is used to synchronize against memory
hotplug, is currently backed by a mutex, which makes it a bit of a
hammer - threads that only want to get a stable value of online nodes
mask won't be able to proceed concurrently. Also, it imposes some
strong locking ordering rules on it, which narrows down the set of its
usage scenarios.

This patch introduces get/put_online_mems, which are the same as
get/put_online_cpus, but for memory hotplug, i.e. executing a code
inside a get/put_online_mems section will guarantee a stable value of
online nodes, present pages, etc.

lock_memory_hotplug()/unlock_memory_hotplug() are removed altogether.

Signed-off-by: Vladimir Davydov
Cc: Christoph Lameter
Cc: Pekka Enberg
Cc: Tang Chen
Cc: Zhang Yanfei
Cc: Toshi Kani
Cc: Xishi Qiu
Cc: Jiang Liu
Cc: Rafael J. Wysocki
Cc: David Rientjes
Cc: Wen Congyang
Cc: Yasuaki Ishimatsu
Cc: Lai Jiangshan
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vladimir Davydov
2014-06-05 07:53:59 +0800
52383431b mm: get rid of __GFP_KMEMCG ... Browse Code »

Currently to allocate a page that should be charged to kmemcg (e.g.
threadinfo), we pass __GFP_KMEMCG flag to the page allocator. The page
allocated is then to be freed by free_memcg_kmem_pages. Apart from
looking asymmetrical, this also requires intrusion to the general
allocation path. So let's introduce separate functions that will
alloc/free pages charged to kmemcg.

The new functions are called alloc_kmem_pages and free_kmem_pages. They
should be used when the caller actually would like to use kmalloc, but
has to fall back to the page allocator for the allocation is large.
They only differ from alloc_pages and free_pages in that besides
allocating or freeing pages they also charge them to the kmem resource
counter of the current memory cgroup.

[sfr@canb.auug.org.au: export kmalloc_order() to modules]
Signed-off-by: Vladimir Davydov
Acked-by: Greg Thelen
Cc: Johannes Weiner
Acked-by: Michal Hocko
Cc: Glauber Costa
Cc: Christoph Lameter
Cc: Pekka Enberg
Signed-off-by: Stephen Rothwell
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vladimir Davydov
2014-06-05 07:53:56 +0800
5dfb41750 sl[au]b: charge slabs to kmemcg explicitly ... Browse Code »

We have only a few places where we actually want to charge kmem so
instead of intruding into the general page allocation path with
__GFP_KMEMCG it's better to explictly charge kmem there. All kmem
charges will be easier to follow that way.

This is a step towards removing __GFP_KMEMCG. It removes __GFP_KMEMCG
from memcg caches' allocflags. Instead it makes slab allocation path
call memcg_charge_kmem directly getting memcg to charge from the cache's
memcg params.

This also eliminates any possibility of misaccounting an allocation
going from one memcg's cache to another memcg, because now we always
charge slabs against the memcg the cache belongs to. That's why this
patch removes the big comment to memcg_kmem_get_cache.

Signed-off-by: Vladimir Davydov
Acked-by: Greg Thelen
Cc: Johannes Weiner
Acked-by: Michal Hocko
Cc: Glauber Costa
Cc: Christoph Lameter
Cc: Pekka Enberg
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vladimir Davydov
2014-06-05 07:53:56 +0800
8eae14926 mm: slub: fix ALLOC_SLOWPATH stat ... Browse Code »

There used to be only one path out of __slab_alloc(), and ALLOC_SLOWPATH
got bumped in that exit path. Now there are two, and a bunch of gotos.
ALLOC_SLOWPATH can now get set more than once during a single call to
__slab_alloc() which is pretty bogus. Here's the sequence:

1. Enter __slab_alloc(), fall through all the way to the
stat(s, ALLOC_SLOWPATH);
2. hit 'if (!freelist)', and bump DEACTIVATE_BYPASS, jump to
new_slab (goto #1)
3. Hit 'if (c->partial)', bump CPU_PARTIAL_ALLOC, goto redo
(goto #2)
4. Fall through in the same path we did before all the way to
stat(s, ALLOC_SLOWPATH)
5. bump ALLOC_REFILL stat, then return

Doing this is obviously bogus. It keeps us from being able to
accurately compare ALLOC_SLOWPATH vs. ALLOC_FASTPATH. It also means
that the total number of allocs always exceeds the total number of
frees.

This patch moves stat(s, ALLOC_SLOWPATH) to be called from the same
place that __slab_alloc() is. This makes it much less likely that
ALLOC_SLOWPATH will get botched again in the spaghetti-code inside
__slab_alloc().

Signed-off-by: Dave Hansen
Acked-by: Christoph Lameter
Acked-by: David Rientjes
Cc: Pekka Enberg
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Dave Hansen
2014-06-05 07:53:56 +0800
9a02d6999 mm, slab: suppress out of memory warning unless debug is enabled ... Browse Code »

When the slab or slub allocators cannot allocate additional slab pages,
they emit diagnostic information to the kernel log such as current
number of slabs, number of objects, active objects, etc. This is always
coupled with a page allocation failure warning since it is controlled by
!__GFP_NOWARN.

Suppress this out of memory warning if the allocator is configured
without debug supported. The page allocation failure warning will
indicate it is a failed slab allocation, the order, and the gfp mask, so
this is only useful to diagnose allocator issues.

Since CONFIG_SLUB_DEBUG is already enabled by default for the slub
allocator, there is no functional change with this patch. If debug is
disabled, however, the warnings are now suppressed.

Signed-off-by: David Rientjes
Cc: Pekka Enberg
Acked-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Rientjes
2014-06-05 07:53:56 +0800
ecc42fbe9 mm/slub.c: convert vnsprintf-static to va_format ... Browse Code »

Inspired by Joe Perches suggestion in ntfs logging clean-up.

Signed-off-by: Fabian Frederick
Acked-by: Christoph Lameter
Cc: Joe Perches
Cc: Pekka Enberg
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Fabian Frederick
2014-06-05 07:53:55 +0800
f9f582859 mm/slub.c: convert printk to pr_foo() ... Browse Code »

All printk(KERN_foo converted to pr_foo()

Default printk converted to pr_warn()

Coalesce format fragments

Signed-off-by: Fabian Frederick
Acked-by: Christoph Lameter
Cc: Joe Perches
Cc: Pekka Enberg
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Fabian Frederick
2014-06-05 07:53:55 +0800

07 May, 2014

2 commits

41a212859 slub: use sysfs'es release mechanism for kmem_cache ... Browse Code »
13

debugobjects warning during netfilter exit:

------------[ cut here ]------------
WARNING: CPU: 6 PID: 4178 at lib/debugobjects.c:260 debug_print_object+0x8d/0xb0()
ODEBUG: free active (active state 0) object type: timer_list hint: delayed_work_timer_fn+0x0/0x20
Modules linked in:
CPU: 6 PID: 4178 Comm: kworker/u16:2 Tainted: G W 3.11.0-next-20130906-sasha #3984
Workqueue: netns cleanup_net
Call Trace:
dump_stack+0x52/0x87
warn_slowpath_common+0x8c/0xc0
warn_slowpath_fmt+0x46/0x50
debug_print_object+0x8d/0xb0
__debug_check_no_obj_freed+0xa5/0x220
debug_check_no_obj_freed+0x15/0x20
kmem_cache_free+0x197/0x340
kmem_cache_destroy+0x86/0xe0
nf_conntrack_cleanup_net_list+0x131/0x170
nf_conntrack_pernet_exit+0x5d/0x70
ops_exit_list+0x5e/0x70
cleanup_net+0xfb/0x1c0
process_one_work+0x338/0x550
worker_thread+0x215/0x350
kthread+0xe7/0xf0
ret_from_fork+0x7c/0xb0

Also during dcookie cleanup:

WARNING: CPU: 12 PID: 9725 at lib/debugobjects.c:260 debug_print_object+0x8c/0xb0()
ODEBUG: free active (active state 0) object type: timer_list hint: delayed_work_timer_fn+0x0/0x20
Modules linked in:
CPU: 12 PID: 9725 Comm: trinity-c141 Not tainted 3.15.0-rc2-next-20140423-sasha-00018-gc4ff6c4 #408
Call Trace:
dump_stack (lib/dump_stack.c:52)
warn_slowpath_common (kernel/panic.c:430)
warn_slowpath_fmt (kernel/panic.c:445)
debug_print_object (lib/debugobjects.c:262)
__debug_check_no_obj_freed (lib/debugobjects.c:697)
debug_check_no_obj_freed (lib/debugobjects.c:726)
kmem_cache_free (mm/slub.c:2689 mm/slub.c:2717)
kmem_cache_destroy (mm/slab_common.c:363)
dcookie_unregister (fs/dcookies.c:302 fs/dcookies.c:343)
event_buffer_release (arch/x86/oprofile/../../../drivers/oprofile/event_buffer.c:153)
__fput (fs/file_table.c:217)
____fput (fs/file_table.c:253)
task_work_run (kernel/task_work.c:125 (discriminator 1))
do_notify_resume (include/linux/tracehook.h:196 arch/x86/kernel/signal.c:751)
int_signal (arch/x86/kernel/entry_64.S:807)

Sysfs has a release mechanism. Use that to release the kmem_cache
structure if CONFIG_SYSFS is enabled.

Only slub is changed - slab currently only supports /proc/slabinfo and
not /sys/kernel/slab/*. We talked about adding that and someone was
working on it.

[akpm@linux-foundation.org: fix CONFIG_SYSFS=n build]
[akpm@linux-foundation.org: fix CONFIG_SYSFS=n build even more]
Signed-off-by: Christoph Lameter
Reported-by: Sasha Levin
Tested-by: Sasha Levin
Acked-by: Greg KH
Cc: Thomas Gleixner
Cc: Pekka Enberg
Cc: Russell King
Cc: Bart Van Assche
Cc: Al Viro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2014-05-07 04:04:59 +0800
93030d83b slub: fix memcg_propagate_slab_attrs ... Browse Code »

After creating a cache for a memcg we should initialize its sysfs attrs
with the values from its parent. That's what memcg_propagate_slab_attrs
is for. Currently it's broken - we clearly muddled root-vs-memcg caches
there. Let's fix it up.

Signed-off-by: Vladimir Davydov
Cc: Christoph Lameter
Cc: Pekka Enberg
Cc: Michal Hocko
Cc: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vladimir Davydov
2014-05-07 04:04:58 +0800

14 Apr, 2014

1 commit

bf3a34073 Merge branch 'slab/next' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/linux ... Browse Code »

Pull slab changes from Pekka Enberg:
"The biggest change is byte-sized freelist indices which reduces slab
freelist memory usage:

https://lkml.org/lkml/2013/12/2/64"

* 'slab/next' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/linux:
mm: slab/slub: use page->list consistently instead of page->lru
mm/slab.c: cleanup outdated comments and unify variables naming
slab: fix wrongly used macro
slub: fix high order page allocation problem with __GFP_NOFAIL
slab: Make allocations with GFP_ZERO slightly more efficient
slab: make more slab management structure off the slab
slab: introduce byte sized index for the freelist of a slab
slab: restrict the number of objects in a slab
slab: introduce helper functions to get/set free object
slab: factor out calculate nr objects in cache_estimate

Linus Torvalds
2014-04-14 04:28:13 +0800

08 Apr, 2014

6 commits

88da03a67 slub: use raw_cpu_inc for incrementing statistics ... Browse Code »

Statistics are not critical to the operation of the allocation but
should also not cause too much overhead.

When __this_cpu_inc is altered to check if preemption is disabled this
triggers. Use raw_cpu_inc to avoid the checks. Using this_cpu_ops may
cause interrupt disable/enable sequences on various arches which may
significantly impact allocator performance.

[akpm@linux-foundation.org: add comment]
Signed-off-by: Christoph Lameter
Cc: Fengguang Wu
Cc: Pekka Enberg
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2014-04-08 07:36:14 +0800
54b6a7310 slub: fix leak of 'name' in sysfs_slab_add ... Browse Code »

The failure paths of sysfs_slab_add don't release the allocation of
'name' made by create_unique_id() a few lines above the context of the
diff below. Create a common exit path to make it more obvious what
needs freeing.

[vdavydov@parallels.com: free the name only if !unmergeable]
Signed-off-by: Dave Jones
Signed-off-by: Vladimir Davydov
Cc: Pekka Enberg
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Dave Jones
2014-04-08 07:36:13 +0800
9a41707bd slub: rework sysfs layout for memcg caches ... Browse Code »

Currently, we try to arrange sysfs entries for memcg caches in the same
manner as for global caches. Apart from turning /sys/kernel/slab into a
mess when there are a lot of kmem-active memcgs created, it actually
does not work properly - we won't create more than one link to a memcg
cache in case its parent is merged with another cache. For instance, if
A is a root cache merged with another root cache B, we will have the
following sysfs setup:

X
A -> X
B -> X

where X is some unique id (see create_unique_id()). Now if memcgs M and
N start to allocate from cache A (or B, which is the same), we will get:

X
X:M
X:N
A -> X
B -> X
A:M -> X:M
A:N -> X:N

Since B is an alias for A, we won't get entries B:M and B:N, which is
confusing.

It is more logical to have entries for memcg caches under the
corresponding root cache's sysfs directory. This would allow us to keep
sysfs layout clean, and avoid such inconsistencies like one described
above.

This patch does the trick. It creates a "cgroup" kset in each root
cache kobject to keep its children caches there.

Signed-off-by: Vladimir Davydov
Cc: Michal Hocko
Cc: Johannes Weiner
Cc: David Rientjes
Cc: Pekka Enberg
Cc: Glauber Costa
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vladimir Davydov
2014-04-08 07:36:13 +0800
84d0ddd6b slub: adjust memcg caches when creating cache alias ... Browse Code »

Otherwise, kzalloc() called from a memcg won't clear the whole object.

Signed-off-by: Vladimir Davydov
Cc: Michal Hocko
Cc: Johannes Weiner
Cc: David Rientjes
Cc: Pekka Enberg
Cc: Glauber Costa
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vladimir Davydov
2014-04-08 07:36:13 +0800
a44cb9449 memcg, slab: never try to merge memcg caches ... Browse Code »

When a kmem cache is created (kmem_cache_create_memcg()), we first try to
find a compatible cache that already exists and can handle requests from
the new cache, i.e. has the same object size, alignment, ctor, etc. If
there is such a cache, we do not create any new caches, instead we simply
increment the refcount of the cache found and return it.

Currently we do this procedure not only when creating root caches, but
also for memcg caches. However, there is no point in that, because, as
every memcg cache has exactly the same parameters as its parent and cache
merging cannot be turned off in runtime (only on boot by passing
"slub_nomerge"), the root caches of any two potentially mergeable memcg
caches should be merged already, i.e. it must be the same root cache, and
therefore we couldn't even get to the memcg cache creation, because it
already exists.

The only exception is boot caches - they are explicitly forbidden to be
merged by setting their refcount to -1. There are currently only two of
them - kmem_cache and kmem_cache_node, which are used in slab internals (I
do not count kmalloc caches as their refcount is set to 1 immediately
after creation). Since they are prevented from merging preliminary I
guess we should avoid to merge their children too.

So let's remove the useless code responsible for merging memcg caches.

Signed-off-by: Vladimir Davydov
Cc: Michal Hocko
Cc: Johannes Weiner
Cc: David Rientjes
Cc: Pekka Enberg
Cc: Glauber Costa
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vladimir Davydov
2014-04-08 07:36:12 +0800
2a389610a mm, mempolicy: rename slab_node for clarity ... Browse Code »

slab_node() is actually a mempolicy function, so rename it to
mempolicy_slab_node() to make it clearer that it used for processes with
mempolicies.

At the same time, cleanup its code by saving numa_mem_id() in a local
variable (since we require a node with memory, not just any node) and
remove an obsolete comment that assumes the mempolicy is actually passed
into the function.

Signed-off-by: David Rientjes
Acked-by: Christoph Lameter
Cc: Johannes Weiner
Cc: Michal Hocko
Cc: KAMEZAWA Hiroyuki
Cc: Christoph Lameter
Cc: Pekka Enberg
Cc: Tejun Heo
Cc: Mel Gorman
Cc: Oleg Nesterov
Cc: Rik van Riel
Cc: Jianguo Wu
Cc: Tim Hockin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Rientjes
2014-04-08 07:35:54 +0800

04 Apr, 2014

1 commit

421af243b slub: do not drop slab_mutex for sysfs_slab_add ... Browse Code »

We release the slab_mutex while calling sysfs_slab_add from
__kmem_cache_create since commit 66c4c35c6bc5 ("slub: Do not hold
slub_lock when calling sysfs_slab_add()"), because kobject_uevent called
by sysfs_slab_add might block waiting for the usermode helper to exec,
which would result in a deadlock if we took the slab_mutex while
executing it.

However, apart from complicating synchronization rules, releasing the
slab_mutex on kmem cache creation can result in a kmemcg-related race.
The point is that we check if the memcg cache exists before going to
__kmem_cache_create, but register the new cache in memcg subsys after
it. Since we can drop the mutex there, several threads can see that the
memcg cache does not exist and proceed to creating it, which is wrong.

Fortunately, recently kobject_uevent was patched to call the usermode
helper with the UMH_NO_WAIT flag, making the deadlock impossible.
Therefore there is no point in releasing the slab_mutex while calling
sysfs_slab_add, so let's simplify kmem_cache_create synchronization and
fix the kmemcg-race mentioned above by holding the slab_mutex during the
whole cache creation path.

Signed-off-by: Vladimir Davydov
Acked-by: Christoph Lameter
Cc: Greg KH
Cc: Pekka Enberg
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vladimir Davydov
2014-04-04 07:21:05 +0800