Eric Lee / smarc-fsl-linux-kernel

14 Oct, 2014

1 commit

85c9f4b04 mm/slab: fix unaligned access on sparc64 ... Browse Code »

Commit bf0dea23a9c0 ("mm/slab: use percpu allocator for cpu cache")
changed the allocation method for cpu cache array from slab allocator to
percpu allocator. Alignment should be provided for aligned memory in
percpu allocator case, but, that commit mistakenly set this alignment to
0. So, percpu allocator returns unaligned memory address. It doesn't
cause any problem on x86 which permits unaligned access, but, it causes
the problem on sparc64 which needs strong guarantee of alignment.

Following bug report is reported from David Miller.

I'm getting tons of the following on sparc64:

[603965.383447] Kernel unaligned access at TPC[546b58] free_block+0x98/0x1a0
[603965.396987] Kernel unaligned access at TPC[546b60] free_block+0xa0/0x1a0
...
[603970.554394] log_unaligned: 333 callbacks suppressed
...

This patch provides a proper alignment parameter when allocating cpu
cache to fix this unaligned memory access problem on sparc64.

Reported-by: David Miller
Tested-by: David Miller
Tested-by: Meelis Roos
Signed-off-by: Joonsoo Kim
Cc: Christoph Lameter
Cc: Pekka Enberg
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joonsoo Kim
2014-10-14 08:18:12 +0800

10 Oct, 2014

7 commits

b208ce329 mm/slab.c: use __seq_open_private() instead of seq_open() ... Browse Code »

Using __seq_open_private() removes boilerplate code from slabstats_open()

The resultant code is shorter and easier to follow.

This patch does not change any functionality.

Signed-off-by: Rob Jones
Acked-by: Christoph Lameter
Cc: Pekka Enberg
Cc: David Rientjes
Cc: Joonsoo Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Rob Jones
2014-10-10 10:25:57 +0800
bf0dea23a mm/slab: use percpu allocator for cpu cache ... Browse Code »

Because of chicken and egg problem, initialization of SLAB is really
complicated. We need to allocate cpu cache through SLAB to make the
kmem_cache work, but before initialization of kmem_cache, allocation
through SLAB is impossible.

On the other hand, SLUB does initialization in a more simple way. It uses
percpu allocator to allocate cpu cache so there is no chicken and egg
problem.

So, this patch try to use percpu allocator in SLAB. This simplifies the
initialization step in SLAB so that we could maintain SLAB code more
easily.

In my testing there is no performance difference.

This implementation relies on percpu allocator. Because percpu allocator
uses vmalloc address space, vmalloc address space could be exhausted by
this change on many cpu system with *32 bit* kernel. This implementation
can cover 1024 cpus in worst case by following calculation.

Worst: 1024 cpus * 4 bytes for pointer * 300 kmem_caches *
120 objects per cpu_cache = 140 MB
Normal: 1024 cpus * 4 bytes for pointer * 150 kmem_caches(slab merge) *
80 objects per cpu_cache = 46 MB

Signed-off-by: Joonsoo Kim
Acked-by: Christoph Lameter
Cc: Pekka Enberg
Cc: David Rientjes
Cc: Jeremiah Mahler
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joonsoo Kim
2014-10-10 10:25:51 +0800
12220dea0 mm/slab: support slab merge ... Browse Code »

Slab merge is good feature to reduce fragmentation. If new creating slab
have similar size and property with exsitent slab, this feature reuse it
rather than creating new one. As a result, objects are packed into fewer
slabs so that fragmentation is reduced.

Below is result of my testing.

* After boot, sleep 20; cat /proc/meminfo | grep Slab

Slab: 25136 kB

Slab: 24364 kB

We can save 3% memory used by slab.

For supporting this feature in SLAB, we need to implement SLAB specific
kmem_cache_flag() and __kmem_cache_alias(), because SLUB implements some
SLUB specific processing related to debug flag and object size change on
these functions.

Signed-off-by: Joonsoo Kim
Cc: Christoph Lameter
Cc: Pekka Enberg
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joonsoo Kim
2014-10-10 10:25:51 +0800
25c4f304b mm/slab: factor out unlikely part of cache_free_alien() ... Browse Code »

cache_free_alien() is rarely used function when node mismatch. But, it is
defined with inline attribute so it is inlined to __cache_free() which is
core free function of slab allocator. It uselessly makes
kmem_cache_free()/kfree() functions large. What we really need to inline
is just checking node match so this patch factor out other parts of
cache_free_alien() to reduce code size of kmem_cache_free()/ kfree().

nm -S mm/slab.o | grep -e "T kfree" -e "T kmem_cache_free"
00000000000011e0 0000000000000228 T kfree
0000000000000670 0000000000000216 T kmem_cache_free

nm -S mm/slab.o | grep -e "T kfree" -e "T kmem_cache_free"
0000000000001110 00000000000001b5 T kfree
0000000000000750 0000000000000181 T kmem_cache_free

You can see slightly reduced size of text: 0x228->0x1b5, 0x216->0x181.

Signed-off-by: Joonsoo Kim
Cc: Christoph Lameter
Cc: Pekka Enberg
Cc: David Rientjes
Cc: Zhang Yanfei
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joonsoo Kim
2014-10-10 10:25:51 +0800
d3aec3446 mm/slab: noinline __ac_put_obj() ... Browse Code »

Our intention of __ac_put_obj() is that it doesn't affect anything if
sk_memalloc_socks() is disabled. But, because __ac_put_obj() is too
small, compiler inline it to ac_put_obj() and affect code size of free
path. This patch add noinline keyword for __ac_put_obj() not to distrupt
normal free path at all.

nm -S slab-orig.o |
grep -e "t cache_alloc_refill" -e "T kfree" -e "T kmem_cache_free"

0000000000001e80 00000000000002f5 t cache_alloc_refill
0000000000001230 0000000000000258 T kfree
0000000000000690 000000000000024c T kmem_cache_free

nm -S slab-patched.o |
grep -e "t cache_alloc_refill" -e "T kfree" -e "T kmem_cache_free"

0000000000001e00 00000000000002e5 t cache_alloc_refill
00000000000011e0 0000000000000228 T kfree
0000000000000670 0000000000000216 T kmem_cache_free

cache_alloc_refill: 0x2f5->0x2e5
kfree: 0x256->0x228
kmem_cache_free: 0x24c->0x216

code size of each function is reduced slightly.

Signed-off-by: Joonsoo Kim
Cc: Christoph Lameter
Cc: Pekka Enberg
Cc: David Rientjes
Cc: Zhang Yanfei
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joonsoo Kim
2014-10-10 10:25:50 +0800
3d8801940 mm/slab: move cache_flusharray() out of unlikely.text section ... Browse Code »

Now, due to likely keyword, compiled code of cache_flusharray() is on
unlikely.text section. Although it is uncommon case compared to free to
cpu cache case, it is common case than free_block(). But, free_block() is
on normal text section. This patch fix this odd situation to remove
likely keyword.

Signed-off-by: Joonsoo Kim
Cc: Christoph Lameter
Cc: Pekka Enberg
Cc: David Rientjes
Cc: Zhang Yanfei
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joonsoo Kim
2014-10-10 10:25:50 +0800
61f47105a mm/sl[ao]b: always track caller in kmalloc_(node_)track_caller() ... Browse Code »

Now, we track caller if tracing or slab debugging is enabled. If they are
disabled, we could save one argument passing overhead by calling
__kmalloc(_node)(). But, I think that it would be marginal. Furthermore,
default slab allocator, SLUB, doesn't use this technique so I think that
it's okay to change this situation.

After this change, we can turn on/off CONFIG_DEBUG_SLAB without full
kernel build and remove some complicated '#if' defintion. It looks more
benefitial to me.

Signed-off-by: Joonsoo Kim
Acked-by: Christoph Lameter
Cc: Pekka Enberg
Cc: David Rientjes
Cc: Zhang Yanfei
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joonsoo Kim
2014-10-10 10:25:50 +0800

28 Sep, 2014

1 commit

6111da343 Merge branch 'for-3.17-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup ... Browse Code »

Pull cgroup fixes from Tejun Heo:
"This is quite late but these need to be backported anyway.

This is the fix for a long-standing cpuset bug which existed from
2009. cpuset makes use of PF_SPREAD_{PAGE|SLAB} flags to modify the
task's memory allocation behavior according to the settings of the
cpuset it belongs to; unfortunately, when those flags have to be
changed, cpuset did so directly even whlie the target task is running,
which is obviously racy as task->flags may be modified by the task
itself at any time. This obscure bug manifested as corrupt
PF_USED_MATH flag leading to a weird crash.

The bug is fixed by moving the flag to task->atomic_flags. The first
two are prepatory ones to help defining atomic_flags accessors and the
third one is the actual fix"

* 'for-3.17-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cpuset: PF_SPREAD_PAGE and PF_SPREAD_SLAB should be atomic flags
sched: add macros to define bitops for task atomic flags
sched: fix confusing PFA_NO_NEW_PRIVS constant

Linus Torvalds
2014-09-28 07:45:33 +0800

26 Sep, 2014

1 commit

d4a5fca59 mm, slab: initialize object alignment on cache creation ... Browse Code »

Since commit 4590685546a3 ("mm/sl[aou]b: Common alignment code"), the
"ralign" automatic variable in __kmem_cache_create() may be used as
uninitialized.

The proper alignment defaults to BYTES_PER_WORD and can be overridden by
SLAB_RED_ZONE or the alignment specified by the caller.

This fixes https://bugzilla.kernel.org/show_bug.cgi?id=85031

Signed-off-by: David Rientjes
Reported-by: Andrei Elovikov
Acked-by: Christoph Lameter
Cc: Pekka Enberg
Cc: Joonsoo Kim
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Rientjes
2014-09-26 23:10:35 +0800

25 Sep, 2014

1 commit

2ad654bc5 cpuset: PF_SPREAD_PAGE and PF_SPREAD_SLAB should be atomic flags ... Browse Code »

When we change cpuset.memory_spread_{page,slab}, cpuset will flip
PF_SPREAD_{PAGE,SLAB} bit of tsk->flags for each task in that cpuset.
This should be done using atomic bitops, but currently we don't,
which is broken.

Tetsuo reported a hard-to-reproduce kernel crash on RHEL6, which happened
when one thread tried to clear PF_USED_MATH while at the same time another
thread tried to flip PF_SPREAD_PAGE/PF_SPREAD_SLAB. They both operate on
the same task.

Here's the full report:
https://lkml.org/lkml/2014/9/19/230

To fix this, we make PF_SPREAD_PAGE and PF_SPREAD_SLAB atomic flags.

v4:
- updated mm/slab.c. (Fengguang Wu)
- updated Documentation.

Cc: Peter Zijlstra
Cc: Ingo Molnar
Cc: Miao Xie
Cc: Kees Cook
Fixes: 950592f7b991 ("cpusets: update tasks' page/slab spread flags in time")
Cc: # 2.6.31+
Reported-by: Tetsuo Handa
Signed-off-by: Zefan Li
Signed-off-by: Tejun Heo

Zefan Li
2014-09-25 10:16:06 +0800

09 Aug, 2014

1 commit

edcad2509 Revert "slab: remove BAD_ALIEN_MAGIC" ... Browse Code »

This reverts commit a640616822b2 ("slab: remove BAD_ALIEN_MAGIC").

commit a640616822b2 ("slab: remove BAD_ALIEN_MAGIC") assumes that the
system with !CONFIG_NUMA has only one memory node. But, it turns out to
be false by the report from Geert. His system, m68k, has many memory
nodes and is configured in !CONFIG_NUMA. So it couldn't boot with above
change.

Here goes his failure report.

With latest mainline, I'm getting a crash during bootup on m68k/ARAnyM:

enable_cpucache failed for radix_tree_node, error 12.
kernel BUG at /scratch/geert/linux/linux-m68k/mm/slab.c:1522!
*** TRAP #7 *** FORMAT=0
Current process id is 0
BAD KERNEL TRAP: 00000000
Modules linked in:
PC: [] kmem_cache_init_late+0x70/0x8c
SR: 2200 SP: 00345f90 a2: 0034c2e8
d0: 0000003d d1: 00000000 d2: 00000000 d3: 003ac942
d4: 00000000 d5: 00000000 a0: 0034f686 a1: 0034f682
Process swapper (pid: 0, task=0034c2e8)
Frame format=0
Stack from 00345fc4:
002f69ef 002ff7e5 000005f2 000360fa 0017d806 003921d4 00000000
00000000 00000000 00000000 00000000 00000000 003ac942 00000000
003912d6
Call Trace: [] parse_args+0x0/0x2ca
[] strlen+0x0/0x1a
[] start_kernel+0x23c/0x428
[] _sinittext+0x2d6/0x95e

Code: f7e5 4879 002f 69ef 61ff ffca 462a 4e47 0035 4b1c 61ff
fff0 0cc4 7005 23c0 0037 fd20 588f 265f 285f 4e75 48e7 301c
Disabling lock debugging due to kernel taint
Kernel panic - not syncing: Attempted to kill the idle task!

Although there is a alternative way to fix this issue such as disabling
use of alien cache on !CONFIG_NUMA, but, reverting issued commit is better
to me in this time.

Signed-off-by: Joonsoo Kim
Reported-by: Geert Uytterhoeven
Cc: Christoph Lameter
Cc: Pekka Enberg
Cc: David Rientjes
Cc: Vladimir Davydov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joonsoo Kim
2014-08-09 06:57:17 +0800

07 Aug, 2014

13 commits

8a7d9b430 mm/slab.c: fix comments ... Browse Code »

Current struct kmem_cache has no 'lock' field, and slab page is managed by
struct kmem_cache_node, which has 'list_lock' field.

Clean up the related comment.

Signed-off-by: Wang Sheng-Hui
Cc: Christoph Lameter
Cc: Pekka Enberg
Cc: David Rientjes
Cc: Joonsoo Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Wang Sheng-Hui
2014-08-07 09:01:15 +0800
5e8047896 slab: change int to size_t for representing allocation size ... Browse Code »

It is better to represent allocation size in size_t rather than int. So
change it.

Signed-off-by: Joonsoo Kim
Suggested-by: Andrew Morton
Cc: Christoph Lameter
Reviewed-by: Pekka Enberg
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joonsoo Kim
2014-08-07 09:01:14 +0800
a64061682 slab: remove BAD_ALIEN_MAGIC ... Browse Code »

BAD_ALIEN_MAGIC value isn't used anymore. So remove it.

Signed-off-by: Joonsoo Kim
Acked-by: Christoph Lameter
Cc: Pekka Enberg
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joonsoo Kim
2014-08-07 09:01:14 +0800
367f7f2f4 slab: remove a useless lockdep annotation ... Browse Code »

Now, there is no code to hold two lock simultaneously, since we don't
call slab_destroy() with holding any lock. So, lockdep annotation is
useless now. Remove it.

v2: don't remove BAD_ALIEN_MAGIC in this patch. It will be removed
in the following patch.

Signed-off-by: Joonsoo Kim
Acked-by: Christoph Lameter
Cc: Pekka Enberg
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joonsoo Kim
2014-08-07 09:01:14 +0800
833b706cc slab: destroy a slab without holding any alien cache lock ... Browse Code »

I haven't heard that this alien cache lock is contended, but to reduce
chance of contention would be better generally. And with this change,
we can simplify complex lockdep annotation in slab code. In the
following patch, it will be implemented.

Signed-off-by: Joonsoo Kim
Acked-by: Christoph Lameter
Cc: Pekka Enberg
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joonsoo Kim
2014-08-07 09:01:14 +0800
49dfc304b slab: use the lock on alien_cache, instead of the lock on array_cache ... Browse Code »

Now, we have separate alien_cache structure, so it'd be better to hold
the lock on alien_cache while manipulating alien_cache. After that, we
don't need the lock on array_cache, so remove it.

Signed-off-by: Joonsoo Kim
Acked-by: Christoph Lameter
Cc: Pekka Enberg
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joonsoo Kim
2014-08-07 09:01:14 +0800
c8522a3a5 slab: introduce alien_cache ... Browse Code »

Currently, we use array_cache for alien_cache. Although they are mostly
similar, there is one difference, that is, need for spinlock. We don't
need spinlock for array_cache itself, but to use array_cache for
alien_cache, array_cache structure should have spinlock. This is
needless overhead, so removing it would be better. This patch prepare
it by introducing alien_cache and using it. In the following patch, we
remove spinlock in array_cache.

Signed-off-by: Joonsoo Kim
Acked-by: Christoph Lameter
Cc: Pekka Enberg
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joonsoo Kim
2014-08-07 09:01:14 +0800
1fe00d50a slab: factor out initialization of array cache ... Browse Code »

Factor out initialization of array cache to use it in following patch.

Signed-off-by: Joonsoo Kim
Acked-by: Christoph Lameter
Cc: Pekka Enberg
Acked-by: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joonsoo Kim
2014-08-07 09:01:14 +0800
97654dfa2 slab: defer slab_destroy in free_block() ... Browse Code »

In free_block(), if freeing object makes new free slab and number of
free_objects exceeds free_limit, we start to destroy this new free slab
with holding the kmem_cache node lock. Holding the lock is useless and,
generally, holding a lock as least as possible is good thing. I never
measure performance effect of this, but we'd be better not to hold the
lock as much as possible.

Commented by Christoph:
This is also good because kmem_cache_free is no longer called while
holding the node lock. So we avoid one case of recursion.

Signed-off-by: Joonsoo Kim
Acked-by: Christoph Lameter
Cc: Pekka Enberg
Acked-by: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joonsoo Kim
2014-08-07 09:01:14 +0800
25c063fbd slab: move up code to get kmem_cache_node in free_block() ... Browse Code »

node isn't changed, so we don't need to retreive this structure
everytime we move the object. Maybe compiler do this optimization, but
making it explicitly is better.

Signed-off-by: Joonsoo Kim
Acked-by: Christoph Lameter
Cc: Pekka Enberg
Acked-by: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joonsoo Kim
2014-08-07 09:01:14 +0800
8a9c61d43 slab: add unlikely macro to help compiler ... Browse Code »

This patchset does some cleanup and tries to remove lockdep annotation.

Patches 1~2 are just for really really minor improvement.
Patches 3~9 are for clean-up and removing lockdep annotation.

There are two cases that lockdep annotation is needed in SLAB.
1) holding two node locks
2) holding two array cache(alien cache) locks

I looked at the code and found that we can avoid these cases without any
negative effect.

1) occurs if freeing object makes new free slab and we decide to
destroy it. Although we don't need to hold the lock during destroying
a slab, current code do that. Destroying a slab without holding the
lock would help the reduction of the lock contention. To do it, I
change the implementation that new free slab is destroyed after
releasing the lock.

2) occurs on similar situation. When we free object from non-local
node, we put this object to alien cache with holding the alien cache
lock. If alien cache is full, we try to flush alien cache to proper
node cache, and, in this time, new free slab could be made. Destroying
it would be started and we will free metadata object which comes from
another node. In this case, we need another node's alien cache lock to
free object. This forces us to hold two array cache locks and then we
need lockdep annotation although they are always different locks and
deadlock cannot be possible. To prevent this situation, I use same way
as 1).

In this way, we can avoid 1) and 2) cases, and then, can remove lockdep
annotation. As short stat noted, this makes SLAB code much simpler.

This patch (of 9):

slab_should_failslab() is called on every allocation, so to optimize it
is reasonable. We normally don't allocate from kmem_cache. It is just
used when new kmem_cache is created, so it's very rare case. Therefore,
add unlikely macro to help compiler optimization.

Signed-off-by: Joonsoo Kim
Acked-by: David Rientjes
Acked-by: Christoph Lameter
Cc: Pekka Enberg
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joonsoo Kim
2014-08-07 09:01:14 +0800
18bf85411 slab: use get_node() and kmem_cache_node() functions ... Browse Code »

Use the two functions to simplify the code avoiding numerous explicit
checks coded checking for a certain node to be online.

Get rid of various repeated calculations of kmem_cache_node structures.

[akpm@linux-foundation.org: fix build]
Signed-off-by: Christoph Lameter
Cc: Christoph Lameter
Cc: Pekka Enberg
Acked-by: David Rientjes
Cc: Joonsoo Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2014-08-07 09:01:13 +0800
1536cb393 mm/slab.c: add __init to init_lock_keys ... Browse Code »

init_lock_keys is only called by __init kmem_cache_init_late

Signed-off-by: Fabian Frederick
Acked-by: Christoph Lameter
Acked-by: David Rientjes
Cc: Joonsoo Kim
Cc: Pekka Enberg
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Fabian Frederick
2014-08-07 09:01:13 +0800

24 Jun, 2014

1 commit

037873014 slab: fix oops when reading /proc/slab_allocators ... Browse Code »

Commit b1cb0982bdd6 ("change the management method of free objects of
the slab") introduced a bug on slab leak detector
('/proc/slab_allocators'). This detector works like as following
decription.

1. traverse all objects on all the slabs.
2. determine whether it is active or not.
3. if active, print who allocate this object.

but that commit changed the way how to manage free objects, so the logic
determining whether it is active or not is also changed. In before, we
regard object in cpu caches as inactive one, but, with this commit, we
mistakenly regard object in cpu caches as active one.

This intoduces kernel oops if DEBUG_PAGEALLOC is enabled. If
DEBUG_PAGEALLOC is enabled, kernel_map_pages() is used to detect who
corrupt free memory in the slab. It unmaps page table mapping if object
is free and map it if object is active. When slab leak detector check
object in cpu caches, it mistakenly think this object active so try to
access object memory to retrieve caller of allocation. At this point,
page table mapping to this object doesn't exist, so oops occurs.

Following is oops message reported from Dave.

It blew up when something tried to read /proc/slab_allocators
(Just cat it, and you should see the oops below)

Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
Modules linked in:
[snip...]
CPU: 1 PID: 9386 Comm: trinity-c33 Not tainted 3.14.0-rc5+ #131
task: ffff8801aa46e890 ti: ffff880076924000 task.ti: ffff880076924000
RIP: 0010:[] [] handle_slab+0x8a/0x180
RSP: 0018:ffff880076925de0 EFLAGS: 00010002
RAX: 0000000000001000 RBX: 0000000000000000 RCX: 000000005ce85ce7
RDX: ffffea00079be100 RSI: 0000000000001000 RDI: ffff880107458000
RBP: ffff880076925e18 R08: 0000000000000001 R09: 0000000000000000
R10: 0000000000000000 R11: 000000000000000f R12: ffff8801e6f84000
R13: ffffea00079be100 R14: ffff880107458000 R15: ffff88022bb8d2c0
FS: 00007fb769e45740(0000) GS:ffff88024d040000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffff8801e6f84ff8 CR3: 00000000a22db000 CR4: 00000000001407e0
DR0: 0000000002695000 DR1: 0000000002695000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000070602
Call Trace:
leaks_show+0xce/0x240
seq_read+0x28e/0x490
proc_reg_read+0x3d/0x80
vfs_read+0x9b/0x160
SyS_read+0x58/0xb0
tracesys+0xd4/0xd9
Code: f5 00 00 00 0f 1f 44 00 00 48 63 c8 44 3b 0c 8a 0f 84 e3 00 00 00 83 c0 01 44 39 c0 72 eb 41 f6 47 1a 01 0f 84 e9 00 00 00 89 f0 8b 4c 04 f8 4d 85 c9 0f 84 88 00 00 00 49 8b 7e 08 4d 8d 46
RIP handle_slab+0x8a/0x180

To fix the problem, I introduce an object status buffer on each slab.
With this, we can track object status precisely, so slab leak detector
would not access active object and no kernel oops would occur. Memory
overhead caused by this fix is only imposed to CONFIG_DEBUG_SLAB_LEAK
which is mainly used for debugging, so memory overhead isn't big
problem.

Signed-off-by: Joonsoo Kim
Reported-by: Dave Jones
Reported-by: Tetsuo Handa
Reviewed-by: Vladimir Davydov
Cc: Christoph Lameter
Cc: Pekka Enberg
Cc: David Rientjes
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joonsoo Kim
2014-06-24 07:47:44 +0800

05 Jun, 2014

4 commits

c67a8a685 memcg, slab: merge memcg_{bind,release}_pages to memcg_{un}charge_slab ... Browse Code »

Currently we have two pairs of kmemcg-related functions that are called on
slab alloc/free. The first is memcg_{bind,release}_pages that count the
total number of pages allocated on a kmem cache. The second is
memcg_{un}charge_slab that {un}charge slab pages to kmemcg resource
counter. Let's just merge them to keep the code clean.

Signed-off-by: Vladimir Davydov
Acked-by: Johannes Weiner
Cc: Michal Hocko
Cc: Glauber Costa
Cc: Pekka Enberg
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vladimir Davydov
2014-06-05 07:54:01 +0800
03afc0e25 slab: get_online_mems for kmem_cache_{create,destroy,shrink} ... Browse Code »

When we create a sl[au]b cache, we allocate kmem_cache_node structures
for each online NUMA node. To handle nodes taken online/offline, we
register memory hotplug notifier and allocate/free kmem_cache_node
corresponding to the node that changes its state for each kmem cache.

To synchronize between the two paths we hold the slab_mutex during both
the cache creationg/destruction path and while tuning per-node parts of
kmem caches in memory hotplug handler, but that's not quite right,
because it does not guarantee that a newly created cache will have all
kmem_cache_nodes initialized in case it races with memory hotplug. For
instance, in case of slub:

CPU0 CPU1
---- ----
kmem_cache_create: online_pages:
__kmem_cache_create: slab_memory_callback:
slab_mem_going_online_callback:
lock slab_mutex
for each slab_caches list entry
allocate kmem_cache node
unlock slab_mutex
lock slab_mutex
init_kmem_cache_nodes:
for_each_node_state(node, N_NORMAL_MEMORY)
allocate kmem_cache node
add kmem_cache to slab_caches list
unlock slab_mutex
online_pages (continued):
node_states_set_node

As a result we'll get a kmem cache with not all kmem_cache_nodes
allocated.

To avoid issues like that we should hold get/put_online_mems() during
the whole kmem cache creation/destruction/shrink paths, just like we
deal with cpu hotplug. This patch does the trick.

Note, that after it's applied, there is no need in taking the slab_mutex
for kmem_cache_shrink any more, so it is removed from there.

Signed-off-by: Vladimir Davydov
Cc: Christoph Lameter
Cc: Pekka Enberg
Cc: Tang Chen
Cc: Zhang Yanfei
Cc: Toshi Kani
Cc: Xishi Qiu
Cc: Jiang Liu
Cc: Rafael J. Wysocki
Cc: David Rientjes
Cc: Wen Congyang
Cc: Yasuaki Ishimatsu
Cc: Lai Jiangshan
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vladimir Davydov
2014-06-05 07:53:59 +0800
5dfb41750 sl[au]b: charge slabs to kmemcg explicitly ... Browse Code »

We have only a few places where we actually want to charge kmem so
instead of intruding into the general page allocation path with
__GFP_KMEMCG it's better to explictly charge kmem there. All kmem
charges will be easier to follow that way.

This is a step towards removing __GFP_KMEMCG. It removes __GFP_KMEMCG
from memcg caches' allocflags. Instead it makes slab allocation path
call memcg_charge_kmem directly getting memcg to charge from the cache's
memcg params.

This also eliminates any possibility of misaccounting an allocation
going from one memcg's cache to another memcg, because now we always
charge slabs against the memcg the cache belongs to. That's why this
patch removes the big comment to memcg_kmem_get_cache.

Signed-off-by: Vladimir Davydov
Acked-by: Greg Thelen
Cc: Johannes Weiner
Acked-by: Michal Hocko
Cc: Glauber Costa
Cc: Christoph Lameter
Cc: Pekka Enberg
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vladimir Davydov
2014-06-05 07:53:56 +0800
9a02d6999 mm, slab: suppress out of memory warning unless debug is enabled ... Browse Code »

When the slab or slub allocators cannot allocate additional slab pages,
they emit diagnostic information to the kernel log such as current
number of slabs, number of objects, active objects, etc. This is always
coupled with a page allocation failure warning since it is controlled by
!__GFP_NOWARN.

Suppress this out of memory warning if the allocator is configured
without debug supported. The page allocation failure warning will
indicate it is a failed slab allocation, the order, and the gfp mask, so
this is only useful to diagnose allocator issues.

Since CONFIG_SLUB_DEBUG is already enabled by default for the slub
allocator, there is no functional change with this patch. If debug is
disabled, however, the warnings are now suppressed.

Signed-off-by: David Rientjes
Cc: Pekka Enberg
Acked-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Rientjes
2014-06-05 07:53:56 +0800

06 May, 2014

2 commits

30321c7b6 slab: Fix off by one in object max number tests. ... Browse Code »

If freelist_idx_t is a byte, SLAB_OBJ_MAX_NUM should be 255 not 256, and
likewise if freelist_idx_t is a short, then it should be 65535 not
65536.

This was leading to all kinds of random crashes on sparc64 where
PAGE_SIZE is 8192. One problem shown was that if spinlock debugging was
enabled, we'd get deadlocks in copy_pte_range() or do_wp_page() with the
same cpu already holding a lock it shouldn't hold, or the lock belonging
to a completely unrelated process.

Fixes: a41adfaa23df ("slab: introduce byte sized index for the freelist of a slab")
Signed-off-by: David S. Miller
Signed-off-by: Linus Torvalds

David Miller
2014-05-06 11:38:49 +0800
7cc68973c slab: fix the type of the index on freelist index accessor ... Browse Code »

Commit a41adfaa23df ("slab: introduce byte sized index for the freelist
of a slab") changes the size of freelist index and also changes
prototype of accessor function to freelist index. And there was a
mistake.

The mistake is that although it changes the size of freelist index
correctly, it changes the size of the index of freelist index
incorrectly. With patch, freelist index can be 1 byte or 2 bytes, that
means that num of object on on a slab can be more than 255. So we need
more than 1 byte for the index to find the index of free object on
freelist. But, above patch makes this index type 1 byte, so slab which
have more than 255 objects cannot work properly and in consequence of
it, the system cannot boot.

This issue was reported by Steven King on m68knommu which would use
2 bytes freelist index:

https://lkml.org/lkml/2014/4/16/433

To fix is easy. To change the type of the index of freelist index on
accessor functions is enough to fix this bug. Although 2 bytes is
enough, I use 4 bytes since it have no bad effect and make things more
easier. This fix was suggested and tested by Steven in his original
report.

Signed-off-by: Joonsoo Kim
Reported-and-acked-by: Steven King
Acked-by: Christoph Lameter
Tested-by: James Hogan
Tested-by: David Miller
Cc: Pekka Enberg
Signed-off-by: Linus Torvalds

Joonsoo Kim
2014-05-06 11:38:49 +0800

14 Apr, 2014

1 commit

bf3a34073 Merge branch 'slab/next' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/linux ... Browse Code »

Pull slab changes from Pekka Enberg:
"The biggest change is byte-sized freelist indices which reduces slab
freelist memory usage:

https://lkml.org/lkml/2013/12/2/64"

* 'slab/next' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/linux:
mm: slab/slub: use page->list consistently instead of page->lru
mm/slab.c: cleanup outdated comments and unify variables naming
slab: fix wrongly used macro
slub: fix high order page allocation problem with __GFP_NOFAIL
slab: Make allocations with GFP_ZERO slightly more efficient
slab: make more slab management structure off the slab
slab: introduce byte sized index for the freelist of a slab
slab: restrict the number of objects in a slab
slab: introduce helper functions to get/set free object
slab: factor out calculate nr objects in cache_estimate

Linus Torvalds
2014-04-14 04:28:13 +0800

11 Apr, 2014

1 commit

34bf6ef94 mm: slab/slub: use page->list consistently instead of page->lru ... Browse Code »

'struct page' has two list_head fields: 'lru' and 'list'. Conveniently,
they are unioned together. This means that code can use them
interchangably, which gets horribly confusing like with this nugget from
slab.c:

> list_del(&page->lru);
> if (page->active == cachep->num)
> list_add(&page->list, &n->slabs_full);

This patch makes the slab and slub code use page->lru universally instead
of mixing ->list and ->lru.

So, the new rule is: page->lru is what the you use if you want to keep
your page on a list. Don't like the fact that it's not called ->list?
Too bad.

Signed-off-by: Dave Hansen
Acked-by: Christoph Lameter
Acked-by: David Rientjes
Cc: Pekka Enberg
Signed-off-by: Andrew Morton
Signed-off-by: Pekka Enberg

Dave Hansen
2014-04-11 15:06:06 +0800

08 Apr, 2014

2 commits

f0432d159 mm, mempolicy: remove per-process flag ... Browse Code »

PF_MEMPOLICY is an unnecessary optimization for CONFIG_SLAB users.
There's no significant performance degradation to checking
current->mempolicy rather than current->flags & PF_MEMPOLICY in the
allocation path, especially since this is considered unlikely().

Running TCP_RR with netperf-2.4.5 through localhost on 16 cpu machine with
64GB of memory and without a mempolicy:

threads before after
16 1249409 1244487
32 1281786 1246783
48 1239175 1239138
64 1244642 1241841
80 1244346 1248918
96 1266436 1254316
112 1307398 1312135
128 1327607 1326502

Per-process flags are a scarce resource so we should free them up whenever
possible and make them available. We'll be using it shortly for memcg oom
reserves.

Signed-off-by: David Rientjes
Cc: Johannes Weiner
Cc: Michal Hocko
Cc: KAMEZAWA Hiroyuki
Cc: Christoph Lameter
Cc: Pekka Enberg
Cc: Tejun Heo
Cc: Mel Gorman
Cc: Oleg Nesterov
Cc: Rik van Riel
Cc: Jianguo Wu
Cc: Tim Hockin
Cc: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Rientjes
2014-04-08 07:35:54 +0800
2a389610a mm, mempolicy: rename slab_node for clarity ... Browse Code »

slab_node() is actually a mempolicy function, so rename it to
mempolicy_slab_node() to make it clearer that it used for processes with
mempolicies.

At the same time, cleanup its code by saving numa_mem_id() in a local
variable (since we require a node with memory, not just any node) and
remove an obsolete comment that assumes the mempolicy is actually passed
into the function.

Signed-off-by: David Rientjes
Acked-by: Christoph Lameter
Cc: Johannes Weiner
Cc: Michal Hocko
Cc: KAMEZAWA Hiroyuki
Cc: Christoph Lameter
Cc: Pekka Enberg
Cc: Tejun Heo
Cc: Mel Gorman
Cc: Oleg Nesterov
Cc: Rik van Riel
Cc: Jianguo Wu
Cc: Tim Hockin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Rientjes
2014-04-08 07:35:54 +0800

04 Apr, 2014

1 commit

d26914d11 mm: optimize put_mems_allowed() usage ... Browse Code »

Since put_mems_allowed() is strictly optional, its a seqcount retry, we
don't need to evaluate the function if the allocation was in fact
successful, saving a smp_rmb some loads and comparisons on some relative
fast-paths.

Since the naming, get/put_mems_allowed() does suggest a mandatory
pairing, rename the interface, as suggested by Mel, to resemble the
seqcount interface.

This gives us: read_mems_allowed_begin() and read_mems_allowed_retry(),
where it is important to note that the return value of the latter call
is inverted from its previous incarnation.

Signed-off-by: Peter Zijlstra
Signed-off-by: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2014-04-04 07:20:58 +0800

01 Apr, 2014

1 commit

5f0985bb1 mm/slab.c: cleanup outdated comments and unify variables naming ... Browse Code »

As time goes, the code changes a lot, and this leads to that
some old-days comments scatter around , which instead of faciliating
understanding, but make more confusion. So this patch cleans up them.

Also, this patch unifies some variables naming.

Acked-by: Christoph Lameter
Signed-off-by: Jianyu Zhan
Signed-off-by: Pekka Enberg

Jianyu Zhan
2014-04-01 18:49:25 +0800

08 Feb, 2014

2 commits

5087c8229 slab: Make allocations with GFP_ZERO slightly more efficient ... Browse Code »

Use the likely mechanism already around valid
pointer tests to better choose when to memset
to 0 allocations with __GFP_ZERO

Acked-by: Christoph Lameter
Signed-off-by: Joe Perches
Signed-off-by: Pekka Enberg

Joe Perches
2014-02-08 18:19:02 +0800
8fc9cf420 slab: make more slab management structure off the slab ... Browse Code »

Now, the size of the freelist for the slab management diminish,
so that the on-slab management structure can waste large space
if the object of the slab is large.

Consider a 128 byte sized slab. If on-slab is used, 31 objects can be
in the slab. The size of the freelist for this case would be 31 bytes
so that 97 bytes, that is, more than 75% of object size, are wasted.

In a 64 byte sized slab case, no space is wasted if we use on-slab.
So set off-slab determining constraint to 128 bytes.

Acked-by: Christoph Lameter
Acked-by: David Rientjes
Signed-off-by: Joonsoo Kim
Signed-off-by: Pekka Enberg

Joonsoo Kim
2014-02-08 18:13:25 +0800