Doug / smarc-fsl-linux-kernel | Embedian Git Server

07 Oct, 2012

1 commit

125b79d74 Merge branch 'slab/for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/linux ... Browse Code »

Pull SLAB changes from Pekka Enberg:
"New and noteworthy:

* More SLAB allocator unification patches from Christoph Lameter and
others. This paves the way for slab memcg patches that hopefully
will land in v3.8.

* SLAB tracing improvements from Ezequiel Garcia.

* Kernel tainting upon SLAB corruption from Dave Jones.

* Miscellanous SLAB allocator bug fixes and improvements from various
people."

* 'slab/for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/linux: (43 commits)
slab: Fix build failure in __kmem_cache_create()
slub: init_kmem_cache_cpus() and put_cpu_partial() can be static
mm/slab: Fix kmem_cache_alloc_node_trace() declaration
Revert "mm/slab: Fix kmem_cache_alloc_node_trace() declaration"
mm, slob: fix build breakage in __kmalloc_node_track_caller
mm/slab: Fix kmem_cache_alloc_node_trace() declaration
mm/slab: Fix typo _RET_IP -> _RET_IP_
mm, slub: Rename slab_alloc() -> slab_alloc_node() to match SLAB
mm, slab: Rename __cache_alloc() -> slab_alloc()
mm, slab: Match SLAB and SLUB kmem_cache_alloc_xxx_trace() prototype
mm, slab: Replace 'caller' type, void* -> unsigned long
mm, slob: Add support for kmalloc_track_caller()
mm, slab: Remove silly function slab_buffer_size()
mm, slob: Use NUMA_NO_NODE instead of -1
mm, sl[au]b: Taint kernel when we detect a corrupted slab
slab: Only define slab_error for DEBUG
slab: fix the DEADLOCK issue on l3 alien lock
slub: Zero initial memory segment for kmem_cache and kmem_cache_node
Revert "mm/sl[aou]b: Move sysfs_slab_add to common"
mm/sl[aou]b: Move kmem_cache refcounting to common code
...

Linus Torvalds
2012-10-07 06:53:13 +0800

03 Oct, 2012

5 commits

e2087be35 Merge branch 'slab/tracing' into slab/for-linus Browse Code »

Pekka Enberg
2012-10-03 14:57:17 +0800
f4178cddd Merge branch 'slab/common-for-cgroups' into slab/for-linus ... Browse Code »

Fix up a trivial conflict with NUMA_NO_NODE cleanups.

Conflicts:
mm/slob.c

Signed-off-by: Pekka Enberg

Pekka Enberg
2012-10-03 14:56:37 +0800
023dc7047 Merge branch 'slab/next' into slab/for-linus Browse Code »

Pekka Enberg
2012-10-03 14:56:12 +0800
608da7e3f slab: Fix build failure in __kmem_cache_create() ... Browse Code »

Fix build failure with CONFIG_DEBUG_SLAB=y && CONFIG_DEBUG_PAGEALLOC=y caused
by commit 8a13a4cc "mm/sl[aou]b: Shrink __kmem_cache_create() parameter lists".

mm/slab.c: In function '__kmem_cache_create':
mm/slab.c:2474: error: 'align' undeclared (first use in this function)
mm/slab.c:2474: error: (Each undeclared identifier is reported only once
mm/slab.c:2474: error: for each function it appears in.)
make[1]: *** [mm/slab.o] Error 1
make: *** [mm] Error 2

Acked-by: Christoph Lameter
Signed-off-by: Tetsuo Handa
Signed-off-by: Pekka Enberg

Tetsuo Handa
2012-10-03 14:53:34 +0800
033d9959e Merge branch 'for-3.7' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq ... Browse Code »

Pull workqueue changes from Tejun Heo:
"This is workqueue updates for v3.7-rc1. A lot of activities this
round including considerable API and behavior cleanups.

* delayed_work combines a timer and a work item. The handling of the
timer part has always been a bit clunky leading to confusing
cancelation API with weird corner-case behaviors. delayed_work is
updated to use new IRQ safe timer and cancelation now works as
expected.

* Another deficiency of delayed_work was lack of the counterpart of
mod_timer() which led to cancel+queue combinations or open-coded
timer+work usages. mod_delayed_work[_on]() are added.

These two delayed_work changes make delayed_work provide interface
and behave like timer which is executed with process context.

* A work item could be executed concurrently on multiple CPUs, which
is rather unintuitive and made flush_work() behavior confusing and
half-broken under certain circumstances. This problem doesn't
exist for non-reentrant workqueues. While non-reentrancy check
isn't free, the overhead is incurred only when a work item bounces
across different CPUs and even in simulated pathological scenario
the overhead isn't too high.

All workqueues are made non-reentrant. This removes the
distinction between flush_[delayed_]work() and
flush_[delayed_]_work_sync(). The former is now as strong as the
latter and the specified work item is guaranteed to have finished
execution of any previous queueing on return.

* In addition to the various bug fixes, Lai redid and simplified CPU
hotplug handling significantly.

* Joonsoo introduced system_highpri_wq and used it during CPU
hotplug.

There are two merge commits - one to pull in IRQ safe timer from
tip/timers/core and the other to pull in CPU hotplug fixes from
wq/for-3.6-fixes as Lai's hotplug restructuring depended on them."

Fixed a number of trivial conflicts, but the more interesting conflicts
were silent ones where the deprecated interfaces had been used by new
code in the merge window, and thus didn't cause any real data conflicts.

Tejun pointed out a few of them, I fixed a couple more.

* 'for-3.7' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: (46 commits)
workqueue: remove spurious WARN_ON_ONCE(in_irq()) from try_to_grab_pending()
workqueue: use cwq_set_max_active() helper for workqueue_set_max_active()
workqueue: introduce cwq_set_max_active() helper for thaw_workqueues()
workqueue: remove @delayed from cwq_dec_nr_in_flight()
workqueue: fix possible stall on try_to_grab_pending() of a delayed work item
workqueue: use hotcpu_notifier() for workqueue_cpu_down_callback()
workqueue: use __cpuinit instead of __devinit for cpu callbacks
workqueue: rename manager_mutex to assoc_mutex
workqueue: WORKER_REBIND is no longer necessary for idle rebinding
workqueue: WORKER_REBIND is no longer necessary for busy rebinding
workqueue: reimplement idle worker rebinding
workqueue: deprecate __cancel_delayed_work()
workqueue: reimplement cancel_delayed_work() using try_to_grab_pending()
workqueue: use mod_delayed_work() instead of __cancel + queue
workqueue: use irqsafe timer for delayed_work
workqueue: clean up delayed_work initializers and add missing one
workqueue: make deferrable delayed_work initializer names consistent
workqueue: cosmetic whitespace updates for macro definitions
workqueue: deprecate system_nrt[_freezable]_wq
workqueue: deprecate flush[_delayed]_work_sync()
...

Linus Torvalds
2012-10-03 00:54:49 +0800

29 Sep, 2012

1 commit

c0b24b510 Revert "mm/slab: Fix kmem_cache_alloc_node_trace() declaration" ... Browse Code »

This reverts commit 1e5965bf1f018cc30a4659fa3f1a40146e4276f6. Ezequiel
Garcia has a better fix.

Pekka Enberg
2012-09-29 15:00:59 +0800

26 Sep, 2012

2 commits

1e5965bf1 mm/slab: Fix kmem_cache_alloc_node_trace() declaration ... Browse Code »

The bug was introduced in commit 4052147c0afa ("mm, slab: Match SLAB
and SLUB kmem_cache_alloc_xxx_trace() prototype").

Reported-by: Fengguang Wu
Signed-off-by: Ezequiel Garcia
Signed-off-by: Pekka Enberg

Ezequiel Garcia
2012-09-26 02:47:21 +0800
592f41450 mm/slab: Fix typo _RET_IP -> _RET_IP_ ... Browse Code »

The bug was introduced by commit 7c0cb9c64f83 ("mm, slab: Replace
'caller' type, void* -> unsigned long").

Reported-by: Fengguang Wu
Signed-off-by: Ezequiel Garcia
Signed-off-by: Pekka Enberg

Ezequiel Garcia
2012-09-26 02:47:00 +0800

25 Sep, 2012

4 commits

48356303f mm, slab: Rename __cache_alloc() -> slab_alloc() ... Browse Code »

This patch does not fix anything and its only goal is to
produce common code between SLAB and SLUB.

Cc: Christoph Lameter
Signed-off-by: Ezequiel Garcia
Signed-off-by: Pekka Enberg

Ezequiel Garcia
2012-09-25 15:18:34 +0800
4052147c0 mm, slab: Match SLAB and SLUB kmem_cache_alloc_xxx_trace() prototype ... Browse Code »

This long (seemingly unnecessary) patch does not fix anything and
its only goal is to produce common code between SLAB and SLUB.

Cc: Christoph Lameter
Signed-off-by: Ezequiel Garcia
Signed-off-by: Pekka Enberg

Ezequiel Garcia
2012-09-25 15:17:24 +0800
7c0cb9c64 mm, slab: Replace 'caller' type, void* -> unsigned long ... Browse Code »

This allows to use _RET_IP_ instead of builtin_address(0), thus
achiveing implementation consistency in all three allocators.
Though maybe a nitpick, the real goal behind this patch is
to be able to obtain common code between SLAB and SLUB.

Signed-off-by: Ezequiel Garcia
Signed-off-by: Pekka Enberg

Ezequiel Garcia
2012-09-25 15:15:58 +0800
ff4fcd01e mm, slab: Remove silly function slab_buffer_size() ... Browse Code »

This function is seldom used, and can be simply replaced with cachep->size.

Acked-by: David Rientjes
Signed-off-by: Ezequiel Garcia
Signed-off-by: Pekka Enberg

Ezequiel Garcia
2012-09-25 15:12:19 +0800

19 Sep, 2012

2 commits

645df230c mm, sl[au]b: Taint kernel when we detect a corrupted slab ... Browse Code »

It doesn't seem worth adding a new taint flag for this, so just re-use
the one from 'bad page'

Acked-by: Christoph Lameter # SLUB
Acked-by: David Rientjes
Signed-off-by: Dave Jones
Signed-off-by: Pekka Enberg

Dave Jones
2012-09-19 15:08:01 +0800
f28510d30 slab: Only define slab_error for DEBUG ... Browse Code »

On Tue, 11 Sep 2012, Stephen Rothwell wrote:
> After merging the final tree, today's linux-next build (sparc64 defconfig)
> produced this warning:
>
> mm/slab.c:808:13: warning: '__slab_error' defined but not used [-Wunused-function]
>
> Introduced by commit 945cf2b6199b ("mm/sl[aou]b: Extract a common
> function for kmem_cache_destroy"). All uses of slab_error() are now
> guarded by DEBUG.

There is no use case left for slab builds without DEBUG.

Signed-off-by: Christoph Lameter
Signed-off-by: Pekka Enberg

Christoph Lameter
2012-09-19 14:58:06 +0800

18 Sep, 2012

2 commits

d014dc2ed slab: fix starting index for finding another object ... Browse Code »

In array cache, there is a object at index 0, check it.

Signed-off-by: Joonsoo Kim
Signed-off-by: Mel Gorman
Cc: David Miller
Cc: Chuck Lever
Cc: David Rientjes
Cc: Pekka Enberg
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joonsoo Kim
2012-09-18 06:00:38 +0800
30c29bea6 slab: do ClearSlabPfmemalloc() for all pages of slab ... Browse Code »

Right now, we call ClearSlabPfmemalloc() for first page of slab when we
clear SlabPfmemalloc flag. This is fine for most swap-over-network use
cases as it is expected that order-0 pages are in use. Unfortunately it
is possible that that __ac_put_obj() checks SlabPfmemalloc on a tail
page and while this is harmless, it is sloppy. This patch ensures that
the head page is always used.

This problem was originally identified by Joonsoo Kim.

[js1304@gmail.com: Original implementation and problem identification]
Signed-off-by: Mel Gorman
Cc: David Miller
Cc: Chuck Lever
Cc: Joonsoo Kim
Cc: David Rientjes
Cc: Pekka Enberg
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2012-09-18 06:00:38 +0800

12 Sep, 2012

1 commit

947ca1856 slab: fix the DEADLOCK issue on l3 alien lock ... Browse Code »

DEADLOCK will be report while running a kernel with NUMA and LOCKDEP enabled,
the process of this fake report is:

kmem_cache_free() //free obj in cachep
-> cache_free_alien() //acquire cachep's l3 alien lock
-> __drain_alien_cache()
-> free_block()
-> slab_destroy()
-> kmem_cache_free() //free slab in cachep->slabp_cache
-> cache_free_alien() //acquire cachep->slabp_cache's l3 alien lock

Since the cachep and cachep->slabp_cache's l3 alien are in the same lock class,
fake report generated.

This should not happen since we already have init_lock_keys() which will
reassign the lock class for both l3 list and l3 alien.

However, init_lock_keys() was invoked at a wrong position which is before we
invoke enable_cpucache() on each cache.

Since until set slab_state to be FULL, we won't invoke enable_cpucache()
on caches to build their l3 alien while creating them, so although we invoked
init_lock_keys(), the l3 alien lock class won't change since we don't have
them until invoked enable_cpucache() later.

This patch will invoke init_lock_keys() after we done enable_cpucache()
instead of before to avoid the fake DEADLOCK report.

Michael traced the problem back to a commit in release 3.0.0:

commit 30765b92ada267c5395fc788623cb15233276f5c
Author: Peter Zijlstra
Date: Thu Jul 28 23:22:56 2011 +0200

slab, lockdep: Annotate the locks before using them

Fernando found we hit the regular OFF_SLAB 'recursion' before we
annotate the locks, cure this.

The relevant portion of the stack-trace:

> [ 0.000000] [] rt_spin_lock+0x50/0x56
> [ 0.000000] [] __cache_free+0x43/0xc3
> [ 0.000000] [] kmem_cache_free+0x6c/0xdc
> [ 0.000000] [] slab_destroy+0x4f/0x53
> [ 0.000000] [] free_block+0x94/0xc1
> [ 0.000000] [] do_tune_cpucache+0x10b/0x2bb
> [ 0.000000] [] enable_cpucache+0x7b/0xa7
> [ 0.000000] [] kmem_cache_init_late+0x1f/0x61
> [ 0.000000] [] start_kernel+0x24c/0x363
> [ 0.000000] [] i386_start_kernel+0xa9/0xaf

Reported-by: Fernando Lopez-Lezcano
Acked-by: Pekka Enberg
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/1311888176.2617.379.camel@laptop
Signed-off-by: Ingo Molnar

The commit moved init_lock_keys() before we build up the alien, so we
failed to reclass it.

Cc: # 3.0+
Acked-by: Christoph Lameter
Tested-by: Paul E. McKenney
Signed-off-by: Michael Wang
Signed-off-by: Pekka Enberg

Michael Wang
2012-09-12 00:29:18 +0800

05 Sep, 2012

8 commits

cce89f4f6 mm/sl[aou]b: Move kmem_cache refcounting to common code ... Browse Code »

Get rid of the refcount stuff in the allocators and do that part of
kmem_cache management in the common code.

Signed-off-by: Christoph Lameter
Signed-off-by: Pekka Enberg

Christoph Lameter
2012-09-05 17:00:37 +0800
8a13a4cc8 mm/sl[aou]b: Shrink __kmem_cache_create() parameter lists ... Browse Code »

Do the initial settings of the fields in common code. This will allow us
to push more processing into common code later and improve readability.

Signed-off-by: Christoph Lameter
Signed-off-by: Pekka Enberg

Christoph Lameter
2012-09-05 17:00:37 +0800
278b1bb13 mm/sl[aou]b: Move kmem_cache allocations into common code ... Browse Code »

Shift the allocations to common code. That way the allocation and
freeing of the kmem_cache structures is handled by common code.

Reviewed-by: Glauber Costa
Signed-off-by: Christoph Lameter
Signed-off-by: Pekka Enberg

Christoph Lameter
2012-09-05 17:00:36 +0800
12c3667fb mm/sl[aou]b: Get rid of __kmem_cache_destroy ... Browse Code »

What is done there can be done in __kmem_cache_shutdown.

This affects RCU handling somewhat. On rcu free all slab allocators do
not refer to other management structures than the kmem_cache structure.
Therefore these other structures can be freed before the rcu deferred
free to the page allocator occurs.

Reviewed-by: Joonsoo Kim
Signed-off-by: Christoph Lameter
Signed-off-by: Pekka Enberg

Christoph Lameter
2012-09-05 17:00:36 +0800
8f4c765c2 mm/sl[aou]b: Move freeing of kmem_cache structure to common code ... Browse Code »

The freeing action is basically the same in all slab allocators.
Move to the common kmem_cache_destroy() function.

Reviewed-by: Glauber Costa
Reviewed-by: Joonsoo Kim
Signed-off-by: Christoph Lameter
Signed-off-by: Pekka Enberg

Christoph Lameter
2012-09-05 17:00:36 +0800
9b030cb86 mm/sl[aou]b: Use "kmem_cache" name for slab cache with kmem_cache struct ... Browse Code »

Make all allocators use the "kmem_cache" slabname for the "kmem_cache"
structure.

Reviewed-by: Glauber Costa
Reviewed-by: Joonsoo Kim
Signed-off-by: Christoph Lameter
Signed-off-by: Pekka Enberg

Christoph Lameter
2012-09-05 17:00:36 +0800
945cf2b61 mm/sl[aou]b: Extract a common function for kmem_cache_destroy ... Browse Code »

kmem_cache_destroy does basically the same in all allocators.

Extract common code which is easy since we already have common mutex
handling.

Reviewed-by: Glauber Costa
Signed-off-by: Christoph Lameter
Signed-off-by: Pekka Enberg

Christoph Lameter
2012-09-05 17:00:35 +0800
7c9adf5a5 mm/sl[aou]b: Move list_add() to slab_common.c ... Browse Code »

Move the code to append the new kmem_cache to the list of slab caches to
the kmem_cache_create code in the shared code.

This is possible now since the acquisition of the mutex was moved into
kmem_cache_create().

Acked-by: David Rientjes
Reviewed-by: Glauber Costa
Reviewed-by: Joonsoo Kim
Signed-off-by: Christoph Lameter
Signed-off-by: Pekka Enberg

Christoph Lameter
2012-09-05 17:00:35 +0800

30 Aug, 2012

1 commit

51cd8e6ff mm, slab: lock the correct nodelist after reenabling irqs ... Browse Code »

cache_grow() can reenable irqs so the cpu (and node) can change, so ensure
that we take list_lock on the correct nodelist.

This fixes an issue with commit 072bb0aa5e06 ("mm: sl[au]b: add
knowledge of PFMEMALLOC reserve pages") where list_lock for the wrong
node was taken after growing the cache.

Reported-and-tested-by: Haggai Eran
Signed-off-by: David Rientjes
Signed-off-by: Linus Torvalds

David Rientjes
2012-08-30 02:32:21 +0800

22 Aug, 2012

1 commit

203b42f73 workqueue: make deferrable delayed_work initializer names consistent ... Browse Code »

Initalizers for deferrable delayed_work are confused.

* __DEFERRED_WORK_INITIALIZER()
* DECLARE_DEFERRED_WORK()
* INIT_DELAYED_WORK_DEFERRABLE()

Rename them to

* __DEFERRABLE_WORK_INITIALIZER()
* DECLARE_DEFERRABLE_WORK()
* INIT_DEFERRABLE_WORK()

This patch doesn't cause any functional changes.

Signed-off-by: Tejun Heo

Tejun Heo
2012-08-22 04:18:23 +0800

17 Aug, 2012

1 commit

5b74beb42 mm, slab: remove page_get_cache ... Browse Code »

page_get_cache() isn't called from anything, so remove it.

Signed-off-by: David Rientjes
Signed-off-by: Pekka Enberg

David Rientjes
2012-08-17 20:35:44 +0800

16 Aug, 2012

1 commit

48f247414 slab: do not call compound_head() in page_get_cache() ... Browse Code »

page_get_cache() does not need to call compound_head(), as its unique
caller virt_to_slab() already makes sure to return a head page.

Additionally, removing the compound_head() call makes page_get_cache()
consistent with page_get_slab().

Signed-off-by: Michel Lespinasse
Cc: Christoph Lameter
Cc: Pekka Enberg
Acked-by: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Pekka Enberg

Michel Lespinasse
2012-08-16 14:32:19 +0800

01 Aug, 2012

3 commits

381760ead mm: micro-optimise slab to avoid a function call ... Browse Code »

Getting and putting objects in SLAB currently requires a function call but
the bulk of the work is related to PFMEMALLOC reserves which are only
consumed when network-backed storage is critical. Use an inline function
to determine if the function call is required.

Signed-off-by: Mel Gorman
Cc: David Miller
Cc: Neil Brown
Cc: Peter Zijlstra
Cc: Mike Christie
Cc: Eric B Munson
Cc: Eric Dumazet
Cc: Sebastian Andrzej Siewior
Cc: Mel Gorman
Cc: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2012-08-01 09:42:46 +0800
b37f1dd0f mm: introduce __GFP_MEMALLOC to allow access to emergency reserves ... Browse Code »

__GFP_MEMALLOC will allow the allocation to disregard the watermarks, much
like PF_MEMALLOC. It allows one to pass along the memalloc state in
object related allocation flags as opposed to task related flags, such as
sk->sk_allocation. This removes the need for ALLOC_PFMEMALLOC as callers
using __GFP_MEMALLOC can get the ALLOC_NO_WATERMARK flag which is now
enough to identify allocations related to page reclaim.

Signed-off-by: Peter Zijlstra
Signed-off-by: Mel Gorman
Cc: David Miller
Cc: Neil Brown
Cc: Mike Christie
Cc: Eric B Munson
Cc: Eric Dumazet
Cc: Sebastian Andrzej Siewior
Cc: Mel Gorman
Cc: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2012-08-01 09:42:45 +0800
072bb0aa5 mm: sl[au]b: add knowledge of PFMEMALLOC reserve pages ... Browse Code »

When a user or administrator requires swap for their application, they
create a swap partition and file, format it with mkswap and activate it
with swapon. Swap over the network is considered as an option in diskless
systems. The two likely scenarios are when blade servers are used as part
of a cluster where the form factor or maintenance costs do not allow the
use of disks and thin clients.

The Linux Terminal Server Project recommends the use of the Network Block
Device (NBD) for swap according to the manual at
https://sourceforge.net/projects/ltsp/files/Docs-Admin-Guide/LTSPManual.pdf/download
There is also documentation and tutorials on how to setup swap over NBD at
places like https://help.ubuntu.com/community/UbuntuLTSP/EnableNBDSWAP The
nbd-client also documents the use of NBD as swap. Despite this, the fact
is that a machine using NBD for swap can deadlock within minutes if swap
is used intensively. This patch series addresses the problem.

The core issue is that network block devices do not use mempools like
normal block devices do. As the host cannot control where they receive
packets from, they cannot reliably work out in advance how much memory
they might need. Some years ago, Peter Zijlstra developed a series of
patches that supported swap over an NFS that at least one distribution is
carrying within their kernels. This patch series borrows very heavily
from Peter's work to support swapping over NBD as a pre-requisite to
supporting swap-over-NFS. The bulk of the complexity is concerned with
preserving memory that is allocated from the PFMEMALLOC reserves for use
by the network layer which is needed for both NBD and NFS.

Patch 1 adds knowledge of the PFMEMALLOC reserves to SLAB and SLUB to
preserve access to pages allocated under low memory situations
to callers that are freeing memory.

Patch 2 optimises the SLUB fast path to avoid pfmemalloc checks

Patch 3 introduces __GFP_MEMALLOC to allow access to the PFMEMALLOC
reserves without setting PFMEMALLOC.

Patch 4 opens the possibility for softirqs to use PFMEMALLOC reserves
for later use by network packet processing.

Patch 5 only sets page->pfmemalloc when ALLOC_NO_WATERMARKS was required

Patch 6 ignores memory policies when ALLOC_NO_WATERMARKS is set.

Patches 7-12 allows network processing to use PFMEMALLOC reserves when
the socket has been marked as being used by the VM to clean pages. If
packets are received and stored in pages that were allocated under
low-memory situations and are unrelated to the VM, the packets
are dropped.

Patch 11 reintroduces __skb_alloc_page which the networking
folk may object to but is needed in some cases to propogate
pfmemalloc from a newly allocated page to an skb. If there is a
strong objection, this patch can be dropped with the impact being
that swap-over-network will be slower in some cases but it should
not fail.

Patch 13 is a micro-optimisation to avoid a function call in the
common case.

Patch 14 tags NBD sockets as being SOCK_MEMALLOC so they can use
PFMEMALLOC if necessary.

Patch 15 notes that it is still possible for the PFMEMALLOC reserve
to be depleted. To prevent this, direct reclaimers get throttled on
a waitqueue if 50% of the PFMEMALLOC reserves are depleted. It is
expected that kswapd and the direct reclaimers already running
will clean enough pages for the low watermark to be reached and
the throttled processes are woken up.

Patch 16 adds a statistic to track how often processes get throttled

Some basic performance testing was run using kernel builds, netperf on
loopback for UDP and TCP, hackbench (pipes and sockets), iozone and
sysbench. Each of them were expected to use the sl*b allocators
reasonably heavily but there did not appear to be significant performance
variances.

For testing swap-over-NBD, a machine was booted with 2G of RAM with a
swapfile backed by NBD. 8*NUM_CPU processes were started that create
anonymous memory mappings and read them linearly in a loop. The total
size of the mappings were 4*PHYSICAL_MEMORY to use swap heavily under
memory pressure.

Without the patches and using SLUB, the machine locks up within minutes
and runs to completion with them applied. With SLAB, the story is
different as an unpatched kernel run to completion. However, the patched
kernel completed the test 45% faster.

MICRO
3.5.0-rc2 3.5.0-rc2
vanilla swapnbd
Unrecognised test vmscan-anon-mmap-write
MMTests Statistics: duration
Sys Time Running Test (seconds) 197.80 173.07
User+Sys Time Running Test (seconds) 206.96 182.03
Total Elapsed Time (seconds) 3240.70 1762.09

This patch: mm: sl[au]b: add knowledge of PFMEMALLOC reserve pages

Allocations of pages below the min watermark run a risk of the machine
hanging due to a lack of memory. To prevent this, only callers who have
PF_MEMALLOC or TIF_MEMDIE set and are not processing an interrupt are
allowed to allocate with ALLOC_NO_WATERMARKS. Once they are allocated to
a slab though, nothing prevents other callers consuming free objects
within those slabs. This patch limits access to slab pages that were
alloced from the PFMEMALLOC reserves.

When this patch is applied, pages allocated from below the low watermark
are returned with page->pfmemalloc set and it is up to the caller to
determine how the page should be protected. SLAB restricts access to any
page with page->pfmemalloc set to callers which are known to able to
access the PFMEMALLOC reserve. If one is not available, an attempt is
made to allocate a new page rather than use a reserve. SLUB is a bit more
relaxed in that it only records if the current per-CPU page was allocated
from PFMEMALLOC reserve and uses another partial slab if the caller does
not have the necessary GFP or process flags. This was found to be
sufficient in tests to avoid hangs due to SLUB generally maintaining
smaller lists than SLAB.

In low-memory conditions it does mean that !PFMEMALLOC allocators can fail
a slab allocation even though free objects are available because they are
being preserved for callers that are freeing pages.

[a.p.zijlstra@chello.nl: Original implementation]
[sebastian@breakpoint.cc: Correct order of page flag clearing]
Signed-off-by: Mel Gorman
Cc: David Miller
Cc: Neil Brown
Cc: Peter Zijlstra
Cc: Mike Christie
Cc: Eric B Munson
Cc: Eric Dumazet
Cc: Sebastian Andrzej Siewior
Cc: Mel Gorman
Cc: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2012-08-01 09:42:45 +0800

09 Jul, 2012

4 commits

20cea9683 mm, sl[aou]b: Move kmem_cache_create mutex handling to common code ... Browse Code »

Move the mutex handling into the common kmem_cache_create()
function.

Then we can also move more checks out of SLAB's kmem_cache_create()
into the common code.

Reviewed-by: Glauber Costa
Signed-off-by: Christoph Lameter
Signed-off-by: Pekka Enberg

Christoph Lameter
2012-07-09 17:13:42 +0800
18004c5d4 mm, sl[aou]b: Use a common mutex definition ... Browse Code »

Use the mutex definition from SLAB and make it the common way to take a sleeping lock.

This has the effect of using a mutex instead of a rw semaphore for SLUB.

SLOB gains the use of a mutex for kmem_cache_create serialization.
Not needed now but SLOB may acquire some more features later (like slabinfo
/ sysfs support) through the expansion of the common code that will
need this.

Reviewed-by: Glauber Costa
Reviewed-by: Joonsoo Kim
Signed-off-by: Christoph Lameter
Signed-off-by: Pekka Enberg

Christoph Lameter
2012-07-09 17:13:41 +0800
97d066091 mm, sl[aou]b: Common definition for boot state of the slab allocators ... Browse Code »

All allocators have some sort of support for the bootstrap status.

Setup a common definition for the boot states and make all slab
allocators use that definition.

Reviewed-by: Glauber Costa
Reviewed-by: Joonsoo Kim
Signed-off-by: Christoph Lameter
Signed-off-by: Pekka Enberg

Christoph Lameter
2012-07-09 17:13:35 +0800
039363f38 mm, sl[aou]b: Extract common code for kmem_cache_create() ... Browse Code »

Kmem_cache_create() does a variety of sanity checks but those
vary depending on the allocator. Use the strictest tests and put them into
a slab_common file. Make the tests conditional on CONFIG_DEBUG_VM.

This patch has the effect of adding sanity checks for SLUB and SLOB
under CONFIG_DEBUG_VM and removes the checks in SLAB for !CONFIG_DEBUG_VM.

Signed-off-by: Christoph Lameter
Signed-off-by: Pekka Enberg

Christoph Lameter
2012-07-09 17:13:30 +0800

02 Jul, 2012

3 commits

a164f8962 slab: move FULL state transition to an initcall ... Browse Code »

During kmem_cache_init_late(), we transition to the LATE state,
and after some more work, to the FULL state, its last state

This is quite different from slub, that will only transition to
its last state (previously SYSFS), in a (late)initcall, after a lot
more of the kernel is ready.

This means that in slab, we have no way to taking actions dependent
on the initialization of other pieces of the kernel that are supposed
to start way after kmem_init_late(), such as cgroups initialization.

To achieve more consistency in this behavior, that patch only
transitions to the UP state in kmem_init_late. In my analysis,
setup_cpu_cache() should be happy to test for >= UP, instead of
== FULL. It also has passed some tests I've made.

We then only mark FULL state after the reap timers are in place,
meaning that no further setup is expected.

Signed-off-by: Glauber Costa
Acked-by: Christoph Lameter
Acked-by: David Rientjes
Signed-off-by: Pekka Enberg

Glauber Costa
2012-07-02 18:56:59 +0800
d97d476b1 slab: Fix a typo in commit 8c138b "slab: Get rid of obj_size macro" ... Browse Code »

Commit 8c138b only sits in Pekka's and linux-next tree now, which tries
to replace obj_size(cachep) with cachep->object_size, but has a typo in
kmem_cache_free() by using "size" instead of "object_size", which casues
some regressions.

Reported-and-tested-by: Fengguang Wu
Signed-off-by: Feng Tang
Cc: Christoph Lameter
Acked-by: Glauber Costa
Signed-off-by: Pekka Enberg

Feng Tang
2012-07-02 18:45:52 +0800
0672aa7c2 mm, slab: Build fix for recent kmem_cache changes ... Browse Code »

Commit 3b0efdf ("mm, sl[aou]b: Extract common fields from struct
kmem_cache") renamed the kmem_cache structure's "next" field to "list"
but forgot to update one instance in leaks_show().

Signed-off-by: Thierry Reding
Signed-off-by: Pekka Enberg

Thierry Reding
2012-07-02 18:42:18 +0800