Eric Lee / smarc-fsl-linux-kernel

22 Feb, 2018

2 commits

ae63fd26b kmemcheck: stop using GFP_NOTRACK and SLAB_NOTRACK ... Browse Code »

commit 75f296d93bcebcfe375884ddac79e30263a31766 upstream.

Convert all allocations that used a NOTRACK flag to stop using it.

Link: http://lkml.kernel.org/r/20171007030159.22241-3-alexander.levin@verizon.com
Signed-off-by: Sasha Levin
Cc: Alexander Potapenko
Cc: Eric W. Biederman
Cc: Michal Hocko
Cc: Pekka Enberg
Cc: Steven Rostedt
Cc: Tim Hansen
Cc: Vegard Nossum
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Levin, Alexander (Sasha Levin)
2018-02-22 22:42:23 +0800
2abfcdf8e kmemcheck: remove annotations ... Browse Code »

commit 4950276672fce5c241857540f8561c440663673d upstream.

Patch series "kmemcheck: kill kmemcheck", v2.

As discussed at LSF/MM, kill kmemcheck.

KASan is a replacement that is able to work without the limitation of
kmemcheck (single CPU, slow). KASan is already upstream.

We are also not aware of any users of kmemcheck (or users who don't
consider KASan as a suitable replacement).

The only objection was that since KASAN wasn't supported by all GCC
versions provided by distros at that time we should hold off for 2
years, and try again.

Now that 2 years have passed, and all distros provide gcc that supports
KASAN, kill kmemcheck again for the very same reasons.

This patch (of 4):

Remove kmemcheck annotations, and calls to kmemcheck from the kernel.

[alexander.levin@verizon.com: correctly remove kmemcheck call from dma_map_sg_attrs]
Link: http://lkml.kernel.org/r/20171012192151.26531-1-alexander.levin@verizon.com
Link: http://lkml.kernel.org/r/20171007030159.22241-2-alexander.levin@verizon.com
Signed-off-by: Sasha Levin
Cc: Alexander Potapenko
Cc: Eric W. Biederman
Cc: Michal Hocko
Cc: Pekka Enberg
Cc: Steven Rostedt
Cc: Tim Hansen
Cc: Vegard Nossum
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Levin, Alexander (Sasha Levin)
2018-02-22 22:42:23 +0800

25 Dec, 2017

1 commit

5383f45db locking/barriers: Convert users of lockless_dereference() to READ_ONCE() ... Browse Code »

commit 3382290ed2d5e275429cef510ab21889d3ccd164 upstream.

[ Note, this is a Git cherry-pick of the following commit:

506458efaf15 ("locking/barriers: Convert users of lockless_dereference() to READ_ONCE()")

... for easier x86 PTI code testing and back-porting. ]

READ_ONCE() now has an implicit smp_read_barrier_depends() call, so it
can be used instead of lockless_dereference() without any change in
semantics.

Signed-off-by: Will Deacon
Cc: Linus Torvalds
Cc: Paul E. McKenney
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Link: http://lkml.kernel.org/r/1508840570-22169-4-git-send-email-will.deacon@arm.com
Signed-off-by: Ingo Molnar
Signed-off-by: Greg Kroah-Hartman

Will Deacon
2017-12-25 21:26:21 +0800

02 Nov, 2017

1 commit

b24413180 License cleanup: add SPDX GPL-2.0 license identifier to files with no license ... Browse Code »

Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.

By default all files without license information are under the default
license of the kernel, which is GPL version 2.

Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier. The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.

This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.

How this work was done:

Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
- file had no licensing information it it.
- file was a */uapi/* one with no licensing information in it,
- file was a */uapi/* one with existing licensing information,

Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.

The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne. Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.

The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed. Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.

Criteria used to select files for SPDX license identifier tagging was:
- Files considered eligible had to be source code files.
- Make and config files were included as candidates if they contained >5
lines of source
- File already had some variant of a license header in it (even if
Reviewed-by: Philippe Ombredanne
Reviewed-by: Thomas Gleixner
Signed-off-by: Greg Kroah-Hartman

Greg Kroah-Hartman
2017-11-02 18:10:55 +0800

10 Aug, 2017

1 commit

d92a8cfcb locking/lockdep: Rework FS_RECLAIM annotation ... Browse Code »

A while ago someone, and I cannot find the email just now, asked if we
could not implement the RECLAIM_FS inversion stuff with a 'fake' lock
like we use for other things like workqueues etc. I think this should
be possible which allows reducing the 'irq' states and will reduce the
amount of __bfs() lookups we do.

Removing the 1 IRQ state results in 4 less __bfs() walks per
dependency, improving lockdep performance. And by moving this
annotation out of the lockdep code it becomes easier for the mm people
to extend.

Signed-off-by: Peter Zijlstra (Intel)
Cc: Byungchul Park
Cc: Linus Torvalds
Cc: Mel Gorman
Cc: Michal Hocko
Cc: Nikolay Borisov
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: akpm@linux-foundation.org
Cc: boqun.feng@gmail.com
Cc: iamjoonsoo.kim@lge.com
Cc: kernel-team@lge.com
Cc: kirill@shutemov.name
Cc: npiggin@gmail.com
Cc: walken@google.com
Signed-off-by: Ingo Molnar

Peter Zijlstra
2017-08-10 18:29:03 +0800

07 Jul, 2017

3 commits

7779f2123 mm: memcontrol: account slab stats per lruvec ... Browse Code »

Josef's redesign of the balancing between slab caches and the page cache
requires slab cache statistics at the lruvec level.

Link: http://lkml.kernel.org/r/20170530181724.27197-7-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner
Acked-by: Vladimir Davydov
Cc: Josef Bacik
Cc: Michal Hocko
Cc: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2017-07-07 07:24:35 +0800
ed52be7bf mm: memcontrol: use generic mod_memcg_page_state for kmem pages ... Browse Code »

The kmem-specific functions do the same thing. Switch and drop.

Link: http://lkml.kernel.org/r/20170530181724.27197-5-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner
Acked-by: Vladimir Davydov
Cc: Josef Bacik
Cc: Michal Hocko
Cc: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2017-07-07 07:24:35 +0800
320492961 mm: memcontrol: use the node-native slab memory counters ... Browse Code »

Now that the slab counters are moved from the zone to the node level we
can drop the private memcg node stats and use the official ones.

Link: http://lkml.kernel.org/r/20170530181724.27197-4-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner
Acked-by: Vladimir Davydov
Cc: Josef Bacik
Cc: Michal Hocko
Cc: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2017-07-07 07:24:35 +0800

19 Apr, 2017

1 commit

5f0d5a3ae mm: Rename SLAB_DESTROY_BY_RCU to SLAB_TYPESAFE_BY_RCU ... Browse Code »

A group of Linux kernel hackers reported chasing a bug that resulted
from their assumption that SLAB_DESTROY_BY_RCU provided an existence
guarantee, that is, that no block from such a slab would be reallocated
during an RCU read-side critical section. Of course, that is not the
case. Instead, SLAB_DESTROY_BY_RCU only prevents freeing of an entire
slab of blocks.

However, there is a phrase for this, namely "type safety". This commit
therefore renames SLAB_DESTROY_BY_RCU to SLAB_TYPESAFE_BY_RCU in order
to avoid future instances of this sort of confusion.

Signed-off-by: Paul E. McKenney
Cc: Christoph Lameter
Cc: Pekka Enberg
Cc: David Rientjes
Cc: Joonsoo Kim
Cc: Andrew Morton
Cc:
Acked-by: Johannes Weiner
Acked-by: Vlastimil Babka
[ paulmck: Add comments mentioning the old name, as requested by Eric
Dumazet, in order to help people familiar with the old name find
the new one. ]
Acked-by: David Rientjes

Paul E. McKenney
2017-04-19 02:42:36 +0800

23 Feb, 2017

7 commits

01fb58bcb slab: remove synchronous synchronize_sched() from memcg cache deactivation path ... Browse Code »

With kmem cgroup support enabled, kmem_caches can be created and
destroyed frequently and a great number of near empty kmem_caches can
accumulate if there are a lot of transient cgroups and the system is not
under memory pressure. When memory reclaim starts under such
conditions, it can lead to consecutive deactivation and destruction of
many kmem_caches, easily hundreds of thousands on moderately large
systems, exposing scalability issues in the current slab management
code. This is one of the patches to address the issue.

slub uses synchronize_sched() to deactivate a memcg cache.
synchronize_sched() is an expensive and slow operation and doesn't scale
when a huge number of caches are destroyed back-to-back. While there
used to be a simple batching mechanism, the batching was too restricted
to be helpful.

This patch implements slab_deactivate_memcg_cache_rcu_sched() which slub
can use to schedule sched RCU callback instead of performing
synchronize_sched() synchronously while holding cgroup_mutex. While
this adds online cpus, mems and slab_mutex operations, operating on
these locks back-to-back from the same kworker, which is what's gonna
happen when there are many to deactivate, isn't expensive at all and
this gets rid of the scalability problem completely.

Link: http://lkml.kernel.org/r/20170117235411.9408-9-tj@kernel.org
Signed-off-by: Tejun Heo
Reported-by: Jay Vana
Acked-by: Vladimir Davydov
Cc: Christoph Lameter
Cc: Pekka Enberg
Cc: David Rientjes
Cc: Joonsoo Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Tejun Heo
2017-02-23 08:41:27 +0800
c9fc58640 slab: introduce __kmemcg_cache_deactivate() ... Browse Code »

__kmem_cache_shrink() is called with %true @deactivate only for memcg
caches. Remove @deactivate from __kmem_cache_shrink() and introduce
__kmemcg_cache_deactivate() instead. Each memcg-supporting allocator
should implement it and it should deactivate and drain the cache.

This is to allow memcg cache deactivation behavior to further deviate
from simple shrinking without messing up __kmem_cache_shrink().

This is pure reorganization and doesn't introduce any observable
behavior changes.

v2: Dropped unnecessary ifdef in mm/slab.h as suggested by Vladimir.

Link: http://lkml.kernel.org/r/20170117235411.9408-8-tj@kernel.org
Signed-off-by: Tejun Heo
Acked-by: Vladimir Davydov
Cc: Christoph Lameter
Cc: Pekka Enberg
Cc: David Rientjes
Cc: Joonsoo Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Tejun Heo
2017-02-23 08:41:27 +0800
510ded33e slab: implement slab_root_caches list ... Browse Code »

With kmem cgroup support enabled, kmem_caches can be created and
destroyed frequently and a great number of near empty kmem_caches can
accumulate if there are a lot of transient cgroups and the system is not
under memory pressure. When memory reclaim starts under such
conditions, it can lead to consecutive deactivation and destruction of
many kmem_caches, easily hundreds of thousands on moderately large
systems, exposing scalability issues in the current slab management
code. This is one of the patches to address the issue.

slab_caches currently lists all caches including root and memcg ones.
This is the only data structure which lists the root caches and
iterating root caches can only be done by walking the list while
skipping over memcg caches. As there can be a huge number of memcg
caches, this can become very expensive.

This also can make /proc/slabinfo behave very badly. seq_file processes
reads in 4k chunks and seeks to the previous Nth position on slab_caches
list to resume after each chunk. With a lot of memcg cache churns on
the list, reading /proc/slabinfo can become very slow and its content
often ends up with duplicate and/or missing entries.

This patch adds a new list slab_root_caches which lists only the root
caches. When memcg is not enabled, it becomes just an alias of
slab_caches. memcg specific list operations are collected into
memcg_[un]link_cache().

Link: http://lkml.kernel.org/r/20170117235411.9408-7-tj@kernel.org
Signed-off-by: Tejun Heo
Reported-by: Jay Vana
Acked-by: Vladimir Davydov
Cc: Christoph Lameter
Cc: Pekka Enberg
Cc: David Rientjes
Cc: Joonsoo Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Tejun Heo
2017-02-23 08:41:27 +0800
bc2791f85 slab: link memcg kmem_caches on their associated memory cgroup ... Browse Code »

With kmem cgroup support enabled, kmem_caches can be created and
destroyed frequently and a great number of near empty kmem_caches can
accumulate if there are a lot of transient cgroups and the system is not
under memory pressure. When memory reclaim starts under such
conditions, it can lead to consecutive deactivation and destruction of
many kmem_caches, easily hundreds of thousands on moderately large
systems, exposing scalability issues in the current slab management
code. This is one of the patches to address the issue.

While a memcg kmem_cache is listed on its root cache's ->children list,
there is no direct way to iterate all kmem_caches which are assocaited
with a memory cgroup. The only way to iterate them is walking all
caches while filtering out caches which don't match, which would be most
of them.

This makes memcg destruction operations O(N^2) where N is the total
number of slab caches which can be huge. This combined with the
synchronous RCU operations can tie up a CPU and affect the whole machine
for many hours when memory reclaim triggers offlining and destruction of
the stale memcgs.

This patch adds mem_cgroup->kmem_caches list which goes through
memcg_cache_params->kmem_caches_node of all kmem_caches which are
associated with the memcg. All memcg specific iterations, including
stat file access, are updated to use the new list instead.

Link: http://lkml.kernel.org/r/20170117235411.9408-6-tj@kernel.org
Signed-off-by: Tejun Heo
Reported-by: Jay Vana
Acked-by: Vladimir Davydov
Cc: Christoph Lameter
Cc: Pekka Enberg
Cc: David Rientjes
Cc: Joonsoo Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Tejun Heo
2017-02-23 08:41:27 +0800
9eeadc8b6 slab: reorganize memcg_cache_params ... Browse Code »

We're going to change how memcg caches are iterated. In preparation,
clean up and reorganize memcg_cache_params.

* The shared ->list is replaced by ->children in root and
->children_node in children.

* ->is_root_cache is removed. Instead ->root_cache is moved out of
the child union and now used by both root and children. NULL
indicates root cache. Non-NULL a memcg one.

This patch doesn't cause any observable behavior changes.

Link: http://lkml.kernel.org/r/20170117235411.9408-5-tj@kernel.org
Signed-off-by: Tejun Heo
Acked-by: Vladimir Davydov
Cc: Christoph Lameter
Cc: Pekka Enberg
Cc: David Rientjes
Cc: Joonsoo Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Tejun Heo
2017-02-23 08:41:27 +0800
290b6a58b Revert "slub: move synchronize_sched out of slab_mutex on shrink" ... Browse Code »

Patch series "slab: make memcg slab destruction scalable", v3.

With kmem cgroup support enabled, kmem_caches can be created and
destroyed frequently and a great number of near empty kmem_caches can
accumulate if there are a lot of transient cgroups and the system is not
under memory pressure. When memory reclaim starts under such
conditions, it can lead to consecutive deactivation and destruction of
many kmem_caches, easily hundreds of thousands on moderately large
systems, exposing scalability issues in the current slab management
code.

I've seen machines which end up with hundred thousands of caches and
many millions of kernfs_nodes. The current code is O(N^2) on the total
number of caches and has synchronous rcu_barrier() and
synchronize_sched() in cgroup offline / release path which is executed
while holding cgroup_mutex. Combined, this leads to very expensive and
slow cache destruction operations which can easily keep running for half
a day.

This also messes up /proc/slabinfo along with other cache iterating
operations. seq_file operates on 4k chunks and on each 4k boundary
tries to seek to the last position in the list. With a huge number of
caches on the list, this becomes very slow and very prone to the list
content changing underneath it leading to a lot of missing and/or
duplicate entries.

This patchset addresses the scalability problem.

* Add root and per-memcg lists. Update each user to use the
appropriate list.

* Make rcu_barrier() for SLAB_DESTROY_BY_RCU caches globally batched
and asynchronous.

* For dying empty slub caches, remove the sysfs files after
deactivation so that we don't end up with millions of sysfs files
without any useful information on them.

This patchset contains the following nine patches.

0001-Revert-slub-move-synchronize_sched-out-of-slab_mutex.patch
0002-slub-separate-out-sysfs_slab_release-from-sysfs_slab.patch
0003-slab-remove-synchronous-rcu_barrier-call-in-memcg-ca.patch
0004-slab-reorganize-memcg_cache_params.patch
0005-slab-link-memcg-kmem_caches-on-their-associated-memo.patch
0006-slab-implement-slab_root_caches-list.patch
0007-slab-introduce-__kmemcg_cache_deactivate.patch
0008-slab-remove-synchronous-synchronize_sched-from-memcg.patch
0009-slab-remove-slub-sysfs-interface-files-early-for-emp.patch
0010-slab-use-memcg_kmem_cache_wq-for-slab-destruction-op.patch

0001 reverts an existing optimization to prepare for the following
changes. 0002 is a prep patch. 0003 makes rcu_barrier() in release
path batched and asynchronous. 0004-0006 separate out the lists.
0007-0008 replace synchronize_sched() in slub destruction path with
call_rcu_sched(). 0009 removes sysfs files early for empty dying
caches. 0010 makes destruction work items use a workqueue with limited
concurrency.

This patch (of 10):

Revert 89e364db71fb5e ("slub: move synchronize_sched out of slab_mutex on
shrink").

With kmem cgroup support enabled, kmem_caches can be created and destroyed
frequently and a great number of near empty kmem_caches can accumulate if
there are a lot of transient cgroups and the system is not under memory
pressure. When memory reclaim starts under such conditions, it can lead
to consecutive deactivation and destruction of many kmem_caches, easily
hundreds of thousands on moderately large systems, exposing scalability
issues in the current slab management code. This is one of the patches to
address the issue.

Moving synchronize_sched() out of slab_mutex isn't enough as it's still
inside cgroup_mutex. The whole deactivation / release path will be
updated to avoid all synchronous RCU operations. Revert this insufficient
optimization in preparation to ease future changes.

Link: http://lkml.kernel.org/r/20170117235411.9408-2-tj@kernel.org
Signed-off-by: Tejun Heo
Reported-by: Jay Vana
Cc: Vladimir Davydov
Cc: Christoph Lameter
Cc: Pekka Enberg
Cc: David Rientjes
Cc: Joonsoo Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Tejun Heo
2017-02-23 08:41:27 +0800
af3b5f876 mm, slab: rename kmalloc-node cache to kmalloc-<size> ... Browse Code »

SLAB as part of its bootstrap pre-creates one kmalloc cache that can fit
the kmem_cache_node management structure, and puts it into the generic
kmalloc cache array (e.g. for 128b objects). The name of this cache is
"kmalloc-node", which is confusing for readers of /proc/slabinfo as the
cache is used for generic allocations (and not just the kmem_cache_node
struct) and it appears as the kmalloc-128 cache is missing.

An easy solution is to use the kmalloc- name when pre-creating the
cache, which we can get from the kmalloc_info array.

Example /proc/slabinfo before the patch:

...
kmalloc-256 1647 1984 256 16 1 : tunables 120 60 8 : slabdata 124 124 828
kmalloc-192 1974 1974 192 21 1 : tunables 120 60 8 : slabdata 94 94 133
kmalloc-96 1332 1344 128 32 1 : tunables 120 60 8 : slabdata 42 42 219
kmalloc-64 2505 5952 64 64 1 : tunables 120 60 8 : slabdata 93 93 715
kmalloc-32 4278 4464 32 124 1 : tunables 120 60 8 : slabdata 36 36 346
kmalloc-node 1352 1376 128 32 1 : tunables 120 60 8 : slabdata 43 43 53
kmem_cache 132 147 192 21 1 : tunables 120 60 8 : slabdata 7 7 0

After the patch:

...
kmalloc-256 1672 2160 256 16 1 : tunables 120 60 8 : slabdata 135 135 807
kmalloc-192 1992 2016 192 21 1 : tunables 120 60 8 : slabdata 96 96 203
kmalloc-96 1159 1184 128 32 1 : tunables 120 60 8 : slabdata 37 37 116
kmalloc-64 2561 4864 64 64 1 : tunables 120 60 8 : slabdata 76 76 785
kmalloc-32 4253 4340 32 124 1 : tunables 120 60 8 : slabdata 35 35 270
kmalloc-128 1256 1280 128 32 1 : tunables 120 60 8 : slabdata 40 40 39
kmem_cache 125 147 192 21 1 : tunables 120 60 8 : slabdata 7 7 0

[vbabka@suse.cz: export the whole kmalloc_info structure instead of just a name accessor, per Christoph Lameter]
Link: http://lkml.kernel.org/r/54e80303-b814-4232-66d4-95b34d3eb9d0@suse.cz
Link: http://lkml.kernel.org/r/20170203181008.24898-1-vbabka@suse.cz
Signed-off-by: Vlastimil Babka
Reviewed-by: Matthew Wilcox
Cc: Joonsoo Kim
Cc: David Rientjes
Cc: Pekka Enberg
Cc: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vlastimil Babka
2017-02-23 08:41:27 +0800

13 Dec, 2016

4 commits

bf00bd345 mm, slab: maintain total slab count instead of active count ... Browse Code »

Rather than tracking the number of active slabs for each node, track the
total number of slabs. This is a minor improvement that avoids active
slab tracking when a slab goes from free to partial or partial to free.

For slab debugging, this also removes an explicit free count since it
can easily be inferred by the difference in number of total objects and
number of active objects.

Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1612042020110.115755@chino.kir.corp.google.com
Signed-off-by: David Rientjes
Suggested-by: Joonsoo Kim
Cc: Greg Thelen
Cc: Aruna Ramakrishna
Cc: Christoph Lameter
Cc: Pekka Enberg
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Rientjes
2016-12-13 10:55:07 +0800
f728b0a5d mm, slab: faster active and free stats ... Browse Code »

Reading /proc/slabinfo or monitoring slabtop(1) can become very
expensive if there are many slab caches and if there are very lengthy
per-node partial and/or free lists.

Commit 07a63c41fa1f ("mm/slab: improve performance of gathering slabinfo
stats") addressed the per-node full lists which showed a significant
improvement when no objects were freed. This patch has the same
motivation and optimizes the remainder of the usecases where there are
very lengthy partial and free lists.

This patch maintains per-node active_slabs (full and partial) and
free_slabs rather than iterating the lists at runtime when reading
/proc/slabinfo.

When allocating 100GB of slab from a test cache where every slab page is
on the partial list, reading /proc/slabinfo (includes all other slab
caches on the system) takes ~247ms on average with 48 samples.

As a result of this patch, the same read takes ~0.856ms on average.

[rientjes@google.com: changelog]
Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1611081505240.13403@chino.kir.corp.google.com
Signed-off-by: Greg Thelen
Signed-off-by: David Rientjes
Cc: Christoph Lameter
Cc: Pekka Enberg
Cc: Joonsoo Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Greg Thelen
2016-12-13 10:55:06 +0800
e70954fd6 mm/slab_common.c: check kmem_create_cache flags are common ... Browse Code »

Verify that kmem_create_cache flags are not allocator specific. It is
done before removing flags that are not available with the current
configuration.

The current kmem_cache_create removes incorrect flags but do not
validate the callers are using them right. This change will ensure that
callers are not trying to create caches with flags that won't be used
because allocator specific.

Link: http://lkml.kernel.org/r/1478553075-120242-2-git-send-email-thgarnie@google.com
Signed-off-by: Thomas Garnier
Cc: Christoph Lameter
Cc: Pekka Enberg
Cc: David Rientjes
Cc: Joonsoo Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Thomas Garnier
2016-12-13 10:55:06 +0800
89e364db7 slub: move synchronize_sched out of slab_mutex on shrink ... Browse Code »

synchronize_sched() is a heavy operation and calling it per each cache
owned by a memory cgroup being destroyed may take quite some time. What
is worse, it's currently called under the slab_mutex, stalling all works
doing cache creation/destruction.

Actually, there isn't much point in calling synchronize_sched() for each
cache - it's enough to call it just once - after setting cpu_partial for
all caches and before shrinking them. This way, we can also move it out
of the slab_mutex, which we have to hold for iterating over the slab
cache list.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=172991
Link: http://lkml.kernel.org/r/0a10d71ecae3db00fb4421bcd3f82bcc911f4be4.1475329751.git.vdavydov.dev@gmail.com
Signed-off-by: Vladimir Davydov
Reported-by: Doug Smythies
Acked-by: Joonsoo Kim
Cc: Christoph Lameter
Cc: David Rientjes
Cc: Johannes Weiner
Cc: Michal Hocko
Cc: Pekka Enberg
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vladimir Davydov
2016-12-13 10:55:06 +0800

28 Oct, 2016

1 commit

07a63c41f mm/slab: improve performance of gathering slabinfo stats ... Browse Code »

On large systems, when some slab caches grow to millions of objects (and
many gigabytes), running 'cat /proc/slabinfo' can take up to 1-2
seconds. During this time, interrupts are disabled while walking the
slab lists (slabs_full, slabs_partial, and slabs_free) for each node,
and this sometimes causes timeouts in other drivers (for instance,
Infiniband).

This patch optimizes 'cat /proc/slabinfo' by maintaining a counter for
total number of allocated slabs per node, per cache. This counter is
updated when a slab is created or destroyed. This enables us to skip
traversing the slabs_full list while gathering slabinfo statistics, and
since slabs_full tends to be the biggest list when the cache is large,
it results in a dramatic performance improvement. Getting slabinfo
statistics now only requires walking the slabs_free and slabs_partial
lists, and those lists are usually much smaller than slabs_full.

We tested this after growing the dentry cache to 70GB, and the
performance improved from 2s to 5ms.

Link: http://lkml.kernel.org/r/1472517876-26814-1-git-send-email-aruna.ramakrishna@oracle.com
Signed-off-by: Aruna Ramakrishna
Acked-by: David Rientjes
Cc: Mike Kravetz
Cc: Christoph Lameter
Cc: Pekka Enberg
Cc: David Rientjes
Cc: Joonsoo Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Aruna Ramakrishna
2016-10-28 09:43:43 +0800

29 Jul, 2016

1 commit

80a9201a5 mm, kasan: switch SLUB to stackdepot, enable memory quarantine for SLUB ... Browse Code »

For KASAN builds:
- switch SLUB allocator to using stackdepot instead of storing the
allocation/deallocation stacks in the objects;
- change the freelist hook so that parts of the freelist can be put
into the quarantine.

[aryabinin@virtuozzo.com: fixes]
Link: http://lkml.kernel.org/r/1468601423-28676-1-git-send-email-aryabinin@virtuozzo.com
Link: http://lkml.kernel.org/r/1468347165-41906-3-git-send-email-glider@google.com
Signed-off-by: Alexander Potapenko
Cc: Andrey Konovalov
Cc: Christoph Lameter
Cc: Dmitry Vyukov
Cc: Steven Rostedt (Red Hat)
Cc: Joonsoo Kim
Cc: Kostya Serebryany
Cc: Andrey Ryabinin
Cc: Kuthonuzo Luruo
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Alexander Potapenko
2016-07-29 07:07:41 +0800

27 Jul, 2016

2 commits

452647784 mm: memcontrol: cleanup kmem charge functions ... Browse Code »

- Handle memcg_kmem_enabled check out to the caller. This reduces the
number of function definitions making the code easier to follow. At
the same time it doesn't result in code bloat, because all of these
functions are used only in one or two places.

- Move __GFP_ACCOUNT check to the caller as well so that one wouldn't
have to dive deep into memcg implementation to see which allocations
are charged and which are not.

- Refresh comments.

Link: http://lkml.kernel.org/r/52882a28b542c1979fd9a033b4dc8637fc347399.1464079537.git.vdavydov@virtuozzo.com
Signed-off-by: Vladimir Davydov
Cc: Johannes Weiner
Cc: Michal Hocko
Cc: Eric Dumazet
Cc: Minchan Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vladimir Davydov
2016-07-27 07:19:19 +0800
7c00fce98 mm: reorganize SLAB freelist randomization ... Browse Code »

The kernel heap allocators are using a sequential freelist making their
allocation predictable. This predictability makes kernel heap overflow
easier to exploit. An attacker can careful prepare the kernel heap to
control the following chunk overflowed.

For example these attacks exploit the predictability of the heap:
- Linux Kernel CAN SLUB overflow (https://goo.gl/oMNWkU)
- Exploiting Linux Kernel Heap corruptions (http://goo.gl/EXLn95)

***Problems that needed solving:
- Randomize the Freelist (singled linked) used in the SLUB allocator.
- Ensure good performance to encourage usage.
- Get best entropy in early boot stage.

***Parts:
- 01/02 Reorganize the SLAB Freelist randomization to share elements
with the SLUB implementation.
- 02/02 The SLUB Freelist randomization implementation. Similar approach
than the SLAB but tailored to the singled freelist used in SLUB.

***Performance data:

slab_test impact is between 3% to 4% on average for 100000 attempts
without smp. It is a very focused testing, kernbench show the overall
impact on the system is way lower.

Before:

Single thread testing
=====================
1. Kmalloc: Repeatedly allocate then free test
100000 times kmalloc(8) -> 49 cycles kfree -> 77 cycles
100000 times kmalloc(16) -> 51 cycles kfree -> 79 cycles
100000 times kmalloc(32) -> 53 cycles kfree -> 83 cycles
100000 times kmalloc(64) -> 62 cycles kfree -> 90 cycles
100000 times kmalloc(128) -> 81 cycles kfree -> 97 cycles
100000 times kmalloc(256) -> 98 cycles kfree -> 121 cycles
100000 times kmalloc(512) -> 95 cycles kfree -> 122 cycles
100000 times kmalloc(1024) -> 96 cycles kfree -> 126 cycles
100000 times kmalloc(2048) -> 115 cycles kfree -> 140 cycles
100000 times kmalloc(4096) -> 149 cycles kfree -> 171 cycles
2. Kmalloc: alloc/free test
100000 times kmalloc(8)/kfree -> 70 cycles
100000 times kmalloc(16)/kfree -> 70 cycles
100000 times kmalloc(32)/kfree -> 70 cycles
100000 times kmalloc(64)/kfree -> 70 cycles
100000 times kmalloc(128)/kfree -> 70 cycles
100000 times kmalloc(256)/kfree -> 69 cycles
100000 times kmalloc(512)/kfree -> 70 cycles
100000 times kmalloc(1024)/kfree -> 73 cycles
100000 times kmalloc(2048)/kfree -> 72 cycles
100000 times kmalloc(4096)/kfree -> 71 cycles

After:

Single thread testing
=====================
1. Kmalloc: Repeatedly allocate then free test
100000 times kmalloc(8) -> 57 cycles kfree -> 78 cycles
100000 times kmalloc(16) -> 61 cycles kfree -> 81 cycles
100000 times kmalloc(32) -> 76 cycles kfree -> 93 cycles
100000 times kmalloc(64) -> 83 cycles kfree -> 94 cycles
100000 times kmalloc(128) -> 106 cycles kfree -> 107 cycles
100000 times kmalloc(256) -> 118 cycles kfree -> 117 cycles
100000 times kmalloc(512) -> 114 cycles kfree -> 116 cycles
100000 times kmalloc(1024) -> 115 cycles kfree -> 118 cycles
100000 times kmalloc(2048) -> 147 cycles kfree -> 131 cycles
100000 times kmalloc(4096) -> 214 cycles kfree -> 161 cycles
2. Kmalloc: alloc/free test
100000 times kmalloc(8)/kfree -> 66 cycles
100000 times kmalloc(16)/kfree -> 66 cycles
100000 times kmalloc(32)/kfree -> 66 cycles
100000 times kmalloc(64)/kfree -> 66 cycles
100000 times kmalloc(128)/kfree -> 65 cycles
100000 times kmalloc(256)/kfree -> 67 cycles
100000 times kmalloc(512)/kfree -> 67 cycles
100000 times kmalloc(1024)/kfree -> 64 cycles
100000 times kmalloc(2048)/kfree -> 67 cycles
100000 times kmalloc(4096)/kfree -> 67 cycles

Kernbench, before:

Average Optimal load -j 12 Run (std deviation):
Elapsed Time 101.873 (1.16069)
User Time 1045.22 (1.60447)
System Time 88.969 (0.559195)
Percent CPU 1112.9 (13.8279)
Context Switches 189140 (2282.15)
Sleeps 99008.6 (768.091)

After:

Average Optimal load -j 12 Run (std deviation):
Elapsed Time 102.47 (0.562732)
User Time 1045.3 (1.34263)
System Time 88.311 (0.342554)
Percent CPU 1105.8 (6.49444)
Context Switches 189081 (2355.78)
Sleeps 99231.5 (800.358)

This patch (of 2):

This commit reorganizes the previous SLAB freelist randomization to
prepare for the SLUB implementation. It moves functions that will be
shared to slab_common.

The entropy functions are changed to align with the SLUB implementation,
now using get_random_(int|long) functions. These functions were chosen
because they provide a bit more entropy early on boot and better
performance when specific arch instructions are not available.

[akpm@linux-foundation.org: fix build]
Link: http://lkml.kernel.org/r/1464295031-26375-2-git-send-email-thgarnie@google.com
Signed-off-by: Thomas Garnier
Reviewed-by: Kees Cook
Cc: Christoph Lameter
Cc: Pekka Enberg
Cc: David Rientjes
Cc: Joonsoo Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Thomas Garnier
2016-07-27 07:19:19 +0800

21 May, 2016

1 commit

55834c590 mm: kasan: initial memory quarantine implementation ... Browse Code »

Quarantine isolates freed objects in a separate queue. The objects are
returned to the allocator later, which helps to detect use-after-free
errors.

When the object is freed, its state changes from KASAN_STATE_ALLOC to
KASAN_STATE_QUARANTINE. The object is poisoned and put into quarantine
instead of being returned to the allocator, therefore every subsequent
access to that object triggers a KASAN error, and the error handler is
able to say where the object has been allocated and deallocated.

When it's time for the object to leave quarantine, its state becomes
KASAN_STATE_FREE and it's returned to the allocator. From now on the
allocator may reuse it for another allocation. Before that happens,
it's still possible to detect a use-after free on that object (it
retains the allocation/deallocation stacks).

When the allocator reuses this object, the shadow is unpoisoned and old
allocation/deallocation stacks are wiped. Therefore a use of this
object, even an incorrect one, won't trigger ASan warning.

Without the quarantine, it's not guaranteed that the objects aren't
reused immediately, that's why the probability of catching a
use-after-free is lower than with quarantine in place.

Quarantine isolates freed objects in a separate queue. The objects are
returned to the allocator later, which helps to detect use-after-free
errors.

Freed objects are first added to per-cpu quarantine queues. When a
cache is destroyed or memory shrinking is requested, the objects are
moved into the global quarantine queue. Whenever a kmalloc call allows
memory reclaiming, the oldest objects are popped out of the global queue
until the total size of objects in quarantine is less than 3/4 of the
maximum quarantine size (which is a fraction of installed physical
memory).

As long as an object remains in the quarantine, KASAN is able to report
accesses to it, so the chance of reporting a use-after-free is
increased. Once the object leaves quarantine, the allocator may reuse
it, in which case the object is unpoisoned and KASAN can't detect
incorrect accesses to it.

Right now quarantine support is only enabled in SLAB allocator.
Unification of KASAN features in SLAB and SLUB will be done later.

This patch is based on the "mm: kasan: quarantine" patch originally
prepared by Dmitry Chernenkov. A number of improvements have been
suggested by Andrey Ryabinin.

[glider@google.com: v9]
Link: http://lkml.kernel.org/r/1462987130-144092-1-git-send-email-glider@google.com
Signed-off-by: Alexander Potapenko
Cc: Christoph Lameter
Cc: Pekka Enberg
Cc: David Rientjes
Cc: Joonsoo Kim
Cc: Andrey Konovalov
Cc: Dmitry Vyukov
Cc: Andrey Ryabinin
Cc: Steven Rostedt
Cc: Konstantin Serebryany
Cc: Dmitry Chernenkov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Alexander Potapenko
2016-05-21 08:58:30 +0800

26 Mar, 2016

1 commit

505f5dcb1 mm, kasan: add GFP flags to KASAN API ... Browse Code »

Add GFP flags to KASAN hooks for future patches to use.

This patch is based on the "mm: kasan: unified support for SLUB and SLAB
allocators" patch originally prepared by Dmitry Chernenkov.

Signed-off-by: Alexander Potapenko
Cc: Christoph Lameter
Cc: Pekka Enberg
Cc: David Rientjes
Cc: Joonsoo Kim
Cc: Andrey Konovalov
Cc: Dmitry Vyukov
Cc: Andrey Ryabinin
Cc: Steven Rostedt
Cc: Konstantin Serebryany
Cc: Dmitry Chernenkov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Alexander Potapenko
2016-03-26 07:37:42 +0800

18 Mar, 2016

1 commit

27ee57c93 mm: memcontrol: report slab usage in cgroup2 memory.stat ... Browse Code »

Show how much memory is used for storing reclaimable and unreclaimable
in-kernel data structures allocated from slab caches.

Signed-off-by: Vladimir Davydov
Acked-by: Johannes Weiner
Cc: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vladimir Davydov
2016-03-18 06:09:34 +0800

16 Mar, 2016

4 commits

becfda68a slub: convert SLAB_DEBUG_FREE to SLAB_CONSISTENCY_CHECKS ... Browse Code »

SLAB_DEBUG_FREE allows expensive consistency checks at free to be turned
on or off. Expand its use to be able to turn off all consistency
checks. This gives a nice speed up if you only want features such as
poisoning or tracing.

Credit to Mathias Krause for the original work which inspired this
series

Signed-off-by: Laura Abbott
Acked-by: Christoph Lameter
Cc: Pekka Enberg
Cc: David Rientjes
Cc: Joonsoo Kim
Cc: Kees Cook
Cc: Mathias Krause
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Laura Abbott
2016-03-16 07:55:16 +0800
9f706d682 mm: fix some spelling ... Browse Code »

Fix up trivial spelling errors, noticed while reading the code.

Signed-off-by: Jesper Dangaard Brouer
Cc: Christoph Lameter
Cc: Pekka Enberg
Cc: David Rientjes
Cc: Joonsoo Kim
Cc: Vladimir Davydov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jesper Dangaard Brouer
2016-03-16 07:55:16 +0800
fab9963a6 mm: fault-inject take over bootstrap kmem_cache check ... Browse Code »

Remove the SLAB specific function slab_should_failslab(), by moving the
check against fault-injection for the bootstrap slab, into the shared
function should_failslab() (used by both SLAB and SLUB).

This is a step towards sharing alloc_hook's between SLUB and SLAB.

This bootstrap slab "kmem_cache" is used for allocating struct
kmem_cache objects to the allocator itself.

Signed-off-by: Jesper Dangaard Brouer
Cc: Christoph Lameter
Cc: Pekka Enberg
Cc: David Rientjes
Cc: Joonsoo Kim
Cc: Vladimir Davydov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jesper Dangaard Brouer
2016-03-16 07:55:16 +0800
11c7aec2a mm/slab: move SLUB alloc hooks to common mm/slab.h ... Browse Code »

First step towards sharing alloc_hook's between SLUB and SLAB
allocators. Move the SLUB allocators *_alloc_hook to the common
mm/slab.h for internal slab definitions.

Signed-off-by: Jesper Dangaard Brouer
Cc: Christoph Lameter
Cc: Pekka Enberg
Cc: David Rientjes
Cc: Joonsoo Kim
Cc: Vladimir Davydov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jesper Dangaard Brouer
2016-03-16 07:55:16 +0800

19 Feb, 2016

1 commit

52b4b950b mm: slab: free kmem_cache_node after destroy sysfs file ... Browse Code »

When slub_debug alloc_calls_show is enabled we will try to track
location and user of slab object on each online node, kmem_cache_node
structure and cpu_cache/cpu_slub shouldn't be freed till there is the
last reference to sysfs file.

This fixes the following panic:

BUG: unable to handle kernel NULL pointer dereference at 0000000000000020
IP: list_locations+0x169/0x4e0
PGD 257304067 PUD 438456067 PMD 0
Oops: 0000 [#1] SMP
CPU: 3 PID: 973074 Comm: cat ve: 0 Not tainted 3.10.0-229.7.2.ovz.9.30-00007-japdoll-dirty #2 9.30
Hardware name: DEPO Computers To Be Filled By O.E.M./H67DE3, BIOS L1.60c 07/14/2011
task: ffff88042a5dc5b0 ti: ffff88037f8d8000 task.ti: ffff88037f8d8000
RIP: list_locations+0x169/0x4e0
Call Trace:
alloc_calls_show+0x1d/0x30
slab_attr_show+0x1b/0x30
sysfs_read_file+0x9a/0x1a0
vfs_read+0x9c/0x170
SyS_read+0x58/0xb0
system_call_fastpath+0x16/0x1b
Code: 5e 07 12 00 b9 00 04 00 00 3d 00 04 00 00 0f 4f c1 3d 00 04 00 00 89 45 b0 0f 84 c3 00 00 00 48 63 45 b0 49 8b 9c c4 f8 00 00 00 8b 43 20 48 85 c0 74 b6 48 89 df e8 46 37 44 00 48 8b 53 10
CR2: 0000000000000020

Separated __kmem_cache_release from __kmem_cache_shutdown which now
called on slab_kmem_cache_release (after the last reference to sysfs
file object has dropped).

Reintroduced locking in free_partial as sysfs file might access cache's
partial list after shutdowning - partial revert of the commit
69cb8e6b7c29 ("slub: free slabs without holding locks"). Zap
__remove_partial and use remove_partial (w/o underscores) as
free_partial now takes list_lock which s partial revert for commit
1e4dd9461fab ("slub: do not assert not having lock in removing freed
partial")

Signed-off-by: Dmitry Safonov
Suggested-by: Vladimir Davydov
Acked-by: Vladimir Davydov
Cc: Christoph Lameter
Cc: Pekka Enberg
Cc: David Rientjes
Cc: Joonsoo Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Dmitry Safonov
2016-02-19 08:23:24 +0800

21 Jan, 2016

1 commit

127424c86 mm: memcontrol: move kmem accounting code to CONFIG_MEMCG ... Browse Code »

The cgroup2 memory controller will account important in-kernel memory
consumers per default. Move all necessary components to CONFIG_MEMCG.

Signed-off-by: Johannes Weiner
Acked-by: Vladimir Davydov
Cc: Michal Hocko
Cc: Arnd Bergmann
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2016-01-21 09:09:18 +0800

15 Jan, 2016

1 commit

230e9fc28 slab: add SLAB_ACCOUNT flag ... Browse Code »

Currently, if we want to account all objects of a particular kmem cache,
we have to pass __GFP_ACCOUNT to each kmem_cache_alloc call, which is
inconvenient. This patch introduces SLAB_ACCOUNT flag which if passed
to kmem_cache_create will force accounting for every allocation from
this cache even if __GFP_ACCOUNT is not passed.

This patch does not make any of the existing caches use this flag - it
will be done later in the series.

Note, a cache with SLAB_ACCOUNT cannot be merged with a cache w/o
SLAB_ACCOUNT, because merged caches share the same kmem_cache struct and
hence cannot have different sets of SLAB_* flags. Thus using this flag
will probably reduce the number of merged slabs even if kmem accounting
is not used (only compiled in).

Signed-off-by: Vladimir Davydov
Suggested-by: Tejun Heo
Acked-by: Johannes Weiner
Acked-by: Michal Hocko
Cc: Greg Thelen
Cc: Christoph Lameter
Cc: Pekka Enberg
Cc: David Rientjes
Cc: Joonsoo Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vladimir Davydov
2016-01-15 08:00:49 +0800

23 Nov, 2015

1 commit

865762a81 slab/slub: adjust kmem_cache_alloc_bulk API ... Browse Code »

Adjust kmem_cache_alloc_bulk API before we have any real users.

Adjust API to return type 'int' instead of previously type 'bool'. This
is done to allow future extension of the bulk alloc API.

A future extension could be to allow SLUB to stop at a page boundary, when
specified by a flag, and then return the number of objects.

The advantage of this approach, would make it easier to make bulk alloc
run without local IRQs disabled. With an approach of cmpxchg "stealing"
the entire c->freelist or page->freelist. To avoid overshooting we would
stop processing at a slab-page boundary. Else we always end up returning
some objects at the cost of another cmpxchg.

To keep compatible with future users of this API linking against an older
kernel when using the new flag, we need to return the number of allocated
objects with this API change.

Signed-off-by: Jesper Dangaard Brouer
Cc: Vladimir Davydov
Acked-by: Christoph Lameter
Cc: Pekka Enberg
Cc: David Rientjes
Cc: Joonsoo Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jesper Dangaard Brouer
2015-11-23 03:58:44 +0800

06 Nov, 2015

2 commits

f3ccb2c42 memcg: unify slab and other kmem pages charging ... Browse Code »

We have memcg_kmem_charge and memcg_kmem_uncharge methods for charging and
uncharging kmem pages to memcg, but currently they are not used for
charging slab pages (i.e. they are only used for charging pages allocated
with alloc_kmem_pages). The only reason why the slab subsystem uses
special helpers, memcg_charge_slab and memcg_uncharge_slab, is that it
needs to charge to the memcg of kmem cache while memcg_charge_kmem charges
to the memcg that the current task belongs to.

To remove this diversity, this patch adds an extra argument to
__memcg_kmem_charge that can be a pointer to a memcg or NULL. If it is
not NULL, the function tries to charge to the memcg it points to,
otherwise it charge to the current context. Next, it makes the slab
subsystem use this function to charge slab pages.

Since memcg_charge_kmem and memcg_uncharge_kmem helpers are now used only
in __memcg_kmem_charge and __memcg_kmem_uncharge, they are inlined. Since
__memcg_kmem_charge stores a pointer to the memcg in the page struct, we
don't need memcg_uncharge_slab anymore and can use free_kmem_pages.
Besides, one can now detect which memcg a slab page belongs to by reading
/proc/kpagecgroup.

Note, this patch switches slab to charge-after-alloc design. Since this
design is already used for all other memcg charges, it should not make any
difference.

[hannes@cmpxchg.org: better to have an outer function than a magic parameter for the memcg lookup]
Signed-off-by: Vladimir Davydov
Acked-by: Michal Hocko
Signed-off-by: Johannes Weiner
Cc: Christoph Lameter
Cc: Pekka Enberg
Cc: David Rientjes
Cc: Joonsoo Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vladimir Davydov
2015-11-06 11:34:48 +0800
d60fdcc9e mm/slab_common.c: clear pointers to per memcg caches on destroy ... Browse Code »

Currently, we do not clear pointers to per memcg caches in the
memcg_params.memcg_caches array when a global cache is destroyed with
kmem_cache_destroy.

This is fine if the global cache does get destroyed. However, a cache can
be left on the list if it still has active objects when kmem_cache_destroy
is called (due to a memory leak). If this happens, the entries in the
array will point to already freed areas, which is likely to result in data
corruption when the cache is reused (via slab merging).

Signed-off-by: Vladimir Davydov
Cc: Christoph Lameter
Cc: Pekka Enberg
Cc: David Rientjes
Cc: Joonsoo Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vladimir Davydov
2015-11-06 11:34:48 +0800

05 Sep, 2015

2 commits

2d16e0fd3 mm/slab.h: fix argument order in cache_from_obj's error message ... Browse Code »

While debugging a networking issue, I hit a condition that triggered an
object to be freed into the wrong kmem cache, and thus triggered the
warning in cache_from_obj().

The arguments in the error message are in wrong order: the location
of the object's kmem cache is in cachep, not s.

Signed-off-by: Daniel Borkmann
Cc: Christoph Lameter
Cc: Pekka Enberg
Cc: David Rientjes
Cc: Joonsoo Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Daniel Borkmann
2015-09-05 07:54:41 +0800
484748f0b slab: infrastructure for bulk object allocation and freeing ... Browse Code »

Add the basic infrastructure for alloc/free operations on pointer arrays.
It includes a generic function in the common slab code that is used in
this infrastructure patch to create the unoptimized functionality for slab
bulk operations.

Allocators can then provide optimized allocation functions for situations
in which large numbers of objects are needed. These optimization may
avoid taking locks repeatedly and bypass metadata creation if all objects
in slab pages can be used to provide the objects required.

Allocators can extend the skeletons provided and add their own code to the
bulk alloc and free functions. They can keep the generic allocation and
freeing and just fall back to those if optimizations would not work (like
for example when debugging is on).

Signed-off-by: Christoph Lameter
Signed-off-by: Jesper Dangaard Brouer
Cc: Pekka Enberg
Cc: David Rientjes
Cc: Joonsoo Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2015-09-05 07:54:41 +0800

25 Jun, 2015

1 commit

34cc6990d slab: correct size_index table before replacing the bootstrap kmem_cache_node ... Browse Code »

This patch moves the initialization of the size_index table slightly
earlier so that the first few kmem_cache_node's can be safely allocated
when KMALLOC_MIN_SIZE is large.

There are currently two ways to generate indices into kmalloc_caches (via
kmalloc_index() and via the size_index table in slab_common.c) and on some
arches (possibly only MIPS) they potentially disagree with each other
until create_kmalloc_caches() has been called. It seems that the
intention is that the size_index table is a fast equivalent to
kmalloc_index() and that create_kmalloc_caches() patches the table to
return the correct value for the cases where kmalloc_index()'s
if-statements apply.

The failing sequence was:
* kmalloc_caches contains NULL elements
* kmem_cache_init initialises the element that 'struct
kmem_cache_node' will be allocated to. For 32-bit Mips, this is a
56-byte struct and kmalloc_index returns KMALLOC_SHIFT_LOW (7).
* init_list is called which calls kmalloc_node to allocate a 'struct
kmem_cache_node'.
* kmalloc_slab selects the kmem_caches element using
size_index[size_index_elem(size)]. For MIPS, size is 56, and the
expression returns 6.
* This element of kmalloc_caches is NULL and allocation fails.
* If it had not already failed, it would have called
create_kmalloc_caches() at this point which would have changed
size_index[size_index_elem(size)] to 7.

I don't believe the bug to be LLVM specific but GCC doesn't normally
encounter the problem. I haven't been able to identify exactly what GCC
is doing better (probably inlining) but it seems that GCC is managing to
optimize to the point that it eliminates the problematic allocations.
This theory is supported by the fact that GCC can be made to fail in the
same way by changing inline, __inline, __inline__, and __always_inline in
include/linux/compiler-gcc.h such that they don't actually inline things.

Signed-off-by: Daniel Sanders
Acked-by: Pekka Enberg
Acked-by: Christoph Lameter
Cc: Pekka Enberg
Cc: David Rientjes
Cc: Joonsoo Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Daniel Sanders
2015-06-25 08:49:41 +0800