Eric Lee / smarc-fsl-linux-kernel

27 Apr, 2022

1 commit

a52e73bef mm, kfence: support kmem_dump_obj() for KFENCE objects ... Browse Code »

commit 2dfe63e61cc31ee59ce951672b0850b5229cd5b0 upstream.

Calling kmem_obj_info() via kmem_dump_obj() on KFENCE objects has been
producing garbage data due to the object not actually being maintained
by SLAB or SLUB.

Fix this by implementing __kfence_obj_info() that copies relevant
information to struct kmem_obj_info when the object was allocated by
KFENCE; this is called by a common kmem_obj_info(), which also calls the
slab/slub/slob specific variant now called __kmem_obj_info().

For completeness, kmem_dump_obj() now displays if the object was
allocated by KFENCE.

Link: https://lore.kernel.org/all/20220323090520.GG16885@xsang-OptiPlex-9020/
Link: https://lkml.kernel.org/r/20220406131558.3558585-1-elver@google.com
Fixes: b89fb5ef0ce6 ("mm, kfence: insert KFENCE hooks for SLUB")
Fixes: d3fb45f370d9 ("mm, kfence: insert KFENCE hooks for SLAB")
Signed-off-by: Marco Elver
Reviewed-by: Hyeonggon Yoo
Reported-by: kernel test robot
Acked-by: Vlastimil Babka [slab]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Marco Elver
2022-04-27 20:38:51 +0800

25 Nov, 2021

1 commit

11138d734 mm: kmemleak: slob: respect SLAB_NOLEAKTRACE flag ... Browse Code »

commit 34dbc3aaf5d9e89ba6cc5e24add9458c21ab1950 upstream.

When kmemleak is enabled for SLOB, system does not boot and does not
print anything to the console. At the very early stage in the boot
process we hit infinite recursion from kmemleak_init() and eventually
kernel crashes.

kmemleak_init() specifies SLAB_NOLEAKTRACE for KMEM_CACHE(), but
kmem_cache_create_usercopy() removes it because CACHE_CREATE_MASK is not
valid for SLOB.

Let's fix CACHE_CREATE_MASK and make kmemleak work with SLOB

Link: https://lkml.kernel.org/r/20211115020850.3154366-1-rkovhaev@gmail.com
Fixes: d8843922fba4 ("slab: Ignore internal flags in cache creation")
Signed-off-by: Rustam Kovhaev
Acked-by: Vlastimil Babka
Reviewed-by: Muchun Song
Cc: Christoph Lameter
Cc: Pekka Enberg
Cc: David Rientjes
Cc: Joonsoo Kim
Cc: Catalin Marinas
Cc: Greg Kroah-Hartman
Cc: Glauber Costa
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Rustam Kovhaev
2021-11-25 16:48:42 +0800

31 Jul, 2021

1 commit

121dffe20 mm/memcg: fix NULL pointer dereference in memcg_slab_free_hook() ... Browse Code »

When I use kfree_rcu() to free a large memory allocated by kmalloc_node(),
the following dump occurs.

BUG: kernel NULL pointer dereference, address: 0000000000000020
[...]
Oops: 0000 [#1] SMP
[...]
Workqueue: events kfree_rcu_work
RIP: 0010:__obj_to_index include/linux/slub_def.h:182 [inline]
RIP: 0010:obj_to_index include/linux/slub_def.h:191 [inline]
RIP: 0010:memcg_slab_free_hook+0x120/0x260 mm/slab.h:363
[...]
Call Trace:
kmem_cache_free_bulk+0x58/0x630 mm/slub.c:3293
kfree_bulk include/linux/slab.h:413 [inline]
kfree_rcu_work+0x1ab/0x200 kernel/rcu/tree.c:3300
process_one_work+0x207/0x530 kernel/workqueue.c:2276
worker_thread+0x320/0x610 kernel/workqueue.c:2422
kthread+0x13d/0x160 kernel/kthread.c:313
ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:294

When kmalloc_node() a large memory, page is allocated, not slab, so when
freeing memory via kfree_rcu(), this large memory should not be used by
memcg_slab_free_hook(), because memcg_slab_free_hook() is is used for
slab.

Using page_objcgs_check() instead of page_objcgs() in
memcg_slab_free_hook() to fix this bug.

Link: https://lkml.kernel.org/r/20210728145655.274476-1-wanghai38@huawei.com
Fixes: 270c6a71460e ("mm: memcontrol/slab: Use helpers to access slab page's memcg_data")
Signed-off-by: Wang Hai
Reviewed-by: Shakeel Butt
Acked-by: Michal Hocko
Acked-by: Roman Gushchin
Reviewed-by: Kefeng Wang
Reviewed-by: Muchun Song
Cc: Christoph Lameter
Cc: Pekka Enberg
Cc: David Rientjes
Cc: Joonsoo Kim
Cc: Vlastimil Babka
Cc: Johannes Weiner
Cc: Alexei Starovoitov
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Wang Hai
2021-07-31 01:14:39 +0800

16 Jul, 2021

1 commit

0d4a062af mm: move helper to check slub_debug_enabled ... Browse Code »

Move the helper to check slub_debug_enabled, so that we can confine the
use of #ifdef outside slub.c as well.

Link: https://lkml.kernel.org/r/20210705103229.8505-2-yee.lee@mediatek.com
Signed-off-by: Marco Elver
Signed-off-by: Yee Lee
Suggested-by: Matthew Wilcox
Cc: Alexander Potapenko
Cc: Andrey Konovalov
Cc: Andrey Ryabinin
Cc: Chinwen Chang
Cc: Dmitry Vyukov
Cc: Kuan-Ying Lee
Cc: Nicholas Tang
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Marco Elver
2021-07-16 01:13:49 +0800

05 Jul, 2021

1 commit

28e92f990 Merge branch 'core-rcu-2021.07.04' of git://git.kernel.org/pub/scm/linux/kernel/… ... Browse Code »

…git/paulmck/linux-rcu

Pull RCU updates from Paul McKenney:

- Bitmap parsing support for "all" as an alias for all bits

- Documentation updates

- Miscellaneous fixes, including some that overlap into mm and lockdep

- kvfree_rcu() updates

- mem_dump_obj() updates, with acks from one of the slab-allocator
maintainers

- RCU NOCB CPU updates, including limited deoffloading

- SRCU updates

- Tasks-RCU updates

- Torture-test updates

* 'core-rcu-2021.07.04' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu: (78 commits)
tasks-rcu: Make show_rcu_tasks_gp_kthreads() be static inline
rcu-tasks: Make ksoftirqd provide RCU Tasks quiescent states
rcu: Add missing __releases() annotation
rcu: Remove obsolete rcu_read_unlock() deadlock commentary
rcu: Improve comments describing RCU read-side critical sections
rcu: Create an unrcu_pointer() to remove __rcu from a pointer
srcu: Early test SRCU polling start
rcu: Fix various typos in comments
rcu/nocb: Unify timers
rcu/nocb: Prepare for fine-grained deferred wakeup
rcu/nocb: Only cancel nocb timer if not polling
rcu/nocb: Delete bypass_timer upon nocb_gp wakeup
rcu/nocb: Cancel nocb_timer upon nocb_gp wakeup
rcu/nocb: Allow de-offloading rdp leader
rcu/nocb: Directly call __wake_nocb_gp() from bypass timer
rcu: Don't penalize priority boosting when there is nothing to boost
rcu: Point to documentation of ordering guarantees
rcu: Make rcu_gp_cleanup() be noinline for tracing
rcu: Restrict RCU_STRICT_GRACE_PERIOD to at most four CPUs
rcu: Make show_rcu_gp_kthreads() dump rcu_node structures blocking GP
...

Linus Torvalds
2021-07-05 03:58:33 +0800

30 Jun, 2021

4 commits

41eb5df1c mm: memcg/slab: properly set up gfp flags for objcg pointer array ... Browse Code »

Patch series "mm: memcg/slab: Fix objcg pointer array handling problem", v4.

Since the merging of the new slab memory controller in v5.9, the page
structure stores a pointer to objcg pointer array for slab pages. When
the slab has no used objects, it can be freed in free_slab() which will
call kfree() to free the objcg pointer array in
memcg_alloc_page_obj_cgroups(). If it happens that the objcg pointer
array is the last used object in its slab, that slab may then be freed
which may caused kfree() to be called again.

With the right workload, the slab cache may be set up in a way that allows
the recursive kfree() calling loop to nest deep enough to cause a kernel
stack overflow and panic the system. In fact, we have a reproducer that
can cause kernel stack overflow on a s390 system involving kmalloc-rcl-256
and kmalloc-rcl-128 slabs with the following kfree() loop recursively
called 74 times:

[ 285.520739] [] kfree+0x4bc/0x560 [ 285.520740]
[] __free_slab+0xc6/0x228 [ 285.520741]
[] __slab_free+0x3c2/0x3e0 [ 285.520742]
[] kfree+0x4bc/0x560 : While investigating this issue, I
also found an issue on the allocation side. If the objcg pointer array
happen to come from the same slab or a circular dependency linkage is
formed with multiple slabs, those affected slabs can never be freed again.

This patch series addresses these two issues by introducing a new set of
kmalloc-cg- caches split from kmalloc- caches. The new set will
only contain non-reclaimable and non-dma objects that are accounted in
memory cgroups whereas the old set are now for unaccounted objects only.
By making this split, all the objcg pointer arrays will come from the
kmalloc- caches, but those caches will never hold any objcg pointer
array. As a result, deeply nested kfree() call and the unfreeable slab
problems are now gone.

This patch (of 4):

Since the merging of the new slab memory controller in v5.9, the page
structure may store a pointer to obj_cgroup pointer array for slab pages.
Currently, only the __GFP_ACCOUNT bit is masked off. However, the array
is not readily reclaimable and doesn't need to come from the DMA buffer.
So those GFP bits should be masked off as well.

Do the flag bit clearing at memcg_alloc_page_obj_cgroups() to make sure
that it is consistently applied no matter where it is called.

Link: https://lkml.kernel.org/r/20210505200610.13943-1-longman@redhat.com
Link: https://lkml.kernel.org/r/20210505200610.13943-2-longman@redhat.com
Fixes: 286e04b8ed7a ("mm: memcg/slab: allocate obj_cgroups for non-root slab pages")
Signed-off-by: Waiman Long
Reviewed-by: Shakeel Butt
Acked-by: Roman Gushchin
Reviewed-by: Vlastimil Babka
Cc: Johannes Weiner
Cc: Michal Hocko
Cc: Vladimir Davydov
Cc: Christoph Lameter
Cc: Pekka Enberg
Cc: David Rientjes
Cc: Joonsoo Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Waiman Long
2021-06-30 01:53:49 +0800
fdbcb2a6d mm/memcg: move mod_objcg_state() to memcontrol.c ... Browse Code »

Patch series "mm/memcg: Reduce kmemcache memory accounting overhead", v6.

With the recent introduction of the new slab memory controller, we
eliminate the need for having separate kmemcaches for each memory cgroup
and reduce overall kernel memory usage. However, we also add additional
memory accounting overhead to each call of kmem_cache_alloc() and
kmem_cache_free().

For workloads that require a lot of kmemcache allocations and
de-allocations, they may experience performance regression as illustrated
in [1] and [2].

A simple kernel module that performs repeated loop of 100,000,000
kmem_cache_alloc() and kmem_cache_free() of either a small 32-byte object
or a big 4k object at module init time with a batch size of 4 (4 kmalloc's
followed by 4 kfree's) is used for benchmarking. The benchmarking tool
was run on a kernel based on linux-next-20210419. The test was run on a
CascadeLake server with turbo-boosting disable to reduce run-to-run
variation.

The small object test exercises mainly the object stock charging and
vmstat update code paths. The large object test also exercises the
refill_obj_stock() and __memcg_kmem_charge()/__memcg_kmem_uncharge() code
paths.

With memory accounting disabled, the run time was 3.130s with both small
object big object tests.

With memory accounting enabled, both cgroup v1 and v2 showed similar
results in the small object test. The performance results of the large
object test, however, differed between cgroup v1 and v2.

The execution times with the application of various patches in the
patchset were:

Applied patches Run time Accounting overhead %age 1 %age 2
--------------- -------- ------------------- ------ ------

Small 32-byte object:
None 11.634s 8.504s 100.0% 271.7%
1-2 9.425s 6.295s 74.0% 201.1%
1-3 9.708s 6.578s 77.4% 210.2%
1-4 8.062s 4.932s 58.0% 157.6%

Large 4k object (v2):
None 22.107s 18.977s 100.0% 606.3%
1-2 20.960s 17.830s 94.0% 569.6%
1-3 14.238s 11.108s 58.5% 354.9%
1-4 11.329s 8.199s 43.2% 261.9%

Large 4k object (v1):
None 36.807s 33.677s 100.0% 1075.9%
1-2 36.648s 33.518s 99.5% 1070.9%
1-3 22.345s 19.215s 57.1% 613.9%
1-4 18.662s 15.532s 46.1% 496.2%

N.B. %age 1 = overhead/unpatched overhead
%age 2 = overhead/accounting disabled time

Patch 2 (vmstat data stock caching) helps in both the small object test
and the large v2 object test. It doesn't help much in v1 big object test.

Patch 3 (refill_obj_stock improvement) does help the small object test
but offer significant performance improvement for the large object test
(both v1 and v2).

Patch 4 (eliminating irq disable/enable) helps in all test cases.

To test for the extreme case, a multi-threaded kmalloc/kfree
microbenchmark was run on the 2-socket 48-core 96-thread system with
96 testing threads in the same memcg doing kmalloc+kfree of a 4k object
with accounting enabled for 10s. The total number of kmalloc+kfree done
in kilo operations per second (kops/s) were as follows:

Applied patches v1 kops/s v1 change v2 kops/s v2 change
--------------- --------- --------- --------- ---------
None 3,520 1.00X 6,242 1.00X
1-2 4,304 1.22X 8,478 1.36X
1-3 4,731 1.34X 418,142 66.99X
1-4 4,587 1.30X 438,838 70.30X

With memory accounting disabled, the kmalloc/kfree rate was 1,481,291
kop/s. This test shows how significant the memory accouting overhead
can be in some extreme situations.

For this multithreaded test, the improvement from patch 2 mainly
comes from the conditional atomic xchg of objcg->nr_charged_bytes in
mod_objcg_state(). By using an unconditional xchg, the operation rates
were similar to the unpatched kernel.

Patch 3 elminates the single highly contended cacheline of
objcg->nr_charged_bytes for cgroup v2 leading to a huge performance
improvement. Cgroup v1, however, still has another highly contended
cacheline in the shared page counter &memcg->kmem. So the improvement
is only modest.

Patch 4 helps in cgroup v2, but performs worse in cgroup v1 as
eliminating the irq_disable/irq_enable overhead seems to aggravate the
cacheline contention.

[1] https://lore.kernel.org/linux-mm/20210408193948.vfktg3azh2wrt56t@gabell/T/#u
[2] https://lore.kernel.org/lkml/20210114025151.GA22932@xsang-OptiPlex-9020/

This patch (of 4):

mod_objcg_state() is moved from mm/slab.h to mm/memcontrol.c so that
further optimization can be done to it in later patches without exposing
unnecessary details to other mm components.

Link: https://lkml.kernel.org/r/20210506150007.16288-1-longman@redhat.com
Link: https://lkml.kernel.org/r/20210506150007.16288-2-longman@redhat.com
Signed-off-by: Waiman Long
Acked-by: Johannes Weiner
Reviewed-by: Shakeel Butt
Acked-by: Roman Gushchin
Cc: Alex Shi
Cc: Chris Down
Cc: Christoph Lameter
Cc: David Rientjes
Cc: Joonsoo Kim
Cc: Masayoshi Mizuma
Cc: Matthew Wilcox
Cc: Michal Hocko
Cc: Muchun Song
Cc: Pekka Enberg
Cc: Tejun Heo
Cc: Vladimir Davydov
Cc: Vlastimil Babka
Cc: Wei Yang
Cc: Xing Zhengjun
Cc: Yafang Shao
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Waiman Long
2021-06-30 01:53:49 +0800
64dd68497 mm: slub: move sysfs slab alloc/free interfaces to debugfs ... Browse Code »

alloc_calls and free_calls implementation in sysfs have two issues, one is
PAGE_SIZE limitation of sysfs and other is it does not adhere to "one
value per file" rule.

To overcome this issues, move the alloc_calls and free_calls
implementation to debugfs.

Debugfs cache will be created if SLAB_STORE_USER flag is set.

Rename the alloc_calls/free_calls to alloc_traces/free_traces, to be
inline with what it does.

[faiyazm@codeaurora.org: fix the leak of alloc/free traces debugfs interface]
Link: https://lkml.kernel.org/r/1624248060-30286-1-git-send-email-faiyazm@codeaurora.org

Link: https://lkml.kernel.org/r/1623438200-19361-1-git-send-email-faiyazm@codeaurora.org
Signed-off-by: Faiyaz Mohammed
Reviewed-by: Vlastimil Babka
Reviewed-by: Greg Kroah-Hartman
Cc: Christoph Lameter
Cc: Pekka Enberg
Cc: David Rientjes
Cc: Joonsoo Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Faiyaz Mohammed
2021-06-30 01:53:47 +0800
1f9f78b1b mm/slub, kunit: add a KUnit test for SLUB debugging functionality ... Browse Code »

SLUB has resiliency_test() function which is hidden behind #ifdef
SLUB_RESILIENCY_TEST that is not part of Kconfig, so nobody runs it.
KUnit should be a proper replacement for it.

Try changing byte in redzone after allocation and changing pointer to next
free node, first byte, 50th byte and redzone byte. Check if validation
finds errors.

There are several differences from the original resiliency test: Tests
create own caches with known state instead of corrupting shared kmalloc
caches.

The corruption of freepointer uses correct offset, the original resiliency
test got broken with freepointer changes.

Scratch changing random byte test, because it does not have meaning in
this form where we need deterministic results.

Add new option CONFIG_SLUB_KUNIT_TEST in Kconfig. Tests next_pointer,
first_word and clobber_50th_byte do not run with KASAN option on. Because
the test deliberately modifies non-allocated objects.

Use kunit_resource to count errors in cache and silence bug reports.
Count error whenever slab_bug() or slab_fix() is called or when the count
of pages is wrong.

[glittao@gmail.com: remove unused function test_exit(), from SLUB KUnit test]
Link: https://lkml.kernel.org/r/20210512140656.12083-1-glittao@gmail.com
[akpm@linux-foundation.org: export kasan_enable/disable_current to modules]

Link: https://lkml.kernel.org/r/20210511150734.3492-2-glittao@gmail.com
Signed-off-by: Oliver Glitta
Reviewed-by: Vlastimil Babka
Acked-by: Daniel Latypov
Acked-by: Marco Elver
Cc: Brendan Higgins
Cc: Christoph Lameter
Cc: David Rientjes
Cc: Joonsoo Kim
Cc: Pekka Enberg
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oliver Glitta
2021-06-30 01:53:46 +0800

11 May, 2021

1 commit

e548eaa11 mm/slub: Add Support for free path information of an object ... Browse Code »

This commit adds enables a stack dump for the last free of an object:

slab kmalloc-64 start c8ab0140 data offset 64 pointer offset 0 size 64 allocated at meminfo_proc_show+0x40/0x4fc
[ 20.192078] meminfo_proc_show+0x40/0x4fc
[ 20.192263] seq_read_iter+0x18c/0x4c4
[ 20.192430] proc_reg_read_iter+0x84/0xac
[ 20.192617] generic_file_splice_read+0xe8/0x17c
[ 20.192816] splice_direct_to_actor+0xb8/0x290
[ 20.193008] do_splice_direct+0xa0/0xe0
[ 20.193185] do_sendfile+0x2d0/0x438
[ 20.193345] sys_sendfile64+0x12c/0x140
[ 20.193523] ret_fast_syscall+0x0/0x58
[ 20.193695] 0xbeeacde4
[ 20.193822] Free path:
[ 20.193935] meminfo_proc_show+0x5c/0x4fc
[ 20.194115] seq_read_iter+0x18c/0x4c4
[ 20.194285] proc_reg_read_iter+0x84/0xac
[ 20.194475] generic_file_splice_read+0xe8/0x17c
[ 20.194685] splice_direct_to_actor+0xb8/0x290
[ 20.194870] do_splice_direct+0xa0/0xe0
[ 20.195014] do_sendfile+0x2d0/0x438
[ 20.195174] sys_sendfile64+0x12c/0x140
[ 20.195336] ret_fast_syscall+0x0/0x58
[ 20.195491] 0xbeeacde4

Acked-by: Vlastimil Babka
Co-developed-by: Vaneet Narang
Signed-off-by: Vaneet Narang
Signed-off-by: Maninder Singh
Signed-off-by: Paul E. McKenney

Maninder Singh
2021-05-11 07:01:41 +0800

01 May, 2021

1 commit

da844b787 kasan, mm: integrate slab init_on_alloc with HW_TAGS ... Browse Code »

This change uses the previously added memory initialization feature of
HW_TAGS KASAN routines for slab memory when init_on_alloc is enabled.

With this change, memory initialization memset() is no longer called when
both HW_TAGS KASAN and init_on_alloc are enabled. Instead, memory is
initialized in KASAN runtime.

The memory initialization memset() is moved into slab_post_alloc_hook()
that currently directly follows the initialization loop. A new argument
is added to slab_post_alloc_hook() that indicates whether to initialize
the memory or not.

To avoid discrepancies with which memory gets initialized that can be
caused by future changes, both KASAN hook and initialization memset() are
put together and a warning comment is added.

Combining setting allocation tags with memory initialization improves
HW_TAGS KASAN performance when init_on_alloc is enabled.

Link: https://lkml.kernel.org/r/c1292aeb5d519da221ec74a0684a949b027d7720.1615296150.git.andreyknvl@google.com
Signed-off-by: Andrey Konovalov
Reviewed-by: Marco Elver
Cc: Alexander Potapenko
Cc: Andrey Ryabinin
Cc: Branislav Rankov
Cc: Catalin Marinas
Cc: Christoph Lameter
Cc: David Rientjes
Cc: Dmitry Vyukov
Cc: Evgenii Stepanov
Cc: Joonsoo Kim
Cc: Kevin Brodsky
Cc: Pekka Enberg
Cc: Peter Collingbourne
Cc: Vincenzo Frascino
Cc: Vlastimil Babka
Cc: Will Deacon
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrey Konovalov
2021-05-01 02:20:41 +0800

29 Apr, 2021

1 commit

9a45da927 Merge tag 'core-rcu-2021-04-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull RCU updates from Ingo Molnar:

- Support for "N" as alias for last bit in bitmap parsing library (eg
using syntax like "nohz_full=2-N")

- kvfree_rcu updates

- mm_dump_obj() updates. (One of these is to mm, but was suggested by
Andrew Morton.)

- RCU callback offloading update

- Polling RCU grace-period interfaces

- Realtime-related RCU updates

- Tasks-RCU updates

- Torture-test updates

- Torture-test scripting updates

- Miscellaneous fixes

* tag 'core-rcu-2021-04-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (77 commits)
rcutorture: Test start_poll_synchronize_rcu() and poll_state_synchronize_rcu()
rcu: Provide polling interfaces for Tiny RCU grace periods
torture: Fix kvm.sh --datestamp regex check
torture: Consolidate qemu-cmd duration editing into kvm-transform.sh
torture: Print proper vmlinux path for kvm-again.sh runs
torture: Make TORTURE_TRUST_MAKE available in kvm-again.sh environment
torture: Make kvm-transform.sh update jitter commands
torture: Add --duration argument to kvm-again.sh
torture: Add kvm-again.sh to rerun a previous torture-test
torture: Create a "batches" file for build reuse
torture: De-capitalize TORTURE_SUITE
torture: Make upper-case-only no-dot no-slash scenario names official
torture: Rename SRCU-t and SRCU-u to avoid lowercase characters
torture: Remove no-mpstat error message
torture: Record kvm-test-1-run.sh and kvm-test-1-run-qemu.sh PIDs
torture: Record jitter start/stop commands
torture: Extract kvm-test-1-run-qemu.sh from kvm-test-1-run.sh
torture: Record TORTURE_KCONFIG_GDB_ARG in qemu-cmd
torture: Abstract jitter.sh start/stop into scripts
rcu: Provide polling interfaces for Tree RCU grace periods
...

Linus Torvalds
2021-04-29 03:00:13 +0800

08 Apr, 2021

1 commit

51cba1ebc init_on_alloc: Optimize static branches ... Browse Code »

The state of CONFIG_INIT_ON_ALLOC_DEFAULT_ON (and ...ON_FREE...) did not
change the assembly ordering of the static branches: they were always out
of line. Use the new jump_label macros to check the CONFIG settings to
default to the "expected" state, which slightly optimizes the resulting
assembly code.

Signed-off-by: Kees Cook
Signed-off-by: Thomas Gleixner
Reviewed-by: Alexander Potapenko
Acked-by: Vlastimil Babka
Link: https://lore.kernel.org/r/20210401232347.2791257-3-keescook@chromium.org

Kees Cook
2021-04-08 20:05:19 +0800

09 Mar, 2021

1 commit

5bb1bb353 mm: Don't build mm_dump_obj() on CONFIG_PRINTK=n kernels ... Browse Code »

The mem_dump_obj() functionality adds a few hundred bytes, which is a
small price to pay. Except on kernels built with CONFIG_PRINTK=n, in
which mem_dump_obj() messages will be suppressed. This commit therefore
makes mem_dump_obj() be a static inline empty function on kernels built
with CONFIG_PRINTK=n and excludes all of its support functions as well.
This avoids kernel bloat on systems that cannot use mem_dump_obj().

Cc: Christoph Lameter
Cc: Pekka Enberg
Cc: David Rientjes
Cc: Joonsoo Kim
Cc:
Suggested-by: Andrew Morton
Signed-off-by: Paul E. McKenney

Paul E. McKenney
2021-03-09 06:18:46 +0800

25 Feb, 2021

2 commits

2e9bd4831 mm: memcg/slab: pre-allocate obj_cgroups for slab caches with SLAB_ACCOUNT ... Browse Code »

In general it's unknown in advance if a slab page will contain accounted
objects or not. In order to avoid memory waste, an obj_cgroup vector is
allocated dynamically when a need to account of a new object arises. Such
approach is memory efficient, but requires an expensive cmpxchg() to set
up the memcg/objcgs pointer, because an allocation can race with a
different allocation on another cpu.

But in some common cases it's known for sure that a slab page will contain
accounted objects: if the page belongs to a slab cache with a SLAB_ACCOUNT
flag set. It includes such popular objects like vm_area_struct, anon_vma,
task_struct, etc.

In such cases we can pre-allocate the objcgs vector and simple assign it
to the page without any atomic operations, because at this early stage the
page is not visible to anyone else.

A very simplistic benchmark (allocating 10000000 64-bytes objects in a
row) shows ~15% win. In the real life it seems that most workloads are
not very sensitive to the speed of (accounted) slab allocations.

[guro@fb.com: open-code set_page_objcgs() and add some comments, by Johannes]
Link: https://lkml.kernel.org/r/20201113001926.GA2934489@carbon.dhcp.thefacebook.com
[akpm@linux-foundation.org: fix it for mm-slub-call-account_slab_page-after-slab-page-initialization-fix.patch]

Link: https://lkml.kernel.org/r/20201110195753.530157-2-guro@fb.com
Signed-off-by: Roman Gushchin
Acked-by: Johannes Weiner
Reviewed-by: Shakeel Butt
Cc: Michal Hocko
Cc: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Roman Gushchin
2021-02-25 05:38:29 +0800
375400087 mm/sl?b.c: remove ctor argument from kmem_cache_flags ... Browse Code »

This argument hasn't been used since e153362a50a3 ("slub: Remove objsize
check in kmem_cache_flags()") so simply remove it.

Link: https://lkml.kernel.org/r/20210126095733.974665-1-nborisov@suse.com
Signed-off-by: Nikolay Borisov
Reviewed-by: Miaohe Lin
Reviewed-by: Vlastimil Babka
Acked-by: Christoph Lameter
Acked-by: David Rientjes
Cc: Pekka Enberg
Cc: Joonsoo Kim
Cc: Vlastimil Babka
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nikolay Borisov
2021-02-25 05:38:27 +0800

23 Jan, 2021

1 commit

8e7f37f2a mm: Add mem_dump_obj() to print source of memory block ... Browse Code »

There are kernel facilities such as per-CPU reference counts that give
error messages in generic handlers or callbacks, whose messages are
unenlightening. In the case of per-CPU reference-count underflow, this
is not a problem when creating a new use of this facility because in that
case the bug is almost certainly in the code implementing that new use.
However, trouble arises when deploying across many systems, which might
exercise corner cases that were not seen during development and testing.
Here, it would be really nice to get some kind of hint as to which of
several uses the underflow was caused by.

This commit therefore exposes a mem_dump_obj() function that takes
a pointer to memory (which must still be allocated if it has been
dynamically allocated) and prints available information on where that
memory came from. This pointer can reference the middle of the block as
well as the beginning of the block, as needed by things like RCU callback
functions and timer handlers that might not know where the beginning of
the memory block is. These functions and handlers can use mem_dump_obj()
to print out better hints as to where the problem might lie.

The information printed can depend on kernel configuration. For example,
the allocation return address can be printed only for slab and slub,
and even then only when the necessary debug has been enabled. For slab,
build with CONFIG_DEBUG_SLAB=y, and either use sizes with ample space
to the next power of two or use the SLAB_STORE_USER when creating the
kmem_cache structure. For slub, build with CONFIG_SLUB_DEBUG=y and
boot with slub_debug=U, or pass SLAB_STORE_USER to kmem_cache_create()
if more focused use is desired. Also for slub, use CONFIG_STACKTRACE
to enable printing of the allocation-time stack trace.

Cc: Christoph Lameter
Cc: Pekka Enberg
Cc: David Rientjes
Cc: Joonsoo Kim
Cc: Andrew Morton
Cc:
Reported-by: Andrii Nakryiko
[ paulmck: Convert to printing and change names per Joonsoo Kim. ]
[ paulmck: Move slab definition per Stephen Rothwell and kbuild test robot. ]
[ paulmck: Handle CONFIG_MMU=n case where vmalloc() is kmalloc(). ]
[ paulmck: Apply Vlastimil Babka feedback on slab.c kmem_provenance(). ]
[ paulmck: Extract more info from !SLUB_DEBUG per Joonsoo Kim. ]
[ paulmck: Explicitly check for small pointers per Naresh Kamboju. ]
Acked-by: Joonsoo Kim
Acked-by: Vlastimil Babka
Tested-by: Naresh Kamboju
Signed-off-by: Paul E. McKenney

Paul E. McKenney
2021-01-23 07:16:01 +0800

16 Dec, 2020

3 commits

d635a69dd Merge tag 'net-next-5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next ... Browse Code »

Pull networking updates from Jakub Kicinski:
"Core:

- support "prefer busy polling" NAPI operation mode, where we defer
softirq for some time expecting applications to periodically busy
poll

- AF_XDP: improve efficiency by more batching and hindering the
adjacency cache prefetcher

- af_packet: make packet_fanout.arr size configurable up to 64K

- tcp: optimize TCP zero copy receive in presence of partial or
unaligned reads making zero copy a performance win for much smaller
messages

- XDP: add bulk APIs for returning / freeing frames

- sched: support fragmenting IP packets as they come out of conntrack

- net: allow virtual netdevs to forward UDP L4 and fraglist GSO skbs

BPF:

- BPF switch from crude rlimit-based to memcg-based memory accounting

- BPF type format information for kernel modules and related tracing
enhancements

- BPF implement task local storage for BPF LSM

- allow the FENTRY/FEXIT/RAW_TP tracing programs to use
bpf_sk_storage

Protocols:

- mptcp: improve multiple xmit streams support, memory accounting and
many smaller improvements

- TLS: support CHACHA20-POLY1305 cipher

- seg6: add support for SRv6 End.DT4/DT6 behavior

- sctp: Implement RFC 6951: UDP Encapsulation of SCTP

- ppp_generic: add ability to bridge channels directly

- bridge: Connectivity Fault Management (CFM) support as is defined
in IEEE 802.1Q section 12.14.

Drivers:

- mlx5: make use of the new auxiliary bus to organize the driver
internals

- mlx5: more accurate port TX timestamping support

- mlxsw:
- improve the efficiency of offloaded next hop updates by using
the new nexthop object API
- support blackhole nexthops
- support IEEE 802.1ad (Q-in-Q) bridging

- rtw88: major bluetooth co-existance improvements

- iwlwifi: support new 6 GHz frequency band

- ath11k: Fast Initial Link Setup (FILS)

- mt7915: dual band concurrent (DBDC) support

- net: ipa: add basic support for IPA v4.5

Refactor:

- a few pieces of in_interrupt() cleanup work from Sebastian Andrzej
Siewior

- phy: add support for shared interrupts; get rid of multiple driver
APIs and have the drivers write a full IRQ handler, slight growth
of driver code should be compensated by the simpler API which also
allows shared IRQs

- add common code for handling netdev per-cpu counters

- move TX packet re-allocation from Ethernet switch tag drivers to a
central place

- improve efficiency and rename nla_strlcpy

- number of W=1 warning cleanups as we now catch those in a patchwork
build bot

Old code removal:

- wan: delete the DLCI / SDLA drivers

- wimax: move to staging

- wifi: remove old WDS wifi bridging support"

* tag 'net-next-5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1922 commits)
net: hns3: fix expression that is currently always true
net: fix proc_fs init handling in af_packet and tls
nfc: pn533: convert comma to semicolon
af_vsock: Assign the vsock transport considering the vsock address flags
af_vsock: Set VMADDR_FLAG_TO_HOST flag on the receive path
vsock_addr: Check for supported flag values
vm_sockets: Add VMADDR_FLAG_TO_HOST vsock flag
vm_sockets: Add flags field in the vsock address data structure
net: Disable NETIF_F_HW_TLS_TX when HW_CSUM is disabled
tcp: Add logic to check for SYN w/ data in tcp_simple_retransmit
net: mscc: ocelot: install MAC addresses in .ndo_set_rx_mode from process context
nfc: s3fwrn5: Release the nfc firmware
net: vxget: clean up sparse warnings
mlxsw: spectrum_router: Use eXtended mezzanine to offload IPv4 router
mlxsw: spectrum: Set KVH XLT cache mode for Spectrum2/3
mlxsw: spectrum_router_xm: Introduce basic XM cache flushing
mlxsw: reg: Add Router LPM Cache Enable Register
mlxsw: reg: Add Router LPM Cache ML Delete Register
mlxsw: spectrum_router_xm: Implement L-value tracking for M-index
mlxsw: reg: Add XM Router M Table Register
...

Linus Torvalds
2020-12-16 05:22:29 +0800
95d6c701f mm: extract might_alloc() debug check ... Browse Code »

Extracted from slab.h, which seems to have the most complete version
including the correct might_sleep() check. Roll it out to slob.c.

Motivated by a discussion with Paul about possibly changing call_rcu
behaviour to allocate memory, but only roughly every 500th call.

There are a lot fewer places in the kernel that care about whether
allocating memory is allowed or not (due to deadlocks with reclaim code)
than places that care whether sleeping is allowed. But debugging these
also tends to be a lot harder, so nice descriptive checks could come in
handy. I might have some use eventually for annotations in drivers/gpu.

Note that unlike fs_reclaim_acquire/release gfpflags_allow_blocking does
not consult the PF_MEMALLOC flags. But there is no flag equivalent for
GFP_NOWAIT, hence this check can't go wrong due to
memalloc_no*_save/restore contexts. Willy is working on a patch series
which might change this:

https://lore.kernel.org/linux-mm/20200625113122.7540-7-willy@infradead.org/

I think best would be if that updates gfpflags_allow_blocking(), since
there's a ton of callers all over the place for that already.

Link: https://lkml.kernel.org/r/20201125162532.1299794-3-daniel.vetter@ffwll.ch
Signed-off-by: Daniel Vetter
Acked-by: Vlastimil Babka
Acked-by: Paul E. McKenney
Reviewed-by: Jason Gunthorpe
Cc: Randy Dunlap
Cc: Paul E. McKenney
Cc: Christoph Lameter
Cc: Pekka Enberg
Cc: David Rientjes
Cc: Joonsoo Kim
Cc: Peter Zijlstra
Cc: Ingo Molnar
Cc: Vlastimil Babka
Cc: Mathieu Desnoyers
Cc: Sebastian Andrzej Siewior
Cc: Michel Lespinasse
Cc: Daniel Vetter
Cc: Waiman Long
Cc: Thomas Gleixner
Cc: Randy Dunlap
Cc: Dave Chinner
Cc: Qian Cai
Cc: "Matthew Wilcox (Oracle)"
Cc: Christian König
Cc: Ingo Molnar
Cc: Jason Gunthorpe
Cc: Maarten Lankhorst
Cc: Thomas Hellström (Intel)
Cc: Will Deacon
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Daniel Vetter
2020-12-16 04:13:41 +0800
1a984c4e8 mm: memcontrol: remove unused mod_memcg_obj_state() ... Browse Code »

Since commit 991e7673859e ("mm: memcontrol: account kernel stack per
node") there is no user of the mod_memcg_obj_state(). So just remove
it.

Also rework type of the idx parameter of the mod_objcg_state() from int
to enum node_stat_item.

Link: https://lkml.kernel.org/r/20201013153504.92602-1-songmuchun@bytedance.com
Signed-off-by: Muchun Song
Acked-by: Roman Gushchin
Acked-by: David Rientjes
Reviewed-by: Shakeel Butt
Cc: Johannes Weiner
Cc: Michal Hocko
Cc: Vladimir Davydov
Cc: Christopher Lameter
Cc: Pekka Enberg
Cc: Joonsoo Kim
Cc: Vlastimil Babka
Cc: Yafang Shao
Cc: Chris Down
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Muchun Song
2020-12-16 04:13:39 +0800

12 Dec, 2020

1 commit

46d5e62dd Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net ... Browse Code »

xdp_return_frame_bulk() needs to pass a xdp_buff
to __xdp_return().

strlcpy got converted to strscpy but here it makes no
functional difference, so just keep the right code.

Conflicts:
net/netfilter/nf_tables_api.c

Signed-off-by: Jakub Kicinski

Jakub Kicinski
2020-12-12 14:29:38 +0800

07 Dec, 2020

1 commit

becaba65f mm: memcg/slab: fix obj_cgroup_charge() return value handling ... Browse Code »

Commit 10befea91b61 ("mm: memcg/slab: use a single set of kmem_caches
for all allocations") introduced a regression into the handling of the
obj_cgroup_charge() return value. If a non-zero value is returned
(indicating of exceeding one of memory.max limits), the allocation
should fail, instead of falling back to non-accounted mode.

To make the code more readable, move memcg_slab_pre_alloc_hook() and
memcg_slab_post_alloc_hook() calling conditions into bodies of these
hooks.

Fixes: 10befea91b61 ("mm: memcg/slab: use a single set of kmem_caches for all allocations")
Signed-off-by: Roman Gushchin
Signed-off-by: Andrew Morton
Reviewed-by: Shakeel Butt
Cc: Johannes Weiner
Cc: Michal Hocko
Cc:
Link: https://lkml.kernel.org/r/20201127161828.GD840171@carbon.dhcp.thefacebook.com
Signed-off-by: Linus Torvalds

Roman Gushchin
2020-12-07 02:19:07 +0800

03 Dec, 2020

2 commits

270c6a714 mm: memcontrol/slab: Use helpers to access slab page's memcg_data ... Browse Code »

To gather all direct accesses to struct page's memcg_data field in one
place, let's introduce 3 new helpers to use in the slab accounting code:

struct obj_cgroup **page_objcgs(struct page *page);
struct obj_cgroup **page_objcgs_check(struct page *page);
bool set_page_objcgs(struct page *page, struct obj_cgroup **objcgs);

They are similar to the corresponding API for generic pages, except that
the setter can return false, indicating that the value has been already
set from a different thread.

Signed-off-by: Roman Gushchin
Signed-off-by: Andrew Morton
Signed-off-by: Alexei Starovoitov
Reviewed-by: Shakeel Butt
Acked-by: Johannes Weiner
Link: https://lkml.kernel.org/r/20201027001657.3398190-3-guro@fb.com
Link: https://lore.kernel.org/bpf/20201201215900.3569844-3-guro@fb.com

Roman Gushchin
2020-12-03 10:28:06 +0800
bcfe06bf2 mm: memcontrol: Use helpers to read page's memcg data ... Browse Code »

Patch series "mm: allow mapping accounted kernel pages to userspace", v6.

Currently a non-slab kernel page which has been charged to a memory cgroup
can't be mapped to userspace. The underlying reason is simple: PageKmemcg
flag is defined as a page type (like buddy, offline, etc), so it takes a
bit from a page->mapped counter. Pages with a type set can't be mapped to
userspace.

But in general the kmemcg flag has nothing to do with mapping to
userspace. It only means that the page has been accounted by the page
allocator, so it has to be properly uncharged on release.

Some bpf maps are mapping the vmalloc-based memory to userspace, and their
memory can't be accounted because of this implementation detail.

This patchset removes this limitation by moving the PageKmemcg flag into
one of the free bits of the page->mem_cgroup pointer. Also it formalizes
accesses to the page->mem_cgroup and page->obj_cgroups using new helpers,
adds several checks and removes a couple of obsolete functions. As the
result the code became more robust with fewer open-coded bit tricks.

This patch (of 4):

Currently there are many open-coded reads of the page->mem_cgroup pointer,
as well as a couple of read helpers, which are barely used.

It creates an obstacle on a way to reuse some bits of the pointer for
storing additional bits of information. In fact, we already do this for
slab pages, where the last bit indicates that a pointer has an attached
vector of objcg pointers instead of a regular memcg pointer.

This commits uses 2 existing helpers and introduces a new helper to
converts all read sides to calls of these helpers:
struct mem_cgroup *page_memcg(struct page *page);
struct mem_cgroup *page_memcg_rcu(struct page *page);
struct mem_cgroup *page_memcg_check(struct page *page);

page_memcg_check() is intended to be used in cases when the page can be a
slab page and have a memcg pointer pointing at objcg vector. It does
check the lowest bit, and if set, returns NULL. page_memcg() contains a
VM_BUG_ON_PAGE() check for the page not being a slab page.

To make sure nobody uses a direct access, struct page's
mem_cgroup/obj_cgroups is converted to unsigned long memcg_data.

Signed-off-by: Roman Gushchin
Signed-off-by: Andrew Morton
Signed-off-by: Alexei Starovoitov
Reviewed-by: Shakeel Butt
Acked-by: Johannes Weiner
Acked-by: Michal Hocko
Link: https://lkml.kernel.org/r/20201027001657.3398190-1-guro@fb.com
Link: https://lkml.kernel.org/r/20201027001657.3398190-2-guro@fb.com
Link: https://lore.kernel.org/bpf/20201201215900.3569844-2-guro@fb.com

Roman Gushchin
2020-12-03 10:28:05 +0800

19 Oct, 2020

1 commit

279c3393e mm: kmem: move memcg_kmem_bypass() calls to get_mem/obj_cgroup_from_current() ... Browse Code »

Patch series "mm: kmem: kernel memory accounting in an interrupt context".

This patchset implements memcg-based memory accounting of allocations made
from an interrupt context.

Historically, such allocations were passed unaccounted mostly because
charging the memory cgroup of the current process wasn't an option. Also
performance reasons were likely a reason too.

The remote charging API allows to temporarily overwrite the currently
active memory cgroup, so that all memory allocations are accounted towards
some specified memory cgroup instead of the memory cgroup of the current
process.

This patchset extends the remote charging API so that it can be used from
an interrupt context. Then it removes the fence that prevented the
accounting of allocations made from an interrupt context. It also
contains a couple of optimizations/code refactorings.

This patchset doesn't directly enable accounting for any specific
allocations, but prepares the code base for it. The bpf memory accounting
will likely be the first user of it: a typical example is a bpf program
parsing an incoming network packet, which allocates an entry in hashmap
map to store some information.

This patch (of 4):

Currently memcg_kmem_bypass() is called before obtaining the current
memory/obj cgroup using get_mem/obj_cgroup_from_current(). Moving
memcg_kmem_bypass() into get_mem/obj_cgroup_from_current() reduces the
number of call sites and allows further code simplifications.

Signed-off-by: Roman Gushchin
Signed-off-by: Andrew Morton
Reviewed-by: Shakeel Butt
Cc: Johannes Weiner
Cc: Michal Hocko
Link: http://lkml.kernel.org/r/20200827225843.1270629-1-guro@fb.com
Link: http://lkml.kernel.org/r/20200827225843.1270629-2-guro@fb.com
Signed-off-by: Linus Torvalds

Roman Gushchin
2020-10-19 00:27:09 +0800

17 Oct, 2020

1 commit

c7df08f19 mm/slab.h: remove duplicate include ... Browse Code »

Remove duplicate header which is included twice.

Signed-off-by: YueHaibing
Signed-off-by: Andrew Morton
Reviewed-by: Pekka Enberg
Link: http://lkml.kernel.org/r/20200818114323.58156-1-yuehaibing@huawei.com
Signed-off-by: Linus Torvalds

YueHaibing
2020-10-17 02:11:18 +0800

14 Oct, 2020

1 commit

d1b2cf6cb mm: memcg/slab: uncharge during kmem_cache_free_bulk() ... Browse Code »

Object cgroup charging is done for all the objects during allocation, but
during freeing, uncharging ends up happening for only one object in the
case of bulk allocation/freeing.

Fix this by having a separate call to uncharge all the objects from
kmem_cache_free_bulk() and by modifying memcg_slab_free_hook() to take
care of bulk uncharging.

Fixes: 964d4bd370d5 ("mm: memcg/slab: save obj_cgroup for non-root slab objects"
Signed-off-by: Bharata B Rao
Signed-off-by: Andrew Morton
Acked-by: Roman Gushchin
Cc: Christoph Lameter
Cc: David Rientjes
Cc: Joonsoo Kim
Cc: Vlastimil Babka
Cc: Shakeel Butt
Cc: Johannes Weiner
Cc: Michal Hocko
Cc: Tejun Heo
Cc:
Link: https://lkml.kernel.org/r/20201009060423.390479-1-bharata@linux.ibm.com
Signed-off-by: Linus Torvalds

Bharata B Rao
2020-10-14 09:38:31 +0800

08 Aug, 2020

13 commits

74d555bed mm: slab: rename (un)charge_slab_page() to (un)account_slab_page() ... Browse Code »

charge_slab_page() and uncharge_slab_page() are not related anymore to
memcg charging and uncharging. In order to make their names less
confusing, let's rename them to account_slab_page() and
unaccount_slab_page() respectively.

Signed-off-by: Roman Gushchin
Signed-off-by: Andrew Morton
Reviewed-by: Shakeel Butt
Acked-by: Vlastimil Babka
Cc: Christoph Lameter
Cc: David Rientjes
Cc: Johannes Weiner
Cc: Joonsoo Kim
Cc: Michal Hocko
Cc: Pekka Enberg
Link: http://lkml.kernel.org/r/20200707173612.124425-2-guro@fb.com
Signed-off-by: Linus Torvalds

Roman Gushchin
2020-08-08 02:33:25 +0800
849504809 mm: memcg/slab: remove unused argument by charge_slab_page() ... Browse Code »

charge_slab_page() is not using the gfp argument anymore,
remove it.

Signed-off-by: Roman Gushchin
Signed-off-by: Andrew Morton
Reviewed-by: Shakeel Butt
Acked-by: Vlastimil Babka
Cc: Christoph Lameter
Cc: Pekka Enberg
Cc: David Rientjes
Cc: Joonsoo Kim
Cc: Johannes Weiner
Cc: Michal Hocko
Link: http://lkml.kernel.org/r/20200707173612.124425-1-guro@fb.com
Signed-off-by: Linus Torvalds

Roman Gushchin
2020-08-08 02:33:25 +0800
10befea91 mm: memcg/slab: use a single set of kmem_caches for all allocations ... Browse Code »

Instead of having two sets of kmem_caches: one for system-wide and
non-accounted allocations and the second one shared by all accounted
allocations, we can use just one.

The idea is simple: space for obj_cgroup metadata can be allocated on
demand and filled only for accounted allocations.

It allows to remove a bunch of code which is required to handle kmem_cache
clones for accounted allocations. There is no more need to create them,
accumulate statistics, propagate attributes, etc. It's a quite
significant simplification.

Also, because the total number of slab_caches is reduced almost twice (not
all kmem_caches have a memcg clone), some additional memory savings are
expected. On my devvm it additionally saves about 3.5% of slab memory.

[guro@fb.com: fix build on MIPS]
Link: http://lkml.kernel.org/r/20200717214810.3733082-1-guro@fb.com

Suggested-by: Johannes Weiner
Signed-off-by: Roman Gushchin
Signed-off-by: Andrew Morton
Reviewed-by: Vlastimil Babka
Reviewed-by: Shakeel Butt
Cc: Christoph Lameter
Cc: Michal Hocko
Cc: Tejun Heo
Cc: Naresh Kamboju
Link: http://lkml.kernel.org/r/20200623174037.3951353-18-guro@fb.com
Signed-off-by: Linus Torvalds

Roman Gushchin
2020-08-08 02:33:25 +0800
c7094406f mm: memcg/slab: deprecate slab_root_caches ... Browse Code »

Currently there are two lists of kmem_caches:
1) slab_caches, which contains all kmem_caches,
2) slab_root_caches, which contains only root kmem_caches.

And there is some preprocessor magic to have a single list if
CONFIG_MEMCG_KMEM isn't enabled.

It was required earlier because the number of non-root kmem_caches was
proportional to the number of memory cgroups and could reach really big
values. Now, when it cannot exceed the number of root kmem_caches, there
is really no reason to maintain two lists.

We never iterate over the slab_root_caches list on any hot paths, so it's
perfectly fine to iterate over slab_caches and filter out non-root
kmem_caches.

It allows to remove a lot of config-dependent code and two pointers from
the kmem_cache structure.

Signed-off-by: Roman Gushchin
Signed-off-by: Andrew Morton
Reviewed-by: Vlastimil Babka
Reviewed-by: Shakeel Butt
Cc: Christoph Lameter
Cc: Johannes Weiner
Cc: Michal Hocko
Cc: Tejun Heo
Link: http://lkml.kernel.org/r/20200623174037.3951353-16-guro@fb.com
Signed-off-by: Linus Torvalds

Roman Gushchin
2020-08-08 02:33:25 +0800
272911a4a mm: memcg/slab: remove memcg_kmem_get_cache() ... Browse Code »

The memcg_kmem_get_cache() function became really trivial, so let's just
inline it into the single call point: memcg_slab_pre_alloc_hook().

It will make the code less bulky and can also help the compiler to
generate a better code.

Signed-off-by: Roman Gushchin
Signed-off-by: Andrew Morton
Reviewed-by: Vlastimil Babka
Reviewed-by: Shakeel Butt
Cc: Christoph Lameter
Cc: Johannes Weiner
Cc: Michal Hocko
Cc: Tejun Heo
Link: http://lkml.kernel.org/r/20200623174037.3951353-15-guro@fb.com
Signed-off-by: Linus Torvalds

Roman Gushchin
2020-08-08 02:33:25 +0800
d797b7d05 mm: memcg/slab: simplify memcg cache creation ... Browse Code »

Because the number of non-root kmem_caches doesn't depend on the number of
memory cgroups anymore and is generally not very big, there is no more
need for a dedicated workqueue.

Also, as there is no more need to pass any arguments to the
memcg_create_kmem_cache() except the root kmem_cache, it's possible to
just embed the work structure into the kmem_cache and avoid the dynamic
allocation of the work structure.

This will also simplify the synchronization: for each root kmem_cache
there is only one work. So there will be no more concurrent attempts to
create a non-root kmem_cache for a root kmem_cache: the second and all
following attempts to queue the work will fail.

On the kmem_cache destruction path there is no more need to call the
expensive flush_workqueue() and wait for all pending works to be finished.
Instead, cancel_work_sync() can be used to cancel/wait for only one work.

Signed-off-by: Roman Gushchin
Signed-off-by: Andrew Morton
Reviewed-by: Vlastimil Babka
Reviewed-by: Shakeel Butt
Cc: Christoph Lameter
Cc: Johannes Weiner
Cc: Michal Hocko
Cc: Tejun Heo
Link: http://lkml.kernel.org/r/20200623174037.3951353-14-guro@fb.com
Signed-off-by: Linus Torvalds

Roman Gushchin
2020-08-08 02:33:25 +0800
9855609bd mm: memcg/slab: use a single set of kmem_caches for all accounted allocations ... Browse Code »

This is fairly big but mostly red patch, which makes all accounted slab
allocations use a single set of kmem_caches instead of creating a separate
set for each memory cgroup.

Because the number of non-root kmem_caches is now capped by the number of
root kmem_caches, there is no need to shrink or destroy them prematurely.
They can be perfectly destroyed together with their root counterparts.
This allows to dramatically simplify the management of non-root
kmem_caches and delete a ton of code.

This patch performs the following changes:
1) introduces memcg_params.memcg_cache pointer to represent the
kmem_cache which will be used for all non-root allocations
2) reuses the existing memcg kmem_cache creation mechanism
to create memcg kmem_cache on the first allocation attempt
3) memcg kmem_caches are named -memcg,
e.g. dentry-memcg
4) simplifies memcg_kmem_get_cache() to just return memcg kmem_cache
or schedule it's creation and return the root cache
5) removes almost all non-root kmem_cache management code
(separate refcounter, reparenting, shrinking, etc)
6) makes slab debugfs to display root_mem_cgroup css id and never
show :dead and :deact flags in the memcg_slabinfo attribute.

Following patches in the series will simplify the kmem_cache creation.

Signed-off-by: Roman Gushchin
Signed-off-by: Andrew Morton
Reviewed-by: Vlastimil Babka
Reviewed-by: Shakeel Butt
Cc: Christoph Lameter
Cc: Johannes Weiner
Cc: Michal Hocko
Cc: Tejun Heo
Link: http://lkml.kernel.org/r/20200623174037.3951353-13-guro@fb.com
Signed-off-by: Linus Torvalds

Roman Gushchin
2020-08-08 02:33:25 +0800
f2fe7b09a mm: memcg/slab: charge individual slab objects instead of pages ... Browse Code »

Switch to per-object accounting of non-root slab objects.

Charging is performed using obj_cgroup API in the pre_alloc hook.
Obj_cgroup is charged with the size of the object and the size of
metadata: as now it's the size of an obj_cgroup pointer. If the amount of
memory has been charged successfully, the actual allocation code is
executed. Otherwise, -ENOMEM is returned.

In the post_alloc hook if the actual allocation succeeded, corresponding
vmstats are bumped and the obj_cgroup pointer is saved. Otherwise, the
charge is canceled.

On the free path obj_cgroup pointer is obtained and used to uncharge the
size of the releasing object.

Memcg and lruvec counters are now representing only memory used by active
slab objects and do not include the free space. The free space is shared
and doesn't belong to any specific cgroup.

Global per-node slab vmstats are still modified from
(un)charge_slab_page() functions. The idea is to keep all slab pages
accounted as slab pages on system level.

Signed-off-by: Roman Gushchin
Signed-off-by: Andrew Morton
Reviewed-by: Vlastimil Babka
Reviewed-by: Shakeel Butt
Cc: Christoph Lameter
Cc: Johannes Weiner
Cc: Michal Hocko
Cc: Tejun Heo
Link: http://lkml.kernel.org/r/20200623174037.3951353-10-guro@fb.com
Signed-off-by: Linus Torvalds

Roman Gushchin
2020-08-08 02:33:24 +0800
964d4bd37 mm: memcg/slab: save obj_cgroup for non-root slab objects ... Browse Code »

Store the obj_cgroup pointer in the corresponding place of
page->obj_cgroups for each allocated non-root slab object. Make sure that
each allocated object holds a reference to obj_cgroup.

Objcg pointer is obtained from the memcg->objcg dereferencing in
memcg_kmem_get_cache() and passed from pre_alloc_hook to post_alloc_hook.
Then in case of successful allocation(s) it's getting stored in the
page->obj_cgroups vector.

The objcg obtaining part look a bit bulky now, but it will be simplified
by next commits in the series.

Signed-off-by: Roman Gushchin
Signed-off-by: Andrew Morton
Reviewed-by: Vlastimil Babka
Reviewed-by: Shakeel Butt
Cc: Christoph Lameter
Cc: Johannes Weiner
Cc: Michal Hocko
Cc: Tejun Heo
Link: http://lkml.kernel.org/r/20200623174037.3951353-9-guro@fb.com
Signed-off-by: Linus Torvalds

Roman Gushchin
2020-08-08 02:33:24 +0800
286e04b8e mm: memcg/slab: allocate obj_cgroups for non-root slab pages ... Browse Code »

Allocate and release memory to store obj_cgroup pointers for each non-root
slab page. Reuse page->mem_cgroup pointer to store a pointer to the
allocated space.

This commit temporarily increases the memory footprint of the kernel memory
accounting. To store obj_cgroup pointers we'll need a place for an
objcg_pointer for each allocated object. However, the following patches
in the series will enable sharing of slab pages between memory cgroups,
which will dramatically increase the total slab utilization. And the final
memory footprint will be significantly smaller than before.

To distinguish between obj_cgroups and memcg pointers in case when it's
not obvious which one is used (as in page_cgroup_ino()), let's always set
the lowest bit in the obj_cgroup case. The original obj_cgroups
pointer is marked to be ignored by kmemleak, which otherwise would
report a memory leak for each allocated vector.

Signed-off-by: Roman Gushchin
Signed-off-by: Andrew Morton
Reviewed-by: Vlastimil Babka
Reviewed-by: Shakeel Butt
Cc: Christoph Lameter
Cc: Johannes Weiner
Cc: Michal Hocko
Cc: Tejun Heo
Link: http://lkml.kernel.org/r/20200623174037.3951353-8-guro@fb.com
Signed-off-by: Linus Torvalds

Roman Gushchin
2020-08-08 02:33:24 +0800
1a3e1f409 mm: memcontrol: decouple reference counting from page accounting ... Browse Code »

The reference counting of a memcg is currently coupled directly to how
many 4k pages are charged to it. This doesn't work well with Roman's new
slab controller, which maintains pools of objects and doesn't want to keep
an extra balance sheet for the pages backing those objects.

This unusual refcounting design (reference counts usually track pointers
to an object) is only for historical reasons: memcg used to not take any
css references and simply stalled offlining until all charges had been
reparented and the page counters had dropped to zero. When we got rid of
the reparenting requirement, the simple mechanical translation was to take
a reference for every charge.

More historical context can be found in commit e8ea14cc6ead ("mm:
memcontrol: take a css reference for each charged page"), commit
64f219938941 ("mm: memcontrol: remove obsolete kmemcg pinning tricks") and
commit b2052564e66d ("mm: memcontrol: continue cache reclaim from offlined
groups").

The new slab controller exposes the limitations in this scheme, so let's
switch it to a more idiomatic reference counting model based on actual
kernel pointers to the memcg:

- The per-cpu stock holds a reference to the memcg its caching

- User pages hold a reference for their page->mem_cgroup. Transparent
huge pages will no longer acquire tail references in advance, we'll
get them if needed during the split.

- Kernel pages hold a reference for their page->mem_cgroup

- Pages allocated in the root cgroup will acquire and release css
references for simplicity. css_get() and css_put() optimize that.

- The current memcg_charge_slab() already hacked around the per-charge
references; this change gets rid of that as well.

- tcp accounting will handle reference in mem_cgroup_sk_{alloc,free}

Roman:
1) Rebased on top of the current mm tree: added css_get() in
mem_cgroup_charge(), dropped mem_cgroup_try_charge() part
2) I've reformatted commit references in the commit log to make
checkpatch.pl happy.

[hughd@google.com: remove css_put_many() from __mem_cgroup_clear_mc()]
Link: http://lkml.kernel.org/r/alpine.LSU.2.11.2007302011450.2347@eggly.anvils

Signed-off-by: Johannes Weiner
Signed-off-by: Roman Gushchin
Signed-off-by: Hugh Dickins
Signed-off-by: Andrew Morton
Reviewed-by: Shakeel Butt
Acked-by: Roman Gushchin
Acked-by: Michal Hocko
Cc: Christoph Lameter
Cc: Tejun Heo
Cc: Vlastimil Babka
Link: http://lkml.kernel.org/r/20200623174037.3951353-6-guro@fb.com
Signed-off-by: Linus Torvalds

Johannes Weiner
2020-08-08 02:33:24 +0800
d42f3245c mm: memcg: convert vmstat slab counters to bytes ... Browse Code »

In order to prepare for per-object slab memory accounting, convert
NR_SLAB_RECLAIMABLE and NR_SLAB_UNRECLAIMABLE vmstat items to bytes.

To make it obvious, rename them to NR_SLAB_RECLAIMABLE_B and
NR_SLAB_UNRECLAIMABLE_B (similar to NR_KERNEL_STACK_KB).

Internally global and per-node counters are stored in pages, however memcg
and lruvec counters are stored in bytes. This scheme may look weird, but
only for now. As soon as slab pages will be shared between multiple
cgroups, global and node counters will reflect the total number of slab
pages. However memcg and lruvec counters will be used for per-memcg slab
memory tracking, which will take separate kernel objects in the account.
Keeping global and node counters in pages helps to avoid additional
overhead.

The size of slab memory shouldn't exceed 4Gb on 32-bit machines, so it
will fit into atomic_long_t we use for vmstats.

Signed-off-by: Roman Gushchin
Signed-off-by: Andrew Morton
Reviewed-by: Shakeel Butt
Acked-by: Johannes Weiner
Acked-by: Vlastimil Babka
Cc: Christoph Lameter
Cc: Michal Hocko
Cc: Tejun Heo
Link: http://lkml.kernel.org/r/20200623174037.3951353-4-guro@fb.com
Signed-off-by: Linus Torvalds

Roman Gushchin
2020-08-08 02:33:24 +0800
e42f174e4 mm, slab/slub: improve error reporting and overhead of cache_from_obj() ... Browse Code »

cache_from_obj() was added by commit b9ce5ef49f00 ("sl[au]b: always get
the cache from its page in kmem_cache_free()") to support kmemcg, where
per-memcg cache can be different from the root one, so we can't use the
kmem_cache pointer given to kmem_cache_free().

Prior to that commit, SLUB already had debugging check+warning that could
be enabled to compare the given kmem_cache pointer to one referenced by
the slab page where the object-to-be-freed resides. This check was moved
to cache_from_obj(). Later the check was also enabled for
SLAB_FREELIST_HARDENED configs by commit 598a0717a816 ("mm/slab: validate
cache membership under freelist hardening").

These checks and warnings can be useful especially for the debugging,
which can be improved. Commit 598a0717a816 changed the pr_err() with
WARN_ON_ONCE() to WARN_ONCE() so only the first hit is now reported,
others are silent. This patch changes it to WARN() so that all errors are
reported.

It's also useful to print SLUB allocation/free tracking info for the
offending object, if tracking is enabled. Thus, export the SLUB
print_tracking() function and provide an empty one for SLAB.

For SLUB we can also benefit from the static key check in
kmem_cache_debug_flags(), but we need to move this function to slab.h and
declare the static key there.

[1] https://lore.kernel.org/r/20200608230654.828134-18-guro@fb.com

[vbabka@suse.cz: avoid bogus WARN()]
Link: https://lore.kernel.org/r/20200623090213.GW5535@shao2-debian
Link: http://lkml.kernel.org/r/b33e0fa7-cd28-4788-9e54-5927846329ef@suse.cz

Signed-off-by: Vlastimil Babka
Signed-off-by: Andrew Morton
Acked-by: Kees Cook
Acked-by: Roman Gushchin
Cc: Christoph Lameter
Cc: Pekka Enberg
Cc: David Rientjes
Cc: Joonsoo Kim
Cc: Matthew Garrett
Cc: Jann Horn
Cc: Vijayanand Jitta
Cc: Vinayak Menon
Link: http://lkml.kernel.org/r/afeda7ac-748b-33d8-a905-56b708148ad5@suse.cz
Signed-off-by: Linus Torvalds

Vlastimil Babka
2020-08-08 02:33:23 +0800