12 Jan, 2021
1 commit
-
Export pcpu_nr_pages symbol which is used as part
of meminfo collection from minidump module.Bug: 177027175
Change-Id: I08262ec95a3f1be8322b9b8d2d9c4098518fc408
Signed-off-by: Vijayanand Jitta
24 Dec, 2020
1 commit
-
per_cpu_ptr_to_phys symbols is needed for vendor loadable modules
to find the per cpu physical address of symbols.Bug: 176125613
Change-Id: Ifc23a8a9cae8eb11c94107eb9b9237a838f830bc
Signed-off-by: Prasad Sodagudi
31 Oct, 2020
1 commit
-
Use the safer macro as sparked by the long discussion in [1].
[1] https://lore.kernel.org/lkml/20200917204514.GA2880159@google.com/
Reviewed-by: Gustavo A. R. Silva
Signed-off-by: Dennis Zhou
19 Oct, 2020
1 commit
-
Patch series "mm: kmem: kernel memory accounting in an interrupt context".
This patchset implements memcg-based memory accounting of allocations made
from an interrupt context.Historically, such allocations were passed unaccounted mostly because
charging the memory cgroup of the current process wasn't an option. Also
performance reasons were likely a reason too.The remote charging API allows to temporarily overwrite the currently
active memory cgroup, so that all memory allocations are accounted towards
some specified memory cgroup instead of the memory cgroup of the current
process.This patchset extends the remote charging API so that it can be used from
an interrupt context. Then it removes the fence that prevented the
accounting of allocations made from an interrupt context. It also
contains a couple of optimizations/code refactorings.This patchset doesn't directly enable accounting for any specific
allocations, but prepares the code base for it. The bpf memory accounting
will likely be the first user of it: a typical example is a bpf program
parsing an incoming network packet, which allocates an entry in hashmap
map to store some information.This patch (of 4):
Currently memcg_kmem_bypass() is called before obtaining the current
memory/obj cgroup using get_mem/obj_cgroup_from_current(). Moving
memcg_kmem_bypass() into get_mem/obj_cgroup_from_current() reduces the
number of call sites and allows further code simplifications.Signed-off-by: Roman Gushchin
Signed-off-by: Andrew Morton
Reviewed-by: Shakeel Butt
Cc: Johannes Weiner
Cc: Michal Hocko
Link: http://lkml.kernel.org/r/20200827225843.1270629-1-guro@fb.com
Link: http://lkml.kernel.org/r/20200827225843.1270629-2-guro@fb.com
Signed-off-by: Linus Torvalds
18 Sep, 2020
1 commit
-
Variable populated, which is a member of struct pcpu_chunk, is used as a
unit of size of unsigned long.
However, size of populated is miscounted. So, I fix this minor part.Fixes: 8ab16c43ea79 ("percpu: change the number of pages marked in the first_chunk pop bitmap")
Cc: # 4.14+
Signed-off-by: Sunghyun Jin
Signed-off-by: Dennis Zhou
13 Aug, 2020
3 commits
-
Percpu memory can represent a noticeable chunk of the total memory
consumption, especially on big machines with many CPUs. Let's track
percpu memory usage for each memcg and display it in memory.stat.A percpu allocation is usually scattered over multiple pages (and nodes),
and can be significantly smaller than a page. So let's add a byte-sized
counter on the memcg level: MEMCG_PERCPU_B. Byte-sized vmstat infra
created for slabs can be perfectly reused for percpu case.[guro@fb.com: v3]
Link: http://lkml.kernel.org/r/20200623184515.4132564-4-guro@fb.comSigned-off-by: Roman Gushchin
Signed-off-by: Andrew Morton
Reviewed-by: Shakeel Butt
Acked-by: Dennis Zhou
Acked-by: Johannes Weiner
Cc: Christoph Lameter
Cc: David Rientjes
Cc: Joonsoo Kim
Cc: Mel Gorman
Cc: Michal Hocko
Cc: Pekka Enberg
Cc: Tejun Heo
Cc: Tobin C. Harding
Cc: Vlastimil Babka
Cc: Waiman Long
Cc: Bixuan Cui
Cc: Michal Koutný
Cc: Stephen Rothwell
Link: http://lkml.kernel.org/r/20200608230819.832349-4-guro@fb.com
Signed-off-by: Linus Torvalds -
Percpu memory is becoming more and more widely used by various subsystems,
and the total amount of memory controlled by the percpu allocator can make
a good part of the total memory.As an example, bpf maps can consume a lot of percpu memory, and they are
created by a user. Also, some cgroup internals (e.g. memory controller
statistics) can be quite large. On a machine with many CPUs and big
number of cgroups they can consume hundreds of megabytes.So the lack of memcg accounting is creating a breach in the memory
isolation. Similar to the slab memory, percpu memory should be accounted
by default.To implement the perpcu accounting it's possible to take the slab memory
accounting as a model to follow. Let's introduce two types of percpu
chunks: root and memcg. What makes memcg chunks different is an
additional space allocated to store memcg membership information. If
__GFP_ACCOUNT is passed on allocation, a memcg chunk should be be used.
If it's possible to charge the corresponding size to the target memory
cgroup, allocation is performed, and the memcg ownership data is recorded.
System-wide allocations are performed using root chunks, so there is no
additional memory overhead.To implement a fast reparenting of percpu memory on memcg removal, we
don't store mem_cgroup pointers directly: instead we use obj_cgroup API,
introduced for slab accounting.[akpm@linux-foundation.org: fix CONFIG_MEMCG_KMEM=n build errors and warning]
[akpm@linux-foundation.org: move unreachable code, per Roman]
[cuibixuan@huawei.com: mm/percpu: fix 'defined but not used' warning]
Link: http://lkml.kernel.org/r/6d41b939-a741-b521-a7a2-e7296ec16219@huawei.comSigned-off-by: Roman Gushchin
Signed-off-by: Bixuan Cui
Signed-off-by: Andrew Morton
Reviewed-by: Shakeel Butt
Acked-by: Dennis Zhou
Cc: Christoph Lameter
Cc: David Rientjes
Cc: Johannes Weiner
Cc: Joonsoo Kim
Cc: Mel Gorman
Cc: Michal Hocko
Cc: Pekka Enberg
Cc: Tejun Heo
Cc: Tobin C. Harding
Cc: Vlastimil Babka
Cc: Waiman Long
Cc: Bixuan Cui
Cc: Michal Koutný
Cc: Stephen Rothwell
Link: http://lkml.kernel.org/r/20200623184515.4132564-3-guro@fb.com
Signed-off-by: Linus Torvalds -
Patch series "mm: memcg accounting of percpu memory", v3.
This patchset adds percpu memory accounting to memory cgroups. It's based
on the rework of the slab controller and reuses concepts and features
introduced for the per-object slab accounting.Percpu memory is becoming more and more widely used by various subsystems,
and the total amount of memory controlled by the percpu allocator can make
a good part of the total memory.As an example, bpf maps can consume a lot of percpu memory, and they are
created by a user. Also, some cgroup internals (e.g. memory controller
statistics) can be quite large. On a machine with many CPUs and big
number of cgroups they can consume hundreds of megabytes.So the lack of memcg accounting is creating a breach in the memory
isolation. Similar to the slab memory, percpu memory should be accounted
by default.Percpu allocations by their nature are scattered over multiple pages, so
they can't be tracked on the per-page basis. So the per-object tracking
introduced by the new slab controller is reused.The patchset implements charging of percpu allocations, adds memcg-level
statistics, enables accounting for percpu allocations made by memory
cgroup internals and provides some basic tests.To implement the accounting of percpu memory without a significant memory
and performance overhead the following approach is used: all accounted
allocations are placed into a separate percpu chunk (or chunks). These
chunks are similar to default chunks, except that they do have an attached
vector of pointers to obj_cgroup objects, which is big enough to save a
pointer for each allocated object. On the allocation, if the allocation
has to be accounted (__GFP_ACCOUNT is passed, the allocating process
belongs to a non-root memory cgroup, etc), the memory cgroup is getting
charged and if the maximum limit is not exceeded the allocation is
performed using a memcg-aware chunk. Otherwise -ENOMEM is returned or the
allocation is forced over the limit, depending on gfp (as any other kernel
memory allocation). The memory cgroup information is saved in the
obj_cgroup vector at the corresponding offset. On the release time the
memcg information is restored from the vector and the cgroup is getting
uncharged. Unaccounted allocations (at this point the absolute majority
of all percpu allocations) are performed in the old way, so no additional
overhead is expected.To avoid pinning dying memory cgroups by outstanding allocations,
obj_cgroup API is used instead of directly saving memory cgroup pointers.
obj_cgroup is basically a pointer to a memory cgroup with a standalone
reference counter. The trick is that it can be atomically swapped to
point at the parent cgroup, so that the original memory cgroup can be
released prior to all objects, which has been charged to it. Because all
charges and statistics are fully recursive, it's perfectly correct to
uncharge the parent cgroup instead. This scheme is used in the slab
memory accounting, and percpu memory can just follow the scheme.This patch (of 5):
To implement accounting of percpu memory we need the information about the
size of freed object. Return it from pcpu_free_area().Signed-off-by: Roman Gushchin
Signed-off-by: Andrew Morton
Reviewed-by: Shakeel Butt
Acked-by: Dennis Zhou
Cc: Tejun Heo
Cc: Christoph Lameter
Cc: Johannes Weiner
Cc: Michal Hocko
Cc: David Rientjes
Cc: Joonsoo Kim
Cc: Mel Gorman
Cc: Pekka Enberg
Cc: Tobin C. Harding
Cc: Vlastimil Babka
Cc: Waiman Long
cC: Michal Koutnýutny@suse.com>
Cc: Bixuan Cui
Cc: Michal Koutný
Cc: Stephen Rothwell
Link: http://lkml.kernel.org/r/20200623184515.4132564-1-guro@fb.com
Link: http://lkml.kernel.org/r/20200608230819.832349-1-guro@fb.com
Link: http://lkml.kernel.org/r/20200608230819.832349-2-guro@fb.com
Signed-off-by: Linus Torvalds
17 Jul, 2020
1 commit
-
Using uninitialized_var() is dangerous as it papers over real bugs[1]
(or can in the future), and suppresses unrelated compiler warnings
(e.g. "unused variable"). If the compiler thinks it is uninitialized,
either simply initialize the variable or make compiler changes.In preparation for removing[2] the[3] macro[4], remove all remaining
needless uses with the following script:git grep '\buninitialized_var\b' | cut -d: -f1 | sort -u | \
xargs perl -pi -e \
's/\buninitialized_var\(([^\)]+)\)/\1/g;
s:\s*/\* (GCC be quiet|to make compiler happy) \*/$::g;'drivers/video/fbdev/riva/riva_hw.c was manually tweaked to avoid
pathological white-space.No outstanding warnings were found building allmodconfig with GCC 9.3.0
for x86_64, i386, arm64, arm, powerpc, powerpc64le, s390x, mips, sparc64,
alpha, and m68k.[1] https://lore.kernel.org/lkml/20200603174714.192027-1-glider@google.com/
[2] https://lore.kernel.org/lkml/CA+55aFw+Vbj0i=1TGqCR5vQkCzWJ0QxK6CernOU6eedsudAixw@mail.gmail.com/
[3] https://lore.kernel.org/lkml/CA+55aFwgbgqhbp1fkxvRKEpzyR5J8n1vKT1VZdz9knmPuXhOeg@mail.gmail.com/
[4] https://lore.kernel.org/lkml/CA+55aFz2500WfbKXAx8s67wrm9=yVJu65TpLgN_ybYNv0VEOKA@mail.gmail.com/Reviewed-by: Leon Romanovsky # drivers/infiniband and mlx4/mlx5
Acked-by: Jason Gunthorpe # IB
Acked-by: Kalle Valo # wireless drivers
Reviewed-by: Chao Yu # erofs
Signed-off-by: Kees Cook
03 Jun, 2020
1 commit
-
The pgprot argument to __vmalloc is always PAGE_KERNEL now, so remove it.
Signed-off-by: Christoph Hellwig
Signed-off-by: Andrew Morton
Reviewed-by: Michael Kelley [hyperv]
Acked-by: Gao Xiang [erofs]
Acked-by: Peter Zijlstra (Intel)
Acked-by: Wei Liu
Cc: Christian Borntraeger
Cc: Christophe Leroy
Cc: Daniel Vetter
Cc: David Airlie
Cc: Greg Kroah-Hartman
Cc: Haiyang Zhang
Cc: Johannes Weiner
Cc: "K. Y. Srinivasan"
Cc: Laura Abbott
Cc: Mark Rutland
Cc: Minchan Kim
Cc: Nitin Gupta
Cc: Robin Murphy
Cc: Sakari Ailus
Cc: Stephen Hemminger
Cc: Sumit Semwal
Cc: Benjamin Herrenschmidt
Cc: Catalin Marinas
Cc: Heiko Carstens
Cc: Paul Mackerras
Cc: Vasily Gorbik
Cc: Will Deacon
Link: http://lkml.kernel.org/r/20200414131348.444715-22-hch@lst.de
Signed-off-by: Linus Torvalds
08 May, 2020
1 commit
-
Since 5.7-rc1, on btrfs we have a percpu counter initialization for
which we always pass a GFP_KERNEL gfp_t argument (this happens since
commit 2992df73268f78 ("btrfs: Implement DREW lock")).That is safe in some contextes but not on others where allowing fs
reclaim could lead to a deadlock because we are either holding some
btrfs lock needed for a transaction commit or holding a btrfs
transaction handle open. Because of that we surround the call to the
function that initializes the percpu counter with a NOFS context using
memalloc_nofs_save() (this is done at btrfs_init_fs_root()).However it turns out that this is not enough to prevent a possible
deadlock because percpu_alloc() determines if it is in an atomic context
by looking exclusively at the gfp flags passed to it (GFP_KERNEL in this
case) and it is not aware that a NOFS context is set.Because percpu_alloc() thinks it is in a non atomic context it locks the
pcpu_alloc_mutex. This can result in a btrfs deadlock when
pcpu_balance_workfn() is running, has acquired that mutex and is waiting
for reclaim, while the btrfs task that called percpu_counter_init() (and
therefore percpu_alloc()) is holding either the btrfs commit_root
semaphore or a transaction handle (done fs/btrfs/backref.c:
iterate_extent_inodes()), which prevents reclaim from finishing as an
attempt to commit the current btrfs transaction will deadlock.Lockdep reports this issue with the following trace:
======================================================
WARNING: possible circular locking dependency detected
5.6.0-rc7-btrfs-next-77 #1 Not tainted
------------------------------------------------------
kswapd0/91 is trying to acquire lock:
ffff8938a3b3fdc8 (&delayed_node->mutex){+.+.}, at: __btrfs_release_delayed_node.part.0+0x3f/0x320 [btrfs]but task is already holding lock:
ffffffffb4f0dbc0 (fs_reclaim){+.+.}, at: __fs_reclaim_acquire+0x5/0x30which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #4 (fs_reclaim){+.+.}:
fs_reclaim_acquire.part.0+0x25/0x30
__kmalloc+0x5f/0x3a0
pcpu_create_chunk+0x19/0x230
pcpu_balance_workfn+0x56a/0x680
process_one_work+0x235/0x5f0
worker_thread+0x50/0x3b0
kthread+0x120/0x140
ret_from_fork+0x3a/0x50-> #3 (pcpu_alloc_mutex){+.+.}:
__mutex_lock+0xa9/0xaf0
pcpu_alloc+0x480/0x7c0
__percpu_counter_init+0x50/0xd0
btrfs_drew_lock_init+0x22/0x70 [btrfs]
btrfs_get_fs_root+0x29c/0x5c0 [btrfs]
resolve_indirect_refs+0x120/0xa30 [btrfs]
find_parent_nodes+0x50b/0xf30 [btrfs]
btrfs_find_all_leafs+0x60/0xb0 [btrfs]
iterate_extent_inodes+0x139/0x2f0 [btrfs]
iterate_inodes_from_logical+0xa1/0xe0 [btrfs]
btrfs_ioctl_logical_to_ino+0xb4/0x190 [btrfs]
btrfs_ioctl+0x165a/0x3130 [btrfs]
ksys_ioctl+0x87/0xc0
__x64_sys_ioctl+0x16/0x20
do_syscall_64+0x5c/0x260
entry_SYSCALL_64_after_hwframe+0x49/0xbe-> #2 (&fs_info->commit_root_sem){++++}:
down_write+0x38/0x70
btrfs_cache_block_group+0x2ec/0x500 [btrfs]
find_free_extent+0xc6a/0x1600 [btrfs]
btrfs_reserve_extent+0x9b/0x180 [btrfs]
btrfs_alloc_tree_block+0xc1/0x350 [btrfs]
alloc_tree_block_no_bg_flush+0x4a/0x60 [btrfs]
__btrfs_cow_block+0x122/0x5a0 [btrfs]
btrfs_cow_block+0x106/0x240 [btrfs]
commit_cowonly_roots+0x55/0x310 [btrfs]
btrfs_commit_transaction+0x509/0xb20 [btrfs]
sync_filesystem+0x74/0x90
generic_shutdown_super+0x22/0x100
kill_anon_super+0x14/0x30
btrfs_kill_super+0x12/0x20 [btrfs]
deactivate_locked_super+0x31/0x70
cleanup_mnt+0x100/0x160
task_work_run+0x93/0xc0
exit_to_usermode_loop+0xf9/0x100
do_syscall_64+0x20d/0x260
entry_SYSCALL_64_after_hwframe+0x49/0xbe-> #1 (&space_info->groups_sem){++++}:
down_read+0x3c/0x140
find_free_extent+0xef6/0x1600 [btrfs]
btrfs_reserve_extent+0x9b/0x180 [btrfs]
btrfs_alloc_tree_block+0xc1/0x350 [btrfs]
alloc_tree_block_no_bg_flush+0x4a/0x60 [btrfs]
__btrfs_cow_block+0x122/0x5a0 [btrfs]
btrfs_cow_block+0x106/0x240 [btrfs]
btrfs_search_slot+0x50c/0xd60 [btrfs]
btrfs_lookup_inode+0x3a/0xc0 [btrfs]
__btrfs_update_delayed_inode+0x90/0x280 [btrfs]
__btrfs_commit_inode_delayed_items+0x81f/0x870 [btrfs]
__btrfs_run_delayed_items+0x8e/0x180 [btrfs]
btrfs_commit_transaction+0x31b/0xb20 [btrfs]
iterate_supers+0x87/0xf0
ksys_sync+0x60/0xb0
__ia32_sys_sync+0xa/0x10
do_syscall_64+0x5c/0x260
entry_SYSCALL_64_after_hwframe+0x49/0xbe-> #0 (&delayed_node->mutex){+.+.}:
__lock_acquire+0xef0/0x1c80
lock_acquire+0xa2/0x1d0
__mutex_lock+0xa9/0xaf0
__btrfs_release_delayed_node.part.0+0x3f/0x320 [btrfs]
btrfs_evict_inode+0x40d/0x560 [btrfs]
evict+0xd9/0x1c0
dispose_list+0x48/0x70
prune_icache_sb+0x54/0x80
super_cache_scan+0x124/0x1a0
do_shrink_slab+0x176/0x440
shrink_slab+0x23a/0x2c0
shrink_node+0x188/0x6e0
balance_pgdat+0x31d/0x7f0
kswapd+0x238/0x550
kthread+0x120/0x140
ret_from_fork+0x3a/0x50other info that might help us debug this:
Chain exists of:
&delayed_node->mutex --> pcpu_alloc_mutex --> fs_reclaimPossible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(fs_reclaim);
lock(pcpu_alloc_mutex);
lock(fs_reclaim);
lock(&delayed_node->mutex);*** DEADLOCK ***
3 locks held by kswapd0/91:
#0: (fs_reclaim){+.+.}, at: __fs_reclaim_acquire+0x5/0x30
#1: (shrinker_rwsem){++++}, at: shrink_slab+0x12f/0x2c0
#2: (&type->s_umount_key#43){++++}, at: trylock_super+0x16/0x50stack backtrace:
CPU: 1 PID: 91 Comm: kswapd0 Not tainted 5.6.0-rc7-btrfs-next-77 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-0-ga698c8995f-prebuilt.qemu.org 04/01/2014
Call Trace:
dump_stack+0x8f/0xd0
check_noncircular+0x170/0x190
__lock_acquire+0xef0/0x1c80
lock_acquire+0xa2/0x1d0
__mutex_lock+0xa9/0xaf0
__btrfs_release_delayed_node.part.0+0x3f/0x320 [btrfs]
btrfs_evict_inode+0x40d/0x560 [btrfs]
evict+0xd9/0x1c0
dispose_list+0x48/0x70
prune_icache_sb+0x54/0x80
super_cache_scan+0x124/0x1a0
do_shrink_slab+0x176/0x440
shrink_slab+0x23a/0x2c0
shrink_node+0x188/0x6e0
balance_pgdat+0x31d/0x7f0
kswapd+0x238/0x550
kthread+0x120/0x140
ret_from_fork+0x3a/0x50This could be fixed by making btrfs pass GFP_NOFS instead of GFP_KERNEL
to percpu_counter_init() in contextes where it is not reclaim safe,
however that type of approach is discouraged since
memalloc_[nofs|noio]_save() were introduced. Therefore this change
makes pcpu_alloc() look up into an existing nofs/noio context before
deciding whether it is in an atomic context or not.Signed-off-by: Filipe Manana
Signed-off-by: Andrew Morton
Reviewed-by: Andrew Morton
Acked-by: Tejun Heo
Acked-by: Dennis Zhou
Cc: Tejun Heo
Cc: Christoph Lameter
Link: http://lkml.kernel.org/r/20200430164356.15543-1-fdmanana@kernel.org
Signed-off-by: Linus Torvalds
02 Apr, 2020
1 commit
-
Currently there are 3 emails tied to me in the kernel tree, I'd rather
dennis@kernel.org be the only one.Signed-off-by: Dennis Zhou
20 Jan, 2020
1 commit
-
Bitmaps are fairly popular for their space efficiency, but we don't have
generic iterators available. Make percpu's bitmap region iterators
available to everyone.Reviewed-by: Josef Bacik
Signed-off-by: Dennis Zhou
Reviewed-by: David Sterba
Signed-off-by: David Sterba
05 Sep, 2019
1 commit
-
One of the more common cases of allocation size calculations is finding
the size of a structure that has a zero-sized array at the end, along
with memory for some number of elements for that array. For example:struct pcpu_alloc_info {
...
struct pcpu_group_info groups[];
};Make use of the struct_size() helper instead of an open-coded version
in order to avoid any potential type mistakes.So, replace the following form:
sizeof(*ai) + nr_groups * sizeof(ai->groups[0])
with:
struct_size(ai, groups, nr_groups)
This code was detected with the help of Coccinelle.
Signed-off-by: Gustavo A. R. Silva
Signed-off-by: Dennis Zhou
24 Jul, 2019
1 commit
-
s/perpcu/percpu/
Signed-off-by: Christophe JAILLET
Signed-off-by: Dennis Zhou
[Dennis: updated title]
04 Jul, 2019
1 commit
-
pcpu_setup_first_chunk() will panic or BUG_ON if the are some
error and doesn't return any error, hence it can be defined to
return void.Reported-by: kbuild test robot
Signed-off-by: Kefeng Wang
Signed-off-by: Dennis Zhou
[Dennis: fixed kbuild warning for pcpu_page_first_chunk()]
05 Jun, 2019
1 commit
-
Based on 1 normalized pattern(s):
this file is released under the gplv2
extracted by the scancode license scanner the SPDX license identifier
GPL-2.0-only
has been chosen to replace the boilerplate/reference in 68 file(s).
Signed-off-by: Thomas Gleixner
Reviewed-by: Armijn Hemel
Reviewed-by: Allison Randal
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190531190114.292346262@linutronix.de
Signed-off-by: Greg Kroah-Hartman
14 May, 2019
1 commit
-
Pull percpu updates from Dennis Zhou:
- scan hint update which helps address performance issues with heavily
fragmented blocks- lockdep fix when freeing an allocation causes balance work to be
scheduled* 'for-5.2' of git://git.kernel.org/pub/scm/linux/kernel/git/dennis/percpu:
percpu: remove spurious lock dependency between percpu and sched
percpu: use chunk scan_hint to skip some scanning
percpu: convert chunk hints to be based on pcpu_block_md
percpu: make pcpu_block_md generic
percpu: use block scan_hint to only scan forward
percpu: remember largest area skipped during allocation
percpu: add block level scan_hint
percpu: set PCPU_BITMAP_BLOCK_SIZE to PAGE_SIZE
percpu: relegate chunks unusable when failing small allocations
percpu: manage chunks based on contig_bits instead of free_bytes
percpu: introduce helper to determine if two regions overlap
percpu: do not search past bitmap when allocating an area
percpu: update free path with correct new free region
09 May, 2019
1 commit
-
In free_percpu() we sometimes call pcpu_schedule_balance_work() to
queue a work item (which does a wakeup) while holding pcpu_lock.
This creates an unnecessary lock dependency between pcpu_lock and
the scheduler's pi_lock. There are other places where we call
pcpu_schedule_balance_work() without hold pcpu_lock, and this case
doesn't need to be different.Moving the call outside the lock prevents the following lockdep splat
when running tools/testing/selftests/bpf/{test_maps,test_progs} in
sequence with lockdep enabled:======================================================
WARNING: possible circular locking dependency detected
5.1.0-dbg-DEV #1 Not tainted
------------------------------------------------------
kworker/23:255/18872 is trying to acquire lock:
000000000bc79290 (&(&pool->lock)->rlock){-.-.}, at: __queue_work+0xb2/0x520but task is already holding lock:
00000000e3e7a6aa (pcpu_lock){..-.}, at: free_percpu+0x36/0x260which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #4 (pcpu_lock){..-.}:
lock_acquire+0x9e/0x180
_raw_spin_lock_irqsave+0x3a/0x50
pcpu_alloc+0xfa/0x780
__alloc_percpu_gfp+0x12/0x20
alloc_htab_elem+0x184/0x2b0
__htab_percpu_map_update_elem+0x252/0x290
bpf_percpu_hash_update+0x7c/0x130
__do_sys_bpf+0x1912/0x1be0
__x64_sys_bpf+0x1a/0x20
do_syscall_64+0x59/0x400
entry_SYSCALL_64_after_hwframe+0x49/0xbe-> #3 (&htab->buckets[i].lock){....}:
lock_acquire+0x9e/0x180
_raw_spin_lock_irqsave+0x3a/0x50
htab_map_update_elem+0x1af/0x3a0-> #2 (&rq->lock){-.-.}:
lock_acquire+0x9e/0x180
_raw_spin_lock+0x2f/0x40
task_fork_fair+0x37/0x160
sched_fork+0x211/0x310
copy_process.part.43+0x7b1/0x2160
_do_fork+0xda/0x6b0
kernel_thread+0x29/0x30
rest_init+0x22/0x260
arch_call_rest_init+0xe/0x10
start_kernel+0x4fd/0x520
x86_64_start_reservations+0x24/0x26
x86_64_start_kernel+0x6f/0x72
secondary_startup_64+0xa4/0xb0-> #1 (&p->pi_lock){-.-.}:
lock_acquire+0x9e/0x180
_raw_spin_lock_irqsave+0x3a/0x50
try_to_wake_up+0x41/0x600
wake_up_process+0x15/0x20
create_worker+0x16b/0x1e0
workqueue_init+0x279/0x2ee
kernel_init_freeable+0xf7/0x288
kernel_init+0xf/0x180
ret_from_fork+0x24/0x30-> #0 (&(&pool->lock)->rlock){-.-.}:
__lock_acquire+0x101f/0x12a0
lock_acquire+0x9e/0x180
_raw_spin_lock+0x2f/0x40
__queue_work+0xb2/0x520
queue_work_on+0x38/0x80
free_percpu+0x221/0x260
pcpu_freelist_destroy+0x11/0x20
stack_map_free+0x2a/0x40
bpf_map_free_deferred+0x3c/0x50
process_one_work+0x1f7/0x580
worker_thread+0x54/0x410
kthread+0x10f/0x150
ret_from_fork+0x24/0x30other info that might help us debug this:
Chain exists of:
&(&pool->lock)->rlock --> &htab->buckets[i].lock --> pcpu_lockPossible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(pcpu_lock);
lock(&htab->buckets[i].lock);
lock(pcpu_lock);
lock(&(&pool->lock)->rlock);*** DEADLOCK ***
3 locks held by kworker/23:255/18872:
#0: 00000000b36a6e16 ((wq_completion)events){+.+.},
at: process_one_work+0x17a/0x580
#1: 00000000dfd966f0 ((work_completion)(&map->work)){+.+.},
at: process_one_work+0x17a/0x580
#2: 00000000e3e7a6aa (pcpu_lock){..-.},
at: free_percpu+0x36/0x260stack backtrace:
CPU: 23 PID: 18872 Comm: kworker/23:255 Not tainted 5.1.0-dbg-DEV #1
Hardware name: ...
Workqueue: events bpf_map_free_deferred
Call Trace:
dump_stack+0x67/0x95
print_circular_bug.isra.38+0x1c6/0x220
check_prev_add.constprop.50+0x9f6/0xd20
__lock_acquire+0x101f/0x12a0
lock_acquire+0x9e/0x180
_raw_spin_lock+0x2f/0x40
__queue_work+0xb2/0x520
queue_work_on+0x38/0x80
free_percpu+0x221/0x260
pcpu_freelist_destroy+0x11/0x20
stack_map_free+0x2a/0x40
bpf_map_free_deferred+0x3c/0x50
process_one_work+0x1f7/0x580
worker_thread+0x54/0x410
kthread+0x10f/0x150
ret_from_fork+0x24/0x30Signed-off-by: John Sperbeck
Signed-off-by: Dennis Zhou
19 Mar, 2019
1 commit
-
Since commit ad67b74d2469d9b8 ("printk: hash addresses printed with %p"),
at boot "____ptrval____" is printed instead of actual addresses:percpu: Embedded 38 pages/cpu @(____ptrval____) s124376 r0 d31272 u524288
Instead of changing the print to "%px", and leaking kernel addresses,
just remove the print completely, cfr. e.g. commit 071929dbdd865f77
("arm64: Stop printing the virtual memory layout").Signed-off-by: Matteo Croce
Signed-off-by: Dennis Zhou
14 Mar, 2019
12 commits
-
Just like blocks, chunks now maintain a scan_hint. This can be used to
skip some scanning by promoting the scan_hint to be the contig_hint.
The chunk's scan_hint is primarily updated on the backside and relies on
full scanning when a block becomes free or the free region spans across
blocks.Signed-off-by: Dennis Zhou
Reviewed-by: Peng Fan -
As mentioned in the last patch, a chunk's hints are no different than a
block just responsible for more bits. This converts chunk level hints to
use a pcpu_block_md to maintain them. This lets us reuse the same hint
helper functions as a block. The left_free and right_free are unused by
the chunk's pcpu_block_md.Signed-off-by: Dennis Zhou
Reviewed-by: Peng Fan -
In reality, a chunk is just a block covering a larger number of bits.
The hints themselves are one in the same. Rather than maintaining the
hints separately, first introduce nr_bits to genericize
pcpu_block_update() to correctly maintain block->right_free. The next
patch will convert chunk hints to be managed as a pcpu_block_md.Signed-off-by: Dennis Zhou
Reviewed-by: Peng Fan -
Blocks now remember the latest scan_hint. This can be used on the
allocation path as when a contig_hint is broken, we can promote the
scan_hint to the contig_hint and scan forward from there. This works
because pcpu_block_refresh_hint() is only called on the allocation path
while block free regions are updated manually in
pcpu_block_update_hint_free().Signed-off-by: Dennis Zhou
-
Percpu allocations attempt to do first fit by scanning forward from the
first_free of a block. However, fragmentation from allocation requests
can cause holes not seen by block hint update functions. To address
this, create a local version of bitmap_find_next_zero_area_off() that
remembers the largest area skipped over. The caveat is that it only sees
regions skipped over due to not fitting, not regions skipped due to
alignment.Prior to updating the scan_hint, a scan backwards is done to try and
recover free bits skipped due to alignment. While this can cause
scanning to miss earlier possible free areas, smaller allocations will
eventually fill those holes due to first fit.Signed-off-by: Dennis Zhou
-
Fragmentation can cause both blocks and chunks to have an early
first_firee bit available, but only able to satisfy allocations much
later on. This patch introduces a scan_hint to help mitigate some
unnecessary scanning.The scan_hint remembers the largest area prior to the contig_hint. If
the contig_hint == scan_hint, then scan_hint_start > contig_hint_start.
This is necessary for scan_hint discovery when refreshing a block.Signed-off-by: Dennis Zhou
Reviewed-by: Peng Fan -
Previously, block size was flexible based on the constraint that the
GCD(PCPU_BITMAP_BLOCK_SIZE, PAGE_SIZE) > 1. However, this carried the
overhead that keeping a floating number of populated free pages required
scanning over the free regions of a chunk.Setting the block size to be fixed at PAGE_SIZE lets us know when an
empty page becomes used as we will break a full contig_hint of a block.
This means we no longer have to scan the whole chunk upon breaking a
contig_hint which empty page management piggybacked off. A later patch
takes advantage of this to optimize the allocation path by only scanning
forward using the scan_hint introduced later too.Signed-off-by: Dennis Zhou
Reviewed-by: Peng Fan -
In certain cases, requestors of percpu memory may want specific
alignments. However, it is possible to end up in situations where the
contig_hint matches, but the alignment does not. This causes excess
scanning of chunks that will fail. To prevent this, if a small
allocation fails (< 32B), the chunk is moved to the empty list. Once an
allocation is freed from that chunk, it is placed back into rotation.Signed-off-by: Dennis Zhou
Reviewed-by: Peng Fan -
When a chunk becomes fragmented, it can end up having a large number of
small allocation areas free. The free_bytes sorting of chunks leads to
unnecessary checking of chunks that cannot satisfy the allocation.
Switch to contig_bits sorting to prevent scanning chunks that may not be
able to service the allocation request.Signed-off-by: Dennis Zhou
Reviewed-by: Peng Fan -
While block hints were always accurate, it's possible when spanning
across blocks that we miss updating the chunk's contig_hint. Rather than
rely on correctness of the boundaries of hints, do a full overlap
comparison.A future patch introduces the scan_hint which makes the contig_hint
slightly fuzzy as they can at times be smaller than the actual hint.Signed-off-by: Dennis Zhou
-
pcpu_find_block_fit() guarantees that a fit is found within
PCPU_BITMAP_BLOCK_BITS. Iteration is used to determine the first fit as
it compares against the block's contig_hint. This can lead to
incorrectly scanning past the end of the bitmap. The behavior was okay
given the check after for bit_off >= end and the correctness of the
hints from pcpu_find_block_fit().This patch fixes this by bounding the end offset by the number of bits
in a chunk.Signed-off-by: Dennis Zhou
Reviewed-by: Peng Fan -
When updating the chunk's contig_hint on the free path of a hint that
does not touch the page boundaries, it was incorrectly using the
starting offset of the free region and the block's contig_hint. This
could lead to incorrect assumptions about fit given a size and better
alignment of the start. Fix this by using (end - start) as this is only
called when updating a hint within a block.Signed-off-by: Dennis Zhou
Reviewed-by: Peng Fan
13 Mar, 2019
2 commits
-
As all the memblock allocation functions return NULL in case of error
rather than panic(), the duplicates with _nopanic suffix can be removed.Link: http://lkml.kernel.org/r/1548057848-15136-22-git-send-email-rppt@linux.ibm.com
Signed-off-by: Mike Rapoport
Acked-by: Greg Kroah-Hartman
Reviewed-by: Petr Mladek [printk]
Cc: Catalin Marinas
Cc: Christophe Leroy
Cc: Christoph Hellwig
Cc: "David S. Miller"
Cc: Dennis Zhou
Cc: Geert Uytterhoeven
Cc: Greentime Hu
Cc: Guan Xuetao
Cc: Guo Ren
Cc: Guo Ren [c-sky]
Cc: Heiko Carstens
Cc: Juergen Gross [Xen]
Cc: Mark Salter
Cc: Matt Turner
Cc: Max Filippov
Cc: Michael Ellerman
Cc: Michal Simek
Cc: Paul Burton
Cc: Richard Weinberger
Cc: Rich Felker
Cc: Rob Herring
Cc: Rob Herring
Cc: Russell King
Cc: Stafford Horne
Cc: Tony Luck
Cc: Vineet Gupta
Cc: Yoshinori Sato
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Add panic() calls if memblock_alloc() returns NULL.
The panic() format duplicates the one used by memblock itself and in
order to avoid explosion with long parameters list replace open coded
allocation size calculations with a local variable.Link: http://lkml.kernel.org/r/1548057848-15136-17-git-send-email-rppt@linux.ibm.com
Signed-off-by: Mike Rapoport
Cc: Catalin Marinas
Cc: Christophe Leroy
Cc: Christoph Hellwig
Cc: "David S. Miller"
Cc: Dennis Zhou
Cc: Geert Uytterhoeven
Cc: Greentime Hu
Cc: Greg Kroah-Hartman
Cc: Guan Xuetao
Cc: Guo Ren
Cc: Guo Ren [c-sky]
Cc: Heiko Carstens
Cc: Juergen Gross [Xen]
Cc: Mark Salter
Cc: Matt Turner
Cc: Max Filippov
Cc: Michael Ellerman
Cc: Michal Simek
Cc: Paul Burton
Cc: Petr Mladek
Cc: Richard Weinberger
Cc: Rich Felker
Cc: Rob Herring
Cc: Rob Herring
Cc: Russell King
Cc: Stafford Horne
Cc: Tony Luck
Cc: Vineet Gupta
Cc: Yoshinori Sato
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
24 Feb, 2019
1 commit
-
group_cnt array is defined with NR_CPUS entries, but normally
nr_groups will not reach up to NR_CPUS. So there is no issue
to the current code.Checking other parts of pcpu_build_alloc_info, use nr_groups as
check condition, so make it consistent to use 'group < nr_groups'
as for loop check. In case we do have nr_groups equals with NR_CPUS,
we could also avoid memory access out of bounds.Signed-off-by: Peng Fan
Signed-off-by: Dennis Zhou
02 Nov, 2018
1 commit
-
Pull percpu fixes from Dennis Zhou:
"Two small things for v4.20.The first fixes a clang uninitialized variable warning for arm64 in
the default path calls BUILD_BUG(). The second removes an unnecessary
unlikely() in a WARN_ON() use"* 'for-4.20' of git://git.kernel.org/pub/scm/linux/kernel/git/dennis/percpu:
arm64: percpu: Initialize ret in the default case
mm: percpu: remove unnecessary unlikely()
31 Oct, 2018
3 commits
-
When a memblock allocation APIs are called with align = 0, the alignment
is implicitly set to SMP_CACHE_BYTES.Implicit alignment is done deep in the memblock allocator and it can
come as a surprise. Not that such an alignment would be wrong even
when used incorrectly but it is better to be explicit for the sake of
clarity and the prinicple of the least surprise.Replace all such uses of memblock APIs with the 'align' parameter
explicitly set to SMP_CACHE_BYTES and stop implicit alignment assignment
in the memblock internal allocation functions.For the case when memblock APIs are used via helper functions, e.g. like
iommu_arena_new_node() in Alpha, the helper functions were detected with
Coccinelle's help and then manually examined and updated where
appropriate.The direct memblock APIs users were updated using the semantic patch below:
@@
expression size, min_addr, max_addr, nid;
@@
(
|
- memblock_alloc_try_nid_raw(size, 0, min_addr, max_addr, nid)
+ memblock_alloc_try_nid_raw(size, SMP_CACHE_BYTES, min_addr, max_addr,
nid)
|
- memblock_alloc_try_nid_nopanic(size, 0, min_addr, max_addr, nid)
+ memblock_alloc_try_nid_nopanic(size, SMP_CACHE_BYTES, min_addr, max_addr,
nid)
|
- memblock_alloc_try_nid(size, 0, min_addr, max_addr, nid)
+ memblock_alloc_try_nid(size, SMP_CACHE_BYTES, min_addr, max_addr, nid)
|
- memblock_alloc(size, 0)
+ memblock_alloc(size, SMP_CACHE_BYTES)
|
- memblock_alloc_raw(size, 0)
+ memblock_alloc_raw(size, SMP_CACHE_BYTES)
|
- memblock_alloc_from(size, 0, min_addr)
+ memblock_alloc_from(size, SMP_CACHE_BYTES, min_addr)
|
- memblock_alloc_nopanic(size, 0)
+ memblock_alloc_nopanic(size, SMP_CACHE_BYTES)
|
- memblock_alloc_low(size, 0)
+ memblock_alloc_low(size, SMP_CACHE_BYTES)
|
- memblock_alloc_low_nopanic(size, 0)
+ memblock_alloc_low_nopanic(size, SMP_CACHE_BYTES)
|
- memblock_alloc_from_nopanic(size, 0, min_addr)
+ memblock_alloc_from_nopanic(size, SMP_CACHE_BYTES, min_addr)
|
- memblock_alloc_node(size, 0, nid)
+ memblock_alloc_node(size, SMP_CACHE_BYTES, nid)
)[mhocko@suse.com: changelog update]
[akpm@linux-foundation.org: coding-style fixes]
[rppt@linux.ibm.com: fix missed uses of implicit alignment]
Link: http://lkml.kernel.org/r/20181016133656.GA10925@rapoport-lnx
Link: http://lkml.kernel.org/r/1538687224-17535-1-git-send-email-rppt@linux.vnet.ibm.com
Signed-off-by: Mike Rapoport
Suggested-by: Michal Hocko
Acked-by: Paul Burton [MIPS]
Acked-by: Michael Ellerman [powerpc]
Acked-by: Michal Hocko
Cc: Catalin Marinas
Cc: Chris Zankel
Cc: Geert Uytterhoeven
Cc: Guan Xuetao
Cc: Ingo Molnar
Cc: Matt Turner
Cc: Michal Simek
Cc: Richard Weinberger
Cc: Russell King
Cc: Thomas Gleixner
Cc: Tony Luck
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Move remaining definitions and declarations from include/linux/bootmem.h
into include/linux/memblock.h and remove the redundant header.The includes were replaced with the semantic patch below and then
semi-automated removal of duplicated '#include@@
@@
- #include
+ #include[sfr@canb.auug.org.au: dma-direct: fix up for the removal of linux/bootmem.h]
Link: http://lkml.kernel.org/r/20181002185342.133d1680@canb.auug.org.au
[sfr@canb.auug.org.au: powerpc: fix up for removal of linux/bootmem.h]
Link: http://lkml.kernel.org/r/20181005161406.73ef8727@canb.auug.org.au
[sfr@canb.auug.org.au: x86/kaslr, ACPI/NUMA: fix for linux/bootmem.h removal]
Link: http://lkml.kernel.org/r/20181008190341.5e396491@canb.auug.org.au
Link: http://lkml.kernel.org/r/1536927045-23536-30-git-send-email-rppt@linux.vnet.ibm.com
Signed-off-by: Mike Rapoport
Signed-off-by: Stephen Rothwell
Acked-by: Michal Hocko
Cc: Catalin Marinas
Cc: Chris Zankel
Cc: "David S. Miller"
Cc: Geert Uytterhoeven
Cc: Greentime Hu
Cc: Greg Kroah-Hartman
Cc: Guan Xuetao
Cc: Ingo Molnar
Cc: "James E.J. Bottomley"
Cc: Jonas Bonn
Cc: Jonathan Corbet
Cc: Ley Foon Tan
Cc: Mark Salter
Cc: Martin Schwidefsky
Cc: Matt Turner
Cc: Michael Ellerman
Cc: Michal Simek
Cc: Palmer Dabbelt
Cc: Paul Burton
Cc: Richard Kuo
Cc: Richard Weinberger
Cc: Rich Felker
Cc: Russell King
Cc: Serge Semin
Cc: Thomas Gleixner
Cc: Tony Luck
Cc: Vineet Gupta
Cc: Yoshinori Sato
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
The conversion is done using
sed -i 's@memblock_virt_alloc@memblock_alloc@g' \
$(git grep -l memblock_virt_alloc)Link: http://lkml.kernel.org/r/1536927045-23536-8-git-send-email-rppt@linux.vnet.ibm.com
Signed-off-by: Mike Rapoport
Cc: Catalin Marinas
Cc: Chris Zankel
Cc: "David S. Miller"
Cc: Geert Uytterhoeven
Cc: Greentime Hu
Cc: Greg Kroah-Hartman
Cc: Guan Xuetao
Cc: Ingo Molnar
Cc: "James E.J. Bottomley"
Cc: Jonas Bonn
Cc: Jonathan Corbet
Cc: Ley Foon Tan
Cc: Mark Salter
Cc: Martin Schwidefsky
Cc: Matt Turner
Cc: Michael Ellerman
Cc: Michal Hocko
Cc: Michal Simek
Cc: Palmer Dabbelt
Cc: Paul Burton
Cc: Richard Kuo
Cc: Richard Weinberger
Cc: Rich Felker
Cc: Russell King
Cc: Serge Semin
Cc: Thomas Gleixner
Cc: Tony Luck
Cc: Vineet Gupta
Cc: Yoshinori Sato
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
08 Oct, 2018
1 commit
-
The commit ca460b3c9627 ("percpu: introduce bitmap metadata blocks")
introduced bitmap metadata blocks. These metadata blocks are allocated
whenever a new chunk is created, but they are never freed. Fix it.Fixes: ca460b3c9627 ("percpu: introduce bitmap metadata blocks")
Signed-off-by: Mike Rapoport
Cc: stable@vger.kernel.org
Signed-off-by: Dennis Zhou