Eric Lee / smarc-fsl-linux-kernel

08 Apr, 2018

1 commit

72682b162 percpu: add __GFP_NORETRY semantics to the percpu balancing path ... Browse Code »

commit 47504ee04b9241548ae2c28be7d0b01cff3b7aa6 upstream.

Percpu memory using the vmalloc area based chunk allocator lazily
populates chunks by first requesting the full virtual address space
required for the chunk and subsequently adding pages as allocations come
through. To ensure atomic allocations can succeed, a workqueue item is
used to maintain a minimum number of empty pages. In certain scenarios,
such as reported in [1], it is possible that physical memory becomes
quite scarce which can result in either a rather long time spent trying
to find free pages or worse, a kernel panic.

This patch adds support for __GFP_NORETRY and __GFP_NOWARN passing them
through to the underlying allocators. This should prevent any
unnecessary panics potentially caused by the workqueue item. The passing
of gfp around is as additional flags rather than a full set of flags.
The next patch will change these to caller passed semantics.

V2:
Added const modifier to gfp flags in the balance path.
Removed an extra whitespace.

[1] https://lkml.org/lkml/2018/2/12/551

Signed-off-by: Dennis Zhou
Suggested-by: Daniel Borkmann
Reported-by: syzbot+adb03f3f0bb57ce3acda@syzkaller.appspotmail.com
Acked-by: Christoph Lameter
Signed-off-by: Tejun Heo
Signed-off-by: Greg Kroah-Hartman

Dennis Zhou
2018-04-08 20:26:29 +0800

27 Jul, 2017

1 commit

40064aeca percpu: replace area map allocator with bitmap ... Browse Code »

The percpu memory allocator is experiencing scalability issues when
allocating and freeing large numbers of counters as in BPF.
Additionally, there is a corner case where iteration is triggered over
all chunks if the contig_hint is the right size, but wrong alignment.

This patch replaces the area map allocator with a basic bitmap allocator
implementation. Each subsequent patch will introduce new features and
replace full scanning functions with faster non-scanning options when
possible.

Implementation:
This patchset removes the area map allocator in favor of a bitmap
allocator backed by metadata blocks. The primary goal is to provide
consistency in performance and memory footprint with a focus on small
allocations (< 64 bytes). The bitmap removes the heavy memmove from the
freeing critical path and provides a consistent memory footprint. The
metadata blocks provide a bound on the amount of scanning required by
maintaining a set of hints.

In an effort to make freeing fast, the metadata is updated on the free
path if the new free area makes a page free, a block free, or spans
across blocks. This causes the chunk's contig hint to potentially be
smaller than what it could allocate by up to the smaller of a page or a
block. If the chunk's contig hint is contained within a block, a check
occurs and the hint is kept accurate. Metadata is always kept accurate
on allocation, so there will not be a situation where a chunk has a
later contig hint than available.

Evaluation:
I have primarily done testing against a simple workload of allocation of
1 million objects (2^20) of varying size. Deallocation was done by in
order, alternating, and in reverse. These numbers were collected after
rebasing ontop of a80099a152. I present the worst-case numbers here:

Area Map Allocator:

Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 310 | 4770
16B | 557 | 1325
64B | 436 | 273
256B | 776 | 131
1024B | 3280 | 122

Bitmap Allocator:

Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 490 | 70
16B | 515 | 75
64B | 610 | 80
256B | 950 | 100
1024B | 3520 | 200

This data demonstrates the inability for the area map allocator to
handle less than ideal situations. In the best case of reverse
deallocation, the area map allocator was able to perform within range
of the bitmap allocator. In the worst case situation, freeing took
nearly 5 seconds for 1 million 4-byte objects. The bitmap allocator
dramatically improves the consistency of the free path. The small
allocations performed nearly identical regardless of the freeing
pattern.

While it does add to the allocation latency, the allocation scenario
here is optimal for the area map allocator. The area map allocator runs
into trouble when it is allocating in chunks where the latter half is
full. It is difficult to replicate this, so I present a variant where
the pages are second half filled. Freeing was done sequentially. Below
are the numbers for this scenario:

Area Map Allocator:

Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 4118 | 4892
16B | 1651 | 1163
64B | 598 | 285
256B | 771 | 158
1024B | 3034 | 160

Bitmap Allocator:

Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 481 | 67
16B | 506 | 69
64B | 636 | 75
256B | 892 | 90
1024B | 3262 | 147

The data shows a parabolic curve of performance for the area map
allocator. This is due to the memmove operation being the dominant cost
with the lower object sizes as more objects are packed in a chunk and at
higher object sizes, the traversal of the chunk slots is the dominating
cost. The bitmap allocator suffers this problem as well. The above data
shows the inability to scale for the allocation path with the area map
allocator and that the bitmap allocator demonstrates consistent
performance in general.

The second problem of additional scanning can result in the area map
allocator completing in 52 minutes when trying to allocate 1 million
4-byte objects with 8-byte alignment. The same workload takes
approximately 16 seconds to complete for the bitmap allocator.

V2:
Fixed a bug in pcpu_alloc_first_chunk end_offset was setting the bitmap
using bytes instead of bits.

Added a comment to pcpu_cnt_pop_pages to explain bitmap_weight.

Signed-off-by: Dennis Zhou
Reviewed-by: Josef Bacik
Signed-off-by: Tejun Heo

Dennis Zhou (Facebook)
2017-07-27 05:41:05 +0800

29 Jun, 2017

1 commit

e3efe3db9 percpu: fix static checker warnings in pcpu_destroy_chunk ... Browse Code »

From 5021b97f4026334d2c8dfad80797dd1028cddd73 Mon Sep 17 00:00:00 2001
From: Dennis Zhou
Date: Thu, 29 Jun 2017 07:11:41 -0700

Add NULL check in pcpu_destroy_chunk to correct static checker warnings.

Signed-off-by: Dennis Zhou
Reported-by: Dan Carpenter
Signed-off-by: Tejun Heo

Dennis Zhou
2017-06-29 23:23:38 +0800

21 Jun, 2017

2 commits

df95e795a percpu: add tracepoint support for percpu memory ... Browse Code »

Add support for tracepoints to the following events: chunk allocation,
chunk free, area allocation, area free, and area allocation failure.
This should let us replay percpu memory requests and evaluate
corresponding decisions.

Signed-off-by: Dennis Zhou
Signed-off-by: Tejun Heo

Dennis Zhou
2017-06-21 03:31:43 +0800
30a5b5367 percpu: expose statistics about percpu memory via debugfs ... Browse Code »

There is limited visibility into the use of percpu memory leaving us
unable to reason about correctness of parameters and overall use of
percpu memory. These counters and statistics aim to help understand
basic statistics about percpu memory such as number of allocations over
the lifetime, allocation sizes, and fragmentation.

New Config: PERCPU_STATS

Signed-off-by: Dennis Zhou
Signed-off-by: Tejun Heo

Dennis Zhou
2017-06-21 03:31:38 +0800

18 Mar, 2016

2 commits

870d4b12a mm: percpu: use pr_fmt to prefix output ... Browse Code »

Use the normal mechanism to make the logging output consistently
"percpu:" instead of a mix of "PERCPU:" and "percpu:"

Signed-off-by: Joe Perches
Acked-by: Tejun Heo
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joe Perches
2016-03-18 06:09:34 +0800
1170532bb mm: convert printk(KERN_<LEVEL> to pr_<level> ... Browse Code »

Most of the mm subsystem uses pr_ so make it consistent.

Miscellanea:

- Realign arguments
- Add missing newline to format
- kmemleak-test.c has a "kmemleak: " prefix added to the
"Kmemleak testing" logging message via pr_fmt

Signed-off-by: Joe Perches
Acked-by: Tejun Heo [percpu]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joe Perches
2016-03-18 06:09:34 +0800

03 Sep, 2014

5 commits

b539b87fe percpu: implmeent pcpu_nr_empty_pop_pages and chunk->nr_populated ... Browse Code »

pcpu_nr_empty_pop_pages counts the number of empty populated pages
across all chunks and chunk->nr_populated counts the number of
populated pages in a chunk. Both will be used to implement pre/async
population for atomic allocations.

pcpu_chunk_[de]populated() are added to update chunk->populated,
chunk->nr_populated and pcpu_nr_empty_pop_pages together. All
successful chunk [de]populations should be followed by the
corresponding pcpu_chunk_[de]populated() calls.

Signed-off-by: Tejun Heo

Tejun Heo
2014-09-03 02:46:05 +0800
b38d08f31 percpu: restructure locking ... Browse Code »

At first, the percpu allocator required a sleepable context for both
alloc and free paths and used pcpu_alloc_mutex to protect everything.
Later, pcpu_lock was introduced to protect the index data structure so
that the free path can be invoked from atomic contexts. The
conversion only updated what's necessary and left most of the
allocation path under pcpu_alloc_mutex.

The percpu allocator is planned to add support for atomic allocation
and this patch restructures locking so that the coverage of
pcpu_alloc_mutex is further reduced.

* pcpu_alloc() now grab pcpu_alloc_mutex only while creating a new
chunk and populating the allocated area. Everything else is now
protected soley by pcpu_lock.

After this change, multiple instances of pcpu_extend_area_map() may
race but the function already implements sufficient synchronization
using pcpu_lock.

This also allows multiple allocators to arrive at new chunk
creation. To avoid creating multiple empty chunks back-to-back, a
new chunk is created iff there is no other empty chunk after
grabbing pcpu_alloc_mutex.

* pcpu_lock is now held while modifying chunk->populated bitmap.
After this, all data structures are protected by pcpu_lock.

Signed-off-by: Tejun Heo

Tejun Heo
2014-09-03 02:46:02 +0800
a63d4ac4a percpu: make percpu-km set chunk->populated bitmap properly ... Browse Code »

percpu-km instantiates the whole chunk on creation and doesn't make
use of chunk->populated bitmap and leaves it as zero. While this
currently doesn't cause any problem, the inconsistency makes it
difficult to build further logic on top of chunk->populated. This
patch makes percpu-km fill chunk->populated on creation so that the
bitmap is always consistent.

Signed-off-by: Tejun Heo
Acked-by: Christoph Lameter

Tejun Heo
2014-09-03 02:46:02 +0800
a93ace487 percpu: move region iterations out of pcpu_[de]populate_chunk() ... Browse Code »

Previously, pcpu_[de]populate_chunk() were called with the range which
may contain multiple target regions in it and
pcpu_[de]populate_chunk() iterated over the regions. This has the
benefit of batching up cache flushes for all the regions; however,
we're planning to add more bookkeeping logic around [de]population to
support atomic allocations and this delegation of iterations gets in
the way.

This patch moves the region iterations out of
pcpu_[de]populate_chunk() into its callers - pcpu_alloc() and
pcpu_reclaim() - so that we can later add logic to track more states
around them. This change may make cache and tlb flushes more frequent
but multi-region [de]populations are rare anyway and if this actually
becomes a problem, it's not difficult to factor out cache flushes as
separate callbacks which are directly invoked from percpu.c.

Signed-off-by: Tejun Heo

Tejun Heo
2014-09-03 02:46:02 +0800
dca496451 percpu: move common parts out of pcpu_[de]populate_chunk() ... Browse Code »

percpu-vm and percpu-km implement separate versions of
pcpu_[de]populate_chunk() and some part which is or should be common
are currently in the specific implementations. Make the following
changes.

* Allocate area clearing is moved from the pcpu_populate_chunk()
implementations to pcpu_alloc(). This makes percpu-km's version
noop.

* Quick exit tests in pcpu_[de]populate_chunk() of percpu-vm are moved
to their respective callers so that they are applied to percpu-km
too. This doesn't make any meaningful difference as both functions
are noop for percpu-km; however, this is more consistent and will
help implementing atomic allocation support.

Signed-off-by: Tejun Heo

Tejun Heo
2014-09-03 02:46:01 +0800

10 Sep, 2010

1 commit

fc1481a95 percpu: clear memory allocated with the km allocator ... Browse Code »

Percpu allocator should clear memory before returning it but the km
allocator forgot to do it. Fix it.

Signed-off-by: Tejun Heo
Reported-by: Peter Zijlstra
Acked-by: Peter Zijlstra

Tejun Heo
2010-09-10 16:56:24 +0800

08 Sep, 2010

1 commit

bbddff054 percpu: use percpu allocator on UP too ... Browse Code »
43

On UP, percpu allocations were redirected to kmalloc. This has the
following problems.

* For certain amount of allocations (determined by
PERCPU_DYNAMIC_EARLY_SLOTS and PERCPU_DYNAMIC_EARLY_SIZE), percpu
allocator can be used before the usual kernel memory allocator is
brought online. On SMP, this is used to initialize the kernel
memory allocator.

* percpu allocator honors alignment upto PAGE_SIZE but kmalloc()
doesn't. For example, workqueue makes use of larger alignments for
cpu_workqueues.

Currently, users of percpu allocators need to handle UP differently,
which is somewhat fragile and ugly. Other than small amount of
memory, there isn't much to lose by enabling percpu allocator on UP.
It can simply use kernel memory based chunk allocation which was added
for SMP archs w/o MMUs.

This patch removes mm/percpu_up.c, builds mm/percpu.c on UP too and
makes UP build use percpu-km. As percpu addresses and kernel
addresses are always identity mapped and static percpu variables don't
need any special treatment, nothing is arch dependent and mm/percpu.c
implements generic setup_per_cpu_areas() for UP.

Signed-off-by: Tejun Heo
Reviewed-by: Christoph Lameter
Acked-by: Pekka Enberg

Tejun Heo
2010-09-08 17:11:23 +0800

01 May, 2010

1 commit

b0c9778b1 percpu: implement kernel memory based chunk allocation ... Browse Code »

Implement an alternate percpu chunk management based on kernel memeory
for nommu SMP architectures. Instead of mapping into vmalloc area,
chunks are allocated as a contiguous kernel memory using
alloc_pages(). As such, percpu allocator on nommu will have the
following restrictions.

* It can't fill chunks on-demand page-by-page. It has to allocate
each chunk fully upfront.

* It can't support sparse chunk for NUMA configurations. SMP w/o mmu
is crazy enough. Let's hope no one does NUMA w/o mmu. :-P

* If chunk size isn't power-of-two multiple of PAGE_SIZE, the
unaligned amount will be wasted on each chunk. So, archs which use
this better align chunk size.

For instructions on how to use this, read the comment on top of
mm/percpu-km.c.

Signed-off-by: Tejun Heo
Reviewed-by: David Howells
Cc: Graff Yang
Cc: Sonic Zhang

Tejun Heo
2010-05-01 14:30:50 +0800