13 Aug, 2020

1 commit

  • Percpu memory is becoming more and more widely used by various subsystems,
    and the total amount of memory controlled by the percpu allocator can make
    a good part of the total memory.

    As an example, bpf maps can consume a lot of percpu memory, and they are
    created by a user. Also, some cgroup internals (e.g. memory controller
    statistics) can be quite large. On a machine with many CPUs and big
    number of cgroups they can consume hundreds of megabytes.

    So the lack of memcg accounting is creating a breach in the memory
    isolation. Similar to the slab memory, percpu memory should be accounted
    by default.

    To implement the perpcu accounting it's possible to take the slab memory
    accounting as a model to follow. Let's introduce two types of percpu
    chunks: root and memcg. What makes memcg chunks different is an
    additional space allocated to store memcg membership information. If
    __GFP_ACCOUNT is passed on allocation, a memcg chunk should be be used.
    If it's possible to charge the corresponding size to the target memory
    cgroup, allocation is performed, and the memcg ownership data is recorded.
    System-wide allocations are performed using root chunks, so there is no
    additional memory overhead.

    To implement a fast reparenting of percpu memory on memcg removal, we
    don't store mem_cgroup pointers directly: instead we use obj_cgroup API,
    introduced for slab accounting.

    [akpm@linux-foundation.org: fix CONFIG_MEMCG_KMEM=n build errors and warning]
    [akpm@linux-foundation.org: move unreachable code, per Roman]
    [cuibixuan@huawei.com: mm/percpu: fix 'defined but not used' warning]
    Link: http://lkml.kernel.org/r/6d41b939-a741-b521-a7a2-e7296ec16219@huawei.com

    Signed-off-by: Roman Gushchin
    Signed-off-by: Bixuan Cui
    Signed-off-by: Andrew Morton
    Reviewed-by: Shakeel Butt
    Acked-by: Dennis Zhou
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Pekka Enberg
    Cc: Tejun Heo
    Cc: Tobin C. Harding
    Cc: Vlastimil Babka
    Cc: Waiman Long
    Cc: Bixuan Cui
    Cc: Michal Koutný
    Cc: Stephen Rothwell
    Link: http://lkml.kernel.org/r/20200623184515.4132564-3-guro@fb.com
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     

05 Jun, 2019

1 commit

  • Based on 1 normalized pattern(s):

    this file is released under the gplv2

    extracted by the scancode license scanner the SPDX license identifier

    GPL-2.0-only

    has been chosen to replace the boilerplate/reference in 68 file(s).

    Signed-off-by: Thomas Gleixner
    Reviewed-by: Armijn Hemel
    Reviewed-by: Allison Randal
    Cc: linux-spdx@vger.kernel.org
    Link: https://lkml.kernel.org/r/20190531190114.292346262@linutronix.de
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     

14 Mar, 2019

1 commit

  • Previously, block size was flexible based on the constraint that the
    GCD(PCPU_BITMAP_BLOCK_SIZE, PAGE_SIZE) > 1. However, this carried the
    overhead that keeping a floating number of populated free pages required
    scanning over the free regions of a chunk.

    Setting the block size to be fixed at PAGE_SIZE lets us know when an
    empty page becomes used as we will break a full contig_hint of a block.
    This means we no longer have to scan the whole chunk upon breaking a
    contig_hint which empty page management piggybacked off. A later patch
    takes advantage of this to optimize the allocation path by only scanning
    forward using the scan_hint introduced later too.

    Signed-off-by: Dennis Zhou
    Reviewed-by: Peng Fan

    Dennis Zhou
     

27 Feb, 2019

1 commit


19 Dec, 2018

1 commit

  • From Michael Cree:
    "Bisection lead to commit b38d08f3181c ("percpu: restructure
    locking") as being the cause of lockups at initial boot on
    the kernel built for generic Alpha.

    On a suggestion by Tejun Heo that:

    So, the only thing I can think of is that it's calling
    spin_unlock_irq() while irq handling isn't set up yet.
    Can you please try the followings?

    1. Convert all spin_[un]lock_irq() to
    spin_lock_irqsave/unlock_irqrestore()."

    Fixes: b38d08f3181c ("percpu: restructure locking")
    Reported-and-tested-by: Michael Cree
    Acked-by: Tejun Heo
    Signed-off-by: Dennis Zhou

    Dennis Zhou
     

18 Feb, 2018

2 commits

  • The prior patch added support for passing gfp flags through to the
    underlying allocators. This patch allows users to pass along gfp flags
    (currently only __GFP_NORETRY and __GFP_NOWARN) to the underlying
    allocators. This should allow users to decide if they are ok with
    failing allocations recovering in a more graceful way.

    Additionally, gfp passing was done as additional flags in the previous
    patch. Instead, change this to caller passed semantics. GFP_KERNEL is
    also removed as the default flag. It continues to be used for internally
    caused underlying percpu allocations.

    V2:
    Removed gfp_percpu_mask in favor of doing it inline.
    Removed GFP_KERNEL as a default flag for __alloc_percpu_gfp.

    Signed-off-by: Dennis Zhou
    Suggested-by: Daniel Borkmann
    Acked-by: Christoph Lameter
    Signed-off-by: Tejun Heo

    Dennis Zhou
     
  • Percpu memory using the vmalloc area based chunk allocator lazily
    populates chunks by first requesting the full virtual address space
    required for the chunk and subsequently adding pages as allocations come
    through. To ensure atomic allocations can succeed, a workqueue item is
    used to maintain a minimum number of empty pages. In certain scenarios,
    such as reported in [1], it is possible that physical memory becomes
    quite scarce which can result in either a rather long time spent trying
    to find free pages or worse, a kernel panic.

    This patch adds support for __GFP_NORETRY and __GFP_NOWARN passing them
    through to the underlying allocators. This should prevent any
    unnecessary panics potentially caused by the workqueue item. The passing
    of gfp around is as additional flags rather than a full set of flags.
    The next patch will change these to caller passed semantics.

    V2:
    Added const modifier to gfp flags in the balance path.
    Removed an extra whitespace.

    [1] https://lkml.org/lkml/2018/2/12/551

    Signed-off-by: Dennis Zhou
    Suggested-by: Daniel Borkmann
    Reported-by: syzbot+adb03f3f0bb57ce3acda@syzkaller.appspotmail.com
    Acked-by: Christoph Lameter
    Signed-off-by: Tejun Heo

    Dennis Zhou
     

27 Jul, 2017

1 commit

  • The percpu memory allocator is experiencing scalability issues when
    allocating and freeing large numbers of counters as in BPF.
    Additionally, there is a corner case where iteration is triggered over
    all chunks if the contig_hint is the right size, but wrong alignment.

    This patch replaces the area map allocator with a basic bitmap allocator
    implementation. Each subsequent patch will introduce new features and
    replace full scanning functions with faster non-scanning options when
    possible.

    Implementation:
    This patchset removes the area map allocator in favor of a bitmap
    allocator backed by metadata blocks. The primary goal is to provide
    consistency in performance and memory footprint with a focus on small
    allocations (< 64 bytes). The bitmap removes the heavy memmove from the
    freeing critical path and provides a consistent memory footprint. The
    metadata blocks provide a bound on the amount of scanning required by
    maintaining a set of hints.

    In an effort to make freeing fast, the metadata is updated on the free
    path if the new free area makes a page free, a block free, or spans
    across blocks. This causes the chunk's contig hint to potentially be
    smaller than what it could allocate by up to the smaller of a page or a
    block. If the chunk's contig hint is contained within a block, a check
    occurs and the hint is kept accurate. Metadata is always kept accurate
    on allocation, so there will not be a situation where a chunk has a
    later contig hint than available.

    Evaluation:
    I have primarily done testing against a simple workload of allocation of
    1 million objects (2^20) of varying size. Deallocation was done by in
    order, alternating, and in reverse. These numbers were collected after
    rebasing ontop of a80099a152. I present the worst-case numbers here:

    Area Map Allocator:

    Object Size | Alloc Time (ms) | Free Time (ms)
    ----------------------------------------------
    4B | 310 | 4770
    16B | 557 | 1325
    64B | 436 | 273
    256B | 776 | 131
    1024B | 3280 | 122

    Bitmap Allocator:

    Object Size | Alloc Time (ms) | Free Time (ms)
    ----------------------------------------------
    4B | 490 | 70
    16B | 515 | 75
    64B | 610 | 80
    256B | 950 | 100
    1024B | 3520 | 200

    This data demonstrates the inability for the area map allocator to
    handle less than ideal situations. In the best case of reverse
    deallocation, the area map allocator was able to perform within range
    of the bitmap allocator. In the worst case situation, freeing took
    nearly 5 seconds for 1 million 4-byte objects. The bitmap allocator
    dramatically improves the consistency of the free path. The small
    allocations performed nearly identical regardless of the freeing
    pattern.

    While it does add to the allocation latency, the allocation scenario
    here is optimal for the area map allocator. The area map allocator runs
    into trouble when it is allocating in chunks where the latter half is
    full. It is difficult to replicate this, so I present a variant where
    the pages are second half filled. Freeing was done sequentially. Below
    are the numbers for this scenario:

    Area Map Allocator:

    Object Size | Alloc Time (ms) | Free Time (ms)
    ----------------------------------------------
    4B | 4118 | 4892
    16B | 1651 | 1163
    64B | 598 | 285
    256B | 771 | 158
    1024B | 3034 | 160

    Bitmap Allocator:

    Object Size | Alloc Time (ms) | Free Time (ms)
    ----------------------------------------------
    4B | 481 | 67
    16B | 506 | 69
    64B | 636 | 75
    256B | 892 | 90
    1024B | 3262 | 147

    The data shows a parabolic curve of performance for the area map
    allocator. This is due to the memmove operation being the dominant cost
    with the lower object sizes as more objects are packed in a chunk and at
    higher object sizes, the traversal of the chunk slots is the dominating
    cost. The bitmap allocator suffers this problem as well. The above data
    shows the inability to scale for the allocation path with the area map
    allocator and that the bitmap allocator demonstrates consistent
    performance in general.

    The second problem of additional scanning can result in the area map
    allocator completing in 52 minutes when trying to allocate 1 million
    4-byte objects with 8-byte alignment. The same workload takes
    approximately 16 seconds to complete for the bitmap allocator.

    V2:
    Fixed a bug in pcpu_alloc_first_chunk end_offset was setting the bitmap
    using bytes instead of bits.

    Added a comment to pcpu_cnt_pop_pages to explain bitmap_weight.

    Signed-off-by: Dennis Zhou
    Reviewed-by: Josef Bacik
    Signed-off-by: Tejun Heo

    Dennis Zhou (Facebook)
     

29 Jun, 2017

1 commit


21 Jun, 2017

2 commits

  • Add support for tracepoints to the following events: chunk allocation,
    chunk free, area allocation, area free, and area allocation failure.
    This should let us replay percpu memory requests and evaluate
    corresponding decisions.

    Signed-off-by: Dennis Zhou
    Signed-off-by: Tejun Heo

    Dennis Zhou
     
  • There is limited visibility into the use of percpu memory leaving us
    unable to reason about correctness of parameters and overall use of
    percpu memory. These counters and statistics aim to help understand
    basic statistics about percpu memory such as number of allocations over
    the lifetime, allocation sizes, and fragmentation.

    New Config: PERCPU_STATS

    Signed-off-by: Dennis Zhou
    Signed-off-by: Tejun Heo

    Dennis Zhou
     

18 Mar, 2016

2 commits

  • Use the normal mechanism to make the logging output consistently
    "percpu:" instead of a mix of "PERCPU:" and "percpu:"

    Signed-off-by: Joe Perches
    Acked-by: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joe Perches
     
  • Most of the mm subsystem uses pr_ so make it consistent.

    Miscellanea:

    - Realign arguments
    - Add missing newline to format
    - kmemleak-test.c has a "kmemleak: " prefix added to the
    "Kmemleak testing" logging message via pr_fmt

    Signed-off-by: Joe Perches
    Acked-by: Tejun Heo [percpu]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joe Perches
     

03 Sep, 2014

5 commits

  • pcpu_nr_empty_pop_pages counts the number of empty populated pages
    across all chunks and chunk->nr_populated counts the number of
    populated pages in a chunk. Both will be used to implement pre/async
    population for atomic allocations.

    pcpu_chunk_[de]populated() are added to update chunk->populated,
    chunk->nr_populated and pcpu_nr_empty_pop_pages together. All
    successful chunk [de]populations should be followed by the
    corresponding pcpu_chunk_[de]populated() calls.

    Signed-off-by: Tejun Heo

    Tejun Heo
     
  • At first, the percpu allocator required a sleepable context for both
    alloc and free paths and used pcpu_alloc_mutex to protect everything.
    Later, pcpu_lock was introduced to protect the index data structure so
    that the free path can be invoked from atomic contexts. The
    conversion only updated what's necessary and left most of the
    allocation path under pcpu_alloc_mutex.

    The percpu allocator is planned to add support for atomic allocation
    and this patch restructures locking so that the coverage of
    pcpu_alloc_mutex is further reduced.

    * pcpu_alloc() now grab pcpu_alloc_mutex only while creating a new
    chunk and populating the allocated area. Everything else is now
    protected soley by pcpu_lock.

    After this change, multiple instances of pcpu_extend_area_map() may
    race but the function already implements sufficient synchronization
    using pcpu_lock.

    This also allows multiple allocators to arrive at new chunk
    creation. To avoid creating multiple empty chunks back-to-back, a
    new chunk is created iff there is no other empty chunk after
    grabbing pcpu_alloc_mutex.

    * pcpu_lock is now held while modifying chunk->populated bitmap.
    After this, all data structures are protected by pcpu_lock.

    Signed-off-by: Tejun Heo

    Tejun Heo
     
  • percpu-km instantiates the whole chunk on creation and doesn't make
    use of chunk->populated bitmap and leaves it as zero. While this
    currently doesn't cause any problem, the inconsistency makes it
    difficult to build further logic on top of chunk->populated. This
    patch makes percpu-km fill chunk->populated on creation so that the
    bitmap is always consistent.

    Signed-off-by: Tejun Heo
    Acked-by: Christoph Lameter

    Tejun Heo
     
  • Previously, pcpu_[de]populate_chunk() were called with the range which
    may contain multiple target regions in it and
    pcpu_[de]populate_chunk() iterated over the regions. This has the
    benefit of batching up cache flushes for all the regions; however,
    we're planning to add more bookkeeping logic around [de]population to
    support atomic allocations and this delegation of iterations gets in
    the way.

    This patch moves the region iterations out of
    pcpu_[de]populate_chunk() into its callers - pcpu_alloc() and
    pcpu_reclaim() - so that we can later add logic to track more states
    around them. This change may make cache and tlb flushes more frequent
    but multi-region [de]populations are rare anyway and if this actually
    becomes a problem, it's not difficult to factor out cache flushes as
    separate callbacks which are directly invoked from percpu.c.

    Signed-off-by: Tejun Heo

    Tejun Heo
     
  • percpu-vm and percpu-km implement separate versions of
    pcpu_[de]populate_chunk() and some part which is or should be common
    are currently in the specific implementations. Make the following
    changes.

    * Allocate area clearing is moved from the pcpu_populate_chunk()
    implementations to pcpu_alloc(). This makes percpu-km's version
    noop.

    * Quick exit tests in pcpu_[de]populate_chunk() of percpu-vm are moved
    to their respective callers so that they are applied to percpu-km
    too. This doesn't make any meaningful difference as both functions
    are noop for percpu-km; however, this is more consistent and will
    help implementing atomic allocation support.

    Signed-off-by: Tejun Heo

    Tejun Heo
     

10 Sep, 2010

1 commit


08 Sep, 2010

1 commit

  • On UP, percpu allocations were redirected to kmalloc. This has the
    following problems.

    * For certain amount of allocations (determined by
    PERCPU_DYNAMIC_EARLY_SLOTS and PERCPU_DYNAMIC_EARLY_SIZE), percpu
    allocator can be used before the usual kernel memory allocator is
    brought online. On SMP, this is used to initialize the kernel
    memory allocator.

    * percpu allocator honors alignment upto PAGE_SIZE but kmalloc()
    doesn't. For example, workqueue makes use of larger alignments for
    cpu_workqueues.

    Currently, users of percpu allocators need to handle UP differently,
    which is somewhat fragile and ugly. Other than small amount of
    memory, there isn't much to lose by enabling percpu allocator on UP.
    It can simply use kernel memory based chunk allocation which was added
    for SMP archs w/o MMUs.

    This patch removes mm/percpu_up.c, builds mm/percpu.c on UP too and
    makes UP build use percpu-km. As percpu addresses and kernel
    addresses are always identity mapped and static percpu variables don't
    need any special treatment, nothing is arch dependent and mm/percpu.c
    implements generic setup_per_cpu_areas() for UP.

    Signed-off-by: Tejun Heo
    Reviewed-by: Christoph Lameter
    Acked-by: Pekka Enberg

    Tejun Heo
     

01 May, 2010

1 commit

  • Implement an alternate percpu chunk management based on kernel memeory
    for nommu SMP architectures. Instead of mapping into vmalloc area,
    chunks are allocated as a contiguous kernel memory using
    alloc_pages(). As such, percpu allocator on nommu will have the
    following restrictions.

    * It can't fill chunks on-demand page-by-page. It has to allocate
    each chunk fully upfront.

    * It can't support sparse chunk for NUMA configurations. SMP w/o mmu
    is crazy enough. Let's hope no one does NUMA w/o mmu. :-P

    * If chunk size isn't power-of-two multiple of PAGE_SIZE, the
    unaligned amount will be wasted on each chunk. So, archs which use
    this better align chunk size.

    For instructions on how to use this, read the comment on top of
    mm/percpu-km.c.

    Signed-off-by: Tejun Heo
    Reviewed-by: David Howells
    Cc: Graff Yang
    Cc: Sonic Zhang

    Tejun Heo