02 Nov, 2017

1 commit

  • Many source files in the tree are missing licensing information, which
    makes it harder for compliance tools to determine the correct license.

    By default all files without license information are under the default
    license of the kernel, which is GPL version 2.

    Update the files which contain no license information with the 'GPL-2.0'
    SPDX license identifier. The SPDX identifier is a legally binding
    shorthand, which can be used instead of the full boiler plate text.

    This patch is based on work done by Thomas Gleixner and Kate Stewart and
    Philippe Ombredanne.

    How this work was done:

    Patches were generated and checked against linux-4.14-rc6 for a subset of
    the use cases:
    - file had no licensing information it it.
    - file was a */uapi/* one with no licensing information in it,
    - file was a */uapi/* one with existing licensing information,

    Further patches will be generated in subsequent months to fix up cases
    where non-standard license headers were used, and references to license
    had to be inferred by heuristics based on keywords.

    The analysis to determine which SPDX License Identifier to be applied to
    a file was done in a spreadsheet of side by side results from of the
    output of two independent scanners (ScanCode & Windriver) producing SPDX
    tag:value files created by Philippe Ombredanne. Philippe prepared the
    base worksheet, and did an initial spot review of a few 1000 files.

    The 4.13 kernel was the starting point of the analysis with 60,537 files
    assessed. Kate Stewart did a file by file comparison of the scanner
    results in the spreadsheet to determine which SPDX license identifier(s)
    to be applied to the file. She confirmed any determination that was not
    immediately clear with lawyers working with the Linux Foundation.

    Criteria used to select files for SPDX license identifier tagging was:
    - Files considered eligible had to be source code files.
    - Make and config files were included as candidates if they contained >5
    lines of source
    - File already had some variant of a license header in it (even if
    Reviewed-by: Philippe Ombredanne
    Reviewed-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Greg Kroah-Hartman
     

27 Jul, 2017

3 commits

  • The bitmap allocator must keep metadata consistent. The easiest way is
    to scan after every allocation for each affected block and the entire
    chunk. This is rather expensive.

    The free path can take advantage of current contig hints to prevent
    scanning within the start and end block. If a scan is needed, it can
    be done by scanning backwards from the start and forwards from the end
    to identify the entire free area this can be combined with. The blocks
    can then be updated by some basic checks rather than complete block
    scans.

    A chunk scan happens when the freed area makes a page free, a block
    free, or spans across blocks. This is necessary as the contig hint at
    this point could span across blocks. The check uses the minimum of page
    size and the block size to allow for variable sized blocks. There is a
    tradeoff here with not updating after every free. It is possible a
    contig hint in one block can be merged with the contig hint in the next
    block. This means the contig hint can be off by up to a page. However,
    if the chunk's contig hint is contained in one block, the contig hint
    will be accurate.

    Signed-off-by: Dennis Zhou
    Reviewed-by: Josef Bacik
    Signed-off-by: Tejun Heo

    Dennis Zhou (Facebook)
     
  • This patch introduces the bitmap metadata blocks and adds the skeleton
    of the code that will be used to maintain these blocks. Each chunk's
    bitmap is made up of full metadata blocks. These blocks maintain basic
    metadata to help prevent scanning unnecssarily to update hints. Full
    scanning methods are used for the skeleton and will be replaced in the
    coming patches. A number of helper functions are added as well to do
    conversion of pages to blocks and manage offsets. Comments will be
    updated as the final version of each function is added.

    There exists a relationship between PAGE_SIZE, PCPU_BITMAP_BLOCK_SIZE,
    the region size, and unit_size. Every chunk's region (including offsets)
    is page aligned at the beginning to preserve alignment. The end is
    aligned to LCM(PAGE_SIZE, PCPU_BITMAP_BLOCK_SIZE) to ensure that the end
    can fit with the populated page map which is by page and every metadata
    block is fully accounted for. The unit_size is already page aligned, but
    must also be aligned with PCPU_BITMAP_BLOCK_SIZE to ensure full metadata
    blocks.

    Signed-off-by: Dennis Zhou
    Reviewed-by: Josef Bacik
    Signed-off-by: Tejun Heo

    Dennis Zhou (Facebook)
     
  • The percpu memory allocator is experiencing scalability issues when
    allocating and freeing large numbers of counters as in BPF.
    Additionally, there is a corner case where iteration is triggered over
    all chunks if the contig_hint is the right size, but wrong alignment.

    This patch replaces the area map allocator with a basic bitmap allocator
    implementation. Each subsequent patch will introduce new features and
    replace full scanning functions with faster non-scanning options when
    possible.

    Implementation:
    This patchset removes the area map allocator in favor of a bitmap
    allocator backed by metadata blocks. The primary goal is to provide
    consistency in performance and memory footprint with a focus on small
    allocations (< 64 bytes). The bitmap removes the heavy memmove from the
    freeing critical path and provides a consistent memory footprint. The
    metadata blocks provide a bound on the amount of scanning required by
    maintaining a set of hints.

    In an effort to make freeing fast, the metadata is updated on the free
    path if the new free area makes a page free, a block free, or spans
    across blocks. This causes the chunk's contig hint to potentially be
    smaller than what it could allocate by up to the smaller of a page or a
    block. If the chunk's contig hint is contained within a block, a check
    occurs and the hint is kept accurate. Metadata is always kept accurate
    on allocation, so there will not be a situation where a chunk has a
    later contig hint than available.

    Evaluation:
    I have primarily done testing against a simple workload of allocation of
    1 million objects (2^20) of varying size. Deallocation was done by in
    order, alternating, and in reverse. These numbers were collected after
    rebasing ontop of a80099a152. I present the worst-case numbers here:

    Area Map Allocator:

    Object Size | Alloc Time (ms) | Free Time (ms)
    ----------------------------------------------
    4B | 310 | 4770
    16B | 557 | 1325
    64B | 436 | 273
    256B | 776 | 131
    1024B | 3280 | 122

    Bitmap Allocator:

    Object Size | Alloc Time (ms) | Free Time (ms)
    ----------------------------------------------
    4B | 490 | 70
    16B | 515 | 75
    64B | 610 | 80
    256B | 950 | 100
    1024B | 3520 | 200

    This data demonstrates the inability for the area map allocator to
    handle less than ideal situations. In the best case of reverse
    deallocation, the area map allocator was able to perform within range
    of the bitmap allocator. In the worst case situation, freeing took
    nearly 5 seconds for 1 million 4-byte objects. The bitmap allocator
    dramatically improves the consistency of the free path. The small
    allocations performed nearly identical regardless of the freeing
    pattern.

    While it does add to the allocation latency, the allocation scenario
    here is optimal for the area map allocator. The area map allocator runs
    into trouble when it is allocating in chunks where the latter half is
    full. It is difficult to replicate this, so I present a variant where
    the pages are second half filled. Freeing was done sequentially. Below
    are the numbers for this scenario:

    Area Map Allocator:

    Object Size | Alloc Time (ms) | Free Time (ms)
    ----------------------------------------------
    4B | 4118 | 4892
    16B | 1651 | 1163
    64B | 598 | 285
    256B | 771 | 158
    1024B | 3034 | 160

    Bitmap Allocator:

    Object Size | Alloc Time (ms) | Free Time (ms)
    ----------------------------------------------
    4B | 481 | 67
    16B | 506 | 69
    64B | 636 | 75
    256B | 892 | 90
    1024B | 3262 | 147

    The data shows a parabolic curve of performance for the area map
    allocator. This is due to the memmove operation being the dominant cost
    with the lower object sizes as more objects are packed in a chunk and at
    higher object sizes, the traversal of the chunk slots is the dominating
    cost. The bitmap allocator suffers this problem as well. The above data
    shows the inability to scale for the allocation path with the area map
    allocator and that the bitmap allocator demonstrates consistent
    performance in general.

    The second problem of additional scanning can result in the area map
    allocator completing in 52 minutes when trying to allocate 1 million
    4-byte objects with 8-byte alignment. The same workload takes
    approximately 16 seconds to complete for the bitmap allocator.

    V2:
    Fixed a bug in pcpu_alloc_first_chunk end_offset was setting the bitmap
    using bytes instead of bits.

    Added a comment to pcpu_cnt_pop_pages to explain bitmap_weight.

    Signed-off-by: Dennis Zhou
    Reviewed-by: Josef Bacik
    Signed-off-by: Tejun Heo

    Dennis Zhou (Facebook)
     

26 Jul, 2017

1 commit

  • This patch increases the minimum allocation size of percpu memory to
    4-bytes. This change will help minimize the metadata overhead
    associated with the bitmap allocator. The assumption is that most
    allocations will be of objects or structs greater than 2 bytes with
    integers or longs being used rather than shorts.

    The first chunk regions are now aligned with the minimum allocation
    size. The reserved region is expected to be set as a multiple of the
    minimum allocation size. The static region is aligned up and the delta
    is removed from the dynamic size. This works because the dynamic size is
    increased to be page aligned. If the static size is not minimum
    allocation size aligned, then there must be a gap that is added to the
    dynamic size. The dynamic size will never be smaller than the set value.

    Signed-off-by: Dennis Zhou
    Reviewed-by: Josef Bacik
    Signed-off-by: Tejun Heo

    Dennis Zhou (Facebook)
     

16 Mar, 2017

1 commit

  • If a PER_CPU struct which contains a spin_lock is statically initialized
    via:

    DEFINE_PER_CPU(struct foo, bla) = {
    .lock = __SPIN_LOCK_UNLOCKED(bla.lock)
    };

    then lockdep assigns a seperate key to each lock because the logic for
    assigning a key to statically initialized locks is to use the address as
    the key. With per CPU locks the address is obvioulsy different on each CPU.

    That's wrong, because all locks should have the same key.

    To solve this the following modifications are required:

    1) Extend the is_kernel/module_percpu_addr() functions to hand back the
    canonical address of the per CPU address, i.e. the per CPU address
    minus the per CPU offset.

    2) Check the lock address with these functions and if the per CPU check
    matches use the returned canonical address as the lock key, so all per
    CPU locks have the same key.

    3) Move the static_obj(key) check into look_up_lock_class() so this check
    can be avoided for statically initialized per CPU locks. That's
    required because the canonical address fails the static_obj(key) check
    for obvious reasons.

    Reported-by: Mike Galbraith
    Signed-off-by: Thomas Gleixner
    [ Merged Dan's fixups for !MODULES and !SMP into this patch. ]
    Signed-off-by: Sebastian Andrzej Siewior
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Andrew Morton
    Cc: Dan Murphy
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20170227143736.pectaimkjkan5kow@linutronix.de
    Signed-off-by: Ingo Molnar

    Thomas Gleixner
     

21 May, 2016

1 commit

  • printk() takes some locks and could not be used a safe way in NMI
    context.

    The chance of a deadlock is real especially when printing stacks from
    all CPUs. This particular problem has been addressed on x86 by the
    commit a9edc8809328 ("x86/nmi: Perform a safe NMI stack trace on all
    CPUs").

    The patchset brings two big advantages. First, it makes the NMI
    backtraces safe on all architectures for free. Second, it makes all NMI
    messages almost safe on all architectures (the temporary buffer is
    limited. We still should keep the number of messages in NMI context at
    minimum).

    Note that there already are several messages printed in NMI context:
    WARN_ON(in_nmi()), BUG_ON(in_nmi()), anything being printed out from MCE
    handlers. These are not easy to avoid.

    This patch reuses most of the code and makes it generic. It is useful
    for all messages and architectures that support NMI.

    The alternative printk_func is set when entering and is reseted when
    leaving NMI context. It queues IRQ work to copy the messages into the
    main ring buffer in a safe context.

    __printk_nmi_flush() copies all available messages and reset the buffer.
    Then we could use a simple cmpxchg operations to get synchronized with
    writers. There is also used a spinlock to get synchronized with other
    flushers.

    We do not longer use seq_buf because it depends on external lock. It
    would be hard to make all supported operations safe for a lockless use.
    It would be confusing and error prone to make only some operations safe.

    The code is put into separate printk/nmi.c as suggested by Steven
    Rostedt. It needs a per-CPU buffer and is compiled only on
    architectures that call nmi_enter(). This is achieved by the new
    HAVE_NMI Kconfig flag.

    The are MN10300 and Xtensa architectures. We need to clean up NMI
    handling there first. Let's do it separately.

    The patch is heavily based on the draft from Peter Zijlstra, see

    https://lkml.org/lkml/2015/6/10/327

    [arnd@arndb.de: printk-nmi: use %zu format string for size_t]
    [akpm@linux-foundation.org: min_t->min - all types are size_t here]
    Signed-off-by: Petr Mladek
    Suggested-by: Peter Zijlstra
    Suggested-by: Steven Rostedt
    Cc: Jan Kara
    Acked-by: Russell King [arm part]
    Cc: Daniel Thompson
    Cc: Jiri Kosina
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: Ralf Baechle
    Cc: Benjamin Herrenschmidt
    Cc: Martin Schwidefsky
    Cc: David Miller
    Cc: Daniel Thompson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Petr Mladek
     

16 Nov, 2015

1 commit


22 Nov, 2014

1 commit

  • To avoid include hell, the per_cpu variable printk_func was declared
    in percpu.h. But it is only defined if printk is defined.

    As users of printk may also use the printk_func variable, it needs to
    be defined even if CONFIG_PRINTK is not.

    Also add a printk.h include in percpu.h just to be safe.

    Link: http://lkml.kernel.org/r/20141121183215.01ba539c@canb.auug.org.au

    Reported-by: Stephen Rothwell
    Signed-off-by: Steven Rostedt

    Steven Rostedt (Red Hat)
     

20 Nov, 2014

1 commit

  • Being able to divert printk to call another function besides the normal
    logging is useful for such things like NMI handling. If some functions
    are to be called from NMI that does printk() it is possible to lock up
    the box if the nmi handler triggers when another printk is happening.

    One example of this use is to perform a stack trace on all CPUs via NMI.
    But if the NMI is to do the printk() it can cause the system to lock up.
    By allowing the printk to be diverted to another function that can safely
    record the printk output and then print it when it in a safe context
    then NMIs will be safe to call these functions like show_regs().

    Link: http://lkml.kernel.org/p/20140619213952.209176403@goodmis.org

    Tested-by: Jiri Kosina
    Acked-by: Jiri Kosina
    Acked-by: Paul E. McKenney
    Reviewed-by: Petr Mladek
    Signed-off-by: Steven Rostedt

    Steven Rostedt (Red Hat)
     

03 Sep, 2014

2 commits

  • The percpu allocator now supports atomic allocations by only
    allocating from already populated areas but the mechanism to ensure
    that there's adequate amount of populated areas was missing.

    This patch expands pcpu_balance_work so that in addition to freeing
    excess free chunks it also populates chunks to maintain an adequate
    level of populated areas. pcpu_alloc() schedules pcpu_balance_work if
    the amount of free populated areas is too low or after an atomic
    allocation failure.

    * PERPCU_DYNAMIC_RESERVE is increased by two pages to account for
    PCPU_EMPTY_POP_PAGES_LOW.

    * pcpu_async_enabled is added to gate both async jobs -
    chunk->map_extend_work and pcpu_balance_work - so that we don't end
    up scheduling them while the needed subsystems aren't up yet.

    Signed-off-by: Tejun Heo

    Tejun Heo
     
  • Now that pcpu_alloc_area() can allocate only from populated areas,
    it's easy to add atomic allocation support to [__]alloc_percpu().
    Update pcpu_alloc() so that it accepts @gfp and skips all the blocking
    operations and allocates only from the populated areas if @gfp doesn't
    contain GFP_KERNEL. New interface functions [__]alloc_percpu_gfp()
    are added.

    While this means that atomic allocations are possible, this isn't
    complete yet as there's no mechanism to ensure that certain amount of
    populated areas is kept available and atomic allocations may keep
    failing under certain conditions.

    Signed-off-by: Tejun Heo

    Tejun Heo
     

18 Jun, 2014

4 commits

  • We're in the process of moving all percpu accessors and operations to
    include/linux/percpu-defs.h so that they're available to arch headers
    without having to include full include/linux/percpu.h which may cause
    cyclic inclusion dependency.

    This patch moves {raw|this}_cpu_*() definitions from
    include/linux/percpu.h to include/linux/percpu-defs.h. The code is
    moved mostly verbatim; however, raw_cpu_*() are placed above
    this_cpu_*() which is more conventional as the raw operations may be
    used to defined other variants.

    This is pure reorganization.

    Signed-off-by: Tejun Heo
    Acked-by: Christoph Lameter

    Tejun Heo
     
  • {raw|this}_cpu_*_N() operations are expected to be provided by archs
    and the generic definitions are provided as fallbacks. As such, these
    firmly belong to include/asm-generic/percpu.h.

    Move the generic definitions to include/asm-generic/percpu.h. The
    code is moved mostly verbatim; however, raw_cpu_*_N() are placed above
    this_cpu_*_N() which is more conventional as the raw operations may be
    used to defined other variants.

    This is pure reorganization.

    Signed-off-by: Tejun Heo
    Acked-by: Christoph Lameter

    Tejun Heo
     
  • Currently, percpu allows two separate methods for overriding
    {raw|this}_cpu_*() ops - for a given operation, an arch can provide
    whole replacement or sized sub operations to override specific parts
    of it. e.g. arch either can provide this_cpu_add() or
    this_cpu_add_4() to override only the 4 byte operation.

    While quite flexible on a glance, the dual-overriding scheme
    complicates the code path for no actual gain. It compilcates the
    already complex operation definitions and if an arch wants to override
    all sizes, it can easily provide all variants anyway. In fact, no
    arch is actually making use of whole operation override.

    Another oddity is that __this_cpu_*() operations are defined in the
    same way as raw_cpu_*() but ignores full overrides of the raw_cpu_*()
    and doesn't allow full operation override, so if an arch provides
    whole overrides for raw_cpu_*() operations __this_cpu_*() ends up
    using the generic implementations.

    More importantly, it takes away the layering between arch-specific and
    generic parts making it impossible for the generic part to implement
    arch-independent features on top of arch-specific overrides.

    This patch removes the support for whole operation overrides. As no
    arch is using it, this doesn't cause any actual difference.

    Signed-off-by: Tejun Heo
    Acked-by: Christoph Lameter

    Tejun Heo
     
  • include/linux/percpu-defs.h is gonna host all accessors and operations
    so that arch headers can make use of them too without worrying about
    circular dependency through include/linux/percpu.h.

    This patch moves the following accessors from include/linux/percpu.h
    to include/linux/percpu-defs.h.

    * get/put_cpu_var()
    * get/put_cpu_ptr()
    * per_cpu_ptr()

    This is pure reorgniazation.

    Signed-off-by: Tejun Heo
    Acked-by: Christoph Lameter

    Tejun Heo
     

10 Jun, 2014

1 commit

  • Pull percpu updates from Tejun Heo:
    "Nothing too exciting. percpu_ref is going through some interface
    changes and getting new features with more changes in the pipeline but
    given its young age and few users, it's very low impact"

    * 'for-3.16' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu:
    percpu-refcount: implement percpu_ref_tryget()
    percpu-refcount: rename percpu_ref_tryget() to percpu_ref_tryget_live()
    percpu: Replace __get_cpu_var with this_cpu_ptr

    Linus Torvalds
     

15 May, 2014

1 commit

  • The definition for raw_cpu_add_return() uses the operation prefix
    "raw_add_return_", but the definitions in the various percpu.h files
    expect "raw_cpu_add_return_". This commit therefore appropriately
    adjusts the definition of raw_cpu_add_return().

    Signed-off-by: Paul E. McKenney
    Acked-by: Christoph Lameter
    Reviewed-by: Josh Triplett

    Paul E. McKenney
     

16 Apr, 2014

1 commit

  • __this_cpu_ptr is being phased out. Use raw_cpu_ptr instead which was
    introduced in 3.15-rc1. One case of using __get_cpu_var in the
    get_cpu_var macro for address calculation was remaining in
    include/linux/percpu.h.

    tj: Updated patch description.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Tejun Heo

    Christoph Lameter
     

08 Apr, 2014

2 commits

  • We define a check function in order to avoid trouble with the include
    files. Then the higher level __this_cpu macros are modified to invoke
    the preemption check.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Christoph Lameter
    Acked-by: Ingo Molnar
    Cc: Tejun Heo
    Tested-by: Grygorii Strashko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • The kernel has never been audited to ensure that this_cpu operations are
    consistently used throughout the kernel. The code generated in many
    places can be improved through the use of this_cpu operations (which
    uses a segment register for relocation of per cpu offsets instead of
    performing address calculations).

    The patch set also addresses various consistency issues in general with
    the per cpu macros.

    A. The semantics of __this_cpu_ptr() differs from this_cpu_ptr only
    because checks are skipped. This is typically shown through a raw_
    prefix. So this patch set changes the places where __this_cpu_ptr()
    is used to raw_cpu_ptr().

    B. There has been the long term wish by some that __this_cpu operations
    would check for preemption. However, there are cases where preemption
    checks need to be skipped. This patch set adds raw_cpu operations that
    do not check for preemption and then adds preemption checks to the
    __this_cpu operations.

    C. The use of __get_cpu_var is always a reference to a percpu variable
    that can also be handled via a this_cpu operation. This patch set
    replaces all uses of __get_cpu_var with this_cpu operations.

    D. We can then use this_cpu RMW operations in various places replacing
    sequences of instructions by a single one.

    E. The use of this_cpu operations throughout will allow other arches than
    x86 to implement optimized references and RMV operations to work with
    per cpu local data.

    F. The use of this_cpu operations opens up the possibility to
    further optimize code that relies on synchronization through
    per cpu data.

    The patch set works in a couple of stages:

    I. Patch 1 adds the additional raw_cpu operations and raw_cpu_ptr().
    Also converts the existing __this_cpu_xx_# primitive in the x86
    code to raw_cpu_xx_#.

    II. Patch 2-4 use the raw_cpu operations in places that would give
    us false positives once they are enabled.

    III. Patch 5 adds preemption checks to __this_cpu operations to allow
    checking if preemption is properly disabled when these functions
    are used.

    IV. Patches 6-20 are patches that simply replace uses of __get_cpu_var
    with this_cpu_ptr. They do not depend on any changes to the percpu
    code. No preemption tests are skipped if they are applied.

    V. Patches 21-46 are conversion patches that use this_cpu operations
    in various kernel subsystems/drivers or arch code.

    VI. Patches 47/48 (not included in this series) remove no longer used
    functions (__this_cpu_ptr and __get_cpu_var). These should only be
    applied after all the conversion patches have made it and after we
    have done additional passes through the kernel to ensure that none of
    the uses of these functions remain.

    This patch (of 46):

    The patches following this one will add preemption checks to __this_cpu
    ops so we need to have an alternative way to use this_cpu operations
    without preemption checks.

    raw_cpu_ops will be the basis for all other ops since these will be the
    operations that do not implement any checks.

    Primitive operations are renamed by this patch from __this_cpu_xxx to
    raw_cpu_xxxx.

    Also change the uses of the x86 percpu primitives in preempt.h.
    These depend directly on asm/percpu.h (header #include nesting issue).

    Signed-off-by: Peter Zijlstra
    Signed-off-by: Christoph Lameter
    Acked-by: Ingo Molnar
    Cc: Tejun Heo
    Cc: "James E.J. Bottomley"
    Cc: "Paul E. McKenney"
    Cc: Alex Shi
    Cc: Arnd Bergmann
    Cc: Benjamin Herrenschmidt
    Cc: Bryan Wu
    Cc: Catalin Marinas
    Cc: Chris Metcalf
    Cc: Daniel Lezcano
    Cc: David Daney
    Cc: David Miller
    Cc: David S. Miller
    Cc: Dimitri Sivanich
    Cc: Dipankar Sarma
    Cc: Eric Dumazet
    Cc: Fenghua Yu
    Cc: Frederic Weisbecker
    Cc: Greg Kroah-Hartman
    Cc: H. Peter Anvin
    Cc: Haavard Skinnemoen
    Cc: Hans-Christian Egtvedt
    Cc: Hedi Berriche
    Cc: Heiko Carstens
    Cc: Helge Deller
    Cc: Ivan Kokshaysky
    Cc: James Hogan
    Cc: Jens Axboe
    Cc: John Stultz
    Cc: Martin Schwidefsky
    Cc: Masami Hiramatsu
    Cc: Matt Turner
    Cc: Mike Frysinger
    Cc: Mike Travis
    Cc: Neil Brown
    Cc: Nicolas Pitre
    Cc: Paul Mackerras
    Cc: Paul Mundt
    Cc: Rafael J. Wysocki
    Cc: Ralf Baechle
    Cc: Richard Henderson
    Cc: Robert Richter
    Cc: Russell King
    Cc: Russell King
    Cc: Rusty Russell
    Cc: Steven Rostedt
    Cc: Thomas Gleixner
    Cc: Tony Luck
    Cc: Will Deacon
    Cc: Wim Van Sebroeck
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

24 Jan, 2014

1 commit

  • Most of the VM_BUG_ON assertions are performed on a page. Usually, when
    one of these assertions fails we'll get a BUG_ON with a call stack and
    the registers.

    I've recently noticed based on the requests to add a small piece of code
    that dumps the page to various VM_BUG_ON sites that the page dump is
    quite useful to people debugging issues in mm.

    This patch adds a VM_BUG_ON_PAGE(cond, page) which beyond doing what
    VM_BUG_ON() does, also dumps the page before executing the actual
    BUG_ON.

    [akpm@linux-foundation.org: fix up includes]
    Signed-off-by: Sasha Levin
    Cc: "Kirill A. Shutemov"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sasha Levin
     

13 Nov, 2013

1 commit

  • Pull percpu changes from Tejun Heo:
    "Two smallish changes for percpu. Two patches to remove unused
    this_cpu_xor() and one to fix a bug in percpu init failure path so
    that it can reach the proper BUG() instead of oopsing earlier"

    * 'for-3.13' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu:
    x86: remove this_cpu_xor() implementation
    percpu: remove this_cpu_xor() implementation
    percpu: fix bootmem error handling in pcpu_page_first_chunk()

    Linus Torvalds
     

31 Oct, 2013

1 commit

  • this_cpu_sub() is implemented as negation and addition.

    This patch casts the adjustment to the counter type before negation to
    sign extend the adjustment. This helps in cases where the counter type
    is wider than an unsigned adjustment. An alternative to this patch is
    to declare such operations unsupported, but it seemed useful to avoid
    surprises.

    This patch specifically helps the following example:
    unsigned int delta = 1
    preempt_disable()
    this_cpu_write(long_counter, 0)
    this_cpu_sub(long_counter, delta)
    preempt_enable()

    Before this change long_counter on a 64 bit machine ends with value
    0xffffffff, rather than 0xffffffffffffffff. This is because
    this_cpu_sub(pcp, delta) boils down to this_cpu_add(pcp, -delta),
    which is basically:
    long_counter = 0 + 0xffffffff

    Also apply the same cast to:
    __this_cpu_sub()
    __this_cpu_sub_return()
    this_cpu_sub_return()

    All percpu_test.ko passes, especially the following cases which
    previously failed:

    l -= ui_one;
    __this_cpu_sub(long_counter, ui_one);
    CHECK(l, long_counter, -1);

    l -= ui_one;
    this_cpu_sub(long_counter, ui_one);
    CHECK(l, long_counter, -1);
    CHECK(l, long_counter, 0xffffffffffffffff);

    ul -= ui_one;
    __this_cpu_sub(ulong_counter, ui_one);
    CHECK(ul, ulong_counter, -1);
    CHECK(ul, ulong_counter, 0xffffffffffffffff);

    ul = this_cpu_sub_return(ulong_counter, ui_one);
    CHECK(ul, ulong_counter, 2);

    ul = __this_cpu_sub_return(ulong_counter, ui_one);
    CHECK(ul, ulong_counter, 1);

    Signed-off-by: Greg Thelen
    Acked-by: Tejun Heo
    Acked-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Greg Thelen
     

27 Oct, 2013

1 commit

  • There is not a single user in the whole kernel.
    Besides that this_cpu_xor() is broken anyway since it gets
    translated to this_cpu_or() (see __pcpu_size_call() line).

    So instead of fixing an unused definition just remove it.

    Signed-off-by: Heiko Carstens
    Acked-by: Ingo Molnar
    Signed-off-by: Tejun Heo

    Heiko Carstens
     

06 Oct, 2012

1 commit


15 May, 2012

1 commit

  • Remove percpu_xxx serial functions, all of them were replaced by
    this_cpu_xxx or __this_cpu_xxx serial functions

    Signed-off-by: Alex Shi
    Acked-by: Christoph Lameter
    Acked-by: Tejun Heo
    Acked-by: "H. Peter Anvin"
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Tejun Heo

    Alex Shi
     

05 Mar, 2012

1 commit


22 Feb, 2012

2 commits

  • It doesn't make sense to trace irq off or do irq flags
    lock proving inside 'this_cpu' operations, so replace local_irq_*
    with raw_local_irq_* in 'this_cpu' op.

    Also the patch fixes onelockdep warning[1] by the replacement, see
    below:

    In commit: 933393f58fef9963eac61db8093689544e29a600(percpu:
    Remove irqsafe_cpu_xxx variants), local_irq_save/restore(flags) are
    added inside this_cpu_inc operation, so that trace_hardirqs_off_caller
    will be called by trace_hardirqs_on_caller directly because
    __debug_atomic_inc is implemented as this_cpu_inc, which may trigger
    the lockdep warning[1], for example in the below ARM scenary:

    kernel_thread_helper /*irq disabled*/
    ->trace_hardirqs_on_caller /*hardirqs_enabled was set*/
    ->trace_hardirqs_off_caller /*hardirqs_enabled cleared*/
    __this_cpu_add(redundant_hardirqs_on)
    ->trace_hardirqs_off_caller /*irq disabled, so call here*/

    The 'unannotated irqs-on' warning will be triggered somewhere because
    irq is just enabled after the irq trace in kernel_thread_helper.

    [1],
    [ 0.162841] ------------[ cut here ]------------
    [ 0.167694] WARNING: at kernel/lockdep.c:3493 check_flags+0xc0/0x1d0()
    [ 0.174468] Modules linked in:
    [ 0.177703] Backtrace:
    [ 0.180328] [] (dump_backtrace+0x0/0x110) from [] (dump_stack+0x18/0x1c)
    [ 0.189086] r6:c051f778 r5:00000da5 r4:00000000 r3:60000093
    [ 0.195007] [] (dump_stack+0x0/0x1c) from [] (warn_slowpath_common+0x54/0x6c)
    [ 0.204223] [] (warn_slowpath_common+0x0/0x6c) from [] (warn_slowpath_null+0x24/0x2c)
    [ 0.214111] r8:00000000 r7:00000000 r6:ee069598 r5:60000013 r4:ee082000
    [ 0.220825] r3:00000009
    [ 0.223693] [] (warn_slowpath_null+0x0/0x2c) from [] (check_flags+0xc0/0x1d0)
    [ 0.232910] [] (check_flags+0x0/0x1d0) from [] (lock_acquire+0x4c/0x11c)
    [ 0.241668] [] (lock_acquire+0x0/0x11c) from [] (_raw_spin_lock+0x3c/0x74)
    [ 0.250610] [] (_raw_spin_lock+0x0/0x74) from [] (set_task_comm+0x20/0xc0)
    [ 0.259521] r6:ee069588 r5:ee0691c0 r4:ee082000
    [ 0.264404] [] (set_task_comm+0x0/0xc0) from [] (kthreadd+0x28/0x108)
    [ 0.272857] r8:00000000 r7:00000013 r6:c0044a08 r5:ee0691c0 r4:ee082000
    [ 0.279571] r3:ee083fe0
    [ 0.282470] [] (kthreadd+0x0/0x108) from [] (do_exit+0x0/0x6dc)
    [ 0.290405] r5:c0060758 r4:00000000
    [ 0.294189] ---[ end trace 1b75b31a2719ed1c ]---
    [ 0.299041] possible reason: unannotated irqs-on.
    [ 0.303955] irq event stamp: 5
    [ 0.307159] hardirqs last enabled at (4): [] no_work_pending+0x8/0x2c
    [ 0.314880] hardirqs last disabled at (5): [] trace_hardirqs_on_caller+0x60/0x26c
    [ 0.323547] softirqs last enabled at (0): [] copy_process+0x33c/0xef4
    [ 0.331207] softirqs last disabled at (0): [< (null)>] (null)
    [ 0.337585] CPU0: thread -1, cpu 0, socket 0, mpidr 80000000

    Acked-by: Christoph Lameter
    Signed-off-by: Ming Lei
    Signed-off-by: Tejun Heo

    Ming Lei
     
  • This patch adds missed "__" into function prefix.
    Otherwise on all archectures (except x86) it expands to irq/preemtion-safe
    variant: _this_cpu_generic_add_return(), which do extra irq-save/irq-restore.
    Optimal generic implementation is __this_cpu_generic_add_return().

    Signed-off-by: Konstantin Khlebnikov
    Acked-by: Christoph Lameter
    Signed-off-by: Tejun Heo

    Konstantin Khlebnikov
     

23 Dec, 2011

1 commit

  • We simply say that regular this_cpu use must be safe regardless of
    preemption and interrupt state. That has no material change for x86
    and s390 implementations of this_cpu operations. However, arches that
    do not provide their own implementation for this_cpu operations will
    now get code generated that disables interrupts instead of preemption.

    -tj: This is part of on-going percpu API cleanup. For detailed
    discussion of the subject, please refer to the following thread.

    http://thread.gmane.org/gmane.linux.kernel/1222078

    Signed-off-by: Christoph Lameter
    Signed-off-by: Tejun Heo
    LKML-Reference:

    Christoph Lameter
     

04 Jun, 2011

1 commit

  • On an architecture without CMPXCHG_LOCAL but with DEBUG_VM enabled,
    the VM_BUG_ON() in __pcpu_double_call_return_bool() will cause an early
    panic during boot unless we always align cpu_slab properly.

    In principle we could remove the alignment-testing VM_BUG_ON() for
    architectures that don't have CMPXCHG_LOCAL, but leaving it in means
    that new code will tend not to break x86 even if it is introduced
    on another platform, and it's low cost to require alignment.

    Acked-by: David Rientjes
    Acked-by: Christoph Lameter
    Signed-off-by: Chris Metcalf
    Signed-off-by: Pekka Enberg

    Chris Metcalf
     

05 May, 2011

1 commit

  • The SLUB allocator use of the cmpxchg_double logic was wrong: it
    actually needs the irq-safe one.

    That happens automatically when we use the native unlocked 'cmpxchg8b'
    instruction, but when compiling the kernel for older x86 CPUs that do
    not support that instruction, we fall back to the generic emulation
    code.

    And if you don't specify that you want the irq-safe version, the generic
    code ends up just open-coding the cmpxchg8b equivalent without any
    protection against interrupts or preemption. Which definitely doesn't
    work for SLUB.

    This was reported by Werner Landgraf , who saw
    instability with his distro-kernel that was compiled to support pretty
    much everything under the sun. Most big Linux distributions tend to
    compile for PPro and later, and would never have noticed this problem.

    This also fixes the prototypes for the irqsafe cmpxchg_double functions
    to use 'bool' like they should.

    [ Btw, that whole "generic code defaults to no protection" design just
    sounds stupid - if the code needs no protection, there is no reason to
    use "cmpxchg_double" to begin with. So we should probably just remove
    the unprotected version entirely as pointless. - Linus ]

    Signed-off-by: Thomas Gleixner
    Reported-and-tested-by: werner
    Acked-and-tested-by: Ingo Molnar
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: Jens Axboe
    Cc: Tejun Heo
    Link: http://lkml.kernel.org/r/alpine.LFD.2.02.1105041539050.3005@ionos
    Signed-off-by: Ingo Molnar
    Signed-off-by: Linus Torvalds

    Thomas Gleixner
     

28 Feb, 2011

1 commit

  • Introduce this_cpu_cmpxchg_double(). this_cpu_cmpxchg_double() allows
    the comparison between two consecutive words and replaces them if
    there is a match.

    bool this_cpu_cmpxchg_double(pcp1, pcp2,
    old_word1, old_word2, new_word1, new_word2)

    this_cpu_cmpxchg_double does not return the old value (difficult since
    there are two words) but a boolean indicating if the operation was
    successful.

    The first percpu variable must be double word aligned!

    -tj: Updated to return bool instead of int, converted size check to
    BUILD_BUG_ON() instead of VM_BUG_ON() and other cosmetic changes.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Tejun Heo

    Christoph Lameter
     

18 Dec, 2010

1 commit

  • Generic code to provide new per cpu atomic features

    this_cpu_cmpxchg
    this_cpu_xchg

    Fallback occurs to functions using interrupts disable/enable
    to ensure correct per cpu atomicity.

    Fallback to regular cmpxchg and xchg is not possible since per cpu atomic
    semantics include the guarantee that the current cpus per cpu data is
    accessed atomically. Use of regular cmpxchg and xchg requires the
    determination of the address of the per cpu data before regular cmpxchg
    or xchg which therefore cannot be atomically included in an xchg or
    cmpxchg without segment override.

    tj: - Relocated new ops to conform better to the general organization.
    - This patch contains a trivial comment fix.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Tejun Heo

    Christoph Lameter
     

17 Dec, 2010

2 commits

  • - include/linux/percpu.h: this_cpu_add_return() and friends were
    located next to __this_cpu_add_return(). However, the overall
    organization is to first group by preemption safeness. Relocate
    this_cpu_add_return() and friends to preemption-safe area.

    - arch/x86/include/asm/percpu.h: Relocate percpu_add_return_op() after
    other more basic operations. Relocate [__]this_cpu_add_return_8()
    so that they're first grouped by preemption safeness.

    Signed-off-by: Tejun Heo
    Cc: Christoph Lameter

    Tejun Heo
     
  • Introduce generic support for this_cpu_add_return etc.

    The fallback is to realize these operations with simpler __this_cpu_ops.

    tj: - Reformatted __cpu_size_call_return2() to make it more consistent
    with its neighbors.
    - Dropped unnecessary temp variable ret__ from
    __this_cpu_generic_add_return().

    Reviewed-by: Tejun Heo
    Reviewed-by: Mathieu Desnoyers
    Acked-by: H. Peter Anvin
    Signed-off-by: Christoph Lameter
    Signed-off-by: Tejun Heo

    Christoph Lameter
     

23 Oct, 2010

1 commit

  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu:
    percpu: update comments to reflect that percpu allocations are always zero-filled
    percpu: Optimize __get_cpu_var()
    x86, percpu: Optimize this_cpu_ptr
    percpu: clear memory allocated with the km allocator
    percpu: fix build breakage on s390 and cleanup build configuration tests
    percpu: use percpu allocator on UP too
    percpu: reduce PCPU_MIN_UNIT_SIZE to 32k
    vmalloc: pcpu_get/free_vm_areas() aren't needed on UP

    Fixed up trivial conflicts in include/linux/percpu.h

    Linus Torvalds
     

21 Sep, 2010

1 commit


08 Sep, 2010

1 commit

  • On UP, percpu allocations were redirected to kmalloc. This has the
    following problems.

    * For certain amount of allocations (determined by
    PERCPU_DYNAMIC_EARLY_SLOTS and PERCPU_DYNAMIC_EARLY_SIZE), percpu
    allocator can be used before the usual kernel memory allocator is
    brought online. On SMP, this is used to initialize the kernel
    memory allocator.

    * percpu allocator honors alignment upto PAGE_SIZE but kmalloc()
    doesn't. For example, workqueue makes use of larger alignments for
    cpu_workqueues.

    Currently, users of percpu allocators need to handle UP differently,
    which is somewhat fragile and ugly. Other than small amount of
    memory, there isn't much to lose by enabling percpu allocator on UP.
    It can simply use kernel memory based chunk allocation which was added
    for SMP archs w/o MMUs.

    This patch removes mm/percpu_up.c, builds mm/percpu.c on UP too and
    makes UP build use percpu-km. As percpu addresses and kernel
    addresses are always identity mapped and static percpu variables don't
    need any special treatment, nothing is arch dependent and mm/percpu.c
    implements generic setup_per_cpu_areas() for UP.

    Signed-off-by: Tejun Heo
    Reviewed-by: Christoph Lameter
    Acked-by: Pekka Enberg

    Tejun Heo