22 Jul, 2015

1 commit

  • commit 8a8c35fadfaf55629a37ef1a8ead1b8fb32581d2 upstream.

    Beginning at commit d52d3997f843 ("ipv6: Create percpu rt6_info"), the
    following INFO splat is logged:

    ===============================
    [ INFO: suspicious RCU usage. ]
    4.1.0-rc7-next-20150612 #1 Not tainted
    -------------------------------
    kernel/sched/core.c:7318 Illegal context switch in RCU-bh read-side critical section!
    other info that might help us debug this:
    rcu_scheduler_active = 1, debug_locks = 0
    3 locks held by systemd/1:
    #0: (rtnl_mutex){+.+.+.}, at: [] rtnetlink_rcv+0x1f/0x40
    #1: (rcu_read_lock_bh){......}, at: [] ipv6_add_addr+0x62/0x540
    #2: (addrconf_hash_lock){+...+.}, at: [] ipv6_add_addr+0x184/0x540
    stack backtrace:
    CPU: 0 PID: 1 Comm: systemd Not tainted 4.1.0-rc7-next-20150612 #1
    Hardware name: TOSHIBA TECRA A50-A/TECRA A50-A, BIOS Version 4.20 04/17/2014
    Call Trace:
    dump_stack+0x4c/0x6e
    lockdep_rcu_suspicious+0xe7/0x120
    ___might_sleep+0x1d5/0x1f0
    __might_sleep+0x4d/0x90
    kmem_cache_alloc+0x47/0x250
    create_object+0x39/0x2e0
    kmemleak_alloc_percpu+0x61/0xe0
    pcpu_alloc+0x370/0x630

    Additional backtrace lines are truncated. In addition, the above splat
    is followed by several "BUG: sleeping function called from invalid
    context at mm/slub.c:1268" outputs. As suggested by Martin KaFai Lau,
    these are the clue to the fix. Routine kmemleak_alloc_percpu() always
    uses GFP_KERNEL for its allocations, whereas it should follow the gfp
    from its callers.

    Reviewed-by: Catalin Marinas
    Reviewed-by: Kamalesh Babulal
    Acked-by: Martin KaFai Lau
    Signed-off-by: Larry Finger
    Cc: Martin KaFai Lau
    Cc: Catalin Marinas
    Cc: Tejun Heo
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Larry Finger
     

25 Mar, 2015

1 commit


14 Feb, 2015

1 commit

  • printk and friends can now format bitmaps using '%*pb[l]'. cpumask
    and nodemask also provide cpumask_pr_args() and nodemask_pr_args()
    respectively which can be used to generate the two printf arguments
    necessary to format the specified cpu/nodemask.

    Signed-off-by: Tejun Heo
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tejun Heo
     

29 Oct, 2014

1 commit


09 Oct, 2014

1 commit

  • When @gfp is specified, the percpu allocator is interested in whether
    it contains all of GFP_KERNEL or not. If it does, the normal
    allocation path is taken; otherwise, the atomic allocation path.
    Unfortunately, pcpu_alloc() was incorrectly testing for whether @gfp
    contains any part of GFP_KERNEL.

    Fix it by testing "(gfp & GFP_KERNEL) != GFP_KERNEL" instead of
    "!(gfp & GFP_KERNEL)" to decide whether the allocation should be
    atomic or not.

    Signed-off-by: Tejun Heo

    Tejun Heo
     

22 Sep, 2014

1 commit

  • This reverts commit 3189eddbcafc ("percpu: free percpu allocation info for
    uniprocessor system").

    The commit causes a hang with a crisv32 image. This may be an architecture
    problem, but at least for now the revert is necessary to be able to boot a
    crisv32 image.

    Cc: Tejun Heo
    Cc: Honggang Li
    Signed-off-by: Guenter Roeck
    Signed-off-by: Tejun Heo
    Fixes: 3189eddbcafc ("percpu: free percpu allocation info for uniprocessor system")
    Cc: stable@vger.kernel.org # Please don't apply 3189eddbcafc

    Guenter Roeck
     

09 Sep, 2014

1 commit


03 Sep, 2014

10 commits

  • The percpu allocator now supports atomic allocations by only
    allocating from already populated areas but the mechanism to ensure
    that there's adequate amount of populated areas was missing.

    This patch expands pcpu_balance_work so that in addition to freeing
    excess free chunks it also populates chunks to maintain an adequate
    level of populated areas. pcpu_alloc() schedules pcpu_balance_work if
    the amount of free populated areas is too low or after an atomic
    allocation failure.

    * PERPCU_DYNAMIC_RESERVE is increased by two pages to account for
    PCPU_EMPTY_POP_PAGES_LOW.

    * pcpu_async_enabled is added to gate both async jobs -
    chunk->map_extend_work and pcpu_balance_work - so that we don't end
    up scheduling them while the needed subsystems aren't up yet.

    Signed-off-by: Tejun Heo

    Tejun Heo
     
  • pcpu_reclaim_work will also be used to populate chunks asynchronously.
    Rename it to pcpu_balance_work in preparation. pcpu_reclaim() is
    renamed to pcpu_balance_workfn() and some of its local variables are
    renamed too.

    This is pure rename.

    Signed-off-by: Tejun Heo

    Tejun Heo
     
  • pcpu_nr_empty_pop_pages counts the number of empty populated pages
    across all chunks and chunk->nr_populated counts the number of
    populated pages in a chunk. Both will be used to implement pre/async
    population for atomic allocations.

    pcpu_chunk_[de]populated() are added to update chunk->populated,
    chunk->nr_populated and pcpu_nr_empty_pop_pages together. All
    successful chunk [de]populations should be followed by the
    corresponding pcpu_chunk_[de]populated() calls.

    Signed-off-by: Tejun Heo

    Tejun Heo
     
  • An allocation attempt may require extending chunk->map array which
    requires GFP_KERNEL context which isn't available for atomic
    allocations. This patch ensures that chunk->map array usually keeps
    some amount of available space by directly allocating buffer space
    during GFP_KERNEL allocations and scheduling async extension during
    atomic ones. This should make atomic allocation failures from map
    space exhaustion rare.

    Signed-off-by: Tejun Heo

    Tejun Heo
     
  • Now that pcpu_alloc_area() can allocate only from populated areas,
    it's easy to add atomic allocation support to [__]alloc_percpu().
    Update pcpu_alloc() so that it accepts @gfp and skips all the blocking
    operations and allocates only from the populated areas if @gfp doesn't
    contain GFP_KERNEL. New interface functions [__]alloc_percpu_gfp()
    are added.

    While this means that atomic allocations are possible, this isn't
    complete yet as there's no mechanism to ensure that certain amount of
    populated areas is kept available and atomic allocations may keep
    failing under certain conditions.

    Signed-off-by: Tejun Heo

    Tejun Heo
     
  • The next patch will conditionalize the population block in
    pcpu_alloc() which will end up making a rather large indentation
    change obfuscating the actual logic change. This patch puts the block
    under "if (true)" so that the next patch can avoid indentation
    changes. The defintions of the local variables which are used only in
    the block are moved into the block.

    This patch is purely cosmetic.

    Signed-off-by: Tejun Heo

    Tejun Heo
     
  • Update pcpu_alloc_area() so that it can skip unpopulated areas if the
    new parameter @pop_only is true. This is implemented by a new
    function, pcpu_fit_in_area(), which determines the amount of head
    padding considering the alignment and populated state.

    @pop_only is currently always false but this will be used to implement
    atomic allocation.

    Signed-off-by: Tejun Heo

    Tejun Heo
     
  • At first, the percpu allocator required a sleepable context for both
    alloc and free paths and used pcpu_alloc_mutex to protect everything.
    Later, pcpu_lock was introduced to protect the index data structure so
    that the free path can be invoked from atomic contexts. The
    conversion only updated what's necessary and left most of the
    allocation path under pcpu_alloc_mutex.

    The percpu allocator is planned to add support for atomic allocation
    and this patch restructures locking so that the coverage of
    pcpu_alloc_mutex is further reduced.

    * pcpu_alloc() now grab pcpu_alloc_mutex only while creating a new
    chunk and populating the allocated area. Everything else is now
    protected soley by pcpu_lock.

    After this change, multiple instances of pcpu_extend_area_map() may
    race but the function already implements sufficient synchronization
    using pcpu_lock.

    This also allows multiple allocators to arrive at new chunk
    creation. To avoid creating multiple empty chunks back-to-back, a
    new chunk is created iff there is no other empty chunk after
    grabbing pcpu_alloc_mutex.

    * pcpu_lock is now held while modifying chunk->populated bitmap.
    After this, all data structures are protected by pcpu_lock.

    Signed-off-by: Tejun Heo

    Tejun Heo
     
  • Previously, pcpu_[de]populate_chunk() were called with the range which
    may contain multiple target regions in it and
    pcpu_[de]populate_chunk() iterated over the regions. This has the
    benefit of batching up cache flushes for all the regions; however,
    we're planning to add more bookkeeping logic around [de]population to
    support atomic allocations and this delegation of iterations gets in
    the way.

    This patch moves the region iterations out of
    pcpu_[de]populate_chunk() into its callers - pcpu_alloc() and
    pcpu_reclaim() - so that we can later add logic to track more states
    around them. This change may make cache and tlb flushes more frequent
    but multi-region [de]populations are rare anyway and if this actually
    becomes a problem, it's not difficult to factor out cache flushes as
    separate callbacks which are directly invoked from percpu.c.

    Signed-off-by: Tejun Heo

    Tejun Heo
     
  • percpu-vm and percpu-km implement separate versions of
    pcpu_[de]populate_chunk() and some part which is or should be common
    are currently in the specific implementations. Make the following
    changes.

    * Allocate area clearing is moved from the pcpu_populate_chunk()
    implementations to pcpu_alloc(). This makes percpu-km's version
    noop.

    * Quick exit tests in pcpu_[de]populate_chunk() of percpu-vm are moved
    to their respective callers so that they are applied to percpu-km
    too. This doesn't make any meaningful difference as both functions
    are noop for percpu-km; however, this is more consistent and will
    help implementing atomic allocation support.

    Signed-off-by: Tejun Heo

    Tejun Heo
     

16 Aug, 2014

1 commit


19 Jun, 2014

1 commit


15 Apr, 2014

1 commit

  • pcpu_chunk_struct_size = sizeof(struct pcpu_chunk) +
    BITS_TO_LONGS(pcpu_unit_pages) * sizeof(unsigned long)

    It hardly could be ever bigger than PAGE_SIZE even for large-scale machine,
    but for consistency with its couterpart pcpu_mem_zalloc(),
    use pcpu_mem_free() instead.

    Commit b4916cb17c26 ("percpu: make pcpu_free_chunk() use
    pcpu_mem_free() instead of kfree()") addressed this problem, but
    missed this one.

    tj: commit message updated

    Signed-off-by: Jianyu Zhan
    Signed-off-by: Tejun Heo
    Fixes: 099a19d91ca4 ("percpu: allow limited allocation before slab is online)
    Cc: stable@vger.kernel.org

    Jianyu Zhan
     

29 Mar, 2014

1 commit


18 Mar, 2014

1 commit

  • 723ad1d90b56 ("percpu: store offsets instead of lengths in ->map[]")
    updated percpu area allocator to use the lowest bit, instead of sign,
    to signify whether the area is occupied and forced min align to 2;
    unfortunately, it forgot to force the allocation size to be even
    causing malfunctions for the very rare odd-sized allocations.

    Always force the allocations to be even sized.

    tj: Wrote patch description.

    Original-patch-by: Al Viro
    Signed-off-by: Tejun Heo

    Viro
     

07 Mar, 2014

3 commits

  • If we know that first N areas are all in use, we can obviously skip
    them when searching for a free one. And that kind of hint is very
    easy to maintain.

    Signed-off-by: Al Viro
    Signed-off-by: Tejun Heo

    Al Viro
     
  • Current code keeps +-length for each area in chunk->map[]. It has
    several unpleasant consequences:
    * even if we know that first 50 areas are all in use, allocation
    still needs to go through all those areas just to sum their sizes, just
    to get the offset of free one.
    * freeing needs to find the array entry refering to the area
    in question; again, the need to sum the sizes until we reach the offset
    we are interested in. Note that offsets are monotonous, so simple
    binary search would do here.

    New data representation: array of pairs.
    Each pair is represented by one int - we use offset|1 for
    and offset for (we make sure that all offsets are even).
    In the end we put a sentry entry - . The first
    entry is ; it would be possible to store together the flag
    for Nth area and offset for N+1st, but that leads to much hairier code.

    In other words, where the old variant would have
    4, -8, -4, 4, -12, 100
    (4 bytes free, 8 in use, 4 in use, 4 free, 12 in use, 100 free) we store
    , , , , , ,
    i.e.
    0, 5, 13, 16, 21, 32, 133

    This commit switches to new data representation and takes care of a couple
    of low-hanging fruits in free_pcpu_area() - one is the switch to binary
    search, another is not doing two memmove() when one would do. Speeding
    the alloc side up (by keeping track of how many areas in the beginning are
    known to be all in use) also becomes possible - that'll be done in the next
    commit.

    Signed-off-by: Al Viro
    Signed-off-by: Tejun Heo

    Al Viro
     
  • ... and simplify the results a bit. Makes the next step easier
    to deal with - we will be changing the data representation for
    chunk->map[] and it's easier to do if the code in question is
    not split between pcpu_alloc_area() and pcpu_split_block().

    Signed-off-by: Al Viro
    Signed-off-by: Tejun Heo

    Al Viro
     

22 Jan, 2014

2 commits

  • Merge first patch-bomb from Andrew Morton:

    - a couple of misc things

    - inotify/fsnotify work from Jan

    - ocfs2 updates (partial)

    - about half of MM

    * emailed patches from Andrew Morton : (117 commits)
    mm/migrate: remove unused function, fail_migrate_page()
    mm/migrate: remove putback_lru_pages, fix comment on putback_movable_pages
    mm/migrate: correct failure handling if !hugepage_migration_support()
    mm/migrate: add comment about permanent failure path
    mm, page_alloc: warn for non-blockable __GFP_NOFAIL allocation failure
    mm: compaction: reset scanner positions immediately when they meet
    mm: compaction: do not mark unmovable pageblocks as skipped in async compaction
    mm: compaction: detect when scanners meet in isolate_freepages
    mm: compaction: reset cached scanner pfn's before reading them
    mm: compaction: encapsulate defer reset logic
    mm: compaction: trace compaction begin and end
    memcg, oom: lock mem_cgroup_print_oom_info
    sched: add tracepoints related to NUMA task migration
    mm: numa: do not automatically migrate KSM pages
    mm: numa: trace tasks that fail migration due to rate limiting
    mm: numa: limit scope of lock for NUMA migrate rate limiting
    mm: numa: make NUMA-migrate related functions static
    lib/show_mem.c: show num_poisoned_pages when oom
    mm/hwpoison: add '#' to hwpoison_inject
    mm/memblock: use WARN_ONCE when MAX_NUMNODES passed as input parameter
    ...

    Linus Torvalds
     
  • Switch to memblock interfaces for early memory allocator instead of
    bootmem allocator. No functional change in beahvior than what it is in
    current code from bootmem users points of view.

    Archs already converted to NO_BOOTMEM now directly use memblock
    interfaces instead of bootmem wrappers build on top of memblock. And
    the archs which still uses bootmem, these new apis just fallback to
    exiting bootmem APIs.

    Signed-off-by: Santosh Shilimkar
    Cc: "Rafael J. Wysocki"
    Cc: Arnd Bergmann
    Cc: Christoph Lameter
    Cc: Greg Kroah-Hartman
    Cc: Grygorii Strashko
    Cc: H. Peter Anvin
    Cc: Johannes Weiner
    Cc: KAMEZAWA Hiroyuki
    Cc: Konrad Rzeszutek Wilk
    Cc: Michal Hocko
    Cc: Paul Walmsley
    Cc: Pavel Machek
    Cc: Russell King
    Cc: Tejun Heo
    Cc: Tony Lindgren
    Cc: Yinghai Lu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Santosh Shilimkar
     

21 Jan, 2014

1 commit


23 Sep, 2013

1 commit

  • If memory allocation of in pcpu_embed_first_chunk() fails, the
    allocated memory is not released correctly. In the release loop also
    the non-allocated elements are released which leads to the following
    kernel BUG on systems with very little memory:

    [ 0.000000] kernel BUG at mm/bootmem.c:307!
    [ 0.000000] illegal operation: 0001 [#1] PREEMPT SMP DEBUG_PAGEALLOC
    [ 0.000000] Modules linked in:
    [ 0.000000] CPU: 0 PID: 0 Comm: swapper Not tainted 3.10.0 #22
    [ 0.000000] task: 0000000000a20ae0 ti: 0000000000a08000 task.ti: 0000000000a08000
    [ 0.000000] Krnl PSW : 0400000180000000 0000000000abda7a (__free+0x116/0x154)
    [ 0.000000] R:0 T:1 IO:0 EX:0 Key:0 M:0 W:0 P:0 AS:0 CC:0 PM:0 EA:3
    ...
    [ 0.000000] [] mark_bootmem_node+0xde/0xf0
    [ 0.000000] [] mark_bootmem+0xa8/0x118
    [ 0.000000] [] pcpu_embed_first_chunk+0xe7a/0xf0c
    [ 0.000000] [] setup_per_cpu_areas+0x4a/0x28c

    To fix the problem now only allocated elements are released. This then
    leads to the correct kernel panic:

    [ 0.000000] Kernel panic - not syncing: Failed to initialize percpu areas.
    ...
    [ 0.000000] Call Trace:
    [ 0.000000] ([] show_trace+0x132/0x150)
    [ 0.000000] [] show_stack+0xc4/0xd4
    [ 0.000000] [] dump_stack+0x74/0xd8
    [ 0.000000] [] panic+0xea/0x264
    [ 0.000000] [] setup_per_cpu_areas+0x5c/0x28c

    tj: Flipped if conditional so that it doesn't need "continue".

    Signed-off-by: Michael Holzheu
    Signed-off-by: Tejun Heo

    Michael Holzheu
     

02 Dec, 2012

1 commit


29 Oct, 2012

1 commit


06 Oct, 2012

1 commit


10 May, 2012

2 commits

  • Kmemleak tracks the percpu allocations via a specific API and the
    originally allocated areas must be removed from kmemleak (via
    kmemleak_free). The code was already doing this for SMP systems.

    Reported-by: Sami Liedes
    Cc: Tejun Heo
    Cc: Christoph Lameter
    Signed-off-by: Catalin Marinas
    Signed-off-by: Tejun Heo

    Catalin Marinas
     
  • pcpu_embed_first_chunk() allocates memory for each node, copies percpu
    data and frees unused portions of it before proceeding to the next
    group. This assumes that allocations for different nodes doesn't
    overlap; however, depending on memory topology, the bootmem allocator
    may end up allocating memory from a different node than the requested
    one which may overlap with the portion freed from one of the previous
    percpu areas. This leads to percpu groups for different nodes
    overlapping which is a serious bug.

    This patch separates out copy & partial free from the allocation loop
    such that all allocations are complete before partial frees happen.

    This also fixes overlapping frees which could happen on allocation
    failure path - out_free_areas path frees whole groups but the groups
    could have portions freed at that point.

    Signed-off-by: Tejun Heo
    Cc: stable@vger.kernel.org
    Reported-by: "Pavel V. Panteleev"
    Tested-by: "Pavel V. Panteleev"
    LKML-Reference:

    Tejun Heo
     

30 Mar, 2012

1 commit


15 Jan, 2012

1 commit

  • Kmemleak patches

    Main features:
    - Handle percpu memory allocations (only scanning them, not actually
    reporting).
    - Memory hotplug support.

    Usability improvements:
    - Show the origin of early allocations.
    - Report previously found leaks even if kmemleak has been disabled by
    some error.

    * tag 'kmemleak' of git://git.kernel.org/pub/scm/linux/kernel/git/cmarinas/linux:
    kmemleak: Add support for memory hotplug
    kmemleak: Handle percpu memory allocation
    kmemleak: Report previously found leaks even after an error
    kmemleak: When the early log buffer is exceeded, report the actual number
    kmemleak: Show where early_log issues come from

    Linus Torvalds
     

16 Dec, 2011

1 commit

  • per_cpu_ptr_to_phys() incorrectly rounds up its result for non-kmalloc
    case to the page boundary, which is bogus for any non-page-aligned
    address.

    This affects the only in-tree user of this function - sysfs handler
    for per-cpu 'crash_notes' physical address. The trouble is that the
    crash_notes per-cpu variable is not page-aligned:

    crash_notes = 0xc08e8ed4
    PER-CPU OFFSET VALUES:
    CPU 0: 3711f000
    CPU 1: 37129000
    CPU 2: 37133000
    CPU 3: 3713d000

    So, the per-cpu addresses are:
    crash_notes on CPU 0: f7a07ed4 => phys 36b57ed4
    crash_notes on CPU 1: f7a11ed4 => phys 36b4ded4
    crash_notes on CPU 2: f7a1bed4 => phys 36b43ed4
    crash_notes on CPU 3: f7a25ed4 => phys 36b39ed4

    However, /sys/devices/system/cpu/cpu*/crash_notes says:
    /sys/devices/system/cpu/cpu0/crash_notes: 36b57000
    /sys/devices/system/cpu/cpu1/crash_notes: 36b4d000
    /sys/devices/system/cpu/cpu2/crash_notes: 36b43000
    /sys/devices/system/cpu/cpu3/crash_notes: 36b39000

    As you can see, all values are rounded down to a page
    boundary. Consequently, this is where kexec sets up the NOTE segments,
    and thus where the secondary kernel is looking for them. However, when
    the first kernel crashes, it saves the notes to the unaligned
    addresses, where they are not found.

    Fix it by adding offset_in_page() to the translated page address.

    -tj: Combined Eugene's and Petr's commit messages.

    Signed-off-by: Eugene Surovegin
    Signed-off-by: Tejun Heo
    Reported-by: Petr Tesarik
    Cc: stable@kernel.org

    Eugene Surovegin
     

03 Dec, 2011

1 commit

  • This patch adds kmemleak callbacks from the percpu allocator, reducing a
    number of false positives caused by kmemleak not scanning such memory
    blocks. The percpu chunks are never reported as leaks because of current
    kmemleak limitations with the __percpu pointer not pointing directly to
    the actual chunks.

    Reported-by: Huajun Li
    Acked-by: Christoph Lameter
    Acked-by: Tejun Heo
    Signed-off-by: Catalin Marinas

    Catalin Marinas
     

24 Nov, 2011

1 commit


23 Nov, 2011

1 commit

  • Percpu allocator recorded the cpus which map to the first and last
    units in pcpu_first/last_unit_cpu respectively and used them to
    determine the address range of a chunk - e.g. it assumed that the
    first unit has the lowest address in a chunk while the last unit has
    the highest address.

    This simply isn't true. Groups in a chunk can have arbitrary positive
    or negative offsets from the previous one and there is no guarantee
    that the first unit occupies the lowest offset while the last one the
    highest.

    Fix it by actually comparing unit offsets to determine cpus occupying
    the lowest and highest offsets. Also, rename pcu_first/last_unit_cpu
    to pcpu_low/high_unit_cpu to avoid confusion.

    The chunk address range is used to flush cache on vmalloc area
    map/unmap and decide whether a given address is in the first chunk by
    per_cpu_ptr_to_phys() and the bug was discovered by invalid
    per_cpu_ptr_to_phys() translation for crash_note.

    Kudos to Dave Young for tracking down the problem.

    Signed-off-by: Tejun Heo
    Reported-by: WANG Cong
    Reported-by: Dave Young
    Tested-by: Dave Young
    LKML-Reference:
    Cc: stable @kernel.org

    Tejun Heo