12 Jan, 2021

1 commit


24 Dec, 2020

1 commit


31 Oct, 2020

1 commit


19 Oct, 2020

1 commit

  • Patch series "mm: kmem: kernel memory accounting in an interrupt context".

    This patchset implements memcg-based memory accounting of allocations made
    from an interrupt context.

    Historically, such allocations were passed unaccounted mostly because
    charging the memory cgroup of the current process wasn't an option. Also
    performance reasons were likely a reason too.

    The remote charging API allows to temporarily overwrite the currently
    active memory cgroup, so that all memory allocations are accounted towards
    some specified memory cgroup instead of the memory cgroup of the current
    process.

    This patchset extends the remote charging API so that it can be used from
    an interrupt context. Then it removes the fence that prevented the
    accounting of allocations made from an interrupt context. It also
    contains a couple of optimizations/code refactorings.

    This patchset doesn't directly enable accounting for any specific
    allocations, but prepares the code base for it. The bpf memory accounting
    will likely be the first user of it: a typical example is a bpf program
    parsing an incoming network packet, which allocates an entry in hashmap
    map to store some information.

    This patch (of 4):

    Currently memcg_kmem_bypass() is called before obtaining the current
    memory/obj cgroup using get_mem/obj_cgroup_from_current(). Moving
    memcg_kmem_bypass() into get_mem/obj_cgroup_from_current() reduces the
    number of call sites and allows further code simplifications.

    Signed-off-by: Roman Gushchin
    Signed-off-by: Andrew Morton
    Reviewed-by: Shakeel Butt
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Link: http://lkml.kernel.org/r/20200827225843.1270629-1-guro@fb.com
    Link: http://lkml.kernel.org/r/20200827225843.1270629-2-guro@fb.com
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     

18 Sep, 2020

1 commit

  • Variable populated, which is a member of struct pcpu_chunk, is used as a
    unit of size of unsigned long.
    However, size of populated is miscounted. So, I fix this minor part.

    Fixes: 8ab16c43ea79 ("percpu: change the number of pages marked in the first_chunk pop bitmap")
    Cc: # 4.14+
    Signed-off-by: Sunghyun Jin
    Signed-off-by: Dennis Zhou

    Sunghyun Jin
     

13 Aug, 2020

3 commits

  • Percpu memory can represent a noticeable chunk of the total memory
    consumption, especially on big machines with many CPUs. Let's track
    percpu memory usage for each memcg and display it in memory.stat.

    A percpu allocation is usually scattered over multiple pages (and nodes),
    and can be significantly smaller than a page. So let's add a byte-sized
    counter on the memcg level: MEMCG_PERCPU_B. Byte-sized vmstat infra
    created for slabs can be perfectly reused for percpu case.

    [guro@fb.com: v3]
    Link: http://lkml.kernel.org/r/20200623184515.4132564-4-guro@fb.com

    Signed-off-by: Roman Gushchin
    Signed-off-by: Andrew Morton
    Reviewed-by: Shakeel Butt
    Acked-by: Dennis Zhou
    Acked-by: Johannes Weiner
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Pekka Enberg
    Cc: Tejun Heo
    Cc: Tobin C. Harding
    Cc: Vlastimil Babka
    Cc: Waiman Long
    Cc: Bixuan Cui
    Cc: Michal Koutný
    Cc: Stephen Rothwell
    Link: http://lkml.kernel.org/r/20200608230819.832349-4-guro@fb.com
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     
  • Percpu memory is becoming more and more widely used by various subsystems,
    and the total amount of memory controlled by the percpu allocator can make
    a good part of the total memory.

    As an example, bpf maps can consume a lot of percpu memory, and they are
    created by a user. Also, some cgroup internals (e.g. memory controller
    statistics) can be quite large. On a machine with many CPUs and big
    number of cgroups they can consume hundreds of megabytes.

    So the lack of memcg accounting is creating a breach in the memory
    isolation. Similar to the slab memory, percpu memory should be accounted
    by default.

    To implement the perpcu accounting it's possible to take the slab memory
    accounting as a model to follow. Let's introduce two types of percpu
    chunks: root and memcg. What makes memcg chunks different is an
    additional space allocated to store memcg membership information. If
    __GFP_ACCOUNT is passed on allocation, a memcg chunk should be be used.
    If it's possible to charge the corresponding size to the target memory
    cgroup, allocation is performed, and the memcg ownership data is recorded.
    System-wide allocations are performed using root chunks, so there is no
    additional memory overhead.

    To implement a fast reparenting of percpu memory on memcg removal, we
    don't store mem_cgroup pointers directly: instead we use obj_cgroup API,
    introduced for slab accounting.

    [akpm@linux-foundation.org: fix CONFIG_MEMCG_KMEM=n build errors and warning]
    [akpm@linux-foundation.org: move unreachable code, per Roman]
    [cuibixuan@huawei.com: mm/percpu: fix 'defined but not used' warning]
    Link: http://lkml.kernel.org/r/6d41b939-a741-b521-a7a2-e7296ec16219@huawei.com

    Signed-off-by: Roman Gushchin
    Signed-off-by: Bixuan Cui
    Signed-off-by: Andrew Morton
    Reviewed-by: Shakeel Butt
    Acked-by: Dennis Zhou
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Pekka Enberg
    Cc: Tejun Heo
    Cc: Tobin C. Harding
    Cc: Vlastimil Babka
    Cc: Waiman Long
    Cc: Bixuan Cui
    Cc: Michal Koutný
    Cc: Stephen Rothwell
    Link: http://lkml.kernel.org/r/20200623184515.4132564-3-guro@fb.com
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     
  • Patch series "mm: memcg accounting of percpu memory", v3.

    This patchset adds percpu memory accounting to memory cgroups. It's based
    on the rework of the slab controller and reuses concepts and features
    introduced for the per-object slab accounting.

    Percpu memory is becoming more and more widely used by various subsystems,
    and the total amount of memory controlled by the percpu allocator can make
    a good part of the total memory.

    As an example, bpf maps can consume a lot of percpu memory, and they are
    created by a user. Also, some cgroup internals (e.g. memory controller
    statistics) can be quite large. On a machine with many CPUs and big
    number of cgroups they can consume hundreds of megabytes.

    So the lack of memcg accounting is creating a breach in the memory
    isolation. Similar to the slab memory, percpu memory should be accounted
    by default.

    Percpu allocations by their nature are scattered over multiple pages, so
    they can't be tracked on the per-page basis. So the per-object tracking
    introduced by the new slab controller is reused.

    The patchset implements charging of percpu allocations, adds memcg-level
    statistics, enables accounting for percpu allocations made by memory
    cgroup internals and provides some basic tests.

    To implement the accounting of percpu memory without a significant memory
    and performance overhead the following approach is used: all accounted
    allocations are placed into a separate percpu chunk (or chunks). These
    chunks are similar to default chunks, except that they do have an attached
    vector of pointers to obj_cgroup objects, which is big enough to save a
    pointer for each allocated object. On the allocation, if the allocation
    has to be accounted (__GFP_ACCOUNT is passed, the allocating process
    belongs to a non-root memory cgroup, etc), the memory cgroup is getting
    charged and if the maximum limit is not exceeded the allocation is
    performed using a memcg-aware chunk. Otherwise -ENOMEM is returned or the
    allocation is forced over the limit, depending on gfp (as any other kernel
    memory allocation). The memory cgroup information is saved in the
    obj_cgroup vector at the corresponding offset. On the release time the
    memcg information is restored from the vector and the cgroup is getting
    uncharged. Unaccounted allocations (at this point the absolute majority
    of all percpu allocations) are performed in the old way, so no additional
    overhead is expected.

    To avoid pinning dying memory cgroups by outstanding allocations,
    obj_cgroup API is used instead of directly saving memory cgroup pointers.
    obj_cgroup is basically a pointer to a memory cgroup with a standalone
    reference counter. The trick is that it can be atomically swapped to
    point at the parent cgroup, so that the original memory cgroup can be
    released prior to all objects, which has been charged to it. Because all
    charges and statistics are fully recursive, it's perfectly correct to
    uncharge the parent cgroup instead. This scheme is used in the slab
    memory accounting, and percpu memory can just follow the scheme.

    This patch (of 5):

    To implement accounting of percpu memory we need the information about the
    size of freed object. Return it from pcpu_free_area().

    Signed-off-by: Roman Gushchin
    Signed-off-by: Andrew Morton
    Reviewed-by: Shakeel Butt
    Acked-by: Dennis Zhou
    Cc: Tejun Heo
    Cc: Christoph Lameter
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Mel Gorman
    Cc: Pekka Enberg
    Cc: Tobin C. Harding
    Cc: Vlastimil Babka
    Cc: Waiman Long
    cC: Michal Koutnýutny@suse.com>
    Cc: Bixuan Cui
    Cc: Michal Koutný
    Cc: Stephen Rothwell
    Link: http://lkml.kernel.org/r/20200623184515.4132564-1-guro@fb.com
    Link: http://lkml.kernel.org/r/20200608230819.832349-1-guro@fb.com
    Link: http://lkml.kernel.org/r/20200608230819.832349-2-guro@fb.com
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     

17 Jul, 2020

1 commit

  • Using uninitialized_var() is dangerous as it papers over real bugs[1]
    (or can in the future), and suppresses unrelated compiler warnings
    (e.g. "unused variable"). If the compiler thinks it is uninitialized,
    either simply initialize the variable or make compiler changes.

    In preparation for removing[2] the[3] macro[4], remove all remaining
    needless uses with the following script:

    git grep '\buninitialized_var\b' | cut -d: -f1 | sort -u | \
    xargs perl -pi -e \
    's/\buninitialized_var\(([^\)]+)\)/\1/g;
    s:\s*/\* (GCC be quiet|to make compiler happy) \*/$::g;'

    drivers/video/fbdev/riva/riva_hw.c was manually tweaked to avoid
    pathological white-space.

    No outstanding warnings were found building allmodconfig with GCC 9.3.0
    for x86_64, i386, arm64, arm, powerpc, powerpc64le, s390x, mips, sparc64,
    alpha, and m68k.

    [1] https://lore.kernel.org/lkml/20200603174714.192027-1-glider@google.com/
    [2] https://lore.kernel.org/lkml/CA+55aFw+Vbj0i=1TGqCR5vQkCzWJ0QxK6CernOU6eedsudAixw@mail.gmail.com/
    [3] https://lore.kernel.org/lkml/CA+55aFwgbgqhbp1fkxvRKEpzyR5J8n1vKT1VZdz9knmPuXhOeg@mail.gmail.com/
    [4] https://lore.kernel.org/lkml/CA+55aFz2500WfbKXAx8s67wrm9=yVJu65TpLgN_ybYNv0VEOKA@mail.gmail.com/

    Reviewed-by: Leon Romanovsky # drivers/infiniband and mlx4/mlx5
    Acked-by: Jason Gunthorpe # IB
    Acked-by: Kalle Valo # wireless drivers
    Reviewed-by: Chao Yu # erofs
    Signed-off-by: Kees Cook

    Kees Cook
     

03 Jun, 2020

1 commit

  • The pgprot argument to __vmalloc is always PAGE_KERNEL now, so remove it.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Reviewed-by: Michael Kelley [hyperv]
    Acked-by: Gao Xiang [erofs]
    Acked-by: Peter Zijlstra (Intel)
    Acked-by: Wei Liu
    Cc: Christian Borntraeger
    Cc: Christophe Leroy
    Cc: Daniel Vetter
    Cc: David Airlie
    Cc: Greg Kroah-Hartman
    Cc: Haiyang Zhang
    Cc: Johannes Weiner
    Cc: "K. Y. Srinivasan"
    Cc: Laura Abbott
    Cc: Mark Rutland
    Cc: Minchan Kim
    Cc: Nitin Gupta
    Cc: Robin Murphy
    Cc: Sakari Ailus
    Cc: Stephen Hemminger
    Cc: Sumit Semwal
    Cc: Benjamin Herrenschmidt
    Cc: Catalin Marinas
    Cc: Heiko Carstens
    Cc: Paul Mackerras
    Cc: Vasily Gorbik
    Cc: Will Deacon
    Link: http://lkml.kernel.org/r/20200414131348.444715-22-hch@lst.de
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     

08 May, 2020

1 commit

  • Since 5.7-rc1, on btrfs we have a percpu counter initialization for
    which we always pass a GFP_KERNEL gfp_t argument (this happens since
    commit 2992df73268f78 ("btrfs: Implement DREW lock")).

    That is safe in some contextes but not on others where allowing fs
    reclaim could lead to a deadlock because we are either holding some
    btrfs lock needed for a transaction commit or holding a btrfs
    transaction handle open. Because of that we surround the call to the
    function that initializes the percpu counter with a NOFS context using
    memalloc_nofs_save() (this is done at btrfs_init_fs_root()).

    However it turns out that this is not enough to prevent a possible
    deadlock because percpu_alloc() determines if it is in an atomic context
    by looking exclusively at the gfp flags passed to it (GFP_KERNEL in this
    case) and it is not aware that a NOFS context is set.

    Because percpu_alloc() thinks it is in a non atomic context it locks the
    pcpu_alloc_mutex. This can result in a btrfs deadlock when
    pcpu_balance_workfn() is running, has acquired that mutex and is waiting
    for reclaim, while the btrfs task that called percpu_counter_init() (and
    therefore percpu_alloc()) is holding either the btrfs commit_root
    semaphore or a transaction handle (done fs/btrfs/backref.c:
    iterate_extent_inodes()), which prevents reclaim from finishing as an
    attempt to commit the current btrfs transaction will deadlock.

    Lockdep reports this issue with the following trace:

    ======================================================
    WARNING: possible circular locking dependency detected
    5.6.0-rc7-btrfs-next-77 #1 Not tainted
    ------------------------------------------------------
    kswapd0/91 is trying to acquire lock:
    ffff8938a3b3fdc8 (&delayed_node->mutex){+.+.}, at: __btrfs_release_delayed_node.part.0+0x3f/0x320 [btrfs]

    but task is already holding lock:
    ffffffffb4f0dbc0 (fs_reclaim){+.+.}, at: __fs_reclaim_acquire+0x5/0x30

    which lock already depends on the new lock.

    the existing dependency chain (in reverse order) is:

    -> #4 (fs_reclaim){+.+.}:
    fs_reclaim_acquire.part.0+0x25/0x30
    __kmalloc+0x5f/0x3a0
    pcpu_create_chunk+0x19/0x230
    pcpu_balance_workfn+0x56a/0x680
    process_one_work+0x235/0x5f0
    worker_thread+0x50/0x3b0
    kthread+0x120/0x140
    ret_from_fork+0x3a/0x50

    -> #3 (pcpu_alloc_mutex){+.+.}:
    __mutex_lock+0xa9/0xaf0
    pcpu_alloc+0x480/0x7c0
    __percpu_counter_init+0x50/0xd0
    btrfs_drew_lock_init+0x22/0x70 [btrfs]
    btrfs_get_fs_root+0x29c/0x5c0 [btrfs]
    resolve_indirect_refs+0x120/0xa30 [btrfs]
    find_parent_nodes+0x50b/0xf30 [btrfs]
    btrfs_find_all_leafs+0x60/0xb0 [btrfs]
    iterate_extent_inodes+0x139/0x2f0 [btrfs]
    iterate_inodes_from_logical+0xa1/0xe0 [btrfs]
    btrfs_ioctl_logical_to_ino+0xb4/0x190 [btrfs]
    btrfs_ioctl+0x165a/0x3130 [btrfs]
    ksys_ioctl+0x87/0xc0
    __x64_sys_ioctl+0x16/0x20
    do_syscall_64+0x5c/0x260
    entry_SYSCALL_64_after_hwframe+0x49/0xbe

    -> #2 (&fs_info->commit_root_sem){++++}:
    down_write+0x38/0x70
    btrfs_cache_block_group+0x2ec/0x500 [btrfs]
    find_free_extent+0xc6a/0x1600 [btrfs]
    btrfs_reserve_extent+0x9b/0x180 [btrfs]
    btrfs_alloc_tree_block+0xc1/0x350 [btrfs]
    alloc_tree_block_no_bg_flush+0x4a/0x60 [btrfs]
    __btrfs_cow_block+0x122/0x5a0 [btrfs]
    btrfs_cow_block+0x106/0x240 [btrfs]
    commit_cowonly_roots+0x55/0x310 [btrfs]
    btrfs_commit_transaction+0x509/0xb20 [btrfs]
    sync_filesystem+0x74/0x90
    generic_shutdown_super+0x22/0x100
    kill_anon_super+0x14/0x30
    btrfs_kill_super+0x12/0x20 [btrfs]
    deactivate_locked_super+0x31/0x70
    cleanup_mnt+0x100/0x160
    task_work_run+0x93/0xc0
    exit_to_usermode_loop+0xf9/0x100
    do_syscall_64+0x20d/0x260
    entry_SYSCALL_64_after_hwframe+0x49/0xbe

    -> #1 (&space_info->groups_sem){++++}:
    down_read+0x3c/0x140
    find_free_extent+0xef6/0x1600 [btrfs]
    btrfs_reserve_extent+0x9b/0x180 [btrfs]
    btrfs_alloc_tree_block+0xc1/0x350 [btrfs]
    alloc_tree_block_no_bg_flush+0x4a/0x60 [btrfs]
    __btrfs_cow_block+0x122/0x5a0 [btrfs]
    btrfs_cow_block+0x106/0x240 [btrfs]
    btrfs_search_slot+0x50c/0xd60 [btrfs]
    btrfs_lookup_inode+0x3a/0xc0 [btrfs]
    __btrfs_update_delayed_inode+0x90/0x280 [btrfs]
    __btrfs_commit_inode_delayed_items+0x81f/0x870 [btrfs]
    __btrfs_run_delayed_items+0x8e/0x180 [btrfs]
    btrfs_commit_transaction+0x31b/0xb20 [btrfs]
    iterate_supers+0x87/0xf0
    ksys_sync+0x60/0xb0
    __ia32_sys_sync+0xa/0x10
    do_syscall_64+0x5c/0x260
    entry_SYSCALL_64_after_hwframe+0x49/0xbe

    -> #0 (&delayed_node->mutex){+.+.}:
    __lock_acquire+0xef0/0x1c80
    lock_acquire+0xa2/0x1d0
    __mutex_lock+0xa9/0xaf0
    __btrfs_release_delayed_node.part.0+0x3f/0x320 [btrfs]
    btrfs_evict_inode+0x40d/0x560 [btrfs]
    evict+0xd9/0x1c0
    dispose_list+0x48/0x70
    prune_icache_sb+0x54/0x80
    super_cache_scan+0x124/0x1a0
    do_shrink_slab+0x176/0x440
    shrink_slab+0x23a/0x2c0
    shrink_node+0x188/0x6e0
    balance_pgdat+0x31d/0x7f0
    kswapd+0x238/0x550
    kthread+0x120/0x140
    ret_from_fork+0x3a/0x50

    other info that might help us debug this:

    Chain exists of:
    &delayed_node->mutex --> pcpu_alloc_mutex --> fs_reclaim

    Possible unsafe locking scenario:

    CPU0 CPU1
    ---- ----
    lock(fs_reclaim);
    lock(pcpu_alloc_mutex);
    lock(fs_reclaim);
    lock(&delayed_node->mutex);

    *** DEADLOCK ***

    3 locks held by kswapd0/91:
    #0: (fs_reclaim){+.+.}, at: __fs_reclaim_acquire+0x5/0x30
    #1: (shrinker_rwsem){++++}, at: shrink_slab+0x12f/0x2c0
    #2: (&type->s_umount_key#43){++++}, at: trylock_super+0x16/0x50

    stack backtrace:
    CPU: 1 PID: 91 Comm: kswapd0 Not tainted 5.6.0-rc7-btrfs-next-77 #1
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-0-ga698c8995f-prebuilt.qemu.org 04/01/2014
    Call Trace:
    dump_stack+0x8f/0xd0
    check_noncircular+0x170/0x190
    __lock_acquire+0xef0/0x1c80
    lock_acquire+0xa2/0x1d0
    __mutex_lock+0xa9/0xaf0
    __btrfs_release_delayed_node.part.0+0x3f/0x320 [btrfs]
    btrfs_evict_inode+0x40d/0x560 [btrfs]
    evict+0xd9/0x1c0
    dispose_list+0x48/0x70
    prune_icache_sb+0x54/0x80
    super_cache_scan+0x124/0x1a0
    do_shrink_slab+0x176/0x440
    shrink_slab+0x23a/0x2c0
    shrink_node+0x188/0x6e0
    balance_pgdat+0x31d/0x7f0
    kswapd+0x238/0x550
    kthread+0x120/0x140
    ret_from_fork+0x3a/0x50

    This could be fixed by making btrfs pass GFP_NOFS instead of GFP_KERNEL
    to percpu_counter_init() in contextes where it is not reclaim safe,
    however that type of approach is discouraged since
    memalloc_[nofs|noio]_save() were introduced. Therefore this change
    makes pcpu_alloc() look up into an existing nofs/noio context before
    deciding whether it is in an atomic context or not.

    Signed-off-by: Filipe Manana
    Signed-off-by: Andrew Morton
    Reviewed-by: Andrew Morton
    Acked-by: Tejun Heo
    Acked-by: Dennis Zhou
    Cc: Tejun Heo
    Cc: Christoph Lameter
    Link: http://lkml.kernel.org/r/20200430164356.15543-1-fdmanana@kernel.org
    Signed-off-by: Linus Torvalds

    Filipe Manana
     

02 Apr, 2020

1 commit


20 Jan, 2020

1 commit


05 Sep, 2019

1 commit

  • One of the more common cases of allocation size calculations is finding
    the size of a structure that has a zero-sized array at the end, along
    with memory for some number of elements for that array. For example:

    struct pcpu_alloc_info {
    ...
    struct pcpu_group_info groups[];
    };

    Make use of the struct_size() helper instead of an open-coded version
    in order to avoid any potential type mistakes.

    So, replace the following form:

    sizeof(*ai) + nr_groups * sizeof(ai->groups[0])

    with:

    struct_size(ai, groups, nr_groups)

    This code was detected with the help of Coccinelle.

    Signed-off-by: Gustavo A. R. Silva
    Signed-off-by: Dennis Zhou

    Gustavo A. R. Silva
     

24 Jul, 2019

1 commit


04 Jul, 2019

1 commit


05 Jun, 2019

1 commit

  • Based on 1 normalized pattern(s):

    this file is released under the gplv2

    extracted by the scancode license scanner the SPDX license identifier

    GPL-2.0-only

    has been chosen to replace the boilerplate/reference in 68 file(s).

    Signed-off-by: Thomas Gleixner
    Reviewed-by: Armijn Hemel
    Reviewed-by: Allison Randal
    Cc: linux-spdx@vger.kernel.org
    Link: https://lkml.kernel.org/r/20190531190114.292346262@linutronix.de
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     

14 May, 2019

1 commit

  • Pull percpu updates from Dennis Zhou:

    - scan hint update which helps address performance issues with heavily
    fragmented blocks

    - lockdep fix when freeing an allocation causes balance work to be
    scheduled

    * 'for-5.2' of git://git.kernel.org/pub/scm/linux/kernel/git/dennis/percpu:
    percpu: remove spurious lock dependency between percpu and sched
    percpu: use chunk scan_hint to skip some scanning
    percpu: convert chunk hints to be based on pcpu_block_md
    percpu: make pcpu_block_md generic
    percpu: use block scan_hint to only scan forward
    percpu: remember largest area skipped during allocation
    percpu: add block level scan_hint
    percpu: set PCPU_BITMAP_BLOCK_SIZE to PAGE_SIZE
    percpu: relegate chunks unusable when failing small allocations
    percpu: manage chunks based on contig_bits instead of free_bytes
    percpu: introduce helper to determine if two regions overlap
    percpu: do not search past bitmap when allocating an area
    percpu: update free path with correct new free region

    Linus Torvalds
     

09 May, 2019

1 commit

  • In free_percpu() we sometimes call pcpu_schedule_balance_work() to
    queue a work item (which does a wakeup) while holding pcpu_lock.
    This creates an unnecessary lock dependency between pcpu_lock and
    the scheduler's pi_lock. There are other places where we call
    pcpu_schedule_balance_work() without hold pcpu_lock, and this case
    doesn't need to be different.

    Moving the call outside the lock prevents the following lockdep splat
    when running tools/testing/selftests/bpf/{test_maps,test_progs} in
    sequence with lockdep enabled:

    ======================================================
    WARNING: possible circular locking dependency detected
    5.1.0-dbg-DEV #1 Not tainted
    ------------------------------------------------------
    kworker/23:255/18872 is trying to acquire lock:
    000000000bc79290 (&(&pool->lock)->rlock){-.-.}, at: __queue_work+0xb2/0x520

    but task is already holding lock:
    00000000e3e7a6aa (pcpu_lock){..-.}, at: free_percpu+0x36/0x260

    which lock already depends on the new lock.

    the existing dependency chain (in reverse order) is:

    -> #4 (pcpu_lock){..-.}:
    lock_acquire+0x9e/0x180
    _raw_spin_lock_irqsave+0x3a/0x50
    pcpu_alloc+0xfa/0x780
    __alloc_percpu_gfp+0x12/0x20
    alloc_htab_elem+0x184/0x2b0
    __htab_percpu_map_update_elem+0x252/0x290
    bpf_percpu_hash_update+0x7c/0x130
    __do_sys_bpf+0x1912/0x1be0
    __x64_sys_bpf+0x1a/0x20
    do_syscall_64+0x59/0x400
    entry_SYSCALL_64_after_hwframe+0x49/0xbe

    -> #3 (&htab->buckets[i].lock){....}:
    lock_acquire+0x9e/0x180
    _raw_spin_lock_irqsave+0x3a/0x50
    htab_map_update_elem+0x1af/0x3a0

    -> #2 (&rq->lock){-.-.}:
    lock_acquire+0x9e/0x180
    _raw_spin_lock+0x2f/0x40
    task_fork_fair+0x37/0x160
    sched_fork+0x211/0x310
    copy_process.part.43+0x7b1/0x2160
    _do_fork+0xda/0x6b0
    kernel_thread+0x29/0x30
    rest_init+0x22/0x260
    arch_call_rest_init+0xe/0x10
    start_kernel+0x4fd/0x520
    x86_64_start_reservations+0x24/0x26
    x86_64_start_kernel+0x6f/0x72
    secondary_startup_64+0xa4/0xb0

    -> #1 (&p->pi_lock){-.-.}:
    lock_acquire+0x9e/0x180
    _raw_spin_lock_irqsave+0x3a/0x50
    try_to_wake_up+0x41/0x600
    wake_up_process+0x15/0x20
    create_worker+0x16b/0x1e0
    workqueue_init+0x279/0x2ee
    kernel_init_freeable+0xf7/0x288
    kernel_init+0xf/0x180
    ret_from_fork+0x24/0x30

    -> #0 (&(&pool->lock)->rlock){-.-.}:
    __lock_acquire+0x101f/0x12a0
    lock_acquire+0x9e/0x180
    _raw_spin_lock+0x2f/0x40
    __queue_work+0xb2/0x520
    queue_work_on+0x38/0x80
    free_percpu+0x221/0x260
    pcpu_freelist_destroy+0x11/0x20
    stack_map_free+0x2a/0x40
    bpf_map_free_deferred+0x3c/0x50
    process_one_work+0x1f7/0x580
    worker_thread+0x54/0x410
    kthread+0x10f/0x150
    ret_from_fork+0x24/0x30

    other info that might help us debug this:

    Chain exists of:
    &(&pool->lock)->rlock --> &htab->buckets[i].lock --> pcpu_lock

    Possible unsafe locking scenario:

    CPU0 CPU1
    ---- ----
    lock(pcpu_lock);
    lock(&htab->buckets[i].lock);
    lock(pcpu_lock);
    lock(&(&pool->lock)->rlock);

    *** DEADLOCK ***

    3 locks held by kworker/23:255/18872:
    #0: 00000000b36a6e16 ((wq_completion)events){+.+.},
    at: process_one_work+0x17a/0x580
    #1: 00000000dfd966f0 ((work_completion)(&map->work)){+.+.},
    at: process_one_work+0x17a/0x580
    #2: 00000000e3e7a6aa (pcpu_lock){..-.},
    at: free_percpu+0x36/0x260

    stack backtrace:
    CPU: 23 PID: 18872 Comm: kworker/23:255 Not tainted 5.1.0-dbg-DEV #1
    Hardware name: ...
    Workqueue: events bpf_map_free_deferred
    Call Trace:
    dump_stack+0x67/0x95
    print_circular_bug.isra.38+0x1c6/0x220
    check_prev_add.constprop.50+0x9f6/0xd20
    __lock_acquire+0x101f/0x12a0
    lock_acquire+0x9e/0x180
    _raw_spin_lock+0x2f/0x40
    __queue_work+0xb2/0x520
    queue_work_on+0x38/0x80
    free_percpu+0x221/0x260
    pcpu_freelist_destroy+0x11/0x20
    stack_map_free+0x2a/0x40
    bpf_map_free_deferred+0x3c/0x50
    process_one_work+0x1f7/0x580
    worker_thread+0x54/0x410
    kthread+0x10f/0x150
    ret_from_fork+0x24/0x30

    Signed-off-by: John Sperbeck
    Signed-off-by: Dennis Zhou

    John Sperbeck
     

19 Mar, 2019

1 commit

  • Since commit ad67b74d2469d9b8 ("printk: hash addresses printed with %p"),
    at boot "____ptrval____" is printed instead of actual addresses:

    percpu: Embedded 38 pages/cpu @(____ptrval____) s124376 r0 d31272 u524288

    Instead of changing the print to "%px", and leaking kernel addresses,
    just remove the print completely, cfr. e.g. commit 071929dbdd865f77
    ("arm64: Stop printing the virtual memory layout").

    Signed-off-by: Matteo Croce
    Signed-off-by: Dennis Zhou

    Matteo Croce
     

14 Mar, 2019

12 commits

  • Just like blocks, chunks now maintain a scan_hint. This can be used to
    skip some scanning by promoting the scan_hint to be the contig_hint.
    The chunk's scan_hint is primarily updated on the backside and relies on
    full scanning when a block becomes free or the free region spans across
    blocks.

    Signed-off-by: Dennis Zhou
    Reviewed-by: Peng Fan

    Dennis Zhou
     
  • As mentioned in the last patch, a chunk's hints are no different than a
    block just responsible for more bits. This converts chunk level hints to
    use a pcpu_block_md to maintain them. This lets us reuse the same hint
    helper functions as a block. The left_free and right_free are unused by
    the chunk's pcpu_block_md.

    Signed-off-by: Dennis Zhou
    Reviewed-by: Peng Fan

    Dennis Zhou
     
  • In reality, a chunk is just a block covering a larger number of bits.
    The hints themselves are one in the same. Rather than maintaining the
    hints separately, first introduce nr_bits to genericize
    pcpu_block_update() to correctly maintain block->right_free. The next
    patch will convert chunk hints to be managed as a pcpu_block_md.

    Signed-off-by: Dennis Zhou
    Reviewed-by: Peng Fan

    Dennis Zhou
     
  • Blocks now remember the latest scan_hint. This can be used on the
    allocation path as when a contig_hint is broken, we can promote the
    scan_hint to the contig_hint and scan forward from there. This works
    because pcpu_block_refresh_hint() is only called on the allocation path
    while block free regions are updated manually in
    pcpu_block_update_hint_free().

    Signed-off-by: Dennis Zhou

    Dennis Zhou
     
  • Percpu allocations attempt to do first fit by scanning forward from the
    first_free of a block. However, fragmentation from allocation requests
    can cause holes not seen by block hint update functions. To address
    this, create a local version of bitmap_find_next_zero_area_off() that
    remembers the largest area skipped over. The caveat is that it only sees
    regions skipped over due to not fitting, not regions skipped due to
    alignment.

    Prior to updating the scan_hint, a scan backwards is done to try and
    recover free bits skipped due to alignment. While this can cause
    scanning to miss earlier possible free areas, smaller allocations will
    eventually fill those holes due to first fit.

    Signed-off-by: Dennis Zhou

    Dennis Zhou
     
  • Fragmentation can cause both blocks and chunks to have an early
    first_firee bit available, but only able to satisfy allocations much
    later on. This patch introduces a scan_hint to help mitigate some
    unnecessary scanning.

    The scan_hint remembers the largest area prior to the contig_hint. If
    the contig_hint == scan_hint, then scan_hint_start > contig_hint_start.
    This is necessary for scan_hint discovery when refreshing a block.

    Signed-off-by: Dennis Zhou
    Reviewed-by: Peng Fan

    Dennis Zhou
     
  • Previously, block size was flexible based on the constraint that the
    GCD(PCPU_BITMAP_BLOCK_SIZE, PAGE_SIZE) > 1. However, this carried the
    overhead that keeping a floating number of populated free pages required
    scanning over the free regions of a chunk.

    Setting the block size to be fixed at PAGE_SIZE lets us know when an
    empty page becomes used as we will break a full contig_hint of a block.
    This means we no longer have to scan the whole chunk upon breaking a
    contig_hint which empty page management piggybacked off. A later patch
    takes advantage of this to optimize the allocation path by only scanning
    forward using the scan_hint introduced later too.

    Signed-off-by: Dennis Zhou
    Reviewed-by: Peng Fan

    Dennis Zhou
     
  • In certain cases, requestors of percpu memory may want specific
    alignments. However, it is possible to end up in situations where the
    contig_hint matches, but the alignment does not. This causes excess
    scanning of chunks that will fail. To prevent this, if a small
    allocation fails (< 32B), the chunk is moved to the empty list. Once an
    allocation is freed from that chunk, it is placed back into rotation.

    Signed-off-by: Dennis Zhou
    Reviewed-by: Peng Fan

    Dennis Zhou
     
  • When a chunk becomes fragmented, it can end up having a large number of
    small allocation areas free. The free_bytes sorting of chunks leads to
    unnecessary checking of chunks that cannot satisfy the allocation.
    Switch to contig_bits sorting to prevent scanning chunks that may not be
    able to service the allocation request.

    Signed-off-by: Dennis Zhou
    Reviewed-by: Peng Fan

    Dennis Zhou
     
  • While block hints were always accurate, it's possible when spanning
    across blocks that we miss updating the chunk's contig_hint. Rather than
    rely on correctness of the boundaries of hints, do a full overlap
    comparison.

    A future patch introduces the scan_hint which makes the contig_hint
    slightly fuzzy as they can at times be smaller than the actual hint.

    Signed-off-by: Dennis Zhou

    Dennis Zhou
     
  • pcpu_find_block_fit() guarantees that a fit is found within
    PCPU_BITMAP_BLOCK_BITS. Iteration is used to determine the first fit as
    it compares against the block's contig_hint. This can lead to
    incorrectly scanning past the end of the bitmap. The behavior was okay
    given the check after for bit_off >= end and the correctness of the
    hints from pcpu_find_block_fit().

    This patch fixes this by bounding the end offset by the number of bits
    in a chunk.

    Signed-off-by: Dennis Zhou
    Reviewed-by: Peng Fan

    Dennis Zhou
     
  • When updating the chunk's contig_hint on the free path of a hint that
    does not touch the page boundaries, it was incorrectly using the
    starting offset of the free region and the block's contig_hint. This
    could lead to incorrect assumptions about fit given a size and better
    alignment of the start. Fix this by using (end - start) as this is only
    called when updating a hint within a block.

    Signed-off-by: Dennis Zhou
    Reviewed-by: Peng Fan

    Dennis Zhou
     

13 Mar, 2019

2 commits

  • As all the memblock allocation functions return NULL in case of error
    rather than panic(), the duplicates with _nopanic suffix can be removed.

    Link: http://lkml.kernel.org/r/1548057848-15136-22-git-send-email-rppt@linux.ibm.com
    Signed-off-by: Mike Rapoport
    Acked-by: Greg Kroah-Hartman
    Reviewed-by: Petr Mladek [printk]
    Cc: Catalin Marinas
    Cc: Christophe Leroy
    Cc: Christoph Hellwig
    Cc: "David S. Miller"
    Cc: Dennis Zhou
    Cc: Geert Uytterhoeven
    Cc: Greentime Hu
    Cc: Guan Xuetao
    Cc: Guo Ren
    Cc: Guo Ren [c-sky]
    Cc: Heiko Carstens
    Cc: Juergen Gross [Xen]
    Cc: Mark Salter
    Cc: Matt Turner
    Cc: Max Filippov
    Cc: Michael Ellerman
    Cc: Michal Simek
    Cc: Paul Burton
    Cc: Richard Weinberger
    Cc: Rich Felker
    Cc: Rob Herring
    Cc: Rob Herring
    Cc: Russell King
    Cc: Stafford Horne
    Cc: Tony Luck
    Cc: Vineet Gupta
    Cc: Yoshinori Sato
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     
  • Add panic() calls if memblock_alloc() returns NULL.

    The panic() format duplicates the one used by memblock itself and in
    order to avoid explosion with long parameters list replace open coded
    allocation size calculations with a local variable.

    Link: http://lkml.kernel.org/r/1548057848-15136-17-git-send-email-rppt@linux.ibm.com
    Signed-off-by: Mike Rapoport
    Cc: Catalin Marinas
    Cc: Christophe Leroy
    Cc: Christoph Hellwig
    Cc: "David S. Miller"
    Cc: Dennis Zhou
    Cc: Geert Uytterhoeven
    Cc: Greentime Hu
    Cc: Greg Kroah-Hartman
    Cc: Guan Xuetao
    Cc: Guo Ren
    Cc: Guo Ren [c-sky]
    Cc: Heiko Carstens
    Cc: Juergen Gross [Xen]
    Cc: Mark Salter
    Cc: Matt Turner
    Cc: Max Filippov
    Cc: Michael Ellerman
    Cc: Michal Simek
    Cc: Paul Burton
    Cc: Petr Mladek
    Cc: Richard Weinberger
    Cc: Rich Felker
    Cc: Rob Herring
    Cc: Rob Herring
    Cc: Russell King
    Cc: Stafford Horne
    Cc: Tony Luck
    Cc: Vineet Gupta
    Cc: Yoshinori Sato
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     

24 Feb, 2019

1 commit

  • group_cnt array is defined with NR_CPUS entries, but normally
    nr_groups will not reach up to NR_CPUS. So there is no issue
    to the current code.

    Checking other parts of pcpu_build_alloc_info, use nr_groups as
    check condition, so make it consistent to use 'group < nr_groups'
    as for loop check. In case we do have nr_groups equals with NR_CPUS,
    we could also avoid memory access out of bounds.

    Signed-off-by: Peng Fan
    Signed-off-by: Dennis Zhou

    Peng Fan
     

02 Nov, 2018

1 commit


31 Oct, 2018

3 commits

  • When a memblock allocation APIs are called with align = 0, the alignment
    is implicitly set to SMP_CACHE_BYTES.

    Implicit alignment is done deep in the memblock allocator and it can
    come as a surprise. Not that such an alignment would be wrong even
    when used incorrectly but it is better to be explicit for the sake of
    clarity and the prinicple of the least surprise.

    Replace all such uses of memblock APIs with the 'align' parameter
    explicitly set to SMP_CACHE_BYTES and stop implicit alignment assignment
    in the memblock internal allocation functions.

    For the case when memblock APIs are used via helper functions, e.g. like
    iommu_arena_new_node() in Alpha, the helper functions were detected with
    Coccinelle's help and then manually examined and updated where
    appropriate.

    The direct memblock APIs users were updated using the semantic patch below:

    @@
    expression size, min_addr, max_addr, nid;
    @@
    (
    |
    - memblock_alloc_try_nid_raw(size, 0, min_addr, max_addr, nid)
    + memblock_alloc_try_nid_raw(size, SMP_CACHE_BYTES, min_addr, max_addr,
    nid)
    |
    - memblock_alloc_try_nid_nopanic(size, 0, min_addr, max_addr, nid)
    + memblock_alloc_try_nid_nopanic(size, SMP_CACHE_BYTES, min_addr, max_addr,
    nid)
    |
    - memblock_alloc_try_nid(size, 0, min_addr, max_addr, nid)
    + memblock_alloc_try_nid(size, SMP_CACHE_BYTES, min_addr, max_addr, nid)
    |
    - memblock_alloc(size, 0)
    + memblock_alloc(size, SMP_CACHE_BYTES)
    |
    - memblock_alloc_raw(size, 0)
    + memblock_alloc_raw(size, SMP_CACHE_BYTES)
    |
    - memblock_alloc_from(size, 0, min_addr)
    + memblock_alloc_from(size, SMP_CACHE_BYTES, min_addr)
    |
    - memblock_alloc_nopanic(size, 0)
    + memblock_alloc_nopanic(size, SMP_CACHE_BYTES)
    |
    - memblock_alloc_low(size, 0)
    + memblock_alloc_low(size, SMP_CACHE_BYTES)
    |
    - memblock_alloc_low_nopanic(size, 0)
    + memblock_alloc_low_nopanic(size, SMP_CACHE_BYTES)
    |
    - memblock_alloc_from_nopanic(size, 0, min_addr)
    + memblock_alloc_from_nopanic(size, SMP_CACHE_BYTES, min_addr)
    |
    - memblock_alloc_node(size, 0, nid)
    + memblock_alloc_node(size, SMP_CACHE_BYTES, nid)
    )

    [mhocko@suse.com: changelog update]
    [akpm@linux-foundation.org: coding-style fixes]
    [rppt@linux.ibm.com: fix missed uses of implicit alignment]
    Link: http://lkml.kernel.org/r/20181016133656.GA10925@rapoport-lnx
    Link: http://lkml.kernel.org/r/1538687224-17535-1-git-send-email-rppt@linux.vnet.ibm.com
    Signed-off-by: Mike Rapoport
    Suggested-by: Michal Hocko
    Acked-by: Paul Burton [MIPS]
    Acked-by: Michael Ellerman [powerpc]
    Acked-by: Michal Hocko
    Cc: Catalin Marinas
    Cc: Chris Zankel
    Cc: Geert Uytterhoeven
    Cc: Guan Xuetao
    Cc: Ingo Molnar
    Cc: Matt Turner
    Cc: Michal Simek
    Cc: Richard Weinberger
    Cc: Russell King
    Cc: Thomas Gleixner
    Cc: Tony Luck
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     
  • Move remaining definitions and declarations from include/linux/bootmem.h
    into include/linux/memblock.h and remove the redundant header.

    The includes were replaced with the semantic patch below and then
    semi-automated removal of duplicated '#include

    @@
    @@
    - #include
    + #include

    [sfr@canb.auug.org.au: dma-direct: fix up for the removal of linux/bootmem.h]
    Link: http://lkml.kernel.org/r/20181002185342.133d1680@canb.auug.org.au
    [sfr@canb.auug.org.au: powerpc: fix up for removal of linux/bootmem.h]
    Link: http://lkml.kernel.org/r/20181005161406.73ef8727@canb.auug.org.au
    [sfr@canb.auug.org.au: x86/kaslr, ACPI/NUMA: fix for linux/bootmem.h removal]
    Link: http://lkml.kernel.org/r/20181008190341.5e396491@canb.auug.org.au
    Link: http://lkml.kernel.org/r/1536927045-23536-30-git-send-email-rppt@linux.vnet.ibm.com
    Signed-off-by: Mike Rapoport
    Signed-off-by: Stephen Rothwell
    Acked-by: Michal Hocko
    Cc: Catalin Marinas
    Cc: Chris Zankel
    Cc: "David S. Miller"
    Cc: Geert Uytterhoeven
    Cc: Greentime Hu
    Cc: Greg Kroah-Hartman
    Cc: Guan Xuetao
    Cc: Ingo Molnar
    Cc: "James E.J. Bottomley"
    Cc: Jonas Bonn
    Cc: Jonathan Corbet
    Cc: Ley Foon Tan
    Cc: Mark Salter
    Cc: Martin Schwidefsky
    Cc: Matt Turner
    Cc: Michael Ellerman
    Cc: Michal Simek
    Cc: Palmer Dabbelt
    Cc: Paul Burton
    Cc: Richard Kuo
    Cc: Richard Weinberger
    Cc: Rich Felker
    Cc: Russell King
    Cc: Serge Semin
    Cc: Thomas Gleixner
    Cc: Tony Luck
    Cc: Vineet Gupta
    Cc: Yoshinori Sato
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     
  • The conversion is done using

    sed -i 's@memblock_virt_alloc@memblock_alloc@g' \
    $(git grep -l memblock_virt_alloc)

    Link: http://lkml.kernel.org/r/1536927045-23536-8-git-send-email-rppt@linux.vnet.ibm.com
    Signed-off-by: Mike Rapoport
    Cc: Catalin Marinas
    Cc: Chris Zankel
    Cc: "David S. Miller"
    Cc: Geert Uytterhoeven
    Cc: Greentime Hu
    Cc: Greg Kroah-Hartman
    Cc: Guan Xuetao
    Cc: Ingo Molnar
    Cc: "James E.J. Bottomley"
    Cc: Jonas Bonn
    Cc: Jonathan Corbet
    Cc: Ley Foon Tan
    Cc: Mark Salter
    Cc: Martin Schwidefsky
    Cc: Matt Turner
    Cc: Michael Ellerman
    Cc: Michal Hocko
    Cc: Michal Simek
    Cc: Palmer Dabbelt
    Cc: Paul Burton
    Cc: Richard Kuo
    Cc: Richard Weinberger
    Cc: Rich Felker
    Cc: Russell King
    Cc: Serge Semin
    Cc: Thomas Gleixner
    Cc: Tony Luck
    Cc: Vineet Gupta
    Cc: Yoshinori Sato
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     

08 Oct, 2018

1 commit

  • The commit ca460b3c9627 ("percpu: introduce bitmap metadata blocks")
    introduced bitmap metadata blocks. These metadata blocks are allocated
    whenever a new chunk is created, but they are never freed. Fix it.

    Fixes: ca460b3c9627 ("percpu: introduce bitmap metadata blocks")
    Signed-off-by: Mike Rapoport
    Cc: stable@vger.kernel.org
    Signed-off-by: Dennis Zhou

    Mike Rapoport