06 Nov, 2020

1 commit

  • Zero-fill element values for all other cpus than current, just as
    when not using prealloc. This is the only way the bpf program can
    ensure known initial values for all cpus ('onallcpus' cannot be
    set when coming from the bpf program).

    The scenario is: bpf program inserts some elements in a per-cpu
    map, then deletes some (or userspace does). When later adding
    new elements using bpf_map_update_elem(), the bpf program can
    only set the value of the new elements for the current cpu.
    When prealloc is enabled, previously deleted elements are re-used.
    Without the fix, values for other cpus remain whatever they were
    when the re-used entry was previously freed.

    A selftest is added to validate correct operation in above
    scenario as well as in case of LRU per-cpu map element re-use.

    Fixes: 6c9059817432 ("bpf: pre-allocate hash map elements")
    Signed-off-by: David Verbeiren
    Signed-off-by: Alexei Starovoitov
    Acked-by: Matthieu Baerts
    Acked-by: Andrii Nakryiko
    Link: https://lore.kernel.org/bpf/20201104112332.15191-1-david.verbeiren@tessares.net

    David Verbeiren
     

12 Oct, 2020

1 commit

  • Recent work in f4d05259213f ("bpf: Add map_meta_equal map ops") and 134fede4eecf
    ("bpf: Relax max_entries check for most of the inner map types") added support
    for dynamic inner max elements for most map-in-map types. Exceptions were maps
    like array or prog array where the map_gen_lookup() callback uses the maps'
    max_entries field as a constant when emitting instructions.

    We recently implemented Maglev consistent hashing into Cilium's load balancer
    which uses map-in-map with an outer map being hash and inner being array holding
    the Maglev backend table for each service. This has been designed this way in
    order to reduce overall memory consumption given the outer hash map allows to
    avoid preallocating a large, flat memory area for all services. Also, the
    number of service mappings is not always known a-priori.

    The use case for dynamic inner array map entries is to further reduce memory
    overhead, for example, some services might just have a small number of back
    ends while others could have a large number. Right now the Maglev backend table
    for small and large number of backends would need to have the same inner array
    map entries which adds a lot of unneeded overhead.

    Dynamic inner array map entries can be realized by avoiding the inlined code
    generation for their lookup. The lookup will still be efficient since it will
    be calling into array_map_lookup_elem() directly and thus avoiding retpoline.
    The patch adds a BPF_F_INNER_MAP flag to map creation which therefore skips
    inline code generation and relaxes array_map_meta_equal() check to ignore both
    maps' max_entries. This also still allows to have faster lookups for map-in-map
    when BPF_F_INNER_MAP is not specified and hence dynamic max_entries not needed.

    Example code generation where inner map is dynamic sized array:

    # bpftool p d x i 125
    int handle__sys_enter(void * ctx):
    ; int handle__sys_enter(void *ctx)
    0: (b4) w1 = 0
    ; int key = 0;
    1: (63) *(u32 *)(r10 -4) = r1
    2: (bf) r2 = r10
    ;
    3: (07) r2 += -4
    ; inner_map = bpf_map_lookup_elem(&outer_arr_dyn, &key);
    4: (18) r1 = map[id:468]
    6: (07) r1 += 272
    7: (61) r0 = *(u32 *)(r2 +0)
    8: (35) if r0 >= 0x3 goto pc+5
    9: (67) r0 <
    Signed-off-by: Alexei Starovoitov
    Acked-by: Andrii Nakryiko
    Link: https://lore.kernel.org/bpf/20201010234006.7075-4-daniel@iogearbox.net

    Daniel Borkmann
     

23 Sep, 2020

1 commit

  • Two minor conflicts:

    1) net/ipv4/route.c, adding a new local variable while
    moving another local variable and removing it's
    initial assignment.

    2) drivers/net/dsa/microchip/ksz9477.c, overlapping changes.
    One pretty prints the port mode differently, whilst another
    changes the driver to try and obtain the port mode from
    the port node rather than the switch node.

    Signed-off-by: David S. Miller

    David S. Miller
     

04 Sep, 2020

1 commit

  • Currently, for hashmap, the bpf iterator will grab a bucket lock, a
    spinlock, before traversing the elements in the bucket. This can ensure
    all bpf visted elements are valid. But this mechanism may cause
    deadlock if update/deletion happens to the same bucket of the
    visited map in the program. For example, if we added bpf_map_update_elem()
    call to the same visited element in selftests bpf_iter_bpf_hash_map.c,
    we will have the following deadlock:

    ============================================
    WARNING: possible recursive locking detected
    5.9.0-rc1+ #841 Not tainted
    --------------------------------------------
    test_progs/1750 is trying to acquire lock:
    ffff9a5bb73c5e70 (&htab->buckets[i].raw_lock){....}-{2:2}, at: htab_map_update_elem+0x1cf/0x410

    but task is already holding lock:
    ffff9a5bb73c5e20 (&htab->buckets[i].raw_lock){....}-{2:2}, at: bpf_hash_map_seq_find_next+0x94/0x120

    other info that might help us debug this:
    Possible unsafe locking scenario:

    CPU0
    ----
    lock(&htab->buckets[i].raw_lock);
    lock(&htab->buckets[i].raw_lock);

    *** DEADLOCK ***
    ...
    Call Trace:
    dump_stack+0x78/0xa0
    __lock_acquire.cold.74+0x209/0x2e3
    lock_acquire+0xba/0x380
    ? htab_map_update_elem+0x1cf/0x410
    ? __lock_acquire+0x639/0x20c0
    _raw_spin_lock_irqsave+0x3b/0x80
    ? htab_map_update_elem+0x1cf/0x410
    htab_map_update_elem+0x1cf/0x410
    ? lock_acquire+0xba/0x380
    bpf_prog_ad6dab10433b135d_dump_bpf_hash_map+0x88/0xa9c
    ? find_held_lock+0x34/0xa0
    bpf_iter_run_prog+0x81/0x16e
    __bpf_hash_map_seq_show+0x145/0x180
    bpf_seq_read+0xff/0x3d0
    vfs_read+0xad/0x1c0
    ksys_read+0x5f/0xe0
    do_syscall_64+0x33/0x40
    entry_SYSCALL_64_after_hwframe+0x44/0xa9
    ...

    The bucket_lock first grabbed in seq_ops->next() called by bpf_seq_read(),
    and then grabbed again in htab_map_update_elem() in the bpf program, causing
    deadlocks.

    Actually, we do not need bucket_lock here, we can just use rcu_read_lock()
    similar to netlink iterator where the rcu_read_{lock,unlock} likes below:
    seq_ops->start():
    rcu_read_lock();
    seq_ops->next():
    rcu_read_unlock();
    /* next element */
    rcu_read_lock();
    seq_ops->stop();
    rcu_read_unlock();

    Compared to old bucket_lock mechanism, if concurrent updata/delete happens,
    we may visit stale elements, miss some elements, or repeat some elements.
    I think this is a reasonable compromise. For users wanting to avoid
    stale, missing/repeated accesses, bpf_map batch access syscall interface
    can be used.

    Signed-off-by: Yonghong Song
    Signed-off-by: Alexei Starovoitov
    Link: https://lore.kernel.org/bpf/20200902235340.2001375-1-yhs@fb.com

    Yonghong Song
     

29 Aug, 2020

1 commit

  • Introduce sleepable BPF programs that can request such property for themselves
    via BPF_F_SLEEPABLE flag at program load time. In such case they will be able
    to use helpers like bpf_copy_from_user() that might sleep. At present only
    fentry/fexit/fmod_ret and lsm programs can request to be sleepable and only
    when they are attached to kernel functions that are known to allow sleeping.

    The non-sleepable programs are relying on implicit rcu_read_lock() and
    migrate_disable() to protect life time of programs, maps that they use and
    per-cpu kernel structures used to pass info between bpf programs and the
    kernel. The sleepable programs cannot be enclosed into rcu_read_lock().
    migrate_disable() maps to preempt_disable() in non-RT kernels, so the progs
    should not be enclosed in migrate_disable() as well. Therefore
    rcu_read_lock_trace is used to protect the life time of sleepable progs.

    There are many networking and tracing program types. In many cases the
    'struct bpf_prog *' pointer itself is rcu protected within some other kernel
    data structure and the kernel code is using rcu_dereference() to load that
    program pointer and call BPF_PROG_RUN() on it. All these cases are not touched.
    Instead sleepable bpf programs are allowed with bpf trampoline only. The
    program pointers are hard-coded into generated assembly of bpf trampoline and
    synchronize_rcu_tasks_trace() is used to protect the life time of the program.
    The same trampoline can hold both sleepable and non-sleepable progs.

    When rcu_read_lock_trace is held it means that some sleepable bpf program is
    running from bpf trampoline. Those programs can use bpf arrays and preallocated
    hash/lru maps. These map types are waiting on programs to complete via
    synchronize_rcu_tasks_trace();

    Updates to trampoline now has to do synchronize_rcu_tasks_trace() and
    synchronize_rcu_tasks() to wait for sleepable progs to finish and for
    trampoline assembly to finish.

    This is the first step of introducing sleepable progs. Eventually dynamically
    allocated hash maps can be allowed and networking program types can become
    sleepable too.

    Signed-off-by: Alexei Starovoitov
    Signed-off-by: Daniel Borkmann
    Reviewed-by: Josef Bacik
    Acked-by: Andrii Nakryiko
    Acked-by: KP Singh
    Link: https://lore.kernel.org/bpf/20200827220114.69225-3-alexei.starovoitov@gmail.com

    Alexei Starovoitov
     

28 Aug, 2020

1 commit

  • Some properties of the inner map is used in the verification time.
    When an inner map is inserted to an outer map at runtime,
    bpf_map_meta_equal() is currently used to ensure those properties
    of the inserting inner map stays the same as the verification
    time.

    In particular, the current bpf_map_meta_equal() checks max_entries which
    turns out to be too restrictive for most of the maps which do not use
    max_entries during the verification time. It limits the use case that
    wants to replace a smaller inner map with a larger inner map. There are
    some maps do use max_entries during verification though. For example,
    the map_gen_lookup in array_map_ops uses the max_entries to generate
    the inline lookup code.

    To accommodate differences between maps, the map_meta_equal is added
    to bpf_map_ops. Each map-type can decide what to check when its
    map is used as an inner map during runtime.

    Also, some map types cannot be used as an inner map and they are
    currently black listed in bpf_map_meta_alloc() in map_in_map.c.
    It is not unusual that the new map types may not aware that such
    blacklist exists. This patch enforces an explicit opt-in
    and only allows a map to be used as an inner map if it has
    implemented the map_meta_equal ops. It is based on the
    discussion in [1].

    All maps that support inner map has its map_meta_equal points
    to bpf_map_meta_equal in this patch. A later patch will
    relax the max_entries check for most maps. bpf_types.h
    counts 28 map types. This patch adds 23 ".map_meta_equal"
    by using coccinelle. -5 for
    BPF_MAP_TYPE_PROG_ARRAY
    BPF_MAP_TYPE_(PERCPU)_CGROUP_STORAGE
    BPF_MAP_TYPE_STRUCT_OPS
    BPF_MAP_TYPE_ARRAY_OF_MAPS
    BPF_MAP_TYPE_HASH_OF_MAPS

    The "if (inner_map->inner_map_meta)" check in bpf_map_meta_alloc()
    is moved such that the same error is returned.

    [1]: https://lore.kernel.org/bpf/20200522022342.899756-1-kafai@fb.com/

    Signed-off-by: Martin KaFai Lau
    Signed-off-by: Daniel Borkmann
    Link: https://lore.kernel.org/bpf/20200828011806.1970400-1-kafai@fb.com

    Martin KaFai Lau
     

04 Aug, 2020

1 commit

  • Daniel Borkmann says:

    ====================
    pull-request: bpf-next 2020-08-04

    The following pull-request contains BPF updates for your *net-next* tree.

    We've added 73 non-merge commits during the last 9 day(s) which contain
    a total of 135 files changed, 4603 insertions(+), 1013 deletions(-).

    The main changes are:

    1) Implement bpf_link support for XDP. Also add LINK_DETACH operation for the BPF
    syscall allowing processes with BPF link FD to force-detach, from Andrii Nakryiko.

    2) Add BPF iterator for map elements and to iterate all BPF programs for efficient
    in-kernel inspection, from Yonghong Song and Alexei Starovoitov.

    3) Separate bpf_get_{stack,stackid}() helpers for perf events in BPF to avoid
    unwinder errors, from Song Liu.

    4) Allow cgroup local storage map to be shared between programs on the same
    cgroup. Also extend BPF selftests with coverage, from YiFei Zhu.

    5) Add BPF exception tables to ARM64 JIT in order to be able to JIT BPF_PROBE_MEM
    load instructions, from Jean-Philippe Brucker.

    6) Follow-up fixes on BPF socket lookup in combination with reuseport group
    handling. Also add related BPF selftests, from Jakub Sitnicki.

    7) Allow to use socket storage in BPF_PROG_TYPE_CGROUP_SOCK-typed programs for
    socket create/release as well as bind functions, from Stanislav Fomichev.

    8) Fix an info leak in xsk_getsockopt() when retrieving XDP stats via old struct
    xdp_statistics, from Peilin Ye.

    9) Fix PT_REGS_RC{,_CORE}() macros in libbpf for MIPS arch, from Jerry Crunchtime.

    10) Extend BPF kernel test infra with skb->family and skb->{local,remote}_ip{4,6}
    fields and allow user space to specify skb->dev via ifindex, from Dmitry Yakunin.

    11) Fix a bpftool segfault due to missing program type name and make it more robust
    to prevent them in future gaps, from Quentin Monnet.

    12) Consolidate cgroup helper functions across selftests and fix a v6 localhost
    resolver issue, from John Fastabend.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     

02 Aug, 2020

1 commit


30 Jul, 2020

1 commit

  • Fix HASH_OF_MAPS bug of not putting inner map pointer on bpf_map_elem_update()
    operation. This is due to per-cpu extra_elems optimization, which bypassed
    free_htab_elem() logic doing proper clean ups. Make sure that inner map is put
    properly in optimized case as well.

    Fixes: 8c290e60fa2a ("bpf: fix hashmap extra_elems logic")
    Signed-off-by: Andrii Nakryiko
    Signed-off-by: Daniel Borkmann
    Acked-by: Song Liu
    Link: https://lore.kernel.org/bpf/20200729040913.2815687-1-andriin@fb.com

    Andrii Nakryiko
     

26 Jul, 2020

1 commit

  • The bpf iterators for hash, percpu hash, lru hash
    and lru percpu hash are implemented. During link time,
    bpf_iter_reg->check_target() will check map type
    and ensure the program access key/value region is
    within the map defined key/value size limit.

    For percpu hash and lru hash maps, the bpf program
    will receive values for all cpus. The map element
    bpf iterator infrastructure will prepare value
    properly before passing the value pointer to the
    bpf program.

    This patch set supports readonly map keys and
    read/write map values. It does not support deleting
    map elements, e.g., from hash tables. If there is
    a user case for this, the following mechanism can
    be used to support map deletion for hashtab, etc.
    - permit a new bpf program return value, e.g., 2,
    to let bpf iterator know the map element should
    be removed.
    - since bucket lock is taken, the map element will be
    queued.
    - once bucket lock is released after all elements under
    this bucket are traversed, all to-be-deleted map
    elements can be deleted.

    Signed-off-by: Yonghong Song
    Signed-off-by: Alexei Starovoitov
    Link: https://lore.kernel.org/bpf/20200723184114.590470-1-yhs@fb.com

    Yonghong Song
     

01 Jul, 2020

1 commit

  • bpf_free_used_maps() or close(map_fd) will trigger map_free callback.
    bpf_free_used_maps() is called after bpf prog is no longer executing:
    bpf_prog_put->call_rcu->bpf_prog_free->bpf_free_used_maps.
    Hence there is no need to call synchronize_rcu() to protect map elements.

    Note that hash_of_maps and array_of_maps update/delete inner maps via
    sys_bpf() that calls maybe_wait_bpf_programs() and synchronize_rcu().

    Signed-off-by: Alexei Starovoitov
    Acked-by: Andrii Nakryiko
    Acked-by: Paul E. McKenney
    Link: https://lore.kernel.org/bpf/20200630043343.53195-2-alexei.starovoitov@gmail.com

    Alexei Starovoitov
     

23 Jun, 2020

2 commits

  • Set map_btf_name and map_btf_id for all map types so that map fields can
    be accessed by bpf programs.

    Signed-off-by: Andrey Ignatov
    Signed-off-by: Daniel Borkmann
    Acked-by: John Fastabend
    Acked-by: Martin KaFai Lau
    Link: https://lore.kernel.org/bpf/a825f808f22af52b018dbe82f1c7d29dab5fc978.1592600985.git.rdna@fb.com

    Andrey Ignatov
     
  • There are multiple use-cases when it's convenient to have access to bpf
    map fields, both `struct bpf_map` and map type specific struct-s such as
    `struct bpf_array`, `struct bpf_htab`, etc.

    For example while working with sock arrays it can be necessary to
    calculate the key based on map->max_entries (some_hash % max_entries).
    Currently this is solved by communicating max_entries via "out-of-band"
    channel, e.g. via additional map with known key to get info about target
    map. That works, but is not very convenient and error-prone while
    working with many maps.

    In other cases necessary data is dynamic (i.e. unknown at loading time)
    and it's impossible to get it at all. For example while working with a
    hash table it can be convenient to know how much capacity is already
    used (bpf_htab.count.counter for BPF_F_NO_PREALLOC case).

    At the same time kernel knows this info and can provide it to bpf
    program.

    Fill this gap by adding support to access bpf map fields from bpf
    program for both `struct bpf_map` and map type specific fields.

    Support is implemented via btf_struct_access() so that a user can define
    their own `struct bpf_map` or map type specific struct in their program
    with only necessary fields and preserve_access_index attribute, cast a
    map to this struct and use a field.

    For example:

    struct bpf_map {
    __u32 max_entries;
    } __attribute__((preserve_access_index));

    struct bpf_array {
    struct bpf_map map;
    __u32 elem_size;
    } __attribute__((preserve_access_index));

    struct {
    __uint(type, BPF_MAP_TYPE_ARRAY);
    __uint(max_entries, 4);
    __type(key, __u32);
    __type(value, __u32);
    } m_array SEC(".maps");

    SEC("cgroup_skb/egress")
    int cg_skb(void *ctx)
    {
    struct bpf_array *array = (struct bpf_array *)&m_array;
    struct bpf_map *map = (struct bpf_map *)&m_array;

    /* .. use map->max_entries or array->map.max_entries .. */
    }

    Similarly to other btf_struct_access() use-cases (e.g. struct tcp_sock
    in net/ipv4/bpf_tcp_ca.c) the patch allows access to any fields of
    corresponding struct. Only reading from map fields is supported.

    For btf_struct_access() to work there should be a way to know btf id of
    a struct that corresponds to a map type. To get btf id there should be a
    way to get a stringified name of map-specific struct, such as
    "bpf_array", "bpf_htab", etc for a map type. Two new fields are added to
    `struct bpf_map_ops` to handle it:
    * .map_btf_name keeps a btf name of a struct returned by map_alloc();
    * .map_btf_id is used to cache btf id of that struct.

    To make btf ids calculation cheaper they're calculated once while
    preparing btf_vmlinux and cached same way as it's done for btf_id field
    of `struct bpf_func_proto`

    While calculating btf ids, struct names are NOT checked for collision.
    Collisions will be checked as a part of the work to prepare btf ids used
    in verifier in compile time that should land soon. The only known
    collision for `struct bpf_htab` (kernel/bpf/hashtab.c vs
    net/core/sock_map.c) was fixed earlier.

    Both new fields .map_btf_name and .map_btf_id must be set for a map type
    for the feature to work. If neither is set for a map type, verifier will
    return ENOTSUPP on a try to access map_ptr of corresponding type. If
    just one of them set, it's verifier misconfiguration.

    Only `struct bpf_array` for BPF_MAP_TYPE_ARRAY and `struct bpf_htab` for
    BPF_MAP_TYPE_HASH are supported by this patch. Other map types will be
    supported separately.

    The feature is available only for CONFIG_DEBUG_INFO_BTF=y and gated by
    perfmon_capable() so that unpriv programs won't have access to bpf map
    fields.

    Signed-off-by: Andrey Ignatov
    Signed-off-by: Daniel Borkmann
    Acked-by: John Fastabend
    Acked-by: Martin KaFai Lau
    Link: https://lore.kernel.org/bpf/6479686a0cd1e9067993df57b4c3eef0e276fec9.1592600985.git.rdna@fb.com

    Andrey Ignatov
     

15 May, 2020

1 commit

  • Implement permissions as stated in uapi/linux/capability.h
    In order to do that the verifier allow_ptr_leaks flag is split
    into four flags and they are set as:
    env->allow_ptr_leaks = bpf_allow_ptr_leaks();
    env->bypass_spec_v1 = bpf_bypass_spec_v1();
    env->bypass_spec_v4 = bpf_bypass_spec_v4();
    env->bpf_capable = bpf_capable();

    The first three currently equivalent to perfmon_capable(), since leaking kernel
    pointers and reading kernel memory via side channel attacks is roughly
    equivalent to reading kernel memory with cap_perfmon.

    'bpf_capable' enables bounded loops, precision tracking, bpf to bpf calls and
    other verifier features. 'allow_ptr_leaks' enable ptr leaks, ptr conversions,
    subtraction of pointers. 'bypass_spec_v1' disables speculative analysis in the
    verifier, run time mitigations in bpf array, and enables indirect variable
    access in bpf programs. 'bypass_spec_v4' disables emission of sanitation code
    by the verifier.

    That means that the networking BPF program loaded with CAP_BPF + CAP_NET_ADMIN
    will have speculative checks done by the verifier and other spectre mitigation
    applied. Such networking BPF program will not be able to leak kernel pointers
    and will not be able to access arbitrary kernel memory.

    Signed-off-by: Alexei Starovoitov
    Signed-off-by: Daniel Borkmann
    Link: https://lore.kernel.org/bpf/20200513230355.7858-3-alexei.starovoitov@gmail.com

    Alexei Starovoitov
     

28 Feb, 2020

1 commit

  • The current codebase makes use of the zero-length array language
    extension to the C90 standard, but the preferred mechanism to declare
    variable-length types such as these ones is a flexible array member[1][2],
    introduced in C99:

    struct foo {
    int stuff;
    struct boo array[];
    };

    By making use of the mechanism above, we will get a compiler warning
    in case the flexible array does not occur last in the structure, which
    will help us prevent some kind of undefined behavior bugs from being
    inadvertently introduced[3] to the codebase from now on.

    Also, notice that, dynamic memory allocations won't be affected by
    this change:

    "Flexible array members have incomplete type, and so the sizeof operator
    may not be applied. As a quirk of the original implementation of
    zero-length arrays, sizeof evaluates to zero."[1]

    This issue was found with the help of Coccinelle.

    [1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
    [2] https://github.com/KSPP/linux/issues/21
    [3] commit 76497732932f ("cxgb3/l2t: Fix undefined behaviour")

    Signed-off-by: Gustavo A. R. Silva
    Signed-off-by: Daniel Borkmann
    Acked-by: Song Liu
    Link: https://lore.kernel.org/bpf/20200227001744.GA3317@embeddedor

    Gustavo A. R. Silva
     

25 Feb, 2020

5 commits

  • PREEMPT_RT forbids certain operations like memory allocations (even with
    GFP_ATOMIC) from atomic contexts. This is required because even with
    GFP_ATOMIC the memory allocator calls into code pathes which acquire locks
    with long held lock sections. To ensure the deterministic behaviour these
    locks are regular spinlocks, which are converted to 'sleepable' spinlocks
    on RT. The only true atomic contexts on an RT kernel are the low level
    hardware handling, scheduling, low level interrupt handling, NMIs etc. None
    of these contexts should ever do memory allocations.

    As regular device interrupt handlers and soft interrupts are forced into
    thread context, the existing code which does
    spin_lock*(); alloc(GPF_ATOMIC); spin_unlock*();
    just works.

    In theory the BPF locks could be converted to regular spinlocks as well,
    but the bucket locks and percpu_freelist locks can be taken from arbitrary
    contexts (perf, kprobes, tracepoints) which are required to be atomic
    contexts even on RT. These mechanisms require preallocated maps, so there
    is no need to invoke memory allocations within the lock held sections.

    BPF maps which need dynamic allocation are only used from (forced) thread
    context on RT and can therefore use regular spinlocks which in turn allows
    to invoke memory allocations from the lock held section.

    To achieve this make the hash bucket lock a union of a raw and a regular
    spinlock and initialize and lock/unlock either the raw spinlock for
    preallocated maps or the regular variant for maps which require memory
    allocations.

    On a non RT kernel this distinction is neither possible nor required.
    spinlock maps to raw_spinlock and the extra code and conditional is
    optimized out by the compiler. No functional change.

    Signed-off-by: Thomas Gleixner
    Signed-off-by: Alexei Starovoitov
    Link: https://lore.kernel.org/bpf/20200224145644.509685912@linutronix.de

    Thomas Gleixner
     
  • As a preparation for making the BPF locking RT friendly, factor out the
    hash bucket lock operations into inline functions. This allows to do the
    necessary RT modification in one place instead of sprinkling it all over
    the place. No functional change.

    The now unused htab argument of the lock/unlock functions will be used in
    the next step which adds PREEMPT_RT support.

    Signed-off-by: Thomas Gleixner
    Signed-off-by: Alexei Starovoitov
    Link: https://lore.kernel.org/bpf/20200224145644.420416916@linutronix.de

    Thomas Gleixner
     
  • The required protection is that the caller cannot be migrated to a
    different CPU as these places take either a hash bucket lock or might
    trigger a kprobe inside the memory allocator. Both scenarios can lead to
    deadlocks. The deadlock prevention is per CPU by incrementing a per CPU
    variable which temporarily blocks the invocation of BPF programs from perf
    and kprobes.

    Replace the open coded preempt_disable/enable() and this_cpu_inc/dec()
    pairs with the new recursion prevention helpers to prepare BPF to work on
    PREEMPT_RT enabled kernels. On a non-RT kernel the migrate disable/enable
    in the helpers map to preempt_disable/enable(), i.e. no functional change.

    Signed-off-by: Thomas Gleixner
    Signed-off-by: Alexei Starovoitov
    Link: https://lore.kernel.org/bpf/20200224145644.211208533@linutronix.de

    Thomas Gleixner
     
  • If an element is freed via RCU then recursion into BPF instrumentation
    functions is not a concern. The element is already detached from the map
    and the RCU callback does not hold any locks on which a kprobe, perf event
    or tracepoint attached BPF program could deadlock.

    Signed-off-by: Thomas Gleixner
    Signed-off-by: Alexei Starovoitov
    Link: https://lore.kernel.org/bpf/20200224145643.259118710@linutronix.de

    Thomas Gleixner
     
  • The comment where the bucket lock is acquired says:

    /* bpf_map_update_elem() can be called in_irq() */

    which is not really helpful and aside of that it does not explain the
    subtle details of the hash bucket locks expecially in the context of BPF
    and perf, kprobes and tracing.

    Add a comment at the top of the file which explains the protection scopes
    and the details how potential deadlocks are prevented.

    Signed-off-by: Thomas Gleixner
    Signed-off-by: Alexei Starovoitov
    Link: https://lore.kernel.org/bpf/20200224145642.755793061@linutronix.de

    Thomas Gleixner
     

20 Feb, 2020

2 commits

  • Commit 057996380a42 ("bpf: Add batch ops to all htab bpf map")
    added lookup_and_delete batch operation for hash table.
    The current implementation has bpf_lru_push_free() inside
    the bucket lock, which may cause a deadlock.

    syzbot reports:
    -> #2 (&htab->buckets[i].lock#2){....}:
    __raw_spin_lock_irqsave include/linux/spinlock_api_smp.h:110 [inline]
    _raw_spin_lock_irqsave+0x95/0xcd kernel/locking/spinlock.c:159
    htab_lru_map_delete_node+0xce/0x2f0 kernel/bpf/hashtab.c:593
    __bpf_lru_list_shrink_inactive kernel/bpf/bpf_lru_list.c:220 [inline]
    __bpf_lru_list_shrink+0xf9/0x470 kernel/bpf/bpf_lru_list.c:266
    bpf_lru_list_pop_free_to_local kernel/bpf/bpf_lru_list.c:340 [inline]
    bpf_common_lru_pop_free kernel/bpf/bpf_lru_list.c:447 [inline]
    bpf_lru_pop_free+0x87c/0x1670 kernel/bpf/bpf_lru_list.c:499
    prealloc_lru_pop+0x2c/0xa0 kernel/bpf/hashtab.c:132
    __htab_lru_percpu_map_update_elem+0x67e/0xa90 kernel/bpf/hashtab.c:1069
    bpf_percpu_hash_update+0x16e/0x210 kernel/bpf/hashtab.c:1585
    bpf_map_update_value.isra.0+0x2d7/0x8e0 kernel/bpf/syscall.c:181
    generic_map_update_batch+0x41f/0x610 kernel/bpf/syscall.c:1319
    bpf_map_do_batch+0x3f5/0x510 kernel/bpf/syscall.c:3348
    __do_sys_bpf+0x9b7/0x41e0 kernel/bpf/syscall.c:3460
    __se_sys_bpf kernel/bpf/syscall.c:3355 [inline]
    __x64_sys_bpf+0x73/0xb0 kernel/bpf/syscall.c:3355
    do_syscall_64+0xfa/0x790 arch/x86/entry/common.c:294
    entry_SYSCALL_64_after_hwframe+0x49/0xbe

    -> #0 (&loc_l->lock){....}:
    check_prev_add kernel/locking/lockdep.c:2475 [inline]
    check_prevs_add kernel/locking/lockdep.c:2580 [inline]
    validate_chain kernel/locking/lockdep.c:2970 [inline]
    __lock_acquire+0x2596/0x4a00 kernel/locking/lockdep.c:3954
    lock_acquire+0x190/0x410 kernel/locking/lockdep.c:4484
    __raw_spin_lock_irqsave include/linux/spinlock_api_smp.h:110 [inline]
    _raw_spin_lock_irqsave+0x95/0xcd kernel/locking/spinlock.c:159
    bpf_common_lru_push_free kernel/bpf/bpf_lru_list.c:516 [inline]
    bpf_lru_push_free+0x250/0x5b0 kernel/bpf/bpf_lru_list.c:555
    __htab_map_lookup_and_delete_batch+0x8d4/0x1540 kernel/bpf/hashtab.c:1374
    htab_lru_map_lookup_and_delete_batch+0x34/0x40 kernel/bpf/hashtab.c:1491
    bpf_map_do_batch+0x3f5/0x510 kernel/bpf/syscall.c:3348
    __do_sys_bpf+0x1f7d/0x41e0 kernel/bpf/syscall.c:3456
    __se_sys_bpf kernel/bpf/syscall.c:3355 [inline]
    __x64_sys_bpf+0x73/0xb0 kernel/bpf/syscall.c:3355
    do_syscall_64+0xfa/0x790 arch/x86/entry/common.c:294
    entry_SYSCALL_64_after_hwframe+0x49/0xbe

    Possible unsafe locking scenario:

    CPU0 CPU2
    ---- ----
    lock(&htab->buckets[i].lock#2);
    lock(&l->lock);
    lock(&htab->buckets[i].lock#2);
    lock(&loc_l->lock);

    *** DEADLOCK ***

    To fix the issue, for htab_lru_map_lookup_and_delete_batch() in CPU0,
    let us do bpf_lru_push_free() out of the htab bucket lock. This can
    avoid the above deadlock scenario.

    Fixes: 057996380a42 ("bpf: Add batch ops to all htab bpf map")
    Reported-by: syzbot+a38ff3d9356388f2fb83@syzkaller.appspotmail.com
    Reported-by: syzbot+122b5421d14e68f29cd1@syzkaller.appspotmail.com
    Suggested-by: Hillf Danton
    Suggested-by: Martin KaFai Lau
    Signed-off-by: Yonghong Song
    Signed-off-by: Alexei Starovoitov
    Reviewed-by: Jakub Sitnicki
    Acked-by: Brian Vazquez
    Acked-by: Martin KaFai Lau
    Link: https://lore.kernel.org/bpf/20200219234757.3544014-1-yhs@fb.com

    Yonghong Song
     
  • Grabbing the spinlock for every bucket even if it's empty, was causing
    significant perfomance cost when traversing htab maps that have only a
    few entries. This patch addresses the issue by checking first the
    bucket_cnt, if the bucket has some entries then we go and grab the
    spinlock and proceed with the batching.

    Tested with a htab of size 50K and different value of populated entries.

    Before:
    Benchmark Time(ns) CPU(ns)
    ---------------------------------------------
    BM_DumpHashMap/1 2759655 2752033
    BM_DumpHashMap/10 2933722 2930825
    BM_DumpHashMap/200 3171680 3170265
    BM_DumpHashMap/500 3639607 3635511
    BM_DumpHashMap/1000 4369008 4364981
    BM_DumpHashMap/5k 11171919 11134028
    BM_DumpHashMap/20k 69150080 69033496
    BM_DumpHashMap/39k 190501036 190226162

    After:
    Benchmark Time(ns) CPU(ns)
    ---------------------------------------------
    BM_DumpHashMap/1 202707 200109
    BM_DumpHashMap/10 213441 210569
    BM_DumpHashMap/200 478641 472350
    BM_DumpHashMap/500 980061 967102
    BM_DumpHashMap/1000 1863835 1839575
    BM_DumpHashMap/5k 8961836 8902540
    BM_DumpHashMap/20k 69761497 69322756
    BM_DumpHashMap/39k 187437830 186551111

    Fixes: 057996380a42 ("bpf: Add batch ops to all htab bpf map")
    Signed-off-by: Brian Vazquez
    Signed-off-by: Alexei Starovoitov
    Acked-by: Yonghong Song
    Link: https://lore.kernel.org/bpf/20200218172552.215077-1-brianvv@google.com

    Brian Vazquez
     

16 Jan, 2020

1 commit

  • htab can't use generic batch support due some problematic behaviours
    inherent to the data structre, i.e. while iterating the bpf map a
    concurrent program might delete the next entry that batch was about to
    use, in that case there's no easy solution to retrieve the next entry,
    the issue has been discussed multiple times (see [1] and [2]).

    The only way hmap can be traversed without the problem previously
    exposed is by making sure that the map is traversing entire buckets.
    This commit implements those strict requirements for hmap, the
    implementation follows the same interaction that generic support with
    some exceptions:

    - If keys/values buffer are not big enough to traverse a bucket,
    ENOSPC will be returned.
    - out_batch contains the value of the next bucket in the iteration, not
    the next key, but this is transparent for the user since the user
    should never use out_batch for other than bpf batch syscalls.

    This commits implements BPF_MAP_LOOKUP_BATCH and adds support for new
    command BPF_MAP_LOOKUP_AND_DELETE_BATCH. Note that for update/delete
    batch ops it is possible to use the generic implementations.

    [1] https://lore.kernel.org/bpf/20190724165803.87470-1-brianvv@google.com/
    [2] https://lore.kernel.org/bpf/20190906225434.3635421-1-yhs@fb.com/

    Signed-off-by: Yonghong Song
    Signed-off-by: Brian Vazquez
    Signed-off-by: Alexei Starovoitov
    Link: https://lore.kernel.org/bpf/20200115184308.162644-6-brianvv@google.com

    Yonghong Song
     

18 Jun, 2019

1 commit


05 Jun, 2019

1 commit

  • Based on 1 normalized pattern(s):

    this program is free software you can redistribute it and or modify
    it under the terms of version 2 of the gnu general public license as
    published by the free software foundation this program is
    distributed in the hope that it will be useful but without any
    warranty without even the implied warranty of merchantability or
    fitness for a particular purpose see the gnu general public license
    for more details

    extracted by the scancode license scanner the SPDX license identifier

    GPL-2.0-only

    has been chosen to replace the boilerplate/reference in 64 file(s).

    Signed-off-by: Thomas Gleixner
    Reviewed-by: Alexios Zavras
    Reviewed-by: Allison Randal
    Cc: linux-spdx@vger.kernel.org
    Link: https://lkml.kernel.org/r/20190529141901.894819585@linutronix.de
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     

01 Jun, 2019

3 commits

  • Most bpf map types doing similar checks and bytes to pages
    conversion during memory allocation and charging.

    Let's unify these checks by moving them into bpf_map_charge_init().

    Signed-off-by: Roman Gushchin
    Acked-by: Song Liu
    Signed-off-by: Alexei Starovoitov

    Roman Gushchin
     
  • In order to unify the existing memlock charging code with the
    memcg-based memory accounting, which will be added later, let's
    rework the current scheme.

    Currently the following design is used:
    1) .alloc() callback optionally checks if the allocation will likely
    succeed using bpf_map_precharge_memlock()
    2) .alloc() performs actual allocations
    3) .alloc() callback calculates map cost and sets map.memory.pages
    4) map_create() calls bpf_map_init_memlock() which sets map.memory.user
    and performs actual charging; in case of failure the map is
    destroyed

    1) bpf_map_free_deferred() calls bpf_map_release_memlock(), which
    performs uncharge and releases the user
    2) .map_free() callback releases the memory

    The scheme can be simplified and made more robust:
    1) .alloc() calculates map cost and calls bpf_map_charge_init()
    2) bpf_map_charge_init() sets map.memory.user and performs actual
    charge
    3) .alloc() performs actual allocations

    1) .map_free() callback releases the memory
    2) bpf_map_charge_finish() performs uncharge and releases the user

    The new scheme also allows to reuse bpf_map_charge_init()/finish()
    functions for memcg-based accounting. Because charges are performed
    before actual allocations and uncharges after freeing the memory,
    no bogus memory pressure can be created.

    In cases when the map structure is not available (e.g. it's not
    created yet, or is already destroyed), on-stack bpf_map_memory
    structure is used. The charge can be transferred with the
    bpf_map_charge_move() function.

    Signed-off-by: Roman Gushchin
    Acked-by: Song Liu
    Signed-off-by: Alexei Starovoitov

    Roman Gushchin
     
  • Group "user" and "pages" fields of bpf_map into the bpf_map_memory
    structure. Later it can be extended with "memcg" and other related
    information.

    The main reason for a such change (beside cosmetics) is to pass
    bpf_map_memory structure to charging functions before the actual
    allocation of bpf_map.

    Signed-off-by: Roman Gushchin
    Acked-by: Song Liu
    Signed-off-by: Alexei Starovoitov

    Roman Gushchin
     

15 May, 2019

1 commit

  • One of the biggest issues we face right now with picking LRU map over
    regular hash table is that a map walk out of user space, for example,
    to just dump the existing entries or to remove certain ones, will
    completely mess up LRU eviction heuristics and wrong entries such
    as just created ones will get evicted instead. The reason for this
    is that we mark an entry as "in use" via bpf_lru_node_set_ref() from
    system call lookup side as well. Thus upon walk, all entries are
    being marked, so information of actual least recently used ones
    are "lost".

    In case of Cilium where it can be used (besides others) as a BPF
    based connection tracker, this current behavior causes disruption
    upon control plane changes that need to walk the map from user space
    to evict certain entries. Discussion result from bpfconf [0] was that
    we should simply just remove marking from system call side as no
    good use case could be found where it's actually needed there.
    Therefore this patch removes marking for regular LRU and per-CPU
    flavor. If there ever should be a need in future, the behavior could
    be selected via map creation flag, but due to mentioned reason we
    avoid this here.

    [0] http://vger.kernel.org/bpfconf.html

    Fixes: 29ba732acbee ("bpf: Add BPF_MAP_TYPE_LRU_HASH")
    Fixes: 8f8449384ec3 ("bpf: Add BPF_MAP_TYPE_LRU_PERCPU_HASH")
    Signed-off-by: Daniel Borkmann
    Acked-by: Martin KaFai Lau
    Signed-off-by: Alexei Starovoitov

    Daniel Borkmann
     

10 Apr, 2019

1 commit

  • This work adds two new map creation flags BPF_F_RDONLY_PROG
    and BPF_F_WRONLY_PROG in order to allow for read-only or
    write-only BPF maps from a BPF program side.

    Today we have BPF_F_RDONLY and BPF_F_WRONLY, but this only
    applies to system call side, meaning the BPF program has full
    read/write access to the map as usual while bpf(2) calls with
    map fd can either only read or write into the map depending
    on the flags. BPF_F_RDONLY_PROG and BPF_F_WRONLY_PROG allows
    for the exact opposite such that verifier is going to reject
    program loads if write into a read-only map or a read into a
    write-only map is detected. For read-only map case also some
    helpers are forbidden for programs that would alter the map
    state such as map deletion, update, etc. As opposed to the two
    BPF_F_RDONLY / BPF_F_WRONLY flags, BPF_F_RDONLY_PROG as well
    as BPF_F_WRONLY_PROG really do correspond to the map lifetime.

    We've enabled this generic map extension to various non-special
    maps holding normal user data: array, hash, lru, lpm, local
    storage, queue and stack. Further generic map types could be
    followed up in future depending on use-case. Main use case
    here is to forbid writes into .rodata map values from verifier
    side.

    Signed-off-by: Daniel Borkmann
    Acked-by: Martin KaFai Lau
    Signed-off-by: Alexei Starovoitov

    Daniel Borkmann
     

09 Feb, 2019

1 commit


02 Feb, 2019

2 commits

  • Introduce BPF_F_LOCK flag for map_lookup and map_update syscall commands
    and for map_update() helper function.
    In all these cases take a lock of existing element (which was provided
    in BTF description) before copying (in or out) the rest of map value.

    Implementation details that are part of uapi:

    Array:
    The array map takes the element lock for lookup/update.

    Hash:
    hash map also takes the lock for lookup/update and tries to avoid the bucket lock.
    If old element exists it takes the element lock and updates the element in place.
    If element doesn't exist it allocates new one and inserts into hash table
    while holding the bucket lock.
    In rare case the hashmap has to take both the bucket lock and the element lock
    to update old value in place.

    Cgroup local storage:
    It is similar to array. update in place and lookup are done with lock taken.

    Signed-off-by: Alexei Starovoitov
    Signed-off-by: Daniel Borkmann

    Alexei Starovoitov
     
  • Introduce 'struct bpf_spin_lock' and bpf_spin_lock/unlock() helpers to let
    bpf program serialize access to other variables.

    Example:
    struct hash_elem {
    int cnt;
    struct bpf_spin_lock lock;
    };
    struct hash_elem * val = bpf_map_lookup_elem(&hash_map, &key);
    if (val) {
    bpf_spin_lock(&val->lock);
    val->cnt++;
    bpf_spin_unlock(&val->lock);
    }

    Restrictions and safety checks:
    - bpf_spin_lock is only allowed inside HASH and ARRAY maps.
    - BTF description of the map is mandatory for safety analysis.
    - bpf program can take one bpf_spin_lock at a time, since two or more can
    cause dead locks.
    - only one 'struct bpf_spin_lock' is allowed per map element.
    It drastically simplifies implementation yet allows bpf program to use
    any number of bpf_spin_locks.
    - when bpf_spin_lock is taken the calls (either bpf2bpf or helpers) are not allowed.
    - bpf program must bpf_spin_unlock() before return.
    - bpf program can access 'struct bpf_spin_lock' only via
    bpf_spin_lock()/bpf_spin_unlock() helpers.
    - load/store into 'struct bpf_spin_lock lock;' field is not allowed.
    - to use bpf_spin_lock() helper the BTF description of map value must be
    a struct and have 'struct bpf_spin_lock anyname;' field at the top level.
    Nested lock inside another struct is not allowed.
    - syscall map_lookup doesn't copy bpf_spin_lock field to user space.
    - syscall map_update and program map_update do not update bpf_spin_lock field.
    - bpf_spin_lock cannot be on the stack or inside networking packet.
    bpf_spin_lock can only be inside HASH or ARRAY map value.
    - bpf_spin_lock is available to root only and to all program types.
    - bpf_spin_lock is not allowed in inner maps of map-in-map.
    - ld_abs is not allowed inside spin_lock-ed region.
    - tracing progs and socket filter progs cannot use bpf_spin_lock due to
    insufficient preemption checks

    Implementation details:
    - cgroup-bpf class of programs can nest with xdp/tc programs.
    Hence bpf_spin_lock is equivalent to spin_lock_irqsave.
    Other solutions to avoid nested bpf_spin_lock are possible.
    Like making sure that all networking progs run with softirq disabled.
    spin_lock_irqsave is the simplest and doesn't add overhead to the
    programs that don't use it.
    - arch_spinlock_t is used when its implemented as queued_spin_lock
    - archs can force their own arch_spinlock_t
    - on architectures where queued_spin_lock is not available and
    sizeof(arch_spinlock_t) != sizeof(__u32) trivial lock is used.
    - presence of bpf_spin_lock inside map value could have been indicated via
    extra flag during map_create, but specifying it via BTF is cleaner.
    It provides introspection for map key/value and reduces user mistakes.

    Next steps:
    - allow bpf_spin_lock in other map types (like cgroup local storage)
    - introduce BPF_F_LOCK flag for bpf_map_update() syscall and helper
    to request kernel to grab bpf_spin_lock before rewriting the value.
    That will serialize access to map elements.

    Acked-by: Peter Zijlstra (Intel)
    Signed-off-by: Alexei Starovoitov
    Signed-off-by: Daniel Borkmann

    Alexei Starovoitov
     

01 Feb, 2019

1 commit

  • Lockdep warns about false positive:
    [ 12.492084] 00000000e6b28347 (&head->lock){+...}, at: pcpu_freelist_push+0x2a/0x40
    [ 12.492696] but this lock was taken by another, HARDIRQ-safe lock in the past:
    [ 12.493275] (&rq->lock){-.-.}
    [ 12.493276]
    [ 12.493276]
    [ 12.493276] and interrupts could create inverse lock ordering between them.
    [ 12.493276]
    [ 12.494435]
    [ 12.494435] other info that might help us debug this:
    [ 12.494979] Possible interrupt unsafe locking scenario:
    [ 12.494979]
    [ 12.495518] CPU0 CPU1
    [ 12.495879] ---- ----
    [ 12.496243] lock(&head->lock);
    [ 12.496502] local_irq_disable();
    [ 12.496969] lock(&rq->lock);
    [ 12.497431] lock(&head->lock);
    [ 12.497890]
    [ 12.498104] lock(&rq->lock);
    [ 12.498368]
    [ 12.498368] *** DEADLOCK ***
    [ 12.498368]
    [ 12.498837] 1 lock held by dd/276:
    [ 12.499110] #0: 00000000c58cb2ee (rcu_read_lock){....}, at: trace_call_bpf+0x5e/0x240
    [ 12.499747]
    [ 12.499747] the shortest dependencies between 2nd lock and 1st lock:
    [ 12.500389] -> (&rq->lock){-.-.} {
    [ 12.500669] IN-HARDIRQ-W at:
    [ 12.500934] _raw_spin_lock+0x2f/0x40
    [ 12.501373] scheduler_tick+0x4c/0xf0
    [ 12.501812] update_process_times+0x40/0x50
    [ 12.502294] tick_periodic+0x27/0xb0
    [ 12.502723] tick_handle_periodic+0x1f/0x60
    [ 12.503203] timer_interrupt+0x11/0x20
    [ 12.503651] __handle_irq_event_percpu+0x43/0x2c0
    [ 12.504167] handle_irq_event_percpu+0x20/0x50
    [ 12.504674] handle_irq_event+0x37/0x60
    [ 12.505139] handle_level_irq+0xa7/0x120
    [ 12.505601] handle_irq+0xa1/0x150
    [ 12.506018] do_IRQ+0x77/0x140
    [ 12.506411] ret_from_intr+0x0/0x1d
    [ 12.506834] _raw_spin_unlock_irqrestore+0x53/0x60
    [ 12.507362] __setup_irq+0x481/0x730
    [ 12.507789] setup_irq+0x49/0x80
    [ 12.508195] hpet_time_init+0x21/0x32
    [ 12.508644] x86_late_time_init+0xb/0x16
    [ 12.509106] start_kernel+0x390/0x42a
    [ 12.509554] secondary_startup_64+0xa4/0xb0
    [ 12.510034] IN-SOFTIRQ-W at:
    [ 12.510305] _raw_spin_lock+0x2f/0x40
    [ 12.510772] try_to_wake_up+0x1c7/0x4e0
    [ 12.511220] swake_up_locked+0x20/0x40
    [ 12.511657] swake_up_one+0x1a/0x30
    [ 12.512070] rcu_process_callbacks+0xc5/0x650
    [ 12.512553] __do_softirq+0xe6/0x47b
    [ 12.512978] irq_exit+0xc3/0xd0
    [ 12.513372] smp_apic_timer_interrupt+0xa9/0x250
    [ 12.513876] apic_timer_interrupt+0xf/0x20
    [ 12.514343] default_idle+0x1c/0x170
    [ 12.514765] do_idle+0x199/0x240
    [ 12.515159] cpu_startup_entry+0x19/0x20
    [ 12.515614] start_kernel+0x422/0x42a
    [ 12.516045] secondary_startup_64+0xa4/0xb0
    [ 12.516521] INITIAL USE at:
    [ 12.516774] _raw_spin_lock_irqsave+0x38/0x50
    [ 12.517258] rq_attach_root+0x16/0xd0
    [ 12.517685] sched_init+0x2f2/0x3eb
    [ 12.518096] start_kernel+0x1fb/0x42a
    [ 12.518525] secondary_startup_64+0xa4/0xb0
    [ 12.518986] }
    [ 12.519132] ... key at: [] __key.71384+0x0/0x8
    [ 12.519649] ... acquired at:
    [ 12.519892] pcpu_freelist_pop+0x7b/0xd0
    [ 12.520221] bpf_get_stackid+0x1d2/0x4d0
    [ 12.520563] ___bpf_prog_run+0x8b4/0x11a0
    [ 12.520887]
    [ 12.521008] -> (&head->lock){+...} {
    [ 12.521292] HARDIRQ-ON-W at:
    [ 12.521539] _raw_spin_lock+0x2f/0x40
    [ 12.521950] pcpu_freelist_push+0x2a/0x40
    [ 12.522396] bpf_get_stackid+0x494/0x4d0
    [ 12.522828] ___bpf_prog_run+0x8b4/0x11a0
    [ 12.523296] INITIAL USE at:
    [ 12.523537] _raw_spin_lock+0x2f/0x40
    [ 12.523944] pcpu_freelist_populate+0xc0/0x120
    [ 12.524417] htab_map_alloc+0x405/0x500
    [ 12.524835] __do_sys_bpf+0x1a3/0x1a90
    [ 12.525253] do_syscall_64+0x4a/0x180
    [ 12.525659] entry_SYSCALL_64_after_hwframe+0x49/0xbe
    [ 12.526167] }
    [ 12.526311] ... key at: [] __key.13130+0x0/0x8
    [ 12.526812] ... acquired at:
    [ 12.527047] __lock_acquire+0x521/0x1350
    [ 12.527371] lock_acquire+0x98/0x190
    [ 12.527680] _raw_spin_lock+0x2f/0x40
    [ 12.527994] pcpu_freelist_push+0x2a/0x40
    [ 12.528325] bpf_get_stackid+0x494/0x4d0
    [ 12.528645] ___bpf_prog_run+0x8b4/0x11a0
    [ 12.528970]
    [ 12.529092]
    [ 12.529092] stack backtrace:
    [ 12.529444] CPU: 0 PID: 276 Comm: dd Not tainted 5.0.0-rc3-00018-g2fa53f892422 #475
    [ 12.530043] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.el7 04/01/2014
    [ 12.530750] Call Trace:
    [ 12.530948] dump_stack+0x5f/0x8b
    [ 12.531248] check_usage_backwards+0x10c/0x120
    [ 12.531598] ? ___bpf_prog_run+0x8b4/0x11a0
    [ 12.531935] ? mark_lock+0x382/0x560
    [ 12.532229] mark_lock+0x382/0x560
    [ 12.532496] ? print_shortest_lock_dependencies+0x180/0x180
    [ 12.532928] __lock_acquire+0x521/0x1350
    [ 12.533271] ? find_get_entry+0x17f/0x2e0
    [ 12.533586] ? find_get_entry+0x19c/0x2e0
    [ 12.533902] ? lock_acquire+0x98/0x190
    [ 12.534196] lock_acquire+0x98/0x190
    [ 12.534482] ? pcpu_freelist_push+0x2a/0x40
    [ 12.534810] _raw_spin_lock+0x2f/0x40
    [ 12.535099] ? pcpu_freelist_push+0x2a/0x40
    [ 12.535432] pcpu_freelist_push+0x2a/0x40
    [ 12.535750] bpf_get_stackid+0x494/0x4d0
    [ 12.536062] ___bpf_prog_run+0x8b4/0x11a0

    It has been explained that is a false positive here:
    https://lkml.org/lkml/2018/7/25/756
    Recap:
    - stackmap uses pcpu_freelist
    - The lock in pcpu_freelist is a percpu lock
    - stackmap is only used by tracing bpf_prog
    - A tracing bpf_prog cannot be run if another bpf_prog
    has already been running (ensured by the percpu bpf_prog_active counter).

    Eric pointed out that this lockdep splats stops other
    legit lockdep splats in selftests/bpf/test_progs.c.

    Fix this by calling local_irq_save/restore for stackmap.

    Another false positive had also been worked around by calling
    local_irq_save in commit 89ad2fa3f043 ("bpf: fix lockdep splat").
    That commit added unnecessary irq_save/restore to fast path of
    bpf hash map. irqs are already disabled at that point, since htab
    is holding per bucket spin_lock with irqsave.

    Let's reduce overhead for htab by introducing __pcpu_freelist_push/pop
    function w/o irqsave and convert pcpu_freelist_push/pop to irqsave
    to be used elsewhere (right now only in stackmap).
    It stops lockdep false positive in stackmap with a bit of acceptable overhead.

    Fixes: 557c0c6e7df8 ("bpf: convert stackmap to pre-allocation")
    Reported-by: Naresh Kamboju
    Reported-by: Eric Dumazet
    Acked-by: Martin KaFai Lau
    Signed-off-by: Alexei Starovoitov
    Signed-off-by: Daniel Borkmann

    Alexei Starovoitov
     

20 Nov, 2018

1 commit

  • Add a new flag BPF_F_ZERO_SEED, which forces a hash map
    to initialize the seed to zero. This is useful when doing
    performance analysis both on individual BPF programs, as
    well as the kernel's hash table implementation.

    Signed-off-by: Lorenz Bauer
    Signed-off-by: Daniel Borkmann

    Lorenz Bauer
     

30 Aug, 2018

1 commit

  • Added bpffs pretty print for percpu arraymap, percpu hashmap
    and percpu lru hashmap.

    For each map pair, the format is:
    : {
    cpu0:
    cpu1:
    ...
    cpun:
    }

    For example, on my VM, there are 4 cpus, and
    for test_btf test in the next patch:
    cat /sys/fs/bpf/pprint_test_percpu_hash

    You may get:
    ...
    43602: {
    cpu0: {43602,0,-43602,0x3,0xaa52,0x3,{43602|[82,170,0,0,0,0,0,0]},ENUM_TWO}
    cpu1: {43602,0,-43602,0x3,0xaa52,0x3,{43602|[82,170,0,0,0,0,0,0]},ENUM_TWO}
    cpu2: {43602,0,-43602,0x3,0xaa52,0x3,{43602|[82,170,0,0,0,0,0,0]},ENUM_TWO}
    cpu3: {43602,0,-43602,0x3,0xaa52,0x3,{43602|[82,170,0,0,0,0,0,0]},ENUM_TWO}
    }
    72847: {
    cpu0: {72847,0,-72847,0x3,0x11c8f,0x3,{72847|[143,28,1,0,0,0,0,0]},ENUM_THREE}
    cpu1: {72847,0,-72847,0x3,0x11c8f,0x3,{72847|[143,28,1,0,0,0,0,0]},ENUM_THREE}
    cpu2: {72847,0,-72847,0x3,0x11c8f,0x3,{72847|[143,28,1,0,0,0,0,0]},ENUM_THREE}
    cpu3: {72847,0,-72847,0x3,0x11c8f,0x3,{72847|[143,28,1,0,0,0,0,0]},ENUM_THREE}
    }
    ...

    Signed-off-by: Yonghong Song
    Signed-off-by: Daniel Borkmann

    Yonghong Song
     

24 Aug, 2018

1 commit

  • All BPF hash and LRU maps currently have a known and global seed
    we feed into jhash() which is 0. This is suboptimal, thus fix it
    by generating a random seed upon hashtab setup time which we can
    later on feed into jhash() on lookup, update and deletions.

    Fixes: 0f8e4bd8a1fc8 ("bpf: add hashtable type of eBPF maps")
    Signed-off-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Acked-by: Song Liu
    Reviewed-by: Eduardo Valentin

    Daniel Borkmann
     

13 Aug, 2018

1 commit

  • Commit a26ca7c982cb ("bpf: btf: Add pretty print support to
    the basic arraymap") and 699c86d6ec21 ("bpf: btf: add pretty
    print for hash/lru_hash maps") enabled support for BTF and
    dumping via BPF fs for array and hash/lru map. However, both
    can be decoupled from each other such that regular BPF maps
    can be supported for attaching BTF key/value information,
    while not all maps necessarily need to dump via map_seq_show_elem()
    callback.

    The basic sanity check which is a prerequisite for all maps
    is that key/value size has to match in any case, and some maps
    can have extra checks via map_check_btf() callback, e.g.
    probing certain types or indicating no support in general. With
    that we can also enable retrieving BTF info for per-cpu map
    types and lpm.

    Signed-off-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Acked-by: Yonghong Song

    Daniel Borkmann
     

11 Aug, 2018

1 commit

  • Commit a26ca7c982cb ("bpf: btf: Add pretty print support to
    the basic arraymap") added pretty print support to array map.
    This patch adds pretty print for hash and lru_hash maps.
    The following example shows the pretty-print result of
    a pinned hashmap:

    struct map_value {
    int count_a;
    int count_b;
    };

    cat /sys/fs/bpf/pinned_hash_map:

    87907: {87907,87908}
    57354: {37354,57355}
    76625: {76625,76626}
    ...

    Signed-off-by: Yonghong Song
    Signed-off-by: Daniel Borkmann

    Yonghong Song
     

04 Jul, 2018

1 commit