12 Oct, 2020

1 commit

  • Recent work in f4d05259213f ("bpf: Add map_meta_equal map ops") and 134fede4eecf
    ("bpf: Relax max_entries check for most of the inner map types") added support
    for dynamic inner max elements for most map-in-map types. Exceptions were maps
    like array or prog array where the map_gen_lookup() callback uses the maps'
    max_entries field as a constant when emitting instructions.

    We recently implemented Maglev consistent hashing into Cilium's load balancer
    which uses map-in-map with an outer map being hash and inner being array holding
    the Maglev backend table for each service. This has been designed this way in
    order to reduce overall memory consumption given the outer hash map allows to
    avoid preallocating a large, flat memory area for all services. Also, the
    number of service mappings is not always known a-priori.

    The use case for dynamic inner array map entries is to further reduce memory
    overhead, for example, some services might just have a small number of back
    ends while others could have a large number. Right now the Maglev backend table
    for small and large number of backends would need to have the same inner array
    map entries which adds a lot of unneeded overhead.

    Dynamic inner array map entries can be realized by avoiding the inlined code
    generation for their lookup. The lookup will still be efficient since it will
    be calling into array_map_lookup_elem() directly and thus avoiding retpoline.
    The patch adds a BPF_F_INNER_MAP flag to map creation which therefore skips
    inline code generation and relaxes array_map_meta_equal() check to ignore both
    maps' max_entries. This also still allows to have faster lookups for map-in-map
    when BPF_F_INNER_MAP is not specified and hence dynamic max_entries not needed.

    Example code generation where inner map is dynamic sized array:

    # bpftool p d x i 125
    int handle__sys_enter(void * ctx):
    ; int handle__sys_enter(void *ctx)
    0: (b4) w1 = 0
    ; int key = 0;
    1: (63) *(u32 *)(r10 -4) = r1
    2: (bf) r2 = r10
    ;
    3: (07) r2 += -4
    ; inner_map = bpf_map_lookup_elem(&outer_arr_dyn, &key);
    4: (18) r1 = map[id:468]
    6: (07) r1 += 272
    7: (61) r0 = *(u32 *)(r2 +0)
    8: (35) if r0 >= 0x3 goto pc+5
    9: (67) r0 <
    Signed-off-by: Alexei Starovoitov
    Acked-by: Andrii Nakryiko
    Link: https://lore.kernel.org/bpf/20201010234006.7075-4-daniel@iogearbox.net

    Daniel Borkmann
     

01 Oct, 2020

1 commit

  • Currently, perf event in perf event array is removed from the array when
    the map fd used to add the event is closed. This behavior makes it
    difficult to the share perf events with perf event array.

    Introduce perf event map that keeps the perf event open with a new flag
    BPF_F_PRESERVE_ELEMS. With this flag set, perf events in the array are not
    removed when the original map fd is closed. Instead, the perf event will
    stay in the map until 1) it is explicitly removed from the array; or 2)
    the array is freed.

    Signed-off-by: Song Liu
    Signed-off-by: Alexei Starovoitov
    Link: https://lore.kernel.org/bpf/20200930224927.1936644-2-songliubraving@fb.com

    Song Liu
     

18 Sep, 2020

2 commits

  • This commit serves two things:
    1) it optimizes BPF prologue/epilogue generation
    2) it makes possible to have tailcalls within BPF subprogram

    Both points are related to each other since without 1), 2) could not be
    achieved.

    In [1], Alexei says:
    "The prologue will look like:
    nop5
    xor eax,eax  // two new bytes if bpf_tail_call() is used in this
    // function
    push rbp
    mov rbp, rsp
    sub rsp, rounded_stack_depth
    push rax // zero init tail_call counter
    variable number of push rbx,r13,r14,r15

    Then bpf_tail_call will pop variable number rbx,..
    and final 'pop rax'
    Then 'add rsp, size_of_current_stack_frame'
    jmp to next function and skip over 'nop5; xor eax,eax; push rpb; mov
    rbp, rsp'

    This way new function will set its own stack size and will init tail
    call
    counter with whatever value the parent had.

    If next function doesn't use bpf_tail_call it won't have 'xor eax,eax'.
    Instead it would need to have 'nop2' in there."

    Implement that suggestion.

    Since the layout of stack is changed, tail call counter handling can not
    rely anymore on popping it to rbx just like it have been handled for
    constant prologue case and later overwrite of rbx with actual value of
    rbx pushed to stack. Therefore, let's use one of the register (%rcx) that
    is considered to be volatile/caller-saved and pop the value of tail call
    counter in there in the epilogue.

    Drop the BUILD_BUG_ON in emit_prologue and in
    emit_bpf_tail_call_indirect where instruction layout is not constant
    anymore.

    Introduce new poke target, 'tailcall_bypass' to poke descriptor that is
    dedicated for skipping the register pops and stack unwind that are
    generated right before the actual jump to target program.
    For case when the target program is not present, BPF program will skip
    the pop instructions and nop5 dedicated for jmpq $target. An example of
    such state when only R6 of callee saved registers is used by program:

    ffffffffc0513aa1: e9 0e 00 00 00 jmpq 0xffffffffc0513ab4
    ffffffffc0513aa6: 5b pop %rbx
    ffffffffc0513aa7: 58 pop %rax
    ffffffffc0513aa8: 48 81 c4 00 00 00 00 add $0x0,%rsp
    ffffffffc0513aaf: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)
    ffffffffc0513ab4: 48 89 df mov %rbx,%rdi

    When target program is inserted, the jump that was there to skip
    pops/nop5 will become the nop5, so CPU will go over pops and do the
    actual tailcall.

    One might ask why there simply can not be pushes after the nop5?
    In the following example snippet:

    ffffffffc037030c: 48 89 fb mov %rdi,%rbx
    (...)
    ffffffffc0370332: 5b pop %rbx
    ffffffffc0370333: 58 pop %rax
    ffffffffc0370334: 48 81 c4 00 00 00 00 add $0x0,%rsp
    ffffffffc037033b: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)
    ffffffffc0370340: 48 81 ec 00 00 00 00 sub $0x0,%rsp
    ffffffffc0370347: 50 push %rax
    ffffffffc0370348: 53 push %rbx
    ffffffffc0370349: 48 89 df mov %rbx,%rdi
    ffffffffc037034c: e8 f7 21 00 00 callq 0xffffffffc0372548

    There is the bpf2bpf call (at ffffffffc037034c) right after the tailcall
    and jump target is not present. ctx is in %rbx register and BPF
    subprogram that we will call into on ffffffffc037034c is relying on it,
    e.g. it will pick ctx from there. Such code layout is therefore broken
    as we would overwrite the content of %rbx with the value that was pushed
    on the prologue. That is the reason for the 'bypass' approach.

    Special care needs to be taken during the install/update/remove of
    tailcall target. In case when target program is not present, the CPU
    must not execute the pop instructions that precede the tailcall.

    To address that, the following states can be defined:
    A nop, unwind, nop
    B nop, unwind, tail
    C skip, unwind, nop
    D skip, unwind, tail

    A is forbidden (lead to incorrectness). The state transitions between
    tailcall install/update/remove will work as follows:

    First install tail call f: C->D->B(f)
    * poke the tailcall, after that get rid of the skip
    Update tail call f to f': B(f)->B(f')
    * poke the tailcall (poke->tailcall_target) and do NOT touch the
    poke->tailcall_bypass
    Remove tail call: B(f')->C(f')
    * poke->tailcall_bypass is poked back to jump, then we wait the RCU
    grace period so that other programs will finish its execution and
    after that we are safe to remove the poke->tailcall_target
    Install new tail call (f''): C(f')->D(f'')->B(f'').
    * same as first step

    This way CPU can never be exposed to "unwind, tail" state.

    Last but not least, when tailcalls get mixed with bpf2bpf calls, it
    would be possible to encounter the endless loop due to clearing the
    tailcall counter if for example we would use the tailcall3-like from BPF
    selftests program that would be subprogram-based, meaning the tailcall
    would be present within the BPF subprogram.

    This test, broken down to particular steps, would do:
    entry -> set tailcall counter to 0, bump it by 1, tailcall to func0
    func0 -> call subprog_tail
    (we are NOT skipping the first 11 bytes of prologue and this subprogram
    has a tailcall, therefore we clear the counter...)
    subprog -> do the same thing as entry

    and then loop forever.

    To address this, the idea is to go through the call chain of bpf2bpf progs
    and look for a tailcall presence throughout whole chain. If we saw a single
    tail call then each node in this call chain needs to be marked as a subprog
    that can reach the tailcall. We would later feed the JIT with this info
    and:
    - set eax to 0 only when tailcall is reachable and this is the entry prog
    - if tailcall is reachable but there's no tailcall in insns of currently
    JITed prog then push rax anyway, so that it will be possible to
    propagate further down the call chain
    - finally if tailcall is reachable, then we need to precede the 'call'
    insn with mov rax, [rbp - (stack_depth + 8)]

    Tail call related cases from test_verifier kselftest are also working
    fine. Sample BPF programs that utilize tail calls (sockex3, tracex5)
    work properly as well.

    [1]: https://lore.kernel.org/bpf/20200517043227.2gpq22ifoq37ogst@ast-mbp.dhcp.thefacebook.com/

    Suggested-by: Alexei Starovoitov
    Signed-off-by: Maciej Fijalkowski
    Signed-off-by: Alexei Starovoitov

    Maciej Fijalkowski
     
  • Reflect the actual purpose of poke->ip and rename it to
    poke->tailcall_target so that it will not the be confused with another
    poke target that will be introduced in next commit.

    While at it, do the same thing with poke->ip_stable - rename it to
    poke->tailcall_target_stable.

    Signed-off-by: Maciej Fijalkowski
    Signed-off-by: Alexei Starovoitov

    Maciej Fijalkowski
     

29 Aug, 2020

1 commit

  • Introduce sleepable BPF programs that can request such property for themselves
    via BPF_F_SLEEPABLE flag at program load time. In such case they will be able
    to use helpers like bpf_copy_from_user() that might sleep. At present only
    fentry/fexit/fmod_ret and lsm programs can request to be sleepable and only
    when they are attached to kernel functions that are known to allow sleeping.

    The non-sleepable programs are relying on implicit rcu_read_lock() and
    migrate_disable() to protect life time of programs, maps that they use and
    per-cpu kernel structures used to pass info between bpf programs and the
    kernel. The sleepable programs cannot be enclosed into rcu_read_lock().
    migrate_disable() maps to preempt_disable() in non-RT kernels, so the progs
    should not be enclosed in migrate_disable() as well. Therefore
    rcu_read_lock_trace is used to protect the life time of sleepable progs.

    There are many networking and tracing program types. In many cases the
    'struct bpf_prog *' pointer itself is rcu protected within some other kernel
    data structure and the kernel code is using rcu_dereference() to load that
    program pointer and call BPF_PROG_RUN() on it. All these cases are not touched.
    Instead sleepable bpf programs are allowed with bpf trampoline only. The
    program pointers are hard-coded into generated assembly of bpf trampoline and
    synchronize_rcu_tasks_trace() is used to protect the life time of the program.
    The same trampoline can hold both sleepable and non-sleepable progs.

    When rcu_read_lock_trace is held it means that some sleepable bpf program is
    running from bpf trampoline. Those programs can use bpf arrays and preallocated
    hash/lru maps. These map types are waiting on programs to complete via
    synchronize_rcu_tasks_trace();

    Updates to trampoline now has to do synchronize_rcu_tasks_trace() and
    synchronize_rcu_tasks() to wait for sleepable progs to finish and for
    trampoline assembly to finish.

    This is the first step of introducing sleepable progs. Eventually dynamically
    allocated hash maps can be allowed and networking program types can become
    sleepable too.

    Signed-off-by: Alexei Starovoitov
    Signed-off-by: Daniel Borkmann
    Reviewed-by: Josef Bacik
    Acked-by: Andrii Nakryiko
    Acked-by: KP Singh
    Link: https://lore.kernel.org/bpf/20200827220114.69225-3-alexei.starovoitov@gmail.com

    Alexei Starovoitov
     

28 Aug, 2020

2 commits

  • Most of the maps do not use max_entries during verification time.
    Thus, those map_meta_equal() do not need to enforce max_entries
    when it is inserted as an inner map during runtime. The max_entries
    check is removed from the default implementation bpf_map_meta_equal().

    The prog_array_map and xsk_map are exception. Its map_gen_lookup
    uses max_entries to generate inline lookup code. Thus, they will
    implement its own map_meta_equal() to enforce max_entries.
    Since there are only two cases now, the max_entries check
    is not refactored and stays in its own .c file.

    Signed-off-by: Martin KaFai Lau
    Signed-off-by: Daniel Borkmann
    Link: https://lore.kernel.org/bpf/20200828011813.1970516-1-kafai@fb.com

    Martin KaFai Lau
     
  • Some properties of the inner map is used in the verification time.
    When an inner map is inserted to an outer map at runtime,
    bpf_map_meta_equal() is currently used to ensure those properties
    of the inserting inner map stays the same as the verification
    time.

    In particular, the current bpf_map_meta_equal() checks max_entries which
    turns out to be too restrictive for most of the maps which do not use
    max_entries during the verification time. It limits the use case that
    wants to replace a smaller inner map with a larger inner map. There are
    some maps do use max_entries during verification though. For example,
    the map_gen_lookup in array_map_ops uses the max_entries to generate
    the inline lookup code.

    To accommodate differences between maps, the map_meta_equal is added
    to bpf_map_ops. Each map-type can decide what to check when its
    map is used as an inner map during runtime.

    Also, some map types cannot be used as an inner map and they are
    currently black listed in bpf_map_meta_alloc() in map_in_map.c.
    It is not unusual that the new map types may not aware that such
    blacklist exists. This patch enforces an explicit opt-in
    and only allows a map to be used as an inner map if it has
    implemented the map_meta_equal ops. It is based on the
    discussion in [1].

    All maps that support inner map has its map_meta_equal points
    to bpf_map_meta_equal in this patch. A later patch will
    relax the max_entries check for most maps. bpf_types.h
    counts 28 map types. This patch adds 23 ".map_meta_equal"
    by using coccinelle. -5 for
    BPF_MAP_TYPE_PROG_ARRAY
    BPF_MAP_TYPE_(PERCPU)_CGROUP_STORAGE
    BPF_MAP_TYPE_STRUCT_OPS
    BPF_MAP_TYPE_ARRAY_OF_MAPS
    BPF_MAP_TYPE_HASH_OF_MAPS

    The "if (inner_map->inner_map_meta)" check in bpf_map_meta_alloc()
    is moved such that the same error is returned.

    [1]: https://lore.kernel.org/bpf/20200522022342.899756-1-kafai@fb.com/

    Signed-off-by: Martin KaFai Lau
    Signed-off-by: Daniel Borkmann
    Link: https://lore.kernel.org/bpf/20200828011806.1970400-1-kafai@fb.com

    Martin KaFai Lau
     

26 Jul, 2020

1 commit

  • The bpf iterators for array and percpu array
    are implemented. Similar to hash maps, for percpu
    array map, bpf program will receive values
    from all cpus.

    Signed-off-by: Yonghong Song
    Signed-off-by: Alexei Starovoitov
    Link: https://lore.kernel.org/bpf/20200723184115.590532-1-yhs@fb.com

    Yonghong Song
     

01 Jul, 2020

1 commit

  • bpf_free_used_maps() or close(map_fd) will trigger map_free callback.
    bpf_free_used_maps() is called after bpf prog is no longer executing:
    bpf_prog_put->call_rcu->bpf_prog_free->bpf_free_used_maps.
    Hence there is no need to call synchronize_rcu() to protect map elements.

    Note that hash_of_maps and array_of_maps update/delete inner maps via
    sys_bpf() that calls maybe_wait_bpf_programs() and synchronize_rcu().

    Signed-off-by: Alexei Starovoitov
    Acked-by: Andrii Nakryiko
    Acked-by: Paul E. McKenney
    Link: https://lore.kernel.org/bpf/20200630043343.53195-2-alexei.starovoitov@gmail.com

    Alexei Starovoitov
     

23 Jun, 2020

2 commits

  • Set map_btf_name and map_btf_id for all map types so that map fields can
    be accessed by bpf programs.

    Signed-off-by: Andrey Ignatov
    Signed-off-by: Daniel Borkmann
    Acked-by: John Fastabend
    Acked-by: Martin KaFai Lau
    Link: https://lore.kernel.org/bpf/a825f808f22af52b018dbe82f1c7d29dab5fc978.1592600985.git.rdna@fb.com

    Andrey Ignatov
     
  • There are multiple use-cases when it's convenient to have access to bpf
    map fields, both `struct bpf_map` and map type specific struct-s such as
    `struct bpf_array`, `struct bpf_htab`, etc.

    For example while working with sock arrays it can be necessary to
    calculate the key based on map->max_entries (some_hash % max_entries).
    Currently this is solved by communicating max_entries via "out-of-band"
    channel, e.g. via additional map with known key to get info about target
    map. That works, but is not very convenient and error-prone while
    working with many maps.

    In other cases necessary data is dynamic (i.e. unknown at loading time)
    and it's impossible to get it at all. For example while working with a
    hash table it can be convenient to know how much capacity is already
    used (bpf_htab.count.counter for BPF_F_NO_PREALLOC case).

    At the same time kernel knows this info and can provide it to bpf
    program.

    Fill this gap by adding support to access bpf map fields from bpf
    program for both `struct bpf_map` and map type specific fields.

    Support is implemented via btf_struct_access() so that a user can define
    their own `struct bpf_map` or map type specific struct in their program
    with only necessary fields and preserve_access_index attribute, cast a
    map to this struct and use a field.

    For example:

    struct bpf_map {
    __u32 max_entries;
    } __attribute__((preserve_access_index));

    struct bpf_array {
    struct bpf_map map;
    __u32 elem_size;
    } __attribute__((preserve_access_index));

    struct {
    __uint(type, BPF_MAP_TYPE_ARRAY);
    __uint(max_entries, 4);
    __type(key, __u32);
    __type(value, __u32);
    } m_array SEC(".maps");

    SEC("cgroup_skb/egress")
    int cg_skb(void *ctx)
    {
    struct bpf_array *array = (struct bpf_array *)&m_array;
    struct bpf_map *map = (struct bpf_map *)&m_array;

    /* .. use map->max_entries or array->map.max_entries .. */
    }

    Similarly to other btf_struct_access() use-cases (e.g. struct tcp_sock
    in net/ipv4/bpf_tcp_ca.c) the patch allows access to any fields of
    corresponding struct. Only reading from map fields is supported.

    For btf_struct_access() to work there should be a way to know btf id of
    a struct that corresponds to a map type. To get btf id there should be a
    way to get a stringified name of map-specific struct, such as
    "bpf_array", "bpf_htab", etc for a map type. Two new fields are added to
    `struct bpf_map_ops` to handle it:
    * .map_btf_name keeps a btf name of a struct returned by map_alloc();
    * .map_btf_id is used to cache btf id of that struct.

    To make btf ids calculation cheaper they're calculated once while
    preparing btf_vmlinux and cached same way as it's done for btf_id field
    of `struct bpf_func_proto`

    While calculating btf ids, struct names are NOT checked for collision.
    Collisions will be checked as a part of the work to prepare btf ids used
    in verifier in compile time that should land soon. The only known
    collision for `struct bpf_htab` (kernel/bpf/hashtab.c vs
    net/core/sock_map.c) was fixed earlier.

    Both new fields .map_btf_name and .map_btf_id must be set for a map type
    for the feature to work. If neither is set for a map type, verifier will
    return ENOTSUPP on a try to access map_ptr of corresponding type. If
    just one of them set, it's verifier misconfiguration.

    Only `struct bpf_array` for BPF_MAP_TYPE_ARRAY and `struct bpf_htab` for
    BPF_MAP_TYPE_HASH are supported by this patch. Other map types will be
    supported separately.

    The feature is available only for CONFIG_DEBUG_INFO_BTF=y and gated by
    perfmon_capable() so that unpriv programs won't have access to bpf map
    fields.

    Signed-off-by: Andrey Ignatov
    Signed-off-by: Daniel Borkmann
    Acked-by: John Fastabend
    Acked-by: Martin KaFai Lau
    Link: https://lore.kernel.org/bpf/6479686a0cd1e9067993df57b4c3eef0e276fec9.1592600985.git.rdna@fb.com

    Andrey Ignatov
     

16 May, 2020

1 commit


15 May, 2020

2 commits

  • Implement permissions as stated in uapi/linux/capability.h
    In order to do that the verifier allow_ptr_leaks flag is split
    into four flags and they are set as:
    env->allow_ptr_leaks = bpf_allow_ptr_leaks();
    env->bypass_spec_v1 = bpf_bypass_spec_v1();
    env->bypass_spec_v4 = bpf_bypass_spec_v4();
    env->bpf_capable = bpf_capable();

    The first three currently equivalent to perfmon_capable(), since leaking kernel
    pointers and reading kernel memory via side channel attacks is roughly
    equivalent to reading kernel memory with cap_perfmon.

    'bpf_capable' enables bounded loops, precision tracking, bpf to bpf calls and
    other verifier features. 'allow_ptr_leaks' enable ptr leaks, ptr conversions,
    subtraction of pointers. 'bypass_spec_v1' disables speculative analysis in the
    verifier, run time mitigations in bpf array, and enables indirect variable
    access in bpf programs. 'bypass_spec_v4' disables emission of sanitation code
    by the verifier.

    That means that the networking BPF program loaded with CAP_BPF + CAP_NET_ADMIN
    will have speculative checks done by the verifier and other spectre mitigation
    applied. Such networking BPF program will not be able to leak kernel pointers
    and will not be able to access arbitrary kernel memory.

    Signed-off-by: Alexei Starovoitov
    Signed-off-by: Daniel Borkmann
    Link: https://lore.kernel.org/bpf/20200513230355.7858-3-alexei.starovoitov@gmail.com

    Alexei Starovoitov
     
  • mmap() subsystem allows user-space application to memory-map region with
    initial page offset. This wasn't taken into account in initial implementation
    of BPF array memory-mapping. This would result in wrong pages, not taking into
    account requested page shift, being memory-mmaped into user-space. This patch
    fixes this gap and adds a test for such scenario.

    Fixes: fc9702273e2e ("bpf: Add mmap() support for BPF_MAP_TYPE_ARRAY")
    Signed-off-by: Andrii Nakryiko
    Signed-off-by: Alexei Starovoitov
    Acked-by: Yonghong Song
    Link: https://lore.kernel.org/bpf/20200512235925.3817805-1-andriin@fb.com

    Andrii Nakryiko
     

16 Jan, 2020

1 commit

  • This adds the generic batch ops functionality to bpf arraymap, note that
    since deletion is not a valid operation for arraymap, only batch and
    lookup are added.

    Signed-off-by: Brian Vazquez
    Signed-off-by: Alexei Starovoitov
    Acked-by: Yonghong Song
    Link: https://lore.kernel.org/bpf/20200115184308.162644-5-brianvv@google.com

    Brian Vazquez
     

25 Nov, 2019

3 commits

  • Given that we have BPF_MOD_NOP_TO_{CALL,JUMP}, BPF_MOD_{CALL,JUMP}_TO_NOP
    and BPF_MOD_{CALL,JUMP}_TO_{CALL,JUMP} poke types and that we also pass in
    old_addr as well as new_addr, it's a bit redundant and unnecessarily
    complicates __bpf_arch_text_poke() itself since we can derive the same from
    the *_addr that were passed in. Hence simplify and use BPF_MOD_{CALL,JUMP}
    as types which also allows to clean up call-sites.

    In addition to that, __bpf_arch_text_poke() currently verifies that text
    matches expected old_insn before we invoke text_poke_bp(). Also add a check
    on new_insn and skip rewrite if it already matches. Reason why this is rather
    useful is that it avoids making any special casing in prog_array_map_poke_run()
    when old and new prog were NULL and has the benefit that also for this case
    we perform a check on text whether it really matches our expectations.

    Suggested-by: Andrii Nakryiko
    Signed-off-by: Daniel Borkmann
    Signed-off-by: Alexei Starovoitov
    Link: https://lore.kernel.org/bpf/fcb00a2b0b288d6c73de4ef58116a821c8fe8f2f.1574555798.git.daniel@iogearbox.net

    Daniel Borkmann
     
  • This work adds program tracking to prog array maps. This is needed such
    that upon prog array updates/deletions we can fix up all programs which
    make use of this tail call map. We add ops->map_poke_{un,}track()
    helpers to maps to maintain the list of programs and ops->map_poke_run()
    for triggering the actual update.

    bpf_array_aux is extended to contain the list head and poke_mutex in
    order to serialize program patching during updates/deletions.
    bpf_free_used_maps() will untrack the program shortly before dropping
    the reference to the map. For clearing out the prog array once all urefs
    are dropped we need to use schedule_work() to have a sleepable context.

    The prog_array_map_poke_run() is triggered during updates/deletions and
    walks the maintained prog list. It checks in their poke_tabs whether the
    map and key is matching and runs the actual bpf_arch_text_poke() for
    patching in the nop or new jmp location. Depending on the type of update,
    we use one of BPF_MOD_{NOP_TO_JUMP,JUMP_TO_NOP,JUMP_TO_JUMP}.

    Signed-off-by: Daniel Borkmann
    Signed-off-by: Alexei Starovoitov
    Acked-by: Andrii Nakryiko
    Link: https://lore.kernel.org/bpf/1fb364bb3c565b3e415d5ea348f036ff379e779d.1574452833.git.daniel@iogearbox.net

    Daniel Borkmann
     
  • We're going to extend this with further information which is only
    relevant for prog array at this point. Given this info is not used
    in critical path, move it into its own structure such that the main
    array map structure can be kept on diet.

    Signed-off-by: Daniel Borkmann
    Signed-off-by: Alexei Starovoitov
    Acked-by: Andrii Nakryiko
    Link: https://lore.kernel.org/bpf/b9ddccdb0f6f7026489ee955f16c96381e1e7238.1574452833.git.daniel@iogearbox.net

    Daniel Borkmann
     

20 Nov, 2019

1 commit

  • Fix sparse warning:

    kernel/bpf/arraymap.c:481:5: warning:
    symbol 'array_map_mmap' was not declared. Should it be static?

    Reported-by: Hulk Robot
    Signed-off-by: YueHaibing
    Signed-off-by: Alexei Starovoitov
    Acked-by: Andrii Nakryiko
    Link: https://lore.kernel.org/bpf/20191119142113.15388-1-yuehaibing@huawei.com

    YueHaibing
     

18 Nov, 2019

1 commit

  • Add ability to memory-map contents of BPF array map. This is extremely useful
    for working with BPF global data from userspace programs. It allows to avoid
    typical bpf_map_{lookup,update}_elem operations, improving both performance
    and usability.

    There had to be special considerations for map freezing, to avoid having
    writable memory view into a frozen map. To solve this issue, map freezing and
    mmap-ing is happening under mutex now:
    - if map is already frozen, no writable mapping is allowed;
    - if map has writable memory mappings active (accounted in map->writecnt),
    map freezing will keep failing with -EBUSY;
    - once number of writable memory mappings drops to zero, map freezing can be
    performed again.

    Only non-per-CPU plain arrays are supported right now. Maps with spinlocks
    can't be memory mapped either.

    For BPF_F_MMAPABLE array, memory allocation has to be done through vmalloc()
    to be mmap()'able. We also need to make sure that array data memory is
    page-sized and page-aligned, so we over-allocate memory in such a way that
    struct bpf_array is at the end of a single page of memory with array->value
    being aligned with the start of the second page. On deallocation we need to
    accomodate this memory arrangement to free vmalloc()'ed memory correctly.

    One important consideration regarding how memory-mapping subsystem functions.
    Memory-mapping subsystem provides few optional callbacks, among them open()
    and close(). close() is called for each memory region that is unmapped, so
    that users can decrease their reference counters and free up resources, if
    necessary. open() is *almost* symmetrical: it's called for each memory region
    that is being mapped, **except** the very first one. So bpf_map_mmap does
    initial refcnt bump, while open() will do any extra ones after that. Thus
    number of close() calls is equal to number of open() calls plus one more.

    Signed-off-by: Andrii Nakryiko
    Signed-off-by: Daniel Borkmann
    Acked-by: Song Liu
    Acked-by: John Fastabend
    Acked-by: Johannes Weiner
    Link: https://lore.kernel.org/bpf/20191117172806.2195367-4-andriin@fb.com

    Andrii Nakryiko
     

18 Jun, 2019

1 commit


05 Jun, 2019

1 commit

  • Based on 1 normalized pattern(s):

    this program is free software you can redistribute it and or modify
    it under the terms of version 2 of the gnu general public license as
    published by the free software foundation this program is
    distributed in the hope that it will be useful but without any
    warranty without even the implied warranty of merchantability or
    fitness for a particular purpose see the gnu general public license
    for more details

    extracted by the scancode license scanner the SPDX license identifier

    GPL-2.0-only

    has been chosen to replace the boilerplate/reference in 64 file(s).

    Signed-off-by: Thomas Gleixner
    Reviewed-by: Alexios Zavras
    Reviewed-by: Allison Randal
    Cc: linux-spdx@vger.kernel.org
    Link: https://lkml.kernel.org/r/20190529141901.894819585@linutronix.de
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     

01 Jun, 2019

3 commits

  • Most bpf map types doing similar checks and bytes to pages
    conversion during memory allocation and charging.

    Let's unify these checks by moving them into bpf_map_charge_init().

    Signed-off-by: Roman Gushchin
    Acked-by: Song Liu
    Signed-off-by: Alexei Starovoitov

    Roman Gushchin
     
  • In order to unify the existing memlock charging code with the
    memcg-based memory accounting, which will be added later, let's
    rework the current scheme.

    Currently the following design is used:
    1) .alloc() callback optionally checks if the allocation will likely
    succeed using bpf_map_precharge_memlock()
    2) .alloc() performs actual allocations
    3) .alloc() callback calculates map cost and sets map.memory.pages
    4) map_create() calls bpf_map_init_memlock() which sets map.memory.user
    and performs actual charging; in case of failure the map is
    destroyed

    1) bpf_map_free_deferred() calls bpf_map_release_memlock(), which
    performs uncharge and releases the user
    2) .map_free() callback releases the memory

    The scheme can be simplified and made more robust:
    1) .alloc() calculates map cost and calls bpf_map_charge_init()
    2) bpf_map_charge_init() sets map.memory.user and performs actual
    charge
    3) .alloc() performs actual allocations

    1) .map_free() callback releases the memory
    2) bpf_map_charge_finish() performs uncharge and releases the user

    The new scheme also allows to reuse bpf_map_charge_init()/finish()
    functions for memcg-based accounting. Because charges are performed
    before actual allocations and uncharges after freeing the memory,
    no bogus memory pressure can be created.

    In cases when the map structure is not available (e.g. it's not
    created yet, or is already destroyed), on-stack bpf_map_memory
    structure is used. The charge can be transferred with the
    bpf_map_charge_move() function.

    Signed-off-by: Roman Gushchin
    Acked-by: Song Liu
    Signed-off-by: Alexei Starovoitov

    Roman Gushchin
     
  • Group "user" and "pages" fields of bpf_map into the bpf_map_memory
    structure. Later it can be extended with "memcg" and other related
    information.

    The main reason for a such change (beside cosmetics) is to pass
    bpf_map_memory structure to charging functions before the actual
    allocation of bpf_map.

    Signed-off-by: Roman Gushchin
    Acked-by: Song Liu
    Signed-off-by: Alexei Starovoitov

    Roman Gushchin
     

10 Apr, 2019

3 commits

  • Given we'll be reusing BPF array maps for global data/bss/rodata
    sections, we need a way to associate BTF DataSec type as its map
    value type. In usual cases we have this ugly BPF_ANNOTATE_KV_PAIR()
    macro hack e.g. via 38d5d3b3d5db ("bpf: Introduce BPF_ANNOTATE_KV_PAIR")
    to get initial map to type association going. While more use cases
    for it are discouraged, this also won't work for global data since
    the use of array map is a BPF loader detail and therefore unknown
    at compilation time. For array maps with just a single entry we make
    an exception in terms of BTF in that key type is declared optional
    if value type is of DataSec type. The latter LLVM is guaranteed to
    emit and it also aligns with how we regard global data maps as just
    a plain buffer area reusing existing map facilities for allowing
    things like introspection with existing tools.

    Signed-off-by: Daniel Borkmann
    Acked-by: Martin KaFai Lau
    Signed-off-by: Alexei Starovoitov

    Daniel Borkmann
     
  • This work adds two new map creation flags BPF_F_RDONLY_PROG
    and BPF_F_WRONLY_PROG in order to allow for read-only or
    write-only BPF maps from a BPF program side.

    Today we have BPF_F_RDONLY and BPF_F_WRONLY, but this only
    applies to system call side, meaning the BPF program has full
    read/write access to the map as usual while bpf(2) calls with
    map fd can either only read or write into the map depending
    on the flags. BPF_F_RDONLY_PROG and BPF_F_WRONLY_PROG allows
    for the exact opposite such that verifier is going to reject
    program loads if write into a read-only map or a read into a
    write-only map is detected. For read-only map case also some
    helpers are forbidden for programs that would alter the map
    state such as map deletion, update, etc. As opposed to the two
    BPF_F_RDONLY / BPF_F_WRONLY flags, BPF_F_RDONLY_PROG as well
    as BPF_F_WRONLY_PROG really do correspond to the map lifetime.

    We've enabled this generic map extension to various non-special
    maps holding normal user data: array, hash, lru, lpm, local
    storage, queue and stack. Further generic map types could be
    followed up in future depending on use-case. Main use case
    here is to forbid writes into .rodata map values from verifier
    side.

    Signed-off-by: Daniel Borkmann
    Acked-by: Martin KaFai Lau
    Signed-off-by: Alexei Starovoitov

    Daniel Borkmann
     
  • This generic extension to BPF maps allows for directly loading
    an address residing inside a BPF map value as a single BPF
    ldimm64 instruction!

    The idea is similar to what BPF_PSEUDO_MAP_FD does today, which
    is a special src_reg flag for ldimm64 instruction that indicates
    that inside the first part of the double insns's imm field is a
    file descriptor which the verifier then replaces as a full 64bit
    address of the map into both imm parts. For the newly added
    BPF_PSEUDO_MAP_VALUE src_reg flag, the idea is the following:
    the first part of the double insns's imm field is again a file
    descriptor corresponding to the map, and the second part of the
    imm field is an offset into the value. The verifier will then
    replace both imm parts with an address that points into the BPF
    map value at the given value offset for maps that support this
    operation. Currently supported is array map with single entry.
    It is possible to support more than just single map element by
    reusing both 16bit off fields of the insns as a map index, so
    full array map lookup could be expressed that way. It hasn't
    been implemented here due to lack of concrete use case, but
    could easily be done so in future in a compatible way, since
    both off fields right now have to be 0 and would correctly
    denote a map index 0.

    The BPF_PSEUDO_MAP_VALUE is a distinct flag as otherwise with
    BPF_PSEUDO_MAP_FD we could not differ offset 0 between load of
    map pointer versus load of map's value at offset 0, and changing
    BPF_PSEUDO_MAP_FD's encoding into off by one to differ between
    regular map pointer and map value pointer would add unnecessary
    complexity and increases barrier for debugability thus less
    suitable. Using the second part of the imm field as an offset
    into the value does /not/ come with limitations since maximum
    possible value size is in u32 universe anyway.

    This optimization allows for efficiently retrieving an address
    to a map value memory area without having to issue a helper call
    which needs to prepare registers according to calling convention,
    etc, without needing the extra NULL test, and without having to
    add the offset in an additional instruction to the value base
    pointer. The verifier then treats the destination register as
    PTR_TO_MAP_VALUE with constant reg->off from the user passed
    offset from the second imm field, and guarantees that this is
    within bounds of the map value. Any subsequent operations are
    normally treated as typical map value handling without anything
    extra needed from verification side.

    The two map operations for direct value access have been added to
    array map for now. In future other types could be supported as
    well depending on the use case. The main use case for this commit
    is to allow for BPF loader support for global variables that
    reside in .data/.rodata/.bss sections such that we can directly
    load the address of them with minimal additional infrastructure
    required. Loader support has been added in subsequent commits for
    libbpf library.

    Signed-off-by: Daniel Borkmann
    Signed-off-by: Alexei Starovoitov

    Daniel Borkmann
     

02 Feb, 2019

2 commits

  • Introduce BPF_F_LOCK flag for map_lookup and map_update syscall commands
    and for map_update() helper function.
    In all these cases take a lock of existing element (which was provided
    in BTF description) before copying (in or out) the rest of map value.

    Implementation details that are part of uapi:

    Array:
    The array map takes the element lock for lookup/update.

    Hash:
    hash map also takes the lock for lookup/update and tries to avoid the bucket lock.
    If old element exists it takes the element lock and updates the element in place.
    If element doesn't exist it allocates new one and inserts into hash table
    while holding the bucket lock.
    In rare case the hashmap has to take both the bucket lock and the element lock
    to update old value in place.

    Cgroup local storage:
    It is similar to array. update in place and lookup are done with lock taken.

    Signed-off-by: Alexei Starovoitov
    Signed-off-by: Daniel Borkmann

    Alexei Starovoitov
     
  • Introduce 'struct bpf_spin_lock' and bpf_spin_lock/unlock() helpers to let
    bpf program serialize access to other variables.

    Example:
    struct hash_elem {
    int cnt;
    struct bpf_spin_lock lock;
    };
    struct hash_elem * val = bpf_map_lookup_elem(&hash_map, &key);
    if (val) {
    bpf_spin_lock(&val->lock);
    val->cnt++;
    bpf_spin_unlock(&val->lock);
    }

    Restrictions and safety checks:
    - bpf_spin_lock is only allowed inside HASH and ARRAY maps.
    - BTF description of the map is mandatory for safety analysis.
    - bpf program can take one bpf_spin_lock at a time, since two or more can
    cause dead locks.
    - only one 'struct bpf_spin_lock' is allowed per map element.
    It drastically simplifies implementation yet allows bpf program to use
    any number of bpf_spin_locks.
    - when bpf_spin_lock is taken the calls (either bpf2bpf or helpers) are not allowed.
    - bpf program must bpf_spin_unlock() before return.
    - bpf program can access 'struct bpf_spin_lock' only via
    bpf_spin_lock()/bpf_spin_unlock() helpers.
    - load/store into 'struct bpf_spin_lock lock;' field is not allowed.
    - to use bpf_spin_lock() helper the BTF description of map value must be
    a struct and have 'struct bpf_spin_lock anyname;' field at the top level.
    Nested lock inside another struct is not allowed.
    - syscall map_lookup doesn't copy bpf_spin_lock field to user space.
    - syscall map_update and program map_update do not update bpf_spin_lock field.
    - bpf_spin_lock cannot be on the stack or inside networking packet.
    bpf_spin_lock can only be inside HASH or ARRAY map value.
    - bpf_spin_lock is available to root only and to all program types.
    - bpf_spin_lock is not allowed in inner maps of map-in-map.
    - ld_abs is not allowed inside spin_lock-ed region.
    - tracing progs and socket filter progs cannot use bpf_spin_lock due to
    insufficient preemption checks

    Implementation details:
    - cgroup-bpf class of programs can nest with xdp/tc programs.
    Hence bpf_spin_lock is equivalent to spin_lock_irqsave.
    Other solutions to avoid nested bpf_spin_lock are possible.
    Like making sure that all networking progs run with softirq disabled.
    spin_lock_irqsave is the simplest and doesn't add overhead to the
    programs that don't use it.
    - arch_spinlock_t is used when its implemented as queued_spin_lock
    - archs can force their own arch_spinlock_t
    - on architectures where queued_spin_lock is not available and
    sizeof(arch_spinlock_t) != sizeof(__u32) trivial lock is used.
    - presence of bpf_spin_lock inside map value could have been indicated via
    extra flag during map_create, but specifying it via BTF is cleaner.
    It provides introspection for map key/value and reduces user mistakes.

    Next steps:
    - allow bpf_spin_lock in other map types (like cgroup local storage)
    - introduce BPF_F_LOCK flag for bpf_map_update() syscall and helper
    to request kernel to grab bpf_spin_lock before rewriting the value.
    That will serialize access to map elements.

    Acked-by: Peter Zijlstra (Intel)
    Signed-off-by: Alexei Starovoitov
    Signed-off-by: Daniel Borkmann

    Alexei Starovoitov
     

13 Dec, 2018

1 commit

  • If key_type or value_type are of non-trivial data types
    (e.g. structure or typedef), it's not possible to check them without
    the additional information, which can't be obtained without a pointer
    to the btf structure.

    So, let's pass btf pointer to the map_check_btf() callbacks.

    Signed-off-by: Roman Gushchin
    Cc: Alexei Starovoitov
    Cc: Daniel Borkmann
    Acked-by: Martin KaFai Lau
    Signed-off-by: Daniel Borkmann
    Signed-off-by: Alexei Starovoitov

    Roman Gushchin
     

10 Oct, 2018

1 commit

  • Return ERR_PTR(-EOPNOTSUPP) from map_lookup_elem() methods of below
    map types:
    - BPF_MAP_TYPE_PROG_ARRAY
    - BPF_MAP_TYPE_STACK_TRACE
    - BPF_MAP_TYPE_XSKMAP
    - BPF_MAP_TYPE_SOCKMAP/BPF_MAP_TYPE_SOCKHASH

    Signed-off-by: Prashant Bhole
    Acked-by: Alexei Starovoitov
    Acked-by: Song Liu
    Signed-off-by: Alexei Starovoitov

    Prashant Bhole
     

12 Sep, 2018

1 commit

  • Added bpffs pretty print for program array map. For a particular
    array index, if the program array points to a valid program,
    the ": " will be printed out like
    0: 6
    which means bpf program with id "6" is installed at index "0".

    Signed-off-by: Yonghong Song
    Signed-off-by: Alexei Starovoitov

    Yonghong Song
     

30 Aug, 2018

1 commit

  • Added bpffs pretty print for percpu arraymap, percpu hashmap
    and percpu lru hashmap.

    For each map pair, the format is:
    : {
    cpu0:
    cpu1:
    ...
    cpun:
    }

    For example, on my VM, there are 4 cpus, and
    for test_btf test in the next patch:
    cat /sys/fs/bpf/pprint_test_percpu_hash

    You may get:
    ...
    43602: {
    cpu0: {43602,0,-43602,0x3,0xaa52,0x3,{43602|[82,170,0,0,0,0,0,0]},ENUM_TWO}
    cpu1: {43602,0,-43602,0x3,0xaa52,0x3,{43602|[82,170,0,0,0,0,0,0]},ENUM_TWO}
    cpu2: {43602,0,-43602,0x3,0xaa52,0x3,{43602|[82,170,0,0,0,0,0,0]},ENUM_TWO}
    cpu3: {43602,0,-43602,0x3,0xaa52,0x3,{43602|[82,170,0,0,0,0,0,0]},ENUM_TWO}
    }
    72847: {
    cpu0: {72847,0,-72847,0x3,0x11c8f,0x3,{72847|[143,28,1,0,0,0,0,0]},ENUM_THREE}
    cpu1: {72847,0,-72847,0x3,0x11c8f,0x3,{72847|[143,28,1,0,0,0,0,0]},ENUM_THREE}
    cpu2: {72847,0,-72847,0x3,0x11c8f,0x3,{72847|[143,28,1,0,0,0,0,0]},ENUM_THREE}
    cpu3: {72847,0,-72847,0x3,0x11c8f,0x3,{72847|[143,28,1,0,0,0,0,0]},ENUM_THREE}
    }
    ...

    Signed-off-by: Yonghong Song
    Signed-off-by: Daniel Borkmann

    Yonghong Song
     

13 Aug, 2018

1 commit

  • Commit a26ca7c982cb ("bpf: btf: Add pretty print support to
    the basic arraymap") and 699c86d6ec21 ("bpf: btf: add pretty
    print for hash/lru_hash maps") enabled support for BTF and
    dumping via BPF fs for array and hash/lru map. However, both
    can be decoupled from each other such that regular BPF maps
    can be supported for attaching BTF key/value information,
    while not all maps necessarily need to dump via map_seq_show_elem()
    callback.

    The basic sanity check which is a prerequisite for all maps
    is that key/value size has to match in any case, and some maps
    can have extra checks via map_check_btf() callback, e.g.
    probing certain types or indicating no support in general. With
    that we can also enable retrieving BTF info for per-cpu map
    types and lpm.

    Signed-off-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Acked-by: Yonghong Song

    Daniel Borkmann
     

11 Aug, 2018

1 commit

  • This patch introduces a new map type BPF_MAP_TYPE_REUSEPORT_SOCKARRAY.

    To unleash the full potential of a bpf prog, it is essential for the
    userspace to be capable of directly setting up a bpf map which can then
    be consumed by the bpf prog to make decision. In this case, decide which
    SO_REUSEPORT sk to serve the incoming request.

    By adding BPF_MAP_TYPE_REUSEPORT_SOCKARRAY, the userspace has total control
    and visibility on where a SO_REUSEPORT sk should be located in a bpf map.
    The later patch will introduce BPF_PROG_TYPE_SK_REUSEPORT such that
    the bpf prog can directly select a sk from the bpf map. That will
    raise the programmability of the bpf prog attached to a reuseport
    group (a group of sk serving the same IP:PORT).

    For example, in UDP, the bpf prog can peek into the payload (e.g.
    through the "data" pointer introduced in the later patch) to learn
    the application level's connection information and then decide which sk
    to pick from a bpf map. The userspace can tightly couple the sk's location
    in a bpf map with the application logic in generating the UDP payload's
    connection information. This connection info contact/API stays within the
    userspace.

    Also, when used with map-in-map, the userspace can switch the
    old-server-process's inner map to a new-server-process's inner map
    in one call "bpf_map_update_elem(outer_map, &index, &new_reuseport_array)".
    The bpf prog will then direct incoming requests to the new process instead
    of the old process. The old process can finish draining the pending
    requests (e.g. by "accept()") before closing the old-fds. [Note that
    deleting a fd from a bpf map does not necessary mean the fd is closed]

    During map_update_elem(),
    Only SO_REUSEPORT sk (i.e. which has already been added
    to a reuse->socks[]) can be used. That means a SO_REUSEPORT sk that is
    "bind()" for UDP or "bind()+listen()" for TCP. These conditions are
    ensured in "reuseport_array_update_check()".

    A SO_REUSEPORT sk can only be added once to a map (i.e. the
    same sk cannot be added twice even to the same map). SO_REUSEPORT
    already allows another sk to be created for the same IP:PORT.
    There is no need to re-create a similar usage in the BPF side.

    When a SO_REUSEPORT is deleted from the "reuse->socks[]" (e.g. "close()"),
    it will notify the bpf map to remove it from the map also. It is
    done through "bpf_sk_reuseport_detach()" and it will only be called
    if >=1 of the "reuse->sock[]" has ever been added to a bpf map.

    The map_update()/map_delete() has to be in-sync with the
    "reuse->socks[]". Hence, the same "reuseport_lock" used
    by "reuse->socks[]" has to be used here also. Care has
    been taken to ensure the lock is only acquired when the
    adding sk passes some strict tests. and
    freeing the map does not require the reuseport_lock.

    The reuseport_array will also support lookup from the syscall
    side. It will return a sock_gen_cookie(). The sock_gen_cookie()
    is on-demand (i.e. a sk's cookie is not generated until the very
    first map_lookup_elem()).

    The lookup cookie is 64bits but it goes against the logical userspace
    expectation on 32bits sizeof(fd) (and as other fd based bpf maps do also).
    It may catch user in surprise if we enforce value_size=8 while
    userspace still pass a 32bits fd during update. Supporting different
    value_size between lookup and update seems unintuitive also.

    We also need to consider what if other existing fd based maps want
    to return 64bits value from syscall's lookup in the future.
    Hence, reuseport_array supports both value_size 4 and 8, and
    assuming user will usually use value_size=4. The syscall's lookup
    will return ENOSPC on value_size=4. It will will only
    return 64bits value from sock_gen_cookie() when user consciously
    choose value_size=8 (as a signal that lookup is desired) which then
    requires a 64bits value in both lookup and update.

    Signed-off-by: Martin KaFai Lau
    Acked-by: Alexei Starovoitov
    Signed-off-by: Daniel Borkmann

    Martin KaFai Lau
     

27 Jul, 2018

1 commit

  • The current map_check_btf() in BPF_MAP_TYPE_ARRAY rejects
    '> map->value_size' to ensure map_seq_show_elem() will not
    access things beyond an array element.

    Yonghong suggested that using '!=' is a more correct
    check. The 8 bytes round_up on value_size is stored
    in array->elem_size. Hence, using '!=' on map->value_size
    is a proper check.

    This patch also adds new tests to check the btf array
    key type and value type. Two of these new tests verify
    the btf's value_size (the change in this patch).

    It also fixes two existing tests that wrongly encoded
    a btf's type size (pprint_test) and the value_type_id (in one
    of the raw_tests[]). However, that do not affect these two
    BTF verification tests before or after this test changes.
    These two tests mainly failed at array creation time after
    this patch.

    Fixes: a26ca7c982cb ("bpf: btf: Add pretty print support to the basic arraymap")
    Suggested-by: Yonghong Song
    Acked-by: Yonghong Song
    Signed-off-by: Martin KaFai Lau
    Signed-off-by: Daniel Borkmann

    Martin KaFai Lau
     

23 May, 2018

1 commit

  • In "struct bpf_map_info", the name "btf_id", "btf_key_id" and "btf_value_id"
    could cause confusion because the "id" of "btf_id" means the BPF obj id
    given to the BTF object while
    "btf_key_id" and "btf_value_id" means the BTF type id within
    that BTF object.

    To make it clear, btf_key_id and btf_value_id are
    renamed to btf_key_type_id and btf_value_type_id.

    Suggested-by: Daniel Borkmann
    Signed-off-by: Martin KaFai Lau
    Acked-by: Yonghong Song
    Signed-off-by: Daniel Borkmann

    Martin KaFai Lau
     

26 Apr, 2018

1 commit


24 Apr, 2018

1 commit

  • Relying on map_release hook to decrement the reference counts when a
    map is removed only works if the map is not being pinned. In the
    pinned case the ref is decremented immediately and the BPF programs
    released. After this BPF programs may not be in-use which is not
    what the user would expect.

    This patch moves the release logic into bpf_map_put_uref() and brings
    sockmap in-line with how a similar case is handled in prog array maps.

    Fixes: 3d9e952697de ("bpf: sockmap, fix leaking maps with attached but not detached progs")
    Signed-off-by: John Fastabend
    Signed-off-by: Daniel Borkmann

    John Fastabend