28 Aug, 2020

1 commit

  • Some properties of the inner map is used in the verification time.
    When an inner map is inserted to an outer map at runtime,
    bpf_map_meta_equal() is currently used to ensure those properties
    of the inserting inner map stays the same as the verification
    time.

    In particular, the current bpf_map_meta_equal() checks max_entries which
    turns out to be too restrictive for most of the maps which do not use
    max_entries during the verification time. It limits the use case that
    wants to replace a smaller inner map with a larger inner map. There are
    some maps do use max_entries during verification though. For example,
    the map_gen_lookup in array_map_ops uses the max_entries to generate
    the inline lookup code.

    To accommodate differences between maps, the map_meta_equal is added
    to bpf_map_ops. Each map-type can decide what to check when its
    map is used as an inner map during runtime.

    Also, some map types cannot be used as an inner map and they are
    currently black listed in bpf_map_meta_alloc() in map_in_map.c.
    It is not unusual that the new map types may not aware that such
    blacklist exists. This patch enforces an explicit opt-in
    and only allows a map to be used as an inner map if it has
    implemented the map_meta_equal ops. It is based on the
    discussion in [1].

    All maps that support inner map has its map_meta_equal points
    to bpf_map_meta_equal in this patch. A later patch will
    relax the max_entries check for most maps. bpf_types.h
    counts 28 map types. This patch adds 23 ".map_meta_equal"
    by using coccinelle. -5 for
    BPF_MAP_TYPE_PROG_ARRAY
    BPF_MAP_TYPE_(PERCPU)_CGROUP_STORAGE
    BPF_MAP_TYPE_STRUCT_OPS
    BPF_MAP_TYPE_ARRAY_OF_MAPS
    BPF_MAP_TYPE_HASH_OF_MAPS

    The "if (inner_map->inner_map_meta)" check in bpf_map_meta_alloc()
    is moved such that the same error is returned.

    [1]: https://lore.kernel.org/bpf/20200522022342.899756-1-kafai@fb.com/

    Signed-off-by: Martin KaFai Lau
    Signed-off-by: Daniel Borkmann
    Link: https://lore.kernel.org/bpf/20200828011806.1970400-1-kafai@fb.com

    Martin KaFai Lau
     

01 Jul, 2020

1 commit

  • bpf_free_used_maps() or close(map_fd) will trigger map_free callback.
    bpf_free_used_maps() is called after bpf prog is no longer executing:
    bpf_prog_put->call_rcu->bpf_prog_free->bpf_free_used_maps.
    Hence there is no need to call synchronize_rcu() to protect map elements.

    Note that hash_of_maps and array_of_maps update/delete inner maps via
    sys_bpf() that calls maybe_wait_bpf_programs() and synchronize_rcu().

    Signed-off-by: Alexei Starovoitov
    Acked-by: Andrii Nakryiko
    Acked-by: Paul E. McKenney
    Link: https://lore.kernel.org/bpf/20200630043343.53195-2-alexei.starovoitov@gmail.com

    Alexei Starovoitov
     

23 Jun, 2020

1 commit

  • Set map_btf_name and map_btf_id for all map types so that map fields can
    be accessed by bpf programs.

    Signed-off-by: Andrey Ignatov
    Signed-off-by: Daniel Borkmann
    Acked-by: John Fastabend
    Acked-by: Martin KaFai Lau
    Link: https://lore.kernel.org/bpf/a825f808f22af52b018dbe82f1c7d29dab5fc978.1592600985.git.rdna@fb.com

    Andrey Ignatov
     

15 May, 2020

1 commit

  • Implement permissions as stated in uapi/linux/capability.h
    In order to do that the verifier allow_ptr_leaks flag is split
    into four flags and they are set as:
    env->allow_ptr_leaks = bpf_allow_ptr_leaks();
    env->bypass_spec_v1 = bpf_bypass_spec_v1();
    env->bypass_spec_v4 = bpf_bypass_spec_v4();
    env->bpf_capable = bpf_capable();

    The first three currently equivalent to perfmon_capable(), since leaking kernel
    pointers and reading kernel memory via side channel attacks is roughly
    equivalent to reading kernel memory with cap_perfmon.

    'bpf_capable' enables bounded loops, precision tracking, bpf to bpf calls and
    other verifier features. 'allow_ptr_leaks' enable ptr leaks, ptr conversions,
    subtraction of pointers. 'bypass_spec_v1' disables speculative analysis in the
    verifier, run time mitigations in bpf array, and enables indirect variable
    access in bpf programs. 'bypass_spec_v4' disables emission of sanitation code
    by the verifier.

    That means that the networking BPF program loaded with CAP_BPF + CAP_NET_ADMIN
    will have speculative checks done by the verifier and other spectre mitigation
    applied. Such networking BPF program will not be able to leak kernel pointers
    and will not be able to access arbitrary kernel memory.

    Signed-off-by: Alexei Starovoitov
    Signed-off-by: Daniel Borkmann
    Link: https://lore.kernel.org/bpf/20200513230355.7858-3-alexei.starovoitov@gmail.com

    Alexei Starovoitov
     

11 May, 2020

1 commit

  • The current codebase makes use of the zero-length array language
    extension to the C90 standard, but the preferred mechanism to declare
    variable-length types such as these ones is a flexible array member[1][2],
    introduced in C99:

    struct foo {
    int stuff;
    struct boo array[];
    };

    By making use of the mechanism above, we will get a compiler warning
    in case the flexible array does not occur last in the structure, which
    will help us prevent some kind of undefined behavior bugs from being
    inadvertently introduced[3] to the codebase from now on.

    Also, notice that, dynamic memory allocations won't be affected by
    this change:

    "Flexible array members have incomplete type, and so the sizeof operator
    may not be applied. As a quirk of the original implementation of
    zero-length arrays, sizeof evaluates to zero."[1]

    sizeof(flexible-array-member) triggers a warning because flexible array
    members have incomplete type[1]. There are some instances of code in
    which the sizeof operator is being incorrectly/erroneously applied to
    zero-length arrays and the result is zero. Such instances may be hiding
    some bugs. So, this work (flexible-array member conversions) will also
    help to get completely rid of those sorts of issues.

    This issue was found with the help of Coccinelle.

    [1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
    [2] https://github.com/KSPP/linux/issues/21
    [3] commit 76497732932f ("cxgb3/l2t: Fix undefined behaviour")

    Signed-off-by: Gustavo A. R. Silva
    Signed-off-by: Daniel Borkmann
    Acked-by: Yonghong Song
    Link: https://lore.kernel.org/bpf/20200507185057.GA13981@embeddedor

    Gustavo A. R. Silva
     

01 Jun, 2019

3 commits

  • Most bpf map types doing similar checks and bytes to pages
    conversion during memory allocation and charging.

    Let's unify these checks by moving them into bpf_map_charge_init().

    Signed-off-by: Roman Gushchin
    Acked-by: Song Liu
    Signed-off-by: Alexei Starovoitov

    Roman Gushchin
     
  • In order to unify the existing memlock charging code with the
    memcg-based memory accounting, which will be added later, let's
    rework the current scheme.

    Currently the following design is used:
    1) .alloc() callback optionally checks if the allocation will likely
    succeed using bpf_map_precharge_memlock()
    2) .alloc() performs actual allocations
    3) .alloc() callback calculates map cost and sets map.memory.pages
    4) map_create() calls bpf_map_init_memlock() which sets map.memory.user
    and performs actual charging; in case of failure the map is
    destroyed

    1) bpf_map_free_deferred() calls bpf_map_release_memlock(), which
    performs uncharge and releases the user
    2) .map_free() callback releases the memory

    The scheme can be simplified and made more robust:
    1) .alloc() calculates map cost and calls bpf_map_charge_init()
    2) bpf_map_charge_init() sets map.memory.user and performs actual
    charge
    3) .alloc() performs actual allocations

    1) .map_free() callback releases the memory
    2) bpf_map_charge_finish() performs uncharge and releases the user

    The new scheme also allows to reuse bpf_map_charge_init()/finish()
    functions for memcg-based accounting. Because charges are performed
    before actual allocations and uncharges after freeing the memory,
    no bogus memory pressure can be created.

    In cases when the map structure is not available (e.g. it's not
    created yet, or is already destroyed), on-stack bpf_map_memory
    structure is used. The charge can be transferred with the
    bpf_map_charge_move() function.

    Signed-off-by: Roman Gushchin
    Acked-by: Song Liu
    Signed-off-by: Alexei Starovoitov

    Roman Gushchin
     
  • Group "user" and "pages" fields of bpf_map into the bpf_map_memory
    structure. Later it can be extended with "memcg" and other related
    information.

    The main reason for a such change (beside cosmetics) is to pass
    bpf_map_memory structure to charging functions before the actual
    allocation of bpf_map.

    Signed-off-by: Roman Gushchin
    Acked-by: Song Liu
    Signed-off-by: Alexei Starovoitov

    Roman Gushchin
     

10 Apr, 2019

1 commit

  • This work adds two new map creation flags BPF_F_RDONLY_PROG
    and BPF_F_WRONLY_PROG in order to allow for read-only or
    write-only BPF maps from a BPF program side.

    Today we have BPF_F_RDONLY and BPF_F_WRONLY, but this only
    applies to system call side, meaning the BPF program has full
    read/write access to the map as usual while bpf(2) calls with
    map fd can either only read or write into the map depending
    on the flags. BPF_F_RDONLY_PROG and BPF_F_WRONLY_PROG allows
    for the exact opposite such that verifier is going to reject
    program loads if write into a read-only map or a read into a
    write-only map is detected. For read-only map case also some
    helpers are forbidden for programs that would alter the map
    state such as map deletion, update, etc. As opposed to the two
    BPF_F_RDONLY / BPF_F_WRONLY flags, BPF_F_RDONLY_PROG as well
    as BPF_F_WRONLY_PROG really do correspond to the map lifetime.

    We've enabled this generic map extension to various non-special
    maps holding normal user data: array, hash, lru, lpm, local
    storage, queue and stack. Further generic map types could be
    followed up in future depending on use-case. Main use case
    here is to forbid writes into .rodata map values from verifier
    side.

    Signed-off-by: Daniel Borkmann
    Acked-by: Martin KaFai Lau
    Signed-off-by: Alexei Starovoitov

    Daniel Borkmann
     

23 Nov, 2018

1 commit

  • Fix the following issues:

    - allow queue_stack_map for root only
    - fix u32 max_entries overflow
    - disallow value_size == 0

    Fixes: f1a2e44a3aec ("bpf: add queue and stack maps")
    Reported-by: Wei Wu
    Signed-off-by: Alexei Starovoitov
    Cc: Mauricio Vasquez B
    Signed-off-by: Daniel Borkmann

    Alexei Starovoitov
     

26 Oct, 2018

1 commit

  • Commit f1a2e44a3aec ("bpf: add queue and stack maps") added helpers
    with ARG_PTR_TO_UNINIT_MAP_VALUE. Meaning, the helper is supposed to
    fill the map value buffer with data instead of reading from it like
    in other helpers such as map update. However, given the buffer is
    allowed to be uninitialized (since we fill it in the helper anyway),
    it also means that the helper is obliged to wipe the memory in case
    of an error in order to not allow for leaking uninitialized memory.
    Given pop/peek is both handled inside __{stack,queue}_map_get(),
    lets wipe it there on error case, that is, empty stack/queue.

    Fixes: f1a2e44a3aec ("bpf: add queue and stack maps")
    Signed-off-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Cc: Mauricio Vasquez B
    Acked-by: Mauricio Vasquez B
    Signed-off-by: Alexei Starovoitov

    Daniel Borkmann
     

20 Oct, 2018

1 commit

  • Queue/stack maps implement a FIFO/LIFO data storage for ebpf programs.
    These maps support peek, pop and push operations that are exposed to eBPF
    programs through the new bpf_map[peek/pop/push] helpers. Those operations
    are exposed to userspace applications through the already existing
    syscalls in the following way:

    BPF_MAP_LOOKUP_ELEM -> peek
    BPF_MAP_LOOKUP_AND_DELETE_ELEM -> pop
    BPF_MAP_UPDATE_ELEM -> push

    Queue/stack maps are implemented using a buffer, tail and head indexes,
    hence BPF_F_NO_PREALLOC is not supported.

    As opposite to other maps, queue and stack do not use RCU for protecting
    maps values, the bpf_map[peek/pop] have a ARG_PTR_TO_UNINIT_MAP_VALUE
    argument that is a pointer to a memory zone where to save the value of a
    map. Basically the same as ARG_PTR_TO_UNINIT_MEM, but the size has not
    be passed as an extra argument.

    Our main motivation for implementing queue/stack maps was to keep track
    of a pool of elements, like network ports in a SNAT, however we forsee
    other use cases, like for exampling saving last N kernel events in a map
    and then analysing from userspace.

    Signed-off-by: Mauricio Vasquez B
    Acked-by: Song Liu
    Signed-off-by: Alexei Starovoitov

    Mauricio Vasquez B