15 Sep, 2020

4 commits

  • Add a quiet option (-Q) that disables the statistics print outs of
    xdpsock. This is good to have when measuring 0% loss rate performance
    as it will be quite terrible if the application uses printfs.

    Signed-off-by: Magnus Karlsson
    Signed-off-by: Alexei Starovoitov
    Link: https://lore.kernel.org/bpf/1599726666-8431-4-git-send-email-magnus.karlsson@gmail.com

    Magnus Karlsson
     
  • Fix a possible deadlock in the l2fwd application in xdpsock that can
    occur when there is no space in the Tx ring. There are two ways to get
    the kernel to consume entries in the Tx ring: calling sendto() to make
    it send packets and freeing entries from the completion ring, as the
    kernel will not send a packet if there is no space for it to add a
    completion entry in the completion ring. The Tx loop in l2fwd only
    used to call sendto(). This patches adds cleaning the completion ring
    in that loop.

    Signed-off-by: Magnus Karlsson
    Signed-off-by: Alexei Starovoitov
    Link: https://lore.kernel.org/bpf/1599726666-8431-3-git-send-email-magnus.karlsson@gmail.com

    Magnus Karlsson
     
  • Fix the sending of a single packet (or small burst) in xdpsock when
    executing in copy mode. Currently, the l2fwd application in xdpsock
    only transmits the packets after a batch of them has been received,
    which might be confusing if you only send one packet and expect that
    it is returned pronto. Fix this by calling sendto() more often and add
    a comment in the code that states that this can be optimized if
    needed.

    Reported-by: Tirthendu Sarkar
    Signed-off-by: Magnus Karlsson
    Signed-off-by: Alexei Starovoitov
    Link: https://lore.kernel.org/bpf/1599726666-8431-2-git-send-email-magnus.karlsson@gmail.com

    Magnus Karlsson
     
  • In order to branch around tail calls (due to out-of-bounds index,
    exceeding tail call count or missing tail call target), JIT uses
    label[0] field, which contains the address of the instruction following
    the tail call. When there are multiple tail calls, label[0] value comes
    from handling of a previous tail call, which is incorrect.

    Fix by getting rid of label array and resolving the label address
    locally: for all 3 branches that jump to it, emit 0 offsets at the
    beginning, and then backpatch them with the correct value.

    Also, do not use the long jump infrastructure: the tail call sequence
    is known to be short, so make all 3 jumps short.

    Fixes: 6651ee070b31 ("s390/bpf: implement bpf_tail_call() helper")
    Signed-off-by: Ilya Leoshkevich
    Signed-off-by: Alexei Starovoitov
    Link: https://lore.kernel.org/bpf/20200909232141.3099367-1-iii@linux.ibm.com

    Ilya Leoshkevich
     

11 Sep, 2020

22 commits

  • Neal Cardwell says:

    ====================
    This patch series reorganizes TCP congestion control initialization so that if
    EBPF code called by tcp_init_transfer() sets the congestion control algorithm
    by calling setsockopt(TCP_CONGESTION) then the TCP stack initializes the
    congestion control module immediately, instead of having tcp_init_transfer()
    later initialize the congestion control module.

    This increases flexibility for the EBPF code that runs at connection
    establishment time, and simplifies the code.

    This has the following benefits:

    (1) This allows CC module customizations made by the EBPF called in
    tcp_init_transfer() to persist, and not be wiped out by a later
    call to tcp_init_congestion_control() in tcp_init_transfer().

    (2) Does not flip the order of EBPF and CC init, to avoid causing bugs
    for existing code upstream that depends on the current order.

    (3) Does not cause 2 initializations for for CC in the case where the
    EBPF called in tcp_init_transfer() wants to set the CC to a new CC
    algorithm.

    (4) Allows follow-on simplifications to the code in net/core/filter.c
    and net/ipv4/tcp_cong.c, which currently both have some complexity
    to special-case CC initialization to avoid double CC
    initialization if EBPF sets the CC.

    changes in v2:

    o rebase onto bpf-next

    o add another follow-on simplification suggested by Martin KaFai Lau:
    "tcp: simplify tcp_set_congestion_control() load=false case"

    changes in v3:

    o no change in commits

    o resent patch series from @gmail.com, since mail from ncardwell@google.com
    stopped being accepted at netdev@vger.kernel.org mid-way through processing
    the v2 patch series (between patches 2 and 3), confusing patchwork about
    which patches belonged to the v2 patch series
    ====================

    Acked-by: Martin KaFai Lau
    Signed-off-by: Alexei Starovoitov

    Alexei Starovoitov
     
  • Simplify tcp_set_congestion_control() by removing the initialization
    code path for the !load case.

    There are only two call sites for tcp_set_congestion_control(). The
    EBPF call site is the only one that passes load=false; it also passes
    cap_net_admin=true. Because of that, the exact same behavior can be
    achieved by removing the special if (!load) branch of the logic. Both
    before and after this commit, the EBPF case will call
    bpf_try_module_get(), and if that succeeds then call
    tcp_reinit_congestion_control() or if that fails then return EBUSY.

    Note that this returns the logic to a structure very similar to the
    structure before:
    commit 91b5b21c7c16 ("bpf: Add support for changing congestion control")
    except that the CAP_NET_ADMIN status is passed in as a function
    argument.

    This clean-up was suggested by Martin KaFai Lau.

    Suggested-by: Martin KaFai Lau
    Signed-off-by: Neal Cardwell
    Signed-off-by: Alexei Starovoitov
    Cc: Lawrence Brakmo
    Cc: Eric Dumazet
    Cc: Yuchung Cheng
    Cc: Kevin Yang

    Neal Cardwell
     
  • Now that the previous patches have removed the code that uses the
    flags argument to _bpf_setsockopt(), we can remove that argument.

    Signed-off-by: Neal Cardwell
    Signed-off-by: Eric Dumazet
    Signed-off-by: Alexei Starovoitov
    Acked-by: Yuchung Cheng
    Acked-by: Kevin Yang
    Cc: Lawrence Brakmo

    Neal Cardwell
     
  • Now that the previous patches ensure that all call sites for
    tcp_set_congestion_control() want to initialize congestion control, we
    can simplify tcp_set_congestion_control() by removing the reinit
    argument and the code to support it.

    Signed-off-by: Neal Cardwell
    Signed-off-by: Eric Dumazet
    Signed-off-by: Alexei Starovoitov
    Acked-by: Yuchung Cheng
    Acked-by: Kevin Yang
    Cc: Lawrence Brakmo

    Neal Cardwell
     
  • Now that the previous patch ensures we don't initialize the congestion
    control twice, when EBPF sets the congestion control algorithm at
    connection establishment we can simplify the code by simply
    initializing the congestion control module at that time.

    Signed-off-by: Neal Cardwell
    Signed-off-by: Eric Dumazet
    Signed-off-by: Alexei Starovoitov
    Acked-by: Yuchung Cheng
    Acked-by: Kevin Yang
    Cc: Lawrence Brakmo

    Neal Cardwell
     
  • Change tcp_init_transfer() to only initialize congestion control if it
    has not been initialized already.

    With this new approach, we can arrange things so that if the EBPF code
    sets the congestion control by calling setsockopt(TCP_CONGESTION) then
    tcp_init_transfer() will not re-initialize the CC module.

    This is an approach that has the following beneficial properties:

    (1) This allows CC module customizations made by the EBPF called in
    tcp_init_transfer() to persist, and not be wiped out by a later
    call to tcp_init_congestion_control() in tcp_init_transfer().

    (2) Does not flip the order of EBPF and CC init, to avoid causing bugs
    for existing code upstream that depends on the current order.

    (3) Does not cause 2 initializations for for CC in the case where the
    EBPF called in tcp_init_transfer() wants to set the CC to a new CC
    algorithm.

    (4) Allows follow-on simplifications to the code in net/core/filter.c
    and net/ipv4/tcp_cong.c, which currently both have some complexity
    to special-case CC initialization to avoid double CC
    initialization if EBPF sets the CC.

    Signed-off-by: Neal Cardwell
    Signed-off-by: Eric Dumazet
    Signed-off-by: Alexei Starovoitov
    Acked-by: Yuchung Cheng
    Acked-by: Kevin Yang
    Cc: Lawrence Brakmo

    Neal Cardwell
     
  • The "SEE ALSO" sections of bpftool's manual pages refer to bpf(2),
    bpf-helpers(7), then all existing bpftool man pages (save the current
    one).

    This leads to nearly-identical lists being duplicated in all manual
    pages. Ideally, when a new page is created, all lists should be updated
    accordingly, but this has led to omissions and inconsistencies multiple
    times in the past.

    Let's take it out of the RST files and generate the "SEE ALSO" sections
    automatically in the Makefile when generating the man pages. The lists
    are not really useful in the RST anyway because all other pages are
    available in the same directory.

    v3:
    - Fix conflict with a previous patchset that introduced RST2MAN_OPTS
    variable passed to rst2man.

    v2:
    - Use "echo -n" instead of "printf" in Makefile, to avoid any risk of
    passing a format string directly to the command.

    Signed-off-by: Quentin Monnet
    Signed-off-by: Alexei Starovoitov
    Acked-by: Andrii Nakryiko
    Link: https://lore.kernel.org/bpf/20200910203935.25304-1-quentin@isovalent.com

    Quentin Monnet
     
  • This should be "current" not "skb".

    Fixes: c6b5fb8690fa ("bpf: add documentation for eBPF helpers (42-50)")
    Signed-off-by: Song Liu
    Signed-off-by: Alexei Starovoitov
    Cc:
    Link: https://lore.kernel.org/bpf/20200910203314.70018-1-songliubraving@fb.com

    Song Liu
     
  • When tweaking llvm optimizations, I found that selftest build failed
    with the following error:
    libbpf: elf: skipping unrecognized data section(6) .rodata.str1.1
    libbpf: prog 'sysctl_tcp_mem': bad map relo against '.L__const.is_tcp_mem.tcp_mem_name'
    in section '.rodata.str1.1'
    Error: failed to open BPF object file: Relocation failed
    make: *** [/work/net-next/tools/testing/selftests/bpf/test_sysctl_prog.skel.h] Error 255
    make: *** Deleting file `/work/net-next/tools/testing/selftests/bpf/test_sysctl_prog.skel.h'

    The local string constant "tcp_mem_name" is put into '.rodata.str1.1' section
    which libbpf cannot handle. Using untweaked upstream llvm, "tcp_mem_name"
    is completely inlined after loop unrolling.

    Commit 7fb5eefd7639 ("selftests/bpf: Fix test_sysctl_loop{1, 2}
    failure due to clang change") solved a similar problem by defining
    the string const as a global. Let us do the same here
    for test_sysctl_prog.c so it can weather future potential llvm changes.

    Signed-off-by: Yonghong Song
    Signed-off-by: Alexei Starovoitov
    Acked-by: Andrii Nakryiko
    Link: https://lore.kernel.org/bpf/20200910202718.956042-1-yhs@fb.com

    Yonghong Song
     
  • On non-SMP kernels __per_cpu_start is not 0, so look it up in kallsyms.

    Signed-off-by: Ilya Leoshkevich
    Signed-off-by: Alexei Starovoitov
    Acked-by: Andrii Nakryiko
    Link: https://lore.kernel.org/bpf/20200910171336.3161995-1-iii@linux.ibm.com

    Ilya Leoshkevich
     
  • As Alexei points out, struct bpf_sk_lookup_kern has two 4-byte holes.
    This leads to suboptimal instructions being generated (IPv4, x86):

    1372 struct bpf_sk_lookup_kern ctx = {
    0xffffffff81b87f30 : xor %eax,%eax
    0xffffffff81b87f32 : mov $0x6,%ecx
    0xffffffff81b87f37 : lea 0x90(%rsp),%rdi
    0xffffffff81b87f3f : movl $0x110002,0x88(%rsp)
    0xffffffff81b87f4a : rep stos %rax,%es:(%rdi)
    0xffffffff81b87f4d : mov 0x8(%rsp),%eax
    0xffffffff81b87f51 : mov %r13d,0x90(%rsp)
    0xffffffff81b87f59 : incl %gs:0x7e4970a0(%rip)
    0xffffffff81b87f60 : mov %eax,0x8c(%rsp)
    0xffffffff81b87f67 : movzwl 0x10(%rsp),%eax
    0xffffffff81b87f6c : mov %ax,0xa8(%rsp)
    0xffffffff81b87f74 : movzwl 0x38(%rsp),%eax
    0xffffffff81b87f79 : mov %ax,0xaa(%rsp)

    Fix this by moving around sport and dport. pahole confirms there
    are no more holes:

    struct bpf_sk_lookup_kern {
    u16 family; /* 0 2 */
    u16 protocol; /* 2 2 */
    __be16 sport; /* 4 2 */
    u16 dport; /* 6 2 */
    struct {
    __be32 saddr; /* 8 4 */
    __be32 daddr; /* 12 4 */
    } v4; /* 8 8 */
    struct {
    const struct in6_addr * saddr; /* 16 8 */
    const struct in6_addr * daddr; /* 24 8 */
    } v6; /* 16 16 */
    struct sock * selected_sk; /* 32 8 */
    bool no_reuseport; /* 40 1 */

    /* size: 48, cachelines: 1, members: 8 */
    /* padding: 7 */
    /* last cacheline: 48 bytes */
    };

    The assembly also doesn't contain the pesky rep stos anymore:

    1372 struct bpf_sk_lookup_kern ctx = {
    0xffffffff81b87f60 : movzwl 0x10(%rsp),%eax
    0xffffffff81b87f65 : movq $0x0,0xa8(%rsp)
    0xffffffff81b87f71 : movq $0x0,0xb0(%rsp)
    0xffffffff81b87f7d : mov %ax,0x9c(%rsp)
    0xffffffff81b87f85 : movzwl 0x38(%rsp),%eax
    0xffffffff81b87f8a : movq $0x0,0xb8(%rsp)
    0xffffffff81b87f96 : mov %ax,0x9e(%rsp)
    0xffffffff81b87f9e : mov 0x8(%rsp),%eax
    0xffffffff81b87fa2 : movq $0x0,0xc0(%rsp)
    0xffffffff81b87fae : movl $0x110002,0x98(%rsp)
    0xffffffff81b87fb9 : mov %eax,0xa0(%rsp)
    0xffffffff81b87fc0 : mov %r13d,0xa4(%rsp)

    1: https://lore.kernel.org/bpf/CAADnVQKE6y9h2fwX6OS837v-Uf+aBXnT_JXiN_bbo2gitZQ3tA@mail.gmail.com/

    Fixes: e9ddbb7707ff ("bpf: Introduce SK_LOOKUP program type with a dedicated attach point")
    Suggested-by: Alexei Starovoitov
    Signed-off-by: Lorenz Bauer
    Signed-off-by: Alexei Starovoitov
    Acked-by: Jesper Dangaard Brouer
    Link: https://lore.kernel.org/bpf/20200910110248.198326-1-lmb@cloudflare.com

    Lorenz Bauer
     
  • There is no support for creating maps of types array-of-map or
    hash-of-map in bpftool. This is because the kernel needs an inner_map_fd
    to collect metadata on the inner maps to be supported by the new map,
    but bpftool does not provide a way to pass this file descriptor.

    Add a new optional "inner_map" keyword that can be used to pass a
    reference to a map, retrieve a fd to that map, and pass it as the
    inner_map_fd.

    Add related documentation and bash completion. Note that we can
    reference the inner map by its name, meaning we can have several times
    the keyword "name" with different meanings (mandatory outer map name,
    and possibly a name to use to find the inner_map_fd). The bash
    completion will offer it just once, and will not suggest "name" on the
    following command:

    # bpftool map create /sys/fs/bpf/my_outer_map type hash_of_maps \
    inner_map name my_inner_map [TAB]

    Fixing that specific case seems too convoluted. Completion will work as
    expected, however, if the outer map name comes first and the "inner_map
    name ..." is passed second.

    Signed-off-by: Quentin Monnet
    Signed-off-by: Alexei Starovoitov
    Acked-by: Andrii Nakryiko
    Link: https://lore.kernel.org/bpf/20200910102652.10509-4-quentin@isovalent.com

    Quentin Monnet
     
  • When dumping outer maps or prog_array maps, and on lookup failure,
    bpftool simply skips the entry with no error message. This is because
    the kernel returns non-zero when no value is found for the provided key,
    which frequently happen for those maps if they have not been filled.

    When such a case occurs, errno is set to ENOENT. It seems unlikely we
    could receive other error codes at this stage (we successfully retrieved
    map info just before), but to be on the safe side, let's skip the entry
    only if errno was ENOENT, and not for the other errors.

    v3: New patch

    Signed-off-by: Quentin Monnet
    Signed-off-by: Alexei Starovoitov
    Acked-by: Andrii Nakryiko
    Link: https://lore.kernel.org/bpf/20200910102652.10509-3-quentin@isovalent.com

    Quentin Monnet
     
  • The function used to dump a map entry in bpftool is a bit difficult to
    follow, as a consequence to earlier refactorings. There is a variable
    ("num_elems") which does not appear to be necessary, and the error
    handling would look cleaner if moved to its own function. Let's clean it
    up. No functional change.

    v2:
    - v1 was erroneously removing the check on fd maps in an attempt to get
    support for outer map dumps. This is already working. Instead, v2
    focuses on cleaning up the dump_map_elem() function, to avoid
    similar confusion in the future.

    Signed-off-by: Quentin Monnet
    Signed-off-by: Alexei Starovoitov
    Acked-by: Andrii Nakryiko
    Link: https://lore.kernel.org/bpf/20200910102652.10509-2-quentin@isovalent.com

    Quentin Monnet
     
  • Add a test that exercises a basic sockmap / sockhash iteration. For
    now we simply count the number of elements seen. Once sockmap update
    from iterators works we can extend this to perform a full copy.

    Signed-off-by: Lorenz Bauer
    Signed-off-by: Alexei Starovoitov
    Link: https://lore.kernel.org/bpf/20200909162712.221874-4-lmb@cloudflare.com

    Lorenz Bauer
     
  • Add bpf_iter support for sockmap / sockhash, based on the bpf_sk_storage and
    hashtable implementation. sockmap and sockhash share the same iteration
    context: a pointer to an arbitrary key and a pointer to a socket. Both
    pointers may be NULL, and so BPF has to perform a NULL check before accessing
    them. Technically it's not possible for sockhash iteration to yield a NULL
    socket, but we ignore this to be able to use a single iteration point.

    Iteration will visit all keys that remain unmodified during the lifetime of
    the iterator. It may or may not visit newly added ones.

    Switch from using rcu_dereference_raw to plain rcu_dereference, so we gain
    another guard rail if CONFIG_PROVE_RCU is enabled.

    Signed-off-by: Lorenz Bauer
    Signed-off-by: Alexei Starovoitov
    Acked-by: Yonghong Song
    Link: https://lore.kernel.org/bpf/20200909162712.221874-3-lmb@cloudflare.com

    Lorenz Bauer
     
  • The lookup paths for sockmap and sockhash currently include a check
    that returns NULL if the socket we just found is not a full socket.
    However, this check is not necessary. On insertion we ensure that
    we have a full socket (caveat around sock_ops), so request sockets
    are not a problem. Time-wait sockets are allocated separate from
    the original socket and then fed into the hashdance. They don't
    affect the sockets already stored in the sockmap.

    Suggested-by: Jakub Sitnicki
    Signed-off-by: Lorenz Bauer
    Signed-off-by: Alexei Starovoitov
    Link: https://lore.kernel.org/bpf/20200909162712.221874-2-lmb@cloudflare.com

    Lorenz Bauer
     
  • Nearly all man pages for bpftool have the same common set of option
    flags (--help, --version, --json, --pretty, --debug). The description is
    duplicated across all the pages, which is more difficult to maintain if
    the description of an option changes. It may also be confusing to sort
    out what options are not "common" and should not be copied when creating
    new manual pages.

    Let's move the description for those common options to a separate file,
    which is included with a RST directive when generating the man pages.

    Signed-off-by: Quentin Monnet
    Signed-off-by: Alexei Starovoitov
    Acked-by: Andrii Nakryiko
    Link: https://lore.kernel.org/bpf/20200909162500.17010-3-quentin@isovalent.com

    Quentin Monnet
     
  • Bpftool has a number of features that can be included or left aside
    during compilation. This includes:

    - Support for libbfd, providing the disassembler for JIT-compiled
    programs.
    - Support for BPF skeletons, used for profiling programs or iterating on
    the PIDs of processes associated with BPF objects.

    In order to make it easy for users to understand what features were
    compiled for a given bpftool binary, print the status of the two
    features above when showing the version number for bpftool ("bpftool -V"
    or "bpftool version"). Document this in the main manual page. Example
    invocations:

    $ bpftool version
    ./bpftool v5.9.0-rc1
    features: libbfd, skeletons

    $ bpftool -p version
    {
    "version": "5.9.0-rc1",
    "features": {
    "libbfd": true,
    "skeletons": true
    }
    }

    Some other parameters are optional at compilation
    ("DISASM_FOUR_ARGS_SIGNATURE", LIBCAP support) but they do not impact
    significantly bpftool's behaviour from a user's point of view, so their
    status is not reported.

    Available commands and supported program types depend on the version
    number, and are therefore not reported either. Note that they are
    already available, albeit without JSON, via bpftool's help messages.

    v3:
    - Use a simple list instead of boolean values for plain output.

    v2:
    - Fix JSON (object instead or array for the features).

    Signed-off-by: Quentin Monnet
    Signed-off-by: Alexei Starovoitov
    Acked-by: Andrii Nakryiko
    Link: https://lore.kernel.org/bpf/20200909162500.17010-2-quentin@isovalent.com

    Quentin Monnet
     
  • eBPF selftests include a script to check that bpftool builds correctly
    with different command lines. Let's add one build for bpftool's
    documentation so as to detect errors or warning reported by rst2man when
    compiling the man pages. Also add a build to the selftests Makefile to
    make sure we build bpftool documentation along with bpftool when
    building the selftests.

    This also builds and checks warnings for the man page for eBPF helpers,
    which is built along bpftool's documentation.

    This change adds rst2man as a dependency for selftests (it comes with
    Python's "docutils").

    v2:
    - Use "--exit-status=1" option for rst2man instead of counting lines
    from stderr.
    - Also build bpftool as part as the selftests build (and not only when
    the tests are actually run).

    Signed-off-by: Quentin Monnet
    Signed-off-by: Alexei Starovoitov
    Acked-by: Andrii Nakryiko
    Link: https://lore.kernel.org/bpf/20200909162251.15498-3-quentin@isovalent.com

    Quentin Monnet
     
  • To build man pages for bpftool (and for eBPF helper functions), rst2man
    can log different levels of information. Let's make it log all levels
    to keep the RST files clean.

    Doing so, rst2man complains about double colons, used for literal
    blocks, that look like underlines for section titles. Let's add the
    necessary blank lines.

    v2:
    - Use "--verbose" instead of "-r 1" (same behaviour but more readable).
    - Pass it through a RST2MAN_OPTS variable so we can easily pass other
    options too.

    Signed-off-by: Quentin Monnet
    Signed-off-by: Alexei Starovoitov
    Acked-by: Andrii Nakryiko
    Link: https://lore.kernel.org/bpf/20200909162251.15498-2-quentin@isovalent.com

    Quentin Monnet
     
  • Remove duplicate headers which are included twice.

    Signed-off-by: Chen Zhou
    Signed-off-by: Alexei Starovoitov
    Acked-by: Andrii Nakryiko
    Link: https://lore.kernel.org/bpf/20200908132201.184005-1-chenzhou10@huawei.com

    Chen Zhou
     

10 Sep, 2020

2 commits

  • Switch from deprecated bpf_program__title() API to
    bpf_program__section_name(). Also drop unnecessary error checks because
    neither bpf_program__title() nor bpf_program__section_name() can fail or
    return NULL.

    Fixes: 521095842027 ("libbpf: Deprecate notion of BPF program "title" in favor of "section name"")
    Signed-off-by: Andrii Nakryiko
    Signed-off-by: Alexei Starovoitov
    Reviewed-by: Tobias Klauser
    Acked-by: Jiri Olsa
    Link: https://lore.kernel.org/bpf/20200908180127.1249-1-andriin@fb.com

    Andrii Nakryiko
     
  • Andrii reported that with latest clang, when building selftests, we have
    error likes:
    error: progs/test_sysctl_loop1.c:23:16: in function sysctl_tcp_mem i32 (%struct.bpf_sysctl*):
    Looks like the BPF stack limit of 512 bytes is exceeded.
    Please move large on stack variables into BPF per-cpu array map.

    The error is triggered by the following LLVM patch:
    https://reviews.llvm.org/D87134

    For example, the following code is from test_sysctl_loop1.c:
    static __always_inline int is_tcp_mem(struct bpf_sysctl *ctx)
    {
    volatile char tcp_mem_name[] = "net/ipv4/tcp_mem/very_very_very_very_long_pointless_string";
    ...
    }
    Without the above LLVM patch, the compiler did optimization to load the string
    (59 bytes long) with 7 64bit loads, 1 8bit load and 1 16bit load,
    occupying 64 byte stack size.

    With the above LLVM patch, the compiler only uses 8bit loads, but subregister is 32bit.
    So stack requirements become 4 * 59 = 236 bytes. Together with other stuff on
    the stack, total stack size exceeds 512 bytes, hence compiler complains and quits.

    To fix the issue, removing "volatile" key word or changing "volatile" to
    "const"/"static const" does not work, the string is put in .rodata.str1.1 section,
    which libbpf did not process it and errors out with
    libbpf: elf: skipping unrecognized data section(6) .rodata.str1.1
    libbpf: prog 'sysctl_tcp_mem': bad map relo against '.L__const.is_tcp_mem.tcp_mem_name'
    in section '.rodata.str1.1'

    Defining the string const as global variable can fix the issue as it puts the string constant
    in '.rodata' section which is recognized by libbpf. In the future, when libbpf can process
    '.rodata.str*.*' properly, the global definition can be changed back to local definition.

    Defining tcp_mem_name as a global, however, triggered a verifier failure.
    ./test_progs -n 7/21
    libbpf: load bpf program failed: Permission denied
    libbpf: -- BEGIN DUMP LOG ---
    libbpf:
    invalid stack off=0 size=1
    verification time 6975 usec
    stack depth 160+64
    processed 889 insns (limit 1000000) max_states_per_insn 4 total_states
    14 peak_states 14 mark_read 10

    libbpf: -- END LOG --
    libbpf: failed to load program 'sysctl_tcp_mem'
    libbpf: failed to load object 'test_sysctl_loop2.o'
    test_bpf_verif_scale:FAIL:114
    #7/21 test_sysctl_loop2.o:FAIL
    This actually exposed a bpf program bug. In test_sysctl_loop{1,2}, we have code
    like
    const char tcp_mem_name[] = "";
    ...
    char name[64];
    ...
    for (i = 0; i < sizeof(tcp_mem_name); ++i)
    if (name[i] != tcp_mem_name[i])
    return 0;
    In the above code, if sizeof(tcp_mem_name) > 64, name[i] access may be
    out of bound. The sizeof(tcp_mem_name) is 59 for test_sysctl_loop1.c and
    79 for test_sysctl_loop2.c.

    Without promotion-to-global change, old compiler generates code where
    the overflowed stack access is actually filled with valid value, so hiding
    the bpf program bug. With promotion-to-global change, the code is different,
    more specifically, the previous loading constants to stack is gone, and
    "name" occupies stack[-64:0] and overflow access triggers a verifier error.
    To fix the issue, adjust "name" buffer size properly.

    Reported-by: Andrii Nakryiko
    Signed-off-by: Yonghong Song
    Signed-off-by: Alexei Starovoitov
    Acked-by: Andrii Nakryiko
    Link: https://lore.kernel.org/bpf/20200909171542.3673449-1-yhs@fb.com

    Yonghong Song
     

09 Sep, 2020

2 commits

  • Change selftest map_ptr_kern.c with disabling inlining for
    one of subtests, which will fail the test without previous
    verifier change. Also added to verifier test for both
    "map_ptr += scalar" and "scalar += map_ptr" arithmetic.

    Signed-off-by: Yonghong Song
    Signed-off-by: Alexei Starovoitov
    Acked-by: Andrii Nakryiko
    Link: https://lore.kernel.org/bpf/20200908175703.2463721-1-yhs@fb.com

    Yonghong Song
     
  • Commit 41c48f3a98231 ("bpf: Support access
    to bpf map fields") added support to access map fields
    with CORE support. For example,

    struct bpf_map {
    __u32 max_entries;
    } __attribute__((preserve_access_index));

    struct bpf_array {
    struct bpf_map map;
    __u32 elem_size;
    } __attribute__((preserve_access_index));

    struct {
    __uint(type, BPF_MAP_TYPE_ARRAY);
    __uint(max_entries, 4);
    __type(key, __u32);
    __type(value, __u32);
    } m_array SEC(".maps");

    SEC("cgroup_skb/egress")
    int cg_skb(void *ctx)
    {
    struct bpf_array *array = (struct bpf_array *)&m_array;

    /* .. array->map.max_entries .. */
    }

    In kernel, bpf_htab has similar structure,

    struct bpf_htab {
    struct bpf_map map;
    ...
    }

    In the above cg_skb(), to access array->map.max_entries, with CORE, the clang will
    generate two builtin's.
    base = &m_array;
    /* access array.map */
    map_addr = __builtin_preserve_struct_access_info(base, 0, 0);
    /* access array.map.max_entries */
    max_entries_addr = __builtin_preserve_struct_access_info(map_addr, 0, 0);
    max_entries = *max_entries_addr;

    In the current llvm, if two builtin's are in the same function or
    in the same function after inlining, the compiler is smart enough to chain
    them together and generates like below:
    base = &m_array;
    max_entries = *(base + reloc_offset); /* reloc_offset = 0 in this case */
    and we are fine.

    But if we force no inlining for one of functions in test_map_ptr() selftest, e.g.,
    check_default(), the above two __builtin_preserve_* will be in two different
    functions. In this case, we will have code like:
    func check_hash():
    reloc_offset_map = 0;
    base = &m_array;
    map_base = base + reloc_offset_map;
    check_default(map_base, ...)
    func check_default(map_base, ...):
    max_entries = *(map_base + reloc_offset_max_entries);

    In kernel, map_ptr (CONST_PTR_TO_MAP) does not allow any arithmetic.
    The above "map_base = base + reloc_offset_map" will trigger a verifier failure.
    ; VERIFY(check_default(&hash->map, map));
    0: (18) r7 = 0xffffb4fe8018a004
    2: (b4) w1 = 110
    3: (63) *(u32 *)(r7 +0) = r1
    R1_w=invP110 R7_w=map_value(id=0,off=4,ks=4,vs=8,imm=0) R10=fp0
    ; VERIFY_TYPE(BPF_MAP_TYPE_HASH, check_hash);
    4: (18) r1 = 0xffffb4fe8018a000
    6: (b4) w2 = 1
    7: (63) *(u32 *)(r1 +0) = r2
    R1_w=map_value(id=0,off=0,ks=4,vs=8,imm=0) R2_w=invP1 R7_w=map_value(id=0,off=4,ks=4,vs=8,imm=0) R10=fp0
    8: (b7) r2 = 0
    9: (18) r8 = 0xffff90bcb500c000
    11: (18) r1 = 0xffff90bcb500c000
    13: (0f) r1 += r2
    R1 pointer arithmetic on map_ptr prohibited

    To fix the issue, let us permit map_ptr + 0 arithmetic which will
    result in exactly the same map_ptr.

    Signed-off-by: Yonghong Song
    Signed-off-by: Alexei Starovoitov
    Acked-by: Andrii Nakryiko
    Link: https://lore.kernel.org/bpf/20200908175702.2463625-1-yhs@fb.com

    Yonghong Song
     

07 Sep, 2020

3 commits


04 Sep, 2020

7 commits

  • This commit adds xsk_fwd test file to .gitignore which is newly added
    to samples/bpf.

    Signed-off-by: Daniel T. Lee
    Signed-off-by: Daniel Borkmann
    Link: https://lore.kernel.org/bpf/20200904063434.24963-2-danieltimlee@gmail.com

    Daniel T. Lee
     
  • From commit 521095842027 ("libbpf: Deprecate notion of BPF program
    "title" in favor of "section name""), the term title has been replaced
    with section name in libbpf.

    Since the bpf_program__title() has been deprecated, this commit
    switches this function to bpf_program__section_name(). Due to
    this commit, the compilation warning issue has also been resolved.

    Fixes: 521095842027 ("libbpf: Deprecate notion of BPF program "title" in favor of "section name"")
    Signed-off-by: Daniel T. Lee
    Signed-off-by: Daniel Borkmann
    Link: https://lore.kernel.org/bpf/20200904063434.24963-1-danieltimlee@gmail.com

    Daniel T. Lee
     
  • Detected by LGTM static analyze in Github repo, fix potential multiplication
    overflow before result is casted to size_t.

    Fixes: 8505e8709b5e ("libbpf: Implement generalized .BTF.ext func/line info adjustment")
    Signed-off-by: Andrii Nakryiko
    Signed-off-by: Daniel Borkmann
    Link: https://lore.kernel.org/bpf/20200904041611.1695163-2-andriin@fb.com

    Andrii Nakryiko
     
  • Another issue of __u64 needing either %lu or %llu, depending on the
    architecture. Fix with cast to `unsigned long long`.

    Fixes: 7e06aad52929 ("libbpf: Add multi-prog section support for struct_ops")
    Signed-off-by: Andrii Nakryiko
    Signed-off-by: Daniel Borkmann
    Link: https://lore.kernel.org/bpf/20200904041611.1695163-1-andriin@fb.com

    Andrii Nakryiko
     
  • The returned value of bpf_object__open_file() should be checked with
    libbpf_get_error() rather than NULL. This fix prevents test_progs from
    crash when test_global_data.o is not present.

    Signed-off-by: Hao Luo
    Signed-off-by: Alexei Starovoitov
    Acked-by: Andrii Nakryiko
    Link: https://lore.kernel.org/bpf/20200903200528.747884-1-haoluo@google.com

    Hao Luo
     
  • Andrii Nakryiko says:

    ====================
    Currently, libbpf supports a limited form of BPF-to-BPF subprogram calls. The
    restriction is that entry-point BPF program should use *all* of defined
    sub-programs in BPF .o file. If any of the subprograms is not used, such
    entry-point BPF program will be rejected by verifier as containing unreachable
    dead code. This is not a big limitation for cases with single entry-point BPF
    programs, but is quite a heavy restriction for multi-programs that use only
    partially overlapping set of subprograms.

    This patch set removes all such restrictions and adds complete support for
    using BPF sub-program calls on BPF side. This is achieved through libbpf
    tracking subprograms individually and detecting which subprograms are used by
    any given entry-point BPF program, and subsequently only appending and
    relocating code for just those used subprograms.

    In addition, libbpf now also supports multiple entry-point BPF programs within
    the same ELF section. This allows to structure code so that there are few
    variants of BPF programs of the same type and attaching to the same target
    (e.g., for tracepoints and kprobes) without the need to worry about ELF
    section name clashes.

    This patch set opens way for more wider adoption of BPF subprogram calls,
    especially for real-world production use-cases with complicated net of
    subprograms. This will allow to further scale BPF verification process through
    good use of global functions, which can be verified independently. This is
    also important prerequisite for static linking which allows static BPF
    libraries to not worry about naming clashes for section names, as well as use
    static non-inlined functions (subprograms) without worries of verifier
    rejecting program due to dead code.

    Patch set is structured as follows:
    - patched 1-6 contain all the libbpf changes necessary to support multi-prog
    sections and bpf2bpf subcalls;
    - patch 7 adds dedicated selftests validating all combinations of possible
    sub-calls (within and across sections, static vs global functions);
    - patch 8 deprecated bpf_program__title() in favor of
    bpf_program__section_name(). The intent was to also deprecate
    bpf_object__find_program_by_title() as it's now non-sensical with multiple
    programs per section. But there were too many selftests uses of this and
    I didn't want to delay this patches further and make it even bigger, so left
    it for a follow up cleanup;
    - patches 9-10 remove uses for title-related APIs from bpftool and
    bpf_program__title() use from selftests;
    - patch 11 is converting fexit_bpf2bpf to have explicit subtest (it does
    contain 4 subtests, which are not handled as sub-tests);
    - patches 12-14 convert few complicated BPF selftests to use __noinline
    functions to further validate correctness of libbpf's bpf2bpf processing
    logic.

    v2->v3:
    - explained subprog relocation algorithm in more details (Alexei);
    - pyperf, strobelight and cls_redirect got new subprog variants, leaving
    other modes intact (Alexei);
    v1->v2:
    - rename DEPRECATED to LIBBPF_DEPRECATED to avoid name clashes;
    - fix test_subprogs build;
    - convert a bunch of complicated selftests to __noinline (Alexei).
    ====================

    Signed-off-by: Alexei Starovoitov

    Alexei Starovoitov
     
  • As one of the most complicated and close-to-real-world programs, cls_redirect
    is a good candidate to exercise libbpf's logic of handling bpf2bpf calls. So
    add variant with using explicit __noinline for majority of functions except
    few most basic ones. If those few functions are inlined, verifier starts to
    complain about program instruction limit of 1mln instructions being exceeded,
    most probably due to instruction overhead of doing a sub-program call.
    Convert user-space part of selftest to have to sub-tests: with and without
    inlining.

    Signed-off-by: Andrii Nakryiko
    Signed-off-by: Alexei Starovoitov
    Cc: Lorenz Bauer
    Link: https://lore.kernel.org/bpf/20200903203542.15944-15-andriin@fb.com

    Andrii Nakryiko