08 Oct, 2020

2 commits

  • When CONFIG_NET is not defined, I hit the following build error:
    kernel/trace/bpf_trace.o:(.rodata+0x110): undefined reference to `bpf_prog_test_run_raw_tp'

    Commit 1b4d60ec162f ("bpf: Enable BPF_PROG_TEST_RUN for raw_tracepoint")
    added test_run support for raw_tracepoint in /kernel/trace/bpf_trace.c.
    But the test_run function bpf_prog_test_run_raw_tp is defined in
    net/bpf/test_run.c, only available with CONFIG_NET=y.

    Adding a CONFIG_NET guard for
    .test_run = bpf_prog_test_run_raw_tp;
    fixed the above build issue.

    Fixes: 1b4d60ec162f ("bpf: Enable BPF_PROG_TEST_RUN for raw_tracepoint")
    Signed-off-by: Yonghong Song
    Signed-off-by: Alexei Starovoitov
    Link: https://lore.kernel.org/bpf/20201007062933.3425899-1-yhs@fb.com

    Yonghong Song
     
  • Fix build errors in kernel/bpf/verifier.c when CONFIG_NET is
    not enabled.

    ../kernel/bpf/verifier.c:3995:13: error: ‘btf_sock_ids’ undeclared here (not in a function); did you mean ‘bpf_sock_ops’?
    .btf_id = &btf_sock_ids[BTF_SOCK_TYPE_SOCK_COMMON],

    ../kernel/bpf/verifier.c:3995:26: error: ‘BTF_SOCK_TYPE_SOCK_COMMON’ undeclared here (not in a function); did you mean ‘PTR_TO_SOCK_COMMON’?
    .btf_id = &btf_sock_ids[BTF_SOCK_TYPE_SOCK_COMMON],

    Fixes: 1df8f55a37bd ("bpf: Enable bpf_skc_to_* sock casting helper to networking prog type")
    Signed-off-by: Randy Dunlap
    Signed-off-by: Alexei Starovoitov
    Acked-by: Yonghong Song
    Link: https://lore.kernel.org/bpf/20201007021613.13646-1-rdunlap@infradead.org

    Randy Dunlap
     

06 Oct, 2020

1 commit

  • Recent improvements in LOCKDEP highlighted a potential A-A deadlock with
    pcpu_freelist in NMI:

    ./tools/testing/selftests/bpf/test_progs -t stacktrace_build_id_nmi

    [ 18.984807] ================================
    [ 18.984807] WARNING: inconsistent lock state
    [ 18.984808] 5.9.0-rc6-01771-g1466de1330e1 #2967 Not tainted
    [ 18.984809] --------------------------------
    [ 18.984809] inconsistent {INITIAL USE} -> {IN-NMI} usage.
    [ 18.984810] test_progs/1990 [HC2[2]:SC0[0]:HE0:SE1] takes:
    [ 18.984810] ffffe8ffffc219c0 (&head->lock){....}-{2:2}, at: __pcpu_freelist_pop+0xe3/0x180
    [ 18.984813] {INITIAL USE} state was registered at:
    [ 18.984814] lock_acquire+0x175/0x7c0
    [ 18.984814] _raw_spin_lock+0x2c/0x40
    [ 18.984815] __pcpu_freelist_pop+0xe3/0x180
    [ 18.984815] pcpu_freelist_pop+0x31/0x40
    [ 18.984816] htab_map_alloc+0xbbf/0xf40
    [ 18.984816] __do_sys_bpf+0x5aa/0x3ed0
    [ 18.984817] do_syscall_64+0x2d/0x40
    [ 18.984818] entry_SYSCALL_64_after_hwframe+0x44/0xa9
    [ 18.984818] irq event stamp: 12
    [...]
    [ 18.984822] other info that might help us debug this:
    [ 18.984823] Possible unsafe locking scenario:
    [ 18.984823]
    [ 18.984824] CPU0
    [ 18.984824] ----
    [ 18.984824] lock(&head->lock);
    [ 18.984826]
    [ 18.984826] lock(&head->lock);
    [ 18.984827]
    [ 18.984828] *** DEADLOCK ***
    [ 18.984828]
    [ 18.984829] 2 locks held by test_progs/1990:
    [...]
    [ 18.984838]
    [ 18.984838] dump_stack+0x9a/0xd0
    [ 18.984839] lock_acquire+0x5c9/0x7c0
    [ 18.984839] ? lock_release+0x6f0/0x6f0
    [ 18.984840] ? __pcpu_freelist_pop+0xe3/0x180
    [ 18.984840] _raw_spin_lock+0x2c/0x40
    [ 18.984841] ? __pcpu_freelist_pop+0xe3/0x180
    [ 18.984841] __pcpu_freelist_pop+0xe3/0x180
    [ 18.984842] pcpu_freelist_pop+0x17/0x40
    [ 18.984842] ? lock_release+0x6f0/0x6f0
    [ 18.984843] __bpf_get_stackid+0x534/0xaf0
    [ 18.984843] bpf_prog_1fd9e30e1438d3c5_oncpu+0x73/0x350
    [ 18.984844] bpf_overflow_handler+0x12f/0x3f0

    This is because pcpu_freelist_head.lock is accessed in both NMI and
    non-NMI context. Fix this issue by using raw_spin_trylock() in NMI.

    Since NMI interrupts non-NMI context, when NMI context tries to lock the
    raw_spinlock, non-NMI context of the same CPU may already have locked a
    lock and is blocked from unlocking the lock. For a system with N CPUs,
    there could be N NMIs at the same time, and they may block N non-NMI
    raw_spinlocks. This is tricky for pcpu_freelist_push(), where unlike
    _pop(), failing _push() means leaking memory. This issue is more likely to
    trigger in non-SMP system.

    Fix this issue with an extra list, pcpu_freelist.extralist. The extralist
    is primarily used to take _push() when raw_spin_trylock() failed on all
    the per CPU lists. It should be empty most of the time. The following
    table summarizes the behavior of pcpu_freelist in NMI and non-NMI:

    non-NMI pop(): use _lock(); check per CPU lists first;
    if all per CPU lists are empty, check extralist;
    if extralist is empty, return NULL.

    non-NMI push(): use _lock(); only push to per CPU lists.

    NMI pop(): use _trylock(); check per CPU lists first;
    if all per CPU lists are locked or empty, check extralist;
    if extralist is locked or empty, return NULL.

    NMI push(): use _trylock(); check per CPU lists first;
    if all per CPU lists are locked; try push to extralist;
    if extralist is also locked, keep trying on per CPU lists.

    Reported-by: Alexei Starovoitov
    Signed-off-by: Song Liu
    Signed-off-by: Daniel Borkmann
    Acked-by: Martin KaFai Lau
    Link: https://lore.kernel.org/bpf/20201005165838.3735218-1-songliubraving@fb.com

    Song Liu
     

05 Oct, 2020

1 commit

  • Replace /* fallthrough */ comments with the new pseudo-keyword
    macro fallthrough [1].

    [1] https://www.kernel.org/doc/html/v5.7/process/deprecated.html?highlight=fallthrough#implicit-switch-case-fall-through

    Signed-off-by: Gustavo A. R. Silva
    Signed-off-by: Daniel Borkmann
    Acked-by: Yonghong Song
    Link: https://lore.kernel.org/bpf/20201002234217.GA12280@embeddedor

    Gustavo A. R. Silva
     

03 Oct, 2020

4 commits

  • We are missing a deref for the case when we are doing BPF_PROG_BIND_MAP
    on a map that's being already held by the program.
    There is 'if (ret) bpf_map_put(map)' below which doesn't trigger
    because we don't consider this an error.
    Let's add missing bpf_map_put() for this specific condition.

    Fixes: ef15314aa5de ("bpf: Add BPF_PROG_BIND_MAP syscall")
    Reported-by: Alexei Starovoitov
    Signed-off-by: Stanislav Fomichev
    Signed-off-by: Alexei Starovoitov
    Acked-by: Andrii Nakryiko
    Link: https://lore.kernel.org/bpf/20201003002544.3601440-1-sdf@google.com

    Stanislav Fomichev
     
  • Add bpf_this_cpu_ptr() to help access percpu var on this cpu. This
    helper always returns a valid pointer, therefore no need to check
    returned value for NULL. Also note that all programs run with
    preemption disabled, which means that the returned pointer is stable
    during all the execution of the program.

    Signed-off-by: Hao Luo
    Signed-off-by: Alexei Starovoitov
    Acked-by: Andrii Nakryiko
    Link: https://lore.kernel.org/bpf/20200929235049.2533242-6-haoluo@google.com

    Hao Luo
     
  • Add bpf_per_cpu_ptr() to help bpf programs access percpu vars.
    bpf_per_cpu_ptr() has the same semantic as per_cpu_ptr() in the kernel
    except that it may return NULL. This happens when the cpu parameter is
    out of range. So the caller must check the returned value.

    Signed-off-by: Hao Luo
    Signed-off-by: Alexei Starovoitov
    Acked-by: Andrii Nakryiko
    Link: https://lore.kernel.org/bpf/20200929235049.2533242-5-haoluo@google.com

    Hao Luo
     
  • Pseudo_btf_id is a type of ld_imm insn that associates a btf_id to a
    ksym so that further dereferences on the ksym can use the BTF info
    to validate accesses. Internally, when seeing a pseudo_btf_id ld insn,
    the verifier reads the btf_id stored in the insn[0]'s imm field and
    marks the dst_reg as PTR_TO_BTF_ID. The btf_id points to a VAR_KIND,
    which is encoded in btf_vminux by pahole. If the VAR is not of a struct
    type, the dst reg will be marked as PTR_TO_MEM instead of PTR_TO_BTF_ID
    and the mem_size is resolved to the size of the VAR's type.

    >From the VAR btf_id, the verifier can also read the address of the
    ksym's corresponding kernel var from kallsyms and use that to fill
    dst_reg.

    Therefore, the proper functionality of pseudo_btf_id depends on (1)
    kallsyms and (2) the encoding of kernel global VARs in pahole, which
    should be available since pahole v1.18.

    Signed-off-by: Hao Luo
    Signed-off-by: Alexei Starovoitov
    Acked-by: Andrii Nakryiko
    Link: https://lore.kernel.org/bpf/20200929235049.2533242-2-haoluo@google.com

    Hao Luo
     

01 Oct, 2020

2 commits

  • Currently, perf event in perf event array is removed from the array when
    the map fd used to add the event is closed. This behavior makes it
    difficult to the share perf events with perf event array.

    Introduce perf event map that keeps the perf event open with a new flag
    BPF_F_PRESERVE_ELEMS. With this flag set, perf events in the array are not
    removed when the original map fd is closed. Instead, the perf event will
    stay in the map until 1) it is explicitly removed from the array; or 2)
    the array is freed.

    Signed-off-by: Song Liu
    Signed-off-by: Alexei Starovoitov
    Link: https://lore.kernel.org/bpf/20200930224927.1936644-2-songliubraving@fb.com

    Song Liu
     
  • With its use in BPF, the cookie generator can be called very frequently
    in particular when used out of cgroup v2 hooks (e.g. connect / sendmsg)
    and attached to the root cgroup, for example, when used in v1/v2 mixed
    environments. In particular, when there's a high churn on sockets in the
    system there can be many parallel requests to the bpf_get_socket_cookie()
    and bpf_get_netns_cookie() helpers which then cause contention on the
    atomic counter.

    As similarly done in f991bd2e1421 ("fs: introduce a per-cpu last_ino
    allocator"), add a small helper library that both can use for the 64 bit
    counters. Given this can be called from different contexts, we also need
    to deal with potential nested calls even though in practice they are
    considered extremely rare. One idea as suggested by Eric Dumazet was
    to use a reverse counter for this situation since we don't expect 64 bit
    overflows anyways; that way, we can avoid bigger gaps in the 64 bit
    counter space compared to just batch-wise increase. Even on machines
    with small number of cores (e.g. 4) the cookie generation shrinks from
    min/max/med/avg (ns) of 22/50/40/38.9 down to 10/35/14/17.3 when run
    in parallel from multiple CPUs.

    Signed-off-by: Daniel Borkmann
    Signed-off-by: Alexei Starovoitov
    Reviewed-by: Eric Dumazet
    Acked-by: Martin KaFai Lau
    Cc: Eric Dumazet
    Link: https://lore.kernel.org/bpf/8a80b8d27d3c49f9a14e1d5213c19d8be87d1dc8.1601477936.git.daniel@iogearbox.net

    Daniel Borkmann
     

30 Sep, 2020

4 commits

  • Eelco reported we can't properly access arguments if the tracing
    program is attached to extension program.

    Having following program:

    SEC("classifier/test_pkt_md_access")
    int test_pkt_md_access(struct __sk_buff *skb)

    with its extension:

    SEC("freplace/test_pkt_md_access")
    int test_pkt_md_access_new(struct __sk_buff *skb)

    and tracing that extension with:

    SEC("fentry/test_pkt_md_access_new")
    int BPF_PROG(fentry, struct sk_buff *skb)

    It's not possible to access skb argument in the fentry program,
    with following error from verifier:

    ; int BPF_PROG(fentry, struct sk_buff *skb)
    0: (79) r1 = *(u64 *)(r1 +0)
    invalid bpf_context access off=0 size=8

    The problem is that btf_ctx_access gets the context type for the
    traced program, which is in this case the extension.

    But when we trace extension program, we want to get the context
    type of the program that the extension is attached to, so we can
    access the argument properly in the trace program.

    This version of the patch is tweaked slightly from Jiri's original one,
    since the refactoring in the previous patches means we have to get the
    target prog type from the new variable in prog->aux instead of directly
    from the target prog.

    Reported-by: Eelco Chaudron
    Suggested-by: Jiri Olsa
    Signed-off-by: Toke Høiland-Jørgensen
    Signed-off-by: Alexei Starovoitov
    Acked-by: Andrii Nakryiko
    Link: https://lore.kernel.org/bpf/160138355278.48470.17057040257274725638.stgit@toke.dk

    Toke Høiland-Jørgensen
     
  • This enables support for attaching freplace programs to multiple attach
    points. It does this by amending the UAPI for bpf_link_Create with a target
    btf ID that can be used to supply the new attachment point along with the
    target program fd. The target must be compatible with the target that was
    supplied at program load time.

    The implementation reuses the checks that were factored out of
    check_attach_btf_id() to ensure compatibility between the BTF types of the
    old and new attachment. If these match, a new bpf_tracing_link will be
    created for the new attach target, allowing multiple attachments to
    co-exist simultaneously.

    The code could theoretically support multiple-attach of other types of
    tracing programs as well, but since I don't have a use case for any of
    those, there is no API support for doing so.

    Signed-off-by: Toke Høiland-Jørgensen
    Signed-off-by: Alexei Starovoitov
    Acked-by: Andrii Nakryiko
    Link: https://lore.kernel.org/bpf/160138355169.48470.17165680973640685368.stgit@toke.dk

    Toke Høiland-Jørgensen
     
  • In preparation for allowing multiple attachments of freplace programs, move
    the references to the target program and trampoline into the
    bpf_tracing_link structure when that is created. To do this atomically,
    introduce a new mutex in prog->aux to protect writing to the two pointers
    to target prog and trampoline, and rename the members to make it clear that
    they are related.

    With this change, it is no longer possible to attach the same tracing
    program multiple times (detaching in-between), since the reference from the
    tracing program to the target disappears on the first attach. However,
    since the next patch will let the caller supply an attach target, that will
    also make it possible to attach to the same place multiple times.

    Signed-off-by: Toke Høiland-Jørgensen
    Signed-off-by: Alexei Starovoitov
    Acked-by: Andrii Nakryiko
    Link: https://lore.kernel.org/bpf/160138355059.48470.2503076992210324984.stgit@toke.dk

    Toke Høiland-Jørgensen
     
  • The Makefile in bpf/preload builds a local copy of libbpf, but does not
    properly clean up after itself. This can lead to subsequent compilation
    failures, since the feature detection cache is kept around which can lead
    subsequent detection to fail.

    Fix this by properly setting clean-files, and while we're at it, also add a
    .gitignore for the directory to ignore the build artifacts.

    Fixes: d71fa5c9763c ("bpf: Add kernel module with user mode driver that populates bpffs.")
    Signed-off-by: Toke Høiland-Jørgensen
    Signed-off-by: Alexei Starovoitov
    Link: https://lore.kernel.org/bpf/20200927193005.8459-1-toke@redhat.com

    Toke Høiland-Jørgensen
     

29 Sep, 2020

11 commits

  • A helper is added to allow seq file writing of kernel data
    structures using vmlinux BTF. Its signature is

    long bpf_seq_printf_btf(struct seq_file *m, struct btf_ptr *ptr,
    u32 btf_ptr_size, u64 flags);

    Flags and struct btf_ptr definitions/use are identical to the
    bpf_snprintf_btf helper, and the helper returns 0 on success
    or a negative error value.

    Suggested-by: Alexei Starovoitov
    Signed-off-by: Alan Maguire
    Signed-off-by: Alexei Starovoitov
    Link: https://lore.kernel.org/bpf/1601292670-1616-8-git-send-email-alan.maguire@oracle.com

    Alan Maguire
     
  • BPF iter size is limited to PAGE_SIZE; if we wish to display BTF-based
    representations of larger kernel data structures such as task_struct,
    this will be insufficient.

    Suggested-by: Alexei Starovoitov
    Signed-off-by: Alan Maguire
    Signed-off-by: Alexei Starovoitov
    Link: https://lore.kernel.org/bpf/1601292670-1616-6-git-send-email-alan.maguire@oracle.com

    Alan Maguire
     
  • A helper is added to support tracing kernel type information in BPF
    using the BPF Type Format (BTF). Its signature is

    long bpf_snprintf_btf(char *str, u32 str_size, struct btf_ptr *ptr,
    u32 btf_ptr_size, u64 flags);

    struct btf_ptr * specifies

    - a pointer to the data to be traced
    - the BTF id of the type of data pointed to
    - a flags field is provided for future use; these flags
    are not to be confused with the BTF_F_* flags
    below that control how the btf_ptr is displayed; the
    flags member of the struct btf_ptr may be used to
    disambiguate types in kernel versus module BTF, etc;
    the main distinction is the flags relate to the type
    and information needed in identifying it; not how it
    is displayed.

    For example a BPF program with a struct sk_buff *skb
    could do the following:

    static struct btf_ptr b = { };

    b.ptr = skb;
    b.type_id = __builtin_btf_type_id(struct sk_buff, 1);
    bpf_snprintf_btf(str, sizeof(str), &b, sizeof(b), 0, 0);

    Default output looks like this:

    (struct sk_buff){
    .transport_header = (__u16)65535,
    .mac_header = (__u16)65535,
    .end = (sk_buff_data_t)192,
    .head = (unsigned char *)0x000000007524fd8b,
    .data = (unsigned char *)0x000000007524fd8b,
    .truesize = (unsigned int)768,
    .users = (refcount_t){
    .refs = (atomic_t){
    .counter = (int)1,
    },
    },
    }

    Flags modifying display are as follows:

    - BTF_F_COMPACT: no formatting around type information
    - BTF_F_NONAME: no struct/union member names/types
    - BTF_F_PTR_RAW: show raw (unobfuscated) pointer values;
    equivalent to %px.
    - BTF_F_ZERO: show zero-valued struct/union members;
    they are not displayed by default

    Signed-off-by: Alan Maguire
    Signed-off-by: Alexei Starovoitov
    Link: https://lore.kernel.org/bpf/1601292670-1616-4-git-send-email-alan.maguire@oracle.com

    Alan Maguire
     
  • generalize the "seq_show" seq file support in btf.c to support
    a generic show callback of which we support two instances; the
    current seq file show, and a show with snprintf() behaviour which
    instead writes the type data to a supplied string.

    Both classes of show function call btf_type_show() with different
    targets; the seq file or the string to be written. In the string
    case we need to track additional data - length left in string to write
    and length to return that we would have written (a la snprintf).

    By default show will display type information, field members and
    their types and values etc, and the information is indented
    based upon structure depth. Zeroed fields are omitted.

    Show however supports flags which modify its behaviour:

    BTF_SHOW_COMPACT - suppress newline/indent.
    BTF_SHOW_NONAME - suppress show of type and member names.
    BTF_SHOW_PTR_RAW - do not obfuscate pointer values.
    BTF_SHOW_UNSAFE - do not copy data to safe buffer before display.
    BTF_SHOW_ZERO - show zeroed values (by default they are not shown).

    Signed-off-by: Alan Maguire
    Signed-off-by: Alexei Starovoitov
    Link: https://lore.kernel.org/bpf/1601292670-1616-3-git-send-email-alan.maguire@oracle.com

    Alan Maguire
     
  • It will be used later for BPF structure display support

    Signed-off-by: Alan Maguire
    Signed-off-by: Alexei Starovoitov
    Link: https://lore.kernel.org/bpf/1601292670-1616-2-git-send-email-alan.maguire@oracle.com

    Alan Maguire
     
  • The check_attach_btf_id() function really does three things:

    1. It performs a bunch of checks on the program to ensure that the
    attachment is valid.

    2. It stores a bunch of state about the attachment being requested in
    the verifier environment and struct bpf_prog objects.

    3. It allocates a trampoline for the attachment.

    This patch splits out (1.) and (3.) into separate functions which will
    perform the checks, but return the computed values instead of directly
    modifying the environment. This is done in preparation for reusing the
    checks when the actual attachment is happening, which will allow tracing
    programs to have multiple (compatible) attachments.

    This also fixes a bug where a bunch of checks were skipped if a trampoline
    already existed for the tracing target.

    Fixes: 6ba43b761c41 ("bpf: Attachment verification for BPF_MODIFY_RETURN")
    Fixes: 1e6c62a88215 ("bpf: Introduce sleepable BPF programs")
    Acked-by: Andrii Nakryiko
    Signed-off-by: Toke Høiland-Jørgensen
    Signed-off-by: Alexei Starovoitov

    Toke Høiland-Jørgensen
     
  • In preparation for moving code around, change a bunch of references to
    env->log (and the verbose() logging helper) to use bpf_log() and a direct
    pointer to struct bpf_verifier_log. While we're touching the function
    signature, mark the 'prog' argument to bpf_check_type_match() as const.

    Also enhance the bpf_verifier_log_needed() check to handle NULL pointers
    for the log struct so we can re-use the code with logging disabled.

    Acked-by: Andrii Nakryiko
    Signed-off-by: Toke Høiland-Jørgensen
    Signed-off-by: Alexei Starovoitov

    Toke Høiland-Jørgensen
     
  • From the checks and commit messages for modify_return, it seems it was
    never the intention that it should be possible to attach a tracing program
    with expected_attach_type == BPF_MODIFY_RETURN to another BPF program.
    However, check_attach_modify_return() will only look at the function name,
    so if the target function starts with "security_", the attach will be
    allowed even for bpf2bpf attachment.

    Fix this oversight by also blocking the modification if a target program is
    supplied.

    Fixes: 18644cec714a ("bpf: Fix use-after-free in fmod_ret check")
    Fixes: 6ba43b761c41 ("bpf: Attachment verification for BPF_MODIFY_RETURN")
    Acked-by: Andrii Nakryiko
    Signed-off-by: Toke Høiland-Jørgensen
    Signed-off-by: Alexei Starovoitov

    Toke Høiland-Jørgensen
     
  • Allow passing a pointer to a BTF struct sock_common* when updating
    a sockmap or sockhash. Since BTF pointers can fault and therefore be
    NULL at runtime we need to add an additional !sk check to
    sock_map_update_elem. Since we may be passed a request or timewait
    socket we also need to check sk_fullsock. Doing this allows calling
    map_update_elem on sockmap from bpf_iter context, which uses
    BTF pointers.

    Signed-off-by: Lorenz Bauer
    Signed-off-by: Alexei Starovoitov
    Link: https://lore.kernel.org/bpf/20200928090805.23343-2-lmb@cloudflare.com

    Lorenz Bauer
     
  • Get rid of bpf_cpu_map_entry pointer in cpu_map_build_skb routine
    signature since it is no longer needed.

    Signed-off-by: Lorenzo Bianconi
    Signed-off-by: Daniel Borkmann
    Link: https://lore.kernel.org/bpf/33cb9b7dc447de3ea6fd6ce713ac41bca8794423.1601292015.git.lorenzo@kernel.org

    Lorenzo Bianconi
     
  • Add .test_run for raw_tracepoint. Also, introduce a new feature that runs
    the target program on a specific CPU. This is achieved by a new flag in
    bpf_attr.test, BPF_F_TEST_RUN_ON_CPU. When this flag is set, the program
    is triggered on cpu with id bpf_attr.test.cpu. This feature is needed for
    BPF programs that handle perf_event and other percpu resources, as the
    program can access these resource locally.

    Signed-off-by: Song Liu
    Signed-off-by: Daniel Borkmann
    Acked-by: John Fastabend
    Acked-by: Andrii Nakryiko
    Link: https://lore.kernel.org/bpf/20200925205432.1777-2-songliubraving@fb.com

    Song Liu
     

26 Sep, 2020

4 commits

  • In BPF_AND and BPF_OR alu cases we have this pattern when the src and dst
    tnum is a constant.

    1 dst_reg->var_off = tnum_[op](dst_reg->var_off, src_reg.var_off)
    2 scalar32_min_max_[op]
    3 if (known) return
    4 scalar_min_max_[op]
    5 if (known)
    6 __mark_reg_known(dst_reg,
    dst_reg->var_off.value [op] src_reg.var_off.value)

    The result is in 1 we calculate the var_off value and store it in the
    dst_reg. Then in 6 we duplicate this logic doing the op again on the
    value.

    The duplication comes from the the tnum_[op] handlers because they have
    already done the value calcuation. For example this is tnum_and().

    struct tnum tnum_and(struct tnum a, struct tnum b)
    {
    u64 alpha, beta, v;

    alpha = a.value | a.mask;
    beta = b.value | b.mask;
    v = a.value & b.value;
    return TNUM(v, alpha & beta & ~v);
    }

    So lets remove the redundant op calculation. Its confusing for readers
    and unnecessary. Its also not harmful because those ops have the
    property, r1 & r1 = r1 and r1 | r1 = r1.

    Signed-off-by: John Fastabend
    Signed-off-by: Alexei Starovoitov

    John Fastabend
     
  • This patch changes the bpf_sk_storage_*() to take
    ARG_PTR_TO_BTF_ID_SOCK_COMMON such that they will work with the pointer
    returned by the bpf_skc_to_*() helpers also.

    A micro benchmark has been done on a "cgroup_skb/egress" bpf program
    which does a bpf_sk_storage_get(). It was driven by netperf doing
    a 4096 connected UDP_STREAM test with 64bytes packet.
    The stats from "kernel.bpf_stats_enabled" shows no meaningful difference.

    The sk_storage_get_btf_proto, sk_storage_delete_btf_proto,
    btf_sk_storage_get_proto, and btf_sk_storage_delete_proto are
    no longer needed, so they are removed.

    Signed-off-by: Martin KaFai Lau
    Signed-off-by: Alexei Starovoitov
    Acked-by: Lorenz Bauer
    Link: https://lore.kernel.org/bpf/20200925000402.3856307-1-kafai@fb.com

    Martin KaFai Lau
     
  • There is a constant need to add more fields into the bpf_tcp_sock
    for the bpf programs running at tc, sock_ops...etc.

    A current workaround could be to use bpf_probe_read_kernel(). However,
    other than making another helper call for reading each field and missing
    CO-RE, it is also not as intuitive to use as directly reading
    "tp->lsndtime" for example. While already having perfmon cap to do
    bpf_probe_read_kernel(), it will be much easier if the bpf prog can
    directly read from the tcp_sock.

    This patch tries to do that by using the existing casting-helpers
    bpf_skc_to_*() whose func_proto returns a btf_id. For example, the
    func_proto of bpf_skc_to_tcp_sock returns the btf_id of the
    kernel "struct tcp_sock".

    These helpers are also added to is_ptr_cast_function().
    It ensures the returning reg (BPF_REF_0) will also carries the ref_obj_id.
    That will keep the ref-tracking works properly.

    The bpf_skc_to_* helpers are made available to most of the bpf prog
    types in filter.c. The bpf_skc_to_* helpers will be limited by
    perfmon cap.

    This patch adds a ARG_PTR_TO_BTF_ID_SOCK_COMMON. The helper accepting
    this arg can accept a btf-id-ptr (PTR_TO_BTF_ID + &btf_sock_ids[BTF_SOCK_TYPE_SOCK_COMMON])
    or a legacy-ctx-convert-skc-ptr (PTR_TO_SOCK_COMMON). The bpf_skc_to_*()
    helpers are changed to take ARG_PTR_TO_BTF_ID_SOCK_COMMON such that
    they will accept pointer obtained from skb->sk.

    Instead of specifying both arg_type and arg_btf_id in the same func_proto
    which is how the current ARG_PTR_TO_BTF_ID does, the arg_btf_id of
    the new ARG_PTR_TO_BTF_ID_SOCK_COMMON is specified in the
    compatible_reg_types[] in verifier.c. The reason is the arg_btf_id is
    always the same. Discussion in this thread:
    https://lore.kernel.org/bpf/20200922070422.1917351-1-kafai@fb.com/

    The ARG_PTR_TO_BTF_ID_ part gives a clear expectation that the helper is
    expecting a PTR_TO_BTF_ID which could be NULL. This is the same
    behavior as the existing helper taking ARG_PTR_TO_BTF_ID.

    The _SOCK_COMMON part means the helper is also expecting the legacy
    SOCK_COMMON pointer.

    By excluding the _OR_NULL part, the bpf prog cannot call helper
    with a literal NULL which doesn't make sense in most cases.
    e.g. bpf_skc_to_tcp_sock(NULL) will be rejected. All PTR_TO_*_OR_NULL
    reg has to do a NULL check first before passing into the helper or else
    the bpf prog will be rejected. This behavior is nothing new and
    consistent with the current expectation during bpf-prog-load.

    [ ARG_PTR_TO_BTF_ID_SOCK_COMMON will be used to replace
    ARG_PTR_TO_SOCK* of other existing helpers later such that
    those existing helpers can take the PTR_TO_BTF_ID returned by
    the bpf_skc_to_*() helpers.

    The only special case is bpf_sk_lookup_assign() which can accept a
    literal NULL ptr. It has to be handled specially in another follow
    up patch if there is a need (e.g. by renaming ARG_PTR_TO_SOCKET_OR_NULL
    to ARG_PTR_TO_BTF_ID_SOCK_COMMON_OR_NULL). ]

    [ When converting the older helpers that take ARG_PTR_TO_SOCK* in
    the later patch, if the kernel does not support BTF,
    ARG_PTR_TO_BTF_ID_SOCK_COMMON will behave like ARG_PTR_TO_SOCK_COMMON
    because no reg->type could have PTR_TO_BTF_ID in this case.

    It is not a concern for the newer-btf-only helper like the bpf_skc_to_*()
    here though because these helpers must require BTF vmlinux to begin
    with. ]

    Signed-off-by: Martin KaFai Lau
    Signed-off-by: Alexei Starovoitov
    Acked-by: John Fastabend
    Link: https://lore.kernel.org/bpf/20200925000350.3855720-1-kafai@fb.com

    Martin KaFai Lau
     
  • check_reg_type() checks whether a reg can be used as an arg of a
    func_proto. For PTR_TO_BTF_ID, the check is actually not
    completely done until the reg->btf_id is pointing to a
    kernel struct that is acceptable by the func_proto.

    Thus, this patch moves the btf_id check into check_reg_type().
    "arg_type" and "arg_btf_id" are passed to check_reg_type() instead of
    "compatible". The compatible_reg_types[] usage is localized in
    check_reg_type() now.

    The "if (!btf_id) verbose(...); " is also removed since it won't happen.

    Signed-off-by: Martin KaFai Lau
    Signed-off-by: Alexei Starovoitov
    Acked-by: Lorenz Bauer
    Acked-by: John Fastabend
    Link: https://lore.kernel.org/bpf/20200925000344.3854828-1-kafai@fb.com

    Martin KaFai Lau
     

24 Sep, 2020

4 commits

  • …nel/git/paulmck/linux-rcu into bpf-next

    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

    Alexei Starovoitov
     
  • This reverts commit 31f23a6a181c81543b10a1a9056b0e6c7ef1c747.

    This change made many selftests/bpf flaky: flow_dissector, sk_lookup, sk_assign and others.
    There was no issue in the code.

    Signed-off-by: Alexei Starovoitov

    Alexei Starovoitov
     
  • Alexei Starovoitov says:

    ====================
    pull-request: bpf-next 2020-09-23

    The following pull-request contains BPF updates for your *net-next* tree.

    We've added 95 non-merge commits during the last 22 day(s) which contain
    a total of 124 files changed, 4211 insertions(+), 2040 deletions(-).

    The main changes are:

    1) Full multi function support in libbpf, from Andrii.

    2) Refactoring of function argument checks, from Lorenz.

    3) Make bpf_tail_call compatible with functions (subprograms), from Maciej.

    4) Program metadata support, from YiFei.

    5) bpf iterator optimizations, from Yonghong.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     
  • Arrays with designated initializers have an implicit length of the highest
    initialized value plus one. I used this to ensure that newly added entries
    in enum bpf_reg_type get a NULL entry in compatible_reg_types.

    This is difficult to understand since it requires knowledge of the
    peculiarities of designated initializers. Use __BPF_ARG_TYPE_MAX to size
    the array instead.

    Suggested-by: Alexei Starovoitov
    Signed-off-by: Lorenz Bauer
    Signed-off-by: Alexei Starovoitov
    Acked-by: Andrii Nakryiko
    Link: https://lore.kernel.org/bpf/20200923160156.80814-1-lmb@cloudflare.com

    Lorenz Bauer
     

23 Sep, 2020

3 commits

  • Two minor conflicts:

    1) net/ipv4/route.c, adding a new local variable while
    moving another local variable and removing it's
    initial assignment.

    2) drivers/net/dsa/microchip/ksz9477.c, overlapping changes.
    One pretty prints the port mode differently, whilst another
    changes the driver to try and obtain the port mode from
    the port node rather than the switch node.

    Signed-off-by: David S. Miller

    David S. Miller
     
  • Pull networking fixes from Jakub Kicinski:

    - fix failure to add bond interfaces to a bridge, the offload-handling
    code was too defensive there and recent refactoring unearthed that.
    Users complained (Ido)

    - fix unnecessarily reflecting ECN bits within TOS values / QoS marking
    in TCP ACK and reset packets (Wei)

    - fix a deadlock with bpf iterator. Hopefully we're in the clear on
    this front now... (Yonghong)

    - BPF fix for clobbering r2 in bpf_gen_ld_abs (Daniel)

    - fix AQL on mt76 devices with FW rate control and add a couple of AQL
    issues in mac80211 code (Felix)

    - fix authentication issue with mwifiex (Maximilian)

    - WiFi connectivity fix: revert IGTK support in ti/wlcore (Mauro)

    - fix exception handling for multipath routes via same device (David
    Ahern)

    - revert back to a BH spin lock flavor for nsid_lock: there are paths
    which do require the BH context protection (Taehee)

    - fix interrupt / queue / NAPI handling in the lantiq driver (Hauke)

    - fix ife module load deadlock (Cong)

    - make an adjustment to netlink reply message type for code added in
    this release (the sole change touching uAPI here) (Michal)

    - a number of fixes for small NXP and Microchip switches (Vladimir)

    [ Pull request acked by David: "you can expect more of this in the
    future as I try to delegate more things to Jakub" ]

    * git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (167 commits)
    net: mscc: ocelot: fix some key offsets for IP4_TCP_UDP VCAP IS2 entries
    net: dsa: seville: fix some key offsets for IP4_TCP_UDP VCAP IS2 entries
    net: dsa: felix: fix some key offsets for IP4_TCP_UDP VCAP IS2 entries
    inet_diag: validate INET_DIAG_REQ_PROTOCOL attribute
    net: bridge: br_vlan_get_pvid_rcu() should dereference the VLAN group under RCU
    net: Update MAINTAINERS for MediaTek switch driver
    net/mlx5e: mlx5e_fec_in_caps() returns a boolean
    net/mlx5e: kTLS, Avoid kzalloc(GFP_KERNEL) under spinlock
    net/mlx5e: kTLS, Fix leak on resync error flow
    net/mlx5e: kTLS, Add missing dma_unmap in RX resync
    net/mlx5e: kTLS, Fix napi sync and possible use-after-free
    net/mlx5e: TLS, Do not expose FPGA TLS counter if not supported
    net/mlx5e: Fix using wrong stats_grps in mlx5e_update_ndo_stats()
    net/mlx5e: Fix multicast counter not up-to-date in "ip -s"
    net/mlx5e: Fix endianness when calculating pedit mask first bit
    net/mlx5e: Enable adding peer miss rules only if merged eswitch is supported
    net/mlx5e: CT: Fix freeing ct_label mapping
    net/mlx5e: Fix memory leak of tunnel info when rule under multipath not ready
    net/mlx5e: Use synchronize_rcu to sync with NAPI
    net/mlx5e: Use RCU to protect rq->xdp_prog
    ...

    Linus Torvalds
     
  • Pull tracing fixes from Steven Rostedt:

    - Check kprobe is enabled before unregistering from ftrace as it isn't
    registered when disabled.

    - Remove kprobes enabled via command-line that is on init text when
    freed.

    - Add missing RCU synchronization for ftrace trampoline symbols removed
    from kallsyms.

    - Free trampoline on error path if ftrace_startup() fails.

    - Give more space for the longer PID numbers in trace output.

    - Fix a possible double free in the histogram code.

    - A couple of fixes that were discovered by sparse.

    * tag 'trace-v5.9-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
    bootconfig: init: make xbc_namebuf static
    kprobes: tracing/kprobes: Fix to kill kprobes on initmem after boot
    tracing: fix double free
    ftrace: Let ftrace_enable_sysctl take a kernel pointer buffer
    tracing: Make the space reserved for the pid wider
    ftrace: Fix missing synchronize_rcu() removing trampoline from kallsyms
    ftrace: Free the trampoline when ftrace_startup() fails
    kprobes: Fix to check probe enabled before disarm_kprobe_ftrace()

    Linus Torvalds
     

22 Sep, 2020

4 commits

  • Some kernels builds might inline vfs_getattr call within fstat
    syscall code path, so fentry/vfs_getattr trampoline is not called.

    Add security_inode_getattr to allowlist and switch the d_path test stat
    trampoline to security_inode_getattr.

    Keeping dentry_open and filp_close, because they are in their own
    files, so unlikely to be inlined, but in case they are, adding
    security_file_open.

    Adding flags that indicate trampolines were called and failing
    the test if any of them got missed, so it's easier to identify
    the issue next time.

    Fixes: e4d1af4b16f8 ("selftests/bpf: Add test for d_path helper")
    Suggested-by: Alexei Starovoitov
    Signed-off-by: Jiri Olsa
    Signed-off-by: Alexei Starovoitov
    Link: https://lore.kernel.org/bpf/20200918112338.2618444-1-jolsa@kernel.org

    Jiri Olsa
     
  • The mapping between bpf_arg_type and bpf_reg_type is encoded in a big
    hairy if statement that is hard to follow. The debug output also leaves
    to be desired: if a reg_type doesn't match we only print one of the
    options, instead printing all the valid ones.

    Convert the if statement into a table which is then used to drive type
    checking. If none of the reg_types match we print all options, e.g.:

    R2 type=rdonly_buf expected=fp, pkt, pkt_meta, map_value

    Signed-off-by: Lorenz Bauer
    Signed-off-by: Alexei Starovoitov
    Acked-by: Martin KaFai Lau
    Link: https://lore.kernel.org/bpf/20200921121227.255763-12-lmb@cloudflare.com

    Lorenz Bauer
     
  • check_func_arg has a plethora of weird if statements with empty branches.
    They work around the fact that *_OR_NULL argument types should accept a
    SCALAR_VALUE register, as long as it's value is 0. These statements make
    it difficult to reason about the type checking logic.

    Instead, skip more detailed type checking logic iff the register is 0,
    and the function expects a nullable type. This allows simplifying the type
    checking itself.

    Signed-off-by: Lorenz Bauer
    Signed-off-by: Alexei Starovoitov
    Acked-by: Andrii Nakryiko
    Acked-by: Martin KaFai Lau
    Link: https://lore.kernel.org/bpf/20200921121227.255763-11-lmb@cloudflare.com

    Lorenz Bauer
     
  • Move the check for PTR_TO_MAP_VALUE to check_func_arg, where all other
    checking is done as well. Move the invocation of process_spin_lock away
    from the register type checking, to allow a future refactoring.

    Signed-off-by: Lorenz Bauer
    Signed-off-by: Alexei Starovoitov
    Acked-by: Martin KaFai Lau
    Link: https://lore.kernel.org/bpf/20200921121227.255763-10-lmb@cloudflare.com

    Lorenz Bauer