30 Sep, 2020

1 commit

  • In preparation for allowing multiple attachments of freplace programs, move
    the references to the target program and trampoline into the
    bpf_tracing_link structure when that is created. To do this atomically,
    introduce a new mutex in prog->aux to protect writing to the two pointers
    to target prog and trampoline, and rename the members to make it clear that
    they are related.

    With this change, it is no longer possible to attach the same tracing
    program multiple times (detaching in-between), since the reference from the
    tracing program to the target disappears on the first attach. However,
    since the next patch will let the caller supply an attach target, that will
    also make it possible to attach to the same place multiple times.

    Signed-off-by: Toke Høiland-Jørgensen
    Signed-off-by: Alexei Starovoitov
    Acked-by: Andrii Nakryiko
    Link: https://lore.kernel.org/bpf/160138355059.48470.2503076992210324984.stgit@toke.dk

    Toke Høiland-Jørgensen
     

29 Sep, 2020

1 commit

  • The check_attach_btf_id() function really does three things:

    1. It performs a bunch of checks on the program to ensure that the
    attachment is valid.

    2. It stores a bunch of state about the attachment being requested in
    the verifier environment and struct bpf_prog objects.

    3. It allocates a trampoline for the attachment.

    This patch splits out (1.) and (3.) into separate functions which will
    perform the checks, but return the computed values instead of directly
    modifying the environment. This is done in preparation for reusing the
    checks when the actual attachment is happening, which will allow tracing
    programs to have multiple (compatible) attachments.

    This also fixes a bug where a bunch of checks were skipped if a trampoline
    already existed for the tracing target.

    Fixes: 6ba43b761c41 ("bpf: Attachment verification for BPF_MODIFY_RETURN")
    Fixes: 1e6c62a88215 ("bpf: Introduce sleepable BPF programs")
    Acked-by: Andrii Nakryiko
    Signed-off-by: Toke Høiland-Jørgensen
    Signed-off-by: Alexei Starovoitov

    Toke Høiland-Jørgensen
     

01 Sep, 2020

1 commit

  • Technically the bpf programs can sleep while attached to bpf_lsm_file_mprotect,
    but such programs need to access user memory. So they're in might_fault()
    category. Which means they cannot be called from file_mprotect lsm hook that
    takes write lock on mm->mmap_lock.
    Adjust the test accordingly.

    Also add might_fault() to __bpf_prog_enter_sleepable() to catch such deadlocks early.

    Fixes: 1e6c62a88215 ("bpf: Introduce sleepable BPF programs")
    Fixes: e68a144547fc ("selftests/bpf: Add sleepable tests")
    Reported-by: Yonghong Song
    Signed-off-by: Alexei Starovoitov
    Signed-off-by: Daniel Borkmann
    Link: https://lore.kernel.org/bpf/20200831201651.82447-1-alexei.starovoitov@gmail.com

    Alexei Starovoitov
     

29 Aug, 2020

1 commit

  • Introduce sleepable BPF programs that can request such property for themselves
    via BPF_F_SLEEPABLE flag at program load time. In such case they will be able
    to use helpers like bpf_copy_from_user() that might sleep. At present only
    fentry/fexit/fmod_ret and lsm programs can request to be sleepable and only
    when they are attached to kernel functions that are known to allow sleeping.

    The non-sleepable programs are relying on implicit rcu_read_lock() and
    migrate_disable() to protect life time of programs, maps that they use and
    per-cpu kernel structures used to pass info between bpf programs and the
    kernel. The sleepable programs cannot be enclosed into rcu_read_lock().
    migrate_disable() maps to preempt_disable() in non-RT kernels, so the progs
    should not be enclosed in migrate_disable() as well. Therefore
    rcu_read_lock_trace is used to protect the life time of sleepable progs.

    There are many networking and tracing program types. In many cases the
    'struct bpf_prog *' pointer itself is rcu protected within some other kernel
    data structure and the kernel code is using rcu_dereference() to load that
    program pointer and call BPF_PROG_RUN() on it. All these cases are not touched.
    Instead sleepable bpf programs are allowed with bpf trampoline only. The
    program pointers are hard-coded into generated assembly of bpf trampoline and
    synchronize_rcu_tasks_trace() is used to protect the life time of the program.
    The same trampoline can hold both sleepable and non-sleepable progs.

    When rcu_read_lock_trace is held it means that some sleepable bpf program is
    running from bpf trampoline. Those programs can use bpf arrays and preallocated
    hash/lru maps. These map types are waiting on programs to complete via
    synchronize_rcu_tasks_trace();

    Updates to trampoline now has to do synchronize_rcu_tasks_trace() and
    synchronize_rcu_tasks() to wait for sleepable progs to finish and for
    trampoline assembly to finish.

    This is the first step of introducing sleepable progs. Eventually dynamically
    allocated hash maps can be allowed and networking program types can become
    sleepable too.

    Signed-off-by: Alexei Starovoitov
    Signed-off-by: Daniel Borkmann
    Reviewed-by: Josef Bacik
    Acked-by: Andrii Nakryiko
    Acked-by: KP Singh
    Link: https://lore.kernel.org/bpf/20200827220114.69225-3-alexei.starovoitov@gmail.com

    Alexei Starovoitov
     

30 Mar, 2020

1 commit

  • JITed BPF programs are dynamically attached to the LSM hooks
    using BPF trampolines. The trampoline prologue generates code to handle
    conversion of the signature of the hook to the appropriate BPF context.

    The allocated trampoline programs are attached to the nop functions
    initialized as LSM hooks.

    BPF_PROG_TYPE_LSM programs must have a GPL compatible license and
    and need CAP_SYS_ADMIN (required for loading eBPF programs).

    Upon attachment:

    * A BPF fexit trampoline is used for LSM hooks with a void return type.
    * A BPF fmod_ret trampoline is used for LSM hooks which return an
    int. The attached programs can override the return value of the
    bpf LSM hook to indicate a MAC Policy decision.

    Signed-off-by: KP Singh
    Signed-off-by: Daniel Borkmann
    Reviewed-by: Brendan Jackman
    Reviewed-by: Florent Revest
    Acked-by: Andrii Nakryiko
    Acked-by: James Morris
    Link: https://lore.kernel.org/bpf/20200329004356.27286-5-kpsingh@chromium.org

    KP Singh
     

14 Mar, 2020

3 commits

  • Sparse reports a warning at __bpf_prog_enter() and __bpf_prog_exit()

    warning: context imbalance in __bpf_prog_enter() - wrong count at exit
    warning: context imbalance in __bpf_prog_exit() - unexpected unlock

    The root cause is the missing annotation at __bpf_prog_enter()
    and __bpf_prog_exit()

    Add the missing __acquires(RCU) annotation
    Add the missing __releases(RCU) annotation

    Signed-off-by: Jules Irenge
    Signed-off-by: Daniel Borkmann
    Link: https://lore.kernel.org/bpf/20200311010908.42366-2-jbi.octave@gmail.com

    Jules Irenge
     
  • Now that we have all the objects (bpf_prog, bpf_trampoline,
    bpf_dispatcher) linked in bpf_tree, there's no need to have
    separate bpf_image tree for images.

    Reverting the bpf_image tree together with struct bpf_image,
    because it's no longer needed.

    Also removing bpf_image_alloc function and adding the original
    bpf_jit_alloc_exec_page interface instead.

    The kernel_text_address function can now rely only on is_bpf_text_address,
    because it checks the bpf_tree that contains all the objects.

    Keeping bpf_image_ksym_add and bpf_image_ksym_del because they are
    useful wrappers with perf's ksymbol interface calls.

    Signed-off-by: Jiri Olsa
    Signed-off-by: Alexei Starovoitov
    Link: https://lore.kernel.org/bpf/20200312195610.346362-13-jolsa@kernel.org
    Signed-off-by: Alexei Starovoitov

    Jiri Olsa
     
  • Adding trampolines to kallsyms. It's displayed as
    bpf_trampoline_ [bpf]

    where ID is the BTF id of the trampoline function.

    Adding bpf_image_ksym_add/del functions that setup
    the start/end values and call KSYMBOL perf events
    handlers.

    Signed-off-by: Jiri Olsa
    Signed-off-by: Alexei Starovoitov
    Link: https://lore.kernel.org/bpf/20200312195610.346362-11-jolsa@kernel.org
    Signed-off-by: Alexei Starovoitov

    Jiri Olsa
     

05 Mar, 2020

2 commits

  • When multiple programs are attached, each program receives the return
    value from the previous program on the stack and the last program
    provides the return value to the attached function.

    The fmod_ret bpf programs are run after the fentry programs and before
    the fexit programs. The original function is only called if all the
    fmod_ret programs return 0 to avoid any unintended side-effects. The
    success value, i.e. 0 is not currently configurable but can be made so
    where user-space can specify it at load time.

    For example:

    int func_to_be_attached(int a, int b)
    {
    if (ret != 0)
    goto do_fexit;

    original_function:

    }
    Signed-off-by: Alexei Starovoitov
    Acked-by: Andrii Nakryiko
    Acked-by: Daniel Borkmann
    Link: https://lore.kernel.org/bpf/20200304191853.1529-4-kpsingh@chromium.org

    KP Singh
     
  • As we need to introduce a third type of attachment for trampolines, the
    flattened signature of arch_prepare_bpf_trampoline gets even more
    complicated.

    Refactor the prog and count argument to arch_prepare_bpf_trampoline to
    use bpf_tramp_progs to simplify the addition and accounting for new
    attachment types.

    Signed-off-by: KP Singh
    Signed-off-by: Alexei Starovoitov
    Acked-by: Andrii Nakryiko
    Acked-by: Daniel Borkmann
    Link: https://lore.kernel.org/bpf/20200304191853.1529-2-kpsingh@chromium.org

    KP Singh
     

25 Feb, 2020

1 commit

  • Instead of preemption disable/enable to reflect the purpose. This allows
    PREEMPT_RT to substitute it with an actual migration disable
    implementation. On non RT kernels this is still mapped to
    preempt_disable/enable().

    Signed-off-by: David S. Miller
    Signed-off-by: Thomas Gleixner
    Signed-off-by: Alexei Starovoitov
    Link: https://lore.kernel.org/bpf/20200224145643.891428873@linutronix.de

    David Miller
     

25 Jan, 2020

1 commit

  • When unwinding the stack we need to identify each address
    to successfully continue. Adding latch tree to keep trampolines
    for quick lookup during the unwind.

    The patch uses first 48 bytes for latch tree node, leaving 4048
    bytes from the rest of the page for trampoline or dispatcher
    generated code.

    It's still enough not to affect trampoline and dispatcher progs
    maximum counts.

    Signed-off-by: Jiri Olsa
    Signed-off-by: Alexei Starovoitov
    Link: https://lore.kernel.org/bpf/20200123161508.915203-3-jolsa@kernel.org

    Jiri Olsa
     

23 Jan, 2020

1 commit

  • Introduce dynamic program extensions. The users can load additional BPF
    functions and replace global functions in previously loaded BPF programs while
    these programs are executing.

    Global functions are verified individually by the verifier based on their types only.
    Hence the global function in the new program which types match older function can
    safely replace that corresponding function.

    This new function/program is called 'an extension' of old program. At load time
    the verifier uses (attach_prog_fd, attach_btf_id) pair to identify the function
    to be replaced. The BPF program type is derived from the target program into
    extension program. Technically bpf_verifier_ops is copied from target program.
    The BPF_PROG_TYPE_EXT program type is a placeholder. It has empty verifier_ops.
    The extension program can call the same bpf helper functions as target program.
    Single BPF_PROG_TYPE_EXT type is used to extend XDP, SKB and all other program
    types. The verifier allows only one level of replacement. Meaning that the
    extension program cannot recursively extend an extension. That also means that
    the maximum stack size is increasing from 512 to 1024 bytes and maximum
    function nesting level from 8 to 16. The programs don't always consume that
    much. The stack usage is determined by the number of on-stack variables used by
    the program. The verifier could have enforced 512 limit for combined original
    plus extension program, but it makes for difficult user experience. The main
    use case for extensions is to provide generic mechanism to plug external
    programs into policy program or function call chaining.

    BPF trampoline is used to track both fentry/fexit and program extensions
    because both are using the same nop slot at the beginning of every BPF
    function. Attaching fentry/fexit to a function that was replaced is not
    allowed. The opposite is true as well. Replacing a function that currently
    being analyzed with fentry/fexit is not allowed. The executable page allocated
    by BPF trampoline is not used by program extensions. This inefficiency will be
    optimized in future patches.

    Function by function verification of global function supports scalars and
    pointer to context only. Hence program extensions are supported for such class
    of global functions only. In the future the verifier will be extended with
    support to pointers to structures, arrays with sizes, etc.

    Signed-off-by: Alexei Starovoitov
    Signed-off-by: Daniel Borkmann
    Acked-by: John Fastabend
    Acked-by: Andrii Nakryiko
    Acked-by: Toke Høiland-Jørgensen
    Link: https://lore.kernel.org/bpf/20200121005348.2769920-2-ast@kernel.org

    Alexei Starovoitov
     

22 Jan, 2020

1 commit

  • Though the second half of trampoline page is unused a task could be
    preempted in the middle of the first half of trampoline and two
    updates to trampoline would change the code from underneath the
    preempted task. Hence wait for tasks to voluntarily schedule or go
    to userspace. Add similar wait before freeing the trampoline.

    Fixes: fec56f5890d9 ("bpf: Introduce BPF trampoline")
    Reported-by: Jann Horn
    Signed-off-by: Alexei Starovoitov
    Signed-off-by: Daniel Borkmann
    Acked-by: Paul E. McKenney
    Link: https://lore.kernel.org/bpf/20200121032231.3292185-1-ast@kernel.org

    Alexei Starovoitov
     

10 Jan, 2020

1 commit

  • The patch introduces BPF_MAP_TYPE_STRUCT_OPS. The map value
    is a kernel struct with its func ptr implemented in bpf prog.
    This new map is the interface to register/unregister/introspect
    a bpf implemented kernel struct.

    The kernel struct is actually embedded inside another new struct
    (or called the "value" struct in the code). For example,
    "struct tcp_congestion_ops" is embbeded in:
    struct bpf_struct_ops_tcp_congestion_ops {
    refcount_t refcnt;
    enum bpf_struct_ops_state state;
    struct tcp_congestion_ops data; /*
    INUSE (map updated, i.e. reg) =>
    TOBEFREE (map value deleted, i.e. unreg)

    The kernel subsystem needs to call bpf_struct_ops_get() and
    bpf_struct_ops_put() to manage the "refcnt" in the
    "struct bpf_struct_ops_XYZ". This patch uses a separate refcnt
    for the purose of tracking the subsystem usage. Another approach
    is to reuse the map->refcnt and then "show" (i.e. during map_lookup)
    the subsystem's usage by doing map->refcnt - map->usercnt to filter out
    the map-fd/pinned-map usage. However, that will also tie down the
    future semantics of map->refcnt and map->usercnt.

    The very first subsystem's refcnt (during reg()) holds one
    count to map->refcnt. When the very last subsystem's refcnt
    is gone, it will also release the map->refcnt. All bpf_prog will be
    freed when the map->refcnt reaches 0 (i.e. during map_free()).

    Here is how the bpftool map command will look like:
    [root@arch-fb-vm1 bpf]# bpftool map show
    6: struct_ops name dctcp flags 0x0
    key 4B value 256B max_entries 1 memlock 4096B
    btf_id 6
    [root@arch-fb-vm1 bpf]# bpftool map dump id 6
    [{
    "value": {
    "refcnt": {
    "refs": {
    "counter": 1
    }
    },
    "state": 1,
    "data": {
    "list": {
    "next": 0,
    "prev": 0
    },
    "key": 0,
    "flags": 2,
    "init": 24,
    "release": 0,
    "ssthresh": 25,
    "cong_avoid": 30,
    "set_state": 27,
    "cwnd_event": 28,
    "in_ack_event": 26,
    "undo_cwnd": 29,
    "pkts_acked": 0,
    "min_tso_segs": 0,
    "sndbuf_expand": 0,
    "cong_control": 0,
    "get_info": 0,
    "name": [98,112,102,95,100,99,116,99,112,0,0,0,0,0,0,0
    ],
    "owner": 0
    }
    }
    }
    ]

    Misc Notes:
    * bpf_struct_ops_map_sys_lookup_elem() is added for syscall lookup.
    It does an inplace update on "*value" instead returning a pointer
    to syscall.c. Otherwise, it needs a separate copy of "zero" value
    for the BPF_STRUCT_OPS_STATE_INIT to avoid races.

    * The bpf_struct_ops_map_delete_elem() is also called without
    preempt_disable() from map_delete_elem(). It is because
    the "->unreg()" may requires sleepable context, e.g.
    the "tcp_unregister_congestion_control()".

    * "const" is added to some of the existing "struct btf_func_model *"
    function arg to avoid a compiler warning caused by this patch.

    Signed-off-by: Martin KaFai Lau
    Signed-off-by: Alexei Starovoitov
    Acked-by: Andrii Nakryiko
    Acked-by: Yonghong Song
    Link: https://lore.kernel.org/bpf/20200109003505.3855919-1-kafai@fb.com

    Martin KaFai Lau
     

28 Dec, 2019

1 commit

  • Daniel Borkmann says:

    ====================
    pull-request: bpf-next 2019-12-27

    The following pull-request contains BPF updates for your *net-next* tree.

    We've added 127 non-merge commits during the last 17 day(s) which contain
    a total of 110 files changed, 6901 insertions(+), 2721 deletions(-).

    There are three merge conflicts. Conflicts and resolution looks as follows:

    1) Merge conflict in net/bpf/test_run.c:

    There was a tree-wide cleanup c593642c8be0 ("treewide: Use sizeof_field() macro")
    which gets in the way with b590cb5f802d ("bpf: Switch to offsetofend in
    BPF_PROG_TEST_RUN"):

    <<<<<<< HEAD
    if (!range_is_zero(__skb, offsetof(struct __sk_buff, priority) +
    sizeof_field(struct __sk_buff, priority),
    =======
    if (!range_is_zero(__skb, offsetofend(struct __sk_buff, priority),
    >>>>>>> 7c8dce4b166113743adad131b5a24c4acc12f92c

    There are a few occasions that look similar to this. Always take the chunk with
    offsetofend(). Note that there is one where the fields differ in here:

    <<<<<<< HEAD
    if (!range_is_zero(__skb, offsetof(struct __sk_buff, tstamp) +
    sizeof_field(struct __sk_buff, tstamp),
    =======
    if (!range_is_zero(__skb, offsetofend(struct __sk_buff, gso_segs),
    >>>>>>> 7c8dce4b166113743adad131b5a24c4acc12f92c

    Just take the one with offsetofend() /and/ gso_segs. Latter is correct due to
    850a88cc4096 ("bpf: Expose __sk_buff wire_len/gso_segs to BPF_PROG_TEST_RUN").

    2) Merge conflict in arch/riscv/net/bpf_jit_comp.c:

    (I'm keeping Bjorn in Cc here for a double-check in case I got it wrong.)

    <<<<<<< HEAD
    if (is_13b_check(off, insn))
    return -1;
    emit(rv_blt(tcc, RV_REG_ZERO, off >> 1), ctx);
    =======
    emit_branch(BPF_JSLT, RV_REG_T1, RV_REG_ZERO, off, ctx);
    >>>>>>> 7c8dce4b166113743adad131b5a24c4acc12f92c

    Result should look like:

    emit_branch(BPF_JSLT, tcc, RV_REG_ZERO, off, ctx);

    3) Merge conflict in arch/riscv/include/asm/pgtable.h:

    <<<<<<< HEAD
    =======
    #define VMALLOC_SIZE (KERN_VIRT_SIZE >> 1)
    #define VMALLOC_END (PAGE_OFFSET - 1)
    #define VMALLOC_START (PAGE_OFFSET - VMALLOC_SIZE)

    #define BPF_JIT_REGION_SIZE (SZ_128M)
    #define BPF_JIT_REGION_START (PAGE_OFFSET - BPF_JIT_REGION_SIZE)
    #define BPF_JIT_REGION_END (VMALLOC_END)

    /*
    * Roughly size the vmemmap space to be large enough to fit enough
    * struct pages to map half the virtual address space. Then
    * position vmemmap directly below the VMALLOC region.
    */
    #define VMEMMAP_SHIFT \
    (CONFIG_VA_BITS - PAGE_SHIFT - 1 + STRUCT_PAGE_MAX_SHIFT)
    #define VMEMMAP_SIZE BIT(VMEMMAP_SHIFT)
    #define VMEMMAP_END (VMALLOC_START - 1)
    #define VMEMMAP_START (VMALLOC_START - VMEMMAP_SIZE)

    #define vmemmap ((struct page *)VMEMMAP_START)

    >>>>>>> 7c8dce4b166113743adad131b5a24c4acc12f92c

    Only take the BPF_* defines from there and move them higher up in the
    same file. Remove the rest from the chunk. The VMALLOC_* etc defines
    got moved via 01f52e16b868 ("riscv: define vmemmap before pfn_to_page
    calls"). Result:

    [...]
    #define __S101 PAGE_READ_EXEC
    #define __S110 PAGE_SHARED_EXEC
    #define __S111 PAGE_SHARED_EXEC

    #define VMALLOC_SIZE (KERN_VIRT_SIZE >> 1)
    #define VMALLOC_END (PAGE_OFFSET - 1)
    #define VMALLOC_START (PAGE_OFFSET - VMALLOC_SIZE)

    #define BPF_JIT_REGION_SIZE (SZ_128M)
    #define BPF_JIT_REGION_START (PAGE_OFFSET - BPF_JIT_REGION_SIZE)
    #define BPF_JIT_REGION_END (VMALLOC_END)

    /*
    * Roughly size the vmemmap space to be large enough to fit enough
    * struct pages to map half the virtual address space. Then
    * position vmemmap directly below the VMALLOC region.
    */
    #define VMEMMAP_SHIFT \
    (CONFIG_VA_BITS - PAGE_SHIFT - 1 + STRUCT_PAGE_MAX_SHIFT)
    #define VMEMMAP_SIZE BIT(VMEMMAP_SHIFT)
    #define VMEMMAP_END (VMALLOC_START - 1)
    #define VMEMMAP_START (VMALLOC_START - VMEMMAP_SIZE)

    [...]

    Let me know if there are any other issues.

    Anyway, the main changes are:

    1) Extend bpftool to produce a struct (aka "skeleton") tailored and specific
    to a provided BPF object file. This provides an alternative, simplified API
    compared to standard libbpf interaction. Also, add libbpf extern variable
    resolution for .kconfig section to import Kconfig data, from Andrii Nakryiko.

    2) Add BPF dispatcher for XDP which is a mechanism to avoid indirect calls by
    generating a branch funnel as discussed back in bpfconf'19 at LSF/MM. Also,
    add various BPF riscv JIT improvements, from Björn Töpel.

    3) Extend bpftool to allow matching BPF programs and maps by name,
    from Paul Chaignon.

    4) Support for replacing cgroup BPF programs attached with BPF_F_ALLOW_MULTI
    flag for allowing updates without service interruption, from Andrey Ignatov.

    5) Cleanup and simplification of ring access functions for AF_XDP with a
    bonus of 0-5% performance improvement, from Magnus Karlsson.

    6) Enable BPF JITs for x86-64 and arm64 by default. Also, final version of
    audit support for BPF, from Daniel Borkmann and latter with Jiri Olsa.

    7) Move and extend test_select_reuseport into BPF program tests under
    BPF selftests, from Jakub Sitnicki.

    8) Various BPF sample improvements for xdpsock for customizing parameters
    to set up and benchmark AF_XDP, from Jay Jayatheerthan.

    9) Improve libbpf to provide a ulimit hint on permission denied errors.
    Also change XDP sample programs to attach in driver mode by default,
    from Toke Høiland-Jørgensen.

    10) Extend BPF test infrastructure to allow changing skb mark from tc BPF
    programs, from Nikita V. Shirokov.

    11) Optimize prologue code sequence in BPF arm32 JIT, from Russell King.

    12) Fix xdp_redirect_cpu BPF sample to manually attach to tracepoints after
    libbpf conversion, from Jesper Dangaard Brouer.

    13) Minor misc improvements from various others.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     

14 Dec, 2019

1 commit


12 Dec, 2019

1 commit

  • Make BPF trampoline attach its generated assembly code to kernel functions via
    register_ftrace_direct() API. It helps ftrace-based tracers co-exist with BPF
    trampoline on the same kernel function. It also switches attaching logic from
    arch specific text_poke to generic ftrace that is available on many
    architectures. text_poke is still necessary for bpf-to-bpf attach and for
    bpf_tail_call optimization.

    Signed-off-by: Alexei Starovoitov
    Acked-by: Daniel Borkmann
    Link: https://lore.kernel.org/bpf/20191209000114.1876138-3-ast@kernel.org

    Alexei Starovoitov
     

25 Nov, 2019

1 commit

  • Given that we have BPF_MOD_NOP_TO_{CALL,JUMP}, BPF_MOD_{CALL,JUMP}_TO_NOP
    and BPF_MOD_{CALL,JUMP}_TO_{CALL,JUMP} poke types and that we also pass in
    old_addr as well as new_addr, it's a bit redundant and unnecessarily
    complicates __bpf_arch_text_poke() itself since we can derive the same from
    the *_addr that were passed in. Hence simplify and use BPF_MOD_{CALL,JUMP}
    as types which also allows to clean up call-sites.

    In addition to that, __bpf_arch_text_poke() currently verifies that text
    matches expected old_insn before we invoke text_poke_bp(). Also add a check
    on new_insn and skip rewrite if it already matches. Reason why this is rather
    useful is that it avoids making any special casing in prog_array_map_poke_run()
    when old and new prog were NULL and has the benefit that also for this case
    we perform a check on text whether it really matches our expectations.

    Suggested-by: Andrii Nakryiko
    Signed-off-by: Daniel Borkmann
    Signed-off-by: Alexei Starovoitov
    Link: https://lore.kernel.org/bpf/fcb00a2b0b288d6c73de4ef58116a821c8fe8f2f.1574555798.git.daniel@iogearbox.net

    Daniel Borkmann
     

16 Nov, 2019

1 commit

  • Introduce BPF trampoline concept to allow kernel code to call into BPF programs
    with practically zero overhead. The trampoline generation logic is
    architecture dependent. It's converting native calling convention into BPF
    calling convention. BPF ISA is 64-bit (even on 32-bit architectures). The
    registers R1 to R5 are used to pass arguments into BPF functions. The main BPF
    program accepts only single argument "ctx" in R1. Whereas CPU native calling
    convention is different. x86-64 is passing first 6 arguments in registers
    and the rest on the stack. x86-32 is passing first 3 arguments in registers.
    sparc64 is passing first 6 in registers. And so on.

    The trampolines between BPF and kernel already exist. BPF_CALL_x macros in
    include/linux/filter.h statically compile trampolines from BPF into kernel
    helpers. They convert up to five u64 arguments into kernel C pointers and
    integers. On 64-bit architectures this BPF_to_kernel trampolines are nops. On
    32-bit architecture they're meaningful.

    The opposite job kernel_to_BPF trampolines is done by CAST_TO_U64 macros and
    __bpf_trace_##call() shim functions in include/trace/bpf_probe.h. They convert
    kernel function arguments into array of u64s that BPF program consumes via
    R1=ctx pointer.

    This patch set is doing the same job as __bpf_trace_##call() static
    trampolines, but dynamically for any kernel function. There are ~22k global
    kernel functions that are attachable via nop at function entry. The function
    arguments and types are described in BTF. The job of btf_distill_func_proto()
    function is to extract useful information from BTF into "function model" that
    architecture dependent trampoline generators will use to generate assembly code
    to cast kernel function arguments into array of u64s. For example the kernel
    function eth_type_trans has two pointers. They will be casted to u64 and stored
    into stack of generated trampoline. The pointer to that stack space will be
    passed into BPF program in R1. On x86-64 such generated trampoline will consume
    16 bytes of stack and two stores of %rdi and %rsi into stack. The verifier will
    make sure that only two u64 are accessed read-only by BPF program. The verifier
    will also recognize the precise type of the pointers being accessed and will
    not allow typecasting of the pointer to a different type within BPF program.

    The tracing use case in the datacenter demonstrated that certain key kernel
    functions have (like tcp_retransmit_skb) have 2 or more kprobes that are always
    active. Other functions have both kprobe and kretprobe. So it is essential to
    keep both kernel code and BPF programs executing at maximum speed. Hence
    generated BPF trampoline is re-generated every time new program is attached or
    detached to maintain maximum performance.

    To avoid the high cost of retpoline the attached BPF programs are called
    directly. __bpf_prog_enter/exit() are used to support per-program execution
    stats. In the future this logic will be optimized further by adding support
    for bpf_stats_enabled_key inside generated assembly code. Introduction of
    preemptible and sleepable BPF programs will completely remove the need to call
    to __bpf_prog_enter/exit().

    Detach of a BPF program from the trampoline should not fail. To avoid memory
    allocation in detach path the half of the page is used as a reserve and flipped
    after each attach/detach. 2k bytes is enough to call 40+ BPF programs directly
    which is enough for BPF tracing use cases. This limit can be increased in the
    future.

    BPF_TRACE_FENTRY programs have access to raw kernel function arguments while
    BPF_TRACE_FEXIT programs have access to kernel return value as well. Often
    kprobe BPF program remembers function arguments in a map while kretprobe
    fetches arguments from a map and analyzes them together with return value.
    BPF_TRACE_FEXIT accelerates this typical use case.

    Recursion prevention for kprobe BPF programs is done via per-cpu
    bpf_prog_active counter. In practice that turned out to be a mistake. It
    caused programs to randomly skip execution. The tracing tools missed results
    they were looking for. Hence BPF trampoline doesn't provide builtin recursion
    prevention. It's a job of BPF program itself and will be addressed in the
    follow up patches.

    BPF trampoline is intended to be used beyond tracing and fentry/fexit use cases
    in the future. For example to remove retpoline cost from XDP programs.

    Signed-off-by: Alexei Starovoitov
    Signed-off-by: Daniel Borkmann
    Acked-by: Andrii Nakryiko
    Acked-by: Song Liu
    Link: https://lore.kernel.org/bpf/20191114185720.1641606-5-ast@kernel.org

    Alexei Starovoitov