30 Sep, 2020

2 commits

  • There is no particular reason for keeping the "sub 0, %rsp" insn within
    the BPF's x64 JIT prologue.

    When tail call code was skipping the whole prologue section these 7
    bytes that represent the rsp subtraction could not be simply discarded
    as the jump target address would be broken. An option to address that
    would be to substitute it with nop7.

    Right now tail call is skipping only first 11 bytes of target program's
    prologue and "sub X, %rsp" is the first insn that is processed, so if
    stack depth is zero then this insn could be omitted without the need for
    nop7 swap.

    Therefore, do not emit the "sub 0, %rsp" in prologue when program is not
    making use of R10 register. Also, make the emission of "add X, %rsp"
    conditional in tail call code logic and take into account the presence
    of mentioned insn when calculating the jump offsets.

    Signed-off-by: Maciej Fijalkowski
    Signed-off-by: Alexei Starovoitov
    Link: https://lore.kernel.org/bpf/20200929204653.4325-3-maciej.fijalkowski@intel.com

    Maciej Fijalkowski
     
  • Back when all of the callee-saved registers where always pushed to stack
    in x64 JIT prologue, tail call counter was placed at the bottom of the
    BPF program's stack frame that had a following layout:

    +-------------+
    | ret addr |
    +-------------+
    | rbp |
    Signed-off-by: Alexei Starovoitov
    Link: https://lore.kernel.org/bpf/20200929204653.4325-2-maciej.fijalkowski@intel.com

    Maciej Fijalkowski
     

18 Sep, 2020

3 commits

  • This commit serves two things:
    1) it optimizes BPF prologue/epilogue generation
    2) it makes possible to have tailcalls within BPF subprogram

    Both points are related to each other since without 1), 2) could not be
    achieved.

    In [1], Alexei says:
    "The prologue will look like:
    nop5
    xor eax,eax  // two new bytes if bpf_tail_call() is used in this
    // function
    push rbp
    mov rbp, rsp
    sub rsp, rounded_stack_depth
    push rax // zero init tail_call counter
    variable number of push rbx,r13,r14,r15

    Then bpf_tail_call will pop variable number rbx,..
    and final 'pop rax'
    Then 'add rsp, size_of_current_stack_frame'
    jmp to next function and skip over 'nop5; xor eax,eax; push rpb; mov
    rbp, rsp'

    This way new function will set its own stack size and will init tail
    call
    counter with whatever value the parent had.

    If next function doesn't use bpf_tail_call it won't have 'xor eax,eax'.
    Instead it would need to have 'nop2' in there."

    Implement that suggestion.

    Since the layout of stack is changed, tail call counter handling can not
    rely anymore on popping it to rbx just like it have been handled for
    constant prologue case and later overwrite of rbx with actual value of
    rbx pushed to stack. Therefore, let's use one of the register (%rcx) that
    is considered to be volatile/caller-saved and pop the value of tail call
    counter in there in the epilogue.

    Drop the BUILD_BUG_ON in emit_prologue and in
    emit_bpf_tail_call_indirect where instruction layout is not constant
    anymore.

    Introduce new poke target, 'tailcall_bypass' to poke descriptor that is
    dedicated for skipping the register pops and stack unwind that are
    generated right before the actual jump to target program.
    For case when the target program is not present, BPF program will skip
    the pop instructions and nop5 dedicated for jmpq $target. An example of
    such state when only R6 of callee saved registers is used by program:

    ffffffffc0513aa1: e9 0e 00 00 00 jmpq 0xffffffffc0513ab4
    ffffffffc0513aa6: 5b pop %rbx
    ffffffffc0513aa7: 58 pop %rax
    ffffffffc0513aa8: 48 81 c4 00 00 00 00 add $0x0,%rsp
    ffffffffc0513aaf: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)
    ffffffffc0513ab4: 48 89 df mov %rbx,%rdi

    When target program is inserted, the jump that was there to skip
    pops/nop5 will become the nop5, so CPU will go over pops and do the
    actual tailcall.

    One might ask why there simply can not be pushes after the nop5?
    In the following example snippet:

    ffffffffc037030c: 48 89 fb mov %rdi,%rbx
    (...)
    ffffffffc0370332: 5b pop %rbx
    ffffffffc0370333: 58 pop %rax
    ffffffffc0370334: 48 81 c4 00 00 00 00 add $0x0,%rsp
    ffffffffc037033b: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)
    ffffffffc0370340: 48 81 ec 00 00 00 00 sub $0x0,%rsp
    ffffffffc0370347: 50 push %rax
    ffffffffc0370348: 53 push %rbx
    ffffffffc0370349: 48 89 df mov %rbx,%rdi
    ffffffffc037034c: e8 f7 21 00 00 callq 0xffffffffc0372548

    There is the bpf2bpf call (at ffffffffc037034c) right after the tailcall
    and jump target is not present. ctx is in %rbx register and BPF
    subprogram that we will call into on ffffffffc037034c is relying on it,
    e.g. it will pick ctx from there. Such code layout is therefore broken
    as we would overwrite the content of %rbx with the value that was pushed
    on the prologue. That is the reason for the 'bypass' approach.

    Special care needs to be taken during the install/update/remove of
    tailcall target. In case when target program is not present, the CPU
    must not execute the pop instructions that precede the tailcall.

    To address that, the following states can be defined:
    A nop, unwind, nop
    B nop, unwind, tail
    C skip, unwind, nop
    D skip, unwind, tail

    A is forbidden (lead to incorrectness). The state transitions between
    tailcall install/update/remove will work as follows:

    First install tail call f: C->D->B(f)
    * poke the tailcall, after that get rid of the skip
    Update tail call f to f': B(f)->B(f')
    * poke the tailcall (poke->tailcall_target) and do NOT touch the
    poke->tailcall_bypass
    Remove tail call: B(f')->C(f')
    * poke->tailcall_bypass is poked back to jump, then we wait the RCU
    grace period so that other programs will finish its execution and
    after that we are safe to remove the poke->tailcall_target
    Install new tail call (f''): C(f')->D(f'')->B(f'').
    * same as first step

    This way CPU can never be exposed to "unwind, tail" state.

    Last but not least, when tailcalls get mixed with bpf2bpf calls, it
    would be possible to encounter the endless loop due to clearing the
    tailcall counter if for example we would use the tailcall3-like from BPF
    selftests program that would be subprogram-based, meaning the tailcall
    would be present within the BPF subprogram.

    This test, broken down to particular steps, would do:
    entry -> set tailcall counter to 0, bump it by 1, tailcall to func0
    func0 -> call subprog_tail
    (we are NOT skipping the first 11 bytes of prologue and this subprogram
    has a tailcall, therefore we clear the counter...)
    subprog -> do the same thing as entry

    and then loop forever.

    To address this, the idea is to go through the call chain of bpf2bpf progs
    and look for a tailcall presence throughout whole chain. If we saw a single
    tail call then each node in this call chain needs to be marked as a subprog
    that can reach the tailcall. We would later feed the JIT with this info
    and:
    - set eax to 0 only when tailcall is reachable and this is the entry prog
    - if tailcall is reachable but there's no tailcall in insns of currently
    JITed prog then push rax anyway, so that it will be possible to
    propagate further down the call chain
    - finally if tailcall is reachable, then we need to precede the 'call'
    insn with mov rax, [rbp - (stack_depth + 8)]

    Tail call related cases from test_verifier kselftest are also working
    fine. Sample BPF programs that utilize tail calls (sockex3, tracex5)
    work properly as well.

    [1]: https://lore.kernel.org/bpf/20200517043227.2gpq22ifoq37ogst@ast-mbp.dhcp.thefacebook.com/

    Suggested-by: Alexei Starovoitov
    Signed-off-by: Maciej Fijalkowski
    Signed-off-by: Alexei Starovoitov

    Maciej Fijalkowski
     
  • Reflect the actual purpose of poke->ip and rename it to
    poke->tailcall_target so that it will not the be confused with another
    poke target that will be introduced in next commit.

    While at it, do the same thing with poke->ip_stable - rename it to
    poke->tailcall_target_stable.

    Signed-off-by: Maciej Fijalkowski
    Signed-off-by: Alexei Starovoitov

    Maciej Fijalkowski
     
  • Currently, %rax is used to store the jump target when BPF program is
    emitting the retpoline instructions that are handling the indirect
    tailcall.

    There is a plan to use %rax for different purpose, which is storing the
    tail call counter. In order to preserve this value across the tailcalls,
    adjust the BPF indirect tailcalls so that the target program will reside
    in %rcx and teach the retpoline instructions about new location of jump
    target.

    Signed-off-by: Maciej Fijalkowski
    Signed-off-by: Alexei Starovoitov

    Maciej Fijalkowski
     

29 Aug, 2020

1 commit

  • Introduce sleepable BPF programs that can request such property for themselves
    via BPF_F_SLEEPABLE flag at program load time. In such case they will be able
    to use helpers like bpf_copy_from_user() that might sleep. At present only
    fentry/fexit/fmod_ret and lsm programs can request to be sleepable and only
    when they are attached to kernel functions that are known to allow sleeping.

    The non-sleepable programs are relying on implicit rcu_read_lock() and
    migrate_disable() to protect life time of programs, maps that they use and
    per-cpu kernel structures used to pass info between bpf programs and the
    kernel. The sleepable programs cannot be enclosed into rcu_read_lock().
    migrate_disable() maps to preempt_disable() in non-RT kernels, so the progs
    should not be enclosed in migrate_disable() as well. Therefore
    rcu_read_lock_trace is used to protect the life time of sleepable progs.

    There are many networking and tracing program types. In many cases the
    'struct bpf_prog *' pointer itself is rcu protected within some other kernel
    data structure and the kernel code is using rcu_dereference() to load that
    program pointer and call BPF_PROG_RUN() on it. All these cases are not touched.
    Instead sleepable bpf programs are allowed with bpf trampoline only. The
    program pointers are hard-coded into generated assembly of bpf trampoline and
    synchronize_rcu_tasks_trace() is used to protect the life time of the program.
    The same trampoline can hold both sleepable and non-sleepable progs.

    When rcu_read_lock_trace is held it means that some sleepable bpf program is
    running from bpf trampoline. Those programs can use bpf arrays and preallocated
    hash/lru maps. These map types are waiting on programs to complete via
    synchronize_rcu_tasks_trace();

    Updates to trampoline now has to do synchronize_rcu_tasks_trace() and
    synchronize_rcu_tasks() to wait for sleepable progs to finish and for
    trampoline assembly to finish.

    This is the first step of introducing sleepable progs. Eventually dynamically
    allocated hash maps can be allowed and networking program types can become
    sleepable too.

    Signed-off-by: Alexei Starovoitov
    Signed-off-by: Daniel Borkmann
    Reviewed-by: Josef Bacik
    Acked-by: Andrii Nakryiko
    Acked-by: KP Singh
    Link: https://lore.kernel.org/bpf/20200827220114.69225-3-alexei.starovoitov@gmail.com

    Alexei Starovoitov
     

07 May, 2020

1 commit

  • The '==' expression itself is bool, no need to convert it to bool again.
    This fixes the following coccicheck warning:

    arch/x86/net/bpf_jit_comp32.c:1478:50-55: WARNING: conversion to bool not needed here
    arch/x86/net/bpf_jit_comp32.c:1479:50-55: WARNING: conversion to bool not needed here

    Signed-off-by: Jason Yan
    Signed-off-by: Daniel Borkmann
    Acked-by: Martin KaFai Lau
    Link: https://lore.kernel.org/bpf/20200506140352.37154-1-yanaijie@huawei.com

    Jason Yan
     

25 Apr, 2020

3 commits

  • When verifier_zext is true, we don't need to emit code
    for zero-extension.

    Fixes: 836256bf5f37 ("x32: bpf: eliminate zero extension code-gen")
    Signed-off-by: Wang YanQing
    Signed-off-by: Alexei Starovoitov
    Link: https://lore.kernel.org/bpf/20200423050637.GA4029@udknight

    Wang YanQing
     
  • The current JIT clobbers the destination register for BPF_JSET BPF_X
    and BPF_K by using "and" and "or" instructions. This is fine when the
    destination register is a temporary loaded from a register stored on
    the stack but not otherwise.

    This patch fixes the problem (for both BPF_K and BPF_X) by always loading
    the destination register into temporaries since BPF_JSET should not
    modify the destination register.

    This bug may not be currently triggerable as BPF_REG_AX is the only
    register not stored on the stack and the verifier uses it in a limited
    way.

    Fixes: 03f5781be2c7b ("bpf, x86_32: add eBPF JIT compiler for ia32")
    Signed-off-by: Xi Wang
    Signed-off-by: Luke Nelson
    Signed-off-by: Alexei Starovoitov
    Acked-by: Wang YanQing
    Link: https://lore.kernel.org/bpf/20200422173630.8351-2-luke.r.nels@gmail.com

    Luke Nelson
     
  • The current JIT uses the following sequence to zero-extend into the
    upper 32 bits of the destination register for BPF_LDX BPF_{B,H,W},
    when the destination register is not on the stack:

    EMIT3(0xC7, add_1reg(0xC0, dst_hi), 0);

    The problem is that C7 /0 encodes a MOV instruction that requires a 4-byte
    immediate; the current code emits only 1 byte of the immediate. This
    means that the first 3 bytes of the next instruction will be treated as
    the rest of the immediate, breaking the stream of instructions.

    This patch fixes the problem by instead emitting "xor dst_hi,dst_hi"
    to clear the upper 32 bits. This fixes the problem and is more efficient
    than using MOV to load a zero immediate.

    This bug may not be currently triggerable as BPF_REG_AX is the only
    register not stored on the stack and the verifier uses it in a limited
    way, and the verifier implements a zero-extension optimization. But the
    JIT should avoid emitting incorrect encodings regardless.

    Fixes: 03f5781be2c7b ("bpf, x86_32: add eBPF JIT compiler for ia32")
    Signed-off-by: Xi Wang
    Signed-off-by: Luke Nelson
    Signed-off-by: Alexei Starovoitov
    Reviewed-by: H. Peter Anvin (Intel)
    Acked-by: Wang YanQing
    Link: https://lore.kernel.org/bpf/20200422173630.8351-1-luke.r.nels@gmail.com

    Luke Nelson
     

21 Apr, 2020

1 commit

  • This patch fixes an encoding bug in emit_stx for BPF_B when the source
    register is BPF_REG_FP.

    The current implementation for BPF_STX BPF_B in emit_stx saves one REX
    byte when the operands can be encoded using Mod-R/M alone. The lower 8
    bits of registers %rax, %rbx, %rcx, and %rdx can be accessed without using
    a REX prefix via %al, %bl, %cl, and %dl, respectively. Other registers,
    (e.g., %rsi, %rdi, %rbp, %rsp) require a REX prefix to use their 8-bit
    equivalents (%sil, %dil, %bpl, %spl).

    The current code checks if the source for BPF_STX BPF_B is BPF_REG_1
    or BPF_REG_2 (which map to %rdi and %rsi), in which case it emits the
    required REX prefix. However, it misses the case when the source is
    BPF_REG_FP (mapped to %rbp).

    The result is that BPF_STX BPF_B with BPF_REG_FP as the source operand
    will read from register %ch instead of the correct %bpl. This patch fixes
    the problem by fixing and refactoring the check on which registers need
    the extra REX byte. Since no BPF registers map to %rsp, there is no need
    to handle %spl.

    Fixes: 622582786c9e0 ("net: filter: x86: internal BPF JIT")
    Signed-off-by: Xi Wang
    Signed-off-by: Luke Nelson
    Signed-off-by: Alexei Starovoitov
    Link: https://lore.kernel.org/bpf/20200418232655.23870-1-luke.r.nels@gmail.com

    Luke Nelson
     

26 Mar, 2020

1 commit


11 Mar, 2020

1 commit

  • fmod_ret progs are emitted as:

    start = __bpf_prog_enter();
    call fmod_ret
    *(u64 *)(rbp - 8) = rax
    __bpf_prog_exit(, start);
    test eax, eax
    jne do_fexit

    That 'test eax, eax' is working by accident. The compiler is free to use rax
    inside __bpf_prog_exit() or inside functions that __bpf_prog_exit() is calling.
    Which caused "test_progs -t modify_return" to sporadically fail depending on
    compiler version and kconfig. Fix it by using 'cmp [rbp - 8], 0' instead of
    'test eax, eax'.

    Fixes: ae24082331d9 ("bpf: Introduce BPF_MODIFY_RETURN")
    Reported-by: Andrii Nakryiko
    Signed-off-by: Alexei Starovoitov
    Signed-off-by: Daniel Borkmann
    Acked-by: Andrii Nakryiko
    Acked-by: KP Singh
    Link: https://lore.kernel.org/bpf/20200311003906.3643037-1-ast@kernel.org

    Alexei Starovoitov
     

06 Mar, 2020

1 commit

  • The current x32 BPF JIT is incorrect for JMP32 JSET BPF_X when the upper
    32 bits of operand registers are non-zero in certain situations.

    The problem is in the following code:

    case BPF_JMP | BPF_JSET | BPF_X:
    case BPF_JMP32 | BPF_JSET | BPF_X:
    ...

    /* and dreg_lo,sreg_lo */
    EMIT2(0x23, add_2reg(0xC0, sreg_lo, dreg_lo));
    /* and dreg_hi,sreg_hi */
    EMIT2(0x23, add_2reg(0xC0, sreg_hi, dreg_hi));
    /* or dreg_lo,dreg_hi */
    EMIT2(0x09, add_2reg(0xC0, dreg_lo, dreg_hi));

    This code checks the upper bits of the operand registers regardless if
    the BPF instruction is BPF_JMP32 or BPF_JMP64. Registers dreg_hi and
    dreg_lo are not loaded from the stack for BPF_JMP32, however, they can
    still be polluted with values from previous instructions.

    The following BPF program demonstrates the bug. The jset64 instruction
    loads the temporary registers and performs the jump, since ((u64)r7 &
    (u64)r8) is non-zero. The jset32 should _not_ be taken, as the lower
    32 bits are all zero, however, the current JIT will take the branch due
    the pollution of temporary registers from the earlier jset64.

    mov64 r0, 0
    ld64 r7, 0x8000000000000000
    ld64 r8, 0x8000000000000000
    jset64 r7, r8, 1
    exit
    jset32 r7, r8, 1
    mov64 r0, 2
    exit

    The expected return value of this program is 2; under the buggy x32 JIT
    it returns 0. The fix is to skip using the upper 32 bits for jset32 and
    compare the upper 32 bits for jset64 only.

    All tests in test_bpf.ko and selftests/bpf/test_verifier continue to
    pass with this change.

    We found this bug using our automated verification tool, Serval.

    Fixes: 69f827eb6e14 ("x32: bpf: implement jitting of JMP32")
    Co-developed-by: Xi Wang
    Signed-off-by: Xi Wang
    Signed-off-by: Luke Nelson
    Signed-off-by: Daniel Borkmann
    Link: https://lore.kernel.org/bpf/20200305234416.31597-1-luke.r.nels@gmail.com

    Luke Nelson
     

05 Mar, 2020

3 commits

  • When multiple programs are attached, each program receives the return
    value from the previous program on the stack and the last program
    provides the return value to the attached function.

    The fmod_ret bpf programs are run after the fentry programs and before
    the fexit programs. The original function is only called if all the
    fmod_ret programs return 0 to avoid any unintended side-effects. The
    success value, i.e. 0 is not currently configurable but can be made so
    where user-space can specify it at load time.

    For example:

    int func_to_be_attached(int a, int b)
    {
    if (ret != 0)
    goto do_fexit;

    original_function:

    }
    Signed-off-by: Alexei Starovoitov
    Acked-by: Andrii Nakryiko
    Acked-by: Daniel Borkmann
    Link: https://lore.kernel.org/bpf/20200304191853.1529-4-kpsingh@chromium.org

    KP Singh
     
  • * Split the invoke_bpf program to prepare for special handling of
    fmod_ret programs introduced in a subsequent patch.
    * Move the definition of emit_cond_near_jump and emit_nops as they are
    needed for fmod_ret.
    * Refactor branch target alignment into its own generic helper function
    i.e. emit_align.

    Signed-off-by: KP Singh
    Signed-off-by: Alexei Starovoitov
    Acked-by: Andrii Nakryiko
    Acked-by: Daniel Borkmann
    Link: https://lore.kernel.org/bpf/20200304191853.1529-3-kpsingh@chromium.org

    KP Singh
     
  • As we need to introduce a third type of attachment for trampolines, the
    flattened signature of arch_prepare_bpf_trampoline gets even more
    complicated.

    Refactor the prog and count argument to arch_prepare_bpf_trampoline to
    use bpf_tramp_progs to simplify the addition and accounting for new
    attachment types.

    Signed-off-by: KP Singh
    Signed-off-by: Alexei Starovoitov
    Acked-by: Andrii Nakryiko
    Acked-by: Daniel Borkmann
    Link: https://lore.kernel.org/bpf/20200304191853.1529-2-kpsingh@chromium.org

    KP Singh
     

10 Jan, 2020

1 commit

  • The patch introduces BPF_MAP_TYPE_STRUCT_OPS. The map value
    is a kernel struct with its func ptr implemented in bpf prog.
    This new map is the interface to register/unregister/introspect
    a bpf implemented kernel struct.

    The kernel struct is actually embedded inside another new struct
    (or called the "value" struct in the code). For example,
    "struct tcp_congestion_ops" is embbeded in:
    struct bpf_struct_ops_tcp_congestion_ops {
    refcount_t refcnt;
    enum bpf_struct_ops_state state;
    struct tcp_congestion_ops data; /*
    INUSE (map updated, i.e. reg) =>
    TOBEFREE (map value deleted, i.e. unreg)

    The kernel subsystem needs to call bpf_struct_ops_get() and
    bpf_struct_ops_put() to manage the "refcnt" in the
    "struct bpf_struct_ops_XYZ". This patch uses a separate refcnt
    for the purose of tracking the subsystem usage. Another approach
    is to reuse the map->refcnt and then "show" (i.e. during map_lookup)
    the subsystem's usage by doing map->refcnt - map->usercnt to filter out
    the map-fd/pinned-map usage. However, that will also tie down the
    future semantics of map->refcnt and map->usercnt.

    The very first subsystem's refcnt (during reg()) holds one
    count to map->refcnt. When the very last subsystem's refcnt
    is gone, it will also release the map->refcnt. All bpf_prog will be
    freed when the map->refcnt reaches 0 (i.e. during map_free()).

    Here is how the bpftool map command will look like:
    [root@arch-fb-vm1 bpf]# bpftool map show
    6: struct_ops name dctcp flags 0x0
    key 4B value 256B max_entries 1 memlock 4096B
    btf_id 6
    [root@arch-fb-vm1 bpf]# bpftool map dump id 6
    [{
    "value": {
    "refcnt": {
    "refs": {
    "counter": 1
    }
    },
    "state": 1,
    "data": {
    "list": {
    "next": 0,
    "prev": 0
    },
    "key": 0,
    "flags": 2,
    "init": 24,
    "release": 0,
    "ssthresh": 25,
    "cong_avoid": 30,
    "set_state": 27,
    "cwnd_event": 28,
    "in_ack_event": 26,
    "undo_cwnd": 29,
    "pkts_acked": 0,
    "min_tso_segs": 0,
    "sndbuf_expand": 0,
    "cong_control": 0,
    "get_info": 0,
    "name": [98,112,102,95,100,99,116,99,112,0,0,0,0,0,0,0
    ],
    "owner": 0
    }
    }
    }
    ]

    Misc Notes:
    * bpf_struct_ops_map_sys_lookup_elem() is added for syscall lookup.
    It does an inplace update on "*value" instead returning a pointer
    to syscall.c. Otherwise, it needs a separate copy of "zero" value
    for the BPF_STRUCT_OPS_STATE_INIT to avoid races.

    * The bpf_struct_ops_map_delete_elem() is also called without
    preempt_disable() from map_delete_elem(). It is because
    the "->unreg()" may requires sleepable context, e.g.
    the "tcp_unregister_congestion_control()".

    * "const" is added to some of the existing "struct btf_func_model *"
    function arg to avoid a compiler warning caused by this patch.

    Signed-off-by: Martin KaFai Lau
    Signed-off-by: Alexei Starovoitov
    Acked-by: Andrii Nakryiko
    Acked-by: Yonghong Song
    Link: https://lore.kernel.org/bpf/20200109003505.3855919-1-kafai@fb.com

    Martin KaFai Lau
     

14 Dec, 2019

2 commits

  • >From Intel 64 and IA-32 Architectures Optimization Reference Manual,
    3.4.1.4 Code Alignment, Assembly/Compiler Coding Rule 11: All branch
    targets should be 16-byte aligned.

    This commits aligns branch targets according to the Intel manual.

    The nops used to align branch targets make the dispatcher larger, and
    therefore the number of supported dispatch points/programs are
    descreased from 64 to 48.

    Signed-off-by: Björn Töpel
    Signed-off-by: Alexei Starovoitov
    Link: https://lore.kernel.org/bpf/20191213175112.30208-7-bjorn.topel@gmail.com

    Björn Töpel
     
  • The BPF dispatcher is a multi-way branch code generator, mainly
    targeted for XDP programs. When an XDP program is executed via the
    bpf_prog_run_xdp(), it is invoked via an indirect call. The indirect
    call has a substantial performance impact, when retpolines are
    enabled. The dispatcher transform indirect calls to direct calls, and
    therefore avoids the retpoline. The dispatcher is generated using the
    BPF JIT, and relies on text poking provided by bpf_arch_text_poke().

    The dispatcher hijacks a trampoline function it via the __fentry__ nop
    of the trampoline. One dispatcher instance currently supports up to 64
    dispatch points. A user creates a dispatcher with its corresponding
    trampoline with the DEFINE_BPF_DISPATCHER macro.

    Signed-off-by: Björn Töpel
    Signed-off-by: Alexei Starovoitov
    Link: https://lore.kernel.org/bpf/20191213175112.30208-3-bjorn.topel@gmail.com

    Björn Töpel
     

25 Nov, 2019

3 commits

  • Given that we have BPF_MOD_NOP_TO_{CALL,JUMP}, BPF_MOD_{CALL,JUMP}_TO_NOP
    and BPF_MOD_{CALL,JUMP}_TO_{CALL,JUMP} poke types and that we also pass in
    old_addr as well as new_addr, it's a bit redundant and unnecessarily
    complicates __bpf_arch_text_poke() itself since we can derive the same from
    the *_addr that were passed in. Hence simplify and use BPF_MOD_{CALL,JUMP}
    as types which also allows to clean up call-sites.

    In addition to that, __bpf_arch_text_poke() currently verifies that text
    matches expected old_insn before we invoke text_poke_bp(). Also add a check
    on new_insn and skip rewrite if it already matches. Reason why this is rather
    useful is that it avoids making any special casing in prog_array_map_poke_run()
    when old and new prog were NULL and has the benefit that also for this case
    we perform a check on text whether it really matches our expectations.

    Suggested-by: Andrii Nakryiko
    Signed-off-by: Daniel Borkmann
    Signed-off-by: Alexei Starovoitov
    Link: https://lore.kernel.org/bpf/fcb00a2b0b288d6c73de4ef58116a821c8fe8f2f.1574555798.git.daniel@iogearbox.net

    Daniel Borkmann
     
  • Add initial code emission for *direct* jumps for tail call maps in
    order to avoid the retpoline overhead from a493a87f38cf ("bpf, x64:
    implement retpoline for tail call") for situations that allow for
    it, meaning, for known constant keys at verification time which are
    used as index into the tail call map. In case of Cilium which makes
    heavy use of tail calls, constant keys are used in the vast majority,
    only for a single occurrence we use a dynamic key.

    High level outline is that if the target prog is NULL in the map, we
    emit a 5-byte nop for the fall-through case and if not, we emit a
    5-byte direct relative jmp to the target bpf_func + skipped prologue
    offset. Later during runtime, we patch these 5-byte nop/jmps upon
    tail call map update or deletions dynamically. Note that on x86-64
    the direct jmp works as we reuse the same stack frame and skip
    prologue (as opposed to some other JIT implementations).

    One of the issues is that the tail call map slots can change at any
    given time even during JITing. Therefore, we have two passes: i) emit
    nops for all patchable locations during main JITing phase until we
    declare prog->jited = 1 eventually. At this point the image is stable,
    not public yet and with all jmps disabled. While JITing, we collect
    additional info like poke->ip in order to remember the patch location
    for later modifications. In ii) bpf_tail_call_direct_fixup() walks
    over the progs poke_tab, locks the tail call maps poke_mutex to
    prevent from parallel updates and patches in the right locations via
    __bpf_arch_text_poke(). Note, the main bpf_arch_text_poke() cannot
    be used at this point since we're not yet exposed to kallsyms. For
    the update we use plain memcpy() since the image is not public and
    still in read-write mode. After patching, we activate that poke entry
    through poke->ip_stable. Meaning, at this point any tail call map
    updates/deletions are not going to ignore that poke entry anymore.
    Then, bpf_arch_text_poke() might still occur on the read-write image
    until we finally locked it as read-only. Both modifications on the
    given image are under text_mutex to avoid interference with each
    other when update requests come in in parallel for different tail
    call maps (current one we have locked in JIT and different one where
    poke->ip_stable was already set).

    Example prog:

    # ./bpftool p d x i 1655
    0: (b7) r3 = 0
    1: (18) r2 = map[id:526]
    3: (85) call bpf_tail_call#12
    4: (b7) r0 = 1
    5: (95) exit

    Before:

    # ./bpftool p d j i 1655
    0xffffffffc076e55c:
    0: nopl 0x0(%rax,%rax,1)
    5: push %rbp
    6: mov %rsp,%rbp
    9: sub $0x200,%rsp
    10: push %rbx
    11: push %r13
    13: push %r14
    15: push %r15
    17: pushq $0x0 _
    19: xor %edx,%edx |_ index (arg 3)
    1b: movabs $0xffff88d95cc82600,%rsi |_ map (arg 2)
    25: mov %edx,%edx | index >= array->map.max_entries
    27: cmp %edx,0x24(%rsi) |
    2a: jbe 0x0000000000000066 |_
    2c: mov -0x224(%rbp),%eax | tail call limit check
    32: cmp $0x20,%eax |
    35: ja 0x0000000000000066 |
    37: add $0x1,%eax |
    3a: mov %eax,-0x224(%rbp) |_
    40: mov 0xd0(%rsi,%rdx,8),%rax |_ prog = array->ptrs[index]
    48: test %rax,%rax | prog == NULL check
    4b: je 0x0000000000000066 |_
    4d: mov 0x30(%rax),%rax | goto *(prog->bpf_func + prologue_size)
    51: add $0x19,%rax |
    55: callq 0x0000000000000061 | retpoline for indirect jump
    5a: pause |
    5c: lfence |
    5f: jmp 0x000000000000005a |
    61: mov %rax,(%rsp) |
    65: retq |_
    66: mov $0x1,%eax
    6b: pop %rbx
    6c: pop %r15
    6e: pop %r14
    70: pop %r13
    72: pop %rbx
    73: leaveq
    74: retq

    After; state after JIT:

    # ./bpftool p d j i 1655
    0xffffffffc08e8930:
    0: nopl 0x0(%rax,%rax,1)
    5: push %rbp
    6: mov %rsp,%rbp
    9: sub $0x200,%rsp
    10: push %rbx
    11: push %r13
    13: push %r14
    15: push %r15
    17: pushq $0x0 _
    19: xor %edx,%edx |_ index (arg 3)
    1b: movabs $0xffff9d8afd74c000,%rsi |_ map (arg 2)
    25: mov -0x224(%rbp),%eax | tail call limit check
    2b: cmp $0x20,%eax |
    2e: ja 0x000000000000003e |
    30: add $0x1,%eax |
    33: mov %eax,-0x224(%rbp) |_
    39: jmpq 0xfffffffffffd1785 |_ [direct] goto *(prog->bpf_func + prologue_size)
    3e: mov $0x1,%eax
    43: pop %rbx
    44: pop %r15
    46: pop %r14
    48: pop %r13
    4a: pop %rbx
    4b: leaveq
    4c: retq

    After; state after map update (target prog):

    # ./bpftool p d j i 1655
    0xffffffffc08e8930:
    0: nopl 0x0(%rax,%rax,1)
    5: push %rbp
    6: mov %rsp,%rbp
    9: sub $0x200,%rsp
    10: push %rbx
    11: push %r13
    13: push %r14
    15: push %r15
    17: pushq $0x0
    19: xor %edx,%edx
    1b: movabs $0xffff9d8afd74c000,%rsi
    25: mov -0x224(%rbp),%eax
    2b: cmp $0x20,%eax .
    2e: ja 0x000000000000003e .
    30: add $0x1,%eax .
    33: mov %eax,-0x224(%rbp) |_
    39: jmpq 0xffffffffffb09f55 |_ goto *(prog->bpf_func + prologue_size)
    3e: mov $0x1,%eax
    43: pop %rbx
    44: pop %r15
    46: pop %r14
    48: pop %r13
    4a: pop %rbx
    4b: leaveq
    4c: retq

    After; state after map update (no prog):

    # ./bpftool p d j i 1655
    0xffffffffc08e8930:
    0: nopl 0x0(%rax,%rax,1)
    5: push %rbp
    6: mov %rsp,%rbp
    9: sub $0x200,%rsp
    10: push %rbx
    11: push %r13
    13: push %r14
    15: push %r15
    17: pushq $0x0
    19: xor %edx,%edx
    1b: movabs $0xffff9d8afd74c000,%rsi
    25: mov -0x224(%rbp),%eax
    2b: cmp $0x20,%eax .
    2e: ja 0x000000000000003e .
    30: add $0x1,%eax .
    33: mov %eax,-0x224(%rbp) |_
    39: nopl 0x0(%rax,%rax,1) |_ fall-through nop
    3e: mov $0x1,%eax
    43: pop %rbx
    44: pop %r15
    46: pop %r14
    48: pop %r13
    4a: pop %rbx
    4b: leaveq
    4c: retq

    Nice bonus is that this also shrinks the code emission quite a bit
    for every tail call invocation.

    Signed-off-by: Daniel Borkmann
    Signed-off-by: Alexei Starovoitov
    Link: https://lore.kernel.org/bpf/6ada4c1c9d35eeb5f4ecfab94593dafa6b5c4b09.1574452833.git.daniel@iogearbox.net

    Daniel Borkmann
     
  • Add BPF_MOD_{NOP_TO_JUMP,JUMP_TO_JUMP,JUMP_TO_NOP} patching for x86
    JIT in order to be able to patch direct jumps or nop them out. We need
    this facility in order to patch tail call jumps and in later work also
    BPF static keys.

    Signed-off-by: Daniel Borkmann
    Signed-off-by: Alexei Starovoitov
    Acked-by: Andrii Nakryiko
    Link: https://lore.kernel.org/bpf/aa4784196a8e5e985af4b30a4fe5336bce6e9643.1574452833.git.daniel@iogearbox.net

    Daniel Borkmann
     

16 Nov, 2019

5 commits

  • Allow FENTRY/FEXIT BPF programs to attach to other BPF programs of any type
    including their subprograms. This feature allows snooping on input and output
    packets in XDP, TC programs including their return values. In order to do that
    the verifier needs to track types not only of vmlinux, but types of other BPF
    programs as well. The verifier also needs to translate uapi/linux/bpf.h types
    used by networking programs into kernel internal BTF types used by FENTRY/FEXIT
    BPF programs. In some cases LLVM optimizations can remove arguments from BPF
    subprograms without adjusting BTF info that LLVM backend knows. When BTF info
    disagrees with actual types that the verifiers sees the BPF trampoline has to
    fallback to conservative and treat all arguments as u64. The FENTRY/FEXIT
    program can still attach to such subprograms, but it won't be able to recognize
    pointer types like 'struct sk_buff *' and it won't be able to pass them to
    bpf_skb_output() for dumping packets to user space. The FENTRY/FEXIT program
    would need to use bpf_probe_read_kernel() instead.

    The BPF_PROG_LOAD command is extended with attach_prog_fd field. When it's set
    to zero the attach_btf_id is one vmlinux BTF type ids. When attach_prog_fd
    points to previously loaded BPF program the attach_btf_id is BTF type id of
    main function or one of its subprograms.

    Signed-off-by: Alexei Starovoitov
    Signed-off-by: Daniel Borkmann
    Acked-by: Song Liu
    Link: https://lore.kernel.org/bpf/20191114185720.1641606-18-ast@kernel.org

    Alexei Starovoitov
     
  • BPF trampoline can be made to work with existing 5 bytes of BPF program
    prologue, but let's add 5 bytes of NOPs to the beginning of every JITed BPF
    program to make BPF trampoline job easier. They can be removed in the future.

    Signed-off-by: Alexei Starovoitov
    Signed-off-by: Daniel Borkmann
    Acked-by: Andrii Nakryiko
    Acked-by: Song Liu
    Link: https://lore.kernel.org/bpf/20191114185720.1641606-14-ast@kernel.org

    Alexei Starovoitov
     
  • Introduce BPF trampoline concept to allow kernel code to call into BPF programs
    with practically zero overhead. The trampoline generation logic is
    architecture dependent. It's converting native calling convention into BPF
    calling convention. BPF ISA is 64-bit (even on 32-bit architectures). The
    registers R1 to R5 are used to pass arguments into BPF functions. The main BPF
    program accepts only single argument "ctx" in R1. Whereas CPU native calling
    convention is different. x86-64 is passing first 6 arguments in registers
    and the rest on the stack. x86-32 is passing first 3 arguments in registers.
    sparc64 is passing first 6 in registers. And so on.

    The trampolines between BPF and kernel already exist. BPF_CALL_x macros in
    include/linux/filter.h statically compile trampolines from BPF into kernel
    helpers. They convert up to five u64 arguments into kernel C pointers and
    integers. On 64-bit architectures this BPF_to_kernel trampolines are nops. On
    32-bit architecture they're meaningful.

    The opposite job kernel_to_BPF trampolines is done by CAST_TO_U64 macros and
    __bpf_trace_##call() shim functions in include/trace/bpf_probe.h. They convert
    kernel function arguments into array of u64s that BPF program consumes via
    R1=ctx pointer.

    This patch set is doing the same job as __bpf_trace_##call() static
    trampolines, but dynamically for any kernel function. There are ~22k global
    kernel functions that are attachable via nop at function entry. The function
    arguments and types are described in BTF. The job of btf_distill_func_proto()
    function is to extract useful information from BTF into "function model" that
    architecture dependent trampoline generators will use to generate assembly code
    to cast kernel function arguments into array of u64s. For example the kernel
    function eth_type_trans has two pointers. They will be casted to u64 and stored
    into stack of generated trampoline. The pointer to that stack space will be
    passed into BPF program in R1. On x86-64 such generated trampoline will consume
    16 bytes of stack and two stores of %rdi and %rsi into stack. The verifier will
    make sure that only two u64 are accessed read-only by BPF program. The verifier
    will also recognize the precise type of the pointers being accessed and will
    not allow typecasting of the pointer to a different type within BPF program.

    The tracing use case in the datacenter demonstrated that certain key kernel
    functions have (like tcp_retransmit_skb) have 2 or more kprobes that are always
    active. Other functions have both kprobe and kretprobe. So it is essential to
    keep both kernel code and BPF programs executing at maximum speed. Hence
    generated BPF trampoline is re-generated every time new program is attached or
    detached to maintain maximum performance.

    To avoid the high cost of retpoline the attached BPF programs are called
    directly. __bpf_prog_enter/exit() are used to support per-program execution
    stats. In the future this logic will be optimized further by adding support
    for bpf_stats_enabled_key inside generated assembly code. Introduction of
    preemptible and sleepable BPF programs will completely remove the need to call
    to __bpf_prog_enter/exit().

    Detach of a BPF program from the trampoline should not fail. To avoid memory
    allocation in detach path the half of the page is used as a reserve and flipped
    after each attach/detach. 2k bytes is enough to call 40+ BPF programs directly
    which is enough for BPF tracing use cases. This limit can be increased in the
    future.

    BPF_TRACE_FENTRY programs have access to raw kernel function arguments while
    BPF_TRACE_FEXIT programs have access to kernel return value as well. Often
    kprobe BPF program remembers function arguments in a map while kretprobe
    fetches arguments from a map and analyzes them together with return value.
    BPF_TRACE_FEXIT accelerates this typical use case.

    Recursion prevention for kprobe BPF programs is done via per-cpu
    bpf_prog_active counter. In practice that turned out to be a mistake. It
    caused programs to randomly skip execution. The tracing tools missed results
    they were looking for. Hence BPF trampoline doesn't provide builtin recursion
    prevention. It's a job of BPF program itself and will be addressed in the
    follow up patches.

    BPF trampoline is intended to be used beyond tracing and fentry/fexit use cases
    in the future. For example to remove retpoline cost from XDP programs.

    Signed-off-by: Alexei Starovoitov
    Signed-off-by: Daniel Borkmann
    Acked-by: Andrii Nakryiko
    Acked-by: Song Liu
    Link: https://lore.kernel.org/bpf/20191114185720.1641606-5-ast@kernel.org

    Alexei Starovoitov
     
  • Add bpf_arch_text_poke() helper that is used by BPF trampoline logic to patch
    nops/calls in kernel text into calls into BPF trampoline and to patch
    calls/nops inside BPF programs too.

    Signed-off-by: Alexei Starovoitov
    Signed-off-by: Daniel Borkmann
    Acked-by: Song Liu
    Acked-by: Andrii Nakryiko
    Link: https://lore.kernel.org/bpf/20191114185720.1641606-4-ast@kernel.org

    Alexei Starovoitov
     
  • Refactor x86 JITing of LDX, STX, CALL instructions into separate helper
    functions. No functional changes in LDX and STX helpers. There is a minor
    change in CALL helper. It will populate target address correctly on the first
    pass of JIT instead of second pass. That won't reduce total number of JIT
    passes though.

    Signed-off-by: Alexei Starovoitov
    Signed-off-by: Daniel Borkmann
    Acked-by: Song Liu
    Acked-by: Andrii Nakryiko
    Link: https://lore.kernel.org/bpf/20191114185720.1641606-3-ast@kernel.org

    Alexei Starovoitov
     

17 Oct, 2019

1 commit

  • Pointer to BTF object is a pointer to kernel object or NULL.
    Such pointers can only be used by BPF_LDX instructions.
    The verifier changed their opcode from LDX|MEM|size
    to LDX|PROBE_MEM|size to make JITing easier.
    The number of entries in extable is the number of BPF_LDX insns
    that access kernel memory via "pointer to BTF type".
    Only these load instructions can fault.
    Since x86 extable is relative it has to be allocated in the same
    memory region as JITed code.
    Allocate it prior to last pass of JITing and let the last pass populate it.
    Pointer to extable in bpf_prog_aux is necessary to make page fault
    handling fast.
    Page fault handling is done in two steps:
    1. bpf_prog_kallsyms_find() finds BPF program that page faulted.
    It's done by walking rb tree.
    2. then extable for given bpf program is binary searched.
    This process is similar to how page faulting is done for kernel modules.
    The exception handler skips over faulting x86 instruction and
    initializes destination register with zero. This mimics exact
    behavior of bpf_probe_read (when probe_kernel_read faults dest is zeroed).

    JITs for other architectures can add support in similar way.
    Until then they will reject unknown opcode and fallback to interpreter.

    Since extable should be aligned and placed near JITed code
    make bpf_jit_binary_alloc() return 4 byte aligned image offset,
    so that extable aligning formula in bpf_int_jit_compile() doesn't need
    to rely on internal implementation of bpf_jit_binary_alloc().
    On x86 gcc defaults to 16-byte alignment for regular kernel functions
    due to better performance. JITed code may be aligned to 16 in the future,
    but it will use 4 in the meantime.

    Signed-off-by: Alexei Starovoitov
    Signed-off-by: Daniel Borkmann
    Acked-by: Andrii Nakryiko
    Acked-by: Martin KaFai Lau
    Link: https://lore.kernel.org/bpf/20191016032505.2089704-10-ast@kernel.org

    Alexei Starovoitov
     

05 Oct, 2019

1 commit

  • Replace 'cmp reg, 0' with 'test reg, reg' for comparisons against
    zero. Saves 1 byte of instruction encoding per occurrence. The flag
    results of test 'reg, reg' are identical to 'cmp reg, 0' in all
    cases except for AF which we don't use/care about. In terms of
    macro-fusibility in combination with a subsequent conditional jump
    instruction, both have the same properties for the jumps used in
    the JIT translation. For example, same JITed Cilium program can
    shrink a bit from e.g. 12,455 to 12,317 bytes as tests with 0 are
    used quite frequently.

    Signed-off-by: Daniel Borkmann
    Signed-off-by: Alexei Starovoitov
    Acked-by: Song Liu
    Acked-by: John Fastabend

    Daniel Borkmann
     

02 Aug, 2019

1 commit

  • Introduction of bounded loops exposed old bug in x64 JIT.
    JIT maintains the array of offsets to the end of all instructions to
    compute jmp offsets.
    addrs[0] - offset of the end of the 1st insn (that includes prologue).
    addrs[1] - offset of the end of the 2nd insn.
    JIT didn't keep the offset of the beginning of the 1st insn,
    since classic BPF didn't have backward jumps and valid extended BPF
    couldn't have a branch to 1st insn, because it didn't allow loops.
    With bounded loops it's possible to construct a valid program that
    jumps backwards to the 1st insn.
    Fix JIT by computing:
    addrs[0] - offset of the end of prologue == start of the 1st insn.
    addrs[1] - offset of the end of 1st insn.

    v1->v2:
    - Yonghong noticed a bug in jit linfo.
    Fix it by passing 'addrs + 1' to bpf_prog_fill_jited_linfo(),
    since it expects insn_to_jit_off array to be offsets to last byte.

    Reported-by: syzbot+35101610ff3e83119b1b@syzkaller.appspotmail.com
    Fixes: 2589726d12a1 ("bpf: introduce bounded loops")
    Fixes: 0a14842f5a3c ("net: filter: Just In Time compiler for x86-64")
    Signed-off-by: Alexei Starovoitov
    Acked-by: Song Liu

    Alexei Starovoitov
     

09 Jul, 2019

1 commit


04 Jul, 2019

1 commit

  • Daniel Borkmann says:

    ====================
    pull-request: bpf 2019-07-03

    The following pull-request contains BPF updates for your *net* tree.

    The main changes are:

    1) Fix the interpreter to properly handle BPF_ALU32 | BPF_ARSH
    on BE architectures, from Jiong.

    2) Fix several bugs in the x32 BPF JIT for handling shifts by 0,
    from Luke and Xi.

    3) Fix NULL pointer deref in btf_type_is_resolve_source_only(),
    from Stanislav.

    4) Properly handle the check that forwarding is enabled on the device
    in bpf_ipv6_fib_lookup() helper code, from Anton.

    5) Fix UAPI bpf_prog_info fields alignment for archs that have 16 bit
    alignment such as m68k, from Baruch.

    6) Fix kernel hanging in unregister_netdevice loop while unregistering
    device bound to XDP socket, from Ilya.

    7) Properly terminate tail update in xskq_produce_flush_desc(), from Nathan.

    8) Fix broken always_inline handling in test_lwt_seg6local, from Jiri.

    9) Fix bpftool to use correct argument in cgroup errors, from Jakub.

    10) Fix detaching dummy prog in XDP redirect sample code, from Prashant.

    11) Add Jonathan to AF_XDP reviewers, from Björn.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     

03 Jul, 2019

2 commits

  • The current x32 BPF JIT does not correctly compile shift operations when
    the immediate shift amount is 0. The expected behavior is for this to
    be a no-op.

    The following program demonstrates the bug. The expexceted result is 1,
    but the current JITed code returns 2.

    r0 = 1
    r1 = 1
    r1 <
    Signed-off-by: Xi Wang
    Signed-off-by: Luke Nelson
    Signed-off-by: Daniel Borkmann

    Luke Nelson
     
  • The current x32 BPF JIT for shift operations is not correct when the
    shift amount in a register is 0. The expected behavior is a no-op, whereas
    the current implementation changes bits in the destination register.

    The following example demonstrates the bug. The expected result of this
    program is 1, but the current JITed code returns 2.

    r0 = 1
    r1 = 1
    r2 = 0
    r1 <
    Signed-off-by: Xi Wang
    Signed-off-by: Luke Nelson
    Signed-off-by: Daniel Borkmann

    Luke Nelson
     

18 Jun, 2019

2 commits

  • Honestly all the conflicts were simple overlapping changes,
    nothing really interesting to report.

    Signed-off-by: David S. Miller

    David S. Miller
     
  • Pull networking fixes from David Miller:
    "Lots of bug fixes here:

    1) Out of bounds access in __bpf_skc_lookup, from Lorenz Bauer.

    2) Fix rate reporting in cfg80211_calculate_bitrate_he(), from John
    Crispin.

    3) Use after free in psock backlog workqueue, from John Fastabend.

    4) Fix source port matching in fdb peer flow rule of mlx5, from Raed
    Salem.

    5) Use atomic_inc_not_zero() in fl6_sock_lookup(), from Eric Dumazet.

    6) Network header needs to be set for packet redirect in nfp, from
    John Hurley.

    7) Fix udp zerocopy refcnt, from Willem de Bruijn.

    8) Don't assume linear buffers in vxlan and geneve error handlers,
    from Stefano Brivio.

    9) Fix TOS matching in mlxsw, from Jiri Pirko.

    10) More SCTP cookie memory leak fixes, from Neil Horman.

    11) Fix VLAN filtering in rtl8366, from Linus Walluij.

    12) Various TCP SACK payload size and fragmentation memory limit fixes
    from Eric Dumazet.

    13) Use after free in pneigh_get_next(), also from Eric Dumazet.

    14) LAPB control block leak fix from Jeremy Sowden"

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (145 commits)
    lapb: fixed leak of control-blocks.
    tipc: purge deferredq list for each grp member in tipc_group_delete
    ax25: fix inconsistent lock state in ax25_destroy_timer
    neigh: fix use-after-free read in pneigh_get_next
    tcp: fix compile error if !CONFIG_SYSCTL
    hv_sock: Suppress bogus "may be used uninitialized" warnings
    be2net: Fix number of Rx queues used for flow hashing
    net: handle 802.1P vlan 0 packets properly
    tcp: enforce tcp_min_snd_mss in tcp_mtu_probing()
    tcp: add tcp_min_snd_mss sysctl
    tcp: tcp_fragment() should apply sane memory limits
    tcp: limit payload size of sacked skbs
    Revert "net: phylink: set the autoneg state in phylink_phy_change"
    bpf: fix nested bpf tracepoints with per-cpu data
    bpf: Fix out of bounds memory access in bpf_sk_storage
    vsock/virtio: set SOCK_DONE on peer shutdown
    net: dsa: rtl8366: Fix up VLAN filtering
    net: phylink: set the autoneg state in phylink_phy_change
    net: add high_order_alloc_disable sysctl/static key
    tcp: add tcp_tx_skb_cache sysctl
    ...

    Linus Torvalds
     

15 Jun, 2019

1 commit

  • Since commit 177366bf7ceb the %rbp stopped pointing to %rbp of the
    previous stack frame. That broke frame pointer based stack unwinding.
    This commit is a partial revert of it.
    Note that the location of tail_call_cnt is fixed, since the verifier
    enforces MAX_BPF_STACK stack size for programs with tail calls.

    Fixes: 177366bf7ceb ("bpf: change x86 JITed program stack layout")
    Signed-off-by: Alexei Starovoitov

    Alexei Starovoitov
     

05 Jun, 2019

1 commit

  • Based on 1 normalized pattern(s):

    this program is free software you can redistribute it and or modify
    it under the terms of the gnu general public license as published by
    the free software foundation version 2 of the license

    extracted by the scancode license scanner the SPDX license identifier

    GPL-2.0-only

    has been chosen to replace the boilerplate/reference in 315 file(s).

    Signed-off-by: Thomas Gleixner
    Reviewed-by: Allison Randal
    Reviewed-by: Armijn Hemel
    Cc: linux-spdx@vger.kernel.org
    Link: https://lkml.kernel.org/r/20190531190115.503150771@linutronix.de
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     

25 May, 2019

1 commit