29 Feb, 2020

1 commit

  • commit e20d3a055a457a10a4c748ce5b7c2ed3173a1324 upstream.

    This if guards whether user-space wants a copy of the offload-jited
    bytecode and whether this bytecode exists. By erroneously doing a bitwise
    AND instead of a logical AND on user- and kernel-space buffer-size can lead
    to no data being copied to user-space especially when user-space size is a
    power of two and bigger then the kernel-space buffer.

    Fixes: fcfb126defda ("bpf: add new jited info fields in bpf_dev_offload and bpf_prog_info")
    Signed-off-by: Johannes Krude
    Signed-off-by: Daniel Borkmann
    Acked-by: Jakub Kicinski
    Link: https://lore.kernel.org/bpf/20200212193227.GA3769@phlox.h.transitiv.net
    Signed-off-by: Greg Kroah-Hartman

    Johannes Krude
     

24 Feb, 2020

1 commit

  • [ Upstream commit 90435a7891a2259b0f74c5a1bc5600d0d64cba8f ]

    If seq_file .next fuction does not change position index,
    read after some lseek can generate an unexpected output.

    See also: https://bugzilla.kernel.org/show_bug.cgi?id=206283

    v1 -> v2: removed missed increment in end of function

    Signed-off-by: Vasily Averin
    Signed-off-by: Daniel Borkmann
    Link: https://lore.kernel.org/bpf/eca84fdd-c374-a154-d874-6c7b55fc3bc4@virtuozzo.com
    Signed-off-by: Sasha Levin

    Vasily Averin
     

11 Feb, 2020

1 commit

  • commit 485ec2ea9cf556e9c120e07961b7b459d776a115 upstream.

    head is traversed using hlist_for_each_entry_rcu outside an RCU
    read-side critical section but under the protection of dtab->index_lock.

    Hence, add corresponding lockdep expression to silence false-positive
    lockdep warnings, and harden RCU lists.

    Fixes: 6f9d451ab1a3 ("xdp: Add devmap_hash map type for looking up devices by hashed index")
    Signed-off-by: Amol Grover
    Signed-off-by: Daniel Borkmann
    Acked-by: Jesper Dangaard Brouer
    Acked-by: Toke Høiland-Jørgensen
    Link: https://lore.kernel.org/bpf/20200123120437.26506-1-frextrite@gmail.com
    Signed-off-by: Greg Kroah-Hartman

    Amol Grover
     

26 Jan, 2020

1 commit

  • [ Upstream commit 071cdecec57fb5d5df78e6a12114ad7bccea5b0e ]

    Tetsuo pointed out that it was not only the device unregister hook that was
    broken for devmap_hash types, it was also cleanup on map free. So better
    fix this as well.

    While we're at it, there's no reason to allocate the netdev_map array for
    DEVMAP_HASH, so skip that and adjust the cost accordingly.

    Fixes: 6f9d451ab1a3 ("xdp: Add devmap_hash map type for looking up devices by hashed index")
    Reported-by: Tetsuo Handa
    Signed-off-by: Toke Høiland-Jørgensen
    Signed-off-by: Alexei Starovoitov
    Acked-by: John Fastabend
    Link: https://lore.kernel.org/bpf/20191121133612.430414-1-toke@redhat.com
    Signed-off-by: Sasha Levin

    Toke Høiland-Jørgensen
     

23 Jan, 2020

1 commit

  • commit 0af2ffc93a4b50948f9dad2786b7f1bd253bf0b9 upstream.

    Anatoly has been fuzzing with kBdysch harness and reported a hang in one
    of the outcomes:

    0: R1=ctx(id=0,off=0,imm=0) R10=fp0
    0: (85) call bpf_get_socket_cookie#46
    1: R0_w=invP(id=0) R10=fp0
    1: (57) r0 &= 808464432
    2: R0_w=invP(id=0,umax_value=808464432,var_off=(0x0; 0x30303030)) R10=fp0
    2: (14) w0 -= 810299440
    3: R0_w=invP(id=0,umax_value=4294967295,var_off=(0xcf800000; 0x3077fff0)) R10=fp0
    3: (c4) w0 s>>= 1
    4: R0_w=invP(id=0,umin_value=1740636160,umax_value=2147221496,var_off=(0x67c00000; 0x183bfff8)) R10=fp0
    4: (76) if w0 s>= 0x30303030 goto pc+216
    221: R0_w=invP(id=0,umin_value=1740636160,umax_value=2147221496,var_off=(0x67c00000; 0x183bfff8)) R10=fp0
    221: (95) exit
    processed 6 insns (limit 1000000) [...]

    Taking a closer look, the program was xlated as follows:

    # ./bpftool p d x i 12
    0: (85) call bpf_get_socket_cookie#7800896
    1: (bf) r6 = r0
    2: (57) r6 &= 808464432
    3: (14) w6 -= 810299440
    4: (c4) w6 s>>= 1
    5: (76) if w6 s>= 0x30303030 goto pc+216
    6: (05) goto pc-1
    7: (05) goto pc-1
    8: (05) goto pc-1
    [...]
    220: (05) goto pc-1
    221: (05) goto pc-1
    222: (95) exit

    Meaning, the visible effect is very similar to f54c7898ed1c ("bpf: Fix
    precision tracking for unbounded scalars"), that is, the fall-through
    branch in the instruction 5 is considered to be never taken given the
    conclusion from the min/max bounds tracking in w6, and therefore the
    dead-code sanitation rewrites it as goto pc-1. However, real-life input
    disagrees with verification analysis since a soft-lockup was observed.

    The bug sits in the analysis of the ARSH. The definition is that we shift
    the target register value right by K bits through shifting in copies of
    its sign bit. In adjust_scalar_min_max_vals(), we do first coerce the
    register into 32 bit mode, same happens after simulating the operation.
    However, for the case of simulating the actual ARSH, we don't take the
    mode into account and act as if it's always 64 bit, but location of sign
    bit is different:

    dst_reg->smin_value >>= umin_val;
    dst_reg->smax_value >>= umin_val;
    dst_reg->var_off = tnum_arshift(dst_reg->var_off, umin_val);

    Consider an unknown R0 where bpf_get_socket_cookie() (or others) would
    for example return 0xffff. With the above ARSH simulation, we'd see the
    following results:

    [...]
    1: R1=ctx(id=0,off=0,imm=0) R2_w=invP65535 R10=fp0
    1: (85) call bpf_get_socket_cookie#46
    2: R0_w=invP(id=0) R10=fp0
    2: (57) r0 &= 808464432
    -> R0_runtime = 0x3030
    3: R0_w=invP(id=0,umax_value=808464432,var_off=(0x0; 0x30303030)) R10=fp0
    3: (14) w0 -= 810299440
    -> R0_runtime = 0xcfb40000
    4: R0_w=invP(id=0,umax_value=4294967295,var_off=(0xcf800000; 0x3077fff0)) R10=fp0
    (0xffffffff)
    4: (c4) w0 s>>= 1
    -> R0_runtime = 0xe7da0000
    5: R0_w=invP(id=0,umin_value=1740636160,umax_value=2147221496,var_off=(0x67c00000; 0x183bfff8)) R10=fp0
    (0x67c00000) (0x7ffbfff8)
    [...]

    In insn 3, we have a runtime value of 0xcfb40000, which is '1100 1111 1011
    0100 0000 0000 0000 0000', the result after the shift has 0xe7da0000 that
    is '1110 0111 1101 1010 0000 0000 0000 0000', where the sign bit is correctly
    retained in 32 bit mode. In insn4, the umax was 0xffffffff, and changed into
    0x7ffbfff8 after the shift, that is, '0111 1111 1111 1011 1111 1111 1111 1000'
    and means here that the simulation didn't retain the sign bit. With above
    logic, the updates happen on the 64 bit min/max bounds and given we coerced
    the register, the sign bits of the bounds are cleared as well, meaning, we
    need to force the simulation into s32 space for 32 bit alu mode.

    Verification after the fix below. We're first analyzing the fall-through branch
    on 32 bit signed >= test eventually leading to rejection of the program in this
    specific case:

    0: R1=ctx(id=0,off=0,imm=0) R10=fp0
    0: (b7) r2 = 808464432
    1: R1=ctx(id=0,off=0,imm=0) R2_w=invP808464432 R10=fp0
    1: (85) call bpf_get_socket_cookie#46
    2: R0_w=invP(id=0) R10=fp0
    2: (bf) r6 = r0
    3: R0_w=invP(id=0) R6_w=invP(id=0) R10=fp0
    3: (57) r6 &= 808464432
    4: R0_w=invP(id=0) R6_w=invP(id=0,umax_value=808464432,var_off=(0x0; 0x30303030)) R10=fp0
    4: (14) w6 -= 810299440
    5: R0_w=invP(id=0) R6_w=invP(id=0,umax_value=4294967295,var_off=(0xcf800000; 0x3077fff0)) R10=fp0
    5: (c4) w6 s>>= 1
    6: R0_w=invP(id=0) R6_w=invP(id=0,umin_value=3888119808,umax_value=4294705144,var_off=(0xe7c00000; 0x183bfff8)) R10=fp0
    (0x67c00000) (0xfffbfff8)
    6: (76) if w6 s>= 0x30303030 goto pc+216
    7: R0_w=invP(id=0) R6_w=invP(id=0,umin_value=3888119808,umax_value=4294705144,var_off=(0xe7c00000; 0x183bfff8)) R10=fp0
    7: (30) r0 = *(u8 *)skb[808464432]
    BPF_LD_[ABS|IND] uses reserved fields
    processed 8 insns (limit 1000000) [...]

    Fixes: 9cbe1f5a32dc ("bpf/verifier: improve register value range tracking with ARSH")
    Reported-by: Anatoly Trosinenko
    Signed-off-by: Daniel Borkmann
    Acked-by: Yonghong Song
    Signed-off-by: Alexei Starovoitov
    Link: https://lore.kernel.org/bpf/20200115204733.16648-1-daniel@iogearbox.net
    Signed-off-by: Greg Kroah-Hartman

    Daniel Borkmann
     

18 Jan, 2020

1 commit

  • commit e10360f815ca6367357b2c2cfef17fc663e50f7b upstream.

    Before commit 4bfc0bb2c60e ("bpf: decouple the lifetime of cgroup_bpf from cgroup itself")
    cgroup bpf structures were released with
    corresponding cgroup structures. It guaranteed the hierarchical order
    of destruction: children were always first. It preserved attached
    programs from being released before their propagated copies.

    But with cgroup auto-detachment there are no such guarantees anymore:
    cgroup bpf is released as soon as the cgroup is offline and there are
    no live associated sockets. It means that an attached program can be
    detached and released, while its propagated copy is still living
    in the cgroup subtree. This will obviously lead to an use-after-free
    bug.

    To reproduce the issue the following script can be used:

    #!/bin/bash

    CGROOT=/sys/fs/cgroup

    mkdir -p ${CGROOT}/A ${CGROOT}/B ${CGROOT}/A/C
    sleep 1

    ./test_cgrp2_attach ${CGROOT}/A egress &
    A_PID=$!
    ./test_cgrp2_attach ${CGROOT}/B egress &
    B_PID=$!

    echo $$ > ${CGROOT}/A/C/cgroup.procs
    iperf -s &
    S_PID=$!
    iperf -c localhost -t 100 &
    C_PID=$!

    sleep 1

    echo $$ > ${CGROOT}/B/cgroup.procs
    echo ${S_PID} > ${CGROOT}/B/cgroup.procs
    echo ${C_PID} > ${CGROOT}/B/cgroup.procs

    sleep 1

    rmdir ${CGROOT}/A/C
    rmdir ${CGROOT}/A

    sleep 1

    kill -9 ${S_PID} ${C_PID} ${A_PID} ${B_PID}

    On the unpatched kernel the following stacktrace can be obtained:

    [ 33.619799] BUG: unable to handle page fault for address: ffffbdb4801ab002
    [ 33.620677] #PF: supervisor read access in kernel mode
    [ 33.621293] #PF: error_code(0x0000) - not-present page
    [ 33.622754] Oops: 0000 [#1] SMP NOPTI
    [ 33.623202] CPU: 0 PID: 601 Comm: iperf Not tainted 5.5.0-rc2+ #23
    [ 33.625545] RIP: 0010:__cgroup_bpf_run_filter_skb+0x29f/0x3d0
    [ 33.635809] Call Trace:
    [ 33.636118] ? __cgroup_bpf_run_filter_skb+0x2bf/0x3d0
    [ 33.636728] ? __switch_to_asm+0x40/0x70
    [ 33.637196] ip_finish_output+0x68/0xa0
    [ 33.637654] ip_output+0x76/0xf0
    [ 33.638046] ? __ip_finish_output+0x1c0/0x1c0
    [ 33.638576] __ip_queue_xmit+0x157/0x410
    [ 33.639049] __tcp_transmit_skb+0x535/0xaf0
    [ 33.639557] tcp_write_xmit+0x378/0x1190
    [ 33.640049] ? _copy_from_iter_full+0x8d/0x260
    [ 33.640592] tcp_sendmsg_locked+0x2a2/0xdc0
    [ 33.641098] ? sock_has_perm+0x10/0xa0
    [ 33.641574] tcp_sendmsg+0x28/0x40
    [ 33.641985] sock_sendmsg+0x57/0x60
    [ 33.642411] sock_write_iter+0x97/0x100
    [ 33.642876] new_sync_write+0x1b6/0x1d0
    [ 33.643339] vfs_write+0xb6/0x1a0
    [ 33.643752] ksys_write+0xa7/0xe0
    [ 33.644156] do_syscall_64+0x5b/0x1b0
    [ 33.644605] entry_SYSCALL_64_after_hwframe+0x44/0xa9

    Fix this by grabbing a reference to the bpf structure of each ancestor
    on the initialization of the cgroup bpf structure, and dropping the
    reference at the end of releasing the cgroup bpf structure.

    This will restore the hierarchical order of cgroup bpf releasing,
    without adding any operations on hot paths.

    Thanks to Josef Bacik for the debugging and the initial analysis of
    the problem.

    Fixes: 4bfc0bb2c60e ("bpf: decouple the lifetime of cgroup_bpf from cgroup itself")
    Reported-by: Josef Bacik
    Signed-off-by: Roman Gushchin
    Acked-by: Song Liu
    Signed-off-by: Alexei Starovoitov
    Signed-off-by: Greg Kroah-Hartman

    Roman Gushchin
     

12 Jan, 2020

1 commit

  • commit 6d4f151acf9a4f6fab09b615f246c717ddedcf0c upstream.

    Anatoly has been fuzzing with kBdysch harness and reported a KASAN
    slab oob in one of the outcomes:

    [...]
    [ 77.359642] BUG: KASAN: slab-out-of-bounds in bpf_skb_load_helper_8_no_cache+0x71/0x130
    [ 77.360463] Read of size 4 at addr ffff8880679bac68 by task bpf/406
    [ 77.361119]
    [ 77.361289] CPU: 2 PID: 406 Comm: bpf Not tainted 5.5.0-rc2-xfstests-00157-g2187f215eba #1
    [ 77.362134] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
    [ 77.362984] Call Trace:
    [ 77.363249] dump_stack+0x97/0xe0
    [ 77.363603] print_address_description.constprop.0+0x1d/0x220
    [ 77.364251] ? bpf_skb_load_helper_8_no_cache+0x71/0x130
    [ 77.365030] ? bpf_skb_load_helper_8_no_cache+0x71/0x130
    [ 77.365860] __kasan_report.cold+0x37/0x7b
    [ 77.366365] ? bpf_skb_load_helper_8_no_cache+0x71/0x130
    [ 77.366940] kasan_report+0xe/0x20
    [ 77.367295] bpf_skb_load_helper_8_no_cache+0x71/0x130
    [ 77.367821] ? bpf_skb_load_helper_8+0xf0/0xf0
    [ 77.368278] ? mark_lock+0xa3/0x9b0
    [ 77.368641] ? kvm_sched_clock_read+0x14/0x30
    [ 77.369096] ? sched_clock+0x5/0x10
    [ 77.369460] ? sched_clock_cpu+0x18/0x110
    [ 77.369876] ? bpf_skb_load_helper_8+0xf0/0xf0
    [ 77.370330] ___bpf_prog_run+0x16c0/0x28f0
    [ 77.370755] __bpf_prog_run32+0x83/0xc0
    [ 77.371153] ? __bpf_prog_run64+0xc0/0xc0
    [ 77.371568] ? match_held_lock+0x1b/0x230
    [ 77.371984] ? rcu_read_lock_held+0xa1/0xb0
    [ 77.372416] ? rcu_is_watching+0x34/0x50
    [ 77.372826] sk_filter_trim_cap+0x17c/0x4d0
    [ 77.373259] ? sock_kzfree_s+0x40/0x40
    [ 77.373648] ? __get_filter+0x150/0x150
    [ 77.374059] ? skb_copy_datagram_from_iter+0x80/0x280
    [ 77.374581] ? do_raw_spin_unlock+0xa5/0x140
    [ 77.375025] unix_dgram_sendmsg+0x33a/0xa70
    [ 77.375459] ? do_raw_spin_lock+0x1d0/0x1d0
    [ 77.375893] ? unix_peer_get+0xa0/0xa0
    [ 77.376287] ? __fget_light+0xa4/0xf0
    [ 77.376670] __sys_sendto+0x265/0x280
    [ 77.377056] ? __ia32_sys_getpeername+0x50/0x50
    [ 77.377523] ? lock_downgrade+0x350/0x350
    [ 77.377940] ? __sys_setsockopt+0x2a6/0x2c0
    [ 77.378374] ? sock_read_iter+0x240/0x240
    [ 77.378789] ? __sys_socketpair+0x22a/0x300
    [ 77.379221] ? __ia32_sys_socket+0x50/0x50
    [ 77.379649] ? mark_held_locks+0x1d/0x90
    [ 77.380059] ? trace_hardirqs_on_thunk+0x1a/0x1c
    [ 77.380536] __x64_sys_sendto+0x74/0x90
    [ 77.380938] do_syscall_64+0x68/0x2a0
    [ 77.381324] entry_SYSCALL_64_after_hwframe+0x49/0xbe
    [ 77.381878] RIP: 0033:0x44c070
    [...]

    After further debugging, turns out while in case of other helper functions
    we disallow passing modified ctx, the special case of ld/abs/ind instruction
    which has similar semantics (except r6 being the ctx argument) is missing
    such check. Modified ctx is impossible here as bpf_skb_load_helper_8_no_cache()
    and others are expecting skb fields in original position, hence, add
    check_ctx_reg() to reject any modified ctx. Issue was first introduced back
    in f1174f77b50c ("bpf/verifier: rework value tracking").

    Fixes: f1174f77b50c ("bpf/verifier: rework value tracking")
    Reported-by: Anatoly Trosinenko
    Signed-off-by: Daniel Borkmann
    Signed-off-by: Alexei Starovoitov
    Link: https://lore.kernel.org/bpf/20200106215157.3553-1-daniel@iogearbox.net
    Signed-off-by: Greg Kroah-Hartman

    Daniel Borkmann
     

09 Jan, 2020

1 commit

  • commit f54c7898ed1c3c9331376c0337a5049c38f66497 upstream.

    Anatoly has been fuzzing with kBdysch harness and reported a hang in one
    of the outcomes. Upon closer analysis, it turns out that precise scalar
    value tracking is missing a few precision markings for unknown scalars:

    0: R1=ctx(id=0,off=0,imm=0) R10=fp0
    0: (b7) r0 = 0
    1: R0_w=invP0 R1=ctx(id=0,off=0,imm=0) R10=fp0
    1: (35) if r0 >= 0xf72e goto pc+0
    --> only follow fallthrough
    2: R0_w=invP0 R1=ctx(id=0,off=0,imm=0) R10=fp0
    2: (35) if r0 >= 0x80fe0000 goto pc+0
    --> only follow fallthrough
    3: R0_w=invP0 R1=ctx(id=0,off=0,imm=0) R10=fp0
    3: (14) w0 -= -536870912
    4: R0_w=invP536870912 R1=ctx(id=0,off=0,imm=0) R10=fp0
    4: (0f) r1 += r0
    5: R0_w=invP536870912 R1_w=inv(id=0) R10=fp0
    5: (55) if r1 != 0x104c1500 goto pc+0
    --> push other branch for later analysis
    R0_w=invP536870912 R1_w=inv273421568 R10=fp0
    6: R0_w=invP536870912 R1_w=inv273421568 R10=fp0
    6: (b7) r0 = 0
    7: R0=invP0 R1=inv273421568 R10=fp0
    7: (76) if w1 s>= 0xffffff00 goto pc+3
    --> only follow goto
    11: R0=invP0 R1=inv273421568 R10=fp0
    11: (95) exit
    6: R0_w=invP536870912 R1_w=inv(id=0) R10=fp0
    6: (b7) r0 = 0
    propagating r0
    7: safe
    processed 11 insns [...]

    In the analysis of the second path coming after the successful exit above,
    the path is being pruned at line 7. Pruning analysis found that both r0 are
    precise P0 and both R1 are non-precise scalars and given prior path with
    R1 as non-precise scalar succeeded, this one is therefore safe as well.

    However, problem is that given condition at insn 7 in the first run, we only
    followed goto and didn't push the other branch for later analysis, we've
    never walked the few insns in there and therefore dead-code sanitation
    rewrites it as goto pc-1, causing the hang depending on the skb address
    hitting these conditions. The issue is that R1 should have been marked as
    precise as well such that pruning enforces range check and conluded that new
    R1 is not in range of old R1. In insn 4, we mark R1 (skb) as unknown scalar
    via __mark_reg_unbounded() but not mark_reg_unbounded() and therefore
    regs->precise remains as false.

    Back in b5dc0163d8fd ("bpf: precise scalar_value tracking"), this was not
    the case since marking out of __mark_reg_unbounded() had this covered as well.
    Once in both are set as precise in 4 as they should have been, we conclude
    that given R1 was in prior fall-through path 0x104c1500 and now is completely
    unknown, the check at insn 7 concludes that we need to continue walking.
    Analysis after the fix:

    0: R1=ctx(id=0,off=0,imm=0) R10=fp0
    0: (b7) r0 = 0
    1: R0_w=invP0 R1=ctx(id=0,off=0,imm=0) R10=fp0
    1: (35) if r0 >= 0xf72e goto pc+0
    2: R0_w=invP0 R1=ctx(id=0,off=0,imm=0) R10=fp0
    2: (35) if r0 >= 0x80fe0000 goto pc+0
    3: R0_w=invP0 R1=ctx(id=0,off=0,imm=0) R10=fp0
    3: (14) w0 -= -536870912
    4: R0_w=invP536870912 R1=ctx(id=0,off=0,imm=0) R10=fp0
    4: (0f) r1 += r0
    5: R0_w=invP536870912 R1_w=invP(id=0) R10=fp0
    5: (55) if r1 != 0x104c1500 goto pc+0
    R0_w=invP536870912 R1_w=invP273421568 R10=fp0
    6: R0_w=invP536870912 R1_w=invP273421568 R10=fp0
    6: (b7) r0 = 0
    7: R0=invP0 R1=invP273421568 R10=fp0
    7: (76) if w1 s>= 0xffffff00 goto pc+3
    11: R0=invP0 R1=invP273421568 R10=fp0
    11: (95) exit
    6: R0_w=invP536870912 R1_w=invP(id=0) R10=fp0
    6: (b7) r0 = 0
    7: R0_w=invP0 R1_w=invP(id=0) R10=fp0
    7: (76) if w1 s>= 0xffffff00 goto pc+3
    R0_w=invP0 R1_w=invP(id=0) R10=fp0
    8: R0_w=invP0 R1_w=invP(id=0) R10=fp0
    8: (a5) if r0 < 0x2007002a goto pc+0
    9: R0_w=invP0 R1_w=invP(id=0) R10=fp0
    9: (57) r0 &= -16316416
    10: R0_w=invP0 R1_w=invP(id=0) R10=fp0
    10: (a6) if w0 < 0x1201 goto pc+0
    11: R0_w=invP0 R1_w=invP(id=0) R10=fp0
    11: (95) exit
    11: R0=invP0 R1=invP(id=0) R10=fp0
    11: (95) exit
    processed 16 insns [...]

    Fixes: 6754172c208d ("bpf: fix precision tracking in presence of bpf2bpf calls")
    Reported-by: Anatoly Trosinenko
    Signed-off-by: Daniel Borkmann
    Signed-off-by: Alexei Starovoitov
    Link: https://lore.kernel.org/bpf/20191222223740.25297-1-daniel@iogearbox.net
    Signed-off-by: Greg Kroah-Hartman

    Daniel Borkmann
     

31 Dec, 2019

2 commits

  • [ Upstream commit 581738a681b6faae5725c2555439189ca81c0f1f ]

    With latest llvm (trunk https://github.com/llvm/llvm-project),
    test_progs, which has +alu32 enabled, failed for strobemeta.o.
    The verifier output looks like below with edit to replace large
    decimal numbers with hex ones.
    193: (85) call bpf_probe_read_user_str#114
    R0=inv(id=0)
    194: (26) if w0 > 0x1 goto pc+4
    R0_w=inv(id=0,umax_value=0xffffffff00000001)
    195: (6b) *(u16 *)(r7 +80) = r0
    196: (bc) w6 = w0
    R6_w=inv(id=0,umax_value=0xffffffff,var_off=(0x0; 0xffffffff))
    197: (67) r6 <>= 32
    R6=inv(id=0,umax_value=0xffffffff,var_off=(0x0; 0xffffffff))
    ...
    201: (79) r8 = *(u64 *)(r10 -416)
    R8_w=map_value(id=0,off=40,ks=4,vs=13872,imm=0)
    202: (0f) r8 += r6
    R8_w=map_value(id=0,off=40,ks=4,vs=13872,umax_value=0xffffffff,var_off=(0x0; 0xffffffff))
    203: (07) r8 += 9696
    R8_w=map_value(id=0,off=9736,ks=4,vs=13872,umax_value=0xffffffff,var_off=(0x0; 0xffffffff))
    ...
    255: (bf) r1 = r8
    R1_w=map_value(id=0,off=9736,ks=4,vs=13872,umax_value=0xffffffff,var_off=(0x0; 0xffffffff))
    ...
    257: (85) call bpf_probe_read_user_str#114
    R1 unbounded memory access, make sure to bounds check any array access into a map

    The value range for register r6 at insn 198 should be really just 0/1.
    The umax_value=0xffffffff caused later verification failure.

    After jmp instructions, the current verifier already tried to use just
    obtained information to get better register range. The current mechanism is
    for 64bit register only. This patch implemented to tighten the range
    for 32bit sub-registers after jmp32 instructions.
    With the patch, we have the below range ranges for the
    above code sequence:
    193: (85) call bpf_probe_read_user_str#114
    R0=inv(id=0)
    194: (26) if w0 > 0x1 goto pc+4
    R0_w=inv(id=0,smax_value=0x7fffffff00000001,umax_value=0xffffffff00000001,
    var_off=(0x0; 0xffffffff00000001))
    195: (6b) *(u16 *)(r7 +80) = r0
    196: (bc) w6 = w0
    R6_w=inv(id=0,umax_value=0xffffffff,var_off=(0x0; 0x1))
    197: (67) r6 <>= 32
    R6=inv(id=0,umax_value=1,var_off=(0x0; 0x1))
    ...
    201: (79) r8 = *(u64 *)(r10 -416)
    R8_w=map_value(id=0,off=40,ks=4,vs=13872,imm=0)
    202: (0f) r8 += r6
    R8_w=map_value(id=0,off=40,ks=4,vs=13872,umax_value=1,var_off=(0x0; 0x1))
    203: (07) r8 += 9696
    R8_w=map_value(id=0,off=9736,ks=4,vs=13872,umax_value=1,var_off=(0x0; 0x1))
    ...
    255: (bf) r1 = r8
    R1_w=map_value(id=0,off=9736,ks=4,vs=13872,umax_value=1,var_off=(0x0; 0x1))
    ...
    257: (85) call bpf_probe_read_user_str#114
    ...

    At insn 194, the register R0 has better var_off.mask and smax_value.
    Especially, the var_off.mask ensures later lshift and rshift
    maintains proper value range.

    Suggested-by: Alexei Starovoitov
    Signed-off-by: Yonghong Song
    Signed-off-by: Alexei Starovoitov
    Link: https://lore.kernel.org/bpf/20191121170650.449030-1-yhs@fb.com
    Signed-off-by: Sasha Levin

    Yonghong Song
     
  • [ Upstream commit eac9153f2b584c702cea02c1f1a57d85aa9aea42 ]

    bpf stackmap with build-id lookup (BPF_F_STACK_BUILD_ID) can trigger A-A
    deadlock on rq_lock():

    rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
    [...]
    Call Trace:
    try_to_wake_up+0x1ad/0x590
    wake_up_q+0x54/0x80
    rwsem_wake+0x8a/0xb0
    bpf_get_stack+0x13c/0x150
    bpf_prog_fbdaf42eded9fe46_on_event+0x5e3/0x1000
    bpf_overflow_handler+0x60/0x100
    __perf_event_overflow+0x4f/0xf0
    perf_swevent_overflow+0x99/0xc0
    ___perf_sw_event+0xe7/0x120
    __schedule+0x47d/0x620
    schedule+0x29/0x90
    futex_wait_queue_me+0xb9/0x110
    futex_wait+0x139/0x230
    do_futex+0x2ac/0xa50
    __x64_sys_futex+0x13c/0x180
    do_syscall_64+0x42/0x100
    entry_SYSCALL_64_after_hwframe+0x44/0xa9

    This can be reproduced by:
    1. Start a multi-thread program that does parallel mmap() and malloc();
    2. taskset the program to 2 CPUs;
    3. Attach bpf program to trace_sched_switch and gather stackmap with
    build-id, e.g. with trace.py from bcc tools:
    trace.py -U -p -s t:sched:sched_switch

    A sample reproducer is attached at the end.

    This could also trigger deadlock with other locks that are nested with
    rq_lock.

    Fix this by checking whether irqs are disabled. Since rq_lock and all
    other nested locks are irq safe, it is safe to do up_read() when irqs are
    not disable. If the irqs are disabled, postpone up_read() in irq_work.

    Fixes: 615755a77b24 ("bpf: extend stackmap to save binary_build_id+offset instead of address")
    Signed-off-by: Song Liu
    Signed-off-by: Alexei Starovoitov
    Cc: Peter Zijlstra
    Cc: Alexei Starovoitov
    Cc: Daniel Borkmann
    Link: https://lore.kernel.org/bpf/20191014171223.357174-1-songliubraving@fb.com

    Reproducer:
    ============================ 8< ============================

    char *filename;

    void *worker(void *p)
    {
    void *ptr;
    int fd;
    char *pptr;

    fd = open(filename, O_RDONLY);
    if (fd < 0)
    return NULL;
    while (1) {
    struct timespec ts = {0, 1000 + rand() % 2000};

    ptr = mmap(NULL, 4096 * 64, PROT_READ, MAP_PRIVATE, fd, 0);
    usleep(1);
    if (ptr == MAP_FAILED) {
    printf("failed to mmap\n");
    break;
    }
    munmap(ptr, 4096 * 64);
    usleep(1);
    pptr = malloc(1);
    usleep(1);
    pptr[0] = 1;
    usleep(1);
    free(pptr);
    usleep(1);
    nanosleep(&ts, NULL);
    }
    close(fd);
    return NULL;
    }

    int main(int argc, char *argv[])
    {
    void *ptr;
    int i;
    pthread_t threads[THREAD_COUNT];

    if (argc < 2)
    return 0;

    filename = argv[1];

    for (i = 0; i < THREAD_COUNT; i++) {
    if (pthread_create(threads + i, NULL, worker, NULL)) {
    fprintf(stderr, "Error creating thread\n");
    return 0;
    }
    }

    for (i = 0; i < THREAD_COUNT; i++)
    pthread_join(threads[i], NULL);
    return 0;
    }
    ============================ 8< ============================

    Signed-off-by: Sasha Levin

    Song Liu
     

07 Nov, 2019

1 commit


01 Nov, 2019

1 commit

  • The functions bpf_map_area_alloc() and bpf_map_charge_init() prior
    this commit passed the size parameter as size_t. In this commit this
    is changed to u64.

    All users of these functions avoid size_t overflows on 32-bit systems,
    by explicitly using u64 when calculating the allocation size and
    memory charge cost. However, since the result was narrowed by the
    size_t when passing size and cost to the functions, the overflow
    handling was in vain.

    Instead of changing all call sites to size_t and handle overflow at
    the call site, the parameter is changed to u64 and checked in the
    functions above.

    Fixes: d407bd25a204 ("bpf: don't trigger OOM killer under pressure with map alloc")
    Fixes: c85d69135a91 ("bpf: move memory size checks to bpf_map_charge_init()")
    Signed-off-by: Björn Töpel
    Signed-off-by: Daniel Borkmann
    Reviewed-by: Jakub Kicinski
    Link: https://lore.kernel.org/bpf/20191029154307.23053-1-bjorn.topel@gmail.com

    Björn Töpel
     

31 Oct, 2019

1 commit

  • "ctx:file_pos sysctl:read read ok narrow" works on s390 by accident: it
    reads the wrong byte, which happens to have the expected value of 0.
    Improve the test by seeking to the 4th byte and expecting 4 instead of
    0.

    This makes the latent problem apparent: the test attempts to read the
    first byte of bpf_sysctl.file_pos, assuming this is the least-significant
    byte, which is not the case on big-endian machines: a non-zero offset is
    needed.

    The point of the test is to verify narrow loads, so we cannot cheat our
    way out by simply using BPF_W. The existence of the test means that such
    loads have to be supported, most likely because llvm can generate them.
    Fix the test by adding a big-endian variant, which uses an offset to
    access the least-significant byte of bpf_sysctl.file_pos.

    This reveals the final problem: verifier rejects accesses to bpf_sysctl
    fields with offset > 0. Such accesses are already allowed for a wide
    range of structs: __sk_buff, bpf_sock_addr and sk_msg_md to name a few.
    Extend this support to bpf_sysctl by using bpf_ctx_range instead of
    offsetof when matching field offsets.

    Fixes: 7b146cebe30c ("bpf: Sysctl hook")
    Fixes: e1550bfe0de4 ("bpf: Add file_pos field to bpf_sysctl ctx")
    Fixes: 9a1027e52535 ("selftests/bpf: Test file_pos field in bpf_sysctl ctx")
    Signed-off-by: Ilya Leoshkevich
    Signed-off-by: Alexei Starovoitov
    Acked-by: Andrey Ignatov
    Acked-by: Andrii Nakryiko
    Link: https://lore.kernel.org/bpf/20191028122902.9763-1-iii@linux.ibm.com

    Ilya Leoshkevich
     

23 Oct, 2019

2 commits

  • There is one more problematic case I noticed while recently fixing BPF kallsyms
    handling in cd7455f1013e ("bpf: Fix use after free in subprog's jited symbol
    removal") and that is bpf_get_prog_name().

    If BTF has been attached to the prog, then we may be able to fetch the function
    signature type id in kallsyms through prog->aux->func_info[prog->aux->func_idx].type_id.
    However, while the BTF object itself is torn down via RCU callback, the prog's
    aux->func_info is immediately freed via kvfree(prog->aux->func_info) once the
    prog's refcount either hit zero or when subprograms were already exposed via
    kallsyms and we hit the error path added in 5482e9a93c83 ("bpf: Fix memleak in
    aux->func_info and aux->btf").

    This violates RCU as well since kallsyms could be walked in parallel where we
    could access aux->func_info. Hence, defer kvfree() to after RCU grace period.
    Looking at ba64e7d85252 ("bpf: btf: support proper non-jit func info") there
    is no reason/dependency where we couldn't defer the kvfree(aux->func_info) into
    the RCU callback.

    Fixes: 5482e9a93c83 ("bpf: Fix memleak in aux->func_info and aux->btf")
    Fixes: ba64e7d85252 ("bpf: btf: support proper non-jit func info")
    Signed-off-by: Daniel Borkmann
    Signed-off-by: Alexei Starovoitov
    Acked-by: Yonghong Song
    Cc: Martin KaFai Lau
    Link: https://lore.kernel.org/bpf/875f2906a7c1a0691f2d567b4d8e4ea2739b1e88.1571779205.git.daniel@iogearbox.net

    Daniel Borkmann
     
  • syzkaller managed to trigger the following crash:

    [...]
    BUG: unable to handle page fault for address: ffffc90001923030
    #PF: supervisor read access in kernel mode
    #PF: error_code(0x0000) - not-present page
    PGD aa551067 P4D aa551067 PUD aa552067 PMD a572b067 PTE 80000000a1173163
    Oops: 0000 [#1] PREEMPT SMP KASAN
    CPU: 0 PID: 7982 Comm: syz-executor912 Not tainted 5.4.0-rc3+ #0
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
    RIP: 0010:bpf_jit_binary_hdr include/linux/filter.h:787 [inline]
    RIP: 0010:bpf_get_prog_addr_region kernel/bpf/core.c:531 [inline]
    RIP: 0010:bpf_tree_comp kernel/bpf/core.c:600 [inline]
    RIP: 0010:__lt_find include/linux/rbtree_latch.h:115 [inline]
    RIP: 0010:latch_tree_find include/linux/rbtree_latch.h:208 [inline]
    RIP: 0010:bpf_prog_kallsyms_find kernel/bpf/core.c:674 [inline]
    RIP: 0010:is_bpf_text_address+0x184/0x3b0 kernel/bpf/core.c:709
    [...]
    Call Trace:
    kernel_text_address kernel/extable.c:147 [inline]
    __kernel_text_address+0x9a/0x110 kernel/extable.c:102
    unwind_get_return_address+0x4c/0x90 arch/x86/kernel/unwind_frame.c:19
    arch_stack_walk+0x98/0xe0 arch/x86/kernel/stacktrace.c:26
    stack_trace_save+0xb6/0x150 kernel/stacktrace.c:123
    save_stack mm/kasan/common.c:69 [inline]
    set_track mm/kasan/common.c:77 [inline]
    __kasan_kmalloc+0x11c/0x1b0 mm/kasan/common.c:510
    kasan_slab_alloc+0xf/0x20 mm/kasan/common.c:518
    slab_post_alloc_hook mm/slab.h:584 [inline]
    slab_alloc mm/slab.c:3319 [inline]
    kmem_cache_alloc+0x1f5/0x2e0 mm/slab.c:3483
    getname_flags+0xba/0x640 fs/namei.c:138
    getname+0x19/0x20 fs/namei.c:209
    do_sys_open+0x261/0x560 fs/open.c:1091
    __do_sys_open fs/open.c:1115 [inline]
    __se_sys_open fs/open.c:1110 [inline]
    __x64_sys_open+0x87/0x90 fs/open.c:1110
    do_syscall_64+0xf7/0x1c0 arch/x86/entry/common.c:290
    entry_SYSCALL_64_after_hwframe+0x49/0xbe
    [...]

    After further debugging it turns out that we walk kallsyms while in parallel
    we tear down a BPF program which contains subprograms that have been JITed
    though the program itself has not been fully exposed and is eventually bailing
    out with error.

    The bpf_prog_kallsyms_del_subprogs() in bpf_prog_load()'s error path removes
    the symbols, however, bpf_prog_free() tears down the JIT memory too early via
    scheduled work. Instead, it needs to properly respect RCU grace period as the
    kallsyms walk for BPF is under RCU.

    Fix it by refactoring __bpf_prog_put()'s tear down and reuse it in our error
    path where we defer final destruction when we have subprogs in the program.

    Fixes: 7d1982b4e335 ("bpf: fix panic in prog load calls cleanup")
    Fixes: 1c2a088a6626 ("bpf: x64: add JIT support for multi-function programs")
    Reported-by: syzbot+710043c5d1d5b5013bc7@syzkaller.appspotmail.com
    Signed-off-by: Daniel Borkmann
    Signed-off-by: Alexei Starovoitov
    Tested-by: syzbot+710043c5d1d5b5013bc7@syzkaller.appspotmail.com
    Link: https://lore.kernel.org/bpf/55f6367324c2d7e9583fa9ccf5385dcbba0d7a6e.1571752452.git.daniel@iogearbox.net

    Daniel Borkmann
     

22 Oct, 2019

1 commit

  • It seems I forgot to add handling of devmap_hash type maps to the device
    unregister hook for devmaps. This omission causes devices to not be
    properly released, which causes hangs.

    Fix this by adding the missing handler.

    Fixes: 6f9d451ab1a3 ("xdp: Add devmap_hash map type for looking up devices by hashed index")
    Reported-by: Tetsuo Handa
    Signed-off-by: Toke Høiland-Jørgensen
    Signed-off-by: Alexei Starovoitov
    Acked-by: Martin KaFai Lau
    Link: https://lore.kernel.org/bpf/20191019111931.2981954-1-toke@redhat.com

    Toke Høiland-Jørgensen
     

19 Oct, 2019

1 commit

  • Tetsuo pointed out that without an explicit cast, the cost calculation for
    devmap_hash type maps could overflow on 32-bit builds. This adds the
    missing cast.

    Fixes: 6f9d451ab1a3 ("xdp: Add devmap_hash map type for looking up devices by hashed index")
    Reported-by: Tetsuo Handa
    Signed-off-by: Toke Høiland-Jørgensen
    Signed-off-by: Alexei Starovoitov
    Acked-by: Yonghong Song
    Link: https://lore.kernel.org/bpf/20191017105702.2807093-1-toke@redhat.com

    Toke Høiland-Jørgensen
     

29 Sep, 2019

1 commit

  • Pull networking fixes from David Miller:

    1) Sanity check URB networking device parameters to avoid divide by
    zero, from Oliver Neukum.

    2) Disable global multicast filter in NCSI, otherwise LLDP and IPV6
    don't work properly. Longer term this needs a better fix tho. From
    Vijay Khemka.

    3) Small fixes to selftests (use ping when ping6 is not present, etc.)
    from David Ahern.

    4) Bring back rt_uses_gateway member of struct rtable, it's semantics
    were not well understood and trying to remove it broke things. From
    David Ahern.

    5) Move usbnet snaity checking, ignore endpoints with invalid
    wMaxPacketSize. From Bjørn Mork.

    6) Missing Kconfig deps for sja1105 driver, from Mao Wenan.

    7) Various small fixes to the mlx5 DR steering code, from Alaa Hleihel,
    Alex Vesker, and Yevgeny Kliteynik

    8) Missing CAP_NET_RAW checks in various places, from Ori Nimron.

    9) Fix crash when removing sch_cbs entry while offloading is enabled,
    from Vinicius Costa Gomes.

    10) Signedness bug fixes, generally in looking at the result given by
    of_get_phy_mode() and friends. From Dan Crapenter.

    11) Disable preemption around BPF_PROG_RUN() calls, from Eric Dumazet.

    12) Don't create VRF ipv6 rules if ipv6 is disabled, from David Ahern.

    13) Fix quantization code in tcp_bbr, from Kevin Yang.

    * git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (127 commits)
    net: tap: clean up an indentation issue
    nfp: abm: fix memory leak in nfp_abm_u32_knode_replace
    tcp: better handle TCP_USER_TIMEOUT in SYN_SENT state
    sk_buff: drop all skb extensions on free and skb scrubbing
    tcp_bbr: fix quantization code to not raise cwnd if not probing bandwidth
    mlxsw: spectrum_flower: Fail in case user specifies multiple mirror actions
    Documentation: Clarify trap's description
    mlxsw: spectrum: Clear VLAN filters during port initialization
    net: ena: clean up indentation issue
    NFC: st95hf: clean up indentation issue
    net: phy: micrel: add Asym Pause workaround for KSZ9021
    net: socionext: ave: Avoid using netdev_err() before calling register_netdev()
    ptp: correctly disable flags on old ioctls
    lib: dimlib: fix help text typos
    net: dsa: microchip: Always set regmap stride to 1
    nfp: flower: fix memory leak in nfp_flower_spawn_vnic_reprs
    nfp: flower: prevent memory leak in nfp_flower_spawn_phy_reprs
    net/sched: Set default of CONFIG_NET_TC_SKB_EXT to N
    vrf: Do not attempt to create IPv6 mcast rule if IPv6 is disabled
    net: sched: sch_sfb: don't call qdisc_put() while holding tree lock
    ...

    Linus Torvalds
     

26 Sep, 2019

2 commits

  • There is a statement that is indented one level too deeply, remove
    the extraneous tab.

    Signed-off-by: Colin Ian King
    Signed-off-by: Daniel Borkmann
    Link: https://lore.kernel.org/bpf/20190925093835.19515-1-colin.king@canonical.com

    Colin Ian King
     
  • When kzalloc() failed, NULL was returned to the caller, which
    tested the pointer with IS_ERR(), which didn't match, so the
    pointer was used later, resulting in a NULL dereference.

    Return ERR_PTR(-ENOMEM) instead of NULL.

    Reported-by: syzbot+491c1b7565ba9069ecae@syzkaller.appspotmail.com
    Fixes: 0402acd683c6 ("xsk: remove AF_XDP socket from map when the socket is released")
    Signed-off-by: Jonathan Lemon
    Acked-by: Björn Töpel
    Signed-off-by: Daniel Borkmann

    Jonathan Lemon
     

25 Sep, 2019

1 commit

  • Pull more mount API conversions from Al Viro:
    "Assorted conversions of options parsing to new API.

    gfs2 is probably the most serious one here; the rest is trivial stuff.

    Other things in what used to be #work.mount are going to wait for the
    next cycle (and preferably go via git trees of the filesystems
    involved)"

    * 'work.mount3' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    gfs2: Convert gfs2 to fs_context
    vfs: Convert spufs to use the new mount API
    vfs: Convert hypfs to use the new mount API
    hypfs: Fix error number left in struct pointer member
    vfs: Convert functionfs to use the new mount API
    vfs: Convert bpf to use the new mount API

    Linus Torvalds
     

19 Sep, 2019

2 commits

  • vmlinux BTF has enums that are 8 byte and 1 byte in size.
    2 byte enum is a valid construct as well.
    Fix BTF enum verification to accept those sizes.

    Fixes: 69b693f0aefa ("bpf: btf: Introduce BPF Type Format (BTF)")
    Signed-off-by: Alexei Starovoitov
    Acked-by: Martin KaFai Lau
    Signed-off-by: Daniel Borkmann

    Alexei Starovoitov
     
  • Convert the bpf filesystem to the new internal mount API as the old
    one will be obsoleted and removed. This allows greater flexibility in
    communication of mount parameters between userspace, the VFS and the
    filesystem.

    See Documentation/filesystems/mount_api.txt for more information.

    Signed-off-by: David Howells
    cc: Alexei Starovoitov
    cc: Daniel Borkmann
    cc: Martin KaFai Lau
    cc: Song Liu
    cc: Yonghong Song
    cc: netdev@vger.kernel.org
    cc: bpf@vger.kernel.org
    Signed-off-by: Al Viro

    David Howells
     

16 Sep, 2019

3 commits

  • Daniel Borkmann says:

    ====================
    pull-request: bpf-next 2019-09-16

    The following pull-request contains BPF updates for your *net-next* tree.

    The main changes are:

    1) Now that initial BPF backend for gcc has been merged upstream, enable
    BPF kselftest suite for bpf-gcc. Also fix a BE issue with access to
    bpf_sysctl.file_pos, from Ilya.

    2) Follow-up fix for link-vmlinux.sh to remove bash-specific extensions
    related to recent work on exposing BTF info through sysfs, from Andrii.

    3) AF_XDP zero copy fixes for i40e and ixgbe driver which caused umem
    headroom to be added twice, from Ciara.

    4) Refactoring work to convert sock opt tests into test_progs framework
    in BPF kselftests, from Stanislav.

    5) Fix a general protection fault in dev_map_hash_update_elem(), from Toke.

    6) Cleanup to use BPF_PROG_RUN() macro in KCM, from Sami.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     
  • "ctx:file_pos sysctl:read write ok" fails on s390 with "Read value !=
    nux". This is because verifier rewrites a complete 32-bit
    bpf_sysctl.file_pos update to a partial update of the first 32 bits of
    64-bit *bpf_sysctl_kern.ppos, which is not correct on big-endian
    systems.

    Fix by using an offset on big-endian systems.

    Ditto for bpf_sysctl.file_pos reads. Currently the test does not detect
    a problem there, since it expects to see 0, which it gets with high
    probability in error cases, so change it to seek to offset 3 and expect
    3 in bpf_sysctl.file_pos.

    Fixes: e1550bfe0de4 ("bpf: Add file_pos field to bpf_sysctl ctx")
    Signed-off-by: Ilya Leoshkevich
    Acked-by: Yonghong Song
    Signed-off-by: Daniel Borkmann
    Link: https://lore.kernel.org/bpf/20190816105300.49035-1-iii@linux.ibm.com/

    Ilya Leoshkevich
     
  • syzbot found a crash in dev_map_hash_update_elem(), when replacing an
    element with a new one. Jesper correctly identified the cause of the crash
    as a race condition between the initial lookup in the map (which is done
    before taking the lock), and the removal of the old element.

    Rather than just add a second lookup into the hashmap after taking the
    lock, fix this by reworking the function logic to take the lock before the
    initial lookup.

    Fixes: 6f9d451ab1a3 ("xdp: Add devmap_hash map type for looking up devices by hashed index")
    Reported-and-tested-by: syzbot+4e7a85b1432052e8d6f8@syzkaller.appspotmail.com
    Signed-off-by: Toke Høiland-Jørgensen
    Acked-by: Jesper Dangaard Brouer
    Signed-off-by: Daniel Borkmann

    Toke Høiland-Jørgensen
     

15 Sep, 2019

1 commit


06 Sep, 2019

1 commit

  • Daniel Borkmann says:

    ====================
    The following pull-request contains BPF updates for your *net-next* tree.

    The main changes are:

    1) Add the ability to use unaligned chunks in the AF_XDP umem. By
    relaxing where the chunks can be placed, it allows to use an
    arbitrary buffer size and place whenever there is a free
    address in the umem. Helps more seamless DPDK AF_XDP driver
    integration. Support for i40e, ixgbe and mlx5e, from Kevin and
    Maxim.

    2) Addition of a wakeup flag for AF_XDP tx and fill rings so the
    application can wake up the kernel for rx/tx processing which
    avoids busy-spinning of the latter, useful when app and driver
    is located on the same core. Support for i40e, ixgbe and mlx5e,
    from Magnus and Maxim.

    3) bpftool fixes for printf()-like functions so compiler can actually
    enforce checks, bpftool build system improvements for custom output
    directories, and addition of 'bpftool map freeze' command, from Quentin.

    4) Support attaching/detaching XDP programs from 'bpftool net' command,
    from Daniel.

    5) Automatic xskmap cleanup when AF_XDP socket is released, and several
    barrier/{read,write}_once fixes in AF_XDP code, from Björn.

    6) Relicense of bpf_helpers.h/bpf_endian.h for future libbpf
    inclusion as well as libbpf versioning improvements, from Andrii.

    7) Several new BPF kselftests for verifier precision tracking, from Alexei.

    8) Several BPF kselftest fixes wrt endianess to run on s390x, from Ilya.

    9) And more BPF kselftest improvements all over the place, from Stanislav.

    10) Add simple BPF map op cache for nfp driver to batch dumps, from Jakub.

    11) AF_XDP socket umem mapping improvements for 32bit archs, from Ivan.

    12) Add BPF-to-BPF call and BTF line info support for s390x JIT, from Yauheni.

    13) Small optimization in arm64 JIT to spare 1 insns for BPF_MOD, from Jerin.

    14) Fix an error check in bpf_tcp_gen_syncookie() helper, from Petar.

    15) Various minor fixes and cleanups, from Nathan, Masahiro, Masanari,
    Peter, Wei, Yue.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     

05 Sep, 2019

1 commit

  • The problem can be seen in the following two tests:
    0: (bf) r3 = r10
    1: (55) if r3 != 0x7b goto pc+0
    2: (7a) *(u64 *)(r3 -8) = 0
    3: (79) r4 = *(u64 *)(r10 -8)
    ..
    0: (85) call bpf_get_prandom_u32#7
    1: (bf) r3 = r10
    2: (55) if r3 != 0x7b goto pc+0
    3: (7b) *(u64 *)(r3 -8) = r0
    4: (79) r4 = *(u64 *)(r10 -8)

    When backtracking need to mark R4 it will mark slot fp-8.
    But ST or STX into fp-8 could belong to the same block of instructions.
    When backtracing is done the parent state may have fp-8 slot
    as "unallocated stack". Which will cause verifier to warn
    and incorrectly reject such programs.

    Writes into stack via non-R10 register are rare. llvm always
    generates canonical stack spill/fill.
    For such pathological case fall back to conservative precision
    tracking instead of rejecting.

    Reported-by: syzbot+c8d66267fd2b5955287e@syzkaller.appspotmail.com
    Fixes: b5dc0163d8fd ("bpf: precise scalar_value tracking")
    Signed-off-by: Alexei Starovoitov
    Signed-off-by: Daniel Borkmann

    Alexei Starovoitov
     

03 Sep, 2019

1 commit


28 Aug, 2019

2 commits


27 Aug, 2019

1 commit

  • Since BPF constant blinding is performed after the verifier pass, the
    ALU32 instructions inserted for doubleword immediate loads don't have a
    corresponding zext instruction. This is causing a kernel oops on powerpc
    and can be reproduced by running 'test_cgroup_storage' with
    bpf_jit_harden=2.

    Fix this by emitting BPF_ZEXT during constant blinding if
    prog->aux->verifier_zext is set.

    Fixes: a4b1d3c1ddf6cb ("bpf: verifier: insert zero extension according to analysis result")
    Reported-by: Michael Ellerman
    Signed-off-by: Naveen N. Rao
    Reviewed-by: Jiong Wang
    Signed-off-by: Daniel Borkmann

    Naveen N. Rao
     

24 Aug, 2019

2 commits

  • syzkaller managed to trigger the warning in bpf_jit_free() which checks via
    bpf_prog_kallsyms_verify_off() for potentially unlinked JITed BPF progs
    in kallsyms, and subsequently trips over GPF when walking kallsyms entries:

    [...]
    8021q: adding VLAN 0 to HW filter on device batadv0
    8021q: adding VLAN 0 to HW filter on device batadv0
    WARNING: CPU: 0 PID: 9869 at kernel/bpf/core.c:810 bpf_jit_free+0x1e8/0x2a0
    Kernel panic - not syncing: panic_on_warn set ...
    CPU: 0 PID: 9869 Comm: kworker/0:7 Not tainted 5.0.0-rc8+ #1
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
    Workqueue: events bpf_prog_free_deferred
    Call Trace:
    __dump_stack lib/dump_stack.c:77 [inline]
    dump_stack+0x113/0x167 lib/dump_stack.c:113
    panic+0x212/0x40b kernel/panic.c:214
    __warn.cold.8+0x1b/0x38 kernel/panic.c:571
    report_bug+0x1a4/0x200 lib/bug.c:186
    fixup_bug arch/x86/kernel/traps.c:178 [inline]
    do_error_trap+0x11b/0x200 arch/x86/kernel/traps.c:271
    do_invalid_op+0x36/0x40 arch/x86/kernel/traps.c:290
    invalid_op+0x14/0x20 arch/x86/entry/entry_64.S:973
    RIP: 0010:bpf_jit_free+0x1e8/0x2a0
    Code: 02 4c 89 e2 83 e2 07 38 d0 7f 08 84 c0 0f 85 86 00 00 00 48 ba 00 02 00 00 00 00 ad de 0f b6 43 02 49 39 d6 0f 84 5f fe ff ff 0b e9 58 fe ff ff 48 b8 00 00 00 00 00 fc ff df 4c 89 e2 48 c1
    RSP: 0018:ffff888092f67cd8 EFLAGS: 00010202
    RAX: 0000000000000007 RBX: ffffc90001947000 RCX: ffffffff816e9d88
    RDX: dead000000000200 RSI: 0000000000000008 RDI: ffff88808769f7f0
    RBP: ffff888092f67d00 R08: fffffbfff1394059 R09: fffffbfff1394058
    R10: fffffbfff1394058 R11: ffffffff89ca02c7 R12: ffffc90001947002
    R13: ffffc90001947020 R14: ffffffff881eca80 R15: ffff88808769f7e8
    BUG: unable to handle kernel paging request at fffffbfff400d000
    #PF error: [normal kernel read fault]
    PGD 21ffee067 P4D 21ffee067 PUD 21ffed067 PMD 9f942067 PTE 0
    Oops: 0000 [#1] PREEMPT SMP KASAN
    CPU: 0 PID: 9869 Comm: kworker/0:7 Not tainted 5.0.0-rc8+ #1
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
    Workqueue: events bpf_prog_free_deferred
    RIP: 0010:bpf_get_prog_addr_region kernel/bpf/core.c:495 [inline]
    RIP: 0010:bpf_tree_comp kernel/bpf/core.c:558 [inline]
    RIP: 0010:__lt_find include/linux/rbtree_latch.h:115 [inline]
    RIP: 0010:latch_tree_find include/linux/rbtree_latch.h:208 [inline]
    RIP: 0010:bpf_prog_kallsyms_find+0x107/0x2e0 kernel/bpf/core.c:632
    Code: 00 f0 ff ff 44 38 c8 7f 08 84 c0 0f 85 fa 00 00 00 41 f6 45 02 01 75 02 0f 0b 48 39 da 0f 82 92 00 00 00 48 89 d8 48 c1 e8 03 0f b6 04 30 84 c0 74 08 3c 03 0f 8e 45 01 00 00 8b 03 48 c1 e0
    [...]

    Upon further debugging, it turns out that whenever we trigger this
    issue, the kallsyms removal in bpf_prog_ksym_node_del() was /skipped/
    but yet bpf_jit_free() reported that the entry is /in use/.

    Problem is that symbol exposure via bpf_prog_kallsyms_add() but also
    perf_event_bpf_event() were done /after/ bpf_prog_new_fd(). Once the
    fd is exposed to the public, a parallel close request came in right
    before we attempted to do the bpf_prog_kallsyms_add().

    Given at this time the prog reference count is one, we start to rip
    everything underneath us via bpf_prog_release() -> bpf_prog_put().
    The memory is eventually released via deferred free, so we're seeing
    that bpf_jit_free() has a kallsym entry because we added it from
    bpf_prog_load() but /after/ bpf_prog_put() from the remote CPU.

    Therefore, move both notifications /before/ we install the fd. The
    issue was never seen between bpf_prog_alloc_id() and bpf_prog_new_fd()
    because upon bpf_prog_get_fd_by_id() we'll take another reference to
    the BPF prog, so we're still holding the original reference from the
    bpf_prog_load().

    Fixes: 6ee52e2a3fe4 ("perf, bpf: Introduce PERF_RECORD_BPF_EVENT")
    Fixes: 74451e66d516 ("bpf: make jited programs visible in traces")
    Reported-by: syzbot+bd3bba6ff3fcea7a6ec6@syzkaller.appspotmail.com
    Signed-off-by: Daniel Borkmann
    Cc: Song Liu

    Daniel Borkmann
     
  • While adding extra tests for precision tracking and extra infra
    to adjust verifier heuristics the existing test
    "calls: cross frame pruning - liveness propagation" started to fail.
    The root cause is the same as described in verifer.c comment:

    * Also if parent's curframe > frame where backtracking started,
    * the verifier need to mark registers in both frames, otherwise callees
    * may incorrectly prune callers. This is similar to
    * commit 7640ead93924 ("bpf: verifier: make sure callees don't prune with caller differences")
    * For now backtracking falls back into conservative marking.

    Turned out though that returning -ENOTSUPP from backtrack_insn() and
    doing mark_all_scalars_precise() in the current parentage chain is not enough.
    Depending on how is_state_visited() heuristic is creating parentage chain
    it's possible that callee will incorrectly prune caller.
    Fix the issue by setting precise=true earlier and more aggressively.
    Before this fix the precision tracking _within_ functions that don't do
    bpf2bpf calls would still work. Whereas now precision tracking is completely
    disabled when bpf2bpf calls are present anywhere in the program.

    No difference in cilium tests (they don't have bpf2bpf calls).
    No difference in test_progs though some of them have bpf2bpf calls,
    but precision tracking wasn't effective there.

    Fixes: b5dc0163d8fd ("bpf: precise scalar_value tracking")
    Signed-off-by: Alexei Starovoitov
    Signed-off-by: Daniel Borkmann

    Alexei Starovoitov
     

21 Aug, 2019

1 commit

  • Add a new command for the bpf() system call: BPF_BTF_GET_NEXT_ID is used
    to cycle through all BTF objects loaded on the system.

    The motivation is to be able to inspect (list) all BTF objects presents
    on the system.

    Signed-off-by: Quentin Monnet
    Reviewed-by: Jakub Kicinski
    Signed-off-by: Alexei Starovoitov

    Quentin Monnet
     

20 Aug, 2019

2 commits


18 Aug, 2019

2 commits

  • The XSKMAP did not honor the BPF_EXIST/BPF_NOEXIST flags when updating
    an entry. This patch addresses that.

    Signed-off-by: Björn Töpel
    Signed-off-by: Daniel Borkmann

    Björn Töpel
     
  • When an AF_XDP socket is released/closed the XSKMAP still holds a
    reference to the socket in a "released" state. The socket will still
    use the netdev queue resource, and block newly created sockets from
    attaching to that queue, but no user application can access the
    fill/complete/rx/tx queues. This results in that all applications need
    to explicitly clear the map entry from the old "zombie state"
    socket. This should be done automatically.

    In this patch, the sockets tracks, and have a reference to, which maps
    it resides in. When the socket is released, it will remove itself from
    all maps.

    Suggested-by: Bruce Richardson
    Signed-off-by: Björn Töpel
    Signed-off-by: Daniel Borkmann

    Björn Töpel