03 Apr, 2019

1 commit

  • commit 0803278b0b4d8eeb2b461fb698785df65a725d9e upstream.

    Syzkaller hit 'KASAN: use-after-free Write in sanitize_ptr_alu' bug.

    Call trace:

    dump_stack+0xbf/0x12e
    print_address_description+0x6a/0x280
    kasan_report+0x237/0x360
    sanitize_ptr_alu+0x85a/0x8d0
    adjust_ptr_min_max_vals+0x8f2/0x1ca0
    adjust_reg_min_max_vals+0x8ed/0x22e0
    do_check+0x1ca6/0x5d00
    bpf_check+0x9ca/0x2570
    bpf_prog_load+0xc91/0x1030
    __se_sys_bpf+0x61e/0x1f00
    do_syscall_64+0xc8/0x550
    entry_SYSCALL_64_after_hwframe+0x49/0xbe

    Fault injection trace:

     kfree+0xea/0x290
     free_func_state+0x4a/0x60
     free_verifier_state+0x61/0xe0
     push_stack+0x216/0x2f0
    Signed-off-by: Daniel Borkmann
    Signed-off-by: Greg Kroah-Hartman

    Xu Yu
     

24 Mar, 2019

2 commits

  • [ Upstream commit 7c0cdf0b3940f63d9777c3fcf250a2f83859ca54 ]

    trie_delete_elem() was deleting an entry even though it was not matching
    if the prefixlen was correct. This patch adds a check on matchlen.

    Reproducer:

    $ sudo bpftool map create /sys/fs/bpf/mylpm type lpm_trie key 8 value 1 entries 128 name mylpm flags 1
    $ sudo bpftool map update pinned /sys/fs/bpf/mylpm key hex 10 00 00 00 aa bb cc dd value hex 01
    $ sudo bpftool map dump pinned /sys/fs/bpf/mylpm
    key: 10 00 00 00 aa bb cc dd value: 01
    Found 1 element
    $ sudo bpftool map delete pinned /sys/fs/bpf/mylpm key hex 10 00 00 00 ff ff ff ff
    $ echo $?
    0
    $ sudo bpftool map dump pinned /sys/fs/bpf/mylpm
    Found 0 elements

    A similar reproducer is added in the selftests.

    Without the patch:

    $ sudo ./tools/testing/selftests/bpf/test_lpm_map
    test_lpm_map: test_lpm_map.c:485: test_lpm_delete: Assertion `bpf_map_delete_elem(map_fd, key) == -1 && errno == ENOENT' failed.
    Aborted

    With the patch: test_lpm_map runs without errors.

    Fixes: e454cf595853 ("bpf: Implement map_delete_elem for BPF_MAP_TYPE_LPM_TRIE")
    Cc: Craig Gallek
    Signed-off-by: Alban Crequy
    Acked-by: Craig Gallek
    Signed-off-by: Daniel Borkmann
    Signed-off-by: Sasha Levin

    Alban Crequy
     
  • [ Upstream commit 3defaf2f15b2bfd86c6664181ac009e91985f8ac ]

    Lockdep warns about false positive:
    [ 11.211460] ------------[ cut here ]------------
    [ 11.211936] DEBUG_LOCKS_WARN_ON(depth
    [ 11.223874] ? __local_bh_enable+0x7a/0x80
    [ 11.224199] up_read+0x1c/0xa0
    [ 11.224446] do_up_read+0x12/0x20
    [ 11.224713] irq_work_run_list+0x43/0x70
    [ 11.225030] irq_work_run+0x26/0x50
    [ 11.225310] smp_irq_work_interrupt+0x57/0x1f0
    [ 11.225662] irq_work_interrupt+0xf/0x20

    since rw_semaphore is released in a different task vs task that locked the sema.
    It is expected behavior.
    Fix the warning with up_read_non_owner() and rwsem_release() annotation.

    Fixes: bae77c5eb5b2 ("bpf: enable stackmap with build_id in nmi context")
    Signed-off-by: Alexei Starovoitov
    Signed-off-by: Daniel Borkmann
    Signed-off-by: Sasha Levin

    Alexei Starovoitov
     

14 Mar, 2019

2 commits

  • [ Upstream commit 7c4cd051add3d00bbff008a133c936c515eaa8fe ]

    The map_lookup_elem used to not acquiring spinlock
    in order to optimize the reader.

    It was true until commit 557c0c6e7df8 ("bpf: convert stackmap to pre-allocation")
    The syscall's map_lookup_elem(stackmap) calls bpf_stackmap_copy().
    bpf_stackmap_copy() may find the elem no longer needed after the copy is done.
    If that is the case, pcpu_freelist_push() saves this elem for reuse later.
    This push requires a spinlock.

    If a tracing bpf_prog got run in the middle of the syscall's
    map_lookup_elem(stackmap) and this tracing bpf_prog is calling
    bpf_get_stackid(stackmap) which also requires the same pcpu_freelist's
    spinlock, it may end up with a dead lock situation as reported by
    Eric Dumazet in https://patchwork.ozlabs.org/patch/1030266/

    The situation is the same as the syscall's map_update_elem() which
    needs to acquire the pcpu_freelist's spinlock and could race
    with tracing bpf_prog. Hence, this patch fixes it by protecting
    bpf_stackmap_copy() with this_cpu_inc(bpf_prog_active)
    to prevent tracing bpf_prog from running.

    A later syscall's map_lookup_elem commit f1a2e44a3aec ("bpf: add queue and stack maps")
    also acquires a spinlock and races with tracing bpf_prog similarly.
    Hence, this patch is forward looking and protects the majority
    of the map lookups. bpf_map_offload_lookup_elem() is the exception
    since it is for network bpf_prog only (i.e. never called by tracing
    bpf_prog).

    Fixes: 557c0c6e7df8 ("bpf: convert stackmap to pre-allocation")
    Reported-by: Eric Dumazet
    Acked-by: Alexei Starovoitov
    Signed-off-by: Martin KaFai Lau
    Signed-off-by: Alexei Starovoitov
    Signed-off-by: Daniel Borkmann
    Signed-off-by: Sasha Levin

    Martin KaFai Lau
     
  • [ Upstream commit a89fac57b5d080771efd4d71feaae19877cf68f0 ]

    Lockdep warns about false positive:
    [ 12.492084] 00000000e6b28347 (&head->lock){+...}, at: pcpu_freelist_push+0x2a/0x40
    [ 12.492696] but this lock was taken by another, HARDIRQ-safe lock in the past:
    [ 12.493275] (&rq->lock){-.-.}
    [ 12.493276]
    [ 12.493276]
    [ 12.493276] and interrupts could create inverse lock ordering between them.
    [ 12.493276]
    [ 12.494435]
    [ 12.494435] other info that might help us debug this:
    [ 12.494979] Possible interrupt unsafe locking scenario:
    [ 12.494979]
    [ 12.495518] CPU0 CPU1
    [ 12.495879] ---- ----
    [ 12.496243] lock(&head->lock);
    [ 12.496502] local_irq_disable();
    [ 12.496969] lock(&rq->lock);
    [ 12.497431] lock(&head->lock);
    [ 12.497890]
    [ 12.498104] lock(&rq->lock);
    [ 12.498368]
    [ 12.498368] *** DEADLOCK ***
    [ 12.498368]
    [ 12.498837] 1 lock held by dd/276:
    [ 12.499110] #0: 00000000c58cb2ee (rcu_read_lock){....}, at: trace_call_bpf+0x5e/0x240
    [ 12.499747]
    [ 12.499747] the shortest dependencies between 2nd lock and 1st lock:
    [ 12.500389] -> (&rq->lock){-.-.} {
    [ 12.500669] IN-HARDIRQ-W at:
    [ 12.500934] _raw_spin_lock+0x2f/0x40
    [ 12.501373] scheduler_tick+0x4c/0xf0
    [ 12.501812] update_process_times+0x40/0x50
    [ 12.502294] tick_periodic+0x27/0xb0
    [ 12.502723] tick_handle_periodic+0x1f/0x60
    [ 12.503203] timer_interrupt+0x11/0x20
    [ 12.503651] __handle_irq_event_percpu+0x43/0x2c0
    [ 12.504167] handle_irq_event_percpu+0x20/0x50
    [ 12.504674] handle_irq_event+0x37/0x60
    [ 12.505139] handle_level_irq+0xa7/0x120
    [ 12.505601] handle_irq+0xa1/0x150
    [ 12.506018] do_IRQ+0x77/0x140
    [ 12.506411] ret_from_intr+0x0/0x1d
    [ 12.506834] _raw_spin_unlock_irqrestore+0x53/0x60
    [ 12.507362] __setup_irq+0x481/0x730
    [ 12.507789] setup_irq+0x49/0x80
    [ 12.508195] hpet_time_init+0x21/0x32
    [ 12.508644] x86_late_time_init+0xb/0x16
    [ 12.509106] start_kernel+0x390/0x42a
    [ 12.509554] secondary_startup_64+0xa4/0xb0
    [ 12.510034] IN-SOFTIRQ-W at:
    [ 12.510305] _raw_spin_lock+0x2f/0x40
    [ 12.510772] try_to_wake_up+0x1c7/0x4e0
    [ 12.511220] swake_up_locked+0x20/0x40
    [ 12.511657] swake_up_one+0x1a/0x30
    [ 12.512070] rcu_process_callbacks+0xc5/0x650
    [ 12.512553] __do_softirq+0xe6/0x47b
    [ 12.512978] irq_exit+0xc3/0xd0
    [ 12.513372] smp_apic_timer_interrupt+0xa9/0x250
    [ 12.513876] apic_timer_interrupt+0xf/0x20
    [ 12.514343] default_idle+0x1c/0x170
    [ 12.514765] do_idle+0x199/0x240
    [ 12.515159] cpu_startup_entry+0x19/0x20
    [ 12.515614] start_kernel+0x422/0x42a
    [ 12.516045] secondary_startup_64+0xa4/0xb0
    [ 12.516521] INITIAL USE at:
    [ 12.516774] _raw_spin_lock_irqsave+0x38/0x50
    [ 12.517258] rq_attach_root+0x16/0xd0
    [ 12.517685] sched_init+0x2f2/0x3eb
    [ 12.518096] start_kernel+0x1fb/0x42a
    [ 12.518525] secondary_startup_64+0xa4/0xb0
    [ 12.518986] }
    [ 12.519132] ... key at: [] __key.71384+0x0/0x8
    [ 12.519649] ... acquired at:
    [ 12.519892] pcpu_freelist_pop+0x7b/0xd0
    [ 12.520221] bpf_get_stackid+0x1d2/0x4d0
    [ 12.520563] ___bpf_prog_run+0x8b4/0x11a0
    [ 12.520887]
    [ 12.521008] -> (&head->lock){+...} {
    [ 12.521292] HARDIRQ-ON-W at:
    [ 12.521539] _raw_spin_lock+0x2f/0x40
    [ 12.521950] pcpu_freelist_push+0x2a/0x40
    [ 12.522396] bpf_get_stackid+0x494/0x4d0
    [ 12.522828] ___bpf_prog_run+0x8b4/0x11a0
    [ 12.523296] INITIAL USE at:
    [ 12.523537] _raw_spin_lock+0x2f/0x40
    [ 12.523944] pcpu_freelist_populate+0xc0/0x120
    [ 12.524417] htab_map_alloc+0x405/0x500
    [ 12.524835] __do_sys_bpf+0x1a3/0x1a90
    [ 12.525253] do_syscall_64+0x4a/0x180
    [ 12.525659] entry_SYSCALL_64_after_hwframe+0x49/0xbe
    [ 12.526167] }
    [ 12.526311] ... key at: [] __key.13130+0x0/0x8
    [ 12.526812] ... acquired at:
    [ 12.527047] __lock_acquire+0x521/0x1350
    [ 12.527371] lock_acquire+0x98/0x190
    [ 12.527680] _raw_spin_lock+0x2f/0x40
    [ 12.527994] pcpu_freelist_push+0x2a/0x40
    [ 12.528325] bpf_get_stackid+0x494/0x4d0
    [ 12.528645] ___bpf_prog_run+0x8b4/0x11a0
    [ 12.528970]
    [ 12.529092]
    [ 12.529092] stack backtrace:
    [ 12.529444] CPU: 0 PID: 276 Comm: dd Not tainted 5.0.0-rc3-00018-g2fa53f892422 #475
    [ 12.530043] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.el7 04/01/2014
    [ 12.530750] Call Trace:
    [ 12.530948] dump_stack+0x5f/0x8b
    [ 12.531248] check_usage_backwards+0x10c/0x120
    [ 12.531598] ? ___bpf_prog_run+0x8b4/0x11a0
    [ 12.531935] ? mark_lock+0x382/0x560
    [ 12.532229] mark_lock+0x382/0x560
    [ 12.532496] ? print_shortest_lock_dependencies+0x180/0x180
    [ 12.532928] __lock_acquire+0x521/0x1350
    [ 12.533271] ? find_get_entry+0x17f/0x2e0
    [ 12.533586] ? find_get_entry+0x19c/0x2e0
    [ 12.533902] ? lock_acquire+0x98/0x190
    [ 12.534196] lock_acquire+0x98/0x190
    [ 12.534482] ? pcpu_freelist_push+0x2a/0x40
    [ 12.534810] _raw_spin_lock+0x2f/0x40
    [ 12.535099] ? pcpu_freelist_push+0x2a/0x40
    [ 12.535432] pcpu_freelist_push+0x2a/0x40
    [ 12.535750] bpf_get_stackid+0x494/0x4d0
    [ 12.536062] ___bpf_prog_run+0x8b4/0x11a0

    It has been explained that is a false positive here:
    https://lkml.org/lkml/2018/7/25/756
    Recap:
    - stackmap uses pcpu_freelist
    - The lock in pcpu_freelist is a percpu lock
    - stackmap is only used by tracing bpf_prog
    - A tracing bpf_prog cannot be run if another bpf_prog
    has already been running (ensured by the percpu bpf_prog_active counter).

    Eric pointed out that this lockdep splats stops other
    legit lockdep splats in selftests/bpf/test_progs.c.

    Fix this by calling local_irq_save/restore for stackmap.

    Another false positive had also been worked around by calling
    local_irq_save in commit 89ad2fa3f043 ("bpf: fix lockdep splat").
    That commit added unnecessary irq_save/restore to fast path of
    bpf hash map. irqs are already disabled at that point, since htab
    is holding per bucket spin_lock with irqsave.

    Let's reduce overhead for htab by introducing __pcpu_freelist_push/pop
    function w/o irqsave and convert pcpu_freelist_push/pop to irqsave
    to be used elsewhere (right now only in stackmap).
    It stops lockdep false positive in stackmap with a bit of acceptable overhead.

    Fixes: 557c0c6e7df8 ("bpf: convert stackmap to pre-allocation")
    Reported-by: Naresh Kamboju
    Reported-by: Eric Dumazet
    Acked-by: Martin KaFai Lau
    Signed-off-by: Alexei Starovoitov
    Signed-off-by: Daniel Borkmann
    Signed-off-by: Sasha Levin

    Alexei Starovoitov
     

10 Mar, 2019

1 commit

  • commit 3612af783cf52c74a031a2f11b82247b2599d3cd upstream.

    Marek reported that he saw an issue with the below snippet in that
    timing measurements where off when loaded as unpriv while results
    were reasonable when loaded as privileged:

    [...]
    uint64_t a = bpf_ktime_get_ns();
    uint64_t b = bpf_ktime_get_ns();
    uint64_t delta = b - a;
    if ((int64_t)delta > 0) {
    [...]

    Turns out there is a bug where a corner case is missing in the fix
    d3bd7413e0ca ("bpf: fix sanitation of alu op with pointer / scalar
    type from different paths"), namely fixup_bpf_calls() only checks
    whether aux has a non-zero alu_state, but it also needs to test for
    the case of BPF_ALU_NON_POINTER since in both occasions we need to
    skip the masking rewrite (as there is nothing to mask).

    Fixes: d3bd7413e0ca ("bpf: fix sanitation of alu op with pointer / scalar type from different paths")
    Reported-by: Marek Majkowski
    Reported-by: Arthur Fabre
    Signed-off-by: Daniel Borkmann
    Link: https://lore.kernel.org/netdev/CAJPywTJqP34cK20iLM5YmUMz9KXQOdu1-+BZrGMAGgLuBWz7fg@mail.gmail.com/T/
    Acked-by: Song Liu
    Signed-off-by: Alexei Starovoitov
    Signed-off-by: Greg Kroah-Hartman

    Daniel Borkmann
     

27 Feb, 2019

3 commits

  • [ Upstream commit 4af396ae4836c4ecab61e975b8e61270c551894d ]

    When returning BPF_STACK_BUILD_ID_IP from stack_map_get_build_id_offset,
    make sure that build_id field is empty. Since we are using percpu
    free list, there is a possibility that we might reuse some previous
    bpf_stack_build_id with non-zero build_id.

    Fixes: 615755a77b24 ("bpf: extend stackmap to save binary_build_id+offset instead of address")
    Acked-by: Song Liu
    Signed-off-by: Stanislav Fomichev
    Signed-off-by: Daniel Borkmann
    Signed-off-by: Sasha Levin

    Stanislav Fomichev
     
  • [ Upstream commit 0b698005a9d11c0e91141ec11a2c4918a129f703 ]

    Build-id length is not fixed to 20, it can be (`man ld` /--build-id):
    * 128-bit (uuid)
    * 160-bit (sha1)
    * any length specified in ld --build-id=0xhexstring

    To fix the issue of missing BPF_STACK_BUILD_ID_VALID for shorter build-ids,
    assume that build-id is somewhere in the range of 1 .. 20.
    Set the remaining bytes to zero.

    v2:
    * don't introduce new "len = min(BPF_BUILD_ID_SIZE, nhdr->n_descsz)",
    we already know that nhdr->n_descsz
    Signed-off-by: Stanislav Fomichev
    Signed-off-by: Daniel Borkmann
    Signed-off-by: Sasha Levin

    Stanislav Fomichev
     
  • [ Upstream commit beaf3d1901f4ea46fbd5c9d857227d99751de469 ]

    As Naresh reported, test_stacktrace_build_id() causes panic on i386 and
    arm32 systems. This is caused by page_address() returns NULL in certain
    cases.

    This patch fixes this error by using kmap_atomic/kunmap_atomic instead
    of page_address.

    Fixes: 615755a77b24 (" bpf: extend stackmap to save binary_build_id+offset instead of address")
    Reported-by: Naresh Kamboju
    Signed-off-by: Song Liu
    Signed-off-by: Daniel Borkmann
    Signed-off-by: Sasha Levin

    Song Liu
     

31 Jan, 2019

12 commits

  • [ commit 9d5564ddcf2a0f5ba3fa1c3a1f8a1b59ad309553 upstream ]

    During review I noticed that inner meta map setup for map in
    map is buggy in that it does not propagate all needed data
    from the reference map which the verifier is later accessing.

    In particular one such case is index masking to prevent out of
    bounds access under speculative execution due to missing the
    map's unpriv_array/index_mask field propagation. Fix this such
    that the verifier is generating the correct code for inlined
    lookups in case of unpriviledged use.

    Before patch (test_verifier's 'map in map access' dump):

    # bpftool prog dump xla id 3
    0: (62) *(u32 *)(r10 -4) = 0
    1: (bf) r2 = r10
    2: (07) r2 += -4
    3: (18) r1 = map[id:4]
    5: (07) r1 += 272 |
    6: (61) r0 = *(u32 *)(r2 +0) |
    7: (35) if r0 >= 0x1 goto pc+6 | Inlined map in map lookup
    8: (54) (u32) r0 &= (u32) 0 | with index masking for
    9: (67) r0 <unpriv_array.
    10: (0f) r0 += r1 |
    11: (79) r0 = *(u64 *)(r0 +0) |
    12: (15) if r0 == 0x0 goto pc+1 |
    13: (05) goto pc+1 |
    14: (b7) r0 = 0 |
    15: (15) if r0 == 0x0 goto pc+11
    16: (62) *(u32 *)(r10 -4) = 0
    17: (bf) r2 = r10
    18: (07) r2 += -4
    19: (bf) r1 = r0
    20: (07) r1 += 272 |
    21: (61) r0 = *(u32 *)(r2 +0) | Index masking missing (!)
    22: (35) if r0 >= 0x1 goto pc+3 | for inner map despite
    23: (67) r0 <unpriv_array set.
    24: (0f) r0 += r1 |
    25: (05) goto pc+1 |
    26: (b7) r0 = 0 |
    27: (b7) r0 = 0
    28: (95) exit

    After patch:

    # bpftool prog dump xla id 1
    0: (62) *(u32 *)(r10 -4) = 0
    1: (bf) r2 = r10
    2: (07) r2 += -4
    3: (18) r1 = map[id:2]
    5: (07) r1 += 272 |
    6: (61) r0 = *(u32 *)(r2 +0) |
    7: (35) if r0 >= 0x1 goto pc+6 | Same inlined map in map lookup
    8: (54) (u32) r0 &= (u32) 0 | with index masking due to
    9: (67) r0 <unpriv_array.
    10: (0f) r0 += r1 |
    11: (79) r0 = *(u64 *)(r0 +0) |
    12: (15) if r0 == 0x0 goto pc+1 |
    13: (05) goto pc+1 |
    14: (b7) r0 = 0 |
    15: (15) if r0 == 0x0 goto pc+12
    16: (62) *(u32 *)(r10 -4) = 0
    17: (bf) r2 = r10
    18: (07) r2 += -4
    19: (bf) r1 = r0
    20: (07) r1 += 272 |
    21: (61) r0 = *(u32 *)(r2 +0) |
    22: (35) if r0 >= 0x1 goto pc+4 | Now fixed inlined inner map
    23: (54) (u32) r0 &= (u32) 0 | lookup with proper index masking
    24: (67) r0 <unpriv_array.
    25: (0f) r0 += r1 |
    26: (05) goto pc+1 |
    27: (b7) r0 = 0 |
    28: (b7) r0 = 0
    29: (95) exit

    Fixes: b2157399cc98 ("bpf: prevent out-of-bounds speculation")
    Signed-off-by: Daniel Borkmann
    Acked-by: Martin KaFai Lau
    Signed-off-by: Alexei Starovoitov
    Signed-off-by: Daniel Borkmann
    Signed-off-by: Sasha Levin

    Daniel Borkmann
     
  • [ commit d3bd7413e0ca40b60cf60d4003246d067cafdeda upstream ]

    While 979d63d50c0c ("bpf: prevent out of bounds speculation on pointer
    arithmetic") took care of rejecting alu op on pointer when e.g. pointer
    came from two different map values with different map properties such as
    value size, Jann reported that a case was not covered yet when a given
    alu op is used in both "ptr_reg += reg" and "numeric_reg += reg" from
    different branches where we would incorrectly try to sanitize based
    on the pointer's limit. Catch this corner case and reject the program
    instead.

    Fixes: 979d63d50c0c ("bpf: prevent out of bounds speculation on pointer arithmetic")
    Reported-by: Jann Horn
    Signed-off-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Signed-off-by: Alexei Starovoitov
    Signed-off-by: Daniel Borkmann
    Signed-off-by: Sasha Levin

    Daniel Borkmann
     
  • [ commit 979d63d50c0c0f7bc537bf821e056cc9fe5abd38 upstream ]

    Jann reported that the original commit back in b2157399cc98
    ("bpf: prevent out-of-bounds speculation") was not sufficient
    to stop CPU from speculating out of bounds memory access:
    While b2157399cc98 only focussed on masking array map access
    for unprivileged users for tail calls and data access such
    that the user provided index gets sanitized from BPF program
    and syscall side, there is still a more generic form affected
    from BPF programs that applies to most maps that hold user
    data in relation to dynamic map access when dealing with
    unknown scalars or "slow" known scalars as access offset, for
    example:

    - Load a map value pointer into R6
    - Load an index into R7
    - Do a slow computation (e.g. with a memory dependency) that
    loads a limit into R8 (e.g. load the limit from a map for
    high latency, then mask it to make the verifier happy)
    - Exit if R7 >= R8 (mispredicted branch)
    - Load R0 = R6[R7]
    - Load R0 = R6[R0]

    For unknown scalars there are two options in the BPF verifier
    where we could derive knowledge from in order to guarantee
    safe access to the memory: i) While </>/= variants won't
    allow to derive any lower or upper bounds from the unknown
    scalar where it would be safe to add it to the map value
    pointer, it is possible through ==/!= test however. ii) another
    option is to transform the unknown scalar into a known scalar,
    for example, through ALU ops combination such as R &=
    followed by R |= or any similar combination where the
    original information from the unknown scalar would be destroyed
    entirely leaving R with a constant. The initial slow load still
    precedes the latter ALU ops on that register, so the CPU
    executes speculatively from that point. Once we have the known
    scalar, any compare operation would work then. A third option
    only involving registers with known scalars could be crafted
    as described in [0] where a CPU port (e.g. Slow Int unit)
    would be filled with many dependent computations such that
    the subsequent condition depending on its outcome has to wait
    for evaluation on its execution port and thereby executing
    speculatively if the speculated code can be scheduled on a
    different execution port, or any other form of mistraining
    as described in [1], for example. Given this is not limited
    to only unknown scalars, not only map but also stack access
    is affected since both is accessible for unprivileged users
    and could potentially be used for out of bounds access under
    speculation.

    In order to prevent any of these cases, the verifier is now
    sanitizing pointer arithmetic on the offset such that any
    out of bounds speculation would be masked in a way where the
    pointer arithmetic result in the destination register will
    stay unchanged, meaning offset masked into zero similar as
    in array_index_nospec() case. With regards to implementation,
    there are three options that were considered: i) new insn
    for sanitation, ii) push/pop insn and sanitation as inlined
    BPF, iii) reuse of ax register and sanitation as inlined BPF.

    Option i) has the downside that we end up using from reserved
    bits in the opcode space, but also that we would require
    each JIT to emit masking as native arch opcodes meaning
    mitigation would have slow adoption till everyone implements
    it eventually which is counter-productive. Option ii) and iii)
    have both in common that a temporary register is needed in
    order to implement the sanitation as inlined BPF since we
    are not allowed to modify the source register. While a push /
    pop insn in ii) would be useful to have in any case, it
    requires once again that every JIT needs to implement it
    first. While possible, amount of changes needed would also
    be unsuitable for a -stable patch. Therefore, the path which
    has fewer changes, less BPF instructions for the mitigation
    and does not require anything to be changed in the JITs is
    option iii) which this work is pursuing. The ax register is
    already mapped to a register in all JITs (modulo arm32 where
    it's mapped to stack as various other BPF registers there)
    and used in constant blinding for JITs-only so far. It can
    be reused for verifier rewrites under certain constraints.
    The interpreter's tmp "register" has therefore been remapped
    into extending the register set with hidden ax register and
    reusing that for a number of instructions that needed the
    prior temporary variable internally (e.g. div, mod). This
    allows for zero increase in stack space usage in the interpreter,
    and enables (restricted) generic use in rewrites otherwise as
    long as such a patchlet does not make use of these instructions.
    The sanitation mask is dynamic and relative to the offset the
    map value or stack pointer currently holds.

    There are various cases that need to be taken under consideration
    for the masking, e.g. such operation could look as follows:
    ptr += val or val += ptr or ptr -= val. Thus, the value to be
    sanitized could reside either in source or in destination
    register, and the limit is different depending on whether
    the ALU op is addition or subtraction and depending on the
    current known and bounded offset. The limit is derived as
    follows: limit := max_value_size - (smin_value + off). For
    subtraction: limit := umax_value + off. This holds because
    we do not allow any pointer arithmetic that would
    temporarily go out of bounds or would have an unknown
    value with mixed signed bounds where it is unclear at
    verification time whether the actual runtime value would
    be either negative or positive. For example, we have a
    derived map pointer value with constant offset and bounded
    one, so limit based on smin_value works because the verifier
    requires that statically analyzed arithmetic on the pointer
    must be in bounds, and thus it checks if resulting
    smin_value + off and umax_value + off is still within map
    value bounds at time of arithmetic in addition to time of
    access. Similarly, for the case of stack access we derive
    the limit as follows: MAX_BPF_STACK + off for subtraction
    and -off for the case of addition where off := ptr_reg->off +
    ptr_reg->var_off.value. Subtraction is a special case for
    the masking which can be in form of ptr += -val, ptr -= -val,
    or ptr -= val. In the first two cases where we know that
    the value is negative, we need to temporarily negate the
    value in order to do the sanitation on a positive value
    where we later swap the ALU op, and restore original source
    register if the value was in source.

    The sanitation of pointer arithmetic alone is still not fully
    sufficient as is, since a scenario like the following could
    happen ...

    PTR += 0x1000 (e.g. K-based imm)
    PTR -= BIG_NUMBER_WITH_SLOW_COMPARISON
    PTR += 0x1000
    PTR -= BIG_NUMBER_WITH_SLOW_COMPARISON
    [...]

    ... which under speculation could end up as ...

    PTR += 0x1000
    PTR -= 0 [ truncated by mitigation ]
    PTR += 0x1000
    PTR -= 0 [ truncated by mitigation ]
    [...]

    ... and therefore still access out of bounds. To prevent such
    case, the verifier is also analyzing safety for potential out
    of bounds access under speculative execution. Meaning, it is
    also simulating pointer access under truncation. We therefore
    "branch off" and push the current verification state after the
    ALU operation with known 0 to the verification stack for later
    analysis. Given the current path analysis succeeded it is
    likely that the one under speculation can be pruned. In any
    case, it is also subject to existing complexity limits and
    therefore anything beyond this point will be rejected. In
    terms of pruning, it needs to be ensured that the verification
    state from speculative execution simulation must never prune
    a non-speculative execution path, therefore, we mark verifier
    state accordingly at the time of push_stack(). If verifier
    detects out of bounds access under speculative execution from
    one of the possible paths that includes a truncation, it will
    reject such program.

    Given we mask every reg-based pointer arithmetic for
    unprivileged programs, we've been looking into how it could
    affect real-world programs in terms of size increase. As the
    majority of programs are targeted for privileged-only use
    case, we've unconditionally enabled masking (with its alu
    restrictions on top of it) for privileged programs for the
    sake of testing in order to check i) whether they get rejected
    in its current form, and ii) by how much the number of
    instructions and size will increase. We've tested this by
    using Katran, Cilium and test_l4lb from the kernel selftests.
    For Katran we've evaluated balancer_kern.o, Cilium bpf_lxc.o
    and an older test object bpf_lxc_opt_-DUNKNOWN.o and l4lb
    we've used test_l4lb.o as well as test_l4lb_noinline.o. We
    found that none of the programs got rejected by the verifier
    with this change, and that impact is rather minimal to none.
    balancer_kern.o had 13,904 bytes (1,738 insns) xlated and
    7,797 bytes JITed before and after the change. Most complex
    program in bpf_lxc.o had 30,544 bytes (3,817 insns) xlated
    and 18,538 bytes JITed before and after and none of the other
    tail call programs in bpf_lxc.o had any changes either. For
    the older bpf_lxc_opt_-DUNKNOWN.o object we found a small
    increase from 20,616 bytes (2,576 insns) and 12,536 bytes JITed
    before to 20,664 bytes (2,582 insns) and 12,558 bytes JITed
    after the change. Other programs from that object file had
    similar small increase. Both test_l4lb.o had no change and
    remained at 6,544 bytes (817 insns) xlated and 3,401 bytes
    JITed and for test_l4lb_noinline.o constant at 5,080 bytes
    (634 insns) xlated and 3,313 bytes JITed. This can be explained
    in that LLVM typically optimizes stack based pointer arithmetic
    by using K-based operations and that use of dynamic map access
    is not overly frequent. However, in future we may decide to
    optimize the algorithm further under known guarantees from
    branch and value speculation. Latter seems also unclear in
    terms of prediction heuristics that today's CPUs apply as well
    as whether there could be collisions in e.g. the predictor's
    Value History/Pattern Table for triggering out of bounds access,
    thus masking is performed unconditionally at this point but could
    be subject to relaxation later on. We were generally also
    brainstorming various other approaches for mitigation, but the
    blocker was always lack of available registers at runtime and/or
    overhead for runtime tracking of limits belonging to a specific
    pointer. Thus, we found this to be minimally intrusive under
    given constraints.

    With that in place, a simple example with sanitized access on
    unprivileged load at post-verification time looks as follows:

    # bpftool prog dump xlated id 282
    [...]
    28: (79) r1 = *(u64 *)(r7 +0)
    29: (79) r2 = *(u64 *)(r7 +8)
    30: (57) r1 &= 15
    31: (79) r3 = *(u64 *)(r0 +4608)
    32: (57) r3 &= 1
    33: (47) r3 |= 1
    34: (2d) if r2 > r3 goto pc+19
    35: (b4) (u32) r11 = (u32) 20479 |
    36: (1f) r11 -= r2 | Dynamic sanitation for pointer
    37: (4f) r11 |= r2 | arithmetic with registers
    38: (87) r11 = -r11 | containing bounded or known
    39: (c7) r11 s>>= 63 | scalars in order to prevent
    40: (5f) r11 &= r2 | out of bounds speculation.
    41: (0f) r4 += r11 |
    42: (71) r4 = *(u8 *)(r4 +0)
    43: (6f) r4 <>= 63
    21: (5f) r2 &= r11
    22: (0f) r2 += r0
    23: (61) r0 = *(u32 *)(r2 +0)
    [...]

    JIT blinding example with non-conflicting use of r10:

    [...]
    d5: je 0x0000000000000106 _
    d7: mov 0x0(%rax),%edi |
    da: mov $0xf153246,%r10d | Index load from map value and
    e0: xor $0xf153259,%r10 | (const blinded) mask with 0x1f.
    e7: and %r10,%rdi |_
    ea: mov $0x2f,%r10d |
    f0: sub %rdi,%r10 | Sanitized addition. Both use r10
    f3: or %rdi,%r10 | but do not interfere with each
    f6: neg %r10 | other. (Neither do these instructions
    f9: sar $0x3f,%r10 | interfere with the use of ax as temp
    fd: and %r10,%rdi | in interpreter.)
    100: add %rax,%rdi |_
    103: mov 0x0(%rdi),%eax
    [...]

    Tested that it fixes Jann's reproducer, and also checked that test_verifier
    and test_progs suite with interpreter, JIT and JIT with hardening enabled
    on x86-64 and arm64 runs successfully.

    [0] Speculose: Analyzing the Security Implications of Speculative
    Execution in CPUs, Giorgi Maisuradze and Christian Rossow,
    https://arxiv.org/pdf/1801.04084.pdf

    [1] A Systematic Evaluation of Transient Execution Attacks and
    Defenses, Claudio Canella, Jo Van Bulck, Michael Schwarz,
    Moritz Lipp, Benjamin von Berg, Philipp Ortner, Frank Piessens,
    Dmitry Evtyushkin, Daniel Gruss,
    https://arxiv.org/pdf/1811.05441.pdf

    Fixes: b2157399cc98 ("bpf: prevent out-of-bounds speculation")
    Reported-by: Jann Horn
    Signed-off-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Signed-off-by: Alexei Starovoitov
    Signed-off-by: Daniel Borkmann
    Signed-off-by: Sasha Levin

    Daniel Borkmann
     
  • [ commit b7137c4eab85c1cf3d46acdde90ce1163b28c873 upstream ]

    In check_map_access() we probe actual bounds through __check_map_access()
    with offset of reg->smin_value + off for lower bound and offset of
    reg->umax_value + off for the upper bound. However, even though the
    reg->smin_value could have a negative value, the final result of the
    sum with off could be positive when pointer arithmetic with known and
    unknown scalars is combined. In this case we reject the program with
    an error such as "R min value is negative, either use unsigned index
    or do a if (index >=0) check." even though the access itself would be
    fine. Therefore extend the check to probe whether the actual resulting
    reg->smin_value + off is less than zero.

    Signed-off-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Signed-off-by: Alexei Starovoitov
    Signed-off-by: Daniel Borkmann
    Signed-off-by: Sasha Levin

    Daniel Borkmann
     
  • [ commit 9d7eceede769f90b66cfa06ad5b357140d5141ed upstream ]

    For unknown scalars of mixed signed bounds, meaning their smin_value is
    negative and their smax_value is positive, we need to reject arithmetic
    with pointer to map value. For unprivileged the goal is to mask every
    map pointer arithmetic and this cannot reliably be done when it is
    unknown at verification time whether the scalar value is negative or
    positive. Given this is a corner case, the likelihood of breaking should
    be very small.

    Signed-off-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Signed-off-by: Alexei Starovoitov
    Signed-off-by: Daniel Borkmann
    Signed-off-by: Sasha Levin

    Daniel Borkmann
     
  • [ commit e4298d25830a866cc0f427d4bccb858e76715859 upstream ]

    Restrict stack pointer arithmetic for unprivileged users in that
    arithmetic itself must not go out of bounds as opposed to the actual
    access later on. Therefore after each adjust_ptr_min_max_vals() with
    a stack pointer as a destination we simulate a check_stack_access()
    of 1 byte on the destination and once that fails the program is
    rejected for unprivileged program loads. This is analog to map
    value pointer arithmetic and needed for masking later on.

    Signed-off-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Signed-off-by: Alexei Starovoitov
    Signed-off-by: Daniel Borkmann
    Signed-off-by: Sasha Levin

    Daniel Borkmann
     
  • [ commit 0d6303db7970e6f56ae700fa07e11eb510cda125 upstream ]

    Restrict map value pointer arithmetic for unprivileged users in that
    arithmetic itself must not go out of bounds as opposed to the actual
    access later on. Therefore after each adjust_ptr_min_max_vals() with a
    map value pointer as a destination it will simulate a check_map_access()
    of 1 byte on the destination and once that fails the program is rejected
    for unprivileged program loads. We use this later on for masking any
    pointer arithmetic with the remainder of the map value space. The
    likelihood of breaking any existing real-world unprivileged eBPF
    program is very small for this corner case.

    Signed-off-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Signed-off-by: Alexei Starovoitov
    Signed-off-by: Daniel Borkmann
    Signed-off-by: Sasha Levin

    Daniel Borkmann
     
  • [ commit 9b73bfdd08e73231d6a90ae6db4b46b3fbf56c30 upstream ]

    Right now we are using BPF ax register in JIT for constant blinding as
    well as in interpreter as temporary variable. Verifier will not be able
    to use it simply because its use will get overridden from the former in
    bpf_jit_blind_insn(). However, it can be made to work in that blinding
    will be skipped if there is prior use in either source or destination
    register on the instruction. Taking constraints of ax into account, the
    verifier is then open to use it in rewrites under some constraints. Note,
    ax register already has mappings in every eBPF JIT.

    Signed-off-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Signed-off-by: Alexei Starovoitov
    Signed-off-by: Daniel Borkmann
    Signed-off-by: Sasha Levin

    Daniel Borkmann
     
  • [ commit 144cd91c4c2bced6eb8a7e25e590f6618a11e854 upstream ]

    This change moves the on-stack 64 bit tmp variable in ___bpf_prog_run()
    into the hidden ax register. The latter is currently only used in JITs
    for constant blinding as a temporary scratch register, meaning the BPF
    interpreter will never see the use of ax. Therefore it is safe to use
    it for the cases where tmp has been used earlier. This is needed to later
    on allow restricted hidden use of ax in both interpreter and JITs.

    Signed-off-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Signed-off-by: Alexei Starovoitov
    Signed-off-by: Daniel Borkmann
    Signed-off-by: Sasha Levin

    Daniel Borkmann
     
  • [ commit c08435ec7f2bc8f4109401f696fd55159b4b40cb upstream ]

    Move prev_insn_idx and insn_idx from the do_check() function into
    the verifier environment, so they can be read inside the various
    helper functions for handling the instructions. It's easier to put
    this into the environment rather than changing all call-sites only
    to pass it along. insn_idx is useful in particular since this later
    on allows to hold state in env->insn_aux_data[env->insn_idx].

    Signed-off-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Signed-off-by: Alexei Starovoitov
    Signed-off-by: Daniel Borkmann
    Signed-off-by: Sasha Levin

    Daniel Borkmann
     
  • [ commit ceefbc96fa5c5b975d87bf8e89ba8416f6b764d9 upstream ]

    malicious bpf program may try to force the verifier to remember
    a lot of distinct verifier states.
    Put a limit to number of per-insn 'struct bpf_verifier_state'.
    Note that hitting the limit doesn't reject the program.
    It potentially makes the verifier do more steps to analyze the program.
    It means that malicious programs will hit BPF_COMPLEXITY_LIMIT_INSNS sooner
    instead of spending cpu time walking long link list.

    The limit of BPF_COMPLEXITY_LIMIT_STATES==64 affects cilium progs
    with slight increase in number of "steps" it takes to successfully verify
    the programs:
    before after
    bpf_lb-DLB_L3.o 1940 1940
    bpf_lb-DLB_L4.o 3089 3089
    bpf_lb-DUNKNOWN.o 1065 1065
    bpf_lxc-DDROP_ALL.o 28052 | 28162
    bpf_lxc-DUNKNOWN.o 35487 | 35541
    bpf_netdev.o 10864 10864
    bpf_overlay.o 6643 6643
    bpf_lcx_jit.o 38437 38437

    But it also makes malicious program to be rejected in 0.4 seconds vs 6.5
    Hence apply this limit to unprivileged programs only.

    Signed-off-by: Alexei Starovoitov
    Acked-by: Daniel Borkmann
    Acked-by: Edward Cree
    Signed-off-by: Daniel Borkmann
    Signed-off-by: Sasha Levin

    Alexei Starovoitov
     
  • [ commit 4f7b3e82589e0de723780198ec7983e427144c0a upstream ]

    pathological bpf programs may try to force verifier to explode in
    the number of branch states:
    20: (d5) if r1 s< 0x2100ecf4 goto pc+0
    25: (d5) if r1 s< 0xf4041810 goto pc+0
    27: (d5) if r1 s< 0x6d0020da goto pc+0
    31: (35) if r0 >= 0x2100ecf4 goto pc+0

    Teach verifier to recognize always taken and always not taken branches.
    This analysis is already done for == and != comparison.
    Expand it to all other branches.

    It also helps real bpf programs to be verified faster:
    before after
    bpf_lb-DLB_L3.o 2003 1940
    bpf_lb-DLB_L4.o 3173 3089
    bpf_lb-DUNKNOWN.o 1080 1065
    bpf_lxc-DDROP_ALL.o 29584 28052
    bpf_lxc-DUNKNOWN.o 36916 35487
    bpf_netdev.o 11188 10864
    bpf_overlay.o 6679 6643
    bpf_lcx_jit.o 39555 38437

    Reported-by: Anatoly Trosinenko
    Signed-off-by: Alexei Starovoitov
    Acked-by: Daniel Borkmann
    Acked-by: Edward Cree
    Signed-off-by: Daniel Borkmann
    Signed-off-by: Sasha Levin

    Alexei Starovoitov
     

26 Jan, 2019

2 commits

  • [ Upstream commit e434b8cdf788568ba65a0a0fd9f3cb41f3ca1803 ]

    Currently, the destination register is marked as unknown for 32-bit
    sub-register move (BPF_MOV | BPF_ALU) whenever the source register type is
    SCALAR_VALUE.

    This is too conservative that some valid cases will be rejected.
    Especially, this may turn a constant scalar value into unknown value that
    could break some assumptions of verifier.

    For example, test_l4lb_noinline.c has the following C code:

    struct real_definition *dst

    1: if (!get_packet_dst(&dst, &pckt, vip_info, is_ipv6))
    2: return TC_ACT_SHOT;
    3:
    4: if (dst->flags & F_IPV6) {

    get_packet_dst is responsible for initializing "dst" into valid pointer and
    return true (1), otherwise return false (0). The compiled instruction
    sequence using alu32 will be:

    412: (54) (u32) r7 &= (u32) 1
    413: (bc) (u32) r0 = (u32) r7
    414: (95) exit

    insn 413, a BPF_MOV | BPF_ALU, however will turn r0 into unknown value even
    r7 contains SCALAR_VALUE 1.

    This causes trouble when verifier is walking the code path that hasn't
    initialized "dst" inside get_packet_dst, for which case 0 is returned and
    we would then expect verifier concluding line 1 in the above C code pass
    the "if" check, therefore would skip fall through path starting at line 4.
    Now, because r0 returned from callee has became unknown value, so verifier
    won't skip analyzing path starting at line 4 and "dst->flags" requires
    dereferencing the pointer "dst" which actually hasn't be initialized for
    this path.

    This patch relaxed the code marking sub-register move destination. For a
    SCALAR_VALUE, it is safe to just copy the value from source then truncate
    it into 32-bit.

    A unit test also included to demonstrate this issue. This test will fail
    before this patch.

    This relaxation could let verifier skipping more paths for conditional
    comparison against immediate. It also let verifier recording a more
    accurate/strict value for one register at one state, if this state end up
    with going through exit without rejection and it is used for state
    comparison later, then it is possible an inaccurate/permissive value is
    better. So the real impact on verifier processed insn number is complex.
    But in all, without this fix, valid program could be rejected.

    >From real benchmarking on kernel selftests and Cilium bpf tests, there is
    no impact on processed instruction number when tests ares compiled with
    default compilation options. There is slightly improvements when they are
    compiled with -mattr=+alu32 after this patch.

    Also, test_xdp_noinline/-mattr=+alu32 now passed verification. It is
    rejected before this fix.

    Insn processed before/after this patch:

    default -mattr=+alu32

    Kernel selftest

    ===
    test_xdp.o 371/371 369/369
    test_l4lb.o 6345/6345 5623/5623
    test_xdp_noinline.o 2971/2971 rejected/2727
    test_tcp_estates.o 429/429 430/430

    Cilium bpf
    ===
    bpf_lb-DLB_L3.o: 2085/2085 1685/1687
    bpf_lb-DLB_L4.o: 2287/2287 1986/1982
    bpf_lb-DUNKNOWN.o: 690/690 622/622
    bpf_lxc.o: 95033/95033 N/A
    bpf_netdev.o: 7245/7245 N/A
    bpf_overlay.o: 2898/2898 3085/2947

    NOTE:
    - bpf_lxc.o and bpf_netdev.o compiled by -mattr=+alu32 are rejected by
    verifier due to another issue inside verifier on supporting alu32
    binary.
    - Each cilium bpf program could generate several processed insn number,
    above number is sum of them.

    v1->v2:
    - Restrict the change on SCALAR_VALUE.
    - Update benchmark numbers on Cilium bpf tests.

    Signed-off-by: Jiong Wang
    Signed-off-by: Alexei Starovoitov
    Signed-off-by: Sasha Levin

    Jiong Wang
     
  • [ Upstream commit 46f53a65d2de3e1591636c22b626b09d8684fd71 ]

    Currently BPF verifier allows narrow loads for a context field only with
    offset zero. E.g. if there is a __u32 field then only the following
    loads are permitted:
    * off=0, size=1 (narrow);
    * off=0, size=2 (narrow);
    * off=0, size=4 (full).

    On the other hand LLVM can generate a load with offset different than
    zero that make sense from program logic point of view, but verifier
    doesn't accept it.

    E.g. tools/testing/selftests/bpf/sendmsg4_prog.c has code:

    #define DST_IP4 0xC0A801FEU /* 192.168.1.254 */
    ...
    if ((ctx->user_ip4 >> 24) == (bpf_htonl(DST_IP4) >> 24) &&

    where ctx is struct bpf_sock_addr.

    Some versions of LLVM can produce the following byte code for it:

    8: 71 12 07 00 00 00 00 00 r2 = *(u8 *)(r1 + 7)
    9: 67 02 00 00 18 00 00 00 r2 <

    where `*(u8 *)(r1 + 7)` means narrow load for ctx->user_ip4 with size=1
    and offset=3 (7 - sizeof(ctx->user_family) = 3). This load is currently
    rejected by verifier.

    Verifier code that rejects such loads is in bpf_ctx_narrow_access_ok()
    what means any is_valid_access implementation, that uses the function,
    works this way, e.g. bpf_skb_is_valid_access() for __sk_buff or
    sock_addr_is_valid_access() for bpf_sock_addr.

    The patch makes such loads supported. Offset can be in [0; size_default)
    but has to be multiple of load size. E.g. for __u32 field the following
    loads are supported now:
    * off=0, size=1 (narrow);
    * off=1, size=1 (narrow);
    * off=2, size=1 (narrow);
    * off=3, size=1 (narrow);
    * off=0, size=2 (narrow);
    * off=2, size=2 (narrow);
    * off=0, size=4 (full).

    Reported-by: Yonghong Song
    Signed-off-by: Andrey Ignatov
    Signed-off-by: Alexei Starovoitov
    Signed-off-by: Sasha Levin

    Andrey Ignatov
     

21 Dec, 2018

1 commit

  • [ Upstream commit c3494801cd1785e2c25f1a5735fa19ddcf9665da ]

    Malicious user space may try to force the verifier to use as much cpu
    time and memory as possible. Hence check for pending signals
    while verifying the program.
    Note that suspend of sys_bpf(PROG_LOAD) syscall will lead to EAGAIN,
    since the kernel has to release the resources used for program verification.

    Reported-by: Anatoly Trosinenko
    Signed-off-by: Alexei Starovoitov
    Acked-by: Daniel Borkmann
    Acked-by: Edward Cree
    Signed-off-by: Daniel Borkmann
    Signed-off-by: Sasha Levin

    Alexei Starovoitov
     

17 Dec, 2018

2 commits

  • commit afd594240806acc138cf696c09f2f4829d55d02f upstream.

    When patching in a new sequence for the first insn of a subprog, the start
    of that subprog does not change (it's the first insn of the sequence), so
    adjust_subprog_starts should check start < off).
    Also added a test to test_verifier.c (it's essentially the syz reproducer).

    Fixes: cc8b0b92a169 ("bpf: introduce function calls (function boundaries)")
    Reported-by: syzbot+4fc427c7af994b0948be@syzkaller.appspotmail.com
    Signed-off-by: Edward Cree
    Acked-by: Yonghong Song
    Signed-off-by: Alexei Starovoitov
    Signed-off-by: Greg Kroah-Hartman

    Edward Cree
     
  • [ Upstream commit 569a933b03f3c48b392fe67c0086b3a6b9306b5a ]

    Naresh reported an issue with the non-atomic memory allocation of
    cgroup local storage buffers:

    [ 73.047526] BUG: sleeping function called from invalid context at
    /srv/oe/build/tmp-rpb-glibc/work-shared/intel-corei7-64/kernel-source/mm/slab.h:421
    [ 73.060915] in_atomic(): 1, irqs_disabled(): 0, pid: 3157, name: test_cgroup_sto
    [ 73.068342] INFO: lockdep is turned off.
    [ 73.072293] CPU: 2 PID: 3157 Comm: test_cgroup_sto Not tainted
    4.20.0-rc2-next-20181113 #1
    [ 73.080548] Hardware name: Supermicro SYS-5019S-ML/X11SSH-F, BIOS
    2.0b 07/27/2017
    [ 73.088018] Call Trace:
    [ 73.090463] dump_stack+0x70/0xa5
    [ 73.093783] ___might_sleep+0x152/0x240
    [ 73.097619] __might_sleep+0x4a/0x80
    [ 73.101191] __kmalloc_node+0x1cf/0x2f0
    [ 73.105031] ? cgroup_storage_update_elem+0x46/0x90
    [ 73.109909] cgroup_storage_update_elem+0x46/0x90

    cgroup_storage_update_elem() (as well as other update map update
    callbacks) is called with disabled preemption, so GFP_ATOMIC
    allocation should be used: e.g. alloc_htab_elem() in hashtab.c.

    Reported-by: Naresh Kamboju
    Tested-by: Naresh Kamboju
    Signed-off-by: Roman Gushchin
    Cc: Alexei Starovoitov
    Cc: Daniel Borkmann
    Signed-off-by: Alexei Starovoitov
    Signed-off-by: Sasha Levin

    Roman Gushchin
     

27 Nov, 2018

1 commit

  • [ Upstream commit 28c2fae726bf5003cd209b0d5910a642af98316f ]

    While dbecd7388476 ("bpf: get kernel symbol addresses via syscall")
    zeroed info.nr_jited_ksyms in bpf_prog_get_info_by_fd() for queries
    from unprivileged users, commit 815581c11cc2 ("bpf: get JITed image
    lengths of functions via syscall") forgot about doing so and therefore
    returns the #elems of the user set up buffer which is incorrect. It
    also needs to indicate a info.nr_jited_func_lens of zero.

    Fixes: 815581c11cc2 ("bpf: get JITed image lengths of functions via syscall")
    Signed-off-by: Daniel Borkmann
    Cc: Sandipan Das
    Cc: Song Liu
    Signed-off-by: Alexei Starovoitov
    Signed-off-by: Sasha Levin

    Daniel Borkmann
     

14 Nov, 2018

3 commits

  • commit 1ae80cf31938c8f77c37a29bbe29e7f1cd492be8 upstream.

    The map-in-map frequently serves as a mechanism for atomic
    snapshotting of state that a BPF program might record. The current
    implementation is dangerous to use in this way, however, since
    userspace has no way of knowing when all programs that might have
    retrieved the "old" value of the map may have completed.

    This change ensures that map update operations on map-in-map map types
    always wait for all references to the old map to drop before returning
    to userspace.

    Signed-off-by: Daniel Colascione
    Reviewed-by: Joel Fernandes (Google)
    Signed-off-by: Alexei Starovoitov
    Signed-off-by: Chenbo Feng
    Signed-off-by: Greg Kroah-Hartman

    Daniel Colascione
     
  • [ Upstream commit a9c676bc8fc58d00eea9836fb14ee43c0346416a ]

    Edward Cree says:
    In check_mem_access(), for the PTR_TO_CTX case, after check_ctx_access()
    has supplied a reg_type, the other members of the register state are set
    appropriately. Previously reg.range was set to 0, but as it is in a
    union with reg.map_ptr, which is larger, upper bytes of the latter were
    left in place. This then caused the memcmp() in regsafe() to fail,
    preventing some branches from being pruned (and occasionally causing the
    same program to take a varying number of processed insns on repeated
    verifier runs).

    Fix the instability by clearing bpf_reg_state in __mark_reg_[un]known()

    Fixes: f1174f77b50c ("bpf/verifier: rework value tracking")
    Debugged-by: Edward Cree
    Acked-by: Edward Cree
    Signed-off-by: Alexei Starovoitov
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Alexei Starovoitov
     
  • commit 0962590e553331db2cc0aef2dc35c57f6300dbbe upstream.

    ALU operations on pointers such as scalar_reg += map_value_ptr are
    handled in adjust_ptr_min_max_vals(). Problem is however that map_ptr
    and range in the register state share a union, so transferring state
    through dst_reg->range = ptr_reg->range is just buggy as any new
    map_ptr in the dst_reg is then truncated (or null) for subsequent
    checks. Fix this by adding a raw member and use it for copying state
    over to dst_reg.

    Fixes: f1174f77b50c ("bpf/verifier: rework value tracking")
    Signed-off-by: Daniel Borkmann
    Cc: Edward Cree
    Acked-by: Alexei Starovoitov
    Signed-off-by: Alexei Starovoitov
    Acked-by: Edward Cree
    Signed-off-by: Sasha Levin

    Daniel Borkmann
     

11 Oct, 2018

1 commit

  • The XSKMAP update and delete functions called synchronize_net(), which
    can sleep. It is not allowed to sleep during an RCU read section.

    Instead we need to make sure that the sock sk_destruct (xsk_destruct)
    function is asynchronously called after an RCU grace period. Setting
    the SOCK_RCU_FREE flag for XDP sockets takes care of this.

    Fixes: fbfc504a24f5 ("bpf: introduce new bpf AF_XDP map type BPF_MAP_TYPE_XSKMAP")
    Reported-by: Eric Dumazet
    Signed-off-by: Björn Töpel
    Acked-by: Song Liu
    Signed-off-by: Daniel Borkmann

    Björn Töpel
     

06 Oct, 2018

1 commit

  • When I wrote commit 468f6eafa6c4 ("bpf: fix 32-bit ALU op verification"), I
    assumed that, in order to emulate 64-bit arithmetic with 32-bit logic, it
    is sufficient to just truncate the output to 32 bits; and so I just moved
    the register size coercion that used to be at the start of the function to
    the end of the function.

    That assumption is true for almost every op, but not for 32-bit right
    shifts, because those can propagate information towards the least
    significant bit. Fix it by always truncating inputs for 32-bit ops to 32
    bits.

    Also get rid of the coerce_reg_to_size() after the ALU op, since that has
    no effect.

    Fixes: 468f6eafa6c4 ("bpf: fix 32-bit ALU op verification")
    Acked-by: Daniel Borkmann
    Signed-off-by: Jann Horn
    Signed-off-by: Daniel Borkmann

    Jann Horn
     

02 Oct, 2018

1 commit


28 Sep, 2018

1 commit

  • cgroup_storage_update_elem() shouldn't accept any flags
    argument values except BPF_ANY and BPF_EXIST to guarantee
    the backward compatibility, had a new flag value been added.

    Fixes: de9cbbaadba5 ("bpf: introduce cgroup storage maps")
    Signed-off-by: Roman Gushchin
    Reported-by: Daniel Borkmann
    Cc: Alexei Starovoitov
    Signed-off-by: Daniel Borkmann

    Roman Gushchin
     

22 Sep, 2018

2 commits

  • It is possible (via shutdown()) for TCP socks to go trough TCP_CLOSE
    state via tcp_disconnect() without actually calling tcp_close which
    would then call our bpf_tcp_close() callback. Because of this a user
    could disconnect a socket then put it in a LISTEN state which would
    break our assumptions about sockets always being ESTABLISHED state.

    To resolve this rely on the unhash hook, which is called in the
    disconnect case, to remove the sock from the sockmap.

    Reported-by: Eric Dumazet
    Fixes: 1aa12bdf1bfb ("bpf: sockmap, add sock close() hook to remove socks")
    Signed-off-by: John Fastabend
    Acked-by: Yonghong Song
    Signed-off-by: Daniel Borkmann

    John Fastabend
     
  • After this patch we only allow socks that are in ESTABLISHED state or
    are being added via a sock_ops event that is transitioning into an
    ESTABLISHED state. By allowing sock_ops events we allow users to
    manage sockmaps directly from sock ops programs. The two supported
    sock_ops ops are BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB and
    BPF_SOCK_OPS_ACTIVE_ESTABLISHED_CB.

    Similar to TLS ULP this ensures sk_user_data is correct.

    Reported-by: Eric Dumazet
    Fixes: 1aa12bdf1bfb ("bpf: sockmap, add sock close() hook to remove socks")
    Signed-off-by: John Fastabend
    Acked-by: Yonghong Song
    Signed-off-by: Daniel Borkmann

    John Fastabend
     

13 Sep, 2018

2 commits

  • Subtraction of pointers was accidentally allowed for unpriv programs
    by commit 82abbf8d2fc4. Revert that part of commit.

    Fixes: 82abbf8d2fc4 ("bpf: do not allow root to mangle valid pointers")
    Reported-by: Jann Horn
    Acked-by: Daniel Borkmann
    Signed-off-by: Alexei Starovoitov
    Signed-off-by: Daniel Borkmann

    Alexei Starovoitov
     
  • The end boundary math for type section is incorrect in
    btf_check_all_metas(). It just happens that hdr->type_off
    is always 0 for now because there are only two sections
    (type and string) and string section must be at the end (ensured
    in btf_parse_str_sec).

    However, type_off may not be 0 if a new section would be added later.
    This patch fixes it.

    Fixes: f80442a4cd18 ("bpf: btf: Change how section is supported in btf_header")
    Reported-by: Dmitry Vyukov
    Signed-off-by: Martin KaFai Lau
    Acked-by: Yonghong Song
    Signed-off-by: Daniel Borkmann

    Martin KaFai Lau
     

03 Sep, 2018

1 commit

  • Currently we check sk_user_data is non NULL to determine if the sk
    exists in a map. However, this is not sufficient to ensure the psock
    or the ULP ops are not in use by another user, such as kcm or TLS. To
    avoid this when adding a sock to a map also verify it is of the
    correct ULP type. Additionally, when releasing a psock verify that
    it is the TCP_ULP_BPF type before releasing the ULP. The error case
    where we abort an update due to ULP collision can cause this error
    path.

    For example,

    __sock_map_ctx_update_elem()
    [...]
    err = tcp_set_ulp_id(sock, TCP_ULP_BPF)
    Signed-off-by: Daniel Borkmann

    John Fastabend
     

28 Aug, 2018

1 commit

  • Currently, when a redirect occurs in sockmap and an error occurs in
    the redirect call we unwind the scatterlist once in the error path
    of bpf_tcp_sendmsg_do_redirect() and then again in sendmsg(). Then
    in the error path of sendmsg we decrement the copied count by the
    send size.

    However, its possible we partially sent data before the error was
    generated. This can happen if do_tcp_sendpages() partially sends the
    scatterlist before encountering a memory pressure error. If this
    happens we need to decrement the copied value (the value tracking
    how many bytes were actually sent to TCP stack) by the number of
    remaining bytes _not_ the entire send size. Otherwise we risk
    confusing userspace.

    Also we don't need two calls to free the scatterlist one is
    good enough. So remove the one in bpf_tcp_sendmsg_do_redirect() and
    then properly reduce copied by the number of remaining bytes which
    may in fact be the entire send size if no bytes were sent.

    To do this use bool to indicate if free_start_sg() should do mem
    accounting or not.

    Signed-off-by: John Fastabend
    Signed-off-by: Daniel Borkmann

    John Fastabend