08 Sep, 2022

1 commit

  • [ Upstream commit 7d6620f107bae6ed687ff07668e8e8f855487aa9 ]

    Syzkaller reported a triggered kernel BUG as follows:

    ------------[ cut here ]------------
    kernel BUG at kernel/bpf/cgroup.c:925!
    invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
    CPU: 1 PID: 194 Comm: detach Not tainted 5.19.0-14184-g69dac8e431af #8
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
    rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
    RIP: 0010:__cgroup_bpf_detach+0x1f2/0x2a0
    Code: 00 e8 92 60 30 00 84 c0 75 d8 4c 89 e0 31 f6 85 f6 74 19 42 f6 84
    28 48 05 00 00 02 75 0e 48 8b 80 c0 00 00 00 48 85 c0 75 e5 0b 48
    8b 0c5
    RSP: 0018:ffffc9000055bdb0 EFLAGS: 00000246
    RAX: 0000000000000000 RBX: ffff888100ec0800 RCX: ffffc900000f1000
    RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff888100ec4578
    RBP: 0000000000000000 R08: ffff888100ec0800 R09: 0000000000000040
    R10: 0000000000000000 R11: 0000000000000000 R12: ffff888100ec4000
    R13: 000000000000000d R14: ffffc90000199000 R15: ffff888100effb00
    FS: 00007f68213d2b80(0000) GS:ffff88813bc80000(0000)
    knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 000055f74a0e5850 CR3: 0000000102836000 CR4: 00000000000006e0
    Call Trace:

    cgroup_bpf_prog_detach+0xcc/0x100
    __sys_bpf+0x2273/0x2a00
    __x64_sys_bpf+0x17/0x20
    do_syscall_64+0x3b/0x90
    entry_SYSCALL_64_after_hwframe+0x63/0xcd
    RIP: 0033:0x7f68214dbcb9
    Code: 08 44 89 e0 5b 41 5c c3 66 0f 1f 84 00 00 00 00 00 48 89 f8 48 89
    f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 3d 01
    f0 ff8
    RSP: 002b:00007ffeb487db68 EFLAGS: 00000246 ORIG_RAX: 0000000000000141
    RAX: ffffffffffffffda RBX: 000000000000000b RCX: 00007f68214dbcb9
    RDX: 0000000000000090 RSI: 00007ffeb487db70 RDI: 0000000000000009
    RBP: 0000000000000003 R08: 0000000000000012 R09: 0000000b00000003
    R10: 00007ffeb487db70 R11: 0000000000000246 R12: 00007ffeb487dc20
    R13: 0000000000000004 R14: 0000000000000001 R15: 000055f74a1011b0

    Modules linked in:
    ---[ end trace 0000000000000000 ]---

    Repetition steps:

    For the following cgroup tree,

    root
    |
    cg1
    |
    cg2

    1. attach prog2 to cg2, and then attach prog1 to cg1, both bpf progs
    attach type is NONE or OVERRIDE.
    2. write 1 to /proc/thread-self/fail-nth for failslab.
    3. detach prog1 for cg1, and then kernel BUG occur.

    Failslab injection will cause kmalloc fail and fall back to
    purge_effective_progs. The problem is that cg2 have attached another prog,
    so when go through cg2 layer, iteration will add pos to 1, and subsequent
    operations will be skipped by the following condition, and cg will meet
    NULL in the end.

    `if (pos && !(cg->bpf.flags[atype] & BPF_F_ALLOW_MULTI))`

    The NULL cg means no link or prog match, this is as expected, and it's not
    a bug. So here just skip the no match situation.

    Fixes: 4c46091ee985 ("bpf: Fix KASAN use-after-free Read in compute_effective_progs")
    Signed-off-by: Pu Lehui
    Signed-off-by: Daniel Borkmann
    Acked-by: Andrii Nakryiko
    Link: https://lore.kernel.org/bpf/20220813134030.1972696-1-pulehui@huawei.com
    Signed-off-by: Sasha Levin

    Pu Lehui
     

17 Aug, 2022

1 commit

  • commit 4c46091ee985ae84c60c5e95055d779fcd291d87 upstream.

    Syzbot found a Use After Free bug in compute_effective_progs().
    The reproducer creates a number of BPF links, and causes a fault
    injected alloc to fail, while calling bpf_link_detach on them.
    Link detach triggers the link to be freed by bpf_link_free(),
    which calls __cgroup_bpf_detach() and update_effective_progs().
    If the memory allocation in this function fails, the function restores
    the pointer to the bpf_cgroup_link on the cgroup list, but the memory
    gets freed just after it returns. After this, every subsequent call to
    update_effective_progs() causes this already deallocated pointer to be
    dereferenced in prog_list_length(), and triggers KASAN UAF error.

    To fix this issue don't preserve the pointer to the prog or link in the
    list, but remove it and replace it with a dummy prog without shrinking
    the table. The subsequent call to __cgroup_bpf_detach() or
    __cgroup_bpf_detach() will correct it.

    Fixes: af6eea57437a ("bpf: Implement bpf_link-based cgroup BPF program attachment")
    Reported-by:
    Signed-off-by: Tadeusz Struk
    Signed-off-by: Andrii Nakryiko
    Cc:
    Link: https://syzkaller.appspot.com/bug?id=8ebf179a95c2a2670f7cf1ba62429ec044369db4
    Link: https://lore.kernel.org/bpf/20220517180420.87954-1-tadeusz.struk@linaro.org
    Signed-off-by: Greg Kroah-Hartman

    Tadeusz Struk
     

01 May, 2022

1 commit

  • commit 216e3cd2f28dbbf1fe86848e0e29e6693b9f0a20 upstream.

    Some helper functions may modify its arguments, for example,
    bpf_d_path, bpf_get_stack etc. Previously, their argument types
    were marked as ARG_PTR_TO_MEM, which is compatible with read-only
    mem types, such as PTR_TO_RDONLY_BUF. Therefore it's legitimate,
    but technically incorrect, to modify a read-only memory by passing
    it into one of such helper functions.

    This patch tags the bpf_args compatible with immutable memory with
    MEM_RDONLY flag. The arguments that don't have this flag will be
    only compatible with mutable memory types, preventing the helper
    from modifying a read-only memory. The bpf_args that have
    MEM_RDONLY are compatible with both mutable memory and immutable
    memory.

    Signed-off-by: Hao Luo
    Signed-off-by: Alexei Starovoitov
    Link: https://lore.kernel.org/bpf/20211217003152.48334-9-haoluo@google.com
    Cc: stable@vger.kernel.org # 5.15.x
    Signed-off-by: Greg Kroah-Hartman

    Hao Luo
     

25 Nov, 2021

1 commit

  • commit 5e0bc3082e2e403ac0753e099c2b01446bb35578 upstream.

    Use of bpf_ktime_get_coarse_ns() and bpf_timer_* helpers in tracing
    progs may result in locking issues.

    bpf_ktime_get_coarse_ns() uses ktime_get_coarse_ns() time accessor that
    isn't safe for any context:
    ======================================================
    WARNING: possible circular locking dependency detected
    5.15.0-syzkaller #0 Not tainted
    ------------------------------------------------------
    syz-executor.4/14877 is trying to acquire lock:
    ffffffff8cb30008 (tk_core.seq.seqcount){----}-{0:0}, at: ktime_get_coarse_ts64+0x25/0x110 kernel/time/timekeeping.c:2255

    but task is already holding lock:
    ffffffff90dbf200 (&obj_hash[i].lock){-.-.}-{2:2}, at: debug_object_deactivate+0x61/0x400 lib/debugobjects.c:735

    which lock already depends on the new lock.

    the existing dependency chain (in reverse order) is:

    -> #1 (&obj_hash[i].lock){-.-.}-{2:2}:
    lock_acquire+0x19f/0x4d0 kernel/locking/lockdep.c:5625
    __raw_spin_lock_irqsave include/linux/spinlock_api_smp.h:110 [inline]
    _raw_spin_lock_irqsave+0xd1/0x120 kernel/locking/spinlock.c:162
    __debug_object_init+0xd9/0x1860 lib/debugobjects.c:569
    debug_hrtimer_init kernel/time/hrtimer.c:414 [inline]
    debug_init kernel/time/hrtimer.c:468 [inline]
    hrtimer_init+0x20/0x40 kernel/time/hrtimer.c:1592
    ntp_init_cmos_sync kernel/time/ntp.c:676 [inline]
    ntp_init+0xa1/0xad kernel/time/ntp.c:1095
    timekeeping_init+0x512/0x6bf kernel/time/timekeeping.c:1639
    start_kernel+0x267/0x56e init/main.c:1030
    secondary_startup_64_no_verify+0xb1/0xbb

    -> #0 (tk_core.seq.seqcount){----}-{0:0}:
    check_prev_add kernel/locking/lockdep.c:3051 [inline]
    check_prevs_add kernel/locking/lockdep.c:3174 [inline]
    validate_chain+0x1dfb/0x8240 kernel/locking/lockdep.c:3789
    __lock_acquire+0x1382/0x2b00 kernel/locking/lockdep.c:5015
    lock_acquire+0x19f/0x4d0 kernel/locking/lockdep.c:5625
    seqcount_lockdep_reader_access+0xfe/0x230 include/linux/seqlock.h:103
    ktime_get_coarse_ts64+0x25/0x110 kernel/time/timekeeping.c:2255
    ktime_get_coarse include/linux/timekeeping.h:120 [inline]
    ktime_get_coarse_ns include/linux/timekeeping.h:126 [inline]
    ____bpf_ktime_get_coarse_ns kernel/bpf/helpers.c:173 [inline]
    bpf_ktime_get_coarse_ns+0x7e/0x130 kernel/bpf/helpers.c:171
    bpf_prog_a99735ebafdda2f1+0x10/0xb50
    bpf_dispatcher_nop_func include/linux/bpf.h:721 [inline]
    __bpf_prog_run include/linux/filter.h:626 [inline]
    bpf_prog_run include/linux/filter.h:633 [inline]
    BPF_PROG_RUN_ARRAY include/linux/bpf.h:1294 [inline]
    trace_call_bpf+0x2cf/0x5d0 kernel/trace/bpf_trace.c:127
    perf_trace_run_bpf_submit+0x7b/0x1d0 kernel/events/core.c:9708
    perf_trace_lock+0x37c/0x440 include/trace/events/lock.h:39
    trace_lock_release+0x128/0x150 include/trace/events/lock.h:58
    lock_release+0x82/0x810 kernel/locking/lockdep.c:5636
    __raw_spin_unlock_irqrestore include/linux/spinlock_api_smp.h:149 [inline]
    _raw_spin_unlock_irqrestore+0x75/0x130 kernel/locking/spinlock.c:194
    debug_hrtimer_deactivate kernel/time/hrtimer.c:425 [inline]
    debug_deactivate kernel/time/hrtimer.c:481 [inline]
    __run_hrtimer kernel/time/hrtimer.c:1653 [inline]
    __hrtimer_run_queues+0x2f9/0xa60 kernel/time/hrtimer.c:1749
    hrtimer_interrupt+0x3b3/0x1040 kernel/time/hrtimer.c:1811
    local_apic_timer_interrupt arch/x86/kernel/apic/apic.c:1086 [inline]
    __sysvec_apic_timer_interrupt+0xf9/0x270 arch/x86/kernel/apic/apic.c:1103
    sysvec_apic_timer_interrupt+0x8c/0xb0 arch/x86/kernel/apic/apic.c:1097
    asm_sysvec_apic_timer_interrupt+0x12/0x20
    __raw_spin_unlock_irqrestore include/linux/spinlock_api_smp.h:152 [inline]
    _raw_spin_unlock_irqrestore+0xd4/0x130 kernel/locking/spinlock.c:194
    try_to_wake_up+0x702/0xd20 kernel/sched/core.c:4118
    wake_up_process kernel/sched/core.c:4200 [inline]
    wake_up_q+0x9a/0xf0 kernel/sched/core.c:953
    futex_wake+0x50f/0x5b0 kernel/futex/waitwake.c:184
    do_futex+0x367/0x560 kernel/futex/syscalls.c:127
    __do_sys_futex kernel/futex/syscalls.c:199 [inline]
    __se_sys_futex+0x401/0x4b0 kernel/futex/syscalls.c:180
    do_syscall_x64 arch/x86/entry/common.c:50 [inline]
    do_syscall_64+0x44/0xd0 arch/x86/entry/common.c:80
    entry_SYSCALL_64_after_hwframe+0x44/0xae

    There is a possible deadlock with bpf_timer_* set of helpers:
    hrtimer_start()
    lock_base();
    trace_hrtimer...()
    perf_event()
    bpf_run()
    bpf_timer_start()
    hrtimer_start()
    lock_base()
    Signed-off-by: Alexei Starovoitov
    Link: https://lore.kernel.org/bpf/20211113142227.566439-2-me@ubique.spb.ru
    Signed-off-by: Greg Kroah-Hartman

    Dmitrii Banshchikov
     

24 Aug, 2021

1 commit

  • Add an enum (cgroup_bpf_attach_type) containing only valid cgroup_bpf
    attach types and a function to map bpf_attach_type values to the new
    enum. Inspired by netns_bpf_attach_type.

    Then, migrate cgroup_bpf to use cgroup_bpf_attach_type wherever
    possible. Functionality is unchanged as attach_type_to_prog_type
    switches in bpf/syscall.c were preventing non-cgroup programs from
    making use of the invalid cgroup_bpf array slots.

    As a result struct cgroup_bpf uses 504 fewer bytes relative to when its
    arrays were sized using MAX_BPF_ATTACH_TYPE.

    bpf_cgroup_storage is notably not migrated as struct
    bpf_cgroup_storage_key is part of uapi and contains a bpf_attach_type
    member which is not meant to be opaque. Similarly, bpf_cgroup_link
    continues to report its bpf_attach_type member to userspace via fdinfo
    and bpf_link_info.

    To ease disambiguation, bpf_attach_type variables are renamed from
    'type' to 'atype' when changed to cgroup_bpf_attach_type.

    Signed-off-by: Dave Marchevsky
    Signed-off-by: Alexei Starovoitov
    Link: https://lore.kernel.org/bpf/20210819092420.1984861-2-davemarchevsky@fb.com

    Dave Marchevsky
     

20 Aug, 2021

1 commit

  • Add logic to call bpf_setsockopt() and bpf_getsockopt() from setsockopt BPF
    programs. An example use case is when the user sets the IPV6_TCLASS socket
    option, we would also like to change the tcp-cc for that socket.

    We don't have any use case for calling bpf_setsockopt() from supposedly read-
    only sys_getsockopt(), so it is made available to BPF_CGROUP_SETSOCKOPT only
    at this point.

    Signed-off-by: Prankur Gupta
    Signed-off-by: Daniel Borkmann
    Acked-by: Song Liu
    Link: https://lore.kernel.org/bpf/20210817224221.3257826-2-prankgup@fb.com

    Prankur Gupta
     

18 Aug, 2021

1 commit

  • The variable allow is being initialized with a value that is never read, it
    is being updated later on. The assignment is redundant and can be removed.

    Addresses-Coverity: ("Unused value")

    Signed-off-by: Colin Ian King
    Signed-off-by: Andrii Nakryiko
    Link: https://lore.kernel.org/bpf/20210817170842.495440-1-colin.king@canonical.com

    Colin Ian King
     

17 Aug, 2021

2 commits

  • Similar to BPF_PROG_RUN, turn BPF_PROG_RUN_ARRAY macros into proper functions
    with all the same readability and maintainability benefits. Making them into
    functions required shuffling around bpf_set_run_ctx/bpf_reset_run_ctx
    functions. Also, explicitly specifying the type of the BPF prog run callback
    required adjusting __bpf_prog_run_save_cb() to accept const void *, casted
    internally to const struct sk_buff.

    Further, split out a cgroup-specific BPF_PROG_RUN_ARRAY_CG and
    BPF_PROG_RUN_ARRAY_CG_FLAGS from the more generic BPF_PROG_RUN_ARRAY due to
    the differences in bpf_run_ctx used for those two different use cases.

    I think BPF_PROG_RUN_ARRAY_CG would benefit from further refactoring to accept
    struct cgroup and enum bpf_attach_type instead of bpf_prog_array, fetching
    cgrp->bpf.effective[type] and RCU-dereferencing it internally. But that
    required including include/linux/cgroup-defs.h, which I wasn't sure is ok with
    everyone.

    The remaining generic BPF_PROG_RUN_ARRAY function will be extended to
    pass-through user-provided context value in the next patch.

    Signed-off-by: Andrii Nakryiko
    Signed-off-by: Daniel Borkmann
    Acked-by: Yonghong Song
    Link: https://lore.kernel.org/bpf/20210815070609.987780-3-andrii@kernel.org

    Andrii Nakryiko
     
  • Turn BPF_PROG_RUN into a proper always inlined function. No functional and
    performance changes are intended, but it makes it much easier to understand
    what's going on with how BPF programs are actually get executed. It's more
    obvious what types and callbacks are expected. Also extra () around input
    parameters can be dropped, as well as `__` variable prefixes intended to avoid
    naming collisions, which makes the code simpler to read and write.

    This refactoring also highlighted one extra issue. BPF_PROG_RUN is both
    a macro and an enum value (BPF_PROG_RUN == BPF_PROG_TEST_RUN). Turning
    BPF_PROG_RUN into a function causes naming conflict compilation error. So
    rename BPF_PROG_RUN into lower-case bpf_prog_run(), similar to
    bpf_prog_run_xdp(), bpf_prog_run_pin_on_cpu(), etc. All existing callers of
    BPF_PROG_RUN, the macro, are switched to bpf_prog_run() explicitly.

    Signed-off-by: Andrii Nakryiko
    Signed-off-by: Daniel Borkmann
    Acked-by: Yonghong Song
    Link: https://lore.kernel.org/bpf/20210815070609.987780-2-andrii@kernel.org

    Andrii Nakryiko
     

14 Aug, 2021

1 commit


17 Feb, 2021

1 commit

  • Daniel Borkmann says:

    ====================
    pull-request: bpf-next 2021-02-16

    The following pull-request contains BPF updates for your *net-next* tree.

    There's a small merge conflict between 7eeba1706eba ("tcp: Add receive timestamp
    support for receive zerocopy.") from net-next tree and 9cacf81f8161 ("bpf: Remove
    extra lock_sock for TCP_ZEROCOPY_RECEIVE") from bpf-next tree. Resolve as follows:

    [...]
    lock_sock(sk);
    err = tcp_zerocopy_receive(sk, &zc, &tss);
    err = BPF_CGROUP_RUN_PROG_GETSOCKOPT_KERN(sk, level, optname,
    &zc, &len, err);
    release_sock(sk);
    [...]

    We've added 116 non-merge commits during the last 27 day(s) which contain
    a total of 156 files changed, 5662 insertions(+), 1489 deletions(-).

    The main changes are:

    1) Adds support of pointers to types with known size among global function
    args to overcome the limit on max # of allowed args, from Dmitrii Banshchikov.

    2) Add bpf_iter for task_vma which can be used to generate information similar
    to /proc/pid/maps, from Song Liu.

    3) Enable bpf_{g,s}etsockopt() from all sock_addr related program hooks. Allow
    rewriting bind user ports from BPF side below the ip_unprivileged_port_start
    range, both from Stanislav Fomichev.

    4) Prevent recursion on fentry/fexit & sleepable programs and allow map-in-map
    as well as per-cpu maps for the latter, from Alexei Starovoitov.

    5) Add selftest script to run BPF CI locally. Also enable BPF ringbuffer
    for sleepable programs, both from KP Singh.

    6) Extend verifier to enable variable offset read/write access to the BPF
    program stack, from Andrei Matei.

    7) Improve tc & XDP MTU handling and add a new bpf_check_mtu() helper to
    query device MTU from programs, from Jesper Dangaard Brouer.

    8) Allow bpf_get_socket_cookie() helper also be called from [sleepable] BPF
    tracing programs, from Florent Revest.

    9) Extend x86 JIT to pad JMPs with NOPs for helping image to converge when
    otherwise too many passes are required, from Gary Lin.

    10) Verifier fixes on atomics with BPF_FETCH as well as function-by-function
    verification both related to zero-extension handling, from Ilya Leoshkevich.

    11) Better kernel build integration of resolve_btfids tool, from Jiri Olsa.

    12) Batch of AF_XDP selftest cleanups and small performance improvement
    for libbpf's xsk map redirect for newer kernels, from Björn Töpel.

    13) Follow-up BPF doc and verifier improvements around atomics with
    BPF_FETCH, from Brendan Jackman.

    14) Permit zero-sized data sections e.g. if ELF .rodata section contains
    read-only data from local variables, from Yonghong Song.

    15) veth driver skb bulk-allocation for ndo_xdp_xmit, from Lorenzo Bianconi.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     

28 Jan, 2021

1 commit

  • At the moment, BPF_CGROUP_INET{4,6}_BIND hooks can rewrite user_port
    to the privileged ones (< ip_unprivileged_port_start), but it will
    be rejected later on in the __inet_bind or __inet6_bind.

    Let's add another return value to indicate that CAP_NET_BIND_SERVICE
    check should be ignored. Use the same idea as we currently use
    in cgroup/egress where bit #1 indicates CN. Instead, for
    cgroup/bind{4,6}, bit #1 indicates that CAP_NET_BIND_SERVICE should
    be bypassed.

    v5:
    - rename flags to be less confusing (Andrey Ignatov)
    - rework BPF_PROG_CGROUP_INET_EGRESS_RUN_ARRAY to work on flags
    and accept BPF_RET_SET_CN (no behavioral changes)

    v4:
    - Add missing IPv6 support (Martin KaFai Lau)

    v3:
    - Update description (Martin KaFai Lau)
    - Fix capability restore in selftest (Martin KaFai Lau)

    v2:
    - Switch to explicit return code (Martin KaFai Lau)

    Signed-off-by: Stanislav Fomichev
    Signed-off-by: Alexei Starovoitov
    Reviewed-by: Martin KaFai Lau
    Acked-by: Andrey Ignatov
    Link: https://lore.kernel.org/bpf/20210127193140.3170382-1-sdf@google.com

    Stanislav Fomichev
     

23 Jan, 2021

2 commits

  • Since ctx.optlen is signed, a larger value than max_value could be
    passed, as it is later on used as unsigned, which causes a WARN_ON_ONCE
    in the copy_to_user.

    Fixes: 0d01da6afc54 ("bpf: implement getsockopt and setsockopt hooks")
    Signed-off-by: Loris Reiff
    Signed-off-by: Daniel Borkmann
    Reviewed-by: Stanislav Fomichev
    Link: https://lore.kernel.org/bpf/20210122164232.61770-2-loris.reiff@liblor.ch

    Loris Reiff
     
  • A toctou issue in `__cgroup_bpf_run_filter_getsockopt` can trigger a
    WARN_ON_ONCE in a check of `copy_from_user`.

    `*optlen` is checked to be non-negative in the individual getsockopt
    functions beforehand. Changing `*optlen` in a race to a negative value
    will result in a `copy_from_user(ctx.optval, optval, ctx.optlen)` with
    `ctx.optlen` being a negative integer.

    Fixes: 0d01da6afc54 ("bpf: implement getsockopt and setsockopt hooks")
    Signed-off-by: Loris Reiff
    Signed-off-by: Daniel Borkmann
    Reviewed-by: Stanislav Fomichev
    Link: https://lore.kernel.org/bpf/20210122164232.61770-1-loris.reiff@liblor.ch

    Loris Reiff
     

21 Jan, 2021

3 commits

  • When we attach any cgroup hook, the rest (even if unused/unattached) start
    to contribute small overhead. In particular, the one we want to avoid is
    __cgroup_bpf_run_filter_skb which does two redirections to get to
    the cgroup and pushes/pulls skb.

    Let's split cgroup_bpf_enabled to be per-attach to make sure
    only used attach types trigger.

    I've dropped some existing high-level cgroup_bpf_enabled in some
    places because BPF_PROG_CGROUP_XXX_RUN macros usually have another
    cgroup_bpf_enabled check.

    I also had to copy-paste BPF_CGROUP_RUN_SA_PROG_LOCK for
    GETPEERNAME/GETSOCKNAME because type for cgroup_bpf_enabled[type]
    has to be constant and known at compile time.

    Signed-off-by: Stanislav Fomichev
    Signed-off-by: Alexei Starovoitov
    Acked-by: Song Liu
    Link: https://lore.kernel.org/bpf/20210115163501.805133-4-sdf@google.com

    Stanislav Fomichev
     
  • When we attach a bpf program to cgroup/getsockopt any other getsockopt()
    syscall starts incurring kzalloc/kfree cost.

    Let add a small buffer on the stack and use it for small (majority)
    {s,g}etsockopt values. The buffer is small enough to fit into
    the cache line and cover the majority of simple options (most
    of them are 4 byte ints).

    It seems natural to do the same for setsockopt, but it's a bit more
    involved when the BPF program modifies the data (where we have to
    kmalloc). The assumption is that for the majority of setsockopt
    calls (which are doing pure BPF options or apply policy) this
    will bring some benefit as well.

    Without this patch (we remove about 1% __kmalloc):
    3.38% 0.07% tcp_mmap [kernel.kallsyms] [k] __cgroup_bpf_run_filter_getsockopt
    |
    --3.30%--__cgroup_bpf_run_filter_getsockopt
    |
    --0.81%--__kmalloc

    Signed-off-by: Stanislav Fomichev
    Signed-off-by: Alexei Starovoitov
    Acked-by: Martin KaFai Lau
    Link: https://lore.kernel.org/bpf/20210115163501.805133-3-sdf@google.com

    Stanislav Fomichev
     
  • Add custom implementation of getsockopt hook for TCP_ZEROCOPY_RECEIVE.
    We skip generic hooks for TCP_ZEROCOPY_RECEIVE and have a custom
    call in do_tcp_getsockopt using the on-stack data. This removes
    3% overhead for locking/unlocking the socket.

    Without this patch:
    3.38% 0.07% tcp_mmap [kernel.kallsyms] [k] __cgroup_bpf_run_filter_getsockopt
    |
    --3.30%--__cgroup_bpf_run_filter_getsockopt
    |
    --0.81%--__kmalloc

    With the patch applied:
    0.52% 0.12% tcp_mmap [kernel.kallsyms] [k] __cgroup_bpf_run_filter_getsockopt_kern

    Note, exporting uapi/tcp.h requires removing netinet/tcp.h
    from test_progs.h because those headers have confliciting
    definitions.

    Signed-off-by: Stanislav Fomichev
    Signed-off-by: Alexei Starovoitov
    Acked-by: Martin KaFai Lau
    Link: https://lore.kernel.org/bpf/20210115163501.805133-2-sdf@google.com

    Stanislav Fomichev
     

13 Jan, 2021

1 commit

  • optlen == 0 indicates that the kernel should ignore BPF buffer
    and use the original one from the user. We, however, forget
    to free the temporary buffer that we've allocated for BPF.

    Fixes: d8fe449a9c51 ("bpf: Don't return EINVAL from {get,set}sockopt when optlen > PAGE_SIZE")
    Reported-by: Martin KaFai Lau
    Signed-off-by: Stanislav Fomichev
    Signed-off-by: Daniel Borkmann
    Acked-by: Martin KaFai Lau
    Link: https://lore.kernel.org/bpf/20210112162829.775079-1-sdf@google.com

    Stanislav Fomichev
     

23 Oct, 2020

1 commit

  • Pull initial set_fs() removal from Al Viro:
    "Christoph's set_fs base series + fixups"

    * 'work.set_fs' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    fs: Allow a NULL pos pointer to __kernel_read
    fs: Allow a NULL pos pointer to __kernel_write
    powerpc: remove address space overrides using set_fs()
    powerpc: use non-set_fs based maccess routines
    x86: remove address space overrides using set_fs()
    x86: make TASK_SIZE_MAX usable from assembly code
    x86: move PAGE_OFFSET, TASK_SIZE & friends to page_{32,64}_types.h
    lkdtm: remove set_fs-based tests
    test_bitmap: remove user bitmap tests
    uaccess: add infrastructure for kernel builds with set_fs()
    fs: don't allow splice read/write without explicit ops
    fs: don't allow kernel reads and writes without iter ops
    sysctl: Convert to iter interfaces
    proc: add a read_iter method to proc proc_ops
    proc: cleanup the compat vs no compat file ops
    proc: remove a level of indentation in proc_get_inode

    Linus Torvalds
     

09 Sep, 2020

1 commit

  • Using the read_iter/write_iter interfaces allows for in-kernel users
    to set sysctls without using set_fs(). Also, the buffer is a string,
    so give it the real type of 'char *', not void *.

    [AV: Christoph's fixup folded in]

    Signed-off-by: Matthew Wilcox (Oracle)
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Matthew Wilcox (Oracle)
     

24 Aug, 2020

1 commit

  • Replace the existing /* fall through */ comments and its variants with
    the new pseudo-keyword macro fallthrough[1]. Also, remove unnecessary
    fall-through markings when it is the case.

    [1] https://www.kernel.org/doc/html/v5.7/process/deprecated.html?highlight=fallthrough#implicit-switch-case-fall-through

    Signed-off-by: Gustavo A. R. Silva

    Gustavo A. R. Silva
     

02 Aug, 2020

1 commit

  • Add LINK_DETACH command to force-detach bpf_link without destroying it. It has
    the same behavior as auto-detaching of bpf_link due to cgroup dying for
    bpf_cgroup_link or net_device being destroyed for bpf_xdp_link. In such case,
    bpf_link is still a valid kernel object, but is defuncts and doesn't hold BPF
    program attached to corresponding BPF hook. This functionality allows users
    with enough access rights to manually force-detach attached bpf_link without
    killing respective owner process.

    This patch implements LINK_DETACH for cgroup, xdp, and netns links, mostly
    re-using existing link release handling code.

    Signed-off-by: Andrii Nakryiko
    Signed-off-by: Alexei Starovoitov
    Acked-by: Song Liu
    Acked-by: John Fastabend
    Link: https://lore.kernel.org/bpf/20200731182830.286260-2-andriin@fb.com

    Andrii Nakryiko
     

26 Jul, 2020

1 commit

  • This change comes in several parts:

    One, the restriction that the CGROUP_STORAGE map can only be used
    by one program is removed. This results in the removal of the field
    'aux' in struct bpf_cgroup_storage_map, and removal of relevant
    code associated with the field, and removal of now-noop functions
    bpf_free_cgroup_storage and bpf_cgroup_storage_release.

    Second, we permit a key of type u64 as the key to the map.
    Providing such a key type indicates that the map should ignore
    attach type when comparing map keys. However, for simplicity newly
    linked storage will still have the attach type at link time in
    its key struct. cgroup_storage_check_btf is adapted to accept
    u64 as the type of the key.

    Third, because the storages are now shared, the storages cannot
    be unconditionally freed on program detach. There could be two
    ways to solve this issue:
    * A. Reference count the usage of the storages, and free when the
    last program is detached.
    * B. Free only when the storage is impossible to be referred to
    again, i.e. when either the cgroup_bpf it is attached to, or
    the map itself, is freed.
    Option A has the side effect that, when the user detach and
    reattach a program, whether the program gets a fresh storage
    depends on whether there is another program attached using that
    storage. This could trigger races if the user is multi-threaded,
    and since nondeterminism in data races is evil, go with option B.

    The both the map and the cgroup_bpf now tracks their associated
    storages, and the storage unlink and free are removed from
    cgroup_bpf_detach and added to cgroup_bpf_release and
    cgroup_storage_map_free. The latter also new holds the cgroup_mutex
    to prevent any races with the former.

    Fourth, on attach, we reuse the old storage if the key already
    exists in the map, via cgroup_storage_lookup. If the storage
    does not exist yet, we create a new one, and publish it at the
    last step in the attach process. This does not create a race
    condition because for the whole attach the cgroup_mutex is held.
    We keep track of an array of new storages that was allocated
    and if the process fails only the new storages would get freed.

    Signed-off-by: YiFei Zhu
    Signed-off-by: Alexei Starovoitov
    Link: https://lore.kernel.org/bpf/d5401c6106728a00890401190db40020a1f84ff1.1595565795.git.zhuyifei@google.com

    YiFei Zhu
     

18 Jun, 2020

1 commit

  • Attaching to these hooks can break iptables because its optval is
    usually quite big, or at least bigger than the current PAGE_SIZE limit.
    David also mentioned some SCTP options can be big (around 256k).

    For such optvals we expose only the first PAGE_SIZE bytes to
    the BPF program. BPF program has two options:
    1. Set ctx->optlen to 0 to indicate that the BPF's optval
    should be ignored and the kernel should use original userspace
    value.
    2. Set ctx->optlen to something that's smaller than the PAGE_SIZE.

    v5:
    * use ctx->optlen == 0 with trimmed buffer (Alexei Starovoitov)
    * update the docs accordingly

    v4:
    * use temporary buffer to avoid optval == optval_end == NULL;
    this removes the corner case in the verifier that might assume
    non-zero PTR_TO_PACKET/PTR_TO_PACKET_END.

    v3:
    * don't increase the limit, bypass the argument

    v2:
    * proper comments formatting (Jakub Kicinski)

    Fixes: 0d01da6afc54 ("bpf: implement getsockopt and setsockopt hooks")
    Signed-off-by: Stanislav Fomichev
    Signed-off-by: Alexei Starovoitov
    Cc: David Laight
    Link: https://lore.kernel.org/bpf/20200617010416.93086-1-sdf@google.com

    Stanislav Fomichev
     

10 Jun, 2020

1 commit

  • When using BPF_PROG_ATTACH to attach a program to a cgroup in
    BPF_F_ALLOW_MULTI mode, it is not possible to replace a program
    with itself. This is because the check for duplicate programs
    doesn't take the replacement program into account.

    Replacing a program with itself might seem weird, but it has
    some uses: first, it allows resetting the associated cgroup storage.
    Second, it makes the API consistent with the non-ALLOW_MULTI usage,
    where it is possible to replace a program with itself. Third, it
    aligns BPF_PROG_ATTACH with bpf_link, where replacing itself is
    also supported.

    Sice this code has been refactored a few times this change will
    only apply to v5.7 and later. Adjustments could be made to
    commit 1020c1f24a94 ("bpf: Simplify __cgroup_bpf_attach") and
    commit d7bf2c10af05 ("bpf: allocate cgroup storage entries on attaching bpf programs")
    as well as commit 324bda9e6c5a ("bpf: multi program support for cgroup+bpf")

    Fixes: af6eea57437a ("bpf: Implement bpf_link-based cgroup BPF program attachment")
    Signed-off-by: Lorenz Bauer
    Signed-off-by: Alexei Starovoitov
    Link: https://lore.kernel.org/bpf/20200608162202.94002-1-lmb@cloudflare.com

    Lorenz Bauer
     

02 Jun, 2020

1 commit

  • Failure to update a bpf_link because it has been auto-detached by a dying
    cgroup currently results in EINVAL error, even though the arguments passed
    to bpf() syscall are not wrong.

    bpf_links attaching to netns in this case will return ENOLINK, which
    carries the message that the link is no longer attached to anything.

    Change cgroup bpf_links to do the same to keep the uAPI errors consistent.

    Fixes: 0c991ebc8c69 ("bpf: Implement bpf_prog replacement for an active bpf_cgroup_link")
    Suggested-by: Lorenz Bauer
    Signed-off-by: Jakub Sitnicki
    Signed-off-by: Alexei Starovoitov
    Link: https://lore.kernel.org/bpf/20200531082846.2117903-6-jakub@cloudflare.com

    Jakub Sitnicki
     

29 Apr, 2020

4 commits

  • Add ability to fetch bpf_link details through BPF_OBJ_GET_INFO_BY_FD command.
    Also enhance show_fdinfo to potentially include bpf_link type-specific
    information (similarly to obj_info).

    Also introduce enum bpf_link_type stored in bpf_link itself and expose it in
    UAPI. bpf_link_tracing also now will store and return bpf_attach_type.

    Signed-off-by: Andrii Nakryiko
    Signed-off-by: Alexei Starovoitov
    Link: https://lore.kernel.org/bpf/20200429001614.1544-5-andriin@fb.com

    Andrii Nakryiko
     
  • Generate ID for each bpf_link using IDR, similarly to bpf_map and bpf_prog.
    bpf_link creation, initialization, attachment, and exposing to user-space
    through FD and ID is a complicated multi-step process, abstract it away
    through bpf_link_primer and bpf_link_prime(), bpf_link_settle(), and
    bpf_link_cleanup() internal API. They guarantee that until bpf_link is
    properly attached, user-space won't be able to access partially-initialized
    bpf_link either from FD or ID. All this allows to simplify bpf_link attachment
    and error handling code.

    Signed-off-by: Andrii Nakryiko
    Signed-off-by: Alexei Starovoitov
    Link: https://lore.kernel.org/bpf/20200429001614.1544-3-andriin@fb.com

    Andrii Nakryiko
     
  • Make bpf_link update support more generic by making it into another
    bpf_link_ops methods. This allows generic syscall handling code to be agnostic
    to various conditionally compiled features (e.g., the case of
    CONFIG_CGROUP_BPF). This also allows to keep link type-specific code to remain
    static within respective code base. Refactor existing bpf_cgroup_link code and
    take advantage of this.

    Signed-off-by: Andrii Nakryiko
    Signed-off-by: Alexei Starovoitov
    Link: https://lore.kernel.org/bpf/20200429001614.1544-2-andriin@fb.com

    Andrii Nakryiko
     
  • Pull in Christoph Hellwig's series that changes the sysctl's ->proc_handler
    methods to take kernel pointers instead. It gets rid of the set_fs address
    space overrides used by BPF. As per discussion, pull in the feature branch
    into bpf-next as it relates to BPF sysctl progs.

    Signed-off-by: Daniel Borkmann
    Link: https://lore.kernel.org/bpf/20200427071508.GV23230@ZenIV.linux.org.uk/T/

    Daniel Borkmann
     

28 Apr, 2020

1 commit

  • Except for a few of the networking hooks called from modular ipv4
    or ipv6 code, all of hooks are just called from guaranteed to be
    built-in code.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Daniel Borkmann
    Acked-by: Andrey Ignatov
    Link: https://lore.kernel.org/bpf/20200424064338.538313-2-hch@lst.de

    Christoph Hellwig
     

27 Apr, 2020

1 commit

  • Instead of having all the sysctl handlers deal with user pointers, which
    is rather hairy in terms of the BPF interaction, copy the input to and
    from userspace in common code. This also means that the strings are
    always NUL-terminated by the common code, making the API a little bit
    safer.

    As most handler just pass through the data to one of the common handlers
    a lot of the changes are mechnical.

    Signed-off-by: Christoph Hellwig
    Acked-by: Andrey Ignatov
    Signed-off-by: Al Viro

    Christoph Hellwig
     

26 Apr, 2020

1 commit

  • Currently the following prog types don't fall back to bpf_base_func_proto()
    (instead they have cgroup_base_func_proto which has a limited set of
    helpers from bpf_base_func_proto):
    * BPF_PROG_TYPE_CGROUP_DEVICE
    * BPF_PROG_TYPE_CGROUP_SYSCTL
    * BPF_PROG_TYPE_CGROUP_SOCKOPT

    I don't see any specific reason why we shouldn't use bpf_base_func_proto(),
    every other type of program (except bpf-lirc and, understandably, tracing)
    use it, so let's fall back to bpf_base_func_proto for those prog types
    as well.

    This basically boils down to adding access to the following helpers:
    * BPF_FUNC_get_prandom_u32
    * BPF_FUNC_get_smp_processor_id
    * BPF_FUNC_get_numa_node_id
    * BPF_FUNC_tail_call
    * BPF_FUNC_ktime_get_ns
    * BPF_FUNC_spin_lock (CAP_SYS_ADMIN)
    * BPF_FUNC_spin_unlock (CAP_SYS_ADMIN)
    * BPF_FUNC_jiffies64 (CAP_SYS_ADMIN)

    I've also added bpf_perf_event_output() because it's really handy for
    logging and debugging.

    Signed-off-by: Stanislav Fomichev
    Signed-off-by: Alexei Starovoitov
    Link: https://lore.kernel.org/bpf/20200420174610.77494-1-sdf@google.com

    Stanislav Fomichev
     

31 Mar, 2020

3 commits

  • Signed-off-by: David S. Miller

    David S. Miller
     
  • Add new operation (LINK_UPDATE), which allows to replace active bpf_prog from
    under given bpf_link. Currently this is only supported for bpf_cgroup_link,
    but will be extended to other kinds of bpf_links in follow-up patches.

    For bpf_cgroup_link, implemented functionality matches existing semantics for
    direct bpf_prog attachment (including BPF_F_REPLACE flag). User can either
    unconditionally set new bpf_prog regardless of which bpf_prog is currently
    active under given bpf_link, or, optionally, can specify expected active
    bpf_prog. If active bpf_prog doesn't match expected one, no changes are
    performed, old bpf_link stays intact and attached, operation returns
    a failure.

    cgroup_bpf_replace() operation is resolving race between auto-detachment and
    bpf_prog update in the same fashion as it's done for bpf_link detachment,
    except in this case update has no way of succeeding because of target cgroup
    marked as dying. So in this case error is returned.

    Signed-off-by: Andrii Nakryiko
    Signed-off-by: Alexei Starovoitov
    Link: https://lore.kernel.org/bpf/20200330030001.2312810-3-andriin@fb.com

    Andrii Nakryiko
     
  • Implement new sub-command to attach cgroup BPF programs and return FD-based
    bpf_link back on success. bpf_link, once attached to cgroup, cannot be
    replaced, except by owner having its FD. Cgroup bpf_link supports only
    BPF_F_ALLOW_MULTI semantics. Both link-based and prog-based BPF_F_ALLOW_MULTI
    attachments can be freely intermixed.

    To prevent bpf_cgroup_link from keeping cgroup alive past the point when no
    BPF program can be executed, implement auto-detachment of link. When
    cgroup_bpf_release() is called, all attached bpf_links are forced to release
    cgroup refcounts, but they leave bpf_link otherwise active and allocated, as
    well as still owning underlying bpf_prog. This is because user-space might
    still have FDs open and active, so bpf_link as a user-referenced object can't
    be freed yet. Once last active FD is closed, bpf_link will be freed and
    underlying bpf_prog refcount will be dropped. But cgroup refcount won't be
    touched, because cgroup is released already.

    The inherent race between bpf_cgroup_link release (from closing last FD) and
    cgroup_bpf_release() is resolved by both operations taking cgroup_mutex. So
    the only additional check required is when bpf_cgroup_link attempts to detach
    itself from cgroup. At that time we need to check whether there is still
    cgroup associated with that link. And if not, exit with success, because
    bpf_cgroup_link was already successfully detached.

    Signed-off-by: Andrii Nakryiko
    Signed-off-by: Alexei Starovoitov
    Acked-by: Roman Gushchin
    Link: https://lore.kernel.org/bpf/20200330030001.2312810-2-andriin@fb.com

    Andrii Nakryiko
     

27 Mar, 2020

1 commit

  • Refactor cgroup attach/detach code to abstract away common operations
    performed on all types of cgroup storages. This makes high-level logic more
    apparent, plus allows to reuse more code across multiple functions.

    Signed-off-by: Andrii Nakryiko
    Signed-off-by: Alexei Starovoitov
    Link: https://lore.kernel.org/bpf/20200325065746.640559-2-andriin@fb.com

    Andrii Nakryiko
     

10 Mar, 2020

2 commits

  • There is no compensating cgroup_bpf_put() for each ancestor cgroup in
    cgroup_bpf_inherit(). If compute_effective_progs returns error, those cgroups
    won't be freed ever. Fix it by putting them in cleanup code path.

    Fixes: e10360f815ca ("bpf: cgroup: prevent out-of-order release of cgroup bpf")
    Signed-off-by: Andrii Nakryiko
    Signed-off-by: Alexei Starovoitov
    Acked-by: Roman Gushchin
    Link: https://lore.kernel.org/bpf/20200309224017.1063297-1-andriin@fb.com

    Andrii Nakryiko
     
  • Local storage array isn't initialized, so if cgroup storage allocation fails
    for BPF_CGROUP_STORAGE_SHARED, error handling code will attempt to free
    uninitialized pointer for BPF_CGROUP_STORAGE_PERCPU storage type. Avoid this
    by always initializing storage pointers to NULLs.

    Fixes: 8bad74f9840f ("bpf: extend cgroup bpf core to allow multiple cgroup storage types")
    Signed-off-by: Andrii Nakryiko
    Signed-off-by: Alexei Starovoitov
    Link: https://lore.kernel.org/bpf/20200309222756.1018737-1-andriin@fb.com

    Andrii Nakryiko
     

10 Jan, 2020

1 commit