30 Jan, 2020

1 commit

  • Pull openat2 support from Al Viro:
    "This is the openat2() series from Aleksa Sarai.

    I'm afraid that the rest of namei stuff will have to wait - it got
    zero review the last time I'd posted #work.namei, and there had been a
    leak in the posted series I'd caught only last weekend. I was going to
    repost it on Monday, but the window opened and the odds of getting any
    review during that... Oh, well.

    Anyway, openat2 part should be ready; that _did_ get sane amount of
    review and public testing, so here it comes"

    From Aleksa's description of the series:
    "For a very long time, extending openat(2) with new features has been
    incredibly frustrating. This stems from the fact that openat(2) is
    possibly the most famous counter-example to the mantra "don't silently
    accept garbage from userspace" -- it doesn't check whether unknown
    flags are present[1].

    This means that (generally) the addition of new flags to openat(2) has
    been fraught with backwards-compatibility issues (O_TMPFILE has to be
    defined as __O_TMPFILE|O_DIRECTORY|[O_RDWR or O_WRONLY] to ensure old
    kernels gave errors, since it's insecure to silently ignore the
    flag[2]). All new security-related flags therefore have a tough road
    to being added to openat(2).

    Furthermore, the need for some sort of control over VFS's path
    resolution (to avoid malicious paths resulting in inadvertent
    breakouts) has been a very long-standing desire of many userspace
    applications.

    This patchset is a revival of Al Viro's old AT_NO_JUMPS[3] patchset
    (which was a variant of David Drysdale's O_BENEATH patchset[4] which
    was a spin-off of the Capsicum project[5]) with a few additions and
    changes made based on the previous discussion within [6] as well as
    others I felt were useful.

    In line with the conclusions of the original discussion of
    AT_NO_JUMPS, the flag has been split up into separate flags. However,
    instead of being an openat(2) flag it is provided through a new
    syscall openat2(2) which provides several other improvements to the
    openat(2) interface (see the patch description for more details). The
    following new LOOKUP_* flags are added:

    LOOKUP_NO_XDEV:

    Blocks all mountpoint crossings (upwards, downwards, or through
    absolute links). Absolute pathnames alone in openat(2) do not
    trigger this. Magic-link traversal which implies a vfsmount jump is
    also blocked (though magic-link jumps on the same vfsmount are
    permitted).

    LOOKUP_NO_MAGICLINKS:

    Blocks resolution through /proc/$pid/fd-style links. This is done
    by blocking the usage of nd_jump_link() during resolution in a
    filesystem. The term "magic-links" is used to match with the only
    reference to these links in Documentation/, but I'm happy to change
    the name.

    It should be noted that this is different to the scope of
    ~LOOKUP_FOLLOW in that it applies to all path components. However,
    you can do openat2(NO_FOLLOW|NO_MAGICLINKS) on a magic-link and it
    will *not* fail (assuming that no parent component was a
    magic-link), and you will have an fd for the magic-link.

    In order to correctly detect magic-links, the introduction of a new
    LOOKUP_MAGICLINK_JUMPED state flag was required.

    LOOKUP_BENEATH:

    Disallows escapes to outside the starting dirfd's
    tree, using techniques such as ".." or absolute links. Absolute
    paths in openat(2) are also disallowed.

    Conceptually this flag is to ensure you "stay below" a certain
    point in the filesystem tree -- but this requires some additional
    to protect against various races that would allow escape using
    "..".

    Currently LOOKUP_BENEATH implies LOOKUP_NO_MAGICLINKS, because it
    can trivially beam you around the filesystem (breaking the
    protection). In future, there might be similar safety checks done
    as in LOOKUP_IN_ROOT, but that requires more discussion.

    In addition, two new flags are added that expand on the above ideas:

    LOOKUP_NO_SYMLINKS:

    Does what it says on the tin. No symlink resolution is allowed at
    all, including magic-links. Just as with LOOKUP_NO_MAGICLINKS this
    can still be used with NOFOLLOW to open an fd for the symlink as
    long as no parent path had a symlink component.

    LOOKUP_IN_ROOT:

    This is an extension of LOOKUP_BENEATH that, rather than blocking
    attempts to move past the root, forces all such movements to be
    scoped to the starting point. This provides chroot(2)-like
    protection but without the cost of a chroot(2) for each filesystem
    operation, as well as being safe against race attacks that
    chroot(2) is not.

    If a race is detected (as with LOOKUP_BENEATH) then an error is
    generated, and similar to LOOKUP_BENEATH it is not permitted to
    cross magic-links with LOOKUP_IN_ROOT.

    The primary need for this is from container runtimes, which
    currently need to do symlink scoping in userspace[7] when opening
    paths in a potentially malicious container.

    There is a long list of CVEs that could have bene mitigated by
    having RESOLVE_THIS_ROOT (such as CVE-2017-1002101,
    CVE-2017-1002102, CVE-2018-15664, and CVE-2019-5736, just to name a
    few).

    In order to make all of the above more usable, I'm working on
    libpathrs[8] which is a C-friendly library for safe path resolution.
    It features a userspace-emulated backend if the kernel doesn't support
    openat2(2). Hopefully we can get userspace to switch to using it, and
    thus get openat2(2) support for free once it's ready.

    Future work would include implementing things like
    RESOLVE_NO_AUTOMOUNT and possibly a RESOLVE_NO_REMOTE (to allow
    programs to be sure they don't hit DoSes though stale NFS handles)"

    * 'work.openat2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    Documentation: path-lookup: include new LOOKUP flags
    selftests: add openat2(2) selftests
    open: introduce openat2(2) syscall
    namei: LOOKUP_{IN_ROOT,BENEATH}: permit limited ".." resolution
    namei: LOOKUP_IN_ROOT: chroot-like scoped resolution
    namei: LOOKUP_BENEATH: O_BENEATH-like scoped resolution
    namei: LOOKUP_NO_XDEV: block mountpoint crossing
    namei: LOOKUP_NO_MAGICLINKS: block magic-link resolution
    namei: LOOKUP_NO_SYMLINKS: block symlink resolution
    namei: allow set_root() to produce errors
    namei: allow nd_jump_link() to produce errors
    nsfs: clean-up ns_get_path() signature to return int
    namei: only return -ECHILD from follow_dotdot_rcu()

    Linus Torvalds
     

27 Jan, 2020

3 commits

  • Now that we depend on rcu_call() and synchronize_rcu() to also wait
    for preempt_disabled region to complete the rcu read critical section
    in __dev_map_flush() is no longer required. Except in a few special
    cases in drivers that need it for other reasons.

    These originally ensured the map reference was safe while a map was
    also being free'd. And additionally that bpf program updates via
    ndo_bpf did not happen while flush updates were in flight. But flush
    by new rules can only be called from preempt-disabled NAPI context.
    The synchronize_rcu from the map free path and the rcu_call from the
    delete path will ensure the reference there is safe. So lets remove
    the rcu_read_lock and rcu_read_unlock pair to avoid any confusion
    around how this is being protected.

    If the rcu_read_lock was required it would mean errors in the above
    logic and the original patch would also be wrong.

    Now that we have done above we put the rcu_read_lock in the driver
    code where it is needed in a driver dependent way. I think this
    helps readability of the code so we know where and why we are
    taking read locks. Most drivers will not need rcu_read_locks here
    and further XDP drivers already have rcu_read_locks in their code
    paths for reading xdp programs on RX side so this makes it symmetric
    where we don't have half of rcu critical sections define in driver
    and the other half in devmap.

    Signed-off-by: John Fastabend
    Signed-off-by: Daniel Borkmann
    Acked-by: Jesper Dangaard Brouer
    Link: https://lore.kernel.org/bpf/1580084042-11598-4-git-send-email-john.fastabend@gmail.com

    John Fastabend
     
  • Now that we rely on synchronize_rcu and call_rcu waiting to
    exit perempt-disable regions (NAPI) lets update the comments
    to reflect this.

    Fixes: 0536b85239b84 ("xdp: Simplify devmap cleanup")
    Signed-off-by: John Fastabend
    Signed-off-by: Daniel Borkmann
    Acked-by: Björn Töpel
    Acked-by: Song Liu
    Link: https://lore.kernel.org/bpf/1580084042-11598-2-git-send-email-john.fastabend@gmail.com

    John Fastabend
     
  • If seq_file .next fuction does not change position index,
    read after some lseek can generate an unexpected output.

    See also: https://bugzilla.kernel.org/show_bug.cgi?id=206283

    v1 -> v2: removed missed increment in end of function

    Signed-off-by: Vasily Averin
    Signed-off-by: Daniel Borkmann
    Link: https://lore.kernel.org/bpf/eca84fdd-c374-a154-d874-6c7b55fc3bc4@virtuozzo.com

    Vasily Averin
     

25 Jan, 2020

2 commits

  • When unwinding the stack we need to identify each address
    to successfully continue. Adding latch tree to keep trampolines
    for quick lookup during the unwind.

    The patch uses first 48 bytes for latch tree node, leaving 4048
    bytes from the rest of the page for trampoline or dispatcher
    generated code.

    It's still enough not to affect trampoline and dispatcher progs
    maximum counts.

    Signed-off-by: Jiri Olsa
    Signed-off-by: Alexei Starovoitov
    Link: https://lore.kernel.org/bpf/20200123161508.915203-3-jolsa@kernel.org

    Jiri Olsa
     
  • When accessing the context we allow access to arguments with
    scalar type and pointer to struct. But we deny access for
    pointer to scalar type, which is the case for many functions.

    Alexei suggested to take conservative approach and allow
    currently only string pointer access, which is the case
    for most functions now:

    Adding check if the pointer is to string type and allow access to it.

    Suggested-by: Alexei Starovoitov
    Signed-off-by: Jiri Olsa
    Signed-off-by: Alexei Starovoitov
    Link: https://lore.kernel.org/bpf/20200123161508.915203-2-jolsa@kernel.org

    Jiri Olsa
     

24 Jan, 2020

1 commit

  • head is traversed using hlist_for_each_entry_rcu outside an RCU
    read-side critical section but under the protection of dtab->index_lock.

    Hence, add corresponding lockdep expression to silence false-positive
    lockdep warnings, and harden RCU lists.

    Fixes: 6f9d451ab1a3 ("xdp: Add devmap_hash map type for looking up devices by hashed index")
    Signed-off-by: Amol Grover
    Signed-off-by: Daniel Borkmann
    Acked-by: Jesper Dangaard Brouer
    Acked-by: Toke Høiland-Jørgensen
    Link: https://lore.kernel.org/bpf/20200123120437.26506-1-frextrite@gmail.com

    Amol Grover
     

23 Jan, 2020

3 commits

  • Alexei Starovoitov says:

    ====================
    pull-request: bpf-next 2020-01-22

    The following pull-request contains BPF updates for your *net-next* tree.

    We've added 92 non-merge commits during the last 16 day(s) which contain
    a total of 320 files changed, 7532 insertions(+), 1448 deletions(-).

    The main changes are:

    1) function by function verification and program extensions from Alexei.

    2) massive cleanup of selftests/bpf from Toke and Andrii.

    3) batched bpf map operations from Brian and Yonghong.

    4) tcp congestion control in bpf from Martin.

    5) bulking for non-map xdp_redirect form Toke.

    6) bpf_send_signal_thread helper from Yonghong.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     
  • This patch adds a helper to read the 64bit jiffies. It will be used
    in a later patch to implement the bpf_cubic.c.

    The helper is inlined for jit_requested and 64 BITS_PER_LONG
    as the map_gen_lookup(). Other cases could be considered together
    with map_gen_lookup() if needed.

    Signed-off-by: Martin KaFai Lau
    Signed-off-by: Alexei Starovoitov
    Link: https://lore.kernel.org/bpf/20200122233646.903260-1-kafai@fb.com

    Martin KaFai Lau
     
  • Introduce dynamic program extensions. The users can load additional BPF
    functions and replace global functions in previously loaded BPF programs while
    these programs are executing.

    Global functions are verified individually by the verifier based on their types only.
    Hence the global function in the new program which types match older function can
    safely replace that corresponding function.

    This new function/program is called 'an extension' of old program. At load time
    the verifier uses (attach_prog_fd, attach_btf_id) pair to identify the function
    to be replaced. The BPF program type is derived from the target program into
    extension program. Technically bpf_verifier_ops is copied from target program.
    The BPF_PROG_TYPE_EXT program type is a placeholder. It has empty verifier_ops.
    The extension program can call the same bpf helper functions as target program.
    Single BPF_PROG_TYPE_EXT type is used to extend XDP, SKB and all other program
    types. The verifier allows only one level of replacement. Meaning that the
    extension program cannot recursively extend an extension. That also means that
    the maximum stack size is increasing from 512 to 1024 bytes and maximum
    function nesting level from 8 to 16. The programs don't always consume that
    much. The stack usage is determined by the number of on-stack variables used by
    the program. The verifier could have enforced 512 limit for combined original
    plus extension program, but it makes for difficult user experience. The main
    use case for extensions is to provide generic mechanism to plug external
    programs into policy program or function call chaining.

    BPF trampoline is used to track both fentry/fexit and program extensions
    because both are using the same nop slot at the beginning of every BPF
    function. Attaching fentry/fexit to a function that was replaced is not
    allowed. The opposite is true as well. Replacing a function that currently
    being analyzed with fentry/fexit is not allowed. The executable page allocated
    by BPF trampoline is not used by program extensions. This inefficiency will be
    optimized in future patches.

    Function by function verification of global function supports scalars and
    pointer to context only. Hence program extensions are supported for such class
    of global functions only. In the future the verifier will be extended with
    support to pointers to structures, arrays with sizes, etc.

    Signed-off-by: Alexei Starovoitov
    Signed-off-by: Daniel Borkmann
    Acked-by: John Fastabend
    Acked-by: Andrii Nakryiko
    Acked-by: Toke Høiland-Jørgensen
    Link: https://lore.kernel.org/bpf/20200121005348.2769920-2-ast@kernel.org

    Alexei Starovoitov
     

22 Jan, 2020

3 commits

  • Restore the 'if (env->cur_state)' check that was incorrectly removed during
    code move. Under memory pressure env->cur_state can be freed and zeroed inside
    do_check(). Hence the check is necessary.

    Fixes: 51c39bb1d5d1 ("bpf: Introduce function-by-function verification")
    Reported-by: syzbot+b296579ba5015704d9fa@syzkaller.appspotmail.com
    Signed-off-by: Alexei Starovoitov
    Signed-off-by: Daniel Borkmann
    Acked-by: Song Liu
    Link: https://lore.kernel.org/bpf/20200122024138.3385590-1-ast@kernel.org

    Alexei Starovoitov
     
  • Though the second half of trampoline page is unused a task could be
    preempted in the middle of the first half of trampoline and two
    updates to trampoline would change the code from underneath the
    preempted task. Hence wait for tasks to voluntarily schedule or go
    to userspace. Add similar wait before freeing the trampoline.

    Fixes: fec56f5890d9 ("bpf: Introduce BPF trampoline")
    Reported-by: Jann Horn
    Signed-off-by: Alexei Starovoitov
    Signed-off-by: Daniel Borkmann
    Acked-by: Paul E. McKenney
    Link: https://lore.kernel.org/bpf/20200121032231.3292185-1-ast@kernel.org

    Alexei Starovoitov
     
  • kernel/bpf/inode.c misuses kern_path...() - it's much simpler (and
    more efficient, on top of that) to use user_path...() counterparts
    rather than bothering with doing getname() manually.

    Signed-off-by: Al Viro
    Signed-off-by: Daniel Borkmann
    Link: https://lore.kernel.org/bpf/20200120232858.GF8904@ZenIV.linux.org.uk

    Al Viro
     

21 Jan, 2020

1 commit

  • Generic update/delete batch ops functions were using __bpf_copy_key
    without properly freeing the memory. Handle the memory allocation and
    copy_from_user separately.

    Fixes: aa2e93b8e58e ("bpf: Add generic support for update and delete batch ops")
    Reported-by: Dan Carpenter
    Signed-off-by: Brian Vazquez
    Signed-off-by: Daniel Borkmann
    Acked-by: Yonghong Song
    Link: https://lore.kernel.org/bpf/20200119194040.128369-1-brianvv@google.com

    Brian Vazquez
     

20 Jan, 2020

1 commit


17 Jan, 2020

4 commits

  • kernel/bpf/syscall.c: In function generic_map_lookup_batch:
    kernel/bpf/syscall.c:1339:7: warning: variable first_key set but not used [-Wunused-but-set-variable]

    It is never used, so remove it.

    Reported-by: Hulk Robot
    Signed-off-by: YueHaibing
    Signed-off-by: Alexei Starovoitov
    Acked-by: Brian Vazquez
    Link: https://lore.kernel.org/bpf/20200116145300.59056-1-yuehaibing@huawei.com

    YueHaibing
     
  • Now that we don't have a reference to a devmap when flushing the device
    bulk queue, let's change the the devmap_xmit tracepoint to remote the
    map_id and map_index fields entirely. Rearrange the fields so 'drops' and
    'sent' stay in the same position in the tracepoint struct, to make it
    possible for the xdp_monitor utility to read both the old and the new
    format.

    Signed-off-by: Jesper Dangaard Brouer
    Signed-off-by: Toke Høiland-Jørgensen
    Signed-off-by: Alexei Starovoitov
    Link: https://lore.kernel.org/bpf/157918768613.1458396.9165902403373826572.stgit@toke.dk

    Jesper Dangaard Brouer
     
  • Since the bulk queue used by XDP_REDIRECT now lives in struct net_device,
    we can re-use the bulking for the non-map version of the bpf_redirect()
    helper. This is a simple matter of having xdp_do_redirect_slow() queue the
    frame on the bulk queue instead of sending it out with __bpf_tx_xdp().

    Unfortunately we can't make the bpf_redirect() helper return an error if
    the ifindex doesn't exit (as bpf_redirect_map() does), because we don't
    have a reference to the network namespace of the ingress device at the time
    the helper is called. So we have to leave it as-is and keep the device
    lookup in xdp_do_redirect_slow().

    Since this leaves less reason to have the non-map redirect code in a
    separate function, so we get rid of the xdp_do_redirect_slow() function
    entirely. This does lose us the tracepoint disambiguation, but fortunately
    the xdp_redirect and xdp_redirect_map tracepoints use the same tracepoint
    entry structures. This means both can contain a map index, so we can just
    amend the tracepoint definitions so we always emit the xdp_redirect(_err)
    tracepoints, but with the map ID only populated if a map is present. This
    means we retire the xdp_redirect_map(_err) tracepoints entirely, but keep
    the definitions around in case someone is still listening for them.

    With this change, the performance of the xdp_redirect sample program goes
    from 5Mpps to 8.4Mpps (a 68% increase).

    Since the flush functions are no longer map-specific, rename the flush()
    functions to drop _map from their names. One of the renamed functions is
    the xdp_do_flush_map() callback used in all the xdp-enabled drivers. To
    keep from having to update all drivers, use a #define to keep the old name
    working, and only update the virtual drivers in this patch.

    Signed-off-by: Toke Høiland-Jørgensen
    Signed-off-by: Alexei Starovoitov
    Acked-by: John Fastabend
    Link: https://lore.kernel.org/bpf/157918768505.1458396.17518057312953572912.stgit@toke.dk

    Toke Høiland-Jørgensen
     
  • Commit 96360004b862 ("xdp: Make devmap flush_list common for all map
    instances"), changed devmap flushing to be a global operation instead of a
    per-map operation. However, the queue structure used for bulking was still
    allocated as part of the containing map.

    This patch moves the devmap bulk queue into struct net_device. The
    motivation for this is reusing it for the non-map variant of XDP_REDIRECT,
    which will be changed in a subsequent commit. To avoid other fields of
    struct net_device moving to different cache lines, we also move a couple of
    other members around.

    We defer the actual allocation of the bulk queue structure until the
    NETDEV_REGISTER notification devmap.c. This makes it possible to check for
    ndo_xdp_xmit support before allocating the structure, which is not possible
    at the time struct net_device is allocated. However, we keep the freeing in
    free_netdev() to avoid adding another RCU callback on NETDEV_UNREGISTER.

    Because of this change, we lose the reference back to the map that
    originated the redirect, so change the tracepoint to always return 0 as the
    map ID and index. Otherwise no functional change is intended with this
    patch.

    After this patch, the relevant part of struct net_device looks like this,
    according to pahole:

    /* --- cacheline 14 boundary (896 bytes) --- */
    struct netdev_queue * _tx __attribute__((__aligned__(64))); /* 896 8 */
    unsigned int num_tx_queues; /* 904 4 */
    unsigned int real_num_tx_queues; /* 908 4 */
    struct Qdisc * qdisc; /* 912 8 */
    unsigned int tx_queue_len; /* 920 4 */
    spinlock_t tx_global_lock; /* 924 4 */
    struct xdp_dev_bulk_queue * xdp_bulkq; /* 928 8 */
    struct xps_dev_maps * xps_cpus_map; /* 936 8 */
    struct xps_dev_maps * xps_rxqs_map; /* 944 8 */
    struct mini_Qdisc * miniq_egress; /* 952 8 */
    /* --- cacheline 15 boundary (960 bytes) --- */
    struct hlist_head qdisc_hash[16]; /* 960 128 */
    /* --- cacheline 17 boundary (1088 bytes) --- */
    struct timer_list watchdog_timer; /* 1088 40 */

    /* XXX last struct has 4 bytes of padding */

    int watchdog_timeo; /* 1128 4 */

    /* XXX 4 bytes hole, try to pack */

    struct list_head todo_list; /* 1136 16 */
    /* --- cacheline 18 boundary (1152 bytes) --- */

    Signed-off-by: Toke Høiland-Jørgensen
    Signed-off-by: Alexei Starovoitov
    Acked-by: Björn Töpel
    Acked-by: John Fastabend
    Link: https://lore.kernel.org/bpf/157918768397.1458396.12673224324627072349.stgit@toke.dk

    Toke Høiland-Jørgensen
     

16 Jan, 2020

6 commits

  • htab can't use generic batch support due some problematic behaviours
    inherent to the data structre, i.e. while iterating the bpf map a
    concurrent program might delete the next entry that batch was about to
    use, in that case there's no easy solution to retrieve the next entry,
    the issue has been discussed multiple times (see [1] and [2]).

    The only way hmap can be traversed without the problem previously
    exposed is by making sure that the map is traversing entire buckets.
    This commit implements those strict requirements for hmap, the
    implementation follows the same interaction that generic support with
    some exceptions:

    - If keys/values buffer are not big enough to traverse a bucket,
    ENOSPC will be returned.
    - out_batch contains the value of the next bucket in the iteration, not
    the next key, but this is transparent for the user since the user
    should never use out_batch for other than bpf batch syscalls.

    This commits implements BPF_MAP_LOOKUP_BATCH and adds support for new
    command BPF_MAP_LOOKUP_AND_DELETE_BATCH. Note that for update/delete
    batch ops it is possible to use the generic implementations.

    [1] https://lore.kernel.org/bpf/20190724165803.87470-1-brianvv@google.com/
    [2] https://lore.kernel.org/bpf/20190906225434.3635421-1-yhs@fb.com/

    Signed-off-by: Yonghong Song
    Signed-off-by: Brian Vazquez
    Signed-off-by: Alexei Starovoitov
    Link: https://lore.kernel.org/bpf/20200115184308.162644-6-brianvv@google.com

    Yonghong Song
     
  • This adds the generic batch ops functionality to bpf arraymap, note that
    since deletion is not a valid operation for arraymap, only batch and
    lookup are added.

    Signed-off-by: Brian Vazquez
    Signed-off-by: Alexei Starovoitov
    Acked-by: Yonghong Song
    Link: https://lore.kernel.org/bpf/20200115184308.162644-5-brianvv@google.com

    Brian Vazquez
     
  • This commit adds generic support for update and delete batch ops that
    can be used for almost all the bpf maps. These commands share the same
    UAPI attr that lookup and lookup_and_delete batch ops use and the
    syscall commands are:

    BPF_MAP_UPDATE_BATCH
    BPF_MAP_DELETE_BATCH

    The main difference between update/delete and lookup batch ops is that
    for update/delete keys/values must be specified for userspace and
    because of that, neither in_batch nor out_batch are used.

    Suggested-by: Stanislav Fomichev
    Signed-off-by: Brian Vazquez
    Signed-off-by: Yonghong Song
    Signed-off-by: Alexei Starovoitov
    Link: https://lore.kernel.org/bpf/20200115184308.162644-4-brianvv@google.com

    Brian Vazquez
     
  • This commit introduces generic support for the bpf_map_lookup_batch.
    This implementation can be used by almost all the bpf maps since its core
    implementation is relying on the existing map_get_next_key and
    map_lookup_elem. The bpf syscall subcommand introduced is:

    BPF_MAP_LOOKUP_BATCH

    The UAPI attribute is:

    struct { /* struct used by BPF_MAP_*_BATCH commands */
    __aligned_u64 in_batch; /* start batch,
    * NULL to start from beginning
    */
    __aligned_u64 out_batch; /* output: next start batch */
    __aligned_u64 keys;
    __aligned_u64 values;
    __u32 count; /* input/output:
    * input: # of key/value
    * elements
    * output: # of filled elements
    */
    __u32 map_fd;
    __u64 elem_flags;
    __u64 flags;
    } batch;

    in_batch/out_batch are opaque values use to communicate between
    user/kernel space, in_batch/out_batch must be of key_size length.

    To start iterating from the beginning in_batch must be null,
    count is the # of key/value elements to retrieve. Note that the 'keys'
    buffer must be a buffer of key_size * count size and the 'values' buffer
    must be value_size * count, where value_size must be aligned to 8 bytes
    by userspace if it's dealing with percpu maps. 'count' will contain the
    number of keys/values successfully retrieved. Note that 'count' is an
    input/output variable and it can contain a lower value after a call.

    If there's no more entries to retrieve, ENOENT will be returned. If error
    is ENOENT, count might be > 0 in case it copied some values but there were
    no more entries to retrieve.

    Note that if the return code is an error and not -EFAULT,
    count indicates the number of elements successfully processed.

    Suggested-by: Stanislav Fomichev
    Signed-off-by: Brian Vazquez
    Signed-off-by: Yonghong Song
    Signed-off-by: Alexei Starovoitov
    Link: https://lore.kernel.org/bpf/20200115184308.162644-3-brianvv@google.com

    Brian Vazquez
     
  • This commit moves reusable code from map_lookup_elem and map_update_elem
    to avoid code duplication in kernel/bpf/syscall.c.

    Signed-off-by: Brian Vazquez
    Signed-off-by: Alexei Starovoitov
    Acked-by: John Fastabend
    Acked-by: Yonghong Song
    Link: https://lore.kernel.org/bpf/20200115184308.162644-2-brianvv@google.com

    Brian Vazquez
     
  • Anatoly has been fuzzing with kBdysch harness and reported a hang in one
    of the outcomes:

    0: R1=ctx(id=0,off=0,imm=0) R10=fp0
    0: (85) call bpf_get_socket_cookie#46
    1: R0_w=invP(id=0) R10=fp0
    1: (57) r0 &= 808464432
    2: R0_w=invP(id=0,umax_value=808464432,var_off=(0x0; 0x30303030)) R10=fp0
    2: (14) w0 -= 810299440
    3: R0_w=invP(id=0,umax_value=4294967295,var_off=(0xcf800000; 0x3077fff0)) R10=fp0
    3: (c4) w0 s>>= 1
    4: R0_w=invP(id=0,umin_value=1740636160,umax_value=2147221496,var_off=(0x67c00000; 0x183bfff8)) R10=fp0
    4: (76) if w0 s>= 0x30303030 goto pc+216
    221: R0_w=invP(id=0,umin_value=1740636160,umax_value=2147221496,var_off=(0x67c00000; 0x183bfff8)) R10=fp0
    221: (95) exit
    processed 6 insns (limit 1000000) [...]

    Taking a closer look, the program was xlated as follows:

    # ./bpftool p d x i 12
    0: (85) call bpf_get_socket_cookie#7800896
    1: (bf) r6 = r0
    2: (57) r6 &= 808464432
    3: (14) w6 -= 810299440
    4: (c4) w6 s>>= 1
    5: (76) if w6 s>= 0x30303030 goto pc+216
    6: (05) goto pc-1
    7: (05) goto pc-1
    8: (05) goto pc-1
    [...]
    220: (05) goto pc-1
    221: (05) goto pc-1
    222: (95) exit

    Meaning, the visible effect is very similar to f54c7898ed1c ("bpf: Fix
    precision tracking for unbounded scalars"), that is, the fall-through
    branch in the instruction 5 is considered to be never taken given the
    conclusion from the min/max bounds tracking in w6, and therefore the
    dead-code sanitation rewrites it as goto pc-1. However, real-life input
    disagrees with verification analysis since a soft-lockup was observed.

    The bug sits in the analysis of the ARSH. The definition is that we shift
    the target register value right by K bits through shifting in copies of
    its sign bit. In adjust_scalar_min_max_vals(), we do first coerce the
    register into 32 bit mode, same happens after simulating the operation.
    However, for the case of simulating the actual ARSH, we don't take the
    mode into account and act as if it's always 64 bit, but location of sign
    bit is different:

    dst_reg->smin_value >>= umin_val;
    dst_reg->smax_value >>= umin_val;
    dst_reg->var_off = tnum_arshift(dst_reg->var_off, umin_val);

    Consider an unknown R0 where bpf_get_socket_cookie() (or others) would
    for example return 0xffff. With the above ARSH simulation, we'd see the
    following results:

    [...]
    1: R1=ctx(id=0,off=0,imm=0) R2_w=invP65535 R10=fp0
    1: (85) call bpf_get_socket_cookie#46
    2: R0_w=invP(id=0) R10=fp0
    2: (57) r0 &= 808464432
    -> R0_runtime = 0x3030
    3: R0_w=invP(id=0,umax_value=808464432,var_off=(0x0; 0x30303030)) R10=fp0
    3: (14) w0 -= 810299440
    -> R0_runtime = 0xcfb40000
    4: R0_w=invP(id=0,umax_value=4294967295,var_off=(0xcf800000; 0x3077fff0)) R10=fp0
    (0xffffffff)
    4: (c4) w0 s>>= 1
    -> R0_runtime = 0xe7da0000
    5: R0_w=invP(id=0,umin_value=1740636160,umax_value=2147221496,var_off=(0x67c00000; 0x183bfff8)) R10=fp0
    (0x67c00000) (0x7ffbfff8)
    [...]

    In insn 3, we have a runtime value of 0xcfb40000, which is '1100 1111 1011
    0100 0000 0000 0000 0000', the result after the shift has 0xe7da0000 that
    is '1110 0111 1101 1010 0000 0000 0000 0000', where the sign bit is correctly
    retained in 32 bit mode. In insn4, the umax was 0xffffffff, and changed into
    0x7ffbfff8 after the shift, that is, '0111 1111 1111 1011 1111 1111 1111 1000'
    and means here that the simulation didn't retain the sign bit. With above
    logic, the updates happen on the 64 bit min/max bounds and given we coerced
    the register, the sign bits of the bounds are cleared as well, meaning, we
    need to force the simulation into s32 space for 32 bit alu mode.

    Verification after the fix below. We're first analyzing the fall-through branch
    on 32 bit signed >= test eventually leading to rejection of the program in this
    specific case:

    0: R1=ctx(id=0,off=0,imm=0) R10=fp0
    0: (b7) r2 = 808464432
    1: R1=ctx(id=0,off=0,imm=0) R2_w=invP808464432 R10=fp0
    1: (85) call bpf_get_socket_cookie#46
    2: R0_w=invP(id=0) R10=fp0
    2: (bf) r6 = r0
    3: R0_w=invP(id=0) R6_w=invP(id=0) R10=fp0
    3: (57) r6 &= 808464432
    4: R0_w=invP(id=0) R6_w=invP(id=0,umax_value=808464432,var_off=(0x0; 0x30303030)) R10=fp0
    4: (14) w6 -= 810299440
    5: R0_w=invP(id=0) R6_w=invP(id=0,umax_value=4294967295,var_off=(0xcf800000; 0x3077fff0)) R10=fp0
    5: (c4) w6 s>>= 1
    6: R0_w=invP(id=0) R6_w=invP(id=0,umin_value=3888119808,umax_value=4294705144,var_off=(0xe7c00000; 0x183bfff8)) R10=fp0
    (0x67c00000) (0xfffbfff8)
    6: (76) if w6 s>= 0x30303030 goto pc+216
    7: R0_w=invP(id=0) R6_w=invP(id=0,umin_value=3888119808,umax_value=4294705144,var_off=(0xe7c00000; 0x183bfff8)) R10=fp0
    7: (30) r0 = *(u8 *)skb[808464432]
    BPF_LD_[ABS|IND] uses reserved fields
    processed 8 insns (limit 1000000) [...]

    Fixes: 9cbe1f5a32dc ("bpf/verifier: improve register value range tracking with ARSH")
    Reported-by: Anatoly Trosinenko
    Signed-off-by: Daniel Borkmann
    Acked-by: Yonghong Song
    Signed-off-by: Alexei Starovoitov
    Link: https://lore.kernel.org/bpf/20200115204733.16648-1-daniel@iogearbox.net

    Daniel Borkmann
     

15 Jan, 2020

1 commit

  • Instead of using bpf_struct_ops_map_lookup_elem() which is
    not implemented, bpf_struct_ops_map_seq_show_elem() should
    also use bpf_struct_ops_map_sys_lookup_elem() which does
    an inplace update to the value. The change allocates
    a value to pass to bpf_struct_ops_map_sys_lookup_elem().

    [root@arch-fb-vm1 bpf]# cat /sys/fs/bpf/dctcp
    {{{1}},BPF_STRUCT_OPS_STATE_INUSE,{{00000000df93eebc,00000000df93eebc},0,2, ...

    Fixes: 85d33df357b6 ("bpf: Introduce BPF_MAP_TYPE_STRUCT_OPS")
    Reported-by: Dan Carpenter
    Signed-off-by: Martin KaFai Lau
    Signed-off-by: Alexei Starovoitov
    Link: https://lore.kernel.org/bpf/20200114072647.3188298-1-kafai@fb.com

    Martin KaFai Lau
     

11 Jan, 2020

1 commit

  • New llvm and old llvm with libbpf help produce BTF that distinguish global and
    static functions. Unlike arguments of static function the arguments of global
    functions cannot be removed or optimized away by llvm. The compiler has to use
    exactly the arguments specified in a function prototype. The argument type
    information allows the verifier validate each global function independently.
    For now only supported argument types are pointer to context and scalars. In
    the future pointers to structures, sizes, pointer to packet data can be
    supported as well. Consider the following example:

    static int f1(int ...)
    {
    ...
    }

    int f3(int b);

    int f2(int a)
    {
    f1(a) + f3(a);
    }

    int f3(int b)
    {
    ...
    }

    int main(...)
    {
    f1(...) + f2(...) + f3(...);
    }

    The verifier will start its safety checks from the first global function f2().
    It will recursively descend into f1() because it's static. Then it will check
    that arguments match for the f3() invocation inside f2(). It will not descend
    into f3(). It will finish f2() that has to be successfully verified for all
    possible values of 'a'. Then it will proceed with f3(). That function also has
    to be safe for all possible values of 'b'. Then it will start subprog 0 (which
    is main() function). It will recursively descend into f1() and will skip full
    check of f2() and f3(), since they are global. The order of processing global
    functions doesn't affect safety, since all global functions must be proven safe
    based on their arguments only.

    Such function by function verification can drastically improve speed of the
    verification and reduce complexity.

    Note that the stack limit of 512 still applies to the call chain regardless whether
    functions were static or global. The nested level of 8 also still applies. The
    same recursion prevention checks are in place as well.

    The type information and static/global kind is preserved after the verification
    hence in the above example global function f2() and f3() can be replaced later
    by equivalent functions with the same types that are loaded and verified later
    without affecting safety of this main() program. Such replacement (re-linking)
    of global functions is a subject of future patches.

    Signed-off-by: Alexei Starovoitov
    Signed-off-by: Daniel Borkmann
    Acked-by: Song Liu
    Link: https://lore.kernel.org/bpf/20200110064124.1760511-3-ast@kernel.org

    Alexei Starovoitov
     

10 Jan, 2020

8 commits

  • The ungrafting from PRIO bug fixes in net, when merged into net-next,
    merge cleanly but create a build failure. The resolution used here is
    from Petr Machata.

    Signed-off-by: David S. Miller

    David S. Miller
     
  • This patch makes "struct tcp_congestion_ops" to be the first user
    of BPF STRUCT_OPS. It allows implementing a tcp_congestion_ops
    in bpf.

    The BPF implemented tcp_congestion_ops can be used like
    regular kernel tcp-cc through sysctl and setsockopt. e.g.
    [root@arch-fb-vm1 bpf]# sysctl -a | egrep congestion
    net.ipv4.tcp_allowed_congestion_control = reno cubic bpf_cubic
    net.ipv4.tcp_available_congestion_control = reno bic cubic bpf_cubic
    net.ipv4.tcp_congestion_control = bpf_cubic

    There has been attempt to move the TCP CC to the user space
    (e.g. CCP in TCP). The common arguments are faster turn around,
    get away from long-tail kernel versions in production...etc,
    which are legit points.

    BPF has been the continuous effort to join both kernel and
    userspace upsides together (e.g. XDP to gain the performance
    advantage without bypassing the kernel). The recent BPF
    advancements (in particular BTF-aware verifier, BPF trampoline,
    BPF CO-RE...) made implementing kernel struct ops (e.g. tcp cc)
    possible in BPF. It allows a faster turnaround for testing algorithm
    in the production while leveraging the existing (and continue growing)
    BPF feature/framework instead of building one specifically for
    userspace TCP CC.

    This patch allows write access to a few fields in tcp-sock
    (in bpf_tcp_ca_btf_struct_access()).

    The optional "get_info" is unsupported now. It can be added
    later. One possible way is to output the info with a btf-id
    to describe the content.

    Signed-off-by: Martin KaFai Lau
    Signed-off-by: Alexei Starovoitov
    Acked-by: Andrii Nakryiko
    Acked-by: Yonghong Song
    Link: https://lore.kernel.org/bpf/20200109003508.3856115-1-kafai@fb.com

    Martin KaFai Lau
     
  • The patch introduces BPF_MAP_TYPE_STRUCT_OPS. The map value
    is a kernel struct with its func ptr implemented in bpf prog.
    This new map is the interface to register/unregister/introspect
    a bpf implemented kernel struct.

    The kernel struct is actually embedded inside another new struct
    (or called the "value" struct in the code). For example,
    "struct tcp_congestion_ops" is embbeded in:
    struct bpf_struct_ops_tcp_congestion_ops {
    refcount_t refcnt;
    enum bpf_struct_ops_state state;
    struct tcp_congestion_ops data; /*
    INUSE (map updated, i.e. reg) =>
    TOBEFREE (map value deleted, i.e. unreg)

    The kernel subsystem needs to call bpf_struct_ops_get() and
    bpf_struct_ops_put() to manage the "refcnt" in the
    "struct bpf_struct_ops_XYZ". This patch uses a separate refcnt
    for the purose of tracking the subsystem usage. Another approach
    is to reuse the map->refcnt and then "show" (i.e. during map_lookup)
    the subsystem's usage by doing map->refcnt - map->usercnt to filter out
    the map-fd/pinned-map usage. However, that will also tie down the
    future semantics of map->refcnt and map->usercnt.

    The very first subsystem's refcnt (during reg()) holds one
    count to map->refcnt. When the very last subsystem's refcnt
    is gone, it will also release the map->refcnt. All bpf_prog will be
    freed when the map->refcnt reaches 0 (i.e. during map_free()).

    Here is how the bpftool map command will look like:
    [root@arch-fb-vm1 bpf]# bpftool map show
    6: struct_ops name dctcp flags 0x0
    key 4B value 256B max_entries 1 memlock 4096B
    btf_id 6
    [root@arch-fb-vm1 bpf]# bpftool map dump id 6
    [{
    "value": {
    "refcnt": {
    "refs": {
    "counter": 1
    }
    },
    "state": 1,
    "data": {
    "list": {
    "next": 0,
    "prev": 0
    },
    "key": 0,
    "flags": 2,
    "init": 24,
    "release": 0,
    "ssthresh": 25,
    "cong_avoid": 30,
    "set_state": 27,
    "cwnd_event": 28,
    "in_ack_event": 26,
    "undo_cwnd": 29,
    "pkts_acked": 0,
    "min_tso_segs": 0,
    "sndbuf_expand": 0,
    "cong_control": 0,
    "get_info": 0,
    "name": [98,112,102,95,100,99,116,99,112,0,0,0,0,0,0,0
    ],
    "owner": 0
    }
    }
    }
    ]

    Misc Notes:
    * bpf_struct_ops_map_sys_lookup_elem() is added for syscall lookup.
    It does an inplace update on "*value" instead returning a pointer
    to syscall.c. Otherwise, it needs a separate copy of "zero" value
    for the BPF_STRUCT_OPS_STATE_INIT to avoid races.

    * The bpf_struct_ops_map_delete_elem() is also called without
    preempt_disable() from map_delete_elem(). It is because
    the "->unreg()" may requires sleepable context, e.g.
    the "tcp_unregister_congestion_control()".

    * "const" is added to some of the existing "struct btf_func_model *"
    function arg to avoid a compiler warning caused by this patch.

    Signed-off-by: Martin KaFai Lau
    Signed-off-by: Alexei Starovoitov
    Acked-by: Andrii Nakryiko
    Acked-by: Yonghong Song
    Link: https://lore.kernel.org/bpf/20200109003505.3855919-1-kafai@fb.com

    Martin KaFai Lau
     
  • This patch allows the kernel's struct ops (i.e. func ptr) to be
    implemented in BPF. The first use case in this series is the
    "struct tcp_congestion_ops" which will be introduced in a
    latter patch.

    This patch introduces a new prog type BPF_PROG_TYPE_STRUCT_OPS.
    The BPF_PROG_TYPE_STRUCT_OPS prog is verified against a particular
    func ptr of a kernel struct. The attr->attach_btf_id is the btf id
    of a kernel struct. The attr->expected_attach_type is the member
    "index" of that kernel struct. The first member of a struct starts
    with member index 0. That will avoid ambiguity when a kernel struct
    has multiple func ptrs with the same func signature.

    For example, a BPF_PROG_TYPE_STRUCT_OPS prog is written
    to implement the "init" func ptr of the "struct tcp_congestion_ops".
    The attr->attach_btf_id is the btf id of the "struct tcp_congestion_ops"
    of the _running_ kernel. The attr->expected_attach_type is 3.

    The ctx of BPF_PROG_TYPE_STRUCT_OPS is an array of u64 args saved
    by arch_prepare_bpf_trampoline that will be done in the next
    patch when introducing BPF_MAP_TYPE_STRUCT_OPS.

    "struct bpf_struct_ops" is introduced as a common interface for the kernel
    struct that supports BPF_PROG_TYPE_STRUCT_OPS prog. The supporting kernel
    struct will need to implement an instance of the "struct bpf_struct_ops".

    The supporting kernel struct also needs to implement a bpf_verifier_ops.
    During BPF_PROG_LOAD, bpf_struct_ops_find() will find the right
    bpf_verifier_ops by searching the attr->attach_btf_id.

    A new "btf_struct_access" is also added to the bpf_verifier_ops such
    that the supporting kernel struct can optionally provide its own specific
    check on accessing the func arg (e.g. provide limited write access).

    After btf_vmlinux is parsed, the new bpf_struct_ops_init() is called
    to initialize some values (e.g. the btf id of the supporting kernel
    struct) and it can only be done once the btf_vmlinux is available.

    The R0 checks at BPF_EXIT is excluded for the BPF_PROG_TYPE_STRUCT_OPS prog
    if the return type of the prog->aux->attach_func_proto is "void".

    Signed-off-by: Martin KaFai Lau
    Signed-off-by: Alexei Starovoitov
    Acked-by: Andrii Nakryiko
    Acked-by: Yonghong Song
    Link: https://lore.kernel.org/bpf/20200109003503.3855825-1-kafai@fb.com

    Martin KaFai Lau
     
  • This patch allows bitfield access as a scalar.

    It checks "off + size > t->size" to avoid accessing bitfield
    end up accessing beyond the struct. This check is done
    outside of the loop since it is applicable to all access.

    It also takes this chance to break early on the "off < moff" case.

    Signed-off-by: Martin KaFai Lau
    Signed-off-by: Alexei Starovoitov
    Acked-by: Yonghong Song
    Link: https://lore.kernel.org/bpf/20200109003501.3855427-1-kafai@fb.com

    Martin KaFai Lau
     
  • It allows bpf prog (e.g. tracing) to attach
    to a kernel function that takes enum argument.

    Signed-off-by: Martin KaFai Lau
    Signed-off-by: Alexei Starovoitov
    Acked-by: Yonghong Song
    Link: https://lore.kernel.org/bpf/20200109003459.3855366-1-kafai@fb.com

    Martin KaFai Lau
     
  • info->btf_id expects the btf_id of a struct, so it should
    store the final result after skipping modifiers (if any).

    It also takes this chanace to add a missing newline in one of the
    bpf_log() messages.

    Signed-off-by: Martin KaFai Lau
    Signed-off-by: Alexei Starovoitov
    Acked-by: Yonghong Song
    Link: https://lore.kernel.org/bpf/20200109003456.3855176-1-kafai@fb.com

    Martin KaFai Lau
     
  • This patch makes the verifier save the PTR_TO_BTF_ID register state when
    spilling to the stack.

    Signed-off-by: Martin KaFai Lau
    Signed-off-by: Alexei Starovoitov
    Acked-by: Yonghong Song
    Link: https://lore.kernel.org/bpf/20200109003454.3854870-1-kafai@fb.com

    Martin KaFai Lau
     

07 Jan, 2020

2 commits

  • Anatoly has been fuzzing with kBdysch harness and reported a KASAN
    slab oob in one of the outcomes:

    [...]
    [ 77.359642] BUG: KASAN: slab-out-of-bounds in bpf_skb_load_helper_8_no_cache+0x71/0x130
    [ 77.360463] Read of size 4 at addr ffff8880679bac68 by task bpf/406
    [ 77.361119]
    [ 77.361289] CPU: 2 PID: 406 Comm: bpf Not tainted 5.5.0-rc2-xfstests-00157-g2187f215eba #1
    [ 77.362134] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
    [ 77.362984] Call Trace:
    [ 77.363249] dump_stack+0x97/0xe0
    [ 77.363603] print_address_description.constprop.0+0x1d/0x220
    [ 77.364251] ? bpf_skb_load_helper_8_no_cache+0x71/0x130
    [ 77.365030] ? bpf_skb_load_helper_8_no_cache+0x71/0x130
    [ 77.365860] __kasan_report.cold+0x37/0x7b
    [ 77.366365] ? bpf_skb_load_helper_8_no_cache+0x71/0x130
    [ 77.366940] kasan_report+0xe/0x20
    [ 77.367295] bpf_skb_load_helper_8_no_cache+0x71/0x130
    [ 77.367821] ? bpf_skb_load_helper_8+0xf0/0xf0
    [ 77.368278] ? mark_lock+0xa3/0x9b0
    [ 77.368641] ? kvm_sched_clock_read+0x14/0x30
    [ 77.369096] ? sched_clock+0x5/0x10
    [ 77.369460] ? sched_clock_cpu+0x18/0x110
    [ 77.369876] ? bpf_skb_load_helper_8+0xf0/0xf0
    [ 77.370330] ___bpf_prog_run+0x16c0/0x28f0
    [ 77.370755] __bpf_prog_run32+0x83/0xc0
    [ 77.371153] ? __bpf_prog_run64+0xc0/0xc0
    [ 77.371568] ? match_held_lock+0x1b/0x230
    [ 77.371984] ? rcu_read_lock_held+0xa1/0xb0
    [ 77.372416] ? rcu_is_watching+0x34/0x50
    [ 77.372826] sk_filter_trim_cap+0x17c/0x4d0
    [ 77.373259] ? sock_kzfree_s+0x40/0x40
    [ 77.373648] ? __get_filter+0x150/0x150
    [ 77.374059] ? skb_copy_datagram_from_iter+0x80/0x280
    [ 77.374581] ? do_raw_spin_unlock+0xa5/0x140
    [ 77.375025] unix_dgram_sendmsg+0x33a/0xa70
    [ 77.375459] ? do_raw_spin_lock+0x1d0/0x1d0
    [ 77.375893] ? unix_peer_get+0xa0/0xa0
    [ 77.376287] ? __fget_light+0xa4/0xf0
    [ 77.376670] __sys_sendto+0x265/0x280
    [ 77.377056] ? __ia32_sys_getpeername+0x50/0x50
    [ 77.377523] ? lock_downgrade+0x350/0x350
    [ 77.377940] ? __sys_setsockopt+0x2a6/0x2c0
    [ 77.378374] ? sock_read_iter+0x240/0x240
    [ 77.378789] ? __sys_socketpair+0x22a/0x300
    [ 77.379221] ? __ia32_sys_socket+0x50/0x50
    [ 77.379649] ? mark_held_locks+0x1d/0x90
    [ 77.380059] ? trace_hardirqs_on_thunk+0x1a/0x1c
    [ 77.380536] __x64_sys_sendto+0x74/0x90
    [ 77.380938] do_syscall_64+0x68/0x2a0
    [ 77.381324] entry_SYSCALL_64_after_hwframe+0x49/0xbe
    [ 77.381878] RIP: 0033:0x44c070
    [...]

    After further debugging, turns out while in case of other helper functions
    we disallow passing modified ctx, the special case of ld/abs/ind instruction
    which has similar semantics (except r6 being the ctx argument) is missing
    such check. Modified ctx is impossible here as bpf_skb_load_helper_8_no_cache()
    and others are expecting skb fields in original position, hence, add
    check_ctx_reg() to reject any modified ctx. Issue was first introduced back
    in f1174f77b50c ("bpf/verifier: rework value tracking").

    Fixes: f1174f77b50c ("bpf/verifier: rework value tracking")
    Reported-by: Anatoly Trosinenko
    Signed-off-by: Daniel Borkmann
    Signed-off-by: Alexei Starovoitov
    Link: https://lore.kernel.org/bpf/20200106215157.3553-1-daniel@iogearbox.net

    Daniel Borkmann
     
  • Before commit 4bfc0bb2c60e ("bpf: decouple the lifetime of cgroup_bpf from cgroup itself")
    cgroup bpf structures were released with
    corresponding cgroup structures. It guaranteed the hierarchical order
    of destruction: children were always first. It preserved attached
    programs from being released before their propagated copies.

    But with cgroup auto-detachment there are no such guarantees anymore:
    cgroup bpf is released as soon as the cgroup is offline and there are
    no live associated sockets. It means that an attached program can be
    detached and released, while its propagated copy is still living
    in the cgroup subtree. This will obviously lead to an use-after-free
    bug.

    To reproduce the issue the following script can be used:

    #!/bin/bash

    CGROOT=/sys/fs/cgroup

    mkdir -p ${CGROOT}/A ${CGROOT}/B ${CGROOT}/A/C
    sleep 1

    ./test_cgrp2_attach ${CGROOT}/A egress &
    A_PID=$!
    ./test_cgrp2_attach ${CGROOT}/B egress &
    B_PID=$!

    echo $$ > ${CGROOT}/A/C/cgroup.procs
    iperf -s &
    S_PID=$!
    iperf -c localhost -t 100 &
    C_PID=$!

    sleep 1

    echo $$ > ${CGROOT}/B/cgroup.procs
    echo ${S_PID} > ${CGROOT}/B/cgroup.procs
    echo ${C_PID} > ${CGROOT}/B/cgroup.procs

    sleep 1

    rmdir ${CGROOT}/A/C
    rmdir ${CGROOT}/A

    sleep 1

    kill -9 ${S_PID} ${C_PID} ${A_PID} ${B_PID}

    On the unpatched kernel the following stacktrace can be obtained:

    [ 33.619799] BUG: unable to handle page fault for address: ffffbdb4801ab002
    [ 33.620677] #PF: supervisor read access in kernel mode
    [ 33.621293] #PF: error_code(0x0000) - not-present page
    [ 33.622754] Oops: 0000 [#1] SMP NOPTI
    [ 33.623202] CPU: 0 PID: 601 Comm: iperf Not tainted 5.5.0-rc2+ #23
    [ 33.625545] RIP: 0010:__cgroup_bpf_run_filter_skb+0x29f/0x3d0
    [ 33.635809] Call Trace:
    [ 33.636118] ? __cgroup_bpf_run_filter_skb+0x2bf/0x3d0
    [ 33.636728] ? __switch_to_asm+0x40/0x70
    [ 33.637196] ip_finish_output+0x68/0xa0
    [ 33.637654] ip_output+0x76/0xf0
    [ 33.638046] ? __ip_finish_output+0x1c0/0x1c0
    [ 33.638576] __ip_queue_xmit+0x157/0x410
    [ 33.639049] __tcp_transmit_skb+0x535/0xaf0
    [ 33.639557] tcp_write_xmit+0x378/0x1190
    [ 33.640049] ? _copy_from_iter_full+0x8d/0x260
    [ 33.640592] tcp_sendmsg_locked+0x2a2/0xdc0
    [ 33.641098] ? sock_has_perm+0x10/0xa0
    [ 33.641574] tcp_sendmsg+0x28/0x40
    [ 33.641985] sock_sendmsg+0x57/0x60
    [ 33.642411] sock_write_iter+0x97/0x100
    [ 33.642876] new_sync_write+0x1b6/0x1d0
    [ 33.643339] vfs_write+0xb6/0x1a0
    [ 33.643752] ksys_write+0xa7/0xe0
    [ 33.644156] do_syscall_64+0x5b/0x1b0
    [ 33.644605] entry_SYSCALL_64_after_hwframe+0x44/0xa9

    Fix this by grabbing a reference to the bpf structure of each ancestor
    on the initialization of the cgroup bpf structure, and dropping the
    reference at the end of releasing the cgroup bpf structure.

    This will restore the hierarchical order of cgroup bpf releasing,
    without adding any operations on hot paths.

    Thanks to Josef Bacik for the debugging and the initial analysis of
    the problem.

    Fixes: 4bfc0bb2c60e ("bpf: decouple the lifetime of cgroup_bpf from cgroup itself")
    Reported-by: Josef Bacik
    Signed-off-by: Roman Gushchin
    Acked-by: Song Liu
    Signed-off-by: Alexei Starovoitov

    Roman Gushchin
     

01 Jan, 2020

1 commit


28 Dec, 2019

1 commit

  • Daniel Borkmann says:

    ====================
    pull-request: bpf-next 2019-12-27

    The following pull-request contains BPF updates for your *net-next* tree.

    We've added 127 non-merge commits during the last 17 day(s) which contain
    a total of 110 files changed, 6901 insertions(+), 2721 deletions(-).

    There are three merge conflicts. Conflicts and resolution looks as follows:

    1) Merge conflict in net/bpf/test_run.c:

    There was a tree-wide cleanup c593642c8be0 ("treewide: Use sizeof_field() macro")
    which gets in the way with b590cb5f802d ("bpf: Switch to offsetofend in
    BPF_PROG_TEST_RUN"):

    <<<<<<< HEAD
    if (!range_is_zero(__skb, offsetof(struct __sk_buff, priority) +
    sizeof_field(struct __sk_buff, priority),
    =======
    if (!range_is_zero(__skb, offsetofend(struct __sk_buff, priority),
    >>>>>>> 7c8dce4b166113743adad131b5a24c4acc12f92c

    There are a few occasions that look similar to this. Always take the chunk with
    offsetofend(). Note that there is one where the fields differ in here:

    <<<<<<< HEAD
    if (!range_is_zero(__skb, offsetof(struct __sk_buff, tstamp) +
    sizeof_field(struct __sk_buff, tstamp),
    =======
    if (!range_is_zero(__skb, offsetofend(struct __sk_buff, gso_segs),
    >>>>>>> 7c8dce4b166113743adad131b5a24c4acc12f92c

    Just take the one with offsetofend() /and/ gso_segs. Latter is correct due to
    850a88cc4096 ("bpf: Expose __sk_buff wire_len/gso_segs to BPF_PROG_TEST_RUN").

    2) Merge conflict in arch/riscv/net/bpf_jit_comp.c:

    (I'm keeping Bjorn in Cc here for a double-check in case I got it wrong.)

    <<<<<<< HEAD
    if (is_13b_check(off, insn))
    return -1;
    emit(rv_blt(tcc, RV_REG_ZERO, off >> 1), ctx);
    =======
    emit_branch(BPF_JSLT, RV_REG_T1, RV_REG_ZERO, off, ctx);
    >>>>>>> 7c8dce4b166113743adad131b5a24c4acc12f92c

    Result should look like:

    emit_branch(BPF_JSLT, tcc, RV_REG_ZERO, off, ctx);

    3) Merge conflict in arch/riscv/include/asm/pgtable.h:

    <<<<<<< HEAD
    =======
    #define VMALLOC_SIZE (KERN_VIRT_SIZE >> 1)
    #define VMALLOC_END (PAGE_OFFSET - 1)
    #define VMALLOC_START (PAGE_OFFSET - VMALLOC_SIZE)

    #define BPF_JIT_REGION_SIZE (SZ_128M)
    #define BPF_JIT_REGION_START (PAGE_OFFSET - BPF_JIT_REGION_SIZE)
    #define BPF_JIT_REGION_END (VMALLOC_END)

    /*
    * Roughly size the vmemmap space to be large enough to fit enough
    * struct pages to map half the virtual address space. Then
    * position vmemmap directly below the VMALLOC region.
    */
    #define VMEMMAP_SHIFT \
    (CONFIG_VA_BITS - PAGE_SHIFT - 1 + STRUCT_PAGE_MAX_SHIFT)
    #define VMEMMAP_SIZE BIT(VMEMMAP_SHIFT)
    #define VMEMMAP_END (VMALLOC_START - 1)
    #define VMEMMAP_START (VMALLOC_START - VMEMMAP_SIZE)

    #define vmemmap ((struct page *)VMEMMAP_START)

    >>>>>>> 7c8dce4b166113743adad131b5a24c4acc12f92c

    Only take the BPF_* defines from there and move them higher up in the
    same file. Remove the rest from the chunk. The VMALLOC_* etc defines
    got moved via 01f52e16b868 ("riscv: define vmemmap before pfn_to_page
    calls"). Result:

    [...]
    #define __S101 PAGE_READ_EXEC
    #define __S110 PAGE_SHARED_EXEC
    #define __S111 PAGE_SHARED_EXEC

    #define VMALLOC_SIZE (KERN_VIRT_SIZE >> 1)
    #define VMALLOC_END (PAGE_OFFSET - 1)
    #define VMALLOC_START (PAGE_OFFSET - VMALLOC_SIZE)

    #define BPF_JIT_REGION_SIZE (SZ_128M)
    #define BPF_JIT_REGION_START (PAGE_OFFSET - BPF_JIT_REGION_SIZE)
    #define BPF_JIT_REGION_END (VMALLOC_END)

    /*
    * Roughly size the vmemmap space to be large enough to fit enough
    * struct pages to map half the virtual address space. Then
    * position vmemmap directly below the VMALLOC region.
    */
    #define VMEMMAP_SHIFT \
    (CONFIG_VA_BITS - PAGE_SHIFT - 1 + STRUCT_PAGE_MAX_SHIFT)
    #define VMEMMAP_SIZE BIT(VMEMMAP_SHIFT)
    #define VMEMMAP_END (VMALLOC_START - 1)
    #define VMEMMAP_START (VMALLOC_START - VMEMMAP_SIZE)

    [...]

    Let me know if there are any other issues.

    Anyway, the main changes are:

    1) Extend bpftool to produce a struct (aka "skeleton") tailored and specific
    to a provided BPF object file. This provides an alternative, simplified API
    compared to standard libbpf interaction. Also, add libbpf extern variable
    resolution for .kconfig section to import Kconfig data, from Andrii Nakryiko.

    2) Add BPF dispatcher for XDP which is a mechanism to avoid indirect calls by
    generating a branch funnel as discussed back in bpfconf'19 at LSF/MM. Also,
    add various BPF riscv JIT improvements, from Björn Töpel.

    3) Extend bpftool to allow matching BPF programs and maps by name,
    from Paul Chaignon.

    4) Support for replacing cgroup BPF programs attached with BPF_F_ALLOW_MULTI
    flag for allowing updates without service interruption, from Andrey Ignatov.

    5) Cleanup and simplification of ring access functions for AF_XDP with a
    bonus of 0-5% performance improvement, from Magnus Karlsson.

    6) Enable BPF JITs for x86-64 and arm64 by default. Also, final version of
    audit support for BPF, from Daniel Borkmann and latter with Jiri Olsa.

    7) Move and extend test_select_reuseport into BPF program tests under
    BPF selftests, from Jakub Sitnicki.

    8) Various BPF sample improvements for xdpsock for customizing parameters
    to set up and benchmark AF_XDP, from Jay Jayatheerthan.

    9) Improve libbpf to provide a ulimit hint on permission denied errors.
    Also change XDP sample programs to attach in driver mode by default,
    from Toke Høiland-Jørgensen.

    10) Extend BPF test infrastructure to allow changing skb mark from tc BPF
    programs, from Nikita V. Shirokov.

    11) Optimize prologue code sequence in BPF arm32 JIT, from Russell King.

    12) Fix xdp_redirect_cpu BPF sample to manually attach to tracepoints after
    libbpf conversion, from Jesper Dangaard Brouer.

    13) Minor misc improvements from various others.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     

23 Dec, 2019

1 commit

  • Anatoly has been fuzzing with kBdysch harness and reported a hang in one
    of the outcomes. Upon closer analysis, it turns out that precise scalar
    value tracking is missing a few precision markings for unknown scalars:

    0: R1=ctx(id=0,off=0,imm=0) R10=fp0
    0: (b7) r0 = 0
    1: R0_w=invP0 R1=ctx(id=0,off=0,imm=0) R10=fp0
    1: (35) if r0 >= 0xf72e goto pc+0
    --> only follow fallthrough
    2: R0_w=invP0 R1=ctx(id=0,off=0,imm=0) R10=fp0
    2: (35) if r0 >= 0x80fe0000 goto pc+0
    --> only follow fallthrough
    3: R0_w=invP0 R1=ctx(id=0,off=0,imm=0) R10=fp0
    3: (14) w0 -= -536870912
    4: R0_w=invP536870912 R1=ctx(id=0,off=0,imm=0) R10=fp0
    4: (0f) r1 += r0
    5: R0_w=invP536870912 R1_w=inv(id=0) R10=fp0
    5: (55) if r1 != 0x104c1500 goto pc+0
    --> push other branch for later analysis
    R0_w=invP536870912 R1_w=inv273421568 R10=fp0
    6: R0_w=invP536870912 R1_w=inv273421568 R10=fp0
    6: (b7) r0 = 0
    7: R0=invP0 R1=inv273421568 R10=fp0
    7: (76) if w1 s>= 0xffffff00 goto pc+3
    --> only follow goto
    11: R0=invP0 R1=inv273421568 R10=fp0
    11: (95) exit
    6: R0_w=invP536870912 R1_w=inv(id=0) R10=fp0
    6: (b7) r0 = 0
    propagating r0
    7: safe
    processed 11 insns [...]

    In the analysis of the second path coming after the successful exit above,
    the path is being pruned at line 7. Pruning analysis found that both r0 are
    precise P0 and both R1 are non-precise scalars and given prior path with
    R1 as non-precise scalar succeeded, this one is therefore safe as well.

    However, problem is that given condition at insn 7 in the first run, we only
    followed goto and didn't push the other branch for later analysis, we've
    never walked the few insns in there and therefore dead-code sanitation
    rewrites it as goto pc-1, causing the hang depending on the skb address
    hitting these conditions. The issue is that R1 should have been marked as
    precise as well such that pruning enforces range check and conluded that new
    R1 is not in range of old R1. In insn 4, we mark R1 (skb) as unknown scalar
    via __mark_reg_unbounded() but not mark_reg_unbounded() and therefore
    regs->precise remains as false.

    Back in b5dc0163d8fd ("bpf: precise scalar_value tracking"), this was not
    the case since marking out of __mark_reg_unbounded() had this covered as well.
    Once in both are set as precise in 4 as they should have been, we conclude
    that given R1 was in prior fall-through path 0x104c1500 and now is completely
    unknown, the check at insn 7 concludes that we need to continue walking.
    Analysis after the fix:

    0: R1=ctx(id=0,off=0,imm=0) R10=fp0
    0: (b7) r0 = 0
    1: R0_w=invP0 R1=ctx(id=0,off=0,imm=0) R10=fp0
    1: (35) if r0 >= 0xf72e goto pc+0
    2: R0_w=invP0 R1=ctx(id=0,off=0,imm=0) R10=fp0
    2: (35) if r0 >= 0x80fe0000 goto pc+0
    3: R0_w=invP0 R1=ctx(id=0,off=0,imm=0) R10=fp0
    3: (14) w0 -= -536870912
    4: R0_w=invP536870912 R1=ctx(id=0,off=0,imm=0) R10=fp0
    4: (0f) r1 += r0
    5: R0_w=invP536870912 R1_w=invP(id=0) R10=fp0
    5: (55) if r1 != 0x104c1500 goto pc+0
    R0_w=invP536870912 R1_w=invP273421568 R10=fp0
    6: R0_w=invP536870912 R1_w=invP273421568 R10=fp0
    6: (b7) r0 = 0
    7: R0=invP0 R1=invP273421568 R10=fp0
    7: (76) if w1 s>= 0xffffff00 goto pc+3
    11: R0=invP0 R1=invP273421568 R10=fp0
    11: (95) exit
    6: R0_w=invP536870912 R1_w=invP(id=0) R10=fp0
    6: (b7) r0 = 0
    7: R0_w=invP0 R1_w=invP(id=0) R10=fp0
    7: (76) if w1 s>= 0xffffff00 goto pc+3
    R0_w=invP0 R1_w=invP(id=0) R10=fp0
    8: R0_w=invP0 R1_w=invP(id=0) R10=fp0
    8: (a5) if r0 < 0x2007002a goto pc+0
    9: R0_w=invP0 R1_w=invP(id=0) R10=fp0
    9: (57) r0 &= -16316416
    10: R0_w=invP0 R1_w=invP(id=0) R10=fp0
    10: (a6) if w0 < 0x1201 goto pc+0
    11: R0_w=invP0 R1_w=invP(id=0) R10=fp0
    11: (95) exit
    11: R0=invP0 R1=invP(id=0) R10=fp0
    11: (95) exit
    processed 16 insns [...]

    Fixes: 6754172c208d ("bpf: fix precision tracking in presence of bpf2bpf calls")
    Reported-by: Anatoly Trosinenko
    Signed-off-by: Daniel Borkmann
    Signed-off-by: Alexei Starovoitov
    Link: https://lore.kernel.org/bpf/20191222223740.25297-1-daniel@iogearbox.net

    Daniel Borkmann