26 Jan, 2021

2 commits

  • Fold xp_assign_dev and __xp_assign_dev. The former directly calls the
    latter.

    Signed-off-by: Björn Töpel
    Signed-off-by: Daniel Borkmann
    Reviewed-by: Maciej Fijalkowski
    Link: https://lore.kernel.org/bpf/20210122105351.11751-3-bjorn.topel@gmail.com

    Björn Töpel
     
  • The explicit_free parameter of the __xsk_rcv() function was used to
    mark whether the call was via the generic XDP or the native XDP
    path. Instead of clutter the code with if-statements and "true/false"
    parameters which are hard to understand, simply move the explicit free
    to the __xsk_map_redirect() which is always called from the native XDP
    path.

    Signed-off-by: Björn Töpel
    Signed-off-by: Daniel Borkmann
    Reviewed-by: Maciej Fijalkowski
    Link: https://lore.kernel.org/bpf/20210122105351.11751-2-bjorn.topel@gmail.com

    Björn Töpel
     

20 Jan, 2021

1 commit

  • The number of queues can change by other means, rather than ethtool. For
    example, attaching an mqprio qdisc with num_tc > 1 leads to creating
    multiple sets of TX queues, which may be then destroyed when mqprio is
    deleted. If an AF_XDP socket is created while mqprio is active,
    dev->_tx[queue_id].pool will be filled, but then real_num_tx_queues may
    decrease with deletion of mqprio, which will mean that the pool won't be
    NULLed, and a further increase of the number of TX queues may expose a
    dangling pointer.

    To avoid any potential misbehavior, this commit clears pool for RX and
    TX queues, regardless of real_num_*_queues, still taking into
    consideration num_*_queues to avoid overflows.

    Fixes: 1c1efc2af158 ("xsk: Create and free buffer pool independently from umem")
    Fixes: a41b4f3c58dd ("xsk: simplify xdp_clear_umem_at_qid implementation")
    Signed-off-by: Maxim Mikityanskiy
    Signed-off-by: Daniel Borkmann
    Acked-by: Björn Töpel
    Link: https://lore.kernel.org/bpf/20210118160333.333439-1-maximmi@mellanox.com

    Maxim Mikityanskiy
     

18 Dec, 2020

3 commits

  • Rollback the reservation in the completion ring when we get a
    NETDEV_TX_BUSY. When this error is received from the driver, we are
    supposed to let the user application retry the transmit again. And in
    order to do this, we need to roll back the failed send so it can be
    retried. Unfortunately, we did not cancel the reservation we had made
    in the completion ring. By not doing this, we actually make the
    completion ring one entry smaller per NETDEV_TX_BUSY error we get, and
    after enough of these errors the completion ring will be of size zero
    and transmit will stop working.

    Fix this by cancelling the reservation when we get a NETDEV_TX_BUSY
    error.

    Fixes: 642e450b6b59 ("xsk: Do not discard packet when NETDEV_TX_BUSY")
    Reported-by: Xuan Zhuo
    Signed-off-by: Magnus Karlsson
    Signed-off-by: Daniel Borkmann
    Acked-by: Björn Töpel
    Link: https://lore.kernel.org/bpf/20201218134525.13119-3-magnus.karlsson@gmail.com

    Magnus Karlsson
     
  • Fix a race when multiple sockets are simultaneously calling sendto()
    when the completion ring is shared in the SKB case. This is the case
    when you share the same netdev and queue id through the
    XDP_SHARED_UMEM bind flag. The problem is that multiple processes can
    be in xsk_generic_xmit() and call the backpressure mechanism in
    xskq_prod_reserve(xs->pool->cq). As this is a shared resource in this
    specific scenario, a race might occur since the rings are
    single-producer single-consumer.

    Fix this by moving the tx_completion_lock from the socket to the pool
    as the pool is shared between the sockets that share the completion
    ring. (The pool is not shared when this is not the case.) And then
    protect the accesses to xskq_prod_reserve() with this lock. The
    tx_completion_lock is renamed cq_lock to better reflect that it
    protects accesses to the potentially shared completion ring.

    Fixes: 35fcde7f8deb ("xsk: support for Tx")
    Reported-by: Xuan Zhuo
    Signed-off-by: Magnus Karlsson
    Signed-off-by: Daniel Borkmann
    Acked-by: Björn Töpel
    Link: https://lore.kernel.org/bpf/20201218134525.13119-2-magnus.karlsson@gmail.com

    Magnus Karlsson
     
  • Fix a possible memory leak when a bind of an AF_XDP socket fails. When
    the fill and completion rings are created, they are tied to the
    socket. But when the buffer pool is later created at bind time, the
    ownership of these two rings are transferred to the buffer pool as
    they might be shared between sockets (and the buffer pool cannot be
    created until we know what we are binding to). So, before the buffer
    pool is created, these two rings are cleaned up with the socket, and
    after they have been transferred they are cleaned up together with
    the buffer pool.

    The problem is that ownership was transferred before it was absolutely
    certain that the buffer pool could be created and initialized
    correctly and when one of these errors occurred, the fill and
    completion rings did neither belong to the socket nor the pool and
    where therefore leaked. Solve this by moving the ownership transfer
    to the point where the buffer pool has been completely set up and
    there is no way it can fail.

    Fixes: 7361f9c3d719 ("xsk: Move fill and completion rings to buffer pool")
    Reported-by: syzbot+cfa88ddd0655afa88763@syzkaller.appspotmail.com
    Signed-off-by: Magnus Karlsson
    Signed-off-by: Daniel Borkmann
    Acked-by: Björn Töpel
    Link: https://lore.kernel.org/bpf/20201214085127.3960-1-magnus.karlsson@gmail.com

    Magnus Karlsson
     

15 Dec, 2020

1 commit

  • Daniel Borkmann says:

    ====================
    pull-request: bpf-next 2020-12-14

    1) Expose bpf_sk_storage_*() helpers to iterator programs, from Florent Revest.

    2) Add AF_XDP selftests based on veth devs to BPF selftests, from Weqaar Janjua.

    3) Support for finding BTF based kernel attach targets through libbpf's
    bpf_program__set_attach_target() API, from Andrii Nakryiko.

    4) Permit pointers on stack for helper calls in the verifier, from Yonghong Song.

    5) Fix overflows in hash map elem size after rlimit removal, from Eric Dumazet.

    6) Get rid of direct invocation of llc in BPF selftests, from Andrew Delgadillo.

    7) Fix xsk_recvmsg() to reorder socket state check before access, from Björn Töpel.

    8) Add new libbpf API helper to retrieve ring buffer epoll fd, from Brendan Jackman.

    9) Batch of minor BPF selftest improvements all over the place, from Florian Lehner,
    KP Singh, Jiri Olsa and various others.

    * https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (31 commits)
    selftests/bpf: Add a test for ptr_to_map_value on stack for helper access
    bpf: Permits pointers on stack for helper calls
    libbpf: Expose libbpf ring_buffer epoll_fd
    selftests/bpf: Add set_attach_target() API selftest for module target
    libbpf: Support modules in bpf_program__set_attach_target() API
    selftests/bpf: Silence ima_setup.sh when not running in verbose mode.
    selftests/bpf: Drop the need for LLVM's llc
    selftests/bpf: fix bpf_testmod.ko recompilation logic
    samples/bpf: Fix possible hang in xdpsock with multiple threads
    selftests/bpf: Make selftest compilation work on clang 11
    selftests/bpf: Xsk selftests - adding xdpxceiver to .gitignore
    selftests/bpf: Drop tcp-{client,server}.py from Makefile
    selftests/bpf: Xsk selftests - Bi-directional Sockets - SKB, DRV
    selftests/bpf: Xsk selftests - Socket Teardown - SKB, DRV
    selftests/bpf: Xsk selftests - DRV POLL, NOPOLL
    selftests/bpf: Xsk selftests - SKB POLL, NOPOLL
    selftests/bpf: Xsk selftests framework
    bpf: Only provide bpf_sock_from_file with CONFIG_NET
    bpf: Return -ENOTSUPP when attaching to non-kernel BTF
    xsk: Validate socket state in xsk_recvmsg, prior touching socket members
    ...
    ====================

    Link: https://lore.kernel.org/r/20201214214316.20642-1-daniel@iogearbox.net
    Signed-off-by: Jakub Kicinski

    Jakub Kicinski
     

12 Dec, 2020

1 commit


09 Dec, 2020

1 commit

  • In AF_XDP the socket state needs to be checked, prior touching the
    members of the socket. This was not the case for the recvmsg
    implementation. Fix that by moving the xsk_is_bound() call.

    Fixes: 45a86681844e ("xsk: Add support for recvmsg()")
    Reported-by: kernel test robot
    Signed-off-by: Björn Töpel
    Signed-off-by: Daniel Borkmann
    Acked-by: Magnus Karlsson
    Link: https://lore.kernel.org/bpf/20201207082008.132263-1-bjorn.topel@gmail.com

    Björn Töpel
     

04 Dec, 2020

2 commits

  • If force_zc is set, we should exit out with an error, not fall back to
    copy mode.

    Fixes: 921b68692abb ("xsk: Enable sharing of dma mappings")
    Reported-by: Hulk Robot
    Signed-off-by: Zhang Changzhong
    Signed-off-by: Daniel Borkmann
    Acked-by: Magnus Karlsson
    Link: https://lore.kernel.org/bpf/1607077277-41995-1-git-send-email-zhangchangzhong@huawei.com

    Zhang Changzhong
     
  • Alexei Starovoitov says:

    ====================
    pull-request: bpf-next 2020-12-03

    The main changes are:

    1) Support BTF in kernel modules, from Andrii.

    2) Introduce preferred busy-polling, from Björn.

    3) bpf_ima_inode_hash() and bpf_bprm_opts_set() helpers, from KP Singh.

    4) Memcg-based memory accounting for bpf objects, from Roman.

    5) Allow bpf_{s,g}etsockopt from cgroup bind{4,6} hooks, from Stanislav.

    * https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (118 commits)
    selftests/bpf: Fix invalid use of strncat in test_sockmap
    libbpf: Use memcpy instead of strncpy to please GCC
    selftests/bpf: Add fentry/fexit/fmod_ret selftest for kernel module
    selftests/bpf: Add tp_btf CO-RE reloc test for modules
    libbpf: Support attachment of BPF tracing programs to kernel modules
    libbpf: Factor out low-level BPF program loading helper
    bpf: Allow to specify kernel module BTFs when attaching BPF programs
    bpf: Remove hard-coded btf_vmlinux assumption from BPF verifier
    selftests/bpf: Add CO-RE relocs selftest relying on kernel module BTF
    selftests/bpf: Add support for marking sub-tests as skipped
    selftests/bpf: Add bpf_testmod kernel module for testing
    libbpf: Add kernel module BTF support for CO-RE relocations
    libbpf: Refactor CO-RE relocs to not assume a single BTF object
    libbpf: Add internal helper to load BTF data by FD
    bpf: Keep module's btf_data_size intact after load
    bpf: Fix bpf_put_raw_tracepoint()'s use of __module_address()
    selftests/bpf: Add Userspace tests for TCP_WINDOW_CLAMP
    bpf: Adds support for setting window clamp
    samples/bpf: Fix spelling mistake "recieving" -> "receiving"
    bpf: Fix cold build of test_progs-no_alu32
    ...
    ====================

    Link: https://lore.kernel.org/r/20201204021936.85653-1-alexei.starovoitov@gmail.com
    Signed-off-by: Jakub Kicinski

    Jakub Kicinski
     

03 Dec, 2020

4 commits

  • Do not use rlimit-based memory accounting for xskmap maps.
    It has been replaced with the memcg-based memory accounting.

    Signed-off-by: Roman Gushchin
    Signed-off-by: Alexei Starovoitov
    Acked-by: Song Liu
    Link: https://lore.kernel.org/bpf/20201201215900.3569844-31-guro@fb.com

    Roman Gushchin
     
  • Extend xskmap memory accounting to include the memory taken by
    the xsk_map_node structure.

    Signed-off-by: Roman Gushchin
    Signed-off-by: Alexei Starovoitov
    Link: https://lore.kernel.org/bpf/20201201215900.3569844-18-guro@fb.com

    Roman Gushchin
     
  • Modify the tx writeable condition from the queue is not full to the
    number of present tx queues is less than the half of the total number
    of queues. Because the tx queue not full is a very short time, this will
    cause a large number of EPOLLOUT events, and cause a large number of
    process wake up.

    Fixes: 35fcde7f8deb ("xsk: support for Tx")
    Signed-off-by: Xuan Zhuo
    Signed-off-by: Daniel Borkmann
    Acked-by: Magnus Karlsson
    Link: https://lore.kernel.org/bpf/508fef55188d4e1160747ead64c6dcda36735880.1606555939.git.xuanzhuo@linux.alibaba.com

    Xuan Zhuo
     
  • datagram_poll will judge the current socket status (EPOLLIN, EPOLLOUT)
    based on the traditional socket information (eg: sk_wmem_alloc), but
    this does not apply to xsk. So this patch uses sock_poll_wait instead of
    datagram_poll, and the mask is calculated by xsk_poll.

    Fixes: c497176cb2e4 ("xsk: add Rx receive functions and poll support")
    Signed-off-by: Xuan Zhuo
    Signed-off-by: Daniel Borkmann
    Acked-by: Magnus Karlsson
    Link: https://lore.kernel.org/bpf/e82f4697438cd63edbf271ebe1918db8261b7c09.1606555939.git.xuanzhuo@linux.alibaba.com

    Xuan Zhuo
     

01 Dec, 2020

4 commits

  • Add napi_id to the xdp_rxq_info structure, and make sure the XDP
    socket pick up the napi_id in the Rx path. The napi_id is used to find
    the corresponding NAPI structure for socket busy polling.

    Signed-off-by: Björn Töpel
    Signed-off-by: Daniel Borkmann
    Acked-by: Ilias Apalodimas
    Acked-by: Michael S. Tsirkin
    Acked-by: Tariq Toukan
    Link: https://lore.kernel.org/bpf/20201130185205.196029-7-bjorn.topel@gmail.com

    Björn Töpel
     
  • Wire-up XDP socket busy-poll support for recvmsg() and sendmsg(). If
    the XDP socket prefers busy-polling, make sure that no wakeup/IPI is
    performed.

    Signed-off-by: Björn Töpel
    Signed-off-by: Daniel Borkmann
    Acked-by: Magnus Karlsson
    Link: https://lore.kernel.org/bpf/20201130185205.196029-6-bjorn.topel@gmail.com

    Björn Töpel
     
  • Add a check for need wake up in sendmsg(), so that if a user calls
    sendmsg() when no wakeup is needed, do not trigger a wakeup.

    To simplify the need wakeup check in the syscall, unconditionally
    enable the need wakeup flag for Tx. This has a side-effect for poll();
    If poll() is called for a socket without enabled need wakeup, a Tx
    wakeup is unconditionally performed.

    The wakeup matrix for AF_XDP now looks like:

    need wakeup | poll() | sendmsg() | recvmsg()
    ------------+--------------+-------------+------------
    disabled | wake Tx | wake Tx | nop
    enabled | check flag; | check flag; | check flag;
    | wake Tx/Rx | wake Tx | wake Rx

    Signed-off-by: Björn Töpel
    Signed-off-by: Daniel Borkmann
    Acked-by: Magnus Karlsson
    Link: https://lore.kernel.org/bpf/20201130185205.196029-5-bjorn.topel@gmail.com

    Björn Töpel
     
  • Add support for non-blocking recvmsg() to XDP sockets. Previously,
    only sendmsg() was supported by XDP socket. Now, for symmetry and the
    upcoming busy-polling support, recvmsg() is added.

    Signed-off-by: Björn Töpel
    Signed-off-by: Daniel Borkmann
    Acked-by: Magnus Karlsson
    Link: https://lore.kernel.org/bpf/20201130185205.196029-4-bjorn.topel@gmail.com

    Björn Töpel
     

28 Nov, 2020

1 commit

  • The functions xsk_map_put() and xsk_map_inc() are simple wrappers and
    as such, replace these functions with the functions bpf_map_inc() and
    bpf_map_put() and remove some error testing code.

    Signed-off-by: Zhu Yanjun
    Signed-off-by: Daniel Borkmann
    Acked-by: Magnus Karlsson
    Link: https://lore.kernel.org/bpf/1606402998-12562-1-git-send-email-yanjunz@nvidia.com

    Zhu Yanjun
     

25 Nov, 2020

1 commit

  • Commit 642e450b6b59 ("xsk: Do not discard packet when NETDEV_TX_BUSY")
    addressed the problem that packets were discarded from the Tx AF_XDP
    ring, when the driver returned NETDEV_TX_BUSY. Part of the fix was
    bumping the skbuff reference count, so that the buffer would not be
    freed by dev_direct_xmit(). A reference count larger than one means
    that the skbuff is "shared", which is not the case.

    If the "shared" skbuff is sent to the generic XDP receive path,
    netif_receive_generic_xdp(), and pskb_expand_head() is entered the
    BUG_ON(skb_shared(skb)) will trigger.

    This patch adds a variant to dev_direct_xmit(), __dev_direct_xmit(),
    where a user can select the skbuff free policy. This allows AF_XDP to
    avoid bumping the reference count, but still keep the NETDEV_TX_BUSY
    behavior.

    Fixes: 642e450b6b59 ("xsk: Do not discard packet when NETDEV_TX_BUSY")
    Reported-by: Yonghong Song
    Signed-off-by: Björn Töpel
    Signed-off-by: Daniel Borkmann
    Link: https://lore.kernel.org/bpf/20201123175600.146255-1-bjorn.topel@gmail.com

    Björn Töpel
     

23 Nov, 2020

1 commit

  • Fix incorrect netdev reference count in xsk_bind operation. Incorrect
    reference count of the device appears when a user calls bind with the
    XDP_ZEROCOPY flag on an interface which does not support zero-copy.
    In such a case, an error is returned but the reference count is not
    decreased. This change fixes the fault, by decreasing the reference count
    in case of such an error.

    The problem being corrected appeared in '162c820ed896' for the first time,
    and the code was moved to new file location over the time with commit
    'c2d3d6a47462'. This specific patch applies to all version starting
    from 'c2d3d6a47462'. The same solution should be applied but on different
    file (net/xdp/xdp_umem.c) and function (xdp_umem_assign_dev) for versions
    from '162c820ed896' to 'c2d3d6a47462' excluded.

    Fixes: 162c820ed896 ("xdp: hold device for umem regardless of zero-copy mode")
    Signed-off-by: Marek Majtyka
    Signed-off-by: Daniel Borkmann
    Acked-by: Magnus Karlsson
    Link: https://lore.kernel.org/bpf/20201120151443.105903-1-marekx.majtyka@intel.com

    Marek Majtyka
     

20 Nov, 2020

1 commit

  • Fix a bug that is triggered when a partially setup socket is
    destroyed. For a fully setup socket, a socket that has been bound to a
    device, the cleanup of the umem is performed at the end of the buffer
    pool's cleanup work queue item. This has to be performed in a work
    queue, and not in RCU cleanup, as it is doing a vunmap that cannot
    execute in interrupt context. However, when a socket has only been
    partially set up so that a umem has been created but the buffer pool
    has not, the code erroneously directly calls the umem cleanup function
    instead of using a work queue, and this leads to a BUG_ON() in
    vunmap().

    As there in this case is no buffer pool, we cannot use its work queue,
    so we need to introduce a work queue for the umem and schedule this for
    the cleanup. So in the case there is no pool, we are going to use the
    umem's own work queue to schedule the cleanup. But if there is a
    pool, the cleanup of the umem is still being performed by the pool's
    work queue, as it is important that the umem is cleaned up after the
    pool.

    Fixes: e5e1a4bc916d ("xsk: Fix possible memory leak at socket close")
    Reported-by: Marek Majtyka
    Signed-off-by: Magnus Karlsson
    Signed-off-by: Daniel Borkmann
    Tested-by: Marek Majtyka
    Link: https://lore.kernel.org/bpf/1605873219-21629-1-git-send-email-magnus.karlsson@gmail.com

    Magnus Karlsson
     

18 Nov, 2020

2 commits

  • Introduce batched descriptor interfaces in the xsk core code for the
    Tx path to be used in the driver to write a code path with higher
    performance. This interface will be used by the i40e driver in the
    next patch. Though other drivers would likely benefit from this new
    interface too.

    Note that batching is only implemented for the common case when
    there is only one socket bound to the same device and queue id. When
    this is not the case, we fall back to the old non-batched version of
    the function.

    Signed-off-by: Magnus Karlsson
    Signed-off-by: Daniel Borkmann
    Acked-by: John Fastabend
    Link: https://lore.kernel.org/bpf/1605525167-14450-5-git-send-email-magnus.karlsson@gmail.com

    Magnus Karlsson
     
  • Introduce one cache line worth of padding between the consumer pointer
    and the flags field as well as between the flags field and the start
    of the descriptors in all the lockless rings. This so that the x86 HW
    adjacency prefetcher will not prefetch the adjacent pointer/field when
    only one pointer/field is going to be used. This improves throughput
    performance for the l2fwd sample app with 1% on my machine with HW
    prefetching turned on in the BIOS.

    Signed-off-by: Magnus Karlsson
    Signed-off-by: Daniel Borkmann
    Acked-by: John Fastabend
    Link: https://lore.kernel.org/bpf/1605525167-14450-4-git-send-email-magnus.karlsson@gmail.com

    Magnus Karlsson
     

29 Oct, 2020

1 commit

  • Fix a possible memory leak at xsk socket close that is caused by the
    refcounting of the umem object being wrong. The reference count of the
    umem was decremented only after the pool had been freed. Note that if
    the buffer pool is destroyed, it is important that the umem is
    destroyed after the pool, otherwise the umem would disappear while the
    driver is still running. And as the buffer pool needs to be destroyed
    in a work queue, the umem is also (if its refcount reaches zero)
    destroyed after the buffer pool in that same work queue.

    What was missing is that the refcount also needs to be decremented
    when the pool is not freed and when the pool has not even been
    created. The first case happens when the refcount of the pool is
    higher than 1, i.e. it is still being used by some other socket using
    the same device and queue id. In this case, it is safe to decrement
    the refcount of the umem outside of the work queue as the umem will
    never be freed because the refcount of the umem is always greater than
    or equal to the refcount of the buffer pool. The second case is if the
    buffer pool has not been created yet, i.e. the socket was closed
    before it was bound but after the umem was created. In this case, it
    is safe to destroy the umem outside of the work queue, since there is
    no pool that can use it by definition.

    Fixes: 1c1efc2af158 ("xsk: Create and free buffer pool independently from umem")
    Reported-by: syzbot+eb71df123dc2be2c1456@syzkaller.appspotmail.com
    Signed-off-by: Magnus Karlsson
    Signed-off-by: Daniel Borkmann
    Acked-by: Björn Töpel
    Link: https://lore.kernel.org/bpf/1603801921-2712-1-git-send-email-magnus.karlsson@gmail.com

    Magnus Karlsson
     

13 Oct, 2020

1 commit

  • Alexei Starovoitov says:

    ====================
    pull-request: bpf-next 2020-10-12

    The main changes are:

    1) The BPF verifier improvements to track register allocation pattern, from Alexei and Yonghong.

    2) libbpf relocation support for different size load/store, from Andrii.

    3) bpf_redirect_peer() helper and support for inner map array with different max_entries, from Daniel.

    4) BPF support for per-cpu variables, form Hao.

    5) sockmap improvements, from John.
    ====================

    Signed-off-by: Jakub Kicinski

    Jakub Kicinski
     

12 Oct, 2020

1 commit

  • Recent work in f4d05259213f ("bpf: Add map_meta_equal map ops") and 134fede4eecf
    ("bpf: Relax max_entries check for most of the inner map types") added support
    for dynamic inner max elements for most map-in-map types. Exceptions were maps
    like array or prog array where the map_gen_lookup() callback uses the maps'
    max_entries field as a constant when emitting instructions.

    We recently implemented Maglev consistent hashing into Cilium's load balancer
    which uses map-in-map with an outer map being hash and inner being array holding
    the Maglev backend table for each service. This has been designed this way in
    order to reduce overall memory consumption given the outer hash map allows to
    avoid preallocating a large, flat memory area for all services. Also, the
    number of service mappings is not always known a-priori.

    The use case for dynamic inner array map entries is to further reduce memory
    overhead, for example, some services might just have a small number of back
    ends while others could have a large number. Right now the Maglev backend table
    for small and large number of backends would need to have the same inner array
    map entries which adds a lot of unneeded overhead.

    Dynamic inner array map entries can be realized by avoiding the inlined code
    generation for their lookup. The lookup will still be efficient since it will
    be calling into array_map_lookup_elem() directly and thus avoiding retpoline.
    The patch adds a BPF_F_INNER_MAP flag to map creation which therefore skips
    inline code generation and relaxes array_map_meta_equal() check to ignore both
    maps' max_entries. This also still allows to have faster lookups for map-in-map
    when BPF_F_INNER_MAP is not specified and hence dynamic max_entries not needed.

    Example code generation where inner map is dynamic sized array:

    # bpftool p d x i 125
    int handle__sys_enter(void * ctx):
    ; int handle__sys_enter(void *ctx)
    0: (b4) w1 = 0
    ; int key = 0;
    1: (63) *(u32 *)(r10 -4) = r1
    2: (bf) r2 = r10
    ;
    3: (07) r2 += -4
    ; inner_map = bpf_map_lookup_elem(&outer_arr_dyn, &key);
    4: (18) r1 = map[id:468]
    6: (07) r1 += 272
    7: (61) r0 = *(u32 *)(r2 +0)
    8: (35) if r0 >= 0x3 goto pc+5
    9: (67) r0 <
    Signed-off-by: Alexei Starovoitov
    Acked-by: Andrii Nakryiko
    Link: https://lore.kernel.org/bpf/20201010234006.7075-4-daniel@iogearbox.net

    Daniel Borkmann
     

09 Oct, 2020

1 commit

  • Introduce one cache line worth of padding between the producer and
    consumer pointers in all the lockless rings. This so that the HW
    adjacency prefetcher will not prefetch the consumer pointer when the
    producer pointer is used and vice versa. This improves throughput
    performance for the l2fwd sample app with 2% on my machine with HW
    prefetching turned on.

    Signed-off-by: Magnus Karlsson
    Signed-off-by: Daniel Borkmann
    Link: https://lore.kernel.org/bpf/1602166338-21378-1-git-send-email-magnus.karlsson@gmail.com

    Magnus Karlsson
     

06 Oct, 2020

1 commit


05 Oct, 2020

1 commit

  • Christoph Hellwig correctly pointed out [1] that the AF_XDP core was
    pointlessly including internal headers. Let us remove those includes.

    [1] https://lore.kernel.org/bpf/20201005084341.GA3224@infradead.org/

    Fixes: 1c1efc2af158 ("xsk: Create and free buffer pool independently from umem")
    Reported-by: Christoph Hellwig
    Signed-off-by: Björn Töpel
    Signed-off-by: Daniel Borkmann
    Acked-by: Christoph Hellwig
    Link: https://lore.kernel.org/bpf/20201005090525.116689-1-bjorn.topel@gmail.com

    Björn Töpel
     

30 Sep, 2020

1 commit

  • After 'peeking' the ring, the consumer, not the producer, reads the data.
    Fix this mistake in the comments.

    Fixes: 15d8c9162ced ("xsk: Add function naming comments and reorder functions")
    Signed-off-by: Ciara Loftus
    Signed-off-by: Alexei Starovoitov
    Acked-by: Magnus Karlsson
    Link: https://lore.kernel.org/bpf/20200928082344.17110-1-ciara.loftus@intel.com

    Ciara Loftus
     

29 Sep, 2020

1 commit

  • Fix possible crash in socket_release when an out-of-memory error has
    occurred in the bind call. If a socket using the XDP_SHARED_UMEM flag
    encountered an error in xp_create_and_assign_umem, the bind code
    jumped to the exit routine but erroneously forgot to set the err value
    before jumping. This meant that the exit routine thought the setup
    went well and set the state of the socket to XSK_BOUND. The xsk socket
    release code will then, at application exit, think that this is a
    properly setup socket, when it is not, leading to a crash when all
    fields in the socket have in fact not been initialized properly. Fix
    this by setting the err variable in xsk_bind so that the socket is not
    set to XSK_BOUND which leads to the clean-up in xsk_release not being
    triggered.

    Fixes: 1c1efc2af158 ("xsk: Create and free buffer pool independently from umem")
    Reported-by: syzbot+ddc7b4944bc61da19b81@syzkaller.appspotmail.com
    Signed-off-by: Magnus Karlsson
    Signed-off-by: Daniel Borkmann
    Link: https://lore.kernel.org/bpf/1601112373-10595-1-git-send-email-magnus.karlsson@gmail.com

    Magnus Karlsson
     

24 Sep, 2020

1 commit

  • Alexei Starovoitov says:

    ====================
    pull-request: bpf-next 2020-09-23

    The following pull-request contains BPF updates for your *net-next* tree.

    We've added 95 non-merge commits during the last 22 day(s) which contain
    a total of 124 files changed, 4211 insertions(+), 2040 deletions(-).

    The main changes are:

    1) Full multi function support in libbpf, from Andrii.

    2) Refactoring of function argument checks, from Lorenz.

    3) Make bpf_tail_call compatible with functions (subprograms), from Maciej.

    4) Program metadata support, from YiFei.

    5) bpf iterator optimizations, from Yonghong.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     

23 Sep, 2020

1 commit

  • Two minor conflicts:

    1) net/ipv4/route.c, adding a new local variable while
    moving another local variable and removing it's
    initial assignment.

    2) drivers/net/dsa/microchip/ksz9477.c, overlapping changes.
    One pretty prints the port mode differently, whilst another
    changes the driver to try and obtain the port mode from
    the port node rather than the switch node.

    Signed-off-by: David S. Miller

    David S. Miller
     

17 Sep, 2020

1 commit

  • In the skb Tx path, transmission of a packet is performed with
    dev_direct_xmit(). When NETDEV_TX_BUSY is set in the drivers, it
    signifies that it was not possible to send the packet right now,
    please try later. Unfortunately, the xsk transmit code discarded the
    packet and returned EBUSY to the application. Fix this unnecessary
    packet loss, by not discarding the packet in the Tx ring and return
    EAGAIN. As EAGAIN is returned to the application, it can then retry
    the send operation later and the packet will then likely be sent as
    the driver will then likely have space/resources to send the packet.

    In summary, EAGAIN tells the application that the packet was not
    discarded from the Tx ring and that it needs to call send()
    again. EBUSY, on the other hand, signifies that the packet was not
    sent and discarded from the Tx ring. The application needs to put
    the packet on the Tx ring again if it wants it to be sent.

    Fixes: 35fcde7f8deb ("xsk: support for Tx")
    Reported-by: Arkadiusz Zema
    Suggested-by: Arkadiusz Zema
    Suggested-by: Daniel Borkmann
    Signed-off-by: Magnus Karlsson
    Signed-off-by: Daniel Borkmann
    Reviewed-by: Jesse Brandeburg
    Link: https://lore.kernel.org/bpf/1600257625-2353-1-git-send-email-magnus.karlsson@gmail.com

    Magnus Karlsson
     

15 Sep, 2020

2 commits

  • Fix a potential refcount warning that a zero value is increased to one
    in xp_dma_map, by initializing the refcount to one to start with,
    instead of zero plus a refcount_inc().

    Fixes: 921b68692abb ("xsk: Enable sharing of dma mappings")
    Signed-off-by: Magnus Karlsson
    Signed-off-by: Alexei Starovoitov
    Acked-by: Song Liu
    Link: https://lore.kernel.org/bpf/1600095036-23868-1-git-send-email-magnus.karlsson@gmail.com

    Magnus Karlsson
     
  • For AF_XDP sockets, there was a discrepancy between the number of of
    pinned pages and the size of the umem region.

    The size of the umem region is used to validate the AF_XDP descriptor
    addresses. The logic that pinned the pages covered by the region only
    took whole pages into consideration, creating a mismatch between the
    size and pinned pages. A user could then pass AF_XDP addresses outside
    the range of pinned pages, but still within the size of the region,
    crashing the kernel.

    This change correctly calculates the number of pages to be
    pinned. Further, the size check for the aligned mode is
    simplified. Now the code simply checks if the size is divisible by the
    chunk size.

    Fixes: bbff2f321a86 ("xsk: new descriptor addressing scheme")
    Reported-by: Ciara Loftus
    Signed-off-by: Björn Töpel
    Signed-off-by: Alexei Starovoitov
    Tested-by: Ciara Loftus
    Acked-by: Song Liu
    Link: https://lore.kernel.org/bpf/20200910075609.7904-1-bjorn.topel@gmail.com

    Björn Töpel
     

03 Sep, 2020

2 commits

  • Fix use-after-free when a shared umem bind fails. The code incorrectly
    tried to free the allocated buffer pool both in the bind code and then
    later also when the socket was released. Fix this by setting the
    buffer pool pointer to NULL after the bind code has freed the pool, so
    that the socket release code will not try to free the pool. This is
    the same solution as the regular, non-shared umem code path has. This
    was missing from the shared umem path.

    Fixes: b5aea28dca13 ("xsk: Add shared umem support between queue ids")
    Reported-by: syzbot+5334f62e4d22804e646a@syzkaller.appspotmail.com
    Signed-off-by: Magnus Karlsson
    Signed-off-by: Daniel Borkmann
    Link: https://lore.kernel.org/bpf/1599032164-25684-1-git-send-email-magnus.karlsson@intel.com

    Magnus Karlsson
     
  • Currently, dma_map is being checked, when the right object identifier
    to be null-checked is dma_map->dma_pages, instead.

    Fix this by null-checking dma_map->dma_pages.

    Fixes: 921b68692abb ("xsk: Enable sharing of dma mappings")
    Addresses-Coverity-ID: 1496811 ("Logically dead code")
    Signed-off-by: Gustavo A. R. Silva
    Signed-off-by: Daniel Borkmann
    Acked-by: Björn Töpel
    Link: https://lore.kernel.org/bpf/20200902150750.GA7257@embeddedor

    Gustavo A. R. Silva