17 Jan, 2021

2 commits

  • commit b1b95cb5c0a9694d47d5f845ba97e226cfda957d upstream.

    Rollback the reservation in the completion ring when we get a
    NETDEV_TX_BUSY. When this error is received from the driver, we are
    supposed to let the user application retry the transmit again. And in
    order to do this, we need to roll back the failed send so it can be
    retried. Unfortunately, we did not cancel the reservation we had made
    in the completion ring. By not doing this, we actually make the
    completion ring one entry smaller per NETDEV_TX_BUSY error we get, and
    after enough of these errors the completion ring will be of size zero
    and transmit will stop working.

    Fix this by cancelling the reservation when we get a NETDEV_TX_BUSY
    error.

    Fixes: 642e450b6b59 ("xsk: Do not discard packet when NETDEV_TX_BUSY")
    Reported-by: Xuan Zhuo
    Signed-off-by: Magnus Karlsson
    Signed-off-by: Daniel Borkmann
    Acked-by: Björn Töpel
    Link: https://lore.kernel.org/bpf/20201218134525.13119-3-magnus.karlsson@gmail.com
    Signed-off-by: Greg Kroah-Hartman

    Magnus Karlsson
     
  • commit f09ced4053bc0a2094a12b60b646114c966ef4c6 upstream.

    Fix a race when multiple sockets are simultaneously calling sendto()
    when the completion ring is shared in the SKB case. This is the case
    when you share the same netdev and queue id through the
    XDP_SHARED_UMEM bind flag. The problem is that multiple processes can
    be in xsk_generic_xmit() and call the backpressure mechanism in
    xskq_prod_reserve(xs->pool->cq). As this is a shared resource in this
    specific scenario, a race might occur since the rings are
    single-producer single-consumer.

    Fix this by moving the tx_completion_lock from the socket to the pool
    as the pool is shared between the sockets that share the completion
    ring. (The pool is not shared when this is not the case.) And then
    protect the accesses to xskq_prod_reserve() with this lock. The
    tx_completion_lock is renamed cq_lock to better reflect that it
    protects accesses to the potentially shared completion ring.

    Fixes: 35fcde7f8deb ("xsk: support for Tx")
    Reported-by: Xuan Zhuo
    Signed-off-by: Magnus Karlsson
    Signed-off-by: Daniel Borkmann
    Acked-by: Björn Töpel
    Link: https://lore.kernel.org/bpf/20201218134525.13119-2-magnus.karlsson@gmail.com
    Signed-off-by: Greg Kroah-Hartman

    Magnus Karlsson
     

13 Jan, 2021

1 commit

  • commit 8bee683384087a6275c9183a483435225f7bb209 upstream.

    Fix a possible memory leak when a bind of an AF_XDP socket fails. When
    the fill and completion rings are created, they are tied to the
    socket. But when the buffer pool is later created at bind time, the
    ownership of these two rings are transferred to the buffer pool as
    they might be shared between sockets (and the buffer pool cannot be
    created until we know what we are binding to). So, before the buffer
    pool is created, these two rings are cleaned up with the socket, and
    after they have been transferred they are cleaned up together with
    the buffer pool.

    The problem is that ownership was transferred before it was absolutely
    certain that the buffer pool could be created and initialized
    correctly and when one of these errors occurred, the fill and
    completion rings did neither belong to the socket nor the pool and
    where therefore leaked. Solve this by moving the ownership transfer
    to the point where the buffer pool has been completely set up and
    there is no way it can fail.

    Fixes: 7361f9c3d719 ("xsk: Move fill and completion rings to buffer pool")
    Reported-by: syzbot+cfa88ddd0655afa88763@syzkaller.appspotmail.com
    Signed-off-by: Magnus Karlsson
    Signed-off-by: Daniel Borkmann
    Acked-by: Björn Töpel
    Link: https://lore.kernel.org/bpf/20201214085127.3960-1-magnus.karlsson@gmail.com
    Signed-off-by: Greg Kroah-Hartman

    Magnus Karlsson
     

04 Dec, 2020

1 commit

  • If force_zc is set, we should exit out with an error, not fall back to
    copy mode.

    Fixes: 921b68692abb ("xsk: Enable sharing of dma mappings")
    Reported-by: Hulk Robot
    Signed-off-by: Zhang Changzhong
    Signed-off-by: Daniel Borkmann
    Acked-by: Magnus Karlsson
    Link: https://lore.kernel.org/bpf/1607077277-41995-1-git-send-email-zhangchangzhong@huawei.com

    Zhang Changzhong
     

03 Dec, 2020

2 commits

  • Modify the tx writeable condition from the queue is not full to the
    number of present tx queues is less than the half of the total number
    of queues. Because the tx queue not full is a very short time, this will
    cause a large number of EPOLLOUT events, and cause a large number of
    process wake up.

    Fixes: 35fcde7f8deb ("xsk: support for Tx")
    Signed-off-by: Xuan Zhuo
    Signed-off-by: Daniel Borkmann
    Acked-by: Magnus Karlsson
    Link: https://lore.kernel.org/bpf/508fef55188d4e1160747ead64c6dcda36735880.1606555939.git.xuanzhuo@linux.alibaba.com

    Xuan Zhuo
     
  • datagram_poll will judge the current socket status (EPOLLIN, EPOLLOUT)
    based on the traditional socket information (eg: sk_wmem_alloc), but
    this does not apply to xsk. So this patch uses sock_poll_wait instead of
    datagram_poll, and the mask is calculated by xsk_poll.

    Fixes: c497176cb2e4 ("xsk: add Rx receive functions and poll support")
    Signed-off-by: Xuan Zhuo
    Signed-off-by: Daniel Borkmann
    Acked-by: Magnus Karlsson
    Link: https://lore.kernel.org/bpf/e82f4697438cd63edbf271ebe1918db8261b7c09.1606555939.git.xuanzhuo@linux.alibaba.com

    Xuan Zhuo
     

25 Nov, 2020

1 commit

  • Commit 642e450b6b59 ("xsk: Do not discard packet when NETDEV_TX_BUSY")
    addressed the problem that packets were discarded from the Tx AF_XDP
    ring, when the driver returned NETDEV_TX_BUSY. Part of the fix was
    bumping the skbuff reference count, so that the buffer would not be
    freed by dev_direct_xmit(). A reference count larger than one means
    that the skbuff is "shared", which is not the case.

    If the "shared" skbuff is sent to the generic XDP receive path,
    netif_receive_generic_xdp(), and pskb_expand_head() is entered the
    BUG_ON(skb_shared(skb)) will trigger.

    This patch adds a variant to dev_direct_xmit(), __dev_direct_xmit(),
    where a user can select the skbuff free policy. This allows AF_XDP to
    avoid bumping the reference count, but still keep the NETDEV_TX_BUSY
    behavior.

    Fixes: 642e450b6b59 ("xsk: Do not discard packet when NETDEV_TX_BUSY")
    Reported-by: Yonghong Song
    Signed-off-by: Björn Töpel
    Signed-off-by: Daniel Borkmann
    Link: https://lore.kernel.org/bpf/20201123175600.146255-1-bjorn.topel@gmail.com

    Björn Töpel
     

23 Nov, 2020

1 commit

  • Fix incorrect netdev reference count in xsk_bind operation. Incorrect
    reference count of the device appears when a user calls bind with the
    XDP_ZEROCOPY flag on an interface which does not support zero-copy.
    In such a case, an error is returned but the reference count is not
    decreased. This change fixes the fault, by decreasing the reference count
    in case of such an error.

    The problem being corrected appeared in '162c820ed896' for the first time,
    and the code was moved to new file location over the time with commit
    'c2d3d6a47462'. This specific patch applies to all version starting
    from 'c2d3d6a47462'. The same solution should be applied but on different
    file (net/xdp/xdp_umem.c) and function (xdp_umem_assign_dev) for versions
    from '162c820ed896' to 'c2d3d6a47462' excluded.

    Fixes: 162c820ed896 ("xdp: hold device for umem regardless of zero-copy mode")
    Signed-off-by: Marek Majtyka
    Signed-off-by: Daniel Borkmann
    Acked-by: Magnus Karlsson
    Link: https://lore.kernel.org/bpf/20201120151443.105903-1-marekx.majtyka@intel.com

    Marek Majtyka
     

20 Nov, 2020

1 commit

  • Fix a bug that is triggered when a partially setup socket is
    destroyed. For a fully setup socket, a socket that has been bound to a
    device, the cleanup of the umem is performed at the end of the buffer
    pool's cleanup work queue item. This has to be performed in a work
    queue, and not in RCU cleanup, as it is doing a vunmap that cannot
    execute in interrupt context. However, when a socket has only been
    partially set up so that a umem has been created but the buffer pool
    has not, the code erroneously directly calls the umem cleanup function
    instead of using a work queue, and this leads to a BUG_ON() in
    vunmap().

    As there in this case is no buffer pool, we cannot use its work queue,
    so we need to introduce a work queue for the umem and schedule this for
    the cleanup. So in the case there is no pool, we are going to use the
    umem's own work queue to schedule the cleanup. But if there is a
    pool, the cleanup of the umem is still being performed by the pool's
    work queue, as it is important that the umem is cleaned up after the
    pool.

    Fixes: e5e1a4bc916d ("xsk: Fix possible memory leak at socket close")
    Reported-by: Marek Majtyka
    Signed-off-by: Magnus Karlsson
    Signed-off-by: Daniel Borkmann
    Tested-by: Marek Majtyka
    Link: https://lore.kernel.org/bpf/1605873219-21629-1-git-send-email-magnus.karlsson@gmail.com

    Magnus Karlsson
     

29 Oct, 2020

1 commit

  • Fix a possible memory leak at xsk socket close that is caused by the
    refcounting of the umem object being wrong. The reference count of the
    umem was decremented only after the pool had been freed. Note that if
    the buffer pool is destroyed, it is important that the umem is
    destroyed after the pool, otherwise the umem would disappear while the
    driver is still running. And as the buffer pool needs to be destroyed
    in a work queue, the umem is also (if its refcount reaches zero)
    destroyed after the buffer pool in that same work queue.

    What was missing is that the refcount also needs to be decremented
    when the pool is not freed and when the pool has not even been
    created. The first case happens when the refcount of the pool is
    higher than 1, i.e. it is still being used by some other socket using
    the same device and queue id. In this case, it is safe to decrement
    the refcount of the umem outside of the work queue as the umem will
    never be freed because the refcount of the umem is always greater than
    or equal to the refcount of the buffer pool. The second case is if the
    buffer pool has not been created yet, i.e. the socket was closed
    before it was bound but after the umem was created. In this case, it
    is safe to destroy the umem outside of the work queue, since there is
    no pool that can use it by definition.

    Fixes: 1c1efc2af158 ("xsk: Create and free buffer pool independently from umem")
    Reported-by: syzbot+eb71df123dc2be2c1456@syzkaller.appspotmail.com
    Signed-off-by: Magnus Karlsson
    Signed-off-by: Daniel Borkmann
    Acked-by: Björn Töpel
    Link: https://lore.kernel.org/bpf/1603801921-2712-1-git-send-email-magnus.karlsson@gmail.com

    Magnus Karlsson
     

13 Oct, 2020

1 commit

  • Alexei Starovoitov says:

    ====================
    pull-request: bpf-next 2020-10-12

    The main changes are:

    1) The BPF verifier improvements to track register allocation pattern, from Alexei and Yonghong.

    2) libbpf relocation support for different size load/store, from Andrii.

    3) bpf_redirect_peer() helper and support for inner map array with different max_entries, from Daniel.

    4) BPF support for per-cpu variables, form Hao.

    5) sockmap improvements, from John.
    ====================

    Signed-off-by: Jakub Kicinski

    Jakub Kicinski
     

12 Oct, 2020

1 commit

  • Recent work in f4d05259213f ("bpf: Add map_meta_equal map ops") and 134fede4eecf
    ("bpf: Relax max_entries check for most of the inner map types") added support
    for dynamic inner max elements for most map-in-map types. Exceptions were maps
    like array or prog array where the map_gen_lookup() callback uses the maps'
    max_entries field as a constant when emitting instructions.

    We recently implemented Maglev consistent hashing into Cilium's load balancer
    which uses map-in-map with an outer map being hash and inner being array holding
    the Maglev backend table for each service. This has been designed this way in
    order to reduce overall memory consumption given the outer hash map allows to
    avoid preallocating a large, flat memory area for all services. Also, the
    number of service mappings is not always known a-priori.

    The use case for dynamic inner array map entries is to further reduce memory
    overhead, for example, some services might just have a small number of back
    ends while others could have a large number. Right now the Maglev backend table
    for small and large number of backends would need to have the same inner array
    map entries which adds a lot of unneeded overhead.

    Dynamic inner array map entries can be realized by avoiding the inlined code
    generation for their lookup. The lookup will still be efficient since it will
    be calling into array_map_lookup_elem() directly and thus avoiding retpoline.
    The patch adds a BPF_F_INNER_MAP flag to map creation which therefore skips
    inline code generation and relaxes array_map_meta_equal() check to ignore both
    maps' max_entries. This also still allows to have faster lookups for map-in-map
    when BPF_F_INNER_MAP is not specified and hence dynamic max_entries not needed.

    Example code generation where inner map is dynamic sized array:

    # bpftool p d x i 125
    int handle__sys_enter(void * ctx):
    ; int handle__sys_enter(void *ctx)
    0: (b4) w1 = 0
    ; int key = 0;
    1: (63) *(u32 *)(r10 -4) = r1
    2: (bf) r2 = r10
    ;
    3: (07) r2 += -4
    ; inner_map = bpf_map_lookup_elem(&outer_arr_dyn, &key);
    4: (18) r1 = map[id:468]
    6: (07) r1 += 272
    7: (61) r0 = *(u32 *)(r2 +0)
    8: (35) if r0 >= 0x3 goto pc+5
    9: (67) r0 <
    Signed-off-by: Alexei Starovoitov
    Acked-by: Andrii Nakryiko
    Link: https://lore.kernel.org/bpf/20201010234006.7075-4-daniel@iogearbox.net

    Daniel Borkmann
     

09 Oct, 2020

1 commit

  • Introduce one cache line worth of padding between the producer and
    consumer pointers in all the lockless rings. This so that the HW
    adjacency prefetcher will not prefetch the consumer pointer when the
    producer pointer is used and vice versa. This improves throughput
    performance for the l2fwd sample app with 2% on my machine with HW
    prefetching turned on.

    Signed-off-by: Magnus Karlsson
    Signed-off-by: Daniel Borkmann
    Link: https://lore.kernel.org/bpf/1602166338-21378-1-git-send-email-magnus.karlsson@gmail.com

    Magnus Karlsson
     

06 Oct, 2020

1 commit


05 Oct, 2020

1 commit

  • Christoph Hellwig correctly pointed out [1] that the AF_XDP core was
    pointlessly including internal headers. Let us remove those includes.

    [1] https://lore.kernel.org/bpf/20201005084341.GA3224@infradead.org/

    Fixes: 1c1efc2af158 ("xsk: Create and free buffer pool independently from umem")
    Reported-by: Christoph Hellwig
    Signed-off-by: Björn Töpel
    Signed-off-by: Daniel Borkmann
    Acked-by: Christoph Hellwig
    Link: https://lore.kernel.org/bpf/20201005090525.116689-1-bjorn.topel@gmail.com

    Björn Töpel
     

30 Sep, 2020

1 commit

  • After 'peeking' the ring, the consumer, not the producer, reads the data.
    Fix this mistake in the comments.

    Fixes: 15d8c9162ced ("xsk: Add function naming comments and reorder functions")
    Signed-off-by: Ciara Loftus
    Signed-off-by: Alexei Starovoitov
    Acked-by: Magnus Karlsson
    Link: https://lore.kernel.org/bpf/20200928082344.17110-1-ciara.loftus@intel.com

    Ciara Loftus
     

29 Sep, 2020

1 commit

  • Fix possible crash in socket_release when an out-of-memory error has
    occurred in the bind call. If a socket using the XDP_SHARED_UMEM flag
    encountered an error in xp_create_and_assign_umem, the bind code
    jumped to the exit routine but erroneously forgot to set the err value
    before jumping. This meant that the exit routine thought the setup
    went well and set the state of the socket to XSK_BOUND. The xsk socket
    release code will then, at application exit, think that this is a
    properly setup socket, when it is not, leading to a crash when all
    fields in the socket have in fact not been initialized properly. Fix
    this by setting the err variable in xsk_bind so that the socket is not
    set to XSK_BOUND which leads to the clean-up in xsk_release not being
    triggered.

    Fixes: 1c1efc2af158 ("xsk: Create and free buffer pool independently from umem")
    Reported-by: syzbot+ddc7b4944bc61da19b81@syzkaller.appspotmail.com
    Signed-off-by: Magnus Karlsson
    Signed-off-by: Daniel Borkmann
    Link: https://lore.kernel.org/bpf/1601112373-10595-1-git-send-email-magnus.karlsson@gmail.com

    Magnus Karlsson
     

24 Sep, 2020

1 commit

  • Alexei Starovoitov says:

    ====================
    pull-request: bpf-next 2020-09-23

    The following pull-request contains BPF updates for your *net-next* tree.

    We've added 95 non-merge commits during the last 22 day(s) which contain
    a total of 124 files changed, 4211 insertions(+), 2040 deletions(-).

    The main changes are:

    1) Full multi function support in libbpf, from Andrii.

    2) Refactoring of function argument checks, from Lorenz.

    3) Make bpf_tail_call compatible with functions (subprograms), from Maciej.

    4) Program metadata support, from YiFei.

    5) bpf iterator optimizations, from Yonghong.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     

23 Sep, 2020

1 commit

  • Two minor conflicts:

    1) net/ipv4/route.c, adding a new local variable while
    moving another local variable and removing it's
    initial assignment.

    2) drivers/net/dsa/microchip/ksz9477.c, overlapping changes.
    One pretty prints the port mode differently, whilst another
    changes the driver to try and obtain the port mode from
    the port node rather than the switch node.

    Signed-off-by: David S. Miller

    David S. Miller
     

17 Sep, 2020

1 commit

  • In the skb Tx path, transmission of a packet is performed with
    dev_direct_xmit(). When NETDEV_TX_BUSY is set in the drivers, it
    signifies that it was not possible to send the packet right now,
    please try later. Unfortunately, the xsk transmit code discarded the
    packet and returned EBUSY to the application. Fix this unnecessary
    packet loss, by not discarding the packet in the Tx ring and return
    EAGAIN. As EAGAIN is returned to the application, it can then retry
    the send operation later and the packet will then likely be sent as
    the driver will then likely have space/resources to send the packet.

    In summary, EAGAIN tells the application that the packet was not
    discarded from the Tx ring and that it needs to call send()
    again. EBUSY, on the other hand, signifies that the packet was not
    sent and discarded from the Tx ring. The application needs to put
    the packet on the Tx ring again if it wants it to be sent.

    Fixes: 35fcde7f8deb ("xsk: support for Tx")
    Reported-by: Arkadiusz Zema
    Suggested-by: Arkadiusz Zema
    Suggested-by: Daniel Borkmann
    Signed-off-by: Magnus Karlsson
    Signed-off-by: Daniel Borkmann
    Reviewed-by: Jesse Brandeburg
    Link: https://lore.kernel.org/bpf/1600257625-2353-1-git-send-email-magnus.karlsson@gmail.com

    Magnus Karlsson
     

15 Sep, 2020

2 commits

  • Fix a potential refcount warning that a zero value is increased to one
    in xp_dma_map, by initializing the refcount to one to start with,
    instead of zero plus a refcount_inc().

    Fixes: 921b68692abb ("xsk: Enable sharing of dma mappings")
    Signed-off-by: Magnus Karlsson
    Signed-off-by: Alexei Starovoitov
    Acked-by: Song Liu
    Link: https://lore.kernel.org/bpf/1600095036-23868-1-git-send-email-magnus.karlsson@gmail.com

    Magnus Karlsson
     
  • For AF_XDP sockets, there was a discrepancy between the number of of
    pinned pages and the size of the umem region.

    The size of the umem region is used to validate the AF_XDP descriptor
    addresses. The logic that pinned the pages covered by the region only
    took whole pages into consideration, creating a mismatch between the
    size and pinned pages. A user could then pass AF_XDP addresses outside
    the range of pinned pages, but still within the size of the region,
    crashing the kernel.

    This change correctly calculates the number of pages to be
    pinned. Further, the size check for the aligned mode is
    simplified. Now the code simply checks if the size is divisible by the
    chunk size.

    Fixes: bbff2f321a86 ("xsk: new descriptor addressing scheme")
    Reported-by: Ciara Loftus
    Signed-off-by: Björn Töpel
    Signed-off-by: Alexei Starovoitov
    Tested-by: Ciara Loftus
    Acked-by: Song Liu
    Link: https://lore.kernel.org/bpf/20200910075609.7904-1-bjorn.topel@gmail.com

    Björn Töpel
     

03 Sep, 2020

2 commits

  • Fix use-after-free when a shared umem bind fails. The code incorrectly
    tried to free the allocated buffer pool both in the bind code and then
    later also when the socket was released. Fix this by setting the
    buffer pool pointer to NULL after the bind code has freed the pool, so
    that the socket release code will not try to free the pool. This is
    the same solution as the regular, non-shared umem code path has. This
    was missing from the shared umem path.

    Fixes: b5aea28dca13 ("xsk: Add shared umem support between queue ids")
    Reported-by: syzbot+5334f62e4d22804e646a@syzkaller.appspotmail.com
    Signed-off-by: Magnus Karlsson
    Signed-off-by: Daniel Borkmann
    Link: https://lore.kernel.org/bpf/1599032164-25684-1-git-send-email-magnus.karlsson@intel.com

    Magnus Karlsson
     
  • Currently, dma_map is being checked, when the right object identifier
    to be null-checked is dma_map->dma_pages, instead.

    Fix this by null-checking dma_map->dma_pages.

    Fixes: 921b68692abb ("xsk: Enable sharing of dma mappings")
    Addresses-Coverity-ID: 1496811 ("Logically dead code")
    Signed-off-by: Gustavo A. R. Silva
    Signed-off-by: Daniel Borkmann
    Acked-by: Björn Töpel
    Link: https://lore.kernel.org/bpf/20200902150750.GA7257@embeddedor

    Gustavo A. R. Silva
     

02 Sep, 2020

2 commits

  • Fix possible segfault when entry is inserted into xskmap. This can
    happen if the socket is in a state where the umem has been set up, the
    Rx ring created but it has yet to be bound to a device. In this case
    the pool has not yet been created and we cannot reference it for the
    existence of the fill ring. Fix this by removing the whole
    xsk_is_setup_for_bpf_map function. Once upon a time, it was used to
    make sure that the Rx and fill rings where set up before the driver
    could call xsk_rcv, since there are no tests for the existence of
    these rings in the data path. But these days, we have a state variable
    that we test instead. When it is XSK_BOUND, everything has been set up
    correctly and the socket has been bound. So no reason to have the
    xsk_is_setup_for_bpf_map function anymore.

    Fixes: 7361f9c3d719 ("xsk: Move fill and completion rings to buffer pool")
    Reported-by: syzbot+febe51d44243fbc564ee@syzkaller.appspotmail.com
    Signed-off-by: Magnus Karlsson
    Signed-off-by: Daniel Borkmann
    Link: https://lore.kernel.org/bpf/1599037569-26690-1-git-send-email-magnus.karlsson@intel.com

    Magnus Karlsson
     
  • Fix possible segfault in the xsk diagnostics code when dumping
    information about the umem. This can happen when a umem has been
    created, but the socket has not been bound yet. In this case, the xsk
    buffer pool does not exist yet and we cannot dump the information
    that was moved from the umem to the buffer pool. Fix this by testing
    for the existence of the buffer pool and if not there, do not dump any
    of that information.

    Fixes: c2d3d6a47462 ("xsk: Move queue_id, dev and need_wakeup to buffer pool")
    Fixes: 7361f9c3d719 ("xsk: Move fill and completion rings to buffer pool")
    Reported-by: syzbot+3f04d36b7336f7868066@syzkaller.appspotmail.com
    Signed-off-by: Magnus Karlsson
    Signed-off-by: Daniel Borkmann
    Link: https://lore.kernel.org/bpf/1599036743-26454-1-git-send-email-magnus.karlsson@intel.com

    Magnus Karlsson
     

01 Sep, 2020

10 commits

  • Add support to share a umem between different devices. This mode
    can be invoked with the XDP_SHARED_UMEM bind flag. Previously,
    sharing was only supported within the same device. Note that when
    sharing a umem between devices, just as in the case of sharing a
    umem between queue ids, you need to create a fill ring and a
    completion ring and tie them to the socket (with two setsockopts,
    one for each ring) before you do the bind with the
    XDP_SHARED_UMEM flag. This so that the single-producer
    single-consumer semantics of the rings can be upheld.

    Signed-off-by: Magnus Karlsson
    Signed-off-by: Daniel Borkmann
    Acked-by: Björn Töpel
    Link: https://lore.kernel.org/bpf/1598603189-32145-13-git-send-email-magnus.karlsson@intel.com

    Magnus Karlsson
     
  • Add support to share a umem between queue ids on the same
    device. This mode can be invoked with the XDP_SHARED_UMEM bind
    flag. Previously, sharing was only supported within the same
    queue id and device, and you shared one set of fill and
    completion rings. However, note that when sharing a umem between
    queue ids, you need to create a fill ring and a completion ring
    and tie them to the socket before you do the bind with the
    XDP_SHARED_UMEM flag. This so that the single-producer
    single-consumer semantics can be upheld.

    Signed-off-by: Magnus Karlsson
    Signed-off-by: Daniel Borkmann
    Acked-by: Björn Töpel
    Link: https://lore.kernel.org/bpf/1598603189-32145-12-git-send-email-magnus.karlsson@intel.com

    Magnus Karlsson
     
  • Enable the sharing of dma mappings by moving them out from the buffer
    pool. Instead we put each dma mapped umem region in a list in the umem
    structure. If dma has already been mapped for this umem and device, it
    is not mapped again and the existing dma mappings are reused.

    Signed-off-by: Magnus Karlsson
    Signed-off-by: Daniel Borkmann
    Acked-by: Björn Töpel
    Link: https://lore.kernel.org/bpf/1598603189-32145-9-git-send-email-magnus.karlsson@intel.com

    Magnus Karlsson
     
  • Replicate the addrs pointer in the buffer pool to the umem. This mapping
    will be the same for all buffer pools sharing the same umem. In the
    buffer pool we leave the addrs pointer for performance reasons.

    Signed-off-by: Magnus Karlsson
    Signed-off-by: Daniel Borkmann
    Acked-by: Björn Töpel
    Link: https://lore.kernel.org/bpf/1598603189-32145-8-git-send-email-magnus.karlsson@intel.com

    Magnus Karlsson
     
  • Move the xsk_tx_list and the xsk_tx_list_lock from the umem to
    the buffer pool. This so that we in a later commit can share the
    umem between multiple HW queues. There is one xsk_tx_list per
    device and queue id, so it should be located in the buffer pool.

    Signed-off-by: Magnus Karlsson
    Signed-off-by: Daniel Borkmann
    Acked-by: Björn Töpel
    Link: https://lore.kernel.org/bpf/1598603189-32145-7-git-send-email-magnus.karlsson@intel.com

    Magnus Karlsson
     
  • Move queue_id, dev, and need_wakeup from the umem to the
    buffer pool. This so that we in a later commit can share the umem
    between multiple HW queues. There is one buffer pool per dev and
    queue id, so these variables should belong to the buffer pool, not
    the umem. Need_wakeup is also something that is set on a per napi
    level, so there is usually one per device and queue id. So move
    this to the buffer pool too.

    Signed-off-by: Magnus Karlsson
    Signed-off-by: Daniel Borkmann
    Acked-by: Björn Töpel
    Link: https://lore.kernel.org/bpf/1598603189-32145-6-git-send-email-magnus.karlsson@intel.com

    Magnus Karlsson
     
  • Move the fill and completion rings from the umem to the buffer
    pool. This so that we in a later commit can share the umem
    between multiple HW queue ids. In this case, we need one fill and
    completion ring per queue id. As the buffer pool is per queue id
    and napi id this is a natural place for it and one umem
    struture can be shared between these buffer pools.

    Signed-off-by: Magnus Karlsson
    Signed-off-by: Daniel Borkmann
    Acked-by: Björn Töpel
    Link: https://lore.kernel.org/bpf/1598603189-32145-5-git-send-email-magnus.karlsson@intel.com

    Magnus Karlsson
     
  • Create and free the buffer pool independently from the umem. Move
    these operations that are performed on the buffer pool from the
    umem create and destroy functions to new create and destroy
    functions just for the buffer pool. This so that in later commits
    we can instantiate multiple buffer pools per umem when sharing a
    umem between HW queues and/or devices. We also erradicate the
    back pointer from the umem to the buffer pool as this will not
    work when we introduce the possibility to have multiple buffer
    pools per umem.

    Signed-off-by: Magnus Karlsson
    Signed-off-by: Daniel Borkmann
    Acked-by: Björn Töpel
    Link: https://lore.kernel.org/bpf/1598603189-32145-4-git-send-email-magnus.karlsson@intel.com

    Magnus Karlsson
     
  • Rename the AF_XDP zero-copy driver interface functions to better
    reflect what they do after the replacement of umems with buffer
    pools in the previous commit. Mostly it is about replacing the
    umem name from the function names with xsk_buff and also have
    them take the a buffer pool pointer instead of a umem. The
    various ring functions have also been renamed in the process so
    that they have the same naming convention as the internal
    functions in xsk_queue.h. This so that it will be clearer what
    they do and also for consistency.

    Signed-off-by: Magnus Karlsson
    Signed-off-by: Daniel Borkmann
    Acked-by: Björn Töpel
    Link: https://lore.kernel.org/bpf/1598603189-32145-3-git-send-email-magnus.karlsson@intel.com

    Magnus Karlsson
     
  • Replace the explicit umem reference passed to the driver in AF_XDP
    zero-copy mode with the buffer pool instead. This in preparation for
    extending the functionality of the zero-copy mode so that umems can be
    shared between queues on the same netdev and also between netdevs. In
    this commit, only an umem reference has been added to the buffer pool
    struct. But later commits will add other entities to it. These are
    going to be entities that are different between different queue ids
    and netdevs even though the umem is shared between them.

    Signed-off-by: Magnus Karlsson
    Signed-off-by: Daniel Borkmann
    Acked-by: Björn Töpel
    Link: https://lore.kernel.org/bpf/1598603189-32145-2-git-send-email-magnus.karlsson@intel.com

    Magnus Karlsson
     

28 Aug, 2020

2 commits

  • Most of the maps do not use max_entries during verification time.
    Thus, those map_meta_equal() do not need to enforce max_entries
    when it is inserted as an inner map during runtime. The max_entries
    check is removed from the default implementation bpf_map_meta_equal().

    The prog_array_map and xsk_map are exception. Its map_gen_lookup
    uses max_entries to generate inline lookup code. Thus, they will
    implement its own map_meta_equal() to enforce max_entries.
    Since there are only two cases now, the max_entries check
    is not refactored and stays in its own .c file.

    Signed-off-by: Martin KaFai Lau
    Signed-off-by: Daniel Borkmann
    Link: https://lore.kernel.org/bpf/20200828011813.1970516-1-kafai@fb.com

    Martin KaFai Lau
     
  • Some properties of the inner map is used in the verification time.
    When an inner map is inserted to an outer map at runtime,
    bpf_map_meta_equal() is currently used to ensure those properties
    of the inserting inner map stays the same as the verification
    time.

    In particular, the current bpf_map_meta_equal() checks max_entries which
    turns out to be too restrictive for most of the maps which do not use
    max_entries during the verification time. It limits the use case that
    wants to replace a smaller inner map with a larger inner map. There are
    some maps do use max_entries during verification though. For example,
    the map_gen_lookup in array_map_ops uses the max_entries to generate
    the inline lookup code.

    To accommodate differences between maps, the map_meta_equal is added
    to bpf_map_ops. Each map-type can decide what to check when its
    map is used as an inner map during runtime.

    Also, some map types cannot be used as an inner map and they are
    currently black listed in bpf_map_meta_alloc() in map_in_map.c.
    It is not unusual that the new map types may not aware that such
    blacklist exists. This patch enforces an explicit opt-in
    and only allows a map to be used as an inner map if it has
    implemented the map_meta_equal ops. It is based on the
    discussion in [1].

    All maps that support inner map has its map_meta_equal points
    to bpf_map_meta_equal in this patch. A later patch will
    relax the max_entries check for most maps. bpf_types.h
    counts 28 map types. This patch adds 23 ".map_meta_equal"
    by using coccinelle. -5 for
    BPF_MAP_TYPE_PROG_ARRAY
    BPF_MAP_TYPE_(PERCPU)_CGROUP_STORAGE
    BPF_MAP_TYPE_STRUCT_OPS
    BPF_MAP_TYPE_ARRAY_OF_MAPS
    BPF_MAP_TYPE_HASH_OF_MAPS

    The "if (inner_map->inner_map_meta)" check in bpf_map_meta_alloc()
    is moved such that the same error is returned.

    [1]: https://lore.kernel.org/bpf/20200522022342.899756-1-kafai@fb.com/

    Signed-off-by: Martin KaFai Lau
    Signed-off-by: Daniel Borkmann
    Link: https://lore.kernel.org/bpf/20200828011806.1970400-1-kafai@fb.com

    Martin KaFai Lau
     

28 Jul, 2020

1 commit

  • xsk_getsockopt() is copying uninitialized stack memory to userspace when
    'extra_stats' is 'false'. Fix it. Doing '= {};' is sufficient since currently
    'struct xdp_statistics' is defined as follows:

    struct xdp_statistics {
    __u64 rx_dropped;
    __u64 rx_invalid_descs;
    __u64 tx_invalid_descs;
    __u64 rx_ring_full;
    __u64 rx_fill_ring_empty_descs;
    __u64 tx_ring_empty_descs;
    };

    When being copied to the userspace, 'stats' will not contain any uninitialized
    'holes' between struct fields.

    Fixes: 8aa5a33578e9 ("xsk: Add new statistics")
    Suggested-by: Dan Carpenter
    Signed-off-by: Peilin Ye
    Signed-off-by: Daniel Borkmann
    Acked-by: Björn Töpel
    Acked-by: Song Liu
    Acked-by: Arnd Bergmann
    Link: https://lore.kernel.org/bpf/20200728053604.404631-1-yepeilin.cs@gmail.com

    Peilin Ye
     

25 Jul, 2020

1 commit

  • Rework the remaining setsockopt code to pass a sockptr_t instead of a
    plain user pointer. This removes the last remaining set_fs(KERNEL_DS)
    outside of architecture specific code.

    Signed-off-by: Christoph Hellwig
    Acked-by: Stefan Schmidt [ieee802154]
    Acked-by: Matthieu Baerts
    Signed-off-by: David S. Miller

    Christoph Hellwig