Eric Lee / smarc-fsl-linux-kernel

17 Jan, 2021

2 commits

be6657273 xsk: Rollback reservation at NETDEV_TX_BUSY ... Browse Code »

commit b1b95cb5c0a9694d47d5f845ba97e226cfda957d upstream.

Rollback the reservation in the completion ring when we get a
NETDEV_TX_BUSY. When this error is received from the driver, we are
supposed to let the user application retry the transmit again. And in
order to do this, we need to roll back the failed send so it can be
retried. Unfortunately, we did not cancel the reservation we had made
in the completion ring. By not doing this, we actually make the
completion ring one entry smaller per NETDEV_TX_BUSY error we get, and
after enough of these errors the completion ring will be of size zero
and transmit will stop working.

Fix this by cancelling the reservation when we get a NETDEV_TX_BUSY
error.

Fixes: 642e450b6b59 ("xsk: Do not discard packet when NETDEV_TX_BUSY")
Reported-by: Xuan Zhuo
Signed-off-by: Magnus Karlsson
Signed-off-by: Daniel Borkmann
Acked-by: Björn Töpel
Link: https://lore.kernel.org/bpf/20201218134525.13119-3-magnus.karlsson@gmail.com
Signed-off-by: Greg Kroah-Hartman

Magnus Karlsson
2021-01-17 21:17:05 +0800
9ad0375ed xsk: Fix race in SKB mode transmit with shared cq ... Browse Code »

commit f09ced4053bc0a2094a12b60b646114c966ef4c6 upstream.

Fix a race when multiple sockets are simultaneously calling sendto()
when the completion ring is shared in the SKB case. This is the case
when you share the same netdev and queue id through the
XDP_SHARED_UMEM bind flag. The problem is that multiple processes can
be in xsk_generic_xmit() and call the backpressure mechanism in
xskq_prod_reserve(xs->pool->cq). As this is a shared resource in this
specific scenario, a race might occur since the rings are
single-producer single-consumer.

Fix this by moving the tx_completion_lock from the socket to the pool
as the pool is shared between the sockets that share the completion
ring. (The pool is not shared when this is not the case.) And then
protect the accesses to xskq_prod_reserve() with this lock. The
tx_completion_lock is renamed cq_lock to better reflect that it
protects accesses to the potentially shared completion ring.

Fixes: 35fcde7f8deb ("xsk: support for Tx")
Reported-by: Xuan Zhuo
Signed-off-by: Magnus Karlsson
Signed-off-by: Daniel Borkmann
Acked-by: Björn Töpel
Link: https://lore.kernel.org/bpf/20201218134525.13119-2-magnus.karlsson@gmail.com
Signed-off-by: Greg Kroah-Hartman

Magnus Karlsson
2021-01-17 21:17:05 +0800

13 Jan, 2021

1 commit

0fae7d269 xsk: Fix memory leak for failed bind ... Browse Code »

commit 8bee683384087a6275c9183a483435225f7bb209 upstream.

Fix a possible memory leak when a bind of an AF_XDP socket fails. When
the fill and completion rings are created, they are tied to the
socket. But when the buffer pool is later created at bind time, the
ownership of these two rings are transferred to the buffer pool as
they might be shared between sockets (and the buffer pool cannot be
created until we know what we are binding to). So, before the buffer
pool is created, these two rings are cleaned up with the socket, and
after they have been transferred they are cleaned up together with
the buffer pool.

The problem is that ownership was transferred before it was absolutely
certain that the buffer pool could be created and initialized
correctly and when one of these errors occurred, the fill and
completion rings did neither belong to the socket nor the pool and
where therefore leaked. Solve this by moving the ownership transfer
to the point where the buffer pool has been completely set up and
there is no way it can fail.

Fixes: 7361f9c3d719 ("xsk: Move fill and completion rings to buffer pool")
Reported-by: syzbot+cfa88ddd0655afa88763@syzkaller.appspotmail.com
Signed-off-by: Magnus Karlsson
Signed-off-by: Daniel Borkmann
Acked-by: Björn Töpel
Link: https://lore.kernel.org/bpf/20201214085127.3960-1-magnus.karlsson@gmail.com
Signed-off-by: Greg Kroah-Hartman

Magnus Karlsson
2021-01-13 03:18:26 +0800

04 Dec, 2020

1 commit

12c8a8ca1 xsk: Return error code if force_zc is set ... Browse Code »

If force_zc is set, we should exit out with an error, not fall back to
copy mode.

Fixes: 921b68692abb ("xsk: Enable sharing of dma mappings")
Reported-by: Hulk Robot
Signed-off-by: Zhang Changzhong
Signed-off-by: Daniel Borkmann
Acked-by: Magnus Karlsson
Link: https://lore.kernel.org/bpf/1607077277-41995-1-git-send-email-zhangchangzhong@huawei.com

Zhang Changzhong
2020-12-04 23:48:31 +0800

03 Dec, 2020

2 commits

3413f0414 xsk: Change the tx writeable condition ... Browse Code »

Modify the tx writeable condition from the queue is not full to the
number of present tx queues is less than the half of the total number
of queues. Because the tx queue not full is a very short time, this will
cause a large number of EPOLLOUT events, and cause a large number of
process wake up.

Fixes: 35fcde7f8deb ("xsk: support for Tx")
Signed-off-by: Xuan Zhuo
Signed-off-by: Daniel Borkmann
Acked-by: Magnus Karlsson
Link: https://lore.kernel.org/bpf/508fef55188d4e1160747ead64c6dcda36735880.1606555939.git.xuanzhuo@linux.alibaba.com

Xuan Zhuo
2020-12-03 08:14:38 +0800
f5da54187 xsk: Replace datagram_poll by sock_poll_wait ... Browse Code »

datagram_poll will judge the current socket status (EPOLLIN, EPOLLOUT)
based on the traditional socket information (eg: sk_wmem_alloc), but
this does not apply to xsk. So this patch uses sock_poll_wait instead of
datagram_poll, and the mask is calculated by xsk_poll.

Fixes: c497176cb2e4 ("xsk: add Rx receive functions and poll support")
Signed-off-by: Xuan Zhuo
Signed-off-by: Daniel Borkmann
Acked-by: Magnus Karlsson
Link: https://lore.kernel.org/bpf/e82f4697438cd63edbf271ebe1918db8261b7c09.1606555939.git.xuanzhuo@linux.alibaba.com

Xuan Zhuo
2020-12-03 08:14:14 +0800

25 Nov, 2020

1 commit

36ccdf858 net, xsk: Avoid taking multiple skbuff references ... Browse Code »

Commit 642e450b6b59 ("xsk: Do not discard packet when NETDEV_TX_BUSY")
addressed the problem that packets were discarded from the Tx AF_XDP
ring, when the driver returned NETDEV_TX_BUSY. Part of the fix was
bumping the skbuff reference count, so that the buffer would not be
freed by dev_direct_xmit(). A reference count larger than one means
that the skbuff is "shared", which is not the case.

If the "shared" skbuff is sent to the generic XDP receive path,
netif_receive_generic_xdp(), and pskb_expand_head() is entered the
BUG_ON(skb_shared(skb)) will trigger.

This patch adds a variant to dev_direct_xmit(), __dev_direct_xmit(),
where a user can select the skbuff free policy. This allows AF_XDP to
avoid bumping the reference count, but still keep the NETDEV_TX_BUSY
behavior.

Fixes: 642e450b6b59 ("xsk: Do not discard packet when NETDEV_TX_BUSY")
Reported-by: Yonghong Song
Signed-off-by: Björn Töpel
Signed-off-by: Daniel Borkmann
Link: https://lore.kernel.org/bpf/20201123175600.146255-1-bjorn.topel@gmail.com

Björn Töpel
2020-11-25 05:39:56 +0800

23 Nov, 2020

1 commit

178648916 xsk: Fix incorrect netdev reference count ... Browse Code »

Fix incorrect netdev reference count in xsk_bind operation. Incorrect
reference count of the device appears when a user calls bind with the
XDP_ZEROCOPY flag on an interface which does not support zero-copy.
In such a case, an error is returned but the reference count is not
decreased. This change fixes the fault, by decreasing the reference count
in case of such an error.

The problem being corrected appeared in '162c820ed896' for the first time,
and the code was moved to new file location over the time with commit
'c2d3d6a47462'. This specific patch applies to all version starting
from 'c2d3d6a47462'. The same solution should be applied but on different
file (net/xdp/xdp_umem.c) and function (xdp_umem_assign_dev) for versions
from '162c820ed896' to 'c2d3d6a47462' excluded.

Fixes: 162c820ed896 ("xdp: hold device for umem regardless of zero-copy mode")
Signed-off-by: Marek Majtyka
Signed-off-by: Daniel Borkmann
Acked-by: Magnus Karlsson
Link: https://lore.kernel.org/bpf/20201120151443.105903-1-marekx.majtyka@intel.com

Marek Majtyka
2020-11-23 20:19:33 +0800

20 Nov, 2020

1 commit

537cf4e3c xsk: Fix umem cleanup bug at socket destruct ... Browse Code »

Fix a bug that is triggered when a partially setup socket is
destroyed. For a fully setup socket, a socket that has been bound to a
device, the cleanup of the umem is performed at the end of the buffer
pool's cleanup work queue item. This has to be performed in a work
queue, and not in RCU cleanup, as it is doing a vunmap that cannot
execute in interrupt context. However, when a socket has only been
partially set up so that a umem has been created but the buffer pool
has not, the code erroneously directly calls the umem cleanup function
instead of using a work queue, and this leads to a BUG_ON() in
vunmap().

As there in this case is no buffer pool, we cannot use its work queue,
so we need to introduce a work queue for the umem and schedule this for
the cleanup. So in the case there is no pool, we are going to use the
umem's own work queue to schedule the cleanup. But if there is a
pool, the cleanup of the umem is still being performed by the pool's
work queue, as it is important that the umem is cleaned up after the
pool.

Fixes: e5e1a4bc916d ("xsk: Fix possible memory leak at socket close")
Reported-by: Marek Majtyka
Signed-off-by: Magnus Karlsson
Signed-off-by: Daniel Borkmann
Tested-by: Marek Majtyka
Link: https://lore.kernel.org/bpf/1605873219-21629-1-git-send-email-magnus.karlsson@gmail.com

Magnus Karlsson
2020-11-20 22:52:39 +0800

29 Oct, 2020

1 commit

e5e1a4bc9 xsk: Fix possible memory leak at socket close ... Browse Code »

Fix a possible memory leak at xsk socket close that is caused by the
refcounting of the umem object being wrong. The reference count of the
umem was decremented only after the pool had been freed. Note that if
the buffer pool is destroyed, it is important that the umem is
destroyed after the pool, otherwise the umem would disappear while the
driver is still running. And as the buffer pool needs to be destroyed
in a work queue, the umem is also (if its refcount reaches zero)
destroyed after the buffer pool in that same work queue.

What was missing is that the refcount also needs to be decremented
when the pool is not freed and when the pool has not even been
created. The first case happens when the refcount of the pool is
higher than 1, i.e. it is still being used by some other socket using
the same device and queue id. In this case, it is safe to decrement
the refcount of the umem outside of the work queue as the umem will
never be freed because the refcount of the umem is always greater than
or equal to the refcount of the buffer pool. The second case is if the
buffer pool has not been created yet, i.e. the socket was closed
before it was bound but after the umem was created. In this case, it
is safe to destroy the umem outside of the work queue, since there is
no pool that can use it by definition.

Fixes: 1c1efc2af158 ("xsk: Create and free buffer pool independently from umem")
Reported-by: syzbot+eb71df123dc2be2c1456@syzkaller.appspotmail.com
Signed-off-by: Magnus Karlsson
Signed-off-by: Daniel Borkmann
Acked-by: Björn Töpel
Link: https://lore.kernel.org/bpf/1603801921-2712-1-git-send-email-magnus.karlsson@gmail.com

Magnus Karlsson
2020-10-29 22:19:56 +0800

13 Oct, 2020

1 commit

ccdf7fae3 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next ... Browse Code »

Alexei Starovoitov says:

====================
pull-request: bpf-next 2020-10-12

The main changes are:

1) The BPF verifier improvements to track register allocation pattern, from Alexei and Yonghong.

2) libbpf relocation support for different size load/store, from Andrii.

3) bpf_redirect_peer() helper and support for inner map array with different max_entries, from Daniel.

4) BPF support for per-cpu variables, form Hao.

5) sockmap improvements, from John.
====================

Signed-off-by: Jakub Kicinski

Jakub Kicinski
2020-10-13 07:16:50 +0800

12 Oct, 2020

1 commit

4a8f87e60 bpf: Allow for map-in-map with dynamic inner array map entries ... Browse Code »

Recent work in f4d05259213f ("bpf: Add map_meta_equal map ops") and 134fede4eecf
("bpf: Relax max_entries check for most of the inner map types") added support
for dynamic inner max elements for most map-in-map types. Exceptions were maps
like array or prog array where the map_gen_lookup() callback uses the maps'
max_entries field as a constant when emitting instructions.

We recently implemented Maglev consistent hashing into Cilium's load balancer
which uses map-in-map with an outer map being hash and inner being array holding
the Maglev backend table for each service. This has been designed this way in
order to reduce overall memory consumption given the outer hash map allows to
avoid preallocating a large, flat memory area for all services. Also, the
number of service mappings is not always known a-priori.

The use case for dynamic inner array map entries is to further reduce memory
overhead, for example, some services might just have a small number of back
ends while others could have a large number. Right now the Maglev backend table
for small and large number of backends would need to have the same inner array
map entries which adds a lot of unneeded overhead.

Dynamic inner array map entries can be realized by avoiding the inlined code
generation for their lookup. The lookup will still be efficient since it will
be calling into array_map_lookup_elem() directly and thus avoiding retpoline.
The patch adds a BPF_F_INNER_MAP flag to map creation which therefore skips
inline code generation and relaxes array_map_meta_equal() check to ignore both
maps' max_entries. This also still allows to have faster lookups for map-in-map
when BPF_F_INNER_MAP is not specified and hence dynamic max_entries not needed.

Example code generation where inner map is dynamic sized array:

# bpftool p d x i 125
int handle__sys_enter(void * ctx):
; int handle__sys_enter(void *ctx)
0: (b4) w1 = 0
; int key = 0;
1: (63) *(u32 *)(r10 -4) = r1
2: (bf) r2 = r10
;
3: (07) r2 += -4
; inner_map = bpf_map_lookup_elem(&outer_arr_dyn, &key);
4: (18) r1 = map[id:468]
6: (07) r1 += 272
7: (61) r0 = *(u32 *)(r2 +0)
8: (35) if r0 >= 0x3 goto pc+5
9: (67) r0 <
Signed-off-by: Alexei Starovoitov
Acked-by: Andrii Nakryiko
Link: https://lore.kernel.org/bpf/20201010234006.7075-4-daniel@iogearbox.net

Daniel Borkmann
2020-10-12 01:21:04 +0800

09 Oct, 2020

1 commit

c3f01fdce xsk: Introduce padding between ring pointers ... Browse Code »

Introduce one cache line worth of padding between the producer and
consumer pointers in all the lockless rings. This so that the HW
adjacency prefetcher will not prefetch the consumer pointer when the
producer pointer is used and vice versa. This improves throughput
performance for the l2fwd sample app with 2% on my machine with HW
prefetching turned on.

Signed-off-by: Magnus Karlsson
Signed-off-by: Daniel Borkmann
Link: https://lore.kernel.org/bpf/1602166338-21378-1-git-send-email-magnus.karlsson@gmail.com

Magnus Karlsson
2020-10-09 22:35:01 +0800

06 Oct, 2020

1 commit

8b0308fe3 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net ... Browse Code »

Rejecting non-native endian BTF overlapped with the addition
of support for it.

The rest were more simple overlapping changes, except the
renesas ravb binding update, which had to follow a file
move as well as a YAML conversion.

Signed-off-by: David S. Miller

David S. Miller
2020-10-06 09:40:01 +0800

05 Oct, 2020

1 commit

b75597d89 xsk: Remove internal DMA headers ... Browse Code »

Christoph Hellwig correctly pointed out [1] that the AF_XDP core was
pointlessly including internal headers. Let us remove those includes.

[1] https://lore.kernel.org/bpf/20201005084341.GA3224@infradead.org/

Fixes: 1c1efc2af158 ("xsk: Create and free buffer pool independently from umem")
Reported-by: Christoph Hellwig
Signed-off-by: Björn Töpel
Signed-off-by: Daniel Borkmann
Acked-by: Christoph Hellwig
Link: https://lore.kernel.org/bpf/20201005090525.116689-1-bjorn.topel@gmail.com

Björn Töpel
2020-10-05 21:48:00 +0800

30 Sep, 2020

1 commit

f1fc8ece6 xsk: Fix a documentation mistake in xsk_queue.h ... Browse Code »

After 'peeking' the ring, the consumer, not the producer, reads the data.
Fix this mistake in the comments.

Fixes: 15d8c9162ced ("xsk: Add function naming comments and reorder functions")
Signed-off-by: Ciara Loftus
Signed-off-by: Alexei Starovoitov
Acked-by: Magnus Karlsson
Link: https://lore.kernel.org/bpf/20200928082344.17110-1-ciara.loftus@intel.com

Ciara Loftus
2020-09-30 02:25:56 +0800

29 Sep, 2020

1 commit

1fd17c8cd xsk: Fix possible crash in socket_release when out-of-memory ... Browse Code »

Fix possible crash in socket_release when an out-of-memory error has
occurred in the bind call. If a socket using the XDP_SHARED_UMEM flag
encountered an error in xp_create_and_assign_umem, the bind code
jumped to the exit routine but erroneously forgot to set the err value
before jumping. This meant that the exit routine thought the setup
went well and set the state of the socket to XSK_BOUND. The xsk socket
release code will then, at application exit, think that this is a
properly setup socket, when it is not, leading to a crash when all
fields in the socket have in fact not been initialized properly. Fix
this by setting the err variable in xsk_bind so that the socket is not
set to XSK_BOUND which leads to the clean-up in xsk_release not being
triggered.

Fixes: 1c1efc2af158 ("xsk: Create and free buffer pool independently from umem")
Reported-by: syzbot+ddc7b4944bc61da19b81@syzkaller.appspotmail.com
Signed-off-by: Magnus Karlsson
Signed-off-by: Daniel Borkmann
Link: https://lore.kernel.org/bpf/1601112373-10595-1-git-send-email-magnus.karlsson@gmail.com

Magnus Karlsson
2020-09-29 03:18:01 +0800

24 Sep, 2020

1 commit

6d772f328 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next ... Browse Code »

Alexei Starovoitov says:

====================
pull-request: bpf-next 2020-09-23

The following pull-request contains BPF updates for your *net-next* tree.

We've added 95 non-merge commits during the last 22 day(s) which contain
a total of 124 files changed, 4211 insertions(+), 2040 deletions(-).

The main changes are:

1) Full multi function support in libbpf, from Andrii.

2) Refactoring of function argument checks, from Lorenz.

3) Make bpf_tail_call compatible with functions (subprograms), from Maciej.

4) Program metadata support, from YiFei.

5) bpf iterator optimizations, from Yonghong.
====================

Signed-off-by: David S. Miller

David S. Miller
2020-09-24 04:11:11 +0800

23 Sep, 2020

1 commit

3ab0a7a0c Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net ... Browse Code »

Two minor conflicts:

1) net/ipv4/route.c, adding a new local variable while
moving another local variable and removing it's
initial assignment.

2) drivers/net/dsa/microchip/ksz9477.c, overlapping changes.
One pretty prints the port mode differently, whilst another
changes the driver to try and obtain the port mode from
the port node rather than the switch node.

Signed-off-by: David S. Miller

David S. Miller
2020-09-23 07:45:34 +0800

17 Sep, 2020

1 commit

642e450b6 xsk: Do not discard packet when NETDEV_TX_BUSY ... Browse Code »

In the skb Tx path, transmission of a packet is performed with
dev_direct_xmit(). When NETDEV_TX_BUSY is set in the drivers, it
signifies that it was not possible to send the packet right now,
please try later. Unfortunately, the xsk transmit code discarded the
packet and returned EBUSY to the application. Fix this unnecessary
packet loss, by not discarding the packet in the Tx ring and return
EAGAIN. As EAGAIN is returned to the application, it can then retry
the send operation later and the packet will then likely be sent as
the driver will then likely have space/resources to send the packet.

In summary, EAGAIN tells the application that the packet was not
discarded from the Tx ring and that it needs to call send()
again. EBUSY, on the other hand, signifies that the packet was not
sent and discarded from the Tx ring. The application needs to put
the packet on the Tx ring again if it wants it to be sent.

Fixes: 35fcde7f8deb ("xsk: support for Tx")
Reported-by: Arkadiusz Zema
Suggested-by: Arkadiusz Zema
Suggested-by: Daniel Borkmann
Signed-off-by: Magnus Karlsson
Signed-off-by: Daniel Borkmann
Reviewed-by: Jesse Brandeburg
Link: https://lore.kernel.org/bpf/1600257625-2353-1-git-send-email-magnus.karlsson@gmail.com

Magnus Karlsson
2020-09-17 05:36:58 +0800

15 Sep, 2020

2 commits

bf74a370e xsk: Fix refcount warning in xp_dma_map ... Browse Code »

Fix a potential refcount warning that a zero value is increased to one
in xp_dma_map, by initializing the refcount to one to start with,
instead of zero plus a refcount_inc().

Fixes: 921b68692abb ("xsk: Enable sharing of dma mappings")
Signed-off-by: Magnus Karlsson
Signed-off-by: Alexei Starovoitov
Acked-by: Song Liu
Link: https://lore.kernel.org/bpf/1600095036-23868-1-git-send-email-magnus.karlsson@gmail.com

Magnus Karlsson
2020-09-15 09:43:25 +0800
2b1667e54 xsk: Fix number of pinned pages/umem size discrepancy ... Browse Code »

For AF_XDP sockets, there was a discrepancy between the number of of
pinned pages and the size of the umem region.

The size of the umem region is used to validate the AF_XDP descriptor
addresses. The logic that pinned the pages covered by the region only
took whole pages into consideration, creating a mismatch between the
size and pinned pages. A user could then pass AF_XDP addresses outside
the range of pinned pages, but still within the size of the region,
crashing the kernel.

This change correctly calculates the number of pages to be
pinned. Further, the size check for the aligned mode is
simplified. Now the code simply checks if the size is divisible by the
chunk size.

Fixes: bbff2f321a86 ("xsk: new descriptor addressing scheme")
Reported-by: Ciara Loftus
Signed-off-by: Björn Töpel
Signed-off-by: Alexei Starovoitov
Tested-by: Ciara Loftus
Acked-by: Song Liu
Link: https://lore.kernel.org/bpf/20200910075609.7904-1-bjorn.topel@gmail.com

Björn Töpel
2020-09-15 09:35:09 +0800

03 Sep, 2020

2 commits

83cf5c68d xsk: Fix use-after-free in failed shared_umem bind ... Browse Code »

Fix use-after-free when a shared umem bind fails. The code incorrectly
tried to free the allocated buffer pool both in the bind code and then
later also when the socket was released. Fix this by setting the
buffer pool pointer to NULL after the bind code has freed the pool, so
that the socket release code will not try to free the pool. This is
the same solution as the regular, non-shared umem code path has. This
was missing from the shared umem path.

Fixes: b5aea28dca13 ("xsk: Add shared umem support between queue ids")
Reported-by: syzbot+5334f62e4d22804e646a@syzkaller.appspotmail.com
Signed-off-by: Magnus Karlsson
Signed-off-by: Daniel Borkmann
Link: https://lore.kernel.org/bpf/1599032164-25684-1-git-send-email-magnus.karlsson@intel.com

Magnus Karlsson
2020-09-03 05:37:19 +0800
1d6fd78a2 xsk: Fix null check on error return path ... Browse Code »

Currently, dma_map is being checked, when the right object identifier
to be null-checked is dma_map->dma_pages, instead.

Fix this by null-checking dma_map->dma_pages.

Fixes: 921b68692abb ("xsk: Enable sharing of dma mappings")
Addresses-Coverity-ID: 1496811 ("Logically dead code")
Signed-off-by: Gustavo A. R. Silva
Signed-off-by: Daniel Borkmann
Acked-by: Björn Töpel
Link: https://lore.kernel.org/bpf/20200902150750.GA7257@embeddedor

Gustavo A. R. Silva
2020-09-03 02:31:50 +0800

02 Sep, 2020

2 commits

968be23ce xsk: Fix possible segfault at xskmap entry insertion ... Browse Code »

Fix possible segfault when entry is inserted into xskmap. This can
happen if the socket is in a state where the umem has been set up, the
Rx ring created but it has yet to be bound to a device. In this case
the pool has not yet been created and we cannot reference it for the
existence of the fill ring. Fix this by removing the whole
xsk_is_setup_for_bpf_map function. Once upon a time, it was used to
make sure that the Rx and fill rings where set up before the driver
could call xsk_rcv, since there are no tests for the existence of
these rings in the data path. But these days, we have a state variable
that we test instead. When it is XSK_BOUND, everything has been set up
correctly and the socket has been bound. So no reason to have the
xsk_is_setup_for_bpf_map function anymore.

Fixes: 7361f9c3d719 ("xsk: Move fill and completion rings to buffer pool")
Reported-by: syzbot+febe51d44243fbc564ee@syzkaller.appspotmail.com
Signed-off-by: Magnus Karlsson
Signed-off-by: Daniel Borkmann
Link: https://lore.kernel.org/bpf/1599037569-26690-1-git-send-email-magnus.karlsson@intel.com

Magnus Karlsson
2020-09-02 22:52:59 +0800
53ea2076d xsk: Fix possible segfault in xsk umem diagnostics ... Browse Code »

Fix possible segfault in the xsk diagnostics code when dumping
information about the umem. This can happen when a umem has been
created, but the socket has not been bound yet. In this case, the xsk
buffer pool does not exist yet and we cannot dump the information
that was moved from the umem to the buffer pool. Fix this by testing
for the existence of the buffer pool and if not there, do not dump any
of that information.

Fixes: c2d3d6a47462 ("xsk: Move queue_id, dev and need_wakeup to buffer pool")
Fixes: 7361f9c3d719 ("xsk: Move fill and completion rings to buffer pool")
Reported-by: syzbot+3f04d36b7336f7868066@syzkaller.appspotmail.com
Signed-off-by: Magnus Karlsson
Signed-off-by: Daniel Borkmann
Link: https://lore.kernel.org/bpf/1599036743-26454-1-git-send-email-magnus.karlsson@intel.com

Magnus Karlsson
2020-09-02 22:49:40 +0800

01 Sep, 2020

10 commits

a1132430c xsk: Add shared umem support between devices ... Browse Code »

Add support to share a umem between different devices. This mode
can be invoked with the XDP_SHARED_UMEM bind flag. Previously,
sharing was only supported within the same device. Note that when
sharing a umem between devices, just as in the case of sharing a
umem between queue ids, you need to create a fill ring and a
completion ring and tie them to the socket (with two setsockopts,
one for each ring) before you do the bind with the
XDP_SHARED_UMEM flag. This so that the single-producer
single-consumer semantics of the rings can be upheld.

Signed-off-by: Magnus Karlsson
Signed-off-by: Daniel Borkmann
Acked-by: Björn Töpel
Link: https://lore.kernel.org/bpf/1598603189-32145-13-git-send-email-magnus.karlsson@intel.com

Magnus Karlsson
2020-09-01 03:15:04 +0800
b5aea28dc xsk: Add shared umem support between queue ids ... Browse Code »

Add support to share a umem between queue ids on the same
device. This mode can be invoked with the XDP_SHARED_UMEM bind
flag. Previously, sharing was only supported within the same
queue id and device, and you shared one set of fill and
completion rings. However, note that when sharing a umem between
queue ids, you need to create a fill ring and a completion ring
and tie them to the socket before you do the bind with the
XDP_SHARED_UMEM flag. This so that the single-producer
single-consumer semantics can be upheld.

Signed-off-by: Magnus Karlsson
Signed-off-by: Daniel Borkmann
Acked-by: Björn Töpel
Link: https://lore.kernel.org/bpf/1598603189-32145-12-git-send-email-magnus.karlsson@intel.com

Magnus Karlsson
2020-09-01 03:15:04 +0800
921b68692 xsk: Enable sharing of dma mappings ... Browse Code »

Enable the sharing of dma mappings by moving them out from the buffer
pool. Instead we put each dma mapped umem region in a list in the umem
structure. If dma has already been mapped for this umem and device, it
is not mapped again and the existing dma mappings are reused.

Signed-off-by: Magnus Karlsson
Signed-off-by: Daniel Borkmann
Acked-by: Björn Töpel
Link: https://lore.kernel.org/bpf/1598603189-32145-9-git-send-email-magnus.karlsson@intel.com

Magnus Karlsson
2020-09-01 03:15:04 +0800
7f7ffa4e9 xsk: Move addrs from buffer pool to umem ... Browse Code »

Replicate the addrs pointer in the buffer pool to the umem. This mapping
will be the same for all buffer pools sharing the same umem. In the
buffer pool we leave the addrs pointer for performance reasons.

Signed-off-by: Magnus Karlsson
Signed-off-by: Daniel Borkmann
Acked-by: Björn Töpel
Link: https://lore.kernel.org/bpf/1598603189-32145-8-git-send-email-magnus.karlsson@intel.com

Magnus Karlsson
2020-09-01 03:15:04 +0800
a5aa8e529 xsk: Move xsk_tx_list and its lock to buffer pool ... Browse Code »

Move the xsk_tx_list and the xsk_tx_list_lock from the umem to
the buffer pool. This so that we in a later commit can share the
umem between multiple HW queues. There is one xsk_tx_list per
device and queue id, so it should be located in the buffer pool.

Signed-off-by: Magnus Karlsson
Signed-off-by: Daniel Borkmann
Acked-by: Björn Töpel
Link: https://lore.kernel.org/bpf/1598603189-32145-7-git-send-email-magnus.karlsson@intel.com

Magnus Karlsson
2020-09-01 03:15:04 +0800
c2d3d6a47 xsk: Move queue_id, dev and need_wakeup to buffer pool ... Browse Code »

Move queue_id, dev, and need_wakeup from the umem to the
buffer pool. This so that we in a later commit can share the umem
between multiple HW queues. There is one buffer pool per dev and
queue id, so these variables should belong to the buffer pool, not
the umem. Need_wakeup is also something that is set on a per napi
level, so there is usually one per device and queue id. So move
this to the buffer pool too.

Signed-off-by: Magnus Karlsson
Signed-off-by: Daniel Borkmann
Acked-by: Björn Töpel
Link: https://lore.kernel.org/bpf/1598603189-32145-6-git-send-email-magnus.karlsson@intel.com

Magnus Karlsson
2020-09-01 03:15:04 +0800
7361f9c3d xsk: Move fill and completion rings to buffer pool ... Browse Code »

Move the fill and completion rings from the umem to the buffer
pool. This so that we in a later commit can share the umem
between multiple HW queue ids. In this case, we need one fill and
completion ring per queue id. As the buffer pool is per queue id
and napi id this is a natural place for it and one umem
struture can be shared between these buffer pools.

Signed-off-by: Magnus Karlsson
Signed-off-by: Daniel Borkmann
Acked-by: Björn Töpel
Link: https://lore.kernel.org/bpf/1598603189-32145-5-git-send-email-magnus.karlsson@intel.com

Magnus Karlsson
2020-09-01 03:15:04 +0800
1c1efc2af xsk: Create and free buffer pool independently from umem ... Browse Code »

Create and free the buffer pool independently from the umem. Move
these operations that are performed on the buffer pool from the
umem create and destroy functions to new create and destroy
functions just for the buffer pool. This so that in later commits
we can instantiate multiple buffer pools per umem when sharing a
umem between HW queues and/or devices. We also erradicate the
back pointer from the umem to the buffer pool as this will not
work when we introduce the possibility to have multiple buffer
pools per umem.

Signed-off-by: Magnus Karlsson
Signed-off-by: Daniel Borkmann
Acked-by: Björn Töpel
Link: https://lore.kernel.org/bpf/1598603189-32145-4-git-send-email-magnus.karlsson@intel.com

Magnus Karlsson
2020-09-01 03:15:04 +0800
c4655761d xsk: i40e: ice: ixgbe: mlx5: Rename xsk zero-copy driver interfaces ... Browse Code »

Rename the AF_XDP zero-copy driver interface functions to better
reflect what they do after the replacement of umems with buffer
pools in the previous commit. Mostly it is about replacing the
umem name from the function names with xsk_buff and also have
them take the a buffer pool pointer instead of a umem. The
various ring functions have also been renamed in the process so
that they have the same naming convention as the internal
functions in xsk_queue.h. This so that it will be clearer what
they do and also for consistency.

Signed-off-by: Magnus Karlsson
Signed-off-by: Daniel Borkmann
Acked-by: Björn Töpel
Link: https://lore.kernel.org/bpf/1598603189-32145-3-git-send-email-magnus.karlsson@intel.com

Magnus Karlsson
2020-09-01 03:15:04 +0800
1742b3d52 xsk: i40e: ice: ixgbe: mlx5: Pass buffer pool to driver instead of umem ... Browse Code »

Replace the explicit umem reference passed to the driver in AF_XDP
zero-copy mode with the buffer pool instead. This in preparation for
extending the functionality of the zero-copy mode so that umems can be
shared between queues on the same netdev and also between netdevs. In
this commit, only an umem reference has been added to the buffer pool
struct. But later commits will add other entities to it. These are
going to be entities that are different between different queue ids
and netdevs even though the umem is shared between them.

Signed-off-by: Magnus Karlsson
Signed-off-by: Daniel Borkmann
Acked-by: Björn Töpel
Link: https://lore.kernel.org/bpf/1598603189-32145-2-git-send-email-magnus.karlsson@intel.com

Magnus Karlsson
2020-09-01 03:15:03 +0800

28 Aug, 2020

2 commits

134fede4e bpf: Relax max_entries check for most of the inner map types ... Browse Code »

Most of the maps do not use max_entries during verification time.
Thus, those map_meta_equal() do not need to enforce max_entries
when it is inserted as an inner map during runtime. The max_entries
check is removed from the default implementation bpf_map_meta_equal().

The prog_array_map and xsk_map are exception. Its map_gen_lookup
uses max_entries to generate inline lookup code. Thus, they will
implement its own map_meta_equal() to enforce max_entries.
Since there are only two cases now, the max_entries check
is not refactored and stays in its own .c file.

Signed-off-by: Martin KaFai Lau
Signed-off-by: Daniel Borkmann
Link: https://lore.kernel.org/bpf/20200828011813.1970516-1-kafai@fb.com

Martin KaFai Lau
2020-08-28 21:41:30 +0800
f4d052592 bpf: Add map_meta_equal map ops ... Browse Code »

Some properties of the inner map is used in the verification time.
When an inner map is inserted to an outer map at runtime,
bpf_map_meta_equal() is currently used to ensure those properties
of the inserting inner map stays the same as the verification
time.

In particular, the current bpf_map_meta_equal() checks max_entries which
turns out to be too restrictive for most of the maps which do not use
max_entries during the verification time. It limits the use case that
wants to replace a smaller inner map with a larger inner map. There are
some maps do use max_entries during verification though. For example,
the map_gen_lookup in array_map_ops uses the max_entries to generate
the inline lookup code.

To accommodate differences between maps, the map_meta_equal is added
to bpf_map_ops. Each map-type can decide what to check when its
map is used as an inner map during runtime.

Also, some map types cannot be used as an inner map and they are
currently black listed in bpf_map_meta_alloc() in map_in_map.c.
It is not unusual that the new map types may not aware that such
blacklist exists. This patch enforces an explicit opt-in
and only allows a map to be used as an inner map if it has
implemented the map_meta_equal ops. It is based on the
discussion in [1].

All maps that support inner map has its map_meta_equal points
to bpf_map_meta_equal in this patch. A later patch will
relax the max_entries check for most maps. bpf_types.h
counts 28 map types. This patch adds 23 ".map_meta_equal"
by using coccinelle. -5 for
BPF_MAP_TYPE_PROG_ARRAY
BPF_MAP_TYPE_(PERCPU)_CGROUP_STORAGE
BPF_MAP_TYPE_STRUCT_OPS
BPF_MAP_TYPE_ARRAY_OF_MAPS
BPF_MAP_TYPE_HASH_OF_MAPS

The "if (inner_map->inner_map_meta)" check in bpf_map_meta_alloc()
is moved such that the same error is returned.

[1]: https://lore.kernel.org/bpf/20200522022342.899756-1-kafai@fb.com/

Signed-off-by: Martin KaFai Lau
Signed-off-by: Daniel Borkmann
Link: https://lore.kernel.org/bpf/20200828011806.1970400-1-kafai@fb.com

Martin KaFai Lau
2020-08-28 21:41:30 +0800

28 Jul, 2020

1 commit

3c4f850e8 xdp: Prevent kernel-infoleak in xsk_getsockopt() ... Browse Code »

xsk_getsockopt() is copying uninitialized stack memory to userspace when
'extra_stats' is 'false'. Fix it. Doing '= {};' is sufficient since currently
'struct xdp_statistics' is defined as follows:

struct xdp_statistics {
__u64 rx_dropped;
__u64 rx_invalid_descs;
__u64 tx_invalid_descs;
__u64 rx_ring_full;
__u64 rx_fill_ring_empty_descs;
__u64 tx_ring_empty_descs;
};

When being copied to the userspace, 'stats' will not contain any uninitialized
'holes' between struct fields.

Fixes: 8aa5a33578e9 ("xsk: Add new statistics")
Suggested-by: Dan Carpenter
Signed-off-by: Peilin Ye
Signed-off-by: Daniel Borkmann
Acked-by: Björn Töpel
Acked-by: Song Liu
Acked-by: Arnd Bergmann
Link: https://lore.kernel.org/bpf/20200728053604.404631-1-yepeilin.cs@gmail.com

Peilin Ye
2020-07-28 18:50:15 +0800

25 Jul, 2020

1 commit

a7b75c5a8 net: pass a sockptr_t into ->setsockopt ... Browse Code »

Rework the remaining setsockopt code to pass a sockptr_t instead of a
plain user pointer. This removes the last remaining set_fs(KERNEL_DS)
outside of architecture specific code.

Signed-off-by: Christoph Hellwig
Acked-by: Stefan Schmidt [ieee802154]
Acked-by: Matthieu Baerts
Signed-off-by: David S. Miller

Christoph Hellwig
2020-07-25 06:41:54 +0800