Eric Lee / smarc-fsl-linux-kernel

06 Feb, 2021

1 commit

8dc1c444d net: gro: do not keep too many GRO packets in napi->rx_list ... Browse Code »

Commit c80794323e82 ("net: Fix packet reordering caused by GRO and
listified RX cooperation") had the unfortunate effect of adding
latencies in common workloads.

Before the patch, GRO packets were immediately passed to
upper stacks.

After the patch, we can accumulate quite a lot of GRO
packets (depdending on NAPI budget).

My fix is counting in napi->rx_count number of segments
instead of number of logical packets.

Fixes: c80794323e82 ("net: Fix packet reordering caused by GRO and listified RX cooperation")
Signed-off-by: Eric Dumazet
Bisected-by: John Sperbeck
Tested-by: Jian Yang
Cc: Maxim Mikityanskiy
Reviewed-by: Saeed Mahameed
Reviewed-by: Edward Cree
Reviewed-by: Alexander Lobakin
Link: https://lore.kernel.org/r/20210204213146.4192368-1-eric.dumazet@gmail.com
Signed-off-by: Jakub Kicinski

Eric Dumazet
2021-02-06 11:28:01 +0800

05 Feb, 2021

1 commit

52cbd23a1 udp: fix skb_copy_and_csum_datagram with odd segment sizes ... Browse Code »

When iteratively computing a checksum with csum_block_add, track the
offset "pos" to correctly rotate in csum_block_add when offset is odd.

The open coded implementation of skb_copy_and_csum_datagram did this.
With the switch to __skb_datagram_iter calling csum_and_copy_to_iter,
pos was reinitialized to 0 on each call.

Bring back the pos by passing it along with the csum to the callback.

Changes v1->v2
- pass csum value, instead of csump pointer (Alexander Duyck)

Link: https://lore.kernel.org/netdev/20210128152353.GB27281@optiplex/
Fixes: 950fcaecd5cc ("datagram: consolidate datagram copy to iter helpers")
Reported-by: Oliver Graute
Signed-off-by: Willem de Bruijn
Reviewed-by: Alexander Duyck
Reviewed-by: Eric Dumazet
Link: https://lore.kernel.org/r/20210203192952.1849843-1-willemdebruijn.kernel@gmail.com
Signed-off-by: Jakub Kicinski

Willem de Bruijn
2021-02-05 10:56:56 +0800

31 Jan, 2021

1 commit

eb4e8fac0 neighbour: Prevent a dead entry from updating gc_list ... Browse Code »

Following race condition was detected:
- neigh_flush_dev() is under execution and calls
neigh_mark_dead(n) marking the neighbour entry 'n' as dead.

- Executing: __netif_receive_skb() ->
__netif_receive_skb_core() -> arp_rcv() -> arp_process().arp_process()
calls __neigh_lookup() which takes a reference on neighbour entry 'n'.

- Moves further along neigh_flush_dev() and calls
neigh_cleanup_and_release(n), but since reference count increased in t2,
'n' couldn't be destroyed.

- Moves further along, arp_process() and calls
neigh_update()-> __neigh_update() -> neigh_update_gc_list(), which adds
the neighbour entry back in gc_list(neigh_mark_dead(), removed it
earlier in t0 from gc_list)

- arp_process() finally calls neigh_release(n), destroying
the neighbour entry.

This leads to 'n' still being part of gc_list, but the actual
neighbour structure has been freed.

The situation can be prevented from happening if we disallow a dead
entry to have any possibility of updating gc_list. This is what the
patch intends to achieve.

Fixes: 9c29a2f55ec0 ("neighbor: Fix locking order for gc_list changes")
Signed-off-by: Chinmay Agarwal
Reviewed-by: Cong Wang
Reviewed-by: David Ahern
Link: https://lore.kernel.org/r/20210127165453.GA20514@chinagar-linux.qualcomm.com
Signed-off-by: Jakub Kicinski

Chinmay Agarwal
2021-01-31 03:09:07 +0800

20 Jan, 2021

2 commits

a3eb4e9d4 net: Disable NETIF_F_HW_TLS_RX when RXCSUM is disabled ... Browse Code »

With NETIF_F_HW_TLS_RX packets are decrypted in HW. This cannot be
logically done when RXCSUM offload is off.

Fixes: 14136564c8ee ("net: Add TLS RX offload feature")
Signed-off-by: Tariq Toukan
Reviewed-by: Boris Pismenny
Link: https://lore.kernel.org/r/20210117151538.9411-1-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski

Tariq Toukan
2021-01-20 07:58:05 +0800
7e238de82 net: core: devlink: use right genl user_ptr when handling port param get/set ... Browse Code »

Fix incorrect user_ptr dereferencing when handling port param get/set:

idx [0] stores the 'struct devlink' pointer;
idx [1] stores the 'struct devlink_port' pointer;

Fixes: 637989b5d77e ("devlink: Always use user_ptr[0] for devlink and simplify post_doit")
CC: Parav Pandit
Signed-off-by: Oleksandr Mazur
Signed-off-by: Vadym Kochan
Link: https://lore.kernel.org/r/20210119085333.16833-1-vadym.kochan@plvision.eu
Signed-off-by: Jakub Kicinski

Oleksandr Mazur
2021-01-20 03:45:41 +0800

17 Jan, 2021

1 commit

66c556025 skbuff: back tiny skbs with kmalloc() in __netdev_alloc_skb() too ... Browse Code »

Commit 3226b158e67c ("net: avoid 32 x truesize under-estimation for
tiny skbs") ensured that skbs with data size lower than 1025 bytes
will be kmalloc'ed to avoid excessive page cache fragmentation and
memory consumption.
However, the fix adressed only __napi_alloc_skb() (primarily for
virtio_net and napi_get_frags()), but the issue can still be achieved
through __netdev_alloc_skb(), which is still used by several drivers.
Drivers often allocate a tiny skb for headers and place the rest of
the frame to frags (so-called copybreak).
Mirror the condition to __netdev_alloc_skb() to handle this case too.

Since v1 [0]:
- fix "Fixes:" tag;
- refine commit message (mention copybreak usecase).

[0] https://lore.kernel.org/netdev/20210114235423.232737-1-alobakin@pm.me

Fixes: a1c7fff7e18f ("net: netdev_alloc_skb() use build_skb()")
Signed-off-by: Alexander Lobakin
Link: https://lore.kernel.org/r/20210115150354.85967-1-alobakin@pm.me
Signed-off-by: Jakub Kicinski

Alexander Lobakin
2021-01-17 11:02:04 +0800

16 Jan, 2021

1 commit

dd5e07338 net_sched: gen_estimator: support large ewma log ... Browse Code »

syzbot report reminded us that very big ewma_log were supported in the past,
even if they made litle sense.

tc qdisc replace dev xxx root est 1sec 131072sec ...

While fixing the bug, also add boundary checks for ewma_log, in line
with range supported by iproute2.

UBSAN: shift-out-of-bounds in net/core/gen_estimator.c:83:38
shift exponent -1 is negative
CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.10.0-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:

__dump_stack lib/dump_stack.c:79 [inline]
dump_stack+0x107/0x163 lib/dump_stack.c:120
ubsan_epilogue+0xb/0x5a lib/ubsan.c:148
__ubsan_handle_shift_out_of_bounds.cold+0xb1/0x181 lib/ubsan.c:395
est_timer.cold+0xbb/0x12d net/core/gen_estimator.c:83
call_timer_fn+0x1a5/0x710 kernel/time/timer.c:1417
expire_timers kernel/time/timer.c:1462 [inline]
__run_timers.part.0+0x692/0xa80 kernel/time/timer.c:1731
__run_timers kernel/time/timer.c:1712 [inline]
run_timer_softirq+0xb3/0x1d0 kernel/time/timer.c:1744
__do_softirq+0x2bc/0xa77 kernel/softirq.c:343
asm_call_irq_on_stack+0xf/0x20

__run_on_irqstack arch/x86/include/asm/irq_stack.h:26 [inline]
run_on_irqstack_cond arch/x86/include/asm/irq_stack.h:77 [inline]
do_softirq_own_stack+0xaa/0xd0 arch/x86/kernel/irq_64.c:77
invoke_softirq kernel/softirq.c:226 [inline]
__irq_exit_rcu+0x17f/0x200 kernel/softirq.c:420
irq_exit_rcu+0x5/0x20 kernel/softirq.c:432
sysvec_apic_timer_interrupt+0x4d/0x100 arch/x86/kernel/apic/apic.c:1096
asm_sysvec_apic_timer_interrupt+0x12/0x20 arch/x86/include/asm/idtentry.h:628
RIP: 0010:native_save_fl arch/x86/include/asm/irqflags.h:29 [inline]
RIP: 0010:arch_local_save_flags arch/x86/include/asm/irqflags.h:79 [inline]
RIP: 0010:arch_irqs_disabled arch/x86/include/asm/irqflags.h:169 [inline]
RIP: 0010:acpi_safe_halt drivers/acpi/processor_idle.c:111 [inline]
RIP: 0010:acpi_idle_do_entry+0x1c9/0x250 drivers/acpi/processor_idle.c:516

Fixes: 1c0d32fde5bd ("net_sched: gen_estimator: complete rewrite of rate estimators")
Signed-off-by: Eric Dumazet
Reported-by: syzbot
Link: https://lore.kernel.org/r/20210114181929.1717985-1-eric.dumazet@gmail.com
Signed-off-by: Jakub Kicinski

Eric Dumazet
2021-01-16 10:11:06 +0800

15 Jan, 2021

2 commits

25537d71e net: Allow NETIF_F_HW_TLS_TX if IP_CSUM && IPV6_CSUM ... Browse Code »

Cited patch below blocked the TLS TX device offload unless HW_CSUM
is set. This broke devices that use IP_CSUM && IP6_CSUM.
Here we fix it.

Note that the single HW_TLS_TX feature flag indicates support for
both IPv4/6, hence it should still be disabled in case only one of
(IP_CSUM | IPV6_CSUM) is set.

Fixes: ae0b04b238e2 ("net: Disable NETIF_F_HW_TLS_TX when HW_CSUM is disabled")
Signed-off-by: Tariq Toukan
Reported-by: Rohit Maheshwari
Reviewed-by: Maxim Mikityanskiy
Link: https://lore.kernel.org/r/20210114151215.7061-1-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski

Tariq Toukan
2021-01-15 02:56:13 +0800
3226b158e net: avoid 32 x truesize under-estimation for tiny skbs ... Browse Code »

Both virtio net and napi_get_frags() allocate skbs
with a very small skb->head

While using page fragments instead of a kmalloc backed skb->head might give
a small performance improvement in some cases, there is a huge risk of
under estimating memory usage.

For both GOOD_COPY_LEN and GRO_MAX_HEAD, we can fit at least 32 allocations
per page (order-3 page in x86), or even 64 on PowerPC

We have been tracking OOM issues on GKE hosts hitting tcp_mem limits
but consuming far more memory for TCP buffers than instructed in tcp_mem[2]

Even if we force napi_alloc_skb() to only use order-0 pages, the issue
would still be there on arches with PAGE_SIZE >= 32768

This patch makes sure that small skb head are kmalloc backed, so that
other objects in the slab page can be reused instead of being held as long
as skbs are sitting in socket queues.

Note that we might in the future use the sk_buff napi cache,
instead of going through a more expensive __alloc_skb()

Another idea would be to use separate page sizes depending
on the allocated length (to never have more than 4 frags per page)

I would like to thank Greg Thelen for his precious help on this matter,
analysing crash dumps is always a time consuming task.

Fixes: fd11a83dd363 ("net: Pull out core bits of __netdev_alloc_skb and add __napi_alloc_skb")
Signed-off-by: Eric Dumazet
Cc: Paolo Abeni
Cc: Greg Thelen
Reviewed-by: Alexander Duyck
Acked-by: Michael S. Tsirkin
Link: https://lore.kernel.org/r/20210113161819.1155526-1-eric.dumazet@gmail.com
Signed-off-by: Jakub Kicinski

Eric Dumazet
2021-01-15 02:51:52 +0800

12 Jan, 2021

1 commit

97550f6fa net: compound page support in skb_seq_read ... Browse Code »

skb_seq_read iterates over an skb, returning pointer and length of
the next data range with each call.

It relies on kmap_atomic to access highmem pages when needed.

An skb frag may be backed by a compound page, but kmap_atomic maps
only a single page. There are not enough kmap slots to always map all
pages concurrently.

Instead, if kmap_atomic is needed, iterate over each page.

As this increases the number of calls, avoid this unless needed.
The necessary condition is captured in skb_frag_must_loop.

I tried to make the change as obvious as possible. It should be easy
to verify that nothing changes if skb_frag_must_loop returns false.

Tested:
On an x86 platform with
CONFIG_HIGHMEM=y
CONFIG_DEBUG_KMAP_LOCAL_FORCE_MAP=y
CONFIG_NETFILTER_XT_MATCH_STRING=y

Run
ip link set dev lo mtu 1500
iptables -A OUTPUT -m string --string 'badstring' -algo bm -j ACCEPT
dd if=/dev/urandom of=in bs=1M count=20
nc -l -p 8000 > /dev/null &
nc -w 1 -q 0 localhost 8000 < in

Signed-off-by: Willem de Bruijn
Signed-off-by: Jakub Kicinski

Willem de Bruijn
2021-01-12 10:20:09 +0800

09 Jan, 2021

5 commits

766b0515d net: make sure devices go through netdev_wait_all_refs ... Browse Code »

If register_netdevice() fails at the very last stage - the
notifier call - some subsystems may have already seen it and
grabbed a reference. struct net_device can't be freed right
away without calling netdev_wait_all_refs().

Now that we have a clean interface in form of dev->needs_free_netdev
and lenient free_netdev() we can undo what commit 93ee31f14f6f ("[NET]:
Fix free_netdev on register_netdev failure.") has done and complete
the unregistration path by bringing the net_set_todo() call back.

After registration fails user is still expected to explicitly
free the net_device, so make sure ->needs_free_netdev is cleared,
otherwise rolling back the registration will cause the old double
free for callers who release rtnl_lock before the free.

This also solves the problem of priv_destructor not being called
on notifier error.

net_set_todo() will be moved back into unregister_netdevice_queue()
in a follow up.

Reported-by: Hulk Robot
Reported-by: Yang Yingliang
Signed-off-by: Jakub Kicinski

Jakub Kicinski
2021-01-09 11:27:41 +0800
c269a24ce net: make free_netdev() more lenient with unregistering devices ... Browse Code »

There are two flavors of handling netdev registration:
- ones called without holding rtnl_lock: register_netdev() and
unregister_netdev(); and
- those called with rtnl_lock held: register_netdevice() and
unregister_netdevice().

While the semantics of the former are pretty clear, the same can't
be said about the latter. The netdev_todo mechanism is utilized to
perform some of the device unregistering tasks and it hooks into
rtnl_unlock() so the locked variants can't actually finish the work.
In general free_netdev() does not mix well with locked calls. Most
drivers operating under rtnl_lock set dev->needs_free_netdev to true
and expect core to make the free_netdev() call some time later.

The part where this becomes most problematic is error paths. There is
no way to unwind the state cleanly after a call to register_netdevice(),
since unreg can't be performed fully without dropping locks.

Make free_netdev() more lenient, and defer the freeing if device
is being unregistered. This allows error paths to simply call
free_netdev() both after register_netdevice() failed, and after
a call to unregister_netdevice() but before dropping rtnl_lock.

Simplify the error paths which are currently doing gymnastics
around free_netdev() handling.

Signed-off-by: Jakub Kicinski

Jakub Kicinski
2021-01-09 11:27:41 +0800
2b446e650 docs: net: explain struct net_device lifetime ... Browse Code »

Explain the two basic flows of struct net_device's operation.

Signed-off-by: Jakub Kicinski

Jakub Kicinski
2021-01-09 11:27:41 +0800
fd2ddef04 udp: Prevent reuseport_select_sock from reading uninitialized socks ... Browse Code »

reuse->socks[] is modified concurrently by reuseport_add_sock. To
prevent reading values that have not been fully initialized, only read
the array up until the last known safe index instead of incorrectly
re-reading the last index of the array.

Fixes: acdcecc61285f ("udp: correct reuseport selection with connected sockets")
Signed-off-by: Baptiste Lepers
Acked-by: Willem de Bruijn
Link: https://lore.kernel.org/r/20210107051110.12247-1-baptiste.lepers@gmail.com
Signed-off-by: Jakub Kicinski

Baptiste Lepers
2021-01-09 11:15:40 +0800
53475c5dd net: fix use-after-free when UDP GRO with shared fraglist ... Browse Code »

skbs in fraglist could be shared by a BPF filter loaded at TC. If TC
writes, it will call skb_ensure_writable -> pskb_expand_head to create
a private linear section for the head_skb. And then call
skb_clone_fraglist -> skb_get on each skb in the fraglist.

skb_segment_list overwrites part of the skb linear section of each
fragment itself. Even after skb_clone, the frag_skbs share their
linear section with their clone in PF_PACKET.

Both sk_receive_queue of PF_PACKET and PF_INET (or PF_INET6) can have
a link for the same frag_skbs chain. If a new skb (not frags) is
queued to one of the sk_receive_queue, multiple ptypes can see and
release this. It causes use-after-free.

[ 4443.426215] ------------[ cut here ]------------
[ 4443.426222] refcount_t: underflow; use-after-free.
[ 4443.426291] WARNING: CPU: 7 PID: 28161 at lib/refcount.c:190
refcount_dec_and_test_checked+0xa4/0xc8
[ 4443.426726] pstate: 60400005 (nZCv daif +PAN -UAO)
[ 4443.426732] pc : refcount_dec_and_test_checked+0xa4/0xc8
[ 4443.426737] lr : refcount_dec_and_test_checked+0xa0/0xc8
[ 4443.426808] Call trace:
[ 4443.426813] refcount_dec_and_test_checked+0xa4/0xc8
[ 4443.426823] skb_release_data+0x144/0x264
[ 4443.426828] kfree_skb+0x58/0xc4
[ 4443.426832] skb_queue_purge+0x64/0x9c
[ 4443.426844] packet_set_ring+0x5f0/0x820
[ 4443.426849] packet_setsockopt+0x5a4/0xcd0
[ 4443.426853] __sys_setsockopt+0x188/0x278
[ 4443.426858] __arm64_sys_setsockopt+0x28/0x38
[ 4443.426869] el0_svc_common+0xf0/0x1d0
[ 4443.426873] el0_svc_handler+0x74/0x98
[ 4443.426880] el0_svc+0x8/0xc

Fixes: 3a1296a38d0c (net: Support GRO/GSO fraglist chaining.)
Signed-off-by: Dongseok Yi
Acked-by: Willem de Bruijn
Acked-by: Daniel Borkmann
Link: https://lore.kernel.org/r/1610072918-174177-1-git-send-email-dseok.yi@samsung.com
Signed-off-by: Jakub Kicinski

Dongseok Yi
2021-01-09 11:14:02 +0800

29 Dec, 2020

5 commits

a533b70a6 net: neighbor: fix a crash caused by mod zero ... Browse Code »

pneigh_enqueue() tries to obtain a random delay by mod
NEIGH_VAR(p, PROXY_DELAY). However, NEIGH_VAR(p, PROXY_DELAY)
migth be zero at that point because someone could write zero
to /proc/sys/net/ipv4/neigh/[device]/proxy_delay after the
callers check it.

This patch uses prandom_u32_max() to get a random delay instead
which avoids potential division by zero.

Signed-off-by: weichenchen
Signed-off-by: David S. Miller

weichenchen
2020-12-29 06:49:48 +0800
4ae2bb816 net-sysfs: take the rtnl lock when accessing xps_rxqs_map and num_tc ... Browse Code »

Accesses to dev->xps_rxqs_map (when using dev->num_tc) should be
protected by the rtnl lock, like we do for netif_set_xps_queue. I didn't
see an actual bug being triggered, but let's be safe here and take the
rtnl lock while accessing the map in sysfs.

Fixes: 8af2c06ff4b1 ("net-sysfs: Add interface for Rx queue(s) map per Tx queue")
Signed-off-by: Antoine Tenart
Reviewed-by: Alexander Duyck
Signed-off-by: Jakub Kicinski

Antoine Tenart
2020-12-29 05:26:46 +0800
2d57b4f14 net-sysfs: take the rtnl lock when storing xps_rxqs ... Browse Code »

Two race conditions can be triggered when storing xps rxqs, resulting in
various oops and invalid memory accesses:

1. Calling netdev_set_num_tc while netif_set_xps_queue:

- netif_set_xps_queue uses dev->tc_num as one of the parameters to
compute the size of new_dev_maps when allocating it. dev->tc_num is
also used to access the map, and the compiler may generate code to
retrieve this field multiple times in the function.

- netdev_set_num_tc sets dev->tc_num.

If new_dev_maps is allocated using dev->tc_num and then dev->tc_num
is set to a higher value through netdev_set_num_tc, later accesses to
new_dev_maps in netif_set_xps_queue could lead to accessing memory
outside of new_dev_maps; triggering an oops.

2. Calling netif_set_xps_queue while netdev_set_num_tc is running:

2.1. netdev_set_num_tc starts by resetting the xps queues,
dev->tc_num isn't updated yet.

2.2. netif_set_xps_queue is called, setting up the map with the
*old* dev->num_tc.

2.3. netdev_set_num_tc updates dev->tc_num.

2.4. Later accesses to the map lead to out of bound accesses and
oops.

A similar issue can be found with netdev_reset_tc.

One way of triggering this is to set an iface up (for which the driver
uses netdev_set_num_tc in the open path, such as bnx2x) and writing to
xps_rxqs in a concurrent thread. With the right timing an oops is
triggered.

Both issues have the same fix: netif_set_xps_queue, netdev_set_num_tc
and netdev_reset_tc should be mutually exclusive. We do that by taking
the rtnl lock in xps_rxqs_store.

Fixes: 8af2c06ff4b1 ("net-sysfs: Add interface for Rx queue(s) map per Tx queue")
Signed-off-by: Antoine Tenart
Reviewed-by: Alexander Duyck
Signed-off-by: Jakub Kicinski

Antoine Tenart
2020-12-29 05:26:46 +0800
fb2503858 net-sysfs: take the rtnl lock when accessing xps_cpus_map and num_tc ... Browse Code »

Accesses to dev->xps_cpus_map (when using dev->num_tc) should be
protected by the rtnl lock, like we do for netif_set_xps_queue. I didn't
see an actual bug being triggered, but let's be safe here and take the
rtnl lock while accessing the map in sysfs.

Fixes: 184c449f91fe ("net: Add support for XPS with QoS via traffic classes")
Signed-off-by: Antoine Tenart
Reviewed-by: Alexander Duyck
Signed-off-by: Jakub Kicinski

Antoine Tenart
2020-12-29 05:26:46 +0800
1ad58225d net-sysfs: take the rtnl lock when storing xps_cpus ... Browse Code »

Two race conditions can be triggered when storing xps cpus, resulting in
various oops and invalid memory accesses:

1. Calling netdev_set_num_tc while netif_set_xps_queue:

- netif_set_xps_queue uses dev->tc_num as one of the parameters to
compute the size of new_dev_maps when allocating it. dev->tc_num is
also used to access the map, and the compiler may generate code to
retrieve this field multiple times in the function.

- netdev_set_num_tc sets dev->tc_num.

If new_dev_maps is allocated using dev->tc_num and then dev->tc_num
is set to a higher value through netdev_set_num_tc, later accesses to
new_dev_maps in netif_set_xps_queue could lead to accessing memory
outside of new_dev_maps; triggering an oops.

2. Calling netif_set_xps_queue while netdev_set_num_tc is running:

2.1. netdev_set_num_tc starts by resetting the xps queues,
dev->tc_num isn't updated yet.

2.2. netif_set_xps_queue is called, setting up the map with the
*old* dev->num_tc.

2.3. netdev_set_num_tc updates dev->tc_num.

2.4. Later accesses to the map lead to out of bound accesses and
oops.

A similar issue can be found with netdev_reset_tc.

One way of triggering this is to set an iface up (for which the driver
uses netdev_set_num_tc in the open path, such as bnx2x) and writing to
xps_cpus in a concurrent thread. With the right timing an oops is
triggered.

Both issues have the same fix: netif_set_xps_queue, netdev_set_num_tc
and netdev_reset_tc should be mutually exclusive. We do that by taking
the rtnl lock in xps_cpus_store.

Fixes: 184c449f91fe ("net: Add support for XPS with QoS via traffic classes")
Signed-off-by: Antoine Tenart
Reviewed-by: Alexander Duyck
Signed-off-by: Jakub Kicinski

Antoine Tenart
2020-12-29 05:26:46 +0800

17 Dec, 2020

1 commit

7061eb8cf net: core: introduce __netdev_notify_peers ... Browse Code »

There are some use cases for netdev_notify_peers in the context
when rtnl lock is already held. Introduce lockless version
of netdev_notify_peers call to save the extra code to call
call_netdevice_notifiers(NETDEV_NOTIFY_PEERS, dev);
call_netdevice_notifiers(NETDEV_RESEND_IGMP, dev);
After that, convert netdev_notify_peers to call the new helper.

Suggested-by: Nathan Lynch
Signed-off-by: Lijun Pan
Signed-off-by: Jakub Kicinski

Lijun Pan
2020-12-17 03:43:25 +0800

16 Dec, 2020

1 commit

d635a69dd Merge tag 'net-next-5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next ... Browse Code »

Pull networking updates from Jakub Kicinski:
"Core:

- support "prefer busy polling" NAPI operation mode, where we defer
softirq for some time expecting applications to periodically busy
poll

- AF_XDP: improve efficiency by more batching and hindering the
adjacency cache prefetcher

- af_packet: make packet_fanout.arr size configurable up to 64K

- tcp: optimize TCP zero copy receive in presence of partial or
unaligned reads making zero copy a performance win for much smaller
messages

- XDP: add bulk APIs for returning / freeing frames

- sched: support fragmenting IP packets as they come out of conntrack

- net: allow virtual netdevs to forward UDP L4 and fraglist GSO skbs

BPF:

- BPF switch from crude rlimit-based to memcg-based memory accounting

- BPF type format information for kernel modules and related tracing
enhancements

- BPF implement task local storage for BPF LSM

- allow the FENTRY/FEXIT/RAW_TP tracing programs to use
bpf_sk_storage

Protocols:

- mptcp: improve multiple xmit streams support, memory accounting and
many smaller improvements

- TLS: support CHACHA20-POLY1305 cipher

- seg6: add support for SRv6 End.DT4/DT6 behavior

- sctp: Implement RFC 6951: UDP Encapsulation of SCTP

- ppp_generic: add ability to bridge channels directly

- bridge: Connectivity Fault Management (CFM) support as is defined
in IEEE 802.1Q section 12.14.

Drivers:

- mlx5: make use of the new auxiliary bus to organize the driver
internals

- mlx5: more accurate port TX timestamping support

- mlxsw:
- improve the efficiency of offloaded next hop updates by using
the new nexthop object API
- support blackhole nexthops
- support IEEE 802.1ad (Q-in-Q) bridging

- rtw88: major bluetooth co-existance improvements

- iwlwifi: support new 6 GHz frequency band

- ath11k: Fast Initial Link Setup (FILS)

- mt7915: dual band concurrent (DBDC) support

- net: ipa: add basic support for IPA v4.5

Refactor:

- a few pieces of in_interrupt() cleanup work from Sebastian Andrzej
Siewior

- phy: add support for shared interrupts; get rid of multiple driver
APIs and have the drivers write a full IRQ handler, slight growth
of driver code should be compensated by the simpler API which also
allows shared IRQs

- add common code for handling netdev per-cpu counters

- move TX packet re-allocation from Ethernet switch tag drivers to a
central place

- improve efficiency and rename nla_strlcpy

- number of W=1 warning cleanups as we now catch those in a patchwork
build bot

Old code removal:

- wan: delete the DLCI / SDLA drivers

- wimax: move to staging

- wifi: remove old WDS wifi bridging support"

* tag 'net-next-5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1922 commits)
net: hns3: fix expression that is currently always true
net: fix proc_fs init handling in af_packet and tls
nfc: pn533: convert comma to semicolon
af_vsock: Assign the vsock transport considering the vsock address flags
af_vsock: Set VMADDR_FLAG_TO_HOST flag on the receive path
vsock_addr: Check for supported flag values
vm_sockets: Add VMADDR_FLAG_TO_HOST vsock flag
vm_sockets: Add flags field in the vsock address data structure
net: Disable NETIF_F_HW_TLS_TX when HW_CSUM is disabled
tcp: Add logic to check for SYN w/ data in tcp_simple_retransmit
net: mscc: ocelot: install MAC addresses in .ndo_set_rx_mode from process context
nfc: s3fwrn5: Release the nfc firmware
net: vxget: clean up sparse warnings
mlxsw: spectrum_router: Use eXtended mezzanine to offload IPv4 router
mlxsw: spectrum: Set KVH XLT cache mode for Spectrum2/3
mlxsw: spectrum_router_xm: Introduce basic XM cache flushing
mlxsw: reg: Add Router LPM Cache Enable Register
mlxsw: reg: Add Router LPM Cache ML Delete Register
mlxsw: spectrum_router_xm: Implement L-value tracking for M-index
mlxsw: reg: Add XM Router M Table Register
...

Linus Torvalds
2020-12-16 05:22:29 +0800

15 Dec, 2020

5 commits

ae0b04b23 net: Disable NETIF_F_HW_TLS_TX when HW_CSUM is disabled ... Browse Code »

With NETIF_F_HW_TLS_TX packets are encrypted in HW. This cannot be
logically done when HW_CSUM offload is off.

Fixes: 2342a8512a1e ("net: Add TLS TX offload features")
Signed-off-by: Tariq Toukan
Reviewed-by: Boris Pismenny
Link: https://lore.kernel.org/r/20201213143929.26253-1-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski

Tariq Toukan
2020-12-15 11:31:36 +0800
54970a2fb net: drop bogus skb with CHECKSUM_PARTIAL and offset beyond end of trimmed packet ... Browse Code »

syzbot reproduces BUG_ON in skb_checksum_help():
tun creates (bogus) skb with huge partial-checksummed area and
small ip packet inside. Then ip_rcv trims the skb based on size
of internal ip packet, after that csum offset points beyond of
trimmed skb. Then checksum_tg() called via netfilter hook
triggers BUG_ON:

offset = skb_checksum_start_offset(skb);
BUG_ON(offset >= skb_headlen(skb));

To work around the problem this patch forces pskb_trim_rcsum_slow()
to return -EINVAL in described scenario. It allows its callers to
drop such kind of packets.

Link: https://syzkaller.appspot.com/bug?id=b419a5ca95062664fe1a60b764621eb4526e2cd0
Reported-by: syzbot+7010af67ced6105e5ab6@syzkaller.appspotmail.com
Signed-off-by: Vasily Averin
Acked-by: Willem de Bruijn
Link: https://lore.kernel.org/r/1b2494af-2c56-8ee2-7bc0-923fcad1cdf8@virtuozzo.com
Signed-off-by: Jakub Kicinski

Vasily Averin
2020-12-15 10:41:01 +0800
adb35e8dc Merge tag 'sched-core-2020-12-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull scheduler updates from Thomas Gleixner:

- migrate_disable/enable() support which originates from the RT tree
and is now a prerequisite for the new preemptible kmap_local() API
which aims to replace kmap_atomic().

- A fair amount of topology and NUMA related improvements

- Improvements for the frequency invariant calculations

- Enhanced robustness for the global CPU priority tracking and decision
making

- The usual small fixes and enhancements all over the place

* tag 'sched-core-2020-12-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (61 commits)
sched/fair: Trivial correction of the newidle_balance() comment
sched/fair: Clear SMT siblings after determining the core is not idle
sched: Fix kernel-doc markup
x86: Print ratio freq_max/freq_base used in frequency invariance calculations
x86, sched: Use midpoint of max_boost and max_P for frequency invariance on AMD EPYC
x86, sched: Calculate frequency invariance for AMD systems
irq_work: Optimize irq_work_single()
smp: Cleanup smp_call_function*()
irq_work: Cleanup
sched: Limit the amount of NUMA imbalance that can exist at fork time
sched/numa: Allow a floating imbalance between NUMA nodes
sched: Avoid unnecessary calculation of load imbalance at clone time
sched/numa: Rename nr_running and break out the magic number
sched: Make migrate_disable/enable() independent of RT
sched/topology: Condition EAS enablement on FIE support
arm64: Rebuild sched domains on invariance status changes
sched/topology,schedutil: Wrap sched domains rebuild
sched/uclamp: Allow to reset a task uclamp constraint value
sched/core: Fix typos in comments
Documentation: scheduler: fix information on arch SD flags, sched_domain and sched_debug
...

Linus Torvalds
2020-12-15 10:29:11 +0800
f9b4240b0 Merge tag 'fixes-v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux ... Browse Code »

Pull misc fixes from Christian Brauner:
"This contains several fixes which felt worth being combined into a
single branch:

- Use put_nsproxy() instead of open-coding it switch_task_namespaces()

- Kirill's work to unify lifecycle management for all namespaces. The
lifetime counters are used identically for all namespaces types.
Namespaces may of course have additional unrelated counters and
these are not altered. This work allows us to unify the type of the
counters and reduces maintenance cost by moving the counter in one
place and indicating that basic lifetime management is identical
for all namespaces.

- Peilin's fix adding three byte padding to Dmitry's
PTRACE_GET_SYSCALL_INFO uapi struct to prevent an info leak.

- Two smal patches to convert from the /* fall through */ comment
annotation to the fallthrough keyword annotation which I had taken
into my branch and into -next before df561f6688fe ("treewide: Use
fallthrough pseudo-keyword") made it upstream which fixed this
tree-wide.

Since I didn't want to invalidate all testing for other commits I
didn't rebase and kept them"

* tag 'fixes-v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
nsproxy: use put_nsproxy() in switch_task_namespaces()
sys: Convert to the new fallthrough notation
signal: Convert to the new fallthrough notation
time: Use generic ns_common::count
cgroup: Use generic ns_common::count
mnt: Use generic ns_common::count
user: Use generic ns_common::count
pid: Use generic ns_common::count
ipc: Use generic ns_common::count
uts: Use generic ns_common::count
net: Use generic ns_common::count
ns: Add a common refcount into ns_common
ptrace: Prevent kernel-infoleak in ptrace_get_syscall_info()

Linus Torvalds
2020-12-15 08:40:27 +0800
a6b5e026e Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next ... Browse Code »

Daniel Borkmann says:

====================
pull-request: bpf-next 2020-12-14

1) Expose bpf_sk_storage_*() helpers to iterator programs, from Florent Revest.

2) Add AF_XDP selftests based on veth devs to BPF selftests, from Weqaar Janjua.

3) Support for finding BTF based kernel attach targets through libbpf's
bpf_program__set_attach_target() API, from Andrii Nakryiko.

4) Permit pointers on stack for helper calls in the verifier, from Yonghong Song.

5) Fix overflows in hash map elem size after rlimit removal, from Eric Dumazet.

6) Get rid of direct invocation of llc in BPF selftests, from Andrew Delgadillo.

7) Fix xsk_recvmsg() to reorder socket state check before access, from Björn Töpel.

8) Add new libbpf API helper to retrieve ring buffer epoll fd, from Brendan Jackman.

9) Batch of minor BPF selftest improvements all over the place, from Florian Lehner,
KP Singh, Jiri Olsa and various others.

* https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (31 commits)
selftests/bpf: Add a test for ptr_to_map_value on stack for helper access
bpf: Permits pointers on stack for helper calls
libbpf: Expose libbpf ring_buffer epoll_fd
selftests/bpf: Add set_attach_target() API selftest for module target
libbpf: Support modules in bpf_program__set_attach_target() API
selftests/bpf: Silence ima_setup.sh when not running in verbose mode.
selftests/bpf: Drop the need for LLVM's llc
selftests/bpf: fix bpf_testmod.ko recompilation logic
samples/bpf: Fix possible hang in xdpsock with multiple threads
selftests/bpf: Make selftest compilation work on clang 11
selftests/bpf: Xsk selftests - adding xdpxceiver to .gitignore
selftests/bpf: Drop tcp-{client,server}.py from Makefile
selftests/bpf: Xsk selftests - Bi-directional Sockets - SKB, DRV
selftests/bpf: Xsk selftests - Socket Teardown - SKB, DRV
selftests/bpf: Xsk selftests - DRV POLL, NOPOLL
selftests/bpf: Xsk selftests - SKB POLL, NOPOLL
selftests/bpf: Xsk selftests framework
bpf: Only provide bpf_sock_from_file with CONFIG_NET
bpf: Return -ENOTSUPP when attaching to non-kernel BTF
xsk: Validate socket state in xsk_recvmsg, prior touching socket members
...
====================

Link: https://lore.kernel.org/r/20201214214316.20642-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski

Jakub Kicinski
2020-12-15 07:34:36 +0800

12 Dec, 2020

1 commit

46d5e62dd Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net ... Browse Code »

xdp_return_frame_bulk() needs to pass a xdp_buff
to __xdp_return().

strlcpy got converted to strscpy but here it makes no
functional difference, so just keep the right code.

Conflicts:
net/netfilter/nf_tables_api.c

Signed-off-by: Jakub Kicinski

Jakub Kicinski
2020-12-12 14:29:38 +0800

11 Dec, 2020

2 commits

d9838b1d3 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf ... Browse Code »

Alexei Starovoitov says:

====================
pull-request: bpf 2020-12-10

The following pull-request contains BPF updates for your *net* tree.

We've added 21 non-merge commits during the last 12 day(s) which contain
a total of 21 files changed, 163 insertions(+), 88 deletions(-).

The main changes are:

1) Fix propagation of 32-bit signed bounds from 64-bit bounds, from Alexei.

2) Fix ring_buffer__poll() return value, from Andrii.

3) Fix race in lwt_bpf, from Cong.

4) Fix test_offload, from Toke.

5) Various xsk fixes.

Please consider pulling these changes from:

git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git

Thanks a lot!

Also thanks to reporters, reviewers and testers of commits in this pull-request:

Cong Wang, Hulk Robot, Jakub Kicinski, Jean-Philippe Brucker, John
Fastabend, Magnus Karlsson, Maxim Mikityanskiy, Yonghong Song
====================

Signed-off-by: David S. Miller

David S. Miller
2020-12-11 06:29:30 +0800
51e13685b rtnetlink: RCU-annotate both dimensions of rtnl_msg_handlers ... Browse Code »

We use rcu_assign_pointer to assign both the table and the entries,
but the entries are not marked as __rcu. This generates sparse
warnings.

Signed-off-by: Jakub Kicinski
Signed-off-by: David S. Miller

Jakub Kicinski
2020-12-11 05:35:59 +0800

10 Dec, 2020

1 commit

5137d3036 net: flow_offload: Fix memory leak for indirect flow block ... Browse Code »

The offending commit introduces a cleanup callback that is invoked
when the driver module is removed to clean up the tunnel device
flow block. But it returns on the first iteration of the for loop.
The remaining indirect flow blocks will never be freed.

Fixes: 1fac52da5942 ("net: flow_offload: consolidate indirect flow_block infrastructure")
CC: Pablo Neira Ayuso
Signed-off-by: Chris Mi
Reviewed-by: Roi Dayan

Chris Mi
2020-12-10 08:08:33 +0800

09 Dec, 2020

3 commits

998f17296 xdp: Remove the xdp_attachment_flags_ok() callback ... Browse Code »

Since commit 7f0a838254bd ("bpf, xdp: Maintain info on attached XDP BPF
programs in net_device"), the XDP program attachment info is now maintained
in the core code. This interacts badly with the xdp_attachment_flags_ok()
check that prevents unloading an XDP program with different load flags than
it was loaded with. In practice, two kinds of failures are seen:

- An XDP program loaded without specifying a mode (and which then ends up
in driver mode) cannot be unloaded if the program mode is specified on
unload.

- The dev_xdp_uninstall() hook always calls the driver callback with the
mode set to the type of the program but an empty flags argument, which
means the flags_ok() check prevents the program from being removed,
leading to bpf prog reference leaks.

The original reason this check was added was to avoid ambiguity when
multiple programs were loaded. With the way the checks are done in the core
now, this is quite simple to enforce in the core code, so let's add a check
there and get rid of the xdp_attachment_flags_ok() callback entirely.

Fixes: 7f0a838254bd ("bpf, xdp: Maintain info on attached XDP BPF programs in net_device")
Signed-off-by: Toke Høiland-Jørgensen
Signed-off-by: Daniel Borkmann
Acked-by: Jakub Kicinski
Link: https://lore.kernel.org/bpf/160752225751.110217.10267659521308669050.stgit@toke.dk

Toke Høiland-Jørgensen
2020-12-09 23:27:42 +0800
b60da4955 bpf: Only provide bpf_sock_from_file with CONFIG_NET ... Browse Code »

This moves the bpf_sock_from_file definition into net/core/filter.c
which only gets compiled with CONFIG_NET and also moves the helper proto
usage next to other tracing helpers that are conditional on CONFIG_NET.

This avoids
ld: kernel/trace/bpf_trace.o: in function `bpf_sock_from_file':
bpf_trace.c:(.text+0xe23): undefined reference to `sock_from_file'
When compiling a kernel with BPF and without NET.

Reported-by: kernel test robot
Reported-by: Randy Dunlap
Signed-off-by: Florent Revest
Signed-off-by: Alexei Starovoitov
Acked-by: Randy Dunlap
Acked-by: Martin KaFai Lau
Acked-by: KP Singh
Link: https://lore.kernel.org/bpf/20201208173623.1136863-1-revest@chromium.org

Florent Revest
2020-12-09 10:23:36 +0800
8daa76a52 net: core: devlink: simplify the return expression of devlink_nl_cmd_trap_set_doit() ... Browse Code »

Simplify the return expression.

Signed-off-by: Zheng Yongjun
Signed-off-by: David S. Miller

Zheng Yongjun
2020-12-09 08:22:54 +0800

08 Dec, 2020

2 commits

e3366884b lwt_bpf: Replace preempt_disable() with migrate_disable() ... Browse Code »

migrate_disable() is just a wrapper for preempt_disable() in
non-RT kernel. It is safe to replace it, and RT kernel will
benefit.

Note that it is introduced since Feb 2020.

Suggested-by: Alexei Starovoitov
Signed-off-by: Cong Wang
Signed-off-by: Alexei Starovoitov
Link: https://lore.kernel.org/bpf/20201205075946.497763-2-xiyou.wangcong@gmail.com

Cong Wang
2020-12-08 03:53:40 +0800
d9054a1ff lwt: Disable BH too in run_lwt_bpf() ... Browse Code »

The per-cpu bpf_redirect_info is shared among all skb_do_redirect()
and BPF redirect helpers. Callers on RX path are all in BH context,
disabling preemption is not sufficient to prevent BH interruption.

In production, we observed strange packet drops because of the race
condition between LWT xmit and TC ingress, and we verified this issue
is fixed after we disable BH.

Although this bug was technically introduced from the beginning, that
is commit 3a0af8fd61f9 ("bpf: BPF for lightweight tunnel infrastructure"),
at that time call_rcu() had to be call_rcu_bh() to match the RCU context.
So this patch may not work well before RCU flavor consolidation has been
completed around v5.0.

Update the comments above the code too, as call_rcu() is now BH friendly.

Signed-off-by: Dongdong Wang
Signed-off-by: Alexei Starovoitov
Reviewed-by: Cong Wang
Link: https://lore.kernel.org/bpf/20201205075946.497763-1-xiyou.wangcong@gmail.com

Dongdong Wang
2020-12-08 03:53:39 +0800

05 Dec, 2020

2 commits

a50a85e40 bpf: Expose bpf_sk_storage_* to iterator programs ... Browse Code »

Iterators are currently used to expose kernel information to userspace
over fast procfs-like files but iterators could also be used to
manipulate local storage. For example, the task_file iterator could be
used to initialize a socket local storage with associations between
processes and sockets or to selectively delete local storage values.

Signed-off-by: Florent Revest
Signed-off-by: Daniel Borkmann
Acked-by: Martin KaFai Lau
Acked-by: KP Singh
Link: https://lore.kernel.org/bpf/20201204113609.1850150-3-revest@google.com

Florent Revest
2020-12-05 05:32:40 +0800
dba4a9256 net: Remove the err argument from sock_from_file ... Browse Code »

Currently, the sock_from_file prototype takes an "err" pointer that is
either not set or set to -ENOTSOCK IFF the returned socket is NULL. This
makes the error redundant and it is ignored by a few callers.

This patch simplifies the API by letting callers deduce the error based
on whether the returned socket is NULL or not.

Suggested-by: Al Viro
Signed-off-by: Florent Revest
Signed-off-by: Daniel Borkmann
Reviewed-by: KP Singh
Link: https://lore.kernel.org/bpf/20201204113609.1850150-1-revest@google.com

Florent Revest
2020-12-05 05:32:40 +0800

04 Dec, 2020

2 commits

a1dd1d869 Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next ... Browse Code »

Alexei Starovoitov says:

====================
pull-request: bpf-next 2020-12-03

The main changes are:

1) Support BTF in kernel modules, from Andrii.

2) Introduce preferred busy-polling, from Björn.

3) bpf_ima_inode_hash() and bpf_bprm_opts_set() helpers, from KP Singh.

4) Memcg-based memory accounting for bpf objects, from Roman.

5) Allow bpf_{s,g}etsockopt from cgroup bind{4,6} hooks, from Stanislav.

* https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (118 commits)
selftests/bpf: Fix invalid use of strncat in test_sockmap
libbpf: Use memcpy instead of strncpy to please GCC
selftests/bpf: Add fentry/fexit/fmod_ret selftest for kernel module
selftests/bpf: Add tp_btf CO-RE reloc test for modules
libbpf: Support attachment of BPF tracing programs to kernel modules
libbpf: Factor out low-level BPF program loading helper
bpf: Allow to specify kernel module BTFs when attaching BPF programs
bpf: Remove hard-coded btf_vmlinux assumption from BPF verifier
selftests/bpf: Add CO-RE relocs selftest relying on kernel module BTF
selftests/bpf: Add support for marking sub-tests as skipped
selftests/bpf: Add bpf_testmod kernel module for testing
libbpf: Add kernel module BTF support for CO-RE relocations
libbpf: Refactor CO-RE relocs to not assume a single BTF object
libbpf: Add internal helper to load BTF data by FD
bpf: Keep module's btf_data_size intact after load
bpf: Fix bpf_put_raw_tracepoint()'s use of __module_address()
selftests/bpf: Add Userspace tests for TCP_WINDOW_CLAMP
bpf: Adds support for setting window clamp
samples/bpf: Fix spelling mistake "recieving" -> "receiving"
bpf: Fix cold build of test_progs-no_alu32
...
====================

Link: https://lore.kernel.org/r/20201204021936.85653-1-alexei.starovoitov@gmail.com
Signed-off-by: Jakub Kicinski

Jakub Kicinski
2020-12-04 23:48:12 +0800
cb8111099 bpf: Adds support for setting window clamp ... Browse Code »

Adds a new bpf_setsockopt for TCP sockets, TCP_BPF_WINDOW_CLAMP,
which sets the maximum receiver window size. It will be useful for
limiting receiver window based on RTT.

Signed-off-by: Prankur gupta
Signed-off-by: Alexei Starovoitov
Link: https://lore.kernel.org/bpf/20201202213152.435886-2-prankgup@fb.com

Prankur gupta
2020-12-04 09:23:24 +0800