Eric Lee / smarc-fsl-linux-kernel

08 Jan, 2021

6 commits

b19218b27 nexthop: Bounce NHA_GATEWAY in FDB nexthop groups ... Browse Code »

The function nh_check_attr_group() is called to validate nexthop groups.
The intention of that code seems to have been to bounce all attributes
above NHA_GROUP_TYPE except for NHA_FDB. However instead it bounces all
these attributes except when NHA_FDB attribute is present--then it accepts
them.

NHA_FDB validation that takes place before, in rtm_to_nh_config(), already
bounces NHA_OIF, NHA_BLACKHOLE, NHA_ENCAP and NHA_ENCAP_TYPE. Yet further
back, NHA_GROUPS and NHA_MASTER are bounced unconditionally.

But that still leaves NHA_GATEWAY as an attribute that would be accepted in
FDB nexthop groups (with no meaning), so long as it keeps the address
family as unspecified:

# ip nexthop add id 1 fdb via 127.0.0.1
# ip nexthop add id 10 fdb via default group 1

The nexthop code is still relatively new and likely not used very broadly,
and the FDB bits are newer still. Even though there is a reproducer out
there, it relies on an improbable gateway arguments "via default", "via
all" or "via any". Given all this, I believe it is OK to reformulate the
condition to do the right thing and bounce NHA_GATEWAY.

Fixes: 38428d68719c ("nexthop: support for fdb ecmp nexthops")
Signed-off-by: Petr Machata
Signed-off-by: Ido Schimmel
Reviewed-by: David Ahern
Signed-off-by: Jakub Kicinski

Petr Machata
2021-01-08 10:47:18 +0800
7b01e53ee nexthop: Unlink nexthop group entry in error path ... Browse Code »

In case of error, remove the nexthop group entry from the list to which
it was previously added.

Fixes: 430a049190de ("nexthop: Add support for nexthop groups")
Signed-off-by: Ido Schimmel
Reviewed-by: Petr Machata
Reviewed-by: David Ahern
Signed-off-by: Jakub Kicinski

Ido Schimmel
2021-01-08 10:47:18 +0800
07e61a979 nexthop: Fix off-by-one error in error path ... Browse Code »

A reference was not taken for the current nexthop entry, so do not try
to put it in the error path.

Fixes: 430a049190de ("nexthop: Add support for nexthop groups")
Signed-off-by: Ido Schimmel
Reviewed-by: Petr Machata
Reviewed-by: David Ahern
Signed-off-by: Jakub Kicinski

Ido Schimmel
2021-01-08 10:47:18 +0800
bb4cc1a18 net: ip: always refragment ip defragmented packets ... Browse Code »

Conntrack reassembly records the largest fragment size seen in IPCB.
However, when this gets forwarded/transmitted, fragmentation will only
be forced if one of the fragmented packets had the DF bit set.

In that case, a flag in IPCB will force fragmentation even if the
MTU is large enough.

This should work fine, but this breaks with ip tunnels.
Consider client that sends a UDP datagram of size X to another host.

The client fragments the datagram, so two packets, of size y and z, are
sent. DF bit is not set on any of these packets.

Middlebox netfilter reassembles those packets back to single size-X
packet, before routing decision.

packet-size-vs-mtu checks in ip_forward are irrelevant, because DF bit
isn't set. At output time, ip refragmentation is skipped as well
because x is still smaller than the mtu of the output device.

If ttransmit device is an ip tunnel, the packet size increases to
x+overhead.

Also, tunnel might be configured to force DF bit on outer header.

In this case, packet will be dropped (exceeds MTU) and an ICMP error is
generated back to sender.

But sender already respects the announced MTU, all the packets that
it sent did fit the announced mtu.

Force refragmentation as per original sizes unconditionally so ip tunnel
will encapsulate the fragments instead.

The only other solution I see is to place ip refragmentation in
the ip_tunnel code to handle this case.

Fixes: d6b915e29f4ad ("ip_fragment: don't forward defragmented DF packet")
Reported-by: Christian Perle
Signed-off-by: Florian Westphal
Acked-by: Pablo Neira Ayuso
Signed-off-by: Jakub Kicinski

Florian Westphal
2021-01-08 06:42:36 +0800
50c661670 net: fix pmtu check in nopmtudisc mode ... Browse Code »

For some reason ip_tunnel insist on setting the DF bit anyway when the
inner header has the DF bit set, EVEN if the tunnel was configured with
'nopmtudisc'.

This means that the script added in the previous commit
cannot be made to work by adding the 'nopmtudisc' flag to the
ip tunnel configuration. Doing so breaks connectivity even for the
without-conntrack/netfilter scenario.

When nopmtudisc is set, the tunnel will skip the mtu check, so no
icmp error is sent to client. Then, because inner header has DF set,
the outer header gets added with DF bit set as well.

IP stack then sends an error to itself because the packet exceeds
the device MTU.

Fixes: 23a3647bc4f93 ("ip_tunnels: Use skb-len to PMTU check.")
Cc: Stefano Brivio
Signed-off-by: Florian Westphal
Acked-by: Pablo Neira Ayuso
Signed-off-by: Jakub Kicinski

Florian Westphal
2021-01-08 06:42:36 +0800
d8f5c2965 net: ipv6: fib: flush exceptions when purging route ... Browse Code »

Route removal is handled by two code paths. The main removal path is via
fib6_del_route() which will handle purging any PMTU exceptions from the
cache, removing all per-cpu copies of the DST entry used by the route, and
releasing the fib6_info struct.

The second removal location is during fib6_add_rt2node() during a route
replacement operation. This path also calls fib6_purge_rt() to handle
cleaning up the per-cpu copies of the DST entries and releasing the
fib6_info associated with the older route, but it does not flush any PMTU
exceptions that the older route had. Since the older route is removed from
the tree during the replacement, we lose any way of accessing it again.

As these lingering DSTs and the fib6_info struct are holding references to
the underlying netdevice struct as well, unregistering that device from the
kernel can never complete.

Fixes: 2b760fcf5cfb3 ("ipv6: hook up exception table to store dst cache")
Signed-off-by: Sean Tranchetti
Reviewed-by: David Ahern
Link: https://lore.kernel.org/r/1609892546-11389-1-git-send-email-stranche@quicinc.com
Signed-off-by: Jakub Kicinski

Sean Tranchetti
2021-01-08 04:03:16 +0800

06 Jan, 2021

4 commits

4beb17e55 net: qrtr: fix null-ptr-deref in qrtr_ns_remove ... Browse Code »

A null-ptr-deref bug is reported by Hulk Robot like this:
--------------
KASAN: null-ptr-deref in range [0x0000000000000128-0x000000000000012f]
Call Trace:
qrtr_ns_remove+0x22/0x40 [ns]
qrtr_proto_fini+0xa/0x31 [qrtr]
__x64_sys_delete_module+0x337/0x4e0
do_syscall_64+0x34/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x468ded
--------------

When qrtr_ns_init fails in qrtr_proto_init, qrtr_ns_remove which would
be called later on would raise a null-ptr-deref because qrtr_ns.workqueue
has been destroyed.

Fix it by making qrtr_ns_init have a return value and adding a check in
qrtr_proto_init.

Reported-by: Hulk Robot
Signed-off-by: Qinglang Miao
Signed-off-by: David S. Miller

Qinglang Miao
2021-01-06 08:50:09 +0800
55b7ab117 net: vlan: avoid leaks on register_vlan_dev() failures ... Browse Code »

VLAN checks for NETREG_UNINITIALIZED to distinguish between
registration failure and unregistration in progress.

Since commit cb626bf566eb ("net-sysfs: Fix reference count leak")
registration failure may, however, result in NETREG_UNREGISTERED
as well as NETREG_UNINITIALIZED.

This fix is similer to cebb69754f37 ("rtnetlink: Fix
memory(net_device) leak when ->newlink fails")

Fixes: cb626bf566eb ("net-sysfs: Fix reference count leak")
Signed-off-by: Jakub Kicinski
Signed-off-by: David S. Miller

Jakub Kicinski
2021-01-06 08:25:31 +0800
152a8a6c0 cfg80211: select CONFIG_CRC32 ... Browse Code »

Without crc32 support, this fails to link:

arm-linux-gnueabi-ld: net/wireless/scan.o: in function `cfg80211_scan_6ghz':
scan.c:(.text+0x928): undefined reference to `crc32_le'

Fixes: c8cb5b854b40 ("nl80211/cfg80211: support 6 GHz scanning")
Signed-off-by: Arnd Bergmann
Signed-off-by: David S. Miller

Arnd Bergmann
2021-01-06 07:50:36 +0800
aa35e45cd Merge tag 'net-5.11-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net ... Browse Code »

Pull networking fixes from Jakub Kicinski:
"Networking fixes, including fixes from netfilter, wireless and bpf
trees.

Current release - regressions:

- mt76: fix NULL pointer dereference in mt76u_status_worker and
mt76s_process_tx_queue

- net: ipa: fix interconnect enable bug

Current release - always broken:

- netfilter: fixes possible oops in mtype_resize in ipset

- ath11k: fix number of coding issues found by static analysis tools
and spurious error messages

Previous releases - regressions:

- e1000e: re-enable s0ix power saving flows for systems with the
Intel i219-LM Ethernet controllers to fix power use regression

- virtio_net: fix recursive call to cpus_read_lock() to avoid a
deadlock

- ipv4: ignore ECN bits for fib lookups in fib_compute_spec_dst()

- sysfs: take the rtnl lock around XPS configuration

- xsk: fix memory leak for failed bind and rollback reservation at
NETDEV_TX_BUSY

- r8169: work around power-saving bug on some chip versions

Previous releases - always broken:

- dcb: validate netlink message in DCB handler

- tun: fix return value when the number of iovs exceeds MAX_SKB_FRAGS
to prevent unnecessary retries

- vhost_net: fix ubuf refcount when sendmsg fails

- bpf: save correct stopping point in file seq iteration

- ncsi: use real net-device for response handler

- neighbor: fix div by zero caused by a data race (TOCTOU)

- bareudp: fix use of incorrect min_headroom size and a false
positive lockdep splat from the TX lock

- mvpp2:
- clear force link UP during port init procedure in case
bootloader had set it
- add TCAM entry to drop flow control pause frames
- fix PPPoE with ipv6 packet parsing
- fix GoP Networking Complex Control config of port 3
- fix pkt coalescing IRQ-threshold configuration

- xsk: fix race in SKB mode transmit with shared cq

- ionic: account for vlan tag len in rx buffer len

- stmmac: ignore the second clock input, current clock framework does
not handle exclusive clock use well, other drivers may reconfigure
the second clock

Misc:

- ppp: change PPPIOCUNBRIDGECHAN ioctl request number to follow
existing scheme"

* tag 'net-5.11-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (99 commits)
net: dsa: lantiq_gswip: Fix GSWIP_MII_CFG(p) register access
net: dsa: lantiq_gswip: Enable GSWIP_MII_CFG_EN also for internal PHYs
net: lapb: Decrease the refcount of "struct lapb_cb" in lapb_device_event
r8169: work around power-saving bug on some chip versions
net: usb: qmi_wwan: add Quectel EM160R-GL
selftests: mlxsw: Set headroom size of correct port
net: macb: Correct usage of MACB_CAPS_CLK_HW_CHG flag
ibmvnic: fix: NULL pointer dereference.
docs: networking: packet_mmap: fix old config reference
docs: networking: packet_mmap: fix formatting for C macros
vhost_net: fix ubuf refcount incorrectly when sendmsg fails
bareudp: Fix use of incorrect min_headroom size
bareudp: set NETIF_F_LLTX flag
net: hdlc_ppp: Fix issues when mod_timer is called while timer is running
atlantic: remove architecture depends
erspan: fix version 1 check in gre_parse_header()
net: hns: fix return value check in __lb_other_process()
net: sched: prevent invalid Scell_log shift count
net: neighbor: fix a crash caused by mod zero
ipv4: Ignore ECN bits for fib lookups in fib_compute_spec_dst()
...

Linus Torvalds
2021-01-06 04:38:56 +0800

05 Jan, 2021

2 commits

a8f33c038 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf ... Browse Code »

Pablo Neira Ayuso says:

====================
Netfilter fixes for net

The following patchset contains Netfilter fixes for net:

1) Missing sanitization of rateest userspace string, bug has been
triggered by syzbot, patch from Florian Westphal.

2) Report EOPNOTSUPP on missing set features in nft_dynset, otherwise
error reporting to userspace via EINVAL is misleading since this is
reserved for malformed netlink requests.

3) New binaries with old kernels might silently accept several set
element expressions. New binaries set on the NFT_SET_EXPR and
NFT_DYNSET_F_EXPR flags to request for several expressions per
element, hence old kernels which do not support for this bail out
with EOPNOTSUPP.

* git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf:
netfilter: nftables: add set expression flags
netfilter: nft_dynset: report EOPNOTSUPP on missing set feature
netfilter: xt_RATEEST: reject non-null terminated string from userspace
====================

Link: https://lore.kernel.org/r/20210103192920.18639-1-pablo@netfilter.org
Signed-off-by: Jakub Kicinski

Jakub Kicinski
2021-01-05 06:02:02 +0800
b40f97b91 net: lapb: Decrease the refcount of "struct lapb_cb" in lapb_device_event ... Browse Code »

In lapb_device_event, lapb_devtostruct is called to get a reference to
an object of "struct lapb_cb". lapb_devtostruct increases the refcount
of the object and returns a pointer to it. However, we didn't decrease
the refcount after we finished using the pointer. This patch fixes this
problem.

Fixes: a4989fa91110 ("net/lapb: support netdev events")
Cc: Martin Schiller
Signed-off-by: Xie He
Link: https://lore.kernel.org/r/20201231174331.64539-1-xie.he.0141@gmail.com
Signed-off-by: Jakub Kicinski

Xie He
2021-01-05 05:42:41 +0800

29 Dec, 2020

12 commits

4bfc47148 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf ... Browse Code »

Daniel Borkmann says:

====================
pull-request: bpf 2020-12-28

The following pull-request contains BPF updates for your *net* tree.

There is a small merge conflict between bpf tree commit 69ca310f3416
("bpf: Save correct stopping point in file seq iteration") and net tree
commit 66ed594409a1 ("bpf/task_iter: In task_file_seq_get_next use
task_lookup_next_fd_rcu"). The get_files_struct() does not exist anymore
in net, so take the hunk in HEAD and add the `info->tid = curr_tid` to
the error path:

[...]
curr_task = task_seq_get_next(ns, &curr_tid, true);
if (!curr_task) {
info->task = NULL;
info->tid = curr_tid;
return NULL;
}

/* set info->task and info->tid */
[...]

We've added 10 non-merge commits during the last 9 day(s) which contain
a total of 11 files changed, 75 insertions(+), 20 deletions(-).

The main changes are:

1) Various AF_XDP fixes such as fill/completion ring leak on failed bind and
fixing a race in skb mode's backpressure mechanism, from Magnus Karlsson.

2) Fix latency spikes on lockdep enabled kernels by adding a rescheduling
point to BPF hashtab initialization, from Eric Dumazet.

3) Fix a splat in task iterator by saving the correct stopping point in the
seq file iteration, from Jonathan Lemon.

4) Fix BPF maps selftest by adding retries in case hashtab returns EBUSY
errors on update/deletes, from Andrii Nakryiko.

5) Fix BPF selftest error reporting to something more user friendly if the
vmlinux BTF cannot be found, from Kamal Mostafa.
====================

Signed-off-by: David S. Miller

David S. Miller
2020-12-29 07:26:11 +0800
085c7c4e1 erspan: fix version 1 check in gre_parse_header() ... Browse Code »

Both version 0 and version 1 use ETH_P_ERSPAN, but version 0 does not
have an erspan header. So the check in gre_parse_header() is wrong,
we have to distinguish version 1 from version 0.

We can just check the gre header length like is_erspan_type1().

Fixes: cb73ee40b1b3 ("net: ip_gre: use erspan key field for tunnel lookup")
Reported-by: syzbot+f583ce3d4ddf9836b27a@syzkaller.appspotmail.com
Cc: William Tu
Cc: Lorenzo Bianconi
Signed-off-by: Cong Wang
Signed-off-by: David S. Miller

Cong Wang
2020-12-29 07:00:00 +0800
bd1248f1d net: sched: prevent invalid Scell_log shift count ... Browse Code »

Check Scell_log shift size in red_check_params() and modify all callers
of red_check_params() to pass Scell_log.

This prevents a shift out-of-bounds as detected by UBSAN:
UBSAN: shift-out-of-bounds in ./include/net/red.h:252:22
shift exponent 72 is too large for 32-bit type 'int'

Fixes: 8afa10cbe281 ("net_sched: red: Avoid illegal values")
Signed-off-by: Randy Dunlap
Reported-by: syzbot+97c5bd9cc81eca63d36e@syzkaller.appspotmail.com
Cc: Nogah Frankel
Cc: Jamal Hadi Salim
Cc: Cong Wang
Cc: Jiri Pirko
Cc: netdev@vger.kernel.org
Cc: "David S. Miller"
Cc: Jakub Kicinski
Signed-off-by: David S. Miller

Randy Dunlap
2020-12-29 06:52:54 +0800
a533b70a6 net: neighbor: fix a crash caused by mod zero ... Browse Code »

pneigh_enqueue() tries to obtain a random delay by mod
NEIGH_VAR(p, PROXY_DELAY). However, NEIGH_VAR(p, PROXY_DELAY)
migth be zero at that point because someone could write zero
to /proc/sys/net/ipv4/neigh/[device]/proxy_delay after the
callers check it.

This patch uses prandom_u32_max() to get a random delay instead
which avoids potential division by zero.

Signed-off-by: weichenchen
Signed-off-by: David S. Miller

weichenchen
2020-12-29 06:49:48 +0800
21fdca22e ipv4: Ignore ECN bits for fib lookups in fib_compute_spec_dst() ... Browse Code »

RT_TOS() only clears one of the ECN bits. Therefore, when
fib_compute_spec_dst() resorts to a fib lookup, it can return
different results depending on the value of the second ECN bit.

For example, ECT(0) and ECT(1) packets could be treated differently.

$ ip netns add ns0
$ ip netns add ns1
$ ip link add name veth01 netns ns0 type veth peer name veth10 netns ns1
$ ip -netns ns0 link set dev lo up
$ ip -netns ns1 link set dev lo up
$ ip -netns ns0 link set dev veth01 up
$ ip -netns ns1 link set dev veth10 up

$ ip -netns ns0 address add 192.0.2.10/24 dev veth01
$ ip -netns ns1 address add 192.0.2.11/24 dev veth10

$ ip -netns ns1 address add 192.0.2.21/32 dev lo
$ ip -netns ns1 route add 192.0.2.10/32 tos 4 dev veth10 src 192.0.2.21
$ ip netns exec ns1 sysctl -wq net.ipv4.icmp_echo_ignore_broadcasts=0

With TOS 4 and ECT(1), ns1 replies using source address 192.0.2.21
(ping uses -Q to set all TOS and ECN bits):

$ ip netns exec ns0 ping -c 1 -b -Q 5 192.0.2.255
[...]
64 bytes from 192.0.2.21: icmp_seq=1 ttl=64 time=0.544 ms

But with TOS 4 and ECT(0), ns1 replies using source address 192.0.2.11
because the "tos 4" route isn't matched:

$ ip netns exec ns0 ping -c 1 -b -Q 6 192.0.2.255
[...]
64 bytes from 192.0.2.11: icmp_seq=1 ttl=64 time=0.597 ms

After this patch the ECN bits don't affect the result anymore:

$ ip netns exec ns0 ping -c 1 -b -Q 6 192.0.2.255
[...]
64 bytes from 192.0.2.21: icmp_seq=1 ttl=64 time=0.591 ms

Fixes: 35ebf65e851c ("ipv4: Create and use fib_compute_spec_dst() helper.")
Signed-off-by: Guillaume Nault
Signed-off-by: David S. Miller

Guillaume Nault
2020-12-29 06:44:32 +0800
e7579d5d5 net: mptcp: cap forward allocation to 1M ... Browse Code »

the following syzkaller reproducer:

r0 = socket$inet_mptcp(0x2, 0x1, 0x106)
bind$inet(r0, &(0x7f0000000080)={0x2, 0x4e24, @multicast2}, 0x10)
connect$inet(r0, &(0x7f0000000480)={0x2, 0x4e24, @local}, 0x10)
sendto$inet(r0, &(0x7f0000000100)="f6", 0xffffffe7, 0xc000, 0x0, 0x0)

systematically triggers the following warning:

WARNING: CPU: 2 PID: 8618 at net/core/stream.c:208 sk_stream_kill_queues+0x3fa/0x580
Modules linked in:
CPU: 2 PID: 8618 Comm: syz-executor Not tainted 5.10.0+ #334
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.1-4.module+el8.1.0+4066+0f1aadab 04/04
RIP: 0010:sk_stream_kill_queues+0x3fa/0x580
Code: df 48 c1 ea 03 0f b6 04 02 84 c0 74 04 3c 03 7e 40 8b ab 20 02 00 00 e9 64 ff ff ff e8 df f0 81 2
RSP: 0018:ffffc9000290fcb0 EFLAGS: 00010293
RAX: ffff888011cb8000 RBX: 0000000000000000 RCX: ffffffff86eecf0e
RDX: 0000000000000000 RSI: ffffffff86eecf6a RDI: 0000000000000005
RBP: 0000000000000e28 R08: ffff888011cb8000 R09: fffffbfff1f48139
R10: ffffffff8fa409c7 R11: fffffbfff1f48138 R12: ffff8880215e6220
R13: ffffffff8fa409c0 R14: ffffc9000290fd30 R15: 1ffff92000521fa2
FS: 00007f41c78f4800(0000) GS:ffff88802d000000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f95c803d088 CR3: 0000000025ed2000 CR4: 00000000000006f0
Call Trace:
__mptcp_destroy_sock+0x4f5/0x8e0
mptcp_close+0x5e2/0x7f0
inet_release+0x12b/0x270
__sock_release+0xc8/0x270
sock_close+0x18/0x20
__fput+0x272/0x8e0
task_work_run+0xe0/0x1a0
exit_to_user_mode_prepare+0x1df/0x200
syscall_exit_to_user_mode+0x19/0x50
entry_SYSCALL_64_after_hwframe+0x44/0xa9

userspace programs provide arbitrarily high values of 'len' in sendmsg():
this is causing integer overflow of 'amount'. Cap forward allocation to 1
megabyte: higher values are not really useful.

Suggested-by: Paolo Abeni
Fixes: e93da92896bc ("mptcp: implement wmem reservation")
Signed-off-by: Davide Caratti
Link: https://lore.kernel.org/r/3334d00d8b2faecafdfab9aa593efcbf61442756.1608584474.git.dcaratti@redhat.com
Signed-off-by: Jakub Kicinski

Davide Caratti
2020-12-29 05:53:57 +0800
4ae2bb816 net-sysfs: take the rtnl lock when accessing xps_rxqs_map and num_tc ... Browse Code »

Accesses to dev->xps_rxqs_map (when using dev->num_tc) should be
protected by the rtnl lock, like we do for netif_set_xps_queue. I didn't
see an actual bug being triggered, but let's be safe here and take the
rtnl lock while accessing the map in sysfs.

Fixes: 8af2c06ff4b1 ("net-sysfs: Add interface for Rx queue(s) map per Tx queue")
Signed-off-by: Antoine Tenart
Reviewed-by: Alexander Duyck
Signed-off-by: Jakub Kicinski

Antoine Tenart
2020-12-29 05:26:46 +0800
2d57b4f14 net-sysfs: take the rtnl lock when storing xps_rxqs ... Browse Code »

Two race conditions can be triggered when storing xps rxqs, resulting in
various oops and invalid memory accesses:

1. Calling netdev_set_num_tc while netif_set_xps_queue:

- netif_set_xps_queue uses dev->tc_num as one of the parameters to
compute the size of new_dev_maps when allocating it. dev->tc_num is
also used to access the map, and the compiler may generate code to
retrieve this field multiple times in the function.

- netdev_set_num_tc sets dev->tc_num.

If new_dev_maps is allocated using dev->tc_num and then dev->tc_num
is set to a higher value through netdev_set_num_tc, later accesses to
new_dev_maps in netif_set_xps_queue could lead to accessing memory
outside of new_dev_maps; triggering an oops.

2. Calling netif_set_xps_queue while netdev_set_num_tc is running:

2.1. netdev_set_num_tc starts by resetting the xps queues,
dev->tc_num isn't updated yet.

2.2. netif_set_xps_queue is called, setting up the map with the
*old* dev->num_tc.

2.3. netdev_set_num_tc updates dev->tc_num.

2.4. Later accesses to the map lead to out of bound accesses and
oops.

A similar issue can be found with netdev_reset_tc.

One way of triggering this is to set an iface up (for which the driver
uses netdev_set_num_tc in the open path, such as bnx2x) and writing to
xps_rxqs in a concurrent thread. With the right timing an oops is
triggered.

Both issues have the same fix: netif_set_xps_queue, netdev_set_num_tc
and netdev_reset_tc should be mutually exclusive. We do that by taking
the rtnl lock in xps_rxqs_store.

Fixes: 8af2c06ff4b1 ("net-sysfs: Add interface for Rx queue(s) map per Tx queue")
Signed-off-by: Antoine Tenart
Reviewed-by: Alexander Duyck
Signed-off-by: Jakub Kicinski

Antoine Tenart
2020-12-29 05:26:46 +0800
fb2503858 net-sysfs: take the rtnl lock when accessing xps_cpus_map and num_tc ... Browse Code »

Accesses to dev->xps_cpus_map (when using dev->num_tc) should be
protected by the rtnl lock, like we do for netif_set_xps_queue. I didn't
see an actual bug being triggered, but let's be safe here and take the
rtnl lock while accessing the map in sysfs.

Fixes: 184c449f91fe ("net: Add support for XPS with QoS via traffic classes")
Signed-off-by: Antoine Tenart
Reviewed-by: Alexander Duyck
Signed-off-by: Jakub Kicinski

Antoine Tenart
2020-12-29 05:26:46 +0800
1ad58225d net-sysfs: take the rtnl lock when storing xps_cpus ... Browse Code »

Two race conditions can be triggered when storing xps cpus, resulting in
various oops and invalid memory accesses:

1. Calling netdev_set_num_tc while netif_set_xps_queue:

- netif_set_xps_queue uses dev->tc_num as one of the parameters to
compute the size of new_dev_maps when allocating it. dev->tc_num is
also used to access the map, and the compiler may generate code to
retrieve this field multiple times in the function.

- netdev_set_num_tc sets dev->tc_num.

If new_dev_maps is allocated using dev->tc_num and then dev->tc_num
is set to a higher value through netdev_set_num_tc, later accesses to
new_dev_maps in netif_set_xps_queue could lead to accessing memory
outside of new_dev_maps; triggering an oops.

2. Calling netif_set_xps_queue while netdev_set_num_tc is running:

2.1. netdev_set_num_tc starts by resetting the xps queues,
dev->tc_num isn't updated yet.

2.2. netif_set_xps_queue is called, setting up the map with the
*old* dev->num_tc.

2.3. netdev_set_num_tc updates dev->tc_num.

2.4. Later accesses to the map lead to out of bound accesses and
oops.

A similar issue can be found with netdev_reset_tc.

One way of triggering this is to set an iface up (for which the driver
uses netdev_set_num_tc in the open path, such as bnx2x) and writing to
xps_cpus in a concurrent thread. With the right timing an oops is
triggered.

Both issues have the same fix: netif_set_xps_queue, netdev_set_num_tc
and netdev_reset_tc should be mutually exclusive. We do that by taking
the rtnl lock in xps_cpus_store.

Fixes: 184c449f91fe ("net: Add support for XPS with QoS via traffic classes")
Signed-off-by: Antoine Tenart
Reviewed-by: Alexander Duyck
Signed-off-by: Jakub Kicinski

Antoine Tenart
2020-12-29 05:26:46 +0800
f5f2c9a0e libceph: align session_key and con_secret to 16 bytes ... Browse Code »

crypto_shash_setkey() and crypto_aead_setkey() will do a (small)
GFP_ATOMIC allocation to align the key if it isn't suitably aligned.
It's not a big deal, but at the same time easy to avoid.

The actual alignment requirement is dynamic, queryable with
crypto_shash_alignmask() and crypto_aead_alignmask(), but shouldn't
be stricter than 16 bytes for our algorithms.

Fixes: cd1a677cad99 ("libceph, ceph: implement msgr2.1 protocol (crc and secure modes)")
Signed-off-by: Ilya Dryomov

Ilya Dryomov
2020-12-29 03:34:33 +0800
ad32fe880 libceph: fix auth_signature buffer allocation in secure mode ... Browse Code »

auth_signature frame is 68 bytes in plain mode and 96 bytes in
secure mode but we are requesting 68 bytes in both modes. By luck,
this doesn't actually result in any invalid memory accesses because
the allocation is satisfied out of kmalloc-96 slab and so exactly
96 bytes are allocated, but KASAN rightfully complains.

Fixes: cd1a677cad99 ("libceph, ceph: implement msgr2.1 protocol (crc and secure modes)")
Reported-by: Luis Henriques
Signed-off-by: Ilya Dryomov

Ilya Dryomov
2020-12-29 03:34:32 +0800

28 Dec, 2020

2 commits

b4e70d8dd netfilter: nftables: add set expression flags ... Browse Code »

The set flag NFT_SET_EXPR provides a hint to the kernel that userspace
supports for multiple expressions per set element. In the same
direction, NFT_DYNSET_F_EXPR specifies that dynset expression defines
multiple expressions per set element.

This allows new userspace software with old kernels to bail out with
EOPNOTSUPP. This update is similar to ef516e8625dd ("netfilter:
nf_tables: reintroduce the NFT_SET_CONCAT flag"). The NFT_SET_EXPR flag
needs to be set on when the NFTA_SET_EXPRESSIONS attribute is specified.
The NFT_SET_EXPR flag is not set on with NFTA_SET_EXPR to retain
backward compatibility in old userspace binaries.

Fixes: 48b0ae046ee9 ("netfilter: nftables: netlink support for several set element expressions")
Signed-off-by: Pablo Neira Ayuso

Pablo Neira Ayuso
2020-12-28 17:50:26 +0800
95cd4bca7 netfilter: nft_dynset: report EOPNOTSUPP on missing set feature ... Browse Code »

If userspace requests a feature which is not available the original set
definition, then bail out with EOPNOTSUPP. If userspace sends
unsupported dynset flags (new feature not supported by this kernel),
then report EOPNOTSUPP to userspace. EINVAL should be only used to
report malformed netlink messages from userspace.

Fixes: 22fe54d5fefc ("netfilter: nf_tables: add support for dynamic set updates")
Signed-off-by: Pablo Neira Ayuso

Pablo Neira Ayuso
2020-12-28 17:50:16 +0800

27 Dec, 2020

1 commit

6cb56218a netfilter: xt_RATEEST: reject non-null terminated string from userspace ... Browse Code »

syzbot reports:
detected buffer overflow in strlen
[..]
Call Trace:
strlen include/linux/string.h:325 [inline]
strlcpy include/linux/string.h:348 [inline]
xt_rateest_tg_checkentry+0x2a5/0x6b0 net/netfilter/xt_RATEEST.c:143

strlcpy assumes src is a c-string. Check info->name before its used.

Reported-by: syzbot+e86f7c428c8c50db65b4@syzkaller.appspotmail.com
Fixes: 5859034d7eb8793 ("[NETFILTER]: x_tables: add RATEEST target")
Signed-off-by: Florian Westphal
Signed-off-by: Pablo Neira Ayuso

Florian Westphal
2020-12-27 18:52:26 +0800

24 Dec, 2020

2 commits

427c94055 net/ncsi: Use real net-device for response handler ... Browse Code »

When aggregating ncsi interfaces and dedicated interfaces to bond
interfaces, the ncsi response handler will use the wrong net device to
find ncsi_dev, so that the ncsi interface will not work properly.
Here, we use the original net device to fix it.

Fixes: 138635cc27c9 ("net/ncsi: NCSI response packet handler")
Signed-off-by: John Wang
Link: https://lore.kernel.org/r/20201223055523.2069-1-wangzhiqiang.bj@bytedance.com
Signed-off-by: Jakub Kicinski

John Wang
2020-12-24 04:22:23 +0800
826f328e2 net: dcb: Validate netlink message in DCB handler ... Browse Code »

DCB uses the same handler function for both RTM_GETDCB and RTM_SETDCB
messages. dcb_doit() bounces RTM_SETDCB mesasges if the user does not have
the CAP_NET_ADMIN capability.

However, the operation to be performed is not decided from the DCB message
type, but from the DCB command. Thus DCB_CMD_*_GET commands are used for
reading DCB objects, the corresponding SET and DEL commands are used for
manipulation.

The assumption is that set-like commands will be sent via an RTM_SETDCB
message, and get-like ones via RTM_GETDCB. However, this assumption is not
enforced.

It is therefore possible to manipulate DCB objects without CAP_NET_ADMIN
capability by sending the corresponding command in an RTM_GETDCB message.
That is a bug. Fix it by validating the type of the request message against
the type used for the response.

Fixes: 2f90b8657ec9 ("ixgbe: this patch adds support for DCB to the kernel and ixgbe driver")
Signed-off-by: Petr Machata
Link: https://lore.kernel.org/r/a2a9b88418f3a58ef211b718f2970128ef9e3793.1608673640.git.me@pmachata.org
Signed-off-by: Jakub Kicinski

Petr Machata
2020-12-24 04:19:48 +0800

22 Dec, 2020

1 commit

70990afa3 Merge tag '9p-for-5.11-rc1' of git://github.com/martinetd/linux ... Browse Code »

Pull 9p update from Dominique Martinet:

- fix long-standing limitation on open-unlink-fop pattern

- add refcount to p9_fid (fixes the above and will allow for more
cleanups and simplifications in the future)

* tag '9p-for-5.11-rc1' of git://github.com/martinetd/linux:
9p: Remove unnecessary IS_ERR() check
9p: Uninitialized variable in v9fs_writeback_fid()
9p: Fix writeback fid incorrectly being attached to dentry
9p: apply review requests for fid refcounting
9p: add refcount to p9_fid struct
fs/9p: search open fids first
fs/9p: track open fids
fs/9p: fix create-unlink-getattr idiom

Linus Torvalds
2020-12-22 02:28:02 +0800

19 Dec, 2020

3 commits

1e72faedc Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf ... Browse Code »

Pablo Neira Ayuso says:

====================
Netfilter fixes for net

1) Incorrect loop in error path of nft_set_elem_expr_clone(),
from Colin Ian King.

2) Missing xt_table_get_private_protected() to access table
private data in x_tables, from Subash Abhinov Kasiviswanathan.

3) Possible oops in ipset hash type resize, from Vasily Averin.

4) Fix shift-out-of-bounds in ipset hash type, also from Vasily.

* git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf:
netfilter: ipset: fix shift-out-of-bounds in htable_bits()
netfilter: ipset: fixes possible oops in mtype_resize
netfilter: x_tables: Update remaining dereference to RCU
netfilter: nftables: fix incorrect increment of loop counter
====================

Link: https://lore.kernel.org/r/20201218120409.3659-1-pablo@netfilter.org
Signed-off-by: Jakub Kicinski

Jakub Kicinski
2020-12-19 10:07:14 +0800
698285da7 net/sched: sch_taprio: ensure to reset/destroy all child qdiscs ... Browse Code »

taprio_graft() can insert a NULL element in the array of child qdiscs. As
a consquence, taprio_reset() might not reset child qdiscs completely, and
taprio_destroy() might leak resources. Fix it by ensuring that loops that
iterate over q->qdiscs[] don't end when they find the first NULL item.

Fixes: 44d4775ca518 ("net/sched: sch_taprio: reset child qdiscs before freeing them")
Fixes: 5a781ccbd19e ("tc: Add support for configuring the taprio scheduler")
Suggested-by: Jakub Kicinski
Signed-off-by: Davide Caratti
Link: https://lore.kernel.org/r/13edef6778fef03adc751582562fba4a13e06d6a.1608240532.git.dcaratti@redhat.com
Signed-off-by: Jakub Kicinski

Davide Caratti
2020-12-19 08:43:29 +0800
abdcd06c4 net: af_packet: fix procfs header for 64-bit pointers ... Browse Code »

On 64-bit systems the packet procfs header field names following 'sk'
are not aligned correctly:

sk RefCnt Type Proto Iface R Rmem User Inode
00000000605d2c64 3 3 0003 7 1 450880 0 16643
00000000080e9b80 2 2 0000 0 0 0 0 17404
00000000b23b8a00 2 2 0000 0 0 0 0 17421
...

With this change field names are correctly aligned:

sk RefCnt Type Proto Iface R Rmem User Inode
000000005c3b1d97 3 3 0003 7 1 21568 0 16178
000000007be55bb7 3 3 fbce 8 1 0 0 16250
00000000be62127d 3 3 fbcd 8 1 0 0 16254
...

Signed-off-by: Baruch Siach
Link: https://lore.kernel.org/r/54917251d8433735d9a24e935a6cb8eb88b4058a.1608103684.git.baruch@tkos.co.il
Signed-off-by: Jakub Kicinski

Baruch Siach
2020-12-19 04:17:23 +0800

18 Dec, 2020

7 commits

b1b95cb5c xsk: Rollback reservation at NETDEV_TX_BUSY ... Browse Code »

Rollback the reservation in the completion ring when we get a
NETDEV_TX_BUSY. When this error is received from the driver, we are
supposed to let the user application retry the transmit again. And in
order to do this, we need to roll back the failed send so it can be
retried. Unfortunately, we did not cancel the reservation we had made
in the completion ring. By not doing this, we actually make the
completion ring one entry smaller per NETDEV_TX_BUSY error we get, and
after enough of these errors the completion ring will be of size zero
and transmit will stop working.

Fix this by cancelling the reservation when we get a NETDEV_TX_BUSY
error.

Fixes: 642e450b6b59 ("xsk: Do not discard packet when NETDEV_TX_BUSY")
Reported-by: Xuan Zhuo
Signed-off-by: Magnus Karlsson
Signed-off-by: Daniel Borkmann
Acked-by: Björn Töpel
Link: https://lore.kernel.org/bpf/20201218134525.13119-3-magnus.karlsson@gmail.com

Magnus Karlsson
2020-12-18 23:10:21 +0800
f09ced405 xsk: Fix race in SKB mode transmit with shared cq ... Browse Code »

Fix a race when multiple sockets are simultaneously calling sendto()
when the completion ring is shared in the SKB case. This is the case
when you share the same netdev and queue id through the
XDP_SHARED_UMEM bind flag. The problem is that multiple processes can
be in xsk_generic_xmit() and call the backpressure mechanism in
xskq_prod_reserve(xs->pool->cq). As this is a shared resource in this
specific scenario, a race might occur since the rings are
single-producer single-consumer.

Fix this by moving the tx_completion_lock from the socket to the pool
as the pool is shared between the sockets that share the completion
ring. (The pool is not shared when this is not the case.) And then
protect the accesses to xskq_prod_reserve() with this lock. The
tx_completion_lock is renamed cq_lock to better reflect that it
protects accesses to the potentially shared completion ring.

Fixes: 35fcde7f8deb ("xsk: support for Tx")
Reported-by: Xuan Zhuo
Signed-off-by: Magnus Karlsson
Signed-off-by: Daniel Borkmann
Acked-by: Björn Töpel
Link: https://lore.kernel.org/bpf/20201218134525.13119-2-magnus.karlsson@gmail.com

Magnus Karlsson
2020-12-18 23:10:21 +0800
8bee68338 xsk: Fix memory leak for failed bind ... Browse Code »

Fix a possible memory leak when a bind of an AF_XDP socket fails. When
the fill and completion rings are created, they are tied to the
socket. But when the buffer pool is later created at bind time, the
ownership of these two rings are transferred to the buffer pool as
they might be shared between sockets (and the buffer pool cannot be
created until we know what we are binding to). So, before the buffer
pool is created, these two rings are cleaned up with the socket, and
after they have been transferred they are cleaned up together with
the buffer pool.

The problem is that ownership was transferred before it was absolutely
certain that the buffer pool could be created and initialized
correctly and when one of these errors occurred, the fill and
completion rings did neither belong to the socket nor the pool and
where therefore leaked. Solve this by moving the ownership transfer
to the point where the buffer pool has been completely set up and
there is no way it can fail.

Fixes: 7361f9c3d719 ("xsk: Move fill and completion rings to buffer pool")
Reported-by: syzbot+cfa88ddd0655afa88763@syzkaller.appspotmail.com
Signed-off-by: Magnus Karlsson
Signed-off-by: Daniel Borkmann
Acked-by: Björn Töpel
Link: https://lore.kernel.org/bpf/20201214085127.3960-1-magnus.karlsson@gmail.com

Magnus Karlsson
2020-12-18 05:48:55 +0800
d64c6f96b Merge tag 'net-5.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net ... Browse Code »

Pull networking fixes from Jakub Kicinski:
"Current release - always broken:

- net/smc: fix access to parent of an ib device

- devlink: use _BITUL() macro instead of BIT() in the UAPI header

- handful of mptcp fixes

Previous release - regressions:

- intel: AF_XDP: clear the status bits for the next_to_use descriptor

- dpaa2-eth: fix the size of the mapped SGT buffer

Previous release - always broken:

- mptcp: fix security context on server socket

- ethtool: fix string set id check

- ethtool: fix error paths in ethnl_set_channels()

- lan743x: fix rx_napi_poll/interrupt ping-pong

- qca: ar9331: fix sleeping function called from invalid context bug"

* tag 'net-5.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (32 commits)
net/sched: sch_taprio: reset child qdiscs before freeing them
nfp: move indirect block cleanup to flower app stop callback
octeontx2-af: Fix undetected unmap PF error check
net: nixge: fix spelling mistake in Kconfig: "Instuments" -> "Instruments"
qlcnic: Fix error code in probe
mptcp: fix pending data accounting
mptcp: push pending frames when subflow has free space
mptcp: properly annotate nested lock
mptcp: fix security context on server socket
net/mlx5: Fix compilation warning for 32-bit platform
mptcp: clear use_ack and use_map when dropping other suboptions
devlink: use _BITUL() macro instead of BIT() in the UAPI header
net: korina: fix return value
net/smc: fix access to parent of an ib device
ethtool: fix error paths in ethnl_set_channels()
nfc: s3fwrn5: Remove unused NCI prop commands
nfc: s3fwrn5: Remove the delay for NFC sleep
phy: fix kdoc warning
tipc: do sanity check payload of a netlink message
use __netdev_notify_peers in hyperv
...

Linus Torvalds
2020-12-18 05:45:24 +0800
74f602dc9 Merge tag 'nfs-for-5.11-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs ... Browse Code »

Pull NFS client updates from Trond Myklebust:
"Highlights include:

Features:

- NFSv3: Add emulation of lookupp() to improve open_by_filehandle()
support

- A series of patches to improve readdir performance, particularly
with large directories

- Basic support for using NFS/RDMA with the pNFS files and flexfiles
drivers

- Micro-optimisations for RDMA

- RDMA tracing improvements

Bugfixes:

- Fix a long standing bug with xs_read_xdr_buf() when receiving
partial pages (Dan Aloni)

- Various fixes for getxattr and listxattr, when used over non-TCP
transports

- Fixes for containerised NFS from Sargun Dhillon

- switch nfsiod to be an UNBOUND workqueue (Neil Brown)

- READDIR should not ask for security label information if there is
no LSM policy (Olga Kornievskaia)

- Avoid using interval-based rebinding with TCP in lockd (Calum
Mackay)

- A series of RPC and NFS layer fixes to support the NFSv4.2
READ_PLUS code

- A couple of fixes for pnfs/flexfiles read failover

Cleanups:

- Various cleanups for the SUNRPC xdr code in conjunction with the
READ_PLUS fixes"

* tag 'nfs-for-5.11-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (90 commits)
NFS/pNFS: Fix a typo in ff_layout_resend_pnfs_read()
pNFS/flexfiles: Avoid spurious layout returns in ff_layout_choose_ds_for_read
NFSv4/pnfs: Add tracing for the deviceid cache
fs/lockd: convert comma to semicolon
NFSv4.2: fix error return on memory allocation failure
NFSv4.2/pnfs: Don't use READ_PLUS with pNFS yet
NFSv4.2: Deal with potential READ_PLUS data extent buffer overflow
NFSv4.2: Don't error when exiting early on a READ_PLUS buffer overflow
NFSv4.2: Handle hole lengths that exceed the READ_PLUS read buffer
NFSv4.2: decode_read_plus_hole() needs to check the extent offset
NFSv4.2: decode_read_plus_data() must skip padding after data segment
NFSv4.2: Ensure we always reset the result->count in decode_read_plus()
SUNRPC: When expanding the buffer, we may need grow the sparse pages
SUNRPC: Cleanup - constify a number of xdr_buf helpers
SUNRPC: Clean up open coded setting of the xdr_stream 'nwords' field
SUNRPC: _copy_to/from_pages() now check for zero length
SUNRPC: Cleanup xdr_shrink_bufhead()
SUNRPC: Fix xdr_expand_hole()
SUNRPC: Fixes for xdr_align_data()
SUNRPC: _shift_data_left/right_pages should check the shift length
...

Linus Torvalds
2020-12-18 04:15:03 +0800
be695ee29 Merge tag 'ceph-for-5.11-rc1' of git://github.com/ceph/ceph-client ... Browse Code »

Pull ceph updates from Ilya Dryomov:
"The big ticket item here is support for msgr2 on-wire protocol, which
adds the option of full in-transit encryption using AES-GCM algorithm
(myself).

On top of that we have a series to avoid intermittent errors during
recovery with recover_session=clean and some MDS request encoding work
from Jeff, a cap handling fix and assorted observability improvements
from Luis and Xiubo and a good number of cleanups.

Luis also ran into a corner case with quotas which sadly means that we
are back to denying cross-quota-realm renames"

* tag 'ceph-for-5.11-rc1' of git://github.com/ceph/ceph-client: (59 commits)
libceph: drop ceph_auth_{create,update}_authorizer()
libceph, ceph: make use of __ceph_auth_get_authorizer() in msgr1
libceph, ceph: implement msgr2.1 protocol (crc and secure modes)
libceph: introduce connection modes and ms_mode option
libceph, rbd: ignore addr->type while comparing in some cases
libceph, ceph: get and handle cluster maps with addrvecs
libceph: factor out finish_auth()
libceph: drop ac->ops->name field
libceph: amend cephx init_protocol() and build_request()
libceph, ceph: incorporate nautilus cephx changes
libceph: safer en/decoding of cephx requests and replies
libceph: more insight into ticket expiry and invalidation
libceph: move msgr1 protocol specific fields to its own struct
libceph: move msgr1 protocol implementation to its own file
libceph: separate msgr1 protocol implementation
libceph: export remaining protocol independent infrastructure
libceph: export zero_page
libceph: rename and export con->flags bits
libceph: rename and export con->state states
libceph: make con->state an int
...

Linus Torvalds
2020-12-18 03:53:52 +0800
44d4775ca net/sched: sch_taprio: reset child qdiscs before freeing them ... Browse Code »

syzkaller shows that packets can still be dequeued while taprio_destroy()
is running. Let sch_taprio use the reset() function to cancel the advance
timer and drop all skbs from the child qdiscs.

Fixes: 5a781ccbd19e ("tc: Add support for configuring the taprio scheduler")
Link: https://syzkaller.appspot.com/bug?id=f362872379bf8f0017fb667c1ab158f2d1e764ae
Reported-by: syzbot+8971da381fb5a31f542d@syzkaller.appspotmail.com
Signed-off-by: Davide Caratti
Acked-by: Vinicius Costa Gomes
Link: https://lore.kernel.org/r/63b6d79b0e830ebb0283e020db4df3cdfdfb2b94.1608142843.git.dcaratti@redhat.com
Signed-off-by: Jakub Kicinski

Davide Caratti
2020-12-18 02:57:57 +0800