Eric Lee / smarc-fsl-linux-kernel

14 Jun, 2020

2 commits

96144c58a Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net ... Browse Code »

Pull networking fixes from David Miller:

1) Fix cfg80211 deadlock, from Johannes Berg.

2) RXRPC fails to send norigications, from David Howells.

3) MPTCP RM_ADDR parsing has an off by one pointer error, fix from
Geliang Tang.

4) Fix crash when using MSG_PEEK with sockmap, from Anny Hu.

5) The ucc_geth driver needs __netdev_watchdog_up exported, from
Valentin Longchamp.

6) Fix hashtable memory leak in dccp, from Wang Hai.

7) Fix how nexthops are marked as FDB nexthops, from David Ahern.

8) Fix mptcp races between shutdown and recvmsg, from Paolo Abeni.

9) Fix crashes in tipc_disc_rcv(), from Tuong Lien.

10) Fix link speed reporting in iavf driver, from Brett Creeley.

11) When a channel is used for XSK and then reused again later for XSK,
we forget to clear out the relevant data structures in mlx5 which
causes all kinds of problems. Fix from Maxim Mikityanskiy.

12) Fix memory leak in genetlink, from Cong Wang.

13) Disallow sockmap attachments to UDP sockets, it simply won't work.
From Lorenz Bauer.

* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (83 commits)
net: ethernet: ti: ale: fix allmulti for nu type ale
net: ethernet: ti: am65-cpsw-nuss: fix ale parameters init
net: atm: Remove the error message according to the atomic context
bpf: Undo internal BPF_PROBE_MEM in BPF insns dump
libbpf: Support pre-initializing .bss global variables
tools/bpftool: Fix skeleton codegen
bpf: Fix memlock accounting for sock_hash
bpf: sockmap: Don't attach programs to UDP sockets
bpf: tcp: Recv() should return 0 when the peer socket is closed
ibmvnic: Flush existing work items before device removal
genetlink: clean up family attributes allocations
net: ipa: header pad field only valid for AP->modem endpoint
net: ipa: program upper nibbles of sequencer type
net: ipa: fix modem LAN RX endpoint id
net: ipa: program metadata mask differently
ionic: add pcie_print_link_status
rxrpc: Fix race between incoming ACK parser and retransmitter
net/mlx5: E-Switch, Fix some error pointer dereferences
net/mlx5: Don't fail driver on failure to create debugfs
net/mlx5e: CT: Fix ipv6 nat header rewrite actions
...

Linus Torvalds
2020-06-14 07:27:13 +0800
fa7566a0d Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf ... Browse Code »

Alexei Starovoitov says:

====================
pull-request: bpf 2020-06-12

The following pull-request contains BPF updates for your *net* tree.

We've added 26 non-merge commits during the last 10 day(s) which contain
a total of 27 files changed, 348 insertions(+), 93 deletions(-).

The main changes are:

1) sock_hash accounting fix, from Andrey.

2) libbpf fix and probe_mem sanitizing, from Andrii.

3) sock_hash fixes, from Jakub.

4) devmap_val fix, from Jesper.

5) load_bytes_relative fix, from YiFei.
====================

Signed-off-by: David S. Miller

David S. Miller
2020-06-14 06:28:08 +0800

13 Jun, 2020

2 commits

60e5ca8a6 bpf: Fix memlock accounting for sock_hash ... Browse Code »

Add missed bpf_map_charge_init() in sock_hash_alloc() and
correspondingly bpf_map_charge_finish() on ENOMEM.

It was found accidentally while working on unrelated selftest that
checks "map->memory.pages > 0" is true for all map types.

Before:
# bpftool m l
...
3692: sockhash name m_sockhash flags 0x0
key 4B value 4B max_entries 8 memlock 0B

After:
# bpftool m l
...
84: sockmap name m_sockmap flags 0x0
key 4B value 4B max_entries 8 memlock 4096B

Fixes: 604326b41a6f ("bpf, sockmap: convert to generic sk_msg interface")
Signed-off-by: Andrey Ignatov
Signed-off-by: Alexei Starovoitov
Link: https://lore.kernel.org/bpf/20200612000857.2881453-1-rdna@fb.com

Andrey Ignatov
2020-06-13 06:21:29 +0800
f6fede856 bpf: sockmap: Don't attach programs to UDP sockets ... Browse Code »

The stream parser infrastructure isn't set up to deal with UDP
sockets, so we mustn't try to attach programs to them.

I remember making this change at some point, but I must have lost
it while rebasing or something similar.

Fixes: 7b98cd42b049 ("bpf: sockmap: Add UDP support")
Signed-off-by: Lorenz Bauer
Signed-off-by: Alexei Starovoitov
Acked-by: Jakub Sitnicki
Link: https://lore.kernel.org/bpf/20200611172520.327602-1-lmb@cloudflare.com

Lorenz Bauer
2020-06-13 06:13:43 +0800

11 Jun, 2020

2 commits

0f5d82f18 net/filter: Permit reading NET in load_bytes_relative when MAC not set ... Browse Code »

Added a check in the switch case on start_header that checks for
the existence of the header, and in the case that MAC is not set
and the caller requests for MAC, -EFAULT. If the caller requests
for NET then MAC's existence is completely ignored.

There is no function to check NET header's existence and as far
as cgroup_skb/egress is concerned it should always be set.

Removed for ptr >= the start of header, considering offset is
bounded unsigned and should always be true. len
Signed-off-by: Daniel Borkmann
Reviewed-by: Stanislav Fomichev
Link: https://lore.kernel.org/bpf/76bb820ddb6a95f59a772ecbd8c8a336f646b362.1591812755.git.zhuyifei@google.com

YiFei Zhu
2020-06-11 22:05:56 +0800
1c3837266 Merge branch 'work.sysctl' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs ... Browse Code »

Pull sysctl fixes from Al Viro:
"Fixups to regressions in sysctl series"

* 'work.sysctl' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
sysctl: reject gigantic reads/write to sysctl files
cdrom: fix an incorrect __user annotation on cdrom_sysctl_info
trace: fix an incorrect __user annotation on stack_trace_sysctl
random: fix an incorrect __user annotation on proc_do_entropy
net/sysctl: remove leftover __user annotations on neigh_proc_dointvec*
net/sysctl: use cpumask_parse in flow_limit_cpu_sysctl

Linus Torvalds
2020-06-11 07:05:54 +0800

10 Jun, 2020

3 commits

845e0ebb4 net: change addr_list_lock back to static key ... Browse Code »

The dynamic key update for addr_list_lock still causes troubles,
for example the following race condition still exists:

CPU 0: CPU 1:
(RCU read lock) (RTNL lock)
dev_mc_seq_show() netdev_update_lockdep_key()
-> lockdep_unregister_key()
-> netif_addr_lock_bh()

because lockdep doesn't provide an API to update it atomically.
Therefore, we have to move it back to static keys and use subclass
for nest locking like before.

In commit 1a33e10e4a95 ("net: partially revert dynamic lockdep key
changes"), I already reverted most parts of commit ab92d68fc22f
("net: core: add generic lockdep keys").

This patch reverts the rest and also part of commit f3b0a18bb6cb
("net: remove unnecessary variables and callback"). After this
patch, addr_list_lock changes back to using static keys and
subclasses to satisfy lockdep. Thanks to dev->lower_level, we do
not have to change back to ->ndo_get_lock_subclass().

And hopefully this reduces some syzbot lockdep noises too.

Reported-by: syzbot+f3a0e80c34b3fc28ac5e@syzkaller.appspotmail.com
Cc: Taehee Yoo
Cc: Dmitry Vyukov
Signed-off-by: Cong Wang
Signed-off-by: David S. Miller

Cong Wang
2020-06-10 03:59:45 +0800
75e68e5bf bpf, sockhash: Synchronize delete from bucket list on map free ... Browse Code »

We can end up modifying the sockhash bucket list from two CPUs when a
sockhash is being destroyed (sock_hash_free) on one CPU, while a socket
that is in the sockhash is unlinking itself from it on another CPU
it (sock_hash_delete_from_link).

This results in accessing a list element that is in an undefined state as
reported by KASAN:

| ==================================================================
| BUG: KASAN: wild-memory-access in sock_hash_free+0x13c/0x280
| Write of size 8 at addr dead000000000122 by task kworker/2:1/95
|
| CPU: 2 PID: 95 Comm: kworker/2:1 Not tainted 5.7.0-rc7-02961-ge22c35ab0038-dirty #691
| Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20190727_073836-buildvm-ppc64le-16.ppc.fedoraproject.org-3.fc31 04/01/2014
| Workqueue: events bpf_map_free_deferred
| Call Trace:
| dump_stack+0x97/0xe0
| ? sock_hash_free+0x13c/0x280
| __kasan_report.cold+0x5/0x40
| ? mark_lock+0xbc1/0xc00
| ? sock_hash_free+0x13c/0x280
| kasan_report+0x38/0x50
| ? sock_hash_free+0x152/0x280
| sock_hash_free+0x13c/0x280
| bpf_map_free_deferred+0xb2/0xd0
| ? bpf_map_charge_finish+0x50/0x50
| ? rcu_read_lock_sched_held+0x81/0xb0
| ? rcu_read_lock_bh_held+0x90/0x90
| process_one_work+0x59a/0xac0
| ? lock_release+0x3b0/0x3b0
| ? pwq_dec_nr_in_flight+0x110/0x110
| ? rwlock_bug.part.0+0x60/0x60
| worker_thread+0x7a/0x680
| ? _raw_spin_unlock_irqrestore+0x4c/0x60
| kthread+0x1cc/0x220
| ? process_one_work+0xac0/0xac0
| ? kthread_create_on_node+0xa0/0xa0
| ret_from_fork+0x24/0x30
| ==================================================================

Fix it by reintroducing spin-lock protected critical section around the
code that removes the elements from the bucket on sockhash free.

To do that we also need to defer processing of removed elements, until out
of atomic context so that we can unlink the socket from the map when
holding the sock lock.

Fixes: 90db6d772f74 ("bpf, sockmap: Remove bucket->lock from sock_{hash|map}_free")
Reported-by: Eric Dumazet
Signed-off-by: Jakub Sitnicki
Signed-off-by: Alexei Starovoitov
Acked-by: John Fastabend
Link: https://lore.kernel.org/bpf/20200607205229.2389672-3-jakub@cloudflare.com

Jakub Sitnicki
2020-06-10 01:59:04 +0800
33a7c8315 bpf, sockhash: Fix memory leak when unlinking sockets in sock_hash_free ... Browse Code »

When sockhash gets destroyed while sockets are still linked to it, we will
walk the bucket lists and delete the links. However, we are not freeing the
list elements after processing them, leaking the memory.

The leak can be triggered by close()'ing a sockhash map when it still
contains sockets, and observed with kmemleak:

unreferenced object 0xffff888116e86f00 (size 64):
comm "race_sock_unlin", pid 223, jiffies 4294731063 (age 217.404s)
hex dump (first 32 bytes):
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
81 de e8 41 00 00 00 00 c0 69 2f 15 81 88 ff ff ...A.....i/.....
backtrace:
[] sock_hash_update_common+0x4ca/0x760
[] sock_hash_update_elem+0x1d2/0x200
[] __do_sys_bpf+0x2046/0x2990
[] do_syscall_64+0xad/0x9a0
[] entry_SYSCALL_64_after_hwframe+0x49/0xb3

Fix it by freeing the list element when we're done with it.

Fixes: 604326b41a6f ("bpf, sockmap: convert to generic sk_msg interface")
Signed-off-by: Jakub Sitnicki
Signed-off-by: Alexei Starovoitov
Acked-by: John Fastabend
Link: https://lore.kernel.org/bpf/20200607205229.2389672-2-jakub@cloudflare.com

Jakub Sitnicki
2020-06-10 01:59:04 +0800

08 Jun, 2020

1 commit

56965ac72 net/sysctl: use cpumask_parse in flow_limit_cpu_sysctl ... Browse Code »

cpumask_parse_user works on __user pointers, so this is wrong now.

Fixes: 32927393dc1c ("sysctl: pass kernel pointers to ->proc_handler")
Reported-by: build test robot
Signed-off-by: Christoph Hellwig
Signed-off-by: Al Viro

Christoph Hellwig
2020-06-08 22:13:56 +0800

05 Jun, 2020

3 commits

11d6011c2 net: core: device_rename: Use rwsem instead of a seqcount ... Browse Code »

Sequence counters write paths are critical sections that must never be
preempted, and blocking, even for CONFIG_PREEMPTION=n, is not allowed.

Commit 5dbe7c178d3f ("net: fix kernel deadlock with interface rename and
netdev name retrieval.") handled a deadlock, observed with
CONFIG_PREEMPTION=n, where the devnet_rename seqcount read side was
infinitely spinning: it got scheduled after the seqcount write side
blocked inside its own critical section.

To fix that deadlock, among other issues, the commit added a
cond_resched() inside the read side section. While this will get the
non-preemptible kernel eventually unstuck, the seqcount reader is fully
exhausting its slice just spinning -- until TIF_NEED_RESCHED is set.

The fix is also still broken: if the seqcount reader belongs to a
real-time scheduling policy, it can spin forever and the kernel will
livelock.

Disabling preemption over the seqcount write side critical section will
not work: inside it are a number of GFP_KERNEL allocations and mutex
locking through the drivers/base/ :: device_rename() call chain.

>From all the above, replace the seqcount with a rwsem.

Fixes: 5dbe7c178d3f (net: fix kernel deadlock with interface rename and netdev name retrieval.)
Fixes: 30e6c9fa93cf (net: devnet_rename_seq should be a seqcount)
Fixes: c91f6df2db49 (sockopt: Change getsockopt() of SO_BINDTODEVICE to return an interface name)
Cc:
Reported-by: kbuild test robot [ v1 missing up_read() on error exit ]
Reported-by: Dan Carpenter [ v1 missing up_read() on error exit ]
Signed-off-by: Ahmed S. Darwish
Reviewed-by: Sebastian Andrzej Siewior
Signed-off-by: David S. Miller

Ahmed S. Darwish
2020-06-05 06:50:42 +0800
bb986a504 seg6: fix seg6_validate_srh() to avoid slab-out-of-bounds ... Browse Code »

The seg6_validate_srh() is used to validate SRH for three cases:

case1: SRH of data-plane SRv6 packets to be processed by the Linux kernel.
Case2: SRH of the netlink message received from user-space (iproute2)
Case3: SRH injected into packets through setsockopt

In case1, the SRH can be encoded in the Reduced way (i.e., first SID is
carried in DA only and not represented as SID in the SRH) and the
seg6_validate_srh() now handles this case correctly.

In case2 and case3, the SRH shouldn’t be encoded in the Reduced way
otherwise we lose the first segment (i.e., the first hop).

The current implementation of the seg6_validate_srh() allow SRH of case2
and case3 to be encoded in the Reduced way. This leads a slab-out-of-bounds
problem.

This patch verifies SRH of case1, case2 and case3. Allowing case1 to be
reduced while preventing SRH of case2 and case3 from being reduced .

Reported-by: syzbot+e8c028b62439eac42073@syzkaller.appspotmail.com
Reported-by: YueHaibing
Fixes: 0cb7498f234e ("seg6: fix SRH processing to comply with RFC8754")
Signed-off-by: Ahmed Abdelsalam
Signed-off-by: David S. Miller

Ahmed Abdelsalam
2020-06-05 06:39:32 +0800
49b235759 bpf: Fix unused-var without NETDEVICES ... Browse Code »

A recent commit added new variables only used if CONFIG_NETDEVICES is
set. A simple fix would be to only declare these variables if the same
condition is valid but Alexei suggested an even simpler solution:

since CONFIG_NETDEVICES doesn't change anything in .h I think the
best is to remove #ifdef CONFIG_NETDEVICES from net/core/filter.c
and rely on sock_bindtoindex() returning ENOPROTOOPT in the extreme
case of oddly configured kernels.

Fixes: 70c58997c1e8 ("bpf: Allow SO_BINDTODEVICE opt in bpf_setsockopt")
Suggested-by: Alexei Starovoitov
Signed-off-by: Matthieu Baerts
Signed-off-by: Daniel Borkmann
Acked-by: Song Liu
Link: https://lore.kernel.org/bpf/20200603190347.2310320-1-matthieu.baerts@tessares.net

Matthieu Baerts
2020-06-05 04:52:10 +0800

04 Jun, 2020

2 commits

cb8e59cc8 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next ... Browse Code »

Pull networking updates from David Miller:

1) Allow setting bluetooth L2CAP modes via socket option, from Luiz
Augusto von Dentz.

2) Add GSO partial support to igc, from Sasha Neftin.

3) Several cleanups and improvements to r8169 from Heiner Kallweit.

4) Add IF_OPER_TESTING link state and use it when ethtool triggers a
device self-test. From Andrew Lunn.

5) Start moving away from custom driver versions, use the globally
defined kernel version instead, from Leon Romanovsky.

6) Support GRO vis gro_cells in DSA layer, from Alexander Lobakin.

7) Allow hard IRQ deferral during NAPI, from Eric Dumazet.

8) Add sriov and vf support to hinic, from Luo bin.

9) Support Media Redundancy Protocol (MRP) in the bridging code, from
Horatiu Vultur.

10) Support netmap in the nft_nat code, from Pablo Neira Ayuso.

11) Allow UDPv6 encapsulation of ESP in the ipsec code, from Sabrina
Dubroca. Also add ipv6 support for espintcp.

12) Lots of ReST conversions of the networking documentation, from Mauro
Carvalho Chehab.

13) Support configuration of ethtool rxnfc flows in bcmgenet driver,
from Doug Berger.

14) Allow to dump cgroup id and filter by it in inet_diag code, from
Dmitry Yakunin.

15) Add infrastructure to export netlink attribute policies to
userspace, from Johannes Berg.

16) Several optimizations to sch_fq scheduler, from Eric Dumazet.

17) Fallback to the default qdisc if qdisc init fails because otherwise
a packet scheduler init failure will make a device inoperative. From
Jesper Dangaard Brouer.

18) Several RISCV bpf jit optimizations, from Luke Nelson.

19) Correct the return type of the ->ndo_start_xmit() method in several
drivers, it's netdev_tx_t but many drivers were using
'int'. From Yunjian Wang.

20) Add an ethtool interface for PHY master/slave config, from Oleksij
Rempel.

21) Add BPF iterators, from Yonghang Song.

22) Add cable test infrastructure, including ethool interfaces, from
Andrew Lunn. Marvell PHY driver is the first to support this
facility.

23) Remove zero-length arrays all over, from Gustavo A. R. Silva.

24) Calculate and maintain an explicit frame size in XDP, from Jesper
Dangaard Brouer.

25) Add CAP_BPF, from Alexei Starovoitov.

26) Support terse dumps in the packet scheduler, from Vlad Buslov.

27) Support XDP_TX bulking in dpaa2 driver, from Ioana Ciornei.

28) Add devm_register_netdev(), from Bartosz Golaszewski.

29) Minimize qdisc resets, from Cong Wang.

30) Get rid of kernel_getsockopt and kernel_setsockopt in order to
eliminate set_fs/get_fs calls. From Christoph Hellwig.

* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2517 commits)
selftests: net: ip_defrag: ignore EPERM
net_failover: fixed rollback in net_failover_open()
Revert "tipc: Fix potential tipc_aead refcnt leak in tipc_crypto_rcv"
Revert "tipc: Fix potential tipc_node refcnt leak in tipc_rcv"
vmxnet3: allow rx flow hash ops only when rss is enabled
hinic: add set_channels ethtool_ops support
selftests/bpf: Add a default $(CXX) value
tools/bpf: Don't use $(COMPILE.c)
bpf, selftests: Use bpf_probe_read_kernel
s390/bpf: Use bcr 0,%0 as tail call nop filler
s390/bpf: Maintain 8-byte stack alignment
selftests/bpf: Fix verifier test
selftests/bpf: Fix sample_cnt shared between two threads
bpf, selftests: Adapt cls_redirect to call csum_level helper
bpf: Add csum_level helper for fixing up csum levels
bpf: Fix up bpf_skb_adjust_room helper's skb csum setting
sfc: add missing annotation for efx_ef10_try_update_nic_stats_vf()
crypto/chtls: IPv6 support for inline TLS
Crypto/chcr: Fixes a coccinile check error
Crypto/chcr: Fixes compilations warnings
...

Linus Torvalds
2020-06-04 07:27:18 +0800
e7c93cbfe Merge tag 'threads-v5.8' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux ... Browse Code »

Pull thread updates from Christian Brauner:
"We have been discussing using pidfds to attach to namespaces for quite
a while and the patches have in one form or another already existed
for about a year. But I wanted to wait to see how the general api
would be received and adopted.

This contains the changes to make it possible to use pidfds to attach
to the namespaces of a process, i.e. they can be passed as the first
argument to the setns() syscall.

When only a single namespace type is specified the semantics are
equivalent to passing an nsfd. That means setns(nsfd, CLONE_NEWNET)
equals setns(pidfd, CLONE_NEWNET).

However, when a pidfd is passed, multiple namespace flags can be
specified in the second setns() argument and setns() will attach the
caller to all the specified namespaces all at once or to none of them.

Specifying 0 is not valid together with a pidfd. Here are just two
obvious examples:

setns(pidfd, CLONE_NEWPID | CLONE_NEWNS | CLONE_NEWNET);
setns(pidfd, CLONE_NEWUSER);

Allowing to also attach subsets of namespaces supports various
use-cases where callers setns to a subset of namespaces to retain
privilege, perform an action and then re-attach another subset of
namespaces.

Apart from significantly reducing the number of syscalls needed to
attach to all currently supported namespaces (eight "open+setns"
sequences vs just a single "setns()"), this also allows atomic setns
to a set of namespaces, i.e. either attaching to all namespaces
succeeds or we fail without having changed anything.

This is centered around a new internal struct nsset which holds all
information necessary for a task to switch to a new set of namespaces
atomically. Fwiw, with this change a pidfd becomes the only token
needed to interact with a container. I'm expecting this to be
picked-up by util-linux for nsenter rather soon.

Associated with this change is a shiny new test-suite dedicated to
setns() (for pidfds and nsfds alike)"

* tag 'threads-v5.8' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
selftests/pidfd: add pidfd setns tests
nsproxy: attach to namespaces via pidfds
nsproxy: add struct nsset

Linus Torvalds
2020-06-04 04:12:57 +0800

03 Jun, 2020

2 commits

7cdec54f9 bpf: Add csum_level helper for fixing up csum levels ... Browse Code »

Add a bpf_csum_level() helper which BPF programs can use in combination
with bpf_skb_adjust_room() when they pass in BPF_F_ADJ_ROOM_NO_CSUM_RESET
flag to the latter to avoid falling back to CHECKSUM_NONE.

The bpf_csum_level() allows to adjust CHECKSUM_UNNECESSARY skb->csum_levels
via BPF_CSUM_LEVEL_{INC,DEC} which calls __skb_{incr,decr}_checksum_unnecessary()
on the skb. The helper also allows a BPF_CSUM_LEVEL_RESET which sets the skb's
csum to CHECKSUM_NONE as well as a BPF_CSUM_LEVEL_QUERY to just return the
current level. Without this helper, there is no way to otherwise adjust the
skb->csum_level. I did not add an extra dummy flags as there is plenty of free
bitspace in level argument itself iff ever needed in future.

Signed-off-by: Daniel Borkmann
Signed-off-by: Alexei Starovoitov
Reviewed-by: Alan Maguire
Acked-by: Lorenz Bauer
Link: https://lore.kernel.org/bpf/279ae3717cb3d03c0ffeb511493c93c450a01e1a.1591108731.git.daniel@iogearbox.net

Daniel Borkmann
2020-06-03 02:50:23 +0800
836e66c21 bpf: Fix up bpf_skb_adjust_room helper's skb csum setting ... Browse Code »

Lorenz recently reported:

In our TC classifier cls_redirect [0], we use the following sequence of
helper calls to decapsulate a GUE (basically IP + UDP + custom header)
encapsulated packet:

bpf_skb_adjust_room(skb, -encap_len, BPF_ADJ_ROOM_MAC, BPF_F_ADJ_ROOM_FIXED_GSO)
bpf_redirect(skb->ifindex, BPF_F_INGRESS)

It seems like some checksums of the inner headers are not validated in
this case. For example, a TCP SYN packet with invalid TCP checksum is
still accepted by the network stack and elicits a SYN ACK. [...]

That is, we receive the following packet from the driver:

| ETH | IP | UDP | GUE | IP | TCP |
skb->ip_summed == CHECKSUM_UNNECESSARY

ip_summed is CHECKSUM_UNNECESSARY because our NICs do rx checksum offloading.
On this packet we run skb_adjust_room_mac(-encap_len), and get the following:

| ETH | IP | TCP |
skb->ip_summed == CHECKSUM_UNNECESSARY

Note that ip_summed is still CHECKSUM_UNNECESSARY. After bpf_redirect()'ing
into the ingress, we end up in tcp_v4_rcv(). There, skb_checksum_init() is
turned into a no-op due to CHECKSUM_UNNECESSARY.

The bpf_skb_adjust_room() helper is not aware of protocol specifics. Internally,
it handles the CHECKSUM_COMPLETE case via skb_postpull_rcsum(), but that does
not cover CHECKSUM_UNNECESSARY. In this case skb->csum_level of the original
skb prior to bpf_skb_adjust_room() call was 0, that is, covering UDP. Right now
there is no way to adjust the skb->csum_level. NICs that have checksum offload
disabled (CHECKSUM_NONE) or that support CHECKSUM_COMPLETE are not affected.

Use a safe default for CHECKSUM_UNNECESSARY by resetting to CHECKSUM_NONE and
add a flag to the helper called BPF_F_ADJ_ROOM_NO_CSUM_RESET that allows users
from opting out. Opting out is useful for the case where we don't remove/add
full protocol headers, or for the case where a user wants to adjust the csum
level manually e.g. through bpf_csum_level() helper that is added in subsequent
patch.

The bpf_skb_proto_{4_to_6,6_to_4}() for NAT64/46 translation from the BPF
bpf_skb_change_proto() helper uses bpf_skb_net_hdr_{push,pop}() pair internally
as well but doesn't change layers, only transitions between v4 to v6 and vice
versa, therefore no adoption is required there.

[0] https://lore.kernel.org/bpf/20200424185556.7358-1-lmb@cloudflare.com/

Fixes: 2be7e212d541 ("bpf: add bpf_skb_adjust_room helper")
Reported-by: Lorenz Bauer
Reported-by: Alan Maguire
Signed-off-by: Daniel Borkmann
Signed-off-by: Lorenz Bauer
Signed-off-by: Alexei Starovoitov
Reviewed-by: Alan Maguire
Link: https://lore.kernel.org/bpf/CACAyw9-uU_52esMd1JjuA80fRPHJv5vsSg8GnfW3t_qDU4aVKQ@mail.gmail.com/
Link: https://lore.kernel.org/bpf/11a90472e7cce83e76ddbfce81fdfce7bfc68808.1591108731.git.daniel@iogearbox.net

Daniel Borkmann
2020-06-03 02:50:23 +0800

02 Jun, 2020

21 commits

b27f7bb59 flow_dissector: Move out netns_bpf prog callbacks ... Browse Code »

Move functions to manage BPF programs attached to netns that are not
specific to flow dissector to a dedicated module named
bpf/net_namespace.c.

The set of functions will grow with the addition of bpf_link support for
netns attached programs. This patch prepares ground by creating a place
for it.

This is a code move with no functional changes intended.

Signed-off-by: Jakub Sitnicki
Signed-off-by: Alexei Starovoitov
Link: https://lore.kernel.org/bpf/20200531082846.2117903-4-jakub@cloudflare.com

Jakub Sitnicki
2020-06-02 06:21:02 +0800
a3fd7ceee net: Introduce netns_bpf for BPF programs attached to netns ... Browse Code »

In order to:

(1) attach more than one BPF program type to netns, or
(2) support attaching BPF programs to netns with bpf_link, or
(3) support multi-prog attach points for netns

we will need to keep more state per netns than a single pointer like we
have now for BPF flow dissector program.

Prepare for the above by extracting netns_bpf that is part of struct net,
for storing all state related to BPF programs attached to netns.

Turn flow dissector callbacks for querying/attaching/detaching a program
into generic ones that operate on netns_bpf. Next patch will move the
generic callbacks into their own module.

This is similar to how it is organized for cgroup with cgroup_bpf.

Signed-off-by: Jakub Sitnicki
Signed-off-by: Alexei Starovoitov
Cc: Stanislav Fomichev
Link: https://lore.kernel.org/bpf/20200531082846.2117903-3-jakub@cloudflare.com

Jakub Sitnicki
2020-06-02 06:21:02 +0800
171526f6f flow_dissector: Pull locking up from prog attach callback ... Browse Code »

Split out the part of attach callback that happens with attach/detach lock
acquired. This structures the prog attach callback in a way that opens up
doors for moving the locking out of flow_dissector and into generic
callbacks for attaching/detaching progs to netns in subsequent patches.

Signed-off-by: Jakub Sitnicki
Signed-off-by: Alexei Starovoitov
Reviewed-by: Stanislav Fomichev
Link: https://lore.kernel.org/bpf/20200531082846.2117903-2-jakub@cloudflare.com

Jakub Sitnicki
2020-06-02 06:21:02 +0800
70c58997c bpf: Allow SO_BINDTODEVICE opt in bpf_setsockopt ... Browse Code »

Extending the supported sockopts in bpf_setsockopt with
SO_BINDTODEVICE. We call sock_bindtoindex with parameter
lock_sk = false in this context because we already owning
the socket.

Signed-off-by: Ferenc Fejes
Signed-off-by: Alexei Starovoitov
Link: https://lore.kernel.org/bpf/4149e304867b8d5a606a305bc59e29b063e51f49.1590871065.git.fejes@inf.elte.hu

Ferenc Fejes
2020-06-02 05:57:14 +0800
8ea204c2b net: Make locking in sock_bindtoindex optional ... Browse Code »

The sock_bindtoindex intended for kernel wide usage however
it will lock the socket regardless of the context. This modification
relax this behavior optionally: locking the socket will be optional
by calling the sock_bindtoindex with lock_sk = true.

The modification applied to all users of the sock_bindtoindex.

Signed-off-by: Ferenc Fejes
Signed-off-by: Alexei Starovoitov
Link: https://lore.kernel.org/bpf/bee6355da40d9e991b2f2d12b67d55ebb5f5b207.1590871065.git.fejes@inf.elte.hu

Ferenc Fejes
2020-06-02 05:57:14 +0800
e91de6afa bpf: Fix running sk_skb program types with ktls ... Browse Code »

KTLS uses a stream parser to collect TLS messages and send them to
the upper layer tls receive handler. This ensures the tls receiver
has a full TLS header to parse when it is run. However, when a
socket has BPF_SK_SKB_STREAM_VERDICT program attached before KTLS
is enabled we end up with two stream parsers running on the same
socket.

The result is both try to run on the same socket. First the KTLS
stream parser runs and calls read_sock() which will tcp_read_sock
which in turn calls tcp_rcv_skb(). This dequeues the skb from the
sk_receive_queue. When this is done KTLS code then data_ready()
callback which because we stacked KTLS on top of the bpf stream
verdict program has been replaced with sk_psock_start_strp(). This
will in turn kick the stream parser again and eventually do the
same thing KTLS did above calling into tcp_rcv_skb() and dequeuing
a skb from the sk_receive_queue.

At this point the data stream is broke. Part of the stream was
handled by the KTLS side some other bytes may have been handled
by the BPF side. Generally this results in either missing data
or more likely a "Bad Message" complaint from the kTLS receive
handler as the BPF program steals some bytes meant to be in a
TLS header and/or the TLS header length is no longer correct.

We've already broke the idealized model where we can stack ULPs
in any order with generic callbacks on the TX side to handle this.
So in this patch we do the same thing but for RX side. We add
a sk_psock_strp_enabled() helper so TLS can learn a BPF verdict
program is running and add a tls_sw_has_ctx_rx() helper so BPF
side can learn there is a TLS ULP on the socket.

Then on BPF side we omit calling our stream parser to avoid
breaking the data stream for the KTLS receiver. Then on the
KTLS side we call BPF_SK_SKB_STREAM_VERDICT once the KTLS
receiver is done with the packet but before it posts the
msg to userspace. This gives us symmetry between the TX and
RX halfs and IMO makes it usable again. On the TX side we
process packets in this order BPF -> TLS -> TCP and on
the receive side in the reverse order TCP -> TLS -> BPF.

Discovered while testing OpenSSL 3.0 Alpha2.0 release.

Fixes: d829e9c4112b5 ("tls: convert to generic sk_msg interface")
Signed-off-by: John Fastabend
Signed-off-by: Alexei Starovoitov
Link: https://lore.kernel.org/bpf/159079361946.5745.605854335665044485.stgit@john-Precision-5820-Tower
Signed-off-by: Alexei Starovoitov

John Fastabend
2020-06-02 05:48:32 +0800
ca2f5f21d bpf: Refactor sockmap redirect code so its easy to reuse ... Browse Code »

We will need this block of code called from tls context shortly
lets refactor the redirect logic so its easy to use. This also
cleans up the switch stmt so we have fewer fallthrough cases.

No logic changes are intended.

Fixes: d829e9c4112b5 ("tls: convert to generic sk_msg interface")
Signed-off-by: John Fastabend
Signed-off-by: Alexei Starovoitov
Reviewed-by: Jakub Sitnicki
Acked-by: Song Liu
Link: https://lore.kernel.org/bpf/159079360110.5745.7024009076049029819.stgit@john-Precision-5820-Tower
Signed-off-by: Alexei Starovoitov

John Fastabend
2020-06-02 05:48:32 +0800
64b59025c xdp: Add xdp_txq_info to xdp_buff ... Browse Code »

Add xdp_txq_info as the Tx counterpart to xdp_rxq_info. At the
moment only the device is added. Other fields (queue_index)
can be added as use cases arise.

>From a UAPI perspective, add egress_ifindex to xdp context for
bpf programs to see the Tx device.

Update the verifier to only allow accesses to egress_ifindex by
XDP programs with BPF_XDP_DEVMAP expected attach type.

Signed-off-by: David Ahern
Signed-off-by: Alexei Starovoitov
Acked-by: Toke Høiland-Jørgensen
Link: https://lore.kernel.org/bpf/20200529220716.75383-4-dsahern@kernel.org
Signed-off-by: Alexei Starovoitov

David Ahern
2020-06-02 05:48:32 +0800
fbee97fee bpf: Add support to attach bpf program to a devmap entry ... Browse Code »

Add BPF_XDP_DEVMAP attach type for use with programs associated with a
DEVMAP entry.

Allow DEVMAPs to associate a program with a device entry by adding
a bpf_prog.fd to 'struct bpf_devmap_val'. Values read show the program
id, so the fd and id are a union. bpf programs can get access to the
struct via vmlinux.h.

The program associated with the fd must have type XDP with expected
attach type BPF_XDP_DEVMAP. When a program is associated with a device
index, the program is run on an XDP_REDIRECT and before the buffer is
added to the per-cpu queue. At this point rxq data is still valid; the
next patch adds tx device information allowing the prorgam to see both
ingress and egress device indices.

XDP generic is skb based and XDP programs do not work with skb's. Block
the use case by walking maps used by a program that is to be attached
via xdpgeneric and fail if any of them are DEVMAP / DEVMAP_HASH with

Block attach of BPF_XDP_DEVMAP programs to devices.

Signed-off-by: David Ahern
Signed-off-by: Alexei Starovoitov
Acked-by: Toke Høiland-Jørgensen
Link: https://lore.kernel.org/bpf/20200529220716.75383-3-dsahern@kernel.org
Signed-off-by: Alexei Starovoitov

David Ahern
2020-06-02 05:48:32 +0800
c3c16f2ea bpf: Add rx_queue_mapping to bpf_sock ... Browse Code »

Add "rx_queue_mapping" to bpf_sock. This gives read access for the
existing field (sk_rx_queue_mapping) of struct sock from bpf_sock.
Semantics for the bpf_sock rx_queue_mapping access are similar to
sk_rx_queue_get(), i.e the value NO_QUEUE_MAPPING is not allowed
and -1 is returned in that case. This is useful for transmit queue
selection based on the received queue index which is cached in the
socket in the receive path.

v3: Addressed review comments to add usecase in patch description,
and fixed default value for rx_queue_mapping.
v2: fixed build error for CONFIG_XPS wrapping, reported by
kbuild test robot

Signed-off-by: Amritha Nambiar
Signed-off-by: Alexei Starovoitov

Amritha Nambiar
2020-06-02 05:38:23 +0800
13d70f5a5 bpf, sk_msg: Add get socket storage helpers ... Browse Code »

Add helpers to use local socket storage.

Signed-off-by: John Fastabend
Signed-off-by: Daniel Borkmann
Acked-by: Yonghong Song
Link: https://lore.kernel.org/bpf/159033907577.12355.14740125020572756560.stgit@john-Precision-5820-Tower
Signed-off-by: Alexei Starovoitov

John Fastabend
2020-06-02 05:38:20 +0800
abe3cac87 bpf, sk_msg: Add some generic helpers that may be useful from sk_msg ... Browse Code »

Add these generic helpers that may be useful to use from sk_msg programs.
The helpers do not depend on ctx so we can simply add them here,

BPF_FUNC_perf_event_output
BPF_FUNC_get_current_uid_gid
BPF_FUNC_get_current_pid_tgid
BPF_FUNC_get_current_cgroup_id
BPF_FUNC_get_current_ancestor_cgroup_id
BPF_FUNC_get_cgroup_classid

Signed-off-by: John Fastabend
Signed-off-by: Daniel Borkmann
Acked-by: Yonghong Song
Link: https://lore.kernel.org/bpf/159033903373.12355.15489763099696629346.stgit@john-Precision-5820-Tower
Signed-off-by: Alexei Starovoitov

John Fastabend
2020-06-02 05:38:20 +0800
81e8c10da Merge branch 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6 ... Browse Code »

Pull crypto updates from Herbert Xu:
"API:
- Introduce crypto_shash_tfm_digest() and use it wherever possible.
- Fix use-after-free and race in crypto_spawn_alg.
- Add support for parallel and batch requests to crypto_engine.

Algorithms:
- Update jitter RNG for SP800-90B compliance.
- Always use jitter RNG as seed in drbg.

Drivers:
- Add Arm CryptoCell driver cctrng.
- Add support for SEV-ES to the PSP driver in ccp"

* 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (114 commits)
crypto: hisilicon - fix driver compatibility issue with different versions of devices
crypto: engine - do not requeue in case of fatal error
crypto: cavium/nitrox - Fix a typo in a comment
crypto: hisilicon/qm - change debugfs file name from qm_regs to regs
crypto: hisilicon/qm - add DebugFS for xQC and xQE dump
crypto: hisilicon/zip - add debugfs for Hisilicon ZIP
crypto: hisilicon/hpre - add debugfs for Hisilicon HPRE
crypto: hisilicon/sec2 - add debugfs for Hisilicon SEC
crypto: hisilicon/qm - add debugfs to the QM state machine
crypto: hisilicon/qm - add debugfs for QM
crypto: stm32/crc32 - protect from concurrent accesses
crypto: stm32/crc32 - don't sleep in runtime pm
crypto: stm32/crc32 - fix multi-instance
crypto: stm32/crc32 - fix run-time self test issue.
crypto: stm32/crc32 - fix ext4 chksum BUG_ON()
crypto: hisilicon/zip - Use temporary sqe when doing work
crypto: hisilicon - add device error report through abnormal irq
crypto: hisilicon - remove codes of directly report device errors through MSI
crypto: hisilicon - QM memory management optimization
crypto: hisilicon - unify initial value assignment into QM
...

Linus Torvalds
2020-06-02 03:00:10 +0800
5eb18a2b6 devlink: Add ACL control packet traps ... Browse Code »

Add packet traps for packets that are sampled / trapped by ACLs, so that
capable drivers could register them with devlink. Add documentation for
every added packet trap and packet trap group.

Signed-off-by: Ido Schimmel
Reviewed-by: Jiri Pirko
Signed-off-by: David S. Miller

Ido Schimmel
2020-06-02 02:49:23 +0800
d77cfd162 devlink: Add layer 3 control packet traps ... Browse Code »

Add layer 3 control packet traps such as ARP and DHCP, so that capable
device drivers could register them with devlink. Add documentation for
every added packet trap and packet trap group.

Signed-off-by: Ido Schimmel
Reviewed-by: Jiri Pirko
Signed-off-by: David S. Miller

Ido Schimmel
2020-06-02 02:49:23 +0800
515eac677 devlink: Add layer 2 control packet traps ... Browse Code »

Add layer 2 control packet traps such as STP and IGMP query, so that
capable device drivers could register them with devlink. Add
documentation for every added packet trap and packet trap group.

Signed-off-by: Ido Schimmel
Reviewed-by: Jiri Pirko
Signed-off-by: David S. Miller

Ido Schimmel
2020-06-02 02:49:23 +0800
30a4e9a29 devlink: Add 'control' trap type ... Browse Code »

This type is used for traps that trap control packets such as ARP
request and IGMP query to the CPU.

Do not report such packets to the kernel's drop monitor as they were not
dropped by the device no encountered an exception during forwarding.

Signed-off-by: Ido Schimmel
Reviewed-by: Jiri Pirko
Signed-off-by: David S. Miller

Ido Schimmel
2020-06-02 02:49:23 +0800
9eefeabed devlink: Add 'mirror' trap action ... Browse Code »

The action is used by control traps such as IGMP query. The packet is
flooded by the device, but also trapped to the CPU in order for the
software bridge to mark the receiving port as a multicast router port.
Such packets are marked with 'skb->offload_fwd_mark = 1' in order to
prevent the software bridge from flooding them again.

Signed-off-by: Ido Schimmel
Reviewed-by: Jiri Pirko
Signed-off-by: David S. Miller

Ido Schimmel
2020-06-02 02:49:23 +0800
678eb199c devlink: Create dedicated trap group for layer 3 exceptions ... Browse Code »

Packets that hit exceptions during layer 3 forwarding must be trapped to
the CPU for the control plane to function properly. Create a dedicated
group for them, so that user space could choose to assign a different
policer for them.

Signed-off-by: Ido Schimmel
Reviewed-by: Jiri Pirko
Signed-off-by: David S. Miller

Ido Schimmel
2020-06-02 02:49:23 +0800
709ffbe19 net: remove indirect block netdev event registration ... Browse Code »

Drivers do not register to netdev events to set up indirect blocks
anymore. Remove __flow_indr_block_cb_register() and
__flow_indr_block_cb_unregister().

The frontends set up the callbacks through flow_indr_dev_setup_block()

Signed-off-by: Pablo Neira Ayuso
Signed-off-by: David S. Miller

Pablo Neira Ayuso
2020-06-02 02:41:50 +0800
1fac52da5 net: flow_offload: consolidate indirect flow_block infrastructure ... Browse Code »

Tunnel devices provide no dev->netdev_ops->ndo_setup_tc(...) interface.
The tunnel device and route control plane does not provide an obvious
way to relate tunnel and physical devices.

This patch allows drivers to register a tunnel device offload handler
for the tc and netfilter frontends through flow_indr_dev_register() and
flow_indr_dev_unregister().

The frontend calls flow_indr_dev_setup_offload() that iterates over the
list of drivers that are offering tunnel device hardware offload
support and it sets up the flow block for this tunnel device.

If the driver module is removed, the indirect flow_block ends up with a
stale callback reference. The module removal path triggers the
dev_shutdown() path to remove the qdisc and the flow_blocks for the
physical devices. However, this is not useful for tunnel devices, where
relation between the physical and the tunnel device is not explicit.

This patch introduces a cleanup callback that is invoked when the driver
module is removed to clean up the tunnel device flow_block. This patch
defines struct flow_block_indr and it uses it from flow_block_cb to
store the information that front-end requires to perform the
flow_block_cb cleanup on module removal.

Signed-off-by: Pablo Neira Ayuso
Signed-off-by: David S. Miller

Pablo Neira Ayuso
2020-06-02 02:41:12 +0800

01 Jun, 2020

1 commit

1806c13dc Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net ... Browse Code »

xdp_umem.c had overlapping changes between the 64-bit math fix
for the calculation of npgs and the removal of the zerocopy
memory type which got rid of the chunk_size_nohdr member.

The mlx5 Kconfig conflict is a case where we just take the
net-next copy of the Kconfig entry dependency as it takes on
the ESWITCH dependency by one level of indirection which is
what the 'net' conflicting change is trying to ensure.

Signed-off-by: David S. Miller

David S. Miller
2020-06-01 08:48:46 +0800

30 May, 2020

1 commit

96d10d5b1 neigh: fix ARP retransmit timer guard ... Browse Code »

In commit 19e16d220f0a ("neigh: support smaller retrans_time settting")
we add more accurate control for ARP and NS. But for ARP I forgot to
update the latest guard in neigh_timer_handler(), then the next
retransmit would be reset to jiffies + HZ/2 if we set the retrans_time
less than 500ms. Fix it by setting the time_before() check to HZ/100.

IPv6 does not have this issue.

Reported-by: Jianwen Ji
Fixes: 19e16d220f0a ("neigh: support smaller retrans_time settting")
Signed-off-by: Hangbin Liu
Signed-off-by: David S. Miller

Hangbin Liu
2020-05-30 07:56:53 +0800