Eric Lee / smarc-fsl-linux-kernel

25 Jan, 2021

1 commit

2259bc93f Merge remote-tracking branch 'lts/lf-5.10.y' into lf-5.10.y_android ... Browse Code »

Change-Id: I173bc0325c93daaf82bb894f77ba35fa827864de

zhang sanshan
2021-01-25 16:19:35 +0800

17 Jan, 2021

1 commit

69363e37d net: ipv6: fib: flush exceptions when purging route ... Browse Code »

[ Upstream commit d8f5c29653c3f6995e8979be5623d263e92f6b86 ]

Route removal is handled by two code paths. The main removal path is via
fib6_del_route() which will handle purging any PMTU exceptions from the
cache, removing all per-cpu copies of the DST entry used by the route, and
releasing the fib6_info struct.

The second removal location is during fib6_add_rt2node() during a route
replacement operation. This path also calls fib6_purge_rt() to handle
cleaning up the per-cpu copies of the DST entries and releasing the
fib6_info associated with the older route, but it does not flush any PMTU
exceptions that the older route had. Since the older route is removed from
the tree during the replacement, we lose any way of accessing it again.

As these lingering DSTs and the fib6_info struct are holding references to
the underlying netdevice struct as well, unregistering that device from the
kernel can never complete.

Fixes: 2b760fcf5cfb3 ("ipv6: hook up exception table to store dst cache")
Signed-off-by: Sean Tranchetti
Reviewed-by: David Ahern
Link: https://lore.kernel.org/r/1609892546-11389-1-git-send-email-stranche@quicinc.com
Signed-off-by: Jakub Kicinski
Signed-off-by: Greg Kroah-Hartman

Sean Tranchetti
2021-01-17 21:16:56 +0800

13 Jan, 2021

1 commit

27bc60d96 netfilter: x_tables: Update remaining dereference to RCU ... Browse Code »

commit 443d6e86f821a165fae3fc3fc13086d27ac140b1 upstream.

This fixes the dereference to fetch the RCU pointer when holding
the appropriate xtables lock.

Reported-by: kernel test robot
Fixes: cc00bcaa5899 ("netfilter: x_tables: Switch synchronization to RCU")
Signed-off-by: Subash Abhinov Kasiviswanathan
Reviewed-by: Florian Westphal
Signed-off-by: Pablo Neira Ayuso
Signed-off-by: Greg Kroah-Hartman

Subash Abhinov Kasiviswanathan
2021-01-13 03:18:25 +0800

11 Dec, 2020

1 commit

a07068169 Merge 33dc9614dc20 ("Merge tag 'ktest-v5.10-rc6' of git://git.kernel.org/pub/scm… ... Browse Code »

…/linux/kernel/git/rostedt/linux-ktest") into android-mainline

Steps on the way to 5.10-rc8/final

Resolves conflicts with:
net/xfrm/xfrm_state.c

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I06495103fc0f2531b241b3577820f4461ba83dd5

Greg Kroah-Hartman
2020-12-11 17:26:31 +0800

10 Dec, 2020

2 commits

b7e4ba9a9 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf ... Browse Code »

Pablo Neira Ayuso says:

====================
Netfilter fixes for net

The following patchset contains Netfilter fixes for net:

1) Switch to RCU in x_tables to fix possible NULL pointer dereference,
from Subash Abhinov Kasiviswanathan.

2) Fix netlink dump of dynset timeouts later than 23 days.

3) Add comment for the indirect serialization of the nft commit mutex
with rtnl_mutex.

4) Remove bogus check for confirmed conntrack when matching on the
conntrack ID, from Brett Mastbergen.
====================

Signed-off-by: David S. Miller

David S. Miller
2020-12-10 10:55:46 +0800
8ef44b6fe tcp: Retain ECT bits for tos reflection ... Browse Code »

For DCTCP, we have to retain the ECT bits set by the congestion control
algorithm on the socket when reflecting syn TOS in syn-ack, in order to
make ECN work properly.

Fixes: ac8f1710c12b ("tcp: reflect tos value received in SYN to the socket")
Reported-by: Alexander Duyck
Signed-off-by: Wei Wang
Reviewed-by: Eric Dumazet
Signed-off-by: David S. Miller

Wei Wang
2020-12-10 08:08:23 +0800

09 Dec, 2020

1 commit

ca0c76873 Merge 5.10-rc7 into android-mainline ... Browse Code »

Linux 5.10-rc7

Signed-off-by: Greg Kroah-Hartman
Change-Id: Ie61b3510311a825ee57bee12610e25bc1500b350

Greg Kroah-Hartman
2020-12-09 15:09:26 +0800

08 Dec, 2020

1 commit

cc00bcaa5 netfilter: x_tables: Switch synchronization to RCU ... Browse Code »

When running concurrent iptables rules replacement with data, the per CPU
sequence count is checked after the assignment of the new information.
The sequence count is used to synchronize with the packet path without the
use of any explicit locking. If there are any packets in the packet path using
the table information, the sequence count is incremented to an odd value and
is incremented to an even after the packet process completion.

The new table value assignment is followed by a write memory barrier so every
CPU should see the latest value. If the packet path has started with the old
table information, the sequence counter will be odd and the iptables
replacement will wait till the sequence count is even prior to freeing the
old table info.

However, this assumes that the new table information assignment and the memory
barrier is actually executed prior to the counter check in the replacement
thread. If CPU decides to execute the assignment later as there is no user of
the table information prior to the sequence check, the packet path in another
CPU may use the old table information. The replacement thread would then free
the table information under it leading to a use after free in the packet
processing context-

Unable to handle kernel NULL pointer dereference at virtual
address 000000000000008e
pc : ip6t_do_table+0x5d0/0x89c
lr : ip6t_do_table+0x5b8/0x89c
ip6t_do_table+0x5d0/0x89c
ip6table_filter_hook+0x24/0x30
nf_hook_slow+0x84/0x120
ip6_input+0x74/0xe0
ip6_rcv_finish+0x7c/0x128
ipv6_rcv+0xac/0xe4
__netif_receive_skb+0x84/0x17c
process_backlog+0x15c/0x1b8
napi_poll+0x88/0x284
net_rx_action+0xbc/0x23c
__do_softirq+0x20c/0x48c

This could be fixed by forcing instruction order after the new table
information assignment or by switching to RCU for the synchronization.

Fixes: 80055dab5de0 ("netfilter: x_tables: make xt_replace_table wait until old rules are not used anymore")
Reported-by: Sean Tranchetti
Reported-by: kernel test robot
Suggested-by: Florian Westphal
Signed-off-by: Subash Abhinov Kasiviswanathan
Signed-off-by: Pablo Neira Ayuso

Subash Abhinov Kasiviswanathan
2020-12-08 19:57:39 +0800

03 Dec, 2020

1 commit

832ba5964 net: ip6_gre: set dev->hard_header_len when using header_ops ... Browse Code »

syzkaller managed to crash the kernel using an NBMA ip6gre interface. I
could reproduce it creating an NBMA ip6gre interface and forwarding
traffic to it:

skbuff: skb_under_panic: text:ffffffff8250e927 len:148 put:44 head:ffff8c03c7a33
------------[ cut here ]------------
kernel BUG at net/core/skbuff.c:109!
Call Trace:
skb_push+0x10/0x10
ip6gre_header+0x47/0x1b0
neigh_connected_output+0xae/0xf0

ip6gre tunnel provides its own header_ops->create, and sets it
conditionally when initializing the tunnel in NBMA mode. When
header_ops->create is used, dev->hard_header_len should reflect the
length of the header created. Otherwise, when not used,
dev->needed_headroom should be used.

Fixes: eb95f52fc72d ("net: ipv6_gre: Fix GRO to work on IPv6 over GRE tap")
Cc: Maria Pasechnik
Signed-off-by: Antoine Tenart
Link: https://lore.kernel.org/r/20201130161911.464106-1-atenart@kernel.org
Signed-off-by: Jakub Kicinski

Antoine Tenart
2020-12-03 03:16:12 +0800

28 Nov, 2020

1 commit

b19651bfc Merge c84e1efae022 ("Merge tag 'asm-generic-fixes-5.10-2' of git://git.kernel.or… ... Browse Code »

…g/pub/scm/linux/kernel/git/arnd/asm-generic") into android-mainline

Steps on the way to 5.10-rc5

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I644783003a83186a34cdbb753aa492f4350f49ee

Greg Kroah-Hartman
2020-11-28 15:23:09 +0800

26 Nov, 2020

1 commit

e255e11e6 ipv6: addrlabel: fix possible memory leak in ip6addrlbl_net_init ... Browse Code »

kmemleak report a memory leak as follows:

unreferenced object 0xffff8880059c6a00 (size 64):
comm "ip", pid 23696, jiffies 4296590183 (age 1755.384s)
hex dump (first 32 bytes):
20 01 00 10 00 00 00 00 00 00 00 00 00 00 00 00 ...............
1c 00 00 00 00 00 00 00 00 00 00 00 07 00 00 00 ................
backtrace:
[] ip6addrlbl_add+0x90/0xbb0
[] ip6addrlbl_net_init+0x109/0x170
[] ops_init+0xa8/0x3c0
[] setup_net+0x2de/0x7e0
[] copy_net_ns+0x27d/0x530
[] create_new_namespaces+0x382/0xa30
[] unshare_nsproxy_namespaces+0xa1/0x1d0
[] ksys_unshare+0x3a4/0x780
[] __x64_sys_unshare+0x2d/0x40
[] do_syscall_64+0x33/0x40
[] entry_SYSCALL_64_after_hwframe+0x44/0xa9

We should free all rules when we catch an error in ip6addrlbl_net_init().
otherwise a memory leak will occur.

Fixes: 2a8cc6c89039 ("[IPV6] ADDRCONF: Support RFC3484 configurable address selection policy table.")
Reported-by: Hulk Robot
Signed-off-by: Wang Hai
Link: https://lore.kernel.org/r/20201124071728.8385-1-wanghai38@huawei.com
Signed-off-by: Jakub Kicinski

Wang Hai
2020-11-26 03:20:16 +0800

25 Nov, 2020

1 commit

407c85c7d tcp: Set ECT0 bit in tos/tclass for synack when BPF needs ECN ... Browse Code »

When a BPF program is used to select between a type of TCP congestion
control algorithm that uses either ECN or not there is a case where the
synack for the frame was coming up without the ECT0 bit set. A bit of
research found that this was due to the final socket being configured to
dctcp while the listener socket was staying in cubic.

To reproduce it all that is needed is to monitor TCP traffic while running
the sample bpf program "samples/bpf/tcp_cong_kern.c". What is observed,
assuming tcp_dctcp module is loaded or compiled in and the traffic matches
the rules in the sample file, is that for all frames with the exception of
the synack the ECT0 bit is set.

To address that it is necessary to make one additional call to
tcp_bpf_ca_needs_ecn using the request socket and then use the output of
that to set the ECT0 bit for the tos/tclass of the packet.

Fixes: 91b5b21c7c16 ("bpf: Add support for changing congestion control")
Signed-off-by: Alexander Duyck
Link: https://lore.kernel.org/r/160593039663.2604.1374502006916871573.stgit@localhost.localdomain
Signed-off-by: Jakub Kicinski

Alexander Duyck
2020-11-25 06:12:55 +0800

24 Nov, 2020

1 commit

01770a166 tcp: fix race condition when creating child sockets from syncookies ... Browse Code »

When the TCP stack is in SYN flood mode, the server child socket is
created from the SYN cookie received in a TCP packet with the ACK flag
set.

The child socket is created when the server receives the first TCP
packet with a valid SYN cookie from the client. Usually, this packet
corresponds to the final step of the TCP 3-way handshake, the ACK
packet. But is also possible to receive a valid SYN cookie from the
first TCP data packet sent by the client, and thus create a child socket
from that SYN cookie.

Since a client socket is ready to send data as soon as it receives the
SYN+ACK packet from the server, the client can send the ACK packet (sent
by the TCP stack code), and the first data packet (sent by the userspace
program) almost at the same time, and thus the server will equally
receive the two TCP packets with valid SYN cookies almost at the same
instant.

When such event happens, the TCP stack code has a race condition that
occurs between the momement a lookup is done to the established
connections hashtable to check for the existence of a connection for the
same client, and the moment that the child socket is added to the
established connections hashtable. As a consequence, this race condition
can lead to a situation where we add two child sockets to the
established connections hashtable and deliver two sockets to the
userspace program to the same client.

This patch fixes the race condition by checking if an existing child
socket exists for the same client when we are adding the second child
socket to the established connections socket. If an existing child
socket exists, we drop the packet and discard the second child socket
to the same client.

Signed-off-by: Ricardo Dias
Signed-off-by: Eric Dumazet
Link: https://lore.kernel.org/r/20201120111133.GA67501@rdias-suse-pc.lan
Signed-off-by: Jakub Kicinski

Ricardo Dias
2020-11-24 08:32:33 +0800

21 Nov, 2020

1 commit

861602b57 tcp: Allow full IP tos/IPv6 tclass to be reflected in L3 header ... Browse Code »

An issue was recently found where DCTCP SYN/ACK packets did not have the
ECT bit set in the L3 header. A bit of code review found that the recent
change referenced below had gone though and added a mask that prevented the
ECN bits from being populated in the L3 header.

This patch addresses that by rolling back the mask so that it is only
applied to the flags coming from the incoming TCP request instead of
applying it to the socket tos/tclass field. Doing this the ECT bits were
restored in the SYN/ACK packets in my testing.

One thing that is not addressed by this patch set is the fact that
tcp_reflect_tos appears to be incompatible with ECN based congestion
avoidance algorithms. At a minimum the feature should likely be documented
which it currently isn't.

Fixes: ac8f1710c12b ("tcp: reflect tos value received in SYN to the socket")
Signed-off-by: Alexander Duyck
Acked-by: Wei Wang
Signed-off-by: Jakub Kicinski

Alexander Duyck
2020-11-21 10:09:47 +0800

20 Nov, 2020

2 commits

d53cfb36d Merge 4d02da974ea8 ("Merge tag 'net-5.10-rc5' of git://git.kernel.org/pub/scm/li… ... Browse Code »

…nux/kernel/git/netdev/net") into android-mainline

Steps on the way to 5.10-rc5

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I00726ee0d08f08ae6ac5edd07c8fa502b41d4800

Greg Kroah-Hartman
2020-11-20 22:06:42 +0800
2d8f6481c ipv6: Remove dependency of ipv6_frag_thdr_truncated on ipv6 module ... Browse Code »

IPV6=m
NF_DEFRAG_IPV6=y

ld: net/ipv6/netfilter/nf_conntrack_reasm.o: in function
`nf_ct_frag6_gather':
net/ipv6/netfilter/nf_conntrack_reasm.c:462: undefined reference to
`ipv6_frag_thdr_truncated'

Netfilter is depending on ipv6 symbol ipv6_frag_thdr_truncated. This
dependency is forcing IPV6=y.

Remove this dependency by moving ipv6_frag_thdr_truncated out of ipv6. This
is the same solution as used with a similar issues: Referring to
commit 70b095c843266 ("ipv6: remove dependency of nf_defrag_ipv6 on ipv6
module")

Fixes: 9d9e937b1c8b ("ipv6/netfilter: Discard first fragment not including all headers")
Reported-by: Randy Dunlap
Reported-by: kernel test robot
Signed-off-by: Georg Kohmann
Acked-by: Pablo Neira Ayuso
Acked-by: Randy Dunlap # build-tested
Link: https://lore.kernel.org/r/20201119095833.8409-1-geokohma@cisco.com
Signed-off-by: Jakub Kicinski

Georg Kohmann
2020-11-20 02:49:50 +0800

19 Nov, 2020

1 commit

a5ebcbdf3 ah6: fix error return code in ah6_input() ... Browse Code »

Fix to return a negative error code from the error handling
case instead of 0, as done elsewhere in this function.

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Reported-by: Hulk Robot
Signed-off-by: Zhang Changzhong
Link: https://lore.kernel.org/r/1605581105-35295-1-git-send-email-zhangchangzhong@huawei.com
Signed-off-by: Jakub Kicinski

Zhang Changzhong
2020-11-19 02:53:16 +0800

17 Nov, 2020

1 commit

9d9e937b1 ipv6/netfilter: Discard first fragment not including all headers ... Browse Code »

Packets are processed even though the first fragment don't include all
headers through the upper layer header. This breaks TAHI IPv6 Core
Conformance Test v6LC.1.3.6.

Referring to RFC8200 SECTION 4.5: "If the first fragment does not include
all headers through an Upper-Layer header, then that fragment should be
discarded and an ICMP Parameter Problem, Code 3, message should be sent to
the source of the fragment, with the Pointer field set to zero."

The fragment needs to be validated the same way it is done in
commit 2efdaaaf883a ("IPv6: reply ICMP error if the first fragment don't
include all headers") for ipv6. Wrap the validation into a common function,
ipv6_frag_thdr_truncated() to check for truncation in the upper layer
header. This validation does not fullfill all aspects of RFC 8200,
section 4.5, but is at the moment sufficient to pass mentioned TAHI test.

In netfilter, utilize the fragment offset returned by find_prev_fhdr() to
let ipv6_frag_thdr_truncated() start it's traverse from the fragment
header.

Return 0 to drop the fragment in the netfilter. This is the same behaviour
as used on other protocol errors in this function, e.g. when
nf_ct_frag6_queue() returns -EPROTO. The Fragment will later be picked up
by ipv6_frag_rcv() in reassembly.c. ipv6_frag_rcv() will then send an
appropriate ICMP Parameter Problem message back to the source.

References commit 2efdaaaf883a ("IPv6: reply ICMP error if the first
fragment don't include all headers")

Signed-off-by: Georg Kohmann
Acked-by: Pablo Neira Ayuso
Link: https://lore.kernel.org/r/20201111115025.28879-1-geokohma@cisco.com
Signed-off-by: Jakub Kicinski

Georg Kohmann
2020-11-17 02:15:11 +0800

14 Nov, 2020

2 commits

ceb736e1d ipv6: Fix error path to cancel the meseage ... Browse Code »

genlmsg_cancel() needs to be called in the error path of
inet6_fill_ifmcaddr and inet6_fill_ifacaddr to cancel
the message.

Fixes: 6ecf4c37eb3e ("ipv6: enable IFA_TARGET_NETNSID for RTM_GETADDR")
Reported-by: Hulk Robot
Signed-off-by: Zhang Qilong
Link: https://lore.kernel.org/r/20201112080950.1476302-1-zhangqilong3@huawei.com
Signed-off-by: Jakub Kicinski

Zhang Qilong
2020-11-14 10:20:00 +0800
8cf8821e1 net: Exempt multicast addresses from five-second neighbor lifetime ... Browse Code »

Commit 58956317c8de ("neighbor: Improve garbage collection")
guarantees neighbour table entries a five-second lifetime. Processes
which make heavy use of multicast can fill the neighour table with
multicast addresses in five seconds. At that point, neighbour entries
can't be GC-ed because they aren't five seconds old yet, the kernel
log starts to fill up with "neighbor table overflow!" messages, and
sends start to fail.

This patch allows multicast addresses to be thrown out before they've
lived out their five seconds. This makes room for non-multicast
addresses and makes messages to all addresses more reliable in these
circumstances.

Fixes: 58956317c8de ("neighbor: Improve garbage collection")
Signed-off-by: Jeff Dike
Reviewed-by: David Ahern
Link: https://lore.kernel.org/r/20201113015815.31397-1-jdike@akamai.com
Signed-off-by: Jakub Kicinski

Jeff Dike
2020-11-14 06:24:39 +0800

13 Nov, 2020

2 commits

3428a2f78 Merge 585e5b17b92d ("Merge tag 'fscrypt-for-linus' of git://git.kernel.org/pub/s… ... Browse Code »

…cm/fs/fscrypt/fscrypt") into android-mainline

Steps on the way to 5.10-rc4

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I8554ba37704bee02192ff6117d4909fde568fca2

Greg Kroah-Hartman
2020-11-13 15:26:07 +0800
55e729889 net: udp: fix IP header access and skb lookup on Fast/frag0 UDP GRO ... Browse Code »

udp{4,6}_lib_lookup_skb() use ip{,v6}_hdr() to get IP header of the
packet. While it's probably OK for non-frag0 paths, this helpers
will also point to junk on Fast/frag0 GRO when all headers are
located in frags. As a result, sk/skb lookup may fail or give wrong
results. To support both GRO modes, skb_gro_network_header() might
be used. To not modify original functions, add private versions of
udp{4,6}_lib_lookup_skb() only to perform correct sk lookups on GRO.

Present since the introduction of "application-level" UDP GRO
in 4.7-rc1.

Misc: replace totally unneeded ternaries with plain ifs.

Fixes: a6024562ffd7 ("udp: Add GRO functions to UDP socket")
Suggested-by: Willem de Bruijn
Cc: Eric Dumazet
Signed-off-by: Alexander Lobakin
Acked-by: Willem de Bruijn
Signed-off-by: Jakub Kicinski

Alexander Lobakin
2020-11-13 01:55:51 +0800

11 Nov, 2020

1 commit

909172a14 net: Update window_clamp if SOCK_RCVBUF is set ... Browse Code »

When net.ipv4.tcp_syncookies=1 and syn flood is happened,
cookie_v4_check or cookie_v6_check tries to redo what
tcp_v4_send_synack or tcp_v6_send_synack did,
rsk_window_clamp will be changed if SOCK_RCVBUF is set,
which will make rcv_wscale is different, the client
still operates with initial window scale and can overshot
granted window, the client use the initial scale but local
server use new scale to advertise window value, and session
work abnormally.

Fixes: e88c64f0a425 ("tcp: allow effective reduction of TCP's rcv-buffer via setsockopt")
Signed-off-by: Mao Wenan
Signed-off-by: Eric Dumazet
Link: https://lore.kernel.org/r/1604967391-123737-1-git-send-email-wenan.mao@linux.alibaba.com
Signed-off-by: Jakub Kicinski

Mao Wenan
2020-11-11 09:42:35 +0800

10 Nov, 2020

1 commit

8ef9ba4d6 IPv6: Set SIT tunnel hard_header_len to zero ... Browse Code »

Due to the legacy usage of hard_header_len for SIT tunnels while
already using infrastructure from net/ipv4/ip_tunnel.c the
calculation of the path MTU in tnl_update_pmtu is incorrect.
This leads to unnecessary creation of MTU exceptions for any
flow going over a SIT tunnel.

As SIT tunnels do not have a header themsevles other than their
transport (L3, L2) headers we're leaving hard_header_len set to zero
as tnl_update_pmtu is already taking care of the transport headers
sizes.

This will also help avoiding unnecessary IPv6 GC runs and spinlock
contention seen when using SIT tunnels and for more than
net.ipv6.route.gc_thresh flows.

Fixes: c54419321455 ("GRE: Refactor GRE tunneling code.")
Signed-off-by: Oliver Herms
Acked-by: Willem de Bruijn
Link: https://lore.kernel.org/r/20201103104133.GA1573211@tws
Signed-off-by: Jakub Kicinski

Oliver Herms
2020-11-10 07:07:40 +0800

09 Nov, 2020

1 commit

2cfc344f8 Merge 5.10-rc3 into android-mainline ... Browse Code »

Linux 5.10-rc3

Signed-off-by: Greg Kroah-Hartman
Change-Id: I7884051ea7b86204b2685b51462368e122ad0772

Greg Kroah-Hartman
2020-11-09 19:49:27 +0800

05 Nov, 2020

1 commit

2da4c187a Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec ... Browse Code »

Steffen Klassert says:

====================
1) Fix packet receiving of standard IP tunnels when the xfrm_interface
module is installed. From Xin Long.

2) Fix a race condition between spi allocating and hash list
resizing. From zhuoliang zhang.
====================

Signed-off-by: Jakub Kicinski

Jakub Kicinski
2020-11-05 00:12:52 +0800

01 Nov, 2020

2 commits

859191b23 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf ... Browse Code »

Pablo Neira Ayuso says:

====================
Netfilter fixes for net

The following patchset contains Netfilter fixes for net:

1) Incorrect netlink report logic in flowtable and genID.

2) Add a selftest to check that wireguard passes the right sk
to ip_route_me_harder, from Jason A. Donenfeld.

3) Pass the actual sk to ip_route_me_harder(), also from Jason.

4) Missing expression validation of updates via nft --check.

5) Update byte and packet counters regardless of whether they
match, from Stefano Brivio.
====================

Signed-off-by: Jakub Kicinski

Jakub Kicinski
2020-11-01 08:34:19 +0800
2efdaaaf8 IPv6: reply ICMP error if the first fragment don't include all headers ... Browse Code »

Based on RFC 8200, Section 4.5 Fragment Header:

- If the first fragment does not include all headers through an
Upper-Layer header, then that fragment should be discarded and
an ICMP Parameter Problem, Code 3, message should be sent to
the source of the fragment, with the Pointer field set to zero.

Checking each packet header in IPv6 fast path will have performance impact,
so I put the checking in ipv6_frag_rcv().

As the packet may be any kind of L4 protocol, I only checked some common
protocols' header length and handle others by (offset + 1) > skb->len.
Also use !(frag_off & htons(IP6_OFFSET)) to catch atomic fragments
(fragmented packet with only one fragment).

When send ICMP error message, if the 1st truncated fragment is ICMP message,
icmp6_send() will break as is_ineligible() return true. So I added a check
in is_ineligible() to let fragment packet with nexthdr ICMP but no ICMP header
return false.

Signed-off-by: Hangbin Liu
Signed-off-by: Jakub Kicinski

Hangbin Liu
2020-11-01 04:16:02 +0800

30 Oct, 2020

2 commits

9e7c5b396 ip6_tunnel: set inner ipproto before ip6_tnl_encap ... Browse Code »

ip6_tnl_encap assigns to proto transport protocol which
encapsulates inner packet, but we must pass to set_inner_ipproto
protocol of that inner packet.

Calling set_inner_ipproto after ip6_tnl_encap might break gso.
For example, in case of encapsulating ipv6 packet in fou6 packet, inner_ipproto
would be set to IPPROTO_UDP instead of IPPROTO_IPV6. This would lead to
incorrect calling sequence of gso functions:
ipv6_gso_segment -> udp6_ufo_fragment -> skb_udp_tunnel_segment -> udp6_ufo_fragment
instead of:
ipv6_gso_segment -> udp6_ufo_fragment -> skb_udp_tunnel_segment -> ip6ip6_gso_segment

Fixes: 6c11fbf97e69 ("ip6_tunnel: add MPLS transmit support")
Signed-off-by: Alexander Ovechkin
Link: https://lore.kernel.org/r/20201029171012.20904-1-ovov@yandex-team.ru
Signed-off-by: Jakub Kicinski

Alexander Ovechkin
2020-10-30 23:07:30 +0800
46d6c5ae9 netfilter: use actual socket sk rather than skb sk when routing harder ... Browse Code »

If netfilter changes the packet mark when mangling, the packet is
rerouted using the route_me_harder set of functions. Prior to this
commit, there's one big difference between route_me_harder and the
ordinary initial routing functions, described in the comment above
__ip_queue_xmit():

/* Note: skb->sk can be different from sk, in case of tunnels */
int __ip_queue_xmit(struct sock *sk, struct sk_buff *skb, struct flowi *fl,

That function goes on to correctly make use of sk->sk_bound_dev_if,
rather than skb->sk->sk_bound_dev_if. And indeed the comment is true: a
tunnel will receive a packet in ndo_start_xmit with an initial skb->sk.
It will make some transformations to that packet, and then it will send
the encapsulated packet out of a *new* socket. That new socket will
basically always have a different sk_bound_dev_if (otherwise there'd be
a routing loop). So for the purposes of routing the encapsulated packet,
the routing information as it pertains to the socket should come from
that socket's sk, rather than the packet's original skb->sk. For that
reason __ip_queue_xmit() and related functions all do the right thing.

One might argue that all tunnels should just call skb_orphan(skb) before
transmitting the encapsulated packet into the new socket. But tunnels do
*not* do this -- and this is wisely avoided in skb_scrub_packet() too --
because features like TSQ rely on skb->destructor() being called when
that buffer space is truely available again. Calling skb_orphan(skb) too
early would result in buffers filling up unnecessarily and accounting
info being all wrong. Instead, additional routing must take into account
the new sk, just as __ip_queue_xmit() notes.

So, this commit addresses the problem by fishing the correct sk out of
state->sk -- it's already set properly in the call to nf_hook() in
__ip_local_out(), which receives the sk as part of its normal
functionality. So we make sure to plumb state->sk through the various
route_me_harder functions, and then make correct use of it following the
example of __ip_queue_xmit().

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Jason A. Donenfeld
Reviewed-by: Florian Westphal
Signed-off-by: Pablo Neira Ayuso

Jason A. Donenfeld
2020-10-30 19:57:39 +0800

27 Oct, 2020

1 commit

79c83f152 Merge 1f70935f637d ("Merge tag 'armsoc-fixes' of git://git.kernel.org/pub/scm/li… ... Browse Code »

…nux/kernel/git/soc/soc") into android-mainline

Steps on the way to 5.10-rc1

Resolves conflicts in:
Documentation/admin-guide/sysctl/vm.rst

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: Ic58f28718f28dae42948c935dfb0c62122fe86fc

Greg Kroah-Hartman
2020-10-27 18:47:16 +0800

26 Oct, 2020

1 commit

5a8acc99f Merge 9ff9b0d392ea ("Merge tag 'net-next-5.10' of git://git.kernel.org/pub/scm/l… ... Browse Code »

…inux/kernel/git/netdev/net-next") into android-mainline

Steps on the way to 5.10-rc1

Resolves merge issues in:
drivers/net/virtio_net.c
net/xfrm/xfrm_state.c
net/xfrm/xfrm_user.c

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I3132e7802f25cb775eb02d0b3a03068da39a6fe2

Greg Kroah-Hartman
2020-10-26 16:20:21 +0800

21 Oct, 2020

1 commit

0b5642206 Merge c90578360c92 ("Merge branch 'work.csum_and_copy' of git://git.kernel.org/p… ... Browse Code »

…ub/scm/linux/kernel/git/viro/vfs") into android-mainline

Steps on the way to 5.10-rc1

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I64f970f0cd1bfe4edf42b4170a54bef3ca613c02

Greg Kroah-Hartman
2020-10-21 21:44:50 +0800

20 Oct, 2020

1 commit

68f9f9c2c netfilter: Drop fragmented ndisc packets assembled in netfilter ... Browse Code »

Fragmented ndisc packets assembled in netfilter not dropped as specified
in RFC 6980, section 5. This behaviour breaks TAHI IPv6 Core Conformance
Tests v6LC.2.1.22/23, V6LC.2.2.26/27 and V6LC.2.3.18.

Setting IP6SKB_FRAGMENTED flag during reassembly.

References: commit b800c3b966bc ("ipv6: drop fragmented ndisc packets by default (RFC 6980)")
Signed-off-by: Georg Kohmann
Signed-off-by: Pablo Neira Ayuso

Georg Kohmann
2020-10-20 19:54:53 +0800

16 Oct, 2020

3 commits

9ff9b0d39 Merge tag 'net-next-5.10' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next ... Browse Code »

Pull networking updates from Jakub Kicinski:

- Add redirect_neigh() BPF packet redirect helper, allowing to limit
stack traversal in common container configs and improving TCP
back-pressure.

Daniel reports ~10Gbps => ~15Gbps single stream TCP performance gain.

- Expand netlink policy support and improve policy export to user
space. (Ge)netlink core performs request validation according to
declared policies. Expand the expressiveness of those policies
(min/max length and bitmasks). Allow dumping policies for particular
commands. This is used for feature discovery by user space (instead
of kernel version parsing or trial and error).

- Support IGMPv3/MLDv2 multicast listener discovery protocols in
bridge.

- Allow more than 255 IPv4 multicast interfaces.

- Add support for Type of Service (ToS) reflection in SYN/SYN-ACK
packets of TCPv6.

- In Multi-patch TCP (MPTCP) support concurrent transmission of data on
multiple subflows in a load balancing scenario. Enhance advertising
addresses via the RM_ADDR/ADD_ADDR options.

- Support SMC-Dv2 version of SMC, which enables multi-subnet
deployments.

- Allow more calls to same peer in RxRPC.

- Support two new Controller Area Network (CAN) protocols - CAN-FD and
ISO 15765-2:2016.

- Add xfrm/IPsec compat layer, solving the 32bit user space on 64bit
kernel problem.

- Add TC actions for implementing MPLS L2 VPNs.

- Improve nexthop code - e.g. handle various corner cases when nexthop
objects are removed from groups better, skip unnecessary
notifications and make it easier to offload nexthops into HW by
converting to a blocking notifier.

- Support adding and consuming TCP header options by BPF programs,
opening the doors for easy experimental and deployment-specific TCP
option use.

- Reorganize TCP congestion control (CC) initialization to simplify
life of TCP CC implemented in BPF.

- Add support for shipping BPF programs with the kernel and loading
them early on boot via the User Mode Driver mechanism, hence reusing
all the user space infra we have.

- Support sleepable BPF programs, initially targeting LSM and tracing.

- Add bpf_d_path() helper for returning full path for given 'struct
path'.

- Make bpf_tail_call compatible with bpf-to-bpf calls.

- Allow BPF programs to call map_update_elem on sockmaps.

- Add BPF Type Format (BTF) support for type and enum discovery, as
well as support for using BTF within the kernel itself (current use
is for pretty printing structures).

- Support listing and getting information about bpf_links via the bpf
syscall.

- Enhance kernel interfaces around NIC firmware update. Allow
specifying overwrite mask to control if settings etc. are reset
during update; report expected max time operation may take to users;
support firmware activation without machine reboot incl. limits of
how much impact reset may have (e.g. dropping link or not).

- Extend ethtool configuration interface to report IEEE-standard
counters, to limit the need for per-vendor logic in user space.

- Adopt or extend devlink use for debug, monitoring, fw update in many
drivers (dsa loop, ice, ionic, sja1105, qed, mlxsw, mv88e6xxx,
dpaa2-eth).

- In mlxsw expose critical and emergency SFP module temperature alarms.
Refactor port buffer handling to make the defaults more suitable and
support setting these values explicitly via the DCBNL interface.

- Add XDP support for Intel's igb driver.

- Support offloading TC flower classification and filtering rules to
mscc_ocelot switches.

- Add PTP support for Marvell Octeontx2 and PP2.2 hardware, as well as
fixed interval period pulse generator and one-step timestamping in
dpaa-eth.

- Add support for various auth offloads in WiFi APs, e.g. SAE (WPA3)
offload.

- Add Lynx PHY/PCS MDIO module, and convert various drivers which have
this HW to use it. Convert mvpp2 to split PCS.

- Support Marvell Prestera 98DX3255 24-port switch ASICs, as well as
7-port Mediatek MT7531 IP.

- Add initial support for QCA6390 and IPQ6018 in ath11k WiFi driver,
and wcn3680 support in wcn36xx.

- Improve performance for packets which don't require much offloads on
recent Mellanox NICs by 20% by making multiple packets share a
descriptor entry.

- Move chelsio inline crypto drivers (for TLS and IPsec) from the
crypto subtree to drivers/net. Move MDIO drivers out of the phy
directory.

- Clean up a lot of W=1 warnings, reportedly the actively developed
subsections of networking drivers should now build W=1 warning free.

- Make sure drivers don't use in_interrupt() to dynamically adapt their
code. Convert tasklets to use new tasklet_setup API (sadly this
conversion is not yet complete).

* tag 'net-next-5.10' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2583 commits)
Revert "bpfilter: Fix build error with CONFIG_BPFILTER_UMH"
net, sockmap: Don't call bpf_prog_put() on NULL pointer
bpf, selftest: Fix flaky tcp_hdr_options test when adding addr to lo
bpf, sockmap: Add locking annotations to iterator
netfilter: nftables: allow re-computing sctp CRC-32C in 'payload' statements
net: fix pos incrementment in ipv6_route_seq_next
net/smc: fix invalid return code in smcd_new_buf_create()
net/smc: fix valid DMBE buffer sizes
net/smc: fix use-after-free of delayed events
bpfilter: Fix build error with CONFIG_BPFILTER_UMH
cxgb4/ch_ipsec: Replace the module name to ch_ipsec from chcr
net: sched: Fix suspicious RCU usage while accessing tcf_tunnel_info
bpf: Fix register equivalence tracking.
rxrpc: Fix loss of final ack on shutdown
rxrpc: Fix bundle counting for exclusive connections
netfilter: restore NF_INET_NUMHOOKS
ibmveth: Identify ingress large send packets.
ibmveth: Switch order of ibmveth_helper calls.
cxgb4: handle 4-tuple PEDIT to NAT mode translation
selftests: Add VRF route leaking tests
...

Linus Torvalds
2020-10-16 09:42:13 +0800
2295cddf9 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net ... Browse Code »

Minor conflicts in net/mptcp/protocol.h and
tools/testing/selftests/net/Makefile.

In both cases code was added on both sides in the same place
so just keep both.

Signed-off-by: Jakub Kicinski

Jakub Kicinski
2020-10-16 03:43:21 +0800
6617dfd44 net: fix pos incrementment in ipv6_route_seq_next ... Browse Code »

Commit 4fc427e05158 ("ipv6_route_seq_next should increase position index")
tried to fix the issue where seq_file pos is not increased
if a NULL element is returned with seq_ops->next(). See bug
https://bugzilla.kernel.org/show_bug.cgi?id=206283
The commit effectively does:
- increase pos for all seq_ops->start()
- increase pos for all seq_ops->next()

For ipv6_route, increasing pos for all seq_ops->next() is correct.
But increasing pos for seq_ops->start() is not correct
since pos is used to determine how many items to skip during
seq_ops->start():
iter->skip = *pos;
seq_ops->start() just fetches the *current* pos item.
The item can be skipped only after seq_ops->show() which essentially
is the beginning of seq_ops->next().

For example, I have 7 ipv6 route entries,
root@arch-fb-vm1:~/net-next dd if=/proc/net/ipv6_route bs=4096
00000000000000000000000000000000 40 00000000000000000000000000000000 00 00000000000000000000000000000000 00000400 00000001 00000000 00000001 eth0
fe800000000000000000000000000000 40 00000000000000000000000000000000 00 00000000000000000000000000000000 00000100 00000001 00000000 00000001 eth0
00000000000000000000000000000000 00 00000000000000000000000000000000 00 00000000000000000000000000000000 ffffffff 00000001 00000000 00200200 lo
00000000000000000000000000000001 80 00000000000000000000000000000000 00 00000000000000000000000000000000 00000000 00000003 00000000 80200001 lo
fe800000000000002050e3fffebd3be8 80 00000000000000000000000000000000 00 00000000000000000000000000000000 00000000 00000002 00000000 80200001 eth0
ff000000000000000000000000000000 08 00000000000000000000000000000000 00 00000000000000000000000000000000 00000100 00000004 00000000 00000001 eth0
00000000000000000000000000000000 00 00000000000000000000000000000000 00 00000000000000000000000000000000 ffffffff 00000001 00000000 00200200 lo
0+1 records in
0+1 records out
1050 bytes (1.0 kB, 1.0 KiB) copied, 0.00707908 s, 148 kB/s
root@arch-fb-vm1:~/net-next

In the above, I specify buffer size 4096, so all records can be returned
to user space with a single trip to the kernel.

If I use buffer size 128, since each record size is 149, internally
kernel seq_read() will read 149 into its internal buffer and return the data
to user space in two read() syscalls. Then user read() syscall will trigger
next seq_ops->start(). Since the current implementation increased pos even
for seq_ops->start(), it will skip record #2, #4 and #6, assuming the first
record is #1.

root@arch-fb-vm1:~/net-next dd if=/proc/net/ipv6_route bs=128
00000000000000000000000000000000 40 00000000000000000000000000000000 00 00000000000000000000000000000000 00000400 00000001 00000000 00000001 eth0
00000000000000000000000000000000 00 00000000000000000000000000000000 00 00000000000000000000000000000000 ffffffff 00000001 00000000 00200200 lo
fe800000000000002050e3fffebd3be8 80 00000000000000000000000000000000 00 00000000000000000000000000000000 00000000 00000002 00000000 80200001 eth0
00000000000000000000000000000000 00 00000000000000000000000000000000 00 00000000000000000000000000000000 ffffffff 00000001 00000000 00200200 lo
4+1 records in
4+1 records out
600 bytes copied, 0.00127758 s, 470 kB/s

To fix the problem, create a fake pos pointer so seq_ops->start()
won't actually increase seq_file pos. With this fix, the
above `dd` command with `bs=128` will show correct result.

Fixes: 4fc427e05158 ("ipv6_route_seq_next should increase position index")
Cc: Alexei Starovoitov
Suggested-by: Vasily Averin
Reviewed-by: Vasily Averin
Signed-off-by: Yonghong Song
Acked-by: Martin KaFai Lau
Acked-by: Andrii Nakryiko
Signed-off-by: Jakub Kicinski

Yonghong Song
2020-10-16 01:21:28 +0800

15 Oct, 2020

1 commit

272928d1c ipv6/icmp: l3mdev: Perform icmp error route lookup on source device routing table (v2) ... Browse Code »

As per RFC4443, the destination address field for ICMPv6 error messages
is copied from the source address field of the invoking packet.

In configurations with Virtual Routing and Forwarding tables, looking up
which routing table to use for sending ICMPv6 error messages is
currently done by using the destination net_device.

If the source and destination interfaces are within separate VRFs, or
one in the global routing table and the other in a VRF, looking up the
source address of the invoking packet in the destination interface's
routing table will fail if the destination interface's routing table
contains no route to the invoking packet's source address.

One observable effect of this issue is that traceroute6 does not work in
the following cases:

- Route leaking between global routing table and VRF
- Route leaking between VRFs

Use the source device routing table when sending ICMPv6 error
messages.

[ In the context of ipv4, it has been pointed out that a similar issue
may exist with ICMP errors triggered when forwarding between network
namespaces. It would be worthwhile to investigate whether ipv6 has
similar issues, but is outside of the scope of this investigation. ]

[ Testing shows that similar issues exist with ipv6 unreachable /
fragmentation needed messages. However, investigation of this
additional failure mode is beyond this investigation's scope. ]

Link: https://tools.ietf.org/html/rfc4443
Signed-off-by: Mathieu Desnoyers
Reviewed-by: David Ahern
Signed-off-by: Jakub Kicinski

Mathieu Desnoyers
2020-10-15 08:14:23 +0800

14 Oct, 2020

2 commits

6159e9633 net/ipv6: use semicolons rather than commas to separate statements ... Browse Code »

Replace commas with semicolons. Commas introduce unnecessary
variability in the code structure and are hard to see. What is done
is essentially described by the following Coccinelle semantic patch
(http://coccinelle.lip6.fr/):

//
@@ expression e1,e2; @@
e1
-,
+;
e2
... when any
//

Signed-off-by: Julia Lawall
Acked-by: Paul Moore
Link: https://lore.kernel.org/r/1602412498-32025-5-git-send-email-Julia.Lawall@inria.fr
Signed-off-by: Jakub Kicinski

Julia Lawall
2020-10-14 08:11:52 +0800
0d9826bc1 netfilter: nf_log: missing vlan offload tag and proto ... Browse Code »

Dump vlan tag and proto for the usual vlan offload case if the
NF_LOG_MACDECODE flag is set on. Without this information the logging is
misleading as there is no reference to the VLAN header.

[12716.993704] test: IN=veth0 OUT= MACSRC=86:6c:92:ea:d6:73 MACDST=0e:3b:eb:86:73:76 VPROTO=8100 VID=10 MACPROTO=0800 SRC=192.168.10.2 DST=172.217.168.163 LEN=52 TOS=0x00 PREC=0x00 TTL=64 ID=2548 DF PROTO=TCP SPT=55848 DPT=80 WINDOW=501 RES=0x00 ACK FIN URGP=0
[12721.157643] test: IN=veth0 OUT= MACSRC=86:6c:92:ea:d6:73 MACDST=0e:3b:eb:86:73:76 VPROTO=8100 VID=10 MACPROTO=0806 ARP HTYPE=1 PTYPE=0x0800 OPCODE=2 MACSRC=86:6c:92:ea:d6:73 IPSRC=192.168.10.2 MACDST=0e:3b:eb:86:73:76 IPDST=192.168.10.1

Fixes: 83e96d443b37 ("netfilter: log: split family specific code to nf_log_{ip,ip6,common}.c files")
Signed-off-by: Pablo Neira Ayuso

Pablo Neira Ayuso
2020-10-14 07:25:14 +0800