Eric Lee / smarc-fsl-linux-kernel

18 Feb, 2017

7 commits

7c4c32a29 tcp: fix mark propagation with fwmark_reflect enabled ... Browse Code »

commit bf99b4ded5f8a4767dbb9d180626f06c51f9881f upstream.

Otherwise, RST packets generated by the TCP stack for non-existing
sockets always have mark 0.
The mark from the original packet is assigned to the netns_ipv4/6
socket used to send the response so that it can get copied into the
response skb when the socket sends it.

Fixes: e110861f8609 ("net: add a sysctl to reflect the fwmark on replies")
Cc: Lorenzo Colitti
Signed-off-by: Pau Espin Pedrol
Signed-off-by: Pablo Neira Ayuso
Signed-off-by: Greg Kroah-Hartman

Pau Espin Pedrol
2017-02-18 22:11:43 +0800
16a3fbe52 igmp, mld: Fix memory leak in igmpv3/mld_del_delrec() ... Browse Code »

[ Upstream commit 9c8bb163ae784be4f79ae504e78c862806087c54 ]

In function igmpv3/mld_add_delrec() we allocate pmc and put it in
idev->mc_tomb, so we should free it when we don't need it in del_delrec().
But I removed kfree(pmc) incorrectly in latest two patches. Now fix it.

Fixes: 24803f38a5c0 ("igmp: do not remove igmp souce list info when ...")
Fixes: 1666d49e1d41 ("mld: do not remove mld souce list info when ...")
Reported-by: Daniel Borkmann
Signed-off-by: Hangbin Liu
Signed-off-by: David S. Miller
Signed-off-by: Greg Kroah-Hartman

Hangbin Liu
2017-02-18 22:11:43 +0800
a700cf26a ping: fix a null pointer dereference ... Browse Code »

[ Upstream commit 73d2c6678e6c3af7e7a42b1e78cd0211782ade32 ]

Andrey reported a kernel crash:

general protection fault: 0000 [#1] SMP KASAN
Dumping ftrace buffer:
(ftrace buffer empty)
Modules linked in:
CPU: 2 PID: 3880 Comm: syz-executor1 Not tainted 4.10.0-rc6+ #124
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
task: ffff880060048040 task.stack: ffff880069be8000
RIP: 0010:ping_v4_push_pending_frames net/ipv4/ping.c:647 [inline]
RIP: 0010:ping_v4_sendmsg+0x1acd/0x23f0 net/ipv4/ping.c:837
RSP: 0018:ffff880069bef8b8 EFLAGS: 00010206
RAX: dffffc0000000000 RBX: ffff880069befb90 RCX: 0000000000000000
RDX: 0000000000000018 RSI: ffff880069befa30 RDI: 00000000000000c2
RBP: ffff880069befbb8 R08: 0000000000000008 R09: 0000000000000000
R10: 0000000000000002 R11: 0000000000000000 R12: ffff880069befab0
R13: ffff88006c624a80 R14: ffff880069befa70 R15: 0000000000000000
FS: 00007f6f7c716700(0000) GS:ffff88006de00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00000000004a6f28 CR3: 000000003a134000 CR4: 00000000000006e0
Call Trace:
inet_sendmsg+0x164/0x5b0 net/ipv4/af_inet.c:744
sock_sendmsg_nosec net/socket.c:635 [inline]
sock_sendmsg+0xca/0x110 net/socket.c:645
SYSC_sendto+0x660/0x810 net/socket.c:1687
SyS_sendto+0x40/0x50 net/socket.c:1655
entry_SYSCALL_64_fastpath+0x1f/0xc2

This is because we miss a check for NULL pointer for skb_peek() when
the queue is empty. Other places already have the same check.

Fixes: c319b4d76b9e ("net: ipv4: add IPPROTO_ICMP socket kind")
Reported-by: Andrey Konovalov
Tested-by: Andrey Konovalov
Signed-off-by: Cong Wang
Signed-off-by: David S. Miller
Signed-off-by: Greg Kroah-Hartman

WANG Cong
2017-02-18 22:11:43 +0800
0f895f51a tcp: avoid infinite loop in tcp_splice_read() ... Browse Code »

[ Upstream commit ccf7abb93af09ad0868ae9033d1ca8108bdaec82 ]

Splicing from TCP socket is vulnerable when a packet with URG flag is
received and stored into receive queue.

__tcp_splice_read() returns 0, and sk_wait_data() immediately
returns since there is the problematic skb in queue.

This is a nice way to burn cpu (aka infinite loop) and trigger
soft lockups.

Again, this gem was found by syzkaller tool.

Fixes: 9c55e01c0cc8 ("[TCP]: Splice receive support.")
Signed-off-by: Eric Dumazet
Reported-by: Dmitry Vyukov
Cc: Willy Tarreau
Signed-off-by: David S. Miller
Signed-off-by: Greg Kroah-Hartman

Eric Dumazet
2017-02-18 22:11:42 +0800
66cdd4347 netlabel: out of bound access in cipso_v4_validate() ... Browse Code »

[ Upstream commit d71b7896886345c53ef1d84bda2bc758554f5d61 ]

syzkaller found another out of bound access in ip_options_compile(),
or more exactly in cipso_v4_validate()

Fixes: 20e2a8648596 ("cipso: handle CIPSO options correctly when NetLabel is disabled")
Fixes: 446fda4f2682 ("[NetLabel]: CIPSOv4 engine")
Signed-off-by: Eric Dumazet
Reported-by: Dmitry Vyukov
Cc: Paul Moore
Acked-by: Paul Moore
Signed-off-by: David S. Miller
Signed-off-by: Greg Kroah-Hartman

Eric Dumazet
2017-02-18 22:11:41 +0800
f5b544466 ipv4: keep skb->dst around in presence of IP options ... Browse Code »

[ Upstream commit 34b2cef20f19c87999fff3da4071e66937db9644 ]

Andrey Konovalov got crashes in __ip_options_echo() when a NULL skb->dst
is accessed.

ipv4_pktinfo_prepare() should not drop the dst if (evil) IP options
are present.

We could refine the test to the presence of ts_needtime or srr,
but IP options are not often used, so let's be conservative.

Thanks to syzkaller team for finding this bug.

Fixes: d826eb14ecef ("ipv4: PKTINFO doesnt need dst reference")
Signed-off-by: Eric Dumazet
Reported-by: Andrey Konovalov
Signed-off-by: David S. Miller
Signed-off-by: Greg Kroah-Hartman

Eric Dumazet
2017-02-18 22:11:41 +0800
ca876dff1 tcp: fix 0 divide in __tcp_select_window() ... Browse Code »

[ Upstream commit 06425c308b92eaf60767bc71d359f4cbc7a561f8 ]

syszkaller fuzzer was able to trigger a divide by zero, when
TCP window scaling is not enabled.

SO_RCVBUF can be used not only to increase sk_rcvbuf, also
to decrease it below current receive buffers utilization.

If mss is negative or 0, just return a zero TCP window.

Signed-off-by: Eric Dumazet
Reported-by: Dmitry Vyukov
Acked-by: Neal Cardwell
Signed-off-by: David S. Miller
Signed-off-by: Greg Kroah-Hartman

Eric Dumazet
2017-02-18 22:11:41 +0800

04 Feb, 2017

6 commits

89c258862 net: Specify the owning module for lwtunnel ops ... Browse Code »

[ Upstream commit 88ff7334f25909802140e690c0e16433e485b0a0 ]

Modules implementing lwtunnel ops should not be allowed to unload
while there is state alive using those ops, so specify the owning
module for all lwtunnel ops.

Signed-off-by: Robert Shearman
Signed-off-by: David S. Miller
Signed-off-by: Greg Kroah-Hartman

Robert Shearman
2017-02-04 16:47:11 +0800
0c687a735 tcp: initialize max window for a new fastopen socket ... Browse Code »

[ Upstream commit 0dbd7ff3ac5017a46033a9d0a87a8267d69119d9 ]

Found that if we run LTP netstress test with large MSS (65K),
the first attempt from server to send data comparable to this
MSS on fastopen connection will be delayed by the probe timer.

Here is an example:

< S seq 0:0 win 43690 options [mss 65495 wscale 7 tfo cookie] length 32
> S. seq 0:0 ack 1 win 43690 options [mss 65495 wscale 7] length 0
< . ack 1 win 342 length 0

Inside tcp_sendmsg(), tcp_send_mss() returns max MSS in 'mss_now',
as well as in 'size_goal'. This results the segment not queued for
transmition until all the data copied from user buffer. Then, inside
__tcp_push_pending_frames(), it breaks on send window test and
continues with the check probe timer.

Fragmentation occurs in tcp_write_wakeup()...

+0.2 > P. seq 1:43777 ack 1 win 342 length 43776
< . ack 43777, win 1365 length 0
> P. seq 43777:65001 ack 1 win 342 options [...] length 21224
...

This also contradicts with the fact that we should bound to the half
of the window if it is large.

Fix this flaw by correctly initializing max_window. Before that, it
could have large values that affect further calculations of 'size_goal'.

Fixes: 168a8f58059a ("tcp: TCP Fast Open Server - main code path")
Signed-off-by: Alexey Kodanev
Acked-by: Eric Dumazet
Signed-off-by: David S. Miller
Signed-off-by: Greg Kroah-Hartman

Alexey Kodanev
2017-02-04 16:47:10 +0800
e9db042dc lwtunnel: fix autoload of lwt modules ... Browse Code »

[ Upstream commit 9ed59592e3e379b2e9557dc1d9e9ec8fcbb33f16]

Trying to add an mpls encap route when the MPLS modules are not loaded
hangs. For example:

CONFIG_MPLS=y
CONFIG_NET_MPLS_GSO=m
CONFIG_MPLS_ROUTING=m
CONFIG_MPLS_IPTUNNEL=m

$ ip route add 10.10.10.10/32 encap mpls 100 via inet 10.100.1.2

The ip command hangs:
root 880 826 0 21:25 pts/0 00:00:00 ip route add 10.10.10.10/32 encap mpls 100 via inet 10.100.1.2

$ cat /proc/880/stack
[] call_usermodehelper_exec+0xd6/0x134
[] __request_module+0x27b/0x30a
[] lwtunnel_build_state+0xe4/0x178
[] fib_create_info+0x47f/0xdd4
[] fib_table_insert+0x90/0x41f
[] inet_rtm_newroute+0x4b/0x52
...

modprobe is trying to load rtnl-lwt-MPLS:

root 881 5 0 21:25 ? 00:00:00 /sbin/modprobe -q -- rtnl-lwt-MPLS

and it hangs after loading mpls_router:

$ cat /proc/881/stack
[] rtnl_lock+0x12/0x14
[] register_netdevice_notifier+0x16/0x179
[] mpls_init+0x25/0x1000 [mpls_router]
[] do_one_initcall+0x8e/0x13f
[] do_init_module+0x5a/0x1e5
[] load_module+0x13bd/0x17d6
...

The problem is that lwtunnel_build_state is called with rtnl lock
held preventing mpls_init from registering.

Given the potential references held by the time lwtunnel_build_state it
can not drop the rtnl lock to the load module. So, extract the module
loading code from lwtunnel_build_state into a new function to validate
the encap type. The new function is called while converting the user
request into a fib_config which is well before any table, device or
fib entries are examined.

Fixes: 745041e2aaf1 ("lwtunnel: autoload of lwt modules")
Signed-off-by: David Ahern
Signed-off-by: David S. Miller
Signed-off-by: Greg Kroah-Hartman

David Ahern
2017-02-04 16:47:10 +0800
3524f6422 tcp: fix tcp_fastopen unaligned access complaints on sparc ... Browse Code »

[ Upstream commit 003c941057eaa868ca6fedd29a274c863167230d ]

Fix up a data alignment issue on sparc by swapping the order
of the cookie byte array field with the length field in
struct tcp_fastopen_cookie, and making it a proper union
to clean up the typecasting.

This addresses log complaints like these:
log_unaligned: 113 callbacks suppressed
Kernel unaligned access at TPC[976490] tcp_try_fastopen+0x2d0/0x360
Kernel unaligned access at TPC[9764ac] tcp_try_fastopen+0x2ec/0x360
Kernel unaligned access at TPC[9764c8] tcp_try_fastopen+0x308/0x360
Kernel unaligned access at TPC[9764e4] tcp_try_fastopen+0x324/0x360
Kernel unaligned access at TPC[976490] tcp_try_fastopen+0x2d0/0x360

Cc: Eric Dumazet
Signed-off-by: Shannon Nelson
Acked-by: Eric Dumazet
Signed-off-by: David S. Miller
Signed-off-by: Greg Kroah-Hartman

Shannon Nelson
2017-02-04 16:47:09 +0800
958bb1bdc net: ipv4: fix table id in getroute response ... Browse Code »

[ Upstream commit 8a430ed50bb1b19ca14a46661f3b1b35f2fb5c39 ]

rtm_table is an 8-bit field while table ids are allowed up to u32. Commit
709772e6e065 ("net: Fix routing tables with id > 255 for legacy software")
added the preference to set rtm_table in dumps to RT_TABLE_COMPAT if the
table id is > 255. The table id returned on get route requests should do
the same.

Fixes: c36ba6603a11 ("net: Allow user to get table id from route lookup")
Signed-off-by: David Ahern
Signed-off-by: David S. Miller
Signed-off-by: Greg Kroah-Hartman

David Ahern
2017-02-04 16:47:08 +0800
6980c52c4 net: lwtunnel: Handle lwtunnel_fill_encap failure ... Browse Code »

[ Upstream commit ea7a80858f57d8878b1499ea0f1b8a635cc48de7 ]

Handle failure in lwtunnel_fill_encap adding attributes to skb.

Fixes: 571e722676fe ("ipv4: support for fib route lwtunnel encap attributes")
Fixes: 19e42e451506 ("ipv6: support for fib route lwtunnel encap attributes")
Signed-off-by: David Ahern
Signed-off-by: David S. Miller
Signed-off-by: Greg Kroah-Hartman

David Ahern
2017-02-04 16:47:08 +0800

15 Jan, 2017

6 commits

7b7a5a85b net: ipv4: Fix multipath selection with vrf ... Browse Code »

[ Upstream commit 7a18c5b9fb31a999afc62b0e60978aa896fc89e9 ]

fib_select_path does not call fib_select_multipath if oif is set in the
flow struct. For VRF use cases oif is always set, so multipath route
selection is bypassed. Use the FLOWI_FLAG_SKIP_NH_OIF to skip the oif
check similar to what is done in fib_table_lookup.

Add saddr and proto to the flow struct for the fib lookup done by the
VRF driver to better match hash computation for a flow.

Fixes: 613d09b30f8b ("net: Use VRF device index for lookups on TX")
Signed-off-by: David Ahern
Signed-off-by: David S. Miller
Signed-off-by: Greg Kroah-Hartman

David Ahern
2017-01-15 20:42:55 +0800
efc455f08 ipv4: Do not allow MAIN to be alias for new LOCAL w/ custom rules ... Browse Code »

[ Upstream commit 5350d54f6cd12eaff623e890744c79b700bd3f17 ]

In the case of custom rules being present we need to handle the case of the
LOCAL table being intialized after the new rule has been added. To address
that I am adding a new check so that we can make certain we don't use an
alias of MAIN for LOCAL when allocating a new table.

Fixes: 0ddcf43d5d4a ("ipv4: FIB Local/MAIN table collapse")
Reported-by: Oliver Brunel
Signed-off-by: Alexander Duyck
Signed-off-by: David S. Miller
Signed-off-by: Greg Kroah-Hartman

Alexander Duyck
2017-01-15 20:42:54 +0800
fe1e13cfe igmp: Make igmp group member RFC 3376 compliant ... Browse Code »

[ Upstream commit 7ababb782690e03b78657e27bd051e20163af2d6 ]

5.2. Action on Reception of a Query

When a system receives a Query, it does not respond immediately.
Instead, it delays its response by a random amount of time, bounded
by the Max Resp Time value derived from the Max Resp Code in the
received Query message. A system may receive a variety of Queries on
different interfaces and of different kinds (e.g., General Queries,
Group-Specific Queries, and Group-and-Source-Specific Queries), each
of which may require its own delayed response.

Before scheduling a response to a Query, the system must first
consider previously scheduled pending responses and in many cases
schedule a combined response. Therefore, the system must be able to
maintain the following state:

o A timer per interface for scheduling responses to General Queries.

o A per-group and interface timer for scheduling responses to Group-
Specific and Group-and-Source-Specific Queries.

o A per-group and interface list of sources to be reported in the
response to a Group-and-Source-Specific Query.

When a new Query with the Router-Alert option arrives on an
interface, provided the system has state to report, a delay for a
response is randomly selected in the range (0, [Max Resp Time]) where
Max Resp Time is derived from Max Resp Code in the received Query
message. The following rules are then used to determine if a Report
needs to be scheduled and the type of Report to schedule. The rules
are considered in order and only the first matching rule is applied.

1. If there is a pending response to a previous General Query
scheduled sooner than the selected delay, no additional response
needs to be scheduled.

2. If the received Query is a General Query, the interface timer is
used to schedule a response to the General Query after the
selected delay. Any previously pending response to a General
Query is canceled.
--8
Signed-off-by: David S. Miller
Signed-off-by: Greg Kroah-Hartman

Michal Tesar
2017-01-15 20:42:54 +0800
a8a213f29 net: ipv4: dst for local input routes should use l3mdev if relevant ... Browse Code »

[ Upstream commit f5a0aab84b74de68523599817569c057c7ac1622 ]

IPv4 output routes already use l3mdev device instead of loopback for dst's
if it is applicable. Change local input routes to do the same.

This fixes icmp responses for unreachable UDP ports which are directed
to the wrong table after commit 9d1a6c4ea43e4 because local_input
routes use the loopback device. Moving from ingress device to loopback
loses the L3 domain causing responses based on the dst to get to lost.

Fixes: 9d1a6c4ea43e4 ("net: icmp_route_lookup should use rt dev to
determine L3 domain")
Signed-off-by: David Ahern
Signed-off-by: David S. Miller
Signed-off-by: Greg Kroah-Hartman

David Ahern
2017-01-15 20:42:54 +0800
e7422080e net: fix incorrect original ingress device index in PKTINFO ... Browse Code »

[ Upstream commit f0c16ba8933ed217c2688b277410b2a37ba81591 ]

When we send a packet for our own local address on a non-loopback
interface (e.g. eth0), due to the change had been introduced from
commit 0b922b7a829c ("net: original ingress device index in PKTINFO"), the
original ingress device index would be set as the loopback interface.
However, the packet should be considered as if it is being arrived via the
sending interface (eth0), otherwise it would break the expectation of the
userspace application (e.g. the DHCPRELEASE message from dhcp_release
binary would be ignored by the dnsmasq daemon, since it come from lo which
is not the interface dnsmasq bind to)

Fixes: 0b922b7a829c ("net: original ingress device index in PKTINFO")
Acked-by: David Ahern
Signed-off-by: Wei Zhang
Signed-off-by: David S. Miller
Signed-off-by: Greg Kroah-Hartman

Wei Zhang
2017-01-15 20:42:54 +0800
d36a1cb1e inet: fix IP(V6)_RECVORIGDSTADDR for udp sockets ... Browse Code »

[ Upstream commit 39b2dd765e0711e1efd1d1df089473a8dd93ad48 ]

Socket cmsg IP(V6)_RECVORIGDSTADDR checks that port range lies within
the packet. For sockets that have transport headers pulled, transport
offset can be negative. Use signed comparison to avoid overflow.

Fixes: e6afc8ace6dd ("udp: remove headers from UDP packets before queueing")
Reported-by: Nisar Jagabar
Signed-off-by: Willem de Bruijn
Signed-off-by: David S. Miller
Signed-off-by: Greg Kroah-Hartman

Willem de Bruijn
2017-01-15 20:42:52 +0800

07 Dec, 2016

1 commit

dcb17d22e tcp: warn on bogus MSS and try to amend it ... Browse Code »

There have been some reports lately about TCP connection stalls caused
by NIC drivers that aren't setting gso_size on aggregated packets on rx
path. This causes TCP to assume that the MSS is actually the size of the
aggregated packet, which is invalid.

Although the proper fix is to be done at each driver, it's often hard
and cumbersome for one to debug, come to such root cause and report/fix
it.

This patch amends this situation in two ways. First, it adds a warning
on when this situation occurs, so it gives a hint to those trying to
debug this. It also limit the maximum probed MSS to the adverised MSS,
as it should never be any higher than that.

The result is that the connection may not have the best performance ever
but it shouldn't stall, and the admin will have a hint on what to look
for.

Tested with virtio by forcing gso_size to 0.

v2: updated msg per David's suggestion
v3: use skb_iif to find the interface and also log its name, per Eric
Dumazet's suggestion. As the skb may be backlogged and the interface
gone by then, we need to check if the number still has a meaning.
v4: use helper tcp_gro_dev_warn() and avoid pr_warn_once inside __once, per
David's suggestion

Cc: Jonathan Maxwell
Signed-off-by: Marcelo Ricardo Leitner
Signed-off-by: David S. Miller

Marcelo Ricardo Leitner
2016-12-07 00:01:19 +0800

06 Dec, 2016

3 commits

0eab121ef net: ping: check minimum size on ICMP header length ... Browse Code »

Prior to commit c0371da6047a ("put iov_iter into msghdr") in v3.19, there
was no check that the iovec contained enough bytes for an ICMP header,
and the read loop would walk across neighboring stack contents. Since the
iov_iter conversion, bad arguments are noticed, but the returned error is
EFAULT. Returning EINVAL is a clearer error and also solves the problem
prior to v3.19.

This was found using trinity with KASAN on v3.18:

BUG: KASAN: stack-out-of-bounds in memcpy_fromiovec+0x60/0x114 at addr ffffffc071077da0
Read of size 8 by task trinity-c2/9623
page:ffffffbe034b9a08 count:0 mapcount:0 mapping: (null) index:0x0
flags: 0x0()
page dumped because: kasan: bad access detected
CPU: 0 PID: 9623 Comm: trinity-c2 Tainted: G BU 3.18.0-dirty #15
Hardware name: Google Tegra210 Smaug Rev 1,3+ (DT)
Call trace:
[] dump_backtrace+0x0/0x1ac arch/arm64/kernel/traps.c:90
[] show_stack+0x10/0x1c arch/arm64/kernel/traps.c:171
[< inline >] __dump_stack lib/dump_stack.c:15
[] dump_stack+0x7c/0xd0 lib/dump_stack.c:50
[< inline >] print_address_description mm/kasan/report.c:147
[< inline >] kasan_report_error mm/kasan/report.c:236
[] kasan_report+0x380/0x4b8 mm/kasan/report.c:259
[< inline >] check_memory_region mm/kasan/kasan.c:264
[] __asan_load8+0x20/0x70 mm/kasan/kasan.c:507
[] memcpy_fromiovec+0x5c/0x114 lib/iovec.c:15
[< inline >] memcpy_from_msg include/linux/skbuff.h:2667
[] ping_common_sendmsg+0x50/0x108 net/ipv4/ping.c:674
[] ping_v4_sendmsg+0xd8/0x698 net/ipv4/ping.c:714
[] inet_sendmsg+0xe0/0x12c net/ipv4/af_inet.c:749
[< inline >] __sock_sendmsg_nosec net/socket.c:624
[< inline >] __sock_sendmsg net/socket.c:632
[] sock_sendmsg+0x124/0x164 net/socket.c:643
[< inline >] SYSC_sendto net/socket.c:1797
[] SyS_sendto+0x178/0x1d8 net/socket.c:1761

CVE-2016-8399

Reported-by: Qidan He
Fixes: c319b4d76b9e ("net: ipv4: add IPPROTO_ICMP socket kind")
Cc: stable@vger.kernel.org
Signed-off-by: Kees Cook
Signed-off-by: David S. Miller

Kees Cook
2016-12-06 02:35:38 +0800
a52ca62c4 ipv4: Drop suffix update from resize code ... Browse Code »

It has been reported that update_suffix can be expensive when it is called
on a large node in which most of the suffix lengths are the same. The time
required to add 200K entries had increased from around 3 seconds to almost
49 seconds.

In order to address this we need to move the code for updating the suffix
out of resize and instead just have it handled in the cases where we are
pushing a node that increases the suffix length, or will decrease the
suffix length.

Fixes: 5405afd1a306 ("fib_trie: Add tracking value for suffix length")
Reported-by: Robert Shearman
Signed-off-by: Alexander Duyck
Reviewed-by: Robert Shearman
Tested-by: Robert Shearman
Signed-off-by: David S. Miller

Alexander Duyck
2016-12-06 02:15:58 +0800
1a239173c ipv4: Drop leaf from suffix pull/push functions ... Browse Code »

It wasn't necessary to pass a leaf in when doing the suffix updates so just
drop it. Instead just pass the suffix and work with that.

Since we dropped the leaf there is no need to include that in the name so
the names are updated to node_push_suffix and node_pull_suffix.

Finally I noticed that the logic for pulling the suffix length back
actually had some issues. Specifically it would stop prematurely if there
was a longer suffix, but it was not as long as the original suffix. I
updated the code to address that in node_pull_suffix.

Fixes: 5405afd1a306 ("fib_trie: Add tracking value for suffix length")
Suggested-by: Robert Shearman
Signed-off-by: Alexander Duyck
Reviewed-by: Robert Shearman
Tested-by: Robert Shearman
Signed-off-by: David S. Miller

Alexander Duyck
2016-12-06 02:15:58 +0800

03 Dec, 2016

1 commit

f41804391 ipv4: Set skb->protocol properly for local output ... Browse Code »

When xfrm is applied to TSO/GSO packets, it follows this path:

xfrm_output() -> xfrm_output_gso() -> skb_gso_segment()

where skb_gso_segment() relies on skb->protocol to function properly.

This patch sets skb->protocol to ETH_P_IP before dst_output() is called,
fixing a bug where GSO packets sent through a sit tunnel are dropped
when xfrm is involved.

Cc: stable@vger.kernel.org
Signed-off-by: Eli Cooper
Signed-off-by: David S. Miller

Eli Cooper
2016-12-03 01:34:22 +0800

02 Dec, 2016

2 commits

7bbf91ce2 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec ... Browse Code »

Steffen Klassert says:

====================
pull request (net): ipsec 2016-12-01

1) Change the error value when someone tries to run 32bit
userspace on a 64bit host from -ENOTSUPP to the userspace
exported -EOPNOTSUPP. Fix from Yi Zhao.

2) On inbound, ESN sequence numbers are already in network
byte order. So don't try to convert it again, this fixes
integrity verification for ESN. Fixes from Tobias Brunner.

Please pull or let me know if there are problems.
====================

Signed-off-by: David S. Miller

David S. Miller
2016-12-02 00:35:49 +0800
3d2dd617f Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf ... Browse Code »

Pablo Neira Ayuso says:

====================
Netfilter fixes for net

This is a large batch of Netfilter fixes for net, they are:

1) Three patches to fix NAT conversion to rhashtable: Switch to rhlist
structure that allows to have several objects with the same key.
Moreover, fix wrong comparison logic in nf_nat_bysource_cmp() as this is
expecting a return value similar to memcmp(). Change location of
the nat_bysource field in the nf_conn structure to avoid zeroing
this as it breaks interaction with SLAB_DESTROY_BY_RCU and lead us
to crashes. From Florian Westphal.

2) Don't allow malformed fragments go through in IPv6, drop them,
otherwise we hit GPF, patch from Florian Westphal.

3) Fix crash if attributes are missing in nft_range, from Liping Zhang.

4) Fix arptables 32-bits userspace 64-bits kernel compat, from Hongxu Jia.

5) Two patches from David Ahern to fix netfilter interaction with vrf.
From David Ahern.

6) Fix element timeout calculation in nf_tables, we take milliseconds
from userspace, but we use jiffies from kernelspace. Patch from
Anders K. Pedersen.

7) Missing validation length netlink attribute for nft_hash, from
Laura Garcia.

8) Fix nf_conntrack_helper documentation, we don't default to off
anymore for a bit of time so let's get this in sync with the code.

I know is late but I think these are important, specifically the NAT
bits, as they are mostly addressing fallout from recent changes. I also
read there are chances to have -rc8, if that is the case, that would
also give us a bit more time to test this.
====================

Signed-off-by: David S. Miller

David S. Miller
2016-12-02 00:04:41 +0800

01 Dec, 2016

1 commit

17a49cd54 netfilter: arp_tables: fix invoking 32bit "iptable -P INPUT ACCEPT" failed in 64bit kernel ... Browse Code »

Since 09d9686047db ("netfilter: x_tables: do compat validation via
translate_table"), it used compatr structure to assign newinfo
structure. In translate_compat_table of ip_tables.c and ip6_tables.c,
it used compatr->hook_entry to replace info->hook_entry and
compatr->underflow to replace info->underflow, but not do the same
replacement in arp_tables.c.

It caused invoking 32-bit "arptbale -P INPUT ACCEPT" failed in 64bit
kernel.
--------------------------------------
root@qemux86-64:~# arptables -P INPUT ACCEPT
root@qemux86-64:~# arptables -P INPUT ACCEPT
ERROR: Policy for `INPUT' offset 448 != underflow 0
arptables: Incompatible with this kernel
--------------------------------------

Fixes: 09d9686047db ("netfilter: x_tables: do compat validation via translate_table")
Signed-off-by: Hongxu Jia
Acked-by: Florian Westphal
Signed-off-by: Pablo Neira Ayuso

Hongxu Jia
2016-12-01 03:50:23 +0800

30 Nov, 2016

2 commits

7c7fedd51 esp4: Fix integrity verification when ESN are used ... Browse Code »

When handling inbound packets, the two halves of the sequence number
stored on the skb are already in network order.

Fixes: 7021b2e1cddd ("esp4: Switch to new AEAD interface")
Signed-off-by: Tobias Brunner
Acked-by: Herbert Xu
Signed-off-by: Steffen Klassert

Tobias Brunner
2016-11-30 18:09:39 +0800
a51088782 GSO: Reload iph after pskb_may_pull ... Browse Code »

As it may get stale and lead to use after free.

Acked-by: Eric Dumazet
Cc: Alexander Duyck
Cc: Andrey Konovalov
Fixes: cbc53e08a793 ("GSO: Add GSO type for fixed IPv4 ID")
Signed-off-by: Arnaldo Carvalho de Melo
Acked-by: Alexander Duyck
Signed-off-by: David S. Miller

Arnaldo Carvalho de Melo
2016-11-30 09:45:54 +0800

29 Nov, 2016

1 commit

4df21dfcf tcp: Set DEFAULT_TCP_CONG to bbr if DEFAULT_BBR is set ... Browse Code »

Signed-off-by: Julian Wollrath
Signed-off-by: David S. Miller

Julian Wollrath
2016-11-29 01:15:00 +0800

25 Nov, 2016

1 commit

30c7be26f udplite: call proper backlog handlers ... Browse Code »

In commits 93821778def10 ("udp: Fix rcv socket locking") and
f7ad74fef3af ("net/ipv6/udp: UDP encapsulation: break backlog_rcv into
__udpv6_queue_rcv_skb") UDP backlog handlers were renamed, but UDPlite
was forgotten.

This leads to crashes if UDPlite header is pulled twice, which happens
starting from commit e6afc8ace6dd ("udp: remove headers from UDP packets
before queueing")

Bug found by syzkaller team, thanks a lot guys !

Note that backlog use in UDP/UDPlite is scheduled to be removed starting
from linux-4.10, so this patch is only needed up to linux-4.9

Fixes: 93821778def1 ("udp: Fix rcv socket locking")
Fixes: f7ad74fef3af ("net/ipv6/udp: UDP encapsulation: break backlog_rcv into __udpv6_queue_rcv_skb")
Fixes: e6afc8ace6dd ("udp: remove headers from UDP packets before queueing")
Signed-off-by: Eric Dumazet
Reported-by: Andrey Konovalov
Cc: Benjamin LaHaise
Cc: Herbert Xu
Signed-off-by: David S. Miller

Eric Dumazet
2016-11-25 04:32:14 +0800

24 Nov, 2016

1 commit

6d8b49c3a netfilter: Update ip_route_me_harder to consider L3 domain ... Browse Code »

ip_route_me_harder is not considering the L3 domain and sending lookups
to the wrong table. For example consider the following output rule:

iptables -I OUTPUT -p tcp --dport 12345 -j REJECT --reject-with tcp-reset

using perf to analyze lookups via the fib_table_lookup tracepoint shows:

vrf-test 1187 [001] 46887.295927: fib:fib_table_lookup: table 255 oif 0 iif 0 src 0.0.0.0 dst 10.100.1.254 tos 0 scope 0 flags 0
ffffffff8143922c perf_trace_fib_table_lookup ([kernel.kallsyms])
ffffffff81493aac fib_table_lookup ([kernel.kallsyms])
ffffffff8148dda3 __inet_dev_addr_type ([kernel.kallsyms])
ffffffff8148ddf6 inet_addr_type ([kernel.kallsyms])
ffffffff8149e344 ip_route_me_harder ([kernel.kallsyms])

and

vrf-test 1187 [001] 46887.295933: fib:fib_table_lookup: table 255 oif 0 iif 1 src 10.100.1.254 dst 10.100.1.2 tos 0 scope 0 flags
ffffffff8143922c perf_trace_fib_table_lookup ([kernel.kallsyms])
ffffffff81493aac fib_table_lookup ([kernel.kallsyms])
ffffffff814998ff fib4_rule_action ([kernel.kallsyms])
ffffffff81437f35 fib_rules_lookup ([kernel.kallsyms])
ffffffff81499758 __fib_lookup ([kernel.kallsyms])
ffffffff8144f010 fib_lookup.constprop.34 ([kernel.kallsyms])
ffffffff8144f759 __ip_route_output_key_hash ([kernel.kallsyms])
ffffffff8144fc6a ip_route_output_flow ([kernel.kallsyms])
ffffffff8149e39b ip_route_me_harder ([kernel.kallsyms])

In both cases the lookups are directed to table 255 rather than the
table associated with the device via the L3 domain. Update both
lookups to pull the L3 domain from the dst currently attached to the
skb.

Signed-off-by: David Ahern
Signed-off-by: Pablo Neira Ayuso

David Ahern
2016-11-24 19:44:36 +0800

22 Nov, 2016

1 commit

7082c5c3f tcp: zero ca_priv area when switching cc algorithms ... Browse Code »

We need to zero out the private data area when application switches
connection to different algorithm (TCP_CONGESTION setsockopt).

When congestion ops get assigned at connect time everything is already
zeroed because sk_alloc uses GFP_ZERO flag. But in the setsockopt case
this contains whatever previous cc placed there.

Signed-off-by: Florian Westphal
Signed-off-by: David S. Miller

Florian Westphal
2016-11-22 02:13:56 +0800

17 Nov, 2016

2 commits

3114cdfe6 ipv4: Fix memory leak in exception case for splitting tries ... Browse Code »

Fix a small memory leak that can occur where we leak a fib_alias in the
event of us not being able to insert it into the local table.

Fixes: 0ddcf43d5d4a0 ("ipv4: FIB Local/MAIN table collapse")
Reported-by: Eric Dumazet
Signed-off-by: Alexander Duyck
Acked-by: Eric Dumazet
Signed-off-by: David S. Miller

Alexander Duyck
2016-11-17 02:24:50 +0800
3b7093346 ipv4: Restore fib_trie_flush_external function and fix call ordering ... Browse Code »

The patch that removed the FIB offload infrastructure was a bit too
aggressive and also removed code needed to clean up us splitting the table
if additional rules were added. Specifically the function
fib_trie_flush_external was called at the end of a new rule being added to
flush the foreign trie entries from the main trie.

I updated the code so that we only call fib_trie_flush_external on the main
table so that we flush the entries for local from main. This way we don't
call it for every rule change which is what was happening previously.

Fixes: 347e3b28c1ba2 ("switchdev: remove FIB offload infrastructure")
Reported-by: Eric Dumazet
Cc: Jiri Pirko
Signed-off-by: Alexander Duyck
Acked-by: Eric Dumazet
Signed-off-by: David S. Miller

Alexander Duyck
2016-11-17 02:24:50 +0800

16 Nov, 2016

2 commits

73e2d5e34 udp: restore UDPlite many-cast delivery ... Browse Code »

Honor udptable parameter that is passed to __udp*_lib_mcast_deliver(),
otherwise udplite broadcast/multicast use the wrong table and it breaks.

Fixes: 2dc41cff7545 ("udp: Use hash2 for long hash1 chains in __udp*_lib_mcast_deliver.")
Signed-off-by: Pablo Neira Ayuso
Acked-by: Eric Dumazet
Signed-off-by: David S. Miller

Pablo Neira
2016-11-16 11:14:27 +0800
24803f38a igmp: do not remove igmp souce list info when set link down ... Browse Code »

In commit 24cf3af3fed5 ("igmp: call ip_mc_clear_src..."), we forgot to remove
igmpv3_clear_delrec() in ip_mc_down(), which also called ip_mc_clear_src().
This make us clear all IGMPv3 source filter info after NETDEV_DOWN.
Move igmpv3_clear_delrec() to ip_mc_destroy_dev() and then no need
ip_mc_clear_src() in ip_mc_destroy_dev().

On the other hand, we should restore back instead of free all source filter
info in igmpv3_del_delrec(). Or we will not able to restore IGMPv3 source
filter info after NETDEV_UP and NETDEV_POST_TYPE_CHANGE.

Fixes: 24cf3af3fed5 ("igmp: call ip_mc_clear_src() only when ...")
Signed-off-by: Hangbin Liu
Signed-off-by: David S. Miller

Hangbin Liu
2016-11-16 08:51:16 +0800

14 Nov, 2016

2 commits

ac6e78007 tcp: take care of truncations done by sk_filter() ... Browse Code »

With syzkaller help, Marco Grassi found a bug in TCP stack,
crashing in tcp_collapse()

Root cause is that sk_filter() can truncate the incoming skb,
but TCP stack was not really expecting this to happen.
It probably was expecting a simple DROP or ACCEPT behavior.

We first need to make sure no part of TCP header could be removed.
Then we need to adjust TCP_SKB_CB(skb)->end_seq

Many thanks to syzkaller team and Marco for giving us a reproducer.

Signed-off-by: Eric Dumazet
Reported-by: Marco Grassi
Reported-by: Vladis Dronov
Signed-off-by: David S. Miller

Eric Dumazet
2016-11-14 01:30:02 +0800
969447f22 ipv4: use new_gw for redirect neigh lookup ... Browse Code »

In v2.6, ip_rt_redirect() calls arp_bind_neighbour() which returns 0
and then the state of the neigh for the new_gw is checked. If the state
isn't valid then the redirected route is deleted. This behavior is
maintained up to v3.5.7 by check_peer_redirect() because rt->rt_gateway
is assigned to peer->redirect_learned.a4 before calling
ipv4_neigh_lookup().

After commit 5943634fc559 ("ipv4: Maintain redirect and PMTU info in
struct rtable again."), ipv4_neigh_lookup() is performed without the
rt_gateway assigned to the new_gw. In the case when rt_gateway (old_gw)
isn't zero, the function uses it as the key. The neigh is most likely
valid since the old_gw is the one that sends the ICMP redirect message.
Then the new_gw is assigned to fib_nh_exception. The problem is: the
new_gw ARP may never gets resolved and the traffic is blackholed.

So, use the new_gw for neigh lookup.

Changes from v1:
- use __ipv4_neigh_lookup instead (per Eric Dumazet).

Fixes: 5943634fc559 ("ipv4: Maintain redirect and PMTU info in struct rtable again.")
Signed-off-by: Stephen Suryaputra Lin
Signed-off-by: David S. Miller

Stephen Suryaputra Lin
2016-11-14 01:24:44 +0800

11 Nov, 2016

1 commit

0ace81ec7 ipv4: update comment to document GSO fragmentation cases. ... Browse Code »

This is a follow-up to commit 9ee6c5dc816a ("ipv4: allow local
fragmentation in ip_finish_output_gso()"), updating the comment
documenting cases in which fragmentation is needed for egress
GSO packets.

Suggested-by: Shmulik Ladkani
Reviewed-by: Shmulik Ladkani
Signed-off-by: Lance Richardson
Signed-off-by: David S. Miller

Lance Richardson
2016-11-11 01:01:54 +0800