Eric Lee / smarc-fsl-linux-kernel

18 Feb, 2017

3 commits

0d4c19ee6 tcp: don't annotate mark on control socket from tcp_v6_send_response() ... Browse Code »

commit 92e55f412cffd016cc245a74278cb4d7b89bb3bc upstream.

Unlike ipv4, this control socket is shared by all cpus so we cannot use
it as scratchpad area to annotate the mark that we pass to ip6_xmit().

Add a new parameter to ip6_xmit() to indicate the mark. The SCTP socket
family caches the flowi6 structure in the sctp_transport structure, so
we cannot use to carry the mark unless we later on reset it back, which
I discarded since it looks ugly to me.

Fixes: bf99b4ded5f8 ("tcp: fix mark propagation with fwmark_reflect enabled")
Suggested-by: Eric Dumazet
Signed-off-by: Pablo Neira Ayuso
Signed-off-by: David S. Miller
Signed-off-by: Greg Kroah-Hartman

Pablo Neira
2017-02-18 22:11:44 +0800
2b7f50d67 lwtunnel: valid encap attr check should return 0 when lwtunnel is disabled ... Browse Code »

[ Upstream commit 2bd137de531367fb573d90150d1872cb2a2095f7 ]

An error was reported upgrading to 4.9.8:
root@Typhoon:~# ip route add default table 210 nexthop dev eth0 via 10.68.64.1
weight 1 nexthop dev eth0 via 10.68.64.2 weight 1
RTNETLINK answers: Operation not supported

The problem occurs when CONFIG_LWTUNNEL is not enabled and a multipath
route is submitted.

The point of lwtunnel_valid_encap_type_attr is catch modules that
need to be loaded before any references are taken with rntl held. With
CONFIG_LWTUNNEL disabled, there will be no modules to load so the
lwtunnel_valid_encap_type_attr stub should just return 0.

Fixes: 9ed59592e3e3 ("lwtunnel: fix autoload of lwt modules")
Reported-by: pupilla@libero.it
Signed-off-by: David Ahern
Signed-off-by: David S. Miller
Signed-off-by: Greg Kroah-Hartman

David Ahern
2017-02-18 22:11:42 +0800
66cdd4347 netlabel: out of bound access in cipso_v4_validate() ... Browse Code »

[ Upstream commit d71b7896886345c53ef1d84bda2bc758554f5d61 ]

syzkaller found another out of bound access in ip_options_compile(),
or more exactly in cipso_v4_validate()

Fixes: 20e2a8648596 ("cipso: handle CIPSO options correctly when NetLabel is disabled")
Fixes: 446fda4f2682 ("[NetLabel]: CIPSOv4 engine")
Signed-off-by: Eric Dumazet
Reported-by: Dmitry Vyukov
Cc: Paul Moore
Acked-by: Paul Moore
Signed-off-by: David S. Miller
Signed-off-by: Greg Kroah-Hartman

Eric Dumazet
2017-02-18 22:11:41 +0800

04 Feb, 2017

3 commits

e972cce0c lwtunnel: Fix oops on state free after encap module unload ... Browse Code »

[ Upstream commit 85c814016ce3b371016c2c054a905fa2492f5a65 ]

When attempting to free lwtunnel state after the module for the encap
has been unloaded an oops occurs:

BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
IP: lwtstate_free+0x18/0x40
[..]
task: ffff88003e372380 task.stack: ffffc900001fc000
RIP: 0010:lwtstate_free+0x18/0x40
RSP: 0018:ffff88003fd83e88 EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffff88002bbb3380 RCX: ffff88000c91a300
[..]
Call Trace:

free_fib_info_rcu+0x195/0x1a0
? rt_fibinfo_free+0x50/0x50
rcu_process_callbacks+0x2d3/0x850
? rcu_process_callbacks+0x296/0x850
__do_softirq+0xe4/0x4cb
irq_exit+0xb0/0xc0
smp_apic_timer_interrupt+0x3d/0x50
apic_timer_interrupt+0x93/0xa0
[..]
Code: e8 6e c6 fc ff 89 d8 5b 5d c3 bb de ff ff ff eb f4 66 90 66 66 66 66 90 55 48 89 e5 53 0f b7 07 48 89 fb 48 8b 04 c5 00 81 d5 81 8b 40 08 48 85 c0 74 13 ff d0 48 8d 7b 20 be 20 00 00 00 e8

The problem is after the module for the encap can be unloaded the
corresponding ops is removed and is thus NULL here.

Modules implementing lwtunnel ops should not be allowed to unload
while there is state alive using those ops, so grab the module
reference for the ops on creating lwtunnel state and of course release
the reference when freeing the state.

Fixes: 1104d9ba443a ("lwtunnel: Add destroy state operation")
Signed-off-by: Robert Shearman
Signed-off-by: David S. Miller
Signed-off-by: Greg Kroah-Hartman

Robert Shearman
2017-02-04 16:47:11 +0800
89c258862 net: Specify the owning module for lwtunnel ops ... Browse Code »

[ Upstream commit 88ff7334f25909802140e690c0e16433e485b0a0 ]

Modules implementing lwtunnel ops should not be allowed to unload
while there is state alive using those ops, so specify the owning
module for all lwtunnel ops.

Signed-off-by: Robert Shearman
Signed-off-by: David S. Miller
Signed-off-by: Greg Kroah-Hartman

Robert Shearman
2017-02-04 16:47:11 +0800
e9db042dc lwtunnel: fix autoload of lwt modules ... Browse Code »

[ Upstream commit 9ed59592e3e379b2e9557dc1d9e9ec8fcbb33f16]

Trying to add an mpls encap route when the MPLS modules are not loaded
hangs. For example:

CONFIG_MPLS=y
CONFIG_NET_MPLS_GSO=m
CONFIG_MPLS_ROUTING=m
CONFIG_MPLS_IPTUNNEL=m

$ ip route add 10.10.10.10/32 encap mpls 100 via inet 10.100.1.2

The ip command hangs:
root 880 826 0 21:25 pts/0 00:00:00 ip route add 10.10.10.10/32 encap mpls 100 via inet 10.100.1.2

$ cat /proc/880/stack
[] call_usermodehelper_exec+0xd6/0x134
[] __request_module+0x27b/0x30a
[] lwtunnel_build_state+0xe4/0x178
[] fib_create_info+0x47f/0xdd4
[] fib_table_insert+0x90/0x41f
[] inet_rtm_newroute+0x4b/0x52
...

modprobe is trying to load rtnl-lwt-MPLS:

root 881 5 0 21:25 ? 00:00:00 /sbin/modprobe -q -- rtnl-lwt-MPLS

and it hangs after loading mpls_router:

$ cat /proc/881/stack
[] rtnl_lock+0x12/0x14
[] register_netdevice_notifier+0x16/0x179
[] mpls_init+0x25/0x1000 [mpls_router]
[] do_one_initcall+0x8e/0x13f
[] do_init_module+0x5a/0x1e5
[] load_module+0x13bd/0x17d6
...

The problem is that lwtunnel_build_state is called with rtnl lock
held preventing mpls_init from registering.

Given the potential references held by the time lwtunnel_build_state it
can not drop the rtnl lock to the load module. So, extract the module
loading code from lwtunnel_build_state into a new function to validate
the encap type. The new function is called while converting the user
request into a fib_config which is well before any table, device or
fib entries are examined.

Fixes: 745041e2aaf1 ("lwtunnel: autoload of lwt modules")
Signed-off-by: David Ahern
Signed-off-by: David S. Miller
Signed-off-by: Greg Kroah-Hartman

David Ahern
2017-02-04 16:47:10 +0800

09 Jan, 2017

1 commit

1976c7689 cfg80211/mac80211: fix BSS leaks when abandoning assoc attempts ... Browse Code »

commit e6f462df9acd2a3295e5d34eb29e2823220cf129 upstream.

When mac80211 abandons an association attempt, it may free
all the data structures, but inform cfg80211 and userspace
about it only by sending the deauth frame it received, in
which case cfg80211 has no link to the BSS struct that was
used and will not cfg80211_unhold_bss() it.

Fix this by providing a way to inform cfg80211 of this with
the BSS entry passed, so that it can clean up properly, and
use this ability in the appropriate places in mac80211.

This isn't ideal: some code is more or less duplicated and
tracing is missing. However, it's a fairly small change and
it's thus easier to backport - cleanups can come later.

Signed-off-by: Johannes Berg
Signed-off-by: Greg Kroah-Hartman

Johannes Berg
2017-01-09 15:32:17 +0800

02 Dec, 2016

1 commit

3d2dd617f Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf ... Browse Code »

Pablo Neira Ayuso says:

====================
Netfilter fixes for net

This is a large batch of Netfilter fixes for net, they are:

1) Three patches to fix NAT conversion to rhashtable: Switch to rhlist
structure that allows to have several objects with the same key.
Moreover, fix wrong comparison logic in nf_nat_bysource_cmp() as this is
expecting a return value similar to memcmp(). Change location of
the nat_bysource field in the nf_conn structure to avoid zeroing
this as it breaks interaction with SLAB_DESTROY_BY_RCU and lead us
to crashes. From Florian Westphal.

2) Don't allow malformed fragments go through in IPv6, drop them,
otherwise we hit GPF, patch from Florian Westphal.

3) Fix crash if attributes are missing in nft_range, from Liping Zhang.

4) Fix arptables 32-bits userspace 64-bits kernel compat, from Hongxu Jia.

5) Two patches from David Ahern to fix netfilter interaction with vrf.
From David Ahern.

6) Fix element timeout calculation in nf_tables, we take milliseconds
from userspace, but we use jiffies from kernelspace. Patch from
Anders K. Pedersen.

7) Missing validation length netlink attribute for nft_hash, from
Laura Garcia.

8) Fix nf_conntrack_helper documentation, we don't default to off
anymore for a bit of time so let's get this in sync with the code.

I know is late but I think these are important, specifically the NAT
bits, as they are mostly addressing fallout from recent changes. I also
read there are chances to have -rc8, if that is the case, that would
also give us a bit more time to test this.
====================

Signed-off-by: David S. Miller

David S. Miller
2016-12-02 00:04:41 +0800

01 Dec, 2016

1 commit

0382a25af l2tp: lock socket before checking flags in connect() ... Browse Code »

Socket flags aren't updated atomically, so the socket must be locked
while reading the SOCK_ZAPPED flag.

This issue exists for both l2tp_ip and l2tp_ip6. For IPv6, this patch
also brings error handling for __ip6_datagram_connect() failures.

Signed-off-by: Guillaume Nault
Signed-off-by: David S. Miller

Guillaume Nault
2016-12-01 03:14:07 +0800

24 Nov, 2016

3 commits

5173bc679 netfilter: nat: fix crash when conntrack entry is re-used ... Browse Code »

Stas Nichiporovich reports oops in nf_nat_bysource_cmp(), trying to
access nf_conn struct at address 0xffffffffffffff50.

This is the result of fetching a null rhash list (struct embedded at
offset 176; 0 - 176 gets us ...fff50).

The problem is that conntrack entries are allocated from a
SLAB_DESTROY_BY_RCU cache, i.e. entries can be free'd and reused
on another cpu while nf nat bysource hash access the same conntrack entry.

Freeing is fine (we hold rcu read lock); zeroing rhlist_head isn't.

-> Move the rhlist struct outside of the memset()-inited area.

Fixes: 7c9664351980aaa6a ("netfilter: move nat hlist_head to nf_conn")
Reported-by: Stas Nichiporovich
Signed-off-by: Florian Westphal
Signed-off-by: Pablo Neira Ayuso

Florian Westphal
2016-11-24 21:43:35 +0800
d3e2a1110 netfilter: nf_tables: fix inconsistent element expiration calculation ... Browse Code »

As Liping Zhang reports, after commit a8b1e36d0d1d ("netfilter: nft_dynset:
fix element timeout for HZ != 1000"), priv->timeout was stored in jiffies,
while set->timeout was stored in milliseconds. This is inconsistent and
incorrect.

Firstly, we already call msecs_to_jiffies in nft_set_elem_init, so
priv->timeout will be converted to jiffies twice.

Secondly, if the user did not specify the NFTA_DYNSET_TIMEOUT attr,
set->timeout will be used, but we forget to call msecs_to_jiffies
when do update elements.

Fix this by using jiffies internally for traditional sets and doing the
conversions to/from msec when interacting with userspace - as dynset
already does.

This is preferable to doing the conversions, when elements are inserted or
updated, because this can happen very frequently on busy dynsets.

Fixes: a8b1e36d0d1d ("netfilter: nft_dynset: fix element timeout for HZ != 1000")
Reported-by: Liping Zhang
Signed-off-by: Anders K. Pedersen
Acked-by: Liping Zhang
Signed-off-by: Pablo Neira Ayuso

Anders K. Pedersen
2016-11-24 21:43:34 +0800
7223ecd46 netfilter: nat: switch to new rhlist interface ... Browse Code »

I got offlist bug report about failing connections and high cpu usage.
This happens because we hit 'elasticity' checks in rhashtable that
refuses bucket list exceeding 16 entries.

The nat bysrc hash unfortunately needs to insert distinct objects that
share same key and are identical (have same source tuple), this cannot
be avoided.

Switch to the rhlist interface which is designed for this.

The nulls_base is removed here, I don't think its needed:

A (unlikely) false positive results in unneeded port clash resolution,
a false negative results in packet drop during conntrack confirmation,
when we try to insert the duplicate into main conntrack hash table.

Tested by adding multiple ip addresses to host, then adding
iptables -t nat -A POSTROUTING -o eth0 -j MASQUERADE

... and then creating multiple connections, from same source port but
different addresses:

for i in $(seq 2000 2032);do nc -p 1234 192.168.7.1 $i > /dev/null & done

(all of these then get hashed to same bysource slot)

Then, to test that nat conflict resultion is working:

nc -s 10.0.0.1 -p 1234 192.168.7.1 2000
nc -s 10.0.0.2 -p 1234 192.168.7.1 2000

tcp .. src=10.0.0.1 dst=192.168.7.1 sport=1234 dport=2000 src=192.168.7.1 dst=192.168.7.10 sport=2000 dport=1024 [ASSURED]
tcp .. src=10.0.0.2 dst=192.168.7.1 sport=1234 dport=2000 src=192.168.7.1 dst=192.168.7.10 sport=2000 dport=1025 [ASSURED]
tcp .. src=192.168.7.10 dst=192.168.7.1 sport=1234 dport=2000 src=192.168.7.1 dst=192.168.7.10 sport=2000 dport=1234 [ASSURED]
tcp .. src=192.168.7.10 dst=192.168.7.1 sport=1234 dport=2001 src=192.168.7.1 dst=192.168.7.10 sport=2001 dport=1234 [ASSURED]
[..]

-> nat altered source ports to 1024 and 1025, respectively.
This can also be confirmed on destination host which shows
ESTAB 0 0 192.168.7.1:2000 192.168.7.10:1024
ESTAB 0 0 192.168.7.1:2000 192.168.7.10:1025
ESTAB 0 0 192.168.7.1:2000 192.168.7.10:1234

Cc: Herbert Xu
Fixes: 870190a9ec907 ("netfilter: nat: convert nat bysrc hash to rhashtable")
Signed-off-by: Florian Westphal
Signed-off-by: Pablo Neira Ayuso

Florian Westphal
2016-11-24 21:43:34 +0800

23 Nov, 2016

1 commit

39385cb5f Bluetooth: Fix using the correct source address type ... Browse Code »

The hci_get_route() API is used to look up local HCI devices, however
so far it has been incapable of dealing with anything else than the
public address of HCI devices. This completely breaks with LE-only HCI
devices that do not come with a public address, but use a static
random address instead.

This patch exteds the hci_get_route() API with a src_type parameter
that's used for comparing with the right address of each HCI device.

Signed-off-by: Johan Hedberg
Signed-off-by: Marcel Holtmann

Johan Hedberg
2016-11-23 05:50:46 +0800

19 Nov, 2016

1 commit

0f5258cd9 netns: fix get_net_ns_by_fd(int pid) typo ... Browse Code »

The argument to get_net_ns_by_fd() is a /proc/$PID/ns/net file
descriptor not a pid. Fix the typo.

Signed-off-by: Stefan Hajnoczi
Acked-by: Rami Rosen
Signed-off-by: David S. Miller

Stefan Hajnoczi
2016-11-19 03:01:58 +0800

17 Nov, 2016

1 commit

3b7093346 ipv4: Restore fib_trie_flush_external function and fix call ordering ... Browse Code »

The patch that removed the FIB offload infrastructure was a bit too
aggressive and also removed code needed to clean up us splitting the table
if additional rules were added. Specifically the function
fib_trie_flush_external was called at the end of a new rule being added to
flush the foreign trie entries from the main trie.

I updated the code so that we only call fib_trie_flush_external on the main
table so that we flush the entries for local from main. This way we don't
call it for every rule change which is what was happening previously.

Fixes: 347e3b28c1ba2 ("switchdev: remove FIB offload infrastructure")
Reported-by: Eric Dumazet
Cc: Jiri Pirko
Signed-off-by: Alexander Duyck
Acked-by: Eric Dumazet
Signed-off-by: David S. Miller

Alexander Duyck
2016-11-17 02:24:50 +0800

16 Nov, 2016

1 commit

e88a27661 gro_cells: mark napi struct as not busy poll candidates ... Browse Code »

Rolf Neugebauer reported very long delays at netns dismantle.

Eric W. Biederman was kind enough to look at this problem
and noticed synchronize_net() occurring from netif_napi_del() that was
added in linux-4.5

Busy polling makes no sense for tunnels NAPI.
If busy poll is used for sessions over tunnels, the poller will need to
poll the physical device queue anyway.

netif_tx_napi_add() could be used here, but function name is misleading,
and renaming it is not stable material, so set NAPI_STATE_NO_BUSY_POLL
bit directly.

This will avoid inserting gro_cells napi structures in napi_hash[]
and avoid the problematic synchronize_net() (per possible cpu) that
Rolf reported.

Fixes: 93d05d4a320c ("net: provide generic busy polling to all NAPI drivers")
Signed-off-by: Eric Dumazet
Reported-by: Rolf Neugebauer
Reported-by: Eric W. Biederman
Acked-by: Cong Wang
Tested-by: Rolf Neugebauer
Signed-off-by: David S. Miller

Eric Dumazet
2016-11-16 11:27:27 +0800

14 Nov, 2016

1 commit

ac6e78007 tcp: take care of truncations done by sk_filter() ... Browse Code »

With syzkaller help, Marco Grassi found a bug in TCP stack,
crashing in tcp_collapse()

Root cause is that sk_filter() can truncate the incoming skb,
but TCP stack was not really expecting this to happen.
It probably was expecting a simple DROP or ACCEPT behavior.

We first need to make sure no part of TCP header could be removed.
Then we need to adjust TCP_SKB_CB(skb)->end_seq

Many thanks to syzkaller team and Marco for giving us a reproducer.

Signed-off-by: Eric Dumazet
Reported-by: Marco Grassi
Reported-by: Vladis Dronov
Signed-off-by: David S. Miller

Eric Dumazet
2016-11-14 01:30:02 +0800

10 Nov, 2016

1 commit

9fa684ec8 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf ... Browse Code »

Pablo Neira Ayuso says:

====================
Netfilter fixes for net

The following patchset contains a larger than usual batch of Netfilter
fixes for your net tree. This series contains a mixture of old bugs and
recently introduced bugs, they are:

1) Fix a crash when using nft_dynset with nft_set_rbtree, which doesn't
support the set element updates from the packet path. From Liping
Zhang.

2) Fix leak when nft_expr_clone() fails, from Liping Zhang.

3) Fix a race when inserting new elements to the set hash from the
packet path, also from Liping.

4) Handle segmented TCP SIP packets properly, basically avoid that the
INVITE in the allow header create bogus expectations by performing
stricter SIP message parsing, from Ulrich Weber.

5) nft_parse_u32_check() should return signed integer for errors, from
John Linville.

6) Fix wrong allocation instead of connlabels, allocate 16 instead of
32 bytes, from Florian Westphal.

7) Fix compilation breakage when building the ip_vs_sync code with
CONFIG_OPTIMIZE_INLINING on x86, from Arnd Bergmann.

8) Destroy the new set if the transaction object cannot be allocated,
also from Liping Zhang.

9) Use device to route duplicated packets via nft_dup only when set by
the user, otherwise packets may not follow the right route, again
from Liping.

10) Fix wrong maximum genetlink attribute definition in IPVS, from
WANG Cong.

11) Ignore untracked conntrack objects from xt_connmark, from Florian
Westphal.

12) Allow to use conntrack helpers that are registered NFPROTO_UNSPEC
via CT target, otherwise we cannot use the h.245 helper, from
Florian.

13) Revisit garbage collection heuristic in the new workqueue-based
timer approach for conntrack to evict objects earlier, again from
Florian.

14) Fix crash in nf_tables when inserting an element into a verdict map,
from Liping Zhang.
====================

Signed-off-by: David S. Miller

David S. Miller
2016-11-10 09:38:18 +0800

04 Nov, 2016

3 commits

c3f24cfb3 dccp: do not release listeners too soon ... Browse Code »

Andrey Konovalov reported following error while fuzzing with syzkaller :

IPv4: Attempt to release alive inet socket ffff880068e98940
kasan: CONFIG_KASAN_INLINE enabled
kasan: GPF could be caused by NULL-ptr deref or user memory access
general protection fault: 0000 [#1] SMP KASAN
Modules linked in:
CPU: 1 PID: 3905 Comm: a.out Not tainted 4.9.0-rc3+ #333
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
task: ffff88006b9e0000 task.stack: ffff880068770000
RIP: 0010:[] []
selinux_socket_sock_rcv_skb+0xff/0x6a0 security/selinux/hooks.c:4639
RSP: 0018:ffff8800687771c8 EFLAGS: 00010202
RAX: ffff88006b9e0000 RBX: 1ffff1000d0eee3f RCX: 1ffff1000d1d312a
RDX: 1ffff1000d1d31a6 RSI: dffffc0000000000 RDI: 0000000000000010
RBP: ffff880068777360 R08: 0000000000000000 R09: 0000000000000002
R10: dffffc0000000000 R11: 0000000000000006 R12: ffff880068e98940
R13: 0000000000000002 R14: ffff880068777338 R15: 0000000000000000
FS: 00007f00ff760700(0000) GS:ffff88006cd00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000020008000 CR3: 000000006a308000 CR4: 00000000000006e0
Stack:
ffff8800687771e0 ffffffff812508a5 ffff8800686f3168 0000000000000007
ffff88006ac8cdfc ffff8800665ea500 0000000041b58ab3 ffffffff847b5480
ffffffff819eac60 ffff88006b9e0860 ffff88006b9e0868 ffff88006b9e07f0
Call Trace:
[] security_sock_rcv_skb+0x75/0xb0 security/security.c:1317
[] sk_filter_trim_cap+0x67/0x10e0 net/core/filter.c:81
[] __sk_receive_skb+0x30/0xa00 net/core/sock.c:460
[] dccp_v4_rcv+0xdb2/0x1910 net/dccp/ipv4.c:873
[] ip_local_deliver_finish+0x332/0xad0
net/ipv4/ip_input.c:216
[< inline >] NF_HOOK_THRESH ./include/linux/netfilter.h:232
[< inline >] NF_HOOK ./include/linux/netfilter.h:255
[] ip_local_deliver+0x1c2/0x4b0 net/ipv4/ip_input.c:257
[< inline >] dst_input ./include/net/dst.h:507
[] ip_rcv_finish+0x750/0x1c40 net/ipv4/ip_input.c:396
[< inline >] NF_HOOK_THRESH ./include/linux/netfilter.h:232
[< inline >] NF_HOOK ./include/linux/netfilter.h:255
[] ip_rcv+0x96f/0x12f0 net/ipv4/ip_input.c:487
[] __netif_receive_skb_core+0x1897/0x2a50 net/core/dev.c:4213
[] __netif_receive_skb+0x2a/0x170 net/core/dev.c:4251
[] netif_receive_skb_internal+0x1b3/0x390 net/core/dev.c:4279
[] netif_receive_skb+0x48/0x250 net/core/dev.c:4303
[] tun_get_user+0xbd5/0x28a0 drivers/net/tun.c:1308
[] tun_chr_write_iter+0xda/0x190 drivers/net/tun.c:1332
[< inline >] new_sync_write fs/read_write.c:499
[] __vfs_write+0x334/0x570 fs/read_write.c:512
[] vfs_write+0x17b/0x500 fs/read_write.c:560
[< inline >] SYSC_write fs/read_write.c:607
[] SyS_write+0xd4/0x1a0 fs/read_write.c:599
[] entry_SYSCALL_64_fastpath+0x1f/0xc2

It turns out DCCP calls __sk_receive_skb(), and this broke when
lookups no longer took a reference on listeners.

Fix this issue by adding a @refcounted parameter to __sk_receive_skb(),
so that sock_put() is used only when needed.

Fixes: 3b24d854cb35 ("tcp/dccp: do not touch listener sk_refcnt under synflood")
Signed-off-by: Eric Dumazet
Reported-by: Andrey Konovalov
Tested-by: Andrey Konovalov
Signed-off-by: David S. Miller

Eric Dumazet
2016-11-04 04:16:50 +0800
9ee6c5dc8 ipv4: allow local fragmentation in ip_finish_output_gso() ... Browse Code »

Some configurations (e.g. geneve interface with default
MTU of 1500 over an ethernet interface with 1500 MTU) result
in the transmission of packets that exceed the configured MTU.
While this should be considered to be a "bad" configuration,
it is still allowed and should not result in the sending
of packets that exceed the configured MTU.

Fix by dropping the assumption in ip_finish_output_gso() that
locally originated gso packets will never need fragmentation.
Basic testing using iperf (observing CPU usage and bandwidth)
have shown no measurable performance impact for traffic not
requiring fragmentation.

Fixes: c7ba65d7b649 ("net: ip: push gso skb forwarding handling down the stack")
Reported-by: Jan Tluka
Signed-off-by: Lance Richardson
Acked-by: Hannes Frederic Sowa
Signed-off-by: David S. Miller

Lance Richardson
2016-11-04 04:10:26 +0800
da96786e2 net: tcp: check skb is non-NULL for exact match on lookups ... Browse Code »

Andrey reported the following error report while running the syzkaller
fuzzer:

general protection fault: 0000 [#1] SMP KASAN
Dumping ftrace buffer:
(ftrace buffer empty)
Modules linked in:
CPU: 0 PID: 648 Comm: syz-executor Not tainted 4.9.0-rc3+ #333
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
task: ffff8800398c4480 task.stack: ffff88003b468000
RIP: 0010:[] [< inline >]
inet_exact_dif_match include/net/tcp.h:808
RIP: 0010:[] []
__inet_lookup_listener+0xb6/0x500 net/ipv4/inet_hashtables.c:219
RSP: 0018:ffff88003b46f270 EFLAGS: 00010202
RAX: 0000000000000004 RBX: 0000000000004242 RCX: 0000000000000001
RDX: 0000000000000000 RSI: ffffc90000e3c000 RDI: 0000000000000054
RBP: ffff88003b46f2d8 R08: 0000000000004000 R09: ffffffff830910e7
R10: 0000000000000000 R11: 000000000000000a R12: ffffffff867fa0c0
R13: 0000000000004242 R14: 0000000000000003 R15: dffffc0000000000
FS: 00007fb135881700(0000) GS:ffff88003ec00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000020cc3000 CR3: 000000006d56a000 CR4: 00000000000006f0
Stack:
0000000000000000 000000000601a8c0 0000000000000000 ffffffff00004242
424200003b9083c2 ffff88003def4041 ffffffff84e7e040 0000000000000246
ffff88003a0911c0 0000000000000000 ffff88003a091298 ffff88003b9083ae
Call Trace:
[] tcp_v4_send_reset+0x584/0x1700 net/ipv4/tcp_ipv4.c:643
[] tcp_v4_rcv+0x198b/0x2e50 net/ipv4/tcp_ipv4.c:1718
[] ip_local_deliver_finish+0x332/0xad0
net/ipv4/ip_input.c:216
...

MD5 has a code path that calls __inet_lookup_listener with a null skb,
so inet{6}_exact_dif_match needs to check skb against null before pulling
the flag.

Fixes: a04a480d4392 ("net: Require exact match for TCP socket lookups if
dif is l3mdev")
Reported-by: Andrey Konovalov
Signed-off-by: David Ahern
Tested-by: Andrey Konovalov
Signed-off-by: David S. Miller

David Ahern
2016-11-04 04:05:44 +0800

03 Nov, 2016

1 commit

23f4ffedb ip6_tunnel: Clear IP6CB in ip6tunnel_xmit() ... Browse Code »

skb->cb may contain data from previous layers. In the observed scenario,
the garbage data were misinterpreted as IP6CB(skb)->frag_max_size, so
that small packets sent through the tunnel are mistakenly fragmented.

This patch unconditionally clears the control buffer in ip6tunnel_xmit(),
which affects ip6_tunnel, ip6_udp_tunnel and ip6_gre. Currently none of
these tunnels set IP6CB(skb)->flags, otherwise it needs to be done earlier.

Cc: stable@vger.kernel.org
Signed-off-by: Eli Cooper
Signed-off-by: David S. Miller

Eli Cooper
2016-11-03 03:18:36 +0800

01 Nov, 2016

1 commit

dae399d7f sctp: hold transport instead of assoc when lookup assoc in rx path ... Browse Code »

Prior to this patch, in rx path, before calling lock_sock, it needed to
hold assoc when got it by __sctp_lookup_association, in case other place
would free/put assoc.

But in __sctp_lookup_association, it lookup and hold transport, then got
assoc by transport->assoc, then hold assoc and put transport. It means
it didn't hold transport, yet it was returned and later on directly
assigned to chunk->transport.

Without the protection of sock lock, the transport may be freed/put by
other places, which would cause a use-after-free issue.

This patch is to fix this issue by holding transport instead of assoc.
As holding transport can make sure to access assoc is also safe, and
actually it looks up assoc by searching transport rhashtable, to hold
transport here makes more sense.

Note that the function will be renamed later on on another patch.

Signed-off-by: Xin Long
Acked-by: Marcelo Ricardo Leitner
Signed-off-by: David S. Miller

Xin Long
2016-11-01 04:20:33 +0800

30 Oct, 2016

3 commits

2a26d99b2 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net ... Browse Code »

Pull networking fixes from David Miller:
"Lots of fixes, mostly drivers as is usually the case.

1) Don't treat zero DMA address as invalid in vmxnet3, from Alexey
Khoroshilov.

2) Fix element timeouts in netfilter's nft_dynset, from Anders K.
Pedersen.

3) Don't put aead_req crypto struct on the stack in mac80211, from
Ard Biesheuvel.

4) Several uninitialized variable warning fixes from Arnd Bergmann.

5) Fix memory leak in cxgb4, from Colin Ian King.

6) Fix bpf handling of VLAN header push/pop, from Daniel Borkmann.

7) Several VRF semantic fixes from David Ahern.

8) Set skb->protocol properly in ip6_tnl_xmit(), from Eli Cooper.

9) Socket needs to be locked in udp_disconnect(), from Eric Dumazet.

10) Div-by-zero on 32-bit fix in mlx4 driver, from Eugenia Emantayev.

11) Fix stale link state during failover in NCSCI driver, from Gavin
Shan.

12) Fix netdev lower adjacency list traversal, from Ido Schimmel.

13) Propvide proper handle when emitting notifications of filter
deletes, from Jamal Hadi Salim.

14) Memory leaks and big-endian issues in rtl8xxxu, from Jes Sorensen.

15) Fix DESYNC_FACTOR handling in ipv6, from Jiri Bohac.

16) Several routing offload fixes in mlxsw driver, from Jiri Pirko.

17) Fix broadcast sync problem in TIPC, from Jon Paul Maloy.

18) Validate chunk len before using it in SCTP, from Marcelo Ricardo
Leitner.

19) Revert a netns locking change that causes regressions, from Paul
Moore.

20) Add recursion limit to GRO handling, from Sabrina Dubroca.

21) GFP_KERNEL in irq context fix in ibmvnic, from Thomas Falcon.

22) Avoid accessing stale vxlan/geneve socket in data path, from
Pravin Shelar"

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (189 commits)
geneve: avoid using stale geneve socket.
vxlan: avoid using stale vxlan socket.
qede: Fix out-of-bound fastpath memory access
net: phy: dp83848: add dp83822 PHY support
enic: fix rq disable
tipc: fix broadcast link synchronization problem
ibmvnic: Fix missing brackets in init_sub_crq_irqs
ibmvnic: Fix releasing of sub-CRQ IRQs in interrupt context
Revert "ibmvnic: Fix releasing of sub-CRQ IRQs in interrupt context"
arch/powerpc: Update parameters for csum_tcpudp_magic & csum_tcpudp_nofold
net/mlx4_en: Save slave ethtool stats command
net/mlx4_en: Fix potential deadlock in port statistics flow
net/mlx4: Fix firmware command timeout during interrupt test
net/mlx4_core: Do not access comm channel if it has not yet been initialized
net/mlx4_en: Fix panic during reboot
net/mlx4_en: Process all completions in RX rings after port goes up
net/mlx4_en: Resolve dividing by zero in 32-bit system
net/mlx4_core: Change the default value of enable_qos
net/mlx4_core: Avoid setting ports to auto when only one port type is supported
net/mlx4_core: Fix the resource-type enum in res tracker to conform to FW spec
...

Linus Torvalds
2016-10-30 11:33:20 +0800
c6fcc4fc5 vxlan: avoid using stale vxlan socket. ... Browse Code »

When vxlan device is closed vxlan socket is freed. This
operation can race with vxlan-xmit function which
dereferences vxlan socket. Following patch uses RCU
mechanism to avoid this situation.

Signed-off-by: Pravin B Shelar
Signed-off-by: David S. Miller

pravin shelar
2016-10-30 08:56:31 +0800
880b583ce Merge tag 'mac80211-for-davem-2016-10-27' of git://git.kernel.org/pub/scm/linux/… ... Browse Code »

…kernel/git/jberg/mac80211

Johannes Berg says:

====================
Just two fixes:
* a fix to process all events while suspending, so any
potential calls into the driver are done before it is
suspended
* small markup fixes for the sphinx documentation conversion
that's coming into the tree via the doc tree
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

David S. Miller
2016-10-30 03:54:16 +0800

28 Oct, 2016

5 commits

d5d32e4b7 net: ipv6: Do not consider link state for nexthop validation ... Browse Code »

Similar to IPv4, do not consider link state when validating next hops.

Currently, if the link is down default routes can fail to insert:
$ ip -6 ro add vrf blue default via 2100:2::64 dev eth2
RTNETLINK answers: No route to host

With this patch the command succeeds.

Fixes: 8c14586fc320 ("net: ipv6: Use passed in table for nexthop lookups")
Signed-off-by: David Ahern
Signed-off-by: David S. Miller

David Ahern
2016-10-28 04:33:12 +0800
830218c1a net: ipv6: Fix processing of RAs in presence of VRF ... Browse Code »

rt6_add_route_info and rt6_add_dflt_router were updated to pull the FIB
table from the device index, but the corresponding rt6_get_route_info
and rt6_get_dflt_router functions were not leading to the failure to
process RA's:

ICMPv6: RA: ndisc_router_discovery failed to add default route

Fix the 'get' functions by using the table id associated with the
device when applicable.

Also, now that default routes can be added to tables other than the
default table, rt6_purge_dflt_routers needs to be updated as well to
look at all tables. To handle that efficiently, add a flag to the table
denoting if it is has a default route via RA.

Fixes: ca254490c8dfd ("net: Add VRF support to IPv6 stack")
Signed-off-by: David Ahern
Signed-off-by: David S. Miller

David Ahern
2016-10-28 04:30:52 +0800
cdb436d18 netfilter: conntrack: avoid excess memory allocation ... Browse Code »

This is now a fixed-size extension, so we don't need to pass a variable
alloc size. This (harmless) error results in allocating 32 instead of
the needed 16 bytes for this extension as the size gets passed twice.

Fixes: 23014011ba420 ("netfilter: conntrack: support a fixed size of 128 distinct labels")
Signed-off-by: Florian Westphal
Signed-off-by: Pablo Neira Ayuso

Florian Westphal
2016-10-28 00:29:02 +0800
f1d505bb7 netfilter: nf_tables: fix type mismatch with error return from nft_parse_u32_check ... Browse Code »

Commit 36b701fae12ac ("netfilter: nf_tables: validate maximum value of
u32 netlink attributes") introduced nft_parse_u32_check with a return
value of "unsigned int", yet on error it returns "-ERANGE".

This patch corrects the mismatch by changing the return value to "int",
which happens to match the actual users of nft_parse_u32_check already.

Found by Coverity, CID 1373930.

Note that commit 21a9e0f1568ea ("netfilter: nft_exthdr: fix error
handling in nft_exthdr_init()) attempted to address the issue, but
did not address the return type of nft_parse_u32_check.

Signed-off-by: John W. Linville
Cc: Laura Garcia Liebana
Cc: Pablo Neira Ayuso
Cc: Dan Carpenter
Fixes: 36b701fae12ac ("netfilter: nf_tables: validate maximum value...")
Signed-off-by: Pablo Neira Ayuso

John W. Linville
2016-10-28 00:29:01 +0800
61f9e2924 netfilter: nf_tables: fix *leak* when expr clone fail ... Browse Code »

When nft_expr_clone failed, a series of problems will happen:

1. module refcnt will leak, we call __module_get at the beginning but
we forget to put it back if ops->clone returns fail
2. memory will be leaked, if clone fail, we just return NULL and forget
to free the alloced element
3. set->nelems will become incorrect when set->size is specified. If
clone fail, we should decrease the set->nelems

Now this patch fixes these problems. And fortunately, clone fail will
only happen on counter expression when memory is exhausted.

Fixes: 086f332167d6 ("netfilter: nf_tables: add clone interface to expression operations")
Signed-off-by: Liping Zhang
Signed-off-by: Pablo Neira Ayuso

Liping Zhang
2016-10-28 00:20:45 +0800

27 Oct, 2016

2 commits

10df8e615 udp: fix IP_CHECKSUM handling ... Browse Code »

First bug was added in commit ad6f939ab193 ("ip: Add offset parameter to
ip_cmsg_recv") : Tom missed that ipv4 udp messages could be received on
AF_INET6 socket. ip_cmsg_recv(msg, skb) should have been replaced by
ip_cmsg_recv_offset(msg, skb, sizeof(struct udphdr));

Then commit e6afc8ace6dd ("udp: remove headers from UDP packets before
queueing") forgot to adjust the offsets now UDP headers are pulled
before skb are put in receive queue.

Fixes: ad6f939ab193 ("ip: Add offset parameter to ip_cmsg_recv")
Fixes: e6afc8ace6dd ("udp: remove headers from UDP packets before queueing")
Signed-off-by: Eric Dumazet
Cc: Sam Kumar
Cc: Willem de Bruijn
Tested-by: Willem de Bruijn
Signed-off-by: David S. Miller

Eric Dumazet
2016-10-27 05:33:22 +0800
293de7dee doc: update docbook annotations for socket and skb ... Browse Code »

The skbuff and sock structure both had missing parameter annotation
values.

Signed-off-by: Stephen Hemminger
Signed-off-by: David S. Miller

Stephen Hemminger
2016-10-27 05:31:23 +0800

26 Oct, 2016

1 commit

b4f7f4ad4 mac80211: fix some sphinx warnings ... Browse Code »

Signed-off-by: Jani Nikula
Signed-off-by: Johannes Berg

Jani Nikula
2016-10-26 14:01:07 +0800

21 Oct, 2016

2 commits

8651be8f1 ipv6: fix a potential deadlock in do_ipv6_setsockopt() ... Browse Code »

Baozeng reported this deadlock case:

CPU0 CPU1
---- ----
lock([ 165.136033] sk_lock-AF_INET6);
lock([ 165.136033] rtnl_mutex);
lock([ 165.136033] sk_lock-AF_INET6);
lock([ 165.136033] rtnl_mutex);

Similar to commit 87e9f0315952
("ipv4: fix a potential deadlock in mcast getsockopt() path")
this is due to we still have a case, ipv6_sock_mc_close(),
where we acquire sk_lock before rtnl_lock. Close this deadlock
with the similar solution, that is always acquire rtnl lock first.

Fixes: baf606d9c9b1 ("ipv4,ipv6: grab rtnl before locking the socket")
Reported-by: Baozeng Ding
Tested-by: Baozeng Ding
Cc: Marcelo Ricardo Leitner
Signed-off-by: Cong Wang
Reviewed-by: Marcelo Ricardo Leitner
Signed-off-by: David S. Miller

WANG Cong
2016-10-21 23:29:02 +0800
286c72dea udp: must lock the socket in udp_disconnect() ... Browse Code »

Baozeng Ding reported KASAN traces showing uses after free in
udp_lib_get_port() and other related UDP functions.

A CONFIG_DEBUG_PAGEALLOC=y kernel would eventually crash.

I could write a reproducer with two threads doing :

static int sock_fd;
static void *thr1(void *arg)
{
for (;;) {
connect(sock_fd, (const struct sockaddr *)arg,
sizeof(struct sockaddr_in));
}
}

static void *thr2(void *arg)
{
struct sockaddr_in unspec;

for (;;) {
memset(&unspec, 0, sizeof(unspec));
connect(sock_fd, (const struct sockaddr *)&unspec,
sizeof(unspec));
}
}

Problem is that udp_disconnect() could run without holding socket lock,
and this was causing list corruptions.

Signed-off-by: Eric Dumazet
Reported-by: Baozeng Ding
Signed-off-by: David S. Miller

Eric Dumazet
2016-10-21 02:45:52 +0800

18 Oct, 2016

1 commit

5cbee5573 Merge tag 'mac80211-for-davem-2016-10-18' of git://git.kernel.org/pub/scm/linux/… ... Browse Code »

…kernel/git/jberg/mac80211

Johannes Berg says:

====================
This is relatively small, mostly to get the SG/crypto
from stack removal fix that crashes things when VMAP
stack is used in conjunction with software crypto.

Aside from that, we have:
* a fix for AP_VLAN usage with the nl80211 frame command
* two fixes (and two preparation patches) for A-MSDU, one
to discard group-addressed (multicast) and unexpected
4-address A-MSDUs, the other to validate A-MSDU inner
MAC addresses properly to prevent controlled port bypass
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

David S. Miller
2016-10-18 22:26:15 +0800

17 Oct, 2016

1 commit

a04a480d4 net: Require exact match for TCP socket lookups if dif is l3mdev ... Browse Code »

Currently, socket lookups for l3mdev (vrf) use cases can match a socket
that is bound to a port but not a device (ie., a global socket). If the
sysctl tcp_l3mdev_accept is not set this leads to ack packets going out
based on the main table even though the packet came in from an L3 domain.
The end result is that the connection does not establish creating
confusion for users since the service is running and a socket shows in
ss output. Fix by requiring an exact dif to sk_bound_dev_if match if the
skb came through an interface enslaved to an l3mdev device and the
tcp_l3mdev_accept is not set.

skb's through an l3mdev interface are marked by setting a flag in
inet{6}_skb_parm. The IPv6 variant is already set; this patch adds the
flag for IPv4. Using an skb flag avoids a device lookup on the dif. The
flag is set in the VRF driver using the IP{6}CB macros. For IPv4, the
inet_skb_parm struct is moved in the cb per commit 971f10eca186, so the
match function in the TCP stack needs to use TCP_SKB_CB. For IPv6, the
move is done after the socket lookup, so IP6CB is used.

The flags field in inet_skb_parm struct needs to be increased to add
another flag. There is currently a 1-byte hole following the flags,
so it can be expanded to u16 without increasing the size of the struct.

Fixes: 193125dbd8eb ("net: Introduce VRF device driver")
Signed-off-by: David Ahern
Signed-off-by: David S. Miller

David Ahern
2016-10-17 22:17:05 +0800

15 Oct, 2016

1 commit

d4d24d2d0 Merge tag 'docs-4.9-2' of git://git.lwn.net/linux ... Browse Code »

Pull one more documentation update from Jonathan Corbet:
"A single commit converting the mac80211 DocBook template over to
Sphinx. Only 32 more to go..."

* tag 'docs-4.9-2' of git://git.lwn.net/linux:
docs-rst: sphinxify 802.11 documentation

Linus Torvalds
2016-10-15 05:11:22 +0800

14 Oct, 2016

1 commit

76506a986 IPv6: fix DESYNC_FACTOR ... Browse Code »

The IPv6 temporary address generation uses a variable called DESYNC_FACTOR
to prevent hosts updating the addresses at the same time. Quoting RFC 4941:

... The value DESYNC_FACTOR is a random value (different for each
client) that ensures that clients don't synchronize with each other and
generate new addresses at exactly the same time ...

DESYNC_FACTOR is defined as:

DESYNC_FACTOR -- A random value within the range 0 - MAX_DESYNC_FACTOR.
It is computed once at system start (rather than each time it is used)
and must never be greater than (TEMP_VALID_LIFETIME - REGEN_ADVANCE).

First, I believe the RFC has a typo in it and meant to say: "and must
never be greater than (TEMP_PREFERRED_LIFETIME - REGEN_ADVANCE)"

The reason is that at various places in the RFC, DESYNC_FACTOR is used in
a calculation like (TEMP_PREFERRED_LIFETIME - DESYNC_FACTOR) or
(TEMP_PREFERRED_LIFETIME - REGEN_ADVANCE - DESYNC_FACTOR). It needs to be
smaller than (TEMP_PREFERRED_LIFETIME - REGEN_ADVANCE) for the result of
these calculations to be larger than zero. It's never used in a
calculation together with TEMP_VALID_LIFETIME.

I already submitted an errata to the rfc-editor:
https://www.rfc-editor.org/errata_search.php?rfc=4941

The Linux implementation of DESYNC_FACTOR is very wrong:
max_desync_factor is used in places DESYNC_FACTOR should be used.
max_desync_factor is initialized to the RFC-recommended value for
MAX_DESYNC_FACTOR (600) but the whole point is to get a _random_ value.

And nothing ensures that the value used is not greater than
(TEMP_PREFERRED_LIFETIME - REGEN_ADVANCE), which leads to underflows. The
effect can easily be observed when setting the temp_prefered_lft sysctl
e.g. to 60. The preferred lifetime of the temporary addresses will be
bogus.

TEMP_PREFERRED_LIFETIME and REGEN_ADVANCE are not constants and can be
influenced by these three sysctls: regen_max_retry, dad_transmits and
temp_prefered_lft. Thus, the upper bound for desync_factor needs to be
re-calculated each time a new address is generated and if desync_factor is
larger than the new upper bound, a new random value needs to be
re-generated.

And since we already have max_desync_factor configurable per interface, we
also need to calculate and store desync_factor per interface.

Signed-off-by: Jiri Bohac
Signed-off-by: David S. Miller

Jiri Bohac
2016-10-14 22:59:15 +0800