Eric Lee / smarc-fsl-linux-kernel

13 Nov, 2016

1 commit

4e3264d21 bpf: Fix bpf_redirect to an ipip/ip6tnl dev ... Browse Code »

If the bpf program calls bpf_redirect(dev, 0) and dev is
an ipip/ip6tnl, it currently includes the mac header.
e.g. If dev is ipip, the end result is IP-EthHdr-IP instead
of IP-IP.

The fix is to pull the mac header. At ingress, skb_postpull_rcsum()
is not needed because the ethhdr should have been pulled once already
and then got pushed back just before calling the bpf_prog.
At egress, this patch calls skb_postpull_rcsum().

If bpf_redirect(dev, BPF_F_INGRESS) is called,
it also fails now because it calls dev_forward_skb() which
eventually calls eth_type_trans(skb, dev). The eth_type_trans()
will set skb->type = PACKET_OTHERHOST because the mac address
does not match the redirecting dev->dev_addr. The PACKET_OTHERHOST
will eventually cause the ip_rcv() errors out. To fix this,
____dev_forward_skb() is added.

Joint work with Daniel Borkmann.

Fixes: cfc7381b3002 ("ip_tunnel: add collect_md mode to IPIP tunnel")
Fixes: 8d79266bc48c ("ip6_tunnel: add collect_md mode to IPv6 tunnels")
Acked-by: Daniel Borkmann
Acked-by: Alexei Starovoitov
Signed-off-by: Martin KaFai Lau
Signed-off-by: David S. Miller

Martin KaFai Lau
2016-11-13 12:38:07 +0800

11 Nov, 2016

1 commit

0ace81ec7 ipv4: update comment to document GSO fragmentation cases. ... Browse Code »

This is a follow-up to commit 9ee6c5dc816a ("ipv4: allow local
fragmentation in ip_finish_output_gso()"), updating the comment
documenting cases in which fragmentation is needed for egress
GSO packets.

Suggested-by: Shmulik Ladkani
Reviewed-by: Shmulik Ladkani
Signed-off-by: Lance Richardson
Signed-off-by: David S. Miller

Lance Richardson
2016-11-11 01:01:54 +0800

10 Nov, 2016

6 commits

9b6c14d51 net: tcp response should set oif only if it is L3 master ... Browse Code »

Lorenzo noted an Android unit test failed due to e0d56fdd7342:
"The expectation in the test was that the RST replying to a SYN sent to a
closed port should be generated with oif=0. In other words it should not
prefer the interface where the SYN came in on, but instead should follow
whatever the routing table says it should do."

Revert the change to ip_send_unicast_reply and tcp_v6_send_response such
that the oif in the flow is set to the skb_iif only if skb_iif is an L3
master.

Fixes: e0d56fdd7342 ("net: l3mdev: remove redundant calls")
Reported-by: Lorenzo Colitti
Signed-off-by: David Ahern
Tested-by: Lorenzo Colitti
Acked-by: Lorenzo Colitti
Signed-off-by: David S. Miller

David Ahern
2016-11-10 11:32:10 +0800
9fa684ec8 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf ... Browse Code »

Pablo Neira Ayuso says:

====================
Netfilter fixes for net

The following patchset contains a larger than usual batch of Netfilter
fixes for your net tree. This series contains a mixture of old bugs and
recently introduced bugs, they are:

1) Fix a crash when using nft_dynset with nft_set_rbtree, which doesn't
support the set element updates from the packet path. From Liping
Zhang.

2) Fix leak when nft_expr_clone() fails, from Liping Zhang.

3) Fix a race when inserting new elements to the set hash from the
packet path, also from Liping.

4) Handle segmented TCP SIP packets properly, basically avoid that the
INVITE in the allow header create bogus expectations by performing
stricter SIP message parsing, from Ulrich Weber.

5) nft_parse_u32_check() should return signed integer for errors, from
John Linville.

6) Fix wrong allocation instead of connlabels, allocate 16 instead of
32 bytes, from Florian Westphal.

7) Fix compilation breakage when building the ip_vs_sync code with
CONFIG_OPTIMIZE_INLINING on x86, from Arnd Bergmann.

8) Destroy the new set if the transaction object cannot be allocated,
also from Liping Zhang.

9) Use device to route duplicated packets via nft_dup only when set by
the user, otherwise packets may not follow the right route, again
from Liping.

10) Fix wrong maximum genetlink attribute definition in IPVS, from
WANG Cong.

11) Ignore untracked conntrack objects from xt_connmark, from Florian
Westphal.

12) Allow to use conntrack helpers that are registered NFPROTO_UNSPEC
via CT target, otherwise we cannot use the h.245 helper, from
Florian.

13) Revisit garbage collection heuristic in the new workqueue-based
timer approach for conntrack to evict objects earlier, again from
Florian.

14) Fix crash in nf_tables when inserting an element into a verdict map,
from Liping Zhang.
====================

Signed-off-by: David S. Miller

David S. Miller
2016-11-10 09:38:18 +0800
f567e950b rtnl: reset calcit fptr in rtnl_unregister() ... Browse Code »

To avoid having dangling function pointers left behind, reset calcit in
rtnl_unregister(), too.

This is no issue so far, as only the rtnl core registers a netlink
handler with a calcit hook which won't be unregistered, but may become
one if new code makes use of the calcit hook.

Fixes: c7ac8679bec9 ("rtnetlink: Compute and store minimum ifinfo...")
Cc: Jeff Kirsher
Cc: Greg Rose
Signed-off-by: Mathias Krause
Signed-off-by: David S. Miller

Mathias Krause
2016-11-10 09:18:19 +0800
9d1a6c4ea net: icmp_route_lookup should use rt dev to determine L3 domain ... Browse Code »

icmp_send is called in response to some event. The skb may not have
the device set (skb->dev is NULL), but it is expected to have an rt.
Update icmp_route_lookup to use the rt on the skb to determine L3
domain.

Fixes: 613d09b30f8b ("net: Use VRF device index for lookups on TX")
Signed-off-by: David Ahern
Signed-off-by: David S. Miller

David Ahern
2016-11-10 07:49:39 +0800
fb56be83e net-ipv6: on device mtu change do not add mtu to mtu-less routes ... Browse Code »

Routes can specify an mtu explicitly or inherit the mtu from
the underlying device - this inheritance is implemented in
dst->ops->mtu handlers ip6_mtu() and ip6_blackhole_mtu().

Currently changing the mtu of a device adds mtu explicitly
to routes using that device.

ie.
# ip link set dev lo mtu 65536
# ip -6 route add local 2000::1 dev lo
# ip -6 route get 2000::1
local 2000::1 dev lo table local src ... metric 1024 pref medium

# ip link set dev lo mtu 65535
# ip -6 route get 2000::1
local 2000::1 dev lo table local src ... metric 1024 mtu 65535 pref medium

# ip link set dev lo mtu 65536
# ip -6 route get 2000::1
local 2000::1 dev lo table local src ... metric 1024 mtu 65536 pref medium

# ip -6 route del local 2000::1

After this patch the route entry no longer changes unless it already has an mtu.
There is no need: this inheritance is already done in ip6_mtu()

# ip link set dev lo mtu 65536
# ip -6 route add local 2000::1 dev lo
# ip -6 route add local 2000::2 dev lo mtu 2000
# ip -6 route get 2000::1; ip -6 route get 2000::2
local 2000::1 dev lo table local src ... metric 1024 pref medium
local 2000::2 dev lo table local src ... metric 1024 mtu 2000 pref medium

# ip link set dev lo mtu 65535
# ip -6 route get 2000::1; ip -6 route get 2000::2
local 2000::1 dev lo table local src ... metric 1024 pref medium
local 2000::2 dev lo table local src ... metric 1024 mtu 2000 pref medium

# ip link set dev lo mtu 1501
# ip -6 route get 2000::1; ip -6 route get 2000::2
local 2000::1 dev lo table local src ... metric 1024 pref medium
local 2000::2 dev lo table local src ... metric 1024 mtu 1501 pref medium

# ip link set dev lo mtu 65536
# ip -6 route get 2000::1; ip -6 route get 2000::2
local 2000::1 dev lo table local src ... metric 1024 pref medium
local 2000::2 dev lo table local src ... metric 1024 mtu 65536 pref medium

# ip -6 route del local 2000::1
# ip -6 route del local 2000::2

This is desirable because changing device mtu and then resetting it
to the previous value shouldn't change the user visible routing table.

Signed-off-by: Maciej Żenczykowski
CC: Eric Dumazet
Acked-by: Hannes Frederic Sowa
Signed-off-by: David S. Miller

Maciej Żenczykowski
2016-11-10 02:19:32 +0800
3023898b7 sock: fix sendmmsg for partial sendmsg ... Browse Code »

Do not send the next message in sendmmsg for partial sendmsg
invocations.

sendmmsg assumes that it can continue sending the next message
when the return value of the individual sendmsg invocations
is positive. It results in corrupting the data for TCP,
SCTP, and UNIX streams.

For example, sendmmsg([["abcd"], ["efgh"]]) can result in a stream
of "aefgh" if the first sendmsg invocation sends only the first
byte while the second sendmsg goes through.

Datagram sockets either send the entire datagram or fail, so
this patch affects only sockets of type SOCK_STREAM and
SOCK_SEQPACKET.

Fixes: 228e548e6020 ("net: Add sendmmsg socket system call")
Signed-off-by: Soheil Hassas Yeganeh
Signed-off-by: Eric Dumazet
Signed-off-by: Willem de Bruijn
Signed-off-by: Neal Cardwell
Acked-by: Maciej Żenczykowski
Signed-off-by: David S. Miller

Soheil Hassas Yeganeh
2016-11-10 02:18:12 +0800

09 Nov, 2016

5 commits

58c78e104 netfilter: nf_tables: fix oops when inserting an element into a verdict map ... Browse Code »

Dalegaard says:
The following ruleset, when loaded with 'nft -f bad.txt'
----snip----
flush ruleset
table ip inlinenat {
map sourcemap {
type ipv4_addr : verdict;
}

chain postrouting {
ip saddr vmap @sourcemap accept
}
}
add chain inlinenat test
add element inlinenat sourcemap { 100.123.10.2 : jump test }
----snip----

results in a kernel oops:
BUG: unable to handle kernel paging request at 0000000000001344
IP: [] nf_tables_check_loops+0x114/0x1f0 [nf_tables]
[...]
Call Trace:
[] ? nft_data_init+0x13e/0x1a0 [nf_tables]
[] nft_validate_register_store+0x60/0xb0 [nf_tables]
[] nft_add_set_elem+0x545/0x5e0 [nf_tables]
[] ? nft_table_lookup+0x30/0x60 [nf_tables]
[] ? nla_strcmp+0x40/0x50
[] nf_tables_newsetelem+0x11e/0x210 [nf_tables]
[] ? nla_validate+0x60/0x80
[] nfnetlink_rcv+0x354/0x5a7 [nfnetlink]

Because we forget to fill the net pointer in bind_ctx, so dereferencing
it may cause kernel crash.

Reported-by: Dalegaard
Signed-off-by: Liping Zhang
Signed-off-by: Pablo Neira Ayuso

Liping Zhang
2016-11-09 06:53:39 +0800
e0df8cae6 netfilter: conntrack: refine gc worker heuristics ... Browse Code »

Nicolas Dichtel says:
After commit b87a2f9199ea ("netfilter: conntrack: add gc worker to
remove timed-out entries"), netlink conntrack deletion events may be
sent with a huge delay.

Nicolas further points at this line:

goal = min(nf_conntrack_htable_size / GC_MAX_BUCKETS_DIV, GC_MAX_BUCKETS);

and indeed, this isn't optimal at all. Rationale here was to ensure that
we don't block other work items for too long, even if
nf_conntrack_htable_size is huge. But in order to have some guarantee
about maximum time period where a scan of the full conntrack table
completes we should always use a fixed slice size, so that once every
N scans the full table has been examined at least once.

We also need to balance this vs. the case where the system is either idle
(i.e., conntrack table (almost) empty) or very busy (i.e. eviction happens
from packet path).

So, after some discussion with Nicolas:

1. want hard guarantee that we scan entire table at least once every X s
-> need to scan fraction of table (get rid of upper bound)

2. don't want to eat cycles on idle or very busy system
-> increase interval if we did not evict any entries

3. don't want to block other worker items for too long
-> make fraction really small, and prefer small scan interval instead

4. Want reasonable short time where we detect timed-out entry when
system went idle after a burst of traffic, while not doing scans
all the time.
-> Store next gc scan in worker, increasing delays when no eviction
happened and shrinking delay when we see timed out entries.

The old gc interval is turned into a max number, scans can now happen
every jiffy if stale entries are present.

Longest possible time period until an entry is evicted is now 2 minutes
in worst case (entry expires right after it was deemed 'not expired').

Reported-by: Nicolas Dichtel
Signed-off-by: Florian Westphal
Acked-by: Nicolas Dichtel
Signed-off-by: Pablo Neira Ayuso

Florian Westphal
2016-11-09 06:53:38 +0800
6114cc516 netfilter: conntrack: fix CT target for UNSPEC helpers ... Browse Code »

Thomas reports its not possible to attach the H.245 helper:

iptables -t raw -A PREROUTING -p udp -j CT --helper H.245
iptables: No chain/target/match by that name.
xt_CT: No such helper "H.245"

This is because H.245 registers as NFPROTO_UNSPEC, but the CT target
passes NFPROTO_IPV4/IPV6 to nf_conntrack_helper_try_module_get.

We should treat UNSPEC as wildcard and ignore the l3num instead.

Reported-by: Thomas Woerner
Signed-off-by: Florian Westphal
Signed-off-by: Pablo Neira Ayuso

Florian Westphal
2016-11-09 06:53:37 +0800
fb9c9649a netfilter: connmark: ignore skbs with magic untracked conntrack objects ... Browse Code »

The (percpu) untracked conntrack entries can end up with nonzero connmarks.

The 'untracked' conntrack objects are merely a way to distinguish INVALID
(i.e. protocol connection tracker says payload doesn't meet some
requirements or packet was never seen by the connection tracking code)
from packets that are intentionally not tracked (some icmpv6 types such as
neigh solicitation, or by using 'iptables -j CT --notrack' option).

Untracked conntrack objects are implementation detail, we might as well use
invalid magic address instead to tell INVALID and UNTRACKED apart.

Check skb->nfct for untracked dummy and behave as if skb->nfct is NULL.

Reported-by: XU Tianwen
Signed-off-by: Florian Westphal
Signed-off-by: Pablo Neira Ayuso

Florian Westphal
2016-11-09 06:53:36 +0800
8fbfef7f5 ipvs: use IPVS_CMD_ATTR_MAX for family.maxattr ... Browse Code »

family.maxattr is the max index for policy[], the size of
ops[] is determined with ARRAY_SIZE().

Reported-by: Andrey Konovalov
Tested-by: Andrey Konovalov
Cc: Pablo Neira Ayuso
Signed-off-by: Cong Wang
Signed-off-by: Simon Horman
Signed-off-by: Pablo Neira Ayuso

WANG Cong
2016-11-09 06:53:30 +0800

08 Nov, 2016

3 commits

fd0285a39 fib_trie: Correct /proc/net/route off by one error ... Browse Code »

The display of /proc/net/route has had a couple issues due to the fact that
when I originally rewrote most of fib_trie I made it so that the iterator
was tracking the next value to use instead of the current.

In addition it had an off by 1 error where I was tracking the first piece
of data as position 0, even though in reality that belonged to the
SEQ_START_TOKEN.

This patch updates the code so the iterator tracks the last reported
position and key instead of the next expected position and key. In
addition it shifts things so that all of the leaves start at 1 instead of
trying to report leaves starting with offset 0 as being valid. With these
two issues addressed this should resolve any off by one errors that were
present in the display of /proc/net/route.

Fixes: 25b97c016b26 ("ipv4: off-by-one in continuation handling in /proc/net/route")
Cc: Andy Whitcroft
Reported-by: Jason Baron
Tested-by: Jason Baron
Signed-off-by: Alexander Duyck
Signed-off-by: David S. Miller

Alexander Duyck
2016-11-08 09:40:27 +0800
5d41ce29e net: icmp6_send should use dst dev to determine L3 domain ... Browse Code »

icmp6_send is called in response to some event. The skb may not have
the device set (skb->dev is NULL), but it is expected to have a dst set.
Update icmp6_send to use the dst on the skb to determine L3 domain.

Fixes: ca254490c8dfd ("net: Add VRF support to IPv6 stack")
Signed-off-by: David Ahern
Signed-off-by: David S. Miller

David Ahern
2016-11-08 09:30:19 +0800
7233bc84a sctp: assign assoc_id earlier in __sctp_connect ... Browse Code »

sctp_wait_for_connect() currently already holds the asoc to keep it
alive during the sleep, in case another thread release it. But Andrey
Konovalov and Dmitry Vyukov reported an use-after-free in such
situation.

Problem is that __sctp_connect() doesn't get a ref on the asoc and will
do a read on the asoc after calling sctp_wait_for_connect(), but by then
another thread may have closed it and the _put on sctp_wait_for_connect
will actually release it, causing the use-after-free.

Fix is, instead of doing the read after waiting for the connect, do it
before so, and avoid this issue as the socket is still locked by then.
There should be no issue on returning the asoc id in case of failure as
the application shouldn't trust on that number in such situations
anyway.

This issue doesn't exist in sctp_sendmsg() path.

Reported-by: Dmitry Vyukov
Reported-by: Andrey Konovalov
Tested-by: Andrey Konovalov
Signed-off-by: Marcelo Ricardo Leitner
Reviewed-by: Xin Long
Acked-by: Neil Horman
Signed-off-by: David S. Miller

Marcelo Ricardo Leitner
2016-11-08 02:18:37 +0800

04 Nov, 2016

11 commits

00ffc1ba0 genetlink: fix a memory leak on error path ... Browse Code »

In __genl_register_family(), when genl_validate_assign_mc_groups()
fails, we forget to free the memory we possibly allocate for
family->attrbuf.

Note, some callers call genl_unregister_family() to clean up
on error path, it doesn't work because the family is inserted
to the global list in the nearly last step.

Cc: Jakub Kicinski
Cc: Johannes Berg
Signed-off-by: Cong Wang
Signed-off-by: David S. Miller

WANG Cong
2016-11-04 04:52:29 +0800
990ff4d84 ipv6: dccp: add missing bind_conflict to dccp_ipv6_mapped ... Browse Code »

While fuzzing kernel with syzkaller, Andrey reported a nasty crash
in inet6_bind() caused by DCCP lacking a required method.

Fixes: ab1e0a13d7029 ("[SOCK] proto: Add hashinfo member to struct proto")
Signed-off-by: Eric Dumazet
Reported-by: Andrey Konovalov
Tested-by: Andrey Konovalov
Cc: Arnaldo Carvalho de Melo
Acked-by: Arnaldo Carvalho de Melo
Signed-off-by: David S. Miller

Eric Dumazet
2016-11-04 04:50:27 +0800
1aa9d1a0e ipv6: dccp: fix out of bound access in dccp_v6_err() ... Browse Code »

dccp_v6_err() does not use pskb_may_pull() and might access garbage.

We only need 4 bytes at the beginning of the DCCP header, like TCP,
so the 8 bytes pulled in icmpv6_notify() are more than enough.

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2016-11-04 04:16:51 +0800
93636d1f1 netlink: netlink_diag_dump() runs without locks ... Browse Code »

A recent commit removed locking from netlink_diag_dump() but forgot
one error case.

=====================================
[ BUG: bad unlock balance detected! ]
4.9.0-rc3+ #336 Not tainted
-------------------------------------
syz-executor/4018 is trying to release lock ([ 36.220068] nl_table_lock
) at:
[] netlink_diag_dump+0x1a3/0x250 net/netlink/diag.c:182
but there are no more locks to release!

other info that might help us debug this:
3 locks held by syz-executor/4018:
#0: [ 36.220068] (
sock_diag_mutex[ 36.220068] ){+.+.+.}
, at: [ 36.220068] [] sock_diag_rcv+0x1b/0x40
#1: [ 36.220068] (
sock_diag_table_mutex[ 36.220068] ){+.+.+.}
, at: [ 36.220068] [] sock_diag_rcv_msg+0x140/0x3a0
#2: [ 36.220068] (
nlk->cb_mutex[ 36.220068] ){+.+.+.}
, at: [ 36.220068] [] netlink_dump+0x50/0xac0

stack backtrace:
CPU: 1 PID: 4018 Comm: syz-executor Not tainted 4.9.0-rc3+ #336
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
ffff8800645df688 ffffffff81b46934 ffffffff84eb3e78 ffff88006ad85800
ffffffff82dc8683 ffffffff84eb3e78 ffff8800645df6b8 ffffffff812043ca
dffffc0000000000 ffff88006ad85ff8 ffff88006ad85fd0 00000000ffffffff
Call Trace:
[< inline >] __dump_stack lib/dump_stack.c:15
[] dump_stack+0xb3/0x10f lib/dump_stack.c:51
[] print_unlock_imbalance_bug+0x17a/0x1a0
kernel/locking/lockdep.c:3388
[< inline >] __lock_release kernel/locking/lockdep.c:3512
[] lock_release+0x8e8/0xc60 kernel/locking/lockdep.c:3765
[< inline >] __raw_read_unlock ./include/linux/rwlock_api_smp.h:225
[] _raw_read_unlock+0x1a/0x30 kernel/locking/spinlock.c:255
[] netlink_diag_dump+0x1a3/0x250 net/netlink/diag.c:182
[] netlink_dump+0x397/0xac0 net/netlink/af_netlink.c:2110

Fixes: ad202074320c ("netlink: Use rhashtable walk interface in diag dump")
Signed-off-by: Eric Dumazet
Reported-by: Andrey Konovalov
Tested-by: Andrey Konovalov
Signed-off-by: David S. Miller

Eric Dumazet
2016-11-04 04:16:51 +0800
6706a97fe dccp: fix out of bound access in dccp_v4_err() ... Browse Code »

dccp_v4_err() does not use pskb_may_pull() and might access garbage.

We only need 4 bytes at the beginning of the DCCP header, like TCP,
so the 8 bytes pulled in icmp_socket_deliver() are more than enough.

This patch might allow to process more ICMP messages, as some routers
are still limiting the size of reflected bytes to 28 (RFC 792), instead
of extended lengths (RFC 1812 4.3.2.3)

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2016-11-04 04:16:51 +0800
346da62cc dccp: do not send reset to already closed sockets ... Browse Code »

Andrey reported following warning while fuzzing with syzkaller

WARNING: CPU: 1 PID: 21072 at net/dccp/proto.c:83 dccp_set_state+0x229/0x290
Kernel panic - not syncing: panic_on_warn set ...

CPU: 1 PID: 21072 Comm: syz-executor Not tainted 4.9.0-rc1+ #293
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
ffff88003d4c7738 ffffffff81b474f4 0000000000000003 dffffc0000000000
ffffffff844f8b00 ffff88003d4c7804 ffff88003d4c7800 ffffffff8140c06a
0000000041b58ab3 ffffffff8479ab7d ffffffff8140beae ffffffff8140cd00
Call Trace:
[< inline >] __dump_stack lib/dump_stack.c:15
[] dump_stack+0xb3/0x10f lib/dump_stack.c:51
[] panic+0x1bc/0x39d kernel/panic.c:179
[] __warn+0x1cc/0x1f0 kernel/panic.c:542
[] warn_slowpath_null+0x2c/0x40 kernel/panic.c:585
[] dccp_set_state+0x229/0x290 net/dccp/proto.c:83
[] dccp_close+0x612/0xc10 net/dccp/proto.c:1016
[] inet_release+0xef/0x1c0 net/ipv4/af_inet.c:415
[] sock_release+0x8e/0x1d0 net/socket.c:570
[] sock_close+0x16/0x20 net/socket.c:1017
[] __fput+0x29d/0x720 fs/file_table.c:208
[] ____fput+0x15/0x20 fs/file_table.c:244
[] task_work_run+0xf8/0x170 kernel/task_work.c:116
[< inline >] exit_task_work include/linux/task_work.h:21
[] do_exit+0x883/0x2ac0 kernel/exit.c:828
[] do_group_exit+0x10e/0x340 kernel/exit.c:931
[] get_signal+0x634/0x15a0 kernel/signal.c:2307
[] do_signal+0x8d/0x1a30 arch/x86/kernel/signal.c:807
[] exit_to_usermode_loop+0xe5/0x130
arch/x86/entry/common.c:156
[< inline >] prepare_exit_to_usermode arch/x86/entry/common.c:190
[] syscall_return_slowpath+0x1a8/0x1e0
arch/x86/entry/common.c:259
[] entry_SYSCALL_64_fastpath+0xc0/0xc2
Dumping ftrace buffer:
(ftrace buffer empty)
Kernel Offset: disabled

Fix this the same way we did for TCP in commit 565b7b2d2e63
("tcp: do not send reset to already closed sockets")

Signed-off-by: Eric Dumazet
Reported-by: Andrey Konovalov
Tested-by: Andrey Konovalov
Signed-off-by: David S. Miller

Eric Dumazet
2016-11-04 04:16:51 +0800
c3f24cfb3 dccp: do not release listeners too soon ... Browse Code »

Andrey Konovalov reported following error while fuzzing with syzkaller :

IPv4: Attempt to release alive inet socket ffff880068e98940
kasan: CONFIG_KASAN_INLINE enabled
kasan: GPF could be caused by NULL-ptr deref or user memory access
general protection fault: 0000 [#1] SMP KASAN
Modules linked in:
CPU: 1 PID: 3905 Comm: a.out Not tainted 4.9.0-rc3+ #333
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
task: ffff88006b9e0000 task.stack: ffff880068770000
RIP: 0010:[] []
selinux_socket_sock_rcv_skb+0xff/0x6a0 security/selinux/hooks.c:4639
RSP: 0018:ffff8800687771c8 EFLAGS: 00010202
RAX: ffff88006b9e0000 RBX: 1ffff1000d0eee3f RCX: 1ffff1000d1d312a
RDX: 1ffff1000d1d31a6 RSI: dffffc0000000000 RDI: 0000000000000010
RBP: ffff880068777360 R08: 0000000000000000 R09: 0000000000000002
R10: dffffc0000000000 R11: 0000000000000006 R12: ffff880068e98940
R13: 0000000000000002 R14: ffff880068777338 R15: 0000000000000000
FS: 00007f00ff760700(0000) GS:ffff88006cd00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000020008000 CR3: 000000006a308000 CR4: 00000000000006e0
Stack:
ffff8800687771e0 ffffffff812508a5 ffff8800686f3168 0000000000000007
ffff88006ac8cdfc ffff8800665ea500 0000000041b58ab3 ffffffff847b5480
ffffffff819eac60 ffff88006b9e0860 ffff88006b9e0868 ffff88006b9e07f0
Call Trace:
[] security_sock_rcv_skb+0x75/0xb0 security/security.c:1317
[] sk_filter_trim_cap+0x67/0x10e0 net/core/filter.c:81
[] __sk_receive_skb+0x30/0xa00 net/core/sock.c:460
[] dccp_v4_rcv+0xdb2/0x1910 net/dccp/ipv4.c:873
[] ip_local_deliver_finish+0x332/0xad0
net/ipv4/ip_input.c:216
[< inline >] NF_HOOK_THRESH ./include/linux/netfilter.h:232
[< inline >] NF_HOOK ./include/linux/netfilter.h:255
[] ip_local_deliver+0x1c2/0x4b0 net/ipv4/ip_input.c:257
[< inline >] dst_input ./include/net/dst.h:507
[] ip_rcv_finish+0x750/0x1c40 net/ipv4/ip_input.c:396
[< inline >] NF_HOOK_THRESH ./include/linux/netfilter.h:232
[< inline >] NF_HOOK ./include/linux/netfilter.h:255
[] ip_rcv+0x96f/0x12f0 net/ipv4/ip_input.c:487
[] __netif_receive_skb_core+0x1897/0x2a50 net/core/dev.c:4213
[] __netif_receive_skb+0x2a/0x170 net/core/dev.c:4251
[] netif_receive_skb_internal+0x1b3/0x390 net/core/dev.c:4279
[] netif_receive_skb+0x48/0x250 net/core/dev.c:4303
[] tun_get_user+0xbd5/0x28a0 drivers/net/tun.c:1308
[] tun_chr_write_iter+0xda/0x190 drivers/net/tun.c:1332
[< inline >] new_sync_write fs/read_write.c:499
[] __vfs_write+0x334/0x570 fs/read_write.c:512
[] vfs_write+0x17b/0x500 fs/read_write.c:560
[< inline >] SYSC_write fs/read_write.c:607
[] SyS_write+0xd4/0x1a0 fs/read_write.c:599
[] entry_SYSCALL_64_fastpath+0x1f/0xc2

It turns out DCCP calls __sk_receive_skb(), and this broke when
lookups no longer took a reference on listeners.

Fix this issue by adding a @refcounted parameter to __sk_receive_skb(),
so that sock_put() is used only when needed.

Fixes: 3b24d854cb35 ("tcp/dccp: do not touch listener sk_refcnt under synflood")
Signed-off-by: Eric Dumazet
Reported-by: Andrey Konovalov
Tested-by: Andrey Konovalov
Signed-off-by: David S. Miller

Eric Dumazet
2016-11-04 04:16:50 +0800
79d8665b9 tcp: fix return value for partial writes ... Browse Code »

After my commit, tcp_sendmsg() might restart its loop after
processing socket backlog.

If sk_err is set, we blindly return an error, even though we
copied data to user space before.

We should instead return number of bytes that could be copied,
otherwise user space might resend data and corrupt the stream.

This might happen if another thread is using recvmsg(MSG_ERRQUEUE)
to process timestamps.

Issue was diagnosed by Soheil and Willem, big kudos to them !

Fixes: d41a69f1d390f ("tcp: make tcp_sendmsg() aware of socket backlog")
Signed-off-by: Eric Dumazet
Cc: Willem de Bruijn
Cc: Soheil Hassas Yeganeh
Cc: Yuchung Cheng
Cc: Neal Cardwell
Tested-by: Soheil Hassas Yeganeh
Signed-off-by: David S. Miller

Eric Dumazet
2016-11-04 04:12:06 +0800
9ee6c5dc8 ipv4: allow local fragmentation in ip_finish_output_gso() ... Browse Code »

Some configurations (e.g. geneve interface with default
MTU of 1500 over an ethernet interface with 1500 MTU) result
in the transmission of packets that exceed the configured MTU.
While this should be considered to be a "bad" configuration,
it is still allowed and should not result in the sending
of packets that exceed the configured MTU.

Fix by dropping the assumption in ip_finish_output_gso() that
locally originated gso packets will never need fragmentation.
Basic testing using iperf (observing CPU usage and bandwidth)
have shown no measurable performance impact for traffic not
requiring fragmentation.

Fixes: c7ba65d7b649 ("net: ip: push gso skb forwarding handling down the stack")
Reported-by: Jan Tluka
Signed-off-by: Lance Richardson
Acked-by: Hannes Frederic Sowa
Signed-off-by: David S. Miller

Lance Richardson
2016-11-04 04:10:26 +0800
ac9e70b17 tcp: fix potential memory corruption ... Browse Code »

Imagine initial value of max_skb_frags is 17, and last
skb in write queue has 15 frags.

Then max_skb_frags is lowered to 14 or smaller value.

tcp_sendmsg() will then be allowed to add additional page frags
and eventually go past MAX_SKB_FRAGS, overflowing struct
skb_shared_info.

Fixes: 5f74f82ea34c ("net:Add sysctl_max_skb_frags")
Signed-off-by: Eric Dumazet
Cc: Hans Westgaard Ry
Cc: Håkon Bugge
Signed-off-by: David S. Miller

Eric Dumazet
2016-11-04 03:33:30 +0800
14135f30e inet: fix sleeping inside inet_wait_for_connect() ... Browse Code »

Andrey reported this kernel warning:

WARNING: CPU: 0 PID: 4608 at kernel/sched/core.c:7724
__might_sleep+0x14c/0x1a0 kernel/sched/core.c:7719
do not call blocking ops when !TASK_RUNNING; state=1 set at
[] prepare_to_wait+0xbc/0x210
kernel/sched/wait.c:178
Modules linked in:
CPU: 0 PID: 4608 Comm: syz-executor Not tainted 4.9.0-rc2+ #320
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
ffff88006625f7a0 ffffffff81b46914 ffff88006625f818 0000000000000000
ffffffff84052960 0000000000000000 ffff88006625f7e8 ffffffff81111237
ffff88006aceac00 ffffffff00001e2c ffffed000cc4beff ffffffff84052960
Call Trace:
[< inline >] __dump_stack lib/dump_stack.c:15
[] dump_stack+0xb3/0x10f lib/dump_stack.c:51
[] __warn+0x1a7/0x1f0 kernel/panic.c:550
[] warn_slowpath_fmt+0xac/0xd0 kernel/panic.c:565
[] __might_sleep+0x14c/0x1a0 kernel/sched/core.c:7719
[< inline >] slab_pre_alloc_hook mm/slab.h:393
[< inline >] slab_alloc_node mm/slub.c:2634
[< inline >] slab_alloc mm/slub.c:2716
[] __kmalloc_track_caller+0x150/0x2a0 mm/slub.c:4240
[] kmemdup+0x24/0x50 mm/util.c:113
[] dccp_feat_clone_sp_val.part.5+0x4f/0xe0 net/dccp/feat.c:374
[< inline >] dccp_feat_clone_sp_val net/dccp/feat.c:1141
[< inline >] dccp_feat_change_recv net/dccp/feat.c:1141
[] dccp_feat_parse_options+0xaa1/0x13d0 net/dccp/feat.c:1411
[] dccp_parse_options+0x721/0x1010 net/dccp/options.c:128
[] dccp_rcv_state_process+0x200/0x15b0 net/dccp/input.c:644
[] dccp_v4_do_rcv+0xf4/0x1a0 net/dccp/ipv4.c:681
[< inline >] sk_backlog_rcv ./include/net/sock.h:872
[] __release_sock+0x126/0x3a0 net/core/sock.c:2044
[] release_sock+0x59/0x1c0 net/core/sock.c:2502
[< inline >] inet_wait_for_connect net/ipv4/af_inet.c:547
[] __inet_stream_connect+0x5d2/0xbb0 net/ipv4/af_inet.c:617
[] inet_stream_connect+0x55/0xa0 net/ipv4/af_inet.c:656
[] SYSC_connect+0x244/0x2f0 net/socket.c:1533
[] SyS_connect+0x24/0x30 net/socket.c:1514
[] entry_SYSCALL_64_fastpath+0x1f/0xc2
arch/x86/entry/entry_64.S:209

Unlike commit 26cabd31259ba43f68026ce3f62b78094124333f
("sched, net: Clean up sk_wait_event() vs. might_sleep()"), the
sleeping function is called before schedule_timeout(), this is indeed
a bug. Fix this by moving the wait logic to the new API, it is similar
to commit ff960a731788a7408b6f66ec4fd772ff18833211
("netdev, sched/wait: Fix sleeping inside wait event").

Reported-by: Andrey Konovalov
Cc: Andrey Konovalov
Cc: Eric Dumazet
Cc: Peter Zijlstra
Signed-off-by: Cong Wang
Signed-off-by: David S. Miller

WANG Cong
2016-11-04 03:18:07 +0800

03 Nov, 2016

1 commit

4fd19c15d ip6_udp_tunnel: remove unused IPCB related codes ... Browse Code »

Some IPCB fields are currently set in udp_tunnel6_xmit_skb(), which are
never used before it reaches ip6tunnel_xmit(), and past that point the
control buffer is no longer interpreted as IPCB.

This clears these unused IPCB related codes. Currently there is no skb
scrubbing in ip6_udp_tunnel, otherwise IPCB(skb)->opt might need to be
cleared for IPv4 packets, as shown in 5146d1f1511
("tunnel: Clear IPCB(skb)->opt before dst_link_failure called").

Signed-off-by: Eli Cooper
Signed-off-by: David S. Miller

Eli Cooper
2016-11-03 03:18:36 +0800

02 Nov, 2016

1 commit

e7947ea77 unix: escape all null bytes in abstract unix domain socket ... Browse Code »

Abstract unix domain socket may embed null characters,
these should be translated to '@' when printed out to
proc the same way the null prefix is currently being
translated.

This helps for tools such as netstat, lsof and the proc
based implementation in ss to show all the significant
bytes of the name (instead of getting cut at the first
null occurrence).

Signed-off-by: Isaac Boukris
Signed-off-by: David S. Miller

Isaac Boukris
2016-11-02 00:15:13 +0800

01 Nov, 2016

10 commits

2c8657d22 Merge tag 'linux-can-fixes-for-4.9-20161031' of git://git.kernel.org/pub/scm/lin… ... Browse Code »

…ux/kernel/git/mkl/linux-can

Marc Kleine-Budde says:

====================
pull-request: can 2016-10-31

this is a pull request of two patches for the upcoming v4.9 release.

The first patch is by Lukas Resch for the sja1000 plx_pci driver that adds
support for Moxa CAN devices. The second patch is by Oliver Hartkopp and fixes
a potential kernel panic in the CAN broadcast manager.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

David S. Miller
2016-11-01 09:01:03 +0800
dae399d7f sctp: hold transport instead of assoc when lookup assoc in rx path ... Browse Code »

Prior to this patch, in rx path, before calling lock_sock, it needed to
hold assoc when got it by __sctp_lookup_association, in case other place
would free/put assoc.

But in __sctp_lookup_association, it lookup and hold transport, then got
assoc by transport->assoc, then hold assoc and put transport. It means
it didn't hold transport, yet it was returned and later on directly
assigned to chunk->transport.

Without the protection of sock lock, the transport may be freed/put by
other places, which would cause a use-after-free issue.

This patch is to fix this issue by holding transport instead of assoc.
As holding transport can make sure to access assoc is also safe, and
actually it looks up assoc by searching transport rhashtable, to hold
transport here makes more sense.

Note that the function will be renamed later on on another patch.

Signed-off-by: Xin Long
Acked-by: Marcelo Ricardo Leitner
Signed-off-by: David S. Miller

Xin Long
2016-11-01 04:20:33 +0800
7c17fcc72 sctp: return back transport in __sctp_rcv_init_lookup ... Browse Code »

Prior to this patch, it used a local variable to save the transport that is
looked up by __sctp_lookup_association(), and didn't return it back. But in
sctp_rcv, it is used to initialize chunk->transport. So when hitting this,
even if it found the transport, it was still initializing chunk->transport
with null instead.

This patch is to return the transport back through transport pointer
that is from __sctp_rcv_lookup_harder().

Signed-off-by: Xin Long
Acked-by: Marcelo Ricardo Leitner
Signed-off-by: David S. Miller

Xin Long
2016-11-01 04:20:32 +0800
cd26da4ff sctp: hold transport instead of assoc in sctp_diag ... Browse Code »

In sctp_transport_lookup_process(), Commit 1cceda784980 ("sctp: fix
the issue sctp_diag uses lock_sock in rcu_read_lock") moved cb() out
of rcu lock, but it put transport and hold assoc instead, and ignore
that cb() still uses transport. It may cause a use-after-free issue.

This patch is to hold transport instead of assoc there.

Fixes: 1cceda784980 ("sctp: fix the issue sctp_diag uses lock_sock in rcu_read_lock")
Signed-off-by: Xin Long
Acked-by: Marcelo Ricardo Leitner
Signed-off-by: David S. Miller

Xin Long
2016-11-01 04:20:32 +0800
deb507f91 can: bcm: fix warning in bcm_connect/proc_register ... Browse Code »

Andrey Konovalov reported an issue with proc_register in bcm.c.
As suggested by Cong Wang this patch adds a lock_sock() protection and
a check for unsuccessful proc_create_data() in bcm_connect().

Reference: http://marc.info/?l=linux-netdev&m=147732648731237

Reported-by: Andrey Konovalov
Suggested-by: Cong Wang
Signed-off-by: Oliver Hartkopp
Acked-by: Cong Wang
Tested-by: Andrey Konovalov
Cc: linux-stable
Signed-off-by: Marc Kleine-Budde

Oliver Hartkopp
2016-11-01 03:48:19 +0800
4f2e4ad56 net: mangle zero checksum in skb_checksum_help() ... Browse Code »

Sending zero checksum is ok for TCP, but not for UDP.

UDPv6 receiver should by default drop a frame with a 0 checksum,
and UDPv4 would not verify the checksum and might accept a corrupted
packet.

Simply replace such checksum by 0xffff, regardless of transport.

This error was caught on SIT tunnels, but seems generic.

Signed-off-by: Eric Dumazet
Cc: Maciej Żenczykowski
Cc: Willem de Bruijn
Acked-by: Maciej Żenczykowski
Signed-off-by: David S. Miller

Eric Dumazet
2016-11-01 03:29:11 +0800
e551c32d5 net: clear sk_err_soft in sk_clone_lock() ... Browse Code »

At accept() time, it is possible the parent has a non zero
sk_err_soft, leftover from a prior error.

Make sure we do not leave this value in the child, as it
makes future getsockopt(SO_ERROR) calls quite unreliable.

Signed-off-by: Eric Dumazet
Acked-by: Soheil Hassas Yeganeh
Signed-off-by: David S. Miller

Eric Dumazet
2016-11-01 03:25:55 +0800
ce6dd2332 dctcp: avoid bogus doubling of cwnd after loss ... Browse Code »

If a congestion control module doesn't provide .undo_cwnd function,
tcp_undo_cwnd_reduction() will set cwnd to

tp->snd_cwnd = max(tp->snd_cwnd, tp->snd_ssthresh << 1);

... which makes sense for reno (it sets ssthresh to half the current cwnd),
but it makes no sense for dctcp, which sets ssthresh based on the current
congestion estimate.

This can cause severe growth of cwnd (eventually overflowing u32).

Fix this by saving last cwnd on loss and restore cwnd based on that,
similar to cubic and other algorithms.

Fixes: e3118e8359bb7c ("net: tcp: add DCTCP congestion control algorithm")
Cc: Lawrence Brakmo
Cc: Andrew Shewmaker
Cc: Glenn Judd
Acked-by: Daniel Borkmann
Signed-off-by: Florian Westphal
Signed-off-by: David S. Miller

Florian Westphal
2016-11-01 03:16:28 +0800
19bda36c4 ipv6: add mtu lock check in __ip6_rt_update_pmtu ... Browse Code »

Prior to this patch, ipv6 didn't do mtu lock check in ip6_update_pmtu.
It leaded to that mtu lock doesn't really work when receiving the pkt
of ICMPV6_PKT_TOOBIG.

This patch is to add mtu lock check in __ip6_rt_update_pmtu just as ipv4
did in __ip_rt_update_pmtu.

Acked-by: Hannes Frederic Sowa
Signed-off-by: Xin Long
Signed-off-by: David S. Miller

Xin Long
2016-11-01 02:24:24 +0800
f89c56ce7 ipv6: Don't use ufo handling on later transformed packets ... Browse Code »

Similar to commit c146066ab802 ("ipv4: Don't use ufo handling on later
transformed packets"), don't perform UFO on packets that will be IPsec
transformed. To detect it we rely on the fact that headerlen in
dst_entry is non-zero only for transformation bundles (xfrm_dst
objects).

Unwanted segmentation can be observed with a NETIF_F_UFO capable device,
such as a dummy device:

DEV=dum0 LEN=1493

ip li add $DEV type dummy
ip addr add fc00::1/64 dev $DEV nodad
ip link set $DEV up
ip xfrm policy add dir out src fc00::1 dst fc00::2 \
tmpl src fc00::1 dst fc00::2 proto esp spi 1
ip xfrm state add src fc00::1 dst fc00::2 \
proto esp spi 1 enc 'aes' 0x0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b

tcpdump -n -nn -i $DEV -t &
socat /dev/zero,readbytes=$LEN udp6:[fc00::2]:$LEN

tcpdump output before:

IP6 fc00::1 > fc00::2: frag (0|1448) ESP(spi=0x00000001,seq=0x1), length 1448
IP6 fc00::1 > fc00::2: frag (1448|48)
IP6 fc00::1 > fc00::2: ESP(spi=0x00000001,seq=0x2), length 88

... and after:

IP6 fc00::1 > fc00::2: frag (0|1448) ESP(spi=0x00000001,seq=0x1), length 1448
IP6 fc00::1 > fc00::2: frag (1448|80)

Fixes: e89e9cf539a2 ("[IPv4/IPv6]: UFO Scatter-gather approach")

Signed-off-by: Jakub Sitnicki
Acked-by: Hannes Frederic Sowa
Signed-off-by: David S. Miller

Jakub Sitnicki
2016-11-01 01:10:41 +0800

31 Oct, 2016

1 commit

b73b8a1ba netfilter: nft_dup: do not use sreg_dev if the user doesn't specify it ... Browse Code »

The NFTA_DUP_SREG_DEV attribute is not a must option, so we should use it
in routing lookup only when the user specify it.

Fixes: d877f07112f1 ("netfilter: nf_tables: add nft_dup expression")
Signed-off-by: Liping Zhang
Signed-off-by: Pablo Neira Ayuso

Liping Zhang
2016-10-31 20:17:38 +0800