Eric Lee / smarc-ti-linux-kernel | Embedian Git Server

19 May, 2016

5 commits

1c76c5d5f net/route: enforce hoplimit max value ... Browse Code »

[ Upstream commit 626abd59e51d4d8c6367e03aae252a8aa759ac78 ]

Currently, when creating or updating a route, no check is performed
in both ipv4 and ipv6 code to the hoplimit value.

The caller can i.e. set hoplimit to 256, and when such route will
be used, packets will be sent with hoplimit/ttl equal to 0.

This commit adds checks for the RTAX_HOPLIMIT value, in both ipv4
ipv6 route code, substituting any value greater than 255 with 255.

This is consistent with what is currently done for ADVMSS and MTU
in the ipv4 code.

Signed-off-by: Paolo Abeni
Signed-off-by: David S. Miller
Signed-off-by: Greg Kroah-Hartman

Paolo Abeni
2016-05-19 08:06:43 +0800
2cddc95ad tcp: refresh skb timestamp at retransmit time ... Browse Code »

[ Upstream commit 10a81980fc47e64ffac26a073139813d3f697b64 ]

In the very unlikely case __tcp_retransmit_skb() can not use the cloning
done in tcp_transmit_skb(), we need to refresh skb_mstamp before doing
the copy and transmit, otherwise TCP TS val will be an exact copy of
original transmit.

Fixes: 7faee5c0d514 ("tcp: remove TCP_SKB_CB(skb)->when")
Signed-off-by: Eric Dumazet
Cc: Yuchung Cheng
Acked-by: Yuchung Cheng
Signed-off-by: David S. Miller
Signed-off-by: Greg Kroah-Hartman

Eric Dumazet
2016-05-19 08:06:43 +0800
c98578079 gre: do not pull header in ICMP error processing ... Browse Code »

[ Upstream commit b7f8fe251e4609e2a437bd2c2dea01e61db6849c ]

iptunnel_pull_header expects that IP header was already pulled; with this
expectation, it pulls the tunnel header. This is not true in gre_err.
Furthermore, ipv4_update_pmtu and ipv4_redirect expect that skb->data points
to the IP header.

We cannot pull the tunnel header in this path. It's just a matter of not
calling iptunnel_pull_header - we don't need any of its effects.

Fixes: bda7bb463436 ("gre: Allow multiple protocol listener for gre protocol.")
Signed-off-by: Jiri Benc
Signed-off-by: David S. Miller
Signed-off-by: Greg Kroah-Hartman

Jiri Benc
2016-05-19 08:06:39 +0800
063318504 ipv4/fib: don't warn when primary address is missing if in_dev is dead ... Browse Code »

[ Upstream commit 391a20333b8393ef2e13014e6e59d192c5594471 ]

After commit fbd40ea0180a ("ipv4: Don't do expensive useless work
during inetdev destroy.") when deleting an interface,
fib_del_ifaddr() can be executed without any primary address
present on the dead interface.

The above is safe, but triggers some "bug: prim == NULL" warnings.

This commit avoids warning if the in_dev is dead

Signed-off-by: Paolo Abeni
Signed-off-by: David S. Miller
Signed-off-by: Greg Kroah-Hartman

Paolo Abeni
2016-05-19 08:06:36 +0800
d0bfda58b route: do not cache fib route info on local routes with oif ... Browse Code »

[ Upstream commit d6d5e999e5df67f8ec20b6be45e2229455ee3699 ]

For local routes that require a particular output interface we do not want
to cache the result. Caching the result causes incorrect behaviour when
there are multiple source addresses on the interface. The end result
being that if the intended recipient is waiting on that interface for the
packet he won't receive it because it will be delivered on the loopback
interface and the IP_PKTINFO ipi_ifindex will be set to the loopback
interface as well.

This can be tested by running a program such as "dhcp_release" which
attempts to inject a packet on a particular interface so that it is
received by another program on the same board. The receiving process
should see an IP_PKTINFO ipi_ifndex value of the source interface
(e.g., eth1) instead of the loopback interface (e.g., lo). The packet
will still appear on the loopback interface in tcpdump but the important
aspect is that the CMSG info is correct.

Sample dhcp_release command line:

dhcp_release eth1 192.168.204.222 02:11:33:22:44:66

Signed-off-by: Allain Legacy
Signed off-by: Chris Friesen
Reviewed-by: Julian Anastasov
Signed-off-by: David S. Miller
Signed-off-by: Greg Kroah-Hartman

Chris Friesen
2016-05-19 08:06:35 +0800

20 Apr, 2016

9 commits

80de2e411 ipv4: initialize flowi4_flags before calling fib_lookup() ... Browse Code »

[ Upstream commit 4cfc86f3dae6ca38ed49cdd78f458a03d4d87992 ]

Field fl4.flowi4_flags is not initialized in fib_compute_spec_dst()
before calling fib_lookup(), which means fib_table_lookup() is
using non-deterministic data at this line:

if (!(flp->flowi4_flags & FLOWI_FLAG_SKIP_NH_OIF)) {

Fix by initializing the entire fl4 structure, which will prevent
similar issues as fields are added in the future by ensuring that
all fields are initialized to zero unless explicitly initialized
to another value.

Fixes: 58189ca7b2741 ("net: Fix vti use case with oif in dst lookups")
Suggested-by: David Ahern
Signed-off-by: Lance Richardson
Acked-by: David Ahern
Signed-off-by: David S. Miller
Signed-off-by: Greg Kroah-Hartman

Lance Richardson
2016-04-20 14:42:05 +0800
2ddb18139 ipv4: fix broadcast packets reception ... Browse Code »

[ Upstream commit ad0ea1989cc4d5905941d0a9e62c63ad6d859cef ]

Currently, ingress ipv4 broadcast datagrams are dropped since,
in udp_v4_early_demux(), ip_check_mc_rcu() is invoked even on
bcast packets.

This patch addresses the issue, invoking ip_check_mc_rcu()
only for mcast packets.

Fixes: 6e5403093261 ("ipv4/udp: Verify multicast group is ours in upd_v4_early_demux()")
Signed-off-by: Paolo Abeni
Acked-by: Hannes Frederic Sowa
Signed-off-by: David S. Miller
Signed-off-by: Greg Kroah-Hartman

Paolo Abeni
2016-04-20 14:42:05 +0800
bd33d14ac tcp/dccp: remove obsolete WARN_ON() in icmp handlers ... Browse Code »

[ Upstream commit e316ea62e3203d524ff0239a40c56d3a39ad1b5c ]

Now SYN_RECV request sockets are installed in ehash table, an ICMP
handler can find a request socket while another cpu handles an incoming
packet transforming this SYN_RECV request socket into an ESTABLISHED
socket.

We need to remove the now obsolete WARN_ON(req->sk), since req->sk
is set when a new child is created and added into listener accept queue.

If this race happens, the ICMP will do nothing special.

Fixes: 079096f103fa ("tcp/dccp: install syn_recv requests into ehash table")
Signed-off-by: Eric Dumazet
Reported-by: Ben Lazarus
Reported-by: Neal Cardwell
Signed-off-by: David S. Miller
Signed-off-by: Greg Kroah-Hartman

Eric Dumazet
2016-04-20 14:42:04 +0800
547897599 ipv4: Don't do expensive useless work during inetdev destroy. ... Browse Code »

[ Upstream commit fbd40ea0180a2d328c5adc61414dc8bab9335ce2 ]

When an inetdev is destroyed, every address assigned to the interface
is removed. And in this scenerio we do two pointless things which can
be very expensive if the number of assigned interfaces is large:

1) Address promotion. We are deleting all addresses, so there is no
point in doing this.

2) A full nf conntrack table purge for every address. We only need to
do this once, as is already caught by the existing
masq_dev_notifier so masq_inet_event() can skip this.

Reported-by: Solar Designer
Signed-off-by: David S. Miller
Tested-by: Cyrill Gorcunov
Signed-off-by: Greg Kroah-Hartman

David S. Miller
2016-04-20 14:42:03 +0800
36b9c7cc0 tcp: fix tcpi_segs_in after connection establishment ... Browse Code »

[ Upstream commit a9d99ce28ed359d68cf6f3c1a69038aefedf6d6a ]

If final packet (ACK) of 3WHS is lost, it appears we do not properly
account the following incoming segment into tcpi_segs_in

While we are at it, starts segs_in with one, to count the SYN packet.

We do not yet count number of SYN we received for a request sock, we
might add this someday.

packetdrill script showing proper behavior after fix :

// Tests tcpi_segs_in when 3rd packet (ACK) of 3WHS is lost
0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
+0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
+0 bind(3, ..., ...) = 0
+0 listen(3, 1) = 0

+0 < S 0:0(0) win 32792
+0 > S. 0:0(0) ack 1
+.020 < P. 1:1001(1000) ack 1 win 32792

+0 accept(3, ..., ...) = 4

+.000 %{ assert tcpi_segs_in == 2, 'tcpi_segs_in=%d' % tcpi_segs_in }%

Fixes: 2efd055c53c06 ("tcp: add tcpi_segs_in and tcpi_segs_out to tcp_info")
Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller
Signed-off-by: Greg Kroah-Hartman

Eric Dumazet
2016-04-20 14:42:00 +0800
d9bbdcd83 mld, igmp: Fix reserved tailroom calculation ... Browse Code »

[ Upstream commit 1837b2e2bcd23137766555a63867e649c0b637f0 ]

The current reserved_tailroom calculation fails to take hlen and tlen into
account.

skb:
[__hlen__|__data____________|__tlen___|__extra__]
^ ^
head skb_end_offset

In this representation, hlen + data + tlen is the size passed to alloc_skb.
"extra" is the extra space made available in __alloc_skb because of
rounding up by kmalloc. We can reorder the representation like so:

[__hlen__|__data____________|__extra__|__tlen___]
^ ^
head skb_end_offset

The maximum space available for ip headers and payload without
fragmentation is min(mtu, data + extra). Therefore,
reserved_tailroom
= data + extra + tlen - min(mtu, data + extra)
= skb_end_offset - hlen - min(mtu, skb_end_offset - hlen - tlen)
= skb_tailroom - min(mtu, skb_tailroom - tlen) ; after skb_reserve(hlen)

Compare the second line to the current expression:
reserved_tailroom = skb_end_offset - min(mtu, skb_end_offset)
and we can see that hlen and tlen are not taken into account.

The min() in the third line can be expanded into:
if mtu < skb_tailroom - tlen:
reserved_tailroom = skb_tailroom - mtu
else:
reserved_tailroom = tlen

Depending on hlen, tlen, mtu and the number of multicast address records,
the current code may output skbs that have less tailroom than
dev->needed_tailroom or it may output more skbs than needed because not all
space available is used.

Fixes: 4c672e4b ("ipv6: mld: fix add_grhead skb_over_panic for devs with large MTUs")
Signed-off-by: Benjamin Poirier
Acked-by: Hannes Frederic Sowa
Acked-by: Daniel Borkmann
Signed-off-by: David S. Miller
Signed-off-by: Greg Kroah-Hartman

Benjamin Poirier
2016-04-20 14:41:58 +0800
e948c9ade ipv4: only create late gso-skb if skb is already set up with CHECKSUM_PARTIAL ... Browse Code »

[ Upstream commit a8c4a2522a0808c5c2143612909717d1115c40cf ]

Otherwise we break the contract with GSO to only pass CHECKSUM_PARTIAL
skbs down. This can easily happen with UDP+IPv4 sockets with the first
MSG_MORE write smaller than the MTU, second write is a sendfile.

Returning -EOPNOTSUPP lets the callers fall back into normal sendmsg path,
were we calculate the checksum manually during copying.

Commit d749c9cbffd6 ("ipv4: no CHECKSUM_PARTIAL on MSG_MORE corked
sockets") started to exposes this bug.

Fixes: d749c9cbffd6 ("ipv4: no CHECKSUM_PARTIAL on MSG_MORE corked sockets")
Reported-by: Jiri Benc
Cc: Jiri Benc
Reported-by: Wakko Warner
Cc: Wakko Warner
Signed-off-by: Hannes Frederic Sowa
Signed-off-by: David S. Miller
Signed-off-by: Greg Kroah-Hartman

Hannes Frederic Sowa
2016-04-20 14:41:57 +0800
207485dc4 tunnel: Clear IPCB(skb)->opt before dst_link_failure called ... Browse Code »

[ Upstream commit 5146d1f151122e868e594c7b45115d64825aee5f ]

IPCB may contain data from previous layers (in the observed case the
qdisc layer). In the observed scenario, the data was misinterpreted as
ip header options, which later caused the ihl to be set to an invalid
value (opt before dst_link_failure can be called for
various types of tunnels. This change only applies to encapsulated ipv4
packets.

The code introduced in 11c21a30 which clears all of IPCB has been removed
to be consistent with these changes, and instead the opt field is cleared
unconditionally in ip_tunnel_xmit. The change in ip_tunnel_xmit applies to
SIT, GRE, and IPIP tunnels.

The relevant vti, l2tp, and pptp functions already contain similar code for
clearing the IPCB.

Signed-off-by: Bernie Harris
Signed-off-by: David S. Miller
Signed-off-by: Greg Kroah-Hartman

Bernie Harris
2016-04-20 14:41:56 +0800
d5322b91e tcp: convert cached rtt from usec to jiffies when feeding initial rto ... Browse Code »

[ Upstream commit 9bdfb3b79e61c60e1a3e2dc05ad164528afa6b8a ]

Currently it's converted into msecs, thus HZ=1000 intact.

Signed-off-by: Konstantin Khlebnikov
Fixes: 740b0f1841f6 ("tcp: switch rtt estimations to usec resolution")
Signed-off-by: David S. Miller
Signed-off-by: Greg Kroah-Hartman

Konstantin Khlebnikov
2016-04-20 14:41:56 +0800

04 Mar, 2016

10 commits

b7c2e2acc rtnl: RTM_GETNETCONF: fix wrong return value ... Browse Code »

[ Upstream commit a97eb33ff225f34a8124774b3373fd244f0e83ce ]

An error response from a RTM_GETNETCONF request can return the positive
error value EINVAL in the struct nlmsgerr that can mislead userspace.

Signed-off-by: Anton Protopopov
Acked-by: Cong Wang
Signed-off-by: David S. Miller
Signed-off-by: Greg Kroah-Hartman

Anton Protopopov
2016-03-04 07:07:07 +0800
9653359eb tcp/dccp: fix another race at listener dismantle ... Browse Code »

[ Upstream commit 7716682cc58e305e22207d5bb315f26af6b1e243 ]

Ilya reported following lockdep splat:

kernel: =========================
kernel: [ BUG: held lock freed! ]
kernel: 4.5.0-rc1-ceph-00026-g5e0a311 #1 Not tainted
kernel: -------------------------
kernel: swapper/5/0 is freeing memory
ffff880035c9d200-ffff880035c9dbff, with a lock still held there!
kernel: (&(&queue->rskq_lock)->rlock){+.-...}, at:
[] inet_csk_reqsk_queue_add+0x28/0xa0
kernel: 4 locks held by swapper/5/0:
kernel: #0: (rcu_read_lock){......}, at: []
netif_receive_skb_internal+0x4b/0x1f0
kernel: #1: (rcu_read_lock){......}, at: []
ip_local_deliver_finish+0x3f/0x380
kernel: #2: (slock-AF_INET){+.-...}, at: []
sk_clone_lock+0x19b/0x440
kernel: #3: (&(&queue->rskq_lock)->rlock){+.-...}, at:
[] inet_csk_reqsk_queue_add+0x28/0xa0

To properly fix this issue, inet_csk_reqsk_queue_add() needs
to return to its callers if the child as been queued
into accept queue.

We also need to make sure listener is still there before
calling sk->sk_data_ready(), by holding a reference on it,
since the reference carried by the child can disappear as
soon as the child is put on accept queue.

Reported-by: Ilya Dryomov
Fixes: ebb516af60e1 ("tcp/dccp: fix race at listener dismantle phase")
Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller
Signed-off-by: Greg Kroah-Hartman

Eric Dumazet
2016-03-04 07:07:07 +0800
54d77a220 route: check and remove route cache when we get route ... Browse Code »

[ Upstream commit deed49df7390d5239024199e249190328f1651e7 ]

Since the gc of ipv4 route was removed, the route cached would has
no chance to be removed, and even it has been timeout, it still could
be used, cause no code to check it's expires.

Fix this issue by checking and removing route cache when we get route.

Signed-off-by: Xin Long
Acked-by: Hannes Frederic Sowa
Signed-off-by: David S. Miller
Signed-off-by: Greg Kroah-Hartman

Xin Long
2016-03-04 07:07:07 +0800
a4b84d5ef tcp: md5: release request socket instead of listener ... Browse Code »

[ Upstream commit 729235554d805c63e5e274fcc6a98e71015dd847 ]

If tcp_v4_inbound_md5_hash() returns an error, we must release
the refcount on the request socket, not on the listener.

The bug was added for IPv4 only.

Fixes: 079096f103fac ("tcp/dccp: install syn_recv requests into ehash table")
Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller
Signed-off-by: Greg Kroah-Hartman

Eric Dumazet
2016-03-04 07:07:06 +0800
6b567a1ab ipv4: fix memory leaks in ip_cmsg_send() callers ... Browse Code »

[ Upstream commit 919483096bfe75dda338e98d56da91a263746a0a ]

Dmitry reported memory leaks of IP options allocated in
ip_cmsg_send() when/if this function returns an error.

Callers are responsible for the freeing.

Many thanks to Dmitry for the report and diagnostic.

Reported-by: Dmitry Vyukov
Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller
Signed-off-by: Greg Kroah-Hartman

Eric Dumazet
2016-03-04 07:07:06 +0800
1bec5f406 net:Add sysctl_max_skb_frags ... Browse Code »

[ Upstream commit 5f74f82ea34c0da80ea0b49192bb5ea06e063593 ]

Devices may have limits on the number of fragments in an skb they support.
Current codebase uses a constant as maximum for number of fragments one
skb can hold and use.
When enabling scatter/gather and running traffic with many small messages
the codebase uses the maximum number of fragments and may thereby violate
the max for certain devices.
The patch introduces a global variable as max number of fragments.

Signed-off-by: Hans Westgaard Ry
Reviewed-by: Håkon Bugge
Acked-by: Eric Dumazet
Signed-off-by: David S. Miller
Signed-off-by: Greg Kroah-Hartman

Hans Westgaard Ry
2016-03-04 07:07:05 +0800
2679161c7 tcp: do not drop syn_recv on all icmp reports ... Browse Code »

[ Upstream commit 9cf7490360bf2c46a16b7525f899e4970c5fc144 ]

Petr Novopashenniy reported that ICMP redirects on SYN_RECV sockets
were leading to RST.

This is of course incorrect.

A specific list of ICMP messages should be able to drop a SYN_RECV.

For instance, a REDIRECT on SYN_RECV shall be ignored, as we do
not hold a dst per SYN_RECV pseudo request.

Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=111751
Fixes: 079096f103fa ("tcp/dccp: install syn_recv requests into ehash table")
Reported-by: Petr Novopashenniy
Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller
Signed-off-by: Greg Kroah-Hartman

Eric Dumazet
2016-03-04 07:07:05 +0800
87e40d8d8 tcp: beware of alignments in tcp_get_info() ... Browse Code »

[ Upstream commit ff5d749772018602c47509bdc0093ff72acd82ec ]

With some combinations of user provided flags in netlink command,
it is possible to call tcp_get_info() with a buffer that is not 8-bytes
aligned.

It does matter on some arches, so we need to use put_unaligned() to
store the u64 fields.

Current iproute2 package does not trigger this particular issue.

Fixes: 0df48c26d841 ("tcp: add tcpi_bytes_acked to tcp_info")
Fixes: 977cb0ecf82e ("tcp: add pacing_rate information into tcp_info")
Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller
Signed-off-by: Greg Kroah-Hartman

Eric Dumazet
2016-03-04 07:07:05 +0800
649dc6c32 inet: frag: Always orphan skbs inside ip_defrag() ... Browse Code »

[ Upstream commit 8282f27449bf15548cb82c77b6e04ee0ab827bdc ]

Later parts of the stack (including fragmentation) expect that there is
never a socket attached to frag in a frag_list, however this invariant
was not enforced on all defrag paths. This could lead to the
BUG_ON(skb->sk) during ip_do_fragment(), as per the call stack at the
end of this commit message.

While the call could be added to openvswitch to fix this particular
error, the head and tail of the frags list are already orphaned
indirectly inside ip_defrag(), so it seems like the remaining fragments
should all be orphaned in all circumstances.

kernel BUG at net/ipv4/ip_output.c:586!
[...]
Call Trace:

[] ? do_output.isra.29+0x1b0/0x1b0 [openvswitch]
[] ovs_fragment+0xcc/0x214 [openvswitch]
[] ? dst_discard_out+0x20/0x20
[] ? dst_ifdown+0x80/0x80
[] ? find_bucket.isra.2+0x62/0x70 [openvswitch]
[] ? mod_timer_pending+0x65/0x210
[] ? __lock_acquire+0x3db/0x1b90
[] ? nf_conntrack_in+0x252/0x500 [nf_conntrack]
[] ? __lock_is_held+0x54/0x70
[] do_output.isra.29+0xe3/0x1b0 [openvswitch]
[] do_execute_actions+0xe11/0x11f0 [openvswitch]
[] ? __lock_is_held+0x54/0x70
[] ovs_execute_actions+0x32/0xd0 [openvswitch]
[] ovs_dp_process_packet+0x85/0x140 [openvswitch]
[] ? __lock_is_held+0x54/0x70
[] ovs_execute_actions+0xb2/0xd0 [openvswitch]
[] ovs_dp_process_packet+0x85/0x140 [openvswitch]
[] ? ovs_ct_get_labels+0x49/0x80 [openvswitch]
[] ovs_vport_receive+0x5d/0xa0 [openvswitch]
[] ? __lock_acquire+0x3db/0x1b90
[] ? __lock_acquire+0x3db/0x1b90
[] ? __lock_acquire+0x3db/0x1b90
[] ? internal_dev_xmit+0x5/0x140 [openvswitch]
[] internal_dev_xmit+0x6c/0x140 [openvswitch]
[] ? internal_dev_xmit+0x5/0x140 [openvswitch]
[] dev_hard_start_xmit+0x2b9/0x5e0
[] ? netif_skb_features+0xd1/0x1f0
[] __dev_queue_xmit+0x800/0x930
[] ? __dev_queue_xmit+0x50/0x930
[] ? mark_held_locks+0x71/0x90
[] ? neigh_resolve_output+0x106/0x220
[] dev_queue_xmit+0x10/0x20
[] neigh_resolve_output+0x178/0x220
[] ? ip_finish_output2+0x1ff/0x590
[] ip_finish_output2+0x1ff/0x590
[] ? ip_finish_output2+0x7e/0x590
[] ip_do_fragment+0x831/0x8a0
[] ? ip_copy_metadata+0x1b0/0x1b0
[] ip_fragment.constprop.49+0x43/0x80
[] ip_finish_output+0x17c/0x340
[] ? nf_hook_slow+0xe4/0x190
[] ip_output+0x70/0x110
[] ? ip_fragment.constprop.49+0x80/0x80
[] ip_local_out+0x39/0x70
[] ip_send_skb+0x19/0x40
[] ip_push_pending_frames+0x33/0x40
[] icmp_push_reply+0xea/0x120
[] icmp_reply.constprop.23+0x1ed/0x230
[] icmp_echo.part.21+0x4e/0x50
[] ? __lock_is_held+0x54/0x70
[] ? rcu_read_lock_held+0x5e/0x70
[] icmp_echo+0x36/0x70
[] icmp_rcv+0x271/0x450
[] ip_local_deliver_finish+0x127/0x3a0
[] ? ip_local_deliver_finish+0x41/0x3a0
[] ip_local_deliver+0x60/0xd0
[] ? ip_rcv_finish+0x560/0x560
[] ip_rcv_finish+0xdd/0x560
[] ip_rcv+0x283/0x3e0
[] ? match_held_lock+0x192/0x200
[] ? inet_del_offload+0x40/0x40
[] __netif_receive_skb_core+0x392/0xae0
[] ? process_backlog+0x8e/0x230
[] ? mark_held_locks+0x71/0x90
[] __netif_receive_skb+0x18/0x60
[] process_backlog+0x78/0x230
[] ? process_backlog+0xdd/0x230
[] net_rx_action+0x155/0x400
[] __do_softirq+0xcc/0x420
[] ? ip_finish_output2+0x217/0x590
[] do_softirq_own_stack+0x1c/0x30

[] do_softirq+0x4e/0x60
[] __local_bh_enable_ip+0xa8/0xb0
[] ip_finish_output2+0x240/0x590
[] ? ip_do_fragment+0x831/0x8a0
[] ip_do_fragment+0x831/0x8a0
[] ? ip_copy_metadata+0x1b0/0x1b0
[] ip_fragment.constprop.49+0x43/0x80
[] ip_finish_output+0x17c/0x340
[] ? nf_hook_slow+0xe4/0x190
[] ip_output+0x70/0x110
[] ? ip_fragment.constprop.49+0x80/0x80
[] ip_local_out+0x39/0x70
[] ip_send_skb+0x19/0x40
[] ip_push_pending_frames+0x33/0x40
[] raw_sendmsg+0x7d3/0xc30
[] ? __lock_acquire+0x3db/0x1b90
[] ? inet_sendmsg+0xc7/0x1d0
[] ? __lock_is_held+0x54/0x70
[] inet_sendmsg+0x10a/0x1d0
[] ? inet_sendmsg+0x5/0x1d0
[] sock_sendmsg+0x38/0x50
[] ___sys_sendmsg+0x25f/0x270
[] ? handle_mm_fault+0x8dd/0x1320
[] ? _raw_spin_unlock+0x27/0x40
[] ? __do_page_fault+0x1e2/0x460
[] ? __fget_light+0x66/0x90
[] __sys_sendmsg+0x42/0x80
[] SyS_sendmsg+0x12/0x20
[] entry_SYSCALL_64_fastpath+0x12/0x6f
Code: 00 00 44 89 e0 e9 7c fb ff ff 4c 89 ff e8 e7 e7 ff ff 41 8b 9d 80 00 00 00 2b 5d d4 89 d8 c1 f8 03 0f b7 c0 e9 33 ff ff f
66 66 66 2e 0f 1f 84 00 00 00 00 00 66 66 66 66 90 55 48
RIP [] ip_do_fragment+0x892/0x8a0
RSP

Fixes: 7f8a436eaa2c ("openvswitch: Add conntrack action")
Signed-off-by: Joe Stringer
Signed-off-by: David S. Miller
Signed-off-by: Greg Kroah-Hartman

Joe Stringer
2016-03-04 07:07:04 +0800
e5abc10d1 tcp: fix NULL deref in tcp_v4_send_ack() ... Browse Code »

[ Upstream commit e62a123b8ef7c5dc4db2c16383d506860ad21b47 ]

Neal reported crashes with this stack trace :

RIP: 0010:[] tcp_v4_send_ack+0x41/0x20f
...
CR2: 0000000000000018 CR3: 000000044005c000 CR4: 00000000001427e0
...
[] tcp_v4_reqsk_send_ack+0xa5/0xb4
[] tcp_check_req+0x2ea/0x3e0
[] tcp_rcv_state_process+0x850/0x2500
[] tcp_v4_do_rcv+0x141/0x330
[] sk_backlog_rcv+0x21/0x30
[] tcp_recvmsg+0x75d/0xf90
[] inet_recvmsg+0x80/0xa0
[] sock_aio_read+0xee/0x110
[] do_sync_read+0x6f/0xa0
[] SyS_read+0x1e1/0x290
[] system_call_fastpath+0x16/0x1b

The problem here is the skb we provide to tcp_v4_send_ack() had to
be parked in the backlog of a new TCP fastopen child because this child
was owned by the user at the time an out of window packet arrived.

Before queuing a packet, TCP has to set skb->dev to NULL as the device
could disappear before packet is removed from the queue.

Fix this issue by using the net pointer provided by the socket (being a
timewait or a request socket).

IPv6 is immune to the bug : tcp_v6_send_response() already gets the net
pointer from the socket if provided.

Fixes: 168a8f58059a ("tcp: TCP Fast Open Server - main code path")
Reported-by: Neal Cardwell
Signed-off-by: Eric Dumazet
Cc: Jerry Chu
Cc: Yuchung Cheng
Acked-by: Neal Cardwell
Signed-off-by: David S. Miller
Signed-off-by: Greg Kroah-Hartman

Eric Dumazet
2016-03-04 07:07:04 +0800

01 Feb, 2016

3 commits

1f0bdf609 net: preserve IP control block during GSO segmentation ... Browse Code »

[ Upstream commit 9207f9d45b0ad071baa128e846d7e7ed85016df3 ]

Skb_gso_segment() uses skb control block during segmentation.
This patch adds 32-bytes room for previous control block which
will be copied into all resulting segments.

This patch fixes kernel crash during fragmenting forwarded packets.
Fragmentation requires valid IP CB in skb for clearing ip options.
Also patch removes custom save/restore in ovs code, now it's redundant.

Signed-off-by: Konstantin Khlebnikov
Link: http://lkml.kernel.org/r/CALYGNiP-0MZ-FExV2HutTvE9U-QQtkKSoE--KN=JQE5STYsjAA@mail.gmail.com
Signed-off-by: David S. Miller
Signed-off-by: Greg Kroah-Hartman

Konstantin Khlebnikov
2016-02-01 03:29:00 +0800
b856abbb4 udp: disallow UFO for sockets with SO_NO_CHECK option ... Browse Code »

[ Upstream commit 40ba330227ad00b8c0cdf2f425736ff9549cc423 ]

Commit acf8dd0a9d0b ("udp: only allow UFO for packets from SOCK_DGRAM
sockets") disallows UFO for packets sent from raw sockets. We need to do
the same also for SOCK_DGRAM sockets with SO_NO_CHECK options, even if
for a bit different reason: while such socket would override the
CHECKSUM_PARTIAL set by ip_ufo_append_data(), gso_size is still set and
bad offloading flags warning is triggered in __skb_gso_segment().

In the IPv6 case, SO_NO_CHECK option is ignored but we need to disallow
UFO for packets sent by sockets with UDP_NO_CHECK6_TX option.

Signed-off-by: Michal Kubecek
Tested-by: Shannon Nelson
Acked-by: Hannes Frederic Sowa
Signed-off-by: David S. Miller
Signed-off-by: Greg Kroah-Hartman

Michal Kubeček
2016-02-01 03:29:00 +0800
5f15ae009 tcp_yeah: don't set ssthresh below 2 ... Browse Code »

[ Upstream commit 83d15e70c4d8909d722c0d64747d8fb42e38a48f ]

For tcp_yeah, use an ssthresh floor of 2, the same floor used by Reno
and CUBIC, per RFC 5681 (equation 4).

tcp_yeah_ssthresh() was sometimes returning a 0 or negative ssthresh
value if the intended reduction is as big or bigger than the current
cwnd. Congestion control modules should never return a zero or
negative ssthresh. A zero ssthresh generally results in a zero cwnd,
causing the connection to stall. A negative ssthresh value will be
interpreted as a u32 and will set a target cwnd for PRR near 4
billion.

Oleksandr Natalenko reported that a system using tcp_yeah with ECN
could see a warning about a prior_cwnd of 0 in
tcp_cwnd_reduction(). Testing verified that this was due to
tcp_yeah_ssthresh() misbehaving in this way.

Reported-by: Oleksandr Natalenko
Signed-off-by: Neal Cardwell
Signed-off-by: Yuchung Cheng
Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller
Signed-off-by: Greg Kroah-Hartman

Neal Cardwell
2016-02-01 03:28:59 +0800

07 Jan, 2016

1 commit

8b8a321ff tcp: fix zero cwnd in tcp_cwnd_reduction ... Browse Code »

Patch 3759824da87b ("tcp: PRR uses CRB mode by default and SS mode
conditionally") introduced a bug that cwnd may become 0 when both
inflight and sndcnt are 0 (cwnd = inflight + sndcnt). This may lead
to a div-by-zero if the connection starts another cwnd reduction
phase by setting tp->prior_cwnd to the current cwnd (0) in
tcp_init_cwnd_reduction().

To prevent this we skip PRR operation when nothing is acked or
sacked. Then cwnd must be positive in all cases as long as ssthresh
is positive:

1) The proportional reduction mode
inflight > ssthresh > 0

2) The reduction bound mode
a) inflight == ssthresh > 0

b) inflight < ssthresh
sndcnt > 0 since newly_acked_sacked > 0 and inflight < ssthresh

Therefore in all cases inflight and sndcnt can not both be 0.
We check invalid tp->prior_cwnd to avoid potential div0 bugs.

In reality this bug is triggered only with a sequence of less common
events. For example, the connection is terminating an ECN-triggered
cwnd reduction with an inflight 0, then it receives reordered/old
ACKs or DSACKs from prior transmission (which acks nothing). Or the
connection is in fast recovery stage that marks everything lost,
but fails to retransmit due to local issues, then receives data
packets from other end which acks nothing.

Fixes: 3759824da87b ("tcp: PRR uses CRB mode by default and SS mode conditionally")
Reported-by: Oleksandr Natalenko
Signed-off-by: Yuchung Cheng
Signed-off-by: Neal Cardwell
Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Yuchung Cheng
2016-01-07 05:39:56 +0800

05 Jan, 2016

1 commit

b5bdacf3b net: Propagate lookup failure in l3mdev_get_saddr to caller ... Browse Code »

Commands run in a vrf context are not failing as expected on a route lookup:
root@kenny:~# ip ro ls table vrf-red
unreachable default

root@kenny:~# ping -I vrf-red -c1 -w1 10.100.1.254
ping: Warning: source address might be selected on device other than vrf-red.
PING 10.100.1.254 (10.100.1.254) from 0.0.0.0 vrf-red: 56(84) bytes of data.

--- 10.100.1.254 ping statistics ---
2 packets transmitted, 0 received, 100% packet loss, time 999ms

Since the vrf table does not have a route for 10.100.1.254 the ping
should have failed. The saddr lookup causes a full VRF table lookup.
Propogating a lookup failure to the user allows the command to fail as
expected:

root@kenny:~# ping -I vrf-red -c1 -w1 10.100.1.254
connect: No route to host

Signed-off-by: David Ahern
Signed-off-by: David S. Miller

David Ahern
2016-01-05 11:58:30 +0800

23 Dec, 2015

1 commit

024f35c55 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec ... Browse Code »

Steffen Klassert says:

====================
pull request (net): ipsec 2015-12-22

Just one patch to fix dst_entries_init with multiple namespaces.
From Dan Streetman.

Please pull or let me know if there are problems.
====================

Signed-off-by: David S. Miller

David S. Miller
2015-12-23 05:26:31 +0800

19 Dec, 2015

1 commit

6d3c348a6 ipip: ioctl: Remove superfluous IP-TTL handling. ... Browse Code »

IP-TTL case is already handled in ip_tunnel_ioctl() API.

Signed-off-by: Pravin B Shelar
Signed-off-by: David S. Miller

Pravin B Shelar
2015-12-19 05:07:59 +0800

18 Dec, 2015

1 commit

07e100f98 tcp: restore fastopen with no data in SYN packet ... Browse Code »

Yuchung tracked a regression caused by commit 57be5bdad759 ("ip: convert
tcp_sendmsg() to iov_iter primitives") for TCP Fast Open.

Some Fast Open users do not actually add any data in the SYN packet.

Fixes: 57be5bdad759 ("ip: convert tcp_sendmsg() to iov_iter primitives")
Reported-by: Yuchung Cheng
Signed-off-by: Eric Dumazet
Cc: Al Viro
Acked-by: Yuchung Cheng
Signed-off-by: David S. Miller

Eric Dumazet
2015-12-18 04:37:39 +0800

17 Dec, 2015

1 commit

3036facbb fou: clean up socket with kfree_rcu ... Browse Code »

fou->udp_offloads is managed by RCU. As it is actually included inside
the fou sockets, we cannot let the memory go out of scope before a grace
period. We either can synchronize_rcu or switch over to kfree_rcu to
manage the sockets. kfree_rcu seems appropriate as it is used by vxlan
and geneve.

Fixes: 23461551c00628c ("fou: Support for foo-over-udp RX path")
Cc: Tom Herbert
Signed-off-by: Hannes Frederic Sowa
Signed-off-by: David S. Miller

Hannes Frederic Sowa
2015-12-17 08:03:02 +0800

15 Dec, 2015

3 commits

5037e9ef9 net: fix IP early demux races ... Browse Code »

David Wilder reported crashes caused by dst reuse.

I am seeing a crash on a distro V4.2.3 kernel caused by a double
release of a dst_entry. In ipv4_dst_destroy() the call to
list_empty() finds a poisoned next pointer, indicating the dst_entry
has already been removed from the list and freed. The crash occurs
18 to 24 hours into a run of a network stress exerciser.

Thanks to his detailed report and analysis, we were able to understand
the core issue.

IP early demux can associate a dst to skb, after a lookup in TCP/UDP
sockets.

When socket cache is not properly set, we want to store into
sk->sk_dst_cache the dst for future IP early demux lookups,
by acquiring a stable refcount on the dst.

Problem is this acquisition is simply using an atomic_inc(),
which works well, unless the dst was queued for destruction from
dst_release() noticing dst refcount went to zero, if DST_NOCACHE
was set on dst.

We need to make sure current refcount is not zero before incrementing
it, or risk double free as David reported.

This patch, being a stable candidate, adds two new helpers, and use
them only from IP early demux problematic paths.

It might be possible to merge in net-next skb_dst_force() and
skb_dst_force_safe(), but I prefer having the smallest patch for stable
kernels : Maybe some skb_dst_force() callers do not expect skb->dst
can suddenly be cleared.

Can probably be backported back to linux-3.6 kernels

Reported-by: David J. Wilder
Tested-by: David J. Wilder
Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2015-12-15 12:52:00 +0800
79462ad02 net: add validation for the socket syscall protocol argument ... Browse Code »
5

郭永刚 reported that one could simply crash the kernel as root by
using a simple program:

int socket_fd;
struct sockaddr_in addr;
addr.sin_port = 0;
addr.sin_addr.s_addr = INADDR_ANY;
addr.sin_family = 10;

socket_fd = socket(10,3,0x40000000);
connect(socket_fd , &addr,16);

AF_INET, AF_INET6 sockets actually only support 8-bit protocol
identifiers. inet_sock's skc_protocol field thus is sized accordingly,
thus larger protocol identifiers simply cut off the higher bits and
store a zero in the protocol fields.

This could lead to e.g. NULL function pointer because as a result of
the cut off inet_num is zero and we call down to inet_autobind, which
is NULL for raw sockets.

kernel: Call Trace:
kernel: [] ? inet_autobind+0x2e/0x70
kernel: [] inet_dgram_connect+0x54/0x80
kernel: [] SYSC_connect+0xd9/0x110
kernel: [] ? ptrace_notify+0x5b/0x80
kernel: [] ? syscall_trace_enter_phase2+0x108/0x200
kernel: [] SyS_connect+0xe/0x10
kernel: [] tracesys_phase2+0x84/0x89

I found no particular commit which introduced this problem.

CVE: CVE-2015-8543
Cc: Cong Wang
Reported-by: 郭永刚
Signed-off-by: Hannes Frederic Sowa
Signed-off-by: David S. Miller

Hannes Frederic Sowa
2015-12-15 05:09:30 +0800
9e5be5bd4 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf ... Browse Code »

Pablo Neira Ayuso says:

====================
netfilter fixes for net

The following patchset contains Netfilter fixes for you net tree,
specifically for nf_tables and nfnetlink_queue, they are:

1) Avoid a compilation warning in nfnetlink_queue that was introduced
in the previous merge window with the simplification of the conntrack
integration, from Arnd Bergmann.

2) nfnetlink_queue is leaking the pernet subsystem registration from
a failure path, patch from Nikolay Borisov.

3) Pass down netns pointer to batch callback in nfnetlink, this is the
largest patch and it is not a bugfix but it is a dependency to
resolve a splat in the correct way.

4) Fix a splat due to incorrect socket memory accounting with nfnetlink
skbuff clones.

5) Add missing conntrack dependencies to NFT_DUP_IPV4 and NFT_DUP_IPV6.

6) Traverse the nftables commit list in reverse order from the commit
path, otherwise we crash when the user applies an incremental update
via 'nft -f' that deletes an object that was just introduced in this
batch, from Xin Long.

Regarding the compilation warning fix, many people have sent us (and
keep sending us) patches to address this, that's why I'm including this
batch even if this is not critical.
====================

Signed-off-by: David S. Miller

David S. Miller
2015-12-15 00:09:01 +0800

14 Dec, 2015

1 commit

7f49e7a38 net: Flush local routes when device changes vrf association ... Browse Code »
2

The VRF driver cycles netdevs when an interface is enslaved or released:
the down event is used to flush neighbor and route tables and the up
event (if the interface was already up) effectively moves local and
connected routes to the proper table.

As of 4f823defdd5b the local route is left hanging around after a link
down, so when a netdev is moved from one VRF to another (or released
from a VRF altogether) local routes are left in the wrong table.

Fix by handling the NETDEV_CHANGEUPPER event. When the upper dev is
an L3mdev then call fib_disable_ip to flush all routes, local ones
to.

Fixes: 4f823defdd5b ("ipv4: fix to not remove local route on link down")
Cc: Julian Anastasov
Signed-off-by: David Ahern
Signed-off-by: Nikolay Aleksandrov
Signed-off-by: David S. Miller

David Ahern
2015-12-14 12:58:44 +0800

11 Dec, 2015

1 commit

d3340b79e netfilter: nf_dup: add missing dependencies with NF_CONNTRACK ... Browse Code »

CONFIG_NF_CONNTRACK=m
CONFIG_NF_DUP_IPV4=y

results in:

net/built-in.o: In function `nf_dup_ipv4':
>> (.text+0xd434f): undefined reference to `nf_conntrack_untracked'

Reported-by: kbuild test robot
Signed-off-by: Pablo Neira Ayuso

Pablo Neira Ayuso
2015-12-11 01:17:06 +0800

04 Dec, 2015

1 commit

4eba7bb1d ipv4: igmp: Allow removing groups from a removed interface ... Browse Code »

When a multicast group is joined on a socket, a struct ip_mc_socklist
is appended to the sockets mc_list containing information about the
joined group.

If the interface is hot unplugged, this entry becomes stale. Prior to
commit 52ad353a5344f ("igmp: fix the problem when mc leave group") it
was possible to remove the stale entry by performing a
IP_DROP_MEMBERSHIP, passing either the old ifindex or ip address on
the interface. However, this fix enforces that the interface must
still exist. Thus with time, the number of stale entries grows, until
sysctl_igmp_max_memberships is reached and then it is not possible to
join and more groups.

The previous patch fixes an issue where a IP_DROP_MEMBERSHIP is
performed without specifying the interface, either by ifindex or ip
address. However here we do supply one of these. So loosen the
restriction on device existence to only apply when the interface has
not been specified. This then restores the ability to clean up the
stale entries.

Signed-off-by: Andrew Lunn
Fixes: 52ad353a5344f "(igmp: fix the problem when mc leave group")
Signed-off-by: David S. Miller

Andrew Lunn
2015-12-04 01:07:05 +0800

02 Dec, 2015

1 commit

9cd3e072b net: rename SOCK_ASYNC_NOSPACE and SOCK_ASYNC_WAITDATA ... Browse Code »

This patch is a cleanup to make following patch easier to
review.

Goal is to move SOCK_ASYNC_NOSPACE and SOCK_ASYNC_WAITDATA
from (struct socket)->flags to a (struct socket_wq)->flags
to benefit from RCU protection in sock_wake_async()

To ease backports, we rename both constants.

Two new helpers, sk_set_bit(int nr, struct sock *sk)
and sk_clear_bit(int net, struct sock *sk) are added so that
following patch can change their implementation.

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2015-12-02 04:45:05 +0800