Eric Lee / smarc-fsl-linux-kernel

29 Apr, 2016

5 commits

946b636f1 gre: reject GUE and FOU in collect metadata mode ... Browse Code »

The collect metadata mode does not support GUE nor FOU. This might be
implemented later; until then, we should reject such config.

I think this is okay to be changed. It's unlikely anyone has such
configuration (as it doesn't work anyway) and we may need a way to
distinguish whether it's supported or not by the kernel later.

For backwards compatibility with iproute2, it's not possible to just check
the attribute presence (iproute2 always includes the attribute), the actual
value has to be checked, too.

Fixes: 2e15ea390e6f4 ("ip_gre: Add support to collect tunnel metadata.")
Signed-off-by: Jiri Benc
Signed-off-by: David S. Miller

Jiri Benc
2016-04-29 05:09:37 +0800
2090714e1 gre: build header correctly for collect metadata tunnels ... Browse Code »

In ipgre (i.e. not gretap) + collect metadata mode, the skb was assumed to
contain Ethernet header and was encapsulated as ETH_P_TEB. This is not the
case, the interface is ARPHRD_IPGRE and the protocol to be used for
encapsulation is skb->protocol.

Fixes: 2e15ea390e6f4 ("ip_gre: Add support to collect tunnel metadata.")
Signed-off-by: Jiri Benc
Acked-by: Pravin B Shelar
Reviewed-by: Simon Horman
Signed-off-by: David S. Miller

Jiri Benc
2016-04-29 05:02:45 +0800
a64b04d86 gre: do not assign header_ops in collect metadata mode ... Browse Code »

In ipgre mode (i.e. not gretap) with collect metadata flag set, the tunnel
is incorrectly assumed to be mGRE in NBMA mode (see commit 6a5f44d7a048c).
This is not the case, we're controlling the encapsulation addresses by
lwtunnel metadata. And anyway, assigning dev->header_ops in collect metadata
mode does not make sense.

Although it would be more user firendly to reject requests that specify
both the collect metadata flag and a remote/local IP address, this would
break current users of gretap or introduce ugly code and differences in
handling ipgre and gretap configuration. Keep the current behavior of
remote/local IP address being ignored in such case.

v3: Back to v1, added explanation paragraph.
v2: Reject configuration specifying both remote/local address and collect
metadata flag.

Fixes: 2e15ea390e6f4 ("ip_gre: Add support to collect tunnel metadata.")
Signed-off-by: Jiri Benc
Signed-off-by: David S. Miller

Jiri Benc
2016-04-29 05:02:44 +0800
12395d064 Merge tag 'mac80211-for-davem-2016-04-27' of git://git.kernel.org/pub/scm/linux/… ... Browse Code »

…kernel/git/jberg/mac80211

Johannes Berg says:

====================
Just a single fix, for a per-CPU memory leak in a
(root user triggerable) error case.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

David S. Miller
2016-04-29 04:55:26 +0800
956a7ffe0 Merge tag 'batman-adv-fix-for-davem' of git://git.open-mesh.org/linux-merge ... Browse Code »

Antonio Quartulli says:

====================
In this patchset you can find the following fixes:

1) check skb size to avoid reading beyond its border when delivering
payloads, by Sven Eckelmann
2) initialize last_seen time in neigh_node object to prevent cleanup
routine from accidentally purge it, by Marek Lindner
3) release "recently added" slave interfaces upon virtual/batman
interface shutdown, by Sven Eckelmann
4) properly decrease router object reference counter upon routing table
update, by Sven Eckelmann
5) release queue slots when purging OGM packets of deactivating slave
interface, by Linus Lüssing

Patch 2 and 3 have no "Fixes:" tag because the offending commits date
back to when batman-adv was not yet officially in the net tree.
====================

Signed-off-by: David S. Miller

David S. Miller
2016-04-29 04:42:40 +0800

27 Apr, 2016

1 commit

e6436be21 mac80211: fix statistics leak if dev_alloc_name() fails ... Browse Code »

In the case that dev_alloc_name() fails, e.g. because the name was
given by the user and already exists, we need to clean up properly
and free the per-CPU statistics. Fix that.

Cc: stable@vger.kernel.org
Fixes: 5a490510ba5f ("mac80211: use per-CPU TX/RX statistics")
Signed-off-by: Johannes Berg

Johannes Berg
2016-04-27 16:06:58 +0800

26 Apr, 2016

3 commits

38bd10c44 net: ipv6: Delete host routes on an ifdown ... Browse Code »

It was a simple idea -- save IPv6 configured addresses on a link down
so that IPv6 behaves similar to IPv4. As always the devil is in the
details and the IPv6 stack as too many behavioral differences from IPv4
making the simple idea more complicated than it needs to be.

The current implementation for keeping IPv6 addresses can panic or spit
out a warning in one of many paths:

1. IPv6 route gets an IPv4 route as its 'next' which causes a panic in
rt6_fill_node while handling a route dump request.

2. rt->dst.obsolete is set to DST_OBSOLETE_DEAD hitting the WARN_ON in
fib6_del

3. Panic in fib6_purge_rt because rt6i_ref count is not 1.

The root cause of all these is references related to the host route for
an address that is retained.

So, this patch deletes the host route every time the ifdown loop runs.
Since the host route is deleted and will be re-generated an up there is
no longer a need for the l3mdev fix up. On the 'admin up' side move
addrconf_permanent_addr into the NETDEV_UP event handling so that it
runs only once versus on UP and CHANGE events.

All of the current panics and warnings appear to be related to
addresses on the loopback device, but given the catastrophic nature when
a bug is triggered this patch takes the conservative approach and evicts
all host routes rather than trying to determine when it can be re-used
and when it can not. That can be a later optimizaton if desired.

Signed-off-by: David Ahern
Signed-off-by: David S. Miller

David Ahern
2016-04-26 23:48:26 +0800
6a923934c Revert "ipv6: Revert optional address flusing on ifdown." ... Browse Code »

This reverts commit 841645b5f2dfceac69b78fcd0c9050868d41ea61.

Ok, this puts the feature back. I've decided to apply David A.'s
bug fix and run with that rather than make everyone wait another
whole release for this feature.

Signed-off-by: David S. Miller

David S. Miller
2016-04-26 23:47:41 +0800
841645b5f ipv6: Revert optional address flusing on ifdown. ... Browse Code »

This reverts the following three commits:

70af921db6f8835f4b11c65731116560adb00c14
799977d9aafbf0ca0b9c39b04cbfb16db71302c9
f1705ec197e705b79ea40fe7a2cc5acfa1d3bfac

The feature was ill conceived, has terrible semantics, and has added
nothing but regressions to the already fragile ipv6 stack.

Fixes: f1705ec197e7 ("net: ipv6: Make address flushing on ifdown optional")
Signed-off-by: David S. Miller

David S. Miller
2016-04-26 03:33:55 +0800

25 Apr, 2016

4 commits

391a20333 ipv4/fib: don't warn when primary address is missing if in_dev is dead ... Browse Code »

After commit fbd40ea0180a ("ipv4: Don't do expensive useless work
during inetdev destroy.") when deleting an interface,
fib_del_ifaddr() can be executed without any primary address
present on the dead interface.

The above is safe, but triggers some "bug: prim == NULL" warnings.

This commit avoids warning if the in_dev is dead

Signed-off-by: Paolo Abeni
Signed-off-by: David S. Miller

Paolo Abeni
2016-04-25 11:26:29 +0800
45ebcce56 bridge: mdb: Marking port-group as offloaded ... Browse Code »

There is a race-condition when updating the mdb offload flag without using
the mulicast_lock. This reverts commit 9e8430f8d60d98 ("bridge: mdb:
Passing the port-group pointer to br_mdb module").

This patch marks offloaded MDB entry as "offload" by changing the port-
group flags and marks it as MDB_PG_FLAGS_OFFLOAD.

When switchdev PORT_MDB succeeded and adds a multicast group, a completion
callback is been invoked "br_mdb_complete". The completion function
locks the multicast_lock and finds the right net_bridge_port_group and
marks it as offloaded.

Fixes: 9e8430f8d60d98 ("bridge: mdb: Passing the port-group pointer to br_mdb module")
Reported-by: Nikolay Aleksandrov
Signed-off-by: Elad Raz
Signed-off-by: Jiri Pirko
Reviewed-by: Ido Schimmel
Acked-by: Nikolay Aleksandrov
Signed-off-by: David S. Miller

Elad Raz
2016-04-25 02:23:32 +0800
6dd684c0f bridge: mdb: Common function for mdb entry translation ... Browse Code »

There is duplicate code that translates br_mdb_entry to br_ip let's wrap it
in a common function.

Signed-off-by: Elad Raz
Signed-off-by: Jiri Pirko
Reviewed-by: Ido Schimmel
Acked-by: Nikolay Aleksandrov
Signed-off-by: David S. Miller

Elad Raz
2016-04-25 02:23:32 +0800
7ceb2afbd switchdev: Adding complete operation to deferred switchdev ops ... Browse Code »

When using switchdev deferred operation (SWITCHDEV_F_DEFER), the operation
is executed in different context and the application doesn't have any way
to get the operation real status.

Adding a completion callback fixes that. This patch adds fields to
switchdev_attr and switchdev_obj "complete_priv" field which is used by
the "complete" callback.

Application can set a complete function which will be called once the
operation executed.

Signed-off-by: Elad Raz
Signed-off-by: Jiri Pirko
Reviewed-by: Ido Schimmel
Acked-by: Nikolay Aleksandrov
Signed-off-by: David S. Miller

Elad Raz
2016-04-25 02:23:32 +0800

24 Apr, 2016

5 commits

c4fdb6cff batman-adv: Fix broadcast/ogm queue limit on a removed interface ... Browse Code »

When removing a single interface while a broadcast or ogm packet is
still pending then we will free the forward packet without releasing the
queue slots again.

This patch is supposed to fix this issue.

Fixes: 6d5808d4ae1b ("batman-adv: Add missing hardif_free_ref in forw_packet_free")
Signed-off-by: Linus Lüssing
[sven@narfation.org: fix conflicts with current version]
Signed-off-by: Sven Eckelmann
Signed-off-by: Marek Lindner
Signed-off-by: Antonio Quartulli

Linus Lüssing
2016-04-24 15:41:56 +0800
d1a65f174 batman-adv: Reduce refcnt of removed router when updating route ... Browse Code »

_batadv_update_route rcu_derefences orig_ifinfo->router outside of a
spinlock protected region to print some information messages to the debug
log. But this pointer is not checked again when the new pointer is assigned
in the spinlock protected region. Thus is can happen that the value of
orig_ifinfo->router changed in the meantime and thus the reference counter
of the wrong router gets reduced after the spinlock protected region.

Just rcu_dereferencing the value of orig_ifinfo->router inside the spinlock
protected region (which also set the new pointer) is enough to get the
correct old router object.

Fixes: e1a5382f978b ("batman-adv: Make orig_node->router an rcu protected pointer")
Signed-off-by: Sven Eckelmann
Signed-off-by: Marek Lindner
Signed-off-by: Antonio Quartulli

Sven Eckelmann
2016-04-24 15:41:25 +0800
f2d23861b batman-adv: Deactivate TO_BE_ACTIVATED hardif on shutdown ... Browse Code »

The shutdown of an batman-adv interface can happen with one of its slave
interfaces still being in the BATADV_IF_TO_BE_ACTIVATED state. A possible
reason for it is that the routing algorithm BATMAN_V was selected and
batadv_schedule_bat_ogm was not yet called for this interface. This slave
interface still has to be set to BATADV_IF_INACTIVE or the batman-adv
interface will never reduce its usage counter and thus never gets shutdown.

This problem can be simulated via:

$ modprobe dummy
$ modprobe batman-adv routing_algo=BATMAN_V
$ ip link add bat0 type batadv
$ ip link set dummy0 master bat0
$ ip link set dummy0 up
$ ip link del bat0
unregister_netdevice: waiting for bat0 to become free. Usage count = 3

Reported-by: Matthias Schiffer
Signed-off-by: Sven Eckelmann
Signed-off-by: Marek Lindner
Signed-off-by: Antonio Quartulli

Sven Eckelmann
2016-04-24 15:40:23 +0800
e48474ed8 batman-adv: init neigh node last seen field ... Browse Code »

Signed-off-by: Marek Lindner
[sven@narfation.org: fix conflicts with current version]
Signed-off-by: Sven Eckelmann
Signed-off-by: Antonio Quartulli

Marek Lindner
2016-04-24 15:39:19 +0800
c78296665 batman-adv: Check skb size before using encapsulated ETH+VLAN header ... Browse Code »

The encapsulated ethernet and VLAN header may be outside the received
ethernet frame. Thus the skb buffer size has to be checked before it can be
parsed to find out if it encapsulates another batman-adv packet.

Fixes: 420193573f11 ("batman-adv: softif bridge loop avoidance")
Signed-off-by: Sven Eckelmann
Signed-off-by: Marek Lindner
Signed-off-by: Antonio Quartulli

Sven Eckelmann
2016-04-24 15:37:21 +0800

22 Apr, 2016

6 commits

c5edde3a8 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net ... Browse Code »

Pull networking fixes from David Miller:

1) Fix memory leak in iwlwifi, from Matti Gottlieb.

2) Add missing registration of netfilter arp_tables into initial
namespace, from Florian Westphal.

3) Fix potential NULL deref in DecNET routing code.

4) Restrict NETLINK_URELEASE to truly bound sockets only, from Dmitry
Ivanov.

5) Fix dst ref counting in VRF, from David Ahern.

6) Fix TSO segmenting limits in i40e driver, from Alexander Duyck.

7) Fix heap leak in PACKET_DIAG_MCLIST, from Mathias Krause.

8) Ravalidate IPV6 datagram socket cached routes properly, particularly
with UDP, from Martin KaFai Lau.

9) Fix endian bug in RDS dp_ack_seq handling, from Qing Huang.

10) Fix stats typing in bcmgenet driver, from Eric Dumazet.

11) Openvswitch needs to orphan SKBs before ipv6 fragmentation handing,
from Joe Stringer.

12) SPI device reference leak in spi_ks8895 PHY driver, from Mark Brown.

13) atl2 doesn't actually support scatter-gather, so don't advertise the
feature. From Ben Hucthings.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (72 commits)
openvswitch: use flow protocol when recalculating ipv6 checksums
Driver: Vmxnet3: set CHECKSUM_UNNECESSARY for IPv6 packets
atl2: Disable unimplemented scatter/gather feature
net/mlx4_en: Split SW RX dropped counter per RX ring
net/mlx4_core: Don't allow to VF change global pause settings
net/mlx4_core: Avoid repeated calls to pci enable/disable
net/mlx4_core: Implement pci_resume callback
net: phy: spi_ks8895: Don't leak references to SPI devices
net: ethernet: davinci_emac: Fix platform_data overwrite
net: ethernet: davinci_emac: Fix Unbalanced pm_runtime_enable
qede: Fix single MTU sized packet from firmware GRO flow
qede: Fix setting Skb network header
qede: Fix various memory allocation error flows for fastpath
tcp: Merge tx_flags and tskey in tcp_shifted_skb
tcp: Merge tx_flags and tskey in tcp_collapse_retrans
drivers: net: cpsw: fix wrong regs access in cpsw_ndo_open
tcp: Fix SOF_TIMESTAMPING_TX_ACK when handling dup acks
openvswitch: Orphan skbs before IPv6 defrag
Revert "Prevent NUll pointer dereference with two PHYs on cpsw"
VSOCK: Only check error on skb_recv_datagram when skb is NULL
...

Linus Torvalds
2016-04-22 03:57:34 +0800
b4f70527f openvswitch: use flow protocol when recalculating ipv6 checksums ... Browse Code »

When using masked actions the ipv6_proto field of an action
to set IPv6 fields may be zero rather than the prevailing protocol
which will result in skipping checksum recalculation.

This patch resolves the problem by relying on the protocol
in the flow key rather than that in the set field action.

Fixes: 83d2b9ba1abc ("net: openvswitch: Support masked set actions.")
Cc: Jarno Rajahalme
Signed-off-by: Simon Horman
Signed-off-by: David S. Miller

Simon Horman
2016-04-22 03:28:47 +0800
cfea5a688 tcp: Merge tx_flags and tskey in tcp_shifted_skb ... Browse Code »

After receiving sacks, tcp_shifted_skb() will collapse
skbs if possible. tx_flags and tskey also have to be
merged.

This patch reuses the tcp_skb_collapse_tstamp() to handle
them.

BPF Output Before:
~~~~~

BPF Output After:
~~~~~
-2024 [007] d.s. 88.644374: : ee_data:14599

Packetdrill Script:
~~~~~
+0 `sysctl -q -w net.ipv4.tcp_min_tso_segs=10`
+0 `sysctl -q -w net.ipv4.tcp_no_metrics_save=1`
+0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
+0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
+0 bind(3, ..., ...) = 0
+0 listen(3, 1) = 0

0.100 < S 0:0(0) win 32792
0.100 > S. 0:0(0) ack 1
0.200 < . 1:1(0) ack 1 win 257
0.200 accept(3, ..., ...) = 4
+0 setsockopt(4, SOL_TCP, TCP_NODELAY, [1], 4) = 0

0.200 write(4, ..., 1460) = 1460
+0 setsockopt(4, SOL_SOCKET, 37, [2688], 4) = 0
0.200 write(4, ..., 13140) = 13140

0.200 > P. 1:1461(1460) ack 1
0.200 > . 1461:8761(7300) ack 1
0.200 > P. 8761:14601(5840) ack 1

0.300 < . 1:1(0) ack 1 win 257
0.300 > P. 1:1461(1460) ack 1
0.400 < . 1:1(0) ack 14601 win 257

0.400 close(4) = 0
0.400 > F. 14601:14601(0) ack 1
0.500 < F. 1:1(0) ack 14602 win 257
0.500 > . 14602:14602(0) ack 2

Signed-off-by: Martin KaFai Lau
Cc: Eric Dumazet
Cc: Neal Cardwell
Cc: Soheil Hassas Yeganeh
Cc: Willem de Bruijn
Cc: Yuchung Cheng
Acked-by: Soheil Hassas Yeganeh
Tested-by: Soheil Hassas Yeganeh
Signed-off-by: David S. Miller

Martin KaFai Lau
2016-04-22 02:40:55 +0800
082ac2d51 tcp: Merge tx_flags and tskey in tcp_collapse_retrans ... Browse Code »

If two skbs are merged/collapsed during retransmission, the current
logic does not merge the tx_flags and tskey. The end result is
the SCM_TSTAMP_ACK timestamp could be missing for a packet.

The patch:
1. Merge the tx_flags
2. Overwrite the prev_skb's tskey with the next_skb's tskey

BPF Output Before:
~~~~~~

BPF Output After:
~~~~~~
packetdrill-2092 [001] d.s. 453.998486: : ee_data:1459

Packetdrill Script:
~~~~~~
+0 `sysctl -q -w net.ipv4.tcp_min_tso_segs=10`
+0 `sysctl -q -w net.ipv4.tcp_no_metrics_save=1`
+0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
+0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
+0 bind(3, ..., ...) = 0
+0 listen(3, 1) = 0

0.100 < S 0:0(0) win 32792
0.100 > S. 0:0(0) ack 1
0.200 < . 1:1(0) ack 1 win 257
0.200 accept(3, ..., ...) = 4
+0 setsockopt(4, SOL_TCP, TCP_NODELAY, [1], 4) = 0

0.200 write(4, ..., 730) = 730
+0 setsockopt(4, SOL_SOCKET, 37, [2688], 4) = 0
0.200 write(4, ..., 730) = 730
+0 setsockopt(4, SOL_SOCKET, 37, [2176], 4) = 0
0.200 write(4, ..., 11680) = 11680
+0 setsockopt(4, SOL_SOCKET, 37, [2688], 4) = 0

0.200 > P. 1:731(730) ack 1
0.200 > P. 731:1461(730) ack 1
0.200 > . 1461:8761(7300) ack 1
0.200 > P. 8761:13141(4380) ack 1

0.300 < . 1:1(0) ack 1 win 257
0.300 < . 1:1(0) ack 1 win 257
0.300 < . 1:1(0) ack 1 win 257
0.300 > P. 1:1461(1460) ack 1
0.400 < . 1:1(0) ack 13141 win 257

0.400 close(4) = 0
0.400 > F. 13141:13141(0) ack 1
0.500 < F. 1:1(0) ack 13142 win 257
0.500 > . 13142:13142(0) ack 2

Signed-off-by: Martin KaFai Lau
Cc: Eric Dumazet
Cc: Neal Cardwell
Cc: Soheil Hassas Yeganeh
Cc: Willem de Bruijn
Cc: Yuchung Cheng
Acked-by: Soheil Hassas Yeganeh
Tested-by: Soheil Hassas Yeganeh
Signed-off-by: David S. Miller

Martin KaFai Lau
2016-04-22 02:40:55 +0800
479f85c36 tcp: Fix SOF_TIMESTAMPING_TX_ACK when handling dup acks ... Browse Code »

Assuming SOF_TIMESTAMPING_TX_ACK is on. When dup acks are received,
it could incorrectly think that a skb has already
been acked and queue a SCM_TSTAMP_ACK cmsg to the
sk->sk_error_queue.

In tcp_ack_tstamp(), it checks
'between(shinfo->tskey, prior_snd_una, tcp_sk(sk)->snd_una - 1)'.
If prior_snd_una == tcp_sk(sk)->snd_una like the following packetdrill
script, between() returns true but the tskey is actually not acked.
e.g. try between(3, 2, 1).

The fix is to replace between() with one before() and one !before().
By doing this, the -1 offset on the tcp_sk(sk)->snd_una can also be
removed.

A packetdrill script is used to reproduce the dup ack scenario.
Due to the lacking cmsg support in packetdrill (may be I
cannot find it), a BPF prog is used to kprobe to
sock_queue_err_skb() and print out the value of
serr->ee.ee_data.

Both the packetdrill and the bcc BPF script is attached at the end of
this commit message.

BPF Output Before Fix:
~~~~~~
-2056 [001] d.s. 433.927987: : ee_data:1459 #incorrect
packetdrill-2056 [001] d.s. 433.929563: : ee_data:1459 #incorrect
packetdrill-2056 [001] d.s. 433.930765: : ee_data:1459 #incorrect
packetdrill-2056 [001] d.s. 434.028177: : ee_data:1459
packetdrill-2056 [001] d.s. 434.029686: : ee_data:14599

BPF Output After Fix:
~~~~~~
-2049 [000] d.s. 113.517039: : ee_data:1459
-2049 [000] d.s. 113.517253: : ee_data:14599

BCC BPF Script:
~~~~~~
#!/usr/bin/env python

from __future__ import print_function
from bcc import BPF

bpf_text = """
#include
#include
#include
#include

#ifdef memset
#undef memset
#endif

int trace_err_skb(struct pt_regs *ctx)
{
struct sk_buff *skb = (struct sk_buff *)ctx->si;
struct sock *sk = (struct sock *)ctx->di;
struct sock_exterr_skb *serr;
u32 ee_data = 0;

if (!sk || !skb)
return 0;

serr = SKB_EXT_ERR(skb);
bpf_probe_read(&ee_data, sizeof(ee_data), &serr->ee.ee_data);
bpf_trace_printk("ee_data:%u\\n", ee_data);

return 0;
};
"""

b = BPF(text=bpf_text)
b.attach_kprobe(event="sock_queue_err_skb", fn_name="trace_err_skb")
print("Attached to kprobe")
b.trace_print()

Packetdrill Script:
~~~~~~
+0 `sysctl -q -w net.ipv4.tcp_min_tso_segs=10`
+0 `sysctl -q -w net.ipv4.tcp_no_metrics_save=1`
+0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
+0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
+0 bind(3, ..., ...) = 0
+0 listen(3, 1) = 0

0.100 < S 0:0(0) win 32792
0.100 > S. 0:0(0) ack 1
0.200 < . 1:1(0) ack 1 win 257
0.200 accept(3, ..., ...) = 4
+0 setsockopt(4, SOL_TCP, TCP_NODELAY, [1], 4) = 0

+0 setsockopt(4, SOL_SOCKET, 37, [2688], 4) = 0
0.200 write(4, ..., 1460) = 1460
0.200 write(4, ..., 13140) = 13140

0.200 > P. 1:1461(1460) ack 1
0.200 > . 1461:8761(7300) ack 1
0.200 > P. 8761:14601(5840) ack 1

0.300 < . 1:1(0) ack 1 win 257
0.300 < . 1:1(0) ack 1 win 257
0.300 < . 1:1(0) ack 1 win 257
0.300 > P. 1:1461(1460) ack 1
0.400 < . 1:1(0) ack 14601 win 257

0.400 close(4) = 0
0.400 > F. 14601:14601(0) ack 1
0.500 < F. 1:1(0) ack 14602 win 257
0.500 > . 14602:14602(0) ack 2

Signed-off-by: Martin KaFai Lau
Cc: Eric Dumazet
Cc: Neal Cardwell
Cc: Soheil Hassas Yeganeh
Cc: Willem de Bruijn
Cc: Yuchung Cheng
Acked-by: Soheil Hassas Yeganeh
Tested-by: Soheil Hassas Yeganeh
Signed-off-by: David S. Miller

Martin KaFai Lau
2016-04-22 01:45:43 +0800
49e261a8a openvswitch: Orphan skbs before IPv6 defrag ... Browse Code »

This is the IPv6 counterpart to commit 8282f27449bf ("inet: frag: Always
orphan skbs inside ip_defrag()").

Prior to commit 029f7f3b8701 ("netfilter: ipv6: nf_defrag: avoid/free
clone operations"), ipv6 fragments sent to nf_ct_frag6_gather() would be
cloned (implicitly orphaning) prior to queueing for reassembly. As such,
when the IPv6 message is eventually reassembled, the skb->sk for all
fragments would be NULL. After that commit was introduced, rather than
cloning, the original skbs were queued directly without orphaning. The
end result is that all frags except for the first and last may have a
socket attached.

This commit explicitly orphans such skbs during nf_ct_frag6_gather() to
prevent BUG_ON(skb->sk) during a later call to ip6_fragment().

kernel BUG at net/ipv6/ip6_output.c:631!
[...]
Call Trace:

[] ? __lock_acquire+0x927/0x20a0
[] ? do_output.isra.28+0x1b0/0x1b0 [openvswitch]
[] ? __lock_is_held+0x52/0x70
[] ovs_fragment+0x1f7/0x280 [openvswitch]
[] ? mark_held_locks+0x75/0xa0
[] ? _raw_spin_unlock_irqrestore+0x36/0x50
[] ? dst_discard_out+0x20/0x20
[] ? dst_ifdown+0x80/0x80
[] do_output.isra.28+0xf3/0x1b0 [openvswitch]
[] do_execute_actions+0x709/0x12c0 [openvswitch]
[] ? ovs_flow_stats_update+0x74/0x1e0 [openvswitch]
[] ? ovs_flow_stats_update+0xa1/0x1e0 [openvswitch]
[] ? _raw_spin_unlock+0x27/0x40
[] ovs_execute_actions+0x45/0x120 [openvswitch]
[] ovs_dp_process_packet+0x85/0x150 [openvswitch]
[] ? _raw_spin_unlock+0x27/0x40
[] ovs_execute_actions+0xc4/0x120 [openvswitch]
[] ovs_dp_process_packet+0x85/0x150 [openvswitch]
[] ? key_extract+0x442/0xc10 [openvswitch]
[] ovs_vport_receive+0x5d/0xb0 [openvswitch]
[] ? __lock_acquire+0x927/0x20a0
[] ? __lock_acquire+0x927/0x20a0
[] ? __lock_acquire+0x927/0x20a0
[] ? _raw_spin_unlock_irqrestore+0x36/0x50
[] internal_dev_xmit+0x6d/0x150 [openvswitch]
[] ? internal_dev_xmit+0x5/0x150 [openvswitch]
[] dev_hard_start_xmit+0x2df/0x660
[] ? validate_xmit_skb.isra.105.part.106+0x1a/0x2b0
[] __dev_queue_xmit+0x8f5/0x950
[] ? __dev_queue_xmit+0x50/0x950
[] ? mark_held_locks+0x75/0xa0
[] dev_queue_xmit+0x10/0x20
[] neigh_resolve_output+0x178/0x220
[] ? ip6_finish_output2+0x219/0x7b0
[] ip6_finish_output2+0x219/0x7b0
[] ? ip6_finish_output2+0x65/0x7b0
[] ? ip_idents_reserve+0x6b/0x80
[] ? ip6_fragment+0x93f/0xc50
[] ip6_fragment+0xba1/0xc50
[] ? ip6_flush_pending_frames+0x40/0x40
[] ip6_finish_output+0xcb/0x1d0
[] ip6_output+0x5f/0x1a0
[] ? ip6_fragment+0xc50/0xc50
[] ip6_local_out+0x3d/0x80
[] ip6_send_skb+0x2f/0xc0
[] ip6_push_pending_frames+0x4d/0x50
[] icmpv6_push_pending_frames+0xac/0xe0
[] icmpv6_echo_reply+0x42e/0x500
[] icmpv6_rcv+0x4cf/0x580
[] ip6_input_finish+0x1a7/0x690
[] ? ip6_input_finish+0x5/0x690
[] ip6_input+0x30/0xa0
[] ? ip6_rcv_finish+0x1a0/0x1a0
[] ip6_rcv_finish+0x4e/0x1a0
[] ipv6_rcv+0x45f/0x7c0
[] ? ipv6_rcv+0x36/0x7c0
[] ? ip6_make_skb+0x1c0/0x1c0
[] __netif_receive_skb_core+0x229/0xb80
[] ? mark_held_locks+0x75/0xa0
[] ? process_backlog+0x6f/0x230
[] __netif_receive_skb+0x16/0x70
[] process_backlog+0x78/0x230
[] ? process_backlog+0xdd/0x230
[] net_rx_action+0x203/0x480
[] ? mark_held_locks+0x75/0xa0
[] __do_softirq+0xde/0x49f
[] ? ip6_finish_output2+0x228/0x7b0
[] do_softirq_own_stack+0x1c/0x30

[] do_softirq.part.18+0x3b/0x40
[] __local_bh_enable_ip+0xb6/0xc0
[] ip6_finish_output2+0x251/0x7b0
[] ? ip6_fragment+0xba1/0xc50
[] ? ip_idents_reserve+0x6b/0x80
[] ? ip6_fragment+0x93f/0xc50
[] ip6_fragment+0xba1/0xc50
[] ? ip6_flush_pending_frames+0x40/0x40
[] ip6_finish_output+0xcb/0x1d0
[] ip6_output+0x5f/0x1a0
[] ? ip6_fragment+0xc50/0xc50
[] ip6_local_out+0x3d/0x80
[] ip6_send_skb+0x2f/0xc0
[] ip6_push_pending_frames+0x4d/0x50
[] rawv6_sendmsg+0xa28/0xe30
[] ? inet_sendmsg+0xc7/0x1d0
[] inet_sendmsg+0x106/0x1d0
[] ? inet_sendmsg+0x5/0x1d0
[] sock_sendmsg+0x38/0x50
[] SYSC_sendto+0xf6/0x170
[] ? trace_hardirqs_on_thunk+0x1b/0x1d
[] SyS_sendto+0xe/0x10
[] entry_SYSCALL_64_fastpath+0x18/0xa8
Code: 06 48 83 3f 00 75 26 48 8b 87 d8 00 00 00 2b 87 d0 00 00 00 48 39 d0 72 14 8b 87 e4 00 00 00 83 f8 01 75 09 48 83 7f 18 00 74 9a 0b 41 8b 86 cc 00 00 00 49 8#
RIP [] ip6_fragment+0x73a/0xc50
RSP

Fixes: 029f7f3b8701 ("netfilter: ipv6: nf_defrag: avoid/free clone
operations")
Reported-by: Daniele Di Proietto
Signed-off-by: Joe Stringer
Signed-off-by: David S. Miller

Joe Stringer
2016-04-22 01:42:05 +0800

20 Apr, 2016

1 commit

9c995cc9a VSOCK: Only check error on skb_recv_datagram when skb is NULL ... Browse Code »

If skb_recv_datagram returns an skb, we should ignore the err
value returned. Otherwise, datagram receives will return EAGAIN
when they have to wait for a datagram.

Acked-by: Adit Ranadive
Signed-off-by: Jorgen Hansen
Signed-off-by: David S. Miller

Jorgen Hansen
2016-04-20 08:42:01 +0800

17 Apr, 2016

2 commits

e47db94e1 RDS: Fix the atomicity for congestion map update ... Browse Code »

Two different threads with different rds sockets may be in
rds_recv_rcvbuf_delta() via receive path. If their ports
both map to the same word in the congestion map, then
using non-atomic ops to update it could cause the map to
be incorrect. Lets use atomics to avoid such an issue.

Full credit to Wengang for
finding the issue, analysing it and also pointing out
to offending code with spin lock based fix.

Reviewed-by: Leon Romanovsky
Signed-off-by: Wengang Wang
Signed-off-by: Santosh Shilimkar
Signed-off-by: David S. Miller

santosh.shilimkar@oracle.com
2016-04-17 07:01:05 +0800
a7c556546 RDS: fix endianness for dp_ack_seq ... Browse Code »

dp->dp_ack_seq is used in big endian format. We need to do the
big endianness conversion when we assign a value in host format
to it.

Signed-off-by: Qing Huang
Signed-off-by: Santosh Shilimkar
Signed-off-by: David S. Miller

Qing Huang
2016-04-17 07:01:05 +0800

16 Apr, 2016

1 commit

9241e2df4 vlan: pull on __vlan_insert_tag error path and fix csum correction ... Browse Code »

When __vlan_insert_tag() fails from skb_vlan_push() path due to the
skb_cow_head(), we need to undo the __skb_push() in the error path
as well that was done earlier to move skb->data pointer to mac header.

Moreover, I noticed that when in the non-error path the __skb_pull()
is done and the original offset to mac header was non-zero, we fixup
from a wrong skb->data offset in the checksum complete processing.

So the skb_postpush_rcsum() really needs to be done before __skb_pull()
where skb->data still points to the mac header start and thus operates
under the same conditions as in __vlan_insert_tag().

Fixes: 93515d53b133 ("net: move vlan pop/push functions into common code")
Signed-off-by: Daniel Borkmann
Reviewed-by: Jiri Pirko
Signed-off-by: David S. Miller

Daniel Borkmann
2016-04-16 11:20:11 +0800

15 Apr, 2016

7 commits

16382ed97 Merge branch 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6 ... Browse Code »

Pull crypto fixes from Herbert Xu:
"This fixes an NFS regression caused by the skcipher/hash conversion in
sunrpc. It also fixes a build problem in certain configurations with
bcm63xx"

* 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6:
hwrng: bcm63xx - fix device tree compilation
sunrpc: Fix skcipher/shash conversion

Linus Torvalds
2016-04-15 09:15:40 +0800
d894ba18d soreuseport: fix ordering for mixed v4/v6 sockets ... Browse Code »

With the SO_REUSEPORT socket option, it is possible to create sockets
in the AF_INET and AF_INET6 domains which are bound to the same IPv4 address.
This is only possible with SO_REUSEPORT and when not using IPV6_V6ONLY on
the AF_INET6 sockets.

Prior to the commits referenced below, an incoming IPv4 packet would
always be routed to a socket of type AF_INET when this mixed-mode was used.
After those changes, the same packet would be routed to the most recently
bound socket (if this happened to be an AF_INET6 socket, it would
have an IPv4 mapped IPv6 address).

The change in behavior occurred because the recent SO_REUSEPORT optimizations
short-circuit the socket scoring logic as soon as they find a match. They
did not take into account the scoring logic that favors AF_INET sockets
over AF_INET6 sockets in the event of a tie.

To fix this problem, this patch changes the insertion order of AF_INET
and AF_INET6 addresses in the TCP and UDP socket lists when the sockets
have SO_REUSEPORT set. AF_INET sockets will be inserted at the head of the
list and AF_INET6 sockets with SO_REUSEPORT set will always be inserted at
the tail of the list. This will force AF_INET sockets to always be
considered first.

Fixes: e32ea7e74727 ("soreuseport: fast reuseport UDP socket selection")
Fixes: 125e80b88687 ("soreuseport: fast reuseport TCP socket selection")

Reported-by: Maciej Żenczykowski
Signed-off-by: Craig Gallek
Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Craig Gallek
2016-04-15 09:14:03 +0800
e646b657f ipv6: udp: Do a route lookup and update during release_cb ... Browse Code »

This patch adds a release_cb for UDPv6. It does a route lookup
and updates sk->sk_dst_cache if it is needed. It picks up the
left-over job from ip6_sk_update_pmtu() if the sk was owned
by user during the pmtu update.

It takes a rcu_read_lock to protect the __sk_dst_get() operations
because another thread may do ip6_dst_store() without taking the
sk lock (e.g. sendmsg).

Fixes: 45e4fd26683c ("ipv6: Only create RTF_CACHE routes after encountering pmtu exception")
Signed-off-by: Martin KaFai Lau
Reported-by: Wei Wang
Cc: Cong Wang
Cc: Eric Dumazet
Cc: Wei Wang
Signed-off-by: David S. Miller

Martin KaFai Lau
2016-04-15 04:29:53 +0800
33c162a98 ipv6: datagram: Update dst cache of a connected datagram sk during pmtu update ... Browse Code »

There is a case in connected UDP socket such that
getsockopt(IPV6_MTU) will return a stale MTU value. The reproducible
sequence could be the following:
1. Create a connected UDP socket
2. Send some datagrams out
3. Receive a ICMPV6_PKT_TOOBIG
4. No new outgoing datagrams to trigger the sk_dst_check()
logic to update the sk->sk_dst_cache.
5. getsockopt(IPV6_MTU) returns the mtu from the invalid
sk->sk_dst_cache instead of the newly created RTF_CACHE clone.

This patch updates the sk->sk_dst_cache for a connected datagram sk
during pmtu-update code path.

Note that the sk->sk_v6_daddr is used to do the route lookup
instead of skb->data (i.e. iph). It is because a UDP socket can become
connected after sending out some datagrams in un-connected state. or
It can be connected multiple times to different destinations. Hence,
iph may not be related to where sk is currently connected to.

It is done under '!sock_owned_by_user(sk)' condition because
the user may make another ip6_datagram_connect() (i.e changing
the sk->sk_v6_daddr) while dst lookup is happening in the pmtu-update
code path.

For the sock_owned_by_user(sk) == true case, the next patch will
introduce a release_cb() which will update the sk->sk_dst_cache.

Test:

Server (Connected UDP Socket):
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Route Details:
[root@arch-fb-vm1 ~]# ip -6 r show | egrep '2fac'
2fac::/64 dev eth0 proto kernel metric 256 pref medium
2fac:face::/64 via 2fac::face dev eth0 metric 1024 pref medium

A simple python code to create a connected UDP socket:

import socket
import errno

HOST = '2fac::1'
PORT = 8080

s = socket.socket(socket.AF_INET6, socket.SOCK_DGRAM)
s.bind((HOST, PORT))
s.connect(('2fac:face::face', 53))
print("connected")
while True:
try:
data = s.recv(1024)
except socket.error as se:
if se.errno == errno.EMSGSIZE:
pmtu = s.getsockopt(41, 24)
print("PMTU:%d" % pmtu)
break
s.close()

Python program output after getting a ICMPV6_PKT_TOOBIG:
[root@arch-fb-vm1 ~]# python2 ~/devshare/kernel/tasks/fib6/udp-connect-53-8080.py
connected
PMTU:1300

Cache routes after recieving TOOBIG:
[root@arch-fb-vm1 ~]# ip -6 r show table cache
2fac:face::face via 2fac::face dev eth0 metric 0
cache expires 463sec mtu 1300 pref medium

Client (Send the ICMPV6_PKT_TOOBIG):
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
scapy is used to generate the TOOBIG message. Here is the scapy script I have
used:

>>> p=Ether(src='da:75:4d:36:ac:32', dst='52:54:00:12:34:66', type=0x86dd)/IPv6(src='2fac::face', dst='2fac::1')/ICMPv6PacketTooBig(mtu=1300)/IPv6(src='2fac::
1',dst='2fac:face::face', nh='UDP')/UDP(sport=8080,dport=53)
>>> sendp(p, iface='qemubr0')

Fixes: 45e4fd26683c ("ipv6: Only create RTF_CACHE routes after encountering pmtu exception")
Signed-off-by: Martin KaFai Lau
Reported-by: Wei Wang
Cc: Cong Wang
Cc: Eric Dumazet
Cc: Wei Wang
Signed-off-by: David S. Miller

Martin KaFai Lau
2016-04-15 04:29:51 +0800
7e2040db1 ipv6: datagram: Refactor dst lookup and update codes to a new function ... Browse Code »

This patch moves the route lookup and update codes for connected
datagram sk to a newly created function ip6_datagram_dst_update()

It will be reused during the pmtu update in the later patch.

Signed-off-by: Martin KaFai Lau
Cc: Cong Wang
Cc: Eric Dumazet
Cc: Wei Wang
Signed-off-by: David S. Miller

Martin KaFai Lau
2016-04-15 04:29:50 +0800
80fbdb208 ipv6: datagram: Refactor flowi6 init codes to a new function ... Browse Code »

Move flowi6 init codes for connected datagram sk to a newly created
function ip6_datagram_flow_key_init().

Notes:
1. fl6_flowlabel is used instead of fl6.flowlabel in __ip6_datagram_connect
2. ipv6_addr_is_multicast(&fl6->daddr) is used instead of
(addr_type & IPV6_ADDR_MULTICAST) in ip6_datagram_flow_key_init()

This new function will be reused during pmtu update in the later patch.

Signed-off-by: Martin KaFai Lau
Cc: Cong Wang
Cc: Eric Dumazet
Cc: Wei Wang
Signed-off-by: David S. Miller

Martin KaFai Lau
2016-04-15 04:29:49 +0800
5e2650291 Merge tag 'mac80211-for-davem-2016-04-14' of git://git.kernel.org/pub/scm/linux/… ... Browse Code »

…kernel/git/jberg/mac80211

Johannes Berg says:

====================
This has just the single fix from Dmitry Ivanov, adding the missing
netlink notifier family check to avoid the socket close DoS problem.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

David S. Miller
2016-04-15 00:00:59 +0800

14 Apr, 2016

5 commits

3dcd493fb net: sched: do not requeue a NULL skb ... Browse Code »

A failure in validate_xmit_skb_list() triggered an unconditional call
to dev_requeue_skb with skb=NULL. This slowly grows the queue
discipline's qlen count until all traffic through the queue stops.

We take the optimistic approach and continue running the queue after a
failure since it is unknown if later packets also will fail in the
validate path.

Fixes: 55a93b3ea780 ("qdisc: validate skb without holding lock")
Signed-off-by: Lars Persson
Acked-by: Eric Dumazet
Signed-off-by: David S. Miller

Lars Persson
2016-04-14 13:28:51 +0800
309cf37fe packet: fix heap info leak in PACKET_DIAG_MCLIST sock_diag interface ... Browse Code »

Because we miss to wipe the remainder of i->addr[] in packet_mc_add(),
pdiag_put_mclist() leaks uninitialized heap bytes via the
PACKET_DIAG_MCLIST netlink attribute.

Fix this by explicitly memset(0)ing the remaining bytes in i->addr[].

Fixes: eea68e2f1a00 ("packet: Report socket mclist info via diag module")
Signed-off-by: Mathias Krause
Cc: Eric W. Biederman
Cc: Pavel Emelyanov
Acked-by: Pavel Emelyanov
Signed-off-by: David S. Miller

Mathias Krause
2016-04-14 12:46:39 +0800
d6d5e999e route: do not cache fib route info on local routes with oif ... Browse Code »

For local routes that require a particular output interface we do not want
to cache the result. Caching the result causes incorrect behaviour when
there are multiple source addresses on the interface. The end result
being that if the intended recipient is waiting on that interface for the
packet he won't receive it because it will be delivered on the loopback
interface and the IP_PKTINFO ipi_ifindex will be set to the loopback
interface as well.

This can be tested by running a program such as "dhcp_release" which
attempts to inject a packet on a particular interface so that it is
received by another program on the same board. The receiving process
should see an IP_PKTINFO ipi_ifndex value of the source interface
(e.g., eth1) instead of the loopback interface (e.g., lo). The packet
will still appear on the loopback interface in tcpdump but the important
aspect is that the CMSG info is correct.

Sample dhcp_release command line:

dhcp_release eth1 192.168.204.222 02:11:33:22:44:66

Signed-off-by: Allain Legacy
Signed off-by: Chris Friesen
Reviewed-by: Julian Anastasov
Signed-off-by: David S. Miller

Chris Friesen
2016-04-14 11:33:01 +0800
70af921db net: ipv6: Do not keep linklocal and loopback addresses ... Browse Code »

f1705ec197e7 added the option to retain user configured addresses on an
admin down. A comment to one of the later revisions suggested using the
IFA_F_PERMANENT flag rather than adding a user_managed boolean to the
ifaddr struct. A side effect of this change is that link local and
loopback addresses are also retained which is not part of the objective
of f1705ec197e7. Add check to drop those addresses.

Fixes: f1705ec197e7 ("net: ipv6: Make address flushing on ifdown optional")
Signed-off-by: David Ahern
Signed-off-by: David S. Miller

David Ahern
2016-04-14 10:58:37 +0800
60e19518d Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf ... Browse Code »

Pablo Neira Ayuso says:

====================
Netfilter fixes for net

The following patchset contains Netfilter fixes for your net tree. More
specifically, they are:

1) Fix missing filter table per-netns registration in arptables, from
Florian Westphal.

2) Resolve out of bound access when parsing TCP options in
nf_conntrack_tcp, patch from Jozsef Kadlecsik.

3) Prefer NFPROTO_BRIDGE extensions over NFPROTO_UNSPEC in ebtables,
this resolves conflict between xt_limit and ebt_limit, from Phil Sutter.
====================

Signed-off-by: David S. Miller

David S. Miller
2016-04-14 09:49:03 +0800