Eric Lee / smarc-fsl-linux-kernel

14 Nov, 2016

3 commits

ac6e78007 tcp: take care of truncations done by sk_filter() ... Browse Code »

With syzkaller help, Marco Grassi found a bug in TCP stack,
crashing in tcp_collapse()

Root cause is that sk_filter() can truncate the incoming skb,
but TCP stack was not really expecting this to happen.
It probably was expecting a simple DROP or ACCEPT behavior.

We first need to make sure no part of TCP header could be removed.
Then we need to adjust TCP_SKB_CB(skb)->end_seq

Many thanks to syzkaller team and Marco for giving us a reproducer.

Signed-off-by: Eric Dumazet
Reported-by: Marco Grassi
Reported-by: Vladis Dronov
Signed-off-by: David S. Miller

Eric Dumazet
2016-11-14 01:30:02 +0800
969447f22 ipv4: use new_gw for redirect neigh lookup ... Browse Code »

In v2.6, ip_rt_redirect() calls arp_bind_neighbour() which returns 0
and then the state of the neigh for the new_gw is checked. If the state
isn't valid then the redirected route is deleted. This behavior is
maintained up to v3.5.7 by check_peer_redirect() because rt->rt_gateway
is assigned to peer->redirect_learned.a4 before calling
ipv4_neigh_lookup().

After commit 5943634fc559 ("ipv4: Maintain redirect and PMTU info in
struct rtable again."), ipv4_neigh_lookup() is performed without the
rt_gateway assigned to the new_gw. In the case when rt_gateway (old_gw)
isn't zero, the function uses it as the key. The neigh is most likely
valid since the old_gw is the one that sends the ICMP redirect message.
Then the new_gw is assigned to fib_nh_exception. The problem is: the
new_gw ARP may never gets resolved and the traffic is blackholed.

So, use the new_gw for neigh lookup.

Changes from v1:
- use __ipv4_neigh_lookup instead (per Eric Dumazet).

Fixes: 5943634fc559 ("ipv4: Maintain redirect and PMTU info in struct rtable again.")
Signed-off-by: Stephen Suryaputra Lin
Signed-off-by: David S. Miller

Stephen Suryaputra Lin
2016-11-14 01:24:44 +0800
ca0a75316 r8152: Fix error path in open function ... Browse Code »

If usb_submit_urb() called from the open function fails, the following
crash may be observed.

r8152 8-1:1.0 eth0: intr_urb submit failed: -19
...
r8152 8-1:1.0 eth0: v1.08.3
Unable to handle kernel paging request at virtual address 6b6b6b6b6b6b6b7b
pgd = ffffffc0e7305000
[6b6b6b6b6b6b6b7b] *pgd=0000000000000000, *pud=0000000000000000
Internal error: Oops: 96000004 [#1] PREEMPT SMP
...
PC is at notifier_chain_register+0x2c/0x58
LR is at blocking_notifier_chain_register+0x54/0x70
...
Call trace:
[] notifier_chain_register+0x2c/0x58
[] blocking_notifier_chain_register+0x54/0x70
[] register_pm_notifier+0x24/0x2c
[] rtl8152_open+0x3dc/0x3f8 [r8152]
[] __dev_open+0xac/0x104
[] __dev_change_flags+0xb0/0x148
[] dev_change_flags+0x34/0x70
[] do_setlink+0x2c8/0x888
[] rtnl_newlink+0x328/0x644
[] rtnetlink_rcv_msg+0x1a8/0x1d4
[] netlink_rcv_skb+0x68/0xd0
[] rtnetlink_rcv+0x2c/0x3c
[] netlink_unicast+0x16c/0x234
[] netlink_sendmsg+0x340/0x364
[] sock_sendmsg+0x48/0x60
[] SyS_sendto+0xe0/0x120
[] SyS_send+0x40/0x4c
[] el0_svc_naked+0x24/0x28

Clean up error handling to avoid registering the notifier if the open
function is going to fail.

Signed-off-by: Guenter Roeck
Signed-off-by: David S. Miller

Guenter Roeck
2016-11-14 01:03:20 +0800

13 Nov, 2016

5 commits

10b217681 net: bpqether.h: remove if_ether.h guard ... Browse Code »

__LINUX_IF_ETHER_H is not defined anywhere, and if_ether.h can keep itself from
double inclusion, though it uses a single underscore prefix.

Signed-off-by: Baruch Siach
Signed-off-by: David S. Miller

Baruch Siach
2016-11-13 13:57:53 +0800
34fad54c2 net: __skb_flow_dissect() must cap its return value ... Browse Code »

After Tom patch, thoff field could point past the end of the buffer,
this could fool some callers.

If an skb was provided, skb->len should be the upper limit.
If not, hlen is supposed to be the upper limit.

Fixes: a6e544b0a88b ("flow_dissector: Jump to exit code in __skb_flow_dissect")
Signed-off-by: Eric Dumazet
Reported-by: Yibin Yang
Acked-by: Willem de Bruijn
Acked-by: Alexei Starovoitov
Signed-off-by: David S. Miller

Eric Dumazet
2016-11-13 12:41:53 +0800
79774d6bf Merge branch 'fix-bpf_redirect' ... Browse Code »

Martin KaFai Lau says:

====================
bpf: Fix bpf_redirect to an ipip/ip6tnl dev

This patch set fixes a bug in bpf_redirect(dev, flags) when dev is an
ipip/ip6tnl. The current problem is IP-EthHdr-IP is sent out instead of
IP-IP.

Patch 1 adds a dev->type test similar to dev_is_mac_header_xmit()
in act_mirred.c which is only available in net-next. We can consider to
refactor it once this patch is pulled into net-next from net.
====================

Signed-off-by: David S. Miller

David S. Miller
2016-11-13 12:38:08 +0800
90e02896f bpf: Add test for bpf_redirect to ipip/ip6tnl ... Browse Code »

The test creates two netns, ns1 and ns2. The host (the default netns)
has an ipip or ip6tnl dev configured for tunneling traffic to the ns2.

ping VIPS from ns1 host ns2 (VIPs at loopback)

The test is to have ns1 pinging VIPs configured at the loopback
interface in ns2.

The VIPs are 10.10.1.102 and 2401:face::66 (which are configured
at lo@ns2). [Note: 0x66 => 102].

At ns1, the VIPs are routed _via_ the host.

At the host, bpf programs are installed at the veth to redirect packets
from a veth to the ipip/ip6tnl. The test is configured in a way so
that both ingress and egress can be tested.

At ns2, the ipip/ip6tnl dev is configured with the local and remote address
specified. The return path is routed to the dev ipip/ip6tnl.

During egress test, the host also locally tests pinging the VIPs to ensure
that bpf_redirect at egress also works for the direct egress (i.e. not
forwarding from dev ve1 to ve2).

Acked-by: Alexei Starovoitov
Signed-off-by: Martin KaFai Lau
Signed-off-by: David S. Miller

Martin KaFai Lau
2016-11-13 12:38:07 +0800
4e3264d21 bpf: Fix bpf_redirect to an ipip/ip6tnl dev ... Browse Code »

If the bpf program calls bpf_redirect(dev, 0) and dev is
an ipip/ip6tnl, it currently includes the mac header.
e.g. If dev is ipip, the end result is IP-EthHdr-IP instead
of IP-IP.

The fix is to pull the mac header. At ingress, skb_postpull_rcsum()
is not needed because the ethhdr should have been pulled once already
and then got pushed back just before calling the bpf_prog.
At egress, this patch calls skb_postpull_rcsum().

If bpf_redirect(dev, BPF_F_INGRESS) is called,
it also fails now because it calls dev_forward_skb() which
eventually calls eth_type_trans(skb, dev). The eth_type_trans()
will set skb->type = PACKET_OTHERHOST because the mac address
does not match the redirecting dev->dev_addr. The PACKET_OTHERHOST
will eventually cause the ip_rcv() errors out. To fix this,
____dev_forward_skb() is added.

Joint work with Daniel Borkmann.

Fixes: cfc7381b3002 ("ip_tunnel: add collect_md mode to IPIP tunnel")
Fixes: 8d79266bc48c ("ip6_tunnel: add collect_md mode to IPv6 tunnels")
Acked-by: Daniel Borkmann
Acked-by: Alexei Starovoitov
Signed-off-by: Martin KaFai Lau
Signed-off-by: David S. Miller

Martin KaFai Lau
2016-11-13 12:38:07 +0800

11 Nov, 2016

7 commits

23dd83154 Merge branch 'mlxsw-fixes' ... Browse Code »

Jiri Pirko says:

====================
mlxsw: Couple of router fixes

v1->v2:
- patch2:
- use net_eq
====================

Signed-off-by: David S. Miller

David S. Miller
2016-11-11 02:02:25 +0800
0e3715c9c mlxsw: spectrum_router: Ignore FIB notification events for non-init namespaces ... Browse Code »

Since now, the table with same id in multiple netnamespaces were squashed
to a single virtual router. That is not only incorrect, it also causes
error messages when trying to use RALUE register to do double remove
of FIB entries, like this one:

mlxsw_spectrum 0000:03:00.0: EMAD reg access failed (tid=facb831c00007b20,reg_id=8013(ralue),type=write,status=7(bad parameter))

Since we don't allow ports to change namespaces (NETIF_F_NETNS_LOCAL),
and the infrastructure is not yet prepared to handle netnamespaces, just
ignore FIB notification events for non-init namespaces. That is clear to
do since we don't need to offload them.

Fixes: b45f64d16d45 ("mlxsw: spectrum_router: Use FIB notifications instead of switchdev calls")
Signed-off-by: Jiri Pirko
Acked-by: Ido Schimmel
Signed-off-by: David S. Miller

Jiri Pirko
2016-11-11 02:02:15 +0800
33b1341cd mlxsw: spectrum_router: Fix handling of neighbour structure ... Browse Code »

__neigh_create function works in a different way than assumed.
It passes "n" as a parameter to ndo_neigh_construct. But this "n" might
be destroyed right away before __neigh_create() returns in case there is
already another neighbour struct in the hashtable with the same dev and
primary key. That is not expected by mlxsw_sp_router_neigh_construct()
and the stored "n" points to freed memory, eventually leading to crash.

Fix this by doing tight 1:1 coupling between neighbour struct and
internal driver neigh_entry. That allows to narrow down the key in
internal driver hashtable to do lookups by "n" only.

Fixes: 6cf3c971dc84 ("mlxsw: spectrum_router: Add private neigh table")
Signed-off-by: Jiri Pirko
Acked-by: Ido Schimmel
Signed-off-by: David S. Miller

Jiri Pirko
2016-11-11 02:02:15 +0800
2ce0af8fd Merge branch 'qed-fixes' ... Browse Code »

Yuval Mintz says:

====================
qed: Fix RoCE infrastructure

This series fixes 2 basic issues with RoCE support,
one handles a missing configuration in the initial infrastructure
support while the other is a regression introduced by one of the
initial fix submissions.
====================

Signed-off-by: David S. Miller

David S. Miller
2016-11-11 01:55:26 +0800
5c5f26090 qed: Correct rdma params configuration ... Browse Code »

Previous fix has broken RoCE support as the rdma_pf_params are now
being set into the parameters only after the params are alrady assigned
into the hw-function.

Fixes: 0189efb8f4f8 ("qed*: Fix Kconfig dependencies with INFINIBAND_QEDR")
Signed-off-by: Ram Amrani
Signed-off-by: Yuval Mintz
Signed-off-by: David S. Miller

Ram Amrani
2016-11-11 01:55:20 +0800
8d1d8fcb2 qed: configure ll2 RoCE v1/v2 flavor correctly ... Browse Code »

Currently RoCE v2 won't operate with RDMA CM due to missing setting of
the roce-flavour in the ll2 configuration.
This patch properly sets the flavour, and deletes incorrect HSI
that doesn't [yet] exist.

Fixes: abd49676c707 ("qed: Add RoCE ll2 & GSI support")
Signed-off-by: Ram Amrani
Signed-off-by: Yuval Mintz
Signed-off-by: David S. Miller

Ram Amrani
2016-11-11 01:55:20 +0800
0ace81ec7 ipv4: update comment to document GSO fragmentation cases. ... Browse Code »

This is a follow-up to commit 9ee6c5dc816a ("ipv4: allow local
fragmentation in ip_finish_output_gso()"), updating the comment
documenting cases in which fragmentation is needed for egress
GSO packets.

Suggested-by: Shmulik Ladkani
Reviewed-by: Shmulik Ladkani
Signed-off-by: Lance Richardson
Signed-off-by: David S. Miller

Lance Richardson
2016-11-11 01:01:54 +0800

10 Nov, 2016

16 commits

9b6c14d51 net: tcp response should set oif only if it is L3 master ... Browse Code »

Lorenzo noted an Android unit test failed due to e0d56fdd7342:
"The expectation in the test was that the RST replying to a SYN sent to a
closed port should be generated with oif=0. In other words it should not
prefer the interface where the SYN came in on, but instead should follow
whatever the routing table says it should do."

Revert the change to ip_send_unicast_reply and tcp_v6_send_response such
that the oif in the flow is set to the skb_iif only if skb_iif is an L3
master.

Fixes: e0d56fdd7342 ("net: l3mdev: remove redundant calls")
Reported-by: Lorenzo Colitti
Signed-off-by: David Ahern
Tested-by: Lorenzo Colitti
Acked-by: Lorenzo Colitti
Signed-off-by: David S. Miller

David Ahern
2016-11-10 11:32:10 +0800
8da3cf2a4 Net Driver: Add Cypress GX3 VID=04b4 PID=3610. ... Browse Code »

Add support for Cypress GX3 SuperSpeed to Gigabit Ethernet
Bridge Controller (Vendor=04b4 ProdID=3610).

Patch verified on x64 linux kernel 4.7.4, 4.8.6, 4.9-rc4 systems
with the Kensington SD4600P USB-C Universal Dock with Power,
which uses the Cypress GX3 SuperSpeed to Gigabit Ethernet Bridge
Controller.

A similar patch was signed-off and tested-by Allan Chou
on 2015-12-01.

Allan verified his similar patch on x86 Linux kernel 4.1.6 system
with Cypress GX3 SuperSpeed to Gigabit Ethernet Bridge Controller.

Tested-by: Allan Chou
Tested-by: Chris Roth
Tested-by: Artjom Simon

Signed-off-by: Allan Chou
Signed-off-by: Chris Roth
Signed-off-by: David S. Miller

Allan Chou
2016-11-10 10:45:34 +0800
9fa684ec8 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf ... Browse Code »

Pablo Neira Ayuso says:

====================
Netfilter fixes for net

The following patchset contains a larger than usual batch of Netfilter
fixes for your net tree. This series contains a mixture of old bugs and
recently introduced bugs, they are:

1) Fix a crash when using nft_dynset with nft_set_rbtree, which doesn't
support the set element updates from the packet path. From Liping
Zhang.

2) Fix leak when nft_expr_clone() fails, from Liping Zhang.

3) Fix a race when inserting new elements to the set hash from the
packet path, also from Liping.

4) Handle segmented TCP SIP packets properly, basically avoid that the
INVITE in the allow header create bogus expectations by performing
stricter SIP message parsing, from Ulrich Weber.

5) nft_parse_u32_check() should return signed integer for errors, from
John Linville.

6) Fix wrong allocation instead of connlabels, allocate 16 instead of
32 bytes, from Florian Westphal.

7) Fix compilation breakage when building the ip_vs_sync code with
CONFIG_OPTIMIZE_INLINING on x86, from Arnd Bergmann.

8) Destroy the new set if the transaction object cannot be allocated,
also from Liping Zhang.

9) Use device to route duplicated packets via nft_dup only when set by
the user, otherwise packets may not follow the right route, again
from Liping.

10) Fix wrong maximum genetlink attribute definition in IPVS, from
WANG Cong.

11) Ignore untracked conntrack objects from xt_connmark, from Florian
Westphal.

12) Allow to use conntrack helpers that are registered NFPROTO_UNSPEC
via CT target, otherwise we cannot use the h.245 helper, from
Florian.

13) Revisit garbage collection heuristic in the new workqueue-based
timer approach for conntrack to evict objects earlier, again from
Florian.

14) Fix crash in nf_tables when inserting an element into a verdict map,
from Liping Zhang.
====================

Signed-off-by: David S. Miller

David S. Miller
2016-11-10 09:38:18 +0800
f567e950b rtnl: reset calcit fptr in rtnl_unregister() ... Browse Code »

To avoid having dangling function pointers left behind, reset calcit in
rtnl_unregister(), too.

This is no issue so far, as only the rtnl core registers a netlink
handler with a calcit hook which won't be unregistered, but may become
one if new code makes use of the calcit hook.

Fixes: c7ac8679bec9 ("rtnetlink: Compute and store minimum ifinfo...")
Cc: Jeff Kirsher
Cc: Greg Rose
Signed-off-by: Mathias Krause
Signed-off-by: David S. Miller

Mathias Krause
2016-11-10 09:18:19 +0800
4053ab1bf vxlan: hide unused local variable ... Browse Code »

A bugfix introduced a harmless warning in v4.9-rc4:

drivers/net/vxlan.c: In function 'vxlan_group_used':
drivers/net/vxlan.c:947:21: error: unused variable 'sock6' [-Werror=unused-variable]

This hides the variable inside of the same #ifdef that is
around its user. The extraneous initialization is removed
at the same time, it was accidentally introduced in the
same commit.

Fixes: c6fcc4fc5f8b ("vxlan: avoid using stale vxlan socket.")
Signed-off-by: Arnd Bergmann
Acked-by: Jiri Benc
Signed-off-by: David S. Miller

Arnd Bergmann
2016-11-10 07:59:50 +0800
6dbcd8fb5 ibmvnic: Start completion queue negotiation at server-provided optimum values ... Browse Code »

Use the opt_* fields to determine the starting point for negotiating the
number of tx/rx completion queues with the vnic server. These contain the
number of queues that the vnic server estimates that it will be able to
allocate. While renegotiation may still occur, using the opt_* fields will
reduce the number of times this needs to happen and will prevent driver
probe timeout on systems using large numbers of ibmvnic client devices per
vnic port.

Signed-off-by: John Allen
Signed-off-by: David S. Miller

John Allen
2016-11-10 07:52:41 +0800
9d1a6c4ea net: icmp_route_lookup should use rt dev to determine L3 domain ... Browse Code »

icmp_send is called in response to some event. The skb may not have
the device set (skb->dev is NULL), but it is expected to have an rt.
Update icmp_route_lookup to use the rt on the skb to determine L3
domain.

Fixes: 613d09b30f8b ("net: Use VRF device index for lookups on TX")
Signed-off-by: David Ahern
Signed-off-by: David S. Miller

David Ahern
2016-11-10 07:49:39 +0800
fd6f24d75 Merge branch 'qcom-emac-pause' ... Browse Code »

Timur Tabi says:

====================
net: qcom/emac: ensure that pause frames are enabled

The qcom emac driver experiences significant packet loss (through frame
check sequence errors) if flow control is not enabled and the phy is
not configured to allow pause frames to pass through it. Therefore, we
need to enable flow control and force the phy to pass pause frames.
====================

Signed-off-by: David S. Miller

David S. Miller
2016-11-10 07:45:36 +0800
df63022e1 net: qcom/emac: enable flow control if requested ... Browse Code »

If the PHY has been configured to allow pause frames, then the MAC
should be configured to generate and/or accept those frames.

Signed-off-by: Timur Tabi
Signed-off-by: David S. Miller

Timur Tabi
2016-11-10 07:45:24 +0800
3e8844934 net: qcom/emac: configure the external phy to allow pause frames ... Browse Code »

Pause frames are used to enable flow control. A MAC can send and
receive pause frames in order to throttle traffic. However, the PHY
must be configured to allow those frames to pass through.

Reviewed-by: Florian Fainelli
Signed-off-by: Timur Tabi
Signed-off-by: David S. Miller

Timur Tabi
2016-11-10 07:45:24 +0800
cdb26d338 net: bgmac: fix reversed checks for clock control flag ... Browse Code »

This fixes regression introduced by patch adding feature flags. It was
already reported and patch followed (it got accepted) but it appears it
was incorrect. Instead of fixing reversed condition it broke a good one.

This patch was verified to actually fix SoC hanges caused by bgmac on
BCM47186B0.

Fixes: db791eb2970b ("net: ethernet: bgmac: convert to feature flags")
Fixes: 4af1474e6198 ("net: bgmac: Fix errant feature flag check")
Cc: Jon Mason
Signed-off-by: Rafał Miłecki
Signed-off-by: David S. Miller

Rafał Miłecki
2016-11-10 02:32:06 +0800
d667f7851 bna: Add synchronization for tx ring. ... Browse Code »

We received two reports of BUG_ON in bnad_txcmpl_process() where
hw_consumer_index appeared to be ahead of producer_index. Out of order
write/read of these variables could explain these reports.

bnad_start_xmit(), as a producer of tx descriptors, has a few memory
barriers sprinkled around writes to producer_index and the device's
doorbell but they're not paired with anything in bnad_txcmpl_process(), a
consumer.

Since we are synchronizing with a device, we must use mandatory barriers,
not smp_*. Also, I didn't see the purpose of the last smp_mb() in
bnad_start_xmit().

Signed-off-by: Benjamin Poirier
Signed-off-by: David S. Miller

Benjamin Poirier
2016-11-10 02:31:10 +0800
f91d71815 Revert "net/mlx4_en: Fix panic during reboot" ... Browse Code »

This reverts commit 9d2afba058722d40cc02f430229c91611c0e8d16.

The original issue would possibly exist if an external module
tried calling our "ethtool_ops" without checking if it still
exists.

The right way of solving it is by simply doing the check in
the caller side.
Currently, no action is required as there's no such use case.

Signed-off-by: Tariq Toukan
Signed-off-by: David S. Miller

Tariq Toukan
2016-11-10 02:29:32 +0800
fb56be83e net-ipv6: on device mtu change do not add mtu to mtu-less routes ... Browse Code »

Routes can specify an mtu explicitly or inherit the mtu from
the underlying device - this inheritance is implemented in
dst->ops->mtu handlers ip6_mtu() and ip6_blackhole_mtu().

Currently changing the mtu of a device adds mtu explicitly
to routes using that device.

ie.
# ip link set dev lo mtu 65536
# ip -6 route add local 2000::1 dev lo
# ip -6 route get 2000::1
local 2000::1 dev lo table local src ... metric 1024 pref medium

# ip link set dev lo mtu 65535
# ip -6 route get 2000::1
local 2000::1 dev lo table local src ... metric 1024 mtu 65535 pref medium

# ip link set dev lo mtu 65536
# ip -6 route get 2000::1
local 2000::1 dev lo table local src ... metric 1024 mtu 65536 pref medium

# ip -6 route del local 2000::1

After this patch the route entry no longer changes unless it already has an mtu.
There is no need: this inheritance is already done in ip6_mtu()

# ip link set dev lo mtu 65536
# ip -6 route add local 2000::1 dev lo
# ip -6 route add local 2000::2 dev lo mtu 2000
# ip -6 route get 2000::1; ip -6 route get 2000::2
local 2000::1 dev lo table local src ... metric 1024 pref medium
local 2000::2 dev lo table local src ... metric 1024 mtu 2000 pref medium

# ip link set dev lo mtu 65535
# ip -6 route get 2000::1; ip -6 route get 2000::2
local 2000::1 dev lo table local src ... metric 1024 pref medium
local 2000::2 dev lo table local src ... metric 1024 mtu 2000 pref medium

# ip link set dev lo mtu 1501
# ip -6 route get 2000::1; ip -6 route get 2000::2
local 2000::1 dev lo table local src ... metric 1024 pref medium
local 2000::2 dev lo table local src ... metric 1024 mtu 1501 pref medium

# ip link set dev lo mtu 65536
# ip -6 route get 2000::1; ip -6 route get 2000::2
local 2000::1 dev lo table local src ... metric 1024 pref medium
local 2000::2 dev lo table local src ... metric 1024 mtu 65536 pref medium

# ip -6 route del local 2000::1
# ip -6 route del local 2000::2

This is desirable because changing device mtu and then resetting it
to the previous value shouldn't change the user visible routing table.

Signed-off-by: Maciej Żenczykowski
CC: Eric Dumazet
Acked-by: Hannes Frederic Sowa
Signed-off-by: David S. Miller

Maciej Żenczykowski
2016-11-10 02:19:32 +0800
3023898b7 sock: fix sendmmsg for partial sendmsg ... Browse Code »

Do not send the next message in sendmmsg for partial sendmsg
invocations.

sendmmsg assumes that it can continue sending the next message
when the return value of the individual sendmsg invocations
is positive. It results in corrupting the data for TCP,
SCTP, and UNIX streams.

For example, sendmmsg([["abcd"], ["efgh"]]) can result in a stream
of "aefgh" if the first sendmsg invocation sends only the first
byte while the second sendmsg goes through.

Datagram sockets either send the entire datagram or fail, so
this patch affects only sockets of type SOCK_STREAM and
SOCK_SEQPACKET.

Fixes: 228e548e6020 ("net: Add sendmmsg socket system call")
Signed-off-by: Soheil Hassas Yeganeh
Signed-off-by: Eric Dumazet
Signed-off-by: Willem de Bruijn
Signed-off-by: Neal Cardwell
Acked-by: Maciej Żenczykowski
Signed-off-by: David S. Miller

Soheil Hassas Yeganeh
2016-11-10 02:18:12 +0800
aa5fd0fb7 driver: macvlan: Destroy new macvlan port if macvlan_common_newlink failed. ... Browse Code »

When there is no existing macvlan port in lowdev, one new macvlan port
would be created. But it doesn't be destoried when something failed later.
It casues some memleak.

Now add one flag to indicate if new macvlan port is created.

Signed-off-by: Gao Feng
Signed-off-by: David S. Miller

Gao Feng
2016-11-10 02:14:47 +0800

09 Nov, 2016

5 commits

58c78e104 netfilter: nf_tables: fix oops when inserting an element into a verdict map ... Browse Code »

Dalegaard says:
The following ruleset, when loaded with 'nft -f bad.txt'
----snip----
flush ruleset
table ip inlinenat {
map sourcemap {
type ipv4_addr : verdict;
}

chain postrouting {
ip saddr vmap @sourcemap accept
}
}
add chain inlinenat test
add element inlinenat sourcemap { 100.123.10.2 : jump test }
----snip----

results in a kernel oops:
BUG: unable to handle kernel paging request at 0000000000001344
IP: [] nf_tables_check_loops+0x114/0x1f0 [nf_tables]
[...]
Call Trace:
[] ? nft_data_init+0x13e/0x1a0 [nf_tables]
[] nft_validate_register_store+0x60/0xb0 [nf_tables]
[] nft_add_set_elem+0x545/0x5e0 [nf_tables]
[] ? nft_table_lookup+0x30/0x60 [nf_tables]
[] ? nla_strcmp+0x40/0x50
[] nf_tables_newsetelem+0x11e/0x210 [nf_tables]
[] ? nla_validate+0x60/0x80
[] nfnetlink_rcv+0x354/0x5a7 [nfnetlink]

Because we forget to fill the net pointer in bind_ctx, so dereferencing
it may cause kernel crash.

Reported-by: Dalegaard
Signed-off-by: Liping Zhang
Signed-off-by: Pablo Neira Ayuso

Liping Zhang
2016-11-09 06:53:39 +0800
e0df8cae6 netfilter: conntrack: refine gc worker heuristics ... Browse Code »

Nicolas Dichtel says:
After commit b87a2f9199ea ("netfilter: conntrack: add gc worker to
remove timed-out entries"), netlink conntrack deletion events may be
sent with a huge delay.

Nicolas further points at this line:

goal = min(nf_conntrack_htable_size / GC_MAX_BUCKETS_DIV, GC_MAX_BUCKETS);

and indeed, this isn't optimal at all. Rationale here was to ensure that
we don't block other work items for too long, even if
nf_conntrack_htable_size is huge. But in order to have some guarantee
about maximum time period where a scan of the full conntrack table
completes we should always use a fixed slice size, so that once every
N scans the full table has been examined at least once.

We also need to balance this vs. the case where the system is either idle
(i.e., conntrack table (almost) empty) or very busy (i.e. eviction happens
from packet path).

So, after some discussion with Nicolas:

1. want hard guarantee that we scan entire table at least once every X s
-> need to scan fraction of table (get rid of upper bound)

2. don't want to eat cycles on idle or very busy system
-> increase interval if we did not evict any entries

3. don't want to block other worker items for too long
-> make fraction really small, and prefer small scan interval instead

4. Want reasonable short time where we detect timed-out entry when
system went idle after a burst of traffic, while not doing scans
all the time.
-> Store next gc scan in worker, increasing delays when no eviction
happened and shrinking delay when we see timed out entries.

The old gc interval is turned into a max number, scans can now happen
every jiffy if stale entries are present.

Longest possible time period until an entry is evicted is now 2 minutes
in worst case (entry expires right after it was deemed 'not expired').

Reported-by: Nicolas Dichtel
Signed-off-by: Florian Westphal
Acked-by: Nicolas Dichtel
Signed-off-by: Pablo Neira Ayuso

Florian Westphal
2016-11-09 06:53:38 +0800
6114cc516 netfilter: conntrack: fix CT target for UNSPEC helpers ... Browse Code »

Thomas reports its not possible to attach the H.245 helper:

iptables -t raw -A PREROUTING -p udp -j CT --helper H.245
iptables: No chain/target/match by that name.
xt_CT: No such helper "H.245"

This is because H.245 registers as NFPROTO_UNSPEC, but the CT target
passes NFPROTO_IPV4/IPV6 to nf_conntrack_helper_try_module_get.

We should treat UNSPEC as wildcard and ignore the l3num instead.

Reported-by: Thomas Woerner
Signed-off-by: Florian Westphal
Signed-off-by: Pablo Neira Ayuso

Florian Westphal
2016-11-09 06:53:37 +0800
fb9c9649a netfilter: connmark: ignore skbs with magic untracked conntrack objects ... Browse Code »

The (percpu) untracked conntrack entries can end up with nonzero connmarks.

The 'untracked' conntrack objects are merely a way to distinguish INVALID
(i.e. protocol connection tracker says payload doesn't meet some
requirements or packet was never seen by the connection tracking code)
from packets that are intentionally not tracked (some icmpv6 types such as
neigh solicitation, or by using 'iptables -j CT --notrack' option).

Untracked conntrack objects are implementation detail, we might as well use
invalid magic address instead to tell INVALID and UNTRACKED apart.

Check skb->nfct for untracked dummy and behave as if skb->nfct is NULL.

Reported-by: XU Tianwen
Signed-off-by: Florian Westphal
Signed-off-by: Pablo Neira Ayuso

Florian Westphal
2016-11-09 06:53:36 +0800
8fbfef7f5 ipvs: use IPVS_CMD_ATTR_MAX for family.maxattr ... Browse Code »

family.maxattr is the max index for policy[], the size of
ops[] is determined with ARRAY_SIZE().

Reported-by: Andrey Konovalov
Tested-by: Andrey Konovalov
Cc: Pablo Neira Ayuso
Signed-off-by: Cong Wang
Signed-off-by: Simon Horman
Signed-off-by: Pablo Neira Ayuso

WANG Cong
2016-11-09 06:53:30 +0800

08 Nov, 2016

4 commits

fd0285a39 fib_trie: Correct /proc/net/route off by one error ... Browse Code »

The display of /proc/net/route has had a couple issues due to the fact that
when I originally rewrote most of fib_trie I made it so that the iterator
was tracking the next value to use instead of the current.

In addition it had an off by 1 error where I was tracking the first piece
of data as position 0, even though in reality that belonged to the
SEQ_START_TOKEN.

This patch updates the code so the iterator tracks the last reported
position and key instead of the next expected position and key. In
addition it shifts things so that all of the leaves start at 1 instead of
trying to report leaves starting with offset 0 as being valid. With these
two issues addressed this should resolve any off by one errors that were
present in the display of /proc/net/route.

Fixes: 25b97c016b26 ("ipv4: off-by-one in continuation handling in /proc/net/route")
Cc: Andy Whitcroft
Reported-by: Jason Baron
Tested-by: Jason Baron
Signed-off-by: Alexander Duyck
Signed-off-by: David S. Miller

Alexander Duyck
2016-11-08 09:40:27 +0800
8e0140a2d Documentation: networking: dsa: Update tagging protocols ... Browse Code »

Add Qualcomm QCA tagging introduced in cafdc45c9 to the
list of supported protocols.

Signed-off-by: Fabian Mewes
Reviewed-by: Andrew Lunn
Acked-by: Florian Fainelli
Signed-off-by: David S. Miller

Fabian Mewes
2016-11-08 09:39:15 +0800
f3358507c virtio-net: drop legacy features in virtio 1 mode ... Browse Code »

Virtio 1.0 spec says VIRTIO_F_ANY_LAYOUT and VIRTIO_NET_F_GSO are
legacy-only feature bits. Do not negotiate them in virtio 1 mode. Note
this is a spec violation so we need to backport it to stable/downstream
kernels.

Cc: stable@vger.kernel.org
Signed-off-by: Michael S. Tsirkin
Reviewed-by: Cornelia Huck
Acked-by: Jason Wang
Signed-off-by: David S. Miller

Michael S. Tsirkin
2016-11-08 09:35:46 +0800
5d41ce29e net: icmp6_send should use dst dev to determine L3 domain ... Browse Code »

icmp6_send is called in response to some event. The skb may not have
the device set (skb->dev is NULL), but it is expected to have a dst set.
Update icmp6_send to use the dst on the skb to determine L3 domain.

Fixes: ca254490c8dfd ("net: Add VRF support to IPv6 stack")
Signed-off-by: David Ahern
Signed-off-by: David S. Miller

David Ahern
2016-11-08 09:30:19 +0800