Eric Lee / smarc-fsl-linux-kernel

26 Sep, 2015

3 commits

518a7cb69 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net ... Browse Code »

Pull networking fixes from David Miller:

1) When we run a tap on netlink sockets, we have to copy mmap'd SKBs
instead of cloning them. From Daniel Borkmann.

2) When converting classical BPF into eBPF, fix the setting of the
source reg to BPF_REG_X. From Tycho Andersen.

3) Fix igmpv3/mldv2 report parsing in the bridge multicast code, from
Linus Lussing.

4) Fix dst refcounting for ipv6 tunnels, from Martin KaFai Lau.

5) Set NLM_F_REPLACE flag properly when replacing ipv6 routes, from
Roopa Prabhu.

6) Add some new cxgb4 PCI device IDs, from Hariprasad Shenai.

7) Fix headroom tests and SKB leaks in ipv6 fragmentation code, from
Florian Westphal.

8) Check DMA mapping errors in bna driver, from Ivan Vecera.

9) Several 8139cp bug fixes (dev_kfree_skb_any in interrupt context,
misclearing of interrupt status in TX timeout handler, etc.) from
David Woodhouse.

10) In tipc, reset SKB header pointer after skb_linearize(), from Erik
Hugne.

11) Fix autobind races et al. in netlink code, from Herbert Xu with
help from Tejun Heo and others.

12) Missing SET_NETDEV_DEV in sunvnet driver, from Sowmini Varadhan.

13) Fix various races in timewait timer and reqsk_queue_hadh_req, from
Eric Dumazet.

14) Fix array overruns in mac80211, from Johannes Berg and Dan
Carpenter.

15) Fix data race in rhashtable_rehash_one(), from Dmitriy Vyukov.

16) Fix race between poll_one_napi and napi_disable, from Neil Horman.

17) Fix byte order in geneve tunnel port config, from John W Linville.

18) Fix handling of ARP replies over lightweight tunnels, from Jiri
Benc.

19) We can loop when fib rule dumps cross multiple SKBs, fix from Wilson
Kok and Roopa Prabhu.

20) Several reference count handling bug fixes in the PHY/MDIO layer
from Russel King.

21) Fix lockdep splat in ppp_dev_uninit(), from Guillaume Nault.

22) Fix crash in icmp_route_lookup(), from David Ahern.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (116 commits)
net: Fix panic in icmp_route_lookup
net: update docbook comment for __mdiobus_register()
ppp: fix lockdep splat in ppp_dev_uninit()
net: via/Kconfig: GENERIC_PCI_IOMAP required if PCI not selected
phy: marvell: add link partner advertised modes
net: fix net_device refcounting
phy: add phy_device_remove()
phy: fixed-phy: properly validate phy in fixed_phy_update_state()
net: fix phy refcounting in a bunch of drivers
of_mdio: fix MDIO phy device refcounting
phy: add proper phy struct device refcounting
phy: fix mdiobus module safety
net: dsa: fix of_mdio_find_bus() device refcount leak
phy: fix of_mdio_find_bus() device refcount leak
ip6_tunnel: Reduce log level in ip6_tnl_err() to debug
ip6_gre: Reduce log level in ip6gre_err() to debug
fib_rules: fix fib rule dumps across multiple skbs
bnx2x: byte swap rss_key to comply to Toeplitz specs
net: revert "net_sched: move tp->root allocation into fw_init()"
lwtunnel: remove source and destination UDP port config option
...

Linus Torvalds
2015-09-26 18:01:33 +0800
bdb06cbf7 net: Fix panic in icmp_route_lookup ... Browse Code »

Andrey reported a panic:

[ 7249.865507] BUG: unable to handle kernel pointer dereference at 000000b4
[ 7249.865559] IP: [] icmp_route_lookup+0xaa/0x320
[ 7249.865598] *pdpt = 0000000030f7f001 *pde = 0000000000000000
[ 7249.865637] Oops: 0000 [#1]
...
[ 7249.866811] CPU: 0 PID: 0 Comm: swapper/0 Not tainted
4.3.0-999-generic #201509220155
[ 7249.866876] Hardware name: MSI MS-7250/MS-7250, BIOS 080014 08/02/2006
[ 7249.866916] task: c1a5ab00 ti: c1a52000 task.ti: c1a52000
[ 7249.866949] EIP: 0060:[] EFLAGS: 00210246 CPU: 0
[ 7249.866981] EIP is at icmp_route_lookup+0xaa/0x320
[ 7249.867012] EAX: 00000000 EBX: f483ba48 ECX: 00000000 EDX: f2e18a00
[ 7249.867045] ESI: 000000c0 EDI: f483ba70 EBP: f483b9ec ESP: f483b974
[ 7249.867077] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
[ 7249.867108] CR0: 8005003b CR2: 000000b4 CR3: 36ee07c0 CR4: 000006f0
[ 7249.867141] Stack:
[ 7249.867165] 320310ee 00000000 00000042 320310ee 00000000 c1aeca00
f3920240 f0c69180
[ 7249.867268] f483ba04 f855058b a89b66cd f483ba44 f8962f4b 00000000
e659266c f483ba54
[ 7249.867361] 8004753c f483ba5c f8962f4b f2031140 000003c1 ffbd8fa0
c16b0e00 00000064
[ 7249.867448] Call Trace:
[ 7249.867494] [] ? e1000_xmit_frame+0x87b/0xdc0 [e1000e]
[ 7249.867534] [] ? tcp_in_window+0xeb/0xb10 [nf_conntrack]
[ 7249.867576] [] ? tcp_in_window+0xeb/0xb10 [nf_conntrack]
[ 7249.867615] [] ? icmp_send+0xa0/0x380
[ 7249.867648] [] icmp_send+0x2cf/0x380
[ 7249.867681] [] nf_send_unreach+0xa6/0xc0 [nf_reject_ipv4]
[ 7249.867714] [] reject_tg+0x7a/0x9f [ipt_REJECT]
[ 7249.867746] [] ipt_do_table+0x317/0x70c [ip_tables]
[ 7249.867780] [] ? __nf_conntrack_find_get+0x166/0x3b0
[nf_conntrack]
[ 7249.867838] [] ? nf_conntrack_in+0x398/0x600 [nf_conntrack]
[ 7249.867889] [] iptable_filter_hook+0x35/0x80 [iptable_filter]
[ 7249.867933] [] nf_iterate+0x71/0x80
[ 7249.867970] [] nf_hook_slow+0x65/0xc0
[ 7249.868002] [] __ip_local_out_sk+0xc1/0xd0
[ 7249.868034] [] ? ip_forward_options+0x1a0/0x1a0
[ 7249.868066] [] ip_local_out_sk+0x16/0x30
[ 7249.868097] [] ip_send_skb+0x14/0x80
[ 7249.868129] [] ip_push_pending_frames+0x34/0x40
[ 7249.868163] [] ip_send_unicast_reply+0x282/0x310
[ 7249.868196] [] tcp_v4_send_reset+0x1b3/0x380
[ 7249.868227] [] tcp_v4_rcv+0x323/0x990
[ 7249.868257] [] ? nf_iterate+0x71/0x80
[ 7249.868289] [] ip_local_deliver_finish+0x8b/0x230
[ 7249.868322] [] ip_local_deliver+0x4c/0xa0
[ 7249.868353] [] ? ip_rcv_finish+0x390/0x390
[ 7249.868384] [] ip_rcv_finish+0x7c/0x390
[ 7249.868415] [] ip_rcv+0x2e0/0x420
...

Prior to the VRF change the oif was not set in the flow struct, so the
VRF support should really have only added the vrf_master_ifindex lookup.

Fixes: 613d09b30f8b ("net: Use VRF device index for lookups on TX")
Cc: Andrey Melnikov
Signed-off-by: David Ahern
Signed-off-by: David S. Miller

David Ahern
2015-09-26 12:44:02 +0800
101688f53 Merge tag 'nfs-for-4.3-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs ... Browse Code »

Pull NFS client bugfixes from Trond Myklebust:
"Highlights include:

Stable patches:
- fix v4.2 SEEK on files over 2 gigs
- Fix a layout segment reference leak when pNFS I/O falls back to inband I/O.
- Fix recovery of recalled read delegations

Bugfixes:
- Fix a case where NFSv4 fails to send CLOSE after a server reboot
- Fix sunrpc to wait for connections to complete before retrying
- Fix sunrpc races between transport connect/disconnect and shutdown
- Fix an infinite loop when layoutget fail with BAD_STATEID
- nfs/filelayout: Fix NULL reference caused by double freeing of fh_array
- Fix a bogus WARN_ON_ONCE() in O_DIRECT when layout commit_through_mds is set
- Fix layoutreturn/close ordering issues"

* tag 'nfs-for-4.3-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
NFS41: make close wait for layoutreturn
NFS: Skip checking ds_cinfo.buckets when lseg's commit_through_mds is set
NFSv4.x/pnfs: Don't try to recover stateids twice in layoutget
NFSv4: Recovery of recalled read delegations is broken
NFS: Fix an infinite loop when layoutget fail with BAD_STATEID
NFS: Do cleanup before resetting pageio read/write to mds
SUNRPC: xs_sock_mark_closed() does not need to trigger socket autoclose
SUNRPC: Lock the transport layer on shutdown
nfs/filelayout: Fix NULL reference caused by double freeing of fh_array
SUNRPC: Ensure that we wait for connections to complete before retrying
SUNRPC: drop null test before destroy functions
nfs: fix v4.2 SEEK on files over 2 gigs
SUNRPC: Fix races between socket connection and destroy code
nfs: fix pg_test page count calculation
Failing to send a CLOSE if file is opened WRONLY and server reboots on a 4.x mount

Linus Torvalds
2015-09-26 02:33:52 +0800

25 Sep, 2015

10 commits

9861f7207 net: fix net_device refcounting ... Browse Code »

of_find_net_device_by_node() uses class_find_device() internally to
lookup the corresponding network device. class_find_device() returns
a reference to the embedded struct device, with its refcount
incremented.

Add a comment to the definition in net/core/net-sysfs.c indicating the
need to drop this refcount, and fix the DSA code to drop this refcount
when the OF-generated platform data is cleaned up and freed. Also
arrange for the ref to be dropped when handling errors.

Signed-off-by: Russell King
Signed-off-by: David S. Miller

Russell King
2015-09-25 14:04:53 +0800
e496ae690 net: dsa: fix of_mdio_find_bus() device refcount leak ... Browse Code »

Current users of of_mdio_find_bus() leak a struct device refcount, as
they fail to clean up the reference obtained inside class_find_device().

Fix the DSA code to properly refcount the returned MDIO bus by:
1. taking a reference on the struct device whenever we assign it to
pd->chip[x].host_dev.
2. dropping the reference when we overwrite the existing reference.
3. dropping the reference when we free the data structure.
4. dropping the initial reference we obtained after setting up the
platform data structure, or on failure.

In step 2 above, where we obtain a new MDIO bus, there is no need to
take a reference on it as we would only have to drop it immediately
after assignment again, iow:

put_device(cd->host_dev); /* drop original assignment ref */
cd->host_dev = get_device(&mdio_bus_switch->dev); /* get our ref */
put_device(&mdio_bus_switch->dev); /* drop of_mdio_find_bus ref */

Signed-off-by: Russell King
Signed-off-by: David S. Miller

Russell King
2015-09-25 14:04:52 +0800
17a10c921 ip6_tunnel: Reduce log level in ip6_tnl_err() to debug ... Browse Code »

Currently error log messages in ip6_tnl_err are printed at 'warn'
level. This is different to other tunnel types which don't print
any messages. These log messages don't provide any information that
couldn't be deduced with networking tools. Also it can be annoying
to have one end of the tunnel go down and have the logs fill with
pointless messages such as "Path to destination invalid or inactive!".

This patch reduces the log level of these messages to 'dbg' level to
bring the visible behaviour into line with other tunnel types.

Signed-off-by: Matt Bennett
Signed-off-by: David S. Miller

Matt Bennett
2015-09-25 07:08:37 +0800
deccbe80b Merge tag 'mac80211-for-davem-2015-09-22' of git://git.kernel.org/pub/scm/linux/… ... Browse Code »

…kernel/git/jberg/mac80211

Johannes Berg says:

====================
Just two small fixes:
* VHT MCS mask array overrun, reported by Dan Carpenter
* reset CQM history to always get a notification, from Sara Sharon
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

David S. Miller
2015-09-25 06:36:20 +0800
a46496ce3 ip6_gre: Reduce log level in ip6gre_err() to debug ... Browse Code »

Currently error log messages in ip6gre_err are printed at 'warn'
level. This is different to most other tunnel types which don't
print any messages. These log messages don't provide any information
that couldn't be deduced with networking tools. Also it can be annoying
to have one end of the tunnel go down and have the logs fill with
pointless messages such as "Path to destination invalid or inactive!".

This patch reduces the log level of these messages to 'dbg' level to
bring the visible behaviour into line with other tunnel types.

Signed-off-by: Matt Bennett
Signed-off-by: David S. Miller

Matt Bennett
2015-09-25 06:28:43 +0800
41fc01433 fib_rules: fix fib rule dumps across multiple skbs ... Browse Code »

dump_rules returns skb length and not error.
But when family == AF_UNSPEC, the caller of dump_rules
assumes that it returns an error. Hence, when family == AF_UNSPEC,
we continue trying to dump on -EMSGSIZE errors resulting in
incorrect dump idx carried between skbs belonging to the same dump.
This results in fib rule dump always only dumping rules that fit
into the first skb.

This patch fixes dump_rules to return error so that we exit correctly
and idx is correctly maintained between skbs that are part of the
same dump.

Signed-off-by: Wilson Kok
Signed-off-by: Roopa Prabhu
Signed-off-by: David S. Miller

Wilson Kok
2015-09-25 06:21:54 +0800
d8aecb101 net: revert "net_sched: move tp->root allocation into fw_init()" ... Browse Code »

fw filter uses tp->root==NULL to check if it is the old method,
so it doesn't need allocation at all in this case. This patch
reverts the offending commit and adds some comments for old
method to make it obvious.

Fixes: 33f8b9ecdb15 ("net_sched: move tp->root allocation into fw_init()")
Reported-by: Akshat Kakkar
Cc: Jamal Hadi Salim
Signed-off-by: Cong Wang
Acked-by: Jamal Hadi Salim
Signed-off-by: David S. Miller

WANG Cong
2015-09-25 05:33:30 +0800
b194f30c6 lwtunnel: remove source and destination UDP port config option ... Browse Code »

The UDP tunnel config is asymmetric wrt. to the ports used. The source and
destination ports from one direction of the tunnel are not related to the
ports of the other direction. We need to be able to respond to ARP requests
using the correct ports without involving routing.

As the consequence, UDP ports need to be fixed property of the tunnel
interface and cannot be set per route. Remove the ability to set ports per
route. This is still okay to do, as no kernel has been released with these
attributes yet.

Note that the ability to specify source and destination ports is preserved
for other users of the lwtunnel API which don't use routes for tunnel key
specification (like openvswitch).

If in the future we rework ARP handling to allow port specification, the
attributes can be added back.

Signed-off-by: Jiri Benc
Acked-by: Thomas Graf
Signed-off-by: David S. Miller

Jiri Benc
2015-09-25 05:31:37 +0800
63d008a4e ipv4: send arp replies to the correct tunnel ... Browse Code »

When using ip lwtunnels, the additional data for xmit (basically, the actual
tunnel to use) are carried in ip_tunnel_info either in dst->lwtstate or in
metadata dst. When replying to ARP requests, we need to send the reply to
the same tunnel the request came from. This means we need to construct
proper metadata dst for ARP replies.

We could perform another route lookup to get a dst entry with the correct
lwtstate. However, this won't always ensure that the outgoing tunnel is the
same as the incoming one, and it won't work anyway for IPv4 duplicate
address detection.

The only thing to do is to "reverse" the ip_tunnel_info.

Signed-off-by: Jiri Benc
Acked-by: Thomas Graf
Signed-off-by: David S. Miller

Jiri Benc
2015-09-25 05:31:36 +0800
da314c992 netlink: Replace rhash_portid with bound ... Browse Code »

On Mon, Sep 21, 2015 at 02:20:22PM -0400, Tejun Heo wrote:
>
> store_release and load_acquire are different from the usual memory
> barriers and can't be paired this way. You have to pair store_release
> and load_acquire. Besides, it isn't a particularly good idea to

OK I've decided to drop the acquire/release helpers as they don't
help us at all and simply pessimises the code by using full memory
barriers (on some architectures) where only a write or read barrier
is needed.

> depend on memory barriers embedded in other data structures like the
> above. Here, especially, rhashtable_insert() would have write barrier
> *before* the entry is hashed not necessarily *after*, which means that
> in the above case, a socket which appears to have set bound to a
> reader might not visible when the reader tries to look up the socket
> on the hashtable.

But you are right we do need an explicit write barrier here to
ensure that the hashing is visible.

> There's no reason to be overly smart here. This isn't a crazy hot
> path, write barriers tend to be very cheap, store_release more so.
> Please just do smp_store_release() and note what it's paired with.

It's not about being overly smart. It's about actually understanding
what's going on with the code. I've seen too many instances of
people simply sprinkling synchronisation primitives around without
any knowledge of what is happening underneath, which is just a recipe
for creating hard-to-debug races.

> > @@ -1539,7 +1546,7 @@ static int netlink_bind(struct socket *sock, struct sockaddr *addr,
> > }
> > }
> >
> > - if (!nlk->portid) {
> > + if (!nlk->bound) {
>
> I don't think you can skip load_acquire here just because this is the
> second deref of the variable. That doesn't change anything. Race
> condition could still happen between the first and second tests and
> skipping the second would lead to the same kind of bug.

The reason this one is OK is because we do not use nlk->portid or
try to get nlk from the hash table before we return to user-space.

However, there is a real bug here that none of these acquire/release
helpers discovered. The two bound tests here used to be a single
one. Now that they are separate it is entirely possible for another
thread to come in the middle and bind the socket. So we need to
repeat the portid check in order to maintain consistency.

> > @@ -1587,7 +1594,7 @@ static int netlink_connect(struct socket *sock, struct sockaddr *addr,
> > !netlink_allowed(sock, NL_CFG_F_NONROOT_SEND))
> > return -EPERM;
> >
> > - if (!nlk->portid)
> > + if (!nlk->bound)
>
> Don't we need load_acquire here too? Is this path holding a lock
> which makes that unnecessary?

Ditto.

---8bound once in netlink_bind fixes
a race where two threads that bind the socket at the same time
with different port IDs may both succeed.

Fixes: 1f770c0a09da ("netlink: Fix autobind race condition that leads to zero port ID")
Reported-by: Tejun Heo
Reported-by: Linus Torvalds
Signed-off-by: Herbert Xu
Nacked-by: Tejun Heo
Signed-off-by: David S. Miller

Herbert Xu
2015-09-25 03:07:08 +0800

24 Sep, 2015

3 commits

d3869efe7 Fix AF_PACKET ABI breakage in 4.2 ... Browse Code »

Commit 7d82410950aa ("virtio: add explicit big-endian support to memory
accessors") accidentally changed the virtio_net header used by
AF_PACKET with PACKET_VNET_HDR from host-endian to big-endian.

Since virtio_legacy_is_little_endian() is a very long identifier,
define a vio_le macro and use that throughout the code instead of the
hard-coded 'false' for little-endian.

This restores the ABI to match 4.1 and earlier kernels, and makes my
test program work again.

Signed-off-by: David Woodhouse
Signed-off-by: David S. Miller

David Woodhouse
2015-09-24 05:33:55 +0800
2d8bff126 netpoll: Close race condition between poll_one_napi and napi_disable ... Browse Code »

Drivers might call napi_disable while not holding the napi instance poll_lock.
In those instances, its possible for a race condition to exist between
poll_one_napi and napi_disable. That is to say, poll_one_napi only tests the
NAPI_STATE_SCHED bit to see if there is work to do during a poll, and as such
the following may happen:

CPU0 CPU1
ndo_tx_timeout napi_poll_dev
napi_disable poll_one_napi
test_and_set_bit (ret 0)
test_bit (ret 1)
reset adapter napi_poll_routine

If the adapter gets a tx timeout without a napi instance scheduled, its possible
for the adapter to think it has exclusive access to the hardware (as the napi
instance is now scheduled via the napi_disable call), while the netpoll code
thinks there is simply work to do. The result is parallel hardware access
leading to corrupt data structures in the driver, and a crash.

Additionaly, there is another, more critical race between netpoll and
napi_disable. The disabled napi state is actually identical to the scheduled
state for a given napi instance. The implication being that, if a napi instance
is disabled, a netconsole instance would see the napi state of the device as
having been scheduled, and poll it, likely while the driver was dong something
requiring exclusive access. In the case above, its fairly clear that not having
the rings in a state ready to be polled will cause any number of crashes.

The fix should be pretty easy. netpoll uses its own bit to indicate that that
the napi instance is in a state of being serviced by netpoll (NAPI_STATE_NPSVC).
We can just gate disabling on that bit as well as the sched bit. That should
prevent netpoll from conducting a napi poll if we convert its set bit to a
test_and_set_bit operation to provide mutual exclusion

Change notes:
V2)
Remove a trailing whtiespace
Resubmit with proper subject prefix

V3)
Clean up spacing nits

Signed-off-by: Neil Horman
CC: "David S. Miller"
CC: jmaxwell@redhat.com
Tested-by: jmaxwell@redhat.com
Signed-off-by: David S. Miller

Neil Horman
2015-09-24 05:32:50 +0800
675ee231d tcp: add proper TS val into RST packets ... Browse Code »

RST packets sent on behalf of TCP connections with TS option (RFC 7323
TCP timestamps) have incorrect TS val (set to 0), but correct TS ecr.

A > B: Flags [S], seq 0, win 65535, options [mss 1000,nop,nop,TS val 100
ecr 0], length 0
B > A: Flags [S.], seq 2444755794, ack 1, win 28960, options [mss
1460,nop,nop,TS val 7264344 ecr 100], length 0
A > B: Flags [.], ack 1, win 65535, options [nop,nop,TS val 110 ecr
7264344], length 0

B > A: Flags [R.], seq 1, ack 1, win 28960, options [nop,nop,TS val 0
ecr 110], length 0

We need to call skb_mstamp_get() to get proper TS val,
derived from skb->skb_mstamp

Note that RFC 1323 was advocating to not send TS option in RST segment,
but RFC 7323 recommends the opposite :

Once TSopt has been successfully negotiated, that is both and
contain TSopt, the TSopt MUST be sent in every non-
segment for the duration of the connection, and SHOULD be sent in an
segment (see Section 5.2 for details)

Note this RFC recommends to send TS val = 0, but we believe it is
premature : We do not know if all TCP stacks are properly
handling the receive side :

When an segment is
received, it MUST NOT be subjected to the PAWS check by verifying an
acceptable value in SEG.TSval, and information from the Timestamps
option MUST NOT be used to update connection state information.
SEG.TSecr MAY be used to provide stricter acceptance checks.

In 5 years, if/when all TCP stack are RFC 7323 ready, we might consider
to decide to send TS val = 0, if it buys something.

Fixes: 7faee5c0d514 ("tcp: remove TCP_SKB_CB(skb)->when")
Signed-off-by: Eric Dumazet
Acked-by: Yuchung Cheng
Signed-off-by: David S. Miller

Eric Dumazet
2015-09-24 05:24:07 +0800

23 Sep, 2015

3 commits

fbd03513b net: dsa: Fix Marvell Egress Trailer check ... Browse Code »

The Marvell Egress rx trailer check must be fixed to
correctly detect bad bits in the third byte of the
Eggress trailer as described in the Table 28 of the
88E6060 datasheet.
The current code incorrectly omits to check the third
byte and checks the fourth byte twice.

Signed-off-by: Neil Armstrong
Acked-by: Guenter Roeck
Signed-off-by: David S. Miller

Neil Armstrong
2015-09-23 08:37:03 +0800
ae5f2fb1d openvswitch: Zero flows on allocation. ... Browse Code »

When support for megaflows was introduced, OVS needed to start
installing flows with a mask applied to them. Since masking is an
expensive operation, OVS also had an optimization that would only
take the parts of the flow keys that were covered by a non-zero
mask. The values stored in the remaining pieces should not matter
because they are masked out.

While this works fine for the purposes of matching (which must always
look at the mask), serialization to netlink can be problematic. Since
the flow and the mask are serialized separately, the uninitialized
portions of the flow can be encoded with whatever values happen to be
present.

In terms of functionality, this has little effect since these fields
will be masked out by definition. However, it leaks kernel memory to
userspace, which is a potential security vulnerability. It is also
possible that other code paths could look at the masked key and get
uninitialized data, although this does not currently appear to be an
issue in practice.

This removes the mask optimization for flows that are being installed.
This was always intended to be the case as the mask optimizations were
really targetting per-packet flow operations.

Fixes: 03f0d916 ("openvswitch: Mega flow implementation")
Signed-off-by: Jesse Gross
Acked-by: Pravin B Shelar
Signed-off-by: David S. Miller

Jesse Gross
2015-09-23 08:33:41 +0800
ac5be6b47 userfaultfd: revert "userfaultfd: waitqueue: add nr wake parameter to __wake_up_locked_key" ... Browse Code »

This reverts commit 51360155eccb907ff8635bd10fc7de876408c2e0 and adapts
fs/userfaultfd.c to use the old version of that function.

It didn't look robust to call __wake_up_common with "nr == 1" when we
absolutely require wakeall semantics, but we've full control of what we
insert in the two waitqueue heads of the blocked userfaults. No
exclusive waitqueue risks to be inserted into those two waitqueue heads
so we can as well stick to "nr == 1" of the old code and we can rely
purely on the fact no waitqueue inserted in one of the two waitqueue
heads we must enforce as wakeall, has wait->flags WQ_FLAG_EXCLUSIVE set.

Signed-off-by: Andrea Arcangeli
Cc: Dr. David Alan Gilbert
Cc: Michael Ellerman
Cc: Shuah Khan
Cc: Thierry Reding
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrea Arcangeli
2015-09-23 06:09:53 +0800

22 Sep, 2015

4 commits

babc305e2 mac80211: reset CQM history upon reconfiguration ... Browse Code »

The current behavior of notifying CQM events is inconsistent:
Upon first configuration there is a cqm event with the current
status according to threshold configured, regardless of signal
stability.
When there is reconfiguration no event is sent unless there is
a significant change to the signal level according to the new
configuration.

Since the current reconfiguration behavior might cause missing
CQM events in case the current signal did not change but is on
the other side of the new threshold, fix that by resetting the
stored signal level upon reconfiguration.

Signed-off-by: Sara Sharon
Signed-off-by: Luca Coelho
Signed-off-by: Johannes Berg

Sara Sharon
2015-09-22 21:22:50 +0800
2df1b131b mac80211: fix VHT MCS mask array overrun ... Browse Code »

The HT MCS mask has 9 bytes, the VHT one only has 8 streams.
Split the loops to handle this correctly.

Reported-by: Dan Carpenter
Signed-off-by: Johannes Berg

Johannes Berg
2015-09-22 21:19:48 +0800
29c685260 inet: fix races in reqsk_queue_hash_req() ... Browse Code »

Before allowing lockless LISTEN processing, we need to make
sure to arm the SYN_RECV timer before the req socket is visible
in hash tables.

Also, req->rsk_hash should be written before we set rsk_refcnt
to a non zero value.

Fixes: fa76ce7328b2 ("inet: get rid of central tcp/dccp listener timer")
Signed-off-by: Eric Dumazet
Cc: Ying Cai
Signed-off-by: David S. Miller

Eric Dumazet
2015-09-22 07:32:29 +0800
ed2e92394 tcp/dccp: fix timewait races in timer handling ... Browse Code »

When creating a timewait socket, we need to arm the timer before
allowing other cpus to find it. The signal allowing cpus to find
the socket is setting tw_refcnt to non zero value.

As we set tw_refcnt in __inet_twsk_hashdance(), we therefore need to
call inet_twsk_schedule() first.

This also means we need to remove tw_refcnt changes from
inet_twsk_schedule() and let the caller handle it.

Note that because we use mod_timer_pinned(), we have the guarantee
the timer wont expire before we set tw_refcnt as we run in BH context.

To make things more readable I introduced inet_twsk_reschedule() helper.

When rearming the timer, we can use mod_timer_pending() to make sure
we do not rearm a canceled timer.

Note: This bug can possibly trigger if packets of a flow can hit
multiple cpus. This does not normally happen, unless flow steering
is broken somehow. This explains this bug was spotted ~5 months after
its introduction.

A similar fix is needed for SYN_RECV sockets in reqsk_queue_hash_req(),
but will be provided in a separate patch for proper tracking.

Fixes: 789f558cfb36 ("tcp/dccp: get rid of central timewait timer")
Signed-off-by: Eric Dumazet
Reported-by: Ying Cai
Signed-off-by: David S. Miller

Eric Dumazet
2015-09-22 07:32:29 +0800

21 Sep, 2015

5 commits

1f770c0a0 netlink: Fix autobind race condition that leads to zero port ID ... Browse Code »

The commit c0bb07df7d981e4091432754e30c9c720e2c0c78 ("netlink:
Reset portid after netlink_insert failure") introduced a race
condition where if two threads try to autobind the same socket
one of them may end up with a zero port ID. This led to kernel
deadlocks that were observed by multiple people.

This patch reverts that commit and instead fixes it by introducing
a separte rhash_portid variable so that the real portid is only set
after the socket has been successfully hashed.

Fixes: c0bb07df7d98 ("netlink: Reset portid after netlink_insert failure")
Reported-by: Tejun Heo
Reported-by: Linus Torvalds
Signed-off-by: Herbert Xu
Signed-off-by: David S. Miller

Herbert Xu
2015-09-21 13:55:31 +0800
bc22a0e2e iptunnel: make rx/tx bytes counters consistent ... Browse Code »

This was already done a long time ago in
commit 64194c31a0b6 ("inet: Make tunnel RX/TX byte counters more consistent")
but tx path was broken (at least since 3.10).

Before the patch the gre header was included on tx.

After the patch:
$ ping -c1 192.168.0.121 ; ip -s l ls dev gre1
PING 192.168.0.121 (192.168.0.121) 56(84) bytes of data.
64 bytes from 192.168.0.121: icmp_req=1 ttl=64 time=2.95 ms

--- 192.168.0.121 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 2.955/2.955/2.955/0.000 ms
7: gre1@NONE: mtu 1468 qdisc noqueue state UNKNOWN mode DEFAULT group default
link/gre 10.16.0.249 peer 10.16.0.121
RX: bytes packets errors dropped overrun mcast
84 1 0 0 0 0
TX: bytes packets errors dropped carrier collsns
84 1 0 0 0 0

Reported-by: Julien Meunier
Signed-off-by: Nicolas Dichtel
Signed-off-by: David S. Miller

Nicolas Dichtel
2015-09-21 13:36:22 +0800
ac8137449 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf ... Browse Code »

Pablo Neira Ayuso says:

====================
Netfilter fixes for net

The following patch contains Netfilter fixes for your net tree, they are:

1) nf_log_unregister() should only set to NULL the logger that is being
unregistered, instead of everything else. Patch from Florian Westphal.

2) Fix a crash when accessing physoutdev from PREROUTING in br_netfilter.
This is partially reverting the patch to shrink nf_bridge_info to 32 bytes.
Also from Florian.

3) Use existing match/target extensions in the internal nft_compat extension
lists when the extension is family unspecific (ie. NFPROTO_UNSPEC).

4) Wait for rcu grace period before leaving nf_log_unregister().
====================

Signed-off-by: David S. Miller

David S. Miller
2015-09-21 13:32:20 +0800
4e3ae0010 tipc: reinitialize pointer after skb linearize ... Browse Code »

The msg pointer into header may change after skb linearization.
We must reinitialize it after calling skb_linearize to prevent
operating on a freed or invalid pointer.

Signed-off-by: Erik Hugne
Reported-by: Tamás Végh
Acked-by: Ying Xue
Signed-off-by: David S. Miller

Erik Hugne
2015-09-21 13:31:20 +0800
0315e3827 net: Fix behaviour of unreachable, blackhole and prohibit routes ... Browse Code »

Man page of ip-route(8) says following about route types:

unreachable - these destinations are unreachable. Packets are dis‐
carded and the ICMP message host unreachable is generated. The local
senders get an EHOSTUNREACH error.

blackhole - these destinations are unreachable. Packets are dis‐
carded silently. The local senders get an EINVAL error.

prohibit - these destinations are unreachable. Packets are discarded
and the ICMP message communication administratively prohibited is
generated. The local senders get an EACCES error.

In the inet6 address family, this was correct, except the local senders
got ENETUNREACH error instead of EHOSTUNREACH in case of unreachable route.
In the inet address family, all three route types generated ICMP message
net unreachable, and the local senders got ENETUNREACH error.

In both address families all three route types now behave consistently
with documentation.

Signed-off-by: Nikola Forró
Signed-off-by: David S. Miller

Nikola Forró
2015-09-21 12:45:08 +0800

20 Sep, 2015

2 commits

4b0ab51db SUNRPC: xs_sock_mark_closed() does not need to trigger socket autoclose ... Browse Code »

Under all conditions, it should be quite sufficient just to mark
the socket as disconnected. It will then be closed by the
transport shutdown or reconnect code.

Signed-off-by: Trond Myklebust

Trond Myklebust
2015-09-20 04:38:35 +0800
79234c3db SUNRPC: Lock the transport layer on shutdown ... Browse Code »

Avoid all races with the connect/disconnect handlers by taking the
transport lock.

Reported-by:"Suzuki K. Poulose"
Acked-by: Jeff Layton
Signed-off-by: Trond Myklebust

Trond Myklebust
2015-09-20 04:37:43 +0800

18 Sep, 2015

10 commits

c2e7204d1 tcp_cubic: do not set epoch_start in the future ... Browse Code »

Tracking idle time in bictcp_cwnd_event() is imprecise, as epoch_start
is normally set at ACK processing time, not at send time.

Doing a proper fix would need to add an additional state variable,
and does not seem worth the trouble, given CUBIC bug has been there
forever before Jana noticed it.

Let's simply not set epoch_start in the future, otherwise
bictcp_update() could overflow and CUBIC would again
grow cwnd too fast.

This was detected thanks to a packetdrill test Neal wrote that was flaky
before applying this fix.

Fixes: 30927520dbae ("tcp_cubic: better follow cubic curve after idle period")
Signed-off-by: Eric Dumazet
Signed-off-by: Neal Cardwell
Signed-off-by: Yuchung Cheng
Cc: Jana Iyengar
Signed-off-by: David S. Miller

Eric Dumazet
2015-09-18 13:35:07 +0800
1dbb2413c Merge branch 'for-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth ... Browse Code »

Johan Hedberg says:

====================
pull request: bluetooth 2015-09-17

Here's one important patch for the 4.3-rc series that fixes an issue
with Bluetooth LE encryption failing because of a too early check for
the SMP context.

Please let me know if there are any issues pulling. Thanks.
====================

Signed-off-by: David S. Miller

David S. Miller
2015-09-18 13:25:51 +0800
34f5b0066 atm: deal with setting entry before mkip was called ... Browse Code »

If we didn't call ATMARP_MKIP before ATMARP_ENCAP the VCC descriptor is
non-existant and we'll end up dereferencing a NULL ptr:

[1033173.491930] kasan: GPF could be caused by NULL-ptr deref or user memory accessirq event stamp: 123386
[1033173.493678] general protection fault: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC KASAN
[1033173.493689] Modules linked in:
[1033173.493697] CPU: 9 PID: 23815 Comm: trinity-c64 Not tainted 4.2.0-next-20150911-sasha-00043-g353d875-dirty #2545
[1033173.493706] task: ffff8800630c4000 ti: ffff880063110000 task.ti: ffff880063110000
[1033173.493823] RIP: clip_ioctl (net/atm/clip.c:320 net/atm/clip.c:689)
[1033173.493826] RSP: 0018:ffff880063117a88 EFLAGS: 00010203
[1033173.493828] RAX: dffffc0000000000 RBX: 0000000000000000 RCX: 000000000000000c
[1033173.493830] RDX: 0000000000000002 RSI: ffffffffb3f10720 RDI: 0000000000000014
[1033173.493832] RBP: ffff880063117b80 R08: ffff88047574d9a4 R09: 0000000000000000
[1033173.493834] R10: 0000000000000000 R11: 0000000000000000 R12: 1ffff1000c622f53
[1033173.493836] R13: ffff8800cb905500 R14: ffff8808d6da2000 R15: 00000000fffffdfd
[1033173.493840] FS: 00007fa56b92d700(0000) GS:ffff880478000000(0000) knlGS:0000000000000000
[1033173.493843] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[1033173.493845] CR2: 0000000000000000 CR3: 00000000630e8000 CR4: 00000000000006a0
[1033173.493855] Stack:
[1033173.493862] ffffffffb0b60444 000000000000eaea 0000000041b58ab3 ffffffffb3c3ce32
[1033173.493867] ffffffffb0b6f3e0 ffffffffb0b60444 ffffffffb5ea2e50 1ffff1000c622f5e
[1033173.493873] ffff8800630c4cd8 00000000000ee09a ffffffffb3ec4888 ffffffffb5ea2de8
[1033173.493874] Call Trace:
[1033173.494108] do_vcc_ioctl (net/atm/ioctl.c:170)
[1033173.494113] vcc_ioctl (net/atm/ioctl.c:189)
[1033173.494116] svc_ioctl (net/atm/svc.c:605)
[1033173.494200] sock_do_ioctl (net/socket.c:874)
[1033173.494204] sock_ioctl (net/socket.c:958)
[1033173.494244] do_vfs_ioctl (fs/ioctl.c:43 fs/ioctl.c:607)
[1033173.494290] SyS_ioctl (fs/ioctl.c:622 fs/ioctl.c:613)
[1033173.494295] entry_SYSCALL_64_fastpath (arch/x86/entry/entry_64.S:186)
[1033173.494362] Code: fa 48 c1 ea 03 80 3c 02 00 0f 85 50 09 00 00 49 8b 9e 60 06 00 00 48 b8 00 00 00 00 00 fc ff df 48 8d 7b 14 48 89 fa 48 c1 ea 03 b6 04 02 48 89 fa 83 e2 07 38 d0 7f 08 84 c0 0f 85 14 09 00
All code

========
0: fa cli
1: 48 c1 ea 03 shr $0x3,%rdx
5: 80 3c 02 00 cmpb $0x0,(%rdx,%rax,1)
9: 0f 85 50 09 00 00 jne 0x95f
f: 49 8b 9e 60 06 00 00 mov 0x660(%r14),%rbx
16: 48 b8 00 00 00 00 00 movabs $0xdffffc0000000000,%rax
1d: fc ff df
20: 48 8d 7b 14 lea 0x14(%rbx),%rdi
24: 48 89 fa mov %rdi,%rdx
27: 48 c1 ea 03 shr $0x3,%rdx
2b:* 0f b6 04 02 movzbl (%rdx,%rax,1),%eax

Signed-off-by: Sasha Levin
Signed-off-by: David S. Miller

Sasha Levin
2015-09-18 13:13:32 +0800
1d325d217 ipv6: ip6_fragment: fix headroom tests and skb leak ... Browse Code »

David Woodhouse reports skb_under_panic when we try to push ethernet
header to fragmented ipv6 skbs:

skbuff: skb_under_panic: text:c1277f1e len:1294 put:14 head:dec98000
data:dec97ffc tail:0xdec9850a end:0xdec98f40 dev:br-lan
[..]
ip6_finish_output2+0x196/0x4da

David further debugged this:
[..] offending fragments were arriving here with skb_headroom(skb)==10.
Which is reasonable, being the Solos ADSL card's header of 8 bytes
followed by 2 bytes of PPP frame type.

The problem is that if netfilter ipv6 defragmentation is used, skb_cow()
in ip6_forward will only see reassembled skb.

Therefore, headroom is overestimated by 8 bytes (we pulled fragment
header) and we don't check the skbs in the frag_list either.

We can't do these checks in netfilter defrag since outdev isn't known yet.

Furthermore, existing tests in ip6_fragment did not consider the fragment
or ipv6 header size when checking headroom of the fraglist skbs.

While at it, also fix a skb leak on memory allocation -- ip6_fragment
must consume the skb.

I tested this e1000 driver hacked to not allocate additional headroom
(we end up in slowpath, since LL_RESERVED_SPACE is 16).

If 2 bytes of headroom are allocated, fastpath is taken (14 byte
ethernet header was pulled, so 16 byte headroom available in all
fragments).

Reported-by: David Woodhouse
Diagnosed-by: David Woodhouse
Signed-off-by: Florian Westphal
Tested-by: David Woodhouse
Signed-off-by: David S. Miller

Florian Westphal
2015-09-18 12:36:15 +0800
58189ca7b net: Fix vti use case with oif in dst lookups ... Browse Code »

Steffen reported that the recent change to add oif to dst lookups breaks
the VTI use case. The problem is that with the oif set in the flow struct
the comparison to the nh_oif is triggered. Fix by splitting the
FLOWI_FLAG_VRFSRC into 2 flags -- one that triggers the vrf device cache
bypass (FLOWI_FLAG_VRFSRC) and another telling the lookup to not compare
nh oif (FLOWI_FLAG_SKIP_NH_OIF).

Fixes: 42a7b32b73d6 ("xfrm: Add oif to dst lookups")

Signed-off-by: David Ahern
Acked-by: Steffen Klassert
Signed-off-by: David S. Miller

David Ahern
2015-09-18 07:36:34 +0800
cc5706056 openvswitch: Fix IPv6 exthdr handling with ct helpers. ... Browse Code »

Static code analysis reveals the following bug:

net/openvswitch/conntrack.c:281 ovs_ct_helper()
warn: unsigned 'protoff' is never less than zero.

This signedness bug breaks error handling for IPv6 extension headers when
using conntrack helpers. Fix the error by using a local signed variable.

Fixes: cae3a2627520: "openvswitch: Allow attaching helpers to ct
action"
Reported-by: Dan Carpenter
Signed-off-by: Joe Stringer
Acked-by: Pravin B Shelar
Signed-off-by: David S. Miller

Joe Stringer
2015-09-18 06:31:49 +0800
0fdea1e8a SUNRPC: Ensure that we wait for connections to complete before retrying ... Browse Code »

Commit 718ba5b87343, moved the responsibility for unlocking the socket to
xs_tcp_setup_socket, meaning that the socket will be unlocked before we
know that it has finished trying to connect. The following patch is based on
an initial patch by Russell King to ensure that we delay clearing the
XPRT_CONNECTING flag until we either know that we failed to initiate
a connection attempt, or the connection attempt itself failed.

Fixes: 718ba5b87343 ("SUNRPC: Add helpers to prevent socket create from racing")
Reported-by: Russell King
Reported-by: Russell King
Tested-by: Russell King
Tested-by: Benjamin Coddington
Signed-off-by: Trond Myklebust

Trond Myklebust
2015-09-18 06:01:28 +0800
37a1d3611 ipv6: include NLM_F_REPLACE in route replace notifications ... Browse Code »

This patch adds NLM_F_REPLACE flag to ipv6 route replace notifications.
This makes nlm_flags in ipv6 replace notifications consistent
with ipv4.

Signed-off-by: Roopa Prabhu
Acked-by: Nicolas Dichtel
Reviewed-by: Michal Kubecek
Signed-off-by: David S. Miller

Roopa Prabhu
2015-09-18 06:00:27 +0800
17a9618e9 SUNRPC: drop null test before destroy functions ... Browse Code »

Remove unneeded NULL test.

The semantic patch that makes this change is as follows:
(http://coccinelle.lip6.fr/)

//
@@ expression x; @@
-if (x != NULL)
$kmem_cache_destroy\|mempool_destroy\|dma_pool_destroy$(x);
//

Signed-off-by: Julia Lawall
Signed-off-by: Trond Myklebust

Julia Lawall
2015-09-18 03:48:24 +0800
03c78827d SUNRPC: Fix races between socket connection and destroy code ... Browse Code »

When we're destroying the socket transport, we need to ensure that
we cancel any existing delayed connection attempts, and order them
w.r.t. the call to xs_close().

Reported-by:"Suzuki K. Poulose"
Acked-by: Jeff Layton
Signed-off-by: Trond Myklebust

Trond Myklebust
2015-09-18 03:48:23 +0800