Eric Lee / smarc-fsl-linux-kernel

01 Aug, 2015

2 commits

7c764cec3 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net ... Browse Code »

Pull networking fixes from David Miller:

1) Must teardown SR-IOV before unregistering netdev in igb driver, from
Alex Williamson.

2) Fix ipv6 route unreachable crash in IPVS, from Alex Gartrell.

3) Default route selection in ipv4 should take the prefix length, table
ID, and TOS into account, from Julian Anastasov.

4) sch_plug must have a reset method in order to purge all buffered
packets when the qdisc is reset, likewise for sch_choke, from WANG
Cong.

5) Fix deadlock and races in slave_changelink/br_setport in bridging.
From Nikolay Aleksandrov.

6) mlx4 bug fixes (wrong index in port even propagation to VFs,
overzealous BUG_ON assertion, etc.) from Ido Shamay, Jack
Morgenstein, and Or Gerlitz.

7) Turn off klog message about SCTP userspace interface compat that
makes no sense at all, from Daniel Borkmann.

8) Fix unbounded restarts of inet frag eviction process, causing NMI
watchdog soft lockup messages, from Florian Westphal.

9) Suspend/resume fixes for r8152 from Hayes Wang.

10) Fix busy loop when MSG_WAITALL|MSG_PEEK is used in TCP recv, from
Sabrina Dubroca.

11) Fix performance regression when removing a lot of routes from the
ipv4 routing tables, from Alexander Duyck.

12) Fix device leak in AF_PACKET, from Lars Westerhoff.

13) AF_PACKET also has a header length comparison bug due to signedness,
from Alexander Drozdov.

14) Fix bug in EBPF tail call generation on x86, from Daniel Borkmann.

15) Memory leaks, TSO stats, watchdog timeout and other fixes to
thunderx driver from Sunil Goutham and Thanneeru Srinivasulu.

16) act_bpf can leak memory when replacing programs, from Daniel
Borkmann.

17) WOL packet fixes in gianfar driver, from Claudiu Manoil.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (79 commits)
stmmac: fix missing MODULE_LICENSE in stmmac_platform
gianfar: Enable device wakeup when appropriate
gianfar: Fix suspend/resume for wol magic packet
gianfar: Fix warning when CONFIG_PM off
act_pedit: check binding before calling tcf_hash_release()
net: sk_clone_lock() should only do get_net() if the parent is not a kernel socket
net: sched: fix refcount imbalance in actions
r8152: reset device when tx timeout
r8152: add pre_reset and post_reset
qlcnic: Fix corruption while copying
act_bpf: fix memory leaks when replacing bpf programs
net: thunderx: Fix for crash while BGX teardown
net: thunderx: Add PCI driver shutdown routine
net: thunderx: Fix crash when changing rss with mutliple traffic flows
net: thunderx: Set watchdog timeout value
net: thunderx: Wakeup TXQ only if CQE_TX are processed
net: thunderx: Suppress alloc_pages() failure warnings
net: thunderx: Fix TSO packet statistic
net: thunderx: Fix memory leak when changing queue count
net: thunderx: Fix RQ_DROP miscalculation
...

Linus Torvalds
2015-08-01 08:10:56 +0800
5175f7106 act_pedit: check binding before calling tcf_hash_release() ... Browse Code »

When we share an action within a filter, the bind refcnt
should increase, therefore we should not call tcf_hash_release().

Fixes: 1a29321ed045 ("net_sched: act: Dont increment refcnt on replace")
Cc: Jamal Hadi Salim
Cc: Daniel Borkmann
Signed-off-by: Cong Wang
Signed-off-by: Cong Wang
Acked-by: Daniel Borkmann
Signed-off-by: David S. Miller

WANG Cong
2015-08-01 06:22:34 +0800

31 Jul, 2015

2 commits

8a6817369 net: sk_clone_lock() should only do get_net() if the parent is not a kernel socket ... Browse Code »

The newsk returned by sk_clone_lock should hold a get_net()
reference if, and only if, the parent is not a kernel socket
(making this similar to sk_alloc()).

E.g,. for the SYN_RECV path, tcp_v4_syn_recv_sock->..inet_csk_clone_lock
sets up the syn_recv newsk from sk_clone_lock. When the parent (listen)
socket is a kernel socket (defined in sk_alloc() as having
sk_net_refcnt == 0), then the newsk should also have a 0 sk_net_refcnt
and should not hold a get_net() reference.

Fixes: 26abe14379f8 ("net: Modify sk_alloc to not reference count the
netns of kernel sockets.")
Acked-by: Eric Dumazet
Cc: Eric W. Biederman
Signed-off-by: Sowmini Varadhan
Signed-off-by: David S. Miller

Sowmini Varadhan
2015-07-31 06:59:12 +0800
28e6b67f0 net: sched: fix refcount imbalance in actions ... Browse Code »

Since commit 55334a5db5cd ("net_sched: act: refuse to remove bound action
outside"), we end up with a wrong reference count for a tc action.

Test case 1:

FOO="1,6 0 0 4294967295,"
BAR="1,6 0 0 4294967294,"
tc filter add dev foo parent 1: bpf bytecode "$FOO" flowid 1:1 \
action bpf bytecode "$FOO"
tc actions show action bpf
action order 0: bpf bytecode '1,6 0 0 4294967295' default-action pipe
index 1 ref 1 bind 1
tc actions replace action bpf bytecode "$BAR" index 1
tc actions show action bpf
action order 0: bpf bytecode '1,6 0 0 4294967294' default-action pipe
index 1 ref 2 bind 1
tc actions replace action bpf bytecode "$FOO" index 1
tc actions show action bpf
action order 0: bpf bytecode '1,6 0 0 4294967295' default-action pipe
index 1 ref 3 bind 1

Test case 2:

FOO="1,6 0 0 4294967295,"
tc filter add dev foo parent 1: bpf bytecode "$FOO" flowid 1:1 action ok
tc actions show action gact
action order 0: gact action pass
random type none pass val 0
index 1 ref 1 bind 1
tc actions add action drop index 1
RTNETLINK answers: File exists [...]
tc actions show action gact
action order 0: gact action pass
random type none pass val 0
index 1 ref 2 bind 1
tc actions add action drop index 1
RTNETLINK answers: File exists [...]
tc actions show action gact
action order 0: gact action pass
random type none pass val 0
index 1 ref 3 bind 1

What happens is that in tcf_hash_check(), we check tcf_common for a given
index and increase tcfc_refcnt and conditionally tcfc_bindcnt when we've
found an existing action. Now there are the following cases:

1) We do a late binding of an action. In that case, we leave the
tcfc_refcnt/tcfc_bindcnt increased and are done with the ->init()
handler. This is correctly handeled.

2) We replace the given action, or we try to add one without replacing
and find out that the action at a specific index already exists
(thus, we go out with error in that case).

In case of 2), we have to undo the reference count increase from
tcf_hash_check() in the tcf_hash_check() function. Currently, we fail to
do so because of the 'tcfc_bindcnt > 0' check which bails out early with
an -EPERM error.

Now, while commit 55334a5db5cd prevents 'tc actions del action ...' on an
already classifier-bound action to drop the reference count (which could
then become negative, wrap around etc), this restriction only accounts for
invocations outside a specific action's ->init() handler.

One possible solution would be to add a flag thus we possibly trigger
the -EPERM ony in situations where it is indeed relevant.

After the patch, above test cases have correct reference count again.

Fixes: 55334a5db5cd ("net_sched: act: refuse to remove bound action outside")
Signed-off-by: Daniel Borkmann
Reviewed-by: Cong Wang
Signed-off-by: David S. Miller

Daniel Borkmann
2015-07-31 05:20:39 +0800

30 Jul, 2015

5 commits

f4eaed28c act_bpf: fix memory leaks when replacing bpf programs ... Browse Code »

We currently trigger multiple memory leaks when replacing bpf
actions, besides others:

comm "tc", pid 1909, jiffies 4294851310 (age 1602.796s)
hex dump (first 32 bytes):
01 00 00 00 03 00 00 00 00 00 00 00 00 00 00 00 ................
18 b0 98 6d 00 88 ff ff 00 00 00 00 00 00 00 00 ...m............
backtrace:
[] kmemleak_alloc+0x4e/0xb0
[] __vmalloc_node_range+0x1bd/0x2c0
[] __vmalloc+0x4a/0x50
[] bpf_prog_alloc+0x3a/0xa0
[] bpf_prog_create+0x44/0xa0
[] tcf_bpf_init+0x28b/0x3c0 [act_bpf]
[] tcf_action_init_1+0x191/0x1b0
[] tcf_action_init+0x82/0xf0
[] tcf_exts_validate+0xb2/0xc0
[] cls_bpf_modify_existing+0x98/0x340 [cls_bpf]
[] cls_bpf_change+0x1a6/0x274 [cls_bpf]
[] tc_ctl_tfilter+0x335/0x910
[] rtnetlink_rcv_msg+0x95/0x240
[] netlink_rcv_skb+0xaf/0xc0
[] rtnetlink_rcv+0x2e/0x40
[] netlink_unicast+0xef/0x1b0

Issue is that the old content from tcf_bpf is allocated and needs
to be released when we replace it. We seem to do that since the
beginning of act_bpf on the filter and insns, later on the name as
well.

Example test case, after patch:

# FOO="1,6 0 0 4294967295,"
# BAR="1,6 0 0 4294967294,"
# tc actions add action bpf bytecode "$FOO" index 2
# tc actions show action bpf
action order 0: bpf bytecode '1,6 0 0 4294967295' default-action pipe
index 2 ref 1 bind 0
# tc actions replace action bpf bytecode "$BAR" index 2
# tc actions show action bpf
action order 0: bpf bytecode '1,6 0 0 4294967294' default-action pipe
index 2 ref 1 bind 0
# tc actions replace action bpf bytecode "$FOO" index 2
# tc actions show action bpf
action order 0: bpf bytecode '1,6 0 0 4294967295' default-action pipe
index 2 ref 1 bind 0
# tc actions del action bpf index 2
[...]
# echo "scan" > /sys/kernel/debug/kmemleak
# cat /sys/kernel/debug/kmemleak | grep "comm \"tc\"" | wc -l
0

Fixes: d23b8ad8ab23 ("tc: add BPF based action")
Signed-off-by: Daniel Borkmann
Signed-off-by: David S. Miller

Daniel Borkmann
2015-07-30 14:56:22 +0800
c8507fb23 ipv6: flush nd cache on IFF_NOARP change ... Browse Code »

This patch is the IPv6 equivalent of commit
6c8b4e3ff81b ("arp: flush arp cache on IFF_NOARP change")

Without it, we keep buggy neighbours in the cache, with destination
MAC address equal to our own MAC address.

Tested:
tcpdump -i eth0 -s 0 ip6 -n -e &
ip link set dev eth0 arp off
ping6 remote // sends buggy frames
ip link set dev eth0 arp on
ping6 remote // should work once kernel is patched

Signed-off-by: Eric Dumazet
Reported-by: Mario Fanelli
Signed-off-by: David S. Miller

Eric Dumazet
2015-07-30 14:01:39 +0800
7ae90a4f9 bridge: mdb: fix delmdb state in the notification ... Browse Code »

Since mdb states were introduced when deleting an entry the state was
left as it was set in the delete request from the user which leads to
the following output when doing a monitor (for example):
$ bridge mdb add dev br0 port eth3 grp 239.0.0.1 permanent
(monitor) dev br0 port eth3 grp 239.0.0.1 permanent
$ bridge mdb del dev br0 port eth3 grp 239.0.0.1 permanent
(monitor) dev br0 port eth3 grp 239.0.0.1 temp
^^^
Note the "temp" state in the delete notification which is wrong since
the entry was permanent, the state in a delete is always reported as
"temp" regardless of the real state of the entry.

After this patch:
$ bridge mdb add dev br0 port eth3 grp 239.0.0.1 permanent
(monitor) dev br0 port eth3 grp 239.0.0.1 permanent
$ bridge mdb del dev br0 port eth3 grp 239.0.0.1 permanent
(monitor) dev br0 port eth3 grp 239.0.0.1 permanent

There's one important note to make here that the state is actually not
matched when doing a delete, so one can delete a permanent entry by
stating "temp" in the end of the command, I've chosen this fix in order
not to break user-space tools which rely on this (incorrect) behaviour.

So to give an example after this patch and using the wrong state:
$ bridge mdb add dev br0 port eth3 grp 239.0.0.1 permanent
(monitor) dev br0 port eth3 grp 239.0.0.1 permanent
$ bridge mdb del dev br0 port eth3 grp 239.0.0.1 temp
(monitor) dev br0 port eth3 grp 239.0.0.1 permanent

Note the state of the entry that got deleted is correct in the
notification.

Signed-off-by: Nikolay Aleksandrov
Fixes: ccb1c31a7a87 ("bridge: add flags to distinguish permanent mdb entires")
Signed-off-by: David S. Miller

Nikolay Aleksandrov
2015-07-30 06:02:30 +0800
544586f74 bridge: mcast: give fast leave precedence over multicast router and querier ... Browse Code »

When fast leave is configured on a bridge port and an IGMP leave is
received for a group, the group is not deleted immediately if there is
a router detected or if multicast querier is configured.
Ideally the group should be deleted immediately when fast leave is
configured.

Signed-off-by: Satish Ashok
Signed-off-by: David S. Miller

Satish Ashok
2015-07-30 05:57:05 +0800
df356d5e8 bridge: Fix network header pointer for vlan tagged packets ... Browse Code »

There are several devices that can receive vlan tagged packets with
CHECKSUM_PARTIAL like tap, possibly veth and xennet.
When (multiple) vlan tagged packets with CHECKSUM_PARTIAL are forwarded
by bridge to a device with the IP_CSUM feature, they end up with checksum
error because before entering bridge, the network header is set to
ETH_HLEN (not including vlan header length) in __netif_receive_skb_core(),
get_rps_cpu(), or drivers' rx functions, and nobody fixes the pointer later.

Since the network header is exepected to be ETH_HLEN in flow-dissection
and hash-calculation in RPS in rx path, and since the header pointer fix
is needed only in tx path, set the appropriate network header on forwarding
packets.

Signed-off-by: Toshiaki Makita
Signed-off-by: David S. Miller

Toshiaki Makita
2015-07-30 03:20:16 +0800

29 Jul, 2015

5 commits

dbd46ab41 packet: tpacket_snd(): fix signed/unsigned comparison ... Browse Code »

tpacket_fill_skb() can return a negative value (-errno) which
is stored in tp_len variable. In that case the following
condition will be (but shouldn't be) true:

tp_len > dev->mtu + dev->hard_header_len

as dev->mtu and dev->hard_header_len are both unsigned.

That may lead to just returning an incorrect EMSGSIZE errno
to the user.

Fixes: 52f1454f629fa ("packet: allow to transmit +4 byte in TX_RING slot for VLAN case")
Signed-off-by: Alexander Drozdov
Acked-by: Daniel Borkmann
Signed-off-by: David S. Miller

Alexander Drozdov
2015-07-29 15:09:58 +0800
11c91ef98 arp: filter NOARP neighbours for SIOCGARP ... Browse Code »

When arp is off on a device, and ioctl(SIOCGARP) is queried,
a buggy answer is given with MAC address of the device, instead
of the mac address of the destination/gateway.

We filter out NUD_NOARP neighbours for /proc/net/arp,
we must do the same for SIOCGARP ioctl.

Tested:

lpaa23:~# ./arp 10.246.7.190
MAC=00:01:e8:22:cb:1d // correct answer

lpaa23:~# ip link set dev eth0 arp off
lpaa23:~# cat /proc/net/arp # check arp table is now 'empty'
IP address HW type Flags HW address Mask Device
lpaa23:~# ./arp 10.246.7.190
MAC=00:1a:11:c3:0d:7f // buggy answer before patch (this is eth0 mac)

After patch :

lpaa23:~# ip link set dev eth0 arp off
lpaa23:~# ./arp 10.246.7.190
ioctl(SIOCGARP) failed: No such device or address

Signed-off-by: Eric Dumazet
Reported-by: Vytautas Valancius
Cc: Willem de Bruijn
Signed-off-by: David S. Miller

Eric Dumazet
2015-07-29 14:41:24 +0800
865b80424 net/ipv4: suppress NETDEV_UP notification on address lifetime update ... Browse Code »

This notification causes the FIB to be updated, which is not needed
because the address already exists, and more importantly it may undo
intentional changes that were made to the FIB after the address was
originally added. (As a point of comparison, when an address becomes
deprecated because its preferred lifetime expired, a notification on
this chain is not generated.)

The motivation for this commit is fixing an incompatibility between
DHCP clients which set and update the address lifetime according to
the lease, and a commercial VPN client which replaces kernel routes
in a way that outbound traffic is sent only through the tunnel (and
disconnects if any further route changes are detected via netlink).

Signed-off-by: David Ward
Signed-off-by: David S. Miller

David Ward
2015-07-29 14:38:13 +0800
76b91c32d bridge: stp: when using userspace stp stop kernel hello and hold timers ... Browse Code »

These should be handled only by the respective STP which is in control.
They become problematic for devices with limited resources with many
ports because the hold_timer is per port and fires each second and the
hello timer fires each 2 seconds even though it's global. While in
user-space STP mode these timers are completely unnecessary so it's better
to keep them off.
Also ensure that when the bridge is up these timers are started only when
running with kernel STP.

Signed-off-by: Satish Ashok
Signed-off-by: Nikolay Aleksandrov
Signed-off-by: David S. Miller

Nikolay Aleksandrov
2015-07-29 14:33:20 +0800
d8132e08d Merge tag 'nfs-for-4.2-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs ... Browse Code »

Pull NFS client bugfixes from Trond Myklebust:
"Highlights include:

Stable patches:
- Fix a situation where the client uses the wrong (zero) stateid.
- Fix a memory leak in nfs_do_recoalesce

Bugfixes:
- Plug a memory leak when ->prepare_layoutcommit fails
- Fix an Oops in the NFSv4 open code
- Fix a backchannel deadlock
- Fix a livelock in sunrpc when sendmsg fails due to low memory
availability
- Don't revalidate the mapping if both size and change attr are up to
date
- Ensure we don't miss a file extension when doing pNFS
- Several fixes to handle NFSv4.1 sequence operation status bits
correctly
- Several pNFS layout return bugfixes"

* tag 'nfs-for-4.2-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (28 commits)
nfs: Fix an oops caused by using other thread's stack space in ASYNC mode
nfs: plug memory leak when ->prepare_layoutcommit fails
SUNRPC: Report TCP errors to the caller
sunrpc: translate -EAGAIN to -ENOBUFS when socket is writable.
NFSv4.2: handle NFS-specific llseek errors
NFS: Don't clear desc->pg_moreio in nfs_do_recoalesce()
NFS: Fix a memory leak in nfs_do_recoalesce
NFS: nfs_mark_for_revalidate should always set NFS_INO_REVAL_PAGECACHE
NFS: Remove the "NFS_CAP_CHANGE_ATTR" capability
NFS: Set NFS_INO_REVAL_PAGECACHE if the change attribute is uninitialised
NFS: Don't revalidate the mapping if both size and change attr are up to date
NFSv4/pnfs: Ensure we don't miss a file extension
NFSv4: We must set NFS_OPEN_STATE flag in nfs_resync_open_stateid_locked
SUNRPC: xprt_complete_bc_request must also decrement the free slot count
SUNRPC: Fix a backchannel deadlock
pNFS: Don't throw out valid layout segments
pNFS: pnfs_roc_drain() fix a race with open
pNFS: Fix races between return-on-close and layoutreturn.
pNFS: pnfs_roc_drain should return 'true' when sleeping
pNFS: Layoutreturn must invalidate all existing layout segments.
...

Linus Torvalds
2015-07-29 00:37:44 +0800

28 Jul, 2015

3 commits

158cd4af8 packet: missing dev_put() in packet_do_bind() ... Browse Code »

When binding a PF_PACKET socket, the use count of the bound interface is
always increased with dev_hold in dev_get_by_{index,name}. However,
when rebound with the same protocol and device as in the previous bind
the use count of the interface was not decreased. Ultimately, this
caused the deletion of the interface to fail with the following message:

unregister_netdevice: waiting for dummy0 to become free. Usage count = 1

This patch moves the dev_put out of the conditional part that was only
executed when either the protocol or device changed on a bind.

Fixes: 902fefb82ef7 ('packet: improve socket create/bind latency in some cases')
Signed-off-by: Lars Westerhoff
Signed-off-by: Dan Carpenter
Reviewed-by: Daniel Borkmann
Signed-off-by: David S. Miller

Lars Westerhoff
2015-07-28 06:38:58 +0800
f580dd042 SUNRPC: Report TCP errors to the caller ... Browse Code »

Signed-off-by: Trond Myklebust

Trond Myklebust
2015-07-28 05:56:57 +0800
1513069ed fib_trie: Drop unnecessary calls to leaf_pull_suffix ... Browse Code »

It was reported that update_suffix was taking a long time on systems where
a large number of leaves were attached to a single node. As it turns out
fib_table_flush was calling update_suffix for each leaf that didn't have all
of the aliases stripped from it. As a result, on this large node removing
one leaf would result in us calling update_suffix for every other leaf on
the node.

The fix is to just remove the calls to leaf_pull_suffix since they are
redundant as we already have a call in resize that will go through and
update the suffix length for the node before we exit out of
fib_table_flush or fib_table_flush_external.

Reported-by: David Ahern
Signed-off-by: Alexander Duyck
Tested-by: David Ahern
Signed-off-by: David S. Miller

Alexander Duyck
2015-07-28 05:29:11 +0800

27 Jul, 2015

9 commits

743c69e7c sunrpc: translate -EAGAIN to -ENOBUFS when socket is writable. ... Browse Code »

The networking layer does not reliably report the distinction between
a non-block write failing because:
1/ the queue is too full already and
2/ a memory allocation attempt failed.

The distinction is important because in the first case it is
appropriate to retry as soon as the socket reports that it is
writable, and in the second case a small delay is required as the
socket will most likely report as writable but kmalloc could still
fail.

sk_stream_wait_memory() exhibits this distinction nicely, setting
'vm_wait' if a small wait is needed. However in the non-blocking case
it always returns -EAGAIN no matter the cause of the failure. This
-EAGAIN call get all the way to sunrpc.

The sunrpc layer expects EAGAIN to indicate the first cause, and
ENOBUFS to indicate the second. Various documentation suggests that
this is not unreasonable, but does not guarantee the desired error
codes.

The result of getting -EAGAIN when -ENOBUFS is expected is that the
send is tried again in a tight loop and soft lockups are reported.

so: add tests after calls to xs_sendpages() to translate -EAGAIN into
-ENOBUFS if the socket is writable. This cannot happen inside
xs_sendpages() as the test for "is socket writable" is different
between TCP and UDP.

With this change, the tight loop retrying xs_sendpages() becomes a
loop which only retries every 250ms, and so will not trigger a
soft-lockup warning.

It is possible that the write did fail because the queue was too full
and by the time xs_sendpages() completed, the queue was writable
again. In this case an extra 250ms delay is inserted that isn't
really needed. This circumstance suggests a degree of congestion so a
delay is not necessarily a bad thing, and it can only cause a single
250ms delay, not a series of them.

Signed-off-by: NeilBrown
Signed-off-by: Trond Myklebust

NeilBrown
2015-07-27 23:16:56 +0800
dfbafc995 tcp: fix recv with flags MSG_WAITALL | MSG_PEEK ... Browse Code »

Currently, tcp_recvmsg enters a busy loop in sk_wait_data if called
with flags = MSG_WAITALL | MSG_PEEK.

sk_wait_data waits for sk_receive_queue not empty, but in this case,
the receive queue is not empty, but does not contain any skb that we
can use.

Add a "last skb seen on receive queue" argument to sk_wait_data, so
that it sleeps until the receive queue has new skbs.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=99461
Link: https://sourceware.org/bugzilla/show_bug.cgi?id=18493
Link: https://bugzilla.redhat.com/show_bug.cgi?id=1205258
Reported-by: Enrico Scholz
Reported-by: Dan Searle
Signed-off-by: Sabrina Dubroca
Acked-by: Eric Dumazet
Signed-off-by: David S. Miller

Sabrina Dubroca
2015-07-27 16:06:53 +0800
03de104f7 Merge branch 'for-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth ... Browse Code »

Johan Hedberg says:

====================
pull request: bluetooth 2015-07-23

Here's another one-patch pull request for 4.2 which targets a potential
NULL pointer dereference in the LE Security Manager code that can be
triggered by using older user space tools. The issue has been there
since 4.0 so there's the appropriate "Cc: stable" in place.

Let me know if there are any issues pulling. Thanks.
====================

Signed-off-by: David S. Miller

David S. Miller
2015-07-27 12:53:08 +0800
caaecdd3d inet: frags: remove INET_FRAG_EVICTED and use list_evictor for the test ... Browse Code »

We can simply remove the INET_FRAG_EVICTED flag to avoid all the flags
race conditions with the evictor and use a participation test for the
evictor list, when we're at that point (after inet_frag_kill) in the
timer there're 2 possible cases:

1. The evictor added the entry to its evictor list while the timer was
waiting for the chainlock
or
2. The timer unchained the entry and the evictor won't see it

In both cases we should be able to see list_evictor correctly due
to the sync on the chainlock.

Joint work with Florian Westphal.

Tested-by: Frank Schreuder
Signed-off-by: Nikolay Aleksandrov
Signed-off-by: Florian Westphal
Signed-off-by: David S. Miller

Nikolay Aleksandrov
2015-07-27 12:00:15 +0800
5719b296f inet: frag: don't wait for timer deletion when evicting ... Browse Code »

Frank reports 'NMI watchdog: BUG: soft lockup' errors when
load is high. Instead of (potentially) unbounded restarts of the
eviction process, just skip to the next entry.

One caveat is that, when a netns is exiting, a timer may still be running
by the time inet_evict_bucket returns.

We use the frag memory accounting to wait for outstanding timers,
so that when we free the percpu counter we can be sure no running
timer will trip over it.

Reported-and-tested-by: Frank Schreuder
Signed-off-by: Florian Westphal
Signed-off-by: David S. Miller

Florian Westphal
2015-07-27 12:00:14 +0800
0e60d245a inet: frag: change *_frag_mem_limit functions to take netns_frags as argument ... Browse Code »

Followup patch will call it after inet_frag_queue was freed, so q->net
doesn't work anymore (but netf = q->net; free(q); mem_limit(netf) would).

Tested-by: Frank Schreuder
Signed-off-by: Florian Westphal
Signed-off-by: David S. Miller

Florian Westphal
2015-07-27 12:00:14 +0800
d1fe19444 inet: frag: don't re-use chainlist for evictor ... Browse Code »

commit 65ba1f1ec0eff ("inet: frags: fix a race between inet_evict_bucket
and inet_frag_kill") describes the bug, but the fix doesn't work reliably.

Problem is that ->flags member can be set on other cpu without chainlock
being held by that task, i.e. the RMW-Cycle can clear INET_FRAG_EVICTED
bit after we put the element on the evictor private list.

We can crash when walking the 'private' evictor list since an element can
be deleted from list underneath the evictor.

Join work with Nikolay Alexandrov.

Fixes: b13d3cbfb8e8 ("inet: frag: move eviction of queues to work queue")
Reported-by: Johan Schuijt
Tested-by: Frank Schreuder
Signed-off-by: Nikolay Alexandrov
Signed-off-by: Florian Westphal
Signed-off-by: David S. Miller

Florian Westphal
2015-07-27 12:00:14 +0800
81296fc67 net: sctp: stop spamming klog with rfc6458, 5.3.2. deprecation warnings ... Browse Code »

Back then when we added support for SCTP_SNDINFO/SCTP_RCVINFO from
RFC6458 5.3.4/5.3.5, we decided to add a deprecation warning for the
(as per RFC deprecated) SCTP_SNDRCV via commit bbbea41d5e53 ("net:
sctp: deprecate rfc6458, 5.3.2. SCTP_SNDRCV support"), see [1].

Imho, it was not a good idea, and we should just revert that message
for a couple of reasons:

1) It's uapi and therefore set in stone forever.

2) To be able to run on older and newer kernels, an SCTP application
would need to probe for both, SCTP_SNDRCV, but also SCTP_SNDINFO/
SCTP_RCVINFO support, so that on older kernels, it can make use
of SCTP_SNDRCV, and on newer kernels SCTP_SNDINFO/SCTP_RCVINFO.
In my (limited) experience, a lot of SCTP appliances are migrating
to newer kernels only ve(ee)ry slowly.

3) Some people don't have the chance to change their applications,
f.e. due to proprietary legacy stuff. So, they'll hit this warning
in fast path and are stuck with older kernels.

But i.e. due to point 1) I really fail to see the benefit of a warning.
So just revert that for now, the issue was reported up Jamal.

[1] http://thread.gmane.org/gmane.linux.network/321960/

Reported-by: Jamal Hadi Salim
Signed-off-by: Daniel Borkmann
Cc: Michael Tuexen
Acked-by: Jamal Hadi Salim
Signed-off-by: David S. Miller

Daniel Borkmann
2015-07-27 07:32:41 +0800
963ad9485 bridge: netlink: fix slave_changelink/br_setport race conditions ... Browse Code »

Since slave_changelink support was added there have been a few race
conditions when using br_setport() since some of the port functions it
uses require the bridge lock. It is very easy to trigger a lockup due to
some internal spin_lock() usage without bh disabled, also it's possible to
get the bridge into an inconsistent state.

Signed-off-by: Nikolay Aleksandrov
Fixes: 3ac636b8591c ("bridge: implement rtnl_link_ops->slave_changelink")
Reviewed-by: Jiri Pirko
Signed-off-by: David S. Miller

Nikolay Aleksandrov
2015-07-27 07:27:22 +0800

25 Jul, 2015

6 commits

485164381 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf ... Browse Code »

Pablo Neira Ayuso says:

====================
Netfilter/IPVS fixes for net

The following patchset contains ten Netfilter/IPVS fixes, they are:

1) Address refcount leak when creating an expectation from the ctnetlink
interface.

2) Fix bug splat in the IDLETIMER target related to sysfs, from Dmitry
Torokhov.

3) Resolve panic for unreachable route in IPVS with locally generated
traffic in the output path, from Alex Gartrell.

4) Fix wrong source address in rare cases for tunneled traffic in IPVS,
from Julian Anastasov.

5) Fix crash if scheduler is changed via ipvsadm -E, again from Julian.

6) Make sure skb->sk is unset for forwarded traffic through IPVS, again from
Alex Gartrell.

7) Fix crash with IPVS sync protocol v0 and FTP, from Julian.

8) Reset sender cpu for forwarded traffic in IPVS, also from Julian.

9) Allocate template conntracks through kmalloc() to resolve netns dependency
problems with the conntrack kmem_cache.

10) Fix zones with expectations that clash using the same tuple, from Joe
Stringer.
====================

Signed-off-by: David S. Miller

David S. Miller
2015-07-25 15:18:10 +0800
cc9f4daa6 cgroup: net_cls: fix false-positive "suspicious RCU usage" ... Browse Code »

In dev_queue_xmit() net_cls protected with rcu-bh.

[ 270.730026] ===============================
[ 270.730029] [ INFO: suspicious RCU usage. ]
[ 270.730033] 4.2.0-rc3+ #2 Not tainted
[ 270.730036] -------------------------------
[ 270.730040] include/linux/cgroup.h:353 suspicious rcu_dereference_check() usage!
[ 270.730041] other info that might help us debug this:
[ 270.730043] rcu_scheduler_active = 1, debug_locks = 1
[ 270.730045] 2 locks held by dhclient/748:
[ 270.730046] #0: (rcu_read_lock_bh){......}, at: [] __dev_queue_xmit+0x50/0x960
[ 270.730085] #1: (&qdisc_tx_lock){+.....}, at: [] __dev_queue_xmit+0x240/0x960
[ 270.730090] stack backtrace:
[ 270.730096] CPU: 0 PID: 748 Comm: dhclient Not tainted 4.2.0-rc3+ #2
[ 270.730098] Hardware name: OpenStack Foundation OpenStack Nova, BIOS Bochs 01/01/2011
[ 270.730100] 0000000000000001 ffff8800bafeba58 ffffffff817ad487 0000000000000007
[ 270.730103] ffff880232a0a780 ffff8800bafeba88 ffffffff810ca4f2 ffff88022fb23e00
[ 270.730105] ffff880232a0a780 ffff8800bafebb68 ffff8800bafebb68 ffff8800bafebaa8
[ 270.730108] Call Trace:
[ 270.730121] [] dump_stack+0x4c/0x65
[ 270.730148] [] lockdep_rcu_suspicious+0xe2/0x120
[ 270.730153] [] task_cls_state+0x92/0xa0
[ 270.730158] [] cls_cgroup_classify+0x4f/0x120 [cls_cgroup]
[ 270.730164] [] tc_classify_compat+0x74/0xc0
[ 270.730166] [] tc_classify+0x33/0x90
[ 270.730170] [] htb_enqueue+0xaa/0x4a0 [sch_htb]
[ 270.730172] [] __dev_queue_xmit+0x306/0x960
[ 270.730174] [] ? __dev_queue_xmit+0x50/0x960
[ 270.730176] [] dev_queue_xmit_sk+0x13/0x20
[ 270.730185] [] dev_queue_xmit+0x10/0x20
[ 270.730187] [] packet_snd.isra.62+0x54c/0x760
[ 270.730190] [] packet_sendmsg+0x2f5/0x3f0
[ 270.730203] [] ? sock_def_readable+0x5/0x190
[ 270.730210] [] ? _raw_spin_unlock+0x2b/0x40
[ 270.730216] [] ? unix_dgram_sendmsg+0x5cc/0x640
[ 270.730219] [] sock_sendmsg+0x47/0x50
[ 270.730221] [] sock_write_iter+0x7f/0xd0
[ 270.730232] [] __vfs_write+0xa7/0xf0
[ 270.730234] [] vfs_write+0xb8/0x190
[ 270.730236] [] SyS_write+0x52/0xb0
[ 270.730239] [] entry_SYSCALL_64_fastpath+0x12/0x76

Signed-off-by: Konstantin Khlebnikov
Signed-off-by: David S. Miller

Konstantin Khlebnikov
2015-07-25 15:13:18 +0800
77e62da6e sch_choke: drop all packets in queue during reset ... Browse Code »

Signed-off-by: Cong Wang
Signed-off-by: David S. Miller

WANG Cong
2015-07-25 13:57:15 +0800
fe6bea7f1 sch_plug: purge buffered packets during reset ... Browse Code »

Otherwise the skbuff related structures are not correctly
refcount'ed.

Cc: Jamal Hadi Salim
Signed-off-by: Cong Wang
Signed-off-by: David S. Miller

WANG Cong
2015-07-25 13:57:14 +0800
2392debc2 ipv4: consider TOS in fib_select_default ... Browse Code »

fib_select_default considers alternative routes only when
res->fi is for the first alias in res->fa_head. In the
common case this can happen only when the initial lookup
matches the first alias with highest TOS value. This
prevents the alternative routes to require specific TOS.

This patch solves the problem as follows:

- routes that require specific TOS should be returned by
fib_select_default only when TOS matches, as already done
in fib_table_lookup. This rule implies that depending on the
TOS we can have many different lists of alternative gateways
and we have to keep the last used gateway (fa_default) in first
alias for the TOS instead of using single tb_default value.

- as the aliases are ordered by many keys (TOS desc,
fib_priority asc), we restrict the possible results to
routes with matching TOS and lowest metric (fib_priority)
and routes that match any TOS, again with lowest metric.

For example, packet with TOS 8 can not use gw3 (not lowest
metric), gw4 (different TOS) and gw6 (not lowest metric),
all other gateways can be used:

tos 8 via gw1 metric 2 fa_head and res->fi
tos 8 via gw2 metric 2
tos 8 via gw3 metric 3
tos 4 via gw4
tos 0 via gw5
tos 0 via gw6 metric 1

Reported-by: Hagen Paul Pfeifer
Signed-off-by: Julian Anastasov
Signed-off-by: David S. Miller

Julian Anastasov
2015-07-25 13:46:11 +0800
18a912e9a ipv4: fib_select_default should match the prefix ... Browse Code »

fib_trie starting from 4.1 can link fib aliases from
different prefixes in same list. Make sure the alternative
gateways are in same table and for same prefix (0) by
checking tb_id and fa_slen.

Fixes: 79e5ad2ceb00 ("fib_trie: Remove leaf_info")
Signed-off-by: Julian Anastasov
Signed-off-by: David S. Miller

Julian Anastasov
2015-07-25 13:46:09 +0800

24 Jul, 2015

1 commit

d1a343a02 Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost ... Browse Code »

Pull virtio/vhost fixes from Michael Tsirkin:
"Bugfixes and documentation fixes.

Igor's patch that allows users to tweak memory table size is
borderline, but it does fix known crashes, so I merged it"

* tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost:
vhost: add max_mem_regions module parameter
vhost: extend memory regions allocation to vmalloc
9p/trans_virtio: reset virtio device on remove
virtio/s390: rename drivers/s390/kvm -> drivers/s390/virtio
MAINTAINERS: separate section for s390 virtio drivers
virtio: define virtio_pci_cfg_cap in header.
virtio: Fix typecast of pointer in vring_init()
virtio scsi: fix unused variable warning
vhost: use binary search instead of linear in find_region()
virtio_net: document VIRTIO_NET_CTRL_GUEST_OFFLOADS

Linus Torvalds
2015-07-24 04:07:04 +0800

23 Jul, 2015

4 commits

25ba26539 Bluetooth: Fix NULL pointer dereference in smp_conn_security ... Browse Code »

The l2cap_conn->smp pointer may be NULL for various valid reasons where SMP has
failed to initialize properly. One such scenario is when crypto support is
missing, another when the adapter has been powered on through a legacy method.
The smp_conn_security() function should have the appropriate check for this
situation to avoid NULL pointer dereferences.

Signed-off-by: Johan Hedberg
Signed-off-by: Marcel Holtmann
Cc: stable@vger.kernel.org # 4.0+

Johan Hedberg
2015-07-23 22:41:24 +0800
c5dfd654d Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net ... Browse Code »

Pull networking fixes from David Miller:

1) Don't use shared bluetooth antenna in iwlwifi driver for management
frames, from Emmanuel Grumbach.

2) Fix device ID check in ath9k driver, from Felix Fietkau.

3) Off by one in xen-netback BUG checks, from Dan Carpenter.

4) Fix IFLA_VF_PORT netlink attribute validation, from Daniel Borkmann.

5) Fix races in setting peeked bit flag in SKBs during datagram
receive. If it's shared we have to clone it otherwise the value can
easily be corrupted. Fix from Herbert Xu.

6) Revert fec clock handling change, causes regressions. From Fabio
Estevam.

7) Fix use after free in fq_codel and sfq packet schedulers, from WANG
Cong.

8) ipvlan bug fixes (memory leaks, missing rcu_dereference_bh, etc.)
from WANG Cong and Konstantin Khlebnikov.

9) Memory leak in act_bpf packet action, from Alexei Starovoitov.

10) ARM bpf JIT bug fixes from Nicolas Schichan.

11) Fix backwards compat of ANY_LAYOUT in virtio_net driver, from
Michael S Tsirkin.

12) Destruction of bond with different ARP header types not handled
correctly, fix from Nikolay Aleksandrov.

13) Revert GRO receive support in ipv6 SIT tunnel driver, causes
regressions because the GRO packets created cannot be processed
properly on the GSO side if we forward the frame. From Herbert Xu.

14) TCCR update race and other fixes to ravb driver from Sergei
Shtylyov.

15) Fix SKB leaks in caif_queue_rcv_skb(), from Eric Dumazet.

16) Fix panics on packet scheduler filter replace, from Daniel Borkmann.

17) Make sure AF_PACKET sees properly IP headers in defragmented frames
(via PACKET_FANOUT_FLAG_DEFRAG option), from Edward Hyunkoo Jee.

18) AF_NETLINK cannot hold mutex in RCU callback, fix from Florian
Westphal.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (84 commits)
ravb: fix ring memory allocation
net: phy: dp83867: Fix warning check for setting the internal delay
openvswitch: allocate nr_node_ids flow_stats instead of num_possible_nodes
netlink: don't hold mutex in rcu callback when releasing mmapd ring
ARM: net: fix vlan access instructions in ARM JIT.
ARM: net: handle negative offsets in BPF JIT.
ARM: net: fix condition for load_order > 0 when translating load instructions.
tcp: suppress a division by zero warning
drivers: net: cpsw: remove tx event processing in rx napi poll
inet: frags: fix defragmented packet's IP header for af_packet
net: mvneta: fix refilling for Rx DMA buffers
stmmac: fix setting of driver data in stmmac_dvr_probe
sched: cls_flow: fix panic on filter replace
sched: cls_flower: fix panic on filter replace
sched: cls_bpf: fix panic on filter replace
net/mdio: fix mdio_bus_match for c45 PHY
net: ratelimit warnings about dst entry refcount underflow or overflow
caif: fix leaks and race in caif_queue_rcv_skb()
qmi_wwan: add the second QMI/network interface for Sierra Wireless MC7305/MC7355
ravb: fix race updating TCCR
...

Linus Torvalds
2015-07-23 05:45:25 +0800
1980bd4d8 SUNRPC: xprt_complete_bc_request must also decrement the free slot count ... Browse Code »

Calling xprt_complete_bc_request() effectively causes the slot to be allocated,
so it needs to decrement the backchannel free slot count as well.

Fixes: 0d2a970d0ae5 ("SUNRPC: Fix a backchannel race")
Signed-off-by: Trond Myklebust

Trond Myklebust
2015-07-23 05:10:50 +0800
68514471c SUNRPC: Fix a backchannel deadlock ... Browse Code »

xprt_alloc_bc_request() cannot call xprt_free_bc_request() without
deadlocking, since it already holds the xprt->bc_pa_lock.

Reported-by: Chuck Lever
Fixes: 0d2a970d0ae55 ("SUNRPC: Fix a backchannel race")
Signed-off-by: Trond Myklebust

Trond Myklebust
2015-07-23 04:33:59 +0800

22 Jul, 2015

3 commits

4b31814d2 netfilter: nf_conntrack: Support expectations in different zones ... Browse Code »

When zones were originally introduced, the expectation functions were
all extended to perform lookup using the zone. However, insertion was
not modified to check the zone. This means that two expectations which
are intended to apply for different connections that have the same tuple
but exist in different zones cannot both be tracked.

Fixes: 5d0aa2ccd4 (netfilter: nf_conntrack: add support for "conntrack zones")
Signed-off-by: Joe Stringer
Signed-off-by: Pablo Neira Ayuso

Joe Stringer
2015-07-22 23:00:47 +0800
bac541e46 openvswitch: allocate nr_node_ids flow_stats instead of num_possible_nodes ... Browse Code »

Some architectures like POWER can have a NUMA node_possible_map that
contains sparse entries. This causes memory corruption with openvswitch
since it allocates flow_cache with a multiple of num_possible_nodes() and
assumes the node variable returned by for_each_node will index into
flow->stats[node].

Use nr_node_ids to allocate a maximal sparse array instead of
num_possible_nodes().

The crash was noticed after 3af229f2 was applied as it changed the
node_possible_map to match node_online_map on boot.
Fixes: 3af229f2071f5b5cb31664be6109561fbe19c861

Signed-off-by: Chris J Arges
Acked-by: Pravin B Shelar
Acked-by: Nishanth Aravamudan
Signed-off-by: David S. Miller

Chris J Arges
2015-07-22 13:26:03 +0800
0470eb99b netlink: don't hold mutex in rcu callback when releasing mmapd ring ... Browse Code »

Kirill A. Shutemov says:

This simple test-case trigers few locking asserts in kernel:

int main(int argc, char **argv)
{
unsigned int block_size = 16 * 4096;
struct nl_mmap_req req = {
.nm_block_size = block_size,
.nm_block_nr = 64,
.nm_frame_size = 16384,
.nm_frame_nr = 64 * block_size / 16384,
};
unsigned int ring_size;
int fd;

fd = socket(AF_NETLINK, SOCK_RAW, NETLINK_GENERIC);
if (setsockopt(fd, SOL_NETLINK, NETLINK_RX_RING, &req, sizeof(req)) < 0)
exit(1);
if (setsockopt(fd, SOL_NETLINK, NETLINK_TX_RING, &req, sizeof(req)) < 0)
exit(1);

ring_size = req.nm_block_nr * req.nm_block_size;
mmap(NULL, 2 * ring_size, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
return 0;
}

+++ exited with 0 +++
BUG: sleeping function called from invalid context at /home/kas/git/public/linux-mm/kernel/locking/mutex.c:616
in_atomic(): 1, irqs_disabled(): 0, pid: 1, name: init
3 locks held by init/1:
#0: (reboot_mutex){+.+...}, at: [] SyS_reboot+0xa9/0x220
#1: ((reboot_notifier_list).rwsem){.+.+..}, at: [] __blocking_notifier_call_chain+0x39/0x70
#2: (rcu_callback){......}, at: [] rcu_do_batch.isra.49+0x160/0x10c0
Preemption disabled at:[] __delay+0xf/0x20

CPU: 1 PID: 1 Comm: init Not tainted 4.1.0-00009-gbddf4c4818e0 #253
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS Debian-1.8.2-1 04/01/2014
ffff88017b3d8000 ffff88027bc03c38 ffffffff81929ceb 0000000000000102
0000000000000000 ffff88027bc03c68 ffffffff81085a9d 0000000000000002
ffffffff81ca2a20 0000000000000268 0000000000000000 ffff88027bc03c98
Call Trace:
[] dump_stack+0x4f/0x7b
[] ___might_sleep+0x16d/0x270
[] __might_sleep+0x4d/0x90
[] mutex_lock_nested+0x2f/0x430
[] ? _raw_spin_unlock_irqrestore+0x5d/0x80
[] ? __this_cpu_preempt_check+0x13/0x20
[] netlink_set_ring+0x1ed/0x350
[] ? netlink_undo_bind+0x70/0x70
[] netlink_sock_destruct+0x80/0x150
[] __sk_free+0x1d/0x160
[] sk_free+0x19/0x20
[..]

Cong Wang says:

We can't hold mutex lock in a rcu callback, [..]

Thomas Graf says:

The socket should be dead at this point. It might be simpler to
add a netlink_release_ring() function which doesn't require
locking at all.

Reported-by: "Kirill A. Shutemov"
Diagnosed-by: Cong Wang
Suggested-by: Thomas Graf
Signed-off-by: Florian Westphal
Signed-off-by: David S. Miller

Florian Westphal
2015-07-22 13:22:56 +0800