Eric Lee / smarc-fsl-linux-kernel

04 Dec, 2015

9 commits

071f5d105 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net ... Browse Code »

Pull networking fixes from David Miller:
"A lot of Thanksgiving turkey leftovers accumulated, here goes:

1) Fix bluetooth l2cap_chan object leak, from Johan Hedberg.

2) IDs for some new iwlwifi chips, from Oren Givon.

3) Fix rtlwifi lockups on boot, from Larry Finger.

4) Fix memory leak in fm10k, from Stephen Hemminger.

5) We have a route leak in the ipv6 tunnel infrastructure, fix from
Paolo Abeni.

6) Fix buffer pointer handling in arm64 bpf JIT,f rom Zi Shen Lim.

7) Wrong lockdep annotations in tcp md5 support, fix from Eric
Dumazet.

8) Work around some middle boxes which prevent proper handling of TCP
Fast Open, from Yuchung Cheng.

9) TCP repair can do huge kmalloc() requests, build paged SKBs
instead. From Eric Dumazet.

10) Fix msg_controllen overflow in scm_detach_fds, from Daniel
Borkmann.

11) Fix device leaks on ipmr table destruction in ipv4 and ipv6, from
Nikolay Aleksandrov.

12) Fix use after free in epoll with AF_UNIX sockets, from Rainer
Weikusat.

13) Fix double free in VRF code, from Nikolay Aleksandrov.

14) Fix skb leaks on socket receive queue in tipc, from Ying Xue.

15) Fix ifup/ifdown crach in xgene driver, from Iyappan Subramanian.

16) Fix clearing of persistent array maps in bpf, from Daniel
Borkmann.

17) In TCP, for the cross-SYN case, we don't initialize tp->copied_seq
early enough. From Eric Dumazet.

18) Fix out of bounds accesses in bpf array implementation when
updating elements, from Daniel Borkmann.

19) Fill gaps in RCU protection of np->opt in ipv6 stack, from Eric
Dumazet.

20) When dumping proxy neigh entries, we have to accomodate NULL
device pointers properly, from Konstantin Khlebnikov.

21) SCTP doesn't release all ipv6 socket resources properly, fix from
Eric Dumazet.

22) Prevent underflows of sch->q.qlen for multiqueue packet
schedulers, also from Eric Dumazet.

23) Fix MAC and unicast list handling in bnxt_en driver, from Jeffrey
Huang and Michael Chan.

24) Don't actively scan radar channels, from Antonio Quartulli"

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (110 commits)
net: phy: reset only targeted phy
bnxt_en: Setup uc_list mac filters after resetting the chip.
bnxt_en: enforce proper storing of MAC address
bnxt_en: Fixed incorrect implementation of ndo_set_mac_address
net: lpc_eth: remove irq > NR_IRQS check from probe()
net_sched: fix qdisc_tree_decrease_qlen() races
openvswitch: fix hangup on vxlan/gre/geneve device deletion
ipv4: igmp: Allow removing groups from a removed interface
ipv6: sctp: implement sctp_v6_destroy_sock()
arm64: bpf: add 'store immediate' instruction
ipv6: kill sk_dst_lock
ipv6: sctp: add rcu protection around np->opt
net/neighbour: fix crash at dumping device-agnostic proxy entries
sctp: use GFP_USER for user-controlled kmalloc
sctp: convert sack_needed and sack_generation to bits
ipv6: add complete rcu protection around np->opt
bpf: fix allocation warnings in bpf maps and integer overflow
mvebu: dts: enable IP checksum with jumbo frames for Armada 38x on Port0
net: mvneta: enable setting custom TX IP checksum limit
net: mvneta: fix error path for building skb
...

Linus Torvalds
2015-12-04 08:02:46 +0800
e3c9b1ef7 Merge tag 'mac80211-for-davem-2015-12-02' of git://git.kernel.org/pub/scm/linux/… ... Browse Code »

…kernel/git/jberg/mac80211

Johannes Berg says:

====================
A small set of fixes for 4.4:
* fix scanning in mac80211 to not actively scan radar
channels (from Antonio)
* fix uninitialized variable in remain-on-channel that
could lead to treating frame TX as remain-on-channel
and not sending the frame at all
* remove NL80211_FEATURE_FULL_AP_CLIENT_STATE again, it
was broken and needs more work, we'll enable it later
* fix call_rcu() induced use-after-reset/free in mesh
(that was suddenly causing issues in certain tests)
* always request block-ack window size 64 as we found
some APs will otherwise crash (really ...)
* fix P2P-Device teardown sequence to avoid restarting
with uninitialized data
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

David S. Miller
2015-12-04 04:56:22 +0800
4eaf3b84f net_sched: fix qdisc_tree_decrease_qlen() races ... Browse Code »

qdisc_tree_decrease_qlen() suffers from two problems on multiqueue
devices.

One problem is that it updates sch->q.qlen and sch->qstats.drops
on the mq/mqprio root qdisc, while it should not : Daniele
reported underflows errors :
[ 681.774821] PAX: sch->q.qlen: 0 n: 1
[ 681.774825] PAX: size overflow detected in function qdisc_tree_decrease_qlen net/sched/sch_api.c:769 cicus.693_49 min, count: 72, decl: qlen; num: 0; context: sk_buff_head;
[ 681.774954] CPU: 2 PID: 19 Comm: ksoftirqd/2 Tainted: G O 4.2.6.201511282239-1-grsec #1
[ 681.774955] Hardware name: ASUSTeK COMPUTER INC. X302LJ/X302LJ, BIOS X302LJ.202 03/05/2015
[ 681.774956] ffffffffa9a04863 0000000000000000 0000000000000000 ffffffffa990ff7c
[ 681.774959] ffffc90000d3bc38 ffffffffa95d2810 0000000000000007 ffffffffa991002b
[ 681.774960] ffffc90000d3bc68 ffffffffa91a44f4 0000000000000001 0000000000000001
[ 681.774962] Call Trace:
[ 681.774967] [] dump_stack+0x4c/0x7f
[ 681.774970] [] report_size_overflow+0x34/0x50
[ 681.774972] [] qdisc_tree_decrease_qlen+0x152/0x160
[ 681.774976] [] fq_codel_dequeue+0x7b1/0x820 [sch_fq_codel]
[ 681.774978] [] ? qdisc_peek_dequeued+0xa0/0xa0 [sch_fq_codel]
[ 681.774980] [] __qdisc_run+0x4d/0x1d0
[ 681.774983] [] net_tx_action+0xc2/0x160
[ 681.774985] [] __do_softirq+0xf1/0x200
[ 681.774987] [] run_ksoftirqd+0x1e/0x30
[ 681.774989] [] smpboot_thread_fn+0x150/0x260
[ 681.774991] [] ? sort_range+0x40/0x40
[ 681.774992] [] kthread+0xe4/0x100
[ 681.774994] [] ? kthread_worker_fn+0x170/0x170
[ 681.774995] [] ret_from_fork+0x3e/0x70

mq/mqprio have their own ways to report qlen/drops by folding stats on
all their queues, with appropriate locking.

A second problem is that qdisc_tree_decrease_qlen() calls qdisc_lookup()
without proper locking : concurrent qdisc updates could corrupt the list
that qdisc_match_from_root() parses to find a qdisc given its handle.

Fix first problem adding a TCQ_F_NOPARENT qdisc flag that
qdisc_tree_decrease_qlen() can use to abort its tree traversal,
as soon as it meets a mq/mqprio qdisc children.

Second problem can be fixed by RCU protection.
Qdisc are already freed after RCU grace period, so qdisc_list_add() and
qdisc_list_del() simply have to use appropriate rcu list variants.

A future patch will add a per struct netdev_queue list anchor, so that
qdisc_tree_decrease_qlen() can have more efficient lookups.

Reported-by: Daniele Fucini
Signed-off-by: Eric Dumazet
Cc: Cong Wang
Cc: Jamal Hadi Salim
Signed-off-by: David S. Miller

Eric Dumazet
2015-12-04 03:59:05 +0800
131753030 openvswitch: fix hangup on vxlan/gre/geneve device deletion ... Browse Code »

Each openvswitch tunnel vport (vxlan,gre,geneve) holds a reference
to the underlying tunnel device, but never released it when such
device is deleted.
Deleting the underlying device via the ip tool cause the kernel to
hangup in the netdev_wait_allrefs() loop.
This commit ensure that on device unregistration dp_detach_port_notify()
is called for all vports that hold the device reference, properly
releasing it.

Fixes: 614732eaa12d ("openvswitch: Use regular VXLAN net_device device")
Fixes: b2acd1dc3949 ("openvswitch: Use regular GRE net_device instead of vport")
Fixes: 6b001e682e90 ("openvswitch: Use Geneve device.")
Signed-off-by: Paolo Abeni
Acked-by: Flavio Leitner
Acked-by: Pravin B Shelar
Signed-off-by: David S. Miller

Paolo Abeni
2015-12-04 03:29:25 +0800
4eba7bb1d ipv4: igmp: Allow removing groups from a removed interface ... Browse Code »

When a multicast group is joined on a socket, a struct ip_mc_socklist
is appended to the sockets mc_list containing information about the
joined group.

If the interface is hot unplugged, this entry becomes stale. Prior to
commit 52ad353a5344f ("igmp: fix the problem when mc leave group") it
was possible to remove the stale entry by performing a
IP_DROP_MEMBERSHIP, passing either the old ifindex or ip address on
the interface. However, this fix enforces that the interface must
still exist. Thus with time, the number of stale entries grows, until
sysctl_igmp_max_memberships is reached and then it is not possible to
join and more groups.

The previous patch fixes an issue where a IP_DROP_MEMBERSHIP is
performed without specifying the interface, either by ifindex or ip
address. However here we do supply one of these. So loosen the
restriction on device existence to only apply when the interface has
not been specified. This then restores the ability to clean up the
stale entries.

Signed-off-by: Andrew Lunn
Fixes: 52ad353a5344f "(igmp: fix the problem when mc leave group")
Signed-off-by: David S. Miller

Andrew Lunn
2015-12-04 01:07:05 +0800
602dd62df ipv6: sctp: implement sctp_v6_destroy_sock() ... Browse Code »

Dmitry Vyukov reported a memory leak using IPV6 SCTP sockets.

We need to call inet6_destroy_sock() to properly release
inet6 specific fields.

Reported-by: Dmitry Vyukov
Signed-off-by: Eric Dumazet
Acked-by: Daniel Borkmann
Signed-off-by: David S. Miller

Eric Dumazet
2015-12-04 01:05:57 +0800
79aecc721 Merge branch 'for-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth ... Browse Code »

Johan Hedberg says:

====================
pull request: bluetooth 2015-12-01

Here's a Bluetooth fix for the 4.4-rc series that fixes a memory leak of
the Security Manager L2CAP channel that'll happen for every LE
connection.

Please let me know if there are any issues pulling. Thanks.
====================

Signed-off-by: David S. Miller

David S. Miller
2015-12-04 01:04:05 +0800
6bd4f355d ipv6: kill sk_dst_lock ... Browse Code »

While testing the np->opt RCU conversion, I found that UDP/IPv6 was
using a mixture of xchg() and sk_dst_lock to protect concurrent changes
to sk->sk_dst_cache, leading to possible corruptions and crashes.

ip6_sk_dst_lookup_flow() uses sk_dst_check() anyway, so the simplest
way to fix the mess is to remove sk_dst_lock completely, as we did for
IPv4.

__ip6_dst_store() and ip6_dst_store() share same implementation.

sk_setup_caps() being called with socket lock being held or not,
we have to use sk_dst_set() instead of __sk_dst_set()

Note that I had to move the "np->dst_cookie = rt6_get_cookie(rt);"
in ip6_dst_store() before the sk_setup_caps(sk, dst) call.

This is because ip6_dst_store() can be called from process context,
without any lock held.

As soon as the dst is installed in sk->sk_dst_cache, dst can be freed
from another cpu doing a concurrent ip6_dst_store()

Doing the dst dereference before doing the install is needed to make
sure no use after free would trigger.

Signed-off-by: Eric Dumazet
Reported-by: Dmitry Vyukov
Signed-off-by: David S. Miller

Eric Dumazet
2015-12-04 00:32:06 +0800
c836a8ba9 ipv6: sctp: add rcu protection around np->opt ... Browse Code »

This patch completes the work I did in commit 45f6fad84cc3
("ipv6: add complete rcu protection around np->opt"), as I missed
sctp part.

This simply makes sure np->opt is used with proper RCU locking
and accessors.

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2015-12-04 00:30:58 +0800

03 Dec, 2015

7 commits

6adc5fd6a net/neighbour: fix crash at dumping device-agnostic proxy entries ... Browse Code »

Proxy entries could have null pointer to net-device.

Signed-off-by: Konstantin Khlebnikov
Fixes: 84920c1420e2 ("net: Allow ipv6 proxies and arp proxies be shown with iproute2")
Signed-off-by: David S. Miller

Konstantin Khlebnikov
2015-12-03 13:07:51 +0800
cacc06215 sctp: use GFP_USER for user-controlled kmalloc ... Browse Code »

Dmitry Vyukov reported that the user could trigger a kernel warning by
using a large len value for getsockopt SCTP_GET_LOCAL_ADDRS, as that
value directly affects the value used as a kmalloc() parameter.

This patch thus switches the allocation flags from all user-controllable
kmalloc size to GFP_USER to put some more restrictions on it and also
disables the warn, as they are not necessary.

Signed-off-by: Marcelo Ricardo Leitner
Acked-by: Daniel Borkmann
Signed-off-by: David S. Miller

Marcelo Ricardo Leitner
2015-12-03 12:39:46 +0800
45f6fad84 ipv6: add complete rcu protection around np->opt ... Browse Code »

This patch addresses multiple problems :

UDP/RAW sendmsg() need to get a stable struct ipv6_txoptions
while socket is not locked : Other threads can change np->opt
concurrently. Dmitry posted a syzkaller
(http://github.com/google/syzkaller) program desmonstrating
use-after-free.

Starting with TCP/DCCP lockless listeners, tcp_v6_syn_recv_sock()
and dccp_v6_request_recv_sock() also need to use RCU protection
to dereference np->opt once (before calling ipv6_dup_options())

This patch adds full RCU protection to np->opt

Reported-by: Dmitry Vyukov
Signed-off-by: Eric Dumazet
Acked-by: Hannes Frederic Sowa
Signed-off-by: David S. Miller

Eric Dumazet
2015-12-03 12:37:16 +0800
c1df932c0 mac80211: fix off-channel mgmt-tx uninitialized variable usage ... Browse Code »

In the last change here, I neglected to update the cookie in one code
path: when a mgmt-tx has no real cookie sent to userspace as it doesn't
wait for a response, but is off-channel. The original code used the SKB
pointer as the cookie and always assigned the cookie to the TX SKB in
ieee80211_start_roc_work(), but my change turned this around and made
the code rely on a valid cookie being passed in.

Unfortunately, the off-channel no-wait TX path wasn't assigning one at
all, resulting in an uninitialized stack value being used. This wasn't
handed back to userspace as a cookie (since in the no-wait case there
isn't a cookie), but it was tested for non-zero to distinguish between
mgmt-tx and off-channel.

Fix this by assigning a dummy non-zero cookie unconditionally, and get
rid of a misleading comment and some dead code while at it. I'll clean
up the ACK SKB handling separately later.

Fixes: 3b79af973cf4 ("mac80211: stop using pointers as userspace cookies")
Signed-off-by: Johannes Berg

Johannes Berg
2015-12-03 05:27:53 +0800
4e39ccac0 mac80211: do not actively scan DFS channels ... Browse Code »

DFS channels should not be actively scanned as we can't be sure
if we are allowed or not.

If the current channel is in the DFS band, active scan might be
performed after CSA, but we have no guarantee about other channels,
therefore it is safer to prevent active scanning at all.

Cc: stable@vger.kernel.org
Signed-off-by: Antonio Quartulli
Signed-off-by: Johannes Berg

Antonio Quartulli
2015-12-03 05:27:53 +0800
835112b28 mac80211: don't teardown sdata on sdata stop ... Browse Code »

Interfaces are being initialized (setup) on addition,
and torn down on removal.

However, p2p device is being torn down when stopped,
resulting in the next p2p start operation being done
on uninitialized interface.

Solve it by calling ieee80211_teardown_sdata() only
on interface removal (for the non-netdev case).

Signed-off-by: Eliad Peller
Signed-off-by: Emmanuel Grumbach
[squashed in fix to call teardown after unregister]
Signed-off-by: Johannes Berg

Eliad Peller
2015-12-03 05:27:27 +0800
83e4bf7a7 openvswitch: properly refcount vport-vxlan module ... Browse Code »

After 614732eaa12d, no refcount is maintained for the vport-vxlan module.
This allows the userspace to remove such module while vport-vxlan
devices still exist, which leads to later oops.

v1 -> v2:
- move vport 'owner' initialization in ovs_vport_ops_register()
and make such function a macro

Fixes: 614732eaa12d ("openvswitch: Use regular VXLAN net_device device")
Signed-off-by: Paolo Abeni
Signed-off-by: David S. Miller

Paolo Abeni
2015-12-03 00:50:59 +0800

02 Dec, 2015

3 commits

ceb5d58b2 net: fix sock_wake_async() rcu protection ... Browse Code »

Dmitry provided a syzkaller (http://github.com/google/syzkaller)
triggering a fault in sock_wake_async() when async IO is requested.

Said program stressed af_unix sockets, but the issue is generic
and should be addressed in core networking stack.

The problem is that by the time sock_wake_async() is called,
we should not access the @flags field of 'struct socket',
as the inode containing this socket might be freed without
further notice, and without RCU grace period.

We already maintain an RCU protected structure, "struct socket_wq"
so moving SOCKWQ_ASYNC_NOSPACE & SOCKWQ_ASYNC_WAITDATA into it
is the safe route.

It also reduces number of cache lines needing dirtying, so might
provide a performance improvement anyway.

In followup patches, we might move remaining flags (SOCK_NOSPACE,
SOCK_PASSCRED, SOCK_PASSSEC) to save 8 bytes and let 'struct socket'
being mostly read and let it being shared between cpus.

Reported-by: Dmitry Vyukov
Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2015-12-02 04:45:05 +0800
9cd3e072b net: rename SOCK_ASYNC_NOSPACE and SOCK_ASYNC_WAITDATA ... Browse Code »

This patch is a cleanup to make following patch easier to
review.

Goal is to move SOCK_ASYNC_NOSPACE and SOCK_ASYNC_WAITDATA
from (struct socket)->flags to a (struct socket_wq)->flags
to benefit from RCU protection in sock_wake_async()

To ease backports, we rename both constants.

Two new helpers, sk_set_bit(int nr, struct sock *sk)
and sk_clear_bit(int net, struct sock *sk) are added so that
following patch can change their implementation.

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2015-12-02 04:45:05 +0800
304d888b2 Revert "ipv6: ndisc: inherit metadata dst when creating ndisc requests" ... Browse Code »

This reverts commit ab450605b35caa768ca33e86db9403229bf42be4.

In IPv6, we cannot inherit the dst of the original dst. ndisc packets
are IPv6 packets and may take another route than the original packet.

This patch breaks the following scenario: a packet comes from eth0 and
is forwarded through vxlan1. The encapsulated packet triggers an NS
which cannot be sent because of the wrong route.

CC: Jiri Benc
CC: Thomas Graf
Signed-off-by: Nicolas Dichtel
Signed-off-by: David S. Miller

Nicolas Dichtel
2015-12-02 04:07:59 +0800

01 Dec, 2015

2 commits

142a2e7ec tcp: initialize tp->copied_seq in case of cross SYN connection ... Browse Code »

Dmitry provided a syzkaller (http://github.com/google/syzkaller)
generated program that triggers the WARNING at
net/ipv4/tcp.c:1729 in tcp_recvmsg() :

WARN_ON(tp->copied_seq != tp->rcv_nxt &&
!(flags & (MSG_PEEK | MSG_TRUNC)));

His program is specifically attempting a Cross SYN TCP exchange,
that we support (for the pleasure of hackers ?), but it looks we
lack proper tcp->copied_seq initialization.

Thanks again Dmitry for your report and testings.

Signed-off-by: Eric Dumazet
Reported-by: Dmitry Vyukov
Tested-by: Dmitry Vyukov
Signed-off-by: David S. Miller

Eric Dumazet
2015-12-01 04:34:17 +0800
9490f886b af-unix: passcred support for sendpage ... Browse Code »

sendpage did not care about credentials at all. This could lead to
situations in which because of fd passing between processes we could
append data to skbs with different scm data. It is illegal to splice those
skbs together. Instead we have to allocate a new skb and if requested
fill out the scm details.

Fixes: 869e7c62486ec ("net: af_unix: implement stream sendpage support")
Reported-by: Al Viro
Cc: Al Viro
Cc: Eric Dumazet
Signed-off-by: Hannes Frederic Sowa
Signed-off-by: David S. Miller

Hannes Frederic Sowa
2015-12-01 04:16:06 +0800

30 Nov, 2015

1 commit

880621c26 packet: Allow packets with only a header (but no payload) ... Browse Code »

Commit 9c7077622dd91 ("packet: make packet_snd fail on len smaller
than l2 header") added validation for the packet size in packet_snd.
This change enforces that every packet needs a header (with at least
hard_header_len bytes) plus a payload with at least one byte. Before
this change the payload was optional.

This fixes PPPoE connections which do not have a "Service" or
"Host-Uniq" configured (which is violating the spec, but is still
widely used in real-world setups). Those are currently failing with the
following message: "pppd: packet size is too short (24
Signed-off-by: David S. Miller

Martin Blumenstingl
2015-11-30 11:17:17 +0800

25 Nov, 2015

6 commits

8c7188b23 RDS: fix race condition when sending a message on unbound socket ... Browse Code »

Sasha's found a NULL pointer dereference in the RDS connection code when
sending a message to an apparently unbound socket. The problem is caused
by the code checking if the socket is bound in rds_sendmsg(), which checks
the rs_bound_addr field without taking a lock on the socket. This opens a
race where rs_bound_addr is temporarily set but where the transport is not
in rds_bind(), leading to a NULL pointer dereference when trying to
dereference 'trans' in __rds_conn_create().

Vegard wrote a reproducer for this issue, so kindly ask him to share if
you're interested.

I cannot reproduce the NULL pointer dereference using Vegard's reproducer
with this patch, whereas I could without.

Complete earlier incomplete fix to CVE-2015-6937:

74e98eb08588 ("RDS: verify the underlying transport exists before creating a connection")

Cc: David S. Miller
Cc: stable@vger.kernel.org

Reviewed-by: Vegard Nossum
Reviewed-by: Sasha Levin
Acked-by: Santosh Shilimkar
Signed-off-by: Quentin Casasnovas
Signed-off-by: David S. Miller

Quentin Casasnovas
2015-11-25 06:20:09 +0800
20f795666 net: openvswitch: Remove invalid comment ... Browse Code »

During pre-upstream development, the openvswitch datapath used a custom
hashtable to store vports that could fail on delete due to lack of
memory. However, prior to upstream submission, this code was reworked to
use an hlist based hastable with flexible-array based buckets. As such
the failure condition was eliminated from the vport_del path, rendering
this comment invalid.

Signed-off-by: Aaron Conole
Signed-off-by: David S. Miller

Aaron Conole
2015-11-25 06:18:00 +0800
fbdd29bfd net: ipmr, ip6mr: fix vif/tunnel failure race condition ... Browse Code »

Since (at least) commit b17a7c179dd3 ("[NET]: Do sysfs registration as
part of register_netdevice."), netdev_run_todo() deals only with
unregistration, so we don't need to do the rtnl_unlock/lock cycle to
finish registration when failing pimreg or dvmrp device creation. In
fact that opens a race condition where someone can delete the device
while rtnl is unlocked because it's fully registered. The problem gets
worse when netlink support is introduced as there are more points of entry
that can cause it and it also makes reusing that code correctly impossible.

Signed-off-by: Nikolay Aleksandrov
Reviewed-by: Cong Wang
Signed-off-by: David S. Miller

Nikolay Aleksandrov
2015-11-25 06:15:56 +0800
33c40e242 rxrpc: Correctly handle ack at end of client call transmit phase ... Browse Code »

Normally, the transmit phase of a client call is implicitly ack'd by the
reception of the first data packet of the response being received.
However, if a security negotiation happens, the transmit phase, if it is
entirely contained in a single packet, may get an ack packet in response
and then may get aborted due to security negotiation failure.

Because the client has shifted state to RXRPC_CALL_CLIENT_AWAIT_REPLY due
to having transmitted all the data, the code that handles processing of the
received ack packet doesn't note the hard ack the data packet.

The following abort packet in the case of security negotiation failure then
incurs an assertion failure when it tries to drain the Tx queue because the
hard ack state is out of sync (hard ack means the packets have been
processed and can be discarded by the sender; a soft ack means that the
packets are received but could still be discarded and rerequested by the
receiver).

To fix this, we should record the hard ack we received for the ack packet.

The assertion failure looks like:

RxRPC: Assertion failed
1 ] [] rxrpc_rotate_tx_window+0xbc/0x131 [af_rxrpc]
...

Signed-off-by: David Howells
Signed-off-by: David S. Miller

David Howells
2015-11-25 06:14:50 +0800
264640fc2 ipv6: distinguish frag queues by device for multicast and link-local packets ... Browse Code »

If a fragmented multicast packet is received on an ethernet device which
has an active macvlan on top of it, each fragment is duplicated and
received both on the underlying device and the macvlan. If some
fragments for macvlan are processed before the whole packet for the
underlying device is reassembled, the "overlapping fragments" test in
ip6_frag_queue() discards the whole fragment queue.

To resolve this, add device ifindex to the search key and require it to
match reassembling multicast packets and packets to link-local
addresses.

Note: similar patch has been already submitted by Yoshifuji Hideaki in

http://patchwork.ozlabs.org/patch/220979/

but got lost and forgotten for some reason.

Signed-off-by: Michal Kubecek
Signed-off-by: David S. Miller

Michal Kubeček
2015-11-25 05:45:47 +0800
7098356ba tipc: fix error handling of expanding buffer headroom ... Browse Code »

Coverity says:

*** CID 1338065: Error handling issues (CHECKED_RETURN)
/net/tipc/udp_media.c: 162 in tipc_udp_send_msg()
156 struct udp_media_addr *dst = (struct udp_media_addr *)&dest->value;
157 struct udp_media_addr *src = (struct udp_media_addr *)&b->addr.value;
158 struct sk_buff *clone;
159 struct rtable *rt;
160
161 if (skb_headroom(skb) < UDP_MIN_HEADROOM)
>>> CID 1338065: Error handling issues (CHECKED_RETURN)
>>> Calling "pskb_expand_head" without checking return value (as is done elsewhere 51 out of 56 times).
162 pskb_expand_head(skb, UDP_MIN_HEADROOM, 0, GFP_ATOMIC);
163
164 clone = skb_clone(skb, GFP_ATOMIC);
165 skb_set_inner_protocol(clone, htons(ETH_P_TIPC));
166 ub = rcu_dereference_rtnl(b->media_ptr);
167 if (!ub) {

When expanding buffer headroom over udp tunnel with pskb_expand_head(),
it's unfortunate that we don't check its return value. As a result, if
the function returns an error code due to the lack of memory, it may
cause unpredictable consequence as we unconditionally consider that
it's always successful.

Fixes: e53567948f82 ("tipc: conditionally expand buffer headroom over udp tunnel")
Reported-by:
Cc: Stephen Hemminger
Signed-off-by: Ying Xue
Signed-off-by: David S. Miller

Ying Xue
2015-11-25 00:26:19 +0800

24 Nov, 2015

5 commits

f4195d1ea tipc: avoid packets leaking on socket receive queue ... Browse Code »

Even if we drain receive queue thoroughly in tipc_release() after tipc
socket is removed from rhashtable, it is possible that some packets
are in flight because some CPU runs receiver and did rhashtable lookup
before we removed socket. They will achieve receive queue, but nobody
delete them at all. To avoid this leak, we register a private socket
destructor to purge receive queue, meaning releasing packets pending
on receive queue will be delayed until the last reference of tipc
socket will be released.

Signed-off-by: Ying Xue
Signed-off-by: David S. Miller

Ying Xue
2015-11-24 12:45:15 +0800
38b7631fb nfs4: limit callback decoding to received bytes ... Browse Code »

A truncated cb_compound request will cause the client to decode null or
data from a previous callback for nfs4.1 backchannel case, or uninitialized
data for the nfs4.0 case. This is because the path through
svc_process_common() advances the request's iov_base and decrements iov_len
without adjusting the overall xdr_buf's len field. That causes
xdr_init_decode() to set up the xdr_stream with an incorrect length in
nfs4_callback_compound().

Fixing this for the nfs4.1 backchannel case first requires setting the
correct iov_len and page_len based on the length of received data in the
same manner as the nfs4.0 case.

Then the request's xdr_buf length can be adjusted for both cases based upon
the remaining iov_len and page_len.

Signed-off-by: Benjamin Coddington
Cc: stable@vger.kernel.org
Signed-off-by: Trond Myklebust

Benjamin Coddington
2015-11-24 11:03:15 +0800
3d1a54e80 net/hsr: fix a warning message ... Browse Code »

WARN_ON_ONCE() takes a condition, it doesn't take an error message. I
have converted this to WARN() instead.

Signed-off-by: Dan Carpenter
Signed-off-by: David S. Miller

Dan Carpenter
2015-11-24 03:56:15 +0800
7d267278a unix: avoid use-after-free in ep_remove_wait_queue ... Browse Code »

Rainer Weikusat writes:
An AF_UNIX datagram socket being the client in an n:1 association with
some server socket is only allowed to send messages to the server if the
receive queue of this socket contains at most sk_max_ack_backlog
datagrams. This implies that prospective writers might be forced to go
to sleep despite none of the message presently enqueued on the server
receive queue were sent by them. In order to ensure that these will be
woken up once space becomes again available, the present unix_dgram_poll
routine does a second sock_poll_wait call with the peer_wait wait queue
of the server socket as queue argument (unix_dgram_recvmsg does a wake
up on this queue after a datagram was received). This is inherently
problematic because the server socket is only guaranteed to remain alive
for as long as the client still holds a reference to it. In case the
connection is dissolved via connect or by the dead peer detection logic
in unix_dgram_sendmsg, the server socket may be freed despite "the
polling mechanism" (in particular, epoll) still has a pointer to the
corresponding peer_wait queue. There's no way to forcibly deregister a
wait queue with epoll.

Based on an idea by Jason Baron, the patch below changes the code such
that a wait_queue_t belonging to the client socket is enqueued on the
peer_wait queue of the server whenever the peer receive queue full
condition is detected by either a sendmsg or a poll. A wake up on the
peer queue is then relayed to the ordinary wait queue of the client
socket via wake function. The connection to the peer wait queue is again
dissolved if either a wake up is about to be relayed or the client
socket reconnects or a dead peer is detected or the client socket is
itself closed. This enables removing the second sock_poll_wait from
unix_dgram_poll, thus avoiding the use-after-free, while still ensuring
that no blocked writer sleeps forever.

Signed-off-by: Rainer Weikusat
Fixes: ec0d215f9420 ("af_unix: fix 'poll for write'/connected DGRAM sockets")
Reviewed-by: Jason Baron
Signed-off-by: David S. Miller

Rainer Weikusat
2015-11-24 01:29:58 +0800
3b13758f5 cgroups: Allow dynamically changing net_classid ... Browse Code »

The classid of a process is changed either when a process is moved to
or from a cgroup or when the net_cls.classid file is updated.
Previously net_cls only supported propogating these changes to the
cgroup's related sockets when a process was added or removed from the
cgroup. This means it was neccessary to remove and re-add all processes
to a cgroup in order to update its classid. This change introduces
support for doing this dynamically - i.e. when the value is changed in
the net_cls_classid file, this will also trigger an update to the
classid associated with all sockets controlled by the cgroup.
This mimics the behaviour of other cgroup subsystems.
net_prio circumvents this issue by storing an index into a table with
each socket (and so any updates to the table, don't require updating
the value associated with the socket). net_cls, however, passes the
socket the classid directly, and so this additional step is needed.

Signed-off-by: Nina Schiff
Acked-by: Tejun Heo
Signed-off-by: David S. Miller

Nina Schiff
2015-11-24 01:13:46 +0800

23 Nov, 2015

3 commits

4c6980462 net: ip6mr: fix static mfc/dev leaks on table destruction ... Browse Code »

Similar to ipv4, when destroying an mrt table the static mfc entries and
the static devices are kept, which leads to devices that can never be
destroyed (because of refcnt taken) and leaked memory. Make sure that
everything is cleaned up on netns destruction.

Fixes: 8229efdaef1e ("netns: ip6mr: enable namespace support in ipv6 multicast forwarding code")
CC: Benjamin Thery
Signed-off-by: Nikolay Aleksandrov
Reviewed-by: Cong Wang
Signed-off-by: David S. Miller

Nikolay Aleksandrov
2015-11-23 09:44:47 +0800
0e615e960 net: ipmr: fix static mfc/dev leaks on table destruction ... Browse Code »

When destroying an mrt table the static mfc entries and the static
devices are kept, which leads to devices that can never be destroyed
(because of refcnt taken) and leaked memory, for example:
unreferenced object 0xffff880034c144c0 (size 192):
comm "mfc-broken", pid 4777, jiffies 4320349055 (age 46001.964s)
hex dump (first 32 bytes):
98 53 f0 34 00 88 ff ff 98 53 f0 34 00 88 ff ff .S.4.....S.4....
ef 0a 0a 14 01 02 03 04 00 00 00 00 01 00 00 00 ................
backtrace:
[] kmemleak_alloc+0x4e/0xb0
[] kmem_cache_alloc+0x190/0x300
[] ip_mroute_setsockopt+0x5cb/0x910
[] do_ip_setsockopt.isra.11+0x105/0xff0
[] ip_setsockopt+0x30/0xa0
[] raw_setsockopt+0x33/0x90
[] sock_common_setsockopt+0x14/0x20
[] SyS_setsockopt+0x71/0xc0
[] entry_SYSCALL_64_fastpath+0x16/0x7a
[] 0xffffffffffffffff

Make sure that everything is cleaned on netns destruction.

Signed-off-by: Nikolay Aleksandrov
Reviewed-by: Cong Wang
Signed-off-by: David S. Miller

Nikolay Aleksandrov
2015-11-23 09:44:46 +0800
6900317f5 net, scm: fix PaX detected msg_controllen overflow in scm_detach_fds ... Browse Code »

David and HacKurx reported a following/similar size overflow triggered
in a grsecurity kernel, thanks to PaX's gcc size overflow plugin:

(Already fixed in later grsecurity versions by Brad and PaX Team.)

[ 1002.296137] PAX: size overflow detected in function scm_detach_fds net/core/scm.c:314
cicus.202_127 min, count: 4, decl: msg_controllen; num: 0; context: msghdr;
[ 1002.296145] CPU: 0 PID: 3685 Comm: scm_rights_recv Not tainted 4.2.3-grsec+ #7
[ 1002.296149] Hardware name: Apple Inc. MacBookAir5,1/Mac-66F35F19FE2A0D05, [...]
[ 1002.296153] ffffffff81c27366 0000000000000000 ffffffff81c27375 ffffc90007843aa8
[ 1002.296162] ffffffff818129ba 0000000000000000 ffffffff81c27366 ffffc90007843ad8
[ 1002.296169] ffffffff8121f838 fffffffffffffffc fffffffffffffffc ffffc90007843e60
[ 1002.296176] Call Trace:
[ 1002.296190] [] dump_stack+0x45/0x57
[ 1002.296200] [] report_size_overflow+0x38/0x60
[ 1002.296209] [] scm_detach_fds+0x2ce/0x300
[ 1002.296220] [] unix_stream_read_generic+0x609/0x930
[ 1002.296228] [] unix_stream_recvmsg+0x4f/0x60
[ 1002.296236] [] ? unix_set_peek_off+0x50/0x50
[ 1002.296243] [] sock_recvmsg+0x47/0x60
[ 1002.296248] [] ___sys_recvmsg+0xe2/0x1e0
[ 1002.296257] [] __sys_recvmsg+0x46/0x80
[ 1002.296263] [] SyS_recvmsg+0x2c/0x40
[ 1002.296271] [] entry_SYSCALL_64_fastpath+0x12/0x85

Further investigation showed that this can happen when an *odd* number of
fds are being passed over AF_UNIX sockets.

In these cases CMSG_LEN(i * sizeof(int)) and CMSG_SPACE(i * sizeof(int)),
where i is the number of successfully passed fds, differ by 4 bytes due
to the extra CMSG_ALIGN() padding in CMSG_SPACE() to an 8 byte boundary
on 64 bit. The padding is used to align subsequent cmsg headers in the
control buffer.

When the control buffer passed in from the receiver side *lacks* these 4
bytes (e.g. due to buggy/wrong API usage), then msg->msg_controllen will
overflow in scm_detach_fds():

int cmlen = CMSG_LEN(i * sizeof(int)); cmsg_level);
if (!err)
err = put_user(SCM_RIGHTS, &cm->cmsg_type);
if (!err)
err = put_user(cmlen, &cm->cmsg_len);
if (!err) {
cmlen = CMSG_SPACE(i * sizeof(int)); msg_control += cmlen;
msg->msg_controllen -= cmlen; msg_controllen of 20 bytes, and the sender
properly transferred 1 fd to the receiver, so that its CMSG_LEN results
in 20 bytes and CMSG_SPACE in 24 bytes.

In case of MSG_CMSG_COMPAT (scm_detach_fds_compat()), I haven't seen an
issue in my tests as alignment seems always on 4 byte boundary. Same
should be in case of native 32 bit, where we end up with 4 byte boundaries
as well.

In practice, passing msg->msg_controllen of 20 to recvmsg() while receiving
a single fd would mean that on successful return, msg->msg_controllen is
being set by the kernel to 24 bytes instead, thus more than the input
buffer advertised. It could f.e. become an issue if such application later
on zeroes or copies the control buffer based on the returned msg->msg_controllen
elsewhere.

Maximum number of fds we can send is a hard upper limit SCM_MAX_FD (253).

Going over the code, it seems like msg->msg_controllen is not being read
after scm_detach_fds() in scm_recv() anymore by the kernel, good!

Relevant recvmsg() handler are unix_dgram_recvmsg() (unix_seqpacket_recvmsg())
and unix_stream_recvmsg(). Both return back to their recvmsg() caller,
and ___sys_recvmsg() places the updated length, that is, new msg_control -
old msg_control pointer into msg->msg_controllen (hence the 24 bytes seen
in the example).

Long time ago, Wei Yongjun fixed something related in commit 1ac70e7ad24a
("[NET]: Fix function put_cmsg() which may cause usr application memory
overflow").

RFC3542, section 20.2. says:

The fields shown as "XX" are possible padding, between the cmsghdr
structure and the data, and between the data and the next cmsghdr
structure, if required by the implementation. While sending an
application may or may not include padding at the end of last
ancillary data in msg_controllen and implementations must accept both
as valid. On receiving a portable application must provide space for
padding at the end of the last ancillary data as implementations may
copy out the padding at the end of the control message buffer and
include it in the received msg_controllen. When recvmsg() is called
if msg_controllen is too small for all the ancillary data items
including any trailing padding after the last item an implementation
may set MSG_CTRUNC.

Since we didn't place MSG_CTRUNC for already quite a long time, just do
the same as in 1ac70e7ad24a to avoid an overflow.

Btw, even man-page author got this wrong :/ See db939c9b26e9 ("cmsg.3: Fix
error in SCM_RIGHTS code sample"). Some people must have copied this (?),
thus it got triggered in the wild (reported several times during boot by
David and HacKurx).

No Fixes tag this time as pre 2002 (that is, pre history tree).

Reported-by: David Sterba
Reported-by: HacKurx
Cc: PaX Team
Cc: Emese Revfy
Cc: Brad Spengler
Cc: Wei Yongjun
Cc: Eric Dumazet
Reviewed-by: Hannes Frederic Sowa
Signed-off-by: Daniel Borkmann
Signed-off-by: David S. Miller

Daniel Borkmann
2015-11-23 09:34:58 +0800

21 Nov, 2015

1 commit

9a6508382 tipc: correct settings of broadcast link state ... Browse Code »

Since commit 5266698661401afc5e ("tipc: let broadcast packet
reception use new link receive function") the broadcast send
link state was meant to always be set to LINK_ESTABLISHED, since
we don't need this link to follow the regular link FSM rules. It
was also the intention that this state anyway shouldn't impact
the run-time working state of the link, since the latter in
reality is controlled by the number of registered peers.

We have now discovered that this assumption is not quite correct.
If the broadcast link is reset because of too many retransmissions,
its state will inadvertently go to LINK_RESETTING, and never go
back to LINK_ESTABLISHED, because the LINK_FAILURE event was not
anticipated. This will work well once, but if it happens a second
time, the reset on a link in LINK_RESETTING has has no effect, and
neither the broadcast link nor the unicast links will go down as
they should.

Furthermore, it is confusing that the management tool shows that
this link is in UP state when that obviously isn't the case.

We now ensure that this state strictly follows the true working
state of the link. The state is set to LINK_ESTABLISHED when
the number of peers is non-zero, and to LINK_RESET otherwise.

Signed-off-by: Jon Maloy
Signed-off-by: David S. Miller

Jon Paul Maloy
2015-11-21 03:08:51 +0800

20 Nov, 2015

3 commits

5d4c9bfba tcp: fix potential huge kmalloc() calls in TCP_REPAIR ... Browse Code »

tcp_send_rcvq() is used for re-injecting data into tcp receive queue.

Problems :

- No check against size is performed, allowed user to fool kernel in
attempting very large memory allocations, eventually triggering
OOM when memory is fragmented.

- In case of fault during the copy we do not return correct errno.

Lets use alloc_skb_with_frags() to cook optimal skbs.

Fixes: 292e8d8c8538 ("tcp: Move rcvq sending to tcp_input.c")
Fixes: c0e88ff0f256 ("tcp: Repair socket queues")
Signed-off-by: Eric Dumazet
Cc: Pavel Emelyanov
Acked-by: Pavel Emelyanov
Signed-off-by: David S. Miller

Eric Dumazet
2015-11-20 23:57:33 +0800
dd52bc2b4 tcp: fix Fast Open snmp over-counting bug ... Browse Code »

Fix incrementing TCPFastOpenActiveFailed snmp stats multiple times
when the handshake experiences multiple SYN timeouts.

Signed-off-by: Yuchung Cheng
Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Yuchung Cheng
2015-11-20 23:51:12 +0800
0e45f4da5 tcp: disable Fast Open on timeouts after handshake ... Browse Code »

Some middle-boxes black-hole the data after the Fast Open handshake
(https://www.ietf.org/proceedings/94/slides/slides-94-tcpm-13.pdf).
The exact reason is unknown. The work-around is to disable Fast Open
temporarily after multiple recurring timeouts with few or no data
delivered in the established state.

Signed-off-by: Yuchung Cheng
Signed-off-by: Eric Dumazet
Reported-by: Christoph Paasch
Signed-off-by: David S. Miller

Yuchung Cheng
2015-11-20 23:51:12 +0800