Eric Lee / smarc-fsl-linux-kernel

24 May, 2015

1 commit

086e8ddb5 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client ... Browse Code »

Pull two Ceph fixes from Sage Weil:
"These fix an issue with the RBD notifications when there are topology
changes in the cluster"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
Revert "libceph: clear r_req_lru_item in __unregister_linger_request()"
libceph: request a new osdmap if lingering request maps to no osd

Linus Torvalds
2015-05-24 02:28:25 +0800

23 May, 2015

7 commits

93a33a584 bridge: fix lockdep splat ... Browse Code »

Following lockdep splat was reported :

[ 29.382286] ===============================
[ 29.382315] [ INFO: suspicious RCU usage. ]
[ 29.382344] 4.1.0-0.rc0.git11.1.fc23.x86_64 #1 Not tainted
[ 29.382380] -------------------------------
[ 29.382409] net/bridge/br_private.h:626 suspicious
rcu_dereference_check() usage!
[ 29.382455]
other info that might help us debug this:

[ 29.382507]
rcu_scheduler_active = 1, debug_locks = 0
[ 29.382549] 2 locks held by swapper/0/0:
[ 29.382576] #0: (((&p->forward_delay_timer))){+.-...}, at:
[] call_timer_fn+0x5/0x4f0
[ 29.382660] #1: (&(&br->lock)->rlock){+.-...}, at:
[] br_forward_delay_timer_expired+0x31/0x140
[bridge]
[ 29.382754]
stack backtrace:
[ 29.382787] CPU: 0 PID: 0 Comm: swapper/0 Not tainted
4.1.0-0.rc0.git11.1.fc23.x86_64 #1
[ 29.382838] Hardware name: LENOVO 422916G/LENOVO, BIOS A1KT53AUS 04/07/2015
[ 29.382882] 0000000000000000 3ebfc20364115825 ffff880666603c48
ffffffff81892d4b
[ 29.382943] 0000000000000000 ffffffff81e124e0 ffff880666603c78
ffffffff8110bcd7
[ 29.383004] ffff8800785c9d00 ffff88065485ac58 ffff880c62002800
ffff880c5fc88ac0
[ 29.383065] Call Trace:
[ 29.383084] [] dump_stack+0x4c/0x65
[ 29.383130] [] lockdep_rcu_suspicious+0xe7/0x120
[ 29.383178] [] br_fill_ifinfo+0x4a9/0x6a0 [bridge]
[ 29.383225] [] br_ifinfo_notify+0x11b/0x4b0 [bridge]
[ 29.383271] [] ? br_hold_timer_expired+0x70/0x70 [bridge]
[ 29.383320] []
br_forward_delay_timer_expired+0x58/0x140 [bridge]
[ 29.383371] [] ? br_hold_timer_expired+0x70/0x70 [bridge]
[ 29.383416] [] call_timer_fn+0xc3/0x4f0
[ 29.383454] [] ? call_timer_fn+0x5/0x4f0
[ 29.383493] [] ? lock_release_holdtime.part.29+0xf/0x200
[ 29.383541] [] ? br_hold_timer_expired+0x70/0x70 [bridge]
[ 29.383587] [] run_timer_softirq+0x244/0x490
[ 29.383629] [] __do_softirq+0xec/0x670
[ 29.383666] [] irq_exit+0x145/0x150
[ 29.383703] [] smp_apic_timer_interrupt+0x46/0x60
[ 29.383744] [] apic_timer_interrupt+0x73/0x80
[ 29.383782] [] ? cpuidle_enter_state+0x5f/0x2f0
[ 29.383832] [] ? cpuidle_enter_state+0x5b/0x2f0

Problem here is that br_forward_delay_timer_expired() is a timer
handler, calling br_ifinfo_notify() which assumes either rcu_read_lock()
or RTNL are held.

Simplest fix seems to add rcu read lock section.

Signed-off-by: Eric Dumazet
Reported-by: Josh Boyer
Reported-by: Dominick Grift
Cc: Vlad Yasevich
Signed-off-by: David S. Miller

Eric Dumazet
2015-05-23 04:23:56 +0800
f96dee13b net: core: 'ethtool' issue with querying phy settings ... Browse Code »

When trying to configure the settings for PHY1, using commands
like 'ethtool -s eth0 phyad 1 speed 100', the 'ethtool' seems to
modify other settings apart from the speed of the PHY1, in the
above case.

The ethtool seems to query the settings for PHY0, and use this
as the base to apply the new settings to the PHY1. This is
causing the other settings of the PHY 1 to be wrongly
configured.

The issue is caused by the '_ethtool_get_settings()' API, which
gets called because of the 'ETHTOOL_GSET' command, is clearing
the 'cmd' pointer (of type 'struct ethtool_cmd') by calling
memset. This clears all the parameters (if any) passed for the
'ETHTOOL_GSET' cmd. So the driver's callback is always invoked
with 'cmd->phy_address' as '0'.

The '_ethtool_get_settings()' is called from other files in the
'net/core'. So the fix is applied to the 'ethtool_get_settings()'
which is only called in the context of the 'ethtool'.

Signed-off-by: Arun Parameswaran
Reviewed-by: Ray Jui
Reviewed-by: Scott Branden
Signed-off-by: David S. Miller

Arun Parameswaran
2015-05-23 04:14:17 +0800
47cc84ce0 bridge: fix parsing of MLDv2 reports ... Browse Code »

When more than a multicast address is present in a MLDv2 report, all but
the first address is ignored, because the code breaks out of the loop if
there has not been an error adding that address.

This has caused failures when two guests connected through the bridge
tried to communicate using IPv6. Neighbor discoveries would not be
transmitted to the other guest when both used a link-local address and a
static address.

This only happens when there is a MLDv2 querier in the network.

The fix will only break out of the loop when there is a failure adding a
multicast address.

The mdb before the patch:

dev ovirtmgmt port vnet0 grp ff02::1:ff7d:6603 temp
dev ovirtmgmt port vnet1 grp ff02::1:ff7d:6604 temp
dev ovirtmgmt port bond0.86 grp ff02::2 temp

After the patch:

dev ovirtmgmt port vnet0 grp ff02::1:ff7d:6603 temp
dev ovirtmgmt port vnet1 grp ff02::1:ff7d:6604 temp
dev ovirtmgmt port bond0.86 grp ff02::fb temp
dev ovirtmgmt port bond0.86 grp ff02::2 temp
dev ovirtmgmt port bond0.86 grp ff02::d temp
dev ovirtmgmt port vnet0 grp ff02::1:ff00:76 temp
dev ovirtmgmt port bond0.86 grp ff02::16 temp
dev ovirtmgmt port vnet1 grp ff02::1:ff00:77 temp
dev ovirtmgmt port bond0.86 grp ff02::1:ff00:def temp
dev ovirtmgmt port bond0.86 grp ff02::1:ffa1:40bf temp

Fixes: 08b202b67264 ("bridge br_multicast: IPv6 MLD support.")
Reported-by: Rik Theys
Signed-off-by: Thadeu Lima de Souza Cascardo
Tested-by: Rik Theys
Signed-off-by: David S. Miller

Thadeu Lima de Souza Cascardo
2015-05-23 03:08:20 +0800
d4e64c290 ipv4: fill in table id when replacing a route ... Browse Code »

When replacing an IPv4 route, tb_id member of the new fib_alias
structure is not set in the replace code path so that the new route is
ignored.

Fixes: 0ddcf43d5d4a ("ipv4: FIB Local/MAIN table collapse")
Signed-off-by: Michal Kubecek
Acked-by: Alexander Duyck
Signed-off-by: David S. Miller

Michal Kubeček
2015-05-23 02:33:17 +0800
572152adf Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf ... Browse Code »

Pablo Neira Ayuso says:

====================
Netfilter fixes for net

The following patchset contain Netfilter fixes for your net tree, they are:

1) Fix a race in nfnetlink_log and nfnetlink_queue that can lead to a crash.
This problem is due to wrong order in the per-net registration and netlink
socket events. Patch from Francesco Ruggeri.

2) Make sure that counters that userspace pass us are higher than 0 in all the
x_tables frontends. Discovered via Trinity, patch from Dave Jones.

3) Revert a patch for br_netfilter to rely on the conntrack status bits. This
breaks stateless IPv6 NAT transformations. Patch from Florian Westphal.
====================

Signed-off-by: David S. Miller

David S. Miller
2015-05-23 02:25:45 +0800
381c759d9 ipv4: Avoid crashing in ip_error ... Browse Code »

ip_error does not check if in_dev is NULL before dereferencing it.

IThe following sequence of calls is possible:
CPU A CPU B
ip_rcv_finish
ip_route_input_noref()
ip_route_input_slow()
inetdev_destroy()
dst_input()

With the result that a network device can be destroyed while processing
an input packet.

A crash was triggered with only unicast packets in flight, and
forwarding enabled on the only network device. The error condition
was created by the removal of the network device.

As such it is likely the that error code was -EHOSTUNREACH, and the
action taken by ip_error (if in_dev had been accessible) would have
been to not increment any counters and to have tried and likely failed
to send an icmp error as the network device is going away.

Therefore handle this weird case by just dropping the packet if
!in_dev. It will result in dropping the packet sooner, and will not
result in an actual change of behavior.

Fixes: 251da4130115b ("ipv4: Cache ip_error() routes even when not forwarding.")
Reported-by: Vittorio Gambaletta
Tested-by: Vittorio Gambaletta
Signed-off-by: Vittorio Gambaletta
Signed-off-by: "Eric W. Biederman"
Acked-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric W. Biederman
2015-05-23 02:23:40 +0800
d654976cb tcp: fix a potential deadlock in tcp_get_info() ... Browse Code »

Taking socket spinlock in tcp_get_info() can deadlock, as
inet_diag_dump_icsk() holds the &hashinfo->ehash_locks[i],
while packet processing can use the reverse locking order.

We could avoid this locking for TCP_LISTEN states, but lockdep would
certainly get confused as all TCP sockets share same lockdep classes.

[ 523.722504] ======================================================
[ 523.728706] [ INFO: possible circular locking dependency detected ]
[ 523.734990] 4.1.0-dbg-DEV #1676 Not tainted
[ 523.739202] -------------------------------------------------------
[ 523.745474] ss/18032 is trying to acquire lock:
[ 523.750002] (slock-AF_INET){+.-...}, at: [] tcp_get_info+0x2c4/0x360
[ 523.758129]
[ 523.758129] but task is already holding lock:
[ 523.763968] (&(&hashinfo->ehash_locks[i])->rlock){+.-...}, at: [] inet_diag_dump_icsk+0x1d5/0x6c0
[ 523.774661]
[ 523.774661] which lock already depends on the new lock.
[ 523.774661]
[ 523.782850]
[ 523.782850] the existing dependency chain (in reverse order) is:
[ 523.790326]
-> #1 (&(&hashinfo->ehash_locks[i])->rlock){+.-...}:
[ 523.796599] [] lock_acquire+0xbb/0x270
[ 523.802565] [] _raw_spin_lock+0x38/0x50
[ 523.808628] [] __inet_hash_nolisten+0x78/0x110
[ 523.815273] [] tcp_v4_syn_recv_sock+0x24b/0x350
[ 523.822067] [] tcp_check_req+0x3c1/0x500
[ 523.828199] [] tcp_v4_do_rcv+0x239/0x3d0
[ 523.834331] [] tcp_v4_rcv+0xa8e/0xc10
[ 523.840202] [] ip_local_deliver_finish+0x133/0x3e0
[ 523.847214] [] ip_local_deliver+0xaa/0xc0
[ 523.853440] [] ip_rcv_finish+0x168/0x5c0
[ 523.859624] [] ip_rcv+0x307/0x420

Lets use u64_sync infrastructure instead. As a bonus, 64bit
arches get optimized, as these are nop for them.

Fixes: 0df48c26d841 ("tcp: add tcpi_bytes_acked to tcp_info")
Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2015-05-23 01:46:06 +0800

22 May, 2015

1 commit

c78e1746d net: sched: fix call_rcu() race on classifier module unloads ... Browse Code »

Vijay reported that a loop as simple as ...

while true; do
tc qdisc add dev foo root handle 1: prio
tc filter add dev foo parent 1: u32 match u32 0 0 flowid 1
tc qdisc del dev foo root
rmmod cls_u32
done

... will panic the kernel. Moreover, he bisected the change
apparently introducing it to 78fd1d0ab072 ("netlink: Re-add
locking to netlink_lookup() and seq walker").

The removal of synchronize_net() from the netlink socket
triggering the qdisc to be removed, seems to have uncovered
an RCU resp. module reference count race from the tc API.
Given that RCU conversion was done after e341694e3eb5 ("netlink:
Convert netlink_lookup() to use RCU protected hash table")
which added the synchronize_net() originally, occasion of
hitting the bug was less likely (not impossible though):

When qdiscs that i) support attaching classifiers and,
ii) have at least one of them attached, get deleted, they
invoke tcf_destroy_chain(), and thus call into ->destroy()
handler from a classifier module.

After RCU conversion, all classifier that have an internal
prio list, unlink them and initiate freeing via call_rcu()
deferral.

Meanhile, tcf_destroy() releases already reference to the
tp->ops->owner module before the queued RCU callback handler
has been invoked.

Subsequent rmmod on the classifier module is then not prevented
since all module references are already dropped.

By the time, the kernel invokes the RCU callback handler from
the module, that function address is then invalid.

One way to fix it would be to add an rcu_barrier() to
unregister_tcf_proto_ops() to wait for all pending call_rcu()s
to complete.

synchronize_rcu() is not appropriate as under heavy RCU
callback load, registered call_rcu()s could be deferred
longer than a grace period. In case we don't have any pending
call_rcu()s, the barrier is allowed to return immediately.

Since we came here via unregister_tcf_proto_ops(), there
are no users of a given classifier anymore. Further nested
call_rcu()s pointing into the module space are not being
done anywhere.

Only cls_bpf_delete_prog() may schedule a work item, to
unlock pages eventually, but that is not in the range/context
of cls_bpf anymore.

Fixes: 25d8c0d55f24 ("net: rcu-ify tcf_proto")
Fixes: 9888faefe132 ("net: sched: cls_basic use RCU")
Reported-by: Vijay Subramanian
Signed-off-by: Daniel Borkmann
Cc: John Fastabend
Cc: Eric Dumazet
Cc: Thomas Graf
Cc: Jamal Hadi Salim
Cc: Alexei Starovoitov
Tested-by: Vijay Subramanian
Acked-by: Alexei Starovoitov
Acked-by: Eric Dumazet
Signed-off-by: David S. Miller

Daniel Borkmann
2015-05-22 06:48:18 +0800

21 May, 2015

4 commits

521a04d06 Revert "libceph: clear r_req_lru_item in __unregister_linger_request()" ... Browse Code »

This reverts commit ba9d114ec5578e6e99a4dfa37ff8ae688040fd64.

.. which introduced a regression that prevented all lingering requests
requeued in kick_requests() from ever being sent to the OSDs, resulting
in a lot of missed notifies. In retrospect it's pretty obvious that
r_req_lru_item item in the case of lingering requests can be used not
only for notarget, but also for unsent linkage due to how tightly
actual map and enqueue operations are coupled in __map_request().

The assertion that was being silenced is taken care of in the previous
("libceph: request a new osdmap if lingering request maps to no osd")
commit: by always kicking homeless lingering requests we ensure that
none of them ends up on the notarget list outside of the critical
section guarded by request_mutex.

Cc: stable@vger.kernel.org # 3.18+, needs b0494532214b "libceph: request a new osdmap if lingering request maps to no osd"
Signed-off-by: Ilya Dryomov
Reviewed-by: Sage Weil

Ilya Dryomov
2015-05-21 02:02:46 +0800
b04945322 libceph: request a new osdmap if lingering request maps to no osd ... Browse Code »

This commit does two things. First, if there are any homeless
lingering requests, we now request a new osdmap even if the osdmap that
is being processed brought no changes, i.e. if a given lingering
request turned homeless in one of the previous epochs and remained
homeless in the current epoch. Not doing so leaves us with a stale
osdmap and as a result we may miss our window for reestablishing the
watch and lose notifies.

MON=1 OSD=1:

# cat linger-needmap.sh
#!/bin/bash
rbd create --size 1 test
DEV=$(rbd map test)
ceph osd out 0
rbd map dne/dne # obtain a new osdmap as a side effect (!)
sleep 1
ceph osd in 0
rbd resize --size 2 test
# rbd info test | grep size -> 2M
# blockdev --getsize $DEV -> 1M

N.B.: Not obtaining a new osdmap in between "osd out" and "osd in"
above is enough to make it miss that resize notify, but that is a
bug^Wlimitation of ceph watch/notify v1.

Second, homeless lingering requests are now kicked just like those
lingering requests whose mapping has changed. This is mainly to
recognize that a homeless lingering request makes no sense and to
preserve the invariant that a registered lingering request is not
sitting on any of r_req_lru_item lists. This spares us a WARN_ON,
which commit ba9d114ec557 ("libceph: clear r_req_lru_item in
__unregister_linger_request()") tried to fix the _wrong_ way.

Cc: stable@vger.kernel.org # 3.10+
Signed-off-by: Ilya Dryomov
Reviewed-by: Sage Weil

Ilya Dryomov
2015-05-21 02:02:14 +0800
275964724 ipv6: fix ECMP route replacement ... Browse Code »

When replacing an IPv6 multipath route with "ip route replace", i.e.
NLM_F_CREATE | NLM_F_REPLACE, fib6_add_rt2node() replaces only first
matching route without fixing its siblings, resulting in corrupted
siblings linked list; removing one of the siblings can then end in an
infinite loop.

IPv6 ECMP implementation is a bit different from IPv4 so that route
replacement cannot work in exactly the same way. This should be a
reasonable approximation:

1. If the new route is ECMP-able and there is a matching ECMP-able one
already, replace it and all its siblings (if any).

2. If the new route is ECMP-able and no matching ECMP-able route exists,
replace first matching non-ECMP-able (if any) or just add the new one.

3. If the new route is not ECMP-able, replace first matching
non-ECMP-able route (if any) or add the new route.

We also need to remove the NLM_F_REPLACE flag after replacing old
route(s) by first nexthop of an ECMP route so that each subsequent
nexthop does not replace previous one.

Fixes: 51ebd3181572 ("ipv6: add support of equal cost multipath (ECMP)")
Signed-off-by: Michal Kubecek
Acked-by: Nicolas Dichtel
Signed-off-by: David S. Miller

Michal Kubeček
2015-05-21 00:02:26 +0800
35f1b4e96 ipv6: do not delete previously existing ECMP routes if add fails ... Browse Code »

If adding a nexthop of an IPv6 multipath route fails, comment in
ip6_route_multipath() says we are going to delete all nexthops already
added. However, current implementation deletes even the routes it
hasn't even tried to add yet. For example, running

ip route add 1234:5678::/64 \
nexthop via fe80::aa dev dummy1 \
nexthop via fe80::bb dev dummy1 \
nexthop via fe80::cc dev dummy1

twice results in removing all routes first command added.

Limit the second (delete) run to nexthops that succeeded in the first
(add) run.

Fixes: 51ebd3181572 ("ipv6: add support of equal cost multipath (ECMP)")
Signed-off-by: Michal Kubecek
Acked-by: Nicolas Dichtel
Signed-off-by: David S. Miller

Michal Kubeček
2015-05-21 00:02:25 +0800

20 May, 2015

7 commits

faecbb45e Revert "netfilter: bridge: query conntrack about skb dnat" ... Browse Code »

This reverts commit c055d5b03bb4cb69d349d787c9787c0383abd8b2.

There are two issues:
'dnat_took_place' made me think that this is related to
-j DNAT/MASQUERADE.

But thats only one part of the story. This is also relevant for SNAT
when we undo snat translation in reverse/reply direction.

Furthermore, I originally wanted to do this mainly to avoid
storing ipv6 addresses once we make DNAT/REDIRECT work
for ipv6 on bridges.

However, I forgot about SNPT/DNPT which is stateless.

So we can't escape storing address for ipv6 anyway. Might as
well do it for ipv4 too.

Reported-and-tested-by: Bernhard Thaler
Signed-off-by: Florian Westphal
Signed-off-by: Pablo Neira Ayuso

Florian Westphal
2015-05-20 19:51:25 +0800
1086bbe97 netfilter: ensure number of counters is >0 in do_replace() ... Browse Code »

After improving setsockopt() coverage in trinity, I started triggering
vmalloc failures pretty reliably from this code path:

warn_alloc_failed+0xe9/0x140
__vmalloc_node_range+0x1be/0x270
vzalloc+0x4b/0x50
__do_replace+0x52/0x260 [ip_tables]
do_ipt_set_ctl+0x15d/0x1d0 [ip_tables]
nf_setsockopt+0x65/0x90
ip_setsockopt+0x61/0xa0
raw_setsockopt+0x16/0x60
sock_common_setsockopt+0x14/0x20
SyS_setsockopt+0x71/0xd0

It turns out we don't validate that the num_counters field in the
struct we pass in from userspace is initialized.

The same problem also exists in ebtables, arptables, ipv6, and the
compat variants.

Signed-off-by: Dave Jones
Signed-off-by: Pablo Neira Ayuso

Dave Jones
2015-05-20 19:46:49 +0800
3bfe04980 netfilter: nfnetlink_{log,queue}: Register pernet in first place ... Browse Code »

nfnetlink_{log,queue}_init() register the netlink callback nf*_rcv_nl_event
before registering the pernet_subsys, but the callback relies on data
structures allocated by pernet init functions.

When nfnetlink_{log,queue} is loaded, if a netlink message is received after
the netlink callback is registered but before the pernet_subsys is registered,
the kernel will panic in the sequence

nfulnl_rcv_nl_event
nfnl_log_pernet
net_generic
BUG_ON(id == 0) where id is nfnl_log_net_id.

The panic can be easily reproduced in 4.0.3 by:

while true ;do modprobe nfnetlink_log ; rmmod nfnetlink_log ; done &
while true ;do ip netns add dummy ; ip netns del dummy ; done &

This patch moves register_pernet_subsys to earlier in nfnetlink_log_init.

Notice that the BUG_ON hit in 4.0.3 was recently removed in 2591ffd308
["netns: remove BUG_ONs from net_generic()"].

Signed-off-by: Francesco Ruggeri
Signed-off-by: Pablo Neira Ayuso

Francesco Ruggeri
2015-05-20 19:46:48 +0800
892bd6291 Merge tag 'mac80211-for-davem-2015-05-19' of git://git.kernel.org/pub/scm/linux/… ... Browse Code »

…kernel/git/jberg/mac80211

Johannes Berg says:

====================
This has just a single fix, for a WEP tailroom check
problem that leads to dropped frames.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

David S. Miller
2015-05-20 04:44:25 +0800
b7b0ed910 tcp: don't over-send F-RTO probes ... Browse Code »

After sending the new data packets to probe (step 2), F-RTO may
incorrectly send more probes if the next ACK advances SND_UNA and
does not sack new packet. However F-RTO RFC 5682 probes at most
once. This bug may cause sender to always send new data instead of
repairing holes, inducing longer HoL blocking on the receiver for
the application.

Signed-off-by: Yuchung Cheng
Signed-off-by: Neal Cardwell
Signed-off-by: David S. Miller

Yuchung Cheng
2015-05-20 04:36:57 +0800
da34ac762 tcp: only undo on partial ACKs in CA_Loss ... Browse Code »

Undo based on TCP timestamps should only happen on ACKs that advance
SND_UNA, according to the Eifel algorithm in RFC 3522:

Section 3.2:

(4) If the value of the Timestamp Echo Reply field of the
acceptable ACK's Timestamps option is smaller than the
value of RetransmitTS, then proceed to step (5),

Section Terminology:
We use the term 'acceptable ACK' as defined in [RFC793]. That is an
ACK that acknowledges previously unacknowledged data.

This is because upon receiving an out-of-order packet, the receiver
returns the last timestamp that advances RCV_NXT, not the current
timestamp of the packet in the DUPACK. Without checking the flag,
the DUPACK will cause tcp_packet_delayed() to return true and
tcp_try_undo_loss() will revert cwnd reduction.

Note that we check the condition in CA_Recovery already by only
calling tcp_try_undo_partial() if FLAG_SND_UNA_ADVANCED is set or
tcp_try_undo_recovery() if snd_una crosses high_seq.

Signed-off-by: Yuchung Cheng
Signed-off-by: Neal Cardwell
Signed-off-by: David S. Miller

Yuchung Cheng
2015-05-20 04:36:57 +0800
33b4b015e net/ipv6/udp: Fix ipv6 multicast socket filter regression ... Browse Code »

Commit ("udp: Simplify__udp*_lib_mcast_deliver")
simplified the filter for incoming IPv6 multicast but removed
the check of the local socket address and the UDP destination
address.

This patch restores the filter to prevent sockets bound to a IPv6
multicast IP to receive other UDP traffic link unicast.

Signed-off-by: Henning Rogge
Fixes: 5cf3d46192fc ("udp: Simplify__udp*_lib_mcast_deliver")
Cc: "David S. Miller"
Acked-by: Eric Dumazet
Signed-off-by: David S. Miller

Henning Rogge
2015-05-20 04:34:43 +0800

19 May, 2015

1 commit

456cdf53e Merge branch 'for-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth ... Browse Code »

Johan Hedberg says:

====================
pull request: bluetooth 2015-05-17

A couple more Bluetooth updates for 4.1:

- New USB IDs for ath3k & btusb
- Fix for remote name resolving during device discovery

Please let me know if there are any issues pulling. Thanks.
====================

Signed-off-by: David S. Miller

David S. Miller
2015-05-19 04:15:31 +0800

18 May, 2015

2 commits

21858cd02 tcp/ipv6: fix flow label setting in TIME_WAIT state ... Browse Code »

commit 1d13a96c74fc ("ipv6: tcp: fix flowlabel value in ACK messages
send from TIME_WAIT") added the flow label in the last TCP packets.
Unfortunately, it was not casted properly.

This patch replace the buggy shift with be32_to_cpu/cpu_to_be32.

Fixes: 1d13a96c74fc ("ipv6: tcp: fix flowlabel value in ACK messages")
Reported-by: Eric Dumazet
Signed-off-by: Florent Fourcot
Acked-by: Eric Dumazet
Signed-off-by: David S. Miller

Florent Fourcot
2015-05-18 11:41:59 +0800
ed2a80ab7 rtnl/bond: don't send rtnl msg for unregistered iface ... Browse Code »

Before the patch, the command 'ip link add bond2 type bond mode 802.3ad'
causes the kernel to send a rtnl message for the bond2 interface, with an
ifindex 0.

'ip monitor' shows:
0: bond2: mtu 1500 state DOWN group default
link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff
9: bond2@NONE: mtu 1500 qdisc noop state DOWN group default
link/ether ea:3e:1f:53:92:7b brd ff:ff:ff:ff:ff:ff
[snip]

The patch fixes the spotted bug by checking in bond driver if the interface
is registered before calling the notifier chain.
It also adds a check in rtmsg_ifinfo() to prevent this kind of bug in the
future.

Fixes: d4261e565000 ("bonding: create netlink event when bonding option is changed")
CC: Jiri Pirko
Reported-by: Julien Meunier
Signed-off-by: Nicolas Dichtel
Signed-off-by: David S. Miller

Nicolas Dichtel
2015-05-18 10:43:07 +0800

17 May, 2015

2 commits

c0bb07df7 netlink: Reset portid after netlink_insert failure ... Browse Code »

The commit c5adde9468b0714a051eac7f9666f23eb10b61f7 ("netlink:
eliminate nl_sk_hash_lock") breaks the autobind retry mechanism
because it doesn't reset portid after a failed netlink_insert.

This means that should autobind fail the first time around, then
the socket will be stuck in limbo as it can never be bound again
since it already has a non-zero portid.

Fixes: c5adde9468b0 ("netlink: eliminate nl_sk_hash_lock")
Signed-off-by: Herbert Xu
Signed-off-by: David S. Miller

Herbert Xu
2015-05-17 05:08:57 +0800
1d6057019 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf ... Browse Code »

Pablo Neira Ayuso says:

====================
The following patchset contains Netfilter fixes for your net tree, they are:

1) Fix a leak in IPVS, the sysctl table is not released accordingly when
destroying a netns, patch from Tommi Rantala.

2) Fix a build error when TPROXY and socket are built-in but IPv6 defrag is
compiled as module, from Florian Westphal.

3) Fix TCP tracket wrt. RFC5961 challenge ACK when in LAST_ACK state, patch
from Jesper Dangaard Brouer.

4) Fix a bogus WARN_ON() in nf_tables when deleting a set element that stores
a map, from Mirek Kratochvil.
====================

Signed-off-by: David S. Miller

David S. Miller
2015-05-17 04:40:22 +0800

16 May, 2015

3 commits

960bd2c26 netfilter: nf_tables: fix bogus warning in nft_data_uninit() ... Browse Code »

The values 0x00000000-0xfffffeff are reserved for userspace datatype. When,
deleting set elements with maps, a bogus warning is triggered.

WARNING: CPU: 0 PID: 11133 at net/netfilter/nf_tables_api.c:4481 nft_data_uninit+0x35/0x40 [nf_tables]()

This fixes the check accordingly to enum definition in
include/linux/netfilter/nf_tables.h

Fixes: https://bugzilla.netfilter.org/show_bug.cgi?id=1013
Signed-off-by: Mirek Kratochvil
Signed-off-by: Pablo Neira Ayuso

Mirek Kratochvil
2015-05-16 04:07:30 +0800
b3cad287d conntrack: RFC5961 challenge ACK confuse conntrack LAST-ACK transition ... Browse Code »

In compliance with RFC5961, the network stack send challenge ACK in
response to spurious SYN packets, since commit 0c228e833c88 ("tcp:
Restore RFC5961-compliant behavior for SYN packets").

This pose a problem for netfilter conntrack in state LAST_ACK, because
this challenge ACK is (falsely) seen as ACKing last FIN, causing a
false state transition (into TIME_WAIT).

The challenge ACK is hard to distinguish from real last ACK. Thus,
solution introduce a flag that tracks the potential for seeing a
challenge ACK, in case a SYN packet is let through and current state
is LAST_ACK.

When conntrack transition LAST_ACK to TIME_WAIT happens, this flag is
used for determining if we are expecting a challenge ACK.

Scapy based reproducer script avail here:
https://github.com/netoptimizer/network-testing/blob/master/scapy/tcp_hacks_3WHS_LAST_ACK.py

Fixes: 0c228e833c88 ("tcp: Restore RFC5961-compliant behavior for SYN packets")
Signed-off-by: Jesper Dangaard Brouer
Acked-by: Jozsef Kadlecsik
Signed-off-by: Pablo Neira Ayuso

Jesper Dangaard Brouer
2015-05-16 02:50:56 +0800
595ca5880 netfilter: avoid build error if TPROXY/SOCKET=y && NF_DEFRAG_IPV6=m ... Browse Code »

With TPROXY=y but DEFRAG_IPV6=m we get build failure:

net/built-in.o: In function `tproxy_tg_init':
net/netfilter/xt_TPROXY.c:588: undefined reference to `nf_defrag_ipv6_enable'

If DEFRAG_IPV6 is modular, TPROXY must be too.
(or both must be builtin).

This enforces =m for both.

Reported-and-tested-by: Liu Hua
Signed-off-by: Florian Westphal
Signed-off-by: Pablo Neira Ayuso

Florian Westphal
2015-05-16 02:18:27 +0800

15 May, 2015

3 commits

eea39946a rename RTNH_F_EXTERNAL to RTNH_F_OFFLOAD ... Browse Code »

RTNH_F_EXTERNAL today is printed as "offload" in iproute2 output.

This patch renames the flag to be consistent with what the user sees.

Signed-off-by: Roopa Prabhu
Signed-off-by: David S. Miller

Roopa Prabhu
2015-05-15 10:45:39 +0800
e87a468eb ipv6: Fix udp checksums with raw sockets ... Browse Code »

It was reported that trancerout6 would cause
a kernel to crash when trying to compute checksums
on raw UDP packets. The cause was the check in
__ip6_append_data that would attempt to use
partial checksums on the packet. However,
raw sockets do not initialize partial checksum
fields so partial checksums can't be used.

Solve this the same way IPv4 does it. raw sockets
pass transhdrlen value of 0 to ip_append_data which
causes the checksum to be computed in software. Use
the same check in ip6_append_data (check transhdrlen).

Reported-by: Wolfgang Walter
CC: Wolfgang Walter
CC: Eric Dumazet
Signed-off-by: Vladislav Yasevich
Acked-by: Eric Dumazet
Signed-off-by: David S. Miller

Vlad Yasevich
2015-05-15 10:27:03 +0800
91dd93f95 netlink: move nl_table in read_mostly section ... Browse Code »

netlink sockets creation and deletion heavily modify nl_table_users
and nl_table_lock.

If nl_table is sharing one cache line with one of them, netlink
performance is really bad on SMP.

ffffffff81ff5f00 B nl_table
ffffffff81ff5f0c b nl_table_users

Putting nl_table in read_mostly section increased performance
of my open/delete netlink sockets test by about 80 %

This came up while diagnosing a getaddrinfo() problem.

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2015-05-15 05:49:06 +0800

14 May, 2015

2 commits

177d0506a Bluetooth: Fix remote name event return directly. ... Browse Code »

This patch fixes hci_remote_name_evt dose not resolve name during
discovery status is RESOLVING. Before simultaneous dual mode scan enabled,
hci_check_pending_name will set discovery status to STOPPED eventually.

Signed-off-by: Wesley Kuo
Signed-off-by: Marcel Holtmann

Wesley Kuo
2015-05-14 16:35:04 +0800
be346ffaa vlan: Correctly propagate promisc|allmulti flags in notifier. ... Browse Code »

Currently vlan notifier handler will try to update all vlans
for a device when that device comes up. A problem occurs,
however, when the vlan device was set to promiscuous, but not
by the user (ex: a bridge). In that case, dev->gflags are
not updated. What results is that the lower device ends
up with an extra promiscuity count. Here are the
backtraces that prove this:
[62852.052179] [] __dev_set_promiscuity+0x38/0x1e0
[62852.052186] [] ? _raw_spin_unlock_bh+0x1b/0x40
[62852.052188] [] ? dev_set_rx_mode+0x2e/0x40
[62852.052190] [] dev_set_promiscuity+0x24/0x50
[62852.052194] [] vlan_dev_open+0xd5/0x1f0 [8021q]
[62852.052196] [] __dev_open+0xbf/0x140
[62852.052198] [] __dev_change_flags+0x9d/0x170
[62852.052200] [] dev_change_flags+0x29/0x60

The above comes from the setting the vlan device to IFF_UP state.

[62852.053569] [] __dev_set_promiscuity+0x38/0x1e0
[62852.053571] [] ? vlan_dev_set_rx_mode+0x2b/0x30
[8021q]
[62852.053573] [] __dev_change_flags+0xe5/0x170
[62852.053645] [] dev_change_flags+0x29/0x60
[62852.053647] [] vlan_device_event+0x18a/0x690
[8021q]
[62852.053649] [] notifier_call_chain+0x4c/0x70
[62852.053651] [] raw_notifier_call_chain+0x16/0x20
[62852.053653] [] call_netdevice_notifiers+0x2d/0x60
[62852.053654] [] __dev_notify_flags+0x33/0xa0
[62852.053656] [] dev_change_flags+0x52/0x60
[62852.053657] [] do_setlink+0x397/0xa40

And this one comes from the notification code. What we end
up with is a vlan with promiscuity count of 1 and and a physical
device with a promiscuity count of 2. They should both have
a count 1.

To resolve this issue, vlan code can use dev_get_flags() api
which correctly masks promiscuity and allmulti flags.

Signed-off-by: Vlad Yasevich
Signed-off-by: David S. Miller

Vlad Yasevich
2015-05-14 12:54:32 +0800

13 May, 2015

2 commits

110bc7672 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net ... Browse Code »

Pull networking fixes from David Miller:

1) Handle max TX power properly wrt VIFs and the MAC in iwlwifi, from
Avri Altman.

2) Use the correct FW API for scan completions in iwlwifi, from Avraham
Stern.

3) FW monitor in iwlwifi accidently uses unmapped memory, fix from Liad
Kaufman.

4) rhashtable conversion of mac80211 station table was buggy, the
virtual interface was not taken into account. Fix from Johannes
Berg.

5) Fix deadlock in rtlwifi by not using a zero timeout for
usb_control_msg(), from Larry Finger.

6) Update reordering state before calculating loss detection, from
Yuchung Cheng.

7) Fix off by one in bluetooth firmward parsing, from Dan Carpenter.

8) Fix extended frame handling in xiling_can driver, from Jeppe
Ledet-Pedersen.

9) Fix CODEL packet scheduler behavior in the presence of TSO packets,
from Eric Dumazet.

10) Fix NAPI budget testing in fm10k driver, from Alexander Duyck.

11) macvlan needs to propagate promisc settings down the the lower
device, from Vlad Yasevich.

12) igb driver can oops when changing number of rings, from Toshiaki
Makita.

13) Source specific default routes not handled properly in ipv6, from
Markus Stenberg.

14) Use after free in tc_ctl_tfilter(), from WANG Cong.

15) Use softirq spinlocking in netxen driver, from Tony Camuso.

16) Two ARM bpf JIT fixes from Nicolas Schichan.

17) Handle MSG_DONTWAIT properly in ring based AF_PACKET sends, from
Mathias Kretschmer.

18) Fix x86 bpf JIT implementation of FROM_{BE16,LE16,LE32}, from Alexei
Starovoitov.

19) ll_temac driver DMA maps TX packet header with incorrect length, fix
from Michal Simek.

20) We removed pm_qos bits from netdevice.h, but some indirect
references remained. Kill them. From David Ahern.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (90 commits)
net: Remove remaining remnants of pm_qos from netdevice.h
e1000e: Add pm_qos header
net: phy: micrel: Fix regression in kszphy_probe
net: ll_temac: Fix DMA map size bug
x86: bpf_jit: fix FROM_BE16 and FROM_LE16/32 instructions
netns: return RTM_NEWNSID instead of RTM_GETNSID on a get
Update be2net maintainers' email addresses
net_sched: gred: use correct backlog value in WRED mode
pppoe: drop pppoe device in pppoe_unbind_sock_work
net: qca_spi: Fix possible race during probe
net: mdio-gpio: Allow for unspecified bus id
af_packet / TX_RING not fully non-blocking (w/ MSG_DONTWAIT).
bnx2x: limit fw delay in kdump to 5s after boot
ARM: net: delegate filter to kernel interpreter when imm_offset() return value can't fit into 12bits.
ARM: net fix emit_udiv() for BPF_ALU | BPF_DIV | BPF_K intruction.
mpls: Change reserved label names to be consistent with netbsd
usbnet: avoid integer overflow in start_xmit
netxen_nic: use spin_[un]lock_bh around tx_clean_lock (2)
net: xgene_enet: Set hardware dependency
net: amd-xgbe: Add hardware dependency
...

Linus Torvalds
2015-05-13 12:10:38 +0800
e3d8ecb70 netns: return RTM_NEWNSID instead of RTM_GETNSID on a get ... Browse Code »

Usually, RTM_NEWxxx is returned on a get (same as a dump).

Fixes: 0c7aecd4bde4 ("netns: add rtnl cmd to add and get peer netns ids")
Signed-off-by: Nicolas Dichtel
Signed-off-by: David S. Miller

Nicolas Dichtel
2015-05-13 06:53:25 +0800

12 May, 2015

1 commit

145a42b3a net_sched: gred: use correct backlog value in WRED mode ... Browse Code »

In WRED mode, the backlog for a single virtual queue (VQ) should not be
used to determine queue behavior; instead the backlog is summed across
all VQs. This sum is currently used when calculating the average queue
lengths. It also needs to be used when determining if the queue's hard
limit has been reached, or when reporting each VQ's backlog via netlink.
q->backlog will only be used if the queue switches out of WRED mode.

Signed-off-by: David Ward
Signed-off-by: David S. Miller

David Ward
2015-05-12 01:26:26 +0800

11 May, 2015

2 commits

47b4e1fc4 mac80211: move WEP tailroom size check ... Browse Code »

Remove checking tailroom when adding IV as it uses only
headroom, and move the check to the ICV generation that
actually needs the tailroom.

In other case I hit such warning and datapath don't work,
when testing:
- IBSS + WEP
- ath9k with hw crypt enabled
- IPv6 data (ping6)

WARNING: CPU: 3 PID: 13301 at net/mac80211/wep.c:102 ieee80211_wep_add_iv+0x129/0x190 [mac80211]()
[...]
Call Trace:
[] dump_stack+0x45/0x57
[] warn_slowpath_common+0x8a/0xc0
[] warn_slowpath_null+0x1a/0x20
[] ieee80211_wep_add_iv+0x129/0x190 [mac80211]
[] ieee80211_crypto_wep_encrypt+0x6b/0xd0 [mac80211]
[] invoke_tx_handlers+0xc51/0xf30 [mac80211]
[...]

Cc: stable@vger.kernel.org
Signed-off-by: Janusz Dziedzic
Signed-off-by: Johannes Berg

Janusz Dziedzic
2015-05-11 20:51:29 +0800
fbf33a280 af_packet / TX_RING not fully non-blocking (w/ MSG_DONTWAIT). ... Browse Code »

This patch fixes an issue where the send(MSG_DONTWAIT) call
on a TX_RING is not fully non-blocking in cases where the device's sndBuf is
full. We pass nonblock=true to sock_alloc_send_skb() and return any possibly
occuring error code (most likely EGAIN) to the caller. As the fast-path stays
as it is, we keep the unlikely() around skb == NULL.

Signed-off-by: Mathias Kretschmer
Signed-off-by: David S. Miller

Kretschmer, Mathias
2015-05-11 07:40:08 +0800

10 May, 2015

2 commits

78f5b8991 mpls: Change reserved label names to be consistent with netbsd ... Browse Code »

Since these are now visible to userspace it is nice to be consistent
with BSD (sys/netmpls/mpls.h in netBSD).

Signed-off-by: Tom Herbert
Signed-off-by: David S. Miller

Tom Herbert
2015-05-10 10:29:50 +0800
d74431857 net_sched: fix a use-after-free in tc_ctl_tfilter() ... Browse Code »

When tcf_destroy() returns true, tp could be already destroyed,
we should not use tp->next after that.

For long term, we probably should move tp list to list_head.

Fixes: 1e052be69d04 ("net_sched: destroy proto tp when all filters are gone")
Cc: Jamal Hadi Salim
Signed-off-by: Cong Wang
Acked-by: Jamal Hadi Salim
Signed-off-by: David S. Miller

WANG Cong
2015-05-10 04:14:04 +0800