Eric Lee / smarc-ti-linux-kernel | Embedian Git Server

17 Apr, 2014

3 commits

8c923ce21 ip_tunnel: use the right netns in ioctl handler ... Browse Code »

Because the netdevice may be in another netns than the i/o netns, we should
use the i/o netns instead of dev_net(dev).

The variable 'tunnel' was used only to get 'itn', hence to simplify code I
remove it and use 't' instead.

Signed-off-by: Nicolas Dichtel
Signed-off-by: David S. Miller

Nicolas Dichtel
2014-04-17 03:16:02 +0800
0d5edc687 ipv4, route: pass 0 instead of LOOPBACK_IFINDEX to fib_validate_source() ... Browse Code »

In my special case, when a packet is redirected from veth0 to lo,
its skb->dev->ifindex would be LOOPBACK_IFINDEX. Meanwhile we
pass the hard-coded LOOPBACK_IFINDEX to fib_validate_source()
in ip_route_input_slow(). This would cause the following check
in fib_validate_source() fail:

(dev->ifindex != oif || !IN_DEV_TX_REDIRECTS(idev))

when rp_filter is disabeld on loopback. As suggested by Julian,
the caller should pass 0 here so that we will not end up by
calling __fib_validate_source().

Cc: Eric Biederman
Cc: Julian Anastasov
Cc: David S. Miller
Signed-off-by: Cong Wang
Signed-off-by: Cong Wang
Signed-off-by: David S. Miller

Cong Wang
2014-04-17 03:05:12 +0800
6a662719c ipv4, fib: pass LOOPBACK_IFINDEX instead of 0 to flowi4_iif ... Browse Code »

As suggested by Julian:

Simply, flowi4_iif must not contain 0, it does not
look logical to ignore all ip rules with specified iif.

because in fib_rule_match() we do:

if (rule->iifindex && (rule->iifindex != fl->flowi_iif))
goto out;

flowi4_iif should be LOOPBACK_IFINDEX by default.

We need to move LOOPBACK_IFINDEX to include/net/flow.h:

1) It is mostly used by flowi_iif

2) Fix the following compile error if we use it in flow.h
by the patches latter:

In file included from include/linux/netfilter.h:277:0,
from include/net/netns/netfilter.h:5,
from include/net/net_namespace.h:21,
from include/linux/netdevice.h:43,
from include/linux/icmpv6.h:12,
from include/linux/ipv6.h:61,
from include/net/ipv6.h:16,
from include/linux/sunrpc/clnt.h:27,
from include/linux/nfs_fs.h:30,
from init/do_mounts.c:32:
include/net/flow.h: In function ‘flowi4_init_output’:
include/net/flow.h:84:32: error: ‘LOOPBACK_IFINDEX’ undeclared (first use in this function)

Cc: Eric Biederman
Cc: Julian Anastasov
Cc: David S. Miller
Signed-off-by: Cong Wang
Signed-off-by: Cong Wang
Signed-off-by: David S. Miller

Cong Wang
2014-04-17 03:05:11 +0800

16 Apr, 2014

2 commits

aad88724c ipv4: add a sock pointer to dst->output() path. ... Browse Code »
5

In the dst->output() path for ipv4, the code assumes the skb it has to
transmit is attached to an inet socket, specifically via
ip_mc_output() : The sk_mc_loop() test triggers a WARN_ON() when the
provider of the packet is an AF_PACKET socket.

The dst->output() method gets an additional 'struct sock *sk'
parameter. This needs a cascade of changes so that this parameter can
be propagated from vxlan to final consumer.

Fixes: 8f646c922d55 ("vxlan: keep original skb ownership")
Reported-by: lucien xin
Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2014-04-16 01:47:15 +0800
b0270e910 ipv4: add a sock pointer to ip_queue_xmit() ... Browse Code »
5

ip_queue_xmit() assumes the skb it has to transmit is attached to an
inet socket. Commit 31c70d5956fc ("l2tp: keep original skb ownership")
changed l2tp to not change skb ownership and thus broke this assumption.

One fix is to add a new 'struct sock *sk' parameter to ip_queue_xmit(),
so that we do not assume skb->sk points to the socket used by l2tp
tunnel.

Fixes: 31c70d5956fc ("l2tp: keep original skb ownership")
Reported-by: Zhan Jianyu
Tested-by: Zhan Jianyu
Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2014-04-16 00:58:34 +0800

14 Apr, 2014

2 commits

91146153d ipv4: return valid RTA_IIF on ip route get ... Browse Code »
5

Extend commit 13378cad02afc2adc6c0e07fca03903c7ada0b37
("ipv4: Change rt->rt_iif encoding.") from 3.6 to return valid
RTA_IIF on 'ip route get ... iif DEVICE' instead of rt_iif 0
which is displayed as 'iif *'.

inet_iif is not appropriate to use because skb_iif is not set.
Use the skb->dev->ifindex instead.

Signed-off-by: Julian Anastasov
Signed-off-by: David S. Miller

Julian Anastasov
2014-04-14 11:23:41 +0800
b04c46190 net: ipv4: current group_info should be put after using. ... Browse Code »
5

Plug a group_info refcount leak in ping_init.
group_info is only needed during initialization and
the code failed to release the reference on exit.
While here move grabbing the reference to a place
where it is actually needed.

Signed-off-by: Chuansheng Liu
Signed-off-by: Zhang Dongxing
Signed-off-by: xiaoming wang
Signed-off-by: David S. Miller

Wang, Xiaoming
2014-04-14 10:52:58 +0800

13 Apr, 2014

2 commits

8d89dcdf8 vti: don't allow to add the same tunnel twice ... Browse Code »
19

Before the patch, it was possible to add two times the same tunnel:
ip l a vti1 type vti remote 10.16.0.121 local 10.16.0.249 key 41
ip l a vti2 type vti remote 10.16.0.121 local 10.16.0.249 key 41

It was possible, because ip_tunnel_newlink() calls ip_tunnel_find() with the
argument dev->type, which was set only later (when calling ndo_init handler
in register_netdevice()). Let's set this type in the setup handler, which is
called before newlink handler.

Introduced by commit b9959fd3b0fa ("vti: switch to new ip tunnel code").

CC: Cong Wang
CC: Steffen Klassert
Signed-off-by: Nicolas Dichtel
Signed-off-by: David S. Miller

Nicolas Dichtel
2014-04-13 05:03:11 +0800
5a4552752 gre: don't allow to add the same tunnel twice ... Browse Code »
5

Before the patch, it was possible to add two times the same tunnel:
ip l a gre1 type gre remote 10.16.0.121 local 10.16.0.249
ip l a gre2 type gre remote 10.16.0.121 local 10.16.0.249

It was possible, because ip_tunnel_newlink() calls ip_tunnel_find() with the
argument dev->type, which was set only later (when calling ndo_init handler
in register_netdevice()). Let's set this type in the setup handler, which is
called before newlink handler.

Introduced by commit c54419321455 ("GRE: Refactor GRE tunneling code.").

CC: Pravin B Shelar
Signed-off-by: Nicolas Dichtel
Signed-off-by: David S. Miller

Nicolas Dichtel
2014-04-13 05:03:11 +0800

12 Apr, 2014

1 commit

676d23690 net: Fix use after free by removing length arg from sk_data_ready callbacks. ... Browse Code »
13

Several spots in the kernel perform a sequence like:

skb_queue_tail(&sk->s_receive_queue, skb);
sk->sk_data_ready(sk, skb->len);

But at the moment we place the SKB onto the socket receive queue it
can be consumed and freed up. So this skb->len access is potentially
to freed up memory.

Furthermore, the skb->len can be modified by the consumer so it is
possible that the value isn't accurate.

And finally, no actual implementation of this callback actually uses
the length argument. And since nobody actually cared about it's
value, lots of call sites pass arbitrary values in such as '0' and
even '1'.

So just remove the length argument from the callback, that way there
is no confusion whatsoever and all of these use-after-free cases get
fixed as a side effect.

Based upon a patch by Eric Dumazet and his suggestion to audit this
issue tree-wide.

Signed-off-by: David S. Miller

David S. Miller
2014-04-12 04:15:36 +0800

09 Apr, 2014

1 commit

ce7613db2 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net ... Browse Code »

Pull more networking updates from David Miller:

1) If a VXLAN interface is created with no groups, we can crash on
reception of packets. Fix from Mike Rapoport.

2) Missing includes in CPTS driver, from Alexei Starovoitov.

3) Fix string validations in isdnloop driver, from YOSHIFUJI Hideaki
and Dan Carpenter.

4) Missing irq.h include in bnxw2x, enic, and qlcnic drivers. From
Josh Boyer.

5) AF_PACKET transmit doesn't statistically count TX drops, from Daniel
Borkmann.

6) Byte-Queue-Limit enabled drivers aren't handled properly in
AF_PACKET transmit path, also from Daniel Borkmann.

Same problem exists in pktgen, and Daniel fixed it there too.

7) Fix resource leaks in driver probe error paths of new sxgbe driver,
from Francois Romieu.

8) Truesize of SKBs can gradually get more and more corrupted in NAPI
packet recycling path, fix from Eric Dumazet.

9) Fix uniprocessor netfilter build, from Florian Westphal. In the
longer term we should perhaps try to find a way for ARRAY_SIZE() to
work even with zero sized array elements.

10) Fix crash in netfilter conntrack extensions due to mis-estimation of
required extension space. From Andrey Vagin.

11) Since we commit table rule updates before trying to copy the
counters back to userspace (it's the last action we perform), we
really can't signal the user copy with an error as we are beyond the
point from which we can unwind everything. This causes all kinds of
use after free crashes and other mysterious behavior.

From Thomas Graf.

12) Restore previous behvaior of div/mod by zero in BPF filter
processing. From Daniel Borkmann.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (38 commits)
net: sctp: wake up all assocs if sndbuf policy is per socket
isdnloop: several buffer overflows
netdev: remove potentially harmful checks
pktgen: fix xmit test for BQL enabled devices
net/at91_ether: avoid NULL pointer dereference
tipc: Let tipc_release() return 0
at86rf230: fix MAX_CSMA_RETRIES parameter
mac802154: fix duplicate #include headers
sxgbe: fix duplicate #include headers
net: filter: be more defensive on div/mod by X==0
netfilter: Can't fail and free after table replacement
xen-netback: Trivial format string fix
net: bcmgenet: Remove unnecessary version.h inclusion
net: smc911x: Remove unused local variable
bonding: Inactive slaves should keep inactive flag's value
netfilter: nf_tables: fix wrong format in request_module()
netfilter: nf_tables: set names cannot be larger than 15 bytes
netfilter: nf_conntrack: reserve two bytes for nf_ct_ext->len
netfilter: Add {ipt,ip6t}_osf aliases for xt_osf
netfilter: x_tables: allow to use cgroup match for LOCAL_IN nf hooks
...

Linus Torvalds
2014-04-09 03:41:23 +0800

08 Apr, 2014

1 commit

3ed66e910 net: replace __this_cpu_inc in route.c with raw_cpu_inc ... Browse Code »

The RT_CACHE_STAT_INC macro triggers the new preemption checks
for __this_cpu ops.

I do not see any other synchronization that would allow the use of a
__this_cpu operation here however in commit dbd2915ce87e ("[IPV4]:
RT_CACHE_STAT_INC() warning fix") Andrew justifies the use of
raw_smp_processor_id() here because "we do not care" about races. In
the past we agreed that the price of disabling interrupts here to get
consistent counters would be too high. These counters may be inaccurate
due to race conditions.

The use of __this_cpu op improves the situation already from what commit
dbd2915ce87e did since the single instruction emitted on x86 does not
allow the race to occur anymore. However, non x86 platforms could still
experience a race here.

Trace:

__this_cpu_add operation in preemptible [00000000] code: avahi-daemon/1193
caller is __this_cpu_preempt_check+0x38/0x60
CPU: 1 PID: 1193 Comm: avahi-daemon Tainted: GF 3.12.0-rc4+ #187
Call Trace:
check_preemption_disabled+0xec/0x110
__this_cpu_preempt_check+0x38/0x60
__ip_route_output_key+0x575/0x8c0
ip_route_output_flow+0x27/0x70
udp_sendmsg+0x825/0xa20
inet_sendmsg+0x85/0xc0
sock_sendmsg+0x9c/0xd0
___sys_sendmsg+0x37c/0x390
__sys_sendmsg+0x49/0x90
SyS_sendmsg+0x12/0x20
tracesys+0xe1/0xe6

Signed-off-by: Christoph Lameter
Acked-by: David S. Miller
Acked-by: Ingo Molnar
Cc: Eric Dumazet
Cc: Tejun Heo
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2014-04-08 07:36:14 +0800

05 Apr, 2014

1 commit

c58dd2dd4 netfilter: Can't fail and free after table replacement ... Browse Code »
5

All xtables variants suffer from the defect that the copy_to_user()
to copy the counters to user memory may fail after the table has
already been exchanged and thus exposed. Return an error at this
point will result in freeing the already exposed table. Any
subsequent packet processing will result in a kernel panic.

We can't copy the counters before exposing the new tables as we
want provide the counter state after the old table has been
unhooked. Therefore convert this into a silent error.

Cc: Florian Westphal
Signed-off-by: Thomas Graf
Signed-off-by: Pablo Neira Ayuso

Thomas Graf
2014-04-05 23:46:22 +0800

04 Apr, 2014

1 commit

32d01dc7b Merge branch 'for-3.15' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup ... Browse Code »

Pull cgroup updates from Tejun Heo:
"A lot updates for cgroup:

- The biggest one is cgroup's conversion to kernfs. cgroup took
after the long abandoned vfs-entangled sysfs implementation and
made it even more convoluted over time. cgroup's internal objects
were fused with vfs objects which also brought in vfs locking and
object lifetime rules. Naturally, there are places where vfs rules
don't fit and nasty hacks, such as credential switching or lock
dance interleaving inode mutex and cgroup_mutex with object serial
number comparison thrown in to decide whether the operation is
actually necessary, needed to be employed.

After conversion to kernfs, internal object lifetime and locking
rules are mostly isolated from vfs interactions allowing shedding
of several nasty hacks and overall simplification. This will also
allow implmentation of operations which may affect multiple cgroups
which weren't possible before as it would have required nesting
i_mutexes.

- Various simplifications including dropping of module support,
easier cgroup name/path handling, simplified cgroup file type
handling and task_cg_lists optimization.

- Prepatory changes for the planned unified hierarchy, which is still
a patchset away from being actually operational. The dummy
hierarchy is updated to serve as the default unified hierarchy.
Controllers which aren't claimed by other hierarchies are
associated with it, which BTW was what the dummy hierarchy was for
anyway.

- Various fixes from Li and others. This pull request includes some
patches to add missing slab.h to various subsystems. This was
triggered xattr.h include removal from cgroup.h. cgroup.h
indirectly got included a lot of files which brought in xattr.h
which brought in slab.h.

There are several merge commits - one to pull in kernfs updates
necessary for converting cgroup (already in upstream through
driver-core), others for interfering changes in the fixes branch"

* 'for-3.15' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (74 commits)
cgroup: remove useless argument from cgroup_exit()
cgroup: fix spurious lockdep warning in cgroup_exit()
cgroup: Use RCU_INIT_POINTER(x, NULL) in cgroup.c
cgroup: break kernfs active_ref protection in cgroup directory operations
cgroup: fix cgroup_taskset walking order
cgroup: implement CFTYPE_ONLY_ON_DFL
cgroup: make cgrp_dfl_root mountable
cgroup: drop const from @buffer of cftype->write_string()
cgroup: rename cgroup_dummy_root and related names
cgroup: move ->subsys_mask from cgroupfs_root to cgroup
cgroup: treat cgroup_dummy_root as an equivalent hierarchy during rebinding
cgroup: remove NULL checks from [pr_cont_]cgroup_{name|path}()
cgroup: use cgroup_setup_root() to initialize cgroup_dummy_root
cgroup: reorganize cgroup bootstrapping
cgroup: relocate setting of CGRP_DEAD
cpuset: use rcu_read_lock() to protect task_cs()
cgroup_freezer: document freezer_fork() subtleties
cgroup: update cgroup_transfer_tasks() to either succeed or fail
cgroup: drop task_lock() protection around task->cgroups
cgroup: update how a newly forked task gets associated with css_set
...

Linus Torvalds
2014-04-04 04:05:42 +0800

30 Mar, 2014

1 commit

64c27237a Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net ... Browse Code »

Conflicts:
drivers/net/ethernet/marvell/mvneta.c

The mvneta.c conflict is a case of overlapping changes,
a conversion to devm_ioremap_resource() vs. a conversion
to netdev_alloc_pcpu_stats.

Signed-off-by: David S. Miller

David S. Miller
2014-03-30 06:48:54 +0800

29 Mar, 2014

1 commit

e2a1d3e47 tcp: fix get_timewait4_sock() delay computation on 64bit ... Browse Code »

It seems I missed one change in get_timewait4_sock() to compute
the remaining time before deletion of IPV4 timewait socket.

This could result in wrong output in /proc/net/tcp for tm->when field.

Fixes: 96f817fedec4 ("tcp: shrink tcp6_timewait_sock by one cache line")
Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2014-03-29 04:43:14 +0800

28 Mar, 2014

1 commit

a0b8486ca tcp: tcp_make_synack() minor changes ... Browse Code »

There is no need to allocate 15 bytes in excess for a SYNACK packet,
as it contains no data, only headers.

SYNACK are always generated in softirq context, and contain a single
segment, we can use TCP_INC_STATS_BH()

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2014-03-28 03:10:10 +0800

27 Mar, 2014

2 commits

cc93fc51f tcp: delete unused parameter in tcp_nagle_check() ... Browse Code »

After commit d4589926d7a9 (tcp: refine TSO splits), tcp_nagle_check() does
not use parameter mss_now anymore.

Signed-off-by: Weiping Pan
Signed-off-by: David S. Miller

Peter Pan(潘卫平)
2014-03-27 03:43:40 +0800
fbd02dd40 ip_tunnel: Fix dst ref-count. ... Browse Code »

Commit 10ddceb22ba (ip_tunnel:multicast process cause panic due
to skb->_skb_refdst NULL pointer) removed dst-drop call from
ip-tunnel-recv.

Following commit reintroduce dst-drop and fix the original bug by
checking loopback packet before releasing dst.
Original bug: https://bugzilla.kernel.org/show_bug.cgi?id=70681

CC: Xin Long
Signed-off-by: Pravin B Shelar
Signed-off-by: David S. Miller

Pravin B Shelar
2014-03-27 03:18:40 +0800

26 Mar, 2014

1 commit

04f58c885 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net ... Browse Code »

Conflicts:
Documentation/devicetree/bindings/net/micrel-ks8851.txt
net/core/netpoll.c

The net/core/netpoll.c conflict is a bug fix in 'net' happening
to code which is completely removed in 'net-next'.

In micrel-ks8851.txt we simply have overlapping changes.

Signed-off-by: David S. Miller

David S. Miller
2014-03-26 08:29:20 +0800

25 Mar, 2014

1 commit

0b8c7f6f2 ipv4: remove ip_rt_dump from route.c ... Browse Code »

ip_rt_dump do nothing after IPv4 route caches removal, so we can remove it.

Signed-off-by: Li RongQing
Signed-off-by: David S. Miller

Li RongQing
2014-03-25 00:45:01 +0800

24 Mar, 2014

1 commit

4a4eb21fd ipv4: remove ipv4_ifdown_dst from route.c ... Browse Code »

ipv4_ifdown_dst does nothing after IPv4 route caches removal,
so we can remove it.

Signed-off-by: Li RongQing
Signed-off-by: David S. Miller

Li RongQing
2014-03-24 12:18:44 +0800

21 Mar, 2014

2 commits

65886f439 ipmr: fix mfc notification flags ... Browse Code »

Commit 8cd3ac9f9b7b ("ipmr: advertise new mfc entries via rtnl") reuses the
function ipmr_fill_mroute() to notify mfc events.
But this function was used only for dump and thus was always setting the
flag NLM_F_MULTI, which is wrong in case of a single notification.

Libraries like libnl will wait forever for NLMSG_DONE.

CC: Thomas Graf
Signed-off-by: Nicolas Dichtel
Acked-by: Thomas Graf
Signed-off-by: David S. Miller

Nicolas Dichtel
2014-03-21 04:24:28 +0800
e35bad5d8 net: remove empty lines from tcp_syn_flood_action ... Browse Code »

Signed-off-by: Daniel Baluta
Signed-off-by: David S. Miller

Daniel Baluta
2014-03-21 04:17:22 +0800

19 Mar, 2014

2 commits

4d3bb511b cgroup: drop const from @buffer of cftype->write_string() ... Browse Code »

cftype->write_string() just passes on the writeable buffer from kernfs
and there's no reason to add const restriction on the buffer. The
only thing const achieves is unnecessarily complicating parsing of the
buffer. Drop const from @buffer.

Signed-off-by: Tejun Heo
Acked-by: Li Zefan
Cc: Peter Zijlstra
Cc: Paul Mackerras
Cc: Ingo Molnar
Cc: Arnaldo Carvalho de Melo
Cc: Daniel Borkmann
Cc: Michal Hocko
Cc: Johannes Weiner
Cc: Balbir Singh
Cc: KAMEZAWA Hiroyuki

Tejun Heo
2014-03-19 22:23:54 +0800
995dca4ce Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next ... Browse Code »

Steffen Klassert says:

====================
One patch to rename a newly introduced struct. The rest is
the rework of the IPsec virtual tunnel interface for ipv6 to
support inter address family tunneling and namespace crossing.

1) Rename the newly introduced struct xfrm_filter to avoid a
conflict with iproute2. From Nicolas Dichtel.

2) Introduce xfrm_input_afinfo to access the address family
dependent tunnel callback functions properly.

3) Add and use a IPsec protocol multiplexer for ipv6.

4) Remove dst_entry caching. vti can lookup multiple different
dst entries, dependent of the configured xfrm states. Therefore
it does not make to cache a dst_entry.

5) Remove caching of flow informations. vti6 does not use the the
tunnel endpoint addresses to do route and xfrm lookups.

6) Update the vti6 to use its own receive hook.

7) Remove the now unused xfrm_tunnel_notifier. This was used from vti
and is replaced by the IPsec protocol multiplexer hooks.

8) Support inter address family tunneling for vti6.

9) Check if the tunnel endpoints of the xfrm state and the vti interface
are matching and return an error otherwise.

10) Enable namespace crossing for vti devices.
====================

Signed-off-by: David S. Miller

David S. Miller
2014-03-19 02:09:07 +0800

18 Mar, 2014

1 commit

e86e180b8 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next ... Browse Code »

Pablo Neira Ayuso says:

====================
Netfilter/IPVS updates for net-next

The following patchset contains Netfilter/IPVS updates for net-next,
most relevantly they are:

* cleanup to remove double semicolon from stephen hemminger.

* calm down sparse warning in xt_ipcomp, from Fan Du.

* nf_ct_labels support for nf_tables, from Florian Westphal.

* new macros to simplify rcu dereferences in the scope of nfnetlink
and nf_tables, from Patrick McHardy.

* Accept queue and drop (including reason for drop) to verdict
parsing in nf_tables, also from Patrick.

* Remove unused random seed initialization in nfnetlink_log, from
Florian Westphal.

* Allow to attach user-specific information to nf_tables rules, useful
to attach user comments to rule, from me.

* Return errors in ipset according to the manpage documentation, from
Jozsef Kadlecsik.

* Fix coccinelle warnings related to incorrect bool type usage for ipset,
from Fengguang Wu.

* Add hash:ip,mark set type to ipset, from Vytas Dauksa.

* Fix message for each spotted by ipset for each netns that is created,
from Ilia Mirkin.

* Add forceadd option to ipset, which evicts a random entry from the set
if it becomes full, from Josh Hunt.

* Minor IPVS cleanups and fixes from Andi Kleen and Tingwei Liu.

* Improve conntrack scalability by removing a central spinlock, original
work from Eric Dumazet. Jesper Dangaard Brouer took them over to address
remaining issues. Several patches to prepare this change come in first
place.

* Rework nft_hash to resolve bugs (leaking chain, missing rcu synchronization
on element removal, etc. from Patrick McHardy.

* Restore context in the rule deletion path, as we now release rule objects
synchronously, from Patrick McHardy. This gets back event notification for
anonymous sets.

* Fix NAT family validation in nft_nat, also from Patrick.

* Improve scalability of xt_connlimit by using an array of spinlocks and
by introducing a rb-tree of hashtables for faster lookup of accounted
objects per network. This patch was preceded by several patches and
refactorizations to accomodate this change including the use of kmem_cache,
from Florian Westphal.
====================

Signed-off-by: David S. Miller

David S. Miller
2014-03-18 03:06:24 +0800

15 Mar, 2014

2 commits

57a7744e0 net: Replace u64_stats_fetch_begin_bh to u64_stats_fetch_begin_irq ... Browse Code »

Replace the bh safe variant with the hard irq safe variant.

We need a hard irq safe variant to deal with netpoll transmitting
packets from hard irq context, and we need it in most if not all of
the places using the bh safe variant.

Except on 32bit uni-processor the code is exactly the same so don't
bother with a bh variant, just have a hard irq safe variant that
everyone can use.

Signed-off-by: "Eric W. Biederman"
Signed-off-by: David S. Miller

Eric W. Biederman
2014-03-15 10:41:36 +0800
85dcce7a7 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net ... Browse Code »

Conflicts:
drivers/net/usb/r8152.c
drivers/net/xen-netback/netback.c

Both the r8152 and netback conflicts were simple overlapping
changes.

Signed-off-by: David S. Miller

David S. Miller
2014-03-15 10:31:55 +0800

14 Mar, 2014

1 commit

2f32b51b6 xfrm: Introduce xfrm_input_afinfo to access the the callbacks properly ... Browse Code »

IPv6 can be build as a module, so we need mechanism to access
the address family dependent callback functions properly.
Therefore we introduce xfrm_input_afinfo, similar to that
what we have for the address family dependent part of
policies and states.

Signed-off-by: Steffen Klassert

Steffen Klassert
2014-03-14 14:28:07 +0800

12 Mar, 2014

1 commit

c3f9b0184 tcp: tcp_release_cb() should release socket ownership ... Browse Code »

Lars Persson reported following deadlock :

-000 |M:0x0:0x802B6AF8(asm)
Tested-by: Lars Persson
Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2014-03-12 04:45:59 +0800

11 Mar, 2014

1 commit

431a91242 tcp: timestamp SYN+DATA messages ... Browse Code »
18

All skb in socket write queue should be properly timestamped.

In case of FastOpen, we special case the SYN+DATA 'message' as we
queue in socket wrote queue the two fallback skbs:

1) SYN message by itself.
2) DATA segment by itself.

We should make sure these skbs have proper timestamps.

Add a WARN_ON_ONCE() to eventually catch future violations.

Fixes: 740b0f1841f6 ("tcp: switch rtt estimations to usec resolution")
Signed-off-by: Eric Dumazet
Cc: Neal Cardwell
Cc: Yuchung Cheng
Acked-by: Neal Cardwell
Acked-by: Yuchung Cheng
Signed-off-by: David S. Miller

Eric Dumazet
2014-03-11 04:15:54 +0800

08 Mar, 2014

1 commit

219626924 tcp: do not leak non zero tstamp in output packets ... Browse Code »

Usage of skb->tstamp should remain private to TCP stack
(only set on packets on write queue, not on cloned ones)

Otherwise, packets given to loopback interface with a non null tstamp
can confuse netif_rx() / net_timestamp_check()

Other possibility would be to clear tstamp in loopback_xmit(),
as done in skb_scrub_packet()

Fixes: 740b0f1841f6 ("tcp: switch rtt estimations to usec resolution")
Signed-off-by: Eric Dumazet
Reported-by: Willem de Bruijn
Signed-off-by: David S. Miller

Eric Dumazet
2014-03-08 03:32:57 +0800

07 Mar, 2014

2 commits

e588e2f28 inet: frag: make sure forced eviction removes all frags ... Browse Code »

Quoting Alexander Aring:
While fragmentation and unloading of 6lowpan module I got this kernel Oops
after few seconds:

BUG: unable to handle kernel paging request at f88bbc30
[..]
Modules linked in: ipv6 [last unloaded: 6lowpan]
Call Trace:
[] ? call_timer_fn+0x54/0xb3
[] ? process_timeout+0xa/0xa
[] run_timer_softirq+0x140/0x15f

Problem is that incomplete frags are still around after unload; when
their frag expire timer fires, we get crash.

When a netns is removed (also done when unloading module), inet_frag
calls the evictor with 'force' argument to purge remaining frags.

The evictor loop terminates when accounted memory ('work') drops to 0
or the lru-list becomes empty. However, the mem accounting is done
via percpu counters and may not be accurate, i.e. loop may terminate
prematurely.

Alter evictor to only stop once the lru list is empty when force is
requested.

Reported-by: Phoebe Buckheister
Reported-by: Alexander Aring
Tested-by: Alexander Aring
Signed-off-by: Florian Westphal
Acked-by: Eric Dumazet
Signed-off-by: David S. Miller

Florian Westphal
2014-03-07 04:28:45 +0800
f7324acd9 tcp: Use NET_ADD_STATS instead of NET_ADD_STATS_BH in tcp_event_new_data_sent() ... Browse Code »

Can be invoked from non-BH context.

Based upon a patch by Eric Dumazet.

Fixes: f19c29e3e391 ("tcp: snmp stats for Fast Open, SYN rtx, and data pkts")
Reported-by: Sergey Senozhatsky
Signed-off-by: David S. Miller

David S. Miller
2014-03-07 04:19:43 +0800

06 Mar, 2014

2 commits

67ddc87f1 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net ... Browse Code »

Conflicts:
drivers/net/wireless/ath/ath9k/recv.c
drivers/net/wireless/mwifiex/pcie.c
net/ipv6/sit.c

The SIT driver conflict consists of a bug fix being done by hand
in 'net' (missing u64_stats_init()) whilst in 'net-next' a helper
was created (netdev_alloc_pcpu_stats()) which takes care of this.

The two wireless conflicts were overlapping changes.

Signed-off-by: David S. Miller

David S. Miller
2014-03-06 09:32:02 +0800
24b9bf43e net: fix for a race condition in the inet frag code ... Browse Code »

I stumbled upon this very serious bug while hunting for another one,
it's a very subtle race condition between inet_frag_evictor,
inet_frag_intern and the IPv4/6 frag_queue and expire functions
(basically the users of inet_frag_kill/inet_frag_put).

What happens is that after a fragment has been added to the hash chain
but before it's been added to the lru_list (inet_frag_lru_add) in
inet_frag_intern, it may get deleted (either by an expired timer if
the system load is high or the timer sufficiently low, or by the
fraq_queue function for different reasons) before it's added to the
lru_list, then after it gets added it's a matter of time for the
evictor to get to a piece of memory which has been freed leading to a
number of different bugs depending on what's left there.

I've been able to trigger this on both IPv4 and IPv6 (which is normal
as the frag code is the same), but it's been much more difficult to
trigger on IPv4 due to the protocol differences about how fragments
are treated.

The setup I used to reproduce this is: 2 machines with 4 x 10G bonded
in a RR bond, so the same flow can be seen on multiple cards at the
same time. Then I used multiple instances of ping/ping6 to generate
fragmented packets and flood the machines with them while running
other processes to load the attacked machine.

*It is very important to have the _same flow_ coming in on multiple CPUs
concurrently. Usually the attacked machine would die in less than 30
minutes, if configured properly to have many evictor calls and timeouts
it could happen in 10 minutes or so.

An important point to make is that any caller (frag_queue or timer) of
inet_frag_kill will remove both the timer refcount and the
original/guarding refcount thus removing everything that's keeping the
frag from being freed at the next inet_frag_put. All of this could
happen before the frag was ever added to the LRU list, then it gets
added and the evictor uses a freed fragment.

An example for IPv6 would be if a fragment is being added and is at
the stage of being inserted in the hash after the hash lock is
released, but before inet_frag_lru_add executes (or is able to obtain
the lru lock) another overlapping fragment for the same flow arrives
at a different CPU which finds it in the hash, but since it's
overlapping it drops it invoking inet_frag_kill and thus removing all
guarding refcounts, and afterwards freeing it by invoking
inet_frag_put which removes the last refcount added previously by
inet_frag_find, then inet_frag_lru_add gets executed by
inet_frag_intern and we have a freed fragment in the lru_list.

The fix is simple, just move the lru_add under the hash chain locked
region so when a removing function is called it'll have to wait for
the fragment to be added to the lru_list, and then it'll remove it (it
works because the hash chain removal is done before the lru_list one
and there's no window between the two list adds when the frag can get
dropped). With this fix applied I couldn't kill the same machine in 24
hours with the same setup.

Fixes: 3ef0eb0db4bf ("net: frag, move LRU list maintenance outside of
rwlock")

CC: Florian Westphal
CC: Jesper Dangaard Brouer
CC: David S. Miller

Signed-off-by: Nikolay Aleksandrov
Acked-by: Jesper Dangaard Brouer
Signed-off-by: David S. Miller

Nikolay Aleksandrov
2014-03-06 09:31:42 +0800

04 Mar, 2014

3 commits

f19c29e3e tcp: snmp stats for Fast Open, SYN rtx, and data pkts ... Browse Code »

Add the following snmp stats:

TCPFastOpenActiveFail: Fast Open attempts (SYN/data) failed beacuse
the remote does not accept it or the attempts timed out.

TCPSynRetrans: number of SYN and SYN/ACK retransmits to break down
retransmissions into SYN, fast-retransmits, timeout retransmits, etc.

TCPOrigDataSent: number of outgoing packets with original data (excluding
retransmission but including data-in-SYN). This counter is different from
TcpOutSegs because TcpOutSegs also tracks pure ACKs. TCPOrigDataSent is
more useful to track the TCP retransmission rate.

Change TCPFastOpenActive to track only successful Fast Opens to be symmetric to
TCPFastOpenPassive.

Signed-off-by: Yuchung Cheng
Signed-off-by: Eric Dumazet
Signed-off-by: Nandita Dukkipati
Signed-off-by: Lawrence Brakmo
Signed-off-by: David S. Miller

Yuchung Cheng
2014-03-04 04:58:03 +0800
10ddceb22 ip_tunnel:multicast process cause panic due to skb->_skb_refdst NULL pointer ... Browse Code »

when ip_tunnel process multicast packets, it may check if the packet is looped
back packet though 'rt_is_output_route(skb_rtable(skb))' in ip_tunnel_rcv(),
but before that , skb->_skb_refdst has been dropped in iptunnel_pull_header(),
so which leads to a panic.

fix the bug: https://bugzilla.kernel.org/show_bug.cgi?id=70681

Signed-off-by: Xin Long
Signed-off-by: David S. Miller

Xin Long
2014-03-04 04:56:40 +0800
c84a57113 tcp: fix bogus RTT on special retransmission ... Browse Code »

RTT may be bogus with tall loss probe (TLP) when a packet
is retransmitted and latter (s)acked without TCPCB_SACKED_RETRANS flag.

For example, TLP calls __tcp_retransmit_skb() instead of
tcp_retransmit_skb(). The skb timestamps are updated but the sacked
flag is not marked with TCPCB_SACKED_RETRANS. As a result we'll
get bogus RTT in tcp_clean_rtx_queue() or in tcp_sacktag_one() on
spurious retransmission.

The fix is to apply the sticky flag TCP_EVER_RETRANS to enforce Karn's
check on RTT sampling. However this will disable F-RTO if timeout occurs
after TLP, by resetting undo_marker in tcp_enter_loss(). We relax this
check to only if any pending retransmists are still in-flight.

Signed-off-by: Yuchung Cheng
Acked-by: Eric Dumazet
Acked-by: Neal Cardwell
Acked-by: Nandita Dukkipati
Signed-off-by: David S. Miller

Yuchung Cheng
2014-03-04 04:33:02 +0800