Doug / smarc-fsl-linux-kernel | Embedian Git Server

31 Aug, 2012

1 commit

c5ae7d419 ipv4: must use rcu protection while calling fib_lookup ... Browse Code »

Following lockdep splat was reported by Pavel Roskin :

[ 1570.586223] ===============================
[ 1570.586225] [ INFO: suspicious RCU usage. ]
[ 1570.586228] 3.6.0-rc3-wl-main #98 Not tainted
[ 1570.586229] -------------------------------
[ 1570.586231] /home/proski/src/linux/net/ipv4/route.c:645 suspicious rcu_dereference_check() usage!
[ 1570.586233]
[ 1570.586233] other info that might help us debug this:
[ 1570.586233]
[ 1570.586236]
[ 1570.586236] rcu_scheduler_active = 1, debug_locks = 0
[ 1570.586238] 2 locks held by Chrome_IOThread/4467:
[ 1570.586240] #0: (slock-AF_INET){+.-...}, at: [] release_sock+0x2c/0xa0
[ 1570.586253] #1: (fnhe_lock){+.-...}, at: [] update_or_create_fnhe+0x2c/0x270
[ 1570.586260]
[ 1570.586260] stack backtrace:
[ 1570.586263] Pid: 4467, comm: Chrome_IOThread Not tainted 3.6.0-rc3-wl-main #98
[ 1570.586265] Call Trace:
[ 1570.586271] [] lockdep_rcu_suspicious+0xfd/0x130
[ 1570.586275] [] update_or_create_fnhe+0x15c/0x270
[ 1570.586278] [] __ip_rt_update_pmtu+0x73/0xb0
[ 1570.586282] [] ip_rt_update_pmtu+0x29/0x90
[ 1570.586285] [] inet_csk_update_pmtu+0x2c/0x80
[ 1570.586290] [] tcp_v4_mtu_reduced+0x2e/0xc0
[ 1570.586293] [] tcp_release_cb+0xa4/0xb0
[ 1570.586296] [] release_sock+0x55/0xa0
[ 1570.586300] [] tcp_sendmsg+0x4af/0xf50
[ 1570.586305] [] inet_sendmsg+0x120/0x230
[ 1570.586308] [] ? inet_sk_rebuild_header+0x40/0x40
[ 1570.586312] [] ? sock_update_classid+0xbd/0x3b0
[ 1570.586315] [] ? sock_update_classid+0x130/0x3b0
[ 1570.586320] [] do_sock_write+0xc5/0xe0
[ 1570.586323] [] sock_aio_write+0x53/0x80
[ 1570.586328] [] do_sync_write+0xa3/0xe0
[ 1570.586332] [] vfs_write+0x165/0x180
[ 1570.586335] [] sys_write+0x45/0x90
[ 1570.586340] [] system_call_fastpath+0x16/0x1b

Signed-off-by: Eric Dumazet
Reported-by: Pavel Roskin
Signed-off-by: David S. Miller

Eric Dumazet
2012-08-31 01:33:08 +0800

24 Aug, 2012

1 commit

78df76a06 ipv4: take rt_uncached_lock only if needed ... Browse Code »

Multicast traffic allocates dst with DST_NOCACHE, but dst is
not inserted into rt_uncached_list.

This slowdown multicast workloads on SMP because rt_uncached_lock is
contended.

Change the test before taking the lock to actually check the dst
was inserted into rt_uncached_list.

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2012-08-24 23:47:48 +0800

23 Aug, 2012

1 commit

9b04f3500 ipv4: properly update pmtu ... Browse Code »

Sylvain Munault reported following info :

- TCP connection get "stuck" with data in send queue when doing
"large" transfers ( like typing 'ps ax' on a ssh connection )
- Only happens on path where the PMTU is lower than the MTU of
the interface
- Is not present right after boot, it only appears 10-20min after
boot or so. (and that's inside the _same_ TCP connection, it works
fine at first and then in the same ssh session, it'll get stuck)
- Definitely seems related to fragments somehow since I see a router
sending ICMP message saying fragmentation is needed.
- Exact same setup works fine with kernel 3.5.1

Problem happens when the 10 minutes (ip_rt_mtu_expires) expiration
period is over.

ip_rt_update_pmtu() calls dst_set_expires() to rearm a new expiration,
but dst_set_expires() does nothing because dst.expires is already set.

It seems we want to set the expires field to a new value, regardless
of prior one.

With help from Julian Anastasov.

Reported-by: Sylvain Munaut
Signed-off-by: Eric Dumazet
CC: Julian Anastasov
Tested-by: Sylvain Munaut
Signed-off-by: David S. Miller

Eric Dumazet
2012-08-23 10:14:30 +0800

15 Aug, 2012

1 commit

7bd86cc28 ipv4: Cache local output routes ... Browse Code »

Commit caacf05e5ad1abf causes big drop of UDP loop back performance.
The cause of the regression is that we do not cache the local output
routes. Each time we send a datagram from unconnected UDP socket,
the kernel allocates a dst_entry and adds it to the rt_uncached_list.
It creates lock contention on the rt_uncached_lock.

Reported-by: Alex Shi
Signed-off-by: Yan, Zheng
Signed-off-by: David S. Miller

Yan, Zheng
2012-08-15 05:45:07 +0800

02 Aug, 2012

1 commit

e33cdac01 ipv4: route.c cleanup ... Browse Code »

Remove unused includes after IP cache removal

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2012-08-02 17:54:43 +0800

01 Aug, 2012

4 commits

caacf05e5 ipv4: Properly purge netdev references on uncached routes. ... Browse Code »

When a device is unregistered, we have to purge all of the
references to it that may exist in the entire system.

If a route is uncached, we currently have no way of accomplishing
this.

So create a global list that is scanned when a network device goes
down. This mirrors the logic in net/core/dst.c's dst_ifdown().

Signed-off-by: David S. Miller

David S. Miller
2012-08-01 06:06:50 +0800
c5038a832 ipv4: Cache routes in nexthop exception entries. ... Browse Code »

Signed-off-by: David S. Miller

David S. Miller
2012-08-01 06:02:02 +0800
d26b3a7c4 ipv4: percpu nh_rth_output cache ... Browse Code »

Input path is mostly run under RCU and doesnt touch dst refcnt

But output path on forwarding or UDP workloads hits
badly dst refcount, and we have lot of false sharing, for example
in ipv4_mtu() when reading rt->rt_pmtu

Using a percpu cache for nh_rth_output gives a nice performance
increase at a small cost.

24 udpflood test on my 24 cpu machine (dummy0 output device)
(each process sends 1.000.000 udp frames, 24 processes are started)

before : 5.24 s
after : 2.06 s
For reference, time on linux-3.5 : 6.60 s

Signed-off-by: Eric Dumazet
Tested-by: Alexander Duyck
Signed-off-by: David S. Miller

Eric Dumazet
2012-08-01 05:41:39 +0800
54764bb64 ipv4: Restore old dst_free() behavior. ... Browse Code »

commit 404e0a8b6a55 (net: ipv4: fix RCU races on dst refcounts) tried
to solve a race but added a problem at device/fib dismantle time :

We really want to call dst_free() as soon as possible, even if sockets
still have dst in their cache.
dst_release() calls in free_fib_info_rcu() are not welcomed.

Root of the problem was that now we also cache output routes (in
nh_rth_output), we must use call_rcu() instead of call_rcu_bh() in
rt_free(), because output route lookups are done in process context.

Based on feedback and initial patch from David Miller (adding another
call_rcu_bh() call in fib, but it appears it was not the right fix)

I left the inet_sk_rx_dst_set() helper and added __rcu attributes
to nh_rth_output and nh_rth_input to better document what is going on in
this code.

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2012-08-01 05:41:38 +0800

31 Jul, 2012

1 commit

404e0a8b6 net: ipv4: fix RCU races on dst refcounts ... Browse Code »

commit c6cffba4ffa2 (ipv4: Fix input route performance regression.)
added various fatal races with dst refcounts.

crashes happen on tcp workloads if routes are added/deleted at the same
time.

The dst_free() calls from free_fib_info_rcu() are clearly racy.

We need instead regular dst refcounting (dst_release()) and make
sure dst_release() is aware of RCU grace periods :

Add DST_RCU_FREE flag so that dst_release() respects an RCU grace period
before dst destruction for cached dst

Introduce a new inet_sk_rx_dst_set() helper, using atomic_inc_not_zero()
to make sure we dont increase a zero refcount (On a dst currently
waiting an rcu grace period before destruction)

rt_cache_route() must take a reference on the new cached route, and
release it if was not able to install it.

With this patch, my machines survive various benchmarks.

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2012-07-31 05:53:22 +0800

27 Jul, 2012

1 commit

c6cffba4f ipv4: Fix input route performance regression. ... Browse Code »

With the routing cache removal we lost the "noref" code paths on
input, and this can kill some routing workloads.

Reinstate the noref path when we hit a cached route in the FIB
nexthops.

With help from Eric Dumazet.

Reported-by: Alexander Duyck
Signed-off-by: David S. Miller
Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

David S. Miller
2012-07-27 06:50:39 +0800

26 Jul, 2012

1 commit

4331debc5 ipv4: rt_cache_valid must check expired routes ... Browse Code »

commit d2d68ba9fe8 (ipv4: Cache input routes in fib_info nexthops.)
introduced rt_cache_valid() helper. It unfortunately doesn't check if
route is expired before caching it.

I noticed sk_setup_caps() was constantly called on a tcp workload.

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2012-07-26 06:24:14 +0800

24 Jul, 2012

3 commits

13378cad0 ipv4: Change rt->rt_iif encoding. ... Browse Code »

On input packet processing, rt->rt_iif will be zero if we should
use skb->dev->ifindex.

Since we access rt->rt_iif consistently via inet_iif(), that is
the only spot whose interpretation have to adjust.

Signed-off-by: David S. Miller

David S. Miller
2012-07-24 07:36:27 +0800
92101b3b2 ipv4: Prepare for change of rt->rt_iif encoding. ... Browse Code »

Use inet_iif() consistently, and for TCP record the input interface of
cached RX dst in inet sock.

rt->rt_iif is going to be encoded differently, so that we can
legitimately cache input routes in the FIB info more aggressively.

When the input interface is "use SKB device index" the rt->rt_iif will
be set to zero.

This forces us to move the TCP RX dst cache installation into the ipv4
specific code, and as well it should since doing the route caching for
ipv6 is pointless at the moment since it is not inspected in the ipv6
input paths yet.

Also, remove the unlikely on dst->obsolete, all ipv4 dsts have
obsolete set to a non-zero value to force invocation of the check
callback.

Signed-off-by: David S. Miller

David S. Miller
2012-07-24 07:36:26 +0800
fe3edf457 ipv4: Remove all RTCF_DIRECTSRC handliing. ... Browse Code »

The last and final kernel user, ICMP address replies,
has been removed.

Signed-off-by: David S. Miller

David S. Miller
2012-07-24 04:22:20 +0800

21 Jul, 2012

17 commits

2860583fe ipv4: Kill rt->fi ... Browse Code »

It's not really needed.

We only grabbed a reference to the fib_info for the sake of fib_info
local metrics.

However, fib_info objects are freed using RCU, as are therefore their
private metrics (if any).

We would have triggered a route cache flush if we eliminated a
reference to a fib_info object in the routing tables.

Therefore, any existing cached routes will first check and see that
they have been invalidated before an errant reference to these
metric values would occur.

Signed-off-by: David S. Miller

David S. Miller
2012-07-21 04:40:07 +0800
9917e1e87 ipv4: Turn rt->rt_route_iif into rt->rt_is_input. ... Browse Code »

That is this value's only use, as a boolean to indicate whether
a route is an input route or not.

So implement it that way, using a u16 gap present in the struct
already.

Signed-off-by: David S. Miller

David S. Miller
2012-07-21 04:40:02 +0800
4fd551d7b ipv4: Kill rt->rt_oif ... Browse Code »

Never actually used.

It was being set on output routes to the original OIF specified in the
flow key used for the lookup.

Adjust the only user, ipmr_rt_fib_lookup(), for greater correctness of
the flowi4_oif and flowi4_iif values, thanks to feedback from Julian
Anastasov.

Signed-off-by: David S. Miller

David S. Miller
2012-07-21 04:38:34 +0800
93ac53410 ipv4: Dirty less cache lines in route caching paths. ... Browse Code »

Don't bother incrementing dst->__use and setting dst->lastuse,
they are completely pointless and just slow things down.

Signed-off-by: David S. Miller

David S. Miller
2012-07-21 04:36:55 +0800
ba3f7f04e ipv4: Kill FLOWI_FLAG_RT_NOCACHE and associated code. ... Browse Code »

Signed-off-by: David S. Miller

David S. Miller
2012-07-21 04:36:54 +0800
d2d68ba9f ipv4: Cache input routes in fib_info nexthops. ... Browse Code »

Caching input routes is slightly simpler than output routes, since we
don't need to be concerned with nexthop exceptions. (locally
destined, and routed packets, never trigger PMTU events or redirects
that will be processed by us).

However, we have to elide caching for the DIRECTSRC and non-zero itag
cases.

Signed-off-by: David S. Miller

David S. Miller
2012-07-21 04:36:40 +0800
f2bb4bedf ipv4: Cache output routes in fib_info nexthops. ... Browse Code »

If we have an output route that lacks nexthop exceptions, we can cache
it in the FIB info nexthop.

Such routes will have DST_HOST cleared because such routes refer to a
family of destinations, rather than just one.

The sequence of the handling of exceptions during route lookup is
adjusted to make the logic work properly.

Before we allocate the route, we lookup the exception.

Then we know if we will cache this route or not, and therefore whether
DST_HOST should be set on the allocated route.

Then we use DST_HOST to key off whether we should store the resulting
route, during rt_set_nexthop(), in the FIB nexthop cache.

With help from Eric Dumazet.

Signed-off-by: David S. Miller

David S. Miller
2012-07-21 04:36:16 +0800
ceb332061 ipv4: Kill routes during PMTU/redirect updates. ... Browse Code »

Mark them obsolete so there will be a re-lookup to fetch the
FIB nexthop exception info.

Signed-off-by: David S. Miller

David S. Miller
2012-07-21 04:31:22 +0800
f5b0a8743 net: Document dst->obsolete better. ... Browse Code »

Add a big comment explaining how the field works, and use defines
instead of magic constants for the values assigned to it.

Suggested by Joe Perches.

Signed-off-by: David S. Miller

David S. Miller
2012-07-21 04:31:21 +0800
f8126f1d5 ipv4: Adjust semantics of rt->rt_gateway. ... Browse Code »

In order to allow prefixed routes, we have to adjust how rt_gateway
is set and interpreted.

The new interpretation is:

1) rt_gateway == 0, destination is on-link, nexthop is iph->daddr

2) rt_gateway != 0, destination requires a nexthop gateway

Abstract the fetching of the proper nexthop value using a new
inline helper, rt_nexthop(), as suggested by Joe Perches.

Signed-off-by: David S. Miller
Tested-by: Vijay Subramanian

David S. Miller
2012-07-21 04:31:20 +0800
f1ce3062c ipv4: Remove 'rt_dst' from 'struct rtable' ... Browse Code »

Signed-off-by: David S. Miller

David S. Miller
2012-07-21 04:31:19 +0800
b48698895 ipv4: Remove 'rt_mark' from 'struct rtable' ... Browse Code »

Signed-off-by: David S. Miller

David Miller
2012-07-21 04:31:18 +0800
d6c0a4f60 ipv4: Kill 'rt_src' from 'struct rtable' ... Browse Code »

Signed-off-by: David S. Miller

David Miller
2012-07-21 04:31:00 +0800
1a00fee4f ipv4: Remove rt_key_{src,dst,tos} from struct rtable. ... Browse Code »

They are always used in contexts where they can be reconstituted,
or where the finally resolved rt->rt_{src,dst} is semantically
equivalent.

Signed-off-by: David S. Miller

David Miller
2012-07-21 04:30:59 +0800
38a424e46 ipv4: Kill ip_route_input_noref(). ... Browse Code »

The "noref" argument to ip_route_input_common() is now always ignored
because we do not cache routes, and in that case we must always grab
a reference to the resulting 'dst'.

Signed-off-by: David S. Miller

David Miller
2012-07-21 04:30:59 +0800
89aef8921 ipv4: Delete routing cache. ... Browse Code »

The ipv4 routing cache is non-deterministic, performance wise, and is
subject to reasonably easy to launch denial of service attacks.

The routing cache works great for well behaved traffic, and the world
was a much friendlier place when the tradeoffs that led to the routing
cache's design were considered.

What it boils down to is that the performance of the routing cache is
a product of the traffic patterns seen by a system rather than being a
product of the contents of the routing tables. The former of which is
controllable by external entitites.

Even for "well behaved" legitimate traffic, high volume sites can see
hit rates in the routing cache of only ~%10.

Signed-off-by: David S. Miller

David S. Miller
2012-07-21 04:30:27 +0800
521f54909 ipv4: show pmtu in route list ... Browse Code »

Override the metrics with rt_pmtu

Signed-off-by: Julian Anastasov
Signed-off-by: David S. Miller

Julian Anastasov
2012-07-21 02:16:49 +0800

20 Jul, 2012

2 commits

f31fd3838 ipv4: Fix again the time difference calculation ... Browse Code »

Fix again the diff value in rt_bind_exception
after collision of two latest patches, my original commit
actually fixed the same problem.

Signed-off-by: Julian Anastasov
Signed-off-by: David S. Miller

Julian Anastasov
2012-07-20 04:01:44 +0800
aee06da67 ipv4: use seqlock for nh_exceptions ... Browse Code »

Use global seqlock for the nh_exceptions. Call
fnhe_oldest with the right hash chain. Correct the diff
value for dst_set_expires.

v2: after suggestions from Eric Dumazet:
* get rid of spin lock fnhe_lock, rearrange update_or_create_fnhe
* continue daddr search in rt_bind_exception

v3:
* remove the daddr check before seqlock in rt_bind_exception
* restart lookup in rt_bind_exception on detected seqlock change,
as suggested by David Miller

Signed-off-by: Julian Anastasov
Signed-off-by: David S. Miller

Julian Anastasov
2012-07-20 01:30:14 +0800

19 Jul, 2012

1 commit

7fed84f62 ipv4: Fix time difference calculation in rt_bind_exception(). ... Browse Code »

Reported-by: Steffen Klassert
Signed-off-by: David S. Miller

David S. Miller
2012-07-19 23:46:59 +0800

18 Jul, 2012

2 commits

5abf7f7e0 ipv4: fix rcu splat ... Browse Code »

free_nh_exceptions() should use rcu_dereference_protected(..., 1)
since its called after one RCU grace period.

Also add some const-ification in recent code.

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2012-07-18 04:47:33 +0800
d3a25c980 ipv4: Fix nexthop exception hash computation. ... Browse Code »

Need to mask it with (FNHE_HASH_SIZE - 1).

Signed-off-by: David S. Miller

David S. Miller
2012-07-18 04:23:08 +0800

17 Jul, 2012

2 commits

4895c771c ipv4: Add FIB nexthop exceptions. ... Browse Code »

In a regime where we have subnetted route entries, we need a way to
store persistent storage about destination specific learned values
such as redirects and PMTU values.

This is implemented here via nexthop exceptions.

The initial implementation is a 2048 entry hash table with relaiming
starting at chain length 5. A more sophisticated scheme can be
devised if that proves necessary.

Signed-off-by: David S. Miller

David S. Miller
2012-07-17 23:48:50 +0800
6700c2709 net: Pass optional SKB and SK arguments to dst_ops->{update_pmtu,redirect}() ... Browse Code »

This will be used so that we can compose a full flow key.

Even though we have a route in this context, we need more. In the
future the routes will be without destination address, source address,
etc. keying. One ipv4 route will cover entire subnets, etc.

In this environment we have to have a way to possess persistent storage
for redirects and PMTU information. This persistent storage will exist
in the FIB tables, and that's why we'll need to be able to rebuild a
full lookup flow key here. Using that flow key will do a fib_lookup()
and create/update the persistent entry.

Signed-off-by: David S. Miller

David S. Miller
2012-07-17 18:29:28 +0800

13 Jul, 2012

1 commit

85b91b033 ipv4: Don't store a rule pointer in fib_result. ... Browse Code »

We only use it to fetch the rule's tclassid, so just store the
tclassid there instead.

This also decreases the size of fib_result by a full 8 bytes on
64-bit. On 32-bits it's a wash.

Signed-off-by: David S. Miller

David S. Miller
2012-07-13 23:21:29 +0800