Doug / smarc-fsl-linux-kernel | Embedian Git Server

23 Dec, 2011

1 commit

e688a6048 net: introduce DST_NOPEER dst flag ... Browse Code »

Chris Boot reported crashes occurring in ipv6_select_ident().

[ 461.457562] RIP: 0010:[] []
ipv6_select_ident+0x31/0xa7

[ 461.578229] Call Trace:
[ 461.580742]
[ 461.582870] [] ? udp6_ufo_fragment+0x124/0x1a2
[ 461.589054] [] ? ipv6_gso_segment+0xc0/0x155
[ 461.595140] [] ? skb_gso_segment+0x208/0x28b
[ 461.601198] [] ? ipv6_confirm+0x146/0x15e
[nf_conntrack_ipv6]
[ 461.608786] [] ? nf_iterate+0x41/0x77
[ 461.614227] [] ? dev_hard_start_xmit+0x357/0x543
[ 461.620659] [] ? nf_hook_slow+0x73/0x111
[ 461.626440] [] ? br_parse_ip_options+0x19a/0x19a
[bridge]
[ 461.633581] [] ? dev_queue_xmit+0x3af/0x459
[ 461.639577] [] ? br_dev_queue_push_xmit+0x72/0x76
[bridge]
[ 461.646887] [] ? br_nf_post_routing+0x17d/0x18f
[bridge]
[ 461.653997] [] ? nf_iterate+0x41/0x77
[ 461.659473] [] ? br_flood+0xfa/0xfa [bridge]
[ 461.665485] [] ? nf_hook_slow+0x73/0x111
[ 461.671234] [] ? br_flood+0xfa/0xfa [bridge]
[ 461.677299] [] ?
nf_bridge_update_protocol+0x20/0x20 [bridge]
[ 461.684891] [] ? nf_ct_zone+0xa/0x17 [nf_conntrack]
[ 461.691520] [] ? br_flood+0xfa/0xfa [bridge]
[ 461.697572] [] ? NF_HOOK.constprop.8+0x3c/0x56
[bridge]
[ 461.704616] [] ?
nf_bridge_push_encap_header+0x1c/0x26 [bridge]
[ 461.712329] [] ? br_nf_forward_finish+0x8a/0x95
[bridge]
[ 461.719490] [] ?
nf_bridge_pull_encap_header+0x1c/0x27 [bridge]
[ 461.727223] [] ? br_nf_forward_ip+0x1c0/0x1d4 [bridge]
[ 461.734292] [] ? nf_iterate+0x41/0x77
[ 461.739758] [] ? __br_deliver+0xa0/0xa0 [bridge]
[ 461.746203] [] ? nf_hook_slow+0x73/0x111
[ 461.751950] [] ? __br_deliver+0xa0/0xa0 [bridge]
[ 461.758378] [] ? NF_HOOK.constprop.4+0x56/0x56
[bridge]

This is caused by bridge netfilter special dst_entry (fake_rtable), a
special shared entry, where attaching an inetpeer makes no sense.

Problem is present since commit 87c48fa3b46 (ipv6: make fragment
identifications less predictable)

Introduce DST_NOPEER dst flag and make sure ipv6_select_ident() and
__ip_select_ident() fallback to the 'no peer attached' handling.

Reported-by: Chris Boot
Tested-by: Chris Boot
Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2011-12-23 11:34:56 +0800

27 Nov, 2011

2 commits

618f9bc74 net: Move mtu handling down to the protocol depended handlers ... Browse Code »

We move all mtu handling from dst_mtu() down to the protocol
layer. So each protocol can implement the mtu handling in
a different manner.

Signed-off-by: Steffen Klassert
Signed-off-by: David S. Miller

Steffen Klassert
2011-11-27 03:29:51 +0800
ebb762f27 net: Rename the dst_opt default_mtu method to mtu ... Browse Code »

We plan to invoke the dst_opt->default_mtu() method unconditioally
from dst_mtu(). So rename the method to dst_opt->mtu() to match
the name with the new meaning.

Signed-off-by: Steffen Klassert
Signed-off-by: David S. Miller

Steffen Klassert
2011-11-27 03:29:50 +0800

18 Aug, 2011

1 commit

bdeab9919 rps: Add flag to skb to indicate rxhash is based on L4 tuple ... Browse Code »

The l4_rxhash flag was added to the skb structure to indicate
that the rxhash value was computed over the 4 tuple for the
packet which includes the port information in the encapsulated
transport packet. This is used by the stack to preserve the
rxhash value in __skb_rx_tunnel.

Signed-off-by: Tom Herbert
Signed-off-by: David S. Miller

Tom Herbert
2011-08-18 11:06:03 +0800

03 Aug, 2011

1 commit

f2c31e32b net: fix NULL dereferences in check_peer_redir() ... Browse Code »

Gergely Kalman reported crashes in check_peer_redir().

It appears commit f39925dbde778 (ipv4: Cache learned redirect
information in inetpeer.) added a race, leading to possible NULL ptr
dereference.

Since we can now change dst neighbour, we should make sure a reader can
safely use a neighbour.

Add RCU protection to dst neighbour, and make sure check_peer_redir()
can be called safely by different cpus in parallel.

As neighbours are already freed after one RCU grace period, this patch
should not add typical RCU penalty (cache cold effects)

Many thanks to Gergely for providing a pretty report pointing to the
bug.

Reported-by: Gergely Kalman
Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2011-08-03 18:34:12 +0800

18 Jul, 2011

2 commits

d3aaeb38c net: Add ->neigh_lookup() operation to dst_ops ... Browse Code »

In the future dst entries will be neigh-less. In that environment we
need to have an easy transition point for current users of
dst->neighbour outside of the packet output fast path.

Signed-off-by: David S. Miller

David S. Miller
2011-07-18 15:40:17 +0800
69cce1d14 net: Abstract dst->neighbour accesses behind helpers. ... Browse Code »

dst_{get,set}_neighbour()

Signed-off-by: David S. Miller

David S. Miller
2011-07-18 14:11:35 +0800

14 Jul, 2011

1 commit

f6b72b621 net: Embed hh_cache inside of struct neighbour. ... Browse Code »

Now that there is a one-to-one correspondance between neighbour
and hh_cache entries, we no longer need:

1) dynamic allocation
2) attachment to dst->hh
3) refcounting

Initialization of the hh_cache entry is indicated by hh_len
being non-zero, and such initialization is always done with
the neighbour's lock held as a writer.

Signed-off-by: David S. Miller

David S. Miller
2011-07-14 22:53:20 +0800

02 Jul, 2011

1 commit

957c665f3 ipv6: Don't put artificial limit on routing table size. ... Browse Code »

IPV6, unlike IPV4, doesn't have a routing cache.

Routing table entries, as well as clones made in response
to route lookup requests, all live in the same table. And
all of these things are together collected in the destination
cache table for ipv6.

This means that routing table entries count against the garbage
collection limits, even though such entries cannot ever be reclaimed
and are added explicitly by the administrator (rather than being
created in response to lookups).

Therefore it makes no sense to count ipv6 routing table entries
against the GC limits.

Add a DST_NOCOUNT destination cache entry flag, and skip the counting
if it is set. Use this flag bit in ipv6 when adding routing table
entries.

Signed-off-by: David S. Miller

David S. Miller
2011-07-02 08:30:43 +0800

25 May, 2011

1 commit

1f37070d3 dst: catch uninitialized metrics ... Browse Code »

Catch cases where dst_metric_set() and other functions are called
but _metrics is NULL.

Signed-off-by: Stephen Hemminger
Signed-off-by: David S. Miller

Stephen Hemminger
2011-05-25 01:50:52 +0800

19 May, 2011

1 commit

6882f933c ipv4: Kill RT_CACHE_DEBUG ... Browse Code »

It's way past it's usefulness. And this gets rid of a bunch
of stray ->rt_{dst,src} references.

Even the comment documenting the macro was inaccurate (stated
default was 1 when it's 0).

If reintroduced, it should be done properly, with dynamic debug
facilities.

Signed-off-by: David S. Miller

David S. Miller
2011-05-19 06:23:21 +0800

29 Apr, 2011

1 commit

5c1e6aa30 net: Make dst_alloc() take more explicit initializations. ... Browse Code »

Now the dst->dev, dev->obsolete, and dst->flags values can
be specified as well.

Signed-off-by: David S. Miller

David S. Miller
2011-04-29 13:25:59 +0800

25 Apr, 2011

1 commit

2a9e95070 net: Remove __KERNEL__ cpp checks from include/net ... Browse Code »

These header files are never installed to user consumption, so any
__KERNEL__ cpp checks are superfluous.

Projects should also not copy these files into their userland utility
sources and try to use them there. If they insist on doing so, the
onus is on them to sanitize the headers as needed.

Signed-off-by: David S. Miller

David S. Miller
2011-04-25 01:54:56 +0800

28 Mar, 2011

1 commit

e433430a0 dst: Clone child entry in skb_dst_pop ... Browse Code »

We clone the child entry in skb_dst_pop before we call
skb_dst_drop(). Otherwise we might kill the child right
before we return it to the caller.

Signed-off-by: Steffen Klassert
Signed-off-by: David S. Miller

Steffen Klassert
2011-03-28 08:55:01 +0800

03 Mar, 2011

1 commit

452edd598 xfrm: Return dst directly from xfrm_lookup() ... Browse Code »

Instead of on the stack.

Signed-off-by: David S. Miller

David S. Miller
2011-03-03 05:27:41 +0800

02 Mar, 2011

2 commits

2774c131b xfrm: Handle blackhole route creation via afinfo. ... Browse Code »

That way we don't have to potentially do this in every xfrm_lookup()
caller.

Signed-off-by: David S. Miller

David S. Miller
2011-03-02 06:59:04 +0800
80c0bc9e3 xfrm: Kill XFRM_LOOKUP_WAIT flag. ... Browse Code »

This can be determined from the flow flags instead.

Signed-off-by: David S. Miller

David S. Miller
2011-03-02 06:36:37 +0800

23 Feb, 2011

1 commit

dee9f4bce net: Make flow cache paths use a const struct flowi. ... Browse Code »

Signed-off-by: David S. Miller

David S. Miller
2011-02-23 10:44:31 +0800

18 Feb, 2011

1 commit

3c7bd1a14 net: Add initial_ref arg to dst_alloc(). ... Browse Code »

This allows avoiding multiple writes to the initial __refcnt.

The most simplest cases of wanting an initial reference of "1"
in ipv4 and ipv6 have been converted, the rest have been left
along and kept at the existing "0".

Signed-off-by: David S. Miller

David S. Miller
2011-02-18 07:44:00 +0800

09 Feb, 2011

1 commit

e7b66bdc0 net: Remove bogus barrier() in dst_allfrag(). ... Browse Code »

I simply missed this one when modifying the other dst
metric interfaces earlier.

Signed-off-by: David S. Miller

David S. Miller
2011-02-09 07:33:22 +0800

05 Feb, 2011

1 commit

92d868292 inetpeer: Move ICMP rate limiting state into inet_peer entries. ... Browse Code »

Like metrics, the ICMP rate limiting bits are cached state about
a destination. So move it into the inet_peer entries.

If an inet_peer cannot be bound (the reason is memory allocation
failure or similar), the policy is to allow.

Signed-off-by: David S. Miller

David S. Miller
2011-02-05 07:59:53 +0800

29 Jan, 2011

1 commit

725d1e1b4 ipv4: Attach FIB info to dst_default_metrics when possible ... Browse Code »

If there are no explicit metrics attached to a route, hook
fi->fib_info up to dst_default_metrics.

Signed-off-by: David S. Miller

David S. Miller
2011-01-29 06:05:05 +0800

27 Jan, 2011

1 commit

62fa8a846 net: Implement read-only protection and COW'ing of metrics. ... Browse Code »

Routing metrics are now copy-on-write.

Initially a route entry points it's metrics at a read-only location.
If a routing table entry exists, it will point there. Else it will
point at the all zero metric place-holder called 'dst_default_metrics'.

The writeability state of the metrics is stored in the low bits of the
metrics pointer, we have two bits left to spare if we want to store
more states.

For the initial implementation, COW is implemented simply via kmalloc.
However future enhancements will change this to place the writable
metrics somewhere else, in order to increase sharing. Very likely
this "somewhere else" will be the inetpeer cache.

Note also that this means that metrics updates may transiently fail
if we cannot COW the metrics successfully.

But even by itself, this patch should decrease memory usage and
increase cache locality especially for routing workloads. In those
cases the read-only metric copies stay in place and never get written
to.

TCP workloads where metrics get updated, and those rare cases where
PMTU triggers occur, will take a very slight performance hit. But
that hit will be alleviated when the long-term writable metrics
move to a more sharable location.

Since the metrics storage went from a u32 array of RTAX_MAX entries to
what is essentially a pointer, some retooling of the dst_entry layout
was necessary.

Most importantly, we need to preserve the alignment of the reference
count so that it doesn't share cache lines with the read-mostly state,
as per Eric Dumazet's alignment assertion checks.

The only non-trivial bit here is the move of the 'flags' member into
the writeable cacheline. This is OK since we are always accessing the
flags around the same moment when we made a modification to the
reference count.

Signed-off-by: David S. Miller

David S. Miller
2011-01-27 12:51:05 +0800

14 Jan, 2011

2 commits

0134e89c7 Merge branch 'master' of git://1984.lsi.us.es/net-next-2.6 ... Browse Code »

Conflicts:
net/ipv4/route.c

Signed-off-by: Patrick McHardy

Patrick McHardy
2011-01-14 21:12:37 +0800
c7066f70d netfilter: fix Kconfig dependencies ... Browse Code »

Fix dependencies of netfilter realm match: it depends on NET_CLS_ROUTE,
which itself depends on NET_SCHED; this dependency is missing from netfilter.

Since matching on realms is also useful without having NET_SCHED enabled and
the option really only controls whether the tclassid member is included in
route and dst entries, rename the config option to IP_ROUTE_CLASSID and move
it outside of traffic scheduling context to get rid of the NET_SCHED dependeny.

Reported-by: Vladis Kletnieks
Signed-off-by: Patrick McHardy

Patrick McHardy
2011-01-14 20:36:42 +0800

15 Dec, 2010

1 commit

d33e45533 net: Abstract default MTU metric calculation behind an accessor. ... Browse Code »

Like RTAX_ADVMSS, make the default calculation go through a dst_ops
method rather than caching the computation in the routing cache
entries.

Now dst metrics are pretty much left as-is when new entries are
created, thus optimizing metric sharing becomes a real possibility.

Signed-off-by: David S. Miller

David S. Miller
2010-12-15 05:01:14 +0800

14 Dec, 2010

1 commit

0dbaee3b3 net: Abstract default ADVMSS behind an accessor. ... Browse Code »

Make all RTAX_ADVMSS metric accesses go through a new helper function,
dst_metric_advmss().

Leave the actual default metric as "zero" in the real metric slot,
and compute the actual default value dynamically via a new dst_ops
AF specific callback.

For stacked IPSEC routes, we use the advmss of the path which
preserves existing behavior.

Unlike ipv4/ipv6, DecNET ties the advmss to the mtu and thus updates
advmss on pmtu updates. This inconsistency in advmss handling
results in more raw metric accesses than I wish we ended up with.

Signed-off-by: David S. Miller

David S. Miller
2010-12-14 04:52:14 +0800

13 Dec, 2010

2 commits

323e126f0 ipv4: Don't pre-seed hoplimit metric. ... Browse Code »

Always go through a new ip4_dst_hoplimit() helper, just like ipv6.

This allowed several simplifications:

1) The interim dst_metric_hoplimit() can go as it's no longer
userd.

2) The sysctl_ip_default_ttl entry no longer needs to use
ipv4_doint_and_flush, since the sysctl is not cached in
routing cache metrics any longer.

3) ipv4_doint_and_flush no longer needs to be exported and
therefore can be marked static.

When ipv4_doint_and_flush_strategy was removed some time ago,
the external declaration in ip.h was mistakenly left around
so kill that off too.

We have to move the sysctl_ip_default_ttl declaration into
ipv4's route cache definition header net/route.h, because
currently net/ip.h (where the declaration lives now) has
a back dependency on net/route.h

Signed-off-by: David S. Miller

David S. Miller
2010-12-13 14:08:17 +0800
5170ae824 net: Abstract RTAX_HOPLIMIT metric accesses behind helper. ... Browse Code »

Signed-off-by: David S. Miller

David S. Miller
2010-12-13 13:35:57 +0800

10 Dec, 2010

1 commit

defb3519a net: Abstract away all dst_entry metrics accesses. ... Browse Code »

Use helper functions to hide all direct accesses, especially writes,
to dst_entry metrics values.

This will allow us to:

1) More easily change how the metrics are stored.

2) Implement COW for metrics.

In particular this will help us put metrics into the inetpeer
cache if that is what we end up doing. We can make the _metrics
member a pointer instead of an array, initially have it point
at the read-only metrics in the FIB, and then on the first set
grab an inetpeer entry and point the _metrics member there.

Signed-off-by: David S. Miller
Acked-by: Eric Dumazet

David S. Miller
2010-12-10 02:46:36 +0800

09 Nov, 2010

1 commit

fc766e4c4 decnet: RCU conversion and get rid of dev_base_lock ... Browse Code »

While tracking dev_base_lock users, I found decnet used it in
dnet_select_source(), but for a wrong purpose:

Writers only hold RTNL, not dev_base_lock, so readers must use RCU if
they cannot use RTNL.

Adds an rcu_head in struct dn_ifaddr and handle proper RCU management.

Adds __rcu annotation in dn_route as well.

Signed-off-by: Eric Dumazet
Acked-by: Steven Whitehouse
Signed-off-by: David S. Miller

Eric Dumazet
2010-11-09 05:50:08 +0800

28 Oct, 2010

1 commit

1c31720a7 ipv4: add __rcu annotations to routes.c ... Browse Code »

Add __rcu annotations to :
(struct dst_entry)->rt_next
(struct rt_hash_bucket)->chain

And use appropriate rcu primitives to reduce sparse warnings if
CONFIG_SPARSE_RCU_POINTER=y

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2010-10-28 02:37:31 +0800

04 Oct, 2010

1 commit

c7d4426a9 net: introduce DST_NOCACHE flag ... Browse Code »

While doing stress tests with IP route cache disabled, and multi queue
devices, I noticed a very high contention on one rwlock used in
neighbour code.

When many cpus are trying to send frames (possibly using a high
performance multiqueue device) to the same neighbour, they fight for the
neigh->lock rwlock in order to call neigh_hh_init(), and fight on
hh->hh_refcnt (a pair of atomic_inc/atomic_dec_and_test())

But we dont need to call neigh_hh_init() for dst that are used only
once. It costs four atomic operations at least, on two contended cache
lines, plus the high contention on neigh->lock rwlock.

Introduce a new dst flag, DST_NOCACHE, that is set when dst was not
inserted in route cache.

With the stress test bench, sending 160000000 frames on one neighbour,
results are :

Before patch:

real 2m28.406s
user 0m11.781s
sys 36m17.964s

After patch:

real 1m26.532s
user 0m12.185s
sys 20m3.903s

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2010-10-04 13:17:54 +0800

28 Sep, 2010

1 commit

290b895e0 tunnels: prepare percpu accounting ... Browse Code »

Tunnels are going to use percpu for their accounting.

They are going to use a new tstats field in net_device.

skb_tunnel_rx() is changed to be a wrapper around __skb_tunnel_rx()

IPTUNNEL_XMIT() is changed to be a wrapper around __IPTUNNEL_XMIT()

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2010-09-28 12:30:42 +0800

27 Sep, 2010

1 commit

693019e90 net: reset skb queue mapping when rx'ing over tunnel ... Browse Code »

Reset queue mapping when an skb is reentering the stack via a tunnel.
On second pass, the queue mapping from the original device is no
longer valid.

Signed-off-by: Tom Herbert
Acked-by: Eric Dumazet
Signed-off-by: David S. Miller

Tom Herbert
2010-09-27 09:48:40 +0800

05 Jun, 2010

1 commit

8764ab2ca net: check for refcount if pop a stacked dst_entry ... Browse Code »

xfrm triggers a warning if dst_pop() drops a refcount
on a noref dst. This patch changes dst_pop() to
skb_dst_pop(). skb_dst_pop() drops the refcnt only
on a refcounted dst. Also we don't clone the child
dst_entry, so it is not refcounted and we can use
skb_dst_set_noref() in xfrm_output_one().

Signed-off-by: Steffen Klassert
Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Steffen Klassert
2010-06-05 06:56:00 +0800

18 May, 2010

2 commits

d19d56ddc net: Introduce skb_tunnel_rx() helper ... Browse Code »

skb rxhash should be cleared when a skb is handled by a tunnel before
being delivered again, so that correct packet steering can take place.

There are other cleanups and accounting that we can factorize in a new
helper, skb_tunnel_rx()

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2010-05-18 13:36:55 +0800
7fee226ad net: add a noref bit on skb dst ... Browse Code »

Use low order bit of skb->_skb_dst to tell dst is not refcounted.

Change _skb_dst to _skb_refdst to make sure all uses are catched.

skb_dst() returns the dst, regardless of noref bit set or not, but
with a lockdep check to make sure a noref dst is not given if current
user is not rcu protected.

New skb_dst_set_noref() helper to set an notrefcounted dst on a skb.
(with lockdep check)

skb_dst_drop() drops a reference only if skb dst was refcounted.

skb_dst_force() helper is used to force a refcount on dst, when skb
is queued and not anymore RCU protected.

Use skb_dst_force() in __sk_add_backlog(), __dev_xmit_skb() if
!IFF_XMIT_DST_RELEASE or skb enqueued on qdisc queue, in
sock_queue_rcv_skb(), in __nf_queue().

Use skb_dst_force() in dev_requeue_skb().

Note: dst_use_noref() still dirties dst, we might transform it
later to do one dirtying per jiffies.

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2010-05-18 08:18:50 +0800

13 Apr, 2010

1 commit

b6c6712a4 net: sk_dst_cache RCUification ... Browse Code »

With latest CONFIG_PROVE_RCU stuff, I felt more comfortable to make this
work.

sk->sk_dst_cache is currently protected by a rwlock (sk_dst_lock)

This rwlock is readlocked for a very small amount of time, and dst
entries are already freed after RCU grace period. This calls for RCU
again :)

This patch converts sk_dst_lock to a spinlock, and use RCU for readers.

__sk_dst_get() is supposed to be called with rcu_read_lock() or if
socket locked by user, so use appropriate rcu_dereference_check()
condition (rcu_read_lock_held() || sock_owned_by_user(sk))

This patch avoids two atomic ops per tx packet on UDP connected sockets,
for example, and permits sk_dst_lock to be much less dirtied.

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2010-04-13 16:41:33 +0800

24 Dec, 2009

1 commit

31d12926e net: Add rtnetlink init_rcvwnd to set the TCP initial receive window ... Browse Code »

Add rtnetlink init_rcvwnd to set the TCP initial receive window size
advertised by passive and active TCP connections.
The current Linux TCP implementation limits the advertised TCP initial
receive window to the one prescribed by slow start. For short lived
TCP connections used for transaction type of traffic (i.e. http
requests), bounding the advertised TCP initial receive window results
in increased latency to complete the transaction.
Support for setting initial congestion window is already supported
using rtnetlink init_cwnd, but the feature is useless without the
ability to set a larger TCP initial receive window.
The rtnetlink init_rcvwnd allows increasing the TCP initial receive
window, allowing TCP connection to advertise larger TCP receive window
than the ones bounded by slow start.

Signed-off-by: Laurent Chavey
Signed-off-by: David S. Miller

laurent chavey
2009-12-24 06:13:30 +0800