23 Dec, 2011
1 commit
-
Chris Boot reported crashes occurring in ipv6_select_ident().
[ 461.457562] RIP: 0010:[] []
ipv6_select_ident+0x31/0xa7[ 461.578229] Call Trace:
[ 461.580742]
[ 461.582870] [] ? udp6_ufo_fragment+0x124/0x1a2
[ 461.589054] [] ? ipv6_gso_segment+0xc0/0x155
[ 461.595140] [] ? skb_gso_segment+0x208/0x28b
[ 461.601198] [] ? ipv6_confirm+0x146/0x15e
[nf_conntrack_ipv6]
[ 461.608786] [] ? nf_iterate+0x41/0x77
[ 461.614227] [] ? dev_hard_start_xmit+0x357/0x543
[ 461.620659] [] ? nf_hook_slow+0x73/0x111
[ 461.626440] [] ? br_parse_ip_options+0x19a/0x19a
[bridge]
[ 461.633581] [] ? dev_queue_xmit+0x3af/0x459
[ 461.639577] [] ? br_dev_queue_push_xmit+0x72/0x76
[bridge]
[ 461.646887] [] ? br_nf_post_routing+0x17d/0x18f
[bridge]
[ 461.653997] [] ? nf_iterate+0x41/0x77
[ 461.659473] [] ? br_flood+0xfa/0xfa [bridge]
[ 461.665485] [] ? nf_hook_slow+0x73/0x111
[ 461.671234] [] ? br_flood+0xfa/0xfa [bridge]
[ 461.677299] [] ?
nf_bridge_update_protocol+0x20/0x20 [bridge]
[ 461.684891] [] ? nf_ct_zone+0xa/0x17 [nf_conntrack]
[ 461.691520] [] ? br_flood+0xfa/0xfa [bridge]
[ 461.697572] [] ? NF_HOOK.constprop.8+0x3c/0x56
[bridge]
[ 461.704616] [] ?
nf_bridge_push_encap_header+0x1c/0x26 [bridge]
[ 461.712329] [] ? br_nf_forward_finish+0x8a/0x95
[bridge]
[ 461.719490] [] ?
nf_bridge_pull_encap_header+0x1c/0x27 [bridge]
[ 461.727223] [] ? br_nf_forward_ip+0x1c0/0x1d4 [bridge]
[ 461.734292] [] ? nf_iterate+0x41/0x77
[ 461.739758] [] ? __br_deliver+0xa0/0xa0 [bridge]
[ 461.746203] [] ? nf_hook_slow+0x73/0x111
[ 461.751950] [] ? __br_deliver+0xa0/0xa0 [bridge]
[ 461.758378] [] ? NF_HOOK.constprop.4+0x56/0x56
[bridge]This is caused by bridge netfilter special dst_entry (fake_rtable), a
special shared entry, where attaching an inetpeer makes no sense.Problem is present since commit 87c48fa3b46 (ipv6: make fragment
identifications less predictable)Introduce DST_NOPEER dst flag and make sure ipv6_select_ident() and
__ip_select_ident() fallback to the 'no peer attached' handling.Reported-by: Chris Boot
Tested-by: Chris Boot
Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller
27 Nov, 2011
2 commits
-
We move all mtu handling from dst_mtu() down to the protocol
layer. So each protocol can implement the mtu handling in
a different manner.Signed-off-by: Steffen Klassert
Signed-off-by: David S. Miller -
We plan to invoke the dst_opt->default_mtu() method unconditioally
from dst_mtu(). So rename the method to dst_opt->mtu() to match
the name with the new meaning.Signed-off-by: Steffen Klassert
Signed-off-by: David S. Miller
18 Aug, 2011
1 commit
-
The l4_rxhash flag was added to the skb structure to indicate
that the rxhash value was computed over the 4 tuple for the
packet which includes the port information in the encapsulated
transport packet. This is used by the stack to preserve the
rxhash value in __skb_rx_tunnel.Signed-off-by: Tom Herbert
Signed-off-by: David S. Miller
03 Aug, 2011
1 commit
-
Gergely Kalman reported crashes in check_peer_redir().
It appears commit f39925dbde778 (ipv4: Cache learned redirect
information in inetpeer.) added a race, leading to possible NULL ptr
dereference.Since we can now change dst neighbour, we should make sure a reader can
safely use a neighbour.Add RCU protection to dst neighbour, and make sure check_peer_redir()
can be called safely by different cpus in parallel.As neighbours are already freed after one RCU grace period, this patch
should not add typical RCU penalty (cache cold effects)Many thanks to Gergely for providing a pretty report pointing to the
bug.Reported-by: Gergely Kalman
Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller
18 Jul, 2011
2 commits
-
In the future dst entries will be neigh-less. In that environment we
need to have an easy transition point for current users of
dst->neighbour outside of the packet output fast path.Signed-off-by: David S. Miller
-
dst_{get,set}_neighbour()
Signed-off-by: David S. Miller
14 Jul, 2011
1 commit
-
Now that there is a one-to-one correspondance between neighbour
and hh_cache entries, we no longer need:1) dynamic allocation
2) attachment to dst->hh
3) refcountingInitialization of the hh_cache entry is indicated by hh_len
being non-zero, and such initialization is always done with
the neighbour's lock held as a writer.Signed-off-by: David S. Miller
02 Jul, 2011
1 commit
-
IPV6, unlike IPV4, doesn't have a routing cache.
Routing table entries, as well as clones made in response
to route lookup requests, all live in the same table. And
all of these things are together collected in the destination
cache table for ipv6.This means that routing table entries count against the garbage
collection limits, even though such entries cannot ever be reclaimed
and are added explicitly by the administrator (rather than being
created in response to lookups).Therefore it makes no sense to count ipv6 routing table entries
against the GC limits.Add a DST_NOCOUNT destination cache entry flag, and skip the counting
if it is set. Use this flag bit in ipv6 when adding routing table
entries.Signed-off-by: David S. Miller
25 May, 2011
1 commit
-
Catch cases where dst_metric_set() and other functions are called
but _metrics is NULL.Signed-off-by: Stephen Hemminger
Signed-off-by: David S. Miller
19 May, 2011
1 commit
-
It's way past it's usefulness. And this gets rid of a bunch
of stray ->rt_{dst,src} references.Even the comment documenting the macro was inaccurate (stated
default was 1 when it's 0).If reintroduced, it should be done properly, with dynamic debug
facilities.Signed-off-by: David S. Miller
29 Apr, 2011
1 commit
-
Now the dst->dev, dev->obsolete, and dst->flags values can
be specified as well.Signed-off-by: David S. Miller
25 Apr, 2011
1 commit
-
These header files are never installed to user consumption, so any
__KERNEL__ cpp checks are superfluous.Projects should also not copy these files into their userland utility
sources and try to use them there. If they insist on doing so, the
onus is on them to sanitize the headers as needed.Signed-off-by: David S. Miller
28 Mar, 2011
1 commit
-
We clone the child entry in skb_dst_pop before we call
skb_dst_drop(). Otherwise we might kill the child right
before we return it to the caller.Signed-off-by: Steffen Klassert
Signed-off-by: David S. Miller
03 Mar, 2011
1 commit
-
Instead of on the stack.
Signed-off-by: David S. Miller
02 Mar, 2011
2 commits
-
That way we don't have to potentially do this in every xfrm_lookup()
caller.Signed-off-by: David S. Miller
-
This can be determined from the flow flags instead.
Signed-off-by: David S. Miller
23 Feb, 2011
1 commit
-
Signed-off-by: David S. Miller
18 Feb, 2011
1 commit
-
This allows avoiding multiple writes to the initial __refcnt.
The most simplest cases of wanting an initial reference of "1"
in ipv4 and ipv6 have been converted, the rest have been left
along and kept at the existing "0".Signed-off-by: David S. Miller
09 Feb, 2011
1 commit
-
I simply missed this one when modifying the other dst
metric interfaces earlier.Signed-off-by: David S. Miller
05 Feb, 2011
1 commit
-
Like metrics, the ICMP rate limiting bits are cached state about
a destination. So move it into the inet_peer entries.If an inet_peer cannot be bound (the reason is memory allocation
failure or similar), the policy is to allow.Signed-off-by: David S. Miller
29 Jan, 2011
1 commit
-
If there are no explicit metrics attached to a route, hook
fi->fib_info up to dst_default_metrics.Signed-off-by: David S. Miller
27 Jan, 2011
1 commit
-
Routing metrics are now copy-on-write.
Initially a route entry points it's metrics at a read-only location.
If a routing table entry exists, it will point there. Else it will
point at the all zero metric place-holder called 'dst_default_metrics'.The writeability state of the metrics is stored in the low bits of the
metrics pointer, we have two bits left to spare if we want to store
more states.For the initial implementation, COW is implemented simply via kmalloc.
However future enhancements will change this to place the writable
metrics somewhere else, in order to increase sharing. Very likely
this "somewhere else" will be the inetpeer cache.Note also that this means that metrics updates may transiently fail
if we cannot COW the metrics successfully.But even by itself, this patch should decrease memory usage and
increase cache locality especially for routing workloads. In those
cases the read-only metric copies stay in place and never get written
to.TCP workloads where metrics get updated, and those rare cases where
PMTU triggers occur, will take a very slight performance hit. But
that hit will be alleviated when the long-term writable metrics
move to a more sharable location.Since the metrics storage went from a u32 array of RTAX_MAX entries to
what is essentially a pointer, some retooling of the dst_entry layout
was necessary.Most importantly, we need to preserve the alignment of the reference
count so that it doesn't share cache lines with the read-mostly state,
as per Eric Dumazet's alignment assertion checks.The only non-trivial bit here is the move of the 'flags' member into
the writeable cacheline. This is OK since we are always accessing the
flags around the same moment when we made a modification to the
reference count.Signed-off-by: David S. Miller
14 Jan, 2011
2 commits
-
Conflicts:
net/ipv4/route.cSigned-off-by: Patrick McHardy
-
Fix dependencies of netfilter realm match: it depends on NET_CLS_ROUTE,
which itself depends on NET_SCHED; this dependency is missing from netfilter.Since matching on realms is also useful without having NET_SCHED enabled and
the option really only controls whether the tclassid member is included in
route and dst entries, rename the config option to IP_ROUTE_CLASSID and move
it outside of traffic scheduling context to get rid of the NET_SCHED dependeny.Reported-by: Vladis Kletnieks
Signed-off-by: Patrick McHardy
15 Dec, 2010
1 commit
-
Like RTAX_ADVMSS, make the default calculation go through a dst_ops
method rather than caching the computation in the routing cache
entries.Now dst metrics are pretty much left as-is when new entries are
created, thus optimizing metric sharing becomes a real possibility.Signed-off-by: David S. Miller
14 Dec, 2010
1 commit
-
Make all RTAX_ADVMSS metric accesses go through a new helper function,
dst_metric_advmss().Leave the actual default metric as "zero" in the real metric slot,
and compute the actual default value dynamically via a new dst_ops
AF specific callback.For stacked IPSEC routes, we use the advmss of the path which
preserves existing behavior.Unlike ipv4/ipv6, DecNET ties the advmss to the mtu and thus updates
advmss on pmtu updates. This inconsistency in advmss handling
results in more raw metric accesses than I wish we ended up with.Signed-off-by: David S. Miller
13 Dec, 2010
2 commits
-
Always go through a new ip4_dst_hoplimit() helper, just like ipv6.
This allowed several simplifications:
1) The interim dst_metric_hoplimit() can go as it's no longer
userd.2) The sysctl_ip_default_ttl entry no longer needs to use
ipv4_doint_and_flush, since the sysctl is not cached in
routing cache metrics any longer.3) ipv4_doint_and_flush no longer needs to be exported and
therefore can be marked static.When ipv4_doint_and_flush_strategy was removed some time ago,
the external declaration in ip.h was mistakenly left around
so kill that off too.We have to move the sysctl_ip_default_ttl declaration into
ipv4's route cache definition header net/route.h, because
currently net/ip.h (where the declaration lives now) has
a back dependency on net/route.hSigned-off-by: David S. Miller
-
Signed-off-by: David S. Miller
10 Dec, 2010
1 commit
-
Use helper functions to hide all direct accesses, especially writes,
to dst_entry metrics values.This will allow us to:
1) More easily change how the metrics are stored.
2) Implement COW for metrics.
In particular this will help us put metrics into the inetpeer
cache if that is what we end up doing. We can make the _metrics
member a pointer instead of an array, initially have it point
at the read-only metrics in the FIB, and then on the first set
grab an inetpeer entry and point the _metrics member there.Signed-off-by: David S. Miller
Acked-by: Eric Dumazet
09 Nov, 2010
1 commit
-
While tracking dev_base_lock users, I found decnet used it in
dnet_select_source(), but for a wrong purpose:Writers only hold RTNL, not dev_base_lock, so readers must use RCU if
they cannot use RTNL.Adds an rcu_head in struct dn_ifaddr and handle proper RCU management.
Adds __rcu annotation in dn_route as well.
Signed-off-by: Eric Dumazet
Acked-by: Steven Whitehouse
Signed-off-by: David S. Miller
28 Oct, 2010
1 commit
-
Add __rcu annotations to :
(struct dst_entry)->rt_next
(struct rt_hash_bucket)->chainAnd use appropriate rcu primitives to reduce sparse warnings if
CONFIG_SPARSE_RCU_POINTER=ySigned-off-by: Eric Dumazet
Signed-off-by: David S. Miller
04 Oct, 2010
1 commit
-
While doing stress tests with IP route cache disabled, and multi queue
devices, I noticed a very high contention on one rwlock used in
neighbour code.When many cpus are trying to send frames (possibly using a high
performance multiqueue device) to the same neighbour, they fight for the
neigh->lock rwlock in order to call neigh_hh_init(), and fight on
hh->hh_refcnt (a pair of atomic_inc/atomic_dec_and_test())But we dont need to call neigh_hh_init() for dst that are used only
once. It costs four atomic operations at least, on two contended cache
lines, plus the high contention on neigh->lock rwlock.Introduce a new dst flag, DST_NOCACHE, that is set when dst was not
inserted in route cache.With the stress test bench, sending 160000000 frames on one neighbour,
results are :Before patch:
real 2m28.406s
user 0m11.781s
sys 36m17.964sAfter patch:
real 1m26.532s
user 0m12.185s
sys 20m3.903sSigned-off-by: Eric Dumazet
Signed-off-by: David S. Miller
28 Sep, 2010
1 commit
-
Tunnels are going to use percpu for their accounting.
They are going to use a new tstats field in net_device.
skb_tunnel_rx() is changed to be a wrapper around __skb_tunnel_rx()
IPTUNNEL_XMIT() is changed to be a wrapper around __IPTUNNEL_XMIT()
Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller
27 Sep, 2010
1 commit
-
Reset queue mapping when an skb is reentering the stack via a tunnel.
On second pass, the queue mapping from the original device is no
longer valid.Signed-off-by: Tom Herbert
Acked-by: Eric Dumazet
Signed-off-by: David S. Miller
05 Jun, 2010
1 commit
-
xfrm triggers a warning if dst_pop() drops a refcount
on a noref dst. This patch changes dst_pop() to
skb_dst_pop(). skb_dst_pop() drops the refcnt only
on a refcounted dst. Also we don't clone the child
dst_entry, so it is not refcounted and we can use
skb_dst_set_noref() in xfrm_output_one().Signed-off-by: Steffen Klassert
Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller
18 May, 2010
2 commits
-
skb rxhash should be cleared when a skb is handled by a tunnel before
being delivered again, so that correct packet steering can take place.There are other cleanups and accounting that we can factorize in a new
helper, skb_tunnel_rx()Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller -
Use low order bit of skb->_skb_dst to tell dst is not refcounted.
Change _skb_dst to _skb_refdst to make sure all uses are catched.
skb_dst() returns the dst, regardless of noref bit set or not, but
with a lockdep check to make sure a noref dst is not given if current
user is not rcu protected.New skb_dst_set_noref() helper to set an notrefcounted dst on a skb.
(with lockdep check)skb_dst_drop() drops a reference only if skb dst was refcounted.
skb_dst_force() helper is used to force a refcount on dst, when skb
is queued and not anymore RCU protected.Use skb_dst_force() in __sk_add_backlog(), __dev_xmit_skb() if
!IFF_XMIT_DST_RELEASE or skb enqueued on qdisc queue, in
sock_queue_rcv_skb(), in __nf_queue().Use skb_dst_force() in dev_requeue_skb().
Note: dst_use_noref() still dirties dst, we might transform it
later to do one dirtying per jiffies.Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller
13 Apr, 2010
1 commit
-
With latest CONFIG_PROVE_RCU stuff, I felt more comfortable to make this
work.sk->sk_dst_cache is currently protected by a rwlock (sk_dst_lock)
This rwlock is readlocked for a very small amount of time, and dst
entries are already freed after RCU grace period. This calls for RCU
again :)This patch converts sk_dst_lock to a spinlock, and use RCU for readers.
__sk_dst_get() is supposed to be called with rcu_read_lock() or if
socket locked by user, so use appropriate rcu_dereference_check()
condition (rcu_read_lock_held() || sock_owned_by_user(sk))This patch avoids two atomic ops per tx packet on UDP connected sockets,
for example, and permits sk_dst_lock to be much less dirtied.Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller
24 Dec, 2009
1 commit
-
Add rtnetlink init_rcvwnd to set the TCP initial receive window size
advertised by passive and active TCP connections.
The current Linux TCP implementation limits the advertised TCP initial
receive window to the one prescribed by slow start. For short lived
TCP connections used for transaction type of traffic (i.e. http
requests), bounding the advertised TCP initial receive window results
in increased latency to complete the transaction.
Support for setting initial congestion window is already supported
using rtnetlink init_cwnd, but the feature is useless without the
ability to set a larger TCP initial receive window.
The rtnetlink init_rcvwnd allows increasing the TCP initial receive
window, allowing TCP connection to advertise larger TCP receive window
than the ones bounded by slow start.Signed-off-by: Laurent Chavey
Signed-off-by: David S. Miller