23 Dec, 2011

1 commit

  • Chris Boot reported crashes occurring in ipv6_select_ident().

    [ 461.457562] RIP: 0010:[] []
    ipv6_select_ident+0x31/0xa7

    [ 461.578229] Call Trace:
    [ 461.580742]
    [ 461.582870] [] ? udp6_ufo_fragment+0x124/0x1a2
    [ 461.589054] [] ? ipv6_gso_segment+0xc0/0x155
    [ 461.595140] [] ? skb_gso_segment+0x208/0x28b
    [ 461.601198] [] ? ipv6_confirm+0x146/0x15e
    [nf_conntrack_ipv6]
    [ 461.608786] [] ? nf_iterate+0x41/0x77
    [ 461.614227] [] ? dev_hard_start_xmit+0x357/0x543
    [ 461.620659] [] ? nf_hook_slow+0x73/0x111
    [ 461.626440] [] ? br_parse_ip_options+0x19a/0x19a
    [bridge]
    [ 461.633581] [] ? dev_queue_xmit+0x3af/0x459
    [ 461.639577] [] ? br_dev_queue_push_xmit+0x72/0x76
    [bridge]
    [ 461.646887] [] ? br_nf_post_routing+0x17d/0x18f
    [bridge]
    [ 461.653997] [] ? nf_iterate+0x41/0x77
    [ 461.659473] [] ? br_flood+0xfa/0xfa [bridge]
    [ 461.665485] [] ? nf_hook_slow+0x73/0x111
    [ 461.671234] [] ? br_flood+0xfa/0xfa [bridge]
    [ 461.677299] [] ?
    nf_bridge_update_protocol+0x20/0x20 [bridge]
    [ 461.684891] [] ? nf_ct_zone+0xa/0x17 [nf_conntrack]
    [ 461.691520] [] ? br_flood+0xfa/0xfa [bridge]
    [ 461.697572] [] ? NF_HOOK.constprop.8+0x3c/0x56
    [bridge]
    [ 461.704616] [] ?
    nf_bridge_push_encap_header+0x1c/0x26 [bridge]
    [ 461.712329] [] ? br_nf_forward_finish+0x8a/0x95
    [bridge]
    [ 461.719490] [] ?
    nf_bridge_pull_encap_header+0x1c/0x27 [bridge]
    [ 461.727223] [] ? br_nf_forward_ip+0x1c0/0x1d4 [bridge]
    [ 461.734292] [] ? nf_iterate+0x41/0x77
    [ 461.739758] [] ? __br_deliver+0xa0/0xa0 [bridge]
    [ 461.746203] [] ? nf_hook_slow+0x73/0x111
    [ 461.751950] [] ? __br_deliver+0xa0/0xa0 [bridge]
    [ 461.758378] [] ? NF_HOOK.constprop.4+0x56/0x56
    [bridge]

    This is caused by bridge netfilter special dst_entry (fake_rtable), a
    special shared entry, where attaching an inetpeer makes no sense.

    Problem is present since commit 87c48fa3b46 (ipv6: make fragment
    identifications less predictable)

    Introduce DST_NOPEER dst flag and make sure ipv6_select_ident() and
    __ip_select_ident() fallback to the 'no peer attached' handling.

    Reported-by: Chris Boot
    Tested-by: Chris Boot
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

27 Nov, 2011

2 commits


18 Aug, 2011

1 commit

  • The l4_rxhash flag was added to the skb structure to indicate
    that the rxhash value was computed over the 4 tuple for the
    packet which includes the port information in the encapsulated
    transport packet. This is used by the stack to preserve the
    rxhash value in __skb_rx_tunnel.

    Signed-off-by: Tom Herbert
    Signed-off-by: David S. Miller

    Tom Herbert
     

03 Aug, 2011

1 commit

  • Gergely Kalman reported crashes in check_peer_redir().

    It appears commit f39925dbde778 (ipv4: Cache learned redirect
    information in inetpeer.) added a race, leading to possible NULL ptr
    dereference.

    Since we can now change dst neighbour, we should make sure a reader can
    safely use a neighbour.

    Add RCU protection to dst neighbour, and make sure check_peer_redir()
    can be called safely by different cpus in parallel.

    As neighbours are already freed after one RCU grace period, this patch
    should not add typical RCU penalty (cache cold effects)

    Many thanks to Gergely for providing a pretty report pointing to the
    bug.

    Reported-by: Gergely Kalman
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

18 Jul, 2011

2 commits


14 Jul, 2011

1 commit

  • Now that there is a one-to-one correspondance between neighbour
    and hh_cache entries, we no longer need:

    1) dynamic allocation
    2) attachment to dst->hh
    3) refcounting

    Initialization of the hh_cache entry is indicated by hh_len
    being non-zero, and such initialization is always done with
    the neighbour's lock held as a writer.

    Signed-off-by: David S. Miller

    David S. Miller
     

02 Jul, 2011

1 commit

  • IPV6, unlike IPV4, doesn't have a routing cache.

    Routing table entries, as well as clones made in response
    to route lookup requests, all live in the same table. And
    all of these things are together collected in the destination
    cache table for ipv6.

    This means that routing table entries count against the garbage
    collection limits, even though such entries cannot ever be reclaimed
    and are added explicitly by the administrator (rather than being
    created in response to lookups).

    Therefore it makes no sense to count ipv6 routing table entries
    against the GC limits.

    Add a DST_NOCOUNT destination cache entry flag, and skip the counting
    if it is set. Use this flag bit in ipv6 when adding routing table
    entries.

    Signed-off-by: David S. Miller

    David S. Miller
     

25 May, 2011

1 commit


19 May, 2011

1 commit

  • It's way past it's usefulness. And this gets rid of a bunch
    of stray ->rt_{dst,src} references.

    Even the comment documenting the macro was inaccurate (stated
    default was 1 when it's 0).

    If reintroduced, it should be done properly, with dynamic debug
    facilities.

    Signed-off-by: David S. Miller

    David S. Miller
     

29 Apr, 2011

1 commit


25 Apr, 2011

1 commit

  • These header files are never installed to user consumption, so any
    __KERNEL__ cpp checks are superfluous.

    Projects should also not copy these files into their userland utility
    sources and try to use them there. If they insist on doing so, the
    onus is on them to sanitize the headers as needed.

    Signed-off-by: David S. Miller

    David S. Miller
     

28 Mar, 2011

1 commit


03 Mar, 2011

1 commit


02 Mar, 2011

2 commits


23 Feb, 2011

1 commit


18 Feb, 2011

1 commit


09 Feb, 2011

1 commit


05 Feb, 2011

1 commit


29 Jan, 2011

1 commit


27 Jan, 2011

1 commit

  • Routing metrics are now copy-on-write.

    Initially a route entry points it's metrics at a read-only location.
    If a routing table entry exists, it will point there. Else it will
    point at the all zero metric place-holder called 'dst_default_metrics'.

    The writeability state of the metrics is stored in the low bits of the
    metrics pointer, we have two bits left to spare if we want to store
    more states.

    For the initial implementation, COW is implemented simply via kmalloc.
    However future enhancements will change this to place the writable
    metrics somewhere else, in order to increase sharing. Very likely
    this "somewhere else" will be the inetpeer cache.

    Note also that this means that metrics updates may transiently fail
    if we cannot COW the metrics successfully.

    But even by itself, this patch should decrease memory usage and
    increase cache locality especially for routing workloads. In those
    cases the read-only metric copies stay in place and never get written
    to.

    TCP workloads where metrics get updated, and those rare cases where
    PMTU triggers occur, will take a very slight performance hit. But
    that hit will be alleviated when the long-term writable metrics
    move to a more sharable location.

    Since the metrics storage went from a u32 array of RTAX_MAX entries to
    what is essentially a pointer, some retooling of the dst_entry layout
    was necessary.

    Most importantly, we need to preserve the alignment of the reference
    count so that it doesn't share cache lines with the read-mostly state,
    as per Eric Dumazet's alignment assertion checks.

    The only non-trivial bit here is the move of the 'flags' member into
    the writeable cacheline. This is OK since we are always accessing the
    flags around the same moment when we made a modification to the
    reference count.

    Signed-off-by: David S. Miller

    David S. Miller
     

14 Jan, 2011

2 commits

  • Conflicts:
    net/ipv4/route.c

    Signed-off-by: Patrick McHardy

    Patrick McHardy
     
  • Fix dependencies of netfilter realm match: it depends on NET_CLS_ROUTE,
    which itself depends on NET_SCHED; this dependency is missing from netfilter.

    Since matching on realms is also useful without having NET_SCHED enabled and
    the option really only controls whether the tclassid member is included in
    route and dst entries, rename the config option to IP_ROUTE_CLASSID and move
    it outside of traffic scheduling context to get rid of the NET_SCHED dependeny.

    Reported-by: Vladis Kletnieks
    Signed-off-by: Patrick McHardy

    Patrick McHardy
     

15 Dec, 2010

1 commit


14 Dec, 2010

1 commit

  • Make all RTAX_ADVMSS metric accesses go through a new helper function,
    dst_metric_advmss().

    Leave the actual default metric as "zero" in the real metric slot,
    and compute the actual default value dynamically via a new dst_ops
    AF specific callback.

    For stacked IPSEC routes, we use the advmss of the path which
    preserves existing behavior.

    Unlike ipv4/ipv6, DecNET ties the advmss to the mtu and thus updates
    advmss on pmtu updates. This inconsistency in advmss handling
    results in more raw metric accesses than I wish we ended up with.

    Signed-off-by: David S. Miller

    David S. Miller
     

13 Dec, 2010

2 commits

  • Always go through a new ip4_dst_hoplimit() helper, just like ipv6.

    This allowed several simplifications:

    1) The interim dst_metric_hoplimit() can go as it's no longer
    userd.

    2) The sysctl_ip_default_ttl entry no longer needs to use
    ipv4_doint_and_flush, since the sysctl is not cached in
    routing cache metrics any longer.

    3) ipv4_doint_and_flush no longer needs to be exported and
    therefore can be marked static.

    When ipv4_doint_and_flush_strategy was removed some time ago,
    the external declaration in ip.h was mistakenly left around
    so kill that off too.

    We have to move the sysctl_ip_default_ttl declaration into
    ipv4's route cache definition header net/route.h, because
    currently net/ip.h (where the declaration lives now) has
    a back dependency on net/route.h

    Signed-off-by: David S. Miller

    David S. Miller
     
  • Signed-off-by: David S. Miller

    David S. Miller
     

10 Dec, 2010

1 commit

  • Use helper functions to hide all direct accesses, especially writes,
    to dst_entry metrics values.

    This will allow us to:

    1) More easily change how the metrics are stored.

    2) Implement COW for metrics.

    In particular this will help us put metrics into the inetpeer
    cache if that is what we end up doing. We can make the _metrics
    member a pointer instead of an array, initially have it point
    at the read-only metrics in the FIB, and then on the first set
    grab an inetpeer entry and point the _metrics member there.

    Signed-off-by: David S. Miller
    Acked-by: Eric Dumazet

    David S. Miller
     

09 Nov, 2010

1 commit

  • While tracking dev_base_lock users, I found decnet used it in
    dnet_select_source(), but for a wrong purpose:

    Writers only hold RTNL, not dev_base_lock, so readers must use RCU if
    they cannot use RTNL.

    Adds an rcu_head in struct dn_ifaddr and handle proper RCU management.

    Adds __rcu annotation in dn_route as well.

    Signed-off-by: Eric Dumazet
    Acked-by: Steven Whitehouse
    Signed-off-by: David S. Miller

    Eric Dumazet
     

28 Oct, 2010

1 commit


04 Oct, 2010

1 commit

  • While doing stress tests with IP route cache disabled, and multi queue
    devices, I noticed a very high contention on one rwlock used in
    neighbour code.

    When many cpus are trying to send frames (possibly using a high
    performance multiqueue device) to the same neighbour, they fight for the
    neigh->lock rwlock in order to call neigh_hh_init(), and fight on
    hh->hh_refcnt (a pair of atomic_inc/atomic_dec_and_test())

    But we dont need to call neigh_hh_init() for dst that are used only
    once. It costs four atomic operations at least, on two contended cache
    lines, plus the high contention on neigh->lock rwlock.

    Introduce a new dst flag, DST_NOCACHE, that is set when dst was not
    inserted in route cache.

    With the stress test bench, sending 160000000 frames on one neighbour,
    results are :

    Before patch:

    real 2m28.406s
    user 0m11.781s
    sys 36m17.964s

    After patch:

    real 1m26.532s
    user 0m12.185s
    sys 20m3.903s

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

28 Sep, 2010

1 commit

  • Tunnels are going to use percpu for their accounting.

    They are going to use a new tstats field in net_device.

    skb_tunnel_rx() is changed to be a wrapper around __skb_tunnel_rx()

    IPTUNNEL_XMIT() is changed to be a wrapper around __IPTUNNEL_XMIT()

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

27 Sep, 2010

1 commit


05 Jun, 2010

1 commit

  • xfrm triggers a warning if dst_pop() drops a refcount
    on a noref dst. This patch changes dst_pop() to
    skb_dst_pop(). skb_dst_pop() drops the refcnt only
    on a refcounted dst. Also we don't clone the child
    dst_entry, so it is not refcounted and we can use
    skb_dst_set_noref() in xfrm_output_one().

    Signed-off-by: Steffen Klassert
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Steffen Klassert
     

18 May, 2010

2 commits

  • skb rxhash should be cleared when a skb is handled by a tunnel before
    being delivered again, so that correct packet steering can take place.

    There are other cleanups and accounting that we can factorize in a new
    helper, skb_tunnel_rx()

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • Use low order bit of skb->_skb_dst to tell dst is not refcounted.

    Change _skb_dst to _skb_refdst to make sure all uses are catched.

    skb_dst() returns the dst, regardless of noref bit set or not, but
    with a lockdep check to make sure a noref dst is not given if current
    user is not rcu protected.

    New skb_dst_set_noref() helper to set an notrefcounted dst on a skb.
    (with lockdep check)

    skb_dst_drop() drops a reference only if skb dst was refcounted.

    skb_dst_force() helper is used to force a refcount on dst, when skb
    is queued and not anymore RCU protected.

    Use skb_dst_force() in __sk_add_backlog(), __dev_xmit_skb() if
    !IFF_XMIT_DST_RELEASE or skb enqueued on qdisc queue, in
    sock_queue_rcv_skb(), in __nf_queue().

    Use skb_dst_force() in dev_requeue_skb().

    Note: dst_use_noref() still dirties dst, we might transform it
    later to do one dirtying per jiffies.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

13 Apr, 2010

1 commit

  • With latest CONFIG_PROVE_RCU stuff, I felt more comfortable to make this
    work.

    sk->sk_dst_cache is currently protected by a rwlock (sk_dst_lock)

    This rwlock is readlocked for a very small amount of time, and dst
    entries are already freed after RCU grace period. This calls for RCU
    again :)

    This patch converts sk_dst_lock to a spinlock, and use RCU for readers.

    __sk_dst_get() is supposed to be called with rcu_read_lock() or if
    socket locked by user, so use appropriate rcu_dereference_check()
    condition (rcu_read_lock_held() || sock_owned_by_user(sk))

    This patch avoids two atomic ops per tx packet on UDP connected sockets,
    for example, and permits sk_dst_lock to be much less dirtied.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

24 Dec, 2009

1 commit

  • Add rtnetlink init_rcvwnd to set the TCP initial receive window size
    advertised by passive and active TCP connections.
    The current Linux TCP implementation limits the advertised TCP initial
    receive window to the one prescribed by slow start. For short lived
    TCP connections used for transaction type of traffic (i.e. http
    requests), bounding the advertised TCP initial receive window results
    in increased latency to complete the transaction.
    Support for setting initial congestion window is already supported
    using rtnetlink init_cwnd, but the feature is useless without the
    ability to set a larger TCP initial receive window.
    The rtnetlink init_rcvwnd allows increasing the TCP initial receive
    window, allowing TCP connection to advertise larger TCP receive window
    than the ones bounded by slow start.

    Signed-off-by: Laurent Chavey
    Signed-off-by: David S. Miller

    laurent chavey