31 Aug, 2012

1 commit

  • Following lockdep splat was reported by Pavel Roskin :

    [ 1570.586223] ===============================
    [ 1570.586225] [ INFO: suspicious RCU usage. ]
    [ 1570.586228] 3.6.0-rc3-wl-main #98 Not tainted
    [ 1570.586229] -------------------------------
    [ 1570.586231] /home/proski/src/linux/net/ipv4/route.c:645 suspicious rcu_dereference_check() usage!
    [ 1570.586233]
    [ 1570.586233] other info that might help us debug this:
    [ 1570.586233]
    [ 1570.586236]
    [ 1570.586236] rcu_scheduler_active = 1, debug_locks = 0
    [ 1570.586238] 2 locks held by Chrome_IOThread/4467:
    [ 1570.586240] #0: (slock-AF_INET){+.-...}, at: [] release_sock+0x2c/0xa0
    [ 1570.586253] #1: (fnhe_lock){+.-...}, at: [] update_or_create_fnhe+0x2c/0x270
    [ 1570.586260]
    [ 1570.586260] stack backtrace:
    [ 1570.586263] Pid: 4467, comm: Chrome_IOThread Not tainted 3.6.0-rc3-wl-main #98
    [ 1570.586265] Call Trace:
    [ 1570.586271] [] lockdep_rcu_suspicious+0xfd/0x130
    [ 1570.586275] [] update_or_create_fnhe+0x15c/0x270
    [ 1570.586278] [] __ip_rt_update_pmtu+0x73/0xb0
    [ 1570.586282] [] ip_rt_update_pmtu+0x29/0x90
    [ 1570.586285] [] inet_csk_update_pmtu+0x2c/0x80
    [ 1570.586290] [] tcp_v4_mtu_reduced+0x2e/0xc0
    [ 1570.586293] [] tcp_release_cb+0xa4/0xb0
    [ 1570.586296] [] release_sock+0x55/0xa0
    [ 1570.586300] [] tcp_sendmsg+0x4af/0xf50
    [ 1570.586305] [] inet_sendmsg+0x120/0x230
    [ 1570.586308] [] ? inet_sk_rebuild_header+0x40/0x40
    [ 1570.586312] [] ? sock_update_classid+0xbd/0x3b0
    [ 1570.586315] [] ? sock_update_classid+0x130/0x3b0
    [ 1570.586320] [] do_sock_write+0xc5/0xe0
    [ 1570.586323] [] sock_aio_write+0x53/0x80
    [ 1570.586328] [] do_sync_write+0xa3/0xe0
    [ 1570.586332] [] vfs_write+0x165/0x180
    [ 1570.586335] [] sys_write+0x45/0x90
    [ 1570.586340] [] system_call_fastpath+0x16/0x1b

    Signed-off-by: Eric Dumazet
    Reported-by: Pavel Roskin
    Signed-off-by: David S. Miller

    Eric Dumazet
     

24 Aug, 2012

1 commit

  • Multicast traffic allocates dst with DST_NOCACHE, but dst is
    not inserted into rt_uncached_list.

    This slowdown multicast workloads on SMP because rt_uncached_lock is
    contended.

    Change the test before taking the lock to actually check the dst
    was inserted into rt_uncached_list.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

23 Aug, 2012

1 commit

  • Sylvain Munault reported following info :

    - TCP connection get "stuck" with data in send queue when doing
    "large" transfers ( like typing 'ps ax' on a ssh connection )
    - Only happens on path where the PMTU is lower than the MTU of
    the interface
    - Is not present right after boot, it only appears 10-20min after
    boot or so. (and that's inside the _same_ TCP connection, it works
    fine at first and then in the same ssh session, it'll get stuck)
    - Definitely seems related to fragments somehow since I see a router
    sending ICMP message saying fragmentation is needed.
    - Exact same setup works fine with kernel 3.5.1

    Problem happens when the 10 minutes (ip_rt_mtu_expires) expiration
    period is over.

    ip_rt_update_pmtu() calls dst_set_expires() to rearm a new expiration,
    but dst_set_expires() does nothing because dst.expires is already set.

    It seems we want to set the expires field to a new value, regardless
    of prior one.

    With help from Julian Anastasov.

    Reported-by: Sylvain Munaut
    Signed-off-by: Eric Dumazet
    CC: Julian Anastasov
    Tested-by: Sylvain Munaut
    Signed-off-by: David S. Miller

    Eric Dumazet
     

15 Aug, 2012

1 commit

  • Commit caacf05e5ad1abf causes big drop of UDP loop back performance.
    The cause of the regression is that we do not cache the local output
    routes. Each time we send a datagram from unconnected UDP socket,
    the kernel allocates a dst_entry and adds it to the rt_uncached_list.
    It creates lock contention on the rt_uncached_lock.

    Reported-by: Alex Shi
    Signed-off-by: Yan, Zheng
    Signed-off-by: David S. Miller

    Yan, Zheng
     

02 Aug, 2012

1 commit


01 Aug, 2012

4 commits

  • When a device is unregistered, we have to purge all of the
    references to it that may exist in the entire system.

    If a route is uncached, we currently have no way of accomplishing
    this.

    So create a global list that is scanned when a network device goes
    down. This mirrors the logic in net/core/dst.c's dst_ifdown().

    Signed-off-by: David S. Miller

    David S. Miller
     
  • Signed-off-by: David S. Miller

    David S. Miller
     
  • Input path is mostly run under RCU and doesnt touch dst refcnt

    But output path on forwarding or UDP workloads hits
    badly dst refcount, and we have lot of false sharing, for example
    in ipv4_mtu() when reading rt->rt_pmtu

    Using a percpu cache for nh_rth_output gives a nice performance
    increase at a small cost.

    24 udpflood test on my 24 cpu machine (dummy0 output device)
    (each process sends 1.000.000 udp frames, 24 processes are started)

    before : 5.24 s
    after : 2.06 s
    For reference, time on linux-3.5 : 6.60 s

    Signed-off-by: Eric Dumazet
    Tested-by: Alexander Duyck
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • commit 404e0a8b6a55 (net: ipv4: fix RCU races on dst refcounts) tried
    to solve a race but added a problem at device/fib dismantle time :

    We really want to call dst_free() as soon as possible, even if sockets
    still have dst in their cache.
    dst_release() calls in free_fib_info_rcu() are not welcomed.

    Root of the problem was that now we also cache output routes (in
    nh_rth_output), we must use call_rcu() instead of call_rcu_bh() in
    rt_free(), because output route lookups are done in process context.

    Based on feedback and initial patch from David Miller (adding another
    call_rcu_bh() call in fib, but it appears it was not the right fix)

    I left the inet_sk_rx_dst_set() helper and added __rcu attributes
    to nh_rth_output and nh_rth_input to better document what is going on in
    this code.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

31 Jul, 2012

1 commit

  • commit c6cffba4ffa2 (ipv4: Fix input route performance regression.)
    added various fatal races with dst refcounts.

    crashes happen on tcp workloads if routes are added/deleted at the same
    time.

    The dst_free() calls from free_fib_info_rcu() are clearly racy.

    We need instead regular dst refcounting (dst_release()) and make
    sure dst_release() is aware of RCU grace periods :

    Add DST_RCU_FREE flag so that dst_release() respects an RCU grace period
    before dst destruction for cached dst

    Introduce a new inet_sk_rx_dst_set() helper, using atomic_inc_not_zero()
    to make sure we dont increase a zero refcount (On a dst currently
    waiting an rcu grace period before destruction)

    rt_cache_route() must take a reference on the new cached route, and
    release it if was not able to install it.

    With this patch, my machines survive various benchmarks.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

27 Jul, 2012

1 commit

  • With the routing cache removal we lost the "noref" code paths on
    input, and this can kill some routing workloads.

    Reinstate the noref path when we hit a cached route in the FIB
    nexthops.

    With help from Eric Dumazet.

    Reported-by: Alexander Duyck
    Signed-off-by: David S. Miller
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    David S. Miller
     

26 Jul, 2012

1 commit

  • commit d2d68ba9fe8 (ipv4: Cache input routes in fib_info nexthops.)
    introduced rt_cache_valid() helper. It unfortunately doesn't check if
    route is expired before caching it.

    I noticed sk_setup_caps() was constantly called on a tcp workload.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

24 Jul, 2012

3 commits

  • On input packet processing, rt->rt_iif will be zero if we should
    use skb->dev->ifindex.

    Since we access rt->rt_iif consistently via inet_iif(), that is
    the only spot whose interpretation have to adjust.

    Signed-off-by: David S. Miller

    David S. Miller
     
  • Use inet_iif() consistently, and for TCP record the input interface of
    cached RX dst in inet sock.

    rt->rt_iif is going to be encoded differently, so that we can
    legitimately cache input routes in the FIB info more aggressively.

    When the input interface is "use SKB device index" the rt->rt_iif will
    be set to zero.

    This forces us to move the TCP RX dst cache installation into the ipv4
    specific code, and as well it should since doing the route caching for
    ipv6 is pointless at the moment since it is not inspected in the ipv6
    input paths yet.

    Also, remove the unlikely on dst->obsolete, all ipv4 dsts have
    obsolete set to a non-zero value to force invocation of the check
    callback.

    Signed-off-by: David S. Miller

    David S. Miller
     
  • The last and final kernel user, ICMP address replies,
    has been removed.

    Signed-off-by: David S. Miller

    David S. Miller
     

21 Jul, 2012

17 commits


20 Jul, 2012

2 commits

  • Fix again the diff value in rt_bind_exception
    after collision of two latest patches, my original commit
    actually fixed the same problem.

    Signed-off-by: Julian Anastasov
    Signed-off-by: David S. Miller

    Julian Anastasov
     
  • Use global seqlock for the nh_exceptions. Call
    fnhe_oldest with the right hash chain. Correct the diff
    value for dst_set_expires.

    v2: after suggestions from Eric Dumazet:
    * get rid of spin lock fnhe_lock, rearrange update_or_create_fnhe
    * continue daddr search in rt_bind_exception

    v3:
    * remove the daddr check before seqlock in rt_bind_exception
    * restart lookup in rt_bind_exception on detected seqlock change,
    as suggested by David Miller

    Signed-off-by: Julian Anastasov
    Signed-off-by: David S. Miller

    Julian Anastasov
     

19 Jul, 2012

1 commit


18 Jul, 2012

2 commits


17 Jul, 2012

2 commits

  • In a regime where we have subnetted route entries, we need a way to
    store persistent storage about destination specific learned values
    such as redirects and PMTU values.

    This is implemented here via nexthop exceptions.

    The initial implementation is a 2048 entry hash table with relaiming
    starting at chain length 5. A more sophisticated scheme can be
    devised if that proves necessary.

    Signed-off-by: David S. Miller

    David S. Miller
     
  • This will be used so that we can compose a full flow key.

    Even though we have a route in this context, we need more. In the
    future the routes will be without destination address, source address,
    etc. keying. One ipv4 route will cover entire subnets, etc.

    In this environment we have to have a way to possess persistent storage
    for redirects and PMTU information. This persistent storage will exist
    in the FIB tables, and that's why we'll need to be able to rebuild a
    full lookup flow key here. Using that flow key will do a fib_lookup()
    and create/update the persistent entry.

    Signed-off-by: David S. Miller

    David S. Miller
     

13 Jul, 2012

1 commit