27 Nov, 2011

1 commit

  • Now inetpeer is the place where we cache redirect information for ipv4
    destinations, we must be able to invalidate informations when a route is
    added/removed on host.

    As inetpeer is not yet namespace aware, this patch adds a shared
    redirect_genid, and a per inetpeer redirect_genid. This might be changed
    later if inetpeer becomes ns aware.

    Cache information for one inerpeer is valid as long as its
    redirect_genid has the same value than global redirect_genid.

    Reported-by: Arkadiusz Miśkiewicz
    Tested-by: Arkadiusz Miśkiewicz
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

27 Jul, 2011

1 commit

  • This allows us to move duplicated code in
    (atomic_inc_not_zero() for now) to

    Signed-off-by: Arun Sharma
    Reviewed-by: Eric Dumazet
    Cc: Ingo Molnar
    Cc: David Miller
    Cc: Eric Dumazet
    Acked-by: Mike Frysinger
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Arun Sharma
     

22 Jul, 2011

1 commit

  • IPv6 fragment identification generation is way beyond what we use for
    IPv4 : It uses a single generator. Its not scalable and allows DOS
    attacks.

    Now inetpeer is IPv6 aware, we can use it to provide a more secure and
    scalable frag ident generator (per destination, instead of system wide)

    This patch :
    1) defines a new secure_ipv6_id() helper
    2) extends inet_getid() to provide 32bit results
    3) extends ipv6_select_ident() with a new dest parameter

    Reported-by: Fernando Gont
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

09 Jun, 2011

2 commits

  • Profiles show false sharing in addr_compare() because refcnt/dtime
    changes dirty the first inet_peer cache line, where are lying the keys
    used at lookup time. If many cpus are calling inet_getpeer() and
    inet_putpeer(), or need frag ids, addr_compare() is in 2nd position in
    "perf top".

    Before patch, my udpflood bench (16 threads) on my 2x4x2 machine :

    5784.00 9.7% csum_partial_copy_generic [kernel]
    3356.00 5.6% addr_compare [kernel]
    2638.00 4.4% fib_table_lookup [kernel]
    2625.00 4.4% ip_fragment [kernel]
    1934.00 3.2% neigh_lookup [kernel]
    1617.00 2.7% udp_sendmsg [kernel]
    1608.00 2.7% __ip_route_output_key [kernel]
    1480.00 2.5% __ip_append_data [kernel]
    1396.00 2.3% kfree [kernel]
    1195.00 2.0% kmem_cache_free [kernel]
    1157.00 1.9% inet_getpeer [kernel]
    1121.00 1.9% neigh_resolve_output [kernel]
    1012.00 1.7% dev_queue_xmit [kernel]
    # time ./udpflood.sh

    real 0m44.511s
    user 0m20.020s
    sys 11m22.780s

    # time ./udpflood.sh

    real 0m44.099s
    user 0m20.140s
    sys 11m15.870s

    After patch, no more addr_compare() in profiles :

    4171.00 10.7% csum_partial_copy_generic [kernel]
    1787.00 4.6% fib_table_lookup [kernel]
    1756.00 4.5% ip_fragment [kernel]
    1234.00 3.2% udp_sendmsg [kernel]
    1191.00 3.0% neigh_lookup [kernel]
    1118.00 2.9% __ip_append_data [kernel]
    1022.00 2.6% kfree [kernel]
    993.00 2.5% __ip_route_output_key [kernel]
    841.00 2.2% neigh_resolve_output [kernel]
    816.00 2.1% kmem_cache_free [kernel]
    658.00 1.7% ia32_sysenter_target [kernel]
    632.00 1.6% kmem_cache_alloc_node [kernel]

    # time ./udpflood.sh

    real 0m41.587s
    user 0m19.190s
    sys 10m36.370s

    # time ./udpflood.sh

    real 0m41.486s
    user 0m19.290s
    sys 10m33.650s

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • Andi Kleen and Tim Chen reported huge contention on inetpeer
    unused_peers.lock, on memcached workload on a 40 core machine, with
    disabled route cache.

    It appears we constantly flip peers refcnt between 0 and 1 values, and
    we must insert/remove peers from unused_peers.list, holding a contended
    spinlock.

    Remove this list completely and perform a garbage collection on-the-fly,
    at lookup time, using the expired nodes we met during the tree
    traversal.

    This removes a lot of code, makes locking more standard, and obsoletes
    two sysctls (inet_peer_gc_mintime and inet_peer_gc_maxtime). This also
    removes two pointers in inet_peer structure.

    There is still a false sharing effect because refcnt is in first cache
    line of object [were the links and keys used by lookups are located], we
    might move it at the end of inet_peer structure to let this first cache
    line mostly read by cpus.

    Signed-off-by: Eric Dumazet
    CC: Andi Kleen
    CC: Tim Chen
    Signed-off-by: David S. Miller

    Eric Dumazet
     

23 Apr, 2011

1 commit


11 Feb, 2011

2 commits

  • Validity of the cached PMTU information is indicated by it's
    expiration value being non-zero, just as per dst->expires.

    The scheme we will use is that we will remember the pre-ICMP value
    held in the metrics or route entry, and then at expiration time
    we will restore that value.

    In this way PMTU expiration does not kill off the cached route as is
    done currently.

    Redirect information is permanent, or at least until another redirect
    is received.

    Signed-off-by: David S. Miller

    David S. Miller
     
  • Future changes will add caching information, and some of
    these new elements will be addresses.

    Since the family is implicit via the ->daddr.family member,
    replicating the family in ever address we store is entirely
    redundant.

    Signed-off-by: David S. Miller

    David S. Miller
     

05 Feb, 2011

1 commit


28 Jan, 2011

2 commits


02 Dec, 2010

2 commits


01 Dec, 2010

3 commits


28 Oct, 2010

1 commit

  • Adds __rcu annotations to inetpeer
    (struct inet_peer)->avl_left
    (struct inet_peer)->avl_right

    This is a tedious cleanup, but removes one smp_wmb() from link_to_pool()
    since we now use more self documenting rcu_assign_pointer().

    Note the use of RCU_INIT_POINTER() instead of rcu_assign_pointer() in
    all cases we dont need a memory barrier.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

17 Jun, 2010

1 commit

  • Addition of rcu_head to struct inet_peer added 16bytes on 64bit arches.

    Thats a bit unfortunate, since old size was exactly 64 bytes.

    This can be solved, using an union between this rcu_head an four fields,
    that are normally used only when a refcount is taken on inet_peer.
    rcu_head is used only when refcnt=-1, right before structure freeing.

    Add a inet_peer_refcheck() function to check this assertion for a while.

    We can bring back SLAB_HWCACHE_ALIGN qualifier in kmem cache creation.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

16 Jun, 2010

1 commit

  • inetpeer currently uses an AVL tree protected by an rwlock.

    It's possible to make most lookups use RCU

    1) Add a struct rcu_head to struct inet_peer

    2) add a lookup_rcu_bh() helper to perform lockless and opportunistic
    lookup. This is a normal function, not a macro like lookup().

    3) Add a limit to number of links followed by lookup_rcu_bh(). This is
    needed in case we fall in a loop.

    4) add an smp_wmb() in link_to_pool() right before node insert.

    5) make unlink_from_pool() use atomic_cmpxchg() to make sure it can take
    last reference to an inet_peer, since lockless readers could increase
    refcount, even while we hold peers.lock.

    6) Delay struct inet_peer freeing after rcu grace period so that
    lookup_rcu_bh() cannot crash.

    7) inet_getpeer() first attempts lockless lookup.
    Note this lookup can fail even if target is in AVL tree, but a
    concurrent writer can let tree in a non correct form.
    If this attemps fails, lock is taken a regular lookup is performed
    again.

    8) convert peers.lock from rwlock to a spinlock

    9) Remove SLAB_HWCACHE_ALIGN when peer_cachep is created, because
    rcu_head adds 16 bytes on 64bit arches, doubling effective size (64 ->
    128 bytes)
    In a future patch, this is probably possible to revert this part, if rcu
    field is put in an union to share space with rid, ip_id_count, tcp_ts &
    tcp_ts_stamp. These fields being manipulated only with refcnt > 0.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

14 Nov, 2009

1 commit

  • While investigating for network latencies, I found inet_getid() was a
    contention point for some workloads, as inet_peer_idlock is shared
    by all inet_getid() users regardless of peers.

    One way to fix this is to make ip_id_count an atomic_t instead
    of __u16, and use atomic_add_return().

    In order to keep sizeof(struct inet_peer) = 64 on 64bit arches
    tcp_ts_stamp is also converted to __u32 instead of "unsigned long".

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

04 Nov, 2009

1 commit

  • This cleanup patch puts struct/union/enum opening braces,
    in first line to ease grep games.

    struct something
    {

    becomes :

    struct something {

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

12 Jun, 2008

1 commit


13 Nov, 2007

1 commit


20 Oct, 2006

1 commit


16 Oct, 2006

1 commit


29 Sep, 2006

1 commit

  • This one is interesting - we use net-endian value as search key, but
    order the tree by *host-endian* comparisons of keys. OK since we only
    care about lookups. Annotated inet_getpeer() and friends.

    Signed-off-by: Al Viro
    Signed-off-by: David S. Miller

    Al Viro
     

04 Jan, 2006

1 commit

  • Another spin of Herbert Xu's "safer ip reassembly" patch
    for 2.6.16.

    (The original patch is here:
    http://marc.theaimsgroup.com/?l=linux-netdev&m=112281936522415&w=2
    and my only contribution is to have tested it.)

    This patch (optionally) does additional checks before accepting IP
    fragments, which can greatly reduce the possibility of reassembling
    fragments which originated from different IP datagrams.

    Signed-off-by: Herbert Xu
    Signed-off-by: Arthur Kepner
    Signed-off-by: David S. Miller

    Herbert Xu
     

17 Apr, 2005

1 commit

  • Initial git repository build. I'm not bothering with the full history,
    even though we have it. We can create a separate "historical" git
    archive of that later if we want to, and in the meantime it's about
    3.2GB when imported into git - space that would just make the early
    git days unnecessarily complicated, when we don't have a lot of good
    infrastructure for it.

    Let it rip!

    Linus Torvalds