07 Aug, 2011

1 commit

  • Computers have become a lot faster since we compromised on the
    partial MD4 hash which we use currently for performance reasons.

    MD5 is a much safer choice, and is inline with both RFC1948 and
    other ISS generators (OpenBSD, Solaris, etc.)

    Furthermore, only having 24-bits of the sequence number be truly
    unpredictable is a very serious limitation. So the periodic
    regeneration and 8-bit counter have been removed. We compute and
    use a full 32-bit sequence number.

    For ipv6, DCCP was found to use a 32-bit truncated initial sequence
    number (it needs 43-bits) and that is fixed here as well.

    Reported-by: Dan Kaminsky
    Tested-by: Willy Tarreau
    Signed-off-by: David S. Miller

    David S. Miller
     

22 Jul, 2011

1 commit

  • IPv6 fragment identification generation is way beyond what we use for
    IPv4 : It uses a single generator. Its not scalable and allows DOS
    attacks.

    Now inetpeer is IPv6 aware, we can use it to provide a more secure and
    scalable frag ident generator (per destination, instead of system wide)

    This patch :
    1) defines a new secure_ipv6_id() helper
    2) extends inet_getid() to provide 32bit results
    3) extends ipv6_select_ident() with a new dest parameter

    Reported-by: Fernando Gont
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

12 Jul, 2011

1 commit

  • We currently can free inetpeer entries too early :

    [ 782.636674] WARNING: kmemcheck: Caught 32-bit read from uninitialized memory (f130f44c)
    [ 782.636677] 1f7b13c100000000000000000000000002000000000000000000000000000000
    [ 782.636686] i i i i u u u u i i i i u u u u i i i i u u u u u u u u u u u u
    [ 782.636694] ^
    [ 782.636696]
    [ 782.636698] Pid: 4638, comm: ssh Not tainted 3.0.0-rc5+ #270 Hewlett-Packard HP Compaq 6005 Pro SFF PC/3047h
    [ 782.636702] EIP: 0060:[] EFLAGS: 00010286 CPU: 0
    [ 782.636707] EIP is at inet_getpeer+0x25b/0x5a0
    [ 782.636709] EAX: 00000002 EBX: 00010080 ECX: f130f3c0 EDX: f0209d30
    [ 782.636711] ESI: 0000bc87 EDI: 0000ea60 EBP: f0209ddc ESP: c173134c
    [ 782.636712] DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
    [ 782.636714] CR0: 8005003b CR2: f0beca80 CR3: 30246000 CR4: 000006d0
    [ 782.636716] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
    [ 782.636717] DR6: ffff4ff0 DR7: 00000400
    [ 782.636718] [] rt_set_nexthop.clone.45+0x56/0x220
    [ 782.636722] [] __ip_route_output_key+0x309/0x860
    [ 782.636724] [] tcp_v4_connect+0x124/0x450
    [ 782.636728] [] inet_stream_connect+0xa3/0x270
    [ 782.636731] [] sys_connect+0xa1/0xb0
    [ 782.636733] [] sys_socketcall+0x25d/0x2a0
    [ 782.636736] [] sysenter_do_call+0x12/0x28
    [ 782.636738] [] 0xffffffff

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

09 Jun, 2011

1 commit

  • Andi Kleen and Tim Chen reported huge contention on inetpeer
    unused_peers.lock, on memcached workload on a 40 core machine, with
    disabled route cache.

    It appears we constantly flip peers refcnt between 0 and 1 values, and
    we must insert/remove peers from unused_peers.list, holding a contended
    spinlock.

    Remove this list completely and perform a garbage collection on-the-fly,
    at lookup time, using the expired nodes we met during the tree
    traversal.

    This removes a lot of code, makes locking more standard, and obsoletes
    two sysctls (inet_peer_gc_mintime and inet_peer_gc_maxtime). This also
    removes two pointers in inet_peer structure.

    There is still a false sharing effect because refcnt is in first cache
    line of object [were the links and keys used by lookups are located], we
    might move it at the end of inet_peer structure to let this first cache
    line mostly read by cpus.

    Signed-off-by: Eric Dumazet
    CC: Andi Kleen
    CC: Tim Chen
    Signed-off-by: David S. Miller

    Eric Dumazet
     

28 May, 2011

1 commit

  • Several crashes in cleanup_once() were reported in recent kernels.

    Commit d6cc1d642de9 (inetpeer: various changes) added a race in
    unlink_from_unused().

    One way to avoid taking unused_peers.lock before doing the list_empty()
    test is to catch 0->1 refcnt transitions, using full barrier atomic
    operations variants (atomic_cmpxchg() and atomic_inc_return()) instead
    of previous atomic_inc() and atomic_add_unless() variants.

    We then call unlink_from_unused() only for the owner of the 0->1
    transition.

    Add a new atomic_add_unless_return() static helper

    With help from Arun Sharma.

    Refs: https://bugzilla.kernel.org/show_bug.cgi?id=32772

    Reported-by: Arun Sharma
    Reported-by: Maximilian Engelhardt
    Reported-by: Yann Dupont
    Reported-by: Denys Fedoryshchenko
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

13 Apr, 2011

1 commit

  • On 64bit arches, we use 752 bytes of stack when cleanup_once() is called
    from inet_getpeer().

    Lets share the avl stack to save ~376 bytes.

    Before patch :

    # objdump -d net/ipv4/inetpeer.o | scripts/checkstack.pl

    0x000006c3 unlink_from_pool [inetpeer.o]: 376
    0x00000721 unlink_from_pool [inetpeer.o]: 376
    0x00000cb1 inet_getpeer [inetpeer.o]: 376
    0x00000e6d inet_getpeer [inetpeer.o]: 376
    0x0004 inet_initpeers [inetpeer.o]: 112
    # size net/ipv4/inetpeer.o
    text data bss dec hex filename
    5320 432 21 5773 168d net/ipv4/inetpeer.o

    After patch :

    objdump -d net/ipv4/inetpeer.o | scripts/checkstack.pl
    0x00000c11 inet_getpeer [inetpeer.o]: 376
    0x00000dcd inet_getpeer [inetpeer.o]: 376
    0x00000ab9 peer_check_expire [inetpeer.o]: 328
    0x00000b7f peer_check_expire [inetpeer.o]: 328
    0x0004 inet_initpeers [inetpeer.o]: 112
    # size net/ipv4/inetpeer.o
    text data bss dec hex filename
    5163 432 21 5616 15f0 net/ipv4/inetpeer.o

    Signed-off-by: Eric Dumazet
    Cc: Scot Doyle
    Cc: Stephen Hemminger
    Cc: Hiroaki SHIMODA
    Reviewed-by: Hiroaki SHIMODA
    Signed-off-by: David S. Miller

    Eric Dumazet
     

14 Mar, 2011

2 commits

  • After commit 7b46ac4e77f3224a (inetpeer: Don't disable BH for initial
    fast RCU lookup.), we should use call_rcu() to wait proper RCU grace
    period.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • On current net-next-2.6, when Linux receives ICMP Type: 3, Code: 4
    (Destination unreachable (Fragmentation needed)),

    icmp_unreach
    -> ip_rt_frag_needed
    (peer->pmtu_expires is set here)
    -> tcp_v4_err
    -> do_pmtu_discovery
    -> ip_rt_update_pmtu
    (peer->pmtu_expires is already set,
    so check_peer_pmtu is skipped.)
    -> check_peer_pmtu

    check_peer_pmtu is skipped and MTU is not updated.

    To fix this, let check_peer_pmtu execute unconditionally.
    And some minor fixes
    1) Avoid potential peer->pmtu_expires set to be zero.
    2) In check_peer_pmtu, argument of time_before is reversed.
    3) check_peer_pmtu expects peer->pmtu_orig is initialized as zero,
    but not initialized.

    Signed-off-by: Hiroaki SHIMODA
    Signed-off-by: David S. Miller

    Hiroaki SHIMODA
     

09 Mar, 2011

1 commit


05 Mar, 2011

1 commit

  • David noticed :

    ------------------
    Eric, I was profiling the non-routing-cache case and something that
    stuck out is the case of calling inet_getpeer() with create==0.

    If an entry is not found, we have to redo the lookup under a spinlock
    to make certain that a concurrent writer rebalancing the tree does
    not "hide" an existing entry from us.

    This makes the case of a create==0 lookup for a not-present entry
    really expensive. It is on the order of 600 cpu cycles on my
    Niagara2.

    I added a hack to not do the relookup under the lock when create==0
    and it now costs less than 300 cycles.

    This is now a pretty common operation with the way we handle COW'd
    metrics, so I think it's definitely worth optimizing.
    -----------------

    One solution is to use a seqlock instead of a spinlock to protect struct
    inet_peer_base.

    After a failed avl tree lookup, we can easily detect if a writer did
    some changes during our lookup. Taking the lock and redo the lookup is
    only necessary in this case.

    Note: Add one private rcu_deref_locked() macro to place in one spot the
    access to spinlock included in seqlock.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

11 Feb, 2011

2 commits

  • Validity of the cached PMTU information is indicated by it's
    expiration value being non-zero, just as per dst->expires.

    The scheme we will use is that we will remember the pre-ICMP value
    held in the metrics or route entry, and then at expiration time
    we will restore that value.

    In this way PMTU expiration does not kill off the cached route as is
    done currently.

    Redirect information is permanent, or at least until another redirect
    is received.

    Signed-off-by: David S. Miller

    David S. Miller
     
  • Future changes will add caching information, and some of
    these new elements will be addresses.

    Since the family is implicit via the ->daddr.family member,
    replicating the family in ever address we store is entirely
    redundant.

    Signed-off-by: David S. Miller

    David S. Miller
     

05 Feb, 2011

1 commit


28 Jan, 2011

1 commit


25 Jan, 2011

1 commit


02 Dec, 2010

1 commit


01 Dec, 2010

6 commits


28 Oct, 2010

1 commit

  • Adds __rcu annotations to inetpeer
    (struct inet_peer)->avl_left
    (struct inet_peer)->avl_right

    This is a tedious cleanup, but removes one smp_wmb() from link_to_pool()
    since we now use more self documenting rcu_assign_pointer().

    Note the use of RCU_INIT_POINTER() instead of rcu_assign_pointer() in
    all cases we dont need a memory barrier.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

17 Jun, 2010

1 commit

  • Addition of rcu_head to struct inet_peer added 16bytes on 64bit arches.

    Thats a bit unfortunate, since old size was exactly 64 bytes.

    This can be solved, using an union between this rcu_head an four fields,
    that are normally used only when a refcount is taken on inet_peer.
    rcu_head is used only when refcnt=-1, right before structure freeing.

    Add a inet_peer_refcheck() function to check this assertion for a while.

    We can bring back SLAB_HWCACHE_ALIGN qualifier in kmem cache creation.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

16 Jun, 2010

2 commits

  • Followup of commit aa1039e73cc2 (inetpeer: RCU conversion)

    Unused inet_peer entries have a null refcnt.

    Using atomic_inc_not_zero() in rcu lookups is not going to work for
    them, and slow path is taken.

    Fix this using -1 marker instead of 0 for deleted entries.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • inetpeer currently uses an AVL tree protected by an rwlock.

    It's possible to make most lookups use RCU

    1) Add a struct rcu_head to struct inet_peer

    2) add a lookup_rcu_bh() helper to perform lockless and opportunistic
    lookup. This is a normal function, not a macro like lookup().

    3) Add a limit to number of links followed by lookup_rcu_bh(). This is
    needed in case we fall in a loop.

    4) add an smp_wmb() in link_to_pool() right before node insert.

    5) make unlink_from_pool() use atomic_cmpxchg() to make sure it can take
    last reference to an inet_peer, since lockless readers could increase
    refcount, even while we hold peers.lock.

    6) Delay struct inet_peer freeing after rcu grace period so that
    lookup_rcu_bh() cannot crash.

    7) inet_getpeer() first attempts lockless lookup.
    Note this lookup can fail even if target is in AVL tree, but a
    concurrent writer can let tree in a non correct form.
    If this attemps fails, lock is taken a regular lookup is performed
    again.

    8) convert peers.lock from rwlock to a spinlock

    9) Remove SLAB_HWCACHE_ALIGN when peer_cachep is created, because
    rcu_head adds 16 bytes on 64bit arches, doubling effective size (64 ->
    128 bytes)
    In a future patch, this is probably possible to revert this part, if rcu
    field is put in an union to share space with rid, ip_id_count, tcp_ts &
    tcp_ts_stamp. These fields being manipulated only with refcnt > 0.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

15 Jun, 2010

1 commit

  • Try to reduce cache line contentions in peer management, to reduce IP
    defragmentation overhead.

    - peer_fake_node is marked 'const' to make sure its not modified.
    (tested with CONFIG_DEBUG_RODATA=y)

    - Group variables in two structures to reduce number of dirtied cache
    lines. One named "peers" for avl tree root, its number of entries, and
    associated lock. (candidate for RCU conversion)

    - A second one named "unused_peers" for unused list and its lock

    - Add a !list_empty() test in unlink_from_unused() to avoid taking lock
    when entry is not unused.

    - Use atomic_dec_and_lock() in inet_putpeer() to avoid taking lock in
    some cases.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

14 Nov, 2009

1 commit

  • While investigating for network latencies, I found inet_getid() was a
    contention point for some workloads, as inet_peer_idlock is shared
    by all inet_getid() users regardless of peers.

    One way to fix this is to make ip_id_count an atomic_t instead
    of __u16, and use atomic_add_return().

    In order to keep sizeof(struct inet_peer) = 64 on 64bit arches
    tcp_ts_stamp is also converted to __u32 instead of "unsigned long".

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

03 Nov, 2008

1 commit


12 Jun, 2008

1 commit


13 Nov, 2007

1 commit


21 Jul, 2007

1 commit

  • CC net/ipv4/inetpeer.o
    net/ipv4/inetpeer.c: In function 'unlink_from_pool':
    net/ipv4/inetpeer.c:297: warning: the address of 'stack' will always evaluate as 'true'
    net/ipv4/inetpeer.c:297: warning: the address of 'stack' will always evaluate as 'true'
    net/ipv4/inetpeer.c: In function 'inet_getpeer':
    net/ipv4/inetpeer.c:409: warning: the address of 'stack' will always evaluate as 'true'
    net/ipv4/inetpeer.c:409: warning: the address of 'stack' will always evaluate as 'true'

    "Fix" by checking for != NULL.

    Signed-off-by: Patrick McHardy
    Signed-off-by: David S. Miller

    Patrick McHardy
     

20 Jul, 2007

1 commit

  • Slab destructors were no longer supported after Christoph's
    c59def9f222d44bb7e2f0a559f2906191a0862d7 change. They've been
    BUGs for both slab and slub, and slob never supported them
    either.

    This rips out support for the dtor pointer from kmem_cache_create()
    completely and fixes up every single callsite in the kernel (there were
    about 224, not including the slab allocator definitions themselves,
    or the documentation references).

    Signed-off-by: Paul Mundt

    Paul Mundt
     

26 Apr, 2007

1 commit

  • 1) Some sysctl vars are declared __read_mostly

    2) We can avoid updating stack[] when doing an AVL lookup only.

    lookup() macro is extended to receive a second parameter, that may be NULL
    in case of a pure lookup (no need to save the AVL path). This removes
    unnecessary instructions, because compiler knows if this _stack parameter is
    NULL or not.

    text size of net/ipv4/inetpeer.o is 2063 bytes instead of 2107 on x86_64

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

15 Feb, 2007

1 commit

  • After Al Viro (finally) succeeded in removing the sched.h #include in module.h
    recently, it makes sense again to remove other superfluous sched.h includes.
    There are quite a lot of files which include it but don't actually need
    anything defined in there. Presumably these includes were once needed for
    macros that used to live in sched.h, but moved to other header files in the
    course of cleaning it up.

    To ease the pain, this time I did not fiddle with any header files and only
    removed #includes from .c-files, which tend to cause less trouble.

    Compile tested against 2.6.20-rc2 and 2.6.20-rc2-mm2 (with offsets) on alpha,
    arm, i386, ia64, mips, powerpc, and x86_64 with allnoconfig, defconfig,
    allmodconfig, and allyesconfig as well as a few randconfigs on x86_64 and all
    configs in arch/arm/configs on arm. I also checked that no new warnings were
    introduced by the patch (actually, some warnings are removed that were emitted
    by unnecessarily included header files).

    Signed-off-by: Tim Schmielau
    Acked-by: Russell King
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tim Schmielau
     

08 Dec, 2006

1 commit

  • Replace all uses of kmem_cache_t with struct kmem_cache.

    The patch was generated using the following script:

    #!/bin/sh
    #
    # Replace one string by another in all the kernel sources.
    #

    set -e

    for file in `find * -name "*.c" -o -name "*.h"|xargs grep -l $1`; do
    quilt add $file
    sed -e "1,\$s/$1/$2/g" $file >/tmp/$$
    mv /tmp/$$ $file
    quilt refresh
    done

    The script was run like this

    sh replace kmem_cache_t "struct kmem_cache"

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

16 Oct, 2006

1 commit


29 Sep, 2006

1 commit

  • This one is interesting - we use net-endian value as search key, but
    order the tree by *host-endian* comparisons of keys. OK since we only
    care about lookups. Annotated inet_getpeer() and friends.

    Signed-off-by: Al Viro
    Signed-off-by: David S. Miller

    Al Viro
     

23 Sep, 2006

1 commit


11 Jul, 2006

1 commit