15 Jul, 2007

1 commit


09 Feb, 2007

1 commit

  • ehash table layout is currently this one :

    First half of this table is used by sockets not in TIME_WAIT state
    Second half of it is used by sockets in TIME_WAIT state.

    This is non optimal because of for a given hash or socket, the two chain heads
    are located in separate cache lines.
    Moreover the locks of the second half are never used.

    If instead of this halving, we use two list heads in inet_ehash_bucket instead
    of only one, we probably can avoid one cache miss, and reduce ram usage,
    particularly if sizeof(rwlock_t) is big (various CONFIG_DEBUG_SPINLOCK,
    CONFIG_DEBUG_LOCK_ALLOC settings). So we still halves the table but we keep
    together related chains to speedup lookups and socket state change.

    In this patch I did not try to align struct inet_ehash_bucket, but a future
    patch could try to make this structure have a convenient size (a power of two
    or a multiple of L1_CACHE_SIZE).
    I guess rwlock will just vanish as soon as RCU is plugged into ehash :) , so
    maybe we dont need to scratch our heads to align the bucket...

    Note : In case struct inet_ehash_bucket is not a power of two, we could
    probably change alloc_large_system_hash() (in case it use __get_free_pages())
    to free the unused space. It currently allocates a big zone, but the last
    quarter of it could be freed. Again, this should be a temporary 'problem'.

    Patch tested on ipv4 tcp only, but should be OK for IPV6 and DCCP.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

08 Dec, 2006

2 commits

  • * master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6: (48 commits)
    [NETFILTER]: Fix non-ANSI func. decl.
    [TG3]: Identify Serdes devices more clearly.
    [TG3]: Use msleep.
    [TG3]: Use netif_msg_*.
    [TG3]: Allow partial speed advertisement.
    [TG3]: Add TG3_FLG2_IS_NIC flag.
    [TG3]: Add 5787F device ID.
    [TG3]: Fix Phy loopback.
    [WANROUTER]: Kill kmalloc debugging code.
    [TCP] inet_twdr_hangman: Delete unnecessary memory barrier().
    [NET]: Memory barrier cleanups
    [IPSEC]: Fix inetpeer leak in ipv4 xfrm dst entries.
    audit: disable ipsec auditing when CONFIG_AUDITSYSCALL=n
    audit: Add auditing to ipsec
    [IRDA] irlan: Fix compile warning when CONFIG_PROC_FS=n
    [IrDA]: Incorrect TTP header reservation
    [IrDA]: PXA FIR code device model conversion
    [GENETLINK]: Fix misplaced command flags.
    [NETLIK]: Add a pointer to the Generic Netlink wiki page.
    [IPV6] RAW: Don't release unlocked sock.
    ...

    Linus Torvalds
     
  • SLAB_ATOMIC is an alias of GFP_ATOMIC

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

07 Dec, 2006

2 commits

  • As per Ralf Baechle's observations, the schedule_work() call
    should give enough of a memory barrier, so the explicit one
    here is totally unnecessary.

    Signed-off-by: David S. Miller

    David S. Miller
     
  • I believe all the below memory barriers only matter on SMP so
    therefore the smp_* variant of the barrier should be used.

    I'm wondering if the barrier in net/ipv4/inet_timewait_sock.c should be
    dropped entirely. schedule_work's implementation currently implies a
    memory barrier and I think sane semantics of schedule_work() should imply
    a memory barrier, as needed so the caller shouldn't have to worry.
    It's not quite obvious why the barrier in net/packet/af_packet.c is
    needed; maybe it should be implied through flush_dcache_page?

    Signed-off-by: Ralf Baechle
    Signed-off-by: David S. Miller

    Ralf Baechle
     

22 Nov, 2006

1 commit

  • Pass the work_struct pointer to the work function rather than context data.
    The work function can use container_of() to work out the data.

    For the cases where the container of the work_struct may go away the moment the
    pending bit is cleared, it is made possible to defer the release of the
    structure by deferring the clearing of the pending bit.

    To make this work, an extra flag is introduced into the management side of the
    work_struct. This governs auto-release of the structure upon execution.

    Ordinarily, the work queue executor would release the work_struct for further
    scheduling or deallocation by clearing the pending bit prior to jumping to the
    work function. This means that, unless the driver makes some guarantee itself
    that the work_struct won't go away, the work function may not access anything
    else in the work_struct or its container lest they be deallocated.. This is a
    problem if the auxiliary data is taken away (as done by the last patch).

    However, if the pending bit is *not* cleared before jumping to the work
    function, then the work function *may* access the work_struct and its container
    with no problems. But then the work function must itself release the
    work_struct by calling work_release().

    In most cases, automatic release is fine, so this is the default. Special
    initiators exist for the non-auto-release case (ending in _NAR).

    Signed-Off-By: David Howells

    David Howells
     

01 Jul, 2006

1 commit


04 Jan, 2006

1 commit

  • So that we can share several timewait sockets related functions and
    make the timewait mini sockets infrastructure closer to the request
    mini sockets one.

    Next changesets will take advantage of this, moving more code out of
    TCP and DCCP v4 and v6 to common infrastructure.

    Signed-off-by: Arnaldo Carvalho de Melo
    Signed-off-by: David S. Miller

    Arnaldo Carvalho de Melo
     

11 Oct, 2005

1 commit


04 Oct, 2005

1 commit

  • Arnaldo and I agreed it could be applied now, because I have other
    pending patches depending on this one (Thank you Arnaldo)

    (The other important patch moves skc_refcnt in a separate cache line,
    so that the SMP/NUMA performance doesnt suffer from cache line ping pongs)

    1) First some performance data :
    --------------------------------

    tcp_v4_rcv() wastes a *lot* of time in __inet_lookup_established()

    The most time critical code is :

    sk_for_each(sk, node, &head->chain) {
    if (INET_MATCH(sk, acookie, saddr, daddr, ports, dif))
    goto hit; /* You sunk my battleship! */
    }

    The sk_for_each() does use prefetch() hints but only the begining of
    "struct sock" is prefetched.

    As INET_MATCH first comparison uses inet_sk(__sk)->daddr, wich is far
    away from the begining of "struct sock", it has to bring into CPU
    cache cold cache line. Each iteration has to use at least 2 cache
    lines.

    This can be problematic if some chains are very long.

    2) The goal
    -----------

    The idea I had is to change things so that INET_MATCH() may return
    FALSE in 99% of cases only using the data already in the CPU cache,
    using one cache line per iteration.

    3) Description of the patch
    ---------------------------

    Adds a new 'unsigned int skc_hash' field in 'struct sock_common',
    filling a 32 bits hole on 64 bits platform.

    struct sock_common {
    unsigned short skc_family;
    volatile unsigned char skc_state;
    unsigned char skc_reuse;
    int skc_bound_dev_if;
    struct hlist_node skc_node;
    struct hlist_node skc_bind_node;
    atomic_t skc_refcnt;
    + unsigned int skc_hash;
    struct proto *skc_prot;
    };

    Store in this 32 bits field the full hash, not masked by (ehash_size -
    1) Using this full hash as the first comparison done in INET_MATCH
    permits us immediatly skip the element without touching a second cache
    line in case of a miss.

    Suppress the sk_hashent/tw_hashent fields since skc_hash (aliased to
    sk_hash and tw_hash) already contains the slot number if we mask with
    (ehash_size - 1)

    File include/net/inet_hashtables.h

    64 bits platforms :
    #define INET_MATCH(__sk, __hash, __cookie, __saddr, __daddr, __ports, __dif)\
    (((__sk)->sk_hash == (__hash))
    ((*((__u64 *)&(inet_sk(__sk)->daddr)))== (__cookie)) && \
    ((*((__u32 *)&(inet_sk(__sk)->dport))) == (__ports)) && \
    (!((__sk)->sk_bound_dev_if) || ((__sk)->sk_bound_dev_if == (__dif))))

    32bits platforms:
    #define TCP_IPV4_MATCH(__sk, __hash, __cookie, __saddr, __daddr, __ports, __dif)\
    (((__sk)->sk_hash == (__hash)) && \
    (inet_sk(__sk)->daddr == (__saddr)) && \
    (inet_sk(__sk)->rcv_saddr == (__daddr)) && \
    (!((__sk)->sk_bound_dev_if) || ((__sk)->sk_bound_dev_if == (__dif))))

    - Adds a prefetch(head->chain.first) in
    __inet_lookup_established()/__tcp_v4_check_established() and
    __inet6_lookup_established()/__tcp_v6_check_established() and
    __dccp_v4_check_established() to bring into cache the first element of the
    list, before the {read|write}_lock(&head->lock);

    Signed-off-by: Eric Dumazet
    Acked-by: Arnaldo Carvalho de Melo
    Signed-off-by: David S. Miller

    Eric Dumazet
     

30 Aug, 2005

5 commits