07 Aug, 2011

1 commit

  • Computers have become a lot faster since we compromised on the
    partial MD4 hash which we use currently for performance reasons.

    MD5 is a much safer choice, and is inline with both RFC1948 and
    other ISS generators (OpenBSD, Solaris, etc.)

    Furthermore, only having 24-bits of the sequence number be truly
    unpredictable is a very serious limitation. So the periodic
    regeneration and 8-bit counter have been removed. We compute and
    use a full 32-bit sequence number.

    For ipv6, DCCP was found to use a 32-bit truncated initial sequence
    number (it needs 43-bits) and that is fixed here as well.

    Reported-by: Dan Kaminsky
    Tested-by: Willy Tarreau
    Signed-off-by: David S. Miller

    David S. Miller
     

29 Nov, 2010

1 commit

  • inet sockets corresponding to passive connections are added to the bind hash
    using ___inet_inherit_port(). These sockets are later removed from the bind
    hash using __inet_put_port(). These two functions are not exactly symmetrical.
    __inet_put_port() decrements hashinfo->bsockets and tb->num_owners, whereas
    ___inet_inherit_port() does not increment them. This results in both of these
    going to -ve values.

    This patch fixes this by calling inet_bind_hash() from ___inet_inherit_port(),
    which does the right thing.

    'bsockets' and 'num_owners' were introduced by commit a9d8f9110d7e953c
    (inet: Allowing more than 64k connections and heavily optimize bind(0))

    Signed-off-by: Nagendra Singh Tomar
    Acked-by: Eric Dumazet
    Acked-by: Evgeniy Polyakov
    Signed-off-by: David S. Miller

    Nagendra Tomar
     

21 Oct, 2010

1 commit

  • When __inet_inherit_port() is called on a tproxy connection the wrong locks are
    held for the inet_bind_bucket it is added to. __inet_inherit_port() made an
    implicit assumption that the listener's port number (and thus its bind bucket).
    Unfortunately, if you're using the TPROXY target to redirect skbs to a
    transparent proxy that assumption is not true anymore and things break.

    This patch adds code to __inet_inherit_port() so that it can handle this case
    by looking up or creating a new bind bucket for the child socket and updates
    callers of __inet_inherit_port() to gracefully handle __inet_inherit_port()
    failing.

    Reported by and original patch from Stephen Buck .
    See http://marc.info/?t=128169268200001&r=1&w=2 for the original discussion.

    Signed-off-by: KOVACS Krisztian
    Signed-off-by: Patrick McHardy

    Balazs Scheidler
     

13 Jul, 2010

1 commit


16 May, 2010

1 commit

  • (Dropped the infiniband part, because Tetsuo modified the related code,
    I will send a separate patch for it once this is accepted.)

    This patch introduces /proc/sys/net/ipv4/ip_local_reserved_ports which
    allows users to reserve ports for third-party applications.

    The reserved ports will not be used by automatic port assignments
    (e.g. when calling connect() or bind() with port number 0). Explicit
    port allocation behavior is unchanged.

    Signed-off-by: Octavian Purdila
    Signed-off-by: WANG Cong
    Cc: Neil Horman
    Cc: Eric Dumazet
    Cc: Eric W. Biederman
    Signed-off-by: David S. Miller

    Amerigo Wang
     

09 Dec, 2009

2 commits

  • When we find a timewait connection in __inet_hash_connect() and reuse
    it for a new connection request, we have a race window, releasing bind
    list lock and reacquiring it in __inet_twsk_kill() to remove timewait
    socket from list.

    Another thread might find the timewait socket we already chose, leading to
    list corruption and crashes.

    Fix is to remove timewait socket from bind list before releasing the bind lock.

    Note: This problem happens if sysctl_tcp_tw_reuse is set.

    Reported-by: kapil dakhane
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • First patch changes __inet_hash_nolisten() and __inet6_hash()
    to get a timewait parameter to be able to unhash it from ehash
    at same time the new socket is inserted in hash.

    This makes sure timewait socket wont be found by a concurrent
    writer in __inet_check_established()

    Reported-by: kapil dakhane
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

04 Dec, 2009

1 commit

  • Its currently possible that several threads issuing a connect() find
    the same timewait socket and try to reuse it, leading to list
    corruptions.

    Condition for bug is that these threads bound their socket on same
    address/port of to-be-find timewait socket, and connected to same
    target. (SO_REUSEADDR needed)

    To fix this problem, we could unhash timewait socket while holding
    ehash lock, to make sure lookups/changes will be serialized. Only
    first thread finds the timewait socket, other ones find the
    established socket and return an EADDRNOTAVAIL error.

    This second version takes into account Evgeniy's review and makes sure
    inet_twsk_put() is called outside of locked sections.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

26 Nov, 2009

1 commit

  • Generated with the following semantic patch

    @@
    struct net *n1;
    struct net *n2;
    @@
    - n1 == n2
    + net_eq(n1, n2)

    @@
    struct net *n1;
    struct net *n2;
    @@
    - n1 != n2
    + !net_eq(n1, n2)

    applied over {include,net,drivers/net}.

    Signed-off-by: Octavian Purdila
    Signed-off-by: David S. Miller

    Octavian Purdila
     

19 Oct, 2009

1 commit

  • In order to have better cache layouts of struct sock (separate zones
    for rx/tx paths), we need this preliminary patch.

    Goal is to transfert fields used at lookup time in the first
    read-mostly cache line (inside struct sock_common) and move sk_refcnt
    to a separate cache line (only written by rx path)

    This patch adds inet_ prefix to daddr, rcv_saddr, dport, num, saddr,
    sport and id fields. This allows a future patch to define these
    fields as macros, like sk_refcnt, without name clashes.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

13 Oct, 2009

1 commit


02 Feb, 2009

1 commit


22 Jan, 2009

1 commit

  • With simple extension to the binding mechanism, which allows to bind more
    than 64k sockets (or smaller amount, depending on sysctl parameters),
    we have to traverse the whole bind hash table to find out empty bucket.
    And while it is not a problem for example for 32k connections, bind()
    completion time grows exponentially (since after each successful binding
    we have to traverse one bucket more to find empty one) even if we start
    each time from random offset inside the hash table.

    So, when hash table is full, and we want to add another socket, we have
    to traverse the whole table no matter what, so effectivelly this will be
    the worst case performance and it will be constant.

    Attached picture shows bind() time depending on number of already bound
    sockets.

    Green area corresponds to the usual binding to zero port process, which
    turns on kernel port selection as described above. Red area is the bind
    process, when number of reuse-bound sockets is not limited by 64k (or
    sysctl parameters). The same exponential growth (hidden by the green
    area) before number of ports reaches sysctl limit.

    At this time bind hash table has exactly one reuse-enbaled socket in a
    bucket, but it is possible that they have different addresses. Actually
    kernel selects the first port to try randomly, so at the beginning bind
    will take roughly constant time, but with time number of port to check
    after random start will increase. And that will have exponential growth,
    but because of above random selection, not every next port selection
    will necessary take longer time than previous. So we have to consider
    the area below in the graph (if you could zoom it, you could find, that
    there are many different times placed there), so area can hide another.

    Blue area corresponds to the port selection optimization.

    This is rather simple design approach: hashtable now maintains (unprecise
    and racely updated) number of currently bound sockets, and when number
    of such sockets becomes greater than predefined value (I use maximum
    port range defined by sysctls), we stop traversing the whole bind hash
    table and just stop at first matching bucket after random start. Above
    limit roughly corresponds to the case, when bind hash table is full and
    we turned on mechanism of allowing to bind more reuse-enabled sockets,
    so it does not change behaviour of other sockets.

    Signed-off-by: Evgeniy Polyakov
    Tested-by: Denys Fedoryschenko
    Signed-off-by: David S. Miller

    Evgeniy Polyakov
     

24 Nov, 2008

2 commits

  • The rule of calling sock_prot_inuse_add() is that BHs must
    be disabled. Some new calls were added where this was not
    true and this tiggers warnings as reported by Ilpo.

    Fix this by adding explicit BH disabling around those call sites,
    or moving sock_prot_inuse_add() call inside an existing BH disabled
    section.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • This is the last step to be able to perform full RCU lookups
    in __inet_lookup() : After established/timewait tables, we
    add RCU lookups to listening hash table.

    The only trick here is that a socket of a given type (TCP ipv4,
    TCP ipv6, ...) can now flight between two different tables
    (established and listening) during a RCU grace period, so we
    must use different 'nulls' end-of-chain values for two tables.

    We define a large value :

    #define LISTENING_NULLS_BASE (1U << 29)

    So that slots in listening table are guaranteed to have different
    end-of-chain values than slots in established table. A reader can
    still detect it finished its lookup in the right chain.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

21 Nov, 2008

1 commit

  • Now TCP & DCCP use RCU lookups, we can convert ehash rwlocks to spinlocks.

    /proc/net/tcp and other seq_file 'readers' can safely be converted to 'writers'.

    This should speedup writers, since spin_lock()/spin_unlock()
    only use one atomic operation instead of two for write_lock()/write_unlock()

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

20 Nov, 2008

1 commit

  • This patch prepares RCU migration of listening_hash table for
    TCP/DCCP protocols.

    listening_hash table being small (32 slots per protocol), we add
    a spinlock for each slot, instead of a single rwlock for whole table.

    This should reduce hold time of readers, and writers concurrency.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

17 Nov, 2008

1 commit

  • RCU was added to UDP lookups, using a fast infrastructure :
    - sockets kmem_cache use SLAB_DESTROY_BY_RCU and dont pay the
    price of call_rcu() at freeing time.
    - hlist_nulls permits to use few memory barriers.

    This patch uses same infrastructure for TCP/DCCP established
    and timewait sockets.

    Thanks to SLAB_DESTROY_BY_RCU, no slowdown for applications
    using short lived TCP connections. A followup patch, converting
    rwlocks to spinlocks will even speedup this case.

    __inet_lookup_established() is pretty fast now we dont have to
    dirty a contended cache line (read_lock/read_unlock)

    Only established and timewait hashtable are converted to RCU
    (bind table and listen table are still using traditional locking)

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

12 Nov, 2008

1 commit


26 Jul, 2008

1 commit

  • Removes legacy reinvent-the-wheel type thing. The generic
    machinery integrates much better to automated debugging aids
    such as kerneloops.org (and others), and is unambiguous due to
    better naming. Non-intuively BUG_TRAP() is actually equal to
    WARN_ON() rather than BUG_ON() though some might actually be
    promoted to BUG_ON() but I left that to future.

    I could make at least one BUILD_BUG_ON conversion.

    Signed-off-by: Ilpo Järvinen
    Signed-off-by: David S. Miller

    Ilpo Järvinen
     

17 Jul, 2008

1 commit


17 Jun, 2008

3 commits


18 Apr, 2008

1 commit


16 Apr, 2008

1 commit


01 Apr, 2008

1 commit


26 Mar, 2008

2 commits


23 Mar, 2008

1 commit

  • Inspired by the commit ab1e0a13 ([SOCK] proto: Add hashinfo member to
    struct proto) from Arnaldo, I made similar thing for UDP/-Lite IPv4
    and -v6 protocols.

    The result is not that exciting, but it removes some levels of
    indirection in udpxxx_get_port and saves some space in code and text.

    The first step is to union existing hashinfo and new udp_hash on the
    struct proto and give a name to this union, since future initialization
    of tcpxxx_prot, dccp_vx_protinfo and udpxxx_protinfo will cause gcc
    warning about inability to initialize anonymous member this way.

    Signed-off-by: Pavel Emelyanov
    Signed-off-by: David S. Miller

    Pavel Emelyanov
     

14 Feb, 2008

2 commits


05 Feb, 2008

1 commit


03 Feb, 2008

1 commit

  • This way we can remove TCP and DCCP specific versions of

    sk->sk_prot->get_port: both v4 and v6 use inet_csk_get_port
    sk->sk_prot->hash: inet_hash is directly used, only v6 need
    a specific version to deal with mapped sockets
    sk->sk_prot->unhash: both v4 and v6 use inet_hash directly

    struct inet_connection_sock_af_ops also gets a new member, bind_conflict, so
    that inet_csk_get_port can find the per family routine.

    Now only the lookup routines receive as a parameter a struct inet_hashtable.

    With this we further reuse code, reducing the difference among INET transport
    protocols.

    Eventually work has to be done on UDP and SCTP to make them share this
    infrastructure and get as a bonus inet_diag interfaces so that iproute can be
    used with these protocols.

    net-2.6/net/ipv4/inet_hashtables.c:
    struct proto | +8
    struct inet_connection_sock_af_ops | +8
    2 structs changed
    __inet_hash_nolisten | +18
    __inet_hash | -210
    inet_put_port | +8
    inet_bind_bucket_create | +1
    __inet_hash_connect | -8
    5 functions changed, 27 bytes added, 218 bytes removed, diff: -191

    net-2.6/net/core/sock.c:
    proto_seq_show | +3
    1 function changed, 3 bytes added, diff: +3

    net-2.6/net/ipv4/inet_connection_sock.c:
    inet_csk_get_port | +15
    1 function changed, 15 bytes added, diff: +15

    net-2.6/net/ipv4/tcp.c:
    tcp_set_state | -7
    1 function changed, 7 bytes removed, diff: -7

    net-2.6/net/ipv4/tcp_ipv4.c:
    tcp_v4_get_port | -31
    tcp_v4_hash | -48
    tcp_v4_destroy_sock | -7
    tcp_v4_syn_recv_sock | -2
    tcp_unhash | -179
    5 functions changed, 267 bytes removed, diff: -267

    net-2.6/net/ipv6/inet6_hashtables.c:
    __inet6_hash | +8
    1 function changed, 8 bytes added, diff: +8

    net-2.6/net/ipv4/inet_hashtables.c:
    inet_unhash | +190
    inet_hash | +242
    2 functions changed, 432 bytes added, diff: +432

    vmlinux:
    16 functions changed, 485 bytes added, 492 bytes removed, diff: -7

    /home/acme/git/net-2.6/net/ipv6/tcp_ipv6.c:
    tcp_v6_get_port | -31
    tcp_v6_hash | -7
    tcp_v6_syn_recv_sock | -9
    3 functions changed, 47 bytes removed, diff: -47

    /home/acme/git/net-2.6/net/dccp/proto.c:
    dccp_destroy_sock | -7
    dccp_unhash | -179
    dccp_hash | -49
    dccp_set_state | -7
    dccp_done | +1
    5 functions changed, 1 bytes added, 242 bytes removed, diff: -241

    /home/acme/git/net-2.6/net/dccp/ipv4.c:
    dccp_v4_get_port | -31
    dccp_v4_request_recv_sock | -2
    2 functions changed, 33 bytes removed, diff: -33

    /home/acme/git/net-2.6/net/dccp/ipv6.c:
    dccp_v6_get_port | -31
    dccp_v6_hash | -7
    dccp_v6_request_recv_sock | +5
    3 functions changed, 5 bytes added, 38 bytes removed, diff: -33

    Signed-off-by: Arnaldo Carvalho de Melo
    Signed-off-by: David S. Miller

    Arnaldo Carvalho de Melo
     

01 Feb, 2008

3 commits

  • Add a net argument to inet_lookup and propagate it further
    into lookup calls. Plus tune the __inet_check_established.

    The dccp and inet_diag, which use that lookup functions
    pass the init_net into them.

    Signed-off-by: Pavel Emelyanov
    Signed-off-by: David S. Miller

    Pavel Emelyanov
     
  • This tags the inet_bind_bucket struct with net pointer,
    initializes it during creation and makes a filtering
    during lookup.

    A better hashfn, that takes the net into account is to
    be done in the future, but currently all bind buckets
    with similar port will be in one hash chain.

    Signed-off-by: Pavel Emelyanov
    Signed-off-by: David S. Miller

    Pavel Emelyanov
     
  • These two functions are the same except for what they call
    to "check_established" and "hash" for a socket.

    This saves half-a-kilo for ipv4 and ipv6.

    add/remove: 1/0 grow/shrink: 1/4 up/down: 582/-1128 (-546)
    function old new delta
    __inet_hash_connect - 577 +577
    arp_ignore 108 113 +5
    static.hint 8 4 -4
    rt_worker_func 376 372 -4
    inet6_hash_connect 584 25 -559
    inet_hash_connect 586 25 -561

    Signed-off-by: Pavel Emelyanov
    Signed-off-by: David S. Miller

    Pavel Emelyanov
     

29 Jan, 2008

3 commits

  • 1) Cleanups (all functions are prefixed by sock_prot_inuse)

    sock_prot_inc_use(prot) -> sock_prot_inuse_add(prot,-1)
    sock_prot_dec_use(prot) -> sock_prot_inuse_add(prot,-1)
    sock_prot_inuse() -> sock_prot_inuse_get()

    New functions :

    sock_prot_inuse_init() and sock_prot_inuse_free() to abstract pcounter use.

    2) if CONFIG_PROC_FS=n, we can zap 'inuse' member from "struct proto",
    since nobody wants to read the inuse value.

    This saves 1372 bytes on i386/SMP and some cpu cycles.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • Add __acquires() and __releases() annotations to suppress some sparse
    warnings.

    example of warnings :

    net/ipv4/udp.c:1555:14: warning: context imbalance in 'udp_seq_start' - wrong
    count at exit
    net/ipv4/udp.c:1571:13: warning: context imbalance in 'udp_seq_stop' -
    unexpected unlock

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • This is -700 bytes from the net/ipv4/built-in.o

    add/remove: 1/0 grow/shrink: 1/3 up/down: 340/-1040 (-700)
    function old new delta
    __inet_lookup_established - 339 +339
    tcp_sacktag_write_queue 2254 2255 +1
    tcp_v4_err 1304 973 -331
    tcp_v4_rcv 2089 1744 -345
    tcp_v4_do_rcv 826 462 -364

    Exporting is for dccp module (used via e.g. inet_lookup).

    Signed-off-by: Pavel Emelyanov
    Signed-off-by: David S. Miller

    Pavel Emelyanov