04 Dec, 2009

2 commits

  • Both netlink and /proc/net/tcp interfaces can report transient
    negative values for rx queue.

    ss ->
    State Recv-Q Send-Q Local Address:Port Peer Address:Port
    ESTAB -6 6 127.0.0.1:45956 127.0.0.1:3333

    netstat ->
    tcp 4294967290 6 127.0.0.1:37784 127.0.0.1:3333 ESTABLISHED

    This is because we dont lock socket while computing
    tp->rcv_nxt - tp->copied_seq,
    and another CPU can update copied_seq before rcv_next in RX path.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • This function walks the whole hashtable so there is no point in
    passing it a network namespace. Instead I purge all timewait
    sockets from dead network namespaces that I find. If the namespace
    is one of the once I am trying to purge I am guaranteed no new timewait
    sockets can be formed so this will get them all. If the namespace
    is one I am not acting for it might form a few more but I will
    call inet_twsk_purge again and shortly to get rid of them. In
    any even if the network namespace is dead timewait sockets are
    useless.

    Move the calls of inet_twsk_purge into batch_exit routines so
    that if I am killing a bunch of namespaces at once I will just
    call inet_twsk_purge once and save a lot of redundant unnecessary
    work.

    My simple 4k network namespace exit test the cleanup time dropped from
    roughly 8.2s to 1.6s. While the time spent running inet_twsk_purge fell
    to about 2ms. 1ms for ipv4 and 1ms for ipv6.

    Signed-off-by: Eric W. Biederman
    Signed-off-by: David S. Miller

    Eric W. Biederman
     

03 Dec, 2009

3 commits

  • Parse incoming TCP_COOKIE option(s).

    Calculate TCP_COOKIE option.

    Send optional data.

    This is a significantly revised implementation of an earlier (year-old)
    patch that no longer applies cleanly, with permission of the original
    author (Adam Langley):

    http://thread.gmane.org/gmane.linux.network/102586

    Requires:
    TCPCT part 1a: add request_values parameter for sending SYNACK
    TCPCT part 1b: generate Responder Cookie secret
    TCPCT part 1c: sysctl_tcp_cookie_size, socket option TCP_COOKIE_TRANSACTIONS
    TCPCT part 1d: define TCP cookie option, extend existing struct's
    TCPCT part 1e: implement socket option TCP_COOKIE_TRANSACTIONS
    TCPCT part 1f: Initiator Cookie => Responder

    Signed-off-by: William.Allen.Simpson@gmail.com
    Signed-off-by: David S. Miller

    William Allen Simpson
     
  • Data structures are carefully composed to require minimal additions.
    For example, the struct tcp_options_received cookie_plus variable fits
    between existing 16-bit and 8-bit variables, requiring no additional
    space (taking alignment into consideration). There are no additions to
    tcp_request_sock, and only 1 pointer in tcp_sock.

    This is a significantly revised implementation of an earlier (year-old)
    patch that no longer applies cleanly, with permission of the original
    author (Adam Langley):

    http://thread.gmane.org/gmane.linux.network/102586

    The principle difference is using a TCP option to carry the cookie nonce,
    instead of a user configured offset in the data. This is more flexible and
    less subject to user configuration error. Such a cookie option has been
    suggested for many years, and is also useful without SYN data, allowing
    several related concepts to use the same extension option.

    "Re: SYN floods (was: does history repeat itself?)", September 9, 1996.
    http://www.merit.net/mail.archives/nanog/1996-09/msg00235.html

    "Re: what a new TCP header might look like", May 12, 1998.
    ftp://ftp.isi.edu/end2end/end2end-interest-1998.mail

    These functions will also be used in subsequent patches that implement
    additional features.

    Requires:
    TCPCT part 1a: add request_values parameter for sending SYNACK
    TCPCT part 1b: generate Responder Cookie secret
    TCPCT part 1c: sysctl_tcp_cookie_size, socket option TCP_COOKIE_TRANSACTIONS

    Signed-off-by: William.Allen.Simpson@gmail.com
    Signed-off-by: David S. Miller

    William Allen Simpson
     
  • Add optional function parameters associated with sending SYNACK.
    These parameters are not needed after sending SYNACK, and are not
    used for retransmission. Avoids extending struct tcp_request_sock,
    and avoids allocating kernel memory.

    Also affects DCCP as it uses common struct request_sock_ops,
    but this parameter is currently reserved for future use.

    Signed-off-by: William.Allen.Simpson@gmail.com
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    William Allen Simpson
     

14 Nov, 2009

2 commits

  • While investigating for network latencies, I found inet_getid() was a
    contention point for some workloads, as inet_peer_idlock is shared
    by all inet_getid() users regardless of peers.

    One way to fix this is to make ip_id_count an atomic_t instead
    of __u16, and use atomic_add_return().

    In order to keep sizeof(struct inet_peer) = 64 on 64bit arches
    tcp_ts_stamp is also converted to __u32 instead of "unsigned long".

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • Define two symbols needed in both kernel and user space.

    Remove old (somewhat incorrect) kernel variant that wasn't used in
    most cases. Default should apply to both RMSS and SMSS (RFC2581).

    Replace numeric constants with defined symbols.

    Stand-alone patch, originally developed for TCPCT.

    Signed-off-by: William.Allen.Simpson@gmail.com
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    William Allen Simpson
     

29 Oct, 2009

1 commit


19 Oct, 2009

1 commit

  • In order to have better cache layouts of struct sock (separate zones
    for rx/tx paths), we need this preliminary patch.

    Goal is to transfert fields used at lookup time in the first
    read-mostly cache line (inside struct sock_common) and move sk_refcnt
    to a separate cache line (only written by rx path)

    This patch adds inet_ prefix to daddr, rcv_saddr, dport, num, saddr,
    sport and id fields. This allows a future patch to define these
    fields as macros, like sk_refcnt, without name clashes.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

13 Oct, 2009

1 commit


15 Sep, 2009

1 commit

  • It was once upon time so that snd_sthresh was a 16-bit quantity.
    ...That has not been true for long period of time. I run across
    some ancient compares which still seem to trust such legacy.
    Put all that magic into a single place, I hopefully found all
    of them.

    Compile tested, though linking of allyesconfig is ridiculous
    nowadays it seems.

    Signed-off-by: Ilpo Järvinen
    Signed-off-by: David S. Miller

    Ilpo Järvinen
     

03 Sep, 2009

1 commit

  • This fixed a lockdep warning which appeared when doing stress
    memory tests over NFS:

    inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage.

    page reclaim => nfs_writepage => tcp_sendmsg => lock sk_lock

    mount_root => nfs_root_data => tcp_close => lock sk_lock =>
    tcp_send_fin => alloc_skb_fclone => page reclaim

    David raised a concern that if the allocation fails in tcp_send_fin(), and it's
    GFP_ATOMIC, we are going to yield() (which sleeps) and loop endlessly waiting
    for the allocation to succeed.

    But fact is, the original GFP_KERNEL also sleeps. GFP_ATOMIC+yield() looks
    weird, but it is no worse the implicit sleep inside GFP_KERNEL. Both could
    loop endlessly under memory pressure.

    CC: Arnaldo Carvalho de Melo
    CC: David S. Miller
    CC: Herbert Xu
    Signed-off-by: Wu Fengguang
    Signed-off-by: David S. Miller

    Wu Fengguang
     

02 Sep, 2009

2 commits


01 Sep, 2009

2 commits

  • Here, an ICMP host/network unreachable message, whose payload fits to
    TCP's SND.UNA, is taken as an indication that the RTO retransmission has
    not been lost due to congestion, but because of a route failure
    somewhere along the path.
    With true congestion, a router won't trigger such a message and the
    patched TCP will operate as standard TCP.

    This patch reverts one RTO backoff, if an ICMP host/network unreachable
    message, whose payload fits to TCP's SND.UNA, arrives.
    Based on the new RTO, the retransmission timer is reset to reflect the
    remaining time, or - if the revert clocked out the timer - a retransmission
    is sent out immediately.
    Backoffs are only reverted, if TCP is in RTO loss recovery, i.e. if
    there have been retransmissions and reversible backoffs, already.

    Changes from v2:
    1) Renaming of skb in tcp_v4_err() moved to another patch.
    2) Reintroduced tcp_bound_rto() and __tcp_set_rto().
    3) Fixed code comments.

    Signed-off-by: Damian Lukowski
    Acked-by: Ilpo Järvinen
    Signed-off-by: David S. Miller

    Damian Lukowski
     
  • This supplementary patch renames skb to icmp_skb in tcp_v4_err() in order to
    disambiguate from another sk_buff variable, which will be introduced
    in a separate patch.

    Signed-off-by: Damian Lukowski
    Acked-by: Ilpo Järvinen
    Signed-off-by: David S. Miller

    Damian Lukowski
     

20 Jul, 2009

2 commits

  • When the TCP connection handshake completes on the passive
    side, a variety of state must be set up in the "child" sock,
    including the key if MD5 authentication is being used. Fix TCP
    for both address families to label the key with the peer's
    destination address, rather than the address from the listening
    sock, which is usually the wildcard.

    Reported-by: Stephen Hemminger
    Signed-off-by: John Dykstra
    Signed-off-by: David S. Miller

    John Dykstra
     
  • Fix MD5 signature checking so that an IPv4 active open
    to an IPv6 socket can succeed. In particular, use the
    correct address family's signature generation function
    for the SYN/ACK.

    Reported-by: Stephen Hemminger
    Signed-off-by: John Dykstra
    Signed-off-by: David S. Miller

    John Dykstra
     

03 Jun, 2009

2 commits

  • Define three accessors to get/set dst attached to a skb

    struct dst_entry *skb_dst(const struct sk_buff *skb)

    void skb_dst_set(struct sk_buff *skb, struct dst_entry *dst)

    void skb_dst_drop(struct sk_buff *skb)
    This one should replace occurrences of :
    dst_release(skb->dst)
    skb->dst = NULL;

    Delete skb->dst field

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • Define skb_rtable(const struct sk_buff *skb) accessor to get rtable from skb

    Delete skb->rtable field

    Setting rtable is not allowed, just set dst instead as rtable is an alias.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

06 May, 2009

1 commit


27 Apr, 2009

1 commit

  • On a brand new GRO skb, we cannot call ip_hdr since the header
    may lie in the non-linear area. This patch adds the helper
    skb_gro_network_header to handle this.

    Signed-off-by: Herbert Xu
    Signed-off-by: David S. Miller

    Herbert Xu
     

28 Mar, 2009

1 commit

  • The current placement of the security_inet_conn_request() hooks do not allow
    individual LSMs to override the IP options of the connection's request_sock.
    This is a problem as both SELinux and Smack have the ability to use labeled
    networking protocols which make use of IP options to carry security attributes
    and the inability to set the IP options at the start of the TCP handshake is
    problematic.

    This patch moves the IPv4 security_inet_conn_request() hooks past the code
    where the request_sock's IP options are set/reset so that the LSM can safely
    manipulate the IP options as needed. This patch intentionally does not change
    the related IPv6 hooks as IPv6 based labeling protocols which use IPv6 options
    are not currently implemented, once they are we will have a better idea of
    the correct placement for the IPv6 hooks.

    Signed-off-by: Paul Moore
    Acked-by: David S. Miller
    Signed-off-by: James Morris

    Paul Moore
     

12 Mar, 2009

1 commit

  • Some systems send SYN packets with apparently wrong RFC1323 timestamp
    option values [timestamp tsval=0 tsecr=0].
    It might be for security reasons (http://www.secuobs.com/plugs/25220.shtml )

    Linux TCP stack ignores this option and sends back a SYN+ACK packet
    without timestamp option, thus many TCP flows cannot use timestamps
    and lose some benefit of RFC1323.

    Other operating systems seem to not care about initial tsval value, and let
    tcp flows to negotiate timestamp option.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

23 Feb, 2009

1 commit


30 Jan, 2009

1 commit

  • Unfortunately simplicity isn't always the best. The fraginfo
    interface turned out to be suboptimal. The problem was quite
    obvious. For every packet, we have to copy the headers from
    the frags structure into skb->head, even though for 99% of the
    packets this part is immediately thrown away after the merge.

    LRO didn't have this problem because it directly read the headers
    from the frags structure.

    This patch attempts to address this by creating an interface
    that allows GRO to access the headers in the first frag without
    having to copy it. Because all drivers that use frags place the
    headers in the first frag this optimisation should be enough.

    Signed-off-by: Herbert Xu
    Signed-off-by: David S. Miller

    Herbert Xu
     

07 Jan, 2009

1 commit


30 Dec, 2008

1 commit

  • When we converted the protocol atomic counters such as the orphan
    count and the total socket count deadlocks were introduced due to
    the mismatch in BH status of the spots that used the percpu counter
    operations.

    Based on the diagnosis and patch by Peter Zijlstra, this patch
    fixes these issues by disabling BH where we may be in process
    context.

    Reported-by: Jeff Kirsher
    Tested-by: Ingo Molnar
    Signed-off-by: Herbert Xu
    Signed-off-by: David S. Miller

    Herbert Xu
     

16 Dec, 2008

1 commit

  • This patch adds the TCP-specific portion of GRO. The criterion for
    merging is extremely strict (the TCP header must match exactly apart
    from the checksum) so as to allow refragmentation. Otherwise this
    is pretty much identical to LRO, except that we support the merging
    of ECN packets.

    Signed-off-by: Herbert Xu
    Signed-off-by: David S. Miller

    Herbert Xu
     

26 Nov, 2008

1 commit

  • Instead of using one atomic_t per protocol, use a percpu_counter
    for "sockets_allocated", to reduce cache line contention on
    heavy duty network servers.

    Note : We revert commit (248969ae31e1b3276fc4399d67ce29a5d81e6fd9
    net: af_unix can make unix_nr_socks visbile in /proc),
    since it is not anymore used after sock_prot_inuse_add() addition

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

24 Nov, 2008

1 commit

  • This is the last step to be able to perform full RCU lookups
    in __inet_lookup() : After established/timewait tables, we
    add RCU lookups to listening hash table.

    The only trick here is that a socket of a given type (TCP ipv4,
    TCP ipv6, ...) can now flight between two different tables
    (established and listening) during a RCU grace period, so we
    must use different 'nulls' end-of-chain values for two tables.

    We define a large value :

    #define LISTENING_NULLS_BASE (1U << 29)

    So that slots in listening table are guaranteed to have different
    end-of-chain values than slots in established table. A reader can
    still detect it finished its lookup in the right chain.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

21 Nov, 2008

1 commit

  • Now TCP & DCCP use RCU lookups, we can convert ehash rwlocks to spinlocks.

    /proc/net/tcp and other seq_file 'readers' can safely be converted to 'writers'.

    This should speedup writers, since spin_lock()/spin_unlock()
    only use one atomic operation instead of two for write_lock()/write_unlock()

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

20 Nov, 2008

2 commits


17 Nov, 2008

1 commit

  • RCU was added to UDP lookups, using a fast infrastructure :
    - sockets kmem_cache use SLAB_DESTROY_BY_RCU and dont pay the
    price of call_rcu() at freeing time.
    - hlist_nulls permits to use few memory barriers.

    This patch uses same infrastructure for TCP/DCCP established
    and timewait sockets.

    Thanks to SLAB_DESTROY_BY_RCU, no slowdown for applications
    using short lived TCP connections. A followup patch, converting
    rwlocks to spinlocks will even speedup this case.

    __inet_lookup_established() is pretty fast now we dont have to
    dirty a contended cache line (read_lock/read_unlock)

    Only established and timewait hashtable are converted to RCU
    (bind table and listen table are still using traditional locking)

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

03 Nov, 2008

1 commit


31 Oct, 2008

1 commit


10 Oct, 2008

1 commit

  • Maybe it's just me but I guess those md5 people made a mess
    out of it by having *_md5_hash_* to use daddr, saddr order
    instead of the one that is natural (and equal to what csum
    functions use). For the segment were sending, the original
    addresses are reversed so buff's saddr == skb's daddr and
    vice-versa.

    Maybe I can finally proceed with unification of some code
    after fixing it first... :-)

    Signed-off-by: Ilpo Järvinen
    Signed-off-by: David S. Miller

    Ilpo Järvinen
     

09 Oct, 2008

1 commit

  • While looking for some common code I came across difference
    in checksum calculation between tcp_v6_send_(reset|ack) I
    couldn't explain. I checked both v4 and v6 and found out that
    both seem to have the same "feature". I couldn't find anything
    in rfc nor anywhere else which would state that md5 option
    should be ignored like it was in case of reset so I came to
    a conclusion that this is probably a genuine bug. I suspect
    that addition of md5 just was fooled by the excessive
    copy-paste code in those functions and the reset part was
    never tested well enough to find out the problem.

    Signed-off-by: Ilpo Järvinen
    Signed-off-by: David S. Miller

    Ilpo Järvinen
     

08 Oct, 2008

1 commit