02 Nov, 2011

1 commit

  • the tcp and udp code creates a set of struct file_operations at runtime
    while it can also be done at compile time, with the added benefit of then
    having these file operations be const.

    the trickiest part was to get the "THIS_MODULE" reference right; the naive
    method of declaring a struct in the place of registration would not work
    for this reason.

    Signed-off-by: Arjan van de Ven
    Signed-off-by: David S. Miller

    Arjan van de Ven
     

25 Oct, 2011

1 commit


24 Oct, 2011

2 commits


21 Oct, 2011

1 commit

  • Adding const qualifiers to pointers can ease code review, and spot some
    bugs. It might allow compiler to optimize code further.

    For example, is it legal to temporary write a null cksum into tcphdr
    in tcp_md5_hash_header() ? I am afraid a sniffer could catch the
    temporary null value...

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

28 Sep, 2011

1 commit

  • Rename struct tcp_skb_cb "flags" to "tcp_flags" to ease code review and
    maintenance.

    Its content is a combination of FIN/SYN/RST/PSH/ACK/URG/ECE/CWR flags

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

27 Sep, 2011

2 commits

  • struct tcp_skb_cb contains a "flags" field containing either tcp flags
    or IP dsfield depending on context (input or output path)

    Introduce ip_dsfield to make the difference clear and ease maintenance.
    If later we want to save space, we can union flags/ip_dsfield

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • While playing with a new ADSL box at home, I discovered that ECN
    blackhole can trigger suboptimal quickack mode on linux : We send one
    ACK for each incoming data frame, without any delay and eventual
    piggyback.

    This is because TCP_ECN_check_ce() considers that if no ECT is seen on a
    segment, this is because this segment was a retransmit.

    Refine this heuristic and apply it only if we seen ECT in a previous
    segment, to detect ECN blackhole at IP level.

    Signed-off-by: Eric Dumazet
    CC: Jamal Hadi Salim
    CC: Jerry Chu
    CC: Ilpo Järvinen
    CC: Jim Gettys
    CC: Dave Taht
    Acked-by: Ilpo Järvinen
    Signed-off-by: David S. Miller

    Eric Dumazet
     

22 Sep, 2011

1 commit

  • Conflicts:
    MAINTAINERS
    drivers/net/Kconfig
    drivers/net/ethernet/broadcom/bnx2x/bnx2x_link.c
    drivers/net/ethernet/broadcom/tg3.c
    drivers/net/wireless/iwlwifi/iwl-pci.c
    drivers/net/wireless/iwlwifi/iwl-trans-tx-pcie.c
    drivers/net/wireless/rt2x00/rt2800usb.c
    drivers/net/wireless/wl12xx/main.c

    David S. Miller
     

19 Sep, 2011

1 commit


17 Sep, 2011

1 commit

  • tcp_md5sig_pool is currently an 'array' (a percpu object) of pointers to
    struct tcp_md5sig_pool. Only the pointers are NUMA aware, but objects
    themselves are all allocated on a single node.

    Remove this extra indirection to get proper percpu memory (NUMA aware)
    and make code simpler.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

16 Sep, 2011

1 commit

  • "Possible SYN flooding on port xxxx " messages can fill logs on servers.

    Change logic to log the message only once per listener, and add two new
    SNMP counters to track :

    TCPReqQFullDoCookies : number of times a SYNCOOKIE was replied to client

    TCPReqQFullDrop : number of times a SYN request was dropped because
    syncookies were not enabled.

    Based on a prior patch from Tom Herbert, and suggestions from David.

    Signed-off-by: Eric Dumazet
    CC: Tom Herbert
    Signed-off-by: David S. Miller

    Eric Dumazet
     

09 Jun, 2011

1 commit

  • This patch lowers the default initRTO from 3secs to 1sec per
    RFC2988bis. It falls back to 3secs if the SYN or SYN-ACK packet
    has been retransmitted, AND the TCP timestamp option is not on.

    It also adds support to take RTT sample during 3WHS on the passive
    open side, just like its active open counterpart, and uses it, if
    valid, to seed the initRTO for the data transmission phase.

    The patch also resets ssthresh to its initial default at the
    beginning of the data transmission phase, and reduces cwnd to 1 if
    there has been MORE THAN ONE retransmission during 3WHS per RFC5681.

    Signed-off-by: H.K. Jerry Chu
    Signed-off-by: David S. Miller

    Jerry Chu
     

21 Feb, 2011

1 commit


06 Feb, 2011

1 commit


03 Feb, 2011

1 commit


25 Jan, 2011

1 commit

  • Quoting Ben Hutchings: we presumably won't be defining features that
    can only be enabled on 64-bit architectures.

    Occurences found by `grep -r` on net/, drivers/net, include/

    [ Move features and vlan_features next to each other in
    struct netdev, as per Eric Dumazet's suggestion -DaveM ]

    Signed-off-by: Michał Mirosław
    Signed-off-by: David S. Miller

    Michał Mirosław
     

21 Dec, 2010

1 commit

  • This patch changes the default initial receive window to 10 mss
    (defined constant). The default window is limited to the maximum
    of 10*1460 and 2*mss (when mss > 1460).

    draft-ietf-tcpm-initcwnd-00 is a proposal to the IETF that recommends
    increasing TCP's initial congestion window to 10 mss or about 15KB.
    Leading up to this proposal were several large-scale live Internet
    experiments with an initial congestion window of 10 mss (IW10), where
    we showed that the average latency of HTTP responses improved by
    approximately 10%. This was accompanied by a slight increase in
    retransmission rate (0.5%), most of which is coming from applications
    opening multiple simultaneous connections. To understand the extreme
    worst case scenarios, and fairness issues (IW10 versus IW3), we further
    conducted controlled testbed experiments. We came away finding minimal
    negative impact even under low link bandwidths (dial-ups) and small
    buffers. These results are extremely encouraging to adopting IW10.

    However, an initial congestion window of 10 mss is useless unless a TCP
    receiver advertises an initial receive window of at least 10 mss.
    Fortunately, in the large-scale Internet experiments we found that most
    widely used operating systems advertised large initial receive windows
    of 64KB, allowing us to experiment with a wide range of initial
    congestion windows. Linux systems were among the few exceptions that
    advertised a small receive window of 6KB. The purpose of this patch is
    to fix this shortcoming.

    References:
    1. A comprehensive list of all IW10 references to date.
    http://code.google.com/speed/protocols/tcpm-IW10.html

    2. Paper describing results from large-scale Internet experiments with IW10.
    http://ccr.sigcomm.org/drupal/?q=node/621

    3. Controlled testbed experiments under worst case scenarios and a
    fairness study.
    http://www.ietf.org/proceedings/79/slides/tcpm-0.pdf

    4. Raw test data from testbed experiments (Linux senders/receivers)
    with initial congestion and receive windows of both 10 mss.
    http://research.csc.ncsu.edu/netsrv/?q=content/iw10

    5. Internet-Draft. Increasing TCP's Initial Window.
    https://datatracker.ietf.org/doc/draft-ietf-tcpm-initcwnd/

    Signed-off-by: Nandita Dukkipati
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Nandita Dukkipati
     

20 Dec, 2010

1 commit


17 Dec, 2010

1 commit

  • Some windows versions have wrong RFC1323 implementations, with SYN and
    SYNACKS messages containing zero tcp timestamps.

    We relaxed in commit fc1ad92dfc4e363 the passive connection case
    (Windows connects to a linux machine), but the reverse case (linux
    connects to a Windows machine) has an analogue problem when tsvals from
    windows machine are 'negative' (high order bit set) : PAWS triggers and
    we drops incoming messages.

    Fix this by making zero ts_recent value special, allowing frame to be
    processed.

    Based on a report and initial patch from Dmitiy Balakin

    Bugzilla reference : https://bugzilla.kernel.org/show_bug.cgi?id=24842

    Reported-by: dmitriy.balakin@nicneiron.ru
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

03 Dec, 2010

1 commit


02 Dec, 2010

1 commit

  • The only thing AF-specific about remembering the timestamp
    for a time-wait TCP socket is getting the peer.

    Abstract that behind a new timewait_sock_ops vector.

    Support for real IPV6 sockets is not filled in yet, but
    curiously this makes timewait recycling start to work
    for v4-mapped ipv6 sockets.

    Signed-off-by: David S. Miller

    David S. Miller
     

01 Dec, 2010

1 commit


11 Nov, 2010

1 commit

  • Robin Holt tried to boot a 16TB machine and found some limits were
    reached : sysctl_tcp_mem[2], sysctl_udp_mem[2]

    We can switch infrastructure to use long "instead" of "int", now
    atomic_long_t primitives are available for free.

    Signed-off-by: Eric Dumazet
    Reported-by: Robin Holt
    Reviewed-by: Robin Holt
    Signed-off-by: Andrew Morton
    Signed-off-by: David S. Miller

    Eric Dumazet
     

30 Sep, 2010

1 commit


27 Sep, 2010

1 commit


16 Sep, 2010

1 commit

  • If peer uses tiny MSS (say, 75 bytes) and similarly tiny advertised
    window, the SWS logic will packetize to half the MSS unnecessarily.

    This causes problems with some embedded devices.

    However for large MSS devices we do want to half-MSS packetize
    otherwise we never get enough packets into the pipe for things
    like fast retransmit and recovery to work.

    Be careful also to handle the case where MSS > window, otherwise
    we'll never send until the probe timer.

    Reported-by: ツ Leandro Melo de Sales
    Signed-off-by: David S. Miller

    Alexey Kuznetsov
     

10 Sep, 2010

1 commit


31 Aug, 2010

2 commits

  • This updates the use of larger initial windows, as originally specified in
    RFC 3390, to use the newer IW values specified in RFC 5681, section 3.1.

    The changes made in RFC 5681 are:
    a) the setting now is more clearly specified in units of segments (as the
    comments by John Heffner emphasized, this was not very clear in RFC 3390);
    b) for connections with 1095 < SMSS
    Signed-off-by: David S. Miller

    Gerrit Renker
     
  • This patch consolidates initial-window code common to TCP and CCID-2:
    * TCP uses RFC 3390 in a packet-oriented manner (tcp_input.c) and
    * CCID-2 uses RFC 3390 in packet-oriented manner (RFC 4341).

    Signed-off-by: Gerrit Renker
    Signed-off-by: David S. Miller

    Gerrit Renker
     

25 Aug, 2010

1 commit

  • As reported by Anton Blanchard when we use
    percpu_counter_read_positive() to make our orphan socket limit checks,
    the check can be off by up to num_cpus_online() * batch (which is 32
    by default) which on a 128 cpu machine can be as large as the default
    orphan limit itself.

    Fix this by doing the full expensive sum check if the optimized check
    triggers.

    Reported-by: Anton Blanchard
    Signed-off-by: David S. Miller
    Acked-by: Eric Dumazet

    David S. Miller
     

16 Jul, 2010

1 commit


13 Jul, 2010

2 commits

  • a new boolean flag no_autobind is added to structure proto to avoid the autobind
    calls when the protocol is TCP. Then sock_rps_record_flow() is called int the
    TCP's sendmsg() and sendpage() pathes.

    Signed-off-by: Changli Gao
    ----
    include/net/inet_common.h | 4 ++++
    include/net/sock.h | 1 +
    include/net/tcp.h | 8 ++++----
    net/ipv4/af_inet.c | 15 +++++++++------
    net/ipv4/tcp.c | 11 +++++------
    net/ipv4/tcp_ipv4.c | 3 +++
    net/ipv6/af_inet6.c | 8 ++++----
    net/ipv6/tcp_ipv6.c | 3 +++
    8 files changed, 33 insertions(+), 20 deletions(-)
    Signed-off-by: David S. Miller

    Changli Gao
     
  • remove useless blanks.

    Signed-off-by: Changli Gao
    ----
    include/net/inet_common.h | 55 ++++-------
    include/net/tcp.h | 222 +++++++++++++++++-----------------------------
    include/net/udp.h | 38 +++----
    3 files changed, 123 insertions(+), 192 deletions(-)
    Signed-off-by: David S. Miller

    Changli Gao
     

27 Jun, 2010

1 commit

  • Allows use of ECN when syncookies are in effect by encoding ecn_ok
    into the syn-ack tcp timestamp.

    While at it, remove a uneeded #ifdef CONFIG_SYN_COOKIES.
    With CONFIG_SYN_COOKIES=nm want_cookie is ifdef'd to 0 and gcc
    removes the "if (0)".

    Signed-off-by: Florian Westphal
    Signed-off-by: David S. Miller

    Florian Westphal
     

17 Jun, 2010

1 commit

  • Discard the ACK if we find options that do not match current sysctl
    settings.

    Previously it was possible to create a connection with sack, wscale,
    etc. enabled even if the feature was disabled via sysctl.

    Also remove an unneeded call to tcp_sack_reset() in
    cookie_check_timestamp: Both call sites (cookie_v4_check,
    cookie_v6_check) zero "struct tcp_options_received", hand it to
    tcp_parse_options() (which does not change tcp_opt->num_sacks/dsack)
    and then call cookie_check_timestamp().

    Even if num_sacks/dsacks were changed, the structure is allocated on
    the stack and after cookie_check_timestamp returns only a few selected
    members are copied to the inet_request_sock.

    Signed-off-by: Florian Westphal
    Signed-off-by: David S. Miller

    Florian Westphal
     

16 Jun, 2010

1 commit

  • unify tcp flag macros: TCPHDR_FIN, TCPHDR_SYN, TCPHDR_RST, TCPHDR_PSH,
    TCPHDR_ACK, TCPHDR_URG, TCPHDR_ECE and TCPHDR_CWR. TCBCB_FLAG_* are replaced
    with the corresponding TCPHDR_*.

    Signed-off-by: Changli Gao
    ----
    include/net/tcp.h | 24 ++++++-------
    net/ipv4/tcp.c | 8 ++--
    net/ipv4/tcp_input.c | 2 -
    net/ipv4/tcp_output.c | 59 ++++++++++++++++-----------------
    net/netfilter/nf_conntrack_proto_tcp.c | 32 ++++++-----------
    net/netfilter/xt_TCPMSS.c | 4 --
    6 files changed, 58 insertions(+), 71 deletions(-)
    Signed-off-by: David S. Miller

    Changli Gao
     

07 Jun, 2010

1 commit

  • This patch address a serious performance issue in reading the
    TCP sockets table (/proc/net/tcp).

    Reading the full table is done by a number of sequential read
    operations. At each read operation, a seek is done to find the
    last socket that was previously read. This seek operation requires
    that the sockets in the table need to be counted up to the current
    file position, and to count each of these requires taking a lock for
    each non-empty bucket. The whole algorithm is O(n^2).

    The fix is to cache the last bucket value, offset within the bucket,
    and the file position returned by the last read operation. On the
    next sequential read, the bucket and offset are used to find the
    last read socket immediately without needing ot scan the previous
    buckets the table. This algorithm t read the whole table is O(n).

    The improvement offered by this patch is easily show by performing
    cat'ing /proc/net/tcp on a machine with a lot of connections. With
    about 182K connections in the table, I see the following:

    - Without patch
    time cat /proc/net/tcp > /dev/null

    real 1m56.729s
    user 0m0.214s
    sys 1m56.344s

    - With patch
    time cat /proc/net/tcp > /dev/null

    real 0m0.894s
    user 0m0.290s
    sys 0m0.594s

    Signed-off-by: Tom Herbert
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Tom Herbert
     

17 May, 2010

1 commit


16 May, 2010

1 commit

  • TCP MD5 support uses percpu data for temporary storage. It currently
    disables preemption so that same storage cannot be reclaimed by another
    thread on same cpu.

    We also have to make sure a softirq handler wont try to use also same
    context. Various bug reports demonstrated corruptions.

    Fix is to disable preemption and BH.

    Reported-by: Bhaskar Dutta
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet