12 Dec, 2011

1 commit


01 Dec, 2011

1 commit

  • Rick Jones reported that TCP_CONGESTION sockopt performed on a listener
    was ignored for its children sockets : right after accept() the
    congestion control for new socket is the system default one.

    This seems an oversight of the initial design (quoted from Stephen)

    Based on prior investigation and patch from Rick.

    Reported-by: Rick Jones
    Signed-off-by: Eric Dumazet
    CC: Stephen Hemminger
    CC: Yuchung Cheng
    Tested-by: Rick Jones
    Signed-off-by: David S. Miller

    Eric Dumazet
     

23 Nov, 2011

1 commit


09 Nov, 2011

1 commit


27 Oct, 2011

1 commit

  • commit 66b13d99d96a (ipv4: tcp: fix TOS value in ACK messages sent from
    TIME_WAIT) fixed IPv4 only.

    This part is for the IPv6 side, adding a tclass param to ip6_xmit()

    We alias tw_tclass and tw_tos, if socket family is INET6.

    [ if sockets is ipv4-mapped, only IP_TOS socket option is used to fill
    TOS field, TCLASS is not taken into account ]

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

25 Oct, 2011

1 commit


21 Oct, 2011

1 commit

  • Adding const qualifiers to pointers can ease code review, and spot some
    bugs. It might allow compiler to optimize code further.

    For example, is it legal to temporary write a null cksum into tcphdr
    in tcp_md5_hash_header() ? I am afraid a sniffer could catch the
    temporary null value...

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

19 Oct, 2011

1 commit

  • The transparent socket option setting was not copied to the time wait
    socket when an inet socket was being replaced by a time wait socket. This
    broke the --transparent option of the socket match and may have caused
    that FIN packets belonging to sockets in FIN_WAIT2 or TIME_WAIT state
    were being dropped by the packet filter.

    Signed-off-by: KOVACS Krisztian
    Signed-off-by: David S. Miller

    KOVACS Krisztian
     

09 Jun, 2011

1 commit

  • This patch lowers the default initRTO from 3secs to 1sec per
    RFC2988bis. It falls back to 3secs if the SYN or SYN-ACK packet
    has been retransmitted, AND the TCP timestamp option is not on.

    It also adds support to take RTT sample during 3WHS on the passive
    open side, just like its active open counterpart, and uses it, if
    valid, to seed the initRTO for the data transmission phase.

    The patch also resets ssthresh to its initial default at the
    beginning of the data transmission phase, and reduces cwnd to 1 if
    there has been MORE THAN ONE retransmission during 3WHS per RFC5681.

    Signed-off-by: H.K. Jerry Chu
    Signed-off-by: David S. Miller

    Jerry Chu
     

09 Dec, 2010

2 commits


02 Dec, 2010

1 commit

  • The only thing AF-specific about remembering the timestamp
    for a time-wait TCP socket is getting the peer.

    Abstract that behind a new timewait_sock_ops vector.

    Support for real IPV6 sockets is not filled in yet, but
    curiously this makes timewait recycling start to work
    for v4-mapped ipv6 sockets.

    Signed-off-by: David S. Miller

    David S. Miller
     

01 Dec, 2010

1 commit


24 Sep, 2010

1 commit


13 Jul, 2010

1 commit


12 Apr, 2010

1 commit


30 Mar, 2010

1 commit

  • …it slab.h inclusion from percpu.h

    percpu.h is included by sched.h and module.h and thus ends up being
    included when building most .c files. percpu.h includes slab.h which
    in turn includes gfp.h making everything defined by the two files
    universally available and complicating inclusion dependencies.

    percpu.h -> slab.h dependency is about to be removed. Prepare for
    this change by updating users of gfp and slab facilities include those
    headers directly instead of assuming availability. As this conversion
    needs to touch large number of source files, the following script is
    used as the basis of conversion.

    http://userweb.kernel.org/~tj/misc/slabh-sweep.py

    The script does the followings.

    * Scan files for gfp and slab usages and update includes such that
    only the necessary includes are there. ie. if only gfp is used,
    gfp.h, if slab is used, slab.h.

    * When the script inserts a new include, it looks at the include
    blocks and try to put the new include such that its order conforms
    to its surrounding. It's put in the include block which contains
    core kernel includes, in the same order that the rest are ordered -
    alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
    doesn't seem to be any matching order.

    * If the script can't find a place to put a new include (mostly
    because the file doesn't have fitting include block), it prints out
    an error message indicating which .h file needs to be added to the
    file.

    The conversion was done in the following steps.

    1. The initial automatic conversion of all .c files updated slightly
    over 4000 files, deleting around 700 includes and adding ~480 gfp.h
    and ~3000 slab.h inclusions. The script emitted errors for ~400
    files.

    2. Each error was manually checked. Some didn't need the inclusion,
    some needed manual addition while adding it to implementation .h or
    embedding .c file was more appropriate for others. This step added
    inclusions to around 150 files.

    3. The script was run again and the output was compared to the edits
    from #2 to make sure no file was left behind.

    4. Several build tests were done and a couple of problems were fixed.
    e.g. lib/decompress_*.c used malloc/free() wrappers around slab
    APIs requiring slab.h to be added manually.

    5. The script was run on all .h files but without automatically
    editing them as sprinkling gfp.h and slab.h inclusions around .h
    files could easily lead to inclusion dependency hell. Most gfp.h
    inclusion directives were ignored as stuff from gfp.h was usually
    wildly available and often used in preprocessor macros. Each
    slab.h inclusion directive was examined and added manually as
    necessary.

    6. percpu.h was updated not to include slab.h.

    7. Build test were done on the following configurations and failures
    were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
    distributed build env didn't work with gcov compiles) and a few
    more options had to be turned off depending on archs to make things
    build (like ipr on powerpc/64 which failed due to missing writeq).

    * x86 and x86_64 UP and SMP allmodconfig and a custom test config.
    * powerpc and powerpc64 SMP allmodconfig
    * sparc and sparc64 SMP allmodconfig
    * ia64 SMP allmodconfig
    * s390 SMP allmodconfig
    * alpha SMP allmodconfig
    * um on x86_64 SMP allmodconfig

    8. percpu.h modifications were reverted so that it could be applied as
    a separate patch and serve as bisection point.

    Given the fact that I had only a couple of failures from tests on step
    6, I'm fairly confident about the coverage of this conversion patch.
    If there is a breakage, it's likely to be something in one of the arch
    headers which should be easily discoverable easily on most builds of
    the specific arch.

    Signed-off-by: Tejun Heo <tj@kernel.org>
    Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>

    Tejun Heo
     

22 Mar, 2010

1 commit

  • Its currently hard to diagnose when ACK frames are dropped because an
    application set TCP_DEFER_ACCEPT on its listening socket.

    See http://bugzilla.kernel.org/show_bug.cgi?id=15507

    This patch adds a SNMP value, named TCPDeferAcceptDrop

    netstat -s | grep TCPDeferAcceptDrop
    TCPDeferAcceptDrop: 0

    This counter is incremented every time we drop a pure ACK frame received
    by a socket in SYN_RECV state because its SYNACK retrans count is lower
    than defer_accept value.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

06 Mar, 2010

1 commit


16 Dec, 2009

1 commit

  • It creates a regression, triggering badness for SYN_RECV
    sockets, for example:

    [19148.022102] Badness at net/ipv4/inet_connection_sock.c:293
    [19148.022570] NIP: c02a0914 LR: c02a0904 CTR: 00000000
    [19148.023035] REGS: eeecbd30 TRAP: 0700 Not tainted (2.6.32)
    [19148.023496] MSR: 00029032 CR: 24002442 XER: 00000000
    [19148.024012] TASK = eee9a820[1756] 'privoxy' THREAD: eeeca000

    This is likely caused by the change in the 'estab' parameter
    passed to tcp_parse_options() when invoked by the functions
    in net/ipv4/tcp_minisocks.c

    But even if that is fixed, the ->conn_request() changes made in
    this patch series is fundamentally wrong. They try to use the
    listening socket's 'dst' to probe the route settings. The
    listening socket doesn't even have a route, and you can't
    get the right route (the child request one) until much later
    after we setup all of the state, and it must be done by hand.

    This stuff really isn't ready, so the best thing to do is a
    full revert. This reverts the following commits:

    f55017a93f1a74d50244b1254b9a2bd7ac9bbf7d
    022c3f7d82f0f1c68018696f2f027b87b9bb45c2
    1aba721eba1d84a2defce45b950272cee1e6c72a
    cda42ebd67ee5fdf09d7057b5a4584d36fe8a335
    345cda2fd695534be5a4494f1b59da9daed33663
    dc343475ed062e13fc260acccaab91d7d80fd5b2
    05eaade2782fb0c90d3034fd7a7d5a16266182bb
    6a2a2d6bf8581216e08be15fcb563cfd6c430e1e

    Signed-off-by: David S. Miller

    David S. Miller
     

03 Dec, 2009

3 commits

  • Parse incoming TCP_COOKIE option(s).

    Calculate TCP_COOKIE option.

    Send optional data.

    This is a significantly revised implementation of an earlier (year-old)
    patch that no longer applies cleanly, with permission of the original
    author (Adam Langley):

    http://thread.gmane.org/gmane.linux.network/102586

    Requires:
    TCPCT part 1a: add request_values parameter for sending SYNACK
    TCPCT part 1b: generate Responder Cookie secret
    TCPCT part 1c: sysctl_tcp_cookie_size, socket option TCP_COOKIE_TRANSACTIONS
    TCPCT part 1d: define TCP cookie option, extend existing struct's
    TCPCT part 1e: implement socket option TCP_COOKIE_TRANSACTIONS
    TCPCT part 1f: Initiator Cookie => Responder

    Signed-off-by: William.Allen.Simpson@gmail.com
    Signed-off-by: David S. Miller

    William Allen Simpson
     
  • Data structures are carefully composed to require minimal additions.
    For example, the struct tcp_options_received cookie_plus variable fits
    between existing 16-bit and 8-bit variables, requiring no additional
    space (taking alignment into consideration). There are no additions to
    tcp_request_sock, and only 1 pointer in tcp_sock.

    This is a significantly revised implementation of an earlier (year-old)
    patch that no longer applies cleanly, with permission of the original
    author (Adam Langley):

    http://thread.gmane.org/gmane.linux.network/102586

    The principle difference is using a TCP option to carry the cookie nonce,
    instead of a user configured offset in the data. This is more flexible and
    less subject to user configuration error. Such a cookie option has been
    suggested for many years, and is also useful without SYN data, allowing
    several related concepts to use the same extension option.

    "Re: SYN floods (was: does history repeat itself?)", September 9, 1996.
    http://www.merit.net/mail.archives/nanog/1996-09/msg00235.html

    "Re: what a new TCP header might look like", May 12, 1998.
    ftp://ftp.isi.edu/end2end/end2end-interest-1998.mail

    These functions will also be used in subsequent patches that implement
    additional features.

    Requires:
    TCPCT part 1a: add request_values parameter for sending SYNACK
    TCPCT part 1b: generate Responder Cookie secret
    TCPCT part 1c: sysctl_tcp_cookie_size, socket option TCP_COOKIE_TRANSACTIONS

    Signed-off-by: William.Allen.Simpson@gmail.com
    Signed-off-by: David S. Miller

    William Allen Simpson
     
  • Add optional function parameters associated with sending SYNACK.
    These parameters are not needed after sending SYNACK, and are not
    used for retransmission. Avoids extending struct tcp_request_sock,
    and avoids allocating kernel memory.

    Also affects DCCP as it uses common struct request_sock_ops,
    but this parameter is currently reserved for future use.

    Signed-off-by: William.Allen.Simpson@gmail.com
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    William Allen Simpson
     

22 Nov, 2009

1 commit


14 Nov, 2009

1 commit

  • Define two symbols needed in both kernel and user space.

    Remove old (somewhat incorrect) kernel variant that wasn't used in
    most cases. Default should apply to both RMSS and SMSS (RFC2581).

    Replace numeric constants with defined symbols.

    Stand-alone patch, originally developed for TCPCT.

    Signed-off-by: William.Allen.Simpson@gmail.com
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    William Allen Simpson
     

05 Nov, 2009

1 commit

  • Calling IPv4 specific inet_csk_route_req in tcp_check_req
    is a bad idea and crashes machine on IPv6 connections, as reported
    by Valdis Kletnieks

    Also, all we are really interested in is the timestamp
    option in the header, so calling tcp_parse_options()
    with the "estab" set to false flag is an overkill as
    it tries to parse half a dozen other TCP options.

    We know whether timestamp should be enabled or not
    using data from request_sock.

    Signed-off-by: Gilad Ben-Yossef
    Tested-by: Valdis.Kletnieks@vt.edu
    Signed-off-by: David S. Miller

    Gilad Ben-Yossef
     

29 Oct, 2009

2 commits


20 Oct, 2009

2 commits

  • Willy Tarreau and many other folks in recent years
    were concerned what happens when the TCP_DEFER_ACCEPT period
    expires for clients which sent ACK packet. They prefer clients
    that actively resend ACK on our SYN-ACK retransmissions to be
    converted from open requests to sockets and queued to the
    listener for accepting after the deferring period is finished.
    Then application server can decide to wait longer for data
    or to properly terminate the connection with FIN if read()
    returns EAGAIN which is an indication for accepting after
    the deferring period. This change still can have side effects
    for applications that expect always to see data on the accepted
    socket. Others can be prepared to work in both modes (with or
    without TCP_DEFER_ACCEPT period) and their data processing can
    ignore the read=EAGAIN notification and to allocate resources for
    clients which proved to have no data to send during the deferring
    period. OTOH, servers that use TCP_DEFER_ACCEPT=1 as flag (not
    as a timeout) to wait for data will notice clients that didn't
    send data for 3 seconds but that still resend ACKs.
    Thanks to Willy Tarreau for the initial idea and to
    Eric Dumazet for the review and testing the change.

    Signed-off-by: Julian Anastasov
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Julian Anastasov
     
  • This reverts commit 6d01a026b7d3009a418326bdcf313503a314f1ea.

    Julian Anastasov, Willy Tarreau and Eric Dumazet have come up
    with a more correct way to deal with this.

    Signed-off-by: David S. Miller

    David S. Miller
     

13 Oct, 2009

1 commit

  • I was trying to use TCP_DEFER_ACCEPT and noticed that if the
    client does not talk, the connection is never accepted and
    remains in SYN_RECV state until the retransmits expire, where
    it finally is deleted. This is bad when some firewall such as
    netfilter sits between the client and the server because the
    firewall sees the connection in ESTABLISHED state while the
    server will finally silently drop it without sending an RST.

    This behaviour contradicts the man page which says it should
    wait only for some time :

    TCP_DEFER_ACCEPT (since Linux 2.4)
    Allows a listener to be awakened only when data arrives
    on the socket. Takes an integer value (seconds), this
    can bound the maximum number of attempts TCP will
    make to complete the connection. This option should not
    be used in code intended to be portable.

    Also, looking at ipv4/tcp.c, a retransmit counter is correctly
    computed :

    case TCP_DEFER_ACCEPT:
    icsk->icsk_accept_queue.rskq_defer_accept = 0;
    if (val > 0) {
    /* Translate value in seconds to number of
    * retransmits */
    while (icsk->icsk_accept_queue.rskq_defer_accept < 32 &&
    val > ((TCP_TIMEOUT_INIT / HZ) <<
    icsk->icsk_accept_queue.rskq_defer_accept))
    icsk->icsk_accept_queue.rskq_defer_accept++;
    icsk->icsk_accept_queue.rskq_defer_accept++;
    }
    break;

    ==> rskq_defer_accept is used as a counter of retransmits.

    But in tcp_minisocks.c, this counter is only checked. And in
    fact, I have found no location which updates it. So I think
    that what was intended was to decrease it in tcp_minisocks
    whenever it is checked, which the trivial patch below does.

    Signed-off-by: Willy Tarreau
    Signed-off-by: David S. Miller

    Willy Tarreau
     

16 Sep, 2009

1 commit

  • I have recently came across a preemption imbalance detected by:

    huh, entered ffffffff80644630 with preempt_count 00000102, exited with 00000101?
    ------------[ cut here ]------------
    kernel BUG at /usr/src/linux/kernel/timer.c:664!
    invalid opcode: 0000 [1] PREEMPT SMP

    with ffffffff80644630 being inet_twdr_hangman().

    This appeared after I enabled CONFIG_TCP_MD5SIG and played with it a
    bit, so I looked at what might have caused it.

    One thing that struck me as strange is tcp_twsk_destructor(), as it
    calls tcp_put_md5sig_pool() -- which entails a put_cpu(), causing the
    detected imbalance. Found on 2.6.23.9, but 2.6.31 is affected as well,
    as far as I can tell.

    Signed-off-by: Robert Varga
    Signed-off-by: David S. Miller

    Robert Varga
     

15 Sep, 2009

1 commit

  • It was once upon time so that snd_sthresh was a 16-bit quantity.
    ...That has not been true for long period of time. I run across
    some ancient compares which still seem to trust such legacy.
    Put all that magic into a single place, I hopefully found all
    of them.

    Compile tested, though linking of allyesconfig is ridiculous
    nowadays it seems.

    Signed-off-by: Ilpo Järvinen
    Signed-off-by: David S. Miller

    Ilpo Järvinen
     

03 Sep, 2009

1 commit

  • This fixed a lockdep warning which appeared when doing stress
    memory tests over NFS:

    inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage.

    page reclaim => nfs_writepage => tcp_sendmsg => lock sk_lock

    mount_root => nfs_root_data => tcp_close => lock sk_lock =>
    tcp_send_fin => alloc_skb_fclone => page reclaim

    David raised a concern that if the allocation fails in tcp_send_fin(), and it's
    GFP_ATOMIC, we are going to yield() (which sleeps) and loop endlessly waiting
    for the allocation to succeed.

    But fact is, the original GFP_KERNEL also sleeps. GFP_ATOMIC+yield() looks
    weird, but it is no worse the implicit sleep inside GFP_KERNEL. Both could
    loop endlessly under memory pressure.

    CC: Arnaldo Carvalho de Melo
    CC: David S. Miller
    CC: Herbert Xu
    Signed-off-by: Wu Fengguang
    Signed-off-by: David S. Miller

    Wu Fengguang
     

29 Aug, 2009

1 commit


26 Jun, 2009

1 commit

  • RFC0793 defined that in FIN-WAIT-2 state if the ACK bit is off drop
    the segment and return[Page 72]. But this check is missing in function
    tcp_timewait_state_process(). This cause the segment with FIN flag but
    no ACK has two diffent action:

    Case 1:
    Node A Node B

    (enter FIN-WAIT-2)
    FIN -------------> discard
    (move sk to tw list)

    Case 2:
    Node A Node B

    (enter FIN-WAIT-2)
    (move sk to tw list)
    FIN ------------->


    Signed-off-by: David S. Miller

    Wei Yongjun
     

16 Mar, 2009

1 commit

  • Wow, it was quite tricky to merge that stream of negations
    but I think I finally got it right:

    check & replace_ts_recent:
    (s32)(rcv_tsval - ts_recent) >= 0 => 0
    (s32)(ts_recent - rcv_tsval) 0

    discard:
    (s32)(ts_recent - rcv_tsval) > TCP_PAWS_WINDOW => 1
    (s32)(ts_recent - rcv_tsval) 0

    I toggled the return values of tcp_paws_check around since
    the old encoding added yet-another negation making tracking
    of truth-values really complicated.

    Signed-off-by: Ilpo Järvinen
    Signed-off-by: David S. Miller

    Ilpo Järvinen
     

03 Mar, 2009

1 commit

  • The above functions from include/net/tcp.h have been defined with an
    argument that they never use. The argument is 'u32 ack' which is never
    used inside the function body, and thus it can be removed. The rest of
    the patch involves the necessary changes to the function callers of the
    above two functions.

    Signed-off-by: Hantzis Fotis
    Signed-off-by: David S. Miller

    Hantzis Fotis
     

02 Mar, 2009

1 commit


03 Nov, 2008

1 commit