20 Dec, 2011

1 commit

  • DaveM said:
    Please, this kind of stuff rots forever and not using bool properly
    drives me crazy.

    Joe Perches gave me the spatch script:

    @@
    bool b;
    @@
    -b = 0
    +b = false
    @@
    bool b;
    @@
    -b = 1
    +b = true

    I merely installed coccinelle, read the documentation and took credit.

    Signed-off-by: Rusty Russell
    Signed-off-by: David S. Miller

    Rusty Russell
     

13 Dec, 2011

1 commit

  • This patch replaces all uses of struct sock fields' memory_pressure,
    memory_allocated, sockets_allocated, and sysctl_mem to acessor
    macros. Those macros can either receive a socket argument, or a mem_cgroup
    argument, depending on the context they live in.

    Since we're only doing a macro wrapping here, no performance impact at all is
    expected in the case where we don't have cgroups disabled.

    Signed-off-by: Glauber Costa
    Reviewed-by: Hiroyouki Kamezawa
    CC: David S. Miller
    CC: Eric W. Biederman
    CC: Eric Dumazet
    Signed-off-by: David S. Miller

    Glauber Costa
     

12 Dec, 2011

1 commit


25 Oct, 2011

1 commit


21 Feb, 2011

1 commit


18 Oct, 2010

1 commit

  • As CWR is stronger than CA_Disorder state, we can miscount
    SACK/Reno failure into other timeouts. Not a bad problem as
    it can happen only due to ECN, FRTO detecting spurious RTO
    or xmit error which are the only callers of tcp_enter_cwr.
    And even then losses and RTO must still follow thereafter
    to actually end up into the relevant code paths.

    Compile tested.

    Signed-off-by: Ilpo Järvinen
    Signed-off-by: David S. Miller

    Ilpo Järvinen
     

05 Oct, 2010

1 commit


29 Sep, 2010

1 commit

  • Fixes kernel Bugzilla Bug 18952

    This patch adds a syn_set parameter to the retransmits_timed_out()
    routine and updates its callers. If not set, TCP_RTO_MIN is taken
    as the calculation basis as before. If set, TCP_TIMEOUT_INIT is
    used instead, so that sysctl_syn_retries represents the actual
    amount of SYN retransmissions in case no SYNACKs are received when
    establishing a new connection.

    Signed-off-by: Damian Lukowski
    Signed-off-by: David S. Miller

    Damian Lukowski
     

10 Sep, 2010

1 commit


31 Aug, 2010

1 commit

  • This patch provides a "user timeout" support as described in RFC793. The
    socket option is also needed for the the local half of RFC5482 "TCP User
    Timeout Option".

    TCP_USER_TIMEOUT is a TCP level socket option that takes an unsigned int,
    when > 0, to specify the maximum amount of time in ms that transmitted
    data may remain unacknowledged before TCP will forcefully close the
    corresponding connection and return ETIMEDOUT to the application. If
    0 is given, TCP will continue to use the system default.

    Increasing the user timeouts allows a TCP connection to survive extended
    periods without end-to-end connectivity. Decreasing the user timeouts
    allows applications to "fail fast" if so desired. Otherwise it may take
    upto 20 minutes with the current system defaults in a normal WAN
    environment.

    The socket option can be made during any state of a TCP connection, but
    is only effective during the synchronized states of a connection
    (ESTABLISHED, FIN-WAIT-1, FIN-WAIT-2, CLOSE-WAIT, CLOSING, or LAST-ACK).
    Moreover, when used with the TCP keepalive (SO_KEEPALIVE) option,
    TCP_USER_TIMEOUT will overtake keepalive to determine when to close a
    connection due to keepalive failure.

    The option does not change in anyway when TCP retransmits a packet, nor
    when a keepalive probe will be sent.

    This option, like many others, will be inherited by an acceptor from its
    listener.

    Signed-off-by: H.K. Jerry Chu
    Signed-off-by: David S. Miller

    Jerry Chu
     

25 Aug, 2010

1 commit

  • As reported by Anton Blanchard when we use
    percpu_counter_read_positive() to make our orphan socket limit checks,
    the check can be off by up to num_cpus_online() * batch (which is 32
    by default) which on a 128 cpu machine can be as large as the default
    orphan limit itself.

    Fix this by doing the full expensive sum check if the optimized check
    triggers.

    Reported-by: Anton Blanchard
    Signed-off-by: David S. Miller
    Acked-by: Eric Dumazet

    David S. Miller
     

13 Jul, 2010

1 commit


28 Apr, 2010

1 commit

  • RFC 1122 says the following:
    ...
    Keep-alive packets MUST only be sent when no data or
    acknowledgement packets have been received for the
    connection within an interval.
    ...

    The acknowledgement packet is reseting the keepalive
    timer but the data packet isn't. This patch fixes it by
    checking the timestamp of the last received data packet
    too when the keepalive timer expires.

    Signed-off-by: Flavio Leitner
    Signed-off-by: Eric Dumazet
    Acked-by: Ilpo Järvinen
    Signed-off-by: David S. Miller

    Flavio Leitner
     

13 Apr, 2010

1 commit

  • With latest CONFIG_PROVE_RCU stuff, I felt more comfortable to make this
    work.

    sk->sk_dst_cache is currently protected by a rwlock (sk_dst_lock)

    This rwlock is readlocked for a very small amount of time, and dst
    entries are already freed after RCU grace period. This calls for RCU
    again :)

    This patch converts sk_dst_lock to a spinlock, and use RCU for readers.

    __sk_dst_get() is supposed to be called with rcu_read_lock() or if
    socket locked by user, so use appropriate rcu_dereference_check()
    condition (rcu_read_lock_held() || sock_owned_by_user(sk))

    This patch avoids two atomic ops per tx packet on UDP connected sockets,
    for example, and permits sk_dst_lock to be much less dirtied.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

30 Mar, 2010

1 commit

  • …it slab.h inclusion from percpu.h

    percpu.h is included by sched.h and module.h and thus ends up being
    included when building most .c files. percpu.h includes slab.h which
    in turn includes gfp.h making everything defined by the two files
    universally available and complicating inclusion dependencies.

    percpu.h -> slab.h dependency is about to be removed. Prepare for
    this change by updating users of gfp and slab facilities include those
    headers directly instead of assuming availability. As this conversion
    needs to touch large number of source files, the following script is
    used as the basis of conversion.

    http://userweb.kernel.org/~tj/misc/slabh-sweep.py

    The script does the followings.

    * Scan files for gfp and slab usages and update includes such that
    only the necessary includes are there. ie. if only gfp is used,
    gfp.h, if slab is used, slab.h.

    * When the script inserts a new include, it looks at the include
    blocks and try to put the new include such that its order conforms
    to its surrounding. It's put in the include block which contains
    core kernel includes, in the same order that the rest are ordered -
    alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
    doesn't seem to be any matching order.

    * If the script can't find a place to put a new include (mostly
    because the file doesn't have fitting include block), it prints out
    an error message indicating which .h file needs to be added to the
    file.

    The conversion was done in the following steps.

    1. The initial automatic conversion of all .c files updated slightly
    over 4000 files, deleting around 700 includes and adding ~480 gfp.h
    and ~3000 slab.h inclusions. The script emitted errors for ~400
    files.

    2. Each error was manually checked. Some didn't need the inclusion,
    some needed manual addition while adding it to implementation .h or
    embedding .c file was more appropriate for others. This step added
    inclusions to around 150 files.

    3. The script was run again and the output was compared to the edits
    from #2 to make sure no file was left behind.

    4. Several build tests were done and a couple of problems were fixed.
    e.g. lib/decompress_*.c used malloc/free() wrappers around slab
    APIs requiring slab.h to be added manually.

    5. The script was run on all .h files but without automatically
    editing them as sprinkling gfp.h and slab.h inclusions around .h
    files could easily lead to inclusion dependency hell. Most gfp.h
    inclusion directives were ignored as stuff from gfp.h was usually
    wildly available and often used in preprocessor macros. Each
    slab.h inclusion directive was examined and added manually as
    necessary.

    6. percpu.h was updated not to include slab.h.

    7. Build test were done on the following configurations and failures
    were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
    distributed build env didn't work with gcov compiles) and a few
    more options had to be turned off depending on archs to make things
    build (like ipr on powerpc/64 which failed due to missing writeq).

    * x86 and x86_64 UP and SMP allmodconfig and a custom test config.
    * powerpc and powerpc64 SMP allmodconfig
    * sparc and sparc64 SMP allmodconfig
    * ia64 SMP allmodconfig
    * s390 SMP allmodconfig
    * alpha SMP allmodconfig
    * um on x86_64 SMP allmodconfig

    8. percpu.h modifications were reverted so that it could be applied as
    a separate patch and serve as bisection point.

    Given the fact that I had only a couple of failures from tests on step
    6, I'm fairly confident about the coverage of this conversion patch.
    If there is a breakage, it's likely to be something in one of the arch
    headers which should be easily discoverable easily on most builds of
    the specific arch.

    Signed-off-by: Tejun Heo <tj@kernel.org>
    Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>

    Tejun Heo
     

08 Mar, 2010

1 commit


19 Feb, 2010

1 commit

  • This patch will make TCP use only linear timeouts if the
    stream is thin. This will help to avoid the very high latencies
    that thin stream suffer because of exponential backoff. This
    mechanism is only active if enabled by iocontrol or syscontrol
    and the stream is identified as thin. A maximum of 6 linear
    timeouts is tried before exponential backoff is resumed.

    Signed-off-by: Andreas Petlund
    Signed-off-by: David S. Miller

    Andreas Petlund
     

09 Feb, 2010

1 commit

  • In particular, several occurances of funny versions of 'success',
    'unknown', 'therefore', 'acknowledge', 'argument', 'achieve', 'address',
    'beginning', 'desirable', 'separate' and 'necessary' are fixed.

    Signed-off-by: Daniel Mack
    Cc: Joe Perches
    Cc: Junio C Hamano
    Signed-off-by: Jiri Kosina

    Daniel Mack
     

18 Jan, 2010

1 commit

  • Currently we don't increment SYN-ACK timeouts & retransmissions
    although we do increment the same stats for SYN. We seem to have lost
    the SYN-ACK accounting with the introduction of tcp_syn_recv_timer
    (commit 2248761e in the netdev-vger-cvs tree).

    This patch fixes this issue. In the process we also rename the v4/v6
    syn/ack retransmit functions for clarity. We also add a new
    request_socket operations (syn_ack_timeout) so we can keep code in
    inet_connection_sock.c protocol agnostic.

    Signed-off-by: Octavian Purdila
    Signed-off-by: David S. Miller

    Octavian Purdila
     

09 Dec, 2009

1 commit


21 Oct, 2009

1 commit

  • dst_negative_advice() should check for changed dst and reset
    sk_tx_queue_mapping accordingly. Pass sock to the callers of
    dst_negative_advice.

    (sk_reset_txq is defined just for use by dst_negative_advice. The
    only way I could find to get around this is to move dst_negative_()
    from dst.h to dst.c, include sock.h in dst.c, etc)

    Signed-off-by: Krishna Kumar
    Signed-off-by: David S. Miller

    Krishna Kumar
     

19 Oct, 2009

1 commit

  • In order to have better cache layouts of struct sock (separate zones
    for rx/tx paths), we need this preliminary patch.

    Goal is to transfert fields used at lookup time in the first
    read-mostly cache line (inside struct sock_common) and move sk_refcnt
    to a separate cache line (only written by rx path)

    This patch adds inet_ prefix to daddr, rcv_saddr, dport, num, saddr,
    sport and id fields. This allows a future patch to define these
    fields as macros, like sk_refcnt, without name clashes.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

01 Sep, 2009

2 commits

  • RFC 1122 specifies two threshold values R1 and R2 for connection timeouts,
    which may represent a number of allowed retransmissions or a timeout value.
    Currently linux uses sysctl_tcp_retries{1,2} to specify the thresholds
    in number of allowed retransmissions.

    For any desired threshold R2 (by means of time) one can specify tcp_retries2
    (by means of number of retransmissions) such that TCP will not time out
    earlier than R2. This is the case, because the RTO schedule follows a fixed
    pattern, namely exponential backoff.

    However, the RTO behaviour is not predictable any more if RTO backoffs can be
    reverted, as it is the case in the draft
    "Make TCP more Robust to Long Connectivity Disruptions"
    (http://tools.ietf.org/html/draft-zimmermann-tcp-lcd).

    In the worst case TCP would time out a connection after 3.2 seconds, if the
    initial RTO equaled MIN_RTO and each backoff has been reverted.

    This patch introduces a function retransmits_timed_out(N),
    which calculates the timeout of a TCP connection, assuming an initial
    RTO of MIN_RTO and N unsuccessful, exponentially backed-off retransmissions.

    Whenever timeout decisions are made by comparing the retransmission counter
    to some value N, this function can be used, instead.

    The meaning of tcp_retries2 will be changed, as many more RTO retransmissions
    can occur than the value indicates. However, it yields a timeout which is
    similar to the one of an unpatched, exponentially backing off TCP in the same
    scenario. As no application could rely on an RTO greater than MIN_RTO, there
    should be no risk of a regression.

    Signed-off-by: Damian Lukowski
    Acked-by: Ilpo Järvinen
    Signed-off-by: David S. Miller

    Damian Lukowski
     
  • Here, an ICMP host/network unreachable message, whose payload fits to
    TCP's SND.UNA, is taken as an indication that the RTO retransmission has
    not been lost due to congestion, but because of a route failure
    somewhere along the path.
    With true congestion, a router won't trigger such a message and the
    patched TCP will operate as standard TCP.

    This patch reverts one RTO backoff, if an ICMP host/network unreachable
    message, whose payload fits to TCP's SND.UNA, arrives.
    Based on the new RTO, the retransmission timer is reset to reflect the
    remaining time, or - if the revert clocked out the timer - a retransmission
    is sent out immediately.
    Backoffs are only reverted, if TCP is in RTO loss recovery, i.e. if
    there have been retransmissions and reversible backoffs, already.

    Changes from v2:
    1) Renaming of skb in tcp_v4_err() moved to another patch.
    2) Reintroduced tcp_bound_rto() and __tcp_set_rto().
    3) Fixed code comments.

    Signed-off-by: Damian Lukowski
    Acked-by: Ilpo Järvinen
    Signed-off-by: David S. Miller

    Damian Lukowski
     

29 Aug, 2009

1 commit


02 Mar, 2009

1 commit


19 Dec, 2008

1 commit


26 Nov, 2008

1 commit


03 Nov, 2008

1 commit


31 Oct, 2008

1 commit


30 Oct, 2008

1 commit


29 Oct, 2008

1 commit


08 Oct, 2008

1 commit


26 Jul, 2008

1 commit

  • Removes legacy reinvent-the-wheel type thing. The generic
    machinery integrates much better to automated debugging aids
    such as kerneloops.org (and others), and is unambiguous due to
    better naming. Non-intuively BUG_TRAP() is actually equal to
    WARN_ON() rather than BUG_ON() though some might actually be
    promoted to BUG_ON() but I left that to future.

    I could make at least one BUILD_BUG_ON conversion.

    Signed-off-by: Ilpo Järvinen
    Signed-off-by: David S. Miller

    Ilpo Järvinen
     

17 Jul, 2008

1 commit


03 Jul, 2008

1 commit

  • There are some places in TCP that select one MIB index to
    bump snmp statistics like this:

    if ()
    NET_INC_STATS_BH();
    else if ()
    NET_INC_STATS_BH();
    ...
    else
    NET_INC_STATS_BH();

    or in a more tricky but still similar way.

    On the other hand, this NET_INC_STATS_BH is a camouflaged
    increment of percpu variable, which is not that small.

    Factoring those cases out de-bloats 235 bytes on non-preemptible
    i386 config and drives parts of the code into 80 columns.

    add/remove: 0/0 grow/shrink: 0/7 up/down: 0/-235 (-235)
    function old new delta
    tcp_fastretrans_alert 1437 1424 -13
    tcp_dsack_set 137 124 -13
    tcp_xmit_retransmit_queue 690 676 -14
    tcp_try_undo_recovery 283 265 -18
    tcp_sacktag_write_queue 1550 1515 -35
    tcp_update_reordering 162 106 -56
    tcp_retransmit_timer 990 904 -86

    Signed-off-by: Pavel Emelyanov
    Signed-off-by: David S. Miller

    Pavel Emelyanov
     

14 Jun, 2008

1 commit


13 Jun, 2008

1 commit

  • This reverts two changesets, ec3c0982a2dd1e671bad8e9d26c28dcba0039d87
    ("[TCP]: TCP_DEFER_ACCEPT updates - process as established") and
    the follow-on bug fix 9ae27e0adbf471c7a6b80102e38e1d5a346b3b38
    ("tcp: Fix slab corruption with ipv6 and tcp6fuzz").

    This change causes several problems, first reported by Ingo Molnar
    as a distcc-over-loopback regression where connections were getting
    stuck.

    Ilpo Järvinen first spotted the locking problems. The new function
    added by this code, tcp_defer_accept_check(), only has the
    child socket locked, yet it is modifying state of the parent
    listening socket.

    Fixing that is non-trivial at best, because we can't simply just grab
    the parent listening socket lock at this point, because it would
    create an ABBA deadlock. The normal ordering is parent listening
    socket --> child socket, but this code path would require the
    reverse lock ordering.

    Next is a problem noticed by Vitaliy Gusev, he noted:

    ----------------------------------------
    >--- a/net/ipv4/tcp_timer.c
    >+++ b/net/ipv4/tcp_timer.c
    >@@ -481,6 +481,11 @@ static void tcp_keepalive_timer (unsigned long data)
    > goto death;
    > }
    >
    >+ if (tp->defer_tcp_accept.request && sk->sk_state == TCP_ESTABLISHED) {
    >+ tcp_send_active_reset(sk, GFP_ATOMIC);
    >+ goto death;

    Here socket sk is not attached to listening socket's request queue. tcp_done()
    will not call inet_csk_destroy_sock() (and tcp_v4_destroy_sock() which should
    release this sk) as socket is not DEAD. Therefore socket sk will be lost for
    freeing.
    ----------------------------------------

    Finally, Alexey Kuznetsov argues that there might not even be any
    real value or advantage to these new semantics even if we fix all
    of the bugs:

    ----------------------------------------
    Hiding from accept() sockets with only out-of-order data only
    is the only thing which is impossible with old approach. Is this really
    so valuable? My opinion: no, this is nothing but a new loophole
    to consume memory without control.
    ----------------------------------------

    So revert this thing for now.

    Signed-off-by: David S. Miller

    David S. Miller
     

12 Jun, 2008

1 commit


14 Apr, 2008

1 commit