21 Dec, 2011

1 commit


13 Dec, 2011

1 commit

  • This patch replaces all uses of struct sock fields' memory_pressure,
    memory_allocated, sockets_allocated, and sysctl_mem to acessor
    macros. Those macros can either receive a socket argument, or a mem_cgroup
    argument, depending on the context they live in.

    Since we're only doing a macro wrapping here, no performance impact at all is
    expected in the case where we don't have cgroups disabled.

    Signed-off-by: Glauber Costa
    Reviewed-by: Hiroyouki Kamezawa
    CC: David S. Miller
    CC: Eric W. Biederman
    CC: Eric Dumazet
    Signed-off-by: David S. Miller

    Glauber Costa
     

12 Dec, 2011

1 commit


04 Dec, 2011

1 commit

  • Denys Fedoryshchenko reported that SYN+FIN attacks were bringing his
    linux machines to their limits.

    Dont call conn_request() if the TCP flags includes SYN flag

    Reported-by: Denys Fedoryshchenko
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

28 Nov, 2011

5 commits

  • The problem: Senders were overriding cwnd values picked during an undo
    by calling tcp_moderate_cwnd() in tcp_try_to_open().

    The fix: Don't moderate cwnd in tcp_try_to_open() if we're in
    TCP_CA_Open, since doing so is generally unnecessary and specifically
    would override a DSACK-based undo of a cwnd reduction made in fast
    recovery.

    Signed-off-by: Neal Cardwell
    Signed-off-by: David S. Miller

    Neal Cardwell
     
  • Previously, SACK-enabled connections hung around in TCP_CA_Disorder
    state while snd_una==high_seq, just waiting to accumulate DSACKs and
    hopefully undo a cwnd reduction. This could and did lead to the
    following unfortunate scenario: if some incoming ACKs advance snd_una
    beyond high_seq then we were setting undo_marker to 0 and moving to
    TCP_CA_Open, so if (due to reordering in the ACK return path) we
    shortly thereafter received a DSACK then we were no longer able to
    undo the cwnd reduction.

    The change: Simplify the congestion avoidance state machine by
    removing the behavior where SACK-enabled connections hung around in
    the TCP_CA_Disorder state just waiting for DSACKs. Instead, when
    snd_una advances to high_seq or beyond we typically move to
    TCP_CA_Open immediately and allow an undo in either TCP_CA_Open or
    TCP_CA_Disorder if we later receive enough DSACKs.

    Other patches in this series will provide other changes that are
    necessary to fully fix this problem.

    Signed-off-by: Neal Cardwell
    Signed-off-by: David S. Miller

    Neal Cardwell
     
  • The bug: When the ACK field is below snd_una (which can happen when
    ACKs are reordered), senders ignored DSACKs (preventing undo) and did
    not call tcp_fastretrans_alert, so they did not increment
    prr_delivered to reflect newly-SACKed sequence ranges, and did not
    call tcp_xmit_retransmit_queue, thus passing up chances to send out
    more retransmitted and new packets based on any newly-SACKed packets.

    The change: When the ACK field is below snd_una (the "old_ack" goto
    label), call tcp_fastretrans_alert to allow undo based on any
    newly-arrived DSACKs and try to send out more packets based on
    newly-SACKed packets.

    Other patches in this series will provide other changes that are
    necessary to fully fix this problem.

    Signed-off-by: Neal Cardwell
    Acked-by: Ilpo Järvinen
    Signed-off-by: David S. Miller

    Neal Cardwell
     
  • The bug: Senders ignored DSACKs after recovery when there were no
    outstanding packets (a common scenario for HTTP servers).

    The change: when there are no outstanding packets (the "no_queue" goto
    label), call tcp_fastretrans_alert() in order to use DSACKs to undo
    congestion window reductions.

    Other patches in this series will provide other changes that are
    necessary to fully fix this problem.

    Signed-off-by: Neal Cardwell
    Acked-by: Ilpo Järvinen
    Signed-off-by: David S. Miller

    Neal Cardwell
     
  • Allow callers to decide whether an ACK is a duplicate ACK. This is a
    prerequisite to allowing fastretrans_alert to be called from new
    contexts, such as the no_queue and old_ack code paths, from which we
    have extra info that tells us whether an ACK is a dupack.

    Signed-off-by: Neal Cardwell
    Acked-by: Ilpo Järvinen
    Signed-off-by: David S. Miller

    Neal Cardwell
     

21 Oct, 2011

3 commits

  • Adding const qualifiers to pointers can ease code review, and spot some
    bugs. It might allow compiler to optimize code further.

    For example, is it legal to temporary write a null cksum into tcphdr
    in tcp_md5_hash_header() ? I am afraid a sniffer could catch the
    temporary null value...

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • tcp_fin() only needs socket pointer, we can remove skb and th params.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • Since commit 356f039822b (TCP: increase default initial receive
    window.), we allow sender to send 10 (TCP_DEFAULT_INIT_RCVWND) segments.

    Change tcp_fixup_rcvbuf() to reflect this change, even if no real change
    is expected, since sysctl_tcp_rmem[1] = 87380 and this value
    is bigger than tcp_fixup_rcvbuf() computed rcvmem (~23720)

    Note: Since commit 356f039822b limited default window to maximum of
    10*1460 and 2*MSS, we use same heuristic in this patch.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

20 Oct, 2011

1 commit


14 Oct, 2011

1 commit

  • skb truesize currently accounts for sk_buff struct and part of skb head.
    kmalloc() roundings are also ignored.

    Considering that skb_shared_info is larger than sk_buff, its time to
    take it into account for better memory accounting.

    This patch introduces SKB_TRUESIZE(X) macro to centralize various
    assumptions into a single place.

    At skb alloc phase, we put skb_shared_info struct at the exact end of
    skb head, to allow a better use of memory (lowering number of
    reallocations), since kmalloc() gives us power-of-two memory blocks.

    Unless SLUB/SLUB debug is active, both skb->head and skb_shared_info are
    aligned to cache lines, as before.

    Note: This patch might trigger performance regressions because of
    misconfigured protocol stacks, hitting per socket or global memory
    limits that were previously not reached. But its a necessary step for a
    more accurate memory accounting.

    Signed-off-by: Eric Dumazet
    CC: Andi Kleen
    CC: Ben Hutchings
    Signed-off-by: David S. Miller

    Eric Dumazet
     

08 Oct, 2011

1 commit


05 Oct, 2011

1 commit

  • lost_skb_hint is used by tcp_mark_head_lost() to mark the first unhandled skb.
    lost_cnt_hint is the number of packets or sacked packets before the lost_skb_hint;
    When shifting a skb that is before the lost_skb_hint, if tcp_is_fack() is ture,
    the skb has already been counted in the lost_cnt_hint; if tcp_is_fack() is false,
    tcp_sacktag_one() will increase the lost_cnt_hint. So tcp_shifted_skb() does not
    need to adjust the lost_cnt_hint by itself. When shifting a skb that is equal to
    lost_skb_hint, the shifted packets will not be counted by tcp_mark_head_lost().
    So tcp_shifted_skb() should adjust the lost_cnt_hint even tcp_is_fack(tp) is true.

    Signed-off-by: Zheng Yan
    Signed-off-by: David S. Miller

    Yan, Zheng
     

28 Sep, 2011

1 commit

  • Rename struct tcp_skb_cb "flags" to "tcp_flags" to ease code review and
    maintenance.

    Its content is a combination of FIN/SYN/RST/PSH/ACK/URG/ECE/CWR flags

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

27 Sep, 2011

2 commits

  • struct tcp_skb_cb contains a "flags" field containing either tcp flags
    or IP dsfield depending on context (input or output path)

    Introduce ip_dsfield to make the difference clear and ease maintenance.
    If later we want to save space, we can union flags/ip_dsfield

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • While playing with a new ADSL box at home, I discovered that ECN
    blackhole can trigger suboptimal quickack mode on linux : We send one
    ACK for each incoming data frame, without any delay and eventual
    piggyback.

    This is because TCP_ECN_check_ce() considers that if no ECT is seen on a
    segment, this is because this segment was a retransmit.

    Refine this heuristic and apply it only if we seen ECT in a previous
    segment, to detect ECN blackhole at IP level.

    Signed-off-by: Eric Dumazet
    CC: Jamal Hadi Salim
    CC: Jerry Chu
    CC: Ilpo Järvinen
    CC: Jim Gettys
    CC: Dave Taht
    Acked-by: Ilpo Järvinen
    Signed-off-by: David S. Miller

    Eric Dumazet
     

22 Sep, 2011

1 commit

  • Conflicts:
    MAINTAINERS
    drivers/net/Kconfig
    drivers/net/ethernet/broadcom/bnx2x/bnx2x_link.c
    drivers/net/ethernet/broadcom/tg3.c
    drivers/net/wireless/iwlwifi/iwl-pci.c
    drivers/net/wireless/iwlwifi/iwl-trans-tx-pcie.c
    drivers/net/wireless/rt2x00/rt2800usb.c
    drivers/net/wireless/wl12xx/main.c

    David S. Miller
     

19 Sep, 2011

1 commit

  • D-SACK is allowed to reside below snd_una. But the corresponding check
    in tcp_is_sackblock_valid() is the exact opposite. It looks like a typo.

    Signed-off-by: Zheng Yan
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Zheng Yan
     

25 Aug, 2011

1 commit

  • This patch implements Proportional Rate Reduction (PRR) for TCP.
    PRR is an algorithm that determines TCP's sending rate in fast
    recovery. PRR avoids excessive window reductions and aims for
    the actual congestion window size at the end of recovery to be as
    close as possible to the window determined by the congestion control
    algorithm. PRR also improves accuracy of the amount of data sent
    during loss recovery.

    The patch implements the recommended flavor of PRR called PRR-SSRB
    (Proportional rate reduction with slow start reduction bound) and
    replaces the existing rate halving algorithm. PRR improves upon the
    existing Linux fast recovery under a number of conditions including:
    1) burst losses where the losses implicitly reduce the amount of
    outstanding data (pipe) below the ssthresh value selected by the
    congestion control algorithm and,
    2) losses near the end of short flows where application runs out of
    data to send.

    As an example, with the existing rate halving implementation a single
    loss event can cause a connection carrying short Web transactions to
    go into the slow start mode after the recovery. This is because during
    recovery Linux pulls the congestion window down to packets_in_flight+1
    on every ACK. A short Web response often runs out of new data to send
    and its pipe reduces to zero by the end of recovery when all its packets
    are drained from the network. Subsequent HTTP responses using the same
    connection will have to slow start to raise cwnd to ssthresh. PRR on
    the other hand aims for the cwnd to be as close as possible to ssthresh
    by the end of recovery.

    A description of PRR and a discussion of its performance can be found at
    the following links:
    - IETF Draft:
    http://tools.ietf.org/html/draft-mathis-tcpm-proportional-rate-reduction-01
    - IETF Slides:
    http://www.ietf.org/proceedings/80/slides/tcpm-6.pdf
    http://tools.ietf.org/agenda/81/slides/tcpm-2.pdf
    - Paper to appear in Internet Measurements Conference (IMC) 2011:
    Improving TCP Loss Recovery
    Nandita Dukkipati, Matt Mathis, Yuchung Cheng

    Signed-off-by: Nandita Dukkipati
    Signed-off-by: David S. Miller

    Nandita Dukkipati
     

09 Jun, 2011

1 commit

  • This patch lowers the default initRTO from 3secs to 1sec per
    RFC2988bis. It falls back to 3secs if the SYN or SYN-ACK packet
    has been retransmitted, AND the TCP timestamp option is not on.

    It also adds support to take RTT sample during 3WHS on the passive
    open side, just like its active open counterpart, and uses it, if
    valid, to seed the initRTO for the data transmission phase.

    The patch also resets ssthresh to its initial default at the
    beginning of the data transmission phase, and reduces cwnd to 1 if
    there has been MORE THAN ONE retransmission during 3WHS per RFC5681.

    Signed-off-by: H.K. Jerry Chu
    Signed-off-by: David S. Miller

    Jerry Chu
     

23 Mar, 2011

2 commits

  • Signed-off-by: David S. Miller

    David S. Miller
     
  • In the current undo logic, cwnd is moderated after it was restored
    to the value prior entering fast-recovery. It was moderated first
    in tcp_try_undo_recovery then again in tcp_complete_cwr.

    Since the undo indicates recovery was false, these moderations
    are not necessary. If the undo is triggered when most of the
    outstanding data have been acknowledged, the (restored) cwnd is
    falsely pulled down to a small value.

    This patch removes these cwnd moderations if cwnd is undone
    a) during fast-recovery
    b) by receiving DSACKs past fast-recovery

    Signed-off-by: Yuchung Cheng
    Signed-off-by: David S. Miller

    Yuchung Cheng
     

16 Mar, 2011

1 commit


15 Mar, 2011

1 commit

  • In the congestion control interface, the callback for each ACK
    includes an estimated round trip time in microseconds.
    Some algorithms need high resolution (Vegas style) but most only
    need jiffie resolution. If RTT is not accurate (like a retransmission)
    -1 is used as a flag value.

    When doing coarse resolution if RTT is less than a a jiffie
    then 0 should be returned rather than no estimate. Otherwise algorithms
    that expect good ack's to trigger slow start (like CUBIC Hystart)
    will be confused.

    Signed-off-by: Stephen Hemminger
    Signed-off-by: David S. Miller

    stephen hemminger
     

04 Mar, 2011

1 commit


22 Feb, 2011

1 commit

  • Fix a bug that undo_retrans is incorrectly decremented when undo_marker is
    not set or undo_retrans is already 0. This happens when sender receives
    more DSACK ACKs than packets retransmitted during the current
    undo phase. This may also happen when sender receives DSACK after
    the undo operation is completed or cancelled.

    Fix another bug that undo_retrans is incorrectly incremented when
    sender retransmits an skb and tcp_skb_pcount(skb) > 1 (TSO). This case
    is rare but not impossible.

    Signed-off-by: Yuchung Cheng
    Acked-by: Ilpo Järvinen
    Signed-off-by: David S. Miller

    Yuchung Cheng
     

03 Feb, 2011

1 commit


26 Jan, 2011

1 commit

  • This patch fixes a bug that causes TCP RST packets to be generated
    on otherwise correctly behaved applications, e.g., no unread data
    on close,..., etc. To trigger the bug, at least two conditions must
    be met:

    1. The FIN flag is set on the last data packet, i.e., it's not on a
    separate, FIN only packet.
    2. The size of the last data chunk on the receive side matches
    exactly with the size of buffer posted by the receiver, and the
    receiver closes the socket without any further read attempt.

    This bug was first noticed on our netperf based testbed for our IW10
    proposal to IETF where a large number of RST packets were observed.
    netperf's read side code meets the condition 2 above 100%.

    Before the fix, tcp_data_queue() will queue the last skb that meets
    condition 1 to sk_receive_queue even though it has fully copied out
    (skb_copy_datagram_iovec()) the data. Then if condition 2 is also met,
    tcp_recvmsg() often returns all the copied out data successfully
    without actually consuming the skb, due to a check
    "if ((chunk = len - tp->ucopy.len) != 0) {"
    and
    "len -= chunk;"
    after tcp_prequeue_process() that causes "len" to become 0 and an
    early exit from the big while loop.

    I don't see any reason not to free the skb whose data have been fully
    consumed in tcp_data_queue(), regardless of the FIN flag. We won't
    get there if MSG_PEEK is on. Am I missing some arcane cases related
    to urgent data?

    Signed-off-by: H.K. Jerry Chu
    Signed-off-by: David S. Miller

    Jerry Chu
     

24 Dec, 2010

1 commit

  • Commit 86bcebafc5e7f5 ("tcp: fix >2 iw selection") fixed a case
    when congestion window initialization has been mistakenly omitted
    by introducing cwnd label and putting backwards goto from the
    end of the function.

    This makes the code unnecessarily tricky to read and understand
    on a first sight.

    Shuffle the code around a little bit to make it more obvious.

    Signed-off-by: Jiri Kosina
    Signed-off-by: David S. Miller

    Jiri Kosina
     

10 Dec, 2010

1 commit

  • Use helper functions to hide all direct accesses, especially writes,
    to dst_entry metrics values.

    This will allow us to:

    1) More easily change how the metrics are stored.

    2) Implement COW for metrics.

    In particular this will help us put metrics into the inetpeer
    cache if that is what we end up doing. We can make the _metrics
    member a pointer instead of an array, initially have it point
    at the read-only metrics in the FIB, and then on the first set
    grab an inetpeer entry and point the _metrics member there.

    Signed-off-by: David S. Miller
    Acked-by: Eric Dumazet

    David S. Miller
     

11 Nov, 2010

1 commit

  • Robin Holt tried to boot a 16TB machine and found some limits were
    reached : sysctl_tcp_mem[2], sysctl_udp_mem[2]

    We can switch infrastructure to use long "instead" of "int", now
    atomic_long_t primitives are available for free.

    Signed-off-by: Eric Dumazet
    Reported-by: Robin Holt
    Reviewed-by: Robin Holt
    Signed-off-by: Andrew Morton
    Signed-off-by: David S. Miller

    Eric Dumazet
     

25 Oct, 2010

1 commit

  • * 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (39 commits)
    Update broken web addresses in arch directory.
    Update broken web addresses in the kernel.
    Revert "drivers/usb: Remove unnecessary return's from void functions" for musb gadget
    Revert "Fix typo: configuation => configuration" partially
    ida: document IDA_BITMAP_LONGS calculation
    ext2: fix a typo on comment in ext2/inode.c
    drivers/scsi: Remove unnecessary casts of private_data
    drivers/s390: Remove unnecessary casts of private_data
    net/sunrpc/rpc_pipe.c: Remove unnecessary casts of private_data
    drivers/infiniband: Remove unnecessary casts of private_data
    drivers/gpu/drm: Remove unnecessary casts of private_data
    kernel/pm_qos_params.c: Remove unnecessary casts of private_data
    fs/ecryptfs: Remove unnecessary casts of private_data
    fs/seq_file.c: Remove unnecessary casts of private_data
    arm: uengine.c: remove C99 comments
    arm: scoop.c: remove C99 comments
    Fix typo configue => configure in comments
    Fix typo: configuation => configuration
    Fix typo interrest[ing|ed] => interest[ing|ed]
    Fix various typos of valid in comments
    ...

    Fix up trivial conflicts in:
    drivers/char/ipmi/ipmi_si_intf.c
    drivers/usb/gadget/rndis.c
    net/irda/irnet/irnet_ppp.c

    Linus Torvalds
     

18 Oct, 2010

2 commits

  • The patch below updates broken web addresses in the kernel

    Signed-off-by: Justin P. Mattock
    Cc: Maciej W. Rozycki
    Cc: Geert Uytterhoeven
    Cc: Finn Thain
    Cc: Randy Dunlap
    Cc: Matt Turner
    Cc: Dimitry Torokhov
    Cc: Mike Frysinger
    Acked-by: Ben Pfaff
    Acked-by: Hans J. Koch
    Reviewed-by: Finn Thain
    Signed-off-by: Jiri Kosina

    Justin P. Mattock
     
  • When only fast rexmit should be done, tcp_mark_head_lost marks
    L too far. Also, sacked_upto below 1 is perfectly valid number,
    the packets == 0 then needs to be trapped elsewhere.

    Signed-off-by: Ilpo Järvinen
    Signed-off-by: David S. Miller

    Ilpo Järvinen
     

07 Oct, 2010

1 commit


05 Oct, 2010

1 commit


30 Sep, 2010

1 commit