21 Dec, 2011
1 commit
-
to record the state of SACK/FACK and DSACK for better readability and maintenance.
Signed-off-by: Vijay Subramanian
Signed-off-by: David S. Miller
13 Dec, 2011
1 commit
-
This patch replaces all uses of struct sock fields' memory_pressure,
memory_allocated, sockets_allocated, and sysctl_mem to acessor
macros. Those macros can either receive a socket argument, or a mem_cgroup
argument, depending on the context they live in.Since we're only doing a macro wrapping here, no performance impact at all is
expected in the case where we don't have cgroups disabled.Signed-off-by: Glauber Costa
Reviewed-by: Hiroyouki Kamezawa
CC: David S. Miller
CC: Eric W. Biederman
CC: Eric Dumazet
Signed-off-by: David S. Miller
12 Dec, 2011
1 commit
-
Instead of testing defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE)
Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller
04 Dec, 2011
1 commit
-
Denys Fedoryshchenko reported that SYN+FIN attacks were bringing his
linux machines to their limits.Dont call conn_request() if the TCP flags includes SYN flag
Reported-by: Denys Fedoryshchenko
Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller
28 Nov, 2011
5 commits
-
The problem: Senders were overriding cwnd values picked during an undo
by calling tcp_moderate_cwnd() in tcp_try_to_open().The fix: Don't moderate cwnd in tcp_try_to_open() if we're in
TCP_CA_Open, since doing so is generally unnecessary and specifically
would override a DSACK-based undo of a cwnd reduction made in fast
recovery.Signed-off-by: Neal Cardwell
Signed-off-by: David S. Miller -
Previously, SACK-enabled connections hung around in TCP_CA_Disorder
state while snd_una==high_seq, just waiting to accumulate DSACKs and
hopefully undo a cwnd reduction. This could and did lead to the
following unfortunate scenario: if some incoming ACKs advance snd_una
beyond high_seq then we were setting undo_marker to 0 and moving to
TCP_CA_Open, so if (due to reordering in the ACK return path) we
shortly thereafter received a DSACK then we were no longer able to
undo the cwnd reduction.The change: Simplify the congestion avoidance state machine by
removing the behavior where SACK-enabled connections hung around in
the TCP_CA_Disorder state just waiting for DSACKs. Instead, when
snd_una advances to high_seq or beyond we typically move to
TCP_CA_Open immediately and allow an undo in either TCP_CA_Open or
TCP_CA_Disorder if we later receive enough DSACKs.Other patches in this series will provide other changes that are
necessary to fully fix this problem.Signed-off-by: Neal Cardwell
Signed-off-by: David S. Miller -
The bug: When the ACK field is below snd_una (which can happen when
ACKs are reordered), senders ignored DSACKs (preventing undo) and did
not call tcp_fastretrans_alert, so they did not increment
prr_delivered to reflect newly-SACKed sequence ranges, and did not
call tcp_xmit_retransmit_queue, thus passing up chances to send out
more retransmitted and new packets based on any newly-SACKed packets.The change: When the ACK field is below snd_una (the "old_ack" goto
label), call tcp_fastretrans_alert to allow undo based on any
newly-arrived DSACKs and try to send out more packets based on
newly-SACKed packets.Other patches in this series will provide other changes that are
necessary to fully fix this problem.Signed-off-by: Neal Cardwell
Acked-by: Ilpo Järvinen
Signed-off-by: David S. Miller -
The bug: Senders ignored DSACKs after recovery when there were no
outstanding packets (a common scenario for HTTP servers).The change: when there are no outstanding packets (the "no_queue" goto
label), call tcp_fastretrans_alert() in order to use DSACKs to undo
congestion window reductions.Other patches in this series will provide other changes that are
necessary to fully fix this problem.Signed-off-by: Neal Cardwell
Acked-by: Ilpo Järvinen
Signed-off-by: David S. Miller -
Allow callers to decide whether an ACK is a duplicate ACK. This is a
prerequisite to allowing fastretrans_alert to be called from new
contexts, such as the no_queue and old_ack code paths, from which we
have extra info that tells us whether an ACK is a dupack.Signed-off-by: Neal Cardwell
Acked-by: Ilpo Järvinen
Signed-off-by: David S. Miller
21 Oct, 2011
3 commits
-
Adding const qualifiers to pointers can ease code review, and spot some
bugs. It might allow compiler to optimize code further.For example, is it legal to temporary write a null cksum into tcphdr
in tcp_md5_hash_header() ? I am afraid a sniffer could catch the
temporary null value...Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller -
tcp_fin() only needs socket pointer, we can remove skb and th params.
Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller -
Since commit 356f039822b (TCP: increase default initial receive
window.), we allow sender to send 10 (TCP_DEFAULT_INIT_RCVWND) segments.Change tcp_fixup_rcvbuf() to reflect this change, even if no real change
is expected, since sysctl_tcp_rmem[1] = 87380 and this value
is bigger than tcp_fixup_rcvbuf() computed rcvmem (~23720)Note: Since commit 356f039822b limited default window to maximum of
10*1460 and 2*MSS, we use same heuristic in this patch.Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller
20 Oct, 2011
1 commit
-
Initial cwnd being 10 (TCP_INIT_CWND) instead of 3, change
tcp_fixup_sndbuf() to get more than 16384 bytes (sysctl_tcp_wmem[1]) in
initial sk_sndbufSigned-off-by: Eric Dumazet
Signed-off-by: David S. Miller
14 Oct, 2011
1 commit
-
skb truesize currently accounts for sk_buff struct and part of skb head.
kmalloc() roundings are also ignored.Considering that skb_shared_info is larger than sk_buff, its time to
take it into account for better memory accounting.This patch introduces SKB_TRUESIZE(X) macro to centralize various
assumptions into a single place.At skb alloc phase, we put skb_shared_info struct at the exact end of
skb head, to allow a better use of memory (lowering number of
reallocations), since kmalloc() gives us power-of-two memory blocks.Unless SLUB/SLUB debug is active, both skb->head and skb_shared_info are
aligned to cache lines, as before.Note: This patch might trigger performance regressions because of
misconfigured protocol stacks, hitting per socket or global memory
limits that were previously not reached. But its a necessary step for a
more accurate memory accounting.Signed-off-by: Eric Dumazet
CC: Andi Kleen
CC: Ben Hutchings
Signed-off-by: David S. Miller
08 Oct, 2011
1 commit
-
Conflicts:
net/batman-adv/soft-interface.c
05 Oct, 2011
1 commit
-
lost_skb_hint is used by tcp_mark_head_lost() to mark the first unhandled skb.
lost_cnt_hint is the number of packets or sacked packets before the lost_skb_hint;
When shifting a skb that is before the lost_skb_hint, if tcp_is_fack() is ture,
the skb has already been counted in the lost_cnt_hint; if tcp_is_fack() is false,
tcp_sacktag_one() will increase the lost_cnt_hint. So tcp_shifted_skb() does not
need to adjust the lost_cnt_hint by itself. When shifting a skb that is equal to
lost_skb_hint, the shifted packets will not be counted by tcp_mark_head_lost().
So tcp_shifted_skb() should adjust the lost_cnt_hint even tcp_is_fack(tp) is true.Signed-off-by: Zheng Yan
Signed-off-by: David S. Miller
28 Sep, 2011
1 commit
-
Rename struct tcp_skb_cb "flags" to "tcp_flags" to ease code review and
maintenance.Its content is a combination of FIN/SYN/RST/PSH/ACK/URG/ECE/CWR flags
Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller
27 Sep, 2011
2 commits
-
struct tcp_skb_cb contains a "flags" field containing either tcp flags
or IP dsfield depending on context (input or output path)Introduce ip_dsfield to make the difference clear and ease maintenance.
If later we want to save space, we can union flags/ip_dsfieldSigned-off-by: Eric Dumazet
Signed-off-by: David S. Miller -
While playing with a new ADSL box at home, I discovered that ECN
blackhole can trigger suboptimal quickack mode on linux : We send one
ACK for each incoming data frame, without any delay and eventual
piggyback.This is because TCP_ECN_check_ce() considers that if no ECT is seen on a
segment, this is because this segment was a retransmit.Refine this heuristic and apply it only if we seen ECT in a previous
segment, to detect ECN blackhole at IP level.Signed-off-by: Eric Dumazet
CC: Jamal Hadi Salim
CC: Jerry Chu
CC: Ilpo Järvinen
CC: Jim Gettys
CC: Dave Taht
Acked-by: Ilpo Järvinen
Signed-off-by: David S. Miller
22 Sep, 2011
1 commit
-
Conflicts:
MAINTAINERS
drivers/net/Kconfig
drivers/net/ethernet/broadcom/bnx2x/bnx2x_link.c
drivers/net/ethernet/broadcom/tg3.c
drivers/net/wireless/iwlwifi/iwl-pci.c
drivers/net/wireless/iwlwifi/iwl-trans-tx-pcie.c
drivers/net/wireless/rt2x00/rt2800usb.c
drivers/net/wireless/wl12xx/main.c
19 Sep, 2011
1 commit
-
D-SACK is allowed to reside below snd_una. But the corresponding check
in tcp_is_sackblock_valid() is the exact opposite. It looks like a typo.Signed-off-by: Zheng Yan
Acked-by: Eric Dumazet
Signed-off-by: David S. Miller
25 Aug, 2011
1 commit
-
This patch implements Proportional Rate Reduction (PRR) for TCP.
PRR is an algorithm that determines TCP's sending rate in fast
recovery. PRR avoids excessive window reductions and aims for
the actual congestion window size at the end of recovery to be as
close as possible to the window determined by the congestion control
algorithm. PRR also improves accuracy of the amount of data sent
during loss recovery.The patch implements the recommended flavor of PRR called PRR-SSRB
(Proportional rate reduction with slow start reduction bound) and
replaces the existing rate halving algorithm. PRR improves upon the
existing Linux fast recovery under a number of conditions including:
1) burst losses where the losses implicitly reduce the amount of
outstanding data (pipe) below the ssthresh value selected by the
congestion control algorithm and,
2) losses near the end of short flows where application runs out of
data to send.As an example, with the existing rate halving implementation a single
loss event can cause a connection carrying short Web transactions to
go into the slow start mode after the recovery. This is because during
recovery Linux pulls the congestion window down to packets_in_flight+1
on every ACK. A short Web response often runs out of new data to send
and its pipe reduces to zero by the end of recovery when all its packets
are drained from the network. Subsequent HTTP responses using the same
connection will have to slow start to raise cwnd to ssthresh. PRR on
the other hand aims for the cwnd to be as close as possible to ssthresh
by the end of recovery.A description of PRR and a discussion of its performance can be found at
the following links:
- IETF Draft:
http://tools.ietf.org/html/draft-mathis-tcpm-proportional-rate-reduction-01
- IETF Slides:
http://www.ietf.org/proceedings/80/slides/tcpm-6.pdf
http://tools.ietf.org/agenda/81/slides/tcpm-2.pdf
- Paper to appear in Internet Measurements Conference (IMC) 2011:
Improving TCP Loss Recovery
Nandita Dukkipati, Matt Mathis, Yuchung ChengSigned-off-by: Nandita Dukkipati
Signed-off-by: David S. Miller
09 Jun, 2011
1 commit
-
This patch lowers the default initRTO from 3secs to 1sec per
RFC2988bis. It falls back to 3secs if the SYN or SYN-ACK packet
has been retransmitted, AND the TCP timestamp option is not on.It also adds support to take RTT sample during 3WHS on the passive
open side, just like its active open counterpart, and uses it, if
valid, to seed the initRTO for the data transmission phase.The patch also resets ssthresh to its initial default at the
beginning of the data transmission phase, and reduces cwnd to 1 if
there has been MORE THAN ONE retransmission during 3WHS per RFC5681.Signed-off-by: H.K. Jerry Chu
Signed-off-by: David S. Miller
23 Mar, 2011
2 commits
-
Signed-off-by: David S. Miller
-
In the current undo logic, cwnd is moderated after it was restored
to the value prior entering fast-recovery. It was moderated first
in tcp_try_undo_recovery then again in tcp_complete_cwr.Since the undo indicates recovery was false, these moderations
are not necessary. If the undo is triggered when most of the
outstanding data have been acknowledged, the (restored) cwnd is
falsely pulled down to a small value.This patch removes these cwnd moderations if cwnd is undone
a) during fast-recovery
b) by receiving DSACKs past fast-recoverySigned-off-by: Yuchung Cheng
Signed-off-by: David S. Miller
16 Mar, 2011
1 commit
15 Mar, 2011
1 commit
-
In the congestion control interface, the callback for each ACK
includes an estimated round trip time in microseconds.
Some algorithms need high resolution (Vegas style) but most only
need jiffie resolution. If RTT is not accurate (like a retransmission)
-1 is used as a flag value.When doing coarse resolution if RTT is less than a a jiffie
then 0 should be returned rather than no estimate. Otherwise algorithms
that expect good ack's to trigger slow start (like CUBIC Hystart)
will be confused.Signed-off-by: Stephen Hemminger
Signed-off-by: David S. Miller
04 Mar, 2011
1 commit
-
Conflicts:
drivers/net/bnx2x/bnx2x.h
22 Feb, 2011
1 commit
-
Fix a bug that undo_retrans is incorrectly decremented when undo_marker is
not set or undo_retrans is already 0. This happens when sender receives
more DSACK ACKs than packets retransmitted during the current
undo phase. This may also happen when sender receives DSACK after
the undo operation is completed or cancelled.Fix another bug that undo_retrans is incorrectly incremented when
sender retransmits an skb and tcp_skb_pcount(skb) > 1 (TSO). This case
is rare but not impossible.Signed-off-by: Yuchung Cheng
Acked-by: Ilpo Järvinen
Signed-off-by: David S. Miller
03 Feb, 2011
1 commit
-
Signed-off-by: David S. Miller
Acked-by: Nandita Dukkipati
26 Jan, 2011
1 commit
-
This patch fixes a bug that causes TCP RST packets to be generated
on otherwise correctly behaved applications, e.g., no unread data
on close,..., etc. To trigger the bug, at least two conditions must
be met:1. The FIN flag is set on the last data packet, i.e., it's not on a
separate, FIN only packet.
2. The size of the last data chunk on the receive side matches
exactly with the size of buffer posted by the receiver, and the
receiver closes the socket without any further read attempt.This bug was first noticed on our netperf based testbed for our IW10
proposal to IETF where a large number of RST packets were observed.
netperf's read side code meets the condition 2 above 100%.Before the fix, tcp_data_queue() will queue the last skb that meets
condition 1 to sk_receive_queue even though it has fully copied out
(skb_copy_datagram_iovec()) the data. Then if condition 2 is also met,
tcp_recvmsg() often returns all the copied out data successfully
without actually consuming the skb, due to a check
"if ((chunk = len - tp->ucopy.len) != 0) {"
and
"len -= chunk;"
after tcp_prequeue_process() that causes "len" to become 0 and an
early exit from the big while loop.I don't see any reason not to free the skb whose data have been fully
consumed in tcp_data_queue(), regardless of the FIN flag. We won't
get there if MSG_PEEK is on. Am I missing some arcane cases related
to urgent data?Signed-off-by: H.K. Jerry Chu
Signed-off-by: David S. Miller
24 Dec, 2010
1 commit
-
Commit 86bcebafc5e7f5 ("tcp: fix >2 iw selection") fixed a case
when congestion window initialization has been mistakenly omitted
by introducing cwnd label and putting backwards goto from the
end of the function.This makes the code unnecessarily tricky to read and understand
on a first sight.Shuffle the code around a little bit to make it more obvious.
Signed-off-by: Jiri Kosina
Signed-off-by: David S. Miller
10 Dec, 2010
1 commit
-
Use helper functions to hide all direct accesses, especially writes,
to dst_entry metrics values.This will allow us to:
1) More easily change how the metrics are stored.
2) Implement COW for metrics.
In particular this will help us put metrics into the inetpeer
cache if that is what we end up doing. We can make the _metrics
member a pointer instead of an array, initially have it point
at the read-only metrics in the FIB, and then on the first set
grab an inetpeer entry and point the _metrics member there.Signed-off-by: David S. Miller
Acked-by: Eric Dumazet
11 Nov, 2010
1 commit
-
Robin Holt tried to boot a 16TB machine and found some limits were
reached : sysctl_tcp_mem[2], sysctl_udp_mem[2]We can switch infrastructure to use long "instead" of "int", now
atomic_long_t primitives are available for free.Signed-off-by: Eric Dumazet
Reported-by: Robin Holt
Reviewed-by: Robin Holt
Signed-off-by: Andrew Morton
Signed-off-by: David S. Miller
25 Oct, 2010
1 commit
-
* 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (39 commits)
Update broken web addresses in arch directory.
Update broken web addresses in the kernel.
Revert "drivers/usb: Remove unnecessary return's from void functions" for musb gadget
Revert "Fix typo: configuation => configuration" partially
ida: document IDA_BITMAP_LONGS calculation
ext2: fix a typo on comment in ext2/inode.c
drivers/scsi: Remove unnecessary casts of private_data
drivers/s390: Remove unnecessary casts of private_data
net/sunrpc/rpc_pipe.c: Remove unnecessary casts of private_data
drivers/infiniband: Remove unnecessary casts of private_data
drivers/gpu/drm: Remove unnecessary casts of private_data
kernel/pm_qos_params.c: Remove unnecessary casts of private_data
fs/ecryptfs: Remove unnecessary casts of private_data
fs/seq_file.c: Remove unnecessary casts of private_data
arm: uengine.c: remove C99 comments
arm: scoop.c: remove C99 comments
Fix typo configue => configure in comments
Fix typo: configuation => configuration
Fix typo interrest[ing|ed] => interest[ing|ed]
Fix various typos of valid in comments
...Fix up trivial conflicts in:
drivers/char/ipmi/ipmi_si_intf.c
drivers/usb/gadget/rndis.c
net/irda/irnet/irnet_ppp.c
18 Oct, 2010
2 commits
-
The patch below updates broken web addresses in the kernel
Signed-off-by: Justin P. Mattock
Cc: Maciej W. Rozycki
Cc: Geert Uytterhoeven
Cc: Finn Thain
Cc: Randy Dunlap
Cc: Matt Turner
Cc: Dimitry Torokhov
Cc: Mike Frysinger
Acked-by: Ben Pfaff
Acked-by: Hans J. Koch
Reviewed-by: Finn Thain
Signed-off-by: Jiri Kosina -
When only fast rexmit should be done, tcp_mark_head_lost marks
L too far. Also, sacked_upto below 1 is perfectly valid number,
the packets == 0 then needs to be trapped elsewhere.Signed-off-by: Ilpo Järvinen
Signed-off-by: David S. Miller
07 Oct, 2010
1 commit
-
This looks like a simple typo that has gone unnoticed for some time. The
impact is relatively low but it's clearly wrong.Signed-off-by: John Heffner
Signed-off-by: David S. Miller
05 Oct, 2010
1 commit
-
Conflicts:
net/ipv4/Kconfig
net/ipv4/tcp_timer.c
30 Sep, 2010
1 commit
-
Function only used in tcp_input.c
Signed-off-by: Stephen Hemminger
Signed-off-by: David S. Miller