24 Aug, 2018
1 commit
-
[ Upstream commit a69258f7aa2623e0930212f09c586fd06674ad79 ]
After fixing the way DCTCP tracking delayed ACKs, the delayed-ACK
related callbacks are no longer neededSigned-off-by: Yuchung Cheng
Signed-off-by: Eric Dumazet
Acked-by: Neal Cardwell
Acked-by: Lawrence Brakmo
Signed-off-by: David S. Miller
Signed-off-by: Sasha Levin
Signed-off-by: Greg Kroah-Hartman
28 Jul, 2018
2 commits
-
[ Upstream commit 27cde44a259c380a3c09066fc4b42de7dde9b1ad ]
Currently when a DCTCP receiver delays an ACK and receive a
data packet with a different CE mark from the previous one's, it
sends two immediate ACKs acking previous and latest sequences
respectly (for ECN accounting).Previously sending the first ACK may mark off the delayed ACK timer
(tcp_event_ack_sent). This may subsequently prevent sending the
second ACK to acknowledge the latest sequence (tcp_ack_snd_check).
The culprit is that tcp_send_ack() assumes it always acknowleges
the latest sequence, which is not true for the first special ACK.The fix is to not make the assumption in tcp_send_ack and check the
actual ack sequence before cancelling the delayed ACK. Further it's
safer to pass the ack sequence number as a local variable into
tcp_send_ack routine, instead of intercepting tp->rcv_nxt to avoid
future bugs like this.Reported-by: Neal Cardwell
Signed-off-by: Yuchung Cheng
Acked-by: Neal Cardwell
Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller
Signed-off-by: Greg Kroah-Hartman -
[ Upstream commit 2987babb6982306509380fc11b450227a844493b ]
Refactor and create helpers to send the special ACK in DCTCP.
Signed-off-by: Yuchung Cheng
Acked-by: Neal Cardwell
Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller
Signed-off-by: Greg Kroah-Hartman
25 May, 2018
1 commit
-
[ Upstream commit 7f582b248d0a86bae5788c548d7bb5bca6f7691a ]
syzkaller found a reliable way to crash the host, hitting a BUG()
in __tcp_retransmit_skb()Malicous MSG_FASTOPEN is the root cause. We need to purge write queue
in tcp_connect_init() at the point we init snd_una/write_seq.This patch also replaces the BUG() by a less intrusive WARN_ON_ONCE()
kernel BUG at net/ipv4/tcp_output.c:2837!
invalid opcode: 0000 [#1] SMP KASAN
Dumping ftrace buffer:
(ftrace buffer empty)
Modules linked in:
CPU: 0 PID: 5276 Comm: syz-executor0 Not tainted 4.17.0-rc3+ #51
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
RIP: 0010:__tcp_retransmit_skb+0x2992/0x2eb0 net/ipv4/tcp_output.c:2837
RSP: 0000:ffff8801dae06ff8 EFLAGS: 00010206
RAX: ffff8801b9fe61c0 RBX: 00000000ffc18a16 RCX: ffffffff864e1a49
RDX: 0000000000000100 RSI: ffffffff864e2e12 RDI: 0000000000000005
RBP: ffff8801dae073a0 R08: ffff8801b9fe61c0 R09: ffffed0039c40dd2
R10: ffffed0039c40dd2 R11: ffff8801ce206e93 R12: 00000000421eeaad
R13: ffff8801ce206d4e R14: ffff8801ce206cc0 R15: ffff8801cd4f4a80
FS: 0000000000000000(0000) GS:ffff8801dae00000(0063) knlGS:00000000096bc900
CS: 0010 DS: 002b ES: 002b CR0: 0000000080050033
CR2: 0000000020000000 CR3: 00000001c47b6000 CR4: 00000000001406f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
tcp_retransmit_skb+0x2e/0x250 net/ipv4/tcp_output.c:2923
tcp_retransmit_timer+0xc50/0x3060 net/ipv4/tcp_timer.c:488
tcp_write_timer_handler+0x339/0x960 net/ipv4/tcp_timer.c:573
tcp_write_timer+0x111/0x1d0 net/ipv4/tcp_timer.c:593
call_timer_fn+0x230/0x940 kernel/time/timer.c:1326
expire_timers kernel/time/timer.c:1363 [inline]
__run_timers+0x79e/0xc50 kernel/time/timer.c:1666
run_timer_softirq+0x4c/0x70 kernel/time/timer.c:1692
__do_softirq+0x2e0/0xaf5 kernel/softirq.c:285
invoke_softirq kernel/softirq.c:365 [inline]
irq_exit+0x1d1/0x200 kernel/softirq.c:405
exiting_irq arch/x86/include/asm/apic.h:525 [inline]
smp_apic_timer_interrupt+0x17e/0x710 arch/x86/kernel/apic/apic.c:1052
apic_timer_interrupt+0xf/0x20 arch/x86/entry/entry_64.S:863Fixes: cf60af03ca4e ("net-tcp: Fast Open client - sendmsg(MSG_FASTOPEN)")
Signed-off-by: Eric Dumazet
Cc: Yuchung Cheng
Cc: Neal Cardwell
Reported-by: syzbot
Acked-by: Neal Cardwell
Signed-off-by: David S. Miller
Signed-off-by: Greg Kroah-Hartman
09 Mar, 2018
2 commits
-
[ Upstream commit 350c9f484bde93ef229682eedd98cd5f74350f7f ]
BBR uses tcp_tso_autosize() in an attempt to probe what would be the
burst sizes and to adjust cwnd in bbr_target_cwnd() with following
gold formula :/* Allow enough full-sized skbs in flight to utilize end systems. */
cwnd += 3 * bbr->tso_segs_goal;But GSO can be lacking or be constrained to very small
units (ip link set dev ... gso_max_segs 2)What we really want is to have enough packets in flight so that both
GSO and GRO are efficient.So in the case GSO is off or downgraded, we still want to have the same
number of packets in flight as if GSO/TSO was fully operational, so
that GRO can hopefully be working efficiently.To fix this issue, we make tcp_tso_autosize() unaware of
sk->sk_gso_max_segsOnly tcp_tso_segs() has to enforce the gso_max_segs limit.
Tested:
ethtool -K eth0 tso off gso off
tc qd replace dev eth0 root pfifo_fastBefore patch:
for f in {1..5}; do ./super_netperf 1 -H lpaa24 -- -K bbr; done
691 (ss -temoi shows cwnd is stuck around 6 )
667
651
631
517After patch :
# for f in {1..5}; do ./super_netperf 1 -H lpaa24 -- -K bbr; done
1733 (ss -temoi shows cwnd is around 386 )
1778
1746
1781
1718Fixes: 0f8782ea1497 ("tcp_bbr: add BBR congestion control")
Signed-off-by: Eric Dumazet
Reported-by: Oleksandr Natalenko
Acked-by: Neal Cardwell
Acked-by: Soheil Hassas Yeganeh
Signed-off-by: David S. Miller
Signed-off-by: Greg Kroah-Hartman -
[ Upstream commit 808cf9e38cd7923036a99f459ccc8cf2955e47af ]
Avoid SKB coalescing if eor bit is set in one of the relevant
SKBs.Fixes: c134ecb87817 ("tcp: Make use of MSG_EOR in tcp_sendmsg")
Signed-off-by: Ilya Lesokhin
Signed-off-by: David S. Miller
Signed-off-by: Greg Kroah-Hartman
17 Dec, 2017
1 commit
-
[ Upstream commit ed66dfaf236c04d414de1d218441296e57fb2bd2 ]
Fix the TLP scheduling logic so that when scheduling a TLP probe, we
ensure that the estimated time at which an RTO would fire accounts for
the fact that ACKs indicating forward progress should push back RTO
times.After the following fix:
df92c8394e6e ("tcp: fix xmit timer to only be reset if data ACKed/SACKed")
we had an unintentional behavior change in the following kind of
scenario: suppose the RTT variance has been very low recently. Then
suppose we send out a flight of N packets and our RTT is 100ms:t=0: send a flight of N packets
t=100ms: receive an ACK for N-1 packetsThe response before df92c8394e6e that was:
-> schedule a TLP for now + RTO_intervalThe response after df92c8394e6e is:
-> schedule a TLP for t=0 + RTO_intervalSince RTO_interval = srtt + RTT_variance, this means that we have
scheduled a TLP timer at a point in the future that only accounts for
RTT_variance. If the RTT_variance term is small, this means that the
timer fires soon.Before df92c8394e6e this would not happen, because in that code, when
we receive an ACK for a prefix of flight, we did:1) Near the top of tcp_ack(), switch from TLP timer to RTO
at write_queue_head->paket_tx_time + RTO_interval:
if (icsk->icsk_pending == ICSK_TIME_LOSS_PROBE)
tcp_rearm_rto(sk);2) In tcp_clean_rtx_queue(), update the RTO to now + RTO_interval:
if (flag & FLAG_ACKED) {
tcp_rearm_rto(sk);3) In tcp_ack() after tcp_fastretrans_alert() switch from RTO
to TLP at now + RTO_interval:
if (icsk->icsk_pending == ICSK_TIME_RETRANS)
tcp_schedule_loss_probe(sk);In df92c8394e6e we removed that 3-phase dance, and instead directly
set the TLP timer once: we set the TLP timer in cases like this to
write_queue_head->packet_tx_time + RTO_interval. So if the RTT
variance is small, then this means that this is setting the TLP timer
to fire quite soon. This means if the ACK for the tail of the flight
takes longer than an RTT to arrive (often due to delayed ACKs), then
the TLP timer fires too quickly.Fixes: df92c8394e6e ("tcp: fix xmit timer to only be reset if data ACKed/SACKed")
Signed-off-by: Neal Cardwell
Signed-off-by: Yuchung Cheng
Signed-off-by: Eric Dumazet
Acked-by: Soheil Hassas Yeganeh
Signed-off-by: David S. Miller
Signed-off-by: Greg Kroah-Hartman
03 Nov, 2017
1 commit
-
Christoph Paasch sent a patch to address the following issue :
tcp_make_synack() is leaving some TCP private info in skb->cb[],
then send the packet by other means than tcp_transmit_skb()tcp_transmit_skb() makes sure to clear skb->cb[] to not confuse
IPv4/IPV6 stacks, but we have no such cleanup for SYNACK.tcp_make_synack() should not use tcp_init_nondata_skb() :
tcp_init_nondata_skb() really should be limited to skbs put in write/rtx
queues (the ones that are only sent via tcp_transmit_skb())This patch fixes the issue and should even save few cpu cycles ;)
Fixes: 971f10eca186 ("tcp: better TCP_SKB_CB layout to reduce cache line misses")
Signed-off-by: Eric Dumazet
Reported-by: Christoph Paasch
Reviewed-by: Christoph Paasch
Signed-off-by: David S. Miller
01 Nov, 2017
1 commit
-
Based on SNMP values provided by Roman, Yuchung made the observation
that some crashes in tcp_sacktag_walk() might be caused by MTU probing.Looking at tcp_mtu_probe(), I found that when a new skb was placed
in front of the write queue, we were not updating tcp highest sack.If one skb is freed because all its content was copied to the new skb
(for MTU probing), then tp->highest_sack could point to a now freed skb.Bad things would then happen, including infinite loops.
This patch renames tcp_highest_sack_combine() and uses it
from tcp_mtu_probe() to fix the bug.Note that I also removed one test against tp->sacked_out,
since we want to replace tp->highest_sack regardless of whatever
condition, since keeping a stale pointer to freed skb is a recipe
for disaster.Fixes: a47e5a988a57 ("[TCP]: Convert highest_sack to sk_buff to allow direct access")
Signed-off-by: Eric Dumazet
Reported-by: Alexei Starovoitov
Reported-by: Roman Gushchin
Reported-by: Oleksandr Natalenko
Acked-by: Alexei Starovoitov
Acked-by: Neal Cardwell
Acked-by: Yuchung Cheng
Signed-off-by: David S. Miller
28 Oct, 2017
1 commit
-
In the unlikely event tcp_mtu_probe() is sending a packet, we
want tp->tcp_mstamp being as accurate as possible.This means we need to call tcp_mstamp_refresh() a bit earlier in
tcp_write_xmit().Fixes: 385e20706fac ("tcp: use tp->tcp_mstamp in output path")
Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller
26 Oct, 2017
1 commit
-
Current implementation calls tcp_rate_skb_sent() when tcp_transmit_skb()
is called when it clones skb only. Not calling tcp_rate_skb_sent() is OK
for all such code paths except from __tcp_retransmit_skb() which happens
when skb->data address is not aligned. This may rarely happen e.g. when
small amount of data is sent initially and the receiver partially acks
odd number of bytes for some reason, possibly malicious.Signed-off-by: Yousuk Seung
Signed-off-by: Neal Cardwell
Signed-off-by: Soheil Hassas Yeganeh
Acked-by: Eric Dumazet
Signed-off-by: David S. Miller
23 Oct, 2017
1 commit
-
When retransmission on TSQ handler was introduced in the commit
f9616c35a0d7 ("tcp: implement TSQ for retransmits"), the retransmitted
skbs' timestamps were updated on the actual transmission. In the later
commit 385e20706fac ("tcp: use tp->tcp_mstamp in output path"), it stops
being done so. In the commit, the comment says "We try to refresh
tp->tcp_mstamp only when necessary", and at present tcp_tsq_handler and
tcp_v4_mtu_reduced applies to this. About the latter, it's okay since
it's rare enough.About the former, even though possible retransmissions on the tasklet
comes just after the destructor run in NET_RX softirq handling, the time
between them could be nonnegligibly large to the extent that
tcp_rack_advance or rto rearming be affected if other (remaining) RX,
BLOCK and (preceding) TASKLET sofirq handlings are unexpectedly heavy.So in the same way as tcp_write_timer_handler does, doing tcp_mstamp_refresh
ensures the accuracy of algorithms relying on it.Fixes: 385e20706fac ("tcp: use tp->tcp_mstamp in output path")
Signed-off-by: Koichiro Den
Reviewed-by: Eric Dumazet
Signed-off-by: David S. Miller
20 Sep, 2017
1 commit
-
Our recent change exposed a bug in TCP Fastopen Client that syzkaller
found right away [1]When we prepare skb with SYN+DATA, we attempt to transmit it,
and we update socket state as if the transmit was a success.In socket RTX queue we have two skbs, one with the SYN alone,
and a second one containing the DATA.When (malicious) ACK comes in, we now complain that second one had no
skb_mstamp.The proper fix is to make sure that if the transmit failed, we do not
pretend we sent the DATA skb, and make it our send_head.When 3WHS completes, we can now send the DATA right away, without having
to wait for a timeout.[1]
WARNING: CPU: 0 PID: 100189 at net/ipv4/tcp_input.c:3117 tcp_clean_rtx_queue+0x2057/0x2ab0 net/ipv4/tcp_input.c:3117()WARN_ON_ONCE(last_ackt == 0);
Modules linked in:
CPU: 0 PID: 100189 Comm: syz-executor1 Not tainted
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
0000000000000000 ffff8800b35cb1d8 ffffffff81cad00d 0000000000000000
ffffffff828a4347 ffff88009f86c080 ffffffff8316eb20 0000000000000d7f
ffff8800b35cb220 ffffffff812c33c2 ffff8800baad2440 00000009d46575c0
Call Trace:
[] __dump_stack
[] dump_stack+0xc1/0x124
[] warn_slowpath_common+0xe2/0x150
[] warn_slowpath_null+0x2e/0x40
[] tcp_clean_rtx_queue+0x2057/0x2ab0 n
[] tcp_ack+0x151d/0x3930
[] tcp_rcv_state_process+0x1c69/0x4fd0
[] tcp_v4_do_rcv+0x54f/0x7c0
[] sk_backlog_rcv
[] __release_sock+0x12b/0x3a0
[] release_sock+0x5e/0x1c0
[] inet_wait_for_connect
[] __inet_stream_connect+0x545/0xc50
[] tcp_sendmsg_fastopen
[] tcp_sendmsg+0x2298/0x35a0
[] inet_sendmsg+0xe5/0x520
[] sock_sendmsg_nosec
[] sock_sendmsg+0xcf/0x110Fixes: 8c72c65b426b ("tcp: update skb->skb_mstamp more carefully")
Fixes: 783237e8daf1 ("net-tcp: Fast Open client - sending SYN-data")
Signed-off-by: Eric Dumazet
Reported-by: Dmitry Vyukov
Cc: Neal Cardwell
Cc: Yuchung Cheng
Acked-by: Yuchung Cheng
Signed-off-by: David S. Miller
19 Sep, 2017
1 commit
-
remove tcp_may_send_now and tcp_snd_test that are no longer used
Fixes: 840a3cbe8969 ("tcp: remove forward retransmit feature")
Signed-off-by: Yuchung Cheng
Signed-off-by: Neal Cardwell
Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller
17 Sep, 2017
1 commit
-
Now skb->mstamp_skb is updated later, we also need to call
tcp_rate_skb_sent() after the update is done.Fixes: 8c72c65b426b ("tcp: update skb->skb_mstamp more carefully")
Signed-off-by: Eric Dumazet
Acked-by: Soheil Hassas Yeganeh
Signed-off-by: David S. Miller
16 Sep, 2017
1 commit
-
liujian reported a problem in TCP_USER_TIMEOUT processing with a patch
in tcp_probe_timer() :
https://www.spinics.net/lists/netdev/msg454496.htmlAfter investigations, the root cause of the problem is that we update
skb->skb_mstamp of skbs in write queue, even if the attempt to send a
clone or copy of it failed. One reason being a routing problem.This patch prevents this, solving liujian issue.
It also removes a potential RTT miscalculation, since
__tcp_retransmit_skb() is not OR-ing TCP_SKB_CB(skb)->sacked with
TCPCB_EVER_RETRANS if a failure happens, but skb->skb_mstamp has
been changed.A future ACK would then lead to a very small RTT sample and min_rtt
would then be lowered to this too small value.Tested:
# cat user_timeout.pkt
--local_ip=192.168.102.640 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
+0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
+0 bind(3, ..., ...) = 0
+0 listen(3, 1) = 0+0 `ifconfig tun0 192.168.102.64/16; ip ro add 192.0.2.1 dev tun0`
+0 < S 0:0(0) win 0
+0 > S. 0:0(0) ack 1+.1 < . 1:1(0) ack 1 win 65530
+0 accept(3, ..., ...) = 4+0 setsockopt(4, SOL_TCP, TCP_USER_TIMEOUT, [3000], 4) = 0
+0 write(4, ..., 24) = 24
+0 > P. 1:25(24) ack 1 win 29200
+.1 < . 1:1(0) ack 25 win 65530//change the ipaddress
+1 `ifconfig tun0 192.168.0.10/16`+1 write(4, ..., 24) = 24
+1 write(4, ..., 24) = 24
+1 write(4, ..., 24) = 24
+1 write(4, ..., 24) = 24+0 `ifconfig tun0 192.168.102.64/16`
+0 < . 1:2(1) ack 25 win 65530
+0 `ifconfig tun0 192.168.0.10/16`+3 write(4, ..., 24) = -1
# ./packetdrill user_timeout.pkt
Signed-off-by: Eric Dumazet
Reported-by: liujian
Acked-by: Neal Cardwell
Acked-by: Yuchung Cheng
Acked-by: Soheil Hassas Yeganeh
Signed-off-by: David S. Miller
31 Aug, 2017
1 commit
-
This reverts commit 45f119bf936b1f9f546a0b139c5b56f9bb2bdc78.
Eric Dumazet says:
We found at Google a significant regression caused by
45f119bf936b1f9f546a0b139c5b56f9bb2bdc78 tcp: remove header predictionIn typical RPC (TCP_RR), when a TCP socket receives data, we now call
tcp_ack() while we used to not call it.This touches enough cache lines to cause a slowdown.
so problem does not seem to be HP removal itself but the tcp_ack()
call. Therefore, it might be possible to remove HP after all, provided
one finds a way to elide tcp_ack for most cases.Reported-by: Eric Dumazet
Signed-off-by: Florian Westphal
Signed-off-by: David S. Miller
10 Aug, 2017
1 commit
-
The UDP offload conflict is dealt with by simply taking what is
in net-next where we have removed all of the UFO handling code
entirely.The TCP conflict was a case of local variables in a function
being removed from both net and net-next.In netvsc we had an assignment right next to where a missing
set of u64 stats sync object inits were added.Signed-off-by: David S. Miller
09 Aug, 2017
1 commit
-
With new TCP_FASTOPEN_CONNECT socket option, there is a possibility
to call tcp_connect() while socket sk_dst_cache is either NULL
or invalid.+0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 4
+0 fcntl(4, F_SETFL, O_RDWR|O_NONBLOCK) = 0
+0 setsockopt(4, SOL_TCP, TCP_FASTOPEN_CONNECT, [1], 4) = 0
+0 connect(4, ..., ...) = 0<< sk->sk_dst_cache becomes obsolete, or even set to NULL >>
+1 sendto(4, ..., 1000, MSG_FASTOPEN, ..., ...) = 1000
We need to refresh the route otherwise bad things can happen,
especially when syzkaller is running on the host :/Fixes: 19f6d3f3c8422 ("net/tcp-fastopen: Add new API support")
Reported-by: Dmitry Vyukov
Signed-off-by: Eric Dumazet
Cc: Wei Wang
Cc: Yuchung Cheng
Acked-by: Wei Wang
Acked-by: Yuchung Cheng
Signed-off-by: David S. Miller
04 Aug, 2017
2 commits
-
Fix a TCP loss recovery performance bug raised recently on the netdev
list, in two threads:(i) July 26, 2017: netdev thread "TCP fast retransmit issues"
(ii) July 26, 2017: netdev thread:
"[PATCH V2 net-next] TLP: Don't reschedule PTO when there's one
outstanding TLP retransmission"The basic problem is that incoming TCP packets that did not indicate
forward progress could cause the xmit timer (TLP or RTO) to be rearmed
and pushed back in time. In certain corner cases this could result in
the following problems noted in these threads:- Repeated ACKs coming in with bogus SACKs corrupted by middleboxes
could cause TCP to repeatedly schedule TLPs forever. We kept
sending TLPs after every ~200ms, which elicited bogus SACKs, which
caused more TLPs, ad infinitum; we never fired an RTO to fill in
the holes.- Incoming data segments could, in some cases, cause us to reschedule
our RTO or TLP timer further out in time, for no good reason. This
could cause repeated inbound data to result in stalls in outbound
data, in the presence of packet loss.This commit fixes these bugs by changing the TLP and RTO ACK
processing to:(a) Only reschedule the xmit timer once per ACK.
(b) Only reschedule the xmit timer if tcp_clean_rtx_queue() deems the
ACK indicates sufficient forward progress (a packet was
cumulatively ACKed, or we got a SACK for a packet that was sent
before the most recent retransmit of the write queue head).This brings us back into closer compliance with the RFCs, since, as
the comment for tcp_rearm_rto() notes, we should only restart the RTO
timer after forward progress on the connection. Previously we were
restarting the xmit timer even in these cases where there was no
forward progress.As a side benefit, this commit simplifies and speeds up the TCP timer
arming logic. We had been calling inet_csk_reset_xmit_timer() three
times on normal ACKs that cumulatively acknowledged some data:1) Once near the top of tcp_ack() to switch from TLP timer to RTO:
if (icsk->icsk_pending == ICSK_TIME_LOSS_PROBE)
tcp_rearm_rto(sk);2) Once in tcp_clean_rtx_queue(), to update the RTO:
if (flag & FLAG_ACKED) {
tcp_rearm_rto(sk);3) Once in tcp_ack() after tcp_fastretrans_alert() to switch from RTO
to TLP:
if (icsk->icsk_pending == ICSK_TIME_RETRANS)
tcp_schedule_loss_probe(sk);This commit, by only rescheduling the xmit timer once per ACK,
simplifies the code and reduces CPU overhead.This commit was tested in an A/B test with Google web server
traffic. SNMP stats and request latency metrics were within noise
levels, substantiating that for normal web traffic patterns this is a
rare issue. This commit was also tested with packetdrill tests to
verify that it fixes the timer behavior in the corner cases discussed
in the netdev threads mentioned above.This patch is a bug fix patch intended to be queued for -stable
relases.Fixes: 6ba8a3b19e76 ("tcp: Tail loss probe (TLP)")
Reported-by: Klavs Klavsen
Reported-by: Mao Wenan
Signed-off-by: Neal Cardwell
Signed-off-by: Yuchung Cheng
Signed-off-by: Nandita Dukkipati
Acked-by: Eric Dumazet
Signed-off-by: David S. Miller -
Have tcp_schedule_loss_probe() base the TLP scheduling decision based
on when the RTO *should* fire. This is to enable the upcoming xmit
timer fix in this series, where tcp_schedule_loss_probe() cannot
assume that the last timer installed was an RTO timer (because we are
no longer doing the "rearm RTO, rearm RTO, rearm TLP" dance on every
ACK). So tcp_schedule_loss_probe() must independently figure out when
an RTO would want to fire.In the new TLP implementation following in this series, we cannot
assume that icsk_timeout was set based on an RTO; after processing a
cumulative ACK the icsk_timeout we see can be from a previous TLP or
RTO. So we need to independently recalculate the RTO time (instead of
reading it out of icsk_timeout). Removing this dependency on the
nature of icsk_timeout makes things a little easier to reason about
anyway.Note that the old and new code should be equivalent, since they are
both saying: "if the RTO is in the future, but at an earlier time than
the normal TLP time, then set the TLP timer to fire when the RTO would
have fired".Fixes: 6ba8a3b19e76 ("tcp: Tail loss probe (TLP)")
Signed-off-by: Neal Cardwell
Signed-off-by: Yuchung Cheng
Signed-off-by: Nandita Dukkipati
Acked-by: Eric Dumazet
Signed-off-by: David S. Miller
02 Aug, 2017
1 commit
-
Two minor conflicts in virtio_net driver (bug fix overlapping addition
of a helper) and MAINTAINERS (new driver edit overlapping revamp of
PHY entry).Signed-off-by: David S. Miller
01 Aug, 2017
1 commit
-
Like prequeue, I am not sure this is overly useful nowadays.
If we receive a train of packets, GRO will aggregate them if the
headers are the same (HP predates GRO by several years) so we don't
get a per-packet benefit, only a per-aggregated-packet one.Signed-off-by: Florian Westphal
Signed-off-by: David S. Miller
30 Jul, 2017
1 commit
-
When using CONFIG_UBSAN_SANITIZE_ALL, the TCP code produces a
false-positive warning:net/ipv4/tcp_output.c: In function 'tcp_connect':
net/ipv4/tcp_output.c:2207:40: error: array subscript is below array bounds [-Werror=array-bounds]
tp->chrono_stat[tp->chrono_type - 1] += now - tp->chrono_start;
^~
net/ipv4/tcp_output.c:2207:40: error: array subscript is below array bounds [-Werror=array-bounds]
tp->chrono_stat[tp->chrono_type - 1] += now - tp->chrono_start;
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~I have opened a gcc bug for this, but distros have already shipped
compilers with this problem, and it's not clear yet whether there is
a way for gcc to avoid the warning. As the problem is related to the
bitfield access, this introduces a temporary variable to store the old
enum value.I did not notice this warning earlier, since UBSAN is disabled when
building with COMPILE_TEST, and that was always turned on in both
allmodconfig and randconfig tests.Link: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81601
Signed-off-by: Arnd Bergmann
Signed-off-by: David S. Miller
20 Jul, 2017
1 commit
-
This patch adjusts the timeout formula to schedule the TCP loss probe
(TLP). The previous formula uses 2*SRTT or 1.5*RTT + DelayACKMax if
only one packet is in flight. It keeps a lower bound of 10 msec which
is too large for short RTT connections (e.g. within a data-center).
The new formula = 2*RTT + (inflight == 1 ? 200ms : 2ticks) which
performs better for short and fast connections.Signed-off-by: Yuchung Cheng
Signed-off-by: Neal Cardwell
Signed-off-by: David S. Miller
04 Jul, 2017
1 commit
-
SYN-ACK responses on a server in response to a SYN from a client
did not get the injected skb mark that was tagged on the SYN packet.Fixes: 84f39b08d786 ("net: support marking accepting TCP sockets")
Reviewed-by: Lorenzo Colitti
Signed-off-by: Jamal Hadi Salim
Signed-off-by: David S. Miller
02 Jul, 2017
4 commits
-
Added support for changing congestion control for SOCK_OPS bpf
programs through the setsockopt bpf helper function. It also adds
a new SOCK_OPS op, BPF_SOCK_OPS_NEEDS_ECN, that is needed for
congestion controls, like dctcp, that need to enable ECN in the
SYN packets.Signed-off-by: Lawrence Brakmo
Signed-off-by: David S. Miller -
Added callbacks to BPF SOCK_OPS type program before an active
connection is intialized and after a passive or active connection is
established.The following patch demostrates how they can be used to set send and
receive buffer sizes.Signed-off-by: Lawrence Brakmo
Signed-off-by: David S. Miller -
This patch adds suppport for setting the initial advertized window from
within a BPF_SOCK_OPS program. This can be used to support larger
initial cwnd values in environments where it is known to be safe.Signed-off-by: Lawrence Brakmo
Signed-off-by: David S. Miller -
This patch adds support for setting a per connection SYN and
SYN_ACK RTOs from within a BPF_SOCK_OPS program. For example,
to set small RTOs when it is known both hosts are within a
datacenter.Signed-off-by: Lawrence Brakmo
Signed-off-by: David S. Miller
01 Jul, 2017
1 commit
-
refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.Signed-off-by: Elena Reshetova
Signed-off-by: Hans Liljestrand
Signed-off-by: Kees Cook
Signed-off-by: David Windsor
Signed-off-by: David S. Miller
08 Jun, 2017
3 commits
-
Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller -
Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller -
Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller
03 Jun, 2017
1 commit
-
__pskb_trim_head() does not need to reset skb tail pointer.
Also change the comments, __pskb_pull_head() does not exist.
Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller
18 May, 2017
5 commits
-
TCP Timestamps option is defined in RFC 7323
Traditionally on linux, it has been tied to the internal
'jiffies' variable, because it had been a cheap and good enough
generator.For TCP flows on the Internet, 1 ms resolution would be much better
than 4ms or 10ms (HZ=250 or HZ=100 respectively)For TCP flows in the DC, Google has used usec resolution for more
than two years with great success [1]Receive size autotuning (DRS) is indeed more precise and converges
faster to optimal window size.This patch converts tp->tcp_mstamp to a plain u64 value storing
a 1 usec TCP clock.This choice will allow us to upstream the 1 usec TS option as
discussed in IETF 97.[1] https://www.ietf.org/proceedings/97/slides/slides-97-tcpm-tcp-options-for-low-latency-00.pdf
Signed-off-by: Eric Dumazet
Acked-by: Soheil Hassas Yeganeh
Signed-off-by: David S. Miller -
After this patch, all uses of tcp_time_stamp will require
a change when we introduce 1 ms and/or 1 us TCP TS option.Signed-off-by: Eric Dumazet
Acked-by: Soheil Hassas Yeganeh
Signed-off-by: David S. Miller -
tcp_time_stamp will no longer be tied to jiffies.
Signed-off-by: Eric Dumazet
Acked-by: Soheil Hassas Yeganeh
Signed-off-by: David S. Miller -
Use tcp_jiffies32 instead of tcp_time_stamp, since
tcp_time_stamp will soon be only used for TCP TS option.Signed-off-by: Eric Dumazet
Acked-by: Soheil Hassas Yeganeh
Signed-off-by: David S. Miller -
Use tcp_jiffies32 instead of tcp_time_stamp, since
tcp_time_stamp will soon be only used for TCP TS option.Signed-off-by: Eric Dumazet
Acked-by: Soheil Hassas Yeganeh
Signed-off-by: David S. Miller