Eric Lee / smarc-fsl-linux-kernel

13 Jun, 2019

1 commit

a842fe142 tcp: add optional per socket transmit delay ... Browse Code »

Adding delays to TCP flows is crucial for studying behavior
of TCP stacks, including congestion control modules.

Linux offers netem module, but it has unpractical constraints :
- Need root access to change qdisc
- Hard to setup on egress if combined with non trivial qdisc like FQ
- Single delay for all flows.

EDT (Earliest Departure Time) adoption in TCP stack allows us
to enable a per socket delay at a very small cost.

Networking tools can now establish thousands of flows, each of them
with a different delay, simulating real world conditions.

This requires FQ packet scheduler or a EDT-enabled NIC.

This patchs adds TCP_TX_DELAY socket option, to set a delay in
usec units.

unsigned int tx_delay = 10000; /* 10 msec */

setsockopt(fd, SOL_TCP, TCP_TX_DELAY, &tx_delay, sizeof(tx_delay));

Note that FQ packet scheduler limits might need some tweaking :

man tc-fq

PARAMETERS
limit
Hard limit on the real queue size. When this limit is
reached, new packets are dropped. If the value is lowered,
packets are dropped so that the new limit is met. Default
is 10000 packets.

flow_limit
Hard limit on the maximum number of packets queued per
flow. Default value is 100.

Use of TCP_TX_DELAY option will increase number of skbs in FQ qdisc,
so packets would be dropped if any of the previous limit is hit.

Use of a jump label makes this support runtime-free, for hosts
never using the option.

Also note that TSQ (TCP Small Queues) limits are slightly changed
with this patch : we need to account that skbs artificially delayed
wont stop us providind more skbs to feed the pipe (netem uses
skb_orphan_partial() for this purpose, but FQ can not use this trick)

Because of that, using big delays might very well trigger
old bugs in TSO auto defer logic and/or sndbuf limited detection.

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2019-06-13 04:05:43 +0800

10 Jun, 2019

1 commit

c67b85558 ipv6: tcp: send consistent autoflowlabel in TIME_WAIT state ... Browse Code »

In case autoflowlabel is in action, skb_get_hash_flowi6()
derives a non zero skb->hash to the flowlabel.

If skb->hash is zero, a flow dissection is performed.

Since all TCP skbs sent from ESTABLISH state inherit their
skb->hash from sk->sk_txhash, we better keep a copy
of sk->sk_txhash into the TIME_WAIT socket.

After this patch, ACK or RST packets sent on behalf of
a TIME_WAIT socket have the flowlabel that was previously
used by the flow.

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2019-06-10 11:10:19 +0800

21 May, 2019

1 commit

457c89965 treewide: Add SPDX license identifier for missed files ... Browse Code »

Add SPDX license identifiers to all files which:

- Have no license information of any form

- Have EXPORT_.*_SYMBOL_GPL inside which was used in the
initial scan/conversion to ignore the file

These files fall under the project license, GPL v2 only. The resulting SPDX
license identifier is:

GPL-2.0-only

Signed-off-by: Thomas Gleixner
Signed-off-by: Greg Kroah-Hartman

Thomas Gleixner
2019-05-21 16:50:45 +0800

01 May, 2019

1 commit

336c39a03 tcp: undo init congestion window on false SYNACK timeout ... Browse Code »

Linux implements RFC6298 and use an initial congestion window
of 1 upon establishing the connection if the SYNACK packet is
retransmitted 2 or more times. In cellular networks SYNACK timeouts
are often spurious if the wireless radio was dormant or idle. Also
some network path is longer than the default SYNACK timeout. In
both cases falsely starting with a minimal cwnd are detrimental
to performance.

This patch avoids doing so when the final ACK's TCP timestamp
indicates the original SYNACK was delivered. It remembers the
original SYNACK timestamp when SYNACK timeout has occurred and
re-uses the function to detect spurious SYN timeout conveniently.

Note that a server may receives multiple SYNs from and immediately
retransmits SYNACKs without any SYNACK timeout. This often happens
on when the client SYNs have timed out due to wireless delay
above. In this case since the server will still use the default
initial congestion (e.g. 10) because tp->undo_marker is reset in
tcp_init_metrics(). This is an intentional design because packets
are not lost but delayed.

This patch only covers regular TCP passive open. Fast Open is
supported in the next patch.

Signed-off-by: Yuchung Cheng
Signed-off-by: Neal Cardwell
Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Yuchung Cheng
2019-05-01 23:47:54 +0800

27 Feb, 2019

1 commit

6aedbf986 tcp: use tcp_md5_needed for timewait sockets ... Browse Code »

This might speedup tcp_twsk_destructor() a bit,
avoiding a cache line miss.

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2019-02-27 05:16:03 +0800

18 Jan, 2019

11 commits

6bcdc40dd tcp: move rx_opt & syn_data_acked init to tcp_disconnect() ... Browse Code »

If we make sure all listeners have these fields cleared, then a clone
will also inherit zero values.

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2019-01-18 14:19:05 +0800
792c4354a tcp: move tp->rack init to tcp_disconnect() ... Browse Code »

If we make sure all listeners have proper tp->rack value,
then a clone will also inherit proper initial value.

Note that fresh sockets init tp->rack from tcp_init_sock()

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2019-01-18 14:19:05 +0800
6cda8b749 tcp: move app_limited init to tcp_disconnect() ... Browse Code »

If we make sure all listeners have app_limited set to ~0U,
then a clone will also inherit proper initial value.

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2019-01-18 14:19:05 +0800
5c701549c tcp: move retrans_out, sacked_out, tlp_high_seq, last_oow_ack_time init to tcp_disconnect() ... Browse Code »

If we make sure all listeners have these fields cleared, then a clone
will also inherit zero values.

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2019-01-18 14:19:05 +0800
5d8367646 tcp: do not clear urg_data in tcp_create_openreq_child ... Browse Code »

All listeners have this field cleared already, since tcp_disconnect()
clears it and newly created sockets have also a zero value here.

So a clone will inherit a zero value here.

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2019-01-18 14:19:05 +0800
3a9a57f63 tcp: move snd_cwnd & snd_cwnd_cnt init to tcp_disconnect() ... Browse Code »

Passive connections can inherit proper value by cloning,
if we make sure all listeners have the proper values there.

tcp_disconnect() was setting snd_cwnd to 2, which seems
quite obsolete since IW10 adoption.

Also remove an obsolete comment.

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2019-01-18 14:19:05 +0800
b9e2e689a tcp: move mdev_us init to tcp_disconnect() ... Browse Code »

If we make sure a listener always has its mdev_us
field set to TCP_TIMEOUT_INIT, we do not need to rewrite
this field after a new clone is created.

tcp_disconnect() is very seldom used in real applications.

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2019-01-18 14:19:05 +0800
a0070e463 tcp: do not clear srtt_us in tcp_create_openreq_child ... Browse Code »

All listeners have this field cleared already, since tcp_disconnect()
clears it and newly created sockets have also a zero value here.

So a clone will inherit a zero value here.

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2019-01-18 14:19:05 +0800
eb2c80ca8 tcp: do not clear packets_out in tcp_create_openreq_child() ... Browse Code »

New sockets have this field cleared, and tcp_disconnect()
calls tcp_write_queue_purge() which among other things
also clear tp->packets_out

So a listener is guaranteed to have this field cleared.

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2019-01-18 14:19:04 +0800
6a408147e tcp: move icsk_rto init to tcp_disconnect() ... Browse Code »

If we make sure a listener always has its icsk_rto
field set to TCP_TIMEOUT_INIT, we do not need to rewrite
this field after a new clone is created.

tcp_disconnect() is very seldom used in real applications.

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2019-01-18 14:19:04 +0800
b84235e29 tcp: do not set snd_ssthresh in tcp_create_openreq_child() ... Browse Code »

New sockets get the field set to TCP_INFINITE_SSTHRESH in tcp_init_sock()
In case a socket had this field changed and transitions to TCP_LISTEN
state, tcp_disconnect() also makes sure snd_ssthresh is set to
TCP_INFINITE_SSTHRESH.

So a listener has this field set to TCP_INFINITE_SSTHRESH already.

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2019-01-18 14:19:04 +0800

01 Sep, 2018

1 commit

63cc357f7 tcp: do not restart timewait timer on rst reception ... Browse Code »

RFC 1337 says:
''Ignore RST segments in TIME-WAIT state.
If the 2 minute MSL is enforced, this fix avoids all three hazards.''

So with net.ipv4.tcp_rfc1337=1, expected behaviour is to have TIME-WAIT sk
expire rather than removing it instantly when a reset is received.

However, Linux will also re-start the TIME-WAIT timer.

This causes connect to fail when tying to re-use ports or very long
delays (until syn retry interval exceeds MSL).

packetdrill test case:
// Demonstrate bogus rearming of TIME-WAIT timer in rfc1337 mode.
`sysctl net.ipv4.tcp_rfc1337=1`

0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
0.000 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
0.000 bind(3, ..., ...) = 0
0.000 listen(3, 1) = 0

0.100 < S 0:0(0) win 29200
0.100 > S. 0:0(0) ack 1
0.200 < . 1:1(0) ack 1 win 257
0.200 accept(3, ..., ...) = 4

// Receive first segment
0.310 < P. 1:1001(1000) ack 1 win 46

// Send one ACK
0.310 > . 1:1(0) ack 1001

// read 1000 byte
0.310 read(4, ..., 1000) = 1000

// Application writes 100 bytes
0.350 write(4, ..., 100) = 100
0.350 > P. 1:101(100) ack 1001

// ACK
0.500 < . 1001:1001(0) ack 101 win 257

// close the connection
0.600 close(4) = 0
0.600 > F. 101:101(0) ack 1001 win 244

// Our side is in FIN_WAIT_1 & waits for ack to fin
0.7 < . 1001:1001(0) ack 102 win 244

// Our side is in FIN_WAIT_2 with no outstanding data.
0.8 < F. 1001:1001(0) ack 102 win 244
0.8 > . 102:102(0) ack 1002 win 244

// Our side is now in TIME_WAIT state, send ack for fin.
0.9 < F. 1002:1002(0) ack 102 win 244
0.9 > . 102:102(0) ack 1002 win 244

// Peer reopens with in-window SYN:
1.000 < S 1000:1000(0) win 9200

// Therefore, reply with ACK.
1.000 > . 102:102(0) ack 1002 win 244

// Peer sends RST for this ACK. Normally this RST results
// in tw socket removal, but rfc1337=1 setting prevents this.
1.100 < R 1002:1002(0) win 244

// second syn. Due to rfc1337=1 expect another pure ACK.
31.0 < S 1000:1000(0) win 9200
31.0 > . 102:102(0) ack 1002 win 244

// .. and another RST from peer.
31.1 < R 1002:1002(0) win 244
31.2 `echo no timer restart;ss -m -e -a -i -n -t -o state TIME-WAIT`

// third syn after one minute. Time-Wait socket should have expired by now.
63.0 < S 1000:1000(0) win 9200

// so we expect a syn-ack & 3whs to proceed from here on.
63.0 > S. 0:0(0) ack 1

Without this patch, 'ss' shows restarts of tw timer and last packet is
thus just another pure ack, more than one minute later.

This restores the original code from commit 283fd6cf0be690a83
("Merge in ANK networking jumbo patch") in netdev-vger-cvs.git .

For some reason the else branch was removed/lost in 1f28b683339f7
("Merge in TCP/UDP optimizations and [..]") and timer restart became
unconditional.

Reported-by: Michal Tesar
Signed-off-by: Florian Westphal
Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Florian Westphal
2018-09-01 14:10:35 +0800

13 Jul, 2018

1 commit

cca9bab1b tcp: use monotonic timestamps for PAWS ... Browse Code »

Using get_seconds() for timestamps is deprecated since it can lead
to overflows on 32-bit systems. While the interface generally doesn't
overflow until year 2106, the specific implementation of the TCP PAWS
algorithm breaks in 2038 when the intermediate signed 32-bit timestamps
overflow.

A related problem is that the local timestamps in CLOCK_REALTIME form
lead to unexpected behavior when settimeofday is called to set the system
clock backwards or forwards by more than 24 days.

While the first problem could be solved by using an overflow-safe method
of comparing the timestamps, a nicer solution is to use a monotonic
clocksource with ktime_get_seconds() that simply doesn't overflow (at
least not until 136 years after boot) and that doesn't change during
settimeofday().

To make 32-bit and 64-bit architectures behave the same way here, and
also save a few bytes in the tcp_options_received structure, I'm changing
the type to a 32-bit integer, which is now safe on all architectures.

Finally, the ts_recent_stamp field also (confusingly) gets used to store
a jiffies value in tcp_synq_overflow()/tcp_synq_no_recent_overflow().
This is currently safe, but changing the type to 32-bit requires
some small changes there to keep it working.

Signed-off-by: Arnd Bergmann
Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Arnd Bergmann
2018-07-13 05:50:40 +0800

28 Jun, 2018

1 commit

242b1bbe5 tcp: remove one indentation level in tcp_create_openreq_child ... Browse Code »

Signed-off-by: Eric Dumazet
Acked-by: Yuchung Cheng
Acked-by: Neal Cardwell
Signed-off-by: David S. Miller

Eric Dumazet
2018-06-28 15:02:31 +0800

05 Jun, 2018

1 commit

95358a955 net-tcp: remove useless tw_timeout field ... Browse Code »

Tested: 'git grep tw_timeout' comes up empty and it builds :-)

Signed-off-by: Maciej Żenczykowski
Cc: Eric Dumazet
Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Maciej Żenczykowski
2018-06-05 22:45:24 +0800

11 May, 2018

1 commit

004836905 tcp: Add mark for TIMEWAIT sockets ... Browse Code »

This version has some suggestions by Eric Dumazet:

- Use a local variable for the mark in IPv6 instead of ctl_sk to avoid SMP
races.
- Use the more elegant "IP4_REPLY_MARK(net, skb->mark) ?: sk->sk_mark"
statement.
- Factorize code as sk_fullsock() check is not necessary.

Aidan McGurn from Openwave Mobility systems reported the following bug:

"Marked routing is broken on customer deployment. Its effects are large
increase in Uplink retransmissions caused by the client never receiving
the final ACK to their FINACK - this ACK misses the mark and routes out
of the incorrect route."

Currently marks are added to sk_buffs for replies when the "fwmark_reflect"
sysctl is enabled. But not for TW sockets that had sk->sk_mark set via
setsockopt(SO_MARK..).

Fix this in IPv4/v6 by adding tw->tw_mark for TIME_WAIT sockets. Copy the the
original sk->sk_mark in __inet_twsk_hashdance() to the new tw->tw_mark location.
Then progate this so that the skb gets sent with the correct mark. Do the same
for resets. Give the "fwmark_reflect" sysctl precedence over sk->sk_mark so that
netfilter rules are still honored.

Signed-off-by: Jon Maxwell
Reviewed-by: Eric Dumazet
Signed-off-by: David S. Miller

Jon Maxwell
2018-05-11 05:44:52 +0800

01 Apr, 2018

1 commit

cc35c88ae crypto : chtls - CPL handler definition ... Browse Code »

Exchange messages with hardware to program the TLS session
CPL handlers for messages received from chip.

Signed-off-by: Atul Gupta
Signed-off-by: Michael Werner
Signed-off-by: David S. Miller

Atul Gupta
2018-04-01 11:37:32 +0800

15 Feb, 2018

1 commit

e0f9759f5 tcp: try to keep packet if SYN_RCV race is lost ... Browse Code »

배석진 reported that in some situations, packets for a given 5-tuple
end up being processed by different CPUS.

This involves RPS, and fragmentation.

배석진 is seeing packet drops when a SYN_RECV request socket is
moved into ESTABLISH state. Other states are protected by socket lock.

This is caused by a CPU losing the race, and simply not caring enough.

Since this seems to occur frequently, we can do better and perform
a second lookup.

Note that all needed memory barriers are already in the existing code,
thanks to the spin_lock()/spin_unlock() pair in inet_ehash_insert()
and reqsk_put(). The second lookup must find the new socket,
unless it has already been accepted and closed by another cpu.

Note that the fragmentation could be avoided in the first place by
use of a correct TCP MSS option in the SYN{ACK} packet, but this
does not mean we can not be more robust.

Many thanks to 배석진 for a very detailed analysis.

Reported-by: 배석진
Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2018-02-15 03:21:45 +0800

14 Dec, 2017

1 commit

ec94c2696 tcp/dccp: avoid one atomic operation for timewait hashdance ... Browse Code »

First, rename __inet_twsk_hashdance() to inet_twsk_hashdance()

Then, remove one inet_twsk_put() by setting tw_refcnt to 3 instead
of 4, but adding a fat warning that we do not have the right to access
tw anymore after inet_twsk_hashdance()

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2017-12-14 03:33:10 +0800

02 Dec, 2017

1 commit

cfac7f836 tcp/dccp: block bh before arming time_wait timer ... Browse Code »

Maciej Żenczykowski reported some panics in tcp_twsk_destructor()
that might be caused by the following bug.

timewait timer is pinned to the cpu, because we want to transition
timwewait refcount from 0 to 4 in one go, once everything has been
initialized.

At the time commit ed2e92394589 ("tcp/dccp: fix timewait races in timer
handling") was merged, TCP was always running from BH habdler.

After commit 5413d1babe8f ("net: do not block BH while processing
socket backlog") we definitely can run tcp_time_wait() from process
context.

We need to block BH in the critical section so that the pinned timer
has still its purpose.

This bug is more likely to happen under stress and when very small RTO
are used in datacenter flows.

Fixes: 5413d1babe8f ("net: do not block BH while processing socket backlog")
Signed-off-by: Eric Dumazet
Reported-by: Maciej Żenczykowski
Acked-by: Maciej Żenczykowski
Signed-off-by: David S. Miller

Eric Dumazet
2017-12-02 04:07:43 +0800

11 Nov, 2017

2 commits

737ff3145 tcp: use sequence distance to detect reordering ... Browse Code »

Replace the reordering distance measurement in packet unit with
sequence based approach. Previously it trackes the number of "packets"
toward the forward ACK (i.e. highest sacked sequence)in a state
variable "fackets_out".

Precisely measuring reordering degree on packet distance has not much
benefit, as the degree constantly changes by factors like path, load,
and congestion window. It is also complicated and prone to arcane bugs.
This patch replaces with sequence-based approach that's much simpler.

Signed-off-by: Yuchung Cheng
Reviewed-by: Eric Dumazet
Reviewed-by: Neal Cardwell
Reviewed-by: Soheil Hassas Yeganeh
Reviewed-by: Priyaranjan Jha
Signed-off-by: David S. Miller

Yuchung Cheng
2017-11-11 17:53:16 +0800
713bafea9 tcp: retire FACK loss detection ... Browse Code »

FACK loss detection has been disabled by default and the
successor RACK subsumed FACK and can handle reordering better.
This patch removes FACK to simplify TCP loss recovery.

Signed-off-by: Yuchung Cheng
Reviewed-by: Eric Dumazet
Reviewed-by: Neal Cardwell
Reviewed-by: Soheil Hassas Yeganeh
Reviewed-by: Priyaranjan Jha
Signed-off-by: David S. Miller

Yuchung Cheng
2017-11-11 17:53:16 +0800

05 Nov, 2017

1 commit

1f2556916 tcp: higher throughput under reordering with adaptive RACK reordering wnd ... Browse Code »

Currently TCP RACK loss detection does not work well if packets are
being reordered beyond its static reordering window (min_rtt/4).Under
such reordering it may falsely trigger loss recoveries and reduce TCP
throughput significantly.

This patch improves that by increasing and reducing the reordering
window based on DSACK, which is now supported in major TCP implementations.
It makes RACK's reo_wnd adaptive based on DSACK and no. of recoveries.

- If DSACK is received, increment reo_wnd by min_rtt/4 (upper bounded
by srtt), since there is possibility that spurious retransmission was
due to reordering delay longer than reo_wnd.

- Persist the current reo_wnd value for TCP_RACK_RECOVERY_THRESH (16)
no. of successful recoveries (accounts for full DSACK-based loss
recovery undo). After that, reset it to default (min_rtt/4).

- At max, reo_wnd is incremented only once per rtt. So that the new
DSACK on which we are reacting, is due to the spurious retx (approx)
after the reo_wnd has been updated last time.

- reo_wnd is tracked in terms of steps (of min_rtt/4), rather than
absolute value to account for change in rtt.

In our internal testing, we observed significant increase in throughput,
in scenarios where reordering exceeds min_rtt/4 (previous static value).

Signed-off-by: Priyaranjan Jha
Signed-off-by: Yuchung Cheng
Signed-off-by: Neal Cardwell
Signed-off-by: David S. Miller

Priyaranjan Jha
2017-11-05 22:15:42 +0800

28 Oct, 2017

1 commit

ceef9ab6b tcp: Namespace-ify sysctl_tcp_workaround_signed_windows ... Browse Code »

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2017-10-28 18:24:38 +0800

27 Oct, 2017

3 commits

0bc65a28a tcp: Namespace-ify sysctl_tcp_fack ... Browse Code »

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2017-10-27 15:35:42 +0800
65c9410cf tcp: Namespace-ify sysctl_tcp_abort_on_overflow ... Browse Code »

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2017-10-27 15:35:42 +0800
625357aa1 tcp: Namespace-ify sysctl_tcp_rfc1337 ... Browse Code »

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2017-10-27 15:35:42 +0800

26 Oct, 2017

1 commit

60e2a7780 tcp: TCP experimental option for SMC ... Browse Code »

The SMC protocol [1] relies on the use of a new TCP experimental
option [2, 3]. With this option, SMC capabilities are exchanged
between peers during the TCP three way handshake. This patch adds
support for this experimental option to TCP.

References:
[1] SMC-R Informational RFC: http://www.rfc-editor.org/info/rfc7609
[2] Shared Use of TCP Experimental Options RFC 6994:
https://tools.ietf.org/rfc/rfc6994.txt
[3] IANA ExID SMCR:
http://www.iana.org/assignments/tcp-parameters/tcp-parameters.xhtml#tcp-exids

Signed-off-by: Ursula Braun
Signed-off-by: David S. Miller

Ursula Braun
2017-10-26 17:00:29 +0800

24 Oct, 2017

1 commit

49ca1943a ipv4: tcp_minisocks: use BUG_ON instead of if condition followed by BUG ... Browse Code »

Use BUG_ON instead of if condition followed by BUG in tcp_time_wait.

This issue was detected with the help of Coccinelle.

Signed-off-by: Gustavo A. R. Silva
Signed-off-by: David S. Miller

Gustavo A. R. Silva
2017-10-24 17:44:42 +0800

06 Oct, 2017

1 commit

e2080072e tcp: new list for sent but unacked skbs for RACK recovery ... Browse Code »

This patch adds a new queue (list) that tracks the sent but not yet
acked or SACKed skbs for a TCP connection. The list is chronologically
ordered by skb->skb_mstamp (the head is the oldest sent skb).

This list will be used to optimize TCP Rack recovery, which checks
an skb's timestamp to judge if it has been lost and needs to be
retransmitted. Since TCP write queue is ordered by sequence instead
of sent time, RACK has to scan over the write queue to catch all
eligible packets to detect lost retransmission, and iterates through
SACKed skbs repeatedly.

Special cares for rare events:
1. TCP repair fakes skb transmission so the send queue needs adjusted
2. SACK reneging would require re-inserting SACKed skbs into the
send queue. For now I believe it's not worth the complexity to
make RACK work perfectly on SACK reneging, so we do nothing here.
3. Fast Open: currently for non-TFO, send-queue correctly queues
the pure SYN packet. For TFO which queues a pure SYN and
then a data packet, send-queue only queues the data packet but
not the pure SYN due to the structure of TFO code. This is okay
because the SYN receiver would never respond with a SACK on a
missing SYN (i.e. SYN is never fast-retransmitted by SACK/RACK).

In order to not grow sk_buff, we use an union for the new list and
_skb_refdst/destructor fields. This is a bit complicated because
we need to make sure _skb_refdst and destructor are properly zeroed
before skb is cloned/copied at transmit, and before being freed.

Signed-off-by: Eric Dumazet
Signed-off-by: Yuchung Cheng
Signed-off-by: Neal Cardwell
Signed-off-by: David S. Miller

Eric Dumazet
2017-10-06 12:24:47 +0800

31 Aug, 2017

1 commit

31770e34e tcp: Revert "tcp: remove header prediction" ... Browse Code »

This reverts commit 45f119bf936b1f9f546a0b139c5b56f9bb2bdc78.

Eric Dumazet says:
We found at Google a significant regression caused by
45f119bf936b1f9f546a0b139c5b56f9bb2bdc78 tcp: remove header prediction

In typical RPC (TCP_RR), when a TCP socket receives data, we now call
tcp_ack() while we used to not call it.

This touches enough cache lines to cause a slowdown.

so problem does not seem to be HP removal itself but the tcp_ack()
call. Therefore, it might be possible to remove HP after all, provided
one finds a way to elide tcp_ack for most cases.

Reported-by: Eric Dumazet
Signed-off-by: Florian Westphal
Signed-off-by: David S. Miller

Florian Westphal
2017-08-31 02:20:09 +0800

01 Aug, 2017

2 commits

45f119bf9 tcp: remove header prediction ... Browse Code »

Like prequeue, I am not sure this is overly useful nowadays.

If we receive a train of packets, GRO will aggregate them if the
headers are the same (HP predates GRO by several years) so we don't
get a per-packet benefit, only a per-aggregated-packet one.

Signed-off-by: Florian Westphal
Signed-off-by: David S. Miller

Florian Westphal
2017-08-01 05:37:49 +0800
e7942d063 tcp: remove prequeue support ... Browse Code »

prequeue is a tcp receive optimization that moves part of rx processing
from bh to process context.

This only works if the socket being processed belongs to a process that
is blocked in recv on that socket.

In practice, this doesn't happen anymore that often because nowadays
servers tend to use an event driven (epoll) model.

Even normal client applications (web browsers) commonly use many tcp
connections in parallel.

This has measureable impact only in netperf (which uses plain recv and
thus allows prequeue use) from host to locally running vm (~4%), however,
there were no changes when using netperf between two physical hosts with
ixgbe interfaces.

Signed-off-by: Florian Westphal
Signed-off-by: David S. Miller

Florian Westphal
2017-08-01 05:37:49 +0800

02 Jul, 2017

1 commit

13d3b1ebe bpf: Support for setting initial receive window ... Browse Code »

This patch adds suppport for setting the initial advertized window from
within a BPF_SOCK_OPS program. This can be used to support larger
initial cwnd values in environments where it is known to be safe.

Signed-off-by: Lawrence Brakmo
Signed-off-by: David S. Miller

Lawrence Brakmo
2017-07-02 07:15:13 +0800

08 Jun, 2017

1 commit

eed29f17f tcp: add a struct net parameter to tcp_parse_options() ... Browse Code »

We want to move some TCP sysctls to net namespaces in the future.

tcp_window_scaling, tcp_sack and tcp_timestamps being fetched
from tcp_parse_options(), we need to pass an extra parameter.

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2017-06-08 22:53:28 +0800