Eric Lee / smarc-fsl-linux-kernel

14 Oct, 2019

1 commit

d983ea6f1 tcp: add rcu protection around tp->fastopen_rsk ... Browse Code »

Both tcp_v4_err() and tcp_v6_err() do the following operations
while they do not own the socket lock :

fastopen = tp->fastopen_rsk;
snd_una = fastopen ? tcp_rsk(fastopen)->snt_isn : tp->snd_una;

The problem is that without appropriate barrier, the compiler
might reload tp->fastopen_rsk and trigger a NULL deref.

request sockets are protected by RCU, we can simply add
the missing annotations and barriers to solve the issue.

Fixes: 168a8f58059a ("tcp: TCP Fast Open Server - main code path")
Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2019-10-14 01:13:08 +0800

02 Oct, 2019

1 commit

3256a2d6a tcp: adjust rto_base in retransmits_timed_out() ... Browse Code »

The cited commit exposed an old retransmits_timed_out() bug
which assumed it could call tcp_model_timeout() with
TCP_RTO_MIN as rto_base for all states.

But flows in SYN_SENT or SYN_RECV state uses a different
RTO base (1 sec instead of 200 ms, unless BPF choses
another value)

This caused a reduction of SYN retransmits from 6 to 4 with
the default /proc/sys/net/ipv4/tcp_syn_retries value.

Fixes: a41e8a88b06e ("tcp: better handle TCP_USER_TIMEOUT in SYN_SENT state")
Signed-off-by: Eric Dumazet
Cc: Yuchung Cheng
Cc: Marek Majkowski
Signed-off-by: David S. Miller

Eric Dumazet
2019-10-02 09:40:49 +0800

28 Sep, 2019

1 commit

a41e8a88b tcp: better handle TCP_USER_TIMEOUT in SYN_SENT state ... Browse Code »

Yuchung Cheng and Marek Majkowski independently reported a weird
behavior of TCP_USER_TIMEOUT option when used at connect() time.

When the TCP_USER_TIMEOUT is reached, tcp_write_timeout()
believes the flow should live, and the following condition
in tcp_clamp_rto_to_user_timeout() programs one jiffie timers :

remaining = icsk->icsk_user_timeout - elapsed;
if (remaining
Reported-by: Yuchung Cheng
Reported-by: Marek Majkowski
Cc: Jon Maxwell
Link: https://marc.info/?l=linux-netdev&m=156940118307949&w=2
Acked-by: Jon Maxwell
Tested-by: Marek Majkowski
Signed-off-by: Marek Majkowski
Acked-by: Yuchung Cheng
Signed-off-by: David S. Miller

Eric Dumazet
2019-09-28 02:42:24 +0800

10 Aug, 2019

1 commit

c04b79b6c tcp: add new tcp_mtu_probe_floor sysctl ... Browse Code »

The current implementation of TCP MTU probing can considerably
underestimate the MTU on lossy connections allowing the MSS to get down to
48. We have found that in almost all of these cases on our networks these
paths can handle much larger MTUs meaning the connections are being
artificially limited. Even though TCP MTU probing can raise the MSS back up
we have seen this not to be the case causing connections to be "stuck" with
an MSS of 48 when heavy loss is present.

Prior to pushing out this change we could not keep TCP MTU probing enabled
b/c of the above reasons. Now with a reasonble floor set we've had it
enabled for the past 6 months.

The new sysctl will still default to TCP_MIN_SND_MSS (48), but gives
administrators the ability to control the floor of MSS probing.

Signed-off-by: Josh Hunt
Signed-off-by: Eric Dumazet
Acked-by: Neal Cardwell
Signed-off-by: David S. Miller

Josh Hunt
2019-08-10 04:03:30 +0800

16 Jun, 2019

1 commit

967c05aee tcp: enforce tcp_min_snd_mss in tcp_mtu_probing() ... Browse Code »

If mtu probing is enabled tcp_mtu_probing() could very well end up
with a too small MSS.

Use the new sysctl tcp_min_snd_mss to make sure MSS search
is performed in an acceptable range.

CVE-2019-11479 -- tcp mss hardcoded to 48

Signed-off-by: Eric Dumazet
Reported-by: Jonathan Lemon
Cc: Jonathan Looney
Acked-by: Neal Cardwell
Cc: Yuchung Cheng
Cc: Tyler Hicks
Cc: Bruce Curtis
Signed-off-by: David S. Miller

Eric Dumazet
2019-06-16 09:47:31 +0800

21 May, 2019

1 commit

457c89965 treewide: Add SPDX license identifier for missed files ... Browse Code »

Add SPDX license identifiers to all files which:

- Have no license information of any form

- Have EXPORT_.*_SYMBOL_GPL inside which was used in the
initial scan/conversion to ignore the file

These files fall under the project license, GPL v2 only. The resulting SPDX
license identifier is:

GPL-2.0-only

Signed-off-by: Thomas Gleixner
Signed-off-by: Greg Kroah-Hartman

Thomas Gleixner
2019-05-21 16:50:45 +0800

01 May, 2019

1 commit

8c3cfe19f tcp: lower congestion window on Fast Open SYNACK timeout ... Browse Code »

TCP sender would use congestion window of 1 packet on the second SYN
and SYNACK timeout except passive TCP Fast Open. This makes passive
TFO too aggressive and unfair during congestion at handshake. This
patch fixes this issue so TCP (fast open or not, passive or active)
always conforms to the RFC6298.

Note that tcp_enter_loss() is called only once during recurring
timeouts. This is because during handshake, high_seq and snd_una
are the same so tcp_enter_loss() would incorrect set the undo state
variables multiple times.

Signed-off-by: Yuchung Cheng
Signed-off-by: Neal Cardwell
Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Yuchung Cheng
2019-05-01 23:47:54 +0800

28 Jan, 2019

1 commit

31954cd8b tcp: Refactor pingpong code ... Browse Code »

Instead of using pingpong as a single bit information, we refactor the
code to treat it as a counter. When interactive session is detected,
we set pingpong count to TCP_PINGPONG_THRESH. And when pingpong count
is >= TCP_PINGPONG_THRESH, we consider the session in pingpong mode.

This patch is a pure refactor and sets foundation for the next patch.
This patch itself does not change any pingpong logic.

Signed-off-by: Wei Wang
Signed-off-by: Yuchung Cheng
Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Wei Wang
2019-01-28 05:29:43 +0800

18 Jan, 2019

6 commits

590d2026d tcp: retry more conservatively on local congestion ... Browse Code »

Previously when the sender fails to retransmit a data packet on
timeout due to congestion in the local host (e.g. throttling in
qdisc), it'll retry within an RTO up to 500ms.

In low-RTT networks such as data-centers, RTO is often far
below the default minimum 200ms (and the cap 500ms). Then local
host congestion could trigger a retry storm pouring gas to the
fire. Worse yet, the retry counter (icsk_retransmits) is not
properly updated so the aggressive retry may exceed the system
limit (15 rounds) until the packet finally slips through.

On such rare events, it's wise to retry more conservatively (500ms)
and update the stats properly to reflect these incidents and follow
the system limit. Note that this is consistent with the behavior
when a keep-alive probe is dropped due to local congestion.

Signed-off-by: Yuchung Cheng
Signed-off-by: Eric Dumazet
Reviewed-by: Neal Cardwell
Reviewed-by: Soheil Hassas Yeganeh
Signed-off-by: David S. Miller

Yuchung Cheng
2019-01-18 07:12:26 +0800
9721e709f tcp: simplify window probe aborting on USER_TIMEOUT ... Browse Code »

Previously we use the next unsent skb's timestamp to determine
when to abort a socket stalling on window probes. This no longer
works as skb timestamp reflects the last instead of the first
transmission.

Instead we can estimate how long the socket has been stalling
with the probe count and the exponential backoff behavior.

Signed-off-by: Yuchung Cheng
Signed-off-by: Eric Dumazet
Reviewed-by: Neal Cardwell
Reviewed-by: Soheil Hassas Yeganeh
Signed-off-by: David S. Miller

Yuchung Cheng
2019-01-18 07:12:26 +0800
01a523b07 tcp: create a helper to model exponential backoff ... Browse Code »

Create a helper to model TCP exponential backoff for the next patch.
This is pure refactor w no behavior change.

Signed-off-by: Yuchung Cheng
Signed-off-by: Eric Dumazet
Reviewed-by: Neal Cardwell
Reviewed-by: Soheil Hassas Yeganeh
Signed-off-by: David S. Miller

Yuchung Cheng
2019-01-18 07:12:26 +0800
c7d13c8fa tcp: properly track retry time on passive Fast Open ... Browse Code »

This patch addresses a corner issue on timeout behavior of a
passive Fast Open socket. A passive Fast Open server may write
and close the socket when it is re-trying SYN-ACK to complete
the handshake. After the handshake is completely, the server does
not properly stamp the recovery start time (tp->retrans_stamp is
0), and the socket may abort immediately on the very first FIN
timeout, instead of retying until it passes the system or user
specified limit.

Signed-off-by: Yuchung Cheng
Signed-off-by: Eric Dumazet
Reviewed-by: Neal Cardwell
Reviewed-by: Soheil Hassas Yeganeh
Signed-off-by: David S. Miller

Yuchung Cheng
2019-01-18 07:12:26 +0800
7ae189759 tcp: always set retrans_stamp on recovery ... Browse Code »

Previously TCP socket's retrans_stamp is not set if the
retransmission has failed to send. As a result if a socket is
experiencing local issues to retransmit packets, determining when
to abort a socket is complicated w/o knowning the starting time of
the recovery since retrans_stamp may remain zero.

This complication causes sub-optimal behavior that TCP may use the
latest, instead of the first, retransmission time to compute the
elapsed time of a stalling connection due to local issues. Then TCP
may disrecard TCP retries settings and keep retrying until it finally
succeed: not a good idea when the local host is already strained.

The simple fix is to always timestamp the start of a recovery.
It's worth noting that retrans_stamp is also used to compare echo
timestamp values to detect spurious recovery. This patch does
not break that because retrans_stamp is still later than when the
original packet was sent.

Signed-off-by: Yuchung Cheng
Signed-off-by: Eric Dumazet
Reviewed-by: Neal Cardwell
Reviewed-by: Soheil Hassas Yeganeh
Signed-off-by: David S. Miller

Yuchung Cheng
2019-01-18 07:12:26 +0800
88f8598d0 tcp: exit if nothing to retransmit on RTO timeout ... Browse Code »

Previously TCP only warns if its RTO timer fires and the
retransmission queue is empty, but it'll cause null pointer
reference later on. It's better to avoid such catastrophic failure
and simply exit with a warning.

Signed-off-by: Yuchung Cheng
Signed-off-by: Eric Dumazet
Reviewed-by: Neal Cardwell
Reviewed-by: Soheil Hassas Yeganeh
Signed-off-by: David S. Miller

Yuchung Cheng
2019-01-18 07:12:26 +0800

11 Jan, 2019

1 commit

c5715b8fa tcp: change txhash on SYN-data timeout ... Browse Code »

Previously upon SYN timeouts the sender recomputes the txhash to
try a different path. However this does not apply on the initial
timeout of SYN-data (active Fast Open). Therefore an active IPv6
Fast Open connection may incur one second RTO penalty to take on
a new path after the second SYN retransmission uses a new flow label.

This patch removes this undesirable behavior so Fast Open changes
the flow label just like the regular connections. This also helps
avoid falsely disabling Fast Open on the sender which triggers
after two consecutive SYN timeouts on Fast Open.

Signed-off-by: Yuchung Cheng
Reviewed-by: Neal Cardwell
Signed-off-by: David S. Miller

Yuchung Cheng
2019-01-11 05:55:41 +0800

01 Dec, 2018

2 commits

e1561fe2d tcp: fix SNMP TCP timeout under-estimation ... Browse Code »

Previously the SNMP TCPTIMEOUTS counter has inconsistent accounting:
1. It counts all SYN and SYN-ACK timeouts
2. It counts timeouts in other states except recurring timeouts and
timeouts after fast recovery or disorder state.

Such selective accounting makes analysis difficult and complicated. For
example the monitoring system needs to collect many other SNMP counters
to infer the total amount of timeout events. This patch makes TCPTIMEOUTS
counter simply counts all the retransmit timeout (SYN or data or FIN).

Signed-off-by: Yuchung Cheng
Signed-off-by: Eric Dumazet
Signed-off-by: Neal Cardwell
Signed-off-by: David S. Miller

Yuchung Cheng
2018-12-01 09:22:41 +0800
3976535af tcp: fix off-by-one bug on aborting window-probing socket ... Browse Code »

Previously there is an off-by-one bug on determining when to abort
a stalled window-probing socket. This patch fixes that so it is
consistent with tcp_write_timeout().

Signed-off-by: Yuchung Cheng
Signed-off-by: Eric Dumazet
Signed-off-by: Neal Cardwell
Signed-off-by: David S. Miller

Yuchung Cheng
2018-12-01 09:22:41 +0800

25 Nov, 2018

1 commit

9efdda4e3 tcp: address problems caused by EDT misshaps ... Browse Code »

When a qdisc setup including pacing FQ is dismantled and recreated,
some TCP packets are sent earlier than instructed by TCP stack.

TCP can be fooled when ACK comes back, because the following
operation can return a negative value.

tcp_time_stamp(tp) - tp->rx_opt.rcv_tsecr;

Some paths in TCP stack were not dealing properly with this,
this patch addresses four of them.

Fixes: ab408b6dc744 ("tcp: switch tcp and sch_fq to new earliest departure time model")
Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2018-11-25 09:41:37 +0800

22 Nov, 2018

1 commit

86de5921a tcp: defer SACK compression after DupThresh ... Browse Code »

Jean-Louis reported a TCP regression and bisected to recent SACK
compression.

After a loss episode (receiver not able to keep up and dropping
packets because its backlog is full), linux TCP stack is sending
a single SACK (DUPACK).

Sender waits a full RTO timer before recovering losses.

While RFC 6675 says in section 5, "Algorithm Details",

(2) If DupAcks < DupThresh but IsLost (HighACK + 1) returns true --
indicating at least three segments have arrived above the current
cumulative acknowledgment point, which is taken to indicate loss
-- go to step (4).
...
(4) Invoke fast retransmit and enter loss recovery as follows:

there are old TCP stacks not implementing this strategy, and
still counting the dupacks before starting fast retransmit.

While these stacks probably perform poorly when receivers implement
LRO/GRO, we should be a little more gentle to them.

This patch makes sure we do not enable SACK compression unless
3 dupacks have been sent since last rcv_nxt update.

Ideally we should even rearm the timer to send one or two
more DUPACK if no more packets are coming, but that will
be work aiming for linux-4.21.

Many thanks to Jean-Louis for bisecting the issue, providing
packet captures and testing this patch.

Fixes: 5d9f4262b7ea ("tcp: add SACK compression")
Reported-by: Jean-Louis Dupond
Tested-by: Jean-Louis Dupond
Signed-off-by: Eric Dumazet
Acked-by: Neal Cardwell
Signed-off-by: David S. Miller

Eric Dumazet
2018-11-22 07:49:52 +0800

16 Oct, 2018

1 commit

5f6188a80 tcp: do not change tcp_wstamp_ns in tcp_mstamp_refresh ... Browse Code »

In EDT design, I made the mistake of using tcp_wstamp_ns
to store the last tcp_clock_ns() sample and to store the
pacing virtual timer.

This causes major regressions at high speed flows.

Introduce tcp_clock_cache to store last tcp_clock_ns().
This is needed because some arches have slow high-resolution
kernel time service.

tcp_wstamp_ns is only updated when a packet is sent.

Note that we can remove tcp_mstamp in the future since
tcp_mstamp is essentially tcp_clock_cache/1000, so the
apparent socket size increase is temporary.

Fixes: 9799ccb0e984 ("tcp: add tcp_wstamp_ns socket field")
Signed-off-by: Eric Dumazet
Acked-by: Soheil Hassas Yeganeh
Signed-off-by: David S. Miller

Eric Dumazet
2018-10-16 13:56:41 +0800

02 Oct, 2018

1 commit

fb420d5d9 tcp/fq: move back to CLOCK_MONOTONIC ... Browse Code »

In the recent TCP/EDT patch series, I switched TCP and sch_fq
clocks from MONOTONIC to TAI, in order to meet the choice done
earlier for sch_etf packet scheduler.

But sure enough, this broke some setups were the TAI clock
jumps forward (by almost 50 year...), as reported
by Leonard Crestez.

If we want to converge later, we'll probably need to add
an skb field to differentiate the clock bases, or a socket option.

In the meantime, an UDP application will need to use CLOCK_MONOTONIC
base for its SCM_TXTIME timestamps if using fq packet scheduler.

Fixes: 72b0094f9182 ("tcp: switch tcp_clock_ns() to CLOCK_TAI base")
Fixes: 142537e41923 ("net_sched: sch_fq: switch to CLOCK_TAI")
Fixes: fd2bca2aa789 ("tcp: switch internal pacing timer to CLOCK_TAI")
Signed-off-by: Eric Dumazet
Reported-by: Leonard Crestez
Tested-by: Leonard Crestez
Signed-off-by: David S. Miller

Eric Dumazet
2018-10-02 14:18:51 +0800

22 Sep, 2018

2 commits

fd2bca2aa tcp: switch internal pacing timer to CLOCK_TAI ... Browse Code »

Next patch will use tcp_wstamp_ns to feed internal
TCP pacing timer, so switch to CLOCK_TAI to share same base.

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2018-09-22 10:37:59 +0800
d3edd06ea tcp: provide earliest departure time in skb->tstamp ... Browse Code »

Switch internal TCP skb->skb_mstamp to skb->skb_mstamp_ns,
from usec units to nsec units.

Do not clear skb->tstamp before entering IP stacks in TX,
so that qdisc or devices can implement pacing based on the
earliest departure time instead of socket sk->sk_pacing_rate

Packets are fed with tcp_wstamp_ns, and following patch
will update tcp_wstamp_ns when both TCP and sch_fq switch to
the earliest departure time mechanism.

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2018-09-22 10:37:59 +0800

26 Jul, 2018

1 commit

55477206f tcp: make function tcp_retransmit_stamp() static ... Browse Code »

Fixes the following sparse warnings:

net/ipv4/tcp_timer.c:25:5: warning:
symbol 'tcp_retransmit_stamp' was not declared. Should it be static?

Signed-off-by: Wei Yongjun
Signed-off-by: David S. Miller

Wei Yongjun
2018-07-26 07:35:45 +0800

22 Jul, 2018

3 commits

b701a99e4 tcp: Add tcp_clamp_rto_to_user_timeout() helper to improve accuracy ... Browse Code »

Create the tcp_clamp_rto_to_user_timeout() helper routine. To calculate
the correct rto, so that the TCP_USER_TIMEOUT socket option is more
accurate. Taking suggestions and feedback into account from
Eric Dumazet, Neal Cardwell and David Laight. Due to the 1st commit we
can avoid the msecs_to_jiffies() and jiffies_to_msecs() dance.

Signed-off-by: Jon Maxwell
Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Jon Maxwell
2018-07-22 01:28:55 +0800
a7fa37703 tcp: Add tcp_retransmit_stamp() helper routine ... Browse Code »

Create a seperate helper routine as per Neal Cardwells suggestion. To
be used by the final commit in this series and retransmits_timed_out().

Signed-off-by: Jon Maxwell
Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Jon Maxwell
2018-07-22 01:28:55 +0800
9bcc66e19 tcp: convert icsk_user_timeout from jiffies to msecs ... Browse Code »

This is a preparatory commit. Part of this series that improves the
socket TCP_USER_TIMEOUT option accuracy. Implement Eric Dumazets idea
to convert icsk->icsk_user_timeout from jiffies to msecs. To eliminate
the msecs_to_jiffies() and jiffies_to_msecs() dance in future.

Signed-off-by: Jon Maxwell
Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Jon Maxwell
2018-07-22 01:28:55 +0800

18 May, 2018

1 commit

5d9f4262b tcp: add SACK compression ... Browse Code »

When TCP receives an out-of-order packet, it immediately sends
a SACK packet, generating network load but also forcing the
receiver to send 1-MSS pathological packets, increasing its
RTX queue length/depth, and thus processing time.

Wifi networks suffer from this aggressive behavior, but generally
speaking, all these SACK packets add fuel to the fire when networks
are under congestion.

This patch adds a high resolution timer and tp->compressed_ack counter.

Instead of sending a SACK, we program this timer with a small delay,
based on RTT and capped to 1 ms :

delay = min ( 5 % of RTT, 1 ms)

If subsequent SACKs need to be sent while the timer has not yet
expired, we simply increment tp->compressed_ack.

When timer expires, a SACK is sent with the latest information.
Whenever an ACK is sent (if data is sent, or if in-order
data is received) timer is canceled.

Note that tcp_sack_new_ofo_skb() is able to force a SACK to be sent
if the sack blocks need to be shuffled, even if the timer has not
expired.

A new SNMP counter is added in the following patch.

Two other patches add sysctls to allow changing the 1,000,000 and 44
values that this commit hard-coded.

Signed-off-by: Eric Dumazet
Acked-by: Neal Cardwell
Acked-by: Yuchung Cheng
Acked-by: Toke Høiland-Jørgensen
Signed-off-by: David S. Miller

Eric Dumazet
2018-05-18 23:40:27 +0800

12 May, 2018

1 commit

73a6bab5a tcp: switch pacing timer to softirq based hrtimer ... Browse Code »

linux-4.16 got support for softirq based hrtimers.
TCP can switch its pacing hrtimer to this variant, since this
avoids going through a tasklet and some atomic operations.

pacing timer logic looks like other (jiffies based) tcp timers.

v2: use hrtimer_try_to_cancel() in tcp_clear_xmit_timers()
to correctly release reference on socket if needed.

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2018-05-12 00:24:37 +0800

08 Mar, 2018

1 commit

e05836ac0 tcp: purge write queue upon aborting the connection ... Browse Code »

When the connection is aborted, there is no point in
keeping the packets on the write queue until the connection
is closed.

Similar to a27fd7a8ed38 ('tcp: purge write queue upon RST'),
this is essential for a correct MSG_ZEROCOPY implementation,
because userspace cannot call close(fd) before receiving
zerocopy signals even when the connection is aborted.

Fixes: f214f915e7db ("tcp: enable MSG_ZEROCOPY")
Signed-off-by: Soheil Hassas Yeganeh
Signed-off-by: Neal Cardwell
Reviewed-by: Eric Dumazet
Signed-off-by: Yuchung Cheng
Signed-off-by: David S. Miller

Soheil Hassas Yeganeh
2018-03-08 04:01:03 +0800

29 Jan, 2018

1 commit

3e3ab9ccc Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net ... Browse Code »

Signed-off-by: David S. Miller

David S. Miller
2018-01-29 23:15:51 +0800

26 Jan, 2018

1 commit

f89013f66 bpf: Add sock_ops RTO callback ... Browse Code »

Adds an optional call to sock_ops BPF program based on whether the
BPF_SOCK_OPS_RTO_CB_FLAG is set in bpf_sock_ops_flags.
The BPF program is passed 2 arguments: icsk_retransmits and whether the
RTO has expired.

Signed-off-by: Lawrence Brakmo
Signed-off-by: Alexei Starovoitov

Lawrence Brakmo
2018-01-26 08:41:14 +0800

25 Jan, 2018

1 commit

4ee806d51 net: tcp: close sock if net namespace is exiting ... Browse Code »

When a tcp socket is closed, if it detects that its net namespace is
exiting, close immediately and do not wait for FIN sequence.

For normal sockets, a reference is taken to their net namespace, so it will
never exit while the socket is open. However, kernel sockets do not take a
reference to their net namespace, so it may begin exiting while the kernel
socket is still open. In this case if the kernel socket is a tcp socket,
it will stay open trying to complete its close sequence. The sock's dst(s)
hold a reference to their interface, which are all transferred to the
namespace's loopback interface when the real interfaces are taken down.
When the namespace tries to take down its loopback interface, it hangs
waiting for all references to the loopback interface to release, which
results in messages like:

unregister_netdevice: waiting for lo to become free. Usage count = 1

These messages continue until the socket finally times out and closes.
Since the net namespace cleanup holds the net_mutex while calling its
registered pernet callbacks, any new net namespace initialization is
blocked until the current net namespace finishes exiting.

After this change, the tcp socket notices the exiting net namespace, and
closes immediately, releasing its dst(s) and their reference to the
loopback interface, which lets the net namespace continue exiting.

Link: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1711407
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=97811
Signed-off-by: Dan Streetman
Signed-off-by: David S. Miller

Dan Streetman
2018-01-25 23:56:45 +0800

17 Dec, 2017

1 commit

c30abd5e4 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net ... Browse Code »

Three sets of overlapping changes, two in the packet scheduler
and one in the meson-gxl PHY driver.

Signed-off-by: David S. Miller

David S. Miller
2017-12-17 11:11:55 +0800

14 Dec, 2017

2 commits

4688eb7cf tcp: refresh tcp_mstamp from timers callbacks ... Browse Code »

Only the retransmit timer currently refreshes tcp_mstamp

We should do the same for delayed acks and keepalives.

Even if RFC 7323 does not request it, this is consistent to what linux
did in the past, when TS values were based on jiffies.

Fixes: 385e20706fac ("tcp: use tp->tcp_mstamp in output path")
Signed-off-by: Eric Dumazet
Cc: Soheil Hassas Yeganeh
Cc: Mike Maloney
Cc: Neal Cardwell
Acked-by: Neal Cardwell
Acked-by: Soheil Hassas Yeganeh
Acked-by: Mike Maloney
Signed-off-by: David S. Miller

Eric Dumazet
2017-12-14 05:04:04 +0800
7268586ba tcp: pause Fast Open globally after third consecutive timeout ... Browse Code »

Prior to this patch, active Fast Open is paused on a specific
destination IP address if the previous connections to the
IP address have experienced recurring timeouts . But recent
experiments by Microsoft (https://goo.gl/cykmn7) and Mozilla
browsers indicate the isssue is often caused by broken middle-boxes
sitting close to the client. Therefore it is much better user
experience if Fast Open is disabled out-right globally to avoid
experiencing further timeouts on connections toward other
destinations.

This patch changes the destination-IP disablement to global
disablement if a connection experiencing recurring timeouts
or aborts due to timeout. Repeated incidents would still
exponentially increase the pause time, starting from an hour.
This is extremely conservative but an unfortunate compromise to
minimize bad experience due to broken middle-boxes.

Reported-by: Dragana Damjanovic
Reported-by: Patrick McManus
Signed-off-by: Yuchung Cheng
Reviewed-by: Wei Wang
Reviewed-by: Neal Cardwell
Reviewed-by: Eric Dumazet
Signed-off-by: David S. Miller

Yuchung Cheng
2017-12-14 04:51:12 +0800

05 Nov, 2017

1 commit

d0f368470 tcp: tcp_mtu_probing() cleanup ... Browse Code »

Reduce one indentation level to make code more readable.
tcp_sync_mss() can be factorized.

Signed-off-by: Eric Dumazet
Acked-by: Neal Cardwell
Signed-off-by: David S. Miller

Eric Dumazet
2017-11-05 21:14:23 +0800

27 Oct, 2017

1 commit

2c04ac8ae tcp: Namespace-ify sysctl_tcp_thin_linear_timeouts ... Browse Code »

Note that sysctl_tcp_thin_dupack was not used, I deleted it.

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2017-10-27 15:35:42 +0800

18 Oct, 2017

1 commit

59f379f90 inet/connection_sock: Convert timers to use timer_setup() ... Browse Code »

In preparation for unconditionally passing the struct timer_list pointer to
all timer callbacks, switch to using the new timer_setup() and from_timer()
to pass the timer pointer explicitly.

Cc: "David S. Miller"
Cc: Gerrit Renker
Cc: Alexey Kuznetsov
Cc: Hideaki YOSHIFUJI
Cc: netdev@vger.kernel.org
Cc: dccp@vger.kernel.org
Signed-off-by: Kees Cook
Signed-off-by: David S. Miller

Kees Cook
2017-10-18 19:39:55 +0800

07 Oct, 2017

1 commit

75c119afe tcp: implement rb-tree based retransmit queue ... Browse Code »

Using a linear list to store all skbs in write queue has been okay
for quite a while : O(N) is not too bad when N < 500.

Things get messy when N is the order of 100,000 : Modern TCP stacks
want 10Gbit+ of throughput even with 200 ms RTT flows.

40 ns per cache line miss means a full scan can use 4 ms,
blowing away CPU caches.

SACK processing often can use various hints to avoid parsing
whole retransmit queue. But with high packet losses and/or high
reordering, hints no longer work.

Sender has to process thousands of unfriendly SACK, accumulating
a huge socket backlog, burning a cpu and massively dropping packets.

Using an rb-tree for retransmit queue has been avoided for years
because it added complexity and overhead, but now is the time
to be more resistant and say no to quadratic behavior.

1) RTX queue is no longer part of the write queue : already sent skbs
are stored in one rb-tree.

2) Since reaching the head of write queue no longer needs
sk->sk_send_head, we added an union of sk_send_head and tcp_rtx_queue

Tested:

On receiver :
netem on ingress : delay 150ms 200us loss 1
GRO disabled to force stress and SACK storms.

for f in `seq 1 10`
do
./netperf -H lpaa6 -l30 -- -K bbr -o THROUGHPUT|tail -1
done | awk '{print $0} {sum += $0} END {printf "%7u\n",sum}'

Before patch :

323.87
351.48
339.59
338.62
306.72
204.07
304.93
291.88
202.47
176.88
2840

After patch:

1700.83
2207.98
2070.17
1544.26
2114.76
2124.89
1693.14
1080.91
2216.82
1299.94
18053

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2017-10-07 07:28:54 +0800