Eric Lee / smarc-fsl-linux-kernel

04 Nov, 2017

1 commit

2a171788b Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net ... Browse Code »

Files removed in 'net-next' had their license header updated
in 'net'. We take the remove from 'net-next'.

Signed-off-by: David S. Miller

David S. Miller
2017-11-04 08:26:51 +0800

03 Nov, 2017

1 commit

9eba93533 tcp: fix a lockdep issue in tcp_fastopen_reset_cipher() ... Browse Code »

icsk_accept_queue.fastopenq.lock is only fully initialized at listen()
time.

LOCKDEP is not happy if we attempt a spin_lock_bh() on it, because
of missing annotation. (Although kernel runs just fine)

Lets use net->ipv4.tcp_fastopen_ctx_lock to protect ctx access.

Fixes: 1fba70e5b6be ("tcp: socket option to set TCP fast open key")
Signed-off-by: Eric Dumazet
Cc: Yuchung Cheng
Cc: Christoph Paasch
Reviewed-by: Christoph Paasch
Signed-off-by: David S. Miller

Eric Dumazet
2017-11-03 14:51:39 +0800

02 Nov, 2017

1 commit

b24413180 License cleanup: add SPDX GPL-2.0 license identifier to files with no license ... Browse Code »

Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.

By default all files without license information are under the default
license of the kernel, which is GPL version 2.

Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier. The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.

This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.

How this work was done:

Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
- file had no licensing information it it.
- file was a */uapi/* one with no licensing information in it,
- file was a */uapi/* one with existing licensing information,

Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.

The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne. Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.

The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed. Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.

Criteria used to select files for SPDX license identifier tagging was:
- Files considered eligible had to be source code files.
- Make and config files were included as candidates if they contained >5
lines of source
- File already had some variant of a license header in it (even if
Reviewed-by: Philippe Ombredanne
Reviewed-by: Thomas Gleixner
Signed-off-by: Greg Kroah-Hartman

Greg Kroah-Hartman
2017-11-02 18:10:55 +0800

24 Oct, 2017

1 commit

71c02379c tcp: Configure TFO without cookie per socket and/or per route ... Browse Code »

We already allow to enable TFO without a cookie by using the
fastopen-sysctl and setting it to TFO_SERVER_COOKIE_NOT_REQD (or
TFO_CLIENT_NO_COOKIE).
This is safe to do in certain environments where we know that there
isn't a malicous host (aka., data-centers) or when the
application-protocol already provides an authentication mechanism in the
first flight of data.

A server however might be providing multiple services or talking to both
sides (public Internet and data-center). So, this server would want to
enable cookie-less TFO for certain services and/or for connections that
go to the data-center.

This patch exposes a socket-option and a per-route attribute to enable such
fine-grained configurations.

Signed-off-by: Christoph Paasch
Reviewed-by: Yuchung Cheng
Signed-off-by: David S. Miller

Christoph Paasch
2017-10-24 17:48:08 +0800

20 Oct, 2017

1 commit

1fba70e5b tcp: socket option to set TCP fast open key ... Browse Code »

New socket option TCP_FASTOPEN_KEY to allow different keys per
listener. The listener by default uses the global key until the
socket option is set. The key is a 16 bytes long binary data. This
option has no effect on regular non-listener TCP sockets.

Signed-off-by: Yuchung Cheng
Reviewed-by: Eric Dumazet
Reviewed-by: Christoph Paasch
Signed-off-by: David S. Miller

Yuchung Cheng
2017-10-20 20:21:36 +0800

07 Oct, 2017

1 commit

18a4c0eab net: add rb_to_skb() and other rb tree helpers ... Browse Code »

Geeralize private netem_rb_to_skb()

TCP rtx queue will soon be converted to rb-tree,
so we will need skb_rbtree_walk() helpers.

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2017-10-07 07:28:53 +0800

06 Oct, 2017

1 commit

27204aaa9 tcp: uniform the set up of sockets after successful connection ... Browse Code »

Currently in the TCP code, the initialization sequence for cached
metrics, congestion control, BPF, etc, after successful connection
is very inconsistent. This introduces inconsistent bevhavior and is
prone to bugs. The current call sequence is as follows:

(1) for active case (tcp_finish_connect() case):
tcp_mtup_init(sk);
icsk->icsk_af_ops->rebuild_header(sk);
tcp_init_metrics(sk);
tcp_call_bpf(sk, BPF_SOCK_OPS_ACTIVE_ESTABLISHED_CB);
tcp_init_congestion_control(sk);
tcp_init_buffer_space(sk);

(2) for passive case (tcp_rcv_state_process() TCP_SYN_RECV case):
icsk->icsk_af_ops->rebuild_header(sk);
tcp_call_bpf(sk, BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB);
tcp_init_congestion_control(sk);
tcp_mtup_init(sk);
tcp_init_buffer_space(sk);
tcp_init_metrics(sk);

(3) for TFO passive case (tcp_fastopen_create_child()):
inet_csk(child)->icsk_af_ops->rebuild_header(child);
tcp_init_congestion_control(child);
tcp_mtup_init(child);
tcp_init_metrics(child);
tcp_call_bpf(child, BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB);
tcp_init_buffer_space(child);

This commit uniforms the above functions to have the following sequence:
tcp_mtup_init(sk);
icsk->icsk_af_ops->rebuild_header(sk);
tcp_init_metrics(sk);
tcp_call_bpf(sk, BPF_SOCK_OPS_ACTIVE/PASSIVE_ESTABLISHED_CB);
tcp_init_congestion_control(sk);
tcp_init_buffer_space(sk);
This sequence is the same as the (1) active case. We pick this sequence
because this order correctly allows BPF to override the settings
including congestion control module and initial cwnd, etc from
the route, and then allows the CC module to see those settings.

Suggested-by: Neal Cardwell
Tested-by: Neal Cardwell
Signed-off-by: Wei Wang
Acked-by: Neal Cardwell
Acked-by: Yuchung Cheng
Acked-by: Eric Dumazet
Signed-off-by: David S. Miller

Wei Wang
2017-10-06 12:10:16 +0800

02 Oct, 2017

4 commits

3733be14a ipv4: Namespaceify tcp_fastopen_blackhole_timeout knob ... Browse Code »

Different namespace application might require different time period in
second to disable Fastopen on active TCP sockets.

Tested:
Simulate following similar situation that the server's data gets dropped
after 3WHS.
C ---- syn-data ---> S
C S
S (accept & write)
C? X
Signed-off-by: David S. Miller

Haishuang Yan
2017-10-02 08:55:54 +0800
437138485 ipv4: Namespaceify tcp_fastopen_key knob ... Browse Code »

Different namespace application might require different tcp_fastopen_key
independently of the host.

David Miller pointed out there is a leak without releasing the context
of tcp_fastopen_key during netns teardown. So add the release action in
exit_batch path.

Tested:
1. Container namespace:
# cat /proc/sys/net/ipv4/tcp_fastopen_key:
2817fff2-f803cf97-eadfd1f3-78c0992b

cookie key in tcp syn packets:
Fast Open Cookie
Kind: TCP Fast Open Cookie (34)
Length: 10
Fast Open Cookie: 1e5dd82a8c492ca9

2. Host:
# cat /proc/sys/net/ipv4/tcp_fastopen_key:
107d7c5f-68eb2ac7-02fb06e6-ed341702

cookie key in tcp syn packets:
Fast Open Cookie
Kind: TCP Fast Open Cookie (34)
Length: 10
Fast Open Cookie: e213c02bf0afbc8a

Signed-off-by: Haishuang Yan
Signed-off-by: David S. Miller

Haishuang Yan
2017-10-02 08:55:54 +0800
dd000598a ipv4: Remove the 'publish' logic in tcp_fastopen_init_key_once ... Browse Code »

The 'publish' logic is not necessary after commit dfea2aa65424 ("tcp:
Do not call tcp_fastopen_reset_cipher from interrupt context"), because
in tcp_fastopen_cookie_gen，it wouldn't call tcp_fastopen_init_key_once.

Signed-off-by: Haishuang Yan
Signed-off-by: David S. Miller

Haishuang Yan
2017-10-02 08:55:54 +0800
e1cfcbe82 ipv4: Namespaceify tcp_fastopen knob ... Browse Code »

Different namespace application might require enable TCP Fast Open
feature independently of the host.

This patch series continues making more of the TCP Fast Open related
sysctl knobs be per net-namespace.

Reported-by: Luca BRUNO
Signed-off-by: Haishuang Yan
Signed-off-by: David S. Miller

Haishuang Yan
2017-10-02 08:55:54 +0800

23 Aug, 2017

1 commit

111993692 tcp: Remove the unused parameter for tcp_try_fastopen. ... Browse Code »

Signed-off-by: Tonghao Zhang
Acked-by: Eric Dumazet
Signed-off-by: David S. Miller

Tonghao Zhang
2017-08-23 05:16:12 +0800

02 Jul, 2017

1 commit

9872a4bde bpf: Add TCP connection BPF callbacks ... Browse Code »

Added callbacks to BPF SOCK_OPS type program before an active
connection is intialized and after a passive or active connection is
established.

The following patch demostrates how they can be used to set send and
receive buffer sizes.

Signed-off-by: Lawrence Brakmo
Signed-off-by: David S. Miller

Lawrence Brakmo
2017-07-02 07:15:14 +0800

01 Jul, 2017

1 commit

41c6d650f net: convert sock.sk_refcnt from atomic_t to refcount_t ... Browse Code »

refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.

This patch uses refcount_inc_not_zero() instead of
atomic_inc_not_zero_hint() due to absense of a _hint()
version of refcount API. If the hint() version must
be used, we might need to revisit API.

Signed-off-by: Elena Reshetova
Signed-off-by: Hans Liljestrand
Signed-off-by: Kees Cook
Signed-off-by: David Windsor
Signed-off-by: David S. Miller

Reshetova, Elena
2017-07-01 22:39:08 +0800

25 Apr, 2017

2 commits

46c2fa398 net/tcp_fastopen: Add snmp counter for blackhole detection ... Browse Code »

This counter records the number of times the firewall blackhole issue is
detected and active TFO is disabled.

Signed-off-by: Wei Wang
Acked-by: Yuchung Cheng
Acked-by: Neal Cardwell
Signed-off-by: David S. Miller

Wei Wang
2017-04-25 02:27:17 +0800
cf1ef3f07 net/tcp_fastopen: Disable active side TFO in certain scenarios ... Browse Code »

Middlebox firewall issues can potentially cause server's data being
blackholed after a successful 3WHS using TFO. Following are the related
reports from Apple:
https://www.nanog.org/sites/default/files/Paasch_Network_Support.pdf
Slide 31 identifies an issue where the client ACK to the server's data
sent during a TFO'd handshake is dropped.
C ---> syn-data ---> S
C X S
[retry and timeout]

https://www.ietf.org/proceedings/94/slides/slides-94-tcpm-13.pdf
Slide 5 shows a similar situation that the server's data gets dropped
after 3WHS.
C ---- syn-data ---> S
C S
S (accept & write)
C? X
Acked-by: Yuchung Cheng
Acked-by: Neal Cardwell
Signed-off-by: David S. Miller

Wei Wang
2017-04-25 02:27:17 +0800

28 Jan, 2017

1 commit

4e8f2fc1a Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net ... Browse Code »

Two trivial overlapping changes conflicts in MPLS and mlx5.

Signed-off-by: David S. Miller

David S. Miller
2017-01-28 23:33:06 +0800

26 Jan, 2017

2 commits

19f6d3f3c net/tcp-fastopen: Add new API support ... Browse Code »

This patch adds a new socket option, TCP_FASTOPEN_CONNECT, as an
alternative way to perform Fast Open on the active side (client). Prior
to this patch, a client needs to replace the connect() call with
sendto(MSG_FASTOPEN). This can be cumbersome for applications who want
to use Fast Open: these socket operations are often done in lower layer
libraries used by many other applications. Changing these libraries
and/or the socket call sequences are not trivial. A more convenient
approach is to perform Fast Open by simply enabling a socket option when
the socket is created w/o changing other socket calls sequence:
s = socket()
create a new socket
setsockopt(s, IPPROTO_TCP, TCP_FASTOPEN_CONNECT …);
newly introduced sockopt
If set, new functionality described below will be used.
Return ENOTSUPP if TFO is not supported or not enabled in the
kernel.

connect()
With cookie present, return 0 immediately.
With no cookie, initiate 3WHS with TFO cookie-request option and
return -1 with errno = EINPROGRESS.

write()/sendmsg()
With cookie present, send out SYN with data and return the number of
bytes buffered.
With no cookie, and 3WHS not yet completed, return -1 with errno =
EINPROGRESS.
No MSG_FASTOPEN flag is needed.

read()
Return -1 with errno = EWOULDBLOCK/EAGAIN if connect() is called but
write() is not called yet.
Return -1 with errno = EWOULDBLOCK/EAGAIN if connection is
established but no msg is received yet.
Return number of bytes read if socket is established and there is
msg received.

The new API simplifies life for applications that always perform a write()
immediately after a successful connect(). Such applications can now take
advantage of Fast Open by merely making one new setsockopt() call at the time
of creating the socket. Nothing else about the application's socket call
sequence needs to change.

Signed-off-by: Wei Wang
Acked-by: Eric Dumazet
Acked-by: Yuchung Cheng
Signed-off-by: David S. Miller

Wei Wang
2017-01-26 03:04:38 +0800
065263f40 net/tcp-fastopen: refactor cookie check logic ... Browse Code »

Refactor the cookie check logic in tcp_send_syn_data() into a function.
This function will be called else where in later changes.

Signed-off-by: Wei Wang
Acked-by: Eric Dumazet
Acked-by: Yuchung Cheng
Signed-off-by: David S. Miller

Wei Wang
2017-01-26 03:04:38 +0800

20 Jan, 2017

1 commit

0dbd7ff3a tcp: initialize max window for a new fastopen socket ... Browse Code »

Found that if we run LTP netstress test with large MSS (65K),
the first attempt from server to send data comparable to this
MSS on fastopen connection will be delayed by the probe timer.

Here is an example:

< S seq 0:0 win 43690 options [mss 65495 wscale 7 tfo cookie] length 32
> S. seq 0:0 ack 1 win 43690 options [mss 65495 wscale 7] length 0
< . ack 1 win 342 length 0

Inside tcp_sendmsg(), tcp_send_mss() returns max MSS in 'mss_now',
as well as in 'size_goal'. This results the segment not queued for
transmition until all the data copied from user buffer. Then, inside
__tcp_push_pending_frames(), it breaks on send window test and
continues with the check probe timer.

Fragmentation occurs in tcp_write_wakeup()...

+0.2 > P. seq 1:43777 ack 1 win 342 length 43776
< . ack 43777, win 1365 length 0
> P. seq 43777:65001 ack 1 win 342 options [...] length 21224
...

This also contradicts with the fact that we should bound to the half
of the window if it is large.

Fix this flaw by correctly initializing max_window. Before that, it
could have large values that affect further calculations of 'size_goal'.

Fixes: 168a8f58059a ("tcp: TCP Fast Open Server - main code path")
Signed-off-by: Alexey Kodanev
Acked-by: Eric Dumazet
Signed-off-by: David S. Miller

Alexey Kodanev
2017-01-20 00:35:26 +0800

14 Jan, 2017

1 commit

003c94105 tcp: fix tcp_fastopen unaligned access complaints on sparc ... Browse Code »

Fix up a data alignment issue on sparc by swapping the order
of the cookie byte array field with the length field in
struct tcp_fastopen_cookie, and making it a proper union
to clean up the typecasting.

This addresses log complaints like these:
log_unaligned: 113 callbacks suppressed
Kernel unaligned access at TPC[976490] tcp_try_fastopen+0x2d0/0x360
Kernel unaligned access at TPC[9764ac] tcp_try_fastopen+0x2ec/0x360
Kernel unaligned access at TPC[9764c8] tcp_try_fastopen+0x308/0x360
Kernel unaligned access at TPC[9764e4] tcp_try_fastopen+0x324/0x360
Kernel unaligned access at TPC[976490] tcp_try_fastopen+0x2d0/0x360

Cc: Eric Dumazet
Signed-off-by: Shannon Nelson
Acked-by: Eric Dumazet
Signed-off-by: David S. Miller

Shannon Nelson
2017-01-14 01:31:24 +0800

09 Sep, 2016

1 commit

76061f631 tcp: fastopen: avoid negative sk_forward_alloc ... Browse Code »

When DATA and/or FIN are carried in a SYN/ACK message or SYN message,
we append an skb in socket receive queue, but we forget to call
sk_forced_mem_schedule().

Effect is that the socket has a negative sk->sk_forward_alloc as long as
the message is not read by the application.

Josh Hunt fixed a similar issue in commit d22e15371811 ("tcp: fix tcp
fin memory accounting")

Fixes: 168a8f58059a ("tcp: TCP Fast Open Server - main code path")
Signed-off-by: Eric Dumazet
Reviewed-by: Josh Hunt
Signed-off-by: David S. Miller

Eric Dumazet
2016-09-09 07:08:10 +0800

02 Sep, 2016

1 commit

28b346cbc tcp: fastopen: fix rcv_wup initialization for TFO server on SYN/data ... Browse Code »

Yuchung noticed that on the first TFO server data packet sent after
the (TFO) handshake, the server echoed the TCP timestamp value in the
SYN/data instead of the timestamp value in the final ACK of the
handshake. This problem did not happen on regular opens.

The tcp_replace_ts_recent() logic that decides whether to remember an
incoming TS value needs tp->rcv_wup to hold the latest receive
sequence number that we have ACKed (latest tp->rcv_nxt we have
ACKed). This commit fixes this issue by ensuring that a TFO server
properly updates tp->rcv_wup to match tp->rcv_nxt at the time it sends
a SYN/ACK for the SYN/data.

Reported-by: Yuchung Cheng
Signed-off-by: Neal Cardwell
Signed-off-by: Yuchung Cheng
Signed-off-by: Eric Dumazet
Signed-off-by: Soheil Hassas Yeganeh
Fixes: 168a8f58059a ("tcp: TCP Fast Open Server - main code path")
Signed-off-by: David S. Miller

Neal Cardwell
2016-09-02 07:40:15 +0800

03 May, 2016

1 commit

c10d9310e tcp: do not assume TCP code is non preemptible ... Browse Code »

We want to to make TCP stack preemptible, as draining prequeue
and backlog queues can take lot of time.

Many SNMP updates were assuming that BH (and preemption) was disabled.

Need to convert some __NET_INC_STATS() calls to NET_INC_STATS()
and some __TCP_INC_STATS() to TCP_INC_STATS()

Before using this_cpu_ptr(net->ipv4.tcp_sk) in tcp_v4_send_reset()
and tcp_v4_send_ack(), we add an explicit preempt disabled section.

Signed-off-by: Eric Dumazet
Acked-by: Soheil Hassas Yeganeh
Signed-off-by: David S. Miller

Eric Dumazet
2016-05-03 05:02:25 +0800

28 Apr, 2016

1 commit

02a1d6e7a net: rename NET_{ADD|INC}_STATS_BH() ... Browse Code »

Rename NET_INC_STATS_BH() to __NET_INC_STATS()
and NET_ADD_STATS_BH() to __NET_ADD_STATS()

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2016-04-28 10:48:24 +0800

20 Mar, 2016

1 commit

1200b6809 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next ... Browse Code »

Pull networking updates from David Miller:
"Highlights:

1) Support more Realtek wireless chips, from Jes Sorenson.

2) New BPF types for per-cpu hash and arrap maps, from Alexei
Starovoitov.

3) Make several TCP sysctls per-namespace, from Nikolay Borisov.

4) Allow the use of SO_REUSEPORT in order to do per-thread processing
of incoming TCP/UDP connections. The muxing can be done using a
BPF program which hashes the incoming packet. From Craig Gallek.

5) Add a multiplexer for TCP streams, to provide a messaged based
interface. BPF programs can be used to determine the message
boundaries. From Tom Herbert.

6) Add 802.1AE MACSEC support, from Sabrina Dubroca.

7) Avoid factorial complexity when taking down an inetdev interface
with lots of configured addresses. We were doing things like
traversing the entire address less for each address removed, and
flushing the entire netfilter conntrack table for every address as
well.

8) Add and use SKB bulk free infrastructure, from Jesper Brouer.

9) Allow offloading u32 classifiers to hardware, and implement for
ixgbe, from John Fastabend.

10) Allow configuring IRQ coalescing parameters on a per-queue basis,
from Kan Liang.

11) Extend ethtool so that larger link mode masks can be supported.
From David Decotigny.

12) Introduce devlink, which can be used to configure port link types
(ethernet vs Infiniband, etc.), port splitting, and switch device
level attributes as a whole. From Jiri Pirko.

13) Hardware offload support for flower classifiers, from Amir Vadai.

14) Add "Local Checksum Offload". Basically, for a tunneled packet
the checksum of the outer header is 'constant' (because with the
checksum field filled into the inner protocol header, the payload
of the outer frame checksums to 'zero'), and we can take advantage
of that in various ways. From Edward Cree"

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1548 commits)
bonding: fix bond_get_stats()
net: bcmgenet: fix dma api length mismatch
net/mlx4_core: Fix backward compatibility on VFs
phy: mdio-thunder: Fix some Kconfig typos
lan78xx: add ndo_get_stats64
lan78xx: handle statistics counter rollover
RDS: TCP: Remove unused constant
RDS: TCP: Add sysctl tunables for sndbuf/rcvbuf on rds-tcp socket
net: smc911x: convert pxa dma to dmaengine
team: remove duplicate set of flag IFF_MULTICAST
bonding: remove duplicate set of flag IFF_MULTICAST
net: fix a comment typo
ethernet: micrel: fix some error codes
ip_tunnels, bpf: define IP_TUNNEL_OPTS_MAX and use it
bpf, dst: add and use dst_tclassid helper
bpf: make skb->tc_classid also readable
net: mvneta: bm: clarify dependencies
cls_bpf: reset class and reuse major in da
ldmvsw: Checkpatch sunvnet.c and sunvnet_common.c
ldmvsw: Add ldmvsw.c driver code
...

Linus Torvalds
2016-03-20 01:05:34 +0800

15 Mar, 2016

1 commit

a44d6eacd tcp: Add RFC4898 tcpEStatsPerfDataSegsOut/In ... Browse Code »

Per RFC4898, they count segments sent/received
containing a positive length data segment (that includes
retransmission segments carrying data). Unlike
tcpi_segs_out/in, tcpi_data_segs_out/in excludes segments
carrying no data (e.g. pure ack).

The patch also updates the segs_in in tcp_fastopen_add_skb()
so that segs_in >= data_segs_in property is kept.

Together with retransmission data, tcpi_data_segs_out
gives a better signal on the rxmit rate.

v6: Rebase on the latest net-next

v5: Eric pointed out that checking skb->len is still needed in
tcp_fastopen_add_skb() because skb can carry a FIN without data.
Hence, instead of open coding segs_in and data_segs_in, tcp_segs_in()
helper is used. Comment is added to the fastopen case to explain why
segs_in has to be reset and tcp_segs_in() has to be called before
__skb_pull().

v4: Add comment to the changes in tcp_fastopen_add_skb()
and also add remark on this case in the commit message.

v3: Add const modifier to the skb parameter in tcp_segs_in()

v2: Rework based on recent fix by Eric:
commit a9d99ce28ed3 ("tcp: fix tcpi_segs_in after connection establishment")

Signed-off-by: Martin KaFai Lau
Cc: Chris Rapier
Cc: Eric Dumazet
Cc: Marcelo Ricardo Leitner
Cc: Neal Cardwell
Cc: Yuchung Cheng
Acked-by: Eric Dumazet
Signed-off-by: David S. Miller

Martin KaFai Lau
2016-03-15 02:55:26 +0800

07 Feb, 2016

1 commit

e3e17b773 tcp: fastopen: call tcp_fin() if FIN present in SYNACK ... Browse Code »

When we acknowledge a FIN, it is not enough to ack the sequence number
and queue the skb into receive queue. We also have to call tcp_fin()
to properly update socket state and send proper poll() notifications.

It seems we also had the problem if we received a SYN packet with the
FIN flag set, but it does not seem an urgent issue, as no known
implementation can do that.

Fixes: 61d2bcae99f6 ("tcp: fastopen: accept data/FIN present in SYNACK message")
Signed-off-by: Eric Dumazet
Cc: Yuchung Cheng
Cc: Neal Cardwell
Signed-off-by: David S. Miller

Eric Dumazet
2016-02-07 05:49:58 +0800

06 Feb, 2016

2 commits

9d691539e tcp: do not enqueue skb with SYN flag ... Browse Code »

If we remove the SYN flag from the skbs that tcp_fastopen_add_skb()
places in socket receive queue, then we can remove the test that
tcp_recvmsg() has to perform in fast path.

All we have to do is to adjust SEQ in the slow path.

For the moment, we place an unlikely() and output a message
if we find an skb having SYN flag set.
Goal would be to get rid of the test completely.

Signed-off-by: Eric Dumazet
Signed-off-by: Yuchung Cheng
Signed-off-by: Neal Cardwell
Signed-off-by: David S. Miller

Eric Dumazet
2016-02-06 16:11:59 +0800
61d2bcae9 tcp: fastopen: accept data/FIN present in SYNACK message ... Browse Code »

RFC 7413 (TCP Fast Open) 4.2.2 states that the SYNACK message
MAY include data and/or FIN

This patch adds support for the client side :

If we receive a SYNACK with payload or FIN, queue the skb instead
of ignoring it.

Since we already support the same for SYN, we refactor the existing
code and reuse it. Note we need to clone the skb, so this operation
might fail under memory pressure.

Sara Dickinson pointed out FreeBSD server Fast Open implementation
was planned to generate such SYNACK in the future.

The server side might be implemented on linux later.

Reported-by: Sara Dickinson
Signed-off-by: Eric Dumazet
Signed-off-by: Yuchung Cheng
Signed-off-by: Neal Cardwell
Signed-off-by: David S. Miller

Eric Dumazet
2016-02-06 16:11:59 +0800

27 Jan, 2016

1 commit

cf80e0e47 tcp: Use ahash ... Browse Code »

This patch replaces uses of the long obsolete hash interface with
ahash.

Signed-off-by: Herbert Xu
Acked-by: David S. Miller

Herbert Xu
2016-01-27 20:36:18 +0800

23 Oct, 2015

1 commit

5e0724d02 tcp/dccp: fix hashdance race for passive sessions ... Browse Code »

Multiple cpus can process duplicates of incoming ACK messages
matching a SYN_RECV request socket. This is a rare event under
normal operations, but definitely can happen.

Only one must win the race, otherwise corruption would occur.

To fix this without adding new atomic ops, we use logic in
inet_ehash_nolisten() to detect the request was present in the same
ehash bucket where we try to insert the new child.

If request socket was not found, we have to undo the child creation.

This actually removes a spin_lock()/spin_unlock() pair in
reqsk_queue_unlink() for the fast path.

Fixes: e994b2f0fb92 ("tcp: do not lock listener to process SYN packets")
Fixes: 079096f103fa ("tcp/dccp: install syn_recv requests into ehash table")
Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2015-10-23 20:42:21 +0800

05 Oct, 2015

1 commit

7656d842d tcp: fix fastopen races vs lockless listener ... Browse Code »

There are multiple races that need fixes :

1) skb_get() + queue skb + kfree_skb() is racy

An accept() can be done on another cpu, data consumed immediately.
tcp_recvmsg() uses __kfree_skb() as it is assumed all skb found in
socket receive queue are private.

Then the kfree_skb() in tcp_rcv_state_process() uses an already freed skb

2) tcp_reqsk_record_syn() needs to be done before tcp_try_fastopen()
for the same reasons.

3) We want to send the SYNACK before queueing child into accept queue,
otherwise we might reintroduce the ooo issue fixed in
commit 7c85af881044 ("tcp: avoid reorders for TFO passive connections")

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2015-10-05 17:45:24 +0800

03 Oct, 2015

1 commit

ca6fb0651 tcp: attach SYNACK messages to request sockets instead of listener ... Browse Code »

If a listen backlog is very big (to avoid syncookies), then
the listener sk->sk_wmem_alloc is the main source of false
sharing, as we need to touch it twice per SYNACK re-transmit
and TX completion.

(One SYN packet takes listener lock once, but up to 6 SYNACK
are generated)

By attaching the skb to the request socket, we remove this
source of contention.

Tested:

listen(fd, 10485760); // single listener (no SO_REUSEPORT)
16 RX/TX queue NIC
Sustain a SYNFLOOD attack of ~320,000 SYN per second,
Sending ~1,400,000 SYNACK per second.
Perf profiles now show listener spinlock being next bottleneck.

20.29% [kernel] [k] queued_spin_lock_slowpath
10.06% [kernel] [k] __inet_lookup_established
5.12% [kernel] [k] reqsk_timer_handler
3.22% [kernel] [k] get_next_timer_interrupt
3.00% [kernel] [k] tcp_make_synack
2.77% [kernel] [k] ipt_do_table
2.70% [kernel] [k] run_timer_softirq
2.50% [kernel] [k] ip_finish_output
2.04% [kernel] [k] cascade

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2015-10-03 19:32:43 +0800

30 Sep, 2015

1 commit

0536fcc03 tcp: prepare fastopen code for upcoming listener changes ... Browse Code »

While auditing TCP stack for upcoming 'lockless' listener changes,
I found I had to change fastopen_init_queue() to properly init the object
before publishing it.

Otherwise an other cpu could try to lock the spinlock before it gets
properly initialized.

Instead of adding appropriate barriers, just remove dynamic memory
allocations :
- Structure is 28 bytes on 64bit arches. Using additional 8 bytes
for holding a pointer seems overkill.
- Two listeners can share same cache line and performance would suffer.

If we really want to save few bytes, we would instead dynamically allocate
whole struct request_sock_queue in the future.

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2015-09-30 07:53:10 +0800

29 Sep, 2015

1 commit

7c85af881 tcp: avoid reorders for TFO passive connections ... Browse Code »

We found that a TCP Fast Open passive connection was vulnerable
to reorders, as the exchange might look like

[1] C -> S S
[2] S -> C S. ack request
[3] S -> C .

packets [2] and [3] can be generated at almost the same time.

If C receives the 3rd packet before the 2nd, it will drop it as
the socket is in SYN_SENT state and expects a SYNACK.

S will have to retransmit the answer.

Current OOO avoidance in linux is defeated because SYNACK
packets are attached to the LISTEN socket, while DATA packets
are attached to the children. They might be sent by different cpus,
and different TX queues might be selected.

It turns out that for TFO, we created a child, which is a
full blown socket in TCP_SYN_RECV state, and we simply can attach
the SYNACK packet to this socket.

This means that at the time tcp_sendmsg() pushes DATA packet,
skb->ooo_okay will be set iff the SYNACK packet had been sent
and TX completed.

This removes the reorder source at the host level.

We also removed the export of tcp_try_fastopen(), as it is no
longer called from IPv6.

Signed-off-by: Eric Dumazet
Signed-off-by: Yuchung Cheng
Signed-off-by: David S. Miller

Eric Dumazet
2015-09-29 13:11:19 +0800

23 Jun, 2015

1 commit

dfea2aa65 tcp: Do not call tcp_fastopen_reset_cipher from interrupt context ... Browse Code »

tcp_fastopen_reset_cipher really cannot be called from interrupt
context. It allocates the tcp_fastopen_context with GFP_KERNEL and
calls crypto_alloc_cipher, which allocates all kind of stuff with
GFP_KERNEL.

Thus, we might sleep when the key-generation is triggered by an
incoming TFO cookie-request which would then happen in interrupt-
context, as shown by enabling CONFIG_DEBUG_ATOMIC_SLEEP:

[ 36.001813] BUG: sleeping function called from invalid context at mm/slub.c:1266
[ 36.003624] in_atomic(): 1, irqs_disabled(): 0, pid: 1016, name: packetdrill
[ 36.004859] CPU: 1 PID: 1016 Comm: packetdrill Not tainted 4.1.0-rc7 #14
[ 36.006085] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
[ 36.008250] 00000000000004f2 ffff88007f8838a8 ffffffff8171d53a ffff880075a084a8
[ 36.009630] ffff880075a08000 ffff88007f8838c8 ffffffff810967d3 ffff88007f883928
[ 36.011076] 0000000000000000 ffff88007f8838f8 ffffffff81096892 ffff88007f89be00
[ 36.012494] Call Trace:
[ 36.012953] [] dump_stack+0x4f/0x6d
[ 36.014085] [] ___might_sleep+0x103/0x170
[ 36.015117] [] __might_sleep+0x52/0x90
[ 36.016117] [] kmem_cache_alloc_trace+0x47/0x190
[ 36.017266] [] ? tcp_fastopen_reset_cipher+0x42/0x130
[ 36.018485] [] tcp_fastopen_reset_cipher+0x42/0x130
[ 36.019679] [] tcp_fastopen_init_key_once+0x61/0x70
[ 36.020884] [] __tcp_fastopen_cookie_gen+0x1c/0x60
[ 36.022058] [] tcp_try_fastopen+0x58f/0x730
[ 36.023118] [] tcp_conn_request+0x3e8/0x7b0
[ 36.024185] [] ? __module_text_address+0x12/0x60
[ 36.025327] [] tcp_v4_conn_request+0x51/0x60
[ 36.026410] [] tcp_rcv_state_process+0x190/0xda0
[ 36.027556] [] ? __inet_lookup_established+0x47/0x170
[ 36.028784] [] tcp_v4_do_rcv+0x16d/0x3d0
[ 36.029832] [] ? security_sock_rcv_skb+0x16/0x20
[ 36.030936] [] tcp_v4_rcv+0x77a/0x7b0
[ 36.031875] [] ? iptable_filter_hook+0x33/0x70
[ 36.032953] [] ip_local_deliver_finish+0x92/0x1f0
[ 36.034065] [] ip_local_deliver+0x9a/0xb0
[ 36.035069] [] ? ip_rcv+0x3d0/0x3d0
[ 36.035963] [] ip_rcv_finish+0x119/0x330
[ 36.036950] [] ip_rcv+0x2e7/0x3d0
[ 36.037847] [] __netif_receive_skb_core+0x552/0x930
[ 36.038994] [] __netif_receive_skb+0x27/0x70
[ 36.040033] [] process_backlog+0xd2/0x1f0
[ 36.041025] [] net_rx_action+0x122/0x310
[ 36.042007] [] __do_softirq+0x103/0x2f0
[ 36.042978] [] do_softirq_own_stack+0x1c/0x30

This patch moves the call to tcp_fastopen_init_key_once to the places
where a listener socket creates its TFO-state, which always happens in
user-context (either from the setsockopt, or implicitly during the
listen()-call)

Cc: Eric Dumazet
Cc: Hannes Frederic Sowa
Fixes: 222e83d2e0ae ("tcp: switch tcp_fastopen key generation to net_get_random_once")
Signed-off-by: Christoph Paasch
Acked-by: Eric Dumazet
Signed-off-by: David S. Miller

Christoph Paasch
2015-06-23 17:38:10 +0800

23 May, 2015

1 commit

d654976cb tcp: fix a potential deadlock in tcp_get_info() ... Browse Code »

Taking socket spinlock in tcp_get_info() can deadlock, as
inet_diag_dump_icsk() holds the &hashinfo->ehash_locks[i],
while packet processing can use the reverse locking order.

We could avoid this locking for TCP_LISTEN states, but lockdep would
certainly get confused as all TCP sockets share same lockdep classes.

[ 523.722504] ======================================================
[ 523.728706] [ INFO: possible circular locking dependency detected ]
[ 523.734990] 4.1.0-dbg-DEV #1676 Not tainted
[ 523.739202] -------------------------------------------------------
[ 523.745474] ss/18032 is trying to acquire lock:
[ 523.750002] (slock-AF_INET){+.-...}, at: [] tcp_get_info+0x2c4/0x360
[ 523.758129]
[ 523.758129] but task is already holding lock:
[ 523.763968] (&(&hashinfo->ehash_locks[i])->rlock){+.-...}, at: [] inet_diag_dump_icsk+0x1d5/0x6c0
[ 523.774661]
[ 523.774661] which lock already depends on the new lock.
[ 523.774661]
[ 523.782850]
[ 523.782850] the existing dependency chain (in reverse order) is:
[ 523.790326]
-> #1 (&(&hashinfo->ehash_locks[i])->rlock){+.-...}:
[ 523.796599] [] lock_acquire+0xbb/0x270
[ 523.802565] [] _raw_spin_lock+0x38/0x50
[ 523.808628] [] __inet_hash_nolisten+0x78/0x110
[ 523.815273] [] tcp_v4_syn_recv_sock+0x24b/0x350
[ 523.822067] [] tcp_check_req+0x3c1/0x500
[ 523.828199] [] tcp_v4_do_rcv+0x239/0x3d0
[ 523.834331] [] tcp_v4_rcv+0xa8e/0xc10
[ 523.840202] [] ip_local_deliver_finish+0x133/0x3e0
[ 523.847214] [] ip_local_deliver+0xaa/0xc0
[ 523.853440] [] ip_rcv_finish+0x168/0x5c0
[ 523.859624] [] ip_rcv+0x307/0x420

Lets use u64_sync infrastructure instead. As a bonus, 64bit
arches get optimized, as these are nop for them.

Fixes: 0df48c26d841 ("tcp: add tcpi_bytes_acked to tcp_info")
Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2015-05-23 01:46:06 +0800

30 Apr, 2015

1 commit

bdd1f9eda tcp: add tcpi_bytes_received to tcp_info ... Browse Code »

This patch tracks total number of payload bytes received on a TCP socket.
This is the sum of all changes done to tp->rcv_nxt

RFC4898 named this : tcpEStatsAppHCThruOctetsReceived

This is a 64bit field, and can be fetched both from TCP_INFO
getsockopt() if one has a handle on a TCP socket, or from inet_diag
netlink facility (iproute2/ss patch will follow)

Note that tp->bytes_received was placed near tp->rcv_nxt for
best data locality and minimal performance impact.

Signed-off-by: Eric Dumazet
Cc: Yuchung Cheng
Cc: Matt Mathis
Cc: Eric Salo
Cc: Martin Lau
Cc: Chris Rapier
Acked-by: Yuchung Cheng
Signed-off-by: David S. Miller

Eric Dumazet
2015-04-30 05:10:37 +0800

08 Apr, 2015

1 commit

7f9b838b7 tcp: RFC7413 option support for Fast Open server ... Browse Code »

Fast Open has been using the experimental option with a magic number
(RFC6994) to request and grant Fast Open cookies. This patch enables
the server to support the official IANA option 34 in RFC7413 in
addition.

The change has passed all existing Fast Open tests with both
old and new options at Google.

Signed-off-by: Daniel Lee
Signed-off-by: Yuchung Cheng
Signed-off-by: Neal Cardwell
Signed-off-by: David S. Miller

Daniel Lee
2015-04-08 06:36:39 +0800