Doug / smarc-fsl-linux-kernel | Embedian Git Server

23 Sep, 2012

2 commits

016818d07 tcp: TCP Fast Open Server - take SYNACK RTT after completing 3WHS ... Browse Code »

When taking SYNACK RTT samples for servers using TCP Fast Open, fix
the code to ensure that we only call tcp_valid_rtt_meas() after we
receive the ACK that completes the 3-way handshake.

Previously we were always taking an RTT sample in
tcp_v4_syn_recv_sock(). However, for TCP Fast Open connections
tcp_v4_conn_req_fastopen() calls tcp_v4_syn_recv_sock() at the time we
receive the SYN. So for TFO we must wait until tcp_rcv_state_process()
to take the RTT sample.

To fix this, we wait until after TFO calls tcp_v4_syn_recv_sock()
before we set the snt_synack timestamp, since tcp_synack_rtt_meas()
already ensures that we only take a SYNACK RTT sample if snt_synack is
non-zero. To be careful, we only take a snt_synack timestamp when
a SYNACK transmit or retransmit succeeds.

Signed-off-by: Neal Cardwell
Signed-off-by: David S. Miller

Neal Cardwell
2012-09-23 03:47:10 +0800
623df484a tcp: extract code to compute SYNACK RTT ... Browse Code »

In preparation for adding another spot where we compute the SYNACK
RTT, extract this code so that it can be shared.

Signed-off-by: Neal Cardwell
Signed-off-by: David S. Miller

Neal Cardwell
2012-09-23 03:47:10 +0800

04 Sep, 2012

1 commit

684bad110 tcp: use PRR to reduce cwin in CWR state ... Browse Code »

Use proportional rate reduction (PRR) algorithm to reduce cwnd in CWR state,
in addition to Recovery state. Retire the current rate-halving in CWR.
When losses are detected via ACKs in CWR state, the sender enters Recovery
state but the cwnd reduction continues and does not restart.

Rename and refactor cwnd reduction functions since both CWR and Recovery
use the same algorithm:
tcp_init_cwnd_reduction() is new and initiates reduction state variables.
tcp_cwnd_reduction() is previously tcp_update_cwnd_in_recovery().
tcp_ends_cwnd_reduction() is previously tcp_complete_cwr().

The rate halving functions and logic such as tcp_cwnd_down(), tcp_min_cwnd(),
and the cwnd moderation inside tcp_enter_cwr() are removed. The unused
parameter, flag, in tcp_cwnd_reduction() is also removed.

Signed-off-by: Yuchung Cheng
Acked-by: Neal Cardwell
Signed-off-by: David S. Miller

Yuchung Cheng
2012-09-04 02:34:02 +0800

01 Sep, 2012

3 commits

8336886f7 tcp: TCP Fast Open Server - support TFO listeners ... Browse Code »

This patch builds on top of the previous patch to add the support
for TFO listeners. This includes -

1. allocating, properly initializing, and managing the per listener
fastopen_queue structure when TFO is enabled

2. changes to the inet_csk_accept code to support TFO. E.g., the
request_sock can no longer be freed upon accept(), not until 3WHS
finishes

3. allowing a TCP_SYN_RECV socket to properly poll() and sendmsg()
if it's a TFO socket

4. properly closing a TFO listener, and a TFO socket before 3WHS
finishes

5. supporting TCP_FASTOPEN socket option

6. modifying tcp_check_req() to use to check a TFO socket as well
as request_sock

7. supporting TCP's TFO cookie option

8. adding a new SYN-ACK retransmit handler to use the timer directly
off the TFO socket rather than the listener socket. Note that TFO
server side will not retransmit anything other than SYN-ACK until
the 3WHS is completed.

The patch also contains an important function
"reqsk_fastopen_remove()" to manage the somewhat complex relation
between a listener, its request_sock, and the corresponding child
socket. See the comment above the function for the detail.

Signed-off-by: H.K. Jerry Chu
Cc: Yuchung Cheng
Cc: Neal Cardwell
Cc: Eric Dumazet
Cc: Tom Herbert
Signed-off-by: David S. Miller

Jerry Chu
2012-09-01 08:02:19 +0800
104671636 tcp: TCP Fast Open Server - header & support functions ... Browse Code »

This patch adds all the necessary data structure and support
functions to implement TFO server side. It also documents a number
of flags for the sysctl_tcp_fastopen knob, and adds a few Linux
extension MIBs.

In addition, it includes the following:

1. a new TCP_FASTOPEN socket option an application must call to
supply a max backlog allowed in order to enable TFO on its listener.

2. A number of key data structures:
"fastopen_rsk" in tcp_sock - for a big socket to access its
request_sock for retransmission and ack processing purpose. It is
non-NULL iff 3WHS not completed.

"fastopenq" in request_sock_queue - points to a per Fast Open
listener data structure "fastopen_queue" to keep track of qlen (# of
outstanding Fast Open requests) and max_qlen, among other things.

"listener" in tcp_request_sock - to point to the original listener
for book-keeping purpose, i.e., to maintain qlen against max_qlen
as part of defense against IP spoofing attack.

3. various data structure and functions, many in tcp_fastopen.c, to
support server side Fast Open cookie operations, including
/proc/sys/net/ipv4/tcp_fastopen_key to allow manual rekeying.

Signed-off-by: H.K. Jerry Chu
Cc: Yuchung Cheng
Cc: Neal Cardwell
Cc: Eric Dumazet
Cc: Tom Herbert
Signed-off-by: David S. Miller

Jerry Chu
2012-09-01 08:02:18 +0800
6c9ff979d tcp: Increase timeout for SYN segments ... Browse Code »

Commit 9ad7c049 ("tcp: RFC2988bis + taking RTT sample from 3WHS for
the passive open side") changed the initRTO from 3secs to 1sec in
accordance to RFC6298 (former RFC2988bis). This reduced the time till
the last SYN retransmission packet gets sent from 93secs to 31secs.

RFC1122 is stating that the retransmission should be done for at least 3
minutes, but this seems to be quite high.

"However, the values of R1 and R2 may be different for SYN
and data segments. In particular, R2 for a SYN segment MUST
be set large enough to provide retransmission of the segment
for at least 3 minutes. The application can close the
connection (i.e., give up on the open attempt) sooner, of
course."

This patch increases the value of TCP_SYN_RETRIES to the value of 6,
providing a retransmission window of 63secs.

The comments for SYN and SYNACK retries have also been updated to
describe the current settings. The same goes for the documentation file
"Documentation/networking/ip-sysctl.txt".

Signed-off-by: Alexander Bergmann
Acked-by: Eric Dumazet
Signed-off-by: David S. Miller

Alex Bergmann
2012-09-01 03:42:10 +0800

25 Aug, 2012

1 commit

e6acb3848 Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace ... Browse Code »

This is an initial merge in of Eric Biederman's work to start adding
user namespace support to the networking.

Signed-off-by: David S. Miller

David S. Miller
2012-08-25 06:54:37 +0800

15 Aug, 2012

1 commit

a7cb5a49b userns: Print out socket uids in a user namespace aware fashion. ... Browse Code »

Cc: Alexey Kuznetsov
Cc: James Morris
Cc: Hideaki YOSHIFUJI
Cc: Patrick McHardy
Cc: Arnaldo Carvalho de Melo
Cc: Sridhar Samudrala
Acked-by: Vlad Yasevich
Acked-by: David S. Miller
Acked-by: Serge Hallyn
Signed-off-by: Eric W. Biederman

Eric W. Biederman
2012-08-15 12:48:06 +0800

10 Aug, 2012

1 commit

63d02d157 net: tcp: ipv6_mapped needs sk_rx_dst_set method ... Browse Code »

commit 5d299f3d3c8a2fb (net: ipv6: fix TCP early demux) added a
regression for ipv6_mapped case.

[ 67.422369] SELinux: initialized (dev autofs, type autofs), uses
genfs_contexts
[ 67.449678] SELinux: initialized (dev autofs, type autofs), uses
genfs_contexts
[ 92.631060] BUG: unable to handle kernel NULL pointer dereference at
(null)
[ 92.631435] IP: [< (null)>] (null)
[ 92.631645] PGD 0
[ 92.631846] Oops: 0010 [#1] SMP
[ 92.632095] Modules linked in: autofs4 sunrpc ipv6 dm_mirror
dm_region_hash dm_log dm_multipath dm_mod video sbs sbshc battery ac lp
parport sg snd_hda_intel snd_hda_codec snd_seq_oss snd_seq_midi_event
snd_seq snd_seq_device pcspkr snd_pcm_oss snd_mixer_oss snd_pcm
snd_timer serio_raw button floppy snd i2c_i801 i2c_core soundcore
snd_page_alloc shpchp ide_cd_mod cdrom microcode ehci_hcd ohci_hcd
uhci_hcd
[ 92.634294] CPU 0
[ 92.634294] Pid: 4469, comm: sendmail Not tainted 3.6.0-rc1 #3
[ 92.634294] RIP: 0010:[] [< (null)>]
(null)
[ 92.634294] RSP: 0018:ffff880245fc7cb0 EFLAGS: 00010282
[ 92.634294] RAX: ffffffffa01985f0 RBX: ffff88024827ad00 RCX:
0000000000000000
[ 92.634294] RDX: 0000000000000218 RSI: ffff880254735380 RDI:
ffff88024827ad00
[ 92.634294] RBP: ffff880245fc7cc8 R08: 0000000000000001 R09:
0000000000000000
[ 92.634294] R10: 0000000000000000 R11: ffff880245fc7bf8 R12:
ffff880254735380
[ 92.634294] R13: ffff880254735380 R14: 0000000000000000 R15:
7fffffffffff0218
[ 92.634294] FS: 00007f4516ccd6f0(0000) GS:ffff880256600000(0000)
knlGS:0000000000000000
[ 92.634294] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 92.634294] CR2: 0000000000000000 CR3: 0000000245ed1000 CR4:
00000000000007f0
[ 92.634294] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[ 92.634294] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
0000000000000400
[ 92.634294] Process sendmail (pid: 4469, threadinfo ffff880245fc6000,
task ffff880254b8cac0)
[ 92.634294] Stack:
[ 92.634294] ffffffff813837a7 ffff88024827ad00 ffff880254b6b0e8
ffff880245fc7d68
[ 92.634294] ffffffff81385083 00000000001d2680 ffff8802547353a8
ffff880245fc7d18
[ 92.634294] ffffffff8105903a ffff88024827ad60 0000000000000002
00000000000000ff
[ 92.634294] Call Trace:
[ 92.634294] [] ? tcp_finish_connect+0x2c/0xfa
[ 92.634294] [] tcp_rcv_state_process+0x2b6/0x9c6
[ 92.634294] [] ? sched_clock_cpu+0xc3/0xd1
[ 92.634294] [] ? local_clock+0x2b/0x3c
[ 92.634294] [] tcp_v4_do_rcv+0x63a/0x670
[ 92.634294] [] release_sock+0x128/0x1bd
[ 92.634294] [] __inet_stream_connect+0x1b1/0x352
[ 92.634294] [] ? lock_sock_nested+0x74/0x7f
[ 92.634294] [] ? wake_up_bit+0x25/0x25
[ 92.634294] [] ? lock_sock_nested+0x74/0x7f
[ 92.634294] [] ? inet_stream_connect+0x22/0x4b
[ 92.634294] [] inet_stream_connect+0x33/0x4b
[ 92.634294] [] sys_connect+0x78/0x9e
[ 92.634294] [] ? sysret_check+0x1b/0x56
[ 92.634294] [] ? __audit_syscall_entry+0x195/0x1c8
[ 92.634294] [] ? trace_hardirqs_on_thunk+0x3a/0x3f
[ 92.634294] [] system_call_fastpath+0x16/0x1b
[ 92.634294] Code: Bad RIP value.
[ 92.634294] RIP [< (null)>] (null)
[ 92.634294] RSP
[ 92.634294] CR2: 0000000000000000
[ 92.648982] ---[ end trace 24e2bed94314c8d9 ]---
[ 92.649146] Kernel panic - not syncing: Fatal exception in interrupt

Fix this using inet_sk_rx_dst_set(), and export this function in case
IPv6 is modular.

Reported-by: Andrew Morton
Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2012-08-10 11:56:09 +0800

21 Jul, 2012

1 commit

6f458dfb4 tcp: improve latencies of timer triggered events ... Browse Code »

Modern TCP stack highly depends on tcp_write_timer() having a small
latency, but current implementation doesn't exactly meet the
expectations.

When a timer fires but finds the socket is owned by the user, it rearms
itself for an additional delay hoping next run will be more
successful.

tcp_write_timer() for example uses a 50ms delay for next try, and it
defeats many attempts to get predictable TCP behavior in term of
latencies.

Use the recently introduced tcp_release_cb(), so that the user owning
the socket will call various handlers right before socket release.

This will permit us to post a followup patch to address the
tcp_tso_should_defer() syndrome (some deferred packets have to wait
RTO timer to be transmitted, while cwnd should allow us to send them
sooner)

Signed-off-by: Eric Dumazet
Cc: Tom Herbert
Cc: Yuchung Cheng
Cc: Neal Cardwell
Cc: Nandita Dukkipati
Cc: H.K. Jerry Chu
Cc: John Heffner
Signed-off-by: David S. Miller

Eric Dumazet
2012-07-21 01:59:41 +0800

20 Jul, 2012

6 commits

67da22d23 net-tcp: Fast Open client - cookie-less mode ... Browse Code »

In trusted networks, e.g., intranet, data-center, the client does not
need to use Fast Open cookie to mitigate DoS attacks. In cookie-less
mode, sendmsg() with MSG_FASTOPEN flag will send SYN-data regardless
of cookie availability.

Signed-off-by: Yuchung Cheng
Acked-by: Eric Dumazet
Signed-off-by: David S. Miller

Yuchung Cheng
2012-07-20 02:02:03 +0800
aab487435 net-tcp: Fast Open client - detecting SYN-data drops ... Browse Code »

On paths with firewalls dropping SYN with data or experimental TCP options,
Fast Open connections will have experience SYN timeout and bad performance.
The solution is to track such incidents in the cookie cache and disables
Fast Open temporarily.

Since only the original SYN includes data and/or Fast Open option, the
SYN-ACK has some tell-tale sign (tcp_rcv_fastopen_synack()) to detect
such drops. If a path has recurring Fast Open SYN drops, Fast Open is
disabled for 2^(recurring_losses) minutes starting from four minutes up to
roughly one and half day. sendmsg with MSG_FASTOPEN flag will succeed but
it behaves as connect() then write().

Signed-off-by: Yuchung Cheng
Acked-by: Eric Dumazet
Signed-off-by: David S. Miller

Yuchung Cheng
2012-07-20 02:02:03 +0800
cf60af03c net-tcp: Fast Open client - sendmsg(MSG_FASTOPEN) ... Browse Code »

sendmsg() (or sendto()) with MSG_FASTOPEN is a combo of connect(2)
and write(2). The application should replace connect() with it to
send data in the opening SYN packet.

For blocking socket, sendmsg() blocks until all the data are buffered
locally and the handshake is completed like connect() call. It
returns similar errno like connect() if the TCP handshake fails.

For non-blocking socket, it returns the number of bytes queued (and
transmitted in the SYN-data packet) if cookie is available. If cookie
is not available, it transmits a data-less SYN packet with Fast Open
cookie request option and returns -EINPROGRESS like connect().

Using MSG_FASTOPEN on connecting or connected socket will result in
simlar errno like repeating connect() calls. Therefore the application
should only use this flag on new sockets.

The buffer size of sendmsg() is independent of the MSS of the connection.

Signed-off-by: Yuchung Cheng
Acked-by: Eric Dumazet
Signed-off-by: David S. Miller

Yuchung Cheng
2012-07-20 02:02:03 +0800
783237e8d net-tcp: Fast Open client - sending SYN-data ... Browse Code »

This patch implements sending SYN-data in tcp_connect(). The data is
from tcp_sendmsg() with flag MSG_FASTOPEN (implemented in a later patch).

The length of the cookie in tcp_fastopen_req, init'd to 0, controls the
type of the SYN. If the cookie is not cached (len==0), the host sends
data-less SYN with Fast Open cookie request option to solicit a cookie
from the remote. If cookie is not available (len > 0), the host sends
a SYN-data with Fast Open cookie option. If cookie length is negative,
the SYN will not include any Fast Open option (for fall back operations).

To deal with middleboxes that may drop SYN with data or experimental TCP
option, the SYN-data is only sent once. SYN retransmits do not include
data or Fast Open options. The connection will fall back to regular TCP
handshake.

Signed-off-by: Yuchung Cheng
Acked-by: Eric Dumazet
Signed-off-by: David S. Miller

Yuchung Cheng
2012-07-20 02:02:03 +0800
1fe4c481b net-tcp: Fast Open client - cookie cache ... Browse Code »

With help from Eric Dumazet, add Fast Open metrics in tcp metrics cache.
The basic ones are MSS and the cookies. Later patch will cache more to
handle unfriendly middleboxes.

Signed-off-by: Yuchung Cheng
Acked-by: Eric Dumazet
Signed-off-by: David S. Miller

Yuchung Cheng
2012-07-20 01:55:36 +0800
2100c8d2d net-tcp: Fast Open base ... Browse Code »

This patch impelements the common code for both the client and server.

1. TCP Fast Open option processing. Since Fast Open does not have an
option number assigned by IANA yet, it shares the experiment option
code 254 by implementing draft-ietf-tcpm-experimental-options
with a 16 bits magic number 0xF989. This enables global experiments
without clashing the scarce(2) experimental options available for TCP.

When the draft status becomes standard (maybe), the client should
switch to the new option number assigned while the server supports
both numbers for transistion.

2. The new sysctl tcp_fastopen

3. A place holder init function

Signed-off-by: Yuchung Cheng
Acked-by: Eric Dumazet
Signed-off-by: David S. Miller

Yuchung Cheng
2012-07-20 01:55:36 +0800

17 Jul, 2012

1 commit

282f23c6e tcp: implement RFC 5961 3.2 ... Browse Code »

Implement the RFC 5691 mitigation against Blind
Reset attack using RST bit.

Idea is to validate incoming RST sequence,
to match RCV.NXT value, instead of previouly accepted
window : (RCV.NXT < RCV.NXT+RCV.WND)

If sequence is in window but not an exact match, send
a "challenge ACK", so that the other part can resend an
RST with the appropriate sequence.

Add a new sysctl, tcp_challenge_ack_limit, to limit
number of challenge ACK sent per second.

Add a new SNMP counter to count number of challenge acks sent.
(netstat -s | grep TCPChallengeACK)

Signed-off-by: Eric Dumazet
Cc: Kiran Kumar Kella
Signed-off-by: David S. Miller

Eric Dumazet
2012-07-17 16:36:20 +0800

12 Jul, 2012

1 commit

46d3ceabd tcp: TCP Small Queues ... Browse Code »

This introduce TSQ (TCP Small Queues)

TSQ goal is to reduce number of TCP packets in xmit queues (qdisc &
device queues), to reduce RTT and cwnd bias, part of the bufferbloat
problem.

sk->sk_wmem_alloc not allowed to grow above a given limit,
allowing no more than ~128KB [1] per tcp socket in qdisc/dev layers at a
given time.

TSO packets are sized/capped to half the limit, so that we have two
TSO packets in flight, allowing better bandwidth use.

As a side effect, setting the limit to 40000 automatically reduces the
standard gso max limit (65536) to 40000/2 : It can help to reduce
latencies of high prio packets, having smaller TSO packets.

This means we divert sock_wfree() to a tcp_wfree() handler, to
queue/send following frames when skb_orphan() [2] is called for the
already queued skbs.

Results on my dev machines (tg3/ixgbe nics) are really impressive,
using standard pfifo_fast, and with or without TSO/GSO.

Without reduction of nominal bandwidth, we have reduction of buffering
per bulk sender :
< 1ms on Gbit (instead of 50ms with TSO)
< 8ms on 100Mbit (instead of 132 ms)

I no longer have 4 MBytes backlogged in qdisc by a single netperf
session, and both side socket autotuning no longer use 4 Mbytes.

As skb destructor cannot restart xmit itself ( as qdisc lock might be
taken at this point ), we delegate the work to a tasklet. We use one
tasklest per cpu for performance reasons.

If tasklet finds a socket owned by the user, it sets TSQ_OWNED flag.
This flag is tested in a new protocol method called from release_sock(),
to eventually send new segments.

[1] New /proc/sys/net/ipv4/tcp_limit_output_bytes tunable
[2] skb_orphan() is usually called at TX completion time,
but some drivers call it in their start_xmit() handler.
These drivers should at least use BQL, or else a single TCP
session can still fill the whole NIC TX ring, since TSQ will
have no effect.

Signed-off-by: Eric Dumazet
Cc: Dave Taht
Cc: Tom Herbert
Cc: Matt Mathis
Cc: Yuchung Cheng
Cc: Nandita Dukkipati
Signed-off-by: David S. Miller

Eric Dumazet
2012-07-12 09:12:59 +0800

11 Jul, 2012

4 commits

81166dd6f tcp: Move timestamps from inetpeer to metrics cache. ... Browse Code »

With help from Lin Ming.

Signed-off-by: David S. Miller

David S. Miller
2012-07-11 13:40:08 +0800
51c5d0c4b tcp: Maintain dynamic metrics in local cache. ... Browse Code »

Maintain a local hash table of TCP dynamic metrics blobs.

Computed TCP metrics are no longer maintained in the route metrics.

The table uses RCU and an extremely simple hash so that it has low
latency and low overhead. A simple hash is legitimate because we only
make metrics blobs for fully established connections.

Some tweaking of the default hash table sizes, metric timeouts, and
the hash chain length limit certainly could use some tweaking. But
the basic design seems sound.

With help from Eric Dumazet and Joe Perches.

Signed-off-by: David S. Miller

David S. Miller
2012-07-11 13:39:57 +0800
ab92bb2f6 tcp: Abstract back handling peer aliveness test into helper function. ... Browse Code »

Signed-off-by: David S. Miller

David S. Miller
2012-07-11 11:33:49 +0800
4aabd8ef8 tcp: Move dynamnic metrics handling into seperate file. ... Browse Code »

Signed-off-by: David S. Miller

David S. Miller
2012-07-11 11:31:36 +0800

28 Jun, 2012

3 commits

160eb5a6b ipv4: Kill early demux method return value. ... Browse Code »

It's completely unnecessary.

Signed-off-by: David S. Miller

David S. Miller
2012-06-28 13:01:22 +0800
c10237e07 Revert "ipv4: tcp: dont cache unconfirmed intput dst" ... Browse Code »

This reverts commit c074da2810c118b3812f32d6754bd9ead2f169e7.

This change has several unwanted side effects:

1) Sockets will cache the DST_NOCACHE route in sk->sk_rx_dst and we'll
thus never create a real cached route.

2) All TCP traffic will use DST_NOCACHE and never use the routing
cache at all.

Signed-off-by: David S. Miller

David S. Miller
2012-06-28 08:05:06 +0800
c074da281 ipv4: tcp: dont cache unconfirmed intput dst ... Browse Code »

DDOS synflood attacks hit badly IP route cache.

On typical machines, this cache is allowed to hold up to 8 Millions dst
entries, 256 bytes for each, for a total of 2GB of memory.

rt_garbage_collect() triggers and tries to cleanup things.

Eventually route cache is disabled but machine is under fire and might
OOM and crash.

This patch exploits the new TCP early demux, to set a nocache
boolean in case incoming TCP frame is for a not yet ESTABLISHED or
TIMEWAIT socket.

This 'nocache' boolean is then used in case dst entry is not found in
route cache, to create an unhashed dst entry (DST_NOCACHE)

SYN-cookie-ACK sent use a similar mechanism (ipv4: tcp: dont cache
output dst for syncookies), so after this patch, a machine is able to
absorb a DDOS synflood attack without polluting its IP route cache.

Signed-off-by: Eric Dumazet
Cc: Hans Schillstrom
Signed-off-by: David S. Miller

Eric Dumazet
2012-06-28 06:34:24 +0800

20 Jun, 2012

1 commit

41063e9dd ipv4: Early TCP socket demux. ... Browse Code »

Input packet processing for local sockets involves two major demuxes.
One for the route and one for the socket.

But we can optimize this down to one demux for certain kinds of local
sockets.

Currently we only do this for established TCP sockets, but it could
at least in theory be expanded to other kinds of connections.

If a TCP socket is established then it's identity is fully specified.

This means that whatever input route was used during the three-way
handshake must work equally well for the rest of the connection since
the keys will not change.

Once we move to established state, we cache the receive packet's input
route to use later.

Like the existing cached route in sk->sk_dst_cache used for output
packets, we have to check for route invalidations using dst->obsolete
and dst->ops->check().

Early demux occurs outside of a socket locked section, so when a route
invalidation occurs we defer the fixup of sk->sk_rx_dst until we are
actually inside of established state packet processing and thus have
the socket locked.

Signed-off-by: David S. Miller

David S. Miller
2012-06-20 12:22:05 +0800

10 Jun, 2012

1 commit

2397849ba [PATCH] tcp: Cache inetpeer in timewait socket, and only when necessary. ... Browse Code »

Since it's guarenteed that we will access the inetpeer if we're trying
to do timewait recycling and TCP options were enabled on the
connection, just cache the peer in the timewait socket.

In the future, inetpeer lookups will be context dependent (per routing
realm), and this helps facilitate that as well.

Signed-off-by: David S. Miller

David S. Miller
2012-06-10 05:56:12 +0800

09 Jun, 2012

1 commit

4670fd819 tcp: Get rid of inetpeer special cases. ... Browse Code »

The get_peer method TCP uses is full of special cases that make no
sense accommodating, and it also gets in the way of doing more
reasonable things here.

First of all, if the socket doesn't have a usable cached route, there
is no sense in trying to optimize timewait recycling.

Likewise for the case where we have IP options, such as SRR enabled,
that make the IP header destination address (and thus the destination
address of the route key) differ from that of the connection's
destination address.

Just return a NULL peer in these cases, and thus we're also able to
get rid of the clumsy inetpeer release logic.

Signed-off-by: David S. Miller

David S. Miller
2012-06-09 16:25:47 +0800

18 May, 2012

1 commit

a2a385d62 tcp: bool conversions ... Browse Code »

bool conversions where possible.

__inline__ -> inline

space cleanups

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2012-05-18 02:59:59 +0800

11 May, 2012

1 commit

292e8d8c8 tcp: Move rcvq sending to tcp_input.c ... Browse Code »

It actually works on the input queue and will use its read mem
routines, thus it's better to have in in the tcp_input.c file.

Signed-off-by: Pavel Emelyanov
Acked-by: Eric Dumazet
Signed-off-by: David S. Miller

Pavel Emelyanov
2012-05-11 11:24:35 +0800

05 May, 2012

1 commit

bd14b1b2e tcp: be more strict before accepting ECN negociation ... Browse Code »

It appears some networks play bad games with the two bits reserved for
ECN. This can trigger false congestion notifications and very slow
transferts.

Since RFC 3168 (6.1.1) forbids SYN packets to carry CT bits, we can
disable TCP ECN negociation if it happens we receive mangled CT bits in
the SYN packet.

Signed-off-by: Eric Dumazet
Cc: Perry Lorier
Cc: Matt Mathis
Cc: Yuchung Cheng
Cc: Neal Cardwell
Cc: Wilmer van der Gaast
Cc: Ankur Jain
Cc: Tom Herbert
Cc: Dave Täht
Acked-by: Neal Cardwell
Signed-off-by: David S. Miller

Eric Dumazet
2012-05-05 00:05:27 +0800

03 May, 2012

3 commits

b081f85c2 net: implement tcp coalescing in tcp_queue_rcv() ... Browse Code »

Extend tcp coalescing implementing it from tcp_queue_rcv(), the main
receiver function when application is not blocked in recvmsg().

Function tcp_queue_rcv() is moved a bit to allow its call from
tcp_data_queue()

This gives good results especially if GRO could not kick, and if skb
head is a fragment.

Signed-off-by: Eric Dumazet
Cc: Alexander Duyck
Cc: Neal Cardwell
Cc: Tom Herbert
Signed-off-by: David S. Miller

Eric Dumazet
2012-05-03 09:11:11 +0800
750ea2baf tcp: early retransmit: delayed fast retransmit ... Browse Code »

Implementing the advanced early retransmit (sysctl_tcp_early_retrans==2).
Delays the fast retransmit by an interval of RTT/4. We borrow the
RTO timer to implement the delay. If we receive another ACK or send
a new packet, the timer is cancelled and restored to original RTO
value offset by time elapsed. When the delayed-ER timer fires,
we enter fast recovery and perform fast retransmit.

Signed-off-by: Yuchung Cheng
Acked-by: Neal Cardwell
Signed-off-by: David S. Miller

Yuchung Cheng
2012-05-03 08:56:10 +0800
eed530b6c tcp: early retransmit ... Browse Code »

This patch implements RFC 5827 early retransmit (ER) for TCP.
It reduces DUPACK threshold (dupthresh) if outstanding packets are
less than 4 to recover losses by fast recovery instead of timeout.

While the algorithm is simple, small but frequent network reordering
makes this feature dangerous: the connection repeatedly enter
false recovery and degrade performance. Therefore we implement
a mitigation suggested in the appendix of the RFC that delays
entering fast recovery by a small interval, i.e., RTT/4. Currently
ER is conservative and is disabled for the rest of the connection
after the first reordering event. A large scale web server
experiment on the performance impact of ER is summarized in
section 6 of the paper "Proportional Rate Reduction for TCP”,
IMC 2011. http://conferences.sigcomm.org/imc/2011/docs/p155.pdf

Note that Linux has a similar feature called THIN_DUPACK. The
differences are THIN_DUPACK do not mitigate reorderings and is only
used after slow start. Currently ER is disabled if THIN_DUPACK is
enabled. I would be happy to merge THIN_DUPACK feature with ER if
people think it's a good idea.

ER is enabled by sysctl_tcp_early_retrans:
0: Disables ER

1: Reduce dupthresh to packets_out - 1 when outstanding packets < 4.

2: (Default) reduce dupthresh like mode 1. In addition, delay
entering fast recovery by RTT/4.

Note: mode 2 is implemented in the third part of this patch series.

Signed-off-by: Yuchung Cheng
Acked-by: Neal Cardwell
Signed-off-by: David S. Miller

Yuchung Cheng
2012-05-03 08:56:10 +0800

27 Apr, 2012

1 commit

674696014 ipv6: RTAX_FEATURE_ALLFRAG causes inefficient TCP segment sizing ... Browse Code »

Quoting Tore Anderson from :
https://bugzilla.kernel.org/show_bug.cgi?id=42572

When RTAX_FEATURE_ALLFRAG is set on a route, the effective TCP segment
size does not take into account the size of the IPv6 Fragmentation
header that needs to be included in outbound packets, causing every
transmitted TCP segment to be fragmented across two IPv6 packets, the
latter of which will only contain 8 bytes of actual payload.

RTAX_FEATURE_ALLFRAG is typically set on a route in response to
receving a ICMPv6 Packet Too Big message indicating a Path MTU of less
than 1280 bytes. 1280 bytes is the minimum IPv6 MTU, however ICMPv6
PTBs with MTU < 1280 are still valid, in particular when an IPv6
packet is sent to an IPv4 destination through a stateless translator.
Any ICMPv4 Need To Fragment packets originated from the IPv4 part of
the path will be translated to ICMPv6 PTB which may then indicate an
MTU of less than 1280.

The Linux kernel refuses to reduce the effective MTU to anything below
1280 bytes, instead it sets it to exactly 1280 bytes, and
RTAX_FEATURE_ALLFRAG is also set. However, the TCP segment size appears
to be set to 1240 bytes (1280 Path MTU - 40 bytes of IPv6 header),
instead of 1232 (additionally taking into account the 8 bytes required
by the IPv6 Fragmentation extension header).

This in turn results in rather inefficient transmission, as every
transmitted TCP segment now is split in two fragments containing
1232+8 bytes of payload.

After this patch, all the outgoing packets that includes a
Fragmentation header all are "atomic" or "non-fragmented" fragments,
i.e., they both have Offset=0 and More Fragments=0.

With help from David S. Miller

Reported-by: Tore Anderson
Signed-off-by: Eric Dumazet
Cc: Maciej Żenczykowski
Cc: Tom Herbert
Tested-by: Tore Anderson
Signed-off-by: David S. Miller

Eric Dumazet
2012-04-27 12:03:34 +0800

22 Apr, 2012

3 commits

900f65d36 tcp: move duplicate code from tcp_v4_init_sock()/tcp_v6_init_sock() ... Browse Code »

This commit moves the (substantial) common code shared between
tcp_v4_init_sock() and tcp_v6_init_sock() to a new address-family
independent function, tcp_init_sock().

Centralizing this functionality should help avoid drift issues,
e.g. where the IPv4 side is updated without a corresponding update to
IPv6. There was already some drift: IPv4 initialized snd_cwnd to
TCP_INIT_CWND, while the IPv6 side was still initializing snd_cwnd to
2 (in this case it should not matter, since snd_cwnd is also
initialized in tcp_init_metrics(), but the general risks and
maintenance overhead remain).

When diffing the old and new code, note that new tcp_init_sock()
function uses the order of steps from the tcp_v4_init_sock()
implementation (the order is slightly different in
tcp_v6_init_sock()).

Signed-off-by: Neal Cardwell
Acked-by: Eric Dumazet
Signed-off-by: David S. Miller

Neal Cardwell
2012-04-22 04:36:42 +0800
ee9952831 tcp: Initial repair mode ... Browse Code »

This includes (according the the previous description):

* TCP_REPAIR sockoption

This one just puts the socket in/out of the repair mode.
Allowed for CAP_NET_ADMIN and for closed/establised sockets only.
When repair mode is turned off and the socket happens to be in
the established state the window probe is sent to the peer to
'unlock' the connection.

* TCP_REPAIR_QUEUE sockoption

This one sets the queue which we're about to repair. The
'no-queue' is set by default.

* TCP_QUEUE_SEQ socoption

Sets the write_seq/rcv_nxt of a selected repaired queue.
Allowed for TCP_CLOSE-d sockets only. When the socket changes
its state the other seq-s are changed by the kernel according
to the protocol rules (most of the existing code is actually
reused).

* Ability to forcibly bind a socket to a port

The sk->sk_reuse is set to SK_FORCE_REUSE.

* Immediate connect modification

The connect syscall initializes the connection, then directly jumps
to the code which finalizes it.

* Silent close modification

The close just aborts the connection (similar to SO_LINGER with 0
time) but without sending any FIN/RST-s to peer.

Signed-off-by: Pavel Emelyanov
Signed-off-by: David S. Miller

Pavel Emelyanov
2012-04-22 03:52:25 +0800
370816aef tcp: Move code around ... Browse Code »

This is just the preparation patch, which makes the needed for
TCP repair code ready for use.

Signed-off-by: Pavel Emelyanov
Signed-off-by: David S. Miller

Pavel Emelyanov
2012-04-22 03:52:25 +0800

17 Apr, 2012

1 commit

f4f9f6e75 tcp: restore formatting of macros for tcp_skb_cb sacked field ... Browse Code »

Commit b82d1bb4 inadvertendly placed unrelated new code between
TCPCB_EVER_RETRANS and TCPCB_RETRANS and the other macros that refer
to the sacked field in the struct tcp_skb_cb (probably because there
was a misleading empty line there). This commit fixes up the
formatting so that all macros related to the sacked field are adjacent
again.

Signed-off-by: Neal Cardwell
Signed-off-by: David S. Miller

Neal Cardwell
2012-04-17 02:38:16 +0800

16 Apr, 2012

1 commit

95c961747 net: cleanup unsigned to unsigned int ... Browse Code »

Use of "unsigned int" is preferred to bare "unsigned" in net tree.

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2012-04-16 00:44:40 +0800