Eric Lee / smarc-fsl-linux-kernel

05 Sep, 2013

1 commit

f21278108 net: ipv6: mld: document force_mld_version in ip-sysctl.txt ... Browse Code »

Document force_mld_version parameter in ip-sysctl.txt.

Signed-off-by: Daniel Borkmann
Cc: Hannes Frederic Sowa
Acked-by: Hannes Frederic Sowa
Signed-off-by: David S. Miller

Daniel Borkmann
2013-09-05 02:53:21 +0800

30 Aug, 2013

2 commits

95bd09eb2 tcp: TSO packets automatic sizing ... Browse Code »

After hearing many people over past years complaining against TSO being
bursty or even buggy, we are proud to present automatic sizing of TSO
packets.

One part of the problem is that tcp_tso_should_defer() uses an heuristic
relying on upcoming ACKS instead of a timer, but more generally, having
big TSO packets makes little sense for low rates, as it tends to create
micro bursts on the network, and general consensus is to reduce the
buffering amount.

This patch introduces a per socket sk_pacing_rate, that approximates
the current sending rate, and allows us to size the TSO packets so
that we try to send one packet every ms.

This field could be set by other transports.

Patch has no impact for high speed flows, where having large TSO packets
makes sense to reach line rate.

For other flows, this helps better packet scheduling and ACK clocking.

This patch increases performance of TCP flows in lossy environments.

A new sysctl (tcp_min_tso_segs) is added, to specify the
minimal size of a TSO packet (default being 2).

A follow-up patch will provide a new packet scheduler (FQ), using
sk_pacing_rate as an input to perform optional per flow pacing.

This explains why we chose to set sk_pacing_rate to twice the current
rate, allowing 'slow start' ramp up.

sk_pacing_rate = 2 * cwnd * mss / srtt

v2: Neal Cardwell reported a suspect deferring of last two segments on
initial write of 10 MSS, I had to change tcp_tso_should_defer() to take
into account tp->xmit_size_goal_segs

Signed-off-by: Eric Dumazet
Cc: Neal Cardwell
Cc: Yuchung Cheng
Cc: Van Jacobson
Cc: Tom Herbert
Acked-by: Yuchung Cheng
Acked-by: Neal Cardwell
Signed-off-by: David S. Miller

Eric Dumazet
2013-08-30 03:50:06 +0800
b800c3b96 ipv6: drop fragmented ndisc packets by default (RFC 6980) ... Browse Code »

This patch implements RFC6980: Drop fragmented ndisc packets by
default. If a fragmented ndisc packet is received the user is informed
that it is possible to disable the check.

Cc: Fernando Gont
Cc: YOSHIFUJI Hideaki
Signed-off-by: Hannes Frederic Sowa
Signed-off-by: David S. Miller

Hannes Frederic Sowa
2013-08-30 03:32:08 +0800

14 Aug, 2013

1 commit

fc4eba58b ipv6: make unsolicited report intervals configurable for mld ... Browse Code »

Commit cab70040dfd95ee32144f02fade64f0cb94f31a0 ("net: igmp:
Reduce Unsolicited report interval to 1s when using IGMPv3") and
2690048c01f32bf45d1c1e1ab3079bc10ad2aea7 ("net: igmp: Allow user-space
configuration of igmp unsolicited report interval") by William Manley made
igmp unsolicited report intervals configurable per interface and corrected
the interval of unsolicited igmpv3 report messages resendings to 1s.

Same needs to be done for IPv6:

MLDv1 (RFC2710 7.10.): 10 seconds
MLDv2 (RFC3810 9.11.): 1 second

Both intervals are configurable via new procfs knobs
mldv1_unsolicited_report_interval and mldv2_unsolicited_report_interval.

(also added .force_mld_version to ipv6_devconf_dflt to bring structs in
line without semantic changes)

v2:
a) Joined documentation update for IPv4 and IPv6 MLD/IGMP
unsolicited_report_interval procfs knobs.
b) incorporate stylistic feedback from William Manley

v3:
a) add new DEVCONF_* values to the end of the enum (thanks to David
Miller)

Cc: Cong Wang
Cc: William Manley
Cc: Benjamin LaHaise
Cc: YOSHIFUJI Hideaki
Signed-off-by: Hannes Frederic Sowa
Signed-off-by: David S. Miller

Hannes Frederic Sowa
2013-08-14 08:05:04 +0800

10 Aug, 2013

2 commits

71acc0ddd Revert "net: sctp: convert sctp_checksum_disable module param into sctp sysctl" ... Browse Code »

This reverts commit cda5f98e36576596b9230483ec52bff3cc97eb21.

As per Vlad's request.

Signed-off-by: David S. Miller

David S. Miller
2013-08-10 04:09:41 +0800
cda5f98e3 net: sctp: convert sctp_checksum_disable module param into sctp sysctl ... Browse Code »

Get rid of the last module parameter for SCTP and make this
configurable via sysctl for SCTP like all the rest of SCTP's
configuration knobs.

Signed-off-by: Daniel Borkmann
Signed-off-by: David S. Miller

Daniel Borkmann
2013-08-10 02:33:02 +0800

31 Jul, 2013

1 commit

5ad37d5de tcp: add tcp_syncookies mode to allow unconditionally generation of syncookies ... Browse Code »

| If you want to test which effects syncookies have to your
| network connections you can set this knob to 2 to enable
| unconditionally generation of syncookies.

Original idea and first implementation by Eric Dumazet.

Cc: Florian Westphal
Cc: David Miller
Signed-off-by: Eric Dumazet
Signed-off-by: Hannes Frederic Sowa
Signed-off-by: David S. Miller

Hannes Frederic Sowa
2013-07-31 07:15:18 +0800

25 Jul, 2013

1 commit

c9bee3b7f tcp: TCP_NOTSENT_LOWAT socket option ... Browse Code »

Idea of this patch is to add optional limitation of number of
unsent bytes in TCP sockets, to reduce usage of kernel memory.

TCP receiver might announce a big window, and TCP sender autotuning
might allow a large amount of bytes in write queue, but this has little
performance impact if a large part of this buffering is wasted :

Write queue needs to be large only to deal with large BDP, not
necessarily to cope with scheduling delays (incoming ACKS make room
for the application to queue more bytes)

For most workloads, using a value of 128 KB or less is OK to give
applications enough time to react to POLLOUT events in time
(or being awaken in a blocking sendmsg())

This patch adds two ways to set the limit :

1) Per socket option TCP_NOTSENT_LOWAT

2) A sysctl (/proc/sys/net/ipv4/tcp_notsent_lowat) for sockets
not using TCP_NOTSENT_LOWAT socket option (or setting a zero value)
Default value being UINT_MAX (0xFFFFFFFF), meaning this has no effect.

This changes poll()/select()/epoll() to report POLLOUT
only if number of unsent bytes is below tp->nosent_lowat

Note this might increase number of sendmsg()/sendfile() calls
when using non blocking sockets,
and increase number of context switches for blocking sockets.

Note this is not related to SO_SNDLOWAT (as SO_SNDLOWAT is
defined as :
Specify the minimum number of bytes in the buffer until
the socket layer will pass the data to the protocol)

Tested:

netperf sessions, and watching /proc/net/protocols "memory" column for TCP

With 200 concurrent netperf -t TCP_STREAM sessions, amount of kernel memory
used by TCP buffers shrinks by ~55 % (20567 pages instead of 45458)

lpq83:~# echo -1 >/proc/sys/net/ipv4/tcp_notsent_lowat
lpq83:~# (super_netperf 200 -t TCP_STREAM -H remote -l 90 &); sleep 60 ; grep TCP /proc/net/protocols
TCPv6 1880 2 45458 no 208 yes ipv6 y y y y y y y y y y y y y n y y y y y
TCP 1696 508 45458 no 208 yes kernel y y y y y y y y y y y y y n y y y y y

lpq83:~# echo 131072 >/proc/sys/net/ipv4/tcp_notsent_lowat
lpq83:~# (super_netperf 200 -t TCP_STREAM -H remote -l 90 &); sleep 60 ; grep TCP /proc/net/protocols
TCPv6 1880 2 20567 no 208 yes ipv6 y y y y y y y y y y y y y n y y y y y
TCP 1696 508 20567 no 208 yes kernel y y y y y y y y y y y y y n y y y y y

Using 128KB has no bad effect on the throughput or cpu usage
of a single flow, although there is an increase of context switches.

A bonus is that we hold socket lock for a shorter amount
of time and should improve latencies of ACK processing.

lpq83:~# echo -1 >/proc/sys/net/ipv4/tcp_notsent_lowat
lpq83:~# perf stat -e context-switches ./netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3
OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.7.84 () port 0 AF_INET : +/-2.500% @ 99% conf.
Local Remote Local Elapsed Throughput Throughput Local Local Remote Remote Local Remote Service
Send Socket Recv Socket Send Time Units CPU CPU CPU CPU Service Service Demand
Size Size Size (sec) Util Util Util Util Demand Demand Units
Final Final % Method % Method
1651584 6291456 16384 20.00 17447.90 10^6bits/s 3.13 S -1.00 U 0.353 -1.000 usec/KB

Performance counter stats for './netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3':

412,514 context-switches

200.034645535 seconds time elapsed

lpq83:~# echo 131072 >/proc/sys/net/ipv4/tcp_notsent_lowat
lpq83:~# perf stat -e context-switches ./netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3
OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.7.84 () port 0 AF_INET : +/-2.500% @ 99% conf.
Local Remote Local Elapsed Throughput Throughput Local Local Remote Remote Local Remote Service
Send Socket Recv Socket Send Time Units CPU CPU CPU CPU Service Service Demand
Size Size Size (sec) Util Util Util Util Demand Demand Units
Final Final % Method % Method
1593240 6291456 16384 20.00 17321.16 10^6bits/s 3.35 S -1.00 U 0.381 -1.000 usec/KB

Performance counter stats for './netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3':

2,675,818 context-switches

200.029651391 seconds time elapsed

Signed-off-by: Eric Dumazet
Cc: Neal Cardwell
Cc: Yuchung Cheng
Acked-By: Yuchung Cheng
Signed-off-by: David S. Miller

Eric Dumazet
2013-07-25 08:54:48 +0800

04 Jul, 2013

1 commit

0c1072ae0 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net ... Browse Code »

Conflicts:
drivers/net/ethernet/freescale/fec_main.c
drivers/net/ethernet/renesas/sh_eth.c
net/ipv4/gre.c

The GRE conflict is between a bug fix (kfree_skb --> kfree_skb_list)
and the splitting of the gre.c code into seperate files.

The FEC conflict was two sets of changes adding ethtool support code
in an "!CONFIG_M5272" CPP protected block.

Finally the sh_eth.c conflict was between one commit add bits set
in the .eesr_err_check mask whilst another commit removed the
.tx_error_check member and assignments.

Signed-off-by: David S. Miller

David S. Miller
2013-07-04 05:55:13 +0800

24 Jun, 2013

1 commit

a3c910d2e tcp: doc : fix the syncookies default value ... Browse Code »

syncookies is on for default since in commit e994b7c901
(tcp: Don't make syn cookies initial setting depend on CONFIG_SYSCTL).

And fix a typo of CONFIG_SYN_COOKIES.

Signed-off-by: Shan Wei
Signed-off-by: David S. Miller

Shan Wei
2013-06-24 15:26:26 +0800

13 Jun, 2013

1 commit

e3d73bced net: add doc for ip_early_demux sysctl ... Browse Code »

commit 6648bd7e0e62c0c8c03b (ipv4: Add sysctl knob to control
early socket demux) introduced such sysctl, but forgot to add
doc into Documentation/networking/ip-sysctl.txt. This patch adds it.

Basically I grab the doc from the description of commit 41063e9dd11956f2d285
(ipv4: Early TCP socket demux.) and the above commit.

Cc: Eric Dumazet
Cc: Alexander Duyck
Cc: David S. Miller
Signed-off-by: Cong Wang
Signed-off-by: David S. Miller

Cong Wang
2013-06-13 06:10:13 +0800

08 Jun, 2013

1 commit

e8b265e8b doc:networking: Fix default value (icmp_ignore_bogus_error_responses). ... Browse Code »

This patch fixes icmp_ignore_bogus_error_responses default value to be 1 instead
of FALSE. It is initialized to 1 in icmp_sk_init().

Signed-off-by: Rami Rosen
Signed-off-by: David S. Miller

Rami Rosen
2013-06-08 14:31:54 +0800

28 May, 2013

1 commit

3dd17edea doc:networking: Fix typo in documentation/networking ... Browse Code »

Correct spelling typo

Signed-off-by: Masanari Iida
Signed-off-by: David S. Miller

Masanari Iida
2013-05-28 14:29:18 +0800

21 Mar, 2013

2 commits

e33099f96 tcp: implement RFC5682 F-RTO ... Browse Code »

This patch implements F-RTO (foward RTO recovery):

When the first retransmission after timeout is acknowledged, F-RTO
sends new data instead of old data. If the next ACK acknowledges
some never-retransmitted data, then the timeout was spurious and the
congestion state is reverted. Otherwise if the next ACK selectively
acknowledges the new data, then the timeout was genuine and the
loss recovery continues. This idea applies to recurring timeouts
as well. While F-RTO sends different data during timeout recovery,
it does not (and should not) change the congestion control.

The implementaion follows the three steps of SACK enhanced algorithm
(section 3) in RFC5682. Step 1 is in tcp_enter_loss(). Step 2 and
3 are in tcp_process_loss(). The basic version is not supported
because SACK enhanced version also works for non-SACK connections.

The new implementation is functionally in parity with the old F-RTO
implementation except the one case where it increases undo events:
In addition to the RFC algorithm, a spurious timeout may be detected
without sending data in step 2, as long as the SACK confirms not
all the original data are dropped. When this happens, the sender
will undo the cwnd and perhaps enter fast recovery instead. This
additional check increases the F-RTO undo events by 5x compared
to the prior implementation on Google Web servers, since the sender
often does not have new data to send for HTTP.

Note F-RTO may detect spurious timeout before Eifel with timestamps
does so.

Signed-off-by: Yuchung Cheng
Acked-by: Eric Dumazet
Acked-by: Neal Cardwell
Signed-off-by: David S. Miller

Yuchung Cheng
2013-03-21 23:47:51 +0800
9b44190dc tcp: refactor F-RTO ... Browse Code »

The patch series refactor the F-RTO feature (RFC4138/5682).

This is to simplify the loss recovery processing. Existing F-RTO
was developed during the experimental stage (RFC4138) and has
many experimental features. It takes a separate code path from
the traditional timeout processing by overloading CA_Disorder
instead of using CA_Loss state. This complicates CA_Disorder state
handling because it's also used for handling dubious ACKs and undos.
While the algorithm in the RFC does not change the congestion control,
the implementation intercepts congestion control in various places
(e.g., frto_cwnd in tcp_ack()).

The new code implements newer F-RTO RFC5682 using CA_Loss processing
path. F-RTO becomes a small extension in the timeout processing
and interfaces with congestion control and Eifel undo modules.
It lets congestion control (module) determines how many to send
independently. F-RTO only chooses what to send in order to detect
spurious retranmission. If timeout is found spurious it invokes
existing Eifel undo algorithms like DSACK or TCP timestamp based
detection.

The first patch removes all F-RTO code except the sysctl_tcp_frto is
left for the new implementation. Since CA_EVENT_FRTO is removed, TCP
westwood now computes ssthresh on regular timeout CA_EVENT_LOSS event.

Signed-off-by: Yuchung Cheng
Acked-by: Neal Cardwell
Acked-by: Eric Dumazet
Signed-off-by: David S. Miller

Yuchung Cheng
2013-03-21 23:47:50 +0800

18 Mar, 2013

1 commit

1a2c6181c tcp: Remove TCPCT ... Browse Code »

TCPCT uses option-number 253, reserved for experimental use and should
not be used in production environments.
Further, TCPCT does not fully implement RFC 6013.

As a nice side-effect, removing TCPCT increases TCP's performance for
very short flows:

Doing an apache-benchmark with -c 100 -n 100000, sending HTTP-requests
for files of 1KB size.

before this patch:
average (among 7 runs) of 20845.5 Requests/Second
after:
average (among 7 runs) of 21403.6 Requests/Second

Signed-off-by: Christoph Paasch
Signed-off-by: David S. Miller

Christoph Paasch
2013-03-18 02:35:13 +0800

15 Mar, 2013

1 commit

b66c66dc5 Documentation: fix neigh/default/gc_thresh1 default value. ... Browse Code »

The default value is 128, not 256
#grep gc_thresh1 net/ -rI
net/decnet/dn_neigh.c: .gc_thresh1 = 128,
net/ipv6/ndisc.c: .gc_thresh1 = 128,
net/ipv4/arp.c: .gc_thresh1 = 128,

Signed-off-by: Li RongQing
Signed-off-by: David S. Miller

Li RongQing
2013-03-15 21:12:23 +0800

12 Mar, 2013

1 commit

6ba8a3b19 tcp: Tail loss probe (TLP) ... Browse Code »

This patch series implement the Tail loss probe (TLP) algorithm described
in http://tools.ietf.org/html/draft-dukkipati-tcpm-tcp-loss-probe-01. The
first patch implements the basic algorithm.

TLP's goal is to reduce tail latency of short transactions. It achieves
this by converting retransmission timeouts (RTOs) occuring due
to tail losses (losses at end of transactions) into fast recovery.
TLP transmits one packet in two round-trips when a connection is in
Open state and isn't receiving any ACKs. The transmitted packet, aka
loss probe, can be either new or a retransmission. When there is tail
loss, the ACK from a loss probe triggers FACK/early-retransmit based
fast recovery, thus avoiding a costly RTO. In the absence of loss,
there is no change in the connection state.

PTO stands for probe timeout. It is a timer event indicating
that an ACK is overdue and triggers a loss probe packet. The PTO value
is set to max(2*SRTT, 10ms) and is adjusted to account for delayed
ACK timer when there is only one oustanding packet.

TLP Algorithm

On transmission of new data in Open state:
-> packets_out > 1: schedule PTO in max(2*SRTT, 10ms).
-> packets_out == 1: schedule PTO in max(2*RTT, 1.5*RTT + 200ms)
-> PTO = min(PTO, RTO)

Conditions for scheduling PTO:
-> Connection is in Open state.
-> Connection is either cwnd limited or no new data to send.
-> Number of probes per tail loss episode is limited to one.
-> Connection is SACK enabled.

When PTO fires:
new_segment_exists:
-> transmit new segment.
-> packets_out++. cwnd remains same.

no_new_packet:
-> retransmit the last segment.
Its ACK triggers FACK or early retransmit based recovery.

ACK path:
-> rearm RTO at start of ACK processing.
-> reschedule PTO if need be.

In addition, the patch includes a small variation to the Early Retransmit
(ER) algorithm, such that ER and TLP together can in principle recover any
N-degree of tail loss through fast recovery. TLP is controlled by the same
sysctl as ER, tcp_early_retrans sysctl.
tcp_early_retrans==0; disables TLP and ER.
==1; enables RFC5827 ER.
==2; delayed ER.
==3; TLP and delayed ER. [DEFAULT]
==4; TLP only.

The TLP patch series have been extensively tested on Google Web servers.
It is most effective for short Web trasactions, where it reduced RTOs by 15%
and improved HTTP response time (average by 6%, 99th percentile by 10%).
The transmitted probes account for
Acked-by: Neal Cardwell
Acked-by: Yuchung Cheng
Signed-off-by: David S. Miller

Nandita Dukkipati
2013-03-12 20:30:34 +0800

06 Feb, 2013

1 commit

ca2eb5679 tcp: remove Appropriate Byte Count support ... Browse Code »

TCP Appropriate Byte Count was added by me, but later disabled.
There is no point in maintaining it since it is a potential source
of bugs and Linux already implements other better window protection
heuristics.

Signed-off-by: Stephen Hemminger
Signed-off-by: David S. Miller

Stephen Hemminger
2013-02-06 03:51:16 +0800

23 Jan, 2013

1 commit

2724680bc neigh: Keep neighbour cache entries if number of them is small enough. ... Browse Code »

Since we have removed NCE (Neighbour Cache Entry) reference from
routing entries, the only refcnt holders of an NCE are its timer
(if running) and its owner table, in usual cases. As a result,
neigh_periodic_work() purges NCEs over and over again even for
gateways.

It does not make sense to purge entries, if number of them is
very small, so keep them. The minimum number of entries to keep
is specified by gc_thresh1.

Signed-off-by: YOSHIFUJI Hideaki
Signed-off-by: David S. Miller

YOSHIFUJI Hideaki / 吉藤英明
2013-01-23 03:25:28 +0800

16 Jan, 2013

1 commit

4b87f9225 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net ... Browse Code »

Conflicts:
Documentation/networking/ip-sysctl.txt
drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c

Both conflicts were simply overlapping context.

A build fix for qlcnic is in here too, simply removing the added
devinit annotations which no longer exist.

Signed-off-by: David S. Miller

David S. Miller
2013-01-16 04:05:59 +0800

11 Jan, 2013

1 commit

3d55b3237 doc: Clarify behavior when sysctl tcp_ecn = 1 ... Browse Code »

Recent commit (commit 7e3a2dc52953 doc: make the description of how tcp_ecn
works more explicit and clear ) clarified the behavior of tcp_ecn sysctl
variable but description is inconsistent. When requested by incoming conections,
ECN is enabled with not just tcp_ecn = 2 but also with tcp_ecn = 1.

This patch makes it clear that with tcp_ecn = 1, ECN is enabled when requested
by incoming connections.

Also fix spelling of 'incoming'.

Signed-off-by: Vijay Subramanian
Signed-off-by: David S. Miller

Vijay Subramanian
2013-01-11 06:21:16 +0800

05 Jan, 2013

2 commits

3b09adcb2 ip-sysctl: fix spelling errors ... Browse Code »

Signed-off-by: Stephen Hemminger
Signed-off-by: David S. Miller

stephen hemminger
2013-01-05 07:12:34 +0800
db2b620aa ipv6: document ndisc_notify in networking/ip-sysctl.txt ... Browse Code »

I slipped in a new sysctl without proper documentation. I would like to
make up for this now.

Signed-off-by: Hannes Frederic Sowa
Signed-off-by: David S. Miller

Hannes Frederic Sowa
2013-01-05 05:35:38 +0800

11 Dec, 2012

1 commit

d825da2ed doc: Tighten-up and clarify description of tcp_fin_timeout ... Browse Code »

The description for tcp_fin_timeout should be tigher and more clear.

In addition to being tighter, we should make the spelling of the
state name consistent with what utilities report, remove the now
dated reference to 2.2 and put the default in the consistent place.

Signed-off-by: Rick Jones
Signed-off-by: David S. Miller

Rick Jones
2012-12-11 06:14:28 +0800

08 Dec, 2012

1 commit

5d248c491 net: doc : use more suitable word 'unexpected' to replace 'secluded' ... Browse Code »

'secluded' is used to describe places, not suitable here.

Suggested-by: Ben Hutchings
Signed-off-by: Shan Wei
Signed-off-by: David S. Miller

Shan Wei
2012-12-08 03:31:07 +0800

06 Dec, 2012

1 commit

cc8680280 net: doc: add default value for neighbour parameters ... Browse Code »

Signed-off-by: Shan Wei
Signed-off-by: David S. Miller

Shan Wei
2012-12-06 05:01:28 +0800

30 Nov, 2012

1 commit

7e3a2dc52 doc: make the description of how tcp_ecn works more explicit and clear ... Browse Code »

Make the description of how tcp_ecn works a bit more explicit and clear.

Signed-off-by: Rick Jones
Signed-off-by: David S. Miller

Rick Jones
2012-11-30 02:14:58 +0800

26 Oct, 2012

1 commit

3c68198e7 sctp: Make hmac algorithm selection for cookie generation dynamic ... Browse Code »

Currently sctp allows for the optional use of md5 of sha1 hmac algorithms to
generate cookie values when establishing new connections via two build time
config options. Theres no real reason to make this a static selection. We can
add a sysctl that allows for the dynamic selection of these algorithms at run
time, with the default value determined by the corresponding crypto library
availability.
This comes in handy when, for example running a system in FIPS mode, where use
of md5 is disallowed, but SHA1 is permitted.

Note: This new sysctl has no corresponding socket option to select the cookie
hmac algorithm. I chose not to implement that intentionally, as RFC 6458
contains no option for this value, and I opted not to pollute the socket option
namespace.

Change notes:
v2)
* Updated subject to have the proper sctp prefix as per Dave M.
* Replaced deafult selection options with new options that allow
developers to explicitly select available hmac algs at build time
as per suggestion by Vlad Y.

Signed-off-by: Neil Horman
CC: Vlad Yasevich
CC: "David S. Miller"
CC: netdev@vger.kernel.org
Acked-by: Vlad Yasevich
Signed-off-by: David S. Miller

Neil Horman
2012-10-26 14:22:18 +0800

01 Sep, 2012

2 commits

104671636 tcp: TCP Fast Open Server - header & support functions ... Browse Code »
46

This patch adds all the necessary data structure and support
functions to implement TFO server side. It also documents a number
of flags for the sysctl_tcp_fastopen knob, and adds a few Linux
extension MIBs.

In addition, it includes the following:

1. a new TCP_FASTOPEN socket option an application must call to
supply a max backlog allowed in order to enable TFO on its listener.

2. A number of key data structures:
"fastopen_rsk" in tcp_sock - for a big socket to access its
request_sock for retransmission and ack processing purpose. It is
non-NULL iff 3WHS not completed.

"fastopenq" in request_sock_queue - points to a per Fast Open
listener data structure "fastopen_queue" to keep track of qlen (# of
outstanding Fast Open requests) and max_qlen, among other things.

"listener" in tcp_request_sock - to point to the original listener
for book-keeping purpose, i.e., to maintain qlen against max_qlen
as part of defense against IP spoofing attack.

3. various data structure and functions, many in tcp_fastopen.c, to
support server side Fast Open cookie operations, including
/proc/sys/net/ipv4/tcp_fastopen_key to allow manual rekeying.

Signed-off-by: H.K. Jerry Chu
Cc: Yuchung Cheng
Cc: Neal Cardwell
Cc: Eric Dumazet
Cc: Tom Herbert
Signed-off-by: David S. Miller

Jerry Chu
2012-09-01 08:02:18 +0800
6c9ff979d tcp: Increase timeout for SYN segments ... Browse Code »

Commit 9ad7c049 ("tcp: RFC2988bis + taking RTT sample from 3WHS for
the passive open side") changed the initRTO from 3secs to 1sec in
accordance to RFC6298 (former RFC2988bis). This reduced the time till
the last SYN retransmission packet gets sent from 93secs to 31secs.

RFC1122 is stating that the retransmission should be done for at least 3
minutes, but this seems to be quite high.

"However, the values of R1 and R2 may be different for SYN
and data segments. In particular, R2 for a SYN segment MUST
be set large enough to provide retransmission of the segment
for at least 3 minutes. The application can close the
connection (i.e., give up on the open attempt) sooner, of
course."

This patch increases the value of TCP_SYN_RETRIES to the value of 6,
providing a retransmission window of 63secs.

The comments for SYN and SYNACK retries have also been updated to
describe the current settings. The same goes for the documentation file
"Documentation/networking/ip-sysctl.txt".

Signed-off-by: Alexander Bergmann
Acked-by: Eric Dumazet
Signed-off-by: David S. Miller

Alex Bergmann
2012-09-01 03:42:10 +0800

31 Jul, 2012

1 commit

0c7462a23 ipv4: remove rt_cache_rebuild_count ... Browse Code »

After IP route cache removal, rt_cache_rebuild_count is no longer
used.

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2012-07-31 05:53:22 +0800

23 Jul, 2012

1 commit

5aa93bcf6 sctp: Implement quick failover draft from tsvwg ... Browse Code »

I've seen several attempts recently made to do quick failover of sctp transports
by reducing various retransmit timers and counters. While its possible to
implement a faster failover on multihomed sctp associations, its not
particularly robust, in that it can lead to unneeded retransmits, as well as
false connection failures due to intermittent latency on a network.

Instead, lets implement the new ietf quick failover draft found here:
http://tools.ietf.org/html/draft-nishida-tsvwg-sctp-failover-05

This will let the sctp stack identify transports that have had a small number of
errors, and avoid using them quickly until their reliability can be
re-established. I've tested this out on two virt guests connected via multiple
isolated virt networks and believe its in compliance with the above draft and
works well.

Signed-off-by: Neil Horman
CC: Vlad Yasevich
CC: Sridhar Samudrala
CC: "David S. Miller"
CC: linux-sctp@vger.kernel.org
CC: joe@perches.com
Acked-by: Vlad Yasevich
Signed-off-by: David S. Miller

Neil Horman
2012-07-23 03:13:46 +0800

20 Jul, 2012

2 commits

67da22d23 net-tcp: Fast Open client - cookie-less mode ... Browse Code »

In trusted networks, e.g., intranet, data-center, the client does not
need to use Fast Open cookie to mitigate DoS attacks. In cookie-less
mode, sendmsg() with MSG_FASTOPEN flag will send SYN-data regardless
of cookie availability.

Signed-off-by: Yuchung Cheng
Acked-by: Eric Dumazet
Signed-off-by: David S. Miller

Yuchung Cheng
2012-07-20 02:02:03 +0800
cf60af03c net-tcp: Fast Open client - sendmsg(MSG_FASTOPEN) ... Browse Code »

sendmsg() (or sendto()) with MSG_FASTOPEN is a combo of connect(2)
and write(2). The application should replace connect() with it to
send data in the opening SYN packet.

For blocking socket, sendmsg() blocks until all the data are buffered
locally and the handshake is completed like connect() call. It
returns similar errno like connect() if the TCP handshake fails.

For non-blocking socket, it returns the number of bytes queued (and
transmitted in the SYN-data packet) if cookie is available. If cookie
is not available, it transmits a data-less SYN packet with Fast Open
cookie request option and returns -EINPROGRESS like connect().

Using MSG_FASTOPEN on connecting or connected socket will result in
simlar errno like repeating connect() calls. Therefore the application
should only use this flag on new sockets.

The buffer size of sendmsg() is independent of the MSS of the connection.

Signed-off-by: Yuchung Cheng
Acked-by: Eric Dumazet
Signed-off-by: David S. Miller

Yuchung Cheng
2012-07-20 02:02:03 +0800

17 Jul, 2012

1 commit

282f23c6e tcp: implement RFC 5961 3.2 ... Browse Code »

Implement the RFC 5691 mitigation against Blind
Reset attack using RST bit.

Idea is to validate incoming RST sequence,
to match RCV.NXT value, instead of previouly accepted
window : (RCV.NXT < RCV.NXT+RCV.WND)

If sequence is in window but not an exact match, send
a "challenge ACK", so that the other part can resend an
RST with the appropriate sequence.

Add a new sysctl, tcp_challenge_ack_limit, to limit
number of challenge ACK sent per second.

Add a new SNMP counter to count number of challenge acks sent.
(netstat -s | grep TCPChallengeACK)

Signed-off-by: Eric Dumazet
Cc: Kiran Kumar Kella
Signed-off-by: David S. Miller

Eric Dumazet
2012-07-17 16:36:20 +0800

12 Jul, 2012

1 commit

46d3ceabd tcp: TCP Small Queues ... Browse Code »

This introduce TSQ (TCP Small Queues)

TSQ goal is to reduce number of TCP packets in xmit queues (qdisc &
device queues), to reduce RTT and cwnd bias, part of the bufferbloat
problem.

sk->sk_wmem_alloc not allowed to grow above a given limit,
allowing no more than ~128KB [1] per tcp socket in qdisc/dev layers at a
given time.

TSO packets are sized/capped to half the limit, so that we have two
TSO packets in flight, allowing better bandwidth use.

As a side effect, setting the limit to 40000 automatically reduces the
standard gso max limit (65536) to 40000/2 : It can help to reduce
latencies of high prio packets, having smaller TSO packets.

This means we divert sock_wfree() to a tcp_wfree() handler, to
queue/send following frames when skb_orphan() [2] is called for the
already queued skbs.

Results on my dev machines (tg3/ixgbe nics) are really impressive,
using standard pfifo_fast, and with or without TSO/GSO.

Without reduction of nominal bandwidth, we have reduction of buffering
per bulk sender :
< 1ms on Gbit (instead of 50ms with TSO)
< 8ms on 100Mbit (instead of 132 ms)

I no longer have 4 MBytes backlogged in qdisc by a single netperf
session, and both side socket autotuning no longer use 4 Mbytes.

As skb destructor cannot restart xmit itself ( as qdisc lock might be
taken at this point ), we delegate the work to a tasklet. We use one
tasklest per cpu for performance reasons.

If tasklet finds a socket owned by the user, it sets TSQ_OWNED flag.
This flag is tested in a new protocol method called from release_sock(),
to eventually send new segments.

[1] New /proc/sys/net/ipv4/tcp_limit_output_bytes tunable
[2] skb_orphan() is usually called at TX completion time,
but some drivers call it in their start_xmit() handler.
These drivers should at least use BQL, or else a single TCP
session can still fill the whole NIC TX ring, since TSQ will
have no effect.

Signed-off-by: Eric Dumazet
Cc: Dave Taht
Cc: Tom Herbert
Cc: Matt Mathis
Cc: Yuchung Cheng
Cc: Nandita Dukkipati
Signed-off-by: David S. Miller

Eric Dumazet
2012-07-12 09:12:59 +0800

01 Jul, 2012

1 commit

c801e3cc1 ipv4: Clarify in docs that accept_local requires rp_filter. ... Browse Code »

Signed-off-by: David S. Miller

David S. Miller
2012-07-01 13:39:27 +0800

13 Jun, 2012

1 commit

d0daebc3d ipv4: Add interface option to enable routing of 127.0.0.0/8 ... Browse Code »

Routing of 127/8 is tradtionally forbidden, we consider
packets from that address block martian when routing and do
not process corresponding ARP requests.

This is a sane default but renders a huge address space
practically unuseable.

The RFC states that no address within the 127/8 block should
ever appear on any network anywhere but it does not forbid
the use of such addresses outside of the loopback device in
particular. For example to address a pool of virtual guests
behind a load balancer.

This patch adds a new interface option 'route_localnet'
enabling routing of the 127/8 address block and processing
of ARP requests on a specific interface.

Note that for the feature to work, the default local route
covering 127/8 dev lo needs to be removed.

Example:
$ sysctl -w net.ipv4.conf.eth0.route_localnet=1
$ ip route del 127.0.0.0/8 dev lo table local
$ ip addr add 127.1.0.1/16 dev eth0
$ ip route flush cache

V2: Fix invalid check to auto flush cache (thanks davem)

Signed-off-by: Thomas Graf
Acked-by: Neil Horman
Signed-off-by: David S. Miller

Thomas Graf
2012-06-13 06:25:46 +0800

09 May, 2012

1 commit

4981682cc netfilter: bridge: optionally set indev to vlan ... Browse Code »

if net.bridge.bridge-nf-filter-vlan-tagged sysctl is enabled, bridge
netfilter removes the vlan header temporarily and then feeds the packet
to ip(6)tables.

When the new "bridge-nf-pass-vlan-input-device" sysctl is on
(default off), then bridge netfilter will also set the
in-interface to the vlan interface; if such an interface exists.

This is needed to make iptables REDIRECT target work with
"vlan-on-top-of-bridge" setups and to allow use of "iptables -i" to
match the vlan device name.

Also update Documentation with current brnf default settings.

Signed-off-by: Florian Westphal
Acked-by: Bart De Schuymer
Signed-off-by: Pablo Neira Ayuso

Pablo Neira Ayuso
2012-05-09 01:36:47 +0800