Eric Lee / smarc-fsl-linux-kernel

24 Aug, 2016

1 commit

cebc5cbab net-tcp: retire TFO_SERVER_WO_SOCKOPT2 config ... Browse Code »

TFO_SERVER_WO_SOCKOPT2 was intended for debugging purposes during
Fast Open development. Remove this config option and also
update/clean-up the documentation of the Fast Open sysctl.

Reported-by: Piotr Jurkiewicz
Signed-off-by: Yuchung Cheng
Signed-off-by: Eric Dumazet
Signed-off-by: Neal Cardwell
Signed-off-by: David S. Miller

Yuchung Cheng
2016-08-24 08:01:01 +0800

30 May, 2016

1 commit

176b346b3 Documentation: ip-sysctl.txt: clarify secure_redirects ... Browse Code »

Clarify how secure_redirects works. Mention that RFC1122 always applies.

Signed-off-by: Eric Garver
Signed-off-by: David S. Miller

Eric Garver
2016-05-30 13:40:53 +0800

12 Apr, 2016

1 commit

a6db4494d net: ipv4: Consider failed nexthops in multipath routes ... Browse Code »

Multipath route lookups should consider knowledge about next hops and not
select a hop that is known to be failed.

Example:

[h2] [h3] 15.0.0.5
| |
3| 3|
[SP1] [SP2]--+
1 2 1 2
| | /-------------+ |
| \ / |
| X |
| / \ |
| / \---------------\ |
1 2 1 2
12.0.0.2 [TOR1] 3-----------------3 [TOR2] 12.0.0.3
4 4
\ /
\ /
\ /
-------| |-----/
1 2
[TOR3]
3|
|
[h1] 12.0.0.1

host h1 with IP 12.0.0.1 has 2 paths to host h3 at 15.0.0.5:

root@h1:~# ip ro ls
...
12.0.0.0/24 dev swp1 proto kernel scope link src 12.0.0.1
15.0.0.0/16
nexthop via 12.0.0.2 dev swp1 weight 1
nexthop via 12.0.0.3 dev swp1 weight 1
...

If the link between tor3 and tor1 is down and the link between tor1
and tor2 then tor1 is effectively cut-off from h1. Yet the route lookups
in h1 are alternating between the 2 routes: ping 15.0.0.5 gets one and
ssh 15.0.0.5 gets the other. Connections that attempt to use the
12.0.0.2 nexthop fail since that neighbor is not reachable:

root@h1:~# ip neigh show
...
12.0.0.3 dev swp1 lladdr 00:02:00:00:00:1b REACHABLE
12.0.0.2 dev swp1 FAILED
...

The failed path can be avoided by considering known neighbor information
when selecting next hops. If the neighbor lookup fails we have no
knowledge about the nexthop, so give it a shot. If there is an entry
then only select the nexthop if the state is sane. This is similar to
what fib_detect_death does.

To maintain backward compatibility use of the neighbor information is
based on a new sysctl, fib_multipath_use_neigh.

Signed-off-by: David Ahern
Reviewed-by: Julian Anastasov
Signed-off-by: David S. Miller

David Ahern
2016-04-12 03:16:13 +0800

22 Mar, 2016

2 commits

537377d3b igmp: Document sysctl_igmp_max_msf ... Browse Code »

Signed-off-by: Benjamin Poirier
Signed-off-by: David S. Miller

Benjamin Poirier
2016-03-22 10:56:37 +0800
6b226e2f8 net: Fix indentation of the conf/ documentation block ... Browse Code »

Commit d67ef35fff67 ("clarify documentation for
net.ipv4.igmp_max_memberships") mistakenly indented a block of
documentation such that it now looks like it belongs to a specific sysctl.
Restore that block's original position.

Cc: Jeremy Eder
Signed-off-by: Benjamin Poirier
Signed-off-by: David S. Miller

Benjamin Poirier
2016-03-22 10:56:37 +0800

26 Feb, 2016

1 commit

f1705ec19 net: ipv6: Make address flushing on ifdown optional ... Browse Code »

Currently, all ipv6 addresses are flushed when the interface is configured
down, including global, static addresses:

$ ip -6 addr show dev eth1
3: eth1: mtu 1500 state UP qlen 1000
inet6 2100:1::2/120 scope global
valid_lft forever preferred_lft forever
inet6 fe80::e0:f9ff:fe79:34bd/64 scope link
valid_lft forever preferred_lft forever
$ ip link set dev eth1 down
$ ip -6 addr show dev eth1
<< nothing; all addresses have been flushed>>

Add a new sysctl to make this behavior optional. The new setting defaults to
flush all addresses to maintain backwards compatibility. When the set global
addresses with no expire times are not flushed on an admin down. The sysctl
is per-interface or system-wide for all interfaces

$ sysctl -w net.ipv6.conf.eth1.keep_addr_on_down=1
or
$ sysctl -w net.ipv6.conf.all.keep_addr_on_down=1

Will keep addresses on eth1 on an admin down.

$ ip -6 addr show dev eth1
3: eth1: mtu 1500 state UP qlen 1000
inet6 2100:1::2/120 scope global
valid_lft forever preferred_lft forever
inet6 fe80::e0:f9ff:fe79:34bd/64 scope link
valid_lft forever preferred_lft forever
$ ip link set dev eth1 down
$ ip -6 addr show dev eth1
3: eth1: mtu 1500 state DOWN qlen 1000
inet6 2100:1::2/120 scope global tentative
valid_lft forever preferred_lft forever
inet6 fe80::e0:f9ff:fe79:34bd/64 scope link tentative
valid_lft forever preferred_lft forever

Signed-off-by: David Ahern
Signed-off-by: David S. Miller

David Ahern
2016-02-26 10:45:15 +0800

11 Feb, 2016

4 commits

7a02bf892 ipv6: add option to drop unsolicited neighbor advertisements ... Browse Code »

In certain 802.11 wireless deployments, there will be NA proxies
that use knowledge of the network to correctly answer requests.
To prevent unsolicitd advertisements on the shared medium from
being a problem, on such deployments wireless needs to drop them.

Enable this by providing an option called "drop_unsolicited_na".

Signed-off-by: Johannes Berg
Signed-off-by: David S. Miller

Johannes Berg
2016-02-11 17:27:36 +0800
abbc30436 ipv6: add option to drop unicast encapsulated in L2 multicast ... Browse Code »

In order to solve a problem with 802.11, the so-called hole-196 attack,
add an option (sysctl) called "drop_unicast_in_l2_multicast" which, if
enabled, causes the stack to drop IPv6 unicast packets encapsulated in
link-layer multi- or broadcast frames. Such frames can (as an attack)
be created by any member of the same wireless network and transmitted
as valid encrypted frames since the symmetric key for broadcast frames
is shared between all stations.

Reviewed-by: Julian Anastasov
Signed-off-by: Johannes Berg
Signed-off-by: David S. Miller

Johannes Berg
2016-02-11 17:27:36 +0800
97daf3314 ipv4: add option to drop gratuitous ARP packets ... Browse Code »

In certain 802.11 wireless deployments, there will be ARP proxies
that use knowledge of the network to correctly answer requests.
To prevent gratuitous ARP frames on the shared medium from being
a problem, on such deployments wireless needs to drop them.

Enable this by providing an option called "drop_gratuitous_arp".

Signed-off-by: Johannes Berg
Signed-off-by: David S. Miller

Johannes Berg
2016-02-11 17:27:35 +0800
12b74dfad ipv4: add option to drop unicast encapsulated in L2 multicast ... Browse Code »

In order to solve a problem with 802.11, the so-called hole-196 attack,
add an option (sysctl) called "drop_unicast_in_l2_multicast" which, if
enabled, causes the stack to drop IPv4 unicast packets encapsulated in
link-layer multi- or broadcast frames. Such frames can (as an attack)
be created by any member of the same wireless network and transmitted
as valid encrypted frames since the symmetric key for broadcast frames
is shared between all stations.

Additionally, enabling this option provides compliance with a SHOULD
clause of RFC 1122.

Reviewed-by: Julian Anastasov
Signed-off-by: Johannes Berg
Signed-off-by: David S. Miller

Johannes Berg
2016-02-11 17:27:35 +0800

21 Jan, 2016

1 commit

bffae6975 net: change tcp_syn_retries documentation ... Browse Code »

Documentation should be kept consistent with the code:

static int tcp_syn_retries_max = MAX_TCP_SYNCNT;
#define MAX_TCP_SYNCNT 127

Signed-off-by: Xin Long
Signed-off-by: David S. Miller

Xin Long
2016-01-21 10:55:08 +0800

19 Dec, 2015

1 commit

6dd9a14e9 net: Allow accepted sockets to be bound to l3mdev domain ... Browse Code »

Allow accepted sockets to derive their sk_bound_dev_if setting from the
l3mdev domain in which the packets originated. A sysctl setting is added
to control the behavior which is similar to sk_mark and
sysctl_tcp_fwmark_accept.

This effectively allow a process to have a "VRF-global" listen socket,
with child sockets bound to the VRF device in which the packet originated.
A similar behavior can be achieved using sk_mark, but a solution using marks
is incomplete as it does not handle duplicate addresses in different L3
domains/VRFs. Allowing sockets to inherit the sk_bound_dev_if from l3mdev
domain provides a complete solution.

Signed-off-by: David Ahern
Signed-off-by: David S. Miller

David Ahern
2015-12-19 03:43:38 +0800

16 Dec, 2015

1 commit

566178f85 net: sctp: dynamically enable or disable pf state ... Browse Code »

As we all know, the value of pf_retrans >= max_retrans_path can
disable pf state. The variables of pf_retrans and max_retrans_path
can be changed by the userspace application.

Sometimes the user expects to disable pf state while the 2
variables are changed to enable pf state. So it is necessary to
introduce a new variable to disable pf state.

According to the suggestions from Vlad Yasevich, extra1 and extra2
are removed. The initialization of pf_enable is added.

Acked-by: Vlad Yasevich
Signed-off-by: Zhu Yanjun
Acked-by: Marcelo Ricardo Leitner
Signed-off-by: David S. Miller

Zhu Yanjun
2015-12-16 23:56:50 +0800

10 Nov, 2015

1 commit

821b41440 net: Documentation: Fix default value tcp_limit_output_bytes ... Browse Code »

Commit c39c4c6abb89 ("tcp: double default TSQ output bytes limit")
updated default value for tcp_limit_output_bytes

Signed-off-by: Niklas Cassel
Signed-off-by: David S. Miller

Niklas Cassel
2015-11-10 01:17:34 +0800

30 Oct, 2015

1 commit

e7b63ff11 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next ... Browse Code »

Steffen Klassert says:

====================
pull request (net-next): ipsec-next 2015-10-30

1) The flow cache is limited by the flow cache limit which
depends on the number of cpus and the xfrm garbage collector
threshold which is independent of the number of cpus. This
leads to the fact that on systems with more than 16 cpus
we hit the xfrm garbage collector limit and refuse new
allocations, so new flows are dropped. On systems with 16
or less cpus, we hit the flowcache limit. In this case, we
shrink the flow cache instead of refusing new flows.

We increase the xfrm garbage collector threshold to INT_MAX
to get the same behaviour, independent of the number of cpus.

2) Fix some unaligned accesses on sparc systems.
From Sowmini Varadhan.

3) Fix some header checks in _decode_session4. We may call
pskb_may_pull with a negative value converted to unsigened
int from pskb_may_pull. This can lead to incorrect policy
lookups. We fix this by a check of the data pointer position
before we call pskb_may_pull.

4) Reload skb header pointers after calling pskb_may_pull
in _decode_session4 as this may change the pointers into
the packet.

5) Add a missing statistic counter on inner mode errors.

Please pull or let me know if there are problems.
====================

Signed-off-by: David S. Miller

David S. Miller
2015-10-30 19:51:56 +0800

21 Oct, 2015

2 commits

4f41b1c58 tcp: use RACK to detect losses ... Browse Code »

This patch implements the second half of RACK that uses the the most
recent transmit time among all delivered packets to detect losses.

tcp_rack_mark_lost() is called upon receiving a dubious ACK.
It then checks if an not-yet-sacked packet was sent at least
"reo_wnd" prior to the sent time of the most recently delivered.
If so the packet is deemed lost.

The "reo_wnd" reordering window starts with 1msec for fast loss
detection and changes to min-RTT/4 when reordering is observed.
We found 1msec accommodates well on tiny degree of reordering
(
Signed-off-by: Neal Cardwell
Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Yuchung Cheng
2015-10-21 22:00:53 +0800
f67225839 tcp: track min RTT using windowed min-filter ... Browse Code »

Kathleen Nichols' algorithm for tracking the minimum RTT of a
data stream over some measurement window. It uses constant space
and constant time per update. Yet it almost always delivers
the same minimum as an implementation that has to keep all
the data in the window. The measurement window is tunable via
sysctl.net.ipv4.tcp_min_rtt_wlen with a default value of 5 minutes.

The algorithm keeps track of the best, 2nd best & 3rd best min
values, maintaining an invariant that the measurement time of
the n'th best >= n-1'th best. It also makes sure that the three
values are widely separated in the time window since that bounds
the worse case error when that data is monotonically increasing
over the window.

Upon getting a new min, we can forget everything earlier because
it has no value - the new min is less than everything else in the
window by definition and it's the most recent. So we restart fresh
on every new min and overwrites the 2nd & 3rd choices. The same
property holds for the 2nd & 3rd best.

Therefore we have to maintain two invariants to maximize the
information in the samples, one on values (1st.v
Signed-off-by: Neal Cardwell
Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Yuchung Cheng
2015-10-21 22:00:43 +0800

14 Oct, 2015

1 commit

02a6d6136 Revert "ipv4/icmp: redirect messages can use the ingress daddr as source" ... Browse Code »

Revert the commit e2ca690b657f ("ipv4/icmp: redirect messages
can use the ingress daddr as source"), which tried to introduce a more
suitable behaviour for ICMP redirect messages generated by VRRP routers.
However RFC 5798 section 8.1.1 states:

The IPv4 source address of an ICMP redirect should be the address
that the end-host used when making its next-hop routing decision.

while said commit used the generating packet destination
address, which do not match the above and in most cases leads to
no redirect packets to be generated.

Signed-off-by: Paolo Abeni
Signed-off-by: David S. Miller

Paolo Abeni
2015-10-14 21:01:07 +0800

13 Oct, 2015

1 commit

e2ca690b6 ipv4/icmp: redirect messages can use the ingress daddr as source ... Browse Code »

This patch allows configuring how the source address of ICMP
redirect messages is selected; by default the old behaviour is
retained, while setting icmp_redirects_use_orig_daddr force the
usage of the destination address of the packet that caused the
redirect.

The new behaviour fits closely the RFC 5798 section 8.1.1, and fix the
following scenario:

Two machines are set up with VRRP to act as routers out of a subnet,
they have IPs x.x.x.1/24 and x.x.x.2/24, with VRRP holding on to
x.x.x.254/24.

If a host in said subnet needs to get an ICMP redirect from the VRRP
router, i.e. to reach a destination behind a different gateway, the
source IP in the ICMP redirect is chosen as the primary IP on the
interface that the packet arrived at, i.e. x.x.x.1 or x.x.x.2.

The host will then ignore said redirect, due to RFC 1122 section 3.2.2.2,
and will continue to use the wrong next-op.

Signed-off-by: Paolo Abeni
Signed-off-by: David S. Miller

Paolo Abeni
2015-10-13 10:38:02 +0800

29 Sep, 2015

1 commit

c386578f1 xfrm: Let the flowcache handle its size by default. ... Browse Code »

The xfrm flowcache size is limited by the flowcache limit
(4096 * number of online cpus) and the xfrm garbage collector
threshold (2 * 32768), whatever is reached first. This means
that we can hit the garbage collector limit only on systems
with more than 16 cpus. On such systems we simply refuse
new allocations if we reach the limit, so new flows are dropped.
On syslems with 16 or less cpus, we hit the flowcache limit.
In this case, we shrink the flow cache instead of refusing new
flows.

We increase the xfrm garbage collector threshold to INT_MAX
to get the same behaviour, independent of the number of cpus.

The xfrm garbage collector threshold can still be set below
the flowcache limit to reduce the memory usage of the flowcache.

Tested-by: Dan Streetman
Signed-off-by: Steffen Klassert

Steffen Klassert
2015-09-29 17:44:16 +0800

01 Sep, 2015

1 commit

87583ebb9 IGMP: Document igmp_link_local_mcast_reports ... Browse Code »

Document the addition of a new sysctl variable which controls the
generation of IGMP reports for link local multicast groups in the
224.0.0.X range.

IGMP reports for local multicast groups can now be optionally
inhibited by setting the value to zero e.g.:
echo 0 > /proc/sys/net/ipv4/igmp_link_local_mcast_reports

To retain backwards compatibility the previous behaviour is retained
by default on system boot or reverted by setting the value back to
non-zero.

Signed-off-by: Philip Downey
Signed-off-by: David S. Miller

Philip Downey
2015-09-01 03:30:37 +0800

26 Aug, 2015

1 commit

43e122b01 tcp: refine pacing rate determination ... Browse Code »

When TCP pacing was added back in linux-3.12, we chose
to apply a fixed ratio of 200 % against current rate,
to allow probing for optimal throughput even during
slow start phase, where cwnd can be doubled every other gRTT.

At Google, we found it was better applying a different ratio
while in Congestion Avoidance phase.
This ratio was set to 120 %.

We've used the normal tcp_in_slow_start() helper for a while,
then tuned the condition to select the conservative ratio
as soon as cwnd >= ssthresh/2 :

- After cwnd reduction, it is safer to ramp up more slowly,
as we approach optimal cwnd.
- Initial ramp up (ssthresh == INFINITY) still allows doubling
cwnd every other RTT.

Signed-off-by: Eric Dumazet
Cc: Neal Cardwell
Cc: Yuchung Cheng
Acked-by: Neal Cardwell
Signed-off-by: David S. Miller

Eric Dumazet
2015-08-26 02:33:54 +0800

12 Aug, 2015

1 commit

e69948a0a net: Document xfrm4_gc_thresh and xfrm6_gc_thresh ... Browse Code »

This change adds documentation for xfrm4_gc_thresh and xfrm6_gc_thresh
based on the comments in commit eeb1b73378b56 ("xfrm: Increase the garbage
collector threshold").

Signed-off-by: Alexander Duyck
Signed-off-by: Steffen Klassert

Alexander Duyck
2015-08-12 14:28:04 +0800

01 Aug, 2015

2 commits

b56774163 ipv6: Enable auto flow labels by default ... Browse Code »

Initialize auto_flowlabels to one. This enables automatic flow labels,
individual socket may disable them using the IPV6_AUTOFLOWLABEL socket
option.

Signed-off-by: Tom Herbert
Signed-off-by: David S. Miller

Tom Herbert
2015-08-01 08:07:12 +0800
42240901f ipv6: Implement different admin modes for automatic flow labels ... Browse Code »

Change the meaning of net.ipv6.auto_flowlabels to provide a mode for
automatic flow labels generation. There are four modes:

0: flow labels are disabled
1: flow labels are enabled, sockets can opt-out
2: flow labels are allowed, sockets can opt-in
3: flow labels are enabled and enforced, no opt-out for sockets

np->autoflowlabel is initialized according to the sysctl value.

Signed-off-by: Tom Herbert
Signed-off-by: David S. Miller

Tom Herbert
2015-08-01 08:07:11 +0800

31 Jul, 2015

1 commit

8013d1d7e net/ipv6: add sysctl option accept_ra_min_hop_limit ... Browse Code »

Commit 6fd99094de2b ("ipv6: Don't reduce hop limit for an interface")
disabled accept hop limit from RA if it is smaller than the current hop
limit for security stuff. But this behavior kind of break the RFC definition.

RFC 4861, 6.3.4. Processing Received Router Advertisements
A Router Advertisement field (e.g., Cur Hop Limit, Reachable Time,
and Retrans Timer) may contain a value denoting that it is
unspecified. In such cases, the parameter should be ignored and the
host should continue using whatever value it is already using.

If the received Cur Hop Limit value is non-zero, the host SHOULD set
its CurHopLimit variable to the received value.

So add sysctl option accept_ra_min_hop_limit to let user choose the minimum
hop limit value they can accept from RA. And set default to 1 to meet RFC
standards.

Signed-off-by: Hangbin Liu
Acked-by: YOSHIFUJI Hideaki
Signed-off-by: David S. Miller

Hangbin Liu
2015-07-31 06:56:40 +0800

23 Jul, 2015

1 commit

3985e8a36 ipv6: sysctl to restrict candidate source addresses ... Browse Code »

Per RFC 6724, section 4, "Candidate Source Addresses":

It is RECOMMENDED that the candidate source addresses be the set
of unicast addresses assigned to the interface that will be used
to send to the destination (the "outgoing" interface).

Add a sysctl to enable this behaviour.

Signed-off-by: Erik Kline
Signed-off-by: David S. Miller

Erik Kline
2015-07-23 01:54:11 +0800

10 Jul, 2015

1 commit

35a256fee ipv6: Nonlocal bind ... Browse Code »

Add support to allow non-local binds similar to how this was done for IPv4.
Non-local binds are very useful in emulating the Internet in a box, etc.

This add the ip_nonlocal_bind sysctl under ipv6.

Testing:

Set up nonlocal binding and receive routing on a host, e.g.:

ip -6 rule add from ::/0 iif eth0 lookup 200
ip -6 route add local 2001:0:0:1::/64 dev lo proto kernel scope host table 200
sysctl -w net.ipv6.ip_nonlocal_bind=1

Set up routing to 2001:0:0:1::/64 on peer to go to first host

ping6 -I 2001:0:0:1::1 peer-address -- to verify

Signed-off-by: Tom Herbert
Signed-off-by: David S. Miller

Tom Herbert
2015-07-10 12:09:10 +0800

28 May, 2015

1 commit

07f4c9006 tcp/dccp: try to not exhaust ip_local_port_range in connect() ... Browse Code »

A long standing problem on busy servers is the tiny available TCP port
range (/proc/sys/net/ipv4/ip_local_port_range) and the default
sequential allocation of source ports in connect() system call.

If a host is having a lot of active TCP sessions, chances are
very high that all ports are in use by at least one flow,
and subsequent bind(0) attempts fail, or have to scan a big portion of
space to find a slot.

In this patch, I changed the starting point in __inet_hash_connect()
so that we try to favor even [1] ports, leaving odd ports for bind()
users.

We still perform a sequential search, so there is no guarantee, but
if connect() targets are very different, end result is we leave
more ports available to bind(), and we spread them all over the range,
lowering time for both connect() and bind() to find a slot.

This strategy only works well if /proc/sys/net/ipv4/ip_local_port_range
is even, ie if start/end values have different parity.

Therefore, default /proc/sys/net/ipv4/ip_local_port_range was changed to
32768 - 60999 (instead of 32768 - 61000)

There is no change on security aspects here, only some poor hashing
schemes could be eventually impacted by this change.

[1] : The odd/even property depends on ip_local_port_range values parity

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2015-05-28 01:30:44 +0800

20 May, 2015

1 commit

492135557 tcp: add rfc3168, section 6.1.1.1. fallback ... Browse Code »

This work as a follow-up of commit f7b3bec6f516 ("net: allow setting ecn
via routing table") and adds RFC3168 section 6.1.1.1. fallback for outgoing
ECN connections. In other words, this work adds a retry with a non-ECN
setup SYN packet, as suggested from the RFC on the first timeout:

[...] A host that receives no reply to an ECN-setup SYN within the
normal SYN retransmission timeout interval MAY resend the SYN and
any subsequent SYN retransmissions with CWR and ECE cleared. [...]

Schematic client-side view when assuming the server is in tcp_ecn=2 mode,
that is, Linux default since 2009 via commit 255cac91c3c9 ("tcp: extend
ECN sysctl to allow server-side only ECN"):

1) Normal ECN-capable path:

SYN ECE CWR ----->

2) Path with broken middlebox, when client has fallback:

SYN ECE CWR ----X crappy middlebox drops packet
(timeout, rtx)
SYN ----->

In case we would not have the fallback implemented, the middlebox drop
point would basically end up as:

SYN ECE CWR ----X crappy middlebox drops packet
(timeout, rtx)
SYN ECE CWR ----X crappy middlebox drops packet
(timeout, rtx)
SYN ECE CWR ----X crappy middlebox drops packet
(timeout, rtx)

In any case, it's rather a smaller percentage of sites where there would
occur such additional setup latency: it was found in end of 2014 that ~56%
of IPv4 and 65% of IPv6 servers of Alexa 1 million list would negotiate
ECN (aka tcp_ecn=2 default), 0.42% of these webservers will fail to connect
when trying to negotiate with ECN (tcp_ecn=1) due to timeouts, which the
fallback would mitigate with a slight latency trade-off. Recent related
paper on this topic:

Brian Trammell, Mirja Kühlewind, Damiano Boppart, Iain Learmonth,
Gorry Fairhurst, and Richard Scheffenegger:
"Enabling Internet-Wide Deployment of Explicit Congestion Notification."
Proc. PAM 2015, New York.
http://ecn.ethz.ch/ecn-pam15.pdf

Thus, when net.ipv4.tcp_ecn=1 is being set, the patch will perform RFC3168,
section 6.1.1.1. fallback on timeout. For users explicitly not wanting this
which can be in DC use case, we add a net.ipv4.tcp_ecn_fallback knob that
allows for disabling the fallback.

tp->ecn_flags are not being cleared in tcp_ecn_clear_syn() on output, but
rather we let tcp_ecn_rcv_synack() take that over on input path in case a
SYN ACK ECE was delayed. Thus a spurious SYN retransmission will not prevent
ECN being negotiated eventually in that case.

Reference: https://www.ietf.org/proceedings/92/slides/slides-92-iccrg-1.pdf
Reference: https://www.ietf.org/proceedings/89/slides/slides-89-tsvarea-1.pdf
Signed-off-by: Daniel Borkmann
Signed-off-by: Florian Westphal
Signed-off-by: Mirja Kühlewind
Signed-off-by: Brian Trammell
Cc: Eric Dumazet
Cc: Dave That
Acked-by: Eric Dumazet
Signed-off-by: David S. Miller

Daniel Borkmann
2015-05-20 04:53:37 +0800

04 May, 2015

1 commit

82a584b7c ipv6: Flow label state ranges ... Browse Code »

This patch divides the IPv6 flow label space into two ranges:
0-7ffff is reserved for flow label manager, 80000-fffff will be
used for creating auto flow labels (per RFC6438). This only affects how
labels are set on transmit, it does not affect receive. This range split
can be disbaled by systcl.

Background:

IPv6 flow labels have been an unmitigated disappointment thus far
in the lifetime of IPv6. Support in HW devices to use them for ECMP
is lacking, and OSes don't turn them on by default. If we had these
we could get much better hashing in IPv6 networks without resorting
to DPI, possibly eliminating some of the motivations to to define new
encaps in UDP just for getting ECMP.

Unfortunately, the initial specfications of IPv6 did not clarify
how they are to be used. There has always been a vague concept that
these can be used for ECMP, flow hashing, etc. and we do now have a
good standard how to this in RFC6438. The problem is that flow labels
can be either stateful or stateless (as in RFC6438), and we are
presented with the possibility that a stateless label may collide
with a stateful one. Attempts to split the flow label space were
rejected in IETF. When we added support in Linux for RFC6438, we
could not turn on flow labels by default due to this conflict.

This patch splits the flow label space and should give us
a path to enabling auto flow labels by default for all IPv6 packets.
This is an API change so we need to consider compatibility with
existing deployment. The stateful range is chosen to be the lower
values in hopes that most uses would have chosen small numbers.

Once we resolve the stateless/stateful issue, we can proceed to
look at enabling RFC6438 flow labels by default (starting with
scaled testing).

Signed-off-by: Tom Herbert
Signed-off-by: David S. Miller

Tom Herbert
2015-05-04 09:58:01 +0800

24 Mar, 2015

1 commit

9f0761c15 ipv6: add documentation for stable_secret, idgen_delay and idgen_retries knobs ... Browse Code »

Cc: Erik Kline
Cc: Fernando Gont
Cc: Lorenzo Colitti
Cc: YOSHIFUJI Hideaki/吉藤英明
Signed-off-by: Hannes Frederic Sowa
Signed-off-by: David S. Miller

Hannes Frederic Sowa
2015-03-24 10:12:09 +0800

21 Mar, 2015

1 commit

89c69d3ce net: neighbour: Document {mcast, ucast}_solicit, mcast_resolicit. ... Browse Code »

Signed-off-by: YOSHIFUJI Hideaki
Signed-off-by: David S. Miller

YOSHIFUJI Hideaki/吉藤英明
2015-03-21 09:47:40 +0800

07 Mar, 2015

1 commit

fab427608 ipv4: Documenting two sysctls for tcp PMTU probe ... Browse Code »

Namely tcp_probe_interval to control how often to restart
a probe. And tcp_probe_threshold to control when stop the
probing in respect to the width of search range in bytes

Signed-off-by: Fan Du
Signed-off-by: David S. Miller

Fan Du
2015-03-07 03:57:42 +0800

08 Feb, 2015

1 commit

032ee4236 tcp: helpers to mitigate ACK loops by rate-limiting out-of-window dupacks ... Browse Code »

Helpers for mitigating ACK loops by rate-limiting dupacks sent in
response to incoming out-of-window packets.

This patch includes:

- rate-limiting logic
- sysctl to control how often we allow dupacks to out-of-window packets
- SNMP counter for cases where we rate-limited our dupack sending

The rate-limiting logic in this patch decides to not send dupacks in
response to out-of-window segments if (a) they are SYNs or pure ACKs
and (b) the remote endpoint is sending them faster than the configured
rate limit.

We rate-limit our responses rather than blocking them entirely or
resetting the connection, because legitimate connections can rely on
dupacks in response to some out-of-window segments. For example, zero
window probes are typically sent with a sequence number that is below
the current window, and ZWPs thus expect to thus elicit a dupack in
response.

We allow dupacks in response to TCP segments with data, because these
may be spurious retransmissions for which the remote endpoint wants to
receive DSACKs. This is safe because segments with data can't
realistically be part of ACK loops, which by their nature consist of
each side sending pure/data-less ACKs to each other.

The dupack interval is controlled by a new sysctl knob,
tcp_invalid_ratelimit, given in milliseconds, in case an administrator
needs to dial this upward in the face of a high-rate DoS attack. The
name and units are chosen to be analogous to the existing analogous
knob for ICMP, icmp_ratelimit.

The default value for tcp_invalid_ratelimit is 500ms, which allows at
most one such dupack per 500ms. This is chosen to be 2x faster than
the 1-second minimum RTO interval allowed by RFC 6298 (section 2, rule
2.4). We allow the extra 2x factor because network delay variations
can cause packets sent at 1 second intervals to be compressed and
arrive much closer.

Reported-by: Avery Fay
Signed-off-by: Neal Cardwell
Signed-off-by: Yuchung Cheng
Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Neal Cardwell
2015-02-08 17:03:12 +0800

26 Jan, 2015

1 commit

c2943f145 net: ipv6: Add sysctl entry to disable MTU updates from RA ... Browse Code »

The kernel forcefully applies MTU values received in router
advertisements provided the new MTU is less than the current. This
behavior is undesirable when the user space is managing the MTU. Instead
a sysctl flag 'accept_ra_mtu' is introduced such that the user space
can control whether or not RA provided MTU updates should be applied. The
default behavior is unchanged; user space must explicitly set this flag
to 0 for RA MTUs to be ignored.

Signed-off-by: Harout Hedeshian
Signed-off-by: David S. Miller

Harout Hedeshian
2015-01-26 06:54:41 +0800

13 Jan, 2015

1 commit

25050c63a update ip-sysctl.txt documentation (v2) ... Browse Code »

Update documentation to reflect the fact that
/proc/sys/net/ipv4/route/max_size is no longer used for ipv4.

Signed-off-by: Ani Sinha
Signed-off-by: David S. Miller

Ani Sinha
2015-01-13 04:38:43 +0800

07 Nov, 2014

1 commit

4e84b496f Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Browse Code »

David S. Miller
2014-11-07 11:01:18 +0800

06 Nov, 2014

1 commit

219b5f29a net: Add missing descriptions for fwmark_reflect for ipv4 and ipv6. ... Browse Code »

It was initially sent by Lorenzo Colitti, but was subsequently
lost in the final diff he submitted.

Signed-off-by: Loganaden Velvindron
Signed-off-by: David S. Miller

Loganaden Velvindron
2014-11-06 04:43:57 +0800

30 Oct, 2014

1 commit

7fd2561e4 net: ipv6: Add a sysctl to make optimistic addresses useful candidates ... Browse Code »

Add a sysctl that causes an interface's optimistic addresses
to be considered equivalent to other non-deprecated addresses
for source address selection purposes. Preferred addresses
will still take precedence over optimistic addresses, subject
to other ranking in the source address selection algorithm.

This is useful where different interfaces are connected to
different networks from different ISPs (e.g., a cell network
and a home wifi network).

The current behaviour complies with RFC 3484/6724, and it
makes sense if the host has only one interface, or has
multiple interfaces on the same network (same or cooperating
administrative domain(s), but not in the multiple distinct
networks case.

For example, if a mobile device has an IPv6 address on an LTE
network and then connects to IPv6-enabled wifi, while the wifi
IPv6 address is undergoing DAD, IPv6 connections will try use
the wifi default route with the LTE IPv6 address, and will get
stuck until they time out.

Also, because optimistic nodes can receive frames, issue
an RTM_NEWADDR as soon as DAD starts (with the IFA_F_OPTIMSTIC
flag appropriately set). A second RTM_NEWADDR is sent if DAD
completes (the address flags have changed), otherwise an
RTM_DELADDR is sent.

Also: add an entry in ip-sysctl.txt for optimistic_dad.

Signed-off-by: Erik Kline
Acked-by: Lorenzo Colitti
Acked-by: Hannes Frederic Sowa
Signed-off-by: David S. Miller

Erik Kline
2014-10-30 03:11:36 +0800