25 Apr, 2017
2 commits
-
Middlebox firewall issues can potentially cause server's data being
blackholed after a successful 3WHS using TFO. Following are the related
reports from Apple:
https://www.nanog.org/sites/default/files/Paasch_Network_Support.pdf
Slide 31 identifies an issue where the client ACK to the server's data
sent during a TFO'd handshake is dropped.
C ---> syn-data ---> S
C X S
[retry and timeout]https://www.ietf.org/proceedings/94/slides/slides-94-tcpm-13.pdf
Slide 5 shows a similar situation that the server's data gets dropped
after 3WHS.
C ---- syn-data ---> S
C S
S (accept & write)
C? X
Acked-by: Yuchung Cheng
Acked-by: Neal Cardwell
Signed-off-by: David S. Miller -
systemd-sysctl is triggering a suspicious RCU usage message when
net.ipv4.tcp_early_demux or net.ipv4.udp_early_demux is changed via
a sysctl config file:[ 33.896184] ===============================
[ 33.899558] [ ERR: suspicious RCU usage. ]
[ 33.900624] 4.11.0-rc7+ #104 Not tainted
[ 33.901698] -------------------------------
[ 33.903059] /home/dsa/kernel-2.git/net/ipv4/sysctl_net_ipv4.c:305 suspicious rcu_dereference_check() usage!
[ 33.905724]
other info that might help us debug this:[ 33.907656]
rcu_scheduler_active = 2, debug_locks = 0
[ 33.909288] 1 lock held by systemd-sysctl/143:
[ 33.910373] #0: (sb_writers#5){.+.+.+}, at: [] file_start_write+0x45/0x48
[ 33.912407]
stack backtrace:
[ 33.914018] CPU: 0 PID: 143 Comm: systemd-sysctl Not tainted 4.11.0-rc7+ #104
[ 33.915631] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140531_083030-gandalf 04/01/2014
[ 33.917870] Call Trace:
[ 33.918431] dump_stack+0x81/0xb6
[ 33.919241] lockdep_rcu_suspicious+0x10f/0x118
[ 33.920263] proc_configure_early_demux+0x65/0x10a
[ 33.921391] proc_udp_early_demux+0x3a/0x41add rcu locking to proc_configure_early_demux.
Fixes: dddb64bcb3461 ("net: Add sysctl to toggle early demux for tcp and udp")
Signed-off-by: David Ahern
Signed-off-by: David S. Miller
25 Mar, 2017
1 commit
-
Certain system process significant unconnected UDP workload.
It would be preferrable to disable UDP early demux for those systems
and enable it for TCP only.By disabling UDP demux, we see these slight gains on an ARM64 system-
782 -> 788Mbps unconnected single stream UDPv4
633 -> 654Mbps unconnected UDPv4 different sourcesThe performance impact can change based on CPU architecure and cache
sizes. There will not much difference seen if entire UDP hash table
is in cache.Both sysctls are enabled by default to preserve existing behavior.
v1->v2: Change function pointer instead of adding conditional as
suggested by Stephen.v2->v3: Read once in callers to avoid issues due to compiler
optimizations. Also update commit message with the tests.v3->v4: Store and use read once result instead of querying pointer
again incorrectly.v4->v5: Refactor to avoid errors due to compilation with IPV6={m,n}
Signed-off-by: Subash Abhinov Kasiviswanathan
Suggested-by: Eric Dumazet
Cc: Stephen Hemminger
Cc: Tom Herbert
Cc: David Miller
Signed-off-by: David S. Miller
22 Mar, 2017
1 commit
-
This patch adds support for ECMP hash policy choice via a new sysctl
called fib_multipath_hash_policy and also adds support for L4 hashes.
The current values for fib_multipath_hash_policy are:
0 - layer 3 (default)
1 - layer 4
If there's an skb hash already set and it matches the chosen policy then it
will be used instead of being calculated (currently only for L4).
In L3 mode we always calculate the hash due to the ICMP error special
case, the flow dissector's field consistentification should handle the
address order thus we can remove the address reversals.
If the skb is provided we always use it for the hash calculation,
otherwise we fallback to fl4, that is if skb is NULL fl4 has to be set.Signed-off-by: Nikolay Aleksandrov
Signed-off-by: David S. Miller
17 Mar, 2017
1 commit
-
The tcp_tw_recycle was already broken for connections
behind NAT, since the per-destination timestamp is not
monotonically increasing for multiple machines behind
a single destination address.After the randomization of TCP timestamp offsets
in commit 8a5bd45f6616 (tcp: randomize tcp timestamp offsets
for each connection), the tcp_tw_recycle is broken for all
types of connections for the same reason: the timestamps
received from a single machine is not monotonically increasing,
anymore.Remove tcp_tw_recycle, since it is not functional. Also, remove
the PAWSPassive SNMP counter since it is only used for
tcp_tw_recycle, and simplify tcp_v4_route_req and tcp_v6_route_req
since the strict argument is only set when tcp_tw_recycle is
enabled.Signed-off-by: Soheil Hassas Yeganeh
Signed-off-by: Eric Dumazet
Signed-off-by: Neal Cardwell
Signed-off-by: Yuchung Cheng
Cc: Lutz Vieweg
Cc: Florian Westphal
Signed-off-by: David S. Miller
31 Jan, 2017
1 commit
-
Packets arriving in a VRF currently are delivered to UDP sockets that
aren't bound to any interface. TCP defaults to not delivering packets
arriving in a VRF to unbound sockets. IP route lookup and socket
transmit both assume that unbound means using the default table and
UDP applications that haven't been changed to be aware of VRFs may not
function correctly in this case since they may not be able to handle
overlapping IP address ranges, or be able to send packets back to the
original sender if required.So add a sysctl, udp_l3mdev_accept, to control this behaviour with it
being analgous to the existing tcp_l3mdev_accept, namely to allow a
process to have a VRF-global listen socket. Have this default to off
as this is the behaviour that users will expect, given that there is
no explicit mechanism to set unmodified VRF-unaware application into a
default VRF.Signed-off-by: Robert Shearman
Acked-by: David Ahern
Tested-by: David Ahern
Signed-off-by: David S. Miller
25 Jan, 2017
1 commit
-
Add net.ipv4.ip_unprivileged_port_start, which is a per namespace sysctl
that denotes the first unprivileged inet port in the namespace. To
disable all privileged ports set this to zero. It also checks for
overlap with the local port range. The privileged and local range may
not overlap.The use case for this change is to allow containerized processes to bind
to priviliged ports, but prevent them from ever being allowed to modify
their container's network configuration. The latter is accomplished by
ensuring that the network namespace is not a child of the user
namespace. This modification was needed to allow the container manager
to disable a namespace's priviliged port restrictions without exposing
control of the network namespace to processes in the user namespace.Signed-off-by: Krister Johansen
Signed-off-by: David S. Miller
14 Jan, 2017
1 commit
-
Thin stream DUPACK is to start fast recovery on only one DUPACK
provided the connection is a thin stream (i.e., low inflight). But
this older feature is now subsumed with RACK. If a connection
receives only a single DUPACK, RACK would arm a reordering timer
and soon starts fast recovery instead of timeout if no further
ACKs are received.The socket option (THIN_DUPACK) is kept as a nop for compatibility.
Note that this patch does not change another thin-stream feature
which enables linear RTO. Although it might be good to generalize
that in the future (i.e., linear RTO for the first say 3 retries).Signed-off-by: Yuchung Cheng
Signed-off-by: Neal Cardwell
Acked-by: Eric Dumazet
Signed-off-by: David S. Miller
12 Jan, 2017
1 commit
-
Two AF_* families adding entries to the lockdep tables
at the same time.Signed-off-by: David S. Miller
10 Jan, 2017
1 commit
-
> cat /proc/sys/net/ipv4/tcp_notsent_lowat
-1
> echo 4294967295 > /proc/sys/net/ipv4/tcp_notsent_lowat
-bash: echo: write error: Invalid argument
> echo -2147483648 > /proc/sys/net/ipv4/tcp_notsent_lowat
> cat /proc/sys/net/ipv4/tcp_notsent_lowat
-2147483648but in documentation we have "tcp_notsent_lowat - UNSIGNED INTEGER"
v2: simplify to just proc_douintvec
Signed-off-by: Pavel Tikhomirov
Signed-off-by: David S. Miller
30 Dec, 2016
2 commits
-
Different namespace application might require different maximal
number of remembered connection requests.Signed-off-by: Haishuang Yan
Signed-off-by: David S. Miller -
Different namespace application might require fast recycling
TIME-WAIT sockets independently of the host.Signed-off-by: Haishuang Yan
Signed-off-by: David S. Miller
28 Dec, 2016
1 commit
-
Different namespaces might have different requirements to reuse
TIME-WAIT sockets for new connections. This might be required in
cases where different namespace applications are in place which
require TIME_WAIT socket connections to be reduced independently
of the host.Signed-off-by: Haishuang Yan
Signed-off-by: David S. Miller
23 Oct, 2016
1 commit
-
This reverts commit a681574c99be23e4d20b769bf0e543239c364af5
("ipv4: disable BH in set_ping_group_range()") because we never
read ping_group_range in BH context (unlike local_port_range).Then, since we already have a lock for ping_group_range, those
using ip_local_ports.lock for ping_group_range are clearly typos.We might consider to share a same lock for both ping_group_range
and local_port_range w.r.t. space saving, but that should be for
net-next.Fixes: a681574c99be ("ipv4: disable BH in set_ping_group_range()")
Fixes: ba6b918ab234 ("ping: move ping_group_range out of CONFIG_SYSCTL")
Cc: Eric Dumazet
Cc: Eric Salo
Signed-off-by: Cong Wang
Signed-off-by: David S. Miller
21 Oct, 2016
1 commit
-
In commit 4ee3bd4a8c746 ("ipv4: disable BH when changing ip local port
range") Cong added BH protection in set_local_port_range() but missed
that same fix was needed in set_ping_group_range()Fixes: b8f1a55639e6 ("udp: Add function to make source port for UDP tunnels")
Signed-off-by: Eric Dumazet
Reported-by: Eric Salo
Signed-off-by: David S. Miller
24 May, 2016
1 commit
-
Commit fa50d974d104 ("ipv4: Namespaceify ip_default_ttl sysctl knob")
moves the default TTL assignment, and as side-effect IPv4 TTL now
has a default value only if sysctl support is enabled (CONFIG_SYSCTL=y).The sysctl_ip_default_ttl is fundamental for IP to work properly,
as it provides the TTL to be used as default. The defautl TTL may be
used in ip_selected_ttl, through the following flow:ip_select_ttl
ip4_dst_hoplimit
net->ipv4.sysctl_ip_default_ttlThis commit fixes the issue by assigning net->ipv4.sysctl_ip_default_ttl
in net_init_net, called during ipv4's initialization.Without this commit, a kernel built without sysctl support will send
all IP packets with zero TTL (unless a TTL is explicitly set, e.g.
with setsockopt).Given a similar issue might appear on the other knobs that were
namespaceify, this commit also moves them.Fixes: fa50d974d104 ("ipv4: Namespaceify ip_default_ttl sysctl knob")
Signed-off-by: Ezequiel Garcia
Signed-off-by: David S. Miller
12 Apr, 2016
1 commit
-
Multipath route lookups should consider knowledge about next hops and not
select a hop that is known to be failed.Example:
[h2] [h3] 15.0.0.5
| |
3| 3|
[SP1] [SP2]--+
1 2 1 2
| | /-------------+ |
| \ / |
| X |
| / \ |
| / \---------------\ |
1 2 1 2
12.0.0.2 [TOR1] 3-----------------3 [TOR2] 12.0.0.3
4 4
\ /
\ /
\ /
-------| |-----/
1 2
[TOR3]
3|
|
[h1] 12.0.0.1host h1 with IP 12.0.0.1 has 2 paths to host h3 at 15.0.0.5:
root@h1:~# ip ro ls
...
12.0.0.0/24 dev swp1 proto kernel scope link src 12.0.0.1
15.0.0.0/16
nexthop via 12.0.0.2 dev swp1 weight 1
nexthop via 12.0.0.3 dev swp1 weight 1
...If the link between tor3 and tor1 is down and the link between tor1
and tor2 then tor1 is effectively cut-off from h1. Yet the route lookups
in h1 are alternating between the 2 routes: ping 15.0.0.5 gets one and
ssh 15.0.0.5 gets the other. Connections that attempt to use the
12.0.0.2 nexthop fail since that neighbor is not reachable:root@h1:~# ip neigh show
...
12.0.0.3 dev swp1 lladdr 00:02:00:00:00:1b REACHABLE
12.0.0.2 dev swp1 FAILED
...The failed path can be avoided by considering known neighbor information
when selecting next hops. If the neighbor lookup fails we have no
knowledge about the nexthop, so give it a shot. If there is an entry
then only select the nexthop if the state is sane. This is similar to
what fib_detect_death does.To maintain backward compatibility use of the neighbor information is
based on a new sysctl, fib_multipath_use_neigh.Signed-off-by: David Ahern
Reviewed-by: Julian Anastasov
Signed-off-by: David S. Miller
17 Feb, 2016
3 commits
-
Signed-off-by: Nikolay Borisov
Signed-off-by: David S. Miller -
Signed-off-by: Nikolay Borisov
Signed-off-by: David S. Miller -
Signed-off-by: Nikolay Borisov
Signed-off-by: David S. Miller
11 Feb, 2016
4 commits
-
Signed-off-by: Nikolay Borisov
Signed-off-by: David S. Miller -
This was initially introduced in df2cf4a78e488d26 ("IGMP: Inhibit
reports for local multicast groups") by defining the sysctl in the
ipv4_net_table array, however it was never implemented to be
namespace aware. Fix this by changing the code accordingly.Signed-off-by: David S. Miller
-
Signed-off-by: Nikolay Borisov
Signed-off-by: David S. Miller -
Signed-off-by: Nikolay Borisov
Signed-off-by: David S. Miller
08 Feb, 2016
9 commits
-
Signed-off-by: Nikolay Borisov
Signed-off-by: David S. Miller -
Signed-off-by: Nikolay Borisov
Signed-off-by: David S. Miller -
Signed-off-by: Nikolay Borisov
Signed-off-by: David S. Miller -
Signed-off-by: Nikolay Borisov
Signed-off-by: David S. Miller -
Signed-off-by: Nikolay Borisov
Signed-off-by: David S. Miller -
Signed-off-by: Nikolay Borisov
Signed-off-by: David S. Miller -
Signed-off-by: Nikolay Borisov
Signed-off-by: David S. Miller -
Signed-off-by: Nikolay Borisov
Signed-off-by: David S. Miller -
Signed-off-by: Nikolay Borisov
Signed-off-by: David S. Miller
21 Jan, 2016
1 commit
-
tcp_memcontrol.c only contains legacy memory.tcp.kmem.* file definitions
and mem_cgroup->tcp_mem init/destroy stuff. This doesn't belong to
network subsys. Let's move it to memcontrol.c. This also allows us to
reuse generic code for handling legacy memcg files.Signed-off-by: Vladimir Davydov
Acked-by: Johannes Weiner
Cc: "David S. Miller"
Acked-by: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
11 Jan, 2016
3 commits
-
This is the final part required to namespaceify the tcp
keep alive mechanism.Signed-off-by: Nikolay Borisov
Signed-off-by: David S. Miller -
This is required to have full tcp keepalive mechanism namespace
support.Signed-off-by: Nikolay Borisov
Signed-off-by: David S. Miller -
Different net namespaces might have different requirements as to
the keepalive time of tcp sockets. This might be required in cases
where different firewall rules are in place which require tcp
timeout sockets to be increased/decreased independently of the host.Signed-off-by: Nikolay Borisov
Signed-off-by: David S. Miller
19 Dec, 2015
1 commit
-
Allow accepted sockets to derive their sk_bound_dev_if setting from the
l3mdev domain in which the packets originated. A sysctl setting is added
to control the behavior which is similar to sk_mark and
sysctl_tcp_fwmark_accept.This effectively allow a process to have a "VRF-global" listen socket,
with child sockets bound to the VRF device in which the packet originated.
A similar behavior can be achieved using sk_mark, but a solution using marks
is incomplete as it does not handle duplicate addresses in different L3
domains/VRFs. Allowing sockets to inherit the sk_bound_dev_if from l3mdev
domain provides a complete solution.Signed-off-by: David Ahern
Signed-off-by: David S. Miller
05 Nov, 2015
1 commit
-
This fixes the following lockdep warning:
[ INFO: inconsistent lock state ]
4.3.0-rc7+ #1197 Not tainted
---------------------------------
inconsistent {IN-SOFTIRQ-R} -> {SOFTIRQ-ON-W} usage.
sysctl/1019 [HC0[0]:SC0[0]:HE1:SE1] takes:
(&(&net->ipv4.ip_local_ports.lock)->seqcount){+.+-..}, at: [] ipv4_local_port_range+0xb4/0x12a
{IN-SOFTIRQ-R} state was registered at:
[] __lock_acquire+0x2f6/0xdf0
[] lock_acquire+0x11c/0x1a4
[] inet_get_local_port_range+0x4e/0xae
[] udp_flow_src_port.constprop.40+0x23/0x116
[] vxlan_xmit_one+0x219/0xa6a
[] vxlan_xmit+0xa6b/0xaa5
[] dev_hard_start_xmit+0x2ae/0x465
[] __dev_queue_xmit+0x531/0x633
[] dev_queue_xmit_sk+0x13/0x15
[] neigh_resolve_output+0x12f/0x14d
[] ip6_finish_output2+0x344/0x39f
[] ip6_finish_output+0x88/0x8e
[] ip6_output+0x91/0xe5
[] dst_output_sk+0x47/0x4c
[] NF_HOOK_THRESH.constprop.30+0x38/0x82
[] mld_sendpack+0x189/0x266
[] mld_ifc_timer_expire+0x1ef/0x223
[] call_timer_fn+0xfb/0x28c
[] run_timer_softirq+0x1c7/0x1f1Fixes: b8f1a55639e6 ("udp: Add function to make source port for UDP tunnels")
Cc: Tom Herbert
Signed-off-by: Cong Wang
Signed-off-by: David S. Miller
21 Oct, 2015
1 commit
-
This patch implements the second half of RACK that uses the the most
recent transmit time among all delivered packets to detect losses.tcp_rack_mark_lost() is called upon receiving a dubious ACK.
It then checks if an not-yet-sacked packet was sent at least
"reo_wnd" prior to the sent time of the most recently delivered.
If so the packet is deemed lost.The "reo_wnd" reordering window starts with 1msec for fast loss
detection and changes to min-RTT/4 when reordering is observed.
We found 1msec accommodates well on tiny degree of reordering
(
Signed-off-by: Neal Cardwell
Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller