Eric Lee / smarc-fsl-linux-kernel

04 Nov, 2015

1 commit

73186df8d Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net ... Browse Code »

Minor overlapping changes in net/ipv4/ipmr.c, in 'net' we were
fixing the "BH-ness" of the counter bumps whilst in 'net-next'
the functions were modified to take an explicit 'net' parameter.

Signed-off-by: David S. Miller

David S. Miller
2015-11-04 02:41:45 +0800

03 Nov, 2015

4 commits

1d6119baf net: fix percpu memory leaks ... Browse Code »

This patch fixes following problems :

1) percpu_counter_init() can return an error, therefore
init_frag_mem_limit() must propagate this error so that
inet_frags_init_net() can do the same up to its callers.

2) If ip[46]_frags_ns_ctl_register() fail, we must unwind
properly and free the percpu_counter.

Without this fix, we leave freed object in percpu_counters
global list (if CONFIG_HOTPLUG_CPU) leading to crashes.

This bug was detected by KASAN and syzkaller tool
(http://github.com/google/syzkaller)

Fixes: 6d7b857d541e ("net: use lib/percpu_counter API for fragmentation mem accounting")
Signed-off-by: Eric Dumazet
Reported-by: Dmitry Vyukov
Cc: Hannes Frederic Sowa
Cc: Jesper Dangaard Brouer
Acked-by: Hannes Frederic Sowa
Signed-off-by: David S. Miller

Eric Dumazet
2015-11-03 11:47:14 +0800
9e17f8a47 net: make skb_set_owner_w() more robust ... Browse Code »

skb_set_owner_w() is called from various places that assume
skb->sk always point to a full blown socket (as it changes
sk->sk_wmem_alloc)

We'd like to attach skb to request sockets, and in the future
to timewait sockets as well. For these kind of pseudo sockets,
we need to take a traditional refcount and use sock_edemux()
as the destructor.

It is now time to un-inline skb_set_owner_w(), being too big.

Fixes: ca6fb0651883 ("tcp: attach SYNACK messages to request sockets instead of listener")
Signed-off-by: Eric Dumazet
Bisected-by: Haiyang Zhang
Signed-off-by: David S. Miller

Eric Dumazet
2015-11-03 05:28:49 +0800
44f49dd8b ipmr: fix possible race resulting from improper usage of IP_INC_STATS_BH() in preemptible context. ... Browse Code »

Fixes the following kernel BUG :

BUG: using __this_cpu_add() in preemptible [00000000] code: bash/2758
caller is __this_cpu_preempt_check+0x13/0x15
CPU: 0 PID: 2758 Comm: bash Tainted: P O 3.18.19 #2
ffffffff8170eaca ffff880110d1b788 ffffffff81482b2a 0000000000000000
0000000000000000 ffff880110d1b7b8 ffffffff812010ae ffff880007cab800
ffff88001a060800 ffff88013a899108 ffff880108b84240 ffff880110d1b7c8
Call Trace:
[] dump_stack+0x52/0x80
[] check_preemption_disabled+0xce/0xe1
[] __this_cpu_preempt_check+0x13/0x15
[] ipmr_queue_xmit+0x647/0x70c
[] ip_mr_forward+0x32f/0x34e
[] ip_mroute_setsockopt+0xe03/0x108c
[] ? get_parent_ip+0x11/0x42
[] ? pollwake+0x4d/0x51
[] ? default_wake_function+0x0/0xf
[] ? get_parent_ip+0x11/0x42
[] ? __wake_up_common+0x45/0x77
[] ? _raw_spin_unlock_irqrestore+0x1d/0x32
[] ? __wake_up_sync_key+0x4a/0x53
[] ? sock_def_readable+0x71/0x75
[] do_ip_setsockopt+0x9d/0xb55
[] ? unix_seqpacket_sendmsg+0x3f/0x41
[] ? sock_sendmsg+0x6d/0x86
[] ? sockfd_lookup_light+0x12/0x5d
[] ? SyS_sendto+0xf3/0x11b
[] ? new_sync_read+0x82/0xaa
[] compat_ip_setsockopt+0x3b/0x99
[] compat_raw_setsockopt+0x11/0x32
[] compat_sock_common_setsockopt+0x18/0x1f
[] compat_SyS_setsockopt+0x1a9/0x1cf
[] compat_SyS_socketcall+0x180/0x1e3
[] cstar_dispatch+0x7/0x1e

Signed-off-by: Ani Sinha
Acked-by: Eric Dumazet
Signed-off-by: David S. Miller

Ani Sinha
2015-11-03 04:57:12 +0800
9920e48b8 ipv4: use l4 hash for locally generated multipath flows ... Browse Code »

This patch changes how the multipath hash is computed for locally
generated flows: now the hash comprises l4 information.

This allows better utilization of the available paths when the existing
flows have the same source IP and the same destination IP: with l3 hash,
even when multiple connections are in place simultaneously, a single path
will be used, while with l4 hash we can use all the available paths.

v2 changes:
- use get_hash_from_flowi4() instead of implementing just another l4 hash
function

Signed-off-by: Paolo Abeni
Signed-off-by: David S. Miller

Paolo Abeni
2015-11-03 03:38:43 +0800

02 Nov, 2015

4 commits

c9b3292ee ipv4: update RTNH_F_LINKDOWN flag on UP event ... Browse Code »

When nexthop is part of multipath route we should clear the
LINKDOWN flag when link goes UP or when first address is added.
This is needed because we always set LINKDOWN flag when DEAD flag
was set but now on UP the nexthop is not dead anymore. Examples when
LINKDOWN bit can be forgotten when no NETDEV_CHANGE is delivered:

- link goes down (LINKDOWN is set), then link goes UP and device
shows carrier OK but LINKDOWN remains set

- last address is deleted (LINKDOWN is set), then address is
added and device shows carrier OK but LINKDOWN remains set

Steps to reproduce:
modprobe dummy
ifconfig dummy0 192.168.168.1 up

here add a multipath route where one nexthop is for dummy0:

ip route add 1.2.3.4 nexthop dummy0 nexthop SOME_OTHER_DEVICE
ifconfig dummy0 down
ifconfig dummy0 up

now ip route shows nexthop that is not dead. Now set the sysctl var:

echo 1 > /proc/sys/net/ipv4/conf/dummy0/ignore_routes_with_linkdown

now ip route will show a dead nexthop because the forgotten
RTNH_F_LINKDOWN is propagated as RTNH_F_DEAD.

Fixes: 8a3d03166f19 ("net: track link-status of ipv4 nexthops")
Signed-off-by: Julian Anastasov
Signed-off-by: David S. Miller

Julian Anastasov
2015-11-02 05:57:39 +0800
4f823defd ipv4: fix to not remove local route on link down ... Browse Code »

When fib_netdev_event calls fib_disable_ip on NETDEV_DOWN event
we should not delete the local routes if the local address
is still present. The confusion comes from the fact that both
fib_netdev_event and fib_inetaddr_event use the NETDEV_DOWN
constant. Fix it by returning back the variable 'force'.

Steps to reproduce:
modprobe dummy
ifconfig dummy0 192.168.168.1 up
ifconfig dummy0 down
ip route list table local | grep dummy | grep host
local 192.168.168.1 dev dummy0 proto kernel scope host src 192.168.168.1

Fixes: 8a3d03166f19 ("net: track link-status of ipv4 nexthops")
Signed-off-by: Julian Anastasov
Signed-off-by: David S. Miller

Julian Anastasov
2015-11-02 05:57:39 +0800
dbd3393c5 ipv4: add defensive check for CHECKSUM_PARTIAL skbs in ip_fragment ... Browse Code »

CHECKSUM_PARTIAL skbs should never arrive in ip_fragment. If we get one
of those warn about them once and handle them gracefully by recalculating
the checksum.

Cc: Eric Dumazet
Cc: Vlad Yasevich
Cc: Benjamin Coddington
Cc: Tom Herbert
Signed-off-by: Hannes Frederic Sowa
Signed-off-by: David S. Miller

Hannes Frederic Sowa
2015-11-02 01:01:27 +0800
d749c9cbf ipv4: no CHECKSUM_PARTIAL on MSG_MORE corked sockets ... Browse Code »

We cannot reliable calculate packet size on MSG_MORE corked sockets
and thus cannot decide if they are going to be fragmented later on,
so better not use CHECKSUM_PARTIAL in the first place.

Cc: Eric Dumazet
Cc: Vlad Yasevich
Cc: Benjamin Coddington
Cc: Tom Herbert
Signed-off-by: Hannes Frederic Sowa
Signed-off-by: David S. Miller

Hannes Frederic Sowa
2015-11-02 01:01:27 +0800

01 Nov, 2015

1 commit

b75ec3af2 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Browse Code »

David S. Miller
2015-11-01 12:15:30 +0800

30 Oct, 2015

1 commit

e7b63ff11 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next ... Browse Code »

Steffen Klassert says:

====================
pull request (net-next): ipsec-next 2015-10-30

1) The flow cache is limited by the flow cache limit which
depends on the number of cpus and the xfrm garbage collector
threshold which is independent of the number of cpus. This
leads to the fact that on systems with more than 16 cpus
we hit the xfrm garbage collector limit and refuse new
allocations, so new flows are dropped. On systems with 16
or less cpus, we hit the flowcache limit. In this case, we
shrink the flow cache instead of refusing new flows.

We increase the xfrm garbage collector threshold to INT_MAX
to get the same behaviour, independent of the number of cpus.

2) Fix some unaligned accesses on sparc systems.
From Sowmini Varadhan.

3) Fix some header checks in _decode_session4. We may call
pskb_may_pull with a negative value converted to unsigened
int from pskb_may_pull. This can lead to incorrect policy
lookups. We fix this by a check of the data pointer position
before we call pskb_may_pull.

4) Reload skb header pointers after calling pskb_may_pull
in _decode_session4 as this may change the pointers into
the packet.

5) Add a missing statistic counter on inner mode errors.

Please pull or let me know if there are problems.
====================

Signed-off-by: David S. Miller

David S. Miller
2015-10-30 19:51:56 +0800

28 Oct, 2015

1 commit

c2229fe14 fib_trie: leaf_walk_rcu should not compute key if key is less than pn->key ... Browse Code »

We were computing the child index in cases where the key value we were
looking for was actually less than the base key of the tnode. As a result
we were getting incorrect index values that would cause us to skip over
some children.

To fix this I have added a test that will force us to use child index 0 if
the key we are looking for is less than the key of the current tnode.

Fixes: 8be33e955cb9 ("fib_trie: Fib walk rcu should take a tnode and key instead of a trie and a leaf")
Reported-by: Brian Rak
Signed-off-by: Alexander Duyck
Signed-off-by: David S. Miller

Alexander Duyck
2015-10-28 09:14:51 +0800

27 Oct, 2015

1 commit

7e3b6e742 ipv6: gre: support SIT encapsulation ... Browse Code »

gre_gso_segment() chokes if SIT frames were aggregated by GRO engine.

Fixes: 61c1db7fae21e ("ipv6: sit: add GSO/TSO support")
Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2015-10-27 13:01:18 +0800

24 Oct, 2015

1 commit

ba3e2084f Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net ... Browse Code »

Conflicts:
net/ipv6/xfrm6_output.c
net/openvswitch/flow_netlink.c
net/openvswitch/vport-gre.c
net/openvswitch/vport-vxlan.c
net/openvswitch/vport.c
net/openvswitch/vport.h

The openvswitch conflicts were overlapping changes. One was
the egress tunnel info fix in 'net' and the other was the
vport ->send() op simplification in 'net-next'.

The xfrm6_output.c conflicts was also a simplification
overlapping a bug fix.

Signed-off-by: David S. Miller

David S. Miller
2015-10-24 21:54:12 +0800

23 Oct, 2015

6 commits

5e0724d02 tcp/dccp: fix hashdance race for passive sessions ... Browse Code »

Multiple cpus can process duplicates of incoming ACK messages
matching a SYN_RECV request socket. This is a rare event under
normal operations, but definitely can happen.

Only one must win the race, otherwise corruption would occur.

To fix this without adding new atomic ops, we use logic in
inet_ehash_nolisten() to detect the request was present in the same
ehash bucket where we try to insert the new child.

If request socket was not found, we have to undo the child creation.

This actually removes a spin_lock()/spin_unlock() pair in
reqsk_queue_unlink() for the fast path.

Fixes: e994b2f0fb92 ("tcp: do not lock listener to process SYN packets")
Fixes: 079096f103fa ("tcp/dccp: install syn_recv requests into ehash table")
Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2015-10-23 20:42:21 +0800
7b1311807 ipv4: implement support for NOPREFIXROUTE ifa flag for ipv4 address ... Browse Code »

Currently adding a new ipv4 address always cause the creation of the
related network route, with default metric. When a host has multiple
interfaces on the same network, multiple routes with the same metric
are created.

If the userspace wants to set specific metric on each routes, i.e.
giving better metric to ethernet links in respect to Wi-Fi ones,
the network routes must be deleted and recreated, which is error-prone.

This patch implements the support for IFA_F_NOPREFIXROUTE for ipv4
address. When an address is added with such flag set, no associated
network route is created, no network route is deleted when
said IP is gone and it's up to the user space manage such route.

Signed-off-by: Paolo Abeni
Signed-off-by: David S. Miller

Paolo Abeni
2015-10-23 17:54:54 +0800
c80dbe046 tcp: allow dctcp alpha to drop to zero ... Browse Code »

If alpha is strictly reduced by alpha >> dctcp_shift_g and if alpha is less
than 1 << dctcp_shift_g, then alpha may never reach zero. For example,
given shift_g=4 and alpha=15, alpha >> dctcp_shift_g yields 0 and alpha
remains 15. The effect isn't noticeable in this case below cwnd=137, but
could gradually drive uncongested flows with leftover alpha down to
cwnd=137. A larger dctcp_shift_g would have a greater effect.

This change causes alpha=15 to drop to 0 instead of being decrementing by 1
as it would when alpha=16. However, it requires one less conditional to
implement since it doesn't have to guard against subtracting 1 from 0U. A
decay of 15 is not unreasonable since an equal or greater amount occurs at
alpha >= 240.

Signed-off-by: Andrew G. Shewmaker
Acked-by: Florian Westphal
Signed-off-by: David S. Miller

Andrew Shewmaker
2015-10-23 17:46:52 +0800
ea673a4d3 xfrm4: Reload skb header pointers after calling pskb_may_pull. ... Browse Code »

A call to pskb_may_pull may change the pointers into the packet,
so reload the pointers after the call.

Signed-off-by: Steffen Klassert

Steffen Klassert
2015-10-23 13:32:39 +0800
1a14f1e55 xfrm4: Fix header checks in _decode_session4. ... Browse Code »

We skip the header informations if the data pointer points
already behind the header in question for some protocols.
This is because we call pskb_may_pull with a negative value
converted to unsigened int from pskb_may_pull in this case.
Skipping the header informations can lead to incorrect policy
lookups, so fix it by a check of the data pointer position
before we call pskb_may_pull.

Signed-off-by: Steffen Klassert

Steffen Klassert
2015-10-23 13:31:23 +0800
fc4099f17 openvswitch: Fix egress tunnel info. ... Browse Code »

While transitioning to netdev based vport we broke OVS
feature which allows user to retrieve tunnel packet egress
information for lwtunnel devices. Following patch fixes it
by introducing ndo operation to get the tunnel egress info.
Same ndo operation can be used for lwtunnel devices and compat
ovs-tnl-vport devices. So after adding such device operation
we can remove similar operation from ovs-vport.

Fixes: 614732eaa12d ("openvswitch: Use regular VXLAN net_device device").
Signed-off-by: Pravin B Shelar
Signed-off-by: David S. Miller

Pravin B Shelar
2015-10-23 10:39:25 +0800

22 Oct, 2015

4 commits

199c65506 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec ... Browse Code »

Steffen Klassert says:

====================
pull request (net): ipsec 2015-10-22

1) Fix IPsec pre-encap fragmentation for GSO packets.
From Herbert Xu.

2) Fix some header checks in _decode_session6.
We skip the header informations if the data pointer points
already behind the header in question for some protocols.
This is because we call pskb_may_pull with a negative value
converted to unsigened int from pskb_may_pull in this case.
Skipping the header informations can lead to incorrect policy
lookups. From Mathias Krause.

3) Allow to change the replay threshold and expiry timer of a
state without having to set other attributes like replay
counter and byte lifetime. Changing these other attributes
may break the SA. From Michael Rossberg.

4) Fix pmtu discovery for local generated packets.
We may fail dispatch to the inner address family.
As a reault, the local error handler is not called
and the mtu value is not reported back to userspace.

Please pull or let me know if there are problems.
====================

Signed-off-by: David S. Miller

David S. Miller
2015-10-22 22:46:05 +0800
e2e8009ff tcp: remove improper preemption check in tcp_xmit_probe_skb() ... Browse Code »

Commit e520af48c7e5a introduced the following bug when setting the
TCP_REPAIR sockoption:

[ 2860.657036] BUG: using __this_cpu_add() in preemptible [00000000] code: daemon/12164
[ 2860.657045] caller is __this_cpu_preempt_check+0x13/0x20
[ 2860.657049] CPU: 1 PID: 12164 Comm: daemon Not tainted 4.2.3 #1
[ 2860.657051] Hardware name: Dell Inc. PowerEdge R210 II/0JP7TR, BIOS 2.0.5 03/13/2012
[ 2860.657054] ffffffff81c7f071 ffff880231e9fdf8 ffffffff8185d765 0000000000000002
[ 2860.657058] 0000000000000001 ffff880231e9fe28 ffffffff8146ed91 ffff880231e9fe18
[ 2860.657062] ffffffff81cd1a5d ffff88023534f200 ffff8800b9811000 ffff880231e9fe38
[ 2860.657065] Call Trace:
[ 2860.657072] [] dump_stack+0x4f/0x7b
[ 2860.657075] [] check_preemption_disabled+0xe1/0xf0
[ 2860.657078] [] __this_cpu_preempt_check+0x13/0x20
[ 2860.657082] [] tcp_xmit_probe_skb+0xc7/0x100
[ 2860.657085] [] tcp_send_window_probe+0x2d/0x30
[ 2860.657089] [] do_tcp_setsockopt.isra.29+0x74c/0x830
[ 2860.657093] [] tcp_setsockopt+0x2c/0x30
[ 2860.657097] [] sock_common_setsockopt+0x14/0x20
[ 2860.657100] [] SyS_setsockopt+0x71/0xc0
[ 2860.657104] [] entry_SYSCALL_64_fastpath+0x16/0x75

Since tcp_xmit_probe_skb() can be called from process context, use
NET_INC_STATS() instead of NET_INC_STATS_BH().

Fixes: e520af48c7e5 ("tcp: add TCPWinProbe and TCPKeepAlive SNMP counters")
Signed-off-by: Renato Westphal
Signed-off-by: David S. Miller

Renato Westphal
2015-10-22 10:29:26 +0800
36a28b211 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf ... Browse Code »

Pablo Neira Ayuso says:

====================
Netfilter fixes for net

The following patchset contains four Netfilter fixes for net, they are:

1) Fix Kconfig dependencies of new nf_dup_ipv4 and nf_dup_ipv6.

2) Remove bogus test nh_scope in IPv4 rpfilter match that is breaking
--accept-local, from Xin Long.

3) Wait for RCU grace period after dropping the pending packets in the
nfqueue, from Florian Westphal.

4) Fix sleeping allocation while holding spin_lock_bh, from Nikolay Borisov.
====================

Signed-off-by: David S. Miller

David S. Miller
2015-10-22 10:26:17 +0800
b1974ed05 netlink: Rightsize IFLA_AF_SPEC size calculation ... Browse Code »

if_nlmsg_size() overestimates the minimum allocation size of netlink
dump request (when called from rtnl_calcit()) or the size of the
message (when called from rtnl_getlink()). This is because
ext_filter_mask is not supported by rtnl_link_get_af_size() and
rtnl_link_get_size().

The over-estimation is significant when at least one netdev has many
VLANs configured (8 bytes for each configured VLAN).

This patch-set "rightsizes" the protocol specific attribute size
calculation by propagating ext_filter_mask to rtnl_link_get_af_size()
and adding this a argument to get_link_af_size op in rtnl_af_ops.

Bridge module already used filtering aware sizing for notifications.
br_get_link_af_size_filtered() is consistent with the modified
get_link_af_size op so it replaces br_get_link_af_size() in br_af_ops.
br_get_link_af_size() becomes unused and thus removed.

Signed-off-by: Ronen Arad
Acked-by: Sridhar Samudrala
Signed-off-by: David S. Miller

Arad, Ronen
2015-10-22 10:15:20 +0800

21 Oct, 2015

6 commits

4f41b1c58 tcp: use RACK to detect losses ... Browse Code »

This patch implements the second half of RACK that uses the the most
recent transmit time among all delivered packets to detect losses.

tcp_rack_mark_lost() is called upon receiving a dubious ACK.
It then checks if an not-yet-sacked packet was sent at least
"reo_wnd" prior to the sent time of the most recently delivered.
If so the packet is deemed lost.

The "reo_wnd" reordering window starts with 1msec for fast loss
detection and changes to min-RTT/4 when reordering is observed.
We found 1msec accommodates well on tiny degree of reordering
(
Signed-off-by: Neal Cardwell
Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Yuchung Cheng
2015-10-21 22:00:53 +0800
659a8ad56 tcp: track the packet timings in RACK ... Browse Code »

This patch is the first half of the RACK loss recovery.

RACK loss recovery uses the notion of time instead
of packet sequence (FACK) or counts (dupthresh). It's inspired by the
previous FACK heuristic in tcp_mark_lost_retrans(): when a limited
transmit (new data packet) is sacked, then current retransmitted
sequence below the newly sacked sequence must been lost,
since at least one round trip time has elapsed.

But it has several limitations:
1) can't detect tail drops since it depends on limited transmit
2) is disabled upon reordering (assumes no reordering)
3) only enabled in fast recovery ut not timeout recovery

RACK (Recently ACK) addresses these limitations with the notion
of time instead: a packet P1 is lost if a later packet P2 is s/acked,
as at least one round trip has passed.

Since RACK cares about the time sequence instead of the data sequence
of packets, it can detect tail drops when later retransmission is
s/acked while FACK or dupthresh can't. For reordering RACK uses a
dynamically adjusted reordering window ("reo_wnd") to reduce false
positives on ever (small) degree of reordering.

This patch implements tcp_advanced_rack() which tracks the
most recent transmission time among the packets that have been
delivered (ACKed or SACKed) in tp->rack.mstamp. This timestamp
is the key to determine which packet has been lost.

Consider an example that the sender sends six packets:
T1: P1 (lost)
T2: P2
T3: P3
T4: P4
T100: sack of P2. rack.mstamp = T2
T101: retransmit P1
T102: sack of P2,P3,P4. rack.mstamp = T4
T205: ACK of P4 since the hole is repaired. rack.mstamp = T101

We need to be careful about spurious retransmission because it may
falsely advance tp->rack.mstamp by an RTT or an RTO, causing RACK
to falsely mark all packets lost, just like a spurious timeout.

We identify spurious retransmission by the ACK's TS echo value.
If TS option is not applicable but the retransmission is acknowledged
less than min-RTT ago, it is likely to be spurious. We refrain from
using the transmission time of these spurious retransmissions.

The second half is implemented in the next patch that marks packet
lost using RACK timestamp.

Signed-off-by: Yuchung Cheng
Signed-off-by: Neal Cardwell
Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Yuchung Cheng
2015-10-21 22:00:48 +0800
77c631273 tcp: add tcp_tsopt_ecr_before helper ... Browse Code »

a helper to prepare the main RACK patch

Signed-off-by: Yuchung Cheng
Signed-off-by: Neal Cardwell
Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Yuchung Cheng
2015-10-21 22:00:45 +0800
af82f4e84 tcp: remove tcp_mark_lost_retrans() ... Browse Code »

Remove the existing lost retransmit detection because RACK subsumes
it completely. This also stops the overloading the ack_seq field of
the skb control block.

Signed-off-by: Yuchung Cheng
Signed-off-by: Neal Cardwell
Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Yuchung Cheng
2015-10-21 22:00:44 +0800
f67225839 tcp: track min RTT using windowed min-filter ... Browse Code »

Kathleen Nichols' algorithm for tracking the minimum RTT of a
data stream over some measurement window. It uses constant space
and constant time per update. Yet it almost always delivers
the same minimum as an implementation that has to keep all
the data in the window. The measurement window is tunable via
sysctl.net.ipv4.tcp_min_rtt_wlen with a default value of 5 minutes.

The algorithm keeps track of the best, 2nd best & 3rd best min
values, maintaining an invariant that the measurement time of
the n'th best >= n-1'th best. It also makes sure that the three
values are widely separated in the time window since that bounds
the worse case error when that data is monotonically increasing
over the window.

Upon getting a new min, we can forget everything earlier because
it has no value - the new min is less than everything else in the
window by definition and it's the most recent. So we restart fresh
on every new min and overwrites the 2nd & 3rd choices. The same
property holds for the 2nd & 3rd best.

Therefore we have to maintain two invariants to maximize the
information in the samples, one on values (1st.v
Signed-off-by: Neal Cardwell
Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Yuchung Cheng
2015-10-21 22:00:43 +0800
9e45a3e36 tcp: apply Kern's check on RTTs used for congestion control ... Browse Code »

Currently ca_seq_rtt_us does not use Kern's check. Fix that by
checking if any packet acked is a retransmit, for both RTT used
for RTT estimation and congestion control.

Fixes: 5b08e47ca ("tcp: prefer packet timing to TS-ECR for RTT")
Signed-off-by: Yuchung Cheng
Signed-off-by: Neal Cardwell
Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Yuchung Cheng
2015-10-21 22:00:41 +0800

20 Oct, 2015

1 commit

26440c835 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net ... Browse Code »

Conflicts:
drivers/net/usb/asix_common.c
net/ipv4/inet_connection_sock.c
net/switchdev/switchdev.c

In the inet_connection_sock.c case the request socket hashing scheme
is completely different in net-next.

The other two conflicts were overlapping changes.

Signed-off-by: David S. Miller

David S. Miller
2015-10-20 21:08:27 +0800

19 Oct, 2015

4 commits

ca064bd89 xfrm: Fix pmtu discovery for local generated packets. ... Browse Code »

Commit 044a832a777 ("xfrm: Fix local error reporting crash
with interfamily tunnels") moved the setting of skb->protocol
behind the last access of the inner mode family to fix an
interfamily crash. Unfortunately now skb->protocol might not
be set at all, so we fail dispatch to the inner address family.
As a reault, the local error handler is not called and the
mtu value is not reported back to userspace.

We fix this by setting skb->protocol on message size errors
before we call xfrm_local_error.

Fixes: 044a832a7779c ("xfrm: Fix local error reporting crash with interfamily tunnels")
Signed-off-by: Steffen Klassert

Steffen Klassert
2015-10-19 16:30:05 +0800
371f1c7e0 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next ... Browse Code »

Pablo Neira Ayuso says:

====================
Netfilter/IPVS updates for net-next

The following patchset contains Netfilter/IPVS updates for your net-next
tree. Most relevantly, updates for the nfnetlink_log to integrate with
conntrack, fixes for cttimeout and improvements for nf_queue core, they are:

1) Remove useless ifdef around static inline function in IPVS, from
Eric W. Biederman.

2) Simplify the conntrack support for nfnetlink_queue: Merge
nfnetlink_queue_ct.c file into nfnetlink_queue_core.c, then rename it back
to nfnetlink_queue.c

3) Use y2038 safe timestamp from nfnetlink_queue.

4) Get rid of dead function definition in nf_conntrack, from Flavio
Leitner.

5) Attach conntrack support for nfnetlink_log.c, from Ken-ichirou MATSUZAWA.
This adds a new NETFILTER_NETLINK_GLUE_CT Kconfig switch that
controls enabling both nfqueue and nflog integration with conntrack.
The userspace application can request this via NFULNL_CFG_F_CONNTRACK
configuration flag.

6) Remove unused netns variables in IPVS, from Eric W. Biederman and
Simon Horman.

7) Don't put back the refcount on the cttimeout object from xt_CT on success.

8) Fix crash on cttimeout policy object removal. We have to flush out
the cttimeout extension area of the conntrack not to refer to an unexisting
object that was just removed.

9) Make sure rcu_callback completion before removing nfnetlink_cttimeout
module removal.

10) Fix compilation warning in br_netfilter when no nf_defrag_ipv4 and
nf_defrag_ipv6 are enabled. Patch from Arnd Bergmann.

11) Autoload ctnetlink dependencies when NFULNL_CFG_F_CONNTRACK is
requested. Again from Ken-ichirou MATSUZAWA.

12) Don't use pointer to previous hook when reinjecting traffic via
nf_queue with NF_REPEAT verdict since it may be already gone. This
also avoids a deadloop if the userspace application keeps returning
NF_REPEAT.

13) A bunch of cleanups for netfilter IPv4 and IPv6 code from Ian Morris.

14) Consolidate logger instance existence check in nfulnl_recv_config().

15) Fix broken atomicity when applying configuration updates to logger
instances in nfnetlink_log.

16) Get rid of the .owner attribute in our hook object. We don't need
this anymore since we're dropping pending packets that have escaped
from the kernel when unremoving the hook. Patch from Florian Westphal.

17) Remove unnecessary rcu_read_lock() from nf_reinject code, we always
assume RCU read side lock from .call_rcu in nfnetlink. Also from Florian.

18) Use static inline function instead of macros to define NF_HOOK() and
NF_HOOK_COND() when no netfilter support in on, from Arnd Bergmann.
====================

Signed-off-by: David S. Miller

David S. Miller
2015-10-19 13:48:34 +0800
dc6ef6be5 tcp: do not set queue_mapping on SYNACK ... Browse Code »

At the time of commit fff326990789 ("tcp: reflect SYN queue_mapping into
SYNACK packets") we had little ways to cope with SYN floods.

We no longer need to reflect incoming skb queue mappings, and instead
can pick a TX queue based on cpu cooking the SYNACK, with normal XPS
affinities.

Note that all SYNACK retransmits were picking TX queue 0, this no longer
is a win given that SYNACK rtx are now distributed on all cpus.

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2015-10-19 13:26:02 +0800
26fb342c7 ipconfig: send Client-identifier in DHCP requests ... Browse Code »

A dhcp server may provide parameters to a client from a pool of IP
addresses and using a shared rootfs, or provide a specific set of
parameters for a specific client, usually using the MAC address to
identify each client individually. The dhcp protocol also specifies
a client-id field which can be used to determine the correct
parameters to supply when no MAC address is available. There is
currently no way to tell the kernel to supply a specific client-id,
only the userspace dhcp clients support this feature, but this can
not be used when the network is needed before userspace is available
such as when the root filesystem is on NFS.

This patch is to be able to do something like "ip=dhcp,client_id_type,
client_id_value", as a kernel parameter to enable the kernel to
identify itself to the server.

Signed-off-by: Li RongQing
Signed-off-by: David S. Miller

Li RongQing
2015-10-19 10:23:52 +0800

17 Oct, 2015

5 commits

f0a0a978b Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next ... Browse Code »

This merge resolves conflicts with 75aec9df3a78 ("bridge: Remove
br_nf_push_frag_xmit_sk") as part of Eric Biederman's effort to improve
netns support in the network stack that reached upstream via David's
net-next tree.

Signed-off-by: Pablo Neira Ayuso

Conflicts:
net/bridge/br_netfilter_hooks.c

Pablo Neira Ayuso
2015-10-17 20:28:03 +0800
c8d71d08a netfilter: ipv4: whitespace around operators ... Browse Code »

This patch cleanses whitespace around arithmetical operators.

No changes detected by objdiff.

Signed-off-by: Ian Morris
Signed-off-by: Pablo Neira Ayuso

Ian Morris
2015-10-17 01:19:23 +0800
24cebe3f2 netfilter: ipv4: code indentation ... Browse Code »

Use tabs instead of spaces to indent code.

No changes detected by objdiff.

Signed-off-by: Ian Morris
Signed-off-by: Pablo Neira Ayuso

Ian Morris
2015-10-17 01:19:15 +0800
6c28255b4 netfilter: ipv4: function definition layout ... Browse Code »

Use tabs instead of spaces to indent second line of parameters in
function definitions.

No changes detected by objdiff.

Signed-off-by: Ian Morris
Signed-off-by: Pablo Neira Ayuso

Ian Morris
2015-10-17 01:19:10 +0800
27951a016 netfilter: ipv4: ternary operator layout ... Browse Code »

Correct whitespace layout of ternary operators in the netfilter-ipv4
code.

No changes detected by objdiff.

Signed-off-by: Ian Morris
Signed-off-by: Pablo Neira Ayuso

Ian Morris
2015-10-17 01:19:04 +0800