15 Dec, 2015
4 commits
-
[ Upstream commit 45f6fad84cc305103b28d73482b344d7f5b76f39 ]
This patch addresses multiple problems :
UDP/RAW sendmsg() need to get a stable struct ipv6_txoptions
while socket is not locked : Other threads can change np->opt
concurrently. Dmitry posted a syzkaller
(http://github.com/google/syzkaller) program desmonstrating
use-after-free.Starting with TCP/DCCP lockless listeners, tcp_v6_syn_recv_sock()
and dccp_v6_request_recv_sock() also need to use RCU protection
to dereference np->opt once (before calling ipv6_dup_options())This patch adds full RCU protection to np->opt
Reported-by: Dmitry Vyukov
Signed-off-by: Eric Dumazet
Acked-by: Hannes Frederic Sowa
Signed-off-by: David S. Miller
Signed-off-by: Greg Kroah-Hartman -
[ Upstream commit 264640fc2c5f4f913db5c73fa3eb1ead2c45e9d7 ]
If a fragmented multicast packet is received on an ethernet device which
has an active macvlan on top of it, each fragment is duplicated and
received both on the underlying device and the macvlan. If some
fragments for macvlan are processed before the whole packet for the
underlying device is reassembled, the "overlapping fragments" test in
ip6_frag_queue() discards the whole fragment queue.To resolve this, add device ifindex to the search key and require it to
match reassembling multicast packets and packets to link-local
addresses.Note: similar patch has been already submitted by Yoshifuji Hideaki in
http://patchwork.ozlabs.org/patch/220979/
but got lost and forgotten for some reason.
Signed-off-by: Michal Kubecek
Signed-off-by: David S. Miller
Signed-off-by: Greg Kroah-Hartman -
[ Upstream commit 4c6980462f32b4f282c5d8e5f7ea8070e2937725 ]
Similar to ipv4, when destroying an mrt table the static mfc entries and
the static devices are kept, which leads to devices that can never be
destroyed (because of refcnt taken) and leaked memory. Make sure that
everything is cleaned up on netns destruction.Fixes: 8229efdaef1e ("netns: ip6mr: enable namespace support in ipv6 multicast forwarding code")
CC: Benjamin Thery
Signed-off-by: Nikolay Aleksandrov
Reviewed-by: Cong Wang
Signed-off-by: David S. Miller
Signed-off-by: Greg Kroah-Hartman -
[ Upstream commit 41033f029e393a64e81966cbe34d66c6cf8a2e7e ]
the OUTMCAST stat is double incremented, getting bumped once in the mcast code
itself, and again in the common ip output path. Remove the mcast bump, as its
not neededValidated by the reporter, with good results
Signed-off-by: Neil Horman
Reported-by: Claus Jensen
CC: Claus Jensen
CC: David Miller
Signed-off-by: David S. Miller
Signed-off-by: Greg Kroah-Hartman
10 Dec, 2015
2 commits
-
[ Upstream commit 2a189f9e57650e9f310ddf4aad75d66c1233a064 ]
In ipv6_add_dev, when addrconf_sysctl_register fails, we do not clean up
the dev_snmp6 entry that we have already registered for this device.
Call snmp6_unregister_dev in this case.Fixes: a317a2f19da7d ("ipv6: fail early when creating netdev named all or default")
Reported-by: Dmitry Vyukov
Signed-off-by: Sabrina Dubroca
Signed-off-by: David S. Miller
Signed-off-by: Greg Kroah-Hartman -
[ Upstream commit 4ece9009774596ee3df0acba65a324b7ea79387c ]
sit0 device allocates its percpu storage twice :
- One time in ipip6_tunnel_init()
- One time in ipip6_fb_tunnel_init()Thus we leak 48 bytes per possible cpu per network namespace dismantle.
ipip6_fb_tunnel_init() can be much simpler and does not
return an error, and should be called after register_netdev()Note that ipip6_tunnel_clone_6rd() also needs to be called
after register_netdev() (calling ipip6_tunnel_init())Fixes: ebe084aafb7e ("sit: Use ipip6_tunnel_init as the ndo_init function.")
Signed-off-by: Eric Dumazet
Reported-by: Dmitry Vyukov
Cc: Steffen Klassert
Signed-off-by: David S. Miller
Signed-off-by: Greg Kroah-Hartman
03 Oct, 2015
4 commits
-
[ Upstream commit 6b9ea5a64ed5eeb3f68f2e6fcce0ed1179801d1e ]
Problem:
The ecmp route replace support for ipv6 in the kernel, deletes the
existing ecmp route too early, ie when it installs the first nexthop.
If there is an error in installing the subsequent nexthops, its too late
to recover the already deleted existing route leaving the fib
in an inconsistent state.This patch reduces the possibility of this by doing the following:
a) Changes the existing multipath route add code to a two stage process:
build rt6_infos + insert them
ip6_route_add rt6_info creation code is moved into
ip6_route_info_create.
b) This ensures that most errors are caught during building rt6_infos
and we fail early
c) Separates multipath add and del code. Because add needs the special
two stage mode in a) and delete essentially does not care.
d) In any event if the code fails during inserting a route again, a
warning is printed (This should be unlikely)Before the patch:
$ip -6 route show
3000:1000:1000:1000::2 via fe80::202:ff:fe00:b dev swp49s0 metric 1024
3000:1000:1000:1000::2 via fe80::202:ff:fe00:d dev swp49s1 metric 1024
3000:1000:1000:1000::2 via fe80::202:ff:fe00:f dev swp49s2 metric 1024/* Try replacing the route with a duplicate nexthop */
$ip -6 route change 3000:1000:1000:1000::2/128 nexthop via
fe80::202:ff:fe00:b dev swp49s0 nexthop via fe80::202:ff:fe00:d dev
swp49s1 nexthop via fe80::202:ff:fe00:d dev swp49s1
RTNETLINK answers: File exists$ip -6 route show
/* previously added ecmp route 3000:1000:1000:1000::2 dissappears from
* kernel */After the patch:
$ip -6 route show
3000:1000:1000:1000::2 via fe80::202:ff:fe00:b dev swp49s0 metric 1024
3000:1000:1000:1000::2 via fe80::202:ff:fe00:d dev swp49s1 metric 1024
3000:1000:1000:1000::2 via fe80::202:ff:fe00:f dev swp49s2 metric 1024/* Try replacing the route with a duplicate nexthop */
$ip -6 route change 3000:1000:1000:1000::2/128 nexthop via
fe80::202:ff:fe00:b dev swp49s0 nexthop via fe80::202:ff:fe00:d dev
swp49s1 nexthop via fe80::202:ff:fe00:d dev swp49s1
RTNETLINK answers: File exists$ip -6 route show
3000:1000:1000:1000::2 via fe80::202:ff:fe00:b dev swp49s0 metric 1024
3000:1000:1000:1000::2 via fe80::202:ff:fe00:d dev swp49s1 metric 1024
3000:1000:1000:1000::2 via fe80::202:ff:fe00:f dev swp49s2 metric 1024Fixes: 27596472473a ("ipv6: fix ECMP route replacement")
Signed-off-by: Roopa Prabhu
Reviewed-by: Nikolay Aleksandrov
Acked-by: Nicolas Dichtel
Signed-off-by: David S. Miller
Signed-off-by: Greg Kroah-Hartman -
[ Upstream commit 25b4a44c19c83d98e8c0807a7ede07c1f28eab8b ]
In the IPv6 multicast routing code the mrt_lock was not being released
correctly in the MFC iterator, as a result adding or deleting a MIF would
cause a hang because the mrt_lock could not be acquired.This fix is a copy of the code for the IPv4 case and ensures that the lock
is released correctly.Signed-off-by: Richard Laing
Acked-by: Cong Wang
Signed-off-by: David S. Miller
Signed-off-by: Greg Kroah-Hartman -
[ Upstream commit e41b0bedba0293b9e1e8d1e8ed553104b9693656 ]
We previously register IPPROTO_ROUTING offload under inet6_add_offload(),
but in error path, we try to unregister it with inet_del_offload(). This
doesn't seem correct, it should actually be inet6_del_offload(), also
ipv6_exthdrs_offload_exit() from that commit seems rather incorrect (it
also uses rthdr_offload twice), but it got removed entirely later on.Fixes: 3336288a9fea ("ipv6: Switch to using new offload infrastructure.")
Signed-off-by: Daniel Borkmann
Signed-off-by: David S. Miller
Signed-off-by: Greg Kroah-Hartman -
[ Upstream commit d4257295ba1b389c693b79de857a96e4b7cd8ac0 ]
When a tunnel is deleted, the cached dst entry should be released.
This problem may prevent the removal of a netns (seen with a x-netns IPv6
gre tunnel):
unregister_netdevice: waiting for lo to become free. Usage count = 3CC: Dmitry Kozlov
Fixes: c12b395a4664 ("gre: Support GRE over IPv6")
Signed-off-by: huaibin Wang
Signed-off-by: Nicolas Dichtel
Signed-off-by: David S. Miller
Signed-off-by: Greg Kroah-Hartman
30 Sep, 2015
4 commits
-
[ Upstream commit 3257d8b12f954c462d29de6201664a846328a522 ]
In commit b357a364c57c9 ("inet: fix possible panic in
reqsk_queue_unlink()"), I missed fact that tcp_check_req()
can return the listener socket in one case, and that we must
release the request socket refcount or we leak it.Tested:
Following packetdrill test template shows the issue
0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
+0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
+0 bind(3, ..., ...) = 0
+0 listen(3, 1) = 0+0 < S 0:0(0) win 2920
+0 > S. 0:0(0) ack 1
+.002 < . 1:1(0) ack 21 win 2920
+0 > R 21:21(0)Fixes: b357a364c57c9 ("inet: fix possible panic in reqsk_queue_unlink()")
Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller
Signed-off-by: Greg Kroah-Hartman -
[ Upstream commit fdbf5b097bbd9693a86c0b8bfdd071a9a2117cfc ]
This patch reverts 19424e052fb44da2f00d1a868cbb51f3e9f4bbb5 ("sit:
Add gro callbacks to sit_offload") because it generates packets
that cannot be handled even by our own GSO.Reported-by: Wolfgang Walter
Signed-off-by: Herbert Xu
Signed-off-by: David S. Miller
Signed-off-by: Greg Kroah-Hartman -
[ Upstream commit 03645a11a570d52e70631838cb786eb4253eb463 ]
ip6_datagram_connect() is doing a lot of socket changes without
socket being locked.This looks wrong, at least for udp_lib_rehash() which could corrupt
lists because of concurrent udp_sk(sk)->udp_portaddr_hash accesses.Signed-off-by: Eric Dumazet
Acked-by: Herbert Xu
Signed-off-by: David S. Miller
Signed-off-by: Greg Kroah-Hartman -
[ Upstream commit 4c938d22c88a9ddccc8c55a85e0430e9c62b1ac5 ]
Before commit daad151263cf ("ipv6: Make ipv6_is_mld() inline and use it
from ip6_mc_input().") MLD packets were only processed locally. After the
change, a copy of MLD packet goes through ip6_mr_input, causing
MRT6MSG_NOCACHE message to be generated to user space.Make MLD packet only processed locally.
Fixes: daad151263cf ("ipv6: Make ipv6_is_mld() inline and use it from ip6_mc_input().")
Signed-off-by: Hermin Anggawijaya
Signed-off-by: David S. Miller
Signed-off-by: Greg Kroah-Hartman
11 Jul, 2015
1 commit
-
[ Upstream commit 34b99df4e6256ddafb663c6de0711dceceddfe0e ]
ICMP messages can trigger ICMP and local errors. In this case
serr->port is 0 and starting from Linux 4.0 we do not return
the original target address to the error queue readers.
Add function to define which errors provide addr_offset.
With this fix my ping command is not silent anymore.Fixes: c247f0534cc5 ("ip: fix error queue empty skb handling")
Signed-off-by: Julian Anastasov
Acked-by: Willem de Bruijn
Signed-off-by: David S. Miller
Signed-off-by: Greg Kroah-Hartman
11 Jun, 2015
1 commit
-
This reverts commit 0243508edd317ff1fa63b495643a7c192fbfcd92.
It introduces new regressions.
Signed-off-by: David S. Miller
09 Jun, 2015
2 commits
-
UDP encapsulation is broken on IPv6. This is because the logic to resubmit
the nexthdr is inverted, checking for a ret value > 0 instead of < 0. Also,
the resubmit label is in the wrong position since we already get the
nexthdr value when performing decapsulation. In addition the skb pull is no
longer necessary either.This changes the return value check to look for < 0, using it for the
nexthdr on the next iteration, and moves the resubmit label to the proper
location.With these changes the v6 code now matches what we do in the v4 ip input
code wrt resubmitting when decapsulating.Signed-off-by: Josh Hunt
Acked-by: "Tom Herbert"
Signed-off-by: David S. Miller -
The memory pointed to by idev->stats.icmpv6msgdev,
idev->stats.icmpv6dev and idev->stats.ipv6 can each be used in an RCU
read context without taking a reference on idev. For example, through
IP6_*_STATS_* calls in ip6_rcv. These memory blocks are freed without
waiting for an RCU grace period to elapse. This could lead to the
memory being written to after it has been freed.Fix this by using call_rcu to free the memory used for stats, as well
as idev after an RCU grace period has elapsed.Signed-off-by: Robert Shearman
Acked-by: Hannes Frederic Sowa
Signed-off-by: David S. Miller
02 Jun, 2015
1 commit
-
We currently rely on the PMTU discovery of xfrm.
However if a packet is localy sent, the PMTU mechanism
of xfrm tries to to local socket notification what
might not work for applications like ping that don't
check for this. So add pmtu handling to vti6_xmit to
report MTU changes immediately.Signed-off-by: Steffen Klassert
Signed-off-by: Alexander Duyck
Signed-off-by: David S. Miller
01 Jun, 2015
1 commit
-
We have two problems in UDP stack related to bogus checksums :
1) We return -EAGAIN to application even if receive queue is not empty.
This breaks applications using edge trigger epoll()2) Under UDP flood, we can loop forever without yielding to other
processes, potentially hanging the host, especially on non SMP.This patch is an attempt to make things better.
We might in the future add extra support for rt applications
wanting to better control time spent doing a recv() in a hostile
environment. For example we could validate checksums before queuing
packets in socket receive queue.Signed-off-by: Eric Dumazet
Cc: Willem de Bruijn
Signed-off-by: David S. Miller
29 May, 2015
1 commit
-
Steffen Klassert says:
====================
pull request (net): ipsec 2015-05-281) Fix a race in xfrm_state_lookup_byspi, we need to take
the refcount before we release xfrm_state_lock.
From Li RongQing.2) Fix IV generation on ESN state. We used just the
low order sequence numbers for IV generation on
ESN, as a result the IV can repeat on the same
state. Fix this by using the high order sequence
number bits too and make sure to always initialize
the high order bits with zero. These patches are
serious stable candidates. Fixes from Herbert Xu.3) Fix the skb->mark handling on vti. We don't
reset skb->mark in skb_scrub_packet anymore,
so vti must care to restore the original
value back after it was used to lookup the
vti policy and state. Fixes from Alexander Duyck.Please pull or let me know if there are problems.
====================Signed-off-by: David S. Miller
28 May, 2015
2 commits
-
The vti6_rcv_cb and vti_rcv_cb calls were leaving the skb->mark modified
after completing the function. This resulted in the original skb->mark
value being lost. Since we only need skb->mark to be set for
xfrm_policy_check we can pull the assignment into the rcv_cb calls and then
just restore the original mark after xfrm_policy_check has been completed.Signed-off-by: Alexander Duyck
Signed-off-by: Steffen Klassert -
Instead of modifying skb->mark we can simply modify the flowi_mark that is
generated as a result of the xfrm_decode_session. By doing this we don't
need to actually touch the skb->mark and it can be preserved as it passes
out through the tunnel.Signed-off-by: Alexander Duyck
Signed-off-by: Steffen Klassert
23 May, 2015
1 commit
-
Pablo Neira Ayuso says:
====================
Netfilter fixes for netThe following patchset contain Netfilter fixes for your net tree, they are:
1) Fix a race in nfnetlink_log and nfnetlink_queue that can lead to a crash.
This problem is due to wrong order in the per-net registration and netlink
socket events. Patch from Francesco Ruggeri.2) Make sure that counters that userspace pass us are higher than 0 in all the
x_tables frontends. Discovered via Trinity, patch from Dave Jones.3) Revert a patch for br_netfilter to rely on the conntrack status bits. This
breaks stateless IPv6 NAT transformations. Patch from Florian Westphal.
====================Signed-off-by: David S. Miller
21 May, 2015
2 commits
-
When replacing an IPv6 multipath route with "ip route replace", i.e.
NLM_F_CREATE | NLM_F_REPLACE, fib6_add_rt2node() replaces only first
matching route without fixing its siblings, resulting in corrupted
siblings linked list; removing one of the siblings can then end in an
infinite loop.IPv6 ECMP implementation is a bit different from IPv4 so that route
replacement cannot work in exactly the same way. This should be a
reasonable approximation:1. If the new route is ECMP-able and there is a matching ECMP-able one
already, replace it and all its siblings (if any).2. If the new route is ECMP-able and no matching ECMP-able route exists,
replace first matching non-ECMP-able (if any) or just add the new one.3. If the new route is not ECMP-able, replace first matching
non-ECMP-able route (if any) or add the new route.We also need to remove the NLM_F_REPLACE flag after replacing old
route(s) by first nexthop of an ECMP route so that each subsequent
nexthop does not replace previous one.Fixes: 51ebd3181572 ("ipv6: add support of equal cost multipath (ECMP)")
Signed-off-by: Michal Kubecek
Acked-by: Nicolas Dichtel
Signed-off-by: David S. Miller -
If adding a nexthop of an IPv6 multipath route fails, comment in
ip6_route_multipath() says we are going to delete all nexthops already
added. However, current implementation deletes even the routes it
hasn't even tried to add yet. For example, runningip route add 1234:5678::/64 \
nexthop via fe80::aa dev dummy1 \
nexthop via fe80::bb dev dummy1 \
nexthop via fe80::cc dev dummy1twice results in removing all routes first command added.
Limit the second (delete) run to nexthops that succeeded in the first
(add) run.Fixes: 51ebd3181572 ("ipv6: add support of equal cost multipath (ECMP)")
Signed-off-by: Michal Kubecek
Acked-by: Nicolas Dichtel
Signed-off-by: David S. Miller
20 May, 2015
2 commits
-
After improving setsockopt() coverage in trinity, I started triggering
vmalloc failures pretty reliably from this code path:warn_alloc_failed+0xe9/0x140
__vmalloc_node_range+0x1be/0x270
vzalloc+0x4b/0x50
__do_replace+0x52/0x260 [ip_tables]
do_ipt_set_ctl+0x15d/0x1d0 [ip_tables]
nf_setsockopt+0x65/0x90
ip_setsockopt+0x61/0xa0
raw_setsockopt+0x16/0x60
sock_common_setsockopt+0x14/0x20
SyS_setsockopt+0x71/0xd0It turns out we don't validate that the num_counters field in the
struct we pass in from userspace is initialized.The same problem also exists in ebtables, arptables, ipv6, and the
compat variants.Signed-off-by: Dave Jones
Signed-off-by: Pablo Neira Ayuso -
Commit ("udp: Simplify__udp*_lib_mcast_deliver")
simplified the filter for incoming IPv6 multicast but removed
the check of the local socket address and the UDP destination
address.This patch restores the filter to prevent sockets bound to a IPv6
multicast IP to receive other UDP traffic link unicast.Signed-off-by: Henning Rogge
Fixes: 5cf3d46192fc ("udp: Simplify__udp*_lib_mcast_deliver")
Cc: "David S. Miller"
Acked-by: Eric Dumazet
Signed-off-by: David S. Miller
18 May, 2015
1 commit
-
commit 1d13a96c74fc ("ipv6: tcp: fix flowlabel value in ACK messages
send from TIME_WAIT") added the flow label in the last TCP packets.
Unfortunately, it was not casted properly.This patch replace the buggy shift with be32_to_cpu/cpu_to_be32.
Fixes: 1d13a96c74fc ("ipv6: tcp: fix flowlabel value in ACK messages")
Reported-by: Eric Dumazet
Signed-off-by: Florent Fourcot
Acked-by: Eric Dumazet
Signed-off-by: David S. Miller
15 May, 2015
1 commit
-
It was reported that trancerout6 would cause
a kernel to crash when trying to compute checksums
on raw UDP packets. The cause was the check in
__ip6_append_data that would attempt to use
partial checksums on the packet. However,
raw sockets do not initialize partial checksum
fields so partial checksums can't be used.Solve this the same way IPv4 does it. raw sockets
pass transhdrlen value of 0 to ip_append_data which
causes the checksum to be computed in software. Use
the same check in ip6_append_data (check transhdrlen).Reported-by: Wolfgang Walter
CC: Wolfgang Walter
CC: Eric Dumazet
Signed-off-by: Vladislav Yasevich
Acked-by: Eric Dumazet
Signed-off-by: David S. Miller
13 May, 2015
1 commit
-
I noticed we were only using the low-order bits for IV generation
when ESN is enabled. This is very bad because it means that the
IV can repeat. We must use the full 64 bits.Signed-off-by: Herbert Xu
Signed-off-by: Steffen Klassert
10 May, 2015
1 commit
-
If there are only IPv6 source specific default routes present, the
host gets -ENETUNREACH on e.g. connect() because ip6_dst_lookup_tail
calls ip6_route_output first, and given source address any, it fails,
and ip6_route_get_saddr is never called.The change is to use the ip6_route_get_saddr, even if the initial
ip6_route_output fails, and then doing ip6_route_output _again_ after
we have appropriate source address available.Note that this is '99% fix' to the problem; a correct fix would be to
do route lookups only within addrconf.c when picking a source address,
and never call ip6_route_output before source address has been
populated.Signed-off-by: Markus Stenberg
Signed-off-by: David S. Miller
24 Apr, 2015
1 commit
-
[ 3897.923145] BUG: unable to handle kernel NULL pointer dereference at
0000000000000080
[ 3897.931025] IP: [] reqsk_timer_handler+0x1a6/0x243There is a race when reqsk_timer_handler() and tcp_check_req() call
inet_csk_reqsk_queue_unlink() on the same req at the same time.Before commit fa76ce7328b2 ("inet: get rid of central tcp/dccp listener
timer"), listener spinlock was held and race could not happen.To solve this bug, we change reqsk_queue_unlink() to not assume req
must be found, and we return a status, to conditionally release a
refcount on the request sock.This also means tcp_check_req() in non fastopen case might or not
consume req refcount, so tcp_v6_hnd_req() & tcp_v4_hnd_req() have
to properly handle this.(Same remark for dccp_check_req() and its callers)
inet_csk_reqsk_queue_drop() is now too big to be inlined, as it is
called 4 times in tcp and 3 times in dccp.Fixes: fa76ce7328b2 ("inet: get rid of central tcp/dccp listener timer")
Signed-off-by: Eric Dumazet
Reported-by: Yuchung Cheng
Signed-off-by: David S. Miller
23 Apr, 2015
1 commit
-
The code there just open-codes the same, so use the provided macro instead.
Signed-off-by: Johannes Berg
Signed-off-by: David S. Miller
15 Apr, 2015
2 commits
-
Pablo Neira Ayuso says:
====================
Netfilter updates for net-nextA final pull request, I know it's very late but this time I think it's worth a
bit of rush.The following patchset contains Netfilter/nf_tables updates for net-next, more
specifically concatenation support and dynamic stateful expression
instantiation.This also comes with a couple of small patches. One to fix the ebtables.h
userspace header and another to get rid of an obsolete example file in tree
that describes a nf_tables expression.This time, I decided to paste the original descriptions. This will result in a
rather large commit description, but I think these bytes to keep.Patrick McHardy says:
====================
netfilter: nf_tables: concatenation supportThe following patches add support for concatenations, which allow multi
dimensional exact matches in O(1).The basic idea is to split the data registers, currently consisting of
4 registers of 16 bytes each, into smaller units, 16 registers of 4
bytes each, and making sure each register store always leaves the
full 32 bit in a well defined state, meaning smaller stores will
zero the remaining bits.Based on that, we can load multiple adjacent registers with different
values, thereby building a concatenated bigger value, and use that
value for set lookups.Sets are changed to use variable sized extensions for their key and
data values, removing the fixed limit of 16 bytes while saving memory
if less space is needed.As a side effect, these patches will allow some nice optimizations in
the future, like using jhash2 in nft_hash, removing the masking in
nft_cmp_fast, optimized data comparison using 32 bit word size etc.
These are not done so far however.The patches are split up as follows:
* the first five patches add length validation to register loads and
stores to make sure we stay within bounds and prepare the validation
functions for the new addressing mode* the next patches prepare for changing to 32 bit addressing by
introducing a struct nft_regs, which holds the verdict register as
well as the data registers. The verdict members are moved to a new
struct nft_verdict to allow to pull struct nft_data out of the stack.* the next patches contain preparatory conversions of expressions and
sets to use 32 bit addressing* the next patch introduces so far unused register conversion helpers
for parsing and dumping register numbers over netlink* following is the real conversion to 32 bit addressing, consisting of
replacing struct nft_data in struct nft_regs by an array of u32s and
actually translating and validating the new register numbers.* the final two patches add support for variable sized data items and
variable sized keys / data in set elementsThe patches have been verified to work correctly with nft binaries using
both old and new addressing.
====================Patrick McHardy says:
====================
netfilter: nf_tables: dynamic stateful expression instantiationThe following patches are the grand finale of my nf_tables set work,
using all the building blocks put in place by the previous patches
to support something like iptables hashlimit, but a lot more powerful.Sets are extended to allow attaching expressions to set elements.
The dynset expression dynamically instantiates these expressions
based on a template when creating new set elements and evaluates
them for all new or updated set members.In combination with concatenations this effectively creates state
tables for arbitrary combinations of keys, using the existing
expression types to maintain that state. Regular set GC takes care
of purging expired states.We currently support two different stateful expressions, counter
and limit. Using limit as a template we can express the functionality
of hashlimit, but completely unrestricted in the combination of keys.
Using counter we can perform accounting for arbitrary flows.The following examples from patch 5/5 show some possibilities.
Userspace syntax is still WIP, especially the listing of state
tables will most likely be seperated from normal set listings
and use a more structured format:1. Limit the rate of new SSH connections per host, similar to iptables
hashlimit:flow ip saddr timeout 60s \
limit 10/second \
accept2. Account network traffic between each set of /24 networks:
flow ip saddr & 255.255.255.0 . ip daddr & 255.255.255.0 \
counter3. Account traffic to each host per user:
flow skuid . ip daddr \
counter4. Account traffic for each combination of source address and TCP flags:
flow ip saddr . tcp flags \
counterThe resulting set content after a Xmas-scan look like this:
{
192.168.122.1 . fin | psh | urg : counter packets 1001 bytes 40040,
192.168.122.1 . ack : counter packets 74 bytes 3848,
192.168.122.1 . psh | ack : counter packets 35 bytes 3144
}In the future the "expressions attached to elements" will be extended
to also support user created non-stateful expressions to allow to
efficiently select beween a set of parameter sets, f.i. a set of log
statements with different prefixes based on the interface, which currently
require one rule each. This will most likely have to wait until the next
kernel version though.
====================Signed-off-by: David S. Miller
-
The dwmac-socfpga.c conflict was a case of a bug fix overlapping
changes in net-next to handle an error pointer differently.Signed-off-by: David S. Miller
14 Apr, 2015
1 commit
-
Using a timer wheel for timewait sockets was nice ~15 years ago when
memory was expensive and machines had a single processor.This does not scale, code is ugly and source of huge latencies
(Typically 30 ms have been seen, cpus spinning on death_lock spinlock.)We can afford to use an extra 64 bytes per timewait sock and spread
timewait load to all cpus to have better behavior.Tested:
On following test, /proc/sys/net/ipv4/tcp_tw_recycle is set to 1
on the target (lpaa24)Before patch :
lpaa23:~# ./super_netperf 200 -H lpaa24 -t TCP_CC -l 60 -- -p0,0
419594lpaa23:~# ./super_netperf 200 -H lpaa24 -t TCP_CC -l 60 -- -p0,0
437171While test is running, we can observe 25 or even 33 ms latencies.
lpaa24:~# ping -c 1000 -i 0.02 -qn lpaa23
...
1000 packets transmitted, 1000 received, 0% packet loss, time 20601ms
rtt min/avg/max/mdev = 0.020/0.217/25.771/1.535 ms, pipe 2lpaa24:~# ping -c 1000 -i 0.02 -qn lpaa23
...
1000 packets transmitted, 1000 received, 0% packet loss, time 20702ms
rtt min/avg/max/mdev = 0.019/0.183/33.761/1.441 ms, pipe 2After patch :
About 90% increase of throughput :
lpaa23:~# ./super_netperf 200 -H lpaa24 -t TCP_CC -l 60 -- -p0,0
810442lpaa23:~# ./super_netperf 200 -H lpaa24 -t TCP_CC -l 60 -- -p0,0
800992And latencies are kept to minimal values during this load, even
if network utilization is 90% higher :lpaa24:~# ping -c 1000 -i 0.02 -qn lpaa23
...
1000 packets transmitted, 1000 received, 0% packet loss, time 19991ms
rtt min/avg/max/mdev = 0.023/0.064/0.360/0.042 msSigned-off-by: Eric Dumazet
Signed-off-by: David S. Miller
13 Apr, 2015
2 commits
-
Switch the nf_tables registers from 128 bit addressing to 32 bit
addressing to support so called concatenations, where multiple values
can be concatenated over multiple registers for O(1) exact matches of
multiple dimensions using sets.The old register values are mapped to areas of 128 bits for compatibility.
When dumping register numbers, values are expressed using the old values
if they refer to the beginning of a 128 bit area for compatibility.To support concatenations, register loads of less than a full 32 bit
value need to be padded. This mainly affects the payload and exthdr
expressions, which both unconditionally zero the last word before
copying the data.Userspace fully passes the testsuite using both old and new register
addressing.Signed-off-by: Patrick McHardy
Signed-off-by: Pablo Neira Ayuso -
Replace the array of registers passed to expressions by a struct nft_regs,
containing the verdict as a seperate member, which aliases to the
NFT_REG_VERDICT register.This is needed to seperate the verdict from the data registers completely,
so their size can be changed.Signed-off-by: Patrick McHardy
Signed-off-by: Pablo Neira Ayuso
10 Apr, 2015
1 commit
-
Pablo Neira Ayuso says:
====================
Netfilter updates for net-nextThe following patchset contains Netfilter updates for your net-next tree.
They are:* nf_tables set timeout infrastructure from Patrick Mchardy.
1) Add support for set timeout support.
2) Add support for set element timeouts using the new set extension
infrastructure.4) Add garbage collection helper functions to get rid of stale elements.
Elements are accumulated in a batch that are asynchronously released
via RCU when the batch is full.5) Add garbage collection synchronization helpers. This introduces a new
element busy bit to address concurrent access from the netlink API and the
garbage collector.5) Add timeout support for the nft_hash set implementation. The garbage
collector peridically checks for stale elements from the workqueue.* iptables/nftables cgroup fixes:
6) Ignore non full-socket objects from the input path, otherwise cgroup
match may crash, from Daniel Borkmann.7) Fix cgroup in nf_tables.
8) Save some cycles from xt_socket by skipping packet header parsing when
skb->sk is already set because of early demux. Also from Daniel.* br_netfilter updates from Florian Westphal.
9) Save frag_max_size and restore it from the forward path too.
10) Use a per-cpu area to restore the original source MAC address when traffic
is DNAT'ed.11) Add helper functions to access physical devices.
12) Use these new physdev helper function from xt_physdev.
13) Add another nf_bridge_info_get() helper function to fetch the br_netfilter
state information.14) Annotate original layer 2 protocol number in nf_bridge info, instead of
using kludgy flags.15) Also annotate the pkttype mangling when the packet travels back and forth
from the IP to the bridge layer, instead of using a flag.* More nf_tables set enhancement from Patrick:
16) Fix possible usage of set variant that doesn't support timeouts.
17) Avoid spurious "set is full" errors from Netlink API when there are pending
stale elements scheduled to be released.18) Restrict loop checks to set maps.
19) Add support for dynamic set updates from the packet path.
20) Add support to store optional user data (eg. comments) per set element.
BTW, I have also pulled net-next into nf-next to anticipate the conflict
resolution between your okfn() signature changes and Florian's br_netfilter
updates.
====================Signed-off-by: David S. Miller