10 Jan, 2014
15 commits
-
We have to unregister chain type if this fails to register netns.
Signed-off-by: Pablo Neira Ayuso
-
We don't encode argument types into function names and since besides
nft_do_chain() there are only AF-specific versions, there is no risk
of confusion.Signed-off-by: Patrick McHardy
Signed-off-by: Pablo Neira Ayuso -
We currently leak the set memory when deleting a table that still has
sets in it. Return EBUSY when attempting to delete a table with sets.Signed-off-by: Patrick McHardy
Signed-off-by: Pablo Neira Ayuso -
The table refers to data of the AF module, so we need to make sure the
module isn't unloaded while the table exists.Signed-off-by: Patrick McHardy
Signed-off-by: Pablo Neira Ayuso -
Simplifies error handling. Additionally use the correct type u32 for the
host byte order flags value.Signed-off-by: Patrick McHardy
Signed-off-by: Pablo Neira Ayuso -
Minor nf_chain_type cleanups:
- reorder struct to plug a hoe
- rename struct module member to "owner" for consistency
- rename nf_hookfn array to "hooks" for consistency
- reorder initializers for better readabilitySigned-off-by: Patrick McHardy
Signed-off-by: Pablo Neira Ayuso -
Signed-off-by: Patrick McHardy
Signed-off-by: Pablo Neira Ayuso -
To avoid races, we need to replay to request after dropping the nfnl_mutex
to auto-load the chain type module.Signed-off-by: Patrick McHardy
Signed-off-by: Pablo Neira Ayuso -
In some cases we neither take a reference to the AF info nor to the
chain type, allowing the module to be unloaded while in use.Signed-off-by: Patrick McHardy
Signed-off-by: Pablo Neira Ayuso -
The chain type module reference handling makes no sense at all: we take
a reference immediately when the module is registered, preventing the
module from ever being unloaded.Fix by taking a reference when we're actually creating a chain of the
chain type and release the reference when destroying the chain.Signed-off-by: Patrick McHardy
Signed-off-by: Pablo Neira Ayuso -
The table use counter is only increased for new chains, so move the check
to the correct position.Signed-off-by: Patrick McHardy
Signed-off-by: Pablo Neira Ayuso -
Chain counter validation is performed after the chain policy has
potentially been changed. Move counter validation/setting before
changing of the chain policy to fix this.Additionally fix a memory leak if chain counter allocation fails
for new chains, remove an unnecessary free_percpu() and move
counter allocation for new chainsSigned-off-by: Patrick McHardy
Signed-off-by: Pablo Neira Ayuso -
Currently nf_tables_newchain() atomicity is broken because of having
validation of some netlink attributes performed after changing attributes
of the chain. The chain policy is (currently) fine, but split it up as
preparation for the following fixes and to avoid future mistakes.Signed-off-by: Patrick McHardy
Signed-off-by: Pablo Neira Ayuso -
We have to validate that the input register is in the range of
allowed registers, otherwise we can take a incorrect register
value as input that may lead us to a crash.Signed-off-by: Pablo Neira Ayuso
-
This patch adds kernel support for setting properties of tracked
connections. Currently, only connmark is supported. One use-case
for this feature is to provide the same functionality as
-j CONNMARK --save-mark in iptables.Some restructuring was needed to implement the set op. The new
structure follows that of nft_meta.Signed-off-by: Kristian Evensen
Signed-off-by: Pablo Neira Ayuso
08 Jan, 2014
12 commits
-
The ct expression can currently not be used in the inet family since
we don't have a conntrack module for NFPROTO_INET, so
nf_ct_l3proto_try_module_get() fails. Add some manual handling to
load the modules for both NFPROTO_IPV4 and NFPROTO_IPV6 if the
ct expression is used in the inet family.Signed-off-by: Patrick McHardy
Signed-off-by: Pablo Neira Ayuso -
For L3-proto independant rules we need to get at the L4 protocol value
directly. Add it to the nft_pktinfo struct and use the meta expression
to retrieve it.Signed-off-by: Patrick McHardy
Signed-off-by: Pablo Neira Ayuso -
Needed by multi-family tables to distinguish IPv4 and IPv6 packets.
Signed-off-by: Patrick McHardy
Signed-off-by: Pablo Neira Ayuso -
This patch adds a new table family and a new filter chain that you can
use to attach IPv4 and IPv6 rules. This should help to simplify
rule-set maintainance in dual-stack setups.Signed-off-by: Patrick McHardy
Signed-off-by: Pablo Neira Ayuso -
Add support to register chains to multiple hooks for different address
families for mixed IPv4/IPv6 tables.Signed-off-by: Patrick McHardy
-
Multi-family tables need the AF from the hook ops. Add a pointer to the
hook ops and replace usage of the hooknum member in struct nft_pktinfo.Signed-off-by: Patrick McHardy
Signed-off-by: Pablo Neira Ayuso -
Currently the AF-specific hook functions override the chain-type specific
hook functions. That doesn't make too much sense since the chain types
are a special case of the AF-specific hooks.Make the AF-specific hook functions the default and make the optional
chain type hooks override them.As a side effect, the necessary code restructuring reduces the code size,
f.i. in case of nf_tables_ipv4.o:nf_tables_ipv4_init_net | -24
nft_do_chain_ipv4 | -113
2 functions changed, 137 bytes removed, diff: -137Signed-off-by: Patrick McHardy
Signed-off-by: Pablo Neira Ayuso -
net/netfilter/nft_reject.c: In function 'nft_reject_eval':
net/netfilter/nft_reject.c:37:14: warning: unused variable 'net' [-Wunused-variable]Reported-by: kbuild test robot
Signed-off-by: Pablo Neira Ayuso -
There are many cases where this feature does not improve performance or even
reduces it.For example, here are the results from tests that I've run using 3.12.6 on one
Intel Xeon W3565 and one i7 920 connected by ixgbe adapters. The results are
from the Xeon, but they're similar on the i7. All numbers report the
mean±stddev over 10 runs of 10s.1) latency tests similar to what is described in "c6e1a0d net: Allow no-cache
copy from user on transmit"
There is no statistically significant difference between tx-nocache-copy
on/off.
nic irqs spread out (one queue per cpu)200x netperf -r 1400,1
tx-nocache-copy off
692000±1000 tps
50/90/95/99% latency (us): 275±2/643.8±0.4/799±1/2474.4±0.3
tx-nocache-copy on
693000±1000 tps
50/90/95/99% latency (us): 274±1/644.1±0.7/800±2/2474.5±0.7200x netperf -r 14000,14000
tx-nocache-copy off
86450±80 tps
50/90/95/99% latency (us): 334.37±0.02/838±1/2100±20/3990±40
tx-nocache-copy on
86110±60 tps
50/90/95/99% latency (us): 334.28±0.01/837±2/2110±20/3990±202) single stream throughput tests
tx-nocache-copy leads to higher service demandthroughput cpu0 cpu1 demand
(Gb/s) (Gcycle) (Gcycle) (cycle/B)nic irqs and netperf on cpu0 (1x netperf -T0,0 -t omni -- -d send)
tx-nocache-copy off 9402±5 9.4±0.2 0.80±0.01
tx-nocache-copy on 9403±3 9.85±0.04 0.838±0.004nic irqs on cpu0, netperf on cpu1 (1x netperf -T1,1 -t omni -- -d send)
tx-nocache-copy off 9401±5 5.83±0.03 5.0±0.1 0.923±0.007
tx-nocache-copy on 9404±2 5.74±0.03 5.523±0.009 0.958±0.002As a second example, here are some results from Eric Dumazet with latest
net-next.
tx-nocache-copy also leads to higher service demand(cpu is Intel(R) Xeon(R) CPU X5660 @ 2.80GHz)
lpq83:~# ./ethtool -K eth0 tx-nocache-copy on
lpq83:~# perf stat ./netperf -H lpq84 -c
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to lpq84.prod.google.com () port 0 AF_INET
Recv Send Send Utilization Service Demand
Socket Socket Message Elapsed Send Recv Send Recv
Size Size Size Time Throughput local remote local remote
bytes bytes bytes secs. 10^6bits/s % S % U us/KB us/KB87380 16384 16384 10.00 9407.44 2.50 -1.00 0.522 -1.000
Performance counter stats for './netperf -H lpq84 -c':
4282.648396 task-clock # 0.423 CPUs utilized
9,348 context-switches # 0.002 M/sec
88 CPU-migrations # 0.021 K/sec
355 page-faults # 0.083 K/sec
11,812,797,651 cycles # 2.758 GHz [82.79%]
9,020,522,817 stalled-cycles-frontend # 76.36% frontend cycles idle [82.54%]
4,579,889,681 stalled-cycles-backend # 38.77% backend cycles idle [67.33%]
6,053,172,792 instructions # 0.51 insns per cycle
# 1.49 stalled cycles per insn [83.64%]
597,275,583 branches # 139.464 M/sec [83.70%]
8,960,541 branch-misses # 1.50% of all branches [83.65%]10.128990264 seconds time elapsed
lpq83:~# ./ethtool -K eth0 tx-nocache-copy off
lpq83:~# perf stat ./netperf -H lpq84 -c
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to lpq84.prod.google.com () port 0 AF_INET
Recv Send Send Utilization Service Demand
Socket Socket Message Elapsed Send Recv Send Recv
Size Size Size Time Throughput local remote local remote
bytes bytes bytes secs. 10^6bits/s % S % U us/KB us/KB87380 16384 16384 10.00 9412.45 2.15 -1.00 0.449 -1.000
Performance counter stats for './netperf -H lpq84 -c':
2847.375441 task-clock # 0.281 CPUs utilized
11,632 context-switches # 0.004 M/sec
49 CPU-migrations # 0.017 K/sec
354 page-faults # 0.124 K/sec
7,646,889,749 cycles # 2.686 GHz [83.34%]
6,115,050,032 stalled-cycles-frontend # 79.97% frontend cycles idle [83.31%]
1,726,460,071 stalled-cycles-backend # 22.58% backend cycles idle [66.55%]
2,079,702,453 instructions # 0.27 insns per cycle
# 2.94 stalled cycles per insn [83.22%]
363,773,213 branches # 127.757 M/sec [83.29%]
4,242,732 branch-misses # 1.17% of all branches [83.51%]10.128449949 seconds time elapsed
CC: Tom Herbert
Signed-off-by: Benjamin Poirier
Signed-off-by: David S. Miller -
When lo is brought up, new ifa is created. Then, devconf and neigh values
bitfield should be set so later changes of default values would not
affect lo values.Note that the same behaviour is in ipv6. Also note that this is likely
not an issue in many distros (for example Fedora 19) because userspace
sets address to lo manually before bringing it up.Signed-off-by: Jiri Pirko
Signed-off-by: David S. Miller -
This change allows to follow a recommandation of RFC4942.
- Add "anycast_src_echo_reply" sysctl to control the use of anycast addresses
as source addresses for ICMPv6 echo reply. This sysctl is false by default
to preserve existing behavior.
- Add inline check ipv6_anycast_destination().
- Use them in icmpv6_echo_reply().Reference:
RFC4942 - IPv6 Transition/Coexistence Security Considerations
(http://tools.ietf.org/html/rfc4942#section-2.1.6)2.1.6. Anycast Traffic Identification and Security
[...]
To avoid exposing knowledge about the internal structure of the
network, it is recommended that anycast servers now take advantage of
the ability to return responses with the anycast address as the
source address if possible.Signed-off-by: Francois-Xavier Le Bail
Acked-by: Hannes Frederic Sowa
Signed-off-by: David S. Miller -
Fix to return a negative error code from the error handling
case instead of 0.Fixes: 837052d0ccc5 ('net/mlx4_en: Add netdev support for TCP/IP offloads of vxlan tunneling')
Signed-off-by: Wei Yongjun
Signed-off-by: David S. Miller
07 Jan, 2014
13 commits
-
- Replace pr_warn_ratelimited() with net_ratelimit() and netdev_warn().
- Adjust the algnment of some messages.
- Remove the peroid.
- Fix some messages don't have terminating newline.Signed-off-by: Hayes Wang
Signed-off-by: David S. Miller -
> as reported for linux-next of Dec.20, 2013
> when CONFIG_NEED_DMA_MAP_STATE is not enabled:
>
> drivers/net/ethernet/brocade/bna/bnad.c: In function 'bnad_start_xmit':
> drivers/net/ethernet/brocade/bna/bnad.c:3074:26: error: 'struct bnad_tx_vector' has no member named 'dma_len'Reported-by: Randy Dunlap
Signed-off-by: David S. Miller -
GRO/GSO layers can be enabled on a node, even if said
node is only forwarding packets.This patch permits GSO (and upcoming GRO) support for GRE
encapsulated packets, even if the host has no GRE tunnel setup.Signed-off-by: Eric Dumazet
Cc: H.K. Jerry Chu
Signed-off-by: David S. Miller -
This fixes some typos found by Sergei.
Reported-by: Sergei Shtylyov
Signed-off-by: Hauke Mehrtens
Signed-off-by: David S. Miller -
Jesse Gross says:
====================
[GIT net-next] Open vSwitchOpen vSwitch changes for net-next/3.14. Highlights are:
* Performance improvements in the mechanism to get packets to userspace
using memory mapped netlink and skb zero copy where appropriate.
* Per-cpu flow stats in situations where flows are likely to be shared
across CPUs. Standard flow stats are used in other situations to save
memory and allocation time.
* A handful of code cleanups and rationalization.
====================Signed-off-by: David S. Miller
-
Several functions and datastructures could be local
Found with 'make namespacecheck'Signed-off-by: Stephen Hemminger
Signed-off-by: Jesse Gross -
The copy & csum optimization is no longer present with zerocopy
enabled. Compute the checksum in skb_gso_segment() directly by
dropping the HW CSUM capability from the features passed in.Signed-off-by: Thomas Graf
Signed-off-by: Jesse Gross -
Use of skb_zerocopy() can avoid the expensive call to memcpy()
when copying the packet data into the Netlink skb. Completes
checksum through skb_checksum_help() if not already done in
GSO segmentation.Zerocopy is only performed if user space supported unaligned
Netlink messages. memory mapped netlink i/o is preferred over
zerocopy if it is set up.Cost of upcall is significantly reduced from:
+ 7.48% vhost-8471 [k] memcpy
+ 5.57% ovs-vswitchd [k] memcpy
+ 2.81% vhost-8471 [k] csum_partial_copy_genericto:
+ 5.72% ovs-vswitchd [k] memcpy
+ 3.32% vhost-5153 [k] memcpy
+ 0.68% vhost-5153 [k] skb_zerocopy(megaflows disabled)
Signed-off-by: Thomas Graf
Signed-off-by: Jesse Gross -
Allows removing the net and dp_ifindex argument and simplify the
code.Signed-off-by: Thomas Graf
Signed-off-by: Jesse Gross -
Drop user features if an outdated user space instance that does not
understand the concept of user_features attempted to create a new
datapath.Signed-off-by: Thomas Graf
Signed-off-by: Jesse Gross -
Signed-off-by: Thomas Graf
Reviewed-by: Daniel Borkmann
Signed-off-by: Jesse Gross -
Make the skb zerocopy logic written for nfnetlink queue available for
use by other modules.Signed-off-by: Thomas Graf
Reviewed-by: Daniel Borkmann
Acked-by: David S. Miller
Signed-off-by: Jesse Gross -
Remove duplicated include.
Signed-off-by: Wei Yongjun
Signed-off-by: Jesse Gross