09 Jan, 2018
40 commits
-
This new bit tells us that the conntrack entry is owned by the flow
table offload infrastructure.# cat /proc/net/nf_conntrack
ipv4 2 tcp 6 src=10.141.10.2 dst=147.75.205.195 sport=36392 dport=443 src=147.75.205.195 dst=192.168.2.195 sport=443 dport=36392 [OFFLOAD] mark=0 zone=0 use=2Note the [OFFLOAD] tag in the listing.
The timer of such conntrack entries look like stopped from userspace.
In practise, to make sure the conntrack entry does not go away, the
conntrack timer is periodically set to an arbitrary large value that
gets refreshed on every iteration from the garbage collector, so it
never expires- and they display no internal state in the case of TCP
flows. This allows us to save a bitcheck from the packet path via
nf_ct_is_expired().Conntrack entries that have been offloaded to the flow table
infrastructure cannot be deleted/flushed via ctnetlink. The flow table
infrastructure is also responsible for releasing this conntrack entry.Signed-off-by: Pablo Neira Ayuso
-
This macro is unnecessary, it just hides details for one single caller.
nfnl_dereference() is just enough.Signed-off-by: Pablo Neira Ayuso
-
Users cannot forge malformed IPv4/IPv6 headers via raw sockets that they
can inject into the stack. Specifically, not for IPv4 since 55888dfb6ba7
("AF_RAW: Augment raw_send_hdrinc to expand skb to fit iphdr->ihl
(v2)"). IPv6 raw sockets also ensure that packets have a well-formed
IPv6 header available in the skbuff.At quick glance, br_netfilter also validates layer 3 headers and it
drops malformed both IPv4 and IPv6 packets.Therefore, let's remove this defensive check all over the place.
Signed-off-by: Pablo Neira Ayuso
-
replacement for iptables "-m policy --dir in --policy {ipsec,none}".
Signed-off-by: Florian Westphal
Signed-off-by: Pablo Neira Ayuso -
This abstraction has no clients anymore, remove it.
This is what remains from previous authors, so correct copyright
statement after recent modifications and code removal.Signed-off-by: Pablo Neira Ayuso
-
This is only needed by nf_queue, place this code where it belongs.
Signed-off-by: Pablo Neira Ayuso
-
We cannot make a direct call to nf_ip6_reroute() because that would result
in autoloading the 'ipv6' module because of symbol dependencies.
Therefore, define reroute indirection in nf_ipv6_ops where this really
belongs to.For IPv4, we can indeed make a direct function call, which is faster,
given IPv4 is built-in in the networking code by default. Still,
CONFIG_INET=n and CONFIG_NETFILTER=y is possible, so define empty inline
stub for IPv4 in such case.Signed-off-by: Pablo Neira Ayuso
-
We cannot make a direct call to nf_ip6_route() because that would result
in autoloading the 'ipv6' module because of symbol dependencies.
Therefore, define route indirection in nf_ipv6_ops where this really
belongs to.For IPv4, we can indeed make a direct function call, which is faster,
given IPv4 is built-in in the networking code by default. Still,
CONFIG_INET=n and CONFIG_NETFILTER=y is possible, so define empty inline
stub for IPv4 in such case.Signed-off-by: Pablo Neira Ayuso
-
This is only used by nf_queue.c and this function comes with no symbol
dependencies with IPv6, it just refers to structure layouts. Therefore,
we can replace it by a direct function call from where it belongs.Signed-off-by: Pablo Neira Ayuso
-
We cannot make a direct call to nf_ip6_checksum_partial() because that
would result in autoloading the 'ipv6' module because of symbol
dependencies. Therefore, define checksum_partial indirection in
nf_ipv6_ops where this really belongs to.For IPv4, we can indeed make a direct function call, which is faster,
given IPv4 is built-in in the networking code by default. Still,
CONFIG_INET=n and CONFIG_NETFILTER=y is possible, so define empty inline
stub for IPv4 in such case.Signed-off-by: Pablo Neira Ayuso
-
We cannot make a direct call to nf_ip6_checksum() because that would
result in autoloading the 'ipv6' module because of symbol dependencies.
Therefore, define checksum indirection in nf_ipv6_ops where this really
belongs to.For IPv4, we can indeed make a direct function call, which is faster,
given IPv4 is built-in in the networking code by default. Still,
CONFIG_INET=n and CONFIG_NETFILTER=y is possible, so define empty inline
stub for IPv4 in such case.Signed-off-by: Pablo Neira Ayuso
-
This allows to reuse xt_connlimit infrastructure from nf_tables.
The upcoming nf_tables frontend can just pass in an nftables register
as input key, this allows limiting by any nft-supported key, including
concatenations.For xt_connlimit, pass in the zone and the ip/ipv6 address.
With help from Yi-Hung Wei.
Signed-off-by: Florian Westphal
Acked-by: Yi-Hung Wei
Signed-off-by: Pablo Neira Ayuso -
They don't belong to the family definition, move them to the filter
chain type definition instead.Signed-off-by: Pablo Neira Ayuso
-
Since NFPROTO_INET is handled from the core, we don't need to maintain
extra infrastructure in nf_tables to handle the double hook
registration, one for IPv4 and another for IPv6.Signed-off-by: Pablo Neira Ayuso
-
Use new native NFPROTO_INET support in netfilter core, this gets rid of
ad-hoc code in the nf_tables API codebase.Signed-off-by: Pablo Neira Ayuso
-
Expand NFPROTO_INET in two hook registrations, one for NFPROTO_IPV4 and
another for NFPROTO_IPV6. Hence, we handle NFPROTO_INET from the core.Signed-off-by: Pablo Neira Ayuso
-
So static_key_slow_dec applies to the family behind NFPROTO_INET.
Signed-off-by: Pablo Neira Ayuso
-
Instead of passing struct nf_hook_ops, this is needed by follow up
patches to handle NFPROTO_INET from the core.Signed-off-by: Pablo Neira Ayuso
-
Just a cleanup, __nf_unregister_net_hook() is used by a follow up patch
when handling NFPROTO_INET as a real family from the core.Signed-off-by: Pablo Neira Ayuso
-
Add helper function to test for the NFT_SET_ANONYMOUS flag.
Signed-off-by: Pablo Neira Ayuso
-
Instead of calling this function from the family specific variant, this
reduces the code size in the fast path for the netdev, bridge and inet
families. After this change, we must call nft_set_pktinfo() upfront from
the chain hook indirection.Before:
text data bss dec hex filename
2145 208 0 2353 931 net/netfilter/nf_tables_netdev.oAfter:
text data bss dec hex filename
2125 208 0 2333 91d net/netfilter/nf_tables_netdev.oSigned-off-by: Pablo Neira Ayuso
-
46928a0b49f3 ("netfilter: nf_tables: remove multihook chains and
families") already removed this, this is a leftover.Signed-off-by: Pablo Neira Ayuso
-
No problem for iptables as priorities are fixed values defined in the
nat modules, but in nftables the priority its coming from userspace.Reject in case we see that such a hook would not work.
Signed-off-by: Florian Westphal
Signed-off-by: Pablo Neira Ayuso -
The netfilter NAT core cannot deal with more than one NAT hook per hook
location (prerouting, input ...), because the NAT hooks install a NAT null
binding in case the iptables nat table (iptable_nat hooks) or the
corresponding nftables chain (nft nat hooks) doesn't specify a nat
transformation.Null bindings are needed to detect port collsisions between NAT-ed and
non-NAT-ed connections.This causes nftables NAT rules to not work when iptable_nat module is
loaded, and vice versa because nat binding has already been attached
when the second nat hook is consulted.The netfilter core is not really the correct location to handle this
(hooks are just hooks, the core has no notion of what kinds of side
effects a hook implements), but its the only place where we can check
for conflicts between both iptables hooks and nftables hooks without
adding dependencies.So add nat annotation to hook_ops to describe those hooks that will
add NAT bindings and then make core reject if such a hook already exists.
The annotation fills a padding hole, in case further restrictions appar
we might change this to a 'u8 type' instead of bool.iptables error if nft nat hook active:
iptables -t nat -A POSTROUTING -j MASQUERADE
iptables v1.4.21: can't initialize iptables table `nat': File exists
Perhaps iptables or your kernel needs to be upgraded.nftables error if iptables nat table present:
nft -f /etc/nftables/ipv4-nat
/usr/etc/nftables/ipv4-nat:3:1-2: Error: Could not process rule: File exists
table nat {
^^Signed-off-by: Florian Westphal
Signed-off-by: Pablo Neira Ayuso -
currently we always return -ENOENT to userspace if we can't find
a particular table, or if the table initialization fails.Followup patch will make nat table init fail in case nftables already
registered a nat hook so this change makes xt_find_table_lock return
an ERR_PTR to return the errno value reported from the table init
function.Add xt_request_find_table_lock as try_then_request_module replacement
and use it where needed.Signed-off-by: Florian Westphal
Signed-off-by: Pablo Neira Ayuso -
This can be same as NF_INET_NUMHOOKS if we don't support DECNET.
Signed-off-by: Florian Westphal
Signed-off-by: Pablo Neira Ayuso -
no need to define hook points if the family isn't supported.
Because we need these hooks for either nftables, arp/ebtables
or the 'call-iptables' hack we have in the bridge layer add two
new dependencies, NETFILTER_FAMILY_{ARP,BRIDGE}, and have the
users select them.Signed-off-by: Florian Westphal
Signed-off-by: Pablo Neira Ayuso -
no need to define hook points if the family isn't supported.
Signed-off-by: Florian Westphal
Signed-off-by: Pablo Neira Ayuso -
Not all families share the same hook count, adjust sizes to what is
needed.struct net before:
/* size: 6592, cachelines: 103, members: 46 */
after:
/* size: 5952, cachelines: 93, members: 46 */Signed-off-by: Florian Westphal
Signed-off-by: Pablo Neira Ayuso -
The kernel already has defines for this, but they are in uapi exposed
headers.Including these from netns.h causes build errors and also adds unneeded
dependencies on heads that we don't need.So move these defines to netfilter_defs.h and place the uapi ones
in ifndef __KERNEL__ to keep them for userspace.Signed-off-by: Florian Westphal
Signed-off-by: Pablo Neira Ayuso -
struct net contains:
struct nf_hook_entries __rcu *hooks[NFPROTO_NUMPROTO][NF_MAX_HOOKS];
which store the hook entry point locations for the various protocol
families and the hooks.Using array results in compact c code when doing accesses, i.e.
x = rcu_dereference(net->nf.hooks[pf][hook]);but its also wasting a lot of memory, as most families are
not used.So split the array into those families that are used, which
are only 5 (instead of 13). In most cases, the 'pf' argument is
constant, i.e. gcc removes switch statement.struct net before:
/* size: 5184, cachelines: 81, members: 46 */
after:
/* size: 4672, cachelines: 73, members: 46 */Signed-off-by: Florian Westphal
Signed-off-by: Pablo Neira Ayuso -
Giuseppe Scrivano says:
"SELinux, if enabled, registers for each new network namespace 6
netfilter hooks."Cost for this is high. With synchronize_net() removed:
"The net benefit on an SMP machine with two cores is that creating a
new network namespace takes -40% of the original time."This patch replaces synchronize_net+kvfree with call_rcu().
We store rcu_head at the tail of a structure that has no fixed layout,
i.e. we cannot use offsetof() to compute the start of the original
allocation. Thus store this information right after the rcu head.We could simplify this by just placing the rcu_head at the start
of struct nf_hook_entries. However, this structure is used in
packet processing hotpath, so only place what is needed for that
at the beginning of the struct.Reported-by: Giuseppe Scrivano
Signed-off-by: Florian Westphal
Signed-off-by: Pablo Neira Ayuso -
since commit 960632ece6949b ("netfilter: convert hook list to an array")
nfqueue no longer stores a pointer to the hook that caused the packet
to be queued. Therefore no extra synchronize_net() call is needed after
dropping the packets enqueued by the old rule blob.Signed-off-by: Florian Westphal
Signed-off-by: Pablo Neira Ayuso -
This reverts commit d3ad2c17b4047
("netfilter: core: batch nf_unregister_net_hooks synchronize_net calls").Nothing wrong with it. However, followup patch will delay freeing of hooks
with call_rcu, so all synchronize_net() calls become obsolete and there
is no need anymore for this batching.This revert causes a temporary performance degradation when destroying
network namespace, but its resolved with the upcoming call_rcu conversion.Signed-off-by: Florian Westphal
Signed-off-by: Pablo Neira Ayuso -
Change old multi-line comment style to kernel comment style and
remove unwanted comments.Signed-off-by: Varsha Rao
Signed-off-by: Pablo Neira Ayuso -
When sets are extremely large we can get softlockup during ipset -L.
We could fix this by adding cond_resched_rcu() at the right location
during iteration, but this only works if RCU nesting depth is 1.At this time entire variant->list() is called under under rcu_read_lock_bh.
This used to be a read_lock_bh() but as rcu doesn't really lock anything,
it does not appear to be needed, so remove it (ipset increments set
reference count before this, so a set deletion should not be possible).Reported-by: Li Shuang
Signed-off-by: Florian Westphal
Acked-by: Jozsef Kadlecsik
Signed-off-by: Pablo Neira Ayuso -
Check that we really hold nfnl mutex here instead of relying on correct
usage alone.Signed-off-by: Florian Westphal
Acked-by: Jozsef Kadlecsik
Signed-off-by: Pablo Neira Ayuso -
The param of frag_safe_skb_hp, ipvsh, isn't used now. So remove it and
update the callers' codes too.Signed-off-by: Gao Feng
Acked-by: Simon Horman
Signed-off-by: Pablo Neira Ayuso -
Nowadays this is just the default template that is used when setting up
the net namespace, so nothing writes to these locations.Signed-off-by: Florian Westphal
Signed-off-by: Pablo Neira Ayuso -
In preparation to enabling -Wimplicit-fallthrough, mark switch cases
where we are expecting to fall through.Signed-off-by: Gustavo A. R. Silva
Signed-off-by: Simon Horman
Signed-off-by: Pablo Neira Ayuso