31 Mar, 2020
3 commits
-
Pablo Neira Ayuso says:
====================
Netfilter/IPVS updates for net-nextThe following patchset contains Netfilter/IPVS updates for net-next:
1) Add support to specify a stateful expression in set definitions,
this allows users to specify e.g. counters per set elements.2) Flowtable software counter support.
3) Flowtable hardware offload counter support, from wenxu.
3) Parallelize flowtable hardware offload requests, from Paul Blakey.
This includes a patch to add one work entry per offload command.4) Several patches to rework nf_queue refcount handling, from Florian
Westphal.4) A few fixes for the flowtable tunnel offload: Fix crash if tunneling
information is missing and set up indirect flow block as TC_SETUP_FT,
patch from wenxu.5) Stricter netlink attribute sanity check on filters, from Romain Bellan
and Florent Fourcot.5) Annotations to make sparse happy, from Jules Irenge.
6) Improve icmp errors in debugging information, from Haishuang Yan.
7) Fix warning in IPVS icmp error debugging, from Haishuang Yan.
8) Fix endianess issue in tcp extension header, from Sergey Marinkevich.
====================Signed-off-by: David S. Miller
-
If outer_proto is not set, GCC warning as following:
In file included from net/netfilter/ipvs/ip_vs_core.c:52:
net/netfilter/ipvs/ip_vs_core.c: In function 'ip_vs_in_icmp':
include/net/ip_vs.h:233:4: warning: 'outer_proto' may be used uninitialized in this function [-Wmaybe-uninitialized]
233 | printk(KERN_DEBUG pr_fmt(msg), ##__VA_ARGS__); \
| ^~~~~~
net/netfilter/ipvs/ip_vs_core.c:1666:8: note: 'outer_proto' was declared here
1666 | char *outer_proto;
| ^~~~~~~~~~~Fixes: 73348fed35d0 ("ipvs: optimize tunnel dumps for icmp errors")
Signed-off-by: Haishuang Yan
Acked-by: Julian Anastasov
Signed-off-by: Pablo Neira Ayuso -
I got a problem on MIPS with Big-Endian is turned on: every time when
NF trying to change TCP MSS it returns because of new.v16 was greater
than old.v16. But real MSS was 1460 and my rule was like this:add rule table chain tcp option maxseg size set 1400
And 1400 is lesser that 1460, not greater.
Later I founded that main causer is cast from u32 to __be16.
Debugging:
In example MSS = 1400(HEX: 0x578). Here is representation of each byte
like it is in memory by addresses from left to right(e.g. [0x0 0x1 0x2
0x3]). LE — Little-Endian system, BE — Big-Endian, left column is type.LE BE
u32: [78 05 00 00] [00 00 05 78]As you can see, u32 representation will be casted to u16 from different
half of 4-byte address range. But actually nf_tables uses registers and
store data of various size. Actually TCP MSS stored in 2 bytes. But
registers are still u32 in definition:struct nft_regs {
union {
u32 data[20];
struct nft_verdict verdict;
};
};So, access like regs->data[priv->sreg] exactly u32. So, according to
table presents above, per-byte representation of stored TCP MSS in
register will be:LE BE
(u32)regs->data[]: [78 05 00 00] [05 78 00 00]
^^ ^^We see that register uses just half of u32 and other 2 bytes may be
used for some another data. But in nft_exthdr_tcp_set_eval() it casted
just like u32 -> __be16:new.v16 = src
But u32 overfill __be16, so it get 2 low bytes. For clarity draw
one more table( means that bytes will be used for cast).LE BE
u32: [ 00 00] [00 00 ]
(u32)regs->data[]: [ 00 00] [05 78 ]As you can see, for Little-Endian nothing changes, but for Big-endian we
take the wrong half. In my case there is some other data instead of
zeros, so new MSS was wrongly greater.For shooting this bug I used solution for ports ranges. Applying of this
patch does not affect Little-Endian systems.Signed-off-by: Sergey Marinkevich
Acked-by: Florian Westphal
Signed-off-by: Pablo Neira Ayuso
30 Mar, 2020
6 commits
-
Store the conntrack counters to the conntrack entry in the
HW flowtable offload.Signed-off-by: wenxu
Signed-off-by: Pablo Neira Ayuso -
Add nf_ct_acct_add function to update the conntrack counter
with packets and bytes.Signed-off-by: wenxu
Signed-off-by: Pablo Neira Ayuso -
The bitmap set does not support for expressions, skip it from the
estimation step.Signed-off-by: Pablo Neira Ayuso
-
If the global set expression definition mismatches the dynset
expression, then bail out.Signed-off-by: Pablo Neira Ayuso
-
Otherwise, nft_lookup might dereference an uninitialized pointer to the
element extension.Fixes: 665153ff5752 ("netfilter: nf_tables: add bitmap set type")
Signed-off-by: Pablo Neira Ayuso -
When CONFIG_NF_CONNTRACK_MARK is not set, any CTA_MARK or CTA_MARK_MASK
in netlink message are not supported. We should return an error when one
of them is set, not bothFixes: 9306425b70bf ("netfilter: ctnetlink: must check mark attributes vs NULL")
Signed-off-by: Romain Bellan
Signed-off-by: Florent Fourcot
Signed-off-by: Pablo Neira Ayuso
29 Mar, 2020
4 commits
-
Instead of dropping refs+kfree, use the helper added in previous patch.
Signed-off-by: Florian Westphal
Signed-off-by: Pablo Neira Ayuso -
nf_queue is problematic when another NF_QUEUE invocation happens
from nf_reinject().1. nf_queue is invoked, increments state->sk refcount.
2. skb is queued, waiting for verdict.
3. sk is closed/released.
3. verdict comes back, nf_reinject is called.
4. nf_reinject drops the reference -- refcount can now drop to 0Instead of get_ref/release_ref pattern, we need to nest the get_ref calls:
get_ref
get_ref
release_ref
release_refSo that when we invoke the next processing stage (another netfilter
or the okfn()), we hold at least one reference count on the
devices/socket.After previous patch, it is now safe to put the entry even after okfn()
has potentially free'd the skb.Signed-off-by: Florian Westphal
Signed-off-by: Pablo Neira Ayuso -
The refcount is done via entry->skb, which does work fine.
Major problem: When putting the refcount of the bridge ports, we
must always put the references while the skb is still around.However, we will need to put the references after okfn() to avoid
a possible 1 -> 0 -> 1 refcount transition, so we cannot use the
skb pointer anymore.Place the physports in the queue entry structure instead to allow
for refcounting changes in the next patch.Signed-off-by: Florian Westphal
Signed-off-by: Pablo Neira Ayuso -
This is a preparation patch, no logical changes.
Move free_entry into core and rename it to something more sensible.Will ease followup patches which will complicate the refcount handling.
Signed-off-by: Florian Westphal
Signed-off-by: Pablo Neira Ayuso
28 Mar, 2020
10 commits
-
To allow offload commands to execute in parallel, create workqueue
for flow table offload, and use a work entry per offload command.Signed-off-by: Paul Blakey
Reviewed-by: Oz Shlomo
Signed-off-by: Pablo Neira Ayuso -
Currently flow offload threads are synchronized by the flow block mutex.
Use rw lock instead to increase flow insertion (read) concurrency.Signed-off-by: Paul Blakey
Reviewed-by: Oz Shlomo
Signed-off-by: Pablo Neira Ayuso -
It is safe to traverse &net->nft.tables with &net->nft.commit_mutex
held using list_for_each_entry_rcu(). Silence the PROVE_RCU_LIST false
positive,WARNING: suspicious RCU usage
net/netfilter/nf_tables_api.c:523 RCU-list traversed in non-reader section!!other info that might help us debug this:
rcu_scheduler_active = 2, debug_locks = 1
1 lock held by iptables/1384:
#0: ffffffff9745c4a8 (&net->nft.commit_mutex){+.+.}, at: nf_tables_valid_genid+0x25/0x60 [nf_tables]Call Trace:
dump_stack+0xa1/0xea
lockdep_rcu_suspicious+0x103/0x10d
nft_table_lookup.part.0+0x116/0x120 [nf_tables]
nf_tables_newtable+0x12c/0x7d0 [nf_tables]
nfnetlink_rcv_batch+0x559/0x1190 [nfnetlink]
nfnetlink_rcv+0x1da/0x210 [nfnetlink]
netlink_unicast+0x306/0x460
netlink_sendmsg+0x44b/0x770
____sys_sendmsg+0x46b/0x4a0
___sys_sendmsg+0x138/0x1a0
__sys_sendmsg+0xb6/0x130
__x64_sys_sendmsg+0x48/0x50
do_syscall_64+0x69/0xf4
entry_SYSCALL_64_after_hwframe+0x49/0xb3Signed-off-by: Qian Cai
Acked-by: Florian Westphal
Signed-off-by: Pablo Neira Ayuso -
The indirect block setup should use TC_SETUP_FT as the type instead of
TC_SETUP_BLOCK. Adjust existing users of the indirect flow block
infrastructure.Fixes: b5140a36da78 ("netfilter: flowtable: add indr block setup support")
Signed-off-by: wenxu
Signed-off-by: Pablo Neira Ayuso -
Add a new flag to turn on flowtable counters which are stored in the
conntrack entry.Signed-off-by: Pablo Neira Ayuso
-
Expose the NFT_FLOWTABLE_HW_OFFLOAD flag through uapi.
Signed-off-by: Pablo Neira Ayuso
-
This function allows you to update the conntrack counters.
Signed-off-by: Pablo Neira Ayuso
-
After strip GRE/UDP tunnel header for icmp errors, it's better to show
"GRE/UDP" instead of "IPIP" in debug message.Signed-off-by: Haishuang Yan
Acked-by: Julian Anastasov
Signed-off-by: Pablo Neira Ayuso -
…_conntrack_all_unlock()
Sparse reports warnings at nf_conntrack_all_lock()
and nf_conntrack_all_unlock()warning: context imbalance in nf_conntrack_all_lock()
- wrong count at exit
warning: context imbalance in nf_conntrack_all_unlock()
- unexpected unlockAdd the missing __acquires(&nf_conntrack_locks_all_lock)
Add missing __releases(&nf_conntrack_locks_all_lock)Signed-off-by: Jules Irenge <jbi.octave@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> -
Sparse reports a warning at ctnetlink_parse_nat_setup()
warning: context imbalance in ctnetlink_parse_nat_setup()
- unexpected unlockThe root cause is the missing annotation at ctnetlink_parse_nat_setup()
Add the missing __must_hold(RCU) annotationSigned-off-by: Jules Irenge
Signed-off-by: Pablo Neira Ayuso
26 Mar, 2020
2 commits
-
Overlapping header include additions in macsec.c
A bug fix in 'net' overlapping with the removal of 'version'
string in ena_netdev.cOverlapping test additions in selftests Makefile
Overlapping PCI ID table adjustments in iwlwifi driver.
Signed-off-by: David S. Miller
-
net/netfilter/nft_fwd_netdev.c: In function ‘nft_fwd_netdev_eval’:
net/netfilter/nft_fwd_netdev.c:32:10: error: ‘struct sk_buff’ has no member named ‘tc_redirected’
pkt->skb->tc_redirected = 1;
^~
net/netfilter/nft_fwd_netdev.c:33:10: error: ‘struct sk_buff’ has no member named ‘tc_from_ingress’
pkt->skb->tc_from_ingress = 1;
^~To avoid a direct dependency with tc actions from netfilter, wrap the
redirect bits around CONFIG_NET_REDIRECT and move helpers to
include/linux/skbuff.h. Turn on this toggle from the ifb driver, the
only existing client of these bits in the tree.This patch adds skb_set_redirected() that sets on the redirected bit
on the skbuff, it specifies if the packet was redirect from ingress
and resets the timestamp (timestamp reset was originally missing in the
netfilter bugfix).Fixes: bcfabee1afd99484 ("netfilter: nft_fwd_netdev: allow to redirect to ifb via ingress")
Reported-by: noreply@ellerman.id.au
Reported-by: Geert Uytterhoeven
Signed-off-by: Pablo Neira Ayuso
Signed-off-by: David S. Miller
25 Mar, 2020
6 commits
-
Set skb->tc_redirected to 1, otherwise the ifb driver drops the packet.
Set skb->tc_from_ingress to 1 to reinject the packet back to the ingress
path after leaving the ifb egress path.This patch inconditionally sets on these two skb fields that are
meaningful to the ifb driver. The existing forward action is guaranteed
to run from ingress path.Fixes: 39e6dea28adc ("netfilter: nf_tables: add forward expression to the netdev family")
Signed-off-by: Pablo Neira Ayuso -
Make sure the forward action is only used from ingress.
Fixes: 39e6dea28adc ("netfilter: nf_tables: add forward expression to the netdev family")
Signed-off-by: Pablo Neira Ayuso -
...and return -ENOTEMPTY to the front-end in this case, instead of
proceeding. Currently, nft takes care of checking for these cases
and not sending them to the kernel, but if we drop the set_overlap()
call in nft we can end up in situations like:# nft add table t
# nft add set t s '{ type inet_service ; flags interval ; }'
# nft add element t s '{ 1 - 5 }'
# nft add element t s '{ 6 - 10 }'
# nft add element t s '{ 4 - 7 }'
# nft list set t s
table ip t {
set s {
type inet_service
flags interval
elements = { 1-3, 4-5, 6-7 }
}
}This change has the primary purpose of making the behaviour
consistent with nft_set_pipapo, but is also functional to avoid
inconsistent behaviour if userspace sends overlapping elements for
any reason.v2: When we meet the same key data in the tree, as start element while
inserting an end element, or as end element while inserting a start
element, actually check that the existing element is active, before
resetting the overlap flag (Pablo Neira Ayuso)Signed-off-by: Stefano Brivio
Signed-off-by: Pablo Neira Ayuso -
Replace negations of nft_rbtree_interval_end() with a new helper,
nft_rbtree_interval_start(), wherever this helps to visualise the
problem at hand, that is, for all the occurrences except for the
comparison against given flags in __nft_rbtree_get().This gets especially useful in the next patch.
Signed-off-by: Stefano Brivio
Signed-off-by: Pablo Neira Ayuso -
...and return -ENOTEMPTY to the front-end on collision, -EEXIST if
an identical element already exists. Together with the previous patch,
element collision will now be returned to the user as -EEXIST.Reported-by: Phil Sutter
Signed-off-by: Stefano Brivio
Signed-off-by: Pablo Neira Ayuso -
Currently, the -EEXIST return code of ->insert() callbacks is ambiguous: it
might indicate that a given element (including intervals) already exists as
such, or that the new element would clash with existing ones.If identical elements already exist, the front-end is ignoring this without
returning error, in case NLM_F_EXCL is not set. However, if the new element
can't be inserted due an overlap, we should report this to the user.To this purpose, allow set back-ends to return -ENOTEMPTY on collision with
existing elements, translate that to -EEXIST, and return that to userspace,
no matter if NLM_F_EXCL was set.Reported-by: Phil Sutter
Signed-off-by: Stefano Brivio
Signed-off-by: Pablo Neira Ayuso
20 Mar, 2020
5 commits
-
nf_flow_rule_match() sets control.addr_type in key, so needs to also set
the corresponding mask. An exact match is wanted, so mask is all ones.Fixes: c29f74e0df7a ("netfilter: nf_flow_table: hardware offload support")
Signed-off-by: Edward Cree
Signed-off-by: Pablo Neira Ayuso -
The tc ct action does not cache the route in the flowtable entry.
Fixes: 88bf6e4114d5 ("netfilter: flowtable: add tunnel encap/decap action offload support")
Fixes: cfab6dbd0ecf ("netfilter: flowtable: add tunnel match offload support")
Signed-off-by: wenxu
Signed-off-by: Pablo Neira Ayuso -
Freeing a flowtable with offloaded flows, the flow are deleted from
hardware but are not deleted from the flow table, leaking them,
and leaving their offload bit on.Add a second pass of the disabled gc to delete the these flows from
the flow table before freeing it.Fixes: c29f74e0df7a ("netfilter: nf_flow_table: hardware offload support")
Signed-off-by: Paul Blakey
Signed-off-by: Pablo Neira Ayuso -
Since pskb_may_pull may change skb->data, so we need to reload ip{v6}h at
the right place.Fixes: a908fdec3dda ("netfilter: nf_flow_table: move ipv6 offload hook code to nf_flow_table")
Fixes: 7d2086871762 ("netfilter: nf_flow_table: move ipv4 offload hook code to nf_flow_table")
Signed-off-by: Haishuang Yan
Signed-off-by: Pablo Neira Ayuso -
Since nf_flow_snat_port and nf_flow_snat_ip{v6} call pskb_may_pull()
which may change skb->data, so we need to reload ip{v6}h at the right
place.Fixes: a908fdec3dda ("netfilter: nf_flow_table: move ipv6 offload hook code to nf_flow_table")
Fixes: 7d2086871762 ("netfilter: nf_flow_table: move ipv4 offload hook code to nf_flow_table")
Signed-off-by: Haishuang Yan
Signed-off-by: Pablo Neira Ayuso
19 Mar, 2020
4 commits
-
This patch adds nft_set_elem_expr_destroy() to destroy stateful
expressions in set elements.This patch also updates the commit path to call this function to invoke
expr->ops->destroy_clone when required.This is implicitly fixing up a module reference counter leak and
a memory leak in expressions that allocated internal state, e.g.
nft_counter.Fixes: 409444522976 ("netfilter: nf_tables: add elements with stateful expressions")
Signed-off-by: Pablo Neira Ayuso -
After copying the expression to the set element extension, release the
expression and reset the pointer to avoid a double-free from the error
path.Fixes: 409444522976 ("netfilter: nf_tables: add elements with stateful expressions")
Signed-off-by: Pablo Neira Ayuso -
This patch allows users to specify the stateful expression for the
elements in this set via NFTA_SET_EXPR. This new feature allows you to
turn on counters for all of the elements in this set.Signed-off-by: Pablo Neira Ayuso
-
The patch that adds support for stateful expressions in set definitions
require this.Signed-off-by: Pablo Neira Ayuso