Eric Lee / smarc-fsl-linux-kernel

21 Nov, 2018

1 commit

54ab59528 netfilter: conntrack: fix calculation of next bucket number in early_drop ... Browse Code »

commit f393808dc64149ccd0e5a8427505ba2974a59854 upstream.

If there's no entry to drop in bucket that corresponds to the hash,
early_drop() should look for it in other buckets. But since it increments
hash instead of bucket number, it actually looks in the same bucket 8
times: hsize is 16k by default (14 bits) and hash is 32-bit value, so
reciprocal_scale(hash, hsize) returns the same value for hash..hash+7 in
most cases.

Fix it by increasing bucket number instead of hash and rename _hash
to bucket to avoid future confusion.

Fixes: 3e86638e9a0b ("netfilter: conntrack: consider ct netns in early_drop logic")
Cc: # v4.7+
Signed-off-by: Vasily Khoruzhick
Signed-off-by: Pablo Neira Ayuso
Signed-off-by: Greg Kroah-Hartman

Vasily Khoruzhick
2018-11-21 16:24:09 +0800

24 Aug, 2018

1 commit

e653e79ac netfilter: nf_conntrack: Fix possible possible crash on module loading. ... Browse Code »

[ Upstream commit 2045cdfa1b40d66f126f3fd05604fc7c754f0022 ]

Loading the nf_conntrack module with doubled hashsize parameter, i.e.
modprobe nf_conntrack hashsize=12345 hashsize=12345
causes NULL-ptr deref.

If 'hashsize' specified twice, the nf_conntrack_set_hashsize() function
will be called also twice.
The first nf_conntrack_set_hashsize() call will set the
'nf_conntrack_htable_size' variable:

nf_conntrack_set_hashsize()
...
/* On boot, we can set this without any fancy locking. */
if (!nf_conntrack_htable_size)
return param_set_uint(val, kp);

But on the second invocation, the nf_conntrack_htable_size is already set,
so the nf_conntrack_set_hashsize() will take a different path and call
the nf_conntrack_hash_resize() function. Which will crash on the attempt
to dereference 'nf_conntrack_hash' pointer:

BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
RIP: 0010:nf_conntrack_hash_resize+0x255/0x490 [nf_conntrack]
Call Trace:
nf_conntrack_set_hashsize+0xcd/0x100 [nf_conntrack]
parse_args+0x1f9/0x5a0
load_module+0x1281/0x1a50
__se_sys_finit_module+0xbe/0xf0
do_syscall_64+0x7c/0x390
entry_SYSCALL_64_after_hwframe+0x49/0xbe

Fix this, by checking !nf_conntrack_hash instead of
!nf_conntrack_htable_size. nf_conntrack_hash will be initialized only
after the module loaded, so the second invocation of the
nf_conntrack_set_hashsize() won't crash, it will just reinitialize
nf_conntrack_htable_size again.

Signed-off-by: Andrey Ryabinin
Signed-off-by: Pablo Neira Ayuso
Signed-off-by: Sasha Levin
Signed-off-by: Greg Kroah-Hartman

Andrey Ryabinin
2018-08-24 19:09:15 +0800

07 Sep, 2017

1 commit

aae3dbb47 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next ... Browse Code »

Pull networking updates from David Miller:

1) Support ipv6 checksum offload in sunvnet driver, from Shannon
Nelson.

2) Move to RB-tree instead of custom AVL code in inetpeer, from Eric
Dumazet.

3) Allow generic XDP to work on virtual devices, from John Fastabend.

4) Add bpf device maps and XDP_REDIRECT, which can be used to build
arbitrary switching frameworks using XDP. From John Fastabend.

5) Remove UFO offloads from the tree, gave us little other than bugs.

6) Remove the IPSEC flow cache, from Florian Westphal.

7) Support ipv6 route offload in mlxsw driver.

8) Support VF representors in bnxt_en, from Sathya Perla.

9) Add support for forward error correction modes to ethtool, from
Vidya Sagar Ravipati.

10) Add time filter for packet scheduler action dumping, from Jamal Hadi
Salim.

11) Extend the zerocopy sendmsg() used by virtio and tap to regular
sockets via MSG_ZEROCOPY. From Willem de Bruijn.

12) Significantly rework value tracking in the BPF verifier, from Edward
Cree.

13) Add new jump instructions to eBPF, from Daniel Borkmann.

14) Rework rtnetlink plumbing so that operations can be run without
taking the RTNL semaphore. From Florian Westphal.

15) Support XDP in tap driver, from Jason Wang.

16) Add 32-bit eBPF JIT for ARM, from Shubham Bansal.

17) Add Huawei hinic ethernet driver.

18) Allow to report MD5 keys in TCP inet_diag dumps, from Ivan
Delalande.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1780 commits)
i40e: point wb_desc at the nvm_wb_desc during i40e_read_nvm_aq
i40e: avoid NVM acquire deadlock during NVM update
drivers: net: xgene: Remove return statement from void function
drivers: net: xgene: Configure tx/rx delay for ACPI
drivers: net: xgene: Read tx/rx delay for ACPI
rocker: fix kcalloc parameter order
rds: Fix non-atomic operation on shared flag variable
net: sched: don't use GFP_KERNEL under spin lock
vhost_net: correctly check tx avail during rx busy polling
net: mdio-mux: add mdio_mux parameter to mdio_mux_init()
rxrpc: Make service connection lookup always check for retry
net: stmmac: Delete dead code for MDIO registration
gianfar: Fix Tx flow control deactivation
cxgb4: Ignore MPS_TX_INT_CAUSE[Bubble] for T6
cxgb4: Fix pause frame count in t4_get_port_stats
cxgb4: fix memory leak
tun: rename generic_xdp to skb_xdp
tun: reserve extra headroom only when XDP is set
net: dsa: bcm_sf2: Configure IMP port TC2QOS mapping
net: dsa: bcm_sf2: Advertise number of egress queues
...

Linus Torvalds
2017-09-07 05:45:08 +0800

04 Sep, 2017

2 commits

44d6e2f27 net: Replace NF_CT_ASSERT() with WARN_ON(). ... Browse Code »

This patch removes NF_CT_ASSERT() and instead uses WARN_ON().

Signed-off-by: Varsha Rao

Varsha Rao
2017-09-04 19:25:19 +0800
d1c1e39de netfilter: remove unused hooknum arg from packet functions ... Browse Code »

tested with allmodconfig build.

Signed-off-by: Florian Westphal

Florian Westphal
2017-09-04 19:25:18 +0800

25 Aug, 2017

1 commit

b3480fe05 netfilter: conntrack: make protocol tracker pointers const ... Browse Code »

Doesn't change generated code, but will make it easier to eventually
make the actual trackers themselvers const.

Signed-off-by: Florian Westphal
Signed-off-by: Pablo Neira Ayuso

Florian Westphal
2017-08-25 00:52:33 +0800

02 Aug, 2017

1 commit

2a04aabf5 netfilter: constify nf_conntrack_l3/4proto parameters ... Browse Code »

When a nf_conntrack_l3/4proto parameter is not on the left hand side
of an assignment, its address is not taken, and it is not passed to a
function that may modify its fields, then it can be declared as const.

This change is useful from a documentation point of view, and can
possibly facilitate making some nf_conntrack_l3/4proto structures const
subsequently.

Done with the help of Coccinelle.

Signed-off-by: Julia Lawall
Signed-off-by: Pablo Neira Ayuso

Julia Lawall
2017-08-02 20:25:57 +0800

01 Aug, 2017

3 commits

e2a750070 netfilter: conntrack: destroy functions need to free queued packets ... Browse Code »

queued skbs might be using conntrack extensions that are being removed,
such as timeout. This happens for skbs that have a skb->nfct in
unconfirmed state (i.e., not in hash table yet).

This is destructive, but there are only two use cases:
- module removal (rare)
- netns cleanup (most likely no conntracks exist, and if they do,
they are removed anyway later on).

Signed-off-by: Florian Westphal
Signed-off-by: Pablo Neira Ayuso

Florian Westphal
2017-08-01 01:09:39 +0800
84657984c netfilter: add and use nf_ct_unconfirmed_destroy ... Browse Code »

This also removes __nf_ct_unconfirmed_destroy() call from
nf_ct_iterate_cleanup_net, so that function can be used only
when missing conntracks from unconfirmed list isn't a problem.

Signed-off-by: Florian Westphal
Signed-off-by: Pablo Neira Ayuso

Florian Westphal
2017-08-01 01:09:39 +0800
a232cd0e0 netfilter: conntrack: Change to deferable work queue ... Browse Code »

Delayed workqueue causes wakeups to idle CPUs. This was
causing a power impact for devices. Use deferable work
queue instead so that gc_worker runs when CPU is active only.

Signed-off-by: Subash Abhinov Kasiviswanathan
Signed-off-by: Pablo Neira Ayuso

subashab@codeaurora.org
2017-08-01 01:03:50 +0800

26 Jul, 2017

1 commit

3ef0c7a73 net/netfilter/nf_conntrack_core: Fix net_conntrack_lock() ... Browse Code »

As we want to remove spin_unlock_wait() and replace it with explicit
spin_lock()/spin_unlock() calls, we can use this to simplify the
locking.

In addition:
- Reading nf_conntrack_locks_all needs ACQUIRE memory ordering.
- The new code avoids the backwards loop.

Only slightly tested, I did not manage to trigger calls to
nf_conntrack_all_lock().

V2: With improved comments, to clearly show how the barriers
pair.

Fixes: b16c29191dc8 ("netfilter: nf_conntrack: use safer way to lock all buckets")
Signed-off-by: Manfred Spraul
Cc:
Cc: Alan Stern
Cc: Sasha Levin
Cc: Pablo Neira Ayuso
Cc: netfilter-devel@vger.kernel.org
Signed-off-by: Paul E. McKenney

Manfred Spraul
2017-07-26 01:08:58 +0800

24 Jul, 2017

1 commit

0b35f6031 netfilter: Remove duplicated rcu_read_lock. ... Browse Code »

This patch removes duplicate rcu_read_lock().

1. IPVS part:

According to Julian Anastasov's mention, contexts of ipvs are described
at: http://marc.info/?l=netfilter-devel&m=149562884514072&w=2, in summary:

- packet RX/TX: does not need locks because packets come from hooks.
- sync msg RX: backup server uses RCU locks while registering new
connections.
- ip_vs_ctl.c: configuration get/set, RCU locks needed.
- xt_ipvs.c: It is a netfilter match, running from hook context.

As result, rcu_read_lock and rcu_read_unlock can be removed from:

- ip_vs_core.c: all
- ip_vs_ctl.c:
- only from ip_vs_has_real_service
- ip_vs_ftp.c: all
- ip_vs_proto_sctp.c: all
- ip_vs_proto_tcp.c: all
- ip_vs_proto_udp.c: all
- ip_vs_xmit.c: all (contains only packet processing)

2. Netfilter part:

There are three types of functions that are guaranteed the rcu_read_lock().
First, as result, functions are only called by nf_hook():

- nf_conntrack_broadcast_help(), pptp_expectfn(), set_expected_rtp_rtcp().
- tcpmss_reverse_mtu(), tproxy_laddr4(), tproxy_laddr6().
- match_lookup_rt6(), check_hlist(), hashlimit_mt_common().
- xt_osf_match_packet().

Second, functions that caller already held the rcu_read_lock().
- destroy_conntrack(), ctnetlink_conntrack_event().
- ctnl_timeout_find_get(), nfqnl_nf_hook_drop().

Third, functions that are mixed with type1 and type2.

These functions are called by nf_hook() also these are called by
ordinary functions that already held the rcu_read_lock():

- __ctnetlink_glue_build(), ctnetlink_expect_event().
- ctnetlink_proto_size().

Applied files are below:

- nf_conntrack_broadcast.c, nf_conntrack_core.c, nf_conntrack_netlink.c.
- nf_conntrack_pptp.c, nf_conntrack_sip.c, nfnetlink_cttimeout.c.
- nfnetlink_queue.c, xt_TCPMSS.c, xt_TPROXY.c, xt_addrtype.c.
- xt_connlimit.c, xt_hashlimit.c, xt_osf.c

Detailed calltrace can be found at:
http://marc.info/?l=netfilter-devel&m=149667610710350&w=2

Signed-off-by: Taehee Yoo
Acked-by: Julian Anastasov
Signed-off-by: Pablo Neira Ayuso

Taehee Yoo
2017-07-24 19:24:46 +0800

20 Jun, 2017

1 commit

7866cc57b netns: add and use net_ns_barrier ... Browse Code »

Quoting Joe Stringer:
If a user loads nf_conntrack_ftp, sends FTP traffic through a network
namespace, destroys that namespace then unloads the FTP helper module,
then the kernel will crash.

Events that lead to the crash:
1. conntrack is created with ftp helper in netns x
2. This netns is destroyed
3. netns destruction is scheduled
4. netns destruction wq starts, removes netns from global list
5. ftp helper is unloaded, which resets all helpers of the conntracks
via for_each_net()

but because netns is already gone from list the for_each_net() loop
doesn't include it, therefore all of these conntracks are unaffected.

6. helper module unload finishes
7. netns wq invokes destructor for rmmod'ed helper

CC: "Eric W. Biederman"
Reported-by: Joe Stringer
Signed-off-by: Florian Westphal
Acked-by: David S. Miller
Acked-by: "Eric W. Biederman"
Signed-off-by: Pablo Neira Ayuso

Florian Westphal
2017-06-20 01:09:19 +0800

29 May, 2017

4 commits

0d02d5646 netfilter: conntrack: restart iteration on resize ... Browse Code »

We could some conntracks when a resize occurs in parallel.

Avoid this by sampling generation seqcnt and doing a restart if needed.

Signed-off-by: Florian Westphal
Signed-off-by: Pablo Neira Ayuso

Florian Westphal
2017-05-29 18:46:11 +0800
2843fb699 netfilter: conntrack: add nf_ct_iterate_destroy ... Browse Code »

sledgehammer to be used on module unload (to remove affected conntracks
from all namespaces).

It will also flag all unconfirmed conntracks as dying, i.e. they will
not be committed to main table.

Signed-off-by: Florian Westphal
Signed-off-by: Pablo Neira Ayuso

Florian Westphal
2017-05-29 18:46:10 +0800
b0feacaad netfilter: conntrack: don't call iter for non-confirmed conntracks ... Browse Code »

nf_ct_iterate_cleanup_net currently calls iter() callback also for
conntracks on the unconfirmed list, but this is unsafe.

Acesses to nf_conn are fine, but some users access the extension area
in the iter() callback, but that does only work reliably for confirmed
conntracks (ct->ext can be reallocated at any time for unconfirmed
conntrack).

The seond issue is that there is a short window where a conntrack entry
is neither on the list nor in the table: To confirm an entry, it is first
removed from the unconfirmed list, then insert into the table.

Fix this by iterating the unconfirmed list first and marking all entries
as dying, then wait for rcu grace period.

This makes sure all entries that were about to be confirmed either are
in the main table, or will be dropped soon.

Signed-off-by: Florian Westphal
Signed-off-by: Pablo Neira Ayuso

Florian Westphal
2017-05-29 18:46:09 +0800
9fd6452d6 netfilter: conntrack: rename nf_ct_iterate_cleanup ... Browse Code »

There are several places where we needlesly call nf_ct_iterate_cleanup,
we should instead iterate the full table at module unload time.

This is a leftover from back when the conntrack table got duplicated
per net namespace.

So rename nf_ct_iterate_cleanup to nf_ct_iterate_cleanup_net.
A later patch will then add a non-net variant.

Signed-off-by: Florian Westphal
Signed-off-by: Pablo Neira Ayuso

Florian Westphal
2017-05-29 18:46:08 +0800

11 May, 2017

1 commit

de4d19530 Merge branch 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull RCU updates from Ingo Molnar:
"The main changes are:

- Debloat RCU headers

- Parallelize SRCU callback handling (plus overlapping patches)

- Improve the performance of Tree SRCU on a CPU-hotplug stress test

- Documentation updates

- Miscellaneous fixes"

* 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (74 commits)
rcu: Open-code the rcu_cblist_n_lazy_cbs() function
rcu: Open-code the rcu_cblist_n_cbs() function
rcu: Open-code the rcu_cblist_empty() function
rcu: Separately compile large rcu_segcblist functions
srcu: Debloat the header
srcu: Adjust default auto-expediting holdoff
srcu: Specify auto-expedite holdoff time
srcu: Expedite first synchronize_srcu() when idle
srcu: Expedited grace periods with reduced memory contention
srcu: Make rcutorture writer stalls print SRCU GP state
srcu: Exact tracking of srcu_data structures containing callbacks
srcu: Make SRCU be built by default
srcu: Fix Kconfig botch when SRCU not selected
rcu: Make non-preemptive schedule be Tasks RCU quiescent state
srcu: Expedite srcu_schedule_cbs_snp() callback invocation
srcu: Parallelize callback handling
kvm: Move srcu_struct fields to end of struct kvm
rcu: Fix typo in PER_RCU_NODE_PERIOD header comment
rcu: Use true/false in assignment to bool
rcu: Use bool value directly
...

Linus Torvalds
2017-05-11 01:30:46 +0800

03 May, 2017

1 commit

ab71632c4 netfilter: conntrack: Force inlining of build check to prevent build failure ... Browse Code »

If gcc (e.g. 4.1.2) decides not to inline total_extension_size(), the
build will fail with:

net/built-in.o: In function `nf_conntrack_init_start':
(.text+0x9baf6): undefined reference to `__compiletime_assert_1893'

or

ERROR: "__compiletime_assert_1893" [net/netfilter/nf_conntrack.ko] undefined!

Fix this by forcing inlining of total_extension_size().

Fixes: b3a5db109e0670d6 ("netfilter: conntrack: use u8 for extension sizes again")
Signed-off-by: Geert Uytterhoeven
Acked-by: Arnd Bergmann
Acked-by: Florian Westphal
Signed-off-by: David S. Miller

Geert Uytterhoeven
2017-05-03 21:51:26 +0800

23 Apr, 2017

1 commit

58d30c36d Merge branch 'for-mingo' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmc… ... Browse Code »

…k/linux-rcu into core/rcu

Pull RCU updates from Paul E. McKenney:

- Documentation updates.

- Miscellaneous fixes.

- Parallelize SRCU callback handling (plus overlapping patches).

Signed-off-by: Ingo Molnar <mingo@kernel.org>

Ingo Molnar
2017-04-23 17:12:44 +0800

19 Apr, 2017

3 commits

c6dd940b1 netfilter: allow early drop of assured conntracks ... Browse Code »

If insertion of a new conntrack fails because the table is full, the kernel
searches the next buckets of the hash slot where the new connection
was supposed to be inserted at for an entry that hasn't seen traffic
in reply direction (non-assured), if it finds one, that entry is
is dropped and the new connection entry is allocated.

Allow the conntrack gc worker to also remove *assured* conntracks if
resources are low.

Do this by querying the l4 tracker, e.g. tcp connections are now dropped
if they are no longer established (e.g. in finwait).

This could be refined further, e.g. by adding 'soft' established timeout
(i.e., a timeout that is only used once we get close to resource
exhaustion).

Cc: Jozsef Kadlecsik
Signed-off-by: Florian Westphal
Acked-by: Jozsef Kadlecsik
Signed-off-by: Pablo Neira Ayuso

Florian Westphal
2017-04-19 23:55:17 +0800
b3a5db109 netfilter: conntrack: use u8 for extension sizes again ... Browse Code »

commit 223b02d923ecd7c84cf9780bb3686f455d279279
("netfilter: nf_conntrack: reserve two bytes for nf_ct_ext->len")
had to increase size of the extension offsets because total size of the
extensions had increased to a point where u8 did overflow.

3 years later we've managed to diet extensions a bit and we no longer
need u16. Furthermore we can now add a compile-time assertion for this
problem.

Signed-off-by: Florian Westphal
Signed-off-by: Pablo Neira Ayuso

Florian Westphal
2017-04-19 23:55:17 +0800
5f0d5a3ae mm: Rename SLAB_DESTROY_BY_RCU to SLAB_TYPESAFE_BY_RCU ... Browse Code »

A group of Linux kernel hackers reported chasing a bug that resulted
from their assumption that SLAB_DESTROY_BY_RCU provided an existence
guarantee, that is, that no block from such a slab would be reallocated
during an RCU read-side critical section. Of course, that is not the
case. Instead, SLAB_DESTROY_BY_RCU only prevents freeing of an entire
slab of blocks.

However, there is a phrase for this, namely "type safety". This commit
therefore renames SLAB_DESTROY_BY_RCU to SLAB_TYPESAFE_BY_RCU in order
to avoid future instances of this sort of confusion.

Signed-off-by: Paul E. McKenney
Cc: Christoph Lameter
Cc: Pekka Enberg
Cc: David Rientjes
Cc: Joonsoo Kim
Cc: Andrew Morton
Cc:
Acked-by: Johannes Weiner
Acked-by: Vlastimil Babka
[ paulmck: Add comments mentioning the old name, as requested by Eric
Dumazet, in order to help people familiar with the old name find
the new one. ]
Acked-by: David Rientjes

Paul E. McKenney
2017-04-19 02:42:36 +0800

15 Apr, 2017

1 commit

cc41c84b7 netfilter: kill the fake untracked conntrack objects ... Browse Code »

resurrect an old patch from Pablo Neira to remove the untracked objects.

Currently, there are four possible states of an skb wrt. conntrack.

1. No conntrack attached, ct is NULL.
2. Normal (kmem cache allocated) ct attached.
3. a template (kmalloc'd), not in any hash tables at any point in time
4. the 'untracked' conntrack, a percpu nf_conn object, tagged via
IPS_UNTRACKED_BIT in ct->status.

Untracked is supposed to be identical to case 1. It exists only
so users can check

-m conntrack --ctstate UNTRACKED vs.
-m conntrack --ctstate INVALID

e.g. attempts to set connmark on INVALID or UNTRACKED conntracks is
supposed to be a no-op.

Thus currently we need to check
ct == NULL || nf_ct_is_untracked(ct)

in a lot of places in order to avoid altering untracked objects.

The other consequence of the percpu untracked object is that all
-j NOTRACK (and, later, kfree_skb of such skbs) result in an atomic op
(inc/dec the untracked conntracks refcount).

This adds a new kernel-private ctinfo state, IP_CT_UNTRACKED, to
make the distinction instead.

The (few) places that care about packet invalid (ct is NULL) vs.
packet untracked now need to test ct == NULL vs. ctinfo == IP_CT_UNTRACKED,
but all other places can omit the nf_ct_is_untracked() check.

Signed-off-by: Florian Westphal
Signed-off-by: Pablo Neira Ayuso

Florian Westphal
2017-04-15 17:47:57 +0800

07 Apr, 2017

1 commit

6e699867f netfilter: nat: avoid use of nf_conn_nat extension ... Browse Code »

successful insert into the bysource hash sets IPS_SRC_NAT_DONE status bit
so we can check that instead of presence of nat extension which requires
extra deref.

Signed-off-by: Florian Westphal
Signed-off-by: Pablo Neira Ayuso

Florian Westphal
2017-04-07 04:01:42 +0800

24 Mar, 2017

1 commit

16ae1f223 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net ... Browse Code »

Conflicts:
drivers/net/ethernet/broadcom/genet/bcmmii.c
drivers/net/hyperv/netvsc.c
kernel/bpf/hashtab.c

Almost entirely overlapping changes.

Signed-off-by: David S. Miller

David S. Miller
2017-03-24 07:41:27 +0800

14 Mar, 2017

1 commit

fc09e4a75 netfilter: nf_conntrack: reduce resolve_normal_ct args ... Browse Code »

also mark init_conntrack noinline, in most cases resolve_normal_ct will
find an existing conntrack entry.

text data bss dec hex filename
16735 5707 176 22618 585a net/netfilter/nf_conntrack_core.o
16687 5707 176 22570 582a net/netfilter/nf_conntrack_core.o

Signed-off-by: Florian Westphal
Signed-off-by: Pablo Neira Ayuso

Florian Westphal
2017-03-14 02:30:20 +0800

13 Mar, 2017

1 commit

170a1fb9c netfilter: Force fake conntrack entry to be at least 8 bytes aligned ... Browse Code »

Since the nfct and nfctinfo have been combined, the nf_conn structure
must be at least 8 bytes aligned, as the 3 LSB bits are used for the
nfctinfo. But there's a fake nf_conn structure to denote untracked
connections, which is created by a PER_CPU construct. This does not
guarantee that it will be 8 bytes aligned and can break the logic in
determining the correct nfctinfo.

I triggered this on a 32bit machine with the following error:

BUG: unable to handle kernel NULL pointer dereference at 00000af4
IP: nf_ct_deliver_cached_events+0x1b/0xfb
*pdpt = 0000000031962001 *pde = 0000000000000000

Oops: 0000 [#1] SMP
[Modules linked in: ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter ip6_tables ipv6 crc_ccitt ppdev r8169 parport_pc parport
OK ]
CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.10.0-test+ #75
Hardware name: MSI MS-7823/CSM-H87M-G43 (MS-7823), BIOS V1.6 02/22/2014
task: c126ec00 task.stack: c1258000
EIP: nf_ct_deliver_cached_events+0x1b/0xfb
EFLAGS: 00010202 CPU: 0
EAX: 0021cd01 EBX: 00000000 ECX: 27b0c767 EDX: 32bcb17a
ESI: f34135c0 EDI: f34135c0 EBP: f2debd60 ESP: f2debd3c
DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
CR0: 80050033 CR2: 00000af4 CR3: 309a0440 CR4: 001406f0
Call Trace:

? ipv6_skip_exthdr+0xac/0xcb
ipv6_confirm+0x10c/0x119 [nf_conntrack_ipv6]
nf_hook_slow+0x22/0xc7
nf_hook+0x9a/0xad [ipv6]
? ip6t_do_table+0x356/0x379 [ip6_tables]
? ip6_fragment+0x9e9/0x9e9 [ipv6]
ip6_output+0xee/0x107 [ipv6]
? ip6_fragment+0x9e9/0x9e9 [ipv6]
dst_output+0x36/0x4d [ipv6]
NF_HOOK.constprop.37+0xb2/0xba [ipv6]
? icmp6_dst_alloc+0x2c/0xfd [ipv6]
? local_bh_enable+0x14/0x14 [ipv6]
mld_sendpack+0x1c5/0x281 [ipv6]
? mark_held_locks+0x40/0x5c
mld_ifc_timer_expire+0x1f6/0x21e [ipv6]
call_timer_fn+0x135/0x283
? detach_if_pending+0x55/0x55
? mld_dad_timer_expire+0x3e/0x3e [ipv6]
__run_timers+0x111/0x14b
? mld_dad_timer_expire+0x3e/0x3e [ipv6]
run_timer_softirq+0x1c/0x36
__do_softirq+0x185/0x37c
? test_ti_thread_flag.constprop.19+0xd/0xd
do_softirq_own_stack+0x22/0x28

irq_exit+0x5a/0xa4
smp_apic_timer_interrupt+0x2a/0x34
apic_timer_interrupt+0x37/0x3c

By using DEFINE/DECLARE_PER_CPU_ALIGNED we can enforce at least 8 byte
alignment as all cache line sizes are at least 8 bytes or more.

Fixes: a9e419dc7be6 ("netfilter: merge ctinfo into nfct pointer storage area")
Signed-off-by: Steven Rostedt (VMware)
Acked-by: Florian Westphal
Signed-off-by: Pablo Neira Ayuso

Steven Rostedt (VMware)
2017-03-13 20:33:58 +0800

04 Feb, 2017

1 commit

52e01b84a Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next ... Browse Code »

Pablo Neira Ayuso says:

====================
Netfilter updates for net-next

The following patchset contains Netfilter updates for your net-next
tree, they are:

1) Stash ctinfo 3-bit field into pointer to nf_conntrack object from
sk_buff so we only access one single cacheline in the conntrack
hotpath. Patchset from Florian Westphal.

2) Don't leak pointer to internal structures when exporting x_tables
ruleset back to userspace, from Willem DeBruijn. This includes new
helper functions to copy data to userspace such as xt_data_to_user()
as well as conversions of our ip_tables, ip6_tables and arp_tables
clients to use it. Not surprinsingly, ebtables requires an ad-hoc
update. There is also a new field in x_tables extensions to indicate
the amount of bytes that we copy to userspace.

3) Add nf_log_all_netns sysctl: This new knob allows you to enable
logging via nf_log infrastructure for all existing netnamespaces.
Given the effort to provide pernet syslog has been discontinued,
let's provide a way to restore logging using netfilter kernel logging
facilities in trusted environments. Patch from Michal Kubecek.

4) Validate SCTP checksum from conntrack helper, from Davide Caratti.

5) Merge UDPlite conntrack and NAT helpers into UDP, this was mostly
a copy&paste from the original helper, from Florian Westphal.

6) Reset netfilter state when duplicating packets, also from Florian.

7) Remove unnecessary check for broadcast in IPv6 in pkttype match and
nft_meta, from Liping Zhang.

8) Add missing code to deal with loopback packets from nft_meta when
used by the netdev family, also from Liping.

9) Several cleanups on nf_tables, one to remove unnecessary check from
the netlink control plane path to add table, set and stateful objects
and code consolidation when unregister chain hooks, from Gao Feng.

10) Fix harmless reference counter underflow in IPVS that, however,
results in problems with the introduction of the new refcount_t
type, from David Windsor.

11) Enable LIBCRC32C from nf_ct_sctp instead of nf_nat_sctp,
from Davide Caratti.

12) Missing documentation on nf_tables uapi header, from Liping Zhang.

13) Use rb_entry() helper in xt_connlimit, from Geliang Tang.
====================

Signed-off-by: David S. Miller

David S. Miller
2017-02-04 05:58:20 +0800

02 Feb, 2017

6 commits

a9e419dc7 netfilter: merge ctinfo into nfct pointer storage area ... Browse Code »

After this change conntrack operations (lookup, creation, matching from
ruleset) only access one instead of two sk_buff cache lines.

This works for normal conntracks because those are allocated from a slab
that guarantees hw cacheline or 8byte alignment (whatever is larger)
so the 3 bits needed for ctinfo won't overlap with nf_conn addresses.

Template allocation now does manual address alignment (see previous change)
on arches that don't have sufficent kmalloc min alignment.

Some spots intentionally use skb->_nfct instead of skb_nfct() helpers,
this is to avoid undoing the skb_nfct() use when we remove untracked
conntrack object in the future.

Signed-off-by: Florian Westphal
Signed-off-by: Pablo Neira Ayuso

Florian Westphal
2017-02-02 21:31:56 +0800
303223092 netfilter: guarantee 8 byte minalign for template addresses ... Browse Code »

The next change will merge skb->nfct pointer and skb->nfctinfo
status bits into single skb->_nfct (unsigned long) area.

For this to work nf_conn addresses must always be aligned at least on
an 8 byte boundary since we will need the lower 3bits to store nfctinfo.

Conntrack templates are allocated via kmalloc.
kbuild test robot reported
BUILD_BUG_ON failed: NFCT_INFOMASK >= ARCH_KMALLOC_MINALIGN
on v1 of this patchset, so not all platforms meet this requirement.

Do manual alignment if needed, the alignment offset is stored in the
nf_conn entry protocol area. This works because templates are not
handed off to L4 protocol trackers.

Reported-by: kbuild test robot
Signed-off-by: Florian Westphal
Signed-off-by: Pablo Neira Ayuso

Florian Westphal
2017-02-02 21:31:55 +0800
c74454fad netfilter: add and use nf_ct_set helper ... Browse Code »

Add a helper to assign a nf_conn entry and the ctinfo bits to an sk_buff.
This avoids changing code in followup patch that merges skb->nfct and
skb->nfctinfo into skb->_nfct.

Signed-off-by: Florian Westphal
Signed-off-by: Pablo Neira Ayuso

Florian Westphal
2017-02-02 21:31:54 +0800
cb9c68363 skbuff: add and use skb_nfct helper ... Browse Code »

Followup patch renames skb->nfct and changes its type so add a helper to
avoid intrusive rename change later.

Signed-off-by: Florian Westphal
Signed-off-by: Pablo Neira Ayuso

Florian Westphal
2017-02-02 21:31:53 +0800
97a6ad13d netfilter: reduce direct skb->nfct usage ... Browse Code »

Next patch makes direct skb->nfct access illegal, reduce noise
in next patch by using accessors we already have.

Signed-off-by: Florian Westphal
Signed-off-by: Pablo Neira Ayuso

Florian Westphal
2017-02-02 21:31:52 +0800
11df4b760 netfilter: conntrack: no need to pass ctinfo to error handler ... Browse Code »

It is never accessed for reading and the only places that write to it
are the icmp(6) handlers, which also set skb->nfct (and skb->nfctinfo).

The conntrack core specifically checks for attached skb->nfct after
->error() invocation and returns early in this case.

Signed-off-by: Florian Westphal
Signed-off-by: Pablo Neira Ayuso

Florian Westphal
2017-02-02 21:31:51 +0800

19 Jan, 2017

2 commits

e5072053b netfilter: conntrack: refine gc worker heuristics, redux ... Browse Code »

This further refines the changes made to conntrack gc_worker in
commit e0df8cae6c16 ("netfilter: conntrack: refine gc worker heuristics").

The main idea of that change was to reduce the scan interval when evictions
take place.

However, on the reporters' setup, there are 1-2 million conntrack entries
in total and roughly 8k new (and closing) connections per second.

In this case we'll always evict at least one entry per gc cycle and scan
interval is always at 1 jiffy because of this test:

} else if (expired_count) {
gc_work->next_gc_run /= 2U;
next_run = msecs_to_jiffies(1);

being true almost all the time.

Given we scan ~10k entries per run its clearly wrong to reduce interval
based on nonzero eviction count, it will only waste cpu cycles since a vast
majorities of conntracks are not timed out.

Thus only look at the ratio (scanned entries vs. evicted entries) to make
a decision on whether to reduce or not.

Because evictor is supposed to only kick in when system turns idle after
a busy period, pick a high ratio -- this makes it 50%. We thus keep
the idea of increasing scan rate when its likely that table contains many
expired entries.

In order to not let timed-out entries hang around for too long
(important when using event logging, in which case we want to timely
destroy events), we now scan the full table within at most
GC_MAX_SCAN_JIFFIES (16 seconds) even in worst-case scenario where all
timed-out entries sit in same slot.

I tested this with a vm under synflood (with
sysctl net.netfilter.nf_conntrack_tcp_timeout_syn_recv=3).

While flood is ongoing, interval now stays at its max rate
(GC_MAX_SCAN_JIFFIES / GC_MAX_BUCKETS_DIV -> 125ms).

With feedback from Nicolas Dichtel.

Reported-by: Denys Fedoryshchenko
Cc: Nicolas Dichtel
Fixes: b87a2f9199ea82eaadc ("netfilter: conntrack: add gc worker to remove timed-out entries")
Signed-off-by: Florian Westphal
Tested-by: Nicolas Dichtel
Acked-by: Nicolas Dichtel
Tested-by: Denys Fedoryshchenko
Signed-off-by: Pablo Neira Ayuso

Florian Westphal
2017-01-19 21:28:01 +0800
524b698db netfilter: conntrack: remove GC_MAX_EVICTS break ... Browse Code »

Instead of breaking loop and instant resched, don't bother checking
this in first place (the loop calls cond_resched for every bucket anyway).

Suggested-by: Nicolas Dichtel
Signed-off-by: Florian Westphal
Acked-by: Nicolas Dichtel
Signed-off-by: Pablo Neira Ayuso

Florian Westphal
2017-01-19 21:27:41 +0800

26 Dec, 2016

1 commit

2456e8553 ktime: Get rid of the union ... Browse Code »

ktime is a union because the initial implementation stored the time in
scalar nanoseconds on 64 bit machine and in a endianess optimized timespec
variant for 32bit machines. The Y2038 cleanup removed the timespec variant
and switched everything to scalar nanoseconds. The union remained, but
become completely pointless.

Get rid of the union and just keep ktime_t as simple typedef of type s64.

The conversion was done with coccinelle and some manual mopping up.

Signed-off-by: Thomas Gleixner
Cc: Peter Zijlstra

Thomas Gleixner
2016-12-26 00:21:22 +0800

15 Nov, 2016

1 commit

bb598c1b8 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net ... Browse Code »

Several cases of bug fixes in 'net' overlapping other changes in
'net-next-.

Signed-off-by: David S. Miller

David S. Miller
2016-11-15 23:54:36 +0800

10 Nov, 2016

1 commit

56a62e221 netfilter: conntrack: fix NF_REPEAT handling ... Browse Code »

gcc correctly identified a theoretical uninitialized variable use:

net/netfilter/nf_conntrack_core.c: In function 'nf_conntrack_in':
net/netfilter/nf_conntrack_core.c:1125:14: error: 'l4proto' may be used uninitialized in this function [-Werror=maybe-uninitialized]

This could only happen when we 'goto out' before looking up l4proto,
and then enter the retry, implying that l3proto->get_l4proto()
returned NF_REPEAT. This does not currently get returned in any
code path and probably won't ever happen, but is not good to
rely on.

Moving the repeat handling up a little should have the same
behavior as today but avoids the warning by making that case
impossible to enter.

[ I have mangled this original patch to remove the check for tmpl, we
should inconditionally jump back to the repeat label in case we hit
NF_REPEAT instead. I have also moved the comment that explains this
where it belongs. --pablo ]

Fixes: 08733a0cb7de ("netfilter: handle NF_REPEAT from nf_conntrack_in()")
Signed-off-by: Arnd Bergmann
Signed-off-by: Pablo Neira Ayuso

Arnd Bergmann
2016-11-10 07:19:33 +0800