03 Apr, 2019
1 commit
-
[ Upstream commit 408f13ef358aa5ad56dc6230c2c7deb92cf462b1 ]
As it stands if a shrink is delayed because of an outstanding
rehash, we will go into a rescheduling loop without ever doing
the rehash.This patch fixes this by still carrying out the rehash and then
rescheduling so that we can shrink after the completion of the
rehash should it still be necessary.The return value of EEXIST captures this case and other cases
(e.g., another thread expanded/rehashed the table at the same
time) where we should still proceed with the rehash.Fixes: da20420f83ea ("rhashtable: Add nested tables")
Reported-by: Josh Elsasser
Signed-off-by: Herbert Xu
Tested-by: Josh Elsasser
Signed-off-by: David S. Miller
Signed-off-by: Greg Kroah-Hartman
28 Aug, 2018
1 commit
-
Pull networking fixes from David Miller:
1) ICE, E1000, IGB, IXGBE, and I40E bug fixes from the Intel folks.
2) Better fix for AB-BA deadlock in packet scheduler code, from Cong
Wang.3) bpf sockmap fixes (zero sized key handling, etc.) from Daniel
Borkmann.4) Send zero IPID in TCP resets and SYN-RECV state ACKs, to prevent
attackers using it as a side-channel. From Eric Dumazet.5) Memory leak in mediatek bluetooth driver, from Gustavo A. R. Silva.
6) Hook up rt->dst.input of ipv6 anycast routes properly, from Hangbin
Liu.7) hns and hns3 bug fixes from Huazhong Tan.
8) Fix RIF leak in mlxsw driver, from Ido Schimmel.
9) iova range check fix in vhost, from Jason Wang.
10) Fix hang in do_tcp_sendpages() with tls, from John Fastabend.
11) More r8152 chips need to disable RX aggregation, from Kai-Heng Feng.
12) Memory exposure in TCA_U32_SEL handling, from Kees Cook.
13) TCP BBR congestion control fixes from Kevin Yang.
14) hv_netvsc, ignore non-PCI devices, from Stephen Hemminger.
15) qed driver fixes from Tomer Tayar.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (77 commits)
net: sched: Fix memory exposure from short TCA_U32_SEL
qed: fix spelling mistake "comparsion" -> "comparison"
vhost: correctly check the iova range when waking virtqueue
qlge: Fix netdev features configuration.
net: macb: do not disable MDIO bus at open/close time
Revert "net: stmmac: fix build failure due to missing COMMON_CLK dependency"
net: macb: Fix regression breaking non-MDIO fixed-link PHYs
mlxsw: spectrum_switchdev: Do not leak RIFs when removing bridge
i40e: fix condition of WARN_ONCE for stat strings
i40e: Fix for Tx timeouts when interface is brought up if DCB is enabled
ixgbe: fix driver behaviour after issuing VFLR
ixgbe: Prevent unsupported configurations with XDP
ixgbe: Replace GFP_ATOMIC with GFP_KERNEL
igb: Replace mdelay() with msleep() in igb_integrated_phy_loopback()
igb: Replace GFP_ATOMIC with GFP_KERNEL in igb_sw_init()
igb: Use an advanced ctx descriptor for launchtime
e1000: ensure to free old tx/rx rings in set_ringparam()
e1000: check on netif_running() before calling e1000_up()
ixgb: use dma_zalloc_coherent instead of allocator/memset
ice: Trivial formatting fixes
...
23 Aug, 2018
2 commits
-
rhashtable_init() may fail due to -ENOMEM, thus making the entire api
unusable. This patch removes this scenario, however unlikely. In order
to guarantee memory allocation, this patch always ends up doing
GFP_KERNEL|__GFP_NOFAIL for both the tbl as well as
alloc_bucket_spinlocks().Upon the first table allocation failure, we shrink the size to the
smallest value that makes sense and retry with __GFP_NOFAIL semantics.
With the defaults, this means that from 64 buckets, we retry with only 4.
Any later issues regarding performance due to collisions or larger table
resizing (when more memory becomes available) is the least of our
problems.Link: http://lkml.kernel.org/r/20180712185241.4017-9-manfred@colorfullife.com
Signed-off-by: Davidlohr Bueso
Signed-off-by: Manfred Spraul
Acked-by: Herbert Xu
Cc: Dmitry Vyukov
Cc: Kees Cook
Cc: Michael Kerrisk
Cc: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
As of ce91f6ee5b3b ("mm: kvmalloc does not fallback to vmalloc for
incompatible gfp flags") we can simplify the caller and trust kvzalloc()
to just do the right thing. For the case of the GFP_ATOMIC context, we
can drop the __GFP_NORETRY flag for obvious reasons, and for the
__GFP_NOWARN case, however, it is changed such that the caller passes the
flag instead of making bucket_table_alloc() handle it.This slightly changes the gfp flags passed on to nested_table_alloc() as
it will now also use GFP_ATOMIC | __GFP_NOWARN. However, I consider this
a positive consequence as for the same reasons we want nowarn semantics in
bucket_table_alloc().[manfred@colorfullife.com: commit id extended to 12 digits, line wraps updated]
Link: http://lkml.kernel.org/r/20180712185241.4017-8-manfred@colorfullife.com
Signed-off-by: Davidlohr Bueso
Signed-off-by: Manfred Spraul
Acked-by: Michal Hocko
Cc: Dmitry Vyukov
Cc: Herbert Xu
Cc: Kees Cook
Cc: Michael Kerrisk
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
21 Aug, 2018
1 commit
-
Remove duplicated include.
Signed-off-by: Yue Haibing
Signed-off-by: David S. Miller
21 Jul, 2018
1 commit
-
All conflicts were trivial overlapping changes, so reasonably
easy to resolve.Signed-off-by: David S. Miller
19 Jul, 2018
1 commit
-
rhashtable_init() currently does not take into account the user-passed
min_size parameter unless param->nelem_hint is set as well. As such,
the default size (number of buckets) will always be HASH_DEFAULT_SIZE
even if the smallest allowed size is larger than that. Remediate this
by unconditionally calling into rounded_hashtable_size() and handling
things accordingly.Signed-off-by: Davidlohr Bueso
Acked-by: Herbert Xu
Signed-off-by: David S. Miller
10 Jul, 2018
1 commit
-
rhashtable_free_and_destroy() cancels re-hash deferred work
then walks and destroys elements. at this moment, some elements can be
still in future_tbl. that elements are not destroyed.test case:
nft_rhash_destroy() calls rhashtable_free_and_destroy() to destroy
all elements of sets before destroying sets and chains.
But rhashtable_free_and_destroy() doesn't destroy elements of future_tbl.
so that splat occurred.test script:
%cat test.nft
table ip aa {
map map1 {
type ipv4_addr : verdict;
elements = {
0 : jump a0,
1 : jump a0,
2 : jump a0,
3 : jump a0,
4 : jump a0,
5 : jump a0,
6 : jump a0,
7 : jump a0,
8 : jump a0,
9 : jump a0,
}
}
chain a0 {
}
}
flush ruleset
table ip aa {
map map1 {
type ipv4_addr : verdict;
elements = {
0 : jump a0,
1 : jump a0,
2 : jump a0,
3 : jump a0,
4 : jump a0,
5 : jump a0,
6 : jump a0,
7 : jump a0,
8 : jump a0,
9 : jump a0,
}
}
chain a0 {
}
}
flush ruleset%while :; do nft -f test.nft; done
Splat looks like:
[ 200.795603] kernel BUG at net/netfilter/nf_tables_api.c:1363!
[ 200.806944] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN PTI
[ 200.812253] CPU: 1 PID: 1582 Comm: nft Not tainted 4.17.0+ #24
[ 200.820297] Hardware name: To be filled by O.E.M. To be filled by O.E.M./Aptio CRB, BIOS 5.6.5 07/08/2015
[ 200.830309] RIP: 0010:nf_tables_chain_destroy.isra.34+0x62/0x240 [nf_tables]
[ 200.838317] Code: 43 50 85 c0 74 26 48 8b 45 00 48 8b 4d 08 ba 54 05 00 00 48 c7 c6 60 6d 29 c0 48 c7 c7 c0 65 29 c0 4c 8b 40 08 e8 58 e5 fd f8 0b 48 89 da 48 b8 00 00 00 00 00 fc ff
[ 200.860366] RSP: 0000:ffff880118dbf4d0 EFLAGS: 00010282
[ 200.866354] RAX: 0000000000000061 RBX: ffff88010cdeaf08 RCX: 0000000000000000
[ 200.874355] RDX: 0000000000000061 RSI: 0000000000000008 RDI: ffffed00231b7e90
[ 200.882361] RBP: ffff880118dbf4e8 R08: ffffed002373bcfb R09: ffffed002373bcfa
[ 200.890354] R10: 0000000000000000 R11: ffffed002373bcfb R12: dead000000000200
[ 200.898356] R13: dead000000000100 R14: ffffffffbb62af38 R15: dffffc0000000000
[ 200.906354] FS: 00007fefc31fd700(0000) GS:ffff88011b800000(0000) knlGS:0000000000000000
[ 200.915533] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 200.922355] CR2: 0000557f1c8e9128 CR3: 0000000106880000 CR4: 00000000001006e0
[ 200.930353] Call Trace:
[ 200.932351] ? nf_tables_commit+0x26f6/0x2c60 [nf_tables]
[ 200.939525] ? nf_tables_setelem_notify.constprop.49+0x1a0/0x1a0 [nf_tables]
[ 200.947525] ? nf_tables_delchain+0x6e0/0x6e0 [nf_tables]
[ 200.952383] ? nft_add_set_elem+0x1700/0x1700 [nf_tables]
[ 200.959532] ? nla_parse+0xab/0x230
[ 200.963529] ? nfnetlink_rcv_batch+0xd06/0x10d0 [nfnetlink]
[ 200.968384] ? nfnetlink_net_init+0x130/0x130 [nfnetlink]
[ 200.975525] ? debug_show_all_locks+0x290/0x290
[ 200.980363] ? debug_show_all_locks+0x290/0x290
[ 200.986356] ? sched_clock_cpu+0x132/0x170
[ 200.990352] ? find_held_lock+0x39/0x1b0
[ 200.994355] ? sched_clock_local+0x10d/0x130
[ 200.999531] ? memset+0x1f/0x40V2:
- free all tables requested by Herbert XuSigned-off-by: Taehee Yoo
Acked-by: Herbert Xu
Signed-off-by: David S. Miller
03 Jul, 2018
1 commit
-
In file lib/rhashtable.c line 777, skip variable is assigned to
itself. The following error was observed:lib/rhashtable.c:777:41: warning: explicitly assigning value of
variable of type 'int' to itself [-Wself-assign] error, forbidden
warning: rhashtable.c:777
This error was found when compiling with Clang 6.0. Change it to iter->skip.Signed-off-by: Rishabh Bhatnagar
Acked-by: Herbert Xu
Reviewed-by: NeilBrown
Signed-off-by: David S. Miller
22 Jun, 2018
6 commits
-
Using rht_dereference_bucket() to dereference
->future_tbl looks like a type error, and could be confusing.
Using rht_dereference_rcu() to test a pointer for NULL
adds an unnecessary barrier - rcu_access_pointer() is preferred
for NULL tests when no lock is held.This uses 3 different ways to access ->future_tbl.
- if we know the mutex is held, use rht_dereference()
- if we don't hold the mutex, and are only testing for NULL,
use rcu_access_pointer()
- otherwise (using RCU protection for true dereference),
use rht_dereference_rcu().Note that this includes a simplification of the call to
rhashtable_last_table() - we don't do an extra dereference
before the call any more.Acked-by: Herbert Xu
Signed-off-by: NeilBrown
Signed-off-by: David S. Miller -
Rather than borrowing one of the bucket locks to
protect ->future_tbl updates, use cmpxchg().
This gives more freedom to change how bucket locking
is implemented.Acked-by: Herbert Xu
Signed-off-by: NeilBrown
Signed-off-by: David S. Miller -
Now that we don't use the hash value or shift in nested_table_alloc()
there is room for simplification.
We only need to pass a "is this a leaf" flag to nested_table_alloc(),
and don't need to track as much information in
rht_bucket_nested_insert().Note there is another minor cleanup in nested_table_alloc() here.
The number of elements in a page of "union nested_tables" is most naturallyPAGE_SIZE / sizeof(ntbl[0])
The previous code had
PAGE_SIZE / sizeof(ntbl[0].bucket)
which happens to be the correct value only because the bucket uses all
the space in the union.Acked-by: Herbert Xu
Signed-off-by: NeilBrown
Signed-off-by: David S. Miller -
The 'ht' and 'hash' arguments to INIT_RHT_NULLS_HEAD() are
no longer used - so drop them. This allows us to also
remove the nhash argument from nested_table_alloc().Acked-by: Herbert Xu
Signed-off-by: NeilBrown
Signed-off-by: David S. Miller -
This "feature" is unused, undocumented, and untested and so doesn't
really belong. A patch is under development to properly implement
support for detecting when a search gets diverted down a different
chain, which the common purpose of nulls markers.This patch actually fixes a bug too. The table resizing allows a
table to grow to 2^31 buckets, but the hash is truncated to 27 bits -
any growth beyond 2^27 is wasteful an ineffective.This patch results in NULLS_MARKER(0) being used for all chains,
and leaves the use of rht_is_a_null() to test for it.Acked-by: Herbert Xu
Signed-off-by: NeilBrown
Signed-off-by: David S. Miller -
Due to the use of rhashtables in net namespaces,
rhashtable.h is included in lots of the kernel,
so a small changes can required a large recompilation.
This makes development painful.This patch splits out rhashtable-types.h which just includes
the major type declarations, and does not include (non-trivial)
inline code. rhashtable.h is no longer included by anything
in the include/ directory.
Common include files only include rhashtable-types.h so a large
recompilation is only triggered when that changes.Acked-by: Herbert Xu
Signed-off-by: NeilBrown
Signed-off-by: David S. Miller
25 Apr, 2018
3 commits
-
When a walk of an rhashtable is interrupted with rhastable_walk_stop()
and then rhashtable_walk_start(), the location to restart from is based
on a 'skip' count in the current hash chain, and this can be incorrect
if insertions or deletions have happened. This does not happen when
the walk is not stopped and started as iter->p is a placeholder which
is safe to use while holding the RCU read lock.In rhashtable_walk_start() we can revalidate that 'p' is still in the
same hash chain. If it isn't then the current method is still used.With this patch, if a rhashtable walker ensures that the current
object remains in the table over a stop/start period (possibly by
elevating the reference count if that is sufficient), it can be sure
that a walk will not miss objects that were in the hashtable for the
whole time of the walk.rhashtable_walk_start() may not find the object even though it is
still in the hashtable if a rehash has moved it to a new table. In
this case it will (eventually) get -EAGAIN and will need to proceed
through the whole table again to be sure to see everything at least
once.Acked-by: Herbert Xu
Signed-off-by: NeilBrown
Signed-off-by: David S. Miller -
The documentation claims that when rhashtable_walk_start_check()
detects a resize event, it will rewind back to the beginning
of the table. This is not true. We need to set ->slot and
->skip to be zero for it to be true.Acked-by: Herbert Xu
Signed-off-by: NeilBrown
Signed-off-by: David S. Miller -
Neither rhashtable_walk_enter() or rhltable_walk_enter() sleep, though
they do take a spinlock without irq protection.
So revise the comments to accurately state the contexts in which
these functions can be called.Acked-by: Herbert Xu
Signed-off-by: NeilBrown
Signed-off-by: David S. Miller
01 Apr, 2018
1 commit
-
Rehashing and destroying large hash table takes a lot of time,
and happens in process context. It is safe to add cond_resched()
in rhashtable_rehash_table() and rhashtable_free_and_destroy()Signed-off-by: Eric Dumazet
Acked-by: Herbert Xu
Signed-off-by: David S. Miller
07 Mar, 2018
1 commit
-
When inserting duplicate objects (those with the same key),
current rhlist implementation messes up the chain pointers by
updating the bucket pointer instead of prev next pointer to the
newly inserted node. This causes missing elements on removal and
travesal.Fix that by properly updating pprev pointer to point to
the correct rhash_head next pointer.Issue: 1241076
Change-Id: I86b2c140bcb4aeb10b70a72a267ff590bb2b17e7
Fixes: ca26893f05e8 ('rhashtable: Add rhlist interface')
Signed-off-by: Paul Blakey
Acked-by: Herbert Xu
Signed-off-by: David S. Miller
11 Dec, 2017
3 commits
-
To allocate the array of bucket locks for the hash table we now
call library function alloc_bucket_spinlocks. This function is
based on the old alloc_bucket_locks in rhashtable and should
produce the same effect.Signed-off-by: Tom Herbert
Signed-off-by: David S. Miller -
This function is like rhashtable_walk_next except that it only returns
the current element in the inter and does not advance the iter.This patch also creates __rhashtable_walk_find_next. It finds the next
element in the table when the entry cached in iter is NULL or at the end
of a slot. __rhashtable_walk_find_next is called from
rhashtable_walk_next and rhastable_walk_peek.end_of_table is an added field to the iter structure. This indicates
that the end of table was reached (walker.tbl being NULL is not a
sufficient condition for end of table).Signed-off-by: Tom Herbert
Acked-by: Herbert Xu
Signed-off-by: David S. Miller -
Most callers of rhashtable_walk_start don't care about a resize event
which is indicated by a return value of -EAGAIN. So calls to
rhashtable_walk_start are wrapped wih code to ignore -EAGAIN. Something
like this is common:ret = rhashtable_walk_start(rhiter);
if (ret && ret != -EAGAIN)
goto out;Since zero and -EAGAIN are the only possible return values from the
function this check is pointless. The condition never evaluates to true.This patch changes rhashtable_walk_start to return void. This simplifies
code for the callers that ignore -EAGAIN. For the few cases where the
caller cares about the resize event, particularly where the table can be
walked in mulitple parts for netlink or seq file dump, the function
rhashtable_walk_start_check has been added that returns -EAGAIN on a
resize event.Signed-off-by: Tom Herbert
Acked-by: Herbert Xu
Signed-off-by: David S. Miller
20 Sep, 2017
1 commit
-
Clarify that rhashtable_walk_{stop,start} will not reset the iterator to
the beginning of the hash table. Confusion between rhashtable_walk_enter
and rhashtable_walk_start has already lead to a bug.Signed-off-by: Andreas Gruenbacher
Signed-off-by: David S. Miller
16 Jul, 2017
1 commit
-
Pull random updates from Ted Ts'o:
"Add wait_for_random_bytes() and get_random_*_wait() functions so that
callers can more safely get random bytes if they can block until the
CRNG is initialized.Also print a warning if get_random_*() is called before the CRNG is
initialized. By default, only one single-line warning will be printed
per boot. If CONFIG_WARN_ALL_UNSEEDED_RANDOM is defined, then a
warning will be printed for each function which tries to get random
bytes before the CRNG is initialized. This can get spammy for certain
architecture types, so it is not enabled by default"* tag 'random_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/random:
random: reorder READ_ONCE() in get_random_uXX
random: suppress spammy warnings about unseeded randomness
random: warn when kernel uses unseeded randomness
net/route: use get_random_int for random counter
net/neighbor: use get_random_u32 for 32-bit hash random
rhashtable: use get_random_u32 for hash_rnd
ceph: ensure RNG is seeded before using
iscsi: ensure RNG is seeded before use
cifs: use get_random_u32 for 32-bit lock random
random: add get_random_{bytes,u32,u64,int,long,once}_wait family
random: add wait_for_random_bytes() API
11 Jul, 2017
1 commit
-
bucket_table_alloc() can be currently called with GFP_KERNEL or
GFP_ATOMIC. For the former we basically have an open coded kvzalloc()
while the later only uses kzalloc(). Let's simplify the code a bit by
the dropping the open coded path and replace it with kvzalloc().Link: http://lkml.kernel.org/r/20170531155145.17111-3-mhocko@kernel.org
Signed-off-by: Michal Hocko
Cc: Thomas Graf
Cc: Herbert Xu
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
20 Jun, 2017
1 commit
-
This is much faster and just as secure. It also has the added benefit of
probably returning better randomness at early-boot on systems with
architectural RNGs.Signed-off-by: Jason A. Donenfeld
Cc: Thomas Graf
Cc: Herbert Xu
Signed-off-by: Theodore Ts'o
09 May, 2017
1 commit
-
alloc_bucket_locks allocation pattern is quite unusual. We are
preferring vmalloc when CONFIG_NUMA is enabled. The rationale is that
vmalloc will respect the memory policy of the current process and so the
backing memory will get distributed over multiple nodes if the requester
is configured properly. At least that is the intention, in reality
rhastable is shrunk and expanded from a kernel worker so no mempolicy
can be assumed.Let's just simplify the code and use kvmalloc helper, which is a
transparent way to use kmalloc with vmalloc fallback, if the caller is
allowed to block and use the flag otherwise.Link: http://lkml.kernel.org/r/20170306103032.2540-4-mhocko@kernel.org
Signed-off-by: Michal Hocko
Acked-by: Vlastimil Babka
Cc: Tom Herbert
Cc: Eric Dumazet
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
02 May, 2017
1 commit
-
By using smaller datatypes this (rather large) struct shrinks considerably
(80 -> 48 bytes on x86_64).As this is embedded in other structs, this also rerduces size of several
others, e.g. cls_fl_head or nft_hash.Signed-off-by: Florian Westphal
Signed-off-by: David S. Miller
28 Apr, 2017
1 commit
-
The commit 6d684e54690c ("rhashtable: Cap total number of entries
to 2^31") breaks rhashtable users that do not set max_size. This
is because when max_size is zero max_elems is also incorrectly set
to zero instead of 2^31.This patch fixes it by only lowering max_elems when max_size is not
zero.Fixes: 6d684e54690c ("rhashtable: Cap total number of entries to 2^31")
Reported-by: Florian Fainelli
Reported-by: kernel test robot
Signed-off-by: Herbert Xu
Signed-off-by: David S. Miller
27 Apr, 2017
2 commits
-
When max_size is not set or if it set to a sufficiently large
value, the nelems counter can overflow. This would cause havoc
with the automatic shrinking as it would then attempt to fit a
huge number of entries into a tiny hash table.This patch fixes this by adding max_elems to struct rhashtable
to cap the number of elements. This is set to 2^31 as nelems is
not a precise count. This is sufficiently smaller than UINT_MAX
that it should be safe.When max_size is set max_elems will be lowered to at most twice
max_size as is the status quo.Signed-off-by: Herbert Xu
Signed-off-by: David S. Miller -
no users in the tree, insecure_max_entries is always set to
ht->p.max_size * 2 in rhtashtable_init().Replace only spot that uses it with a ht->p.max_size check.
Signed-off-by: Florian Westphal
Signed-off-by: David S. Miller
19 Apr, 2017
1 commit
-
commit 83e7e4ce9e93c3 ("mac80211: Use rhltable instead of rhashtable")
removed the last user that made use of 'insecure_elasticity' parameter,
i.e. the default of 16 is used everywhere.Replace it with a constant.
Signed-off-by: Florian Westphal
Signed-off-by: David S. Miller
02 Mar, 2017
1 commit
-
We don't actually need the full rculist.h header in sched.h anymore,
we will be able to include the smaller rcupdate.h header instead.But first update code that relied on the implicit header inclusion.
Acked-by: Linus Torvalds
Cc: Mike Galbraith
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar
27 Feb, 2017
2 commits
-
The current annotation is wrong as it says that we're only called
under spinlock. In fact it should be marked as under either
spinlock or RCU read lock.Fixes: da20420f83ea ("rhashtable: Add nested tables")
Reported-by: Fengguang Wu
Signed-off-by: Herbert Xu
Signed-off-by: David S. Miller -
Dan Carpenter reported a use before NULL check bug in the function
bucket_table_free. In fact we don't need the NULL check at all as
no caller can provide a NULL argument. So this patch fixes this by
simply removing it.Reported-by: Dan Carpenter
Signed-off-by: Herbert Xu
Signed-off-by: David S. Miller
18 Feb, 2017
1 commit
-
This patch adds code that handles GFP_ATOMIC kmalloc failure on
insertion. As we cannot use vmalloc, we solve it by making our
hash table nested. That is, we allocate single pages at each level
and reach our desired table size by nesting them.When a nested table is created, only a single page is allocated
at the top-level. Lower levels are allocated on demand during
insertion. Therefore for each insertion to succeed, only two
(non-consecutive) pages are needed.After a nested table is created, a rehash will be scheduled in
order to switch to a vmalloced table as soon as possible. Also,
the rehash code will never rehash into a nested table. If we
detect a nested table during a rehash, the rehash will be aborted
and a new rehash will be scheduled.Signed-off-by: Herbert Xu
Signed-off-by: David S. Miller
20 Sep, 2016
1 commit
-
The insecure_elasticity setting is an ugly wart brought out by
users who need to insert duplicate objects (that is, distinct
objects with identical keys) into the same table.In fact, those users have a much bigger problem. Once those
duplicate objects are inserted, they don't have an interface to
find them (unless you count the walker interface which walks
over the entire table).Some users have resorted to doing a manual walk over the hash
table which is of course broken because they don't handle the
potential existence of multiple hash tables. The result is that
they will break sporadically when they encounter a hash table
resize/rehash.This patch provides a way out for those users, at the expense
of an extra pointer per object. Essentially each object is now
a list of objects carrying the same key. The hash table will
only see the lists so nothing changes as far as rhashtable is
concerned.To use this new interface, you need to insert a struct rhlist_head
into your objects instead of struct rhash_head. While the hash
table is unchanged, for type-safety you'll need to use struct
rhltable instead of struct rhashtable. All the existing interfaces
have been duplicated for rhlist, including the hash table walker.One missing feature is nulls marking because AFAIK the only potential
user of it does not need duplicate objects. Should anyone need
this it shouldn't be too hard to add.Signed-off-by: Herbert Xu
Acked-by: Thomas Graf
Signed-off-by: David S. Miller
07 Sep, 2016
1 commit
-
Pablo Neira Ayuso says:
====================
Netfilter updates for net-nextThe following patchset contains Netfilter updates for your net-next
tree. Most relevant updates are the removal of per-conntrack timers to
use a workqueue/garbage collection approach instead from Florian
Westphal, the hash and numgen expression for nf_tables from Laura
Garcia, updates on nf_tables hash set to honor the NLM_F_EXCL flag,
removal of ip_conntrack sysctl and many other incremental updates on our
Netfilter codebase.More specifically, they are:
1) Retrieve only 4 bytes to fetch ports in case of non-linear skb
transport area in dccp, sctp, tcp, udp and udplite protocol
conntrackers, from Gao Feng.2) Missing whitespace on error message in physdev match, from Hangbin Liu.
3) Skip redundant IPv4 checksum calculation in nf_dup_ipv4, from Liping Zhang.
4) Add nf_ct_expires() helper function and use it, from Florian Westphal.
5) Replace opencoded nf_ct_kill() call in IPVS conntrack support, also
from Florian.6) Rename nf_tables set implementation to nft_set_{name}.c
7) Introduce the hash expression to allow arbitrary hashing of selector
concatenations, from Laura Garcia Liebana.8) Remove ip_conntrack sysctl backward compatibility code, this code has
been around for long time already, and we have two interfaces to do
this already: nf_conntrack sysctl and ctnetlink.9) Use nf_conntrack_get_ht() helper function whenever possible, instead
of opencoding fetch of hashtable pointer and size, patch from Liping Zhang.10) Add quota expression for nf_tables.
11) Add number generator expression for nf_tables, this supports
incremental and random generators that can be combined with maps,
very useful for load balancing purpose, again from Laura Garcia Liebana.12) Fix a typo in a debug message in FTP conntrack helper, from Colin Ian King.
13) Introduce a nft_chain_parse_hook() helper function to parse chain hook
configuration, this is used by a follow up patch to perform better chain
update validation.14) Add rhashtable_lookup_get_insert_key() to rhashtable and use it from the
nft_set_hash implementation to honor the NLM_F_EXCL flag.15) Missing nulls check in nf_conntrack from nf_conntrack_tuple_taken(),
patch from Florian Westphal.16) Don't use the DYING bit to know if the conntrack event has been already
delivered, instead a state variable to track event re-delivery
states, also from Florian.17) Remove the per-conntrack timer, use the workqueue approach that was
discussed during the NFWS, from Florian Westphal.18) Use the netlink conntrack table dump path to kill stale entries,
again from Florian.19) Add a garbage collector to get rid of stale conntracks, from
Florian.20) Reschedule garbage collector if eviction rate is high.
21) Get rid of the __nf_ct_kill_acct() helper.
22) Use ARPHRD_ETHER instead of hardcoded 1 from ARP logger.
23) Make nf_log_set() interface assertive on unsupported families.
====================Signed-off-by: David S. Miller
30 Aug, 2016
1 commit
-
All three conflicts were cases of simple overlapping
changes.Signed-off-by: David S. Miller