29 Jul, 2020

1 commit


07 Jun, 2020

1 commit

  • This patch replaces some unnecessary uses of rcu_dereference_raw
    in the rhashtable code with rcu_dereference_protected.

    The top-level nested table entry is only marked as RCU because it
    shares the same type as the tree entries underneath it. So it
    doesn't need any RCU protection.

    We also don't need RCU protection when we're freeing a nested RCU
    table because by this stage we've long passed a memory barrier
    when anyone could change the nested table.

    Signed-off-by: Herbert Xu
    Signed-off-by: David S. Miller

    Herbert Xu
     

19 Jun, 2019

1 commit

  • Based on 2 normalized pattern(s):

    this program is free software you can redistribute it and or modify
    it under the terms of the gnu general public license version 2 as
    published by the free software foundation

    this program is free software you can redistribute it and or modify
    it under the terms of the gnu general public license version 2 as
    published by the free software foundation #

    extracted by the scancode license scanner the SPDX license identifier

    GPL-2.0-only

    has been chosen to replace the boilerplate/reference in 4122 file(s).

    Signed-off-by: Thomas Gleixner
    Reviewed-by: Enrico Weigelt
    Reviewed-by: Kate Stewart
    Reviewed-by: Allison Randal
    Cc: linux-spdx@vger.kernel.org
    Link: https://lkml.kernel.org/r/20190604081206.933168790@linutronix.de
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     

17 May, 2019

2 commits

  • As cmpxchg is a non-RCU mechanism it will cause sparse warnings
    when we use it for RCU. This patch adds explicit casts to silence
    those warnings. This should probably be moved to RCU itself in
    future.

    Signed-off-by: Herbert Xu
    Signed-off-by: David S. Miller

    Herbert Xu
     
  • The opaque type rhash_lock_head should not be marked with __rcu
    because it can never be dereferenced. We should apply the RCU
    marking when we turn it into a pointer which can be dereferenced.

    This patch does exactly that. This fixes a number of sparse
    warnings as well as getting rid of some unnecessary RCU checking.

    Signed-off-by: Herbert Xu
    Signed-off-by: David S. Miller

    Herbert Xu
     

13 Apr, 2019

5 commits

  • As reported by Guenter Roeck, the new bit-locking using
    BIT(1) doesn't work on the m68k architecture. m68k only requires
    2-byte alignment for words and longwords, so there is only one
    unused bit in pointers to structs - We current use two, one for the
    NULLS marker at the end of the linked list, and one for the bit-lock
    in the head of the list.

    The two uses don't need to conflict as we never need the head of the
    list to be a NULLS marker - the marker is only needed to check if an
    object has moved to a different table, and the bucket head cannot
    move. The NULLS marker is only needed in a ->next pointer.

    As we already have different types for the bucket head pointer (struct
    rhash_lock_head) and the ->next pointers (struct rhash_head), it is
    fairly easy to treat the lsb differently in each.

    So: Initialize buckets heads to NULL, and use the lsb for locking.
    When loading the pointer from the bucket head, if it is NULL (ignoring
    the lock big), report as being the expected NULLS marker.
    When storing a value into a bucket head, if it is a NULLS marker,
    store NULL instead.

    And convert all places that used bit 1 for locking, to use bit 0.

    Fixes: 8f0db018006a ("rhashtable: use bit_spin_locks to protect hash bucket.")
    Reported-by: Guenter Roeck
    Tested-by: Guenter Roeck
    Signed-off-by: NeilBrown
    Signed-off-by: David S. Miller

    NeilBrown
     
  • The only times rht_ptr_locked() is used, it is to store a new
    value in a bucket-head. This is the only time it makes sense
    to use it too. So replace it by a function which does the
    whole task: Sets the lock bit and assigns to a bucket head.

    Signed-off-by: NeilBrown
    Signed-off-by: David S. Miller

    NeilBrown
     
  • Rather than dereferencing a pointer to a bucket and then passing the
    result to rht_ptr(), we now pass in the pointer and do the dereference
    in rht_ptr().

    This requires that we pass in the tbl and hash as well to support RCU
    checks, and means that the various rht_for_each functions can expect a
    pointer that can be dereferenced without further care.

    There are two places where we dereference a bucket pointer
    where there is no testable protection - in each case we know
    that we much have exclusive access without having taken a lock.
    The previous code used rht_dereference() to pretend that holding
    the mutex provided protects, but holding the mutex never provides
    protection for accessing buckets.

    So instead introduce rht_ptr_exclusive() that can be used when
    there is known to be exclusive access without holding any locks.

    Signed-off-by: NeilBrown
    Signed-off-by: David S. Miller

    NeilBrown
     
  • With these annotations, the rhashtable now gets no
    warnings when compiled with "C=1" for sparse checking.

    Signed-off-by: NeilBrown
    Signed-off-by: David S. Miller

    NeilBrown
     
  • One of the more common cases of allocation size calculations is finding
    the size of a structure that has a zero-sized array at the end, along with
    memory for some number of elements for that array. For example:

    struct foo {
    int stuff;
    struct boo entry[];
    };

    size = sizeof(struct foo) + count * sizeof(struct boo);
    instance = kvzalloc(size, GFP_KERNEL);

    Instead of leaving these open-coded and prone to type mistakes, we can
    now use the new struct_size() helper:

    instance = kvzalloc(struct_size(instance, entry, count), GFP_KERNEL);

    This code was detected with the help of Coccinelle.

    Signed-off-by: Gustavo A. R. Silva
    Signed-off-by: David S. Miller

    Gustavo A. R. Silva
     

08 Apr, 2019

4 commits

  • Native bit_spin_locks are not tracked by lockdep.

    The bit_spin_locks used for rhashtable buckets are local
    to the rhashtable implementation, so there is little opportunity
    for the sort of misuse that lockdep might detect.
    However locks are held while a hash function or compare
    function is called, and if one of these took a lock,
    a misbehaviour is possible.

    As it is quite easy to add lockdep support this unlikely
    possibility seems to be enough justification.

    So create a lockdep class for bucket bit_spin_lock and attach
    through a lockdep_map in each bucket_table.

    Without the 'nested' annotation in rhashtable_rehash_one(), lockdep
    correctly reports a possible problem as this lock is taken
    while another bucket lock (in another table) is held. This
    confirms that the added support works.
    With the correct nested annotation in place, lockdep reports
    no problems.

    Signed-off-by: NeilBrown
    Signed-off-by: David S. Miller

    NeilBrown
     
  • This patch changes rhashtables to use a bit_spin_lock on BIT(1) of the
    bucket pointer to lock the hash chain for that bucket.

    The benefits of a bit spin_lock are:
    - no need to allocate a separate array of locks.
    - no need to have a configuration option to guide the
    choice of the size of this array
    - locking cost is often a single test-and-set in a cache line
    that will have to be loaded anyway. When inserting at, or removing
    from, the head of the chain, the unlock is free - writing the new
    address in the bucket head implicitly clears the lock bit.
    For __rhashtable_insert_fast() we ensure this always happens
    when adding a new key.
    - even when lockings costs 2 updates (lock and unlock), they are
    in a cacheline that needs to be read anyway.

    The cost of using a bit spin_lock is a little bit of code complexity,
    which I think is quite manageable.

    Bit spin_locks are sometimes inappropriate because they are not fair -
    if multiple CPUs repeatedly contend of the same lock, one CPU can
    easily be starved. This is not a credible situation with rhashtable.
    Multiple CPUs may want to repeatedly add or remove objects, but they
    will typically do so at different buckets, so they will attempt to
    acquire different locks.

    As we have more bit-locks than we previously had spinlocks (by at
    least a factor of two) we can expect slightly less contention to
    go with the slightly better cache behavior and reduced memory
    consumption.

    To enhance type checking, a new struct is introduced to represent the
    pointer plus lock-bit
    that is stored in the bucket-table. This is "struct rhash_lock_head"
    and is empty. A pointer to this needs to be cast to either an
    unsigned lock, or a "struct rhash_head *" to be useful.
    Variables of this type are most often called "bkt".

    Previously "pprev" would sometimes point to a bucket, and sometimes a
    ->next pointer in an rhash_head. As these are now different types,
    pprev is NULL when it would have pointed to the bucket. In that case,
    'blk' is used, together with correct locking protocol.

    Signed-off-by: NeilBrown
    Signed-off-by: David S. Miller

    NeilBrown
     
  • Rather than returning a pointer to a static nulls, rht_bucket_var()
    now returns NULL if the bucket doesn't exist.
    This will make the next patch, which stores a bitlock in the
    bucket pointer, somewhat cleaner.

    This change involves introducing __rht_bucket_nested() which is
    like rht_bucket_nested(), but doesn't provide the static nulls,
    and changing rht_bucket_nested() to call this and possible
    provide a static nulls - as is still needed for the non-var case.

    Signed-off-by: NeilBrown
    Signed-off-by: David S. Miller

    NeilBrown
     
  • nested_table_alloc() relies on the fact that there is
    at most one spinlock allocated for every slot in the top
    level nested table, so it is not possible for two threads
    to try to allocate the same table at the same time.

    This assumption is a little fragile (it is not explicit) and is
    unnecessary as cmpxchg() can be used instead.

    A future patch will replace the spinlocks by per-bucket bitlocks,
    and then we won't be able to protect the slot pointer with a spinlock.

    So replace rcu_assign_pointer() with cmpxchg() - which has equivalent
    barrier properties.
    If it the cmp fails, free the table that was just allocated.

    Signed-off-by: NeilBrown
    Signed-off-by: David S. Miller

    NeilBrown
     

28 Mar, 2019

1 commit


22 Mar, 2019

3 commits

  • The pattern set by list.h is that for_each..continue()
    iterators start at the next entry after the given one,
    while for_each..from() iterators start at the given
    entry.

    The rht_for_each*continue() iterators are documented as though the
    start at the 'next' entry, but actually start at the given entry,
    and they are used expecting that behaviour.
    So fix the documentation and change the names to *from for consistency
    with list.h

    Acked-by: Herbert Xu
    Acked-by: Miguel Ojeda
    Signed-off-by: NeilBrown
    Signed-off-by: David S. Miller

    NeilBrown
     
  • rhashtable_try_insert() currently holds a lock on the bucket in
    the first table, while also locking buckets in subsequent tables.
    This is unnecessary and looks like a hold-over from some earlier
    version of the implementation.

    As insert and remove always lock a bucket in each table in turn, and
    as insert only inserts in the final table, there cannot be any races
    that are not covered by simply locking a bucket in each table in turn.

    When an insert call reaches that last table it can be sure that there
    is no matchinf entry in any other table as it has searched them all, and
    insertion never happens anywhere but in the last table. The fact that
    code tests for the existence of future_tbl while holding a lock on
    the relevant bucket ensures that two threads inserting the same key
    will make compatible decisions about which is the "last" table.

    This simplifies the code and allows the ->rehash field to be
    discarded.

    We still need a way to ensure that a dead bucket_table is never
    re-linked by rhashtable_walk_stop(). This can be achieved by calling
    call_rcu() inside the locked region, and checking with
    rcu_head_after_call_rcu() in rhashtable_walk_stop() to see if the
    bucket table is empty and dead.

    Acked-by: Herbert Xu
    Reviewed-by: Paul E. McKenney
    Signed-off-by: NeilBrown
    Signed-off-by: David S. Miller

    NeilBrown
     
  • As it stands if a shrink is delayed because of an outstanding
    rehash, we will go into a rescheduling loop without ever doing
    the rehash.

    This patch fixes this by still carrying out the rehash and then
    rescheduling so that we can shrink after the completion of the
    rehash should it still be necessary.

    The return value of EEXIST captures this case and other cases
    (e.g., another thread expanded/rehashed the table at the same
    time) where we should still proceed with the rehash.

    Fixes: da20420f83ea ("rhashtable: Add nested tables")
    Reported-by: Josh Elsasser
    Signed-off-by: Herbert Xu
    Tested-by: Josh Elsasser
    Signed-off-by: David S. Miller

    Herbert Xu
     

22 Feb, 2019

1 commit


04 Dec, 2018

1 commit

  • Some users of rhashtables might need to move an object from one table
    to another - this appears to be the reason for the incomplete usage
    of NULLS markers.

    To support these, we store a unique NULLS_MARKER at the end of
    each chain, and when a search fails to find a match, we check
    if the NULLS marker found was the expected one. If not, the search
    may not have examined all objects in the target bucket, so it is
    repeated.

    The unique NULLS_MARKER is derived from the address of the
    head of the chain. As this cannot be derived at load-time the
    static rhnull in rht_bucket_nested() needs to be initialised
    at run time.

    Any caller of a lookup function must still be prepared for the
    possibility that the object returned is in a different table - it
    might have been there for some time.

    Note that this does NOT provide support for other uses of
    NULLS_MARKERs such as allocating with SLAB_TYPESAFE_BY_RCU or changing
    the key of an object and re-inserting it in the same table.
    These could only be done safely if new objects were inserted
    at the *start* of a hash chain, and that is not currently the case.

    Signed-off-by: NeilBrown
    Acked-by: Herbert Xu
    Signed-off-by: David S. Miller

    NeilBrown
     

28 Aug, 2018

1 commit

  • Pull networking fixes from David Miller:

    1) ICE, E1000, IGB, IXGBE, and I40E bug fixes from the Intel folks.

    2) Better fix for AB-BA deadlock in packet scheduler code, from Cong
    Wang.

    3) bpf sockmap fixes (zero sized key handling, etc.) from Daniel
    Borkmann.

    4) Send zero IPID in TCP resets and SYN-RECV state ACKs, to prevent
    attackers using it as a side-channel. From Eric Dumazet.

    5) Memory leak in mediatek bluetooth driver, from Gustavo A. R. Silva.

    6) Hook up rt->dst.input of ipv6 anycast routes properly, from Hangbin
    Liu.

    7) hns and hns3 bug fixes from Huazhong Tan.

    8) Fix RIF leak in mlxsw driver, from Ido Schimmel.

    9) iova range check fix in vhost, from Jason Wang.

    10) Fix hang in do_tcp_sendpages() with tls, from John Fastabend.

    11) More r8152 chips need to disable RX aggregation, from Kai-Heng Feng.

    12) Memory exposure in TCA_U32_SEL handling, from Kees Cook.

    13) TCP BBR congestion control fixes from Kevin Yang.

    14) hv_netvsc, ignore non-PCI devices, from Stephen Hemminger.

    15) qed driver fixes from Tomer Tayar.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (77 commits)
    net: sched: Fix memory exposure from short TCA_U32_SEL
    qed: fix spelling mistake "comparsion" -> "comparison"
    vhost: correctly check the iova range when waking virtqueue
    qlge: Fix netdev features configuration.
    net: macb: do not disable MDIO bus at open/close time
    Revert "net: stmmac: fix build failure due to missing COMMON_CLK dependency"
    net: macb: Fix regression breaking non-MDIO fixed-link PHYs
    mlxsw: spectrum_switchdev: Do not leak RIFs when removing bridge
    i40e: fix condition of WARN_ONCE for stat strings
    i40e: Fix for Tx timeouts when interface is brought up if DCB is enabled
    ixgbe: fix driver behaviour after issuing VFLR
    ixgbe: Prevent unsupported configurations with XDP
    ixgbe: Replace GFP_ATOMIC with GFP_KERNEL
    igb: Replace mdelay() with msleep() in igb_integrated_phy_loopback()
    igb: Replace GFP_ATOMIC with GFP_KERNEL in igb_sw_init()
    igb: Use an advanced ctx descriptor for launchtime
    e1000: ensure to free old tx/rx rings in set_ringparam()
    e1000: check on netif_running() before calling e1000_up()
    ixgb: use dma_zalloc_coherent instead of allocator/memset
    ice: Trivial formatting fixes
    ...

    Linus Torvalds
     

23 Aug, 2018

2 commits

  • rhashtable_init() may fail due to -ENOMEM, thus making the entire api
    unusable. This patch removes this scenario, however unlikely. In order
    to guarantee memory allocation, this patch always ends up doing
    GFP_KERNEL|__GFP_NOFAIL for both the tbl as well as
    alloc_bucket_spinlocks().

    Upon the first table allocation failure, we shrink the size to the
    smallest value that makes sense and retry with __GFP_NOFAIL semantics.
    With the defaults, this means that from 64 buckets, we retry with only 4.
    Any later issues regarding performance due to collisions or larger table
    resizing (when more memory becomes available) is the least of our
    problems.

    Link: http://lkml.kernel.org/r/20180712185241.4017-9-manfred@colorfullife.com
    Signed-off-by: Davidlohr Bueso
    Signed-off-by: Manfred Spraul
    Acked-by: Herbert Xu
    Cc: Dmitry Vyukov
    Cc: Kees Cook
    Cc: Michael Kerrisk
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     
  • As of ce91f6ee5b3b ("mm: kvmalloc does not fallback to vmalloc for
    incompatible gfp flags") we can simplify the caller and trust kvzalloc()
    to just do the right thing. For the case of the GFP_ATOMIC context, we
    can drop the __GFP_NORETRY flag for obvious reasons, and for the
    __GFP_NOWARN case, however, it is changed such that the caller passes the
    flag instead of making bucket_table_alloc() handle it.

    This slightly changes the gfp flags passed on to nested_table_alloc() as
    it will now also use GFP_ATOMIC | __GFP_NOWARN. However, I consider this
    a positive consequence as for the same reasons we want nowarn semantics in
    bucket_table_alloc().

    [manfred@colorfullife.com: commit id extended to 12 digits, line wraps updated]
    Link: http://lkml.kernel.org/r/20180712185241.4017-8-manfred@colorfullife.com
    Signed-off-by: Davidlohr Bueso
    Signed-off-by: Manfred Spraul
    Acked-by: Michal Hocko
    Cc: Dmitry Vyukov
    Cc: Herbert Xu
    Cc: Kees Cook
    Cc: Michael Kerrisk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     

21 Aug, 2018

1 commit


21 Jul, 2018

1 commit


19 Jul, 2018

1 commit

  • rhashtable_init() currently does not take into account the user-passed
    min_size parameter unless param->nelem_hint is set as well. As such,
    the default size (number of buckets) will always be HASH_DEFAULT_SIZE
    even if the smallest allowed size is larger than that. Remediate this
    by unconditionally calling into rounded_hashtable_size() and handling
    things accordingly.

    Signed-off-by: Davidlohr Bueso
    Acked-by: Herbert Xu
    Signed-off-by: David S. Miller

    Davidlohr Bueso
     

10 Jul, 2018

1 commit

  • rhashtable_free_and_destroy() cancels re-hash deferred work
    then walks and destroys elements. at this moment, some elements can be
    still in future_tbl. that elements are not destroyed.

    test case:
    nft_rhash_destroy() calls rhashtable_free_and_destroy() to destroy
    all elements of sets before destroying sets and chains.
    But rhashtable_free_and_destroy() doesn't destroy elements of future_tbl.
    so that splat occurred.

    test script:
    %cat test.nft
    table ip aa {
    map map1 {
    type ipv4_addr : verdict;
    elements = {
    0 : jump a0,
    1 : jump a0,
    2 : jump a0,
    3 : jump a0,
    4 : jump a0,
    5 : jump a0,
    6 : jump a0,
    7 : jump a0,
    8 : jump a0,
    9 : jump a0,
    }
    }
    chain a0 {
    }
    }
    flush ruleset
    table ip aa {
    map map1 {
    type ipv4_addr : verdict;
    elements = {
    0 : jump a0,
    1 : jump a0,
    2 : jump a0,
    3 : jump a0,
    4 : jump a0,
    5 : jump a0,
    6 : jump a0,
    7 : jump a0,
    8 : jump a0,
    9 : jump a0,
    }
    }
    chain a0 {
    }
    }
    flush ruleset

    %while :; do nft -f test.nft; done

    Splat looks like:
    [ 200.795603] kernel BUG at net/netfilter/nf_tables_api.c:1363!
    [ 200.806944] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN PTI
    [ 200.812253] CPU: 1 PID: 1582 Comm: nft Not tainted 4.17.0+ #24
    [ 200.820297] Hardware name: To be filled by O.E.M. To be filled by O.E.M./Aptio CRB, BIOS 5.6.5 07/08/2015
    [ 200.830309] RIP: 0010:nf_tables_chain_destroy.isra.34+0x62/0x240 [nf_tables]
    [ 200.838317] Code: 43 50 85 c0 74 26 48 8b 45 00 48 8b 4d 08 ba 54 05 00 00 48 c7 c6 60 6d 29 c0 48 c7 c7 c0 65 29 c0 4c 8b 40 08 e8 58 e5 fd f8 0b 48 89 da 48 b8 00 00 00 00 00 fc ff
    [ 200.860366] RSP: 0000:ffff880118dbf4d0 EFLAGS: 00010282
    [ 200.866354] RAX: 0000000000000061 RBX: ffff88010cdeaf08 RCX: 0000000000000000
    [ 200.874355] RDX: 0000000000000061 RSI: 0000000000000008 RDI: ffffed00231b7e90
    [ 200.882361] RBP: ffff880118dbf4e8 R08: ffffed002373bcfb R09: ffffed002373bcfa
    [ 200.890354] R10: 0000000000000000 R11: ffffed002373bcfb R12: dead000000000200
    [ 200.898356] R13: dead000000000100 R14: ffffffffbb62af38 R15: dffffc0000000000
    [ 200.906354] FS: 00007fefc31fd700(0000) GS:ffff88011b800000(0000) knlGS:0000000000000000
    [ 200.915533] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [ 200.922355] CR2: 0000557f1c8e9128 CR3: 0000000106880000 CR4: 00000000001006e0
    [ 200.930353] Call Trace:
    [ 200.932351] ? nf_tables_commit+0x26f6/0x2c60 [nf_tables]
    [ 200.939525] ? nf_tables_setelem_notify.constprop.49+0x1a0/0x1a0 [nf_tables]
    [ 200.947525] ? nf_tables_delchain+0x6e0/0x6e0 [nf_tables]
    [ 200.952383] ? nft_add_set_elem+0x1700/0x1700 [nf_tables]
    [ 200.959532] ? nla_parse+0xab/0x230
    [ 200.963529] ? nfnetlink_rcv_batch+0xd06/0x10d0 [nfnetlink]
    [ 200.968384] ? nfnetlink_net_init+0x130/0x130 [nfnetlink]
    [ 200.975525] ? debug_show_all_locks+0x290/0x290
    [ 200.980363] ? debug_show_all_locks+0x290/0x290
    [ 200.986356] ? sched_clock_cpu+0x132/0x170
    [ 200.990352] ? find_held_lock+0x39/0x1b0
    [ 200.994355] ? sched_clock_local+0x10d/0x130
    [ 200.999531] ? memset+0x1f/0x40

    V2:
    - free all tables requested by Herbert Xu

    Signed-off-by: Taehee Yoo
    Acked-by: Herbert Xu
    Signed-off-by: David S. Miller

    Taehee Yoo
     

03 Jul, 2018

1 commit

  • In file lib/rhashtable.c line 777, skip variable is assigned to
    itself. The following error was observed:

    lib/rhashtable.c:777:41: warning: explicitly assigning value of
    variable of type 'int' to itself [-Wself-assign] error, forbidden
    warning: rhashtable.c:777
    This error was found when compiling with Clang 6.0. Change it to iter->skip.

    Signed-off-by: Rishabh Bhatnagar
    Acked-by: Herbert Xu
    Reviewed-by: NeilBrown
    Signed-off-by: David S. Miller

    Rishabh Bhatnagar
     

22 Jun, 2018

6 commits

  • Using rht_dereference_bucket() to dereference
    ->future_tbl looks like a type error, and could be confusing.
    Using rht_dereference_rcu() to test a pointer for NULL
    adds an unnecessary barrier - rcu_access_pointer() is preferred
    for NULL tests when no lock is held.

    This uses 3 different ways to access ->future_tbl.
    - if we know the mutex is held, use rht_dereference()
    - if we don't hold the mutex, and are only testing for NULL,
    use rcu_access_pointer()
    - otherwise (using RCU protection for true dereference),
    use rht_dereference_rcu().

    Note that this includes a simplification of the call to
    rhashtable_last_table() - we don't do an extra dereference
    before the call any more.

    Acked-by: Herbert Xu
    Signed-off-by: NeilBrown
    Signed-off-by: David S. Miller

    NeilBrown
     
  • Rather than borrowing one of the bucket locks to
    protect ->future_tbl updates, use cmpxchg().
    This gives more freedom to change how bucket locking
    is implemented.

    Acked-by: Herbert Xu
    Signed-off-by: NeilBrown
    Signed-off-by: David S. Miller

    NeilBrown
     
  • Now that we don't use the hash value or shift in nested_table_alloc()
    there is room for simplification.
    We only need to pass a "is this a leaf" flag to nested_table_alloc(),
    and don't need to track as much information in
    rht_bucket_nested_insert().

    Note there is another minor cleanup in nested_table_alloc() here.
    The number of elements in a page of "union nested_tables" is most naturally

    PAGE_SIZE / sizeof(ntbl[0])

    The previous code had

    PAGE_SIZE / sizeof(ntbl[0].bucket)

    which happens to be the correct value only because the bucket uses all
    the space in the union.

    Acked-by: Herbert Xu
    Signed-off-by: NeilBrown
    Signed-off-by: David S. Miller

    NeilBrown
     
  • The 'ht' and 'hash' arguments to INIT_RHT_NULLS_HEAD() are
    no longer used - so drop them. This allows us to also
    remove the nhash argument from nested_table_alloc().

    Acked-by: Herbert Xu
    Signed-off-by: NeilBrown
    Signed-off-by: David S. Miller

    NeilBrown
     
  • This "feature" is unused, undocumented, and untested and so doesn't
    really belong. A patch is under development to properly implement
    support for detecting when a search gets diverted down a different
    chain, which the common purpose of nulls markers.

    This patch actually fixes a bug too. The table resizing allows a
    table to grow to 2^31 buckets, but the hash is truncated to 27 bits -
    any growth beyond 2^27 is wasteful an ineffective.

    This patch results in NULLS_MARKER(0) being used for all chains,
    and leaves the use of rht_is_a_null() to test for it.

    Acked-by: Herbert Xu
    Signed-off-by: NeilBrown
    Signed-off-by: David S. Miller

    NeilBrown
     
  • Due to the use of rhashtables in net namespaces,
    rhashtable.h is included in lots of the kernel,
    so a small changes can required a large recompilation.
    This makes development painful.

    This patch splits out rhashtable-types.h which just includes
    the major type declarations, and does not include (non-trivial)
    inline code. rhashtable.h is no longer included by anything
    in the include/ directory.
    Common include files only include rhashtable-types.h so a large
    recompilation is only triggered when that changes.

    Acked-by: Herbert Xu
    Signed-off-by: NeilBrown
    Signed-off-by: David S. Miller

    NeilBrown
     

25 Apr, 2018

3 commits

  • When a walk of an rhashtable is interrupted with rhastable_walk_stop()
    and then rhashtable_walk_start(), the location to restart from is based
    on a 'skip' count in the current hash chain, and this can be incorrect
    if insertions or deletions have happened. This does not happen when
    the walk is not stopped and started as iter->p is a placeholder which
    is safe to use while holding the RCU read lock.

    In rhashtable_walk_start() we can revalidate that 'p' is still in the
    same hash chain. If it isn't then the current method is still used.

    With this patch, if a rhashtable walker ensures that the current
    object remains in the table over a stop/start period (possibly by
    elevating the reference count if that is sufficient), it can be sure
    that a walk will not miss objects that were in the hashtable for the
    whole time of the walk.

    rhashtable_walk_start() may not find the object even though it is
    still in the hashtable if a rehash has moved it to a new table. In
    this case it will (eventually) get -EAGAIN and will need to proceed
    through the whole table again to be sure to see everything at least
    once.

    Acked-by: Herbert Xu
    Signed-off-by: NeilBrown
    Signed-off-by: David S. Miller

    NeilBrown
     
  • The documentation claims that when rhashtable_walk_start_check()
    detects a resize event, it will rewind back to the beginning
    of the table. This is not true. We need to set ->slot and
    ->skip to be zero for it to be true.

    Acked-by: Herbert Xu
    Signed-off-by: NeilBrown
    Signed-off-by: David S. Miller

    NeilBrown
     
  • Neither rhashtable_walk_enter() or rhltable_walk_enter() sleep, though
    they do take a spinlock without irq protection.
    So revise the comments to accurately state the contexts in which
    these functions can be called.

    Acked-by: Herbert Xu
    Signed-off-by: NeilBrown
    Signed-off-by: David S. Miller

    NeilBrown
     

01 Apr, 2018

1 commit

  • Rehashing and destroying large hash table takes a lot of time,
    and happens in process context. It is safe to add cond_resched()
    in rhashtable_rehash_table() and rhashtable_free_and_destroy()

    Signed-off-by: Eric Dumazet
    Acked-by: Herbert Xu
    Signed-off-by: David S. Miller

    Eric Dumazet
     

07 Mar, 2018

1 commit

  • When inserting duplicate objects (those with the same key),
    current rhlist implementation messes up the chain pointers by
    updating the bucket pointer instead of prev next pointer to the
    newly inserted node. This causes missing elements on removal and
    travesal.

    Fix that by properly updating pprev pointer to point to
    the correct rhash_head next pointer.

    Issue: 1241076
    Change-Id: I86b2c140bcb4aeb10b70a72a267ff590bb2b17e7
    Fixes: ca26893f05e8 ('rhashtable: Add rhlist interface')
    Signed-off-by: Paul Blakey
    Acked-by: Herbert Xu
    Signed-off-by: David S. Miller

    Paul Blakey
     

11 Dec, 2017

1 commit