09 Jul, 2015

1 commit

  • If rhashtable_walk_next detects a resize operation in progress, it jumps
    to the new table and continues walking that one. But it misses to drop
    the reference to it's current item, leading it to continue traversing
    the new table's bucket in which the current item is sorted into, and
    after reaching that bucket's end continues traversing the new table's
    second bucket instead of the first one, thereby potentially missing
    items.

    This fixes the rhashtable runtime test for me. Bug probably introduced
    by Herbert Xu's patch eddee5ba ("rhashtable: Fix walker behaviour during
    rehash") although not explicitly tested.

    Fixes: eddee5ba ("rhashtable: Fix walker behaviour during rehash")
    Signed-off-by: Phil Sutter
    Acked-by: Herbert Xu
    Signed-off-by: David S. Miller

    Phil Sutter
     

09 Jun, 2015

1 commit


07 Jun, 2015

1 commit


23 May, 2015

1 commit

  • Conflicts:
    drivers/net/ethernet/cadence/macb.c
    drivers/net/phy/phy.c
    include/linux/skbuff.h
    net/ipv4/tcp.c
    net/switchdev/switchdev.c

    Switchdev was a case of RTNH_H_{EXTERNAL --> OFFLOAD}
    renaming overlapping with net-next changes of various
    sorts.

    phy.c was a case of two changes, one adding a local
    variable to a function whilst the second was removing
    one.

    tcp.c overlapped a deadlock fix with the addition of new tcp_info
    statistic values.

    macb.c involved the addition of two zyncq device entries.

    skbuff.h involved adding back ipv4_daddr to nf_bridge_info
    whilst net-next changes put two other existing members of
    that struct into a union.

    Signed-off-by: David S. Miller

    David S. Miller
     

17 May, 2015

1 commit

  • We currently have no limit on the number of elements in a hash table.
    This is a problem because some users (tipc) set a ceiling on the
    maximum table size and when that is reached the hash table may
    degenerate. Others may encounter OOM when growing and if we allow
    insertions when that happens the hash table perofrmance may also
    suffer.

    This patch adds a new paramater insecure_max_entries which becomes
    the cap on the table. If unset it defaults to max_size * 2. If
    it is also zero it means that there is no cap on the number of
    elements in the table. However, the table will grow whenever the
    utilisation hits 100% and if that growth fails, you will get ENOMEM
    on insertion.

    As allowing oversubscription is potentially dangerous, the name
    contains the word insecure.

    Note that the cap is not a hard limit. This is done for performance
    reasons as enforcing a hard limit will result in use of atomic ops
    that are heavier than the ones we currently use.

    The reasoning is that we're only guarding against a gross over-
    subscription of the table, rather than a small breach of the limit.

    Signed-off-by: Herbert Xu
    Signed-off-by: David S. Miller

    Herbert Xu
     

06 May, 2015

1 commit


23 Apr, 2015

2 commits

  • The current code currently only stops inserting rehashes into the
    chain when no resizes are currently scheduled. As long as resizes
    are scheduled and while inserting above the utilization watermark,
    more and more rehashes will be scheduled.

    This lead to a perfect DoS storm with thousands of rehashes
    scheduled which lead to thousands of spinlocks to be taken
    sequentially.

    Instead, only allow either a series of resizes or a single rehash.
    Drop any further rehashes and return -EBUSY.

    Fixes: ccd57b1bd324 ("rhashtable: Add immediate rehash during insertion")
    Signed-off-by: Thomas Graf
    Acked-by: Herbert Xu
    Signed-off-by: David S. Miller

    Thomas Graf
     
  • When rhashtable_insert_rehash() fails with ENOMEM, this indicates that
    we can't allocate the necessary memory in the current context but the
    limits as set by the user would still allow to grow.

    Thus attempt an async resize in the background where we can allocate
    using GFP_KERNEL which is more likely to succeed. The insertion itself
    will still fail to indicate pressure.

    This fixes a bug where the table would never continue growing once the
    utilization is above 100%.

    Fixes: ccd57b1bd324 ("rhashtable: Add immediate rehash during insertion")
    Signed-off-by: Thomas Graf
    Acked-by: Herbert Xu
    Signed-off-by: David S. Miller

    Thomas Graf
     

26 Mar, 2015

1 commit

  • nftables sets will be converted to use so called setextensions, moving
    the key to a non-fixed position. To hash it, the obj_hashfn must be used,
    however it so far doesn't receive the length parameter.

    Pass the key length to obj_hashfn() and convert existing users.

    Signed-off-by: Patrick McHardy
    Signed-off-by: Pablo Neira Ayuso

    Patrick McHardy
     

25 Mar, 2015

4 commits


24 Mar, 2015

7 commits

  • The commit 963ecbd41a1026d99ec7537c050867428c397b89 ("rhashtable:
    Fix use-after-free in rhashtable_walk_stop") fixed a real bug
    but created another one because we may end up sleeping inside an
    RCU critical section.

    This patch fixes it properly by replacing the mutex with a spin
    lock that specifically protects the walker lists.

    Reported-by: Sasha Levin
    Signed-off-by: Herbert Xu
    Signed-off-by: David S. Miller

    Herbert Xu
     
  • This patch reintroduces immediate rehash during insertion. If
    we find during insertion that the table is full or the chain
    length exceeds a set limit (currently 16 but may be disabled
    with insecure_elasticity) then we will force an immediate rehash.
    The rehash will contain an expansion if the table utilisation
    exceeds 75%.

    If this rehash fails then the insertion will fail. Otherwise the
    insertion will be reattempted in the new hash table.

    Signed-off-by: Herbert Xu
    Acked-by: Thomas Graf
    Signed-off-by: David S. Miller

    Herbert Xu
     
  • This patch adds the ability to allocate bucket table with GFP_ATOMIC
    instead of GFP_KERNEL. This is needed when we perform an immediate
    rehash during insertion.

    Signed-off-by: Herbert Xu
    Acked-by: Thomas Graf
    Signed-off-by: David S. Miller

    Herbert Xu
     
  • This patch adds the missing bits to allow multiple rehashes. The
    read-side as well as remove already handle this correctly. So it's
    only the rehasher and insertion that need modification to handle
    this.

    Note that this patch doesn't actually enable it so for now rehashing
    is still only performed by the worker thread.

    This patch also disables the explicit expand/shrink interface because
    the table is meant to expand and shrink automatically, and continuing
    to export these interfaces unnecessarily complicates the life of the
    rehasher since the rehash process is now composed of two parts.

    Signed-off-by: Herbert Xu
    Acked-by: Thomas Graf
    Signed-off-by: David S. Miller

    Herbert Xu
     
  • This patch changes rhashtable_shrink to shrink to the smallest
    size possible rather than halving the table. This is needed
    because with multiple rehashing we will defer shrinking until
    all other rehashing is done, meaning that when we do shrink
    we may be able to shrink a lot.

    Signed-off-by: Herbert Xu
    Acked-by: Thomas Graf
    Signed-off-by: David S. Miller

    Herbert Xu
     
  • Since every current rhashtable user uses jhash as their hash
    function, the fact that jhash is an inline function causes each
    user to generate a copy of its code.

    This function provides a solution to this problem by allowing
    hashfn to be unset. In which case rhashtable will automatically
    set it to jhash. Furthermore, if the key length is a multiple
    of 4, we will switch over to jhash2.

    Signed-off-by: Herbert Xu
    Acked-by: Thomas Graf
    Signed-off-by: David S. Miller

    Herbert Xu
     
  • The walker is a lockless reader so it too needs an smp_rmb before
    reading the future_tbl field in order to see any new tables that
    may contain elements that we should have walked over.

    Signed-off-by: Herbert Xu
    Acked-by: Thomas Graf
    Signed-off-by: David S. Miller

    Herbert Xu
     

21 Mar, 2015

3 commits

  • Now that all rhashtable users have been converted over to the
    inline interface, this patch removes the unused out-of-line
    interface.

    Signed-off-by: Herbert Xu
    Signed-off-by: David S. Miller

    Herbert Xu
     
  • This patch deals with the complaint that we make indirect function
    calls on the fast paths unnecessarily in rhashtable. We resolve
    it by moving the fast paths into inline functions that take struct
    rhashtable_param (which obviously must be the same set of parameters
    supplied to rhashtable_init) as an argument.

    The only remaining indirect call is to obj_hashfn (or key_hashfn it
    obj_hashfn is unset) on the rehash as well as the insert-during-
    rehash slow path.

    This patch also extends the support of vairable-length keys to
    include those where the key is fixed but scattered in the object.
    For example, in netlink we want to key off the namespace and the
    portid but they're not next to each other.

    This patch does this by directly using the object hash function
    as the indicator of whether the key is accessible or not. It
    also adds a new function obj_cmpfn to compare a key against an
    object. This means that the caller no longer needs to supply
    explicit compare functions.

    All this is done in a backwards compatible manner so no existing
    users are affected until they convert to the new interface.

    Signed-off-by: Herbert Xu
    Signed-off-by: David S. Miller

    Herbert Xu
     
  • This patch marks the rhashtable_init params argument const as
    there is no reason to modify it since we will always make a copy
    of it in the rhashtable.

    This patch also fixes a bug where we don't actually round up the
    value of min_size unless it is less than HASH_MIN_SIZE.

    Signed-off-by: Herbert Xu
    Acked-by: Thomas Graf
    Signed-off-by: David S. Miller

    Herbert Xu
     

20 Mar, 2015

1 commit

  • Round up min_size respectively round down max_size to the next power
    of two to make sure we always respect the limit specified by the
    user. This is required because we compare the table size against the
    limit before we expand or shrink.

    Also fixes a minor bug where we modified min_size in the params
    provided instead of the copy stored in struct rhashtable.

    Signed-off-by: Thomas Graf
    Acked-by: Herbert Xu
    Signed-off-by: David S. Miller

    Thomas Graf
     

19 Mar, 2015

3 commits


17 Mar, 2015

2 commits


16 Mar, 2015

2 commits

  • The commit 9d901bc05153bbf33b5da2cd6266865e531f0545 ("rhashtable:
    Free bucket tables asynchronously after rehash") causes gratuitous
    failures in rhashtable_remove.

    The reason is that it inadvertently introduced multiple rehashing
    from the perspective of readers. IOW it is now possible to see
    more than two tables during a single RCU critical section.

    Fortunately the other reader rhashtable_lookup already deals with
    this correctly thanks to c4db8848af6af92f90462258603be844baeab44d
    ("rhashtable: rhashtable: Move future_tbl into struct bucket_table")
    so only rhashtable_remove is broken by this change.

    This patch fixes this by looping over every table from the first
    one to the last or until we find the element that we were trying
    to delete.

    Incidentally the simple test for detecting rehashing to prevent
    starting another shrinking no longer works. Since it isn't needed
    anyway (the work queue and the mutex serves as a natural barrier
    to unnecessary rehashes) I've simply killed the test.

    Signed-off-by: Herbert Xu
    Signed-off-by: David S. Miller

    Herbert Xu
     
  • The commit c4db8848af6af92f90462258603be844baeab44d ("rhashtable:
    Move future_tbl into struct bucket_table") introduced a use-after-
    free bug in rhashtable_walk_stop because it dereferences tbl after
    droping the RCU read lock.

    This patch fixes it by moving the RCU read unlock down to the bottom
    of rhashtable_walk_stop. In fact this was how I had it originally
    but it got dropped while rearranging patches because this one
    depended on the async freeing of bucket_table.

    Signed-off-by: Herbert Xu
    Signed-off-by: David S. Miller

    Herbert Xu
     

15 Mar, 2015

6 commits

  • This patch moves future_tbl to open up the possibility of having
    multiple rehashes on the same table.

    Signed-off-by: Herbert Xu
    Signed-off-by: David S. Miller

    Herbert Xu
     
  • This patch adds a rehash counter to bucket_table to indicate
    the last bucket that has been rehashed. This serves two purposes:

    1. Any bucket that has been rehashed can never gain a new object.
    2. If the rehash counter reaches the size of the table, the table
    will forever remain empty.

    This patch also downsizes bucket_table->size to an unsigned int
    since we do not support sizes greater than 32 bits yet.

    Signed-off-by: Herbert Xu
    Signed-off-by: David S. Miller

    Herbert Xu
     
  • There is in fact no need to wait for an RCU grace period in the
    rehash function, since all insertions are guaranteed to go into
    the new table through spin locks.

    This patch uses call_rcu to free the old/rehashed table at our
    leisure.

    Signed-off-by: Herbert Xu
    Signed-off-by: David S. Miller

    Herbert Xu
     
  • It seems that I have already made every rehash redo the random
    seed even though my commit message indicated otherwise :)

    Since we have already taken that step, this patch goes one step
    further and moves the seed initialisation into bucket_table_alloc.

    Signed-off-by: Herbert Xu
    Signed-off-by: David S. Miller

    Herbert Xu
     
  • We only nest one level deep there is no need to roll our own
    subclasses.

    Signed-off-by: Herbert Xu
    Signed-off-by: David S. Miller

    Herbert Xu
     
  • Previously whenever the walker encountered a resize it simply
    snaps back to the beginning and starts again. However, this only
    works if the rehash started and completed while the walker was
    idle.

    If the walker attempts to restart while the rehash is still ongoing,
    we may miss objects that we shouldn't have.

    This patch fixes this by making the walker walk the old table
    followed by the new table just like all other readers. If a
    rehash is detected we will still signal our caller of the fact
    so they can prepare for duplicates but we will simply continue
    the walk onto the new table after the old one is finished either
    by us or by the rehasher.

    Signed-off-by: Herbert Xu
    Signed-off-by: David S. Miller

    Herbert Xu
     

13 Mar, 2015

3 commits

  • This patch fixes a typo rhashtable_lookup_compare where we fail
    to recompute the hash when looking up the new table. This causes
    elements to be missed and potentially a crash during a resize.

    Reported-by: Thomas Graf
    Signed-off-by: Herbert Xu
    Signed-off-by: David S. Miller

    Herbert Xu
     
  • Commit c0c09bfdc415 ("rhashtable: avoid unnecessary wakeup for worker
    queue") changed ht->shift to be atomic, which is actually unnecessary.

    Instead of leaving the current shift in the core rhashtable structure,
    it can be cached inside the individual bucket tables.

    There, it will only be initialized once during a new table allocation
    in the shrink/expansion slow path, and from then onward it stays immutable
    for the rest of the bucket table liftime.

    That allows shift to be non-atomic. The patch also moves hash_rnd
    management into the table setup. The rhashtable structure now consumes
    3 instead of 4 cachelines.

    Signed-off-by: Daniel Borkmann
    Cc: Ying Xue
    Acked-by: Thomas Graf
    Signed-off-by: David S. Miller

    Daniel Borkmann
     
  • There is a potential race condition between readers and the rehasher.
    In particular, the rehasher could have started a rehash while the
    reader finishes a scan of the old table but fails to see the new
    table pointer.

    This patch closes this window by adding smp_wmb/smp_rmb.

    Signed-off-by: Herbert Xu
    Signed-off-by: David S. Miller

    Herbert Xu