21 May, 2019

1 commit

  • Add SPDX license identifiers to all files which:

    - Have no license information of any form

    - Have MODULE_LICENCE("GPL*") inside which was used in the initial
    scan/conversion to ignore the file

    These files fall under the project license, GPL v2 only. The resulting SPDX
    license identifier is:

    GPL-2.0-only

    Signed-off-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     

29 Dec, 2018

8 commits

  • Size and 'next bit' were swapped, this bug could cause worker to
    reschedule itself even if system was idle.

    Fixes: 5c789e131cbb9 ("netfilter: nf_conncount: Add list lock and gc worker, and RCU for init tree search")
    Reviewed-by: Shawn Bohrer
    Signed-off-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso

    Florian Westphal
     
  • Instead of removing a empty list node that might be reintroduced soon
    thereafter, tentatively place the empty list node on the list passed to
    tree_nodes_free(), then re-check if the list is empty again before erasing
    it from the tree.

    [ Florian: rebase on top of pending nf_conncount fixes ]

    Fixes: 5c789e131cbb9 ("netfilter: nf_conncount: Add list lock and gc worker, and RCU for init tree search")
    Reviewed-by: Shawn Bohrer
    Signed-off-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso

    Pablo Neira Ayuso
     
  • Two CPUs may race to remove a connection from the list, the existing
    conn->dead will result in a use-after-free. Use the per-list spinlock to
    protect list iterations.

    As all accesses to the list now happen while holding the per-list lock,
    we no longer need to delay free operations with rcu.

    Joint work with Florian.

    Fixes: 5c789e131cbb9 ("netfilter: nf_conncount: Add list lock and gc worker, and RCU for init tree search")
    Reviewed-by: Shawn Bohrer
    Signed-off-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso

    Pablo Neira Ayuso
     
  • 'lookup' is always followed by 'add'.
    Merge both and make the list-walk part of nf_conncount_add().

    This also avoids one unneeded unlock/re-lock pair.

    Extra care needs to be taken in count_tree, as we only hold rcu
    read lock, i.e. we can only insert to an existing tree node after
    acquiring its lock and making sure it has a nonzero count.

    As a zero count should be rare, just fall back to insert_tree()
    (which acquires tree lock).

    This issue and its solution were pointed out by Shawn Bohrer
    during patch review.

    Reviewed-by: Shawn Bohrer
    Signed-off-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso

    Florian Westphal
     
  • Shawn Bohrer reported a following crash:
    |RIP: 0010:rb_erase+0xae/0x360
    [..]
    Call Trace:
    nf_conncount_destroy+0x59/0xc0 [nf_conncount]
    cleanup_match+0x45/0x70 [ip_tables]
    ...

    Shawn tracked this down to bogus 'parent' pointer:
    Problem is that when we insert a new node, then there is a chance that
    the 'parent' that we found was also passed to tree_nodes_free() (because
    that node was empty) for erase+free.

    Instead of trying to be clever and detect when this happens, restart
    the search if we have evicted one or more nodes. To prevent frequent
    restarts, do not perform gc on the second round.

    Also, unconditionally schedule the gc worker.
    The condition

    gc_count > ARRAY_SIZE(gc_nodes))

    cannot be true unless tree grows very large, as the height of the tree
    will be low even with hundreds of nodes present.

    Fixes: 5c789e131cbb9 ("netfilter: nf_conncount: Add list lock and gc worker, and RCU for init tree search")
    Reported-by: Shawn Bohrer
    Reviewed-by: Shawn Bohrer
    Signed-off-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso

    Florian Westphal
     
  • The lockless workqueue garbage collector can race with packet path
    garbage collector to delete list nodes, as it calls tree_nodes_free()
    with the addresses of nodes that might have been free'd already from
    another cpu.

    To fix this, split gc into two phases.

    One phase to perform gc on the connections: From a locking perspective,
    this is the same as count_tree(): we hold rcu lock, but we do not
    change the tree, we only change the nodes' contents.

    The second phase acquires the tree lock and reaps empty nodes.
    This avoids a race condition of the garbage collection vs. packet path:
    If a node has been free'd already, the second phase won't find it anymore.

    This second phase is, from locking perspective, same as insert_tree().

    The former only modifies nodes (list content, count), latter modifies
    the tree itself (rb_erase or rb_insert).

    Fixes: 5c789e131cbb9 ("netfilter: nf_conncount: Add list lock and gc worker, and RCU for init tree search")
    Reviewed-by: Shawn Bohrer
    Signed-off-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso

    Florian Westphal
     
  • age is signed integer, so result can be negative when the timestamps
    have a large delta. In this case we want to discard the entry.

    Instead of using age >= 2 || age < 0, just make it unsigned.

    Fixes: b36e4523d4d56 ("netfilter: nf_conncount: fix garbage collection confirm race")
    Reviewed-by: Shawn Bohrer
    Signed-off-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso

    Florian Westphal
     
  • Most of the time these were the same value anyway, but when
    CONFIG_LOCKDEP was enabled we would use a smaller number of locks to
    reduce overhead. Unfortunately having two values is confusing and not
    worth the complexity.

    This fixes a bug where tree_gc_worker() would only GC up to
    CONNCOUNT_LOCK_SLOTS trees which meant when CONFIG_LOCKDEP was enabled
    not all trees would be GCed by tree_gc_worker().

    Fixes: 5c789e131cbb9 ("netfilter: nf_conncount: Add list lock and gc worker, and RCU for init tree search")
    Signed-off-by: Florian Westphal
    Signed-off-by: Shawn Bohrer
    Signed-off-by: Pablo Neira Ayuso

    Shawn Bohrer
     

13 Dec, 2018

1 commit


27 Nov, 2018

1 commit

  • All lists that reach the tree_nodes_free() function have both zero
    counter and true dead flag. The reason for this is that lists to be
    release are selected by nf_conncount_gc_list() which already decrements
    the list counter and sets on the dead flag. Therefore, this if statement
    in tree_nodes_free() is unnecessary and wrong.

    Fixes: 31568ec09ea0 ("netfilter: nf_conncount: fix list_del corruption in conn_free")
    Signed-off-by: Taehee Yoo
    Signed-off-by: Pablo Neira Ayuso

    Taehee Yoo
     

12 Nov, 2018

3 commits

  • When list->count is 0, the list is deleted by GC. But list->count is
    never reached 0 because initial count value is 1 and it is increased
    when node is inserted. So that initial value of list->count should be 0.

    Originally GC always finds zero count list through deleting node and
    decreasing count. However, list may be left empty since node insertion
    may fail eg. allocaton problem. In order to solve this problem, GC
    routine also finds zero count list without deleting node.

    Fixes: cb2b36f5a97d ("netfilter: nf_conncount: Switch to plain list")
    Signed-off-by: Taehee Yoo
    Signed-off-by: Pablo Neira Ayuso

    Taehee Yoo
     
  • nf_conncount_tuple is an element of nft_connlimit and that is deleted by
    conn_free(). Elements can be deleted by both GC routine and data path
    functions (nf_conncount_lookup, nf_conncount_add) and they call
    conn_free() to free elements. But conn_free() only protects lists, not
    each element. So that list_del corruption could occurred.

    The conn_free() doesn't check whether element is already deleted. In
    order to protect elements, dead flag is added. If an element is deleted,
    dead flag is set. The only conn_free() can delete elements so that both
    list lock and dead flag are enough to protect it.

    test commands:
    %nft add table ip filter
    %nft add chain ip filter input { type filter hook input priority 0\; }
    %nft add rule filter input meter test { ip id ct count over 2 } counter

    splat looks like:
    [ 1779.495778] list_del corruption, ffff8800b6e12008->prev is LIST_POISON2 (dead000000000200)
    [ 1779.505453] ------------[ cut here ]------------
    [ 1779.506260] kernel BUG at lib/list_debug.c:50!
    [ 1779.515831] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN PTI
    [ 1779.516772] CPU: 0 PID: 33 Comm: kworker/0:2 Not tainted 4.19.0-rc6+ #22
    [ 1779.516772] Workqueue: events_power_efficient nft_rhash_gc [nf_tables_set]
    [ 1779.516772] RIP: 0010:__list_del_entry_valid+0xd8/0x150
    [ 1779.516772] Code: 39 48 83 c4 08 b8 01 00 00 00 5b 5d c3 48 89 ea 48 c7 c7 00 c3 5b 98 e8 0f dc 40 ff 0f 0b 48 c7 c7 60 c3 5b 98 e8 01 dc 40 ff 0b 48 c7 c7 c0 c3 5b 98 e8 f3 db 40 ff 0f 0b 48 c7 c7 20 c4 5b
    [ 1779.516772] RSP: 0018:ffff880119127420 EFLAGS: 00010286
    [ 1779.516772] RAX: 000000000000004e RBX: dead000000000200 RCX: 0000000000000000
    [ 1779.516772] RDX: 000000000000004e RSI: 0000000000000008 RDI: ffffed0023224e7a
    [ 1779.516772] RBP: ffff88011934bc10 R08: ffffed002367cea9 R09: ffffed002367cea9
    [ 1779.516772] R10: 0000000000000001 R11: ffffed002367cea8 R12: ffff8800b6e12008
    [ 1779.516772] R13: ffff8800b6e12010 R14: ffff88011934bc20 R15: ffff8800b6e12008
    [ 1779.516772] FS: 0000000000000000(0000) GS:ffff88011b200000(0000) knlGS:0000000000000000
    [ 1779.516772] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [ 1779.516772] CR2: 00007fc876534010 CR3: 000000010da16000 CR4: 00000000001006f0
    [ 1779.516772] Call Trace:
    [ 1779.516772] conn_free+0x9f/0x2b0 [nf_conncount]
    [ 1779.516772] ? nf_ct_tmpl_alloc+0x2a0/0x2a0 [nf_conntrack]
    [ 1779.516772] ? nf_conncount_add+0x520/0x520 [nf_conncount]
    [ 1779.516772] ? do_raw_spin_trylock+0x1a0/0x1a0
    [ 1779.516772] ? do_raw_spin_trylock+0x10/0x1a0
    [ 1779.516772] find_or_evict+0xe5/0x150 [nf_conncount]
    [ 1779.516772] nf_conncount_gc_list+0x162/0x360 [nf_conncount]
    [ 1779.516772] ? nf_conncount_lookup+0xee0/0xee0 [nf_conncount]
    [ 1779.516772] ? _raw_spin_unlock_irqrestore+0x45/0x50
    [ 1779.516772] ? trace_hardirqs_off+0x6b/0x220
    [ 1779.516772] ? trace_hardirqs_on_caller+0x220/0x220
    [ 1779.516772] nft_rhash_gc+0x16b/0x540 [nf_tables_set]
    [ ... ]

    Fixes: 5c789e131cbb ("netfilter: nf_conncount: Add list lock and gc worker, and RCU for init tree search")
    Signed-off-by: Taehee Yoo
    Signed-off-by: Pablo Neira Ayuso

    Taehee Yoo
     
  • conn_free() holds lock with spin_lock() and it is called by both
    nf_conncount_lookup() and nf_conncount_gc_list(). nf_conncount_lookup()
    is called from bottom-half context and nf_conncount_gc_list() from
    process context. So that spin_lock() call is not safe. Hence
    conn_free() should use spin_lock_bh() instead of spin_lock().

    test commands:
    %nft add table ip filter
    %nft add chain ip filter input { type filter hook input priority 0\; }
    %nft add rule filter input meter test { ip saddr ct count over 2 } \
    counter

    splat looks like:
    [ 461.996507] ================================
    [ 461.998999] WARNING: inconsistent lock state
    [ 461.998999] 4.19.0-rc6+ #22 Not tainted
    [ 461.998999] --------------------------------
    [ 461.998999] inconsistent {IN-SOFTIRQ-W} -> {SOFTIRQ-ON-W} usage.
    [ 461.998999] kworker/0:2/134 [HC0[0]:SC0[0]:HE1:SE1] takes:
    [ 461.998999] 00000000a71a559a (&(&list->list_lock)->rlock){+.?.}, at: conn_free+0x69/0x2b0 [nf_conncount]
    [ 461.998999] {IN-SOFTIRQ-W} state was registered at:
    [ 461.998999] _raw_spin_lock+0x30/0x70
    [ 461.998999] nf_conncount_add+0x28a/0x520 [nf_conncount]
    [ 461.998999] nft_connlimit_eval+0x401/0x580 [nft_connlimit]
    [ 461.998999] nft_dynset_eval+0x32b/0x590 [nf_tables]
    [ 461.998999] nft_do_chain+0x497/0x1430 [nf_tables]
    [ 461.998999] nft_do_chain_ipv4+0x255/0x330 [nf_tables]
    [ 461.998999] nf_hook_slow+0xb1/0x160
    [ ... ]
    [ 461.998999] other info that might help us debug this:
    [ 461.998999] Possible unsafe locking scenario:
    [ 461.998999]
    [ 461.998999] CPU0
    [ 461.998999] ----
    [ 461.998999] lock(&(&list->list_lock)->rlock);
    [ 461.998999]
    [ 461.998999] lock(&(&list->list_lock)->rlock);
    [ 461.998999]
    [ 461.998999] *** DEADLOCK ***
    [ 461.998999]
    [ ... ]

    Fixes: 5c789e131cbb ("netfilter: nf_conncount: Add list lock and gc worker, and RCU for init tree search")
    Signed-off-by: Taehee Yoo
    Signed-off-by: Pablo Neira Ayuso

    Taehee Yoo
     

18 Jul, 2018

6 commits

  • This patch is originally from Florian Westphal.

    This patch does the following 3 main tasks.

    1) Add list lock to 'struct nf_conncount_list' so that we can
    alter the lists containing the individual connections without holding the
    main tree lock. It would be useful when we only need to add/remove to/from
    a list without allocate/remove a node in the tree. With this change, we
    update nft_connlimit accordingly since we longer need to maintain
    a list lock in nft_connlimit now.

    2) Use RCU for the initial tree search to improve tree look up performance.

    3) Add a garbage collection worker. This worker is schedule when there
    are excessive tree node that needed to be recycled.

    Moreover,the rbnode reclaim logic is moved from search tree to insert tree
    to avoid race condition.

    Signed-off-by: Yi-Hung Wei
    Signed-off-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso

    Yi-Hung Wei
     
  • This patch is originally from Florian Westphal.

    When we have a very coarse grouping, e.g. by large subnets, zone id,
    etc, it's likely that we do not need to do tree rotation because
    we'll find a node where we can attach new entry. Based on this
    observation, we split tree traversal and insertion.

    Later on, we can make traversal lockless (tree protected
    by RCU), and add extra lock in the individual nodes to protect list
    insertion/deletion, thereby allowing parallel insert/delete in different
    tree nodes.

    Signed-off-by: Yi-Hung Wei
    Signed-off-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso

    Yi-Hung Wei
     
  • This patch is originally from Florian Westphal.

    This is a preparation patch to allow lockless traversal
    of the tree via RCU.

    Signed-off-by: Yi-Hung Wei
    Signed-off-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso

    Yi-Hung Wei
     
  • This patch is originally from Florian Westphal.

    This patch does the following three tasks.

    It applies the same early exit technique for nf_conncount_lookup().

    Since now we keep the number of connections in 'struct nf_conncount_list',
    we no longer need to return the count in nf_conncount_lookup().

    Moreover, we expose the garbage collection function nf_conncount_gc_list()
    for nft_connlimit.

    Signed-off-by: Yi-Hung Wei
    Signed-off-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso

    Yi-Hung Wei
     
  • Original patch is from Florian Westphal.

    This patch switches from hlist to plain list to store the list of
    connections with the same filtering key in nf_conncount. With the
    plain list, we can insert new connections at the tail, so over time
    the beginning of list holds long-running connections and those are
    expired, while the newly creates ones are at the end.

    Later on, we could probably move checked ones to the end of the list,
    so the next run has higher chance to reclaim stale entries in the front.

    Signed-off-by: Yi-Hung Wei
    Signed-off-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso

    Yi-Hung Wei
     
  • This patch is originally from Florian Westphal.

    We use an extra function with early exit for garbage collection.
    It is not necessary to traverse the full list for every node since
    it is enough to zap a couple of entries for garbage collection.

    Signed-off-by: Yi-Hung Wei
    Signed-off-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso

    Yi-Hung Wei
     

27 Jun, 2018

1 commit

  • Yi-Hung Wei and Justin Pettit found a race in the garbage collection scheme
    used by nf_conncount.

    When doing list walk, we lookup the tuple in the conntrack table.
    If the lookup fails we remove this tuple from our list because
    the conntrack entry is gone.

    This is the common cause, but turns out its not the only one.
    The list entry could have been created just before by another cpu, i.e. the
    conntrack entry might not yet have been inserted into the global hash.

    The avoid this, we introduce a timestamp and the owning cpu.
    If the entry appears to be stale, evict only if:
    1. The current cpu is the one that added the entry, or,
    2. The timestamp is older than two jiffies

    The second constraint allows GC to be taken over by other
    cpu too (e.g. because a cpu was offlined or napi got moved to another
    cpu).

    We can't pretend the 'doubtful' entry wasn't in our list.
    Instead, when we don't find an entry indicate via IS_ERR
    that entry was removed ('did not exist' or withheld
    ('might-be-unconfirmed').

    This most likely also fixes a xt_connlimit imbalance earlier reported by
    Dmitry Andrianov.

    Cc: Dmitry Andrianov
    Reported-by: Justin Pettit
    Reported-by: Yi-Hung Wei
    Signed-off-by: Florian Westphal
    Acked-by: Yi-Hung Wei
    Signed-off-by: Pablo Neira Ayuso

    Florian Westphal
     

13 Jun, 2018

1 commit

  • Currently, we use check_hlist() for garbage colleciton. However, we
    use the ‘zone’ from the counted entry to query the existence of
    existing entries in the hlist. This could be wrong when they are in
    different zones, and this patch fixes this issue.

    Fixes: e59ea3df3fc2 ("netfilter: xt_connlimit: honor conntrack zone if available")
    Signed-off-by: Yi-Hung Wei
    Signed-off-by: Pablo Neira Ayuso

    Yi-Hung Wei
     

03 Jun, 2018

1 commit


20 Mar, 2018

2 commits

  • Currently, nf_conncount_count() counts the number of connections that
    matches key and inserts a conntrack 'tuple' with the same key into the
    accounting data structure. This patch supports another use case that only
    counts the number of connections where 'tuple' is not provided. Therefore,
    proper changes are made on nf_conncount_count() to support the case where
    'tuple' is NULL. This could be useful for querying statistics or
    debugging purpose.

    Signed-off-by: Yi-Hung Wei
    Acked-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso

    Yi-Hung Wei
     
  • Remove parameter 'family' in nf_conncount_count() and count_tree().
    It is because the parameter is not useful after commit 625c556118f3
    ("netfilter: connlimit: split xt_connlimit into front and backend").

    Signed-off-by: Yi-Hung Wei
    Acked-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso

    Yi-Hung Wei
     

19 Jan, 2018

1 commit


09 Jan, 2018

1 commit

  • This allows to reuse xt_connlimit infrastructure from nf_tables.
    The upcoming nf_tables frontend can just pass in an nftables register
    as input key, this allows limiting by any nft-supported key, including
    concatenations.

    For xt_connlimit, pass in the zone and the ip/ipv6 address.

    With help from Yi-Hung Wei.

    Signed-off-by: Florian Westphal
    Acked-by: Yi-Hung Wei
    Signed-off-by: Pablo Neira Ayuso

    Florian Westphal