21 Aug, 2020

2 commits

  • Getting creative with nft and omitting the interval_overlap()
    check from the set_overlap() function, without omitting
    set_overlap() altogether, led to the observation of a partial
    overlap that wasn't detected, and would actually result in
    replacement of the end element of an existing interval.

    This is due to the fact that we'll return -EEXIST on a matching,
    pre-existing start element, instead of -ENOTEMPTY, and the error
    is cleared by API if NLM_F_EXCL is not given. At this point, we
    can insert a matching start, and duplicate the end element as long
    as we don't end up into other intervals.

    For instance, inserting interval 0 - 2 with an existing 0 - 3
    interval would result in a single 0 - 2 interval, and a dangling
    '3' end element. This is because nft will proceed after inserting
    the '0' start element as no error is reported, and no further
    conflicting intervals are detected on insertion of the end element.

    This needs a different approach as it's a local condition that can
    be detected by looking for duplicate ends coming from left and
    right, separately. Track those and directly report -ENOTEMPTY on
    duplicated end elements for a matching start.

    Signed-off-by: Stefano Brivio
    Signed-off-by: Pablo Neira Ayuso

    Stefano Brivio
     
  • Checks for partial overlaps on insertion assume that end elements
    are always descendant nodes of their corresponding start, because
    they are inserted later. However, this is not the case if a
    previous delete operation caused a tree rotation as part of
    rebalancing.

    Taking the issue reported by Andreas Fischer as an example, if we
    omit delete operations, the existing procedure works because,
    equivalently, we are inserting a start item with value 40 in the
    this region of the red-black tree with single-sized intervals:

    overlap flag
    10 (start)
    / \ false
    20 (start)
    / \ false
    30 (start)
    / \ false
    60 (start)
    / \ false
    50 (end)
    / \ false
    20 (end)
    / \ false
    40 (start)

    if we now delete interval 30 - 30, the tree can be rearranged in
    a way similar to this (note the rotation involving 50 - 50):

    overlap flag
    10 (start)
    / \ false
    20 (start)
    / \ false
    25 (start)
    / \ false
    70 (start)
    / \ false
    50 (end)
    / \ true (from rule a1.)
    50 (start)
    / \ true
    40 (start)

    and we traverse interval 50 - 50 from the opposite direction
    compared to what was expected.

    To deal with those cases, add a start-before-start rule, b4.,
    that covers traversal of existing intervals from the right.

    We now need to restrict start-after-end rule b3. to cases
    where there are no occurring nodes between existing start and
    end elements, because addition of rule b4. isn't sufficient to
    ensure that the pre-existing end element we encounter while
    descending the tree corresponds to a start element of an
    interval that we already traversed entirely.

    Different types of overlap detection on trees with rotations
    resulting from re-balancing will be covered by nft test case
    sets/0044interval_overlap_1.

    Reported-by: Andreas Fischer
    Bugzilla: https://bugzilla.netfilter.org/show_bug.cgi?id=1449
    Cc: # 5.6.x
    Fixes: 7c84d41416d8 ("netfilter: nft_set_rbtree: Detect partial overlaps on insertion")
    Signed-off-by: Stefano Brivio
    Signed-off-by: Pablo Neira Ayuso

    Stefano Brivio
     

29 Jul, 2020

1 commit

  • A sequence counter write side critical section must be protected by some
    form of locking to serialize writers. A plain seqcount_t does not
    contain the information of which lock must be held when entering a write
    side critical section.

    Use the new seqcount_rwlock_t data type, which allows to associate a
    rwlock with the sequence counter. This enables lockdep to verify that
    the rwlock used for writer serialization is held when the write side
    critical section is entered.

    If lockdep is disabled this lock association is compiled out and has
    neither storage size nor runtime overhead.

    Signed-off-by: Ahmed S. Darwish
    Signed-off-by: Peter Zijlstra (Intel)
    Link: https://lkml.kernel.org/r/20200720155530.1173732-16-a.darwish@linutronix.de

    Ahmed S. Darwish
     

09 Jun, 2020

1 commit

  • While checking the validity of insertion in __nft_rbtree_insert(),
    we currently ignore conflicting elements and intervals only if they
    are not active within the next generation.

    However, if we consider expired elements and intervals as
    potentially conflicting and overlapping, we'll return error for
    entries that should be added instead. This is particularly visible
    with garbage collection intervals that are comparable with the
    element timeout itself, as reported by Mike Dillinger.

    Other than the simple issue of denying insertion of valid entries,
    this might also result in insertion of a single element (opening or
    closing) out of a given interval. With single entries (that are
    inserted as intervals of size 1), this leads in turn to the creation
    of new intervals. For example:

    # nft add element t s { 192.0.2.1 }
    # nft list ruleset
    [...]
    elements = { 192.0.2.1-255.255.255.255 }

    Always ignore expired elements active in the next generation, while
    checking for conflicts.

    It might be more convenient to introduce a new macro that covers
    both inactive and expired items, as this type of check also appears
    quite frequently in other set back-ends. This is however beyond the
    scope of this fix and can be deferred to a separate patch.

    Other than the overlap detection cases introduced by commit
    7c84d41416d8 ("netfilter: nft_set_rbtree: Detect partial overlaps
    on insertion"), we also have to cover the original conflict check
    dealing with conflicts between two intervals of size 1, which was
    introduced before support for timeout was introduced. This won't
    return an error to the user as -EEXIST is masked by nft if
    NLM_F_EXCL is not given, but would result in a silent failure
    adding the entry.

    Reported-by: Mike Dillinger
    Cc: # 5.6.x
    Fixes: 8d8540c4f5e0 ("netfilter: nft_set_rbtree: add timeout support")
    Fixes: 7c84d41416d8 ("netfilter: nft_set_rbtree: Detect partial overlaps on insertion")
    Signed-off-by: Stefano Brivio
    Acked-by: Phil Sutter
    Signed-off-by: Pablo Neira Ayuso

    Stefano Brivio
     

12 May, 2020

1 commit

  • Expired intervals would still match and be dumped to user space until
    garbage collection wiped them out. Make sure they stop matching and
    disappear (from users' perspective) as soon as they expire.

    Fixes: 8d8540c4f5e03 ("netfilter: nft_set_rbtree: add timeout support")
    Signed-off-by: Phil Sutter
    Signed-off-by: Pablo Neira Ayuso

    Phil Sutter
     

06 Apr, 2020

1 commit

  • Case a1. for overlap detection in __nft_rbtree_insert() is not a valid
    one: start-after-start is not needed to detect any type of interval
    overlap and it actually results in a false positive if, while
    descending the tree, this is the only step we hit after starting from
    the root.

    This introduced a regression, as reported by Pablo, in Python tests
    cases ip/ip.t and ip/numgen.t:

    ip/ip.t: ERROR: line 124: add rule ip test-ip4 input ip hdrlength vmap { 0-4 : drop, 5 : accept, 6 : continue } counter: This rule should not have failed.
    ip/numgen.t: ERROR: line 7: add rule ip test-ip4 pre dnat to numgen inc mod 10 map { 0-5 : 192.168.10.100, 6-9 : 192.168.20.200}: This rule should not have failed.

    Drop case a1. and renumber others, so that they are a bit clearer. In
    order for these diagrams to be readily understandable, a bigger rework
    is probably needed, such as an ASCII art of the actual rbtree (instead
    of a flattened version).

    Shell script test sets/0044interval_overlap_0 should cover all
    possible cases for false negatives, so I consider that test case still
    sufficient after this change.

    v2: Fix comments for cases a3. and b3.

    Reported-by: Pablo Neira Ayuso
    Fixes: 7c84d41416d8 ("netfilter: nft_set_rbtree: Detect partial overlaps on insertion")
    Signed-off-by: Stefano Brivio
    Signed-off-by: Pablo Neira Ayuso

    Stefano Brivio
     

26 Mar, 2020

1 commit


25 Mar, 2020

2 commits

  • ...and return -ENOTEMPTY to the front-end in this case, instead of
    proceeding. Currently, nft takes care of checking for these cases
    and not sending them to the kernel, but if we drop the set_overlap()
    call in nft we can end up in situations like:

    # nft add table t
    # nft add set t s '{ type inet_service ; flags interval ; }'
    # nft add element t s '{ 1 - 5 }'
    # nft add element t s '{ 6 - 10 }'
    # nft add element t s '{ 4 - 7 }'
    # nft list set t s
    table ip t {
    set s {
    type inet_service
    flags interval
    elements = { 1-3, 4-5, 6-7 }
    }
    }

    This change has the primary purpose of making the behaviour
    consistent with nft_set_pipapo, but is also functional to avoid
    inconsistent behaviour if userspace sends overlapping elements for
    any reason.

    v2: When we meet the same key data in the tree, as start element while
    inserting an end element, or as end element while inserting a start
    element, actually check that the existing element is active, before
    resetting the overlap flag (Pablo Neira Ayuso)

    Signed-off-by: Stefano Brivio
    Signed-off-by: Pablo Neira Ayuso

    Stefano Brivio
     
  • Replace negations of nft_rbtree_interval_end() with a new helper,
    nft_rbtree_interval_start(), wherever this helps to visualise the
    problem at hand, that is, for all the occurrences except for the
    comparison against given flags in __nft_rbtree_get().

    This gets especially useful in the next patch.

    Signed-off-by: Stefano Brivio
    Signed-off-by: Pablo Neira Ayuso

    Stefano Brivio
     

15 Mar, 2020

1 commit


27 Jan, 2020

1 commit

  • Introduce a new nested netlink attribute, NFTA_SET_DESC_CONCAT, used
    to specify the length of each field in a set concatenation.

    This allows set implementations to support concatenation of multiple
    ranged items, as they can divide the input key into matching data for
    every single field. Such set implementations would be selected as
    they specify support for NFT_SET_INTERVAL and allow desc->field_count
    to be greater than one. Explicitly disallow this for nft_set_rbtree.

    In order to specify the interval for a set entry, userspace would
    include in NFTA_SET_DESC_CONCAT attributes field lengths, and pass
    range endpoints as two separate keys, represented by attributes
    NFTA_SET_ELEM_KEY and NFTA_SET_ELEM_KEY_END.

    While at it, export the number of 32-bit registers available for
    packet matching, as nftables will need this to know the maximum
    number of field lengths that can be specified.

    For example, "packets with an IPv4 address between 192.0.2.0 and
    192.0.2.42, with destination port between 22 and 25", can be
    expressed as two concatenated elements:

    NFTA_SET_ELEM_KEY: 192.0.2.0 . 22
    NFTA_SET_ELEM_KEY_END: 192.0.2.42 . 25

    and NFTA_SET_DESC_CONCAT attribute would contain:

    NFTA_LIST_ELEM
    NFTA_SET_FIELD_LEN: 4
    NFTA_LIST_ELEM
    NFTA_SET_FIELD_LEN: 2

    v4: No changes
    v3: Complete rework, NFTA_SET_DESC_CONCAT instead of NFTA_SET_SUBKEY
    v2: No changes

    Signed-off-by: Stefano Brivio
    Signed-off-by: Pablo Neira Ayuso

    Stefano Brivio
     

09 Dec, 2019

1 commit

  • The existing rbtree implementation might store consecutive elements
    where the closing element and the opening element might overlap, eg.

    [ a, a+1) [ a+1, a+2)

    This patch removes the optimization for non-anonymous sets in the exact
    matching case, where it is assumed to stop searching in case that the
    closing element is found. Instead, invalidate candidate interval and
    keep looking further in the tree.

    The lookup/get operation might return false, while there is an element
    in the rbtree. Moreover, the get operation returns true as if a+2 would
    be in the tree. This happens with named sets after several set updates.

    The existing lookup optimization (that only works for the anonymous
    sets) might not reach the opening [ a+1,... element if the closing
    ...,a+1) is found in first place when walking over the rbtree. Hence,
    walking the full tree in that case is needed.

    This patch fixes the lookup and get operations.

    Fixes: e701001e7cbe ("netfilter: nft_rbtree: allow adjacent intervals with dynamic updates")
    Fixes: ba0e4d9917b4 ("netfilter: nf_tables: get set elements via netlink")
    Signed-off-by: Pablo Neira Ayuso

    Pablo Neira Ayuso
     

13 Aug, 2019

1 commit

  • Sparse rightly complains about undeclared symbols.

    CHECK net/netfilter/nft_set_hash.c
    net/netfilter/nft_set_hash.c:647:21: warning: symbol 'nft_set_rhash_type' was not declared. Should it be static?
    net/netfilter/nft_set_hash.c:670:21: warning: symbol 'nft_set_hash_type' was not declared. Should it be static?
    net/netfilter/nft_set_hash.c:690:21: warning: symbol 'nft_set_hash_fast_type' was not declared. Should it be static?
    CHECK net/netfilter/nft_set_bitmap.c
    net/netfilter/nft_set_bitmap.c:296:21: warning: symbol 'nft_set_bitmap_type' was not declared. Should it be static?
    CHECK net/netfilter/nft_set_rbtree.c
    net/netfilter/nft_set_rbtree.c:470:21: warning: symbol 'nft_set_rbtree_type' was not declared. Should it be static?

    Include nf_tables_core.h rather than nf_tables.h to pick up the additional definitions.

    Signed-off-by: Valdis Kletnieks
    Signed-off-by: Pablo Neira Ayuso

    Valdis Klētnieks
     

19 Jun, 2019

1 commit

  • Based on 2 normalized pattern(s):

    this program is free software you can redistribute it and or modify
    it under the terms of the gnu general public license version 2 as
    published by the free software foundation

    this program is free software you can redistribute it and or modify
    it under the terms of the gnu general public license version 2 as
    published by the free software foundation #

    extracted by the scancode license scanner the SPDX license identifier

    GPL-2.0-only

    has been chosen to replace the boilerplate/reference in 4122 file(s).

    Signed-off-by: Thomas Gleixner
    Reviewed-by: Enrico Weigelt
    Reviewed-by: Kate Stewart
    Reviewed-by: Allison Randal
    Cc: linux-spdx@vger.kernel.org
    Link: https://lkml.kernel.org/r/20190604081206.933168790@linutronix.de
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     

18 Mar, 2019

1 commit


11 Oct, 2018

1 commit


28 Sep, 2018

1 commit

  • The nft_set_gc_batch_check() checks whether gc buffer is full.
    If gc buffer is full, gc buffer is released by
    the nft_set_gc_batch_complete() internally.
    In case of rbtree, the rb_erase() should be called before calling the
    nft_set_gc_batch_complete(). therefore the rb_erase() should
    be called before calling the nft_set_gc_batch_check() too.

    test commands:
    table ip filter {
    set set1 {
    type ipv4_addr; flags interval, timeout;
    gc-interval 10s;
    timeout 1s;
    elements = {
    1-2,
    3-4,
    5-6,
    ...
    10000-10001,
    }
    }
    }
    %nft -f test.nft

    splat looks like:
    [ 430.273885] kasan: GPF could be caused by NULL-ptr deref or user memory access
    [ 430.282158] general protection fault: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN PTI
    [ 430.283116] CPU: 1 PID: 190 Comm: kworker/1:2 Tainted: G B 4.18.0+ #7
    [ 430.283116] Workqueue: events_power_efficient nft_rbtree_gc [nf_tables_set]
    [ 430.313559] RIP: 0010:rb_next+0x81/0x130
    [ 430.313559] Code: 08 49 bd 00 00 00 00 00 fc ff df 48 bb 00 00 00 00 00 fc ff df 48 85 c0 75 05 eb 58 48 89 d4
    [ 430.313559] RSP: 0018:ffff88010cdb7680 EFLAGS: 00010207
    [ 430.313559] RAX: 0000000000b84854 RBX: dffffc0000000000 RCX: ffffffff83f01973
    [ 430.313559] RDX: 000000000017090c RSI: 0000000000000008 RDI: 0000000000b84864
    [ 430.313559] RBP: ffff8801060d4588 R08: fffffbfff09bc349 R09: fffffbfff09bc349
    [ 430.313559] R10: 0000000000000001 R11: fffffbfff09bc348 R12: ffff880100f081a8
    [ 430.313559] R13: dffffc0000000000 R14: ffff880100ff8688 R15: dffffc0000000000
    [ 430.313559] FS: 0000000000000000(0000) GS:ffff88011b400000(0000) knlGS:0000000000000000
    [ 430.313559] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [ 430.313559] CR2: 0000000001551008 CR3: 000000005dc16000 CR4: 00000000001006e0
    [ 430.313559] Call Trace:
    [ 430.313559] nft_rbtree_gc+0x112/0x5c0 [nf_tables_set]
    [ 430.313559] process_one_work+0xc13/0x1ec0
    [ 430.313559] ? _raw_spin_unlock_irq+0x29/0x40
    [ 430.313559] ? pwq_dec_nr_in_flight+0x3c0/0x3c0
    [ 430.313559] ? set_load_weight+0x270/0x270
    [ 430.313559] ? __switch_to_asm+0x34/0x70
    [ 430.313559] ? __switch_to_asm+0x40/0x70
    [ 430.313559] ? __switch_to_asm+0x34/0x70
    [ 430.313559] ? __switch_to_asm+0x34/0x70
    [ 430.313559] ? __switch_to_asm+0x40/0x70
    [ 430.313559] ? __switch_to_asm+0x34/0x70
    [ 430.313559] ? __switch_to_asm+0x40/0x70
    [ 430.313559] ? __switch_to_asm+0x34/0x70
    [ 430.313559] ? __switch_to_asm+0x34/0x70
    [ 430.313559] ? __switch_to_asm+0x40/0x70
    [ 430.313559] ? __switch_to_asm+0x34/0x70
    [ 430.313559] ? __schedule+0x6d3/0x1f50
    [ 430.313559] ? find_held_lock+0x39/0x1c0
    [ 430.313559] ? __sched_text_start+0x8/0x8
    [ 430.313559] ? cyc2ns_read_end+0x10/0x10
    [ 430.313559] ? save_trace+0x300/0x300
    [ 430.313559] ? sched_clock_local+0xd4/0x140
    [ 430.313559] ? find_held_lock+0x39/0x1c0
    [ 430.313559] ? worker_thread+0x353/0x1120
    [ 430.313559] ? worker_thread+0x353/0x1120
    [ 430.313559] ? lock_contended+0xe70/0xe70
    [ 430.313559] ? __lock_acquire+0x4500/0x4500
    [ 430.535635] ? do_raw_spin_unlock+0xa5/0x330
    [ 430.535635] ? do_raw_spin_trylock+0x101/0x1a0
    [ 430.535635] ? do_raw_spin_lock+0x1f0/0x1f0
    [ 430.535635] ? _raw_spin_lock_irq+0x10/0x70
    [ 430.535635] worker_thread+0x15d/0x1120
    [ ... ]

    Fixes: 8d8540c4f5e0 ("netfilter: nft_set_rbtree: add timeout support")
    Signed-off-by: Taehee Yoo
    Signed-off-by: Pablo Neira Ayuso

    Taehee Yoo
     

17 Aug, 2018

1 commit

  • In order to determine allocation size of set, ->privsize is invoked.
    At this point, both desc->size and size of each data structure of set
    are used. desc->size means number of element that is given by user.
    desc->size is u32 type. so that upperlimit of set element is 4294967295.
    but return type of ->privsize is also u32. hence overflow can occurred.

    test commands:
    %nft add table ip filter
    %nft add set ip filter hash1 { type ipv4_addr \; size 4294967295 \; }
    %nft list ruleset

    splat looks like:
    [ 1239.202910] kasan: CONFIG_KASAN_INLINE enabled
    [ 1239.208788] kasan: GPF could be caused by NULL-ptr deref or user memory access
    [ 1239.217625] general protection fault: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN PTI
    [ 1239.219329] CPU: 0 PID: 1603 Comm: nft Not tainted 4.18.0-rc5+ #7
    [ 1239.229091] RIP: 0010:nft_hash_walk+0x1d2/0x310 [nf_tables_set]
    [ 1239.229091] Code: 84 d2 7f 10 4c 89 e7 89 44 24 38 e8 d8 5a 17 e0 8b 44 24 38 48 8d 7b 10 41 0f b6 0c 24 48 89 fa 48 89 fe 48 c1 ea 03 83 e6 07 0f b6 14 3a 40 38 f2 7f 1a 84 d2 74 16
    [ 1239.229091] RSP: 0018:ffff8801118cf358 EFLAGS: 00010246
    [ 1239.229091] RAX: 0000000000000000 RBX: 0000000000020400 RCX: 0000000000000001
    [ 1239.229091] RDX: 0000000000004082 RSI: 0000000000000000 RDI: 0000000000020410
    [ 1239.229091] RBP: ffff880114d5a988 R08: 0000000000007e94 R09: ffff880114dd8030
    [ 1239.229091] R10: ffff880114d5a988 R11: ffffed00229bb006 R12: ffff8801118cf4d0
    [ 1239.229091] R13: ffff8801118cf4d8 R14: 0000000000000000 R15: dffffc0000000000
    [ 1239.229091] FS: 00007f5a8fe0b700(0000) GS:ffff88011b600000(0000) knlGS:0000000000000000
    [ 1239.229091] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [ 1239.229091] CR2: 00007f5a8ecc27b0 CR3: 000000010608e000 CR4: 00000000001006f0
    [ 1239.229091] Call Trace:
    [ 1239.229091] ? nft_hash_remove+0xf0/0xf0 [nf_tables_set]
    [ 1239.229091] ? memset+0x1f/0x40
    [ 1239.229091] ? __nla_reserve+0x9f/0xb0
    [ 1239.229091] ? memcpy+0x34/0x50
    [ 1239.229091] nf_tables_dump_set+0x9a1/0xda0 [nf_tables]
    [ 1239.229091] ? __kmalloc_reserve.isra.29+0x2e/0xa0
    [ 1239.229091] ? nft_chain_hash_obj+0x630/0x630 [nf_tables]
    [ 1239.229091] ? nf_tables_commit+0x2c60/0x2c60 [nf_tables]
    [ 1239.229091] netlink_dump+0x470/0xa20
    [ 1239.229091] __netlink_dump_start+0x5ae/0x690
    [ 1239.229091] nft_netlink_dump_start_rcu+0xd1/0x160 [nf_tables]
    [ 1239.229091] nf_tables_getsetelem+0x2e5/0x4b0 [nf_tables]
    [ 1239.229091] ? nft_get_set_elem+0x440/0x440 [nf_tables]
    [ 1239.229091] ? nft_chain_hash_obj+0x630/0x630 [nf_tables]
    [ 1239.229091] ? nf_tables_dump_obj_done+0x70/0x70 [nf_tables]
    [ 1239.229091] ? nla_parse+0xab/0x230
    [ 1239.229091] ? nft_get_set_elem+0x440/0x440 [nf_tables]
    [ 1239.229091] nfnetlink_rcv_msg+0x7f0/0xab0 [nfnetlink]
    [ 1239.229091] ? nfnetlink_bind+0x1d0/0x1d0 [nfnetlink]
    [ 1239.229091] ? debug_show_all_locks+0x290/0x290
    [ 1239.229091] ? sched_clock_cpu+0x132/0x170
    [ 1239.229091] ? find_held_lock+0x39/0x1b0
    [ 1239.229091] ? sched_clock_local+0x10d/0x130
    [ 1239.229091] netlink_rcv_skb+0x211/0x320
    [ 1239.229091] ? nfnetlink_bind+0x1d0/0x1d0 [nfnetlink]
    [ 1239.229091] ? netlink_ack+0x7b0/0x7b0
    [ 1239.229091] ? ns_capable_common+0x6e/0x110
    [ 1239.229091] nfnetlink_rcv+0x2d1/0x310 [nfnetlink]
    [ 1239.229091] ? nfnetlink_rcv_batch+0x10f0/0x10f0 [nfnetlink]
    [ 1239.229091] ? netlink_deliver_tap+0x829/0x930
    [ 1239.229091] ? lock_acquire+0x265/0x2e0
    [ 1239.229091] netlink_unicast+0x406/0x520
    [ 1239.509725] ? netlink_attachskb+0x5b0/0x5b0
    [ 1239.509725] ? find_held_lock+0x39/0x1b0
    [ 1239.509725] netlink_sendmsg+0x987/0xa20
    [ 1239.509725] ? netlink_unicast+0x520/0x520
    [ 1239.509725] ? _copy_from_user+0xa9/0xc0
    [ 1239.509725] __sys_sendto+0x21a/0x2c0
    [ 1239.509725] ? __ia32_sys_getpeername+0xa0/0xa0
    [ 1239.509725] ? retint_kernel+0x10/0x10
    [ 1239.509725] ? sched_clock_cpu+0x132/0x170
    [ 1239.509725] ? find_held_lock+0x39/0x1b0
    [ 1239.509725] ? lock_downgrade+0x540/0x540
    [ 1239.509725] ? up_read+0x1c/0x100
    [ 1239.509725] ? __do_page_fault+0x763/0x970
    [ 1239.509725] ? retint_user+0x18/0x18
    [ 1239.509725] __x64_sys_sendto+0x177/0x180
    [ 1239.509725] do_syscall_64+0xaa/0x360
    [ 1239.509725] entry_SYSCALL_64_after_hwframe+0x49/0xbe
    [ 1239.509725] RIP: 0033:0x7f5a8f468e03
    [ 1239.509725] Code: 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb d0 0f 1f 84 00 00 00 00 00 83 3d 49 c9 2b 00 00 75 13 49 89 ca b8 2c 00 00 00 0f 05 3d 01 f0 ff ff 73 34 c3 48 83 ec 08 e8
    [ 1239.509725] RSP: 002b:00007ffd78d0b778 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
    [ 1239.509725] RAX: ffffffffffffffda RBX: 00007ffd78d0c890 RCX: 00007f5a8f468e03
    [ 1239.509725] RDX: 0000000000000034 RSI: 00007ffd78d0b7e0 RDI: 0000000000000003
    [ 1239.509725] RBP: 00007ffd78d0b7d0 R08: 00007f5a8f15c160 R09: 000000000000000c
    [ 1239.509725] R10: 0000000000000000 R11: 0000000000000246 R12: 00007ffd78d0b7e0
    [ 1239.509725] R13: 0000000000000034 R14: 00007f5a8f9aff60 R15: 00005648040094b0
    [ 1239.509725] Modules linked in: nf_tables_set nf_tables nfnetlink ip_tables x_tables
    [ 1239.670713] ---[ end trace 39375adcda140f11 ]---
    [ 1239.676016] RIP: 0010:nft_hash_walk+0x1d2/0x310 [nf_tables_set]
    [ 1239.682834] Code: 84 d2 7f 10 4c 89 e7 89 44 24 38 e8 d8 5a 17 e0 8b 44 24 38 48 8d 7b 10 41 0f b6 0c 24 48 89 fa 48 89 fe 48 c1 ea 03 83 e6 07 0f b6 14 3a 40 38 f2 7f 1a 84 d2 74 16
    [ 1239.705108] RSP: 0018:ffff8801118cf358 EFLAGS: 00010246
    [ 1239.711115] RAX: 0000000000000000 RBX: 0000000000020400 RCX: 0000000000000001
    [ 1239.719269] RDX: 0000000000004082 RSI: 0000000000000000 RDI: 0000000000020410
    [ 1239.727401] RBP: ffff880114d5a988 R08: 0000000000007e94 R09: ffff880114dd8030
    [ 1239.735530] R10: ffff880114d5a988 R11: ffffed00229bb006 R12: ffff8801118cf4d0
    [ 1239.743658] R13: ffff8801118cf4d8 R14: 0000000000000000 R15: dffffc0000000000
    [ 1239.751785] FS: 00007f5a8fe0b700(0000) GS:ffff88011b600000(0000) knlGS:0000000000000000
    [ 1239.760993] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [ 1239.767560] CR2: 00007f5a8ecc27b0 CR3: 000000010608e000 CR4: 00000000001006f0
    [ 1239.775679] Kernel panic - not syncing: Fatal exception
    [ 1239.776630] Kernel Offset: 0x1f000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
    [ 1239.776630] Rebooting in 5 seconds..

    Fixes: 20a69341f2d0 ("netfilter: nf_tables: add netlink set API")
    Signed-off-by: Taehee Yoo
    Signed-off-by: Pablo Neira Ayuso

    Taehee Yoo
     

18 Jul, 2018

1 commit

  • This patch fixes below.
    1. check null pointer of rb_next.
    rb_next can return null. so null check routine should be added.
    2. add rcu_barrier in destroy routine.
    GC uses call_rcu to remove elements. but all elements should be
    removed before destroying set and chains. so that rcu_barrier is added.

    test script:
    %cat test.nft
    table inet aa {
    map map1 {
    type ipv4_addr : verdict; flags interval, timeout;
    elements = {
    0-1 : jump a0,
    3-4 : jump a0,
    6-7 : jump a0,
    9-10 : jump a0,
    12-13 : jump a0,
    15-16 : jump a0,
    18-19 : jump a0,
    21-22 : jump a0,
    24-25 : jump a0,
    27-28 : jump a0,
    }
    timeout 1s;
    }
    chain a0 {
    }
    }
    flush ruleset
    table inet aa {
    map map1 {
    type ipv4_addr : verdict; flags interval, timeout;
    elements = {
    0-1 : jump a0,
    3-4 : jump a0,
    6-7 : jump a0,
    9-10 : jump a0,
    12-13 : jump a0,
    15-16 : jump a0,
    18-19 : jump a0,
    21-22 : jump a0,
    24-25 : jump a0,
    27-28 : jump a0,
    }
    timeout 1s;
    }
    chain a0 {
    }
    }
    flush ruleset

    splat looks like:
    [ 2402.419838] kasan: GPF could be caused by NULL-ptr deref or user memory access
    [ 2402.428433] general protection fault: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN PTI
    [ 2402.429343] CPU: 1 PID: 1350 Comm: kworker/1:1 Not tainted 4.18.0-rc2+ #1
    [ 2402.429343] Hardware name: To be filled by O.E.M. To be filled by O.E.M./Aptio CRB, BIOS 5.6.5 03/23/2017
    [ 2402.429343] Workqueue: events_power_efficient nft_rbtree_gc [nft_set_rbtree]
    [ 2402.429343] RIP: 0010:rb_next+0x1e/0x130
    [ 2402.429343] Code: e9 de f2 ff ff 0f 1f 80 00 00 00 00 41 55 48 89 fa 41 54 55 53 48 c1 ea 03 48 b8 00 00 00 0
    [ 2402.429343] RSP: 0018:ffff880105f77678 EFLAGS: 00010296
    [ 2402.429343] RAX: dffffc0000000000 RBX: ffff8801143e3428 RCX: 1ffff1002287c69c
    [ 2402.429343] RDX: 0000000000000000 RSI: 0000000000000004 RDI: 0000000000000000
    [ 2402.429343] RBP: 0000000000000000 R08: ffffed0016aabc24 R09: ffffed0016aabc24
    [ 2402.429343] R10: 0000000000000001 R11: ffffed0016aabc23 R12: 0000000000000000
    [ 2402.429343] R13: ffff8800b6933388 R14: dffffc0000000000 R15: ffff8801143e3440
    [ 2402.534486] kasan: CONFIG_KASAN_INLINE enabled
    [ 2402.534212] FS: 0000000000000000(0000) GS:ffff88011b600000(0000) knlGS:0000000000000000
    [ 2402.534212] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [ 2402.534212] CR2: 0000000000863008 CR3: 00000000a3c16000 CR4: 00000000001006e0
    [ 2402.534212] Call Trace:
    [ 2402.534212] nft_rbtree_gc+0x2b5/0x5f0 [nft_set_rbtree]
    [ 2402.534212] process_one_work+0xc1b/0x1ee0
    [ 2402.540329] kasan: GPF could be caused by NULL-ptr deref or user memory access
    [ 2402.534212] ? _raw_spin_unlock_irq+0x29/0x40
    [ 2402.534212] ? pwq_dec_nr_in_flight+0x3e0/0x3e0
    [ 2402.534212] ? set_load_weight+0x270/0x270
    [ 2402.534212] ? __schedule+0x6ea/0x1fb0
    [ 2402.534212] ? __sched_text_start+0x8/0x8
    [ 2402.534212] ? save_trace+0x320/0x320
    [ 2402.534212] ? sched_clock_local+0xe2/0x150
    [ 2402.534212] ? find_held_lock+0x39/0x1c0
    [ 2402.534212] ? worker_thread+0x35f/0x1150
    [ 2402.534212] ? lock_contended+0xe90/0xe90
    [ 2402.534212] ? __lock_acquire+0x4520/0x4520
    [ 2402.534212] ? do_raw_spin_unlock+0xb1/0x350
    [ 2402.534212] ? do_raw_spin_trylock+0x111/0x1b0
    [ 2402.534212] ? do_raw_spin_lock+0x1f0/0x1f0
    [ 2402.534212] worker_thread+0x169/0x1150

    Fixes: 8d8540c4f5e0("netfilter: nft_set_rbtree: add timeout support")
    Signed-off-by: Taehee Yoo
    Signed-off-by: Pablo Neira Ayuso

    Taehee Yoo
     

07 Jul, 2018

1 commit


12 Jun, 2018

1 commit

  • Pablo Neira Ayuso says:

    ====================
    Netfilter/IPVS fixes for net

    The following patchset contains Netfilter/IPVS fixes for your net tree:

    1) Reject non-null terminated helper names from xt_CT, from Gao Feng.

    2) Fix KASAN splat due to out-of-bound access from commit phase, from
    Alexey Kodanev.

    3) Missing conntrack hook registration on IPVS FTP helper, from Julian
    Anastasov.

    4) Incorrect skbuff allocation size in bridge nft_reject, from Taehee Yoo.

    5) Fix inverted check on packet xmit to non-local addresses, also from
    Julian.

    6) Fix ebtables alignment compat problems, from Alin Nastac.

    7) Hook mask checks are not correct in xt_set, from Serhey Popovych.

    8) Fix timeout listing of element in ipsets, from Jozsef.

    9) Cap maximum timeout value in ipset, also from Jozsef.

    10) Don't allow family option for hash:mac sets, from Florent Fourcot.

    11) Restrict ebtables to work with NFPROTO_BRIDGE targets only, this
    Florian.

    12) Another bug reported by KASAN in the rbtree set backend, from
    Taehee Yoo.

    13) Missing __IPS_MAX_BIT update doesn't include IPS_OFFLOAD_BIT.
    From Gao Feng.

    14) Missing initialization of match/target in ebtables, from Florian
    Westphal.

    15) Remove useless nft_dup.h file in include path, from C. Labbe.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     

07 Jun, 2018

1 commit

  • The parameter this doesn't have a flags value. so that it can't be
    used by nft_rbtree_interval_end().

    test commands:
    %nft add table ip filter
    %nft add set ip filter s { type ipv4_addr \; flags interval \; }
    %nft add element ip filter s {0-1}
    %nft add element ip filter s {2-10}
    %nft add chain ip filter input { type filter hook input priority 0\; }
    %nft add rule ip filter input ip saddr @s

    Splat looks like:
    [ 246.752502] BUG: KASAN: slab-out-of-bounds in __nft_rbtree_lookup+0x677/0x6a0 [nft_set_rbtree]
    [ 246.752502] Read of size 1 at addr ffff88010d9efa47 by task http/1092

    [ 246.752502] CPU: 1 PID: 1092 Comm: http Not tainted 4.17.0-rc6+ #185
    [ 246.752502] Call Trace:
    [ 246.752502]
    [ 246.752502] dump_stack+0x74/0xbb
    [ 246.752502] ? __nft_rbtree_lookup+0x677/0x6a0 [nft_set_rbtree]
    [ 246.752502] print_address_description+0xc7/0x290
    [ 246.752502] ? __nft_rbtree_lookup+0x677/0x6a0 [nft_set_rbtree]
    [ 246.752502] kasan_report+0x22c/0x350
    [ 246.752502] __nft_rbtree_lookup+0x677/0x6a0 [nft_set_rbtree]
    [ 246.752502] nft_rbtree_lookup+0xc9/0x2d2 [nft_set_rbtree]
    [ 246.752502] ? sched_clock_cpu+0x144/0x180
    [ 246.752502] nft_lookup_eval+0x149/0x3a0 [nf_tables]
    [ 246.752502] ? __lock_acquire+0xcea/0x4ed0
    [ 246.752502] ? nft_lookup_init+0x6b0/0x6b0 [nf_tables]
    [ 246.752502] nft_do_chain+0x263/0xf50 [nf_tables]
    [ 246.752502] ? __nft_trace_packet+0x1a0/0x1a0 [nf_tables]
    [ 246.752502] ? sched_clock_cpu+0x144/0x180
    [ ... ]

    Fixes: f9121355eb6f ("netfilter: nft_set_rbtree: incorrect assumption on lower interval lookups")
    Signed-off-by: Taehee Yoo
    Signed-off-by: Pablo Neira Ayuso

    Taehee Yoo
     

23 May, 2018

1 commit


24 Apr, 2018

1 commit

  • Drop nft_set_type's ability to act as a container of multiple backend
    implementations it chooses from. Instead consolidate the whole selection
    logic in nft_select_set_ops() and the actual backend provided estimate()
    callback.

    This turns nf_tables_set_types into a list containing all available
    backends which is traversed when selecting one matching userspace
    requested criteria.

    Also, this change allows to embed nft_set_ops structure into
    nft_set_type and pull flags field into the latter as it's only used
    during selection phase.

    A crucial part of this change is to make sure the new layout respects
    hash backend constraints formerly enforced by nft_hash_select_ops()
    function: This is achieved by introduction of a specific estimate()
    callback for nft_hash_fast_ops which returns false for key lengths != 4.
    In turn, nft_hash_estimate() is changed to return false for key lengths
    == 4 so it won't be chosen by accident. Also, both callbacks must return
    false for unbounded sets as their size estimate depends on a known
    maximum element count.

    Note that this patch partially reverts commit 4f2921ca21b71 ("netfilter:
    nf_tables: meter: pick a set backend that supports updates") by making
    nft_set_ops_candidate() not explicitly look for an update callback but
    make NFT_SET_EVAL a regular backend feature flag which is checked along
    with the others. This way all feature requirements are checked in one
    go.

    Signed-off-by: Phil Sutter
    Signed-off-by: Pablo Neira Ayuso

    Phil Sutter
     

07 Nov, 2017

1 commit


01 Aug, 2017

1 commit

  • switch to lockless lockup. write side now also increments sequence
    counter. On lookup, sample counter value and only take the lock
    if we did not find a match and the counter has changed.

    This avoids need to write to private area in normal (lookup) cases.

    In case we detect a writer (seqretry is true) we fall back to taking
    the readlock.

    The readlock is also used during dumps to ensure we get a consistent
    tree walk.

    Similar technique (rbtree+seqlock) was used by David Howells in rxrpc.

    Signed-off-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso

    Florian Westphal
     

30 Jun, 2017

1 commit

  • Pablo Neira Ayuso says:

    ====================
    Netfilter updates for net-next

    The following patchset contains Netfilter updates for your net-next
    tree. This batch contains connection tracking updates for the cleanup
    iteration path, patches from Florian Westphal:

    X) Skip unconfirmed conntracks in nf_ct_iterate_cleanup_net(), just set
    dying bit to let the CPU release them.

    X) Add nf_ct_iterate_destroy() to be used on module removal, to kill
    conntrack from all namespace.

    X) Restart iteration on hashtable resizing, since both may occur at
    the same time.

    X) Use the new nf_ct_iterate_destroy() to remove conntrack with NAT
    mapping on module removal.

    X) Use nf_ct_iterate_destroy() to remove conntrack entries helper
    module removal, from Liping Zhang.

    X) Use nf_ct_iterate_cleanup_net() to remove the timeout extension
    if user requests this, also from Liping.

    X) Add net_ns_barrier() and use it from FTP helper, so make sure
    no concurrent namespace removal happens at the same time while
    the helper module is being removed.

    X) Use NFPROTO_MAX in layer 3 conntrack protocol array, to reduce
    module size. Same thing in nf_tables.

    Updates for the nf_tables infrastructure:

    X) Prepare usage of the extended ACK reporting infrastructure for
    nf_tables.

    X) Remove unnecessary forward declaration in nf_tables hash set.

    X) Skip set size estimation if number of element is not specified.

    X) Changes to accomodate a (faster) unresizable hash set implementation,
    for anonymous sets and dynamic size fixed sets with no timeouts.

    X) Faster lookup function for unresizable hash table for 2 and 4
    bytes key.

    And, finally, a bunch of asorted small updates and cleanups:

    X) Do not hold reference to netdev from ipt_CLUSTER, instead subscribe
    to device events and look up for index from the packet path, this
    is fixing an issue that is present since the very beginning, patch
    from Xin Long.

    X) Use nf_register_net_hook() in ipt_CLUSTER, from Florian Westphal.

    X) Use ebt_invalid_target() whenever possible in the ebtables tree,
    from Gao Feng.

    X) Calm down compilation warning in nf_dup infrastructure, patch from
    stephen hemminger.

    X) Statify functions in nftables rt expression, also from stephen.

    X) Update Makefile to use canonical method to specify nf_tables-objs.
    From Jike Song.

    X) Use nf_conntrack_helpers_register() in amanda and H323.

    X) Space cleanup for ctnetlink, from linzhang.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     

29 May, 2017

3 commits


24 May, 2017

1 commit

  • The existing code selects no next branch to be inspected when
    re-inserting an inactive element into the rb-tree, looping endlessly.
    This patch restricts the check for active elements to the EEXIST case
    only.

    Fixes: e701001e7cbe ("netfilter: nft_rbtree: allow adjacent intervals with dynamic updates")
    Reported-by: Wolfgang Bumiller
    Tested-by: Wolfgang Bumiller
    Signed-off-by: Pablo Neira Ayuso

    Pablo Neira Ayuso
     

14 Mar, 2017

1 commit

  • Karel Rericha reported that in his test case, ICMP packets going through
    boxes had normally about 5ms latency. But when running nft, actually
    listing the sets with interval flags, latency would go up to 30-100ms.
    This was observed when router throughput is from 600Mbps to 2Gbps.

    This is because we use a single global spinlock to protect the whole
    rbtree sets, so "dumping sets" will race with the "key lookup" inevitably.
    But actually they are all _readers_, so it's ok to convert the spinlock
    to rwlock to avoid competition between them. Also use per-set rwlock since
    each set is independent.

    Reported-by: Karel Rericha
    Tested-by: Karel Rericha
    Signed-off-by: Liping Zhang
    Signed-off-by: Pablo Neira Ayuso

    Liping Zhang
     

03 Mar, 2017

1 commit


12 Feb, 2017

1 commit


08 Feb, 2017

4 commits


25 Jan, 2017

1 commit


07 Dec, 2016

1 commit

  • This patch adds support for set flushing, that consists of walking over
    the set elements if the NFTA_SET_ELEM_LIST_ELEMENTS attribute is set.
    This patch requires the following changes:

    1) Add set->ops->deactivate_one() operation: This allows us to
    deactivate an element from the set element walk path, given we can
    skip the lookup that happens in ->deactivate().

    2) Add a new nft_trans_alloc_gfp() function since we need to allocate
    transactions using GFP_ATOMIC given the set walk path happens with
    held rcu_read_lock.

    Signed-off-by: Pablo Neira Ayuso

    Pablo Neira Ayuso