29 Aug, 2020

4 commits

  • We can delay refcount increment until we reassign the existing entry to
    the current skb.

    A 0 refcount can't happen while the nf_conn object is still in the
    hash table and parallel mutations are impossible because we hold the
    bucket lock.

    Signed-off-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso

    Florian Westphal
     
  • There is a misconception about what "insert_failed" means.

    We increment this even when a clash got resolved, so it might not indicate
    a problem.

    Add a dedicated counter for clash resolution and only increment
    insert_failed if a clash cannot be resolved.

    For the old /proc interface, export this in place of an older stat
    that got removed a while back.
    For ctnetlink, export this with a new attribute.

    Also correct an outdated comment that implies we add a duplicate tuple --
    we only add the (unique) reply direction.

    Signed-off-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso

    Florian Westphal
     
  • This counter increments when nf_conntrack_in sees a packet that already
    has a conntrack attached or when the packet is marked as UNTRACKED.
    Neither is an error.

    The former is normal for loopback traffic. The second happens for
    certain ICMPv6 packets or when nftables/ip(6)tables rules are in place.

    In case someone needs to count UNTRACKED packets, or packets
    that are marked as untracked before conntrack_in this can be done with
    both nftables and ip(6)tables rules.

    Signed-off-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso

    Florian Westphal
     
  • The /proc interface for nf_conntrack displays the "error" counter as
    "icmp_error".

    It makes sense to not increment "invalid" when failing to handle an icmp
    packet since those are special.

    For example, its possible for conntrack to see partial and/or fragmented
    packets inside icmp errors. This should be a separate event and not get
    mixed with the "invalid" counter.

    Likewise, remove the "error" increment for errors from get_l4proto().
    After this, the error counter will only increment for errors coming from
    icmp(v6) packet handling.

    Signed-off-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso

    Florian Westphal
     

11 Aug, 2020

1 commit

  • Pull locking updates from Thomas Gleixner:
    "A set of locking fixes and updates:

    - Untangle the header spaghetti which causes build failures in
    various situations caused by the lockdep additions to seqcount to
    validate that the write side critical sections are non-preemptible.

    - The seqcount associated lock debug addons which were blocked by the
    above fallout.

    seqcount writers contrary to seqlock writers must be externally
    serialized, which usually happens via locking - except for strict
    per CPU seqcounts. As the lock is not part of the seqcount, lockdep
    cannot validate that the lock is held.

    This new debug mechanism adds the concept of associated locks.
    sequence count has now lock type variants and corresponding
    initializers which take a pointer to the associated lock used for
    writer serialization. If lockdep is enabled the pointer is stored
    and write_seqcount_begin() has a lockdep assertion to validate that
    the lock is held.

    Aside of the type and the initializer no other code changes are
    required at the seqcount usage sites. The rest of the seqcount API
    is unchanged and determines the type at compile time with the help
    of _Generic which is possible now that the minimal GCC version has
    been moved up.

    Adding this lockdep coverage unearthed a handful of seqcount bugs
    which have been addressed already independent of this.

    While generally useful this comes with a Trojan Horse twist: On RT
    kernels the write side critical section can become preemtible if
    the writers are serialized by an associated lock, which leads to
    the well known reader preempts writer livelock. RT prevents this by
    storing the associated lock pointer independent of lockdep in the
    seqcount and changing the reader side to block on the lock when a
    reader detects that a writer is in the write side critical section.

    - Conversion of seqcount usage sites to associated types and
    initializers"

    * tag 'locking-urgent-2020-08-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (25 commits)
    locking/seqlock, headers: Untangle the spaghetti monster
    locking, arch/ia64: Reduce header dependencies by moving XTP bits into the new header
    x86/headers: Remove APIC headers from
    seqcount: More consistent seqprop names
    seqcount: Compress SEQCNT_LOCKNAME_ZERO()
    seqlock: Fold seqcount_LOCKNAME_init() definition
    seqlock: Fold seqcount_LOCKNAME_t definition
    seqlock: s/__SEQ_LOCKDEP/__SEQ_LOCK/g
    hrtimer: Use sequence counter with associated raw spinlock
    kvm/eventfd: Use sequence counter with associated spinlock
    userfaultfd: Use sequence counter with associated spinlock
    NFSv4: Use sequence counter with associated spinlock
    iocost: Use sequence counter with associated spinlock
    raid5: Use sequence counter with associated spinlock
    vfs: Use sequence counter with associated spinlock
    timekeeping: Use sequence counter with associated raw spinlock
    xfrm: policy: Use sequence counters with associated lock
    netfilter: nft_set_rbtree: Use sequence counter with associated rwlock
    netfilter: conntrack: Use sequence counter with associated spinlock
    sched: tasks: Use sequence counter with associated spinlock
    ...

    Linus Torvalds
     

05 Aug, 2020

1 commit

  • Pablo Neira Ayuso says:

    ====================
    Netfilter fixes for net

    The following patchset contains Netfilter fixes for net:

    1) Flush the cleanup xtables worker to make sure destructors
    have completed, from Florian Westphal.

    2) iifgroup is matching erroneously, also from Florian.

    3) Add selftest for meta interface matching, from Florian Westphal.

    4) Move nf_ct_offload_timeout() to header, from Roi Dayan.

    5) Call nf_ct_offload_timeout() from flow_offload_add() to
    make sure garbage collection does not evict offloaded flow,
    from Roi Dayan.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     

03 Aug, 2020

1 commit


29 Jul, 2020

1 commit

  • A sequence counter write side critical section must be protected by some
    form of locking to serialize writers. A plain seqcount_t does not
    contain the information of which lock must be held when entering a write
    side critical section.

    Use the new seqcount_spinlock_t data type, which allows to associate a
    spinlock with the sequence counter. This enables lockdep to verify that
    the spinlock used for writer serialization is held when the write side
    critical section is entered.

    If lockdep is disabled this lock association is compiled out and has
    neither storage size nor runtime overhead.

    Signed-off-by: Ahmed S. Darwish
    Signed-off-by: Peter Zijlstra (Intel)
    Link: https://lkml.kernel.org/r/20200720155530.1173732-15-a.darwish@linutronix.de

    Ahmed S. Darwish
     

14 Jul, 2020

1 commit


03 Jul, 2020

1 commit

  • __nf_conntrack_update() might refresh the conntrack object that is
    attached to the skbuff. Otherwise, this triggers UAF.

    [ 633.200434] ==================================================================
    [ 633.200472] BUG: KASAN: use-after-free in nf_conntrack_update+0x34e/0x770 [nf_conntrack]
    [ 633.200478] Read of size 1 at addr ffff888370804c00 by task nfqnl_test/6769

    [ 633.200487] CPU: 1 PID: 6769 Comm: nfqnl_test Not tainted 5.8.0-rc2+ #388
    [ 633.200490] Hardware name: LENOVO 23259H1/23259H1, BIOS G2ET32WW (1.12 ) 05/30/2012
    [ 633.200491] Call Trace:
    [ 633.200499] dump_stack+0x7c/0xb0
    [ 633.200526] ? nf_conntrack_update+0x34e/0x770 [nf_conntrack]
    [ 633.200532] print_address_description.constprop.6+0x1a/0x200
    [ 633.200539] ? _raw_write_lock_irqsave+0xc0/0xc0
    [ 633.200568] ? nf_conntrack_update+0x34e/0x770 [nf_conntrack]
    [ 633.200594] ? nf_conntrack_update+0x34e/0x770 [nf_conntrack]
    [ 633.200598] kasan_report.cold.9+0x1f/0x42
    [ 633.200604] ? call_rcu+0x2c0/0x390
    [ 633.200633] ? nf_conntrack_update+0x34e/0x770 [nf_conntrack]
    [ 633.200659] nf_conntrack_update+0x34e/0x770 [nf_conntrack]
    [ 633.200687] ? nf_conntrack_find_get+0x30/0x30 [nf_conntrack]

    Closes: https://bugzilla.netfilter.org/show_bug.cgi?id=1436
    Fixes: ee04805ff54a ("netfilter: conntrack: make conntrack userspace helpers work again")
    Signed-off-by: Pablo Neira Ayuso

    Pablo Neira Ayuso
     

02 Jun, 2020

1 commit

  • Pablo Neira Ayuso says:

    ====================
    Netfilter updates for net-next

    The following patchset contains Netfilter updates for net-next
    to extend ctnetlink and the flowtable infrastructure:

    1) Extend ctnetlink kernel side netlink dump filtering capabilities,
    from Romain Bellan.

    2) Generalise the flowtable hook parser to take a hook list.

    3) Pass a hook list to the flowtable hook registration/unregistration.

    4) Add a helper function to release the flowtable hook list.

    5) Update the flowtable event notifier to pass a flowtable hook list.

    6) Allow users to add new devices to an existing flowtables.

    7) Allow users to remove devices to an existing flowtables.

    8) Allow for registering a flowtable with no initial devices.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     

28 May, 2020

1 commit

  • Conntrack dump does not support kernel side filtering (only get exists,
    but it returns only one entry. And user has to give a full valid tuple)

    It means that userspace has to implement filtering after receiving many
    irrelevant entries, consuming resources (conntrack table is sometimes
    very huge, much more than a routing table for example).

    This patch adds filtering in kernel side. To achieve this goal, we:

    * Add a new CTA_FILTER netlink attributes, actually a flag list to
    parametize filtering
    * Convert some *nlattr_to_tuple() functions, to allow a partial parsing
    of CTA_TUPLE_ORIG and CTA_TUPLE_REPLY (so nf_conntrack_tuple it not
    fully set)

    Filtering is now possible on:
    * IP SRC/DST values
    * Ports for TCP and UDP flows
    * IMCP(v6) codes types and IDs

    Filtering is done as an "AND" operator. For example, when flags
    PROTO_SRC_PORT, PROTO_NUM and IP_SRC are sets, only entries matching all
    values are dumped.

    Changes since v1:
    Set NLM_F_DUMP_FILTERED in nlm flags if entries are filtered

    Changes since v2:
    Move several constants to nf_internals.h
    Move a fix on netlink values check in a separate patch
    Add a check on not-supported flags
    Return EOPNOTSUPP if CDA_FILTER is set in ctnetlink_flush_conntrack
    (not yet implemented)
    Code style issues

    Changes since v3:
    Fix compilation warning reported by kbuild test robot

    Changes since v4:
    Fix a regression introduced in v3 (returned EINVAL for valid netlink
    messages without CTA_MARK)

    Changes since v5:
    Change definition of CTA_FILTER_F_ALL
    Fix a regression when CTA_TUPLE_ZONE is not set

    Signed-off-by: Romain Bellan
    Signed-off-by: Florent Fourcot
    Signed-off-by: Pablo Neira Ayuso

    Romain Bellan
     

27 May, 2020

2 commits

  • net/netfilter/nf_conntrack_core.c: In function nf_confirm_cthelper:
    net/netfilter/nf_conntrack_core.c:2117:15: warning: comparison of unsigned expression in < 0 is always false [-Wtype-limits]
    2117 | if (protoff < 0 || (frag_off & htons(~0x7)) != 0)
    | ^

    ipv6_skip_exthdr() returns a signed integer.

    Reported-by: Colin Ian King
    Fixes: 703acd70f249 ("netfilter: nfnetlink_cthelper: unbreak userspace helper support")
    Signed-off-by: Pablo Neira Ayuso

    Pablo Neira Ayuso
     
  • Clang warns:

    net/netfilter/nf_conntrack_core.c:2068:21: warning: variable 'ctinfo' is
    uninitialized when used here [-Wuninitialized]
    nf_ct_set(skb, ct, ctinfo);
    ^~~~~~
    net/netfilter/nf_conntrack_core.c:2024:2: note: variable 'ctinfo' is
    declared here
    enum ip_conntrack_info ctinfo;
    ^
    1 warning generated.

    nf_conntrack_update was split up into nf_conntrack_update and
    __nf_conntrack_update, where the assignment of ctinfo is in
    nf_conntrack_update but it is used in __nf_conntrack_update.

    Pass the value of ctinfo from nf_conntrack_update to
    __nf_conntrack_update so that uninitialized memory is not used
    and everything works properly.

    Fixes: ee04805ff54a ("netfilter: conntrack: make conntrack userspace helpers work again")
    Link: https://github.com/ClangBuiltLinux/linux/issues/1039
    Signed-off-by: Nathan Chancellor
    Signed-off-by: Pablo Neira Ayuso

    Nathan Chancellor
     

26 May, 2020

1 commit

  • Florian Westphal says:

    "Problem is that after the helper hook was merged back into the confirm
    one, the queueing itself occurs from the confirm hook, i.e. we queue
    from the last netfilter callback in the hook-list.

    Therefore, on return, the packet bypasses the confirm action and the
    connection is never committed to the main conntrack table.

    To fix this there are several ways:
    1. revert the 'Fixes' commit and have a extra helper hook again.
    Works, but has the drawback of adding another indirect call for
    everyone.

    2. Special case this: split the hooks only when userspace helper
    gets added, so queueing occurs at a lower priority again,
    and normal enqueue reinject would eventually call the last hook.

    3. Extend the existing nf_queue ct update hook to allow a forced
    confirmation (plus run the seqadj code).

    This goes for 3)."

    Fixes: 827318feb69cb ("netfilter: conntrack: remove helper hook again")
    Reviewed-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso

    Pablo Neira Ayuso
     

11 May, 2020

2 commits

  • 'rmmod nf_conntrack' can hang forever, because the netns exit
    gets stuck in nf_conntrack_cleanup_net_list():

    i_see_dead_people:
    busy = 0;
    list_for_each_entry(net, net_exit_list, exit_list) {
    nf_ct_iterate_cleanup(kill_all, net, 0, 0);
    if (atomic_read(&net->ct.count) != 0)
    busy = 1;
    }
    if (busy) {
    schedule();
    goto i_see_dead_people;
    }

    When nf_ct_iterate_cleanup iterates the conntrack table, all nf_conn
    structures can be found twice:
    once for the original tuple and once for the conntracks reply tuple.

    get_next_corpse() only calls the iterator when the entry is
    in original direction -- the idea was to avoid unneeded invocations
    of the iterator callback.

    When support for clashing entries was added, the assumption that
    all nf_conn objects are added twice, once in original, once for reply
    tuple no longer holds -- NF_CLASH_BIT entries are only added in
    the non-clashing reply direction.

    Thus, if at least one NF_CLASH entry is in the list then
    nf_conntrack_cleanup_net_list() always skips it completely.

    During normal netns destruction, this causes a hang of several
    seconds, until the gc worker removes the entry (NF_CLASH entries
    always have a 1 second timeout).

    But in the rmmod case, the gc worker has already been stopped, so
    ct.count never becomes 0.

    We can fix this in two ways:

    1. Add a second test for CLASH_BIT and call iterator for those
    entries as well, or:
    2. Skip the original tuple direction and use the reply tuple.

    2) is simpler, so do that.

    Fixes: 6a757c07e51f80ac ("netfilter: conntrack: allow insertion of clashing entries")
    Reported-by: Chen Yi
    Signed-off-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso

    Florian Westphal
     
  • gcc-10 warns around a suspicious access to an empty struct member:

    net/netfilter/nf_conntrack_core.c: In function '__nf_conntrack_alloc':
    net/netfilter/nf_conntrack_core.c:1522:9: warning: array subscript 0 is outside the bounds of an interior zero-length array 'u8[0]' {aka 'unsigned char[0]'} [-Wzero-length-bounds]
    1522 | memset(&ct->__nfct_init_offset[0], 0,
    | ^~~~~~~~~~~~~~~~~~~~~~~~~~
    In file included from net/netfilter/nf_conntrack_core.c:37:
    include/net/netfilter/nf_conntrack.h:90:5: note: while referencing '__nfct_init_offset'
    90 | u8 __nfct_init_offset[0];
    | ^~~~~~~~~~~~~~~~~~

    The code is correct but a bit unusual. Rework it slightly in a way that
    does not trigger the warning, using an empty struct instead of an empty
    array. There are probably more elegant ways to do this, but this is the
    smallest change.

    Fixes: c41884ce0562 ("netfilter: conntrack: avoid zeroing timer")
    Signed-off-by: Arnd Bergmann
    Signed-off-by: Pablo Neira Ayuso

    Arnd Bergmann
     

30 Mar, 2020

1 commit


28 Mar, 2020

2 commits


15 Mar, 2020

1 commit

  • TEMPLATE_NULLS_VAL is not used after commit 0838aa7fcfcd
    ("netfilter: fix netns dependencies with conntrack templates")

    PFX is not used after commit 8bee4bad03c5b ("netfilter: xt
    extensions: use pr_")

    Signed-off-by: Li RongQing
    Signed-off-by: Pablo Neira Ayuso

    Li RongQing
     

17 Feb, 2020

1 commit

  • This patch further relaxes the need to drop an skb due to a clash with
    an existing conntrack entry.

    Current clash resolution handles the case where the clash occurs between
    two identical entries (distinct nf_conn objects with same tuples), i.e.:

    Original Reply
    existing: 10.2.3.4:42 -> 10.8.8.8:53 10.2.3.4:42 10.8.8.8:53 10.2.3.4:42 _nfct point to the existing one. The skb can then be
    processed normally just as if the clash would not have existed in the
    first place.

    For other clashes, the skb needs to be dropped.
    This frequently happens with DNS resolvers that send A and AAAA queries
    back-to-back when NAT rules are present that cause packets to get
    different DNAT transformations applied, for example:

    -m statistics --mode random ... -j DNAT --dnat-to 10.0.0.6:5353
    -m statistics --mode random ... -j DNAT --dnat-to 10.0.0.7:5353

    In this case the A or AAAA query is dropped which incurs a costly
    delay during name resolution.

    This patch also allows this collision type:
    Original Reply
    existing: 10.2.3.4:42 -> 10.8.8.8:53 10.2.3.4:42 10.8.8.8:53 10.2.3.4:42 10.8.8.8:53 (A)
    2. 10.2.3.4:42 -> 10.8.8.8:53 (AAAA)
    3. Apply DNAT, reply changed to 10.0.0.6
    4. 10.2.3.4:42 -> 10.8.8.8:53 (AAAA)
    5. Apply DNAT, reply changed to 10.0.0.7
    6. confirm/commit to conntrack table, no collisions
    7. commit clashing entry

    Reply comes in:

    10.2.3.4:42 Finds a conntrack, DNAT is reversed & packet forwarded to 10.2.3.4:42
    10.2.3.4:42 Finds a conntrack, DNAT is reversed & packet forwarded to 10.2.3.4:42
    The conntrack entry is deleted from table, as it has the NAT_CLASH
    bit set.

    In case of a retransmit from ORIGINAL dir, all further packets will get
    the DNAT transformation to 10.0.0.6.

    I tried to come up with other solutions but they all have worse
    problems.

    Alternatives considered were:
    1. Confirm ct entries at allocation time, not in postrouting.
    a. will cause uneccesarry work when the skb that creates the
    conntrack is dropped by ruleset.
    b. in case nat is applied, ct entry would need to be moved in
    the table, which requires another spinlock pair to be taken.
    c. breaks the 'unconfirmed entry is private to cpu' assumption:
    we would need to guard all nfct->ext allocation requests with
    ct->lock spinlock.

    2. Make the unconfirmed list a hash table instead of a pcpu list.
    Shares drawback c) of the first alternative.

    3. Document this is expected and force users to rearrange their
    ruleset (e.g. by using "-m cluster" instead of "-m statistics").
    nft has the 'jhash' expression which can be used instead of 'numgen'.

    Major drawback: doesn't fix what I consider a bug, not very realistic
    and I believe its reasonable to have the existing rulesets to 'just
    work'.

    4. Document this is expected and force users to steer problematic
    packets to the same CPU -- this would serialize the "allocate new
    conntrack entry/nat table evaluation/perform nat/confirm entry", so
    no race can occur. Similar drawback to 3.

    Another advantage of this patch compared to 1) and 2) is that there are
    no changes to the hot path; things are handled in the udp tracker and
    the clash resolution path.

    Cc: rcu@vger.kernel.org
    Cc: "Paul E. McKenney"
    Cc: Josh Triplett
    Cc: Jozsef Kadlecsik
    Signed-off-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso

    Florian Westphal
     

11 Feb, 2020

3 commits


01 Feb, 2020

1 commit


31 Dec, 2019

1 commit

  • Pablo Neira Ayuso says:

    ====================
    Netfilter updates for net-next

    The following patchset contains Netfilter updates for net-next:

    1) Remove #ifdef pollution around nf_ingress(), from Lukas Wunner.

    2) Document ingress hook in netdevice, also from Lukas.

    3) Remove htons() in tunnel metadata port netlink attributes,
    from Xin Long.

    4) Missing erspan netlink attribute validation also from Xin Long.

    5) Missing erspan version in tunnel, from Xin Long.

    6) Missing attribute nest in NFTA_TUNNEL_KEY_OPTS_{VXLAN,ERSPAN}
    Patch from Xin Long.

    7) Missing nla_nest_cancel() in tunnel netlink dump path,
    from Xin Long.

    8) Remove two exported conntrack symbols with no clients,
    from Florian Westphal.

    9) Add nft_meta_get_eval_time() helper to nft_meta, from Florian.

    10) Add nft_meta_pkttype helper for loopback, also from Florian.

    11) Add nft_meta_socket uid helper, from Florian Westphal.

    12) Add nft_meta_cgroup helper, from Florian.

    13) Add nft_meta_ifkind helper, from Florian.

    14) Group all interface related meta selector, from Florian.

    15) Add nft_prandom_u32() helper, from Florian.

    16) Add nft_meta_rtclassid helper, from Florian.

    17) Add support for matching on the slave device index,
    from Florian.

    This batch, among other things, contains updates for the netfilter
    tunnel netlink interface: This extension is still incomplete and lacking
    proper userspace support which is actually my fault, I did not find the
    time to go back and finish this. This update is breaking tunnel UAPI in
    some aspects to fix it but do it better sooner than never.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     

18 Dec, 2019

1 commit


01 Dec, 2019

1 commit


27 Oct, 2019

1 commit

  • Pablo Neira Ayuso says:

    ====================
    Netfilter/IPVS updates for net-next

    The following patchset contains Netfilter/IPVS updates for net-next,
    more specifically:

    * Updates for ipset:

    1) Coding style fix for ipset comment extension, from Jeremy Sowden.

    2) De-inline many functions in ipset, from Jeremy Sowden.

    3) Move ipset function definition from header to source file.

    4) Move ip_set_put_flags() to source, export it as a symbol, remove
    inline.

    5) Move range_to_mask() to the source file where this is used.

    6) Move ip_set_get_ip_port() to the source file where this is used.

    * IPVS selftests and netns improvements:

    7) Two patches to speedup ipvs netns dismantle, from Haishuang Yan.

    8) Three patches to add selftest script for ipvs, also from
    Haishuang Yan.

    * Conntrack updates and new nf_hook_slow_list() function:

    9) Document ct ecache extension, from Florian Westphal.

    10) Skip ct extensions from ctnetlink dump, from Florian.

    11) Free ct extension immediately, from Florian.

    12) Skip access to ecache extension from nf_ct_deliver_cached_events()
    this is not correct as reported by Syzbot.

    13) Add and use nf_hook_slow_list(), from Florian.

    * Flowtable infrastructure updates:

    14) Move priority to nf_flowtable definition.

    15) Dynamic allocation of per-device hooks in flowtables.

    16) Allow to include netdevice only once in flowtable definitions.

    17) Rise maximum number of devices per flowtable.

    * Netfilter hardware offload infrastructure updates:

    18) Add nft_flow_block_chain() helper function.

    19) Pass callback list to nft_setup_cb_call().

    20) Add nft_flow_cls_offload_setup() helper function.

    21) Remove rules for the unregistered device via netdevice event.

    22) Support for multiple devices in a basechain definition at the
    ingress hook.

    22) Add nft_chain_offload_cmd() helper function.

    23) Add nft_flow_block_offload_init() helper function.

    24) Rewind in case of failing to bind multiple devices to hook.

    25) Typo in IPv6 tproxy module description, from Norman Rasmussen.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     

17 Oct, 2019

1 commit

  • Instead of waiting for rcu grace period just free it directly.

    This is safe because conntrack lookup doesn't consider extensions.

    Other accesses happen while ct->ext can't be free'd, either because
    a ct refcount was taken or because the conntrack hash bucket lock or
    the dying list spinlock have been taken.

    This allows to remove __krealloc in a followup patch, netfilter was the
    only user.

    Signed-off-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso

    Florian Westphal
     

10 Oct, 2019

1 commit

  • As hinted by KCSAN, we need at least one READ_ONCE()
    to prevent a compiler optimization.

    More details on :
    https://github.com/google/ktsan/wiki/READ_ONCE-and-WRITE_ONCE#it-may-improve-performance

    sysbot report :
    BUG: KCSAN: data-race in __nf_ct_refresh_acct / __nf_ct_refresh_acct

    read to 0xffff888123eb4f08 of 4 bytes by interrupt on cpu 0:
    __nf_ct_refresh_acct+0xd4/0x1b0 net/netfilter/nf_conntrack_core.c:1796
    nf_ct_refresh_acct include/net/netfilter/nf_conntrack.h:201 [inline]
    nf_conntrack_tcp_packet+0xd40/0x3390 net/netfilter/nf_conntrack_proto_tcp.c:1161
    nf_conntrack_handle_packet net/netfilter/nf_conntrack_core.c:1633 [inline]
    nf_conntrack_in+0x410/0xaa0 net/netfilter/nf_conntrack_core.c:1727
    ipv4_conntrack_in+0x27/0x40 net/netfilter/nf_conntrack_proto.c:178
    nf_hook_entry_hookfn include/linux/netfilter.h:135 [inline]
    nf_hook_slow+0x83/0x160 net/netfilter/core.c:512
    nf_hook include/linux/netfilter.h:260 [inline]
    NF_HOOK include/linux/netfilter.h:303 [inline]
    ip_rcv+0x12f/0x1a0 net/ipv4/ip_input.c:523
    __netif_receive_skb_one_core+0xa7/0xe0 net/core/dev.c:5004
    __netif_receive_skb+0x37/0xf0 net/core/dev.c:5118
    netif_receive_skb_internal+0x59/0x190 net/core/dev.c:5208
    napi_skb_finish net/core/dev.c:5671 [inline]
    napi_gro_receive+0x28f/0x330 net/core/dev.c:5704
    receive_buf+0x284/0x30b0 drivers/net/virtio_net.c:1061
    virtnet_receive drivers/net/virtio_net.c:1323 [inline]
    virtnet_poll+0x436/0x7d0 drivers/net/virtio_net.c:1428
    napi_poll net/core/dev.c:6352 [inline]
    net_rx_action+0x3ae/0xa50 net/core/dev.c:6418
    __do_softirq+0x115/0x33f kernel/softirq.c:292

    write to 0xffff888123eb4f08 of 4 bytes by task 7191 on cpu 1:
    __nf_ct_refresh_acct+0xfb/0x1b0 net/netfilter/nf_conntrack_core.c:1797
    nf_ct_refresh_acct include/net/netfilter/nf_conntrack.h:201 [inline]
    nf_conntrack_tcp_packet+0xd40/0x3390 net/netfilter/nf_conntrack_proto_tcp.c:1161
    nf_conntrack_handle_packet net/netfilter/nf_conntrack_core.c:1633 [inline]
    nf_conntrack_in+0x410/0xaa0 net/netfilter/nf_conntrack_core.c:1727
    ipv4_conntrack_local+0xbe/0x130 net/netfilter/nf_conntrack_proto.c:200
    nf_hook_entry_hookfn include/linux/netfilter.h:135 [inline]
    nf_hook_slow+0x83/0x160 net/netfilter/core.c:512
    nf_hook include/linux/netfilter.h:260 [inline]
    __ip_local_out+0x1f7/0x2b0 net/ipv4/ip_output.c:114
    ip_local_out+0x31/0x90 net/ipv4/ip_output.c:123
    __ip_queue_xmit+0x3a8/0xa40 net/ipv4/ip_output.c:532
    ip_queue_xmit+0x45/0x60 include/net/ip.h:236
    __tcp_transmit_skb+0xdeb/0x1cd0 net/ipv4/tcp_output.c:1158
    __tcp_send_ack+0x246/0x300 net/ipv4/tcp_output.c:3685
    tcp_send_ack+0x34/0x40 net/ipv4/tcp_output.c:3691
    tcp_cleanup_rbuf+0x130/0x360 net/ipv4/tcp.c:1575

    Reported by Kernel Concurrency Sanitizer on:
    CPU: 1 PID: 7191 Comm: syz-fuzzer Not tainted 5.3.0+ #0
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011

    Fixes: cc16921351d8 ("netfilter: conntrack: avoid same-timeout update")
    Signed-off-by: Eric Dumazet
    Reported-by: syzbot
    Cc: Jozsef Kadlecsik
    Cc: Florian Westphal
    Acked-by: Pablo Neira Ayuso
    Signed-off-by: Jakub Kicinski

    Eric Dumazet
     

28 Aug, 2019

1 commit

  • when spinlock is locked/unlocked, its elements will be changed,
    so marking it as __read_mostly is not suitable.

    and remove a duplicate definition of nf_conntrack_locks_all_lock
    strange that compiler does not complain.

    Signed-off-by: Li RongQing
    Signed-off-by: Pablo Neira Ayuso

    Li RongQing
     

14 Aug, 2019

1 commit

  • Change ct id hash calculation to only use invariants.

    Currently the ct id hash calculation is based on some fields that can
    change in the lifetime on a conntrack entry in some corner cases. The
    current hash uses the whole tuple which contains an hlist pointer which
    will change when the conntrack is placed on the dying list resulting in
    a ct id change.

    This patch also removes the reply-side tuple and extension pointer from
    the hash calculation so that the ct id will will not change from
    initialization until confirmation.

    Fixes: 3c79107631db1f7 ("netfilter: ctnetlink: don't use conntrack/expect object addresses as id")
    Signed-off-by: Dirk Morris
    Acked-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso

    Dirk Morris
     

16 Jul, 2019

1 commit

  • In 9fb9cbb1082d ("[NETFILTER]: Add nf_conntrack subsystem.") the new
    generic nf_conntrack was introduced, and it came to supersede the old
    ip_conntrack.

    This change updates (some) of the obsolete comments referring to old
    file/function names of the ip_conntrack mechanism, as well as removes a
    few self-referencing comments that we shouldn't maintain anymore.

    I did not update any comments referring to historical actions (e.g,
    comments like "this file was derived from ..." were left untouched, even
    if the referenced file is no longer here).

    Signed-off-by: Yonatan Goldschmidt
    Signed-off-by: Pablo Neira Ayuso

    Yonatan Goldschmidt
     

25 Jun, 2019

1 commit


19 Jun, 2019

1 commit

  • Based on 2 normalized pattern(s):

    this program is free software you can redistribute it and or modify
    it under the terms of the gnu general public license version 2 as
    published by the free software foundation

    this program is free software you can redistribute it and or modify
    it under the terms of the gnu general public license version 2 as
    published by the free software foundation #

    extracted by the scancode license scanner the SPDX license identifier

    GPL-2.0-only

    has been chosen to replace the boilerplate/reference in 4122 file(s).

    Signed-off-by: Thomas Gleixner
    Reviewed-by: Enrico Weigelt
    Reviewed-by: Kate Stewart
    Reviewed-by: Allison Randal
    Cc: linux-spdx@vger.kernel.org
    Link: https://lkml.kernel.org/r/20190604081206.933168790@linutronix.de
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     

17 Jun, 2019

1 commit

  • ____nf_conntrack_find() performs checks on the conntrack objects in
    this order:

    1. if (nf_ct_is_expired(ct))

    This fetches ct->timeout, in third cache line.

    The hnnode that is used to store the list pointers resides in the first
    (origin) or second (reply tuple) cache lines.

    This test rarely passes, but its necessary to reap obsolete entries.

    2. if (nf_ct_is_dying(ct))

    This fetches ct->status, also in third cache line.

    The test is useless, and can be removed:
    Consider:
    cpu0 cpu1
    ct = ____nf_conntrack_find()
    atomic_inc_not_zero(ct) -> ok
    nf_ct_key_equal -> ok
    is_dying -> DYING bit not set, ok
    set_bit(ct, DYING);
    ... unhash ... etc.
    return ct
    -> returning a ct with dying bit set, despite
    having a test for it.

    This (unlikely) case is fine - refcount prevents ct from getting free'd.

    3. if (nf_ct_key_equal(h, tuple, zone, net))

    nf_ct_key_equal checks in following order:

    1. Tuple equal (first or second cacheline)
    2. Zone equal (third cacheline)
    3. confirmed bit set (->status, third cacheline)
    4. net namespace match (third cacheline).

    Swapping "timeout" and "cpu" places timeout in the first cacheline.
    This has two advantages:

    1. For a conntrack that won't even match the original tuple,
    we will now only fetch the first and maybe the second cacheline
    instead of always accessing the 3rd one as well.

    2. in case of TCP ct->timeout changes frequently because we
    reduce/increase it when there are packets outstanding in the network.

    The first cacheline contains both the reference count and the ct spinlock,
    i.e. moving timeout there avoids writes to 3rd cacheline.

    The restart sequence in __nf_conntrack_find() is removed, if we found a
    candidate, but then fail to increment the refcount or discover the tuple
    has changed (object recycling), just pretend we did not find an entry.

    A second lookup won't find anything until another CPU adds a new conntrack
    with identical tuple into the hash table, which is very unlikely.

    We have the confirmation-time checks (when we hold hash lock) that deal
    with identical entries and even perform clash resolution in some cases.

    Signed-off-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso

    Florian Westphal
     

22 Apr, 2019

1 commit

  • setting net.netfilter.nf_conntrack_timestamp=1 breaks xmit with fq
    scheduler. skb->tstamp might be "refreshed" using ktime_get_real(),
    but fq expects CLOCK_MONOTONIC.

    This patch removes all places in netfilter that check/set skb->tstamp:

    1. To fix the bogus "start" time seen with conntrack timestamping for
    outgoing packets, never use skb->tstamp and always use current time.
    2. In nfqueue and nflog, only use skb->tstamp for incoming packets,
    as determined by current hook (prerouting, input, forward).
    3. xt_time has to use system clock as well rather than skb->tstamp.
    We could still use skb->tstamp for prerouting/input/foward, but
    I see no advantage to make this conditional.

    Fixes: fb420d5d91c1 ("tcp/fq: move back to CLOCK_MONOTONIC")
    Cc: Eric Dumazet
    Reported-by: Michal Soltys
    Signed-off-by: Florian Westphal
    Acked-by: Eric Dumazet
    Signed-off-by: Pablo Neira Ayuso

    Florian Westphal
     

15 Apr, 2019

1 commit

  • else, we leak the addresses to userspace via ctnetlink events
    and dumps.

    Compute an ID on demand based on the immutable parts of nf_conn struct.

    Another advantage compared to using an address is that there is no
    immediate re-use of the same ID in case the conntrack entry is freed and
    reallocated again immediately.

    Fixes: 3583240249ef ("[NETFILTER]: nf_conntrack_expect: kill unique ID")
    Fixes: 7f85f914721f ("[NETFILTER]: nf_conntrack: kill unique ID")
    Signed-off-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso

    Florian Westphal