13 Aug, 2016

1 commit

  • This backward compatibility has been around for more than ten years,
    since Yasuyuki Kozakai introduced IPv6 in conntrack. These days, we have
    alternate /proc/net/nf_conntrack* entries, the ctnetlink interface and
    the conntrack utility got adopted by many people in the user community
    according to what I observed on the netfilter user mailing list.

    So let's get rid of this.

    Note that nf_conntrack_htable_size and unsigned int nf_conntrack_max do
    not need to be exported as symbol anymore.

    Signed-off-by: Pablo Neira Ayuso

    Pablo Neira Ayuso
     

09 May, 2016

2 commits


06 May, 2016

1 commit


05 May, 2016

1 commit

  • We already include netns address in the hash and compare the netns pointers
    during lookup, so even if namespaces have overlapping addresses entries
    will be spread across the table.

    Assuming 64k bucket size, this change saves 0.5 mbyte per namespace on a
    64bit system.

    NAT bysrc and expectation hash is still per namespace, those will
    changed too soon.

    Future patch will also make conntrack object slab cache global again.

    Signed-off-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso

    Florian Westphal
     

25 Apr, 2016

1 commit


20 Jul, 2015

1 commit

  • Quoting Daniel Borkmann:

    "When adding connection tracking template rules to a netns, f.e. to
    configure netfilter zones, the kernel will endlessly busy-loop as soon
    as we try to delete the given netns in case there's at least one
    template present, which is problematic i.e. if there is such bravery that
    the priviledged user inside the netns is assumed untrusted.

    Minimal example:

    ip netns add foo
    ip netns exec foo iptables -t raw -A PREROUTING -d 1.2.3.4 -j CT --zone 1
    ip netns del foo

    What happens is that when nf_ct_iterate_cleanup() is being called from
    nf_conntrack_cleanup_net_list() for a provided netns, we always end up
    with a net->ct.count > 0 and thus jump back to i_see_dead_people. We
    don't get a soft-lockup as we still have a schedule() point, but the
    serving CPU spins on 100% from that point onwards.

    Since templates are normally allocated with nf_conntrack_alloc(), we
    also bump net->ct.count. The issue why they are not yet nf_ct_put() is
    because the per netns .exit() handler from x_tables (which would eventually
    invoke xt_CT's xt_ct_tg_destroy() that drops reference on info->ct) is
    called in the dependency chain at a *later* point in time than the per
    netns .exit() handler for the connection tracker.

    This is clearly a chicken'n'egg problem: after the connection tracker
    .exit() handler, we've teared down all the connection tracking
    infrastructure already, so rightfully, xt_ct_tg_destroy() cannot be
    invoked at a later point in time during the netns cleanup, as that would
    lead to a use-after-free. At the same time, we cannot make x_tables depend
    on the connection tracker module, so that the xt_ct_tg_destroy() would
    be invoked earlier in the cleanup chain."

    Daniel confirms this has to do with the order in which modules are loaded or
    having compiled nf_conntrack as modules while x_tables built-in. So we have no
    guarantees regarding the order in which netns callbacks are executed.

    Fix this by allocating the templates through kmalloc() from the respective
    SYNPROXY and CT targets, so they don't depend on the conntrack kmem cache.
    Then, release then via nf_ct_tmpl_free() from destroy_conntrack(). This branch
    is marked as unlikely since conntrack templates are rarely allocated and only
    from the configuration plane path.

    Note that templates are not kept in any list to avoid further dependencies with
    nf_conntrack anymore, thus, the tmpl larval list is removed.

    Reported-by: Daniel Borkmann
    Signed-off-by: Pablo Neira Ayuso
    Tested-by: Daniel Borkmann

    Pablo Neira Ayuso
     

26 Jun, 2014

1 commit

  • This brings the (per-conntrack) ecache extension back to 24 bytes in size
    (was 152 byte on x86_64 with lockdep on).

    When event delivery fails, re-delivery is attempted via work queue.

    Redelivery is attempted at least every 0.1 seconds, but can happen
    more frequently if userspace is not congested.

    The nf_ct_release_dying_list() function is removed.
    With this patch, ownership of the to-be-redelivered conntracks
    (on-dying-list-with-DYING-bit not yet set) is with the work queue,
    which will release the references once event is out.

    Joint work with Pablo Neira Ayuso.

    Signed-off-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso

    Florian Westphal
     

07 Mar, 2014

2 commits

  • nf_conntrack_lock is a monolithic lock and suffers from huge contention
    on current generation servers (8 or more core/threads).

    Perf locking congestion is clear on base kernel:

    - 72.56% ksoftirqd/6 [kernel.kallsyms] [k] _raw_spin_lock_bh
    - _raw_spin_lock_bh
    + 25.33% init_conntrack
    + 24.86% nf_ct_delete_from_lists
    + 24.62% __nf_conntrack_confirm
    + 24.38% destroy_conntrack
    + 0.70% tcp_packet
    + 2.21% ksoftirqd/6 [kernel.kallsyms] [k] fib_table_lookup
    + 1.15% ksoftirqd/6 [kernel.kallsyms] [k] __slab_free
    + 0.77% ksoftirqd/6 [kernel.kallsyms] [k] inet_getpeer
    + 0.70% ksoftirqd/6 [nf_conntrack] [k] nf_ct_delete
    + 0.55% ksoftirqd/6 [ip_tables] [k] ipt_do_table

    This patch change conntrack locking and provides a huge performance
    improvement. SYN-flood attack tested on a 24-core E5-2695v2(ES) with
    10Gbit/s ixgbe (with tool trafgen):

    Base kernel: 810.405 new conntrack/sec
    After patch: 2.233.876 new conntrack/sec

    Notice other floods attack (SYN+ACK or ACK) can easily be deflected using:
    # iptables -A INPUT -m state --state INVALID -j DROP
    # sysctl -w net/netfilter/nf_conntrack_tcp_loose=0

    Use an array of hashed spinlocks to protect insertions/deletions of
    conntracks into the hash table. 1024 spinlocks seem to give good
    results, at minimal cost (4KB memory). Due to lockdep max depth,
    1024 becomes 8 if CONFIG_LOCKDEP=y

    The hash resize is a bit tricky, because we need to take all locks in
    the array. A seqcount_t is used to synchronize the hash table users
    with the resizing process.

    Signed-off-by: Eric Dumazet
    Signed-off-by: Jesper Dangaard Brouer
    Signed-off-by: David S. Miller
    Reviewed-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso

    Jesper Dangaard Brouer
     
  • One spinlock per cpu to protect dying/unconfirmed/template special lists.
    (These lists are now per cpu, a bit like the untracked ct)
    Add a @cpu field to nf_conn, to make sure we hold the appropriate
    spinlock at removal time.

    Signed-off-by: Eric Dumazet
    Signed-off-by: Jesper Dangaard Brouer
    Signed-off-by: David S. Miller
    Reviewed-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso

    Jesper Dangaard Brouer
     

13 Dec, 2013

1 commit

  • Reorder struct netns_ct so that atomic_t "count" changes don't
    slowdown users of read mostly fields.

    This is based on Eric Dumazet's proposed patch:
    "netfilter: conntrack: remove the central spinlock"
    http://thread.gmane.org/gmane.linux.network/268758/focus=47306

    The tricky part of cache-aligning this structure, that it is getting
    inlined in struct net (include/net/net_namespace.h), thus changes to
    other netns_xxx structures affects our alignment.

    Eric's original patch contained an ambiguity on 32-bit regarding
    alignment in struct net. This patch also takes 32-bit into account,
    and in case of changed (struct net) alignment sysctl_xxx entries have
    been ordered according to how often they are accessed.

    Signed-off-by: Jesper Dangaard Brouer
    Reviewed-by: Jiri Benc
    Signed-off-by: Pablo Neira Ayuso

    Jesper Dangaard Brouer
     

18 Jan, 2013

1 commit

  • similar to connmarks, except labels are bit-based; i.e.
    all labels may be attached to a flow at the same time.

    Up to 128 labels are supported. Supporting more labels
    is possible, but requires increasing the ct offset delta
    from u8 to u16 type due to increased extension sizes.

    Mapping of bit-identifier to label name is done in userspace.

    The extension is enabled at run-time once "-m connlabel" netfilter
    rules are added.

    Signed-off-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso

    Florian Westphal
     

17 Dec, 2012

1 commit

  • In (d871bef netfilter: ctnetlink: dump entries from the dying and
    unconfirmed lists), we assume that all conntrack objects are
    inserted in any of the existing lists. However, template conntrack
    objects were not. This results in hitting BUG_ON in the
    destroy_conntrack path while removing a rule that uses the CT target.

    This patch fixes the situation by adding the template lists, which
    is where template conntrack objects reside now.

    Signed-off-by: Pablo Neira Ayuso

    Pablo Neira Ayuso
     

30 Aug, 2012

1 commit


07 Jun, 2012

7 commits

  • This patch adds namespace support for ICMPv6 protocol tracker.

    Acked-by: Eric W. Biederman
    Signed-off-by: Gao feng
    Signed-off-by: Pablo Neira Ayuso

    Gao feng
     
  • This patch adds namespace support for ICMP protocol tracker.

    Acked-by: Eric W. Biederman
    Signed-off-by: Gao feng
    Signed-off-by: Pablo Neira Ayuso

    Gao feng
     
  • This patch adds namespace support for UDP protocol tracker.

    Acked-by: Eric W. Biederman
    Signed-off-by: Gao feng
    Signed-off-by: Pablo Neira Ayuso

    Gao feng
     
  • This patch adds namespace support for TCP protocol tracker.

    Acked-by: Eric W. Biederman
    Signed-off-by: Gao feng
    Signed-off-by: Pablo Neira Ayuso

    Gao feng
     
  • This patch adds namespace support for the generic layer 4 protocol
    tracker.

    Acked-by: Eric W. Biederman
    Signed-off-by: Gao feng
    Signed-off-by: Pablo Neira Ayuso

    Gao feng
     
  • This patch prepares the namespace support for layer 3 protocol trackers.
    Basically, this modifies the following interfaces:

    * nf_ct_l3proto_[un]register_sysctl.
    * nf_conntrack_l3proto_[un]register.

    We add a new nf_ct_l3proto_net is used to get the pernet data of l3proto.

    This adds rhe new struct nf_ip_net that is used to store the sysctl header
    and l3proto_ipv4,l4proto_tcp(6),l4proto_udp(6),l4proto_icmp(v6) because the
    protos such tcp and tcp6 use the same data,so making nf_ip_net as a field
    of netns_ct is the easiest way to manager it.

    This patch also adds init_net to struct nf_conntrack_l3proto to initial
    the layer 3 protocol pernet data.

    Acked-by: Eric W. Biederman
    Signed-off-by: Gao feng
    Signed-off-by: Pablo Neira Ayuso

    Gao feng
     
  • This patch prepares the namespace support for layer 4 protocol trackers.
    Basically, this modifies the following interfaces:

    * nf_ct_[un]register_sysctl
    * nf_conntrack_l4proto_[un]register

    to include the namespace parameter. We still use init_net in this patch
    to prepare the ground for follow-up patches for each layer 4 protocol
    tracker.

    We add a new net_id field to struct nf_conntrack_l4proto that is used
    to store the pernet_operations id for each layer 4 protocol tracker.

    Note that AF_INET6's protocols do not need to do sysctl compat. Thus,
    we only register compat sysctl when l4proto.l3proto != AF_INET6.

    Acked-by: Eric W. Biederman
    Signed-off-by: Gao feng
    Signed-off-by: Pablo Neira Ayuso

    Gao feng
     

09 May, 2012

1 commit

  • This patch allows you to disable automatic conntrack helper
    lookup based on TCP/UDP ports, eg.

    echo 0 > /proc/sys/net/netfilter/nf_conntrack_helper

    [ Note: flows that already got a helper will keep using it even
    if automatic helper assignment has been disabled ]

    Once this behaviour has been disabled, you have to explicitly
    use the iptables CT target to attach helper to flows.

    There are good reasons to stop supporting automatic helper
    assignment, for further information, please read:

    http://www.netfilter.org/news.html#2012-04-03

    This patch also adds one message to inform that automatic helper
    assignment is deprecated and it will be removed soon (this is
    spotted only once, with the first flow that gets a helper attached
    to make it as less annoying as possible).

    Signed-off-by: Eric Leblond
    Signed-off-by: Pablo Neira Ayuso

    Eric Leblond
     

22 Nov, 2011

1 commit

  • This patch fixes an oops that can be triggered following this recipe:

    0) make sure nf_conntrack_netlink and nf_conntrack_ipv4 are loaded.
    1) container is started.
    2) connect to it via lxc-console.
    3) generate some traffic with the container to create some conntrack
    entries in its table.
    4) stop the container: you hit one oops because the conntrack table
    cleanup tries to report the destroy event to user-space but the
    per-netns nfnetlink socket has already gone (as the nfnetlink
    socket is per-netns but event callback registration is global).

    To fix this situation, we make the ctnl_notifier per-netns so the
    callback is registered/unregistered if the container is
    created/destroyed.

    Alex Bligh and Alexey Dobriyan originally proposed one small patch to
    check if the nfnetlink socket is gone in nfnetlink_has_listeners,
    but this is a very visited path for events, thus, it may reduce
    performance and it looks a bit hackish to check for the nfnetlink
    socket only to workaround this situation. As a result, I decided
    to follow the bigger path choice, which seems to look nicer to me.

    Cc: Alexey Dobriyan
    Reported-by: Alex Bligh
    Signed-off-by: Pablo Neira Ayuso

    Pablo Neira Ayuso
     

27 Jul, 2011

1 commit

  • This allows us to move duplicated code in
    (atomic_inc_not_zero() for now) to

    Signed-off-by: Arun Sharma
    Reviewed-by: Eric Dumazet
    Cc: Ingo Molnar
    Cc: David Miller
    Cc: Eric Dumazet
    Acked-by: Mike Frysinger
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Arun Sharma
     

19 Jan, 2011

1 commit

  • This patch adds flow-based timestamping for conntracks. This
    conntrack extension is disabled by default. Basically, we use
    two 64-bits variables to store the creation timestamp once the
    conntrack has been confirmed and the other to store the deletion
    time. This extension is disabled by default, to enable it, you
    have to:

    echo 1 > /proc/sys/net/netfilter/nf_conntrack_timestamp

    This patch allows to save memory for user-space flow-based
    loogers such as ulogd2. In short, ulogd2 does not need to
    keep a hashtable with the conntrack in user-space to know
    when they were created and destroyed, instead we use the
    kernel timestamp. If we want to have a sane IPFIX implementation
    in user-space, this nanosecs resolution timestamps are also
    useful. Other custom user-space applications can benefit from
    this via libnetfilter_conntrack.

    This patch modifies the /proc output to display the delta time
    in seconds since the flow start. You can also obtain the
    flow-start date by means of the conntrack-tools.

    Signed-off-by: Pablo Neira Ayuso
    Signed-off-by: Patrick McHardy

    Pablo Neira Ayuso
     

14 Jan, 2011

1 commit


17 Feb, 2010

1 commit

  • Add __percpu sparse annotations to net.

    These annotations are to make sparse consider percpu variables to be
    in a different address space and warn if accessed without going
    through percpu accessors. This patch doesn't affect normal builds.

    The macro and type tricks around snmp stats make things a bit
    interesting. DEFINE/DECLARE_SNMP_STAT() macros mark the target field
    as __percpu and SNMP_UPD_PO_STATS() macro is updated accordingly. All
    snmp_mib_*() users which used to cast the argument to (void **) are
    updated to cast it to (void __percpu **).

    Signed-off-by: Tejun Heo
    Acked-by: David S. Miller
    Cc: Patrick McHardy
    Cc: Arnaldo Carvalho de Melo
    Cc: Vlad Yasevich
    Cc: netdev@vger.kernel.org
    Signed-off-by: David S. Miller

    Tejun Heo
     

09 Feb, 2010

2 commits

  • As noticed by Jon Masters , the conntrack hash
    size is global and not per namespace, but modifiable at runtime through
    /sys/module/nf_conntrack/hashsize. Changing the hash size will only
    resize the hash in the current namespace however, so other namespaces
    will use an invalid hash size. This can cause crashes when enlarging
    the hashsize, or false negative lookups when shrinking it.

    Move the hash size into the per-namespace data and only use the global
    hash size to initialize the per-namespace value when instanciating a
    new namespace. Additionally restrict hash resizing to init_net for
    now as other namespaces are not handled currently.

    Cc: stable@kernel.org
    Signed-off-by: Patrick McHardy
    Signed-off-by: David S. Miller

    Patrick McHardy
     
  • nf_conntrack_cachep is currently shared by all netns instances, but
    because of SLAB_DESTROY_BY_RCU special semantics, this is wrong.

    If we use a shared slab cache, one object can instantly flight between
    one hash table (netns ONE) to another one (netns TWO), and concurrent
    reader (doing a lookup in netns ONE, 'finding' an object of netns TWO)
    can be fooled without notice, because no RCU grace period has to be
    observed between object freeing and its reuse.

    We dont have this problem with UDP/TCP slab caches because TCP/UDP
    hashtables are global to the machine (and each object has a pointer to
    its netns).

    If we use per netns conntrack hash tables, we also *must* use per netns
    conntrack slab caches, to guarantee an object can not escape from one
    namespace to another one.

    Signed-off-by: Eric Dumazet
    [Patrick: added unique slab name allocation]
    Cc: stable@kernel.org
    Signed-off-by: Patrick McHardy

    Eric Dumazet
     

13 Jun, 2009

2 commits

  • This patch improves ctnetlink event reliability if one broadcast
    listener has set the NETLINK_BROADCAST_ERROR socket option.

    The logic is the following: if an event delivery fails, we keep
    the undelivered events in the missed event cache. Once the next
    packet arrives, we add the new events (if any) to the missed
    events in the cache and we try a new delivery, and so on. Thus,
    if ctnetlink fails to deliver an event, we try to deliver them
    once we see a new packet. Therefore, we may lose state
    transitions but the userspace process gets in sync at some point.

    At worst case, if no events were delivered to userspace, we make
    sure that destroy events are successfully delivered. Basically,
    if ctnetlink fails to deliver the destroy event, we remove the
    conntrack entry from the hashes and we insert them in the dying
    list, which contains inactive entries. Then, the conntrack timer
    is added with an extra grace timeout of random32() % 15 seconds
    to trigger the event again (this grace timeout is tunable via
    /proc). The use of a limited random timeout value allows
    distributing the "destroy" resends, thus, avoiding accumulating
    lots "destroy" events at the same time. Event delivery may
    re-order but we can identify them by means of the tuple plus
    the conntrack ID.

    The maximum number of conntrack entries (active or inactive) is
    still handled by nf_conntrack_max. Thus, we may start dropping
    packets at some point if we accumulate a lot of inactive conntrack
    entries that did not successfully report the destroy event to
    userspace.

    During my stress tests consisting of setting a very small buffer
    of 2048 bytes for conntrackd and the NETLINK_BROADCAST_ERROR socket
    flag, and generating lots of very small connections, I noticed
    very few destroy entries on the fly waiting to be resend.

    A simple way to test this patch consist of creating a lot of
    entries, set a very small Netlink buffer in conntrackd (+ a patch
    which is not in the git tree to set the BROADCAST_ERROR flag)
    and invoke `conntrack -F'.

    For expectations, no changes are introduced in this patch.
    Currently, event delivery is only done for new expectations (no
    events from expectation expiration, removal and confirmation).
    In that case, they need a per-expectation event cache to implement
    the same idea that is exposed in this patch.

    This patch can be useful to provide reliable flow-accouting. We
    still have to add a new conntrack extension to store the creation
    and destroy time.

    Signed-off-by: Pablo Neira Ayuso
    Signed-off-by: Patrick McHardy

    Pablo Neira Ayuso
     
  • This patch reworks the per-cpu event caching to use the conntrack
    extension infrastructure.

    The main drawback is that we consume more memory per conntrack
    if event delivery is enabled. This patch is required by the
    reliable event delivery that follows to this patch.

    BTW, this patch allows you to enable/disable event delivery via
    /proc/sys/net/netfilter/nf_conntrack_events in runtime, although
    you can still disable event caching as compilation option.

    Signed-off-by: Pablo Neira Ayuso
    Signed-off-by: Patrick McHardy

    Pablo Neira Ayuso
     

26 Mar, 2009

1 commit

  • Use "hlist_nulls" infrastructure we added in 2.6.29 for RCUification of UDP & TCP.

    This permits an easy conversion from call_rcu() based hash lists to a
    SLAB_DESTROY_BY_RCU one.

    Avoiding call_rcu() delay at nf_conn freeing time has numerous gains.

    First, it doesnt fill RCU queues (up to 10000 elements per cpu).
    This reduces OOM possibility, if queued elements are not taken into account
    This reduces latency problems when RCU queue size hits hilimit and triggers
    emergency mode.

    - It allows fast reuse of just freed elements, permitting better use of
    CPU cache.

    - We delete rcu_head from "struct nf_conn", shrinking size of this structure
    by 8 or 16 bytes.

    This patch only takes care of "struct nf_conn".
    call_rcu() is still used for less critical conntrack parts, that may
    be converted later if necessary.

    Signed-off-by: Eric Dumazet
    Signed-off-by: Patrick McHardy

    Eric Dumazet
     

08 Oct, 2008

8 commits