11 Jul, 2016

1 commit

  • When we do "cat /proc/net/nf_conntrack", and meanwhile resize the conntrack
    hash table via /sys/module/nf_conntrack/parameters/hashsize, race will
    happen, because reader can observe a newly allocated hash but the old size
    (or vice versa). So oops will happen like follows:

    BUG: unable to handle kernel NULL pointer dereference at 0000000000000017
    IP: [] seq_print_acct+0x11/0x50 [nf_conntrack]
    Call Trace:
    [] ? ct_seq_show+0x14e/0x340 [nf_conntrack]
    [] seq_read+0x2cc/0x390
    [] proc_reg_read+0x42/0x70
    [] __vfs_read+0x37/0x130
    [] ? security_file_permission+0xa0/0xc0
    [] vfs_read+0x95/0x140
    [] SyS_read+0x55/0xc0
    [] entry_SYSCALL_64_fastpath+0x1a/0xa4

    It is very easy to reproduce this kernel crash.
    1. open one shell and input the following cmds:
    while : ; do
    echo $RANDOM > /sys/module/nf_conntrack/parameters/hashsize
    done
    2. open more shells and input the following cmds:
    while : ; do
    cat /proc/net/nf_conntrack
    done
    3. just wait a monent, oops will happen soon.

    The solution in this patch is based on Florian's Commit 5e3c61f98175
    ("netfilter: conntrack: fix lookup race during hash resize"). And
    add a wrapper function nf_conntrack_get_ht to get hash and hsize
    suggested by Florian Westphal.

    Signed-off-by: Liping Zhang
    Signed-off-by: Pablo Neira Ayuso

    Liping Zhang
     

07 Jul, 2016

1 commit

  • Pablo Neira Ayuso says:

    ====================
    Netfilter updates for net-next

    The following patchset contains Netfilter updates for net-next,
    they are:

    1) Don't use userspace datatypes in bridge netfilter code, from
    Tobin Harding.

    2) Iterate only once over the expectation table when removing the
    helper module, instead of once per-netns, from Florian Westphal.

    3) Extra sanitization in xt_hook_ops_alloc() to return error in case
    we ever pass zero hooks, xt_hook_ops_alloc():

    4) Handle NFPROTO_INET from the logging core infrastructure, from
    Liping Zhang.

    5) Autoload loggers when TRACE target is used from rules, this doesn't
    change the behaviour in case the user already selected nfnetlink_log
    as preferred way to print tracing logs, also from Liping Zhang.

    6) Conntrack slabs with SLAB_HWCACHE_ALIGN to allow rearranging fields
    by cache lines, increases the size of entries in 11% per entry.
    From Florian Westphal.

    7) Skip zone comparison if CONFIG_NF_CONNTRACK_ZONES=n, from Florian.

    8) Remove useless defensive check in nf_logger_find_get() from Shivani
    Bhardwaj.

    9) Remove zone extension as place it in the conntrack object, this is
    always include in the hashing and we expect more intensive use of
    zones since containers are in place. Also from Florian Westphal.

    10) Owner match now works from any namespace, from Eric Bierdeman.

    11) Make sure we only reply with TCP reset to TCP traffic from
    nf_reject_ipv4, patch from Liping Zhang.

    12) Introduce --nflog-size to indicate amount of network packet bytes
    that are copied to userspace via log message, from Vishwanath Pai.
    This obsoletes --nflog-range that has never worked, it was designed
    to achieve this but it has never worked.

    13) Introduce generic macros for nf_tables object generation masks.

    14) Use generation mask in table, chain and set objects in nf_tables.
    This allows fixes interferences with ongoing preparation phase of
    the commit protocol and object listings going on at the same time.
    This update is introduced in three patches, one per object.

    15) Check if the object is active in the next generation for element
    deactivation in the rbtree implementation, given that deactivation
    happens from the commit phase path we have to observe the future
    status of the object.

    16) Support for deletion of just added elements in the hash set type.

    17) Allow to resize hashtable from /proc entry, not only from the
    obscure /sys entry that maps to the module parameter, from Florian
    Westphal.

    18) Get rid of NFT_BASECHAIN_DISABLED, this code is not exercised
    anymore since we tear down the ruleset whenever the netdevice
    goes away.

    19) Support for matching inverted set lookups, from Arturo Borrero.

    20) Simplify the iptables_mangle_hook() by removing a superfluous
    extra branch.

    21) Introduce ether_addr_equal_masked() and use it from the netfilter
    codebase, from Joe Perches.

    22) Remove references to "Use netfilter MARK value as routing key"
    from the Netfilter Kconfig description given that this toggle
    doesn't exists already for 10 years, from Moritz Sichert.

    23) Introduce generic NF_INVF() and use it from the xtables codebase,
    from Joe Perches.

    24) Setting logger to NONE via /proc was not working unless explicit
    nul-termination was included in the string. This fixes seems to
    leave the former behaviour there, so we don't break backward.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     

24 Jun, 2016

1 commit

  • No need to restrict this to module parameter.

    We export a copy of the real hash size -- when user alters the value we
    allocate the new table, copy entries etc before we update the real size
    to the requested one.

    This is also needed because the real size is used by concurrent readers
    and cannot be changed without synchronizing the conntrack generation
    seqcnt.

    We only allow changing this value from the initial net namespace.

    Tested using http-client-benchmark vs. httpterm with concurrent

    while true;do
    echo $RANDOM > /proc/sys/net/netfilter/nf_conntrack_buckets
    done

    Signed-off-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso

    Florian Westphal
     

23 Jun, 2016

2 commits

  • Curently we store zone information as a conntrack extension.
    This has one drawback: for every lookup we need to fetch the zone data
    from the extension area.

    This change place the zone data directly into the main conntrack object
    structure and then removes the zone conntrack extension.

    The zone data is just 4 bytes, it fits into a padding hole before
    the tuplehash info, so we do not even increase the nf_conn structure size.

    Signed-off-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso

    Florian Westphal
     
  • increases struct size by 32 bytes (288 -> 320), but it is the right thing,
    else any attempt to (re-)arrange nf_conn members by cacheline won't work.

    Signed-off-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso

    Florian Westphal
     

15 Jun, 2016

1 commit


16 May, 2016

1 commit

  • The nf_conntrack_core.c fix in 'net' is not relevant in 'net-next'
    because we no longer have a per-netns conntrack hash.

    The ip_gre.c conflict as well as the iwlwifi ones were cases of
    overlapping changes.

    Conflicts:
    drivers/net/wireless/intel/iwlwifi/mvm/tx.c
    net/ipv4/ip_gre.c
    net/netfilter/nf_conntrack_core.c

    Signed-off-by: David S. Miller

    David S. Miller
     

15 May, 2016

1 commit

  • The slab name ends up being visible in the directory structure under
    /sys, and even if you don't have access rights to the file you can see
    the filenames.

    Just use a 64-bit counter instead of the pointer to the 'net' structure
    to generate a unique name.

    This code will go away in 4.7 when the conntrack code moves to a single
    kmemcache, but this is the backportable simple solution to avoiding
    leaking kernel pointers to user space.

    Fixes: 5b3501faa874 ("netfilter: nf_conntrack: per netns nf_conntrack_cachep")
    Signed-off-by: Linus Torvalds
    Acked-by: Eric Dumazet
    Cc: stable@vger.kernel.org
    Signed-off-by: David S. Miller

    Linus Torvalds
     

10 May, 2016

1 commit

  • A recent commit introduced an unconditional use of an uninitialized
    variable, as reported in this gcc warning:

    net/netfilter/nf_conntrack_core.c: In function '__nf_conntrack_confirm':
    net/netfilter/nf_conntrack_core.c:632:33: error: 'ctinfo' may be used uninitialized in this function [-Werror=maybe-uninitialized]
    bytes = atomic64_read(&counter[CTINFO2DIR(ctinfo)].bytes);
    ^
    net/netfilter/nf_conntrack_core.c:628:26: note: 'ctinfo' was declared here
    enum ip_conntrack_info ctinfo;

    The problem is that a local variable shadows the function parameter.
    This removes the local variable, which looks like what Pablo originally
    intended.

    Signed-off-by: Arnd Bergmann
    Fixes: 71d8c47fc653 ("netfilter: conntrack: introduce clash resolution on insertion race")
    Acked-by: Pablo Neira Ayuso
    Signed-off-by: David S. Miller

    Arnd Bergmann
     

09 May, 2016

1 commit


05 May, 2016

11 commits

  • This patch introduces nf_ct_resolve_clash() to resolve race condition on
    conntrack insertions.

    This is particularly a problem for connection-less protocols such as
    UDP, with no initial handshake. Two or more packets may race to insert
    the entry resulting in packet drops.

    Another problematic scenario are packets enqueued to userspace via
    NFQUEUE after the raw table, that make it easier to trigger this
    race.

    To resolve this, the idea is to reset the conntrack entry to the one
    that won race. Packet and bytes counters are also merged.

    The 'insert_failed' stats still accounts for this situation, after
    this patch, the drop counter is bumped whenever we drop packets, so we
    can watch for unresolved clashes.

    Signed-off-by: Pablo Neira Ayuso

    Pablo Neira Ayuso
     
  • Introduce a helper function to update conntrack counters.
    __nf_ct_kill_acct() was unnecessarily subtracting skb_network_offset()
    that is expected to be zero from the ipv4/ipv6 hooks.

    Signed-off-by: Pablo Neira Ayuso

    Pablo Neira Ayuso
     
  • Remove unnecessary check for non-nul pointer in destroy_conntrack()
    given that __nf_ct_l4proto_find() returns the generic protocol tracker
    if the protocol is not supported.

    Signed-off-by: Pablo Neira Ayuso

    Pablo Neira Ayuso
     
  • When iterating, skip conntrack entries living in a different netns.

    We could ignore netns and kill some other non-assured one, but it
    has two problems:

    - a netns can kill non-assured conntracks in other namespace
    - we would start to 'over-subscribe' the affected/overlimit netns.

    Signed-off-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso

    Florian Westphal
     
  • We already include netns address in the hash and compare the netns pointers
    during lookup, so even if namespaces have overlapping addresses entries
    will be spread across the table.

    Assuming 64k bucket size, this change saves 0.5 mbyte per namespace on a
    64bit system.

    NAT bysrc and expectation hash is still per namespace, those will
    changed too soon.

    Future patch will also make conntrack object slab cache global again.

    Signed-off-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso

    Florian Westphal
     
  • Once we place all conntracks into a global hash table we want them to be
    spread across entire hash table, even if namespaces have overlapping ip
    addresses.

    Signed-off-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso

    Florian Westphal
     
  • Once we place all conntracks in the same hash table we must also compare
    the netns pointer to skip conntracks that belong to a different namespace.

    Signed-off-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso

    Florian Westphal
     
  • This prepares for upcoming change that places all conntracks into a
    single, global table. For this to work we will need to also compare
    net pointer during lookup. To avoid open-coding such check use the
    nf_ct_key_equal helper and then later extend it to also consider net_eq.

    Signed-off-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso

    Florian Westphal
     
  • Once we place all conntracks into same table iteration becomes more
    costly because the table contains conntracks that we are not interested
    in (belonging to other netns).

    So don't bother scanning if the current namespace has no entries.

    Signed-off-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso

    Florian Westphal
     
  • When resizing the conntrack hash table at runtime via
    echo 42 > /sys/module/nf_conntrack/parameters/hashsize, we are racing with
    the conntrack lookup path -- reads can happen in parallel and nothing
    prevents readers from observing a the newly allocated hash but the old
    size (or vice versa).

    So access to hash[bucket] can trigger OOB read access in case the table got
    expanded and we saw the new size but the old hash pointer (or it got shrunk
    and we got new hash ptr but the size of the old and larger table):

    kasan: GPF could be caused by NULL-ptr deref or user memory access general protection fault: 0000 [#1] SMP KASAN
    CPU: 0 PID: 3 Comm: ksoftirqd/0 Not tainted 4.6.0-rc2+ #107
    [..]
    Call Trace:
    [] ? nf_conntrack_tuple_taken+0x12a/0xe90
    [] ? nf_ct_invert_tuplepr+0x221/0x3a0
    [] get_unique_tuple+0xfb3/0x2760

    Use generation counter to obtain the address/length of the same table.

    Also add a synchronize_net before freeing the old hash.
    AFAICS, without it we might access ct_hash[bucket] after ct_hash has been
    freed, provided that lockless reader got delayed by another event:

    CPU1 CPU2
    seq_begin
    seq_retry
    resize occurs
    free oldhash
    for_each(oldhash[size])

    Note that resize is only supported in init_netns, it took over 2 minutes
    of constant resizing+flooding to produce the warning, so this isn't a
    big problem in practice.

    Signed-off-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso

    Florian Westphal
     
  • No need to disable BH here anymore:

    stats are switched to _ATOMIC variant (== this_cpu_inc()), which
    nowadays generates same code as the non _ATOMIC NF_STAT, at least on x86.

    Signed-off-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso

    Florian Westphal
     

29 Apr, 2016

1 commit


25 Apr, 2016

2 commits


28 Mar, 2016

1 commit


15 Mar, 2016

1 commit


01 Feb, 2016

1 commit

  • Ulrich reports soft lockup with following (shortened) callchain:

    NMI watchdog: BUG: soft lockup - CPU#1 stuck for 22s!
    __netif_receive_skb_core+0x6e4/0x774
    process_backlog+0x94/0x160
    net_rx_action+0x88/0x178
    call_do_softirq+0x24/0x3c
    do_softirq+0x54/0x6c
    __local_bh_enable_ip+0x7c/0xbc
    nf_ct_iterate_cleanup+0x11c/0x22c [nf_conntrack]
    masq_inet_event+0x20/0x30 [nf_nat_masquerade_ipv6]
    atomic_notifier_call_chain+0x1c/0x2c
    ipv6_del_addr+0x1bc/0x220 [ipv6]

    Problem is that nf_ct_iterate_cleanup can run for a very long time
    since it can be interrupted by softirq processing.
    Moreover, atomic_notifier_call_chain runs with rcu readlock held.

    So lets call cond_resched() in nf_ct_iterate_cleanup and defer
    the call to a work queue for the atomic_notifier_call_chain case.

    We also need another cond_resched in get_next_corpse, since we
    have to deal with iter() always returning false, in that case
    get_next_corpse will walk entire conntrack table.

    Reported-by: Ulrich Weber
    Tested-by: Ulrich Weber
    Signed-off-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso

    Florian Westphal
     

20 Jan, 2016

1 commit

  • When we need to lock all buckets in the connection hashtable we'd attempt to
    lock 1024 spinlocks, which is way more preemption levels than supported by
    the kernel. Furthermore, this behavior was hidden by checking if lockdep is
    enabled, and if it was - use only 8 buckets(!).

    Fix this by using a global lock and synchronize all buckets on it when we
    need to lock them all. This is pretty heavyweight, but is only done when we
    need to resize the hashtable, and that doesn't happen often enough (or at all).

    Signed-off-by: Sasha Levin
    Acked-by: Jesper Dangaard Brouer
    Reviewed-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso

    Sasha Levin
     

12 Oct, 2015

1 commit

  • The object and module refcounts are updated for each conntrack template,
    however, if we delete the iptables rules and we flush the timeout
    database, we may end up with invalid references to timeout object that
    are just gone.

    Resolve this problem by setting the timeout reference to NULL when the
    custom timeout entry is removed from our base. This patch requires some
    RCU trickery to ensure safe pointer handling.

    This handling is similar to what we already do with conntrack helpers,
    the idea is to avoid bumping the timeout object reference counter from
    the packet path to avoid the cost of atomic ops.

    Reported-by: Stephen Hemminger
    Signed-off-by: Pablo Neira Ayuso

    Pablo Neira Ayuso
     

19 Sep, 2015

1 commit


06 Sep, 2015

1 commit

  • Conflicts:
    include/net/netfilter/nf_conntrack.h

    The conflict was an overlap between changing the type of the zone
    argument to nf_ct_tmpl_alloc() whilst exporting nf_ct_tmpl_free.

    Pablo Neira Ayuso says:

    ====================
    Netfilter fixes for net

    The following patchset contains Netfilter fixes for net, they are:

    1) Oneliner to restore maps in nf_tables since we support addressing registers
    at 32 bits level.

    2) Restore previous default behaviour in bridge netfilter when CONFIG_IPV6=n,
    oneliner from Bernhard Thaler.

    3) Out of bound access in ipset hash:net* set types, reported by Dave Jones'
    KASan utility, patch from Jozsef Kadlecsik.

    4) Fix ipset compilation with gcc 4.4.7 related to C99 initialization of
    unnamed unions, patch from Elad Raz.

    5) Add a workaround to address inconsistent endianess in the res_id field of
    nfnetlink batch messages, reported by Florian Westphal.

    6) Fix error paths of CT/synproxy since the conntrack template was moved to use
    kmalloc, patch from Daniel Borkmann.

    All of them look good to me to reach 4.2, I can route this to -stable myself
    too, just let me know what you prefer.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     

03 Sep, 2015

1 commit

  • Fengguang reported, that some randconfig generated the following linker
    issue with nf_ct_zone_dflt object involved:

    [...]
    CC init/version.o
    LD init/built-in.o
    net/built-in.o: In function `ipv4_conntrack_defrag':
    nf_defrag_ipv4.c:(.text+0x93e95): undefined reference to `nf_ct_zone_dflt'
    net/built-in.o: In function `ipv6_defrag':
    nf_defrag_ipv6_hooks.c:(.text+0xe3ffe): undefined reference to `nf_ct_zone_dflt'
    make: *** [vmlinux] Error 1

    Given that configurations exist where we have a built-in part, which is
    accessing nf_ct_zone_dflt such as the two handlers nf_ct_defrag_user()
    and nf_ct6_defrag_user(), and a part that configures nf_conntrack as a
    module, we must move nf_ct_zone_dflt into a fixed, guaranteed built-in
    area when netfilter is configured in general.

    Therefore, split the more generic parts into a common header under
    include/linux/netfilter/ and move nf_ct_zone_dflt into the built-in
    section that already holds parts related to CONFIG_NF_CONNTRACK in the
    netfilter core. This fixes the issue on my side.

    Fixes: 308ac9143ee2 ("netfilter: nf_conntrack: push zone object into functions")
    Reported-by: Fengguang Wu
    Signed-off-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    Daniel Borkmann
     

01 Sep, 2015

1 commit

  • Commit 0838aa7fcfcd ("netfilter: fix netns dependencies with conntrack
    templates") migrated templates to the new allocator api, but forgot to
    update error paths for them in CT and synproxy to use nf_ct_tmpl_free()
    instead of nf_conntrack_free().

    Due to that, memory is being freed into the wrong kmemcache, but also
    we drop the per net reference count of ct objects causing an imbalance.

    In Brad's case, this leads to a wrap-around of net->ct.count and thus
    lets __nf_conntrack_alloc() refuse to create a new ct object:

    [ 10.340913] xt_addrtype: ipv6 does not support BROADCAST matching
    [ 10.810168] nf_conntrack: table full, dropping packet
    [ 11.917416] r8169 0000:07:00.0 eth0: link up
    [ 11.917438] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
    [ 12.815902] nf_conntrack: table full, dropping packet
    [ 15.688561] nf_conntrack: table full, dropping packet
    [ 15.689365] nf_conntrack: table full, dropping packet
    [ 15.690169] nf_conntrack: table full, dropping packet
    [ 15.690967] nf_conntrack: table full, dropping packet
    [...]

    With slab debugging, it also reports the wrong kmemcache (kmalloc-512 vs.
    nf_conntrack_ffffffff81ce75c0) and reports poison overwrites, etc. Thus,
    to fix the problem, export and use nf_ct_tmpl_free() instead.

    Fixes: 0838aa7fcfcd ("netfilter: fix netns dependencies with conntrack templates")
    Reported-by: Brad Jackson
    Signed-off-by: Daniel Borkmann
    Signed-off-by: Pablo Neira Ayuso

    Daniel Borkmann
     

21 Aug, 2015

1 commit


18 Aug, 2015

2 commits

  • This work adds the possibility of deriving the zone id from the skb->mark
    field in a scalable manner. This allows for having only a single template
    serving hundreds/thousands of different zones, for example, instead of the
    need to have one match for each zone as an extra CT jump target.

    Note that we'd need to have this information attached to the template as at
    the time when we're trying to lookup a possible ct object, we already need
    to know zone information for a possible match when going into
    __nf_conntrack_find_get(). This work provides a minimal implementation for
    a possible mapping.

    In order to not add/expose an extra ct->status bit, the zone structure has
    been extended to carry a flag for deriving the mark.

    Signed-off-by: Daniel Borkmann
    Signed-off-by: Pablo Neira Ayuso

    Daniel Borkmann
     
  • This work adds a direction parameter to netfilter zones, so identity
    separation can be performed only in original/reply or both directions
    (default). This basically opens up the possibility of doing NAT with
    conflicting IP address/port tuples from multiple, isolated tenants
    on a host (e.g. from a netns) without requiring each tenant to NAT
    twice resp. to use its own dedicated IP address to SNAT to, meaning
    overlapping tuples can be made unique with the zone identifier in
    original direction, where the NAT engine will then allocate a unique
    tuple in the commonly shared default zone for the reply direction.
    In some restricted, local DNAT cases, also port redirection could be
    used for making the reply traffic unique w/o requiring SNAT.

    The consensus we've reached and discussed at NFWS and since the initial
    implementation [1] was to directly integrate the direction meta data
    into the existing zones infrastructure, as opposed to the ct->mark
    approach we proposed initially.

    As we pass the nf_conntrack_zone object directly around, we don't have
    to touch all call-sites, but only those, that contain equality checks
    of zones. Thus, based on the current direction (original or reply),
    we either return the actual id, or the default NF_CT_DEFAULT_ZONE_ID.
    CT expectations are direction-agnostic entities when expectations are
    being compared among themselves, so we can only use the identifier
    in this case.

    Note that zone identifiers can not be included into the hash mix
    anymore as they don't contain a "stable" value that would be equal
    for both directions at all times, f.e. if only zone->id would
    unconditionally be xor'ed into the table slot hash, then replies won't
    find the corresponding conntracking entry anymore.

    If no particular direction is specified when configuring zones, the
    behaviour is exactly as we expect currently (both directions).

    Support has been added for the CT netlink interface as well as the
    x_tables raw CT target, which both already offer existing interfaces
    to user space for the configuration of zones.

    Below a minimal, simplified collision example (script in [2]) with
    netperf sessions:

    +--- tenant-1 ---+ mark := 1
    | netperf |--+
    +----------------+ | CT zone := mark [ORIGINAL]
    [ip,sport] := X +--------------+ +--- gateway ---+
    | mark routing |--| SNAT |-- ... +
    +--------------+ +---------------+ |
    +--- tenant-2 ---+ | ~~~|~~~
    | netperf |--+ +-----------+ |
    +----------------+ mark := 2 | netserver |------ ... +
    [ip,sport] := X +-----------+
    [ip,port] := Y
    On the gateway netns, example:

    iptables -t raw -A PREROUTING -j CT --zone mark --zone-dir ORIGINAL
    iptables -t nat -A POSTROUTING -o -j SNAT --to-source --random-fully

    iptables -t mangle -A PREROUTING -m conntrack --ctdir ORIGINAL -j CONNMARK --save-mark
    iptables -t mangle -A POSTROUTING -m conntrack --ctdir REPLY -j CONNMARK --restore-mark

    conntrack dump from gateway netns:

    netperf -H 10.1.1.2 -t TCP_STREAM -l60 -p12865,5555 from each tenant netns

    tcp 6 431995 ESTABLISHED src=40.1.1.1 dst=10.1.1.2 sport=5555 dport=12865 zone-orig=1
    src=10.1.1.2 dst=10.1.1.1 sport=12865 dport=1024
    [ASSURED] mark=1 secctx=system_u:object_r:unlabeled_t:s0 use=1

    tcp 6 431994 ESTABLISHED src=40.1.1.1 dst=10.1.1.2 sport=5555 dport=12865 zone-orig=2
    src=10.1.1.2 dst=10.1.1.1 sport=12865 dport=5555
    [ASSURED] mark=2 secctx=system_u:object_r:unlabeled_t:s0 use=1

    tcp 6 299 ESTABLISHED src=40.1.1.1 dst=10.1.1.2 sport=39438 dport=33768 zone-orig=1
    src=10.1.1.2 dst=10.1.1.1 sport=33768 dport=39438
    [ASSURED] mark=1 secctx=system_u:object_r:unlabeled_t:s0 use=1

    tcp 6 300 ESTABLISHED src=40.1.1.1 dst=10.1.1.2 sport=32889 dport=40206 zone-orig=2
    src=10.1.1.2 dst=10.1.1.1 sport=40206 dport=32889
    [ASSURED] mark=2 secctx=system_u:object_r:unlabeled_t:s0 use=2

    Taking this further, test script in [2] creates 200 tenants and runs
    original-tuple colliding netperf sessions each. A conntrack -L dump in
    the gateway netns also confirms 200 overlapping entries, all in ESTABLISHED
    state as expected.

    I also did run various other tests with some permutations of the script,
    to mention some: SNAT in random/random-fully/persistent mode, no zones (no
    overlaps), static zones (original, reply, both directions), etc.

    [1] http://thread.gmane.org/gmane.comp.security.firewalls.netfilter.devel/57412/
    [2] https://paste.fedoraproject.org/242835/65657871/

    Signed-off-by: Daniel Borkmann
    Signed-off-by: Pablo Neira Ayuso

    Daniel Borkmann
     

11 Aug, 2015

1 commit

  • This patch replaces the zone id which is pushed down into functions
    with the actual zone object. It's a bigger one-time change, but
    needed for later on extending zones with a direction parameter, and
    thus decoupling this additional information from all call-sites.

    No functional changes in this patch.

    The default zone becomes a global const object, namely nf_ct_zone_dflt
    and will be returned directly in various cases, one being, when there's
    f.e. no zoning support.

    Signed-off-by: Daniel Borkmann
    Signed-off-by: Pablo Neira Ayuso

    Daniel Borkmann
     

05 Aug, 2015

1 commit


30 Jul, 2015

1 commit


20 Jul, 2015

1 commit

  • Quoting Daniel Borkmann:

    "When adding connection tracking template rules to a netns, f.e. to
    configure netfilter zones, the kernel will endlessly busy-loop as soon
    as we try to delete the given netns in case there's at least one
    template present, which is problematic i.e. if there is such bravery that
    the priviledged user inside the netns is assumed untrusted.

    Minimal example:

    ip netns add foo
    ip netns exec foo iptables -t raw -A PREROUTING -d 1.2.3.4 -j CT --zone 1
    ip netns del foo

    What happens is that when nf_ct_iterate_cleanup() is being called from
    nf_conntrack_cleanup_net_list() for a provided netns, we always end up
    with a net->ct.count > 0 and thus jump back to i_see_dead_people. We
    don't get a soft-lockup as we still have a schedule() point, but the
    serving CPU spins on 100% from that point onwards.

    Since templates are normally allocated with nf_conntrack_alloc(), we
    also bump net->ct.count. The issue why they are not yet nf_ct_put() is
    because the per netns .exit() handler from x_tables (which would eventually
    invoke xt_CT's xt_ct_tg_destroy() that drops reference on info->ct) is
    called in the dependency chain at a *later* point in time than the per
    netns .exit() handler for the connection tracker.

    This is clearly a chicken'n'egg problem: after the connection tracker
    .exit() handler, we've teared down all the connection tracking
    infrastructure already, so rightfully, xt_ct_tg_destroy() cannot be
    invoked at a later point in time during the netns cleanup, as that would
    lead to a use-after-free. At the same time, we cannot make x_tables depend
    on the connection tracker module, so that the xt_ct_tg_destroy() would
    be invoked earlier in the cleanup chain."

    Daniel confirms this has to do with the order in which modules are loaded or
    having compiled nf_conntrack as modules while x_tables built-in. So we have no
    guarantees regarding the order in which netns callbacks are executed.

    Fix this by allocating the templates through kmalloc() from the respective
    SYNPROXY and CT targets, so they don't depend on the conntrack kmem cache.
    Then, release then via nf_ct_tmpl_free() from destroy_conntrack(). This branch
    is marked as unlikely since conntrack templates are rarely allocated and only
    from the configuration plane path.

    Note that templates are not kept in any list to avoid further dependencies with
    nf_conntrack anymore, thus, the tmpl larval list is removed.

    Reported-by: Daniel Borkmann
    Signed-off-by: Pablo Neira Ayuso
    Tested-by: Daniel Borkmann

    Pablo Neira Ayuso