09 Feb, 2011

1 commit

  • The TCP tracking code has a special case that allows to return
    NF_REPEAT if we receive a new SYN packet while in TIME_WAIT state.

    In this situation, the TCP tracking code destroys the existing
    conntrack to start a new clean session.

    [DESTROY] tcp 6 src=192.168.0.2 dst=192.168.1.2 sport=38925 dport=8000 src=192.168.1.2 dst=192.168.1.100 sport=8000 dport=38925 [ASSURED]
    [NEW] tcp 6 120 SYN_SENT src=192.168.0.2 dst=192.168.1.2 sport=38925 dport=8000 [UNREPLIED] src=192.168.1.2 dst=192.168.1.100 sport=8000 dport=38925

    However, this is a problem for the iptables' CT target event filtering
    which will not work in this case since the conntrack template will not
    be there for the new session. To fix this, we reassign the conntrack
    template to the packet if we return NF_REPEAT.

    Signed-off-by: Pablo Neira Ayuso
    Signed-off-by: Patrick McHardy

    Pablo Neira Ayuso
     

07 Jan, 2011

1 commit


28 Oct, 2010

1 commit


21 Sep, 2010

1 commit

  • Since we don't change the tuple in the original direction, we can save it
    in ct->tuplehash[IP_CT_DIR_REPLY].hnode.pprev for __nf_conntrack_confirm()
    use.

    __hash_conntrack() is split into two steps: hash_conntrack_raw() is used
    to get the raw hash, and __hash_bucket() is used to get the bucket id.

    In SYN-flood case, early_drop() doesn't need to recompute the hash again.

    Signed-off-by: Changli Gao
    Signed-off-by: Patrick McHardy

    Changli Gao
     

17 Sep, 2010

1 commit


02 Aug, 2010

1 commit


15 Jun, 2010

1 commit


09 Jun, 2010

1 commit


08 Jun, 2010

1 commit

  • NOTRACK makes all cpus share a cache line on nf_conntrack_untracked
    twice per packet. This is bad for performance.
    __read_mostly annotation is also a bad choice.

    This patch introduces IPS_UNTRACKED bit so that we can use later a
    per_cpu untrack structure more easily.

    A new helper, nf_ct_untracked_get() returns a pointer to
    nf_conntrack_untracked.

    Another one, nf_ct_untracked_status_or() is used by nf_nat_init() to add
    IPS_NAT_DONE_MASK bits to untracked status.

    nf_ct_is_untracked() prototype is changed to work on a nf_conn pointer.

    Signed-off-by: Eric Dumazet
    Signed-off-by: Patrick McHardy

    Eric Dumazet
     

02 Jun, 2010

1 commit


20 May, 2010

1 commit


13 May, 2010

1 commit


23 Apr, 2010

1 commit


16 Feb, 2010

2 commits


10 Feb, 2010

1 commit


09 Feb, 2010

3 commits

  • As noticed by Jon Masters , the conntrack hash
    size is global and not per namespace, but modifiable at runtime through
    /sys/module/nf_conntrack/hashsize. Changing the hash size will only
    resize the hash in the current namespace however, so other namespaces
    will use an invalid hash size. This can cause crashes when enlarging
    the hashsize, or false negative lookups when shrinking it.

    Move the hash size into the per-namespace data and only use the global
    hash size to initialize the per-namespace value when instanciating a
    new namespace. Additionally restrict hash resizing to init_net for
    now as other namespaces are not handled currently.

    Cc: stable@kernel.org
    Signed-off-by: Patrick McHardy
    Signed-off-by: David S. Miller

    Patrick McHardy
     
  • nf_conntrack_cachep is currently shared by all netns instances, but
    because of SLAB_DESTROY_BY_RCU special semantics, this is wrong.

    If we use a shared slab cache, one object can instantly flight between
    one hash table (netns ONE) to another one (netns TWO), and concurrent
    reader (doing a lookup in netns ONE, 'finding' an object of netns TWO)
    can be fooled without notice, because no RCU grace period has to be
    observed between object freeing and its reuse.

    We dont have this problem with UDP/TCP slab caches because TCP/UDP
    hashtables are global to the machine (and each object has a pointer to
    its netns).

    If we use per netns conntrack hash tables, we also *must* use per netns
    conntrack slab caches, to guarantee an object can not escape from one
    namespace to another one.

    Signed-off-by: Eric Dumazet
    [Patrick: added unique slab name allocation]
    Cc: stable@kernel.org
    Signed-off-by: Patrick McHardy

    Eric Dumazet
     
  • As discovered by Jon Masters , the "untracked"
    conntrack, which is located in the data section, might be accidentally
    freed when a new namespace is instantiated while the untracked conntrack
    is attached to a skb because the reference count it re-initialized.

    The best fix would be to use a seperate untracked conntrack per
    namespace since it includes a namespace pointer. Unfortunately this is
    not possible without larger changes since the namespace is not easily
    available everywhere we need it. For now move the untracked conntrack
    initialization to the init_net setup function to make sure the reference
    count is not re-initialized and handle cleanup in the init_net cleanup
    function to make sure namespaces can exit properly while the untracked
    conntrack is in use in other namespaces.

    Cc: stable@kernel.org
    Signed-off-by: Patrick McHardy
    Signed-off-by: David S. Miller

    Patrick McHardy
     

03 Feb, 2010

3 commits

  • Support initializing selected parameters of new conntrack entries from a
    "conntrack template", which is a specially marked conntrack entry attached
    to the skb.

    Currently the helper and the event delivery masks can be initialized this
    way.

    Signed-off-by: Patrick McHardy

    Patrick McHardy
     
  • Add two masks for conntrack end expectation events to struct nf_conntrack_ecache
    and use them to filter events. Their default value is "all events" when the
    event sysctl is on and "no events" when it is off. A following patch will add
    specific initializations. Expectation events depend on the ecache struct of
    their master conntrack.

    Signed-off-by: Patrick McHardy

    Patrick McHardy
     
  • Split up the IPCT_STATUS event into an IPCT_REPLY event, which is generated
    when the IPS_SEEN_REPLY bit is set, and an IPCT_ASSURED event, which is
    generated when the IPS_ASSURED bit is set.

    In combination with a following patch to support selective event delivery,
    this can be used for "sparse" conntrack replication: start replicating the
    conntrack entry after it reached the ASSURED state and that way it's SYN-flood
    resistant.

    Signed-off-by: Patrick McHardy

    Patrick McHardy
     

04 Dec, 2009

1 commit


10 Nov, 2009

1 commit

  • * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (34 commits)
    net/fsl_pq_mdio: add module license GPL
    can: fix WARN_ON dump in net/core/rtnetlink.c:rtmsg_ifinfo()
    can: should not use __dev_get_by_index() without locks
    hisax: remove bad udelay call to fix build error on ARM
    ipip: Fix handling of DF packets when pmtudisc is OFF
    qlge: Set PCIe reset type for EEH to fundamental.
    qlge: Fix early exit from mbox cmd complete wait.
    ixgbe: fix traffic hangs on Tx with ioatdma loaded
    ixgbe: Fix checking TFCS register for TXOFF status when DCB is enabled
    ixgbe: Fix gso_max_size for 82599 when DCB is enabled
    macsonic: fix crash on PowerBook 520
    NET: cassini, fix lock imbalance
    ems_usb: Fix byte order issues on big endian machines
    be2net: Bug fix to send config commands to hardware after netdev_register
    be2net: fix to set proper flow control on resume
    netfilter: xt_connlimit: fix regression caused by zero family value
    rt2x00: Don't queue ieee80211 work after USB removal
    Revert "ipw2200: fix oops on missing firmware"
    decnet: netdevice refcount leak
    netfilter: nf_nat: fix NAT issue in 2.6.30.4+
    ...

    Linus Torvalds
     

06 Nov, 2009

1 commit

  • Vitezslav Samel discovered that since 2.6.30.4+ active FTP can not work
    over NAT. The "cause" of the problem was a fix of unacknowledged data
    detection with NAT (commit a3a9f79e361e864f0e9d75ebe2a0cb43d17c4272).
    However, actually, that fix uncovered a long standing bug in TCP conntrack:
    when NAT was enabled, we simply updated the max of the right edge of
    the segments we have seen (td_end), by the offset NAT produced with
    changing IP/port in the data. However, we did not update the other parameter
    (td_maxend) which is affected by the NAT offset. Thus that could drift
    away from the correct value and thus resulted breaking active FTP.

    The patch below fixes the issue by *not* updating the conntrack parameters
    from NAT, but instead taking into account the NAT offsets in conntrack in a
    consistent way. (Updating from NAT would be more harder and expensive because
    it'd need to re-calculate parameters we already calculated in conntrack.)

    Signed-off-by: Jozsef Kadlecsik
    Signed-off-by: Patrick McHardy
    Signed-off-by: David S. Miller

    Jozsef Kadlecsik
     

05 Nov, 2009

1 commit


12 Oct, 2009

1 commit


22 Sep, 2009

1 commit

  • Sizing of memory allocations shouldn't depend on the number of physical
    pages found in a system, as that generally includes (perhaps a huge amount
    of) non-RAM pages. The amount of what actually is usable as storage
    should instead be used as a basis here.

    Some of the calculations (i.e. those not intending to use high memory)
    should likely even use (totalram_pages - totalhigh_pages).

    Signed-off-by: Jan Beulich
    Acked-by: Rusty Russell
    Acked-by: Ingo Molnar
    Cc: Dave Airlie
    Cc: Kyle McMartin
    Cc: Jeremy Fitzhardinge
    Cc: Pekka Enberg
    Cc: Hugh Dickins
    Cc: "David S. Miller"
    Cc: Patrick McHardy
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Beulich
     

31 Aug, 2009

1 commit


25 Aug, 2009

1 commit


16 Jul, 2009

1 commit

  • When a slab cache uses SLAB_DESTROY_BY_RCU, we must be careful when allocating
    objects, since slab allocator could give a freed object still used by lockless
    readers.

    In particular, nf_conntrack RCU lookups rely on ct->tuplehash[xxx].hnnode.next
    being always valid (ie containing a valid 'nulls' value, or a valid pointer to next
    object in hash chain.)

    kmem_cache_zalloc() setups object with NULL values, but a NULL value is not valid
    for ct->tuplehash[xxx].hnnode.next.

    Fix is to call kmem_cache_alloc() and do the zeroing ourself.

    As spotted by Patrick, we also need to make sure lookup keys are committed to
    memory before setting refcount to 1, or a lockless reader could get a reference
    on the old version of the object. Its key re-check could then pass the barrier.

    Signed-off-by: Eric Dumazet
    Signed-off-by: Patrick McHardy

    Eric Dumazet
     

22 Jun, 2009

3 commits

  • The RCU protected conntrack hash lookup only checks whether the entry
    has a refcount of zero to decide whether it is stale. This is not
    sufficient, entries are explicitly removed while there is at least
    one reference left, possibly more. Explicitly check whether the entry
    has been marked as dying to fix this.

    Signed-off-by: Patrick McHardy

    Patrick McHardy
     
  • New connection tracking entries are inserted into the hash before they
    are fully set up, namely the CONFIRMED bit is not set and the timer not
    started yet. This can theoretically lead to a race with timer, which
    would set the timeout value to a relative value, most likely already in
    the past.

    Perform hash insertion as the final step to fix this.

    Signed-off-by: Patrick McHardy

    Patrick McHardy
     
  • death_by_timeout() might delete a conntrack from hash list
    and insert it in dying list.

    nf_ct_delete_from_lists(ct);
    nf_ct_insert_dying_list(ct);

    I believe a (lockless) reader could *catch* ct while doing a lookup
    and miss the end of its chain.
    (nulls lookup algo must check the null value at the end of lookup and
    should restart if the null value is not the expected one.
    cf Documentation/RCU/rculist_nulls.txt for details)

    We need to change nf_conntrack_init_net() and use a different "null" value,
    guaranteed not being used in regular lists. Choose very large values, since
    hash table uses [0..size-1] null values.

    Signed-off-by: Eric Dumazet
    Acked-by: Pablo Neira Ayuso
    Signed-off-by: Patrick McHardy

    Eric Dumazet
     

13 Jun, 2009

4 commits

  • This patch improves ctnetlink event reliability if one broadcast
    listener has set the NETLINK_BROADCAST_ERROR socket option.

    The logic is the following: if an event delivery fails, we keep
    the undelivered events in the missed event cache. Once the next
    packet arrives, we add the new events (if any) to the missed
    events in the cache and we try a new delivery, and so on. Thus,
    if ctnetlink fails to deliver an event, we try to deliver them
    once we see a new packet. Therefore, we may lose state
    transitions but the userspace process gets in sync at some point.

    At worst case, if no events were delivered to userspace, we make
    sure that destroy events are successfully delivered. Basically,
    if ctnetlink fails to deliver the destroy event, we remove the
    conntrack entry from the hashes and we insert them in the dying
    list, which contains inactive entries. Then, the conntrack timer
    is added with an extra grace timeout of random32() % 15 seconds
    to trigger the event again (this grace timeout is tunable via
    /proc). The use of a limited random timeout value allows
    distributing the "destroy" resends, thus, avoiding accumulating
    lots "destroy" events at the same time. Event delivery may
    re-order but we can identify them by means of the tuple plus
    the conntrack ID.

    The maximum number of conntrack entries (active or inactive) is
    still handled by nf_conntrack_max. Thus, we may start dropping
    packets at some point if we accumulate a lot of inactive conntrack
    entries that did not successfully report the destroy event to
    userspace.

    During my stress tests consisting of setting a very small buffer
    of 2048 bytes for conntrackd and the NETLINK_BROADCAST_ERROR socket
    flag, and generating lots of very small connections, I noticed
    very few destroy entries on the fly waiting to be resend.

    A simple way to test this patch consist of creating a lot of
    entries, set a very small Netlink buffer in conntrackd (+ a patch
    which is not in the git tree to set the BROADCAST_ERROR flag)
    and invoke `conntrack -F'.

    For expectations, no changes are introduced in this patch.
    Currently, event delivery is only done for new expectations (no
    events from expectation expiration, removal and confirmation).
    In that case, they need a per-expectation event cache to implement
    the same idea that is exposed in this patch.

    This patch can be useful to provide reliable flow-accouting. We
    still have to add a new conntrack extension to store the creation
    and destroy time.

    Signed-off-by: Pablo Neira Ayuso
    Signed-off-by: Patrick McHardy

    Pablo Neira Ayuso
     
  • This patch moves the helper destruction to a function that lives
    in nf_conntrack_helper.c. This new function is used in the patch
    to add ctnetlink reliable event delivery.

    Signed-off-by: Pablo Neira Ayuso
    Signed-off-by: Patrick McHardy

    Pablo Neira Ayuso
     
  • This patch reworks the per-cpu event caching to use the conntrack
    extension infrastructure.

    The main drawback is that we consume more memory per conntrack
    if event delivery is enabled. This patch is required by the
    reliable event delivery that follows to this patch.

    BTW, this patch allows you to enable/disable event delivery via
    /proc/sys/net/netfilter/nf_conntrack_events in runtime, although
    you can still disable event caching as compilation option.

    Signed-off-by: Pablo Neira Ayuso
    Signed-off-by: Patrick McHardy

    Pablo Neira Ayuso
     
  • Use mod_timer_pending() instead of atomic sequence of del_timer()/
    add_timer(). mod_timer_pending() does not rearm an inactive timer,
    so we don't need the conntrack lock anymore to make sure we don't
    accidentally rearm a timer of a conntrack which is in the process
    of being destroyed.

    With this change, we don't need to take the global lock anymore at all,
    counter updates can be performed under the per-conntrack lock.

    Signed-off-by: Patrick McHardy

    Patrick McHardy
     

10 Jun, 2009

1 commit


03 Jun, 2009

1 commit

  • This patch simplifies the conntrack event caching system by removing
    several events:

    * IPCT_[*]_VOLATILE, IPCT_HELPINFO and IPCT_NATINFO has been deleted
    since the have no clients.
    * IPCT_COUNTER_FILLING which is a leftover of the 32-bits counter
    days.
    * IPCT_REFRESH which is not of any use since we always include the
    timeout in the messages.

    After this patch, the existing events are:

    * IPCT_NEW, IPCT_RELATED and IPCT_DESTROY, that are used to identify
    addition and deletion of entries.
    * IPCT_STATUS, that notes that the status bits have changes,
    eg. IPS_SEEN_REPLY and IPS_ASSURED.
    * IPCT_PROTOINFO, that reports that internal protocol information has
    changed, eg. the TCP, DCCP and SCTP protocol state.
    * IPCT_HELPER, that a helper has been assigned or unassigned to this
    entry.
    * IPCT_MARK and IPCT_SECMARK, that reports that the mark has changed, this
    covers the case when a mark is set to zero.
    * IPCT_NATSEQADJ, to report that there's updates in the NAT sequence
    adjustment.

    Signed-off-by: Pablo Neira Ayuso

    Pablo Neira Ayuso