14 Apr, 2014

2 commits

  • nft_cmp_fast is used for equality comparisions of size < 4 byte a mask is calculated that is applied to
    both the data from userspace (during initialization) and the register
    value (during runtime). Both values are stored using (in effect) memcpy
    to a memory area that is then interpreted as u32 by nft_cmp_fast.

    This works fine on little endian since smaller types have the same base
    address, however on big endian this is not true and the smaller types
    are interpreted as a big number with trailing zero bytes.

    The mask therefore must not include the lower bytes, but the higher bytes
    on big endian. Add a helper function that does a cpu_to_le32 to switch
    the bytes on big endian. Since we're dealing with a mask of just consequitive
    bits, this works out fine.

    Signed-off-by: Patrick McHardy
    Signed-off-by: Pablo Neira Ayuso

    Patrick McHardy
     
  • [ 251.920788] INFO: trying to register non-static key.
    [ 251.921386] the code is fine but needs lockdep annotation.
    [ 251.921386] turning off the locking correctness validator.
    [ 251.921386] CPU: 2 PID: 15715 Comm: socket_listen Not tainted 3.14.0+ #294
    [ 251.921386] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
    [ 251.921386] 0000000000000000 000000009d18c210 ffff880075f039b8 ffffffff816b7ecd
    [ 251.921386] ffffffff822c3b10 ffff880075f039c8 ffffffff816b36f4 ffff880075f03aa0
    [ 251.921386] ffffffff810c65ff ffffffff810c4a85 00000000fffffe01 ffffffffa0075172
    [ 251.921386] Call Trace:
    [ 251.921386] [] dump_stack+0x45/0x56
    [ 251.921386] [] register_lock_class.part.24+0x38/0x3c
    [ 251.921386] [] __lock_acquire+0x168f/0x1b40
    [ 251.921386] [] ? trace_hardirqs_on_caller+0x105/0x1d0
    [ 251.921386] [] ? nf_nat_setup_info+0x252/0x3a0 [nf_nat]
    [ 251.921386] [] ? _raw_spin_unlock_bh+0x35/0x40
    [ 251.921386] [] ? nf_nat_setup_info+0x252/0x3a0 [nf_nat]
    [ 251.921386] [] lock_acquire+0xa2/0x120
    [ 251.921386] [] ? ipv4_confirm+0x90/0xf0 [nf_conntrack_ipv4]
    [ 251.921386] [] __nf_conntrack_confirm+0x129/0x410 [nf_conntrack]
    [ 251.921386] [] ? ipv4_confirm+0x90/0xf0 [nf_conntrack_ipv4]
    [ 251.921386] [] ipv4_confirm+0x90/0xf0 [nf_conntrack_ipv4]
    [ 251.921386] [] ? ip_fragment+0x9f0/0x9f0
    [ 251.921386] [] nf_iterate+0xaa/0xc0
    [ 251.921386] [] ? ip_fragment+0x9f0/0x9f0
    [ 251.921386] [] nf_hook_slow+0xa4/0x190
    [ 251.921386] [] ? ip_fragment+0x9f0/0x9f0
    [ 251.921386] [] ip_output+0x92/0x100
    [ 251.921386] [] ip_local_out+0x29/0x90
    [ 251.921386] [] ip_queue_xmit+0x170/0x4c0
    [ 251.921386] [] ? ip_queue_xmit+0x5/0x4c0
    [ 251.921386] [] tcp_transmit_skb+0x498/0x960
    [ 251.921386] [] tcp_connect+0x812/0x960
    [ 251.921386] [] ? ktime_get_real+0x25/0x70
    [ 251.921386] [] ? secure_tcp_sequence_number+0x6a/0xc0
    [ 251.921386] [] tcp_v4_connect+0x317/0x470
    [ 251.921386] [] __inet_stream_connect+0xb5/0x330
    [ 251.921386] [] ? lock_sock_nested+0x33/0xa0
    [ 251.921386] [] ? trace_hardirqs_on+0xd/0x10
    [ 251.921386] [] ? __local_bh_enable_ip+0x75/0xe0
    [ 251.921386] [] inet_stream_connect+0x38/0x50
    [ 251.921386] [] SYSC_connect+0xe7/0x120
    [ 251.921386] [] ? current_kernel_time+0x69/0xd0
    [ 251.921386] [] ? trace_hardirqs_on_caller+0x105/0x1d0
    [ 251.921386] [] ? trace_hardirqs_on+0xd/0x10
    [ 251.921386] [] SyS_connect+0xe/0x10
    [ 251.921386] [] system_call_fastpath+0x16/0x1b
    [ 312.014104] INFO: rcu_sched detected stalls on CPUs/tasks: {} (detected by 0, t=60003 jiffies, g=42359, c=42358, q=333)
    [ 312.015097] INFO: Stall ended before state dump start

    Fixes: 93bb0ceb75be ("netfilter: conntrack: remove central spinlock nf_conntrack_lock")
    Cc: Jesper Dangaard Brouer
    Cc: Pablo Neira Ayuso
    Cc: Patrick McHardy
    Cc: Jozsef Kadlecsik
    Cc: "David S. Miller"
    Signed-off-by: Andrey Vagin
    Signed-off-by: Pablo Neira Ayuso

    Andrey Vagin
     

08 Apr, 2014

1 commit

  • nf_ct_gre_keymap_flush() removes a nf_ct_gre_keymap object from
    net_gre->keymap_list and frees the object. But it doesn't clean
    a reference on this object from ct_pptp_info->keymap[dir].
    Then nf_ct_gre_keymap_destroy() may release the same object again.

    So nf_ct_gre_keymap_flush() can be called only when we are sure that
    when nf_ct_gre_keymap_destroy will not be called.

    nf_ct_gre_keymap is created by nf_ct_gre_keymap_add() and the right way
    to destroy it is to call nf_ct_gre_keymap_destroy().

    This patch marks nf_ct_gre_keymap_flush() as static, so this patch can
    break compilation of third party modules, which use
    nf_ct_gre_keymap_flush. I'm not sure this is the right way to deprecate
    this function.

    [ 226.540793] general protection fault: 0000 [#1] SMP
    [ 226.541750] Modules linked in: nf_nat_pptp nf_nat_proto_gre
    nf_conntrack_pptp nf_conntrack_proto_gre ip_gre ip_tunnel gre
    ppp_deflate bsd_comp ppp_async crc_ccitt ppp_generic slhc xt_nat
    iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat
    nf_conntrack veth tun bridge stp llc ppdev microcode joydev pcspkr
    serio_raw virtio_console virtio_balloon floppy parport_pc parport
    pvpanic i2c_piix4 virtio_net drm_kms_helper ttm ata_generic virtio_pci
    virtio_ring virtio drm i2c_core pata_acpi [last unloaded: ip_tunnel]
    [ 226.541776] CPU: 0 PID: 49 Comm: kworker/u4:2 Not tainted 3.14.0-rc8+ #101
    [ 226.541776] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
    [ 226.541776] Workqueue: netns cleanup_net
    [ 226.541776] task: ffff8800371e0000 ti: ffff88003730c000 task.ti: ffff88003730c000
    [ 226.541776] RIP: 0010:[] [] __list_del_entry+0x29/0xd0
    [ 226.541776] RSP: 0018:ffff88003730dbd0 EFLAGS: 00010a83
    [ 226.541776] RAX: 6b6b6b6b6b6b6b6b RBX: ffff8800374e6c40 RCX: dead000000200200
    [ 226.541776] RDX: 6b6b6b6b6b6b6b6b RSI: ffff8800371e07d0 RDI: ffff8800374e6c40
    [ 226.541776] RBP: ffff88003730dbd0 R08: 0000000000000000 R09: 0000000000000000
    [ 226.541776] R10: 0000000000000001 R11: ffff88003730d92e R12: 0000000000000002
    [ 226.541776] R13: ffff88007a4c42d0 R14: ffff88007aef0000 R15: ffff880036cf0018
    [ 226.541776] FS: 0000000000000000(0000) GS:ffff88007fc00000(0000) knlGS:0000000000000000
    [ 226.541776] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
    [ 226.541776] CR2: 00007f07f643f7d0 CR3: 0000000036fd2000 CR4: 00000000000006f0
    [ 226.541776] Stack:
    [ 226.541776] ffff88003730dbe8 ffffffff81389c5d ffff8800374ffbe4 ffff88003730dc28
    [ 226.541776] ffffffffa0162a43 ffffffffa01627c5 ffff88007a4c42d0 ffff88007aef0000
    [ 226.541776] ffffffffa01651c0 ffff88007a4c45e0 ffff88007aef0000 ffff88003730dc40
    [ 226.541776] Call Trace:
    [ 226.541776] [] list_del+0xd/0x30
    [ 226.541776] [] nf_ct_gre_keymap_destroy+0x283/0x2d0 [nf_conntrack_proto_gre]
    [ 226.541776] [] ? nf_ct_gre_keymap_destroy+0x5/0x2d0 [nf_conntrack_proto_gre]
    [ 226.541776] [] gre_destroy+0x27/0x70 [nf_conntrack_proto_gre]
    [ 226.541776] [] destroy_conntrack+0x83/0x200 [nf_conntrack]
    [ 226.541776] [] ? destroy_conntrack+0x27/0x200 [nf_conntrack]
    [ 226.541776] [] ? nf_conntrack_hash_check_insert+0x2e0/0x2e0 [nf_conntrack]
    [ 226.541776] [] nf_conntrack_destroy+0x72/0x180
    [ 226.541776] [] ? nf_conntrack_destroy+0x5/0x180
    [ 226.541776] [] ? kill_l3proto+0x20/0x20 [nf_conntrack]
    [ 226.541776] [] nf_ct_iterate_cleanup+0x14e/0x170 [nf_conntrack]
    [ 226.541776] [] nf_ct_l4proto_pernet_unregister+0x5b/0x90 [nf_conntrack]
    [ 226.541776] [] proto_gre_net_exit+0x19/0x30 [nf_conntrack_proto_gre]
    [ 226.541776] [] ops_exit_list.isra.1+0x39/0x60
    [ 226.541776] [] cleanup_net+0x100/0x1d0
    [ 226.541776] [] process_one_work+0x1ea/0x4f0
    [ 226.541776] [] ? process_one_work+0x188/0x4f0
    [ 226.541776] [] worker_thread+0x11b/0x3a0
    [ 226.541776] [] ? process_one_work+0x4f0/0x4f0
    [ 226.541776] [] kthread+0xed/0x110
    [ 226.541776] [] ? _raw_spin_unlock_irq+0x2c/0x40
    [ 226.541776] [] ? kthread_create_on_node+0x200/0x200
    [ 226.541776] [] ret_from_fork+0x7c/0xb0
    [ 226.541776] [] ? kthread_create_on_node+0x200/0x200
    [ 226.541776] Code: 00 00 55 48 8b 17 48 b9 00 01 10 00 00 00 ad de
    48 8b 47 08 48 89 e5 48 39 ca 74 29 48 b9 00 02 20 00 00 00 ad de 48
    39 c8 74 7a 8b 00 4c 39 c7 75 53 4c 8b 42 08 4c 39 c7 75 2b 48 89
    42 08
    [ 226.541776] RIP [] __list_del_entry+0x29/0xd0
    [ 226.541776] RSP
    [ 226.612193] ---[ end trace 985ae23ddfcc357c ]---

    Cc: Pablo Neira Ayuso
    Cc: Patrick McHardy
    Cc: Jozsef Kadlecsik
    Cc: "David S. Miller"
    Signed-off-by: Andrey Vagin
    Signed-off-by: Pablo Neira Ayuso

    Andrey Vagin
     

04 Apr, 2014

6 commits

  • The intended format in request_module is %.*s instead of %*.s.

    Reported-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso

    Pablo Neira Ayuso
     
  • Currently, nf_tables trims off the set name if it exceeeds 15
    bytes, so explicitly reject set names that are too large.

    Reported-by: Giuseppe Longo
    Signed-off-by: Pablo Neira Ayuso

    Pablo Neira Ayuso
     
  • There are no these aliases, so kernel can not request appropriate
    match table:

    $ iptables -I INPUT -p tcp -m osf --genre Windows --ttl 2 -j DROP
    iptables: No chain/target/match by that name.

    setsockopt() requests ipt_osf module, which is not present. Add
    the aliases.

    Signed-off-by: Kirill Tkhai
    Signed-off-by: Pablo Neira Ayuso

    Kirill Tkhai
     
  • This simple modification allows iptables to work with INPUT chain
    in combination with cgroup module. It could be useful for counting
    ingress traffic per cgroup with nfacct netfilter module. There
    were no problems to count the egress traffic that way formerly.

    It's possible to get classified sk_buff after PREROUTING, due to
    socket lookup being done in early_demux (tcp_v4_early_demux). Also
    it works for udp as well.

    Trivial usage example, assuming we're in the same shell every step
    and we have enough permissions:

    1) Classic net_cls cgroup initialization:

    mkdir /sys/fs/cgroup/net_cls
    mount -t cgroup -o net_cls net_cls /sys/fs/cgroup/net_cls

    2) Set up cgroup for interesting application:

    mkdir /sys/fs/cgroup/net_cls/wget
    echo 1 > /sys/fs/cgroup/net_cls/wget/net_cls.classid
    echo $BASHPID > /sys/fs/cgroup/net_cls/wget/cgroup.procs

    3) Create kernel counters:

    nfacct add wget-cgroup-in
    iptables -A INPUT -m cgroup ! --cgroup 1 -m nfacct --nfacct-name wget-cgroup-in

    nfacct add wget-cgroup-out
    iptables -A OUTPUT -m cgroup ! --cgroup 1 -m nfacct --nfacct-name wget-cgroup-out

    4) Network usage:

    wget https://www.kernel.org/pub/linux/kernel/v3.x/testing/linux-3.14-rc6.tar.xz

    5) Check results:

    nfacct list

    Cgroup approach is being used for the DataUsage (counting & blocking
    traffic) feature for Samsung's modification of the Tizen OS.

    Signed-off-by: Alexey Perevalov
    Acked-by: Daniel Borkmann
    Signed-off-by: Pablo Neira Ayuso

    Alexey Perevalov
     
  • Eric points out that the locks can be global.
    Moreover, both Jesper and Eric note that using only 32 locks increases
    false sharing as only two cache lines are used.

    This increases locks to 256 (16 cache lines assuming 64byte cacheline and
    4 bytes per spinlock).

    Suggested-by: Jesper Dangaard Brouer
    Suggested-by: Eric Dumazet
    Signed-off-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso

    Florian Westphal
     
  • cannot use ARRAY_SIZE() if spinlock_t is empty struct.

    Fixes: 1442e7507dd597 ("netfilter: connlimit: use keyed locks")
    Reported-by: kbuild test robot
    Signed-off-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso

    Florian Westphal
     

30 Mar, 2014

1 commit


28 Mar, 2014

1 commit

  • skb_zerocopy can copy elements of the frags array between skbs, but it doesn't
    orphan them. Also, it doesn't handle errors, so this patch takes care of that
    as well, and modify the callers accordingly. skb_tx_error() is also added to
    the callers so they will signal the failed delivery towards the creator of the
    skb.

    Signed-off-by: Zoltan Kiss
    Signed-off-by: David S. Miller

    Zoltan Kiss
     

19 Mar, 2014

1 commit


18 Mar, 2014

2 commits

  • ARRAY_SIZE(nf_conntrack_locks) is undefined if spinlock_t is an
    empty structure. Replace it by CONNTRACK_LOCKS

    Fixes: 93bb0ceb75be ("netfilter: conntrack: remove central spinlock nf_conntrack_lock")
    Reported-by: kbuild test robot
    Signed-off-by: Eric Dumazet
    Cc: Jesper Dangaard Brouer
    Cc: Pablo Neira Ayuso
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • Pablo Neira Ayuso says:

    ====================
    Netfilter/IPVS updates for net-next

    The following patchset contains Netfilter/IPVS updates for net-next,
    most relevantly they are:

    * cleanup to remove double semicolon from stephen hemminger.

    * calm down sparse warning in xt_ipcomp, from Fan Du.

    * nf_ct_labels support for nf_tables, from Florian Westphal.

    * new macros to simplify rcu dereferences in the scope of nfnetlink
    and nf_tables, from Patrick McHardy.

    * Accept queue and drop (including reason for drop) to verdict
    parsing in nf_tables, also from Patrick.

    * Remove unused random seed initialization in nfnetlink_log, from
    Florian Westphal.

    * Allow to attach user-specific information to nf_tables rules, useful
    to attach user comments to rule, from me.

    * Return errors in ipset according to the manpage documentation, from
    Jozsef Kadlecsik.

    * Fix coccinelle warnings related to incorrect bool type usage for ipset,
    from Fengguang Wu.

    * Add hash:ip,mark set type to ipset, from Vytas Dauksa.

    * Fix message for each spotted by ipset for each netns that is created,
    from Ilia Mirkin.

    * Add forceadd option to ipset, which evicts a random entry from the set
    if it becomes full, from Josh Hunt.

    * Minor IPVS cleanups and fixes from Andi Kleen and Tingwei Liu.

    * Improve conntrack scalability by removing a central spinlock, original
    work from Eric Dumazet. Jesper Dangaard Brouer took them over to address
    remaining issues. Several patches to prepare this change come in first
    place.

    * Rework nft_hash to resolve bugs (leaking chain, missing rcu synchronization
    on element removal, etc. from Patrick McHardy.

    * Restore context in the rule deletion path, as we now release rule objects
    synchronously, from Patrick McHardy. This gets back event notification for
    anonymous sets.

    * Fix NAT family validation in nft_nat, also from Patrick.

    * Improve scalability of xt_connlimit by using an array of spinlocks and
    by introducing a rb-tree of hashtables for faster lookup of accounted
    objects per network. This patch was preceded by several patches and
    refactorizations to accomodate this change including the use of kmem_cache,
    from Florian Westphal.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     

17 Mar, 2014

3 commits

  • With current match design every invocation of the connlimit_match
    function means we have to perform (number_of_conntracks % 256) lookups
    in the conntrack table [ to perform GC/delete stale entries ].
    This is also the reason why ____nf_conntrack_find() in perf top has
    > 20% cpu time per core.

    This patch changes the storage to rbtree which cuts down the number of
    ct objects that need testing.

    When looking up a new tuple, we only test the connections of the host
    objects we visit while searching for the wanted host/network (or
    the leaf we need to insert at).

    The slot count is reduced to 32. Increasing slot count doesn't
    speed up things much because of rbtree nature.

    before patch (50kpps rx, 10kpps tx):
    + 20.95% ksoftirqd/0 [nf_conntrack] [k] ____nf_conntrack_find
    + 20.50% ksoftirqd/1 [nf_conntrack] [k] ____nf_conntrack_find
    + 20.27% ksoftirqd/2 [nf_conntrack] [k] ____nf_conntrack_find
    + 5.76% ksoftirqd/1 [nf_conntrack] [k] hash_conntrack_raw
    + 5.39% ksoftirqd/2 [nf_conntrack] [k] hash_conntrack_raw
    + 5.35% ksoftirqd/0 [nf_conntrack] [k] hash_conntrack_raw

    after (90kpps, 51kpps tx):
    + 17.24% swapper [nf_conntrack] [k] ____nf_conntrack_find
    + 6.60% ksoftirqd/2 [nf_conntrack] [k] ____nf_conntrack_find
    + 2.73% swapper [nf_conntrack] [k] hash_conntrack_raw
    + 2.36% swapper [xt_connlimit] [k] count_tree

    Obvious disadvantages to previous version are the increase in code
    complexity and the increased memory cost.

    Partially based on Eric Dumazets fq scheduler.

    Reviewed-by: Jesper Dangaard Brouer
    Signed-off-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso

    Florian Westphal
     
  • currently returns 1 if they're the same. Make it work like mem/strcmp
    so it can be used as rbtree search function.

    Reviewed-by: Jesper Dangaard Brouer
    Signed-off-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso

    Florian Westphal
     
  • connlimit currently suffers from spinlock contention, example for
    4-core system with rps enabled:

    + 20.84% ksoftirqd/2 [kernel.kallsyms] [k] _raw_spin_lock_bh
    + 20.76% ksoftirqd/1 [kernel.kallsyms] [k] _raw_spin_lock_bh
    + 20.42% ksoftirqd/0 [kernel.kallsyms] [k] _raw_spin_lock_bh
    + 6.07% ksoftirqd/2 [nf_conntrack] [k] ____nf_conntrack_find
    + 6.07% ksoftirqd/1 [nf_conntrack] [k] ____nf_conntrack_find
    + 5.97% ksoftirqd/0 [nf_conntrack] [k] ____nf_conntrack_find
    + 2.47% ksoftirqd/2 [nf_conntrack] [k] hash_conntrack_raw
    + 2.45% ksoftirqd/0 [nf_conntrack] [k] hash_conntrack_raw
    + 2.44% ksoftirqd/1 [nf_conntrack] [k] hash_conntrack_raw

    May allow parallel lookup/insert/delete if the entry is hashed to
    another slot. With patch:

    + 20.95% ksoftirqd/0 [nf_conntrack] [k] ____nf_conntrack_find
    + 20.50% ksoftirqd/1 [nf_conntrack] [k] ____nf_conntrack_find
    + 20.27% ksoftirqd/2 [nf_conntrack] [k] ____nf_conntrack_find
    + 5.76% ksoftirqd/1 [nf_conntrack] [k] hash_conntrack_raw
    + 5.39% ksoftirqd/2 [nf_conntrack] [k] hash_conntrack_raw
    + 5.35% ksoftirqd/0 [nf_conntrack] [k] hash_conntrack_raw
    + 2.00% ksoftirqd/1 [kernel.kallsyms] [k] __rcu_read_unlock

    Improved rx processing rate from ~35kpps to ~50 kpps.

    Reviewed-by: Jesper Dangaard Brouer
    Signed-off-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso

    Florian Westphal
     

15 Mar, 2014

1 commit

  • Replace the bh safe variant with the hard irq safe variant.

    We need a hard irq safe variant to deal with netpoll transmitting
    packets from hard irq context, and we need it in most if not all of
    the places using the bh safe variant.

    Except on 32bit uni-processor the code is exactly the same so don't
    bother with a bh variant, just have a hard irq safe variant that
    everyone can use.

    Signed-off-by: "Eric W. Biederman"
    Signed-off-by: David S. Miller

    Eric W. Biederman
     

13 Mar, 2014

1 commit


12 Mar, 2014

4 commits


08 Mar, 2014

5 commits


07 Mar, 2014

9 commits

  • The hash set type is very broken and was never meant to be merged in this
    state. Missing RCU synchronization on element removal, leaking chain
    refcounts when used as a verdict map, races during lookups, a fixed table
    size are probably just some of the problems. Luckily it is currently
    never chosen by the kernel when the rbtree type is also available.

    Rewrite it to be usable.

    The new implementation supports automatic hash table resizing using RCU,
    based on Paul McKenney's and Josh Triplett's algorithm "Optimized Resizing
    For RCU-Protected Hash Tables" described in [1].

    Resizing doesn't require a second list head in the elements, it works by
    chosing a hash function that remaps elements to a predictable set of buckets,
    only resizing by integral factors and

    - during expansion: linking new buckets to the old bucket that contains
    elements for any of the new buckets, thereby creating imprecise chains,
    then incrementally seperating the elements until the new buckets only
    contain elements that hash directly to them.

    - during shrinking: linking the hash chains of all old buckets that hash
    to the same new bucket to form a single chain.

    Expansion requires at most the number of elements in the longest hash chain
    grace periods, shrinking requires a single grace period.

    Due to the requirement of having hash chains/elements linked to multiple
    buckets during resizing, homemade single linked lists are used instead of
    the existing list helpers, that don't support this in a clean fashion.
    As a side effect, the amount of memory required per element is reduced by
    one pointer.

    Expansion is triggered when the load factors exceeds 75%, shrinking when
    the load factor goes below 30%. Both operations are allowed to fail and
    will be retried on the next insertion or removal if their respective
    conditions still hold.

    [1] http://dl.acm.org/citation.cfm?id=2002181.2002192

    Reviewed-by: Josh Triplett
    Reviewed-by: Paul E. McKenney
    Signed-off-by: Patrick McHardy
    Signed-off-by: Pablo Neira Ayuso

    Patrick McHardy
     
  • nf_conntrack_lock is a monolithic lock and suffers from huge contention
    on current generation servers (8 or more core/threads).

    Perf locking congestion is clear on base kernel:

    - 72.56% ksoftirqd/6 [kernel.kallsyms] [k] _raw_spin_lock_bh
    - _raw_spin_lock_bh
    + 25.33% init_conntrack
    + 24.86% nf_ct_delete_from_lists
    + 24.62% __nf_conntrack_confirm
    + 24.38% destroy_conntrack
    + 0.70% tcp_packet
    + 2.21% ksoftirqd/6 [kernel.kallsyms] [k] fib_table_lookup
    + 1.15% ksoftirqd/6 [kernel.kallsyms] [k] __slab_free
    + 0.77% ksoftirqd/6 [kernel.kallsyms] [k] inet_getpeer
    + 0.70% ksoftirqd/6 [nf_conntrack] [k] nf_ct_delete
    + 0.55% ksoftirqd/6 [ip_tables] [k] ipt_do_table

    This patch change conntrack locking and provides a huge performance
    improvement. SYN-flood attack tested on a 24-core E5-2695v2(ES) with
    10Gbit/s ixgbe (with tool trafgen):

    Base kernel: 810.405 new conntrack/sec
    After patch: 2.233.876 new conntrack/sec

    Notice other floods attack (SYN+ACK or ACK) can easily be deflected using:
    # iptables -A INPUT -m state --state INVALID -j DROP
    # sysctl -w net/netfilter/nf_conntrack_tcp_loose=0

    Use an array of hashed spinlocks to protect insertions/deletions of
    conntracks into the hash table. 1024 spinlocks seem to give good
    results, at minimal cost (4KB memory). Due to lockdep max depth,
    1024 becomes 8 if CONFIG_LOCKDEP=y

    The hash resize is a bit tricky, because we need to take all locks in
    the array. A seqcount_t is used to synchronize the hash table users
    with the resizing process.

    Signed-off-by: Eric Dumazet
    Signed-off-by: Jesper Dangaard Brouer
    Signed-off-by: David S. Miller
    Reviewed-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso

    Jesper Dangaard Brouer
     
  • Netfilter expectations are protected with the same lock as conntrack
    entries (nf_conntrack_lock). This patch split out expectations locking
    to use it's own lock (nf_conntrack_expect_lock).

    Signed-off-by: Jesper Dangaard Brouer
    Signed-off-by: David S. Miller
    Reviewed-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso

    Jesper Dangaard Brouer
     
  • Preparation for disconnecting the nf_conntrack_lock from the
    expectations code. Once the nf_conntrack_lock is lifted, a race
    condition is exposed.

    The expectations master conntrack exp->master, can race with
    delete operations, as the refcnt increment happens too late in
    init_conntrack(). Race is against other CPUs invoking
    ->destroy() (destroy_conntrack()), or nf_ct_delete() (via timeout
    or early_drop()).

    Avoid this race in nf_ct_find_expectation() by using atomic_inc_not_zero(),
    and checking if nf_ct_is_dying() (path via nf_ct_delete()).

    Signed-off-by: Jesper Dangaard Brouer
    Signed-off-by: Florian Westphal
    Signed-off-by: David S. Miller
    Signed-off-by: Pablo Neira Ayuso

    Jesper Dangaard Brouer
     
  • One spinlock per cpu to protect dying/unconfirmed/template special lists.
    (These lists are now per cpu, a bit like the untracked ct)
    Add a @cpu field to nf_conn, to make sure we hold the appropriate
    spinlock at removal time.

    Signed-off-by: Eric Dumazet
    Signed-off-by: Jesper Dangaard Brouer
    Signed-off-by: David S. Miller
    Reviewed-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso

    Jesper Dangaard Brouer
     
  • Changes while reading through the netfilter code.

    Added hint about how conntrack nf_conn refcnt is accessed.
    And renamed repl_hash to reply_hash for readability

    Signed-off-by: Jesper Dangaard Brouer
    Signed-off-by: David S. Miller
    Reviewed-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso

    Jesper Dangaard Brouer
     
  • Via Simon Horman:

    ====================
    * Whitespace cleanup spotted by checkpatch.pl from Tingwei Liu.
    * Section conflict cleanup, basically removal of one wrong __read_mostly,
    from Andi Kleen.
    ====================

    Signed-off-by: Pablo Neira Ayuso

    Pablo Neira Ayuso
     
  • Add whitespace after operator and put open brace { on the previous line

    Cc: Tingwei Liu
    Cc: lvs-devel@vger.kernel.org
    Signed-off-by: Tingwei Liu
    Signed-off-by: Simon Horman

    Tingwei Liu
     
  • const __read_mostly does not make any sense, because const
    data is already read-only. Remove the __read_mostly
    for the ipvs genl_ops. This avoids a LTO
    section conflict compile problem.

    Cc: Wensong Zhang
    Cc: Simon Horman
    Cc: Patrick McHardy
    Cc: lvs-devel@vger.kernel.org
    Signed-off-by: Andi Kleen
    Signed-off-by: Simon Horman

    Andi Kleen
     

06 Mar, 2014

3 commits

  • Adds a new property for hash set types, where if a set is created
    with the 'forceadd' option and the set becomes full the next addition
    to the set may succeed and evict a random entry from the set.

    To keep overhead low eviction is done very simply. It checks to see
    which bucket the new entry would be added. If the bucket's pos value
    is non-zero (meaning there's at least one entry in the bucket) it
    replaces the first entry in the bucket. If pos is zero, then it continues
    down the normal add process.

    This property is useful if you have a set for 'ban' lists where it may
    not matter if you release some entries from the set early.

    Signed-off-by: Josh Hunt
    Signed-off-by: Jozsef Kadlecsik

    Josh Hunt
     
  • Commit 1785e8f473 ("netfiler: ipset: Add net namespace for ipset") moved
    the initialization print into net_init, which can get called a lot due
    to namespaces. Move it back into init, reduce to pr_info.

    Signed-off-by: Ilia Mirkin
    Signed-off-by: Jozsef Kadlecsik

    Ilia Mirkin
     
  • Introduce packet mark mask for hash:ip,mark data type. This allows to
    set mark bit filter for the ip set.

    Change-Id: Id8dd9ca7e64477c4f7b022a1d9c1a5b187f1c96e

    Signed-off-by: Jozsef Kadlecsik

    Vytas Dauksa