17 Mar, 2014

2 commits

  • currently returns 1 if they're the same. Make it work like mem/strcmp
    so it can be used as rbtree search function.

    Reviewed-by: Jesper Dangaard Brouer
    Signed-off-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso

    Florian Westphal
     
  • connlimit currently suffers from spinlock contention, example for
    4-core system with rps enabled:

    + 20.84% ksoftirqd/2 [kernel.kallsyms] [k] _raw_spin_lock_bh
    + 20.76% ksoftirqd/1 [kernel.kallsyms] [k] _raw_spin_lock_bh
    + 20.42% ksoftirqd/0 [kernel.kallsyms] [k] _raw_spin_lock_bh
    + 6.07% ksoftirqd/2 [nf_conntrack] [k] ____nf_conntrack_find
    + 6.07% ksoftirqd/1 [nf_conntrack] [k] ____nf_conntrack_find
    + 5.97% ksoftirqd/0 [nf_conntrack] [k] ____nf_conntrack_find
    + 2.47% ksoftirqd/2 [nf_conntrack] [k] hash_conntrack_raw
    + 2.45% ksoftirqd/0 [nf_conntrack] [k] hash_conntrack_raw
    + 2.44% ksoftirqd/1 [nf_conntrack] [k] hash_conntrack_raw

    May allow parallel lookup/insert/delete if the entry is hashed to
    another slot. With patch:

    + 20.95% ksoftirqd/0 [nf_conntrack] [k] ____nf_conntrack_find
    + 20.50% ksoftirqd/1 [nf_conntrack] [k] ____nf_conntrack_find
    + 20.27% ksoftirqd/2 [nf_conntrack] [k] ____nf_conntrack_find
    + 5.76% ksoftirqd/1 [nf_conntrack] [k] hash_conntrack_raw
    + 5.39% ksoftirqd/2 [nf_conntrack] [k] hash_conntrack_raw
    + 5.35% ksoftirqd/0 [nf_conntrack] [k] hash_conntrack_raw
    + 2.00% ksoftirqd/1 [kernel.kallsyms] [k] __rcu_read_unlock

    Improved rx processing rate from ~35kpps to ~50 kpps.

    Reviewed-by: Jesper Dangaard Brouer
    Signed-off-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso

    Florian Westphal
     

13 Mar, 2014

1 commit


12 Mar, 2014

4 commits


08 Mar, 2014

5 commits


07 Mar, 2014

9 commits

  • The hash set type is very broken and was never meant to be merged in this
    state. Missing RCU synchronization on element removal, leaking chain
    refcounts when used as a verdict map, races during lookups, a fixed table
    size are probably just some of the problems. Luckily it is currently
    never chosen by the kernel when the rbtree type is also available.

    Rewrite it to be usable.

    The new implementation supports automatic hash table resizing using RCU,
    based on Paul McKenney's and Josh Triplett's algorithm "Optimized Resizing
    For RCU-Protected Hash Tables" described in [1].

    Resizing doesn't require a second list head in the elements, it works by
    chosing a hash function that remaps elements to a predictable set of buckets,
    only resizing by integral factors and

    - during expansion: linking new buckets to the old bucket that contains
    elements for any of the new buckets, thereby creating imprecise chains,
    then incrementally seperating the elements until the new buckets only
    contain elements that hash directly to them.

    - during shrinking: linking the hash chains of all old buckets that hash
    to the same new bucket to form a single chain.

    Expansion requires at most the number of elements in the longest hash chain
    grace periods, shrinking requires a single grace period.

    Due to the requirement of having hash chains/elements linked to multiple
    buckets during resizing, homemade single linked lists are used instead of
    the existing list helpers, that don't support this in a clean fashion.
    As a side effect, the amount of memory required per element is reduced by
    one pointer.

    Expansion is triggered when the load factors exceeds 75%, shrinking when
    the load factor goes below 30%. Both operations are allowed to fail and
    will be retried on the next insertion or removal if their respective
    conditions still hold.

    [1] http://dl.acm.org/citation.cfm?id=2002181.2002192

    Reviewed-by: Josh Triplett
    Reviewed-by: Paul E. McKenney
    Signed-off-by: Patrick McHardy
    Signed-off-by: Pablo Neira Ayuso

    Patrick McHardy
     
  • nf_conntrack_lock is a monolithic lock and suffers from huge contention
    on current generation servers (8 or more core/threads).

    Perf locking congestion is clear on base kernel:

    - 72.56% ksoftirqd/6 [kernel.kallsyms] [k] _raw_spin_lock_bh
    - _raw_spin_lock_bh
    + 25.33% init_conntrack
    + 24.86% nf_ct_delete_from_lists
    + 24.62% __nf_conntrack_confirm
    + 24.38% destroy_conntrack
    + 0.70% tcp_packet
    + 2.21% ksoftirqd/6 [kernel.kallsyms] [k] fib_table_lookup
    + 1.15% ksoftirqd/6 [kernel.kallsyms] [k] __slab_free
    + 0.77% ksoftirqd/6 [kernel.kallsyms] [k] inet_getpeer
    + 0.70% ksoftirqd/6 [nf_conntrack] [k] nf_ct_delete
    + 0.55% ksoftirqd/6 [ip_tables] [k] ipt_do_table

    This patch change conntrack locking and provides a huge performance
    improvement. SYN-flood attack tested on a 24-core E5-2695v2(ES) with
    10Gbit/s ixgbe (with tool trafgen):

    Base kernel: 810.405 new conntrack/sec
    After patch: 2.233.876 new conntrack/sec

    Notice other floods attack (SYN+ACK or ACK) can easily be deflected using:
    # iptables -A INPUT -m state --state INVALID -j DROP
    # sysctl -w net/netfilter/nf_conntrack_tcp_loose=0

    Use an array of hashed spinlocks to protect insertions/deletions of
    conntracks into the hash table. 1024 spinlocks seem to give good
    results, at minimal cost (4KB memory). Due to lockdep max depth,
    1024 becomes 8 if CONFIG_LOCKDEP=y

    The hash resize is a bit tricky, because we need to take all locks in
    the array. A seqcount_t is used to synchronize the hash table users
    with the resizing process.

    Signed-off-by: Eric Dumazet
    Signed-off-by: Jesper Dangaard Brouer
    Signed-off-by: David S. Miller
    Reviewed-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso

    Jesper Dangaard Brouer
     
  • Netfilter expectations are protected with the same lock as conntrack
    entries (nf_conntrack_lock). This patch split out expectations locking
    to use it's own lock (nf_conntrack_expect_lock).

    Signed-off-by: Jesper Dangaard Brouer
    Signed-off-by: David S. Miller
    Reviewed-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso

    Jesper Dangaard Brouer
     
  • Preparation for disconnecting the nf_conntrack_lock from the
    expectations code. Once the nf_conntrack_lock is lifted, a race
    condition is exposed.

    The expectations master conntrack exp->master, can race with
    delete operations, as the refcnt increment happens too late in
    init_conntrack(). Race is against other CPUs invoking
    ->destroy() (destroy_conntrack()), or nf_ct_delete() (via timeout
    or early_drop()).

    Avoid this race in nf_ct_find_expectation() by using atomic_inc_not_zero(),
    and checking if nf_ct_is_dying() (path via nf_ct_delete()).

    Signed-off-by: Jesper Dangaard Brouer
    Signed-off-by: Florian Westphal
    Signed-off-by: David S. Miller
    Signed-off-by: Pablo Neira Ayuso

    Jesper Dangaard Brouer
     
  • One spinlock per cpu to protect dying/unconfirmed/template special lists.
    (These lists are now per cpu, a bit like the untracked ct)
    Add a @cpu field to nf_conn, to make sure we hold the appropriate
    spinlock at removal time.

    Signed-off-by: Eric Dumazet
    Signed-off-by: Jesper Dangaard Brouer
    Signed-off-by: David S. Miller
    Reviewed-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso

    Jesper Dangaard Brouer
     
  • Changes while reading through the netfilter code.

    Added hint about how conntrack nf_conn refcnt is accessed.
    And renamed repl_hash to reply_hash for readability

    Signed-off-by: Jesper Dangaard Brouer
    Signed-off-by: David S. Miller
    Reviewed-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso

    Jesper Dangaard Brouer
     
  • Via Simon Horman:

    ====================
    * Whitespace cleanup spotted by checkpatch.pl from Tingwei Liu.
    * Section conflict cleanup, basically removal of one wrong __read_mostly,
    from Andi Kleen.
    ====================

    Signed-off-by: Pablo Neira Ayuso

    Pablo Neira Ayuso
     
  • Add whitespace after operator and put open brace { on the previous line

    Cc: Tingwei Liu
    Cc: lvs-devel@vger.kernel.org
    Signed-off-by: Tingwei Liu
    Signed-off-by: Simon Horman

    Tingwei Liu
     
  • const __read_mostly does not make any sense, because const
    data is already read-only. Remove the __read_mostly
    for the ipvs genl_ops. This avoids a LTO
    section conflict compile problem.

    Cc: Wensong Zhang
    Cc: Simon Horman
    Cc: Patrick McHardy
    Cc: lvs-devel@vger.kernel.org
    Signed-off-by: Andi Kleen
    Signed-off-by: Simon Horman

    Andi Kleen
     

06 Mar, 2014

8 commits

  • Adds a new property for hash set types, where if a set is created
    with the 'forceadd' option and the set becomes full the next addition
    to the set may succeed and evict a random entry from the set.

    To keep overhead low eviction is done very simply. It checks to see
    which bucket the new entry would be added. If the bucket's pos value
    is non-zero (meaning there's at least one entry in the bucket) it
    replaces the first entry in the bucket. If pos is zero, then it continues
    down the normal add process.

    This property is useful if you have a set for 'ban' lists where it may
    not matter if you release some entries from the set early.

    Signed-off-by: Josh Hunt
    Signed-off-by: Jozsef Kadlecsik

    Josh Hunt
     
  • Commit 1785e8f473 ("netfiler: ipset: Add net namespace for ipset") moved
    the initialization print into net_init, which can get called a lot due
    to namespaces. Move it back into init, reduce to pr_info.

    Signed-off-by: Ilia Mirkin
    Signed-off-by: Jozsef Kadlecsik

    Ilia Mirkin
     
  • commit 2dfb973c0dcc6d2211 (add markmask for hash:ip,mark data type)
    inserted IPSET_ATTR_MARKMASK in-between other enum values, i.e.
    changing values of all further attributes. This causes 'ipset list'
    segfault on existing kernels since ipset no longer finds
    IPSET_ATTR_MEMSIZE (it has a different value on kernel side).

    Jozsef points out it should be moved below IPSET_ATTR_MARK which
    works since there is some extra reserved space after that value.

    Signed-off-by: Florian Westphal
    Signed-off-by: Jozsef Kadlecsik

    Florian Westphal
     
  • Signed-off-by: Jozsef Kadlecsik

    Jozsef Kadlecsik
     
  • Introduce packet mark mask for hash:ip,mark data type. This allows to
    set mark bit filter for the ip set.

    Change-Id: Id8dd9ca7e64477c4f7b022a1d9c1a5b187f1c96e

    Signed-off-by: Jozsef Kadlecsik

    Vytas Dauksa
     
  • Introduce packet mark support with new ip,mark hash set. This includes
    userspace and kernelspace code, hash:ip,mark set tests and man page
    updates.

    The intended use of ip,mark set is similar to the ip:port type, but for
    protocols which don't use a predictable port number. Instead of port
    number it matches a firewall mark determined by a layer 7 filtering
    program like opendpi.

    As well as allowing or blocking traffic it will also be used for
    accounting packets and bytes sent for each protocol.

    Signed-off-by: Jozsef Kadlecsik

    Vytas Dauksa
     
  • net/netfilter/ipset/ip_set_hash_netnet.c:115:8-9: WARNING: return of 0/1 in function 'hash_netnet4_data_list' with return type bool
    /c/kernel-tests/src/cocci/net/netfilter/ipset/ip_set_hash_netnet.c:338:8-9: WARNING: return of 0/1 in function 'hash_netnet6_data_list' with return type bool

    Return statements in functions returning bool should use
    true/false instead of 1/0.
    Generated by: coccinelle/misc/boolreturn.cocci

    Signed-off-by: Fengguang Wu
    Signed-off-by: Jozsef Kadlecsik

    Fengguang Wu
     
  • ipset(8) for list:set says:
    The match will try to find a matching entry in the sets and the
    target will try to add an entry to the first set to which it can
    be added.

    However real behavior is bit differ from described. Consider example:

    # ipset create test-1-v4 hash:ip family inet
    # ipset create test-1-v6 hash:ip family inet6
    # ipset create test-1 list:set
    # ipset add test-1 test-1-v4
    # ipset add test-1 test-1-v6

    # iptables -A INPUT -p tcp --destination-port 25 -j SET --add-set test-1 src
    # ip6tables -A INPUT -p tcp --destination-port 25 -j SET --add-set test-1 src

    And then when iptables/ip6tables rule matches packet IPSET target
    tries to add src from packet to the list:set test-1 where first
    entry is test-1-v4 and the second one is test-1-v6.

    For IPv4, as it first entry in test-1 src added to test-1-v4
    correctly, but for IPv6 src not added!

    Placing test-1-v6 to the first element of list:set makes behavior
    correct for IPv6, but brokes for IPv4.

    This is due to result, returned from ip_set_add() and ip_set_del() from
    net/netfilter/ipset/ip_set_core.c when set in list:set equires more
    parameters than given or address families do not match (which is this
    case).

    It seems wrong returning 0 from ip_set_add() and ip_set_del() in
    this case, as 0 should be returned only when an element successfuly
    added/deleted to/from the set, contrary to ip_set_test() which
    returns 0 when no entry exists and >0 when entry found in set.

    Signed-off-by: Sergey Popovich
    Signed-off-by: Jozsef Kadlecsik

    Sergey Popovich
     

27 Feb, 2014

1 commit

  • This allows us to store user comment strings, but it could be also
    used to store any kind of information that the user application needs
    to link to the rule.

    Scratch 8 bits for the new ulen field that indicates the length the
    user data area. 4 bits from the handle (so it's 42 bits long, according
    to Patrick, it would last 139 years with 1000 new rules per second)
    and 4 bits from dlen (so the expression data area is 4K, which seems
    sufficient by now even considering the compatibility layer).

    Signed-off-by: Pablo Neira Ayuso
    Acked-by: Patrick McHardy

    Pablo Neira Ayuso
     

25 Feb, 2014

5 commits


19 Feb, 2014

5 commits

  • This also adds NF_CT_LABELS_MAX_SIZE so it can be re-used
    as BUILD_BUG_ON in nft_ct.

    At this time, nft doesn't yet support writing to the label area;
    when this changes the label->words handling needs to be moved
    out of xt_connlabel.c into nf_conntrack_labels.c.

    Also removes a useless run-time check: words cannot grow beyond
    4 (32 bit) or 2 (64bit) since xt_connlabel enforces a maximum of
    128 labels.

    Signed-off-by: Florian Westphal
    Acked-by: Patrick McHardy
    Signed-off-by: Pablo Neira Ayuso

    Florian Westphal
     
  • 0-DAY kernel build testing backend reported:

    sparse warnings: (new ones prefixed by >>)

    >> >> net/netfilter/xt_ipcomp.c:63:26: sparse: restricted __be16 degrades to integer
    >> >> net/netfilter/xt_ipcomp.c:63:26: sparse: cast to restricted __be32

    Fix this by using ntohs without shifting.

    Tested with: make C=1 CF=-D__CHECK_ENDIAN__

    Signed-off-by: Fan Du
    Signed-off-by: Pablo Neira Ayuso

    Pablo Neira Ayuso
     
  • This is C not shell script

    Signed-off-by: Stephen Hemminger
    Signed-off-by: Pablo Neira Ayuso

    stephen hemminger
     
  • Conflicts:
    drivers/net/bonding/bond_3ad.h
    drivers/net/bonding/bond_main.c

    Two minor conflicts in bonding, both of which were overlapping
    changes.

    Signed-off-by: David S. Miller

    David S. Miller
     
  • Pull drm fixes from Dave Airlie:
    "Lots of little small things, nothing too major: nouveau regression
    fixes, vmware fixes for the new hw support, memory leaks in error path
    fixes"

    * 'drm-fixes' of git://people.freedesktop.org/~airlied/linux: (31 commits)
    drm/radeon/ni: fix typo in dpm sq ramping setup
    drm/radeon/si: fix typo in dpm sq ramping setup
    drm/radeon: fix CP semaphores on CIK
    drm/radeon: delete a stray tab
    drm/radeon: fix display tiling setup on SI
    drm/radeon/dpm: reduce r7xx vblank mclk threshold to 200
    drm/radeon: fill in DRM_CAPs for cursor size
    drm: add DRM_CAPs for cursor size
    drm/radeon: unify bpc handling
    drm/ttm: Fix memory leak in ttm_agp_backend.c
    drm/ttm: declare 'struct device' in ttm_page_alloc.h
    drm/nouveau: fix TTM_PL_TT memtype on pre-nv50
    drm/nv50/disp: use correct register to determine DP display bpp
    drm/nouveau/fb: use correct ram oclass for nv1a hardware
    drm/nv50/gr: add missing nv_error parameter priv
    drm/nouveau: fix ENG_RUNLIST register address
    drm/nv4c/bios: disallow retrieving from prom on nv4x igp's
    drm/nv4c/vga: decode register is in a different place on nv4x igp's
    drm/nv4c/mc: nv4x igp's have a different msi rearm register
    drm/nouveau: set irq_enabled manually
    ...

    Linus Torvalds