13 Sep, 2016

1 commit

  • These counters sit in hot path and do show up in perf, this is especially
    true for 'found' and 'searched' which get incremented for every packet
    processed.

    Information like

    searched=212030105
    new=623431
    found=333613
    delete=623327

    does not seem too helpful nowadays:

    - on busy systems found and searched will overflow every few hours
    (these are 32bit integers), other more busy ones every few days.

    - for debugging there are better methods, such as iptables' trace target,
    the conntrack log sysctls. Nowadays we also have perf tool.

    This removes packet path stat counters except those that
    are expected to be 0 (or close to 0) on a normal system, e.g.
    'insert_failed' (race happened) or 'invalid' (proto tracker rejects).

    The insert stat is retained for the ctnetlink case.
    The found stat is retained for the tuple-is-taken check when NAT has to
    determine if it needs to pick a different source address.

    Signed-off-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso

    Florian Westphal
     

07 Sep, 2016

2 commits


18 Aug, 2016

1 commit


19 Jul, 2016

1 commit

  • The dummy ruleset I used to test the original validation change was broken,
    most rules were unreachable and were not tested by mark_source_chains().

    In some cases rulesets that used to load in a few seconds now require
    several minutes.

    sample ruleset that shows the behaviour:

    echo "*filter"
    for i in $(seq 0 100000);do
    printf ":chain_%06x - [0:0]\n" $i
    done
    for i in $(seq 0 100000);do
    printf -- "-A INPUT -j chain_%06x\n" $i
    printf -- "-A INPUT -j chain_%06x\n" $i
    printf -- "-A INPUT -j chain_%06x\n" $i
    done
    echo COMMIT

    [ pipe result into iptables-restore ]

    This ruleset will be about 74mbyte in size, with ~500k searches
    though all 500k[1] rule entries. iptables-restore will take forever
    (gave up after 10 minutes)

    Instead of always searching the entire blob for a match, fill an
    array with the start offsets of every single ipt_entry struct,
    then do a binary search to check if the jump target is present or not.

    After this change ruleset restore times get again close to what one
    gets when reverting 36472341017529e (~3 seconds on my workstation).

    [1] every user-defined rule gets an implicit RETURN, so we get
    300k jumps + 100k userchains + 100k returns -> 500k rule entries

    Fixes: 36472341017529e ("netfilter: x_tables: validate targets of jumps")
    Reported-by: Jeff Wu
    Tested-by: Jeff Wu
    Signed-off-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso

    Florian Westphal
     

03 Jul, 2016

1 commit

  • netfilter uses multiple FWINV #defines with identical form that hide a
    specific structure variable and dereference it with a invflags member.

    $ git grep "#define FWINV"
    include/linux/netfilter_bridge/ebtables.h:#define FWINV(bool,invflg) ((bool) ^ !!(info->invflags & invflg))
    net/bridge/netfilter/ebtables.c:#define FWINV2(bool, invflg) ((bool) ^ !!(e->invflags & invflg))
    net/ipv4/netfilter/arp_tables.c:#define FWINV(bool, invflg) ((bool) ^ !!(arpinfo->invflags & (invflg)))
    net/ipv4/netfilter/ip_tables.c:#define FWINV(bool, invflg) ((bool) ^ !!(ipinfo->invflags & (invflg)))
    net/ipv6/netfilter/ip6_tables.c:#define FWINV(bool, invflg) ((bool) ^ !!(ip6info->invflags & (invflg)))
    net/netfilter/xt_tcpudp.c:#define FWINVTCP(bool, invflg) ((bool) ^ !!(tcpinfo->invflags & (invflg)))

    Consolidate these macros into a single NF_INVF macro.

    Miscellanea:

    o Neaten the alignment around these uses
    o A few lines are > 80 columns for intelligibility

    Signed-off-by: Joe Perches
    Signed-off-by: Pablo Neira Ayuso

    Joe Perches
     

29 Apr, 2016

1 commit

  • This is a forward-port of the original patch from Andrzej Hajda,
    he said:

    "IS_ERR_VALUE should be used only with unsigned long type.
    Otherwise it can work incorrectly. To achieve this function
    xt_percpu_counter_alloc is modified to return unsigned long,
    and its result is assigned to temporary variable to perform
    error checking, before assigning to .pcnt field.

    The patch follows conclusion from discussion on LKML [1][2].

    [1]: http://permalink.gmane.org/gmane.linux.kernel/2120927
    [2]: http://permalink.gmane.org/gmane.linux.kernel/2150581"

    Original patch from Andrzej is here:

    http://patchwork.ozlabs.org/patch/582970/

    This patch has clashed with input validation fixes for x_tables.

    Signed-off-by: Pablo Neira Ayuso

    Pablo Neira Ayuso
     

24 Apr, 2016

2 commits

  • Pablo Neira Ayuso says:

    ====================
    Netfilter updates for net-next

    The following patchset contains Netfilter updates for your net-next
    tree, mostly from Florian Westphal to sort out the lack of sufficient
    validation in x_tables and connlabel preparation patches to add
    nf_tables support. They are:

    1) Ensure we don't go over the ruleset blob boundaries in
    mark_source_chains().

    2) Validate that target jumps land on an existing xt_entry. This extra
    sanitization comes with a performance penalty when loading the ruleset.

    3) Introduce xt_check_entry_offsets() and use it from {arp,ip,ip6}tables.

    4) Get rid of the smallish check_entry() functions in {arp,ip,ip6}tables.

    5) Make sure the minimal possible target size in x_tables.

    6) Similar to #3, add xt_compat_check_entry_offsets() for compat code.

    7) Check that standard target size is valid.

    8) More sanitization to ensure that the target_offset field is correct.

    9) Add xt_check_entry_match() to validate that matches are well-formed.

    10-12) Three patch to reduce the number of parameters in
    translate_compat_table() for {arp,ip,ip6}tables by using a container
    structure.

    13) No need to return value from xt_compat_match_from_user(), so make
    it void.

    14) Consolidate translate_table() so it can be used by compat code too.

    15) Remove obsolete check for compat code, so we keep consistent with
    what was already removed in the native layout code (back in 2007).

    16) Get rid of target jump validation from mark_source_chains(),
    obsoleted by #2.

    17) Introduce xt_copy_counters_from_user() to consolidate counter
    copying, and use it from {arp,ip,ip6}tables.

    18,22) Get rid of unnecessary explicit inlining in ctnetlink for dump
    functions.

    19) Move nf_connlabel_match() to xt_connlabel.

    20) Skip event notification if connlabel did not change.

    21) Update of nf_connlabels_get() to make the upcoming nft connlabel
    support easier.

    23) Remove spinlock to read protocol state field in conntrack.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     
  • nla_data() is now aligned on a 64-bit area.

    The temporary function nla_put_be64_32bit() is removed in this patch.

    Signed-off-by: Nicolas Dichtel
    Signed-off-by: David S. Miller

    Nicolas Dichtel
     

14 Apr, 2016

5 commits


28 Mar, 2016

1 commit

  • This fix adds a new reference counter (ref_netlink) for the struct ip_set.
    The other reference counter (ref) can be swapped out by ip_set_swap and we
    need a separate counter to keep track of references for netlink events
    like dump. Using the same ref counter for dump causes a race condition
    which can be demonstrated by the following script:

    ipset create hash_ip1 hash:ip family inet hashsize 1024 maxelem 500000 \
    counters
    ipset create hash_ip2 hash:ip family inet hashsize 300000 maxelem 500000 \
    counters
    ipset create hash_ip3 hash:ip family inet hashsize 1024 maxelem 500000 \
    counters

    ipset save &

    ipset swap hash_ip3 hash_ip2
    ipset destroy hash_ip3 /* will crash the machine */

    Swap will exchange the values of ref so destroy will see ref = 0 instead of
    ref = 1. With this fix in place swap will not succeed because ipset save
    still has ref_netlink on the set (ip_set_swap doesn't swap ref_netlink).

    Both delete and swap will error out if ref_netlink != 0 on the set.

    Note: The changes to *_head functions is because previously we would
    increment ref whenever we called these functions, we don't do that
    anymore.

    Reviewed-by: Joshua Hunt
    Signed-off-by: Vishwanath Pai
    Signed-off-by: Jozsef Kadlecsik
    Signed-off-by: Pablo Neira Ayuso

    Vishwanath Pai
     

03 Mar, 2016

1 commit

  • delay hook registration until the table is being requested inside a
    namespace.

    Historically, a particular table (iptables mangle, ip6tables filter, etc)
    was registered on module load.

    When netns support was added to iptables only the ip/ip6tables ruleset was
    made namespace aware, not the actual hook points.

    This means f.e. that when ipt_filter table/module is loaded on a system,
    then each namespace on that system has an (empty) iptables filter ruleset.

    In other words, if a namespace sends a packet, such skb is 'caught' by
    netfilter machinery and fed to hooking points for that table (i.e. INPUT,
    FORWARD, etc).

    Thanks to Eric Biederman, hooks are no longer global, but per namespace.

    This means that we can avoid allocation of empty ruleset in a namespace and
    defer hook registration until we need the functionality.

    We register a tables hook entry points ONLY in the initial namespace.
    When an iptables get/setockopt is issued inside a given namespace, we check
    if the table is found in the per-namespace list.

    If not, we attempt to find it in the initial namespace, and, if found,
    create an empty default table in the requesting namespace and register the
    needed hooks.

    Hook points are destroyed only once namespace is deleted, there is no
    'usage count' (it makes no sense since there is no 'remove table' operation
    in xtables api).

    Signed-off-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso

    Florian Westphal
     

19 Feb, 2016

1 commit


29 Dec, 2015

2 commits


19 Dec, 2015

1 commit

  • Pablo Neira Ayuso says:

    ====================
    Netfilter updates for net-next

    The following patchset contains the first batch of Netfilter updates for
    the upcoming 4.5 kernel. This batch contains userspace netfilter header
    compilation fixes, support for packet mangling in nf_tables, the new
    tracing infrastructure for nf_tables and cgroup2 support for iptables.
    More specifically, they are:

    1) Two patches to include dependencies in our netfilter userspace
    headers to resolve compilation problems, from Mikko Rapeli.

    2) Four comestic cleanup patches for the ebtables codebase, from Ian Morris.

    3) Remove duplicate include in the netfilter reject infrastructure,
    from Stephen Hemminger.

    4) Two patches to simplify the netfilter defragmentation code for IPv6,
    patch from Florian Westphal.

    5) Fix root ownership of /proc/net netfilter for unpriviledged net
    namespaces, from Philip Whineray.

    6) Get rid of unused fields in struct nft_pktinfo, from Florian Westphal.

    7) Add mangling support to our nf_tables payload expression, from
    Patrick McHardy.

    8) Introduce a new netlink-based tracing infrastructure for nf_tables,
    from Florian Westphal.

    9) Change setter functions in nfnetlink_log to be void, from
    Rami Rosen.

    10) Add netns support to the cttimeout infrastructure.

    11) Add cgroup2 support to iptables, from Tejun Heo.

    12) Introduce nfnl_dereference_protected() in nfnetlink, from Florian.

    13) Add support for mangling pkttype in the nf_tables meta expression,
    also from Florian.

    BTW, I need that you pull net into net-next, I have another batch that
    requires changes that I don't yet see in net.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     

10 Dec, 2015

1 commit


24 Nov, 2015

1 commit

  • ip_ct_sctp is an internal structure, embedded by the union
    nf_conntrack_proto to store sctp-specific information at conntrack
    entries. It has no business with UAPI.

    This patch moves it from UAPI to a saner place, together with similar
    structs for other protocols.

    Signed-off-by: Marcelo Ricardo Leitner
    Acked-by: Neil Horman
    Signed-off-by: Pablo Neira Ayuso

    Marcelo Ricardo Leitner
     

07 Nov, 2015

1 commit

  • The data extensions in ipset lacked the proper memory alignment and
    thus could lead to kernel crash on several architectures. Therefore
    the structures have been reorganized and alignment attributes added
    where needed. The patch was tested on armv7h by Gerhard Wiesinger and
    on x86_64, sparc64 by Jozsef Kadlecsik.

    Reported-by: Gerhard Wiesinger
    Tested-by: Gerhard Wiesinger
    Tested-by: Jozsef Kadlecsik
    Signed-off-by: Jozsef Kadlecsik

    Jozsef Kadlecsik
     

09 Oct, 2015

1 commit


19 Sep, 2015

1 commit


03 Sep, 2015

1 commit

  • Fengguang reported, that some randconfig generated the following linker
    issue with nf_ct_zone_dflt object involved:

    [...]
    CC init/version.o
    LD init/built-in.o
    net/built-in.o: In function `ipv4_conntrack_defrag':
    nf_defrag_ipv4.c:(.text+0x93e95): undefined reference to `nf_ct_zone_dflt'
    net/built-in.o: In function `ipv6_defrag':
    nf_defrag_ipv6_hooks.c:(.text+0xe3ffe): undefined reference to `nf_ct_zone_dflt'
    make: *** [vmlinux] Error 1

    Given that configurations exist where we have a built-in part, which is
    accessing nf_ct_zone_dflt such as the two handlers nf_ct_defrag_user()
    and nf_ct6_defrag_user(), and a part that configures nf_conntrack as a
    module, we must move nf_ct_zone_dflt into a fixed, guaranteed built-in
    area when netfilter is configured in general.

    Therefore, split the more generic parts into a common header under
    include/linux/netfilter/ and move nf_ct_zone_dflt into the built-in
    section that already holds parts related to CONFIG_NF_CONNTRACK in the
    netfilter core. This fixes the issue on my side.

    Fixes: 308ac9143ee2 ("netfilter: nf_conntrack: push zone object into functions")
    Reported-by: Fengguang Wu
    Signed-off-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    Daniel Borkmann
     

07 Aug, 2015

1 commit

  • - Move the nfnl_acct_list into the network namespace, initialize
    and destroy it per namespace
    - Keep track of refcnt on nfacct objects, the old logic does not
    longer work with a per namespace list
    - Adjust xt_nfacct to pass the namespace when registring objects

    Signed-off-by: Andreas Schultz
    Signed-off-by: Pablo Neira Ayuso

    Andreas Schultz
     

16 Jul, 2015

2 commits

  • Don't bother testing if we need to switch to alternate stack
    unless TEE target is used.

    Suggested-by: Eric Dumazet
    Signed-off-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso

    Florian Westphal
     
  • In most cases there is no reentrancy into ip/ip6tables.

    For skbs sent by REJECT or SYNPROXY targets, there is one level
    of reentrancy, but its not relevant as those targets issue an absolute
    verdict, i.e. the jumpstack can be clobbered since its not used
    after the target issues absolute verdict (ACCEPT, DROP, STOLEN, etc).

    So the only special case where it is relevant is the TEE target, which
    returns XT_CONTINUE.

    This patch changes ip(6)_do_table to always use the jump stack starting
    from 0.

    When we detect we're operating on an skb sent via TEE (percpu
    nf_skb_duplicated is 1) we switch to an alternate stack to leave
    the original one alone.

    Since there is no TEE support for arptables, it doesn't need to
    test if tee is active.

    The jump stack overflow tests are no longer needed as well --
    since ->stacksize is the largest call depth we cannot exceed it.

    A much better alternative to the external jumpstack would be to just
    declare a jumps[32] stack on the local stack frame, but that would mean
    we'd have to reject iptables rulesets that used to work before.

    Another alternative would be to start rejecting rulesets with a larger
    call depth, e.g. 1000 -- in this case it would be feasible to allocate the
    entire stack in the percpu area which would avoid one dereference.

    Signed-off-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso

    Florian Westphal
     

19 Jun, 2015

1 commit

  • On 32bit archs gcc complains due to cast from void* to u64.
    Add intermediate casts to long to silence these warnings.

    include/linux/netfilter/x_tables.h:376:10: warning: cast from pointer to integer of different size [-Wpointer-to-int-cast]
    include/linux/netfilter/x_tables.h:384:15: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast]
    include/linux/netfilter/x_tables.h:391:23: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast]
    include/linux/netfilter/x_tables.h:400:22: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast]

    Fixes: 71ae0dff02d756e ("netfilter: xtables: use percpu rule counters")
    Reported-by: kbuild test robot
    Signed-off-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso

    Florian Westphal
     

18 Jun, 2015

1 commit

  • Let's force a 16 bytes alignment on xt_counter percpu allocations,
    so that bytes and packets sit in same cache line.

    xt_counter being exported to user space, we cannot add __align(16) on
    the structure itself.

    Signed-off-by: Eric Dumazet
    Cc: Florian Westphal
    Acked-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso

    Eric Dumazet
     

16 Jun, 2015

1 commit

  • After Florian patches, there is no need for XT_TABLE_INFO_SZ anymore :
    Only one copy of table is kept, instead of one copy per cpu.

    We also can avoid a dereference if we put table data right after
    xt_table_info. It reduces register pressure and helps compiler.

    Then, we attempt a kmalloc() if total size is under order-3 allocation,
    to reduce TLB pressure, as in many cases, rules fit in 32 KB.

    Signed-off-by: Eric Dumazet
    Cc: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso

    Eric Dumazet
     

14 Jun, 2015

5 commits


12 Jun, 2015

2 commits

  • We store the rule blob per (possible) cpu. Unfortunately this means we can
    waste lot of memory on big smp machines. ipt_entry structure ('rule head')
    is 112 byte, so e.g. with maxcpu=64 one single rule eats
    close to 8k RAM.

    Since previous patch made counters percpu it appears there is nothing
    left in the rule blob that needs to be percpu.

    On my test system (144 possible cpus, 400k dummy rules) this
    change saves close to 9 Gigabyte of RAM.

    Reported-by: Marcelo Ricardo Leitner
    Acked-by: Jesper Dangaard Brouer
    Signed-off-by: Florian Westphal
    Acked-by: Eric Dumazet
    Signed-off-by: Pablo Neira Ayuso

    Florian Westphal
     
  • The binary arp/ip/ip6tables ruleset is stored per cpu.

    The only reason left as to why we need percpu duplication are the rule
    counters embedded into ipt_entry et al -- since each cpu has its own copy
    of the rules, all counters can be lockless.

    The downside is that the more cpus are supported, the more memory is
    required. Rules are not just duplicated per online cpu but for each
    possible cpu, i.e. if maxcpu is 144, then rule is duplicated 144 times,
    not for the e.g. 64 cores present.

    To save some memory and also improve utilization of shared caches it
    would be preferable to only store the rule blob once.

    So we first need to separate counters and the rule blob.

    Instead of using entry->counters, allocate this percpu and store the
    percpu address in entry->counters.pcnt on CONFIG_SMP.

    This change makes no sense as-is; it is merely an intermediate step to
    remove the percpu duplication of the rule set in a followup patch.

    Suggested-by: Eric Dumazet
    Acked-by: Jesper Dangaard Brouer
    Reported-by: Marcelo Ricardo Leitner
    Signed-off-by: Florian Westphal
    Acked-by: Eric Dumazet
    Signed-off-by: Pablo Neira Ayuso

    Florian Westphal
     

16 May, 2015

1 commit

  • Currently, we have four xtables extensions that cannot be used from the
    xt over nft compat layer. The problem is that they need real access to
    the full blown xt_entry to validate that the rule comes with the right
    dependencies. This check was introduced to overcome the lack of
    sufficient userspace dependency validation in iptables.

    To resolve this problem, this patch introduces a new field to the
    xt_tgchk_param structure that tell us if the extension is run from
    nft_compat context.

    The three affected extensions are:

    1) CLUSTERIP, this target has been superseded by xt_cluster. So just
    bail out by returning -EINVAL.

    2) TCPMSS. Relax the checking when used from nft_compat. If used with
    the wrong configuration, it will corrupt !syn packets by adding TCP
    MSS option.

    3) ebt_stp. Relax the check to make sure it uses the reserved
    destination MAC address for STP.

    Signed-off-by: Pablo Neira Ayuso
    Tested-by: Arturo Borrero Gonzalez

    Pablo Neira Ayuso