09 Jan, 2018

40 commits

  • This new bit tells us that the conntrack entry is owned by the flow
    table offload infrastructure.

    # cat /proc/net/nf_conntrack
    ipv4 2 tcp 6 src=10.141.10.2 dst=147.75.205.195 sport=36392 dport=443 src=147.75.205.195 dst=192.168.2.195 sport=443 dport=36392 [OFFLOAD] mark=0 zone=0 use=2

    Note the [OFFLOAD] tag in the listing.

    The timer of such conntrack entries look like stopped from userspace.
    In practise, to make sure the conntrack entry does not go away, the
    conntrack timer is periodically set to an arbitrary large value that
    gets refreshed on every iteration from the garbage collector, so it
    never expires- and they display no internal state in the case of TCP
    flows. This allows us to save a bitcheck from the packet path via
    nf_ct_is_expired().

    Conntrack entries that have been offloaded to the flow table
    infrastructure cannot be deleted/flushed via ctnetlink. The flow table
    infrastructure is also responsible for releasing this conntrack entry.

    Signed-off-by: Pablo Neira Ayuso

    Pablo Neira Ayuso
     
  • This macro is unnecessary, it just hides details for one single caller.
    nfnl_dereference() is just enough.

    Signed-off-by: Pablo Neira Ayuso

    Pablo Neira Ayuso
     
  • Users cannot forge malformed IPv4/IPv6 headers via raw sockets that they
    can inject into the stack. Specifically, not for IPv4 since 55888dfb6ba7
    ("AF_RAW: Augment raw_send_hdrinc to expand skb to fit iphdr->ihl
    (v2)"). IPv6 raw sockets also ensure that packets have a well-formed
    IPv6 header available in the skbuff.

    At quick glance, br_netfilter also validates layer 3 headers and it
    drops malformed both IPv4 and IPv6 packets.

    Therefore, let's remove this defensive check all over the place.

    Signed-off-by: Pablo Neira Ayuso

    Pablo Neira Ayuso
     
  • replacement for iptables "-m policy --dir in --policy {ipsec,none}".

    Signed-off-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso

    Florian Westphal
     
  • This abstraction has no clients anymore, remove it.

    This is what remains from previous authors, so correct copyright
    statement after recent modifications and code removal.

    Signed-off-by: Pablo Neira Ayuso

    Pablo Neira Ayuso
     
  • This is only needed by nf_queue, place this code where it belongs.

    Signed-off-by: Pablo Neira Ayuso

    Pablo Neira Ayuso
     
  • We cannot make a direct call to nf_ip6_reroute() because that would result
    in autoloading the 'ipv6' module because of symbol dependencies.
    Therefore, define reroute indirection in nf_ipv6_ops where this really
    belongs to.

    For IPv4, we can indeed make a direct function call, which is faster,
    given IPv4 is built-in in the networking code by default. Still,
    CONFIG_INET=n and CONFIG_NETFILTER=y is possible, so define empty inline
    stub for IPv4 in such case.

    Signed-off-by: Pablo Neira Ayuso

    Pablo Neira Ayuso
     
  • We cannot make a direct call to nf_ip6_route() because that would result
    in autoloading the 'ipv6' module because of symbol dependencies.
    Therefore, define route indirection in nf_ipv6_ops where this really
    belongs to.

    For IPv4, we can indeed make a direct function call, which is faster,
    given IPv4 is built-in in the networking code by default. Still,
    CONFIG_INET=n and CONFIG_NETFILTER=y is possible, so define empty inline
    stub for IPv4 in such case.

    Signed-off-by: Pablo Neira Ayuso

    Pablo Neira Ayuso
     
  • This is only used by nf_queue.c and this function comes with no symbol
    dependencies with IPv6, it just refers to structure layouts. Therefore,
    we can replace it by a direct function call from where it belongs.

    Signed-off-by: Pablo Neira Ayuso

    Pablo Neira Ayuso
     
  • We cannot make a direct call to nf_ip6_checksum_partial() because that
    would result in autoloading the 'ipv6' module because of symbol
    dependencies. Therefore, define checksum_partial indirection in
    nf_ipv6_ops where this really belongs to.

    For IPv4, we can indeed make a direct function call, which is faster,
    given IPv4 is built-in in the networking code by default. Still,
    CONFIG_INET=n and CONFIG_NETFILTER=y is possible, so define empty inline
    stub for IPv4 in such case.

    Signed-off-by: Pablo Neira Ayuso

    Pablo Neira Ayuso
     
  • We cannot make a direct call to nf_ip6_checksum() because that would
    result in autoloading the 'ipv6' module because of symbol dependencies.
    Therefore, define checksum indirection in nf_ipv6_ops where this really
    belongs to.

    For IPv4, we can indeed make a direct function call, which is faster,
    given IPv4 is built-in in the networking code by default. Still,
    CONFIG_INET=n and CONFIG_NETFILTER=y is possible, so define empty inline
    stub for IPv4 in such case.

    Signed-off-by: Pablo Neira Ayuso

    Pablo Neira Ayuso
     
  • This allows to reuse xt_connlimit infrastructure from nf_tables.
    The upcoming nf_tables frontend can just pass in an nftables register
    as input key, this allows limiting by any nft-supported key, including
    concatenations.

    For xt_connlimit, pass in the zone and the ip/ipv6 address.

    With help from Yi-Hung Wei.

    Signed-off-by: Florian Westphal
    Acked-by: Yi-Hung Wei
    Signed-off-by: Pablo Neira Ayuso

    Florian Westphal
     
  • They don't belong to the family definition, move them to the filter
    chain type definition instead.

    Signed-off-by: Pablo Neira Ayuso

    Pablo Neira Ayuso
     
  • Since NFPROTO_INET is handled from the core, we don't need to maintain
    extra infrastructure in nf_tables to handle the double hook
    registration, one for IPv4 and another for IPv6.

    Signed-off-by: Pablo Neira Ayuso

    Pablo Neira Ayuso
     
  • Use new native NFPROTO_INET support in netfilter core, this gets rid of
    ad-hoc code in the nf_tables API codebase.

    Signed-off-by: Pablo Neira Ayuso

    Pablo Neira Ayuso
     
  • Expand NFPROTO_INET in two hook registrations, one for NFPROTO_IPV4 and
    another for NFPROTO_IPV6. Hence, we handle NFPROTO_INET from the core.

    Signed-off-by: Pablo Neira Ayuso

    Pablo Neira Ayuso
     
  • So static_key_slow_dec applies to the family behind NFPROTO_INET.

    Signed-off-by: Pablo Neira Ayuso

    Pablo Neira Ayuso
     
  • Instead of passing struct nf_hook_ops, this is needed by follow up
    patches to handle NFPROTO_INET from the core.

    Signed-off-by: Pablo Neira Ayuso

    Pablo Neira Ayuso
     
  • Just a cleanup, __nf_unregister_net_hook() is used by a follow up patch
    when handling NFPROTO_INET as a real family from the core.

    Signed-off-by: Pablo Neira Ayuso

    Pablo Neira Ayuso
     
  • Add helper function to test for the NFT_SET_ANONYMOUS flag.

    Signed-off-by: Pablo Neira Ayuso

    Pablo Neira Ayuso
     
  • Instead of calling this function from the family specific variant, this
    reduces the code size in the fast path for the netdev, bridge and inet
    families. After this change, we must call nft_set_pktinfo() upfront from
    the chain hook indirection.

    Before:

    text data bss dec hex filename
    2145 208 0 2353 931 net/netfilter/nf_tables_netdev.o

    After:

    text data bss dec hex filename
    2125 208 0 2333 91d net/netfilter/nf_tables_netdev.o

    Signed-off-by: Pablo Neira Ayuso

    Pablo Neira Ayuso
     
  • 46928a0b49f3 ("netfilter: nf_tables: remove multihook chains and
    families") already removed this, this is a leftover.

    Signed-off-by: Pablo Neira Ayuso

    Pablo Neira Ayuso
     
  • No problem for iptables as priorities are fixed values defined in the
    nat modules, but in nftables the priority its coming from userspace.

    Reject in case we see that such a hook would not work.

    Signed-off-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso

    Florian Westphal
     
  • The netfilter NAT core cannot deal with more than one NAT hook per hook
    location (prerouting, input ...), because the NAT hooks install a NAT null
    binding in case the iptables nat table (iptable_nat hooks) or the
    corresponding nftables chain (nft nat hooks) doesn't specify a nat
    transformation.

    Null bindings are needed to detect port collsisions between NAT-ed and
    non-NAT-ed connections.

    This causes nftables NAT rules to not work when iptable_nat module is
    loaded, and vice versa because nat binding has already been attached
    when the second nat hook is consulted.

    The netfilter core is not really the correct location to handle this
    (hooks are just hooks, the core has no notion of what kinds of side
    effects a hook implements), but its the only place where we can check
    for conflicts between both iptables hooks and nftables hooks without
    adding dependencies.

    So add nat annotation to hook_ops to describe those hooks that will
    add NAT bindings and then make core reject if such a hook already exists.
    The annotation fills a padding hole, in case further restrictions appar
    we might change this to a 'u8 type' instead of bool.

    iptables error if nft nat hook active:
    iptables -t nat -A POSTROUTING -j MASQUERADE
    iptables v1.4.21: can't initialize iptables table `nat': File exists
    Perhaps iptables or your kernel needs to be upgraded.

    nftables error if iptables nat table present:
    nft -f /etc/nftables/ipv4-nat
    /usr/etc/nftables/ipv4-nat:3:1-2: Error: Could not process rule: File exists
    table nat {
    ^^

    Signed-off-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso

    Florian Westphal
     
  • currently we always return -ENOENT to userspace if we can't find
    a particular table, or if the table initialization fails.

    Followup patch will make nat table init fail in case nftables already
    registered a nat hook so this change makes xt_find_table_lock return
    an ERR_PTR to return the errno value reported from the table init
    function.

    Add xt_request_find_table_lock as try_then_request_module replacement
    and use it where needed.

    Signed-off-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso

    Florian Westphal
     
  • This can be same as NF_INET_NUMHOOKS if we don't support DECNET.

    Signed-off-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso

    Florian Westphal
     
  • no need to define hook points if the family isn't supported.
    Because we need these hooks for either nftables, arp/ebtables
    or the 'call-iptables' hack we have in the bridge layer add two
    new dependencies, NETFILTER_FAMILY_{ARP,BRIDGE}, and have the
    users select them.

    Signed-off-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso

    Florian Westphal
     
  • no need to define hook points if the family isn't supported.

    Signed-off-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso

    Florian Westphal
     
  • Not all families share the same hook count, adjust sizes to what is
    needed.

    struct net before:
    /* size: 6592, cachelines: 103, members: 46 */
    after:
    /* size: 5952, cachelines: 93, members: 46 */

    Signed-off-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso

    Florian Westphal
     
  • The kernel already has defines for this, but they are in uapi exposed
    headers.

    Including these from netns.h causes build errors and also adds unneeded
    dependencies on heads that we don't need.

    So move these defines to netfilter_defs.h and place the uapi ones
    in ifndef __KERNEL__ to keep them for userspace.

    Signed-off-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso

    Florian Westphal
     
  • struct net contains:

    struct nf_hook_entries __rcu *hooks[NFPROTO_NUMPROTO][NF_MAX_HOOKS];

    which store the hook entry point locations for the various protocol
    families and the hooks.

    Using array results in compact c code when doing accesses, i.e.
    x = rcu_dereference(net->nf.hooks[pf][hook]);

    but its also wasting a lot of memory, as most families are
    not used.

    So split the array into those families that are used, which
    are only 5 (instead of 13). In most cases, the 'pf' argument is
    constant, i.e. gcc removes switch statement.

    struct net before:
    /* size: 5184, cachelines: 81, members: 46 */
    after:
    /* size: 4672, cachelines: 73, members: 46 */

    Signed-off-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso

    Florian Westphal
     
  • Giuseppe Scrivano says:
    "SELinux, if enabled, registers for each new network namespace 6
    netfilter hooks."

    Cost for this is high. With synchronize_net() removed:
    "The net benefit on an SMP machine with two cores is that creating a
    new network namespace takes -40% of the original time."

    This patch replaces synchronize_net+kvfree with call_rcu().
    We store rcu_head at the tail of a structure that has no fixed layout,
    i.e. we cannot use offsetof() to compute the start of the original
    allocation. Thus store this information right after the rcu head.

    We could simplify this by just placing the rcu_head at the start
    of struct nf_hook_entries. However, this structure is used in
    packet processing hotpath, so only place what is needed for that
    at the beginning of the struct.

    Reported-by: Giuseppe Scrivano
    Signed-off-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso

    Florian Westphal
     
  • since commit 960632ece6949b ("netfilter: convert hook list to an array")
    nfqueue no longer stores a pointer to the hook that caused the packet
    to be queued. Therefore no extra synchronize_net() call is needed after
    dropping the packets enqueued by the old rule blob.

    Signed-off-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso

    Florian Westphal
     
  • This reverts commit d3ad2c17b4047
    ("netfilter: core: batch nf_unregister_net_hooks synchronize_net calls").

    Nothing wrong with it. However, followup patch will delay freeing of hooks
    with call_rcu, so all synchronize_net() calls become obsolete and there
    is no need anymore for this batching.

    This revert causes a temporary performance degradation when destroying
    network namespace, but its resolved with the upcoming call_rcu conversion.

    Signed-off-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso

    Florian Westphal
     
  • Change old multi-line comment style to kernel comment style and
    remove unwanted comments.

    Signed-off-by: Varsha Rao
    Signed-off-by: Pablo Neira Ayuso

    Varsha Rao
     
  • When sets are extremely large we can get softlockup during ipset -L.
    We could fix this by adding cond_resched_rcu() at the right location
    during iteration, but this only works if RCU nesting depth is 1.

    At this time entire variant->list() is called under under rcu_read_lock_bh.
    This used to be a read_lock_bh() but as rcu doesn't really lock anything,
    it does not appear to be needed, so remove it (ipset increments set
    reference count before this, so a set deletion should not be possible).

    Reported-by: Li Shuang
    Signed-off-by: Florian Westphal
    Acked-by: Jozsef Kadlecsik
    Signed-off-by: Pablo Neira Ayuso

    Florian Westphal
     
  • Check that we really hold nfnl mutex here instead of relying on correct
    usage alone.

    Signed-off-by: Florian Westphal
    Acked-by: Jozsef Kadlecsik
    Signed-off-by: Pablo Neira Ayuso

    Florian Westphal
     
  • The param of frag_safe_skb_hp, ipvsh, isn't used now. So remove it and
    update the callers' codes too.

    Signed-off-by: Gao Feng
    Acked-by: Simon Horman
    Signed-off-by: Pablo Neira Ayuso

    Gao Feng
     
  • Nowadays this is just the default template that is used when setting up
    the net namespace, so nothing writes to these locations.

    Signed-off-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso

    Florian Westphal
     
  • In preparation to enabling -Wimplicit-fallthrough, mark switch cases
    where we are expecting to fall through.

    Signed-off-by: Gustavo A. R. Silva
    Signed-off-by: Simon Horman
    Signed-off-by: Pablo Neira Ayuso

    Gustavo A. R. Silva