14 Nov, 2014

3 commits

  • Reallocation is only required for shrinking and expanding and both rely
    on a mutex for synchronization and callers of rhashtable_init() are in
    non atomic context. Therefore, no reason to continue passing allocation
    hints through the API.

    Instead, use GFP_KERNEL and add __GFP_NOWARN | __GFP_NORETRY to allow
    for silent fall back to vzalloc() without the OOM killer jumping in as
    pointed out by Eric Dumazet and Eric W. Biederman.

    Signed-off-by: Thomas Graf
    Signed-off-by: David S. Miller

    Thomas Graf
     
  • Currently mutex_is_held can only test locks in the that are global
    since it takes no arguments. This prevents rhashtable from being
    used in places where locks are lock, e.g., per-namespace locks.

    This patch adds a parent field to mutex_is_held and rhashtable_params
    so that local locks can be used (and tested).

    Signed-off-by: Herbert Xu
    Signed-off-by: David S. Miller

    Herbert Xu
     
  • The rhashtable function mutex_is_held is only used when PROVE_LOCKING
    is enabled. This patch modifies netfilter so that we can rhashtable.h
    itself can later make mutex_is_held optional depending on PROVE_LOCKING.

    Signed-off-by: Herbert Xu
    Signed-off-by: David S. Miller

    Herbert Xu
     

03 Sep, 2014

1 commit

  • The sets are released from the rcu callback, after the rule is removed
    from the chain list, which implies that nfnetlink cannot update the
    hashes (thus, no resizing may occur) and no packets are walking on the
    set anymore.

    This resolves a lockdep splat in the nft_hash_destroy() path since the
    nfnl mutex is not held there.

    ===============================
    [ INFO: suspicious RCU usage. ]
    3.16.0-rc2+ #168 Not tainted
    -------------------------------
    net/netfilter/nft_hash.c:362 suspicious rcu_dereference_protected() usage!

    other info that might help us debug this:

    rcu_scheduler_active = 1, debug_locks = 1
    1 lock held by ksoftirqd/0/3:
    #0: (rcu_callback){......}, at: [] rcu_process_callbacks+0x27e/0x4c7

    stack backtrace:
    CPU: 0 PID: 3 Comm: ksoftirqd/0 Not tainted 3.16.0-rc2+ #168
    Hardware name: LENOVO 23259H1/23259H1, BIOS G2ET32WW (1.12 ) 05/30/2012
    0000000000000001 ffff88011769bb98 ffffffff8142c922 0000000000000006
    ffff880117694090 ffff88011769bbc8 ffffffff8107c3ff ffff8800cba52400
    ffff8800c476bea8 ffff8800c476bea8 ffff8800cba52400 ffff88011769bc08
    Call Trace:
    [] dump_stack+0x4e/0x68
    [] lockdep_rcu_suspicious+0xfa/0x103
    [] nft_hash_destroy+0x50/0x137 [nft_hash]
    [] nft_set_destroy+0x11/0x2a [nf_tables]

    Signed-off-by: Pablo Neira Ayuso
    Acked-by: Thomas Graf

    Pablo Neira Ayuso
     

03 Aug, 2014

1 commit

  • The sizing of the hash table and the practice of requiring a lookup
    to retrieve the pprev to be stored in the element cookie before the
    deletion of an entry is left intact.

    Signed-off-by: Thomas Graf
    Acked-by: Patrick McHardy
    Reviewed-by: Nikolay Aleksandrov
    Signed-off-by: David S. Miller

    Thomas Graf
     

05 Jun, 2014

1 commit


03 Apr, 2014

2 commits

  • Now that nf_tables performs global accounting of set elements, it is not
    needed in the hash type anymore.

    Signed-off-by: Patrick McHardy
    Signed-off-by: Pablo Neira Ayuso

    Patrick McHardy
     
  • The current set selection simply choses the first set type that provides
    the requested features, which always results in the rbtree being chosen
    by virtue of being the first set in the list.

    What we actually want to do is choose the implementation that can provide
    the requested features and is optimal from either a performance or memory
    perspective depending on the characteristics of the elements and the
    preferences specified by the user.

    The elements are not known when creating a set. Even if we would provide
    them for anonymous (literal) sets, we'd still have standalone sets where
    the elements are not known in advance. We therefore need an abstract
    description of the data charcteristics.

    The kernel already knows the size of the key, this patch starts by
    introducing a nested set description which so far contains only the maximum
    amount of elements. Based on this the set implementations are changed to
    provide an estimate of the required amount of memory and the lookup
    complexity class.

    The set ops have a new callback ->estimate() that is invoked during set
    selection. It receives a structure containing the attributes known to the
    kernel and is supposed to populate a struct nft_set_estimate with the
    complexity class and, in case the size is known, the complete amount of
    memory required, or the amount of memory required per element otherwise.

    Based on the policy specified by the user (performance/memory, defaulting
    to performance) the kernel will then select the best suited implementation.

    Even if the set implementation would allow to add more than the specified
    maximum amount of elements, they are enforced since new implementations
    might not be able to add more than maximum based on which they were
    selected.

    Signed-off-by: Patrick McHardy
    Signed-off-by: Pablo Neira Ayuso

    Patrick McHardy
     

19 Mar, 2014

1 commit


07 Mar, 2014

1 commit

  • The hash set type is very broken and was never meant to be merged in this
    state. Missing RCU synchronization on element removal, leaking chain
    refcounts when used as a verdict map, races during lookups, a fixed table
    size are probably just some of the problems. Luckily it is currently
    never chosen by the kernel when the rbtree type is also available.

    Rewrite it to be usable.

    The new implementation supports automatic hash table resizing using RCU,
    based on Paul McKenney's and Josh Triplett's algorithm "Optimized Resizing
    For RCU-Protected Hash Tables" described in [1].

    Resizing doesn't require a second list head in the elements, it works by
    chosing a hash function that remaps elements to a predictable set of buckets,
    only resizing by integral factors and

    - during expansion: linking new buckets to the old bucket that contains
    elements for any of the new buckets, thereby creating imprecise chains,
    then incrementally seperating the elements until the new buckets only
    contain elements that hash directly to them.

    - during shrinking: linking the hash chains of all old buckets that hash
    to the same new bucket to form a single chain.

    Expansion requires at most the number of elements in the longest hash chain
    grace periods, shrinking requires a single grace period.

    Due to the requirement of having hash chains/elements linked to multiple
    buckets during resizing, homemade single linked lists are used instead of
    the existing list helpers, that don't support this in a clean fashion.
    As a side effect, the amount of memory required per element is reduced by
    one pointer.

    Expansion is triggered when the load factors exceeds 75%, shrinking when
    the load factor goes below 30%. Both operations are allowed to fail and
    will be retried on the next insertion or removal if their respective
    conditions still hold.

    [1] http://dl.acm.org/citation.cfm?id=2002181.2002192

    Reviewed-by: Josh Triplett
    Reviewed-by: Paul E. McKenney
    Signed-off-by: Patrick McHardy
    Signed-off-by: Pablo Neira Ayuso

    Patrick McHardy
     

14 Oct, 2013

2 commits

  • This patch adds the new netlink API for maintaining nf_tables sets
    independently of the ruleset. The API supports the following operations:

    - creation of sets
    - deletion of sets
    - querying of specific sets
    - dumping of all sets

    - addition of set elements
    - removal of set elements
    - dumping of all set elements

    Sets are identified by name, each table defines an individual namespace.
    The name of a set may be allocated automatically, this is mostly useful
    in combination with the NFT_SET_ANONYMOUS flag, which destroys a set
    automatically once the last reference has been released.

    Sets can be marked constant, meaning they're not allowed to change while
    linked to a rule. This allows to perform lockless operation for set
    types that would otherwise require locking.

    Additionally, if the implementation supports it, sets can (as before) be
    used as maps, associating a data value with each key (or range), by
    specifying the NFT_SET_MAP flag and can be used for interval queries by
    specifying the NFT_SET_INTERVAL flag.

    Set elements are added and removed incrementally. All element operations
    support batching, reducing netlink message and set lookup overhead.

    The old "set" and "hash" expressions are replaced by a generic "lookup"
    expression, which binds to the specified set. Userspace is not aware
    of the actual set implementation used by the kernel anymore, all
    configuration options are generic.

    Currently the implementation selection logic is largely missing and the
    kernel will simply use the first registered implementation supporting the
    requested operation. Eventually, the plan is to have userspace supply a
    description of the data characteristics and select the implementation
    based on expected performance and memory use.

    This patch includes the new 'lookup' expression to look up for element
    matching in the set.

    This patch includes kernel-doc descriptions for this set API and it
    also includes the following fixes.

    From Patrick McHardy:
    * netfilter: nf_tables: fix set element data type in dumps
    * netfilter: nf_tables: fix indentation of struct nft_set_elem comments
    * netfilter: nf_tables: fix oops in nft_validate_data_load()
    * netfilter: nf_tables: fix oops while listing sets of built-in tables
    * netfilter: nf_tables: destroy anonymous sets immediately if binding fails
    * netfilter: nf_tables: propagate context to set iter callback
    * netfilter: nf_tables: add loop detection

    From Pablo Neira Ayuso:
    * netfilter: nf_tables: allow to dump all existing sets
    * netfilter: nf_tables: fix wrong type for flags variable in newelem

    Signed-off-by: Patrick McHardy
    Signed-off-by: Pablo Neira Ayuso

    Patrick McHardy
     
  • This patch adds nftables which is the intended successor of iptables.
    This packet filtering framework reuses the existing netfilter hooks,
    the connection tracking system, the NAT subsystem, the transparent
    proxying engine, the logging infrastructure and the userspace packet
    queueing facilities.

    In a nutshell, nftables provides a pseudo-state machine with 4 general
    purpose registers of 128 bits and 1 specific purpose register to store
    verdicts. This pseudo-machine comes with an extensible instruction set,
    a.k.a. "expressions" in the nftables jargon. The expressions included
    in this patch provide the basic functionality, they are:

    * bitwise: to perform bitwise operations.
    * byteorder: to change from host/network endianess.
    * cmp: to compare data with the content of the registers.
    * counter: to enable counters on rules.
    * ct: to store conntrack keys into register.
    * exthdr: to match IPv6 extension headers.
    * immediate: to load data into registers.
    * limit: to limit matching based on packet rate.
    * log: to log packets.
    * meta: to match metainformation that usually comes with the skbuff.
    * nat: to perform Network Address Translation.
    * payload: to fetch data from the packet payload and store it into
    registers.
    * reject (IPv4 only): to explicitly close connection, eg. TCP RST.

    Using this instruction-set, the userspace utility 'nft' can transform
    the rules expressed in human-readable text representation (using a
    new syntax, inspired by tcpdump) to nftables bytecode.

    nftables also inherits the table, chain and rule objects from
    iptables, but in a more configurable way, and it also includes the
    original datatype-agnostic set infrastructure with mapping support.
    This set infrastructure is enhanced in the follow up patch (netfilter:
    nf_tables: add netlink set API).

    This patch includes the following components:

    * the netlink API: net/netfilter/nf_tables_api.c and
    include/uapi/netfilter/nf_tables.h
    * the packet filter core: net/netfilter/nf_tables_core.c
    * the expressions (described above): net/netfilter/nft_*.c
    * the filter tables: arp, IPv4, IPv6 and bridge:
    net/ipv4/netfilter/nf_tables_ipv4.c
    net/ipv6/netfilter/nf_tables_ipv6.c
    net/ipv4/netfilter/nf_tables_arp.c
    net/bridge/netfilter/nf_tables_bridge.c
    * the NAT table (IPv4 only):
    net/ipv4/netfilter/nf_table_nat_ipv4.c
    * the route table (similar to mangle):
    net/ipv4/netfilter/nf_table_route_ipv4.c
    net/ipv6/netfilter/nf_table_route_ipv6.c
    * internal definitions under:
    include/net/netfilter/nf_tables.h
    include/net/netfilter/nf_tables_core.h
    * It also includes an skeleton expression:
    net/netfilter/nft_expr_template.c
    and the preliminary implementation of the meta target
    net/netfilter/nft_meta_target.c

    It also includes a change in struct nf_hook_ops to add a new
    pointer to store private data to the hook, that is used to store
    the rule list per chain.

    This patch is based on the patch from Patrick McHardy, plus merged
    accumulated cleanups, fixes and small enhancements to the nftables
    code that has been done since 2009, which are:

    From Patrick McHardy:
    * nf_tables: adjust netlink handler function signatures
    * nf_tables: only retry table lookup after successful table module load
    * nf_tables: fix event notification echo and avoid unnecessary messages
    * nft_ct: add l3proto support
    * nf_tables: pass expression context to nft_validate_data_load()
    * nf_tables: remove redundant definition
    * nft_ct: fix maxattr initialization
    * nf_tables: fix invalid event type in nf_tables_getrule()
    * nf_tables: simplify nft_data_init() usage
    * nf_tables: build in more core modules
    * nf_tables: fix double lookup expression unregistation
    * nf_tables: move expression initialization to nf_tables_core.c
    * nf_tables: build in payload module
    * nf_tables: use NFPROTO constants
    * nf_tables: rename pid variables to portid
    * nf_tables: save 48 bits per rule
    * nf_tables: introduce chain rename
    * nf_tables: check for duplicate names on chain rename
    * nf_tables: remove ability to specify handles for new rules
    * nf_tables: return error for rule change request
    * nf_tables: return error for NLM_F_REPLACE without rule handle
    * nf_tables: include NLM_F_APPEND/NLM_F_REPLACE flags in rule notification
    * nf_tables: fix NLM_F_MULTI usage in netlink notifications
    * nf_tables: include NLM_F_APPEND in rule dumps

    From Pablo Neira Ayuso:
    * nf_tables: fix stack overflow in nf_tables_newrule
    * nf_tables: nft_ct: fix compilation warning
    * nf_tables: nft_ct: fix crash with invalid packets
    * nft_log: group and qthreshold are 2^16
    * nf_tables: nft_meta: fix socket uid,gid handling
    * nft_counter: allow to restore counters
    * nf_tables: fix module autoload
    * nf_tables: allow to remove all rules placed in one chain
    * nf_tables: use 64-bits rule handle instead of 16-bits
    * nf_tables: fix chain after rule deletion
    * nf_tables: improve deletion performance
    * nf_tables: add missing code in route chain type
    * nf_tables: rise maximum number of expressions from 12 to 128
    * nf_tables: don't delete table if in use
    * nf_tables: fix basechain release

    From Tomasz Bursztyka:
    * nf_tables: Add support for changing users chain's name
    * nf_tables: Change chain's name to be fixed sized
    * nf_tables: Add support for replacing a rule by another one
    * nf_tables: Update uapi nftables netlink header documentation

    From Florian Westphal:
    * nft_log: group is u16, snaplen u32

    From Phil Oester:
    * nf_tables: operational limit match

    Signed-off-by: Patrick McHardy
    Signed-off-by: Pablo Neira Ayuso

    Patrick McHardy