17 Oct, 2015

2 commits

  • A recent change to the dst_output handling caused a new warning
    when the call to NF_HOOK() is the only used of a local variable
    passed as 'dev', and CONFIG_NETFILTER is disabled:

    net/ipv6/ip6_output.c: In function 'ip6_output':
    net/ipv6/ip6_output.c:135:21: warning: unused variable 'dev' [-Wunused-variable]

    The reason for this is that the NF_HOOK macro in this case does
    not reference the variable at all, and the call to dev_net(dev)
    got removed from the ip6_output function. To avoid that warning now
    and in the future, this changes the macro into an equivalent
    inline function, which tells the compiler that the variable is
    passed correctly but still unused.

    The dn_forward function apparently had the same problem in
    the past and added a local workaround that no longer works
    with the inline function. In order to avoid a regression, we
    have to also remove the #ifdef from decnet in the same patch.

    Fixes: ede2059dbaf9 ("dst: Pass net into dst->output")
    Signed-off-by: Arnd Bergmann
    Signed-off-by: Pablo Neira Ayuso

    Arnd Bergmann
     
  • since commit 8405a8fff3f8 ("netfilter: nf_qeueue: Drop queue entries on
    nf_unregister_hook") all pending queued entries are discarded.

    So we can simply remove all of the owner handling -- when module is
    removed it also needs to unregister all its hooks.

    Signed-off-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso

    Florian Westphal
     

05 Oct, 2015

3 commits

  • get_ct as is and will not update its skb argument, and users of
    nfnl_ct_hook is currently only nfqueue, we can add const qualifier.

    Signed-off-by: Ken-ichirou MATSUZAWA

    Ken-ichirou MATSUZAWA
     
  • The idea of this series of patch is to attach conntrack information to
    nflog like nfqueue has already done. nfqueue conntrack info attaching
    basis is generic, rename those names to generic one, glue.

    Signed-off-by: Ken-ichirou MATSUZAWA
    Signed-off-by: Pablo Neira Ayuso

    Ken-ichirou MATSUZAWA
     
  • The original intention was to avoid dependencies between nfnetlink_queue and
    conntrack without ifdef pollution. However, we can achieve this by moving the
    conntrack dependent code into ctnetlink and keep some glue code to access the
    nfq_ct indirection from nfqueue.

    After this patch, the nfq_ct indirection is always compiled in the netfilter
    core to avoid polluting nfqueue with ifdefs. Thus, if nf_conntrack is not
    compiled this results in only 8-bytes of memory waste in x86_64.

    This patch also adds ctnetlink_nfqueue_seqadj() to avoid that the nf_conn
    structure layout if exposed to nf_queue, which creates another dependency with
    nf_conntrack at compilation time.

    Signed-off-by: Pablo Neira Ayuso

    Pablo Neira Ayuso
     

30 Sep, 2015

1 commit


19 Sep, 2015

1 commit


18 Sep, 2015

5 commits

  • This is immediately motivated by the bridge code that chains functions that
    call into netfilter. Without passing net into the okfns the bridge code would
    need to guess about the best expression for the network namespace to process
    packets in.

    As net is frequently one of the first things computed in continuation functions
    after netfilter has done it's job passing in the desired network namespace is in
    many cases a code simplification.

    To support this change the function dst_output_okfn is introduced to
    simplify passing dst_output as an okfn. For the moment dst_output_okfn
    just silently drops the struct net.

    Signed-off-by: "Eric W. Biederman"
    Signed-off-by: David S. Miller

    Eric W. Biederman
     
  • Pass a network namespace parameter into the netfilter hooks. At the
    call site of the netfilter hooks the path a packet is taking through
    the network stack is well known which allows the network namespace to
    be easily and reliabily.

    This allows the replacement of magic code like
    "dev_net(state->in?:state->out)" that appears at the start of most
    netfilter hooks with "state->net".

    In almost all cases the network namespace passed in is derived
    from the first network device passed in, guaranteeing those
    paths will not see any changes in practice.

    The exceptions are:
    xfrm/xfrm_output.c:xfrm_output_resume() xs_net(skb_dst(skb)->xfrm)
    ipvs/ip_vs_xmit.c:ip_vs_nat_send_or_cont() ip_vs_conn_net(cp)
    ipvs/ip_vs_xmit.c:ip_vs_send_or_cont() ip_vs_conn_net(cp)
    ipv4/raw.c:raw_send_hdrinc() sock_net(sk)
    ipv6/ip6_output.c:ip6_xmit() sock_net(sk)
    ipv6/ndisc.c:ndisc_send_skb() dev_net(skb->dev) not dev_net(dst->dev)
    ipv6/raw.c:raw6_send_hdrinc() sock_net(sk)
    br_netfilter_hooks.c:br_nf_pre_routing_finish() dev_net(skb->dev) before skb->dev is set to nf_bridge->physindev

    In all cases these exceptions seem to be a better expression for the
    network namespace the packet is being processed in then the historic
    "dev_net(in?in:out)". I am documenting them in case something odd
    pops up and someone starts trying to track down what happened.

    Signed-off-by: "Eric W. Biederman"
    Signed-off-by: David S. Miller

    Eric W. Biederman
     
  • Signed-off-by: "Eric W. Biederman"
    Signed-off-by: David S. Miller

    Eric W. Biederman
     
  • Signed-off-by: "Eric W. Biederman"
    Signed-off-by: David S. Miller

    Eric W. Biederman
     
  • The !CONFIG_NETFILTER definition of nf_hook_thresh calls okfn when
    the CONFIG_NETFITLER defintion does not, making it buggy.

    As the !CONFIG_NETFILTER defintion of nf_hook_thresh is not used remove
    it rather than fix it.

    Signed-off-by: "Eric W. Biederman"
    Signed-off-by: David S. Miller

    Eric W. Biederman
     

03 Sep, 2015

1 commit

  • Fengguang reported, that some randconfig generated the following linker
    issue with nf_ct_zone_dflt object involved:

    [...]
    CC init/version.o
    LD init/built-in.o
    net/built-in.o: In function `ipv4_conntrack_defrag':
    nf_defrag_ipv4.c:(.text+0x93e95): undefined reference to `nf_ct_zone_dflt'
    net/built-in.o: In function `ipv6_defrag':
    nf_defrag_ipv6_hooks.c:(.text+0xe3ffe): undefined reference to `nf_ct_zone_dflt'
    make: *** [vmlinux] Error 1

    Given that configurations exist where we have a built-in part, which is
    accessing nf_ct_zone_dflt such as the two handlers nf_ct_defrag_user()
    and nf_ct6_defrag_user(), and a part that configures nf_conntrack as a
    module, we must move nf_ct_zone_dflt into a fixed, guaranteed built-in
    area when netfilter is configured in general.

    Therefore, split the more generic parts into a common header under
    include/linux/netfilter/ and move nf_ct_zone_dflt into the built-in
    section that already holds parts related to CONFIG_NF_CONNTRACK in the
    netfilter core. This fixes the issue on my side.

    Fixes: 308ac9143ee2 ("netfilter: nf_conntrack: push zone object into functions")
    Reported-by: Fengguang Wu
    Signed-off-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    Daniel Borkmann
     

23 Jul, 2015

1 commit


16 Jul, 2015

2 commits

  • This prepares for a TEE like expression in nftables.
    We want to ensure only one duplicate is sent, so both will
    use the same percpu variable to detect duplication.

    The other use case is detection of recursive call to xtables, but since
    we don't want dependency from nft to xtables core its put into core.c
    instead of the x_tables core.

    Signed-off-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso

    Florian Westphal
     
  • - Add a new set of functions for registering and unregistering per
    network namespace hooks.

    - Modify the old global namespace hook functions to use the per
    network namespace hooks in their implementation, so their remains a
    single list that needs to be walked for any hook (this is important
    for keeping the hook priority working and for keeping the code
    walking the hooks simple).

    - Only allow registering the per netdevice hooks in the network
    namespace where the network device lives.

    - Dynamically allocate the structures in the per network namespace
    hook list in nf_register_net_hook, and unregister them in
    nf_unregister_net_hook.

    Dynamic allocate is required somewhere as the number of network
    namespaces are not fixed so we might as well allocate them in the
    registration function.

    The chain of registered hooks on any list is expected to be small so
    the cost of walking that list to find the entry we are unregistering
    should also be small.

    Performing the management of the dynamically allocated list entries
    in the registration and unregistration functions keeps the complexity
    from spreading.

    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

15 Jul, 2015

1 commit


19 Jun, 2015

1 commit

  • This pulls the full hook netfilter definitions from all those that include
    net_namespace.h.

    Instead let's just include the bare minimum required in the new
    linux/netfilter_defs.h file, and use it from the netfilter netns header files.

    I also needed to include in.h and in6.h from linux/netfilter.h otherwise we hit
    this compilation error:

    In file included from include/linux/netfilter_defs.h:4:0,
    from include/net/netns/netfilter.h:4,
    from include/net/net_namespace.h:22,
    from include/linux/netdevice.h:43,
    from net/netfilter/nfnetlink_queue_core.c:23:
    include/uapi/linux/netfilter.h:76:17: error: field ‘in’ has incomplete type struct in_addr in;

    And also explicit include linux/netfilter.h in several spots.

    Signed-off-by: Pablo Neira Ayuso
    Signed-off-by: Eric W. Biederman

    Pablo Neira Ayuso
     

14 May, 2015

4 commits

  • This patch adds the Netfilter ingress hook just after the existing tc ingress
    hook, that seems to be the consensus solution for this.

    Note that the Netfilter hook resides under the global static key that enables
    ingress filtering. Nonetheless, Netfilter still also has its own static key for
    minimal impact on the existing handle_ing().

    * Without this patch:

    Result: OK: 6216490(c6216338+d152) usec, 100000000 (60byte,0frags)
    16086246pps 7721Mb/sec (7721398080bps) errors: 100000000

    42.46% kpktgend_0 [kernel.kallsyms] [k] __netif_receive_skb_core
    25.92% kpktgend_0 [kernel.kallsyms] [k] kfree_skb
    7.81% kpktgend_0 [pktgen] [k] pktgen_thread_worker
    5.62% kpktgend_0 [kernel.kallsyms] [k] ip_rcv
    2.70% kpktgend_0 [kernel.kallsyms] [k] netif_receive_skb_internal
    2.34% kpktgend_0 [kernel.kallsyms] [k] netif_receive_skb_sk
    1.44% kpktgend_0 [kernel.kallsyms] [k] __build_skb

    * With this patch:

    Result: OK: 6214833(c6214731+d101) usec, 100000000 (60byte,0frags)
    16090536pps 7723Mb/sec (7723457280bps) errors: 100000000

    41.23% kpktgend_0 [kernel.kallsyms] [k] __netif_receive_skb_core
    26.57% kpktgend_0 [kernel.kallsyms] [k] kfree_skb
    7.72% kpktgend_0 [pktgen] [k] pktgen_thread_worker
    5.55% kpktgend_0 [kernel.kallsyms] [k] ip_rcv
    2.78% kpktgend_0 [kernel.kallsyms] [k] netif_receive_skb_internal
    2.06% kpktgend_0 [kernel.kallsyms] [k] netif_receive_skb_sk
    1.43% kpktgend_0 [kernel.kallsyms] [k] __build_skb

    * Without this patch + tc ingress:

    tc filter add dev eth4 parent ffff: protocol ip prio 1 \
    u32 match ip dst 4.3.2.1/32

    Result: OK: 9269001(c9268821+d179) usec, 100000000 (60byte,0frags)
    10788648pps 5178Mb/sec (5178551040bps) errors: 100000000

    40.99% kpktgend_0 [kernel.kallsyms] [k] __netif_receive_skb_core
    17.50% kpktgend_0 [kernel.kallsyms] [k] kfree_skb
    11.77% kpktgend_0 [cls_u32] [k] u32_classify
    5.62% kpktgend_0 [kernel.kallsyms] [k] tc_classify_compat
    5.18% kpktgend_0 [pktgen] [k] pktgen_thread_worker
    3.23% kpktgend_0 [kernel.kallsyms] [k] tc_classify
    2.97% kpktgend_0 [kernel.kallsyms] [k] ip_rcv
    1.83% kpktgend_0 [kernel.kallsyms] [k] netif_receive_skb_internal
    1.50% kpktgend_0 [kernel.kallsyms] [k] netif_receive_skb_sk
    0.99% kpktgend_0 [kernel.kallsyms] [k] __build_skb

    * With this patch + tc ingress:

    tc filter add dev eth4 parent ffff: protocol ip prio 1 \
    u32 match ip dst 4.3.2.1/32

    Result: OK: 9308218(c9308091+d126) usec, 100000000 (60byte,0frags)
    10743194pps 5156Mb/sec (5156733120bps) errors: 100000000

    42.01% kpktgend_0 [kernel.kallsyms] [k] __netif_receive_skb_core
    17.78% kpktgend_0 [kernel.kallsyms] [k] kfree_skb
    11.70% kpktgend_0 [cls_u32] [k] u32_classify
    5.46% kpktgend_0 [kernel.kallsyms] [k] tc_classify_compat
    5.16% kpktgend_0 [pktgen] [k] pktgen_thread_worker
    2.98% kpktgend_0 [kernel.kallsyms] [k] ip_rcv
    2.84% kpktgend_0 [kernel.kallsyms] [k] tc_classify
    1.96% kpktgend_0 [kernel.kallsyms] [k] netif_receive_skb_internal
    1.57% kpktgend_0 [kernel.kallsyms] [k] netif_receive_skb_sk

    Note that the results are very similar before and after.

    I can see gcc gets the code under the ingress static key out of the hot path.
    Then, on that cold branch, it generates the code to accomodate the netfilter
    ingress static key. My explanation for this is that this reduces the pressure
    on the instruction cache for non-users as the new code is out of the hot path,
    and it comes with minimal impact for tc ingress users.

    Using gcc version 4.8.4 on:

    Architecture: x86_64
    CPU op-mode(s): 32-bit, 64-bit
    Byte Order: Little Endian
    CPU(s): 8
    [...]
    L1d cache: 16K
    L1i cache: 64K
    L2 cache: 2048K
    L3 cache: 8192K

    Signed-off-by: Pablo Neira Ayuso
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Pablo Neira
     
  • In preparation to have netfilter ingress per-device hook list.

    Signed-off-by: Pablo Neira Ayuso
    Signed-off-by: David S. Miller

    Pablo Neira
     
  • Signed-off-by: Pablo Neira Ayuso
    Signed-off-by: David S. Miller

    Pablo Neira
     
  • Signed-off-by: Pablo Neira Ayuso
    Signed-off-by: David S. Miller

    Pablo Neira
     

08 Apr, 2015

3 commits

  • On the output paths in particular, we have to sometimes deal with two
    socket contexts. First, and usually skb->sk, is the local socket that
    generated the frame.

    And second, is potentially the socket used to control a tunneling
    socket, such as one the encapsulates using UDP.

    We do not want to disassociate skb->sk when encapsulating in order
    to fix this, because that would break socket memory accounting.

    The most extreme case where this can cause huge problems is an
    AF_PACKET socket transmitting over a vxlan device. We hit code
    paths doing checks that assume they are dealing with an ipv4
    socket, but are actually operating upon the AF_PACKET one.

    Signed-off-by: David S. Miller

    David Miller
     
  • It is currently always set to NULL, but nf_queue is adjusted to be
    prepared for it being set to a real socket by taking and releasing a
    reference to that socket when necessary.

    Signed-off-by: David S. Miller

    David Miller
     
  • This way we can consolidate where we setup new nf_hook_state objects,
    to make sure the entire thing is initialized.

    The only other place an nf_hook_object is instantiated is nf_queue,
    wherein a structure copy is used.

    Signed-off-by: David S. Miller

    David Miller
     

05 Apr, 2015

2 commits


25 Aug, 2014

1 commit


14 Oct, 2013

2 commits

  • This patch adds nftables which is the intended successor of iptables.
    This packet filtering framework reuses the existing netfilter hooks,
    the connection tracking system, the NAT subsystem, the transparent
    proxying engine, the logging infrastructure and the userspace packet
    queueing facilities.

    In a nutshell, nftables provides a pseudo-state machine with 4 general
    purpose registers of 128 bits and 1 specific purpose register to store
    verdicts. This pseudo-machine comes with an extensible instruction set,
    a.k.a. "expressions" in the nftables jargon. The expressions included
    in this patch provide the basic functionality, they are:

    * bitwise: to perform bitwise operations.
    * byteorder: to change from host/network endianess.
    * cmp: to compare data with the content of the registers.
    * counter: to enable counters on rules.
    * ct: to store conntrack keys into register.
    * exthdr: to match IPv6 extension headers.
    * immediate: to load data into registers.
    * limit: to limit matching based on packet rate.
    * log: to log packets.
    * meta: to match metainformation that usually comes with the skbuff.
    * nat: to perform Network Address Translation.
    * payload: to fetch data from the packet payload and store it into
    registers.
    * reject (IPv4 only): to explicitly close connection, eg. TCP RST.

    Using this instruction-set, the userspace utility 'nft' can transform
    the rules expressed in human-readable text representation (using a
    new syntax, inspired by tcpdump) to nftables bytecode.

    nftables also inherits the table, chain and rule objects from
    iptables, but in a more configurable way, and it also includes the
    original datatype-agnostic set infrastructure with mapping support.
    This set infrastructure is enhanced in the follow up patch (netfilter:
    nf_tables: add netlink set API).

    This patch includes the following components:

    * the netlink API: net/netfilter/nf_tables_api.c and
    include/uapi/netfilter/nf_tables.h
    * the packet filter core: net/netfilter/nf_tables_core.c
    * the expressions (described above): net/netfilter/nft_*.c
    * the filter tables: arp, IPv4, IPv6 and bridge:
    net/ipv4/netfilter/nf_tables_ipv4.c
    net/ipv6/netfilter/nf_tables_ipv6.c
    net/ipv4/netfilter/nf_tables_arp.c
    net/bridge/netfilter/nf_tables_bridge.c
    * the NAT table (IPv4 only):
    net/ipv4/netfilter/nf_table_nat_ipv4.c
    * the route table (similar to mangle):
    net/ipv4/netfilter/nf_table_route_ipv4.c
    net/ipv6/netfilter/nf_table_route_ipv6.c
    * internal definitions under:
    include/net/netfilter/nf_tables.h
    include/net/netfilter/nf_tables_core.h
    * It also includes an skeleton expression:
    net/netfilter/nft_expr_template.c
    and the preliminary implementation of the meta target
    net/netfilter/nft_meta_target.c

    It also includes a change in struct nf_hook_ops to add a new
    pointer to store private data to the hook, that is used to store
    the rule list per chain.

    This patch is based on the patch from Patrick McHardy, plus merged
    accumulated cleanups, fixes and small enhancements to the nftables
    code that has been done since 2009, which are:

    From Patrick McHardy:
    * nf_tables: adjust netlink handler function signatures
    * nf_tables: only retry table lookup after successful table module load
    * nf_tables: fix event notification echo and avoid unnecessary messages
    * nft_ct: add l3proto support
    * nf_tables: pass expression context to nft_validate_data_load()
    * nf_tables: remove redundant definition
    * nft_ct: fix maxattr initialization
    * nf_tables: fix invalid event type in nf_tables_getrule()
    * nf_tables: simplify nft_data_init() usage
    * nf_tables: build in more core modules
    * nf_tables: fix double lookup expression unregistation
    * nf_tables: move expression initialization to nf_tables_core.c
    * nf_tables: build in payload module
    * nf_tables: use NFPROTO constants
    * nf_tables: rename pid variables to portid
    * nf_tables: save 48 bits per rule
    * nf_tables: introduce chain rename
    * nf_tables: check for duplicate names on chain rename
    * nf_tables: remove ability to specify handles for new rules
    * nf_tables: return error for rule change request
    * nf_tables: return error for NLM_F_REPLACE without rule handle
    * nf_tables: include NLM_F_APPEND/NLM_F_REPLACE flags in rule notification
    * nf_tables: fix NLM_F_MULTI usage in netlink notifications
    * nf_tables: include NLM_F_APPEND in rule dumps

    From Pablo Neira Ayuso:
    * nf_tables: fix stack overflow in nf_tables_newrule
    * nf_tables: nft_ct: fix compilation warning
    * nf_tables: nft_ct: fix crash with invalid packets
    * nft_log: group and qthreshold are 2^16
    * nf_tables: nft_meta: fix socket uid,gid handling
    * nft_counter: allow to restore counters
    * nf_tables: fix module autoload
    * nf_tables: allow to remove all rules placed in one chain
    * nf_tables: use 64-bits rule handle instead of 16-bits
    * nf_tables: fix chain after rule deletion
    * nf_tables: improve deletion performance
    * nf_tables: add missing code in route chain type
    * nf_tables: rise maximum number of expressions from 12 to 128
    * nf_tables: don't delete table if in use
    * nf_tables: fix basechain release

    From Tomasz Bursztyka:
    * nf_tables: Add support for changing users chain's name
    * nf_tables: Change chain's name to be fixed sized
    * nf_tables: Add support for replacing a rule by another one
    * nf_tables: Update uapi nftables netlink header documentation

    From Florian Westphal:
    * nft_log: group is u16, snaplen u32

    From Phil Oester:
    * nf_tables: operational limit match

    Signed-off-by: Patrick McHardy
    Signed-off-by: Pablo Neira Ayuso

    Patrick McHardy
     
  • Pass the hook ops to the hookfn to allow for generic hook
    functions. This change is required by nf_tables.

    Signed-off-by: Patrick McHardy
    Signed-off-by: Pablo Neira Ayuso

    Patrick McHardy
     

27 Sep, 2013

1 commit

  • There are a mix of function prototypes with and without extern
    in the kernel sources. Standardize on not using extern for
    function prototypes.

    Function prototypes don't need to be written with extern.
    extern is assumed by the compiler. Its use is as unnecessary as
    using auto to declare automatic/local variables in a block.

    Signed-off-by: Joe Perches

    Joe Perches
     

28 Aug, 2013

1 commit

  • Split out sequence number adjustments from NAT and move them to the conntrack
    core to make them usable for SYN proxying. The sequence number adjustment
    information is moved to a seperate extend. The extend is added to new
    conntracks when a NAT mapping is set up for a connection using a helper.

    As a side effect, this saves 24 bytes per connection with NAT in the common
    case that a connection does not have a helper assigned.

    Signed-off-by: Patrick McHardy
    Tested-by: Martin Topholm
    Signed-off-by: Jesper Dangaard Brouer
    Signed-off-by: Pablo Neira Ayuso

    Patrick McHardy
     

13 Aug, 2013

1 commit


01 Aug, 2013

1 commit


31 Jul, 2013

1 commit


23 May, 2013

1 commit


06 Apr, 2013

1 commit


13 Oct, 2012

1 commit


30 Aug, 2012

1 commit


22 Jun, 2012

1 commit