13 Jan, 2021

1 commit

  • commit 443d6e86f821a165fae3fc3fc13086d27ac140b1 upstream.

    This fixes the dereference to fetch the RCU pointer when holding
    the appropriate xtables lock.

    Reported-by: kernel test robot
    Fixes: cc00bcaa5899 ("netfilter: x_tables: Switch synchronization to RCU")
    Signed-off-by: Subash Abhinov Kasiviswanathan
    Reviewed-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso
    Signed-off-by: Greg Kroah-Hartman

    Subash Abhinov Kasiviswanathan
     

08 Dec, 2020

1 commit

  • When running concurrent iptables rules replacement with data, the per CPU
    sequence count is checked after the assignment of the new information.
    The sequence count is used to synchronize with the packet path without the
    use of any explicit locking. If there are any packets in the packet path using
    the table information, the sequence count is incremented to an odd value and
    is incremented to an even after the packet process completion.

    The new table value assignment is followed by a write memory barrier so every
    CPU should see the latest value. If the packet path has started with the old
    table information, the sequence counter will be odd and the iptables
    replacement will wait till the sequence count is even prior to freeing the
    old table info.

    However, this assumes that the new table information assignment and the memory
    barrier is actually executed prior to the counter check in the replacement
    thread. If CPU decides to execute the assignment later as there is no user of
    the table information prior to the sequence check, the packet path in another
    CPU may use the old table information. The replacement thread would then free
    the table information under it leading to a use after free in the packet
    processing context-

    Unable to handle kernel NULL pointer dereference at virtual
    address 000000000000008e
    pc : ip6t_do_table+0x5d0/0x89c
    lr : ip6t_do_table+0x5b8/0x89c
    ip6t_do_table+0x5d0/0x89c
    ip6table_filter_hook+0x24/0x30
    nf_hook_slow+0x84/0x120
    ip6_input+0x74/0xe0
    ip6_rcv_finish+0x7c/0x128
    ipv6_rcv+0xac/0xe4
    __netif_receive_skb+0x84/0x17c
    process_backlog+0x15c/0x1b8
    napi_poll+0x88/0x284
    net_rx_action+0xbc/0x23c
    __do_softirq+0x20c/0x48c

    This could be fixed by forcing instruction order after the new table
    information assignment or by switching to RCU for the synchronization.

    Fixes: 80055dab5de0 ("netfilter: x_tables: make xt_replace_table wait until old rules are not used anymore")
    Reported-by: Sean Tranchetti
    Reported-by: kernel test robot
    Suggested-by: Florian Westphal
    Signed-off-by: Subash Abhinov Kasiviswanathan
    Signed-off-by: Pablo Neira Ayuso

    Subash Abhinov Kasiviswanathan
     

20 Nov, 2020

1 commit

  • IPV6=m
    NF_DEFRAG_IPV6=y

    ld: net/ipv6/netfilter/nf_conntrack_reasm.o: in function
    `nf_ct_frag6_gather':
    net/ipv6/netfilter/nf_conntrack_reasm.c:462: undefined reference to
    `ipv6_frag_thdr_truncated'

    Netfilter is depending on ipv6 symbol ipv6_frag_thdr_truncated. This
    dependency is forcing IPV6=y.

    Remove this dependency by moving ipv6_frag_thdr_truncated out of ipv6. This
    is the same solution as used with a similar issues: Referring to
    commit 70b095c843266 ("ipv6: remove dependency of nf_defrag_ipv6 on ipv6
    module")

    Fixes: 9d9e937b1c8b ("ipv6/netfilter: Discard first fragment not including all headers")
    Reported-by: Randy Dunlap
    Reported-by: kernel test robot
    Signed-off-by: Georg Kohmann
    Acked-by: Pablo Neira Ayuso
    Acked-by: Randy Dunlap # build-tested
    Link: https://lore.kernel.org/r/20201119095833.8409-1-geokohma@cisco.com
    Signed-off-by: Jakub Kicinski

    Georg Kohmann
     

17 Nov, 2020

1 commit

  • Packets are processed even though the first fragment don't include all
    headers through the upper layer header. This breaks TAHI IPv6 Core
    Conformance Test v6LC.1.3.6.

    Referring to RFC8200 SECTION 4.5: "If the first fragment does not include
    all headers through an Upper-Layer header, then that fragment should be
    discarded and an ICMP Parameter Problem, Code 3, message should be sent to
    the source of the fragment, with the Pointer field set to zero."

    The fragment needs to be validated the same way it is done in
    commit 2efdaaaf883a ("IPv6: reply ICMP error if the first fragment don't
    include all headers") for ipv6. Wrap the validation into a common function,
    ipv6_frag_thdr_truncated() to check for truncation in the upper layer
    header. This validation does not fullfill all aspects of RFC 8200,
    section 4.5, but is at the moment sufficient to pass mentioned TAHI test.

    In netfilter, utilize the fragment offset returned by find_prev_fhdr() to
    let ipv6_frag_thdr_truncated() start it's traverse from the fragment
    header.

    Return 0 to drop the fragment in the netfilter. This is the same behaviour
    as used on other protocol errors in this function, e.g. when
    nf_ct_frag6_queue() returns -EPROTO. The Fragment will later be picked up
    by ipv6_frag_rcv() in reassembly.c. ipv6_frag_rcv() will then send an
    appropriate ICMP Parameter Problem message back to the source.

    References commit 2efdaaaf883a ("IPv6: reply ICMP error if the first
    fragment don't include all headers")

    Signed-off-by: Georg Kohmann
    Acked-by: Pablo Neira Ayuso
    Link: https://lore.kernel.org/r/20201111115025.28879-1-geokohma@cisco.com
    Signed-off-by: Jakub Kicinski

    Georg Kohmann
     

30 Oct, 2020

1 commit

  • If netfilter changes the packet mark when mangling, the packet is
    rerouted using the route_me_harder set of functions. Prior to this
    commit, there's one big difference between route_me_harder and the
    ordinary initial routing functions, described in the comment above
    __ip_queue_xmit():

    /* Note: skb->sk can be different from sk, in case of tunnels */
    int __ip_queue_xmit(struct sock *sk, struct sk_buff *skb, struct flowi *fl,

    That function goes on to correctly make use of sk->sk_bound_dev_if,
    rather than skb->sk->sk_bound_dev_if. And indeed the comment is true: a
    tunnel will receive a packet in ndo_start_xmit with an initial skb->sk.
    It will make some transformations to that packet, and then it will send
    the encapsulated packet out of a *new* socket. That new socket will
    basically always have a different sk_bound_dev_if (otherwise there'd be
    a routing loop). So for the purposes of routing the encapsulated packet,
    the routing information as it pertains to the socket should come from
    that socket's sk, rather than the packet's original skb->sk. For that
    reason __ip_queue_xmit() and related functions all do the right thing.

    One might argue that all tunnels should just call skb_orphan(skb) before
    transmitting the encapsulated packet into the new socket. But tunnels do
    *not* do this -- and this is wisely avoided in skb_scrub_packet() too --
    because features like TSQ rely on skb->destructor() being called when
    that buffer space is truely available again. Calling skb_orphan(skb) too
    early would result in buffers filling up unnecessarily and accounting
    info being all wrong. Instead, additional routing must take into account
    the new sk, just as __ip_queue_xmit() notes.

    So, this commit addresses the problem by fishing the correct sk out of
    state->sk -- it's already set properly in the call to nf_hook() in
    __ip_local_out(), which receives the sk as part of its normal
    functionality. So we make sure to plumb state->sk through the various
    route_me_harder functions, and then make correct use of it following the
    example of __ip_queue_xmit().

    Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
    Signed-off-by: Jason A. Donenfeld
    Reviewed-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso

    Jason A. Donenfeld
     

20 Oct, 2020

1 commit

  • Fragmented ndisc packets assembled in netfilter not dropped as specified
    in RFC 6980, section 5. This behaviour breaks TAHI IPv6 Core Conformance
    Tests v6LC.2.1.22/23, V6LC.2.2.26/27 and V6LC.2.3.18.

    Setting IP6SKB_FRAGMENTED flag during reassembly.

    References: commit b800c3b966bc ("ipv6: drop fragmented ndisc packets by default (RFC 6980)")
    Signed-off-by: Georg Kohmann
    Signed-off-by: Pablo Neira Ayuso

    Georg Kohmann
     

16 Oct, 2020

1 commit


14 Oct, 2020

1 commit

  • Dump vlan tag and proto for the usual vlan offload case if the
    NF_LOG_MACDECODE flag is set on. Without this information the logging is
    misleading as there is no reference to the VLAN header.

    [12716.993704] test: IN=veth0 OUT= MACSRC=86:6c:92:ea:d6:73 MACDST=0e:3b:eb:86:73:76 VPROTO=8100 VID=10 MACPROTO=0800 SRC=192.168.10.2 DST=172.217.168.163 LEN=52 TOS=0x00 PREC=0x00 TTL=64 ID=2548 DF PROTO=TCP SPT=55848 DPT=80 WINDOW=501 RES=0x00 ACK FIN URGP=0
    [12721.157643] test: IN=veth0 OUT= MACSRC=86:6c:92:ea:d6:73 MACDST=0e:3b:eb:86:73:76 VPROTO=8100 VID=10 MACPROTO=0806 ARP HTYPE=1 PTYPE=0x0800 OPCODE=2 MACSRC=86:6c:92:ea:d6:73 IPSRC=192.168.10.2 MACDST=0e:3b:eb:86:73:76 IPDST=192.168.10.1

    Fixes: 83e96d443b37 ("netfilter: log: split family specific code to nf_log_{ip,ip6,common}.c files")
    Signed-off-by: Pablo Neira Ayuso

    Pablo Neira Ayuso
     

29 Aug, 2020

1 commit

  • Detect and rewrite a prefix embedded in an ICMPv6 original packet that was
    rewritten by a corresponding DNPT/SNPT rule so it will be recognised by
    the host that sent the original packet.

    Example

    Rules in effect on the 1:2:3:4::/64 + 5:6:7:8::/64 side router:
    * SNPT src-pfx 1:2:3:4::/64 dst-pfx 5:6:7:8::/64
    * DNPT src-pfx 5:6:7:8::/64 dst-pfx 1:2:3:4::/64

    No rules on the 9:a:b:c::/64 side.

    1. 1:2:3:4::1 sends UDP packet to 9:a:b:c::1
    2. Router applies SNPT changing src to 5:6:7:8::ffef::1
    3. 9:a:b:c::1 receives packet with (src 5:6:7:8::ffef::1 dst 9:a:b:c::1)
    and replies with ICMPv6 port unreachable to 5:6:7:8::ffef::1,
    including original packet (src 5:6:7:8::ffef::1 dst 9:a:b:c::1)
    4. Router forwards ICMPv6 packet with (src 9:a:b:c::1 dst 5:6:7:8::ffef::1)
    including original packet (src 5:6:7:8::ffef::1 dst 9:a:b:c::1)
    and applies DNPT changing dst to 1:2:3:4::1
    5. 1:2:3:4::1 receives ICMPv6 packet with (src 9:a:b:c::1 dst 1:2:3:4::1)
    including original packet (src 5:6:7:8::ffef::1 dst 9:a:b:c::1).
    It doesn't recognise the original packet as the src doesn't
    match anything it originally sent

    With this change, at step 4, DNPT will also rewrite the original packet
    src to 1:2:3:4::1, so at step 5, 1:2:3:4::1 will recognise the ICMPv6
    error and provide feedback to the application properly.

    Conversely, SNPT will help when ICMPv6 errors are sent from the
    translated network.

    1. 9:a:b:c::1 sends UDP packet to 5:6:7:8::ffef::1
    2. Router applies DNPT changing dst to 1:2:3:4::1
    3. 1:2:3:4::1 receives packet with (src 9:a:b:c::1 dst 1:2:3:4::1)
    and replies with ICMPv6 port unreachable to 9:a:b:c::1
    including original packet (src 9:a:b:c::1 dst 1:2:3:4::1)
    4. Router forwards ICMPv6 packet with (src 1:2:3:4::1 dst 9:a:b:c::1)
    including original packet (src 9:a:b:c::1 dst 1:2:3:4::1)
    and applies SNPT changing src to 5:6:7:8::ffef::1
    5. 9:a:b:c::1 receives ICMPv6 packet with
    (src 5:6:7:8::ffef::1 dst 9:a:b:c::1) including
    original packet (src 9:a:b:c::1 dst 1:2:3:4::1).
    It doesn't recognise the original packet as the dst doesn't
    match anything it already sent

    The change to SNPT means the ICMPv6 original packet dst will be
    rewritten to 5:6:7:8::ffef::1 in step 4, allowing the error to be
    properly recognised in step 5.

    Signed-off-by: Michael Zhou
    Signed-off-by: Pablo Neira Ayuso

    Michael Zhou
     

06 Aug, 2020

1 commit

  • Pull networking updates from David Miller:

    1) Support 6Ghz band in ath11k driver, from Rajkumar Manoharan.

    2) Support UDP segmentation in code TSO code, from Eric Dumazet.

    3) Allow flashing different flash images in cxgb4 driver, from Vishal
    Kulkarni.

    4) Add drop frames counter and flow status to tc flower offloading,
    from Po Liu.

    5) Support n-tuple filters in cxgb4, from Vishal Kulkarni.

    6) Various new indirect call avoidance, from Eric Dumazet and Brian
    Vazquez.

    7) Fix BPF verifier failures on 32-bit pointer arithmetic, from
    Yonghong Song.

    8) Support querying and setting hardware address of a port function via
    devlink, use this in mlx5, from Parav Pandit.

    9) Support hw ipsec offload on bonding slaves, from Jarod Wilson.

    10) Switch qca8k driver over to phylink, from Jonathan McDowell.

    11) In bpftool, show list of processes holding BPF FD references to
    maps, programs, links, and btf objects. From Andrii Nakryiko.

    12) Several conversions over to generic power management, from Vaibhav
    Gupta.

    13) Add support for SO_KEEPALIVE et al. to bpf_setsockopt(), from Dmitry
    Yakunin.

    14) Various https url conversions, from Alexander A. Klimov.

    15) Timestamping and PHC support for mscc PHY driver, from Antoine
    Tenart.

    16) Support bpf iterating over tcp and udp sockets, from Yonghong Song.

    17) Support 5GBASE-T i40e NICs, from Aleksandr Loktionov.

    18) Add kTLS RX HW offload support to mlx5e, from Tariq Toukan.

    19) Fix the ->ndo_start_xmit() return type to be netdev_tx_t in several
    drivers. From Luc Van Oostenryck.

    20) XDP support for xen-netfront, from Denis Kirjanov.

    21) Support receive buffer autotuning in MPTCP, from Florian Westphal.

    22) Support EF100 chip in sfc driver, from Edward Cree.

    23) Add XDP support to mvpp2 driver, from Matteo Croce.

    24) Support MPTCP in sock_diag, from Paolo Abeni.

    25) Commonize UDP tunnel offloading code by creating udp_tunnel_nic
    infrastructure, from Jakub Kicinski.

    26) Several pci_ --> dma_ API conversions, from Christophe JAILLET.

    27) Add FLOW_ACTION_POLICE support to mlxsw, from Ido Schimmel.

    28) Add SK_LOOKUP bpf program type, from Jakub Sitnicki.

    29) Refactor a lot of networking socket option handling code in order to
    avoid set_fs() calls, from Christoph Hellwig.

    30) Add rfc4884 support to icmp code, from Willem de Bruijn.

    31) Support TBF offload in dpaa2-eth driver, from Ioana Ciornei.

    32) Support XDP_REDIRECT in qede driver, from Alexander Lobakin.

    33) Support PCI relaxed ordering in mlx5 driver, from Aya Levin.

    34) Support TCP syncookies in MPTCP, from Flowian Westphal.

    35) Fix several tricky cases of PMTU handling wrt. briding, from Stefano
    Brivio.

    * git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2056 commits)
    net: thunderx: initialize VF's mailbox mutex before first usage
    usb: hso: remove bogus check for EINPROGRESS
    usb: hso: no complaint about kmalloc failure
    hso: fix bailout in error case of probe
    ip_tunnel_core: Fix build for archs without _HAVE_ARCH_IPV6_CSUM
    selftests/net: relax cpu affinity requirement in msg_zerocopy test
    mptcp: be careful on subflow creation
    selftests: rtnetlink: make kci_test_encap() return sub-test result
    selftests: rtnetlink: correct the final return value for the test
    net: dsa: sja1105: use detected device id instead of DT one on mismatch
    tipc: set ub->ifindex for local ipv6 address
    ipv6: add ipv6_dev_find()
    net: openvswitch: silence suspicious RCU usage warning
    Revert "vxlan: fix tos value before xmit"
    ptp: only allow phase values lower than 1 period
    farsync: switch from 'pci_' to 'dma_' API
    wan: wanxl: switch from 'pci_' to 'dma_' API
    hv_netvsc: do not use VF device if link is down
    dpaa2-eth: Fix passing zero to 'PTR_ERR' warning
    net: macb: Properly handle phylink on at91sam9x
    ...

    Linus Torvalds
     

04 Aug, 2020

1 commit

  • Pablo Neira Ayuso says:

    ====================
    Netfilter updates for net-next

    1) UAF in chain binding support from previous batch, from Dan Carpenter.

    2) Queue up delayed work to expire connections with no destination,
    from Andrew Sy Kim.

    3) Use fallthrough pseudo-keyword, from Gustavo A. R. Silva.

    4) Replace HTTP links with HTTPS, from Alexander A. Klimov.

    5) Remove superfluous null header checks in ip6tables, from
    Gaurav Singh.

    6) Add extended netlink error reporting for expression.

    7) Report EEXIST on overlapping chain, set elements and flowtable
    devices.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     

30 Jul, 2020

1 commit


29 Jul, 2020

1 commit

  • sockptr_advance never properly worked. Replace it with _offset variants
    of copy_from_sockptr and copy_to_sockptr.

    Fixes: ba423fdaa589 ("net: add a new sockptr_t type")
    Reported-by: Jason A. Donenfeld
    Reported-by: Ido Schimmel
    Signed-off-by: Christoph Hellwig
    Acked-by: Jason A. Donenfeld
    Tested-by: Ido Schimmel
    Signed-off-by: David S. Miller

    Christoph Hellwig
     

25 Jul, 2020

2 commits


20 Jul, 2020

2 commits


17 Jul, 2020

1 commit

  • Using uninitialized_var() is dangerous as it papers over real bugs[1]
    (or can in the future), and suppresses unrelated compiler warnings
    (e.g. "unused variable"). If the compiler thinks it is uninitialized,
    either simply initialize the variable or make compiler changes.

    In preparation for removing[2] the[3] macro[4], remove all remaining
    needless uses with the following script:

    git grep '\buninitialized_var\b' | cut -d: -f1 | sort -u | \
    xargs perl -pi -e \
    's/\buninitialized_var\(([^\)]+)\)/\1/g;
    s:\s*/\* (GCC be quiet|to make compiler happy) \*/$::g;'

    drivers/video/fbdev/riva/riva_hw.c was manually tweaked to avoid
    pathological white-space.

    No outstanding warnings were found building allmodconfig with GCC 9.3.0
    for x86_64, i386, arm64, arm, powerpc, powerpc64le, s390x, mips, sparc64,
    alpha, and m68k.

    [1] https://lore.kernel.org/lkml/20200603174714.192027-1-glider@google.com/
    [2] https://lore.kernel.org/lkml/CA+55aFw+Vbj0i=1TGqCR5vQkCzWJ0QxK6CernOU6eedsudAixw@mail.gmail.com/
    [3] https://lore.kernel.org/lkml/CA+55aFwgbgqhbp1fkxvRKEpzyR5J8n1vKT1VZdz9knmPuXhOeg@mail.gmail.com/
    [4] https://lore.kernel.org/lkml/CA+55aFz2500WfbKXAx8s67wrm9=yVJu65TpLgN_ybYNv0VEOKA@mail.gmail.com/

    Reviewed-by: Leon Romanovsky # drivers/infiniband and mlx4/mlx5
    Acked-by: Jason Gunthorpe # IB
    Acked-by: Kalle Valo # wireless drivers
    Reviewed-by: Chao Yu # erofs
    Signed-off-by: Kees Cook

    Kees Cook
     

01 Jul, 2020

1 commit

  • REJECT statement can be only used in INPUT, FORWARD and OUTPUT
    chains. This patch adds support of REJECT, both icmp and tcp
    reset, at PREROUTING stage.

    The need for this patch comes from the requirement of some
    forwarding devices to reject traffic before the natting and
    routing decisions.

    The main use case is to be able to send a graceful termination
    to legitimate clients that, under any circumstances, the NATed
    endpoints are not available. This option allows clients to
    decide either to perform a reconnection or manage the error in
    their side, instead of just dropping the connection and let
    them die due to timeout.

    It is supported ipv4, ipv6 and inet families for nft
    infrastructure.

    Signed-off-by: Laura Garcia Liebana
    Signed-off-by: Pablo Neira Ayuso

    Laura Garcia Liebana
     

25 Jun, 2020

3 commits


14 Jun, 2020

1 commit

  • Since commit 84af7a6194e4 ("checkpatch: kconfig: prefer 'help' over
    '---help---'"), the number of '---help---' has been gradually
    decreasing, but there are still more than 2400 instances.

    This commit finishes the conversion. While I touched the lines,
    I also fixed the indentation.

    There are a variety of indentation styles found.

    a) 4 spaces + '---help---'
    b) 7 spaces + '---help---'
    c) 8 spaces + '---help---'
    d) 1 space + 1 tab + '---help---'
    e) 1 tab + '---help---' (correct indentation)
    f) 1 tab + 1 space + '---help---'
    g) 1 tab + 2 spaces + '---help---'

    In order to convert all of them to 1 tab + 'help', I ran the
    following commend:

    $ find . -name 'Kconfig*' | xargs sed -i 's/^[[:space:]]*---help---/\thelp/'

    Signed-off-by: Masahiro Yamada

    Masahiro Yamada
     

15 Mar, 2020

1 commit

  • The current codebase makes use of the zero-length array language
    extension to the C90 standard, but the preferred mechanism to declare
    variable-length types such as these ones is a flexible array member[1][2],
    introduced in C99:

    struct foo {
    int stuff;
    struct boo array[];
    };

    By making use of the mechanism above, we will get a compiler warning
    in case the flexible array does not occur last in the structure, which
    will help us prevent some kind of undefined behavior bugs from being
    inadvertently introduced[3] to the codebase from now on.

    Also, notice that, dynamic memory allocations won't be affected by
    this change:

    "Flexible array members have incomplete type, and so the sizeof operator
    may not be applied. As a quirk of the original implementation of
    zero-length arrays, sizeof evaluates to zero."[1]

    Lastly, fix checkpatch.pl warning
    WARNING: __aligned(size) is preferred over __attribute__((aligned(size)))
    in net/bridge/netfilter/ebtables.c

    This issue was found with the help of Coccinelle.

    [1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
    [2] https://github.com/KSPP/linux/issues/21
    [3] commit 76497732932f ("cxgb3/l2t: Fix undefined behaviour")

    Signed-off-by: Gustavo A. R. Silva
    Signed-off-by: Pablo Neira Ayuso

    Gustavo A. R. Silva
     

13 Mar, 2020

1 commit

  • Convert the various uses of fallthrough comments to fallthrough;

    Done via script
    Link: https://lore.kernel.org/lkml/b56602fcf79f849e733e7b521bb0e17895d390fa.1582230379.git.joe@perches.com/

    And by hand:

    net/ipv6/ip6_fib.c has a fallthrough comment outside of an #ifdef block
    that causes gcc to emit a warning if converted in-place.

    So move the new fallthrough; inside the containing #ifdef/#endif too.

    Signed-off-by: Joe Perches
    Signed-off-by: David S. Miller

    Joe Perches
     

22 Nov, 2019

1 commit


16 Nov, 2019

1 commit

  • Add nf_flow_rule_route_ipv6() and use it from the IPv6 and the inet
    flowtable type definitions. Rename the nf_flow_rule_route() function to
    nf_flow_rule_route_ipv4().

    Adjust maximum number of actions, which now becomes 16 to leave
    sufficient room for the IPv6 address mangling for NAT.

    Signed-off-by: Pablo Neira Ayuso

    Pablo Neira Ayuso
     

13 Nov, 2019

2 commits

  • This patch adds the dataplane hardware offload to the flowtable
    infrastructure. Three new flags represent the hardware state of this
    flow:

    * FLOW_OFFLOAD_HW: This flow entry resides in the hardware.
    * FLOW_OFFLOAD_HW_DYING: This flow entry has been scheduled to be remove
    from hardware. This might be triggered by either packet path (via TCP
    RST/FIN packet) or via aging.
    * FLOW_OFFLOAD_HW_DEAD: This flow entry has been already removed from
    the hardware, the software garbage collector can remove it from the
    software flowtable.

    This patch supports for:

    * IPv4 only.
    * Aging via FLOW_CLS_STATS, no packet and byte counter synchronization
    at this stage.

    This patch also adds the action callback that specifies how to convert
    the flow entry into the flow_rule object that is passed to the driver.

    Signed-off-by: Pablo Neira Ayuso
    Signed-off-by: David S. Miller

    Pablo Neira Ayuso
     
  • This patch adds the NFTA_FLOWTABLE_FLAGS attribute that allows users to
    specify the NF_FLOWTABLE_HW_OFFLOAD flag. This patch also adds a new
    setup interface for the flowtable type to perform the flowtable offload
    block callback configuration.

    Signed-off-by: Pablo Neira Ayuso
    Signed-off-by: David S. Miller

    Pablo Neira Ayuso
     

17 Oct, 2019

1 commit


02 Oct, 2019

1 commit

  • commit 174e23810cd31
    ("sk_buff: drop all skb extensions on free and skb scrubbing") made napi
    recycle always drop skb extensions. The additional skb_ext_del() that is
    performed via nf_reset on napi skb recycle is not needed anymore.

    Most nf_reset() calls in the stack are there so queued skb won't block
    'rmmod nf_conntrack' indefinitely.

    This removes the skb_ext_del from nf_reset, and renames it to a more
    fitting nf_reset_ct().

    In a few selected places, add a call to skb_ext_reset to make sure that
    no active extensions remain.

    I am submitting this for "net", because we're still early in the release
    cycle. The patch applies to net-next too, but I think the rename causes
    needless divergence between those trees.

    Suggested-by: Eric Dumazet
    Signed-off-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso

    Florian Westphal
     

26 Sep, 2019

1 commit


13 Sep, 2019

2 commits


20 Aug, 2019

1 commit


09 Aug, 2019

1 commit

  • Before commit d4289fcc9b16 ("net: IP6 defrag: use rbtrees for IPv6
    defrag"), a netperf UDP_STREAM test[0] using big IPv6 datagrams (thus
    generating many fragments) and running over an IPsec tunnel, reported
    more than 6Gbps throughput. After that patch, the same test gets only
    9Mbps when receiving on a be2net nic (driver can make a big difference
    here, for example, ixgbe doesn't seem to be affected).

    By reusing the IPv4 defragmentation code, IPv6 lost fragment coalescing
    (IPv4 fragment coalescing was dropped by commit 14fe22e33462 ("Revert
    "ipv4: use skb coalescing in defragmentation"")).

    Without fragment coalescing, be2net runs out of Rx ring entries and
    starts to drop frames (ethtool reports rx_drops_no_frags errors). Since
    the netperf traffic is only composed of UDP fragments, any lost packet
    prevents reassembly of the full datagram. Therefore, fragments which
    have no possibility to ever get reassembled pile up in the reassembly
    queue, until the memory accounting exeeds the threshold. At that point
    no fragment is accepted anymore, which effectively discards all
    netperf traffic.

    When reassembly timeout expires, some stale fragments are removed from
    the reassembly queue, so a few packets can be received, reassembled
    and delivered to the netperf receiver. But the nic still drops frames
    and soon the reassembly queue gets filled again with stale fragments.
    These long time frames where no datagram can be received explain why
    the performance drop is so significant.

    Re-introducing fragment coalescing is enough to get the initial
    performances again (6.6Gbps with be2net): driver doesn't drop frames
    anymore (no more rx_drops_no_frags errors) and the reassembly engine
    works at full speed.

    This patch is quite conservative and only coalesces skbs for local
    IPv4 and IPv6 delivery (in order to avoid changing skb geometry when
    forwarding). Coalescing could be extended in the future if need be, as
    more scenarios would probably benefit from it.

    [0]: Test configuration
    Sender:
    ip xfrm policy flush
    ip xfrm state flush
    ip xfrm state add src fc00:1::1 dst fc00:2::1 proto esp spi 0x1000 aead 'rfc4106(gcm(aes))' 0x0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b 96 mode transport sel src fc00:1::1 dst fc00:2::1
    ip xfrm policy add src fc00:1::1 dst fc00:2::1 dir in tmpl src fc00:1::1 dst fc00:2::1 proto esp mode transport action allow
    ip xfrm state add src fc00:2::1 dst fc00:1::1 proto esp spi 0x1001 aead 'rfc4106(gcm(aes))' 0x0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b 96 mode transport sel src fc00:2::1 dst fc00:1::1
    ip xfrm policy add src fc00:2::1 dst fc00:1::1 dir out tmpl src fc00:2::1 dst fc00:1::1 proto esp mode transport action allow
    netserver -D -L fc00:2::1

    Receiver:
    ip xfrm policy flush
    ip xfrm state flush
    ip xfrm state add src fc00:2::1 dst fc00:1::1 proto esp spi 0x1001 aead 'rfc4106(gcm(aes))' 0x0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b 96 mode transport sel src fc00:2::1 dst fc00:1::1
    ip xfrm policy add src fc00:2::1 dst fc00:1::1 dir in tmpl src fc00:2::1 dst fc00:1::1 proto esp mode transport action allow
    ip xfrm state add src fc00:1::1 dst fc00:2::1 proto esp spi 0x1000 aead 'rfc4106(gcm(aes))' 0x0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b 96 mode transport sel src fc00:1::1 dst fc00:2::1
    ip xfrm policy add src fc00:1::1 dst fc00:2::1 dir out tmpl src fc00:1::1 dst fc00:2::1 proto esp mode transport action allow
    netperf -H fc00:2::1 -f k -P 0 -L fc00:1::1 -l 60 -t UDP_STREAM -I 99,5 -i 5,5 -T5,5 -6

    Signed-off-by: Guillaume Nault
    Acked-by: Florian Westphal
    Signed-off-by: David S. Miller

    Guillaume Nault
     

04 Aug, 2019

1 commit


16 Jul, 2019

2 commits

  • Now synproxy sends the mss value set by the user on client syn-ack packet
    instead of the mss value that client announced.

    Fixes: 48b1de4c110a ("netfilter: add SYNPROXY core/target")
    Signed-off-by: Fernando Fernandez Mancera
    Signed-off-by: Pablo Neira Ayuso

    Fernando Fernandez Mancera
     
  • When firewalld is enabled with ipv4/ipv6 rpfilter, vrf
    ipv4/ipv6 packets will be dropped. Vrf device will pass
    through netfilter hook twice. One with enslaved device
    and another one with l3 master device. So in device may
    dismatch witch out device because out device is always
    enslaved device.So failed with the check of the rpfilter
    and drop the packets by mistake.

    Signed-off-by: Miaohe Lin
    Signed-off-by: Pablo Neira Ayuso

    Miaohe Lin
     

12 Jul, 2019

1 commit