16 Jul, 2018

39 commits

  • Not needed, we can have the l4trackers fetch it themselvs.

    Signed-off-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso

    Florian Westphal
     
  • Handle common protocols (udp, tcp, ..), in the core and only
    do the call if needed by the l4proto tracker.

    Signed-off-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso

    Florian Westphal
     
  • Handle the common cases (tcp, udp, etc). in the core and only
    do the indirect call for the protocols that need it (GRE for instance).

    Signed-off-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso

    Florian Westphal
     
  • Handle it in the core instead.

    ipv6_skip_exthdr() is built-in even if ipv6 is a module, i.e. this
    doesn't create an ipv6 dependency.

    Signed-off-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso

    Florian Westphal
     
  • Its simpler to just handle it directly in nf_ct_invert_tuple().
    Also gets rid of need to pass l3proto pointer to resolve_conntrack().

    Signed-off-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso

    Florian Westphal
     
  • Signed-off-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso

    Florian Westphal
     
  • handle everything from ctnetlink directly.

    After all these years we still only support ipv4 and ipv6, so it
    seems reasonable to remove l3 protocol tracker support and instead
    handle ipv4/ipv6 from a common, always builtin inet tracker.

    Step 1: Get rid of all the l3proto->func() calls.

    Start with ctnetlink, then move on to packet-path ones.

    Signed-off-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso

    Florian Westphal
     
  • Instead of depending on it.

    Signed-off-by: Máté Eckl
    Signed-off-by: Pablo Neira Ayuso

    Máté Eckl
     
  • These versions deal with the l3proto/l4proto details internally.
    It removes only caller of nf_ct_get_tuple, so make it static.

    After this, l3proto->get_l4proto() can be removed in a followup patch.

    Signed-off-by: Florian Westphal
    Acked-by: Pravin B Shelar
    Signed-off-by: Pablo Neira Ayuso

    Florian Westphal
     
  • similar to previous change, this also allows to remove it
    from nf_ipv6_ops and avoid the indirection.

    It also removes the bogus dependency of nf_conntrack_ipv6 on ipv6 module:
    ipv6 checksum functions are built into kernel even if CONFIG_IPV6=m,
    but ipv6/netfilter.o isn't.

    Signed-off-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso

    Florian Westphal
     
  • allows to make nf_ip_checksum_partial static, it no longer
    has an external caller.

    Signed-off-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso

    Florian Westphal
     
  • This function is also necessary to implement nft tproxy support

    Fixes: 45ca4e0cf273 ("netfilter: Libify xt_TPROXY")
    Signed-off-by: Máté Eckl
    Signed-off-by: Pablo Neira Ayuso

    Máté Eckl
     
  • This is one of the very few external callers of ->get_timeouts(),

    We can use a fixed timeout instead, conntrack core will refresh this in
    case a new packet comes within this period.

    Use of ESTABLISHED timeout seems way too huge anyway.

    Signed-off-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso

    Florian Westphal
     
  • In the nft_reject_br_send_v4_tcp_reset(), a ttl is set by the
    nf_reject_iphdr_put(). so, below code is unnecessary.

    Signed-off-by: Taehee Yoo
    Signed-off-by: Pablo Neira Ayuso

    Taehee Yoo
     
  • Boris Pismenny says:

    ====================
    TLS offload rx, netdev & mlx5

    The following series provides TLS RX inline crypto offload.

    v5->v4:
    - Remove the Kconfig to mutually exclude both IPsec and TLS

    v4->v3:
    - Remove the iov revert for zero copy send flow

    v2->v3:
    - Fix typo
    - Adjust cover letter
    - Fix bug in zero copy flows
    - Use network byte order for the record number in resync
    - Adjust the sequence provided in resync

    v1->v2:
    - Fix bisectability problems due to variable name changes
    - Fix potential uninitialized return value

    This series completes the generic infrastructure to offload TLS crypto to
    a network devices. It enables the kernel TLS socket to skip decryption and
    authentication operations for SKBs marked as decrypted on the receive
    side of the data path. Leaving those computationally expensive operations
    to the NIC.

    This infrastructure doesn't require a TCP offload engine. Instead, the
    NIC decrypts a packet's payload if the packet contains the expected TCP
    sequence number. The TLS record authentication tag remains unmodified
    regardless of decryption. If the packet is decrypted successfully and it
    contains an authentication tag, then the authentication check has passed.
    Otherwise, if the authentication fails, then the packet is provided
    unmodified and the KTLS layer is responsible for handling it.
    Out-Of-Order TCP packets are provided unmodified. As a result,
    in the slow path some of the SKBs are decrypted while others remain as
    ciphertext.

    The GRO and TCP layers must not coalesce decrypted and non-decrypted SKBs.
    At the worst case a received TLS record consists of both plaintext
    and ciphertext packets. These partially decrypted records must be
    reencrypted, only to be decrypted.

    The notable differences between SW KTLS and NIC offloaded TLS
    implementations are as follows:
    1. Partial decryption - Software must handle the case of a TLS record
    that was only partially decrypted by HW. This can happen due to packet
    reordering.
    2. Resynchronization - tls_read_size calls the device driver to
    resynchronize HW whenever it lost track of the TLS record framing in
    the TCP stream.

    The infrastructure should be extendable to support various NIC offload
    implementations. However it is currently written with the
    implementation below in mind:
    The NIC identifies packets that should be offloaded according to
    the 5-tuple and the TCP sequence number. If these match and the
    packet is decrypted and authenticated successfully, then a syndrome
    is provided to software. Otherwise, the packet is unmodified.
    Decrypted and non-decrypted packets aren't coalesced by the network stack,
    and the KTLS layer decrypts and authenticates partially decrypted records.
    The NIC provides an indication whenever a resync is required. The resync
    operation is triggered by the KTLS layer while parsing TLS record headers.

    Finally, we measure the performance obtained by running single stream
    iperf with two Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz machines connected
    back-to-back with Innova TLS (40Gbps) NICs. We compare TCP (upper bound)
    and KTLS-Offload running both in Tx and Rx. The results show that the
    performance of offload is comparable to TCP.

    | Bandwidth (Gbps) | CPU Tx (%) | CPU rx (%)
    TCP | 28.8 | 5 | 12
    KTLS-Offload-Tx-Rx | 28.6 | 7 | 14

    Paper: https://netdevconf.org/2.2/papers/pismenny-tlscrypto-talk.pdf
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     
  • This patch fixes the byte count indication in CQE for processed IPsec
    packets that contain a metadata header.

    Signed-off-by: Boris Pismenny
    Signed-off-by: David S. Miller

    Boris Pismenny
     
  • This patch adds common functions to handle mellanox metadata headers.
    These functions are used by IPsec and TLS to process FPGA metadata.

    Signed-off-by: Boris Pismenny
    Signed-off-by: David S. Miller

    Boris Pismenny
     
  • This patch enables TLS Rx based on available HW capabilities.

    Signed-off-by: Boris Pismenny
    Signed-off-by: David S. Miller

    Boris Pismenny
     
  • This patch adds software statistics for TLS to count important
    events.

    Signed-off-by: Boris Pismenny
    Signed-off-by: David S. Miller

    Boris Pismenny
     
  • Implement the TLS rx offload data path according to the
    requirements of the TLS generic NIC offload infrastructure.

    Special metadata ethertype is used to pass information to
    the hardware.

    When hardware loses synchronization a special resync request
    metadata message is used to request resync.

    Signed-off-by: Boris Pismenny
    Signed-off-by: Ilya Lesokhin
    Signed-off-by: David S. Miller

    Boris Pismenny
     
  • Add the mlx5 implementation of the TLS Rx routines to add/del TLS
    contexts, also add the tls_dev_resync_rx routine
    to work with the TLS inline Rx crypto offload infrastructure.

    Signed-off-by: Boris Pismenny
    Signed-off-by: Ilya Lesokhin
    Signed-off-by: David S. Miller

    Boris Pismenny
     
  • In Innova TLS, TLS contexts are added or deleted
    via a command message over the SBU connection.
    The HW then sends a response message over the same connection.

    Complete the implementation for Innova TLS (FPGA-based) hardware by
    adding support for rx inline crypto offload.

    Signed-off-by: Boris Pismenny
    Signed-off-by: Ilya Lesokhin
    Signed-off-by: David S. Miller

    Boris Pismenny
     
  • For symmetry, we rename mlx5e_tls_offload_context to
    mlx5e_tls_offload_context_tx before we add mlx5e_tls_offload_context_rx.

    Signed-off-by: Boris Pismenny
    Reviewed-by: Aviad Yehezkel
    Reviewed-by: Tariq Toukan
    Signed-off-by: David S. Miller

    Boris Pismenny
     
  • zerocopy_from_iter iterates over the message, but it doesn't revert the
    updates made by the iov iteration. This patch fixes it. Now, the iov can
    be used after calling zerocopy_from_iter.

    Fixes: 3c4d75591 ("tls: kernel TLS support")
    Signed-off-by: Boris Pismenny
    Signed-off-by: David S. Miller

    Boris Pismenny
     
  • This patch completes the generic infrastructure to offload TLS crypto to a
    network device. It enables the kernel to skip decryption and
    authentication of some skbs marked as decrypted by the NIC. In the fast
    path, all packets received are decrypted by the NIC and the performance
    is comparable to plain TCP.

    This infrastructure doesn't require a TCP offload engine. Instead, the
    NIC only decrypts packets that contain the expected TCP sequence number.
    Out-Of-Order TCP packets are provided unmodified. As a result, at the
    worst case a received TLS record consists of both plaintext and ciphertext
    packets. These partially decrypted records must be reencrypted,
    only to be decrypted.

    The notable differences between SW KTLS Rx and this offload are as
    follows:
    1. Partial decryption - Software must handle the case of a TLS record
    that was only partially decrypted by HW. This can happen due to packet
    reordering.
    2. Resynchronization - tls_read_size calls the device driver to
    resynchronize HW after HW lost track of TLS record framing in
    the TCP stream.

    Signed-off-by: Boris Pismenny
    Signed-off-by: David S. Miller

    Boris Pismenny
     
  • This patch allows tls_set_sw_offload to fill the context in case it was
    already allocated previously.

    We will use it in TLS_DEVICE to fill the RX software context.

    Signed-off-by: Boris Pismenny
    Signed-off-by: David S. Miller

    Boris Pismenny
     
  • This patch splits tls_sw_release_resources_rx into two functions one
    which releases all inner software tls structures and another that also
    frees the containing structure.

    In TLS_DEVICE we will need to release the software structures without
    freeeing the containing structure, which contains other information.

    Signed-off-by: Boris Pismenny
    Signed-off-by: David S. Miller

    Boris Pismenny
     
  • Previously, decrypt_skb also updated the TLS context.
    Now, decrypt_skb only decrypts the payload using the current context,
    while decrypt_skb_update also updates the state.

    Later, in the tls_device Rx flow, we will use decrypt_skb directly.

    Signed-off-by: Boris Pismenny
    Signed-off-by: David S. Miller

    Boris Pismenny
     
  • For symmetry, we rename tls_offload_context to
    tls_offload_context_tx before we add tls_offload_context_rx.

    Signed-off-by: Boris Pismenny
    Signed-off-by: David S. Miller

    Boris Pismenny
     
  • Prevent coalescing of decrypted and encrypted SKBs in GRO
    and TCP layer.

    Signed-off-by: Boris Pismenny
    Signed-off-by: Ilya Lesokhin
    Signed-off-by: David S. Miller

    Boris Pismenny
     
  • Add new netdev tls op for resynchronizing HW tls context

    Signed-off-by: Boris Pismenny
    Signed-off-by: David S. Miller

    Boris Pismenny
     
  • This patch adds a netdev feature to configure TLS RX inline crypto offload.

    Signed-off-by: Ilya Lesokhin
    Signed-off-by: Boris Pismenny
    Signed-off-by: David S. Miller

    Ilya Lesokhin
     
  • The decrypted bit is propogated to cloned/copied skbs.
    This will be used later by the inline crypto receive side offload
    of tls.

    Signed-off-by: Boris Pismenny
    Signed-off-by: Ilya Lesokhin
    Signed-off-by: David S. Miller

    Boris Pismenny
     
  • Maxime Chevallier says:

    ====================
    net: mvpp2: add debugfs interface

    The PPv2 Header Parser and Classifier are not straightforward to debug,
    having easy access to some of the many lookup tables configuration is
    helpful during development and debug.

    This series adds a basic debugfs interface, allowing to read data from
    the Header Parser and some of the Classifier tables.

    For now, the interface is read-only, and contains only some basic info.

    This was actually used during RSS development, and might be useful to
    troubleshoot some issues we might find.

    The first patch of the series converts the mvpp2 files to SPDX, which
    eases adding the new debugfs dedicated file.

    The second patch adds the interface, and exposes basic Header Parser data.

    The 3rd patch adds a hit counter for the Header Parser TCAM.

    The 4th patch exposes classifier info.

    The 5th patch adds some hit counters for some of the classifier engines.

    Changes since V1:
    - Rebased on the lastest net-next
    - Made cls_flow_get non static so that it can be used in mvpp2_debugfs
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     
  • The classification operations that are used for RSS make use of several
    lookup tables. Having hit counters for these tables is really helpful
    to determine what flows were matched by ingress traffic, and see the
    path of packets among all the classifier tables.

    This commit adds hit counters for the 3 tables used at the moment :

    - The decoding table (also called lookup_id table), that links flows
    identified by the Header Parser to the flow table.

    There's one entry per flow, located at :
    .../mvpp2//flows/XX/dec_hits

    Note that there are 21 flows in the decoding table, whereas there are
    52 flows in the Header Parser. That's because there are several kind
    of traffic that will match a given flow. Reading the hit counter from
    one sub-flow will clear all hit counter that have the same flow_id.

    This also applies to the flow_hits.

    - The flow table, that contains all the different lookups to be
    performed by the classifier for each packet of a given flow. The match
    is done on the first entry of the flow sequence.

    - The C2 engine entries, that are used to assign the default rx queue,
    and enable or disable RSS for a given port.

    There's one entry per flow, located at:
    .../mvpp2//flows/XX/flow_hits

    There is one C2 entry per port, so the c2 hit counter is located at :
    .../mvpp2//ethX/c2_hits

    All hit counter values are 16-bits clear-on-read values.

    Signed-off-by: Maxime Chevallier
    Signed-off-by: David S. Miller

    Maxime Chevallier
     
  • The classifier configuration for RSS is quite complex, with several
    lookup tables being used. This commit adds useful info in debugfs to
    see how the different tables are configured :

    Added 2 new entries in the per-port directory :

    - .../eth0/default_rxq : The default rx queue on that port
    - .../eth0/rss_enable : Indicates if RSS is enabled in the C2 entry

    Added the 'flows' directory :

    It contains one entry per sub-flow. a 'sub-flow' is a unique path from
    Header Parser to the flow table. Multiple sub-flows can point to the
    same 'flow' (each flow has an id from 8 to 29, which is its index in the
    Lookup Id table) :

    - .../flows/00/...
    /01/...
    ...
    /51/id : The flow id. There are 21 unique flows. There's one
    flow per combination of the following parameters :
    - L4 protocol (TCP, UDP, none)
    - L3 protocol (IPv4, IPv6)
    - L3 parameters (Fragmented or not)
    - L2 parameters (Vlan tag presence or not)
    .../type : The flow type. This is an even higher level flow,
    that we manipulate with ethtool. It can be :
    "udp4" "tcp4" "udp6" "tcp6" "ipv4" "ipv6" "other".
    .../eth0/...
    .../eth1/engine : The hash generation engine used for this
    flow on the given port
    .../hash_opts : The hash generation options indicating on
    what data we base the hash (vlan tag, src
    IP, src port, etc.)

    Signed-off-by: Maxime Chevallier
    Signed-off-by: David S. Miller

    Maxime Chevallier
     
  • One helpful feature to help debug the Header Parser TCAM filter in PPv2
    is to be able to see if the entries did match something when a packet
    comes in. This can be done by using the built-in hit counter for TCAM
    entries.

    This commit implements reading the counter, and exposing its value on
    debugfs for each filter entry.

    The counter is a 16-bits clear-on-read value, located at:
    .../mvpp2//parser/XXX/hits

    Signed-off-by: Maxime Chevallier
    Signed-off-by: David S. Miller

    Maxime Chevallier
     
  • Marvell PPv2 Packer Header Parser has a TCAM based filter, that is not
    trivial to configure and debug. Being able to dump TCAM entries from
    userspace can be really helpful to help development of new features
    and debug existing ones.

    This commit adds a basic debugfs interface for the PPv2 driver, focusing
    on TCAM related features.

    /mvpp2/ --- f2000000.ethernet
    \- f4000000.ethernet --- parser --- 000 ...
    | \- 001
    | \- ...
    | \- 255 --- ai
    | \- header_data
    | \- lookup_id
    | \- sram
    | \- valid
    \- eth1 ...
    \- eth2 --- mac_filter
    \- parser_entries
    \- vid_filter

    There's one directory per PPv2 instance, named after pdev->name to make
    sure names are uniques. In each of these directories, there's :

    - one directory per interface on the controller, each containing :

    - "mac_filter", which lists all filtered addresses for this port
    (based on TCAM, not on the kernel's uc / mc lists)

    - "parser_entries", which lists the indices of all valid TCAM
    entries that have this port in their port map

    - "vid_filter", which lists the vids allowed on this port, based on
    TCAM

    - one "parser" directory (the parser is common to all ports), containing :

    - one directory per TCAM entry (256 of them, from 0 to 255), each
    containing :

    - "ai" : Contains the 1 byte Additional Info field from TCAM, and

    - "header_data" : Contains the 8 bytes Header Data extracted from
    the packet

    - "lookup_id" : Contains the 4 bits LU_ID

    - "sram" : contains the raw SRAM data, which is the result of the TCAM
    lookup. This readonly at the moment.

    - "valid" : Indicates if the entry is valid of not.

    All entries are read-only, and everything is output in hex form.

    Signed-off-by: Maxime Chevallier
    Signed-off-by: David S. Miller

    Maxime Chevallier
     
  • Use the appropriate SPDX license identifiers and drop the license text.
    This patch is only cosmetic.

    Signed-off-by: Antoine Tenart
    Signed-off-by: Maxime Chevallier
    Signed-off-by: David S. Miller

    Antoine Tenart
     

15 Jul, 2018

1 commit

  • Daniel Borkmann says:

    ====================
    pull-request: bpf-next 2018-07-15

    The following pull-request contains BPF updates for your *net-next* tree.

    The main changes are:

    1) Various different arm32 JIT improvements in order to optimize code emission
    and make the JIT code itself more robust, from Russell.

    2) Support simultaneous driver and offloaded XDP in order to allow for advanced
    use-cases where some work is offloaded to the NIC and some to the host. Also
    add ability for bpftool to load programs and maps beyond just the cgroup case,
    from Jakub.

    3) Add BPF JIT support in nfp for multiplication as well as division. For the
    latter in particular, it uses the reciprocal algorithm to emulate it, from Jiong.

    4) Add BTF pretty print functionality to bpftool in plain and JSON output
    format, from Okash.

    5) Add build and installation to the BPF helper man page into bpftool, from Quentin.

    6) Add a TCP BPF callback for listening sockets which is triggered right after
    the socket transitions to TCP_LISTEN state, from Andrey.

    7) Add a new cgroup tree command to bpftool which iterates over the whole cgroup
    tree and prints all attached programs, from Roman.

    8) Improve xdp_redirect_cpu sample to support parsing of double VLAN tagged
    packets, from Jesper.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller