05 Aug, 2017

20 commits


04 Aug, 2017

20 commits

  • Willem de Bruijn says:

    ====================
    socket sendmsg MSG_ZEROCOPY

    Introduce zerocopy socket send flag MSG_ZEROCOPY. This extends the
    shared page support (SKBTX_SHARED_FRAG) from sendpage to sendmsg.
    Implement the feature for TCP initially, as large writes benefit
    most.

    On a send call with MSG_ZEROCOPY, the kernel pins user pages and
    links these directly into the skbuff frags[] array.

    Each send call with MSG_ZEROCOPY that transmits data will eventually
    queue a completion notification on the error queue: a per-socket u32
    incremented on each such call. A request may have to revert to copy
    to succeed, for instance when a device cannot support scatter-gather
    IO. In that case a flag is passed along to notify that the operation
    succeeded without zerocopy optimization.

    The implementation extends the existing zerocopy infra for tuntap,
    vhost and xen with features needed for TCP, notably reference
    counting to handle cloning on retransmit and GSO.

    For more details, see also the netdev 2.1 paper and presentation at
    https://netdevconf.org/2.1/session.html?debruijn

    Changelog:

    v3 -> v4:
    - dropped UDP, RAW and PF_PACKET for now
    Without loopback support, datagrams are usually smaller than
    the ~8KB size threshold needed to benefit from zerocopy.
    - style: a few reverse chrismas tree
    - minor: SO_ZEROCOPY returns ENOTSUPP on unsupported protocols
    - minor: squashed SO_EE_CODE_ZEROCOPY_COPIED patch
    - minor: rebased on top of net-next with kmap_atomic fix

    v2 -> v3:
    - fix rebase conflict: SO_ZEROCOPY 59 -> 60

    v1 -> v2:
    - fix (kbuild-bot): do not remove uarg until patch 5
    - fix (kbuild-bot): move zerocopy_sg_from_iter doc with function
    - fix: remove unused extern in header file

    RFCv2 -> v1:
    - patch 2
    - review comment: in skb_copy_ubufs, always allocate order-0
    page, also when replacing compound source pages.
    - patch 3
    - fix: always queue completion notification on MSG_ZEROCOPY,
    also if revert to copy.
    - fix: on syscall abort, correctly revert notification state
    - minor: skip queue notification on SOCK_DEAD
    - minor: replace BUG_ON with WARN_ON in recoverable error
    - patch 4
    - new: add socket option SOCK_ZEROCOPY.
    only honor MSG_ZEROCOPY if set, ignore for legacy apps.
    - patch 5
    - fix: clear zerocopy state on skb_linearize
    - patch 6
    - fix: only coalesce if prev errqueue elem is zerocopy
    - minor: try coalescing with list tail instead of head
    - minor: merge bytelen limit patch
    - patch 7
    - new: signal when data had to be copied
    - patch 8 (tcp)
    - optimize: avoid setting PSH bit when exceeding max frags.
    that limits GRO on the client. do not goto new_segment.
    - fix: fail on MSG_ZEROCOPY | MSG_FASTOPEN
    - minor: do not wait for memory: does not work for optmem
    - minor: simplify alloc
    - patch 9 (udp)
    - new: add PF_INET6
    - fix: attach zerocopy notification even if revert to copy
    - minor: simplify alloc size arithmetic
    - patch 10 (raw hdrinc)
    - new: add PF_INET6
    - patch 11 (pf_packet)
    - minor: simplify slightly
    - patch 12
    - new msg_zerocopy regression test: use veth pair to test
    all protocols: ipv4/ipv6/packet, tcp/udp/raw, cork
    all relevant ethtool settings: rx off, sg off
    all relevant packet lengths: 0, RFCv2:
    - review comment: do not loop skb with zerocopy frags onto rx:
    add skb_orphan_frags_rx to orphan even refcounted frags
    call this in __netif_receive_skb_core, deliver_skb and tun:
    same as commit 1080e512d44d ("net: orphan frags on receive")
    - fix: hold an explicit sk reference on each notification skb.
    previously relied on the reference (or wmem) held by the
    data skb that would trigger notification, but this breaks
    on skb_orphan.
    - fix: when aborting a send, do not inc the zerocopy counter
    this caused gaps in the notification chain
    - fix: in packet with SOCK_DGRAM, pull ll headers before calling
    zerocopy_sg_from_iter
    - fix: if sock_zerocopy_realloc does not allow coalescing,
    do not fail, just allocate a new ubuf
    - fix: in tcp, check return value of second allocation attempt
    - chg: allocate notification skbs from optmem
    to avoid affecting tcp write queue accounting (TSQ)
    - chg: limit #locked pages (ulimit) per user instead of per process
    - chg: grow notification ids from 16 to 32 bit
    - pass range [lo, hi] through 32 bit fields ee_info and ee_data
    - chg: rebased to davem-net-next on top of v4.10-rc7
    - add: limit notification coalescing
    sharing ubufs limits overhead, but delays notification until
    the last packet is released, possibly unbounded. Add a cap.
    - tests: add snd_zerocopy_lo pf_packet test
    - tests: two bugfixes (add do_flush_tcp, ++sent not only in debug)

    Limitations / Known Issues:
    - TCP may build slightly smaller than max TSO packets due to
    exceeding MAX_SKB_FRAGS frags when zerocopy pages are unaligned.
    - All SKBTX_SHARED_FRAG may require additional __skb_linearize or
    skb_copy_ubufs calls in u32, skb_find_text, similar to
    skb_checksum_help.

    Notification skbuffs are allocated from optmem. For sockets that
    cannot effectively coalesce notifications, the optmem max may need
    to be increased to avoid hitting -ENOBUFS:

    sysctl -w net.core.optmem_max=1048576

    In application load, copy avoidance shows a roughly 5% systemwide
    reduction in cycles when streaming large flows and a 4-8% reduction in
    wall clock time on early tensorflow test workloads.

    For the single-machine veth tests to succeed, loopback support has to
    be temporarily enabled by making skb_orphan_frags_rx map to
    skb_orphan_frags.

    * Performance

    The below table shows cycles reported by perf for a netperf process
    sending a single 10 Gbps TCP_STREAM. The first three columns show
    Mcycles spent in the netperf process context. The second three columns
    show time spent systemwide (-a -C A,B) on the two cpus that run the
    process and interrupt handler. Reported is the median of at least 3
    runs. std is a standard netperf, zc uses zerocopy and % is the ratio.
    Netperf is pinned to cpu 2, network interrupts to cpu3, rps and rfs
    are disabled and the kernel is booted with idle=halt.

    NETPERF=./netperf -t TCP_STREAM -H $host -T 2 -l 30 -- -m $size

    perf stat -e cycles $NETPERF
    perf stat -C 2,3 -a -e cycles $NETPERF

    --process cycles-- ----cpu cycles----
    std zc % std zc %
    4K 27,609 11,217 41 49,217 39,175 79
    16K 21,370 3,823 18 43,540 29,213 67
    64K 20,557 2,312 11 42,189 26,910 64
    256K 21,110 2,134 10 43,006 27,104 63
    1M 20,987 1,610 8 42,759 25,931 61

    Perf record indicates the main source of these differences. Process
    cycles only at 1M writes (perf record; perf report -n):

    std:
    Samples: 42K of event 'cycles', Event count (approx.): 21258597313
    79.41% 33884 netperf [kernel.kallsyms] [k] copy_user_generic_string
    3.27% 1396 netperf [kernel.kallsyms] [k] tcp_sendmsg
    1.66% 694 netperf [kernel.kallsyms] [k] get_page_from_freelist
    0.79% 325 netperf [kernel.kallsyms] [k] tcp_ack
    0.43% 188 netperf [kernel.kallsyms] [k] __alloc_skb

    zc:
    Samples: 1K of event 'cycles', Event count (approx.): 1439509124
    30.36% 584 netperf.zerocop [kernel.kallsyms] [k] gup_pte_range
    14.63% 284 netperf.zerocop [kernel.kallsyms] [k] __zerocopy_sg_from_iter
    8.03% 159 netperf.zerocop [kernel.kallsyms] [k] skb_zerocopy_add_frags_iter
    4.84% 96 netperf.zerocop [kernel.kallsyms] [k] __alloc_skb
    3.10% 60 netperf.zerocop [kernel.kallsyms] [k] kmem_cache_alloc_node

    * Safety

    The number of pages that can be pinned on behalf of a user with
    MSG_ZEROCOPY is bound by the locked memory ulimit.

    While the kernel holds process memory pinned, a process cannot safely
    reuse those pages for other purposes. Packets looped onto the receive
    stack and queued to a socket can be held indefinitely. Avoid unbounded
    notification latency by restricting user pages to egress paths only.
    skb_orphan_frags_rx() will create a private copy of pages even for
    refcounted packets when these are looped, as did skb_orphan_frags for
    the original tun zerocopy implementation.

    Pages are not remapped read-only. Processes can modify packet contents
    while packets are in flight in the kernel path. Bytes on which kernel
    control flow depends (headers) are copied to avoid TOCTTOU attacks.
    Datapath integrity does not otherwise depend on payload, with three
    exceptions: checksums, optional sk_filter/tc u32/.. and device +
    driver logic. The effect of wrong checksums is limited to the
    misbehaving process. TC filters that access contents may have to be
    excluded by adding an skb_orphan_frags_rx.

    Processes can also safely avoid OOM conditions by bounding the number
    of bytes passed with MSG_ZEROCOPY and by removing shared pages after
    transmission from their own memory map.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     
  • Introduce regression test for msg_zerocopy feature. Send traffic from
    one process to another with and without zerocopy.

    Evaluate tcp, udp, raw and packet sockets, including variants
    - udp: corking and corking with mixed copy/zerocopy calls
    - raw: with and without hdrincl
    - packet: at both raw and dgram level

    Test on both ipv4 and ipv6, optionally with ethtool changes to
    disable scatter-gather, tx checksum or tso offload. All of these
    can affect zerocopy behavior.

    The regression test can be run on a single machine if over a veth
    pair. Then skb_orphan_frags_rx must be modified to be identical to
    skb_orphan_frags to allow forwarding zerocopy locally.

    The msg_zerocopy.sh script will setup the veth pair in network
    namespaces and run all tests.

    Signed-off-by: Willem de Bruijn
    Signed-off-by: David S. Miller

    Willem de Bruijn
     
  • Enable support for MSG_ZEROCOPY to the TCP stack. TSO and GSO are
    both supported. Only data sent to remote destinations is sent without
    copying. Packets looped onto a local destination have their payload
    copied to avoid unbounded latency.

    Tested:
    A 10x TCP_STREAM between two hosts showed a reduction in netserver
    process cycles by up to 70%, depending on packet size. Systemwide,
    savings are of course much less pronounced, at up to 20% best case.

    msg_zerocopy.sh 4 tcp:

    without zerocopy
    tx=121792 (7600 MB) txc=0 zc=n
    rx=60458 (7600 MB)

    with zerocopy
    tx=286257 (17863 MB) txc=286257 zc=y
    rx=140022 (17863 MB)

    This test opens a pair of sockets over veth, one one calls send with
    64KB and optionally MSG_ZEROCOPY and on the other reads the initial
    bytes. The receiver truncates, so this is strictly an upper bound on
    what is achievable. It is more representative of sending data out of
    a physical NIC (when payload is not touched, either).

    Signed-off-by: Willem de Bruijn
    Signed-off-by: David S. Miller

    Willem de Bruijn
     
  • Bound the number of pages that a user may pin.

    Follow the lead of perf tools to maintain a per-user bound on memory
    locked pages commit 789f90fcf6b0 ("perf_counter: per user mlock gift")

    Signed-off-by: Willem de Bruijn
    Signed-off-by: David S. Miller

    Willem de Bruijn
     
  • In the simple case, each sendmsg() call generates data and eventually
    a zerocopy ready notification N, where N indicates the Nth successful
    invocation of sendmsg() with the MSG_ZEROCOPY flag on this socket.

    TCP and corked sockets can cause send() calls to append new data to an
    existing sk_buff and, thus, ubuf_info. In that case the notification
    must hold a range. odify ubuf_info to store a inclusive range [N..N+m]
    and add skb_zerocopy_realloc() to optionally extend an existing range.

    Also coalesce notifications in this common case: if a notification
    [1, 1] is about to be queued while [0, 0] is the queue tail, just modify
    the head of the queue to read [0, 1].

    Coalescing is limited to a few TSO frames worth of data to bound
    notification latency.

    Signed-off-by: Willem de Bruijn
    Signed-off-by: David S. Miller

    Willem de Bruijn
     
  • Prepare the datapath for refcounted ubuf_info. Clone ubuf_info with
    skb_zerocopy_clone() wherever needed due to skb split, merge, resize
    or clone.

    Split skb_orphan_frags into two variants. The split, merge, .. paths
    support reference counted zerocopy buffers, so do not do a deep copy.
    Add skb_orphan_frags_rx for paths that may loop packets to receive
    sockets. That is not allowed, as it may cause unbounded latency.
    Deep copy all zerocopy copy buffers, ref-counted or not, in this path.

    The exact locations to modify were chosen by exhaustively searching
    through all code that might modify skb_frag references and/or the
    the SKBTX_DEV_ZEROCOPY tx_flags bit.

    The changes err on the safe side, in two ways.

    (1) legacy ubuf_info paths virtio and tap are not modified. They keep
    a 1:1 ubuf_info to sk_buff relationship. Calls to skb_orphan_frags
    still call skb_copy_ubufs and thus copy frags in this case.

    (2) not all copies deep in the stack are addressed yet. skb_shift,
    skb_split and skb_try_coalesce can be refined to avoid copying.
    These are not in the hot path and this patch is hairy enough as
    is, so that is left for future refinement.

    Signed-off-by: Willem de Bruijn
    Signed-off-by: David S. Miller

    Willem de Bruijn
     
  • The send call ignores unknown flags. Legacy applications may already
    unwittingly pass MSG_ZEROCOPY. Continue to ignore this flag unless a
    socket opts in to zerocopy.

    Introduce socket option SO_ZEROCOPY to enable MSG_ZEROCOPY processing.
    Processes can also query this socket option to detect kernel support
    for the feature. Older kernels will return ENOPROTOOPT.

    Signed-off-by: Willem de Bruijn
    Signed-off-by: David S. Miller

    Willem de Bruijn
     
  • The kernel supports zerocopy sendmsg in virtio and tap. Expand the
    infrastructure to support other socket types. Introduce a completion
    notification channel over the socket error queue. Notifications are
    returned with ee_origin SO_EE_ORIGIN_ZEROCOPY. ee_errno is 0 to avoid
    blocking the send/recv path on receiving notifications.

    Add reference counting, to support the skb split, merge, resize and
    clone operations possible with SOCK_STREAM and other socket types.

    The patch does not yet modify any datapaths.

    Signed-off-by: Willem de Bruijn
    Signed-off-by: David S. Miller

    Willem de Bruijn
     
  • Refine skb_copy_ubufs to support compound pages. With upcoming TCP
    zerocopy sendmsg, such fragments may appear.

    The existing code replaces each page one for one. Splitting each
    compound page into an independent number of regular pages can result
    in exceeding limit MAX_SKB_FRAGS if data is not exactly page aligned.

    Instead, fill all destination pages but the last to PAGE_SIZE.
    Split the existing alloc + copy loop into separate stages:
    1. compute bytelength and minimum number of pages to store this.
    2. allocate
    3. copy, filling each page except the last to PAGE_SIZE bytes
    4. update skb frag array

    Signed-off-by: Willem de Bruijn
    Signed-off-by: David S. Miller

    Willem de Bruijn
     
  • Add sock_omalloc and sock_ofree to be able to allocate control skbs,
    for instance for looping errors onto sk_error_queue.

    The transmit budget (sk_wmem_alloc) is involved in transmit skb
    shaping, most notably in TCP Small Queues. Using this budget for
    control packets would impact transmission.

    Signed-off-by: Willem de Bruijn
    Signed-off-by: David S. Miller

    Willem de Bruijn
     
  • Jiri Pirko says:

    ====================
    mlxsw: Support for IPv6 UC router

    Ido says:

    This set adds support for IPv6 unicast routes offload. The first four
    patches make the FIB notification chain generic so that it could be used
    by address families other than IPv4. This is done by having each address
    family register its callbacks with the common code, so that its FIB tables
    and rules could be dumped upon registration to the chain, while ensuring
    the integrity of the dump. The exact mechanics are explained in detail in
    the first patch.

    The next six patches build upon this work and add the necessary callbacks
    in IPv6 code. This allows listeners of the chain to receive notifications
    about IPv6 routes addition, deletion and replacement as well as FIB rules
    notifications.

    Unlike user space notifications for IPv6 multipath routes, the FIB
    notification chain notifies these on a per-nexthop basis. This allows
    us to keep the common code lean and is also unnecessary, as notifications
    are serialized by each table's lock whereas applications maintaining
    netlink caches may suffer from concurrent dumps and deletions / additions
    of routes.

    The next five patches audit the different code paths reading the route's
    reference count (rt6i_ref) and remove assumptions regarding its meaning.
    This is needed since non-FIB users need to be able to hold a reference on
    the route and a non-zero reference count no longer means the route is in
    the FIB.

    The last six patches enable the mlxsw driver to offload IPv6 unicast
    routes to the Spectrum ASIC. Without resorting to ACLs, lookup is done
    solely based on the destination IP, so the abort mechanism is invoked
    upon the addition of source-specific routes.

    Follow-up patch sets will increase the scale of gatewayed routes by
    consolidating identical nexthop groups to one adjacency entry in the
    device's adjacency table (as in IPv4), as well as add support for
    NH_{ADD,DEL} events which enable support for the
    'ignore_routes_with_linkdown' sysctl.

    Changes in v2:
    * Provide offload indication for individual nexthops (David Ahern).
    * Use existing route reference count instead of adding another one.
    This resulted in several new patches to remove assumptions regarding
    current semantics of the existing reference count (David Ahern).
    * Add helpers to allow non-FIB users to take a reference on route.
    * Remove use of tb6_lock in mlxsw (David Ahern).
    * Add IPv6 dependency to mlxsw.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     
  • We now have all the necessary IPv6 infrastructure in place, so stop
    ignoring these notifications.

    Signed-off-by: Ido Schimmel
    Signed-off-by: Jiri Pirko
    Signed-off-by: David S. Miller

    Ido Schimmel
     
  • Without resorting to ACLs, the device performs route lookup solely based
    on the destination IP address.

    In case source-specific routing is needed, an error is returned and the
    abort mechanism is activated, thus allowing the kernel to take over
    forwarding decisions.

    Instead of aborting, we can trap specific destination prefixes where
    source-specific routes are present, but this will result in a lot more
    code that is unlikely to ever be used.

    Signed-off-by: Ido Schimmel
    Signed-off-by: Jiri Pirko
    Signed-off-by: David S. Miller

    Ido Schimmel
     
  • In case we got a replace event, then the replaced route must exist. If
    the route isn't capable of multipath, then replace first matching
    non-multipath capable route.

    If the route is capable of multipath and matching multipath capable
    route is found, then replace it. Otherwise, replace first matching
    non-multipath capable route.

    The new route is inserted before the replaced one. In case the replaced
    route is currently offloaded, then it's overwritten in the device's table
    by the new route and later deleted, thus not impacting routed traffic.

    Signed-off-by: Ido Schimmel
    Signed-off-by: Jiri Pirko
    Signed-off-by: David S. Miller

    Ido Schimmel
     
  • Allow directly connected and remote unicast IPv6 routes to be programmed
    to the device's tables.

    As with IPv4, identical routes - sharing the same destination prefix -
    are ordered in a FIB node according to their table ID and then the
    metric. While the kernel doesn't share the same trie for the local and
    main table, this does happen in the device, so ordering according to
    table ID is needed.

    Since individual nexthops can be added and deleted in IPv6, each FIB
    entry stores a linked list of the rt6_info structs it represents. Upon
    the addition or deletion of a nexthop, a new nexthop group is allocated
    according to the new configuration and the old one is destroyed.
    Identical groups aren't currently consolidated, but will be in a
    follow-up patchset.

    Signed-off-by: Ido Schimmel
    Signed-off-by: Jiri Pirko
    Signed-off-by: David S. Miller

    Ido Schimmel
     
  • We only allow FIB offload in the presence of default rules or an l3mdev
    rule. In a similar fashion to IPv4 FIB rules, sanitize IPv6 rules.

    Signed-off-by: Ido Schimmel
    Signed-off-by: Jiri Pirko
    Signed-off-by: David S. Miller

    Ido Schimmel
     
  • The FIB notification block currently only handles IPv4 events, but we
    want to start handling IPv6 events soon, so lay the groundwork now.

    Do that by preparing the work item and process it according to the
    notified address family.

    Signed-off-by: Ido Schimmel
    Signed-off-by: Jiri Pirko
    Signed-off-by: David S. Miller

    Ido Schimmel
     
  • Similar to commit 1c677b3d2828 ("ipv4: fib: Add fib_info_hold() helper")
    and commit b423cb10807b ("ipv4: fib: Export free_fib_info()") add an
    helper to hold a reference on rt6_info and export rt6_release() to drop
    it and potentially release the route.

    This is needed so that drivers capable of FIB offload could hold a
    reference on the route before queueing it for offload and drop it after
    the route has been programmed to the device's tables.

    Signed-off-by: Ido Schimmel
    Signed-off-by: Jiri Pirko
    Signed-off-by: David S. Miller

    Ido Schimmel
     
  • When an interface is brought back up, the kernel tries to restore the
    host routes tied to its permanent addresses.

    However, if the host route was removed from the FIB, then we need to
    reinsert it. This is done by releasing the current dst and allocating a
    new, so as to not reuse a dst with obsolete values.

    Since this function is called under RTNL and using the same explanation
    from the previous patch, we can test if the route is in the FIB by
    checking its node pointer instead of its reference count.

    Tested using the following script and Andrey's reproducer mentioned
    in commit 8048ced9beb2 ("net: ipv6: regenerate host route if moved to gc
    list") and linked below:

    $ ip link set dev lo up
    $ ip link add dummy1 type dummy
    $ ip -6 address add cafe::1/64 dev dummy1
    $ ip link set dev lo down # cafe::1/128 is removed
    $ ip link set dev dummy1 up
    $ ip link set dev lo up

    The host route is correctly regenerated.

    Signed-off-by: Ido Schimmel
    Link: http://lkml.kernel.org/r/CAAeHK+zSe82vc5gCRgr_EoUwiALPnWVdWJBPwJZBpbxYz=kGJw@mail.gmail.com
    Signed-off-by: Jiri Pirko
    Signed-off-by: David S. Miller

    Ido Schimmel
     
  • When the loopback device is brought back up we need to check if the host
    route attached to the address is still in the FIB and regenerate one in
    case it's not.

    Host routes using the loopback device are always inserted into and
    removed from the FIB under RTNL (under which this function is called),
    so we can test their node pointer instead of the reference count in
    order to check if the route is in the FIB or not.

    Tested using the following script from Nicolas mentioned in
    commit a220445f9f43 ("ipv6: correctly add local routes when lo goes up"):

    $ ip link add dummy1 type dummy
    $ ip link set dummy1 up
    $ ip link set lo down ; ip link set lo up

    The host route is correctly regenerated.

    Signed-off-by: Ido Schimmel
    Signed-off-by: Jiri Pirko
    Signed-off-by: David S. Miller

    Ido Schimmel