20 Dec, 2018

1 commit

  • Prior this commit, when the struct socket object was being released,
    the UMEM did not have its reference count decreased. Instead, this was
    done in the struct sock sk_destruct function.

    There is no reason to keep the UMEM reference around when the socket
    is being orphaned, so in this patch the xdp_put_mem is called in the
    xsk_release function. This results in that the xsk_destruct function
    can be removed!

    Note that, it still holds that a struct xsk_sock reference might still
    linger in the XSKMAP after the UMEM is released, e.g. if a user does
    not clear the XSKMAP prior to closing the process. This sock will be
    in a "released" zombie like state, until the XSKMAP is removed.

    Signed-off-by: Björn Töpel
    Signed-off-by: Daniel Borkmann

    Björn Töpel
     

20 Oct, 2018

1 commit

  • net/sched/cls_api.c has overlapping changes to a call to
    nlmsg_parse(), one (from 'net') added rtm_tca_policy instead of NULL
    to the 5th argument, and another (from 'net-next') added cb->extack
    instead of NULL to the 6th argument.

    net/ipv4/ipmr_base.c is a case of a bug fix in 'net' being done to
    code which moved (to mr_table_dump)) in 'net-next'. Thanks to David
    Ahern for the heads up.

    Signed-off-by: David S. Miller

    David S. Miller
     

11 Oct, 2018

1 commit

  • The XSKMAP update and delete functions called synchronize_net(), which
    can sleep. It is not allowed to sleep during an RCU read section.

    Instead we need to make sure that the sock sk_destruct (xsk_destruct)
    function is asynchronously called after an RCU grace period. Setting
    the SOCK_RCU_FREE flag for XDP sockets takes care of this.

    Fixes: fbfc504a24f5 ("bpf: introduce new bpf AF_XDP map type BPF_MAP_TYPE_XSKMAP")
    Reported-by: Eric Dumazet
    Signed-off-by: Björn Töpel
    Acked-by: Song Liu
    Signed-off-by: Daniel Borkmann

    Björn Töpel
     

08 Oct, 2018

1 commit

  • The AF_XDP socket struct can exist in three different, implicit
    states: setup, bound and released. Setup is prior the socket has been
    bound to a device. Bound is when the socket is active for receive and
    send. Released is when the process/userspace side of the socket is
    released, but the sock object is still lingering, e.g. when there is a
    reference to the socket in an XSKMAP after process termination.

    The Rx fast-path code uses the "dev" member of struct xdp_sock to
    check whether a socket is bound or relased, and the Tx code uses the
    struct xdp_umem "xsk_list" member in conjunction with "dev" to
    determine the state of a socket.

    However, the transition from bound to released did not tear the socket
    down in correct order.

    On the Rx side "dev" was cleared after synchronize_net() making the
    synchronization useless. On the Tx side, the internal queues were
    destroyed prior removing them from the "xsk_list".

    This commit corrects the cleanup order, and by doing so
    xdp_del_sk_umem() can be simplified and one synchronize_net() can be
    removed.

    Fixes: 965a99098443 ("xsk: add support for bind for Rx")
    Fixes: ac98d8aab61b ("xsk: wire upp Tx zero-copy functions")
    Reported-by: Jesper Dangaard Brouer
    Signed-off-by: Björn Töpel
    Acked-by: Song Liu
    Signed-off-by: Daniel Borkmann

    Björn Töpel
     

05 Oct, 2018

3 commits

  • As we now do not allow ethtool to deactivate the queue id we are
    running an AF_XDP socket on, we can simplify the implementation of
    xdp_clear_umem_at_qid().

    Signed-off-by: Magnus Karlsson
    Signed-off-by: Daniel Borkmann

    Magnus Karlsson
     
  • We already check the RSS indirection table does not use queues which
    would be disabled by channel reconfiguration. Make sure user does not
    try to disable queues which have a UMEM and zero-copy AF_XDP socket
    installed.

    Signed-off-by: Jakub Kicinski
    Reviewed-by: Quentin Monnet
    Signed-off-by: Daniel Borkmann

    Jakub Kicinski
     
  • Previously, the xsk code did not record which umem was bound to a
    specific queue id. This was not required if all drivers were zero-copy
    enabled as this had to be recorded in the driver anyway. So if a user
    tried to bind two umems to the same queue, the driver would say
    no. But if copy-mode was first enabled and then zero-copy mode (or the
    reverse order), we mistakenly enabled both of them on the same umem
    leading to buggy behavior. The main culprit for this is that we did
    not store the association of umem to queue id in the copy case and
    only relied on the driver reporting this. As this relation was not
    stored in the driver for copy mode (it does not rely on the AF_XDP
    NDOs), this obviously could not work.

    This patch fixes the problem by always recording the umem to queue id
    relationship in the netdev_queue and netdev_rx_queue structs. This way
    we always know what kind of umem has been bound to a queue id and can
    act appropriately at bind time.

    Signed-off-by: Magnus Karlsson
    Signed-off-by: Daniel Borkmann

    Magnus Karlsson
     

26 Sep, 2018

1 commit

  • XSK UMEM is strongly single producer single consumer so reuse of
    frames is challenging. Add a simple "stash" of FILL packets to
    reuse for drivers to optionally make use of. This is useful
    when driver has to free (ndo_stop) or resize a ring with an active
    AF_XDP ZC socket.

    Signed-off-by: Jakub Kicinski
    Tested-by: Andrew Bowers
    Signed-off-by: Jeff Kirsher

    Jakub Kicinski
     

01 Sep, 2018

2 commits

  • This commit gets rid of the structure xdp_umem_props. It was there to
    be able to break a dependency at one point, but this is no longer
    needed. The values in the struct are instead stored directly in the
    xdp_umem structure. This simplifies the xsk code as well as af_xdp
    zero-copy drivers and as a bonus gets rid of one internal header file.

    The i40e driver is also adapted to the new interface in this commit.

    Signed-off-by: Magnus Karlsson
    Signed-off-by: Daniel Borkmann

    Magnus Karlsson
     
  • Since xdp_umem_query() was added one assignment of bpf.command was
    missed from cleanup. Removing the assignment statement.

    Fixes: 84c6b86875e01a0 ("xsk: don't allow umem replace at stack level")
    Signed-off-by: Prashant Bhole
    Acked-by: Björn Töpel
    Signed-off-by: Daniel Borkmann

    Prashant Bhole
     

30 Aug, 2018

2 commits

  • Previously, the AF_XDP (XDP_DRV/XDP_SKB copy-mode) ingress logic did
    not include XDP meta data in the data buffers copied out to the user
    application.

    In this commit, we check if meta data is available, and if so, it is
    prepended to the frame.

    Signed-off-by: Björn Töpel
    Signed-off-by: Daniel Borkmann

    Björn Töpel
     
  • Move the xdp_umem_get_{data,dma} functions to include/net/xdp_sock.h,
    so that the upcoming zero-copy implementation in the Ethernet drivers
    can utilize them.

    Also, supply some dummy function implementations for
    CONFIG_XDP_SOCKETS=n configs.

    Signed-off-by: Björn Töpel
    Signed-off-by: Alexei Starovoitov

    Björn Töpel
     

22 Aug, 2018

1 commit

  • s/ENOTSUPP/EOPNOTSUPP/ in function umem_assign_dev().
    This function's return value is directly returned by xsk_bind().
    EOPNOTSUPP is bind()'s possible return value.

    Fixes: f734607e819b ("xsk: refactor xdp_umem_assign_dev()")
    Signed-off-by: Prashant Bhole
    Acked-by: Song Liu
    Acked-by: Björn Töpel
    Signed-off-by: Daniel Borkmann

    Prashant Bhole
     

06 Aug, 2018

1 commit


03 Aug, 2018

1 commit


01 Aug, 2018

2 commits

  • Currently drivers have to check if they already have a umem
    installed for a given queue and return an error if so. Make
    better use of XDP_QUERY_XSK_UMEM and move this functionality
    to the core.

    We need to keep rtnl across the calls now.

    Signed-off-by: Jakub Kicinski
    Reviewed-by: Quentin Monnet
    Acked-by: Björn Töpel
    Signed-off-by: David S. Miller

    Jakub Kicinski
     
  • Return early and only take the ref on dev once there is no possibility
    of failing.

    Signed-off-by: Jakub Kicinski
    Reviewed-by: Quentin Monnet
    Acked-by: Björn Töpel
    Signed-off-by: David S. Miller

    Jakub Kicinski
     

31 Jul, 2018

1 commit

  • xdp_return_buff() is used when frame has been successfully
    handled (transmitted) or if an error occurred during delayed
    processing and there is no way to report it back to
    xdp_do_redirect().

    In case of __xsk_rcv_zc() error is propagated all the way
    back to the driver, so there is no need to call
    xdp_return_buff(). Driver will recycle the frame anyway
    after seeing that error happened.

    Fixes: 173d3adb6f43 ("xsk: add zero-copy support for Rx")
    Signed-off-by: Jakub Kicinski
    Acked-by: Björn Töpel
    Signed-off-by: Daniel Borkmann

    Jakub Kicinski
     

26 Jul, 2018

1 commit

  • Polling for the ingress queues relies on reading the producer/consumer
    pointers of the Rx queue.

    Prior this commit, a cached consumer pointer could be used, instead of
    the actual consumer pointer and therefore report POLLIN prematurely.

    This patch makes sure that the non-cached consumer pointer is used
    instead.

    Reported-by: Qi Zhang
    Tested-by: Qi Zhang
    Fixes: c497176cb2e4 ("xsk: add Rx receive functions and poll support")
    Signed-off-by: Björn Töpel
    Signed-off-by: Daniel Borkmann

    Björn Töpel
     

13 Jul, 2018

4 commits

  • This patch stops returning EMSGSIZE from sendmsg in copy mode when the
    size of the packet is larger than the MTU. Just send it to the device
    so that it will drop it as in zero-copy mode. This makes the error
    reporting consistent between copy mode and zero-copy mode.

    Fixes: 35fcde7f8deb ("xsk: support for Tx")
    Signed-off-by: Magnus Karlsson
    Signed-off-by: Daniel Borkmann

    Magnus Karlsson
     
  • This patch makes sure ENOBUFS is always returned from sendmsg if there
    is no TX queue configured. This was not the case for zero-copy
    mode. With this patch this error reporting is consistent between copy
    mode and zero-copy mode.

    Fixes: ac98d8aab61b ("xsk: wire upp Tx zero-copy functions")
    Signed-off-by: Magnus Karlsson
    Signed-off-by: Daniel Borkmann

    Magnus Karlsson
     
  • This patch stops returning EAGAIN in TX copy mode when the completion
    queue is full as zero-copy does not do this. Instead this situation
    can be detected by comparing the head and tail pointers of the
    completion queue in both modes. In any case, EAGAIN was not the
    correct error code here since no amount of calling sendmsg will solve
    the problem. Only consuming one or more messages on the completion
    queue will fix this.

    With this patch, the error reporting becomes consistent between copy
    mode and zero-copy mode.

    Fixes: 35fcde7f8deb ("xsk: support for Tx")
    Signed-off-by: Magnus Karlsson
    Signed-off-by: Daniel Borkmann

    Magnus Karlsson
     
  • This patch removes the ENXIO return code from TX copy-mode when
    someone has forcefully changed the number of queues on the device so
    that the queue bound to the socket is no longer available. Just
    silently stop sending anything as in zero-copy mode so the error
    reporting gets consistent between the two modes.

    Fixes: 35fcde7f8deb ("xsk: support for Tx")
    Signed-off-by: Magnus Karlsson
    Signed-off-by: Daniel Borkmann

    Magnus Karlsson
     

03 Jul, 2018

3 commits

  • There is a potential race in the TX completion code for the SKB
    case. One process enters the sendmsg code of an AF_XDP socket in order
    to send a frame. The execution eventually trickles down to the driver
    that is told to send the packet. However, it decides to drop the
    packet due to some error condition (e.g., rings full) and frees the
    SKB. This will trigger the SKB destructor and a completion will be
    sent to the AF_XDP user space through its
    single-producer/single-consumer queues.

    At the same time a TX interrupt has fired on another core and it
    dispatches the TX completion code in the driver. It does its HW
    specific things and ends up freeing the SKB associated with the
    transmitted packet. This will trigger the SKB destructor and a
    completion will be sent to the AF_XDP user space through its
    single-producer/single-consumer queues. With a pseudo call stack, it
    would look like this:

    Core 1:
    sendmsg() being called in the application
    netdev_start_xmit()
    Driver entered through ndo_start_xmit
    Driver decides to free the SKB for some reason (e.g., rings full)
    Destructor of SKB called
    xskq_produce_addr() is called to signal completion to user space

    Core 2:
    TX completion irq
    NAPI loop
    Driver irq handler for TX completions
    Frees the SKB
    Destructor of SKB called
    xskq_produce_addr() is called to signal completion to user space

    We now have a violation of the single-producer/single-consumer
    principle for our queues as there are two threads trying to produce at
    the same time on the same queue.

    Fixed by introducing a spin_lock in the destructor. In regards to the
    performance, I get around 1.74 Mpps for txonly before and after the
    introduction of the spinlock. There is of course some impact due to
    the spin lock but it is in the less significant digits that are too
    noisy for me to measure. But let us say that the version without the
    spin lock got 1.745 Mpps in the best case and the version with 1.735
    Mpps in the worst case, then that would mean a maximum drop in
    performance of 0.5%.

    Fixes: 35fcde7f8deb ("xsk: support for Tx")
    Signed-off-by: Magnus Karlsson
    Signed-off-by: Alexei Starovoitov

    Magnus Karlsson
     
  • Fixed a bug in which a frame could be completed more than once
    when an error was returned from dev_direct_xmit(). The code
    erroneously retried sending the message leading to multiple
    calls to the SKB destructor and therefore multiple completions
    of the same buffer to user space.

    The error code in this case has been changed from EAGAIN to EBUSY
    in order to tell user space that the sending of the packet failed
    and the buffer has been return to user space through the completion
    queue.

    Fixes: 35fcde7f8deb ("xsk: support for Tx")
    Signed-off-by: Magnus Karlsson
    Reported-by: Pavel Odintsov
    Signed-off-by: Alexei Starovoitov

    Magnus Karlsson
     
  • The code in xskq_produce_addr erroneously checked if there
    was up to LAZY_UPDATE_THRESHOLD amount of space in the completion
    queue. It only needs to check if there is one slot left in the
    queue. This bug could under some circumstances lead to a WARN_ON_ONCE
    being triggered and the completion message to user space being lost.

    Fixes: 35fcde7f8deb ("xsk: support for Tx")
    Signed-off-by: Magnus Karlsson
    Reported-by: Pavel Odintsov
    Signed-off-by: Alexei Starovoitov

    Magnus Karlsson
     

29 Jun, 2018

1 commit

  • The poll() changes were not well thought out, and completely
    unexplained. They also caused a huge performance regression, because
    "->poll()" was no longer a trivial file operation that just called down
    to the underlying file operations, but instead did at least two indirect
    calls.

    Indirect calls are sadly slow now with the Spectre mitigation, but the
    performance problem could at least be largely mitigated by changing the
    "->get_poll_head()" operation to just have a per-file-descriptor pointer
    to the poll head instead. That gets rid of one of the new indirections.

    But that doesn't fix the new complexity that is completely unwarranted
    for the regular case. The (undocumented) reason for the poll() changes
    was some alleged AIO poll race fixing, but we don't make the common case
    slower and more complex for some uncommon special case, so this all
    really needs way more explanations and most likely a fundamental
    redesign.

    [ This revert is a revert of about 30 different commits, not reverted
    individually because that would just be unnecessarily messy - Linus ]

    Cc: Al Viro
    Cc: Christoph Hellwig
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

12 Jun, 2018

2 commits

  • Commit 173d3adb6f43 ("xsk: add zero-copy support for Rx") introduced a
    regression on the XDP_SKB receive path, when the queue id checks were
    removed. Now, they are back again.

    Fixes: 173d3adb6f43 ("xsk: add zero-copy support for Rx")
    Reported-by: Qi Zhang
    Signed-off-by: Björn Töpel
    Signed-off-by: Daniel Borkmann

    Björn Töpel
     
  • syzkaller reported a warning from xdp_umem_pin_pages():

    WARNING: CPU: 1 PID: 4537 at mm/slab_common.c:996 kmalloc_slab+0x56/0x70 mm/slab_common.c:996
    ...
    __do_kmalloc mm/slab.c:3713 [inline]
    __kmalloc+0x25/0x760 mm/slab.c:3727
    kmalloc_array include/linux/slab.h:634 [inline]
    kcalloc include/linux/slab.h:645 [inline]
    xdp_umem_pin_pages net/xdp/xdp_umem.c:205 [inline]
    xdp_umem_reg net/xdp/xdp_umem.c:318 [inline]
    xdp_umem_create+0x5c9/0x10f0 net/xdp/xdp_umem.c:349
    xsk_setsockopt+0x443/0x550 net/xdp/xsk.c:531
    __sys_setsockopt+0x1bd/0x390 net/socket.c:1935
    __do_sys_setsockopt net/socket.c:1946 [inline]
    __se_sys_setsockopt net/socket.c:1943 [inline]
    __x64_sys_setsockopt+0xbe/0x150 net/socket.c:1943
    do_syscall_64+0x1b1/0x800 arch/x86/entry/common.c:287
    entry_SYSCALL_64_after_hwframe+0x49/0xbe

    This is a warning about attempting to allocate more than
    KMALLOC_MAX_SIZE memory. The request originates from userspace, and if
    the request is too big, the kernel is free to deny its allocation. In
    this patch, the failed allocation attempt is silenced with
    __GFP_NOWARN.

    Fixes: c0c77d8fb787 ("xsk: add user memory registration support sockopt")
    Reported-by: syzbot+4abadc5d69117b346506@syzkaller.appspotmail.com
    Signed-off-by: Björn Töpel
    Signed-off-by: Daniel Borkmann

    Björn Töpel
     

08 Jun, 2018

2 commits

  • syzkaller was able to trigger the following panic for AF_XDP:

    BUG: KASAN: null-ptr-deref in atomic64_sub include/asm-generic/atomic-instrumented.h:144 [inline]
    BUG: KASAN: null-ptr-deref in atomic_long_sub include/asm-generic/atomic-long.h:199 [inline]
    BUG: KASAN: null-ptr-deref in xdp_umem_unaccount_pages.isra.4+0x3d/0x80 net/xdp/xdp_umem.c:135
    Write of size 8 at addr 0000000000000060 by task syz-executor246/4527

    CPU: 1 PID: 4527 Comm: syz-executor246 Not tainted 4.17.0+ #89
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
    Call Trace:
    __dump_stack lib/dump_stack.c:77 [inline]
    dump_stack+0x1b9/0x294 lib/dump_stack.c:113
    kasan_report_error mm/kasan/report.c:352 [inline]
    kasan_report.cold.7+0x6d/0x2fe mm/kasan/report.c:412
    check_memory_region_inline mm/kasan/kasan.c:260 [inline]
    check_memory_region+0x13e/0x1b0 mm/kasan/kasan.c:267
    kasan_check_write+0x14/0x20 mm/kasan/kasan.c:278
    atomic64_sub include/asm-generic/atomic-instrumented.h:144 [inline]
    atomic_long_sub include/asm-generic/atomic-long.h:199 [inline]
    xdp_umem_unaccount_pages.isra.4+0x3d/0x80 net/xdp/xdp_umem.c:135
    xdp_umem_reg net/xdp/xdp_umem.c:334 [inline]
    xdp_umem_create+0xd6c/0x10f0 net/xdp/xdp_umem.c:349
    xsk_setsockopt+0x443/0x550 net/xdp/xsk.c:531
    __sys_setsockopt+0x1bd/0x390 net/socket.c:1935
    __do_sys_setsockopt net/socket.c:1946 [inline]
    __se_sys_setsockopt net/socket.c:1943 [inline]
    __x64_sys_setsockopt+0xbe/0x150 net/socket.c:1943
    do_syscall_64+0x1b1/0x800 arch/x86/entry/common.c:287
    entry_SYSCALL_64_after_hwframe+0x49/0xbe

    In xdp_umem_reg() the call to xdp_umem_account_pages() passed
    with CAP_IPC_LOCK where we didn't need to end up charging rlimit
    on memlock for the current user and therefore umem->user continues
    to be NULL. Later on through fault injection syzkaller triggered
    a failure in either umem->pgs or umem->pages allocation such that
    we bail out and undo accounting in xdp_umem_unaccount_pages()
    where we eventually hit the panic since it tries to deref the
    umem->user.

    The code is pretty close to mm_account_pinned_pages() and
    mm_unaccount_pinned_pages() pair and potentially could reuse
    it even in a later cleanup, and it appears that the initial
    commit c0c77d8fb787 ("xsk: add user memory registration support
    sockopt") got this right while later follow-up introduced the
    bug via a49049ea2576 ("xsk: simplified umem setup").

    Fixes: a49049ea2576 ("xsk: simplified umem setup")
    Reported-by: syzbot+979217770b09ebf5c407@syzkaller.appspotmail.com
    Signed-off-by: Daniel Borkmann
    Signed-off-by: Alexei Starovoitov

    Daniel Borkmann
     
  • With gcc-4.1.2 on 32-bit:

    net/xdp/xsk.c:663: warning: integer constant is too large for ‘long’ type
    net/xdp/xsk.c:665: warning: integer constant is too large for ‘long’ type

    Add the missing "ULL" suffixes to the large XDP_UMEM_PGOFF_*_RING values
    to fix this.

    net/xdp/xsk.c:663: warning: comparison is always false due to limited range of data type
    net/xdp/xsk.c:665: warning: comparison is always false due to limited range of data type

    "unsigned long" is 32-bit on 32-bit systems, hence the offset is
    truncated, and can never be equal to any of the XDP_UMEM_PGOFF_*_RING
    values. Use loff_t (and the required cast) to fix this.

    Fixes: 423f38329d267969 ("xsk: add umem fill queue support and mmap")
    Fixes: fe2308328cd2f26e ("xsk: add umem completion queue support and mmap")
    Signed-off-by: Geert Uytterhoeven
    Acked-by: Björn Töpel
    Signed-off-by: Daniel Borkmann

    Geert Uytterhoeven
     

07 Jun, 2018

1 commit

  • Pull networking updates from David Miller:

    1) Add Maglev hashing scheduler to IPVS, from Inju Song.

    2) Lots of new TC subsystem tests from Roman Mashak.

    3) Add TCP zero copy receive and fix delayed acks and autotuning with
    SO_RCVLOWAT, from Eric Dumazet.

    4) Add XDP_REDIRECT support to mlx5 driver, from Jesper Dangaard
    Brouer.

    5) Add ttl inherit support to vxlan, from Hangbin Liu.

    6) Properly separate ipv6 routes into their logically independant
    components. fib6_info for the routing table, and fib6_nh for sets of
    nexthops, which thus can be shared. From David Ahern.

    7) Add bpf_xdp_adjust_tail helper, which can be used to generate ICMP
    messages from XDP programs. From Nikita V. Shirokov.

    8) Lots of long overdue cleanups to the r8169 driver, from Heiner
    Kallweit.

    9) Add BTF ("BPF Type Format"), from Martin KaFai Lau.

    10) Add traffic condition monitoring to iwlwifi, from Luca Coelho.

    11) Plumb extack down into fib_rules, from Roopa Prabhu.

    12) Add Flower classifier offload support to igb, from Vinicius Costa
    Gomes.

    13) Add UDP GSO support, from Willem de Bruijn.

    14) Add documentation for eBPF helpers, from Quentin Monnet.

    15) Add TLS tx offload to mlx5, from Ilya Lesokhin.

    16) Allow applications to be given the number of bytes available to read
    on a socket via a control message returned from recvmsg(), from
    Soheil Hassas Yeganeh.

    17) Add x86_32 eBPF JIT compiler, from Wang YanQing.

    18) Add AF_XDP sockets, with zerocopy support infrastructure as well.
    From Björn Töpel.

    19) Remove indirect load support from all of the BPF JITs and handle
    these operations in the verifier by translating them into native BPF
    instead. From Daniel Borkmann.

    20) Add GRO support to ipv6 gre tunnels, from Eran Ben Elisha.

    21) Allow XDP programs to do lookups in the main kernel routing tables
    for forwarding. From David Ahern.

    22) Allow drivers to store hardware state into an ELF section of kernel
    dump vmcore files, and use it in cxgb4. From Rahul Lakkireddy.

    23) Various RACK and loss detection improvements in TCP, from Yuchung
    Cheng.

    24) Add TCP SACK compression, from Eric Dumazet.

    25) Add User Mode Helper support and basic bpfilter infrastructure, from
    Alexei Starovoitov.

    26) Support ports and protocol values in RTM_GETROUTE, from Roopa
    Prabhu.

    27) Support bulking in ->ndo_xdp_xmit() API, from Jesper Dangaard
    Brouer.

    28) Add lots of forwarding selftests, from Petr Machata.

    29) Add generic network device failover driver, from Sridhar Samudrala.

    * ra.kernel.org:/pub/scm/linux/kernel/git/davem/net-next: (1959 commits)
    strparser: Add __strp_unpause and use it in ktls.
    rxrpc: Fix terminal retransmission connection ID to include the channel
    net: hns3: Optimize PF CMDQ interrupt switching process
    net: hns3: Fix for VF mailbox receiving unknown message
    net: hns3: Fix for VF mailbox cannot receiving PF response
    bnx2x: use the right constant
    Revert "net: sched: cls: Fix offloading when ingress dev is vxlan"
    net: dsa: b53: Fix for brcm tag issue in Cygnus SoC
    enic: fix UDP rss bits
    netdev-FAQ: clarify DaveM's position for stable backports
    rtnetlink: validate attributes in do_setlink()
    mlxsw: Add extack messages for port_{un, }split failures
    netdevsim: Add extack error message for devlink reload
    devlink: Add extack to reload and port_{un, }split operations
    net: metrics: add proper netlink validation
    ipmr: fix error path when ipmr_new_table fails
    ip6mr: only set ip6mr_table from setsockopt when ip6mr_new_table succeeds
    net: hns3: remove unused hclgevf_cfg_func_mta_filter
    netfilter: provide udp*_lib_lookup for nf_tproxy
    qed*: Utilize FW 8.37.2.0
    ...

    Linus Torvalds
     

05 Jun, 2018

4 commits


04 Jun, 2018

3 commits

  • Currently, AF_XDP only supports a fixed frame-size memory scheme where
    each frame is referenced via an index (idx). A user passes the frame
    index to the kernel, and the kernel acts upon the data. Some NICs,
    however, do not have a fixed frame-size model, instead they have a
    model where a memory window is passed to the hardware and multiple
    frames are filled into that window (referred to as the "type-writer"
    model).

    By changing the descriptor format from the current frame index
    addressing scheme, AF_XDP can in the future be extended to support
    these kinds of NICs.

    In the index-based model, an idx refers to a frame of size
    frame_size. Addressing a frame in the UMEM is done by offseting the
    UMEM starting address by a global offset, idx * frame_size + offset.
    Communicating via the fill- and completion-rings are done by means of
    idx.

    In this commit, the idx is removed in favor of an address (addr),
    which is a relative address ranging over the UMEM. To convert an
    idx-based address to the new addr is simply: addr = idx * frame_size +
    offset.

    We also stop referring to the UMEM "frame" as a frame. Instead it is
    simply called a chunk.

    To transfer ownership of a chunk to the kernel, the addr of the chunk
    is passed in the fill-ring. Note, that the kernel will mask addr to
    make it chunk aligned, so there is no need for userspace to do
    that. E.g., for a chunk size of 2k, passing an addr of 2048, 2050 or
    3000 to the fill-ring will refer to the same chunk.

    On the completion-ring, the addr will match that of the Tx descriptor,
    passed to the kernel.

    Changing the descriptor format to use chunks/addr will allow for
    future changes to move to a type-writer based model, where multiple
    frames can reside in one chunk. In this model passing one single chunk
    into the fill-ring, would potentially result in multiple Rx
    descriptors.

    This commit changes the uapi of AF_XDP sockets, and updates the
    documentation.

    Signed-off-by: Björn Töpel
    Signed-off-by: Daniel Borkmann

    Björn Töpel
     
  • Previously, rx_dropped could be updated incorrectly, e.g. if the XDP
    program redirected the frame to a socket bound to a different queue
    than where the XDP program was executing.

    Signed-off-by: Björn Töpel
    Signed-off-by: Daniel Borkmann

    Björn Töpel
     
  • Previously the fill queue descriptor was not copied to kernel space
    prior validating it, making it possible for userland to change the
    descriptor post-kernel-validation.

    Signed-off-by: Björn Töpel
    Signed-off-by: Daniel Borkmann

    Björn Töpel
     

22 May, 2018

1 commit