20 Aug, 2019

1 commit


12 Jul, 2019

2 commits

  • There are 2 call chains:

    a) xsk_bind --> xdp_umem_assign_dev
    b) unregister_netdevice_queue --> xsk_notifier

    with the following locking order:

    a) xs->mutex --> rtnl_lock
    b) rtnl_lock --> xdp.lock --> xs->mutex

    Different order of taking 'xs->mutex' and 'rtnl_lock' could produce a
    deadlock here. Fix that by moving the 'rtnl_lock' before 'xs->lock' in
    the bind call chain (a).

    Reported-by: syzbot+bf64ec93de836d7f4c2c@syzkaller.appspotmail.com
    Fixes: 455302d1c9ae ("xdp: fix hang while unregistering device bound to xdp socket")
    Signed-off-by: Ilya Maximets
    Acked-by: Jonathan Lemon
    Signed-off-by: Daniel Borkmann

    Ilya Maximets
     
  • Completion queue address reservation could not be undone.
    In case of bad 'queue_id' or skb allocation failure, reserved entry
    will be leaked reducing the total capacity of completion queue.

    Fix that by moving reservation to the point where failure is not
    possible. Additionally, 'queue_id' checking moved out from the loop
    since there is no point to check it there.

    Fixes: 35fcde7f8deb ("xsk: support for Tx")
    Signed-off-by: Ilya Maximets
    Acked-by: Björn Töpel
    Tested-by: William Tu
    Signed-off-by: Daniel Borkmann

    Ilya Maximets
     

09 Jul, 2019

2 commits

  • Two cases of overlapping changes, nothing fancy.

    Signed-off-by: David S. Miller

    David S. Miller
     
  • Unlike driver mode, generic xdp receive could be triggered
    by different threads on different CPU cores at the same time
    leading to the fill and rx queue breakage. For example, this
    could happen while sending packets from two processes to the
    first interface of veth pair while the second part of it is
    open with AF_XDP socket.

    Need to take a lock for each generic receive to avoid race.

    Fixes: c497176cb2e4 ("xsk: add Rx receive functions and poll support")
    Signed-off-by: Ilya Maximets
    Acked-by: Magnus Karlsson
    Tested-by: William Tu
    Signed-off-by: Daniel Borkmann

    Ilya Maximets
     

03 Jul, 2019

2 commits

  • Device that bound to XDP socket will not have zero refcount until the
    userspace application will not close it. This leads to hang inside
    'netdev_wait_allrefs()' if device unregistering requested:

    # ip link del p1
    < hang on recvmsg on netlink socket >

    # ps -x | grep ip
    5126 pts/0 D+ 0:00 ip link del p1

    # journalctl -b

    Jun 05 07:19:16 kernel:
    unregister_netdevice: waiting for p1 to become free. Usage count = 1

    Jun 05 07:19:27 kernel:
    unregister_netdevice: waiting for p1 to become free. Usage count = 1
    ...

    Fix that by implementing NETDEV_UNREGISTER event notification handler
    to properly clean up all the resources and unref device.

    This should also allow socket killing via ss(8) utility.

    Fixes: 965a99098443 ("xsk: add support for bind for Rx")
    Signed-off-by: Ilya Maximets
    Acked-by: Jonathan Lemon
    Signed-off-by: Daniel Borkmann

    Ilya Maximets
     
  • Device pointer stored in umem regardless of zero-copy mode,
    so we heed to hold the device in all cases.

    Fixes: c9b47cc1fabc ("xsk: fix bug when trying to use both copy and zero-copy on one queue id")
    Signed-off-by: Ilya Maximets
    Acked-by: Jonathan Lemon
    Signed-off-by: Daniel Borkmann

    Ilya Maximets
     

28 Jun, 2019

3 commits

  • Some drivers want to access the data transmitted in order to implement
    acceleration features of the NICs. It is also useful in AF_XDP TX flow.

    Change the xsk_umem_consume_tx API to return the whole xdp_desc, that
    contains the data pointer, length and DMA address, instead of only the
    latter two. Adapt the implementation of i40e and ixgbe to this change.

    Signed-off-by: Maxim Mikityanskiy
    Signed-off-by: Tariq Toukan
    Acked-by: Saeed Mahameed
    Cc: Björn Töpel
    Cc: Magnus Karlsson
    Acked-by: Björn Töpel
    Signed-off-by: Daniel Borkmann

    Maxim Mikityanskiy
     
  • Make it possible for the application to determine whether the AF_XDP
    socket is running in zero-copy mode. To achieve this, add a new
    getsockopt option XDP_OPTIONS that returns flags. The only flag
    supported for now is the zero-copy mode indicator.

    Signed-off-by: Maxim Mikityanskiy
    Signed-off-by: Tariq Toukan
    Acked-by: Saeed Mahameed
    Acked-by: Björn Töpel
    Signed-off-by: Daniel Borkmann

    Maxim Mikityanskiy
     
  • Add a function that checks whether the Fill Ring has the specified
    amount of descriptors available. It will be useful for mlx5e that wants
    to check in advance, whether it can allocate a bulk of RX descriptors,
    to get the best performance.

    Signed-off-by: Maxim Mikityanskiy
    Signed-off-by: Tariq Toukan
    Acked-by: Saeed Mahameed
    Acked-by: Björn Töpel
    Signed-off-by: Daniel Borkmann

    Maxim Mikityanskiy
     

26 Jun, 2019

1 commit

  • Clang warns:

    In file included from net/xdp/xsk_queue.c:10:
    net/xdp/xsk_queue.h:292:2: warning: expression result unused
    [-Wunused-value]
    WRITE_ONCE(q->ring->producer, q->prod_tail);
    ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    include/linux/compiler.h:284:6: note: expanded from macro 'WRITE_ONCE'
    __u.__val; \
    ~~~ ^~~~~
    1 warning generated.

    The q->prod_tail assignment has a comma at the end, not a semi-colon.
    Fix that so clang no longer warns and everything works as expected.

    Fixes: c497176cb2e4 ("xsk: add Rx receive functions and poll support")
    Link: https://github.com/ClangBuiltLinux/linux/issues/544
    Signed-off-by: Nathan Chancellor
    Acked-by: Nick Desaulniers
    Acked-by: Jonathan Lemon
    Acked-by: Björn Töpel
    Acked-by: Song Liu
    Signed-off-by: Daniel Borkmann

    Nathan Chancellor
     

12 Jun, 2019

1 commit


21 May, 2019

1 commit


15 May, 2019

1 commit

  • Pach series "Add FOLL_LONGTERM to GUP fast and use it".

    HFI1, qib, and mthca, use get_user_pages_fast() due to its performance
    advantages. These pages can be held for a significant time. But
    get_user_pages_fast() does not protect against mapping FS DAX pages.

    Introduce FOLL_LONGTERM and use this flag in get_user_pages_fast() which
    retains the performance while also adding the FS DAX checks. XDP has also
    shown interest in using this functionality.[1]

    In addition we change get_user_pages() to use the new FOLL_LONGTERM flag
    and remove the specialized get_user_pages_longterm call.

    [1] https://lkml.org/lkml/2019/3/19/939

    "longterm" is a relative thing and at this point is probably a misnomer.
    This is really flagging a pin which is going to be given to hardware and
    can't move. I've thought of a couple of alternative names but I think we
    have to settle on if we are going to use FL_LAYOUT or something else to
    solve the "longterm" problem. Then I think we can change the flag to a
    better name.

    Secondly, it depends on how often you are registering memory. I have
    spoken with some RDMA users who consider MR in the performance path...
    For the overall application performance. I don't have the numbers as the
    tests for HFI1 were done a long time ago. But there was a significant
    advantage. Some of which is probably due to the fact that you don't have
    to hold mmap_sem.

    Finally, architecturally I think it would be good for everyone to use
    *_fast. There are patches submitted to the RDMA list which would allow
    the use of *_fast (they reworking the use of mmap_sem) and as soon as they
    are accepted I'll submit a patch to convert the RDMA core as well. Also
    to this point others are looking to use *_fast.

    As an aside, Jasons pointed out in my previous submission that *_fast and
    *_unlocked look very much the same. I agree and I think further cleanup
    will be coming. But I'm focused on getting the final solution for DAX at
    the moment.

    This patch (of 7):

    This patch starts a series which aims to support FOLL_LONGTERM in
    get_user_pages_fast(). Some callers who would like to do a longterm (user
    controlled pin) of pages with the fast variant of GUP for performance
    purposes.

    Rather than have a separate get_user_pages_longterm() call, introduce
    FOLL_LONGTERM and change the longterm callers to use it.

    This patch does not change any functionality. In the short term
    "longterm" or user controlled pins are unsafe for Filesystems and FS DAX
    in particular has been blocked. However, callers of get_user_pages_fast()
    were not "protected".

    FOLL_LONGTERM can _only_ be supported with get_user_pages[_fast]() as it
    requires vmas to determine if DAX is in use.

    NOTE: In merging with the CMA changes we opt to change the
    get_user_pages() call in check_and_migrate_cma_pages() to a call of
    __get_user_pages_locked() on the newly migrated pages. This makes the
    code read better in that we are calling __get_user_pages_locked() on the
    pages before and after a potential migration.

    As a side affect some of the interfaces are cleaned up but this is not the
    primary purpose of the series.

    In review[1] it was asked:

    > This I don't get - if you do lock down long term mappings performance
    > of the actual get_user_pages call shouldn't matter to start with.
    >
    > What do I miss?

    A couple of points.

    First "longterm" is a relative thing and at this point is probably a
    misnomer. This is really flagging a pin which is going to be given to
    hardware and can't move. I've thought of a couple of alternative names
    but I think we have to settle on if we are going to use FL_LAYOUT or
    something else to solve the "longterm" problem. Then I think we can
    change the flag to a better name.

    Second, It depends on how often you are registering memory. I have spoken
    with some RDMA users who consider MR in the performance path... For the
    overall application performance. I don't have the numbers as the tests
    for HFI1 were done a long time ago. But there was a significant
    advantage. Some of which is probably due to the fact that you don't have
    to hold mmap_sem.

    Finally, architecturally I think it would be good for everyone to use
    *_fast. There are patches submitted to the RDMA list which would allow
    the use of *_fast (they reworking the use of mmap_sem) and as soon as they
    are accepted I'll submit a patch to convert the RDMA core as well. Also
    to this point others are looking to use *_fast.

    As an asside, Jasons pointed out in my previous submission that *_fast and
    *_unlocked look very much the same. I agree and I think further cleanup
    will be coming. But I'm focused on getting the final solution for DAX at
    the moment.

    [1] https://lore.kernel.org/lkml/20190220180255.GA12020@iweiny-DESK2.sc.intel.com/T/#md6abad2569f3bf6c1f03686c8097ab6563e94965

    [ira.weiny@intel.com: v3]
    Link: http://lkml.kernel.org/r/20190328084422.29911-2-ira.weiny@intel.com
    Link: http://lkml.kernel.org/r/20190328084422.29911-2-ira.weiny@intel.com
    Link: http://lkml.kernel.org/r/20190317183438.2057-2-ira.weiny@intel.com
    Signed-off-by: Ira Weiny
    Reviewed-by: Andrew Morton
    Cc: Aneesh Kumar K.V
    Cc: Michal Hocko
    Cc: John Hubbard
    Cc: "Kirill A. Shutemov"
    Cc: Peter Zijlstra
    Cc: Jason Gunthorpe
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: "David S. Miller"
    Cc: Martin Schwidefsky
    Cc: Heiko Carstens
    Cc: Rich Felker
    Cc: Yoshinori Sato
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: Borislav Petkov
    Cc: Ralf Baechle
    Cc: James Hogan
    Cc: Dan Williams
    Cc: Mike Marshall
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ira Weiny
     

17 Apr, 2019

1 commit

  • The ring buffer code of XDP sockets is missing a memory barrier on the
    consumer side between the load of the data and the write that signals
    that it is ok for the producer to put new data into the buffer. On
    architectures that does not guarantee that stores are not reordered
    with older loads, the producer might put data into the ring before the
    consumer had the chance to read it. As IA does guarantee this
    ordering, it would only need a compiler barrier here, but there are no
    primitives in Linux for this specific case (hinder writes to be ordered
    before older reads) so I had to add a smp_mb() here which will
    translate into a run-time synch operation on IA.

    Added a longish comment in the code explaining what each barrier in
    the ring implementation accomplishes and what would happen if we
    removed one of them.

    Signed-off-by: Magnus Karlsson
    Signed-off-by: Alexei Starovoitov

    Magnus Karlsson
     

16 Mar, 2019

1 commit

  • When the umem is cleaned up, the task that created it might already be
    gone. If the task was gone, the xdp_umem_release function did not free
    the pages member of struct xdp_umem.

    It turned out that the task lookup was not needed at all; The code was
    a left-over when we moved from task accounting to user accounting [1].

    This patch fixes the memory leak by removing the task lookup logic
    completely.

    [1] https://lore.kernel.org/netdev/20180131135356.19134-3-bjorn.topel@gmail.com/

    Link: https://lore.kernel.org/netdev/c1cb2ca8-6a14-3980-8672-f3de0bb38dfd@suse.cz/
    Fixes: c0c77d8fb787 ("xsk: add user memory registration support sockopt")
    Reported-by: Jiri Slaby
    Signed-off-by: Björn Töpel
    Signed-off-by: Daniel Borkmann

    Björn Töpel
     

09 Mar, 2019

2 commits

  • Passing a non-existing option in the options member of struct
    xdp_desc was, incorrectly, silently ignored. This patch addresses
    that behavior, and drops any Tx descriptor with non-existing options.

    We have examined existing user space code, and to our best knowledge,
    no one is relying on the current incorrect behavior. AF_XDP is still
    in its infancy, so from our perspective, the risk of breakage is very
    low, and addressing this problem now is important.

    Fixes: 35fcde7f8deb ("xsk: support for Tx")
    Signed-off-by: Björn Töpel
    Signed-off-by: Daniel Borkmann

    Björn Töpel
     
  • Passing a non-existing flag in the sxdp_flags member of struct
    sockaddr_xdp was, incorrectly, silently ignored. This patch addresses
    that behavior, and rejects any non-existing flags.

    We have examined existing user space code, and to our best knowledge,
    no one is relying on the current incorrect behavior. AF_XDP is still
    in its infancy, so from our perspective, the risk of breakage is very
    low, and addressing this problem now is important.

    Fixes: 965a99098443 ("xsk: add support for bind for Rx")
    Signed-off-by: Björn Töpel
    Signed-off-by: Daniel Borkmann

    Björn Töpel
     

07 Mar, 2019

1 commit

  • Fixes two typos in xsk_diag_put_umem()

    syzbot reported the following crash :

    kasan: CONFIG_KASAN_INLINE enabled
    kasan: GPF could be caused by NULL-ptr deref or user memory access
    general protection fault: 0000 [#1] PREEMPT SMP KASAN
    CPU: 1 PID: 7641 Comm: syz-executor946 Not tainted 5.0.0-rc7+ #95
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
    RIP: 0010:xsk_diag_put_umem net/xdp/xsk_diag.c:71 [inline]
    RIP: 0010:xsk_diag_fill net/xdp/xsk_diag.c:113 [inline]
    RIP: 0010:xsk_diag_dump+0xdcb/0x13a0 net/xdp/xsk_diag.c:143
    Code: 8d be c0 04 00 00 48 89 f8 48 c1 e8 03 42 80 3c 20 00 0f 85 39 04 00 00 49 8b 96 c0 04 00 00 48 8d 7a 14 48 89 f8 48 c1 e8 03 0f b6 0c 20 48 89 f8 83 e0 07 83 c0 03 38 c8 7c 08 84 c9 0f 85
    RSP: 0018:ffff888090bcf2d8 EFLAGS: 00010203
    RAX: 0000000000000002 RBX: ffff8880a0aacbc0 RCX: ffffffff86ffdc3c
    RDX: 0000000000000000 RSI: ffffffff86ffdc70 RDI: 0000000000000014
    RBP: ffff888090bcf438 R08: ffff88808e04a700 R09: ffffed1011c74174
    R10: ffffed1011c74173 R11: ffff88808e3a0b9f R12: dffffc0000000000
    R13: ffff888093a6d818 R14: ffff88808e365240 R15: ffff88808e3a0b40
    FS: 00000000011ea880(0000) GS:ffff8880ae900000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 0000000020000080 CR3: 000000008fa13000 CR4: 00000000001406e0
    Call Trace:
    netlink_dump+0x55d/0xfb0 net/netlink/af_netlink.c:2252
    __netlink_dump_start+0x5b4/0x7e0 net/netlink/af_netlink.c:2360
    netlink_dump_start include/linux/netlink.h:226 [inline]
    xsk_diag_handler_dump+0x1b2/0x250 net/xdp/xsk_diag.c:170
    __sock_diag_cmd net/core/sock_diag.c:232 [inline]
    sock_diag_rcv_msg+0x322/0x410 net/core/sock_diag.c:263
    netlink_rcv_skb+0x17a/0x460 net/netlink/af_netlink.c:2485
    sock_diag_rcv+0x2b/0x40 net/core/sock_diag.c:274
    netlink_unicast_kernel net/netlink/af_netlink.c:1310 [inline]
    netlink_unicast+0x536/0x720 net/netlink/af_netlink.c:1336
    netlink_sendmsg+0x8ae/0xd70 net/netlink/af_netlink.c:1925
    sock_sendmsg_nosec net/socket.c:622 [inline]
    sock_sendmsg+0xdd/0x130 net/socket.c:632
    sock_write_iter+0x27c/0x3e0 net/socket.c:923
    call_write_iter include/linux/fs.h:1863 [inline]
    do_iter_readv_writev+0x5e0/0x8e0 fs/read_write.c:680
    do_iter_write fs/read_write.c:956 [inline]
    do_iter_write+0x184/0x610 fs/read_write.c:937
    vfs_writev+0x1b3/0x2f0 fs/read_write.c:1001
    do_writev+0xf6/0x290 fs/read_write.c:1036
    __do_sys_writev fs/read_write.c:1109 [inline]
    __se_sys_writev fs/read_write.c:1106 [inline]
    __x64_sys_writev+0x75/0xb0 fs/read_write.c:1106
    do_syscall_64+0x103/0x610 arch/x86/entry/common.c:290
    entry_SYSCALL_64_after_hwframe+0x49/0xbe
    RIP: 0033:0x440139
    Code: 18 89 d0 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 3d 01 f0 ff ff 0f 83 fb 13 fc ff c3 66 2e 0f 1f 84 00 00 00 00
    RSP: 002b:00007ffcc966cc18 EFLAGS: 00000246 ORIG_RAX: 0000000000000014
    RAX: ffffffffffffffda RBX: 00000000004002c8 RCX: 0000000000440139
    RDX: 0000000000000001 RSI: 0000000020000080 RDI: 0000000000000003
    RBP: 00000000006ca018 R08: 00000000004002c8 R09: 00000000004002c8
    R10: 0000000000000004 R11: 0000000000000246 R12: 00000000004019c0
    R13: 0000000000401a50 R14: 0000000000000000 R15: 0000000000000000
    Modules linked in:
    ---[ end trace 460a3c24d0a656c9 ]---
    RIP: 0010:xsk_diag_put_umem net/xdp/xsk_diag.c:71 [inline]
    RIP: 0010:xsk_diag_fill net/xdp/xsk_diag.c:113 [inline]
    RIP: 0010:xsk_diag_dump+0xdcb/0x13a0 net/xdp/xsk_diag.c:143
    Code: 8d be c0 04 00 00 48 89 f8 48 c1 e8 03 42 80 3c 20 00 0f 85 39 04 00 00 49 8b 96 c0 04 00 00 48 8d 7a 14 48 89 f8 48 c1 e8 03 0f b6 0c 20 48 89 f8 83 e0 07 83 c0 03 38 c8 7c 08 84 c9 0f 85
    RSP: 0018:ffff888090bcf2d8 EFLAGS: 00010203
    RAX: 0000000000000002 RBX: ffff8880a0aacbc0 RCX: ffffffff86ffdc3c
    RDX: 0000000000000000 RSI: ffffffff86ffdc70 RDI: 0000000000000014
    RBP: ffff888090bcf438 R08: ffff88808e04a700 R09: ffffed1011c74174
    R10: ffffed1011c74173 R11: ffff88808e3a0b9f R12: dffffc0000000000
    R13: ffff888093a6d818 R14: ffff88808e365240 R15: ffff88808e3a0b40
    FS: 00000000011ea880(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 0000000001d22000 CR3: 000000008fa13000 CR4: 00000000001406f0

    Fixes: a36b38aa2af6 ("xsk: add sock_diag interface for AF_XDP")
    Signed-off-by: Eric Dumazet
    Reported-by: syzbot
    Cc: Björn Töpel
    Cc: Daniel Borkmann
    Cc: Magnus Karlsson
    Acked-by: Björn Töpel
    Signed-off-by: Daniel Borkmann

    Eric Dumazet
     

25 Feb, 2019

1 commit

  • Three conflicts, one of which, for marvell10g.c is non-trivial and
    requires some follow-up from Heiner or someone else.

    The issue is that Heiner converted the marvell10g driver over to
    use the generic c45 code as much as possible.

    However, in 'net' a bug fix appeared which makes sure that a new
    local mask (MDIO_AN_10GBT_CTRL_ADV_NBT_MASK) with value 0x01e0
    is cleared.

    Signed-off-by: David S. Miller

    David S. Miller
     

21 Feb, 2019

1 commit

  • This reverts commit e2ce3674883ecba2605370404208c9d4a07ae1c3.

    It turns out that the sock destructor xsk_destruct was needed after
    all. The cleanup simplification broke the skb transmit cleanup path,
    due to that the umem was prematurely destroyed.

    The umem cannot be destroyed until all outstanding skbs are freed,
    which means that we cannot remove the umem until the sk_destruct has
    been called.

    Signed-off-by: Björn Töpel
    Signed-off-by: Daniel Borkmann

    Björn Töpel
     

20 Feb, 2019

1 commit


13 Feb, 2019

1 commit

  • Commit c9b47cc1fabc ("xsk: fix bug when trying to use both copy and
    zero-copy on one queue id") stores the umem into the netdev._rx
    struct. However, the patch incorrectly removed the umem from the
    netdev._rx struct when user-space passed "best-effort" mode
    (i.e. select the fastest possible option available), and zero-copy
    mode was not available. This commit fixes that.

    Fixes: c9b47cc1fabc ("xsk: fix bug when trying to use both copy and zero-copy on one queue id")
    Signed-off-by: Björn Töpel
    Signed-off-by: Daniel Borkmann

    Björn Töpel
     

12 Feb, 2019

1 commit

  • Holding mmap_sem exclusively for a gup() is an overkill. Lets
    share the lock and replace the gup call for gup_longterm(), as
    it is better suited for the lifetime of the pinning.

    Fixes: c0c77d8fb787 ("xsk: add user memory registration support sockopt")
    Signed-off-by: Davidlohr Bueso
    Cc: David S. Miller
    Cc: Bjorn Topel
    Cc: Magnus Karlsson
    CC: netdev@vger.kernel.org
    Acked-by: Björn Töpel
    Signed-off-by: Daniel Borkmann

    Davidlohr Bueso
     

11 Feb, 2019

1 commit

  • All the setup code in AF_XDP is protected by a mutex with the
    exception of the mmap code that cannot use it. To make sure that a
    process banging on the mmap call at the same time as another process
    is setting up the socket, smp_wmb() calls were added in the umem
    registration code and the queue creation code, so that the published
    structures that xsk_mmap needs would be consistent. However, the
    corresponding smp_rmb() calls were not added to the xsk_mmap
    code. This patch adds these calls.

    Fixes: 37b076933a8e3 ("xsk: add missing write- and data-dependency barrier")
    Fixes: c0c77d8fb787c ("xsk: add user memory registration support sockopt")
    Signed-off-by: Magnus Karlsson
    Signed-off-by: Alexei Starovoitov

    Magnus Karlsson
     

29 Jan, 2019

1 commit

  • Daniel Borkmann says:

    ====================
    pull-request: bpf-next 2019-01-29

    The following pull-request contains BPF updates for your *net-next* tree.

    The main changes are:

    1) Teach verifier dead code removal, this also allows for optimizing /
    removing conditional branches around dead code and to shrink the
    resulting image. Code store constrained architectures like nfp would
    have hard time doing this at JIT level, from Jakub.

    2) Add JMP32 instructions to BPF ISA in order to allow for optimizing
    code generation for 32-bit sub-registers. Evaluation shows that this
    can result in code reduction of ~5-20% compared to 64 bit-only code
    generation. Also add implementation for most JITs, from Jiong.

    3) Add support for __int128 types in BTF which is also needed for
    vmlinux's BTF conversion to work, from Yonghong.

    4) Add a new command to bpftool in order to dump a list of BPF-related
    parameters from the system or for a specific network device e.g. in
    terms of available prog/map types or helper functions, from Quentin.

    5) Add AF_XDP sock_diag interface for querying sockets from user
    space which provides information about the RX/TX/fill/completion
    rings, umem, memory usage etc, from Björn.

    6) Add skb context access for skb_shared_info->gso_segs field, from Eric.

    7) Add support for testing flow dissector BPF programs by extending
    existing BPF_PROG_TEST_RUN infrastructure, from Stanislav.

    8) Split BPF kselftest's test_verifier into various subgroups of tests
    in order better deal with merge conflicts in this area, from Jakub.

    9) Add support for queue/stack manipulations in bpftool, from Stanislav.

    10) Document BTF, from Yonghong.

    11) Dump supported ELF section names in libbpf on program load
    failure, from Taeung.

    12) Silence a false positive compiler warning in verifier's BTF
    handling, from Peter.

    13) Fix help string in bpftool's feature probing, from Prashant.

    14) Remove duplicate includes in BPF kselftests, from Yue.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     

25 Jan, 2019

3 commits

  • This patch adds the sock_diag interface for querying sockets from user
    space. Tools like iproute2 ss(8) can use this interface to list open
    AF_XDP sockets.

    The user-space ABI is defined in linux/xdp_diag.h and includes netlink
    request and response structs. The request can query sockets and the
    response contains socket information about the rings, umems, inode and
    more.

    Signed-off-by: Björn Töpel
    Signed-off-by: Daniel Borkmann

    Björn Töpel
     
  • This commit adds an id to the umem structure. The id uniquely
    identifies a umem instance, and will be exposed to user-space via the
    socket monitoring interface.

    Signed-off-by: Björn Töpel
    Signed-off-by: Daniel Borkmann

    Björn Töpel
     
  • Track each AF_XDP socket in a per-netns list. This will be used later
    by the sock_diag interface for querying sockets from userspace.

    Signed-off-by: Björn Töpel
    Signed-off-by: Daniel Borkmann

    Björn Töpel
     

22 Jan, 2019

1 commit


16 Jan, 2019

1 commit

  • In the xdp_umem_assign_dev() path, the xsk code does not
    check if a queue for which umem is to be created exists.
    It leads to a situation where umem is not assigned to any
    Tx/Rx queue of a netdevice, without notifying the stack
    about an error. This affects both XDP_SKB and XDP_DRV
    modes - in case of XDP_DRV_ZC, queue index is checked by
    the driver.

    This patch fixes xsk code, so that in both XDP_SKB and
    XDP_DRV mode of AF_XDP, an error is returned when requested
    queue index exceedes an existing maximum.

    Fixes: c9b47cc1fabca ("xsk: fix bug when trying to use both copy and zero-copy on one queue id")
    Reported-by: Jakub Spizewski
    Signed-off-by: Krzysztof Kazimierczak
    Acked-by: Björn Töpel
    Signed-off-by: Daniel Borkmann

    Krzysztof Kazimierczak
     

20 Dec, 2018

1 commit

  • Prior this commit, when the struct socket object was being released,
    the UMEM did not have its reference count decreased. Instead, this was
    done in the struct sock sk_destruct function.

    There is no reason to keep the UMEM reference around when the socket
    is being orphaned, so in this patch the xdp_put_mem is called in the
    xsk_release function. This results in that the xsk_destruct function
    can be removed!

    Note that, it still holds that a struct xsk_sock reference might still
    linger in the XSKMAP after the UMEM is released, e.g. if a user does
    not clear the XSKMAP prior to closing the process. This sock will be
    in a "released" zombie like state, until the XSKMAP is removed.

    Signed-off-by: Björn Töpel
    Signed-off-by: Daniel Borkmann

    Björn Töpel
     

20 Oct, 2018

1 commit

  • net/sched/cls_api.c has overlapping changes to a call to
    nlmsg_parse(), one (from 'net') added rtm_tca_policy instead of NULL
    to the 5th argument, and another (from 'net-next') added cb->extack
    instead of NULL to the 6th argument.

    net/ipv4/ipmr_base.c is a case of a bug fix in 'net' being done to
    code which moved (to mr_table_dump)) in 'net-next'. Thanks to David
    Ahern for the heads up.

    Signed-off-by: David S. Miller

    David S. Miller
     

11 Oct, 2018

1 commit

  • The XSKMAP update and delete functions called synchronize_net(), which
    can sleep. It is not allowed to sleep during an RCU read section.

    Instead we need to make sure that the sock sk_destruct (xsk_destruct)
    function is asynchronously called after an RCU grace period. Setting
    the SOCK_RCU_FREE flag for XDP sockets takes care of this.

    Fixes: fbfc504a24f5 ("bpf: introduce new bpf AF_XDP map type BPF_MAP_TYPE_XSKMAP")
    Reported-by: Eric Dumazet
    Signed-off-by: Björn Töpel
    Acked-by: Song Liu
    Signed-off-by: Daniel Borkmann

    Björn Töpel
     

08 Oct, 2018

1 commit

  • The AF_XDP socket struct can exist in three different, implicit
    states: setup, bound and released. Setup is prior the socket has been
    bound to a device. Bound is when the socket is active for receive and
    send. Released is when the process/userspace side of the socket is
    released, but the sock object is still lingering, e.g. when there is a
    reference to the socket in an XSKMAP after process termination.

    The Rx fast-path code uses the "dev" member of struct xdp_sock to
    check whether a socket is bound or relased, and the Tx code uses the
    struct xdp_umem "xsk_list" member in conjunction with "dev" to
    determine the state of a socket.

    However, the transition from bound to released did not tear the socket
    down in correct order.

    On the Rx side "dev" was cleared after synchronize_net() making the
    synchronization useless. On the Tx side, the internal queues were
    destroyed prior removing them from the "xsk_list".

    This commit corrects the cleanup order, and by doing so
    xdp_del_sk_umem() can be simplified and one synchronize_net() can be
    removed.

    Fixes: 965a99098443 ("xsk: add support for bind for Rx")
    Fixes: ac98d8aab61b ("xsk: wire upp Tx zero-copy functions")
    Reported-by: Jesper Dangaard Brouer
    Signed-off-by: Björn Töpel
    Acked-by: Song Liu
    Signed-off-by: Daniel Borkmann

    Björn Töpel
     

05 Oct, 2018

3 commits

  • As we now do not allow ethtool to deactivate the queue id we are
    running an AF_XDP socket on, we can simplify the implementation of
    xdp_clear_umem_at_qid().

    Signed-off-by: Magnus Karlsson
    Signed-off-by: Daniel Borkmann

    Magnus Karlsson
     
  • We already check the RSS indirection table does not use queues which
    would be disabled by channel reconfiguration. Make sure user does not
    try to disable queues which have a UMEM and zero-copy AF_XDP socket
    installed.

    Signed-off-by: Jakub Kicinski
    Reviewed-by: Quentin Monnet
    Signed-off-by: Daniel Borkmann

    Jakub Kicinski
     
  • Previously, the xsk code did not record which umem was bound to a
    specific queue id. This was not required if all drivers were zero-copy
    enabled as this had to be recorded in the driver anyway. So if a user
    tried to bind two umems to the same queue, the driver would say
    no. But if copy-mode was first enabled and then zero-copy mode (or the
    reverse order), we mistakenly enabled both of them on the same umem
    leading to buggy behavior. The main culprit for this is that we did
    not store the association of umem to queue id in the copy case and
    only relied on the driver reporting this. As this relation was not
    stored in the driver for copy mode (it does not rely on the AF_XDP
    NDOs), this obviously could not work.

    This patch fixes the problem by always recording the umem to queue id
    relationship in the netdev_queue and netdev_rx_queue structs. This way
    we always know what kind of umem has been bound to a queue id and can
    act appropriately at bind time.

    Signed-off-by: Magnus Karlsson
    Signed-off-by: Daniel Borkmann

    Magnus Karlsson
     

26 Sep, 2018

1 commit

  • XSK UMEM is strongly single producer single consumer so reuse of
    frames is challenging. Add a simple "stash" of FILL packets to
    reuse for drivers to optionally make use of. This is useful
    when driver has to free (ndo_stop) or resize a ring with an active
    AF_XDP ZC socket.

    Signed-off-by: Jakub Kicinski
    Tested-by: Andrew Bowers
    Signed-off-by: Jeff Kirsher

    Jakub Kicinski
     

01 Sep, 2018

1 commit

  • This commit gets rid of the structure xdp_umem_props. It was there to
    be able to break a dependency at one point, but this is no longer
    needed. The values in the struct are instead stored directly in the
    xdp_umem structure. This simplifies the xsk code as well as af_xdp
    zero-copy drivers and as a bonus gets rid of one internal header file.

    The i40e driver is also adapted to the new interface in this commit.

    Signed-off-by: Magnus Karlsson
    Signed-off-by: Daniel Borkmann

    Magnus Karlsson