12 Jan, 2020

1 commit

  • [ Upstream commit 06870682087b58398671e8cdc896cd62314c4399 ]

    The XSK wakeup callback in drivers makes some sanity checks before
    triggering NAPI. However, some configuration changes may occur during
    this function that affect the result of those checks. For example, the
    interface can go down, and all the resources will be destroyed after the
    checks in the wakeup function, but before it attempts to use these
    resources. Wrap this callback in rcu_read_lock to allow driver to
    synchronize_rcu before actually destroying the resources.

    xsk_wakeup is a new function that encapsulates calling ndo_xsk_wakeup
    wrapped into the RCU lock. After this commit, xsk_poll starts using
    xsk_wakeup and checks xs->zc instead of ndo_xsk_wakeup != NULL to decide
    ndo_xsk_wakeup should be called. It also fixes a bug introduced with the
    need_wakeup feature: a non-zero-copy socket may be used with a driver
    supporting zero-copy, and in this case ndo_xsk_wakeup should not be
    called, so the xs->zc check is the correct one.

    Fixes: 77cd0d7b3f25 ("xsk: add support for need_wakeup flag in AF_XDP rings")
    Signed-off-by: Maxim Mikityanskiy
    Signed-off-by: Björn Töpel
    Signed-off-by: Daniel Borkmann
    Link: https://lore.kernel.org/bpf/20191217162023.16011-2-maximmi@mellanox.com
    Signed-off-by: Sasha Levin

    Maxim Mikityanskiy
     

24 Oct, 2019

1 commit

  • Having Rx-only AF_XDP sockets can potentially lead to a crash in the
    system by a NULL pointer dereference in xsk_umem_consume_tx(). This
    function iterates through a list of all sockets tied to a umem and
    checks if there are any packets to send on the Tx ring. Rx-only
    sockets do not have a Tx ring, so this will cause a NULL pointer
    dereference. This will happen if you have registered one or more
    Rx-only sockets to a umem and the driver is checking the Tx ring even
    on Rx, or if the XDP_SHARED_UMEM mode is used and there is a mix of
    Rx-only and other sockets tied to the same umem.

    Fixed by only putting sockets with a Tx component on the list that
    xsk_umem_consume_tx() iterates over.

    Fixes: ac98d8aab61b ("xsk: wire upp Tx zero-copy functions")
    Reported-by: Kal Cutter Conley
    Signed-off-by: Magnus Karlsson
    Signed-off-by: Alexei Starovoitov
    Acked-by: Jonathan Lemon
    Link: https://lore.kernel.org/bpf/1571645818-16244-1-git-send-email-magnus.karlsson@intel.com

    Magnus Karlsson
     

03 Oct, 2019

1 commit

  • Fixes a crash in poll() when an AF_XDP socket is opened in copy mode
    and the bound device does not have ndo_xsk_wakeup defined. Avoid
    trying to call the non-existing ndo and instead call the internal xsk
    sendmsg function to send packets in the same way (from the
    application's point of view) as calling sendmsg() in any mode or
    poll() in zero-copy mode would have done. The application should
    behave in the same way independent on if zero-copy mode or copy mode
    is used.

    Fixes: 77cd0d7b3f25 ("xsk: add support for need_wakeup flag in AF_XDP rings")
    Reported-by: syzbot+a5765ed8cdb1cca4d249@syzkaller.appspotmail.com
    Signed-off-by: Magnus Karlsson
    Signed-off-by: Daniel Borkmann
    Link: https://lore.kernel.org/bpf/1569997919-11541-1-git-send-email-magnus.karlsson@intel.com

    Magnus Karlsson
     

29 Sep, 2019

1 commit

  • Pull networking fixes from David Miller:

    1) Sanity check URB networking device parameters to avoid divide by
    zero, from Oliver Neukum.

    2) Disable global multicast filter in NCSI, otherwise LLDP and IPV6
    don't work properly. Longer term this needs a better fix tho. From
    Vijay Khemka.

    3) Small fixes to selftests (use ping when ping6 is not present, etc.)
    from David Ahern.

    4) Bring back rt_uses_gateway member of struct rtable, it's semantics
    were not well understood and trying to remove it broke things. From
    David Ahern.

    5) Move usbnet snaity checking, ignore endpoints with invalid
    wMaxPacketSize. From Bjørn Mork.

    6) Missing Kconfig deps for sja1105 driver, from Mao Wenan.

    7) Various small fixes to the mlx5 DR steering code, from Alaa Hleihel,
    Alex Vesker, and Yevgeny Kliteynik

    8) Missing CAP_NET_RAW checks in various places, from Ori Nimron.

    9) Fix crash when removing sch_cbs entry while offloading is enabled,
    from Vinicius Costa Gomes.

    10) Signedness bug fixes, generally in looking at the result given by
    of_get_phy_mode() and friends. From Dan Crapenter.

    11) Disable preemption around BPF_PROG_RUN() calls, from Eric Dumazet.

    12) Don't create VRF ipv6 rules if ipv6 is disabled, from David Ahern.

    13) Fix quantization code in tcp_bbr, from Kevin Yang.

    * git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (127 commits)
    net: tap: clean up an indentation issue
    nfp: abm: fix memory leak in nfp_abm_u32_knode_replace
    tcp: better handle TCP_USER_TIMEOUT in SYN_SENT state
    sk_buff: drop all skb extensions on free and skb scrubbing
    tcp_bbr: fix quantization code to not raise cwnd if not probing bandwidth
    mlxsw: spectrum_flower: Fail in case user specifies multiple mirror actions
    Documentation: Clarify trap's description
    mlxsw: spectrum: Clear VLAN filters during port initialization
    net: ena: clean up indentation issue
    NFC: st95hf: clean up indentation issue
    net: phy: micrel: add Asym Pause workaround for KSZ9021
    net: socionext: ave: Avoid using netdev_err() before calling register_netdev()
    ptp: correctly disable flags on old ioctls
    lib: dimlib: fix help text typos
    net: dsa: microchip: Always set regmap stride to 1
    nfp: flower: fix memory leak in nfp_flower_spawn_vnic_reprs
    nfp: flower: prevent memory leak in nfp_flower_spawn_phy_reprs
    net/sched: Set default of CONFIG_NET_TC_SKB_EXT to N
    vrf: Do not attempt to create IPv6 mcast rule if IPv6 is disabled
    net: sched: sch_sfb: don't call qdisc_put() while holding tree lock
    ...

    Linus Torvalds
     

25 Sep, 2019

2 commits

  • For pages that were retained via get_user_pages*(), release those pages
    via the new put_user_page*() routines, instead of via put_page() or
    release_pages().

    This is part a tree-wide conversion, as described in fc1d8e7cca2d ("mm:
    introduce put_user_page*(), placeholder versions").

    Link: http://lkml.kernel.org/r/20190724044537.10458-4-jhubbard@nvidia.com
    Signed-off-by: John Hubbard
    Acked-by: Björn Töpel
    Cc: Björn Töpel
    Cc: Magnus Karlsson
    Cc: David S. Miller
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    John Hubbard
     
  • Patch series "Make working with compound pages easier", v2.

    These three patches add three helpers and convert the appropriate
    places to use them.

    This patch (of 3):

    It's unnecessarily hard to find out the size of a potentially huge page.
    Replace 'PAGE_SIZE << compound_order(page)' with page_size(page).

    Link: http://lkml.kernel.org/r/20190721104612.19120-2-willy@infradead.org
    Signed-off-by: Matthew Wilcox (Oracle)
    Acked-by: Michal Hocko
    Reviewed-by: Andrew Morton
    Reviewed-by: Ira Weiny
    Acked-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox (Oracle)
     

19 Sep, 2019

1 commit

  • This patch removes the 64B alignment of the UMEM headroom. There is
    really no reason for it, and having a headroom less than 64B should be
    valid.

    Fixes: c0c77d8fb787 ("xsk: add user memory registration support sockopt")
    Signed-off-by: Björn Töpel
    Signed-off-by: Daniel Borkmann

    Björn Töpel
     

06 Sep, 2019

1 commit

  • Daniel Borkmann says:

    ====================
    The following pull-request contains BPF updates for your *net-next* tree.

    The main changes are:

    1) Add the ability to use unaligned chunks in the AF_XDP umem. By
    relaxing where the chunks can be placed, it allows to use an
    arbitrary buffer size and place whenever there is a free
    address in the umem. Helps more seamless DPDK AF_XDP driver
    integration. Support for i40e, ixgbe and mlx5e, from Kevin and
    Maxim.

    2) Addition of a wakeup flag for AF_XDP tx and fill rings so the
    application can wake up the kernel for rx/tx processing which
    avoids busy-spinning of the latter, useful when app and driver
    is located on the same core. Support for i40e, ixgbe and mlx5e,
    from Magnus and Maxim.

    3) bpftool fixes for printf()-like functions so compiler can actually
    enforce checks, bpftool build system improvements for custom output
    directories, and addition of 'bpftool map freeze' command, from Quentin.

    4) Support attaching/detaching XDP programs from 'bpftool net' command,
    from Daniel.

    5) Automatic xskmap cleanup when AF_XDP socket is released, and several
    barrier/{read,write}_once fixes in AF_XDP code, from Björn.

    6) Relicense of bpf_helpers.h/bpf_endian.h for future libbpf
    inclusion as well as libbpf versioning improvements, from Andrii.

    7) Several new BPF kselftests for verifier precision tracking, from Alexei.

    8) Several BPF kselftest fixes wrt endianess to run on s390x, from Ilya.

    9) And more BPF kselftest improvements all over the place, from Stanislav.

    10) Add simple BPF map op cache for nfp driver to batch dumps, from Jakub.

    11) AF_XDP socket umem mapping improvements for 32bit archs, from Ivan.

    12) Add BPF-to-BPF call and BTF line info support for s390x JIT, from Yauheni.

    13) Small optimization in arm64 JIT to spare 1 insns for BPF_MOD, from Jerin.

    14) Fix an error check in bpf_tcp_gen_syncookie() helper, from Petar.

    15) Various minor fixes and cleanups, from Nathan, Masahiro, Masanari,
    Peter, Wei, Yue.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     

05 Sep, 2019

4 commits

  • When accessing the members of an XDP socket, the control mutex should
    be held. This commit fixes that.

    Acked-by: Jonathan Lemon
    Fixes: a36b38aa2af6 ("xsk: add sock_diag interface for AF_XDP")
    Signed-off-by: Björn Töpel
    Signed-off-by: Daniel Borkmann

    Björn Töpel
     
  • Prior the state variable was introduced by Ilya, the dev member was
    used to determine whether the socket was bound or not. However, when
    dev was read, proper SMP barriers and READ_ONCE were missing. In order
    to address the missing barriers and READ_ONCE, we start using the
    state variable as a point of synchronization. The state member
    read/write is paired with proper SMP barriers, and from this follows
    that the members described above does not need READ_ONCE if used in
    conjunction with state check.

    In all syscalls and the xsk_rcv path we check if state is
    XSK_BOUND. If that is the case we do a SMP read barrier, and this
    implies that the dev, umem and all rings are correctly setup. Note
    that no READ_ONCE are needed for these variable if used when state is
    XSK_BOUND (plus the read barrier).

    To summarize: The members struct xdp_sock members dev, queue_id, umem,
    fq, cq, tx, rx, and state were read lock-less, with incorrect barriers
    and missing {READ, WRITE}_ONCE. Now, umem, fq, cq, tx, rx, and state
    are read lock-less. When these members are updated, WRITE_ONCE is
    used. When read, READ_ONCE are only used when read outside the control
    mutex (e.g. mmap) or, not synchronized with the state member
    (XSK_BOUND plus smp_rmb())

    Note that dev and queue_id do not need a WRITE_ONCE or READ_ONCE, due
    to the introduce state synchronization (XSK_BOUND plus smp_rmb()).

    Introducing the state check also fixes a race, found by syzcaller, in
    xsk_poll() where umem could be accessed when stale.

    Suggested-by: Hillf Danton
    Reported-by: syzbot+c82697e3043781e08802@syzkaller.appspotmail.com
    Fixes: 77cd0d7b3f25 ("xsk: add support for need_wakeup flag in AF_XDP rings")
    Signed-off-by: Björn Töpel
    Acked-by: Jonathan Lemon
    Signed-off-by: Daniel Borkmann

    Björn Töpel
     
  • The umem member of struct xdp_sock is read outside of the control
    mutex, in the mmap implementation, and needs a WRITE_ONCE to avoid
    potential store-tearing.

    Acked-by: Jonathan Lemon
    Fixes: 423f38329d26 ("xsk: add umem fill queue support and mmap")
    Signed-off-by: Björn Töpel
    Signed-off-by: Daniel Borkmann

    Björn Töpel
     
  • Use WRITE_ONCE when doing the store of tx, rx, fq, and cq, to avoid
    potential store-tearing. These members are read outside of the control
    mutex in the mmap implementation.

    Acked-by: Jonathan Lemon
    Fixes: 37b076933a8e ("xsk: add missing write- and data-dependency barrier")
    Signed-off-by: Björn Töpel
    Signed-off-by: Daniel Borkmann

    Björn Töpel
     

31 Aug, 2019

1 commit

  • Currently, addresses are chunk size aligned. This means, we are very
    restricted in terms of where we can place chunk within the umem. For
    example, if we have a chunk size of 2k, then our chunks can only be placed
    at 0,2k,4k,6k,8k... and so on (ie. every 2k starting from 0).

    This patch introduces the ability to use unaligned chunks. With these
    changes, we are no longer bound to having to place chunks at a 2k (or
    whatever your chunk size is) interval. Since we are no longer dealing with
    aligned chunks, they can now cross page boundaries. Checks for page
    contiguity have been added in order to keep track of which pages are
    followed by a physically contiguous page.

    Signed-off-by: Kevin Laatz
    Signed-off-by: Ciara Loftus
    Signed-off-by: Bruce Richardson
    Acked-by: Jonathan Lemon
    Signed-off-by: Daniel Borkmann

    Kevin Laatz
     

28 Aug, 2019

1 commit


21 Aug, 2019

1 commit

  • For 64-bit there is no reason to use vmap/vunmap, so use page_address
    as it was initially. For 32 bits, in some apps, like in samples
    xdpsock_user.c when number of pgs in use is quite big, the kmap
    memory can be not enough, despite on this, kmap looks like is
    deprecated in such cases as it can block and should be used rather
    for dynamic mm.

    Signed-off-by: Ivan Khoronzhuk
    Acked-by: Jonathan Lemon
    Signed-off-by: Daniel Borkmann

    Ivan Khoronzhuk
     

20 Aug, 2019

1 commit


18 Aug, 2019

3 commits

  • When an AF_XDP socket is released/closed the XSKMAP still holds a
    reference to the socket in a "released" state. The socket will still
    use the netdev queue resource, and block newly created sockets from
    attaching to that queue, but no user application can access the
    fill/complete/rx/tx queues. This results in that all applications need
    to explicitly clear the map entry from the old "zombie state"
    socket. This should be done automatically.

    In this patch, the sockets tracks, and have a reference to, which maps
    it resides in. When the socket is released, it will remove itself from
    all maps.

    Suggested-by: Bruce Richardson
    Signed-off-by: Björn Töpel
    Signed-off-by: Daniel Borkmann

    Björn Töpel
     
  • This commit adds support for a new flag called need_wakeup in the
    AF_XDP Tx and fill rings. When this flag is set, it means that the
    application has to explicitly wake up the kernel Rx (for the bit in
    the fill ring) or kernel Tx (for bit in the Tx ring) processing by
    issuing a syscall. Poll() can wake up both depending on the flags
    submitted and sendto() will wake up tx processing only.

    The main reason for introducing this new flag is to be able to
    efficiently support the case when application and driver is executing
    on the same core. Previously, the driver was just busy-spinning on the
    fill ring if it ran out of buffers in the HW and there were none on
    the fill ring. This approach works when the application is running on
    another core as it can replenish the fill ring while the driver is
    busy-spinning. Though, this is a lousy approach if both of them are
    running on the same core as the probability of the fill ring getting
    more entries when the driver is busy-spinning is zero. With this new
    feature the driver now sets the need_wakeup flag and returns to the
    application. The application can then replenish the fill queue and
    then explicitly wake up the Rx processing in the kernel using the
    syscall poll(). For Tx, the flag is only set to one if the driver has
    no outstanding Tx completion interrupts. If it has some, the flag is
    zero as it will be woken up by a completion interrupt anyway.

    As a nice side effect, this new flag also improves the performance of
    the case where application and driver are running on two different
    cores as it reduces the number of syscalls to the kernel. The kernel
    tells user space if it needs to be woken up by a syscall, and this
    eliminates many of the syscalls.

    This flag needs some simple driver support. If the driver does not
    support this, the Rx flag is always zero and the Tx flag is always
    one. This makes any application relying on this feature default to the
    old behaviour of not requiring any syscalls in the Rx path and always
    having to call sendto() in the Tx path.

    For backwards compatibility reasons, this feature has to be explicitly
    turned on using a new bind flag (XDP_USE_NEED_WAKEUP). I recommend
    that you always turn it on as it so far always have had a positive
    performance impact.

    The name and inspiration of the flag has been taken from io_uring by
    Jens Axboe. Details about this feature in io_uring can be found in
    http://kernel.dk/io_uring.pdf, section 8.3.

    Signed-off-by: Magnus Karlsson
    Acked-by: Jonathan Lemon
    Signed-off-by: Daniel Borkmann

    Magnus Karlsson
     
  • This commit replaces ndo_xsk_async_xmit with ndo_xsk_wakeup. This new
    ndo provides the same functionality as before but with the addition of
    a new flags field that is used to specifiy if Rx, Tx or both should be
    woken up. The previous ndo only woke up Tx, as implied by the
    name. The i40e and ixgbe drivers (which are all the supported ones)
    are updated with this new interface.

    This new ndo will be used by the new need_wakeup functionality of XDP
    sockets that need to be able to wake up both Rx and Tx driver
    processing.

    Signed-off-by: Magnus Karlsson
    Acked-by: Jonathan Lemon
    Signed-off-by: Daniel Borkmann

    Magnus Karlsson
     

10 Aug, 2019

1 commit


12 Jul, 2019

2 commits

  • There are 2 call chains:

    a) xsk_bind --> xdp_umem_assign_dev
    b) unregister_netdevice_queue --> xsk_notifier

    with the following locking order:

    a) xs->mutex --> rtnl_lock
    b) rtnl_lock --> xdp.lock --> xs->mutex

    Different order of taking 'xs->mutex' and 'rtnl_lock' could produce a
    deadlock here. Fix that by moving the 'rtnl_lock' before 'xs->lock' in
    the bind call chain (a).

    Reported-by: syzbot+bf64ec93de836d7f4c2c@syzkaller.appspotmail.com
    Fixes: 455302d1c9ae ("xdp: fix hang while unregistering device bound to xdp socket")
    Signed-off-by: Ilya Maximets
    Acked-by: Jonathan Lemon
    Signed-off-by: Daniel Borkmann

    Ilya Maximets
     
  • Completion queue address reservation could not be undone.
    In case of bad 'queue_id' or skb allocation failure, reserved entry
    will be leaked reducing the total capacity of completion queue.

    Fix that by moving reservation to the point where failure is not
    possible. Additionally, 'queue_id' checking moved out from the loop
    since there is no point to check it there.

    Fixes: 35fcde7f8deb ("xsk: support for Tx")
    Signed-off-by: Ilya Maximets
    Acked-by: Björn Töpel
    Tested-by: William Tu
    Signed-off-by: Daniel Borkmann

    Ilya Maximets
     

09 Jul, 2019

2 commits

  • Two cases of overlapping changes, nothing fancy.

    Signed-off-by: David S. Miller

    David S. Miller
     
  • Unlike driver mode, generic xdp receive could be triggered
    by different threads on different CPU cores at the same time
    leading to the fill and rx queue breakage. For example, this
    could happen while sending packets from two processes to the
    first interface of veth pair while the second part of it is
    open with AF_XDP socket.

    Need to take a lock for each generic receive to avoid race.

    Fixes: c497176cb2e4 ("xsk: add Rx receive functions and poll support")
    Signed-off-by: Ilya Maximets
    Acked-by: Magnus Karlsson
    Tested-by: William Tu
    Signed-off-by: Daniel Borkmann

    Ilya Maximets
     

03 Jul, 2019

2 commits

  • Device that bound to XDP socket will not have zero refcount until the
    userspace application will not close it. This leads to hang inside
    'netdev_wait_allrefs()' if device unregistering requested:

    # ip link del p1
    < hang on recvmsg on netlink socket >

    # ps -x | grep ip
    5126 pts/0 D+ 0:00 ip link del p1

    # journalctl -b

    Jun 05 07:19:16 kernel:
    unregister_netdevice: waiting for p1 to become free. Usage count = 1

    Jun 05 07:19:27 kernel:
    unregister_netdevice: waiting for p1 to become free. Usage count = 1
    ...

    Fix that by implementing NETDEV_UNREGISTER event notification handler
    to properly clean up all the resources and unref device.

    This should also allow socket killing via ss(8) utility.

    Fixes: 965a99098443 ("xsk: add support for bind for Rx")
    Signed-off-by: Ilya Maximets
    Acked-by: Jonathan Lemon
    Signed-off-by: Daniel Borkmann

    Ilya Maximets
     
  • Device pointer stored in umem regardless of zero-copy mode,
    so we heed to hold the device in all cases.

    Fixes: c9b47cc1fabc ("xsk: fix bug when trying to use both copy and zero-copy on one queue id")
    Signed-off-by: Ilya Maximets
    Acked-by: Jonathan Lemon
    Signed-off-by: Daniel Borkmann

    Ilya Maximets
     

28 Jun, 2019

3 commits

  • Some drivers want to access the data transmitted in order to implement
    acceleration features of the NICs. It is also useful in AF_XDP TX flow.

    Change the xsk_umem_consume_tx API to return the whole xdp_desc, that
    contains the data pointer, length and DMA address, instead of only the
    latter two. Adapt the implementation of i40e and ixgbe to this change.

    Signed-off-by: Maxim Mikityanskiy
    Signed-off-by: Tariq Toukan
    Acked-by: Saeed Mahameed
    Cc: Björn Töpel
    Cc: Magnus Karlsson
    Acked-by: Björn Töpel
    Signed-off-by: Daniel Borkmann

    Maxim Mikityanskiy
     
  • Make it possible for the application to determine whether the AF_XDP
    socket is running in zero-copy mode. To achieve this, add a new
    getsockopt option XDP_OPTIONS that returns flags. The only flag
    supported for now is the zero-copy mode indicator.

    Signed-off-by: Maxim Mikityanskiy
    Signed-off-by: Tariq Toukan
    Acked-by: Saeed Mahameed
    Acked-by: Björn Töpel
    Signed-off-by: Daniel Borkmann

    Maxim Mikityanskiy
     
  • Add a function that checks whether the Fill Ring has the specified
    amount of descriptors available. It will be useful for mlx5e that wants
    to check in advance, whether it can allocate a bulk of RX descriptors,
    to get the best performance.

    Signed-off-by: Maxim Mikityanskiy
    Signed-off-by: Tariq Toukan
    Acked-by: Saeed Mahameed
    Acked-by: Björn Töpel
    Signed-off-by: Daniel Borkmann

    Maxim Mikityanskiy
     

26 Jun, 2019

1 commit

  • Clang warns:

    In file included from net/xdp/xsk_queue.c:10:
    net/xdp/xsk_queue.h:292:2: warning: expression result unused
    [-Wunused-value]
    WRITE_ONCE(q->ring->producer, q->prod_tail);
    ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    include/linux/compiler.h:284:6: note: expanded from macro 'WRITE_ONCE'
    __u.__val; \
    ~~~ ^~~~~
    1 warning generated.

    The q->prod_tail assignment has a comma at the end, not a semi-colon.
    Fix that so clang no longer warns and everything works as expected.

    Fixes: c497176cb2e4 ("xsk: add Rx receive functions and poll support")
    Link: https://github.com/ClangBuiltLinux/linux/issues/544
    Signed-off-by: Nathan Chancellor
    Acked-by: Nick Desaulniers
    Acked-by: Jonathan Lemon
    Acked-by: Björn Töpel
    Acked-by: Song Liu
    Signed-off-by: Daniel Borkmann

    Nathan Chancellor
     

12 Jun, 2019

1 commit


21 May, 2019

1 commit


15 May, 2019

1 commit

  • Pach series "Add FOLL_LONGTERM to GUP fast and use it".

    HFI1, qib, and mthca, use get_user_pages_fast() due to its performance
    advantages. These pages can be held for a significant time. But
    get_user_pages_fast() does not protect against mapping FS DAX pages.

    Introduce FOLL_LONGTERM and use this flag in get_user_pages_fast() which
    retains the performance while also adding the FS DAX checks. XDP has also
    shown interest in using this functionality.[1]

    In addition we change get_user_pages() to use the new FOLL_LONGTERM flag
    and remove the specialized get_user_pages_longterm call.

    [1] https://lkml.org/lkml/2019/3/19/939

    "longterm" is a relative thing and at this point is probably a misnomer.
    This is really flagging a pin which is going to be given to hardware and
    can't move. I've thought of a couple of alternative names but I think we
    have to settle on if we are going to use FL_LAYOUT or something else to
    solve the "longterm" problem. Then I think we can change the flag to a
    better name.

    Secondly, it depends on how often you are registering memory. I have
    spoken with some RDMA users who consider MR in the performance path...
    For the overall application performance. I don't have the numbers as the
    tests for HFI1 were done a long time ago. But there was a significant
    advantage. Some of which is probably due to the fact that you don't have
    to hold mmap_sem.

    Finally, architecturally I think it would be good for everyone to use
    *_fast. There are patches submitted to the RDMA list which would allow
    the use of *_fast (they reworking the use of mmap_sem) and as soon as they
    are accepted I'll submit a patch to convert the RDMA core as well. Also
    to this point others are looking to use *_fast.

    As an aside, Jasons pointed out in my previous submission that *_fast and
    *_unlocked look very much the same. I agree and I think further cleanup
    will be coming. But I'm focused on getting the final solution for DAX at
    the moment.

    This patch (of 7):

    This patch starts a series which aims to support FOLL_LONGTERM in
    get_user_pages_fast(). Some callers who would like to do a longterm (user
    controlled pin) of pages with the fast variant of GUP for performance
    purposes.

    Rather than have a separate get_user_pages_longterm() call, introduce
    FOLL_LONGTERM and change the longterm callers to use it.

    This patch does not change any functionality. In the short term
    "longterm" or user controlled pins are unsafe for Filesystems and FS DAX
    in particular has been blocked. However, callers of get_user_pages_fast()
    were not "protected".

    FOLL_LONGTERM can _only_ be supported with get_user_pages[_fast]() as it
    requires vmas to determine if DAX is in use.

    NOTE: In merging with the CMA changes we opt to change the
    get_user_pages() call in check_and_migrate_cma_pages() to a call of
    __get_user_pages_locked() on the newly migrated pages. This makes the
    code read better in that we are calling __get_user_pages_locked() on the
    pages before and after a potential migration.

    As a side affect some of the interfaces are cleaned up but this is not the
    primary purpose of the series.

    In review[1] it was asked:

    > This I don't get - if you do lock down long term mappings performance
    > of the actual get_user_pages call shouldn't matter to start with.
    >
    > What do I miss?

    A couple of points.

    First "longterm" is a relative thing and at this point is probably a
    misnomer. This is really flagging a pin which is going to be given to
    hardware and can't move. I've thought of a couple of alternative names
    but I think we have to settle on if we are going to use FL_LAYOUT or
    something else to solve the "longterm" problem. Then I think we can
    change the flag to a better name.

    Second, It depends on how often you are registering memory. I have spoken
    with some RDMA users who consider MR in the performance path... For the
    overall application performance. I don't have the numbers as the tests
    for HFI1 were done a long time ago. But there was a significant
    advantage. Some of which is probably due to the fact that you don't have
    to hold mmap_sem.

    Finally, architecturally I think it would be good for everyone to use
    *_fast. There are patches submitted to the RDMA list which would allow
    the use of *_fast (they reworking the use of mmap_sem) and as soon as they
    are accepted I'll submit a patch to convert the RDMA core as well. Also
    to this point others are looking to use *_fast.

    As an asside, Jasons pointed out in my previous submission that *_fast and
    *_unlocked look very much the same. I agree and I think further cleanup
    will be coming. But I'm focused on getting the final solution for DAX at
    the moment.

    [1] https://lore.kernel.org/lkml/20190220180255.GA12020@iweiny-DESK2.sc.intel.com/T/#md6abad2569f3bf6c1f03686c8097ab6563e94965

    [ira.weiny@intel.com: v3]
    Link: http://lkml.kernel.org/r/20190328084422.29911-2-ira.weiny@intel.com
    Link: http://lkml.kernel.org/r/20190328084422.29911-2-ira.weiny@intel.com
    Link: http://lkml.kernel.org/r/20190317183438.2057-2-ira.weiny@intel.com
    Signed-off-by: Ira Weiny
    Reviewed-by: Andrew Morton
    Cc: Aneesh Kumar K.V
    Cc: Michal Hocko
    Cc: John Hubbard
    Cc: "Kirill A. Shutemov"
    Cc: Peter Zijlstra
    Cc: Jason Gunthorpe
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: "David S. Miller"
    Cc: Martin Schwidefsky
    Cc: Heiko Carstens
    Cc: Rich Felker
    Cc: Yoshinori Sato
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: Borislav Petkov
    Cc: Ralf Baechle
    Cc: James Hogan
    Cc: Dan Williams
    Cc: Mike Marshall
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ira Weiny
     

17 Apr, 2019

1 commit

  • The ring buffer code of XDP sockets is missing a memory barrier on the
    consumer side between the load of the data and the write that signals
    that it is ok for the producer to put new data into the buffer. On
    architectures that does not guarantee that stores are not reordered
    with older loads, the producer might put data into the ring before the
    consumer had the chance to read it. As IA does guarantee this
    ordering, it would only need a compiler barrier here, but there are no
    primitives in Linux for this specific case (hinder writes to be ordered
    before older reads) so I had to add a smp_mb() here which will
    translate into a run-time synch operation on IA.

    Added a longish comment in the code explaining what each barrier in
    the ring implementation accomplishes and what would happen if we
    removed one of them.

    Signed-off-by: Magnus Karlsson
    Signed-off-by: Alexei Starovoitov

    Magnus Karlsson
     

16 Mar, 2019

1 commit

  • When the umem is cleaned up, the task that created it might already be
    gone. If the task was gone, the xdp_umem_release function did not free
    the pages member of struct xdp_umem.

    It turned out that the task lookup was not needed at all; The code was
    a left-over when we moved from task accounting to user accounting [1].

    This patch fixes the memory leak by removing the task lookup logic
    completely.

    [1] https://lore.kernel.org/netdev/20180131135356.19134-3-bjorn.topel@gmail.com/

    Link: https://lore.kernel.org/netdev/c1cb2ca8-6a14-3980-8672-f3de0bb38dfd@suse.cz/
    Fixes: c0c77d8fb787 ("xsk: add user memory registration support sockopt")
    Reported-by: Jiri Slaby
    Signed-off-by: Björn Töpel
    Signed-off-by: Daniel Borkmann

    Björn Töpel
     

09 Mar, 2019

2 commits

  • Passing a non-existing option in the options member of struct
    xdp_desc was, incorrectly, silently ignored. This patch addresses
    that behavior, and drops any Tx descriptor with non-existing options.

    We have examined existing user space code, and to our best knowledge,
    no one is relying on the current incorrect behavior. AF_XDP is still
    in its infancy, so from our perspective, the risk of breakage is very
    low, and addressing this problem now is important.

    Fixes: 35fcde7f8deb ("xsk: support for Tx")
    Signed-off-by: Björn Töpel
    Signed-off-by: Daniel Borkmann

    Björn Töpel
     
  • Passing a non-existing flag in the sxdp_flags member of struct
    sockaddr_xdp was, incorrectly, silently ignored. This patch addresses
    that behavior, and rejects any non-existing flags.

    We have examined existing user space code, and to our best knowledge,
    no one is relying on the current incorrect behavior. AF_XDP is still
    in its infancy, so from our perspective, the risk of breakage is very
    low, and addressing this problem now is important.

    Fixes: 965a99098443 ("xsk: add support for bind for Rx")
    Signed-off-by: Björn Töpel
    Signed-off-by: Daniel Borkmann

    Björn Töpel
     

07 Mar, 2019

1 commit

  • Fixes two typos in xsk_diag_put_umem()

    syzbot reported the following crash :

    kasan: CONFIG_KASAN_INLINE enabled
    kasan: GPF could be caused by NULL-ptr deref or user memory access
    general protection fault: 0000 [#1] PREEMPT SMP KASAN
    CPU: 1 PID: 7641 Comm: syz-executor946 Not tainted 5.0.0-rc7+ #95
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
    RIP: 0010:xsk_diag_put_umem net/xdp/xsk_diag.c:71 [inline]
    RIP: 0010:xsk_diag_fill net/xdp/xsk_diag.c:113 [inline]
    RIP: 0010:xsk_diag_dump+0xdcb/0x13a0 net/xdp/xsk_diag.c:143
    Code: 8d be c0 04 00 00 48 89 f8 48 c1 e8 03 42 80 3c 20 00 0f 85 39 04 00 00 49 8b 96 c0 04 00 00 48 8d 7a 14 48 89 f8 48 c1 e8 03 0f b6 0c 20 48 89 f8 83 e0 07 83 c0 03 38 c8 7c 08 84 c9 0f 85
    RSP: 0018:ffff888090bcf2d8 EFLAGS: 00010203
    RAX: 0000000000000002 RBX: ffff8880a0aacbc0 RCX: ffffffff86ffdc3c
    RDX: 0000000000000000 RSI: ffffffff86ffdc70 RDI: 0000000000000014
    RBP: ffff888090bcf438 R08: ffff88808e04a700 R09: ffffed1011c74174
    R10: ffffed1011c74173 R11: ffff88808e3a0b9f R12: dffffc0000000000
    R13: ffff888093a6d818 R14: ffff88808e365240 R15: ffff88808e3a0b40
    FS: 00000000011ea880(0000) GS:ffff8880ae900000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 0000000020000080 CR3: 000000008fa13000 CR4: 00000000001406e0
    Call Trace:
    netlink_dump+0x55d/0xfb0 net/netlink/af_netlink.c:2252
    __netlink_dump_start+0x5b4/0x7e0 net/netlink/af_netlink.c:2360
    netlink_dump_start include/linux/netlink.h:226 [inline]
    xsk_diag_handler_dump+0x1b2/0x250 net/xdp/xsk_diag.c:170
    __sock_diag_cmd net/core/sock_diag.c:232 [inline]
    sock_diag_rcv_msg+0x322/0x410 net/core/sock_diag.c:263
    netlink_rcv_skb+0x17a/0x460 net/netlink/af_netlink.c:2485
    sock_diag_rcv+0x2b/0x40 net/core/sock_diag.c:274
    netlink_unicast_kernel net/netlink/af_netlink.c:1310 [inline]
    netlink_unicast+0x536/0x720 net/netlink/af_netlink.c:1336
    netlink_sendmsg+0x8ae/0xd70 net/netlink/af_netlink.c:1925
    sock_sendmsg_nosec net/socket.c:622 [inline]
    sock_sendmsg+0xdd/0x130 net/socket.c:632
    sock_write_iter+0x27c/0x3e0 net/socket.c:923
    call_write_iter include/linux/fs.h:1863 [inline]
    do_iter_readv_writev+0x5e0/0x8e0 fs/read_write.c:680
    do_iter_write fs/read_write.c:956 [inline]
    do_iter_write+0x184/0x610 fs/read_write.c:937
    vfs_writev+0x1b3/0x2f0 fs/read_write.c:1001
    do_writev+0xf6/0x290 fs/read_write.c:1036
    __do_sys_writev fs/read_write.c:1109 [inline]
    __se_sys_writev fs/read_write.c:1106 [inline]
    __x64_sys_writev+0x75/0xb0 fs/read_write.c:1106
    do_syscall_64+0x103/0x610 arch/x86/entry/common.c:290
    entry_SYSCALL_64_after_hwframe+0x49/0xbe
    RIP: 0033:0x440139
    Code: 18 89 d0 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 3d 01 f0 ff ff 0f 83 fb 13 fc ff c3 66 2e 0f 1f 84 00 00 00 00
    RSP: 002b:00007ffcc966cc18 EFLAGS: 00000246 ORIG_RAX: 0000000000000014
    RAX: ffffffffffffffda RBX: 00000000004002c8 RCX: 0000000000440139
    RDX: 0000000000000001 RSI: 0000000020000080 RDI: 0000000000000003
    RBP: 00000000006ca018 R08: 00000000004002c8 R09: 00000000004002c8
    R10: 0000000000000004 R11: 0000000000000246 R12: 00000000004019c0
    R13: 0000000000401a50 R14: 0000000000000000 R15: 0000000000000000
    Modules linked in:
    ---[ end trace 460a3c24d0a656c9 ]---
    RIP: 0010:xsk_diag_put_umem net/xdp/xsk_diag.c:71 [inline]
    RIP: 0010:xsk_diag_fill net/xdp/xsk_diag.c:113 [inline]
    RIP: 0010:xsk_diag_dump+0xdcb/0x13a0 net/xdp/xsk_diag.c:143
    Code: 8d be c0 04 00 00 48 89 f8 48 c1 e8 03 42 80 3c 20 00 0f 85 39 04 00 00 49 8b 96 c0 04 00 00 48 8d 7a 14 48 89 f8 48 c1 e8 03 0f b6 0c 20 48 89 f8 83 e0 07 83 c0 03 38 c8 7c 08 84 c9 0f 85
    RSP: 0018:ffff888090bcf2d8 EFLAGS: 00010203
    RAX: 0000000000000002 RBX: ffff8880a0aacbc0 RCX: ffffffff86ffdc3c
    RDX: 0000000000000000 RSI: ffffffff86ffdc70 RDI: 0000000000000014
    RBP: ffff888090bcf438 R08: ffff88808e04a700 R09: ffffed1011c74174
    R10: ffffed1011c74173 R11: ffff88808e3a0b9f R12: dffffc0000000000
    R13: ffff888093a6d818 R14: ffff88808e365240 R15: ffff88808e3a0b40
    FS: 00000000011ea880(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 0000000001d22000 CR3: 000000008fa13000 CR4: 00000000001406f0

    Fixes: a36b38aa2af6 ("xsk: add sock_diag interface for AF_XDP")
    Signed-off-by: Eric Dumazet
    Reported-by: syzbot
    Cc: Björn Töpel
    Cc: Daniel Borkmann
    Cc: Magnus Karlsson
    Acked-by: Björn Töpel
    Signed-off-by: Daniel Borkmann

    Eric Dumazet
     

25 Feb, 2019

1 commit

  • Three conflicts, one of which, for marvell10g.c is non-trivial and
    requires some follow-up from Heiner or someone else.

    The issue is that Heiner converted the marvell10g driver over to
    use the generic c45 code as much as possible.

    However, in 'net' a bug fix appeared which makes sure that a new
    local mask (MDIO_AN_10GBT_CTRL_ADV_NBT_MASK) with value 0x01e0
    is cleared.

    Signed-off-by: David S. Miller

    David S. Miller
     

21 Feb, 2019

1 commit

  • This reverts commit e2ce3674883ecba2605370404208c9d4a07ae1c3.

    It turns out that the sock destructor xsk_destruct was needed after
    all. The cleanup simplification broke the skb transmit cleanup path,
    due to that the umem was prematurely destroyed.

    The umem cannot be destroyed until all outstanding skbs are freed,
    which means that we cannot remove the umem until the sk_destruct has
    been called.

    Signed-off-by: Björn Töpel
    Signed-off-by: Daniel Borkmann

    Björn Töpel