29 Feb, 2020

1 commit

  • The current codebase makes use of the zero-length array language
    extension to the C90 standard, but the preferred mechanism to declare
    variable-length types such as these ones is a flexible array member[1][2],
    introduced in C99:

    struct foo {
    int stuff;
    struct boo array[];
    };

    By making use of the mechanism above, we will get a compiler warning
    in case the flexible array does not occur last in the structure, which
    will help us prevent some kind of undefined behavior bugs from being
    inadvertently introduced[3] to the codebase from now on.

    Also, notice that, dynamic memory allocations won't be affected by
    this change:

    "Flexible array members have incomplete type, and so the sizeof operator
    may not be applied. As a quirk of the original implementation of
    zero-length arrays, sizeof evaluates to zero."[1]

    This issue was found with the help of Coccinelle.

    [1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
    [2] https://github.com/KSPP/linux/issues/21
    [3] commit 76497732932f ("cxgb3/l2t: Fix undefined behaviour")

    Signed-off-by: Gustavo A. R. Silva
    Acked-by: Jonathan Lemon
    Acked-by: Björn Töpel
    Signed-off-by: David S. Miller

    Gustavo A. R. Silva
     

11 Feb, 2020

1 commit

  • The commit 4b638f13bab4 ("xsk: Eliminate the RX batch size")
    introduced a much more lazy way of updating the global consumer
    pointers from the kernel side, by only doing so when running out of
    entries in the fill or Tx rings (the rings consumed by the
    kernel). This can result in a deadlock with the user application if
    the kernel requires more than one entry to proceed and the application
    cannot put these entries in the fill ring because the kernel has not
    updated the global consumer pointer since the ring is not empty.

    Fix this by publishing the local kernel side consumer pointer whenever
    we have completed Rx or Tx processing in the kernel. This way, user
    space will have an up-to-date view of the consumer pointers whenever it
    gets to execute in the one core case (application and driver on the
    same core), or after a certain number of packets have been processed
    in the two core case (application and driver on different cores).

    A side effect of this patch is that the one core case gets better
    performance, but the two core case gets worse. The reason that the one
    core case improves is that updating the global consumer pointer is
    relatively cheap since the application by definition is not running
    when the kernel is (they are on the same core) and it is beneficial
    for the application, once it gets to run, to have pointers that are
    as up to date as possible since it then can operate on more packets
    and buffers. In the two core case, the most important performance
    aspect is to minimize the number of accesses to the global pointers
    since they are shared between two cores and bounces between the caches
    of those cores. This patch results in more updates to global state,
    which means lower performance in the two core case.

    Fixes: 4b638f13bab4 ("xsk: Eliminate the RX batch size")
    Reported-by: Ryan Goodfellow
    Reported-by: Maxim Mikityanskiy
    Signed-off-by: Magnus Karlsson
    Signed-off-by: Daniel Borkmann
    Acked-by: Jonathan Lemon
    Acked-by: Maxim Mikityanskiy
    Link: https://lore.kernel.org/bpf/1581348432-6747-1-git-send-email-magnus.karlsson@intel.com

    Magnus Karlsson
     

01 Feb, 2020

2 commits

  • In order to provide a clearer, more symmetric API for pinning and
    unpinning DMA pages. This way, pin_user_pages*() calls match up with
    unpin_user_pages*() calls, and the API is a lot closer to being
    self-explanatory.

    Link: http://lkml.kernel.org/r/20200107224558.2362728-23-jhubbard@nvidia.com
    Signed-off-by: John Hubbard
    Reviewed-by: Jan Kara
    Cc: Alex Williamson
    Cc: Aneesh Kumar K.V
    Cc: Björn Töpel
    Cc: Christoph Hellwig
    Cc: Daniel Vetter
    Cc: Dan Williams
    Cc: Hans Verkuil
    Cc: Ira Weiny
    Cc: Jason Gunthorpe
    Cc: Jason Gunthorpe
    Cc: Jens Axboe
    Cc: Jerome Glisse
    Cc: Jonathan Corbet
    Cc: Kirill A. Shutemov
    Cc: Leon Romanovsky
    Cc: Mauro Carvalho Chehab
    Cc: Mike Rapoport
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    John Hubbard
     
  • Convert net/xdp to use the new pin_longterm_pages() call, which sets
    FOLL_PIN. Setting FOLL_PIN is now required for code that requires
    tracking of pinned pages.

    In partial anticipation of this work, the net/xdp code was already calling
    put_user_page() instead of put_page(). Therefore, in order to convert
    from the get_user_pages()/put_page() model, to the
    pin_user_pages()/put_user_page() model, the only change required here is
    to change get_user_pages() to pin_user_pages().

    Link: http://lkml.kernel.org/r/20200107224558.2362728-18-jhubbard@nvidia.com
    Signed-off-by: John Hubbard
    Acked-by: Björn Töpel
    Cc: Alex Williamson
    Cc: Aneesh Kumar K.V
    Cc: Christoph Hellwig
    Cc: Daniel Vetter
    Cc: Dan Williams
    Cc: Hans Verkuil
    Cc: Ira Weiny
    Cc: Jan Kara
    Cc: Jason Gunthorpe
    Cc: Jason Gunthorpe
    Cc: Jens Axboe
    Cc: Jerome Glisse
    Cc: Jonathan Corbet
    Cc: Kirill A. Shutemov
    Cc: Leon Romanovsky
    Cc: Mauro Carvalho Chehab
    Cc: Mike Rapoport
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    John Hubbard
     

22 Jan, 2020

1 commit

  • XDP sockets use the default implementation of struct sock's
    sk_data_ready callback, which is sock_def_readable(). This function
    is called in the XDP socket fast-path, and involves a retpoline. By
    letting sock_def_readable() have external linkage, and being called
    directly, the retpoline can be avoided.

    Signed-off-by: Björn Töpel
    Signed-off-by: Daniel Borkmann
    Link: https://lore.kernel.org/bpf/20200120092917.13949-1-bjorn.topel@gmail.com

    Björn Töpel
     

16 Jan, 2020

1 commit

  • When registering a umem area that is sufficiently large (>1G on an
    x86), kmalloc cannot be used to allocate one of the internal data
    structures, as the size requested gets too large. Use kvmalloc instead
    that falls back on vmalloc if the allocation is too large for kmalloc.

    Also add accounting for this structure as it is triggered by a user
    space action (the XDP_UMEM_REG setsockopt) and it is by far the
    largest structure of kernel allocated memory in xsk.

    Reported-by: Ryan Goodfellow
    Signed-off-by: Magnus Karlsson
    Signed-off-by: Alexei Starovoitov
    Acked-by: Jonathan Lemon
    Link: https://lore.kernel.org/bpf/1578995365-7050-1-git-send-email-magnus.karlsson@intel.com

    Magnus Karlsson
     

28 Dec, 2019

1 commit

  • Daniel Borkmann says:

    ====================
    pull-request: bpf-next 2019-12-27

    The following pull-request contains BPF updates for your *net-next* tree.

    We've added 127 non-merge commits during the last 17 day(s) which contain
    a total of 110 files changed, 6901 insertions(+), 2721 deletions(-).

    There are three merge conflicts. Conflicts and resolution looks as follows:

    1) Merge conflict in net/bpf/test_run.c:

    There was a tree-wide cleanup c593642c8be0 ("treewide: Use sizeof_field() macro")
    which gets in the way with b590cb5f802d ("bpf: Switch to offsetofend in
    BPF_PROG_TEST_RUN"):

    <<<<<<< HEAD
    if (!range_is_zero(__skb, offsetof(struct __sk_buff, priority) +
    sizeof_field(struct __sk_buff, priority),
    =======
    if (!range_is_zero(__skb, offsetofend(struct __sk_buff, priority),
    >>>>>>> 7c8dce4b166113743adad131b5a24c4acc12f92c

    There are a few occasions that look similar to this. Always take the chunk with
    offsetofend(). Note that there is one where the fields differ in here:

    <<<<<<< HEAD
    if (!range_is_zero(__skb, offsetof(struct __sk_buff, tstamp) +
    sizeof_field(struct __sk_buff, tstamp),
    =======
    if (!range_is_zero(__skb, offsetofend(struct __sk_buff, gso_segs),
    >>>>>>> 7c8dce4b166113743adad131b5a24c4acc12f92c

    Just take the one with offsetofend() /and/ gso_segs. Latter is correct due to
    850a88cc4096 ("bpf: Expose __sk_buff wire_len/gso_segs to BPF_PROG_TEST_RUN").

    2) Merge conflict in arch/riscv/net/bpf_jit_comp.c:

    (I'm keeping Bjorn in Cc here for a double-check in case I got it wrong.)

    <<<<<<< HEAD
    if (is_13b_check(off, insn))
    return -1;
    emit(rv_blt(tcc, RV_REG_ZERO, off >> 1), ctx);
    =======
    emit_branch(BPF_JSLT, RV_REG_T1, RV_REG_ZERO, off, ctx);
    >>>>>>> 7c8dce4b166113743adad131b5a24c4acc12f92c

    Result should look like:

    emit_branch(BPF_JSLT, tcc, RV_REG_ZERO, off, ctx);

    3) Merge conflict in arch/riscv/include/asm/pgtable.h:

    <<<<<<< HEAD
    =======
    #define VMALLOC_SIZE (KERN_VIRT_SIZE >> 1)
    #define VMALLOC_END (PAGE_OFFSET - 1)
    #define VMALLOC_START (PAGE_OFFSET - VMALLOC_SIZE)

    #define BPF_JIT_REGION_SIZE (SZ_128M)
    #define BPF_JIT_REGION_START (PAGE_OFFSET - BPF_JIT_REGION_SIZE)
    #define BPF_JIT_REGION_END (VMALLOC_END)

    /*
    * Roughly size the vmemmap space to be large enough to fit enough
    * struct pages to map half the virtual address space. Then
    * position vmemmap directly below the VMALLOC region.
    */
    #define VMEMMAP_SHIFT \
    (CONFIG_VA_BITS - PAGE_SHIFT - 1 + STRUCT_PAGE_MAX_SHIFT)
    #define VMEMMAP_SIZE BIT(VMEMMAP_SHIFT)
    #define VMEMMAP_END (VMALLOC_START - 1)
    #define VMEMMAP_START (VMALLOC_START - VMEMMAP_SIZE)

    #define vmemmap ((struct page *)VMEMMAP_START)

    >>>>>>> 7c8dce4b166113743adad131b5a24c4acc12f92c

    Only take the BPF_* defines from there and move them higher up in the
    same file. Remove the rest from the chunk. The VMALLOC_* etc defines
    got moved via 01f52e16b868 ("riscv: define vmemmap before pfn_to_page
    calls"). Result:

    [...]
    #define __S101 PAGE_READ_EXEC
    #define __S110 PAGE_SHARED_EXEC
    #define __S111 PAGE_SHARED_EXEC

    #define VMALLOC_SIZE (KERN_VIRT_SIZE >> 1)
    #define VMALLOC_END (PAGE_OFFSET - 1)
    #define VMALLOC_START (PAGE_OFFSET - VMALLOC_SIZE)

    #define BPF_JIT_REGION_SIZE (SZ_128M)
    #define BPF_JIT_REGION_START (PAGE_OFFSET - BPF_JIT_REGION_SIZE)
    #define BPF_JIT_REGION_END (VMALLOC_END)

    /*
    * Roughly size the vmemmap space to be large enough to fit enough
    * struct pages to map half the virtual address space. Then
    * position vmemmap directly below the VMALLOC region.
    */
    #define VMEMMAP_SHIFT \
    (CONFIG_VA_BITS - PAGE_SHIFT - 1 + STRUCT_PAGE_MAX_SHIFT)
    #define VMEMMAP_SIZE BIT(VMEMMAP_SHIFT)
    #define VMEMMAP_END (VMALLOC_START - 1)
    #define VMEMMAP_START (VMALLOC_START - VMEMMAP_SIZE)

    [...]

    Let me know if there are any other issues.

    Anyway, the main changes are:

    1) Extend bpftool to produce a struct (aka "skeleton") tailored and specific
    to a provided BPF object file. This provides an alternative, simplified API
    compared to standard libbpf interaction. Also, add libbpf extern variable
    resolution for .kconfig section to import Kconfig data, from Andrii Nakryiko.

    2) Add BPF dispatcher for XDP which is a mechanism to avoid indirect calls by
    generating a branch funnel as discussed back in bpfconf'19 at LSF/MM. Also,
    add various BPF riscv JIT improvements, from Björn Töpel.

    3) Extend bpftool to allow matching BPF programs and maps by name,
    from Paul Chaignon.

    4) Support for replacing cgroup BPF programs attached with BPF_F_ALLOW_MULTI
    flag for allowing updates without service interruption, from Andrey Ignatov.

    5) Cleanup and simplification of ring access functions for AF_XDP with a
    bonus of 0-5% performance improvement, from Magnus Karlsson.

    6) Enable BPF JITs for x86-64 and arm64 by default. Also, final version of
    audit support for BPF, from Daniel Borkmann and latter with Jiri Olsa.

    7) Move and extend test_select_reuseport into BPF program tests under
    BPF selftests, from Jakub Sitnicki.

    8) Various BPF sample improvements for xdpsock for customizing parameters
    to set up and benchmark AF_XDP, from Jay Jayatheerthan.

    9) Improve libbpf to provide a ulimit hint on permission denied errors.
    Also change XDP sample programs to attach in driver mode by default,
    from Toke Høiland-Jørgensen.

    10) Extend BPF test infrastructure to allow changing skb mark from tc BPF
    programs, from Nikita V. Shirokov.

    11) Optimize prologue code sequence in BPF arm32 JIT, from Russell King.

    12) Fix xdp_redirect_cpu BPF sample to manually attach to tracepoints after
    libbpf conversion, from Jesper Dangaard Brouer.

    13) Minor misc improvements from various others.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     

21 Dec, 2019

12 commits

  • Improve readability and maintainability by using the struct_size()
    helper when allocating the AF_XDP rings.

    Signed-off-by: Magnus Karlsson
    Signed-off-by: Alexei Starovoitov
    Link: https://lore.kernel.org/bpf/1576759171-28550-13-git-send-email-magnus.karlsson@intel.com

    Magnus Karlsson
     
  • Add comments on how the ring access functions are named and how they
    are supposed to be used for producers and consumers. The functions are
    also reordered so that the consumer functions are in the beginning and
    the producer functions in the end, for easier reference. Put this in a
    separate patch as the diff might look a little odd, but no
    functionality has changed in this patch.

    Signed-off-by: Magnus Karlsson
    Signed-off-by: Alexei Starovoitov
    Link: https://lore.kernel.org/bpf/1576759171-28550-12-git-send-email-magnus.karlsson@intel.com

    Magnus Karlsson
     
  • There are two unnecessary READ_ONCE of descriptor data. These are not
    needed since the data is written by the producer before it signals
    that the data is available by incrementing the producer pointer. As the
    access to this producer pointer is serialized and the consumer always
    reads the descriptor after it has read and synchronized with the
    producer counter, the write of the descriptor will have fully
    completed and it does not matter if the consumer has any read tearing.

    Signed-off-by: Magnus Karlsson
    Signed-off-by: Alexei Starovoitov
    Link: https://lore.kernel.org/bpf/1576759171-28550-11-git-send-email-magnus.karlsson@intel.com

    Magnus Karlsson
     
  • Change the name of xsk_umem_discard_addr to xsk_umem_release_addr to
    better reflect the new naming of the AF_XDP queue manipulation
    functions. As this functions is used by drivers implementing support
    for AF_XDP zero-copy, it requires a name change to these drivers. The
    function xsk_umem_release_addr_rq has also changed name in the same
    fashion.

    Signed-off-by: Magnus Karlsson
    Signed-off-by: Alexei Starovoitov
    Link: https://lore.kernel.org/bpf/1576759171-28550-10-git-send-email-magnus.karlsson@intel.com

    Magnus Karlsson
     
  • Change the names of the validation functions to better reflect what
    they are doing. The uppermost ones are reading entries from the rings
    and only the bottom ones validate entries. So xskq_cons_read_ is a
    better prefix name.

    Also change the xskq_cons_read_ functions to return a bool
    as the the descriptor or address is already returned by reference
    in the parameters. Everyone is using the return value as a bool
    anyway.

    Signed-off-by: Magnus Karlsson
    Signed-off-by: Alexei Starovoitov
    Link: https://lore.kernel.org/bpf/1576759171-28550-9-git-send-email-magnus.karlsson@intel.com

    Magnus Karlsson
     
  • Simplify and refactor consumer ring functions. The consumer first
    "peeks" to find descriptors or addresses that are available to
    read from the ring, then reads them and finally "releases" these
    descriptors once it is done. The two local variables cons_tail
    and cons_head are turned into one single variable called
    cached_cons. cached_tail referred to the cached value of the
    global consumer pointer and will be stored in cached_cons. For
    cached_head, we just use cached_prod instead as it was not used
    for a consumer queue before. It also better reflects what it
    really is now: a cached copy of the producer pointer.

    The names of the functions are also renamed in the same manner as
    the producer functions. The new functions are called xskq_cons_
    followed by what it does.

    Signed-off-by: Magnus Karlsson
    Signed-off-by: Alexei Starovoitov
    Link: https://lore.kernel.org/bpf/1576759171-28550-8-git-send-email-magnus.karlsson@intel.com

    Magnus Karlsson
     
  • At this point, there are no users of the functions xskq_nb_avail and
    xskq_nb_free that take any other number of entries argument than 1, so
    let us get rid of the second argument that takes the number of
    entries.

    Signed-off-by: Magnus Karlsson
    Signed-off-by: Alexei Starovoitov
    Link: https://lore.kernel.org/bpf/1576759171-28550-7-git-send-email-magnus.karlsson@intel.com

    Magnus Karlsson
     
  • In the xsk consumer ring code there is a variable called RX_BATCH_SIZE
    that dictates the minimum number of entries that we try to grab from
    the fill and Tx rings. In fact, the code always try to grab the
    maximum amount of entries from these rings. The only thing this
    variable does is to throw an error if there is less than 16 (as it is
    defined) entries on the ring. There is no reason to do this and it
    will just lead to weird behavior from user space's point of view. So
    eliminate this variable.

    With this change, we will be able to simplify the xskq_nb_free and
    xskq_nb_avail code in the next commit.

    Signed-off-by: Magnus Karlsson
    Signed-off-by: Alexei Starovoitov
    Link: https://lore.kernel.org/bpf/1576759171-28550-6-git-send-email-magnus.karlsson@intel.com

    Magnus Karlsson
     
  • Adopt the naming of the producer ring access functions to have a
    similar naming convention as the functions in libbpf, but adapted to
    the kernel. You first reserve a number of entries that you later
    submit to the global state of the ring. This is much clearer, IMO,
    than the one that was in the kernel part. Once renamed, we also
    discover that two functions are actually the same, so remove one of
    them. Some of the primitive ring submission operations are also the
    same so break these out into __xskq_prod_submit that the upper level
    ring access functions can use.

    Signed-off-by: Magnus Karlsson
    Signed-off-by: Alexei Starovoitov
    Link: https://lore.kernel.org/bpf/1576759171-28550-5-git-send-email-magnus.karlsson@intel.com

    Magnus Karlsson
     
  • Currently, the xsk ring code has two cached producer pointers:
    prod_head and prod_tail. This patch consolidates these two into a
    single one called cached_prod to make the code simpler and easier to
    maintain. This will be in line with the user space part of the the
    code found in libbpf, that only uses a single cached pointer.

    The Rx path only uses the two top level functions
    xskq_produce_batch_desc and xskq_produce_flush_desc and they both use
    prod_head and never prod_tail. So just move them over to
    cached_prod.

    The Tx XDP_DRV path uses xskq_produce_addr_lazy and
    xskq_produce_flush_addr_n and unnecessarily operates on both prod_tail
    and prod_head, so move them over to just use cached_prod by skipping
    the intermediate step of updating prod_tail.

    The Tx path in XDP_SKB mode uses xskq_reserve_addr and
    xskq_produce_addr. They currently use both cached pointers, but we can
    operate on the global producer pointer in xskq_produce_addr since it
    has to be updated anyway, thus eliminating the use of both cached
    pointers. We can also remove the xskq_nb_free in xskq_produce_addr
    since it is already called in xskq_reserve_addr. No need to do it
    twice.

    When there is only one cached producer pointer, we can also simplify
    xskq_nb_free by removing one argument.

    Signed-off-by: Magnus Karlsson
    Signed-off-by: Alexei Starovoitov
    Link: https://lore.kernel.org/bpf/1576759171-28550-4-git-send-email-magnus.karlsson@intel.com

    Magnus Karlsson
     
  • In order to set the correct return flags for poll, the xsk code has to
    check if the Rx queue is empty and if the Tx queue is full. This code
    was unnecessarily large and complex as it used the functions that are
    used to update the local state from the global state (xskq_nb_free and
    xskq_nb_avail). Since we are not doing this nor updating any data
    dependent on this state, we can simplify the functions. Another
    benefit from this is that we can also simplify the xskq_nb_free and
    xskq_nb_avail functions in a later commit.

    Signed-off-by: Magnus Karlsson
    Signed-off-by: Alexei Starovoitov
    Link: https://lore.kernel.org/bpf/1576759171-28550-3-git-send-email-magnus.karlsson@intel.com

    Magnus Karlsson
     
  • The lazy update threshold was introduced to keep the producer and
    consumer some distance apart in the completion ring. This was
    important in the beginning of the development of AF_XDP as the ring
    format as that point in time was very sensitive to the producer and
    consumer being on the same cache line. This is not the case
    anymore as the current ring format does not degrade in any noticeable
    way when this happens. Moreover, this threshold makes it impossible
    to run the system with rings that have less than 128 entries.

    So let us remove this threshold and just get one entry from the ring
    as in all other functions. This will enable us to remove this function
    in a later commit. Note that xskq_produce_addr_lazy followed by
    xskq_produce_flush_addr_n are still not the same function as
    xskq_produce_addr() as it operates on another cached pointer.

    Signed-off-by: Magnus Karlsson
    Signed-off-by: Alexei Starovoitov
    Link: https://lore.kernel.org/bpf/1576759171-28550-2-git-send-email-magnus.karlsson@intel.com

    Magnus Karlsson
     

20 Dec, 2019

1 commit

  • The xskmap flush list is used to track entries that need to flushed
    from via the xdp_do_flush_map() function. This list used to be
    per-map, but there is really no reason for that. Instead make the
    flush list global for all xskmaps, which simplifies __xsk_map_flush()
    and xsk_map_alloc().

    Signed-off-by: Björn Töpel
    Signed-off-by: Alexei Starovoitov
    Acked-by: Toke Høiland-Jørgensen
    Link: https://lore.kernel.org/bpf/20191219061006.21980-5-bjorn.topel@gmail.com

    Björn Töpel
     

19 Dec, 2019

1 commit

  • The XSK wakeup callback in drivers makes some sanity checks before
    triggering NAPI. However, some configuration changes may occur during
    this function that affect the result of those checks. For example, the
    interface can go down, and all the resources will be destroyed after the
    checks in the wakeup function, but before it attempts to use these
    resources. Wrap this callback in rcu_read_lock to allow driver to
    synchronize_rcu before actually destroying the resources.

    xsk_wakeup is a new function that encapsulates calling ndo_xsk_wakeup
    wrapped into the RCU lock. After this commit, xsk_poll starts using
    xsk_wakeup and checks xs->zc instead of ndo_xsk_wakeup != NULL to decide
    ndo_xsk_wakeup should be called. It also fixes a bug introduced with the
    need_wakeup feature: a non-zero-copy socket may be used with a driver
    supporting zero-copy, and in this case ndo_xsk_wakeup should not be
    called, so the xs->zc check is the correct one.

    Fixes: 77cd0d7b3f25 ("xsk: add support for need_wakeup flag in AF_XDP rings")
    Signed-off-by: Maxim Mikityanskiy
    Signed-off-by: Björn Töpel
    Signed-off-by: Daniel Borkmann
    Link: https://lore.kernel.org/bpf/20191217162023.16011-2-maximmi@mellanox.com

    Maxim Mikityanskiy
     

25 Nov, 2019

1 commit

  • xsk_poll() is defined as returning 'unsigned int' but the
    .poll method is declared as returning '__poll_t', a bitwise type.

    Fix this by using the proper return type and using the EPOLL
    constants instead of the POLL ones, as required for __poll_t.

    Signed-off-by: Luc Van Oostenryck
    Signed-off-by: Daniel Borkmann
    Acked-by: Björn Töpel
    Link: https://lore.kernel.org/bpf/20191120001042.30830-1-luc.vanoostenryck@gmail.com

    Luc Van Oostenryck
     

03 Nov, 2019

1 commit

  • Alexei Starovoitov says:

    ====================
    pull-request: bpf-next 2019-11-02

    The following pull-request contains BPF updates for your *net-next* tree.

    We've added 30 non-merge commits during the last 7 day(s) which contain
    a total of 41 files changed, 1864 insertions(+), 474 deletions(-).

    The main changes are:

    1) Fix long standing user vs kernel access issue by introducing
    bpf_probe_read_user() and bpf_probe_read_kernel() helpers, from Daniel.

    2) Accelerated xskmap lookup, from Björn and Maciej.

    3) Support for automatic map pinning in libbpf, from Toke.

    4) Cleanup of BTF-enabled raw tracepoints, from Alexei.

    5) Various fixes to libbpf and selftests.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     

02 Nov, 2019

1 commit

  • In this commit the XSKMAP entry lookup function used by the XDP
    redirect code is moved from the xskmap.c file to the xdp_sock.h
    header, so the lookup can be inlined from, e.g., the
    bpf_xdp_redirect_map() function.

    Further the __xsk_map_redirect() and __xsk_map_flush() is moved to the
    xsk.c, which lets the compiler inline the xsk_rcv() and xsk_flush()
    functions.

    Finally, all the XDP socket functions were moved from linux/bpf.h to
    net/xdp_sock.h, where most of the XDP sockets functions are anyway.

    This yields a ~2% performance boost for the xdpsock "rx_drop"
    scenario.

    Signed-off-by: Björn Töpel
    Signed-off-by: Daniel Borkmann
    Link: https://lore.kernel.org/bpf/20191101110346.15004-4-bjorn.topel@gmail.com

    Björn Töpel
     

24 Oct, 2019

1 commit

  • Having Rx-only AF_XDP sockets can potentially lead to a crash in the
    system by a NULL pointer dereference in xsk_umem_consume_tx(). This
    function iterates through a list of all sockets tied to a umem and
    checks if there are any packets to send on the Tx ring. Rx-only
    sockets do not have a Tx ring, so this will cause a NULL pointer
    dereference. This will happen if you have registered one or more
    Rx-only sockets to a umem and the driver is checking the Tx ring even
    on Rx, or if the XDP_SHARED_UMEM mode is used and there is a mix of
    Rx-only and other sockets tied to the same umem.

    Fixed by only putting sockets with a Tx component on the list that
    xsk_umem_consume_tx() iterates over.

    Fixes: ac98d8aab61b ("xsk: wire upp Tx zero-copy functions")
    Reported-by: Kal Cutter Conley
    Signed-off-by: Magnus Karlsson
    Signed-off-by: Alexei Starovoitov
    Acked-by: Jonathan Lemon
    Link: https://lore.kernel.org/bpf/1571645818-16244-1-git-send-email-magnus.karlsson@intel.com

    Magnus Karlsson
     

03 Oct, 2019

1 commit

  • Fixes a crash in poll() when an AF_XDP socket is opened in copy mode
    and the bound device does not have ndo_xsk_wakeup defined. Avoid
    trying to call the non-existing ndo and instead call the internal xsk
    sendmsg function to send packets in the same way (from the
    application's point of view) as calling sendmsg() in any mode or
    poll() in zero-copy mode would have done. The application should
    behave in the same way independent on if zero-copy mode or copy mode
    is used.

    Fixes: 77cd0d7b3f25 ("xsk: add support for need_wakeup flag in AF_XDP rings")
    Reported-by: syzbot+a5765ed8cdb1cca4d249@syzkaller.appspotmail.com
    Signed-off-by: Magnus Karlsson
    Signed-off-by: Daniel Borkmann
    Link: https://lore.kernel.org/bpf/1569997919-11541-1-git-send-email-magnus.karlsson@intel.com

    Magnus Karlsson
     

29 Sep, 2019

1 commit

  • Pull networking fixes from David Miller:

    1) Sanity check URB networking device parameters to avoid divide by
    zero, from Oliver Neukum.

    2) Disable global multicast filter in NCSI, otherwise LLDP and IPV6
    don't work properly. Longer term this needs a better fix tho. From
    Vijay Khemka.

    3) Small fixes to selftests (use ping when ping6 is not present, etc.)
    from David Ahern.

    4) Bring back rt_uses_gateway member of struct rtable, it's semantics
    were not well understood and trying to remove it broke things. From
    David Ahern.

    5) Move usbnet snaity checking, ignore endpoints with invalid
    wMaxPacketSize. From Bjørn Mork.

    6) Missing Kconfig deps for sja1105 driver, from Mao Wenan.

    7) Various small fixes to the mlx5 DR steering code, from Alaa Hleihel,
    Alex Vesker, and Yevgeny Kliteynik

    8) Missing CAP_NET_RAW checks in various places, from Ori Nimron.

    9) Fix crash when removing sch_cbs entry while offloading is enabled,
    from Vinicius Costa Gomes.

    10) Signedness bug fixes, generally in looking at the result given by
    of_get_phy_mode() and friends. From Dan Crapenter.

    11) Disable preemption around BPF_PROG_RUN() calls, from Eric Dumazet.

    12) Don't create VRF ipv6 rules if ipv6 is disabled, from David Ahern.

    13) Fix quantization code in tcp_bbr, from Kevin Yang.

    * git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (127 commits)
    net: tap: clean up an indentation issue
    nfp: abm: fix memory leak in nfp_abm_u32_knode_replace
    tcp: better handle TCP_USER_TIMEOUT in SYN_SENT state
    sk_buff: drop all skb extensions on free and skb scrubbing
    tcp_bbr: fix quantization code to not raise cwnd if not probing bandwidth
    mlxsw: spectrum_flower: Fail in case user specifies multiple mirror actions
    Documentation: Clarify trap's description
    mlxsw: spectrum: Clear VLAN filters during port initialization
    net: ena: clean up indentation issue
    NFC: st95hf: clean up indentation issue
    net: phy: micrel: add Asym Pause workaround for KSZ9021
    net: socionext: ave: Avoid using netdev_err() before calling register_netdev()
    ptp: correctly disable flags on old ioctls
    lib: dimlib: fix help text typos
    net: dsa: microchip: Always set regmap stride to 1
    nfp: flower: fix memory leak in nfp_flower_spawn_vnic_reprs
    nfp: flower: prevent memory leak in nfp_flower_spawn_phy_reprs
    net/sched: Set default of CONFIG_NET_TC_SKB_EXT to N
    vrf: Do not attempt to create IPv6 mcast rule if IPv6 is disabled
    net: sched: sch_sfb: don't call qdisc_put() while holding tree lock
    ...

    Linus Torvalds
     

25 Sep, 2019

2 commits

  • For pages that were retained via get_user_pages*(), release those pages
    via the new put_user_page*() routines, instead of via put_page() or
    release_pages().

    This is part a tree-wide conversion, as described in fc1d8e7cca2d ("mm:
    introduce put_user_page*(), placeholder versions").

    Link: http://lkml.kernel.org/r/20190724044537.10458-4-jhubbard@nvidia.com
    Signed-off-by: John Hubbard
    Acked-by: Björn Töpel
    Cc: Björn Töpel
    Cc: Magnus Karlsson
    Cc: David S. Miller
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    John Hubbard
     
  • Patch series "Make working with compound pages easier", v2.

    These three patches add three helpers and convert the appropriate
    places to use them.

    This patch (of 3):

    It's unnecessarily hard to find out the size of a potentially huge page.
    Replace 'PAGE_SIZE << compound_order(page)' with page_size(page).

    Link: http://lkml.kernel.org/r/20190721104612.19120-2-willy@infradead.org
    Signed-off-by: Matthew Wilcox (Oracle)
    Acked-by: Michal Hocko
    Reviewed-by: Andrew Morton
    Reviewed-by: Ira Weiny
    Acked-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox (Oracle)
     

19 Sep, 2019

1 commit

  • This patch removes the 64B alignment of the UMEM headroom. There is
    really no reason for it, and having a headroom less than 64B should be
    valid.

    Fixes: c0c77d8fb787 ("xsk: add user memory registration support sockopt")
    Signed-off-by: Björn Töpel
    Signed-off-by: Daniel Borkmann

    Björn Töpel
     

06 Sep, 2019

1 commit

  • Daniel Borkmann says:

    ====================
    The following pull-request contains BPF updates for your *net-next* tree.

    The main changes are:

    1) Add the ability to use unaligned chunks in the AF_XDP umem. By
    relaxing where the chunks can be placed, it allows to use an
    arbitrary buffer size and place whenever there is a free
    address in the umem. Helps more seamless DPDK AF_XDP driver
    integration. Support for i40e, ixgbe and mlx5e, from Kevin and
    Maxim.

    2) Addition of a wakeup flag for AF_XDP tx and fill rings so the
    application can wake up the kernel for rx/tx processing which
    avoids busy-spinning of the latter, useful when app and driver
    is located on the same core. Support for i40e, ixgbe and mlx5e,
    from Magnus and Maxim.

    3) bpftool fixes for printf()-like functions so compiler can actually
    enforce checks, bpftool build system improvements for custom output
    directories, and addition of 'bpftool map freeze' command, from Quentin.

    4) Support attaching/detaching XDP programs from 'bpftool net' command,
    from Daniel.

    5) Automatic xskmap cleanup when AF_XDP socket is released, and several
    barrier/{read,write}_once fixes in AF_XDP code, from Björn.

    6) Relicense of bpf_helpers.h/bpf_endian.h for future libbpf
    inclusion as well as libbpf versioning improvements, from Andrii.

    7) Several new BPF kselftests for verifier precision tracking, from Alexei.

    8) Several BPF kselftest fixes wrt endianess to run on s390x, from Ilya.

    9) And more BPF kselftest improvements all over the place, from Stanislav.

    10) Add simple BPF map op cache for nfp driver to batch dumps, from Jakub.

    11) AF_XDP socket umem mapping improvements for 32bit archs, from Ivan.

    12) Add BPF-to-BPF call and BTF line info support for s390x JIT, from Yauheni.

    13) Small optimization in arm64 JIT to spare 1 insns for BPF_MOD, from Jerin.

    14) Fix an error check in bpf_tcp_gen_syncookie() helper, from Petar.

    15) Various minor fixes and cleanups, from Nathan, Masahiro, Masanari,
    Peter, Wei, Yue.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     

05 Sep, 2019

4 commits

  • When accessing the members of an XDP socket, the control mutex should
    be held. This commit fixes that.

    Acked-by: Jonathan Lemon
    Fixes: a36b38aa2af6 ("xsk: add sock_diag interface for AF_XDP")
    Signed-off-by: Björn Töpel
    Signed-off-by: Daniel Borkmann

    Björn Töpel
     
  • Prior the state variable was introduced by Ilya, the dev member was
    used to determine whether the socket was bound or not. However, when
    dev was read, proper SMP barriers and READ_ONCE were missing. In order
    to address the missing barriers and READ_ONCE, we start using the
    state variable as a point of synchronization. The state member
    read/write is paired with proper SMP barriers, and from this follows
    that the members described above does not need READ_ONCE if used in
    conjunction with state check.

    In all syscalls and the xsk_rcv path we check if state is
    XSK_BOUND. If that is the case we do a SMP read barrier, and this
    implies that the dev, umem and all rings are correctly setup. Note
    that no READ_ONCE are needed for these variable if used when state is
    XSK_BOUND (plus the read barrier).

    To summarize: The members struct xdp_sock members dev, queue_id, umem,
    fq, cq, tx, rx, and state were read lock-less, with incorrect barriers
    and missing {READ, WRITE}_ONCE. Now, umem, fq, cq, tx, rx, and state
    are read lock-less. When these members are updated, WRITE_ONCE is
    used. When read, READ_ONCE are only used when read outside the control
    mutex (e.g. mmap) or, not synchronized with the state member
    (XSK_BOUND plus smp_rmb())

    Note that dev and queue_id do not need a WRITE_ONCE or READ_ONCE, due
    to the introduce state synchronization (XSK_BOUND plus smp_rmb()).

    Introducing the state check also fixes a race, found by syzcaller, in
    xsk_poll() where umem could be accessed when stale.

    Suggested-by: Hillf Danton
    Reported-by: syzbot+c82697e3043781e08802@syzkaller.appspotmail.com
    Fixes: 77cd0d7b3f25 ("xsk: add support for need_wakeup flag in AF_XDP rings")
    Signed-off-by: Björn Töpel
    Acked-by: Jonathan Lemon
    Signed-off-by: Daniel Borkmann

    Björn Töpel
     
  • The umem member of struct xdp_sock is read outside of the control
    mutex, in the mmap implementation, and needs a WRITE_ONCE to avoid
    potential store-tearing.

    Acked-by: Jonathan Lemon
    Fixes: 423f38329d26 ("xsk: add umem fill queue support and mmap")
    Signed-off-by: Björn Töpel
    Signed-off-by: Daniel Borkmann

    Björn Töpel
     
  • Use WRITE_ONCE when doing the store of tx, rx, fq, and cq, to avoid
    potential store-tearing. These members are read outside of the control
    mutex in the mmap implementation.

    Acked-by: Jonathan Lemon
    Fixes: 37b076933a8e ("xsk: add missing write- and data-dependency barrier")
    Signed-off-by: Björn Töpel
    Signed-off-by: Daniel Borkmann

    Björn Töpel
     

31 Aug, 2019

1 commit

  • Currently, addresses are chunk size aligned. This means, we are very
    restricted in terms of where we can place chunk within the umem. For
    example, if we have a chunk size of 2k, then our chunks can only be placed
    at 0,2k,4k,6k,8k... and so on (ie. every 2k starting from 0).

    This patch introduces the ability to use unaligned chunks. With these
    changes, we are no longer bound to having to place chunks at a 2k (or
    whatever your chunk size is) interval. Since we are no longer dealing with
    aligned chunks, they can now cross page boundaries. Checks for page
    contiguity have been added in order to keep track of which pages are
    followed by a physically contiguous page.

    Signed-off-by: Kevin Laatz
    Signed-off-by: Ciara Loftus
    Signed-off-by: Bruce Richardson
    Acked-by: Jonathan Lemon
    Signed-off-by: Daniel Borkmann

    Kevin Laatz
     

28 Aug, 2019

1 commit


21 Aug, 2019

1 commit

  • For 64-bit there is no reason to use vmap/vunmap, so use page_address
    as it was initially. For 32 bits, in some apps, like in samples
    xdpsock_user.c when number of pgs in use is quite big, the kmap
    memory can be not enough, despite on this, kmap looks like is
    deprecated in such cases as it can block and should be used rather
    for dynamic mm.

    Signed-off-by: Ivan Khoronzhuk
    Acked-by: Jonathan Lemon
    Signed-off-by: Daniel Borkmann

    Ivan Khoronzhuk
     

20 Aug, 2019

1 commit


18 Aug, 2019

1 commit

  • When an AF_XDP socket is released/closed the XSKMAP still holds a
    reference to the socket in a "released" state. The socket will still
    use the netdev queue resource, and block newly created sockets from
    attaching to that queue, but no user application can access the
    fill/complete/rx/tx queues. This results in that all applications need
    to explicitly clear the map entry from the old "zombie state"
    socket. This should be done automatically.

    In this patch, the sockets tracks, and have a reference to, which maps
    it resides in. When the socket is released, it will remove itself from
    all maps.

    Suggested-by: Bruce Richardson
    Signed-off-by: Björn Töpel
    Signed-off-by: Daniel Borkmann

    Björn Töpel