05 Mar, 2020

1 commit

  • commit 42d84c8490f9f0931786f1623191fcab397c3d64 upstream.

    Doing so, we save one call to get data we already have in the struct.

    Also, since there is no guarantee that getname use sockaddr_ll
    parameter beyond its size, we add a little bit of security here.
    It should do not do beyond MAX_ADDR_LEN, but syzbot found that
    ax25_getname writes more (72 bytes, the size of full_sockaddr_ax25,
    versus 20 + 32 bytes of sockaddr_ll + MAX_ADDR_LEN in syzbot repro).

    Fixes: 3a4d5c94e9593 ("vhost_net: a kernel-level virtio server")
    Reported-by: syzbot+f2a62d07a5198c819c7b@syzkaller.appspotmail.com
    Signed-off-by: Eugenio Pérez
    Acked-by: Michael S. Tsirkin
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Eugenio Pérez
     

18 Jul, 2019

1 commit

  • Pull virtio, vhost updates from Michael Tsirkin:
    "Fixes, features, performance:

    - new iommu device

    - vhost guest memory access using vmap (just meta-data for now)

    - minor fixes"

    * tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost:
    virtio-mmio: add error check for platform_get_irq
    scsi: virtio_scsi: Use struct_size() helper
    iommu/virtio: Add event queue
    iommu/virtio: Add probe request
    iommu: Add virtio-iommu driver
    PCI: OF: Initialize dev->fwnode appropriately
    of: Allow the iommu-map property to omit untranslated devices
    dt-bindings: virtio: Add virtio-pci-iommu node
    dt-bindings: virtio-mmio: Add IOMMU description
    vhost: fix clang build warning
    vhost: access vq metadata through kernel virtual address
    vhost: factor out setting vring addr and num
    vhost: introduce helpers to get the size of metadata area
    vhost: rename vq_iotlb_prefetch() to vq_meta_prefetch()
    vhost: fine grain userspace memory accessors
    vhost: generalize adding used elem

    Linus Torvalds
     

22 Jun, 2019

1 commit


19 Jun, 2019

1 commit

  • Based on 1 normalized pattern(s):

    this work is licensed under the terms of the gnu gpl version 2

    extracted by the scancode license scanner the SPDX license identifier

    GPL-2.0-only

    has been chosen to replace the boilerplate/reference in 48 file(s).

    Signed-off-by: Thomas Gleixner
    Reviewed-by: Allison Randal
    Reviewed-by: Enrico Weigelt
    Cc: linux-spdx@vger.kernel.org
    Link: https://lkml.kernel.org/r/20190604081204.624030236@linutronix.de
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     

18 Jun, 2019

1 commit

  • Vhost_net was known to suffer from HOL[1] issues which is not easy to
    fix. Several downstream disable the feature by default. What's more,
    the datapath was split and datacopy path got the support of batching
    and XDP support recently which makes it faster than zerocopy part for
    small packets transmission.

    It looks to me that disable zerocopy by default is more
    appropriate. It cold be enabled by default again in the future if we
    fix the above issues.

    [1] https://patchwork.kernel.org/patch/3787671/

    Signed-off-by: Jason Wang
    Acked-by: Michael S. Tsirkin
    Signed-off-by: David S. Miller

    Jason Wang
     

06 Jun, 2019

1 commit


27 May, 2019

2 commits

  • When the rx buffer is too small for a packet, we will discard the vq
    descriptor and retry it for the next packet:

    while ((sock_len = vhost_net_rx_peek_head_len(net, sock->sk,
    &busyloop_intr))) {
    ...
    /* On overrun, truncate and discard */
    if (unlikely(headcount > UIO_MAXIOV)) {
    iov_iter_init(&msg.msg_iter, READ, vq->iov, 1, 1);
    err = sock->ops->recvmsg(sock, &msg,
    1, MSG_DONTWAIT | MSG_TRUNC);
    pr_debug("Discarded rx packet: len %zd\n", sock_len);
    continue;
    }
    ...
    }

    This makes it possible to trigger a infinite while..continue loop
    through the co-opreation of two VMs like:

    1) Malicious VM1 allocate 1 byte rx buffer and try to slow down the
    vhost process as much as possible e.g using indirect descriptors or
    other.
    2) Malicious VM2 generate packets to VM1 as fast as possible

    Fixing this by checking against weight at the end of RX and TX
    loop. This also eliminate other similar cases when:

    - userspace is consuming the packets in the meanwhile
    - theoretical TOCTOU attack if guest moving avail index back and forth
    to hit the continue after vhost find guest just add new buffers

    This addresses CVE-2019-3900.

    Fixes: d8316f3991d20 ("vhost: fix total length when packets are too short")
    Fixes: 3a4d5c94e9593 ("vhost_net: a kernel-level virtio server")
    Signed-off-by: Jason Wang
    Reviewed-by: Stefan Hajnoczi
    Signed-off-by: Michael S. Tsirkin

    Jason Wang
     
  • We used to have vhost_exceeds_weight() for vhost-net to:

    - prevent vhost kthread from hogging the cpu
    - balance the time spent between TX and RX

    This function could be useful for vsock and scsi as well. So move it
    to vhost.c. Device must specify a weight which counts the number of
    requests, or it can also specific a byte_weight which counts the
    number of bytes that has been processed.

    Signed-off-by: Jason Wang
    Reviewed-by: Stefan Hajnoczi
    Signed-off-by: Michael S. Tsirkin

    Jason Wang
     

29 Jan, 2019

1 commit

  • After batched used ring updating was introduced in commit e2b3b35eb989
    ("vhost_net: batch used ring update in rx"). We tend to batch heads in
    vq->heads for more than one packet. But the quota passed to
    get_rx_bufs() was not correctly limited, which can result a OOB write
    in vq->heads.

    headcount = get_rx_bufs(vq, vq->heads + nvq->done_idx,
    vhost_len, &in, vq_log, &log,
    likely(mergeable) ? UIO_MAXIOV : 1);

    UIO_MAXIOV was still used which is wrong since we could have batched
    used in vq->heads, this will cause OOB if the next buffer needs more
    than 960 (1024 (UIO_MAXIOV) - 64 (VHOST_NET_BATCH)) heads after we've
    batched 64 (VHOST_NET_BATCH) heads:
    Acked-by: Stefan Hajnoczi

    =============================================================================
    BUG kmalloc-8k (Tainted: G B ): Redzone overwritten
    -----------------------------------------------------------------------------

    INFO: 0x00000000fd93b7a2-0x00000000f0713384. First byte 0xa9 instead of 0xcc
    INFO: Allocated in alloc_pd+0x22/0x60 age=3933677 cpu=2 pid=2674
    kmem_cache_alloc_trace+0xbb/0x140
    alloc_pd+0x22/0x60
    gen8_ppgtt_create+0x11d/0x5f0
    i915_ppgtt_create+0x16/0x80
    i915_gem_create_context+0x248/0x390
    i915_gem_context_create_ioctl+0x4b/0xe0
    drm_ioctl_kernel+0xa5/0xf0
    drm_ioctl+0x2ed/0x3a0
    do_vfs_ioctl+0x9f/0x620
    ksys_ioctl+0x6b/0x80
    __x64_sys_ioctl+0x11/0x20
    do_syscall_64+0x43/0xf0
    entry_SYSCALL_64_after_hwframe+0x44/0xa9
    INFO: Slab 0x00000000d13e87af objects=3 used=3 fp=0x (null) flags=0x200000000010201
    INFO: Object 0x0000000003278802 @offset=17064 fp=0x00000000e2e6652b

    Fixing this by allocating UIO_MAXIOV + VHOST_NET_BATCH iovs for
    vhost-net. This is done through set the limitation through
    vhost_dev_init(), then set_owner can allocate the number of iov in a
    per device manner.

    This fixes CVE-2018-16880.

    Fixes: e2b3b35eb989 ("vhost_net: batch used ring update in rx")
    Signed-off-by: Jason Wang
    Signed-off-by: David S. Miller

    Jason Wang
     

18 Jan, 2019

1 commit

  • Vhost dirty page logging API is designed to sync through GPA. But we
    try to log GIOVA when device IOTLB is enabled. This is wrong and may
    lead to missing data after migration.

    To solve this issue, when logging with device IOTLB enabled, we will:

    1) reuse the device IOTLB translation result of GIOVA->HVA mapping to
    get HVA, for writable descriptor, get HVA through iovec. For used
    ring update, translate its GIOVA to HVA
    2) traverse the GPA->HVA mapping to get the possible GPA and log
    through GPA. Pay attention this reverse mapping is not guaranteed
    to be unique, so we should log each possible GPA in this case.

    This fix the failure of scp to guest during migration. In -next, we
    will probably support passing GIOVA->GPA instead of GIOVA->HVA.

    Fixes: 6b1e6cc7855b ("vhost: new device IOTLB API")
    Reported-by: Jintack Lim
    Cc: Jintack Lim
    Signed-off-by: Jason Wang
    Acked-by: Michael S. Tsirkin
    Signed-off-by: David S. Miller

    Jason Wang
     

28 Dec, 2018

1 commit

  • Pull networking updates from David Miller:

    1) New ipset extensions for matching on destination MAC addresses, from
    Stefano Brivio.

    2) Add ipv4 ttl and tos, plus ipv6 flow label and hop limit offloads to
    nfp driver. From Stefano Brivio.

    3) Implement GRO for plain UDP sockets, from Paolo Abeni.

    4) Lots of work from Michał Mirosław to eliminate the VLAN_TAG_PRESENT
    bit so that we could support the entire vlan_tci value.

    5) Rework the IPSEC policy lookups to better optimize more usecases,
    from Florian Westphal.

    6) Infrastructure changes eliminating direct manipulation of SKB lists
    wherever possible, and to always use the appropriate SKB list
    helpers. This work is still ongoing...

    7) Lots of PHY driver and state machine improvements and
    simplifications, from Heiner Kallweit.

    8) Various TSO deferral refinements, from Eric Dumazet.

    9) Add ntuple filter support to aquantia driver, from Dmitry Bogdanov.

    10) Batch dropping of XDP packets in tuntap, from Jason Wang.

    11) Lots of cleanups and improvements to the r8169 driver from Heiner
    Kallweit, including support for ->xmit_more. This driver has been
    getting some much needed love since he started working on it.

    12) Lots of new forwarding selftests from Petr Machata.

    13) Enable VXLAN learning in mlxsw driver, from Ido Schimmel.

    14) Packed ring support for virtio, from Tiwei Bie.

    15) Add new Aquantia AQtion USB driver, from Dmitry Bezrukov.

    16) Add XDP support to dpaa2-eth driver, from Ioana Ciocoi Radulescu.

    17) Implement coalescing on TCP backlog queue, from Eric Dumazet.

    18) Implement carrier change in tun driver, from Nicolas Dichtel.

    19) Support msg_zerocopy in UDP, from Willem de Bruijn.

    20) Significantly improve garbage collection of neighbor objects when
    the table has many PERMANENT entries, from David Ahern.

    21) Remove egdev usage from nfp and mlx5, and remove the facility
    completely from the tree as it no longer has any users. From Oz
    Shlomo and others.

    22) Add a NETDEV_PRE_CHANGEADDR so that drivers can veto the change and
    therefore abort the operation before the commit phase (which is the
    NETDEV_CHANGEADDR event). From Petr Machata.

    23) Add indirect call wrappers to avoid retpoline overhead, and use them
    in the GRO code paths. From Paolo Abeni.

    24) Add support for netlink FDB get operations, from Roopa Prabhu.

    25) Support bloom filter in mlxsw driver, from Nir Dotan.

    26) Add SKB extension infrastructure. This consolidates the handling of
    the auxiliary SKB data used by IPSEC and bridge netfilter, and is
    designed to support the needs to MPTCP which could be integrated in
    the future.

    27) Lots of XDP TX optimizations in mlx5 from Tariq Toukan.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1845 commits)
    net: dccp: fix kernel crash on module load
    drivers/net: appletalk/cops: remove redundant if statement and mask
    bnx2x: Fix NULL pointer dereference in bnx2x_del_all_vlans() on some hw
    net/net_namespace: Check the return value of register_pernet_subsys()
    net/netlink_compat: Fix a missing check of nla_parse_nested
    ieee802154: lowpan_header_create check must check daddr
    net/mlx4_core: drop useless LIST_HEAD
    mlxsw: spectrum: drop useless LIST_HEAD
    net/mlx5e: drop useless LIST_HEAD
    iptunnel: Set tun_flags in the iptunnel_metadata_reply from src
    net/mlx5e: fix semicolon.cocci warnings
    staging: octeon: fix build failure with XFRM enabled
    net: Revert recent Spectre-v1 patches.
    can: af_can: Fix Spectre v1 vulnerability
    packet: validate address length if non-zero
    nfc: af_nfc: Fix Spectre v1 vulnerability
    phonet: af_phonet: Fix Spectre v1 vulnerability
    net: core: Fix Spectre v1 vulnerability
    net: minor cleanup in skb_ext_add()
    net: drop the unused helper skb_ext_get()
    ...

    Linus Torvalds
     

27 Dec, 2018

1 commit

  • Pull RCU updates from Ingo Molnar:
    "The biggest RCU changes in this cycle were:

    - Convert RCU's BUG_ON() and similar calls to WARN_ON() and similar.

    - Replace calls of RCU-bh and RCU-sched update-side functions to
    their vanilla RCU counterparts. This series is a step towards
    complete removal of the RCU-bh and RCU-sched update-side functions.

    ( Note that some of these conversions are going upstream via their
    respective maintainers. )

    - Documentation updates, including a number of flavor-consolidation
    updates from Joel Fernandes.

    - Miscellaneous fixes.

    - Automate generation of the initrd filesystem used for rcutorture
    testing.

    - Convert spin_is_locked() assertions to instead use lockdep.

    ( Note that some of these conversions are going upstream via their
    respective maintainers. )

    - SRCU updates, especially including a fix from Dennis Krein for a
    bag-on-head-class bug.

    - RCU torture-test updates"

    * 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (112 commits)
    rcutorture: Don't do busted forward-progress testing
    rcutorture: Use 100ms buckets for forward-progress callback histograms
    rcutorture: Recover from OOM during forward-progress tests
    rcutorture: Print forward-progress test age upon failure
    rcutorture: Print time since GP end upon forward-progress failure
    rcutorture: Print histogram of CB invocation at OOM time
    rcutorture: Print GP age upon forward-progress failure
    rcu: Print per-CPU callback counts for forward-progress failures
    rcu: Account for nocb-CPU callback counts in RCU CPU stall warnings
    rcutorture: Dump grace-period diagnostics upon forward-progress OOM
    rcutorture: Prepare for asynchronous access to rcu_fwd_startat
    torture: Remove unnecessary "ret" variables
    rcutorture: Affinity forward-progress test to avoid housekeeping CPUs
    rcutorture: Break up too-long rcu_torture_fwd_prog() function
    rcutorture: Remove cbflood facility
    torture: Bring any extra CPUs online during kernel startup
    rcutorture: Add call_rcu() flooding forward-progress tests
    rcutorture/formal: Replace synchronize_sched() with synchronize_rcu()
    tools/kernel.h: Replace synchronize_sched() with synchronize_rcu()
    net/decnet: Replace rcu_barrier_bh() with rcu_barrier()
    ...

    Linus Torvalds
     

21 Dec, 2018

1 commit


13 Dec, 2018

1 commit

  • We used to hold the mutex of paired virtqueue in
    vhost_net_busy_poll(). But this will results an inconsistent lock
    order which may cause deadlock if we try to bring back the protection
    of device IOTLB with vq mutex that requires to hold mutex of all
    virtqueues at the same time.

    Fix this simply by switching to use mutex_trylock(), when fail just
    skip the busy polling. This can happen when device IOTLB is under
    updating which should be rare.

    Fixes: commit 78139c94dc8c ("net: vhost: lock the vqs one by one")
    Cc: Tonghao Zhang
    Acked-by: Michael S. Tsirkin
    Signed-off-by: Jason Wang
    Signed-off-by: David S. Miller

    Jason Wang
     

28 Nov, 2018

1 commit


18 Nov, 2018

1 commit

  • We do a get_page() which involves a atomic operation. This patch tries
    to mitigate a per packet atomic operation by maintaining a reference
    bias which is initially USHRT_MAX. Each time a page is got, instead of
    calling get_page() we decrease the bias and when we find it's time to
    use a new page we will decrease the bias at one time through
    __page_cache_drain_cache().

    Testpmd(virtio_user + vhost_net) + XDP_DROP on TAP shows about 1.6%
    improvement.

    Before: 4.63Mpps
    After: 4.71Mpps

    Signed-off-by: Jason Wang
    Signed-off-by: David S. Miller

    Jason Wang
     

08 Oct, 2018

1 commit


27 Sep, 2018

3 commits

  • This patch improves the guest receive performance.
    On the handle_tx side, we poll the sock receive queue at the
    same time. handle_rx do that in the same way.

    We set the poll-us=100us and use the netperf to test throughput
    and mean latency. When running the tests, the vhost-net kthread
    of that VM, is alway 100% CPU. The commands are shown as below.

    Rx performance is greatly improved by this patch. There is not
    notable performance change on tx with this series though. This
    patch is useful for bi-directional traffic.

    netperf -H IP -t TCP_STREAM -l 20 -- -O "THROUGHPUT, THROUGHPUT_UNITS, MEAN_LATENCY"

    Topology:
    [Host] ->linux bridge -> tap vhost-net ->[Guest]

    TCP_STREAM:
    * Without the patch: 19842.95 Mbps, 6.50 us mean latency
    * With the patch: 37598.20 Mbps, 3.43 us mean latency

    Signed-off-by: Tonghao Zhang
    Signed-off-by: David S. Miller

    Tonghao Zhang
     
  • Factor out generic busy polling logic and will be
    used for in tx path in the next patch. And with the patch,
    qemu can set differently the busyloop_timeout for rx queue.

    To avoid duplicate codes, introduce the helper functions:
    * sock_has_rx_data(changed from sk_has_rx_data)
    * vhost_net_busy_poll_try_queue

    Signed-off-by: Tonghao Zhang
    Signed-off-by: David S. Miller

    Tonghao Zhang
     
  • Use the VHOST_NET_VQ_XXX as a subclass for mutex_lock_nested.

    Signed-off-by: Tonghao Zhang
    Acked-by: Jason Wang
    Signed-off-by: David S. Miller

    Tonghao Zhang
     

22 Sep, 2018

1 commit

  • We accidentally left out this error return so it leads to some use after
    free bugs later on.

    Fixes: 0a0be13b8fe2 ("vhost_net: batch submitting XDP buffers to underlayer sockets")
    Signed-off-by: Dan Carpenter
    Acked-by: Michael S. Tsirkin
    Acked-by: Jason Wang
    Signed-off-by: David S. Miller

    Dan Carpenter
     

14 Sep, 2018

2 commits

  • This patch implements XDP batching for vhost_net. The idea is first to
    try to do userspace copy and build XDP buff directly in vhost. Instead
    of submitting the packet immediately, vhost_net will batch them in an
    array and submit every 64 (VHOST_NET_BATCH) packets to the under layer
    sockets through msg_control of sendmsg().

    When XDP is enabled on the TUN/TAP, TUN/TAP can process XDP inside a
    loop without caring GUP thus it can do batch map flushing. When XDP is
    not enabled or not supported, the underlayer socket need to build skb
    and pass it to network core. The batched packet submission allows us
    to do batching like netif_receive_skb_list() in the future.

    This saves lots of indirect calls for better cache utilization. For
    the case that we can't so batching e.g when sndbuf is limited or
    packet size is too large, we will go for usual one packet per
    sendmsg() way.

    Doing testpmd on various setups gives us:

    Test /+pps%
    XDP_DROP on TAP /+44.8%
    XDP_REDIRECT on TAP /+29%
    macvtap (skb) /+26%

    Netperf tests shows obvious improvements for small packet transmission:

    size/session/+thu%/+normalize%
    64/ 1/ +2%/ 0%
    64/ 2/ +3%/ +1%
    64/ 4/ +7%/ +5%
    64/ 8/ +8%/ +6%
    256/ 1/ +3%/ 0%
    256/ 2/ +10%/ +7%
    256/ 4/ +26%/ +22%
    256/ 8/ +27%/ +23%
    512/ 1/ +3%/ +2%
    512/ 2/ +19%/ +14%
    512/ 4/ +43%/ +40%
    512/ 8/ +45%/ +41%
    1024/ 1/ +4%/ 0%
    1024/ 2/ +27%/ +21%
    1024/ 4/ +38%/ +73%
    1024/ 8/ +15%/ +24%
    2048/ 1/ +10%/ +7%
    2048/ 2/ +16%/ +12%
    2048/ 4/ 0%/ +2%
    2048/ 8/ 0%/ +2%
    4096/ 1/ +36%/ +60%
    4096/ 2/ -11%/ -26%
    4096/ 4/ 0%/ +14%
    4096/ 8/ 0%/ +4%
    16384/ 1/ -1%/ +5%
    16384/ 2/ 0%/ +2%
    16384/ 4/ 0%/ -3%
    16384/ 8/ 0%/ +4%
    65535/ 1/ 0%/ +10%
    65535/ 2/ 0%/ +8%
    65535/ 4/ 0%/ +1%
    65535/ 8/ 0%/ +3%

    Signed-off-by: Jason Wang
    Signed-off-by: David S. Miller

    Jason Wang
     
  • This patch introduces to a new tun/tap specific msg_control:

    #define TUN_MSG_UBUF 1
    #define TUN_MSG_PTR 2
    struct tun_msg_ctl {
    int type;
    void *ptr;
    };

    This allows us to pass different kinds of msg_control through
    sendmsg(). The first supported type is ubuf (TUN_MSG_UBUF) which will
    be used by the existed vhost_net zerocopy code. The second is XDP
    buff, which allows vhost_net to pass XDP buff to TUN. This could be
    used to implement accepting an array of XDP buffs from vhost_net in
    the following patches.

    Signed-off-by: Jason Wang
    Signed-off-by: David S. Miller

    Jason Wang
     

07 Aug, 2018

1 commit

  • We use to have message like:

    struct vhost_msg {
    int type;
    union {
    struct vhost_iotlb_msg iotlb;
    __u8 padding[64];
    };
    };

    Unfortunately, there will be a hole of 32bit in 64bit machine because
    of the alignment. This leads a different formats between 32bit API and
    64bit API. What's more it will break 32bit program running on 64bit
    machine.

    So fixing this by introducing a new message type with an explicit
    32bit reserved field after type like:

    struct vhost_msg_v2 {
    __u32 type;
    __u32 reserved;
    union {
    struct vhost_iotlb_msg iotlb;
    __u8 padding[64];
    };
    };

    We will have a consistent ABI after switching to use this. To enable
    this capability, introduce a new ioctl (VHOST_SET_BAKCEND_FEATURE) for
    userspace to enable this feature (VHOST_BACKEND_F_IOTLB_V2).

    Fixes: 6b1e6cc7855b ("vhost: new device IOTLB API")
    Signed-off-by: Jason Wang
    Signed-off-by: David S. Miller

    Jason Wang
     

23 Jul, 2018

9 commits


04 Jul, 2018

4 commits

  • We may run out of avail rx ring descriptor under heavy load but busypoll
    did not detect it so busypoll may have exited prematurely. Avoid this by
    checking rx ring full during busypoll.

    Signed-off-by: Toshiaki Makita
    Acked-by: Jason Wang
    Signed-off-by: David S. Miller

    Toshiaki Makita
     
  • We may run handle_rx() while rx work is queued. For example a packet can
    push the rx work during the window before handle_rx calls
    vhost_net_disable_vq().
    In that case busypoll immediately exits due to vhost_has_work()
    condition and enables vq again. This can lead to another unnecessary rx
    wake-ups, so poll rx work instead of enabling the vq.

    Signed-off-by: Toshiaki Makita
    Acked-by: Jason Wang
    Signed-off-by: David S. Miller

    Toshiaki Makita
     
  • Under heavy load vhost busypoll may run without suppressing
    notification. For example tx zerocopy callback can push tx work while
    handle_tx() is running, then busyloop exits due to vhost_has_work()
    condition and enables notification but immediately reenters handle_tx()
    because the pushed work was tx. In this case handle_tx() tries to
    disable notification again, but when using event_idx it by design
    cannot. Then busyloop will run without suppressing notification.
    Another example is the case where handle_tx() tries to enable
    notification but avail idx is advanced so disables it again. This case
    also leads to the same situation with event_idx.

    The problem is that once we enter this situation busyloop does not work
    under heavy load for considerable amount of time, because notification
    is likely to happen during busyloop and handle_tx() immediately enables
    notification after notification happens. Specifically busyloop detects
    notification by vhost_has_work() and then handle_tx() calls
    vhost_enable_notify(). Because the detected work was the tx work, it
    enters handle_tx(), and enters busyloop without suppression again.
    This is likely to be repeated, so with event_idx we are almost not able
    to suppress notification in this case.

    To fix this, poll the work instead of enabling notification when
    busypoll is interrupted by something. IMHO vhost_has_work() is kind of
    interruption rather than a signal to completely cancel the busypoll, so
    let's run busypoll after the necessary work is done.

    Signed-off-by: Toshiaki Makita
    Acked-by: Jason Wang
    Signed-off-by: David S. Miller

    Toshiaki Makita
     
  • So we can easily see which variable is for which, tx or rx.

    Signed-off-by: Toshiaki Makita
    Acked-by: Jason Wang
    Signed-off-by: David S. Miller

    Toshiaki Makita
     

23 Jun, 2018

1 commit

  • Sock will be NULL if we pass -1 to vhost_net_set_backend(), but when
    we meet errors during ubuf allocation, the code does not check for
    NULL before calling sockfd_put(), this will lead NULL
    dereferencing. Fixing by checking sock pointer before.

    Fixes: bab632d69ee4 ("vhost: vhost TX zero-copy support")
    Reported-by: Dan Carpenter
    Signed-off-by: Jason Wang
    Signed-off-by: David S. Miller

    Jason Wang
     

13 Jun, 2018

1 commit

  • The kmalloc() function has a 2-factor argument form, kmalloc_array(). This
    patch replaces cases of:

    kmalloc(a * b, gfp)

    with:
    kmalloc_array(a * b, gfp)

    as well as handling cases of:

    kmalloc(a * b * c, gfp)

    with:

    kmalloc(array3_size(a, b, c), gfp)

    as it's slightly less ugly than:

    kmalloc_array(array_size(a, b), c, gfp)

    This does, however, attempt to ignore constant size factors like:

    kmalloc(4 * 1024, gfp)

    though any constants defined via macros get caught up in the conversion.

    Any factors with a sizeof() of "unsigned char", "char", and "u8" were
    dropped, since they're redundant.

    The tools/ directory was manually excluded, since it has its own
    implementation of kmalloc().

    The Coccinelle script used for this was:

    // Fix redundant parens around sizeof().
    @@
    type TYPE;
    expression THING, E;
    @@

    (
    kmalloc(
    - (sizeof(TYPE)) * E
    + sizeof(TYPE) * E
    , ...)
    |
    kmalloc(
    - (sizeof(THING)) * E
    + sizeof(THING) * E
    , ...)
    )

    // Drop single-byte sizes and redundant parens.
    @@
    expression COUNT;
    typedef u8;
    typedef __u8;
    @@

    (
    kmalloc(
    - sizeof(u8) * (COUNT)
    + COUNT
    , ...)
    |
    kmalloc(
    - sizeof(__u8) * (COUNT)
    + COUNT
    , ...)
    |
    kmalloc(
    - sizeof(char) * (COUNT)
    + COUNT
    , ...)
    |
    kmalloc(
    - sizeof(unsigned char) * (COUNT)
    + COUNT
    , ...)
    |
    kmalloc(
    - sizeof(u8) * COUNT
    + COUNT
    , ...)
    |
    kmalloc(
    - sizeof(__u8) * COUNT
    + COUNT
    , ...)
    |
    kmalloc(
    - sizeof(char) * COUNT
    + COUNT
    , ...)
    |
    kmalloc(
    - sizeof(unsigned char) * COUNT
    + COUNT
    , ...)
    )

    // 2-factor product with sizeof(type/expression) and identifier or constant.
    @@
    type TYPE;
    expression THING;
    identifier COUNT_ID;
    constant COUNT_CONST;
    @@

    (
    - kmalloc
    + kmalloc_array
    (
    - sizeof(TYPE) * (COUNT_ID)
    + COUNT_ID, sizeof(TYPE)
    , ...)
    |
    - kmalloc
    + kmalloc_array
    (
    - sizeof(TYPE) * COUNT_ID
    + COUNT_ID, sizeof(TYPE)
    , ...)
    |
    - kmalloc
    + kmalloc_array
    (
    - sizeof(TYPE) * (COUNT_CONST)
    + COUNT_CONST, sizeof(TYPE)
    , ...)
    |
    - kmalloc
    + kmalloc_array
    (
    - sizeof(TYPE) * COUNT_CONST
    + COUNT_CONST, sizeof(TYPE)
    , ...)
    |
    - kmalloc
    + kmalloc_array
    (
    - sizeof(THING) * (COUNT_ID)
    + COUNT_ID, sizeof(THING)
    , ...)
    |
    - kmalloc
    + kmalloc_array
    (
    - sizeof(THING) * COUNT_ID
    + COUNT_ID, sizeof(THING)
    , ...)
    |
    - kmalloc
    + kmalloc_array
    (
    - sizeof(THING) * (COUNT_CONST)
    + COUNT_CONST, sizeof(THING)
    , ...)
    |
    - kmalloc
    + kmalloc_array
    (
    - sizeof(THING) * COUNT_CONST
    + COUNT_CONST, sizeof(THING)
    , ...)
    )

    // 2-factor product, only identifiers.
    @@
    identifier SIZE, COUNT;
    @@

    - kmalloc
    + kmalloc_array
    (
    - SIZE * COUNT
    + COUNT, SIZE
    , ...)

    // 3-factor product with 1 sizeof(type) or sizeof(expression), with
    // redundant parens removed.
    @@
    expression THING;
    identifier STRIDE, COUNT;
    type TYPE;
    @@

    (
    kmalloc(
    - sizeof(TYPE) * (COUNT) * (STRIDE)
    + array3_size(COUNT, STRIDE, sizeof(TYPE))
    , ...)
    |
    kmalloc(
    - sizeof(TYPE) * (COUNT) * STRIDE
    + array3_size(COUNT, STRIDE, sizeof(TYPE))
    , ...)
    |
    kmalloc(
    - sizeof(TYPE) * COUNT * (STRIDE)
    + array3_size(COUNT, STRIDE, sizeof(TYPE))
    , ...)
    |
    kmalloc(
    - sizeof(TYPE) * COUNT * STRIDE
    + array3_size(COUNT, STRIDE, sizeof(TYPE))
    , ...)
    |
    kmalloc(
    - sizeof(THING) * (COUNT) * (STRIDE)
    + array3_size(COUNT, STRIDE, sizeof(THING))
    , ...)
    |
    kmalloc(
    - sizeof(THING) * (COUNT) * STRIDE
    + array3_size(COUNT, STRIDE, sizeof(THING))
    , ...)
    |
    kmalloc(
    - sizeof(THING) * COUNT * (STRIDE)
    + array3_size(COUNT, STRIDE, sizeof(THING))
    , ...)
    |
    kmalloc(
    - sizeof(THING) * COUNT * STRIDE
    + array3_size(COUNT, STRIDE, sizeof(THING))
    , ...)
    )

    // 3-factor product with 2 sizeof(variable), with redundant parens removed.
    @@
    expression THING1, THING2;
    identifier COUNT;
    type TYPE1, TYPE2;
    @@

    (
    kmalloc(
    - sizeof(TYPE1) * sizeof(TYPE2) * COUNT
    + array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
    , ...)
    |
    kmalloc(
    - sizeof(TYPE1) * sizeof(THING2) * (COUNT)
    + array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
    , ...)
    |
    kmalloc(
    - sizeof(THING1) * sizeof(THING2) * COUNT
    + array3_size(COUNT, sizeof(THING1), sizeof(THING2))
    , ...)
    |
    kmalloc(
    - sizeof(THING1) * sizeof(THING2) * (COUNT)
    + array3_size(COUNT, sizeof(THING1), sizeof(THING2))
    , ...)
    |
    kmalloc(
    - sizeof(TYPE1) * sizeof(THING2) * COUNT
    + array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
    , ...)
    |
    kmalloc(
    - sizeof(TYPE1) * sizeof(THING2) * (COUNT)
    + array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
    , ...)
    )

    // 3-factor product, only identifiers, with redundant parens removed.
    @@
    identifier STRIDE, SIZE, COUNT;
    @@

    (
    kmalloc(
    - (COUNT) * STRIDE * SIZE
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    |
    kmalloc(
    - COUNT * (STRIDE) * SIZE
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    |
    kmalloc(
    - COUNT * STRIDE * (SIZE)
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    |
    kmalloc(
    - (COUNT) * (STRIDE) * SIZE
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    |
    kmalloc(
    - COUNT * (STRIDE) * (SIZE)
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    |
    kmalloc(
    - (COUNT) * STRIDE * (SIZE)
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    |
    kmalloc(
    - (COUNT) * (STRIDE) * (SIZE)
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    |
    kmalloc(
    - COUNT * STRIDE * SIZE
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    )

    // Any remaining multi-factor products, first at least 3-factor products,
    // when they're not all constants...
    @@
    expression E1, E2, E3;
    constant C1, C2, C3;
    @@

    (
    kmalloc(C1 * C2 * C3, ...)
    |
    kmalloc(
    - (E1) * E2 * E3
    + array3_size(E1, E2, E3)
    , ...)
    |
    kmalloc(
    - (E1) * (E2) * E3
    + array3_size(E1, E2, E3)
    , ...)
    |
    kmalloc(
    - (E1) * (E2) * (E3)
    + array3_size(E1, E2, E3)
    , ...)
    |
    kmalloc(
    - E1 * E2 * E3
    + array3_size(E1, E2, E3)
    , ...)
    )

    // And then all remaining 2 factors products when they're not all constants,
    // keeping sizeof() as the second factor argument.
    @@
    expression THING, E1, E2;
    type TYPE;
    constant C1, C2, C3;
    @@

    (
    kmalloc(sizeof(THING) * C2, ...)
    |
    kmalloc(sizeof(TYPE) * C2, ...)
    |
    kmalloc(C1 * C2 * C3, ...)
    |
    kmalloc(C1 * C2, ...)
    |
    - kmalloc
    + kmalloc_array
    (
    - sizeof(TYPE) * (E2)
    + E2, sizeof(TYPE)
    , ...)
    |
    - kmalloc
    + kmalloc_array
    (
    - sizeof(TYPE) * E2
    + E2, sizeof(TYPE)
    , ...)
    |
    - kmalloc
    + kmalloc_array
    (
    - sizeof(THING) * (E2)
    + E2, sizeof(THING)
    , ...)
    |
    - kmalloc
    + kmalloc_array
    (
    - sizeof(THING) * E2
    + E2, sizeof(THING)
    , ...)
    |
    - kmalloc
    + kmalloc_array
    (
    - (E1) * E2
    + E1, E2
    , ...)
    |
    - kmalloc
    + kmalloc_array
    (
    - (E1) * (E2)
    + E1, E2
    , ...)
    |
    - kmalloc
    + kmalloc_array
    (
    - E1 * E2
    + E1, E2
    , ...)
    )

    Signed-off-by: Kees Cook

    Kees Cook
     

03 Jun, 2018

1 commit