07 Oct, 2020

1 commit

  • [ Upstream commit 4c7246dc45e2706770d5233f7ce1597a07e069ba ]

    We are going to add 'struct vsock_sock *' parameter to
    virtio_transport_get_ops().

    In some cases, like in the virtio_transport_reset_no_sock(),
    we don't have any socket assigned to the packet received,
    so we can't use the virtio_transport_get_ops().

    In order to allow virtio_transport_reset_no_sock() to use the
    '.send_pkt' callback from the 'vhost_transport' or 'virtio_transport',
    we add the 'struct virtio_transport *' to it and to its caller:
    virtio_transport_recv_pkt().

    We moved the 'vhost_transport' and 'virtio_transport' definition,
    to pass their address to the virtio_transport_recv_pkt().

    Reviewed-by: Stefan Hajnoczi
    Signed-off-by: Stefano Garzarella
    Signed-off-by: David S. Miller
    Signed-off-by: Sasha Levin

    Stefano Garzarella
     

05 Aug, 2020

1 commit

  • commit 295c1b9852d000580786375304a9800bd9634d15 upstream.

    vhost/scsi doesn't handle type conversion correctly
    for request type when using virtio 1.0 and up for BE,
    or cross-endian platforms.

    Fix it up using vhost_32_to_cpu.

    Cc: stable@vger.kernel.org
    Signed-off-by: Michael S. Tsirkin
    Acked-by: Jason Wang
    Reviewed-by: Stefan Hajnoczi
    Signed-off-by: Greg Kroah-Hartman

    Michael S. Tsirkin
     

24 Jun, 2020

1 commit

  • [ Upstream commit 5ae6a6a915033bfee79e76e0c374d4f927909edc ]

    vhost-scsi pre-allocates the maximum sg entries per command and if a
    command requires more than VHOST_SCSI_PREALLOC_SGLS entries, then that
    command is failed by it. This patch lets vhost communicate the max sg limit
    when it registers vhost_scsi_ops with TCM. With this change, TCM would
    report the max sg entries through "Block Limits" VPD page which will be
    typically queried by the SCSI initiator during device discovery. By knowing
    this limit, the initiator could ensure the maximum transfer length is less
    than or equal to what is reported by vhost-scsi.

    Link: https://lore.kernel.org/r/1590166317-953-1-git-send-email-sudhakar.panneerselvam@oracle.com
    Cc: Michael S. Tsirkin
    Cc: Jason Wang
    Cc: Paolo Bonzini
    Cc: Stefan Hajnoczi
    Reviewed-by: Mike Christie
    Signed-off-by: Sudhakar Panneerselvam
    Signed-off-by: Martin K. Petersen
    Signed-off-by: Sasha Levin

    Sudhakar Panneerselvam
     

27 May, 2020

1 commit


10 May, 2020

1 commit

  • commit 0b841030625cde5f784dd62aec72d6a766faae70 upstream.

    Ning Bo reported an abnormal 2-second gap when booting Kata container [1].
    The unconditional timeout was caused by VSOCK_DEFAULT_CONNECT_TIMEOUT of
    connecting from the client side. The vhost vsock client tries to connect
    an initializing virtio vsock server.

    The abnormal flow looks like:
    host-userspace vhost vsock guest vsock
    ============== =========== ============
    connect() --------> vhost_transport_send_pkt_work() initializing
    | vq->private_data==NULL
    | will not be queued
    V
    schedule_timeout(2s)
    vhost_vsock_start() private_data

    wait for 2s and failed
    connect() again vq->private_data!=NULL recv connecting pkt

    Details:
    1. Host userspace sends a connect pkt, at that time, guest vsock is under
    initializing, hence the vhost_vsock_start has not been called. So
    vq->private_data==NULL, and the pkt is not been queued to send to guest
    2. Then it sleeps for 2s
    3. After guest vsock finishes initializing, vq->private_data is set
    4. When host userspace wakes up after 2s, send connecting pkt again,
    everything is fine.

    As suggested by Stefano Garzarella, this fixes it by additional kicking the
    send_pkt worker in vhost_vsock_start once the virtio device is started. This
    makes the pending pkt sent again.

    After this patch, kata-runtime (with vsock enabled) boot time is reduced
    from 3s to 1s on a ThunderX2 arm64 server.

    [1] https://github.com/kata-containers/runtime/issues/1917

    Reported-by: Ning Bo
    Suggested-by: Stefano Garzarella
    Signed-off-by: Jia He
    Link: https://lore.kernel.org/r/20200501043840.186557-1-justin.he@arm.com
    Signed-off-by: Michael S. Tsirkin
    Reviewed-by: Stefano Garzarella
    Signed-off-by: Greg Kroah-Hartman

    Jia He
     

05 Mar, 2020

1 commit

  • commit 42d84c8490f9f0931786f1623191fcab397c3d64 upstream.

    Doing so, we save one call to get data we already have in the struct.

    Also, since there is no guarantee that getname use sockaddr_ll
    parameter beyond its size, we add a little bit of security here.
    It should do not do beyond MAX_ADDR_LEN, but syzbot found that
    ax25_getname writes more (72 bytes, the size of full_sockaddr_ax25,
    versus 20 + 32 bytes of sockaddr_ll + MAX_ADDR_LEN in syzbot repro).

    Fixes: 3a4d5c94e9593 ("vhost_net: a kernel-level virtio server")
    Reported-by: syzbot+f2a62d07a5198c819c7b@syzkaller.appspotmail.com
    Signed-off-by: Eugenio Pérez
    Acked-by: Michael S. Tsirkin
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Eugenio Pérez
     

05 Jan, 2020

1 commit

  • [ Upstream commit 8a3cc29c316c17de590e3ff8b59f3d6cbfd37b0a ]

    When we receive a new packet from the guest, we check if the
    src_cid is correct, but we forgot to check the dst_cid.

    The host should accept only packets where dst_cid is
    equal to the host CID.

    Signed-off-by: Stefano Garzarella
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Stefano Garzarella
     

28 Oct, 2019

1 commit


13 Oct, 2019

1 commit

  • When device stop was moved out of reset, test device wasn't updated to
    stop before reset, this resulted in a use after free. Fix by invoking
    stop appropriately.

    Fixes: b211616d7125 ("vhost: move -net specific code out")
    Signed-off-by: Michael S. Tsirkin

    Michael S. Tsirkin
     

15 Sep, 2019

2 commits


12 Sep, 2019

2 commits

  • The code assumes log_num < in_num everywhere, and that is true as long as
    in_num is incremented by descriptor iov count, and log_num by 1. However
    this breaks if there's a zero sized descriptor.

    As a result, if a malicious guest creates a vring desc with desc.len = 0,
    it may cause the host kernel to crash by overflowing the log array. This
    bug can be triggered during the VM migration.

    There's no need to log when desc.len = 0, so just don't increment log_num
    in this case.

    Fixes: 3a4d5c94e959 ("vhost_net: a kernel-level virtio server")
    Cc: stable@vger.kernel.org
    Reviewed-by: Lidong Chen
    Signed-off-by: ruippan
    Signed-off-by: yongduan
    Acked-by: Michael S. Tsirkin
    Reviewed-by: Tyler Hicks
    Signed-off-by: Michael S. Tsirkin

    yongduan
     
  • iovec addresses coming from vhost are assumed to be
    pre-validated, but in fact can be speculated to a value
    out of range.

    Userspace address are later validated with array_index_nospec so we can
    be sure kernel info does not leak through these addresses, but vhost
    must also not leak userspace info outside the allowed memory table to
    guests.

    Following the defence in depth principle, make sure
    the address is not validated out of node range.

    Signed-off-by: Michael S. Tsirkin
    Cc: stable@vger.kernel.org
    Acked-by: Jason Wang
    Tested-by: Jason Wang

    Michael S. Tsirkin
     

04 Sep, 2019

4 commits

  • This reverts commit 7f466032dc ("vhost: access vq metadata through
    kernel virtual address"). The commit caused a bunch of issues, and
    while commit 73f628ec9e ("vhost: disable metadata prefetch
    optimization") disabled the optimization it's not nice to keep lots of
    dead code around.

    Signed-off-by: Michael S. Tsirkin

    Michael S. Tsirkin
     
  • It is unnecessary to use ret variable to return the error
    code, just return the error code directly.

    Signed-off-by: Yunsheng Lin
    Signed-off-by: Michael S. Tsirkin

    Yunsheng Lin
     
  • Since vhost_exceeds_weight() was introduced, callers need to specify
    the packet weight and byte weight in vhost_dev_init(). Note that, the
    packet weight isn't counted in this patch to keep the original behavior
    unchanged.

    Fixes: e82b9b0727ff ("vhost: introduce vhost_exceeds_weight()")
    Cc: stable@vger.kernel.org
    Signed-off-by: Tiwei Bie
    Signed-off-by: Michael S. Tsirkin
    Acked-by: Jason Wang

    Tiwei Bie
     
  • Since below commit, callers need to specify the iov_limit in
    vhost_dev_init() explicitly.

    Fixes: b46a0bf78ad7 ("vhost: fix OOB in get_rx_bufs()")
    Cc: stable@vger.kernel.org
    Signed-off-by: Tiwei Bie
    Signed-off-by: Michael S. Tsirkin
    Acked-by: Jason Wang

    Tiwei Bie
     

07 Aug, 2019

1 commit


31 Jul, 2019

2 commits

  • If the packets to sent to the guest are bigger than the buffer
    available, we can split them, using multiple buffers and fixing
    the length in the packet header.
    This is safe since virtio-vsock supports only stream sockets.

    Signed-off-by: Stefano Garzarella
    Reviewed-by: Stefan Hajnoczi
    Acked-by: Michael S. Tsirkin
    Signed-off-by: David S. Miller

    Stefano Garzarella
     
  • Since virtio-vsock was introduced, the buffers filled by the host
    and pushed to the guest using the vring, are directly queued in
    a per-socket list. These buffers are preallocated by the guest
    with a fixed size (4 KB).

    The maximum amount of memory used by each socket should be
    controlled by the credit mechanism.
    The default credit available per-socket is 256 KB, but if we use
    only 1 byte per packet, the guest can queue up to 262144 of 4 KB
    buffers, using up to 1 GB of memory per-socket. In addition, the
    guest will continue to fill the vring with new 4 KB free buffers
    to avoid starvation of other sockets.

    This patch mitigates this issue copying the payload of small
    packets (< 128 bytes) into the buffer of last packet queued, in
    order to avoid wasting memory.

    Signed-off-by: Stefano Garzarella
    Reviewed-by: Stefan Hajnoczi
    Acked-by: Michael S. Tsirkin
    Signed-off-by: David S. Miller

    Stefano Garzarella
     

26 Jul, 2019

1 commit


18 Jul, 2019

1 commit

  • Pull virtio, vhost updates from Michael Tsirkin:
    "Fixes, features, performance:

    - new iommu device

    - vhost guest memory access using vmap (just meta-data for now)

    - minor fixes"

    * tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost:
    virtio-mmio: add error check for platform_get_irq
    scsi: virtio_scsi: Use struct_size() helper
    iommu/virtio: Add event queue
    iommu/virtio: Add probe request
    iommu: Add virtio-iommu driver
    PCI: OF: Initialize dev->fwnode appropriately
    of: Allow the iommu-map property to omit untranslated devices
    dt-bindings: virtio: Add virtio-pci-iommu node
    dt-bindings: virtio-mmio: Add IOMMU description
    vhost: fix clang build warning
    vhost: access vq metadata through kernel virtual address
    vhost: factor out setting vring addr and num
    vhost: introduce helpers to get the size of metadata area
    vhost: rename vq_iotlb_prefetch() to vq_meta_prefetch()
    vhost: fine grain userspace memory accessors
    vhost: generalize adding used elem

    Linus Torvalds
     

12 Jul, 2019

1 commit

  • Pull networking updates from David Miller:
    "Some highlights from this development cycle:

    1) Big refactoring of ipv6 route and neigh handling to support
    nexthop objects configurable as units from userspace. From David
    Ahern.

    2) Convert explored_states in BPF verifier into a hash table,
    significantly decreased state held for programs with bpf2bpf
    calls, from Alexei Starovoitov.

    3) Implement bpf_send_signal() helper, from Yonghong Song.

    4) Various classifier enhancements to mvpp2 driver, from Maxime
    Chevallier.

    5) Add aRFS support to hns3 driver, from Jian Shen.

    6) Fix use after free in inet frags by allocating fqdirs dynamically
    and reworking how rhashtable dismantle occurs, from Eric Dumazet.

    7) Add act_ctinfo packet classifier action, from Kevin
    Darbyshire-Bryant.

    8) Add TFO key backup infrastructure, from Jason Baron.

    9) Remove several old and unused ISDN drivers, from Arnd Bergmann.

    10) Add devlink notifications for flash update status to mlxsw driver,
    from Jiri Pirko.

    11) Lots of kTLS offload infrastructure fixes, from Jakub Kicinski.

    12) Add support for mv88e6250 DSA chips, from Rasmus Villemoes.

    13) Various enhancements to ipv6 flow label handling, from Eric
    Dumazet and Willem de Bruijn.

    14) Support TLS offload in nfp driver, from Jakub Kicinski, Dirk van
    der Merwe, and others.

    15) Various improvements to axienet driver including converting it to
    phylink, from Robert Hancock.

    16) Add PTP support to sja1105 DSA driver, from Vladimir Oltean.

    17) Add mqprio qdisc offload support to dpaa2-eth, from Ioana
    Radulescu.

    18) Add devlink health reporting to mlx5, from Moshe Shemesh.

    19) Convert stmmac over to phylink, from Jose Abreu.

    20) Add PTP PHC (Physical Hardware Clock) support to mlxsw, from
    Shalom Toledo.

    21) Add nftables SYNPROXY support, from Fernando Fernandez Mancera.

    22) Convert tcp_fastopen over to use SipHash, from Ard Biesheuvel.

    23) Track spill/fill of constants in BPF verifier, from Alexei
    Starovoitov.

    24) Support bounded loops in BPF, from Alexei Starovoitov.

    25) Various page_pool API fixes and improvements, from Jesper Dangaard
    Brouer.

    26) Just like ipv4, support ref-countless ipv6 route handling. From
    Wei Wang.

    27) Support VLAN offloading in aquantia driver, from Igor Russkikh.

    28) Add AF_XDP zero-copy support to mlx5, from Maxim Mikityanskiy.

    29) Add flower GRE encap/decap support to nfp driver, from Pieter
    Jansen van Vuuren.

    30) Protect against stack overflow when using act_mirred, from John
    Hurley.

    31) Allow devmap map lookups from eBPF, from Toke Høiland-Jørgensen.

    32) Use page_pool API in netsec driver, Ilias Apalodimas.

    33) Add Google gve network driver, from Catherine Sullivan.

    34) More indirect call avoidance, from Paolo Abeni.

    35) Add kTLS TX HW offload support to mlx5, from Tariq Toukan.

    36) Add XDP_REDIRECT support to bnxt_en, from Andy Gospodarek.

    37) Add MPLS manipulation actions to TC, from John Hurley.

    38) Add sending a packet to connection tracking from TC actions, and
    then allow flower classifier matching on conntrack state. From
    Paul Blakey.

    39) Netfilter hw offload support, from Pablo Neira Ayuso"

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (2080 commits)
    net/mlx5e: Return in default case statement in tx_post_resync_params
    mlx5: Return -EINVAL when WARN_ON_ONCE triggers in mlx5e_tls_resync().
    net: dsa: add support for BRIDGE_MROUTER attribute
    pkt_sched: Include const.h
    net: netsec: remove static declaration for netsec_set_tx_de()
    net: netsec: remove superfluous if statement
    netfilter: nf_tables: add hardware offload support
    net: flow_offload: rename tc_cls_flower_offload to flow_cls_offload
    net: flow_offload: add flow_block_cb_is_busy() and use it
    net: sched: remove tcf block API
    drivers: net: use flow block API
    net: sched: use flow block API
    net: flow_offload: add flow_block_cb_{priv, incref, decref}()
    net: flow_offload: add list handling functions
    net: flow_offload: add flow_block_cb_alloc() and flow_block_cb_free()
    net: flow_offload: rename TCF_BLOCK_BINDER_TYPE_* to FLOW_BLOCK_BINDER_TYPE_*
    net: flow_offload: rename TC_BLOCK_{UN}BIND to FLOW_BLOCK_{UN}BIND
    net: flow_offload: add flow_block_cb_setup_simple()
    net: hisilicon: Add an tx_desc to adapt HI13X1_GMAC
    net: hisilicon: Add an rx_desc to adapt HI13X1_GMAC
    ...

    Linus Torvalds
     

10 Jul, 2019

1 commit

  • Pull Documentation updates from Jonathan Corbet:
    "It's been a relatively busy cycle for docs:

    - A fair pile of RST conversions, many from Mauro. These create more
    than the usual number of simple but annoying merge conflicts with
    other trees, unfortunately. He has a lot more of these waiting on
    the wings that, I think, will go to you directly later on.

    - A new document on how to use merges and rebases in kernel repos,
    and one on Spectre vulnerabilities.

    - Various improvements to the build system, including automatic
    markup of function() references because some people, for reasons I
    will never understand, were of the opinion that
    :c:func:``function()`` is unattractive and not fun to type.

    - We now recommend using sphinx 1.7, but still support back to 1.4.

    - Lots of smaller improvements, warning fixes, typo fixes, etc"

    * tag 'docs-5.3' of git://git.lwn.net/linux: (129 commits)
    docs: automarkup.py: ignore exceptions when seeking for xrefs
    docs: Move binderfs to admin-guide
    Disable Sphinx SmartyPants in HTML output
    doc: RCU callback locks need only _bh, not necessarily _irq
    docs: format kernel-parameters -- as code
    Doc : doc-guide : Fix a typo
    platform: x86: get rid of a non-existent document
    Add the RCU docs to the core-api manual
    Documentation: RCU: Add TOC tree hooks
    Documentation: RCU: Rename txt files to rst
    Documentation: RCU: Convert RCU UP systems to reST
    Documentation: RCU: Convert RCU linked list to reST
    Documentation: RCU: Convert RCU basic concepts to reST
    docs: filesystems: Remove uneeded .rst extension on toctables
    scripts/sphinx-pre-install: fix out-of-tree build
    docs: zh_CN: submitting-drivers.rst: Remove a duplicated Documentation/
    Documentation: PGP: update for newer HW devices
    Documentation: Add section about CPU vulnerabilities for Spectre
    Documentation: platform: Delete x86-laptop-drivers.txt
    docs: Note that :c:func: should no longer be used
    ...

    Linus Torvalds
     

22 Jun, 2019

1 commit


19 Jun, 2019

1 commit

  • Based on 1 normalized pattern(s):

    this work is licensed under the terms of the gnu gpl version 2

    extracted by the scancode license scanner the SPDX license identifier

    GPL-2.0-only

    has been chosen to replace the boilerplate/reference in 48 file(s).

    Signed-off-by: Thomas Gleixner
    Reviewed-by: Allison Randal
    Reviewed-by: Enrico Weigelt
    Cc: linux-spdx@vger.kernel.org
    Link: https://lkml.kernel.org/r/20190604081204.624030236@linutronix.de
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     

18 Jun, 2019

1 commit

  • Vhost_net was known to suffer from HOL[1] issues which is not easy to
    fix. Several downstream disable the feature by default. What's more,
    the datapath was split and datacopy path got the support of batching
    and XDP support recently which makes it faster than zerocopy part for
    small packets transmission.

    It looks to me that disable zerocopy by default is more
    appropriate. It cold be enabled by default again in the future if we
    fix the above issues.

    [1] https://patchwork.kernel.org/patch/3787671/

    Signed-off-by: Jason Wang
    Acked-by: Michael S. Tsirkin
    Signed-off-by: David S. Miller

    Jason Wang
     

15 Jun, 2019

1 commit


09 Jun, 2019

1 commit

  • Mostly due to x86 and acpi conversion, several documentation
    links are still pointing to the old file. Fix them.

    Signed-off-by: Mauro Carvalho Chehab
    Reviewed-by: Wolfram Sang
    Reviewed-by: Sven Van Asbroeck
    Reviewed-by: Bhupesh Sharma
    Acked-by: Mark Brown
    Signed-off-by: Jonathan Corbet

    Mauro Carvalho Chehab
     

07 Jun, 2019

1 commit

  • Clang warns:

    drivers/vhost/vhost.c:2085:5: warning: macro expansion producing
    'defined' has undefined behavior [-Wexpansion-to-defined]
    #if VHOST_ARCH_CAN_ACCEL_UACCESS
    ^
    drivers/vhost/vhost.h:98:38: note: expanded from macro
    'VHOST_ARCH_CAN_ACCEL_UACCESS'
    #define VHOST_ARCH_CAN_ACCEL_UACCESS defined(CONFIG_MMU_NOTIFIER) && \
    ^

    It's being pedantic for the sake of portability, but the fix is easy
    enough.

    Rework the definition of VHOST_ARCH_CAN_ACCEL_UACCESS to expand to a constant.

    Fixes: 7f466032dc9e ("vhost: access vq metadata through kernel virtual address")
    Link: https://github.com/ClangBuiltLinux/linux/issues/508
    Signed-off-by: Michael S. Tsirkin
    Reviewed-by: Nathan Chancellor
    Tested-by: Nathan Chancellor

    Michael S. Tsirkin
     

06 Jun, 2019

6 commits

  • It was noticed that the copy_to/from_user() friends that was used to
    access virtqueue metdata tends to be very expensive for dataplane
    implementation like vhost since it involves lots of software checks,
    speculation barriers, hardware feature toggling (e.g SMAP). The
    extra cost will be more obvious when transferring small packets since
    the time spent on metadata accessing become more significant.

    This patch tries to eliminate those overheads by accessing them
    through direct mapping of those pages. Invalidation callbacks is
    implemented for co-operation with general VM management (swap, KSM,
    THP or NUMA balancing). We will try to get the direct mapping of vq
    metadata before each round of packet processing if it doesn't
    exist. If we fail, we will simplely fallback to copy_to/from_user()
    friends.

    This invalidation and direct mapping access are synchronized through
    spinlock and RCU. All matedata accessing through direct map is
    protected by RCU, and the setup or invalidation are done under
    spinlock.

    This method might does not work for high mem page which requires
    temporary mapping so we just fallback to normal
    copy_to/from_user() and may not for arch that has virtual tagged cache
    since extra cache flushing is needed to eliminate the alias. This will
    result complex logic and bad performance. For those archs, this patch
    simply go for copy_to/from_user() friends. This is done by ruling out
    kernel mapping codes through ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE.

    Note that this is only done when device IOTLB is not enabled. We
    could use similar method to optimize IOTLB in the future.

    Tests shows at most about 23% improvement on TX PPS when using
    virtio-user + vhost_net + xdp1 + TAP on 2.6GHz Broadwell:

    SMAP on | SMAP off
    Before: 5.2Mpps | 7.1Mpps
    After: 6.4Mpps | 8.2Mpps

    Cc: Andrea Arcangeli
    Cc: James Bottomley
    Cc: Christoph Hellwig
    Cc: David Miller
    Cc: Jerome Glisse
    Cc: linux-mm@kvack.org
    Cc: linux-arm-kernel@lists.infradead.org
    Cc: linux-parisc@vger.kernel.org
    Signed-off-by: Jason Wang
    Signed-off-by: Michael S. Tsirkin

    Jason Wang
     
  • Factoring vring address and num setting which needs special care for
    accelerating vq metadata accessing.

    Signed-off-by: Jason Wang
    Signed-off-by: Michael S. Tsirkin

    Jason Wang
     
  • To avoid code duplication since it will be used by kernel VA prefetching.

    Signed-off-by: Jason Wang
    Signed-off-by: Michael S. Tsirkin

    Jason Wang
     
  • Rename the function to be more accurate since it actually tries to
    prefetch vq metadata address in IOTLB. And this will be used by
    following patch to prefetch metadata virtual addresses.

    Signed-off-by: Jason Wang
    Signed-off-by: Michael S. Tsirkin

    Jason Wang
     
  • This is used to hide the metadata address from virtqueue helpers. This
    will allow to implement a vmap based fast accessing to metadata.

    Signed-off-by: Jason Wang
    Signed-off-by: Michael S. Tsirkin

    Jason Wang
     
  • Use one generic vhost_copy_to_user() instead of two dedicated
    accessor. This will simplify the conversion to fine grain
    accessors. About 2% improvement of PPS were seen during vitio-user
    txonly test.

    Signed-off-by: Jason Wang
    Signed-off-by: Michael S. Tsirkin

    Jason Wang
     

27 May, 2019

4 commits

  • This patch will check the weight and exit the loop if we exceeds the
    weight. This is useful for preventing scsi kthread from hogging cpu
    which is guest triggerable.

    This addresses CVE-2019-3900.

    Cc: Paolo Bonzini
    Cc: Stefan Hajnoczi
    Fixes: 057cbf49a1f0 ("tcm_vhost: Initial merge for vhost level target fabric driver")
    Signed-off-by: Jason Wang
    Reviewed-by: Stefan Hajnoczi
    Signed-off-by: Michael S. Tsirkin
    Reviewed-by: Stefan Hajnoczi

    Jason Wang
     
  • This patch will check the weight and exit the loop if we exceeds the
    weight. This is useful for preventing vsock kthread from hogging cpu
    which is guest triggerable. The weight can help to avoid starving the
    request from on direction while another direction is being processed.

    The value of weight is picked from vhost-net.

    This addresses CVE-2019-3900.

    Cc: Stefan Hajnoczi
    Fixes: 433fc58e6bf2 ("VSOCK: Introduce vhost_vsock.ko")
    Signed-off-by: Jason Wang
    Reviewed-by: Stefan Hajnoczi
    Signed-off-by: Michael S. Tsirkin

    Jason Wang
     
  • When the rx buffer is too small for a packet, we will discard the vq
    descriptor and retry it for the next packet:

    while ((sock_len = vhost_net_rx_peek_head_len(net, sock->sk,
    &busyloop_intr))) {
    ...
    /* On overrun, truncate and discard */
    if (unlikely(headcount > UIO_MAXIOV)) {
    iov_iter_init(&msg.msg_iter, READ, vq->iov, 1, 1);
    err = sock->ops->recvmsg(sock, &msg,
    1, MSG_DONTWAIT | MSG_TRUNC);
    pr_debug("Discarded rx packet: len %zd\n", sock_len);
    continue;
    }
    ...
    }

    This makes it possible to trigger a infinite while..continue loop
    through the co-opreation of two VMs like:

    1) Malicious VM1 allocate 1 byte rx buffer and try to slow down the
    vhost process as much as possible e.g using indirect descriptors or
    other.
    2) Malicious VM2 generate packets to VM1 as fast as possible

    Fixing this by checking against weight at the end of RX and TX
    loop. This also eliminate other similar cases when:

    - userspace is consuming the packets in the meanwhile
    - theoretical TOCTOU attack if guest moving avail index back and forth
    to hit the continue after vhost find guest just add new buffers

    This addresses CVE-2019-3900.

    Fixes: d8316f3991d20 ("vhost: fix total length when packets are too short")
    Fixes: 3a4d5c94e9593 ("vhost_net: a kernel-level virtio server")
    Signed-off-by: Jason Wang
    Reviewed-by: Stefan Hajnoczi
    Signed-off-by: Michael S. Tsirkin

    Jason Wang
     
  • We used to have vhost_exceeds_weight() for vhost-net to:

    - prevent vhost kthread from hogging the cpu
    - balance the time spent between TX and RX

    This function could be useful for vsock and scsi as well. So move it
    to vhost.c. Device must specify a weight which counts the number of
    requests, or it can also specific a byte_weight which counts the
    number of bytes that has been processed.

    Signed-off-by: Jason Wang
    Reviewed-by: Stefan Hajnoczi
    Signed-off-by: Michael S. Tsirkin

    Jason Wang