02 Aug, 2016

6 commits

  • Detect and fail early if long wrap around is triggered.

    Signed-off-by: Michael S. Tsirkin

    Michael S. Tsirkin
     
  • This patch tries to implement an device IOTLB for vhost. This could be
    used with userspace(qemu) implementation of DMA remapping
    to emulate an IOMMU for the guest.

    The idea is simple, cache the translation in a software device IOTLB
    (which is implemented as an interval tree) in vhost and use vhost_net
    file descriptor for reporting IOTLB miss and IOTLB
    update/invalidation. When vhost meets an IOTLB miss, the fault
    address, size and access can be read from the file. After userspace
    finishes the translation, it writes the translated address to the
    vhost_net file to update the device IOTLB.

    When device IOTLB is enabled by setting VIRTIO_F_IOMMU_PLATFORM all vq
    addresses set by ioctl are treated as iova instead of virtual address and
    the accessing can only be done through IOTLB instead of direct userspace
    memory access. Before each round or vq processing, all vq metadata is
    prefetched in device IOTLB to make sure no translation fault happens
    during vq processing.

    In most cases, virtqueues are contiguous even in virtual address space.
    The IOTLB translation for virtqueue itself may make it a little
    slower. We might add fast path cache on top of this patch.

    Signed-off-by: Jason Wang
    [mst: use virtio feature bit: VHOST_F_DEVICE_IOTLB -> VIRTIO_F_IOMMU_PLATFORM ]
    [mst: fix build warnings ]
    Signed-off-by: Michael S. Tsirkin
    [ weiyj.lk: missing unlock on error ]
    Signed-off-by: Wei Yongjun

    Jason Wang
     
  • Current pre-sorted memory region array has some limitations for future
    device IOTLB conversion:

    1) need extra work for adding and removing a single region, and it's
    expected to be slow because of sorting or memory re-allocation.
    2) need extra work of removing a large range which may intersect
    several regions with different size.
    3) need trick for a replacement policy like LRU

    To overcome the above shortcomings, this patch convert it to interval
    tree which can easily address the above issue with almost no extra
    work.

    The patch could be used for:

    - Extend the current API and only let the userspace to send diffs of
    memory table.
    - Simplify Device IOTLB implementation.

    Signed-off-by: Jason Wang
    Signed-off-by: Michael S. Tsirkin

    Jason Wang
     
  • This patch introduces vhost memory accessors which were just wrappers
    for userspace address access helpers. This is a requirement for vhost
    device iotlb implementation which will add iotlb translations in those
    accessors.

    Signed-off-by: Jason Wang
    Signed-off-by: Michael S. Tsirkin

    Jason Wang
     
  • We use spinlock to synchronize the work list now which may cause
    unnecessary contentions. So this patch switch to use llist to remove
    this contention. Pktgen tests shows about 5% improvement:

    Before:
    ~1300000 pps
    After:
    ~1370000 pps

    Signed-off-by: Jason Wang
    Reviewed-by: Michael S. Tsirkin
    Signed-off-by: Michael S. Tsirkin

    Jason Wang
     
  • We used to implement the work flushing through tracking queued seq,
    done seq, and the number of flushing. This patch simplify this by just
    implement work flushing through another kind of vhost work with
    completion. This will be used by lockless enqueuing patch.

    Signed-off-by: Jason Wang
    Reviewed-by: Michael S. Tsirkin
    Signed-off-by: Michael S. Tsirkin

    Jason Wang
     

11 Mar, 2016

3 commits

  • This patch tries to poll for new added tx buffer or socket receive
    queue for a while at the end of tx/rx processing. The maximum time
    spent on polling were specified through a new kind of vring ioctl.

    Signed-off-by: Jason Wang
    Signed-off-by: Michael S. Tsirkin

    Jason Wang
     
  • This patch introduces a helper which will return true if we're sure
    that the available ring is empty for a specific vq. When we're not
    sure, e.g vq access failure, return false instead. This could be used
    for busy polling code to exit the busy loop.

    Signed-off-by: Jason Wang
    Signed-off-by: Michael S. Tsirkin

    Jason Wang
     
  • This path introduces a helper which can give a hint for whether or not
    there's a work queued in the work list. This could be used for busy
    polling code to exit the busy loop.

    Signed-off-by: Jason Wang
    Signed-off-by: Michael S. Tsirkin

    Jason Wang
     

02 Mar, 2016

3 commits

  • Looking at how callers use this, maybe we should just rename init_used
    to vhost_vq_init_access. The _used suffix was a hint that we
    access the vq used ring. But maybe what callers care about is
    that it must be called after access_ok.

    Also, this function manipulates the vq->is_le field which isn't related
    to the vq used ring.

    This patch simply renames vhost_init_used() to vhost_vq_init_access() as
    suggested by Michael.

    No behaviour change.

    Signed-off-by: Greg Kurz
    Signed-off-by: Michael S. Tsirkin

    Greg Kurz
     
  • The default use case for vhost is when the host and the vring have the
    same endianness (default native endianness). But there are cases where
    they differ and vhost should byteswap when accessing the vring.

    The first case is when the host is big endian and the vring belongs to
    a virtio 1.0 device, which is always little endian.

    This is covered by the vq->is_le field. This field is initialized when
    userspace calls the VHOST_SET_FEATURES ioctl. It is reset when the device
    stops.

    We already have a vhost_init_is_le() helper, but the reset operation is
    opencoded as follows:

    vq->is_le = virtio_legacy_is_little_endian();

    It isn't clear that we are resetting vq->is_le here.

    This patch moves the code to a helper with a more explicit name.

    The other case where we may have to byteswap is when the architecture can
    switch endianness at runtime (bi-endian). If endianness differs in the host
    and in the guest, then legacy devices need to be used in cross-endian mode.

    This mode is available with CONFIG_VHOST_CROSS_ENDIAN_LEGACY=y, which
    introduces a vq->user_be field. Userspace may enable cross-endian mode
    by calling the SET_VRING_ENDIAN ioctl before the device is started. The
    cross-endian mode is disabled when the device is stopped.

    The current names of the helpers that manipulate vq->user_be are unclear.

    This patch renames those helpers to clearly show that this is cross-endian
    stuff and with explicit enable/disable semantics.

    No behaviour change.

    Signed-off-by: Greg Kurz
    Signed-off-by: Michael S. Tsirkin

    Greg Kurz
     
  • We don't want side effects. If something fails, we rollback vq->is_le to
    its previous value.

    Signed-off-by: Greg Kurz
    Signed-off-by: Michael S. Tsirkin

    Greg Kurz
     

07 Dec, 2015

2 commits


27 Jul, 2015

2 commits

  • callers of vhost_kvzalloc() expect the same behaviour on
    allocation error as from kmalloc/vmalloc i.e. NULL return
    value. So just return vzmalloc() returned value instead of
    returning ERR_PTR(-ENOMEM)

    Fixes: 4de7255f7d2be5 ("vhost: extend memory regions allocation to vmalloc")

    Spotted-by: Dan Carpenter
    Suggested-by: Julia Lawall
    Signed-off-by: Igor Mammedov
    Signed-off-by: Michael S. Tsirkin

    Igor Mammedov
     
  • While reviewing vhost log code, I found out that log_file is never
    set. Note: I haven't tested the change (QEMU doesn't use LOG_FD yet).

    Cc: stable@vger.kernel.org
    Signed-off-by: Marc-André Lureau
    Signed-off-by: Michael S. Tsirkin

    Marc-André Lureau
     

14 Jul, 2015

2 commits

  • it became possible to use a bigger amount of memory
    slots, which is used by memory hotplug for
    registering hotplugged memory.
    However QEMU crashes if it's used with more than ~60
    pc-dimm devices and vhost-net enabled since host kernel
    in module vhost-net refuses to accept more than 64
    memory regions.

    Allow to tweak limit via max_mem_regions module paramemter
    with default value set to 64 slots.

    Signed-off-by: Igor Mammedov
    Signed-off-by: Michael S. Tsirkin

    Igor Mammedov
     
  • with large number of memory regions we could end up with
    high order allocations and kmalloc could fail if
    host is under memory pressure.
    Considering that memory regions array is used on hot path
    try harder to allocate using kmalloc and if it fails resort
    to vmalloc.
    It's still better than just failing vhost_set_memory() and
    causing guest crash due to it when a new memory hotplugged
    to guest.

    I'll still look at QEMU side solution to reduce amount of
    memory regions it feeds to vhost to make things even better,
    but it doesn't hurt for kernel to behave smarter and don't
    crash older QEMU's which could use large amount of memory
    regions.

    Signed-off-by: Igor Mammedov
    Signed-off-by: Michael S. Tsirkin

    Igor Mammedov
     

01 Jul, 2015

1 commit

  • For default region layouts performance stays the same
    as linear search i.e. it takes around 210ns average for
    translate_desc() that inlines find_region().

    But it scales better with larger amount of regions,
    235ns BS vs 300ns LS with 55 memory regions
    and it will be about the same values when allowed number
    of slots is increased to 509 like it has been done in kvm.

    Signed-off-by: Igor Mammedov

    Signed-off-by: Michael S. Tsirkin

    Igor Mammedov
     

01 Jun, 2015

1 commit

  • This patch brings cross-endian support to vhost when used to implement
    legacy virtio devices. Since it is a relatively rare situation, the
    feature availability is controlled by a kernel config option (not set
    by default).

    The vq->is_le boolean field is added to cache the endianness to be
    used for ring accesses. It defaults to native endian, as expected
    by legacy virtio devices. When the ring gets active, we force little
    endian if the device is modern. When the ring is deactivated, we
    revert to the native endian default.

    If cross-endian was compiled in, a vq->user_be boolean field is added
    so that userspace may request a specific endianness. This field is
    used to override the default when activating the ring of a legacy
    device. It has no effect on modern devices.

    Signed-off-by: Greg Kurz

    Signed-off-by: Michael S. Tsirkin
    Reviewed-by: Cornelia Huck
    Reviewed-by: David Gibson

    Greg Kurz
     

04 Feb, 2015

1 commit


29 Dec, 2014

1 commit

  • virtio 1.0 only requires used address to be 4 byte aligned,
    vhost required 8 bytes (size of vring_used_elem).
    Fix up vhost to match that.

    Additionally, while vhost correctly requires 8 byte
    alignment for log, it's unconnected to used ring:
    it's a consequence that log has u64 entries.
    Tweak code to make that clearer.

    Signed-off-by: Michael S. Tsirkin

    Michael S. Tsirkin
     

09 Dec, 2014

2 commits


09 Jun, 2014

3 commits

  • commit 2ae76693b8bcabf370b981cd00c36cd41d33fabc
    vhost: replace rcu with mutex
    replaced rcu sync for memory accesses with VQ mutex locl/unlock.
    This is correct since all accesses are under VQ mutex, but incomplete:
    we still do useless rcu lock/unlock operations, someone might copy this
    code into some other context where this won't be right.
    This use of RCU is also non standard and hard to understand.
    Let's copy the pointer to each VQ structure, this way
    the access rules become straight-forward, and there's
    no need for RCU anymore.

    Reported-by: Eric Dumazet
    Signed-off-by: Michael S. Tsirkin

    Michael S. Tsirkin
     
  • Refactor code to make sure features are only accessed
    under VQ mutex. This makes everything simpler, no need
    for RCU here anymore.

    Signed-off-by: Michael S. Tsirkin

    Michael S. Tsirkin
     
  • All memory accesses are done under some VQ mutex.
    So lock/unlock all VQs is a faster equivalent of synchronize_rcu()
    for memory access changes.
    Some guests cause a lot of these changes, so it's helpful
    to make them faster.

    Reported-by: "Gonglei (Arei)"
    Signed-off-by: Michael S. Tsirkin

    Michael S. Tsirkin
     

07 Dec, 2013

1 commit


17 Sep, 2013

1 commit

  • the wake_up_process func is included by spin_lock/unlock in
    vhost_work_queue,
    but it could be done outside the spin_lock.
    I have test it with kernel 3.0.27 and guest suse11-sp2 using iperf,
    the num as below.
    original modified
    thread_num tp(Gbps) vhost(%) | tp(Gbps) vhost(%)
    1 9.59 28.82 | 9.59 27.49
    8 9.61 32.92 | 9.62 26.77
    64 9.58 46.48 | 9.55 38.99
    256 9.6 63.7 | 9.6 52.59

    Signed-off-by: Chuanyu Qin
    Signed-off-by: Michael S. Tsirkin

    Qin Chuanyu
     

04 Sep, 2013

1 commit

  • Let vhost_add_used() to use vhost_add_used_n() to reduce the code
    duplication. To avoid the overhead brought by __copy_to_user(). We will use
    put_user() when one used need to be added.

    Signed-off-by: Jason Wang
    Signed-off-by: David S. Miller

    Jason Wang
     

21 Aug, 2013

1 commit


07 Jul, 2013

2 commits


11 Jun, 2013

1 commit


06 May, 2013

2 commits


01 May, 2013

4 commits