25 Feb, 2016

1 commit

  • The problem:

    On -rt, an emulated LAPIC timer instances has the following path:

    1) hard interrupt
    2) ksoftirqd is scheduled
    3) ksoftirqd wakes up vcpu thread
    4) vcpu thread is scheduled

    This extra context switch introduces unnecessary latency in the
    LAPIC path for a KVM guest.

    The solution:

    Allow waking up vcpu thread from hardirq context,
    thus avoiding the need for ksoftirqd to be scheduled.

    Normal waitqueues make use of spinlocks, which on -RT
    are sleepable locks. Therefore, waking up a waitqueue
    waiter involves locking a sleeping lock, which
    is not allowed from hard interrupt context.

    cyclictest command line:

    This patch reduces the average latency in my tests from 14us to 11us.

    Daniel writes:
    Paolo asked for numbers from kvm-unit-tests/tscdeadline_latency
    benchmark on mainline. The test was run 1000 times on
    tip/sched/core 4.4.0-rc8-01134-g0905f04:

    ./x86-run x86/tscdeadline_latency.flat -cpu host

    with idle=poll.

    The test seems not to deliver really stable numbers though most of
    them are smaller. Paolo write:

    "Anything above ~10000 cycles means that the host went to C1 or
    lower---the number means more or less nothing in that case.

    The mean shows an improvement indeed."

    Before:

    min max mean std
    count 1000.000000 1000.000000 1000.000000 1000.000000
    mean 5162.596000 2019270.084000 5824.491541 20681.645558
    std 75.431231 622607.723969 89.575700 6492.272062
    min 4466.000000 23928.000000 5537.926500 585.864966
    25% 5163.000000 1613252.750000 5790.132275 16683.745433
    50% 5175.000000 2281919.000000 5834.654000 23151.990026
    75% 5190.000000 2382865.750000 5861.412950 24148.206168
    max 5228.000000 4175158.000000 6254.827300 46481.048691

    After
    min max mean std
    count 1000.000000 1000.00000 1000.000000 1000.000000
    mean 5143.511000 2076886.10300 5813.312474 21207.357565
    std 77.668322 610413.09583 86.541500 6331.915127
    min 4427.000000 25103.00000 5529.756600 559.187707
    25% 5148.000000 1691272.75000 5784.889825 17473.518244
    50% 5160.000000 2308328.50000 5832.025000 23464.837068
    75% 5172.000000 2393037.75000 5853.177675 24223.969976
    max 5222.000000 3922458.00000 6186.720500 42520.379830

    [Patch was originaly based on the swait implementation found in the -rt
    tree. Daniel ported it to mainline's version and gathered the
    benchmark numbers for tscdeadline_latency test.]

    Signed-off-by: Daniel Wagner
    Acked-by: Peter Zijlstra (Intel)
    Cc: linux-rt-users@vger.kernel.org
    Cc: Boqun Feng
    Cc: Marcelo Tosatti
    Cc: Steven Rostedt
    Cc: Paul Gortmaker
    Cc: Paolo Bonzini
    Cc: "Paul E. McKenney"
    Link: http://lkml.kernel.org/r/1455871601-27484-4-git-send-email-wagi@monom.org
    Signed-off-by: Thomas Gleixner

    Marcelo Tosatti
     

16 Jan, 2016

1 commit

  • To date, we have implemented two I/O usage models for persistent memory,
    PMEM (a persistent "ram disk") and DAX (mmap persistent memory into
    userspace). This series adds a third, DAX-GUP, that allows DAX mappings
    to be the target of direct-i/o. It allows userspace to coordinate
    DMA/RDMA from/to persistent memory.

    The implementation leverages the ZONE_DEVICE mm-zone that went into
    4.3-rc1 (also discussed at kernel summit) to flag pages that are owned
    and dynamically mapped by a device driver. The pmem driver, after
    mapping a persistent memory range into the system memmap via
    devm_memremap_pages(), arranges for DAX to distinguish pfn-only versus
    page-backed pmem-pfns via flags in the new pfn_t type.

    The DAX code, upon seeing a PFN_DEV+PFN_MAP flagged pfn, flags the
    resulting pte(s) inserted into the process page tables with a new
    _PAGE_DEVMAP flag. Later, when get_user_pages() is walking ptes it keys
    off _PAGE_DEVMAP to pin the device hosting the page range active.
    Finally, get_page() and put_page() are modified to take references
    against the device driver established page mapping.

    Finally, this need for "struct page" for persistent memory requires
    memory capacity to store the memmap array. Given the memmap array for a
    large pool of persistent may exhaust available DRAM introduce a
    mechanism to allocate the memmap from persistent memory. The new
    "struct vmem_altmap *" parameter to devm_memremap_pages() enables
    arch_add_memory() to use reserved pmem capacity rather than the page
    allocator.

    This patch (of 18):

    The core has developed a need for a "pfn_t" type [1]. Move the existing
    pfn_t in KVM to kvm_pfn_t [2].

    [1]: https://lists.01.org/pipermail/linux-nvdimm/2015-September/002199.html
    [2]: https://lists.01.org/pipermail/linux-nvdimm/2015-September/002218.html

    Signed-off-by: Dan Williams
    Acked-by: Christoffer Dall
    Cc: Paolo Bonzini
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Williams
     

09 Jan, 2016

4 commits


17 Dec, 2015

3 commits

  • It looks like this in action:

    kvm [5197]: vcpu0, guest rIP: 0xffffffff810187ba unhandled rdmsr: 0xc001102

    and helps to pinpoint quickly where in the guest we did the unsupported
    thing.

    Signed-off-by: Borislav Petkov
    Signed-off-by: Paolo Bonzini

    Borislav Petkov
     
  • Per Hyper-V specification (and as required by Hyper-V-aware guests),
    SynIC provides 4 per-vCPU timers. Each timer is programmed via a pair
    of MSRs, and signals expiration by delivering a special format message
    to the configured SynIC message slot and triggering the corresponding
    synthetic interrupt.

    Note: as implemented by this patch, all periodic timers are "lazy"
    (i.e. if the vCPU wasn't scheduled for more than the timer period the
    timer events are lost), regardless of the corresponding configuration
    MSR. If deemed necessary, the "catch up" mode (the timer period is
    shortened until the timer catches up) will be implemented later.

    Changes v2:
    * Use remainder to calculate periodic timer expiration time

    Signed-off-by: Andrey Smetanin
    Reviewed-by: Roman Kagan
    CC: Gleb Natapov
    CC: Paolo Bonzini
    CC: "K. Y. Srinivasan"
    CC: Haiyang Zhang
    CC: Vitaly Kuznetsov
    CC: Roman Kagan
    CC: Denis V. Lunev
    CC: qemu-devel@nongnu.org
    Signed-off-by: Paolo Bonzini

    Andrey Smetanin
     
  • The SynIC message protocol mandates that the message slot is claimed
    by atomically setting message type to something other than HVMSG_NONE.
    If another message is to be delivered while the slot is still busy,
    message pending flag is asserted to indicate to the guest that the
    hypervisor wants to be notified when the slot is released.

    To make sure the protocol works regardless of where the message
    sources are (kernel or userspace), clear the pending flag on SINT ACK
    notification, and let the message sources compete for the slot again.

    Signed-off-by: Andrey Smetanin
    Reviewed-by: Roman Kagan
    CC: Gleb Natapov
    CC: Paolo Bonzini
    CC: "K. Y. Srinivasan"
    CC: Haiyang Zhang
    CC: Vitaly Kuznetsov
    CC: Roman Kagan
    CC: Denis V. Lunev
    CC: qemu-devel@nongnu.org
    Signed-off-by: Paolo Bonzini

    Andrey Smetanin
     

30 Nov, 2015

2 commits

  • KVM creates debugfs files to export VM statistics to userland. To be
    able to remove them on kvm exit it tracks the files' dentries.

    Since their parent directory is also tracked and since each parent
    direntry knows its children we can easily remove them by using
    debugfs_remove_recursive(kvm_debugfs_dir). Therefore we don't
    need the extra tracking in the kvm_stats_debugfs_item anymore.

    Signed-off-by: Janosch Frank
    Reviewed-By: Sascha Silbe
    Acked-by: Christian Borntraeger
    Signed-off-by: Christian Borntraeger

    Janosch Frank
     
  • Usually, VCPU ids match the array index. So let's try a fast
    lookup first before falling back to the slow iteration.

    Suggested-by: Christian Borntraeger
    Reviewed-by: Dominik Dingel
    Reviewed-by: Christian Borntraeger
    Signed-off-by: David Hildenbrand
    Signed-off-by: Christian Borntraeger

    David Hildenbrand
     

26 Nov, 2015

4 commits

  • This patch makes kvm_is_visible_gfn return bool due to this particular
    function only using either one or zero as its return value.

    No functional change.

    Signed-off-by: Yaowei Bai
    Signed-off-by: Paolo Bonzini

    Yaowei Bai
     
  • A new vcpu exit is introduced to notify the userspace of the
    changes in Hyper-V SynIC configuration triggered by guest writing to the
    corresponding MSRs.

    Changes v4:
    * exit into userspace only if guest writes into SynIC MSR's

    Changes v3:
    * added KVM_EXIT_HYPERV types and structs notes into docs

    Signed-off-by: Andrey Smetanin
    Reviewed-by: Roman Kagan
    Signed-off-by: Denis V. Lunev
    CC: Gleb Natapov
    CC: Paolo Bonzini
    CC: Roman Kagan
    CC: Denis V. Lunev
    CC: qemu-devel@nongnu.org
    Signed-off-by: Paolo Bonzini

    Andrey Smetanin
     
  • SynIC (synthetic interrupt controller) is a lapic extension,
    which is controlled via MSRs and maintains for each vCPU
    - 16 synthetic interrupt "lines" (SINT's); each can be configured to
    trigger a specific interrupt vector optionally with auto-EOI
    semantics
    - a message page in the guest memory with 16 256-byte per-SINT message
    slots
    - an event flag page in the guest memory with 16 2048-bit per-SINT
    event flag areas

    The host triggers a SINT whenever it delivers a new message to the
    corresponding slot or flips an event flag bit in the corresponding area.
    The guest informs the host that it can try delivering a message by
    explicitly asserting EOI in lapic or writing to End-Of-Message (EOM)
    MSR.

    The userspace (qemu) triggers interrupts and receives EOM notifications
    via irqfd with resampler; for that, a GSI is allocated for each
    configured SINT, and irq_routing api is extended to support GSI-SINT
    mapping.

    Changes v4:
    * added activation of SynIC by vcpu KVM_ENABLE_CAP
    * added per SynIC active flag
    * added deactivation of APICv upon SynIC activation

    Changes v3:
    * added KVM_CAP_HYPERV_SYNIC and KVM_IRQ_ROUTING_HV_SINT notes into
    docs

    Changes v2:
    * do not use posted interrupts for Hyper-V SynIC AutoEOI vectors
    * add Hyper-V SynIC vectors into EOI exit bitmap
    * Hyper-V SyniIC SINT msr write logic simplified

    Signed-off-by: Andrey Smetanin
    Reviewed-by: Roman Kagan
    Signed-off-by: Denis V. Lunev
    CC: Gleb Natapov
    CC: Paolo Bonzini
    CC: Roman Kagan
    CC: Denis V. Lunev
    CC: qemu-devel@nongnu.org
    Signed-off-by: Paolo Bonzini

    Andrey Smetanin
     
  • Actually kvm_arch_irq_routing_update() should be
    kvm_arch_post_irq_routing_update() as it's called at the end
    of irq routing update.

    This renaming frees kvm_arch_irq_routing_update function name.
    kvm_arch_irq_routing_update() weak function which will be used
    to update mappings for arch-specific irq routing entries
    (in particular, the upcoming Hyper-V synthetic interrupts).

    Signed-off-by: Andrey Smetanin
    Reviewed-by: Roman Kagan
    Signed-off-by: Denis V. Lunev
    CC: Gleb Natapov
    CC: Paolo Bonzini
    CC: Roman Kagan
    CC: Denis V. Lunev
    CC: qemu-devel@nongnu.org
    Signed-off-by: Paolo Bonzini

    Andrey Smetanin
     

19 Nov, 2015

1 commit


10 Nov, 2015

1 commit

  • VMX and SVM calculate the TSC scaling ratio in a similar logic, so this
    patch generalizes it to a common TSC scaling function.

    Signed-off-by: Haozhong Zhang
    [Inline the multiplication and shift steps into mul_u64_u64_shr. Remove
    BUG_ON. - Paolo]
    Signed-off-by: Paolo Bonzini

    Haozhong Zhang
     

04 Nov, 2015

3 commits


23 Oct, 2015

1 commit

  • Some times it is useful for architecture implementations of KVM to know
    when the VCPU thread is about to block or when it comes back from
    blocking (arm/arm64 needs to know this to properly implement timers, for
    example).

    Therefore provide a generic architecture callback function in line with
    what we do elsewhere for KVM generic-arch interactions.

    Reviewed-by: Marc Zyngier
    Signed-off-by: Christoffer Dall

    Christoffer Dall
     

16 Oct, 2015

2 commits

  • Allow for arch-specific interrupt types to be set. For that, add
    kvm_arch_set_irq() which takes interrupt type-specific action if it
    recognizes the interrupt type given, and -EWOULDBLOCK otherwise.

    The default implementation always returns -EWOULDBLOCK.

    Signed-off-by: Andrey Smetanin
    Reviewed-by: Roman Kagan
    Signed-off-by: Denis V. Lunev
    CC: Vitaly Kuznetsov
    CC: "K. Y. Srinivasan"
    CC: Gleb Natapov
    CC: Paolo Bonzini
    Signed-off-by: Paolo Bonzini

    Andrey Smetanin
     
  • Factor out kvm_notify_acked_gsi() helper to iterate over EOI listeners
    and notify those matching the given gsi.

    It will be reused in the upcoming Hyper-V SynIC implementation.

    Signed-off-by: Andrey Smetanin
    Reviewed-by: Roman Kagan
    Signed-off-by: Denis V. Lunev
    CC: Vitaly Kuznetsov
    CC: "K. Y. Srinivasan"
    CC: Gleb Natapov
    CC: Paolo Bonzini
    Signed-off-by: Paolo Bonzini

    Andrey Smetanin
     

01 Oct, 2015

7 commits

  • This patch updates the Posted-Interrupts Descriptor when vCPU
    is blocked.

    pre-block:
    - Add the vCPU to the blocked per-CPU list
    - Set 'NV' to POSTED_INTR_WAKEUP_VECTOR

    post-block:
    - Remove the vCPU from the per-CPU list

    Signed-off-by: Feng Wu
    [Concentrate invocation of pre/post-block hooks to vcpu_block. - Paolo]
    Signed-off-by: Paolo Bonzini

    Feng Wu
     
  • This patch adds an arch specific hooks 'arch_update' in
    'struct kvm_kernel_irqfd'. On Intel side, it is used to
    update the IRTE when VT-d posted-interrupts is used.

    Signed-off-by: Feng Wu
    Reviewed-by: Alex Williamson
    Signed-off-by: Paolo Bonzini

    Feng Wu
     
  • This patch introduces
    - kvm_arch_irq_bypass_add_producer
    - kvm_arch_irq_bypass_del_producer
    - kvm_arch_irq_bypass_stop
    - kvm_arch_irq_bypass_start

    They make possible to specialize the KVM IRQ bypass consumer in
    case CONFIG_KVM_HAVE_IRQ_BYPASS is set.

    Signed-off-by: Eric Auger
    [Add weak implementations of the callbacks. - Feng]
    Signed-off-by: Feng Wu
    Reviewed-by: Alex Williamson
    Signed-off-by: Paolo Bonzini

    Eric Auger
     
  • HV_X64_MSR_RESET msr is used by Hyper-V based Windows guest
    to reset guest VM by hypervisor.

    Necessary to support loading of winhv.sys in guest, which in turn is
    required to support Windows VMBus.

    Signed-off-by: Andrey Smetanin
    Reviewed-by: Roman Kagan
    Signed-off-by: Denis V. Lunev
    CC: Paolo Bonzini
    CC: Gleb Natapov
    Signed-off-by: Paolo Bonzini

    Andrey Smetanin
     
  • In order to support a userspace IOAPIC interacting with an in kernel
    APIC, the EOI exit bitmaps need to be configurable.

    If the IOAPIC is in userspace (i.e. the irqchip has been split), the
    EOI exit bitmaps will be set whenever the GSI Routes are configured.
    In particular, for the low MSI routes are reservable for userspace
    IOAPICs. For these MSI routes, the EOI Exit bit corresponding to the
    destination vector of the route will be set for the destination VCPU.

    The intention is for the userspace IOAPICs to use the reservable MSI
    routes to inject interrupts into the guest.

    This is a slight abuse of the notion of an MSI Route, given that MSIs
    classically bypass the IOAPIC. It might be worthwhile to add an
    additional route type to improve clarity.

    Compile tested for Intel x86.

    Signed-off-by: Steve Rutherford
    Signed-off-by: Paolo Bonzini

    Steve Rutherford
     
  • Adds KVM_EXIT_IOAPIC_EOI which allows the kernel to EOI
    level-triggered IOAPIC interrupts.

    Uses a per VCPU exit bitmap to decide whether or not the IOAPIC needs
    to be informed (which is identical to the EOI_EXIT_BITMAP field used
    by modern x86 processors, but can also be used to elide kvm IOAPIC EOI
    exits on older processors).

    [Note: A prototype using ResampleFDs found that decoupling the EOI
    from the VCPU's thread made it possible for the VCPU to not see a
    recent EOI after reentering the guest. This does not match real
    hardware.]

    Compile tested for Intel x86.

    Signed-off-by: Steve Rutherford
    Signed-off-by: Paolo Bonzini

    Steve Rutherford
     
  • First patch in a series which enables the relocation of the
    PIC/IOAPIC to userspace.

    Adds capability KVM_CAP_SPLIT_IRQCHIP;

    KVM_CAP_SPLIT_IRQCHIP enables the construction of LAPICs without the
    rest of the irqchip.

    Compile tested for x86.

    Signed-off-by: Steve Rutherford
    Suggested-by: Andrew Honig
    Signed-off-by: Paolo Bonzini

    Steve Rutherford
     

06 Sep, 2015

1 commit


30 Jul, 2015

1 commit


29 Jul, 2015

1 commit


23 Jul, 2015

2 commits


10 Jul, 2015

1 commit

  • If there are no assigned devices, the guest PAT are not providing
    any useful information and can be overridden to writeback; VMX
    always does this because it has the "IPAT" bit in its extended
    page table entries, but SVM does not have anything similar.
    Hook into VFIO and legacy device assignment so that they
    provide this information to KVM.

    Reviewed-by: Alex Williamson
    Tested-by: Joerg Roedel
    Signed-off-by: Paolo Bonzini

    Paolo Bonzini
     

05 Jun, 2015

2 commits

  • Only two ioctls have to be modified; the address space id is
    placed in the higher 16 bits of their slot id argument.

    As of this patch, no architecture defines more than one
    address space; x86 will be the first.

    Reviewed-by: Radim Krčmář
    Signed-off-by: Paolo Bonzini

    Paolo Bonzini
     
  • We need to hide SMRAM from guests not running in SMM. Therefore, all
    uses of kvm_read_guest* and kvm_write_guest* must be changed to use
    different address spaces, depending on whether the VCPU is in system
    management mode. We need to introduce a new family of functions for
    this purpose.

    For now, the VCPU-based functions have the same behavior as the
    existing per-VM ones, they just accept a different type for the
    first argument. Later however they will be changed to use one of many
    "struct kvm_memslots" stored in struct kvm, through an architecture hook.
    VM-based functions will unconditionally use the first memslots pointer.

    Whenever possible, this patch introduces slot-based functions with an
    __ prefix, with two wrappers for generic and vcpu-based actions.
    The exceptions are kvm_read_guest and kvm_write_guest, which are copied
    into the new functions kvm_vcpu_read_guest and kvm_vcpu_write_guest.

    Reviewed-by: Radim Krčmář
    Signed-off-by: Paolo Bonzini

    Paolo Bonzini
     

04 Jun, 2015

1 commit

  • This patch adds the interface between x86.c and the emulator: the
    SMBASE register, a new emulator flag, the RSM instruction. It also
    adds a new request bit that will be used by the KVM_SMI ioctl.

    Reviewed-by: Radim Krčmář
    Signed-off-by: Paolo Bonzini

    Paolo Bonzini
     

28 May, 2015

1 commit