04 Jun, 2014

1 commit

  • Pull KVM updates from Paolo Bonzini:
    "At over 200 commits, covering almost all supported architectures, this
    was a pretty active cycle for KVM. Changes include:

    - a lot of s390 changes: optimizations, support for migration, GDB
    support and more

    - ARM changes are pretty small: support for the PSCI 0.2 hypercall
    interface on both the guest and the host (the latter acked by
    Catalin)

    - initial POWER8 and little-endian host support

    - support for running u-boot on embedded POWER targets

    - pretty large changes to MIPS too, completing the userspace
    interface and improving the handling of virtualized timer hardware

    - for x86, a larger set of changes is scheduled for 3.17. Still, we
    have a few emulator bugfixes and support for running nested
    fully-virtualized Xen guests (para-virtualized Xen guests have
    always worked). And some optimizations too.

    The only missing architecture here is ia64. It's not a coincidence
    that support for KVM on ia64 is scheduled for removal in 3.17"

    * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (203 commits)
    KVM: add missing cleanup_srcu_struct
    KVM: PPC: Book3S PR: Rework SLB switching code
    KVM: PPC: Book3S PR: Use SLB entry 0
    KVM: PPC: Book3S HV: Fix machine check delivery to guest
    KVM: PPC: Book3S HV: Work around POWER8 performance monitor bugs
    KVM: PPC: Book3S HV: Make sure we don't miss dirty pages
    KVM: PPC: Book3S HV: Fix dirty map for hugepages
    KVM: PPC: Book3S HV: Put huge-page HPTEs in rmap chain for base address
    KVM: PPC: Book3S HV: Fix check for running inside guest in global_invalidates()
    KVM: PPC: Book3S: Move KVM_REG_PPC_WORT to an unused register number
    KVM: PPC: Book3S: Add ONE_REG register names that were missed
    KVM: PPC: Add CAP to indicate hcall fixes
    KVM: PPC: MPIC: Reset IRQ source private members
    KVM: PPC: Graciously fail broken LE hypercalls
    PPC: ePAPR: Fix hypercall on LE guest
    KVM: PPC: BOOK3S: Remove open coded make_dsisr in alignment handler
    KVM: PPC: BOOK3S: Always use the saved DAR value
    PPC: KVM: Make NX bit available with magic page
    KVM: PPC: Disable NX for old magic page using guests
    KVM: PPC: BOOK3S: HV: Add mixed page-size support for guest
    ...

    Linus Torvalds
     

03 Jun, 2014

1 commit


05 May, 2014

1 commit

  • When starting lots of dataplane devices the bootup takes very long on
    Christian's s390 with irqfd patches. With larger setups he is even
    able to trigger some timeouts in some components. Turns out that the
    KVM_SET_GSI_ROUTING ioctl takes very long (strace claims up to 0.1 sec)
    when having multiple CPUs. This is caused by the synchronize_rcu and
    the HZ=100 of s390. By changing the code to use a private srcu we can
    speed things up. This patch reduces the boot time till mounting root
    from 8 to 2 seconds on my s390 guest with 100 disks.

    Uses of hlist_for_each_entry_rcu, hlist_add_head_rcu, hlist_del_init_rcu
    are fine because they do not have lockdep checks (hlist_for_each_entry_rcu
    uses rcu_dereference_raw rather than rcu_dereference, and write-sides
    do not do rcu lockdep at all).

    Note that we're hardly relying on the "sleepable" part of srcu. We just
    want SRCU's faster detection of grace periods.

    Testing was done by Andrew Theurer using netperf tests STREAM, MAERTS
    and RR. The difference between results "before" and "after" the patch
    has mean -0.2% and standard deviation 0.6%. Using a paired t-test on the
    data points says that there is a 2.5% probability that the patch is the
    cause of the performance difference (rather than a random fluctuation).

    (Restricting the t-test to RR, which is the most likely to be affected,
    changes the numbers to respectively -0.3% mean, 0.7% stdev, and 8%
    probability that the numbers actually say something about the patch.
    The probability increases mostly because there are fewer data points).

    Cc: Marcelo Tosatti
    Cc: Michael S. Tsirkin
    Tested-by: Christian Borntraeger # s390
    Reviewed-by: Christian Borntraeger
    Signed-off-by: Christian Borntraeger
    Signed-off-by: Paolo Bonzini

    Christian Borntraeger
     

01 May, 2014

1 commit


29 Apr, 2014

1 commit

  • Currently below check in vgic_ioaddr_overlap will always succeed,
    because the vgic dist base and vgic cpu base are still kept UNDEF
    after initialization. The code as follows will be return forever.

    if (IS_VGIC_ADDR_UNDEF(dist) || IS_VGIC_ADDR_UNDEF(cpu))
    return 0;

    So, before invoking the vgic_ioaddr_overlap, it needs to set the
    corresponding base address firstly.

    Signed-off-by: Haibin Wang
    Acked-by: Marc Zyngier
    Signed-off-by: Christoffer Dall

    Haibin Wang
     

28 Apr, 2014

6 commits

  • async_pf_execute() passes tsk == current to gup(), this is doesn't
    hurt but unnecessary and misleading. "tsk" is only used to account
    the number of faults and current is the random workqueue thread.

    Signed-off-by: Oleg Nesterov
    Suggested-by: Andrea Arcangeli
    Signed-off-by: Paolo Bonzini

    Oleg Nesterov
     
  • async_pf_execute() has no reasons to adopt apf->mm, gup(current, mm)
    should work just fine even if current has another or NULL ->mm.

    Recently kvm_async_page_present_sync() was added insedie the "use_mm"
    section, but it seems that it doesn't need current->mm too.

    Signed-off-by: Oleg Nesterov
    Reviewed-by: Andrea Arcangeli
    Signed-off-by: Paolo Bonzini

    Oleg Nesterov
     
  • Since KVM internally represents the ICFGR registers by stuffing two
    of them into one word, the offset for accessing the internal
    representation and the one for the MMIO based access are different.
    So keep the original offset around, but adjust the internal array
    offset by one bit.

    Reported-by: Haibin Wang
    Signed-off-by: Andre Przywara
    Signed-off-by: Marc Zyngier
    Signed-off-by: Christoffer Dall

    Andre Przywara
     
  • get_user_pages(mm) is simply wrong if mm->mm_users == 0 and exit_mmap/etc
    was already called (or is in progress), mm->mm_count can only pin mm->pgd
    and mm_struct itself.

    Change kvm_setup_async_pf/async_pf_execute to inc/dec mm->mm_users.

    kvm_create_vm/kvm_destroy_vm play with ->mm_count too but this case looks
    fine at first glance, it seems that this ->mm is only used to verify that
    current->mm == kvm->mm.

    Signed-off-by: Oleg Nesterov
    Cc: stable@vger.kernel.org
    Signed-off-by: Paolo Bonzini

    Oleg Nesterov
     
  • When dispatch SGI(mode == 0), that is the vcpu of VM should send
    sgi to the cpu which the target_cpus list.
    So, there must add the "break" to branch of case 0.

    Cc: # 3.10+
    Signed-off-by: Haibin Wang
    Acked-by: Marc Zyngier
    Signed-off-by: Christoffer Dall

    Haibin Wang
     
  • As result of deprecation of MSI-X/MSI enablement functions
    pci_enable_msix() and pci_enable_msi_block() all drivers
    using these two interfaces need to be updated to use the
    new pci_enable_msi_range() or pci_enable_msi_exact()
    and pci_enable_msix_range() or pci_enable_msix_exact()
    interfaces.

    Signed-off-by: Alexander Gordeev
    Cc: Gleb Natapov
    Cc: Paolo Bonzini
    Cc: kvm@vger.kernel.org
    Cc: linux-pci@vger.kernel.org
    Signed-off-by: Paolo Bonzini

    Alexander Gordeev
     

24 Apr, 2014

1 commit


22 Apr, 2014

2 commits

  • …vms390/linux into queue

    Lazy storage key handling
    -------------------------
    Linux does not use the ACC and F bits of the storage key. Newer Linux
    versions also do not use the storage keys for dirty and reference
    tracking. We can optimize the guest handling for those guests for faults
    as well as page-in and page-out by simply not caring about the guest
    visible storage key. We trap guest storage key instruction to enable
    those keys only on demand.

    Migration bitmap

    Until now s390 never provided a proper dirty bitmap. Let's provide a
    proper migration bitmap for s390. We also change the user dirty tracking
    to a fault based mechanism. This makes the host completely independent
    from the storage keys. Long term this will allow us to back guest memory
    with large pages.

    per-VM device attributes
    ------------------------
    To avoid the introduction of new ioctls, let's provide the
    attribute semanantic also on the VM-"device".

    Userspace controlled CMMA
    -------------------------
    The CMMA assist is changed from "always on" to "on if requested" via
    per-VM device attributes. In addition a callback to reset all usage
    states is provided.

    Proper guest DAT handling for intercepts
    ----------------------------------------
    While instructions handled by SIE take care of all addressing aspects,
    KVM/s390 currently does not care about guest address translation of
    intercepts. This worked out fine, because
    - the s390 Linux kernel has a 1:1 mapping between kernel virtual<->real
    for all pages up to memory size
    - intercepts happen only for a small amount of cases
    - all of these intercepts happen to be in the kernel text for current
    distros

    Of course we need to be better for other intercepts, kernel modules etc.
    We provide the infrastructure and rework all in-kernel intercepts to work
    on logical addresses (paging etc) instead of real ones. The code has
    been running internally for several months now, so it is time for going
    public.

    GDB support
    -----------
    We provide breakpoints, single stepping and watchpoints.

    Fixes/Cleanups
    --------------
    - Improve program check delivery
    - Factor out the handling of transactional memory on program checks
    - Use the existing define __LC_PGM_TDB
    - Several cleanups in the lowcore structure
    - Documentation

    NOTES
    -----
    - All patches touching base s390 are either ACKed or written by the s390
    maintainers
    - One base KVM patch "KVM: add kvm_is_error_gpa() helper"
    - One patch introduces the notion of VM device attributes

    Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

    Conflicts:
    include/uapi/linux/kvm.h

    Marcelo Tosatti
     
  • Replace the kvm_s390_sync_dirty_log() stub with code to construct the KVM
    dirty_bitmap from S390 memory change bits. Also add code to properly clear
    the dirty_bitmap size when clearing the bitmap.

    Signed-off-by: Jason J. Herne
    CC: Dominik Dingel
    [Dominik Dingel: use gmap_test_and_clear_dirty, locking fixes]
    Signed-off-by: Christian Borntraeger

    Jason J. Herne
     

18 Apr, 2014

2 commits

  • With KVM, MMIO is much slower than PIO, due to the need to
    do page walk and emulation. But with EPT, it does not have to be: we
    know the address from the VMCS so if the address is unique, we can look
    up the eventfd directly, bypassing emulation.

    Unfortunately, this only works if userspace does not need to match on
    access length and data. The implementation adds a separate FAST_MMIO
    bus internally. This serves two purposes:
    - minimize overhead for old userspace that does not use eventfd with lengtth = 0
    - minimize disruption in other code (since we don't know the length,
    devices on the MMIO bus only get a valid address in write, this
    way we don't need to touch all devices to teach them to handle
    an invalid length)

    At the moment, this optimization only has effect for EPT on x86.

    It will be possible to speed up MMIO for NPT and MMU using the same
    idea in the future.

    With this patch applied, on VMX MMIO EVENTFD is essentially as fast as PIO.
    I was unable to detect any measureable slowdown to non-eventfd MMIO.

    Making MMIO faster is important for the upcoming virtio 1.0 which
    includes an MMIO signalling capability.

    The idea was suggested by Peter Anvin. Lots of thanks to Gleb for
    pre-review and suggestions.

    Signed-off-by: Michael S. Tsirkin
    Signed-off-by: Marcelo Tosatti

    Michael S. Tsirkin
     
  • It is sometimes benefitial to ignore IO size, and only match on address.
    In hindsight this would have been a better default than matching length
    when KVM_IOEVENTFD_FLAG_DATAMATCH is not set, In particular, this kind
    of access can be optimized on VMX: there no need to do page lookups.
    This can currently be done with many ioeventfds but in a suboptimal way.

    However we can't change kernel/userspace ABI without risk of breaking
    some applications.
    Use len = 0 to mean "ignore length for matching" in a more optimal way.

    Signed-off-by: Michael S. Tsirkin
    Signed-off-by: Marcelo Tosatti

    Michael S. Tsirkin
     

15 Apr, 2014

1 commit

  • Pull KVM fixes from Marcelo Tosatti:
    - Fix for guest triggerable BUG_ON (CVE-2014-0155)
    - CR4.SMAP support
    - Spurious WARN_ON() fix

    * git://git.kernel.org/pub/scm/virt/kvm/kvm:
    KVM: x86: remove WARN_ON from get_kernel_ns()
    KVM: Rename variable smep to cr4_smep
    KVM: expose SMAP feature to guest
    KVM: Disable SMAP for guests in EPT realmode and EPT unpaging mode
    KVM: Add SMAP support when setting CR4
    KVM: Remove SMAP bit from CR4_RESERVED_BITS
    KVM: ioapic: try to recover if pending_eoi goes out of range
    KVM: ioapic: fix assignment of ioapic->rtc_status.pending_eoi (CVE-2014-0155)

    Linus Torvalds
     

08 Apr, 2014

1 commit

  • Commit 8146875de7d4 (arm, kvm: Fix CPU hotplug callback registration)
    holds the lock before calling the two functions:

    kvm_vgic_hyp_init()
    kvm_timer_hyp_init()

    and both the two functions are calling register_cpu_notifier()
    to register cpu notifier, so cause double lock on cpu_add_remove_lock.

    Considered that both two functions are only called inside
    kvm_arch_init() with holding cpu_add_remove_lock, so simply use
    __register_cpu_notifier() to fix the problem.

    Fixes: 8146875de7d4 (arm, kvm: Fix CPU hotplug callback registration)
    Signed-off-by: Ming Lei
    Reviewed-by: Srivatsa S. Bhat
    Signed-off-by: Rafael J. Wysocki

    Ming Lei
     

04 Apr, 2014

3 commits

  • The RTC tracking code tracks the cardinality of rtc_status.dest_map
    into rtc_status.pending_eoi. It has some WARN_ONs that trigger if
    pending_eoi ever becomes negative; however, these do not do anything
    to recover, and it bad things will happen soon after they trigger.

    When the next RTC interrupt is triggered, rtc_check_coalesced() will
    return false, but ioapic_service will find pending_eoi != 0 and
    do a BUG_ON. To avoid this, should pending_eoi ever be nonzero,
    call kvm_rtc_eoi_tracking_restore_all to recompute a correct
    dest_map and pending_eoi.

    Signed-off-by: Paolo Bonzini

    Paolo Bonzini
     
  • QE reported that they got the BUG_ON in ioapic_service to trigger.
    I cannot reproduce it, but there are two reasons why this could happen.

    The less likely but also easiest one, is when kvm_irq_delivery_to_apic
    does not deliver to any APIC and returns -1.

    Because irqe.shorthand == 0, the kvm_for_each_vcpu loop in that
    function is never reached. However, you can target the similar loop in
    kvm_irq_delivery_to_apic_fast; just program a zero logical destination
    address into the IOAPIC, or an out-of-range physical destination address.

    Signed-off-by: Paolo Bonzini

    Paolo Bonzini
     
  • Pull VFIO updates from Alex Williamson:
    "VFIO updates for v3.15 include:

    - Allow the vfio-type1 IOMMU to support multiple domains within a
    container
    - Plumb path to query whether all domains are cache-coherent
    - Wire query into kvm-vfio device to avoid KVM x86 WBINVD emulation
    - Always select CONFIG_ANON_INODES, vfio depends on it (Arnd)

    The first patch also makes the vfio-type1 IOMMU driver completely
    independent of the bus_type of the devices it's handling, which
    enables it to be used for both vfio-pci and a future vfio-platform
    (and hopefully combinations involving both simultaneously)"

    * tag 'vfio-v3.15-rc1' of git://github.com/awilliam/linux-vfio:
    vfio: always select ANON_INODES
    kvm/vfio: Support for DMA coherent IOMMUs
    vfio: Add external user check extension interface
    vfio/type1: Add extension to test DMA cache coherence of IOMMU
    vfio/iommu_type1: Multi-IOMMU domain support

    Linus Torvalds
     

03 Apr, 2014

1 commit

  • Pull kvm updates from Paolo Bonzini:
    "PPC and ARM do not have much going on this time. Most of the cool
    stuff, instead, is in s390 and (after a few releases) x86.

    ARM has some caching fixes and PPC has transactional memory support in
    guests. MIPS has some fixes, with more probably coming in 3.16 as
    QEMU will soon get support for MIPS KVM.

    For x86 there are optimizations for debug registers, which trigger on
    some Windows games, and other important fixes for Windows guests. We
    now expose to the guest Broadwell instruction set extensions and also
    Intel MPX. There's also a fix/workaround for OS X guests, nested
    virtualization features (preemption timer), and a couple kvmclock
    refinements.

    For s390, the main news is asynchronous page faults, together with
    improvements to IRQs (floating irqs and adapter irqs) that speed up
    virtio devices"

    * tag 'kvm-3.15-1' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (96 commits)
    KVM: PPC: Book3S HV: Save/restore host PMU registers that are new in POWER8
    KVM: PPC: Book3S HV: Fix decrementer timeouts with non-zero TB offset
    KVM: PPC: Book3S HV: Don't use kvm_memslots() in real mode
    KVM: PPC: Book3S HV: Return ENODEV error rather than EIO
    KVM: PPC: Book3S: Trim top 4 bits of physical address in RTAS code
    KVM: PPC: Book3S HV: Add get/set_one_reg for new TM state
    KVM: PPC: Book3S HV: Add transactional memory support
    KVM: Specify byte order for KVM_EXIT_MMIO
    KVM: vmx: fix MPX detection
    KVM: PPC: Book3S HV: Fix KVM hang with CONFIG_KVM_XICS=n
    KVM: PPC: Book3S: Introduce hypervisor call H_GET_TCE
    KVM: PPC: Book3S HV: Fix incorrect userspace exit on ioeventfd write
    KVM: s390: clear local interrupts at cpu initial reset
    KVM: s390: Fix possible memory leak in SIGP functions
    KVM: s390: fix calculation of idle_mask array size
    KVM: s390: randomize sca address
    KVM: ioapic: reinject pending interrupts on KVM_SET_IRQCHIP
    KVM: Bump KVM_MAX_IRQ_ROUTES for s390
    KVM: s390: irq routing for adapter interrupts.
    KVM: s390: adapter interrupt sources
    ...

    Linus Torvalds
     

01 Apr, 2014

1 commit

  • Pull x86 LTO changes from Peter Anvin:
    "More infrastructure work in preparation for link-time optimization
    (LTO). Most of these changes is to make sure symbols accessed from
    assembly code are properly marked as visible so the linker doesn't
    remove them.

    My understanding is that the changes to support LTO are still not
    upstream in binutils, but are on the way there. This patchset should
    conclude the x86-specific changes, and remaining patches to actually
    enable LTO will be fed through the Kbuild tree (other than keeping up
    with changes to the x86 code base, of course), although not
    necessarily in this merge window"

    * 'x86-asmlinkage-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (25 commits)
    Kbuild, lto: Handle basic LTO in modpost
    Kbuild, lto: Disable LTO for asm-offsets.c
    Kbuild, lto: Add a gcc-ld script to let run gcc as ld
    Kbuild, lto: add ld-version and ld-ifversion macros
    Kbuild, lto: Drop .number postfixes in modpost
    Kbuild, lto, workaround: Don't warn for initcall_reference in modpost
    lto: Disable LTO for sys_ni
    lto: Handle LTO common symbols in module loader
    lto, workaround: Add workaround for initcall reordering
    lto: Make asmlinkage __visible
    x86, lto: Disable LTO for the x86 VDSO
    initconst, x86: Fix initconst mistake in ts5500 code
    initconst: Fix initconst mistake in dcdbas
    asmlinkage: Make trace_hardirqs_on/off_caller visible
    asmlinkage, x86: Fix 32bit memcpy for LTO
    asmlinkage Make __stack_chk_failed and memcmp visible
    asmlinkage: Mark rwsem functions that can be called from assembler asmlinkage
    asmlinkage: Make main_extable_sort_needed visible
    asmlinkage, mutex: Mark __visible
    asmlinkage: Make trace_hardirq visible
    ...

    Linus Torvalds
     

21 Mar, 2014

4 commits


19 Mar, 2014

1 commit

  • When registering a new irqfd, we call its ->poll method to collect any
    event that might have previously been pending so that we can trigger it.
    This is done under the kvm->irqfds.lock, which means the eventfd's ctx
    lock is taken under it.

    However, if we get a POLLHUP in irqfd_wakeup, we will be called with the
    ctx lock held before getting the irqfds.lock to deactivate the irqfd,
    causing lockdep to complain.

    Calling the ->poll method does not really need the irqfds.lock, so let's
    just move it after we've given up the irqfds.lock in kvm_irqfd_assign().

    Signed-off-by: Cornelia Huck
    Signed-off-by: Paolo Bonzini

    Cornelia Huck
     

13 Mar, 2014

1 commit

  • Both QEMU and KVM have already accumulated a significant number of
    optimizations based on the hard-coded assumption that ioapic polarity
    will always use the ActiveHigh convention, where the logical and
    physical states of level-triggered irq lines always match (i.e.,
    active(asserted) == high == 1, inactive == low == 0). QEMU guests
    are expected to follow directions given via ACPI and configure the
    ioapic with polarity 0 (ActiveHigh). However, even when misbehaving
    guests (e.g. OS X
    Signed-off-by: Gabriel L. Somlo
    [Move documentation to KVM_IRQ_LINE, add ia64. - Paolo]
    Signed-off-by: Paolo Bonzini

    Gabriel L. Somlo
     

27 Feb, 2014

2 commits

  • VFIO now has support for using the IOMMU_CACHE flag and a mechanism
    for an external user to test the current operating mode of the IOMMU.
    Add support for this to the kvm-vfio pseudo device so that we only
    register noncoherent DMA when necessary.

    Signed-off-by: Alex Williamson
    Cc: Gleb Natapov
    Cc: Paolo Bonzini
    Acked-by: Paolo Bonzini

    Alex Williamson
     
  • Use the arch specific function kvm_arch_vcpu_runnable() to add a further
    criterium to identify a suitable vcpu to yield to during undirected yield
    processing.

    Signed-off-by: Michael Mueller
    Reviewed-by: Christian Borntraeger
    Signed-off-by: Christian Borntraeger
    Signed-off-by: Paolo Bonzini

    Michael Mueller
     

18 Feb, 2014

1 commit

  • When this was introduced, kvm_flush_remote_tlbs() could be called
    without holding mmu_lock. It is now acknowledged that the function
    must be called before releasing mmu_lock, and all callers have already
    been changed to do so.

    There is no need to use smp_mb() and cmpxchg() any more.

    Signed-off-by: Takuya Yoshikawa
    Signed-off-by: Paolo Bonzini

    Takuya Yoshikawa
     

14 Feb, 2014

3 commits


04 Feb, 2014

1 commit


30 Jan, 2014

4 commits

  • On s390 we are not able to cancel work. Instead we will flush the work and wait for
    completion.

    Signed-off-by: Dominik Dingel
    Signed-off-by: Christian Borntraeger

    Dominik Dingel
     
  • By setting a Kconfig option, the architecture can control when
    guest notifications will be presented by the apf backend.
    There is the default batch mechanism, working as before, where the vcpu
    thread should pull in this information.
    Opposite to this, there is now the direct mechanism, that will push the
    information to the guest.
    This way s390 can use an already existing architecture interface.

    Still the vcpu thread should call check_completion to cleanup leftovers.

    Signed-off-by: Dominik Dingel
    Signed-off-by: Christian Borntraeger

    Dominik Dingel
     
  • If kvm_io_bus_register_dev() fails then it returns success but it should
    return an error code.

    I also did a little cleanup like removing an impossible NULL test.

    Cc: stable@vger.kernel.org
    Fixes: 2b3c246a682c ('KVM: Make coalesced mmio use a device per zone')
    Signed-off-by: Dan Carpenter
    Signed-off-by: Paolo Bonzini

    Dan Carpenter
     
  • This patch adds a floating irq controller as a kvm_device.
    It will be necessary for migration of floating interrupts as well
    as for hardening the reset code by allowing user space to explicitly
    remove all pending floating interrupts.

    Signed-off-by: Jens Freimann
    Reviewed-by: Cornelia Huck
    Signed-off-by: Christian Borntraeger

    Jens Freimann