26 Mar, 2011

1 commit


24 Mar, 2011

3 commits

  • As a preparation for removing ext2 non-atomic bit operations from
    asm/bitops.h. This converts ext2 non-atomic bit operations to
    little-endian bit operations.

    Signed-off-by: Akinobu Mita
    Cc: Avi Kivity
    Cc: Marcelo Tosatti
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Akinobu Mita
     
  • asm-generic/bitops/le.h is only intended to be included directly from
    asm-generic/bitops/ext2-non-atomic.h or asm-generic/bitops/minix-le.h
    which implements generic ext2 or minix bit operations.

    This stops including asm-generic/bitops/le.h directly and use ext2
    non-atomic bit operations instead.

    It seems odd to use ext2_set_bit() on kvm, but it will replaced with
    __set_bit_le() after introducing little endian bit operations for all
    architectures. This indirect step is necessary to maintain bisectability
    for some architectures which have their own little-endian bit operations.

    Signed-off-by: Akinobu Mita
    Cc: Avi Kivity
    Cc: Marcelo Tosatti
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Akinobu Mita
     
  • KVM uses a sysdev class and a sysdev for executing kvm_suspend()
    after interrupts have been turned off on the boot CPU (during system
    suspend) and for executing kvm_resume() before turning on interrupts
    on the boot CPU (during system resume). However, since both of these
    functions ignore their arguments, the entire mechanism may be
    replaced with a struct syscore_ops object which is simpler.

    Signed-off-by: Rafael J. Wysocki
    Acked-by: Avi Kivity

    Rafael J. Wysocki
     

18 Mar, 2011

9 commits

  • The RCU use in kvm_irqfd_deassign is tricky: we have rcu_assign_pointer
    but no synchronize_rcu: synchronize_rcu is done by kvm_irq_routing_update
    which we share a spinlock with.

    Fix up a comment in an attempt to make this clearer.

    Signed-off-by: Michael S. Tsirkin
    Signed-off-by: Avi Kivity

    Michael S. Tsirkin
     
  • Code under this lock requires non-preemptibility. Ensure this also over
    -rt by converting it to raw spinlock.

    Signed-off-by: Jan Kiszka
    Signed-off-by: Avi Kivity

    Jan Kiszka
     
  • Instead of sleeping in kvm_vcpu_on_spin, which can cause gigantic
    slowdowns of certain workloads, we instead use yield_to to get
    another VCPU in the same KVM guest to run sooner.

    This seems to give a 10-15% speedup in certain workloads.

    Signed-off-by: Rik van Riel
    Signed-off-by: Marcelo Tosatti
    Signed-off-by: Avi Kivity

    Rik van Riel
     
  • Keep track of which task is running a KVM vcpu. This helps us
    figure out later what task to wake up if we want to boost a
    vcpu that got preempted.

    Unfortunately there are no guarantees that the same task
    always keeps the same vcpu, so we can only track the task
    across a single "run" of the vcpu.

    Signed-off-by: Rik van Riel
    Signed-off-by: Avi Kivity

    Rik van Riel
     
  • is_hwpoison_address only checks whether the page table entry is
    hwpoisoned, regardless the memory page mapped. While __get_user_pages
    will check both.

    QEMU will clear the poisoned page table entry (via unmap/map) to make
    it possible to allocate a new memory page for the virtual address
    across guest rebooting. But it is also possible that the underlying
    memory page is kept poisoned even after the corresponding page table
    entry is cleared, that is, a new memory page can not be allocated.
    __get_user_pages can catch these situations.

    Signed-off-by: Huang Ying
    Signed-off-by: Marcelo Tosatti

    Huang Ying
     
  • Now, we have 'vcpu->mode' to judge whether need to send ipi to other
    cpus, this way is very exact, so checking request bit is needless,
    then we can drop the spinlock let it's collateral

    Signed-off-by: Xiao Guangrong
    Signed-off-by: Avi Kivity

    Xiao Guangrong
     
  • Currently we keep track of only two states: guest mode and host
    mode. This patch adds an "exiting guest mode" state that tells
    us that an IPI will happen soon, so unless we need to wait for the
    IPI, we can avoid it completely.

    Also
    1: No need atomically to read/write ->mode in vcpu's thread

    2: reorganize struct kvm_vcpu to make ->mode and ->requests
    in the same cache line explicitly

    Signed-off-by: Xiao Guangrong
    Signed-off-by: Avi Kivity

    Xiao Guangrong
     
  • Get rid of this warning:

    CC arch/s390/kvm/../../../virt/kvm/kvm_main.o
    arch/s390/kvm/../../../virt/kvm/kvm_main.c:596:12: warning: 'kvm_create_dirty_bitmap' defined but not used

    The only caller of the function is within a !CONFIG_S390 section, so add the
    same ifdef around kvm_create_dirty_bitmap() as well.

    Signed-off-by: Heiko Carstens
    Signed-off-by: Marcelo Tosatti

    Heiko Carstens
     
  • Instead, drop large mappings, which were the reason we dropped shadow.

    Signed-off-by: Avi Kivity
    Signed-off-by: Marcelo Tosatti

    Avi Kivity
     

14 Jan, 2011

3 commits

  • Cleanup some code with common compound_trans_head helper.

    Signed-off-by: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: Johannes Weiner
    Cc: Marcelo Tosatti
    Cc: Avi Kivity
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • For GRU and EPT, we need gup-fast to set referenced bit too (this is why
    it's correct to return 0 when shadow_access_mask is zero, it requires
    gup-fast to set the referenced bit). qemu-kvm access already sets the
    young bit in the pte if it isn't zero-copy, if it's zero copy or a shadow
    paging EPT minor fault we relay on gup-fast to signal the page is in
    use...

    We also need to check the young bits on the secondary pagetables for NPT
    and not nested shadow mmu as the data may never get accessed again by the
    primary pte.

    Without this closer accuracy, we'd have to remove the heuristic that
    avoids collapsing hugepages in hugepage virtual regions that have not even
    a single subpage in use.

    ->test_young is full backwards compatible with GRU and other usages that
    don't have young bits in pagetables set by the hardware and that should
    nuke the secondary mmu mappings when ->clear_flush_young runs just like
    EPT does.

    Removing the heuristic that checks the young bit in
    khugepaged/collapse_huge_page completely isn't so bad either probably but
    I thought it was worth it and this makes it reliable.

    Signed-off-by: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • This should work for both hugetlbfs and transparent hugepages.

    [akpm@linux-foundation.org: bring forward PageTransCompound() addition for bisectability]
    Signed-off-by: Andrea Arcangeli
    Cc: Avi Kivity
    Cc: Marcelo Tosatti
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     

12 Jan, 2011

24 commits

  • Since vmx blocks INIT signals, we disable virtualization extensions during
    reboot. This leads to virtualization instructions faulting; we trap these
    faults and spin while the reboot continues.

    Unfortunately spinning on a non-preemptible kernel may block a task that
    reboot depends on; this causes the reboot to hang.

    Fix by skipping over the instruction and hoping for the best.

    Signed-off-by: Avi Kivity

    Avi Kivity
     
  • Quote from Avi:
    | I don't think we need to flush immediately; set a "tlb dirty" bit somewhere
    | that is cleareded when we flush the tlb. kvm_mmu_notifier_invalidate_page()
    | can consult the bit and force a flush if set.

    Signed-off-by: Xiao Guangrong
    Signed-off-by: Marcelo Tosatti

    Xiao Guangrong
     
  • Store irq routing table pointer in the irqfd object,
    and use that to inject MSI directly without bouncing out to
    a kernel thread.

    While we touch this structure, rearrange irqfd fields to make fastpath
    better packed for better cache utilization.

    This also adds some comments about locking rules and rcu usage in code.

    Some notes on the design:
    - Use pointer into the rt instead of copying an entry,
    to make it possible to use rcu, thus side-stepping
    locking complexities. We also save some memory this way.
    - Old workqueue code is still used for level irqs.
    I don't think we DTRT with level anyway, however,
    it seems easier to keep the code around as
    it has been thought through and debugged, and fix level later than
    rip out and re-instate it later.

    Signed-off-by: Michael S. Tsirkin
    Acked-by: Marcelo Tosatti
    Acked-by: Gregory Haskins
    Signed-off-by: Avi Kivity

    Michael S. Tsirkin
     
  • The naming convension of hardware_[dis|en]able family is little bit confusing
    because only hardware_[dis|en]able_all are using _nolock suffix.

    Renaming current hardware_[dis|en]able() to *_nolock() and using
    hardware_[dis|en]able() as wrapper functions which take kvm_lock for them
    reduces extra confusion.

    Signed-off-by: Takuya Yoshikawa
    Signed-off-by: Marcelo Tosatti

    Takuya Yoshikawa
     
  • In kvm_cpu_hotplug(), only CPU_STARTING case is protected by kvm_lock.
    This patch adds missing protection for CPU_DYING case.

    Signed-off-by: Takuya Yoshikawa
    Signed-off-by: Marcelo Tosatti

    Takuya Yoshikawa
     
  • Any arch not supporting device assigment will also not build
    assigned-dev.c. So testing for KVM_CAP_DEVICE_DEASSIGNMENT is pointless.
    KVM_CAP_ASSIGN_DEV_IRQ is unconditinally set. Moreover, add a default
    case for dispatching the ioctl.

    Acked-by: Alex Williamson
    Acked-by: Michael S. Tsirkin
    Signed-off-by: Jan Kiszka
    Signed-off-by: Marcelo Tosatti

    Jan Kiszka
     
  • The guest may change states that pci_reset_function does not touch. So
    we better save/restore the assigned device across guest usage.

    Acked-by: Alex Williamson
    Acked-by: Michael S. Tsirkin
    Signed-off-by: Jan Kiszka
    Signed-off-by: Marcelo Tosatti

    Jan Kiszka
     
  • Cosmetic change, but it helps to correlate IRQs with PCI devices.

    Acked-by: Alex Williamson
    Acked-by: Michael S. Tsirkin
    Signed-off-by: Jan Kiszka
    Signed-off-by: Marcelo Tosatti

    Jan Kiszka
     
  • This improves the IRQ forwarding for assigned devices: By using the
    kernel's threaded IRQ scheme, we can get rid of the latency-prone work
    queue and simplify the code in the same run.

    Moreover, we no longer have to hold assigned_dev_lock while raising the
    guest IRQ, which can be a lenghty operation as we may have to iterate
    over all VCPUs. The lock is now only used for synchronizing masking vs.
    unmasking of INTx-type IRQs, thus is renames to intx_lock.

    Acked-by: Alex Williamson
    Acked-by: Michael S. Tsirkin
    Signed-off-by: Jan Kiszka
    Signed-off-by: Marcelo Tosatti

    Jan Kiszka
     
  • When we deassign a guest IRQ, clear the potentially asserted guest line.
    There might be no chance for the guest to do this, specifically if we
    switch from INTx to MSI mode.

    Acked-by: Alex Williamson
    Acked-by: Michael S. Tsirkin
    Signed-off-by: Jan Kiszka
    Signed-off-by: Marcelo Tosatti

    Jan Kiszka
     
  • IA64 support forces us to abstract the allocation of the kvm structure.
    But instead of mixing this up with arch-specific initialization and
    doing the same on destruction, split both steps. This allows to move
    generic destruction calls into generic code.

    It also fixes error clean-up on failures of kvm_create_vm for IA64.

    Signed-off-by: Jan Kiszka
    Signed-off-by: Avi Kivity

    Jan Kiszka
     
  • Signed-off-by: Jan Kiszka
    Signed-off-by: Avi Kivity

    Jan Kiszka
     
  • In kvm_async_pf_wakeup_all(), we add a dummy apf to vcpu->async_pf.done
    without holding vcpu->async_pf.lock, it will break if we are handling apfs
    at this time.

    Also use 'list_empty_careful()' instead of 'list_empty()'

    Signed-off-by: Xiao Guangrong
    Acked-by: Gleb Natapov
    Signed-off-by: Marcelo Tosatti

    Xiao Guangrong
     
  • If it's no need to inject async #PF to PV guest we can handle
    more completed apfs at one time, so we can retry guest #PF
    as early as possible

    Signed-off-by: Xiao Guangrong
    Acked-by: Gleb Natapov
    Signed-off-by: Marcelo Tosatti

    Xiao Guangrong
     
  • Let's use newly introduced vzalloc().

    Signed-off-by: Takuya Yoshikawa
    Signed-off-by: Jesper Juhl
    Signed-off-by: Marcelo Tosatti

    Takuya Yoshikawa
     
  • Fixes this:

    CC arch/s390/kvm/../../../virt/kvm/kvm_main.o
    arch/s390/kvm/../../../virt/kvm/kvm_main.c: In function 'kvm_dev_ioctl_create_vm':
    arch/s390/kvm/../../../virt/kvm/kvm_main.c:1828:10: warning: unused variable 'r'

    Signed-off-by: Heiko Carstens
    Signed-off-by: Marcelo Tosatti

    Heiko Carstens
     
  • Fixes this:

    CC arch/s390/kvm/../../../virt/kvm/kvm_main.o
    arch/s390/kvm/../../../virt/kvm/kvm_main.c: In function 'kvm_clear_guest_page':
    arch/s390/kvm/../../../virt/kvm/kvm_main.c:1224:2: warning: passing argument 3 of 'kvm_write_guest_page' makes pointer from integer without a cast
    arch/s390/kvm/../../../virt/kvm/kvm_main.c:1185:5: note: expected 'const void *' but argument is of type 'long unsigned int'

    Signed-off-by: Heiko Carstens
    Signed-off-by: Marcelo Tosatti

    Heiko Carstens
     
  • Currently we are using vmalloc() for all dirty bitmaps even if
    they are small enough, say less than K bytes.

    We use kmalloc() if dirty bitmap size is less than or equal to
    PAGE_SIZE so that we can avoid vmalloc area usage for VGA.

    This will also make the logging start/stop faster.

    Signed-off-by: Takuya Yoshikawa
    Signed-off-by: Marcelo Tosatti

    Takuya Yoshikawa
     
  • Currently x86's kvm_vm_ioctl_get_dirty_log() needs to allocate a bitmap by
    vmalloc() which will be used in the next logging and this has been causing
    bad effect to VGA and live-migration: vmalloc() consumes extra systime,
    triggers tlb flush, etc.

    This patch resolves this issue by pre-allocating one more bitmap and switching
    between two bitmaps during dirty logging.

    Performance improvement:
    I measured performance for the case of VGA update by trace-cmd.
    The result was 1.5 times faster than the original one.

    In the case of live migration, the improvement ratio depends on the workload
    and the guest memory size. In general, the larger the memory size is the more
    benefits we get.

    Note:
    This does not change other architectures's logic but the allocation size
    becomes twice. This will increase the actual memory consumption only when
    the new size changes the number of pages allocated by vmalloc().

    Signed-off-by: Takuya Yoshikawa
    Signed-off-by: Fernando Luis Vazquez Cao
    Signed-off-by: Marcelo Tosatti

    Takuya Yoshikawa
     
  • This makes it easy to change the way of allocating/freeing dirty bitmaps.

    Signed-off-by: Takuya Yoshikawa
    Signed-off-by: Fernando Luis Vazquez Cao
    Signed-off-by: Marcelo Tosatti

    Takuya Yoshikawa
     
  • Add tracepoint for userspace exit.

    Signed-off-by: Gleb Natapov
    Signed-off-by: Marcelo Tosatti

    Gleb Natapov
     
  • As suggested by Andrea, pass r/w error code to gup(), upgrading read fault
    to writable if host pte allows it.

    Signed-off-by: Marcelo Tosatti
    Signed-off-by: Avi Kivity

    Marcelo Tosatti
     
  • Improve vma handling code readability in hva_to_pfn() and fix
    async pf handling code to properly check vma returned by find_vma().

    Signed-off-by: Gleb Natapov
    Signed-off-by: Marcelo Tosatti

    Gleb Natapov
     
  • Send async page fault to a PV guest if it accesses swapped out memory.
    Guest will choose another task to run upon receiving the fault.

    Allow async page fault injection only when guest is in user mode since
    otherwise guest may be in non-sleepable context and will not be able
    to reschedule.

    Vcpu will be halted if guest will fault on the same page again or if
    vcpu executes kernel code.

    Acked-by: Rik van Riel
    Signed-off-by: Gleb Natapov
    Signed-off-by: Marcelo Tosatti

    Gleb Natapov