27 Dec, 2011

12 commits


26 Sep, 2011

1 commit

  • Currently the method of dealing with an IO operation on a bus (PIO/MMIO)
    is to call the read or write callback for each device registered
    on the bus until we find a device which handles it.

    Since the number of devices on a bus can be significant due to ioeventfds
    and coalesced MMIO zones, this leads to a lot of overhead on each IO
    operation.

    Instead of registering devices, we now register ranges which points to
    a device. Lookup is done using an efficient bsearch instead of a linear
    search.

    Performance test was conducted by comparing exit count per second with
    200 ioeventfds created on one byte and the guest is trying to access a
    different byte continuously (triggering usermode exits).
    Before the patch the guest has achieved 259k exits per second, after the
    patch the guest does 274k exits per second.

    Cc: Avi Kivity
    Cc: Marcelo Tosatti
    Signed-off-by: Sasha Levin
    Signed-off-by: Avi Kivity

    Sasha Levin
     

24 Jul, 2011

2 commits

  • The idea is from Avi:

    | We could cache the result of a miss in an spte by using a reserved bit, and
    | checking the page fault error code (or seeing if we get an ept violation or
    | ept misconfiguration), so if we get repeated mmio on a page, we don't need to
    | search the slot list/tree.
    | (https://lkml.org/lkml/2011/2/22/221)

    When the page fault is caused by mmio, we cache the info in the shadow page
    table, and also set the reserved bits in the shadow page table, so if the mmio
    is caused again, we can quickly identify it and emulate it directly

    Searching mmio gfn in memslots is heavy since we need to walk all memeslots, it
    can be reduced by this feature, and also avoid walking guest page table for
    soft mmu.

    [jan: fix operator precedence issue]

    Signed-off-by: Xiao Guangrong
    Signed-off-by: Jan Kiszka
    Signed-off-by: Avi Kivity

    Xiao Guangrong
     
  • If the page fault is caused by mmio, the gfn can not be found in memslots, and
    'bad_pfn' is returned on gfn_to_hva path, so we can use 'bad_pfn' to identify
    the mmio page fault.
    And, to clarify the meaning of mmio pfn, we return fault page instead of bad
    page when the gfn is not allowd to prefetch

    Signed-off-by: Xiao Guangrong
    Signed-off-by: Avi Kivity

    Xiao Guangrong
     

12 Jul, 2011

4 commits

  • Introduce kvm_read_guest_cached() function in addition to write one we
    already have.

    [ by glauber: export function signature in kvm header ]

    Signed-off-by: Gleb Natapov
    Signed-off-by: Glauber Costa
    Acked-by: Rik van Riel
    Tested-by: Eric Munson
    Signed-off-by: Avi Kivity

    Gleb Natapov
     
  • KVM has an ioctl to define which signal mask should be used while running
    inside VCPU_RUN. At least for big endian systems, this mask is different
    on 32-bit and 64-bit systems (though the size is identical).

    Add a compat wrapper that converts the mask to whatever the kernel accepts,
    allowing 32-bit kvm user space to set signal masks.

    This patch fixes qemu with --enable-io-thread on ppc64 hosts when running
    32-bit user land.

    Signed-off-by: Alexander Graf
    Signed-off-by: Avi Kivity

    Alexander Graf
     
  • So far kvm_arch_vcpu_setup is responsible for freeing the vcpu struct if
    it fails. Move this confusing resonsibility back into the hands of
    kvm_vm_ioctl_create_vcpu. Only kvm_arch_vcpu_setup of x86 is affected,
    all other archs cannot fail.

    Signed-off-by: Jan Kiszka
    Signed-off-by: Avi Kivity

    Jan Kiszka
     
  • Simply use __copy_to_user/__clear_user to write guest page since we have
    already verified the user address when the memslot is set

    Signed-off-by: Xiao Guangrong
    Signed-off-by: Marcelo Tosatti

    Xiao Guangrong
     

06 Jun, 2011

1 commit

  • It doesn't make sense to ever see a half-initialized kvm structure on
    mmu notifier callbacks. Previously, 85722cda changed the ordering to
    ensure that the mmu_lock was initialized before mmu notifier
    registration, but there is still a race where the mmu notifier could
    come in and try accessing other portions of struct kvm before they are
    intialized.

    Solve this by moving the mmu notifier registration to occur after the
    structure is completely initialized.

    Google-Bug-Id: 452199
    Signed-off-by: Mike Waychison
    Signed-off-by: Avi Kivity

    Mike Waychison
     

26 May, 2011

1 commit

  • fa3d315a "KVM: Validate userspace_addr of memslot when registered" introduced
    this new warning onn s390:

    kvm_main.c: In function '__kvm_set_memory_region':
    kvm_main.c:654:7: warning: passing argument 1 of '__access_ok' makes pointer from integer without a cast
    arch/s390/include/asm/uaccess.h:53:19: note: expected 'const void *' but argument is of type '__u64'

    Add the missing cast to get rid of it again...

    Cc: Takuya Yoshikawa
    Signed-off-by: Heiko Carstens
    Signed-off-by: Avi Kivity

    Heiko Carstens
     

22 May, 2011

2 commits

  • Like the following, mmu_notifier can be called after registering
    immediately. So, kvm have to initialize kvm->mmu_lock before it.

    BUG: spinlock bad magic on CPU#0, kswapd0/342
    lock: ffff8800af8c4000, .magic: 00000000, .owner: /-1, .owner_cpu: 0
    Pid: 342, comm: kswapd0 Not tainted 2.6.39-rc5+ #1
    Call Trace:
    [] spin_bug+0x9c/0xa3
    [] do_raw_spin_lock+0x29/0x13c
    [] ? flush_tlb_others_ipi+0xaf/0xfd
    [] _raw_spin_lock+0x9/0xb
    [] kvm_mmu_notifier_clear_flush_young+0x2c/0x66 [kvm]
    [] __mmu_notifier_clear_flush_young+0x2b/0x57
    [] page_referenced_one+0x88/0xea
    [] page_referenced+0x1fc/0x256
    [] shrink_page_list+0x187/0x53a
    [] shrink_inactive_list+0x1e0/0x33d
    [] ? determine_dirtyable_memory+0x15/0x27
    [] ? call_function_single_interrupt+0xe/0x20
    [] shrink_zone+0x322/0x3de
    [] ? zone_watermark_ok_safe+0xe2/0xf1
    [] kswapd+0x516/0x818
    [] ? shrink_zone+0x3de/0x3de
    [] kthread+0x7d/0x85
    [] kernel_thread_helper+0x4/0x10
    [] ? __init_kthread_worker+0x37/0x37
    [] ? gs_change+0xb/0xb

    Signed-off-by: OGAWA Hirofumi
    Signed-off-by: Avi Kivity

    OGAWA Hirofumi
     
  • This way, we can avoid checking the user space address many times when
    we read the guest memory.

    Although we can do the same for write if we check which slots are
    writable, we do not care write now: reading the guest memory happens
    more often than writing.

    [avi: change VERIFY_READ to VERIFY_WRITE]

    Signed-off-by: Takuya Yoshikawa
    Signed-off-by: Avi Kivity

    Takuya Yoshikawa
     

11 May, 2011

1 commit


06 Apr, 2011

1 commit

  • If asynchronous hva_to_pfn() is requested call GUP with FOLL_NOWAIT to
    avoid sleeping on IO. Check for hwpoison is done at the same time,
    otherwise check_user_page_hwpoison() will call GUP again and will put
    vcpu to sleep.

    Signed-off-by: Gleb Natapov
    Signed-off-by: Avi Kivity

    Gleb Natapov
     

26 Mar, 2011

1 commit


24 Mar, 2011

3 commits

  • As a preparation for removing ext2 non-atomic bit operations from
    asm/bitops.h. This converts ext2 non-atomic bit operations to
    little-endian bit operations.

    Signed-off-by: Akinobu Mita
    Cc: Avi Kivity
    Cc: Marcelo Tosatti
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Akinobu Mita
     
  • asm-generic/bitops/le.h is only intended to be included directly from
    asm-generic/bitops/ext2-non-atomic.h or asm-generic/bitops/minix-le.h
    which implements generic ext2 or minix bit operations.

    This stops including asm-generic/bitops/le.h directly and use ext2
    non-atomic bit operations instead.

    It seems odd to use ext2_set_bit() on kvm, but it will replaced with
    __set_bit_le() after introducing little endian bit operations for all
    architectures. This indirect step is necessary to maintain bisectability
    for some architectures which have their own little-endian bit operations.

    Signed-off-by: Akinobu Mita
    Cc: Avi Kivity
    Cc: Marcelo Tosatti
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Akinobu Mita
     
  • KVM uses a sysdev class and a sysdev for executing kvm_suspend()
    after interrupts have been turned off on the boot CPU (during system
    suspend) and for executing kvm_resume() before turning on interrupts
    on the boot CPU (during system resume). However, since both of these
    functions ignore their arguments, the entire mechanism may be
    replaced with a struct syscore_ops object which is simpler.

    Signed-off-by: Rafael J. Wysocki
    Acked-by: Avi Kivity

    Rafael J. Wysocki
     

18 Mar, 2011

8 commits

  • Code under this lock requires non-preemptibility. Ensure this also over
    -rt by converting it to raw spinlock.

    Signed-off-by: Jan Kiszka
    Signed-off-by: Avi Kivity

    Jan Kiszka
     
  • Instead of sleeping in kvm_vcpu_on_spin, which can cause gigantic
    slowdowns of certain workloads, we instead use yield_to to get
    another VCPU in the same KVM guest to run sooner.

    This seems to give a 10-15% speedup in certain workloads.

    Signed-off-by: Rik van Riel
    Signed-off-by: Marcelo Tosatti
    Signed-off-by: Avi Kivity

    Rik van Riel
     
  • Keep track of which task is running a KVM vcpu. This helps us
    figure out later what task to wake up if we want to boost a
    vcpu that got preempted.

    Unfortunately there are no guarantees that the same task
    always keeps the same vcpu, so we can only track the task
    across a single "run" of the vcpu.

    Signed-off-by: Rik van Riel
    Signed-off-by: Avi Kivity

    Rik van Riel
     
  • is_hwpoison_address only checks whether the page table entry is
    hwpoisoned, regardless the memory page mapped. While __get_user_pages
    will check both.

    QEMU will clear the poisoned page table entry (via unmap/map) to make
    it possible to allocate a new memory page for the virtual address
    across guest rebooting. But it is also possible that the underlying
    memory page is kept poisoned even after the corresponding page table
    entry is cleared, that is, a new memory page can not be allocated.
    __get_user_pages can catch these situations.

    Signed-off-by: Huang Ying
    Signed-off-by: Marcelo Tosatti

    Huang Ying
     
  • Now, we have 'vcpu->mode' to judge whether need to send ipi to other
    cpus, this way is very exact, so checking request bit is needless,
    then we can drop the spinlock let it's collateral

    Signed-off-by: Xiao Guangrong
    Signed-off-by: Avi Kivity

    Xiao Guangrong
     
  • Currently we keep track of only two states: guest mode and host
    mode. This patch adds an "exiting guest mode" state that tells
    us that an IPI will happen soon, so unless we need to wait for the
    IPI, we can avoid it completely.

    Also
    1: No need atomically to read/write ->mode in vcpu's thread

    2: reorganize struct kvm_vcpu to make ->mode and ->requests
    in the same cache line explicitly

    Signed-off-by: Xiao Guangrong
    Signed-off-by: Avi Kivity

    Xiao Guangrong
     
  • Get rid of this warning:

    CC arch/s390/kvm/../../../virt/kvm/kvm_main.o
    arch/s390/kvm/../../../virt/kvm/kvm_main.c:596:12: warning: 'kvm_create_dirty_bitmap' defined but not used

    The only caller of the function is within a !CONFIG_S390 section, so add the
    same ifdef around kvm_create_dirty_bitmap() as well.

    Signed-off-by: Heiko Carstens
    Signed-off-by: Marcelo Tosatti

    Heiko Carstens
     
  • Instead, drop large mappings, which were the reason we dropped shadow.

    Signed-off-by: Avi Kivity
    Signed-off-by: Marcelo Tosatti

    Avi Kivity
     

14 Jan, 2011

3 commits

  • Cleanup some code with common compound_trans_head helper.

    Signed-off-by: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: Johannes Weiner
    Cc: Marcelo Tosatti
    Cc: Avi Kivity
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • For GRU and EPT, we need gup-fast to set referenced bit too (this is why
    it's correct to return 0 when shadow_access_mask is zero, it requires
    gup-fast to set the referenced bit). qemu-kvm access already sets the
    young bit in the pte if it isn't zero-copy, if it's zero copy or a shadow
    paging EPT minor fault we relay on gup-fast to signal the page is in
    use...

    We also need to check the young bits on the secondary pagetables for NPT
    and not nested shadow mmu as the data may never get accessed again by the
    primary pte.

    Without this closer accuracy, we'd have to remove the heuristic that
    avoids collapsing hugepages in hugepage virtual regions that have not even
    a single subpage in use.

    ->test_young is full backwards compatible with GRU and other usages that
    don't have young bits in pagetables set by the hardware and that should
    nuke the secondary mmu mappings when ->clear_flush_young runs just like
    EPT does.

    Removing the heuristic that checks the young bit in
    khugepaged/collapse_huge_page completely isn't so bad either probably but
    I thought it was worth it and this makes it reliable.

    Signed-off-by: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • This should work for both hugetlbfs and transparent hugepages.

    [akpm@linux-foundation.org: bring forward PageTransCompound() addition for bisectability]
    Signed-off-by: Andrea Arcangeli
    Cc: Avi Kivity
    Cc: Marcelo Tosatti
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli