27 Dec, 2011

10 commits

  • Add a helper function that emulates the RDPMC instruction operation.

    Signed-off-by: Avi Kivity
    Signed-off-by: Gleb Natapov
    Signed-off-by: Avi Kivity

    Avi Kivity
     
  • Use perf_events to emulate an architectural PMU, version 2.

    Based on PMU version 1 emulation by Avi Kivity.

    [avi: adjust for cpuid.c]
    [jan: fix anonymous field initialization for older gcc]

    Signed-off-by: Gleb Natapov
    Signed-off-by: Jan Kiszka
    Signed-off-by: Avi Kivity

    Gleb Natapov
     
  • Move the mmu code in kvm_arch_vcpu_init() to kvm_mmu_create()

    Signed-off-by: Xiao Guangrong
    Signed-off-by: Avi Kivity

    Xiao Guangrong
     
  • Introduce KVM_MEM_SLOTS_NUM macro to instead of
    KVM_MEMORY_SLOTS + KVM_PRIVATE_MEM_SLOTS

    Signed-off-by: Xiao Guangrong
    Signed-off-by: Avi Kivity

    Xiao Guangrong
     
  • Currently, write protecting a slot needs to walk all the shadow pages
    and checks ones which have a pte mapping a page in it.

    The walk is overly heavy when dirty pages in that slot are not so many
    and checking the shadow pages would result in unwanted cache pollution.

    To mitigate this problem, we use rmap_write_protect() and check only
    the sptes which can be reached from gfns marked in the dirty bitmap
    when the number of dirty pages are less than that of shadow pages.

    This criterion is reasonable in its meaning and worked well in our test:
    write protection became some times faster than before when the ratio of
    dirty pages are low and was not worse even when the ratio was near the
    criterion.

    Note that the locking for this write protection becomes fine grained.
    The reason why this is safe is descripted in the comments.

    Signed-off-by: Takuya Yoshikawa
    Signed-off-by: Avi Kivity

    Takuya Yoshikawa
     
  • The host side pv mmu support has been marked for feature removal in
    January 2011. It's not in use, is slower than shadow or hardware
    assisted paging, and a maintenance burden. It's November 2011, time to
    remove it.

    Signed-off-by: Chris Wright
    Signed-off-by: Avi Kivity

    Chris Wright
     
  • Detecting write-flooding does not work well, when we handle page written, if
    the last speculative spte is not accessed, we treat the page is
    write-flooding, however, we can speculative spte on many path, such as pte
    prefetch, page synced, that means the last speculative spte may be not point
    to the written page and the written page can be accessed via other sptes, so
    depends on the Accessed bit of the last speculative spte is not enough

    Instead of detected page accessed, we can detect whether the spte is accessed
    after it is written, if the spte is not accessed but it is written frequently,
    we treat is not a page table or it not used for a long time

    Signed-off-by: Xiao Guangrong
    Signed-off-by: Avi Kivity

    Xiao Guangrong
     
  • Fast prefetch spte for the unsync shadow page on invlpg path

    Signed-off-by: Xiao Guangrong
    Signed-off-by: Avi Kivity

    Xiao Guangrong
     
  • In current code, the accessed bit is always set when page fault occurred,
    do not need to set it on pte write path

    Signed-off-by: Xiao Guangrong
    Signed-off-by: Avi Kivity

    Xiao Guangrong
     
  • If the emulation is caused by #PF and it is non-page_table writing instruction,
    it means the VM-EXIT is caused by shadow page protected, we can zap the shadow
    page and retry this instruction directly

    The idea is from Avi

    Signed-off-by: Xiao Guangrong
    Signed-off-by: Avi Kivity

    Xiao Guangrong
     

05 Oct, 2011

1 commit

  • This patch emulate lapic tsc deadline timer for guest:
    Enumerate tsc deadline timer capability by CPUID;
    Enable tsc deadline timer mode by lapic MMIO;
    Start tsc deadline timer by WRMSR;

    [jan: use do_div()]
    [avi: fix for !irqchip_in_kernel()]
    [marcelo: another fix for !irqchip_in_kernel()]

    Signed-off-by: Liu, Jinsong
    Signed-off-by: Jan Kiszka
    Signed-off-by: Marcelo Tosatti
    Signed-off-by: Avi Kivity

    Liu, Jinsong
     

26 Sep, 2011

5 commits

  • If simultaneous NMIs happen, we're supposed to queue the second
    and next (collapsing them), but currently we sometimes collapse
    the second into the first.

    Fix by using a counter for pending NMIs instead of a bool; since
    the counter limit depends on whether the processor is currently
    in an NMI handler, which can only be checked in vcpu context
    (via the NMI mask), we add a new KVM_REQ_NMI to request recalculation
    of the counter.

    Signed-off-by: Avi Kivity
    Signed-off-by: Marcelo Tosatti

    Avi Kivity
     
  • KVM assumed in several places that reading the TSC MSR returns the value for
    L1. This is incorrect, because when L2 is running, the correct TSC read exit
    emulation is to return L2's value.

    We therefore add a new x86_ops function, read_l1_tsc, to use in places that
    specifically need to read the L1 TSC, NOT the TSC of the current level of
    guest.

    Note that one change, of one line in kvm_arch_vcpu_load, is made redundant
    by a different patch sent by Zachary Amsden (and not yet applied):
    kvm_arch_vcpu_load() should not read the guest TSC, and if it didn't, of
    course we didn't have to change the call of kvm_get_msr() to read_l1_tsc().

    [avi: moved callback to kvm_x86_ops tsc block]

    Signed-off-by: Nadav Har'El
    Acked-by: Zachary Amsdem
    Signed-off-by: Avi Kivity

    Nadav Har'El
     
  • Architecturally, PDPTEs are cached in the PDPTRs when CR3 is reloaded.
    On SVM, it is not possible to implement this, but on VMX this is possible
    and was indeed implemented until nested SVM changed this to unconditionally
    read PDPTEs dynamically. This has noticable impact when running PAE guests.

    Fix by changing the MMU to read PDPTRs from the cache, falling back to
    reading from memory for the nested MMU.

    Signed-off-by: Avi Kivity
    Tested-by: Joerg Roedel
    Signed-off-by: Marcelo Tosatti

    Avi Kivity
     
  • The vmexit tracepoints format the exit_reason to make it human-readable.
    Since the exit_reason depends on the instruction set (vmx or svm),
    formatting is handled with ftrace_print_symbols_seq() by referring to
    the appropriate exit reason table.

    However, the ftrace_print_symbols_seq() function is not meant to be used
    directly in tracepoints since it does not export the formatting table
    which userspace tools like trace-cmd and perf use to format traces.

    In practice perf dies when formatting vmexit-related events and
    trace-cmd falls back to printing the numeric value (with extra
    formatting code in the kvm plugin to paper over this limitation). Other
    userspace consumers of vmexit-related tracepoints would be in similar
    trouble.

    To avoid significant changes to the kvm_exit tracepoint, this patch
    moves the vmx and svm exit reason tables into arch/x86/kvm/trace.h and
    selects the right table with __print_symbolic() depending on the
    instruction set. Note that __print_symbolic() is designed for exporting
    the formatting table to userspace and allows trace-cmd and perf to work.

    Signed-off-by: Stefan Hajnoczi
    Signed-off-by: Avi Kivity

    Stefan Hajnoczi
     
  • The patch raises the hard limit of VCPU count to 254.

    This will allow developers to easily work on scalability
    and will allow users to test high VCPU setups easily without
    patching the kernel.

    To prevent possible issues with current setups, KVM_CAP_NR_VCPUS
    now returns the recommended VCPU limit (which is still 64) - this
    should be a safe value for everybody, while a new KVM_CAP_MAX_VCPUS
    returns the hard limit which is now 254.

    Cc: Avi Kivity
    Cc: Ingo Molnar
    Cc: Marcelo Tosatti
    Cc: Pekka Enberg
    Suggested-by: Pekka Enberg
    Signed-off-by: Sasha Levin
    Signed-off-by: Marcelo Tosatti

    Sasha Levin
     

24 Jul, 2011

3 commits


14 Jul, 2011

1 commit

  • To implement steal time, we need the hypervisor to pass the guest
    information about how much time was spent running other processes
    outside the VM, while the vcpu had meaningful work to do - halt
    time does not count.

    This information is acquired through the run_delay field of
    delayacct/schedstats infrastructure, that counts time spent in a
    runqueue but not running.

    Steal time is a per-cpu information, so the traditional MSR-based
    infrastructure is used. A new msr, KVM_MSR_STEAL_TIME, holds the
    memory area address containing information about steal time

    This patch contains the hypervisor part of the steal time infrasructure,
    and can be backported independently of the guest portion.

    [avi, yongjie: export delayacct_on, to avoid build failures in some configs]

    Signed-off-by: Glauber Costa
    Tested-by: Eric B Munson
    CC: Rik van Riel
    CC: Jeremy Fitzhardinge
    CC: Peter Zijlstra
    CC: Anthony Liguori
    Signed-off-by: Yongjie Ren
    Signed-off-by: Avi Kivity

    Glauber Costa
     

12 Jul, 2011

8 commits


22 May, 2011

6 commits


11 May, 2011

6 commits