07 Apr, 2013

1 commit

  • This patch adds support for kvm_gfn_to_hva_cache_init functions for
    reads and writes that will cross a page. If the range falls within
    the same memslot, then this will be a fast operation. If the range
    is split between two memslots, then the slower kvm_read_guest and
    kvm_write_guest are used.

    Tested: Test against kvm_clock unit tests.

    Signed-off-by: Andrew Honig
    Signed-off-by: Gleb Natapov

    Andrew Honig
     

20 Mar, 2013

1 commit

  • If the guest specifies a IOAPIC_REG_SELECT with an invalid value and follows
    that with a read of the IOAPIC_REG_WINDOW KVM does not properly validate
    that request. ioapic_read_indirect contains an
    ASSERT(redir_index < IOAPIC_NUM_PINS), but the ASSERT has no effect in
    non-debug builds. In recent kernels this allows a guest to cause a kernel
    oops by reading invalid memory. In older kernels (pre-3.3) this allows a
    guest to read from large ranges of host memory.

    Tested: tested against apic unit tests.

    Signed-off-by: Andrew Honig
    Signed-off-by: Marcelo Tosatti

    Andy Honig
     

28 Feb, 2013

1 commit

  • I'm not sure why, but the hlist for each entry iterators were conceived

    list_for_each_entry(pos, head, member)

    The hlist ones were greedy and wanted an extra parameter:

    hlist_for_each_entry(tpos, pos, head, member)

    Why did they need an extra pos parameter? I'm not quite sure. Not only
    they don't really need it, it also prevents the iterator from looking
    exactly like the list iterator, which is unfortunate.

    Besides the semantic patch, there was some manual work required:

    - Fix up the actual hlist iterators in linux/list.h
    - Fix up the declaration of other iterators based on the hlist ones.
    - A very small amount of places were using the 'node' parameter, this
    was modified to use 'obj->member' instead.
    - Coccinelle didn't handle the hlist_for_each_entry_safe iterator
    properly, so those had to be fixed up manually.

    The semantic patch which is mostly the work of Peter Senna Tschudin is here:

    @@
    iterator name hlist_for_each_entry, hlist_for_each_entry_continue, hlist_for_each_entry_from, hlist_for_each_entry_rcu, hlist_for_each_entry_rcu_bh, hlist_for_each_entry_continue_rcu_bh, for_each_busy_worker, ax25_uid_for_each, ax25_for_each, inet_bind_bucket_for_each, sctp_for_each_hentry, sk_for_each, sk_for_each_rcu, sk_for_each_from, sk_for_each_safe, sk_for_each_bound, hlist_for_each_entry_safe, hlist_for_each_entry_continue_rcu, nr_neigh_for_each, nr_neigh_for_each_safe, nr_node_for_each, nr_node_for_each_safe, for_each_gfn_indirect_valid_sp, for_each_gfn_sp, for_each_host;

    type T;
    expression a,c,d,e;
    identifier b;
    statement S;
    @@

    -T b;

    [akpm@linux-foundation.org: drop bogus change from net/ipv4/raw.c]
    [akpm@linux-foundation.org: drop bogus hunk from net/ipv6/raw.c]
    [akpm@linux-foundation.org: checkpatch fixes]
    [akpm@linux-foundation.org: fix warnings]
    [akpm@linux-foudnation.org: redo intrusive kvm changes]
    Tested-by: Peter Senna Tschudin
    Acked-by: Paul E. McKenney
    Signed-off-by: Sasha Levin
    Cc: Wu Fengguang
    Cc: Marcelo Tosatti
    Cc: Gleb Natapov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sasha Levin
     

11 Feb, 2013

1 commit

  • This field was needed to differentiate memory slots created by the new
    API, KVM_SET_USER_MEMORY_REGION, from those by the old equivalent,
    KVM_SET_MEMORY_REGION, whose support was dropped long before:

    commit b74a07beed0e64bfba413dcb70dd6749c57f43dc
    KVM: Remove kernel-allocated memory regions

    Although we also have private memory slots to which KVM allocates
    memory with vm_mmap(), !user_alloc slots in other words, the slot id
    should be enough for differentiating them.

    Note: corresponding function parameters will be removed later.

    Reviewed-by: Marcelo Tosatti
    Signed-off-by: Takuya Yoshikawa
    Signed-off-by: Gleb Natapov

    Takuya Yoshikawa
     

05 Feb, 2013

2 commits

  • As Xiao pointed out, there are a few problems with it:
    - kvm_arch_commit_memory_region() write protects the memory slot only
    for GET_DIRTY_LOG when modifying the flags.
    - FNAME(sync_page) uses the old spte value to set a new one without
    checking KVM_MEM_READONLY flag.

    Since we flush all shadow pages when creating a new slot, the simplest
    fix is to disallow such problematic flag changes: this is safe because
    no one is doing such things.

    Reviewed-by: Gleb Natapov
    Signed-off-by: Takuya Yoshikawa
    Cc: Xiao Guangrong
    Cc: Alex Williamson
    Signed-off-by: Marcelo Tosatti

    Takuya Yoshikawa
     
  • KVM_SET_USER_MEMORY_REGION forces __kvm_set_memory_region() to identify
    what kind of change is being requested by checking the arguments. The
    current code does this checking at various points in code and each
    condition being used there is not easy to understand at first glance.

    This patch consolidates these checks and introduces an enum to name the
    possible changes to clean up the code.

    Although this does not introduce any functional changes, there is one
    change which optimizes the code a bit: if we have nothing to change, the
    new code returns 0 immediately.

    Note that the return value for this case cannot be changed since QEMU
    relies on it: we noticed this when we changed it to -EINVAL and got a
    section mismatch error at the final stage of live migration.

    Signed-off-by: Takuya Yoshikawa
    Signed-off-by: Marcelo Tosatti

    Takuya Yoshikawa
     

29 Jan, 2013

2 commits

  • yield_to returns -ESRCH, When source and target of yield_to
    run queue length is one. When we see three successive failures of
    yield_to we assume we are in potential undercommit case and abort
    from PLE handler.
    The assumption is backed by low probability of wrong decision
    for even worst case scenarios such as average runqueue length
    between 1 and 2.

    More detail on rationale behind using three tries:
    if p is the probability of finding rq length one on a particular cpu,
    and if we do n tries, then probability of exiting ple handler is:

    p^(n+1) [ because we would have come across one source with rq length
    1 and n target cpu rqs with length 1 ]

    so
    num tries: probability of aborting ple handler (1.5x overcommit)
    1 1/4
    2 1/8
    3 1/16

    We can increase this probability with more tries, but the problem is
    the overhead.
    Also, If we have tried three times that means we would have iterated
    over 3 good eligible vcpus along with many non-eligible candidates. In
    worst case if we iterate all the vcpus, we reduce 1x performance and
    overcommit performance get hit.

    note that we do not update last boosted vcpu in failure cases.
    Thank Avi for raising question on aborting after first fail from yield_to.

    Reviewed-by: Srikar Dronamraju
    Signed-off-by: Raghavendra K T
    Tested-by: Chegu Vinod
    Signed-off-by: Gleb Natapov

    Raghavendra K T
     
  • Virtual interrupt delivery avoids KVM to inject vAPIC interrupts
    manually, which is fully taken care of by the hardware. This needs
    some special awareness into existing interrupr injection path:

    - for pending interrupt, instead of direct injection, we may need
    update architecture specific indicators before resuming to guest.

    - A pending interrupt, which is masked by ISR, should be also
    considered in above update action, since hardware will decide
    when to inject it at right time. Current has_interrupt and
    get_interrupt only returns a valid vector from injection p.o.v.

    Reviewed-by: Marcelo Tosatti
    Signed-off-by: Kevin Tian
    Signed-off-by: Yang Zhang
    Signed-off-by: Gleb Natapov

    Yang Zhang
     

27 Jan, 2013

2 commits

  • We've been ignoring read-only mappings and programming everything
    into the iommu as read-write. Fix this to only include the write
    access flag when read-only is not set.

    Signed-off-by: Alex Williamson
    Signed-off-by: Gleb Natapov

    Alex Williamson
     
  • Memory slot flags can be altered without changing other parameters of
    the slot. The read-only attribute is the only one the IOMMU cares
    about, so generate an un-map, re-map when this occurs. This also
    avoid unnecessarily re-mapping the slot when no IOMMU visible changes
    are made.

    Reviewed-by: Xiao Guangrong
    Signed-off-by: Alex Williamson
    Signed-off-by: Gleb Natapov

    Alex Williamson
     

17 Jan, 2013

3 commits


14 Jan, 2013

1 commit

  • Calling kvm_mmu_slot_remove_write_access() for a deleted slot does
    nothing but search for non-existent mmu pages which have mappings to
    that deleted memory; this is safe but a waste of time.

    Since we want to make the function rmap based in a later patch, in a
    manner which makes it unsafe to be called for a deleted slot, we makes
    the caller see if the slot is non-zero and being dirty logged.

    Reviewed-by: Marcelo Tosatti
    Signed-off-by: Takuya Yoshikawa
    Signed-off-by: Gleb Natapov

    Takuya Yoshikawa
     

24 Dec, 2012

1 commit


23 Dec, 2012

2 commits

  • Previous patch "kvm: Minor memory slot optimization" (b7f69c555ca43)
    overlooked the generation field of the memory slots. Re-using the
    original memory slots left us with with two slightly different memory
    slots with the same generation. To fix this, make update_memslots()
    take a new parameter to specify the last generation. This also makes
    generation management more explicit to avoid such problems in the future.

    Reported-by: Takuya Yoshikawa
    Signed-off-by: Alex Williamson
    Signed-off-by: Gleb Natapov

    Alex Williamson
     
  • This hack is wrong. The pin number of PIT is connected to
    2 not 0. This means this hack never takes effect. So it is ok
    to remove it.

    Signed-off-by: Yang Zhang
    Signed-off-by: Gleb Natapov

    Yang Zhang
     

14 Dec, 2012

7 commits

  • We're currently offering a whopping 32 memory slots to user space, an
    int is a bit excessive for storing this. We would like to increase
    our memslots, but SHRT_MAX should be more than enough.

    Reviewed-by: Gleb Natapov
    Signed-off-by: Alex Williamson
    Signed-off-by: Marcelo Tosatti

    Alex Williamson
     
  • There's no need for this to be an int, it holds a boolean.
    Move to the end of the struct for alignment.

    Reviewed-by: Gleb Natapov
    Signed-off-by: Alex Williamson
    Signed-off-by: Marcelo Tosatti

    Alex Williamson
     
  • It's easy to confuse KVM_MEMORY_SLOTS and KVM_MEM_SLOTS_NUM. One is
    the user accessible slots and the other is user + private. Make this
    more obvious.

    Reviewed-by: Gleb Natapov
    Signed-off-by: Alex Williamson
    Signed-off-by: Marcelo Tosatti

    Alex Williamson
     
  • If a slot is removed or moved in the guest physical address space, we
    first allocate and install a new slot array with the invalidated
    entry. The old array is then freed. We then proceed to allocate yet
    another slot array to install the permanent replacement. Re-use the
    original array when this occurs and avoid the extra kfree/kmalloc.

    Reviewed-by: Gleb Natapov
    Signed-off-by: Alex Williamson
    Signed-off-by: Marcelo Tosatti

    Alex Williamson
     
  • The iommu integration into memory slots expects memory slots to be
    added or removed and doesn't handle the move case. We can unmap
    slots from the iommu after we mark them invalid and map them before
    installing the final memslot array. Also re-order the kmemdup vs
    map so we don't leave iommu mappings if we get ENOMEM.

    Reviewed-by: Gleb Natapov
    Signed-off-by: Alex Williamson
    Signed-off-by: Marcelo Tosatti

    Alex Williamson
     
  • The API documents that only flags and guest physical memory space can
    be modified on an existing slot, but we don't enforce that the
    userspace address cannot be modified. Instead we just ignore it.
    This means that a user may think they've successfully moved both the
    guest and user addresses, when in fact only the guest address changed.
    Check and error instead.

    Reviewed-by: Gleb Natapov
    Signed-off-by: Alex Williamson
    Signed-off-by: Marcelo Tosatti

    Alex Williamson
     
  • The API documentation states:

    When changing an existing slot, it may be moved in the guest
    physical memory space, or its flags may be modified.

    An "existing slot" requires a non-zero npages (memory_size). The only
    transition we should therefore allow for a non-existing slot should be
    to create the slot, which includes setting a non-zero memory_size. We
    currently allow calls to modify non-existing slots, which is pointless,
    confusing, and possibly wrong.

    With this we know that the invalidation path of __kvm_set_memory_region
    is always for a delete or move and never for adding a zero size slot.

    Reviewed-by: Gleb Natapov
    Signed-off-by: Alex Williamson
    Signed-off-by: Marcelo Tosatti

    Alex Williamson
     

11 Dec, 2012

1 commit


10 Dec, 2012

1 commit

  • * 'for-upstream' of https://github.com/agraf/linux-2.6: (28 commits)
    KVM: PPC: booke: Get/set guest EPCR register using ONE_REG interface
    KVM: PPC: bookehv: Add EPCR support in mtspr/mfspr emulation
    KVM: PPC: bookehv: Add guest computation mode for irq delivery
    KVM: PPC: Make EPCR a valid field for booke64 and bookehv
    KVM: PPC: booke: Extend MAS2 EPN mask for 64-bit
    KVM: PPC: e500: Mask MAS2 EPN high 32-bits in 32/64 tlbwe emulation
    KVM: PPC: Mask ea's high 32-bits in 32/64 instr emulation
    KVM: PPC: e500: Add emulation helper for getting instruction ea
    KVM: PPC: bookehv64: Add support for interrupt handling
    KVM: PPC: bookehv: Remove GET_VCPU macro from exception handler
    KVM: PPC: booke: Fix get_tb() compile error on 64-bit
    KVM: PPC: e500: Silence bogus GCC warning in tlb code
    KVM: PPC: Book3S HV: Handle guest-caused machine checks on POWER7 without panicking
    KVM: PPC: Book3S HV: Improve handling of local vs. global TLB invalidations
    MAINTAINERS: Add git tree link for PPC KVM
    KVM: PPC: Book3S PR: MSR_DE doesn't exist on Book 3S
    KVM: PPC: Book3S PR: Fix VSX handling
    KVM: PPC: Book3S PR: Emulate PURR, SPURR and DSCR registers
    KVM: PPC: Book3S HV: Don't give the guest RW access to RO pages
    KVM: PPC: Book3S HV: Report correct HPT entry index when reading HPT
    ...

    Marcelo Tosatti
     

06 Dec, 2012

1 commit

  • The current eventfd code assumes that when we have eventfd, we also have
    irqfd for in-kernel interrupt delivery. This is not necessarily true. On
    PPC we don't have an in-kernel irqchip yet, but we can still support easily
    support eventfd.

    Signed-off-by: Alexander Graf

    Alexander Graf
     

05 Dec, 2012

2 commits


30 Nov, 2012

1 commit

  • Prior to memory slot sorting this loop compared all of the user memory
    slots for overlap with new entries. With memory slot sorting, we're
    just checking some number of entries in the array that may or may not
    be user slots. Instead, walk all the slots with kvm_for_each_memslot,
    which has the added benefit of terminating early when we hit the first
    empty slot, and skip comparison to private slots.

    Cc: stable@vger.kernel.org
    Signed-off-by: Alex Williamson
    Signed-off-by: Marcelo Tosatti

    Alex Williamson
     

28 Nov, 2012

2 commits

  • TSC initialization will soon make use of online_vcpus.

    Signed-off-by: Marcelo Tosatti

    Marcelo Tosatti
     
  • KVM added a global variable to guarantee monotonicity in the guest.
    One of the reasons for that is that the time between

    1. ktime_get_ts(×pec);
    2. rdtscll(tsc);

    Is variable. That is, given a host with stable TSC, suppose that
    two VCPUs read the same time via ktime_get_ts() above.

    The time required to execute 2. is not the same on those two instances
    executing in different VCPUS (cache misses, interrupts...).

    If the TSC value that is used by the host to interpolate when
    calculating the monotonic time is the same value used to calculate
    the tsc_timestamp value stored in the pvclock data structure, and
    a single tuple is visible to all
    vcpus simultaneously, this problem disappears. See comment on top
    of pvclock_update_vm_gtod_copy for details.

    Monotonicity is then guaranteed by synchronicity of the host TSCs
    and guest TSCs.

    Set TSC stable pvclock flag in that case, allowing the guest to read
    clock from userspace.

    Signed-off-by: Marcelo Tosatti

    Marcelo Tosatti
     

14 Nov, 2012

2 commits


30 Oct, 2012

2 commits

  • This patch filters noslot pfn out from error pfns based on Marcelo comment:
    noslot pfn is not a error pfn

    After this patch,
    - is_noslot_pfn indicates that the gfn is not in slot
    - is_error_pfn indicates that the gfn is in slot but the error is occurred
    when translate the gfn to pfn
    - is_error_noslot_pfn indicates that the pfn either it is error pfns or it
    is noslot pfn
    And is_invalid_pfn can be removed, it makes the code more clean

    Signed-off-by: Xiao Guangrong
    Signed-off-by: Marcelo Tosatti

    Xiao Guangrong
     
  • Merge reason: development work has dependency on kvm patches merged
    upstream.

    Conflicts:
    arch/powerpc/include/asm/Kbuild
    arch/powerpc/include/asm/kvm_para.h

    Signed-off-by: Marcelo Tosatti

    Marcelo Tosatti
     

24 Oct, 2012

1 commit


23 Oct, 2012

1 commit

  • We can not directly call kvm_release_pfn_clean to release the pfn
    since we can meet noslot pfn which is used to cache mmio info into
    spte

    Signed-off-by: Xiao Guangrong
    Cc: stable@vger.kernel.org
    Signed-off-by: Avi Kivity

    Xiao Guangrong
     

11 Oct, 2012

1 commit


06 Oct, 2012

1 commit