13 Dec, 2013

1 commit

  • In multiple functions the vcpu_id is used as an offset into a bitfield. Ag
    malicious user could specify a vcpu_id greater than 255 in order to set or
    clear bits in kernel memory. This could be used to elevate priveges in the
    kernel. This patch verifies that the vcpu_id provided is less than 255.
    The api documentation already specifies that the vcpu_id must be less than
    max_vcpus, but this is currently not checked.

    Reported-by: Andrew Honig
    Cc: stable@vger.kernel.org
    Signed-off-by: Andrew Honig
    Signed-off-by: Paolo Bonzini

    Andy Honig
     

21 Nov, 2013

1 commit

  • Using the address of 'empty_zero_page' as source address in order to
    clear a page is wrong. On some architectures empty_zero_page is only the
    pointer to the struct page of the empty_zero_page. Therefore the clear
    page operation would copy the contents of a couple of struct pages instead
    of clearing a page. For kvm only arm/arm64 are affected by this bug.

    To fix this use the ZERO_PAGE macro instead which will return the struct
    page address of the empty_zero_page on all architectures.

    Signed-off-by: Heiko Carstens
    Signed-off-by: Gleb Natapov

    Heiko Carstens
     

15 Nov, 2013

1 commit

  • Pull KVM changes from Paolo Bonzini:
    "Here are the 3.13 KVM changes. There was a lot of work on the PPC
    side: the HV and emulation flavors can now coexist in a single kernel
    is probably the most interesting change from a user point of view.

    On the x86 side there are nested virtualization improvements and a few
    bugfixes.

    ARM got transparent huge page support, improved overcommit, and
    support for big endian guests.

    Finally, there is a new interface to connect KVM with VFIO. This
    helps with devices that use NoSnoop PCI transactions, letting the
    driver in the guest execute WBINVD instructions. This includes some
    nVidia cards on Windows, that fail to start without these patches and
    the corresponding userspace changes"

    * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (146 commits)
    kvm, vmx: Fix lazy FPU on nested guest
    arm/arm64: KVM: PSCI: propagate caller endianness to the incoming vcpu
    arm/arm64: KVM: MMIO support for BE guest
    kvm, cpuid: Fix sparse warning
    kvm: Delete prototype for non-existent function kvm_check_iopl
    kvm: Delete prototype for non-existent function complete_pio
    hung_task: add method to reset detector
    pvclock: detect watchdog reset at pvclock read
    kvm: optimize out smp_mb after srcu_read_unlock
    srcu: API for barrier after srcu read unlock
    KVM: remove vm mmap method
    KVM: IOMMU: hva align mapping page size
    KVM: x86: trace cpuid emulation when called from emulator
    KVM: emulator: cleanup decode_register_operand() a bit
    KVM: emulator: check rex prefix inside decode_register()
    KVM: x86: fix emulation of "movzbl %bpl, %eax"
    kvm_host: typo fix
    KVM: x86: emulate SAHF instruction
    MAINTAINERS: add tree for kvm.git
    Documentation/kvm: add a 00-INDEX file
    ...

    Linus Torvalds
     

06 Nov, 2013

1 commit

  • It was used in conjunction with KVM_SET_MEMORY_REGION ioctl which was
    removed by b74a07beed0 in 2010, QEMU stopped using it in 2008, so
    it is time to remove the code finally.

    Signed-off-by: Gleb Natapov

    Gleb Natapov
     

05 Nov, 2013

1 commit

  • When determining the page size we could use to map with the IOMMU, the
    page size should also be aligned with the hva, not just the gfn. The
    gfn may not reflect the real alignment within the hugetlbfs file.

    Most of the time, this works fine. However, if the hugetlbfs file is
    backed by non-contiguous huge pages, a multi-huge page memslot starts at
    an unaligned offset within the hugetlbfs file, and the gfn is aligned
    with respect to the huge page size, kvm_host_page_size() will return the
    huge page size and we will use that to map with the IOMMU.

    When we later unpin that same memslot, the IOMMU returns the unmap size
    as the huge page size, and we happily unpin that many pfns in
    monotonically increasing order, not realizing we are spanning
    non-contiguous huge pages and partially unpin the wrong huge page.

    Ensure the IOMMU mapping page size is aligned with the hva corresponding
    to the gfn, which does reflect the alignment within the hugetlbfs file.

    Reviewed-by: Marcelo Tosatti
    Signed-off-by: Greg Edwards
    Cc: stable@vger.kernel.org
    Signed-off-by: Gleb Natapov

    Greg Edwards
     

04 Nov, 2013

1 commit


31 Oct, 2013

3 commits

  • We currently use some ad-hoc arch variables tied to legacy KVM device
    assignment to manage emulation of instructions that depend on whether
    non-coherent DMA is present. Create an interface for this, adapting
    legacy KVM device assignment and adding VFIO via the KVM-VFIO device.
    For now we assume that non-coherent DMA is possible any time we have a
    VFIO group. Eventually an interface can be developed as part of the
    VFIO external user interface to query the coherency of a group.

    Signed-off-by: Alex Williamson
    Signed-off-by: Paolo Bonzini

    Alex Williamson
     
  • Default to operating in coherent mode. This simplifies the logic when
    we switch to a model of registering and unregistering noncoherent I/O
    with KVM.

    Signed-off-by: Alex Williamson
    Signed-off-by: Paolo Bonzini

    Alex Williamson
     
  • So far we've succeeded at making KVM and VFIO mostly unaware of each
    other, but areas are cropping up where a connection beyond eventfds
    and irqfds needs to be made. This patch introduces a KVM-VFIO device
    that is meant to be a gateway for such interaction. The user creates
    the device and can add and remove VFIO groups to it via file
    descriptors. When a group is added, KVM verifies the group is valid
    and gets a reference to it via the VFIO external user interface.

    Signed-off-by: Alex Williamson
    Signed-off-by: Paolo Bonzini

    Alex Williamson
     

30 Oct, 2013

1 commit


28 Oct, 2013

1 commit

  • In kvm_iommu_map_pages(), we need to know the page size via call
    kvm_host_page_size(). And it will check whether the target slot
    is valid before return the right page size.
    Currently, we will map the iommu pages when creating a new slot.
    But we call kvm_iommu_map_pages() during preparing the new slot.
    At that time, the new slot is not visible by domain(still in preparing).
    So we cannot get the right page size from kvm_host_page_size() and
    this will break the IOMMU super page logic.
    The solution is to map the iommu pages after we insert the new slot
    into domain.

    Signed-off-by: Yang Zhang
    Tested-by: Patrick Lu
    Signed-off-by: Paolo Bonzini

    Yang Zhang
     

17 Oct, 2013

3 commits


15 Oct, 2013

1 commit

  • Page pinning is not mandatory in kvm async page fault processing since
    after async page fault event is delivered to a guest it accesses page once
    again and does its own GUP. Drop the FOLL_GET flag in GUP in async_pf
    code, and do some simplifying in check/clear processing.

    Suggested-by: Gleb Natapov
    Signed-off-by: Gu zheng
    Signed-off-by: chai wen
    Signed-off-by: Gleb Natapov

    chai wen
     

03 Oct, 2013

2 commits

  • When KVM (de)assigns PCI(e) devices to VMs, a debug message is printed
    including the BDF notation of the respective device. Currently, the BDF
    notation does not have the commonly used leading zeros. This produces
    messages like "assign device 0:1:8.0", which look strange at first sight.

    The patch fixes this by exchanging the printk(KERN_DEBUG ...) with dev_info()
    and also inserts "kvm" into the debug message, so that it is obvious where
    the message comes from. Also reduces LoC.

    Acked-by: Alex Williamson
    Signed-off-by: Andre Richter
    Signed-off-by: Gleb Natapov

    Andre Richter
     
  • gfn_to_memslot() can return NULL or invalid slot. We need to check slot
    validity before accessing it.

    Reviewed-by: Paolo Bonzini
    Signed-off-by: Gleb Natapov

    Gleb Natapov
     

30 Sep, 2013

3 commits

  • In commit e935b8372cf8 ("KVM: Convert kvm_lock to raw_spinlock"),
    the kvm_lock was made a raw lock. However, the kvm mmu_shrink()
    function tries to grab the (non-raw) mmu_lock within the scope of
    the raw locked kvm_lock being held. This leads to the following:

    BUG: sleeping function called from invalid context at kernel/rtmutex.c:659
    in_atomic(): 1, irqs_disabled(): 0, pid: 55, name: kswapd0
    Preemption disabled at:[] mmu_shrink+0x5c/0x1b0 [kvm]

    Pid: 55, comm: kswapd0 Not tainted 3.4.34_preempt-rt
    Call Trace:
    [] __might_sleep+0xfd/0x160
    [] rt_spin_lock+0x24/0x50
    [] mmu_shrink+0xec/0x1b0 [kvm]
    [] shrink_slab+0x17d/0x3a0
    [] ? mem_cgroup_iter+0x130/0x260
    [] balance_pgdat+0x54a/0x730
    [] ? set_pgdat_percpu_threshold+0xa7/0xd0
    [] kswapd+0x18f/0x490
    [] ? get_parent_ip+0x11/0x50
    [] ? __init_waitqueue_head+0x50/0x50
    [] ? balance_pgdat+0x730/0x730
    [] kthread+0xdb/0xe0
    [] ? finish_task_switch+0x52/0x100
    [] kernel_thread_helper+0x4/0x10
    [] ? __init_kthread_worker+0x

    After the previous patch, kvm_lock need not be a raw spinlock anymore,
    so change it back.

    Reported-by: Paul Gortmaker
    Cc: kvm@vger.kernel.org
    Cc: gleb@redhat.com
    Cc: jan.kiszka@siemens.com
    Reviewed-by: Gleb Natapov
    Signed-off-by: Paolo Bonzini

    Paolo Bonzini
     
  • The VM list need not be protected by a raw spinlock. Separate the
    two so that kvm_lock can be made non-raw.

    Cc: kvm@vger.kernel.org
    Cc: gleb@redhat.com
    Cc: jan.kiszka@siemens.com
    Reviewed-by: Gleb Natapov
    Signed-off-by: Paolo Bonzini

    Paolo Bonzini
     
  • Remove the useless argument, and do not do anything if there are no
    VMs running at the time of the hotplug.

    Cc: kvm@vger.kernel.org
    Cc: gleb@redhat.com
    Cc: jan.kiszka@siemens.com
    Reviewed-by: Gleb Natapov
    Signed-off-by: Paolo Bonzini

    Paolo Bonzini
     

25 Sep, 2013

1 commit

  • '.done' is used to mark the completion of 'async_pf_execute()', but
    'cancel_work_sync()' returns true when the work was canceled, so we
    use it instead.

    Signed-off-by: Radim Krčmář
    Reviewed-by: Paolo Bonzini
    Reviewed-by: Gleb Natapov
    Signed-off-by: Paolo Bonzini

    Radim Krčmář
     

17 Sep, 2013

2 commits

  • When we cancel 'async_pf_execute()', we should behave as if the work was
    never scheduled in 'kvm_setup_async_pf()'.
    Fixes a bug when we can't unload module because the vm wasn't destroyed.

    Signed-off-by: Radim Krčmář
    Reviewed-by: Paolo Bonzini
    Reviewed-by: Gleb Natapov
    Signed-off-by: Paolo Bonzini

    Radim Krčmář
     
  • Page tables in a read-only memory slot will currently cause a triple
    fault because the page walker uses gfn_to_hva and it fails on such a slot.

    OVMF uses such a page table; however, real hardware seems to be fine with
    that as long as the accessed/dirty bits are set. Save whether the slot
    is readonly, and later check it when updating the accessed and dirty bits.

    Reviewed-by: Xiao Guangrong
    Reviewed-by: Gleb Natapov
    Signed-off-by: Paolo Bonzini

    Paolo Bonzini
     

05 Sep, 2013

1 commit

  • Pull vfs pile 1 from Al Viro:
    "Unfortunately, this merge window it'll have a be a lot of small piles -
    my fault, actually, for not keeping #for-next in anything that would
    resemble a sane shape ;-/

    This pile: assorted fixes (the first 3 are -stable fodder, IMO) and
    cleanups + %pd/%pD formats (dentry/file pathname, up to 4 last
    components) + several long-standing patches from various folks.

    There definitely will be a lot more (starting with Miklos'
    check_submount_and_drop() series)"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (26 commits)
    direct-io: Handle O_(D)SYNC AIO
    direct-io: Implement generic deferred AIO completions
    add formats for dentry/file pathnames
    kvm eventfd: switch to fdget
    powerpc kvm: use fdget
    switch fchmod() to fdget
    switch epoll_ctl() to fdget
    switch copy_module_from_fd() to fdget
    git simplify nilfs check for busy subtree
    ibmasmfs: don't bother passing superblock when not needed
    don't pass superblock to hypfs_{mkdir,create*}
    don't pass superblock to hypfs_diag_create_files
    don't pass superblock to hypfs_vm_create_files()
    oprofile: get rid of pointless forward declarations of struct super_block
    oprofilefs_create_...() do not need superblock argument
    oprofilefs_mkdir() doesn't need superblock argument
    don't bother with passing superblock to oprofile_create_stats_files()
    oprofile: don't bother with passing superblock to ->create_files()
    don't bother passing sb to oprofile_create_files()
    coh901318: don't open-code simple_read_from_buffer()
    ...

    Linus Torvalds
     

04 Sep, 2013

1 commit


30 Aug, 2013

3 commits

  • For bytemaps each IRQ field is 1 byte wide, so we pack 4 irq fields in
    one word and since there are 32 private (per cpu) irqs, we have 8
    private u32 fields on the vgic_bytemap struct. We shift the offset from
    the base of the register group right by 2, giving us the word index
    instead of the field index. But then there are 8 private words, not 4,
    which is also why we subtract 8 words from the offset of the shared
    words.

    Signed-off-by: Christoffer Dall
    Signed-off-by: Marc Zyngier
    Signed-off-by: Gleb Natapov

    Christoffer Dall
     
  • All the code in handle_mmio_cfg_reg() assumes the offset has
    been shifted right to accomodate for the 2:1 bit compression,
    but this is only done when getting the register address.

    Shift the offset early so the code works mostly unchanged.

    Reported-by: Zhaobo (Bob, ERC)
    Signed-off-by: Marc Zyngier
    Signed-off-by: Gleb Natapov

    Marc Zyngier
     
  • vgic_get_target_reg is quite complicated, for no good reason.
    Actually, it is fairly easy to write it in a much more efficient
    way by using the target CPU array instead of the bitmap.

    Signed-off-by: Marc Zyngier
    Signed-off-by: Gleb Natapov

    Marc Zyngier
     

28 Aug, 2013

1 commit


27 Aug, 2013

1 commit

  • The checks on PG_reserved in the page structure on head and tail pages
    aren't necessary because split_huge_page wouldn't transfer the
    PG_reserved bit from head to tail anyway.

    This was a forward-thinking check done in the case PageReserved was
    set by a driver-owned page mapped in userland with something like
    remap_pfn_range in a VM_PFNMAP region, but using hugepmds (not
    possible right now). It was meant to be very safe, but it's overkill
    as it's unlikely split_huge_page could ever run without the driver
    noticing and tearing down the hugepage itself.

    And if a driver in the future will really want to map a reserved
    hugepage in userland using an huge pmd it should simply take care of
    marking all subpages reserved too to keep KVM safe. This of course
    would require such a hypothetical driver to tear down the huge pmd
    itself and splitting the hugepage itself, instead of relaying on
    split_huge_page, but that sounds very reasonable, especially
    considering split_huge_page wouldn't currently transfer the reserved
    bit anyway.

    Signed-off-by: Andrea Arcangeli
    Signed-off-by: Gleb Natapov

    Andrea Arcangeli
     

26 Aug, 2013

1 commit

  • KVM uses anon_inode_get() to allocate file descriptors as part
    of some of its ioctls. But those ioctls are lacking a flag argument
    allowing userspace to choose options for the newly opened file descriptor.

    In such case it's advised to use O_CLOEXEC by default so that
    userspace is allowed to choose, without race, if the file descriptor
    is going to be inherited across exec().

    This patch set O_CLOEXEC flag on all file descriptors created
    with anon_inode_getfd() to not leak file descriptors across exec().

    Signed-off-by: Yann Droneaud
    Link: http://lkml.kernel.org/r/cover.1377372576.git.ydroneaud@opteya.com
    Reviewed-by: Paolo Bonzini
    Signed-off-by: Gleb Natapov

    Yann Droneaud
     

29 Jul, 2013

1 commit

  • kvm_io_bus_sort_cmp is used also directly, not just as a callback for
    sort and bsearch. In these cases, it is handy to have a type-safe
    variant. This patch introduces such a variant, __kvm_io_bus_sort_cmp,
    and uses it throughout kvm_main.c.

    Signed-off-by: Paolo Bonzini

    Paolo Bonzini
     

18 Jul, 2013

2 commits

  • This is called right after the memslots is updated, i.e. when the result
    of update_memslots() gets installed in install_new_memslots(). Since
    the memslots needs to be updated twice when we delete or move a memslot,
    kvm_arch_commit_memory_region() does not correspond to this exactly.

    In the following patch, x86 will use this new API to check if the mmio
    generation has reached its maximum value, in which case mmio sptes need
    to be flushed out.

    Signed-off-by: Takuya Yoshikawa
    Acked-by: Alexander Graf
    Reviewed-by: Xiao Guangrong
    Signed-off-by: Paolo Bonzini

    Takuya Yoshikawa
     
  • Add new functions kvm_io_bus_{read,write}_cookie() that allows users of
    the kvm io infrastructure to use a cookie value to speed up lookup of a
    device on an io bus.

    Signed-off-by: Cornelia Huck
    Signed-off-by: Paolo Bonzini
    Signed-off-by: Gleb Natapov

    Cornelia Huck
     

04 Jul, 2013

1 commit

  • Pull KVM fixes from Paolo Bonzini:
    "On the x86 side, there are some optimizations and documentation
    updates. The big ARM/KVM change for 3.11, support for AArch64, will
    come through Catalin Marinas's tree. s390 and PPC have misc cleanups
    and bugfixes"

    * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (87 commits)
    KVM: PPC: Ignore PIR writes
    KVM: PPC: Book3S PR: Invalidate SLB entries properly
    KVM: PPC: Book3S PR: Allow guest to use 1TB segments
    KVM: PPC: Book3S PR: Don't keep scanning HPTEG after we find a match
    KVM: PPC: Book3S PR: Fix invalidation of SLB entry 0 on guest entry
    KVM: PPC: Book3S PR: Fix proto-VSID calculations
    KVM: PPC: Guard doorbell exception with CONFIG_PPC_DOORBELL
    KVM: Fix RTC interrupt coalescing tracking
    kvm: Add a tracepoint write_tsc_offset
    KVM: MMU: Inform users of mmio generation wraparound
    KVM: MMU: document fast invalidate all mmio sptes
    KVM: MMU: document fast invalidate all pages
    KVM: MMU: document fast page fault
    KVM: MMU: document mmio page fault
    KVM: MMU: document write_flooding_count
    KVM: MMU: document clear_spte_count
    KVM: MMU: drop kvm_mmu_zap_mmio_sptes
    KVM: MMU: init kvm generation close to mmio wrap-around value
    KVM: MMU: add tracepoint for check_mmio_spte
    KVM: MMU: fast invalidate all mmio sptes
    ...

    Linus Torvalds
     

27 Jun, 2013

3 commits

  • KVM/ARM pull request for 3.11 merge window

    * tag 'kvm-arm-3.11' of git://git.linaro.org/people/cdall/linux-kvm-arm.git:
    ARM: kvm: don't include drivers/virtio/Kconfig
    Update MAINTAINERS: KVM/ARM work now funded by Linaro
    arm/kvm: Cleanup KVM_ARM_MAX_VCPUS logic
    ARM: KVM: clear exclusive monitor on all exception returns
    ARM: KVM: add missing dsb before invalidating Stage-2 TLBs
    ARM: KVM: perform save/restore of PAR
    ARM: KVM: get rid of S2_PGD_SIZE
    ARM: KVM: don't special case PC when doing an MMIO
    ARM: KVM: use phys_addr_t instead of unsigned long long for HYP PGDs
    ARM: KVM: remove dead prototype for __kvm_tlb_flush_vmid
    ARM: KVM: Don't handle PSCI calls via SMC
    ARM: KVM: Allow host virt timer irq to be different from guest timer virt irq

    Gleb Natapov
     
  • This reverts most of the f1ed0450a5fac7067590317cbf027f566b6ccbca. After
    the commit kvm_apic_set_irq() no longer returns accurate information
    about interrupt injection status if injection is done into disabled
    APIC. RTC interrupt coalescing tracking relies on the information to be
    accurate and cannot recover if it is not.

    Signed-off-by: Gleb Natapov

    Gleb Natapov
     
  • The arch_timer irq numbers (or PPI numbers) are implementation dependent,
    so the host virtual timer irq number can be different from guest virtual
    timer irq number.

    This patch ensures that host virtual timer irq number is read from DTB and
    guest virtual timer irq is determined based on vcpu target type.

    Signed-off-by: Anup Patel
    Signed-off-by: Pranavkumar Sawargaonkar
    Signed-off-by: Christoffer Dall

    Anup Patel
     

04 Jun, 2013

1 commit

  • We can easily reach the 1000 limit by start VM with a couple
    hundred I/O devices (multifunction=on). The hardcode limit
    already been adjusted 3 times (6 ~ 200 ~ 300 ~ 1000).

    In userspace, we already have maximum file descriptor to
    limit ioeventfd count. But kvm_io_bus devices also are used
    for pit, pic, ioapic, coalesced_mmio. They couldn't be limited
    by maximum file descriptor.

    Currently only ioeventfds take too much kvm_io_bus devices,
    so just exclude it from counting kvm_io_range limit.

    Also fixed one indent issue in kvm_host.h

    Signed-off-by: Amos Kong
    Reviewed-by: Stefan Hajnoczi
    Signed-off-by: Gleb Natapov

    Amos Kong
     

19 May, 2013

1 commit

  • As KVM/arm64 is looming on the horizon, it makes sense to move some
    of the common code to a single location in order to reduce duplication.

    The code could live anywhere. Actually, most of KVM is already built
    with a bunch of ugly ../../.. hacks in the various Makefiles, so we're
    not exactly talking about style here. But maybe it is time to start
    moving into a less ugly direction.

    The include files must be in a "public" location, as they are accessed
    from non-KVM files (arch/arm/kernel/asm-offsets.c).

    For this purpose, introduce two new locations:
    - virt/kvm/arm/ : x86 and ia64 already share the ioapic code in
    virt/kvm, so this could be seen as a (very ugly) precedent.
    - include/kvm/ : there is already an include/xen, and while the
    intent is slightly different, this seems as good a location as
    any

    Eventually, we should probably have independant Makefiles at every
    levels (just like everywhere else in the kernel), but this is just
    the first step.

    Signed-off-by: Marc Zyngier
    Signed-off-by: Gleb Natapov

    Marc Zyngier