09 Jan, 2012

1 commit


26 Dec, 2011

1 commit

  • Only allow KVM device assignment to attach to devices which:

    - Are not bridges
    - Have BAR resources (assume others are special devices)
    - The user has permissions to use

    Assigning a bridge is a configuration error, it's not supported, and
    typically doesn't result in the behavior the user is expecting anyway.
    Devices without BAR resources are typically chipset components that
    also don't have host drivers. We don't want users to hold such devices
    captive or cause system problems by fencing them off into an iommu
    domain. We determine "permission to use" by testing whether the user
    has access to the PCI sysfs resource files. By default a normal user
    will not have access to these files, so it provides a good indication
    that an administration agent has granted the user access to the device.

    [Yang Bai: add missing #include]
    [avi: fix comment style]

    Signed-off-by: Alex Williamson
    Signed-off-by: Yang Bai
    Signed-off-by: Marcelo Tosatti

    Alex Williamson
     

25 Dec, 2011

1 commit


10 Nov, 2011

1 commit

  • When mapping a memory region, split it to page sizes as supported
    by the iommu hardware. Always prefer bigger pages, when possible,
    in order to reduce the TLB pressure.

    The logic to do that is now added to the IOMMU core, so neither the iommu
    drivers themselves nor users of the IOMMU API have to duplicate it.

    This allows a more lenient granularity of mappings; traditionally the
    IOMMU API took 'order' (of a page) as a mapping size, and directly let
    the low level iommu drivers handle the mapping, but now that the IOMMU
    core can split arbitrary memory regions into pages, we can remove this
    limitation, so users don't have to split those regions by themselves.

    Currently the supported page sizes are advertised once and they then
    remain static. That works well for OMAP and MSM but it would probably
    not fly well with intel's hardware, where the page size capabilities
    seem to have the potential to be different between several DMA
    remapping devices.

    register_iommu() currently sets a default pgsize behavior, so we can convert
    the IOMMU drivers in subsequent patches. After all the drivers
    are converted, the temporary default settings will be removed.

    Mainline users of the IOMMU API (kvm and omap-iovmm) are adopted
    to deal with bytes instead of page order.

    Many thanks to Joerg Roedel for significant review!

    Signed-off-by: Ohad Ben-Cohen
    Cc: David Brown
    Cc: David Woodhouse
    Cc: Joerg Roedel
    Cc: Stepan Moskovchenko
    Cc: KyongHo Cho
    Cc: Hiroshi DOYU
    Cc: Laurent Pinchart
    Cc: kvm@vger.kernel.org
    Signed-off-by: Joerg Roedel

    Ohad Ben-Cohen
     

01 Nov, 2011

2 commits

  • This file has things like module_param_named() and MODULE_PARM_DESC()
    so it needs the full module.h header present. Without it, you'll get:

    CC arch/x86/kvm/../../../virt/kvm/iommu.o
    virt/kvm/iommu.c:37: error: expected ‘)’ before ‘bool’
    virt/kvm/iommu.c:39: error: expected ‘)’ before string constant
    make[3]: *** [arch/x86/kvm/../../../virt/kvm/iommu.o] Error 1
    make[2]: *** [arch/x86/kvm] Error 2

    Signed-off-by: Paul Gortmaker

    Paul Gortmaker
     
  • This was coming in via an implicit module.h (and its sub-includes)
    before, but we'll be cleaning that up shortly. Call out the stat.h
    include requirement in advance.

    Signed-off-by: Paul Gortmaker

    Paul Gortmaker
     

31 Oct, 2011

2 commits

  • * 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu: (33 commits)
    iommu/core: Remove global iommu_ops and register_iommu
    iommu/msm: Use bus_set_iommu instead of register_iommu
    iommu/omap: Use bus_set_iommu instead of register_iommu
    iommu/vt-d: Use bus_set_iommu instead of register_iommu
    iommu/amd: Use bus_set_iommu instead of register_iommu
    iommu/core: Use bus->iommu_ops in the iommu-api
    iommu/core: Convert iommu_found to iommu_present
    iommu/core: Add bus_type parameter to iommu_domain_alloc
    Driver core: Add iommu_ops to bus_type
    iommu/core: Define iommu_ops and register_iommu only with CONFIG_IOMMU_API
    iommu/amd: Fix wrong shift direction
    iommu/omap: always provide iommu debug code
    iommu/core: let drivers know if an iommu fault handler isn't installed
    iommu/core: export iommu_set_fault_handler()
    iommu/omap: Fix build error with !IOMMU_SUPPORT
    iommu/omap: Migrate to the generic fault report mechanism
    iommu/core: Add fault reporting mechanism
    iommu/core: Use PAGE_SIZE instead of hard-coded value
    iommu/core: use the existing IS_ALIGNED macro
    iommu/msm: ->unmap() should return order of unmapped page
    ...

    Fixup trivial conflicts in drivers/iommu/Makefile: "move omap iommu to
    dedicated iommu folder" vs "Rename the DMAR and INTR_REMAP config
    options" just happened to touch lines next to each other.

    Linus Torvalds
     
  • * 'kvm-updates/3.2' of git://git.kernel.org/pub/scm/linux/kernel/git/avi/kvm: (75 commits)
    KVM: SVM: Keep intercepting task switching with NPT enabled
    KVM: s390: implement sigp external call
    KVM: s390: fix register setting
    KVM: s390: fix return value of kvm_arch_init_vm
    KVM: s390: check cpu_id prior to using it
    KVM: emulate lapic tsc deadline timer for guest
    x86: TSC deadline definitions
    KVM: Fix simultaneous NMIs
    KVM: x86 emulator: convert push %sreg/pop %sreg to direct decode
    KVM: x86 emulator: switch lds/les/lss/lfs/lgs to direct decode
    KVM: x86 emulator: streamline decode of segment registers
    KVM: x86 emulator: simplify OpMem64 decode
    KVM: x86 emulator: switch src decode to decode_operand()
    KVM: x86 emulator: qualify OpReg inhibit_byte_regs hack
    KVM: x86 emulator: switch OpImmUByte decode to decode_imm()
    KVM: x86 emulator: free up some flag bits near src, dst
    KVM: x86 emulator: switch src2 to generic decode_operand()
    KVM: x86 emulator: expand decode flags to 64 bits
    KVM: x86 emulator: split dst decode to a generic decode_operand()
    KVM: x86 emulator: move memop, memopp into emulation context
    ...

    Linus Torvalds
     

21 Oct, 2011

2 commits


26 Sep, 2011

6 commits

  • The threaded IRQ handler for MSI-X has almost nothing in common with the
    INTx/MSI handler. Move its code into a dedicated handler.

    Signed-off-by: Jan Kiszka
    Signed-off-by: Marcelo Tosatti

    Jan Kiszka
     
  • We only perform work in kvm_assigned_dev_ack_irq if the guest IRQ is of
    INTx type. This completely avoids the callback invocation in non-INTx
    cases by registering the IRQ ack notifier only for INTx.

    Signed-off-by: Jan Kiszka
    Signed-off-by: Marcelo Tosatti

    Jan Kiszka
     
  • Signed-off-by: Jan Kiszka
    Signed-off-by: Marcelo Tosatti

    Jan Kiszka
     
  • Currently the method of dealing with an IO operation on a bus (PIO/MMIO)
    is to call the read or write callback for each device registered
    on the bus until we find a device which handles it.

    Since the number of devices on a bus can be significant due to ioeventfds
    and coalesced MMIO zones, this leads to a lot of overhead on each IO
    operation.

    Instead of registering devices, we now register ranges which points to
    a device. Lookup is done using an efficient bsearch instead of a linear
    search.

    Performance test was conducted by comparing exit count per second with
    200 ioeventfds created on one byte and the guest is trying to access a
    different byte continuously (triggering usermode exits).
    Before the patch the guest has achieved 259k exits per second, after the
    patch the guest does 274k exits per second.

    Cc: Avi Kivity
    Cc: Marcelo Tosatti
    Signed-off-by: Sasha Levin
    Signed-off-by: Avi Kivity

    Sasha Levin
     
  • This patch changes coalesced mmio to create one mmio device per
    zone instead of handling all zones in one device.

    Doing so enables us to take advantage of existing locking and prevents
    a race condition between coalesced mmio registration/unregistration
    and lookups.

    Suggested-by: Avi Kivity
    Signed-off-by: Sasha Levin
    Signed-off-by: Marcelo Tosatti

    Sasha Levin
     
  • Move the check whether there are available entries to within the spinlock.
    This allows working with larger amount of VCPUs and reduces premature
    exits when using a large number of VCPUs.

    Cc: Avi Kivity
    Cc: Ingo Molnar
    Cc: Marcelo Tosatti
    Cc: Pekka Enberg
    Signed-off-by: Sasha Levin
    Signed-off-by: Marcelo Tosatti

    Sasha Levin
     

24 Sep, 2011

1 commit

  • Device drivers that create and destroy SR-IOV virtual functions via
    calls to pci_enable_sriov() and pci_disable_sriov can cause catastrophic
    failures if they attempt to destroy VFs while they are assigned to
    guest virtual machines. By adding a flag for use by the KVM module
    to indicate that a device is assigned a device driver can check that
    flag and avoid destroying VFs while they are assigned and avoid system
    failures.

    CC: Ian Campbell
    CC: Konrad Wilk
    Signed-off-by: Greg Rose
    Acked-by: Jesse Barnes
    Signed-off-by: Jeff Kirsher

    Greg Rose
     

24 Jul, 2011

3 commits

  • IOMMU interrupt remapping support provides a further layer of
    isolation for device assignment by preventing arbitrary interrupt
    block DMA writes by a malicious guest from reaching the host. By
    default, we should require that the platform provides interrupt
    remapping support, with an opt-in mechanism for existing behavior.

    Both AMD IOMMU and Intel VT-d2 hardware support interrupt
    remapping, however we currently only have software support on
    the Intel side. Users wishing to re-enable device assignment
    when interrupt remapping is not supported on the platform can
    use the "allow_unsafe_assigned_interrupts=1" module option.

    [avi: break long lines]

    Signed-off-by: Alex Williamson
    Signed-off-by: Marcelo Tosatti
    Signed-off-by: Avi Kivity

    Alex Williamson
     
  • The idea is from Avi:

    | We could cache the result of a miss in an spte by using a reserved bit, and
    | checking the page fault error code (or seeing if we get an ept violation or
    | ept misconfiguration), so if we get repeated mmio on a page, we don't need to
    | search the slot list/tree.
    | (https://lkml.org/lkml/2011/2/22/221)

    When the page fault is caused by mmio, we cache the info in the shadow page
    table, and also set the reserved bits in the shadow page table, so if the mmio
    is caused again, we can quickly identify it and emulate it directly

    Searching mmio gfn in memslots is heavy since we need to walk all memeslots, it
    can be reduced by this feature, and also avoid walking guest page table for
    soft mmu.

    [jan: fix operator precedence issue]

    Signed-off-by: Xiao Guangrong
    Signed-off-by: Jan Kiszka
    Signed-off-by: Avi Kivity

    Xiao Guangrong
     
  • If the page fault is caused by mmio, the gfn can not be found in memslots, and
    'bad_pfn' is returned on gfn_to_hva path, so we can use 'bad_pfn' to identify
    the mmio page fault.
    And, to clarify the meaning of mmio pfn, we return fault page instead of bad
    page when the gfn is not allowd to prefetch

    Signed-off-by: Xiao Guangrong
    Signed-off-by: Avi Kivity

    Xiao Guangrong
     

12 Jul, 2011

5 commits

  • Introduce kvm_read_guest_cached() function in addition to write one we
    already have.

    [ by glauber: export function signature in kvm header ]

    Signed-off-by: Gleb Natapov
    Signed-off-by: Glauber Costa
    Acked-by: Rik van Riel
    Tested-by: Eric Munson
    Signed-off-by: Avi Kivity

    Gleb Natapov
     
  • KVM_MAX_MSIX_PER_DEV implies that up to that many MSI-X entries can be
    requested. But the kernel so far rejected already the upper limit.

    Signed-off-by: Jan Kiszka
    Signed-off-by: Avi Kivity

    Jan Kiszka
     
  • KVM has an ioctl to define which signal mask should be used while running
    inside VCPU_RUN. At least for big endian systems, this mask is different
    on 32-bit and 64-bit systems (though the size is identical).

    Add a compat wrapper that converts the mask to whatever the kernel accepts,
    allowing 32-bit kvm user space to set signal masks.

    This patch fixes qemu with --enable-io-thread on ppc64 hosts when running
    32-bit user land.

    Signed-off-by: Alexander Graf
    Signed-off-by: Avi Kivity

    Alexander Graf
     
  • So far kvm_arch_vcpu_setup is responsible for freeing the vcpu struct if
    it fails. Move this confusing resonsibility back into the hands of
    kvm_vm_ioctl_create_vcpu. Only kvm_arch_vcpu_setup of x86 is affected,
    all other archs cannot fail.

    Signed-off-by: Jan Kiszka
    Signed-off-by: Avi Kivity

    Jan Kiszka
     
  • Simply use __copy_to_user/__clear_user to write guest page since we have
    already verified the user address when the memslot is set

    Signed-off-by: Xiao Guangrong
    Signed-off-by: Marcelo Tosatti

    Xiao Guangrong
     

08 Jun, 2011

1 commit


06 Jun, 2011

1 commit

  • It doesn't make sense to ever see a half-initialized kvm structure on
    mmu notifier callbacks. Previously, 85722cda changed the ordering to
    ensure that the mmu_lock was initialized before mmu notifier
    registration, but there is still a race where the mmu notifier could
    come in and try accessing other portions of struct kvm before they are
    intialized.

    Solve this by moving the mmu notifier registration to occur after the
    structure is completely initialized.

    Google-Bug-Id: 452199
    Signed-off-by: Mike Waychison
    Signed-off-by: Avi Kivity

    Mike Waychison
     

26 May, 2011

1 commit

  • fa3d315a "KVM: Validate userspace_addr of memslot when registered" introduced
    this new warning onn s390:

    kvm_main.c: In function '__kvm_set_memory_region':
    kvm_main.c:654:7: warning: passing argument 1 of '__access_ok' makes pointer from integer without a cast
    arch/s390/include/asm/uaccess.h:53:19: note: expected 'const void *' but argument is of type '__u64'

    Add the missing cast to get rid of it again...

    Cc: Takuya Yoshikawa
    Signed-off-by: Heiko Carstens
    Signed-off-by: Avi Kivity

    Heiko Carstens
     

24 May, 2011

1 commit

  • * 'linux-next' of git://git.kernel.org/pub/scm/linux/kernel/git/jbarnes/pci-2.6: (27 commits)
    PCI: Don't use dmi_name_in_vendors in quirk
    PCI: remove unused AER functions
    PCI/sysfs: move bus cpuaffinity to class dev_attrs
    PCI: add rescan to /sys/.../pci_bus/.../
    PCI: update bridge resources to get more big ranges when allocating space (again)
    KVM: Use pci_store/load_saved_state() around VM device usage
    PCI: Add interfaces to store and load the device saved state
    PCI: Track the size of each saved capability data area
    PCI/e1000e: Add and use pci_disable_link_state_locked()
    x86/PCI: derive pcibios_last_bus from ACPI MCFG
    PCI: add latency tolerance reporting enable/disable support
    PCI: add OBFF enable/disable support
    PCI: add ID-based ordering enable/disable support
    PCI hotplug: acpiphp: assume device is in state D0 after powering on a slot.
    PCI: Set PCIE maxpayload for card during hotplug insertion
    PCI/ACPI: Report _OSC control mask returned on failure to get control
    x86/PCI: irq and pci_ids patch for Intel Panther Point DeviceIDs
    PCI: handle positive error codes
    PCI: check pci_vpd_pci22_wait() return
    PCI: Use ICH6_GPIO_EN in ich6_lpc_acpi_gpio
    ...

    Fix up trivial conflicts in include/linux/pci_ids.h: commit a6e5e2be4461
    moved the intel SMBUS ID definitons to the i2c-i801.c driver.

    Linus Torvalds
     

22 May, 2011

4 commits

  • Like the following, mmu_notifier can be called after registering
    immediately. So, kvm have to initialize kvm->mmu_lock before it.

    BUG: spinlock bad magic on CPU#0, kswapd0/342
    lock: ffff8800af8c4000, .magic: 00000000, .owner: /-1, .owner_cpu: 0
    Pid: 342, comm: kswapd0 Not tainted 2.6.39-rc5+ #1
    Call Trace:
    [] spin_bug+0x9c/0xa3
    [] do_raw_spin_lock+0x29/0x13c
    [] ? flush_tlb_others_ipi+0xaf/0xfd
    [] _raw_spin_lock+0x9/0xb
    [] kvm_mmu_notifier_clear_flush_young+0x2c/0x66 [kvm]
    [] __mmu_notifier_clear_flush_young+0x2b/0x57
    [] page_referenced_one+0x88/0xea
    [] page_referenced+0x1fc/0x256
    [] shrink_page_list+0x187/0x53a
    [] shrink_inactive_list+0x1e0/0x33d
    [] ? determine_dirtyable_memory+0x15/0x27
    [] ? call_function_single_interrupt+0xe/0x20
    [] shrink_zone+0x322/0x3de
    [] ? zone_watermark_ok_safe+0xe2/0xf1
    [] kswapd+0x516/0x818
    [] ? shrink_zone+0x3de/0x3de
    [] kthread+0x7d/0x85
    [] kernel_thread_helper+0x4/0x10
    [] ? __init_kthread_worker+0x37/0x37
    [] ? gs_change+0xb/0xb

    Signed-off-by: OGAWA Hirofumi
    Signed-off-by: Avi Kivity

    OGAWA Hirofumi
     
  • This way, we can avoid checking the user space address many times when
    we read the guest memory.

    Although we can do the same for write if we check which slots are
    writable, we do not care write now: reading the guest memory happens
    more often than writing.

    [avi: change VERIFY_READ to VERIFY_WRITE]

    Signed-off-by: Takuya Yoshikawa
    Signed-off-by: Avi Kivity

    Takuya Yoshikawa
     
  • Function ioapic_debug() in the ioapic_deliver() misnames
    one filed by reference. This patch correct it.

    Signed-off-by: Liu Yuan
    Signed-off-by: Avi Kivity

    Liu Yuan
     
  • Store the device saved state so that we can reload the device back
    to the original state when it's unassigned. This has the benefit
    that the state survives across pci_reset_function() calls via
    the PCI sysfs reset interface while the VM is using the device.

    Signed-off-by: Alex Williamson
    Acked-by: Avi Kivity
    Signed-off-by: Jesse Barnes

    Alex Williamson
     

11 May, 2011

1 commit


08 Apr, 2011

1 commit


06 Apr, 2011

2 commits

  • If asynchronous hva_to_pfn() is requested call GUP with FOLL_NOWAIT to
    avoid sleeping on IO. Check for hwpoison is done at the same time,
    otherwise check_user_page_hwpoison() will call GUP again and will put
    vcpu to sleep.

    Signed-off-by: Gleb Natapov
    Signed-off-by: Avi Kivity

    Gleb Natapov
     
  • irqfd in kvm used flush_work incorrectly: it assumed that work scheduled
    previously can't run after flush_work, but since kvm uses a non-reentrant
    workqueue (by means of schedule_work) we need flush_work_sync to get that
    guarantee.

    Signed-off-by: Michael S. Tsirkin
    Reported-by: Jean-Philippe Menil
    Tested-by: Jean-Philippe Menil
    Signed-off-by: Avi Kivity

    Michael S. Tsirkin
     

31 Mar, 2011

1 commit


26 Mar, 2011

1 commit


24 Mar, 2011

1 commit

  • As a preparation for removing ext2 non-atomic bit operations from
    asm/bitops.h. This converts ext2 non-atomic bit operations to
    little-endian bit operations.

    Signed-off-by: Akinobu Mita
    Cc: Avi Kivity
    Cc: Marcelo Tosatti
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Akinobu Mita