26 Nov, 2014

2 commits

  • This reverts commit 85c8555ff0 ("KVM: check for !is_zero_pfn() in
    kvm_is_mmio_pfn()") and renames the function to kvm_is_reserved_pfn.

    The problem being addressed by the patch above was that some ARM code
    based the memory mapping attributes of a pfn on the return value of
    kvm_is_mmio_pfn(), whose name indeed suggests that such pfns should
    be mapped as device memory.

    However, kvm_is_mmio_pfn() doesn't do quite what it says on the tin,
    and the existing non-ARM users were already using it in a way which
    suggests that its name should probably have been 'kvm_is_reserved_pfn'
    from the beginning, e.g., whether or not to call get_page/put_page on
    it etc. This means that returning false for the zero page is a mistake
    and the patch above should be reverted.

    Signed-off-by: Ard Biesheuvel
    Signed-off-by: Paolo Bonzini

    Ard Biesheuvel
     
  • If we detect another vCPU is running we just exit and return 0 as if we
    succesfully created the VGIC, but the VGIC wouldn't actual be created.

    This shouldn't break in-kernel behavior because the kernel will not
    observe the failed the attempt to create the VGIC, but userspace could
    be rightfully confused.

    Cc: Andre Przywara
    Signed-off-by: Christoffer Dall
    Signed-off-by: Marc Zyngier
    Signed-off-by: Paolo Bonzini

    Christoffer Dall
     

24 Oct, 2014

2 commits

  • After commit 80ce163 (KVM: VFIO: register kvm_device_ops dynamically),
    kvm_device_ops of vfio can be registered dynamically. Commit 3c3c29fd
    (kvm-vfio: do not use module_init) move the dynamic register invoked by
    kvm_init in order to fix broke unloading of the kvm module. However,
    kvm_device_ops of vfio is unregistered after rmmod kvm-intel module
    which lead to device type collision detection warning after kvm-intel
    module reinsmod.

    WARNING: CPU: 1 PID: 10358 at /root/cathy/kvm/arch/x86/kvm/../../../virt/kvm/kvm_main.c:3289 kvm_init+0x234/0x282 [kvm]()
    Modules linked in: kvm_intel(O+) kvm(O) nfsv3 nfs_acl auth_rpcgss oid_registry nfsv4 dns_resolver nfs fscache lockd sunrpc pci_stub bridge stp llc autofs4 8021q cpufreq_ondemand ipv6 joydev microcode pcspkr igb i2c_algo_bit ehci_pci ehci_hcd e1000e i2c_i801 ixgbe ptp pps_core hwmon mdio tpm_tis tpm ipmi_si ipmi_msghandler acpi_cpufreq isci libsas scsi_transport_sas button dm_mirror dm_region_hash dm_log dm_mod [last unloaded: kvm_intel]
    CPU: 1 PID: 10358 Comm: insmod Tainted: G W O 3.17.0-rc1 #2
    Hardware name: Intel Corporation S2600CP/S2600CP, BIOS RMLSDP.86I.00.29.D696.1311111329 11/11/2013
    0000000000000cd9 ffff880ff08cfd18 ffffffff814a61d9 0000000000000cd9
    0000000000000000 ffff880ff08cfd58 ffffffff810417b7 ffff880ff08cfd48
    ffffffffa045bcac ffffffffa049c420 0000000000000040 00000000000000ff
    Call Trace:
    [] dump_stack+0x49/0x60
    [] warn_slowpath_common+0x7c/0x96
    [] ? kvm_init+0x234/0x282 [kvm]
    [] warn_slowpath_null+0x15/0x17
    [] kvm_init+0x234/0x282 [kvm]
    [] vmx_init+0x1bf/0x42a [kvm_intel]
    [] ? vmx_check_processor_compat+0x64/0x64 [kvm_intel]
    [] do_one_initcall+0xe3/0x170
    [] ? __vunmap+0xad/0xb8
    [] do_init_module+0x2b/0x174
    [] load_module+0x43e/0x569
    [] ? do_init_module+0x174/0x174
    [] ? copy_module_from_user+0x39/0x82
    [] ? module_sect_show+0x20/0x20
    [] SyS_init_module+0x54/0x81
    [] system_call_fastpath+0x16/0x1b
    ---[ end trace 0626f4a3ddea56f3 ]---

    The bug can be reproduced by:

    rmmod kvm_intel.ko
    insmod kvm_intel.ko

    without rmmod/insmod kvm.ko
    This patch fixes the bug by unregistering kvm_device_ops of vfio when the
    kvm-intel module is removed.

    Reported-by: Liu Rongrong
    Fixes: 3c3c29fd0d7cddc32862c350d0700ce69953e3bd
    Signed-off-by: Wanpeng Li
    Signed-off-by: Paolo Bonzini

    Wanpeng Li
     
  • The third parameter of kvm_unpin_pages() when called from
    kvm_iommu_map_pages() is wrong, it should be the number of pages to un-pin
    and not the page size.

    This error was facilitated with an inconsistent API: kvm_pin_pages() takes
    a size, but kvn_unpin_pages() takes a number of pages, so fix the problem
    by matching the two.

    This was introduced by commit 350b8bd ("kvm: iommu: fix the third parameter
    of kvm_iommu_put_pages (CVE-2014-3601)"), which fixes the lack of
    un-pinning for pages intended to be un-pinned (i.e. memory leak) but
    unfortunately potentially aggravated the number of pages we un-pin that
    should have stayed pinned. As far as I understand though, the same
    practical mitigations apply.

    This issue was found during review of Red Hat 6.6 patches to prepare
    Ksplice rebootless updates.

    Thanks to Vegard for his time on a late Friday evening to help me in
    understanding this code.

    Fixes: 350b8bd ("kvm: iommu: fix the third parameter of... (CVE-2014-3601)")
    Cc: stable@vger.kernel.org
    Signed-off-by: Quentin Casasnovas
    Signed-off-by: Vegard Nossum
    Signed-off-by: Jamie Iles
    Reviewed-by: Sasha Levin
    Signed-off-by: Paolo Bonzini

    Quentin Casasnovas
     

19 Oct, 2014

1 commit

  • Pull second batch of changes for KVM/{arm,arm64} from Marc Zyngier:
    "The most obvious thing is the sizeable MMU changes to support 48bit
    VAs on arm64.

    Summary:

    - support for 48bit IPA and VA (EL2)
    - a number of fixes for devices mapped into guests
    - yet another VGIC fix for BE
    - a fix for CPU hotplug
    - a few compile fixes (disabled VGIC, strict mm checks)"

    [ I'm pulling directly from Marc at the request of Paolo Bonzini, whose
    backpack was stolen at Düsseldorf airport and will do new keys and
    rebuild his web of trust. - Linus ]

    * tag 'kvm-arm-for-3.18-take-2' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm:
    arm/arm64: KVM: Fix BE accesses to GICv2 EISR and ELRSR regs
    arm: kvm: STRICT_MM_TYPECHECKS fix for user_mem_abort
    arm/arm64: KVM: Ensure memslots are within KVM_PHYS_SIZE
    arm64: KVM: Implement 48 VA support for KVM EL2 and Stage-2
    arm/arm64: KVM: map MMIO regions at creation time
    arm64: kvm: define PAGE_S2_DEVICE as read-only by default
    ARM: kvm: define PAGE_S2_DEVICE as read-only by default
    arm/arm64: KVM: add 'writable' parameter to kvm_phys_addr_ioremap
    arm/arm64: KVM: fix potential NULL dereference in user_mem_abort()
    arm/arm64: KVM: use __GFP_ZERO not memset() to get zeroed pages
    ARM: KVM: fix vgic-disabled build
    arm: kvm: fix CPU hotplug

    Linus Torvalds
     

16 Oct, 2014

1 commit

  • The EIRSR and ELRSR registers are 32-bit registers on GICv2, and we
    store these as an array of two such registers on the vgic vcpu struct.
    However, we access them as a single 64-bit value or as a bitmap pointer
    in the generic vgic code, which breaks BE support.

    Instead, store them as u64 values on the vgic structure and do the
    word-swapping in the assembly code, which already handles the byte order
    for BE systems.

    Tested-by: Victor Kamensky
    Acked-by: Marc Zyngier
    Signed-off-by: Christoffer Dall

    Christoffer Dall
     

15 Oct, 2014

1 commit

  • Pull IOMMU updates from Joerg Roedel:
    "This pull-request includes:

    - change in the IOMMU-API to convert the former iommu_domain_capable
    function to just iommu_capable

    - various fixes in handling RMRR ranges for the VT-d driver (one fix
    requires a device driver core change which was acked by Greg KH)

    - the AMD IOMMU driver now assigns and deassigns complete alias
    groups to fix issues with devices using the wrong PCI request-id

    - MMU-401 support for the ARM SMMU driver

    - multi-master IOMMU group support for the ARM SMMU driver

    - various other small fixes all over the place"

    * tag 'iommu-updates-v3.18' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu: (41 commits)
    iommu/vt-d: Work around broken RMRR firmware entries
    iommu/vt-d: Store bus information in RMRR PCI device path
    iommu/vt-d: Only remove domain when device is removed
    driver core: Add BUS_NOTIFY_REMOVED_DEVICE event
    iommu/amd: Fix devid mapping for ivrs_ioapic override
    iommu/irq_remapping: Fix the regression of hpet irq remapping
    iommu: Fix bus notifier breakage
    iommu/amd: Split init_iommu_group() from iommu_init_device()
    iommu: Rework iommu_group_get_for_pci_dev()
    iommu: Make of_device_id array const
    amd_iommu: do not dereference a NULL pointer address.
    iommu/omap: Remove omap_iommu unused owner field
    iommu: Remove iommu_domain_has_cap() API function
    IB/usnic: Convert to use new iommu_capable() API function
    vfio: Convert to use new iommu_capable() API function
    kvm: iommu: Convert to use new iommu_capable() API function
    iommu/tegra: Convert to iommu_capable() API function
    iommu/msm: Convert to iommu_capable() API function
    iommu/vt-d: Convert to iommu_capable() API function
    iommu/fsl: Convert to iommu_capable() API function
    ...

    Linus Torvalds
     

10 Oct, 2014

2 commits

  • Add support for read-only MMIO passthrough mappings by adding a
    'writable' parameter to kvm_phys_addr_ioremap. For the moment,
    mappings will be read-write even if 'writable' is false, but once
    the definition of PAGE_S2_DEVICE gets changed, those mappings will
    be created read-only.

    Acked-by: Marc Zyngier
    Signed-off-by: Ard Biesheuvel
    Signed-off-by: Christoffer Dall

    Ard Biesheuvel
     
  • Pull PCI updates from Bjorn Helgaas:
    "The interesting things here are:

    - Turn on Config Request Retry Status Software Visibility. This
    caused hangs last time, but we included a fix this time.
    - Rework PCI device configuration to use _HPP/_HPX more aggressively
    - Allow PCI devices to be put into D3cold during system suspend
    - Add arm64 PCI support
    - Add APM X-Gene host bridge driver
    - Add TI Keystone host bridge driver
    - Add Xilinx AXI host bridge driver

    More detailed summary:

    Enumeration
    - Check Vendor ID only for Config Request Retry Status (Rajat Jain)
    - Enable Config Request Retry Status when supported (Rajat Jain)
    - Add generic domain handling (Catalin Marinas)
    - Generate uppercase hex for modalias interface class (Ricardo Ribalda Delgado)

    Resource management
    - Add missing MEM_64 mask in pci_assign_unassigned_bridge_resources() (Yinghai Lu)
    - Increase IBM ipr SAS Crocodile BARs to at least system page size (Douglas Lehr)

    PCI device hotplug
    - Prevent NULL dereference during pciehp probe (Andreas Noever)
    - Move _HPP & _HPX handling into core (Bjorn Helgaas)
    - Apply _HPP to PCIe devices as well as PCI (Bjorn Helgaas)
    - Apply _HPP/_HPX to display devices (Bjorn Helgaas)
    - Preserve SERR & PARITY settings when applying _HPP/_HPX (Bjorn Helgaas)
    - Preserve MPS and MRRS settings when applying _HPP/_HPX (Bjorn Helgaas)
    - Apply _HPP/_HPX to all devices, not just hot-added ones (Bjorn Helgaas)
    - Fix wait time in pciehp timeout message (Yinghai Lu)
    - Add more pciehp Slot Control debug output (Yinghai Lu)
    - Stop disabling pciehp notifications during init (Yinghai Lu)

    MSI
    - Remove arch_msi_check_device() (Alexander Gordeev)
    - Rename pci_msi_check_device() to pci_msi_supported() (Alexander Gordeev)
    - Move D0 check into pci_msi_check_device() (Alexander Gordeev)
    - Remove unused kobject from struct msi_desc (Yijing Wang)
    - Remove "pos" from the struct msi_desc msi_attrib (Yijing Wang)
    - Add "msi_bus" sysfs MSI/MSI-X control for endpoints (Yijing Wang)
    - Use __get_cached_msi_msg() instead of get_cached_msi_msg() (Yijing Wang)
    - Use __read_msi_msg() instead of read_msi_msg() (Yijing Wang)
    - Use __write_msi_msg() instead of write_msi_msg() (Yijing Wang)

    Power management
    - Drop unused runtime PM support code for PCIe ports (Rafael J. Wysocki)
    - Allow PCI devices to be put into D3cold during system suspend (Rafael J. Wysocki)

    AER
    - Add additional AER error strings (Gong Chen)
    - Make standalone includable (Thierry Reding)

    Virtualization
    - Add ACS quirk for Solarflare SFC9120 & SFC9140 (Alex Williamson)
    - Add ACS quirk for Intel 10G NICs (Alex Williamson)
    - Add ACS quirk for AMD A88X southbridge (Marti Raudsepp)
    - Remove unused pci_find_upstream_pcie_bridge(), pci_get_dma_source() (Alex Williamson)
    - Add device flag helpers (Ethan Zhao)
    - Assume all Mellanox devices have broken INTx masking (Gavin Shan)

    Generic host bridge driver
    - Fix ioport_map() for !CONFIG_GENERIC_IOMAP (Liviu Dudau)
    - Add pci_register_io_range() and pci_pio_to_address() (Liviu Dudau)
    - Define PCI_IOBASE as the base of virtual PCI IO space (Liviu Dudau)
    - Fix the conversion of IO ranges into IO resources (Liviu Dudau)
    - Add pci_get_new_domain_nr() and of_get_pci_domain_nr() (Liviu Dudau)
    - Add support for parsing PCI host bridge resources from DT (Liviu Dudau)
    - Add pci_remap_iospace() to map bus I/O resources (Liviu Dudau)
    - Add arm64 architectural support for PCI (Liviu Dudau)

    APM X-Gene
    - Add APM X-Gene PCIe driver (Tanmay Inamdar)
    - Add arm64 DT APM X-Gene PCIe device tree nodes (Tanmay Inamdar)

    Freescale i.MX6
    - Probe in module_init(), not fs_initcall() (Lucas Stach)
    - Delay enabling reference clock for SS until it stabilizes (Tim Harvey)

    Marvell MVEBU
    - Fix uninitialized variable in mvebu_get_tgt_attr() (Thomas Petazzoni)

    NVIDIA Tegra
    - Make sure the PCIe PLL is really reset (Eric Yuen)
    - Add error path tegra_msi_teardown_irq() cleanup (Jisheng Zhang)
    - Fix extended configuration space mapping (Peter Daifuku)
    - Implement resource hierarchy (Thierry Reding)
    - Clear CLKREQ# enable on port disable (Thierry Reding)
    - Add Tegra124 support (Thierry Reding)

    ST Microelectronics SPEAr13xx
    - Pass config resource through reg property (Pratyush Anand)

    Synopsys DesignWare
    - Use NULL instead of false (Fabio Estevam)
    - Parse bus-range property from devicetree (Lucas Stach)
    - Use pci_create_root_bus() instead of pci_scan_root_bus() (Lucas Stach)
    - Remove pci_assign_unassigned_resources() (Lucas Stach)
    - Check private_data validity in single place (Lucas Stach)
    - Setup and clear exactly one MSI at a time (Lucas Stach)
    - Remove open-coded bitmap operations (Lucas Stach)
    - Fix configuration base address when using 'reg' (Minghuan Lian)
    - Fix IO resource end address calculation (Minghuan Lian)
    - Rename get_msi_data() to get_msi_addr() (Minghuan Lian)
    - Add get_msi_data() to pcie_host_ops (Minghuan Lian)
    - Add support for v3.65 hardware (Murali Karicheri)
    - Fold struct pcie_port_info into struct pcie_port (Pratyush Anand)

    TI Keystone
    - Add TI Keystone PCIe driver (Murali Karicheri)
    - Limit MRSS for all downstream devices (Murali Karicheri)
    - Assume controller is already in RC mode (Murali Karicheri)
    - Set device ID based on SoC to support multiple ports (Murali Karicheri)

    Xilinx AXI
    - Add Xilinx AXI PCIe driver (Srikanth Thokala)
    - Fix xilinx_pcie_assign_msi() return value test (Dan Carpenter)

    Miscellaneous
    - Clean up whitespace (Quentin Lambert)
    - Remove assignments from "if" conditions (Quentin Lambert)
    - Move PCI_VENDOR_ID_VMWARE to pci_ids.h (Francesco Ruggeri)
    - x86: Mark DMI tables as initialization data (Mathias Krause)
    - x86: Move __init annotation to the correct place (Mathias Krause)
    - x86: Mark constants of pci_mmcfg_nvidia_mcp55() as __initconst (Mathias Krause)
    - x86: Constify pci_mmcfg_probes[] array (Mathias Krause)
    - x86: Mark PCI BIOS initialization code as such (Mathias Krause)
    - Parenthesize PCI_DEVID and PCI_VPD_LRDT_ID parameters (Megan Kamiya)
    - Remove unnecessary variable in pci_add_dynid() (Tobias Klauser)"

    * tag 'pci-v3.18-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci: (109 commits)
    arm64: dts: Add APM X-Gene PCIe device tree nodes
    PCI: Add ACS quirk for AMD A88X southbridge devices
    PCI: xgene: Add APM X-Gene PCIe driver
    PCI: designware: Remove open-coded bitmap operations
    PCI/MSI: Remove unnecessary temporary variable
    PCI/MSI: Use __write_msi_msg() instead of write_msi_msg()
    MSI/powerpc: Use __read_msi_msg() instead of read_msi_msg()
    PCI/MSI: Use __get_cached_msi_msg() instead of get_cached_msi_msg()
    PCI/MSI: Add "msi_bus" sysfs MSI/MSI-X control for endpoints
    PCI/MSI: Remove "pos" from the struct msi_desc msi_attrib
    PCI/MSI: Remove unused kobject from struct msi_desc
    PCI/MSI: Rename pci_msi_check_device() to pci_msi_supported()
    PCI/MSI: Move D0 check into pci_msi_check_device()
    PCI/MSI: Remove arch_msi_check_device()
    irqchip: armada-370-xp: Remove arch_msi_check_device()
    PCI/MSI/PPC: Remove arch_msi_check_device()
    arm64: Add architectural support for PCI
    PCI: Add pci_remap_iospace() to map bus I/O resources
    of/pci: Add support for parsing PCI host bridge resources from DT
    of/pci: Add pci_get_new_domain_nr() and of_get_pci_domain_nr()
    ...

    Conflicts:
    arch/arm64/boot/dts/apm-storm.dtsi

    Linus Torvalds
     

08 Oct, 2014

1 commit

  • Pull KVM updates from Paolo Bonzini:
    "Fixes and features for 3.18.

    Apart from the usual cleanups, here is the summary of new features:

    - s390 moves closer towards host large page support

    - PowerPC has improved support for debugging (both inside the guest
    and via gdbstub) and support for e6500 processors

    - ARM/ARM64 support read-only memory (which is necessary to put
    firmware in emulated NOR flash)

    - x86 has the usual emulator fixes and nested virtualization
    improvements (including improved Windows support on Intel and
    Jailhouse hypervisor support on AMD), adaptive PLE which helps
    overcommitting of huge guests. Also included are some patches that
    make KVM more friendly to memory hot-unplug, and fixes for rare
    caching bugs.

    Two patches have trivial mm/ parts that were acked by Rik and Andrew.

    Note: I will soon switch to a subkey for signing purposes"

    * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (157 commits)
    kvm: do not handle APIC access page if in-kernel irqchip is not in use
    KVM: s390: count vcpu wakeups in stat.halt_wakeup
    KVM: s390/facilities: allow TOD-CLOCK steering facility bit
    KVM: PPC: BOOK3S: HV: CMA: Reserve cma region only in hypervisor mode
    arm/arm64: KVM: Report correct FSC for unsupported fault types
    arm/arm64: KVM: Fix VTTBR_BADDR_MASK and pgd alloc
    kvm: Fix kvm_get_page_retry_io __gup retval check
    arm/arm64: KVM: Fix set_clear_sgi_pend_reg offset
    kvm: x86: Unpin and remove kvm_arch->apic_access_page
    kvm: vmx: Implement set_apic_access_page_addr
    kvm: x86: Add request bit to reload APIC access page address
    kvm: Add arch specific mmu notifier for page invalidation
    kvm: Rename make_all_cpus_request() to kvm_make_all_cpus_request() and make it non-static
    kvm: Fix page ageing bugs
    kvm/x86/mmu: Pass gfn and level to rmapp callback.
    x86: kvm: use alternatives for VMCALL vs. VMMCALL if kernel text is read-only
    kvm: x86: use macros to compute bank MSRs
    KVM: x86: Remove debug assertion of non-PAE reserved bits
    kvm: don't take vcpu mutex for obviously invalid vcpu ioctls
    kvm: Faults which trigger IO release the mmap_sem
    ...

    Linus Torvalds
     

02 Oct, 2014

1 commit


27 Sep, 2014

1 commit

  • …marm/kvmarm into kvm-next

    Changes for KVM for arm/arm64 for 3.18

    This includes a bunch of changes:
    - Support read-only memory slots on arm/arm64
    - Various changes to fix Sparse warnings
    - Correctly detect write vs. read Stage-2 faults
    - Various VGIC cleanups and fixes
    - Dynamic VGIC data strcuture sizing
    - Fix SGI set_clear_pend offset bug
    - Fix VTTBR_BADDR Mask
    - Correctly report the FSC on Stage-2 faults

    Conflicts:
    virt/kvm/eventfd.c
    [duplicate, different patch where the kvm-arm version broke x86.
    The kvm tree instead has the right one]

    Paolo Bonzini
     

26 Sep, 2014

2 commits

  • Confusion around -EBUSY and zero (inside a BUG_ON no less).

    Reported-by: Andrea Arcangeli
    Signed-off-by: Andres Lagar-Cavilla
    Signed-off-by: Paolo Bonzini

    Andres Lagar-Cavilla
     
  • The sgi values calculated in read_set_clear_sgi_pend_reg() and
    write_set_clear_sgi_pend_reg() were horribly incorrectly multiplied by 4
    with catastrophic results in that subfunctions ended up overwriting
    memory not allocated for the expected purpose.

    This showed up as bugs in kfree() and the kernel complaining a lot of
    you turn on memory debugging.

    This addresses: http://marc.info/?l=kvm&m=141164910007868&w=2

    Reported-by: Shannon Zhao
    Signed-off-by: Christoffer Dall

    Christoffer Dall
     

25 Sep, 2014

1 commit


24 Sep, 2014

7 commits

  • This will be used to let the guest run while the APIC access page is
    not pinned. Because subsequent patches will fill in the function
    for x86, place the (still empty) x86 implementation in the x86.c file
    instead of adding an inline function in kvm_host.h.

    Signed-off-by: Tang Chen
    Signed-off-by: Paolo Bonzini

    Tang Chen
     
  • Different architectures need different requests, and in fact we
    will use this function in architecture-specific code later. This
    will be outside kvm_main.c, so make it non-static and rename it to
    kvm_make_all_cpus_request().

    Reviewed-by: Paolo Bonzini
    Signed-off-by: Tang Chen
    Signed-off-by: Paolo Bonzini

    Tang Chen
     
  • 1. We were calling clear_flush_young_notify in unmap_one, but we are
    within an mmu notifier invalidate range scope. The spte exists no more
    (due to range_start) and the accessed bit info has already been
    propagated (due to kvm_pfn_set_accessed). Simply call
    clear_flush_young.

    2. We clear_flush_young on a primary MMU PMD, but this may be mapped
    as a collection of PTEs by the secondary MMU (e.g. during log-dirty).
    This required expanding the interface of the clear_flush_young mmu
    notifier, so a lot of code has been trivially touched.

    3. In the absence of shadow_accessed_mask (e.g. EPT A bit), we emulate
    the access bit by blowing the spte. This requires proper synchronizing
    with MMU notifier consumers, like every other removal of spte's does.

    Signed-off-by: Andres Lagar-Cavilla
    Acked-by: Rik van Riel
    Signed-off-by: Paolo Bonzini

    Andres Lagar-Cavilla
     
  • vcpu ioctls can hang the calling thread if issued while a vcpu is running.
    However, invalid ioctls can happen when userspace tries to probe the kind
    of file descriptors (e.g. isatty() calls ioctl(TCGETS)); in that case,
    we know the ioctl is going to be rejected as invalid anyway and we can
    fail before trying to take the vcpu mutex.

    This patch does not change functionality, it just makes invalid ioctls
    fail faster.

    Cc: stable@vger.kernel.org
    Signed-off-by: David Matlack
    Signed-off-by: Paolo Bonzini

    David Matlack
     
  • When KVM handles a tdp fault it uses FOLL_NOWAIT. If the guest memory
    has been swapped out or is behind a filemap, this will trigger async
    readahead and return immediately. The rationale is that KVM will kick
    back the guest with an "async page fault" and allow for some other
    guest process to take over.

    If async PFs are enabled the fault is retried asap from an async
    workqueue. If not, it's retried immediately in the same code path. In
    either case the retry will not relinquish the mmap semaphore and will
    block on the IO. This is a bad thing, as other mmap semaphore users
    now stall as a function of swap or filemap latency.

    This patch ensures both the regular and async PF path re-enter the
    fault allowing for the mmap semaphore to be relinquished in the case
    of IO wait.

    Reviewed-by: Radim Krčmář
    Signed-off-by: Andres Lagar-Cavilla
    Acked-by: Andrew Morton
    Signed-off-by: Paolo Bonzini

    Andres Lagar-Cavilla
     
  • /me got confused between the kernel and QEMU. In the kernel, you can
    only have one module_init function, and it will prevent unloading the
    module unless you also have the corresponding module_exit function.

    So, commit 80ce1639727e (KVM: VFIO: register kvm_device_ops dynamically,
    2014-09-02) broke unloading of the kvm module, by adding a module_init
    function and no module_exit.

    Repair it by making kvm_vfio_ops_init weak, and checking it in
    kvm_init.

    Cc: Will Deacon
    Cc: Gleb Natapov
    Cc: Alex Williamson
    Fixes: 80ce1639727e9d38729c34f162378508c307ca25
    Signed-off-by: Paolo Bonzini

    Paolo Bonzini
     
  • Commit c77dcac (KVM: Move more code under CONFIG_HAVE_KVM_IRQFD) added
    functionality that depends on definitions in ioapic.h when
    __KVM_HAVE_IOAPIC is defined.

    At the same time, kvm-arm commit 0ba0951 (KVM: EVENTFD: remove inclusion
    of irq.h) removed the inclusion of irq.h, an architecture-specific header
    that is not present on ARM but which happened to include ioapic.h on x86.

    Include ioapic.h directly in eventfd.c if __KVM_HAVE_IOAPIC is defined.
    This fixes x86 and lets ARM use eventfd.c.

    Signed-off-by: Christoffer Dall
    Signed-off-by: Paolo Bonzini

    Christoffer Dall
     

23 Sep, 2014

2 commits


22 Sep, 2014

1 commit


19 Sep, 2014

15 commits

  • In order to make the number of interrupts configurable, use the new
    fancy device management API to add KVM_DEV_ARM_VGIC_GRP_NR_IRQS as
    a VGIC configurable attribute.

    Userspace can now specify the exact size of the GIC (by increments
    of 32 interrupts).

    Reviewed-by: Christoffer Dall
    Signed-off-by: Marc Zyngier

    Marc Zyngier
     
  • It is now quite easy to delay the allocation of the vgic tables
    until we actually require it to be up and running (when the first
    vcpu is kicking around, or someones tries to access the GIC registers).

    This allow us to allocate memory for the exact number of CPUs we
    have. As nobody configures the number of interrupts just yet,
    use a fallback to VGIC_NR_IRQS_LEGACY.

    Reviewed-by: Christoffer Dall
    Signed-off-by: Marc Zyngier

    Marc Zyngier
     
  • Nuke VGIC_NR_IRQS entierly, now that the distributor instance
    contains the number of IRQ allocated to this GIC.

    Also add VGIC_NR_IRQS_LEGACY to preserve the current API.

    Reviewed-by: Christoffer Dall
    Signed-off-by: Marc Zyngier

    Marc Zyngier
     
  • Now that we can (almost) dynamically size the number of interrupts,
    we're facing an interesting issue:

    We have to evaluate at runtime whether or not an access hits a valid
    register, based on the sizing of this particular instance of the
    distributor. Furthermore, the GIC spec says that accessing a reserved
    register is RAZ/WI.

    For this, add a new field to our range structure, indicating the number
    of bits a single interrupts uses. That allows us to find out whether or
    not the access is in range.

    Reviewed-by: Christoffer Dall
    Signed-off-by: Marc Zyngier

    Marc Zyngier
     
  • We now have the information about the number of CPU interfaces in
    the distributor itself. Let's get rid of VGIC_MAX_CPUS, and just
    rely on KVM_MAX_VCPUS where we don't have the choice. Yet.

    Reviewed-by: Christoffer Dall
    Signed-off-by: Marc Zyngier

    Marc Zyngier
     
  • Having a dynamic number of supported interrupts means that we
    cannot relly on VGIC_NR_SHARED_IRQS being fixed anymore.

    Instead, make it take the distributor structure as a parameter,
    so it can return the right value.

    Reviewed-by: Christoffer Dall
    Signed-off-by: Marc Zyngier

    Marc Zyngier
     
  • So far, all the VGIC data structures are statically defined by the
    *maximum* number of vcpus and interrupts it supports. It means that
    we always have to oversize it to cater for the worse case.

    Start by changing the data structures to be dynamically sizeable,
    and allocate them at runtime.

    The sizes are still very static though.

    Signed-off-by: Marc Zyngier

    Marc Zyngier
     
  • As it stands, nothing prevents userspace from injecting an interrupt
    before the guest's GIC is actually initialized.

    This goes unnoticed so far (as everything is pretty much statically
    allocated), but ends up exploding in a spectacular way once we switch
    to a more dynamic allocation (the GIC data structure isn't there yet).

    The fix is to test for the "ready" flag in the VGIC distributor before
    trying to inject the interrupt. Note that in order to avoid breaking
    userspace, we have to ignore what is essentially an error.

    Signed-off-by: Marc Zyngier
    Acked-by: Christoffer Dall

    Marc Zyngier
     
  • The VGIC virtual distributor implementation documentation was written a
    very long time ago, before the true nature of the beast had been
    partially absorbed into my bloodstream. Clarify the docs.

    Plus, it fixes an actual bug. ICFRn, pfff.

    Signed-off-by: Christoffer Dall

    Christoffer Dall
     
  • Writes to GICD_ISPENDR0 and GICD_ICPENDR0 ignore all settings of the
    pending state for SGIs. Make sure the implementation handles this
    correctly.

    Signed-off-by: Christoffer Dall

    Christoffer Dall
     
  • Writes to GICD_ISPENDRn and GICD_ICPENDRn are currently not handled
    correctly for level-triggered interrupts. The spec states that for
    level-triggered interrupts, writes to the GICD_ISPENDRn activate the
    output of a flip-flop which is in turn or'ed with the actual input
    interrupt signal. Correspondingly, writes to GICD_ICPENDRn simply
    deactivates the output of that flip-flop, but does not (of course) affect
    the external input signal. Reads from GICC_IAR will also deactivate the
    flip-flop output.

    This requires us to track the state of the level-input separately from
    the state in the flip-flop. We therefore introduce two new variables on
    the distributor struct to track these two states. Astute readers may
    notice that this is introducing more state than required (because an OR
    of the two states gives you the pending state), but the remaining vgic
    code uses the pending bitmap for optimized operations to figure out, at
    the end of the day, if an interrupt is pending or not on the distributor
    side. Refactoring the code to consider the two state variables all the
    places where we currently access the precomputed pending value, did not
    look pretty.

    Signed-off-by: Christoffer Dall

    Christoffer Dall
     
  • If we unqueue a level-triggered interrupt completely, and the LR does
    not stick around in the active state (and will therefore no longer
    generate a maintenance interrupt), then we should clear the queued flag
    so that the vgic can actually queue this level-triggered interrupt at a
    later time and deal with its pending state then.

    Note: This should actually be properly fixed to handle the active state
    on the distributor.

    Acked-by: Marc Zyngier
    Signed-off-by: Christoffer Dall

    Christoffer Dall
     
  • We have a special bitmap on the distributor struct to keep track of when
    level-triggered interrupts are queued on the list registers. This was
    named irq_active, which is confusing, because the active state of an
    interrupt as per the GIC spec is a different thing, not specifically
    related to edge-triggered/level-triggered configurations but rather
    indicates an interrupt which has been ack'ed but not yet eoi'ed.

    Rename the bitmap and the corresponding accessor functions to irq_queued
    to clarify what this is actually used for.

    Signed-off-by: Christoffer Dall

    Christoffer Dall
     
  • The irq_state field on the distributor struct is ambiguous in its
    meaning; the comment says it's the level of the input put, but that
    doesn't make much sense for edge-triggered interrupts. The code
    actually uses this state variable to check if the interrupt is in the
    pending state on the distributor so clarify the comment and rename the
    actual variable and accessor methods.

    Acked-by: Marc Zyngier
    Signed-off-by: Christoffer Dall

    Christoffer Dall
     
  • Conflicts:
    arch/arm64/include/asm/kvm_host.h
    virt/kvm/arm/vgic.c

    Christoffer Dall