17 Jan, 2019

1 commit

  • commit fb544d1ca65a89f7a3895f7531221ceeed74ada7 upstream.

    We recently addressed a VMID generation race by introducing a read/write
    lock around accesses and updates to the vmid generation values.

    However, kvm_arch_vcpu_ioctl_run() also calls need_new_vmid_gen() but
    does so without taking the read lock.

    As far as I can tell, this can lead to the same kind of race:

    VM 0, VCPU 0 VM 0, VCPU 1
    ------------ ------------
    update_vttbr (vmid 254)
    update_vttbr (vmid 1) // roll over
    read_lock(kvm_vmid_lock);
    force_vm_exit()
    local_irq_disable
    need_new_vmid_gen == false //because vmid gen matches

    enter_guest (vmid 254)
    kvm_arch.vttbr = :
    read_unlock(kvm_vmid_lock);

    enter_guest (vmid 1)

    Which results in running two VCPUs in the same VM with different VMIDs
    and (even worse) other VCPUs from other VMs could now allocate clashing
    VMID 254 from the new generation as long as VCPU 0 is not exiting.

    Attempt to solve this by making sure vttbr is updated before another CPU
    can observe the updated VMID generation.

    Cc: stable@vger.kernel.org
    Fixes: f0cf47d939d0 "KVM: arm/arm64: Close VMID generation race"
    Reviewed-by: Julien Thierry
    Signed-off-by: Christoffer Dall
    Signed-off-by: Marc Zyngier
    Signed-off-by: Greg Kroah-Hartman

    Christoffer Dall
     

10 Jan, 2019

1 commit

  • commit 107352a24900fb458152b92a4e72fbdc83fd5510 upstream.

    We currently only halt the guest when a vCPU messes with the active
    state of an SPI. This is perfectly fine for GICv2, but isn't enough
    for GICv3, where all vCPUs can access the state of any other vCPU.

    Let's broaden the condition to include any GICv3 interrupt that
    has an active state (i.e. all but LPIs).

    Cc: stable@vger.kernel.org
    Reviewed-by: Christoffer Dall
    Signed-off-by: Marc Zyngier
    Signed-off-by: Greg Kroah-Hartman

    Marc Zyngier
     

14 Nov, 2018

1 commit

  • commit da5a3ce66b8bb51b0ea8a89f42aac153903f90fb upstream.

    At boot time, KVM stashes the host MDCR_EL2 value, but only does this
    when the kernel is not running in hyp mode (i.e. is non-VHE). In these
    cases, the stashed value of MDCR_EL2.HPMN happens to be zero, which can
    lead to CONSTRAINED UNPREDICTABLE behaviour.

    Since we use this value to derive the MDCR_EL2 value when switching
    to/from a guest, after a guest have been run, the performance counters
    do not behave as expected. This has been observed to result in accesses
    via PMXEVTYPER_EL0 and PMXEVCNTR_EL0 not affecting the relevant
    counters, resulting in events not being counted. In these cases, only
    the fixed-purpose cycle counter appears to work as expected.

    Fix this by always stashing the host MDCR_EL2 value, regardless of VHE.

    Cc: Christopher Dall
    Cc: James Morse
    Cc: Will Deacon
    Cc: stable@vger.kernel.org
    Fixes: 1e947bad0b63b351 ("arm64: KVM: Skip HYP setup when already running in HYP")
    Tested-by: Robin Murphy
    Signed-off-by: Mark Rutland
    Signed-off-by: Marc Zyngier
    Signed-off-by: Greg Kroah-Hartman

    Mark Rutland
     

26 Sep, 2018

2 commits

  • [ Upstream commit 1d47191de7e15900f8fbfe7cccd7c6e1c2d7c31a ]

    The vgic_init function can race with kvm_arch_vcpu_create() which does
    not hold kvm_lock() and we therefore have no synchronization primitives
    to ensure we're doing the right thing.

    As the user is trying to initialize or run the VM while at the same time
    creating more VCPUs, we just have to refuse to initialize the VGIC in
    this case rather than silently failing with a broken VCPU.

    Reviewed-by: Eric Auger
    Signed-off-by: Christoffer Dall
    Signed-off-by: Marc Zyngier
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Christoffer Dall
     
  • [ Upstream commit 6b8b9a48545e08345b8ff77c9fd51b1aebdbefb3 ]

    It's possible for userspace to control n. Sanitize n when using it as an
    array index, to inhibit the potential spectre-v1 write gadget.

    Note that while it appears that n must be bound to the interval [0,3]
    due to the way it is extracted from addr, we cannot guarantee that
    compiler transformations (and/or future refactoring) will ensure this is
    the case, and given this is a slow path it's better to always perform
    the masking.

    Found by smatch.

    Signed-off-by: Mark Rutland
    Cc: Christoffer Dall
    Cc: Marc Zyngier
    Cc: kvmarm@lists.cs.columbia.edu
    Signed-off-by: Marc Zyngier
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Mark Rutland
     

05 Sep, 2018

2 commits

  • commit 976d34e2dab10ece5ea8fe7090b7692913f89084 upstream.

    When there is contention on faulting in a particular page table entry
    at stage 2, the break-before-make requirement of the architecture can
    lead to additional refaulting due to TLB invalidation.

    Avoid this by skipping a page table update if the new value of the PTE
    matches the previous value.

    Cc: stable@vger.kernel.org
    Fixes: d5d8184d35c9 ("KVM: ARM: Memory virtualization setup")
    Reviewed-by: Suzuki Poulose
    Acked-by: Christoffer Dall
    Signed-off-by: Punit Agrawal
    Signed-off-by: Marc Zyngier
    Signed-off-by: Greg Kroah-Hartman

    Punit Agrawal
     
  • commit 86658b819cd0a9aa584cd84453ed268a6f013770 upstream.

    Contention on updating a PMD entry by a large number of vcpus can lead
    to duplicate work when handling stage 2 page faults. As the page table
    update follows the break-before-make requirement of the architecture,
    it can lead to repeated refaults due to clearing the entry and
    flushing the tlbs.

    This problem is more likely when -

    * there are large number of vcpus
    * the mapping is large block mapping

    such as when using PMD hugepages (512MB) with 64k pages.

    Fix this by skipping the page table update if there is no change in
    the entry being updated.

    Cc: stable@vger.kernel.org
    Fixes: ad361f093c1e ("KVM: ARM: Support hugetlbfs backed huge pages")
    Reviewed-by: Suzuki Poulose
    Acked-by: Christoffer Dall
    Signed-off-by: Punit Agrawal
    Signed-off-by: Marc Zyngier
    Signed-off-by: Greg Kroah-Hartman

    Punit Agrawal
     

24 Aug, 2018

2 commits

  • commit 9432a3175770e06cb83eada2d91fac90c977cb99 upstream.

    A comment warning against this bug is there, but the code is not doing what
    the comment says. Therefore it is possible that an EPOLLHUP races against
    irq_bypass_register_consumer. The EPOLLHUP handler schedules irqfd_shutdown,
    and if that runs soon enough, you get a use-after-free.

    Reported-by: syzbot
    Cc: stable@vger.kernel.org
    Signed-off-by: Paolo Bonzini
    Reviewed-by: David Hildenbrand
    Signed-off-by: Sudip Mukherjee
    Signed-off-by: Greg Kroah-Hartman

    Paolo Bonzini
     
  • [ Upstream commit ba56bc3a0786992755e6804fbcbdc60ef6cfc24c ]

    When booting a 64 KB pages kernel on a ACPI GICv3 system that
    implements support for v2 emulation, the following warning is
    produced

    GICV size 0x2000 not a multiple of page size 0x10000

    and support for v2 emulation is disabled, preventing GICv2 VMs
    from being able to run on such hosts.

    The reason is that vgic_v3_probe() performs a sanity check on the
    size of the window (it should be a multiple of the page size),
    while the ACPI MADT parsing code hardcodes the size of the window
    to 8 KB. This makes sense, considering that ACPI does not bother
    to describe the size in the first place, under the assumption that
    platforms implementing ACPI will follow the architecture and not
    put anything else in the same 64 KB window.

    So let's just drop the sanity check altogether, and assume that
    the window is at least 64 KB in size.

    Fixes: 909777324588 ("KVM: arm/arm64: vgic-new: vgic_init: implement kvm_vgic_hyp_init")
    Signed-off-by: Ard Biesheuvel
    Signed-off-by: Marc Zyngier
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Ard Biesheuvel
     

25 Jul, 2018

1 commit

  • commit b5020a8e6b54d2ece80b1e7dedb33c79a40ebd47 upstream.

    Syzbot reports crashes in kvm_irqfd_assign(), caused by use-after-free
    when kvm_irqfd_assign() and kvm_irqfd_deassign() run in parallel
    for one specific eventfd. When the assign path hasn't finished but irqfd
    has been added to kvm->irqfds.items list, another thead may deassign the
    eventfd and free struct kvm_kernel_irqfd(). The assign path then uses
    the struct kvm_kernel_irqfd that has been freed by deassign path. To avoid
    such issue, keep irqfd under kvm->irq_srcu protection after the irqfd
    has been added to kvm->irqfds.items list, and call synchronize_srcu()
    in irq_shutdown() to make sure that irqfd has been fully initialized in
    the assign path.

    Reported-by: Dmitry Vyukov
    Cc: Paolo Bonzini
    Cc: Radim Krčmář
    Cc: Dmitry Vyukov
    Signed-off-by: Tianyu Lan
    Cc: stable@vger.kernel.org
    Signed-off-by: Paolo Bonzini
    Signed-off-by: Greg Kroah-Hartman

    Lan Tianyu
     

22 Jul, 2018

4 commits

  • commit 5d81f7dc9bca4f4963092433e27b508cbe524a32 upstream.

    Now that all our infrastructure is in place, let's expose the
    availability of ARCH_WORKAROUND_2 to guests. We take this opportunity
    to tidy up a couple of SMCCC constants.

    Acked-by: Christoffer Dall
    Reviewed-by: Mark Rutland
    Signed-off-by: Marc Zyngier
    Signed-off-by: Catalin Marinas
    Signed-off-by: Marc Zyngier
    Signed-off-by: Greg Kroah-Hartman

    Marc Zyngier
     
  • commit 55e3748e8902ff641e334226bdcb432f9a5d78d3 upstream.

    In order to offer ARCH_WORKAROUND_2 support to guests, we need
    a bit of infrastructure.

    Let's add a flag indicating whether or not the guest uses
    SSBD mitigation. Depending on the state of this flag, allow
    KVM to disable ARCH_WORKAROUND_2 before entering the guest,
    and enable it when exiting it.

    Reviewed-by: Christoffer Dall
    Reviewed-by: Mark Rutland
    Signed-off-by: Marc Zyngier
    Signed-off-by: Catalin Marinas
    Signed-off-by: Marc Zyngier
    Signed-off-by: Greg Kroah-Hartman

    Marc Zyngier
     
  • Commit 44a497abd621a71c645f06d3d545ae2f46448830 upstream.

    kvm_vgic_global_state is part of the read-only section, and is
    usually accessed using a PC-relative address generation (adrp + add).

    It is thus useless to use kern_hyp_va() on it, and actively problematic
    if kern_hyp_va() becomes non-idempotent. On the other hand, there is
    no way that the compiler is going to guarantee that such access is
    always PC relative.

    So let's bite the bullet and provide our own accessor.

    Acked-by: Catalin Marinas
    Reviewed-by: James Morse
    Signed-off-by: Marc Zyngier
    Signed-off-by: Greg Kroah-Hartman

    Marc Zyngier
     
  • Commit 36989e7fd386a9a5822c48691473863f8fbb404d upstream.

    kvm_host_cpu_state is a per-cpu allocation made from kvm_arch_init()
    used to store the host EL1 registers when KVM switches to a guest.

    Make it easier for ASM to generate pointers into this per-cpu memory
    by making it a static allocation.

    Signed-off-by: James Morse
    Acked-by: Christoffer Dall
    Signed-off-by: Catalin Marinas
    Signed-off-by: Marc Zyngier
    Signed-off-by: Greg Kroah-Hartman

    James Morse
     

21 Jun, 2018

1 commit

  • [ Upstream commit 5e1ca5e23b167987d5b6d8b08f2d5b7dd2d13f49 ]

    It's possible for userspace to control n. Sanitize n when using it as an
    array index.

    Note that while it appears that n must be bound to the interval [0,3]
    due to the way it is extracted from addr, we cannot guarantee that
    compiler transformations (and/or future refactoring) will ensure this is
    the case, and given this is a slow path it's better to always perform
    the masking.

    Found by smatch.

    Signed-off-by: Mark Rutland
    Acked-by: Christoffer Dall
    Acked-by: Marc Zyngier
    Cc: kvmarm@lists.cs.columbia.edu
    Signed-off-by: Will Deacon
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Mark Rutland
     

30 May, 2018

1 commit

  • [ Upstream commit 62b06f8f429cd233e4e2e7bbd21081ad60c9018f ]

    Our irq_is_pending() helper function accesses multiple members of the
    vgic_irq struct, so we need to hold the lock when calling it.
    Add that requirement as a comment to the definition and take the lock
    around the call in vgic_mmio_read_pending(), where we were missing it
    before.

    Fixes: 96b298000db4 ("KVM: arm/arm64: vgic-new: Add PENDING registers handlers")
    Signed-off-by: Andre Przywara
    Signed-off-by: Marc Zyngier
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Andre Przywara
     

23 May, 2018

2 commits

  • commit bf308242ab98b5d1648c3663e753556bef9bec01 upstream.

    kvm_read_guest() will eventually look up in kvm_memslots(), which requires
    either to hold the kvm->slots_lock or to be inside a kvm->srcu critical
    section.
    In contrast to x86 and s390 we don't take the SRCU lock on every guest
    exit, so we have to do it individually for each kvm_read_guest() call.

    Provide a wrapper which does that and use that everywhere.

    Note that ending the SRCU critical section before returning from the
    kvm_read_guest() wrapper is safe, because the data has been *copied*, so
    we don't need to rely on valid references to the memslot anymore.

    Cc: Stable # 4.8+
    Reported-by: Jan Glauber
    Signed-off-by: Andre Przywara
    Acked-by: Christoffer Dall
    Signed-off-by: Paolo Bonzini
    Signed-off-by: Greg Kroah-Hartman

    Andre Przywara
     
  • commit 711702b57cc3c50b84bd648de0f1ca0a378805be upstream.

    kvm_read_guest() will eventually look up in kvm_memslots(), which requires
    either to hold the kvm->slots_lock or to be inside a kvm->srcu critical
    section.
    In contrast to x86 and s390 we don't take the SRCU lock on every guest
    exit, so we have to do it individually for each kvm_read_guest() call.
    Use the newly introduced wrapper for that.

    Cc: Stable # 4.12+
    Reported-by: Jan Glauber
    Signed-off-by: Andre Przywara
    Acked-by: Christoffer Dall
    Signed-off-by: Paolo Bonzini
    Signed-off-by: Greg Kroah-Hartman

    Andre Przywara
     

02 May, 2018

2 commits

  • commit 85bd0ba1ff9875798fad94218b627ea9f768f3c3 upstream.

    Although we've implemented PSCI 0.1, 0.2 and 1.0, we expose either 0.1
    or 1.0 to a guest, defaulting to the latest version of the PSCI
    implementation that is compatible with the requested version. This is
    no different from doing a firmware upgrade on KVM.

    But in order to give a chance to hypothetical badly implemented guests
    that would have a fit by discovering something other than PSCI 0.2,
    let's provide a new API that allows userspace to pick one particular
    version of the API.

    This is implemented as a new class of "firmware" registers, where
    we expose the PSCI version. This allows the PSCI version to be
    save/restored as part of a guest migration, and also set to
    any supported version if the guest requires it.

    Cc: stable@vger.kernel.org #4.16
    Reviewed-by: Christoffer Dall
    Signed-off-by: Marc Zyngier
    Signed-off-by: Greg Kroah-Hartman

    Marc Zyngier
     
  • commit f0cf47d939d0b4b4f660c5aaa4276fa3488f3391 upstream.

    Before entering the guest, we check whether our VMID is still
    part of the current generation. In order to avoid taking a lock,
    we start with checking that the generation is still current, and
    only if not current do we take the lock, recheck, and update the
    generation and VMID.

    This leaves open a small race: A vcpu can bump up the global
    generation number as well as the VM's, but has not updated
    the VMID itself yet.

    At that point another vcpu from the same VM comes in, checks
    the generation (and finds it not needing anything), and jumps
    into the guest. At this point, we end-up with two vcpus belonging
    to the same VM running with two different VMIDs. Eventually, the
    VMID used by the second vcpu will get reassigned, and things will
    really go wrong...

    A simple solution would be to drop this initial check, and always take
    the lock. This is likely to cause performance issues. A middle ground
    is to convert the spinlock to a rwlock, and only take the read lock
    on the fast path. If the check fails at that point, drop it and
    acquire the write lock, rechecking the condition.

    This ensures that the above scenario doesn't occur.

    Cc: stable@vger.kernel.org
    Reported-by: Mark Rutland
    Tested-by: Shannon Zhao
    Signed-off-by: Marc Zyngier
    Signed-off-by: Greg Kroah-Hartman

    Marc Zyngier
     

26 Apr, 2018

1 commit

  • [ Upstream commit a340b3e229b24a56f1c7f5826b15a3af0f4b13e5 ]

    For EPT-violations that are triggered by a read, the pages are also mapped with
    write permissions (if their memory region is also writable). That would avoid
    getting yet another fault on the same page when a write occurs.

    This optimization only happens when you have a "struct page" backing the memory
    region. So also enable it for memory regions that do not have a "struct page".

    Cc: Paolo Bonzini
    Cc: Radim Krčmář
    Cc: kvm@vger.kernel.org
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: KarimAllah Ahmed
    Reviewed-by: Paolo Bonzini
    Signed-off-by: Radim Krčmář
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    KarimAllah Ahmed
     

24 Apr, 2018

1 commit

  • commit 7d8b44c54e0c7c8f688e3a07f17e6083f849f01f upstream.

    vgic_copy_lpi_list() parses the LPI list and picks LPIs targeting
    a given vcpu. We allocate the array containing the intids before taking
    the lpi_list_lock, which means we can have an array size that is not
    equal to the number of LPIs.

    This is particularly obvious when looking at the path coming from
    vgic_enable_lpis, which is not a command, and thus can run in parallel
    with commands:

    vcpu 0: vcpu 1:
    vgic_enable_lpis
    its_sync_lpi_pending_table
    vgic_copy_lpi_list
    intids = kmalloc_array(irq_count)
    MAPI(lpi targeting vcpu 0)
    list_for_each_entry(lpi_list_head)
    intids[i++] = irq->intid;

    At that stage, we will happily overrun the intids array. Boo. An easy
    fix is is to break once the array is full. The MAPI command will update
    the config anyway, and we won't miss a thing. We also make sure that
    lpi_list_count is read exactly once, so that further updates of that
    value will not affect the array bound check.

    Cc: stable@vger.kernel.org
    Fixes: ccb1d791ab9e ("KVM: arm64: vgic-its: Fix pending table sync")
    Reviewed-by: Andre Przywara
    Reviewed-by: Eric Auger
    Signed-off-by: Marc Zyngier
    Signed-off-by: Greg Kroah-Hartman

    Marc Zyngier
     

21 Mar, 2018

3 commits

  • commit 16ca6a607d84bef0129698d8d808f501afd08d43 upstream.

    The vgic code is trying to be clever when injecting GICv2 SGIs,
    and will happily populate LRs with the same interrupt number if
    they come from multiple vcpus (after all, they are distinct
    interrupt sources).

    Unfortunately, this is against the letter of the architecture,
    and the GICv2 architecture spec says "Each valid interrupt stored
    in the List registers must have a unique VirtualID for that
    virtual CPU interface.". GICv3 has similar (although slightly
    ambiguous) restrictions.

    This results in guests locking up when using GICv2-on-GICv3, for
    example. The obvious fix is to stop trying so hard, and inject
    a single vcpu per SGI per guest entry. After all, pending SGIs
    with multiple source vcpus are pretty rare, and are mostly seen
    in scenario where the physical CPUs are severely overcomitted.

    But as we now only inject a single instance of a multi-source SGI per
    vcpu entry, we may delay those interrupts for longer than strictly
    necessary, and run the risk of injecting lower priority interrupts
    in the meantime.

    In order to address this, we adopt a three stage strategy:
    - If we encounter a multi-source SGI in the AP list while computing
    its depth, we force the list to be sorted
    - When populating the LRs, we prevent the injection of any interrupt
    of lower priority than that of the first multi-source SGI we've
    injected.
    - Finally, the injection of a multi-source SGI triggers the request
    of a maintenance interrupt when there will be no pending interrupt
    in the LRs (HCR_NPIE).

    At the point where the last pending interrupt in the LRs switches
    from Pending to Active, the maintenance interrupt will be delivered,
    allowing us to add the remaining SGIs using the same process.

    Cc: stable@vger.kernel.org
    Fixes: 0919e84c0fc1 ("KVM: arm/arm64: vgic-new: Add IRQ sync/flush framework")
    Acked-by: Christoffer Dall
    Signed-off-by: Marc Zyngier
    Signed-off-by: Greg Kroah-Hartman

    Marc Zyngier
     
  • commit 27e91ad1e746e341ca2312f29bccb9736be7b476 upstream.

    On guest exit, and when using GICv2 on GICv3, we use a dsb(st) to
    force synchronization between the memory-mapped guest view and
    the system-register view that the hypervisor uses.

    This is incorrect, as the spec calls out the need for "a DSB whose
    required access type is both loads and stores with any Shareability
    attribute", while we're only synchronizing stores.

    We also lack an isb after the dsb to ensure that the latter has
    actually been executed before we start reading stuff from the sysregs.

    The fix is pretty easy: turn dsb(st) into dsb(sy), and slap an isb()
    just after.

    Cc: stable@vger.kernel.org
    Fixes: f68d2b1b73cc ("arm64: KVM: Implement vgic-v3 save/restore")
    Acked-by: Christoffer Dall
    Reviewed-by: Andre Przywara
    Signed-off-by: Marc Zyngier
    Signed-off-by: Greg Kroah-Hartman

    Marc Zyngier
     
  • commit 76600428c3677659e3c3633bb4f2ea302220a275 upstream.

    On my GICv3 system, the following is printed to the kernel log at boot:

    kvm [1]: 8-bit VMID
    kvm [1]: IDMAP page: d20e35000
    kvm [1]: HYP VA range: 800000000000:ffffffffffff
    kvm [1]: vgic-v2@2c020000
    kvm [1]: GIC system register CPU interface enabled
    kvm [1]: vgic interrupt IRQ1
    kvm [1]: virtual timer IRQ4
    kvm [1]: Hyp mode initialized successfully

    The KVM IDMAP is a mapping of a statically allocated kernel structure,
    and so printing its physical address leaks the physical placement of
    the kernel when physical KASLR in effect. So change the kvm_info() to
    kvm_debug() to remove it from the log output.

    While at it, trim the output a bit more: IRQ numbers can be found in
    /proc/interrupts, and the HYP VA and vgic-v2 lines are not highly
    informational either.

    Cc:
    Acked-by: Will Deacon
    Acked-by: Christoffer Dall
    Signed-off-by: Ard Biesheuvel
    Signed-off-by: Marc Zyngier
    Signed-off-by: Greg Kroah-Hartman

    Ard Biesheuvel
     

09 Mar, 2018

1 commit

  • commit b28676bb8ae4569cced423dc2a88f7cb319d5379 upstream.

    Reported by syzkaller:

    pte_list_remove: ffff9714eb1f8078 0->BUG
    ------------[ cut here ]------------
    kernel BUG at arch/x86/kvm/mmu.c:1157!
    invalid opcode: 0000 [#1] SMP
    RIP: 0010:pte_list_remove+0x11b/0x120 [kvm]
    Call Trace:
    drop_spte+0x83/0xb0 [kvm]
    mmu_page_zap_pte+0xcc/0xe0 [kvm]
    kvm_mmu_prepare_zap_page+0x81/0x4a0 [kvm]
    kvm_mmu_invalidate_zap_all_pages+0x159/0x220 [kvm]
    kvm_arch_flush_shadow_all+0xe/0x10 [kvm]
    kvm_mmu_notifier_release+0x6c/0xa0 [kvm]
    ? kvm_mmu_notifier_release+0x5/0xa0 [kvm]
    __mmu_notifier_release+0x79/0x110
    ? __mmu_notifier_release+0x5/0x110
    exit_mmap+0x15a/0x170
    ? do_exit+0x281/0xcb0
    mmput+0x66/0x160
    do_exit+0x2c9/0xcb0
    ? __context_tracking_exit.part.5+0x4a/0x150
    do_group_exit+0x50/0xd0
    SyS_exit_group+0x14/0x20
    do_syscall_64+0x73/0x1f0
    entry_SYSCALL64_slow_path+0x25/0x25

    The reason is that when creates new memslot, there is no guarantee for new
    memslot not overlap with private memslots. This can be triggered by the
    following program:

    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include

    long r[16];

    int main()
    {
    void *p = valloc(0x4000);

    r[2] = open("/dev/kvm", 0);
    r[3] = ioctl(r[2], KVM_CREATE_VM, 0x0ul);

    uint64_t addr = 0xf000;
    ioctl(r[3], KVM_SET_IDENTITY_MAP_ADDR, &addr);
    r[6] = ioctl(r[3], KVM_CREATE_VCPU, 0x0ul);
    ioctl(r[3], KVM_SET_TSS_ADDR, 0x0ul);
    ioctl(r[6], KVM_RUN, 0);
    ioctl(r[6], KVM_RUN, 0);

    struct kvm_userspace_memory_region mr = {
    .slot = 0,
    .flags = KVM_MEM_LOG_DIRTY_PAGES,
    .guest_phys_addr = 0xf000,
    .memory_size = 0x4000,
    .userspace_addr = (uintptr_t) p
    };
    ioctl(r[3], KVM_SET_USER_MEMORY_REGION, &mr);
    return 0;
    }

    This patch fixes the bug by not adding a new memslot even if it
    overlaps with private memslots.

    Reported-by: Dmitry Vyukov
    Cc: Paolo Bonzini
    Cc: Radim Krčmář
    Cc: Dmitry Vyukov
    Cc: Eric Biggers
    Cc: stable@vger.kernel.org
    Signed-off-by: Wanpeng Li

    Wanpeng Li
     

25 Feb, 2018

2 commits

  • [ Upstream commit 7465894e90e5a47e0e52aa5f1f708653fc40020f ]

    vgic_set_owner acquires the irq lock without disabling interrupts,
    resulting in a lockdep splat (an interrupt could fire and result
    in the same lock being taken if the same virtual irq is to be
    injected).

    In practice, it is almost impossible to trigger this bug, but
    better safe than sorry. Convert the lock acquisition to a
    spin_lock_irqsave() and keep lockdep happy.

    Reported-by: James Morse
    Signed-off-by: Marc Zyngier
    Signed-off-by: Christoffer Dall
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Marc Zyngier
     
  • [ Upstream commit 58d0d19a204604ca0da26058828a53558b265da3 ]

    Since it is perfectly legal to run the kernel at EL1, it is not
    actually an error if HYP mode is not available when attempting to
    initialize KVM, given that KVM support cannot be built as a module.
    So demote the kvm_err() to kvm_info(), which prevents the error from
    appearing on an otherwise 'quiet' console.

    Acked-by: Marc Zyngier
    Acked-by: Christoffer Dall
    Signed-off-by: Ard Biesheuvel
    Signed-off-by: Christoffer Dall
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Ard Biesheuvel
     

17 Feb, 2018

9 commits

  • commit 58d6b15e9da5042a99c9c30ad725792e4569150e upstream.

    cpu_pm_enter() calls the pm notifier chain with CPU_PM_ENTER, then if
    there is a failure: CPU_PM_ENTER_FAILED.

    When KVM receives CPU_PM_ENTER it calls cpu_hyp_reset() which will
    return us to the hyp-stub. If we subsequently get a CPU_PM_ENTER_FAILED,
    KVM does nothing, leaving the CPU running with the hyp-stub, at odds
    with kvm_arm_hardware_enabled.

    Add CPU_PM_ENTER_FAILED as a fallthrough for CPU_PM_EXIT, this reloads
    KVM based on kvm_arm_hardware_enabled. This is safe even if CPU_PM_ENTER
    never gets as far as KVM, as cpu_hyp_reinit() calls cpu_hyp_reset()
    to make sure the hyp-stub is loaded before reloading KVM.

    Fixes: 67f691976662 ("arm64: kvm: allows kvm cpu hotplug")
    CC: Lorenzo Pieralisi
    Reviewed-by: Christoffer Dall
    Signed-off-by: James Morse
    Signed-off-by: Christoffer Dall
    Signed-off-by: Greg Kroah-Hartman

    James Morse
     
  • Commit 6167ec5c9145 upstream.

    A new feature of SMCCC 1.1 is that it offers firmware-based CPU
    workarounds. In particular, SMCCC_ARCH_WORKAROUND_1 provides
    BP hardening for CVE-2017-5715.

    If the host has some mitigation for this issue, report that
    we deal with it using SMCCC_ARCH_WORKAROUND_1, as we apply the
    host workaround on every guest exit.

    Tested-by: Ard Biesheuvel
    Reviewed-by: Christoffer Dall
    Signed-off-by: Marc Zyngier
    Signed-off-by: Catalin Marinas
    Signed-off-by: Will Deacon
    Signed-off-by: Ard Biesheuvel
    Signed-off-by: Greg Kroah-Hartman

    Marc Zyngier
     
  • Commit a4097b351118 upstream.

    We're about to need kvm_psci_version in HYP too. So let's turn it
    into a static inline, and pass the kvm structure as a second
    parameter (so that HYP can do a kern_hyp_va on it).

    Tested-by: Ard Biesheuvel
    Reviewed-by: Christoffer Dall
    Signed-off-by: Marc Zyngier
    Signed-off-by: Catalin Marinas
    Signed-off-by: Will Deacon
    Signed-off-by: Ard Biesheuvel
    Signed-off-by: Greg Kroah-Hartman

    Marc Zyngier
     
  • Commit 09e6be12effd upstream.

    The new SMC Calling Convention (v1.1) allows for a reduced overhead
    when calling into the firmware, and provides a new feature discovery
    mechanism.

    Make it visible to KVM guests.

    Tested-by: Ard Biesheuvel
    Reviewed-by: Christoffer Dall
    Signed-off-by: Marc Zyngier
    Signed-off-by: Catalin Marinas
    Signed-off-by: Will Deacon
    Signed-off-by: Ard Biesheuvel
    Signed-off-by: Greg Kroah-Hartman

    Marc Zyngier
     
  • Commit 58e0b2239a4d upstream.

    PSCI 1.0 can be trivially implemented by providing the FEATURES
    call on top of PSCI 0.2 and returning 1.0 as the PSCI version.

    We happily ignore everything else, as they are either optional or
    are clarifications that do not require any additional change.

    PSCI 1.0 is now the default until we decide to add a userspace
    selection API.

    Reviewed-by: Christoffer Dall
    Tested-by: Ard Biesheuvel
    Signed-off-by: Marc Zyngier
    Signed-off-by: Catalin Marinas
    Signed-off-by: Will Deacon
    Signed-off-by: Ard Biesheuvel
    Signed-off-by: Greg Kroah-Hartman

    Marc Zyngier
     
  • Commit 84684fecd7ea upstream.

    Instead of open coding the accesses to the various registers,
    let's add explicit SMCCC accessors.

    Reviewed-by: Christoffer Dall
    Tested-by: Ard Biesheuvel
    Signed-off-by: Marc Zyngier
    Signed-off-by: Catalin Marinas
    Signed-off-by: Will Deacon
    Signed-off-by: Ard Biesheuvel
    Signed-off-by: Greg Kroah-Hartman

    Marc Zyngier
     
  • Commit d0a144f12a7c upstream.

    As we're about to trigger a PSCI version explosion, it doesn't
    hurt to introduce a PSCI_VERSION helper that is going to be
    used everywhere.

    Reviewed-by: Christoffer Dall
    Tested-by: Ard Biesheuvel
    Signed-off-by: Marc Zyngier
    Signed-off-by: Catalin Marinas
    Signed-off-by: Will Deacon
    Signed-off-by: Ard Biesheuvel
    Signed-off-by: Greg Kroah-Hartman

    Marc Zyngier
     
  • Commit 1a2fb94e6a77 upstream.

    As we're about to update the PSCI support, and because I'm lazy,
    let's move the PSCI include file to include/kvm so that both
    ARM architectures can find it.

    Acked-by: Christoffer Dall
    Tested-by: Ard Biesheuvel
    Signed-off-by: Marc Zyngier
    Signed-off-by: Catalin Marinas
    Signed-off-by: Will Deacon
    Signed-off-by: Ard Biesheuvel
    Signed-off-by: Greg Kroah-Hartman

    Marc Zyngier
     
  • Commit 6840bdd73d07 upstream.

    Now that we have per-CPU vectors, let's plug then in the KVM/arm64 code.

    Signed-off-by: Marc Zyngier
    Signed-off-by: Will Deacon
    Signed-off-by: Catalin Marinas
    Signed-off-by: Ard Biesheuvel
    Signed-off-by: Greg Kroah-Hartman

    Marc Zyngier
     

04 Feb, 2018

1 commit

  • [ Upstream commit 20b7035c66bacc909ae3ffe92c1a1ea7db99fe4f ]

    KVM API says for the signal mask you set via KVM_SET_SIGNAL_MASK, that
    "any unblocked signal received [...] will cause KVM_RUN to return with
    -EINTR" and that "the signal will only be delivered if not blocked by
    the original signal mask".

    This, however, is only true, when the calling task has a signal handler
    registered for a signal. If not, signal evaluation is short-circuited for
    SIG_IGN and SIG_DFL, and the signal is either ignored without KVM_RUN
    returning or the whole process is terminated.

    Make KVM_SET_SIGNAL_MASK behave as advertised by utilizing logic similar
    to that in do_sigtimedwait() to avoid short-circuiting of signals.

    Signed-off-by: Jan H. Schönherr
    Signed-off-by: Paolo Bonzini
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Jan H. Schönherr
     

24 Jan, 2018

1 commit

  • commit c507babf10ead4d5c8cca704539b170752a8ac84 upstream.

    KVM only supports PMD hugepages at stage 2 but doesn't actually check
    that the provided hugepage memory pagesize is PMD_SIZE before populating
    stage 2 entries.

    In cases where the backing hugepage size is smaller than PMD_SIZE (such
    as when using contiguous hugepages), KVM can end up creating stage 2
    mappings that extend beyond the supplied memory.

    Fix this by checking for the pagesize of userspace vma before creating
    PMD hugepage at stage 2.

    Fixes: 66b3923a1a0f77a ("arm64: hugetlb: add support for PTE contiguous bit")
    Signed-off-by: Punit Agrawal
    Cc: Marc Zyngier
    Reviewed-by: Christoffer Dall
    Signed-off-by: Christoffer Dall
    Signed-off-by: Greg Kroah-Hartman

    Punit Agrawal
     

17 Jan, 2018

1 commit

  • commit e39d200fa5bf5b94a0948db0dae44c1b73b84a56 upstream.

    Reported by syzkaller:

    BUG: KASAN: stack-out-of-bounds in write_mmio+0x11e/0x270 [kvm]
    Read of size 8 at addr ffff8803259df7f8 by task syz-executor/32298

    CPU: 6 PID: 32298 Comm: syz-executor Tainted: G OE 4.15.0-rc2+ #18
    Hardware name: LENOVO ThinkCentre M8500t-N000/SHARKBAY, BIOS FBKTC1AUS 02/16/2016
    Call Trace:
    dump_stack+0xab/0xe1
    print_address_description+0x6b/0x290
    kasan_report+0x28a/0x370
    write_mmio+0x11e/0x270 [kvm]
    emulator_read_write_onepage+0x311/0x600 [kvm]
    emulator_read_write+0xef/0x240 [kvm]
    emulator_fix_hypercall+0x105/0x150 [kvm]
    em_hypercall+0x2b/0x80 [kvm]
    x86_emulate_insn+0x2b1/0x1640 [kvm]
    x86_emulate_instruction+0x39a/0xb90 [kvm]
    handle_exception+0x1b4/0x4d0 [kvm_intel]
    vcpu_enter_guest+0x15a0/0x2640 [kvm]
    kvm_arch_vcpu_ioctl_run+0x549/0x7d0 [kvm]
    kvm_vcpu_ioctl+0x479/0x880 [kvm]
    do_vfs_ioctl+0x142/0x9a0
    SyS_ioctl+0x74/0x80
    entry_SYSCALL_64_fastpath+0x23/0x9a

    The path of patched vmmcall will patch 3 bytes opcode 0F 01 C1(vmcall)
    to the guest memory, however, write_mmio tracepoint always prints 8 bytes
    through *(u64 *)val since kvm splits the mmio access into 8 bytes. This
    leaks 5 bytes from the kernel stack (CVE-2017-17741). This patch fixes
    it by just accessing the bytes which we operate on.

    Before patch:

    syz-executor-5567 [007] .... 51370.561696: kvm_mmio: mmio write len 3 gpa 0x10 val 0x1ffff10077c1010f

    After patch:

    syz-executor-13416 [002] .... 51302.299573: kvm_mmio: mmio write len 3 gpa 0x10 val 0xc1010f

    Reported-by: Dmitry Vyukov
    Reviewed-by: Darren Kenny
    Reviewed-by: Marc Zyngier
    Tested-by: Marc Zyngier
    Cc: Paolo Bonzini
    Cc: Radim Krčmář
    Cc: Marc Zyngier
    Cc: Christoffer Dall
    Signed-off-by: Wanpeng Li
    Signed-off-by: Paolo Bonzini
    Cc: Mathieu Desnoyers
    Signed-off-by: Greg Kroah-Hartman

    Wanpeng Li