17 Jun, 2020

3 commits

  • commit ef3e40a7ea8dbe2abd0a345032cd7d5023b9684f upstream.

    When using the PtrAuth feature in a guest, we need to save the host's
    keys before allowing the guest to program them. For that, we dump
    them in a per-CPU data structure (the so called host context).

    But both call sites that do this are in preemptible context,
    which may end up in disaster should the vcpu thread get preempted
    before reentering the guest.

    Instead, save the keys eagerly on each vcpu_load(). This has an
    increased overhead, but is at least safe.

    Cc: stable@vger.kernel.org
    Reviewed-by: Mark Rutland
    Signed-off-by: Marc Zyngier
    Signed-off-by: Greg Kroah-Hartman

    Marc Zyngier
     
  • commit 0370964dd3ff7d3d406f292cb443a927952cbd05 upstream.

    On a VHE system, the EL1 state is left in the CPU most of the time,
    and only syncronized back to memory when vcpu_put() is called (most
    of the time on preemption).

    Which means that when injecting an exception, we'd better have a way
    to either:
    (1) write directly to the EL1 sysregs
    (2) synchronize the state back to memory, and do the changes there

    For an AArch64, we already do (1), so we are safe. Unfortunately,
    doing the same thing for AArch32 would be pretty invasive. Instead,
    we can easily implement (2) by calling the put/load architectural
    backends, and keep preemption disabled. We can then reload the
    state back into EL1.

    Cc: stable@vger.kernel.org
    Reported-by: James Morse
    Signed-off-by: Marc Zyngier
    Signed-off-by: Greg Kroah-Hartman

    Marc Zyngier
     
  • commit e649b3f0188f8fd34dd0dde8d43fd3312b902fb2 upstream.

    Commit b1394e745b94 ("KVM: x86: fix APIC page invalidation") tried
    to fix inappropriate APIC page invalidation by re-introducing arch
    specific kvm_arch_mmu_notifier_invalidate_range() and calling it from
    kvm_mmu_notifier_invalidate_range_start. However, the patch left a
    possible race where the VMCS APIC address cache is updated *before*
    it is unmapped:

    (Invalidator) kvm_mmu_notifier_invalidate_range_start()
    (Invalidator) kvm_make_all_cpus_request(kvm, KVM_REQ_APIC_PAGE_RELOAD)
    (KVM VCPU) vcpu_enter_guest()
    (KVM VCPU) kvm_vcpu_reload_apic_access_page()
    (Invalidator) actually unmap page

    Because of the above race, there can be a mismatch between the
    host physical address stored in the APIC_ACCESS_PAGE VMCS field and
    the host physical address stored in the EPT entry for the APIC GPA
    (0xfee0000). When this happens, the processor will not trap APIC
    accesses, and will instead show the raw contents of the APIC-access page.
    Because Windows OS periodically checks for unexpected modifications to
    the LAPIC register, this will show up as a BSOD crash with BugCheck
    CRITICAL_STRUCTURE_CORRUPTION (109) we are currently seeing in
    https://bugzilla.redhat.com/show_bug.cgi?id=1751017.

    The root cause of the issue is that kvm_arch_mmu_notifier_invalidate_range()
    cannot guarantee that no additional references are taken to the pages in
    the range before kvm_mmu_notifier_invalidate_range_end(). Fortunately,
    this case is supported by the MMU notifier API, as documented in
    include/linux/mmu_notifier.h:

    * If the subsystem
    * can't guarantee that no additional references are taken to
    * the pages in the range, it has to implement the
    * invalidate_range() notifier to remove any references taken
    * after invalidate_range_start().

    The fix therefore is to reload the APIC-access page field in the VMCS
    from kvm_mmu_notifier_invalidate_range() instead of ..._range_start().

    Cc: stable@vger.kernel.org
    Fixes: b1394e745b94 ("KVM: x86: fix APIC page invalidation")
    Fixes: https://bugzilla.kernel.org/show_bug.cgi?id=197951
    Signed-off-by: Eiichi Tsukata
    Message-Id:
    Signed-off-by: Paolo Bonzini
    Signed-off-by: Greg Kroah-Hartman

    Eiichi Tsukata
     

20 May, 2020

1 commit

  • [ Upstream commit 9a50ebbffa9862db7604345f5fd763122b0f6fed ]

    When a guest tries to read the active state of its interrupts,
    we currently just return whatever state we have in memory. This
    means that if such an interrupt lives in a List Register on another
    CPU, we fail to obsertve the latest active state for this interrupt.

    In order to remedy this, stop all the other vcpus so that they exit
    and we can observe the most recent value for the state. This is
    similar to what we are doing for the write side of the same
    registers, and results in new MMIO handlers for userspace (which
    do not need to stop the guest, as it is supposed to be stopped
    already).

    Reported-by: Julien Grall
    Reviewed-by: Andre Przywara
    Signed-off-by: Marc Zyngier
    Signed-off-by: Sasha Levin

    Marc Zyngier
     

14 May, 2020

2 commits

  • commit 0225fd5e0a6a32af7af0aefac45c8ebf19dc5183 upstream.

    In the unlikely event that a 32bit vcpu traps into the hypervisor
    on an instruction that is located right at the end of the 32bit
    range, the emulation of that instruction is going to increment
    PC past the 32bit range. This isn't great, as userspace can then
    observe this value and get a bit confused.

    Conversly, userspace can do things like (in the context of a 64bit
    guest that is capable of 32bit EL0) setting PSTATE to AArch64-EL0,
    set PC to a 64bit value, change PSTATE to AArch32-USR, and observe
    that PC hasn't been truncated. More confusion.

    Fix both by:
    - truncating PC increments for 32bit guests
    - sanitizing all 32bit regs every time a core reg is changed by
    userspace, and that PSTATE indicates a 32bit mode.

    Cc: stable@vger.kernel.org
    Acked-by: Will Deacon
    Signed-off-by: Marc Zyngier
    Signed-off-by: Greg Kroah-Hartman

    Marc Zyngier
     
  • commit 1c32ca5dc6d00012f0c964e5fdd7042fcc71efb1 upstream.

    When deciding whether a guest has to be stopped we check whether this
    is a private interrupt or not. Unfortunately, there's an off-by-one bug
    here, and we fail to recognize a whole range of interrupts as being
    global (GICv2 SPIs 32-63).

    Fix the condition from > to be >=.

    Cc: stable@vger.kernel.org
    Fixes: abd7229626b93 ("KVM: arm/arm64: Simplify active_change_prepare and plug race")
    Reported-by: André Przywara
    Signed-off-by: Marc Zyngier
    Signed-off-by: Greg Kroah-Hartman

    Marc Zyngier
     

05 Mar, 2020

1 commit

  • commit fcfbc617547fc6d9552cb6c1c563b6a90ee98085 upstream.

    When reading/writing using the guest/host cache, check for a bad hva
    before checking for a NULL memslot, which triggers the slow path for
    handing cross-page accesses. Because the memslot is nullified on error
    by __kvm_gfn_to_hva_cache_init(), if the bad hva is encountered after
    crossing into a new page, then the kvm_{read,write}_guest() slow path
    could potentially write/access the first chunk prior to detecting the
    bad hva.

    Arguably, performing a partial access is semantically correct from an
    architectural perspective, but that behavior is certainly not intended.
    In the original implementation, memslot was not explicitly nullified
    and therefore the partial access behavior varied based on whether the
    memslot itself was null, or if the hva was simply bad. The current
    behavior was introduced as a seemingly unintentional side effect in
    commit f1b9dd5eb86c ("kvm: Disallow wraparound in
    kvm_gfn_to_hva_cache_init"), which justified the change with "since some
    callers don't check the return code from this function, it sit seems
    prudent to clear ghc->memslot in the event of an error".

    Regardless of intent, the partial access is dependent on _not_ checking
    the result of the cache initialization, which is arguably a bug in its
    own right, at best simply weird.

    Fixes: 8f964525a121 ("KVM: Allow cross page reads and writes from cached translations.")
    Cc: Jim Mattson
    Cc: Andrew Honig
    Signed-off-by: Sean Christopherson
    Signed-off-by: Paolo Bonzini
    Signed-off-by: Greg Kroah-Hartman

    Sean Christopherson
     

15 Feb, 2020

7 commits

  • commit 4a267aa707953a9a73d1f5dc7f894dd9024a92be upstream.

    According to the ARM ARM, registers CNT{P,V}_TVAL_EL0 have bits [63:32]
    RES0 [1]. When reading the register, the value is truncated to the least
    significant 32 bits [2], and on writes, TimerValue is treated as a signed
    32-bit integer [1, 2].

    When the guest behaves correctly and writes 32-bit values, treating TVAL
    as an unsigned 64 bit register works as expected. However, things start
    to break down when the guest writes larger values, because
    (u64)0x1_ffff_ffff = 8589934591. but (s32)0x1_ffff_ffff = -1, and the
    former will cause the timer interrupt to be asserted in the future, but
    the latter will cause it to be asserted now. Let's treat TVAL as a
    signed 32-bit register on writes, to match the behaviour described in
    the architecture, and the behaviour experimentally exhibited by the
    virtual timer on a non-vhe host.

    [1] Arm DDI 0487E.a, section D13.8.18
    [2] Arm DDI 0487E.a, section D11.2.4

    Signed-off-by: Alexandru Elisei
    [maz: replaced the read-side mask with lower_32_bits]
    Signed-off-by: Marc Zyngier
    Fixes: 8fa761624871 ("KVM: arm/arm64: arch_timer: Fix CNTP_TVAL calculation")
    Link: https://lore.kernel.org/r/20200127103652.2326-1-alexandru.elisei@arm.com
    Signed-off-by: Greg Kroah-Hartman

    Alexandru Elisei
     
  • commit aa76829171e98bd75a0cc00b6248eca269ac7f4f upstream.

    At the moment a SW_INCR counter always overflows on 32-bit
    boundary, independently on whether the n+1th counter is
    programmed as CHAIN.

    Check whether the SW_INCR counter is a 64b counter and if so,
    implement the 64b logic.

    Fixes: 80f393a23be6 ("KVM: arm/arm64: Support chained PMU counters")
    Signed-off-by: Eric Auger
    Signed-off-by: Marc Zyngier
    Link: https://lore.kernel.org/r/20200124142535.29386-4-eric.auger@redhat.com
    Signed-off-by: Greg Kroah-Hartman

    Eric Auger
     
  • commit 3837407c1aa1101ed5e214c7d6041e7a23335c6e upstream.

    The specification says PMSWINC increments PMEVCNTR_EL1 by 1
    if PMEVCNTR_EL0 is enabled and configured to count SW_INCR.

    For PMEVCNTR_EL0 to be enabled, we need both PMCNTENSET to
    be set for the corresponding event counter but we also need
    the PMCR.E bit to be set.

    Fixes: 7a0adc7064b8 ("arm64: KVM: Add access handler for PMSWINC register")
    Signed-off-by: Eric Auger
    Signed-off-by: Marc Zyngier
    Reviewed-by: Andrew Murray
    Acked-by: Marc Zyngier
    Link: https://lore.kernel.org/r/20200124142535.29386-2-eric.auger@redhat.com
    Signed-off-by: Greg Kroah-Hartman

    Eric Auger
     
  • commit 21aecdbd7f3ab02c9b82597dc733ee759fb8b274 upstream.

    KVM's inject_abt64() injects an external-abort into an aarch64 guest.
    The KVM_CAP_ARM_INJECT_EXT_DABT is intended to do exactly this, but
    for an aarch32 guest inject_abt32() injects an implementation-defined
    exception, 'Lockdown fault'.

    Change this to external abort. For non-LPAE we now get the documented:
    | Unhandled fault: external abort on non-linefetch (0x008) at 0x9c800f00
    and for LPAE:
    | Unhandled fault: synchronous external abort (0x210) at 0x9c800f00

    Fixes: 74a64a981662a ("KVM: arm/arm64: Unify 32bit fault injection")
    Reported-by: Beata Michalska
    Signed-off-by: James Morse
    Signed-off-by: Marc Zyngier
    Link: https://lore.kernel.org/r/20200121123356.203000-3-james.morse@arm.com
    Signed-off-by: Greg Kroah-Hartman

    James Morse
     
  • commit 018f22f95e8a6c3e27188b7317ef2c70a34cb2cd upstream.

    Beata reports that KVM_SET_VCPU_EVENTS doesn't inject the expected
    exception to a non-LPAE aarch32 guest.

    The host intends to inject DFSR.FS=0x14 "IMPLEMENTATION DEFINED fault
    (Lockdown fault)", but the guest receives DFSR.FS=0x04 "Fault on
    instruction cache maintenance". This fault is hooked by
    do_translation_fault() since ARMv6, which goes on to silently 'handle'
    the exception, and restart the faulting instruction.

    It turns out, when TTBCR.EAE is clear DFSR is split, and FS[4] has
    to shuffle up to DFSR[10].

    As KVM only does this in one place, fix up the static values. We
    now get the expected:
    | Unhandled fault: lock abort (0x404) at 0x9c800f00

    Fixes: 74a64a981662a ("KVM: arm/arm64: Unify 32bit fault injection")
    Reported-by: Beata Michalska
    Signed-off-by: James Morse
    Signed-off-by: Marc Zyngier
    Link: https://lore.kernel.org/r/20200121123356.203000-2-james.morse@arm.com
    Signed-off-by: Greg Kroah-Hartman

    James Morse
     
  • commit cf2d23e0bac9f6b5cd1cba8898f5f05ead40e530 upstream.

    kvm_test_age_hva() is called upon mmu_notifier_test_young(), but wrong
    address range has been passed to handle_hva_to_gpa(). With the wrong
    address range, no young bits will be checked in handle_hva_to_gpa().
    It means zero is always returned from mmu_notifier_test_young().

    This fixes the issue by passing correct address range to the underly
    function handle_hva_to_gpa(), so that the hardware young (access) bit
    will be visited.

    Fixes: 35307b9a5f7e ("arm/arm64: KVM: Implement Stage-2 page aging")
    Signed-off-by: Gavin Shan
    Signed-off-by: Marc Zyngier
    Link: https://lore.kernel.org/r/20200121055659.19560-1-gshan@redhat.com
    Signed-off-by: Greg Kroah-Hartman

    Gavin Shan
     
  • commit 8c58be34494b7f1b2adb446e2d8beeb90e5de65b upstream.

    Saving/restoring an unmapped collection is a valid scenario. For
    example this happens if a MAPTI command was sent, featuring an
    unmapped collection. At the moment the CTE fails to be restored.
    Only compare against the number of online vcpus if the rdist
    base is set.

    Fixes: ea1ad53e1e31a ("KVM: arm64: vgic-its: Collection table save/restore")
    Signed-off-by: Eric Auger
    Signed-off-by: Marc Zyngier
    Reviewed-by: Zenghui Yu
    Link: https://lore.kernel.org/r/20191213094237.19627-1-eric.auger@redhat.com
    Signed-off-by: Greg Kroah-Hartman

    Eric Auger
     

11 Feb, 2020

8 commits

  • [ Upstream commit 42cde48b2d39772dba47e680781a32a6c4b7dc33 ]

    Avoid the "writable" check in __gfn_to_hva_many(), which will always fail
    on read-only memslots due to gfn_to_hva() assuming writes. Functionally,
    this allows x86 to create large mappings for read-only memslots that
    are backed by HugeTLB mappings.

    Note, the changelog for commit 05da45583de9 ("KVM: MMU: large page
    support") states "If the largepage contains write-protected pages, a
    large pte is not used.", but "write-protected" refers to pages that are
    temporarily read-only, e.g. read-only memslots didn't even exist at the
    time.

    Fixes: 4d8b81abc47b ("KVM: introduce readonly memslot")
    Cc: stable@vger.kernel.org
    Signed-off-by: Sean Christopherson
    [Redone using kvm_vcpu_gfn_to_memslot_prot. - Paolo]
    Signed-off-by: Paolo Bonzini
    Signed-off-by: Sasha Levin

    Sean Christopherson
     
  • [ Upstream commit f9b84e19221efc5f493156ee0329df3142085f28 ]

    Use kvm_vcpu_gfn_to_hva() when retrieving the host page size so that the
    correct set of memslots is used when handling x86 page faults in SMM.

    Fixes: 54bf36aac520 ("KVM: x86: use vcpu-specific functions to read/write/translate GFNs")
    Cc: stable@vger.kernel.org
    Signed-off-by: Sean Christopherson
    Signed-off-by: Paolo Bonzini
    Signed-off-by: Sasha Levin

    Sean Christopherson
     
  • [ Upstream commit 736c291c9f36b07f8889c61764c28edce20e715d ]

    Convert a plethora of parameters and variables in the MMU and page fault
    flows from type gva_t to gpa_t to properly handle TDP on 32-bit KVM.

    Thanks to PSE and PAE paging, 32-bit kernels can access 64-bit physical
    addresses. When TDP is enabled, the fault address is a guest physical
    address and thus can be a 64-bit value, even when both KVM and its guest
    are using 32-bit virtual addressing, e.g. VMX's VMCS.GUEST_PHYSICAL is a
    64-bit field, not a natural width field.

    Using a gva_t for the fault address means KVM will incorrectly drop the
    upper 32-bits of the GPA. Ditto for gva_to_gpa() when it is used to
    translate L2 GPAs to L1 GPAs.

    Opportunistically rename variables and parameters to better reflect the
    dual address modes, e.g. use "cr2_or_gpa" for fault addresses and plain
    "addr" instead of "vaddr" when the address may be either a GVA or an L2
    GPA. Similarly, use "gpa" in the nonpaging_page_fault() flows to avoid
    a confusing "gpa_t gva" declaration; this also sets the stage for a
    future patch to combing nonpaging_page_fault() and tdp_page_fault() with
    minimal churn.

    Sprinkle in a few comments to document flows where an address is known
    to be a GVA and thus can be safely truncated to a 32-bit value. Add
    WARNs in kvm_handle_page_fault() and FNAME(gva_to_gpa_nested)() to help
    document such cases and detect bugs.

    Cc: stable@vger.kernel.org
    Signed-off-by: Sean Christopherson
    Signed-off-by: Paolo Bonzini
    Signed-off-by: Sasha Levin

    Sean Christopherson
     
  • commit 917248144db5d7320655dbb41d3af0b8a0f3d589 upstream.

    __kvm_map_gfn()'s call to gfn_to_pfn_memslot() is
    * relatively expensive
    * in certain cases (such as when done from atomic context) cannot be called

    Stashing gfn-to-pfn mapping should help with both cases.

    This is part of CVE-2019-3016.

    Signed-off-by: Boris Ostrovsky
    Reviewed-by: Joao Martins
    Cc: stable@vger.kernel.org
    Signed-off-by: Paolo Bonzini
    Signed-off-by: Greg Kroah-Hartman

    Boris Ostrovsky
     
  • commit 1eff70a9abd46f175defafd29bc17ad456f398a7 upstream.

    kvm_vcpu_(un)map operates on gfns from any current address space.
    In certain cases we want to make sure we are not mapping SMRAM
    and for that we can use kvm_(un)map_gfn() that we are introducing
    in this patch.

    This is part of CVE-2019-3016.

    Signed-off-by: Boris Ostrovsky
    Reviewed-by: Joao Martins
    Cc: stable@vger.kernel.org
    Signed-off-by: Paolo Bonzini
    Signed-off-by: Greg Kroah-Hartman

    Boris Ostrovsky
     
  • commit b6ae256afd32f96bec0117175b329d0dd617655e upstream.

    On AArch64 you can do a sign-extended load to either a 32-bit or 64-bit
    register, and we should only sign extend the register up to the width of
    the register as specified in the operation (by using the 32-bit Wn or
    64-bit Xn register specifier).

    As it turns out, the architecture provides this decoding information in
    the SF ("Sixty-Four" -- how cute...) bit.

    Let's take advantage of this with the usual 32-bit/64-bit header file
    dance and do the right thing on AArch64 hosts.

    Signed-off-by: Christoffer Dall
    Signed-off-by: Marc Zyngier
    Cc: stable@vger.kernel.org
    Link: https://lore.kernel.org/r/20191212195055.5541-1-christoffer.dall@arm.com
    Signed-off-by: Greg Kroah-Hartman

    Christoffer Dall
     
  • commit 1cfbb484de158e378e8971ac40f3082e53ecca55 upstream.

    Confusingly, there are three SPSR layouts that a kernel may need to deal
    with:

    (1) An AArch64 SPSR_ELx view of an AArch64 pstate
    (2) An AArch64 SPSR_ELx view of an AArch32 pstate
    (3) An AArch32 SPSR_* view of an AArch32 pstate

    When the KVM AArch32 support code deals with SPSR_{EL2,HYP}, it's either
    dealing with #2 or #3 consistently. On arm64 the PSR_AA32_* definitions
    match the AArch64 SPSR_ELx view, and on arm the PSR_AA32_* definitions
    match the AArch32 SPSR_* view.

    However, when we inject an exception into an AArch32 guest, we have to
    synthesize the AArch32 SPSR_* that the guest will see. Thus, an AArch64
    host needs to synthesize layout #3 from layout #2.

    This patch adds a new host_spsr_to_spsr32() helper for this, and makes
    use of it in the KVM AArch32 support code. For arm64 we need to shuffle
    the DIT bit around, and remove the SS bit, while for arm we can use the
    value as-is.

    I've open-coded the bit manipulation for now to avoid having to rework
    the existing PSR_* definitions into PSR64_AA32_* and PSR32_AA32_*
    definitions. I hope to perform a more thorough refactoring in future so
    that we can handle pstate view manipulation more consistently across the
    kernel tree.

    Signed-off-by: Mark Rutland
    Signed-off-by: Marc Zyngier
    Reviewed-by: Alexandru Elisei
    Cc: stable@vger.kernel.org
    Link: https://lore.kernel.org/r/20200108134324.46500-4-mark.rutland@arm.com
    Signed-off-by: Greg Kroah-Hartman

    Mark Rutland
     
  • commit 3c2483f15499b877ccb53250d88addb8c91da147 upstream.

    When KVM injects an exception into a guest, it generates the CPSR value
    from scratch, configuring CPSR.{M,A,I,T,E}, and setting all other
    bits to zero.

    This isn't correct, as the architecture specifies that some CPSR bits
    are (conditionally) cleared or set upon an exception, and others are
    unchanged from the original context.

    This patch adds logic to match the architectural behaviour. To make this
    simple to follow/audit/extend, documentation references are provided,
    and bits are configured in order of their layout in SPSR_EL2. This
    layout can be seen in the diagram on ARM DDI 0487E.a page C5-426.

    Note that this code is used by both arm and arm64, and is intended to
    fuction with the SPSR_EL2 and SPSR_HYP layouts.

    Signed-off-by: Mark Rutland
    Signed-off-by: Marc Zyngier
    Reviewed-by: Alexandru Elisei
    Cc: stable@vger.kernel.org
    Link: https://lore.kernel.org/r/20200108134324.46500-3-mark.rutland@arm.com
    Signed-off-by: Greg Kroah-Hartman

    Mark Rutland
     

31 Dec, 2019

1 commit

  • commit 6d674e28f642e3ff676fbae2d8d1b872814d32b6 upstream.

    A device mapping is normally always mapped at Stage-2, since there
    is very little gain in having it faulted in.

    Nonetheless, it is possible to end-up in a situation where the device
    mapping has been removed from Stage-2 (userspace munmaped the VFIO
    region, and the MMU notifier did its job), but present in a userspace
    mapping (userpace has mapped it back at the same address). In such
    a situation, the device mapping will be demand-paged as the guest
    performs memory accesses.

    This requires to be careful when dealing with mapping size, cache
    management, and to handle potential execution of a device mapping.

    Reported-by: Alexandru Elisei
    Signed-off-by: Marc Zyngier
    Tested-by: Alexandru Elisei
    Reviewed-by: James Morse
    Cc: stable@vger.kernel.org
    Link: https://lore.kernel.org/r/20191211165651.7889-2-maz@kernel.org
    Signed-off-by: Greg Kroah-Hartman

    Marc Zyngier
     

13 Dec, 2019

1 commit

  • commit ca185b260951d3b55108c0b95e188682d8a507b7 upstream.

    It's possible that two LPIs locate in the same "byte_offset" but target
    two different vcpus, where their pending status are indicated by two
    different pending tables. In such a scenario, using last_byte_offset
    optimization will lead KVM relying on the wrong pending table entry.
    Let us use last_ptr instead, which can be treated as a byte index into
    a pending table and also, can be vcpu specific.

    Fixes: 280771252c1b ("KVM: arm64: vgic-v3: KVM_DEV_ARM_VGIC_SAVE_PENDING_TABLES")
    Cc: stable@vger.kernel.org
    Signed-off-by: Zenghui Yu
    Signed-off-by: Marc Zyngier
    Acked-by: Eric Auger
    Link: https://lore.kernel.org/r/20191029071919.177-4-yuzenghui@huawei.com
    Signed-off-by: Greg Kroah-Hartman

    Zenghui Yu
     

15 Nov, 2019

1 commit


14 Nov, 2019

1 commit

  • On a system without KVM_COMPAT, we prevent IOCTLs from being issued
    by a compat task. Although this prevents most silly things from
    happening, it can still confuse a 32bit userspace that is able
    to open the kvm device (the qemu test suite seems to be pretty
    mad with this behaviour).

    Take a more radical approach and return a -ENODEV to the compat
    task.

    Reported-by: Peter Maydell
    Signed-off-by: Marc Zyngier
    Signed-off-by: Paolo Bonzini

    Marc Zyngier
     

13 Nov, 2019

1 commit

  • Pull kvm fixes from Paolo Bonzini:
    "Fix unwinding of KVM_CREATE_VM failure, VT-d posted interrupts,
    DAX/ZONE_DEVICE, and module unload/reload"

    * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
    KVM: MMU: Do not treat ZONE_DEVICE pages as being reserved
    KVM: VMX: Introduce pi_is_pir_empty() helper
    KVM: VMX: Do not change PID.NDST when loading a blocked vCPU
    KVM: VMX: Consider PID.PIR to determine if vCPU has pending interrupts
    KVM: VMX: Fix comment to specify PID.ON instead of PIR.ON
    KVM: X86: Fix initialization of MSR lists
    KVM: fix placement of refcount initialization
    KVM: Fix NULL-ptr deref after kvm_create_vm fails

    Linus Torvalds
     

12 Nov, 2019

1 commit

  • Explicitly exempt ZONE_DEVICE pages from kvm_is_reserved_pfn() and
    instead manually handle ZONE_DEVICE on a case-by-case basis. For things
    like page refcounts, KVM needs to treat ZONE_DEVICE pages like normal
    pages, e.g. put pages grabbed via gup(). But for flows such as setting
    A/D bits or shifting refcounts for transparent huge pages, KVM needs to
    to avoid processing ZONE_DEVICE pages as the flows in question lack the
    underlying machinery for proper handling of ZONE_DEVICE pages.

    This fixes a hang reported by Adam Borowski[*] in dev_pagemap_cleanup()
    when running a KVM guest backed with /dev/dax memory, as KVM straight up
    doesn't put any references to ZONE_DEVICE pages acquired by gup().

    Note, Dan Williams proposed an alternative solution of doing put_page()
    on ZONE_DEVICE pages immediately after gup() in order to simplify the
    auditing needed to ensure is_zone_device_page() is called if and only if
    the backing device is pinned (via gup()). But that approach would break
    kvm_vcpu_{un}map() as KVM requires the page to be pinned from map() 'til
    unmap() when accessing guest memory, unlike KVM's secondary MMU, which
    coordinates with mmu_notifier invalidations to avoid creating stale
    page references, i.e. doesn't rely on pages being pinned.

    [*] http://lkml.kernel.org/r/20190919115547.GA17963@angband.pl

    Reported-by: Adam Borowski
    Analyzed-by: David Hildenbrand
    Acked-by: Dan Williams
    Cc: stable@vger.kernel.org
    Fixes: 3565fce3a659 ("mm, x86: get_user_pages() for dax mappings")
    Signed-off-by: Sean Christopherson
    Signed-off-by: Paolo Bonzini

    Sean Christopherson
     

11 Nov, 2019

2 commits

  • Reported by syzkaller:

    =============================
    WARNING: suspicious RCU usage
    -----------------------------
    ./include/linux/kvm_host.h:536 suspicious rcu_dereference_check() usage!

    other info that might help us debug this:

    rcu_scheduler_active = 2, debug_locks = 1
    no locks held by repro_11/12688.

    stack backtrace:
    Call Trace:
    dump_stack+0x7d/0xc5
    lockdep_rcu_suspicious+0x123/0x170
    kvm_dev_ioctl+0x9a9/0x1260 [kvm]
    do_vfs_ioctl+0x1a1/0xfb0
    ksys_ioctl+0x6d/0x80
    __x64_sys_ioctl+0x73/0xb0
    do_syscall_64+0x108/0xaa0
    entry_SYSCALL_64_after_hwframe+0x49/0xbe

    Commit a97b0e773e4 (kvm: call kvm_arch_destroy_vm if vm creation fails)
    sets users_count to 1 before kvm_arch_init_vm(), however, if kvm_arch_init_vm()
    fails, we need to decrease this count. By moving it earlier, we can push
    the decrease to out_err_no_arch_destroy_vm without introducing yet another
    error label.

    syzkaller source: https://syzkaller.appspot.com/x/repro.c?x=15209b84e00000

    Reported-by: syzbot+75475908cd0910f141ee@syzkaller.appspotmail.com
    Fixes: a97b0e773e49 ("kvm: call kvm_arch_destroy_vm if vm creation fails")
    Cc: Jim Mattson
    Analyzed-by: Wanpeng Li
    Signed-off-by: Paolo Bonzini

    Paolo Bonzini
     
  • Reported by syzkaller:

    kasan: CONFIG_KASAN_INLINE enabled
    kasan: GPF could be caused by NULL-ptr deref or user memory access
    general protection fault: 0000 [#1] PREEMPT SMP KASAN
    CPU: 0 PID: 14727 Comm: syz-executor.3 Not tainted 5.4.0-rc4+ #0
    RIP: 0010:kvm_coalesced_mmio_init+0x5d/0x110 arch/x86/kvm/../../../virt/kvm/coalesced_mmio.c:121
    Call Trace:
    kvm_dev_ioctl_create_vm arch/x86/kvm/../../../virt/kvm/kvm_main.c:3446 [inline]
    kvm_dev_ioctl+0x781/0x1490 arch/x86/kvm/../../../virt/kvm/kvm_main.c:3494
    vfs_ioctl fs/ioctl.c:46 [inline]
    file_ioctl fs/ioctl.c:509 [inline]
    do_vfs_ioctl+0x196/0x1150 fs/ioctl.c:696
    ksys_ioctl+0x62/0x90 fs/ioctl.c:713
    __do_sys_ioctl fs/ioctl.c:720 [inline]
    __se_sys_ioctl fs/ioctl.c:718 [inline]
    __x64_sys_ioctl+0x6e/0xb0 fs/ioctl.c:718
    do_syscall_64+0xca/0x5d0 arch/x86/entry/common.c:290
    entry_SYSCALL_64_after_hwframe+0x49/0xbe

    Commit 9121923c457d ("kvm: Allocate memslots and buses before calling kvm_arch_init_vm")
    moves memslots and buses allocations around, however, if kvm->srcu/irq_srcu fails
    initialization, NULL will be returned instead of error code, NULL will not be intercepted
    in kvm_dev_ioctl_create_vm() and be dereferenced by kvm_coalesced_mmio_init(), this patch
    fixes it.

    Moving the initialization is required anyway to avoid an incorrect synchronize_srcu that
    was also reported by syzkaller:

    wait_for_completion+0x29c/0x440 kernel/sched/completion.c:136
    __synchronize_srcu+0x197/0x250 kernel/rcu/srcutree.c:921
    synchronize_srcu_expedited kernel/rcu/srcutree.c:946 [inline]
    synchronize_srcu+0x239/0x3e8 kernel/rcu/srcutree.c:997
    kvm_page_track_unregister_notifier+0xe7/0x130 arch/x86/kvm/page_track.c:212
    kvm_mmu_uninit_vm+0x1e/0x30 arch/x86/kvm/mmu.c:5828
    kvm_arch_destroy_vm+0x4a2/0x5f0 arch/x86/kvm/x86.c:9579
    kvm_create_vm arch/x86/kvm/../../../virt/kvm/kvm_main.c:702 [inline]

    so do it.

    Reported-by: syzbot+89a8060879fa0bd2db4f@syzkaller.appspotmail.com
    Reported-by: syzbot+e27e7027eb2b80e44225@syzkaller.appspotmail.com
    Fixes: 9121923c457d ("kvm: Allocate memslots and buses before calling kvm_arch_init_vm")
    Cc: Jim Mattson
    Cc: Wanpeng Li
    Signed-off-by: Paolo Bonzini

    Paolo Bonzini
     

05 Nov, 2019

1 commit

  • The page table pages corresponding to broken down large pages are zapped in
    FIFO order, so that the large page can potentially be recovered, if it is
    not longer being used for execution. This removes the performance penalty
    for walking deeper EPT page tables.

    By default, one large page will last about one hour once the guest
    reaches a steady state.

    Signed-off-by: Junaid Shahid
    Signed-off-by: Paolo Bonzini
    Signed-off-by: Thomas Gleixner

    Junaid Shahid
     

04 Nov, 2019

1 commit


31 Oct, 2019

1 commit

  • In kvm_create_vm(), if we've successfully called kvm_arch_init_vm(), but
    then fail later in the function, we need to call kvm_arch_destroy_vm()
    so that it can do any necessary cleanup (like freeing memory).

    Fixes: 44a95dae1d229a ("KVM: x86: Detect and Initialize AVIC support")

    Signed-off-by: John Sperbeck
    Signed-off-by: Jim Mattson
    Reviewed-by: Junaid Shahid
    [Remove dependency on "kvm: Don't clear reference count on
    kvm_create_vm() error path" which was not committed. - Paolo]
    Signed-off-by: Paolo Bonzini

    Jim Mattson
     

25 Oct, 2019

1 commit


22 Oct, 2019

2 commits


20 Oct, 2019

3 commits

  • The PMU emulation code uses the perf event sample period to trigger
    the overflow detection. This works fine for the *first* overflow
    handling, but results in a huge number of interrupts on the host,
    unrelated to the number of interrupts handled in the guest (a x20
    factor is pretty common for the cycle counter). On a slow system
    (such as a SW model), this can result in the guest only making
    forward progress at a glacial pace.

    It turns out that the clue is in the name. The sample period is
    exactly that: a period. And once the an overflow has occured,
    the following period should be the full width of the associated
    counter, instead of whatever the guest had initially programed.

    Reset the sample period to the architected value in the overflow
    handler, which now results in a number of host interrupts that is
    much closer to the number of interrupts in the guest.

    Fixes: b02386eb7dac ("arm64: KVM: Add PMU overflow interrupt routing")
    Reviewed-by: Andrew Murray
    Signed-off-by: Marc Zyngier

    Marc Zyngier
     
  • The current convention for KVM to request a chained event from the
    host PMU is to set bit[0] in attr.config1 (PERF_ATTR_CFG1_KVM_PMU_CHAINED).

    But as it turns out, this bit gets set *after* we create the kernel
    event that backs our virtual counter, meaning that we never get
    a 64bit counter.

    Moving the setting to an earlier point solves the problem.

    Fixes: 80f393a23be6 ("KVM: arm/arm64: Support chained PMU counters")
    Reviewed-by: Andrew Murray
    Signed-off-by: Marc Zyngier

    Marc Zyngier
     
  • When a counter is disabled, its value is sampled before the event
    is being disabled, and the value written back in the shadow register.

    In that process, the value gets truncated to 32bit, which is adequate
    for any counter but the cycle counter (defined as a 64bit counter).

    This obviously results in a corrupted counter, and things like
    "perf record -e cycles" not working at all when run in a guest...
    A similar, but less critical bug exists in kvm_pmu_get_counter_value.

    Make the truncation conditional on the counter not being the cycle
    counter, which results in a minor code reorganisation.

    Fixes: 80f393a23be6 ("KVM: arm/arm64: Support chained PMU counters")
    Reviewed-by: Andrew Murray
    Reported-by: Julien Thierry
    Signed-off-by: Marc Zyngier

    Marc Zyngier
     

03 Oct, 2019

1 commit