23 Sep, 2022

1 commit

  • commit 683412ccf61294d727ead4a73d97397396e69a6b upstream.

    Flush the CPU caches when memory is reclaimed from an SEV guest (where
    reclaim also includes it being unmapped from KVM's memslots). Due to lack
    of coherency for SEV encrypted memory, failure to flush results in silent
    data corruption if userspace is malicious/broken and doesn't ensure SEV
    guest memory is properly pinned and unpinned.

    Cache coherency is not enforced across the VM boundary in SEV (AMD APM
    vol.2 Section 15.34.7). Confidential cachelines, generated by confidential
    VM guests have to be explicitly flushed on the host side. If a memory page
    containing dirty confidential cachelines was released by VM and reallocated
    to another user, the cachelines may corrupt the new user at a later time.

    KVM takes a shortcut by assuming all confidential memory remain pinned
    until the end of VM lifetime. Therefore, KVM does not flush cache at
    mmu_notifier invalidation events. Because of this incorrect assumption and
    the lack of cache flushing, malicous userspace can crash the host kernel:
    creating a malicious VM and continuously allocates/releases unpinned
    confidential memory pages when the VM is running.

    Add cache flush operations to mmu_notifier operations to ensure that any
    physical memory leaving the guest VM get flushed. In particular, hook
    mmu_notifier_invalidate_range_start and mmu_notifier_release events and
    flush cache accordingly. The hook after releasing the mmu lock to avoid
    contention with other vCPUs.

    Cc: stable@vger.kernel.org
    Suggested-by: Sean Christpherson
    Reported-by: Mingwei Zhang
    Signed-off-by: Mingwei Zhang
    Message-Id:
    Signed-off-by: Paolo Bonzini
    [OP: adjusted KVM_X86_OP_OPTIONAL() -> KVM_X86_OP_NULL, applied
    kvm_arch_guest_memory_reclaimed() call in kvm_set_memslot()]
    Signed-off-by: Ovidiu Panait
    Signed-off-by: Greg Kroah-Hartman

    Mingwei Zhang
     

25 Aug, 2022

1 commit

  • commit 405294f29faee5de8c10cb9d4a90e229c2835279 upstream.

    Unconditionally get a reference to the /dev/kvm module when creating a VM
    instead of using try_get_module(), which will fail if the module is in
    the process of being forcefully unloaded. The error handling when
    try_get_module() fails doesn't properly unwind all that has been done,
    e.g. doesn't call kvm_arch_pre_destroy_vm() and doesn't remove the VM
    from the global list. Not removing VMs from the global list tends to be
    fatal, e.g. leads to use-after-free explosions.

    The obvious alternative would be to add proper unwinding, but the
    justification for using try_get_module(), "rmmod --wait", is completely
    bogus as support for "rmmod --wait", i.e. delete_module() without
    O_NONBLOCK, was removed by commit 3f2b9c9cdf38 ("module: remove rmmod
    --wait option.") nearly a decade ago.

    It's still possible for try_get_module() to fail due to the module dying
    (more like being killed), as the module will be tagged MODULE_STATE_GOING
    by "rmmod --force", i.e. delete_module(..., O_TRUNC), but playing nice
    with forced unloading is an exercise in futility and gives a falsea sense
    of security. Using try_get_module() only prevents acquiring _new_
    references, it doesn't magically put the references held by other VMs,
    and forced unloading doesn't wait, i.e. "rmmod --force" on KVM is all but
    guaranteed to cause spectacular fireworks; the window where KVM will fail
    try_get_module() is tiny compared to the window where KVM is building and
    running the VM with an elevated module refcount.

    Addressing KVM's inability to play nice with "rmmod --force" is firmly
    out-of-scope. Forcefully unloading any module taints kernel (for obvious
    reasons) _and_ requires the kernel to be built with
    CONFIG_MODULE_FORCE_UNLOAD=y, which is off by default and comes with the
    amusing disclaimer that it's "mainly for kernel developers and desperate
    users". In other words, KVM is free to scoff at bug reports due to using
    "rmmod --force" while VMs may be running.

    Fixes: 5f6de5cbebee ("KVM: Prevent module exit until all VMs are freed")
    Cc: stable@vger.kernel.org
    Cc: David Matlack
    Signed-off-by: Sean Christopherson
    Message-Id:
    Signed-off-by: Paolo Bonzini
    Signed-off-by: Greg Kroah-Hartman

    Sean Christopherson
     

17 Aug, 2022

1 commit

  • [ Upstream commit a1040b0d42acf69bb4f6dbdc54c2dcd78eea1de5 ]

    Don't set Accessed/Dirty bits for a struct page with PG_reserved set,
    i.e. don't set A/D bits for the ZERO_PAGE. The ZERO_PAGE (or pages
    depending on the architecture) should obviously never be written, and
    similarly there's no point in marking it accessed as the page will never
    be swapped out or reclaimed. The comment in page-flags.h is quite clear
    that PG_reserved pages should be managed only by their owner, and
    strictly following that mandate also simplifies KVM's logic.

    Fixes: 7df003c85218 ("KVM: fix overflow of zero page refcount with ksm running")
    Signed-off-by: Sean Christopherson
    Message-Id:
    Signed-off-by: Paolo Bonzini
    Signed-off-by: Sasha Levin

    Sean Christopherson
     

29 Jul, 2022

1 commit

  • commit e8bc2427018826e02add7b0ed0fc625a60390ae5 upstream.

    A KVM device cleanup happens in either of two callbacks:
    1) destroy() which is called when the VM is being destroyed;
    2) release() which is called when a device fd is closed.

    Most KVM devices use 1) but Book3s's interrupt controller KVM devices
    (XICS, XIVE, XIVE-native) use 2) as they need to close and reopen during
    the machine execution. The error handling in kvm_ioctl_create_device()
    assumes destroy() is always defined which leads to NULL dereference as
    discovered by Syzkaller.

    This adds a checks for destroy!=NULL and adds a missing release().

    This is not changing kvm_destroy_devices() as devices with defined
    release() should have been removed from the KVM devices list by then.

    Suggested-by: Paolo Bonzini
    Signed-off-by: Alexey Kardashevskiy
    Signed-off-by: Paolo Bonzini
    Signed-off-by: Greg Kroah-Hartman

    Alexey Kardashevskiy
     

12 Jul, 2022

3 commits

  • [ Upstream commit 5c697c367a66307a5d943c3449421aff2aa3ca4a ]

    Initialize debugfs_entry to its semi-magical -ENOENT value when the VM
    is created. KVM's teardown when VM creation fails is kludgy and calls
    kvm_uevent_notify_change() and kvm_destroy_vm_debugfs() even if KVM never
    attempted kvm_create_vm_debugfs(). Because debugfs_entry is zero
    initialized, the IS_ERR() checks pass and KVM derefs a NULL pointer.

    BUG: kernel NULL pointer dereference, address: 0000000000000018
    #PF: supervisor read access in kernel mode
    #PF: error_code(0x0000) - not-present page
    PGD 1068b1067 P4D 1068b1067 PUD 1068b0067 PMD 0
    Oops: 0000 [#1] SMP
    CPU: 0 PID: 871 Comm: repro Not tainted 5.18.0-rc1+ #825
    Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
    RIP: 0010:__dentry_path+0x7b/0x130
    Call Trace:

    dentry_path_raw+0x42/0x70
    kvm_uevent_notify_change.part.0+0x10c/0x200 [kvm]
    kvm_put_kvm+0x63/0x2b0 [kvm]
    kvm_dev_ioctl+0x43a/0x920 [kvm]
    __x64_sys_ioctl+0x83/0xb0
    do_syscall_64+0x31/0x50
    entry_SYSCALL_64_after_hwframe+0x44/0xae

    Modules linked in: kvm_intel kvm irqbypass

    Fixes: a44a4cc1c969 ("KVM: Don't create VM debugfs files outside of the VM directory")
    Cc: stable@vger.kernel.org
    Cc: Marc Zyngier
    Cc: Oliver Upton
    Reported-by: syzbot+df6fbbd2ee39f21289ef@syzkaller.appspotmail.com
    Signed-off-by: Sean Christopherson
    Reviewed-by: Oliver Upton
    Message-Id:
    Signed-off-by: Paolo Bonzini
    Signed-off-by: Sasha Levin

    Sean Christopherson
     
  • [ Upstream commit a44a4cc1c969afec97dbb2aedaf6f38eaa6253bb ]

    Unfortunately, there is no guarantee that KVM was able to instantiate a
    debugfs directory for a particular VM. To that end, KVM shouldn't even
    attempt to create new debugfs files in this case. If the specified
    parent dentry is NULL, debugfs_create_file() will instantiate files at
    the root of debugfs.

    For arm64, it is possible to create the vgic-state file outside of a
    VM directory, the file is not cleaned up when a VM is destroyed.
    Nonetheless, the corresponding struct kvm is freed when the VM is
    destroyed.

    Nip the problem in the bud for all possible errant debugfs file
    creations by initializing kvm->debugfs_dentry to -ENOENT. In so doing,
    debugfs_create_file() will fail instead of creating the file in the root
    directory.

    Cc: stable@kernel.org
    Fixes: 929f45e32499 ("kvm: no need to check return value of debugfs_create functions")
    Signed-off-by: Oliver Upton
    Signed-off-by: Marc Zyngier
    Link: https://lore.kernel.org/r/20220406235615.1447180-2-oupton@google.com
    Signed-off-by: Sasha Levin

    Oliver Upton
     
  • [ Upstream commit 37b2a6510a48ca361ced679f92682b7b7d7d0330 ]

    Allocations whose size is related to the memslot size can be arbitrarily
    large. Do not use kvzalloc/kvcalloc, as those are limited to "not crazy"
    sizes that fit in 32 bits.

    Cc: stable@vger.kernel.org
    Fixes: 7661809d493b ("mm: don't allow oversized kvmalloc() calls")
    Reviewed-by: David Hildenbrand
    Signed-off-by: Paolo Bonzini
    Signed-off-by: Sasha Levin

    Paolo Bonzini
     

14 Apr, 2022

1 commit

  • commit 5593473a1e6c743764b08e3b6071cb43b5cfa6c4 upstream.

    kvm_vcpu_release() will call kvm_dirty_ring_free(), freeing
    ring->dirty_gfns and setting it to NULL. Afterwards, it calls
    kvm_arch_vcpu_destroy().

    However, if closing the file descriptor races with KVM_RUN in such away
    that vcpu->arch.st.preempted == 0, the following call stack leads to a
    NULL pointer dereference in kvm_dirty_run_push():

    mark_page_dirty_in_slot+0x192/0x270 arch/x86/kvm/../../../virt/kvm/kvm_main.c:3171
    kvm_steal_time_set_preempted arch/x86/kvm/x86.c:4600 [inline]
    kvm_arch_vcpu_put+0x34e/0x5b0 arch/x86/kvm/x86.c:4618
    vcpu_put+0x1b/0x70 arch/x86/kvm/../../../virt/kvm/kvm_main.c:211
    vmx_free_vcpu+0xcb/0x130 arch/x86/kvm/vmx/vmx.c:6985
    kvm_arch_vcpu_destroy+0x76/0x290 arch/x86/kvm/x86.c:11219
    kvm_vcpu_destroy arch/x86/kvm/../../../virt/kvm/kvm_main.c:441 [inline]

    The fix is to release the dirty page ring after kvm_arch_vcpu_destroy
    has run.

    Reported-by: Qiuhao Li
    Reported-by: Gaoning Pan
    Reported-by: Yongkang Jia
    Cc: stable@vger.kernel.org
    Signed-off-by: Paolo Bonzini
    Signed-off-by: Greg Kroah-Hartman

    Paolo Bonzini
     

08 Apr, 2022

1 commit

  • commit 5f6de5cbebee925a612856fce6f9182bb3eee0db upstream.

    Tie the lifetime the KVM module to the lifetime of each VM via
    kvm.users_count. This way anything that grabs a reference to the VM via
    kvm_get_kvm() cannot accidentally outlive the KVM module.

    Prior to this commit, the lifetime of the KVM module was tied to the
    lifetime of /dev/kvm file descriptors, VM file descriptors, and vCPU
    file descriptors by their respective file_operations "owner" field.
    This approach is insufficient because references grabbed via
    kvm_get_kvm() do not prevent closing any of the aforementioned file
    descriptors.

    This fixes a long standing theoretical bug in KVM that at least affects
    async page faults. kvm_setup_async_pf() grabs a reference via
    kvm_get_kvm(), and drops it in an asynchronous work callback. Nothing
    prevents the VM file descriptor from being closed and the KVM module
    from being unloaded before this callback runs.

    Fixes: af585b921e5d ("KVM: Halt vcpu if page it tries to access is swapped out")
    Fixes: 3d3aab1b973b ("KVM: set owner of cpu and vm file operations")
    Cc: stable@vger.kernel.org
    Suggested-by: Ben Gardon
    [ Based on a patch from Ben implemented for Google's kernel. ]
    Signed-off-by: David Matlack
    Message-Id:
    Signed-off-by: Paolo Bonzini
    Signed-off-by: Greg Kroah-Hartman

    David Matlack
     

16 Mar, 2022

1 commit

  • [ Upstream commit 4cb9a998b1ce25fad74a82f5a5c45a4ef40de337 ]

    I saw the below splatting after the host suspended and resumed.

    WARNING: CPU: 0 PID: 2943 at kvm/arch/x86/kvm/../../../virt/kvm/kvm_main.c:5531 kvm_resume+0x2c/0x30 [kvm]
    CPU: 0 PID: 2943 Comm: step_after_susp Tainted: G W IOE 5.17.0-rc3+ #4
    RIP: 0010:kvm_resume+0x2c/0x30 [kvm]
    Call Trace:

    syscore_resume+0x90/0x340
    suspend_devices_and_enter+0xaee/0xe90
    pm_suspend.cold+0x36b/0x3c2
    state_store+0x82/0xf0
    kernfs_fop_write_iter+0x1b6/0x260
    new_sync_write+0x258/0x370
    vfs_write+0x33f/0x510
    ksys_write+0xc9/0x160
    do_syscall_64+0x3b/0xc0
    entry_SYSCALL_64_after_hwframe+0x44/0xae

    lockdep_is_held() can return -1 when lockdep is disabled which triggers
    this warning. Let's use lockdep_assert_not_held() which can detect
    incorrect calls while holding a lock and it also avoids false negatives
    when lockdep is disabled.

    Signed-off-by: Wanpeng Li
    Message-Id:
    Signed-off-by: Paolo Bonzini
    Signed-off-by: Sasha Levin

    Wanpeng Li
     

09 Mar, 2022

1 commit

  • [ Upstream commit 6f390916c4fb359507d9ac4bf1b28a4f8abee5c0 ]

    Wrap s390's halt_poll_max_steal with READ_ONCE and snapshot the result of
    kvm_arch_no_poll() in kvm_vcpu_block() to avoid a mostly-theoretical,
    largely benign bug on s390 where the result of kvm_arch_no_poll() could
    change due to userspace modifying halt_poll_max_steal while the vCPU is
    blocking. The bug is largely benign as it will either cause KVM to skip
    updating halt-polling times (no_poll toggles false=>true) or to update
    halt-polling times with a slightly flawed block_ns.

    Note, READ_ONCE is unnecessary in the current code, add it in case the
    arch hook is ever inlined, and to provide a hint that userspace can
    change the param at will.

    Fixes: 8b905d28ee17 ("KVM: s390: provide kvm_arch_no_poll function")
    Reviewed-by: Christian Borntraeger
    Signed-off-by: Sean Christopherson
    Message-Id:
    Signed-off-by: Paolo Bonzini
    Signed-off-by: Sasha Levin

    Sean Christopherson
     

16 Feb, 2022

1 commit

  • [ Upstream commit 6a0c61703e3a5d67845a4b275e1d9d7bc1b5aad7 ]

    Fix the following false positive warning:
    =============================
    WARNING: suspicious RCU usage
    5.16.0-rc4+ #57 Not tainted
    -----------------------------
    arch/x86/kvm/../../../virt/kvm/eventfd.c:484 RCU-list traversed in non-reader section!!

    other info that might help us debug this:

    rcu_scheduler_active = 2, debug_locks = 1
    3 locks held by fc_vcpu 0/330:
    #0: ffff8884835fc0b0 (&vcpu->mutex){+.+.}-{3:3}, at: kvm_vcpu_ioctl+0x88/0x6f0 [kvm]
    #1: ffffc90004c0bb68 (&kvm->srcu){....}-{0:0}, at: vcpu_enter_guest+0x600/0x1860 [kvm]
    #2: ffffc90004c0c1d0 (&kvm->irq_srcu){....}-{0:0}, at: kvm_notify_acked_irq+0x36/0x180 [kvm]

    stack backtrace:
    CPU: 26 PID: 330 Comm: fc_vcpu 0 Not tainted 5.16.0-rc4+
    Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
    Call Trace:

    dump_stack_lvl+0x44/0x57
    kvm_notify_acked_gsi+0x6b/0x70 [kvm]
    kvm_notify_acked_irq+0x8d/0x180 [kvm]
    kvm_ioapic_update_eoi+0x92/0x240 [kvm]
    kvm_apic_set_eoi_accelerated+0x2a/0xe0 [kvm]
    handle_apic_eoi_induced+0x3d/0x60 [kvm_intel]
    vmx_handle_exit+0x19c/0x6a0 [kvm_intel]
    vcpu_enter_guest+0x66e/0x1860 [kvm]
    kvm_arch_vcpu_ioctl_run+0x438/0x7f0 [kvm]
    kvm_vcpu_ioctl+0x38a/0x6f0 [kvm]
    __x64_sys_ioctl+0x89/0xc0
    do_syscall_64+0x3a/0x90
    entry_SYSCALL_64_after_hwframe+0x44/0xae

    Since kvm_unregister_irq_ack_notifier() does synchronize_srcu(&kvm->irq_srcu),
    kvm->irq_ack_notifier_list is protected by kvm->irq_srcu. In fact,
    kvm->irq_srcu SRCU read lock is held in kvm_notify_acked_irq(), making it
    a false positive warning. So use hlist_for_each_entry_srcu() instead of
    hlist_for_each_entry_rcu().

    Reviewed-by: Sean Christopherson
    Signed-off-by: Hou Wenlong
    Message-Id:
    Signed-off-by: Paolo Bonzini
    Signed-off-by: Sasha Levin

    Hou Wenlong
     

02 Feb, 2022

1 commit

  • commit 31c25585695abdf03d6160aa6d829e855b256329 upstream.

    Revert a completely broken check on an "invalid" RIP in SVM's workaround
    for the DecodeAssists SMAP errata. kvm_vcpu_gfn_to_memslot() obviously
    expects a gfn, i.e. operates in the guest physical address space, whereas
    RIP is a virtual (not even linear) address. The "fix" worked for the
    problematic KVM selftest because the test identity mapped RIP.

    Fully revert the hack instead of trying to translate RIP to a GPA, as the
    non-SEV case is now handled earlier, and KVM cannot access guest page
    tables to translate RIP.

    This reverts commit e72436bc3a5206f95bb384e741154166ddb3202e.

    Fixes: e72436bc3a52 ("KVM: SVM: avoid infinite loop on NPF from bad address")
    Reported-by: Liam Merwick
    Cc: stable@vger.kernel.org
    Signed-off-by: Sean Christopherson
    Reviewed-by: Liam Merwick
    Message-Id:
    Signed-off-by: Paolo Bonzini
    Signed-off-by: Greg Kroah-Hartman

    Sean Christopherson
     

22 Dec, 2021

1 commit

  • [ Upstream commit 5f25e71e311478f9bb0a8ef49e7d8b95316491d7 ]

    This is not an unrecoverable situation. Users of kvm_read_guest_offset_cached
    and kvm_write_guest_offset_cached must expect the read/write to fail, and
    therefore it is possible to just return early with an error value.

    Signed-off-by: Paolo Bonzini
    Signed-off-by: Sasha Levin

    Paolo Bonzini
     

08 Dec, 2021

2 commits

  • commit bda44d844758c70c8dc1478e6fc9c25efa90c5a7 upstream.

    When modifying memslots, snapshot the "old" memslot and copy it to the
    "new" memslot's arch data after (re)acquiring slots_arch_lock. x86 can
    change a memslot's arch data while memslot updates are in-progress so
    long as it holds slots_arch_lock, thus snapshotting a memslot without
    holding the lock can result in the consumption of stale data.

    Fixes: b10a038e84d1 ("KVM: mmu: Add slots_arch_lock for memslot arch fields")
    Cc: stable@vger.kernel.org
    Cc: Ben Gardon
    Signed-off-by: Sean Christopherson
    Message-Id:
    Signed-off-by: Paolo Bonzini
    Signed-off-by: Greg Kroah-Hartman

    Sean Christopherson
     
  • commit 6b285a5587506bae084cf9a3ed5aa491d623b91b upstream.

    Reject userspace memslots whose size exceeds the storage capacity of an
    "unsigned long". KVM's uAPI takes the size as u64 to support large slots
    on 64-bit hosts, but does not account for the size being truncated on
    32-bit hosts in various flows. The access_ok() check on the userspace
    virtual address in particular casts the size to "unsigned long" and will
    check the wrong number of bytes.

    KVM doesn't actually support slots whose size doesn't fit in an "unsigned
    long", e.g. KVM's internal kvm_memory_slot.npages is an "unsigned long",
    not a "u64", and misc arch specific code follows that behavior.

    Fixes: fa3d315a4ce2 ("KVM: Validate userspace_addr of memslot when registered")
    Cc: stable@vger.kernel.org
    Signed-off-by: Sean Christopherson
    Reviewed-by: Maciej S. Szmigiero
    Message-Id:
    Signed-off-by: Paolo Bonzini
    Signed-off-by: Greg Kroah-Hartman

    Sean Christopherson
     

23 Sep, 2021

1 commit


22 Sep, 2021

3 commits

  • Check for a NULL cpumask_var_t when kicking multiple vCPUs via
    cpumask_available(), which performs a !NULL check if and only if cpumasks
    are configured to be allocated off-stack. This is a meaningless
    optimization, e.g. avoids a TEST+Jcc and TEST+CMOV on x86, but more
    importantly helps document that the NULL check is necessary even though
    all callers pass in a local variable.

    No functional change intended.

    Cc: Lai Jiangshan
    Signed-off-by: Sean Christopherson
    Signed-off-by: Vitaly Kuznetsov
    Message-Id:
    Signed-off-by: Paolo Bonzini

    Sean Christopherson
     
  • Fix a benign data race reported by syzbot+KCSAN[*] by ensuring vcpu->cpu
    is read exactly once, and by ensuring the vCPU is booted from guest mode
    if kvm_arch_vcpu_should_kick() returns true. Fix a similar race in
    kvm_make_vcpus_request_mask() by ensuring the vCPU is interrupted if
    kvm_request_needs_ipi() returns true.

    Reading vcpu->cpu before vcpu->mode (via kvm_arch_vcpu_should_kick() or
    kvm_request_needs_ipi()) means the target vCPU could get migrated (change
    vcpu->cpu) and enter !OUTSIDE_GUEST_MODE between reading vcpu->cpud and
    reading vcpu->mode. If that happens, the kick/IPI will be sent to the
    old pCPU, not the new pCPU that is now running the vCPU or reading SPTEs.

    Although failing to kick the vCPU is not exactly ideal, practically
    speaking it cannot cause a functional issue unless there is also a bug in
    the caller, and any such bug would exist regardless of kvm_vcpu_kick()'s
    behavior.

    The purpose of sending an IPI is purely to get a vCPU into the host (or
    out of reading SPTEs) so that the vCPU can recognize a change in state,
    e.g. a KVM_REQ_* request. If vCPU's handling of the state change is
    required for correctness, KVM must ensure either the vCPU sees the change
    before entering the guest, or that the sender sees the vCPU as running in
    guest mode. All architectures handle this by (a) sending the request
    before calling kvm_vcpu_kick() and (b) checking for requests _after_
    setting vcpu->mode.

    x86's READING_SHADOW_PAGE_TABLES has similar requirements; KVM needs to
    ensure it kicks and waits for vCPUs that started reading SPTEs _before_
    MMU changes were finalized, but any vCPU that starts reading after MMU
    changes were finalized will see the new state and can continue on
    uninterrupted.

    For uses of kvm_vcpu_kick() that are not paired with a KVM_REQ_*, e.g.
    x86's kvm_arch_sync_dirty_log(), the order of the kick must not be relied
    upon for functional correctness, e.g. in the dirty log case, userspace
    cannot assume it has a 100% complete log if vCPUs are still running.

    All that said, eliminate the benign race since the cost of doing so is an
    "extra" atomic cmpxchg() in the case where the target vCPU is loaded by
    the current pCPU or is not loaded at all. I.e. the kick will be skipped
    due to kvm_vcpu_exiting_guest_mode() seeing a compatible vcpu->mode as
    opposed to the kick being skipped because of the cpu checks.

    Keep the "cpu != me" checks even though they appear useless/impossible at
    first glance. x86 processes guest IPI writes in a fast path that runs in
    IN_GUEST_MODE, i.e. can call kvm_vcpu_kick() from IN_GUEST_MODE. And
    calling kvm_vm_bugged()->kvm_make_vcpus_request_mask() from IN_GUEST or
    READING_SHADOW_PAGE_TABLES is perfectly reasonable.

    Note, a race with the cpu_online() check in kvm_vcpu_kick() likely
    persists, e.g. the vCPU could exit guest mode and get offlined between
    the cpu_online() check and the sending of smp_send_reschedule(). But,
    the online check appears to exist only to avoid a WARN in x86's
    native_smp_send_reschedule() that fires if the target CPU is not online.
    The reschedule WARN exists because CPU offlining takes the CPU out of the
    scheduling pool, i.e. the WARN is intended to detect the case where the
    kernel attempts to schedule a task on an offline CPU. The actual sending
    of the IPI is a non-issue as at worst it will simpy be dropped on the
    floor. In other words, KVM's usurping of the reschedule IPI could
    theoretically trigger a WARN if the stars align, but there will be no
    loss of functionality.

    [*] https://syzkaller.appspot.com/bug?extid=cd4154e502f43f10808a

    Cc: Venkatesh Srinivas
    Cc: Vitaly Kuznetsov
    Fixes: 97222cc83163 ("KVM: Emulate local APIC in kernel")
    Signed-off-by: Sean Christopherson
    Signed-off-by: Vitaly Kuznetsov
    Message-Id:
    Signed-off-by: Paolo Bonzini

    Sean Christopherson
     
  • grow_halt_poll_ns() ignores values between 0 and
    halt_poll_ns_grow_start (10000 by default). However,
    when we shrink halt_poll_ns we may fall way below
    halt_poll_ns_grow_start and endup with halt_poll_ns
    values that don't make a lot of sense: like 1 or 9,
    or 19.

    VCPU1 trace (halt_poll_ns_shrink equals 2):

    VCPU1 grow 10000
    VCPU1 shrink 5000
    VCPU1 shrink 2500
    VCPU1 shrink 1250
    VCPU1 shrink 625
    VCPU1 shrink 312
    VCPU1 shrink 156
    VCPU1 shrink 78
    VCPU1 shrink 39
    VCPU1 shrink 19
    VCPU1 shrink 9
    VCPU1 shrink 4

    Mirror what grow_halt_poll_ns() does and set halt_poll_ns
    to 0 as soon as new shrink-ed halt_poll_ns value falls
    below halt_poll_ns_grow_start.

    Signed-off-by: Sergey Senozhatsky
    Signed-off-by: Paolo Bonzini
    Message-Id:
    Signed-off-by: Paolo Bonzini

    Sergey Senozhatsky
     

06 Sep, 2021

4 commits

  • Drop the unused function as reported by test bot.

    Reported-by: kernel test robot
    Signed-off-by: Peter Xu
    Message-Id:
    Signed-off-by: Paolo Bonzini

    Peter Xu
     
  • KVM/arm64 updates for 5.15

    - Page ownership tracking between host EL1 and EL2

    - Rely on userspace page tables to create large stage-2 mappings

    - Fix incompatibility between pKVM and kmemleak

    - Fix the PMU reset state, and improve the performance of the virtual PMU

    - Move over to the generic KVM entry code

    - Address PSCI reset issues w.r.t. save/restore

    - Preliminary rework for the upcoming pKVM fixed feature

    - A bunch of MM cleanups

    - a vGIC fix for timer spurious interrupts

    - Various cleanups

    Paolo Bonzini
     
  • Add a new stat that counts the number of times a remote TLB flush is
    requested, regardless of whether it kicks vCPUs out of guest mode. This
    allows us to look at how often flushes are initiated.

    Unlike remote_tlb_flush, this one applies to ARM's instruction-set-based
    TLB flush implementation, so apply it there too.

    Original-by: David Matlack
    Signed-off-by: Jing Zhang
    Message-Id:
    Signed-off-by: Paolo Bonzini

    Jing Zhang
     
  • Don't export KVM's MMU notifier count helpers, under no circumstance
    should any downstream module, including x86's vendor code, have a
    legitimate reason to piggyback KVM's MMU notifier logic. E.g in the x86
    case, only KVM's MMU should be elevating the notifier count, and that
    code is always built into the core kvm.ko module.

    Fixes: edb298c663fc ("KVM: x86/mmu: bump mmu notifier count in kvm_zap_gfn_range")
    Cc: Maxim Levitsky
    Signed-off-by: Sean Christopherson
    Reviewed-by: Maxim Levitsky
    Message-Id:
    Signed-off-by: Paolo Bonzini

    Sean Christopherson
     

21 Aug, 2021

3 commits

  • Add three log histogram stats to record the distribution of time spent
    on successful polling, failed polling and VCPU wait.
    halt_poll_success_hist: Distribution of spent time for a successful poll.
    halt_poll_fail_hist: Distribution of spent time for a failed poll.
    halt_wait_hist: Distribution of time a VCPU has spent on waiting.

    Signed-off-by: Jing Zhang
    Message-Id:
    Signed-off-by: Paolo Bonzini

    Jing Zhang
     
  • Add simple stats halt_wait_ns to record the time a VCPU has spent on
    waiting for all architectures (not just powerpc).

    Signed-off-by: Jing Zhang
    Message-Id:
    Signed-off-by: Paolo Bonzini

    Jing Zhang
     
  • This together with previous patch, ensures that
    kvm_zap_gfn_range doesn't race with page fault
    running on another vcpu, and will make this page fault code
    retry instead.

    This is based on a patch suggested by Sean Christopherson:
    https://lkml.org/lkml/2021/7/22/1025

    Suggested-by: Sean Christopherson
    Signed-off-by: Maxim Levitsky
    Message-Id:
    Signed-off-by: Paolo Bonzini

    Maxim Levitsky
     

13 Aug, 2021

2 commits

  • Allow archs to create arch-specific nodes under kvm->debugfs_dentry directory
    besides the stats fields. The new interface kvm_arch_create_vm_debugfs() is
    defined but not yet used. It's called after kvm->debugfs_dentry is created, so
    it can be referenced directly in kvm_arch_create_vm_debugfs(). Arch should
    define their own versions when they want to create extra debugfs nodes.

    Signed-off-by: Peter Xu
    Message-Id:
    Signed-off-by: Paolo Bonzini

    Peter Xu
     
  • These stores are copied and pasted from the "if" statements above.
    They are dead and while they are not really a bug, they can be
    confusing to anyone reading the code as well. Remove them.

    Reported-by: kernel test robot
    Signed-off-by: Paolo Bonzini

    Paolo Bonzini
     

11 Aug, 2021

1 commit


06 Aug, 2021

2 commits

  • The memslot for a given gfn is looked up multiple times during page
    fault handling. Avoid binary searching for it multiple times by caching
    the most recently used slot. There is an existing VM-wide last_used_slot
    but that does not work well for cases where vCPUs are accessing memory
    in different slots (see performance data below).

    Another benefit of caching the most recently use slot (versus looking
    up the slot once and passing around a pointer) is speeding up memslot
    lookups *across* faults and during spte prefetching.

    To measure the performance of this change I ran dirty_log_perf_test with
    64 vCPUs and 64 memslots and measured "Populate memory time" and
    "Iteration 2 dirty memory time". Tests were ran with eptad=N to force
    dirty logging to use fast_page_fault so its performance could be
    measured.

    Config | Metric | Before | After
    ---------- | ----------------------------- | ------ | ------
    tdp_mmu=Y | Populate memory time | 6.76s | 5.47s
    tdp_mmu=Y | Iteration 2 dirty memory time | 2.83s | 0.31s
    tdp_mmu=N | Populate memory time | 20.4s | 18.7s
    tdp_mmu=N | Iteration 2 dirty memory time | 2.65s | 0.30s

    The "Iteration 2 dirty memory time" results are especially compelling
    because they are equivalent to running the same test with a single
    memslot. In other words, fast_page_fault performance no longer scales
    with the number of memslots.

    Signed-off-by: David Matlack
    Message-Id:
    Signed-off-by: Paolo Bonzini

    David Matlack
     
  • lru_slot is used to keep track of the index of the most-recently used
    memslot. The correct acronym would be "mru" but that is not a common
    acronym. So call it last_used_slot which is a bit more obvious.

    Suggested-by: Paolo Bonzini
    Signed-off-by: David Matlack
    Message-Id:
    Signed-off-by: Paolo Bonzini

    David Matlack
     

04 Aug, 2021

1 commit

  • KVM creates a debugfs directory for each VM in order to store statistics
    about the virtual machine. The directory name is built from the process
    pid and a VM fd. While generally unique, it is possible to keep a
    file descriptor alive in a way that causes duplicate directories, which
    manifests as these messages:

    [ 471.846235] debugfs: Directory '20245-4' with parent 'kvm' already present!

    Even though this should not happen in practice, it is more or less
    expected in the case of KVM for testcases that call KVM_CREATE_VM and
    close the resulting file descriptor repeatedly and in parallel.

    When this happens, debugfs_create_dir() returns an error but
    kvm_create_vm_debugfs() goes on to allocate stat data structs which are
    later leaked. The slow memory leak was spotted by syzkaller, where it
    caused OOM reports.

    Since the issue only affects debugfs, do a lookup before calling
    debugfs_create_dir, so that the message is downgraded and rate-limited.
    While at it, ensure kvm->debugfs_dentry is NULL rather than an error
    if it is not created. This fixes kvm_destroy_vm_debugfs, which was not
    checking IS_ERR_OR_NULL correctly.

    Cc: stable@vger.kernel.org
    Fixes: 536a6f88c49d ("KVM: Create debugfs dir and stat files for each VM")
    Reported-by: Alexey Kardashevskiy
    Suggested-by: Greg Kroah-Hartman
    Acked-by: Greg Kroah-Hartman
    Signed-off-by: Paolo Bonzini

    Paolo Bonzini
     

03 Aug, 2021

2 commits

  • Avoid taking mmu_lock for .invalidate_range_{start,end}() notifications
    that are unrelated to KVM. This is possible now that memslot updates are
    blocked from range_start() to range_end(); that ensures that lock elision
    happens in both or none, and therefore that mmu_notifier_count updates
    (which must occur while holding mmu_lock for write) are always paired
    across start->end.

    Based on patches originally written by Ben Gardon.

    Signed-off-by: Sean Christopherson
    Signed-off-by: Paolo Bonzini

    Paolo Bonzini
     
  • We would like to avoid taking mmu_lock for .invalidate_range_{start,end}()
    notifications that are unrelated to KVM. Because mmu_notifier_count
    must be modified while holding mmu_lock for write, and must always
    be paired across start->end to stay balanced, lock elision must
    happen in both or none. Therefore, in preparation for this change,
    this patch prevents memslot updates across range_start() and range_end().

    Note, technically flag-only memslot updates could be allowed in parallel,
    but stalling a memslot update for a relatively short amount of time is
    not a scalability issue, and this is all more than complex enough.

    A long note on the locking: a previous version of the patch used an rwsem
    to block the memslot update while the MMU notifier run, but this resulted
    in the following deadlock involving the pseudo-lock tagged as
    "mmu_notifier_invalidate_range_start".

    ======================================================
    WARNING: possible circular locking dependency detected
    5.12.0-rc3+ #6 Tainted: G OE
    ------------------------------------------------------
    qemu-system-x86/3069 is trying to acquire lock:
    ffffffff9c775ca0 (mmu_notifier_invalidate_range_start){+.+.}-{0:0}, at: __mmu_notifier_invalidate_range_end+0x5/0x190

    but task is already holding lock:
    ffffaff7410a9160 (&kvm->mmu_notifier_slots_lock){.+.+}-{3:3}, at: kvm_mmu_notifier_invalidate_range_start+0x36d/0x4f0 [kvm]

    which lock already depends on the new lock.

    This corresponds to the following MMU notifier logic:

    invalidate_range_start
    take pseudo lock
    down_read() (*)
    release pseudo lock
    invalidate_range_end
    take pseudo lock (**)
    up_read()
    release pseudo lock

    At point (*) we take the mmu_notifiers_slots_lock inside the pseudo lock;
    at point (**) we take the pseudo lock inside the mmu_notifiers_slots_lock.

    This could cause a deadlock (ignoring for a second that the pseudo lock
    is not a lock):

    - invalidate_range_start waits on down_read(), because the rwsem is
    held by install_new_memslots

    - install_new_memslots waits on down_write(), because the rwsem is
    held till (another) invalidate_range_end finishes

    - invalidate_range_end sits waits on the pseudo lock, held by
    invalidate_range_start.

    Removing the fairness of the rwsem breaks the cycle (in lockdep terms,
    it would change the *shared* rwsem readers into *shared recursive*
    readers), so open-code the wait using a readers count and a
    spinlock. This also allows handling blockable and non-blockable
    critical section in the same way.

    Losing the rwsem fairness does theoretically allow MMU notifiers to
    block install_new_memslots forever. Note that mm/mmu_notifier.c's own
    retry scheme in mmu_interval_read_begin also uses wait/wake_up
    and is likewise not fair.

    Signed-off-by: Paolo Bonzini

    Paolo Bonzini
     

02 Aug, 2021

4 commits


28 Jul, 2021

1 commit

  • The arguments to the KVM_CLEAR_DIRTY_LOG ioctl include a pointer,
    therefore it needs a compat ioctl implementation. Otherwise,
    32-bit userspace fails to invoke it on 64-bit kernels; for x86
    it might work fine by chance if the padding is zero, but not
    on big-endian architectures.

    Reported-by: Thomas Sattler
    Cc: stable@vger.kernel.org
    Fixes: 2a31b9db1535 ("kvm: introduce manual dirty log reprotect")
    Reviewed-by: Peter Xu
    Signed-off-by: Paolo Bonzini

    Paolo Bonzini