26 Jan, 2021

1 commit


21 Jan, 2021

1 commit


09 Jan, 2021

1 commit

  • Pull kvm fixes from Paolo Bonzini:
    "x86:
    - Fixes for the new scalable MMU
    - Fixes for migration of nested hypervisors on AMD
    - Fix for clang integrated assembler
    - Fix for left shift by 64 (UBSAN)
    - Small cleanups
    - Straggler SEV-ES patch

    ARM:
    - VM init cleanups
    - PSCI relay cleanups
    - Kill CONFIG_KVM_ARM_PMU
    - Fixup __init annotations
    - Fixup reg_to_encoding()
    - Fix spurious PMCR_EL0 access

    Misc:
    - selftests cleanups"

    * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (38 commits)
    KVM: x86: __kvm_vcpu_halt can be static
    KVM: SVM: Add support for booting APs in an SEV-ES guest
    KVM: nSVM: cancel KVM_REQ_GET_NESTED_STATE_PAGES on nested vmexit
    KVM: nSVM: mark vmcb as dirty when forcingly leaving the guest mode
    KVM: nSVM: correctly restore nested_run_pending on migration
    KVM: x86/mmu: Clarify TDP MMU page list invariants
    KVM: x86/mmu: Ensure TDP MMU roots are freed after yield
    kvm: check tlbs_dirty directly
    KVM: x86: change in pv_eoi_get_pending() to make code more readable
    MAINTAINERS: Really update email address for Sean Christopherson
    KVM: x86: fix shift out of bounds reported by UBSAN
    KVM: selftests: Implement perf_test_util more conventionally
    KVM: selftests: Use vm_create_with_vcpus in create_vm
    KVM: selftests: Factor out guest mode code
    KVM/SVM: Remove leftover __svm_vcpu_run prototype from svm.c
    KVM: SVM: Add register operand to vmsave call in sev_es_vcpu_load
    KVM: x86/mmu: Optimize not-present/MMIO SPTE check in get_mmio_spte()
    KVM: x86/mmu: Use raw level to index into MMIO walks' sptes array
    KVM: x86/mmu: Get root level from walkers when retrieving MMIO SPTE
    KVM: x86/mmu: Use -1 to flag an undefined spte in get_mmio_spte()
    ...

    Linus Torvalds
     

08 Jan, 2021

1 commit

  • In kvm_mmu_notifier_invalidate_range_start(), tlbs_dirty is used as:
    need_tlb_flush |= kvm->tlbs_dirty;
    with need_tlb_flush's type being int and tlbs_dirty's type being long.

    It means that tlbs_dirty is always used as int and the higher 32 bits
    is useless. We need to check tlbs_dirty in a correct way and this
    change checks it directly without propagating it to need_tlb_flush.

    Note: it's _extremely_ unlikely this neglecting of higher 32 bits can
    cause problems in practice. It would require encountering tlbs_dirty
    on a 4 billion count boundary, and KVM would need to be using shadow
    paging or be running a nested guest.

    Cc: stable@vger.kernel.org
    Fixes: a4ee1ca4a36e ("KVM: MMU: delay flush all tlbs on sync_page path")
    Signed-off-by: Lai Jiangshan
    Message-Id:
    Signed-off-by: Paolo Bonzini

    Lai Jiangshan
     

21 Dec, 2020

1 commit

  • Pull KVM updates from Paolo Bonzini:
    "Much x86 work was pushed out to 5.12, but ARM more than made up for it.

    ARM:
    - PSCI relay at EL2 when "protected KVM" is enabled
    - New exception injection code
    - Simplification of AArch32 system register handling
    - Fix PMU accesses when no PMU is enabled
    - Expose CSV3 on non-Meltdown hosts
    - Cache hierarchy discovery fixes
    - PV steal-time cleanups
    - Allow function pointers at EL2
    - Various host EL2 entry cleanups
    - Simplification of the EL2 vector allocation

    s390:
    - memcg accouting for s390 specific parts of kvm and gmap
    - selftest for diag318
    - new kvm_stat for when async_pf falls back to sync

    x86:
    - Tracepoints for the new pagetable code from 5.10
    - Catch VFIO and KVM irqfd events before userspace
    - Reporting dirty pages to userspace with a ring buffer
    - SEV-ES host support
    - Nested VMX support for wait-for-SIPI activity state
    - New feature flag (AVX512 FP16)
    - New system ioctl to report Hyper-V-compatible paravirtualization features

    Generic:
    - Selftest improvements"

    * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (171 commits)
    KVM: SVM: fix 32-bit compilation
    KVM: SVM: Add AP_JUMP_TABLE support in prep for AP booting
    KVM: SVM: Provide support to launch and run an SEV-ES guest
    KVM: SVM: Provide an updated VMRUN invocation for SEV-ES guests
    KVM: SVM: Provide support for SEV-ES vCPU loading
    KVM: SVM: Provide support for SEV-ES vCPU creation/loading
    KVM: SVM: Update ASID allocation to support SEV-ES guests
    KVM: SVM: Set the encryption mask for the SVM host save area
    KVM: SVM: Add NMI support for an SEV-ES guest
    KVM: SVM: Guest FPU state save/restore not needed for SEV-ES guest
    KVM: SVM: Do not report support for SMM for an SEV-ES guest
    KVM: x86: Update __get_sregs() / __set_sregs() to support SEV-ES
    KVM: SVM: Add support for CR8 write traps for an SEV-ES guest
    KVM: SVM: Add support for CR4 write traps for an SEV-ES guest
    KVM: SVM: Add support for CR0 write traps for an SEV-ES guest
    KVM: SVM: Add support for EFER write traps for an SEV-ES guest
    KVM: SVM: Support string IO operations for an SEV-ES guest
    KVM: SVM: Support MMIO for an SEV-ES guest
    KVM: SVM: Create trace events for VMGEXIT MSR protocol processing
    KVM: SVM: Create trace events for VMGEXIT processing
    ...

    Linus Torvalds
     

20 Dec, 2020

1 commit

  • A VCPU of a VM can allocate couple of pages which can be mmap'ed by the
    user space application. At the moment this memory is not charged to the
    memcg of the VMM. On a large machine running large number of VMs or
    small number of VMs having large number of VCPUs, this unaccounted
    memory can be very significant. So, charge this memory to the memcg of
    the VMM. Please note that lifetime of these allocations corresponds to
    the lifetime of the VMM.

    Link: https://lkml.kernel.org/r/20201106202923.2087414-1-shakeelb@google.com
    Signed-off-by: Shakeel Butt
    Acked-by: Roman Gushchin
    Acked-by: Paolo Bonzini
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shakeel Butt
     

15 Nov, 2020

7 commits

  • Because kvm dirty rings and kvm dirty log is used in an exclusive way,
    Let's avoid creating the dirty_bitmap when kvm dirty ring is enabled.
    At the meantime, since the dirty_bitmap will be conditionally created
    now, we can't use it as a sign of "whether this memory slot enabled
    dirty tracking". Change users like that to check against the kvm
    memory slot flags.

    Note that there still can be chances where the kvm memory slot got its
    dirty_bitmap allocated, _if_ the memory slots are created before
    enabling of the dirty rings and at the same time with the dirty
    tracking capability enabled, they'll still with the dirty_bitmap.
    However it should not hurt much (e.g., the bitmaps will always be
    freed if they are there), and the real users normally won't trigger
    this because dirty bit tracking flag should in most cases only be
    applied to kvm slots only before migration starts, that should be far
    latter than kvm initializes (VM starts).

    Signed-off-by: Peter Xu
    Message-Id:
    Signed-off-by: Paolo Bonzini

    Peter Xu
     
  • There's no good reason to use both the dirty bitmap logging and the
    new dirty ring buffer to track dirty bits. We should be able to even
    support both of them at the same time, but it could complicate things
    which could actually help little. Let's simply make it the rule
    before we enable dirty ring on any arch, that we don't allow these two
    interfaces to be used together.

    The big world switch would be KVM_CAP_DIRTY_LOG_RING capability
    enablement. That's where we'll switch from the default dirty logging
    way to the dirty ring way. As long as kvm->dirty_ring_size is setup
    correctly, we'll once and for all switch to the dirty ring buffer mode
    for the current virtual machine.

    Signed-off-by: Peter Xu
    Message-Id:
    [Change errno from EINVAL to ENXIO. - Paolo]
    Signed-off-by: Paolo Bonzini

    Peter Xu
     
  • This patch is heavily based on previous work from Lei Cao
    and Paolo Bonzini . [1]

    KVM currently uses large bitmaps to track dirty memory. These bitmaps
    are copied to userspace when userspace queries KVM for its dirty page
    information. The use of bitmaps is mostly sufficient for live
    migration, as large parts of memory are be dirtied from one log-dirty
    pass to another. However, in a checkpointing system, the number of
    dirty pages is small and in fact it is often bounded---the VM is
    paused when it has dirtied a pre-defined number of pages. Traversing a
    large, sparsely populated bitmap to find set bits is time-consuming,
    as is copying the bitmap to user-space.

    A similar issue will be there for live migration when the guest memory
    is huge while the page dirty procedure is trivial. In that case for
    each dirty sync we need to pull the whole dirty bitmap to userspace
    and analyse every bit even if it's mostly zeros.

    The preferred data structure for above scenarios is a dense list of
    guest frame numbers (GFN). This patch series stores the dirty list in
    kernel memory that can be memory mapped into userspace to allow speedy
    harvesting.

    This patch enables dirty ring for X86 only. However it should be
    easily extended to other archs as well.

    [1] https://patchwork.kernel.org/patch/10471409/

    Signed-off-by: Lei Cao
    Signed-off-by: Paolo Bonzini
    Signed-off-by: Peter Xu
    Message-Id:
    Signed-off-by: Paolo Bonzini

    Peter Xu
     
  • The context will be needed to implement the kvm dirty ring.

    Signed-off-by: Peter Xu
    Message-Id:
    Signed-off-by: Paolo Bonzini

    Peter Xu
     
  • kvm_clear_guest_page is not used anymore after "KVM: X86: Don't track dirty
    for KVM_SET_[TSS_ADDR|IDENTITY_MAP_ADDR]", except from kvm_clear_guest.
    We can just inline it in its sole user.

    Signed-off-by: Paolo Bonzini

    Paolo Bonzini
     
  • Don't allow the events to accumulate in the eventfd counter, drain them
    as they are handled.

    Signed-off-by: David Woodhouse
    Message-Id:
    Signed-off-by: Paolo Bonzini

    David Woodhouse
     
  • As far as I can tell, when we use posted interrupts we silently cut off
    the events from userspace, if it's listening on the same eventfd that
    feeds the irqfd.

    I like that behaviour. Let's do it all the time, even without posted
    interrupts. It makes it much easier to handle IRQ remapping invalidation
    without having to constantly add/remove the fd from the userspace poll
    set. We can just leave userspace polling on it, and the bypass will...
    well... bypass it.

    Signed-off-by: David Woodhouse
    Message-Id:
    Signed-off-by: Paolo Bonzini

    David Woodhouse
     

23 Oct, 2020

1 commit

  • Dirty logging is a key feature of the KVM MMU and must be supported by
    the TDP MMU. Add support for both the write protection and PML dirty
    logging modes.

    Tested by running kvm-unit-tests and KVM selftests on an Intel Haswell
    machine. This series introduced no new failures.

    This series can be viewed in Gerrit at:
    https://linux-review.googlesource.com/c/virt/kvm/kvm/+/2538

    Signed-off-by: Ben Gardon
    Message-Id:
    Signed-off-by: Paolo Bonzini

    Ben Gardon
     

22 Oct, 2020

1 commit

  • Cache the address space ID just like the slot ID. It will be used in
    order to fill in the dirty ring entries.

    Suggested-by: Paolo Bonzini
    Suggested-by: Sean Christopherson
    Reviewed-by: Sean Christopherson
    Signed-off-by: Peter Xu
    Message-Id:
    Signed-off-by: Paolo Bonzini

    Peter Xu
     

28 Sep, 2020

2 commits


12 Sep, 2020

2 commits

  • when kmalloc() fails in kvm_io_bus_unregister_dev(), before removing
    the bus, we should iterate over all other devices linked to it and call
    kvm_iodevice_destructor() for them

    Fixes: 90db10434b16 ("KVM: kvm_io_bus_unregister_dev() should never fail")
    Cc: stable@vger.kernel.org
    Reported-and-tested-by: syzbot+f196caa45793d6374707@syzkaller.appspotmail.com
    Link: https://syzkaller.appspot.com/bug?extid=f196caa45793d6374707
    Signed-off-by: Rustam Kovhaev
    Reviewed-by: Vitaly Kuznetsov
    Message-Id:
    Signed-off-by: Paolo Bonzini

    Rustam Kovhaev
     
  • …kvmarm/kvmarm into HEAD

    KVM/arm64 fixes for Linux 5.9, take #1

    - Multiple stolen time fixes, with a new capability to match x86
    - Fix for hugetlbfs mappings when PUD and PMD are the same level
    - Fix for hugetlbfs mappings when PTE mappings are enforced
    (dirty logging, for example)
    - Fix tracing output of 64bit values

    Paolo Bonzini
     

22 Aug, 2020

1 commit

  • The 'flags' field of 'struct mmu_notifier_range' is used to indicate
    whether invalidate_range_{start,end}() are permitted to block. In the
    case of kvm_mmu_notifier_invalidate_range_start(), this field is not
    forwarded on to the architecture-specific implementation of
    kvm_unmap_hva_range() and therefore the backend cannot sensibly decide
    whether or not to block.

    Add an extra 'flags' parameter to kvm_unmap_hva_range() so that
    architectures are aware as to whether or not they are permitted to block.

    Cc:
    Cc: Marc Zyngier
    Cc: Suzuki K Poulose
    Cc: James Morse
    Signed-off-by: Will Deacon
    Message-Id:
    Signed-off-by: Paolo Bonzini

    Will Deacon
     

13 Aug, 2020

1 commit

  • After the cleanup of page fault accounting, gup does not need to pass
    task_struct around any more. Remove that parameter in the whole gup
    stack.

    Signed-off-by: Peter Xu
    Signed-off-by: Andrew Morton
    Reviewed-by: John Hubbard
    Link: http://lkml.kernel.org/r/20200707225021.200906-26-peterx@redhat.com
    Signed-off-by: Linus Torvalds

    Peter Xu
     

11 Aug, 2020

1 commit

  • Pull locking updates from Thomas Gleixner:
    "A set of locking fixes and updates:

    - Untangle the header spaghetti which causes build failures in
    various situations caused by the lockdep additions to seqcount to
    validate that the write side critical sections are non-preemptible.

    - The seqcount associated lock debug addons which were blocked by the
    above fallout.

    seqcount writers contrary to seqlock writers must be externally
    serialized, which usually happens via locking - except for strict
    per CPU seqcounts. As the lock is not part of the seqcount, lockdep
    cannot validate that the lock is held.

    This new debug mechanism adds the concept of associated locks.
    sequence count has now lock type variants and corresponding
    initializers which take a pointer to the associated lock used for
    writer serialization. If lockdep is enabled the pointer is stored
    and write_seqcount_begin() has a lockdep assertion to validate that
    the lock is held.

    Aside of the type and the initializer no other code changes are
    required at the seqcount usage sites. The rest of the seqcount API
    is unchanged and determines the type at compile time with the help
    of _Generic which is possible now that the minimal GCC version has
    been moved up.

    Adding this lockdep coverage unearthed a handful of seqcount bugs
    which have been addressed already independent of this.

    While generally useful this comes with a Trojan Horse twist: On RT
    kernels the write side critical section can become preemtible if
    the writers are serialized by an associated lock, which leads to
    the well known reader preempts writer livelock. RT prevents this by
    storing the associated lock pointer independent of lockdep in the
    seqcount and changing the reader side to block on the lock when a
    reader detects that a writer is in the write side critical section.

    - Conversion of seqcount usage sites to associated types and
    initializers"

    * tag 'locking-urgent-2020-08-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (25 commits)
    locking/seqlock, headers: Untangle the spaghetti monster
    locking, arch/ia64: Reduce header dependencies by moving XTP bits into the new header
    x86/headers: Remove APIC headers from
    seqcount: More consistent seqprop names
    seqcount: Compress SEQCNT_LOCKNAME_ZERO()
    seqlock: Fold seqcount_LOCKNAME_init() definition
    seqlock: Fold seqcount_LOCKNAME_t definition
    seqlock: s/__SEQ_LOCKDEP/__SEQ_LOCK/g
    hrtimer: Use sequence counter with associated raw spinlock
    kvm/eventfd: Use sequence counter with associated spinlock
    userfaultfd: Use sequence counter with associated spinlock
    NFSv4: Use sequence counter with associated spinlock
    iocost: Use sequence counter with associated spinlock
    raid5: Use sequence counter with associated spinlock
    vfs: Use sequence counter with associated spinlock
    timekeeping: Use sequence counter with associated raw spinlock
    xfrm: policy: Use sequence counters with associated lock
    netfilter: nft_set_rbtree: Use sequence counter with associated rwlock
    netfilter: conntrack: Use sequence counter with associated spinlock
    sched: tasks: Use sequence counter with associated spinlock
    ...

    Linus Torvalds
     

07 Aug, 2020

1 commit

  • Pull KVM updates from Paolo Bonzini:
    "s390:
    - implement diag318

    x86:
    - Report last CPU for debugging
    - Emulate smaller MAXPHYADDR in the guest than in the host
    - .noinstr and tracing fixes from Thomas
    - nested SVM page table switching optimization and fixes

    Generic:
    - Unify shadow MMU cache data structures across architectures"

    * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (127 commits)
    KVM: SVM: Fix sev_pin_memory() error handling
    KVM: LAPIC: Set the TDCR settable bits
    KVM: x86: Specify max TDP level via kvm_configure_mmu()
    KVM: x86/mmu: Rename max_page_level to max_huge_page_level
    KVM: x86: Dynamically calculate TDP level from max level and MAXPHYADDR
    KVM: VXM: Remove temporary WARN on expected vs. actual EPTP level mismatch
    KVM: x86: Pull the PGD's level from the MMU instead of recalculating it
    KVM: VMX: Make vmx_load_mmu_pgd() static
    KVM: x86/mmu: Add separate helper for shadow NPT root page role calc
    KVM: VMX: Drop a duplicate declaration of construct_eptp()
    KVM: nSVM: Correctly set the shadow NPT root level in its MMU role
    KVM: Using macros instead of magic values
    MIPS: KVM: Fix build error caused by 'kvm_run' cleanup
    KVM: nSVM: remove nonsensical EXITINFO1 adjustment on nested NPF
    KVM: x86: Add a capability for GUEST_MAXPHYADDR < HOST_MAXPHYADDR support
    KVM: VMX: optimize #PF injection when MAXPHYADDR does not match
    KVM: VMX: Add guest physical address check in EPT violation and misconfig
    KVM: VMX: introduce vmx_need_pf_intercept
    KVM: x86: update exception bitmap on CPUID changes
    KVM: x86: rename update_bp_intercept to update_exception_bitmap
    ...

    Linus Torvalds
     

29 Jul, 2020

1 commit

  • A sequence counter write side critical section must be protected by some
    form of locking to serialize writers. A plain seqcount_t does not
    contain the information of which lock must be held when entering a write
    side critical section.

    Use the new seqcount_spinlock_t data type, which allows to associate a
    spinlock with the sequence counter. This enables lockdep to verify that
    the spinlock used for writer serialization is held when the write side
    critical section is entered.

    If lockdep is disabled this lock association is compiled out and has
    neither storage size nor runtime overhead.

    Signed-off-by: Ahmed S. Darwish
    Signed-off-by: Peter Zijlstra (Intel)
    Acked-by: Paolo Bonzini
    Link: https://lkml.kernel.org/r/20200720155530.1173732-24-a.darwish@linutronix.de

    Ahmed S. Darwish
     

24 Jul, 2020

1 commit

  • Entering a guest is similar to exiting to user space. Pending work like
    handling signals, rescheduling, task work etc. needs to be handled before
    that.

    Provide generic infrastructure to avoid duplication of the same handling
    code all over the place.

    The transfer to guest mode handling is different from the exit to usermode
    handling, e.g. vs. rseq and live patching, so a separate function is used.

    The initial list of work items handled is:

    TIF_SIGPENDING, TIF_NEED_RESCHED, TIF_NOTIFY_RESUME

    Architecture specific TIF flags can be added via defines in the
    architecture specific include files.

    The calling convention is also different from the syscall/interrupt entry
    functions as KVM invokes this from the outer vcpu_run() loop with
    interrupts and preemption enabled. To prevent missing a pending work item
    it invokes a check for pending TIF work from interrupt disabled code right
    before transitioning to guest mode. The lockdep, RCU and tracing state
    handling is also done directly around the switch to and from guest mode.

    Signed-off-by: Thomas Gleixner
    Link: https://lkml.kernel.org/r/20200722220519.833296398@linutronix.de

    Thomas Gleixner
     

10 Jul, 2020

1 commit


09 Jul, 2020

2 commits

  • OVMF booted guest running on shadow pages crashes on TRIPLE FAULT after
    enabling paging from SMM. The crash is triggered from mmu_check_root() and
    is caused by kvm_is_visible_gfn() searching through memslots with as_id = 0
    while vCPU may be in a different context (address space).

    Introduce kvm_vcpu_is_visible_gfn() and use it from mmu_check_root().

    Signed-off-by: Vitaly Kuznetsov
    Message-Id:
    Signed-off-by: Paolo Bonzini

    Vitaly Kuznetsov
     
  • Unlike normal 'int' functions returning '0' on success, kvm_setup_async_pf()/
    kvm_arch_setup_async_pf() return '1' when a job to handle page fault
    asynchronously was scheduled and '0' otherwise. To avoid the confusion
    change return type to 'bool'.

    No functional change intended.

    Suggested-by: Sean Christopherson
    Signed-off-by: Vitaly Kuznetsov
    Message-Id:
    Signed-off-by: Paolo Bonzini

    Vitaly Kuznetsov
     

02 Jul, 2020

1 commit


13 Jun, 2020

1 commit

  • Pull more KVM updates from Paolo Bonzini:
    "The guest side of the asynchronous page fault work has been delayed to
    5.9 in order to sync with Thomas's interrupt entry rework, but here's
    the rest of the KVM updates for this merge window.

    MIPS:
    - Loongson port

    PPC:
    - Fixes

    ARM:
    - Fixes

    x86:
    - KVM_SET_USER_MEMORY_REGION optimizations
    - Fixes
    - Selftest fixes"

    * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (62 commits)
    KVM: x86: do not pass poisoned hva to __kvm_set_memory_region
    KVM: selftests: fix sync_with_host() in smm_test
    KVM: async_pf: Inject 'page ready' event only if 'page not present' was previously injected
    KVM: async_pf: Cleanup kvm_setup_async_pf()
    kvm: i8254: remove redundant assignment to pointer s
    KVM: x86: respect singlestep when emulating instruction
    KVM: selftests: Don't probe KVM_CAP_HYPERV_ENLIGHTENED_VMCS when nested VMX is unsupported
    KVM: selftests: do not substitute SVM/VMX check with KVM_CAP_NESTED_STATE check
    KVM: nVMX: Consult only the "basic" exit reason when routing nested exit
    KVM: arm64: Move hyp_symbol_addr() to kvm_asm.h
    KVM: arm64: Synchronize sysreg state on injecting an AArch32 exception
    KVM: arm64: Make vcpu_cp1x() work on Big Endian hosts
    KVM: arm64: Remove host_cpu_context member from vcpu structure
    KVM: arm64: Stop sparse from moaning at __hyp_this_cpu_ptr
    KVM: arm64: Handle PtrAuth traps early
    KVM: x86: Unexport x86_fpu_cache and make it static
    KVM: selftests: Ignore KVM 5-level paging support for VM_MODE_PXXV48_4K
    KVM: arm64: Save the host's PtrAuth keys in non-preemptible context
    KVM: arm64: Stop save/restoring ACTLR_EL1
    KVM: arm64: Add emulation for 32bit guests accessing ACTLR2
    ...

    Linus Torvalds
     

12 Jun, 2020

2 commits

  • 'Page not present' event may or may not get injected depending on
    guest's state. If the event wasn't injected, there is no need to
    inject the corresponding 'page ready' event as the guest may get
    confused. E.g. Linux thinks that the corresponding 'page not present'
    event wasn't delivered *yet* and allocates a 'dummy entry' for it.
    This entry is never freed.

    Note, 'wakeup all' events have no corresponding 'page not present'
    event and always get injected.

    s390 seems to always be able to inject 'page not present', the
    change is effectively a nop.

    Suggested-by: Vivek Goyal
    Signed-off-by: Vitaly Kuznetsov
    Message-Id:
    Fixes: https://bugzilla.kernel.org/show_bug.cgi?id=208081
    Signed-off-by: Paolo Bonzini

    Vitaly Kuznetsov
     
  • schedule_work() returns 'false' only when the work is already on the queue
    and this can't happen as kvm_setup_async_pf() always allocates a new one.
    Also, to avoid potential race, it makes sense to to schedule_work() at the
    very end after we've added it to the queue.

    While on it, do some minor cleanup. gfn_to_pfn_async() mentioned in a
    comment does not currently exist and, moreover, we can check
    kvm_is_error_hva() at the very beginning, before we try to allocate work so
    'retry_sync' label can go away completely.

    Signed-off-by: Vitaly Kuznetsov
    Message-Id:
    Reviewed-by: Sean Christopherson
    Signed-off-by: Paolo Bonzini

    Vitaly Kuznetsov
     

10 Jun, 2020

2 commits

  • This change converts the existing mmap_sem rwsem calls to use the new mmap
    locking API instead.

    The change is generated using coccinelle with the following rule:

    // spatch --sp-file mmap_lock_api.cocci --in-place --include-headers --dir .

    @@
    expression mm;
    @@
    (
    -init_rwsem
    +mmap_init_lock
    |
    -down_write
    +mmap_write_lock
    |
    -down_write_killable
    +mmap_write_lock_killable
    |
    -down_write_trylock
    +mmap_write_trylock
    |
    -up_write
    +mmap_write_unlock
    |
    -downgrade_write
    +mmap_write_downgrade
    |
    -down_read
    +mmap_read_lock
    |
    -down_read_killable
    +mmap_read_lock_killable
    |
    -down_read_trylock
    +mmap_read_trylock
    |
    -up_read
    +mmap_read_unlock
    )
    -(&mm->mmap_sem)
    +(mm)

    Signed-off-by: Michel Lespinasse
    Signed-off-by: Andrew Morton
    Reviewed-by: Daniel Jordan
    Reviewed-by: Laurent Dufour
    Reviewed-by: Vlastimil Babka
    Cc: Davidlohr Bueso
    Cc: David Rientjes
    Cc: Hugh Dickins
    Cc: Jason Gunthorpe
    Cc: Jerome Glisse
    Cc: John Hubbard
    Cc: Liam Howlett
    Cc: Matthew Wilcox
    Cc: Peter Zijlstra
    Cc: Ying Han
    Link: http://lkml.kernel.org/r/20200520052908.204642-5-walken@google.com
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • Patch series "mm: consolidate definitions of page table accessors", v2.

    The low level page table accessors (pXY_index(), pXY_offset()) are
    duplicated across all architectures and sometimes more than once. For
    instance, we have 31 definition of pgd_offset() for 25 supported
    architectures.

    Most of these definitions are actually identical and typically it boils
    down to, e.g.

    static inline unsigned long pmd_index(unsigned long address)
    {
    return (address >> PMD_SHIFT) & (PTRS_PER_PMD - 1);
    }

    static inline pmd_t *pmd_offset(pud_t *pud, unsigned long address)
    {
    return (pmd_t *)pud_page_vaddr(*pud) + pmd_index(address);
    }

    These definitions can be shared among 90% of the arches provided
    XYZ_SHIFT, PTRS_PER_XYZ and xyz_page_vaddr() are defined.

    For architectures that really need a custom version there is always
    possibility to override the generic version with the usual ifdefs magic.

    These patches introduce include/linux/pgtable.h that replaces
    include/asm-generic/pgtable.h and add the definitions of the page table
    accessors to the new header.

    This patch (of 12):

    The linux/mm.h header includes to allow inlining of the
    functions involving page table manipulations, e.g. pte_alloc() and
    pmd_alloc(). So, there is no point to explicitly include
    in the files that include .

    The include statements in such cases are remove with a simple loop:

    for f in $(git grep -l "include ") ; do
    sed -i -e '/include / d' $f
    done

    Signed-off-by: Mike Rapoport
    Signed-off-by: Andrew Morton
    Cc: Arnd Bergmann
    Cc: Borislav Petkov
    Cc: Brian Cain
    Cc: Catalin Marinas
    Cc: Chris Zankel
    Cc: "David S. Miller"
    Cc: Geert Uytterhoeven
    Cc: Greentime Hu
    Cc: Greg Ungerer
    Cc: Guan Xuetao
    Cc: Guo Ren
    Cc: Heiko Carstens
    Cc: Helge Deller
    Cc: Ingo Molnar
    Cc: Ley Foon Tan
    Cc: Mark Salter
    Cc: Matthew Wilcox
    Cc: Matt Turner
    Cc: Max Filippov
    Cc: Michael Ellerman
    Cc: Michal Simek
    Cc: Mike Rapoport
    Cc: Nick Hu
    Cc: Paul Walmsley
    Cc: Richard Weinberger
    Cc: Rich Felker
    Cc: Russell King
    Cc: Stafford Horne
    Cc: Thomas Bogendoerfer
    Cc: Thomas Gleixner
    Cc: Tony Luck
    Cc: Vincent Chen
    Cc: Vineet Gupta
    Cc: Will Deacon
    Cc: Yoshinori Sato
    Link: http://lkml.kernel.org/r/20200514170327.31389-1-rppt@kernel.org
    Link: http://lkml.kernel.org/r/20200514170327.31389-2-rppt@kernel.org
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     

09 Jun, 2020

1 commit

  • API __get_user_pages_fast() renamed to get_user_pages_fast_only() to
    align with pin_user_pages_fast_only().

    As part of this we will get rid of write parameter. Instead caller will
    pass FOLL_WRITE to get_user_pages_fast_only(). This will not change any
    existing functionality of the API.

    All the callers are changed to pass FOLL_WRITE.

    Also introduce get_user_page_fast_only(), and use it in a few places
    that hard-code nr_pages to 1.

    Updated the documentation of the API.

    Signed-off-by: Souptick Joarder
    Signed-off-by: Andrew Morton
    Reviewed-by: John Hubbard
    Reviewed-by: Paul Mackerras [arch/powerpc/kvm]
    Cc: Matthew Wilcox
    Cc: Michael Ellerman
    Cc: Benjamin Herrenschmidt
    Cc: Peter Zijlstra
    Cc: Ingo Molnar
    Cc: Mark Rutland
    Cc: Alexander Shishkin
    Cc: Jiri Olsa
    Cc: Namhyung Kim
    Cc: Paolo Bonzini
    Cc: Stephen Rothwell
    Cc: Mike Rapoport
    Cc: Aneesh Kumar K.V
    Cc: Michal Suchanek
    Link: http://lkml.kernel.org/r/1590396812-31277-1-git-send-email-jrdr.linux@gmail.com
    Signed-off-by: Linus Torvalds

    Souptick Joarder
     

08 Jun, 2020

1 commit

  • Commit b1394e745b94 ("KVM: x86: fix APIC page invalidation") tried
    to fix inappropriate APIC page invalidation by re-introducing arch
    specific kvm_arch_mmu_notifier_invalidate_range() and calling it from
    kvm_mmu_notifier_invalidate_range_start. However, the patch left a
    possible race where the VMCS APIC address cache is updated *before*
    it is unmapped:

    (Invalidator) kvm_mmu_notifier_invalidate_range_start()
    (Invalidator) kvm_make_all_cpus_request(kvm, KVM_REQ_APIC_PAGE_RELOAD)
    (KVM VCPU) vcpu_enter_guest()
    (KVM VCPU) kvm_vcpu_reload_apic_access_page()
    (Invalidator) actually unmap page

    Because of the above race, there can be a mismatch between the
    host physical address stored in the APIC_ACCESS_PAGE VMCS field and
    the host physical address stored in the EPT entry for the APIC GPA
    (0xfee0000). When this happens, the processor will not trap APIC
    accesses, and will instead show the raw contents of the APIC-access page.
    Because Windows OS periodically checks for unexpected modifications to
    the LAPIC register, this will show up as a BSOD crash with BugCheck
    CRITICAL_STRUCTURE_CORRUPTION (109) we are currently seeing in
    https://bugzilla.redhat.com/show_bug.cgi?id=1751017.

    The root cause of the issue is that kvm_arch_mmu_notifier_invalidate_range()
    cannot guarantee that no additional references are taken to the pages in
    the range before kvm_mmu_notifier_invalidate_range_end(). Fortunately,
    this case is supported by the MMU notifier API, as documented in
    include/linux/mmu_notifier.h:

    * If the subsystem
    * can't guarantee that no additional references are taken to
    * the pages in the range, it has to implement the
    * invalidate_range() notifier to remove any references taken
    * after invalidate_range_start().

    The fix therefore is to reload the APIC-access page field in the VMCS
    from kvm_mmu_notifier_invalidate_range() instead of ..._range_start().

    Cc: stable@vger.kernel.org
    Fixes: b1394e745b94 ("KVM: x86: fix APIC page invalidation")
    Fixes: https://bugzilla.kernel.org/show_bug.cgi?id=197951
    Signed-off-by: Eiichi Tsukata
    Message-Id:
    Signed-off-by: Paolo Bonzini

    Eiichi Tsukata
     

05 Jun, 2020

1 commit


04 Jun, 2020

1 commit

  • After commit 63d0434 ("KVM: x86: move kvm_create_vcpu_debugfs after
    last failure point") we are creating the pre-vCPU debugfs files
    after the creation of the vCPU file descriptor. This makes it
    possible for userspace to reach kvm_vcpu_release before
    kvm_create_vcpu_debugfs has finished. The vcpu->debugfs_dentry
    then does not have any associated inode anymore, and this causes
    a NULL-pointer dereference in debugfs_create_file.

    The solution is simply to avoid removing the files; they are
    cleaned up when the VM file descriptor is closed (and that must be
    after KVM_CREATE_VCPU returns). We can stop storing the dentry
    in struct kvm_vcpu too, because it is not needed anywhere after
    kvm_create_vcpu_debugfs returns.

    Reported-by: syzbot+705f4401d5a93a59b87d@syzkaller.appspotmail.com
    Fixes: 63d04348371b ("KVM: x86: move kvm_create_vcpu_debugfs after last failure point")
    Signed-off-by: Paolo Bonzini

    Paolo Bonzini
     

01 Jun, 2020

2 commits

  • KVM/arm64 updates for Linux 5.8:

    - Move the arch-specific code into arch/arm64/kvm
    - Start the post-32bit cleanup
    - Cherry-pick a few non-invasive pre-NV patches

    Paolo Bonzini
     
  • The userspace_addr alignment and range checks are not performed for private
    memory slots that are prepared by KVM itself. This is unnecessary and makes
    it questionable to use __*_user functions to access memory later on. We also
    rely on the userspace address being aligned since we have an entire family
    of functions to map gfn to pfn.

    Fortunately skipping the check is completely unnecessary. Only x86 uses
    private memslots and their userspace_addr is obtained from vm_mmap,
    therefore it must be below PAGE_OFFSET. In fact, any attempt to pass
    an address above PAGE_OFFSET would have failed because such an address
    would return true for kvm_is_error_hva.

    Reported-by: Linus Torvalds
    Signed-off-by: Paolo Bonzini

    Paolo Bonzini