Eric Lee / smarc-fsl-linux-kernel

26 Jan, 2021

1 commit

615099b01 Merge tag 'kvmarm-fixes-5.11-2' of git://git.kernel.org/pub/scm/linux/kernel/git… ... Browse Code »

…/kvmarm/kvmarm into HEAD

KVM/arm64 fixes for 5.11, take #2

- Don't allow tagged pointers to point to memslots
- Filter out ARMv8.1+ PMU events on v8.0 hardware
- Hide PMU registers from userspace when no PMU is configured
- More PMU cleanups
- Don't try to handle broken PSCI firmware
- More sys_reg() to reg_to_encoding() conversions

Paolo Bonzini
2021-01-26 07:52:01 +0800

21 Jan, 2021

1 commit

139bc8a61 KVM: Forbid the use of tagged userspace addresses for memslots ... Browse Code »

The use of a tagged address could be pretty confusing for the
whole memslot infrastructure as well as the MMU notifiers.

Forbid it altogether, as it never quite worked the first place.

Cc: stable@vger.kernel.org
Reported-by: Rick Edgecombe
Reviewed-by: Catalin Marinas
Signed-off-by: Marc Zyngier

Marc Zyngier
2021-01-21 22:17:36 +0800

09 Jan, 2021

1 commit

2a190b22a Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm ... Browse Code »

Pull kvm fixes from Paolo Bonzini:
"x86:
- Fixes for the new scalable MMU
- Fixes for migration of nested hypervisors on AMD
- Fix for clang integrated assembler
- Fix for left shift by 64 (UBSAN)
- Small cleanups
- Straggler SEV-ES patch

ARM:
- VM init cleanups
- PSCI relay cleanups
- Kill CONFIG_KVM_ARM_PMU
- Fixup __init annotations
- Fixup reg_to_encoding()
- Fix spurious PMCR_EL0 access

Misc:
- selftests cleanups"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (38 commits)
KVM: x86: __kvm_vcpu_halt can be static
KVM: SVM: Add support for booting APs in an SEV-ES guest
KVM: nSVM: cancel KVM_REQ_GET_NESTED_STATE_PAGES on nested vmexit
KVM: nSVM: mark vmcb as dirty when forcingly leaving the guest mode
KVM: nSVM: correctly restore nested_run_pending on migration
KVM: x86/mmu: Clarify TDP MMU page list invariants
KVM: x86/mmu: Ensure TDP MMU roots are freed after yield
kvm: check tlbs_dirty directly
KVM: x86: change in pv_eoi_get_pending() to make code more readable
MAINTAINERS: Really update email address for Sean Christopherson
KVM: x86: fix shift out of bounds reported by UBSAN
KVM: selftests: Implement perf_test_util more conventionally
KVM: selftests: Use vm_create_with_vcpus in create_vm
KVM: selftests: Factor out guest mode code
KVM/SVM: Remove leftover __svm_vcpu_run prototype from svm.c
KVM: SVM: Add register operand to vmsave call in sev_es_vcpu_load
KVM: x86/mmu: Optimize not-present/MMIO SPTE check in get_mmio_spte()
KVM: x86/mmu: Use raw level to index into MMIO walks' sptes array
KVM: x86/mmu: Get root level from walkers when retrieving MMIO SPTE
KVM: x86/mmu: Use -1 to flag an undefined spte in get_mmio_spte()
...

Linus Torvalds
2021-01-09 07:06:02 +0800

08 Jan, 2021

1 commit

88bf56d04 kvm: check tlbs_dirty directly ... Browse Code »

In kvm_mmu_notifier_invalidate_range_start(), tlbs_dirty is used as:
need_tlb_flush |= kvm->tlbs_dirty;
with need_tlb_flush's type being int and tlbs_dirty's type being long.

It means that tlbs_dirty is always used as int and the higher 32 bits
is useless. We need to check tlbs_dirty in a correct way and this
change checks it directly without propagating it to need_tlb_flush.

Note: it's _extremely_ unlikely this neglecting of higher 32 bits can
cause problems in practice. It would require encountering tlbs_dirty
on a 4 billion count boundary, and KVM would need to be using shadow
paging or be running a nested guest.

Cc: stable@vger.kernel.org
Fixes: a4ee1ca4a36e ("KVM: MMU: delay flush all tlbs on sync_page path")
Signed-off-by: Lai Jiangshan
Message-Id:
Signed-off-by: Paolo Bonzini

Lai Jiangshan
2021-01-08 07:11:30 +0800

21 Dec, 2020

1 commit

6a447b0e3 Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm ... Browse Code »

Pull KVM updates from Paolo Bonzini:
"Much x86 work was pushed out to 5.12, but ARM more than made up for it.

ARM:
- PSCI relay at EL2 when "protected KVM" is enabled
- New exception injection code
- Simplification of AArch32 system register handling
- Fix PMU accesses when no PMU is enabled
- Expose CSV3 on non-Meltdown hosts
- Cache hierarchy discovery fixes
- PV steal-time cleanups
- Allow function pointers at EL2
- Various host EL2 entry cleanups
- Simplification of the EL2 vector allocation

s390:
- memcg accouting for s390 specific parts of kvm and gmap
- selftest for diag318
- new kvm_stat for when async_pf falls back to sync

x86:
- Tracepoints for the new pagetable code from 5.10
- Catch VFIO and KVM irqfd events before userspace
- Reporting dirty pages to userspace with a ring buffer
- SEV-ES host support
- Nested VMX support for wait-for-SIPI activity state
- New feature flag (AVX512 FP16)
- New system ioctl to report Hyper-V-compatible paravirtualization features

Generic:
- Selftest improvements"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (171 commits)
KVM: SVM: fix 32-bit compilation
KVM: SVM: Add AP_JUMP_TABLE support in prep for AP booting
KVM: SVM: Provide support to launch and run an SEV-ES guest
KVM: SVM: Provide an updated VMRUN invocation for SEV-ES guests
KVM: SVM: Provide support for SEV-ES vCPU loading
KVM: SVM: Provide support for SEV-ES vCPU creation/loading
KVM: SVM: Update ASID allocation to support SEV-ES guests
KVM: SVM: Set the encryption mask for the SVM host save area
KVM: SVM: Add NMI support for an SEV-ES guest
KVM: SVM: Guest FPU state save/restore not needed for SEV-ES guest
KVM: SVM: Do not report support for SMM for an SEV-ES guest
KVM: x86: Update __get_sregs() / __set_sregs() to support SEV-ES
KVM: SVM: Add support for CR8 write traps for an SEV-ES guest
KVM: SVM: Add support for CR4 write traps for an SEV-ES guest
KVM: SVM: Add support for CR0 write traps for an SEV-ES guest
KVM: SVM: Add support for EFER write traps for an SEV-ES guest
KVM: SVM: Support string IO operations for an SEV-ES guest
KVM: SVM: Support MMIO for an SEV-ES guest
KVM: SVM: Create trace events for VMGEXIT MSR protocol processing
KVM: SVM: Create trace events for VMGEXIT processing
...

Linus Torvalds
2020-12-21 02:44:05 +0800

20 Dec, 2020

1 commit

93bb59ca5 mm, kvm: account kvm_vcpu_mmap to kmemcg ... Browse Code »

A VCPU of a VM can allocate couple of pages which can be mmap'ed by the
user space application. At the moment this memory is not charged to the
memcg of the VMM. On a large machine running large number of VMs or
small number of VMs having large number of VCPUs, this unaccounted
memory can be very significant. So, charge this memory to the memcg of
the VMM. Please note that lifetime of these allocations corresponds to
the lifetime of the VMM.

Link: https://lkml.kernel.org/r/20201106202923.2087414-1-shakeelb@google.com
Signed-off-by: Shakeel Butt
Acked-by: Roman Gushchin
Acked-by: Paolo Bonzini
Cc: Johannes Weiner
Cc: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Shakeel Butt
2020-12-20 03:18:37 +0800

15 Nov, 2020

7 commits

044c59c40 KVM: Don't allocate dirty bitmap if dirty ring is enabled ... Browse Code »

Because kvm dirty rings and kvm dirty log is used in an exclusive way,
Let's avoid creating the dirty_bitmap when kvm dirty ring is enabled.
At the meantime, since the dirty_bitmap will be conditionally created
now, we can't use it as a sign of "whether this memory slot enabled
dirty tracking". Change users like that to check against the kvm
memory slot flags.

Note that there still can be chances where the kvm memory slot got its
dirty_bitmap allocated, _if_ the memory slots are created before
enabling of the dirty rings and at the same time with the dirty
tracking capability enabled, they'll still with the dirty_bitmap.
However it should not hurt much (e.g., the bitmaps will always be
freed if they are there), and the real users normally won't trigger
this because dirty bit tracking flag should in most cases only be
applied to kvm slots only before migration starts, that should be far
latter than kvm initializes (VM starts).

Signed-off-by: Peter Xu
Message-Id:
Signed-off-by: Paolo Bonzini

Peter Xu
2020-11-15 22:49:16 +0800
b2cc64c4f KVM: Make dirty ring exclusive to dirty bitmap log ... Browse Code »

There's no good reason to use both the dirty bitmap logging and the
new dirty ring buffer to track dirty bits. We should be able to even
support both of them at the same time, but it could complicate things
which could actually help little. Let's simply make it the rule
before we enable dirty ring on any arch, that we don't allow these two
interfaces to be used together.

The big world switch would be KVM_CAP_DIRTY_LOG_RING capability
enablement. That's where we'll switch from the default dirty logging
way to the dirty ring way. As long as kvm->dirty_ring_size is setup
correctly, we'll once and for all switch to the dirty ring buffer mode
for the current virtual machine.

Signed-off-by: Peter Xu
Message-Id:
[Change errno from EINVAL to ENXIO. - Paolo]
Signed-off-by: Paolo Bonzini

Peter Xu
2020-11-15 22:49:16 +0800
fb04a1edd KVM: X86: Implement ring-based dirty memory tracking ... Browse Code »

This patch is heavily based on previous work from Lei Cao
and Paolo Bonzini . [1]

KVM currently uses large bitmaps to track dirty memory. These bitmaps
are copied to userspace when userspace queries KVM for its dirty page
information. The use of bitmaps is mostly sufficient for live
migration, as large parts of memory are be dirtied from one log-dirty
pass to another. However, in a checkpointing system, the number of
dirty pages is small and in fact it is often bounded---the VM is
paused when it has dirtied a pre-defined number of pages. Traversing a
large, sparsely populated bitmap to find set bits is time-consuming,
as is copying the bitmap to user-space.

A similar issue will be there for live migration when the guest memory
is huge while the page dirty procedure is trivial. In that case for
each dirty sync we need to pull the whole dirty bitmap to userspace
and analyse every bit even if it's mostly zeros.

The preferred data structure for above scenarios is a dense list of
guest frame numbers (GFN). This patch series stores the dirty list in
kernel memory that can be memory mapped into userspace to allow speedy
harvesting.

This patch enables dirty ring for X86 only. However it should be
easily extended to other archs as well.

[1] https://patchwork.kernel.org/patch/10471409/

Signed-off-by: Lei Cao
Signed-off-by: Paolo Bonzini
Signed-off-by: Peter Xu
Message-Id:
Signed-off-by: Paolo Bonzini

Peter Xu
2020-11-15 22:49:15 +0800
28bd726aa KVM: Pass in kvm pointer into mark_page_dirty_in_slot() ... Browse Code »

The context will be needed to implement the kvm dirty ring.

Signed-off-by: Peter Xu
Message-Id:
Signed-off-by: Paolo Bonzini

Peter Xu
2020-11-15 22:49:13 +0800
2f5414423 KVM: remove kvm_clear_guest_page ... Browse Code »

kvm_clear_guest_page is not used anymore after "KVM: X86: Don't track dirty
for KVM_SET_[TSS_ADDR|IDENTITY_MAP_ADDR]", except from kvm_clear_guest.
We can just inline it in its sole user.

Signed-off-by: Paolo Bonzini

Paolo Bonzini
2020-11-15 22:49:13 +0800
b59e00dd8 kvm/eventfd: Drain events from eventfd in irqfd_wakeup() ... Browse Code »

Don't allow the events to accumulate in the eventfd counter, drain them
as they are handled.

Signed-off-by: David Woodhouse
Message-Id:
Signed-off-by: Paolo Bonzini

David Woodhouse
2020-11-15 22:49:11 +0800
e8dbf1950 kvm/eventfd: Use priority waitqueue to catch events before userspace ... Browse Code »

As far as I can tell, when we use posted interrupts we silently cut off
the events from userspace, if it's listening on the same eventfd that
feeds the irqfd.

I like that behaviour. Let's do it all the time, even without posted
interrupts. It makes it much easier to handle IRQ remapping invalidation
without having to constantly add/remove the fd from the userspace poll
set. We can just leave userspace polling on it, and the bypass will...
well... bypass it.

Signed-off-by: David Woodhouse
Message-Id:
Signed-off-by: Paolo Bonzini

David Woodhouse
2020-11-15 22:49:10 +0800

23 Oct, 2020

1 commit

a6a0b05da kvm: x86/mmu: Support dirty logging for the TDP MMU ... Browse Code »

Dirty logging is a key feature of the KVM MMU and must be supported by
the TDP MMU. Add support for both the write protection and PML dirty
logging modes.

Tested by running kvm-unit-tests and KVM selftests on an Intel Haswell
machine. This series introduced no new failures.

This series can be viewed in Gerrit at:
https://linux-review.googlesource.com/c/virt/kvm/kvm/+/2538

Signed-off-by: Ben Gardon
Message-Id:
Signed-off-by: Paolo Bonzini

Ben Gardon
2020-10-23 15:42:13 +0800

22 Oct, 2020

1 commit

9e9eb226b KVM: Cache as_id in kvm_memory_slot ... Browse Code »

Cache the address space ID just like the slot ID. It will be used in
order to fill in the dirty ring entries.

Suggested-by: Paolo Bonzini
Suggested-by: Sean Christopherson
Reviewed-by: Sean Christopherson
Signed-off-by: Peter Xu
Message-Id:
Signed-off-by: Paolo Bonzini

Peter Xu
2020-10-22 06:17:01 +0800

28 Sep, 2020

2 commits

871c433ba KVM: use struct_size() and flex_array_size() helpers in kvm_io_bus_unregister_dev() ... Browse Code »

Make use of the struct_size() helper to avoid any potential type
mistakes and protect against potential integer overflows
Make use of the flex_array_size() helper to calculate the size of a
flexible array member within an enclosing structure

Suggested-by: Gustavo A. R. Silva
Signed-off-by: Rustam Kovhaev
Message-Id:
Reviewed-by: Gustavo A. R. Silva
Signed-off-by: Paolo Bonzini

Rustam Kovhaev
2020-09-28 19:57:17 +0800
2fc4f15da kvm/eventfd: move wildcard calculation outside loop ... Browse Code »

There is no need to calculate wildcard in each iteration
since wildcard is not changed.

Signed-off-by: Yi Li
Message-Id:
Signed-off-by: Paolo Bonzini

Yi Li
2020-09-28 19:57:08 +0800

12 Sep, 2020

2 commits

f65886606 KVM: fix memory leak in kvm_io_bus_unregister_dev() ... Browse Code »

when kmalloc() fails in kvm_io_bus_unregister_dev(), before removing
the bus, we should iterate over all other devices linked to it and call
kvm_iodevice_destructor() for them

Fixes: 90db10434b16 ("KVM: kvm_io_bus_unregister_dev() should never fail")
Cc: stable@vger.kernel.org
Reported-and-tested-by: syzbot+f196caa45793d6374707@syzkaller.appspotmail.com
Link: https://syzkaller.appspot.com/bug?extid=f196caa45793d6374707
Signed-off-by: Rustam Kovhaev
Reviewed-by: Vitaly Kuznetsov
Message-Id:
Signed-off-by: Paolo Bonzini

Rustam Kovhaev
2020-09-12 01:15:11 +0800
1b67fd086 Merge tag 'kvmarm-fixes-5.9-1' of git://git.kernel.org/pub/scm/linux/kernel/git/… ... Browse Code »

…kvmarm/kvmarm into HEAD

KVM/arm64 fixes for Linux 5.9, take #1

- Multiple stolen time fixes, with a new capability to match x86
- Fix for hugetlbfs mappings when PUD and PMD are the same level
- Fix for hugetlbfs mappings when PTE mappings are enforced
(dirty logging, for example)
- Fix tracing output of 64bit values

Paolo Bonzini
2020-09-12 01:12:11 +0800

22 Aug, 2020

1 commit

fdfe7cbd5 KVM: Pass MMU notifier range flags to kvm_unmap_hva_range() ... Browse Code »

The 'flags' field of 'struct mmu_notifier_range' is used to indicate
whether invalidate_range_{start,end}() are permitted to block. In the
case of kvm_mmu_notifier_invalidate_range_start(), this field is not
forwarded on to the architecture-specific implementation of
kvm_unmap_hva_range() and therefore the backend cannot sensibly decide
whether or not to block.

Add an extra 'flags' parameter to kvm_unmap_hva_range() so that
architectures are aware as to whether or not they are permitted to block.

Cc:
Cc: Marc Zyngier
Cc: Suzuki K Poulose
Cc: James Morse
Signed-off-by: Will Deacon
Message-Id:
Signed-off-by: Paolo Bonzini

Will Deacon
2020-08-22 06:03:47 +0800

13 Aug, 2020

1 commit

64019a2e4 mm/gup: remove task_struct pointer for all gup code ... Browse Code »

After the cleanup of page fault accounting, gup does not need to pass
task_struct around any more. Remove that parameter in the whole gup
stack.

Signed-off-by: Peter Xu
Signed-off-by: Andrew Morton
Reviewed-by: John Hubbard
Link: http://lkml.kernel.org/r/20200707225021.200906-26-peterx@redhat.com
Signed-off-by: Linus Torvalds

Peter Xu
2020-08-13 01:58:04 +0800

11 Aug, 2020

1 commit

97d052ea3 Merge tag 'locking-urgent-2020-08-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull locking updates from Thomas Gleixner:
"A set of locking fixes and updates:

- Untangle the header spaghetti which causes build failures in
various situations caused by the lockdep additions to seqcount to
validate that the write side critical sections are non-preemptible.

- The seqcount associated lock debug addons which were blocked by the
above fallout.

seqcount writers contrary to seqlock writers must be externally
serialized, which usually happens via locking - except for strict
per CPU seqcounts. As the lock is not part of the seqcount, lockdep
cannot validate that the lock is held.

This new debug mechanism adds the concept of associated locks.
sequence count has now lock type variants and corresponding
initializers which take a pointer to the associated lock used for
writer serialization. If lockdep is enabled the pointer is stored
and write_seqcount_begin() has a lockdep assertion to validate that
the lock is held.

Aside of the type and the initializer no other code changes are
required at the seqcount usage sites. The rest of the seqcount API
is unchanged and determines the type at compile time with the help
of _Generic which is possible now that the minimal GCC version has
been moved up.

Adding this lockdep coverage unearthed a handful of seqcount bugs
which have been addressed already independent of this.

While generally useful this comes with a Trojan Horse twist: On RT
kernels the write side critical section can become preemtible if
the writers are serialized by an associated lock, which leads to
the well known reader preempts writer livelock. RT prevents this by
storing the associated lock pointer independent of lockdep in the
seqcount and changing the reader side to block on the lock when a
reader detects that a writer is in the write side critical section.

- Conversion of seqcount usage sites to associated types and
initializers"

* tag 'locking-urgent-2020-08-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (25 commits)
locking/seqlock, headers: Untangle the spaghetti monster
locking, arch/ia64: Reduce header dependencies by moving XTP bits into the new header
x86/headers: Remove APIC headers from
seqcount: More consistent seqprop names
seqcount: Compress SEQCNT_LOCKNAME_ZERO()
seqlock: Fold seqcount_LOCKNAME_init() definition
seqlock: Fold seqcount_LOCKNAME_t definition
seqlock: s/__SEQ_LOCKDEP/__SEQ_LOCK/g
hrtimer: Use sequence counter with associated raw spinlock
kvm/eventfd: Use sequence counter with associated spinlock
userfaultfd: Use sequence counter with associated spinlock
NFSv4: Use sequence counter with associated spinlock
iocost: Use sequence counter with associated spinlock
raid5: Use sequence counter with associated spinlock
vfs: Use sequence counter with associated spinlock
timekeeping: Use sequence counter with associated raw spinlock
xfrm: policy: Use sequence counters with associated lock
netfilter: nft_set_rbtree: Use sequence counter with associated rwlock
netfilter: conntrack: Use sequence counter with associated spinlock
sched: tasks: Use sequence counter with associated spinlock
...

Linus Torvalds
2020-08-11 10:07:44 +0800

07 Aug, 2020

1 commit

921d2597a Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm ... Browse Code »

Pull KVM updates from Paolo Bonzini:
"s390:
- implement diag318

x86:
- Report last CPU for debugging
- Emulate smaller MAXPHYADDR in the guest than in the host
- .noinstr and tracing fixes from Thomas
- nested SVM page table switching optimization and fixes

Generic:
- Unify shadow MMU cache data structures across architectures"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (127 commits)
KVM: SVM: Fix sev_pin_memory() error handling
KVM: LAPIC: Set the TDCR settable bits
KVM: x86: Specify max TDP level via kvm_configure_mmu()
KVM: x86/mmu: Rename max_page_level to max_huge_page_level
KVM: x86: Dynamically calculate TDP level from max level and MAXPHYADDR
KVM: VXM: Remove temporary WARN on expected vs. actual EPTP level mismatch
KVM: x86: Pull the PGD's level from the MMU instead of recalculating it
KVM: VMX: Make vmx_load_mmu_pgd() static
KVM: x86/mmu: Add separate helper for shadow NPT root page role calc
KVM: VMX: Drop a duplicate declaration of construct_eptp()
KVM: nSVM: Correctly set the shadow NPT root level in its MMU role
KVM: Using macros instead of magic values
MIPS: KVM: Fix build error caused by 'kvm_run' cleanup
KVM: nSVM: remove nonsensical EXITINFO1 adjustment on nested NPF
KVM: x86: Add a capability for GUEST_MAXPHYADDR < HOST_MAXPHYADDR support
KVM: VMX: optimize #PF injection when MAXPHYADDR does not match
KVM: VMX: Add guest physical address check in EPT violation and misconfig
KVM: VMX: introduce vmx_need_pf_intercept
KVM: x86: update exception bitmap on CPUID changes
KVM: x86: rename update_bp_intercept to update_exception_bitmap
...

Linus Torvalds
2020-08-07 03:59:31 +0800

29 Jul, 2020

1 commit

5c73b9a2b kvm/eventfd: Use sequence counter with associated spinlock ... Browse Code »

A sequence counter write side critical section must be protected by some
form of locking to serialize writers. A plain seqcount_t does not
contain the information of which lock must be held when entering a write
side critical section.

Use the new seqcount_spinlock_t data type, which allows to associate a
spinlock with the sequence counter. This enables lockdep to verify that
the spinlock used for writer serialization is held when the write side
critical section is entered.

If lockdep is disabled this lock association is compiled out and has
neither storage size nor runtime overhead.

Signed-off-by: Ahmed S. Darwish
Signed-off-by: Peter Zijlstra (Intel)
Acked-by: Paolo Bonzini
Link: https://lkml.kernel.org/r/20200720155530.1173732-24-a.darwish@linutronix.de

Ahmed S. Darwish
2020-07-29 22:14:29 +0800

24 Jul, 2020

1 commit

935ace2fb entry: Provide infrastructure for work before transitioning to guest mode ... Browse Code »

Entering a guest is similar to exiting to user space. Pending work like
handling signals, rescheduling, task work etc. needs to be handled before
that.

Provide generic infrastructure to avoid duplication of the same handling
code all over the place.

The transfer to guest mode handling is different from the exit to usermode
handling, e.g. vs. rseq and live patching, so a separate function is used.

The initial list of work items handled is:

TIF_SIGPENDING, TIF_NEED_RESCHED, TIF_NOTIFY_RESUME

Architecture specific TIF flags can be added via defines in the
architecture specific include files.

The calling convention is also different from the syscall/interrupt entry
functions as KVM invokes this from the outer vcpu_run() loop with
interrupts and preemption enabled. To prevent missing a pending work item
it invokes a check for pending TIF work from interrupt disabled code right
before transitioning to guest mode. The lockdep, RCU and tracing state
handling is also done directly around the switch to and from guest mode.

Signed-off-by: Thomas Gleixner
Link: https://lkml.kernel.org/r/20200722220519.833296398@linutronix.de

Thomas Gleixner
2020-07-24 21:03:42 +0800

10 Jul, 2020

1 commit

6926f95ac KVM: Move x86's MMU memory cache helpers to common KVM code ... Browse Code »

Move x86's memory cache helpers to common KVM code so that they can be
reused by arm64 and MIPS in future patches.

Suggested-by: Christoffer Dall
Reviewed-by: Ben Gardon
Signed-off-by: Sean Christopherson
Message-Id:
Signed-off-by: Paolo Bonzini

Sean Christopherson
2020-07-10 01:29:42 +0800

09 Jul, 2020

2 commits

995decb6c KVM: x86: take as_id into account when checking PGD ... Browse Code »

OVMF booted guest running on shadow pages crashes on TRIPLE FAULT after
enabling paging from SMM. The crash is triggered from mmu_check_root() and
is caused by kvm_is_visible_gfn() searching through memslots with as_id = 0
while vCPU may be in a different context (address space).

Introduce kvm_vcpu_is_visible_gfn() and use it from mmu_check_root().

Signed-off-by: Vitaly Kuznetsov
Message-Id:
Signed-off-by: Paolo Bonzini

Vitaly Kuznetsov
2020-07-09 19:08:37 +0800
e8c22266e KVM: async_pf: change kvm_setup_async_pf()/kvm_arch_setup_async_pf() return type to bool ... Browse Code »

Unlike normal 'int' functions returning '0' on success, kvm_setup_async_pf()/
kvm_arch_setup_async_pf() return '1' when a job to handle page fault
asynchronously was scheduled and '0' otherwise. To avoid the confusion
change return type to 'bool'.

No functional change intended.

Suggested-by: Sean Christopherson
Signed-off-by: Vitaly Kuznetsov
Message-Id:
Signed-off-by: Paolo Bonzini

Vitaly Kuznetsov
2020-07-09 04:21:36 +0800

02 Jul, 2020

1 commit

1393b4aaf kvm: use more precise cast and do not drop __user ... Browse Code »

Sparse complains on a call to get_compat_sigset, fix it. The "if"
right above explains that sigmask_arg->sigset is basically a
compat_sigset_t.

Signed-off-by: Paolo Bonzini

Paolo Bonzini
2020-07-02 17:39:31 +0800

13 Jun, 2020

1 commit

52cd0d972 Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm ... Browse Code »

Pull more KVM updates from Paolo Bonzini:
"The guest side of the asynchronous page fault work has been delayed to
5.9 in order to sync with Thomas's interrupt entry rework, but here's
the rest of the KVM updates for this merge window.

MIPS:
- Loongson port

PPC:
- Fixes

ARM:
- Fixes

x86:
- KVM_SET_USER_MEMORY_REGION optimizations
- Fixes
- Selftest fixes"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (62 commits)
KVM: x86: do not pass poisoned hva to __kvm_set_memory_region
KVM: selftests: fix sync_with_host() in smm_test
KVM: async_pf: Inject 'page ready' event only if 'page not present' was previously injected
KVM: async_pf: Cleanup kvm_setup_async_pf()
kvm: i8254: remove redundant assignment to pointer s
KVM: x86: respect singlestep when emulating instruction
KVM: selftests: Don't probe KVM_CAP_HYPERV_ENLIGHTENED_VMCS when nested VMX is unsupported
KVM: selftests: do not substitute SVM/VMX check with KVM_CAP_NESTED_STATE check
KVM: nVMX: Consult only the "basic" exit reason when routing nested exit
KVM: arm64: Move hyp_symbol_addr() to kvm_asm.h
KVM: arm64: Synchronize sysreg state on injecting an AArch32 exception
KVM: arm64: Make vcpu_cp1x() work on Big Endian hosts
KVM: arm64: Remove host_cpu_context member from vcpu structure
KVM: arm64: Stop sparse from moaning at __hyp_this_cpu_ptr
KVM: arm64: Handle PtrAuth traps early
KVM: x86: Unexport x86_fpu_cache and make it static
KVM: selftests: Ignore KVM 5-level paging support for VM_MODE_PXXV48_4K
KVM: arm64: Save the host's PtrAuth keys in non-preemptible context
KVM: arm64: Stop save/restoring ACTLR_EL1
KVM: arm64: Add emulation for 32bit guests accessing ACTLR2
...

Linus Torvalds
2020-06-13 02:05:52 +0800

12 Jun, 2020

2 commits

2a18b7e7c KVM: async_pf: Inject 'page ready' event only if 'page not present' was previously injected ... Browse Code »

'Page not present' event may or may not get injected depending on
guest's state. If the event wasn't injected, there is no need to
inject the corresponding 'page ready' event as the guest may get
confused. E.g. Linux thinks that the corresponding 'page not present'
event wasn't delivered *yet* and allocates a 'dummy entry' for it.
This entry is never freed.

Note, 'wakeup all' events have no corresponding 'page not present'
event and always get injected.

s390 seems to always be able to inject 'page not present', the
change is effectively a nop.

Suggested-by: Vivek Goyal
Signed-off-by: Vitaly Kuznetsov
Message-Id:
Fixes: https://bugzilla.kernel.org/show_bug.cgi?id=208081
Signed-off-by: Paolo Bonzini

Vitaly Kuznetsov
2020-06-12 00:35:19 +0800
7863e346e KVM: async_pf: Cleanup kvm_setup_async_pf() ... Browse Code »

schedule_work() returns 'false' only when the work is already on the queue
and this can't happen as kvm_setup_async_pf() always allocates a new one.
Also, to avoid potential race, it makes sense to to schedule_work() at the
very end after we've added it to the queue.

While on it, do some minor cleanup. gfn_to_pfn_async() mentioned in a
comment does not currently exist and, moreover, we can check
kvm_is_error_hva() at the very beginning, before we try to allocate work so
'retry_sync' label can go away completely.

Signed-off-by: Vitaly Kuznetsov
Message-Id:
Reviewed-by: Sean Christopherson
Signed-off-by: Paolo Bonzini

Vitaly Kuznetsov
2020-06-12 00:35:19 +0800

10 Jun, 2020

2 commits

d8ed45c5d mmap locking API: use coccinelle to convert mmap_sem rwsem call sites ... Browse Code »

This change converts the existing mmap_sem rwsem calls to use the new mmap
locking API instead.

The change is generated using coccinelle with the following rule:

// spatch --sp-file mmap_lock_api.cocci --in-place --include-headers --dir .

@@
expression mm;
@@
(
-init_rwsem
+mmap_init_lock
|
-down_write
+mmap_write_lock
|
-down_write_killable
+mmap_write_lock_killable
|
-down_write_trylock
+mmap_write_trylock
|
-up_write
+mmap_write_unlock
|
-downgrade_write
+mmap_write_downgrade
|
-down_read
+mmap_read_lock
|
-down_read_killable
+mmap_read_lock_killable
|
-down_read_trylock
+mmap_read_trylock
|
-up_read
+mmap_read_unlock
)
-(&mm->mmap_sem)
+(mm)

Signed-off-by: Michel Lespinasse
Signed-off-by: Andrew Morton
Reviewed-by: Daniel Jordan
Reviewed-by: Laurent Dufour
Reviewed-by: Vlastimil Babka
Cc: Davidlohr Bueso
Cc: David Rientjes
Cc: Hugh Dickins
Cc: Jason Gunthorpe
Cc: Jerome Glisse
Cc: John Hubbard
Cc: Liam Howlett
Cc: Matthew Wilcox
Cc: Peter Zijlstra
Cc: Ying Han
Link: http://lkml.kernel.org/r/20200520052908.204642-5-walken@google.com
Signed-off-by: Linus Torvalds

Michel Lespinasse
2020-06-10 00:39:14 +0800
e31cf2f4c mm: don't include asm/pgtable.h if linux/mm.h is already included ... Browse Code »

Patch series "mm: consolidate definitions of page table accessors", v2.

The low level page table accessors (pXY_index(), pXY_offset()) are
duplicated across all architectures and sometimes more than once. For
instance, we have 31 definition of pgd_offset() for 25 supported
architectures.

Most of these definitions are actually identical and typically it boils
down to, e.g.

static inline unsigned long pmd_index(unsigned long address)
{
return (address >> PMD_SHIFT) & (PTRS_PER_PMD - 1);
}

static inline pmd_t *pmd_offset(pud_t *pud, unsigned long address)
{
return (pmd_t *)pud_page_vaddr(*pud) + pmd_index(address);
}

These definitions can be shared among 90% of the arches provided
XYZ_SHIFT, PTRS_PER_XYZ and xyz_page_vaddr() are defined.

For architectures that really need a custom version there is always
possibility to override the generic version with the usual ifdefs magic.

These patches introduce include/linux/pgtable.h that replaces
include/asm-generic/pgtable.h and add the definitions of the page table
accessors to the new header.

This patch (of 12):

The linux/mm.h header includes to allow inlining of the
functions involving page table manipulations, e.g. pte_alloc() and
pmd_alloc(). So, there is no point to explicitly include
in the files that include .

The include statements in such cases are remove with a simple loop:

for f in $(git grep -l "include ") ; do
sed -i -e '/include / d' $f
done

Signed-off-by: Mike Rapoport
Signed-off-by: Andrew Morton
Cc: Arnd Bergmann
Cc: Borislav Petkov
Cc: Brian Cain
Cc: Catalin Marinas
Cc: Chris Zankel
Cc: "David S. Miller"
Cc: Geert Uytterhoeven
Cc: Greentime Hu
Cc: Greg Ungerer
Cc: Guan Xuetao
Cc: Guo Ren
Cc: Heiko Carstens
Cc: Helge Deller
Cc: Ingo Molnar
Cc: Ley Foon Tan
Cc: Mark Salter
Cc: Matthew Wilcox
Cc: Matt Turner
Cc: Max Filippov
Cc: Michael Ellerman
Cc: Michal Simek
Cc: Mike Rapoport
Cc: Nick Hu
Cc: Paul Walmsley
Cc: Richard Weinberger
Cc: Rich Felker
Cc: Russell King
Cc: Stafford Horne
Cc: Thomas Bogendoerfer
Cc: Thomas Gleixner
Cc: Tony Luck
Cc: Vincent Chen
Cc: Vineet Gupta
Cc: Will Deacon
Cc: Yoshinori Sato
Link: http://lkml.kernel.org/r/20200514170327.31389-1-rppt@kernel.org
Link: http://lkml.kernel.org/r/20200514170327.31389-2-rppt@kernel.org
Signed-off-by: Linus Torvalds

Mike Rapoport
2020-06-10 00:39:13 +0800

09 Jun, 2020

1 commit

dadbb612f mm/gup.c: convert to use get_user_{page|pages}_fast_only() ... Browse Code »

API __get_user_pages_fast() renamed to get_user_pages_fast_only() to
align with pin_user_pages_fast_only().

As part of this we will get rid of write parameter. Instead caller will
pass FOLL_WRITE to get_user_pages_fast_only(). This will not change any
existing functionality of the API.

All the callers are changed to pass FOLL_WRITE.

Also introduce get_user_page_fast_only(), and use it in a few places
that hard-code nr_pages to 1.

Updated the documentation of the API.

Signed-off-by: Souptick Joarder
Signed-off-by: Andrew Morton
Reviewed-by: John Hubbard
Reviewed-by: Paul Mackerras [arch/powerpc/kvm]
Cc: Matthew Wilcox
Cc: Michael Ellerman
Cc: Benjamin Herrenschmidt
Cc: Peter Zijlstra
Cc: Ingo Molnar
Cc: Mark Rutland
Cc: Alexander Shishkin
Cc: Jiri Olsa
Cc: Namhyung Kim
Cc: Paolo Bonzini
Cc: Stephen Rothwell
Cc: Mike Rapoport
Cc: Aneesh Kumar K.V
Cc: Michal Suchanek
Link: http://lkml.kernel.org/r/1590396812-31277-1-git-send-email-jrdr.linux@gmail.com
Signed-off-by: Linus Torvalds

Souptick Joarder
2020-06-09 02:05:56 +0800

08 Jun, 2020

1 commit

e649b3f01 KVM: x86: Fix APIC page invalidation race ... Browse Code »

Commit b1394e745b94 ("KVM: x86: fix APIC page invalidation") tried
to fix inappropriate APIC page invalidation by re-introducing arch
specific kvm_arch_mmu_notifier_invalidate_range() and calling it from
kvm_mmu_notifier_invalidate_range_start. However, the patch left a
possible race where the VMCS APIC address cache is updated *before*
it is unmapped:

(Invalidator) kvm_mmu_notifier_invalidate_range_start()
(Invalidator) kvm_make_all_cpus_request(kvm, KVM_REQ_APIC_PAGE_RELOAD)
(KVM VCPU) vcpu_enter_guest()
(KVM VCPU) kvm_vcpu_reload_apic_access_page()
(Invalidator) actually unmap page

Because of the above race, there can be a mismatch between the
host physical address stored in the APIC_ACCESS_PAGE VMCS field and
the host physical address stored in the EPT entry for the APIC GPA
(0xfee0000). When this happens, the processor will not trap APIC
accesses, and will instead show the raw contents of the APIC-access page.
Because Windows OS periodically checks for unexpected modifications to
the LAPIC register, this will show up as a BSOD crash with BugCheck
CRITICAL_STRUCTURE_CORRUPTION (109) we are currently seeing in
https://bugzilla.redhat.com/show_bug.cgi?id=1751017.

The root cause of the issue is that kvm_arch_mmu_notifier_invalidate_range()
cannot guarantee that no additional references are taken to the pages in
the range before kvm_mmu_notifier_invalidate_range_end(). Fortunately,
this case is supported by the MMU notifier API, as documented in
include/linux/mmu_notifier.h:

* If the subsystem
* can't guarantee that no additional references are taken to
* the pages in the range, it has to implement the
* invalidate_range() notifier to remove any references taken
* after invalidate_range_start().

The fix therefore is to reload the APIC-access page field in the VMCS
from kvm_mmu_notifier_invalidate_range() instead of ..._range_start().

Cc: stable@vger.kernel.org
Fixes: b1394e745b94 ("KVM: x86: fix APIC page invalidation")
Fixes: https://bugzilla.kernel.org/show_bug.cgi?id=197951
Signed-off-by: Eiichi Tsukata
Message-Id:
Signed-off-by: Paolo Bonzini

Eiichi Tsukata
2020-06-08 21:05:38 +0800

05 Jun, 2020

1 commit

7ec28e264 KVM: Use vmemdup_user() ... Browse Code »

Replace opencoded alloc and copy with vmemdup_user().

Signed-off-by: Denis Efremov
Message-Id:
Signed-off-by: Paolo Bonzini

Denis Efremov
2020-06-05 02:41:05 +0800

04 Jun, 2020

1 commit

d56f5136b KVM: let kvm_destroy_vm_debugfs clean up vCPU debugfs directories ... Browse Code »

After commit 63d0434 ("KVM: x86: move kvm_create_vcpu_debugfs after
last failure point") we are creating the pre-vCPU debugfs files
after the creation of the vCPU file descriptor. This makes it
possible for userspace to reach kvm_vcpu_release before
kvm_create_vcpu_debugfs has finished. The vcpu->debugfs_dentry
then does not have any associated inode anymore, and this causes
a NULL-pointer dereference in debugfs_create_file.

The solution is simply to avoid removing the files; they are
cleaned up when the VM file descriptor is closed (and that must be
after KVM_CREATE_VCPU returns). We can stop storing the dentry
in struct kvm_vcpu too, because it is not needed anywhere after
kvm_create_vcpu_debugfs returns.

Reported-by: syzbot+705f4401d5a93a59b87d@syzkaller.appspotmail.com
Fixes: 63d04348371b ("KVM: x86: move kvm_create_vcpu_debugfs after last failure point")
Signed-off-by: Paolo Bonzini

Paolo Bonzini
2020-06-04 23:00:54 +0800

01 Jun, 2020

2 commits

380609445 Merge tag 'kvmarm-5.8' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD ... Browse Code »

KVM/arm64 updates for Linux 5.8:

- Move the arch-specific code into arch/arm64/kvm
- Start the post-32bit cleanup
- Cherry-pick a few non-invasive pre-NV patches

Paolo Bonzini
2020-06-01 16:26:27 +0800
09d952c97 KVM: check userspace_addr for all memslots ... Browse Code »

The userspace_addr alignment and range checks are not performed for private
memory slots that are prepared by KVM itself. This is unnecessary and makes
it questionable to use __*_user functions to access memory later on. We also
rely on the userspace address being aligned since we have an entire family
of functions to map gfn to pfn.

Fortunately skipping the check is completely unnecessary. Only x86 uses
private memslots and their userspace_addr is obtained from vm_mmap,
therefore it must be below PAGE_OFFSET. In fact, any attempt to pass
an address above PAGE_OFFSET would have failed because such an address
would return true for kvm_is_error_hva.

Reported-by: Linus Torvalds
Signed-off-by: Paolo Bonzini

Paolo Bonzini
2020-06-01 16:26:14 +0800