08 Mar, 2020
1 commit
-
Merge Linux stable release v5.4.24 into imx_5.4.y
* tag 'v5.4.24': (3306 commits)
Linux 5.4.24
blktrace: Protect q->blk_trace with RCU
kvm: nVMX: VMWRITE checks unsupported field before read-only field
...Signed-off-by: Jason Liu
Conflicts:
arch/arm/boot/dts/imx6sll-evk.dts
arch/arm/boot/dts/imx7ulp.dtsi
arch/arm64/boot/dts/freescale/fsl-ls1028a.dtsi
drivers/clk/imx/clk-composite-8m.c
drivers/gpio/gpio-mxc.c
drivers/irqchip/Kconfig
drivers/mmc/host/sdhci-of-esdhc.c
drivers/mtd/nand/raw/gpmi-nand/gpmi-nand.c
drivers/net/can/flexcan.c
drivers/net/ethernet/freescale/dpaa/dpaa_eth.c
drivers/net/ethernet/mscc/ocelot.c
drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
drivers/net/ethernet/stmicro/stmmac/stmmac_platform.c
drivers/net/phy/realtek.c
drivers/pci/controller/mobiveil/pcie-mobiveil-host.c
drivers/perf/fsl_imx8_ddr_perf.c
drivers/tee/optee/shm_pool.c
drivers/usb/cdns3/gadget.c
kernel/sched/cpufreq.c
net/core/xdp.c
sound/soc/fsl/fsl_esai.c
sound/soc/fsl/fsl_sai.c
sound/soc/sof/core.c
sound/soc/sof/imx/Kconfig
sound/soc/sof/loader.c
05 Mar, 2020
1 commit
-
commit fcfbc617547fc6d9552cb6c1c563b6a90ee98085 upstream.
When reading/writing using the guest/host cache, check for a bad hva
before checking for a NULL memslot, which triggers the slow path for
handing cross-page accesses. Because the memslot is nullified on error
by __kvm_gfn_to_hva_cache_init(), if the bad hva is encountered after
crossing into a new page, then the kvm_{read,write}_guest() slow path
could potentially write/access the first chunk prior to detecting the
bad hva.Arguably, performing a partial access is semantically correct from an
architectural perspective, but that behavior is certainly not intended.
In the original implementation, memslot was not explicitly nullified
and therefore the partial access behavior varied based on whether the
memslot itself was null, or if the hva was simply bad. The current
behavior was introduced as a seemingly unintentional side effect in
commit f1b9dd5eb86c ("kvm: Disallow wraparound in
kvm_gfn_to_hva_cache_init"), which justified the change with "since some
callers don't check the return code from this function, it sit seems
prudent to clear ghc->memslot in the event of an error".Regardless of intent, the partial access is dependent on _not_ checking
the result of the cache initialization, which is arguably a bug in its
own right, at best simply weird.Fixes: 8f964525a121 ("KVM: Allow cross page reads and writes from cached translations.")
Cc: Jim Mattson
Cc: Andrew Honig
Signed-off-by: Sean Christopherson
Signed-off-by: Paolo Bonzini
Signed-off-by: Greg Kroah-Hartman
15 Feb, 2020
7 commits
-
commit 4a267aa707953a9a73d1f5dc7f894dd9024a92be upstream.
According to the ARM ARM, registers CNT{P,V}_TVAL_EL0 have bits [63:32]
RES0 [1]. When reading the register, the value is truncated to the least
significant 32 bits [2], and on writes, TimerValue is treated as a signed
32-bit integer [1, 2].When the guest behaves correctly and writes 32-bit values, treating TVAL
as an unsigned 64 bit register works as expected. However, things start
to break down when the guest writes larger values, because
(u64)0x1_ffff_ffff = 8589934591. but (s32)0x1_ffff_ffff = -1, and the
former will cause the timer interrupt to be asserted in the future, but
the latter will cause it to be asserted now. Let's treat TVAL as a
signed 32-bit register on writes, to match the behaviour described in
the architecture, and the behaviour experimentally exhibited by the
virtual timer on a non-vhe host.[1] Arm DDI 0487E.a, section D13.8.18
[2] Arm DDI 0487E.a, section D11.2.4Signed-off-by: Alexandru Elisei
[maz: replaced the read-side mask with lower_32_bits]
Signed-off-by: Marc Zyngier
Fixes: 8fa761624871 ("KVM: arm/arm64: arch_timer: Fix CNTP_TVAL calculation")
Link: https://lore.kernel.org/r/20200127103652.2326-1-alexandru.elisei@arm.com
Signed-off-by: Greg Kroah-Hartman -
commit aa76829171e98bd75a0cc00b6248eca269ac7f4f upstream.
At the moment a SW_INCR counter always overflows on 32-bit
boundary, independently on whether the n+1th counter is
programmed as CHAIN.Check whether the SW_INCR counter is a 64b counter and if so,
implement the 64b logic.Fixes: 80f393a23be6 ("KVM: arm/arm64: Support chained PMU counters")
Signed-off-by: Eric Auger
Signed-off-by: Marc Zyngier
Link: https://lore.kernel.org/r/20200124142535.29386-4-eric.auger@redhat.com
Signed-off-by: Greg Kroah-Hartman -
commit 3837407c1aa1101ed5e214c7d6041e7a23335c6e upstream.
The specification says PMSWINC increments PMEVCNTR_EL1 by 1
if PMEVCNTR_EL0 is enabled and configured to count SW_INCR.For PMEVCNTR_EL0 to be enabled, we need both PMCNTENSET to
be set for the corresponding event counter but we also need
the PMCR.E bit to be set.Fixes: 7a0adc7064b8 ("arm64: KVM: Add access handler for PMSWINC register")
Signed-off-by: Eric Auger
Signed-off-by: Marc Zyngier
Reviewed-by: Andrew Murray
Acked-by: Marc Zyngier
Link: https://lore.kernel.org/r/20200124142535.29386-2-eric.auger@redhat.com
Signed-off-by: Greg Kroah-Hartman -
commit 21aecdbd7f3ab02c9b82597dc733ee759fb8b274 upstream.
KVM's inject_abt64() injects an external-abort into an aarch64 guest.
The KVM_CAP_ARM_INJECT_EXT_DABT is intended to do exactly this, but
for an aarch32 guest inject_abt32() injects an implementation-defined
exception, 'Lockdown fault'.Change this to external abort. For non-LPAE we now get the documented:
| Unhandled fault: external abort on non-linefetch (0x008) at 0x9c800f00
and for LPAE:
| Unhandled fault: synchronous external abort (0x210) at 0x9c800f00Fixes: 74a64a981662a ("KVM: arm/arm64: Unify 32bit fault injection")
Reported-by: Beata Michalska
Signed-off-by: James Morse
Signed-off-by: Marc Zyngier
Link: https://lore.kernel.org/r/20200121123356.203000-3-james.morse@arm.com
Signed-off-by: Greg Kroah-Hartman -
commit 018f22f95e8a6c3e27188b7317ef2c70a34cb2cd upstream.
Beata reports that KVM_SET_VCPU_EVENTS doesn't inject the expected
exception to a non-LPAE aarch32 guest.The host intends to inject DFSR.FS=0x14 "IMPLEMENTATION DEFINED fault
(Lockdown fault)", but the guest receives DFSR.FS=0x04 "Fault on
instruction cache maintenance". This fault is hooked by
do_translation_fault() since ARMv6, which goes on to silently 'handle'
the exception, and restart the faulting instruction.It turns out, when TTBCR.EAE is clear DFSR is split, and FS[4] has
to shuffle up to DFSR[10].As KVM only does this in one place, fix up the static values. We
now get the expected:
| Unhandled fault: lock abort (0x404) at 0x9c800f00Fixes: 74a64a981662a ("KVM: arm/arm64: Unify 32bit fault injection")
Reported-by: Beata Michalska
Signed-off-by: James Morse
Signed-off-by: Marc Zyngier
Link: https://lore.kernel.org/r/20200121123356.203000-2-james.morse@arm.com
Signed-off-by: Greg Kroah-Hartman -
commit cf2d23e0bac9f6b5cd1cba8898f5f05ead40e530 upstream.
kvm_test_age_hva() is called upon mmu_notifier_test_young(), but wrong
address range has been passed to handle_hva_to_gpa(). With the wrong
address range, no young bits will be checked in handle_hva_to_gpa().
It means zero is always returned from mmu_notifier_test_young().This fixes the issue by passing correct address range to the underly
function handle_hva_to_gpa(), so that the hardware young (access) bit
will be visited.Fixes: 35307b9a5f7e ("arm/arm64: KVM: Implement Stage-2 page aging")
Signed-off-by: Gavin Shan
Signed-off-by: Marc Zyngier
Link: https://lore.kernel.org/r/20200121055659.19560-1-gshan@redhat.com
Signed-off-by: Greg Kroah-Hartman -
commit 8c58be34494b7f1b2adb446e2d8beeb90e5de65b upstream.
Saving/restoring an unmapped collection is a valid scenario. For
example this happens if a MAPTI command was sent, featuring an
unmapped collection. At the moment the CTE fails to be restored.
Only compare against the number of online vcpus if the rdist
base is set.Fixes: ea1ad53e1e31a ("KVM: arm64: vgic-its: Collection table save/restore")
Signed-off-by: Eric Auger
Signed-off-by: Marc Zyngier
Reviewed-by: Zenghui Yu
Link: https://lore.kernel.org/r/20191213094237.19627-1-eric.auger@redhat.com
Signed-off-by: Greg Kroah-Hartman
11 Feb, 2020
8 commits
-
[ Upstream commit 42cde48b2d39772dba47e680781a32a6c4b7dc33 ]
Avoid the "writable" check in __gfn_to_hva_many(), which will always fail
on read-only memslots due to gfn_to_hva() assuming writes. Functionally,
this allows x86 to create large mappings for read-only memslots that
are backed by HugeTLB mappings.Note, the changelog for commit 05da45583de9 ("KVM: MMU: large page
support") states "If the largepage contains write-protected pages, a
large pte is not used.", but "write-protected" refers to pages that are
temporarily read-only, e.g. read-only memslots didn't even exist at the
time.Fixes: 4d8b81abc47b ("KVM: introduce readonly memslot")
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson
[Redone using kvm_vcpu_gfn_to_memslot_prot. - Paolo]
Signed-off-by: Paolo Bonzini
Signed-off-by: Sasha Levin -
[ Upstream commit f9b84e19221efc5f493156ee0329df3142085f28 ]
Use kvm_vcpu_gfn_to_hva() when retrieving the host page size so that the
correct set of memslots is used when handling x86 page faults in SMM.Fixes: 54bf36aac520 ("KVM: x86: use vcpu-specific functions to read/write/translate GFNs")
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson
Signed-off-by: Paolo Bonzini
Signed-off-by: Sasha Levin -
[ Upstream commit 736c291c9f36b07f8889c61764c28edce20e715d ]
Convert a plethora of parameters and variables in the MMU and page fault
flows from type gva_t to gpa_t to properly handle TDP on 32-bit KVM.Thanks to PSE and PAE paging, 32-bit kernels can access 64-bit physical
addresses. When TDP is enabled, the fault address is a guest physical
address and thus can be a 64-bit value, even when both KVM and its guest
are using 32-bit virtual addressing, e.g. VMX's VMCS.GUEST_PHYSICAL is a
64-bit field, not a natural width field.Using a gva_t for the fault address means KVM will incorrectly drop the
upper 32-bits of the GPA. Ditto for gva_to_gpa() when it is used to
translate L2 GPAs to L1 GPAs.Opportunistically rename variables and parameters to better reflect the
dual address modes, e.g. use "cr2_or_gpa" for fault addresses and plain
"addr" instead of "vaddr" when the address may be either a GVA or an L2
GPA. Similarly, use "gpa" in the nonpaging_page_fault() flows to avoid
a confusing "gpa_t gva" declaration; this also sets the stage for a
future patch to combing nonpaging_page_fault() and tdp_page_fault() with
minimal churn.Sprinkle in a few comments to document flows where an address is known
to be a GVA and thus can be safely truncated to a 32-bit value. Add
WARNs in kvm_handle_page_fault() and FNAME(gva_to_gpa_nested)() to help
document such cases and detect bugs.Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson
Signed-off-by: Paolo Bonzini
Signed-off-by: Sasha Levin -
commit 917248144db5d7320655dbb41d3af0b8a0f3d589 upstream.
__kvm_map_gfn()'s call to gfn_to_pfn_memslot() is
* relatively expensive
* in certain cases (such as when done from atomic context) cannot be calledStashing gfn-to-pfn mapping should help with both cases.
This is part of CVE-2019-3016.
Signed-off-by: Boris Ostrovsky
Reviewed-by: Joao Martins
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini
Signed-off-by: Greg Kroah-Hartman -
commit 1eff70a9abd46f175defafd29bc17ad456f398a7 upstream.
kvm_vcpu_(un)map operates on gfns from any current address space.
In certain cases we want to make sure we are not mapping SMRAM
and for that we can use kvm_(un)map_gfn() that we are introducing
in this patch.This is part of CVE-2019-3016.
Signed-off-by: Boris Ostrovsky
Reviewed-by: Joao Martins
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini
Signed-off-by: Greg Kroah-Hartman -
commit b6ae256afd32f96bec0117175b329d0dd617655e upstream.
On AArch64 you can do a sign-extended load to either a 32-bit or 64-bit
register, and we should only sign extend the register up to the width of
the register as specified in the operation (by using the 32-bit Wn or
64-bit Xn register specifier).As it turns out, the architecture provides this decoding information in
the SF ("Sixty-Four" -- how cute...) bit.Let's take advantage of this with the usual 32-bit/64-bit header file
dance and do the right thing on AArch64 hosts.Signed-off-by: Christoffer Dall
Signed-off-by: Marc Zyngier
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20191212195055.5541-1-christoffer.dall@arm.com
Signed-off-by: Greg Kroah-Hartman -
commit 1cfbb484de158e378e8971ac40f3082e53ecca55 upstream.
Confusingly, there are three SPSR layouts that a kernel may need to deal
with:(1) An AArch64 SPSR_ELx view of an AArch64 pstate
(2) An AArch64 SPSR_ELx view of an AArch32 pstate
(3) An AArch32 SPSR_* view of an AArch32 pstateWhen the KVM AArch32 support code deals with SPSR_{EL2,HYP}, it's either
dealing with #2 or #3 consistently. On arm64 the PSR_AA32_* definitions
match the AArch64 SPSR_ELx view, and on arm the PSR_AA32_* definitions
match the AArch32 SPSR_* view.However, when we inject an exception into an AArch32 guest, we have to
synthesize the AArch32 SPSR_* that the guest will see. Thus, an AArch64
host needs to synthesize layout #3 from layout #2.This patch adds a new host_spsr_to_spsr32() helper for this, and makes
use of it in the KVM AArch32 support code. For arm64 we need to shuffle
the DIT bit around, and remove the SS bit, while for arm we can use the
value as-is.I've open-coded the bit manipulation for now to avoid having to rework
the existing PSR_* definitions into PSR64_AA32_* and PSR32_AA32_*
definitions. I hope to perform a more thorough refactoring in future so
that we can handle pstate view manipulation more consistently across the
kernel tree.Signed-off-by: Mark Rutland
Signed-off-by: Marc Zyngier
Reviewed-by: Alexandru Elisei
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20200108134324.46500-4-mark.rutland@arm.com
Signed-off-by: Greg Kroah-Hartman -
commit 3c2483f15499b877ccb53250d88addb8c91da147 upstream.
When KVM injects an exception into a guest, it generates the CPSR value
from scratch, configuring CPSR.{M,A,I,T,E}, and setting all other
bits to zero.This isn't correct, as the architecture specifies that some CPSR bits
are (conditionally) cleared or set upon an exception, and others are
unchanged from the original context.This patch adds logic to match the architectural behaviour. To make this
simple to follow/audit/extend, documentation references are provided,
and bits are configured in order of their layout in SPSR_EL2. This
layout can be seen in the diagram on ARM DDI 0487E.a page C5-426.Note that this code is used by both arm and arm64, and is intended to
fuction with the SPSR_EL2 and SPSR_HYP layouts.Signed-off-by: Mark Rutland
Signed-off-by: Marc Zyngier
Reviewed-by: Alexandru Elisei
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20200108134324.46500-3-mark.rutland@arm.com
Signed-off-by: Greg Kroah-Hartman
31 Dec, 2019
1 commit
-
commit 6d674e28f642e3ff676fbae2d8d1b872814d32b6 upstream.
A device mapping is normally always mapped at Stage-2, since there
is very little gain in having it faulted in.Nonetheless, it is possible to end-up in a situation where the device
mapping has been removed from Stage-2 (userspace munmaped the VFIO
region, and the MMU notifier did its job), but present in a userspace
mapping (userpace has mapped it back at the same address). In such
a situation, the device mapping will be demand-paged as the guest
performs memory accesses.This requires to be careful when dealing with mapping size, cache
management, and to handle potential execution of a device mapping.Reported-by: Alexandru Elisei
Signed-off-by: Marc Zyngier
Tested-by: Alexandru Elisei
Reviewed-by: James Morse
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20191211165651.7889-2-maz@kernel.org
Signed-off-by: Greg Kroah-Hartman
16 Dec, 2019
1 commit
-
This is the 5.4.3 stable release
Conflicts:
drivers/cpufreq/imx-cpufreq-dt.c
drivers/spi/spi-fsl-qspi.cThe conflict is very minor, fixed it when do the merge. The imx-cpufreq-dt.c
is just one line code-style change, using upstream one, no any function change.The spi-fsl-qspi.c has minor conflicts when merge upstream fixes: c69b17da53b2
spi: spi-fsl-qspi: Clear TDH bits in FLSHCR registerAfter merge, basic boot sanity test and basic qspi test been done on i.mx
Signed-off-by: Jason Liu
13 Dec, 2019
1 commit
-
commit ca185b260951d3b55108c0b95e188682d8a507b7 upstream.
It's possible that two LPIs locate in the same "byte_offset" but target
two different vcpus, where their pending status are indicated by two
different pending tables. In such a scenario, using last_byte_offset
optimization will lead KVM relying on the wrong pending table entry.
Let us use last_ptr instead, which can be treated as a byte index into
a pending table and also, can be vcpu specific.Fixes: 280771252c1b ("KVM: arm64: vgic-v3: KVM_DEV_ARM_VGIC_SAVE_PENDING_TABLES")
Cc: stable@vger.kernel.org
Signed-off-by: Zenghui Yu
Signed-off-by: Marc Zyngier
Acked-by: Eric Auger
Link: https://lore.kernel.org/r/20191029071919.177-4-yuzenghui@huawei.com
Signed-off-by: Greg Kroah-Hartman
25 Nov, 2019
3 commits
-
FSL-MC bus devices uses device-ids from 0x10000 to 0x20000.
So to support MSI interrupts for mc-bus devices we need
vgi-ITS device-id table of size 2^17 to support deviceid
range from 0x10000 to 0x20000.Signed-off-by: Bharat Bhushan
-
Instead of hardcoding checks for qman cacheable
mmio region physical addresses extract mapping
information from the user-space mapping.
The involves several steps;
- get access to a pte part of the user-space mapping
by using get_locked_pte() / pte_unmap_unlock() apis
- extract memtype (normal / device), shareability from
the pte
- convert to S2 translation bits in newly added
function stage1_to_stage2_pgprot()
- finish making the s2 translation with the obtained bitsAnother explored option was using vm_area_struct::vm_page_prot
which is set in vfio-mc mmap code to the correct page bits.
However, experiments show that these bits are later altered
in the generic mmap code (e.g. the shareability bit is always
set on arm64).
The only place where the original bits can still be found
is the user-space mapping, using the method described above.Signed-off-by: Laurentiu Tudor
[Bharat - Fixed mem_type check issue]
[changed "ifdef ARM64" to CONFIG_ARM64]
Signed-off-by: Bharat Bhushan
[Ioana - added a sanity check for hugepages]
Signed-off-by: Ioana Ciornei
[Fixed format issues]
Signed-off-by: Diana Craciun -
Add parameter allowing to specify s2 page table
protection and type bits and update the callers
accordingly.
The parameter will be used in a forthcoming patch.Signed-off-by: Laurentiu Tudor
15 Nov, 2019
1 commit
-
Add a comment explaining the rational behind having both
no_compat open and ioctl callbacks to fend off compat tasks.Signed-off-by: Marc Zyngier
Signed-off-by: Paolo Bonzini
14 Nov, 2019
1 commit
-
On a system without KVM_COMPAT, we prevent IOCTLs from being issued
by a compat task. Although this prevents most silly things from
happening, it can still confuse a 32bit userspace that is able
to open the kvm device (the qemu test suite seems to be pretty
mad with this behaviour).Take a more radical approach and return a -ENODEV to the compat
task.Reported-by: Peter Maydell
Signed-off-by: Marc Zyngier
Signed-off-by: Paolo Bonzini
13 Nov, 2019
1 commit
-
Pull kvm fixes from Paolo Bonzini:
"Fix unwinding of KVM_CREATE_VM failure, VT-d posted interrupts,
DAX/ZONE_DEVICE, and module unload/reload"* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
KVM: MMU: Do not treat ZONE_DEVICE pages as being reserved
KVM: VMX: Introduce pi_is_pir_empty() helper
KVM: VMX: Do not change PID.NDST when loading a blocked vCPU
KVM: VMX: Consider PID.PIR to determine if vCPU has pending interrupts
KVM: VMX: Fix comment to specify PID.ON instead of PIR.ON
KVM: X86: Fix initialization of MSR lists
KVM: fix placement of refcount initialization
KVM: Fix NULL-ptr deref after kvm_create_vm fails
12 Nov, 2019
1 commit
-
Explicitly exempt ZONE_DEVICE pages from kvm_is_reserved_pfn() and
instead manually handle ZONE_DEVICE on a case-by-case basis. For things
like page refcounts, KVM needs to treat ZONE_DEVICE pages like normal
pages, e.g. put pages grabbed via gup(). But for flows such as setting
A/D bits or shifting refcounts for transparent huge pages, KVM needs to
to avoid processing ZONE_DEVICE pages as the flows in question lack the
underlying machinery for proper handling of ZONE_DEVICE pages.This fixes a hang reported by Adam Borowski[*] in dev_pagemap_cleanup()
when running a KVM guest backed with /dev/dax memory, as KVM straight up
doesn't put any references to ZONE_DEVICE pages acquired by gup().Note, Dan Williams proposed an alternative solution of doing put_page()
on ZONE_DEVICE pages immediately after gup() in order to simplify the
auditing needed to ensure is_zone_device_page() is called if and only if
the backing device is pinned (via gup()). But that approach would break
kvm_vcpu_{un}map() as KVM requires the page to be pinned from map() 'til
unmap() when accessing guest memory, unlike KVM's secondary MMU, which
coordinates with mmu_notifier invalidations to avoid creating stale
page references, i.e. doesn't rely on pages being pinned.[*] http://lkml.kernel.org/r/20190919115547.GA17963@angband.pl
Reported-by: Adam Borowski
Analyzed-by: David Hildenbrand
Acked-by: Dan Williams
Cc: stable@vger.kernel.org
Fixes: 3565fce3a659 ("mm, x86: get_user_pages() for dax mappings")
Signed-off-by: Sean Christopherson
Signed-off-by: Paolo Bonzini
11 Nov, 2019
2 commits
-
Reported by syzkaller:
=============================
WARNING: suspicious RCU usage
-----------------------------
./include/linux/kvm_host.h:536 suspicious rcu_dereference_check() usage!other info that might help us debug this:
rcu_scheduler_active = 2, debug_locks = 1
no locks held by repro_11/12688.stack backtrace:
Call Trace:
dump_stack+0x7d/0xc5
lockdep_rcu_suspicious+0x123/0x170
kvm_dev_ioctl+0x9a9/0x1260 [kvm]
do_vfs_ioctl+0x1a1/0xfb0
ksys_ioctl+0x6d/0x80
__x64_sys_ioctl+0x73/0xb0
do_syscall_64+0x108/0xaa0
entry_SYSCALL_64_after_hwframe+0x49/0xbeCommit a97b0e773e4 (kvm: call kvm_arch_destroy_vm if vm creation fails)
sets users_count to 1 before kvm_arch_init_vm(), however, if kvm_arch_init_vm()
fails, we need to decrease this count. By moving it earlier, we can push
the decrease to out_err_no_arch_destroy_vm without introducing yet another
error label.syzkaller source: https://syzkaller.appspot.com/x/repro.c?x=15209b84e00000
Reported-by: syzbot+75475908cd0910f141ee@syzkaller.appspotmail.com
Fixes: a97b0e773e49 ("kvm: call kvm_arch_destroy_vm if vm creation fails")
Cc: Jim Mattson
Analyzed-by: Wanpeng Li
Signed-off-by: Paolo Bonzini -
Reported by syzkaller:
kasan: CONFIG_KASAN_INLINE enabled
kasan: GPF could be caused by NULL-ptr deref or user memory access
general protection fault: 0000 [#1] PREEMPT SMP KASAN
CPU: 0 PID: 14727 Comm: syz-executor.3 Not tainted 5.4.0-rc4+ #0
RIP: 0010:kvm_coalesced_mmio_init+0x5d/0x110 arch/x86/kvm/../../../virt/kvm/coalesced_mmio.c:121
Call Trace:
kvm_dev_ioctl_create_vm arch/x86/kvm/../../../virt/kvm/kvm_main.c:3446 [inline]
kvm_dev_ioctl+0x781/0x1490 arch/x86/kvm/../../../virt/kvm/kvm_main.c:3494
vfs_ioctl fs/ioctl.c:46 [inline]
file_ioctl fs/ioctl.c:509 [inline]
do_vfs_ioctl+0x196/0x1150 fs/ioctl.c:696
ksys_ioctl+0x62/0x90 fs/ioctl.c:713
__do_sys_ioctl fs/ioctl.c:720 [inline]
__se_sys_ioctl fs/ioctl.c:718 [inline]
__x64_sys_ioctl+0x6e/0xb0 fs/ioctl.c:718
do_syscall_64+0xca/0x5d0 arch/x86/entry/common.c:290
entry_SYSCALL_64_after_hwframe+0x49/0xbeCommit 9121923c457d ("kvm: Allocate memslots and buses before calling kvm_arch_init_vm")
moves memslots and buses allocations around, however, if kvm->srcu/irq_srcu fails
initialization, NULL will be returned instead of error code, NULL will not be intercepted
in kvm_dev_ioctl_create_vm() and be dereferenced by kvm_coalesced_mmio_init(), this patch
fixes it.Moving the initialization is required anyway to avoid an incorrect synchronize_srcu that
was also reported by syzkaller:wait_for_completion+0x29c/0x440 kernel/sched/completion.c:136
__synchronize_srcu+0x197/0x250 kernel/rcu/srcutree.c:921
synchronize_srcu_expedited kernel/rcu/srcutree.c:946 [inline]
synchronize_srcu+0x239/0x3e8 kernel/rcu/srcutree.c:997
kvm_page_track_unregister_notifier+0xe7/0x130 arch/x86/kvm/page_track.c:212
kvm_mmu_uninit_vm+0x1e/0x30 arch/x86/kvm/mmu.c:5828
kvm_arch_destroy_vm+0x4a2/0x5f0 arch/x86/kvm/x86.c:9579
kvm_create_vm arch/x86/kvm/../../../virt/kvm/kvm_main.c:702 [inline]so do it.
Reported-by: syzbot+89a8060879fa0bd2db4f@syzkaller.appspotmail.com
Reported-by: syzbot+e27e7027eb2b80e44225@syzkaller.appspotmail.com
Fixes: 9121923c457d ("kvm: Allocate memslots and buses before calling kvm_arch_init_vm")
Cc: Jim Mattson
Cc: Wanpeng Li
Signed-off-by: Paolo Bonzini
05 Nov, 2019
1 commit
-
The page table pages corresponding to broken down large pages are zapped in
FIFO order, so that the large page can potentially be recovered, if it is
not longer being used for execution. This removes the performance penalty
for walking deeper EPT page tables.By default, one large page will last about one hour once the guest
reaches a steady state.Signed-off-by: Junaid Shahid
Signed-off-by: Paolo Bonzini
Signed-off-by: Thomas Gleixner
04 Nov, 2019
1 commit
-
Add a function to create a kernel thread associated with a given VM. In
particular, it ensures that the worker thread inherits the priority and
cgroups of the calling thread.Signed-off-by: Junaid Shahid
Signed-off-by: Paolo Bonzini
Signed-off-by: Thomas Gleixner
31 Oct, 2019
1 commit
-
In kvm_create_vm(), if we've successfully called kvm_arch_init_vm(), but
then fail later in the function, we need to call kvm_arch_destroy_vm()
so that it can do any necessary cleanup (like freeing memory).Fixes: 44a95dae1d229a ("KVM: x86: Detect and Initialize AVIC support")
Signed-off-by: John Sperbeck
Signed-off-by: Jim Mattson
Reviewed-by: Junaid Shahid
[Remove dependency on "kvm: Don't clear reference count on
kvm_create_vm() error path" which was not committed. - Paolo]
Signed-off-by: Paolo Bonzini
25 Oct, 2019
1 commit
-
This reorganization will allow us to call kvm_arch_destroy_vm in the
event that kvm_create_vm fails after calling kvm_arch_init_vm.Suggested-by: Junaid Shahid
Signed-off-by: Jim Mattson
Reviewed-by: Junaid Shahid
Signed-off-by: Paolo Bonzini
22 Oct, 2019
2 commits
-
…kvmarm/kvmarm into HEAD
KVM/arm fixes for 5.4, take #2
Special PMU edition:
- Fix cycle counter truncation
- Fix cycle counter overflow limit on pure 64bit system
- Allow chained events to be actually functional
- Correct sample period after overflow -
Don't waste cycles to shrink/grow vCPU halt_poll_ns if host
side polling is disabled.Acked-by: Marcelo Tosatti
Cc: Marcelo Tosatti
Signed-off-by: Wanpeng Li
Signed-off-by: Paolo Bonzini
20 Oct, 2019
3 commits
-
The PMU emulation code uses the perf event sample period to trigger
the overflow detection. This works fine for the *first* overflow
handling, but results in a huge number of interrupts on the host,
unrelated to the number of interrupts handled in the guest (a x20
factor is pretty common for the cycle counter). On a slow system
(such as a SW model), this can result in the guest only making
forward progress at a glacial pace.It turns out that the clue is in the name. The sample period is
exactly that: a period. And once the an overflow has occured,
the following period should be the full width of the associated
counter, instead of whatever the guest had initially programed.Reset the sample period to the architected value in the overflow
handler, which now results in a number of host interrupts that is
much closer to the number of interrupts in the guest.Fixes: b02386eb7dac ("arm64: KVM: Add PMU overflow interrupt routing")
Reviewed-by: Andrew Murray
Signed-off-by: Marc Zyngier -
The current convention for KVM to request a chained event from the
host PMU is to set bit[0] in attr.config1 (PERF_ATTR_CFG1_KVM_PMU_CHAINED).But as it turns out, this bit gets set *after* we create the kernel
event that backs our virtual counter, meaning that we never get
a 64bit counter.Moving the setting to an earlier point solves the problem.
Fixes: 80f393a23be6 ("KVM: arm/arm64: Support chained PMU counters")
Reviewed-by: Andrew Murray
Signed-off-by: Marc Zyngier -
When a counter is disabled, its value is sampled before the event
is being disabled, and the value written back in the shadow register.In that process, the value gets truncated to 32bit, which is adequate
for any counter but the cycle counter (defined as a 64bit counter).This obviously results in a corrupted counter, and things like
"perf record -e cycles" not working at all when run in a guest...
A similar, but less critical bug exists in kvm_pmu_get_counter_value.Make the truncation conditional on the counter not being the cycle
counter, which results in a minor code reorganisation.Fixes: 80f393a23be6 ("KVM: arm/arm64: Support chained PMU counters")
Reviewed-by: Andrew Murray
Reported-by: Julien Thierry
Signed-off-by: Marc Zyngier
03 Oct, 2019
1 commit
-
…kvmarm/kvmarm into HEAD
KVM/arm fixes for 5.4, take #1
- Remove the now obsolete hyp_alternate_select construct
- Fix the TRACE_INCLUDE_PATH macro in the vgic code
01 Oct, 2019
1 commit
-
The largepages debugfs entry is incremented/decremented as shadow
pages are created or destroyed. Clearing it will result in an
underflow, which is harmless to KVM but ugly (and could be
misinterpreted by tools that use debugfs information), so make
this particular statistic read-only.Cc: kvm-ppc@vger.kernel.org
Signed-off-by: Paolo Bonzini