03 Apr, 2019
1 commit
-
commit ddba91801aeb5c160b660caed1800eb3aef403f8 upstream.
KVM's API requires thats ioctls must be issued from the same process
that created the VM. In other words, userspace can play games with a
VM's file descriptors, e.g. fork(), SCM_RIGHTS, etc..., but only the
creator can do anything useful. Explicitly reject device ioctls that
are issued by a process other than the VM's creator, and update KVM's
API documentation to extend its requirements to device ioctls.Fixes: 852b6d57dc7f ("kvm: add device control API")
Cc:
Signed-off-by: Sean Christopherson
Signed-off-by: Paolo Bonzini
Signed-off-by: Greg Kroah-Hartman
24 Mar, 2019
4 commits
-
commit 152482580a1b0accb60676063a1ac57b2d12daf6 upstream.
kvm_arch_memslots_updated() is at this point in time an x86-specific
hook for handling MMIO generation wraparound. x86 stashes 19 bits of
the memslots generation number in its MMIO sptes in order to avoid
full page fault walks for repeat faults on emulated MMIO addresses.
Because only 19 bits are used, wrapping the MMIO generation number is
possible, if unlikely. kvm_arch_memslots_updated() alerts x86 that
the generation has changed so that it can invalidate all MMIO sptes in
case the effective MMIO generation has wrapped so as to avoid using a
stale spte, e.g. a (very) old spte that was created with generation==0.Given that the purpose of kvm_arch_memslots_updated() is to prevent
consuming stale entries, it needs to be called before the new generation
is propagated to memslots. Invalidating the MMIO sptes after updating
memslots means that there is a window where a vCPU could dereference
the new memslots generation, e.g. 0, and incorrectly reuse an old MMIO
spte that was created with (pre-wrap) generation==0.Fixes: e59dbe09f8e6 ("KVM: Introduce kvm_arch_memslots_updated()")
Cc:
Signed-off-by: Sean Christopherson
Signed-off-by: Paolo Bonzini
Signed-off-by: Greg Kroah-Hartman -
[ Upstream commit ab2d5eb03dbb7b37a1c6356686fb48626ab0c93e ]
We currently initialize the group of private IRQs during
kvm_vgic_vcpu_init, and the value of the group depends on the GIC model
we are emulating. However, CPUs created before creating (and
initializing) the VGIC might end up with the wrong group if the VGIC
is created as GICv3 later.Since we have no enforced ordering of creating the VGIC and creating
VCPUs, we can end up with part the VCPUs being properly intialized and
the remaining incorrectly initialized. That also means that we have no
single place to do the per-cpu data structure initialization which
depends on knowing the emulated GIC model (which is only the group
field).This patch removes the incorrect comment from kvm_vgic_vcpu_init and
initializes the group of all previously created VCPUs's private
interrupts in vgic_init in addition to the existing initialization in
kvm_vgic_vcpu_init.Signed-off-by: Christoffer Dall
Signed-off-by: Marc Zyngier
Signed-off-by: Sasha Levin -
[ Upstream commit 358b28f09f0ab074d781df72b8a671edb1547789 ]
The current kvm_psci_vcpu_on implementation will directly try to
manipulate the state of the VCPU to reset it. However, since this is
not done on the thread that runs the VCPU, we can end up in a strangely
corrupted state when the source and target VCPUs are running at the same
time.Fix this by factoring out all reset logic from the PSCI implementation
and forwarding the required information along with a request to the
target VCPU.Reviewed-by: Andrew Jones
Signed-off-by: Marc Zyngier
Signed-off-by: Christoffer Dall
Signed-off-by: Sasha Levin -
[ Upstream commit fc3bc475231e12e9c0142f60100cf84d077c79e1 ]
vgic_dist->lpi_list_lock must always be taken with interrupts disabled as
it is used in interrupt context.For configurations such as PREEMPT_RT_FULL, this means that it should
be a raw_spinlock since RT spinlocks are interruptible.Signed-off-by: Julien Thierry
Acked-by: Christoffer Dall
Acked-by: Marc Zyngier
Signed-off-by: Christoffer Dall
Signed-off-by: Sasha Levin
13 Feb, 2019
3 commits
-
commit cfa39381173d5f969daf43582c95ad679189cbc9 upstream.
kvm_ioctl_create_device() does the following:
1. creates a device that holds a reference to the VM object (with a borrowed
reference, the VM's refcount has not been bumped yet)
2. initializes the device
3. transfers the reference to the device to the caller's file descriptor table
4. calls kvm_get_kvm() to turn the borrowed reference to the VM into a real
referenceThe ownership transfer in step 3 must not happen before the reference to the VM
becomes a proper, non-borrowed reference, which only happens in step 4.
After step 3, an attacker can close the file descriptor and drop the borrowed
reference, which can cause the refcount of the kvm object to drop to zero.This means that we need to grab a reference for the device before
anon_inode_getfd(), otherwise the VM can disappear from under us.Fixes: 852b6d57dc7f ("kvm: add device control API")
Cc: stable@kernel.org
Signed-off-by: Jann Horn
Signed-off-by: Paolo Bonzini
Signed-off-by: Greg Kroah-Hartman -
[ Upstream commit 7a86dab8cf2f0fdf508f3555dddfc236623bff60 ]
Since the offset is added directly to the hva from the
gfn_to_hva_cache, a negative offset could result in an out of bounds
write. The existing BUG_ON only checks for addresses beyond the end of
the gfn_to_hva_cache, not for addresses before the start of the
gfn_to_hva_cache.Note that all current call sites have non-negative offsets.
Fixes: 4ec6e8636256 ("kvm: Introduce kvm_write_guest_offset_cached()")
Reported-by: Cfir Cohen
Signed-off-by: Jim Mattson
Reviewed-by: Cfir Cohen
Reviewed-by: Peter Shier
Reviewed-by: Krish Sadhukhan
Reviewed-by: Sean Christopherson
Signed-off-by: Radim Krčmář
Signed-off-by: Sasha Levin -
[ Upstream commit 0d640732dbebed0f10f18526de21652931f0b2f2 ]
When we emulate an MMIO instruction, we advance the CPU state within
decode_hsr(), before emulating the instruction effects.Having this logic in decode_hsr() is opaque, and advancing the state
before emulation is problematic. It gets in the way of applying
consistent single-step logic, and it prevents us from being able to fail
an MMIO instruction with a synchronous exception.Clean this up by only advancing the CPU state *after* the effects of the
instruction are emulated.Cc: Peter Maydell
Reviewed-by: Alex Bennée
Reviewed-by: Christoffer Dall
Signed-off-by: Mark Rutland
Signed-off-by: Marc Zyngier
Signed-off-by: Sasha Levin
17 Jan, 2019
1 commit
-
commit fb544d1ca65a89f7a3895f7531221ceeed74ada7 upstream.
We recently addressed a VMID generation race by introducing a read/write
lock around accesses and updates to the vmid generation values.However, kvm_arch_vcpu_ioctl_run() also calls need_new_vmid_gen() but
does so without taking the read lock.As far as I can tell, this can lead to the same kind of race:
VM 0, VCPU 0 VM 0, VCPU 1
------------ ------------
update_vttbr (vmid 254)
update_vttbr (vmid 1) // roll over
read_lock(kvm_vmid_lock);
force_vm_exit()
local_irq_disable
need_new_vmid_gen == false //because vmid gen matchesenter_guest (vmid 254)
kvm_arch.vttbr = :
read_unlock(kvm_vmid_lock);enter_guest (vmid 1)
Which results in running two VCPUs in the same VM with different VMIDs
and (even worse) other VCPUs from other VMs could now allocate clashing
VMID 254 from the new generation as long as VCPU 0 is not exiting.Attempt to solve this by making sure vttbr is updated before another CPU
can observe the updated VMID generation.Cc: stable@vger.kernel.org
Fixes: f0cf47d939d0 "KVM: arm/arm64: Close VMID generation race"
Reviewed-by: Julien Thierry
Signed-off-by: Christoffer Dall
Signed-off-by: Marc Zyngier
Signed-off-by: Greg Kroah-Hartman
10 Jan, 2019
5 commits
-
commit c23b2e6fc4ca346018618266bcabd335c0a8a49e upstream.
When using the nospec API, it should be taken into account that:
"...if the CPU speculates past the bounds check then
* array_index_nospec() will clamp the index within the range of [0,
* size)."The above is part of the header for macro array_index_nospec() in
linux/nospec.hNow, in this particular case, if intid evaluates to exactly VGIC_MAX_SPI
or to exaclty VGIC_MAX_PRIVATE, the array_index_nospec() macro ends up
returning VGIC_MAX_SPI - 1 or VGIC_MAX_PRIVATE - 1 respectively, instead
of VGIC_MAX_SPI or VGIC_MAX_PRIVATE, which, based on the original logic:/* SGIs and PPIs */
if (intid arch.vgic_cpu.private_irqs[intid];/* SPIs */
if (intid arch.vgic.spis[intid - VGIC_NR_PRIVATE_IRQS];are valid values for intid.
Fix this by calling array_index_nospec() macro with VGIC_MAX_PRIVATE + 1
and VGIC_MAX_SPI + 1 as arguments for its parameter size.Fixes: 41b87599c743 ("KVM: arm/arm64: vgic: fix possible spectre-v1 in vgic_get_irq()")
Cc: stable@vger.kernel.org
Signed-off-by: Gustavo A. R. Silva
[dropped the SPI part which was fixed separately]
Signed-off-by: Marc Zyngier
Signed-off-by: Greg Kroah-Hartman -
commit 60c3ab30d8c2ff3a52606df03f05af2aae07dc6b upstream.
When restoring the active state from userspace, we don't know which CPU
was the source for the active state, and this is not architecturally
exposed in any of the register state.Set the active_source to 0 in this case. In the future, we can expand
on this and exposse the information as additional information to
userspace for GICv2 if anyone cares.Cc: stable@vger.kernel.org
Signed-off-by: Christoffer Dall
Signed-off-by: Marc Zyngier
Signed-off-by: Greg Kroah-Hartman -
commit bea2ef803ade3359026d5d357348842bca9edcf1 upstream.
SPIs should be checked against the VMs specific configuration, and
not the architectural maximum.Cc: stable@vger.kernel.org
Signed-off-by: Marc Zyngier
Signed-off-by: Greg Kroah-Hartman -
commit 2e2f6c3c0b08eed3fcf7de3c7684c940451bdeb1 upstream.
To change the active state of an MMIO, halt is requested for all vcpus of
the affected guest before modifying the IRQ state. This is done by calling
cond_resched_lock() in vgic_mmio_change_active(). However interrupts are
disabled at this point and we cannot reschedule a vcpu.We actually don't need any of this, as kvm_arm_halt_guest ensures that
all the other vcpus are out of the guest. Let's just drop that useless
code.Signed-off-by: Julien Thierry
Suggested-by: Christoffer Dall
Cc: stable@vger.kernel.org
Signed-off-by: Marc Zyngier
Signed-off-by: Greg Kroah-Hartman -
commit 107352a24900fb458152b92a4e72fbdc83fd5510 upstream.
We currently only halt the guest when a vCPU messes with the active
state of an SPI. This is perfectly fine for GICv2, but isn't enough
for GICv3, where all vCPUs can access the state of any other vCPU.Let's broaden the condition to include any GICv3 interrupt that
has an active state (i.e. all but LPIs).Cc: stable@vger.kernel.org
Reviewed-by: Christoffer Dall
Signed-off-by: Marc Zyngier
Signed-off-by: Greg Kroah-Hartman
14 Nov, 2018
2 commits
-
commit da5a3ce66b8bb51b0ea8a89f42aac153903f90fb upstream.
At boot time, KVM stashes the host MDCR_EL2 value, but only does this
when the kernel is not running in hyp mode (i.e. is non-VHE). In these
cases, the stashed value of MDCR_EL2.HPMN happens to be zero, which can
lead to CONSTRAINED UNPREDICTABLE behaviour.Since we use this value to derive the MDCR_EL2 value when switching
to/from a guest, after a guest have been run, the performance counters
do not behave as expected. This has been observed to result in accesses
via PMXEVTYPER_EL0 and PMXEVCNTR_EL0 not affecting the relevant
counters, resulting in events not being counted. In these cases, only
the fixed-purpose cycle counter appears to work as expected.Fix this by always stashing the host MDCR_EL2 value, regardless of VHE.
Cc: Christopher Dall
Cc: James Morse
Cc: Will Deacon
Cc: stable@vger.kernel.org
Fixes: 1e947bad0b63b351 ("arm64: KVM: Skip HYP setup when already running in HYP")
Tested-by: Robin Murphy
Signed-off-by: Mark Rutland
Signed-off-by: Marc Zyngier
Signed-off-by: Greg Kroah-Hartman -
commit fd2ef358282c849c193aa36dadbf4f07f7dcd29b upstream.
PageTransCompoundMap() returns true for hugetlbfs and THP
hugepages. This behaviour incorrectly leads to stage 2 faults for
unsupported hugepage sizes (e.g., 64K hugepage with 4K pages) to be
treated as THP faults.Tighten the check to filter out hugetlbfs pages. This also leads to
consistently mapping all unsupported hugepage sizes as PTE level
entries at stage 2.Signed-off-by: Punit Agrawal
Reviewed-by: Suzuki Poulose
Cc: Christoffer Dall
Cc: Marc Zyngier
Cc: stable@vger.kernel.org # v4.13+
Signed-off-by: Marc Zyngier
Signed-off-by: Greg Kroah-Hartman
07 Sep, 2018
2 commits
-
kvm_unmap_hva is long gone, and we only have kvm_unmap_hva_range to
deal with. Drop the now obsolete code.Fixes: fb1522e099f0 ("KVM: update to new mmu_notifier semantic v2")
Cc: James Hogan
Reviewed-by: Paolo Bonzini
Signed-off-by: Marc Zyngier
Signed-off-by: Christoffer Dall -
When triggering a CoW, we unmap the RO page via an MMU notifier
(invalidate_range_start), and then populate the new PTE using another
one (change_pte). In the meantime, we'll have copied the old page
into the new one.The problem is that the data for the new page is sitting in the
cache, and should the guest have an uncached mapping to that page
(or its MMU off), following accesses will bypass the cache.In a way, this is similar to what happens on a translation fault:
We need to clean the page to the PoC before mapping it. So let's just
do that.This fixes a KVM unit test regression observed on a HiSilicon platform,
and subsequently reproduced on Seattle.Fixes: a9c0e12ebee5 ("KVM: arm/arm64: Only clean the dcache on translation fault")
Cc: stable@vger.kernel.org # v4.16+
Reported-by: Mike Galbraith
Signed-off-by: Marc Zyngier
Signed-off-by: Christoffer Dall
23 Aug, 2018
3 commits
-
Pull second set of KVM updates from Paolo Bonzini:
"ARM:
- Support for Group0 interrupts in guests
- Cache management optimizations for ARMv8.4 systems
- Userspace interface for RAS
- Fault path optimization
- Emulated physical timer fixes
- Random cleanupsx86:
- fixes for L1TF
- a new test case
- non-support for SGX (inject the right exception in the guest)
- fix lockdep false positive"* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (49 commits)
KVM: VMX: fixes for vmentry_l1d_flush module parameter
kvm: selftest: add dirty logging test
kvm: selftest: pass in extra memory when create vm
kvm: selftest: include the tools headers
kvm: selftest: unify the guest port macros
tools: introduce test_and_clear_bit
KVM: x86: SVM: Call x86_spec_ctrl_set_guest/host() with interrupts disabled
KVM: vmx: Inject #UD for SGX ENCLS instruction in guest
KVM: vmx: Add defines for SGX ENCLS exiting
x86/kvm/vmx: Fix coding style in vmx_setup_l1d_flush()
x86: kvm: avoid unused variable warning
KVM: Documentation: rename the capability of KVM_CAP_ARM_SET_SERROR_ESR
KVM: arm/arm64: Skip updating PTE entry if no change
KVM: arm/arm64: Skip updating PMD entry if no change
KVM: arm: Use true and false for boolean values
KVM: arm/arm64: vgic: Do not use spin_lock_irqsave/restore with irq disabled
KVM: arm/arm64: vgic: Move DEBUG_SPINLOCK_BUG_ON to vgic.h
KVM: arm: vgic-v3: Add support for ICC_SGI0R and ICC_ASGI1R accesses
KVM: arm64: vgic-v3: Add support for ICC_SGI0R_EL1 and ICC_ASGI1R_EL1 accesses
KVM: arm/arm64: vgic-v3: Add core support for Group0 SGIs
... -
Merge more updates from Andrew Morton:
- the rest of MM
- procfs updates
- various misc things
- more y2038 fixes
- get_maintainer updates
- lib/ updates
- checkpatch updates
- various epoll updates
- autofs updates
- hfsplus
- some reiserfs work
- fatfs updates
- signal.c cleanups
- ipc/ updates
* emailed patches from Andrew Morton : (166 commits)
ipc/util.c: update return value of ipc_getref from int to bool
ipc/util.c: further variable name cleanups
ipc: simplify ipc initialization
ipc: get rid of ids->tables_initialized hack
lib/rhashtable: guarantee initial hashtable allocation
lib/rhashtable: simplify bucket_table_alloc()
ipc: drop ipc_lock()
ipc/util.c: correct comment in ipc_obtain_object_check
ipc: rename ipcctl_pre_down_nolock()
ipc/util.c: use ipc_rcu_putref() for failues in ipc_addid()
ipc: reorganize initialization of kern_ipc_perm.seq
ipc: compute kern_ipc_perm.id under the ipc lock
init/Kconfig: remove EXPERT from CHECKPOINT_RESTORE
fs/sysv/inode.c: use ktime_get_real_seconds() for superblock stamp
adfs: use timespec64 for time conversion
kernel/sysctl.c: fix typos in comments
drivers/rapidio/devices/rio_mport_cdev.c: remove redundant pointer md
fork: don't copy inconsistent signal handler state to child
signal: make get_signal() return bool
signal: make sigkill_pending() return bool
... -
There are several blockable mmu notifiers which might sleep in
mmu_notifier_invalidate_range_start and that is a problem for the
oom_reaper because it needs to guarantee a forward progress so it cannot
depend on any sleepable locks.Currently we simply back off and mark an oom victim with blockable mmu
notifiers as done after a short sleep. That can result in selecting a new
oom victim prematurely because the previous one still hasn't torn its
memory down yet.We can do much better though. Even if mmu notifiers use sleepable locks
there is no reason to automatically assume those locks are held. Moreover
majority of notifiers only care about a portion of the address space and
there is absolutely zero reason to fail when we are unmapping an unrelated
range. Many notifiers do really block and wait for HW which is harder to
handle and we have to bail out though.This patch handles the low hanging fruit.
__mmu_notifier_invalidate_range_start gets a blockable flag and callbacks
are not allowed to sleep if the flag is set to false. This is achieved by
using trylock instead of the sleepable lock for most callbacks and
continue as long as we do not block down the call chain.I think we can improve that even further because there is a common pattern
to do a range lookup first and then do something about that. The first
part can be done without a sleeping lock in most cases AFAICS.The oom_reaper end then simply retries if there is at least one notifier
which couldn't make any progress in !blockable mode. A retry loop is
already implemented to wait for the mmap_sem and this is basically the
same thing.The simplest way for driver developers to test this code path is to wrap
userspace code which uses these notifiers into a memcg and set the hard
limit to hit the oom. This can be done e.g. after the test faults in all
the mmu notifier managed memory and set the hard limit to something really
small. Then we are looking for a proper process tear down.[akpm@linux-foundation.org: coding style fixes]
[akpm@linux-foundation.org: minor code simplification]
Link: http://lkml.kernel.org/r/20180716115058.5559-1-mhocko@kernel.org
Signed-off-by: Michal Hocko
Acked-by: Christian König # AMD notifiers
Acked-by: Leon Romanovsky # mlx and umem_odp
Reported-by: David Rientjes
Cc: "David (ChunMing) Zhou"
Cc: Paolo Bonzini
Cc: Alex Deucher
Cc: David Airlie
Cc: Jani Nikula
Cc: Joonas Lahtinen
Cc: Rodrigo Vivi
Cc: Doug Ledford
Cc: Jason Gunthorpe
Cc: Mike Marciniszyn
Cc: Dennis Dalessandro
Cc: Sudeep Dutt
Cc: Ashutosh Dixit
Cc: Dimitri Sivanich
Cc: Boris Ostrovsky
Cc: Juergen Gross
Cc: "Jérôme Glisse"
Cc: Andrea Arcangeli
Cc: Felix Kuehling
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
22 Aug, 2018
2 commits
-
…marm/kvmarm into HEAD
KVM/arm updates for 4.19
- Support for Group0 interrupts in guests
- Cache management optimizations for ARMv8.4 systems
- Userspace interface for RAS, allowing error retrival and injection
- Fault path optimization
- Emulated physical timer fixes
- Random cleanups -
…iederm/user-namespace
Pull core signal handling updates from Eric Biederman:
"It was observed that a periodic timer in combination with a
sufficiently expensive fork could prevent fork from every completing.
This contains the changes to remove the need for that restart.This set of changes is split into several parts:
- The first part makes PIDTYPE_TGID a proper pid type instead
something only for very special cases. The part starts using
PIDTYPE_TGID enough so that in __send_signal where signals are
actually delivered we know if the signal is being sent to a a group
of processes or just a single process.- With that prep work out of the way the logic in fork is modified so
that fork logically makes signals received while it is running
appear to be received after the fork completes"* 'siginfo-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (22 commits)
signal: Don't send signals to tasks that don't exist
signal: Don't restart fork when signals come in.
fork: Have new threads join on-going signal group stops
fork: Skip setting TIF_SIGPENDING in ptrace_init_task
signal: Add calculate_sigpending()
fork: Unconditionally exit if a fatal signal is pending
fork: Move and describe why the code examines PIDNS_ADDING
signal: Push pid type down into complete_signal.
signal: Push pid type down into __send_signal
signal: Push pid type down into send_signal
signal: Pass pid type into do_send_sig_info
signal: Pass pid type into send_sigio_to_task & send_sigurg_to_task
signal: Pass pid type into group_send_sig_info
signal: Pass pid and pid type into send_sigqueue
posix-timers: Noralize good_sigevent
signal: Use PIDTYPE_TGID to clearly store where file signals will be sent
pid: Implement PIDTYPE_TGID
pids: Move the pgrp and session pid pointers from task_struct to signal_struct
kvm: Don't open code task_pid in kvm_vcpu_ioctl
pids: Compute task_tgid using signal->leader_pid
...
20 Aug, 2018
1 commit
-
Pull first set of KVM updates from Paolo Bonzini:
"PPC:
- minor code cleanupsx86:
- PCID emulation and CR3 caching for shadow page tables
- nested VMX live migration
- nested VMCS shadowing
- optimized IPI hypercall
- some optimizationsARM will come next week"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (85 commits)
kvm: x86: Set highest physical address bits in non-present/reserved SPTEs
KVM/x86: Use CC_SET()/CC_OUT in arch/x86/kvm/vmx.c
KVM: X86: Implement PV IPIs in linux guest
KVM: X86: Add kvm hypervisor init time platform setup callback
KVM: X86: Implement "send IPI" hypercall
KVM/x86: Move X86_CR4_OSXSAVE check into kvm_valid_sregs()
KVM: x86: Skip pae_root shadow allocation if tdp enabled
KVM/MMU: Combine flushing remote tlb in mmu_set_spte()
KVM: vmx: skip VMWRITE of HOST_{FS,GS}_BASE when possible
KVM: vmx: skip VMWRITE of HOST_{FS,GS}_SEL when possible
KVM: vmx: always initialize HOST_{FS,GS}_BASE to zero during setup
KVM: vmx: move struct host_state usage to struct loaded_vmcs
KVM: vmx: compute need to reload FS/GS/LDT on demand
KVM: nVMX: remove a misleading comment regarding vmcs02 fields
KVM: vmx: rename __vmx_load_host_state() and vmx_save_host_state()
KVM: vmx: add dedicated utility to access guest's kernel_gs_base
KVM: vmx: track host_state.loaded using a loaded_vmcs pointer
KVM: vmx: refactor segmentation code in vmx_save_host_state()
kvm: nVMX: Fix fault priority for VMX operations
kvm: nVMX: Fix fault vector for VMX operation at CPL > 0
...
15 Aug, 2018
1 commit
-
Pull arm64 updates from Will Deacon:
"A bunch of good stuff in here. Worth noting is that we've pulled in
the x86/mm branch from -tip so that we can make use of the core
ioremap changes which allow us to put down huge mappings in the
vmalloc area without screwing up the TLB. Much of the positive
diffstat is because of the rseq selftest for arm64.Summary:
- Wire up support for qspinlock, replacing our trusty ticket lock
code- Add an IPI to flush_icache_range() to ensure that stale
instructions fetched into the pipeline are discarded along with the
I-cache lines- Support for the GCC "stackleak" plugin
- Support for restartable sequences, plus an arm64 port for the
selftest- Kexec/kdump support on systems booting with ACPI
- Rewrite of our syscall entry code in C, which allows us to zero the
GPRs on entry from userspace- Support for chained PMU counters, allowing 64-bit event counters to
be constructed on current CPUs- Ensure scheduler topology information is kept up-to-date with CPU
hotplug events- Re-enable support for huge vmalloc/IO mappings now that the core
code has the correct hooks to use break-before-make sequences- Miscellaneous, non-critical fixes and cleanups"
* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (90 commits)
arm64: alternative: Use true and false for boolean values
arm64: kexec: Add comment to explain use of __flush_icache_range()
arm64: sdei: Mark sdei stack helper functions as static
arm64, kaslr: export offset in VMCOREINFO ELF notes
arm64: perf: Add cap_user_time aarch64
efi/libstub: Only disable stackleak plugin for arm64
arm64: drop unused kernel_neon_begin_partial() macro
arm64: kexec: machine_kexec should call __flush_icache_range
arm64: svc: Ensure hardirq tracing is updated before return
arm64: mm: Export __sync_icache_dcache() for xen-privcmd
drivers/perf: arm-ccn: Use devm_ioremap_resource() to map memory
arm64: Add support for STACKLEAK gcc plugin
arm64: Add stack information to on_accessible_stack
drivers/perf: hisi: update the sccl_id/ccl_id when MT is supported
arm64: fix ACPI dependencies
rseq/selftests: Add support for arm64
arm64: acpi: fix alignment fault in accessing ACPI
efi/arm: map UEFI memory map even w/o runtime services enabled
efi/arm: preserve early mapping of UEFI memory map longer for BGRT
drivers: acpi: add dependency of EFI for arm64
...
13 Aug, 2018
2 commits
-
When there is contention on faulting in a particular page table entry
at stage 2, the break-before-make requirement of the architecture can
lead to additional refaulting due to TLB invalidation.Avoid this by skipping a page table update if the new value of the PTE
matches the previous value.Cc: stable@vger.kernel.org
Fixes: d5d8184d35c9 ("KVM: ARM: Memory virtualization setup")
Reviewed-by: Suzuki Poulose
Acked-by: Christoffer Dall
Signed-off-by: Punit Agrawal
Signed-off-by: Marc Zyngier -
Contention on updating a PMD entry by a large number of vcpus can lead
to duplicate work when handling stage 2 page faults. As the page table
update follows the break-before-make requirement of the architecture,
it can lead to repeated refaults due to clearing the entry and
flushing the tlbs.This problem is more likely when -
* there are large number of vcpus
* the mapping is large block mappingsuch as when using PMD hugepages (512MB) with 64k pages.
Fix this by skipping the page table update if there is no change in
the entry being updated.Cc: stable@vger.kernel.org
Fixes: ad361f093c1e ("KVM: ARM: Support hugetlbfs backed huge pages")
Reviewed-by: Suzuki Poulose
Acked-by: Christoffer Dall
Signed-off-by: Punit Agrawal
Signed-off-by: Marc Zyngier
12 Aug, 2018
3 commits
-
kvm_vgic_sync_hwstate is only called with IRQ being disabled.
There is thus no need to call spin_lock_irqsave/restore in
vgic_fold_lr_state and vgic_prune_ap_list.This patch replace them with the non irq-safe version.
Signed-off-by: Jia He
Acked-by: Christoffer Dall
[maz: commit message tidy-up]
Signed-off-by: Marc Zyngier -
DEBUG_SPINLOCK_BUG_ON can be used with both vgic-v2 and vgic-v3,
so let's move it to vgic.hSigned-off-by: Jia He
[maz: commit message tidy-up]
Signed-off-by: Marc Zyngier -
Although vgic-v3 now supports Group0 interrupts, it still doesn't
deal with Group0 SGIs. As usually with the GIC, nothing is simple:- ICC_SGI1R can signal SGIs of both groups, since GICD_CTLR.DS==1
with KVM (as per 8.1.10, Non-secure EL1 access)- ICC_SGI0R can only generate Group0 SGIs
- ICC_ASGI1R sees its scope refocussed to generate only Group0
SGIs (as per the note at the bottom of Table 8-14)We only support Group1 SGIs so far, so no material change.
Reviewed-by: Eric Auger
Reviewed-by: Christoffer Dall
Signed-off-by: Marc Zyngier
06 Aug, 2018
4 commits
-
We are currently cutting hva_to_pfn_fast short if we do not want an
immediate exit, which is represented by !async && !atomic. However,
this is unnecessary, and __get_user_pages_fast is *much* faster
because the regular get_user_pages takes pmd_lock/pte_lock.
In fact, when many CPUs take a nested vmexit at the same time
the contention on those locks is visible, and this patch removes
about 25% (compared to 4.18) from vmexit.flat on a 16 vCPU
nested guest.Suggested-by: Andrea Arcangeli
Signed-off-by: Paolo Bonzini -
This patch is to provide a way for platforms to register hv tlb remote
flush callback and this helps to optimize operation of tlb flush
among vcpus for nested virtualization case.Signed-off-by: Lan Tianyu
Signed-off-by: Paolo Bonzini -
Use the fast CR3 switch mechanism to locklessly change the MMU root
page when switching between L1 and L2. The switch from L2 to L1 should
always go through the fast path, while the switch from L1 to L2 should
go through the fast path if L1's CR3/EPTP for L2 hasn't changed
since the last time.Signed-off-by: Junaid Shahid
Signed-off-by: Paolo Bonzini -
Pull bug fixes into the KVM development tree to avoid nasty conflicts.
31 Jul, 2018
2 commits
-
When the VCPU is blocked (for example from WFI) we don't inject the
physical timer interrupt if it should fire while the CPU is blocked, but
instead we just wake up the VCPU and expect kvm_timer_vcpu_load to take
care of injecting the interrupt.Unfortunately, kvm_timer_vcpu_load() doesn't actually do that, it only
has support to schedule a soft timer if the emulated phys timer is
expected to fire in the future.Follow the same pattern as kvm_timer_update_state() and update the irq
state after potentially scheduling a soft timer.Reported-by: Andre Przywara
Cc: Stable # 4.15+
Fixes: bbdd52cfcba29 ("KVM: arm/arm64: Avoid phys timer emulation in vcpu entry/exit")
Signed-off-by: Christoffer Dall
Signed-off-by: Marc Zyngier -
kvm_timer_update_state() is called when changing the phys timer
configuration registers, either via vcpu reset, as a result of a trap
from the guest, or when userspace programs the registers.phys_timer_emulate() is in turn called by kvm_timer_update_state() to
either cancel an existing software timer, or program a new software
timer, to emulate the behavior of a real phys timer, based on the change
in configuration registers.Unfortunately, the interaction between these two functions left a small
race; if the conceptual emulated phys timer should actually fire, but
the soft timer hasn't executed its callback yet, we cancel the timer in
phys_timer_emulate without injecting an irq. This only happens if the
check in kvm_timer_update_state is called before the timer should fire,
which is relatively unlikely, but possible.The solution is to update the state of the phys timer after calling
phys_timer_emulate, which will pick up the pending timer state and
update the interrupt value.Note that this leaves the opportunity of raising the interrupt twice,
once in the just-programmed soft timer, and once in
kvm_timer_update_state. Since this always happens synchronously with
the VCPU execution, there is no harm in this, and the guest ever only
sees a single timer interrupt.Cc: Stable # 4.15+
Signed-off-by: Christoffer Dall
Signed-off-by: Marc Zyngier
25 Jul, 2018
1 commit
-
Signed-off-by: Ingo Molnar
24 Jul, 2018
1 commit
-
It's possible for userspace to control n. Sanitize n when using it as an
array index, to inhibit the potential spectre-v1 write gadget.Note that while it appears that n must be bound to the interval [0,3]
due to the way it is extracted from addr, we cannot guarantee that
compiler transformations (and/or future refactoring) will ensure this is
the case, and given this is a slow path it's better to always perform
the masking.Found by smatch.
Signed-off-by: Mark Rutland
Cc: Christoffer Dall
Cc: Marc Zyngier
Cc: kvmarm@lists.cs.columbia.edu
Signed-off-by: Marc Zyngier
21 Jul, 2018
2 commits
-
Signed-off-by: "Eric W. Biederman"
-
arm64's new use of KVMs get_events/set_events API calls isn't just
or RAS, it allows an SError that has been made pending by KVM as
part of its device emulation to be migrated.Wire this up for 32bit too.
We only need to read/write the HCR_VA bit, and check that no esr has
been provided, as we don't yet support VDFSR.Signed-off-by: James Morse
Reviewed-by: Dongjiu Geng
Signed-off-by: Marc Zyngier