25 Mar, 2020
1 commit
-
commit 763802b53a427ed3cbd419dbba255c414fdd9e7c upstream.
Commit 3f8fd02b1bf1 ("mm/vmalloc: Sync unmappings in
__purge_vmap_area_lazy()") introduced a call to vmalloc_sync_all() in
the vunmap() code-path. While this change was necessary to maintain
correctness on x86-32-pae kernels, it also adds additional cycles for
architectures that don't need it.Specifically on x86-64 with CONFIG_VMAP_STACK=y some people reported
severe performance regressions in micro-benchmarks because it now also
calls the x86-64 implementation of vmalloc_sync_all() on vunmap(). But
the vmalloc_sync_all() implementation on x86-64 is only needed for newly
created mappings.To avoid the unnecessary work on x86-64 and to gain the performance
back, split up vmalloc_sync_all() into two functions:* vmalloc_sync_mappings(), and
* vmalloc_sync_unmappings()Most call-sites to vmalloc_sync_all() only care about new mappings being
synchronized. The only exception is the new call-site added in the
above mentioned commit.Shile Zhang directed us to a report of an 80% regression in reaim
throughput.Fixes: 3f8fd02b1bf1 ("mm/vmalloc: Sync unmappings in __purge_vmap_area_lazy()")
Reported-by: kernel test robot
Reported-by: Shile Zhang
Signed-off-by: Joerg Roedel
Signed-off-by: Andrew Morton
Tested-by: Borislav Petkov
Acked-by: Rafael J. Wysocki [GHES]
Cc: Dave Hansen
Cc: Andy Lutomirski
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: Ingo Molnar
Cc:
Link: http://lkml.kernel.org/r/20191009124418.8286-1-joro@8bytes.org
Link: https://lists.01.org/hyperkitty/list/lkp@lists.01.org/thread/4D3JPPHBNOSPFK2KEPC6KGKS6J25AIDB/
Link: http://lkml.kernel.org/r/20191113095530.228959-1-shile.zhang@linux.alibaba.com
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman
18 Mar, 2020
5 commits
-
commit 59b5809655bdafb0767d3fd00a3e41711aab07e6 upstream.
There are two implemented bits in the PPIN_CTL MSR:
Bit 0: LockOut (R/WO)
Set 1 to prevent further writes to MSR_PPIN_CTL.Bit 1: Enable_PPIN (R/W)
If 1, enables MSR_PPIN to be accessible using RDMSR.
If 0, an attempt to read MSR_PPIN will cause #GP.So there are four defined values:
0: PPIN is disabled, PPIN_CTL may be updated
1: PPIN is disabled. PPIN_CTL is locked against updates
2: PPIN is enabled. PPIN_CTL may be updated
3: PPIN is enabled. PPIN_CTL is locked against updatesCode would only enable the X86_FEATURE_INTEL_PPIN feature for case "2".
When it should have done so for both case "2" and case "3".Fix the final test to just check for the enable bit. Also fix some of
the other comments in this function.Fixes: 3f5a7896a509 ("x86/mce: Include the PPIN in MCE records when available")
Signed-off-by: Tony Luck
Signed-off-by: Borislav Petkov
Cc:
Link: https://lkml.kernel.org/r/20200226011737.9958-1-tony.luck@intel.com
Signed-off-by: Greg Kroah-Hartman -
commit f967140dfb7442e2db0868b03b961f9c59418a1b upstream.
Enable the sampling check in kernel/events/core.c::perf_event_open(),
which returns the more appropriate -EOPNOTSUPP.BEFORE:
$ sudo perf record -a -e instructions,l3_request_g1.caching_l3_cache_accesses true
Error:
The sys_perf_event_open() syscall returned with 22 (Invalid argument) for event (l3_request_g1.caching_l3_cache_accesses).
/bin/dmesg | grep -i perf may provide additional information.With nothing relevant in dmesg.
AFTER:
$ sudo perf record -a -e instructions,l3_request_g1.caching_l3_cache_accesses true
Error:
l3_request_g1.caching_l3_cache_accesses: PMU Hardware doesn't support sampling/overflow-interrupts. Try 'perf stat'Fixes: c43ca5091a37 ("perf/x86/amd: Add support for AMD NB and L2I "uncore" counters")
Signed-off-by: Kim Phillips
Signed-off-by: Borislav Petkov
Acked-by: Peter Zijlstra
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20200311191323.13124-1-kim.phillips@amd.com
Signed-off-by: Greg Kroah-Hartman -
commit 985e537a4082b4635754a57f4f95430790afee6a upstream.
The dmidecode program fails to properly decode the SMBIOS data supplied
by OVMF/UEFI when running in an SEV guest. The SMBIOS area, under SEV, is
encrypted and resides in reserved memory that is marked as EFI runtime
services data.As a result, when memremap() is attempted for the SMBIOS data, it
can't be mapped as regular RAM (through try_ram_remap()) and, since
the address isn't part of the iomem resources list, it isn't mapped
encrypted through the fallback ioremap().Add a new __ioremap_check_other() to deal with memory types like
EFI_RUNTIME_SERVICES_DATA which are not covered by the resource ranges.This allows any runtime services data which has been created encrypted,
to be mapped encrypted too.[ bp: Move functionality to a separate function. ]
Signed-off-by: Tom Lendacky
Signed-off-by: Borislav Petkov
Reviewed-by: Joerg Roedel
Tested-by: Joerg Roedel
Cc: # 5.3
Link: https://lkml.kernel.org/r/2d9e16eb5b53dc82665c95c6764b7407719df7a0.1582645327.git.thomas.lendacky@amd.com
Signed-off-by: Greg Kroah-Hartman -
commit 95fa10103dabc38be5de8efdfced5e67576ed896 upstream.
When an EVMCS enabled L1 guest on KVM will tries doing enlightened VMEnter
with EVMCS GPA = 0 the host crashes because theevmcs_gpa != vmx->nested.hv_evmcs_vmptr
condition in nested_vmx_handle_enlightened_vmptrld() will evaluate to
false (as nested.hv_evmcs_vmptr is zeroed after init). The crash will
happen on vmx->nested.hv_evmcs pointer dereference.Another problematic EVMCS ptr value is '-1' but it only causes host crash
after nested_release_evmcs() invocation. The problem is exactly the same as
with '0', we mistakenly think that the EVMCS pointer hasn't changed and
thus nested.hv_evmcs_vmptr is valid.Resolve the issue by adding an additional !vmx->nested.hv_evmcs
check to nested_vmx_handle_enlightened_vmptrld(), this way we will
always be trying kvm_vcpu_map() when nested.hv_evmcs is NULL
and this is supposed to catch all invalid EVMCS GPAs.Also, initialize hv_evmcs_vmptr to '0' in nested_release_evmcs()
to be consistent with initialization where we don't currently
set hv_evmcs_vmptr to '-1'.Cc: stable@vger.kernel.org
Signed-off-by: Vitaly Kuznetsov
Signed-off-by: Paolo Bonzini
Signed-off-by: Greg Kroah-Hartman -
commit 342993f96ab24d5864ab1216f46c0b199c2baf8e upstream.
After commit 07721feee46b ("KVM: nVMX: Don't emulate instructions in guest
mode") Hyper-V guests on KVM stopped booting with:kvm_nested_vmexit: rip fffff802987d6169 reason EPT_VIOLATION info1 181
info2 0 int_info 0 int_info_err 0
kvm_page_fault: address febd0000 error_code 181
kvm_emulate_insn: 0:fffff802987d6169: f3 a5
kvm_emulate_insn: 0:fffff802987d6169: f3 a5 FAIL
kvm_inj_exception: #UD (0x0)"f3 a5" is a "rep movsw" instruction, which should not be intercepted
at all. Commit c44b4c6ab80e ("KVM: emulate: clean up initializations in
init_decode_cache") reduced the number of fields cleared by
init_decode_cache() claiming that they are being cleared elsewhere,
'intercept', however, is left uncleared if the instruction does not have
any of the "slow path" flags (NotImpl, Stack, Op3264, Sse, Mmx, CheckPerm,
NearBranch, No16 and of course Intercept itself).Fixes: c44b4c6ab80e ("KVM: emulate: clean up initializations in init_decode_cache")
Fixes: 07721feee46b ("KVM: nVMX: Don't emulate instructions in guest mode")
Cc: stable@vger.kernel.org
Suggested-by: Paolo Bonzini
Signed-off-by: Vitaly Kuznetsov
Reviewed-by: Sean Christopherson
Signed-off-by: Paolo Bonzini
Signed-off-by: Greg Kroah-Hartman
12 Mar, 2020
5 commits
-
commit 8319e9d5ad98ffccd19f35664382c73cea216193 upstream.
The mixed mode runtime wrappers are fragile when it comes to how the
memory referred to by its pointer arguments are laid out in memory, due
to the fact that it translates these addresses to physical addresses that
the runtime services can dereference when running in 1:1 mode. Since
vmalloc'ed pages (including the vmap'ed stack) are not contiguous in the
physical address space, this scheme only works if the referenced memory
objects do not cross page boundaries.Currently, the mixed mode runtime service wrappers require that all by-ref
arguments that live in the vmalloc space have a size that is a power of 2,
and are aligned to that same value. While this is a sensible way to
construct an object that is guaranteed not to cross a page boundary, it is
overly strict when it comes to checking whether a given object violates
this requirement, as we can simply take the physical address of the first
and the last byte, and verify that they point into the same physical page.When this check fails, we emit a WARN(), but then simply proceed with the
call, which could cause data corruption if the next physical page belongs
to a mapping that is entirely unrelated.Given that with vmap'ed stacks, this condition is much more likely to
trigger, let's relax the condition a bit, but fail the runtime service
call if it does trigger.Fixes: f6697df36bdf0bf7 ("x86/efi: Prevent mixed mode boot corruption with CONFIG_VMAP_STACK=y")
Signed-off-by: Ard Biesheuvel
Signed-off-by: Ingo Molnar
Cc: linux-efi@vger.kernel.org
Cc: Ingo Molnar
Cc: Thomas Gleixner
Link: https://lore.kernel.org/r/20200221084849.26878-4-ardb@kernel.org
Signed-off-by: Greg Kroah-Hartman -
commit 63056e8b5ebf41d52170e9f5ba1fc83d1855278c upstream.
Hans reports that his mixed mode systems running v5.6-rc1 kernels hit
the WARN_ON() in virt_to_phys_or_null_size(), caused by the fact that
efi_guid_t objects on the vmap'ed stack happen to be misaligned with
respect to their sizes. As a quick (i.e., backportable) fix, copy GUID
pointer arguments to the local stack into a buffer that is naturally
aligned to its size, so that it is guaranteed to cover only one
physical page.Note that on x86, we cannot rely on the stack pointer being aligned
the way the compiler expects, so we need to allocate an 8-byte aligned
buffer of sufficient size, and copy the GUID into that buffer at an
offset that is aligned to 16 bytes.Fixes: f6697df36bdf0bf7 ("x86/efi: Prevent mixed mode boot corruption with CONFIG_VMAP_STACK=y")
Reported-by: Hans de Goede
Signed-off-by: Ard Biesheuvel
Signed-off-by: Ingo Molnar
Tested-by: Hans de Goede
Cc: linux-efi@vger.kernel.org
Cc: Ingo Molnar
Cc: Thomas Gleixner
Link: https://lore.kernel.org/r/20200221084849.26878-2-ardb@kernel.org
Signed-off-by: Greg Kroah-Hartman -
commit 735a6dd02222d8d070c7bb748f25895239ca8c92 upstream.
Explicitly set X86_FEATURE_OSPKE via set_cpu_cap() instead of calling
get_cpu_cap() to pull the feature bit from CPUID after enabling CR4.PKE.
Invoking get_cpu_cap() effectively wipes out any {set,clear}_cpu_cap()
changes that were made between this_cpu->c_init() and setup_pku(), as
all non-synthetic feature words are reinitialized from the CPU's CPUID
values.Blasting away capability updates manifests most visibility when running
on a VMX capable CPU, but with VMX disabled by BIOS. To indicate that
VMX is disabled, init_ia32_feat_ctl() clears X86_FEATURE_VMX, using
clear_cpu_cap() instead of setup_clear_cpu_cap() so that KVM can report
which CPU is misconfigured (KVM needs to probe every CPU anyways).
Restoring X86_FEATURE_VMX from CPUID causes KVM to think VMX is enabled,
ultimately leading to an unexpected #GP when KVM attempts to do VMXON.Arguably, init_ia32_feat_ctl() should use setup_clear_cpu_cap() and let
KVM figure out a different way to report the misconfigured CPU, but VMX
is not the only feature bit that is affected, i.e. there is precedent
that tweaking feature bits via {set,clear}_cpu_cap() after ->c_init()
is expected to work. Most notably, x86_init_rdrand()'s clearing of
X86_FEATURE_RDRAND when RDRAND malfunctions is also overwritten.Fixes: 0697694564c8 ("x86/mm/pkeys: Actually enable Memory Protection Keys in the CPU")
Reported-by: Jacob Keller
Signed-off-by: Sean Christopherson
Signed-off-by: Borislav Petkov
Acked-by: Dave Hansen
Tested-by: Jacob Keller
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20200226231615.13664-1-sean.j.christopherson@intel.com
Signed-off-by: Greg Kroah-Hartman -
[ Upstream commit 9038ec99ceb94fb8d93ade5e236b2928f0792c7c ]
Variables declared in a switch statement before any case statements
cannot be automatically initialized with compiler instrumentation (as
they are not part of any execution flow). With GCC's proposed automatic
stack variable initialization feature, this triggers a warning (and they
don't get initialized). Clang's automatic stack variable initialization
(via CONFIG_INIT_STACK_ALL=y) doesn't throw a warning, but it also
doesn't initialize such variables[1]. Note that these warnings (or silent
skipping) happen before the dead-store elimination optimization phase,
so even when the automatic initializations are later elided in favor of
direct initializations, the warnings remain.To avoid these problems, move such variables into the "case" where
they're used or lift them up into the main function body.arch/x86/xen/enlighten_pv.c: In function ‘xen_write_msr_safe’:
arch/x86/xen/enlighten_pv.c:904:12: warning: statement will never be executed [-Wswitch-unreachable]
904 | unsigned which;
| ^~~~~[1] https://bugs.llvm.org/show_bug.cgi?id=44916
Signed-off-by: Kees Cook
Link: https://lore.kernel.org/r/20200220062318.69299-1-keescook@chromium.org
Reviewed-by: Juergen Gross
[boris: made @which an 'unsigned int']
Signed-off-by: Boris Ostrovsky
Signed-off-by: Sasha Levin -
[ Upstream commit df6d4f9db79c1a5d6f48b59db35ccd1e9ff9adfc ]
GCC 10 changed the default to -fno-common, which leads to
LD arch/x86/boot/compressed/vmlinux
ld: arch/x86/boot/compressed/pgtable_64.o:(.bss+0x0): multiple definition of `__force_order'; \
arch/x86/boot/compressed/kaslr_64.o:(.bss+0x0): first defined here
make[2]: *** [arch/x86/boot/compressed/Makefile:119: arch/x86/boot/compressed/vmlinux] Error 1Since __force_order is already provided in pgtable_64.c, there is no
need to declare __force_order in kaslr_64.c.Signed-off-by: H.J. Lu
Signed-off-by: Borislav Petkov
Link: https://lkml.kernel.org/r/20200124181811.4780-1-hjl.tools@gmail.com
Signed-off-by: Sasha Levin
05 Mar, 2020
10 commits
-
commit 693e02cc24090c379217138719d9d84e50036b24 upstream.
According to the SDM, VMWRITE checks to see if the secondary source
operand corresponds to an unsupported VMCS field before it checks to
see if the secondary source operand corresponds to a VM-exit
information field and the processor does not support writing to
VM-exit information fields.Fixes: 49f705c5324aa ("KVM: nVMX: Implement VMREAD and VMWRITE")
Signed-off-by: Jim Mattson
Cc: Paolo Bonzini
Reviewed-by: Peter Shier
Reviewed-by: Oliver Upton
Reviewed-by: Jon Cargille
Signed-off-by: Paolo Bonzini
Signed-off-by: Greg Kroah-Hartman -
commit dd2d6042b7f4a5440705b4ffc6c4c2dba81a43b7 upstream.
According to the SDM, a VMWRITE in VMX non-root operation with an
invalid VMCS-link pointer results in VMfailInvalid before the validity
of the VMCS field in the secondary source operand is checked.For consistency, modify both handle_vmwrite and handle_vmread, even
though there was no problem with the latter.Fixes: 6d894f498f5d1 ("KVM: nVMX: vmread/vmwrite: Use shadow vmcs12 if running L2")
Signed-off-by: Jim Mattson
Cc: Liran Alon
Cc: Paolo Bonzini
Cc: Vitaly Kuznetsov
Reviewed-by: Peter Shier
Reviewed-by: Oliver Upton
Reviewed-by: Jon Cargille
Signed-off-by: Paolo Bonzini
Signed-off-by: Greg Kroah-Hartman -
commit 208050dac5ef4de5cb83ffcafa78499c94d0b5ad upstream.
Remove a bogus clearing of apf.msr_val from kvm_arch_vcpu_destroy().
apf.msr_val is only set to a non-zero value by kvm_pv_enable_async_pf(),
which is only reachable by kvm_set_msr_common(), i.e. by writing
MSR_KVM_ASYNC_PF_EN. KVM does not autonomously write said MSR, i.e.
can only be written via KVM_SET_MSRS or KVM_RUN. Since KVM_SET_MSRS and
KVM_RUN are vcpu ioctls, they require a valid vcpu file descriptor.
kvm_arch_vcpu_destroy() is only called if KVM_CREATE_VCPU fails, and KVM
declares KVM_CREATE_VCPU successful once the vcpu fd is installed and
thus visible to userspace. Ergo, apf.msr_val cannot be non-zero when
kvm_arch_vcpu_destroy() is called.Fixes: 344d9588a9df0 ("KVM: Add PV MSR to enable asynchronous page faults delivery.")
Signed-off-by: Sean Christopherson
Signed-off-by: Paolo Bonzini
Signed-off-by: Greg Kroah-Hartman -
commit 9d979c7e6ff43ca3200ffcb74f57415fd633a2da upstream.
x86 does not load its MMU until KVM_RUN, which cannot be invoked until
after vCPU creation succeeds. Given that kvm_arch_vcpu_destroy() is
called if and only if vCPU creation fails, it is impossible for the MMU
to be loaded.Note, the bogus kvm_mmu_unload() call was added during an unrelated
refactoring of vCPU allocation, i.e. was presumably added as an
opportunstic "fix" for a perceived leak.Fixes: fb3f0f51d92d1 ("KVM: Dynamically allocate vcpus")
Signed-off-by: Sean Christopherson
Signed-off-by: Paolo Bonzini
Signed-off-by: Greg Kroah-Hartman -
commit 536a0d8e79fb928f2735db37dda95682b6754f9a upstream.
Currently, there are three static keys in the resctrl file system:
rdt_mon_enable_key and rdt_alloc_enable_key indicate if the monitoring
feature and the allocation feature are enabled, respectively. The
rdt_enable_key is enabled when either the monitoring feature or the
allocation feature is enabled.If no monitoring feature is present (either hardware doesn't support a
monitoring feature or the feature is disabled by the kernel command line
option "rdt="), rdt_enable_key is still enabled but rdt_mon_enable_key
is disabled.MBM is a monitoring feature. The MBM overflow handler intends to
check if the monitoring feature is not enabled for fast return.So check the rdt_mon_enable_key in it instead of the rdt_enable_key as
former is the more accurate check.[ bp: Massage commit message. ]
Fixes: e33026831bdb ("x86/intel_rdt/mbm: Handle counter overflow")
Signed-off-by: Xiaochen Shen
Signed-off-by: Borislav Petkov
Link: https://lkml.kernel.org/r/1576094705-13660-1-git-send-email-xiaochen.shen@intel.com
Signed-off-by: Greg Kroah-Hartman -
commit 52918ed5fcf05d97d257f4131e19479da18f5d16 upstream.
The KVM MMIO support uses bit 51 as the reserved bit to cause nested page
faults when a guest performs MMIO. The AMD memory encryption support uses
a CPUID function to define the encryption bit position. Given this, it is
possible that these bits can conflict.Use svm_hardware_setup() to override the MMIO mask if memory encryption
support is enabled. Various checks are performed to ensure that the mask
is properly defined and rsvd_bits() is used to generate the new mask (as
was done prior to the change that necessitated this patch).Fixes: 28a1f3ac1d0c ("kvm: x86: Set highest physical address bits in non-present/reserved SPTEs")
Suggested-by: Sean Christopherson
Reviewed-by: Sean Christopherson
Signed-off-by: Tom Lendacky
Signed-off-by: Paolo Bonzini
Signed-off-by: Greg Kroah-Hartman -
commit 86f7e90ce840aa1db407d3ea6e9b3a52b2ce923c upstream.
KVM emulates UMIP on hardware that doesn't support it by setting the
'descriptor table exiting' VM-execution control and performing
instruction emulation. When running nested, this emulation is broken as
KVM refuses to emulate L2 instructions by default.Correct this regression by allowing the emulation of descriptor table
instructions if L1 hasn't requested 'descriptor table exiting'.Fixes: 07721feee46b ("KVM: nVMX: Don't emulate instructions in guest mode")
Reported-by: Jan Kiszka
Cc: stable@vger.kernel.org
Cc: Paolo Bonzini
Cc: Jim Mattson
Signed-off-by: Oliver Upton
Signed-off-by: Paolo Bonzini
Signed-off-by: Greg Kroah-Hartman -
[ Upstream commit 0aa0e0d6b34b89649e6b5882a7e025a0eb9bd832 ]
Tremont is Intel's successor to Goldmont Plus. SMI_COUNT MSR is also
supported.Signed-off-by: Kan Liang
Signed-off-by: Peter Zijlstra (Intel)
Signed-off-by: Ingo Molnar
Reviewed-by: Andi Kleen
Link: https://lkml.kernel.org/r/1580236279-35492-3-git-send-email-kan.liang@linux.intel.com
Signed-off-by: Sasha Levin -
[ Upstream commit ecf71fbccb9ac5cb964eb7de59bb9da3755b7885 ]
Tremont is Intel's successor to Goldmont Plus. From the perspective of
Intel cstate residency counters, there is nothing changed compared with
Goldmont Plus and Goldmont.Share glm_cstates with Goldmont Plus and Goldmont.
Update the comments for Tremont.Signed-off-by: Kan Liang
Signed-off-by: Peter Zijlstra (Intel)
Signed-off-by: Ingo Molnar
Reviewed-by: Andi Kleen
Link: https://lkml.kernel.org/r/1580236279-35492-2-git-send-email-kan.liang@linux.intel.com
Signed-off-by: Sasha Levin -
[ Upstream commit eda23b387f6c4bb2971ac7e874a09913f533b22c ]
Elkhart Lake also uses Tremont CPU. From the perspective of Intel PMU,
there is nothing changed compared with Jacobsville.
Share the perf code with Jacobsville.Signed-off-by: Kan Liang
Signed-off-by: Peter Zijlstra (Intel)
Signed-off-by: Ingo Molnar
Reviewed-by: Andi Kleen
Link: https://lkml.kernel.org/r/1580236279-35492-1-git-send-email-kan.liang@linux.intel.com
Signed-off-by: Sasha Levin
29 Feb, 2020
11 commits
-
commit 23520b2def95205f132e167cf5b25c609975e959 upstream.
When pv_eoi_get_user() fails, 'val' may remain uninitialized and the return
value of pv_eoi_get_pending() becomes random. Fix the issue by initializing
the variable.Reviewed-by: Vitaly Kuznetsov
Signed-off-by: Miaohe Lin
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini
Signed-off-by: Greg Kroah-Hartman -
commit 91a5f413af596ad01097e59bf487eb07cb3f1331 upstream.
Even when APICv is disabled for L1 it can (and, actually, is) still
available for L2, this means we need to always call
vmx_deliver_nested_posted_interrupt() when attempting an interrupt
delivery.Suggested-by: Paolo Bonzini
Signed-off-by: Vitaly Kuznetsov
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini
Signed-off-by: Greg Kroah-Hartman -
… is globally disabled
commit a4443267800af240072280c44521caab61924e55 upstream.
When apicv is disabled on a vCPU (e.g. by enabling KVM_CAP_HYPERV_SYNIC*),
nothing happens to VMX MSRs on the already existing vCPUs, however, all new
ones are created with PIN_BASED_POSTED_INTR filtered out. This is very
confusing and results in the following picture inside the guest:$ rdmsr -ax 0x48d
ff00000016
7f00000016
7f00000016
7f00000016This is observed with QEMU and 4-vCPU guest: QEMU creates vCPU0, does
KVM_CAP_HYPERV_SYNIC2 and then creates the remaining three.L1 hypervisor may only check CPU0's controls to find out what features
are available and it will be very confused later. Switch to setting
PIN_BASED_POSTED_INTR control based on global 'enable_apicv' setting.Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> -
commit 35a571346a94fb93b5b3b6a599675ef3384bc75c upstream.
Consult the 'unconditional IO exiting' and 'use IO bitmaps' VM-execution
controls when checking instruction interception. If the 'use IO bitmaps'
VM-execution control is 1, check the instruction access against the IO
bitmaps to determine if the instruction causes a VM-exit.Signed-off-by: Oliver Upton
Signed-off-by: Paolo Bonzini
Signed-off-by: Greg Kroah-Hartman -
commit e71237d3ff1abf9f3388337cfebf53b96df2020d upstream.
Checks against the IO bitmap are useful for both instruction emulation
and VM-exit reflection. Refactor the IO bitmap checks into a helper
function.Signed-off-by: Oliver Upton
Reviewed-by: Vitaly Kuznetsov
Signed-off-by: Paolo Bonzini
Signed-off-by: Greg Kroah-Hartman -
commit 7455a8327674e1a7c9a1f5dd1b0743ab6713f6d1 upstream.
Commit 13db77347db1 ("KVM: x86: don't notify userspace IOAPIC on edge
EOI") said, edge-triggered interrupts don't set a bit in TMR, which means
that IOAPIC isn't notified on EOI. And var level indicates level-triggered
interrupt.
But commit 3159d36ad799 ("KVM: x86: use generic function for MSI parsing")
replace var level with irq.level by mistake. Fix it by changing irq.level
to irq.trig_mode.Cc: stable@vger.kernel.org
Fixes: 3159d36ad799 ("KVM: x86: use generic function for MSI parsing")
Signed-off-by: Miaohe Lin
Signed-off-by: Paolo Bonzini
Signed-off-by: Greg Kroah-Hartman -
commit 07721feee46b4b248402133228235318199b05ec upstream.
vmx_check_intercept is not yet fully implemented. To avoid emulating
instructions disallowed by the L1 hypervisor, refuse to emulate
instructions by default.Cc: stable@vger.kernel.org
[Made commit, added commit msg - Oliver]
Signed-off-by: Oliver Upton
Signed-off-by: Paolo Bonzini
Signed-off-by: Greg Kroah-Hartman -
commit 21b5ee59ef18e27d85810584caf1f7ddc705ea83 upstream.
Commit
aaf248848db50 ("perf/x86/msr: Add AMD IRPERF (Instructions Retired)
performance counter")added support for access to the free-running counter via 'perf -e
msr/irperf/', but when exercised, it always returns a 0 count:BEFORE:
$ perf stat -e instructions,msr/irperf/ true
Performance counter stats for 'true':
624,833 instructions
0 msr/irperf/Simply set its enable bit - HWCR bit 30 - to make it start counting.
Enablement is restricted to all machines advertising IRPERF capability,
except those susceptible to an erratum that makes the IRPERF return
bad values.That erratum occurs in Family 17h models 00-1fh [1], but not in F17h
models 20h and above [2].AFTER (on a family 17h model 31h machine):
$ perf stat -e instructions,msr/irperf/ true
Performance counter stats for 'true':
621,690 instructions
622,490 msr/irperf/[1] Revision Guide for AMD Family 17h Models 00h-0Fh Processors
[2] Revision Guide for AMD Family 17h Models 30h-3Fh ProcessorsThe revision guides are available from the bugzilla Link below.
[ bp: Massage commit message. ]
Fixes: aaf248848db50 ("perf/x86/msr: Add AMD IRPERF (Instructions Retired) performance counter")
Signed-off-by: Kim Phillips
Signed-off-by: Borislav Petkov
Cc: Peter Zijlstra
Cc: stable@vger.kernel.org
Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537
Link: http://lkml.kernel.org/r/20200214201805.13830-1-kim.phillips@amd.com
Signed-off-by: Greg Kroah-Hartman -
commit 51dede9c05df2b78acd6dcf6a17d21f0877d2d7b upstream.
Accessing the MCA thresholding controls in sysfs concurrently with CPU
hotplug can lead to a couple of KASAN-reported issues:BUG: KASAN: use-after-free in sysfs_file_ops+0x155/0x180
Read of size 8 at addr ffff888367578940 by task grep/4019and
BUG: KASAN: use-after-free in show_error_count+0x15c/0x180
Read of size 2 at addr ffff888368a05514 by task grep/4454for example. Both result from the fact that the threshold block
creation/teardown code frees the descriptor memory itself instead of
defining proper ->release function and leaving it to the driver core to
take care of that, after all sysfs accesses have completed.Do that and get rid of the custom freeing code, fixing the above UAFs in
the process.[ bp: write commit message. ]
Fixes: 95268664390b ("[PATCH] x86_64: mce_amd support for family 0x10 processors")
Signed-off-by: Thomas Gleixner
Signed-off-by: Borislav Petkov
Cc:
Link: https://lkml.kernel.org/r/20200214082801.13836-1-bp@alien8.de
Signed-off-by: Greg Kroah-Hartman -
commit 6e5cf31fbe651bed7ba1df768f2e123531132417 upstream.
threshold_create_bank() creates a bank descriptor per MCA error
thresholding counter which can be controlled over sysfs. It publishes
the pointer to that bank in a per-CPU variable and then goes on to
create additional thresholding blocks if the bank has such.However, that creation of additional blocks in
allocate_threshold_blocks() can fail, leading to a use-after-free
through the per-CPU pointer.Therefore, publish that pointer only after all blocks have been setup
successfully.Fixes: 019f34fccfd5 ("x86, MCE, AMD: Move shared bank to node descriptor")
Reported-by: Saar Amar
Reported-by: Dan Carpenter
Signed-off-by: Borislav Petkov
Cc:
Link: http://lkml.kernel.org/r/20200128140846.phctkvx5btiexvbx@kili.mountain
Signed-off-by: Greg Kroah-Hartman -
commit ff5ac61ee83c13f516544d29847d28be093a40ee upstream.
The IMA arch code attempts to inspect the "SetupMode" EFI variable
by populating a variable called efi_SetupMode_name with the string
"SecureBoot" and passing that to the EFI GetVariable service, which
obviously does not yield the expected result.Given that the string is only referenced a single time, let's get
rid of the intermediate variable, and pass the correct string as
an immediate argument. While at it, do the same for "SecureBoot".Fixes: 399574c64eaf ("x86/ima: retry detecting secure boot mode")
Fixes: 980ef4d22a95 ("x86/ima: check EFI SetupMode too")
Cc: Matthew Garrett
Signed-off-by: Ard Biesheuvel
Cc: stable@vger.kernel.org # v5.3
Signed-off-by: Mimi Zohar
Signed-off-by: Greg Kroah-Hartman
24 Feb, 2020
8 commits
-
[ Upstream commit 8b7e20a7ba54836076ff35a28349dabea4cec48f ]
Add TEST opcode to Group3-2 reg=001b as same as Group3-1 does.
Commit
12a78d43de76 ("x86/decoder: Add new TEST instruction pattern")
added a TEST opcode assignment to f6 XX/001/XXX (Group 3-1), but did
not add f7 XX/001/XXX (Group 3-2).Actually, this TEST opcode variant (ModRM.reg /1) is not described in
the Intel SDM Vol2 but in AMD64 Architecture Programmer's Manual Vol.3,
Appendix A.2 Table A-6. ModRM.reg Extensions for the Primary Opcode Map.Without this fix, Randy found a warning by insn_decoder_test related
to this issue as below.HOSTCC arch/x86/tools/insn_decoder_test
HOSTCC arch/x86/tools/insn_sanity
TEST posttest
arch/x86/tools/insn_decoder_test: warning: Found an x86 instruction decoder bug, please report this.
arch/x86/tools/insn_decoder_test: warning: ffffffff81000bf1: f7 0b 00 01 08 00 testl $0x80100,(%rbx)
arch/x86/tools/insn_decoder_test: warning: objdump says 6 bytes, but insn_get_length() says 2
arch/x86/tools/insn_decoder_test: warning: Decoded and checked 11913894 instructions with 1 failures
TEST posttest
arch/x86/tools/insn_sanity: Success: decoded and checked 1000000 random instructions with 0 errors (seed:0x871ce29c)To fix this error, add the TEST opcode according to AMD64 APM Vol.3.
[ bp: Massage commit message. ]
Reported-by: Randy Dunlap
Signed-off-by: Masami Hiramatsu
Signed-off-by: Borislav Petkov
Acked-by: Randy Dunlap
Tested-by: Randy Dunlap
Link: https://lkml.kernel.org/r/157966631413.9580.10311036595431878351.stgit@devnote2
Signed-off-by: Sasha Levin -
[ Upstream commit 75fbef0a8b6b4bb19b9a91b5214f846c2dc5139e ]
The following commit:
15f003d20782 ("x86/mm/pat: Don't implicitly allow _PAGE_RW in kernel_map_pages_in_pgd()")
modified kernel_map_pages_in_pgd() to manage writable permissions
of memory mappings in the EFI page table in a different way, but
in the process, it removed the ability to clear NX attributes from
read-only mappings, by clobbering the clear mask if _PAGE_RW is not
being requested.Failure to remove the NX attribute from read-only mappings is
unlikely to be a security issue, but it does prevent us from
tightening the permissions in the EFI page tables going forward,
so let's fix it now.Fixes: 15f003d20782 ("x86/mm/pat: Don't implicitly allow _PAGE_RW in kernel_map_pages_in_pgd()
Signed-off-by: Ard Biesheuvel
Signed-off-by: Ingo Molnar
Link: https://lore.kernel.org/r/20200113172245.27925-5-ardb@kernel.org
Signed-off-by: Sasha Levin -
[ Upstream commit 471af006a747f1c535c8a8c6c0973c320fe01b22 ]
AMD Family 17h processors and above gain support for Large Increment
per Cycle events. Unfortunately there is no CPUID or equivalent bit
that indicates whether the feature exists or not, so we continue to
determine eligibility based on a CPU family number comparison.For Large Increment per Cycle events, we add a f17h-and-compatibles
get_event_constraints_f17h() that returns an even counter bitmask:
Large Increment per Cycle events can only be placed on PMCs 0, 2,
and 4 out of the currently available 0-5. The only currently
public event that requires this feature to report valid counts
is PMCx003 "Retired SSE/AVX Operations".Note that the CPU family logic in amd_core_pmu_init() is changed
so as to be able to selectively add initialization for features
available in ranges of backward-compatible CPU families. This
Large Increment per Cycle feature is expected to be retained
in future families.A side-effect of assigning a new get_constraints function for f17h
disables calling the old (prior to f15h) amd_get_event_constraints
implementation left enabled by commit e40ed1542dd7 ("perf/x86: Add perf
support for AMD family-17h processors"), which is no longer
necessary since those North Bridge event codes are obsoleted.Also fix a spelling mistake whilst in the area (calulating ->
calculating).Fixes: e40ed1542dd7 ("perf/x86: Add perf support for AMD family-17h processors")
Signed-off-by: Kim Phillips
Signed-off-by: Peter Zijlstra (Intel)
Link: https://lkml.kernel.org/r/20191114183720.19887-2-kim.phillips@amd.com
Signed-off-by: Sasha Levin -
[ Upstream commit 248ed51048c40d36728e70914e38bffd7821da57 ]
First, printk() is NMI-context safe now since the safe printk() has been
implemented and it already has an irq_work to make NMI-context safe.Second, this NMI irq_work actually does not work if a NMI handler causes
panic by watchdog timeout. It has no chance to run in such case, while
the safe printk() will flush its per-cpu buffers before panicking.While at it, repurpose the irq_work callback into a function which
concentrates the NMI duration checking and makes the code easier to
follow.[ bp: Massage. ]
Signed-off-by: Changbin Du
Signed-off-by: Borislav Petkov
Acked-by: Thomas Gleixner
Link: https://lkml.kernel.org/r/20200111125427.15662-1-changbin.du@gmail.com
Signed-off-by: Sasha Levin -
[ Upstream commit e2d68a955e49d61fd0384f23e92058dc9b79be5e ]
The logic in __efi_enter_virtual_mode() does a number of steps in
sequence, all of which may fail in one way or the other. In most
cases, we simply print an error and disable EFI runtime services
support, but in some cases, we BUG() or panic() and bring down the
system when encountering conditions that we could easily handle in
the same way.While at it, replace a pointless page-to-virt-phys conversion with
one that goes straight from struct page to physical.Signed-off-by: Ard Biesheuvel
Cc: Andy Lutomirski
Cc: Ard Biesheuvel
Cc: Arvind Sankar
Cc: Matthew Garrett
Cc: linux-efi@vger.kernel.org
Link: https://lkml.kernel.org/r/20200103113953.9571-14-ardb@kernel.org
Signed-off-by: Ingo Molnar
Signed-off-by: Sasha Levin -
[ Upstream commit bff47c2302cc249bcd550b17067f8dddbd4b6f77 ]
When building with C=1, sparse issues a warning:
CHECK arch/x86/entry/vdso/vdso32-setup.c
arch/x86/entry/vdso/vdso32-setup.c:28:28: warning: symbol 'vdso32_enabled' was not declared. Should it be static?Provide the missing header file.
Signed-off-by: Valdis Kletnieks
Signed-off-by: Borislav Petkov
Cc: "H. Peter Anvin"
Cc: Andy Lutomirski
Cc: Ingo Molnar
Cc: Thomas Gleixner
Cc: x86-ml
Link: https://lkml.kernel.org/r/36224.1575599767@turing-police
Signed-off-by: Sasha Levin -
[ Upstream commit dacc9092336be20b01642afe1a51720b31f60369 ]
When checking whether the reported lfb_size makes sense, the height
* stride result is page-aligned before seeing whether it exceeds the
reported size.This doesn't work if height * stride is not an exact number of pages.
For example, as reported in the kernel bugzilla below, an 800x600x32 EFI
framebuffer gets skipped because of this.Move the PAGE_ALIGN to after the check vs size.
Reported-by: Christopher Head
Tested-by: Christopher Head
Signed-off-by: Arvind Sankar
Signed-off-by: Borislav Petkov
Link: https://bugzilla.kernel.org/show_bug.cgi?id=206051
Link: https://lkml.kernel.org/r/20200107230410.2291947-1-nivedita@alum.mit.edu
Signed-off-by: Sasha Levin -
[ Upstream commit ffc2760bcf2dba0dbef74013ed73eea8310cc52c ]
Fix a couple of issues with the way we map and copy the vendor string:
- we map only 2 bytes, which usually works since you get at least a
page, but if the vendor string happens to cross a page boundary,
a crash will result
- only call early_memunmap() if early_memremap() succeeded, or we will
call it with a NULL address which it doesn't like,
- while at it, switch to early_memremap_ro(), and array indexing rather
than pointer dereferencing to read the CHAR16 characters.Signed-off-by: Ard Biesheuvel
Cc: Andy Lutomirski
Cc: Ard Biesheuvel
Cc: Arvind Sankar
Cc: Matthew Garrett
Cc: linux-efi@vger.kernel.org
Fixes: 5b83683f32b1 ("x86: EFI runtime service support")
Link: https://lkml.kernel.org/r/20200103113953.9571-5-ardb@kernel.org
Signed-off-by: Ingo Molnar
Signed-off-by: Sasha Levin