17 Apr, 2019
11 commits
-
commit c73f4c998e1fd4249b9edfa39e23f4fda2b9b041 upstream.
Referring to the "VIRTUALIZING MSR-BASED APIC ACCESSES" chapter of the
SDM, when "virtualize x2APIC mode" is 1 and "APIC-register
virtualization" is 0, a RDMSR of 808H should return the VTPR from the
virtual APIC page.However, for nested, KVM currently fails to disable the read intercept
for this MSR. This means that a RDMSR exit takes precedence over
"virtualize x2APIC mode", and KVM passes through L1's TPR to L2,
instead of sourcing the value from L2's virtual APIC page.This patch fixes the issue by disabling the read intercept, in VMCS02,
for the VTPR when "APIC-register virtualization" is 0.The issue described above and fix prescribed here, were verified with
a related patch in kvm-unit-tests titled "Test VMX's virtualize x2APIC
mode w/ nested".Signed-off-by: Marc Orr
Reviewed-by: Jim Mattson
Fixes: c992384bde84f ("KVM: vmx: speed up MSR bitmap merge")
Signed-off-by: Paolo Bonzini
Signed-off-by: Greg Kroah-Hartman -
commit acff78477b9b4f26ecdf65733a4ed77fe837e9dc upstream.
The nested_vmx_prepare_msr_bitmap() function doesn't directly guard the
x2APIC MSR intercepts with the "virtualize x2APIC mode" MSR. As a
result, we discovered the potential for a buggy or malicious L1 to get
access to L0's x2APIC MSRs, via an L2, as follows.1. L1 executes WRMSR(IA32_SPEC_CTRL, 1). This causes the spec_ctrl
variable, in nested_vmx_prepare_msr_bitmap() to become true.
2. L1 disables "virtualize x2APIC mode" in VMCS12.
3. L1 enables "APIC-register virtualization" in VMCS12.Now, KVM will set VMCS02's x2APIC MSR intercepts from VMCS12, and then
set "virtualize x2APIC mode" to 0 in VMCS02. Oops.This patch closes the leak by explicitly guarding VMCS02's x2APIC MSR
intercepts with VMCS12's "virtualize x2APIC mode" control.The scenario outlined above and fix prescribed here, were verified with
a related patch in kvm-unit-tests titled "Add leak scenario to
virt_x2apic_mode_test".Note, it looks like this issue may have been introduced inadvertently
during a merge---see 15303ba5d1cd.Signed-off-by: Marc Orr
Reviewed-by: Jim Mattson
Signed-off-by: Paolo Bonzini
Signed-off-by: Greg Kroah-Hartman -
commit 3966c3feca3fd10b2935caa0b4a08c7dd59469e5 upstream.
Spurious interrupt support was added to perf in the following commit, almost
a decade ago:63e6be6d98e1 ("perf, x86: Catch spurious interrupts after disabling counters")
The two previous patches (resolving the race condition when disabling a
PMC and NMI latency mitigation) allow for the removal of this older
spurious interrupt support.Currently in x86_pmu_stop(), the bit for the PMC in the active_mask bitmap
is cleared before disabling the PMC, which sets up a race condition. This
race condition was mitigated by introducing the running bitmap. That race
condition can be eliminated by first disabling the PMC, waiting for PMC
reset on overflow and then clearing the bit for the PMC in the active_mask
bitmap. The NMI handler will not re-enable a disabled counter.If x86_pmu_stop() is called from the perf NMI handler, the NMI latency
mitigation support will guard against any unhandled NMI messages.Signed-off-by: Tom Lendacky
Signed-off-by: Peter Zijlstra (Intel)
Cc: # 4.14.x-
Cc: Alexander Shishkin
Cc: Arnaldo Carvalho de Melo
Cc: Arnaldo Carvalho de Melo
Cc: Borislav Petkov
Cc: Jiri Olsa
Cc: Linus Torvalds
Cc: Namhyung Kim
Cc: Peter Zijlstra
Cc: Stephane Eranian
Cc: Thomas Gleixner
Cc: Vince Weaver
Link: https://lkml.kernel.org/r/Message-ID:
Signed-off-by: Ingo Molnar
Signed-off-by: Greg Kroah-Hartman -
commit 6d3edaae16c6c7d238360f2841212c2b26774d5e upstream.
On AMD processors, the detection of an overflowed PMC counter in the NMI
handler relies on the current value of the PMC. So, for example, to check
for overflow on a 48-bit counter, bit 47 is checked to see if it is 1 (not
overflowed) or 0 (overflowed).When the perf NMI handler executes it does not know in advance which PMC
counters have overflowed. As such, the NMI handler will process all active
PMC counters that have overflowed. NMI latency in newer AMD processors can
result in multiple overflowed PMC counters being processed in one NMI and
then a subsequent NMI, that does not appear to be a back-to-back NMI, not
finding any PMC counters that have overflowed. This may appear to be an
unhandled NMI resulting in either a panic or a series of messages,
depending on how the kernel was configured.To mitigate this issue, add an AMD handle_irq callback function,
amd_pmu_handle_irq(), that will invoke the common x86_pmu_handle_irq()
function and upon return perform some additional processing that will
indicate if the NMI has been handled or would have been handled had an
earlier NMI not handled the overflowed PMC. Using a per-CPU variable, a
minimum value of the number of active PMCs or 2 will be set whenever a
PMC is active. This is used to indicate the possible number of NMIs that
can still occur. The value of 2 is used for when an NMI does not arrive
at the LAPIC in time to be collapsed into an already pending NMI. Each
time the function is called without having handled an overflowed counter,
the per-CPU value is checked. If the value is non-zero, it is decremented
and the NMI indicates that it handled the NMI. If the value is zero, then
the NMI indicates that it did not handle the NMI.Signed-off-by: Tom Lendacky
Signed-off-by: Peter Zijlstra (Intel)
Cc: # 4.14.x-
Cc: Alexander Shishkin
Cc: Arnaldo Carvalho de Melo
Cc: Arnaldo Carvalho de Melo
Cc: Borislav Petkov
Cc: Jiri Olsa
Cc: Linus Torvalds
Cc: Namhyung Kim
Cc: Peter Zijlstra
Cc: Stephane Eranian
Cc: Thomas Gleixner
Cc: Vince Weaver
Link: https://lkml.kernel.org/r/Message-ID:
Signed-off-by: Ingo Molnar
Signed-off-by: Greg Kroah-Hartman -
commit 914123fa39042e651d79eaf86bbf63a1b938dddf upstream.
On AMD processors, the detection of an overflowed counter in the NMI
handler relies on the current value of the counter. So, for example, to
check for overflow on a 48 bit counter, bit 47 is checked to see if it
is 1 (not overflowed) or 0 (overflowed).There is currently a race condition present when disabling and then
updating the PMC. Increased NMI latency in newer AMD processors makes this
race condition more pronounced. If the counter value has overflowed, it is
possible to update the PMC value before the NMI handler can run. The
updated PMC value is not an overflowed value, so when the perf NMI handler
does run, it will not find an overflowed counter. This may appear as an
unknown NMI resulting in either a panic or a series of messages, depending
on how the kernel is configured.To eliminate this race condition, the PMC value must be checked after
disabling the counter. Add an AMD function, amd_pmu_disable_all(), that
will wait for the NMI handler to reset any active and overflowed counter
after calling x86_pmu_disable_all().Signed-off-by: Tom Lendacky
Signed-off-by: Peter Zijlstra (Intel)
Cc: # 4.14.x-
Cc: Alexander Shishkin
Cc: Arnaldo Carvalho de Melo
Cc: Arnaldo Carvalho de Melo
Cc: Borislav Petkov
Cc: Jiri Olsa
Cc: Linus Torvalds
Cc: Namhyung Kim
Cc: Peter Zijlstra
Cc: Stephane Eranian
Cc: Thomas Gleixner
Cc: Vince Weaver
Link: https://lkml.kernel.org/r/Message-ID:
Signed-off-by: Ingo Molnar
Signed-off-by: Greg Kroah-Hartman -
commit 5b77e95dd7790ff6c8fbf1cd8d0104ebed818a03 upstream.
There's a number of problems with how arch/x86/include/asm/bitops.h
is currently using assembly constraints for the memory region
bitops are modifying:1) Use memory clobber in bitops that touch arbitrary memory
Certain bit operations that read/write bits take a base pointer and an
arbitrarily large offset to address the bit relative to that base.
Inline assembly constraints aren't expressive enough to tell the
compiler that the assembly directive is going to touch a specific memory
location of unknown size, therefore we have to use the "memory" clobber
to indicate that the assembly is going to access memory locations other
than those listed in the inputs/outputs.To indicate that BTR/BTS instructions don't necessarily touch the first
sizeof(long) bytes of the argument, we also move the address to assembly
inputs.This particular change leads to size increase of 124 kernel functions in
a defconfig build. For some of them the diff is in NOP operations, other
end up re-reading values from memory and may potentially slow down the
execution. But without these clobbers the compiler is free to cache
the contents of the bitmaps and use them as if they weren't changed by
the inline assembly.2) Use byte-sized arguments for operations touching single bytes.
Passing a long value to ANDB/ORB/XORB instructions makes the compiler
treat sizeof(long) bytes as being clobbered, which isn't the case. This
may theoretically lead to worse code in the case of heavy optimization.Practical impact:
I've built a defconfig kernel and looked through some of the functions
generated by GCC 7.3.0 with and without this clobber, and didn't spot
any miscompilations.However there is a (trivial) theoretical case where this code leads to
miscompilation:https://lkml.org/lkml/2019/3/28/393
using just GCC 8.3.0 with -O2. It isn't hard to imagine someone writes
such a function in the kernel someday.So the primary motivation is to fix an existing misuse of the asm
directive, which happens to work in certain configurations now, but
isn't guaranteed to work under different circumstances.[ --mingo: Added -stable tag because defconfig only builds a fraction
of the kernel and the trivial testcase looks normal enough to
be used in existing or in-development code. ]Signed-off-by: Alexander Potapenko
Cc:
Cc: Andy Lutomirski
Cc: Borislav Petkov
Cc: Brian Gerst
Cc: Denys Vlasenko
Cc: Dmitry Vyukov
Cc: H. Peter Anvin
Cc: James Y Knight
Cc: Linus Torvalds
Cc: Paul E. McKenney
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Link: http://lkml.kernel.org/r/20190402112813.193378-1-glider@google.com
[ Edited the changelog, tidied up one of the defines. ]
Signed-off-by: Ingo Molnar
Signed-off-by: Greg Kroah-Hartman -
commit 88ca66d8540ca26119b1428cddb96b37925bdf01 upstream.
The minimum supported gcc version is >= 4.6, so these can be removed.
Signed-off-by: Rasmus Villemoes
Signed-off-by: Borislav Petkov
Cc: "H. Peter Anvin"
Cc: Dan Williams
Cc: Geert Uytterhoeven
Cc: Ingo Molnar
Cc: Matthew Wilcox
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: x86-ml
Link: https://lkml.kernel.org/r/20190111084931.24601-1-linux@rasmusvillemoes.dk
Signed-off-by: Greg Kroah-Hartman -
commit 42d8644bd77dd2d747e004e367cb0c895a606f39 upstream.
The "call" variable comes from the user in privcmd_ioctl_hypercall().
It's an offset into the hypercall_page[] which has (PAGE_SIZE / 32)
elements. We need to put an upper bound on it to prevent an out of
bounds access.Cc: stable@vger.kernel.org
Fixes: 1246ae0bb992 ("xen: add variable hypercall caller")
Signed-off-by: Dan Carpenter
Reviewed-by: Boris Ostrovsky
Signed-off-by: Juergen Gross
Signed-off-by: Greg Kroah-Hartman -
commit ede885ecb2cdf8a8dd5367702e3d964ec846a2d5 upstream.
get_num_contig_pages() could potentially overflow int so make its type
consistent with its usage.Reported-by: Cfir Cohen
Cc: stable@vger.kernel.org
Signed-off-by: David Rientjes
Signed-off-by: Paolo Bonzini
Signed-off-by: Greg Kroah-Hartman -
commit ac3e233d29f7f77f28243af0132057d378d3ea58 upstream.
GNU linker's -z common-page-size's default value is based on the target
architecture. arch/x86/entry/vdso/Makefile sets it to the architecture
default, which is implicit and redundant. Drop it.Fixes: 2aae950b21e4 ("x86_64: Add vDSO for x86-64 with gettimeofday/clock_gettime/getcpu")
Reported-by: Dmitry Golovin
Reported-by: Bill Wendling
Suggested-by: Dmitry Golovin
Suggested-by: Rui Ueyama
Signed-off-by: Nick Desaulniers
Signed-off-by: Borislav Petkov
Acked-by: Andy Lutomirski
Cc: Andi Kleen
Cc: Fangrui Song
Cc: "H. Peter Anvin"
Cc: Ingo Molnar
Cc: Thomas Gleixner
Cc: x86-ml
Link: https://lkml.kernel.org/r/20181206191231.192355-1-ndesaulniers@google.com
Link: https://bugs.llvm.org/show_bug.cgi?id=38774
Link: https://github.com/ClangBuiltLinux/linux/issues/31
Signed-off-by: Nathan Chancellor
Signed-off-by: Sasha Levin -
[ Upstream commit 9ebdfe5230f2e50e3ba05c57723a06e90946815a ]
According to the SDM, "NMI-window exiting" VM-exits wake a logical
processor from the same inactive states as would an NMI and
"interrupt-window exiting" VM-exits wake a logical processor from the
same inactive states as would an external interrupt. Specifically, they
wake a logical processor from the shutdown state and from the states
entered using the HLT and MWAIT instructions.Fixes: 6dfacadd5858 ("KVM: nVMX: Add support for activity state HLT")
Signed-off-by: Jim Mattson
Reviewed-by: Peter Shier
Suggested-by: Sean Christopherson
[Squashed comments of two Jim's patches and used the simplified code
hunk provided by Sean. - Radim]
Signed-off-by: Radim Krčmář
Signed-off-by: Sasha Levin
06 Apr, 2019
5 commits
-
[ Upstream commit a50480cb6d61d5c5fc13308479407b628b6bc1c5 ]
These interrupt functions are already non-attachable by kprobes.
Blacklist them explicitly so that they can show up in
/sys/kernel/debug/kprobes/blacklist and tools like BCC can use this
additional information.Signed-off-by: Andrea Righi
Cc: Andy Lutomirski
Cc: Anil S Keshavamurthy
Cc: Borislav Petkov
Cc: David S. Miller
Cc: Linus Torvalds
Cc: Masami Hiramatsu
Cc: Naveen N. Rao
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: Yonghong Song
Link: http://lkml.kernel.org/r/20181206095648.GA8249@Dell
Signed-off-by: Ingo Molnar
Signed-off-by: Sasha Levin -
[ Upstream commit d071ae09a4a1414c1433d5ae9908959a7325b0ad ]
Accessing per-CPU variables is done by finding the offset of the
variable in the per-CPU block and adding it to the address of the
respective CPU's block.Section 3.10.8 of ld.bfd's documentation states:
For expressions involving numbers, relative addresses and absolute
addresses, ld follows these rules to evaluate terms:Other binary operations, that is, between two relative addresses
not in the same section, or between a relative address and an
absolute address, first convert any non-absolute term to an
absolute address before applying the operator."Note that LLVM's linker does not adhere to the GNU ld's implementation
and as such requires implicitly-absolute terms to be explicitly marked
as absolute in the linker script. If not, it fails currently with:ld.lld: error: ./arch/x86/kernel/vmlinux.lds:153: at least one side of the expression must be absolute
ld.lld: error: ./arch/x86/kernel/vmlinux.lds:154: at least one side of the expression must be absolute
Makefile:1040: recipe for target 'vmlinux' failedThis is not a functional change for ld.bfd which converts the term to an
absolute symbol anyways as specified above.Based on a previous submission by Tri Vo .
Reported-by: Dmitry Golovin
Signed-off-by: Rafael Ávila de Espíndola
[ Update commit message per Boris' and Michael's suggestions. ]
Signed-off-by: Nick Desaulniers
[ Massage commit message more, fix typos. ]
Signed-off-by: Borislav Petkov
Tested-by: Dmitry Golovin
Cc: "H. Peter Anvin"
Cc: Andy Lutomirski
Cc: Brijesh Singh
Cc: Cao Jin
Cc: Ingo Molnar
Cc: Joerg Roedel
Cc: Masahiro Yamada
Cc: Masami Hiramatsu
Cc: Thomas Gleixner
Cc: Tri Vo
Cc: dima@golovin.in
Cc: morbo@google.com
Cc: x86-ml
Link: https://lkml.kernel.org/r/20181219190145.252035-1-ndesaulniers@google.com
Signed-off-by: Sasha Levin -
[ Upstream commit 927185c124d62a9a4d35878d7f6d432a166b74e3 ]
The kernel uses the OUTPUT_FORMAT linker script command in it's linker
scripts. Most of the time, the -m option is passed to the linker with
correct architecture, but sometimes (at least for x86_64) the -m option
contradicts the OUTPUT_FORMAT directive.Specifically, arch/x86/boot and arch/x86/realmode/rm produce i386 object
files, but are linked with the -m elf_x86_64 linker flag when building
for x86_64.The GNU linker manpage doesn't explicitly state any tie-breakers between
-m and OUTPUT_FORMAT. But with BFD and Gold linkers, OUTPUT_FORMAT
overrides the emulation value specified with the -m option.LLVM lld has a different behavior, however. When supplied with
contradicting -m and OUTPUT_FORMAT values it fails with the following
error message:ld.lld: error: arch/x86/realmode/rm/header.o is incompatible with elf_x86_64
Therefore, just add the correct -m after the incorrect one (it overrides
it), so the linker invocation looks like this:ld -m elf_x86_64 -z max-page-size=0x200000 -m elf_i386 --emit-relocs -T \
realmode.lds header.o trampoline_64.o stack.o reboot.o -o realmode.elfThis is not a functional change for GNU ld, because (although not
explicitly documented) OUTPUT_FORMAT overrides -m EMULATION.Tested by building x86_64 kernel with GNU gcc/ld toolchain and booting
it in QEMU.[ bp: massage and clarify text. ]
Suggested-by: Dmitry Golovin
Signed-off-by: George Rimar
Signed-off-by: Tri Vo
Signed-off-by: Borislav Petkov
Tested-by: Tri Vo
Tested-by: Nick Desaulniers
Cc: "H. Peter Anvin"
Cc: Ingo Molnar
Cc: Michael Matz
Cc: Thomas Gleixner
Cc: morbo@google.com
Cc: ndesaulniers@google.com
Cc: ruiu@google.com
Cc: x86-ml
Link: https://lkml.kernel.org/r/20190111201012.71210-1-trong@android.com
Signed-off-by: Sasha Levin -
[ Upstream commit 840018668ce2d96783356204ff282d6c9b0e5f66 ]
When pmu::setup_aux() is called the coresight PMU needs to know which
sink to use for the session by looking up the information in the
event's attr::config2 field.As such simply replace the cpu information by the complete perf_event
structure and change all affected customers.Signed-off-by: Mathieu Poirier
Reviewed-by: Suzuki Poulouse
Acked-by: Peter Zijlstra
Cc: Adrian Hunter
Cc: Alexander Shishkin
Cc: Alexei Starovoitov
Cc: Greg Kroah-Hartman
Cc: H. Peter Anvin
Cc: Heiko Carstens
Cc: Jiri Olsa
Cc: Mark Rutland
Cc: Martin Schwidefsky
Cc: Namhyung Kim
Cc: Thomas Gleixner
Cc: Will Deacon
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-s390@vger.kernel.org
Link: http://lkml.kernel.org/r/20190131184714.20388-2-mathieu.poirier@linaro.org
Signed-off-by: Arnaldo Carvalho de Melo
Signed-off-by: Sasha Levin -
[ Upstream commit 179fb36abb097976997f50733d5b122a29158cba ]
After commit 68bb7bfb7985 ("X86/Hyper-V: Enable IPI enlightenments"),
kexec fails with a kernel panic:kexec_core: Starting new kernel
BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS Hyper-V UEFI Release v3.0 03/02/2018
RIP: 0010:0xffffc9000001d000Call Trace:
? __send_ipi_mask+0x1c6/0x2d0
? hv_send_ipi_mask_allbutself+0x6d/0xb0
? mp_save_irq+0x70/0x70
? __ioapic_read_entry+0x32/0x50
? ioapic_read_entry+0x39/0x50
? clear_IO_APIC_pin+0xb8/0x110
? native_stop_other_cpus+0x6e/0x170
? native_machine_shutdown+0x22/0x40
? kernel_kexec+0x136/0x156That happens if hypercall based IPIs are used because the hypercall page is
reset very early upon kexec reboot, but kexec sends IPIs to stop CPUs,
which invokes the hypercall and dereferences the unusable page.To fix his, reset hv_hypercall_pg to NULL before the page is reset to avoid
any misuse, IPI sending will fall back to the non hypercall based
method. This only happens on kexec / kdump so just setting the pointer to
NULL is good enough.Fixes: 68bb7bfb7985 ("X86/Hyper-V: Enable IPI enlightenments")
Signed-off-by: Kairui Song
Signed-off-by: Thomas Gleixner
Cc: "K. Y. Srinivasan"
Cc: Haiyang Zhang
Cc: Stephen Hemminger
Cc: Sasha Levin
Cc: Borislav Petkov
Cc: "H. Peter Anvin"
Cc: Vitaly Kuznetsov
Cc: Dave Young
Cc: devel@linuxdriverproject.org
Link: https://lkml.kernel.org/r/20190306111827.14131-1-kasong@redhat.com
Signed-off-by: Sasha Levin
03 Apr, 2019
3 commits
-
commit 0cf9135b773bf32fba9dd8e6699c1b331ee4b749 upstream.
The CPUID flag ARCH_CAPABILITIES is unconditioinally exposed to host
userspace for all x86 hosts, i.e. KVM advertises ARCH_CAPABILITIES
regardless of hardware support under the pretense that KVM fully
emulates MSR_IA32_ARCH_CAPABILITIES. Unfortunately, only VMX hosts
handle accesses to MSR_IA32_ARCH_CAPABILITIES (despite KVM_GET_MSRS
also reporting MSR_IA32_ARCH_CAPABILITIES for all hosts).Move the MSR_IA32_ARCH_CAPABILITIES handling to common x86 code so
that it's emulated on AMD hosts.Fixes: 1eaafe91a0df4 ("kvm: x86: IA32_ARCH_CAPABILITIES is always supported")
Cc: stable@vger.kernel.org
Reported-by: Xiaoyao Li
Cc: Jim Mattson
Signed-off-by: Sean Christopherson
Signed-off-by: Paolo Bonzini
Signed-off-by: Greg Kroah-Hartman -
commit 45def77ebf79e2e8942b89ed79294d97ce914fa0 upstream.
Most (all?) x86 platforms provide a port IO based reset mechanism, e.g.
OUT 92h or CF9h. Userspace may emulate said mechanism, i.e. reset a
vCPU in response to KVM_EXIT_IO, without explicitly announcing to KVM
that it is doing a reset, e.g. Qemu jams vCPU state and resumes running.To avoid corruping %rip after such a reset, commit 0967b7bf1c22 ("KVM:
Skip pio instruction when it is emulated, not executed") changed the
behavior of PIO handlers, i.e. today's "fast" PIO handling to skip the
instruction prior to exiting to userspace. Full emulation doesn't need
such tricks becase re-emulating the instruction will naturally handle
%rip being changed to point at the reset vector.Updating %rip prior to executing to userspace has several drawbacks:
- Userspace sees the wrong %rip on the exit, e.g. if PIO emulation
fails it will likely yell about the wrong address.
- Single step exits to userspace for are effectively dropped as
KVM_EXIT_DEBUG is overwritten with KVM_EXIT_IO.
- Behavior of PIO emulation is different depending on whether it
goes down the fast path or the slow path.Rather than skip the PIO instruction before exiting to userspace,
snapshot the linear %rip and cancel PIO completion if the current
value does not match the snapshot. For a 64-bit vCPU, i.e. the most
common scenario, the snapshot and comparison has negligible overhead
as VMCS.GUEST_RIP will be cached regardless, i.e. there is no extra
VMREAD in this case.All other alternatives to snapshotting the linear %rip that don't
rely on an explicit reset announcenment suffer from one corner case
or another. For example, canceling PIO completion on any write to
%rip fails if userspace does a save/restore of %rip, and attempting to
avoid that issue by canceling PIO only if %rip changed then fails if PIO
collides with the reset %rip. Attempting to zero in on the exact reset
vector won't work for APs, which means adding more hooks such as the
vCPU's MP_STATE, and so on and so forth.Checking for a linear %rip match technically suffers from corner cases,
e.g. userspace could theoretically rewrite the underlying code page and
expect a different instruction to execute, or the guest hardcodes a PIO
reset at 0xfffffff0, but those are far, far outside of what can be
considered normal operation.Fixes: 432baf60eee3 ("KVM: VMX: use kvm_fast_pio_in for handling IN I/O")
Cc:
Reported-by: Jim Mattson
Signed-off-by: Sean Christopherson
Signed-off-by: Paolo Bonzini
Signed-off-by: Greg Kroah-Hartman -
commit bebd024e4815b1a170fcd21ead9c2222b23ce9e6 upstream.
The SMT disable 'nosmt' command line argument is not working properly when
CONFIG_HOTPLUG_CPU is disabled. The teardown of the sibling CPUs which are
required to be brought up due to the MCE issues, cannot work. The CPUs are
then kept in a half dead state.As the 'nosmt' functionality has become popular due to the speculative
hardware vulnerabilities, the half torn down state is not a proper solution
to the problem.Enforce CONFIG_HOTPLUG_CPU=y when SMP is enabled so the full operation is
possible.Reported-by: Tianyu Lan
Signed-off-by: Thomas Gleixner
Acked-by: Greg Kroah-Hartman
Cc: Konrad Wilk
Cc: Josh Poimboeuf
Cc: Mukesh Ojha
Cc: Peter Zijlstra
Cc: Jiri Kosina
Cc: Rik van Riel
Cc: Andy Lutomirski
Cc: Micheal Kelley
Cc: "K. Y. Srinivasan"
Cc: Linus Torvalds
Cc: Borislav Petkov
Cc: K. Y. Srinivasan
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20190326163811.598166056@linutronix.de
Signed-off-by: Greg Kroah-Hartman
27 Mar, 2019
2 commits
-
commit ac5ceccce5501e43d217c596e4ee859f2a3fef79 upstream.
When the ORC unwinder is invoked for an oops caused by IP==0,
it currently has no idea what to do because there is no debug information
for the stack frame of NULL.But if RIP is NULL, it is very likely that the last successfully executed
instruction was an indirect CALL/JMP, and it is possible to unwind out in
the same way as for the first instruction of a normal function. Hardcode
a corresponding ORC entry.With an artificially-added NULL call in prctl_set_seccomp(), before this
patch, the trace is:Call Trace:
? __x64_sys_prctl+0x402/0x680
? __ia32_sys_prctl+0x6e0/0x6e0
? __do_page_fault+0x457/0x620
? do_syscall_64+0x6d/0x160
? entry_SYSCALL_64_after_hwframe+0x44/0xa9After this patch, the trace looks like this:
Call Trace:
__x64_sys_prctl+0x402/0x680
? __ia32_sys_prctl+0x6e0/0x6e0
? __do_page_fault+0x457/0x620
do_syscall_64+0x6d/0x160
entry_SYSCALL_64_after_hwframe+0x44/0xa9prctl_set_seccomp() still doesn't show up in the trace because for some
reason, tail call optimization is only disabled in builds that use the
frame pointer unwinder.Signed-off-by: Jann Horn
Signed-off-by: Thomas Gleixner
Acked-by: Josh Poimboeuf
Cc: Borislav Petkov
Cc: Andrew Morton
Cc: syzbot
Cc: "H. Peter Anvin"
Cc: Masahiro Yamada
Cc: Michal Marek
Cc: linux-kbuild@vger.kernel.org
Link: https://lkml.kernel.org/r/20190301031201.7416-2-jannh@google.com
Signed-off-by: Greg Kroah-Hartman -
commit f4f34e1b82eb4219d8eaa1c7e2e17ca219a6a2b5 upstream.
When the frame unwinder is invoked for an oops caused by a call to NULL, it
currently skips the parent function because BP still points to the parent's
stack frame; the (nonexistent) current function only has the first half of
a stack frame, and BP doesn't point to it yet.Add a special case for IP==0 that calculates a fake BP from SP, then uses
the real BP for the next frame.Note that this handles first_frame specially: Return information about the
parent function as long as the saved IP is >=first_frame, even if the fake
BP points below it.With an artificially-added NULL call in prctl_set_seccomp(), before this
patch, the trace is:Call Trace:
? prctl_set_seccomp+0x3a/0x50
__x64_sys_prctl+0x457/0x6f0
? __ia32_sys_prctl+0x750/0x750
do_syscall_64+0x72/0x160
entry_SYSCALL_64_after_hwframe+0x44/0xa9After this patch, the trace is:
Call Trace:
prctl_set_seccomp+0x3a/0x50
__x64_sys_prctl+0x457/0x6f0
? __ia32_sys_prctl+0x750/0x750
do_syscall_64+0x72/0x160
entry_SYSCALL_64_after_hwframe+0x44/0xa9Signed-off-by: Jann Horn
Signed-off-by: Thomas Gleixner
Acked-by: Josh Poimboeuf
Cc: Borislav Petkov
Cc: Andrew Morton
Cc: syzbot
Cc: "H. Peter Anvin"
Cc: Masahiro Yamada
Cc: Michal Marek
Cc: linux-kbuild@vger.kernel.org
Link: https://lkml.kernel.org/r/20190301031201.7416-1-jannh@google.com
Signed-off-by: Greg Kroah-Hartman
24 Mar, 2019
13 commits
-
commit 34333cc6c2cb021662fd32e24e618d1b86de95bf upstream.
Regarding segments with a limit==0xffffffff, the SDM officially states:
When the effective limit is FFFFFFFFH (4 GBytes), these accesses may
or may not cause the indicated exceptions. Behavior is
implementation-specific and may vary from one execution to another.In practice, all CPUs that support VMX ignore limit checks for "flat
segments", i.e. an expand-up data or code segment with base=0 and
limit=0xffffffff. This is subtly different than wrapping the effective
address calculation based on the address size, as the flat segment
behavior also applies to accesses that would wrap the 4g boundary, e.g.
a 4-byte access starting at 0xffffffff will access linear addresses
0xffffffff, 0x0, 0x1 and 0x2.Fixes: f9eb4af67c9d ("KVM: nVMX: VMX instructions: add checks for #GP/#SS exceptions")
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson
Signed-off-by: Paolo Bonzini
Signed-off-by: Greg Kroah-Hartman -
commit 8570f9e881e3fde98801bb3a47eef84dd934d405 upstream.
The address size of an instruction affects the effective address, not
the virtual/linear address. The final address may still be truncated,
e.g. to 32-bits outside of long mode, but that happens irrespective of
the address size, e.g. a 32-bit address size can yield a 64-bit virtual
address when using FS/GS with a non-zero base.Fixes: 064aea774768 ("KVM: nVMX: Decoding memory operands of VMX instructions")
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson
Signed-off-by: Paolo Bonzini
Signed-off-by: Greg Kroah-Hartman -
commit 946c522b603f281195af1df91837a1d4d1eb3bc9 upstream.
The VMCS.EXIT_QUALIFCATION field reports the displacements of memory
operands for various instructions, including VMX instructions, as a
naturally sized unsigned value, but masks the value by the addr size,
e.g. given a ModRM encoded as -0x28(%ebp), the -0x28 displacement is
reported as 0xffffffd8 for a 32-bit address size. Despite some weird
wording regarding sign extension, the SDM explicitly states that bits
beyond the instructions address size are undefined:In all cases, bits of this field beyond the instruction’s address
size are undefined.Failure to sign extend the displacement results in KVM incorrectly
treating a negative displacement as a large positive displacement when
the address size of the VMX instruction is smaller than KVM's native
size, e.g. a 32-bit address size on a 64-bit KVM.The very original decoding, added by commit 064aea774768 ("KVM: nVMX:
Decoding memory operands of VMX instructions"), sort of modeled sign
extension by truncating the final virtual/linear address for a 32-bit
address size. I.e. it messed up the effective address but made it work
by adjusting the final address.When segmentation checks were added, the truncation logic was kept
as-is and no sign extension logic was introduced. In other words, it
kept calculating the wrong effective address while mostly generating
the correct virtual/linear address. As the effective address is what's
used in the segment limit checks, this results in KVM incorreclty
injecting #GP/#SS faults due to non-existent segment violations when
a nested VMM uses negative displacements with an address size smaller
than KVM's native address size.Using the -0x28(%ebp) example, an EBP value of 0x1000 will result in
KVM using 0x100000fd8 as the effective address when checking for a
segment limit violation. This causes a 100% failure rate when running
a 32-bit KVM build as L1 on top of a 64-bit KVM L0.Fixes: f9eb4af67c9d ("KVM: nVMX: VMX instructions: add checks for #GP/#SS exceptions")
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson
Signed-off-by: Paolo Bonzini
Signed-off-by: Greg Kroah-Hartman -
commit ddfd1730fd829743e41213e32ccc8b4aa6dc8325 upstream.
When installing new memslots, KVM sets bit 0 of the generation number to
indicate that an update is in-progress. Until the update is complete,
there are no guarantees as to whether a vCPU will see the old or the new
memslots. Explicity prevent caching MMIO accesses so as to avoid using
an access cached from the old memslots after the new memslots have been
installed.Note that it is unclear whether or not disabling caching during the
update window is strictly necessary as there is no definitive
documentation as to what ordering guarantees KVM provides with respect
to updating memslots. That being said, the MMIO spte code does not
allow reusing sptes created while an update is in-progress, and the
associated documentation explicitly states:We do not want to use an MMIO sptes created with an odd generation
number, ... If KVM is unlucky and creates an MMIO spte while the
low bit is 1, the next access to the spte will always be a cache miss.At the very least, disabling the per-vCPU MMIO cache during updates will
make its behavior consistent with the MMIO spte behavior and
documentation.Fixes: 56f17dd3fbc4 ("kvm: x86: fix stale mmio cache bug")
Cc:
Signed-off-by: Sean Christopherson
Signed-off-by: Paolo Bonzini
Signed-off-by: Greg Kroah-Hartman -
commit e1359e2beb8b0a1188abc997273acbaedc8ee791 upstream.
The check to detect a wrap of the MMIO generation explicitly looks for a
generation number of zero. Now that unique memslots generation numbers
are assigned to each address space, only address space 0 will get a
generation number of exactly zero when wrapping. E.g. when address
space 1 goes from 0x7fffe to 0x80002, the MMIO generation number will
wrap to 0x2. Adjust the MMIO generation to strip the address space
modifier prior to checking for a wrap.Fixes: 4bd518f1598d ("KVM: use separate generations for each address space")
Cc:
Signed-off-by: Sean Christopherson
Signed-off-by: Paolo Bonzini
Signed-off-by: Greg Kroah-Hartman -
commit 152482580a1b0accb60676063a1ac57b2d12daf6 upstream.
kvm_arch_memslots_updated() is at this point in time an x86-specific
hook for handling MMIO generation wraparound. x86 stashes 19 bits of
the memslots generation number in its MMIO sptes in order to avoid
full page fault walks for repeat faults on emulated MMIO addresses.
Because only 19 bits are used, wrapping the MMIO generation number is
possible, if unlikely. kvm_arch_memslots_updated() alerts x86 that
the generation has changed so that it can invalidate all MMIO sptes in
case the effective MMIO generation has wrapped so as to avoid using a
stale spte, e.g. a (very) old spte that was created with generation==0.Given that the purpose of kvm_arch_memslots_updated() is to prevent
consuming stale entries, it needs to be called before the new generation
is propagated to memslots. Invalidating the MMIO sptes after updating
memslots means that there is a window where a vCPU could dereference
the new memslots generation, e.g. 0, and incorrectly reuse an old MMIO
spte that was created with (pre-wrap) generation==0.Fixes: e59dbe09f8e6 ("KVM: Introduce kvm_arch_memslots_updated()")
Cc:
Signed-off-by: Sean Christopherson
Signed-off-by: Paolo Bonzini
Signed-off-by: Greg Kroah-Hartman -
commit 8041ffd36f42d8521d66dd1e236feb58cecd68bc upstream.
The client IMC bandwidth events currently return very large values:
$ perf stat -e uncore_imc/data_reads/ -e uncore_imc/data_writes/ -I 10000 -a
10.000117222 34,788.76 MiB uncore_imc/data_reads/
10.000117222 8.26 MiB uncore_imc/data_writes/
20.000374584 34,842.89 MiB uncore_imc/data_reads/
20.000374584 10.45 MiB uncore_imc/data_writes/
30.000633299 37,965.29 MiB uncore_imc/data_reads/
30.000633299 323.62 MiB uncore_imc/data_writes/
40.000891548 41,012.88 MiB uncore_imc/data_reads/
40.000891548 6.98 MiB uncore_imc/data_writes/
50.001142480 1,125,899,906,621,494.75 MiB uncore_imc/data_reads/
50.001142480 6.97 MiB uncore_imc/data_writes/The client IMC events are freerunning counters. They still use the
old event encoding format (0x1 for data_read and 0x2 for data write).
The counter bit width is calculated by common code, which assume that
the standard encoding format is used for the freerunning counters.
Error bit width information is calculated.The patch intends to convert the old client IMC event encoding to the
standard encoding format.Current common code uses event->attr.config which directly copy from
user space. We should not implicitly modify it for a converted event.
The event->hw.config is used to replace the event->attr.config in
common code.For client IMC events, the event->attr.config is used to calculate a
converted event with standard encoding format in the custom
event_init(). The converted event is stored in event->hw.config.
For other events of freerunning counters, they already use the standard
encoding format. The same value as event->attr.config is assigned to
event->hw.config in common event_init().Reported-by: Jin Yao
Tested-by: Jin Yao
Signed-off-by: Kan Liang
Signed-off-by: Peter Zijlstra (Intel)
Cc: Alexander Shishkin
Cc: Andy Lutomirski
Cc: Arnaldo Carvalho de Melo
Cc: Borislav Petkov
Cc: Dave Hansen
Cc: H. Peter Anvin
Cc: Jiri Olsa
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Rik van Riel
Cc: Stephane Eranian
Cc: Thomas Gleixner
Cc: Vince Weaver
Cc: stable@kernel.org # v4.18+
Fixes: 9aae1780e7e8 ("perf/x86/intel/uncore: Clean up client IMC uncore")
Link: https://lkml.kernel.org/r/20190227165729.1861-1-kan.liang@linux.intel.com
Signed-off-by: Ingo Molnar
Signed-off-by: Greg Kroah-Hartman -
commit 0192e6535ebe9af68614198ced4fd6d37b778ebf upstream.
Prohibit probing on optprobe template code, since it is not
a code but a template instruction sequence. If we modify
this template, copied template must be broken.Signed-off-by: Masami Hiramatsu
Cc: Alexander Shishkin
Cc: Andrea Righi
Cc: Arnaldo Carvalho de Melo
Cc: Jiri Olsa
Cc: Linus Torvalds
Cc: Mathieu Desnoyers
Cc: Peter Zijlstra
Cc: Steven Rostedt
Cc: Thomas Gleixner
Cc: stable@vger.kernel.org
Fixes: 9326638cbee2 ("kprobes, x86: Use NOKPROBE_SYMBOL() instead of __kprobes annotation")
Link: http://lkml.kernel.org/r/154998787911.31052.15274376330136234452.stgit@devbox
Signed-off-by: Ingo Molnar
Signed-off-by: Greg Kroah-Hartman -
commit 01bd2ac2f55a1916d81dace12fa8d7ae1c79b5ea upstream.
Commit f7c90c2aa40048 ("x86/xen: don't write ptes directly in 32-bit
PV guests") introduced a regression for booting dom0 on huge systems
with lots of RAM (in the TB range).Reason is that on those hosts the p2m list needs to be moved early in
the boot process and this requires temporary page tables to be created.
Said commit modified xen_set_pte_init() to use a hypercall for writing
a PTE, but this requires the page table being in the direct mapped
area, which is not the case for the temporary page tables used in
xen_relocate_p2m().As the page tables are completely written before being linked to the
actual address space instead of set_pte() a plain write to memory can
be used in xen_relocate_p2m().Fixes: f7c90c2aa40048 ("x86/xen: don't write ptes directly in 32-bit PV guests")
Cc: stable@vger.kernel.org
Signed-off-by: Juergen Gross
Reviewed-by: Jan Beulich
Signed-off-by: Juergen Gross
Signed-off-by: Greg Kroah-Hartman -
commit 2060e284e9595fc3baed6e035903c05b93266555 upstream.
The x86 MORUS implementations all fail the improved AEAD tests because
they produce the wrong result with some data layouts. The issue is that
they assume that if the skcipher_walk API gives 'nbytes' not aligned to
the walksize (a.k.a. walk.stride), then it is the end of the data. In
fact, this can happen before the end.Also, when the CRYPTO_TFM_REQ_MAY_SLEEP flag is given, they can
incorrectly sleep in the skcipher_walk_*() functions while preemption
has been disabled by kernel_fpu_begin().Fix these bugs.
Fixes: 56e8e57fc3a7 ("crypto: morus - Add common SIMD glue code for MORUS")
Cc: # v4.18+
Cc: Ondrej Mosnacek
Signed-off-by: Eric Biggers
Reviewed-by: Ondrej Mosnacek
Signed-off-by: Herbert Xu
Signed-off-by: Greg Kroah-Hartman -
commit 3af349639597fea582a93604734d717e59a0e223 upstream.
gcmaes_crypt_by_sg() dereferences the NULL pointer returned by
scatterwalk_ffwd() when encrypting an empty plaintext and the source
scatterlist ends immediately after the associated data.Fix it by only fast-forwarding to the src/dst data scatterlists if the
data length is nonzero.This bug is reproduced by the "rfc4543(gcm(aes))" test vectors when run
with the new AEAD test manager.Fixes: e845520707f8 ("crypto: aesni - Update aesni-intel_glue to use scatter/gather")
Cc: # v4.17+
Cc: Dave Watson
Signed-off-by: Eric Biggers
Signed-off-by: Herbert Xu
Signed-off-by: Greg Kroah-Hartman -
commit ba6771c0a0bc2fac9d6a8759bab8493bd1cffe3b upstream.
The x86 AEGIS implementations all fail the improved AEAD tests because
they produce the wrong result with some data layouts. The issue is that
they assume that if the skcipher_walk API gives 'nbytes' not aligned to
the walksize (a.k.a. walk.stride), then it is the end of the data. In
fact, this can happen before the end.Also, when the CRYPTO_TFM_REQ_MAY_SLEEP flag is given, they can
incorrectly sleep in the skcipher_walk_*() functions while preemption
has been disabled by kernel_fpu_begin().Fix these bugs.
Fixes: 1d373d4e8e15 ("crypto: x86 - Add optimized AEGIS implementations")
Cc: # v4.18+
Cc: Ondrej Mosnacek
Signed-off-by: Eric Biggers
Reviewed-by: Ondrej Mosnacek
Signed-off-by: Herbert Xu
Signed-off-by: Greg Kroah-Hartman -
[ Upstream commit 8cd8f0ce0d6aafe661cb3d6781c8b82bc696c04d ]
Add the CPUID model number of Icelake (ICL) mobile processors to the
Intel family list. Icelake U/Y series uses model number 0x7E.Signed-off-by: Rajneesh Bhardwaj
Signed-off-by: Borislav Petkov
Cc: Andy Shevchenko
Cc: Dave Hansen
Cc: "David E. Box"
Cc: dvhart@infradead.org
Cc: "H. Peter Anvin"
Cc: Ingo Molnar
Cc: Kan Liang
Cc: Peter Zijlstra
Cc: platform-driver-x86@vger.kernel.org
Cc: Qiuxu Zhuo
Cc: Srinivas Pandruvada
Cc: Thomas Gleixner
Cc: x86-ml
Link: https://lkml.kernel.org/r/20190214115712.19642-2-rajneesh.bhardwaj@linux.intel.com
Signed-off-by: Sasha Levin
19 Mar, 2019
3 commits
-
commit c634dc6bdedeb0b2c750fc611612618a85639ab2 upstream.
Fixes: 400816f60c54 ("perf/x86/intel: Implement support for TSX Force Abort")
Signed-off-by: kbuild test robot
Signed-off-by: Thomas Gleixner
Cc: "Peter Zijlstra (Intel)"
Cc: kbuild-all@01.org
Cc: Borislav Petkov
Cc: "H. Peter Anvin"
Cc: Kan Liang
Cc: Jiri Olsa
Cc: Andi Kleen
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20190313184243.GA10820@lkp-sb-ep06
Signed-off-by: Greg Kroah-Hartman -
commit ede271b059463731cbd6dffe55ffd70d7dbe8392 upstream.
Through:
validate_event()
x86_pmu.get_event_constraints(.idx=-1)
tfa_get_event_constraints()
dyn_constraint()cpuc->constraint_list[-1] is used, which is an obvious out-of-bound access.
In this case, simply skip the TFA constraint code, there is no event
constraint with just PMC3, therefore the code will never result in the
empty set.Fixes: 400816f60c54 ("perf/x86/intel: Implement support for TSX Force Abort")
Reported-by: Tony Jones
Reported-by: "DSouza, Nelson"
Signed-off-by: Peter Zijlstra (Intel)
Signed-off-by: Thomas Gleixner
Tested-by: Tony Jones
Tested-by: "DSouza, Nelson"
Cc: eranian@google.com
Cc: jolsa@redhat.com
Cc: stable@kernel.org
Link: https://lkml.kernel.org/r/20190314130705.441549378@infradead.org
Signed-off-by: Greg Kroah-Hartman -
commit f764c58b7faa26f5714e6907f892abc2bc0de4f8 upstream.
Guenter reported a build warning for CONFIG_CPU_SUP_INTEL=n:
> With allmodconfig-CONFIG_CPU_SUP_INTEL, this patch results in:
>
> In file included from arch/x86/events/amd/core.c:8:0:
> arch/x86/events/amd/../perf_event.h:1036:45: warning: ‘struct cpu_hw_event’ declared inside parameter list will not be visible outside of this definition or declaration
> static inline int intel_cpuc_prepare(struct cpu_hw_event *cpuc, int cpu)While harmless (an unsed pointer is an unused pointer, no matter the type)
it needs fixing.Reported-by: Guenter Roeck
Signed-off-by: Peter Zijlstra (Intel)
Cc: Greg Kroah-Hartman
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: stable@vger.kernel.org
Fixes: d01b1f96a82e ("perf/x86/intel: Make cpuc allocations consistent")
Link: http://lkml.kernel.org/r/20190315081410.GR5996@hirez.programming.kicks-ass.net
Signed-off-by: Ingo Molnar
Signed-off-by: Greg Kroah-Hartman
14 Mar, 2019
3 commits
-
commit 400816f60c543153656ac74eaf7f36f6b7202378 upstream
Skylake (and later) will receive a microcode update to address a TSX
errata. This microcode will, on execution of a TSX instruction
(speculative or not) use (clobber) PMC3. This update will also provide
a new MSR to change this behaviour along with a CPUID bit to enumerate
the presence of this new MSR.When the MSR gets set; the microcode will no longer use PMC3 but will
Force Abort every TSX transaction (upon executing COMMIT).When TSX Force Abort (TFA) is allowed (default); the MSR gets set when
PMC3 gets scheduled and cleared when, after scheduling, PMC3 is
unused.When TFA is not allowed; clear PMC3 from all constraints such that it
will not get used.Signed-off-by: Peter Zijlstra (Intel)
Signed-off-by: Thomas Gleixner
Signed-off-by: Greg Kroah-Hartman -
commit 52f64909409c17adf54fcf5f9751e0544ca3a6b4 upstream
Skylake systems will receive a microcode update to address a TSX
errata. This microcode will (by default) clobber PMC3 when TSX
instructions are (speculatively or not) executed.It also provides an MSR to cause all TSX transaction to abort and
preserve PMC3.Add the CPUID enumeration and MSR definition.
Signed-off-by: Peter Zijlstra (Intel)
Signed-off-by: Thomas Gleixner
Signed-off-by: Greg Kroah-Hartman -
commit 11f8b2d65ca9029591c8df26bb6bd063c312b7fe upstream
Such that we can re-use it.
Signed-off-by: Peter Zijlstra (Intel)
Signed-off-by: Thomas Gleixner
Signed-off-by: Greg Kroah-Hartman