Eric Lee / smarc-fsl-linux-kernel

05 Mar, 2020

10 commits

cad53d5e2 kvm: nVMX: VMWRITE checks unsupported field before read-only field ... Browse Code »

commit 693e02cc24090c379217138719d9d84e50036b24 upstream.

According to the SDM, VMWRITE checks to see if the secondary source
operand corresponds to an unsupported VMCS field before it checks to
see if the secondary source operand corresponds to a VM-exit
information field and the processor does not support writing to
VM-exit information fields.

Fixes: 49f705c5324aa ("KVM: nVMX: Implement VMREAD and VMWRITE")
Signed-off-by: Jim Mattson
Cc: Paolo Bonzini
Reviewed-by: Peter Shier
Reviewed-by: Oliver Upton
Reviewed-by: Jon Cargille
Signed-off-by: Paolo Bonzini
Signed-off-by: Greg Kroah-Hartman

Jim Mattson
2020-03-05 23:43:52 +0800
119b1e588 kvm: nVMX: VMWRITE checks VMCS-link pointer before VMCS field ... Browse Code »

commit dd2d6042b7f4a5440705b4ffc6c4c2dba81a43b7 upstream.

According to the SDM, a VMWRITE in VMX non-root operation with an
invalid VMCS-link pointer results in VMfailInvalid before the validity
of the VMCS field in the secondary source operand is checked.

For consistency, modify both handle_vmwrite and handle_vmread, even
though there was no problem with the latter.

Fixes: 6d894f498f5d1 ("KVM: nVMX: vmread/vmwrite: Use shadow vmcs12 if running L2")
Signed-off-by: Jim Mattson
Cc: Liran Alon
Cc: Paolo Bonzini
Cc: Vitaly Kuznetsov
Reviewed-by: Peter Shier
Reviewed-by: Oliver Upton
Reviewed-by: Jon Cargille
Signed-off-by: Paolo Bonzini
Signed-off-by: Greg Kroah-Hartman

Jim Mattson
2020-03-05 23:43:52 +0800
2aa7abbdc KVM: x86: Remove spurious clearing of async #PF MSR ... Browse Code »

commit 208050dac5ef4de5cb83ffcafa78499c94d0b5ad upstream.

Remove a bogus clearing of apf.msr_val from kvm_arch_vcpu_destroy().

apf.msr_val is only set to a non-zero value by kvm_pv_enable_async_pf(),
which is only reachable by kvm_set_msr_common(), i.e. by writing
MSR_KVM_ASYNC_PF_EN. KVM does not autonomously write said MSR, i.e.
can only be written via KVM_SET_MSRS or KVM_RUN. Since KVM_SET_MSRS and
KVM_RUN are vcpu ioctls, they require a valid vcpu file descriptor.
kvm_arch_vcpu_destroy() is only called if KVM_CREATE_VCPU fails, and KVM
declares KVM_CREATE_VCPU successful once the vcpu fd is installed and
thus visible to userspace. Ergo, apf.msr_val cannot be non-zero when
kvm_arch_vcpu_destroy() is called.

Fixes: 344d9588a9df0 ("KVM: Add PV MSR to enable asynchronous page faults delivery.")
Signed-off-by: Sean Christopherson
Signed-off-by: Paolo Bonzini
Signed-off-by: Greg Kroah-Hartman

Sean Christopherson
2020-03-05 23:43:50 +0800
64521f9b2 KVM: x86: Remove spurious kvm_mmu_unload() from vcpu destruction path ... Browse Code »

commit 9d979c7e6ff43ca3200ffcb74f57415fd633a2da upstream.

x86 does not load its MMU until KVM_RUN, which cannot be invoked until
after vCPU creation succeeds. Given that kvm_arch_vcpu_destroy() is
called if and only if vCPU creation fails, it is impossible for the MMU
to be loaded.

Note, the bogus kvm_mmu_unload() call was added during an unrelated
refactoring of vCPU allocation, i.e. was presumably added as an
opportunstic "fix" for a perceived leak.

Fixes: fb3f0f51d92d1 ("KVM: Dynamically allocate vcpus")
Signed-off-by: Sean Christopherson
Signed-off-by: Paolo Bonzini
Signed-off-by: Greg Kroah-Hartman

Sean Christopherson
2020-03-05 23:43:50 +0800
56671961b x86/resctrl: Check monitoring static key in the MBM overflow handler ... Browse Code »

commit 536a0d8e79fb928f2735db37dda95682b6754f9a upstream.

Currently, there are three static keys in the resctrl file system:
rdt_mon_enable_key and rdt_alloc_enable_key indicate if the monitoring
feature and the allocation feature are enabled, respectively. The
rdt_enable_key is enabled when either the monitoring feature or the
allocation feature is enabled.

If no monitoring feature is present (either hardware doesn't support a
monitoring feature or the feature is disabled by the kernel command line
option "rdt="), rdt_enable_key is still enabled but rdt_mon_enable_key
is disabled.

MBM is a monitoring feature. The MBM overflow handler intends to
check if the monitoring feature is not enabled for fast return.

So check the rdt_mon_enable_key in it instead of the rdt_enable_key as
former is the more accurate check.

[ bp: Massage commit message. ]

Fixes: e33026831bdb ("x86/intel_rdt/mbm: Handle counter overflow")
Signed-off-by: Xiaochen Shen
Signed-off-by: Borislav Petkov
Link: https://lkml.kernel.org/r/1576094705-13660-1-git-send-email-xiaochen.shen@intel.com
Signed-off-by: Greg Kroah-Hartman

Xiaochen Shen
2020-03-05 23:43:50 +0800
487113220 KVM: SVM: Override default MMIO mask if memory encryption is enabled ... Browse Code »

commit 52918ed5fcf05d97d257f4131e19479da18f5d16 upstream.

The KVM MMIO support uses bit 51 as the reserved bit to cause nested page
faults when a guest performs MMIO. The AMD memory encryption support uses
a CPUID function to define the encryption bit position. Given this, it is
possible that these bits can conflict.

Use svm_hardware_setup() to override the MMIO mask if memory encryption
support is enabled. Various checks are performed to ensure that the mask
is properly defined and rsvd_bits() is used to generate the new mask (as
was done prior to the change that necessitated this patch).

Fixes: 28a1f3ac1d0c ("kvm: x86: Set highest physical address bits in non-present/reserved SPTEs")
Suggested-by: Sean Christopherson
Reviewed-by: Sean Christopherson
Signed-off-by: Tom Lendacky
Signed-off-by: Paolo Bonzini
Signed-off-by: Greg Kroah-Hartman

Tom Lendacky
2020-03-05 23:43:48 +0800
9154d430d KVM: VMX: check descriptor table exits on instruction emulation ... Browse Code »

commit 86f7e90ce840aa1db407d3ea6e9b3a52b2ce923c upstream.

KVM emulates UMIP on hardware that doesn't support it by setting the
'descriptor table exiting' VM-execution control and performing
instruction emulation. When running nested, this emulation is broken as
KVM refuses to emulate L2 instructions by default.

Correct this regression by allowing the emulation of descriptor table
instructions if L1 hasn't requested 'descriptor table exiting'.

Fixes: 07721feee46b ("KVM: nVMX: Don't emulate instructions in guest mode")
Reported-by: Jan Kiszka
Cc: stable@vger.kernel.org
Cc: Paolo Bonzini
Cc: Jim Mattson
Signed-off-by: Oliver Upton
Signed-off-by: Paolo Bonzini
Signed-off-by: Greg Kroah-Hartman

Oliver Upton
2020-03-05 23:43:42 +0800
3b0a0bd12 perf/x86/msr: Add Tremont support ... Browse Code »

[ Upstream commit 0aa0e0d6b34b89649e6b5882a7e025a0eb9bd832 ]

Tremont is Intel's successor to Goldmont Plus. SMI_COUNT MSR is also
supported.

Signed-off-by: Kan Liang
Signed-off-by: Peter Zijlstra (Intel)
Signed-off-by: Ingo Molnar
Reviewed-by: Andi Kleen
Link: https://lkml.kernel.org/r/1580236279-35492-3-git-send-email-kan.liang@linux.intel.com
Signed-off-by: Sasha Levin

Kan Liang
2020-03-05 23:43:38 +0800
d1fdeaf6e perf/x86/cstate: Add Tremont support ... Browse Code »

[ Upstream commit ecf71fbccb9ac5cb964eb7de59bb9da3755b7885 ]

Tremont is Intel's successor to Goldmont Plus. From the perspective of
Intel cstate residency counters, there is nothing changed compared with
Goldmont Plus and Goldmont.

Share glm_cstates with Goldmont Plus and Goldmont.
Update the comments for Tremont.

Signed-off-by: Kan Liang
Signed-off-by: Peter Zijlstra (Intel)
Signed-off-by: Ingo Molnar
Reviewed-by: Andi Kleen
Link: https://lkml.kernel.org/r/1580236279-35492-2-git-send-email-kan.liang@linux.intel.com
Signed-off-by: Sasha Levin

Kan Liang
2020-03-05 23:43:37 +0800
6b1ca90fa perf/x86/intel: Add Elkhart Lake support ... Browse Code »

[ Upstream commit eda23b387f6c4bb2971ac7e874a09913f533b22c ]

Elkhart Lake also uses Tremont CPU. From the perspective of Intel PMU,
there is nothing changed compared with Jacobsville.
Share the perf code with Jacobsville.

Signed-off-by: Kan Liang
Signed-off-by: Peter Zijlstra (Intel)
Signed-off-by: Ingo Molnar
Reviewed-by: Andi Kleen
Link: https://lkml.kernel.org/r/1580236279-35492-1-git-send-email-kan.liang@linux.intel.com
Signed-off-by: Sasha Levin

Kan Liang
2020-03-05 23:43:37 +0800

29 Feb, 2020

11 commits

7e946e30a KVM: apic: avoid calculating pending eoi from an uninitialized val ... Browse Code »

commit 23520b2def95205f132e167cf5b25c609975e959 upstream.

When pv_eoi_get_user() fails, 'val' may remain uninitialized and the return
value of pv_eoi_get_pending() becomes random. Fix the issue by initializing
the variable.

Reviewed-by: Vitaly Kuznetsov
Signed-off-by: Miaohe Lin
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini
Signed-off-by: Greg Kroah-Hartman

Miaohe Lin
2020-02-29 00:22:23 +0800
dc5537061 KVM: nVMX: handle nested posted interrupts when apicv is disabled for L1 ... Browse Code »

commit 91a5f413af596ad01097e59bf487eb07cb3f1331 upstream.

Even when APICv is disabled for L1 it can (and, actually, is) still
available for L2, this means we need to always call
vmx_deliver_nested_posted_interrupt() when attempting an interrupt
delivery.

Suggested-by: Paolo Bonzini
Signed-off-by: Vitaly Kuznetsov
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini
Signed-off-by: Greg Kroah-Hartman

Vitaly Kuznetsov
2020-02-29 00:22:23 +0800
16f8553f7 KVM: nVMX: clear PIN_BASED_POSTED_INTR from nested pinbased_ctls only when apicv… ... Browse Code »

… is globally disabled

commit a4443267800af240072280c44521caab61924e55 upstream.

When apicv is disabled on a vCPU (e.g. by enabling KVM_CAP_HYPERV_SYNIC*),
nothing happens to VMX MSRs on the already existing vCPUs, however, all new
ones are created with PIN_BASED_POSTED_INTR filtered out. This is very
confusing and results in the following picture inside the guest:

$ rdmsr -ax 0x48d
ff00000016
7f00000016
7f00000016
7f00000016

This is observed with QEMU and 4-vCPU guest: QEMU creates vCPU0, does
KVM_CAP_HYPERV_SYNIC2 and then creates the remaining three.

L1 hypervisor may only check CPU0's controls to find out what features
are available and it will be very confused later. Switch to setting
PIN_BASED_POSTED_INTR control based on global 'enable_apicv' setting.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

Vitaly Kuznetsov
2020-02-29 00:22:23 +0800
0f042f5e9 KVM: nVMX: Check IO instruction VM-exit conditions ... Browse Code »

commit 35a571346a94fb93b5b3b6a599675ef3384bc75c upstream.

Consult the 'unconditional IO exiting' and 'use IO bitmaps' VM-execution
controls when checking instruction interception. If the 'use IO bitmaps'
VM-execution control is 1, check the instruction access against the IO
bitmaps to determine if the instruction causes a VM-exit.

Signed-off-by: Oliver Upton
Signed-off-by: Paolo Bonzini
Signed-off-by: Greg Kroah-Hartman

Oliver Upton
2020-02-29 00:22:23 +0800
c4064f14f KVM: nVMX: Refactor IO bitmap checks into helper function ... Browse Code »

commit e71237d3ff1abf9f3388337cfebf53b96df2020d upstream.

Checks against the IO bitmap are useful for both instruction emulation
and VM-exit reflection. Refactor the IO bitmap checks into a helper
function.

Signed-off-by: Oliver Upton
Reviewed-by: Vitaly Kuznetsov
Signed-off-by: Paolo Bonzini
Signed-off-by: Greg Kroah-Hartman

Oliver Upton
2020-02-29 00:22:23 +0800
56b3949a2 KVM: x86: don't notify userspace IOAPIC on edge-triggered interrupt EOI ... Browse Code »

commit 7455a8327674e1a7c9a1f5dd1b0743ab6713f6d1 upstream.

Commit 13db77347db1 ("KVM: x86: don't notify userspace IOAPIC on edge
EOI") said, edge-triggered interrupts don't set a bit in TMR, which means
that IOAPIC isn't notified on EOI. And var level indicates level-triggered
interrupt.
But commit 3159d36ad799 ("KVM: x86: use generic function for MSI parsing")
replace var level with irq.level by mistake. Fix it by changing irq.level
to irq.trig_mode.

Cc: stable@vger.kernel.org
Fixes: 3159d36ad799 ("KVM: x86: use generic function for MSI parsing")
Signed-off-by: Miaohe Lin
Signed-off-by: Paolo Bonzini
Signed-off-by: Greg Kroah-Hartman

Miaohe Lin
2020-02-29 00:22:22 +0800
24dfae91a KVM: nVMX: Don't emulate instructions in guest mode ... Browse Code »

commit 07721feee46b4b248402133228235318199b05ec upstream.

vmx_check_intercept is not yet fully implemented. To avoid emulating
instructions disallowed by the L1 hypervisor, refuse to emulate
instructions by default.

Cc: stable@vger.kernel.org
[Made commit, added commit msg - Oliver]
Signed-off-by: Oliver Upton
Signed-off-by: Paolo Bonzini
Signed-off-by: Greg Kroah-Hartman

Paolo Bonzini
2020-02-29 00:22:22 +0800
e0253c422 x86/cpu/amd: Enable the fixed Instructions Retired counter IRPERF ... Browse Code »

commit 21b5ee59ef18e27d85810584caf1f7ddc705ea83 upstream.

Commit

aaf248848db50 ("perf/x86/msr: Add AMD IRPERF (Instructions Retired)
performance counter")

added support for access to the free-running counter via 'perf -e
msr/irperf/', but when exercised, it always returns a 0 count:

BEFORE:

$ perf stat -e instructions,msr/irperf/ true

Performance counter stats for 'true':

624,833 instructions
0 msr/irperf/

Simply set its enable bit - HWCR bit 30 - to make it start counting.

Enablement is restricted to all machines advertising IRPERF capability,
except those susceptible to an erratum that makes the IRPERF return
bad values.

That erratum occurs in Family 17h models 00-1fh [1], but not in F17h
models 20h and above [2].

AFTER (on a family 17h model 31h machine):

$ perf stat -e instructions,msr/irperf/ true

Performance counter stats for 'true':

621,690 instructions
622,490 msr/irperf/

[1] Revision Guide for AMD Family 17h Models 00h-0Fh Processors
[2] Revision Guide for AMD Family 17h Models 30h-3Fh Processors

The revision guides are available from the bugzilla Link below.

[ bp: Massage commit message. ]

Fixes: aaf248848db50 ("perf/x86/msr: Add AMD IRPERF (Instructions Retired) performance counter")
Signed-off-by: Kim Phillips
Signed-off-by: Borislav Petkov
Cc: Peter Zijlstra
Cc: stable@vger.kernel.org
Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537
Link: http://lkml.kernel.org/r/20200214201805.13830-1-kim.phillips@amd.com
Signed-off-by: Greg Kroah-Hartman

Kim Phillips
2020-02-29 00:22:19 +0800
88e4901d3 x86/mce/amd: Fix kobject lifetime ... Browse Code »

commit 51dede9c05df2b78acd6dcf6a17d21f0877d2d7b upstream.

Accessing the MCA thresholding controls in sysfs concurrently with CPU
hotplug can lead to a couple of KASAN-reported issues:

BUG: KASAN: use-after-free in sysfs_file_ops+0x155/0x180
Read of size 8 at addr ffff888367578940 by task grep/4019

and

BUG: KASAN: use-after-free in show_error_count+0x15c/0x180
Read of size 2 at addr ffff888368a05514 by task grep/4454

for example. Both result from the fact that the threshold block
creation/teardown code frees the descriptor memory itself instead of
defining proper ->release function and leaving it to the driver core to
take care of that, after all sysfs accesses have completed.

Do that and get rid of the custom freeing code, fixing the above UAFs in
the process.

[ bp: write commit message. ]

Fixes: 95268664390b ("[PATCH] x86_64: mce_amd support for family 0x10 processors")
Signed-off-by: Thomas Gleixner
Signed-off-by: Borislav Petkov
Cc:
Link: https://lkml.kernel.org/r/20200214082801.13836-1-bp@alien8.de
Signed-off-by: Greg Kroah-Hartman

Thomas Gleixner
2020-02-29 00:22:18 +0800
de2cce5ae x86/mce/amd: Publish the bank pointer only after setup has succeeded ... Browse Code »

commit 6e5cf31fbe651bed7ba1df768f2e123531132417 upstream.

threshold_create_bank() creates a bank descriptor per MCA error
thresholding counter which can be controlled over sysfs. It publishes
the pointer to that bank in a per-CPU variable and then goes on to
create additional thresholding blocks if the bank has such.

However, that creation of additional blocks in
allocate_threshold_blocks() can fail, leading to a use-after-free
through the per-CPU pointer.

Therefore, publish that pointer only after all blocks have been setup
successfully.

Fixes: 019f34fccfd5 ("x86, MCE, AMD: Move shared bank to node descriptor")
Reported-by: Saar Amar
Reported-by: Dan Carpenter
Signed-off-by: Borislav Petkov
Cc:
Link: http://lkml.kernel.org/r/20200128140846.phctkvx5btiexvbx@kili.mountain
Signed-off-by: Greg Kroah-Hartman

Borislav Petkov
2020-02-29 00:22:18 +0800
6df12de90 x86/ima: use correct identifier for SetupMode variable ... Browse Code »

commit ff5ac61ee83c13f516544d29847d28be093a40ee upstream.

The IMA arch code attempts to inspect the "SetupMode" EFI variable
by populating a variable called efi_SetupMode_name with the string
"SecureBoot" and passing that to the EFI GetVariable service, which
obviously does not yield the expected result.

Given that the string is only referenced a single time, let's get
rid of the intermediate variable, and pass the correct string as
an immediate argument. While at it, do the same for "SecureBoot".

Fixes: 399574c64eaf ("x86/ima: retry detecting secure boot mode")
Fixes: 980ef4d22a95 ("x86/ima: check EFI SetupMode too")
Cc: Matthew Garrett
Signed-off-by: Ard Biesheuvel
Cc: stable@vger.kernel.org # v5.3
Signed-off-by: Mimi Zohar
Signed-off-by: Greg Kroah-Hartman

Ard Biesheuvel
2020-02-29 00:22:18 +0800

24 Feb, 2020

9 commits

838bddc29 x86/decoder: Add TEST opcode to Group3-2 ... Browse Code »

[ Upstream commit 8b7e20a7ba54836076ff35a28349dabea4cec48f ]

Add TEST opcode to Group3-2 reg=001b as same as Group3-1 does.

Commit

12a78d43de76 ("x86/decoder: Add new TEST instruction pattern")

added a TEST opcode assignment to f6 XX/001/XXX (Group 3-1), but did
not add f7 XX/001/XXX (Group 3-2).

Actually, this TEST opcode variant (ModRM.reg /1) is not described in
the Intel SDM Vol2 but in AMD64 Architecture Programmer's Manual Vol.3,
Appendix A.2 Table A-6. ModRM.reg Extensions for the Primary Opcode Map.

Without this fix, Randy found a warning by insn_decoder_test related
to this issue as below.

HOSTCC arch/x86/tools/insn_decoder_test
HOSTCC arch/x86/tools/insn_sanity
TEST posttest
arch/x86/tools/insn_decoder_test: warning: Found an x86 instruction decoder bug, please report this.
arch/x86/tools/insn_decoder_test: warning: ffffffff81000bf1: f7 0b 00 01 08 00 testl $0x80100,(%rbx)
arch/x86/tools/insn_decoder_test: warning: objdump says 6 bytes, but insn_get_length() says 2
arch/x86/tools/insn_decoder_test: warning: Decoded and checked 11913894 instructions with 1 failures
TEST posttest
arch/x86/tools/insn_sanity: Success: decoded and checked 1000000 random instructions with 0 errors (seed:0x871ce29c)

To fix this error, add the TEST opcode according to AMD64 APM Vol.3.

[ bp: Massage commit message. ]

Reported-by: Randy Dunlap
Signed-off-by: Masami Hiramatsu
Signed-off-by: Borislav Petkov
Acked-by: Randy Dunlap
Tested-by: Randy Dunlap
Link: https://lkml.kernel.org/r/157966631413.9580.10311036595431878351.stgit@devnote2
Signed-off-by: Sasha Levin

Masami Hiramatsu
2020-02-24 15:36:55 +0800
e7e6b53fe x86/mm: Fix NX bit clearing issue in kernel_map_pages_in_pgd ... Browse Code »

[ Upstream commit 75fbef0a8b6b4bb19b9a91b5214f846c2dc5139e ]

The following commit:

15f003d20782 ("x86/mm/pat: Don't implicitly allow _PAGE_RW in kernel_map_pages_in_pgd()")

modified kernel_map_pages_in_pgd() to manage writable permissions
of memory mappings in the EFI page table in a different way, but
in the process, it removed the ability to clear NX attributes from
read-only mappings, by clobbering the clear mask if _PAGE_RW is not
being requested.

Failure to remove the NX attribute from read-only mappings is
unlikely to be a security issue, but it does prevent us from
tightening the permissions in the EFI page tables going forward,
so let's fix it now.

Fixes: 15f003d20782 ("x86/mm/pat: Don't implicitly allow _PAGE_RW in kernel_map_pages_in_pgd()
Signed-off-by: Ard Biesheuvel
Signed-off-by: Ingo Molnar
Link: https://lore.kernel.org/r/20200113172245.27925-5-ardb@kernel.org
Signed-off-by: Sasha Levin

Ard Biesheuvel
2020-02-24 15:36:53 +0800
d8a6a443f perf/x86/amd: Constrain Large Increment per Cycle events ... Browse Code »

[ Upstream commit 471af006a747f1c535c8a8c6c0973c320fe01b22 ]

AMD Family 17h processors and above gain support for Large Increment
per Cycle events. Unfortunately there is no CPUID or equivalent bit
that indicates whether the feature exists or not, so we continue to
determine eligibility based on a CPU family number comparison.

For Large Increment per Cycle events, we add a f17h-and-compatibles
get_event_constraints_f17h() that returns an even counter bitmask:
Large Increment per Cycle events can only be placed on PMCs 0, 2,
and 4 out of the currently available 0-5. The only currently
public event that requires this feature to report valid counts
is PMCx003 "Retired SSE/AVX Operations".

Note that the CPU family logic in amd_core_pmu_init() is changed
so as to be able to selectively add initialization for features
available in ranges of backward-compatible CPU families. This
Large Increment per Cycle feature is expected to be retained
in future families.

A side-effect of assigning a new get_constraints function for f17h
disables calling the old (prior to f15h) amd_get_event_constraints
implementation left enabled by commit e40ed1542dd7 ("perf/x86: Add perf
support for AMD family-17h processors"), which is no longer
necessary since those North Bridge event codes are obsoleted.

Also fix a spelling mistake whilst in the area (calulating ->
calculating).

Fixes: e40ed1542dd7 ("perf/x86: Add perf support for AMD family-17h processors")
Signed-off-by: Kim Phillips
Signed-off-by: Peter Zijlstra (Intel)
Link: https://lkml.kernel.org/r/20191114183720.19887-2-kim.phillips@amd.com
Signed-off-by: Sasha Levin

Kim Phillips
2020-02-24 15:36:52 +0800
73f48c100 x86/nmi: Remove irq_work from the long duration NMI handler ... Browse Code »

[ Upstream commit 248ed51048c40d36728e70914e38bffd7821da57 ]

First, printk() is NMI-context safe now since the safe printk() has been
implemented and it already has an irq_work to make NMI-context safe.

Second, this NMI irq_work actually does not work if a NMI handler causes
panic by watchdog timeout. It has no chance to run in such case, while
the safe printk() will flush its per-cpu buffers before panicking.

While at it, repurpose the irq_work callback into a function which
concentrates the NMI duration checking and makes the code easier to
follow.

[ bp: Massage. ]

Signed-off-by: Changbin Du
Signed-off-by: Borislav Petkov
Acked-by: Thomas Gleixner
Link: https://lkml.kernel.org/r/20200111125427.15662-1-changbin.du@gmail.com
Signed-off-by: Sasha Levin

Changbin Du
2020-02-24 15:36:45 +0800
5f0a4eba2 efi/x86: Don't panic or BUG() on non-critical error conditions ... Browse Code »

[ Upstream commit e2d68a955e49d61fd0384f23e92058dc9b79be5e ]

The logic in __efi_enter_virtual_mode() does a number of steps in
sequence, all of which may fail in one way or the other. In most
cases, we simply print an error and disable EFI runtime services
support, but in some cases, we BUG() or panic() and bring down the
system when encountering conditions that we could easily handle in
the same way.

While at it, replace a pointless page-to-virt-phys conversion with
one that goes straight from struct page to physical.

Signed-off-by: Ard Biesheuvel
Cc: Andy Lutomirski
Cc: Ard Biesheuvel
Cc: Arvind Sankar
Cc: Matthew Garrett
Cc: linux-efi@vger.kernel.org
Link: https://lkml.kernel.org/r/20200103113953.9571-14-ardb@kernel.org
Signed-off-by: Ingo Molnar
Signed-off-by: Sasha Levin

Ard Biesheuvel
2020-02-24 15:36:45 +0800
bc866376d x86/vdso: Provide missing include file ... Browse Code »

[ Upstream commit bff47c2302cc249bcd550b17067f8dddbd4b6f77 ]

When building with C=1, sparse issues a warning:

CHECK arch/x86/entry/vdso/vdso32-setup.c
arch/x86/entry/vdso/vdso32-setup.c:28:28: warning: symbol 'vdso32_enabled' was not declared. Should it be static?

Provide the missing header file.

Signed-off-by: Valdis Kletnieks
Signed-off-by: Borislav Petkov
Cc: "H. Peter Anvin"
Cc: Andy Lutomirski
Cc: Ingo Molnar
Cc: Thomas Gleixner
Cc: x86-ml
Link: https://lkml.kernel.org/r/36224.1575599767@turing-police
Signed-off-by: Sasha Levin

Valdis Klētnieks
2020-02-24 15:36:41 +0800
971579fae x86/sysfb: Fix check for bad VRAM size ... Browse Code »

[ Upstream commit dacc9092336be20b01642afe1a51720b31f60369 ]

When checking whether the reported lfb_size makes sense, the height
* stride result is page-aligned before seeing whether it exceeds the
reported size.

This doesn't work if height * stride is not an exact number of pages.
For example, as reported in the kernel bugzilla below, an 800x600x32 EFI
framebuffer gets skipped because of this.

Move the PAGE_ALIGN to after the check vs size.

Reported-by: Christopher Head
Tested-by: Christopher Head
Signed-off-by: Arvind Sankar
Signed-off-by: Borislav Petkov
Link: https://bugzilla.kernel.org/show_bug.cgi?id=206051
Link: https://lkml.kernel.org/r/20200107230410.2291947-1-nivedita@alum.mit.edu
Signed-off-by: Sasha Levin

Arvind Sankar
2020-02-24 15:36:29 +0800
ed140997f efi/x86: Map the entire EFI vendor string before copying it ... Browse Code »

[ Upstream commit ffc2760bcf2dba0dbef74013ed73eea8310cc52c ]

Fix a couple of issues with the way we map and copy the vendor string:
- we map only 2 bytes, which usually works since you get at least a
page, but if the vendor string happens to cross a page boundary,
a crash will result
- only call early_memunmap() if early_memremap() succeeded, or we will
call it with a NULL address which it doesn't like,
- while at it, switch to early_memremap_ro(), and array indexing rather
than pointer dereferencing to read the CHAR16 characters.

Signed-off-by: Ard Biesheuvel
Cc: Andy Lutomirski
Cc: Ard Biesheuvel
Cc: Arvind Sankar
Cc: Matthew Garrett
Cc: linux-efi@vger.kernel.org
Fixes: 5b83683f32b1 ("x86: EFI runtime service support")
Link: https://lkml.kernel.org/r/20200103113953.9571-5-ardb@kernel.org
Signed-off-by: Ingo Molnar
Signed-off-by: Sasha Levin

Ard Biesheuvel
2020-02-24 15:36:27 +0800
0b455673e x86/fpu: Deactivate FPU state after failure during state load ... Browse Code »

[ Upstream commit bbc55341b9c67645d1a5471506370caf7dd4a203 ]

In __fpu__restore_sig(), fpu_fpregs_owner_ctx needs to be reset if the
FPU state was not fully restored. Otherwise the following may happen (on
the same CPU):

Task A Task B fpu_fpregs_owner_ctx
*active* A.fpu
__fpu__restore_sig()
ctx switch load B.fpu
*active* B.fpu
fpregs_lock()
copy_user_to_fpregs_zeroing()
copy_kernel_to_xregs() *modify*
copy_user_to_xregs() *fails*
fpregs_unlock()
ctx switch skip loading B.fpu,
*active* B.fpu

In the success case, fpu_fpregs_owner_ctx is set to the current task.

In the failure case, the FPU state might have been modified by loading
the init state.

In this case, fpu_fpregs_owner_ctx needs to be reset in order to ensure
that the FPU state of the following task is loaded from saved state (and
not skipped because it was the previous state).

Reset fpu_fpregs_owner_ctx after a failure during restore occurred, to
ensure that the FPU state for the next task is always loaded.

The problem was debugged-by Yu-cheng Yu .

[ bp: Massage commit message. ]

Fixes: 5f409e20b7945 ("x86/fpu: Defer FPU state load until return to userspace")
Reported-by: Yu-cheng Yu
Signed-off-by: Sebastian Andrzej Siewior
Signed-off-by: Borislav Petkov
Cc: Andy Lutomirski
Cc: Dave Hansen
Cc: Fenghua Yu
Cc: "H. Peter Anvin"
Cc: Ingo Molnar
Cc: Jann Horn
Cc: Peter Zijlstra
Cc: "Ravi V. Shankar"
Cc: Rik van Riel
Cc: Thomas Gleixner
Cc: Tony Luck
Cc: x86-ml
Link: https://lkml.kernel.org/r/20191220195906.plk6kpmsrikvbcfn@linutronix.de
Signed-off-by: Sasha Levin

Sebastian Andrzej Siewior
2020-02-24 15:36:26 +0800

20 Feb, 2020

5 commits

2cbbe28c7 KVM: x86/mmu: Fix struct guest_walker arrays for 5-level paging ... Browse Code »

[ Upstream commit f6ab0107a4942dbf9a5cf0cca3f37e184870a360 ]

Define PT_MAX_FULL_LEVELS as PT64_ROOT_MAX_LEVEL, i.e. 5, to fix shadow
paging for 5-level guest page tables. PT_MAX_FULL_LEVELS is used to
size the arrays that track guest pages table information, i.e. using a
"max levels" of 4 causes KVM to access garbage beyond the end of an
array when querying state for level 5 entries. E.g. FNAME(gpte_changed)
will read garbage and most likely return %true for a level 5 entry,
soft-hanging the guest because FNAME(fetch) will restart the guest
instead of creating SPTEs because it thinks the guest PTE has changed.

Note, KVM doesn't yet support 5-level nested EPT, so PT_MAX_FULL_LEVELS
gets to stay "4" for the PTTYPE_EPT case.

Fixes: 855feb673640 ("KVM: MMU: Add 5 level EPT & Shadow page table support.")
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson
Signed-off-by: Paolo Bonzini
Signed-off-by: Sasha Levin

Sean Christopherson
2020-02-20 02:53:09 +0800
1164c3380 KVM: x86: Mask off reserved bit from #DB exception payload ... Browse Code »

commit 307f1cfa269657c63cfe2c932386fcc24684d9dd upstream.

KVM defines the #DB payload as compatible with the 'pending debug
exceptions' field under VMX, not DR6. Mask off bit 12 when applying the
payload to DR6, as it is reserved on DR6 but not the 'pending debug
exceptions' field.

Fixes: f10c729ff965 ("kvm: vmx: Defer setting of DR6 until #DB delivery")
Signed-off-by: Oliver Upton
Signed-off-by: Paolo Bonzini
Signed-off-by: Greg Kroah-Hartman

Oliver Upton
2020-02-20 02:53:08 +0800
98509dfe6 perf/x86/intel: Fix inaccurate period in context switch for auto-reload ... Browse Code »

commit f861854e1b435b27197417f6f90d87188003cb24 upstream.

Perf doesn't take the left period into account when auto-reload is
enabled with fixed period sampling mode in context switch.

Here is the MSR trace of the perf command as below.
(The MSR trace is simplified from a ftrace log.)

#perf record -e cycles:p -c 2000000 -- ./triad_loop

//The MSR trace of task schedule out
//perf disable all counters, disable PEBS, disable GP counter 0,
//read GP counter 0, and re-enable all counters.
//The counter 0 stops at 0xfffffff82840
write_msr: MSR_CORE_PERF_GLOBAL_CTRL(38f), value 0
write_msr: MSR_IA32_PEBS_ENABLE(3f1), value 0
write_msr: MSR_P6_EVNTSEL0(186), value 40003003c
rdpmc: 0, value fffffff82840
write_msr: MSR_CORE_PERF_GLOBAL_CTRL(38f), value f000000ff

//The MSR trace of the same task schedule in again
//perf disable all counters, enable and set GP counter 0,
//enable PEBS, and re-enable all counters.
//0xffffffe17b80 (-2000000) is written to GP counter 0.
write_msr: MSR_CORE_PERF_GLOBAL_CTRL(38f), value 0
write_msr: MSR_IA32_PMC0(4c1), value ffffffe17b80
write_msr: MSR_P6_EVNTSEL0(186), value 40043003c
write_msr: MSR_IA32_PEBS_ENABLE(3f1), value 1
write_msr: MSR_CORE_PERF_GLOBAL_CTRL(38f), value f000000ff

When the same task schedule in again, the counter should starts from
previous left. However, it starts from the fixed period -2000000 again.

A special variant of intel_pmu_save_and_restart() is used for
auto-reload, which doesn't update the hwc->period_left.
When the monitored task schedules in again, perf doesn't know the left
period. The fixed period is used, which is inaccurate.

With auto-reload, the counter always has a negative counter value. So
the left period is -value. Update the period_left in
intel_pmu_save_and_restart_reload().

With the patch:

//The MSR trace of task schedule out
write_msr: MSR_CORE_PERF_GLOBAL_CTRL(38f), value 0
write_msr: MSR_IA32_PEBS_ENABLE(3f1), value 0
write_msr: MSR_P6_EVNTSEL0(186), value 40003003c
rdpmc: 0, value ffffffe25cbc
write_msr: MSR_CORE_PERF_GLOBAL_CTRL(38f), value f000000ff

//The MSR trace of the same task schedule in again
write_msr: MSR_CORE_PERF_GLOBAL_CTRL(38f), value 0
write_msr: MSR_IA32_PMC0(4c1), value ffffffe25cbc
write_msr: MSR_P6_EVNTSEL0(186), value 40043003c
write_msr: MSR_IA32_PEBS_ENABLE(3f1), value 1
write_msr: MSR_CORE_PERF_GLOBAL_CTRL(38f), value f000000ff

Fixes: d31fc13fdcb2 ("perf/x86/intel: Fix event update for auto-reload")
Signed-off-by: Kan Liang
Signed-off-by: Peter Zijlstra (Intel)
Signed-off-by: Ingo Molnar
Link: https://lkml.kernel.org/r/20200121190125.3389-1-kan.liang@linux.intel.com
Signed-off-by: Greg Kroah-Hartman

Kan Liang
2020-02-20 02:53:07 +0800
ebc3ddc1a perf/x86/amd: Add missing L2 misses event spec to AMD Family 17h's event map ... Browse Code »

commit 25d387287cf0330abf2aad761ce6eee67326a355 upstream.

Commit 3fe3331bb285 ("perf/x86/amd: Add event map for AMD Family 17h"),
claimed L2 misses were unsupported, due to them not being found in its
referenced documentation, whose link has now moved [1].

That old documentation listed PMCx064 unit mask bit 3 as:

"LsRdBlkC: LS Read Block C S L X Change to X Miss."

and bit 0 as:

"IcFillMiss: IC Fill Miss"

We now have new public documentation [2] with improved descriptions, that
clearly indicate what events those unit mask bits represent:

Bit 3 now clearly states:

"LsRdBlkC: Data Cache Req Miss in L2 (all types)"

and bit 0 is:

"IcFillMiss: Instruction Cache Req Miss in L2."

So we can now add support for L2 misses in perf's genericised events as
PMCx064 with both the above unit masks.

[1] The commit's original documentation reference, "Processor Programming
Reference (PPR) for AMD Family 17h Model 01h, Revision B1 Processors",
originally available here:

https://www.amd.com/system/files/TechDocs/54945_PPR_Family_17h_Models_00h-0Fh.pdf

is now available here:

https://developer.amd.com/wordpress/media/2017/11/54945_PPR_Family_17h_Models_00h-0Fh.pdf

[2] "Processor Programming Reference (PPR) for Family 17h Model 31h,
Revision B0 Processors", available here:

https://developer.amd.com/wp-content/resources/55803_0.54-PUB.pdf

Fixes: 3fe3331bb285 ("perf/x86/amd: Add event map for AMD Family 17h")
Reported-by: Babu Moger
Signed-off-by: Kim Phillips
Signed-off-by: Peter Zijlstra (Intel)
Signed-off-by: Ingo Molnar
Tested-by: Babu Moger
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20200121171232.28839-1-kim.phillips@amd.com
Signed-off-by: Greg Kroah-Hartman

Kim Phillips
2020-02-20 02:53:04 +0800
db6f68908 KVM: nVMX: Use correct root level for nested EPT shadow page tables ... Browse Code »

commit 148d735eb55d32848c3379e460ce365f2c1cbe4b upstream.

Hardcode the EPT page-walk level for L2 to be 4 levels, as KVM's MMU
currently also hardcodes the page walk level for nested EPT to be 4
levels. The L2 guest is all but guaranteed to soft hang on its first
instruction when L1 is using EPT, as KVM will construct 4-level page
tables and then tell hardware to use 5-level page tables.

Fixes: 855feb673640 ("KVM: MMU: Add 5 level EPT & Shadow page table support.")
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson
Signed-off-by: Paolo Bonzini
Signed-off-by: Greg Kroah-Hartman

Sean Christopherson
2020-02-20 02:53:04 +0800

15 Feb, 2020

1 commit

ffad5982c x86/boot: Handle malformed SRAT tables during early ACPI parsing ... Browse Code »

[ Upstream commit 2b73ea3796242608b4ccf019ff217156c92e92fe ]

Break an infinite loop when early parsing of the SRAT table is caused
by a subtable with zero length. Known to affect the ASUS WS X299 SAGE
motherboard with firmware version 1201 which has a large block of
zeros in its SRAT table. The kernel could boot successfully on this
board/firmware prior to the introduction of early parsing this table or
after a BIOS update.

[ bp: Fixup whitespace damage and commit message. Make it return 0 to
denote that there are no immovable regions because who knows what
else is broken in this BIOS. ]

Fixes: 02a3e3cdb7f1 ("x86/boot: Parse SRAT table and count immovable memory regions")
Signed-off-by: Steven Clarkson
Signed-off-by: Borislav Petkov
Cc: linux-acpi@vger.kernel.org
Link: https://bugzilla.kernel.org/show_bug.cgi?id=206343
Link: https://lkml.kernel.org/r/CAHKq8taGzj0u1E_i=poHUam60Bko5BpiJ9jn0fAupFUYexvdUQ@mail.gmail.com
Signed-off-by: Sasha Levin

Steven Clarkson
2020-02-15 05:34:11 +0800

11 Feb, 2020

4 commits

d15b033e9 x86/apic/msi: Plug non-maskable MSI affinity race ... Browse Code »

commit 6f1a4891a5928a5969c87fa5a584844c983ec823 upstream.

Evan tracked down a subtle race between the update of the MSI message and
the device raising an interrupt internally on PCI devices which do not
support MSI masking. The update of the MSI message is non-atomic and
consists of either 2 or 3 sequential 32bit wide writes to the PCI config
space.

- Write address low 32bits
- Write address high 32bits (If supported by device)
- Write data

When an interrupt is migrated then both address and data might change, so
the kernel attempts to mask the MSI interrupt first. But for MSI masking is
optional, so there exist devices which do not provide it. That means that
if the device raises an interrupt internally between the writes then a MSI
message is sent built from half updated state.

On x86 this can lead to spurious interrupts on the wrong interrupt
vector when the affinity setting changes both address and data. As a
consequence the device interrupt can be lost causing the device to
become stuck or malfunctioning.

Evan tried to handle that by disabling MSI accross an MSI message
update. That's not feasible because disabling MSI has issues on its own:

If MSI is disabled the PCI device is routing an interrupt to the legacy
INTx mechanism. The INTx delivery can be disabled, but the disablement is
not working on all devices.

Some devices lose interrupts when both MSI and INTx delivery are disabled.

Another way to solve this would be to enforce the allocation of the same
vector on all CPUs in the system for this kind of screwed devices. That
could be done, but it would bring back the vector space exhaustion problems
which got solved a few years ago.

Fortunately the high address (if supported by the device) is only relevant
when X2APIC is enabled which implies interrupt remapping. In the interrupt
remapping case the affinity setting is happening at the interrupt remapping
unit and the PCI MSI message is programmed only once when the PCI device is
initialized.

That makes it possible to solve it with a two step update:

1) Target the MSI msg to the new vector on the current target CPU

2) Target the MSI msg to the new vector on the new target CPU

In both cases writing the MSI message is only changing a single 32bit word
which prevents the issue of inconsistency.

After writing the final destination it is necessary to check whether the
device issued an interrupt while the intermediate state #1 (new vector,
current CPU) was in effect.

This is possible because the affinity change is always happening on the
current target CPU. The code runs with interrupts disabled, so the
interrupt can be detected by checking the IRR of the local APIC. If the
vector is pending in the IRR then the interrupt is retriggered on the new
target CPU by sending an IPI for the associated vector on the target CPU.

This can cause spurious interrupts on both the local and the new target
CPU.

1) If the new vector is not in use on the local CPU and the device
affected by the affinity change raised an interrupt during the
transitional state (step #1 above) then interrupt entry code will
ignore that spurious interrupt. The vector is marked so that the
'No irq handler for vector' warning is supressed once.

2) If the new vector is in use already on the local CPU then the IRR check
might see an pending interrupt from the device which is using this
vector. The IPI to the new target CPU will then invoke the handler of
the device, which got the affinity change, even if that device did not
issue an interrupt

3) If the new vector is in use already on the local CPU and the device
affected by the affinity change raised an interrupt during the
transitional state (step #1 above) then the handler of the device which
uses that vector on the local CPU will be invoked.

expose issues in device driver interrupt handlers which are not prepared to
handle a spurious interrupt correctly. This not a regression, it's just
exposing something which was already broken as spurious interrupts can
happen for a lot of reasons and all driver handlers need to be able to deal
with them.

Reported-by: Evan Green
Debugged-by: Evan Green
Signed-off-by: Thomas Gleixner
Tested-by: Evan Green
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/87imkr4s7n.fsf@nanos.tec.linutronix.de
Signed-off-by: Greg Kroah-Hartman

Thomas Gleixner
2020-02-11 20:35:54 +0800
7426ddf01 KVM: Use vcpu-specific gva->hva translation when querying host page size ... Browse Code »

[ Upstream commit f9b84e19221efc5f493156ee0329df3142085f28 ]

Use kvm_vcpu_gfn_to_hva() when retrieving the host page size so that the
correct set of memslots is used when handling x86 page faults in SMM.

Fixes: 54bf36aac520 ("KVM: x86: use vcpu-specific functions to read/write/translate GFNs")
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson
Signed-off-by: Paolo Bonzini
Signed-off-by: Sasha Levin

Sean Christopherson
2020-02-11 20:35:54 +0800
09bd0033d KVM: nVMX: vmread should not set rflags to specify success in case of #PF ... Browse Code »

[ Upstream commit a4d956b9390418623ae5d07933e2679c68b6f83c ]

In case writing to vmread destination operand result in a #PF, vmread
should not call nested_vmx_succeed() to set rflags to specify success.
Similar to as done in VMPTRST (See handle_vmptrst()).

Reviewed-by: Liran Alon
Signed-off-by: Miaohe Lin
Cc: stable@vger.kernel.org
Reviewed-by: Sean Christopherson
Signed-off-by: Paolo Bonzini
Signed-off-by: Sasha Levin

Miaohe Lin
2020-02-11 20:35:53 +0800
1d6cfa003 KVM: x86: fix overlap between SPTE_MMIO_MASK and generation ... Browse Code »

[ Upstream commit 56871d444bc4d7ea66708775e62e2e0926384dbc ]

The SPTE_MMIO_MASK overlaps with the bits used to track MMIO
generation number. A high enough generation number would overwrite the
SPTE_SPECIAL_MASK region and cause the MMIO SPTE to be misinterpreted.

Likewise, setting bits 52 and 53 would also cause an incorrect generation
number to be read from the PTE, though this was partially mitigated by the
(useless if it weren't for the bug) removal of SPTE_SPECIAL_MASK from
the spte in get_mmio_spte_generation. Drop that removal, and replace
it with a compile-time assertion.

Fixes: 6eeb4ef049e7 ("KVM: x86: assign two bits to track SPTE kinds")
Reported-by: Ben Gardon
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini
Signed-off-by: Sasha Levin

Paolo Bonzini
2020-02-11 20:35:53 +0800