05 Mar, 2020

10 commits

  • commit 693e02cc24090c379217138719d9d84e50036b24 upstream.

    According to the SDM, VMWRITE checks to see if the secondary source
    operand corresponds to an unsupported VMCS field before it checks to
    see if the secondary source operand corresponds to a VM-exit
    information field and the processor does not support writing to
    VM-exit information fields.

    Fixes: 49f705c5324aa ("KVM: nVMX: Implement VMREAD and VMWRITE")
    Signed-off-by: Jim Mattson
    Cc: Paolo Bonzini
    Reviewed-by: Peter Shier
    Reviewed-by: Oliver Upton
    Reviewed-by: Jon Cargille
    Signed-off-by: Paolo Bonzini
    Signed-off-by: Greg Kroah-Hartman

    Jim Mattson
     
  • commit dd2d6042b7f4a5440705b4ffc6c4c2dba81a43b7 upstream.

    According to the SDM, a VMWRITE in VMX non-root operation with an
    invalid VMCS-link pointer results in VMfailInvalid before the validity
    of the VMCS field in the secondary source operand is checked.

    For consistency, modify both handle_vmwrite and handle_vmread, even
    though there was no problem with the latter.

    Fixes: 6d894f498f5d1 ("KVM: nVMX: vmread/vmwrite: Use shadow vmcs12 if running L2")
    Signed-off-by: Jim Mattson
    Cc: Liran Alon
    Cc: Paolo Bonzini
    Cc: Vitaly Kuznetsov
    Reviewed-by: Peter Shier
    Reviewed-by: Oliver Upton
    Reviewed-by: Jon Cargille
    Signed-off-by: Paolo Bonzini
    Signed-off-by: Greg Kroah-Hartman

    Jim Mattson
     
  • commit 208050dac5ef4de5cb83ffcafa78499c94d0b5ad upstream.

    Remove a bogus clearing of apf.msr_val from kvm_arch_vcpu_destroy().

    apf.msr_val is only set to a non-zero value by kvm_pv_enable_async_pf(),
    which is only reachable by kvm_set_msr_common(), i.e. by writing
    MSR_KVM_ASYNC_PF_EN. KVM does not autonomously write said MSR, i.e.
    can only be written via KVM_SET_MSRS or KVM_RUN. Since KVM_SET_MSRS and
    KVM_RUN are vcpu ioctls, they require a valid vcpu file descriptor.
    kvm_arch_vcpu_destroy() is only called if KVM_CREATE_VCPU fails, and KVM
    declares KVM_CREATE_VCPU successful once the vcpu fd is installed and
    thus visible to userspace. Ergo, apf.msr_val cannot be non-zero when
    kvm_arch_vcpu_destroy() is called.

    Fixes: 344d9588a9df0 ("KVM: Add PV MSR to enable asynchronous page faults delivery.")
    Signed-off-by: Sean Christopherson
    Signed-off-by: Paolo Bonzini
    Signed-off-by: Greg Kroah-Hartman

    Sean Christopherson
     
  • commit 9d979c7e6ff43ca3200ffcb74f57415fd633a2da upstream.

    x86 does not load its MMU until KVM_RUN, which cannot be invoked until
    after vCPU creation succeeds. Given that kvm_arch_vcpu_destroy() is
    called if and only if vCPU creation fails, it is impossible for the MMU
    to be loaded.

    Note, the bogus kvm_mmu_unload() call was added during an unrelated
    refactoring of vCPU allocation, i.e. was presumably added as an
    opportunstic "fix" for a perceived leak.

    Fixes: fb3f0f51d92d1 ("KVM: Dynamically allocate vcpus")
    Signed-off-by: Sean Christopherson
    Signed-off-by: Paolo Bonzini
    Signed-off-by: Greg Kroah-Hartman

    Sean Christopherson
     
  • commit 536a0d8e79fb928f2735db37dda95682b6754f9a upstream.

    Currently, there are three static keys in the resctrl file system:
    rdt_mon_enable_key and rdt_alloc_enable_key indicate if the monitoring
    feature and the allocation feature are enabled, respectively. The
    rdt_enable_key is enabled when either the monitoring feature or the
    allocation feature is enabled.

    If no monitoring feature is present (either hardware doesn't support a
    monitoring feature or the feature is disabled by the kernel command line
    option "rdt="), rdt_enable_key is still enabled but rdt_mon_enable_key
    is disabled.

    MBM is a monitoring feature. The MBM overflow handler intends to
    check if the monitoring feature is not enabled for fast return.

    So check the rdt_mon_enable_key in it instead of the rdt_enable_key as
    former is the more accurate check.

    [ bp: Massage commit message. ]

    Fixes: e33026831bdb ("x86/intel_rdt/mbm: Handle counter overflow")
    Signed-off-by: Xiaochen Shen
    Signed-off-by: Borislav Petkov
    Link: https://lkml.kernel.org/r/1576094705-13660-1-git-send-email-xiaochen.shen@intel.com
    Signed-off-by: Greg Kroah-Hartman

    Xiaochen Shen
     
  • commit 52918ed5fcf05d97d257f4131e19479da18f5d16 upstream.

    The KVM MMIO support uses bit 51 as the reserved bit to cause nested page
    faults when a guest performs MMIO. The AMD memory encryption support uses
    a CPUID function to define the encryption bit position. Given this, it is
    possible that these bits can conflict.

    Use svm_hardware_setup() to override the MMIO mask if memory encryption
    support is enabled. Various checks are performed to ensure that the mask
    is properly defined and rsvd_bits() is used to generate the new mask (as
    was done prior to the change that necessitated this patch).

    Fixes: 28a1f3ac1d0c ("kvm: x86: Set highest physical address bits in non-present/reserved SPTEs")
    Suggested-by: Sean Christopherson
    Reviewed-by: Sean Christopherson
    Signed-off-by: Tom Lendacky
    Signed-off-by: Paolo Bonzini
    Signed-off-by: Greg Kroah-Hartman

    Tom Lendacky
     
  • commit 86f7e90ce840aa1db407d3ea6e9b3a52b2ce923c upstream.

    KVM emulates UMIP on hardware that doesn't support it by setting the
    'descriptor table exiting' VM-execution control and performing
    instruction emulation. When running nested, this emulation is broken as
    KVM refuses to emulate L2 instructions by default.

    Correct this regression by allowing the emulation of descriptor table
    instructions if L1 hasn't requested 'descriptor table exiting'.

    Fixes: 07721feee46b ("KVM: nVMX: Don't emulate instructions in guest mode")
    Reported-by: Jan Kiszka
    Cc: stable@vger.kernel.org
    Cc: Paolo Bonzini
    Cc: Jim Mattson
    Signed-off-by: Oliver Upton
    Signed-off-by: Paolo Bonzini
    Signed-off-by: Greg Kroah-Hartman

    Oliver Upton
     
  • [ Upstream commit 0aa0e0d6b34b89649e6b5882a7e025a0eb9bd832 ]

    Tremont is Intel's successor to Goldmont Plus. SMI_COUNT MSR is also
    supported.

    Signed-off-by: Kan Liang
    Signed-off-by: Peter Zijlstra (Intel)
    Signed-off-by: Ingo Molnar
    Reviewed-by: Andi Kleen
    Link: https://lkml.kernel.org/r/1580236279-35492-3-git-send-email-kan.liang@linux.intel.com
    Signed-off-by: Sasha Levin

    Kan Liang
     
  • [ Upstream commit ecf71fbccb9ac5cb964eb7de59bb9da3755b7885 ]

    Tremont is Intel's successor to Goldmont Plus. From the perspective of
    Intel cstate residency counters, there is nothing changed compared with
    Goldmont Plus and Goldmont.

    Share glm_cstates with Goldmont Plus and Goldmont.
    Update the comments for Tremont.

    Signed-off-by: Kan Liang
    Signed-off-by: Peter Zijlstra (Intel)
    Signed-off-by: Ingo Molnar
    Reviewed-by: Andi Kleen
    Link: https://lkml.kernel.org/r/1580236279-35492-2-git-send-email-kan.liang@linux.intel.com
    Signed-off-by: Sasha Levin

    Kan Liang
     
  • [ Upstream commit eda23b387f6c4bb2971ac7e874a09913f533b22c ]

    Elkhart Lake also uses Tremont CPU. From the perspective of Intel PMU,
    there is nothing changed compared with Jacobsville.
    Share the perf code with Jacobsville.

    Signed-off-by: Kan Liang
    Signed-off-by: Peter Zijlstra (Intel)
    Signed-off-by: Ingo Molnar
    Reviewed-by: Andi Kleen
    Link: https://lkml.kernel.org/r/1580236279-35492-1-git-send-email-kan.liang@linux.intel.com
    Signed-off-by: Sasha Levin

    Kan Liang
     

29 Feb, 2020

11 commits

  • commit 23520b2def95205f132e167cf5b25c609975e959 upstream.

    When pv_eoi_get_user() fails, 'val' may remain uninitialized and the return
    value of pv_eoi_get_pending() becomes random. Fix the issue by initializing
    the variable.

    Reviewed-by: Vitaly Kuznetsov
    Signed-off-by: Miaohe Lin
    Cc: stable@vger.kernel.org
    Signed-off-by: Paolo Bonzini
    Signed-off-by: Greg Kroah-Hartman

    Miaohe Lin
     
  • commit 91a5f413af596ad01097e59bf487eb07cb3f1331 upstream.

    Even when APICv is disabled for L1 it can (and, actually, is) still
    available for L2, this means we need to always call
    vmx_deliver_nested_posted_interrupt() when attempting an interrupt
    delivery.

    Suggested-by: Paolo Bonzini
    Signed-off-by: Vitaly Kuznetsov
    Cc: stable@vger.kernel.org
    Signed-off-by: Paolo Bonzini
    Signed-off-by: Greg Kroah-Hartman

    Vitaly Kuznetsov
     
  • … is globally disabled

    commit a4443267800af240072280c44521caab61924e55 upstream.

    When apicv is disabled on a vCPU (e.g. by enabling KVM_CAP_HYPERV_SYNIC*),
    nothing happens to VMX MSRs on the already existing vCPUs, however, all new
    ones are created with PIN_BASED_POSTED_INTR filtered out. This is very
    confusing and results in the following picture inside the guest:

    $ rdmsr -ax 0x48d
    ff00000016
    7f00000016
    7f00000016
    7f00000016

    This is observed with QEMU and 4-vCPU guest: QEMU creates vCPU0, does
    KVM_CAP_HYPERV_SYNIC2 and then creates the remaining three.

    L1 hypervisor may only check CPU0's controls to find out what features
    are available and it will be very confused later. Switch to setting
    PIN_BASED_POSTED_INTR control based on global 'enable_apicv' setting.

    Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
    Cc: stable@vger.kernel.org
    Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

    Vitaly Kuznetsov
     
  • commit 35a571346a94fb93b5b3b6a599675ef3384bc75c upstream.

    Consult the 'unconditional IO exiting' and 'use IO bitmaps' VM-execution
    controls when checking instruction interception. If the 'use IO bitmaps'
    VM-execution control is 1, check the instruction access against the IO
    bitmaps to determine if the instruction causes a VM-exit.

    Signed-off-by: Oliver Upton
    Signed-off-by: Paolo Bonzini
    Signed-off-by: Greg Kroah-Hartman

    Oliver Upton
     
  • commit e71237d3ff1abf9f3388337cfebf53b96df2020d upstream.

    Checks against the IO bitmap are useful for both instruction emulation
    and VM-exit reflection. Refactor the IO bitmap checks into a helper
    function.

    Signed-off-by: Oliver Upton
    Reviewed-by: Vitaly Kuznetsov
    Signed-off-by: Paolo Bonzini
    Signed-off-by: Greg Kroah-Hartman

    Oliver Upton
     
  • commit 7455a8327674e1a7c9a1f5dd1b0743ab6713f6d1 upstream.

    Commit 13db77347db1 ("KVM: x86: don't notify userspace IOAPIC on edge
    EOI") said, edge-triggered interrupts don't set a bit in TMR, which means
    that IOAPIC isn't notified on EOI. And var level indicates level-triggered
    interrupt.
    But commit 3159d36ad799 ("KVM: x86: use generic function for MSI parsing")
    replace var level with irq.level by mistake. Fix it by changing irq.level
    to irq.trig_mode.

    Cc: stable@vger.kernel.org
    Fixes: 3159d36ad799 ("KVM: x86: use generic function for MSI parsing")
    Signed-off-by: Miaohe Lin
    Signed-off-by: Paolo Bonzini
    Signed-off-by: Greg Kroah-Hartman

    Miaohe Lin
     
  • commit 07721feee46b4b248402133228235318199b05ec upstream.

    vmx_check_intercept is not yet fully implemented. To avoid emulating
    instructions disallowed by the L1 hypervisor, refuse to emulate
    instructions by default.

    Cc: stable@vger.kernel.org
    [Made commit, added commit msg - Oliver]
    Signed-off-by: Oliver Upton
    Signed-off-by: Paolo Bonzini
    Signed-off-by: Greg Kroah-Hartman

    Paolo Bonzini
     
  • commit 21b5ee59ef18e27d85810584caf1f7ddc705ea83 upstream.

    Commit

    aaf248848db50 ("perf/x86/msr: Add AMD IRPERF (Instructions Retired)
    performance counter")

    added support for access to the free-running counter via 'perf -e
    msr/irperf/', but when exercised, it always returns a 0 count:

    BEFORE:

    $ perf stat -e instructions,msr/irperf/ true

    Performance counter stats for 'true':

    624,833 instructions
    0 msr/irperf/

    Simply set its enable bit - HWCR bit 30 - to make it start counting.

    Enablement is restricted to all machines advertising IRPERF capability,
    except those susceptible to an erratum that makes the IRPERF return
    bad values.

    That erratum occurs in Family 17h models 00-1fh [1], but not in F17h
    models 20h and above [2].

    AFTER (on a family 17h model 31h machine):

    $ perf stat -e instructions,msr/irperf/ true

    Performance counter stats for 'true':

    621,690 instructions
    622,490 msr/irperf/

    [1] Revision Guide for AMD Family 17h Models 00h-0Fh Processors
    [2] Revision Guide for AMD Family 17h Models 30h-3Fh Processors

    The revision guides are available from the bugzilla Link below.

    [ bp: Massage commit message. ]

    Fixes: aaf248848db50 ("perf/x86/msr: Add AMD IRPERF (Instructions Retired) performance counter")
    Signed-off-by: Kim Phillips
    Signed-off-by: Borislav Petkov
    Cc: Peter Zijlstra
    Cc: stable@vger.kernel.org
    Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537
    Link: http://lkml.kernel.org/r/20200214201805.13830-1-kim.phillips@amd.com
    Signed-off-by: Greg Kroah-Hartman

    Kim Phillips
     
  • commit 51dede9c05df2b78acd6dcf6a17d21f0877d2d7b upstream.

    Accessing the MCA thresholding controls in sysfs concurrently with CPU
    hotplug can lead to a couple of KASAN-reported issues:

    BUG: KASAN: use-after-free in sysfs_file_ops+0x155/0x180
    Read of size 8 at addr ffff888367578940 by task grep/4019

    and

    BUG: KASAN: use-after-free in show_error_count+0x15c/0x180
    Read of size 2 at addr ffff888368a05514 by task grep/4454

    for example. Both result from the fact that the threshold block
    creation/teardown code frees the descriptor memory itself instead of
    defining proper ->release function and leaving it to the driver core to
    take care of that, after all sysfs accesses have completed.

    Do that and get rid of the custom freeing code, fixing the above UAFs in
    the process.

    [ bp: write commit message. ]

    Fixes: 95268664390b ("[PATCH] x86_64: mce_amd support for family 0x10 processors")
    Signed-off-by: Thomas Gleixner
    Signed-off-by: Borislav Petkov
    Cc:
    Link: https://lkml.kernel.org/r/20200214082801.13836-1-bp@alien8.de
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     
  • commit 6e5cf31fbe651bed7ba1df768f2e123531132417 upstream.

    threshold_create_bank() creates a bank descriptor per MCA error
    thresholding counter which can be controlled over sysfs. It publishes
    the pointer to that bank in a per-CPU variable and then goes on to
    create additional thresholding blocks if the bank has such.

    However, that creation of additional blocks in
    allocate_threshold_blocks() can fail, leading to a use-after-free
    through the per-CPU pointer.

    Therefore, publish that pointer only after all blocks have been setup
    successfully.

    Fixes: 019f34fccfd5 ("x86, MCE, AMD: Move shared bank to node descriptor")
    Reported-by: Saar Amar
    Reported-by: Dan Carpenter
    Signed-off-by: Borislav Petkov
    Cc:
    Link: http://lkml.kernel.org/r/20200128140846.phctkvx5btiexvbx@kili.mountain
    Signed-off-by: Greg Kroah-Hartman

    Borislav Petkov
     
  • commit ff5ac61ee83c13f516544d29847d28be093a40ee upstream.

    The IMA arch code attempts to inspect the "SetupMode" EFI variable
    by populating a variable called efi_SetupMode_name with the string
    "SecureBoot" and passing that to the EFI GetVariable service, which
    obviously does not yield the expected result.

    Given that the string is only referenced a single time, let's get
    rid of the intermediate variable, and pass the correct string as
    an immediate argument. While at it, do the same for "SecureBoot".

    Fixes: 399574c64eaf ("x86/ima: retry detecting secure boot mode")
    Fixes: 980ef4d22a95 ("x86/ima: check EFI SetupMode too")
    Cc: Matthew Garrett
    Signed-off-by: Ard Biesheuvel
    Cc: stable@vger.kernel.org # v5.3
    Signed-off-by: Mimi Zohar
    Signed-off-by: Greg Kroah-Hartman

    Ard Biesheuvel
     

24 Feb, 2020

9 commits

  • [ Upstream commit 8b7e20a7ba54836076ff35a28349dabea4cec48f ]

    Add TEST opcode to Group3-2 reg=001b as same as Group3-1 does.

    Commit

    12a78d43de76 ("x86/decoder: Add new TEST instruction pattern")

    added a TEST opcode assignment to f6 XX/001/XXX (Group 3-1), but did
    not add f7 XX/001/XXX (Group 3-2).

    Actually, this TEST opcode variant (ModRM.reg /1) is not described in
    the Intel SDM Vol2 but in AMD64 Architecture Programmer's Manual Vol.3,
    Appendix A.2 Table A-6. ModRM.reg Extensions for the Primary Opcode Map.

    Without this fix, Randy found a warning by insn_decoder_test related
    to this issue as below.

    HOSTCC arch/x86/tools/insn_decoder_test
    HOSTCC arch/x86/tools/insn_sanity
    TEST posttest
    arch/x86/tools/insn_decoder_test: warning: Found an x86 instruction decoder bug, please report this.
    arch/x86/tools/insn_decoder_test: warning: ffffffff81000bf1: f7 0b 00 01 08 00 testl $0x80100,(%rbx)
    arch/x86/tools/insn_decoder_test: warning: objdump says 6 bytes, but insn_get_length() says 2
    arch/x86/tools/insn_decoder_test: warning: Decoded and checked 11913894 instructions with 1 failures
    TEST posttest
    arch/x86/tools/insn_sanity: Success: decoded and checked 1000000 random instructions with 0 errors (seed:0x871ce29c)

    To fix this error, add the TEST opcode according to AMD64 APM Vol.3.

    [ bp: Massage commit message. ]

    Reported-by: Randy Dunlap
    Signed-off-by: Masami Hiramatsu
    Signed-off-by: Borislav Petkov
    Acked-by: Randy Dunlap
    Tested-by: Randy Dunlap
    Link: https://lkml.kernel.org/r/157966631413.9580.10311036595431878351.stgit@devnote2
    Signed-off-by: Sasha Levin

    Masami Hiramatsu
     
  • [ Upstream commit 75fbef0a8b6b4bb19b9a91b5214f846c2dc5139e ]

    The following commit:

    15f003d20782 ("x86/mm/pat: Don't implicitly allow _PAGE_RW in kernel_map_pages_in_pgd()")

    modified kernel_map_pages_in_pgd() to manage writable permissions
    of memory mappings in the EFI page table in a different way, but
    in the process, it removed the ability to clear NX attributes from
    read-only mappings, by clobbering the clear mask if _PAGE_RW is not
    being requested.

    Failure to remove the NX attribute from read-only mappings is
    unlikely to be a security issue, but it does prevent us from
    tightening the permissions in the EFI page tables going forward,
    so let's fix it now.

    Fixes: 15f003d20782 ("x86/mm/pat: Don't implicitly allow _PAGE_RW in kernel_map_pages_in_pgd()
    Signed-off-by: Ard Biesheuvel
    Signed-off-by: Ingo Molnar
    Link: https://lore.kernel.org/r/20200113172245.27925-5-ardb@kernel.org
    Signed-off-by: Sasha Levin

    Ard Biesheuvel
     
  • [ Upstream commit 471af006a747f1c535c8a8c6c0973c320fe01b22 ]

    AMD Family 17h processors and above gain support for Large Increment
    per Cycle events. Unfortunately there is no CPUID or equivalent bit
    that indicates whether the feature exists or not, so we continue to
    determine eligibility based on a CPU family number comparison.

    For Large Increment per Cycle events, we add a f17h-and-compatibles
    get_event_constraints_f17h() that returns an even counter bitmask:
    Large Increment per Cycle events can only be placed on PMCs 0, 2,
    and 4 out of the currently available 0-5. The only currently
    public event that requires this feature to report valid counts
    is PMCx003 "Retired SSE/AVX Operations".

    Note that the CPU family logic in amd_core_pmu_init() is changed
    so as to be able to selectively add initialization for features
    available in ranges of backward-compatible CPU families. This
    Large Increment per Cycle feature is expected to be retained
    in future families.

    A side-effect of assigning a new get_constraints function for f17h
    disables calling the old (prior to f15h) amd_get_event_constraints
    implementation left enabled by commit e40ed1542dd7 ("perf/x86: Add perf
    support for AMD family-17h processors"), which is no longer
    necessary since those North Bridge event codes are obsoleted.

    Also fix a spelling mistake whilst in the area (calulating ->
    calculating).

    Fixes: e40ed1542dd7 ("perf/x86: Add perf support for AMD family-17h processors")
    Signed-off-by: Kim Phillips
    Signed-off-by: Peter Zijlstra (Intel)
    Link: https://lkml.kernel.org/r/20191114183720.19887-2-kim.phillips@amd.com
    Signed-off-by: Sasha Levin

    Kim Phillips
     
  • [ Upstream commit 248ed51048c40d36728e70914e38bffd7821da57 ]

    First, printk() is NMI-context safe now since the safe printk() has been
    implemented and it already has an irq_work to make NMI-context safe.

    Second, this NMI irq_work actually does not work if a NMI handler causes
    panic by watchdog timeout. It has no chance to run in such case, while
    the safe printk() will flush its per-cpu buffers before panicking.

    While at it, repurpose the irq_work callback into a function which
    concentrates the NMI duration checking and makes the code easier to
    follow.

    [ bp: Massage. ]

    Signed-off-by: Changbin Du
    Signed-off-by: Borislav Petkov
    Acked-by: Thomas Gleixner
    Link: https://lkml.kernel.org/r/20200111125427.15662-1-changbin.du@gmail.com
    Signed-off-by: Sasha Levin

    Changbin Du
     
  • [ Upstream commit e2d68a955e49d61fd0384f23e92058dc9b79be5e ]

    The logic in __efi_enter_virtual_mode() does a number of steps in
    sequence, all of which may fail in one way or the other. In most
    cases, we simply print an error and disable EFI runtime services
    support, but in some cases, we BUG() or panic() and bring down the
    system when encountering conditions that we could easily handle in
    the same way.

    While at it, replace a pointless page-to-virt-phys conversion with
    one that goes straight from struct page to physical.

    Signed-off-by: Ard Biesheuvel
    Cc: Andy Lutomirski
    Cc: Ard Biesheuvel
    Cc: Arvind Sankar
    Cc: Matthew Garrett
    Cc: linux-efi@vger.kernel.org
    Link: https://lkml.kernel.org/r/20200103113953.9571-14-ardb@kernel.org
    Signed-off-by: Ingo Molnar
    Signed-off-by: Sasha Levin

    Ard Biesheuvel
     
  • [ Upstream commit bff47c2302cc249bcd550b17067f8dddbd4b6f77 ]

    When building with C=1, sparse issues a warning:

    CHECK arch/x86/entry/vdso/vdso32-setup.c
    arch/x86/entry/vdso/vdso32-setup.c:28:28: warning: symbol 'vdso32_enabled' was not declared. Should it be static?

    Provide the missing header file.

    Signed-off-by: Valdis Kletnieks
    Signed-off-by: Borislav Petkov
    Cc: "H. Peter Anvin"
    Cc: Andy Lutomirski
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: x86-ml
    Link: https://lkml.kernel.org/r/36224.1575599767@turing-police
    Signed-off-by: Sasha Levin

    Valdis Klētnieks
     
  • [ Upstream commit dacc9092336be20b01642afe1a51720b31f60369 ]

    When checking whether the reported lfb_size makes sense, the height
    * stride result is page-aligned before seeing whether it exceeds the
    reported size.

    This doesn't work if height * stride is not an exact number of pages.
    For example, as reported in the kernel bugzilla below, an 800x600x32 EFI
    framebuffer gets skipped because of this.

    Move the PAGE_ALIGN to after the check vs size.

    Reported-by: Christopher Head
    Tested-by: Christopher Head
    Signed-off-by: Arvind Sankar
    Signed-off-by: Borislav Petkov
    Link: https://bugzilla.kernel.org/show_bug.cgi?id=206051
    Link: https://lkml.kernel.org/r/20200107230410.2291947-1-nivedita@alum.mit.edu
    Signed-off-by: Sasha Levin

    Arvind Sankar
     
  • [ Upstream commit ffc2760bcf2dba0dbef74013ed73eea8310cc52c ]

    Fix a couple of issues with the way we map and copy the vendor string:
    - we map only 2 bytes, which usually works since you get at least a
    page, but if the vendor string happens to cross a page boundary,
    a crash will result
    - only call early_memunmap() if early_memremap() succeeded, or we will
    call it with a NULL address which it doesn't like,
    - while at it, switch to early_memremap_ro(), and array indexing rather
    than pointer dereferencing to read the CHAR16 characters.

    Signed-off-by: Ard Biesheuvel
    Cc: Andy Lutomirski
    Cc: Ard Biesheuvel
    Cc: Arvind Sankar
    Cc: Matthew Garrett
    Cc: linux-efi@vger.kernel.org
    Fixes: 5b83683f32b1 ("x86: EFI runtime service support")
    Link: https://lkml.kernel.org/r/20200103113953.9571-5-ardb@kernel.org
    Signed-off-by: Ingo Molnar
    Signed-off-by: Sasha Levin

    Ard Biesheuvel
     
  • [ Upstream commit bbc55341b9c67645d1a5471506370caf7dd4a203 ]

    In __fpu__restore_sig(), fpu_fpregs_owner_ctx needs to be reset if the
    FPU state was not fully restored. Otherwise the following may happen (on
    the same CPU):

    Task A Task B fpu_fpregs_owner_ctx
    *active* A.fpu
    __fpu__restore_sig()
    ctx switch load B.fpu
    *active* B.fpu
    fpregs_lock()
    copy_user_to_fpregs_zeroing()
    copy_kernel_to_xregs() *modify*
    copy_user_to_xregs() *fails*
    fpregs_unlock()
    ctx switch skip loading B.fpu,
    *active* B.fpu

    In the success case, fpu_fpregs_owner_ctx is set to the current task.

    In the failure case, the FPU state might have been modified by loading
    the init state.

    In this case, fpu_fpregs_owner_ctx needs to be reset in order to ensure
    that the FPU state of the following task is loaded from saved state (and
    not skipped because it was the previous state).

    Reset fpu_fpregs_owner_ctx after a failure during restore occurred, to
    ensure that the FPU state for the next task is always loaded.

    The problem was debugged-by Yu-cheng Yu .

    [ bp: Massage commit message. ]

    Fixes: 5f409e20b7945 ("x86/fpu: Defer FPU state load until return to userspace")
    Reported-by: Yu-cheng Yu
    Signed-off-by: Sebastian Andrzej Siewior
    Signed-off-by: Borislav Petkov
    Cc: Andy Lutomirski
    Cc: Dave Hansen
    Cc: Fenghua Yu
    Cc: "H. Peter Anvin"
    Cc: Ingo Molnar
    Cc: Jann Horn
    Cc: Peter Zijlstra
    Cc: "Ravi V. Shankar"
    Cc: Rik van Riel
    Cc: Thomas Gleixner
    Cc: Tony Luck
    Cc: x86-ml
    Link: https://lkml.kernel.org/r/20191220195906.plk6kpmsrikvbcfn@linutronix.de
    Signed-off-by: Sasha Levin

    Sebastian Andrzej Siewior
     

20 Feb, 2020

5 commits

  • [ Upstream commit f6ab0107a4942dbf9a5cf0cca3f37e184870a360 ]

    Define PT_MAX_FULL_LEVELS as PT64_ROOT_MAX_LEVEL, i.e. 5, to fix shadow
    paging for 5-level guest page tables. PT_MAX_FULL_LEVELS is used to
    size the arrays that track guest pages table information, i.e. using a
    "max levels" of 4 causes KVM to access garbage beyond the end of an
    array when querying state for level 5 entries. E.g. FNAME(gpte_changed)
    will read garbage and most likely return %true for a level 5 entry,
    soft-hanging the guest because FNAME(fetch) will restart the guest
    instead of creating SPTEs because it thinks the guest PTE has changed.

    Note, KVM doesn't yet support 5-level nested EPT, so PT_MAX_FULL_LEVELS
    gets to stay "4" for the PTTYPE_EPT case.

    Fixes: 855feb673640 ("KVM: MMU: Add 5 level EPT & Shadow page table support.")
    Cc: stable@vger.kernel.org
    Signed-off-by: Sean Christopherson
    Signed-off-by: Paolo Bonzini
    Signed-off-by: Sasha Levin

    Sean Christopherson
     
  • commit 307f1cfa269657c63cfe2c932386fcc24684d9dd upstream.

    KVM defines the #DB payload as compatible with the 'pending debug
    exceptions' field under VMX, not DR6. Mask off bit 12 when applying the
    payload to DR6, as it is reserved on DR6 but not the 'pending debug
    exceptions' field.

    Fixes: f10c729ff965 ("kvm: vmx: Defer setting of DR6 until #DB delivery")
    Signed-off-by: Oliver Upton
    Signed-off-by: Paolo Bonzini
    Signed-off-by: Greg Kroah-Hartman

    Oliver Upton
     
  • commit f861854e1b435b27197417f6f90d87188003cb24 upstream.

    Perf doesn't take the left period into account when auto-reload is
    enabled with fixed period sampling mode in context switch.

    Here is the MSR trace of the perf command as below.
    (The MSR trace is simplified from a ftrace log.)

    #perf record -e cycles:p -c 2000000 -- ./triad_loop

    //The MSR trace of task schedule out
    //perf disable all counters, disable PEBS, disable GP counter 0,
    //read GP counter 0, and re-enable all counters.
    //The counter 0 stops at 0xfffffff82840
    write_msr: MSR_CORE_PERF_GLOBAL_CTRL(38f), value 0
    write_msr: MSR_IA32_PEBS_ENABLE(3f1), value 0
    write_msr: MSR_P6_EVNTSEL0(186), value 40003003c
    rdpmc: 0, value fffffff82840
    write_msr: MSR_CORE_PERF_GLOBAL_CTRL(38f), value f000000ff

    //The MSR trace of the same task schedule in again
    //perf disable all counters, enable and set GP counter 0,
    //enable PEBS, and re-enable all counters.
    //0xffffffe17b80 (-2000000) is written to GP counter 0.
    write_msr: MSR_CORE_PERF_GLOBAL_CTRL(38f), value 0
    write_msr: MSR_IA32_PMC0(4c1), value ffffffe17b80
    write_msr: MSR_P6_EVNTSEL0(186), value 40043003c
    write_msr: MSR_IA32_PEBS_ENABLE(3f1), value 1
    write_msr: MSR_CORE_PERF_GLOBAL_CTRL(38f), value f000000ff

    When the same task schedule in again, the counter should starts from
    previous left. However, it starts from the fixed period -2000000 again.

    A special variant of intel_pmu_save_and_restart() is used for
    auto-reload, which doesn't update the hwc->period_left.
    When the monitored task schedules in again, perf doesn't know the left
    period. The fixed period is used, which is inaccurate.

    With auto-reload, the counter always has a negative counter value. So
    the left period is -value. Update the period_left in
    intel_pmu_save_and_restart_reload().

    With the patch:

    //The MSR trace of task schedule out
    write_msr: MSR_CORE_PERF_GLOBAL_CTRL(38f), value 0
    write_msr: MSR_IA32_PEBS_ENABLE(3f1), value 0
    write_msr: MSR_P6_EVNTSEL0(186), value 40003003c
    rdpmc: 0, value ffffffe25cbc
    write_msr: MSR_CORE_PERF_GLOBAL_CTRL(38f), value f000000ff

    //The MSR trace of the same task schedule in again
    write_msr: MSR_CORE_PERF_GLOBAL_CTRL(38f), value 0
    write_msr: MSR_IA32_PMC0(4c1), value ffffffe25cbc
    write_msr: MSR_P6_EVNTSEL0(186), value 40043003c
    write_msr: MSR_IA32_PEBS_ENABLE(3f1), value 1
    write_msr: MSR_CORE_PERF_GLOBAL_CTRL(38f), value f000000ff

    Fixes: d31fc13fdcb2 ("perf/x86/intel: Fix event update for auto-reload")
    Signed-off-by: Kan Liang
    Signed-off-by: Peter Zijlstra (Intel)
    Signed-off-by: Ingo Molnar
    Link: https://lkml.kernel.org/r/20200121190125.3389-1-kan.liang@linux.intel.com
    Signed-off-by: Greg Kroah-Hartman

    Kan Liang
     
  • commit 25d387287cf0330abf2aad761ce6eee67326a355 upstream.

    Commit 3fe3331bb285 ("perf/x86/amd: Add event map for AMD Family 17h"),
    claimed L2 misses were unsupported, due to them not being found in its
    referenced documentation, whose link has now moved [1].

    That old documentation listed PMCx064 unit mask bit 3 as:

    "LsRdBlkC: LS Read Block C S L X Change to X Miss."

    and bit 0 as:

    "IcFillMiss: IC Fill Miss"

    We now have new public documentation [2] with improved descriptions, that
    clearly indicate what events those unit mask bits represent:

    Bit 3 now clearly states:

    "LsRdBlkC: Data Cache Req Miss in L2 (all types)"

    and bit 0 is:

    "IcFillMiss: Instruction Cache Req Miss in L2."

    So we can now add support for L2 misses in perf's genericised events as
    PMCx064 with both the above unit masks.

    [1] The commit's original documentation reference, "Processor Programming
    Reference (PPR) for AMD Family 17h Model 01h, Revision B1 Processors",
    originally available here:

    https://www.amd.com/system/files/TechDocs/54945_PPR_Family_17h_Models_00h-0Fh.pdf

    is now available here:

    https://developer.amd.com/wordpress/media/2017/11/54945_PPR_Family_17h_Models_00h-0Fh.pdf

    [2] "Processor Programming Reference (PPR) for Family 17h Model 31h,
    Revision B0 Processors", available here:

    https://developer.amd.com/wp-content/resources/55803_0.54-PUB.pdf

    Fixes: 3fe3331bb285 ("perf/x86/amd: Add event map for AMD Family 17h")
    Reported-by: Babu Moger
    Signed-off-by: Kim Phillips
    Signed-off-by: Peter Zijlstra (Intel)
    Signed-off-by: Ingo Molnar
    Tested-by: Babu Moger
    Cc: stable@vger.kernel.org
    Link: https://lkml.kernel.org/r/20200121171232.28839-1-kim.phillips@amd.com
    Signed-off-by: Greg Kroah-Hartman

    Kim Phillips
     
  • commit 148d735eb55d32848c3379e460ce365f2c1cbe4b upstream.

    Hardcode the EPT page-walk level for L2 to be 4 levels, as KVM's MMU
    currently also hardcodes the page walk level for nested EPT to be 4
    levels. The L2 guest is all but guaranteed to soft hang on its first
    instruction when L1 is using EPT, as KVM will construct 4-level page
    tables and then tell hardware to use 5-level page tables.

    Fixes: 855feb673640 ("KVM: MMU: Add 5 level EPT & Shadow page table support.")
    Cc: stable@vger.kernel.org
    Signed-off-by: Sean Christopherson
    Signed-off-by: Paolo Bonzini
    Signed-off-by: Greg Kroah-Hartman

    Sean Christopherson
     

15 Feb, 2020

1 commit

  • [ Upstream commit 2b73ea3796242608b4ccf019ff217156c92e92fe ]

    Break an infinite loop when early parsing of the SRAT table is caused
    by a subtable with zero length. Known to affect the ASUS WS X299 SAGE
    motherboard with firmware version 1201 which has a large block of
    zeros in its SRAT table. The kernel could boot successfully on this
    board/firmware prior to the introduction of early parsing this table or
    after a BIOS update.

    [ bp: Fixup whitespace damage and commit message. Make it return 0 to
    denote that there are no immovable regions because who knows what
    else is broken in this BIOS. ]

    Fixes: 02a3e3cdb7f1 ("x86/boot: Parse SRAT table and count immovable memory regions")
    Signed-off-by: Steven Clarkson
    Signed-off-by: Borislav Petkov
    Cc: linux-acpi@vger.kernel.org
    Link: https://bugzilla.kernel.org/show_bug.cgi?id=206343
    Link: https://lkml.kernel.org/r/CAHKq8taGzj0u1E_i=poHUam60Bko5BpiJ9jn0fAupFUYexvdUQ@mail.gmail.com
    Signed-off-by: Sasha Levin

    Steven Clarkson
     

11 Feb, 2020

4 commits

  • commit 6f1a4891a5928a5969c87fa5a584844c983ec823 upstream.

    Evan tracked down a subtle race between the update of the MSI message and
    the device raising an interrupt internally on PCI devices which do not
    support MSI masking. The update of the MSI message is non-atomic and
    consists of either 2 or 3 sequential 32bit wide writes to the PCI config
    space.

    - Write address low 32bits
    - Write address high 32bits (If supported by device)
    - Write data

    When an interrupt is migrated then both address and data might change, so
    the kernel attempts to mask the MSI interrupt first. But for MSI masking is
    optional, so there exist devices which do not provide it. That means that
    if the device raises an interrupt internally between the writes then a MSI
    message is sent built from half updated state.

    On x86 this can lead to spurious interrupts on the wrong interrupt
    vector when the affinity setting changes both address and data. As a
    consequence the device interrupt can be lost causing the device to
    become stuck or malfunctioning.

    Evan tried to handle that by disabling MSI accross an MSI message
    update. That's not feasible because disabling MSI has issues on its own:

    If MSI is disabled the PCI device is routing an interrupt to the legacy
    INTx mechanism. The INTx delivery can be disabled, but the disablement is
    not working on all devices.

    Some devices lose interrupts when both MSI and INTx delivery are disabled.

    Another way to solve this would be to enforce the allocation of the same
    vector on all CPUs in the system for this kind of screwed devices. That
    could be done, but it would bring back the vector space exhaustion problems
    which got solved a few years ago.

    Fortunately the high address (if supported by the device) is only relevant
    when X2APIC is enabled which implies interrupt remapping. In the interrupt
    remapping case the affinity setting is happening at the interrupt remapping
    unit and the PCI MSI message is programmed only once when the PCI device is
    initialized.

    That makes it possible to solve it with a two step update:

    1) Target the MSI msg to the new vector on the current target CPU

    2) Target the MSI msg to the new vector on the new target CPU

    In both cases writing the MSI message is only changing a single 32bit word
    which prevents the issue of inconsistency.

    After writing the final destination it is necessary to check whether the
    device issued an interrupt while the intermediate state #1 (new vector,
    current CPU) was in effect.

    This is possible because the affinity change is always happening on the
    current target CPU. The code runs with interrupts disabled, so the
    interrupt can be detected by checking the IRR of the local APIC. If the
    vector is pending in the IRR then the interrupt is retriggered on the new
    target CPU by sending an IPI for the associated vector on the target CPU.

    This can cause spurious interrupts on both the local and the new target
    CPU.

    1) If the new vector is not in use on the local CPU and the device
    affected by the affinity change raised an interrupt during the
    transitional state (step #1 above) then interrupt entry code will
    ignore that spurious interrupt. The vector is marked so that the
    'No irq handler for vector' warning is supressed once.

    2) If the new vector is in use already on the local CPU then the IRR check
    might see an pending interrupt from the device which is using this
    vector. The IPI to the new target CPU will then invoke the handler of
    the device, which got the affinity change, even if that device did not
    issue an interrupt

    3) If the new vector is in use already on the local CPU and the device
    affected by the affinity change raised an interrupt during the
    transitional state (step #1 above) then the handler of the device which
    uses that vector on the local CPU will be invoked.

    expose issues in device driver interrupt handlers which are not prepared to
    handle a spurious interrupt correctly. This not a regression, it's just
    exposing something which was already broken as spurious interrupts can
    happen for a lot of reasons and all driver handlers need to be able to deal
    with them.

    Reported-by: Evan Green
    Debugged-by: Evan Green
    Signed-off-by: Thomas Gleixner
    Tested-by: Evan Green
    Cc: stable@vger.kernel.org
    Link: https://lore.kernel.org/r/87imkr4s7n.fsf@nanos.tec.linutronix.de
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     
  • [ Upstream commit f9b84e19221efc5f493156ee0329df3142085f28 ]

    Use kvm_vcpu_gfn_to_hva() when retrieving the host page size so that the
    correct set of memslots is used when handling x86 page faults in SMM.

    Fixes: 54bf36aac520 ("KVM: x86: use vcpu-specific functions to read/write/translate GFNs")
    Cc: stable@vger.kernel.org
    Signed-off-by: Sean Christopherson
    Signed-off-by: Paolo Bonzini
    Signed-off-by: Sasha Levin

    Sean Christopherson
     
  • [ Upstream commit a4d956b9390418623ae5d07933e2679c68b6f83c ]

    In case writing to vmread destination operand result in a #PF, vmread
    should not call nested_vmx_succeed() to set rflags to specify success.
    Similar to as done in VMPTRST (See handle_vmptrst()).

    Reviewed-by: Liran Alon
    Signed-off-by: Miaohe Lin
    Cc: stable@vger.kernel.org
    Reviewed-by: Sean Christopherson
    Signed-off-by: Paolo Bonzini
    Signed-off-by: Sasha Levin

    Miaohe Lin
     
  • [ Upstream commit 56871d444bc4d7ea66708775e62e2e0926384dbc ]

    The SPTE_MMIO_MASK overlaps with the bits used to track MMIO
    generation number. A high enough generation number would overwrite the
    SPTE_SPECIAL_MASK region and cause the MMIO SPTE to be misinterpreted.

    Likewise, setting bits 52 and 53 would also cause an incorrect generation
    number to be read from the PTE, though this was partially mitigated by the
    (useless if it weren't for the bug) removal of SPTE_SPECIAL_MASK from
    the spte in get_mmio_spte_generation. Drop that removal, and replace
    it with a compile-time assertion.

    Fixes: 6eeb4ef049e7 ("KVM: x86: assign two bits to track SPTE kinds")
    Reported-by: Ben Gardon
    Cc: stable@vger.kernel.org
    Signed-off-by: Paolo Bonzini
    Signed-off-by: Sasha Levin

    Paolo Bonzini