25 Mar, 2020

1 commit

  • commit 763802b53a427ed3cbd419dbba255c414fdd9e7c upstream.

    Commit 3f8fd02b1bf1 ("mm/vmalloc: Sync unmappings in
    __purge_vmap_area_lazy()") introduced a call to vmalloc_sync_all() in
    the vunmap() code-path. While this change was necessary to maintain
    correctness on x86-32-pae kernels, it also adds additional cycles for
    architectures that don't need it.

    Specifically on x86-64 with CONFIG_VMAP_STACK=y some people reported
    severe performance regressions in micro-benchmarks because it now also
    calls the x86-64 implementation of vmalloc_sync_all() on vunmap(). But
    the vmalloc_sync_all() implementation on x86-64 is only needed for newly
    created mappings.

    To avoid the unnecessary work on x86-64 and to gain the performance
    back, split up vmalloc_sync_all() into two functions:

    * vmalloc_sync_mappings(), and
    * vmalloc_sync_unmappings()

    Most call-sites to vmalloc_sync_all() only care about new mappings being
    synchronized. The only exception is the new call-site added in the
    above mentioned commit.

    Shile Zhang directed us to a report of an 80% regression in reaim
    throughput.

    Fixes: 3f8fd02b1bf1 ("mm/vmalloc: Sync unmappings in __purge_vmap_area_lazy()")
    Reported-by: kernel test robot
    Reported-by: Shile Zhang
    Signed-off-by: Joerg Roedel
    Signed-off-by: Andrew Morton
    Tested-by: Borislav Petkov
    Acked-by: Rafael J. Wysocki [GHES]
    Cc: Dave Hansen
    Cc: Andy Lutomirski
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc:
    Link: http://lkml.kernel.org/r/20191009124418.8286-1-joro@8bytes.org
    Link: https://lists.01.org/hyperkitty/list/lkp@lists.01.org/thread/4D3JPPHBNOSPFK2KEPC6KGKS6J25AIDB/
    Link: http://lkml.kernel.org/r/20191113095530.228959-1-shile.zhang@linux.alibaba.com
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Joerg Roedel
     

18 Mar, 2020

5 commits

  • commit 59b5809655bdafb0767d3fd00a3e41711aab07e6 upstream.

    There are two implemented bits in the PPIN_CTL MSR:

    Bit 0: LockOut (R/WO)
    Set 1 to prevent further writes to MSR_PPIN_CTL.

    Bit 1: Enable_PPIN (R/W)
    If 1, enables MSR_PPIN to be accessible using RDMSR.
    If 0, an attempt to read MSR_PPIN will cause #GP.

    So there are four defined values:
    0: PPIN is disabled, PPIN_CTL may be updated
    1: PPIN is disabled. PPIN_CTL is locked against updates
    2: PPIN is enabled. PPIN_CTL may be updated
    3: PPIN is enabled. PPIN_CTL is locked against updates

    Code would only enable the X86_FEATURE_INTEL_PPIN feature for case "2".
    When it should have done so for both case "2" and case "3".

    Fix the final test to just check for the enable bit. Also fix some of
    the other comments in this function.

    Fixes: 3f5a7896a509 ("x86/mce: Include the PPIN in MCE records when available")
    Signed-off-by: Tony Luck
    Signed-off-by: Borislav Petkov
    Cc:
    Link: https://lkml.kernel.org/r/20200226011737.9958-1-tony.luck@intel.com
    Signed-off-by: Greg Kroah-Hartman

    Tony Luck
     
  • commit f967140dfb7442e2db0868b03b961f9c59418a1b upstream.

    Enable the sampling check in kernel/events/core.c::perf_event_open(),
    which returns the more appropriate -EOPNOTSUPP.

    BEFORE:

    $ sudo perf record -a -e instructions,l3_request_g1.caching_l3_cache_accesses true
    Error:
    The sys_perf_event_open() syscall returned with 22 (Invalid argument) for event (l3_request_g1.caching_l3_cache_accesses).
    /bin/dmesg | grep -i perf may provide additional information.

    With nothing relevant in dmesg.

    AFTER:

    $ sudo perf record -a -e instructions,l3_request_g1.caching_l3_cache_accesses true
    Error:
    l3_request_g1.caching_l3_cache_accesses: PMU Hardware doesn't support sampling/overflow-interrupts. Try 'perf stat'

    Fixes: c43ca5091a37 ("perf/x86/amd: Add support for AMD NB and L2I "uncore" counters")
    Signed-off-by: Kim Phillips
    Signed-off-by: Borislav Petkov
    Acked-by: Peter Zijlstra
    Cc: stable@vger.kernel.org
    Link: https://lkml.kernel.org/r/20200311191323.13124-1-kim.phillips@amd.com
    Signed-off-by: Greg Kroah-Hartman

    Kim Phillips
     
  • commit 985e537a4082b4635754a57f4f95430790afee6a upstream.

    The dmidecode program fails to properly decode the SMBIOS data supplied
    by OVMF/UEFI when running in an SEV guest. The SMBIOS area, under SEV, is
    encrypted and resides in reserved memory that is marked as EFI runtime
    services data.

    As a result, when memremap() is attempted for the SMBIOS data, it
    can't be mapped as regular RAM (through try_ram_remap()) and, since
    the address isn't part of the iomem resources list, it isn't mapped
    encrypted through the fallback ioremap().

    Add a new __ioremap_check_other() to deal with memory types like
    EFI_RUNTIME_SERVICES_DATA which are not covered by the resource ranges.

    This allows any runtime services data which has been created encrypted,
    to be mapped encrypted too.

    [ bp: Move functionality to a separate function. ]

    Signed-off-by: Tom Lendacky
    Signed-off-by: Borislav Petkov
    Reviewed-by: Joerg Roedel
    Tested-by: Joerg Roedel
    Cc: # 5.3
    Link: https://lkml.kernel.org/r/2d9e16eb5b53dc82665c95c6764b7407719df7a0.1582645327.git.thomas.lendacky@amd.com
    Signed-off-by: Greg Kroah-Hartman

    Tom Lendacky
     
  • commit 95fa10103dabc38be5de8efdfced5e67576ed896 upstream.

    When an EVMCS enabled L1 guest on KVM will tries doing enlightened VMEnter
    with EVMCS GPA = 0 the host crashes because the

    evmcs_gpa != vmx->nested.hv_evmcs_vmptr

    condition in nested_vmx_handle_enlightened_vmptrld() will evaluate to
    false (as nested.hv_evmcs_vmptr is zeroed after init). The crash will
    happen on vmx->nested.hv_evmcs pointer dereference.

    Another problematic EVMCS ptr value is '-1' but it only causes host crash
    after nested_release_evmcs() invocation. The problem is exactly the same as
    with '0', we mistakenly think that the EVMCS pointer hasn't changed and
    thus nested.hv_evmcs_vmptr is valid.

    Resolve the issue by adding an additional !vmx->nested.hv_evmcs
    check to nested_vmx_handle_enlightened_vmptrld(), this way we will
    always be trying kvm_vcpu_map() when nested.hv_evmcs is NULL
    and this is supposed to catch all invalid EVMCS GPAs.

    Also, initialize hv_evmcs_vmptr to '0' in nested_release_evmcs()
    to be consistent with initialization where we don't currently
    set hv_evmcs_vmptr to '-1'.

    Cc: stable@vger.kernel.org
    Signed-off-by: Vitaly Kuznetsov
    Signed-off-by: Paolo Bonzini
    Signed-off-by: Greg Kroah-Hartman

    Vitaly Kuznetsov
     
  • commit 342993f96ab24d5864ab1216f46c0b199c2baf8e upstream.

    After commit 07721feee46b ("KVM: nVMX: Don't emulate instructions in guest
    mode") Hyper-V guests on KVM stopped booting with:

    kvm_nested_vmexit: rip fffff802987d6169 reason EPT_VIOLATION info1 181
    info2 0 int_info 0 int_info_err 0
    kvm_page_fault: address febd0000 error_code 181
    kvm_emulate_insn: 0:fffff802987d6169: f3 a5
    kvm_emulate_insn: 0:fffff802987d6169: f3 a5 FAIL
    kvm_inj_exception: #UD (0x0)

    "f3 a5" is a "rep movsw" instruction, which should not be intercepted
    at all. Commit c44b4c6ab80e ("KVM: emulate: clean up initializations in
    init_decode_cache") reduced the number of fields cleared by
    init_decode_cache() claiming that they are being cleared elsewhere,
    'intercept', however, is left uncleared if the instruction does not have
    any of the "slow path" flags (NotImpl, Stack, Op3264, Sse, Mmx, CheckPerm,
    NearBranch, No16 and of course Intercept itself).

    Fixes: c44b4c6ab80e ("KVM: emulate: clean up initializations in init_decode_cache")
    Fixes: 07721feee46b ("KVM: nVMX: Don't emulate instructions in guest mode")
    Cc: stable@vger.kernel.org
    Suggested-by: Paolo Bonzini
    Signed-off-by: Vitaly Kuznetsov
    Reviewed-by: Sean Christopherson
    Signed-off-by: Paolo Bonzini
    Signed-off-by: Greg Kroah-Hartman

    Vitaly Kuznetsov
     

12 Mar, 2020

5 commits

  • commit 8319e9d5ad98ffccd19f35664382c73cea216193 upstream.

    The mixed mode runtime wrappers are fragile when it comes to how the
    memory referred to by its pointer arguments are laid out in memory, due
    to the fact that it translates these addresses to physical addresses that
    the runtime services can dereference when running in 1:1 mode. Since
    vmalloc'ed pages (including the vmap'ed stack) are not contiguous in the
    physical address space, this scheme only works if the referenced memory
    objects do not cross page boundaries.

    Currently, the mixed mode runtime service wrappers require that all by-ref
    arguments that live in the vmalloc space have a size that is a power of 2,
    and are aligned to that same value. While this is a sensible way to
    construct an object that is guaranteed not to cross a page boundary, it is
    overly strict when it comes to checking whether a given object violates
    this requirement, as we can simply take the physical address of the first
    and the last byte, and verify that they point into the same physical page.

    When this check fails, we emit a WARN(), but then simply proceed with the
    call, which could cause data corruption if the next physical page belongs
    to a mapping that is entirely unrelated.

    Given that with vmap'ed stacks, this condition is much more likely to
    trigger, let's relax the condition a bit, but fail the runtime service
    call if it does trigger.

    Fixes: f6697df36bdf0bf7 ("x86/efi: Prevent mixed mode boot corruption with CONFIG_VMAP_STACK=y")
    Signed-off-by: Ard Biesheuvel
    Signed-off-by: Ingo Molnar
    Cc: linux-efi@vger.kernel.org
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Link: https://lore.kernel.org/r/20200221084849.26878-4-ardb@kernel.org
    Signed-off-by: Greg Kroah-Hartman

    Ard Biesheuvel
     
  • commit 63056e8b5ebf41d52170e9f5ba1fc83d1855278c upstream.

    Hans reports that his mixed mode systems running v5.6-rc1 kernels hit
    the WARN_ON() in virt_to_phys_or_null_size(), caused by the fact that
    efi_guid_t objects on the vmap'ed stack happen to be misaligned with
    respect to their sizes. As a quick (i.e., backportable) fix, copy GUID
    pointer arguments to the local stack into a buffer that is naturally
    aligned to its size, so that it is guaranteed to cover only one
    physical page.

    Note that on x86, we cannot rely on the stack pointer being aligned
    the way the compiler expects, so we need to allocate an 8-byte aligned
    buffer of sufficient size, and copy the GUID into that buffer at an
    offset that is aligned to 16 bytes.

    Fixes: f6697df36bdf0bf7 ("x86/efi: Prevent mixed mode boot corruption with CONFIG_VMAP_STACK=y")
    Reported-by: Hans de Goede
    Signed-off-by: Ard Biesheuvel
    Signed-off-by: Ingo Molnar
    Tested-by: Hans de Goede
    Cc: linux-efi@vger.kernel.org
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Link: https://lore.kernel.org/r/20200221084849.26878-2-ardb@kernel.org
    Signed-off-by: Greg Kroah-Hartman

    Ard Biesheuvel
     
  • commit 735a6dd02222d8d070c7bb748f25895239ca8c92 upstream.

    Explicitly set X86_FEATURE_OSPKE via set_cpu_cap() instead of calling
    get_cpu_cap() to pull the feature bit from CPUID after enabling CR4.PKE.
    Invoking get_cpu_cap() effectively wipes out any {set,clear}_cpu_cap()
    changes that were made between this_cpu->c_init() and setup_pku(), as
    all non-synthetic feature words are reinitialized from the CPU's CPUID
    values.

    Blasting away capability updates manifests most visibility when running
    on a VMX capable CPU, but with VMX disabled by BIOS. To indicate that
    VMX is disabled, init_ia32_feat_ctl() clears X86_FEATURE_VMX, using
    clear_cpu_cap() instead of setup_clear_cpu_cap() so that KVM can report
    which CPU is misconfigured (KVM needs to probe every CPU anyways).
    Restoring X86_FEATURE_VMX from CPUID causes KVM to think VMX is enabled,
    ultimately leading to an unexpected #GP when KVM attempts to do VMXON.

    Arguably, init_ia32_feat_ctl() should use setup_clear_cpu_cap() and let
    KVM figure out a different way to report the misconfigured CPU, but VMX
    is not the only feature bit that is affected, i.e. there is precedent
    that tweaking feature bits via {set,clear}_cpu_cap() after ->c_init()
    is expected to work. Most notably, x86_init_rdrand()'s clearing of
    X86_FEATURE_RDRAND when RDRAND malfunctions is also overwritten.

    Fixes: 0697694564c8 ("x86/mm/pkeys: Actually enable Memory Protection Keys in the CPU")
    Reported-by: Jacob Keller
    Signed-off-by: Sean Christopherson
    Signed-off-by: Borislav Petkov
    Acked-by: Dave Hansen
    Tested-by: Jacob Keller
    Cc: stable@vger.kernel.org
    Link: https://lkml.kernel.org/r/20200226231615.13664-1-sean.j.christopherson@intel.com
    Signed-off-by: Greg Kroah-Hartman

    Sean Christopherson
     
  • [ Upstream commit 9038ec99ceb94fb8d93ade5e236b2928f0792c7c ]

    Variables declared in a switch statement before any case statements
    cannot be automatically initialized with compiler instrumentation (as
    they are not part of any execution flow). With GCC's proposed automatic
    stack variable initialization feature, this triggers a warning (and they
    don't get initialized). Clang's automatic stack variable initialization
    (via CONFIG_INIT_STACK_ALL=y) doesn't throw a warning, but it also
    doesn't initialize such variables[1]. Note that these warnings (or silent
    skipping) happen before the dead-store elimination optimization phase,
    so even when the automatic initializations are later elided in favor of
    direct initializations, the warnings remain.

    To avoid these problems, move such variables into the "case" where
    they're used or lift them up into the main function body.

    arch/x86/xen/enlighten_pv.c: In function ‘xen_write_msr_safe’:
    arch/x86/xen/enlighten_pv.c:904:12: warning: statement will never be executed [-Wswitch-unreachable]
    904 | unsigned which;
    | ^~~~~

    [1] https://bugs.llvm.org/show_bug.cgi?id=44916

    Signed-off-by: Kees Cook
    Link: https://lore.kernel.org/r/20200220062318.69299-1-keescook@chromium.org
    Reviewed-by: Juergen Gross
    [boris: made @which an 'unsigned int']
    Signed-off-by: Boris Ostrovsky
    Signed-off-by: Sasha Levin

    Kees Cook
     
  • [ Upstream commit df6d4f9db79c1a5d6f48b59db35ccd1e9ff9adfc ]

    GCC 10 changed the default to -fno-common, which leads to

    LD arch/x86/boot/compressed/vmlinux
    ld: arch/x86/boot/compressed/pgtable_64.o:(.bss+0x0): multiple definition of `__force_order'; \
    arch/x86/boot/compressed/kaslr_64.o:(.bss+0x0): first defined here
    make[2]: *** [arch/x86/boot/compressed/Makefile:119: arch/x86/boot/compressed/vmlinux] Error 1

    Since __force_order is already provided in pgtable_64.c, there is no
    need to declare __force_order in kaslr_64.c.

    Signed-off-by: H.J. Lu
    Signed-off-by: Borislav Petkov
    Link: https://lkml.kernel.org/r/20200124181811.4780-1-hjl.tools@gmail.com
    Signed-off-by: Sasha Levin

    H.J. Lu
     

05 Mar, 2020

10 commits

  • commit 693e02cc24090c379217138719d9d84e50036b24 upstream.

    According to the SDM, VMWRITE checks to see if the secondary source
    operand corresponds to an unsupported VMCS field before it checks to
    see if the secondary source operand corresponds to a VM-exit
    information field and the processor does not support writing to
    VM-exit information fields.

    Fixes: 49f705c5324aa ("KVM: nVMX: Implement VMREAD and VMWRITE")
    Signed-off-by: Jim Mattson
    Cc: Paolo Bonzini
    Reviewed-by: Peter Shier
    Reviewed-by: Oliver Upton
    Reviewed-by: Jon Cargille
    Signed-off-by: Paolo Bonzini
    Signed-off-by: Greg Kroah-Hartman

    Jim Mattson
     
  • commit dd2d6042b7f4a5440705b4ffc6c4c2dba81a43b7 upstream.

    According to the SDM, a VMWRITE in VMX non-root operation with an
    invalid VMCS-link pointer results in VMfailInvalid before the validity
    of the VMCS field in the secondary source operand is checked.

    For consistency, modify both handle_vmwrite and handle_vmread, even
    though there was no problem with the latter.

    Fixes: 6d894f498f5d1 ("KVM: nVMX: vmread/vmwrite: Use shadow vmcs12 if running L2")
    Signed-off-by: Jim Mattson
    Cc: Liran Alon
    Cc: Paolo Bonzini
    Cc: Vitaly Kuznetsov
    Reviewed-by: Peter Shier
    Reviewed-by: Oliver Upton
    Reviewed-by: Jon Cargille
    Signed-off-by: Paolo Bonzini
    Signed-off-by: Greg Kroah-Hartman

    Jim Mattson
     
  • commit 208050dac5ef4de5cb83ffcafa78499c94d0b5ad upstream.

    Remove a bogus clearing of apf.msr_val from kvm_arch_vcpu_destroy().

    apf.msr_val is only set to a non-zero value by kvm_pv_enable_async_pf(),
    which is only reachable by kvm_set_msr_common(), i.e. by writing
    MSR_KVM_ASYNC_PF_EN. KVM does not autonomously write said MSR, i.e.
    can only be written via KVM_SET_MSRS or KVM_RUN. Since KVM_SET_MSRS and
    KVM_RUN are vcpu ioctls, they require a valid vcpu file descriptor.
    kvm_arch_vcpu_destroy() is only called if KVM_CREATE_VCPU fails, and KVM
    declares KVM_CREATE_VCPU successful once the vcpu fd is installed and
    thus visible to userspace. Ergo, apf.msr_val cannot be non-zero when
    kvm_arch_vcpu_destroy() is called.

    Fixes: 344d9588a9df0 ("KVM: Add PV MSR to enable asynchronous page faults delivery.")
    Signed-off-by: Sean Christopherson
    Signed-off-by: Paolo Bonzini
    Signed-off-by: Greg Kroah-Hartman

    Sean Christopherson
     
  • commit 9d979c7e6ff43ca3200ffcb74f57415fd633a2da upstream.

    x86 does not load its MMU until KVM_RUN, which cannot be invoked until
    after vCPU creation succeeds. Given that kvm_arch_vcpu_destroy() is
    called if and only if vCPU creation fails, it is impossible for the MMU
    to be loaded.

    Note, the bogus kvm_mmu_unload() call was added during an unrelated
    refactoring of vCPU allocation, i.e. was presumably added as an
    opportunstic "fix" for a perceived leak.

    Fixes: fb3f0f51d92d1 ("KVM: Dynamically allocate vcpus")
    Signed-off-by: Sean Christopherson
    Signed-off-by: Paolo Bonzini
    Signed-off-by: Greg Kroah-Hartman

    Sean Christopherson
     
  • commit 536a0d8e79fb928f2735db37dda95682b6754f9a upstream.

    Currently, there are three static keys in the resctrl file system:
    rdt_mon_enable_key and rdt_alloc_enable_key indicate if the monitoring
    feature and the allocation feature are enabled, respectively. The
    rdt_enable_key is enabled when either the monitoring feature or the
    allocation feature is enabled.

    If no monitoring feature is present (either hardware doesn't support a
    monitoring feature or the feature is disabled by the kernel command line
    option "rdt="), rdt_enable_key is still enabled but rdt_mon_enable_key
    is disabled.

    MBM is a monitoring feature. The MBM overflow handler intends to
    check if the monitoring feature is not enabled for fast return.

    So check the rdt_mon_enable_key in it instead of the rdt_enable_key as
    former is the more accurate check.

    [ bp: Massage commit message. ]

    Fixes: e33026831bdb ("x86/intel_rdt/mbm: Handle counter overflow")
    Signed-off-by: Xiaochen Shen
    Signed-off-by: Borislav Petkov
    Link: https://lkml.kernel.org/r/1576094705-13660-1-git-send-email-xiaochen.shen@intel.com
    Signed-off-by: Greg Kroah-Hartman

    Xiaochen Shen
     
  • commit 52918ed5fcf05d97d257f4131e19479da18f5d16 upstream.

    The KVM MMIO support uses bit 51 as the reserved bit to cause nested page
    faults when a guest performs MMIO. The AMD memory encryption support uses
    a CPUID function to define the encryption bit position. Given this, it is
    possible that these bits can conflict.

    Use svm_hardware_setup() to override the MMIO mask if memory encryption
    support is enabled. Various checks are performed to ensure that the mask
    is properly defined and rsvd_bits() is used to generate the new mask (as
    was done prior to the change that necessitated this patch).

    Fixes: 28a1f3ac1d0c ("kvm: x86: Set highest physical address bits in non-present/reserved SPTEs")
    Suggested-by: Sean Christopherson
    Reviewed-by: Sean Christopherson
    Signed-off-by: Tom Lendacky
    Signed-off-by: Paolo Bonzini
    Signed-off-by: Greg Kroah-Hartman

    Tom Lendacky
     
  • commit 86f7e90ce840aa1db407d3ea6e9b3a52b2ce923c upstream.

    KVM emulates UMIP on hardware that doesn't support it by setting the
    'descriptor table exiting' VM-execution control and performing
    instruction emulation. When running nested, this emulation is broken as
    KVM refuses to emulate L2 instructions by default.

    Correct this regression by allowing the emulation of descriptor table
    instructions if L1 hasn't requested 'descriptor table exiting'.

    Fixes: 07721feee46b ("KVM: nVMX: Don't emulate instructions in guest mode")
    Reported-by: Jan Kiszka
    Cc: stable@vger.kernel.org
    Cc: Paolo Bonzini
    Cc: Jim Mattson
    Signed-off-by: Oliver Upton
    Signed-off-by: Paolo Bonzini
    Signed-off-by: Greg Kroah-Hartman

    Oliver Upton
     
  • [ Upstream commit 0aa0e0d6b34b89649e6b5882a7e025a0eb9bd832 ]

    Tremont is Intel's successor to Goldmont Plus. SMI_COUNT MSR is also
    supported.

    Signed-off-by: Kan Liang
    Signed-off-by: Peter Zijlstra (Intel)
    Signed-off-by: Ingo Molnar
    Reviewed-by: Andi Kleen
    Link: https://lkml.kernel.org/r/1580236279-35492-3-git-send-email-kan.liang@linux.intel.com
    Signed-off-by: Sasha Levin

    Kan Liang
     
  • [ Upstream commit ecf71fbccb9ac5cb964eb7de59bb9da3755b7885 ]

    Tremont is Intel's successor to Goldmont Plus. From the perspective of
    Intel cstate residency counters, there is nothing changed compared with
    Goldmont Plus and Goldmont.

    Share glm_cstates with Goldmont Plus and Goldmont.
    Update the comments for Tremont.

    Signed-off-by: Kan Liang
    Signed-off-by: Peter Zijlstra (Intel)
    Signed-off-by: Ingo Molnar
    Reviewed-by: Andi Kleen
    Link: https://lkml.kernel.org/r/1580236279-35492-2-git-send-email-kan.liang@linux.intel.com
    Signed-off-by: Sasha Levin

    Kan Liang
     
  • [ Upstream commit eda23b387f6c4bb2971ac7e874a09913f533b22c ]

    Elkhart Lake also uses Tremont CPU. From the perspective of Intel PMU,
    there is nothing changed compared with Jacobsville.
    Share the perf code with Jacobsville.

    Signed-off-by: Kan Liang
    Signed-off-by: Peter Zijlstra (Intel)
    Signed-off-by: Ingo Molnar
    Reviewed-by: Andi Kleen
    Link: https://lkml.kernel.org/r/1580236279-35492-1-git-send-email-kan.liang@linux.intel.com
    Signed-off-by: Sasha Levin

    Kan Liang
     

29 Feb, 2020

11 commits

  • commit 23520b2def95205f132e167cf5b25c609975e959 upstream.

    When pv_eoi_get_user() fails, 'val' may remain uninitialized and the return
    value of pv_eoi_get_pending() becomes random. Fix the issue by initializing
    the variable.

    Reviewed-by: Vitaly Kuznetsov
    Signed-off-by: Miaohe Lin
    Cc: stable@vger.kernel.org
    Signed-off-by: Paolo Bonzini
    Signed-off-by: Greg Kroah-Hartman

    Miaohe Lin
     
  • commit 91a5f413af596ad01097e59bf487eb07cb3f1331 upstream.

    Even when APICv is disabled for L1 it can (and, actually, is) still
    available for L2, this means we need to always call
    vmx_deliver_nested_posted_interrupt() when attempting an interrupt
    delivery.

    Suggested-by: Paolo Bonzini
    Signed-off-by: Vitaly Kuznetsov
    Cc: stable@vger.kernel.org
    Signed-off-by: Paolo Bonzini
    Signed-off-by: Greg Kroah-Hartman

    Vitaly Kuznetsov
     
  • … is globally disabled

    commit a4443267800af240072280c44521caab61924e55 upstream.

    When apicv is disabled on a vCPU (e.g. by enabling KVM_CAP_HYPERV_SYNIC*),
    nothing happens to VMX MSRs on the already existing vCPUs, however, all new
    ones are created with PIN_BASED_POSTED_INTR filtered out. This is very
    confusing and results in the following picture inside the guest:

    $ rdmsr -ax 0x48d
    ff00000016
    7f00000016
    7f00000016
    7f00000016

    This is observed with QEMU and 4-vCPU guest: QEMU creates vCPU0, does
    KVM_CAP_HYPERV_SYNIC2 and then creates the remaining three.

    L1 hypervisor may only check CPU0's controls to find out what features
    are available and it will be very confused later. Switch to setting
    PIN_BASED_POSTED_INTR control based on global 'enable_apicv' setting.

    Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
    Cc: stable@vger.kernel.org
    Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

    Vitaly Kuznetsov
     
  • commit 35a571346a94fb93b5b3b6a599675ef3384bc75c upstream.

    Consult the 'unconditional IO exiting' and 'use IO bitmaps' VM-execution
    controls when checking instruction interception. If the 'use IO bitmaps'
    VM-execution control is 1, check the instruction access against the IO
    bitmaps to determine if the instruction causes a VM-exit.

    Signed-off-by: Oliver Upton
    Signed-off-by: Paolo Bonzini
    Signed-off-by: Greg Kroah-Hartman

    Oliver Upton
     
  • commit e71237d3ff1abf9f3388337cfebf53b96df2020d upstream.

    Checks against the IO bitmap are useful for both instruction emulation
    and VM-exit reflection. Refactor the IO bitmap checks into a helper
    function.

    Signed-off-by: Oliver Upton
    Reviewed-by: Vitaly Kuznetsov
    Signed-off-by: Paolo Bonzini
    Signed-off-by: Greg Kroah-Hartman

    Oliver Upton
     
  • commit 7455a8327674e1a7c9a1f5dd1b0743ab6713f6d1 upstream.

    Commit 13db77347db1 ("KVM: x86: don't notify userspace IOAPIC on edge
    EOI") said, edge-triggered interrupts don't set a bit in TMR, which means
    that IOAPIC isn't notified on EOI. And var level indicates level-triggered
    interrupt.
    But commit 3159d36ad799 ("KVM: x86: use generic function for MSI parsing")
    replace var level with irq.level by mistake. Fix it by changing irq.level
    to irq.trig_mode.

    Cc: stable@vger.kernel.org
    Fixes: 3159d36ad799 ("KVM: x86: use generic function for MSI parsing")
    Signed-off-by: Miaohe Lin
    Signed-off-by: Paolo Bonzini
    Signed-off-by: Greg Kroah-Hartman

    Miaohe Lin
     
  • commit 07721feee46b4b248402133228235318199b05ec upstream.

    vmx_check_intercept is not yet fully implemented. To avoid emulating
    instructions disallowed by the L1 hypervisor, refuse to emulate
    instructions by default.

    Cc: stable@vger.kernel.org
    [Made commit, added commit msg - Oliver]
    Signed-off-by: Oliver Upton
    Signed-off-by: Paolo Bonzini
    Signed-off-by: Greg Kroah-Hartman

    Paolo Bonzini
     
  • commit 21b5ee59ef18e27d85810584caf1f7ddc705ea83 upstream.

    Commit

    aaf248848db50 ("perf/x86/msr: Add AMD IRPERF (Instructions Retired)
    performance counter")

    added support for access to the free-running counter via 'perf -e
    msr/irperf/', but when exercised, it always returns a 0 count:

    BEFORE:

    $ perf stat -e instructions,msr/irperf/ true

    Performance counter stats for 'true':

    624,833 instructions
    0 msr/irperf/

    Simply set its enable bit - HWCR bit 30 - to make it start counting.

    Enablement is restricted to all machines advertising IRPERF capability,
    except those susceptible to an erratum that makes the IRPERF return
    bad values.

    That erratum occurs in Family 17h models 00-1fh [1], but not in F17h
    models 20h and above [2].

    AFTER (on a family 17h model 31h machine):

    $ perf stat -e instructions,msr/irperf/ true

    Performance counter stats for 'true':

    621,690 instructions
    622,490 msr/irperf/

    [1] Revision Guide for AMD Family 17h Models 00h-0Fh Processors
    [2] Revision Guide for AMD Family 17h Models 30h-3Fh Processors

    The revision guides are available from the bugzilla Link below.

    [ bp: Massage commit message. ]

    Fixes: aaf248848db50 ("perf/x86/msr: Add AMD IRPERF (Instructions Retired) performance counter")
    Signed-off-by: Kim Phillips
    Signed-off-by: Borislav Petkov
    Cc: Peter Zijlstra
    Cc: stable@vger.kernel.org
    Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537
    Link: http://lkml.kernel.org/r/20200214201805.13830-1-kim.phillips@amd.com
    Signed-off-by: Greg Kroah-Hartman

    Kim Phillips
     
  • commit 51dede9c05df2b78acd6dcf6a17d21f0877d2d7b upstream.

    Accessing the MCA thresholding controls in sysfs concurrently with CPU
    hotplug can lead to a couple of KASAN-reported issues:

    BUG: KASAN: use-after-free in sysfs_file_ops+0x155/0x180
    Read of size 8 at addr ffff888367578940 by task grep/4019

    and

    BUG: KASAN: use-after-free in show_error_count+0x15c/0x180
    Read of size 2 at addr ffff888368a05514 by task grep/4454

    for example. Both result from the fact that the threshold block
    creation/teardown code frees the descriptor memory itself instead of
    defining proper ->release function and leaving it to the driver core to
    take care of that, after all sysfs accesses have completed.

    Do that and get rid of the custom freeing code, fixing the above UAFs in
    the process.

    [ bp: write commit message. ]

    Fixes: 95268664390b ("[PATCH] x86_64: mce_amd support for family 0x10 processors")
    Signed-off-by: Thomas Gleixner
    Signed-off-by: Borislav Petkov
    Cc:
    Link: https://lkml.kernel.org/r/20200214082801.13836-1-bp@alien8.de
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     
  • commit 6e5cf31fbe651bed7ba1df768f2e123531132417 upstream.

    threshold_create_bank() creates a bank descriptor per MCA error
    thresholding counter which can be controlled over sysfs. It publishes
    the pointer to that bank in a per-CPU variable and then goes on to
    create additional thresholding blocks if the bank has such.

    However, that creation of additional blocks in
    allocate_threshold_blocks() can fail, leading to a use-after-free
    through the per-CPU pointer.

    Therefore, publish that pointer only after all blocks have been setup
    successfully.

    Fixes: 019f34fccfd5 ("x86, MCE, AMD: Move shared bank to node descriptor")
    Reported-by: Saar Amar
    Reported-by: Dan Carpenter
    Signed-off-by: Borislav Petkov
    Cc:
    Link: http://lkml.kernel.org/r/20200128140846.phctkvx5btiexvbx@kili.mountain
    Signed-off-by: Greg Kroah-Hartman

    Borislav Petkov
     
  • commit ff5ac61ee83c13f516544d29847d28be093a40ee upstream.

    The IMA arch code attempts to inspect the "SetupMode" EFI variable
    by populating a variable called efi_SetupMode_name with the string
    "SecureBoot" and passing that to the EFI GetVariable service, which
    obviously does not yield the expected result.

    Given that the string is only referenced a single time, let's get
    rid of the intermediate variable, and pass the correct string as
    an immediate argument. While at it, do the same for "SecureBoot".

    Fixes: 399574c64eaf ("x86/ima: retry detecting secure boot mode")
    Fixes: 980ef4d22a95 ("x86/ima: check EFI SetupMode too")
    Cc: Matthew Garrett
    Signed-off-by: Ard Biesheuvel
    Cc: stable@vger.kernel.org # v5.3
    Signed-off-by: Mimi Zohar
    Signed-off-by: Greg Kroah-Hartman

    Ard Biesheuvel
     

24 Feb, 2020

8 commits

  • [ Upstream commit 8b7e20a7ba54836076ff35a28349dabea4cec48f ]

    Add TEST opcode to Group3-2 reg=001b as same as Group3-1 does.

    Commit

    12a78d43de76 ("x86/decoder: Add new TEST instruction pattern")

    added a TEST opcode assignment to f6 XX/001/XXX (Group 3-1), but did
    not add f7 XX/001/XXX (Group 3-2).

    Actually, this TEST opcode variant (ModRM.reg /1) is not described in
    the Intel SDM Vol2 but in AMD64 Architecture Programmer's Manual Vol.3,
    Appendix A.2 Table A-6. ModRM.reg Extensions for the Primary Opcode Map.

    Without this fix, Randy found a warning by insn_decoder_test related
    to this issue as below.

    HOSTCC arch/x86/tools/insn_decoder_test
    HOSTCC arch/x86/tools/insn_sanity
    TEST posttest
    arch/x86/tools/insn_decoder_test: warning: Found an x86 instruction decoder bug, please report this.
    arch/x86/tools/insn_decoder_test: warning: ffffffff81000bf1: f7 0b 00 01 08 00 testl $0x80100,(%rbx)
    arch/x86/tools/insn_decoder_test: warning: objdump says 6 bytes, but insn_get_length() says 2
    arch/x86/tools/insn_decoder_test: warning: Decoded and checked 11913894 instructions with 1 failures
    TEST posttest
    arch/x86/tools/insn_sanity: Success: decoded and checked 1000000 random instructions with 0 errors (seed:0x871ce29c)

    To fix this error, add the TEST opcode according to AMD64 APM Vol.3.

    [ bp: Massage commit message. ]

    Reported-by: Randy Dunlap
    Signed-off-by: Masami Hiramatsu
    Signed-off-by: Borislav Petkov
    Acked-by: Randy Dunlap
    Tested-by: Randy Dunlap
    Link: https://lkml.kernel.org/r/157966631413.9580.10311036595431878351.stgit@devnote2
    Signed-off-by: Sasha Levin

    Masami Hiramatsu
     
  • [ Upstream commit 75fbef0a8b6b4bb19b9a91b5214f846c2dc5139e ]

    The following commit:

    15f003d20782 ("x86/mm/pat: Don't implicitly allow _PAGE_RW in kernel_map_pages_in_pgd()")

    modified kernel_map_pages_in_pgd() to manage writable permissions
    of memory mappings in the EFI page table in a different way, but
    in the process, it removed the ability to clear NX attributes from
    read-only mappings, by clobbering the clear mask if _PAGE_RW is not
    being requested.

    Failure to remove the NX attribute from read-only mappings is
    unlikely to be a security issue, but it does prevent us from
    tightening the permissions in the EFI page tables going forward,
    so let's fix it now.

    Fixes: 15f003d20782 ("x86/mm/pat: Don't implicitly allow _PAGE_RW in kernel_map_pages_in_pgd()
    Signed-off-by: Ard Biesheuvel
    Signed-off-by: Ingo Molnar
    Link: https://lore.kernel.org/r/20200113172245.27925-5-ardb@kernel.org
    Signed-off-by: Sasha Levin

    Ard Biesheuvel
     
  • [ Upstream commit 471af006a747f1c535c8a8c6c0973c320fe01b22 ]

    AMD Family 17h processors and above gain support for Large Increment
    per Cycle events. Unfortunately there is no CPUID or equivalent bit
    that indicates whether the feature exists or not, so we continue to
    determine eligibility based on a CPU family number comparison.

    For Large Increment per Cycle events, we add a f17h-and-compatibles
    get_event_constraints_f17h() that returns an even counter bitmask:
    Large Increment per Cycle events can only be placed on PMCs 0, 2,
    and 4 out of the currently available 0-5. The only currently
    public event that requires this feature to report valid counts
    is PMCx003 "Retired SSE/AVX Operations".

    Note that the CPU family logic in amd_core_pmu_init() is changed
    so as to be able to selectively add initialization for features
    available in ranges of backward-compatible CPU families. This
    Large Increment per Cycle feature is expected to be retained
    in future families.

    A side-effect of assigning a new get_constraints function for f17h
    disables calling the old (prior to f15h) amd_get_event_constraints
    implementation left enabled by commit e40ed1542dd7 ("perf/x86: Add perf
    support for AMD family-17h processors"), which is no longer
    necessary since those North Bridge event codes are obsoleted.

    Also fix a spelling mistake whilst in the area (calulating ->
    calculating).

    Fixes: e40ed1542dd7 ("perf/x86: Add perf support for AMD family-17h processors")
    Signed-off-by: Kim Phillips
    Signed-off-by: Peter Zijlstra (Intel)
    Link: https://lkml.kernel.org/r/20191114183720.19887-2-kim.phillips@amd.com
    Signed-off-by: Sasha Levin

    Kim Phillips
     
  • [ Upstream commit 248ed51048c40d36728e70914e38bffd7821da57 ]

    First, printk() is NMI-context safe now since the safe printk() has been
    implemented and it already has an irq_work to make NMI-context safe.

    Second, this NMI irq_work actually does not work if a NMI handler causes
    panic by watchdog timeout. It has no chance to run in such case, while
    the safe printk() will flush its per-cpu buffers before panicking.

    While at it, repurpose the irq_work callback into a function which
    concentrates the NMI duration checking and makes the code easier to
    follow.

    [ bp: Massage. ]

    Signed-off-by: Changbin Du
    Signed-off-by: Borislav Petkov
    Acked-by: Thomas Gleixner
    Link: https://lkml.kernel.org/r/20200111125427.15662-1-changbin.du@gmail.com
    Signed-off-by: Sasha Levin

    Changbin Du
     
  • [ Upstream commit e2d68a955e49d61fd0384f23e92058dc9b79be5e ]

    The logic in __efi_enter_virtual_mode() does a number of steps in
    sequence, all of which may fail in one way or the other. In most
    cases, we simply print an error and disable EFI runtime services
    support, but in some cases, we BUG() or panic() and bring down the
    system when encountering conditions that we could easily handle in
    the same way.

    While at it, replace a pointless page-to-virt-phys conversion with
    one that goes straight from struct page to physical.

    Signed-off-by: Ard Biesheuvel
    Cc: Andy Lutomirski
    Cc: Ard Biesheuvel
    Cc: Arvind Sankar
    Cc: Matthew Garrett
    Cc: linux-efi@vger.kernel.org
    Link: https://lkml.kernel.org/r/20200103113953.9571-14-ardb@kernel.org
    Signed-off-by: Ingo Molnar
    Signed-off-by: Sasha Levin

    Ard Biesheuvel
     
  • [ Upstream commit bff47c2302cc249bcd550b17067f8dddbd4b6f77 ]

    When building with C=1, sparse issues a warning:

    CHECK arch/x86/entry/vdso/vdso32-setup.c
    arch/x86/entry/vdso/vdso32-setup.c:28:28: warning: symbol 'vdso32_enabled' was not declared. Should it be static?

    Provide the missing header file.

    Signed-off-by: Valdis Kletnieks
    Signed-off-by: Borislav Petkov
    Cc: "H. Peter Anvin"
    Cc: Andy Lutomirski
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: x86-ml
    Link: https://lkml.kernel.org/r/36224.1575599767@turing-police
    Signed-off-by: Sasha Levin

    Valdis Klētnieks
     
  • [ Upstream commit dacc9092336be20b01642afe1a51720b31f60369 ]

    When checking whether the reported lfb_size makes sense, the height
    * stride result is page-aligned before seeing whether it exceeds the
    reported size.

    This doesn't work if height * stride is not an exact number of pages.
    For example, as reported in the kernel bugzilla below, an 800x600x32 EFI
    framebuffer gets skipped because of this.

    Move the PAGE_ALIGN to after the check vs size.

    Reported-by: Christopher Head
    Tested-by: Christopher Head
    Signed-off-by: Arvind Sankar
    Signed-off-by: Borislav Petkov
    Link: https://bugzilla.kernel.org/show_bug.cgi?id=206051
    Link: https://lkml.kernel.org/r/20200107230410.2291947-1-nivedita@alum.mit.edu
    Signed-off-by: Sasha Levin

    Arvind Sankar
     
  • [ Upstream commit ffc2760bcf2dba0dbef74013ed73eea8310cc52c ]

    Fix a couple of issues with the way we map and copy the vendor string:
    - we map only 2 bytes, which usually works since you get at least a
    page, but if the vendor string happens to cross a page boundary,
    a crash will result
    - only call early_memunmap() if early_memremap() succeeded, or we will
    call it with a NULL address which it doesn't like,
    - while at it, switch to early_memremap_ro(), and array indexing rather
    than pointer dereferencing to read the CHAR16 characters.

    Signed-off-by: Ard Biesheuvel
    Cc: Andy Lutomirski
    Cc: Ard Biesheuvel
    Cc: Arvind Sankar
    Cc: Matthew Garrett
    Cc: linux-efi@vger.kernel.org
    Fixes: 5b83683f32b1 ("x86: EFI runtime service support")
    Link: https://lkml.kernel.org/r/20200103113953.9571-5-ardb@kernel.org
    Signed-off-by: Ingo Molnar
    Signed-off-by: Sasha Levin

    Ard Biesheuvel