31 Jan, 2019

9 commits

  • commit 867cefb4cb1012f42cada1c7d1f35ac8dd276071 upstream.

    Commit f94c8d11699759 ("sched/clock, x86/tsc: Rework the x86 'unstable'
    sched_clock() interface") broke Xen guest time handling across
    migration:

    [ 187.249951] Freezing user space processes ... (elapsed 0.001 seconds) done.
    [ 187.251137] OOM killer disabled.
    [ 187.251137] Freezing remaining freezable tasks ... (elapsed 0.001 seconds) done.
    [ 187.252299] suspending xenstore...
    [ 187.266987] xen:grant_table: Grant tables using version 1 layout
    [18446743811.706476] OOM killer enabled.
    [18446743811.706478] Restarting tasks ... done.
    [18446743811.720505] Setting capacity to 16777216

    Fix that by setting xen_sched_clock_offset at resume time to ensure a
    monotonic clock value.

    [boris: replaced pr_info() with pr_info_once() in xen_callback_vector()
    to avoid printing with incorrect timestamp during resume (as we
    haven't re-adjusted the clock yet)]

    Fixes: f94c8d11699759 ("sched/clock, x86/tsc: Rework the x86 'unstable' sched_clock() interface")
    Cc: # 4.11
    Reported-by: Hans van Kranenburg
    Signed-off-by: Juergen Gross
    Tested-by: Hans van Kranenburg
    Signed-off-by: Boris Ostrovsky
    Signed-off-by: Juergen Gross
    Signed-off-by: Greg Kroah-Hartman

    Juergen Gross
     
  • commit 38669ba205d178d2d38bfd194a196d65a44d5af2 upstream.

    It is expected for sched_clock() to output data from 0, when system boots.

    Add an offset xen_sched_clock_offset (similarly how it is done in other
    hypervisors i.e. kvm_sched_clock_offset) to count sched_clock() from 0,
    when time is first initialized.

    Signed-off-by: Pavel Tatashin
    Signed-off-by: Thomas Gleixner
    Cc: steven.sistare@oracle.com
    Cc: daniel.m.jordan@oracle.com
    Cc: linux@armlinux.org.uk
    Cc: schwidefsky@de.ibm.com
    Cc: heiko.carstens@de.ibm.com
    Cc: john.stultz@linaro.org
    Cc: sboyd@codeaurora.org
    Cc: hpa@zytor.com
    Cc: douly.fnst@cn.fujitsu.com
    Cc: peterz@infradead.org
    Cc: prarit@redhat.com
    Cc: feng.tang@intel.com
    Cc: pmladek@suse.com
    Cc: gnomes@lxorguk.ukuu.org.uk
    Cc: linux-s390@vger.kernel.org
    Cc: boris.ostrovsky@oracle.com
    Cc: jgross@suse.com
    Cc: pbonzini@redhat.com
    Link: https://lkml.kernel.org/r/20180719205545.16512-14-pasha.tatashin@oracle.com
    Signed-off-by: Juergen Gross
    Signed-off-by: Greg Kroah-Hartman

    Pavel Tatashin
     
  • commit 2229f70b5bbb025e1394b61007938a68060afbfb upstream.

    In order to support pvclock vdso on xen we need to setup the time
    info page for vcpu 0 and register the page with Xen using the
    VCPUOP_register_vcpu_time_memory_area hypercall. This hypercall
    will also forcefully update the pvti which will set some of the
    necessary flags for vdso. Afterwards we check if it supports the
    PVCLOCK_TSC_STABLE_BIT flag which is mandatory for having
    vdso/vsyscall support. And if so, it will set the cpu 0 pvti that
    will be later on used when mapping the vdso image.

    The xen headers are also updated to include the new hypercall for
    registering the secondary vcpu_time_info struct.

    Signed-off-by: Joao Martins
    Reviewed-by: Juergen Gross
    Reviewed-by: Boris Ostrovsky
    Signed-off-by: Boris Ostrovsky
    Signed-off-by: Juergen Gross
    Signed-off-by: Greg Kroah-Hartman

    Joao Martins
     
  • commit b888808093113ae7d63d213272d01fea4b8329ed upstream.

    Specifically check for PVCLOCK_TSC_STABLE_BIT and if this bit is set,
    then set it too on pvclock flags. This allows Xen clocksource to use it
    and thus speeding up xen_clocksource_read() callers (i.e. sched_clock())

    Signed-off-by: Joao Martins
    Reviewed-by: Boris Ostrovsky
    Signed-off-by: Boris Ostrovsky
    Signed-off-by: Juergen Gross
    Signed-off-by: Greg Kroah-Hartman

    Joao Martins
     
  • commit 9f08890ab906abaf9d4c1bad8111755cbd302260 upstream.

    Right now there is only a pvclock_pvti_cpu0_va() which is defined
    on kvmclock since:

    commit dac16fba6fc5
    ("x86/vdso: Get pvclock data from the vvar VMA instead of the fixmap")

    The only user of this interface so far is kvm. This commit adds a
    setter function for the pvti page and moves pvclock_pvti_cpu0_va
    to pvclock, which is a more generic place to have it; and would
    allow other PV clocksources to use it, such as Xen.

    While moving pvclock_pvti_cpu0_va into pvclock, rename also this
    function to pvclock_get_pvti_cpu0_va (including its call sites)
    to be symmetric with the setter (pvclock_set_pvti_cpu0_va).

    Signed-off-by: Joao Martins
    Acked-by: Andy Lutomirski
    Acked-by: Paolo Bonzini
    Acked-by: Thomas Gleixner
    Signed-off-by: Boris Ostrovsky
    Signed-off-by: Juergen Gross
    Signed-off-by: Greg Kroah-Hartman

    Joao Martins
     
  • Upstream commit:

    f775b13eedee ("x86,kvm: move qemu/guest FPU switching out to vcpu_run")

    introduced a bug, which was later fixed by upstream commit:

    5663d8f9bbe4 ("kvm: x86: fix WARN due to uninitialized guest FPU state")

    For reasons unknown, both commits were initially passed-over for
    inclusion in the 4.14 stable branch despite being tagged for stable.
    Eventually, someone noticed that the fixup, commit 5663d8f9bbe4, was
    missing from stable[1], and so it was queued up for 4.14 and included in
    release v4.14.79.

    Even later, the original buggy patch, commit f775b13eedee, was also
    applied to the 4.14 stable branch. Through an unlucky coincidence, the
    incorrect ordering did not generate a conflict between the two patches,
    and led to v4.14.94 and later releases containing a spurious call to
    kvm_load_guest_fpu() in kvm_arch_vcpu_ioctl_run(). As a result, KVM may
    reload stale guest FPU state, e.g. after accepting in INIT event. This
    can manifest as crashes during boot, segfaults, failed checksums and so
    on and so forth.

    Remove the unwanted kvm_{load,put}_guest_fpu() calls, i.e. make
    kvm_arch_vcpu_ioctl_run() look like commit 5663d8f9bbe4 was backported
    after commit f775b13eedee.

    [1] https://www.spinics.net/lists/stable/msg263931.html

    Fixes: 4124a4cff344 ("x86,kvm: move qemu/guest FPU switching out to vcpu_run")
    Cc: stable@vger.kernel.org
    Cc: Sasha Levin
    Cc: Greg Kroah-Hartman
    Cc: Peter Xu
    Cc: Rik van Riel
    Cc: Paolo Bonzini
    Cc: Radim Krčmář
    Reported-by: Roman Mamedov
    Reported-by: Thomas Lindroth
    Signed-off-by: Sean Christopherson
    Acked-by: Paolo Bonzini
    Signed-off-by: Greg Kroah-Hartman

    Sean Christopherson
     
  • commit 7e6fc2f50a3197d0e82d1c0e86282976c9e6c8a4 upstream.

    The outb() function takes parameters value and port, in that order. Fix
    the parameters used in the kalsr i8254 fallback code.

    Fixes: 5bfce5ef55cb ("x86, kaslr: Provide randomness functions")
    Signed-off-by: Daniel Drake
    Signed-off-by: Thomas Gleixner
    Cc: bp@alien8.de
    Cc: hpa@zytor.com
    Cc: linux@endlessm.com
    Cc: stable@vger.kernel.org
    Link: https://lkml.kernel.org/r/20190107034024.15005-1-drake@endlessm.com
    Signed-off-by: Greg Kroah-Hartman

    Daniel Drake
     
  • commit a31e184e4f69965c99c04cc5eb8a4920e0c63737 upstream.

    Memory protection key behavior should be the same in a child as it was
    in the parent before a fork. But, there is a bug that resets the
    state in the child at fork instead of preserving it.

    The creation of new mm's is a bit convoluted. At fork(), the code
    does:

    1. memcpy() the parent mm to initialize child
    2. mm_init() to initalize some select stuff stuff
    3. dup_mmap() to create true copies that memcpy() did not do right

    For pkeys two bits of state need to be preserved across a fork:
    'execute_only_pkey' and 'pkey_allocation_map'.

    Those are preserved by the memcpy(), but mm_init() invokes
    init_new_context() which overwrites 'execute_only_pkey' and
    'pkey_allocation_map' with "new" values.

    The author of the code erroneously believed that init_new_context is *only*
    called at execve()-time. But, alas, init_new_context() is used at execve()
    and fork().

    The result is that, after a fork(), the child's pkey state ends up looking
    like it does after an execve(), which is totally wrong. pkeys that are
    already allocated can be allocated again, for instance.

    To fix this, add code called by dup_mmap() to copy the pkey state from
    parent to child explicitly. Also add a comment above init_new_context() to
    make it more clear to the next poor sod what this code is used for.

    Fixes: e8c24d3a23a ("x86/pkeys: Allocation/free syscalls")
    Signed-off-by: Dave Hansen
    Signed-off-by: Thomas Gleixner
    Reviewed-by: Thomas Gleixner
    Cc: bp@alien8.de
    Cc: hpa@zytor.com
    Cc: peterz@infradead.org
    Cc: mpe@ellerman.id.au
    Cc: will.deacon@arm.com
    Cc: luto@kernel.org
    Cc: jroedel@suse.de
    Cc: stable@vger.kernel.org
    Cc: Borislav Petkov
    Cc: "H. Peter Anvin"
    Cc: Peter Zijlstra
    Cc: Michael Ellerman
    Cc: Will Deacon
    Cc: Andy Lutomirski
    Cc: Joerg Roedel
    Link: https://lkml.kernel.org/r/20190102215655.7A69518C@viggo.jf.intel.com
    Signed-off-by: Greg Kroah-Hartman

    Dave Hansen
     
  • commit 5cc244a20b86090c087073c124284381cdf47234 upstream.

    The single-step debugging of KVM guests on x86 is broken: if we run
    gdb 'stepi' command at the breakpoint when the guest interrupts are
    enabled, RIP always jumps to native_apic_mem_write(). Then other
    nasty effects follow.

    Long investigation showed that on Jun 7, 2017 the
    commit c8401dda2f0a00cd25c0 ("KVM: x86: fix singlestepping over syscall")
    introduced the kvm_run.debug corruption: kvm_vcpu_do_singlestep() can
    be called without X86_EFLAGS_TF set.

    Let's fix it. Please consider that for -stable.

    Signed-off-by: Alexander Popov
    Cc: stable@vger.kernel.org
    Fixes: c8401dda2f0a00cd25c0 ("KVM: x86: fix singlestepping over syscall")
    Signed-off-by: Paolo Bonzini
    Signed-off-by: Greg Kroah-Hartman

    Alexander Popov
     

26 Jan, 2019

1 commit

  • [ Upstream commit 68b5e4326e4b8ac9080835005d8254fed0fb3c56 ]

    Add the proper includes and make smca_get_name() static.

    Fix an actual bug too which the warning triggered:

    arch/x86/kernel/cpu/mcheck/therm_throt.c:395:39: error: conflicting \
    types for ‘smp_thermal_interrupt’
    asmlinkage __visible void __irq_entry smp_thermal_interrupt(struct pt_regs *r)
    ^~~~~~~~~~~~~~~~~~~~~
    In file included from arch/x86/kernel/cpu/mcheck/therm_throt.c:29:
    ./arch/x86/include/asm/traps.h:107:17: note: previous declaration of \
    ‘smp_thermal_interrupt’ was here
    asmlinkage void smp_thermal_interrupt(void);

    Signed-off-by: Borislav Petkov
    Cc: Yi Wang
    Cc: Michael Matz
    Cc: x86@kernel.org
    Link: https://lkml.kernel.org/r/alpine.DEB.2.21.1811081633160.1549@nanos.tec.linutronix.de
    Signed-off-by: Sasha Levin

    Borislav Petkov
     

17 Jan, 2019

2 commits

  • commit e4f358916d528d479c3c12bd2fd03f2d5a576380 upstream.

    Commit

    4cd24de3a098 ("x86/retpoline: Make CONFIG_RETPOLINE depend on compiler support")

    replaced the RETPOLINE define with CONFIG_RETPOLINE checks. Remove the
    remaining pieces.

    [ bp: Massage commit message. ]

    Fixes: 4cd24de3a098 ("x86/retpoline: Make CONFIG_RETPOLINE depend on compiler support")
    Signed-off-by: WANG Chao
    Signed-off-by: Borislav Petkov
    Reviewed-by: Zhenzhong Duan
    Reviewed-by: Masahiro Yamada
    Cc: "H. Peter Anvin"
    Cc: Andi Kleen
    Cc: Andrew Morton
    Cc: Andy Lutomirski
    Cc: Arnd Bergmann
    Cc: Daniel Borkmann
    Cc: David Woodhouse
    Cc: Geert Uytterhoeven
    Cc: Jessica Yu
    Cc: Jiri Kosina
    Cc: Kees Cook
    Cc: Konrad Rzeszutek Wilk
    Cc: Luc Van Oostenryck
    Cc: Michal Marek
    Cc: Miguel Ojeda
    Cc: Peter Zijlstra
    Cc: Tim Chen
    Cc: Vasily Gorbik
    Cc: linux-kbuild@vger.kernel.org
    Cc: srinivas.eeda@oracle.com
    Cc: stable
    Cc: x86-ml
    Link: https://lkml.kernel.org/r/20181210163725.95977-1-chao.wang@ucloud.cn
    Signed-off-by: Greg Kroah-Hartman

    WANG Chao
     
  • commit f775b13eedee2f7f3c6fdd4e90fb79090ce5d339 upstream.

    Currently, every time a VCPU is scheduled out, the host kernel will
    first save the guest FPU/xstate context, then load the qemu userspace
    FPU context, only to then immediately save the qemu userspace FPU
    context back to memory. When scheduling in a VCPU, the same extraneous
    FPU loads and saves are done.

    This could be avoided by moving from a model where the guest FPU is
    loaded and stored with preemption disabled, to a model where the
    qemu userspace FPU is swapped out for the guest FPU context for
    the duration of the KVM_RUN ioctl.

    This is done under the VCPU mutex, which is also taken when other
    tasks inspect the VCPU FPU context, so the code should already be
    safe for this change. That should come as no surprise, given that
    s390 already has this optimization.

    This can fix a bug where KVM calls get_user_pages while owning the
    FPU, and the file system ends up requesting the FPU again:

    [258270.527947] __warn+0xcb/0xf0
    [258270.527948] warn_slowpath_null+0x1d/0x20
    [258270.527951] kernel_fpu_disable+0x3f/0x50
    [258270.527953] __kernel_fpu_begin+0x49/0x100
    [258270.527955] kernel_fpu_begin+0xe/0x10
    [258270.527958] crc32c_pcl_intel_update+0x84/0xb0
    [258270.527961] crypto_shash_update+0x3f/0x110
    [258270.527968] crc32c+0x63/0x8a [libcrc32c]
    [258270.527975] dm_bm_checksum+0x1b/0x20 [dm_persistent_data]
    [258270.527978] node_prepare_for_write+0x44/0x70 [dm_persistent_data]
    [258270.527985] dm_block_manager_write_callback+0x41/0x50 [dm_persistent_data]
    [258270.527988] submit_io+0x170/0x1b0 [dm_bufio]
    [258270.527992] __write_dirty_buffer+0x89/0x90 [dm_bufio]
    [258270.527994] __make_buffer_clean+0x4f/0x80 [dm_bufio]
    [258270.527996] __try_evict_buffer+0x42/0x60 [dm_bufio]
    [258270.527998] dm_bufio_shrink_scan+0xc0/0x130 [dm_bufio]
    [258270.528002] shrink_slab.part.40+0x1f5/0x420
    [258270.528004] shrink_node+0x22c/0x320
    [258270.528006] do_try_to_free_pages+0xf5/0x330
    [258270.528008] try_to_free_pages+0xe9/0x190
    [258270.528009] __alloc_pages_slowpath+0x40f/0xba0
    [258270.528011] __alloc_pages_nodemask+0x209/0x260
    [258270.528014] alloc_pages_vma+0x1f1/0x250
    [258270.528017] do_huge_pmd_anonymous_page+0x123/0x660
    [258270.528021] handle_mm_fault+0xfd3/0x1330
    [258270.528025] __get_user_pages+0x113/0x640
    [258270.528027] get_user_pages+0x4f/0x60
    [258270.528063] __gfn_to_pfn_memslot+0x120/0x3f0 [kvm]
    [258270.528108] try_async_pf+0x66/0x230 [kvm]
    [258270.528135] tdp_page_fault+0x130/0x280 [kvm]
    [258270.528149] kvm_mmu_page_fault+0x60/0x120 [kvm]
    [258270.528158] handle_ept_violation+0x91/0x170 [kvm_intel]
    [258270.528162] vmx_handle_exit+0x1ca/0x1400 [kvm_intel]

    No performance changes were detected in quick ping-pong tests on
    my 4 socket system, which is expected since an FPU+xstate load is
    on the order of 0.1us, while ping-ponging between CPUs is on the
    order of 20us, and somewhat noisy.

    Cc: stable@vger.kernel.org
    Signed-off-by: Rik van Riel
    Suggested-by: Christian Borntraeger
    Signed-off-by: Paolo Bonzini
    [Fixed a bug where reset_vcpu called put_fpu without preceding load_fpu,
    which happened inside from KVM_CREATE_VCPU ioctl. - Radim]
    Signed-off-by: Radim Krčmář
    Signed-off-by: Greg Kroah-Hartman

    Rik van Riel
     

13 Jan, 2019

2 commits

  • [ Upstream commit 254eb5505ca0ca749d3a491fc6668b6c16647a99 ]

    The LDT remap placement has been changed. It's now placed before the direct
    mapping in the kernel virtual address space for both paging modes.

    Change address markers order accordingly.

    Fixes: d52888aa2753 ("x86/mm: Move LDT remap out of KASLR region on 5-level paging")
    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Thomas Gleixner
    Cc: bp@alien8.de
    Cc: hpa@zytor.com
    Cc: dave.hansen@linux.intel.com
    Cc: luto@kernel.org
    Cc: peterz@infradead.org
    Cc: boris.ostrovsky@oracle.com
    Cc: jgross@suse.com
    Cc: bhe@redhat.com
    Cc: hans.van.kranenburg@mendix.com
    Cc: linux-mm@kvack.org
    Cc: xen-devel@lists.xenproject.org
    Link: https://lkml.kernel.org/r/20181130202328.65359-3-kirill.shutemov@linux.intel.com
    Signed-off-by: Sasha Levin

    Kirill A. Shutemov
     
  • [ Upstream commit 16877a5570e0c5f4270d5b17f9bab427bcae9514 ]

    There is a guard hole at the beginning of the kernel address space, also
    used by hypervisors. It occupies 16 PGD entries.

    This reserved range is not defined explicitely, it is calculated relative
    to other entities: direct mapping and user space ranges.

    The calculation got broken by recent changes of the kernel memory layout:
    LDT remap range is now mapped before direct mapping and makes the
    calculation invalid.

    The breakage leads to crash on Xen dom0 boot[1].

    Define the reserved range explicitely. It's part of kernel ABI (hypervisors
    expect it to be stable) and must not depend on changes in the rest of
    kernel memory layout.

    [1] https://lists.xenproject.org/archives/html/xen-devel/2018-11/msg03313.html

    Fixes: d52888aa2753 ("x86/mm: Move LDT remap out of KASLR region on 5-level paging")
    Reported-by: Hans van Kranenburg
    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Thomas Gleixner
    Tested-by: Hans van Kranenburg
    Reviewed-by: Juergen Gross
    Cc: bp@alien8.de
    Cc: hpa@zytor.com
    Cc: dave.hansen@linux.intel.com
    Cc: luto@kernel.org
    Cc: peterz@infradead.org
    Cc: boris.ostrovsky@oracle.com
    Cc: bhe@redhat.com
    Cc: linux-mm@kvack.org
    Cc: xen-devel@lists.xenproject.org
    Link: https://lkml.kernel.org/r/20181130202328.65359-2-kirill.shutemov@linux.intel.com
    Signed-off-by: Sasha Levin

    Kirill A. Shutemov
     

10 Jan, 2019

4 commits

  • commit 1b3ab5ad1b8ad99bae76ec583809c5f5a31c707c upstream.

    Fixes: 34a1cd60d17f ("kvm: x86: vmx: move some vmx setting from vmx_init() to hardware_setup()")
    Cc: stable@vger.kernel.org
    Signed-off-by: Sean Christopherson
    Signed-off-by: Paolo Bonzini
    Signed-off-by: Greg Kroah-Hartman

    Sean Christopherson
     
  • commit e81434995081fd7efb755fd75576b35dbb0850b1 upstream.

    ____kvm_handle_fault_on_reboot() provides a generic exception fixup
    handler that is used to cleanly handle faults on VMX/SVM instructions
    during reboot (or at least try to). If there isn't a reboot in
    progress, ____kvm_handle_fault_on_reboot() treats any exception as
    fatal to KVM and invokes kvm_spurious_fault(), which in turn generates
    a BUG() to get a stack trace and die.

    When it was originally added by commit 4ecac3fd6dc2 ("KVM: Handle
    virtualization instruction #UD faults during reboot"), the "call" to
    kvm_spurious_fault() was handcoded as PUSH+JMP, where the PUSH'd value
    is the RIP of the faulting instructing.

    The PUSH+JMP trickery is necessary because the exception fixup handler
    code lies outside of its associated function, e.g. right after the
    function. An actual CALL from the .fixup code would show a slightly
    bogus stack trace, e.g. an extra "random" function would be inserted
    into the trace, as the return RIP on the stack would point to no known
    function (and the unwinder will likely try to guess who owns the RIP).

    Unfortunately, the JMP was replaced with a CALL when the macro was
    reworked to not spin indefinitely during reboot (commit b7c4145ba2eb
    "KVM: Don't spin on virt instruction faults during reboot"). This
    causes the aforementioned behavior where a bogus function is inserted
    into the stack trace, e.g. my builds like to blame free_kvm_area().

    Revert the CALL back to a JMP. The changelog for commit b7c4145ba2eb
    ("KVM: Don't spin on virt instruction faults during reboot") contains
    nothing that indicates the switch to CALL was deliberate. This is
    backed up by the fact that the PUSH was left intact.

    Note that an alternative to the PUSH+JMP magic would be to JMP back
    to the "real" code and CALL from there, but that would require adding
    a JMP in the non-faulting path to avoid calling kvm_spurious_fault()
    and would add no value, i.e. the stack trace would be the same.

    Using CALL:

    ------------[ cut here ]------------
    kernel BUG at /home/sean/go/src/kernel.org/linux/arch/x86/kvm/x86.c:356!
    invalid opcode: 0000 [#1] SMP
    CPU: 4 PID: 1057 Comm: qemu-system-x86 Not tainted 4.20.0-rc6+ #75
    Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
    RIP: 0010:kvm_spurious_fault+0x5/0x10 [kvm]
    Code: 0b 66 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 41 55 49 89 fd 41
    RSP: 0018:ffffc900004bbcc8 EFLAGS: 00010046
    RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffffffffffffff
    RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
    RBP: ffff888273fd8000 R08: 00000000000003e8 R09: 0000000000000000
    R10: 0000000000000000 R11: 0000000000000784 R12: ffffc90000371fb0
    R13: 0000000000000000 R14: 000000026d763cf4 R15: ffff888273fd8000
    FS: 00007f3d69691700(0000) GS:ffff888277800000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 000055f89bc56fe0 CR3: 0000000271a5a001 CR4: 0000000000362ee0
    Call Trace:
    free_kvm_area+0x1044/0x43ea [kvm_intel]
    ? vmx_vcpu_run+0x156/0x630 [kvm_intel]
    ? kvm_arch_vcpu_ioctl_run+0x447/0x1a40 [kvm]
    ? kvm_vcpu_ioctl+0x368/0x5c0 [kvm]
    ? kvm_vcpu_ioctl+0x368/0x5c0 [kvm]
    ? __set_task_blocked+0x38/0x90
    ? __set_current_blocked+0x50/0x60
    ? __fpu__restore_sig+0x97/0x490
    ? do_vfs_ioctl+0xa1/0x620
    ? __x64_sys_futex+0x89/0x180
    ? ksys_ioctl+0x66/0x70
    ? __x64_sys_ioctl+0x16/0x20
    ? do_syscall_64+0x4f/0x100
    ? entry_SYSCALL_64_after_hwframe+0x44/0xa9
    Modules linked in: vhost_net vhost tap kvm_intel kvm irqbypass bridge stp llc
    ---[ end trace 9775b14b123b1713 ]---

    Using JMP:

    ------------[ cut here ]------------
    kernel BUG at /home/sean/go/src/kernel.org/linux/arch/x86/kvm/x86.c:356!
    invalid opcode: 0000 [#1] SMP
    CPU: 6 PID: 1067 Comm: qemu-system-x86 Not tainted 4.20.0-rc6+ #75
    Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
    RIP: 0010:kvm_spurious_fault+0x5/0x10 [kvm]
    Code: 0b 66 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 41 55 49 89 fd 41
    RSP: 0018:ffffc90000497cd0 EFLAGS: 00010046
    RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffffffffffffff
    RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
    RBP: ffff88827058bd40 R08: 00000000000003e8 R09: 0000000000000000
    R10: 0000000000000000 R11: 0000000000000784 R12: ffffc90000369fb0
    R13: 0000000000000000 R14: 00000003c8fc6642 R15: ffff88827058bd40
    FS: 00007f3d7219e700(0000) GS:ffff888277900000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 00007f3d64001000 CR3: 0000000271c6b004 CR4: 0000000000362ee0
    Call Trace:
    vmx_vcpu_run+0x156/0x630 [kvm_intel]
    ? kvm_arch_vcpu_ioctl_run+0x447/0x1a40 [kvm]
    ? kvm_vcpu_ioctl+0x368/0x5c0 [kvm]
    ? kvm_vcpu_ioctl+0x368/0x5c0 [kvm]
    ? __set_task_blocked+0x38/0x90
    ? __set_current_blocked+0x50/0x60
    ? __fpu__restore_sig+0x97/0x490
    ? do_vfs_ioctl+0xa1/0x620
    ? __x64_sys_futex+0x89/0x180
    ? ksys_ioctl+0x66/0x70
    ? __x64_sys_ioctl+0x16/0x20
    ? do_syscall_64+0x4f/0x100
    ? entry_SYSCALL_64_after_hwframe+0x44/0xa9
    Modules linked in: vhost_net vhost tap kvm_intel kvm irqbypass bridge stp llc
    ---[ end trace f9daedb85ab3ddba ]---

    Fixes: b7c4145ba2eb ("KVM: Don't spin on virt instruction faults during reboot")
    Cc: stable@vger.kernel.org
    Signed-off-by: Sean Christopherson
    Signed-off-by: Paolo Bonzini
    Signed-off-by: Greg Kroah-Hartman

    Sean Christopherson
     
  • commit ba6f508d0ec4adb09f0a939af6d5e19cdfa8667d upstream.

    Commit:

    f77084d96355 "x86/mm/pat: Disable preemption around __flush_tlb_all()"

    addressed a case where __flush_tlb_all() is called without preemption
    being disabled. It also left a warning to catch other cases where
    preemption is not disabled.

    That warning triggers for the memory hotplug path which is also used for
    persistent memory enabling:

    WARNING: CPU: 35 PID: 911 at ./arch/x86/include/asm/tlbflush.h:460
    RIP: 0010:__flush_tlb_all+0x1b/0x3a
    [..]
    Call Trace:
    phys_pud_init+0x29c/0x2bb
    kernel_physical_mapping_init+0xfc/0x219
    init_memory_mapping+0x1a5/0x3b0
    arch_add_memory+0x2c/0x50
    devm_memremap_pages+0x3aa/0x610
    pmem_attach_disk+0x585/0x700 [nd_pmem]

    Andy wondered why a path that can sleep was using __flush_tlb_all() [1]
    and Dave confirmed the expectation for TLB flush is for modifying /
    invalidating existing PTE entries, but not initial population [2]. Drop
    the usage of __flush_tlb_all() in phys_{p4d,pud,pmd}_init() on the
    expectation that this path is only ever populating empty entries for the
    linear map. Note, at linear map teardown time there is a call to the
    all-cpu flush_tlb_all() to invalidate the removed mappings.

    [1]: https://lkml.kernel.org/r/9DFD717D-857D-493D-A606-B635D72BAC21@amacapital.net
    [2]: https://lkml.kernel.org/r/749919a4-cdb1-48a3-adb4-adb81a5fa0b5@intel.com

    [ mingo: Minor readability edits. ]

    Suggested-by: Dave Hansen
    Reported-by: Andy Lutomirski
    Signed-off-by: Dan Williams
    Acked-by: Peter Zijlstra (Intel)
    Acked-by: Kirill A. Shutemov
    Cc:
    Cc: Borislav Petkov
    Cc: H. Peter Anvin
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Cc: Sebastian Andrzej Siewior
    Cc: Thomas Gleixner
    Cc: dave.hansen@intel.com
    Fixes: f77084d96355 ("x86/mm/pat: Disable preemption around __flush_tlb_all()")
    Link: http://lkml.kernel.org/r/154395944713.32119.15611079023837132638.stgit@dwillia2-desk3.amr.corp.intel.com
    Signed-off-by: Ingo Molnar
    Signed-off-by: Greg Kroah-Hartman

    Dan Williams
     
  • commit 5b5e4d623ec8a34689df98e42d038a3b594d2ff9 upstream.

    Swap storage is restricted to max_swapfile_size (~16TB on x86_64) whenever
    the system is deemed affected by L1TF vulnerability. Even though the limit
    is quite high for most deployments it seems to be too restrictive for
    deployments which are willing to live with the mitigation disabled.

    We have a customer to deploy 8x 6,4TB PCIe/NVMe SSD swap devices which is
    clearly out of the limit.

    Drop the swap restriction when l1tf=off is specified. It also doesn't make
    much sense to warn about too much memory for the l1tf mitigation when it is
    forcefully disabled by the administrator.

    [ tglx: Folded the documentation delta change ]

    Fixes: 377eeaa8e11f ("x86/speculation/l1tf: Limit swap file size to MAX_PA/2")
    Signed-off-by: Michal Hocko
    Signed-off-by: Thomas Gleixner
    Reviewed-by: Pavel Tatashin
    Reviewed-by: Andi Kleen
    Acked-by: Jiri Kosina
    Cc: Linus Torvalds
    Cc: Dave Hansen
    Cc: Andi Kleen
    Cc: Borislav Petkov
    Cc:
    Cc: stable@vger.kernel.org
    Link: https://lkml.kernel.org/r/20181113184910.26697-1-mhocko@kernel.org
    Signed-off-by: Greg Kroah-Hartman

    Michal Hocko
     

29 Dec, 2018

3 commits

  • commit 32043fa065b51e0b1433e48d118821c71b5cd65d upstream.

    Currently the copy_to_user of data in the gentry struct is copying
    uninitiaized data in field _pad from the stack to userspace.

    Fix this by explicitly memset'ing gentry to zero, this also will zero any
    compiler added padding fields that may be in struct (currently there are
    none).

    Detected by CoverityScan, CID#200783 ("Uninitialized scalar variable")

    Fixes: b263b31e8ad6 ("x86, mtrr: Use explicit sizing and padding for the 64-bit ioctls")
    Signed-off-by: Colin Ian King
    Signed-off-by: Thomas Gleixner
    Reviewed-by: Tyler Hicks
    Cc: security@kernel.org
    Link: https://lkml.kernel.org/r/20181218172956.1440-1-colin.king@canonical.com
    Signed-off-by: Greg Kroah-Hartman

    Colin Ian King
     
  • commit c2dd5146e9fe1f22c77c1b011adf84eea0245806 upstream.

    nested_get_vmcs12_pages() processes the posted_intr address in vmcs12. It
    caches the kmap()ed page object and pointer, however, it doesn't handle
    errors correctly: it's possible to cache a valid pointer, then release
    the page and later dereference the dangling pointer.

    I was able to reproduce with the following steps:

    1. Call vmlaunch with valid posted_intr_desc_addr but an invalid
    MSR_EFER. This causes nested_get_vmcs12_pages() to cache the kmap()ed
    pi_desc_page and pi_desc. Later the invalid EFER value fails
    check_vmentry_postreqs() which fails the first vmlaunch.

    2. Call vmlanuch with a valid EFER but an invalid posted_intr_desc_addr
    (I set it to 2G - 0x80). The second time we call nested_get_vmcs12_pages
    pi_desc_page is unmapped and released and pi_desc_page is set to NULL
    (the "shouldn't happen" clause). Due to the invalid
    posted_intr_desc_addr, kvm_vcpu_gpa_to_page() fails and
    nested_get_vmcs12_pages() returns. It doesn't return an error value so
    vmlaunch proceeds. Note that at this time we have a dangling pointer in
    vmx->nested.pi_desc and POSTED_INTR_DESC_ADDR in L0's vmcs.

    3. Issue an IPI in L2 guest code. This triggers a call to
    vmx_complete_nested_posted_interrupt() and pi_test_and_clear_on() which
    dereferences the dangling pointer.

    Vulnerable code requires nested and enable_apicv variables to be set to
    true. The host CPU must also support posted interrupts.

    Fixes: 5e2f30b756a37 "KVM: nVMX: get rid of nested_get_page()"
    Cc: stable@vger.kernel.org
    Reviewed-by: Andy Honig
    Signed-off-by: Cfir Cohen
    Reviewed-by: Liran Alon
    Signed-off-by: Paolo Bonzini
    Signed-off-by: Greg Kroah-Hartman

    Cfir Cohen
     
  • commit 0e1b869fff60c81b510c2d00602d778f8f59dd9a upstream.

    Some guests OSes (including Windows 10) write to MSR 0xc001102c
    on some cases (possibly while trying to apply a CPU errata).
    Make KVM ignore reads and writes to that MSR, so the guest won't
    crash.

    The MSR is documented as "Execution Unit Configuration (EX_CFG)",
    at AMD's "BIOS and Kernel Developer's Guide (BKDG) for AMD Family
    15h Models 00h-0Fh Processors".

    Cc: stable@vger.kernel.org
    Signed-off-by: Eduardo Habkost
    Signed-off-by: Paolo Bonzini
    Signed-off-by: Greg Kroah-Hartman

    Eduardo Habkost
     

21 Dec, 2018

5 commits

  • [ Upstream commit 79c2206d369b87b19ac29cb47601059b6bf5c291 ]

    An affected screen resolution is 1366 x 768, which width is not
    divisible by 8, the default font width. On such screens, when longer
    lines are earlyprintk'ed, overflow-to-next-line can never trigger,
    due to the left-most x-coordinate of the next character always less
    than the screen width. Earlyprintk will infinite loop in trying to
    print the rest of the string but unable to, due to the line being
    full.

    This patch makes the trigger consider the right-most x-coordinate,
    instead of left-most, as the value to compare against the screen
    width threshold.

    Signed-off-by: YiFei Zhu
    Signed-off-by: Ard Biesheuvel
    Cc: Andy Lutomirski
    Cc: Arend van Spriel
    Cc: Bhupesh Sharma
    Cc: Borislav Petkov
    Cc: Dave Hansen
    Cc: Eric Snowberg
    Cc: Hans de Goede
    Cc: Joe Perches
    Cc: Jon Hunter
    Cc: Julien Thierry
    Cc: Linus Torvalds
    Cc: Marc Zyngier
    Cc: Matt Fleming
    Cc: Nathan Chancellor
    Cc: Peter Zijlstra
    Cc: Sai Praneeth Prakhya
    Cc: Sedat Dilek
    Cc: Thomas Gleixner
    Cc: linux-efi@vger.kernel.org
    Link: http://lkml.kernel.org/r/20181129171230.18699-12-ard.biesheuvel@linaro.org
    Signed-off-by: Ingo Molnar
    Signed-off-by: Sasha Levin

    YiFei Zhu
     
  • commit 7aa54be2976550f17c11a1c3e3630002dea39303 upstream.

    On x86 we cannot do fetch_or() with a single instruction and thus end up
    using a cmpxchg loop, this reduces determinism. Replace the fetch_or()
    with a composite operation: tas-pending + load.

    Using two instructions of course opens a window we previously did not
    have. Consider the scenario:

    CPU0 CPU1 CPU2

    1) lock
    trylock -> (0,0,1)

    2) lock
    trylock /* fail */

    3) unlock -> (0,0,0)

    4) lock
    trylock -> (0,0,1)

    5) tas-pending -> (0,1,1)
    load-val (0,0,1)

    FAIL: _2_ owners

    where 5) is our new composite operation. When we consider each part of
    the qspinlock state as a separate variable (as we can when
    _Q_PENDING_BITS == 8) then the above is entirely possible, because
    tas-pending will only RmW the pending byte, so the later load is able
    to observe prior tail and lock state (but not earlier than its own
    trylock, which operates on the whole word, due to coherence).

    To avoid this we need 2 things:

    - the load must come after the tas-pending (obviously, otherwise it
    can trivially observe prior state).

    - the tas-pending must be a full word RmW instruction, it cannot be an XCHGB for
    example, such that we cannot observe other state prior to setting
    pending.

    On x86 we can realize this by using "LOCK BTS m32, r32" for
    tas-pending followed by a regular load.

    Note that observing later state is not a problem:

    - if we fail to observe a later unlock, we'll simply spin-wait for
    that store to become visible.

    - if we observe a later xchg_tail(), there is no difference from that
    xchg_tail() having taken place before the tas-pending.

    Suggested-by: Will Deacon
    Reported-by: Thomas Gleixner
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Will Deacon
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: andrea.parri@amarulasolutions.com
    Cc: longman@redhat.com
    Fixes: 59fb586b4a07 ("locking/qspinlock: Remove unbounded cmpxchg() loop from locking slowpath")
    Link: https://lkml.kernel.org/r/20181003130957.183726335@infradead.org
    Signed-off-by: Ingo Molnar
    [bigeasy: GEN_BINARY_RMWcc macro redo]
    Signed-off-by: Sebastian Andrzej Siewior
    Signed-off-by: Sasha Levin

    Peter Zijlstra
     
  • commit b247be3fe89b6aba928bf80f4453d1c4ba8d2063 upstream.

    On x86, atomic_cond_read_relaxed will busy-wait with a cpu_relax() loop,
    so it is desirable to increase the number of times we spin on the qspinlock
    lockword when it is found to be transitioning from pending to locked.

    According to Waiman Long:

    | Ideally, the spinning times should be at least a few times the typical
    | cacheline load time from memory which I think can be down to 100ns or
    | so for each cacheline load with the newest systems or up to several
    | hundreds ns for older systems.

    which in his benchmarking corresponded to 512 iterations.

    Suggested-by: Waiman Long
    Signed-off-by: Will Deacon
    Acked-by: Peter Zijlstra (Intel)
    Acked-by: Waiman Long
    Cc: Linus Torvalds
    Cc: Thomas Gleixner
    Cc: boqun.feng@gmail.com
    Cc: linux-arm-kernel@lists.infradead.org
    Cc: paulmck@linux.vnet.ibm.com
    Link: http://lkml.kernel.org/r/1524738868-31318-5-git-send-email-will.deacon@arm.com
    Signed-off-by: Ingo Molnar
    Signed-off-by: Sebastian Andrzej Siewior
    Signed-off-by: Sasha Levin

    Will Deacon
     
  • commit 625e88be1f41b53cec55827c984e4a89ea8ee9f9 upstream.

    'struct __qspinlock' provides a handy union of fields so that
    subcomponents of the lockword can be accessed by name, without having to
    manage shifts and masks explicitly and take endianness into account.

    This is useful in qspinlock.h and also potentially in arch headers, so
    move the 'struct __qspinlock' into 'struct qspinlock' and kill the extra
    definition.

    Signed-off-by: Will Deacon
    Acked-by: Peter Zijlstra (Intel)
    Acked-by: Waiman Long
    Acked-by: Boqun Feng
    Cc: Linus Torvalds
    Cc: Thomas Gleixner
    Cc: linux-arm-kernel@lists.infradead.org
    Cc: paulmck@linux.vnet.ibm.com
    Link: http://lkml.kernel.org/r/1524738868-31318-3-git-send-email-will.deacon@arm.com
    Signed-off-by: Ingo Molnar
    Signed-off-by: Sebastian Andrzej Siewior
    Signed-off-by: Sasha Levin

    Will Deacon
     
  • commit 25896d073d8a0403b07e6dec56f58e6c33678207 upstream.

    It is troublesome to add a diagnostic like this to the Makefile
    parse stage because the top-level Makefile could be parsed with
    a stale include/config/auto.conf.

    Once you are hit by the error about non-retpoline compiler, the
    compilation still breaks even after disabling CONFIG_RETPOLINE.

    The easiest fix is to move this check to the "archprepare" like
    this commit did:

    829fe4aa9ac1 ("x86: Allow generating user-space headers without a compiler")

    Reported-by: Meelis Roos
    Tested-by: Meelis Roos
    Signed-off-by: Masahiro Yamada
    Acked-by: Zhenzhong Duan
    Cc: Borislav Petkov
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Zhenzhong Duan
    Fixes: 4cd24de3a098 ("x86/retpoline: Make CONFIG_RETPOLINE depend on compiler support")
    Link: http://lkml.kernel.org/r/1543991239-18476-1-git-send-email-yamada.masahiro@socionext.com
    Link: https://lkml.org/lkml/2018/12/4/206
    Signed-off-by: Ingo Molnar
    Signed-off-by: Sasha Levin
    Cc: Gi-Oh Kim
    Signed-off-by: Greg Kroah-Hartman

    Masahiro Yamada
     

17 Dec, 2018

3 commits

  • [ Upstream commit 123664101aa2156d05251704fc63f9bcbf77741a ]

    This reverts commit b3cf8528bb21febb650a7ecbf080d0647be40b9f.

    That commit unintentionally broke Xen balloon memory hotplug with
    "hotplug_unpopulated" set to 1. As long as "System RAM" resource
    got assigned under a new "Unusable memory" resource in IO/Mem tree
    any attempt to online this memory would fail due to general kernel
    restrictions on having "System RAM" resources as 1st level only.

    The original issue that commit has tried to workaround fa564ad96366
    ("x86/PCI: Enable a 64bit BAR on AMD Family 15h (Models 00-1f, 30-3f,
    60-7f)") also got amended by the following 03a551734 ("x86/PCI: Move
    and shrink AMD 64-bit window to avoid conflict") which made the
    original fix to Xen ballooning unnecessary.

    Signed-off-by: Igor Druzhinin
    Reviewed-by: Boris Ostrovsky
    Signed-off-by: Juergen Gross
    Signed-off-by: Sasha Levin

    Igor Druzhinin
     
  • [ Upstream commit 1e4329ee2c52692ea42cc677fb2133519718b34a ]

    The inline keyword which is not at the beginning of the function
    declaration may trigger the following build warnings, so let's fix it:

    arch/x86/kvm/vmx.c:1309:1: warning: ‘inline’ is not at beginning of declaration [-Wold-style-declaration]
    arch/x86/kvm/vmx.c:5947:1: warning: ‘inline’ is not at beginning of declaration [-Wold-style-declaration]
    arch/x86/kvm/vmx.c:5985:1: warning: ‘inline’ is not at beginning of declaration [-Wold-style-declaration]
    arch/x86/kvm/vmx.c:6023:1: warning: ‘inline’ is not at beginning of declaration [-Wold-style-declaration]

    Signed-off-by: Yi Wang
    Signed-off-by: Paolo Bonzini
    Signed-off-by: Sasha Levin

    Yi Wang
     
  • [ Upstream commit 354cb410d87314e2eda344feea84809e4261570a ]

    We get the following warnings about empty statements when building
    with 'W=1':

    arch/x86/kvm/lapic.c:632:53: warning: suggest braces around empty body in an ‘if’ statement [-Wempty-body]
    arch/x86/kvm/lapic.c:1907:42: warning: suggest braces around empty body in an ‘if’ statement [-Wempty-body]
    arch/x86/kvm/lapic.c:1936:65: warning: suggest braces around empty body in an ‘if’ statement [-Wempty-body]
    arch/x86/kvm/lapic.c:1975:44: warning: suggest braces around empty body in an ‘if’ statement [-Wempty-body]

    Rework the debug helper macro to get rid of these warnings.

    Signed-off-by: Yi Wang
    Signed-off-by: Paolo Bonzini
    Signed-off-by: Sasha Levin

    Yi Wang
     

08 Dec, 2018

1 commit

  • commit 30510387a5e45bfcf8190e03ec7aa15b295828e2 upstream.

    There is a race condition when accessing kvm->arch.apic_access_page_done.
    Due to it, x86_set_memory_region will fail when creating the second vcpu
    for a svm guest.

    Add a mutex_lock to serialize the accesses to apic_access_page_done.
    This lock is also used by vmx for the same purpose.

    Signed-off-by: Wei Wang
    Signed-off-by: Amadeusz Juskowiak
    Signed-off-by: Julian Stecklina
    Signed-off-by: Suravee Suthikulpanit
    Reviewed-by: Joerg Roedel
    Cc: stable@vger.kernel.org
    Signed-off-by: Paolo Bonzini
    Signed-off-by: Greg Kroah-Hartman

    Wei Wang
     

06 Dec, 2018

10 commits

  • commit 67266c1080ad56c31af72b9c18355fde8ccc124a upstream.

    Currently we check the branch tracing only by checking for the
    PERF_COUNT_HW_BRANCH_INSTRUCTIONS event of PERF_TYPE_HARDWARE
    type. But we can define the same event with the PERF_TYPE_RAW
    type.

    Changing the intel_pmu_has_bts() code to check on event's final
    hw config value, so both HW types are covered.

    Adding unlikely to intel_pmu_has_bts() condition calls, because
    it was used in the original code in intel_bts_constraints.

    Signed-off-by: Jiri Olsa
    Acked-by: Peter Zijlstra
    Cc:
    Cc: Alexander Shishkin
    Cc: Arnaldo Carvalho de Melo
    Cc: Arnaldo Carvalho de Melo
    Cc: Jiri Olsa
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Stephane Eranian
    Cc: Thomas Gleixner
    Cc: Vince Weaver
    Link: http://lkml.kernel.org/r/20181121101612.16272-2-jolsa@kernel.org
    Signed-off-by: Ingo Molnar
    Signed-off-by: Greg Kroah-Hartman

    Jiri Olsa
     
  • commit ed6101bbf6266ee83e620b19faa7c6ad56bb41ab upstream.

    Moving branch tracing setup to Intel core object into separate
    intel_pmu_bts_config function, because it's Intel specific.

    Suggested-by: Peter Zijlstra
    Signed-off-by: Jiri Olsa
    Acked-by: Peter Zijlstra
    Cc:
    Cc: Alexander Shishkin
    Cc: Arnaldo Carvalho de Melo
    Cc: Arnaldo Carvalho de Melo
    Cc: Jiri Olsa
    Cc: Linus Torvalds
    Cc: Stephane Eranian
    Cc: Thomas Gleixner
    Cc: Vince Weaver
    Link: http://lkml.kernel.org/r/20181121101612.16272-1-jolsa@kernel.org
    Signed-off-by: Ingo Molnar
    Signed-off-by: Greg Kroah-Hartman

    Jiri Olsa
     
  • commit 68239654acafe6aad5a3c1dc7237e60accfebc03 upstream.

    The sequence

    fpu->initialized = 1; /* step A */
    preempt_disable(); /* step B */
    fpu__restore(fpu);
    preempt_enable();

    in __fpu__restore_sig() is racy in regard to a context switch.

    For 32bit frames, __fpu__restore_sig() prepares the FPU state within
    fpu->state. To ensure that a context switch (switch_fpu_prepare() in
    particular) does not modify fpu->state it uses fpu__drop() which sets
    fpu->initialized to 0.

    After fpu->initialized is cleared, the CPU's FPU state is not saved
    to fpu->state during a context switch. The new state is loaded via
    fpu__restore(). It gets loaded into fpu->state from userland and
    ensured it is sane. fpu->initialized is then set to 1 in order to avoid
    fpu__initialize() doing anything (overwrite the new state) which is part
    of fpu__restore().

    A context switch between step A and B above would save CPU's current FPU
    registers to fpu->state and overwrite the newly prepared state. This
    looks like a tiny race window but the Kernel Test Robot reported this
    back in 2016 while we had lazy FPU support. Borislav Petkov made the
    link between that report and another patch that has been posted. Since
    the removal of the lazy FPU support, this race goes unnoticed because
    the warning has been removed.

    Disable bottom halves around the restore sequence to avoid the race. BH
    need to be disabled because BH is allowed to run (even with preemption
    disabled) and might invoke kernel_fpu_begin() by doing IPsec.

    [ bp: massage commit message a bit. ]

    Signed-off-by: Sebastian Andrzej Siewior
    Signed-off-by: Borislav Petkov
    Acked-by: Ingo Molnar
    Acked-by: Thomas Gleixner
    Cc: Andy Lutomirski
    Cc: Dave Hansen
    Cc: "H. Peter Anvin"
    Cc: "Jason A. Donenfeld"
    Cc: kvm ML
    Cc: Paolo Bonzini
    Cc: Radim Krčmář
    Cc: Rik van Riel
    Cc: stable@vger.kernel.org
    Cc: x86-ml
    Link: http://lkml.kernel.org/r/20181120102635.ddv3fvavxajjlfqk@linutronix.de
    Link: https://lkml.kernel.org/r/20160226074940.GA28911@pd.tnic
    Signed-off-by: Greg Kroah-Hartman

    Sebastian Andrzej Siewior
     
  • commit 60c8144afc287ef09ce8c1230c6aa972659ba1bb upstream.

    Currently, the code sets up the thresholding interrupt vector and only
    then goes about initializing the thresholding banks. Which is wrong,
    because an early thresholding interrupt would cause a NULL pointer
    dereference when accessing those banks and prevent the machine from
    booting.

    Therefore, set the thresholding interrupt vector only *after* having
    initialized the banks successfully.

    Fixes: 18807ddb7f88 ("x86/mce/AMD: Reset Threshold Limit after logging error")
    Reported-by: Rafał Miłecki
    Reported-by: John Clemens
    Signed-off-by: Borislav Petkov
    Tested-by: Rafał Miłecki
    Tested-by: John Clemens
    Cc: Aravind Gopalakrishnan
    Cc: linux-edac@vger.kernel.org
    Cc: stable@vger.kernel.org
    Cc: Tony Luck
    Cc: x86@kernel.org
    Cc: Yazen Ghannam
    Link: https://lkml.kernel.org/r/20181127101700.2964-1-zajec5@gmail.com
    Link: https://bugzilla.kernel.org/show_bug.cgi?id=201291
    Signed-off-by: Greg Kroah-Hartman

    Borislav Petkov
     
  • commit e97f852fd4561e77721bb9a4e0ea9d98305b1e93 upstream.

    Reported by syzkaller:

    BUG: unable to handle kernel NULL pointer dereference at 00000000000001c8
    PGD 80000003ec4da067 P4D 80000003ec4da067 PUD 3f7bfa067 PMD 0
    Oops: 0000 [#1] PREEMPT SMP PTI
    CPU: 7 PID: 5059 Comm: debug Tainted: G OE 4.19.0-rc5 #16
    RIP: 0010:__lock_acquire+0x1a6/0x1990
    Call Trace:
    lock_acquire+0xdb/0x210
    _raw_spin_lock+0x38/0x70
    kvm_ioapic_scan_entry+0x3e/0x110 [kvm]
    vcpu_enter_guest+0x167e/0x1910 [kvm]
    kvm_arch_vcpu_ioctl_run+0x35c/0x610 [kvm]
    kvm_vcpu_ioctl+0x3e9/0x6d0 [kvm]
    do_vfs_ioctl+0xa5/0x690
    ksys_ioctl+0x6d/0x80
    __x64_sys_ioctl+0x1a/0x20
    do_syscall_64+0x83/0x6e0
    entry_SYSCALL_64_after_hwframe+0x49/0xbe

    The reason is that the testcase writes hyperv synic HV_X64_MSR_SINT6 msr
    and triggers scan ioapic logic to load synic vectors into EOI exit bitmap.
    However, irqchip is not initialized by this simple testcase, ioapic/apic
    objects should not be accessed.
    This can be triggered by the following program:

    #define _GNU_SOURCE

    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include

    uint64_t r[3] = {0xffffffffffffffff, 0xffffffffffffffff, 0xffffffffffffffff};

    int main(void)
    {
    syscall(__NR_mmap, 0x20000000, 0x1000000, 3, 0x32, -1, 0);
    long res = 0;
    memcpy((void*)0x20000040, "/dev/kvm", 9);
    res = syscall(__NR_openat, 0xffffffffffffff9c, 0x20000040, 0, 0);
    if (res != -1)
    r[0] = res;
    res = syscall(__NR_ioctl, r[0], 0xae01, 0);
    if (res != -1)
    r[1] = res;
    res = syscall(__NR_ioctl, r[1], 0xae41, 0);
    if (res != -1)
    r[2] = res;
    memcpy(
    (void*)0x20000080,
    "\x01\x00\x00\x00\x00\x5b\x61\xbb\x96\x00\x00\x40\x00\x00\x00\x00\x01\x00"
    "\x08\x00\x00\x00\x00\x00\x0b\x77\xd1\x78\x4d\xd8\x3a\xed\xb1\x5c\x2e\x43"
    "\xaa\x43\x39\xd6\xff\xf5\xf0\xa8\x98\xf2\x3e\x37\x29\x89\xde\x88\xc6\x33"
    "\xfc\x2a\xdb\xb7\xe1\x4c\xac\x28\x61\x7b\x9c\xa9\xbc\x0d\xa0\x63\xfe\xfe"
    "\xe8\x75\xde\xdd\x19\x38\xdc\x34\xf5\xec\x05\xfd\xeb\x5d\xed\x2e\xaf\x22"
    "\xfa\xab\xb7\xe4\x42\x67\xd0\xaf\x06\x1c\x6a\x35\x67\x10\x55\xcb",
    106);
    syscall(__NR_ioctl, r[2], 0x4008ae89, 0x20000080);
    syscall(__NR_ioctl, r[2], 0xae80, 0);
    return 0;
    }

    This patch fixes it by bailing out scan ioapic if ioapic is not initialized in
    kernel.

    Reported-by: Wei Wu
    Cc: Paolo Bonzini
    Cc: Radim Krčmář
    Cc: Wei Wu
    Signed-off-by: Wanpeng Li
    Cc: stable@vger.kernel.org
    Signed-off-by: Paolo Bonzini
    Signed-off-by: Greg Kroah-Hartman

    Wanpeng Li
     
  • commit bcbfbd8ec21096027f1ee13ce6c185e8175166f6 upstream.

    kvm_pv_clock_pairing() allocates local var
    "struct kvm_clock_pairing clock_pairing" on stack and initializes
    all it's fields besides padding (clock_pairing.pad[]).

    Because clock_pairing var is written completely (including padding)
    to guest memory, failure to init struct padding results in kernel
    info-leak.

    Fix the issue by making sure to also init the padding with zeroes.

    Fixes: 55dd00a73a51 ("KVM: x86: add KVM_HC_CLOCK_PAIRING hypercall")
    Reported-by: syzbot+a8ef68d71211ba264f56@syzkaller.appspotmail.com
    Reviewed-by: Mark Kanda
    Signed-off-by: Liran Alon
    Cc: stable@vger.kernel.org
    Signed-off-by: Paolo Bonzini
    Signed-off-by: Greg Kroah-Hartman

    Liran Alon
     
  • commit fd65d3142f734bc4376053c8d75670041903134d upstream.

    Previously, we only called indirect_branch_prediction_barrier on the
    logical CPU that freed a vmcb. This function should be called on all
    logical CPUs that last loaded the vmcb in question.

    Fixes: 15d45071523d ("KVM/x86: Add IBPB support")
    Reported-by: Neel Natu
    Signed-off-by: Jim Mattson
    Reviewed-by: Konrad Rzeszutek Wilk
    Cc: stable@vger.kernel.org
    Signed-off-by: Paolo Bonzini
    Signed-off-by: Greg Kroah-Hartman

    Jim Mattson
     
  • commit 0e0fee5c539b61fdd098332e0e2cc375d9073706 upstream.

    When a guest page table is updated via an emulated write,
    kvm_mmu_pte_write() is called to update the shadow PTE using the just
    written guest PTE value. But if two emulated guest PTE writes happened
    concurrently, it is possible that the guest PTE and the shadow PTE end
    up being out of sync. Emulated writes do not mark the shadow page as
    unsync-ed, so this inconsistency will not be resolved even by a guest TLB
    flush (unless the page was marked as unsync-ed at some other point).

    This is fixed by re-reading the current value of the guest PTE after the
    MMU lock has been acquired instead of just using the value that was
    written prior to calling kvm_mmu_pte_write().

    Signed-off-by: Junaid Shahid
    Reviewed-by: Wanpeng Li
    Cc: stable@vger.kernel.org
    Signed-off-by: Paolo Bonzini
    Signed-off-by: Greg Kroah-Hartman

    Junaid Shahid
     
  • commit 55a974021ec952ee460dc31ca08722158639de72 upstream

    Provide the possibility to enable IBPB always in combination with 'prctl'
    and 'seccomp'.

    Add the extra command line options and rework the IBPB selection to
    evaluate the command instead of the mode selected by the STIPB switch case.

    Signed-off-by: Thomas Gleixner
    Reviewed-by: Ingo Molnar
    Cc: Peter Zijlstra
    Cc: Andy Lutomirski
    Cc: Linus Torvalds
    Cc: Jiri Kosina
    Cc: Tom Lendacky
    Cc: Josh Poimboeuf
    Cc: Andrea Arcangeli
    Cc: David Woodhouse
    Cc: Tim Chen
    Cc: Andi Kleen
    Cc: Dave Hansen
    Cc: Casey Schaufler
    Cc: Asit Mallick
    Cc: Arjan van de Ven
    Cc: Jon Masters
    Cc: Waiman Long
    Cc: Greg KH
    Cc: Dave Stewart
    Cc: Kees Cook
    Cc: stable@vger.kernel.org
    Link: https://lkml.kernel.org/r/20181125185006.144047038@linutronix.de
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     
  • commit 6b3e64c237c072797a9ec918654a60e3a46488e2 upstream

    If 'prctl' mode of user space protection from spectre v2 is selected
    on the kernel command-line, STIBP and IBPB are applied on tasks which
    restrict their indirect branch speculation via prctl.

    SECCOMP enables the SSBD mitigation for sandboxed tasks already, so it
    makes sense to prevent spectre v2 user space to user space attacks as
    well.

    The Intel mitigation guide documents how STIPB works:

    Setting bit 1 (STIBP) of the IA32_SPEC_CTRL MSR on a logical processor
    prevents the predicted targets of indirect branches on any logical
    processor of that core from being controlled by software that executes
    (or executed previously) on another logical processor of the same core.

    Ergo setting STIBP protects the task itself from being attacked from a task
    running on a different hyper-thread and protects the tasks running on
    different hyper-threads from being attacked.

    While the document suggests that the branch predictors are shielded between
    the logical processors, the observed performance regressions suggest that
    STIBP simply disables the branch predictor more or less completely. Of
    course the document wording is vague, but the fact that there is also no
    requirement for issuing IBPB when STIBP is used points clearly in that
    direction. The kernel still issues IBPB even when STIBP is used until Intel
    clarifies the whole mechanism.

    IBPB is issued when the task switches out, so malicious sandbox code cannot
    mistrain the branch predictor for the next user space task on the same
    logical processor.

    Signed-off-by: Jiri Kosina
    Signed-off-by: Thomas Gleixner
    Reviewed-by: Ingo Molnar
    Cc: Peter Zijlstra
    Cc: Andy Lutomirski
    Cc: Linus Torvalds
    Cc: Tom Lendacky
    Cc: Josh Poimboeuf
    Cc: Andrea Arcangeli
    Cc: David Woodhouse
    Cc: Tim Chen
    Cc: Andi Kleen
    Cc: Dave Hansen
    Cc: Casey Schaufler
    Cc: Asit Mallick
    Cc: Arjan van de Ven
    Cc: Jon Masters
    Cc: Waiman Long
    Cc: Greg KH
    Cc: Dave Stewart
    Cc: Kees Cook
    Cc: stable@vger.kernel.org
    Link: https://lkml.kernel.org/r/20181125185006.051663132@linutronix.de
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner