31 Jan, 2019
9 commits
-
commit 867cefb4cb1012f42cada1c7d1f35ac8dd276071 upstream.
Commit f94c8d11699759 ("sched/clock, x86/tsc: Rework the x86 'unstable'
sched_clock() interface") broke Xen guest time handling across
migration:[ 187.249951] Freezing user space processes ... (elapsed 0.001 seconds) done.
[ 187.251137] OOM killer disabled.
[ 187.251137] Freezing remaining freezable tasks ... (elapsed 0.001 seconds) done.
[ 187.252299] suspending xenstore...
[ 187.266987] xen:grant_table: Grant tables using version 1 layout
[18446743811.706476] OOM killer enabled.
[18446743811.706478] Restarting tasks ... done.
[18446743811.720505] Setting capacity to 16777216Fix that by setting xen_sched_clock_offset at resume time to ensure a
monotonic clock value.[boris: replaced pr_info() with pr_info_once() in xen_callback_vector()
to avoid printing with incorrect timestamp during resume (as we
haven't re-adjusted the clock yet)]Fixes: f94c8d11699759 ("sched/clock, x86/tsc: Rework the x86 'unstable' sched_clock() interface")
Cc: # 4.11
Reported-by: Hans van Kranenburg
Signed-off-by: Juergen Gross
Tested-by: Hans van Kranenburg
Signed-off-by: Boris Ostrovsky
Signed-off-by: Juergen Gross
Signed-off-by: Greg Kroah-Hartman -
commit 38669ba205d178d2d38bfd194a196d65a44d5af2 upstream.
It is expected for sched_clock() to output data from 0, when system boots.
Add an offset xen_sched_clock_offset (similarly how it is done in other
hypervisors i.e. kvm_sched_clock_offset) to count sched_clock() from 0,
when time is first initialized.Signed-off-by: Pavel Tatashin
Signed-off-by: Thomas Gleixner
Cc: steven.sistare@oracle.com
Cc: daniel.m.jordan@oracle.com
Cc: linux@armlinux.org.uk
Cc: schwidefsky@de.ibm.com
Cc: heiko.carstens@de.ibm.com
Cc: john.stultz@linaro.org
Cc: sboyd@codeaurora.org
Cc: hpa@zytor.com
Cc: douly.fnst@cn.fujitsu.com
Cc: peterz@infradead.org
Cc: prarit@redhat.com
Cc: feng.tang@intel.com
Cc: pmladek@suse.com
Cc: gnomes@lxorguk.ukuu.org.uk
Cc: linux-s390@vger.kernel.org
Cc: boris.ostrovsky@oracle.com
Cc: jgross@suse.com
Cc: pbonzini@redhat.com
Link: https://lkml.kernel.org/r/20180719205545.16512-14-pasha.tatashin@oracle.com
Signed-off-by: Juergen Gross
Signed-off-by: Greg Kroah-Hartman -
commit 2229f70b5bbb025e1394b61007938a68060afbfb upstream.
In order to support pvclock vdso on xen we need to setup the time
info page for vcpu 0 and register the page with Xen using the
VCPUOP_register_vcpu_time_memory_area hypercall. This hypercall
will also forcefully update the pvti which will set some of the
necessary flags for vdso. Afterwards we check if it supports the
PVCLOCK_TSC_STABLE_BIT flag which is mandatory for having
vdso/vsyscall support. And if so, it will set the cpu 0 pvti that
will be later on used when mapping the vdso image.The xen headers are also updated to include the new hypercall for
registering the secondary vcpu_time_info struct.Signed-off-by: Joao Martins
Reviewed-by: Juergen Gross
Reviewed-by: Boris Ostrovsky
Signed-off-by: Boris Ostrovsky
Signed-off-by: Juergen Gross
Signed-off-by: Greg Kroah-Hartman -
commit b888808093113ae7d63d213272d01fea4b8329ed upstream.
Specifically check for PVCLOCK_TSC_STABLE_BIT and if this bit is set,
then set it too on pvclock flags. This allows Xen clocksource to use it
and thus speeding up xen_clocksource_read() callers (i.e. sched_clock())Signed-off-by: Joao Martins
Reviewed-by: Boris Ostrovsky
Signed-off-by: Boris Ostrovsky
Signed-off-by: Juergen Gross
Signed-off-by: Greg Kroah-Hartman -
commit 9f08890ab906abaf9d4c1bad8111755cbd302260 upstream.
Right now there is only a pvclock_pvti_cpu0_va() which is defined
on kvmclock since:commit dac16fba6fc5
("x86/vdso: Get pvclock data from the vvar VMA instead of the fixmap")The only user of this interface so far is kvm. This commit adds a
setter function for the pvti page and moves pvclock_pvti_cpu0_va
to pvclock, which is a more generic place to have it; and would
allow other PV clocksources to use it, such as Xen.While moving pvclock_pvti_cpu0_va into pvclock, rename also this
function to pvclock_get_pvti_cpu0_va (including its call sites)
to be symmetric with the setter (pvclock_set_pvti_cpu0_va).Signed-off-by: Joao Martins
Acked-by: Andy Lutomirski
Acked-by: Paolo Bonzini
Acked-by: Thomas Gleixner
Signed-off-by: Boris Ostrovsky
Signed-off-by: Juergen Gross
Signed-off-by: Greg Kroah-Hartman -
Upstream commit:
f775b13eedee ("x86,kvm: move qemu/guest FPU switching out to vcpu_run")
introduced a bug, which was later fixed by upstream commit:
5663d8f9bbe4 ("kvm: x86: fix WARN due to uninitialized guest FPU state")
For reasons unknown, both commits were initially passed-over for
inclusion in the 4.14 stable branch despite being tagged for stable.
Eventually, someone noticed that the fixup, commit 5663d8f9bbe4, was
missing from stable[1], and so it was queued up for 4.14 and included in
release v4.14.79.Even later, the original buggy patch, commit f775b13eedee, was also
applied to the 4.14 stable branch. Through an unlucky coincidence, the
incorrect ordering did not generate a conflict between the two patches,
and led to v4.14.94 and later releases containing a spurious call to
kvm_load_guest_fpu() in kvm_arch_vcpu_ioctl_run(). As a result, KVM may
reload stale guest FPU state, e.g. after accepting in INIT event. This
can manifest as crashes during boot, segfaults, failed checksums and so
on and so forth.Remove the unwanted kvm_{load,put}_guest_fpu() calls, i.e. make
kvm_arch_vcpu_ioctl_run() look like commit 5663d8f9bbe4 was backported
after commit f775b13eedee.[1] https://www.spinics.net/lists/stable/msg263931.html
Fixes: 4124a4cff344 ("x86,kvm: move qemu/guest FPU switching out to vcpu_run")
Cc: stable@vger.kernel.org
Cc: Sasha Levin
Cc: Greg Kroah-Hartman
Cc: Peter Xu
Cc: Rik van Riel
Cc: Paolo Bonzini
Cc: Radim Krčmář
Reported-by: Roman Mamedov
Reported-by: Thomas Lindroth
Signed-off-by: Sean Christopherson
Acked-by: Paolo Bonzini
Signed-off-by: Greg Kroah-Hartman -
commit 7e6fc2f50a3197d0e82d1c0e86282976c9e6c8a4 upstream.
The outb() function takes parameters value and port, in that order. Fix
the parameters used in the kalsr i8254 fallback code.Fixes: 5bfce5ef55cb ("x86, kaslr: Provide randomness functions")
Signed-off-by: Daniel Drake
Signed-off-by: Thomas Gleixner
Cc: bp@alien8.de
Cc: hpa@zytor.com
Cc: linux@endlessm.com
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20190107034024.15005-1-drake@endlessm.com
Signed-off-by: Greg Kroah-Hartman -
commit a31e184e4f69965c99c04cc5eb8a4920e0c63737 upstream.
Memory protection key behavior should be the same in a child as it was
in the parent before a fork. But, there is a bug that resets the
state in the child at fork instead of preserving it.The creation of new mm's is a bit convoluted. At fork(), the code
does:1. memcpy() the parent mm to initialize child
2. mm_init() to initalize some select stuff stuff
3. dup_mmap() to create true copies that memcpy() did not do rightFor pkeys two bits of state need to be preserved across a fork:
'execute_only_pkey' and 'pkey_allocation_map'.Those are preserved by the memcpy(), but mm_init() invokes
init_new_context() which overwrites 'execute_only_pkey' and
'pkey_allocation_map' with "new" values.The author of the code erroneously believed that init_new_context is *only*
called at execve()-time. But, alas, init_new_context() is used at execve()
and fork().The result is that, after a fork(), the child's pkey state ends up looking
like it does after an execve(), which is totally wrong. pkeys that are
already allocated can be allocated again, for instance.To fix this, add code called by dup_mmap() to copy the pkey state from
parent to child explicitly. Also add a comment above init_new_context() to
make it more clear to the next poor sod what this code is used for.Fixes: e8c24d3a23a ("x86/pkeys: Allocation/free syscalls")
Signed-off-by: Dave Hansen
Signed-off-by: Thomas Gleixner
Reviewed-by: Thomas Gleixner
Cc: bp@alien8.de
Cc: hpa@zytor.com
Cc: peterz@infradead.org
Cc: mpe@ellerman.id.au
Cc: will.deacon@arm.com
Cc: luto@kernel.org
Cc: jroedel@suse.de
Cc: stable@vger.kernel.org
Cc: Borislav Petkov
Cc: "H. Peter Anvin"
Cc: Peter Zijlstra
Cc: Michael Ellerman
Cc: Will Deacon
Cc: Andy Lutomirski
Cc: Joerg Roedel
Link: https://lkml.kernel.org/r/20190102215655.7A69518C@viggo.jf.intel.com
Signed-off-by: Greg Kroah-Hartman -
commit 5cc244a20b86090c087073c124284381cdf47234 upstream.
The single-step debugging of KVM guests on x86 is broken: if we run
gdb 'stepi' command at the breakpoint when the guest interrupts are
enabled, RIP always jumps to native_apic_mem_write(). Then other
nasty effects follow.Long investigation showed that on Jun 7, 2017 the
commit c8401dda2f0a00cd25c0 ("KVM: x86: fix singlestepping over syscall")
introduced the kvm_run.debug corruption: kvm_vcpu_do_singlestep() can
be called without X86_EFLAGS_TF set.Let's fix it. Please consider that for -stable.
Signed-off-by: Alexander Popov
Cc: stable@vger.kernel.org
Fixes: c8401dda2f0a00cd25c0 ("KVM: x86: fix singlestepping over syscall")
Signed-off-by: Paolo Bonzini
Signed-off-by: Greg Kroah-Hartman
26 Jan, 2019
1 commit
-
[ Upstream commit 68b5e4326e4b8ac9080835005d8254fed0fb3c56 ]
Add the proper includes and make smca_get_name() static.
Fix an actual bug too which the warning triggered:
arch/x86/kernel/cpu/mcheck/therm_throt.c:395:39: error: conflicting \
types for ‘smp_thermal_interrupt’
asmlinkage __visible void __irq_entry smp_thermal_interrupt(struct pt_regs *r)
^~~~~~~~~~~~~~~~~~~~~
In file included from arch/x86/kernel/cpu/mcheck/therm_throt.c:29:
./arch/x86/include/asm/traps.h:107:17: note: previous declaration of \
‘smp_thermal_interrupt’ was here
asmlinkage void smp_thermal_interrupt(void);Signed-off-by: Borislav Petkov
Cc: Yi Wang
Cc: Michael Matz
Cc: x86@kernel.org
Link: https://lkml.kernel.org/r/alpine.DEB.2.21.1811081633160.1549@nanos.tec.linutronix.de
Signed-off-by: Sasha Levin
17 Jan, 2019
2 commits
-
commit e4f358916d528d479c3c12bd2fd03f2d5a576380 upstream.
Commit
4cd24de3a098 ("x86/retpoline: Make CONFIG_RETPOLINE depend on compiler support")
replaced the RETPOLINE define with CONFIG_RETPOLINE checks. Remove the
remaining pieces.[ bp: Massage commit message. ]
Fixes: 4cd24de3a098 ("x86/retpoline: Make CONFIG_RETPOLINE depend on compiler support")
Signed-off-by: WANG Chao
Signed-off-by: Borislav Petkov
Reviewed-by: Zhenzhong Duan
Reviewed-by: Masahiro Yamada
Cc: "H. Peter Anvin"
Cc: Andi Kleen
Cc: Andrew Morton
Cc: Andy Lutomirski
Cc: Arnd Bergmann
Cc: Daniel Borkmann
Cc: David Woodhouse
Cc: Geert Uytterhoeven
Cc: Jessica Yu
Cc: Jiri Kosina
Cc: Kees Cook
Cc: Konrad Rzeszutek Wilk
Cc: Luc Van Oostenryck
Cc: Michal Marek
Cc: Miguel Ojeda
Cc: Peter Zijlstra
Cc: Tim Chen
Cc: Vasily Gorbik
Cc: linux-kbuild@vger.kernel.org
Cc: srinivas.eeda@oracle.com
Cc: stable
Cc: x86-ml
Link: https://lkml.kernel.org/r/20181210163725.95977-1-chao.wang@ucloud.cn
Signed-off-by: Greg Kroah-Hartman -
commit f775b13eedee2f7f3c6fdd4e90fb79090ce5d339 upstream.
Currently, every time a VCPU is scheduled out, the host kernel will
first save the guest FPU/xstate context, then load the qemu userspace
FPU context, only to then immediately save the qemu userspace FPU
context back to memory. When scheduling in a VCPU, the same extraneous
FPU loads and saves are done.This could be avoided by moving from a model where the guest FPU is
loaded and stored with preemption disabled, to a model where the
qemu userspace FPU is swapped out for the guest FPU context for
the duration of the KVM_RUN ioctl.This is done under the VCPU mutex, which is also taken when other
tasks inspect the VCPU FPU context, so the code should already be
safe for this change. That should come as no surprise, given that
s390 already has this optimization.This can fix a bug where KVM calls get_user_pages while owning the
FPU, and the file system ends up requesting the FPU again:[258270.527947] __warn+0xcb/0xf0
[258270.527948] warn_slowpath_null+0x1d/0x20
[258270.527951] kernel_fpu_disable+0x3f/0x50
[258270.527953] __kernel_fpu_begin+0x49/0x100
[258270.527955] kernel_fpu_begin+0xe/0x10
[258270.527958] crc32c_pcl_intel_update+0x84/0xb0
[258270.527961] crypto_shash_update+0x3f/0x110
[258270.527968] crc32c+0x63/0x8a [libcrc32c]
[258270.527975] dm_bm_checksum+0x1b/0x20 [dm_persistent_data]
[258270.527978] node_prepare_for_write+0x44/0x70 [dm_persistent_data]
[258270.527985] dm_block_manager_write_callback+0x41/0x50 [dm_persistent_data]
[258270.527988] submit_io+0x170/0x1b0 [dm_bufio]
[258270.527992] __write_dirty_buffer+0x89/0x90 [dm_bufio]
[258270.527994] __make_buffer_clean+0x4f/0x80 [dm_bufio]
[258270.527996] __try_evict_buffer+0x42/0x60 [dm_bufio]
[258270.527998] dm_bufio_shrink_scan+0xc0/0x130 [dm_bufio]
[258270.528002] shrink_slab.part.40+0x1f5/0x420
[258270.528004] shrink_node+0x22c/0x320
[258270.528006] do_try_to_free_pages+0xf5/0x330
[258270.528008] try_to_free_pages+0xe9/0x190
[258270.528009] __alloc_pages_slowpath+0x40f/0xba0
[258270.528011] __alloc_pages_nodemask+0x209/0x260
[258270.528014] alloc_pages_vma+0x1f1/0x250
[258270.528017] do_huge_pmd_anonymous_page+0x123/0x660
[258270.528021] handle_mm_fault+0xfd3/0x1330
[258270.528025] __get_user_pages+0x113/0x640
[258270.528027] get_user_pages+0x4f/0x60
[258270.528063] __gfn_to_pfn_memslot+0x120/0x3f0 [kvm]
[258270.528108] try_async_pf+0x66/0x230 [kvm]
[258270.528135] tdp_page_fault+0x130/0x280 [kvm]
[258270.528149] kvm_mmu_page_fault+0x60/0x120 [kvm]
[258270.528158] handle_ept_violation+0x91/0x170 [kvm_intel]
[258270.528162] vmx_handle_exit+0x1ca/0x1400 [kvm_intel]No performance changes were detected in quick ping-pong tests on
my 4 socket system, which is expected since an FPU+xstate load is
on the order of 0.1us, while ping-ponging between CPUs is on the
order of 20us, and somewhat noisy.Cc: stable@vger.kernel.org
Signed-off-by: Rik van Riel
Suggested-by: Christian Borntraeger
Signed-off-by: Paolo Bonzini
[Fixed a bug where reset_vcpu called put_fpu without preceding load_fpu,
which happened inside from KVM_CREATE_VCPU ioctl. - Radim]
Signed-off-by: Radim Krčmář
Signed-off-by: Greg Kroah-Hartman
13 Jan, 2019
2 commits
-
[ Upstream commit 254eb5505ca0ca749d3a491fc6668b6c16647a99 ]
The LDT remap placement has been changed. It's now placed before the direct
mapping in the kernel virtual address space for both paging modes.Change address markers order accordingly.
Fixes: d52888aa2753 ("x86/mm: Move LDT remap out of KASLR region on 5-level paging")
Signed-off-by: Kirill A. Shutemov
Signed-off-by: Thomas Gleixner
Cc: bp@alien8.de
Cc: hpa@zytor.com
Cc: dave.hansen@linux.intel.com
Cc: luto@kernel.org
Cc: peterz@infradead.org
Cc: boris.ostrovsky@oracle.com
Cc: jgross@suse.com
Cc: bhe@redhat.com
Cc: hans.van.kranenburg@mendix.com
Cc: linux-mm@kvack.org
Cc: xen-devel@lists.xenproject.org
Link: https://lkml.kernel.org/r/20181130202328.65359-3-kirill.shutemov@linux.intel.com
Signed-off-by: Sasha Levin -
[ Upstream commit 16877a5570e0c5f4270d5b17f9bab427bcae9514 ]
There is a guard hole at the beginning of the kernel address space, also
used by hypervisors. It occupies 16 PGD entries.This reserved range is not defined explicitely, it is calculated relative
to other entities: direct mapping and user space ranges.The calculation got broken by recent changes of the kernel memory layout:
LDT remap range is now mapped before direct mapping and makes the
calculation invalid.The breakage leads to crash on Xen dom0 boot[1].
Define the reserved range explicitely. It's part of kernel ABI (hypervisors
expect it to be stable) and must not depend on changes in the rest of
kernel memory layout.[1] https://lists.xenproject.org/archives/html/xen-devel/2018-11/msg03313.html
Fixes: d52888aa2753 ("x86/mm: Move LDT remap out of KASLR region on 5-level paging")
Reported-by: Hans van Kranenburg
Signed-off-by: Kirill A. Shutemov
Signed-off-by: Thomas Gleixner
Tested-by: Hans van Kranenburg
Reviewed-by: Juergen Gross
Cc: bp@alien8.de
Cc: hpa@zytor.com
Cc: dave.hansen@linux.intel.com
Cc: luto@kernel.org
Cc: peterz@infradead.org
Cc: boris.ostrovsky@oracle.com
Cc: bhe@redhat.com
Cc: linux-mm@kvack.org
Cc: xen-devel@lists.xenproject.org
Link: https://lkml.kernel.org/r/20181130202328.65359-2-kirill.shutemov@linux.intel.com
Signed-off-by: Sasha Levin
10 Jan, 2019
4 commits
-
commit 1b3ab5ad1b8ad99bae76ec583809c5f5a31c707c upstream.
Fixes: 34a1cd60d17f ("kvm: x86: vmx: move some vmx setting from vmx_init() to hardware_setup()")
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson
Signed-off-by: Paolo Bonzini
Signed-off-by: Greg Kroah-Hartman -
commit e81434995081fd7efb755fd75576b35dbb0850b1 upstream.
____kvm_handle_fault_on_reboot() provides a generic exception fixup
handler that is used to cleanly handle faults on VMX/SVM instructions
during reboot (or at least try to). If there isn't a reboot in
progress, ____kvm_handle_fault_on_reboot() treats any exception as
fatal to KVM and invokes kvm_spurious_fault(), which in turn generates
a BUG() to get a stack trace and die.When it was originally added by commit 4ecac3fd6dc2 ("KVM: Handle
virtualization instruction #UD faults during reboot"), the "call" to
kvm_spurious_fault() was handcoded as PUSH+JMP, where the PUSH'd value
is the RIP of the faulting instructing.The PUSH+JMP trickery is necessary because the exception fixup handler
code lies outside of its associated function, e.g. right after the
function. An actual CALL from the .fixup code would show a slightly
bogus stack trace, e.g. an extra "random" function would be inserted
into the trace, as the return RIP on the stack would point to no known
function (and the unwinder will likely try to guess who owns the RIP).Unfortunately, the JMP was replaced with a CALL when the macro was
reworked to not spin indefinitely during reboot (commit b7c4145ba2eb
"KVM: Don't spin on virt instruction faults during reboot"). This
causes the aforementioned behavior where a bogus function is inserted
into the stack trace, e.g. my builds like to blame free_kvm_area().Revert the CALL back to a JMP. The changelog for commit b7c4145ba2eb
("KVM: Don't spin on virt instruction faults during reboot") contains
nothing that indicates the switch to CALL was deliberate. This is
backed up by the fact that the PUSH was left intact.Note that an alternative to the PUSH+JMP magic would be to JMP back
to the "real" code and CALL from there, but that would require adding
a JMP in the non-faulting path to avoid calling kvm_spurious_fault()
and would add no value, i.e. the stack trace would be the same.Using CALL:
------------[ cut here ]------------
kernel BUG at /home/sean/go/src/kernel.org/linux/arch/x86/kvm/x86.c:356!
invalid opcode: 0000 [#1] SMP
CPU: 4 PID: 1057 Comm: qemu-system-x86 Not tainted 4.20.0-rc6+ #75
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
RIP: 0010:kvm_spurious_fault+0x5/0x10 [kvm]
Code: 0b 66 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 41 55 49 89 fd 41
RSP: 0018:ffffc900004bbcc8 EFLAGS: 00010046
RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffffffffffffff
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
RBP: ffff888273fd8000 R08: 00000000000003e8 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000784 R12: ffffc90000371fb0
R13: 0000000000000000 R14: 000000026d763cf4 R15: ffff888273fd8000
FS: 00007f3d69691700(0000) GS:ffff888277800000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000055f89bc56fe0 CR3: 0000000271a5a001 CR4: 0000000000362ee0
Call Trace:
free_kvm_area+0x1044/0x43ea [kvm_intel]
? vmx_vcpu_run+0x156/0x630 [kvm_intel]
? kvm_arch_vcpu_ioctl_run+0x447/0x1a40 [kvm]
? kvm_vcpu_ioctl+0x368/0x5c0 [kvm]
? kvm_vcpu_ioctl+0x368/0x5c0 [kvm]
? __set_task_blocked+0x38/0x90
? __set_current_blocked+0x50/0x60
? __fpu__restore_sig+0x97/0x490
? do_vfs_ioctl+0xa1/0x620
? __x64_sys_futex+0x89/0x180
? ksys_ioctl+0x66/0x70
? __x64_sys_ioctl+0x16/0x20
? do_syscall_64+0x4f/0x100
? entry_SYSCALL_64_after_hwframe+0x44/0xa9
Modules linked in: vhost_net vhost tap kvm_intel kvm irqbypass bridge stp llc
---[ end trace 9775b14b123b1713 ]---Using JMP:
------------[ cut here ]------------
kernel BUG at /home/sean/go/src/kernel.org/linux/arch/x86/kvm/x86.c:356!
invalid opcode: 0000 [#1] SMP
CPU: 6 PID: 1067 Comm: qemu-system-x86 Not tainted 4.20.0-rc6+ #75
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
RIP: 0010:kvm_spurious_fault+0x5/0x10 [kvm]
Code: 0b 66 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 41 55 49 89 fd 41
RSP: 0018:ffffc90000497cd0 EFLAGS: 00010046
RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffffffffffffff
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
RBP: ffff88827058bd40 R08: 00000000000003e8 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000784 R12: ffffc90000369fb0
R13: 0000000000000000 R14: 00000003c8fc6642 R15: ffff88827058bd40
FS: 00007f3d7219e700(0000) GS:ffff888277900000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f3d64001000 CR3: 0000000271c6b004 CR4: 0000000000362ee0
Call Trace:
vmx_vcpu_run+0x156/0x630 [kvm_intel]
? kvm_arch_vcpu_ioctl_run+0x447/0x1a40 [kvm]
? kvm_vcpu_ioctl+0x368/0x5c0 [kvm]
? kvm_vcpu_ioctl+0x368/0x5c0 [kvm]
? __set_task_blocked+0x38/0x90
? __set_current_blocked+0x50/0x60
? __fpu__restore_sig+0x97/0x490
? do_vfs_ioctl+0xa1/0x620
? __x64_sys_futex+0x89/0x180
? ksys_ioctl+0x66/0x70
? __x64_sys_ioctl+0x16/0x20
? do_syscall_64+0x4f/0x100
? entry_SYSCALL_64_after_hwframe+0x44/0xa9
Modules linked in: vhost_net vhost tap kvm_intel kvm irqbypass bridge stp llc
---[ end trace f9daedb85ab3ddba ]---Fixes: b7c4145ba2eb ("KVM: Don't spin on virt instruction faults during reboot")
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson
Signed-off-by: Paolo Bonzini
Signed-off-by: Greg Kroah-Hartman -
commit ba6f508d0ec4adb09f0a939af6d5e19cdfa8667d upstream.
Commit:
f77084d96355 "x86/mm/pat: Disable preemption around __flush_tlb_all()"
addressed a case where __flush_tlb_all() is called without preemption
being disabled. It also left a warning to catch other cases where
preemption is not disabled.That warning triggers for the memory hotplug path which is also used for
persistent memory enabling:WARNING: CPU: 35 PID: 911 at ./arch/x86/include/asm/tlbflush.h:460
RIP: 0010:__flush_tlb_all+0x1b/0x3a
[..]
Call Trace:
phys_pud_init+0x29c/0x2bb
kernel_physical_mapping_init+0xfc/0x219
init_memory_mapping+0x1a5/0x3b0
arch_add_memory+0x2c/0x50
devm_memremap_pages+0x3aa/0x610
pmem_attach_disk+0x585/0x700 [nd_pmem]Andy wondered why a path that can sleep was using __flush_tlb_all() [1]
and Dave confirmed the expectation for TLB flush is for modifying /
invalidating existing PTE entries, but not initial population [2]. Drop
the usage of __flush_tlb_all() in phys_{p4d,pud,pmd}_init() on the
expectation that this path is only ever populating empty entries for the
linear map. Note, at linear map teardown time there is a call to the
all-cpu flush_tlb_all() to invalidate the removed mappings.[1]: https://lkml.kernel.org/r/9DFD717D-857D-493D-A606-B635D72BAC21@amacapital.net
[2]: https://lkml.kernel.org/r/749919a4-cdb1-48a3-adb4-adb81a5fa0b5@intel.com[ mingo: Minor readability edits. ]
Suggested-by: Dave Hansen
Reported-by: Andy Lutomirski
Signed-off-by: Dan Williams
Acked-by: Peter Zijlstra (Intel)
Acked-by: Kirill A. Shutemov
Cc:
Cc: Borislav Petkov
Cc: H. Peter Anvin
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Rik van Riel
Cc: Sebastian Andrzej Siewior
Cc: Thomas Gleixner
Cc: dave.hansen@intel.com
Fixes: f77084d96355 ("x86/mm/pat: Disable preemption around __flush_tlb_all()")
Link: http://lkml.kernel.org/r/154395944713.32119.15611079023837132638.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Ingo Molnar
Signed-off-by: Greg Kroah-Hartman -
commit 5b5e4d623ec8a34689df98e42d038a3b594d2ff9 upstream.
Swap storage is restricted to max_swapfile_size (~16TB on x86_64) whenever
the system is deemed affected by L1TF vulnerability. Even though the limit
is quite high for most deployments it seems to be too restrictive for
deployments which are willing to live with the mitigation disabled.We have a customer to deploy 8x 6,4TB PCIe/NVMe SSD swap devices which is
clearly out of the limit.Drop the swap restriction when l1tf=off is specified. It also doesn't make
much sense to warn about too much memory for the l1tf mitigation when it is
forcefully disabled by the administrator.[ tglx: Folded the documentation delta change ]
Fixes: 377eeaa8e11f ("x86/speculation/l1tf: Limit swap file size to MAX_PA/2")
Signed-off-by: Michal Hocko
Signed-off-by: Thomas Gleixner
Reviewed-by: Pavel Tatashin
Reviewed-by: Andi Kleen
Acked-by: Jiri Kosina
Cc: Linus Torvalds
Cc: Dave Hansen
Cc: Andi Kleen
Cc: Borislav Petkov
Cc:
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20181113184910.26697-1-mhocko@kernel.org
Signed-off-by: Greg Kroah-Hartman
29 Dec, 2018
3 commits
-
commit 32043fa065b51e0b1433e48d118821c71b5cd65d upstream.
Currently the copy_to_user of data in the gentry struct is copying
uninitiaized data in field _pad from the stack to userspace.Fix this by explicitly memset'ing gentry to zero, this also will zero any
compiler added padding fields that may be in struct (currently there are
none).Detected by CoverityScan, CID#200783 ("Uninitialized scalar variable")
Fixes: b263b31e8ad6 ("x86, mtrr: Use explicit sizing and padding for the 64-bit ioctls")
Signed-off-by: Colin Ian King
Signed-off-by: Thomas Gleixner
Reviewed-by: Tyler Hicks
Cc: security@kernel.org
Link: https://lkml.kernel.org/r/20181218172956.1440-1-colin.king@canonical.com
Signed-off-by: Greg Kroah-Hartman -
commit c2dd5146e9fe1f22c77c1b011adf84eea0245806 upstream.
nested_get_vmcs12_pages() processes the posted_intr address in vmcs12. It
caches the kmap()ed page object and pointer, however, it doesn't handle
errors correctly: it's possible to cache a valid pointer, then release
the page and later dereference the dangling pointer.I was able to reproduce with the following steps:
1. Call vmlaunch with valid posted_intr_desc_addr but an invalid
MSR_EFER. This causes nested_get_vmcs12_pages() to cache the kmap()ed
pi_desc_page and pi_desc. Later the invalid EFER value fails
check_vmentry_postreqs() which fails the first vmlaunch.2. Call vmlanuch with a valid EFER but an invalid posted_intr_desc_addr
(I set it to 2G - 0x80). The second time we call nested_get_vmcs12_pages
pi_desc_page is unmapped and released and pi_desc_page is set to NULL
(the "shouldn't happen" clause). Due to the invalid
posted_intr_desc_addr, kvm_vcpu_gpa_to_page() fails and
nested_get_vmcs12_pages() returns. It doesn't return an error value so
vmlaunch proceeds. Note that at this time we have a dangling pointer in
vmx->nested.pi_desc and POSTED_INTR_DESC_ADDR in L0's vmcs.3. Issue an IPI in L2 guest code. This triggers a call to
vmx_complete_nested_posted_interrupt() and pi_test_and_clear_on() which
dereferences the dangling pointer.Vulnerable code requires nested and enable_apicv variables to be set to
true. The host CPU must also support posted interrupts.Fixes: 5e2f30b756a37 "KVM: nVMX: get rid of nested_get_page()"
Cc: stable@vger.kernel.org
Reviewed-by: Andy Honig
Signed-off-by: Cfir Cohen
Reviewed-by: Liran Alon
Signed-off-by: Paolo Bonzini
Signed-off-by: Greg Kroah-Hartman -
commit 0e1b869fff60c81b510c2d00602d778f8f59dd9a upstream.
Some guests OSes (including Windows 10) write to MSR 0xc001102c
on some cases (possibly while trying to apply a CPU errata).
Make KVM ignore reads and writes to that MSR, so the guest won't
crash.The MSR is documented as "Execution Unit Configuration (EX_CFG)",
at AMD's "BIOS and Kernel Developer's Guide (BKDG) for AMD Family
15h Models 00h-0Fh Processors".Cc: stable@vger.kernel.org
Signed-off-by: Eduardo Habkost
Signed-off-by: Paolo Bonzini
Signed-off-by: Greg Kroah-Hartman
21 Dec, 2018
5 commits
-
[ Upstream commit 79c2206d369b87b19ac29cb47601059b6bf5c291 ]
An affected screen resolution is 1366 x 768, which width is not
divisible by 8, the default font width. On such screens, when longer
lines are earlyprintk'ed, overflow-to-next-line can never trigger,
due to the left-most x-coordinate of the next character always less
than the screen width. Earlyprintk will infinite loop in trying to
print the rest of the string but unable to, due to the line being
full.This patch makes the trigger consider the right-most x-coordinate,
instead of left-most, as the value to compare against the screen
width threshold.Signed-off-by: YiFei Zhu
Signed-off-by: Ard Biesheuvel
Cc: Andy Lutomirski
Cc: Arend van Spriel
Cc: Bhupesh Sharma
Cc: Borislav Petkov
Cc: Dave Hansen
Cc: Eric Snowberg
Cc: Hans de Goede
Cc: Joe Perches
Cc: Jon Hunter
Cc: Julien Thierry
Cc: Linus Torvalds
Cc: Marc Zyngier
Cc: Matt Fleming
Cc: Nathan Chancellor
Cc: Peter Zijlstra
Cc: Sai Praneeth Prakhya
Cc: Sedat Dilek
Cc: Thomas Gleixner
Cc: linux-efi@vger.kernel.org
Link: http://lkml.kernel.org/r/20181129171230.18699-12-ard.biesheuvel@linaro.org
Signed-off-by: Ingo Molnar
Signed-off-by: Sasha Levin -
commit 7aa54be2976550f17c11a1c3e3630002dea39303 upstream.
On x86 we cannot do fetch_or() with a single instruction and thus end up
using a cmpxchg loop, this reduces determinism. Replace the fetch_or()
with a composite operation: tas-pending + load.Using two instructions of course opens a window we previously did not
have. Consider the scenario:CPU0 CPU1 CPU2
1) lock
trylock -> (0,0,1)2) lock
trylock /* fail */3) unlock -> (0,0,0)
4) lock
trylock -> (0,0,1)5) tas-pending -> (0,1,1)
load-val (0,0,1)FAIL: _2_ owners
where 5) is our new composite operation. When we consider each part of
the qspinlock state as a separate variable (as we can when
_Q_PENDING_BITS == 8) then the above is entirely possible, because
tas-pending will only RmW the pending byte, so the later load is able
to observe prior tail and lock state (but not earlier than its own
trylock, which operates on the whole word, due to coherence).To avoid this we need 2 things:
- the load must come after the tas-pending (obviously, otherwise it
can trivially observe prior state).- the tas-pending must be a full word RmW instruction, it cannot be an XCHGB for
example, such that we cannot observe other state prior to setting
pending.On x86 we can realize this by using "LOCK BTS m32, r32" for
tas-pending followed by a regular load.Note that observing later state is not a problem:
- if we fail to observe a later unlock, we'll simply spin-wait for
that store to become visible.- if we observe a later xchg_tail(), there is no difference from that
xchg_tail() having taken place before the tas-pending.Suggested-by: Will Deacon
Reported-by: Thomas Gleixner
Signed-off-by: Peter Zijlstra (Intel)
Reviewed-by: Will Deacon
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: andrea.parri@amarulasolutions.com
Cc: longman@redhat.com
Fixes: 59fb586b4a07 ("locking/qspinlock: Remove unbounded cmpxchg() loop from locking slowpath")
Link: https://lkml.kernel.org/r/20181003130957.183726335@infradead.org
Signed-off-by: Ingo Molnar
[bigeasy: GEN_BINARY_RMWcc macro redo]
Signed-off-by: Sebastian Andrzej Siewior
Signed-off-by: Sasha Levin -
commit b247be3fe89b6aba928bf80f4453d1c4ba8d2063 upstream.
On x86, atomic_cond_read_relaxed will busy-wait with a cpu_relax() loop,
so it is desirable to increase the number of times we spin on the qspinlock
lockword when it is found to be transitioning from pending to locked.According to Waiman Long:
| Ideally, the spinning times should be at least a few times the typical
| cacheline load time from memory which I think can be down to 100ns or
| so for each cacheline load with the newest systems or up to several
| hundreds ns for older systems.which in his benchmarking corresponded to 512 iterations.
Suggested-by: Waiman Long
Signed-off-by: Will Deacon
Acked-by: Peter Zijlstra (Intel)
Acked-by: Waiman Long
Cc: Linus Torvalds
Cc: Thomas Gleixner
Cc: boqun.feng@gmail.com
Cc: linux-arm-kernel@lists.infradead.org
Cc: paulmck@linux.vnet.ibm.com
Link: http://lkml.kernel.org/r/1524738868-31318-5-git-send-email-will.deacon@arm.com
Signed-off-by: Ingo Molnar
Signed-off-by: Sebastian Andrzej Siewior
Signed-off-by: Sasha Levin -
commit 625e88be1f41b53cec55827c984e4a89ea8ee9f9 upstream.
'struct __qspinlock' provides a handy union of fields so that
subcomponents of the lockword can be accessed by name, without having to
manage shifts and masks explicitly and take endianness into account.This is useful in qspinlock.h and also potentially in arch headers, so
move the 'struct __qspinlock' into 'struct qspinlock' and kill the extra
definition.Signed-off-by: Will Deacon
Acked-by: Peter Zijlstra (Intel)
Acked-by: Waiman Long
Acked-by: Boqun Feng
Cc: Linus Torvalds
Cc: Thomas Gleixner
Cc: linux-arm-kernel@lists.infradead.org
Cc: paulmck@linux.vnet.ibm.com
Link: http://lkml.kernel.org/r/1524738868-31318-3-git-send-email-will.deacon@arm.com
Signed-off-by: Ingo Molnar
Signed-off-by: Sebastian Andrzej Siewior
Signed-off-by: Sasha Levin -
commit 25896d073d8a0403b07e6dec56f58e6c33678207 upstream.
It is troublesome to add a diagnostic like this to the Makefile
parse stage because the top-level Makefile could be parsed with
a stale include/config/auto.conf.Once you are hit by the error about non-retpoline compiler, the
compilation still breaks even after disabling CONFIG_RETPOLINE.The easiest fix is to move this check to the "archprepare" like
this commit did:829fe4aa9ac1 ("x86: Allow generating user-space headers without a compiler")
Reported-by: Meelis Roos
Tested-by: Meelis Roos
Signed-off-by: Masahiro Yamada
Acked-by: Zhenzhong Duan
Cc: Borislav Petkov
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: Zhenzhong Duan
Fixes: 4cd24de3a098 ("x86/retpoline: Make CONFIG_RETPOLINE depend on compiler support")
Link: http://lkml.kernel.org/r/1543991239-18476-1-git-send-email-yamada.masahiro@socionext.com
Link: https://lkml.org/lkml/2018/12/4/206
Signed-off-by: Ingo Molnar
Signed-off-by: Sasha Levin
Cc: Gi-Oh Kim
Signed-off-by: Greg Kroah-Hartman
17 Dec, 2018
3 commits
-
[ Upstream commit 123664101aa2156d05251704fc63f9bcbf77741a ]
This reverts commit b3cf8528bb21febb650a7ecbf080d0647be40b9f.
That commit unintentionally broke Xen balloon memory hotplug with
"hotplug_unpopulated" set to 1. As long as "System RAM" resource
got assigned under a new "Unusable memory" resource in IO/Mem tree
any attempt to online this memory would fail due to general kernel
restrictions on having "System RAM" resources as 1st level only.The original issue that commit has tried to workaround fa564ad96366
("x86/PCI: Enable a 64bit BAR on AMD Family 15h (Models 00-1f, 30-3f,
60-7f)") also got amended by the following 03a551734 ("x86/PCI: Move
and shrink AMD 64-bit window to avoid conflict") which made the
original fix to Xen ballooning unnecessary.Signed-off-by: Igor Druzhinin
Reviewed-by: Boris Ostrovsky
Signed-off-by: Juergen Gross
Signed-off-by: Sasha Levin -
[ Upstream commit 1e4329ee2c52692ea42cc677fb2133519718b34a ]
The inline keyword which is not at the beginning of the function
declaration may trigger the following build warnings, so let's fix it:arch/x86/kvm/vmx.c:1309:1: warning: ‘inline’ is not at beginning of declaration [-Wold-style-declaration]
arch/x86/kvm/vmx.c:5947:1: warning: ‘inline’ is not at beginning of declaration [-Wold-style-declaration]
arch/x86/kvm/vmx.c:5985:1: warning: ‘inline’ is not at beginning of declaration [-Wold-style-declaration]
arch/x86/kvm/vmx.c:6023:1: warning: ‘inline’ is not at beginning of declaration [-Wold-style-declaration]Signed-off-by: Yi Wang
Signed-off-by: Paolo Bonzini
Signed-off-by: Sasha Levin -
[ Upstream commit 354cb410d87314e2eda344feea84809e4261570a ]
We get the following warnings about empty statements when building
with 'W=1':arch/x86/kvm/lapic.c:632:53: warning: suggest braces around empty body in an ‘if’ statement [-Wempty-body]
arch/x86/kvm/lapic.c:1907:42: warning: suggest braces around empty body in an ‘if’ statement [-Wempty-body]
arch/x86/kvm/lapic.c:1936:65: warning: suggest braces around empty body in an ‘if’ statement [-Wempty-body]
arch/x86/kvm/lapic.c:1975:44: warning: suggest braces around empty body in an ‘if’ statement [-Wempty-body]Rework the debug helper macro to get rid of these warnings.
Signed-off-by: Yi Wang
Signed-off-by: Paolo Bonzini
Signed-off-by: Sasha Levin
08 Dec, 2018
1 commit
-
commit 30510387a5e45bfcf8190e03ec7aa15b295828e2 upstream.
There is a race condition when accessing kvm->arch.apic_access_page_done.
Due to it, x86_set_memory_region will fail when creating the second vcpu
for a svm guest.Add a mutex_lock to serialize the accesses to apic_access_page_done.
This lock is also used by vmx for the same purpose.Signed-off-by: Wei Wang
Signed-off-by: Amadeusz Juskowiak
Signed-off-by: Julian Stecklina
Signed-off-by: Suravee Suthikulpanit
Reviewed-by: Joerg Roedel
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini
Signed-off-by: Greg Kroah-Hartman
06 Dec, 2018
10 commits
-
commit 67266c1080ad56c31af72b9c18355fde8ccc124a upstream.
Currently we check the branch tracing only by checking for the
PERF_COUNT_HW_BRANCH_INSTRUCTIONS event of PERF_TYPE_HARDWARE
type. But we can define the same event with the PERF_TYPE_RAW
type.Changing the intel_pmu_has_bts() code to check on event's final
hw config value, so both HW types are covered.Adding unlikely to intel_pmu_has_bts() condition calls, because
it was used in the original code in intel_bts_constraints.Signed-off-by: Jiri Olsa
Acked-by: Peter Zijlstra
Cc:
Cc: Alexander Shishkin
Cc: Arnaldo Carvalho de Melo
Cc: Arnaldo Carvalho de Melo
Cc: Jiri Olsa
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Stephane Eranian
Cc: Thomas Gleixner
Cc: Vince Weaver
Link: http://lkml.kernel.org/r/20181121101612.16272-2-jolsa@kernel.org
Signed-off-by: Ingo Molnar
Signed-off-by: Greg Kroah-Hartman -
commit ed6101bbf6266ee83e620b19faa7c6ad56bb41ab upstream.
Moving branch tracing setup to Intel core object into separate
intel_pmu_bts_config function, because it's Intel specific.Suggested-by: Peter Zijlstra
Signed-off-by: Jiri Olsa
Acked-by: Peter Zijlstra
Cc:
Cc: Alexander Shishkin
Cc: Arnaldo Carvalho de Melo
Cc: Arnaldo Carvalho de Melo
Cc: Jiri Olsa
Cc: Linus Torvalds
Cc: Stephane Eranian
Cc: Thomas Gleixner
Cc: Vince Weaver
Link: http://lkml.kernel.org/r/20181121101612.16272-1-jolsa@kernel.org
Signed-off-by: Ingo Molnar
Signed-off-by: Greg Kroah-Hartman -
commit 68239654acafe6aad5a3c1dc7237e60accfebc03 upstream.
The sequence
fpu->initialized = 1; /* step A */
preempt_disable(); /* step B */
fpu__restore(fpu);
preempt_enable();in __fpu__restore_sig() is racy in regard to a context switch.
For 32bit frames, __fpu__restore_sig() prepares the FPU state within
fpu->state. To ensure that a context switch (switch_fpu_prepare() in
particular) does not modify fpu->state it uses fpu__drop() which sets
fpu->initialized to 0.After fpu->initialized is cleared, the CPU's FPU state is not saved
to fpu->state during a context switch. The new state is loaded via
fpu__restore(). It gets loaded into fpu->state from userland and
ensured it is sane. fpu->initialized is then set to 1 in order to avoid
fpu__initialize() doing anything (overwrite the new state) which is part
of fpu__restore().A context switch between step A and B above would save CPU's current FPU
registers to fpu->state and overwrite the newly prepared state. This
looks like a tiny race window but the Kernel Test Robot reported this
back in 2016 while we had lazy FPU support. Borislav Petkov made the
link between that report and another patch that has been posted. Since
the removal of the lazy FPU support, this race goes unnoticed because
the warning has been removed.Disable bottom halves around the restore sequence to avoid the race. BH
need to be disabled because BH is allowed to run (even with preemption
disabled) and might invoke kernel_fpu_begin() by doing IPsec.[ bp: massage commit message a bit. ]
Signed-off-by: Sebastian Andrzej Siewior
Signed-off-by: Borislav Petkov
Acked-by: Ingo Molnar
Acked-by: Thomas Gleixner
Cc: Andy Lutomirski
Cc: Dave Hansen
Cc: "H. Peter Anvin"
Cc: "Jason A. Donenfeld"
Cc: kvm ML
Cc: Paolo Bonzini
Cc: Radim Krčmář
Cc: Rik van Riel
Cc: stable@vger.kernel.org
Cc: x86-ml
Link: http://lkml.kernel.org/r/20181120102635.ddv3fvavxajjlfqk@linutronix.de
Link: https://lkml.kernel.org/r/20160226074940.GA28911@pd.tnic
Signed-off-by: Greg Kroah-Hartman -
commit 60c8144afc287ef09ce8c1230c6aa972659ba1bb upstream.
Currently, the code sets up the thresholding interrupt vector and only
then goes about initializing the thresholding banks. Which is wrong,
because an early thresholding interrupt would cause a NULL pointer
dereference when accessing those banks and prevent the machine from
booting.Therefore, set the thresholding interrupt vector only *after* having
initialized the banks successfully.Fixes: 18807ddb7f88 ("x86/mce/AMD: Reset Threshold Limit after logging error")
Reported-by: Rafał Miłecki
Reported-by: John Clemens
Signed-off-by: Borislav Petkov
Tested-by: Rafał Miłecki
Tested-by: John Clemens
Cc: Aravind Gopalakrishnan
Cc: linux-edac@vger.kernel.org
Cc: stable@vger.kernel.org
Cc: Tony Luck
Cc: x86@kernel.org
Cc: Yazen Ghannam
Link: https://lkml.kernel.org/r/20181127101700.2964-1-zajec5@gmail.com
Link: https://bugzilla.kernel.org/show_bug.cgi?id=201291
Signed-off-by: Greg Kroah-Hartman -
commit e97f852fd4561e77721bb9a4e0ea9d98305b1e93 upstream.
Reported by syzkaller:
BUG: unable to handle kernel NULL pointer dereference at 00000000000001c8
PGD 80000003ec4da067 P4D 80000003ec4da067 PUD 3f7bfa067 PMD 0
Oops: 0000 [#1] PREEMPT SMP PTI
CPU: 7 PID: 5059 Comm: debug Tainted: G OE 4.19.0-rc5 #16
RIP: 0010:__lock_acquire+0x1a6/0x1990
Call Trace:
lock_acquire+0xdb/0x210
_raw_spin_lock+0x38/0x70
kvm_ioapic_scan_entry+0x3e/0x110 [kvm]
vcpu_enter_guest+0x167e/0x1910 [kvm]
kvm_arch_vcpu_ioctl_run+0x35c/0x610 [kvm]
kvm_vcpu_ioctl+0x3e9/0x6d0 [kvm]
do_vfs_ioctl+0xa5/0x690
ksys_ioctl+0x6d/0x80
__x64_sys_ioctl+0x1a/0x20
do_syscall_64+0x83/0x6e0
entry_SYSCALL_64_after_hwframe+0x49/0xbeThe reason is that the testcase writes hyperv synic HV_X64_MSR_SINT6 msr
and triggers scan ioapic logic to load synic vectors into EOI exit bitmap.
However, irqchip is not initialized by this simple testcase, ioapic/apic
objects should not be accessed.
This can be triggered by the following program:#define _GNU_SOURCE
#include
#include
#include
#include
#include
#include
#include
#includeuint64_t r[3] = {0xffffffffffffffff, 0xffffffffffffffff, 0xffffffffffffffff};
int main(void)
{
syscall(__NR_mmap, 0x20000000, 0x1000000, 3, 0x32, -1, 0);
long res = 0;
memcpy((void*)0x20000040, "/dev/kvm", 9);
res = syscall(__NR_openat, 0xffffffffffffff9c, 0x20000040, 0, 0);
if (res != -1)
r[0] = res;
res = syscall(__NR_ioctl, r[0], 0xae01, 0);
if (res != -1)
r[1] = res;
res = syscall(__NR_ioctl, r[1], 0xae41, 0);
if (res != -1)
r[2] = res;
memcpy(
(void*)0x20000080,
"\x01\x00\x00\x00\x00\x5b\x61\xbb\x96\x00\x00\x40\x00\x00\x00\x00\x01\x00"
"\x08\x00\x00\x00\x00\x00\x0b\x77\xd1\x78\x4d\xd8\x3a\xed\xb1\x5c\x2e\x43"
"\xaa\x43\x39\xd6\xff\xf5\xf0\xa8\x98\xf2\x3e\x37\x29\x89\xde\x88\xc6\x33"
"\xfc\x2a\xdb\xb7\xe1\x4c\xac\x28\x61\x7b\x9c\xa9\xbc\x0d\xa0\x63\xfe\xfe"
"\xe8\x75\xde\xdd\x19\x38\xdc\x34\xf5\xec\x05\xfd\xeb\x5d\xed\x2e\xaf\x22"
"\xfa\xab\xb7\xe4\x42\x67\xd0\xaf\x06\x1c\x6a\x35\x67\x10\x55\xcb",
106);
syscall(__NR_ioctl, r[2], 0x4008ae89, 0x20000080);
syscall(__NR_ioctl, r[2], 0xae80, 0);
return 0;
}This patch fixes it by bailing out scan ioapic if ioapic is not initialized in
kernel.Reported-by: Wei Wu
Cc: Paolo Bonzini
Cc: Radim Krčmář
Cc: Wei Wu
Signed-off-by: Wanpeng Li
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini
Signed-off-by: Greg Kroah-Hartman -
commit bcbfbd8ec21096027f1ee13ce6c185e8175166f6 upstream.
kvm_pv_clock_pairing() allocates local var
"struct kvm_clock_pairing clock_pairing" on stack and initializes
all it's fields besides padding (clock_pairing.pad[]).Because clock_pairing var is written completely (including padding)
to guest memory, failure to init struct padding results in kernel
info-leak.Fix the issue by making sure to also init the padding with zeroes.
Fixes: 55dd00a73a51 ("KVM: x86: add KVM_HC_CLOCK_PAIRING hypercall")
Reported-by: syzbot+a8ef68d71211ba264f56@syzkaller.appspotmail.com
Reviewed-by: Mark Kanda
Signed-off-by: Liran Alon
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini
Signed-off-by: Greg Kroah-Hartman -
commit fd65d3142f734bc4376053c8d75670041903134d upstream.
Previously, we only called indirect_branch_prediction_barrier on the
logical CPU that freed a vmcb. This function should be called on all
logical CPUs that last loaded the vmcb in question.Fixes: 15d45071523d ("KVM/x86: Add IBPB support")
Reported-by: Neel Natu
Signed-off-by: Jim Mattson
Reviewed-by: Konrad Rzeszutek Wilk
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini
Signed-off-by: Greg Kroah-Hartman -
commit 0e0fee5c539b61fdd098332e0e2cc375d9073706 upstream.
When a guest page table is updated via an emulated write,
kvm_mmu_pte_write() is called to update the shadow PTE using the just
written guest PTE value. But if two emulated guest PTE writes happened
concurrently, it is possible that the guest PTE and the shadow PTE end
up being out of sync. Emulated writes do not mark the shadow page as
unsync-ed, so this inconsistency will not be resolved even by a guest TLB
flush (unless the page was marked as unsync-ed at some other point).This is fixed by re-reading the current value of the guest PTE after the
MMU lock has been acquired instead of just using the value that was
written prior to calling kvm_mmu_pte_write().Signed-off-by: Junaid Shahid
Reviewed-by: Wanpeng Li
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini
Signed-off-by: Greg Kroah-Hartman -
commit 55a974021ec952ee460dc31ca08722158639de72 upstream
Provide the possibility to enable IBPB always in combination with 'prctl'
and 'seccomp'.Add the extra command line options and rework the IBPB selection to
evaluate the command instead of the mode selected by the STIPB switch case.Signed-off-by: Thomas Gleixner
Reviewed-by: Ingo Molnar
Cc: Peter Zijlstra
Cc: Andy Lutomirski
Cc: Linus Torvalds
Cc: Jiri Kosina
Cc: Tom Lendacky
Cc: Josh Poimboeuf
Cc: Andrea Arcangeli
Cc: David Woodhouse
Cc: Tim Chen
Cc: Andi Kleen
Cc: Dave Hansen
Cc: Casey Schaufler
Cc: Asit Mallick
Cc: Arjan van de Ven
Cc: Jon Masters
Cc: Waiman Long
Cc: Greg KH
Cc: Dave Stewart
Cc: Kees Cook
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20181125185006.144047038@linutronix.de
Signed-off-by: Greg Kroah-Hartman -
commit 6b3e64c237c072797a9ec918654a60e3a46488e2 upstream
If 'prctl' mode of user space protection from spectre v2 is selected
on the kernel command-line, STIBP and IBPB are applied on tasks which
restrict their indirect branch speculation via prctl.SECCOMP enables the SSBD mitigation for sandboxed tasks already, so it
makes sense to prevent spectre v2 user space to user space attacks as
well.The Intel mitigation guide documents how STIPB works:
Setting bit 1 (STIBP) of the IA32_SPEC_CTRL MSR on a logical processor
prevents the predicted targets of indirect branches on any logical
processor of that core from being controlled by software that executes
(or executed previously) on another logical processor of the same core.Ergo setting STIBP protects the task itself from being attacked from a task
running on a different hyper-thread and protects the tasks running on
different hyper-threads from being attacked.While the document suggests that the branch predictors are shielded between
the logical processors, the observed performance regressions suggest that
STIBP simply disables the branch predictor more or less completely. Of
course the document wording is vague, but the fact that there is also no
requirement for issuing IBPB when STIBP is used points clearly in that
direction. The kernel still issues IBPB even when STIBP is used until Intel
clarifies the whole mechanism.IBPB is issued when the task switches out, so malicious sandbox code cannot
mistrain the branch predictor for the next user space task on the same
logical processor.Signed-off-by: Jiri Kosina
Signed-off-by: Thomas Gleixner
Reviewed-by: Ingo Molnar
Cc: Peter Zijlstra
Cc: Andy Lutomirski
Cc: Linus Torvalds
Cc: Tom Lendacky
Cc: Josh Poimboeuf
Cc: Andrea Arcangeli
Cc: David Woodhouse
Cc: Tim Chen
Cc: Andi Kleen
Cc: Dave Hansen
Cc: Casey Schaufler
Cc: Asit Mallick
Cc: Arjan van de Ven
Cc: Jon Masters
Cc: Waiman Long
Cc: Greg KH
Cc: Dave Stewart
Cc: Kees Cook
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20181125185006.051663132@linutronix.de
Signed-off-by: Greg Kroah-Hartman