23 Feb, 2017
1 commit
-
Move the x86_64 idle notifiers originally by Andi Kleen and Venkatesh
Pallipadi to generic.Change-Id: Idf29cda15be151f494ff245933c12462643388d5
Acked-by: Nicolas Pitre
Signed-off-by: Todd Poynor
18 Feb, 2017
1 commit
-
commit dffba9a31c7769be3231c420d4b364c92ba3f1ac upstream.
The compacted-format XSAVES area is determined at boot time and
never changed after. The field xsave.header.xcomp_bv indicates
which components are in the fixed XSAVES format.In fpstate_init() we did not set xcomp_bv to reflect the XSAVES
format since at the time there is no valid data.However, after we do copy_init_fpstate_to_fpregs() in fpu__clear(),
as in commit:b22cbe404a9c x86/fpu: Fix invalid FPU ptrace state after execve()
and when __fpu_restore_sig() does fpu__restore() for a COMPAT-mode
app, a #GP occurs. This can be easily triggered by doing valgrind on
a COMPAT-mode "Hello World," as reported by Joakim Tjernlund and
others:https://bugzilla.kernel.org/show_bug.cgi?id=190061
Fix it by setting xcomp_bv correctly.
This patch also moves the xcomp_bv initialization to the proper
place, which was in copyin_to_xsaves() as of:4c833368f0bf x86/fpu: Set the xcomp_bv when we fake up a XSAVES area
which fixed the bug too, but it's more efficient and cleaner to
initialize things once per boot, not for every signal handling
operation.Reported-by: Kevin Hao
Reported-by: Joakim Tjernlund
Signed-off-by: Yu-cheng Yu
Cc: Andy Lutomirski
Cc: Borislav Petkov
Cc: Dave Hansen
Cc: Fenghua Yu
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Ravi V. Shankar
Cc: Thomas Gleixner
Cc: haokexin@gmail.com
Link: http://lkml.kernel.org/r/1485212084-4418-1-git-send-email-yu-cheng.yu@intel.com
[ Combined it with 4c833368f0bf. ]
Signed-off-by: Ingo Molnar
Signed-off-by: Greg Kroah-Hartman
15 Feb, 2017
4 commits
-
commit 08b259631b5a1d912af4832847b5642f377d9101 upstream.
After:
a33d331761bc ("x86/CPU/AMD: Fix Bulldozer topology")
our SMT scheduling topology for Fam17h systems is broken, because
the ThreadId is included in the ApicId when SMT is enabled.So, without further decoding cpu_core_id is unique for each thread
rather than the same for threads on the same core. This didn't affect
systems with SMT disabled. Make cpu_core_id be what it is defined to be.Signed-off-by: Yazen Ghannam
Signed-off-by: Borislav Petkov
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Link: http://lkml.kernel.org/r/20170205105022.8705-2-bp@alien8.de
Signed-off-by: Ingo Molnar
Signed-off-by: Greg Kroah-Hartman -
commit 79a8b9aa388b0620cc1d525d7c0f0d9a8a85e08e upstream.
Commit:
a33d331761bc ("x86/CPU/AMD: Fix Bulldozer topology")
restored the initial approach we had with the Fam15h topology of
enumerating CU (Compute Unit) threads as cores. And this is still
correct - they're beefier than HT threads but still have some
shared functionality.Our current approach has a problem with the Mad Max Steam game, for
example. Yves Dionne reported a certain "choppiness" while playing on
v4.9.5.That problem stems most likely from the fact that the CU threads share
resources within one CU and when we schedule to a thread of a different
compute unit, this incurs latency due to migrating the working set to a
different CU through the caches.When the thread siblings mask mirrors that aspect of the CUs and
threads, the scheduler pays attention to it and tries to schedule within
one CU first. Which takes care of the latency, of course.Reported-by: Yves Dionne
Signed-off-by: Borislav Petkov
Cc: Brice Goglin
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: Yazen Ghannam
Link: http://lkml.kernel.org/r/20170205105022.8705-1-bp@alien8.de
Signed-off-by: Ingo Molnar
Signed-off-by: Greg Kroah-Hartman -
commit 146fbb766934dc003fcbf755b519acef683576bf upstream.
CONFIG_KASAN=y needs a lot of virtual memory mapped for its shadow.
In that case ptdump_walk_pgd_level_core() takes a lot of time to
walk across all page tables and doing this without
a rescheduling causes soft lockups:NMI watchdog: BUG: soft lockup - CPU#3 stuck for 23s! [swapper/0:1]
...
Call Trace:
ptdump_walk_pgd_level_core+0x40c/0x550
ptdump_walk_pgd_level_checkwx+0x17/0x20
mark_rodata_ro+0x13b/0x150
kernel_init+0x2f/0x120
ret_from_fork+0x2c/0x40I guess that this issue might arise even without KASAN on huge machines
with several terabytes of RAM.Stick cond_resched() in pgd loop to fix this.
Reported-by: Tobias Regnery
Signed-off-by: Andrey Ryabinin
Cc: kasan-dev@googlegroups.com
Cc: Alexander Potapenko
Cc: "Paul E . McKenney"
Cc: Dmitry Vyukov
Link: http://lkml.kernel.org/r/20170210095405.31802-1-aryabinin@virtuozzo.com
Signed-off-by: Thomas Gleixner
Signed-off-by: Greg Kroah-Hartman -
commit d966564fcdc19e13eb6ba1fbe6b8101070339c3d upstream.
This reverts commit 020eb3daaba2857b32c4cf4c82f503d6a00a67de.
Gabriel C reports that it causes his machine to not boot, and we haven't
tracked down the reason for it yet. Since the bug it fixes has been
around for a longish time, we're better off reverting the fix for now.Gabriel says:
"It hangs early and freezes with a lot RCU warnings.I bisected it down to :
> Ruslan Ruslichenko (1):
> x86/ioapic: Restore IO-APIC irq_chip retrigger callbackReverting this one fixes the problem for me..
The box is a PRIMERGY TX200 S5 , 2 socket , 2 x E5520 CPU(s) installed"
and Ruslan and Thomas are currently stumped.
Reported-and-bisected-by: Gabriel C
Cc: Ruslan Ruslichenko
Cc: Thomas Gleixner
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman
09 Feb, 2017
4 commits
-
commit aaaec6fc755447a1d056765b11b24d8ff2b81366 upstream.
The recent commit which prevents double activation of interrupts unearthed
interesting code in x86. The code (ab)uses irq_domain_activate_irq() to
reconfigure an already activated interrupt. That trips over the prevention
code now.Fix it by deactivating the interrupt before activating the new configuration.
Fixes: 08d85f3ea99f1 "irqdomain: Avoid activating interrupts more than once"
Reported-and-tested-by: Mike Galbraith
Reported-and-tested-by: Borislav Petkov
Signed-off-by: Thomas Gleixner
Cc: Andrey Ryabinin
Cc: Marc Zyngier
Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1701311901580.3457@nanos
Signed-off-by: Greg Kroah-Hartman -
commit 00c87e9a70a17b355b81c36adedf05e84f54e10d upstream.
Saving unsupported state prevents migration when the new host does not
support a XSAVE feature of the original host, even if the feature is not
exposed to the guest.We've masked host features with guest-visible features before, with
4344ee981e21 ("KVM: x86: only copy XSAVE state for the supported
features") and dropped it when implementing XSAVES. Do it again.Fixes: df1daba7d1cb ("KVM: x86: support XSAVES usage in the host")
Reviewed-by: Paolo Bonzini
Signed-off-by: Radim Krčmář
Signed-off-by: Greg Kroah-Hartman -
commit 1aa6cfd33df492939b0be15ebdbcff1f8ae5ddb6 upstream.
The recent conversion to the hotplug state machine kept two mechanisms from
the original code:1) The first_init logic which adds the number of online CPUs in a package
to the refcount. That's wrong because the callbacks are executed for
all online CPUs.Remove it so the refcounting is correct.
2) The on_each_cpu() call to undo box->init() in the error handling
path. That's bogus because when the prepare callback fails no box has
been initialized yet.Remove it.
Signed-off-by: Thomas Gleixner
Cc: Alexander Shishkin
Cc: Arnaldo Carvalho de Melo
Cc: Jiri Olsa
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Sebastian Siewior
Cc: Stephane Eranian
Cc: Vince Weaver
Cc: Yasuaki Ishimatsu
Fixes: 1a246b9f58c6 ("perf/x86/intel/uncore: Convert to hotplug state machine")
Link: http://lkml.kernel.org/r/20170131230141.298032324@linutronix.de
Signed-off-by: Ingo Molnar
Signed-off-by: Greg Kroah-Hartman -
commit bf29bddf0417a4783da3b24e8c9e017ac649326f upstream.
Commit:
129766708 ("x86/efi: Only map RAM into EFI page tables if in mixed-mode")
stopped creating 1:1 mappings for all RAM, when running in native 64-bit mode.
It turns out though that there are 64-bit EFI implementations in the wild
(this particular problem has been reported on a Lenovo Yoga 710-11IKB),
which still make use of the first physical page for their own private use,
even though they explicitly mark it EFI_CONVENTIONAL_MEMORY in the memory
map.In case there is no mapping for this particular frame in the EFI pagetables,
as soon as firmware tries to make use of it, a triple fault occurs and the
system reboots (in case of the Yoga 710-11IKB this is very early during bootup).Fix that by always mapping the first page of physical memory into the EFI
pagetables. We're free to hand this page to the BIOS, as trim_bios_range()
will reserve the first page and isolate it away from memory allocators anyway.Note that just reverting 129766708 alone is not enough on v4.9-rc1+ to fix the
regression on affected hardware, as this commit:ab72a27da ("x86/efi: Consolidate region mapping logic")
later made the first physical frame not to be mapped anyway.
Reported-by: Hanka Pavlikova
Signed-off-by: Jiri Kosina
Signed-off-by: Matt Fleming
Cc: Ard Biesheuvel
Cc: Borislav Petkov
Cc: Borislav Petkov
Cc: Laura Abbott
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: Vojtech Pavlik
Cc: Waiman Long
Cc: linux-efi@vger.kernel.org
Fixes: 129766708 ("x86/efi: Only map RAM into EFI page tables if in mixed-mode")
Link: http://lkml.kernel.org/r/20170127222552.22336-1-matt@codeblueprint.co.uk
[ Tidied up the changelog and the comment. ]
Signed-off-by: Ingo Molnar
Signed-off-by: Greg Kroah-Hartman
01 Feb, 2017
1 commit
-
commit 63d762b88cb5510f2bfdb5112ced18cde867ae61 upstream.
There is an off-by-one error so we don't unregister priv->pdev_mux[0].
Also it's slightly simpler as a while loop instead of a for loop.Fixes: 58cbbee2391c ("x86/platform/mellanox: Introduce support for Mellanox systems platform")
Signed-off-by: Dan Carpenter
Acked-by: Vadim Pasternak
Signed-off-by: Andy Shevchenko
Signed-off-by: Greg Kroah-Hartman
26 Jan, 2017
3 commits
-
commit ae7871be189cb41184f1e05742b4a99e2c59774d upstream.
Convert the flag swiotlb_force from an int to an enum, to prepare for
the advent of more possible values.Suggested-by: Konrad Rzeszutek Wilk
Signed-off-by: Geert Uytterhoeven
Signed-off-by: Konrad Rzeszutek Wilk
Signed-off-by: Greg Kroah-Hartman -
commit 020eb3daaba2857b32c4cf4c82f503d6a00a67de upstream.
commit d32932d02e18 removed the irq_retrigger callback from the IO-APIC
chip and did not add it to the new IO-APIC-IR irq chip.Unfortunately the software resend fallback is not enabled on X86, so edge
interrupts which are received during the lazy disabled state of the
interrupt line are not retriggered and therefor lost.Restore the callbacks.
[ tglx: Massaged changelog ]
Fixes: d32932d02e18 ("x86/irq: Convert IOAPIC to use hierarchical irqdomain interfaces")
Signed-off-by: Ruslan Ruslichenko
Cc: xe-linux-external@cisco.com
Link: http://lkml.kernel.org/r/1484662432-13580-1-git-send-email-rruslich@cisco.com
Signed-off-by: Thomas Gleixner
Signed-off-by: Greg Kroah-Hartman -
commit 89e9f7bcd8744ea25fcf0ac671b8d72c10d7d790 upstream.
Martin reported that the Supermicro X8DTH-i/6/iF/6F advertises incorrect
host bridge windows via _CRS:pci_root PNP0A08:00: host bridge window [io 0xf000-0xffff]
pci_root PNP0A08:01: host bridge window [io 0xf000-0xffff]Both bridges advertise the 0xf000-0xffff window, which cannot be correct.
Work around this by ignoring _CRS on this system. The downside is that we
may not assign resources correctly to hot-added PCI devices (if they are
possible on this system).Link: https://bugzilla.kernel.org/show_bug.cgi?id=42606
Reported-by: Martin Burnicki
Signed-off-by: Bjorn Helgaas
Signed-off-by: Greg Kroah-Hartman
20 Jan, 2017
13 commits
-
commit dd853fd216d1485ed3045ff772079cc8689a9a4a upstream.
A negative number can be specified in the cmdline which will be used as
setup_clear_cpu_cap() argument. With that we can clear/set some bit in
memory predceeding boot_cpu_data/cpu_caps_cleared which may cause kernel
to misbehave. This patch adds lower bound check to setup_disablecpuid().Boris Petkov reproduced a crash:
[ 1.234575] BUG: unable to handle kernel paging request at ffffffff858bd540
[ 1.236535] IP: memcpy_erms+0x6/0x10Signed-off-by: Lukasz Odzioba
Acked-by: Borislav Petkov
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: andi.kleen@intel.com
Cc: bp@alien8.de
Cc: dave.hansen@linux.intel.com
Cc: luto@kernel.org
Cc: slaoub@gmail.com
Fixes: ac72e7888a61 ("x86: add generic clearcpuid=... option")
Link: http://lkml.kernel.org/r/1482933340-11857-1-git-send-email-lukasz.odzioba@intel.com
Signed-off-by: Ingo Molnar
Signed-off-by: Greg Kroah-Hartman -
commit a33d331761bc5dd330499ca5ceceb67f0640a8e6 upstream.
The following commit:
8196dab4fc15 ("x86/cpu: Get rid of compute_unit_id")
... broke the initial strategy for Bulldozer-based cores' topology,
where we consider each thread of a compute unit a standalone core
and not a HT or SMT thread.Revert to the firmware-supplied core_id numbering and do not make
them thread siblings as we don't consider them for such even if they
technically are, more or less.Reported-and-tested-by: Brice Goglin
Tested-by: Yazen Ghannam
Signed-off-by: Borislav Petkov
Cc: Andy Lutomirski
Cc: Borislav Petkov
Cc: Brian Gerst
Cc: Denys Vlasenko
Cc: H. Peter Anvin
Cc: Josh Poimboeuf
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Fixes: 8196dab4fc15 ("x86/cpu: Get rid of compute_unit_id")
Link: http://lkml.kernel.org/r/20170105092638.5247-1-bp@alien8.de
Signed-off-by: Ingo Molnar
Signed-off-by: Greg Kroah-Hartman -
commit 3344ed30791af66dbbad5f375008f3d1863b6c99 upstream.
The workaround for the AMD Erratum E400 (Local APIC timer stops in C1E
state) is a two step process:- Selection of the E400 aware idle routine
- Detection whether the platform is affected
The idle routine selection happens for possibly affected CPUs depending on
family/model/stepping information. These range of CPUs is not necessarily
affected as the decision whether to enable the C1E feature is made by the
firmware. Unfortunately there is no way to query this at early boot.The current implementation polls a MSR in the E400 aware idle routine to
detect whether the CPU is affected. This is inefficient on non affected
CPUs because every idle entry has to do the MSR read.There is a better way to detect this before going idle for the first time
which requires to seperate the bug flags:X86_BUG_AMD_E400 - Selects the E400 aware idle routine and
enables the detectionX86_BUG_AMD_APIC_C1E - Set when the platform is affected by E400
Replace the current X86_BUG_AMD_APIC_C1E usage by the new X86_BUG_AMD_E400
bug bit to select the idle routine which currently does an unconditional
detection poll. X86_BUG_AMD_APIC_C1E is going to be used in later patches
to remove the MSR polling and simplify the handling of this misfeature.Signed-off-by: Thomas Gleixner
Signed-off-by: Borislav Petkov
Cc: Jiri Olsa
Link: http://lkml.kernel.org/r/20161209182912.2726-3-bp@alien8.de
Signed-off-by: Thomas Gleixner
Signed-off-by: Greg Kroah-Hartman -
commit b6a50cddbcbda7105355898ead18f1a647c22520 upstream.
These changes do not affect current hw - just a cleanup:
Currently, we assume that a system has a single Last Level Cache (LLC)
per node, and that the cpu_llc_id is thus equal to the node_id. This no
longer applies since Fam17h can have multiple last level caches within a
node.So group the cpu_llc_id assignment by topology feature and family in
order to make the computation of cpu_llc_id on the different families
more clear.Here is how the LLC ID is being computed on the different families:
The NODEID_MSR feature only applies to Fam10h in which case the LLC is
at the node level.The TOPOEXT feature is used on families 15h, 16h and 17h. So far we only
see multiple last level caches if L3 caches are available. Otherwise,
the cpu_llc_id will default to be the phys_proc_id.We have L3 caches only on families 15h and 17h:
- on Fam15h, the LLC is at the node level.
- on Fam17h, the LLC is at the core complex level and can be found by
right shifting the APIC ID. Also, keep the family checks explicit so that
new families will fall back to the default, which will be node_id for
TOPOEXT systems.Single node systems in families 10h and 15h will have a Node ID of 0
which will be the same as the phys_proc_id, so we don't need to check
for multiple nodes before using the node_id.Tested-by: Borislav Petkov
Signed-off-by: Yazen Ghannam
[ Rewrote the commit message. ]
Signed-off-by: Borislav Petkov
Acked-by: Thomas Gleixner
Cc: Aravind Gopalakrishnan
Cc: Linus Torvalds
Cc: Peter Zijlstra
Link: http://lkml.kernel.org/r/20161108153054.bs3sajbyevq6a6uu@pd.tnic
Signed-off-by: Ingo Molnar
Signed-off-by: Greg Kroah-Hartman -
commit 20b1e22d01a4b0b11d3a1066e9feb04be38607ec upstream.
With the following commit:
4bc9f92e64c8 ("x86/efi-bgrt: Use efi_mem_reserve() to avoid copying image data")
... efi_bgrt_init() calls into the memblock allocator through
efi_mem_reserve() => efi_arch_mem_reserve() *after* mm_init() has been called.Indeed, KASAN reports a bad read access later on in efi_free_boot_services():
BUG: KASAN: use-after-free in efi_free_boot_services+0xae/0x24c
at addr ffff88022de12740
Read of size 4 by task swapper/0/0
page:ffffea0008b78480 count:0 mapcount:-127
mapping: (null) index:0x1 flags: 0x5fff8000000000()
[...]
Call Trace:
dump_stack+0x68/0x9f
kasan_report_error+0x4c8/0x500
kasan_report+0x58/0x60
__asan_load4+0x61/0x80
efi_free_boot_services+0xae/0x24c
start_kernel+0x527/0x562
x86_64_start_reservations+0x24/0x26
x86_64_start_kernel+0x157/0x17a
start_cpu+0x5/0x14The instruction at the given address is the first read from the memmap's
memory, i.e. the read of md->type in efi_free_boot_services().Note that the writes earlier in efi_arch_mem_reserve() don't splat because
they're done through early_memremap()ed addresses.So, after memblock is gone, allocations should be done through the "normal"
page allocator. Introduce a helper, efi_memmap_alloc() for this. Use
it from efi_arch_mem_reserve(), efi_free_boot_services() and, for the sake
of consistency, from efi_fake_memmap() as well.Note that for the latter, the memmap allocations cease to be page aligned.
This isn't needed though.Tested-by: Dan Williams
Signed-off-by: Nicolai Stange
Reviewed-by: Ard Biesheuvel
Cc: Dave Young
Cc: Linus Torvalds
Cc: Matt Fleming
Cc: Mika Penttilä
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: linux-efi@vger.kernel.org
Fixes: 4bc9f92e64c8 ("x86/efi-bgrt: Use efi_mem_reserve() to avoid copying image data")
Link: http://lkml.kernel.org/r/20170105125130.2815-1-nicstange@gmail.com
Signed-off-by: Ingo Molnar
Signed-off-by: Greg Kroah-Hartman -
commit 0100a3e67a9cef64d72cd3a1da86f3ddbee50363 upstream.
Some machines, such as the Lenovo ThinkPad W541 with firmware GNET80WW
(2.28), include memory map entries with phys_addr=0x0 and num_pages=0.These machines fail to boot after the following commit,
commit 8e80632fb23f ("efi/esrt: Use efi_mem_reserve() and avoid a kmalloc()")
Fix this by removing such bogus entries from the memory map.
Furthermore, currently the log output for this case (with efi=debug)
looks like:[ 0.000000] efi: mem45: [Reserved | | | | | | | | | | | | ] range=[0x0000000000000000-0xffffffffffffffff] (0MB)
This is clearly wrong, and also not as informative as it could be. This
patch changes it so that if we find obviously invalid memory map
entries, we print an error and skip those entries. It also detects the
display of the address range calculation overflow, so the new output is:[ 0.000000] efi: [Firmware Bug]: Invalid EFI memory map entries:
[ 0.000000] efi: mem45: [Reserved | | | | | | | | | | | | ] range=[0x0000000000000000-0x0000000000000000] (invalid)It also detects memory map sizes that would overflow the physical
address, for example phys_addr=0xfffffffffffff000 and
num_pages=0x0200000000000001, and prints:[ 0.000000] efi: [Firmware Bug]: Invalid EFI memory map entries:
[ 0.000000] efi: mem45: [Reserved | | | | | | | | | | | | ] range=[phys_addr=0xfffffffffffff000-0x20ffffffffffffffff] (invalid)It then removes these entries from the memory map.
Signed-off-by: Peter Jones
Signed-off-by: Ard Biesheuvel
[ardb: refactor for clarity with no functional changes, avoid PAGE_SHIFT]
Signed-off-by: Matt Fleming
[Matt: Include bugzilla info in commit log]
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Link: https://bugzilla.kernel.org/show_bug.cgi?id=191121
Signed-off-by: Ingo Molnar
Signed-off-by: Greg Kroah-Hartman -
commit 129a72a0d3c8e139a04512325384fe5ac119e74d upstream.
Introduces segemented_write_std.
Switches from emulated reads/writes to standard read/writes in fxsave,
fxrstor, sgdt, and sidt. This fixes CVE-2017-2584, a longstanding
kernel memory leak.Since commit 283c95d0e389 ("KVM: x86: emulate FXSAVE and FXRSTOR",
2016-11-09), which is luckily not yet in any final release, this would
also be an exploitable kernel memory *write*!Reported-by: Dmitry Vyukov
Fixes: 96051572c819194c37a8367624b285be10297eca
Fixes: 283c95d0e3891b64087706b344a4b545d04a6e62
Suggested-by: Paolo Bonzini
Signed-off-by: Steve Rutherford
Signed-off-by: Paolo Bonzini
Signed-off-by: Greg Kroah-Hartman -
commit 283c95d0e3891b64087706b344a4b545d04a6e62 upstream.
Internal errors were reported on 16 bit fxsave and fxrstor with ipxe.
Old Intels don't have unrestricted_guest, so we have to emulate them.The patch takes advantage of the hardware implementation.
AMD and Intel differ in saving and restoring other fields in first 32
bytes. A test wrote 0xff to the fxsave area, 0 to upper bits of MCSXR
in the fxsave area, executed fxrstor, rewrote the fxsave area to 0xee,
and executed fxsave:Intel (Nehalem):
7f 1f 7f 7f ff 00 ff 07 ff ff ff ff ff ff 00 00
ff ff ff ff ff ff 00 00 ff ff 00 00 ff ff 00 00
Intel (Haswell -- deprecated FPU CS and FPU DS):
7f 1f 7f 7f ff 00 ff 07 ff ff ff ff 00 00 00 00
ff ff ff ff 00 00 00 00 ff ff 00 00 ff ff 00 00
AMD (Opteron 2300-series):
7f 1f 7f 7f ff 00 ee ee ee ee ee ee ee ee ee ee
ee ee ee ee ee ee ee ee ff ff 00 00 ff ff 02 00fxsave/fxrstor will only be emulated on early Intels, so KVM can't do
much to improve the situation.Signed-off-by: Radim Krčmář
Signed-off-by: Greg Kroah-Hartman -
commit aabba3c6abd50b05b1fc2c6ec44244aa6bcda576 upstream.
Move the existing exception handling for inline assembly into a macro
and switch its return values to X86EMUL type.Signed-off-by: Radim Krčmář
Signed-off-by: Paolo Bonzini
Signed-off-by: Greg Kroah-Hartman -
commit d3fe959f81024072068e9ed86b39c2acfd7462a9 upstream.
Needed for FXSAVE and FXRSTOR.
Signed-off-by: Radim Krčmář
Signed-off-by: Paolo Bonzini
Signed-off-by: Greg Kroah-Hartman -
commit 546d87e5c903a7f3ee7b9f998949a94729fbc65b upstream.
Reported by syzkaller:
BUG: unable to handle kernel NULL pointer dereference at 00000000000001b0
IP: _raw_spin_lock+0xc/0x30
PGD 3e28eb067
PUD 3f0ac6067
PMD 0
Oops: 0002 [#1] SMP
CPU: 0 PID: 2431 Comm: test Tainted: G OE 4.10.0-rc1+ #3
Call Trace:
? kvm_ioapic_scan_entry+0x3e/0x110 [kvm]
kvm_arch_vcpu_ioctl_run+0x10a8/0x15f0 [kvm]
? pick_next_task_fair+0xe1/0x4e0
? kvm_arch_vcpu_load+0xea/0x260 [kvm]
kvm_vcpu_ioctl+0x33a/0x600 [kvm]
? hrtimer_try_to_cancel+0x29/0x130
? do_nanosleep+0x97/0xf0
do_vfs_ioctl+0xa1/0x5d0
? __hrtimer_init+0x90/0x90
? do_nanosleep+0x5b/0xf0
SyS_ioctl+0x79/0x90
do_syscall_64+0x6e/0x180
entry_SYSCALL64_slow_path+0x25/0x25
RIP: _raw_spin_lock+0xc/0x30 RSP: ffffa43688973cc0The syzkaller folks reported a NULL pointer dereference due to
ENABLE_CAP succeeding even without an irqchip. The Hyper-V
synthetic interrupt controller is activated, resulting in a
wrong request to rescan the ioapic and a NULL pointer dereference.#include
#include
#include
#include
#include
#include
#include
#include
#include
#include#ifndef KVM_CAP_HYPERV_SYNIC
#define KVM_CAP_HYPERV_SYNIC 123
#endifvoid* thr(void* arg)
{
struct kvm_enable_cap cap;
cap.flags = 0;
cap.cap = KVM_CAP_HYPERV_SYNIC;
ioctl((long)arg, KVM_ENABLE_CAP, &cap);
return 0;
}int main()
{
void *host_mem = mmap(0, 0x1000, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
int kvmfd = open("/dev/kvm", 0);
int vmfd = ioctl(kvmfd, KVM_CREATE_VM, 0);
struct kvm_userspace_memory_region memreg;
memreg.slot = 0;
memreg.flags = 0;
memreg.guest_phys_addr = 0;
memreg.memory_size = 0x1000;
memreg.userspace_addr = (unsigned long)host_mem;
host_mem[0] = 0xf4;
ioctl(vmfd, KVM_SET_USER_MEMORY_REGION, &memreg);
int cpufd = ioctl(vmfd, KVM_CREATE_VCPU, 0);
struct kvm_sregs sregs;
ioctl(cpufd, KVM_GET_SREGS, &sregs);
sregs.cr0 = 0;
sregs.cr4 = 0;
sregs.efer = 0;
sregs.cs.selector = 0;
sregs.cs.base = 0;
ioctl(cpufd, KVM_SET_SREGS, &sregs);
struct kvm_regs regs = { .rflags = 2 };
ioctl(cpufd, KVM_SET_REGS, ®s);
ioctl(vmfd, KVM_CREATE_IRQCHIP, 0);
pthread_t th;
pthread_create(&th, 0, thr, (void*)(long)cpufd);
usleep(rand() % 10000);
ioctl(cpufd, KVM_RUN, 0);
pthread_join(th, 0);
return 0;
}This patch fixes it by failing ENABLE_CAP if without an irqchip.
Reported-by: Dmitry Vyukov
Fixes: 5c919412fe61 (kvm/x86: Hyper-V synthetic interrupt controller)
Cc: Paolo Bonzini
Cc: Radim Krčmář
Cc: Dmitry Vyukov
Signed-off-by: Wanpeng Li
Signed-off-by: Paolo Bonzini
Signed-off-by: Greg Kroah-Hartman -
commit cef84c302fe051744b983a92764d3fcca933415d upstream.
KVM's lapic emulation uses static_key_deferred (apic_{hw,sw}_disabled).
These are implemented with delayed_work structs which can still be
pending when the KVM module is unloaded. We've seen this cause kernel
panics when the kvm_intel module is quickly reloaded.Use the new static_key_deferred_flush() API to flush pending updates on
module unload.Signed-off-by: David Matlack
Signed-off-by: Paolo Bonzini
Signed-off-by: Greg Kroah-Hartman -
commit 33ab91103b3415e12457e3104f0e4517ce12d0f3 upstream.
This is CVE-2017-2583. On Intel this causes a failed vmentry because
SS's type is neither 3 nor 7 (even though the manual says this check is
only done for usable SS, and the dmesg splat says that SS is unusable!).
On AMD it's worse: svm.c is confused and sets CPL to 0 in the vmcb.The fix fabricates a data segment descriptor when SS is set to a null
selector, so that CPL and SS.DPL are set correctly in the VMCS/vmcb.
Furthermore, only allow setting SS to a NULL selector if SS.RPL < 3;
this in turn ensures CPL < 3 because RPL must be equal to CPL.Thanks to Andy Lutomirski and Willy Tarreau for help in analyzing
the bug and deciphering the manuals.Reported-by: Xiaohan Zhang
Fixes: 79d5b4c3cd809c770d4bf9812635647016c56011
Signed-off-by: Paolo Bonzini
Signed-off-by: Greg Kroah-Hartman
15 Jan, 2017
1 commit
-
[ Upstream commit 9d5ecb09d525469abd1a10c096cb5a17206523f2 ]
If after too many passes still no image could be emitted, then
swap back to the original program as we do in all other cases
and don't use the one with blinding.Fixes: 959a75791603 ("bpf, x86: add support for constant blinding")
Signed-off-by: Daniel Borkmann
Acked-by: Alexei Starovoitov
Signed-off-by: David S. Miller
Signed-off-by: Greg Kroah-Hartman
12 Jan, 2017
3 commits
-
commit 3df8d9208569ef0b2313e516566222d745f3b94b upstream.
A typo (or mis-merge?) resulted in leaf 6 only being probed if
cpuid_level >= 7.Fixes: 2ccd71f1b278 ("x86/cpufeature: Move some of the scattered feature bits to x86_capability")
Signed-off-by: Andy Lutomirski
Acked-by: Borislav Petkov
Cc: Brian Gerst
Link: http://lkml.kernel.org/r/6ea30c0e9daec21e488b54761881a6dfcf3e04d0.1481825597.git.luto@kernel.org
Signed-off-by: Thomas Gleixner
Signed-off-by: Greg Kroah-Hartman -
commit a01aa6c9f40fe03c82032e7f8b3bcf1e6c93ac0e upstream.
As userspace knows nothing about kernel config, thus #ifdefs
around ABI prctl constants makes them invisible to userspace.Let it be clean'n'simple: remove #ifdefs.
If kernel has CONFIG_CHECKPOINT_RESTORE disabled, sys_prctl()
will return -EINVAL for those prctls.Reported-by: Paul Bolle
Signed-off-by: Dmitry Safonov
Acked-by: Andy Lutomirski
Cc: 0x7f454c46@gmail.com
Cc: Borislav Petkov
Cc: Brian Gerst
Cc: Cyrill Gorcunov
Cc: Denys Vlasenko
Cc: H. Peter Anvin
Cc: Josh Poimboeuf
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: linux-mm@kvack.org
Cc: oleg@redhat.com
Fixes: 2eefd8789698 ("x86/arch_prctl/vdso: Add ARCH_MAP_VDSO_*")
Link: http://lkml.kernel.org/r/20161027141516.28447-2-dsafonov@virtuozzo.com
Signed-off-by: Ingo Molnar
Signed-off-by: Greg Kroah-Hartman -
commit 6ef4e07ecd2db21025c446327ecf34414366498b upstream.
Otherwise, mismatch between the smm bit in hflags and the MMU role
can cause a NULL pointer dereference.Signed-off-by: Xiao Guangrong
Signed-off-by: Paolo Bonzini
Signed-off-by: Greg Kroah-Hartman
09 Jan, 2017
5 commits
-
commit 9d85eb9119f4eeeb48e87adfcd71f752655700e9 upstream.
The logical package management has several issues:
- The APIC ids provided by ACPI are not required to be the same as the
initial APIC id which can be retrieved by CPUID. The APIC ids provided
by ACPI are those which are written by the BIOS into the APIC. The
initial id is set by hardware and can not be changed. The hardware
provided ids contain the real hardware package information.Especially AMD sets the effective APIC id different from the hardware id
as they need to reserve space for the IOAPIC ids starting at id 0.As a consequence those machines trigger the currently active firmware
bug printouts in dmesg, These are obviously wrong.- Virtual machines have their own interesting of enumerating APICs and
packages which are not reliably covered by the current implementation.The sizing of the mapping array has been tweaked to be generously large to
handle systems which provide a wrong core count when HT is disabled so the
whole magic which checks for space in the physical hotplug case is not
needed anymore.Simplify the whole machinery and do the mapping when the CPU starts and the
CPUID derived physical package information is available. This solves the
observed problems on AMD machines and works for the virtualization issues
as well.Remove the extra call from XEN cpu bringup code as it is not longer
required.Fixes: d49597fd3bc7 ("x86/cpu: Deal with broken firmware (VMWare/XEN)")
Reported-and-tested-by: Borislav Petkov
Tested-by: Boris Ostrovsky
Signed-off-by: Thomas Gleixner
Cc: Juergen Gross
Cc: Peter Zijlstra
Cc: M. Vefa Bicakci
Cc: xen-devel
Cc: Charles (Chas) Williams
Cc: Borislav Petkov
Cc: Alok Kataria
Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1612121102260.3429@nanos
Signed-off-by: Thomas Gleixner
Signed-off-by: Greg Kroah-Hartman -
commit 847fa1a6d3d00f3bdf68ef5fa4a786f644a0dd67 upstream.
With new binutils, gcc may get smart with its optimization and change a jmp
from a 5 byte jump to a 2 byte one even though it was jumping to a global
function. But that global function existed within a 2 byte radius, and gcc
was able to optimize it. Unfortunately, that jump was also being modified
when function graph tracing begins. Since ftrace expected that jump to be 5
bytes, but it was only two, it overwrote code after the jump, causing a
crash.This was fixed for x86_64 with commit 8329e818f149, with the same subject as
this commit, but nothing was done for x86_32.Fixes: d61f82d06672 ("ftrace: use dynamic patching for updating mcount calls")
Reported-by: Colin Ian King
Tested-by: Colin Ian King
Signed-off-by: Steven Rostedt
Signed-off-by: Greg Kroah-Hartman -
commit ef85b67385436ddc1998f45f1d6a210f935b3388 upstream.
When L2 exits to L0 due to "exception or NMI", software exceptions
(#BP and #OF) for which L1 has requested an intercept should be
handled by L1 rather than L0. Previously, only hardware exceptions
were forwarded to L1.Signed-off-by: Jim Mattson
Signed-off-by: Paolo Bonzini
Signed-off-by: Greg Kroah-Hartman -
commit 834fcd298003c10ce450e66960c78893cb1cc4b5 upstream.
If the pmu registration fails the registered hotplug callbacks are not
removed. Wrong in any case, but fatal in case of a modular driver.Replace the nonsensical state names with proper ones while at it.
Fixes: 77c34ef1c319 ("perf/x86/intel/cstate: Convert Intel CSTATE to hotplug state machine")
Signed-off-by: Thomas Gleixner
Cc: Sebastian Siewior
Cc: Peter Zijlstra
Signed-off-by: Greg Kroah-Hartman -
commit b0c1ef52959582144bbea9a2b37db7f4c9e399f7 upstream.
An earlier patch allowed enabling PT and LBR at the same
time on Goldmont. However it also allowed enabling BTS and LBR
at the same time, which is still not supported. Fix this by
bypassing the check only for PT.Signed-off-by: Andi Kleen
Signed-off-by: Peter Zijlstra (Intel)
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: alexander.shishkin@intel.com
Cc: kan.liang@intel.com
Fixes: ccbebba4c6bf ("perf/x86/intel/pt: Bypass PT vs. LBR exclusivity if the core supports it")
Link: http://lkml.kernel.org/r/20161209001417.4713-1-andi@firstfloor.org
Signed-off-by: Ingo Molnar
Signed-off-by: Greg Kroah-Hartman
06 Jan, 2017
1 commit
-
commit 334bb773876403eae3457d81be0b8ea70f8e4ccc upstream.
Commit 4efca4ed ("kbuild: modversions for EXPORT_SYMBOL() for asm") adds
modversion support for symbols exported from asm files. Architectures
must include C-style declarations for those symbols in asm/asm-prototypes.h
in order for them to be versioned.Add these declarations for x86, and an architecture-independent file that
can be used for common symbols.With f27c2f6 reverting 8ab2ae6 ("default exported asm symbols to zero") we
produce a scary warning on x86, this commit fixes that.Signed-off-by: Adam Borowski
Tested-by: Kalle Valo
Acked-by: Nicholas Piggin
Tested-by: Peter Wu
Tested-by: Oliver Hartkopp
Signed-off-by: Michal Marek
Signed-off-by: Greg Kroah-Hartman
08 Dec, 2016
1 commit
-
Pull x86 fixes from Ingo Molnar:
"Misc fixes: a core dumping crash fix, a guess-unwinder regression fix,
plus three build warning fixes"* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/unwind: Fix guess-unwinder regression
x86/build: Annotate die() with noreturn to fix build warning on clang
x86/platform/olpc: Fix resume handler build warning
x86/apic/uv: Silence a shift wrapping warning
x86/coredump: Always use user_regs_struct for compat_elf_gregset_t
06 Dec, 2016
2 commits
-
Lukasz reported that perf stat counters overflow handling is broken on KNL/SLM.
Both these parts have full_width_write set, and that does indeed have
a problem. In order to deal with counter wrap, we must sample the
counter at at least half the counter period (see also the sampling
theorem) such that we can unambiguously reconstruct the count.However commit:
069e0c3c4058 ("perf/x86/intel: Support full width counting")
sets the sampling interval to the full period, not half.
Fixing that exposes another issue, in that we must not sign extend the
delta value when we shift it right; the counter cannot have
decremented after all.With both these issues fixed, counter overflow functions correctly
again.Reported-by: Lukasz Odzioba
Tested-by: Liang, Kan
Tested-by: Odzioba, Lukasz
Signed-off-by: Peter Zijlstra (Intel)
Cc: Alexander Shishkin
Cc: Arnaldo Carvalho de Melo
Cc: Jiri Olsa
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Stephane Eranian
Cc: Thomas Gleixner
Cc: Vince Weaver
Cc: stable@vger.kernel.org
Fixes: 069e0c3c4058 ("perf/x86/intel: Support full width counting")
Signed-off-by: Ingo Molnar -
The Knights Mill is enough close to Knights Landing so the path reuses
C-state residency support of the latter.Signed-off-by: Piotr Luc
Signed-off-by: Peter Zijlstra (Intel)
Cc: Alexander Shishkin
Cc: Arnaldo Carvalho de Melo
Cc: Arnaldo Carvalho de Melo
Cc: Jiri Olsa
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Stephane Eranian
Cc: Thomas Gleixner
Cc: Vince Weaver
Link: http://lkml.kernel.org/r/20161201000853.18260-1-piotr.luc@intel.com
Signed-off-by: Ingo Molnar