23 Feb, 2017

1 commit


18 Feb, 2017

1 commit

  • commit dffba9a31c7769be3231c420d4b364c92ba3f1ac upstream.

    The compacted-format XSAVES area is determined at boot time and
    never changed after. The field xsave.header.xcomp_bv indicates
    which components are in the fixed XSAVES format.

    In fpstate_init() we did not set xcomp_bv to reflect the XSAVES
    format since at the time there is no valid data.

    However, after we do copy_init_fpstate_to_fpregs() in fpu__clear(),
    as in commit:

    b22cbe404a9c x86/fpu: Fix invalid FPU ptrace state after execve()

    and when __fpu_restore_sig() does fpu__restore() for a COMPAT-mode
    app, a #GP occurs. This can be easily triggered by doing valgrind on
    a COMPAT-mode "Hello World," as reported by Joakim Tjernlund and
    others:

    https://bugzilla.kernel.org/show_bug.cgi?id=190061

    Fix it by setting xcomp_bv correctly.

    This patch also moves the xcomp_bv initialization to the proper
    place, which was in copyin_to_xsaves() as of:

    4c833368f0bf x86/fpu: Set the xcomp_bv when we fake up a XSAVES area

    which fixed the bug too, but it's more efficient and cleaner to
    initialize things once per boot, not for every signal handling
    operation.

    Reported-by: Kevin Hao
    Reported-by: Joakim Tjernlund
    Signed-off-by: Yu-cheng Yu
    Cc: Andy Lutomirski
    Cc: Borislav Petkov
    Cc: Dave Hansen
    Cc: Fenghua Yu
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Ravi V. Shankar
    Cc: Thomas Gleixner
    Cc: haokexin@gmail.com
    Link: http://lkml.kernel.org/r/1485212084-4418-1-git-send-email-yu-cheng.yu@intel.com
    [ Combined it with 4c833368f0bf. ]
    Signed-off-by: Ingo Molnar
    Signed-off-by: Greg Kroah-Hartman

    Yu-cheng Yu
     

15 Feb, 2017

4 commits

  • commit 08b259631b5a1d912af4832847b5642f377d9101 upstream.

    After:

    a33d331761bc ("x86/CPU/AMD: Fix Bulldozer topology")

    our SMT scheduling topology for Fam17h systems is broken, because
    the ThreadId is included in the ApicId when SMT is enabled.

    So, without further decoding cpu_core_id is unique for each thread
    rather than the same for threads on the same core. This didn't affect
    systems with SMT disabled. Make cpu_core_id be what it is defined to be.

    Signed-off-by: Yazen Ghannam
    Signed-off-by: Borislav Petkov
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/20170205105022.8705-2-bp@alien8.de
    Signed-off-by: Ingo Molnar
    Signed-off-by: Greg Kroah-Hartman

    Yazen Ghannam
     
  • commit 79a8b9aa388b0620cc1d525d7c0f0d9a8a85e08e upstream.

    Commit:

    a33d331761bc ("x86/CPU/AMD: Fix Bulldozer topology")

    restored the initial approach we had with the Fam15h topology of
    enumerating CU (Compute Unit) threads as cores. And this is still
    correct - they're beefier than HT threads but still have some
    shared functionality.

    Our current approach has a problem with the Mad Max Steam game, for
    example. Yves Dionne reported a certain "choppiness" while playing on
    v4.9.5.

    That problem stems most likely from the fact that the CU threads share
    resources within one CU and when we schedule to a thread of a different
    compute unit, this incurs latency due to migrating the working set to a
    different CU through the caches.

    When the thread siblings mask mirrors that aspect of the CUs and
    threads, the scheduler pays attention to it and tries to schedule within
    one CU first. Which takes care of the latency, of course.

    Reported-by: Yves Dionne
    Signed-off-by: Borislav Petkov
    Cc: Brice Goglin
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Yazen Ghannam
    Link: http://lkml.kernel.org/r/20170205105022.8705-1-bp@alien8.de
    Signed-off-by: Ingo Molnar
    Signed-off-by: Greg Kroah-Hartman

    Borislav Petkov
     
  • commit 146fbb766934dc003fcbf755b519acef683576bf upstream.

    CONFIG_KASAN=y needs a lot of virtual memory mapped for its shadow.
    In that case ptdump_walk_pgd_level_core() takes a lot of time to
    walk across all page tables and doing this without
    a rescheduling causes soft lockups:

    NMI watchdog: BUG: soft lockup - CPU#3 stuck for 23s! [swapper/0:1]
    ...
    Call Trace:
    ptdump_walk_pgd_level_core+0x40c/0x550
    ptdump_walk_pgd_level_checkwx+0x17/0x20
    mark_rodata_ro+0x13b/0x150
    kernel_init+0x2f/0x120
    ret_from_fork+0x2c/0x40

    I guess that this issue might arise even without KASAN on huge machines
    with several terabytes of RAM.

    Stick cond_resched() in pgd loop to fix this.

    Reported-by: Tobias Regnery
    Signed-off-by: Andrey Ryabinin
    Cc: kasan-dev@googlegroups.com
    Cc: Alexander Potapenko
    Cc: "Paul E . McKenney"
    Cc: Dmitry Vyukov
    Link: http://lkml.kernel.org/r/20170210095405.31802-1-aryabinin@virtuozzo.com
    Signed-off-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Andrey Ryabinin
     
  • commit d966564fcdc19e13eb6ba1fbe6b8101070339c3d upstream.

    This reverts commit 020eb3daaba2857b32c4cf4c82f503d6a00a67de.

    Gabriel C reports that it causes his machine to not boot, and we haven't
    tracked down the reason for it yet. Since the bug it fixes has been
    around for a longish time, we're better off reverting the fix for now.

    Gabriel says:
    "It hangs early and freezes with a lot RCU warnings.

    I bisected it down to :

    > Ruslan Ruslichenko (1):
    > x86/ioapic: Restore IO-APIC irq_chip retrigger callback

    Reverting this one fixes the problem for me..

    The box is a PRIMERGY TX200 S5 , 2 socket , 2 x E5520 CPU(s) installed"

    and Ruslan and Thomas are currently stumped.

    Reported-and-bisected-by: Gabriel C
    Cc: Ruslan Ruslichenko
    Cc: Thomas Gleixner
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Linus Torvalds
     

09 Feb, 2017

4 commits

  • commit aaaec6fc755447a1d056765b11b24d8ff2b81366 upstream.

    The recent commit which prevents double activation of interrupts unearthed
    interesting code in x86. The code (ab)uses irq_domain_activate_irq() to
    reconfigure an already activated interrupt. That trips over the prevention
    code now.

    Fix it by deactivating the interrupt before activating the new configuration.

    Fixes: 08d85f3ea99f1 "irqdomain: Avoid activating interrupts more than once"
    Reported-and-tested-by: Mike Galbraith
    Reported-and-tested-by: Borislav Petkov
    Signed-off-by: Thomas Gleixner
    Cc: Andrey Ryabinin
    Cc: Marc Zyngier
    Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1701311901580.3457@nanos
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     
  • commit 00c87e9a70a17b355b81c36adedf05e84f54e10d upstream.

    Saving unsupported state prevents migration when the new host does not
    support a XSAVE feature of the original host, even if the feature is not
    exposed to the guest.

    We've masked host features with guest-visible features before, with
    4344ee981e21 ("KVM: x86: only copy XSAVE state for the supported
    features") and dropped it when implementing XSAVES. Do it again.

    Fixes: df1daba7d1cb ("KVM: x86: support XSAVES usage in the host")
    Reviewed-by: Paolo Bonzini
    Signed-off-by: Radim Krčmář
    Signed-off-by: Greg Kroah-Hartman

    Radim Krčmář
     
  • commit 1aa6cfd33df492939b0be15ebdbcff1f8ae5ddb6 upstream.

    The recent conversion to the hotplug state machine kept two mechanisms from
    the original code:

    1) The first_init logic which adds the number of online CPUs in a package
    to the refcount. That's wrong because the callbacks are executed for
    all online CPUs.

    Remove it so the refcounting is correct.

    2) The on_each_cpu() call to undo box->init() in the error handling
    path. That's bogus because when the prepare callback fails no box has
    been initialized yet.

    Remove it.

    Signed-off-by: Thomas Gleixner
    Cc: Alexander Shishkin
    Cc: Arnaldo Carvalho de Melo
    Cc: Jiri Olsa
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Sebastian Siewior
    Cc: Stephane Eranian
    Cc: Vince Weaver
    Cc: Yasuaki Ishimatsu
    Fixes: 1a246b9f58c6 ("perf/x86/intel/uncore: Convert to hotplug state machine")
    Link: http://lkml.kernel.org/r/20170131230141.298032324@linutronix.de
    Signed-off-by: Ingo Molnar
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     
  • commit bf29bddf0417a4783da3b24e8c9e017ac649326f upstream.

    Commit:

    129766708 ("x86/efi: Only map RAM into EFI page tables if in mixed-mode")

    stopped creating 1:1 mappings for all RAM, when running in native 64-bit mode.

    It turns out though that there are 64-bit EFI implementations in the wild
    (this particular problem has been reported on a Lenovo Yoga 710-11IKB),
    which still make use of the first physical page for their own private use,
    even though they explicitly mark it EFI_CONVENTIONAL_MEMORY in the memory
    map.

    In case there is no mapping for this particular frame in the EFI pagetables,
    as soon as firmware tries to make use of it, a triple fault occurs and the
    system reboots (in case of the Yoga 710-11IKB this is very early during bootup).

    Fix that by always mapping the first page of physical memory into the EFI
    pagetables. We're free to hand this page to the BIOS, as trim_bios_range()
    will reserve the first page and isolate it away from memory allocators anyway.

    Note that just reverting 129766708 alone is not enough on v4.9-rc1+ to fix the
    regression on affected hardware, as this commit:

    ab72a27da ("x86/efi: Consolidate region mapping logic")

    later made the first physical frame not to be mapped anyway.

    Reported-by: Hanka Pavlikova
    Signed-off-by: Jiri Kosina
    Signed-off-by: Matt Fleming
    Cc: Ard Biesheuvel
    Cc: Borislav Petkov
    Cc: Borislav Petkov
    Cc: Laura Abbott
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Vojtech Pavlik
    Cc: Waiman Long
    Cc: linux-efi@vger.kernel.org
    Fixes: 129766708 ("x86/efi: Only map RAM into EFI page tables if in mixed-mode")
    Link: http://lkml.kernel.org/r/20170127222552.22336-1-matt@codeblueprint.co.uk
    [ Tidied up the changelog and the comment. ]
    Signed-off-by: Ingo Molnar
    Signed-off-by: Greg Kroah-Hartman

    Jiri Kosina
     

01 Feb, 2017

1 commit

  • commit 63d762b88cb5510f2bfdb5112ced18cde867ae61 upstream.

    There is an off-by-one error so we don't unregister priv->pdev_mux[0].
    Also it's slightly simpler as a while loop instead of a for loop.

    Fixes: 58cbbee2391c ("x86/platform/mellanox: Introduce support for Mellanox systems platform")
    Signed-off-by: Dan Carpenter
    Acked-by: Vadim Pasternak
    Signed-off-by: Andy Shevchenko
    Signed-off-by: Greg Kroah-Hartman

    Dan Carpenter
     

26 Jan, 2017

3 commits

  • commit ae7871be189cb41184f1e05742b4a99e2c59774d upstream.

    Convert the flag swiotlb_force from an int to an enum, to prepare for
    the advent of more possible values.

    Suggested-by: Konrad Rzeszutek Wilk
    Signed-off-by: Geert Uytterhoeven
    Signed-off-by: Konrad Rzeszutek Wilk
    Signed-off-by: Greg Kroah-Hartman

    Geert Uytterhoeven
     
  • commit 020eb3daaba2857b32c4cf4c82f503d6a00a67de upstream.

    commit d32932d02e18 removed the irq_retrigger callback from the IO-APIC
    chip and did not add it to the new IO-APIC-IR irq chip.

    Unfortunately the software resend fallback is not enabled on X86, so edge
    interrupts which are received during the lazy disabled state of the
    interrupt line are not retriggered and therefor lost.

    Restore the callbacks.

    [ tglx: Massaged changelog ]

    Fixes: d32932d02e18 ("x86/irq: Convert IOAPIC to use hierarchical irqdomain interfaces")
    Signed-off-by: Ruslan Ruslichenko
    Cc: xe-linux-external@cisco.com
    Link: http://lkml.kernel.org/r/1484662432-13580-1-git-send-email-rruslich@cisco.com
    Signed-off-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Ruslan Ruslichenko
     
  • commit 89e9f7bcd8744ea25fcf0ac671b8d72c10d7d790 upstream.

    Martin reported that the Supermicro X8DTH-i/6/iF/6F advertises incorrect
    host bridge windows via _CRS:

    pci_root PNP0A08:00: host bridge window [io 0xf000-0xffff]
    pci_root PNP0A08:01: host bridge window [io 0xf000-0xffff]

    Both bridges advertise the 0xf000-0xffff window, which cannot be correct.

    Work around this by ignoring _CRS on this system. The downside is that we
    may not assign resources correctly to hot-added PCI devices (if they are
    possible on this system).

    Link: https://bugzilla.kernel.org/show_bug.cgi?id=42606
    Reported-by: Martin Burnicki
    Signed-off-by: Bjorn Helgaas
    Signed-off-by: Greg Kroah-Hartman

    Bjorn Helgaas
     

20 Jan, 2017

13 commits

  • commit dd853fd216d1485ed3045ff772079cc8689a9a4a upstream.

    A negative number can be specified in the cmdline which will be used as
    setup_clear_cpu_cap() argument. With that we can clear/set some bit in
    memory predceeding boot_cpu_data/cpu_caps_cleared which may cause kernel
    to misbehave. This patch adds lower bound check to setup_disablecpuid().

    Boris Petkov reproduced a crash:

    [ 1.234575] BUG: unable to handle kernel paging request at ffffffff858bd540
    [ 1.236535] IP: memcpy_erms+0x6/0x10

    Signed-off-by: Lukasz Odzioba
    Acked-by: Borislav Petkov
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: andi.kleen@intel.com
    Cc: bp@alien8.de
    Cc: dave.hansen@linux.intel.com
    Cc: luto@kernel.org
    Cc: slaoub@gmail.com
    Fixes: ac72e7888a61 ("x86: add generic clearcpuid=... option")
    Link: http://lkml.kernel.org/r/1482933340-11857-1-git-send-email-lukasz.odzioba@intel.com
    Signed-off-by: Ingo Molnar
    Signed-off-by: Greg Kroah-Hartman

    Lukasz Odzioba
     
  • commit a33d331761bc5dd330499ca5ceceb67f0640a8e6 upstream.

    The following commit:

    8196dab4fc15 ("x86/cpu: Get rid of compute_unit_id")

    ... broke the initial strategy for Bulldozer-based cores' topology,
    where we consider each thread of a compute unit a standalone core
    and not a HT or SMT thread.

    Revert to the firmware-supplied core_id numbering and do not make
    them thread siblings as we don't consider them for such even if they
    technically are, more or less.

    Reported-and-tested-by: Brice Goglin
    Tested-by: Yazen Ghannam
    Signed-off-by: Borislav Petkov
    Cc: Andy Lutomirski
    Cc: Borislav Petkov
    Cc: Brian Gerst
    Cc: Denys Vlasenko
    Cc: H. Peter Anvin
    Cc: Josh Poimboeuf
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Fixes: 8196dab4fc15 ("x86/cpu: Get rid of compute_unit_id")
    Link: http://lkml.kernel.org/r/20170105092638.5247-1-bp@alien8.de
    Signed-off-by: Ingo Molnar
    Signed-off-by: Greg Kroah-Hartman

    Borislav Petkov
     
  • commit 3344ed30791af66dbbad5f375008f3d1863b6c99 upstream.

    The workaround for the AMD Erratum E400 (Local APIC timer stops in C1E
    state) is a two step process:

    - Selection of the E400 aware idle routine

    - Detection whether the platform is affected

    The idle routine selection happens for possibly affected CPUs depending on
    family/model/stepping information. These range of CPUs is not necessarily
    affected as the decision whether to enable the C1E feature is made by the
    firmware. Unfortunately there is no way to query this at early boot.

    The current implementation polls a MSR in the E400 aware idle routine to
    detect whether the CPU is affected. This is inefficient on non affected
    CPUs because every idle entry has to do the MSR read.

    There is a better way to detect this before going idle for the first time
    which requires to seperate the bug flags:

    X86_BUG_AMD_E400 - Selects the E400 aware idle routine and
    enables the detection

    X86_BUG_AMD_APIC_C1E - Set when the platform is affected by E400

    Replace the current X86_BUG_AMD_APIC_C1E usage by the new X86_BUG_AMD_E400
    bug bit to select the idle routine which currently does an unconditional
    detection poll. X86_BUG_AMD_APIC_C1E is going to be used in later patches
    to remove the MSR polling and simplify the handling of this misfeature.

    Signed-off-by: Thomas Gleixner
    Signed-off-by: Borislav Petkov
    Cc: Jiri Olsa
    Link: http://lkml.kernel.org/r/20161209182912.2726-3-bp@alien8.de
    Signed-off-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     
  • commit b6a50cddbcbda7105355898ead18f1a647c22520 upstream.

    These changes do not affect current hw - just a cleanup:

    Currently, we assume that a system has a single Last Level Cache (LLC)
    per node, and that the cpu_llc_id is thus equal to the node_id. This no
    longer applies since Fam17h can have multiple last level caches within a
    node.

    So group the cpu_llc_id assignment by topology feature and family in
    order to make the computation of cpu_llc_id on the different families
    more clear.

    Here is how the LLC ID is being computed on the different families:

    The NODEID_MSR feature only applies to Fam10h in which case the LLC is
    at the node level.

    The TOPOEXT feature is used on families 15h, 16h and 17h. So far we only
    see multiple last level caches if L3 caches are available. Otherwise,
    the cpu_llc_id will default to be the phys_proc_id.

    We have L3 caches only on families 15h and 17h:

    - on Fam15h, the LLC is at the node level.

    - on Fam17h, the LLC is at the core complex level and can be found by
    right shifting the APIC ID. Also, keep the family checks explicit so that
    new families will fall back to the default, which will be node_id for
    TOPOEXT systems.

    Single node systems in families 10h and 15h will have a Node ID of 0
    which will be the same as the phys_proc_id, so we don't need to check
    for multiple nodes before using the node_id.

    Tested-by: Borislav Petkov
    Signed-off-by: Yazen Ghannam
    [ Rewrote the commit message. ]
    Signed-off-by: Borislav Petkov
    Acked-by: Thomas Gleixner
    Cc: Aravind Gopalakrishnan
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20161108153054.bs3sajbyevq6a6uu@pd.tnic
    Signed-off-by: Ingo Molnar
    Signed-off-by: Greg Kroah-Hartman

    Yazen Ghannam
     
  • commit 20b1e22d01a4b0b11d3a1066e9feb04be38607ec upstream.

    With the following commit:

    4bc9f92e64c8 ("x86/efi-bgrt: Use efi_mem_reserve() to avoid copying image data")

    ... efi_bgrt_init() calls into the memblock allocator through
    efi_mem_reserve() => efi_arch_mem_reserve() *after* mm_init() has been called.

    Indeed, KASAN reports a bad read access later on in efi_free_boot_services():

    BUG: KASAN: use-after-free in efi_free_boot_services+0xae/0x24c
    at addr ffff88022de12740
    Read of size 4 by task swapper/0/0
    page:ffffea0008b78480 count:0 mapcount:-127
    mapping: (null) index:0x1 flags: 0x5fff8000000000()
    [...]
    Call Trace:
    dump_stack+0x68/0x9f
    kasan_report_error+0x4c8/0x500
    kasan_report+0x58/0x60
    __asan_load4+0x61/0x80
    efi_free_boot_services+0xae/0x24c
    start_kernel+0x527/0x562
    x86_64_start_reservations+0x24/0x26
    x86_64_start_kernel+0x157/0x17a
    start_cpu+0x5/0x14

    The instruction at the given address is the first read from the memmap's
    memory, i.e. the read of md->type in efi_free_boot_services().

    Note that the writes earlier in efi_arch_mem_reserve() don't splat because
    they're done through early_memremap()ed addresses.

    So, after memblock is gone, allocations should be done through the "normal"
    page allocator. Introduce a helper, efi_memmap_alloc() for this. Use
    it from efi_arch_mem_reserve(), efi_free_boot_services() and, for the sake
    of consistency, from efi_fake_memmap() as well.

    Note that for the latter, the memmap allocations cease to be page aligned.
    This isn't needed though.

    Tested-by: Dan Williams
    Signed-off-by: Nicolai Stange
    Reviewed-by: Ard Biesheuvel
    Cc: Dave Young
    Cc: Linus Torvalds
    Cc: Matt Fleming
    Cc: Mika Penttilä
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-efi@vger.kernel.org
    Fixes: 4bc9f92e64c8 ("x86/efi-bgrt: Use efi_mem_reserve() to avoid copying image data")
    Link: http://lkml.kernel.org/r/20170105125130.2815-1-nicstange@gmail.com
    Signed-off-by: Ingo Molnar
    Signed-off-by: Greg Kroah-Hartman

    Nicolai Stange
     
  • commit 0100a3e67a9cef64d72cd3a1da86f3ddbee50363 upstream.

    Some machines, such as the Lenovo ThinkPad W541 with firmware GNET80WW
    (2.28), include memory map entries with phys_addr=0x0 and num_pages=0.

    These machines fail to boot after the following commit,

    commit 8e80632fb23f ("efi/esrt: Use efi_mem_reserve() and avoid a kmalloc()")

    Fix this by removing such bogus entries from the memory map.

    Furthermore, currently the log output for this case (with efi=debug)
    looks like:

    [ 0.000000] efi: mem45: [Reserved | | | | | | | | | | | | ] range=[0x0000000000000000-0xffffffffffffffff] (0MB)

    This is clearly wrong, and also not as informative as it could be. This
    patch changes it so that if we find obviously invalid memory map
    entries, we print an error and skip those entries. It also detects the
    display of the address range calculation overflow, so the new output is:

    [ 0.000000] efi: [Firmware Bug]: Invalid EFI memory map entries:
    [ 0.000000] efi: mem45: [Reserved | | | | | | | | | | | | ] range=[0x0000000000000000-0x0000000000000000] (invalid)

    It also detects memory map sizes that would overflow the physical
    address, for example phys_addr=0xfffffffffffff000 and
    num_pages=0x0200000000000001, and prints:

    [ 0.000000] efi: [Firmware Bug]: Invalid EFI memory map entries:
    [ 0.000000] efi: mem45: [Reserved | | | | | | | | | | | | ] range=[phys_addr=0xfffffffffffff000-0x20ffffffffffffffff] (invalid)

    It then removes these entries from the memory map.

    Signed-off-by: Peter Jones
    Signed-off-by: Ard Biesheuvel
    [ardb: refactor for clarity with no functional changes, avoid PAGE_SHIFT]
    Signed-off-by: Matt Fleming
    [Matt: Include bugzilla info in commit log]
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: https://bugzilla.kernel.org/show_bug.cgi?id=191121
    Signed-off-by: Ingo Molnar
    Signed-off-by: Greg Kroah-Hartman

    Peter Jones
     
  • commit 129a72a0d3c8e139a04512325384fe5ac119e74d upstream.

    Introduces segemented_write_std.

    Switches from emulated reads/writes to standard read/writes in fxsave,
    fxrstor, sgdt, and sidt. This fixes CVE-2017-2584, a longstanding
    kernel memory leak.

    Since commit 283c95d0e389 ("KVM: x86: emulate FXSAVE and FXRSTOR",
    2016-11-09), which is luckily not yet in any final release, this would
    also be an exploitable kernel memory *write*!

    Reported-by: Dmitry Vyukov
    Fixes: 96051572c819194c37a8367624b285be10297eca
    Fixes: 283c95d0e3891b64087706b344a4b545d04a6e62
    Suggested-by: Paolo Bonzini
    Signed-off-by: Steve Rutherford
    Signed-off-by: Paolo Bonzini
    Signed-off-by: Greg Kroah-Hartman

    Steve Rutherford
     
  • commit 283c95d0e3891b64087706b344a4b545d04a6e62 upstream.

    Internal errors were reported on 16 bit fxsave and fxrstor with ipxe.
    Old Intels don't have unrestricted_guest, so we have to emulate them.

    The patch takes advantage of the hardware implementation.

    AMD and Intel differ in saving and restoring other fields in first 32
    bytes. A test wrote 0xff to the fxsave area, 0 to upper bits of MCSXR
    in the fxsave area, executed fxrstor, rewrote the fxsave area to 0xee,
    and executed fxsave:

    Intel (Nehalem):
    7f 1f 7f 7f ff 00 ff 07 ff ff ff ff ff ff 00 00
    ff ff ff ff ff ff 00 00 ff ff 00 00 ff ff 00 00
    Intel (Haswell -- deprecated FPU CS and FPU DS):
    7f 1f 7f 7f ff 00 ff 07 ff ff ff ff 00 00 00 00
    ff ff ff ff 00 00 00 00 ff ff 00 00 ff ff 00 00
    AMD (Opteron 2300-series):
    7f 1f 7f 7f ff 00 ee ee ee ee ee ee ee ee ee ee
    ee ee ee ee ee ee ee ee ff ff 00 00 ff ff 02 00

    fxsave/fxrstor will only be emulated on early Intels, so KVM can't do
    much to improve the situation.

    Signed-off-by: Radim Krčmář
    Signed-off-by: Greg Kroah-Hartman

    Radim Krčmář
     
  • commit aabba3c6abd50b05b1fc2c6ec44244aa6bcda576 upstream.

    Move the existing exception handling for inline assembly into a macro
    and switch its return values to X86EMUL type.

    Signed-off-by: Radim Krčmář
    Signed-off-by: Paolo Bonzini
    Signed-off-by: Greg Kroah-Hartman

    Radim Krčmář
     
  • commit d3fe959f81024072068e9ed86b39c2acfd7462a9 upstream.

    Needed for FXSAVE and FXRSTOR.

    Signed-off-by: Radim Krčmář
    Signed-off-by: Paolo Bonzini
    Signed-off-by: Greg Kroah-Hartman

    Radim Krčmář
     
  • commit 546d87e5c903a7f3ee7b9f998949a94729fbc65b upstream.

    Reported by syzkaller:

    BUG: unable to handle kernel NULL pointer dereference at 00000000000001b0
    IP: _raw_spin_lock+0xc/0x30
    PGD 3e28eb067
    PUD 3f0ac6067
    PMD 0
    Oops: 0002 [#1] SMP
    CPU: 0 PID: 2431 Comm: test Tainted: G OE 4.10.0-rc1+ #3
    Call Trace:
    ? kvm_ioapic_scan_entry+0x3e/0x110 [kvm]
    kvm_arch_vcpu_ioctl_run+0x10a8/0x15f0 [kvm]
    ? pick_next_task_fair+0xe1/0x4e0
    ? kvm_arch_vcpu_load+0xea/0x260 [kvm]
    kvm_vcpu_ioctl+0x33a/0x600 [kvm]
    ? hrtimer_try_to_cancel+0x29/0x130
    ? do_nanosleep+0x97/0xf0
    do_vfs_ioctl+0xa1/0x5d0
    ? __hrtimer_init+0x90/0x90
    ? do_nanosleep+0x5b/0xf0
    SyS_ioctl+0x79/0x90
    do_syscall_64+0x6e/0x180
    entry_SYSCALL64_slow_path+0x25/0x25
    RIP: _raw_spin_lock+0xc/0x30 RSP: ffffa43688973cc0

    The syzkaller folks reported a NULL pointer dereference due to
    ENABLE_CAP succeeding even without an irqchip. The Hyper-V
    synthetic interrupt controller is activated, resulting in a
    wrong request to rescan the ioapic and a NULL pointer dereference.

    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include

    #ifndef KVM_CAP_HYPERV_SYNIC
    #define KVM_CAP_HYPERV_SYNIC 123
    #endif

    void* thr(void* arg)
    {
    struct kvm_enable_cap cap;
    cap.flags = 0;
    cap.cap = KVM_CAP_HYPERV_SYNIC;
    ioctl((long)arg, KVM_ENABLE_CAP, &cap);
    return 0;
    }

    int main()
    {
    void *host_mem = mmap(0, 0x1000, PROT_READ|PROT_WRITE,
    MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
    int kvmfd = open("/dev/kvm", 0);
    int vmfd = ioctl(kvmfd, KVM_CREATE_VM, 0);
    struct kvm_userspace_memory_region memreg;
    memreg.slot = 0;
    memreg.flags = 0;
    memreg.guest_phys_addr = 0;
    memreg.memory_size = 0x1000;
    memreg.userspace_addr = (unsigned long)host_mem;
    host_mem[0] = 0xf4;
    ioctl(vmfd, KVM_SET_USER_MEMORY_REGION, &memreg);
    int cpufd = ioctl(vmfd, KVM_CREATE_VCPU, 0);
    struct kvm_sregs sregs;
    ioctl(cpufd, KVM_GET_SREGS, &sregs);
    sregs.cr0 = 0;
    sregs.cr4 = 0;
    sregs.efer = 0;
    sregs.cs.selector = 0;
    sregs.cs.base = 0;
    ioctl(cpufd, KVM_SET_SREGS, &sregs);
    struct kvm_regs regs = { .rflags = 2 };
    ioctl(cpufd, KVM_SET_REGS, ®s);
    ioctl(vmfd, KVM_CREATE_IRQCHIP, 0);
    pthread_t th;
    pthread_create(&th, 0, thr, (void*)(long)cpufd);
    usleep(rand() % 10000);
    ioctl(cpufd, KVM_RUN, 0);
    pthread_join(th, 0);
    return 0;
    }

    This patch fixes it by failing ENABLE_CAP if without an irqchip.

    Reported-by: Dmitry Vyukov
    Fixes: 5c919412fe61 (kvm/x86: Hyper-V synthetic interrupt controller)
    Cc: Paolo Bonzini
    Cc: Radim Krčmář
    Cc: Dmitry Vyukov
    Signed-off-by: Wanpeng Li
    Signed-off-by: Paolo Bonzini
    Signed-off-by: Greg Kroah-Hartman

    Wanpeng Li
     
  • commit cef84c302fe051744b983a92764d3fcca933415d upstream.

    KVM's lapic emulation uses static_key_deferred (apic_{hw,sw}_disabled).
    These are implemented with delayed_work structs which can still be
    pending when the KVM module is unloaded. We've seen this cause kernel
    panics when the kvm_intel module is quickly reloaded.

    Use the new static_key_deferred_flush() API to flush pending updates on
    module unload.

    Signed-off-by: David Matlack
    Signed-off-by: Paolo Bonzini
    Signed-off-by: Greg Kroah-Hartman

    David Matlack
     
  • commit 33ab91103b3415e12457e3104f0e4517ce12d0f3 upstream.

    This is CVE-2017-2583. On Intel this causes a failed vmentry because
    SS's type is neither 3 nor 7 (even though the manual says this check is
    only done for usable SS, and the dmesg splat says that SS is unusable!).
    On AMD it's worse: svm.c is confused and sets CPL to 0 in the vmcb.

    The fix fabricates a data segment descriptor when SS is set to a null
    selector, so that CPL and SS.DPL are set correctly in the VMCS/vmcb.
    Furthermore, only allow setting SS to a NULL selector if SS.RPL < 3;
    this in turn ensures CPL < 3 because RPL must be equal to CPL.

    Thanks to Andy Lutomirski and Willy Tarreau for help in analyzing
    the bug and deciphering the manuals.

    Reported-by: Xiaohan Zhang
    Fixes: 79d5b4c3cd809c770d4bf9812635647016c56011
    Signed-off-by: Paolo Bonzini
    Signed-off-by: Greg Kroah-Hartman

    Paolo Bonzini
     

15 Jan, 2017

1 commit

  • [ Upstream commit 9d5ecb09d525469abd1a10c096cb5a17206523f2 ]

    If after too many passes still no image could be emitted, then
    swap back to the original program as we do in all other cases
    and don't use the one with blinding.

    Fixes: 959a75791603 ("bpf, x86: add support for constant blinding")
    Signed-off-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Daniel Borkmann
     

12 Jan, 2017

3 commits

  • commit 3df8d9208569ef0b2313e516566222d745f3b94b upstream.

    A typo (or mis-merge?) resulted in leaf 6 only being probed if
    cpuid_level >= 7.

    Fixes: 2ccd71f1b278 ("x86/cpufeature: Move some of the scattered feature bits to x86_capability")
    Signed-off-by: Andy Lutomirski
    Acked-by: Borislav Petkov
    Cc: Brian Gerst
    Link: http://lkml.kernel.org/r/6ea30c0e9daec21e488b54761881a6dfcf3e04d0.1481825597.git.luto@kernel.org
    Signed-off-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Andy Lutomirski
     
  • commit a01aa6c9f40fe03c82032e7f8b3bcf1e6c93ac0e upstream.

    As userspace knows nothing about kernel config, thus #ifdefs
    around ABI prctl constants makes them invisible to userspace.

    Let it be clean'n'simple: remove #ifdefs.

    If kernel has CONFIG_CHECKPOINT_RESTORE disabled, sys_prctl()
    will return -EINVAL for those prctls.

    Reported-by: Paul Bolle
    Signed-off-by: Dmitry Safonov
    Acked-by: Andy Lutomirski
    Cc: 0x7f454c46@gmail.com
    Cc: Borislav Petkov
    Cc: Brian Gerst
    Cc: Cyrill Gorcunov
    Cc: Denys Vlasenko
    Cc: H. Peter Anvin
    Cc: Josh Poimboeuf
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-mm@kvack.org
    Cc: oleg@redhat.com
    Fixes: 2eefd8789698 ("x86/arch_prctl/vdso: Add ARCH_MAP_VDSO_*")
    Link: http://lkml.kernel.org/r/20161027141516.28447-2-dsafonov@virtuozzo.com
    Signed-off-by: Ingo Molnar
    Signed-off-by: Greg Kroah-Hartman

    Dmitry Safonov
     
  • commit 6ef4e07ecd2db21025c446327ecf34414366498b upstream.

    Otherwise, mismatch between the smm bit in hflags and the MMU role
    can cause a NULL pointer dereference.

    Signed-off-by: Xiao Guangrong
    Signed-off-by: Paolo Bonzini
    Signed-off-by: Greg Kroah-Hartman

    Xiao Guangrong
     

09 Jan, 2017

5 commits

  • commit 9d85eb9119f4eeeb48e87adfcd71f752655700e9 upstream.

    The logical package management has several issues:

    - The APIC ids provided by ACPI are not required to be the same as the
    initial APIC id which can be retrieved by CPUID. The APIC ids provided
    by ACPI are those which are written by the BIOS into the APIC. The
    initial id is set by hardware and can not be changed. The hardware
    provided ids contain the real hardware package information.

    Especially AMD sets the effective APIC id different from the hardware id
    as they need to reserve space for the IOAPIC ids starting at id 0.

    As a consequence those machines trigger the currently active firmware
    bug printouts in dmesg, These are obviously wrong.

    - Virtual machines have their own interesting of enumerating APICs and
    packages which are not reliably covered by the current implementation.

    The sizing of the mapping array has been tweaked to be generously large to
    handle systems which provide a wrong core count when HT is disabled so the
    whole magic which checks for space in the physical hotplug case is not
    needed anymore.

    Simplify the whole machinery and do the mapping when the CPU starts and the
    CPUID derived physical package information is available. This solves the
    observed problems on AMD machines and works for the virtualization issues
    as well.

    Remove the extra call from XEN cpu bringup code as it is not longer
    required.

    Fixes: d49597fd3bc7 ("x86/cpu: Deal with broken firmware (VMWare/XEN)")
    Reported-and-tested-by: Borislav Petkov
    Tested-by: Boris Ostrovsky
    Signed-off-by: Thomas Gleixner
    Cc: Juergen Gross
    Cc: Peter Zijlstra
    Cc: M. Vefa Bicakci
    Cc: xen-devel
    Cc: Charles (Chas) Williams
    Cc: Borislav Petkov
    Cc: Alok Kataria
    Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1612121102260.3429@nanos
    Signed-off-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     
  • commit 847fa1a6d3d00f3bdf68ef5fa4a786f644a0dd67 upstream.

    With new binutils, gcc may get smart with its optimization and change a jmp
    from a 5 byte jump to a 2 byte one even though it was jumping to a global
    function. But that global function existed within a 2 byte radius, and gcc
    was able to optimize it. Unfortunately, that jump was also being modified
    when function graph tracing begins. Since ftrace expected that jump to be 5
    bytes, but it was only two, it overwrote code after the jump, causing a
    crash.

    This was fixed for x86_64 with commit 8329e818f149, with the same subject as
    this commit, but nothing was done for x86_32.

    Fixes: d61f82d06672 ("ftrace: use dynamic patching for updating mcount calls")
    Reported-by: Colin Ian King
    Tested-by: Colin Ian King
    Signed-off-by: Steven Rostedt
    Signed-off-by: Greg Kroah-Hartman

    Steven Rostedt (Red Hat)
     
  • commit ef85b67385436ddc1998f45f1d6a210f935b3388 upstream.

    When L2 exits to L0 due to "exception or NMI", software exceptions
    (#BP and #OF) for which L1 has requested an intercept should be
    handled by L1 rather than L0. Previously, only hardware exceptions
    were forwarded to L1.

    Signed-off-by: Jim Mattson
    Signed-off-by: Paolo Bonzini
    Signed-off-by: Greg Kroah-Hartman

    Jim Mattson
     
  • commit 834fcd298003c10ce450e66960c78893cb1cc4b5 upstream.

    If the pmu registration fails the registered hotplug callbacks are not
    removed. Wrong in any case, but fatal in case of a modular driver.

    Replace the nonsensical state names with proper ones while at it.

    Fixes: 77c34ef1c319 ("perf/x86/intel/cstate: Convert Intel CSTATE to hotplug state machine")
    Signed-off-by: Thomas Gleixner
    Cc: Sebastian Siewior
    Cc: Peter Zijlstra
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     
  • commit b0c1ef52959582144bbea9a2b37db7f4c9e399f7 upstream.

    An earlier patch allowed enabling PT and LBR at the same
    time on Goldmont. However it also allowed enabling BTS and LBR
    at the same time, which is still not supported. Fix this by
    bypassing the check only for PT.

    Signed-off-by: Andi Kleen
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: alexander.shishkin@intel.com
    Cc: kan.liang@intel.com
    Fixes: ccbebba4c6bf ("perf/x86/intel/pt: Bypass PT vs. LBR exclusivity if the core supports it")
    Link: http://lkml.kernel.org/r/20161209001417.4713-1-andi@firstfloor.org
    Signed-off-by: Ingo Molnar
    Signed-off-by: Greg Kroah-Hartman

    Andi Kleen
     

06 Jan, 2017

1 commit

  • commit 334bb773876403eae3457d81be0b8ea70f8e4ccc upstream.

    Commit 4efca4ed ("kbuild: modversions for EXPORT_SYMBOL() for asm") adds
    modversion support for symbols exported from asm files. Architectures
    must include C-style declarations for those symbols in asm/asm-prototypes.h
    in order for them to be versioned.

    Add these declarations for x86, and an architecture-independent file that
    can be used for common symbols.

    With f27c2f6 reverting 8ab2ae6 ("default exported asm symbols to zero") we
    produce a scary warning on x86, this commit fixes that.

    Signed-off-by: Adam Borowski
    Tested-by: Kalle Valo
    Acked-by: Nicholas Piggin
    Tested-by: Peter Wu
    Tested-by: Oliver Hartkopp
    Signed-off-by: Michal Marek
    Signed-off-by: Greg Kroah-Hartman

    Adam Borowski
     

08 Dec, 2016

1 commit

  • Pull x86 fixes from Ingo Molnar:
    "Misc fixes: a core dumping crash fix, a guess-unwinder regression fix,
    plus three build warning fixes"

    * 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    x86/unwind: Fix guess-unwinder regression
    x86/build: Annotate die() with noreturn to fix build warning on clang
    x86/platform/olpc: Fix resume handler build warning
    x86/apic/uv: Silence a shift wrapping warning
    x86/coredump: Always use user_regs_struct for compat_elf_gregset_t

    Linus Torvalds
     

06 Dec, 2016

2 commits

  • Lukasz reported that perf stat counters overflow handling is broken on KNL/SLM.

    Both these parts have full_width_write set, and that does indeed have
    a problem. In order to deal with counter wrap, we must sample the
    counter at at least half the counter period (see also the sampling
    theorem) such that we can unambiguously reconstruct the count.

    However commit:

    069e0c3c4058 ("perf/x86/intel: Support full width counting")

    sets the sampling interval to the full period, not half.

    Fixing that exposes another issue, in that we must not sign extend the
    delta value when we shift it right; the counter cannot have
    decremented after all.

    With both these issues fixed, counter overflow functions correctly
    again.

    Reported-by: Lukasz Odzioba
    Tested-by: Liang, Kan
    Tested-by: Odzioba, Lukasz
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Alexander Shishkin
    Cc: Arnaldo Carvalho de Melo
    Cc: Jiri Olsa
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Stephane Eranian
    Cc: Thomas Gleixner
    Cc: Vince Weaver
    Cc: stable@vger.kernel.org
    Fixes: 069e0c3c4058 ("perf/x86/intel: Support full width counting")
    Signed-off-by: Ingo Molnar

    Peter Zijlstra (Intel)
     
  • The Knights Mill is enough close to Knights Landing so the path reuses
    C-state residency support of the latter.

    Signed-off-by: Piotr Luc
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Alexander Shishkin
    Cc: Arnaldo Carvalho de Melo
    Cc: Arnaldo Carvalho de Melo
    Cc: Jiri Olsa
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Stephane Eranian
    Cc: Thomas Gleixner
    Cc: Vince Weaver
    Link: http://lkml.kernel.org/r/20161201000853.18260-1-piotr.luc@intel.com
    Signed-off-by: Ingo Molnar

    Piotr Luc