06 Mar, 2019

1 commit

  • [ Upstream commit bf7d28c53453ea904584960de55e33e03b9d93b1 ]

    Using sizeof(pointer) for determining the size of a memset() only works
    when the size of the pointer and the size of type to which it points are
    the same. For pte_t this is only true for 64bit and 32bit-NONPAE. On 32bit
    PAE systems this is wrong as the pointer size is 4 byte but the PTE entry
    is 8 bytes. It's actually not a real world issue as this code depends on
    64bit, but it's wrong nevertheless.

    Use sizeof(*p) for correctness sake.

    Fixes: aad983913d77 ("x86/mm/encrypt: Simplify sme_populate_pgd() and sme_populate_pgd_large()")
    Signed-off-by: Peng Hao
    Signed-off-by: Thomas Gleixner
    Cc: Kirill A. Shutemov
    Cc: Tom Lendacky
    Cc: dave.hansen@linux.intel.com
    Cc: peterz@infradead.org
    Cc: luto@kernel.org
    Link: https://lkml.kernel.org/r/1546065252-97996-1-git-send-email-peng.hao2@zte.com.cn
    Signed-off-by: Sasha Levin

    Peng Hao
     

13 Jan, 2019

2 commits

  • [ Upstream commit 254eb5505ca0ca749d3a491fc6668b6c16647a99 ]

    The LDT remap placement has been changed. It's now placed before the direct
    mapping in the kernel virtual address space for both paging modes.

    Change address markers order accordingly.

    Fixes: d52888aa2753 ("x86/mm: Move LDT remap out of KASLR region on 5-level paging")
    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Thomas Gleixner
    Cc: bp@alien8.de
    Cc: hpa@zytor.com
    Cc: dave.hansen@linux.intel.com
    Cc: luto@kernel.org
    Cc: peterz@infradead.org
    Cc: boris.ostrovsky@oracle.com
    Cc: jgross@suse.com
    Cc: bhe@redhat.com
    Cc: hans.van.kranenburg@mendix.com
    Cc: linux-mm@kvack.org
    Cc: xen-devel@lists.xenproject.org
    Link: https://lkml.kernel.org/r/20181130202328.65359-3-kirill.shutemov@linux.intel.com
    Signed-off-by: Sasha Levin

    Kirill A. Shutemov
     
  • [ Upstream commit 16877a5570e0c5f4270d5b17f9bab427bcae9514 ]

    There is a guard hole at the beginning of the kernel address space, also
    used by hypervisors. It occupies 16 PGD entries.

    This reserved range is not defined explicitely, it is calculated relative
    to other entities: direct mapping and user space ranges.

    The calculation got broken by recent changes of the kernel memory layout:
    LDT remap range is now mapped before direct mapping and makes the
    calculation invalid.

    The breakage leads to crash on Xen dom0 boot[1].

    Define the reserved range explicitely. It's part of kernel ABI (hypervisors
    expect it to be stable) and must not depend on changes in the rest of
    kernel memory layout.

    [1] https://lists.xenproject.org/archives/html/xen-devel/2018-11/msg03313.html

    Fixes: d52888aa2753 ("x86/mm: Move LDT remap out of KASLR region on 5-level paging")
    Reported-by: Hans van Kranenburg
    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Thomas Gleixner
    Tested-by: Hans van Kranenburg
    Reviewed-by: Juergen Gross
    Cc: bp@alien8.de
    Cc: hpa@zytor.com
    Cc: dave.hansen@linux.intel.com
    Cc: luto@kernel.org
    Cc: peterz@infradead.org
    Cc: boris.ostrovsky@oracle.com
    Cc: bhe@redhat.com
    Cc: linux-mm@kvack.org
    Cc: xen-devel@lists.xenproject.org
    Link: https://lkml.kernel.org/r/20181130202328.65359-2-kirill.shutemov@linux.intel.com
    Signed-off-by: Sasha Levin

    Kirill A. Shutemov
     

10 Jan, 2019

2 commits

  • commit ba6f508d0ec4adb09f0a939af6d5e19cdfa8667d upstream.

    Commit:

    f77084d96355 "x86/mm/pat: Disable preemption around __flush_tlb_all()"

    addressed a case where __flush_tlb_all() is called without preemption
    being disabled. It also left a warning to catch other cases where
    preemption is not disabled.

    That warning triggers for the memory hotplug path which is also used for
    persistent memory enabling:

    WARNING: CPU: 35 PID: 911 at ./arch/x86/include/asm/tlbflush.h:460
    RIP: 0010:__flush_tlb_all+0x1b/0x3a
    [..]
    Call Trace:
    phys_pud_init+0x29c/0x2bb
    kernel_physical_mapping_init+0xfc/0x219
    init_memory_mapping+0x1a5/0x3b0
    arch_add_memory+0x2c/0x50
    devm_memremap_pages+0x3aa/0x610
    pmem_attach_disk+0x585/0x700 [nd_pmem]

    Andy wondered why a path that can sleep was using __flush_tlb_all() [1]
    and Dave confirmed the expectation for TLB flush is for modifying /
    invalidating existing PTE entries, but not initial population [2]. Drop
    the usage of __flush_tlb_all() in phys_{p4d,pud,pmd}_init() on the
    expectation that this path is only ever populating empty entries for the
    linear map. Note, at linear map teardown time there is a call to the
    all-cpu flush_tlb_all() to invalidate the removed mappings.

    [1]: https://lkml.kernel.org/r/9DFD717D-857D-493D-A606-B635D72BAC21@amacapital.net
    [2]: https://lkml.kernel.org/r/749919a4-cdb1-48a3-adb4-adb81a5fa0b5@intel.com

    [ mingo: Minor readability edits. ]

    Suggested-by: Dave Hansen
    Reported-by: Andy Lutomirski
    Signed-off-by: Dan Williams
    Acked-by: Peter Zijlstra (Intel)
    Acked-by: Kirill A. Shutemov
    Cc:
    Cc: Borislav Petkov
    Cc: H. Peter Anvin
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Cc: Sebastian Andrzej Siewior
    Cc: Thomas Gleixner
    Cc: dave.hansen@intel.com
    Fixes: f77084d96355 ("x86/mm/pat: Disable preemption around __flush_tlb_all()")
    Link: http://lkml.kernel.org/r/154395944713.32119.15611079023837132638.stgit@dwillia2-desk3.amr.corp.intel.com
    Signed-off-by: Ingo Molnar
    Signed-off-by: Greg Kroah-Hartman

    Dan Williams
     
  • commit 5b5e4d623ec8a34689df98e42d038a3b594d2ff9 upstream.

    Swap storage is restricted to max_swapfile_size (~16TB on x86_64) whenever
    the system is deemed affected by L1TF vulnerability. Even though the limit
    is quite high for most deployments it seems to be too restrictive for
    deployments which are willing to live with the mitigation disabled.

    We have a customer to deploy 8x 6,4TB PCIe/NVMe SSD swap devices which is
    clearly out of the limit.

    Drop the swap restriction when l1tf=off is specified. It also doesn't make
    much sense to warn about too much memory for the l1tf mitigation when it is
    forcefully disabled by the administrator.

    [ tglx: Folded the documentation delta change ]

    Fixes: 377eeaa8e11f ("x86/speculation/l1tf: Limit swap file size to MAX_PA/2")
    Signed-off-by: Michal Hocko
    Signed-off-by: Thomas Gleixner
    Reviewed-by: Pavel Tatashin
    Reviewed-by: Andi Kleen
    Acked-by: Jiri Kosina
    Cc: Linus Torvalds
    Cc: Dave Hansen
    Cc: Andi Kleen
    Cc: Borislav Petkov
    Cc:
    Cc: stable@vger.kernel.org
    Link: https://lkml.kernel.org/r/20181113184910.26697-1-mhocko@kernel.org
    Signed-off-by: Greg Kroah-Hartman

    Michal Hocko
     

29 Dec, 2018

1 commit

  • commit 51c3fbd89d7554caa3290837604309f8d8669d99 upstream.

    A decoy address is used by set_mce_nospec() to update the cache attributes
    for a page that may contain poison (multi-bit ECC error) while attempting
    to minimize the possibility of triggering a speculative access to that
    page.

    When reserve_memtype() is handling a decoy address it needs to convert it
    to its real physical alias. The conversion, AND'ing with __PHYSICAL_MASK,
    is broken for a 32-bit physical mask and reserve_memtype() is passed the
    last physical page. Gert reports triggering the:

    BUG_ON(start >= end);

    ...assertion when running a 32-bit non-PAE build on a platform that has
    a driver resource at the top of physical memory:

    BIOS-e820: [mem 0x00000000fff00000-0x00000000ffffffff] reserved

    Given that the decoy address scheme is only targeted at 64-bit builds and
    assumes that the top of physical address space is free for use as a decoy
    address range, simply bypass address sanitization in the 32-bit case.

    Lastly, there was no need to crash the system when this failure occurred,
    and no need to crash future systems if the assumptions of decoy addresses
    are ever violated. Change the BUG_ON() to a WARN() with an error return.

    Fixes: 510ee090abc3 ("x86/mm/pat: Prepare {reserve, free}_memtype() for...")
    Reported-by: Gert Robben
    Signed-off-by: Dan Williams
    Signed-off-by: Thomas Gleixner
    Tested-by: Gert Robben
    Cc: stable@vger.kernel.org
    Cc: Andy Shevchenko
    Cc: Dave Hansen
    Cc: Andy Lutomirski
    Cc: Peter Zijlstra
    Cc: Borislav Petkov
    Cc: "H. Peter Anvin"
    Cc: platform-driver-x86@vger.kernel.org
    Cc:
    Link: https://lkml.kernel.org/r/154454337985.789277.12133288391664677775.stgit@dwillia2-desk3.amr.corp.intel.com
    Signed-off-by: Greg Kroah-Hartman

    Dan Williams
     

06 Dec, 2018

2 commits

  • commit 4c71a2b6fd7e42814aa68a6dec88abf3b42ea573 upstream

    The IBPB speculation barrier is issued from switch_mm() when the kernel
    switches to a user space task with a different mm than the user space task
    which ran last on the same CPU.

    An additional optimization is to avoid IBPB when the incoming task can be
    ptraced by the outgoing task. This optimization only works when switching
    directly between two user space tasks. When switching from a kernel task to
    a user space task the optimization fails because the previous task cannot
    be accessed anymore. So for quite some scenarios the optimization is just
    adding overhead.

    The upcoming conditional IBPB support will issue IBPB only for user space
    tasks which have the TIF_SPEC_IB bit set. This requires to handle the
    following cases:

    1) Switch from a user space task (potential attacker) which has
    TIF_SPEC_IB set to a user space task (potential victim) which has
    TIF_SPEC_IB not set.

    2) Switch from a user space task (potential attacker) which has
    TIF_SPEC_IB not set to a user space task (potential victim) which has
    TIF_SPEC_IB set.

    This needs to be optimized for the case where the IBPB can be avoided when
    only kernel threads ran in between user space tasks which belong to the
    same process.

    The current check whether two tasks belong to the same context is using the
    tasks context id. While correct, it's simpler to use the mm pointer because
    it allows to mangle the TIF_SPEC_IB bit into it. The context id based
    mechanism requires extra storage, which creates worse code.

    When a task is scheduled out its TIF_SPEC_IB bit is mangled as bit 0 into
    the per CPU storage which is used to track the last user space mm which was
    running on a CPU. This bit can be used together with the TIF_SPEC_IB bit of
    the incoming task to make the decision whether IBPB needs to be issued or
    not to cover the two cases above.

    As conditional IBPB is going to be the default, remove the dubious ptrace
    check for the IBPB always case and simply issue IBPB always when the
    process changes.

    Move the storage to a different place in the struct as the original one
    created a hole.

    Signed-off-by: Thomas Gleixner
    Reviewed-by: Ingo Molnar
    Cc: Peter Zijlstra
    Cc: Andy Lutomirski
    Cc: Linus Torvalds
    Cc: Jiri Kosina
    Cc: Tom Lendacky
    Cc: Josh Poimboeuf
    Cc: Andrea Arcangeli
    Cc: David Woodhouse
    Cc: Tim Chen
    Cc: Andi Kleen
    Cc: Dave Hansen
    Cc: Casey Schaufler
    Cc: Asit Mallick
    Cc: Arjan van de Ven
    Cc: Jon Masters
    Cc: Waiman Long
    Cc: Greg KH
    Cc: Dave Stewart
    Cc: Kees Cook
    Cc: stable@vger.kernel.org
    Link: https://lkml.kernel.org/r/20181125185005.466447057@linutronix.de
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     
  • commit dbfe2953f63c640463c630746cd5d9de8b2f63ae upstream

    Currently, IBPB is only issued in cases when switching into a non-dumpable
    process, the rationale being to protect such 'important and security
    sensitive' processess (such as GPG) from data leaking into a different
    userspace process via spectre v2.

    This is however completely insufficient to provide proper userspace-to-userpace
    spectrev2 protection, as any process can poison branch buffers before being
    scheduled out, and the newly scheduled process immediately becomes spectrev2
    victim.

    In order to minimize the performance impact (for usecases that do require
    spectrev2 protection), issue the barrier only in cases when switching between
    processess where the victim can't be ptraced by the potential attacker (as in
    such cases, the attacker doesn't have to bother with branch buffers at all).

    [ tglx: Split up PTRACE_MODE_NOACCESS_CHK into PTRACE_MODE_SCHED and
    PTRACE_MODE_IBPB to be able to do ptrace() context tracking reasonably
    fine-grained ]

    Fixes: 18bf3c3ea8 ("x86/speculation: Use Indirect Branch Prediction Barrier in context switch")
    Originally-by: Tim Chen
    Signed-off-by: Jiri Kosina
    Signed-off-by: Thomas Gleixner
    Cc: Peter Zijlstra
    Cc: Josh Poimboeuf
    Cc: Andrea Arcangeli
    Cc: "WoodhouseDavid"
    Cc: Andi Kleen
    Cc: "SchauflerCasey"
    Link: https://lkml.kernel.org/r/nycvar.YFH.7.76.1809251437340.15880@cbobk.fhfr.pm
    Signed-off-by: Greg Kroah-Hartman

    Jiri Kosina
     

14 Nov, 2018

2 commits

  • commit c6ee7a548e2c291398b4f32c1f741c66b9f98e1c upstream.

    The numa_emulation() routine in the 'uniform' case walks through all the
    physical 'memblk' instances and divides them into N emulated nodes with
    split_nodes_size_interleave_uniform(). As each physical node is consumed it
    is removed from the physical memblk array in the numa_remove_memblk_from()
    helper.

    Since split_nodes_size_interleave_uniform() handles advancing the array as
    the 'memblk' is consumed it is expected that the base of the array is
    always specified as the argument.

    Otherwise, on multi-socket (> 2) configurations the uniform-split
    capability can generate an invalid numa configuration leading to boot
    failures with signatures like the following:

    rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
    Sending NMI from CPU 0 to CPUs 2:
    NMI backtrace for cpu 2
    CPU: 2 PID: 1332 Comm: pgdatinit0 Not tainted 4.19.0-rc8-next-20181019-baseline #59
    RIP: 0010:__init_single_page.isra.74+0x81/0x90
    [..]
    Call Trace:
    deferred_init_pages+0xaa/0xe3
    deferred_init_memmap+0x18f/0x318
    kthread+0xf8/0x130
    ? deferred_free_pages.isra.105+0xc9/0xc9
    ? kthread_stop+0x110/0x110
    ret_from_fork+0x35/0x40

    Fixes: 1f6a2c6d9f121 ("x86/numa_emulation: Introduce uniform split capability")
    Signed-off-by: Dave Jiang
    Signed-off-by: Dan Williams
    Signed-off-by: Thomas Gleixner
    Tested-by: Alexander Duyck
    Reviewed-by: Dave Hansen
    Cc: Borislav Petkov
    Cc: "H. Peter Anvin"
    Cc: Andy Lutomirski
    Cc: Peter Zijlstra
    Cc: stable@vger.kernel.org
    Link: https://lkml.kernel.org/r/154049911459.2685845.9210186007479774286.stgit@dwillia2-desk3.amr.corp.intel.com
    Signed-off-by: Greg Kroah-Hartman

    Dave Jiang
     
  • commit f77084d96355f5fba8e2c1fb3a51a393b1570de7 upstream.

    The WARN_ON_ONCE(__read_cr3() != build_cr3()) in switch_mm_irqs_off()
    triggers every once in a while during a snapshotted system upgrade.

    The warning triggers since commit decab0888e6e ("x86/mm: Remove
    preempt_disable/enable() from __native_flush_tlb()"). The callchain is:

    get_page_from_freelist() -> post_alloc_hook() -> __kernel_map_pages()

    with CONFIG_DEBUG_PAGEALLOC enabled.

    Disable preemption during CR3 reset / __flush_tlb_all() and add a comment
    why preemption has to be disabled so it won't be removed accidentaly.

    Add another preemptible() check in __flush_tlb_all() to catch callers with
    enabled preemption when PGE is enabled, because PGE enabled does not
    trigger the warning in __native_flush_tlb(). Suggested by Andy Lutomirski.

    Fixes: decab0888e6e ("x86/mm: Remove preempt_disable/enable() from __native_flush_tlb()")
    Signed-off-by: Sebastian Andrzej Siewior
    Signed-off-by: Thomas Gleixner
    Cc: Andy Lutomirski
    Cc: Dave Hansen
    Cc: Peter Zijlstra
    Cc: Borislav Petkov
    Cc: stable@vger.kernel.org
    Link: https://lkml.kernel.org/r/20181017103432.zgv46nlu3hc7k4rq@linutronix.de
    Signed-off-by: Greg Kroah-Hartman

    Sebastian Andrzej Siewior
     

09 Oct, 2018

1 commit

  • Arnd Bergmann reported that turning on -Wvla found a new (unintended) VLA usage:

    arch/x86/mm/pgtable.c: In function 'pgd_alloc':
    include/linux/build_bug.h:29:45: error: ISO C90 forbids variable length array 'u_pmds' [-Werror=vla]
    arch/x86/mm/pgtable.c:190:34: note: in expansion of macro 'static_cpu_has'
    #define PREALLOCATED_USER_PMDS (static_cpu_has(X86_FEATURE_PTI) ? \
    ^~~~~~~~~~~~~~
    arch/x86/mm/pgtable.c:431:16: note: in expansion of macro 'PREALLOCATED_USER_PMDS'
    pmd_t *u_pmds[PREALLOCATED_USER_PMDS];
    ^~~~~~~~~~~~~~~~~~~~~~

    Use the actual size of the array that is used for X86_FEATURE_PTI,
    which is known at build time, instead of the variable size.

    [ mingo: Squashed original fix with followup fix to avoid bisection breakage, wrote new changelog. ]

    Reported-by: Arnd Bergmann
    Original-written-by: Arnd Bergmann
    Reported-by: Borislav Petkov
    Signed-off-by: Kees Cook
    Cc: Andrew Morton
    Cc: Andy Lutomirski
    Cc: Arnd Bergmann
    Cc: Dave Hansen
    Cc: Joerg Roedel
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Toshi Kani
    Fixes: 1be3f247c288 ("x86/mm: Avoid VLA in pgd_alloc()")
    Link: http://lkml.kernel.org/r/20181008235434.GA35035@beast
    Signed-off-by: Ingo Molnar

    Kees Cook
     

21 Sep, 2018

1 commit

  • We met a kernel panic when enabling earlycon, which is due to the fixmap
    address of earlycon is not statically setup.

    Currently the static fixmap setup in head_64.S only covers 2M virtual
    address space, while it actually could be in 4M space with different
    kernel configurations, e.g. when VSYSCALL emulation is disabled.

    So increase the static space to 4M for now by defining FIXMAP_PMD_NUM to 2,
    and add a build time check to ensure that the fixmap is covered by the
    initial static page tables.

    Fixes: 1ad83c858c7d ("x86_64,vsyscall: Make vsyscall emulation configurable")
    Suggested-by: Thomas Gleixner
    Signed-off-by: Feng Tang
    Signed-off-by: Thomas Gleixner
    Tested-by: kernel test robot
    Reviewed-by: Juergen Gross (Xen parts)
    Cc: H Peter Anvin
    Cc: Peter Zijlstra
    Cc: Michal Hocko
    Cc: Yinghai Lu
    Cc: Dave Hansen
    Cc: Andi Kleen
    Cc: Andy Lutomirsky
    Cc: stable@vger.kernel.org
    Link: https://lkml.kernel.org/r/20180920025828.23699-1-feng.tang@intel.com

    Feng Tang
     

16 Sep, 2018

1 commit

  • kvmclock defines few static variables which are shared with the
    hypervisor during the kvmclock initialization.

    When SEV is active, memory is encrypted with a guest-specific key, and
    if the guest OS wants to share the memory region with the hypervisor
    then it must clear the C-bit before sharing it.

    Currently, we use kernel_physical_mapping_init() to split large pages
    before clearing the C-bit on shared pages. But it fails when called from
    the kvmclock initialization (mainly because the memblock allocator is
    not ready that early during boot).

    Add a __bss_decrypted section attribute which can be used when defining
    such shared variable. The so-defined variables will be placed in the
    .bss..decrypted section. This section will be mapped with C=0 early
    during boot.

    The .bss..decrypted section has a big chunk of memory that may be unused
    when memory encryption is not active, free it when memory encryption is
    not active.

    Suggested-by: Thomas Gleixner
    Signed-off-by: Brijesh Singh
    Signed-off-by: Thomas Gleixner
    Cc: Tom Lendacky
    Cc: Borislav Petkov
    Cc: "H. Peter Anvin"
    Cc: Paolo Bonzini
    Cc: Sean Christopherson
    Cc: Radim Krčmář
    Cc: kvm@vger.kernel.org
    Link: https://lkml.kernel.org/r/1536932759-12905-2-git-send-email-brijesh.singh@amd.com

    Brijesh Singh
     

08 Sep, 2018

1 commit

  • When page-table entries are set, the compiler might optimize their
    assignment by using multiple instructions to set the PTE. This might
    turn into a security hazard if the user somehow manages to use the
    interim PTE. L1TF does not make our lives easier, making even an interim
    non-present PTE a security hazard.

    Using WRITE_ONCE() to set PTEs and friends should prevent this potential
    security hazard.

    I skimmed the differences in the binary with and without this patch. The
    differences are (obviously) greater when CONFIG_PARAVIRT=n as more
    code optimizations are possible. For better and worse, the impact on the
    binary with this patch is pretty small. Skimming the code did not cause
    anything to jump out as a security hazard, but it seems that at least
    move_soft_dirty_pte() caused set_pte_at() to use multiple writes.

    Signed-off-by: Nadav Amit
    Signed-off-by: Thomas Gleixner
    Acked-by: Peter Zijlstra (Intel)
    Cc: Dave Hansen
    Cc: Andi Kleen
    Cc: Josh Poimboeuf
    Cc: Michal Hocko
    Cc: Vlastimil Babka
    Cc: Sean Christopherson
    Cc: Andy Lutomirski
    Cc: stable@vger.kernel.org
    Link: https://lkml.kernel.org/r/20180902181451.80520-1-namit@vmware.com

    Nadav Amit
     

02 Sep, 2018

1 commit

  • Fix the section mismatch warning in arch/x86/mm/pti.c:

    WARNING: vmlinux.o(.text+0x6972a): Section mismatch in reference from the function pti_clone_pgtable() to the function .init.text:pti_user_pagetable_walk_pte()
    The function pti_clone_pgtable() references
    the function __init pti_user_pagetable_walk_pte().
    This is often because pti_clone_pgtable lacks a __init
    annotation or the annotation of pti_user_pagetable_walk_pte is wrong.
    FATAL: modpost: Section mismatches detected.

    Fixes: 85900ea51577 ("x86/pti: Map the vsyscall page if needed")
    Reported-by: kbuild test robot
    Signed-off-by: Randy Dunlap
    Signed-off-by: Thomas Gleixner
    Cc: Andy Lutomirski
    Link: https://lkml.kernel.org/r/43a6d6a3-d69d-5eda-da09-0b1c88215a2a@infradead.org

    Randy Dunlap
     

01 Sep, 2018

1 commit

  • The trick with flipping bit 63 to avoid loading the address of the 1:1
    mapping of the poisoned page while the 1:1 map is updated used to work when
    unmapping the page. But it falls down horribly when attempting to directly
    set the page as uncacheable.

    The problem is that when the cache mode is changed to uncachable, the pages
    needs to be flushed from the cache first. But the decoy address is
    non-canonical due to bit 63 flipped, and the CLFLUSH instruction throws a
    #GP fault.

    Add code to change_page_attr_set_clr() to fix the address before calling
    flush.

    Fixes: 284ce4011ba6 ("x86/memory_failure: Introduce {set, clear}_mce_nospec()")
    Suggested-by: Linus Torvalds
    Signed-off-by: Tony Luck
    Signed-off-by: Thomas Gleixner
    Acked-by: Linus Torvalds
    Cc: Peter Anvin
    Cc: Borislav Petkov
    Cc: linux-edac
    Cc: Dan Williams
    Cc: Dave Jiang
    Link: https://lkml.kernel.org/r/20180831165506.GA9605@agluck-desk

    LuckTony
     

31 Aug, 2018

2 commits

  • A NMI can hit in the middle of context switching or in the middle of
    switch_mm_irqs_off(). In either case, CR3 might not match current->mm,
    which could cause copy_from_user_nmi() and friends to read the wrong
    memory.

    Fix it by adding a new nmi_uaccess_okay() helper and checking it in
    copy_from_user_nmi() and in __copy_from_user_nmi()'s callers.

    Signed-off-by: Andy Lutomirski
    Signed-off-by: Thomas Gleixner
    Reviewed-by: Rik van Riel
    Cc: Nadav Amit
    Cc: Borislav Petkov
    Cc: Jann Horn
    Cc: Peter Zijlstra
    Cc: stable@vger.kernel.org
    Link: https://lkml.kernel.org/r/dd956eba16646fd0b15c3c0741269dfd84452dac.1535557289.git.luto@kernel.org

    Andy Lutomirski
     
  • show_opcodes() is used both for dumping kernel instructions and for dumping
    user instructions. If userspace causes #PF by jumping to a kernel address,
    show_opcodes() can be reached with regs->ip controlled by the user,
    pointing to kernel code. Make sure that userspace can't trick us into
    dumping kernel memory into dmesg.

    Fixes: 7cccf0725cf7 ("x86/dumpstack: Add a show_ip() function")
    Signed-off-by: Jann Horn
    Signed-off-by: Thomas Gleixner
    Reviewed-by: Kees Cook
    Reviewed-by: Borislav Petkov
    Cc: "H. Peter Anvin"
    Cc: Andy Lutomirski
    Cc: security@kernel.org
    Cc: stable@vger.kernel.org
    Link: https://lkml.kernel.org/r/20180828154901.112726-1-jannh@google.com

    Jann Horn
     

27 Aug, 2018

2 commits

  • Pull perf updates from Thomas Gleixner:
    "Kernel:
    - Improve kallsyms coverage
    - Add x86 entry trampolines to kcore
    - Fix ARM SPE handling
    - Correct PPC event post processing

    Tools:
    - Make the build system more robust
    - Small fixes and enhancements all over the place
    - Update kernel ABI header copies
    - Preparatory work for converting libtraceevnt to a shared library
    - License cleanups"

    * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (100 commits)
    tools arch: Update arch/x86/lib/memcpy_64.S copy used in 'perf bench mem memcpy'
    tools arch x86: Update tools's copy of cpufeatures.h
    perf python: Fix pyrf_evlist__read_on_cpu() interface
    perf mmap: Store real cpu number in 'struct perf_mmap'
    perf tools: Remove ext from struct kmod_path
    perf tools: Add gzip_is_compressed function
    perf tools: Add lzma_is_compressed function
    perf tools: Add is_compressed callback to compressions array
    perf tools: Move the temp file processing into decompress_kmodule
    perf tools: Use compression id in decompress_kmodule()
    perf tools: Store compression id into struct dso
    perf tools: Add compression id into 'struct kmod_path'
    perf tools: Make is_supported_compression() static
    perf tools: Make decompress_to_file() function static
    perf tools: Get rid of dso__needs_decompress() call in __open_dso()
    perf tools: Get rid of dso__needs_decompress() call in symbol__disassemble()
    perf tools: Get rid of dso__needs_decompress() call in read_object_code()
    tools lib traceevent: Change to SPDX License format
    perf llvm: Allow passing options to llc in addition to clang
    perf parser: Improve error message for PMU address filters
    ...

    Linus Torvalds
     
  • Pull x86 fixes from Thomas Gleixner:

    - Correct the L1TF fallout on 32bit and the off by one in the 'too much
    RAM for protection' calculation.

    - Add a helpful kernel message for the 'too much RAM' case

    - Unbreak the VDSO in case that the compiler desides to use indirect
    jumps/calls and emits retpolines which cannot be resolved because the
    kernel uses its own thunks, which does not work for the VDSO. Make it
    use the builtin thunks.

    - Re-export start_thread() which was unexported when the 32/64bit
    implementation was unified. start_thread() is required by modular
    binfmt handlers.

    - Trivial cleanups

    * 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    x86/speculation/l1tf: Suggest what to do on systems with too much RAM
    x86/speculation/l1tf: Fix off-by-one error when warning that system has too much RAM
    x86/kvm/vmx: Remove duplicate l1d flush definitions
    x86/speculation/l1tf: Fix overflow in l1tf_pfn_limit() on 32bit
    x86/process: Re-export start_thread()
    x86/mce: Add notifier_block forward declaration
    x86/vdso: Fix vDSO build if a retpoline is emitted

    Linus Torvalds
     

26 Aug, 2018

1 commit

  • …/linux/kernel/git/nvdimm/nvdimm

    Pull libnvdimm memory-failure update from Dave Jiang:
    "As it stands, memory_failure() gets thoroughly confused by dev_pagemap
    backed mappings. The recovery code has specific enabling for several
    possible page states and needs new enabling to handle poison in dax
    mappings.

    In order to support reliable reverse mapping of user space addresses:

    1/ Add new locking in the memory_failure() rmap path to prevent races
    that would typically be handled by the page lock.

    2/ Since dev_pagemap pages are hidden from the page allocator and the
    "compound page" accounting machinery, add a mechanism to determine
    the size of the mapping that encompasses a given poisoned pfn.

    3/ Given pmem errors can be repaired, change the speculatively
    accessed poison protection, mce_unmap_kpfn(), to be reversible and
    otherwise allow ongoing access from the kernel.

    A side effect of this enabling is that MADV_HWPOISON becomes usable
    for dax mappings, however the primary motivation is to allow the
    system to survive userspace consumption of hardware-poison via dax.
    Specifically the current behavior is:

    mce: Uncorrected hardware memory error in user-access at af34214200
    {1}[Hardware Error]: It has been corrected by h/w and requires no further action
    mce: [Hardware Error]: Machine check events logged
    {1}[Hardware Error]: event severity: corrected
    Memory failure: 0xaf34214: reserved kernel page still referenced by 1 users
    [..]
    Memory failure: 0xaf34214: recovery action for reserved kernel page: Failed
    mce: Memory error not recovered
    <reboot>

    ...and with these changes:

    Injecting memory failure for pfn 0x20cb00 at process virtual address 0x7f763dd00000
    Memory failure: 0x20cb00: Killing dax-pmd:5421 due to hardware memory corruption
    Memory failure: 0x20cb00: recovery action for dax page: Recovered

    Given all the cross dependencies I propose taking this through
    nvdimm.git with acks from Naoya, x86/core, x86/RAS, and of course dax
    folks"

    * tag 'libnvdimm-for-4.19_dax-memory-failure' of gitolite.kernel.org:pub/scm/linux/kernel/git/nvdimm/nvdimm:
    libnvdimm, pmem: Restore page attributes when clearing errors
    x86/memory_failure: Introduce {set, clear}_mce_nospec()
    x86/mm/pat: Prepare {reserve, free}_memtype() for "decoy" addresses
    mm, memory_failure: Teach memory_failure() about dev_pagemap pages
    filesystem-dax: Introduce dax_lock_mapping_entry()
    mm, memory_failure: Collect mapping size in collect_procs()
    mm, madvise_inject_error: Let memory_failure() optionally take a page reference
    mm, dev_pagemap: Do not clear ->mapping on final put
    mm, madvise_inject_error: Disable MADV_SOFT_OFFLINE for ZONE_DEVICE pages
    filesystem-dax: Set page->index
    device-dax: Set page->index
    device-dax: Enable page_mapping()
    device-dax: Convert to vmf_insert_mixed and vm_fault_t

    Linus Torvalds
     

24 Aug, 2018

3 commits

  • Two users have reported [1] that they have an "extremely unlikely" system
    with more than MAX_PA/2 memory and L1TF mitigation is not effective. In
    fact it's a CPU with 36bits phys limit (64GB) and 32GB memory, but due to
    holes in the e820 map, the main region is almost 500MB over the 32GB limit:

    [ 0.000000] BIOS-e820: [mem 0x0000000100000000-0x000000081effffff] usable

    Suggestions to use 'mem=32G' to enable the L1TF mitigation while losing the
    500MB revealed, that there's an off-by-one error in the check in
    l1tf_select_mitigation().

    l1tf_pfn_limit() returns the last usable pfn (inclusive) and the range
    check in the mitigation path does not take this into account.

    Instead of amending the range check, make l1tf_pfn_limit() return the first
    PFN which is over the limit which is less error prone. Adjust the other
    users accordingly.

    [1] https://bugzilla.suse.com/show_bug.cgi?id=1105536

    Fixes: 17dbca119312 ("x86/speculation/l1tf: Add sysfs reporting for l1tf")
    Reported-by: George Anchev
    Reported-by: Christopher Snowhill
    Signed-off-by: Vlastimil Babka
    Signed-off-by: Thomas Gleixner
    Cc: "H . Peter Anvin"
    Cc: Linus Torvalds
    Cc: Andi Kleen
    Cc: Dave Hansen
    Cc: Michal Hocko
    Cc: stable@vger.kernel.org
    Link: https://lkml.kernel.org/r/20180823134418.17008-1-vbabka@suse.cz

    Vlastimil Babka
     
  • Merge fixes for missing TLB shootdowns.

    This fixes a couple of cases that involved us possibly freeing page
    table structures before the required TLB shootdown had been done.

    There are a few cleanup patches to make the code easier to follow, and
    to avoid some of the more problematic cases entirely when not necessary.

    To make this easier for backports, it undoes the recent lazy TLB
    patches, because the cleanups and fixes are more important, and Rik is
    ok with re-doing them later when things have calmed down.

    The missing TLB flush was only delayed, and the wrong ordering only
    happened under memory pressure (and in theory under a couple of other
    fairly theoretical situations), so this may have been all very unlikely
    to have hit people in practice.

    But getting the TLB shootdown wrong is _so_ hard to debug and see that I
    consider this a crticial fix.

    Many thanks to Jann Horn for having debugged this.

    * tlb-fixes:
    x86/mm: Only use tlb_remove_table() for paravirt
    mm: mmu_notifier fix for tlb_end_vma
    mm/tlb, x86/mm: Support invalidating TLB caches for RCU_TABLE_FREE
    mm/tlb: Remove tlb_remove_table() non-concurrent condition
    mm: move tlb_table_flush to tlb_flush_mmu_free
    x86/mm/tlb: Revert the recent lazy TLB patches

    Linus Torvalds
     
  • If we don't use paravirt; don't play unnecessary and complicated games
    to free page-tables.

    Suggested-by: Linus Torvalds
    Signed-off-by: Peter Zijlstra (Intel)
    Acked-by: Rik van Riel
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     

23 Aug, 2018

1 commit

  • Revert commits:

    95b0e6357d3e x86/mm/tlb: Always use lazy TLB mode
    64482aafe55f x86/mm/tlb: Only send page table free TLB flush to lazy TLB CPUs
    ac0315896970 x86/mm/tlb: Make lazy TLB mode lazier
    61d0beb5796a x86/mm/tlb: Restructure switch_mm_irqs_off()
    2ff6ddf19c0e x86/mm/tlb: Leave lazy TLB mode at page table free time

    In order to simplify the TLB invalidate fixes for x86 and unify the
    parts that need backporting. We'll try again later.

    Signed-off-by: Peter Zijlstra (Intel)
    Acked-by: Rik van Riel
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     

21 Aug, 2018

3 commits

  • Commit 7b25b9cb0dad83 ("x86/xen/time: Initialize pv xen time in
    init_hypervisor_platform()") moved the mapping of the shared info area
    before pagetable_init(). This breaks booting as 32-bit PV guest as the
    use of set_fixmap isn't possible at this time on 32-bit.

    This can be worked around by populating the needed PMD on 32-bit
    kernel earlier.

    In order not to reimplement populate_extra_pte() using extend_brk()
    for allocating new page tables extend alloc_low_pages() to do that in
    case the early page table pool is not yet available.

    Fixes: 7b25b9cb0dad83 ("x86/xen/time: Initialize pv xen time in init_hypervisor_platform()")
    Signed-off-by: Juergen Gross
    Reviewed-by: Thomas Gleixner
    Signed-off-by: Boris Ostrovsky

    Juergen Gross
     
  • In preparation for using set_memory_uc() instead set_memory_np() for
    isolating poison from speculation, teach the memtype code to sanitize
    physical addresses vs __PHYSICAL_MASK.

    The motivation for using set_memory_uc() for this case is to allow
    ongoing access to persistent memory pages via the pmem-driver +
    memcpy_mcsafe() until the poison is repaired.

    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Tony Luck
    Cc: Borislav Petkov
    Cc:
    Cc:
    Signed-off-by: Dan Williams
    Acked-by: Ingo Molnar
    Signed-off-by: Dave Jiang

    Dan Williams
     
  • On 32bit PAE kernels on 64bit hardware with enough physical bits,
    l1tf_pfn_limit() will overflow unsigned long. This in turn affects
    max_swapfile_size() and can lead to swapon returning -EINVAL. This has been
    observed in a 32bit guest with 42 bits physical address size, where
    max_swapfile_size() overflows exactly to 1 << 32, thus zero, and produces
    the following warning to dmesg:

    [ 6.396845] Truncating oversized swap area, only using 0k out of 2047996k

    Fix this by using unsigned long long instead.

    Fixes: 17dbca119312 ("x86/speculation/l1tf: Add sysfs reporting for l1tf")
    Fixes: 377eeaa8e11f ("x86/speculation/l1tf: Limit swap file size to MAX_PA/2")
    Reported-by: Dominique Leuenberger
    Reported-by: Adrian Schroeter
    Signed-off-by: Vlastimil Babka
    Signed-off-by: Thomas Gleixner
    Acked-by: Andi Kleen
    Acked-by: Michal Hocko
    Cc: "H . Peter Anvin"
    Cc: Linus Torvalds
    Cc: Dave Hansen
    Cc: Michal Hocko
    Cc: stable@vger.kernel.org
    Link: https://lkml.kernel.org/r/20180820095835.5298-1-vbabka@suse.cz

    Vlastimil Babka
     

18 Aug, 2018

2 commits

  • …nux/kernel/git/acme/linux into perf/urgent

    Pull perf/core improvements and fixes from Arnaldo Carvalho de Melo:

    kernel:

    - kallsyms, x86: Export addresses of PTI entry trampolines (Alexander Shishkin)

    - kallsyms: Simplify update_iter_mod() (Adrian Hunter)

    - x86: Add entry trampolines to kcore (Adrian Hunter)

    Hardware tracing:

    - Fix auxtrace queue resize (Adrian Hunter)

    Arch specific:

    - Fix uninitialized ARM SPE record error variable (Kim Phillips)

    - Fix trace event post-processing in powerpc (Sandipan Das)

    Build:

    - Fix check-headers.sh AND list path of execution (Alexander Kapshuk)

    - Remove -mcet and -fcf-protection when building the python binding
    with older clang versions (Arnaldo Carvalho de Melo)

    - Make check-headers.sh check based on kernel dir (Jiri Olsa)

    - Move syscall_64.tbl check into check-headers.sh (Jiri Olsa)

    Infrastructure:

    - Check for null when copying nsinfo. (Benno Evers)

    Libraries:

    - Rename libtraceevent prefixes, prep work for making it a shared
    library generaly available (Tzvetomir Stoyanov (VMware))

    Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>

    Ingo Molnar
     
  • Use new return type vm_fault_t for fault handler. For now, this is just
    documenting that the function returns a VM_FAULT value rather than an
    errno. Once all instances are converted, vm_fault_t will become a
    distinct type.

    Ref-> commit 1c8f422059ae ("mm: change return type to vm_fault_t")

    In this patch all the caller of handle_mm_fault() are changed to return
    vm_fault_t type.

    Link: http://lkml.kernel.org/r/20180617084810.GA6730@jordon-HP-15-Notebook-PC
    Signed-off-by: Souptick Joarder
    Cc: Matthew Wilcox
    Cc: Richard Henderson
    Cc: Tony Luck
    Cc: Matt Turner
    Cc: Vineet Gupta
    Cc: Russell King
    Cc: Catalin Marinas
    Cc: Will Deacon
    Cc: Richard Kuo
    Cc: Geert Uytterhoeven
    Cc: Michal Simek
    Cc: James Hogan
    Cc: Ley Foon Tan
    Cc: Jonas Bonn
    Cc: James E.J. Bottomley
    Cc: Benjamin Herrenschmidt
    Cc: Palmer Dabbelt
    Cc: Yoshinori Sato
    Cc: David S. Miller
    Cc: Richard Weinberger
    Cc: Guan Xuetao
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Cc: "Levin, Alexander (Sasha Levin)"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Souptick Joarder
     

15 Aug, 2018

4 commits

  • Without program headers for PTI entry trampoline pages, the trampoline
    virtual addresses do not map to anything.

    Example before:

    sudo gdb --quiet vmlinux /proc/kcore
    Reading symbols from vmlinux...done.
    [New process 1]
    Core was generated by `BOOT_IMAGE=/boot/vmlinuz-4.16.0 root=UUID=a6096b83-b763-4101-807e-f33daff63233'.
    #0 0x0000000000000000 in irq_stack_union ()
    (gdb) x /21ib 0xfffffe0000006000
    0xfffffe0000006000: Cannot access memory at address 0xfffffe0000006000
    (gdb) quit

    After:

    sudo gdb --quiet vmlinux /proc/kcore
    [sudo] password for ahunter:
    Reading symbols from vmlinux...done.
    [New process 1]
    Core was generated by `BOOT_IMAGE=/boot/vmlinuz-4.16.0-fix-4-00005-gd6e65a8b4072 root=UUID=a6096b83-b7'.
    #0 0x0000000000000000 in irq_stack_union ()
    (gdb) x /21ib 0xfffffe0000006000
    0xfffffe0000006000: swapgs
    0xfffffe0000006003: mov %rsp,-0x3e12(%rip) # 0xfffffe00000021f8
    0xfffffe000000600a: xchg %ax,%ax
    0xfffffe000000600c: mov %cr3,%rsp
    0xfffffe000000600f: bts $0x3f,%rsp
    0xfffffe0000006014: and $0xffffffffffffe7ff,%rsp
    0xfffffe000000601b: mov %rsp,%cr3
    0xfffffe000000601e: mov -0x3019(%rip),%rsp # 0xfffffe000000300c
    0xfffffe0000006025: pushq $0x2b
    0xfffffe0000006027: pushq -0x3e35(%rip) # 0xfffffe00000021f8
    0xfffffe000000602d: push %r11
    0xfffffe000000602f: pushq $0x33
    0xfffffe0000006031: push %rcx
    0xfffffe0000006032: push %rdi
    0xfffffe0000006033: mov $0xffffffff91a00010,%rdi
    0xfffffe000000603a: callq 0xfffffe0000006046
    0xfffffe000000603f: pause
    0xfffffe0000006041: lfence
    0xfffffe0000006044: jmp 0xfffffe000000603f
    0xfffffe0000006046: mov %rdi,(%rsp)
    0xfffffe000000604a: retq
    (gdb) quit

    In addition, entry trampolines all map to the same page. Represent that
    by giving the corresponding program headers in kcore the same offset.

    This has the benefit that, when perf tools uses /proc/kcore as a source
    for kernel object code, samples from different CPU trampolines are
    aggregated together. Note, such aggregation is normal for profiling
    i.e. people want to profile the object code, not every different virtual
    address the object code might be mapped to (across different processes
    for example).

    Notes by PeterZ:

    This also adds the KCORE_REMAP functionality.

    Signed-off-by: Adrian Hunter
    Acked-by: Andi Kleen
    Acked-by: Peter Zijlstra (Intel)
    Cc: Alexander Shishkin
    Cc: Andy Lutomirski
    Cc: Dave Hansen
    Cc: H. Peter Anvin
    Cc: Jiri Olsa
    Cc: Joerg Roedel
    Cc: Thomas Gleixner
    Cc: x86@kernel.org
    Link: http://lkml.kernel.org/r/1528289651-4113-4-git-send-email-adrian.hunter@intel.com
    Signed-off-by: Arnaldo Carvalho de Melo

    Adrian Hunter
     
  • Currently, the addresses of PTI entry trampolines are not exported to
    user space. Kernel profiling tools need these addresses to identify the
    kernel code, so add a symbol and address for each CPU's PTI entry
    trampoline.

    Signed-off-by: Alexander Shishkin
    Acked-by: Andi Kleen
    Acked-by: Peter Zijlstra (Intel)
    Cc: Andy Lutomirski
    Cc: Dave Hansen
    Cc: H. Peter Anvin
    Cc: Jiri Olsa
    Cc: Joerg Roedel
    Cc: Thomas Gleixner
    Cc: x86@kernel.org
    Link: http://lkml.kernel.org/r/1528289651-4113-3-git-send-email-adrian.hunter@intel.com
    Signed-off-by: Arnaldo Carvalho de Melo

    Alexander Shishkin
     
  • The introduction of generic_max_swapfile_size and arch-specific versions has
    broken linking on x86 with CONFIG_SWAP=n due to undefined reference to
    'generic_max_swapfile_size'. Fix it by compiling the x86-specific
    max_swapfile_size() only with CONFIG_SWAP=y.

    Reported-by: Tomas Pruzina
    Fixes: 377eeaa8e11f ("x86/speculation/l1tf: Limit swap file size to MAX_PA/2")
    Signed-off-by: Vlastimil Babka
    Cc: stable@vger.kernel.org
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • Merge L1 Terminal Fault fixes from Thomas Gleixner:
    "L1TF, aka L1 Terminal Fault, is yet another speculative hardware
    engineering trainwreck. It's a hardware vulnerability which allows
    unprivileged speculative access to data which is available in the
    Level 1 Data Cache when the page table entry controlling the virtual
    address, which is used for the access, has the Present bit cleared or
    other reserved bits set.

    If an instruction accesses a virtual address for which the relevant
    page table entry (PTE) has the Present bit cleared or other reserved
    bits set, then speculative execution ignores the invalid PTE and loads
    the referenced data if it is present in the Level 1 Data Cache, as if
    the page referenced by the address bits in the PTE was still present
    and accessible.

    While this is a purely speculative mechanism and the instruction will
    raise a page fault when it is retired eventually, the pure act of
    loading the data and making it available to other speculative
    instructions opens up the opportunity for side channel attacks to
    unprivileged malicious code, similar to the Meltdown attack.

    While Meltdown breaks the user space to kernel space protection, L1TF
    allows to attack any physical memory address in the system and the
    attack works across all protection domains. It allows an attack of SGX
    and also works from inside virtual machines because the speculation
    bypasses the extended page table (EPT) protection mechanism.

    The assoicated CVEs are: CVE-2018-3615, CVE-2018-3620, CVE-2018-3646

    The mitigations provided by this pull request include:

    - Host side protection by inverting the upper address bits of a non
    present page table entry so the entry points to uncacheable memory.

    - Hypervisor protection by flushing L1 Data Cache on VMENTER.

    - SMT (HyperThreading) control knobs, which allow to 'turn off' SMT
    by offlining the sibling CPU threads. The knobs are available on
    the kernel command line and at runtime via sysfs

    - Control knobs for the hypervisor mitigation, related to L1D flush
    and SMT control. The knobs are available on the kernel command line
    and at runtime via sysfs

    - Extensive documentation about L1TF including various degrees of
    mitigations.

    Thanks to all people who have contributed to this in various ways -
    patches, review, testing, backporting - and the fruitful, sometimes
    heated, but at the end constructive discussions.

    There is work in progress to provide other forms of mitigations, which
    might be less horrible performance wise for a particular kind of
    workloads, but this is not yet ready for consumption due to their
    complexity and limitations"

    * 'l1tf-final' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (75 commits)
    x86/microcode: Allow late microcode loading with SMT disabled
    tools headers: Synchronise x86 cpufeatures.h for L1TF additions
    x86/mm/kmmio: Make the tracer robust against L1TF
    x86/mm/pat: Make set_memory_np() L1TF safe
    x86/speculation/l1tf: Make pmd/pud_mknotpresent() invert
    x86/speculation/l1tf: Invert all not present mappings
    cpu/hotplug: Fix SMT supported evaluation
    KVM: VMX: Tell the nested hypervisor to skip L1D flush on vmentry
    x86/speculation: Use ARCH_CAPABILITIES to skip L1D flush on vmentry
    x86/speculation: Simplify sysfs report of VMX L1TF vulnerability
    Documentation/l1tf: Remove Yonah processors from not vulnerable list
    x86/KVM/VMX: Don't set l1tf_flush_l1d from vmx_handle_external_intr()
    x86/irq: Let interrupt handlers set kvm_cpu_l1tf_flush_l1d
    x86: Don't include linux/irq.h from asm/hardirq.h
    x86/KVM/VMX: Introduce per-host-cpu analogue of l1tf_flush_l1d
    x86/irq: Demote irq_cpustat_t::__softirq_pending to u16
    x86/KVM/VMX: Move the l1tf_flush_l1d test to vmx_l1d_flush()
    x86/KVM/VMX: Replace 'vmx_l1d_flush_always' with 'vmx_l1d_flush_cond'
    x86/KVM/VMX: Don't set l1tf_flush_l1d to true from vmx_l1d_flush()
    cpu/hotplug: detect SMT disabled by BIOS
    ...

    Linus Torvalds
     

14 Aug, 2018

2 commits

  • Pull x86 PTI updates from Thomas Gleixner:
    "The Speck brigade sadly provides yet another large set of patches
    destroying the perfomance which we carefully built and preserved

    - PTI support for 32bit PAE. The missing counter part to the 64bit
    PTI code implemented by Joerg.

    - A set of fixes for the Global Bit mechanics for non PCID CPUs which
    were setting the Global Bit too widely and therefore possibly
    exposing interesting memory needlessly.

    - Protection against userspace-userspace SpectreRSB

    - Support for the upcoming Enhanced IBRS mode, which is preferred
    over IBRS. Unfortunately we dont know the performance impact of
    this, but it's expected to be less horrible than the IBRS
    hammering.

    - Cleanups and simplifications"

    * 'x86/pti' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (60 commits)
    x86/mm/pti: Move user W+X check into pti_finalize()
    x86/relocs: Add __end_rodata_aligned to S_REL
    x86/mm/pti: Clone kernel-image on PTE level for 32 bit
    x86/mm/pti: Don't clear permissions in pti_clone_pmd()
    x86/mm/pti: Fix 32 bit PCID check
    x86/mm/init: Remove freed kernel image areas from alias mapping
    x86/mm/init: Add helper for freeing kernel image pages
    x86/mm/init: Pass unconverted symbol addresses to free_init_pages()
    mm: Allow non-direct-map arguments to free_reserved_area()
    x86/mm/pti: Clear Global bit more aggressively
    x86/speculation: Support Enhanced IBRS on future CPUs
    x86/speculation: Protect against userspace-userspace spectreRSB
    x86/kexec: Allocate 8k PGDs for PTI
    Revert "perf/core: Make sure the ring-buffer is mapped in all page-tables"
    x86/mm: Remove in_nmi() warning from vmalloc_fault()
    x86/entry/32: Check for VM86 mode in slow-path check
    perf/core: Make sure the ring-buffer is mapped in all page-tables
    x86/pti: Check the return value of pti_user_pagetable_walk_pmd()
    x86/pti: Check the return value of pti_user_pagetable_walk_p4d()
    x86/entry/32: Add debug code to check entry/exit CR3
    ...

    Linus Torvalds
     
  • Pull x86 mm updates from Thomas Gleixner:

    - Make lazy TLB mode even lazier to avoid pointless switch_mm()
    operations, which reduces CPU load by 1-2% for memcache workloads

    - Small cleanups and improvements all over the place

    * 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    x86/mm: Remove redundant check for kmem_cache_create()
    arm/asm/tlb.h: Fix build error implicit func declaration
    x86/mm/tlb: Make clear_asid_other() static
    x86/mm/tlb: Skip atomic operations for 'init_mm' in switch_mm_irqs_off()
    x86/mm/tlb: Always use lazy TLB mode
    x86/mm/tlb: Only send page table free TLB flush to lazy TLB CPUs
    x86/mm/tlb: Make lazy TLB mode lazier
    x86/mm/tlb: Restructure switch_mm_irqs_off()
    x86/mm/tlb: Leave lazy TLB mode at page table free time
    mm: Allocate the mm_cpumask (mm->cpu_bitmap[]) dynamically based on nr_cpu_ids
    x86/mm: Add TLB purge to free pmd/pte page interfaces
    ioremap: Update pgtable free interfaces with addr
    x86/mm: Disable ioremap free page handling on x86-PAE

    Linus Torvalds
     

11 Aug, 2018

1 commit

  • The user page-table gets the updated kernel mappings in pti_finalize(),
    which runs after the RO+X permissions got applied to the kernel page-table
    in mark_readonly().

    But with CONFIG_DEBUG_WX enabled, the user page-table is already checked in
    mark_readonly() for insecure mappings. This causes false-positive
    warnings, because the user page-table did not get the updated mappings yet.

    Move the W+X check for the user page-table into pti_finalize() after it
    updated all required mappings.

    [ tglx: Folded !NX supported fix ]

    Signed-off-by: Joerg Roedel
    Signed-off-by: Thomas Gleixner
    Cc: "H . Peter Anvin"
    Cc: linux-mm@kvack.org
    Cc: Linus Torvalds
    Cc: Andy Lutomirski
    Cc: Dave Hansen
    Cc: Josh Poimboeuf
    Cc: Juergen Gross
    Cc: Peter Zijlstra
    Cc: Borislav Petkov
    Cc: Jiri Kosina
    Cc: Boris Ostrovsky
    Cc: Brian Gerst
    Cc: David Laight
    Cc: Denys Vlasenko
    Cc: Eduardo Valentin
    Cc: Greg KH
    Cc: Will Deacon
    Cc: aliguori@amazon.com
    Cc: daniel.gruss@iaik.tugraz.at
    Cc: hughd@google.com
    Cc: keescook@google.com
    Cc: Andrea Arcangeli
    Cc: Waiman Long
    Cc: Pavel Machek
    Cc: "David H . Gutteridge"
    Cc: joro@8bytes.org
    Link: https://lkml.kernel.org/r/1533727000-9172-1-git-send-email-joro@8bytes.org

    Joerg Roedel
     

09 Aug, 2018

1 commit

  • The mmio tracer sets io mapping PTEs and PMDs to non present when enabled
    without inverting the address bits, which makes the PTE entry vulnerable
    for L1TF.

    Make it use the right low level macros to actually invert the address bits
    to protect against L1TF.

    In principle this could be avoided because MMIO tracing is not likely to be
    enabled on production machines, but the fix is straigt forward and for
    consistency sake it's better to get rid of the open coded PTE manipulation.

    Signed-off-by: Andi Kleen
    Signed-off-by: Thomas Gleixner

    Andi Kleen
     

08 Aug, 2018

2 commits

  • set_memory_np() is used to mark kernel mappings not present, but it has
    it's own open coded mechanism which does not have the L1TF protection of
    inverting the address bits.

    Replace the open coded PTE manipulation with the L1TF protecting low level
    PTE routines.

    Passes the CPA self test.

    Signed-off-by: Andi Kleen
    Signed-off-by: Thomas Gleixner

    Andi Kleen
     
  • On 32 bit the kernel sections are not huge-page aligned. When we clone
    them on PMD-level we unevitably map some areas that are normal kernel
    memory and may contain secrets to user-space. To prevent that we need to
    clone the kernel-image on PTE-level for 32 bit.

    Also make the page-table cloning code more general so that it can handle
    PMD and PTE level cloning. This can be generalized further in the future to
    also handle clones on the P4D-level.

    Signed-off-by: Joerg Roedel
    Signed-off-by: Thomas Gleixner
    Cc: "H . Peter Anvin"
    Cc: linux-mm@kvack.org
    Cc: Linus Torvalds
    Cc: Andy Lutomirski
    Cc: Dave Hansen
    Cc: Josh Poimboeuf
    Cc: Juergen Gross
    Cc: Peter Zijlstra
    Cc: Borislav Petkov
    Cc: Jiri Kosina
    Cc: Boris Ostrovsky
    Cc: Brian Gerst
    Cc: David Laight
    Cc: Denys Vlasenko
    Cc: Eduardo Valentin
    Cc: Greg KH
    Cc: Will Deacon
    Cc: aliguori@amazon.com
    Cc: daniel.gruss@iaik.tugraz.at
    Cc: hughd@google.com
    Cc: keescook@google.com
    Cc: Andrea Arcangeli
    Cc: Waiman Long
    Cc: Pavel Machek
    Cc: "David H . Gutteridge"
    Cc: joro@8bytes.org
    Link: https://lkml.kernel.org/r/1533637471-30953-4-git-send-email-joro@8bytes.org

    Joerg Roedel