06 Dec, 2018

2 commits

  • commit dbe733642e01dd108f71436aaea7b328cb28fd87 upstream

    CONFIG_SCHED_SMT is enabled by all distros, so there is not a real point to
    have it configurable. The runtime overhead in the core scheduler code is
    minimal because the actual SMT scheduling parts are conditional on a static
    key.

    This allows to expose the scheduler's SMT state static key to the
    speculation control code. Alternatively the scheduler's static key could be
    made always available when CONFIG_SMP is enabled, but that's just adding an
    unused static key to every other architecture for nothing.

    Signed-off-by: Thomas Gleixner
    Reviewed-by: Ingo Molnar
    Cc: Peter Zijlstra
    Cc: Andy Lutomirski
    Cc: Linus Torvalds
    Cc: Jiri Kosina
    Cc: Tom Lendacky
    Cc: Josh Poimboeuf
    Cc: Andrea Arcangeli
    Cc: David Woodhouse
    Cc: Tim Chen
    Cc: Andi Kleen
    Cc: Dave Hansen
    Cc: Casey Schaufler
    Cc: Asit Mallick
    Cc: Arjan van de Ven
    Cc: Jon Masters
    Cc: Waiman Long
    Cc: Greg KH
    Cc: Dave Stewart
    Cc: Kees Cook
    Cc: stable@vger.kernel.org
    Link: https://lkml.kernel.org/r/20181125185004.337452245@linutronix.de
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     
  • commit 4cd24de3a0980bf3100c9dcb08ef65ca7c31af48 upstream

    Since retpoline capable compilers are widely available, make
    CONFIG_RETPOLINE hard depend on the compiler capability.

    Break the build when CONFIG_RETPOLINE is enabled and the compiler does not
    support it. Emit an error message in that case:

    "arch/x86/Makefile:226: *** You are building kernel with non-retpoline
    compiler, please update your compiler.. Stop."

    [dwmw: Fail the build with non-retpoline compiler]

    Suggested-by: Peter Zijlstra
    Signed-off-by: Zhenzhong Duan
    Signed-off-by: Thomas Gleixner
    Cc: David Woodhouse
    Cc: Borislav Petkov
    Cc: Daniel Borkmann
    Cc: H. Peter Anvin
    Cc: Konrad Rzeszutek Wilk
    Cc: Andy Lutomirski
    Cc: Masahiro Yamada
    Cc: Michal Marek
    Cc:
    Cc: stable@vger.kernel.org
    Link: https://lkml.kernel.org/r/cca0cb20-f9e2-4094-840b-fb0f8810cd34@default
    Signed-off-by: Greg Kroah-Hartman

    Zhenzhong Duan
     

05 Sep, 2018

1 commit

  • commit d86564a2f085b79ec046a5cba90188e612352806 upstream.

    Jann reported that x86 was missing required TLB invalidates when he
    hit the !*batch slow path in tlb_remove_table().

    This is indeed the case; RCU_TABLE_FREE does not provide TLB (cache)
    invalidates, the PowerPC-hash where this code originated and the
    Sparc-hash where this was subsequently used did not need that. ARM
    which later used this put an explicit TLB invalidate in their
    __p*_free_tlb() functions, and PowerPC-radix followed that example.

    But when we hooked up x86 we failed to consider this. Fix this by
    (optionally) hooking tlb_remove_table() into the TLB invalidate code.

    NOTE: s390 was also needing something like this and might now
    be able to use the generic code again.

    [ Modified to be on top of Nick's cleanups, which simplified this patch
    now that tlb_flush_mmu_tlbonly() really only flushes the TLB - Linus ]

    Fixes: 9e52fc2b50de ("x86/mm: Enable RCU based page table freeing (CONFIG_HAVE_RCU_TABLE_FREE=y)")
    Reported-by: Jann Horn
    Signed-off-by: Peter Zijlstra (Intel)
    Acked-by: Rik van Riel
    Cc: Nicholas Piggin
    Cc: David Miller
    Cc: Will Deacon
    Cc: Martin Schwidefsky
    Cc: Michael Ellerman
    Cc: stable@kernel.org
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Peter Zijlstra
     

16 Aug, 2018

1 commit

  • commit 05736e4ac13c08a4a9b1ef2de26dd31a32cbee57 upstream

    Provide a command line and a sysfs knob to control SMT.

    The command line options are:

    'nosmt': Enumerate secondary threads, but do not online them

    'nosmt=force': Ignore secondary threads completely during enumeration
    via MP table and ACPI/MADT.

    The sysfs control file has the following states (read/write):

    'on': SMT is enabled. Secondary threads can be freely onlined
    'off': SMT is disabled. Secondary threads, even if enumerated
    cannot be onlined
    'forceoff': SMT is permanentely disabled. Writes to the control
    file are rejected.
    'notsupported': SMT is not supported by the CPU

    The command line option 'nosmt' sets the sysfs control to 'off'. This
    can be changed to 'on' to reenable SMT during runtime.

    The command line option 'nosmt=force' sets the sysfs control to
    'forceoff'. This cannot be changed during runtime.

    When SMT is 'on' and the control file is changed to 'off' then all online
    secondary threads are offlined and attempts to online a secondary thread
    later on are rejected.

    When SMT is 'off' and the control file is changed to 'on' then secondary
    threads can be onlined again. The 'off' -> 'on' transition does not
    automatically online the secondary threads.

    When the control file is set to 'forceoff', the behaviour is the same as
    setting it to 'off', but the operation is irreversible and later writes to
    the control file are rejected.

    When the control status is 'notsupported' then writes to the control file
    are rejected.

    Signed-off-by: Thomas Gleixner
    Reviewed-by: Konrad Rzeszutek Wilk
    Acked-by: Ingo Molnar
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     

15 Mar, 2018

1 commit

  • commit d5028ba8ee5a18c9d0bb926d883c28b370f89009 upstream.

    Disable retpoline validation in objtool if your compiler sucks, and otherwise
    select the validation stuff for CONFIG_RETPOLINE=y (most builds would already
    have it set due to ORC).

    Signed-off-by: Peter Zijlstra (Intel)
    Acked-by: Thomas Gleixner
    Cc: Andy Lutomirski
    Cc: Arjan van de Ven
    Cc: Borislav Petkov
    Cc: Dan Williams
    Cc: Dave Hansen
    Cc: David Woodhouse
    Cc: Greg Kroah-Hartman
    Cc: Josh Poimboeuf
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Signed-off-by: Ingo Molnar
    Signed-off-by: Greg Kroah-Hartman

    Peter Zijlstra
     

22 Feb, 2018

1 commit

  • commit 4675ff05de2d76d167336b368bd07f3fef6ed5a6 upstream.

    Fix up makefiles, remove references, and git rm kmemcheck.

    Link: http://lkml.kernel.org/r/20171007030159.22241-4-alexander.levin@verizon.com
    Signed-off-by: Sasha Levin
    Cc: Steven Rostedt
    Cc: Vegard Nossum
    Cc: Pekka Enberg
    Cc: Michal Hocko
    Cc: Eric W. Biederman
    Cc: Alexander Potapenko
    Cc: Tim Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Levin, Alexander (Sasha Levin)
     

17 Jan, 2018

2 commits

  • commit 76b043848fd22dbf7f8bf3a1452f8c70d557b860 upstream.

    Enable the use of -mindirect-branch=thunk-extern in newer GCC, and provide
    the corresponding thunks. Provide assembler macros for invoking the thunks
    in the same way that GCC does, from native and inline assembler.

    This adds X86_FEATURE_RETPOLINE and sets it by default on all CPUs. In
    some circumstances, IBRS microcode features may be used instead, and the
    retpoline can be disabled.

    On AMD CPUs if lfence is serialising, the retpoline can be dramatically
    simplified to a simple "lfence; jmp *\reg". A future patch, after it has
    been verified that lfence really is serialising in all circumstances, can
    enable this by setting the X86_FEATURE_RETPOLINE_AMD feature bit in addition
    to X86_FEATURE_RETPOLINE.

    Do not align the retpoline in the altinstr section, because there is no
    guarantee that it stays aligned when it's copied over the oldinstr during
    alternative patching.

    [ Andi Kleen: Rename the macros, add CONFIG_RETPOLINE option, export thunks]
    [ tglx: Put actual function CALL/JMP in front of the macros, convert to
    symbolic labels ]
    [ dwmw2: Convert back to numeric labels, merge objtool fixes ]

    Signed-off-by: David Woodhouse
    Signed-off-by: Thomas Gleixner
    Acked-by: Arjan van de Ven
    Acked-by: Ingo Molnar
    Cc: gnomes@lxorguk.ukuu.org.uk
    Cc: Rik van Riel
    Cc: Andi Kleen
    Cc: Josh Poimboeuf
    Cc: thomas.lendacky@amd.com
    Cc: Peter Zijlstra
    Cc: Linus Torvalds
    Cc: Jiri Kosina
    Cc: Andy Lutomirski
    Cc: Dave Hansen
    Cc: Kees Cook
    Cc: Tim Chen
    Cc: Greg Kroah-Hartman
    Cc: Paul Turner
    Link: https://lkml.kernel.org/r/1515707194-20531-4-git-send-email-dwmw@amazon.co.uk
    Signed-off-by: Greg Kroah-Hartman

    David Woodhouse
     
  • commit 61dc0f555b5c761cdafb0ba5bd41ecf22d68a4c4 upstream.

    Implement the CPU vulnerabilty show functions for meltdown, spectre_v1 and
    spectre_v2.

    Signed-off-by: Thomas Gleixner
    Reviewed-by: Greg Kroah-Hartman
    Reviewed-by: Konrad Rzeszutek Wilk
    Cc: Peter Zijlstra
    Cc: Will Deacon
    Cc: Dave Hansen
    Cc: Linus Torvalds
    Cc: Borislav Petkov
    Cc: David Woodhouse
    Link: https://lkml.kernel.org/r/20180107214913.177414879@linutronix.de
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     

30 Dec, 2017

1 commit

  • commit 7bbcbd3d1cdcbacd0f9f8dc4c98d550972f1ca30 upstream.

    The recent cpu_entry_area changes fail to compile on 32-bit when BIGSMP=y
    and NR_CPUS=512, because the fixmap area becomes too big.

    Limit the number of CPUs with BIGSMP to 64, which is already way to big for
    32-bit, but it's at least a working limitation.

    We performed a quick survey of 32-bit-only machines that might be affected
    by this change negatively, but found none.

    Signed-off-by: Thomas Gleixner
    Cc: Andy Lutomirski
    Cc: Borislav Petkov
    Cc: Dave Hansen
    Cc: H. Peter Anvin
    Cc: Josh Poimboeuf
    Cc: Juergen Gross
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     

25 Dec, 2017

3 commits

  • commit 2aeb07365bcd489620f71390a7d2031cd4dfb83e upstream.

    [ Note, this is a Git cherry-pick of the following commit:

    d17a1d97dc20: ("x86/mm/kasan: don't use vmemmap_populate() to initialize shadow")

    ... for easier x86 PTI code testing and back-porting. ]

    The KASAN shadow is currently mapped using vmemmap_populate() since that
    provides a semi-convenient way to map pages into init_top_pgt. However,
    since that no longer zeroes the mapped pages, it is not suitable for
    KASAN, which requires zeroed shadow memory.

    Add kasan_populate_shadow() interface and use it instead of
    vmemmap_populate(). Besides, this allows us to take advantage of
    gigantic pages and use them to populate the shadow, which should save us
    some memory wasted on page tables and reduce TLB pressure.

    Link: http://lkml.kernel.org/r/20171103185147.2688-2-pasha.tatashin@oracle.com
    Signed-off-by: Andrey Ryabinin
    Signed-off-by: Pavel Tatashin
    Cc: Andy Lutomirski
    Cc: Steven Sistare
    Cc: Daniel Jordan
    Cc: Bob Picco
    Cc: Michal Hocko
    Cc: Alexander Potapenko
    Cc: Ard Biesheuvel
    Cc: Catalin Marinas
    Cc: Christian Borntraeger
    Cc: David S. Miller
    Cc: Dmitry Vyukov
    Cc: Heiko Carstens
    Cc: "H. Peter Anvin"
    Cc: Ingo Molnar
    Cc: Mark Rutland
    Cc: Matthew Wilcox
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Sam Ravnborg
    Cc: Thomas Gleixner
    Cc: Will Deacon
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Ingo Molnar
    Signed-off-by: Greg Kroah-Hartman

    Andrey Ryabinin
     
  • commit 12a8cc7fcf54a8575f094be1e99032ec38aa045c upstream.

    We are going to support boot-time switching between 4- and 5-level
    paging. For KASAN it means we cannot have different KASAN_SHADOW_OFFSET
    for different paging modes: the constant is passed to gcc to generate
    code and cannot be changed at runtime.

    This patch changes KASAN code to use 0xdffffc0000000000 as shadow offset
    for both 4- and 5-level paging.

    For 5-level paging it means that shadow memory region is not aligned to
    PGD boundary anymore and we have to handle unaligned parts of the region
    properly.

    In addition, we have to exclude paravirt code from KASAN instrumentation
    as we now use set_pgd() before KASAN is fully ready.

    [kirill.shutemov@linux.intel.com: clenaup, changelog message]
    Signed-off-by: Andrey Ryabinin
    Signed-off-by: Kirill A. Shutemov
    Cc: Andrew Morton
    Cc: Andy Lutomirski
    Cc: Borislav Petkov
    Cc: Cyrill Gorcunov
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-mm@kvack.org
    Link: http://lkml.kernel.org/r/20170929140821.37654-4-kirill.shutemov@linux.intel.com
    Signed-off-by: Ingo Molnar
    Signed-off-by: Greg Kroah-Hartman

    Andrey Ryabinin
     
  • commit 11af847446ed0d131cf24d16a7ef3d5ea7a49554 upstream.

    Rename the unwinder config options from:

    CONFIG_ORC_UNWINDER
    CONFIG_FRAME_POINTER_UNWINDER
    CONFIG_GUESS_UNWINDER

    to:

    CONFIG_UNWINDER_ORC
    CONFIG_UNWINDER_FRAME_POINTER
    CONFIG_UNWINDER_GUESS

    ... in order to give them a more logical config namespace.

    Suggested-by: Ingo Molnar
    Signed-off-by: Josh Poimboeuf
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/73972fc7e2762e91912c6b9584582703d6f1b8cc.1507924831.git.jpoimboe@redhat.com
    Signed-off-by: Ingo Molnar
    Signed-off-by: Greg Kroah-Hartman

    Josh Poimboeuf
     

10 Dec, 2017

1 commit

  • [ Upstream commit 39208aa7ecb7d9c4e86df782b5693270313cbab1 ]

    With the section inlining bug fixed for the x86 refcount protection,
    we can turn the config back on.

    Signed-off-by: Kees Cook
    Cc: Ard Biesheuvel
    Cc: Elena
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-arch
    Link: http://lkml.kernel.org/r/1504382986-49301-3-git-send-email-keescook@chromium.org
    Signed-off-by: Ingo Molnar
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Kees Cook
     

02 Nov, 2017

1 commit

  • Many source files in the tree are missing licensing information, which
    makes it harder for compliance tools to determine the correct license.

    By default all files without license information are under the default
    license of the kernel, which is GPL version 2.

    Update the files which contain no license information with the 'GPL-2.0'
    SPDX license identifier. The SPDX identifier is a legally binding
    shorthand, which can be used instead of the full boiler plate text.

    This patch is based on work done by Thomas Gleixner and Kate Stewart and
    Philippe Ombredanne.

    How this work was done:

    Patches were generated and checked against linux-4.14-rc6 for a subset of
    the use cases:
    - file had no licensing information it it.
    - file was a */uapi/* one with no licensing information in it,
    - file was a */uapi/* one with existing licensing information,

    Further patches will be generated in subsequent months to fix up cases
    where non-standard license headers were used, and references to license
    had to be inferred by heuristics based on keywords.

    The analysis to determine which SPDX License Identifier to be applied to
    a file was done in a spreadsheet of side by side results from of the
    output of two independent scanners (ScanCode & Windriver) producing SPDX
    tag:value files created by Philippe Ombredanne. Philippe prepared the
    base worksheet, and did an initial spot review of a few 1000 files.

    The 4.13 kernel was the starting point of the analysis with 60,537 files
    assessed. Kate Stewart did a file by file comparison of the scanner
    results in the spreadsheet to determine which SPDX license identifier(s)
    to be applied to the file. She confirmed any determination that was not
    immediately clear with lawyers working with the Linux Foundation.

    Criteria used to select files for SPDX license identifier tagging was:
    - Files considered eligible had to be source code files.
    - Make and config files were included as candidates if they contained >5
    lines of source
    - File already had some variant of a license header in it (even if
    Reviewed-by: Philippe Ombredanne
    Reviewed-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Greg Kroah-Hartman
     

12 Sep, 2017

1 commit

  • Pull libnvdimm from Dan Williams:
    "A rework of media error handling in the BTT driver and other updates.
    It has appeared in a few -next releases and collected some late-
    breaking build-error and warning fixups as a result.

    Summary:

    - Media error handling support in the Block Translation Table (BTT)
    driver is reworked to address sleeping-while-atomic locking and
    memory-allocation-context conflicts.

    - The dax_device lookup overhead for xfs and ext4 is moved out of the
    iomap hot-path to a mount-time lookup.

    - A new 'ecc_unit_size' sysfs attribute is added to advertise the
    read-modify-write boundary property of a persistent memory range.

    - Preparatory fix-ups for arm and powerpc pmem support are included
    along with other miscellaneous fixes"

    * tag 'libnvdimm-for-4.14' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (26 commits)
    libnvdimm, btt: fix format string warnings
    libnvdimm, btt: clean up warning and error messages
    ext4: fix null pointer dereference on sbi
    libnvdimm, nfit: move the check on nd_reserved2 to the endpoint
    dax: fix FS_DAX=n BLOCK=y compilation
    libnvdimm: fix integer overflow static analysis warning
    libnvdimm, nd_blk: remove mmio_flush_range()
    libnvdimm, btt: rework error clearing
    libnvdimm: fix potential deadlock while clearing errors
    libnvdimm, btt: cache sector_size in arena_info
    libnvdimm, btt: ensure that flags were also unchanged during a map_read
    libnvdimm, btt: refactor map entry operations with macros
    libnvdimm, btt: fix a missed NVDIMM_IO_ATOMIC case in the write path
    libnvdimm, nfit: export an 'ecc_unit_size' sysfs attribute
    ext4: perform dax_device lookup at mount
    ext2: perform dax_device lookup at mount
    xfs: perform dax_device lookup at mount
    dax: introduce a fs_dax_get_by_bdev() helper
    libnvdimm, btt: check memory allocation failure
    libnvdimm, label: fix index block size calculation
    ...

    Linus Torvalds
     

09 Sep, 2017

2 commits

  • There are new users of memory hotplug emerging. Some of them require
    different subset of arch_add_memory. There are some which only require
    allocation of struct pages without mapping those pages to the kernel
    address space. We currently have __add_pages for that purpose. But this
    is rather lowlevel and not very suitable for the code outside of the
    memory hotplug. E.g. x86_64 wants to update max_pfn which should be done
    by the caller. Introduce add_pages() which should care about those
    details if they are needed. Each architecture should define its
    implementation and select CONFIG_ARCH_HAS_ADD_PAGES. All others use the
    currently existing __add_pages.

    Link: http://lkml.kernel.org/r/20170817000548.32038-7-jglisse@redhat.com
    Signed-off-by: Michal Hocko
    Signed-off-by: Jérôme Glisse
    Acked-by: Balbir Singh
    Cc: Aneesh Kumar
    Cc: Benjamin Herrenschmidt
    Cc: Dan Williams
    Cc: David Nellans
    Cc: Evgeny Baskakov
    Cc: Johannes Weiner
    Cc: John Hubbard
    Cc: Kirill A. Shutemov
    Cc: Mark Hairgrove
    Cc: Paul E. McKenney
    Cc: Ross Zwisler
    Cc: Sherry Cheung
    Cc: Subhash Gutti
    Cc: Vladimir Davydov
    Cc: Bob Liu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Introduce CONFIG_ARCH_ENABLE_THP_MIGRATION to limit thp migration
    functionality to x86_64, which should be safer at the first step.

    Link: http://lkml.kernel.org/r/20170717193955.20207-5-zi.yan@sent.com
    Signed-off-by: Naoya Horiguchi
    Signed-off-by: Zi Yan
    Reviewed-by: Anshuman Khandual
    Cc: "H. Peter Anvin"
    Cc: Dave Hansen
    Cc: David Nellans
    Cc: Ingo Molnar
    Cc: Kirill A. Shutemov
    Cc: Mel Gorman
    Cc: Minchan Kim
    Cc: Thomas Gleixner
    Cc: Vlastimil Babka
    Cc: Andrea Arcangeli
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     

07 Sep, 2017

1 commit

  • Patch series "mm,fork,security: introduce MADV_WIPEONFORK", v4.

    If a child process accesses memory that was MADV_WIPEONFORK, it will get
    zeroes. The address ranges are still valid, they are just empty.

    If a child process accesses memory that was MADV_DONTFORK, it will get a
    segmentation fault, since those address ranges are no longer valid in
    the child after fork.

    Since MADV_DONTFORK also seems to be used to allow very large programs
    to fork in systems with strict memory overcommit restrictions, changing
    the semantics of MADV_DONTFORK might break existing programs.

    The use case is libraries that store or cache information, and want to
    know that they need to regenerate it in the child process after fork.

    Examples of this would be:
    - systemd/pulseaudio API checks (fail after fork) (replacing a getpid
    check, which is too slow without a PID cache)
    - PKCS#11 API reinitialization check (mandated by specification)
    - glibc's upcoming PRNG (reseed after fork)
    - OpenSSL PRNG (reseed after fork)

    The security benefits of a forking server having a re-inialized PRNG in
    every child process are pretty obvious. However, due to libraries
    having all kinds of internal state, and programs getting compiled with
    many different versions of each library, it is unreasonable to expect
    calling programs to re-initialize everything manually after fork.

    A further complication is the proliferation of clone flags, programs
    bypassing glibc's functions to call clone directly, and programs calling
    unshare, causing the glibc pthread_atfork hook to not get called.

    It would be better to have the kernel take care of this automatically.

    The patchset also adds MADV_KEEPONFORK, to undo the effects of a prior
    MADV_WIPEONFORK.

    This is similar to the OpenBSD minherit syscall with MAP_INHERIT_ZERO:

    https://man.openbsd.org/minherit.2

    This patch (of 2):

    MPX only seems to be available on 64 bit CPUs, starting with Skylake and
    Goldmont. Move VM_MPX into the 64 bit only portion of vma->vm_flags, in
    order to free up a VMA flag.

    Link: http://lkml.kernel.org/r/20170811212829.29186-2-riel@redhat.com
    Signed-off-by: Rik van Riel
    Acked-by: Dave Hansen
    Cc: Mike Kravetz
    Cc: Florian Weimer
    Cc: Kees Cook
    Cc: Andy Lutomirski
    Cc: Will Drewry
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Cc: "Kirill A. Shutemov"
    Cc: Matthew Wilcox
    Cc: Colm MacCártaigh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik van Riel
     

05 Sep, 2017

4 commits

  • Pull x86 cache quality monitoring update from Thomas Gleixner:
    "This update provides a complete rewrite of the Cache Quality
    Monitoring (CQM) facility.

    The existing CQM support was duct taped into perf with a lot of issues
    and the attempts to fix those turned out to be incomplete and
    horrible.

    After lengthy discussions it was decided to integrate the CQM support
    into the Resource Director Technology (RDT) facility, which is the
    obvious choise as in hardware CQM is part of RDT. This allowed to add
    Memory Bandwidth Monitoring support on top.

    As a result the mechanisms for allocating cache/memory bandwidth and
    the corresponding monitoring mechanisms are integrated into a single
    management facility with a consistent user interface"

    * 'x86-cache-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (37 commits)
    x86/intel_rdt: Turn off most RDT features on Skylake
    x86/intel_rdt: Add command line options for resource director technology
    x86/intel_rdt: Move special case code for Haswell to a quirk function
    x86/intel_rdt: Remove redundant ternary operator on return
    x86/intel_rdt/cqm: Improve limbo list processing
    x86/intel_rdt/mbm: Fix MBM overflow handler during CPU hotplug
    x86/intel_rdt: Modify the intel_pqr_state for better performance
    x86/intel_rdt/cqm: Clear the default RMID during hotcpu
    x86/intel_rdt: Show bitmask of shareable resource with other executing units
    x86/intel_rdt/mbm: Handle counter overflow
    x86/intel_rdt/mbm: Add mbm counter initialization
    x86/intel_rdt/mbm: Basic counting of MBM events (total and local)
    x86/intel_rdt/cqm: Add CPU hotplug support
    x86/intel_rdt/cqm: Add sched_in support
    x86/intel_rdt: Introduce rdt_enable_key for scheduling
    x86/intel_rdt/cqm: Add mount,umount support
    x86/intel_rdt/cqm: Add rmdir support
    x86/intel_rdt: Separate the ctrl bits from rmdir
    x86/intel_rdt/cqm: Add mon_data
    x86/intel_rdt: Prepare for RDT monitor data support
    ...

    Linus Torvalds
     
  • Pull x86 mm changes from Ingo Molnar:
    "PCID support, 5-level paging support, Secure Memory Encryption support

    The main changes in this cycle are support for three new, complex
    hardware features of x86 CPUs:

    - Add 5-level paging support, which is a new hardware feature on
    upcoming Intel CPUs allowing up to 128 PB of virtual address space
    and 4 PB of physical RAM space - a 512-fold increase over the old
    limits. (Supercomputers of the future forecasting hurricanes on an
    ever warming planet can certainly make good use of more RAM.)

    Many of the necessary changes went upstream in previous cycles,
    v4.14 is the first kernel that can enable 5-level paging.

    This feature is activated via CONFIG_X86_5LEVEL=y - disabled by
    default.

    (By Kirill A. Shutemov)

    - Add 'encrypted memory' support, which is a new hardware feature on
    upcoming AMD CPUs ('Secure Memory Encryption', SME) allowing system
    RAM to be encrypted and decrypted (mostly) transparently by the
    CPU, with a little help from the kernel to transition to/from
    encrypted RAM. Such RAM should be more secure against various
    attacks like RAM access via the memory bus and should make the
    radio signature of memory bus traffic harder to intercept (and
    decrypt) as well.

    This feature is activated via CONFIG_AMD_MEM_ENCRYPT=y - disabled
    by default.

    (By Tom Lendacky)

    - Enable PCID optimized TLB flushing on newer Intel CPUs: PCID is a
    hardware feature that attaches an address space tag to TLB entries
    and thus allows to skip TLB flushing in many cases, even if we
    switch mm's.

    (By Andy Lutomirski)

    All three of these features were in the works for a long time, and
    it's coincidence of the three independent development paths that they
    are all enabled in v4.14 at once"

    * 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (65 commits)
    x86/mm: Enable RCU based page table freeing (CONFIG_HAVE_RCU_TABLE_FREE=y)
    x86/mm: Use pr_cont() in dump_pagetable()
    x86/mm: Fix SME encryption stack ptr handling
    kvm/x86: Avoid clearing the C-bit in rsvd_bits()
    x86/CPU: Align CR3 defines
    x86/mm, mm/hwpoison: Clear PRESENT bit for kernel 1:1 mappings of poison pages
    acpi, x86/mm: Remove encryption mask from ACPI page protection type
    x86/mm, kexec: Fix memory corruption with SME on successive kexecs
    x86/mm/pkeys: Fix typo in Documentation/x86/protection-keys.txt
    x86/mm/dump_pagetables: Speed up page tables dump for CONFIG_KASAN=y
    x86/mm: Implement PCID based optimization: try to preserve old TLB entries using PCID
    x86: Enable 5-level paging support via CONFIG_X86_5LEVEL=y
    x86/mm: Allow userspace have mappings above 47-bit
    x86/mm: Prepare to expose larger address space to userspace
    x86/mpx: Do not allow MPX if we have mappings above 47-bit
    x86/mm: Rename tasksize_32bit/64bit to task_size_32bit/64bit()
    x86/xen: Redefine XEN_ELFNOTE_INIT_P2M using PUD_SIZE * PTRS_PER_PUD
    x86/mm/dump_pagetables: Fix printout of p4d level
    x86/mm/dump_pagetables: Generalize address normalization
    x86/boot: Fix memremap() related build failure
    ...

    Linus Torvalds
     
  • Pull locking updates from Ingo Molnar:

    - Add 'cross-release' support to lockdep, which allows APIs like
    completions, where it's not the 'owner' who releases the lock, to be
    tracked. It's all activated automatically under
    CONFIG_PROVE_LOCKING=y.

    - Clean up (restructure) the x86 atomics op implementation to be more
    readable, in preparation of KASAN annotations. (Dmitry Vyukov)

    - Fix static keys (Paolo Bonzini)

    - Add killable versions of down_read() et al (Kirill Tkhai)

    - Rework and fix jump_label locking (Marc Zyngier, Paolo Bonzini)

    - Rework (and fix) tlb_flush_pending() barriers (Peter Zijlstra)

    - Remove smp_mb__before_spinlock() and convert its usages, introduce
    smp_mb__after_spinlock() (Peter Zijlstra)

    * 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (56 commits)
    locking/lockdep/selftests: Fix mixed read-write ABBA tests
    sched/completion: Avoid unnecessary stack allocation for COMPLETION_INITIALIZER_ONSTACK()
    acpi/nfit: Fix COMPLETION_INITIALIZER_ONSTACK() abuse
    locking/pvqspinlock: Relax cmpxchg's to improve performance on some architectures
    smp: Avoid using two cache lines for struct call_single_data
    locking/lockdep: Untangle xhlock history save/restore from task independence
    locking/refcounts, x86/asm: Disable CONFIG_ARCH_HAS_REFCOUNT for the time being
    futex: Remove duplicated code and fix undefined behaviour
    Documentation/locking/atomic: Finish the document...
    locking/lockdep: Fix workqueue crossrelease annotation
    workqueue/lockdep: 'Fix' flush_work() annotation
    locking/lockdep/selftests: Add mixed read-write ABBA tests
    mm, locking/barriers: Clarify tlb_flush_pending() barriers
    locking/lockdep: Make CONFIG_LOCKDEP_CROSSRELEASE and CONFIG_LOCKDEP_COMPLETIONS truly non-interactive
    locking/lockdep: Explicitly initialize wq_barrier::done::map
    locking/lockdep: Rename CONFIG_LOCKDEP_COMPLETE to CONFIG_LOCKDEP_COMPLETIONS
    locking/lockdep: Reword title of LOCKDEP_CROSSRELEASE config
    locking/lockdep: Make CONFIG_LOCKDEP_CROSSRELEASE part of CONFIG_PROVE_LOCKING
    locking/refcounts, x86/asm: Implement fast refcount overflow protection
    locking/lockdep: Fix the rollback and overwrite detection logic in crossrelease
    ...

    Linus Torvalds
     
  • Pull x86 asm updates from Ingo Molnar:

    - Introduce the ORC unwinder, which can be enabled via
    CONFIG_ORC_UNWINDER=y.

    The ORC unwinder is a lightweight, Linux kernel specific debuginfo
    implementation, which aims to be DWARF done right for unwinding.
    Objtool is used to generate the ORC unwinder tables during build, so
    the data format is flexible and kernel internal: there's no
    dependency on debuginfo created by an external toolchain.

    The ORC unwinder is almost two orders of magnitude faster than the
    (out of tree) DWARF unwinder - which is important for perf call graph
    profiling. It is also significantly simpler and is coded defensively:
    there has not been a single ORC related kernel crash so far, even
    with early versions. (knock on wood!)

    But the main advantage is that enabling the ORC unwinder allows
    CONFIG_FRAME_POINTERS to be turned off - which speeds up the kernel
    measurably:

    With frame pointers disabled, GCC does not have to add frame pointer
    instrumentation code to every function in the kernel. The kernel's
    .text size decreases by about 3.2%, resulting in better cache
    utilization and fewer instructions executed, resulting in a broad
    kernel-wide speedup. Average speedup of system calls should be
    roughly in the 1-3% range - measurements by Mel Gorman [1] have shown
    a speedup of 5-10% for some function execution intense workloads.

    The main cost of the unwinder is that the unwinder data has to be
    stored in RAM: the memory cost is 2-4MB of RAM, depending on kernel
    config - which is a modest cost on modern x86 systems.

    Given how young the ORC unwinder code is it's not enabled by default
    - but given the performance advantages the plan is to eventually make
    it the default unwinder on x86.

    See Documentation/x86/orc-unwinder.txt for more details.

    - Remove lguest support: its intended role was that of a temporary
    proof of concept for virtualization, plus its removal will enable the
    reduction (removal) of the paravirt API as well, so Rusty agreed to
    its removal. (Juergen Gross)

    - Clean up and fix FSGS related functionality (Andy Lutomirski)

    - Clean up IO access APIs (Andy Shevchenko)

    - Enhance the symbol namespace (Jiri Slaby)

    * 'x86-asm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (47 commits)
    objtool: Handle GCC stack pointer adjustment bug
    x86/entry/64: Use ENTRY() instead of ALIGN+GLOBAL for stub32_clone()
    x86/fpu/math-emu: Add ENDPROC to functions
    x86/boot/64: Extract efi_pe_entry() from startup_64()
    x86/boot/32: Extract efi_pe_entry() from startup_32()
    x86/lguest: Remove lguest support
    x86/paravirt/xen: Remove xen_patch()
    objtool: Fix objtool fallthrough detection with function padding
    x86/xen/64: Fix the reported SS and CS in SYSCALL
    objtool: Track DRAP separately from callee-saved registers
    objtool: Fix validate_branch() return codes
    x86: Clarify/fix no-op barriers for text_poke_bp()
    x86/switch_to/64: Rewrite FS/GS switching yet again to fix AMD CPUs
    selftests/x86/fsgsbase: Test selectors 1, 2, and 3
    x86/fsgsbase/64: Report FSBASE and GSBASE correctly in core dumps
    x86/fsgsbase/64: Fully initialize FS and GS state in start_thread_common
    x86/asm: Fix UNWIND_HINT_REGS macro for older binutils
    x86/asm/32: Fix regs_get_register() on segment registers
    x86/xen/64: Rearrange the SYSCALL entries
    x86/asm/32: Remove a bunch of '& 0xffff' from pt_regs segment reads
    ...

    Linus Torvalds
     

01 Sep, 2017

2 commits

  • Dan Williams
     
  • mmio_flush_range() suffers from a lack of clearly-defined semantics,
    and is somewhat ambiguous to port to other architectures where the
    scope of the writeback implied by "flush" and ordering might matter,
    but MMIO would tend to imply non-cacheable anyway. Per the rationale
    in 67a3e8fe9015 ("nd_blk: change aperture mapping from WC to WB"), the
    only existing use is actually to invalidate clean cache lines for
    ARCH_MEMREMAP_PMEM type mappings *without* writeback. Since the recent
    cleanup of the pmem API, that also now happens to be the exact purpose
    of arch_invalidate_pmem(), which would be a far more well-defined tool
    for the job.

    Rather than risk potentially inconsistent implementations of
    mmio_flush_range() for the sake of one callsite, streamline things by
    removing it entirely and instead move the ARCH_MEMREMAP_PMEM related
    definitions up to the libnvdimm level, so they can be shared by NFIT
    as well. This allows NFIT to be enabled for arm64.

    Signed-off-by: Robin Murphy
    Signed-off-by: Dan Williams

    Robin Murphy
     

31 Aug, 2017

1 commit

  • There's a subtle bug in how some of the paravirt guest code handles
    page table freeing on x86:

    On x86 software page table walkers depend on the fact that remote TLB flush
    does an IPI: walk is performed lockless but with interrupts disabled and in
    case the page table is freed the freeing CPU will get blocked as remote TLB
    flush is required. On other architectures which don't require an IPI to do
    remote TLB flush we have an RCU-based mechanism (see
    include/asm-generic/tlb.h for more details).

    In virtualized environments we may want to override the ->flush_tlb_others
    callback in pv_mmu_ops and use a hypercall asking the hypervisor to do a
    remote TLB flush for us. This breaks the assumption about IPIs. Xen PV has
    been doing this for years and the upcoming remote TLB flush for Hyper-V will
    do it too.

    This is not safe, as software page table walkers may step on an already
    freed page.

    Fix the bug by enabling the RCU-based page table freeing mechanism,
    CONFIG_HAVE_RCU_TABLE_FREE=y.

    Testing with kernbench and mmap/munmap microbenchmarks, and neither showed
    any noticeable performance impact.

    Suggested-by: Peter Zijlstra
    Signed-off-by: Vitaly Kuznetsov
    Acked-by: Peter Zijlstra
    Acked-by: Juergen Gross
    Acked-by: Kirill A. Shutemov
    Cc: Andrew Cooper
    Cc: Andy Lutomirski
    Cc: Boris Ostrovsky
    Cc: Jork Loeser
    Cc: KY Srinivasan
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Stephen Hemminger
    Cc: Steven Rostedt
    Cc: Thomas Gleixner
    Cc: xen-devel@lists.xenproject.org
    Link: http://lkml.kernel.org/r/20170828082251.5562-1-vkuznets@redhat.com
    [ Rewrote/fixed/clarified the changelog. ]
    Signed-off-by: Ingo Molnar

    Vitaly Kuznetsov
     

29 Aug, 2017

1 commit

  • Mike Galbraith bisected a boot crash back to the following commit:

    7a46ec0e2f48 ("locking/refcounts, x86/asm: Implement fast refcount overflow protection")

    The crash/hang pattern is:

    > Symptom is a few splats as below, with box finally hanging. Network
    > comes up, but neither ssh nor console login is possible.
    >
    > ------------[ cut here ]------------
    > WARNING: CPU: 4 PID: 0 at net/netlink/af_netlink.c:374 netlink_sock_destruct+0x82/0xa0
    > ...
    > __sk_destruct()
    > rcu_process_callbacks()
    > __do_softirq()
    > irq_exit()
    > smp_apic_timer_interrupt()
    > apic_timer_interrupt()

    We are at -rc7 already, and the code has grown some dependencies, so
    instead of a plain revert disable the config temporarily, in the hope
    of getting real fixes.

    Reported-by: Mike Galbraith
    Tested-by: Mike Galbraith
    Cc: Arnd Bergmann
    Cc: Davidlohr Bueso
    Cc: Elena Reshetova
    Cc: Josh Poimboeuf
    Cc: Kees Cook
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/tip-7a46ec0e2f4850407de5e1d19a44edee6efa58ec@git.kernel.org
    Signed-off-by: Ingo Molnar

    Ingo Molnar
     

26 Aug, 2017

1 commit


25 Aug, 2017

1 commit


24 Aug, 2017

1 commit

  • Lguest seems to be rather unused these days. It has seen only patches
    ensuring it still builds the last two years and its official state is
    "Odd Fixes".

    Remove it in order to be able to clean up the paravirt code.

    Signed-off-by: Juergen Gross
    Acked-by: Rusty Russell
    Acked-by: Thomas Gleixner
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: boris.ostrovsky@oracle.com
    Cc: lguest@lists.ozlabs.org
    Cc: rusty@rustcorp.com.au
    Cc: xen-devel@lists.xenproject.org
    Link: http://lkml.kernel.org/r/20170816173157.8633-3-jgross@suse.com
    Signed-off-by: Ingo Molnar

    Juergen Gross
     

20 Aug, 2017

1 commit

  • Pull watchdog fix from Thomas Gleixner:
    "A fix for the hardlockup watchdog to prevent false positives with
    extreme Turbo-Modes which make the perf/NMI watchdog fire faster than
    the hrtimer which is used to verify.

    Slightly larger than the minimal fix, which just would increase the
    hrtimer frequency, but comes with extra overhead of more watchdog
    timer interrupts and thread wakeups for all users.

    With this change we restrict the overhead to the extreme Turbo-Mode
    systems"

    * 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    kernel/watchdog: Prevent false positives with turbo modes

    Linus Torvalds
     

19 Aug, 2017

1 commit

  • Commit 05a4a9527931 ("kernel/watchdog: split up config options") lost
    the perf-based hardlockup detector's dependency on PERF_EVENTS, which
    can result in broken builds with some powerpc configurations.

    Restore the dependency. Add it in for x86 too, despite x86 always
    selecting PERF_EVENTS it seems reasonable to make the dependency
    explicit.

    Link: http://lkml.kernel.org/r/20170810114452.6673-1-npiggin@gmail.com
    Fixes: 05a4a9527931 ("kernel/watchdog: split up config options")
    Signed-off-by: Nicholas Piggin
    Acked-by: Don Zickus
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nicholas Piggin
     

18 Aug, 2017

1 commit

  • The hardlockup detector on x86 uses a performance counter based on unhalted
    CPU cycles and a periodic hrtimer. The hrtimer period is about 2/5 of the
    performance counter period, so the hrtimer should fire 2-3 times before the
    performance counter NMI fires. The NMI code checks whether the hrtimer
    fired since the last invocation. If not, it assumess a hard lockup.

    The calculation of those periods is based on the nominal CPU
    frequency. Turbo modes increase the CPU clock frequency and therefore
    shorten the period of the perf/NMI watchdog. With extreme Turbo-modes (3x
    nominal frequency) the perf/NMI period is shorter than the hrtimer period
    which leads to false positives.

    A simple fix would be to shorten the hrtimer period, but that comes with
    the side effect of more frequent hrtimer and softlockup thread wakeups,
    which is not desired.

    Implement a low pass filter, which checks the perf/NMI period against
    kernel time. If the perf/NMI fires before 4/5 of the watchdog period has
    elapsed then the event is ignored and postponed to the next perf/NMI.

    That solves the problem and avoids the overhead of shorter hrtimer periods
    and more frequent softlockup thread wakeups.

    Fixes: 58687acba592 ("lockup_detector: Combine nmi_watchdog and softlockup detector")
    Reported-and-tested-by: Kan Liang
    Signed-off-by: Thomas Gleixner
    Cc: dzickus@redhat.com
    Cc: prarit@redhat.com
    Cc: ak@linux.intel.com
    Cc: babu.moger@oracle.com
    Cc: peterz@infradead.org
    Cc: eranian@google.com
    Cc: acme@redhat.com
    Cc: stable@vger.kernel.org
    Cc: atomlin@redhat.com
    Cc: akpm@linux-foundation.org
    Cc: torvalds@linux-foundation.org
    Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1708150931310.1886@nanos

    Thomas Gleixner
     

17 Aug, 2017

1 commit

  • This implements refcount_t overflow protection on x86 without a noticeable
    performance impact, though without the fuller checking of REFCOUNT_FULL.

    This is done by duplicating the existing atomic_t refcount implementation
    but with normally a single instruction added to detect if the refcount
    has gone negative (e.g. wrapped past INT_MAX or below zero). When detected,
    the handler saturates the refcount_t to INT_MIN / 2. With this overflow
    protection, the erroneous reference release that would follow a wrap back
    to zero is blocked from happening, avoiding the class of refcount-overflow
    use-after-free vulnerabilities entirely.

    Only the overflow case of refcounting can be perfectly protected, since
    it can be detected and stopped before the reference is freed and left to
    be abused by an attacker. There isn't a way to block early decrements,
    and while REFCOUNT_FULL stops increment-from-zero cases (which would
    be the state _after_ an early decrement and stops potential double-free
    conditions), this fast implementation does not, since it would require
    the more expensive cmpxchg loops. Since the overflow case is much more
    common (e.g. missing a "put" during an error path), this protection
    provides real-world protection. For example, the two public refcount
    overflow use-after-free exploits published in 2016 would have been
    rendered unexploitable:

    http://perception-point.io/2016/01/14/analysis-and-exploitation-of-a-linux-kernel-vulnerability-cve-2016-0728/

    http://cyseclabs.com/page?n=02012016

    This implementation does, however, notice an unchecked decrement to zero
    (i.e. caller used refcount_dec() instead of refcount_dec_and_test() and it
    resulted in a zero). Decrements under zero are noticed (since they will
    have resulted in a negative value), though this only indicates that a
    use-after-free may have already happened. Such notifications are likely
    avoidable by an attacker that has already exploited a use-after-free
    vulnerability, but it's better to have them reported than allow such
    conditions to remain universally silent.

    On first overflow detection, the refcount value is reset to INT_MIN / 2
    (which serves as a saturation value) and a report and stack trace are
    produced. When operations detect only negative value results (such as
    changing an already saturated value), saturation still happens but no
    notification is performed (since the value was already saturated).

    On the matter of races, since the entire range beyond INT_MAX but before
    0 is negative, every operation at INT_MIN / 2 will trap, leaving no
    overflow-only race condition.

    As for performance, this implementation adds a single "js" instruction
    to the regular execution flow of a copy of the standard atomic_t refcount
    operations. (The non-"and_test" refcount_dec() function, which is uncommon
    in regular refcount design patterns, has an additional "jz" instruction
    to detect reaching exactly zero.) Since this is a forward jump, it is by
    default the non-predicted path, which will be reinforced by dynamic branch
    prediction. The result is this protection having virtually no measurable
    change in performance over standard atomic_t operations. The error path,
    located in .text.unlikely, saves the refcount location and then uses UD0
    to fire a refcount exception handler, which resets the refcount, handles
    reporting, and returns to regular execution. This keeps the changes to
    .text size minimal, avoiding return jumps and open-coded calls to the
    error reporting routine.

    Example assembly comparison:

    refcount_inc() before:

    .text:
    ffffffff81546149: f0 ff 45 f4 lock incl -0xc(%rbp)

    refcount_inc() after:

    .text:
    ffffffff81546149: f0 ff 45 f4 lock incl -0xc(%rbp)
    ffffffff8154614d: 0f 88 80 d5 17 00 js ffffffff816c36d3
    ...
    .text.unlikely:
    ffffffff816c36d3: 48 8d 4d f4 lea -0xc(%rbp),%rcx
    ffffffff816c36d7: 0f ff (bad)

    These are the cycle counts comparing a loop of refcount_inc() from 1
    to INT_MAX and back down to 0 (via refcount_dec_and_test()), between
    unprotected refcount_t (atomic_t), fully protected REFCOUNT_FULL
    (refcount_t-full), and this overflow-protected refcount (refcount_t-fast):

    2147483646 refcount_inc()s and 2147483647 refcount_dec_and_test()s:
    cycles protections
    atomic_t 82249267387 none
    refcount_t-fast 82211446892 overflow, untested dec-to-zero
    refcount_t-full 144814735193 overflow, untested dec-to-zero, inc-from-zero

    This code is a modified version of the x86 PAX_REFCOUNT atomic_t
    overflow defense from the last public patch of PaX/grsecurity, based
    on my understanding of the code. Changes or omissions from the original
    code are mine and don't reflect the original grsecurity/PaX code. Thanks
    to PaX Team for various suggestions for improvement for repurposing this
    code to be a refcount-only protection.

    Signed-off-by: Kees Cook
    Reviewed-by: Josh Poimboeuf
    Cc: Alexey Dobriyan
    Cc: Andrew Morton
    Cc: Arnd Bergmann
    Cc: Christoph Hellwig
    Cc: David S. Miller
    Cc: Davidlohr Bueso
    Cc: Elena Reshetova
    Cc: Eric Biggers
    Cc: Eric W. Biederman
    Cc: Greg KH
    Cc: Hans Liljestrand
    Cc: James Bottomley
    Cc: Jann Horn
    Cc: Linus Torvalds
    Cc: Manfred Spraul
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Cc: Serge E. Hallyn
    Cc: Thomas Gleixner
    Cc: arozansk@redhat.com
    Cc: axboe@kernel.dk
    Cc: kernel-hardening@lists.openwall.com
    Cc: linux-arch
    Link: http://lkml.kernel.org/r/20170815161924.GA133115@beast
    Signed-off-by: Ingo Molnar

    Kees Cook
     

02 Aug, 2017

1 commit

  • We currently have a CONFIG_RDT_A which is for RDT(Resource directory
    technology) allocation based resctrl filesystem interface. As a
    preparation to add support for RDT monitoring as well into the same
    resctrl filesystem, change the config option to be CONFIG_RDT which
    would include both RDT allocation and monitoring code.

    No functional change.

    Signed-off-by: Vikas Shivappa
    Signed-off-by: Thomas Gleixner
    Cc: ravi.v.shankar@intel.com
    Cc: tony.luck@intel.com
    Cc: fenghua.yu@intel.com
    Cc: peterz@infradead.org
    Cc: eranian@google.com
    Cc: vikas.shivappa@intel.com
    Cc: ak@linux.intel.com
    Cc: davidcc@google.com
    Cc: reinette.chatre@intel.com
    Link: http://lkml.kernel.org/r/1501017287-28083-4-git-send-email-vikas.shivappa@linux.intel.com

    Vikas Shivappa
     

26 Jul, 2017

2 commits

  • There are three mutually exclusive unwinders. Make that more obvious by
    combining them into a multiple-choice selection:

    CONFIG_FRAME_POINTER_UNWINDER
    CONFIG_ORC_UNWINDER
    CONFIG_GUESS_UNWINDER (if CONFIG_EXPERT=y)

    Frame pointers are still the default (for now).

    The old CONFIG_FRAME_POINTER option is still used in some
    arch-independent places, so keep it around, but make it
    invisible to the user on x86 - it's now selected by
    CONFIG_FRAME_POINTER_UNWINDER=y.

    Suggested-by: Ingo Molnar
    Signed-off-by: Josh Poimboeuf
    Cc: Andy Lutomirski
    Cc: Borislav Petkov
    Cc: Brian Gerst
    Cc: Denys Vlasenko
    Cc: H. Peter Anvin
    Cc: Jiri Slaby
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: live-patching@vger.kernel.org
    Link: http://lkml.kernel.org/r/20170725135424.zukjmgpz3plf5pmt@treble
    Signed-off-by: Ingo Molnar

    Josh Poimboeuf
     
  • Add the new ORC unwinder which is enabled by CONFIG_ORC_UNWINDER=y.
    It plugs into the existing x86 unwinder framework.

    It relies on objtool to generate the needed .orc_unwind and
    .orc_unwind_ip sections.

    For more details on why ORC is used instead of DWARF, see
    Documentation/x86/orc-unwinder.txt - but the short version is
    that it's a simplified, fundamentally more robust debugninfo
    data structure, which also allows up to two orders of magnitude
    faster lookups than the DWARF unwinder - which matters to
    profiling workloads like perf.

    Thanks to Andy Lutomirski for the performance improvement ideas:
    splitting the ORC unwind table into two parallel arrays and creating a
    fast lookup table to search a subset of the unwind table.

    Signed-off-by: Josh Poimboeuf
    Cc: Andy Lutomirski
    Cc: Borislav Petkov
    Cc: Brian Gerst
    Cc: Denys Vlasenko
    Cc: H. Peter Anvin
    Cc: Jiri Slaby
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: live-patching@vger.kernel.org
    Link: http://lkml.kernel.org/r/0a6cbfb40f8da99b7a45a1a8302dc6aef16ec812.1500938583.git.jpoimboe@redhat.com
    [ Extended the changelog. ]
    Signed-off-by: Ingo Molnar

    Josh Poimboeuf
     

21 Jul, 2017

1 commit

  • Most of things are in place and we can enable support for 5-level paging.

    The patch makes XEN_PV and XEN_PVH dependent on !X86_5LEVEL. Both are
    not ready to work with 5-level paging.

    Signed-off-by: Kirill A. Shutemov
    Reviewed-by: Juergen Gross
    Cc: Andrew Morton
    Cc: Andy Lutomirski
    Cc: Dave Hansen
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-arch@vger.kernel.org
    Cc: linux-mm@kvack.org
    Link: http://lkml.kernel.org/r/20170716225954.74185-9-kirill.shutemov@linux.intel.com
    [ Minor readability edits. ]
    Signed-off-by: Ingo Molnar

    Kirill A. Shutemov
     

18 Jul, 2017

2 commits

  • Add early_memremap() support to be able to specify encrypted and
    decrypted mappings with and without write-protection. The use of
    write-protection is necessary when encrypting data "in place". The
    write-protect attribute is considered cacheable for loads, but not
    stores. This implies that the hardware will never give the core a
    dirty line with this memtype.

    Signed-off-by: Tom Lendacky
    Reviewed-by: Thomas Gleixner
    Reviewed-by: Borislav Petkov
    Cc: Alexander Potapenko
    Cc: Andrey Ryabinin
    Cc: Andy Lutomirski
    Cc: Arnd Bergmann
    Cc: Borislav Petkov
    Cc: Brijesh Singh
    Cc: Dave Young
    Cc: Dmitry Vyukov
    Cc: Jonathan Corbet
    Cc: Konrad Rzeszutek Wilk
    Cc: Larry Woodman
    Cc: Linus Torvalds
    Cc: Matt Fleming
    Cc: Michael S. Tsirkin
    Cc: Paolo Bonzini
    Cc: Peter Zijlstra
    Cc: Radim Krčmář
    Cc: Rik van Riel
    Cc: Toshimitsu Kani
    Cc: kasan-dev@googlegroups.com
    Cc: kvm@vger.kernel.org
    Cc: linux-arch@vger.kernel.org
    Cc: linux-doc@vger.kernel.org
    Cc: linux-efi@vger.kernel.org
    Cc: linux-mm@kvack.org
    Link: http://lkml.kernel.org/r/479b5832c30fae3efa7932e48f81794e86397229.1500319216.git.thomas.lendacky@amd.com
    Signed-off-by: Ingo Molnar

    Tom Lendacky
     
  • Add support for Secure Memory Encryption (SME). This initial support
    provides a Kconfig entry to build the SME support into the kernel and
    defines the memory encryption mask that will be used in subsequent
    patches to mark pages as encrypted.

    Signed-off-by: Tom Lendacky
    Reviewed-by: Thomas Gleixner
    Reviewed-by: Borislav Petkov
    Cc: Alexander Potapenko
    Cc: Andrey Ryabinin
    Cc: Andy Lutomirski
    Cc: Arnd Bergmann
    Cc: Borislav Petkov
    Cc: Brijesh Singh
    Cc: Dave Young
    Cc: Dmitry Vyukov
    Cc: Jonathan Corbet
    Cc: Konrad Rzeszutek Wilk
    Cc: Larry Woodman
    Cc: Linus Torvalds
    Cc: Matt Fleming
    Cc: Michael S. Tsirkin
    Cc: Paolo Bonzini
    Cc: Peter Zijlstra
    Cc: Radim Krčmář
    Cc: Rik van Riel
    Cc: Toshimitsu Kani
    Cc: kasan-dev@googlegroups.com
    Cc: kvm@vger.kernel.org
    Cc: linux-arch@vger.kernel.org
    Cc: linux-doc@vger.kernel.org
    Cc: linux-efi@vger.kernel.org
    Cc: linux-mm@kvack.org
    Link: http://lkml.kernel.org/r/a6c34d16caaed3bc3e2d6f0987554275bd291554.1500319216.git.thomas.lendacky@amd.com
    Signed-off-by: Ingo Molnar

    Tom Lendacky
     

13 Jul, 2017

1 commit

  • This adds support for compiling with a rough equivalent to the glibc
    _FORTIFY_SOURCE=1 feature, providing compile-time and runtime buffer
    overflow checks for string.h functions when the compiler determines the
    size of the source or destination buffer at compile-time. Unlike glibc,
    it covers buffer reads in addition to writes.

    GNU C __builtin_*_chk intrinsics are avoided because they would force a
    much more complex implementation. They aren't designed to detect read
    overflows and offer no real benefit when using an implementation based
    on inline checks. Inline checks don't add up to much code size and
    allow full use of the regular string intrinsics while avoiding the need
    for a bunch of _chk functions and per-arch assembly to avoid wrapper
    overhead.

    This detects various overflows at compile-time in various drivers and
    some non-x86 core kernel code. There will likely be issues caught in
    regular use at runtime too.

    Future improvements left out of initial implementation for simplicity,
    as it's all quite optional and can be done incrementally:

    * Some of the fortified string functions (strncpy, strcat), don't yet
    place a limit on reads from the source based on __builtin_object_size of
    the source buffer.

    * Extending coverage to more string functions like strlcat.

    * It should be possible to optionally use __builtin_object_size(x, 1) for
    some functions (C strings) to detect intra-object overflows (like
    glibc's _FORTIFY_SOURCE=2), but for now this takes the conservative
    approach to avoid likely compatibility issues.

    * The compile-time checks should be made available via a separate config
    option which can be enabled by default (or always enabled) once enough
    time has passed to get the issues it catches fixed.

    Kees said:
    "This is great to have. While it was out-of-tree code, it would have
    blocked at least CVE-2016-3858 from being exploitable (improper size
    argument to strlcpy()). I've sent a number of fixes for
    out-of-bounds-reads that this detected upstream already"

    [arnd@arndb.de: x86: fix fortified memcpy]
    Link: http://lkml.kernel.org/r/20170627150047.660360-1-arnd@arndb.de
    [keescook@chromium.org: avoid panic() in favor of BUG()]
    Link: http://lkml.kernel.org/r/20170626235122.GA25261@beast
    [keescook@chromium.org: move from -mm, add ARCH_HAS_FORTIFY_SOURCE, tweak Kconfig help]
    Link: http://lkml.kernel.org/r/20170526095404.20439-1-danielmicay@gmail.com
    Link: http://lkml.kernel.org/r/1497903987-21002-8-git-send-email-keescook@chromium.org
    Signed-off-by: Daniel Micay
    Signed-off-by: Kees Cook
    Signed-off-by: Arnd Bergmann
    Acked-by: Kees Cook
    Cc: Mark Rutland
    Cc: Daniel Axtens
    Cc: Rasmus Villemoes
    Cc: Andy Shevchenko
    Cc: Chris Metcalf
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Daniel Micay