Eric Lee / smarc-fsl-linux-kernel

06 Dec, 2018

2 commits

44ac7cd01 x86/Kconfig: Select SCHED_SMT if SMP enabled ... Browse Code »

commit dbe733642e01dd108f71436aaea7b328cb28fd87 upstream

CONFIG_SCHED_SMT is enabled by all distros, so there is not a real point to
have it configurable. The runtime overhead in the core scheduler code is
minimal because the actual SMT scheduling parts are conditional on a static
key.

This allows to expose the scheduler's SMT state static key to the
speculation control code. Alternatively the scheduler's static key could be
made always available when CONFIG_SMP is enabled, but that's just adding an
unused static key to every other architecture for nothing.

Signed-off-by: Thomas Gleixner
Reviewed-by: Ingo Molnar
Cc: Peter Zijlstra
Cc: Andy Lutomirski
Cc: Linus Torvalds
Cc: Jiri Kosina
Cc: Tom Lendacky
Cc: Josh Poimboeuf
Cc: Andrea Arcangeli
Cc: David Woodhouse
Cc: Tim Chen
Cc: Andi Kleen
Cc: Dave Hansen
Cc: Casey Schaufler
Cc: Asit Mallick
Cc: Arjan van de Ven
Cc: Jon Masters
Cc: Waiman Long
Cc: Greg KH
Cc: Dave Stewart
Cc: Kees Cook
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20181125185004.337452245@linutronix.de
Signed-off-by: Greg Kroah-Hartman

Thomas Gleixner
2018-12-06 02:41:20 +0800
a9c90037a x86/retpoline: Make CONFIG_RETPOLINE depend on compiler support ... Browse Code »

commit 4cd24de3a0980bf3100c9dcb08ef65ca7c31af48 upstream

Since retpoline capable compilers are widely available, make
CONFIG_RETPOLINE hard depend on the compiler capability.

Break the build when CONFIG_RETPOLINE is enabled and the compiler does not
support it. Emit an error message in that case:

"arch/x86/Makefile:226: *** You are building kernel with non-retpoline
compiler, please update your compiler.. Stop."

[dwmw: Fail the build with non-retpoline compiler]

Suggested-by: Peter Zijlstra
Signed-off-by: Zhenzhong Duan
Signed-off-by: Thomas Gleixner
Cc: David Woodhouse
Cc: Borislav Petkov
Cc: Daniel Borkmann
Cc: H. Peter Anvin
Cc: Konrad Rzeszutek Wilk
Cc: Andy Lutomirski
Cc: Masahiro Yamada
Cc: Michal Marek
Cc:
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/cca0cb20-f9e2-4094-840b-fb0f8810cd34@default
Signed-off-by: Greg Kroah-Hartman

Zhenzhong Duan
2018-12-06 02:41:19 +0800

05 Sep, 2018

1 commit

e9afa7c1e mm/tlb, x86/mm: Support invalidating TLB caches for RCU_TABLE_FREE ... Browse Code »

commit d86564a2f085b79ec046a5cba90188e612352806 upstream.

Jann reported that x86 was missing required TLB invalidates when he
hit the !*batch slow path in tlb_remove_table().

This is indeed the case; RCU_TABLE_FREE does not provide TLB (cache)
invalidates, the PowerPC-hash where this code originated and the
Sparc-hash where this was subsequently used did not need that. ARM
which later used this put an explicit TLB invalidate in their
__p*_free_tlb() functions, and PowerPC-radix followed that example.

But when we hooked up x86 we failed to consider this. Fix this by
(optionally) hooking tlb_remove_table() into the TLB invalidate code.

NOTE: s390 was also needing something like this and might now
be able to use the generic code again.

[ Modified to be on top of Nick's cleanups, which simplified this patch
now that tlb_flush_mmu_tlbonly() really only flushes the TLB - Linus ]

Fixes: 9e52fc2b50de ("x86/mm: Enable RCU based page table freeing (CONFIG_HAVE_RCU_TABLE_FREE=y)")
Reported-by: Jann Horn
Signed-off-by: Peter Zijlstra (Intel)
Acked-by: Rik van Riel
Cc: Nicholas Piggin
Cc: David Miller
Cc: Will Deacon
Cc: Martin Schwidefsky
Cc: Michael Ellerman
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Peter Zijlstra
2018-09-05 15:26:37 +0800

16 Aug, 2018

1 commit

c5ac43ee8 cpu/hotplug: Provide knobs to control SMT ... Browse Code »

commit 05736e4ac13c08a4a9b1ef2de26dd31a32cbee57 upstream

Provide a command line and a sysfs knob to control SMT.

The command line options are:

'nosmt': Enumerate secondary threads, but do not online them

'nosmt=force': Ignore secondary threads completely during enumeration
via MP table and ACPI/MADT.

The sysfs control file has the following states (read/write):

'on': SMT is enabled. Secondary threads can be freely onlined
'off': SMT is disabled. Secondary threads, even if enumerated
cannot be onlined
'forceoff': SMT is permanentely disabled. Writes to the control
file are rejected.
'notsupported': SMT is not supported by the CPU

The command line option 'nosmt' sets the sysfs control to 'off'. This
can be changed to 'on' to reenable SMT during runtime.

The command line option 'nosmt=force' sets the sysfs control to
'forceoff'. This cannot be changed during runtime.

When SMT is 'on' and the control file is changed to 'off' then all online
secondary threads are offlined and attempts to online a secondary thread
later on are rejected.

When SMT is 'off' and the control file is changed to 'on' then secondary
threads can be onlined again. The 'off' -> 'on' transition does not
automatically online the secondary threads.

When the control file is set to 'forceoff', the behaviour is the same as
setting it to 'off', but the operation is irreversible and later writes to
the control file are rejected.

When the control status is 'notsupported' then writes to the control file
are rejected.

Signed-off-by: Thomas Gleixner
Reviewed-by: Konrad Rzeszutek Wilk
Acked-by: Ingo Molnar
Signed-off-by: Greg Kroah-Hartman

Thomas Gleixner
2018-08-16 00:12:52 +0800

15 Mar, 2018

1 commit

c3b9f7260 objtool, retpolines: Integrate objtool with retpoline support more closely ... Browse Code »

commit d5028ba8ee5a18c9d0bb926d883c28b370f89009 upstream.

Disable retpoline validation in objtool if your compiler sucks, and otherwise
select the validation stuff for CONFIG_RETPOLINE=y (most builds would already
have it set due to ORC).

Signed-off-by: Peter Zijlstra (Intel)
Acked-by: Thomas Gleixner
Cc: Andy Lutomirski
Cc: Arjan van de Ven
Cc: Borislav Petkov
Cc: Dan Williams
Cc: Dave Hansen
Cc: David Woodhouse
Cc: Greg Kroah-Hartman
Cc: Josh Poimboeuf
Cc: Linus Torvalds
Cc: Peter Zijlstra
Signed-off-by: Ingo Molnar
Signed-off-by: Greg Kroah-Hartman

Peter Zijlstra
2018-03-15 17:54:38 +0800

22 Feb, 2018

1 commit

f369f1486 kmemcheck: rip it out ... Browse Code »

commit 4675ff05de2d76d167336b368bd07f3fef6ed5a6 upstream.

Fix up makefiles, remove references, and git rm kmemcheck.

Link: http://lkml.kernel.org/r/20171007030159.22241-4-alexander.levin@verizon.com
Signed-off-by: Sasha Levin
Cc: Steven Rostedt
Cc: Vegard Nossum
Cc: Pekka Enberg
Cc: Michal Hocko
Cc: Eric W. Biederman
Cc: Alexander Potapenko
Cc: Tim Hansen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Levin, Alexander (Sasha Levin)
2018-02-22 22:42:24 +0800

17 Jan, 2018

2 commits

3a72bd4b6 x86/retpoline: Add initial retpoline support ... Browse Code »

commit 76b043848fd22dbf7f8bf3a1452f8c70d557b860 upstream.

Enable the use of -mindirect-branch=thunk-extern in newer GCC, and provide
the corresponding thunks. Provide assembler macros for invoking the thunks
in the same way that GCC does, from native and inline assembler.

This adds X86_FEATURE_RETPOLINE and sets it by default on all CPUs. In
some circumstances, IBRS microcode features may be used instead, and the
retpoline can be disabled.

On AMD CPUs if lfence is serialising, the retpoline can be dramatically
simplified to a simple "lfence; jmp *\reg". A future patch, after it has
been verified that lfence really is serialising in all circumstances, can
enable this by setting the X86_FEATURE_RETPOLINE_AMD feature bit in addition
to X86_FEATURE_RETPOLINE.

Do not align the retpoline in the altinstr section, because there is no
guarantee that it stays aligned when it's copied over the oldinstr during
alternative patching.

[ Andi Kleen: Rename the macros, add CONFIG_RETPOLINE option, export thunks]
[ tglx: Put actual function CALL/JMP in front of the macros, convert to
symbolic labels ]
[ dwmw2: Convert back to numeric labels, merge objtool fixes ]

Signed-off-by: David Woodhouse
Signed-off-by: Thomas Gleixner
Acked-by: Arjan van de Ven
Acked-by: Ingo Molnar
Cc: gnomes@lxorguk.ukuu.org.uk
Cc: Rik van Riel
Cc: Andi Kleen
Cc: Josh Poimboeuf
Cc: thomas.lendacky@amd.com
Cc: Peter Zijlstra
Cc: Linus Torvalds
Cc: Jiri Kosina
Cc: Andy Lutomirski
Cc: Dave Hansen
Cc: Kees Cook
Cc: Tim Chen
Cc: Greg Kroah-Hartman
Cc: Paul Turner
Link: https://lkml.kernel.org/r/1515707194-20531-4-git-send-email-dwmw@amazon.co.uk
Signed-off-by: Greg Kroah-Hartman

David Woodhouse
2018-01-17 16:45:29 +0800
e0d753568 x86/cpu: Implement CPU vulnerabilites sysfs functions ... Browse Code »

commit 61dc0f555b5c761cdafb0ba5bd41ecf22d68a4c4 upstream.

Implement the CPU vulnerabilty show functions for meltdown, spectre_v1 and
spectre_v2.

Signed-off-by: Thomas Gleixner
Reviewed-by: Greg Kroah-Hartman
Reviewed-by: Konrad Rzeszutek Wilk
Cc: Peter Zijlstra
Cc: Will Deacon
Cc: Dave Hansen
Cc: Linus Torvalds
Cc: Borislav Petkov
Cc: David Woodhouse
Link: https://lkml.kernel.org/r/20180107214913.177414879@linutronix.de
Signed-off-by: Greg Kroah-Hartman

Thomas Gleixner
2018-01-17 16:45:28 +0800

30 Dec, 2017

1 commit

662fd946a x86/Kconfig: Limit NR_CPUS on 32-bit to a sane amount ... Browse Code »

commit 7bbcbd3d1cdcbacd0f9f8dc4c98d550972f1ca30 upstream.

The recent cpu_entry_area changes fail to compile on 32-bit when BIGSMP=y
and NR_CPUS=512, because the fixmap area becomes too big.

Limit the number of CPUs with BIGSMP to 64, which is already way to big for
32-bit, but it's at least a working limitation.

We performed a quick survey of 32-bit-only machines that might be affected
by this change negatively, but found none.

Signed-off-by: Thomas Gleixner
Cc: Andy Lutomirski
Cc: Borislav Petkov
Cc: Dave Hansen
Cc: H. Peter Anvin
Cc: Josh Poimboeuf
Cc: Juergen Gross
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar
Signed-off-by: Greg Kroah-Hartman

Thomas Gleixner
2017-12-30 00:53:42 +0800

25 Dec, 2017

3 commits

d455b71e7 x86/mm/kasan: Don't use vmemmap_populate() to initialize shadow ... Browse Code »

commit 2aeb07365bcd489620f71390a7d2031cd4dfb83e upstream.

[ Note, this is a Git cherry-pick of the following commit:

d17a1d97dc20: ("x86/mm/kasan: don't use vmemmap_populate() to initialize shadow")

... for easier x86 PTI code testing and back-porting. ]

The KASAN shadow is currently mapped using vmemmap_populate() since that
provides a semi-convenient way to map pages into init_top_pgt. However,
since that no longer zeroes the mapped pages, it is not suitable for
KASAN, which requires zeroed shadow memory.

Add kasan_populate_shadow() interface and use it instead of
vmemmap_populate(). Besides, this allows us to take advantage of
gigantic pages and use them to populate the shadow, which should save us
some memory wasted on page tables and reduce TLB pressure.

Link: http://lkml.kernel.org/r/20171103185147.2688-2-pasha.tatashin@oracle.com
Signed-off-by: Andrey Ryabinin
Signed-off-by: Pavel Tatashin
Cc: Andy Lutomirski
Cc: Steven Sistare
Cc: Daniel Jordan
Cc: Bob Picco
Cc: Michal Hocko
Cc: Alexander Potapenko
Cc: Ard Biesheuvel
Cc: Catalin Marinas
Cc: Christian Borntraeger
Cc: David S. Miller
Cc: Dmitry Vyukov
Cc: Heiko Carstens
Cc: "H. Peter Anvin"
Cc: Ingo Molnar
Cc: Mark Rutland
Cc: Matthew Wilcox
Cc: Mel Gorman
Cc: Michal Hocko
Cc: Sam Ravnborg
Cc: Thomas Gleixner
Cc: Will Deacon
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Ingo Molnar
Signed-off-by: Greg Kroah-Hartman

Andrey Ryabinin
2017-12-25 21:26:21 +0800
873f59b8b x86/kasan: Use the same shadow offset for 4- and 5-level paging ... Browse Code »

commit 12a8cc7fcf54a8575f094be1e99032ec38aa045c upstream.

We are going to support boot-time switching between 4- and 5-level
paging. For KASAN it means we cannot have different KASAN_SHADOW_OFFSET
for different paging modes: the constant is passed to gcc to generate
code and cannot be changed at runtime.

This patch changes KASAN code to use 0xdffffc0000000000 as shadow offset
for both 4- and 5-level paging.

For 5-level paging it means that shadow memory region is not aligned to
PGD boundary anymore and we have to handle unaligned parts of the region
properly.

In addition, we have to exclude paravirt code from KASAN instrumentation
as we now use set_pgd() before KASAN is fully ready.

[kirill.shutemov@linux.intel.com: clenaup, changelog message]
Signed-off-by: Andrey Ryabinin
Signed-off-by: Kirill A. Shutemov
Cc: Andrew Morton
Cc: Andy Lutomirski
Cc: Borislav Petkov
Cc: Cyrill Gorcunov
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20170929140821.37654-4-kirill.shutemov@linux.intel.com
Signed-off-by: Ingo Molnar
Signed-off-by: Greg Kroah-Hartman

Andrey Ryabinin
2017-12-25 21:26:15 +0800
8af220c9e x86/unwind: Rename unwinder config options to 'CONFIG_UNWINDER_*' ... Browse Code »

commit 11af847446ed0d131cf24d16a7ef3d5ea7a49554 upstream.

Rename the unwinder config options from:

CONFIG_ORC_UNWINDER
CONFIG_FRAME_POINTER_UNWINDER
CONFIG_GUESS_UNWINDER

to:

CONFIG_UNWINDER_ORC
CONFIG_UNWINDER_FRAME_POINTER
CONFIG_UNWINDER_GUESS

... in order to give them a more logical config namespace.

Suggested-by: Ingo Molnar
Signed-off-by: Josh Poimboeuf
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Link: http://lkml.kernel.org/r/73972fc7e2762e91912c6b9584582703d6f1b8cc.1507924831.git.jpoimboe@redhat.com
Signed-off-by: Ingo Molnar
Signed-off-by: Greg Kroah-Hartman

Josh Poimboeuf
2017-12-25 21:26:13 +0800

10 Dec, 2017

1 commit

340d45d70 locking/refcounts, x86/asm: Enable CONFIG_ARCH_HAS_REFCOUNT ... Browse Code »

[ Upstream commit 39208aa7ecb7d9c4e86df782b5693270313cbab1 ]

With the section inlining bug fixed for the x86 refcount protection,
we can turn the config back on.

Signed-off-by: Kees Cook
Cc: Ard Biesheuvel
Cc: Elena
Cc: Linus Torvalds
Cc: Mike Galbraith
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: linux-arch
Link: http://lkml.kernel.org/r/1504382986-49301-3-git-send-email-keescook@chromium.org
Signed-off-by: Ingo Molnar
Signed-off-by: Sasha Levin
Signed-off-by: Greg Kroah-Hartman

Kees Cook
2017-12-10 20:40:43 +0800

02 Nov, 2017

1 commit

b24413180 License cleanup: add SPDX GPL-2.0 license identifier to files with no license ... Browse Code »

Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.

By default all files without license information are under the default
license of the kernel, which is GPL version 2.

Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier. The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.

This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.

How this work was done:

Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
- file had no licensing information it it.
- file was a */uapi/* one with no licensing information in it,
- file was a */uapi/* one with existing licensing information,

Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.

The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne. Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.

The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed. Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.

Criteria used to select files for SPDX license identifier tagging was:
- Files considered eligible had to be source code files.
- Make and config files were included as candidates if they contained >5
lines of source
- File already had some variant of a license header in it (even if
Reviewed-by: Philippe Ombredanne
Reviewed-by: Thomas Gleixner
Signed-off-by: Greg Kroah-Hartman

Greg Kroah-Hartman
2017-11-02 18:10:55 +0800

12 Sep, 2017

1 commit

89fd915c4 Merge tag 'libnvdimm-for-4.14' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm ... Browse Code »

Pull libnvdimm from Dan Williams:
"A rework of media error handling in the BTT driver and other updates.
It has appeared in a few -next releases and collected some late-
breaking build-error and warning fixups as a result.

Summary:

- Media error handling support in the Block Translation Table (BTT)
driver is reworked to address sleeping-while-atomic locking and
memory-allocation-context conflicts.

- The dax_device lookup overhead for xfs and ext4 is moved out of the
iomap hot-path to a mount-time lookup.

- A new 'ecc_unit_size' sysfs attribute is added to advertise the
read-modify-write boundary property of a persistent memory range.

- Preparatory fix-ups for arm and powerpc pmem support are included
along with other miscellaneous fixes"

* tag 'libnvdimm-for-4.14' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (26 commits)
libnvdimm, btt: fix format string warnings
libnvdimm, btt: clean up warning and error messages
ext4: fix null pointer dereference on sbi
libnvdimm, nfit: move the check on nd_reserved2 to the endpoint
dax: fix FS_DAX=n BLOCK=y compilation
libnvdimm: fix integer overflow static analysis warning
libnvdimm, nd_blk: remove mmio_flush_range()
libnvdimm, btt: rework error clearing
libnvdimm: fix potential deadlock while clearing errors
libnvdimm, btt: cache sector_size in arena_info
libnvdimm, btt: ensure that flags were also unchanged during a map_read
libnvdimm, btt: refactor map entry operations with macros
libnvdimm, btt: fix a missed NVDIMM_IO_ATOMIC case in the write path
libnvdimm, nfit: export an 'ecc_unit_size' sysfs attribute
ext4: perform dax_device lookup at mount
ext2: perform dax_device lookup at mount
xfs: perform dax_device lookup at mount
dax: introduce a fs_dax_get_by_bdev() helper
libnvdimm, btt: check memory allocation failure
libnvdimm, label: fix index block size calculation
...

Linus Torvalds
2017-09-12 04:10:57 +0800

09 Sep, 2017

2 commits

3072e413e mm/memory_hotplug: introduce add_pages ... Browse Code »

There are new users of memory hotplug emerging. Some of them require
different subset of arch_add_memory. There are some which only require
allocation of struct pages without mapping those pages to the kernel
address space. We currently have __add_pages for that purpose. But this
is rather lowlevel and not very suitable for the code outside of the
memory hotplug. E.g. x86_64 wants to update max_pfn which should be done
by the caller. Introduce add_pages() which should care about those
details if they are needed. Each architecture should define its
implementation and select CONFIG_ARCH_HAS_ADD_PAGES. All others use the
currently existing __add_pages.

Link: http://lkml.kernel.org/r/20170817000548.32038-7-jglisse@redhat.com
Signed-off-by: Michal Hocko
Signed-off-by: Jérôme Glisse
Acked-by: Balbir Singh
Cc: Aneesh Kumar
Cc: Benjamin Herrenschmidt
Cc: Dan Williams
Cc: David Nellans
Cc: Evgeny Baskakov
Cc: Johannes Weiner
Cc: John Hubbard
Cc: Kirill A. Shutemov
Cc: Mark Hairgrove
Cc: Paul E. McKenney
Cc: Ross Zwisler
Cc: Sherry Cheung
Cc: Subhash Gutti
Cc: Vladimir Davydov
Cc: Bob Liu
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Hocko
2017-09-09 09:26:46 +0800
9c670ea37 mm: thp: introduce CONFIG_ARCH_ENABLE_THP_MIGRATION ... Browse Code »

Introduce CONFIG_ARCH_ENABLE_THP_MIGRATION to limit thp migration
functionality to x86_64, which should be safer at the first step.

Link: http://lkml.kernel.org/r/20170717193955.20207-5-zi.yan@sent.com
Signed-off-by: Naoya Horiguchi
Signed-off-by: Zi Yan
Reviewed-by: Anshuman Khandual
Cc: "H. Peter Anvin"
Cc: Dave Hansen
Cc: David Nellans
Cc: Ingo Molnar
Cc: Kirill A. Shutemov
Cc: Mel Gorman
Cc: Minchan Kim
Cc: Thomas Gleixner
Cc: Vlastimil Babka
Cc: Andrea Arcangeli
Cc: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Naoya Horiguchi
2017-09-09 09:26:45 +0800

07 Sep, 2017

1 commit

df3735c5b x86,mpx: make mpx depend on x86-64 to free up VMA flag ... Browse Code »

Patch series "mm,fork,security: introduce MADV_WIPEONFORK", v4.

If a child process accesses memory that was MADV_WIPEONFORK, it will get
zeroes. The address ranges are still valid, they are just empty.

If a child process accesses memory that was MADV_DONTFORK, it will get a
segmentation fault, since those address ranges are no longer valid in
the child after fork.

Since MADV_DONTFORK also seems to be used to allow very large programs
to fork in systems with strict memory overcommit restrictions, changing
the semantics of MADV_DONTFORK might break existing programs.

The use case is libraries that store or cache information, and want to
know that they need to regenerate it in the child process after fork.

Examples of this would be:
- systemd/pulseaudio API checks (fail after fork) (replacing a getpid
check, which is too slow without a PID cache)
- PKCS#11 API reinitialization check (mandated by specification)
- glibc's upcoming PRNG (reseed after fork)
- OpenSSL PRNG (reseed after fork)

The security benefits of a forking server having a re-inialized PRNG in
every child process are pretty obvious. However, due to libraries
having all kinds of internal state, and programs getting compiled with
many different versions of each library, it is unreasonable to expect
calling programs to re-initialize everything manually after fork.

A further complication is the proliferation of clone flags, programs
bypassing glibc's functions to call clone directly, and programs calling
unshare, causing the glibc pthread_atfork hook to not get called.

It would be better to have the kernel take care of this automatically.

The patchset also adds MADV_KEEPONFORK, to undo the effects of a prior
MADV_WIPEONFORK.

This is similar to the OpenBSD minherit syscall with MAP_INHERIT_ZERO:

https://man.openbsd.org/minherit.2

This patch (of 2):

MPX only seems to be available on 64 bit CPUs, starting with Skylake and
Goldmont. Move VM_MPX into the 64 bit only portion of vma->vm_flags, in
order to free up a VMA flag.

Link: http://lkml.kernel.org/r/20170811212829.29186-2-riel@redhat.com
Signed-off-by: Rik van Riel
Acked-by: Dave Hansen
Cc: Mike Kravetz
Cc: Florian Weimer
Cc: Kees Cook
Cc: Andy Lutomirski
Cc: Will Drewry
Cc: Ingo Molnar
Cc: Thomas Gleixner
Cc: "H. Peter Anvin"
Cc: "Kirill A. Shutemov"
Cc: Matthew Wilcox
Cc: Colm MacCártaigh
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Rik van Riel
2017-09-07 08:27:30 +0800

05 Sep, 2017

4 commits

f57091767 Merge branch 'x86-cache-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull x86 cache quality monitoring update from Thomas Gleixner:
"This update provides a complete rewrite of the Cache Quality
Monitoring (CQM) facility.

The existing CQM support was duct taped into perf with a lot of issues
and the attempts to fix those turned out to be incomplete and
horrible.

After lengthy discussions it was decided to integrate the CQM support
into the Resource Director Technology (RDT) facility, which is the
obvious choise as in hardware CQM is part of RDT. This allowed to add
Memory Bandwidth Monitoring support on top.

As a result the mechanisms for allocating cache/memory bandwidth and
the corresponding monitoring mechanisms are integrated into a single
management facility with a consistent user interface"

* 'x86-cache-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (37 commits)
x86/intel_rdt: Turn off most RDT features on Skylake
x86/intel_rdt: Add command line options for resource director technology
x86/intel_rdt: Move special case code for Haswell to a quirk function
x86/intel_rdt: Remove redundant ternary operator on return
x86/intel_rdt/cqm: Improve limbo list processing
x86/intel_rdt/mbm: Fix MBM overflow handler during CPU hotplug
x86/intel_rdt: Modify the intel_pqr_state for better performance
x86/intel_rdt/cqm: Clear the default RMID during hotcpu
x86/intel_rdt: Show bitmask of shareable resource with other executing units
x86/intel_rdt/mbm: Handle counter overflow
x86/intel_rdt/mbm: Add mbm counter initialization
x86/intel_rdt/mbm: Basic counting of MBM events (total and local)
x86/intel_rdt/cqm: Add CPU hotplug support
x86/intel_rdt/cqm: Add sched_in support
x86/intel_rdt: Introduce rdt_enable_key for scheduling
x86/intel_rdt/cqm: Add mount,umount support
x86/intel_rdt/cqm: Add rmdir support
x86/intel_rdt: Separate the ctrl bits from rmdir
x86/intel_rdt/cqm: Add mon_data
x86/intel_rdt: Prepare for RDT monitor data support
...

Linus Torvalds
2017-09-05 04:56:37 +0800
b1b6f83ac Merge branch 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull x86 mm changes from Ingo Molnar:
"PCID support, 5-level paging support, Secure Memory Encryption support

The main changes in this cycle are support for three new, complex
hardware features of x86 CPUs:

- Add 5-level paging support, which is a new hardware feature on
upcoming Intel CPUs allowing up to 128 PB of virtual address space
and 4 PB of physical RAM space - a 512-fold increase over the old
limits. (Supercomputers of the future forecasting hurricanes on an
ever warming planet can certainly make good use of more RAM.)

Many of the necessary changes went upstream in previous cycles,
v4.14 is the first kernel that can enable 5-level paging.

This feature is activated via CONFIG_X86_5LEVEL=y - disabled by
default.

(By Kirill A. Shutemov)

- Add 'encrypted memory' support, which is a new hardware feature on
upcoming AMD CPUs ('Secure Memory Encryption', SME) allowing system
RAM to be encrypted and decrypted (mostly) transparently by the
CPU, with a little help from the kernel to transition to/from
encrypted RAM. Such RAM should be more secure against various
attacks like RAM access via the memory bus and should make the
radio signature of memory bus traffic harder to intercept (and
decrypt) as well.

This feature is activated via CONFIG_AMD_MEM_ENCRYPT=y - disabled
by default.

(By Tom Lendacky)

- Enable PCID optimized TLB flushing on newer Intel CPUs: PCID is a
hardware feature that attaches an address space tag to TLB entries
and thus allows to skip TLB flushing in many cases, even if we
switch mm's.

(By Andy Lutomirski)

All three of these features were in the works for a long time, and
it's coincidence of the three independent development paths that they
are all enabled in v4.14 at once"

* 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (65 commits)
x86/mm: Enable RCU based page table freeing (CONFIG_HAVE_RCU_TABLE_FREE=y)
x86/mm: Use pr_cont() in dump_pagetable()
x86/mm: Fix SME encryption stack ptr handling
kvm/x86: Avoid clearing the C-bit in rsvd_bits()
x86/CPU: Align CR3 defines
x86/mm, mm/hwpoison: Clear PRESENT bit for kernel 1:1 mappings of poison pages
acpi, x86/mm: Remove encryption mask from ACPI page protection type
x86/mm, kexec: Fix memory corruption with SME on successive kexecs
x86/mm/pkeys: Fix typo in Documentation/x86/protection-keys.txt
x86/mm/dump_pagetables: Speed up page tables dump for CONFIG_KASAN=y
x86/mm: Implement PCID based optimization: try to preserve old TLB entries using PCID
x86: Enable 5-level paging support via CONFIG_X86_5LEVEL=y
x86/mm: Allow userspace have mappings above 47-bit
x86/mm: Prepare to expose larger address space to userspace
x86/mpx: Do not allow MPX if we have mappings above 47-bit
x86/mm: Rename tasksize_32bit/64bit to task_size_32bit/64bit()
x86/xen: Redefine XEN_ELFNOTE_INIT_P2M using PUD_SIZE * PTRS_PER_PUD
x86/mm/dump_pagetables: Fix printout of p4d level
x86/mm/dump_pagetables: Generalize address normalization
x86/boot: Fix memremap() related build failure
...

Linus Torvalds
2017-09-05 03:21:28 +0800
5f82e71a0 Merge branch 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull locking updates from Ingo Molnar:

- Add 'cross-release' support to lockdep, which allows APIs like
completions, where it's not the 'owner' who releases the lock, to be
tracked. It's all activated automatically under
CONFIG_PROVE_LOCKING=y.

- Clean up (restructure) the x86 atomics op implementation to be more
readable, in preparation of KASAN annotations. (Dmitry Vyukov)

- Fix static keys (Paolo Bonzini)

- Add killable versions of down_read() et al (Kirill Tkhai)

- Rework and fix jump_label locking (Marc Zyngier, Paolo Bonzini)

- Rework (and fix) tlb_flush_pending() barriers (Peter Zijlstra)

- Remove smp_mb__before_spinlock() and convert its usages, introduce
smp_mb__after_spinlock() (Peter Zijlstra)

* 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (56 commits)
locking/lockdep/selftests: Fix mixed read-write ABBA tests
sched/completion: Avoid unnecessary stack allocation for COMPLETION_INITIALIZER_ONSTACK()
acpi/nfit: Fix COMPLETION_INITIALIZER_ONSTACK() abuse
locking/pvqspinlock: Relax cmpxchg's to improve performance on some architectures
smp: Avoid using two cache lines for struct call_single_data
locking/lockdep: Untangle xhlock history save/restore from task independence
locking/refcounts, x86/asm: Disable CONFIG_ARCH_HAS_REFCOUNT for the time being
futex: Remove duplicated code and fix undefined behaviour
Documentation/locking/atomic: Finish the document...
locking/lockdep: Fix workqueue crossrelease annotation
workqueue/lockdep: 'Fix' flush_work() annotation
locking/lockdep/selftests: Add mixed read-write ABBA tests
mm, locking/barriers: Clarify tlb_flush_pending() barriers
locking/lockdep: Make CONFIG_LOCKDEP_CROSSRELEASE and CONFIG_LOCKDEP_COMPLETIONS truly non-interactive
locking/lockdep: Explicitly initialize wq_barrier::done::map
locking/lockdep: Rename CONFIG_LOCKDEP_COMPLETE to CONFIG_LOCKDEP_COMPLETIONS
locking/lockdep: Reword title of LOCKDEP_CROSSRELEASE config
locking/lockdep: Make CONFIG_LOCKDEP_CROSSRELEASE part of CONFIG_PROVE_LOCKING
locking/refcounts, x86/asm: Implement fast refcount overflow protection
locking/lockdep: Fix the rollback and overwrite detection logic in crossrelease
...

Linus Torvalds
2017-09-05 02:52:29 +0800
b0c79f49c Merge branch 'x86-asm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull x86 asm updates from Ingo Molnar:

- Introduce the ORC unwinder, which can be enabled via
CONFIG_ORC_UNWINDER=y.

The ORC unwinder is a lightweight, Linux kernel specific debuginfo
implementation, which aims to be DWARF done right for unwinding.
Objtool is used to generate the ORC unwinder tables during build, so
the data format is flexible and kernel internal: there's no
dependency on debuginfo created by an external toolchain.

The ORC unwinder is almost two orders of magnitude faster than the
(out of tree) DWARF unwinder - which is important for perf call graph
profiling. It is also significantly simpler and is coded defensively:
there has not been a single ORC related kernel crash so far, even
with early versions. (knock on wood!)

But the main advantage is that enabling the ORC unwinder allows
CONFIG_FRAME_POINTERS to be turned off - which speeds up the kernel
measurably:

With frame pointers disabled, GCC does not have to add frame pointer
instrumentation code to every function in the kernel. The kernel's
.text size decreases by about 3.2%, resulting in better cache
utilization and fewer instructions executed, resulting in a broad
kernel-wide speedup. Average speedup of system calls should be
roughly in the 1-3% range - measurements by Mel Gorman [1] have shown
a speedup of 5-10% for some function execution intense workloads.

The main cost of the unwinder is that the unwinder data has to be
stored in RAM: the memory cost is 2-4MB of RAM, depending on kernel
config - which is a modest cost on modern x86 systems.

Given how young the ORC unwinder code is it's not enabled by default
- but given the performance advantages the plan is to eventually make
it the default unwinder on x86.

See Documentation/x86/orc-unwinder.txt for more details.

- Remove lguest support: its intended role was that of a temporary
proof of concept for virtualization, plus its removal will enable the
reduction (removal) of the paravirt API as well, so Rusty agreed to
its removal. (Juergen Gross)

- Clean up and fix FSGS related functionality (Andy Lutomirski)

- Clean up IO access APIs (Andy Shevchenko)

- Enhance the symbol namespace (Jiri Slaby)

* 'x86-asm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (47 commits)
objtool: Handle GCC stack pointer adjustment bug
x86/entry/64: Use ENTRY() instead of ALIGN+GLOBAL for stub32_clone()
x86/fpu/math-emu: Add ENDPROC to functions
x86/boot/64: Extract efi_pe_entry() from startup_64()
x86/boot/32: Extract efi_pe_entry() from startup_32()
x86/lguest: Remove lguest support
x86/paravirt/xen: Remove xen_patch()
objtool: Fix objtool fallthrough detection with function padding
x86/xen/64: Fix the reported SS and CS in SYSCALL
objtool: Track DRAP separately from callee-saved registers
objtool: Fix validate_branch() return codes
x86: Clarify/fix no-op barriers for text_poke_bp()
x86/switch_to/64: Rewrite FS/GS switching yet again to fix AMD CPUs
selftests/x86/fsgsbase: Test selectors 1, 2, and 3
x86/fsgsbase/64: Report FSBASE and GSBASE correctly in core dumps
x86/fsgsbase/64: Fully initialize FS and GS state in start_thread_common
x86/asm: Fix UNWIND_HINT_REGS macro for older binutils
x86/asm/32: Fix regs_get_register() on segment registers
x86/xen/64: Rearrange the SYSCALL entries
x86/asm/32: Remove a bunch of '& 0xffff' from pt_regs segment reads
...

Linus Torvalds
2017-09-05 00:52:57 +0800

01 Sep, 2017

2 commits

8f98ae0c9 Merge branch 'for-4.14/fs' into libnvdimm-for-next Browse Code »

Dan Williams
2017-09-01 07:25:59 +0800
5deb67f77 libnvdimm, nd_blk: remove mmio_flush_range() ... Browse Code »

mmio_flush_range() suffers from a lack of clearly-defined semantics,
and is somewhat ambiguous to port to other architectures where the
scope of the writeback implied by "flush" and ordering might matter,
but MMIO would tend to imply non-cacheable anyway. Per the rationale
in 67a3e8fe9015 ("nd_blk: change aperture mapping from WC to WB"), the
only existing use is actually to invalidate clean cache lines for
ARCH_MEMREMAP_PMEM type mappings *without* writeback. Since the recent
cleanup of the pmem API, that also now happens to be the exact purpose
of arch_invalidate_pmem(), which would be a far more well-defined tool
for the job.

Rather than risk potentially inconsistent implementations of
mmio_flush_range() for the sake of one callsite, streamline things by
removing it entirely and instead move the ARCH_MEMREMAP_PMEM related
definitions up to the libnvdimm level, so they can be shared by NFIT
as well. This allows NFIT to be enabled for arm64.

Signed-off-by: Robin Murphy
Signed-off-by: Dan Williams

Robin Murphy
2017-09-01 06:05:10 +0800

31 Aug, 2017

1 commit

9e52fc2b5 x86/mm: Enable RCU based page table freeing (CONFIG_HAVE_RCU_TABLE_FREE=y) ... Browse Code »

There's a subtle bug in how some of the paravirt guest code handles
page table freeing on x86:

On x86 software page table walkers depend on the fact that remote TLB flush
does an IPI: walk is performed lockless but with interrupts disabled and in
case the page table is freed the freeing CPU will get blocked as remote TLB
flush is required. On other architectures which don't require an IPI to do
remote TLB flush we have an RCU-based mechanism (see
include/asm-generic/tlb.h for more details).

In virtualized environments we may want to override the ->flush_tlb_others
callback in pv_mmu_ops and use a hypercall asking the hypervisor to do a
remote TLB flush for us. This breaks the assumption about IPIs. Xen PV has
been doing this for years and the upcoming remote TLB flush for Hyper-V will
do it too.

This is not safe, as software page table walkers may step on an already
freed page.

Fix the bug by enabling the RCU-based page table freeing mechanism,
CONFIG_HAVE_RCU_TABLE_FREE=y.

Testing with kernbench and mmap/munmap microbenchmarks, and neither showed
any noticeable performance impact.

Suggested-by: Peter Zijlstra
Signed-off-by: Vitaly Kuznetsov
Acked-by: Peter Zijlstra
Acked-by: Juergen Gross
Acked-by: Kirill A. Shutemov
Cc: Andrew Cooper
Cc: Andy Lutomirski
Cc: Boris Ostrovsky
Cc: Jork Loeser
Cc: KY Srinivasan
Cc: Linus Torvalds
Cc: Paul E. McKenney
Cc: Stephen Hemminger
Cc: Steven Rostedt
Cc: Thomas Gleixner
Cc: xen-devel@lists.xenproject.org
Link: http://lkml.kernel.org/r/20170828082251.5562-1-vkuznets@redhat.com
[ Rewrote/fixed/clarified the changelog. ]
Signed-off-by: Ingo Molnar

Vitaly Kuznetsov
2017-08-31 17:07:07 +0800

29 Aug, 2017

1 commit

7b3d61cc7 locking/refcounts, x86/asm: Disable CONFIG_ARCH_HAS_REFCOUNT for the time being ... Browse Code »

Mike Galbraith bisected a boot crash back to the following commit:

7a46ec0e2f48 ("locking/refcounts, x86/asm: Implement fast refcount overflow protection")

The crash/hang pattern is:

> Symptom is a few splats as below, with box finally hanging. Network
> comes up, but neither ssh nor console login is possible.
>
> ------------[ cut here ]------------
> WARNING: CPU: 4 PID: 0 at net/netlink/af_netlink.c:374 netlink_sock_destruct+0x82/0xa0
> ...
> __sk_destruct()
> rcu_process_callbacks()
> __do_softirq()
> irq_exit()
> smp_apic_timer_interrupt()
> apic_timer_interrupt()

We are at -rc7 already, and the code has grown some dependencies, so
instead of a plain revert disable the config temporarily, in the hope
of getting real fixes.

Reported-by: Mike Galbraith
Tested-by: Mike Galbraith
Cc: Arnd Bergmann
Cc: Davidlohr Bueso
Cc: Elena Reshetova
Cc: Josh Poimboeuf
Cc: Kees Cook
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Rik van Riel
Cc: Thomas Gleixner
Link: http://lkml.kernel.org/r/tip-7a46ec0e2f4850407de5e1d19a44edee6efa58ec@git.kernel.org
Signed-off-by: Ingo Molnar

Ingo Molnar
2017-08-29 19:10:35 +0800

26 Aug, 2017

1 commit

413d63d71 Merge branch 'linus' into x86/mm to pick up fixes and to fix conflicts ... Browse Code »

Conflicts:
arch/x86/kernel/head64.c
arch/x86/mm/mmap.c

Signed-off-by: Ingo Molnar

Ingo Molnar
2017-08-26 15:19:13 +0800

25 Aug, 2017

1 commit

10c9850cb Merge branch 'linus' into locking/core, to pick up fixes ... Browse Code »

Signed-off-by: Ingo Molnar

Ingo Molnar
2017-08-25 17:04:51 +0800

24 Aug, 2017

1 commit

ecda85e70 x86/lguest: Remove lguest support ... Browse Code »

Lguest seems to be rather unused these days. It has seen only patches
ensuring it still builds the last two years and its official state is
"Odd Fixes".

Remove it in order to be able to clean up the paravirt code.

Signed-off-by: Juergen Gross
Acked-by: Rusty Russell
Acked-by: Thomas Gleixner
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: boris.ostrovsky@oracle.com
Cc: lguest@lists.ozlabs.org
Cc: rusty@rustcorp.com.au
Cc: xen-devel@lists.xenproject.org
Link: http://lkml.kernel.org/r/20170816173157.8633-3-jgross@suse.com
Signed-off-by: Ingo Molnar

Juergen Gross
2017-08-24 15:57:28 +0800

20 Aug, 2017

1 commit

e18a5ebc2 Merge branch 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull watchdog fix from Thomas Gleixner:
"A fix for the hardlockup watchdog to prevent false positives with
extreme Turbo-Modes which make the perf/NMI watchdog fire faster than
the hrtimer which is used to verify.

Slightly larger than the minimal fix, which just would increase the
hrtimer frequency, but comes with extra overhead of more watchdog
timer interrupts and thread wakeups for all users.

With this change we restrict the overhead to the extreme Turbo-Mode
systems"

* 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
kernel/watchdog: Prevent false positives with turbo modes

Linus Torvalds
2017-08-20 23:54:30 +0800

19 Aug, 2017

1 commit

92e5aae45 kernel/watchdog: fix Kconfig constraints for perf hardlockup watchdog ... Browse Code »

Commit 05a4a9527931 ("kernel/watchdog: split up config options") lost
the perf-based hardlockup detector's dependency on PERF_EVENTS, which
can result in broken builds with some powerpc configurations.

Restore the dependency. Add it in for x86 too, despite x86 always
selecting PERF_EVENTS it seems reasonable to make the dependency
explicit.

Link: http://lkml.kernel.org/r/20170810114452.6673-1-npiggin@gmail.com
Fixes: 05a4a9527931 ("kernel/watchdog: split up config options")
Signed-off-by: Nicholas Piggin
Acked-by: Don Zickus
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nicholas Piggin
2017-08-19 06:32:01 +0800

18 Aug, 2017

1 commit

7edaeb684 kernel/watchdog: Prevent false positives with turbo modes ... Browse Code »

The hardlockup detector on x86 uses a performance counter based on unhalted
CPU cycles and a periodic hrtimer. The hrtimer period is about 2/5 of the
performance counter period, so the hrtimer should fire 2-3 times before the
performance counter NMI fires. The NMI code checks whether the hrtimer
fired since the last invocation. If not, it assumess a hard lockup.

The calculation of those periods is based on the nominal CPU
frequency. Turbo modes increase the CPU clock frequency and therefore
shorten the period of the perf/NMI watchdog. With extreme Turbo-modes (3x
nominal frequency) the perf/NMI period is shorter than the hrtimer period
which leads to false positives.

A simple fix would be to shorten the hrtimer period, but that comes with
the side effect of more frequent hrtimer and softlockup thread wakeups,
which is not desired.

Implement a low pass filter, which checks the perf/NMI period against
kernel time. If the perf/NMI fires before 4/5 of the watchdog period has
elapsed then the event is ignored and postponed to the next perf/NMI.

That solves the problem and avoids the overhead of shorter hrtimer periods
and more frequent softlockup thread wakeups.

Fixes: 58687acba592 ("lockup_detector: Combine nmi_watchdog and softlockup detector")
Reported-and-tested-by: Kan Liang
Signed-off-by: Thomas Gleixner
Cc: dzickus@redhat.com
Cc: prarit@redhat.com
Cc: ak@linux.intel.com
Cc: babu.moger@oracle.com
Cc: peterz@infradead.org
Cc: eranian@google.com
Cc: acme@redhat.com
Cc: stable@vger.kernel.org
Cc: atomlin@redhat.com
Cc: akpm@linux-foundation.org
Cc: torvalds@linux-foundation.org
Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1708150931310.1886@nanos

Thomas Gleixner
2017-08-18 18:35:02 +0800

17 Aug, 2017

1 commit

7a46ec0e2 locking/refcounts, x86/asm: Implement fast refcount overflow protection ... Browse Code »

This implements refcount_t overflow protection on x86 without a noticeable
performance impact, though without the fuller checking of REFCOUNT_FULL.

This is done by duplicating the existing atomic_t refcount implementation
but with normally a single instruction added to detect if the refcount
has gone negative (e.g. wrapped past INT_MAX or below zero). When detected,
the handler saturates the refcount_t to INT_MIN / 2. With this overflow
protection, the erroneous reference release that would follow a wrap back
to zero is blocked from happening, avoiding the class of refcount-overflow
use-after-free vulnerabilities entirely.

Only the overflow case of refcounting can be perfectly protected, since
it can be detected and stopped before the reference is freed and left to
be abused by an attacker. There isn't a way to block early decrements,
and while REFCOUNT_FULL stops increment-from-zero cases (which would
be the state _after_ an early decrement and stops potential double-free
conditions), this fast implementation does not, since it would require
the more expensive cmpxchg loops. Since the overflow case is much more
common (e.g. missing a "put" during an error path), this protection
provides real-world protection. For example, the two public refcount
overflow use-after-free exploits published in 2016 would have been
rendered unexploitable:

http://perception-point.io/2016/01/14/analysis-and-exploitation-of-a-linux-kernel-vulnerability-cve-2016-0728/

http://cyseclabs.com/page?n=02012016

This implementation does, however, notice an unchecked decrement to zero
(i.e. caller used refcount_dec() instead of refcount_dec_and_test() and it
resulted in a zero). Decrements under zero are noticed (since they will
have resulted in a negative value), though this only indicates that a
use-after-free may have already happened. Such notifications are likely
avoidable by an attacker that has already exploited a use-after-free
vulnerability, but it's better to have them reported than allow such
conditions to remain universally silent.

On first overflow detection, the refcount value is reset to INT_MIN / 2
(which serves as a saturation value) and a report and stack trace are
produced. When operations detect only negative value results (such as
changing an already saturated value), saturation still happens but no
notification is performed (since the value was already saturated).

On the matter of races, since the entire range beyond INT_MAX but before
0 is negative, every operation at INT_MIN / 2 will trap, leaving no
overflow-only race condition.

As for performance, this implementation adds a single "js" instruction
to the regular execution flow of a copy of the standard atomic_t refcount
operations. (The non-"and_test" refcount_dec() function, which is uncommon
in regular refcount design patterns, has an additional "jz" instruction
to detect reaching exactly zero.) Since this is a forward jump, it is by
default the non-predicted path, which will be reinforced by dynamic branch
prediction. The result is this protection having virtually no measurable
change in performance over standard atomic_t operations. The error path,
located in .text.unlikely, saves the refcount location and then uses UD0
to fire a refcount exception handler, which resets the refcount, handles
reporting, and returns to regular execution. This keeps the changes to
.text size minimal, avoiding return jumps and open-coded calls to the
error reporting routine.

Example assembly comparison:

refcount_inc() before:

.text:
ffffffff81546149: f0 ff 45 f4 lock incl -0xc(%rbp)

refcount_inc() after:

.text:
ffffffff81546149: f0 ff 45 f4 lock incl -0xc(%rbp)
ffffffff8154614d: 0f 88 80 d5 17 00 js ffffffff816c36d3
...
.text.unlikely:
ffffffff816c36d3: 48 8d 4d f4 lea -0xc(%rbp),%rcx
ffffffff816c36d7: 0f ff (bad)

These are the cycle counts comparing a loop of refcount_inc() from 1
to INT_MAX and back down to 0 (via refcount_dec_and_test()), between
unprotected refcount_t (atomic_t), fully protected REFCOUNT_FULL
(refcount_t-full), and this overflow-protected refcount (refcount_t-fast):

2147483646 refcount_inc()s and 2147483647 refcount_dec_and_test()s:
cycles protections
atomic_t 82249267387 none
refcount_t-fast 82211446892 overflow, untested dec-to-zero
refcount_t-full 144814735193 overflow, untested dec-to-zero, inc-from-zero

This code is a modified version of the x86 PAX_REFCOUNT atomic_t
overflow defense from the last public patch of PaX/grsecurity, based
on my understanding of the code. Changes or omissions from the original
code are mine and don't reflect the original grsecurity/PaX code. Thanks
to PaX Team for various suggestions for improvement for repurposing this
code to be a refcount-only protection.

Signed-off-by: Kees Cook
Reviewed-by: Josh Poimboeuf
Cc: Alexey Dobriyan
Cc: Andrew Morton
Cc: Arnd Bergmann
Cc: Christoph Hellwig
Cc: David S. Miller
Cc: Davidlohr Bueso
Cc: Elena Reshetova
Cc: Eric Biggers
Cc: Eric W. Biederman
Cc: Greg KH
Cc: Hans Liljestrand
Cc: James Bottomley
Cc: Jann Horn
Cc: Linus Torvalds
Cc: Manfred Spraul
Cc: Peter Zijlstra
Cc: Rik van Riel
Cc: Serge E. Hallyn
Cc: Thomas Gleixner
Cc: arozansk@redhat.com
Cc: axboe@kernel.dk
Cc: kernel-hardening@lists.openwall.com
Cc: linux-arch
Link: http://lkml.kernel.org/r/20170815161924.GA133115@beast
Signed-off-by: Ingo Molnar

Kees Cook
2017-08-17 16:40:26 +0800

02 Aug, 2017

1 commit

f01d7d51f x86/intel_rdt: Introduce a common compile option for RDT ... Browse Code »

We currently have a CONFIG_RDT_A which is for RDT(Resource directory
technology) allocation based resctrl filesystem interface. As a
preparation to add support for RDT monitoring as well into the same
resctrl filesystem, change the config option to be CONFIG_RDT which
would include both RDT allocation and monitoring code.

No functional change.

Signed-off-by: Vikas Shivappa
Signed-off-by: Thomas Gleixner
Cc: ravi.v.shankar@intel.com
Cc: tony.luck@intel.com
Cc: fenghua.yu@intel.com
Cc: peterz@infradead.org
Cc: eranian@google.com
Cc: vikas.shivappa@intel.com
Cc: ak@linux.intel.com
Cc: davidcc@google.com
Cc: reinette.chatre@intel.com
Link: http://lkml.kernel.org/r/1501017287-28083-4-git-send-email-vikas.shivappa@linux.intel.com

Vikas Shivappa
2017-08-02 04:41:19 +0800

26 Jul, 2017

2 commits

81d387190 x86/kconfig: Consolidate unwinders into multiple choice selection ... Browse Code »

There are three mutually exclusive unwinders. Make that more obvious by
combining them into a multiple-choice selection:

CONFIG_FRAME_POINTER_UNWINDER
CONFIG_ORC_UNWINDER
CONFIG_GUESS_UNWINDER (if CONFIG_EXPERT=y)

Frame pointers are still the default (for now).

The old CONFIG_FRAME_POINTER option is still used in some
arch-independent places, so keep it around, but make it
invisible to the user on x86 - it's now selected by
CONFIG_FRAME_POINTER_UNWINDER=y.

Suggested-by: Ingo Molnar
Signed-off-by: Josh Poimboeuf
Cc: Andy Lutomirski
Cc: Borislav Petkov
Cc: Brian Gerst
Cc: Denys Vlasenko
Cc: H. Peter Anvin
Cc: Jiri Slaby
Cc: Linus Torvalds
Cc: Mike Galbraith
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: live-patching@vger.kernel.org
Link: http://lkml.kernel.org/r/20170725135424.zukjmgpz3plf5pmt@treble
Signed-off-by: Ingo Molnar

Josh Poimboeuf
2017-07-26 20:05:36 +0800
ee9f8fce9 x86/unwind: Add the ORC unwinder ... Browse Code »

Add the new ORC unwinder which is enabled by CONFIG_ORC_UNWINDER=y.
It plugs into the existing x86 unwinder framework.

It relies on objtool to generate the needed .orc_unwind and
.orc_unwind_ip sections.

For more details on why ORC is used instead of DWARF, see
Documentation/x86/orc-unwinder.txt - but the short version is
that it's a simplified, fundamentally more robust debugninfo
data structure, which also allows up to two orders of magnitude
faster lookups than the DWARF unwinder - which matters to
profiling workloads like perf.

Thanks to Andy Lutomirski for the performance improvement ideas:
splitting the ORC unwind table into two parallel arrays and creating a
fast lookup table to search a subset of the unwind table.

Signed-off-by: Josh Poimboeuf
Cc: Andy Lutomirski
Cc: Borislav Petkov
Cc: Brian Gerst
Cc: Denys Vlasenko
Cc: H. Peter Anvin
Cc: Jiri Slaby
Cc: Linus Torvalds
Cc: Mike Galbraith
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: live-patching@vger.kernel.org
Link: http://lkml.kernel.org/r/0a6cbfb40f8da99b7a45a1a8302dc6aef16ec812.1500938583.git.jpoimboe@redhat.com
[ Extended the changelog. ]
Signed-off-by: Ingo Molnar

Josh Poimboeuf
2017-07-26 19:18:20 +0800

21 Jul, 2017

1 commit

77ef56e4f x86: Enable 5-level paging support via CONFIG_X86_5LEVEL=y ... Browse Code »

Most of things are in place and we can enable support for 5-level paging.

The patch makes XEN_PV and XEN_PVH dependent on !X86_5LEVEL. Both are
not ready to work with 5-level paging.

Signed-off-by: Kirill A. Shutemov
Reviewed-by: Juergen Gross
Cc: Andrew Morton
Cc: Andy Lutomirski
Cc: Dave Hansen
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: linux-arch@vger.kernel.org
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20170716225954.74185-9-kirill.shutemov@linux.intel.com
[ Minor readability edits. ]
Signed-off-by: Ingo Molnar

Kirill A. Shutemov
2017-07-21 16:05:19 +0800

18 Jul, 2017

2 commits

f88a68fac x86/mm: Extend early_memremap() support with additional attrs ... Browse Code »

Add early_memremap() support to be able to specify encrypted and
decrypted mappings with and without write-protection. The use of
write-protection is necessary when encrypting data "in place". The
write-protect attribute is considered cacheable for loads, but not
stores. This implies that the hardware will never give the core a
dirty line with this memtype.

Signed-off-by: Tom Lendacky
Reviewed-by: Thomas Gleixner
Reviewed-by: Borislav Petkov
Cc: Alexander Potapenko
Cc: Andrey Ryabinin
Cc: Andy Lutomirski
Cc: Arnd Bergmann
Cc: Borislav Petkov
Cc: Brijesh Singh
Cc: Dave Young
Cc: Dmitry Vyukov
Cc: Jonathan Corbet
Cc: Konrad Rzeszutek Wilk
Cc: Larry Woodman
Cc: Linus Torvalds
Cc: Matt Fleming
Cc: Michael S. Tsirkin
Cc: Paolo Bonzini
Cc: Peter Zijlstra
Cc: Radim Krčmář
Cc: Rik van Riel
Cc: Toshimitsu Kani
Cc: kasan-dev@googlegroups.com
Cc: kvm@vger.kernel.org
Cc: linux-arch@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-efi@vger.kernel.org
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/479b5832c30fae3efa7932e48f81794e86397229.1500319216.git.thomas.lendacky@amd.com
Signed-off-by: Ingo Molnar

Tom Lendacky
2017-07-18 17:38:00 +0800
7744ccdbc x86/mm: Add Secure Memory Encryption (SME) support ... Browse Code »

Add support for Secure Memory Encryption (SME). This initial support
provides a Kconfig entry to build the SME support into the kernel and
defines the memory encryption mask that will be used in subsequent
patches to mark pages as encrypted.

Signed-off-by: Tom Lendacky
Reviewed-by: Thomas Gleixner
Reviewed-by: Borislav Petkov
Cc: Alexander Potapenko
Cc: Andrey Ryabinin
Cc: Andy Lutomirski
Cc: Arnd Bergmann
Cc: Borislav Petkov
Cc: Brijesh Singh
Cc: Dave Young
Cc: Dmitry Vyukov
Cc: Jonathan Corbet
Cc: Konrad Rzeszutek Wilk
Cc: Larry Woodman
Cc: Linus Torvalds
Cc: Matt Fleming
Cc: Michael S. Tsirkin
Cc: Paolo Bonzini
Cc: Peter Zijlstra
Cc: Radim Krčmář
Cc: Rik van Riel
Cc: Toshimitsu Kani
Cc: kasan-dev@googlegroups.com
Cc: kvm@vger.kernel.org
Cc: linux-arch@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-efi@vger.kernel.org
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/a6c34d16caaed3bc3e2d6f0987554275bd291554.1500319216.git.thomas.lendacky@amd.com
Signed-off-by: Ingo Molnar

Tom Lendacky
2017-07-18 17:37:59 +0800

13 Jul, 2017

1 commit

6974f0c45 include/linux/string.h: add the option of fortified string.h functions ... Browse Code »

This adds support for compiling with a rough equivalent to the glibc
_FORTIFY_SOURCE=1 feature, providing compile-time and runtime buffer
overflow checks for string.h functions when the compiler determines the
size of the source or destination buffer at compile-time. Unlike glibc,
it covers buffer reads in addition to writes.

GNU C __builtin_*_chk intrinsics are avoided because they would force a
much more complex implementation. They aren't designed to detect read
overflows and offer no real benefit when using an implementation based
on inline checks. Inline checks don't add up to much code size and
allow full use of the regular string intrinsics while avoiding the need
for a bunch of _chk functions and per-arch assembly to avoid wrapper
overhead.

This detects various overflows at compile-time in various drivers and
some non-x86 core kernel code. There will likely be issues caught in
regular use at runtime too.

Future improvements left out of initial implementation for simplicity,
as it's all quite optional and can be done incrementally:

* Some of the fortified string functions (strncpy, strcat), don't yet
place a limit on reads from the source based on __builtin_object_size of
the source buffer.

* Extending coverage to more string functions like strlcat.

* It should be possible to optionally use __builtin_object_size(x, 1) for
some functions (C strings) to detect intra-object overflows (like
glibc's _FORTIFY_SOURCE=2), but for now this takes the conservative
approach to avoid likely compatibility issues.

* The compile-time checks should be made available via a separate config
option which can be enabled by default (or always enabled) once enough
time has passed to get the issues it catches fixed.

Kees said:
"This is great to have. While it was out-of-tree code, it would have
blocked at least CVE-2016-3858 from being exploitable (improper size
argument to strlcpy()). I've sent a number of fixes for
out-of-bounds-reads that this detected upstream already"

[arnd@arndb.de: x86: fix fortified memcpy]
Link: http://lkml.kernel.org/r/20170627150047.660360-1-arnd@arndb.de
[keescook@chromium.org: avoid panic() in favor of BUG()]
Link: http://lkml.kernel.org/r/20170626235122.GA25261@beast
[keescook@chromium.org: move from -mm, add ARCH_HAS_FORTIFY_SOURCE, tweak Kconfig help]
Link: http://lkml.kernel.org/r/20170526095404.20439-1-danielmicay@gmail.com
Link: http://lkml.kernel.org/r/1497903987-21002-8-git-send-email-keescook@chromium.org
Signed-off-by: Daniel Micay
Signed-off-by: Kees Cook
Signed-off-by: Arnd Bergmann
Acked-by: Kees Cook
Cc: Mark Rutland
Cc: Daniel Axtens
Cc: Rasmus Villemoes
Cc: Andy Shevchenko
Cc: Chris Metcalf
Cc: Thomas Gleixner
Cc: "H. Peter Anvin"
Cc: Ingo Molnar
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Daniel Micay
2017-07-13 07:26:03 +0800