Eric Lee / smarc-fsl-linux-kernel

13 Jan, 2019

2 commits

102948111 x86/dump_pagetables: Fix LDT remap address marker ... Browse Code »

[ Upstream commit 254eb5505ca0ca749d3a491fc6668b6c16647a99 ]

The LDT remap placement has been changed. It's now placed before the direct
mapping in the kernel virtual address space for both paging modes.

Change address markers order accordingly.

Fixes: d52888aa2753 ("x86/mm: Move LDT remap out of KASLR region on 5-level paging")
Signed-off-by: Kirill A. Shutemov
Signed-off-by: Thomas Gleixner
Cc: bp@alien8.de
Cc: hpa@zytor.com
Cc: dave.hansen@linux.intel.com
Cc: luto@kernel.org
Cc: peterz@infradead.org
Cc: boris.ostrovsky@oracle.com
Cc: jgross@suse.com
Cc: bhe@redhat.com
Cc: hans.van.kranenburg@mendix.com
Cc: linux-mm@kvack.org
Cc: xen-devel@lists.xenproject.org
Link: https://lkml.kernel.org/r/20181130202328.65359-3-kirill.shutemov@linux.intel.com
Signed-off-by: Sasha Levin

Kirill A. Shutemov
2019-01-13 17:00:58 +0800
0f0b37af1 x86/mm: Fix guard hole handling ... Browse Code »

[ Upstream commit 16877a5570e0c5f4270d5b17f9bab427bcae9514 ]

There is a guard hole at the beginning of the kernel address space, also
used by hypervisors. It occupies 16 PGD entries.

This reserved range is not defined explicitely, it is calculated relative
to other entities: direct mapping and user space ranges.

The calculation got broken by recent changes of the kernel memory layout:
LDT remap range is now mapped before direct mapping and makes the
calculation invalid.

The breakage leads to crash on Xen dom0 boot[1].

Define the reserved range explicitely. It's part of kernel ABI (hypervisors
expect it to be stable) and must not depend on changes in the rest of
kernel memory layout.

[1] https://lists.xenproject.org/archives/html/xen-devel/2018-11/msg03313.html

Fixes: d52888aa2753 ("x86/mm: Move LDT remap out of KASLR region on 5-level paging")
Reported-by: Hans van Kranenburg
Signed-off-by: Kirill A. Shutemov
Signed-off-by: Thomas Gleixner
Tested-by: Hans van Kranenburg
Reviewed-by: Juergen Gross
Cc: bp@alien8.de
Cc: hpa@zytor.com
Cc: dave.hansen@linux.intel.com
Cc: luto@kernel.org
Cc: peterz@infradead.org
Cc: boris.ostrovsky@oracle.com
Cc: bhe@redhat.com
Cc: linux-mm@kvack.org
Cc: xen-devel@lists.xenproject.org
Link: https://lkml.kernel.org/r/20181130202328.65359-2-kirill.shutemov@linux.intel.com
Signed-off-by: Sasha Levin

Kirill A. Shutemov
2019-01-13 17:00:58 +0800

10 Jan, 2019

2 commits

9d9565334 x86/mm: Drop usage of __flush_tlb_all() in kernel_physical_mapping_init() ... Browse Code »

commit ba6f508d0ec4adb09f0a939af6d5e19cdfa8667d upstream.

Commit:

f77084d96355 "x86/mm/pat: Disable preemption around __flush_tlb_all()"

addressed a case where __flush_tlb_all() is called without preemption
being disabled. It also left a warning to catch other cases where
preemption is not disabled.

That warning triggers for the memory hotplug path which is also used for
persistent memory enabling:

WARNING: CPU: 35 PID: 911 at ./arch/x86/include/asm/tlbflush.h:460
RIP: 0010:__flush_tlb_all+0x1b/0x3a
[..]
Call Trace:
phys_pud_init+0x29c/0x2bb
kernel_physical_mapping_init+0xfc/0x219
init_memory_mapping+0x1a5/0x3b0
arch_add_memory+0x2c/0x50
devm_memremap_pages+0x3aa/0x610
pmem_attach_disk+0x585/0x700 [nd_pmem]

Andy wondered why a path that can sleep was using __flush_tlb_all() [1]
and Dave confirmed the expectation for TLB flush is for modifying /
invalidating existing PTE entries, but not initial population [2]. Drop
the usage of __flush_tlb_all() in phys_{p4d,pud,pmd}_init() on the
expectation that this path is only ever populating empty entries for the
linear map. Note, at linear map teardown time there is a call to the
all-cpu flush_tlb_all() to invalidate the removed mappings.

[1]: https://lkml.kernel.org/r/9DFD717D-857D-493D-A606-B635D72BAC21@amacapital.net
[2]: https://lkml.kernel.org/r/749919a4-cdb1-48a3-adb4-adb81a5fa0b5@intel.com

[ mingo: Minor readability edits. ]

Suggested-by: Dave Hansen
Reported-by: Andy Lutomirski
Signed-off-by: Dan Williams
Acked-by: Peter Zijlstra (Intel)
Acked-by: Kirill A. Shutemov
Cc:
Cc: Borislav Petkov
Cc: H. Peter Anvin
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Rik van Riel
Cc: Sebastian Andrzej Siewior
Cc: Thomas Gleixner
Cc: dave.hansen@intel.com
Fixes: f77084d96355 ("x86/mm/pat: Disable preemption around __flush_tlb_all()")
Link: http://lkml.kernel.org/r/154395944713.32119.15611079023837132638.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Ingo Molnar
Signed-off-by: Greg Kroah-Hartman

Dan Williams
2019-01-10 00:14:48 +0800
8c34b0719 x86/speculation/l1tf: Drop the swap storage limit restriction when l1tf=off ... Browse Code »

commit 5b5e4d623ec8a34689df98e42d038a3b594d2ff9 upstream.

Swap storage is restricted to max_swapfile_size (~16TB on x86_64) whenever
the system is deemed affected by L1TF vulnerability. Even though the limit
is quite high for most deployments it seems to be too restrictive for
deployments which are willing to live with the mitigation disabled.

We have a customer to deploy 8x 6,4TB PCIe/NVMe SSD swap devices which is
clearly out of the limit.

Drop the swap restriction when l1tf=off is specified. It also doesn't make
much sense to warn about too much memory for the l1tf mitigation when it is
forcefully disabled by the administrator.

[ tglx: Folded the documentation delta change ]

Fixes: 377eeaa8e11f ("x86/speculation/l1tf: Limit swap file size to MAX_PA/2")
Signed-off-by: Michal Hocko
Signed-off-by: Thomas Gleixner
Reviewed-by: Pavel Tatashin
Reviewed-by: Andi Kleen
Acked-by: Jiri Kosina
Cc: Linus Torvalds
Cc: Dave Hansen
Cc: Andi Kleen
Cc: Borislav Petkov
Cc:
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20181113184910.26697-1-mhocko@kernel.org
Signed-off-by: Greg Kroah-Hartman

Michal Hocko
2019-01-10 00:14:48 +0800

06 Dec, 2018

2 commits

cbca99b96 x86/speculation: Prepare for conditional IBPB in switch_mm() ... Browse Code »

commit 4c71a2b6fd7e42814aa68a6dec88abf3b42ea573 upstream

The IBPB speculation barrier is issued from switch_mm() when the kernel
switches to a user space task with a different mm than the user space task
which ran last on the same CPU.

An additional optimization is to avoid IBPB when the incoming task can be
ptraced by the outgoing task. This optimization only works when switching
directly between two user space tasks. When switching from a kernel task to
a user space task the optimization fails because the previous task cannot
be accessed anymore. So for quite some scenarios the optimization is just
adding overhead.

The upcoming conditional IBPB support will issue IBPB only for user space
tasks which have the TIF_SPEC_IB bit set. This requires to handle the
following cases:

1) Switch from a user space task (potential attacker) which has
TIF_SPEC_IB set to a user space task (potential victim) which has
TIF_SPEC_IB not set.

2) Switch from a user space task (potential attacker) which has
TIF_SPEC_IB not set to a user space task (potential victim) which has
TIF_SPEC_IB set.

This needs to be optimized for the case where the IBPB can be avoided when
only kernel threads ran in between user space tasks which belong to the
same process.

The current check whether two tasks belong to the same context is using the
tasks context id. While correct, it's simpler to use the mm pointer because
it allows to mangle the TIF_SPEC_IB bit into it. The context id based
mechanism requires extra storage, which creates worse code.

When a task is scheduled out its TIF_SPEC_IB bit is mangled as bit 0 into
the per CPU storage which is used to track the last user space mm which was
running on a CPU. This bit can be used together with the TIF_SPEC_IB bit of
the incoming task to make the decision whether IBPB needs to be issued or
not to cover the two cases above.

As conditional IBPB is going to be the default, remove the dubious ptrace
check for the IBPB always case and simply issue IBPB always when the
process changes.

Move the storage to a different place in the struct as the original one
created a hole.

Signed-off-by: Thomas Gleixner
Reviewed-by: Ingo Molnar
Cc: Peter Zijlstra
Cc: Andy Lutomirski
Cc: Linus Torvalds
Cc: Jiri Kosina
Cc: Tom Lendacky
Cc: Josh Poimboeuf
Cc: Andrea Arcangeli
Cc: David Woodhouse
Cc: Tim Chen
Cc: Andi Kleen
Cc: Dave Hansen
Cc: Casey Schaufler
Cc: Asit Mallick
Cc: Arjan van de Ven
Cc: Jon Masters
Cc: Waiman Long
Cc: Greg KH
Cc: Dave Stewart
Cc: Kees Cook
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20181125185005.466447057@linutronix.de
Signed-off-by: Greg Kroah-Hartman

Thomas Gleixner
2018-12-06 02:41:21 +0800
4741e3193 x86/speculation: Apply IBPB more strictly to avoid cross-process data leak ... Browse Code »

commit dbfe2953f63c640463c630746cd5d9de8b2f63ae upstream

Currently, IBPB is only issued in cases when switching into a non-dumpable
process, the rationale being to protect such 'important and security
sensitive' processess (such as GPG) from data leaking into a different
userspace process via spectre v2.

This is however completely insufficient to provide proper userspace-to-userpace
spectrev2 protection, as any process can poison branch buffers before being
scheduled out, and the newly scheduled process immediately becomes spectrev2
victim.

In order to minimize the performance impact (for usecases that do require
spectrev2 protection), issue the barrier only in cases when switching between
processess where the victim can't be ptraced by the potential attacker (as in
such cases, the attacker doesn't have to bother with branch buffers at all).

[ tglx: Split up PTRACE_MODE_NOACCESS_CHK into PTRACE_MODE_SCHED and
PTRACE_MODE_IBPB to be able to do ptrace() context tracking reasonably
fine-grained ]

Fixes: 18bf3c3ea8 ("x86/speculation: Use Indirect Branch Prediction Barrier in context switch")
Originally-by: Tim Chen
Signed-off-by: Jiri Kosina
Signed-off-by: Thomas Gleixner
Cc: Peter Zijlstra
Cc: Josh Poimboeuf
Cc: Andrea Arcangeli
Cc: "WoodhouseDavid"
Cc: Andi Kleen
Cc: "SchauflerCasey"
Link: https://lkml.kernel.org/r/nycvar.YFH.7.76.1809251437340.15880@cbobk.fhfr.pm
Signed-off-by: Greg Kroah-Hartman

Jiri Kosina
2018-12-06 02:41:18 +0800

14 Nov, 2018

1 commit

b80a2dd4d x86/mm/pat: Disable preemption around __flush_tlb_all() ... Browse Code »

commit f77084d96355f5fba8e2c1fb3a51a393b1570de7 upstream.

The WARN_ON_ONCE(__read_cr3() != build_cr3()) in switch_mm_irqs_off()
triggers every once in a while during a snapshotted system upgrade.

The warning triggers since commit decab0888e6e ("x86/mm: Remove
preempt_disable/enable() from __native_flush_tlb()"). The callchain is:

get_page_from_freelist() -> post_alloc_hook() -> __kernel_map_pages()

with CONFIG_DEBUG_PAGEALLOC enabled.

Disable preemption during CR3 reset / __flush_tlb_all() and add a comment
why preemption has to be disabled so it won't be removed accidentaly.

Add another preemptible() check in __flush_tlb_all() to catch callers with
enabled preemption when PGE is enabled, because PGE enabled does not
trigger the warning in __native_flush_tlb(). Suggested by Andy Lutomirski.

Fixes: decab0888e6e ("x86/mm: Remove preempt_disable/enable() from __native_flush_tlb()")
Signed-off-by: Sebastian Andrzej Siewior
Signed-off-by: Thomas Gleixner
Cc: Andy Lutomirski
Cc: Dave Hansen
Cc: Peter Zijlstra
Cc: Borislav Petkov
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20181017103432.zgv46nlu3hc7k4rq@linutronix.de
Signed-off-by: Greg Kroah-Hartman

Sebastian Andrzej Siewior
2018-11-14 03:14:47 +0800

04 Oct, 2018

3 commits

4fff53acf x86/pti: Fix section mismatch warning/error ... Browse Code »

[ Upstream commit ff924c5a1ec7548825cc2d07980b03be4224ffac ]

Fix the section mismatch warning in arch/x86/mm/pti.c:

WARNING: vmlinux.o(.text+0x6972a): Section mismatch in reference from the function pti_clone_pgtable() to the function .init.text:pti_user_pagetable_walk_pte()
The function pti_clone_pgtable() references
the function __init pti_user_pagetable_walk_pte().
This is often because pti_clone_pgtable lacks a __init
annotation or the annotation of pti_user_pagetable_walk_pte is wrong.
FATAL: modpost: Section mismatches detected.

Fixes: 85900ea51577 ("x86/pti: Map the vsyscall page if needed")
Reported-by: kbuild test robot
Signed-off-by: Randy Dunlap
Signed-off-by: Thomas Gleixner
Cc: Andy Lutomirski
Link: https://lkml.kernel.org/r/43a6d6a3-d69d-5eda-da09-0b1c88215a2a@infradead.org
Signed-off-by: Sasha Levin
Signed-off-by: Greg Kroah-Hartman

Randy Dunlap
2018-10-04 08:01:00 +0800
4fe780c1b x86/mm: Expand static page table for fixmap space ... Browse Code »

commit 05ab1d8a4b36ee912b7087c6da127439ed0a903e upstream.

We met a kernel panic when enabling earlycon, which is due to the fixmap
address of earlycon is not statically setup.

Currently the static fixmap setup in head_64.S only covers 2M virtual
address space, while it actually could be in 4M space with different
kernel configurations, e.g. when VSYSCALL emulation is disabled.

So increase the static space to 4M for now by defining FIXMAP_PMD_NUM to 2,
and add a build time check to ensure that the fixmap is covered by the
initial static page tables.

Fixes: 1ad83c858c7d ("x86_64,vsyscall: Make vsyscall emulation configurable")
Suggested-by: Thomas Gleixner
Signed-off-by: Feng Tang
Signed-off-by: Thomas Gleixner
Tested-by: kernel test robot
Reviewed-by: Juergen Gross (Xen parts)
Cc: H Peter Anvin
Cc: Peter Zijlstra
Cc: Michal Hocko
Cc: Yinghai Lu
Cc: Dave Hansen
Cc: Andi Kleen
Cc: Andy Lutomirsky
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20180920025828.23699-1-feng.tang@intel.com
Signed-off-by: Greg Kroah-Hartman

Feng Tang
2018-10-04 08:00:54 +0800
6ecd10b1a x86/numa_emulation: Fix emulated-to-physical node mapping ... Browse Code »

[ Upstream commit 3b6c62f363a19ce82bf378187ab97c9dc01e3927 ]

Without this change the distance table calculation for emulated nodes
may use the wrong numa node and report an incorrect distance.

Signed-off-by: Dan Williams
Cc: David Rientjes
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: Wei Yang
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/153089328103.27680.14778434392225818887.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Ingo Molnar
Signed-off-by: Sasha Levin
Signed-off-by: Greg Kroah-Hartman

Dan Williams
2018-10-04 08:00:46 +0800

26 Sep, 2018

3 commits

ab25ad619 x86/mm/pti: Add an overflow check to pti_clone_pmds() ... Browse Code »

[ Upstream commit 935232ce28dfabff1171e5a7113b2d865fa9ee63 ]

The addr counter will overflow if the last PMD of the address space is
cloned, resulting in an endless loop.

Check for that and bail out of the loop when it happens.

Signed-off-by: Joerg Roedel
Signed-off-by: Thomas Gleixner
Tested-by: Pavel Machek
Cc: "H . Peter Anvin"
Cc: linux-mm@kvack.org
Cc: Linus Torvalds
Cc: Andy Lutomirski
Cc: Dave Hansen
Cc: Josh Poimboeuf
Cc: Juergen Gross
Cc: Peter Zijlstra
Cc: Borislav Petkov
Cc: Jiri Kosina
Cc: Boris Ostrovsky
Cc: Brian Gerst
Cc: David Laight
Cc: Denys Vlasenko
Cc: Eduardo Valentin
Cc: Greg KH
Cc: Will Deacon
Cc: aliguori@amazon.com
Cc: daniel.gruss@iaik.tugraz.at
Cc: hughd@google.com
Cc: keescook@google.com
Cc: Andrea Arcangeli
Cc: Waiman Long
Cc: "David H . Gutteridge"
Cc: joro@8bytes.org
Link: https://lkml.kernel.org/r/1531906876-13451-25-git-send-email-joro@8bytes.org
Signed-off-by: Sasha Levin
Signed-off-by: Greg Kroah-Hartman

Joerg Roedel
2018-09-26 14:38:05 +0800
169399436 x86/pti: Check the return value of pti_user_pagetable_walk_pmd() ... Browse Code »

[ Upstream commit 8c934e01a7ce685d98e970880f5941d79272c654 ]

pti_user_pagetable_walk_pmd() can return NULL, so the return value should
be checked to prevent a NULL pointer dereference.

Add the check and a warning when the PMD allocation fails.

Signed-off-by: Jiang Biao
Signed-off-by: Thomas Gleixner
Cc: dave.hansen@linux.intel.com
Cc: luto@kernel.org
Cc: hpa@zytor.com
Cc: albcamus@gmail.com
Cc: zhong.weidong@zte.com.cn
Link: https://lkml.kernel.org/r/1532045192-49622-2-git-send-email-jiang.biao2@zte.com.cn
Signed-off-by: Sasha Levin
Signed-off-by: Greg Kroah-Hartman

Jiang Biao
2018-09-26 14:38:05 +0800
36d6a43a1 x86/pti: Check the return value of pti_user_pagetable_walk_p4d() ... Browse Code »

[ Upstream commit b2b7d986a89b6c94b1331a909de1217214fb08c1 ]

pti_user_pagetable_walk_p4d() can return NULL, so the return value should
be checked to prevent a NULL pointer dereference.

Add the check and a warning when the P4D allocation fails.

Signed-off-by: Jiang Biao
Signed-off-by: Thomas Gleixner
Cc: dave.hansen@linux.intel.com
Cc: luto@kernel.org
Cc: hpa@zytor.com
Cc: albcamus@gmail.com
Cc: zhong.weidong@zte.com.cn
Link: https://lkml.kernel.org/r/1532045192-49622-1-git-send-email-jiang.biao2@zte.com.cn
Signed-off-by: Sasha Levin
Signed-off-by: Greg Kroah-Hartman

Jiang Biao
2018-09-26 14:38:05 +0800

20 Sep, 2018

1 commit

591ee8d9c x86/mm: Remove in_nmi() warning from vmalloc_fault() ... Browse Code »

[ Upstream commit 6863ea0cda8725072522cd78bda332d9a0b73150 ]

It is perfectly okay to take page-faults, especially on the
vmalloc area while executing an NMI handler. Remove the
warning.

Signed-off-by: Joerg Roedel
Signed-off-by: Thomas Gleixner
Tested-by: David H. Gutteridge
Cc: "H . Peter Anvin"
Cc: linux-mm@kvack.org
Cc: Linus Torvalds
Cc: Andy Lutomirski
Cc: Dave Hansen
Cc: Josh Poimboeuf
Cc: Juergen Gross
Cc: Peter Zijlstra
Cc: Borislav Petkov
Cc: Jiri Kosina
Cc: Boris Ostrovsky
Cc: Brian Gerst
Cc: David Laight
Cc: Denys Vlasenko
Cc: Eduardo Valentin
Cc: Greg KH
Cc: Will Deacon
Cc: aliguori@amazon.com
Cc: daniel.gruss@iaik.tugraz.at
Cc: hughd@google.com
Cc: keescook@google.com
Cc: Andrea Arcangeli
Cc: Waiman Long
Cc: Pavel Machek
Cc: Arnaldo Carvalho de Melo
Cc: Alexander Shishkin
Cc: Jiri Olsa
Cc: Namhyung Kim
Cc: joro@8bytes.org
Link: https://lkml.kernel.org/r/1532533683-5988-2-git-send-email-joro@8bytes.org
Signed-off-by: Sasha Levin
Signed-off-by: Greg Kroah-Hartman

Joerg Roedel
2018-09-20 04:43:42 +0800

05 Sep, 2018

3 commits

53f01e200 x86/nmi: Fix NMI uaccess race against CR3 switching ... Browse Code »

commit 4012e77a903d114f915fc607d6d2ed54a3d6c9b1 upstream.

A NMI can hit in the middle of context switching or in the middle of
switch_mm_irqs_off(). In either case, CR3 might not match current->mm,
which could cause copy_from_user_nmi() and friends to read the wrong
memory.

Fix it by adding a new nmi_uaccess_okay() helper and checking it in
copy_from_user_nmi() and in __copy_from_user_nmi()'s callers.

Signed-off-by: Andy Lutomirski
Signed-off-by: Thomas Gleixner
Reviewed-by: Rik van Riel
Cc: Nadav Amit
Cc: Borislav Petkov
Cc: Jann Horn
Cc: Peter Zijlstra
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/dd956eba16646fd0b15c3c0741269dfd84452dac.1535557289.git.luto@kernel.org
Signed-off-by: Greg Kroah-Hartman

Andy Lutomirski
2018-09-05 15:26:39 +0800
59463ec29 x86/speculation/l1tf: Fix off-by-one error when warning that system has too much RAM ... Browse Code »

commit b0a182f875689647b014bc01d36b340217792852 upstream.

Two users have reported [1] that they have an "extremely unlikely" system
with more than MAX_PA/2 memory and L1TF mitigation is not effective. In
fact it's a CPU with 36bits phys limit (64GB) and 32GB memory, but due to
holes in the e820 map, the main region is almost 500MB over the 32GB limit:

[ 0.000000] BIOS-e820: [mem 0x0000000100000000-0x000000081effffff] usable

Suggestions to use 'mem=32G' to enable the L1TF mitigation while losing the
500MB revealed, that there's an off-by-one error in the check in
l1tf_select_mitigation().

l1tf_pfn_limit() returns the last usable pfn (inclusive) and the range
check in the mitigation path does not take this into account.

Instead of amending the range check, make l1tf_pfn_limit() return the first
PFN which is over the limit which is less error prone. Adjust the other
users accordingly.

[1] https://bugzilla.suse.com/show_bug.cgi?id=1105536

Fixes: 17dbca119312 ("x86/speculation/l1tf: Add sysfs reporting for l1tf")
Reported-by: George Anchev
Reported-by: Christopher Snowhill
Signed-off-by: Vlastimil Babka
Signed-off-by: Thomas Gleixner
Cc: "H . Peter Anvin"
Cc: Linus Torvalds
Cc: Andi Kleen
Cc: Dave Hansen
Cc: Michal Hocko
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20180823134418.17008-1-vbabka@suse.cz
Signed-off-by: Greg Kroah-Hartman

Vlastimil Babka
2018-09-05 15:26:37 +0800
7418d7086 x86/speculation/l1tf: Fix overflow in l1tf_pfn_limit() on 32bit ... Browse Code »

commit 9df9516940a61d29aedf4d91b483ca6597e7d480 upstream.

On 32bit PAE kernels on 64bit hardware with enough physical bits,
l1tf_pfn_limit() will overflow unsigned long. This in turn affects
max_swapfile_size() and can lead to swapon returning -EINVAL. This has been
observed in a 32bit guest with 42 bits physical address size, where
max_swapfile_size() overflows exactly to 1 << 32, thus zero, and produces
the following warning to dmesg:

[ 6.396845] Truncating oversized swap area, only using 0k out of 2047996k

Fix this by using unsigned long long instead.

Fixes: 17dbca119312 ("x86/speculation/l1tf: Add sysfs reporting for l1tf")
Fixes: 377eeaa8e11f ("x86/speculation/l1tf: Limit swap file size to MAX_PA/2")
Reported-by: Dominique Leuenberger
Reported-by: Adrian Schroeter
Signed-off-by: Vlastimil Babka
Signed-off-by: Thomas Gleixner
Acked-by: Andi Kleen
Acked-by: Michal Hocko
Cc: "H . Peter Anvin"
Cc: Linus Torvalds
Cc: Dave Hansen
Cc: Michal Hocko
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20180820095835.5298-1-vbabka@suse.cz
Signed-off-by: Greg Kroah-Hartman

Vlastimil Babka
2018-09-05 15:26:37 +0800

18 Aug, 2018

3 commits

21ed56ef8 x86/mm: Add TLB purge to free pmd/pte page interfaces ... Browse Code »

commit 5e0fb5df2ee871b841f96f9cb6a7f2784e96aa4e upstream.

ioremap() calls pud_free_pmd_page() / pmd_free_pte_page() when it creates
a pud / pmd map. The following preconditions are met at their entry.
- All pte entries for a target pud/pmd address range have been cleared.
- System-wide TLB purges have been peformed for a target pud/pmd address
range.

The preconditions assure that there is no stale TLB entry for the range.
Speculation may not cache TLB entries since it requires all levels of page
entries, including ptes, to have P & A-bits set for an associated address.
However, speculation may cache pud/pmd entries (paging-structure caches)
when they have P-bit set.

Add a system-wide TLB purge (INVLPG) to a single page after clearing
pud/pmd entry's P-bit.

SDM 4.10.4.1, Operation that Invalidate TLBs and Paging-Structure Caches,
states that:
INVLPG invalidates all paging-structure caches associated with the
current PCID regardless of the liner addresses to which they correspond.

Fixes: 28ee90fe6048 ("x86/mm: implement free pmd/pte page interfaces")
Signed-off-by: Toshi Kani
Signed-off-by: Thomas Gleixner
Cc: mhocko@suse.com
Cc: akpm@linux-foundation.org
Cc: hpa@zytor.com
Cc: cpandya@codeaurora.org
Cc: linux-mm@kvack.org
Cc: linux-arm-kernel@lists.infradead.org
Cc: Joerg Roedel
Cc: stable@vger.kernel.org
Cc: Andrew Morton
Cc: Michal Hocko
Cc: "H. Peter Anvin"
Cc:
Link: https://lkml.kernel.org/r/20180627141348.21777-4-toshi.kani@hpe.com
Signed-off-by: Greg Kroah-Hartman

Toshi Kani
2018-08-18 03:01:11 +0800
a34806961 ioremap: Update pgtable free interfaces with addr ... Browse Code »

commit 785a19f9d1dd8a4ab2d0633be4656653bd3de1fc upstream.

The following kernel panic was observed on ARM64 platform due to a stale
TLB entry.

1. ioremap with 4K size, a valid pte page table is set.
2. iounmap it, its pte entry is set to 0.
3. ioremap the same address with 2M size, update its pmd entry with
a new value.
4. CPU may hit an exception because the old pmd entry is still in TLB,
which leads to a kernel panic.

Commit b6bdb7517c3d ("mm/vmalloc: add interfaces to free unmapped page
table") has addressed this panic by falling to pte mappings in the above
case on ARM64.

To support pmd mappings in all cases, TLB purge needs to be performed
in this case on ARM64.

Add a new arg, 'addr', to pud_free_pmd_page() and pmd_free_pte_page()
so that TLB purge can be added later in seprate patches.

[toshi.kani@hpe.com: merge changes, rewrite patch description]
Fixes: 28ee90fe6048 ("x86/mm: implement free pmd/pte page interfaces")
Signed-off-by: Chintan Pandya
Signed-off-by: Toshi Kani
Signed-off-by: Thomas Gleixner
Cc: mhocko@suse.com
Cc: akpm@linux-foundation.org
Cc: hpa@zytor.com
Cc: linux-mm@kvack.org
Cc: linux-arm-kernel@lists.infradead.org
Cc: Will Deacon
Cc: Joerg Roedel
Cc: stable@vger.kernel.org
Cc: Andrew Morton
Cc: Michal Hocko
Cc: "H. Peter Anvin"
Cc:
Link: https://lkml.kernel.org/r/20180627141348.21777-3-toshi.kani@hpe.com
Signed-off-by: Greg Kroah-Hartman

Chintan Pandya
2018-08-18 03:01:11 +0800
3d0170b8a x86/mm: Disable ioremap free page handling on x86-PAE ... Browse Code »

commit f967db0b9ed44ec3057a28f3b28efc51df51b835 upstream.

ioremap() supports pmd mappings on x86-PAE. However, kernel's pmd
tables are not shared among processes on x86-PAE. Therefore, any
update to sync'd pmd entries need re-syncing. Freeing a pte page
also leads to a vmalloc fault and hits the BUG_ON in vmalloc_sync_one().

Disable free page handling on x86-PAE. pud_free_pmd_page() and
pmd_free_pte_page() simply return 0 if a given pud/pmd entry is present.
This assures that ioremap() does not update sync'd pmd entries at the
cost of falling back to pte mappings.

Fixes: 28ee90fe6048 ("x86/mm: implement free pmd/pte page interfaces")
Reported-by: Joerg Roedel
Signed-off-by: Toshi Kani
Signed-off-by: Thomas Gleixner
Cc: mhocko@suse.com
Cc: akpm@linux-foundation.org
Cc: hpa@zytor.com
Cc: cpandya@codeaurora.org
Cc: linux-mm@kvack.org
Cc: linux-arm-kernel@lists.infradead.org
Cc: stable@vger.kernel.org
Cc: Andrew Morton
Cc: Michal Hocko
Cc: "H. Peter Anvin"
Cc:
Link: https://lkml.kernel.org/r/20180627141348.21777-2-toshi.kani@hpe.com
Signed-off-by: Greg Kroah-Hartman

Toshi Kani
2018-08-18 03:01:10 +0800

16 Aug, 2018

8 commits

7cefa38bf x86/init: fix build with CONFIG_SWAP=n ... Browse Code »

commit 792adb90fa724ce07c0171cbc96b9215af4b1045 upstream.

The introduction of generic_max_swapfile_size and arch-specific versions has
broken linking on x86 with CONFIG_SWAP=n due to undefined reference to
'generic_max_swapfile_size'. Fix it by compiling the x86-specific
max_swapfile_size() only with CONFIG_SWAP=y.

Reported-by: Tomas Pruzina
Fixes: 377eeaa8e11f ("x86/speculation/l1tf: Limit swap file size to MAX_PA/2")
Signed-off-by: Vlastimil Babka
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Vlastimil Babka
2018-08-16 00:13:01 +0800
06f412c63 x86/mm/kmmio: Make the tracer robust against L1TF ... Browse Code »

commit 1063711b57393c1999248cccb57bebfaf16739e7 upstream

The mmio tracer sets io mapping PTEs and PMDs to non present when enabled
without inverting the address bits, which makes the PTE entry vulnerable
for L1TF.

Make it use the right low level macros to actually invert the address bits
to protect against L1TF.

In principle this could be avoided because MMIO tracing is not likely to be
enabled on production machines, but the fix is straigt forward and for
consistency sake it's better to get rid of the open coded PTE manipulation.

Signed-off-by: Andi Kleen
Signed-off-by: Thomas Gleixner
Signed-off-by: Greg Kroah-Hartman

Andi Kleen
2018-08-16 00:13:01 +0800
ce1eed46b x86/mm/pat: Make set_memory_np() L1TF safe ... Browse Code »

commit 958f79b9ee55dfaf00c8106ed1c22a2919e0028b upstream

set_memory_np() is used to mark kernel mappings not present, but it has
it's own open coded mechanism which does not have the L1TF protection of
inverting the address bits.

Replace the open coded PTE manipulation with the L1TF protecting low level
PTE routines.

Passes the CPA self test.

Signed-off-by: Andi Kleen
Signed-off-by: Thomas Gleixner
Signed-off-by: Greg Kroah-Hartman

Andi Kleen
2018-08-16 00:13:01 +0800
18f891ef7 x86: Don't include linux/irq.h from asm/hardirq.h ... Browse Code »

commit 447ae316670230d7d29430e2cbf1f5db4f49d14c upstream

The next patch in this series will have to make the definition of
irq_cpustat_t available to entering_irq().

Inclusion of asm/hardirq.h into asm/apic.h would cause circular header
dependencies like

asm/smp.h
asm/apic.h
asm/hardirq.h
linux/irq.h
linux/topology.h
linux/smp.h
asm/smp.h

or

linux/gfp.h
linux/mmzone.h
asm/mmzone.h
asm/mmzone_64.h
asm/smp.h
asm/apic.h
asm/hardirq.h
linux/irq.h
linux/irqdesc.h
linux/kobject.h
linux/sysfs.h
linux/kernfs.h
linux/idr.h
linux/gfp.h

and others.

This causes compilation errors because of the header guards becoming
effective in the second inclusion: symbols/macros that had been defined
before wouldn't be available to intermediate headers in the #include chain
anymore.

A possible workaround would be to move the definition of irq_cpustat_t
into its own header and include that from both, asm/hardirq.h and
asm/apic.h.

However, this wouldn't solve the real problem, namely asm/harirq.h
unnecessarily pulling in all the linux/irq.h cruft: nothing in
asm/hardirq.h itself requires it. Also, note that there are some other
archs, like e.g. arm64, which don't have that #include in their
asm/hardirq.h.

Remove the linux/irq.h #include from x86' asm/hardirq.h.

Fix resulting compilation errors by adding appropriate #includes to *.c
files as needed.

Note that some of these *.c files could be cleaned up a bit wrt. to their
set of #includes, but that should better be done from separate patches, if
at all.

Signed-off-by: Nicolai Stange
Signed-off-by: Thomas Gleixner
Signed-off-by: Greg Kroah-Hartman

Nicolai Stange
2018-08-16 00:12:58 +0800
b9cdf143e x86/speculation/l1tf: Protect PAE swap entries against L1TF ... Browse Code »

commit 0d0f6249058834ffe1ceaad0bb31464af66f6e7a upstream

The PAE 3-level paging code currently doesn't mitigate L1TF by flipping the
offset bits, and uses the high PTE word, thus bits 32-36 for type, 37-63 for
offset. The lower word is zeroed, thus systems with less than 4GB memory are
safe. With 4GB to 128GB the swap type selects the memory locations vulnerable
to L1TF; with even more memory, also the swap offfset influences the address.
This might be a problem with 32bit PAE guests running on large 64bit hosts.

By continuing to keep the whole swap entry in either high or low 32bit word of
PTE we would limit the swap size too much. Thus this patch uses the whole PAE
PTE with the same layout as the 64bit version does. The macros just become a
bit tricky since they assume the arch-dependent swp_entry_t to be 32bit.

Signed-off-by: Vlastimil Babka
Signed-off-by: Thomas Gleixner
Acked-by: Michal Hocko
Signed-off-by: Greg Kroah-Hartman

Vlastimil Babka
2018-08-16 00:12:53 +0800
c6374a625 x86/speculation/l1tf: Extend 64bit swap file size limit ... Browse Code »

commit 1a7ed1ba4bba6c075d5ad61bb75e3fbc870840d6 upstream

The previous patch has limited swap file size so that large offsets cannot
clear bits above MAX_PA/2 in the pte and interfere with L1TF mitigation.

It assumed that offsets are encoded starting with bit 12, same as pfn. But
on x86_64, offsets are encoded starting with bit 9.

Thus the limit can be raised by 3 bits. That means 16TB with 42bit MAX_PA
and 256TB with 46bit MAX_PA.

Fixes: 377eeaa8e11f ("x86/speculation/l1tf: Limit swap file size to MAX_PA/2")
Signed-off-by: Vlastimil Babka
Signed-off-by: Thomas Gleixner
Signed-off-by: Greg Kroah-Hartman

Vlastimil Babka
2018-08-16 00:12:53 +0800
4bb1a8d8f x86/speculation/l1tf: Limit swap file size to MAX_PA/2 ... Browse Code »

commit 377eeaa8e11fe815b1d07c81c4a0e2843a8c15eb upstream

For the L1TF workaround its necessary to limit the swap file size to below
MAX_PA/2, so that the higher bits of the swap offset inverted never point
to valid memory.

Add a mechanism for the architecture to override the swap file size check
in swapfile.c and add a x86 specific max swapfile check function that
enforces that limit.

The check is only enabled if the CPU is vulnerable to L1TF.

In VMs with 42bit MAX_PA the typical limit is 2TB now, on a native system
with 46bit PA it is 32TB. The limit is only per individual swap file, so
it's always possible to exceed these limits with multiple swap files or
partitions.

Signed-off-by: Andi Kleen
Signed-off-by: Thomas Gleixner
Reviewed-by: Josh Poimboeuf
Acked-by: Michal Hocko
Acked-by: Dave Hansen
Signed-off-by: Greg Kroah-Hartman

Andi Kleen
2018-08-16 00:12:51 +0800
a4116334b x86/speculation/l1tf: Disallow non privileged high MMIO PROT_NONE mappings ... Browse Code »

commit 42e4089c7890725fcd329999252dc489b72f2921 upstream

For L1TF PROT_NONE mappings are protected by inverting the PFN in the page
table entry. This sets the high bits in the CPU's address space, thus
making sure to point to not point an unmapped entry to valid cached memory.

Some server system BIOSes put the MMIO mappings high up in the physical
address space. If such an high mapping was mapped to unprivileged users
they could attack low memory by setting such a mapping to PROT_NONE. This
could happen through a special device driver which is not access
protected. Normal /dev/mem is of course access protected.

To avoid this forbid PROT_NONE mappings or mprotect for high MMIO mappings.

Valid page mappings are allowed because the system is then unsafe anyways.

It's not expected that users commonly use PROT_NONE on MMIO. But to
minimize any impact this is only enforced if the mapping actually refers to
a high MMIO address (defined as the MAX_PA-1 bit being set), and also skip
the check for root.

For mmaps this is straight forward and can be handled in vm_insert_pfn and
in remap_pfn_range().

For mprotect it's a bit trickier. At the point where the actual PTEs are
accessed a lot of state has been changed and it would be difficult to undo
on an error. Since this is a uncommon case use a separate early page talk
walk pass for MMIO PROT_NONE mappings that checks for this condition
early. For non MMIO and non PROT_NONE there are no changes.

Signed-off-by: Andi Kleen
Signed-off-by: Thomas Gleixner
Reviewed-by: Josh Poimboeuf
Acked-by: Dave Hansen
Signed-off-by: Greg Kroah-Hartman

Andi Kleen
2018-08-16 00:12:51 +0800

03 Jul, 2018

1 commit

fff76ff5e mm: fix devmem_is_allowed() for sub-page System RAM intersections ... Browse Code »

commit 2bdce74412c249ac01dfe36b6b0043ffd7a5361e upstream.

Hussam reports:

I was poking around and for no real reason, I did cat /dev/mem and
strings /dev/mem. Then I saw the following warning in dmesg. I saved it
and rebooted immediately.

memremap attempted on mixed range 0x000000000009c000 size: 0x1000
------------[ cut here ]------------
WARNING: CPU: 0 PID: 11810 at kernel/memremap.c:98 memremap+0x104/0x170
[..]
Call Trace:
xlate_dev_mem_ptr+0x25/0x40
read_mem+0x89/0x1a0
__vfs_read+0x36/0x170

The memremap() implementation checks for attempts to remap System RAM
with MEMREMAP_WB and instead redirects those mapping attempts to the
linear map. However, that only works if the physical address range
being remapped is page aligned. In low memory we have situations like
the following:

00000000-00000fff : Reserved
00001000-0009fbff : System RAM
0009fc00-0009ffff : Reserved

...where System RAM intersects Reserved ranges on a sub-page page
granularity.

Given that devmem_is_allowed() special cases any attempt to map System
RAM in the first 1MB of memory, replace page_is_ram() with the more
precise region_intersects() to trap attempts to map disallowed ranges.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=199999
Link: http://lkml.kernel.org/r/152856436164.18127.2847888121707136898.stgit@dwillia2-desk3.amr.corp.intel.com
Fixes: 92281dee825f ("arch: introduce memremap()")
Signed-off-by: Dan Williams
Reported-by: Hussam Al-Tayeb
Tested-by: Hussam Al-Tayeb
Cc: Christoph Hellwig
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Dan Williams
2018-07-03 17:25:03 +0800

30 May, 2018

2 commits

59bdc5872 x86/mm: Do not forbid _PAGE_RW before init for __ro_after_init ... Browse Code »

[ Upstream commit 639d6aafe437a7464399d2a77d006049053df06f ]

__ro_after_init data gets stuck in the .rodata section. That's normally
fine because the kernel itself manages the R/W properties.

But, if we run __change_page_attr() on an area which is __ro_after_init,
the .rodata checks will trigger and force the area to be immediately
read-only, even if it is early-ish in boot. This caused problems when
trying to clear the _PAGE_GLOBAL bit for these area in the PTI code:
it cleared _PAGE_GLOBAL like I asked, but also took it up on itself
to clear _PAGE_RW. The kernel then oopses the next time it wrote to
a __ro_after_init data structure.

To fix this, add the kernel_set_to_readonly check, just like we have
for kernel text, just a few lines below in this function.

Signed-off-by: Dave Hansen
Acked-by: Kees Cook
Cc: Andrea Arcangeli
Cc: Andy Lutomirski
Cc: Arjan van de Ven
Cc: Borislav Petkov
Cc: Dan Williams
Cc: David Woodhouse
Cc: Greg Kroah-Hartman
Cc: Hugh Dickins
Cc: Josh Poimboeuf
Cc: Juergen Gross
Cc: Linus Torvalds
Cc: Nadav Amit
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20180406205514.8D898241@viggo.jf.intel.com
Signed-off-by: Ingo Molnar
Signed-off-by: Sasha Levin
Signed-off-by: Greg Kroah-Hartman

Dave Hansen
2018-05-30 13:52:22 +0800
c1af68919 x86/pgtable: Don't set huge PUD/PMD on non-leaf entries ... Browse Code »

[ Upstream commit e3e288121408c3abeed5af60b87b95c847143845 ]

The pmd_set_huge() and pud_set_huge() functions are used from
the generic ioremap() code to establish large mappings where this
is possible.

But the generic ioremap() code does not check whether the
PMD/PUD entries are already populated with a non-leaf entry,
so that any page-table pages these entries point to will be
lost.

Further, on x86-32 with SHARED_KERNEL_PMD=0, this causes a
BUG_ON() in vmalloc_sync_one() when PMD entries are synced
from swapper_pg_dir to the current page-table. This happens
because the PMD entry from swapper_pg_dir was promoted to a
huge-page entry while the current PGD still contains the
non-leaf entry. Because both entries are present and point
to a different page, the BUG_ON() triggers.

This was actually triggered with pti-x32 enabled in a KVM
virtual machine by the graphics driver.

A real and better fix for that would be to improve the
page-table handling in the generic ioremap() code. But that is
out-of-scope for this patch-set and left for later work.

Reported-by: David H. Gutteridge
Signed-off-by: Joerg Roedel
Reviewed-by: Thomas Gleixner
Cc: Andrea Arcangeli
Cc: Andy Lutomirski
Cc: Boris Ostrovsky
Cc: Borislav Petkov
Cc: Brian Gerst
Cc: Dave Hansen
Cc: David Laight
Cc: Denys Vlasenko
Cc: Eduardo Valentin
Cc: Greg KH
Cc: Jiri Kosina
Cc: Josh Poimboeuf
Cc: Juergen Gross
Cc: Linus Torvalds
Cc: Pavel Machek
Cc: Peter Zijlstra
Cc: Waiman Long
Cc: Will Deacon
Cc: aliguori@amazon.com
Cc: daniel.gruss@iaik.tugraz.at
Cc: hughd@google.com
Cc: keescook@google.com
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20180411152437.GC15462@8bytes.org
Signed-off-by: Ingo Molnar
Signed-off-by: Sasha Levin
Signed-off-by: Greg Kroah-Hartman

Joerg Roedel
2018-05-30 13:52:22 +0800

23 May, 2018

1 commit

359b8ff32 x86/pkeys: Override pkey when moving away from PROT_EXEC ... Browse Code »

commit 0a0b152083cfc44ec1bb599b57b7aab41327f998 upstream.

I got a bug report that the following code (roughly) was
causing a SIGSEGV:

mprotect(ptr, size, PROT_EXEC);
mprotect(ptr, size, PROT_NONE);
mprotect(ptr, size, PROT_READ);
*ptr = 100;

The problem is hit when the mprotect(PROT_EXEC)
is implicitly assigned a protection key to the VMA, and made
that key ACCESS_DENY|WRITE_DENY. The PROT_NONE mprotect()
failed to remove the protection key, and the PROT_NONE->
PROT_READ left the PTE usable, but the pkey still in place
and left the memory inaccessible.

To fix this, we ensure that we always "override" the pkee
at mprotect() if the VMA does not have execute-only
permissions, but the VMA has the execute-only pkey.

We had a check for PROT_READ/WRITE, but it did not work
for PROT_NONE. This entirely removes the PROT_* checks,
which ensures that PROT_NONE now works.

Reported-by: Shakeel Butt
Signed-off-by: Dave Hansen
Cc: Andrew Morton
Cc: Dave Hansen
Cc: Linus Torvalds
Cc: Michael Ellermen
Cc: Peter Zijlstra
Cc: Ram Pai
Cc: Shuah Khan
Cc: Thomas Gleixner
Cc: linux-mm@kvack.org
Cc: stable@vger.kernel.org
Fixes: 62b5f7d013f ("mm/core, x86/mm/pkeys: Add execute-only protection keys support")
Link: http://lkml.kernel.org/r/20180509171351.084C5A71@viggo.jf.intel.com
Signed-off-by: Ingo Molnar
Signed-off-by: Greg Kroah-Hartman

Dave Hansen
2018-05-23 00:53:59 +0800

26 Apr, 2018

1 commit

f4d6e4598 vfs/proc/kcore, x86/mm/kcore: Fix SMAP fault when dumping vsyscall user page ... Browse Code »

[ Upstream commit 595dd46ebfc10be041a365d0a3fa99df50b6ba73 ]

Commit:

df04abfd181a ("fs/proc/kcore.c: Add bounce buffer for ktext data")

... introduced a bounce buffer to work around CONFIG_HARDENED_USERCOPY=y.
However, accessing the vsyscall user page will cause an SMAP fault.

Replace memcpy() with copy_from_user() to fix this bug works, but adding
a common way to handle this sort of user page may be useful for future.

Currently, only vsyscall page requires KCORE_USER.

Signed-off-by: Jia Zhang
Reviewed-by: Jiri Olsa
Cc: Al Viro
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: jolsa@redhat.com
Link: http://lkml.kernel.org/r/1518446694-21124-2-git-send-email-zhang.jia@linux.alibaba.com
Signed-off-by: Ingo Molnar
Signed-off-by: Sasha Levin
Signed-off-by: Greg Kroah-Hartman

Jia Zhang
2018-04-26 17:02:20 +0800

29 Mar, 2018

2 commits

5bbd932ff x86/mm: implement free pmd/pte page interfaces ... Browse Code »

commit 28ee90fe6048fa7b7ceaeb8831c0e4e454a4cf89 upstream.

Implement pud_free_pmd_page() and pmd_free_pte_page() on x86, which
clear a given pud/pmd entry and free up lower level page table(s).

The address range associated with the pud/pmd entry must have been
purged by INVLPG.

Link: http://lkml.kernel.org/r/20180314180155.19492-3-toshi.kani@hpe.com
Fixes: e61ce6ade404e ("mm: change ioremap to set up huge I/O mappings")
Signed-off-by: Toshi Kani
Reported-by: Lei Li
Cc: Michal Hocko
Cc: Thomas Gleixner
Cc: Ingo Molnar
Cc: "H. Peter Anvin"
Cc: Borislav Petkov
Cc: Matthew Wilcox
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Toshi Kani
2018-03-29 00:24:38 +0800
acdb49816 mm/vmalloc: add interfaces to free unmapped page table ... Browse Code »

commit b6bdb7517c3d3f41f20e5c2948d6bc3f8897394e upstream.

On architectures with CONFIG_HAVE_ARCH_HUGE_VMAP set, ioremap() may
create pud/pmd mappings. A kernel panic was observed on arm64 systems
with Cortex-A75 in the following steps as described by Hanjun Guo.

1. ioremap a 4K size, valid page table will build,
2. iounmap it, pte0 will set to 0;
3. ioremap the same address with 2M size, pgd/pmd is unchanged,
then set the a new value for pmd;
4. pte0 is leaked;
5. CPU may meet exception because the old pmd is still in TLB,
which will lead to kernel panic.

This panic is not reproducible on x86. INVLPG, called from iounmap,
purges all levels of entries associated with purged address on x86. x86
still has memory leak.

The patch changes the ioremap path to free unmapped page table(s) since
doing so in the unmap path has the following issues:

- The iounmap() path is shared with vunmap(). Since vmap() only
supports pte mappings, making vunmap() to free a pte page is an
overhead for regular vmap users as they do not need a pte page freed
up.

- Checking if all entries in a pte page are cleared in the unmap path
is racy, and serializing this check is expensive.

- The unmap path calls free_vmap_area_noflush() to do lazy TLB purges.
Clearing a pud/pmd entry before the lazy TLB purges needs extra TLB
purge.

Add two interfaces, pud_free_pmd_page() and pmd_free_pte_page(), which
clear a given pud/pmd entry and free up a page for the lower level
entries.

This patch implements their stub functions on x86 and arm64, which work
as workaround.

[akpm@linux-foundation.org: fix typo in pmd_free_pte_page() stub]
Link: http://lkml.kernel.org/r/20180314180155.19492-2-toshi.kani@hpe.com
Fixes: e61ce6ade404e ("mm: change ioremap to set up huge I/O mappings")
Reported-by: Lei Li
Signed-off-by: Toshi Kani
Cc: Catalin Marinas
Cc: Wang Xuefeng
Cc: Will Deacon
Cc: Hanjun Guo
Cc: Michal Hocko
Cc: Thomas Gleixner
Cc: Ingo Molnar
Cc: "H. Peter Anvin"
Cc: Borislav Petkov
Cc: Matthew Wilcox
Cc: Chintan Pandya
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Toshi Kani
2018-03-29 00:24:38 +0800

21 Mar, 2018

1 commit

6fcb523ea x86/mm: Fix vmalloc_fault to use pXd_large ... Browse Code »

commit 18a955219bf7d9008ce480d4451b6b8bf4483a22 upstream.

Gratian Crisan reported that vmalloc_fault() crashes when CONFIG_HUGETLBFS
is not set since the function inadvertently uses pXn_huge(), which always
return 0 in this case. ioremap() does not depend on CONFIG_HUGETLBFS.

Fix vmalloc_fault() to call pXd_large() instead.

Fixes: f4eafd8bcd52 ("x86/mm: Fix vmalloc_fault() to handle large pages properly")
Reported-by: Gratian Crisan
Signed-off-by: Toshi Kani
Signed-off-by: Thomas Gleixner
Cc: stable@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: Borislav Petkov
Cc: Andy Lutomirski
Link: https://lkml.kernel.org/r/20180313170347.3829-2-toshi.kani@hpe.com
Signed-off-by: Greg Kroah-Hartman

Toshi Kani
2018-03-21 19:06:42 +0800

15 Mar, 2018

2 commits

f9be9ef91 x86/mm/sme, objtool: Annotate indirect call in sme_encrypt_execute() ... Browse Code »

commit 531bb52a869a9c6e08c8d17ba955fcbfc18037ad upstream.

This is boot code and thus Spectre-safe: we run this _way_ before userspace
comes along to have a chance to poison our branch predictor.

Signed-off-by: Peter Zijlstra (Intel)
Acked-by: Thomas Gleixner
Acked-by: Josh Poimboeuf
Cc: Andy Lutomirski
Cc: Arjan van de Ven
Cc: Borislav Petkov
Cc: Borislav Petkov
Cc: Dan Williams
Cc: Dave Hansen
Cc: David Woodhouse
Cc: Greg Kroah-Hartman
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Tom Lendacky
Signed-off-by: Ingo Molnar
Signed-off-by: Greg Kroah-Hartman

Peter Zijlstra
2018-03-15 17:54:37 +0800
c507f00d2 x86/mm: Remove stale comment about KMEMCHECK ... Browse Code »

commit 3b3a9268bba62b35a29bafe0931715b1725fdf26 upstream.

This comment referred to a conditional call to kmemcheck_hide() that was
here until commit 4950276672fc ("kmemcheck: remove annotations").

Now that kmemcheck has been removed, it doesn't make sense anymore.

Signed-off-by: Jann Horn
Acked-by: Thomas Gleixner
Cc: Andy Lutomirski
Cc: Borislav Petkov
Cc: Brian Gerst
Cc: Denys Vlasenko
Cc: H. Peter Anvin
Cc: Josh Poimboeuf
Cc: Linus Torvalds
Cc: Peter Zijlstra
Link: http://lkml.kernel.org/r/20180219175039.253089-1-jannh@google.com
Signed-off-by: Ingo Molnar
Signed-off-by: Greg Kroah-Hartman

Jann Horn
2018-03-15 17:54:36 +0800

09 Mar, 2018

1 commit

71978491b x86/cpu_entry_area: Sync cpu_entry_area to initial_page_table ... Browse Code »

commit 945fd17ab6bab8a4d05da6c3170519fbcfe62ddb upstream.

The separation of the cpu_entry_area from the fixmap missed the fact that
on 32bit non-PAE kernels the cpu_entry_area mapping might not be covered in
initial_page_table by the previous synchronizations.

This results in suspend/resume failures because 32bit utilizes initial page
table for resume. The absence of the cpu_entry_area mapping results in a
triple fault, aka. insta reboot.

With PAE enabled this works by chance because the PGD entry which covers
the fixmap and other parts incindentally provides the cpu_entry_area
mapping as well.

Synchronize the initial page table after setting up the cpu entry
area. Instead of adding yet another copy of the same code, move it to a
function and invoke it from the various places.

It needs to be investigated if the existing calls in setup_arch() and
setup_per_cpu_areas() can be replaced by the later invocation from
setup_cpu_entry_areas(), but that's beyond the scope of this fix.

Fixes: 92a0f81d8957 ("x86/cpu_entry_area: Move it out of the fixmap")
Reported-by: Woody Suwalski
Signed-off-by: Thomas Gleixner
Tested-by: Woody Suwalski
Cc: William Grant
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/alpine.DEB.2.21.1802282137290.1392@nanos.tec.linutronix.de
Signed-off-by: Greg Kroah-Hartman

Thomas Gleixner
2018-03-09 14:41:07 +0800

25 Feb, 2018

1 commit

721d4b024 x86/mm/kmmio: Fix mmiotrace for page unaligned addresses ... Browse Code »

[ Upstream commit 6d60ce384d1d5ca32b595244db4077a419acc687 ]

If something calls ioremap() with an address not aligned to PAGE_SIZE, the
returned address might be not aligned as well. This led to a probe
registered on exactly the returned address, but the entire page was armed
for mmiotracing.

On calling iounmap() the address passed to unregister_kmmio_probe() was
PAGE_SIZE aligned by the caller leading to a complete freeze of the
machine.

We should always page align addresses while (un)registerung mappings,
because the mmiotracer works on top of pages, not mappings. We still keep
track of the probes based on their real addresses and lengths though,
because the mmiotrace still needs to know what are mapped memory regions.

Also move the call to mmiotrace_iounmap() prior page aligning the address,
so that all probes are unregistered properly, otherwise the kernel ends up
failing memory allocations randomly after disabling the mmiotracer.

Tested-by: Lyude
Signed-off-by: Karol Herbst
Acked-by: Pekka Paalanen
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Steven Rostedt
Cc: Thomas Gleixner
Cc: nouveau@lists.freedesktop.org
Link: http://lkml.kernel.org/r/20171127075139.4928-1-kherbst@redhat.com
Signed-off-by: Ingo Molnar
Signed-off-by: Sasha Levin
Signed-off-by: Greg Kroah-Hartman

Karol Herbst
2018-02-25 18:08:03 +0800