Eric Lee / smarc-fsl-linux-kernel

08 Aug, 2020

3 commits

349d9fbb0 mm/mremap: start addresses are properly aligned ... Browse Code »

After previous cleanup, extent is the minimal step for both source and
destination. This means when extent is HPAGE_PMD_SIZE or PMD_SIZE,
old_addr and new_addr are properly aligned too.

Since these two functions are only invoked in move_page_tables, it is safe
to remove the check now.

Signed-off-by: Wei Yang
Signed-off-by: Andrew Morton
Tested-by: Dmitry Osipenko
Acked-by: Kirill A. Shutemov
Cc: Aneesh Kumar K.V
Cc: Anshuman Khandual
Cc: Matthew Wilcox
Cc: Peter Xu
Cc: Sean Christopherson
Cc: Thomas Hellstrom
Cc: Thomas Hellstrom (VMware)
Cc: Vlastimil Babka
Cc: Yang Shi
Link: http://lkml.kernel.org/r/20200708095028.41706-4-richard.weiyang@linux.alibaba.com
Signed-off-by: Linus Torvalds

Wei Yang
2020-08-08 02:33:27 +0800
9ad9718bf mm/mremap: calculate extent in one place ... Browse Code »

Page tables is moved on the base of PMD. This requires both source and
destination range should meet the requirement.

Current code works well since move_huge_pmd() and move_normal_pmd() would
check old_addr and new_addr again. And then return to move_ptes() if the
either of them is not aligned.

Instead of calculating the extent separately, it is better to calculate in
one place, so we know it is not necessary to try move pmd. By doing so,
the logic seems a little clear.

Signed-off-by: Wei Yang
Signed-off-by: Andrew Morton
Tested-by: Dmitry Osipenko
Acked-by: Kirill A. Shutemov
Cc: Aneesh Kumar K.V
Cc: Anshuman Khandual
Cc: Matthew Wilcox
Cc: Peter Xu
Cc: Sean Christopherson
Cc: Thomas Hellstrom
Cc: Thomas Hellstrom (VMware)
Cc: Vlastimil Babka
Cc: Yang Shi
Link: http://lkml.kernel.org/r/20200708095028.41706-3-richard.weiyang@linux.alibaba.com
Signed-off-by: Linus Torvalds

Wei Yang
2020-08-08 02:33:27 +0800
b8aa9d9d9 mm/mremap: it is sure to have enough space when extent meets requirement ... Browse Code »

Patch series "mm/mremap: cleanup move_page_tables() a little", v5.

move_page_tables() tries to move page table by PMD or PTE.

The root reason is if it tries to move PMD, both old and new range should
be PMD aligned. But current code calculate old range and new range
separately. This leads to some redundant check and calculation.

This cleanup tries to consolidate the range check in one place to reduce
some extra range handling.

This patch (of 3):

old_end is passed to these two functions to check whether there is enough
space to do the move, while this check is done before invoking these
functions.

These two functions only would be invoked when extent meets the
requirement and there is one check before invoking these functions:

if (extent > old_end - old_addr)
extent = old_end - old_addr;

This implies (old_end - old_addr) won't fail the check in these two
functions.

Signed-off-by: Wei Yang
Signed-off-by: Andrew Morton
Tested-by: Dmitry Osipenko
Acked-by: Kirill A. Shutemov
Cc: Vlastimil Babka
Cc: Yang Shi
Cc: Thomas Hellstrom (VMware)
Cc: Anshuman Khandual
Cc: Sean Christopherson
Cc: Wei Yang
Cc: Peter Xu
Cc: Aneesh Kumar K.V
Cc: Matthew Wilcox
Cc: Thomas Hellstrom
Link: http://lkml.kernel.org/r/20200710092835.56368-1-richard.weiyang@linux.alibaba.com
Link: http://lkml.kernel.org/r/20200710092835.56368-2-richard.weiyang@linux.alibaba.com
Link: http://lkml.kernel.org/r/20200708095028.41706-1-richard.weiyang@linux.alibaba.com
Link: http://lkml.kernel.org/r/20200708095028.41706-2-richard.weiyang@linux.alibaba.com
Signed-off-by: Linus Torvalds

Wei Yang
2020-08-08 02:33:27 +0800

14 Jul, 2020

1 commit

f81fdd0c4 mm: document warning in move_normal_pmd() and make it warn only once ... Browse Code »

Naresh Kamboju reported that the LTP tests can cause warnings on i386
going back all the way to v5.0, and bisected it to commit 2c91bd4a4e2e
("mm: speed up mremap by 20x on large regions").

The warning in move_normal_pmd() is actually mostly correct, but we have
a very unusual special case at process creation time, when we may move
the stack down with an overlapping mode (kind of like a "memmove()"
except using the page tables).

And when you have just the right condition of "move a large initial
stack by the right alignment in the end, but with the early part of the
move being only page-aligned", we'll be in a situation where we're
trying to move a normal PMD entry on top of an already existing - but
now empty - PMD entry.

The warning is still worth having, in case it ever triggers other cases,
and perhaps as a reminder that we could do the stack move case more
efficiently (although it's clearly rare enough that it probably doesn't
matter).

But make it do WARN_ON_ONCE(), so that you can't flood the logs with it.

And add a *big* comment above it to explain and remind us what's going
on, because it took some figuring out to see how this could trigger.
Kudos to Joel Fernandes for debugging this.

Reported-by: Naresh Kamboju
Debugged-and-acked-by: Joel Fernandes
Cc: Arnd Bergmann
Cc: Kirill A. Shutemov
Signed-off-by: Linus Torvalds

Linus Torvalds
2020-07-14 02:37:39 +0800

10 Jun, 2020

2 commits

c1e8d7c6a mmap locking API: convert mmap_sem comments ... Browse Code »

Convert comments that reference mmap_sem to reference mmap_lock instead.

[akpm@linux-foundation.org: fix up linux-next leftovers]
[akpm@linux-foundation.org: s/lockaphore/lock/, per Vlastimil]
[akpm@linux-foundation.org: more linux-next fixups, per Michel]

Signed-off-by: Michel Lespinasse
Signed-off-by: Andrew Morton
Reviewed-by: Vlastimil Babka
Reviewed-by: Daniel Jordan
Cc: Davidlohr Bueso
Cc: David Rientjes
Cc: Hugh Dickins
Cc: Jason Gunthorpe
Cc: Jerome Glisse
Cc: John Hubbard
Cc: Laurent Dufour
Cc: Liam Howlett
Cc: Matthew Wilcox
Cc: Peter Zijlstra
Cc: Ying Han
Link: http://lkml.kernel.org/r/20200520052908.204642-13-walken@google.com
Signed-off-by: Linus Torvalds

Michel Lespinasse
2020-06-10 00:39:14 +0800
d8ed45c5d mmap locking API: use coccinelle to convert mmap_sem rwsem call sites ... Browse Code »

This change converts the existing mmap_sem rwsem calls to use the new mmap
locking API instead.

The change is generated using coccinelle with the following rule:

// spatch --sp-file mmap_lock_api.cocci --in-place --include-headers --dir .

@@
expression mm;
@@
(
-init_rwsem
+mmap_init_lock
|
-down_write
+mmap_write_lock
|
-down_write_killable
+mmap_write_lock_killable
|
-down_write_trylock
+mmap_write_trylock
|
-up_write
+mmap_write_unlock
|
-downgrade_write
+mmap_write_downgrade
|
-down_read
+mmap_read_lock
|
-down_read_killable
+mmap_read_lock_killable
|
-down_read_trylock
+mmap_read_trylock
|
-up_read
+mmap_read_unlock
)
-(&mm->mmap_sem)
+(mm)

Signed-off-by: Michel Lespinasse
Signed-off-by: Andrew Morton
Reviewed-by: Daniel Jordan
Reviewed-by: Laurent Dufour
Reviewed-by: Vlastimil Babka
Cc: Davidlohr Bueso
Cc: David Rientjes
Cc: Hugh Dickins
Cc: Jason Gunthorpe
Cc: Jerome Glisse
Cc: John Hubbard
Cc: Liam Howlett
Cc: Matthew Wilcox
Cc: Peter Zijlstra
Cc: Ying Han
Link: http://lkml.kernel.org/r/20200520052908.204642-5-walken@google.com
Signed-off-by: Linus Torvalds

Michel Lespinasse
2020-06-10 00:39:14 +0800

05 Jun, 2020

3 commits

886d7de63 Merge branch 'akpm' (patches from Andrew) ... Browse Code »

Merge yet more updates from Andrew Morton:

- More MM work. 100ish more to go. Mike Rapoport's "mm: remove
__ARCH_HAS_5LEVEL_HACK" series should fix the current ppc issue

- Various other little subsystems

* emailed patches from Andrew Morton : (127 commits)
lib/ubsan.c: fix gcc-10 warnings
tools/testing/selftests/vm: remove duplicate headers
selftests: vm: pkeys: fix multilib builds for x86
selftests: vm: pkeys: use the correct page size on powerpc
selftests/vm/pkeys: override access right definitions on powerpc
selftests/vm/pkeys: test correct behaviour of pkey-0
selftests/vm/pkeys: introduce a sub-page allocator
selftests/vm/pkeys: detect write violation on a mapped access-denied-key page
selftests/vm/pkeys: associate key on a mapped page and detect write violation
selftests/vm/pkeys: associate key on a mapped page and detect access violation
selftests/vm/pkeys: improve checks to determine pkey support
selftests/vm/pkeys: fix assertion in test_pkey_alloc_exhaust()
selftests/vm/pkeys: fix number of reserved powerpc pkeys
selftests/vm/pkeys: introduce powerpc support
selftests/vm/pkeys: introduce generic pkey abstractions
selftests: vm: pkeys: use the correct huge page size
selftests/vm/pkeys: fix alloc_random_pkey() to make it really random
selftests/vm/pkeys: fix assertion in pkey_disable_set/clear()
selftests/vm/pkeys: fix pkey_disable_clear()
selftests: vm: pkeys: add helpers for pkey bits
...

Linus Torvalds
2020-06-05 10:18:29 +0800
fa1f68cc8 mm: use false for bool variable ... Browse Code »

Fixes coccicheck warnings:

mm/zbud.c:246:1-20: WARNING: Assignment of 0/1 to bool variable
mm/mremap.c:777:2-8: WARNING: Assignment of 0/1 to bool variable
mm/huge_memory.c:525:9-10: WARNING: return of 0/1 in function 'is_transparent_hugepage' with return type bool

Reported-by: Hulk Robot
Signed-off-by: Zou Wei
Signed-off-by: Andrew Morton
Reviewed-by: Andrew Morton
Link: http://lkml.kernel.org/r/1586835930-47076-1-git-send-email-zou_wei@huawei.com
Signed-off-by: Linus Torvalds

Zou Wei
2020-06-05 10:06:24 +0800
5bfea2d9b mm: Fix mremap not considering huge pmd devmap ... Browse Code »

The original code in mm/mremap.c checks huge pmd by:

if (is_swap_pmd(*old_pmd) || pmd_trans_huge(*old_pmd)) {

However, a DAX mapped nvdimm is mapped as huge page (by default) but it
is not transparent huge page (_PAGE_PSE | PAGE_DEVMAP). This commit
changes the condition to include the case.

This addresses CVE-2020-10757.

Fixes: 5c7fb56e5e3f ("mm, dax: dax-pmd vs thp-pmd vs hugetlbfs-pmd")
Cc:
Reported-by: Fan Yang
Signed-off-by: Fan Yang
Tested-by: Fan Yang
Tested-by: Dan Williams
Reviewed-by: Dan Williams
Acked-by: Kirill A. Shutemov
Signed-off-by: Linus Torvalds

Fan Yang
2020-06-05 10:05:24 +0800

15 May, 2020

1 commit

d15649260 userfaultfd: fix remap event with MREMAP_DONTUNMAP ... Browse Code »

A user is not required to set a new address when using MREMAP_DONTUNMAP
as it can be used without MREMAP_FIXED. When doing so the remap event
will use new_addr which may not have been set and we didn't propagate it
back other then in the return value of remap_to.

Because ret is always the new address it's probably more correct to use
it rather than new_addr on the remap_event_complete call, and it
resolves this bug.

Fixes: e346b3813067d4b ("mm/mremap: add MREMAP_DONTUNMAP to mremap()")
Reported-by: Randy Dunlap
Signed-off-by: Brian Geffon
Signed-off-by: Andrew Morton
Cc: Lokesh Gidra
Cc: Minchan Kim
Cc: Kirill A. Shutemov
Cc: Vlastimil Babka
Cc: "Michael S . Tsirkin"
Cc: Andrea Arcangeli
Cc: Sonny Rao
Cc: Joel Fernandes
Link: http://lkml.kernel.org/r/20200506172158.218366-1-bgeffon@google.com
Signed-off-by: Linus Torvalds

Brian Geffon
2020-05-15 01:00:35 +0800

20 Apr, 2020

1 commit

dadbd85f2 mm: Fix MREMAP_DONTUNMAP accounting on VMA merge ... Browse Code »

When remapping a mapping where a portion of a VMA is remapped
into another portion of the VMA it can cause the VMA to become
split. During the copy_vma operation the VMA can actually
be remerged if it's an anonymous VMA whose pages have not yet
been faulted. This isn't normally a problem because at the end
of the remap the original portion is unmapped causing it to
become split again.

However, MREMAP_DONTUNMAP leaves that original portion in place which
means that the VMA which was split and then remerged is not actually
split at the end of the mremap. This patch fixes a bug where
we don't detect that the VMAs got remerged and we end up
putting back VM_ACCOUNT on the next mapping which is completely
unreleated. When that next mapping is unmapped it results in
incorrectly unaccounting for the memory which was never accounted,
and eventually we will underflow on the memory comittment.

There is also another issue which is similar, we're currently
accouting for the number of pages in the new_vma but that's wrong.
We need to account for the length of the remap operation as that's
all that is being added. If there was a mapping already at that
location its comittment would have been adjusted as part of
the munmap at the start of the mremap.

A really simple repro can be seen in:
https://gist.github.com/bgaff/e101ce99da7d9a8c60acc641d07f312c

Fixes: e346b3813067 ("mm/mremap: add MREMAP_DONTUNMAP to mremap()")
Reported-by: syzbot
Signed-off-by: Brian Geffon
Signed-off-by: Linus Torvalds

Brian Geffon
2020-04-20 05:07:10 +0800

03 Apr, 2020

2 commits

e346b3813 mm/mremap: add MREMAP_DONTUNMAP to mremap() ... Browse Code »

When remapping an anonymous, private mapping, if MREMAP_DONTUNMAP is set,
the source mapping will not be removed. The remap operation will be
performed as it would have been normally by moving over the page tables to
the new mapping. The old vma will have any locked flags cleared, have no
pagetables, and any userfaultfds that were watching that range will
continue watching it.

For a mapping that is shared or not anonymous, MREMAP_DONTUNMAP will cause
the mremap() call to fail. Because MREMAP_DONTUNMAP always results in
moving a VMA you MUST use the MREMAP_MAYMOVE flag, it's not possible to
resize a VMA while also moving with MREMAP_DONTUNMAP so old_len must
always be equal to the new_len otherwise it will return -EINVAL.

We hope to use this in Chrome OS where with userfaultfd we could write an
anonymous mapping to disk without having to STOP the process or worry
about VMA permission changes.

This feature also has a use case in Android, Lokesh Gidra has said that
"As part of using userfaultfd for GC, We'll have to move the physical
pages of the java heap to a separate location. For this purpose mremap
will be used. Without the MREMAP_DONTUNMAP flag, when I mremap the java
heap, its virtual mapping will be removed as well. Therefore, we'll
require performing mmap immediately after. This is not only time
consuming but also opens a time window where a native thread may call mmap
and reserve the java heap's address range for its own usage. This flag
solves the problem."

[bgeffon@google.com: v6]
Link: http://lkml.kernel.org/r/20200218173221.237674-1-bgeffon@google.com
[bgeffon@google.com: v7]
Link: http://lkml.kernel.org/r/20200221174248.244748-1-bgeffon@google.com
Signed-off-by: Brian Geffon
Signed-off-by: Andrew Morton
Tested-by: Lokesh Gidra
Reviewed-by: Minchan Kim
Acked-by: Kirill A. Shutemov
Acked-by: Vlastimil Babka
Cc: "Michael S . Tsirkin"
Cc: Arnd Bergmann
Cc: Andy Lutomirski
Cc: Will Deacon
Cc: Andrea Arcangeli
Cc: Sonny Rao
Cc: Minchan Kim
Cc: Joel Fernandes
Cc: Yu Zhao
Cc: Jesse Barnes
Cc: Nathan Chancellor
Cc: Florian Weimer
Link: http://lkml.kernel.org/r/20200207201856.46070-1-bgeffon@google.com
Signed-off-by: Linus Torvalds

Brian Geffon
2020-04-03 00:35:30 +0800
222100eed mm/vma: make is_vma_temporary_stack() available for general use ... Browse Code »

Currently the declaration and definition for is_vma_temporary_stack() are
scattered. Lets make is_vma_temporary_stack() helper available for
general use and also drop the declaration from (include/linux/huge_mm.h)
which is no longer required. While at this, rename this as
vma_is_temporary_stack() in line with existing helpers. This should not
cause any functional change.

Signed-off-by: Anshuman Khandual
Signed-off-by: Andrew Morton
Acked-by: Vlastimil Babka
Cc: Ingo Molnar
Cc: Michael Ellerman
Cc: Paul Mackerras
Cc: Thomas Gleixner
Link: http://lkml.kernel.org/r/1582782965-3274-4-git-send-email-anshuman.khandual@arm.com
Signed-off-by: Linus Torvalds

Anshuman Khandual
2020-04-03 00:35:29 +0800

26 Mar, 2020

1 commit

b2a84de2a mm/mremap: Add comment explaining the untagging behaviour of mremap() ... Browse Code »

Commit dcde237319e6 ("mm: Avoid creating virtual address aliases in
brk()/mmap()/mremap()") changed mremap() so that only the 'old' address
is untagged, leaving the 'new' address in the form it was passed from
userspace. This prevents the unexpected creation of aliasing virtual
mappings in userspace, but looks a bit odd when you read the code.

Add a comment justifying the untagging behaviour in mremap().

Reported-by: Linus Torvalds
Acked-by: Linus Torvalds
Reviewed-by: Catalin Marinas
Signed-off-by: Will Deacon
Signed-off-by: Catalin Marinas

Will Deacon
2020-03-26 19:28:57 +0800

20 Feb, 2020

1 commit

dcde23731 mm: Avoid creating virtual address aliases in brk()/mmap()/mremap() ... Browse Code »

Currently the arm64 kernel ignores the top address byte passed to brk(),
mmap() and mremap(). When the user is not aware of the 56-bit address
limit or relies on the kernel to return an error, untagging such
pointers has the potential to create address aliases in user-space.
Passing a tagged address to munmap(), madvise() is permitted since the
tagged pointer is expected to be inside an existing mapping.

The current behaviour breaks the existing glibc malloc() implementation
which relies on brk() with an address beyond 56-bit to be rejected by
the kernel.

Remove untagging in the above functions by partially reverting commit
ce18d171cb73 ("mm: untag user pointers in mmap/munmap/mremap/brk"). In
addition, update the arm64 tagged-address-abi.rst document accordingly.

Link: https://bugzilla.redhat.com/1797052
Fixes: ce18d171cb73 ("mm: untag user pointers in mmap/munmap/mremap/brk")
Cc: # 5.4.x-
Cc: Florian Weimer
Reviewed-by: Andrew Morton
Reported-by: Victor Stinner
Acked-by: Will Deacon
Acked-by: Andrey Konovalov
Signed-off-by: Catalin Marinas
Signed-off-by: Will Deacon

Catalin Marinas
2020-02-20 18:03:14 +0800

01 Dec, 2019

1 commit

ff68dac6d mm/mmap.c: use IS_ERR_VALUE to check return value of get_unmapped_area ... Browse Code »

get_unmapped_area() returns an address or -errno on failure. Historically
we have checked for the failure by offset_in_page() which is correct but
quite hard to read. Newer code started using IS_ERR_VALUE which is much
easier to read. Convert remaining users of offset_in_page as well.

[mhocko@suse.com: rewrite changelog]
[mhocko@kernel.org: fix mremap.c and uprobes.c sites also]
Link: http://lkml.kernel.org/r/20191012102512.28051-1-pugaowei@gmail.com
Signed-off-by: Gaowei Pu
Reviewed-by: Andrew Morton
Acked-by: Michal Hocko
Cc: Vlastimil Babka
Cc: Wei Yang
Cc: Konstantin Khlebnikov
Cc: Kirill A. Shutemov
Cc: "Jérôme Glisse"
Cc: Mike Kravetz
Cc: Rik van Riel
Cc: Qian Cai
Cc: Shakeel Butt
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Gaowei Pu
2019-12-01 22:29:19 +0800

26 Sep, 2019

2 commits

ce18d171c mm: untag user pointers in mmap/munmap/mremap/brk ... Browse Code »

There isn't a good reason to differentiate between the user address space
layout modification syscalls and the other memory permission/attributes
ones (e.g. mprotect, madvise) w.r.t. the tagged address ABI. Untag the
user addresses on entry to these functions.

Link: http://lkml.kernel.org/r/20190821164730.47450-2-catalin.marinas@arm.com
Signed-off-by: Catalin Marinas
Acked-by: Will Deacon
Acked-by: Andrey Konovalov
Cc: Vincenzo Frascino
Cc: Szabolcs Nagy
Cc: Kevin Brodsky
Cc: Dave P Martin
Cc: Dave Hansen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Catalin Marinas
2019-09-26 08:51:41 +0800
057d33891 mm: untag user pointers passed to memory syscalls ... Browse Code »

This patch is a part of a series that extends kernel ABI to allow to pass
tagged user pointers (with the top byte set to something else other than
0x00) as syscall arguments.

This patch allows tagged pointers to be passed to the following memory
syscalls: get_mempolicy, madvise, mbind, mincore, mlock, mlock2, mprotect,
mremap, msync, munlock, move_pages.

The mmap and mremap syscalls do not currently accept tagged addresses.
Architectures may interpret the tag as a background colour for the
corresponding vma.

Link: http://lkml.kernel.org/r/aaf0c0969d46b2feb9017f3e1b3ef3970b633d91.1563904656.git.andreyknvl@google.com
Signed-off-by: Andrey Konovalov
Reviewed-by: Khalid Aziz
Reviewed-by: Vincenzo Frascino
Reviewed-by: Catalin Marinas
Reviewed-by: Kees Cook
Cc: Al Viro
Cc: Dave Hansen
Cc: Eric Auger
Cc: Felix Kuehling
Cc: Jens Wiklander
Cc: Mauro Carvalho Chehab
Cc: Mike Rapoport
Cc: Will Deacon
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrey Konovalov
2019-09-26 08:51:41 +0800

15 May, 2019

1 commit

6f4f13e8d mm/mmu_notifier: contextual information for event triggering invalidation ... Browse Code »

CPU page table update can happens for many reasons, not only as a result
of a syscall (munmap(), mprotect(), mremap(), madvise(), ...) but also as
a result of kernel activities (memory compression, reclaim, migration,
...).

Users of mmu notifier API track changes to the CPU page table and take
specific action for them. While current API only provide range of virtual
address affected by the change, not why the changes is happening.

This patchset do the initial mechanical convertion of all the places that
calls mmu_notifier_range_init to also provide the default MMU_NOTIFY_UNMAP
event as well as the vma if it is know (most invalidation happens against
a given vma). Passing down the vma allows the users of mmu notifier to
inspect the new vma page protection.

The MMU_NOTIFY_UNMAP is always the safe default as users of mmu notifier
should assume that every for the range is going away when that event
happens. A latter patch do convert mm call path to use a more appropriate
events for each call.

This is done as 2 patches so that no call site is forgotten especialy
as it uses this following coccinelle patch:

%vm_mm, E3, E4)
...>

@@
expression E1, E2, E3, E4;
identifier FN, VMA;
@@
FN(..., struct vm_area_struct *VMA, ...) {
}

@@
expression E1, E2, E3, E4;
identifier FN, VMA;
@@
FN(...) {
struct vm_area_struct *VMA;
}

@@
expression E1, E2, E3, E4;
identifier FN;
@@
FN(...) {
}
---------------------------------------------------------------------->%

Applied with:
spatch --all-includes --sp-file mmu-notifier.spatch fs/proc/task_mmu.c --in-place
spatch --sp-file mmu-notifier.spatch --dir kernel/events/ --in-place
spatch --sp-file mmu-notifier.spatch --dir mm --in-place

Link: http://lkml.kernel.org/r/20190326164747.24405-6-jglisse@redhat.com
Signed-off-by: Jérôme Glisse
Reviewed-by: Ralph Campbell
Reviewed-by: Ira Weiny
Cc: Christian König
Cc: Joonas Lahtinen
Cc: Jani Nikula
Cc: Rodrigo Vivi
Cc: Jan Kara
Cc: Andrea Arcangeli
Cc: Peter Xu
Cc: Felix Kuehling
Cc: Jason Gunthorpe
Cc: Ross Zwisler
Cc: Dan Williams
Cc: Paolo Bonzini
Cc: Radim Krcmar
Cc: Michal Hocko
Cc: Christian Koenig
Cc: John Hubbard
Cc: Arnd Bergmann
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jérôme Glisse
2019-05-15 00:47:49 +0800

06 Mar, 2019

1 commit

ea2c3f6f5 mm,mremap: bail out earlier in mremap_to under map pressure ... Browse Code »

When using mremap() syscall in addition to MREMAP_FIXED flag, mremap()
calls mremap_to() which does the following:

1) unmaps the destination region where we are going to move the map
2) If the new region is going to be smaller, we unmap the last part
of the old region

Then, we will eventually call move_vma() to do the actual move.

move_vma() checks whether we are at least 4 maps below max_map_count
before going further, otherwise it bails out with -ENOMEM. The problem
is that we might have already unmapped the vma's in steps 1) and 2), so
it is not possible for userspace to figure out the state of the vmas
after it gets -ENOMEM, and it gets tricky for userspace to clean up
properly on error path.

While it is true that we can return -ENOMEM for more reasons (e.g: see
may_expand_vm() or move_page_tables()), I think that we can avoid this
scenario if we check early in mremap_to() if the operation has high
chances to succeed map-wise.

Should that not be the case, we can bail out before we even try to unmap
anything, so we make sure the vma's are left untouched in case we are
likely to be short of maps.

The thumb-rule now is to rely on the worst-scenario case we can have.
That is when both vma's (old region and new region) are going to be
split in 3, so we get two more maps to the ones we already hold (one per
each). If current map count + 2 maps still leads us to 4 maps below the
threshold, we are going to pass the check in move_vma().

Of course, this is not free, as it might generate false positives when
it is true that we are tight map-wise, but the unmap operation can
release several vma's leading us to a good state.

Another approach was also investigated [1], but it may be too much
hassle for what it brings.

[1] https://lore.kernel.org/lkml/20190219155320.tkfkwvqk53tfdojt@d104.suse.de/

Link: http://lkml.kernel.org/r/20190226091314.18446-1-osalvador@suse.de
Signed-off-by: Oscar Salvador
Acked-by: Vlastimil Babka
Acked-by: Kirill A. Shutemov
Cc: Hugh Dickins
Cc: Joel Fernandes (Google)
Cc: Yang Shi
Cc: Mel Gorman
Cc: Joel Fernandes
Cc: Cyril Hrubis
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oscar Salvador
2019-03-06 13:07:21 +0800

05 Jan, 2019

2 commits

2c91bd4a4 mm: speed up mremap by 20x on large regions ... Browse Code »

Android needs to mremap large regions of memory during memory management
related operations. The mremap system call can be really slow if THP is
not enabled. The bottleneck is move_page_tables, which is copying each
pte at a time, and can be really slow across a large map. Turning on
THP may not be a viable option, and is not for us. This patch speeds up
the performance for non-THP system by copying at the PMD level when
possible.

The speedup is an order of magnitude on x86 (~20x). On a 1GB mremap,
the mremap completion times drops from 3.4-3.6 milliseconds to 144-160
microseconds.

Before:
Total mremap time for 1GB data: 3521942 nanoseconds.
Total mremap time for 1GB data: 3449229 nanoseconds.
Total mremap time for 1GB data: 3488230 nanoseconds.

After:
Total mremap time for 1GB data: 150279 nanoseconds.
Total mremap time for 1GB data: 144665 nanoseconds.
Total mremap time for 1GB data: 158708 nanoseconds.

If THP is enabled the optimization is mostly skipped except in certain
situations.

[joel@joelfernandes.org: fix 'move_normal_pmd' unused function warning]
Link: http://lkml.kernel.org/r/20181108224457.GB209347@google.com
Link: http://lkml.kernel.org/r/20181108181201.88826-3-joelaf@google.com
Signed-off-by: Joel Fernandes (Google)
Acked-by: Kirill A. Shutemov
Reviewed-by: William Kucharski
Cc: Julia Lawall
Cc: Michal Hocko
Cc: Will Deacon
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joel Fernandes (Google)
2019-01-05 05:13:48 +0800
4cf589249 mm: treewide: remove unused address argument from pte_alloc functions ... Browse Code »

Patch series "Add support for fast mremap".

This series speeds up the mremap(2) syscall by copying page tables at
the PMD level even for non-THP systems. There is concern that the extra
'address' argument that mremap passes to pte_alloc may do something
subtle architecture related in the future that may make the scheme not
work. Also we find that there is no point in passing the 'address' to
pte_alloc since its unused. This patch therefore removes this argument
tree-wide resulting in a nice negative diff as well. Also ensuring
along the way that the enabled architectures do not do anything funky
with the 'address' argument that goes unnoticed by the optimization.

Build and boot tested on x86-64. Build tested on arm64. The config
enablement patch for arm64 will be posted in the future after more
testing.

The changes were obtained by applying the following Coccinelle script.
(thanks Julia for answering all Coccinelle questions!).
Following fix ups were done manually:
* Removal of address argument from pte_fragment_alloc
* Removal of pte_alloc_one_fast definitions from m68k and microblaze.

// Options: --include-headers --no-includes
// Note: I split the 'identifier fn' line, so if you are manually
// running it, please unsplit it so it runs for you.

virtual patch

@pte_alloc_func_def depends on patch exists@
identifier E2;
identifier fn =~
"^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$";
type T2;
@@

fn(...
- , T2 E2
)
{ ... }

@pte_alloc_func_proto_noarg depends on patch exists@
type T1, T2, T3, T4;
identifier fn =~ "^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$";
@@

(
- T3 fn(T1, T2);
+ T3 fn(T1);
|
- T3 fn(T1, T2, T4);
+ T3 fn(T1, T2);
)

@pte_alloc_func_proto depends on patch exists@
identifier E1, E2, E4;
type T1, T2, T3, T4;
identifier fn =~
"^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$";
@@

(
- T3 fn(T1 E1, T2 E2);
+ T3 fn(T1 E1);
|
- T3 fn(T1 E1, T2 E2, T4 E4);
+ T3 fn(T1 E1, T2 E2);
)

@pte_alloc_func_call depends on patch exists@
expression E2;
identifier fn =~
"^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$";
@@

fn(...
-, E2
)

@pte_alloc_macro depends on patch exists@
identifier fn =~
"^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$";
identifier a, b, c;
expression e;
position p;
@@

(
- #define fn(a, b, c) e
+ #define fn(a, b) e
|
- #define fn(a, b) e
+ #define fn(a) e
)

Link: http://lkml.kernel.org/r/20181108181201.88826-2-joelaf@google.com
Signed-off-by: Joel Fernandes (Google)
Suggested-by: Kirill A. Shutemov
Acked-by: Kirill A. Shutemov
Cc: Michal Hocko
Cc: Julia Lawall
Cc: Kirill A. Shutemov
Cc: William Kucharski
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joel Fernandes (Google)
2019-01-05 05:13:47 +0800

29 Dec, 2018

1 commit

ac46d4f3c mm/mmu_notifier: use structure for invalidate_range_start/end calls v2 ... Browse Code »

To avoid having to change many call sites everytime we want to add a
parameter use a structure to group all parameters for the mmu_notifier
invalidate_range_start/end cakks. No functional changes with this patch.

[akpm@linux-foundation.org: coding style fixes]
Link: http://lkml.kernel.org/r/20181205053628.3210-3-jglisse@redhat.com
Signed-off-by: Jérôme Glisse
Acked-by: Christian König
Acked-by: Jan Kara
Cc: Matthew Wilcox
Cc: Ross Zwisler
Cc: Dan Williams
Cc: Paolo Bonzini
Cc: Radim Krcmar
Cc: Michal Hocko
Cc: Felix Kuehling
Cc: Ralph Campbell
Cc: John Hubbard
From: Jérôme Glisse
Subject: mm/mmu_notifier: use structure for invalidate_range_start/end calls v3

fix build warning in migrate.c when CONFIG_MMU_NOTIFIER=n

Link: http://lkml.kernel.org/r/20181213171330.8489-3-jglisse@redhat.com
Signed-off-by: Jérôme Glisse
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jérôme Glisse
2018-12-29 04:11:50 +0800

27 Oct, 2018

1 commit

85a06835f mm: mremap: downgrade mmap_sem to read when shrinking ... Browse Code »

Other than munmap, mremap might be used to shrink memory mapping too.
So, it may hold write mmap_sem for long time when shrinking large
mapping, as what commit ("mm: mmap: zap pages with read mmap_sem in
munmap") described.

The mremap() will not manipulate vmas anymore after __do_munmap() call for
the mapping shrink use case, so it is safe to downgrade to read mmap_sem.

So, the same optimization, which downgrades mmap_sem to read for zapping
pages, is also feasible and reasonable to this case.

The period of holding exclusive mmap_sem for shrinking large mapping
would be reduced significantly with this optimization.

MREMAP_FIXED and MREMAP_MAYMOVE are more complicated to adopt this
optimization since they need manipulate vmas after do_munmap(),
downgrading mmap_sem may create race window.

Simple mapping shrink is the low hanging fruit, and it may cover the
most cases of unmap with munmap together.

[akpm@linux-foundation.org: tweak comment]
[yang.shi@linux.alibaba.com: fix unsigned compare against 0 issue]
Link: http://lkml.kernel.org/r/1538687672-17795-2-git-send-email-yang.shi@linux.alibaba.com
Link: http://lkml.kernel.org/r/1538067582-60038-1-git-send-email-yang.shi@linux.alibaba.com
Signed-off-by: Yang Shi
Acked-by: Vlastimil Babka
Acked-by: Kirill A. Shutemov
Cc: Michal Hocko
Cc: Matthew Wilcox
Cc: Laurent Dufour
Cc: Colin Ian King
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Yang Shi
2018-10-27 07:26:35 +0800

18 Oct, 2018

1 commit

eb66ae030 mremap: properly flush TLB before releasing the page ... Browse Code »

Jann Horn points out that our TLB flushing was subtly wrong for the
mremap() case. What makes mremap() special is that we don't follow the
usual "add page to list of pages to be freed, then flush tlb, and then
free pages". No, mremap() obviously just _moves_ the page from one page
table location to another.

That matters, because mremap() thus doesn't directly control the
lifetime of the moved page with a freelist: instead, the lifetime of the
page is controlled by the page table locking, that serializes access to
the entry.

As a result, we need to flush the TLB not just before releasing the lock
for the source location (to avoid any concurrent accesses to the entry),
but also before we release the destination page table lock (to avoid the
TLB being flushed after somebody else has already done something to that
page).

This also makes the whole "need_flush" logic unnecessary, since we now
always end up flushing the TLB for every valid entry.

Reported-and-tested-by: Jann Horn
Acked-by: Will Deacon
Tested-by: Ingo Molnar
Acked-by: Peter Zijlstra (Intel)
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Linus Torvalds
2018-10-18 17:30:52 +0800

15 Jun, 2018

1 commit

37a4094e8 mremap: remove LATENCY_LIMIT from mremap to reduce the number of TLB shootdowns ... Browse Code »

Commit 5d1904204c99 ("mremap: fix race between mremap() and page
cleanning") fixed races between mremap and other operations for both
file-backed and anonymous mappings. The file-backed was the most
critical as it allowed the possibility that data could be changed on a
physical page after page_mkclean returned which could trigger data loss
or data integrity issues.

A customer reported that the cost of the TLBs for anonymous regressions
was excessive and resulting in a 30-50% drop in performance overall
since this commit on a microbenchmark. Unfortunately I neither have
access to the test-case nor can I describe what it does other than
saying that mremap operations dominate heavily.

This patch removes the LATENCY_LIMIT to handle TLB flushes on a PMD
boundary instead of every 64 pages to reduce the number of TLB
shootdowns by a factor of 8 in the ideal case. LATENCY_LIMIT was almost
certainly used originally to limit the PTL hold times but the latency
savings are likely offset by the cost of IPIs in many cases. This patch
is not reported to completely restore performance but gets it within an
acceptable percentage. The given metric here is simply described as
"higher is better".

Baseline that was known good
002: Metric: 91.05
004: Metric: 109.45
008: Metric: 73.08
016: Metric: 58.14
032: Metric: 61.09
064: Metric: 57.76
128: Metric: 55.43

Current
001: Metric: 54.98
002: Metric: 56.56
004: Metric: 41.22
008: Metric: 35.96
016: Metric: 36.45
032: Metric: 35.71
064: Metric: 35.73
128: Metric: 34.96

With patch
001: Metric: 61.43
002: Metric: 81.64
004: Metric: 67.92
008: Metric: 51.67
016: Metric: 50.47
032: Metric: 52.29
064: Metric: 50.01
128: Metric: 49.04

So for low threads, it's not restored but for larger number of threads,
it's closer to the "known good" baseline.

Using a different mremap-intensive workload that is not representative
of the real workload there is little difference observed outside of
noise in the headline metrics However, the TLB shootdowns are reduced by
11% on average and at the peak, TLB shootdowns were reduced by 21%.
Interrupts were sampled every second while the workload ran to get those
figures. It's known that the figures will vary as the
non-representative load is non-deterministic.

An alternative patch was posted that should have significantly reduced
the TLB flushes but unfortunately it does not perform as well as this
version on the customer test case. If revisited, the two patches can
stack on top of each other.

Link: http://lkml.kernel.org/r/20180606183803.k7qaw2xnbvzshv34@techsingularity.net
Signed-off-by: Mel Gorman
Reviewed-by: Andrew Morton
Cc: Nadav Amit
Cc: Dave Hansen
Cc: Michal Hocko
Cc: Vlastimil Babka
Cc: Aaron Lu
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2018-06-15 06:55:24 +0800

02 Nov, 2017

1 commit

b24413180 License cleanup: add SPDX GPL-2.0 license identifier to files with no license ... Browse Code »

Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.

By default all files without license information are under the default
license of the kernel, which is GPL version 2.

Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier. The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.

This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.

How this work was done:

Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
- file had no licensing information it it.
- file was a */uapi/* one with no licensing information in it,
- file was a */uapi/* one with existing licensing information,

Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.

The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne. Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.

The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed. Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.

Criteria used to select files for SPDX license identifier tagging was:
- Files considered eligible had to be source code files.
- Make and config files were included as candidates if they contained >5
lines of source
- File already had some variant of a license header in it (even if
Reviewed-by: Philippe Ombredanne
Reviewed-by: Thomas Gleixner
Signed-off-by: Greg Kroah-Hartman

Greg Kroah-Hartman
2017-11-02 18:10:55 +0800

09 Sep, 2017

1 commit

84c3fc4e9 mm: thp: check pmd migration entry in common path ... Browse Code »

When THP migration is being used, memory management code needs to handle
pmd migration entries properly. This patch uses !pmd_present() or
is_swap_pmd() (depending on whether pmd_none() needs separate code or
not) to check pmd migration entries at the places where a pmd entry is
present.

Since pmd-related code uses split_huge_page(), split_huge_pmd(),
pmd_trans_huge(), pmd_trans_unstable(), or
pmd_none_or_trans_huge_or_clear_bad(), this patch:

1. adds pmd migration entry split code in split_huge_pmd(),

2. takes care of pmd migration entries whenever pmd_trans_huge() is present,

3. makes pmd_none_or_trans_huge_or_clear_bad() pmd migration entry aware.

Since split_huge_page() uses split_huge_pmd() and pmd_trans_unstable()
is equivalent to pmd_none_or_trans_huge_or_clear_bad(), we do not change
them.

Until this commit, a pmd entry should be:
1. pointing to a pte page,
2. is_swap_pmd(),
3. pmd_trans_huge(),
4. pmd_devmap(), or
5. pmd_none().

Signed-off-by: Zi Yan
Cc: Kirill A. Shutemov
Cc: "H. Peter Anvin"
Cc: Anshuman Khandual
Cc: Dave Hansen
Cc: David Nellans
Cc: Ingo Molnar
Cc: Mel Gorman
Cc: Minchan Kim
Cc: Naoya Horiguchi
Cc: Thomas Gleixner
Cc: Vlastimil Babka
Cc: Andrea Arcangeli
Cc: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Zi Yan
2017-09-09 09:26:45 +0800

07 Sep, 2017

1 commit

dba58d3b8 mm/mremap: fail map duplication attempts for private mappings ... Browse Code »

mremap will attempt to create a 'duplicate' mapping if old_size == 0 is
specified. In the case of private mappings, mremap will actually create
a fresh separate private mapping unrelated to the original. This does
not fit with the design semantics of mremap as the intention is to
create a new mapping based on the original.

Therefore, return EINVAL in the case where an attempt is made to
duplicate a private mapping. Also, print a warning message (once) if
such an attempt is made.

Link: http://lkml.kernel.org/r/cb9d9f6a-7095-582f-15a5-62643d65c736@oracle.com
Signed-off-by: Mike Kravetz
Acked-by: Michal Hocko
Cc: Andrea Arcangeli
Cc: Aaron Lu
Cc: "Kirill A . Shutemov"
Cc: Vlastimil Babka
Cc: Anshuman Khandual
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mike Kravetz
2017-09-07 08:27:26 +0800

03 Aug, 2017

2 commits

b22823719 userfaultfd: non-cooperative: notify about unmap of destination during mremap ... Browse Code »

When mremap is called with MREMAP_FIXED it unmaps memory at the
destination address without notifying userfaultfd monitor.

If the destination were registered with userfaultfd, the monitor has no
way to distinguish between the old and new ranges and to properly relate
the page faults that would occur in the destination region.

Fixes: 897ab3e0c49e ("userfaultfd: non-cooperative: add event for memory unmaps")
Link: http://lkml.kernel.org/r/1500276876-3350-1-git-send-email-rppt@linux.vnet.ibm.com
Signed-off-by: Mike Rapoport
Acked-by: Pavel Emelyanov
Cc: Andrea Arcangeli
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mike Rapoport
2017-08-03 07:34:46 +0800
3ea277194 mm, mprotect: flush TLB if potentially racing with a parallel reclaim leaving stale TLB entries ... Browse Code »

Nadav Amit identified a theoritical race between page reclaim and
mprotect due to TLB flushes being batched outside of the PTL being held.

He described the race as follows:

CPU0 CPU1
---- ----
user accesses memory using RW PTE
[PTE now cached in TLB]
try_to_unmap_one()
==> ptep_get_and_clear()
==> set_tlb_ubc_flush_pending()
mprotect(addr, PROT_READ)
==> change_pte_range()
==> [ PTE non-present - no flush ]

user writes using cached RW PTE
...

try_to_unmap_flush()

The same type of race exists for reads when protecting for PROT_NONE and
also exists for operations that can leave an old TLB entry behind such
as munmap, mremap and madvise.

For some operations like mprotect, it's not necessarily a data integrity
issue but it is a correctness issue as there is a window where an
mprotect that limits access still allows access. For munmap, it's
potentially a data integrity issue although the race is massive as an
munmap, mmap and return to userspace must all complete between the
window when reclaim drops the PTL and flushes the TLB. However, it's
theoritically possible so handle this issue by flushing the mm if
reclaim is potentially currently batching TLB flushes.

Other instances where a flush is required for a present pte should be ok
as either the page lock is held preventing parallel reclaim or a page
reference count is elevated preventing a parallel free leading to
corruption. In the case of page_mkclean there isn't an obvious path
that userspace could take advantage of without using the operations that
are guarded by this patch. Other users such as gup as a race with
reclaim looks just at PTEs. huge page variants should be ok as they
don't race with reclaim. mincore only looks at PTEs. userfault also
should be ok as if a parallel reclaim takes place, it will either fault
the page back in or read some of the data before the flush occurs
triggering a fault.

Note that a variant of this patch was acked by Andy Lutomirski but this
was for the x86 parts on top of his PCID work which didn't make the 4.13
merge window as expected. His ack is dropped from this version and
there will be a follow-on patch on top of PCID that will include his
ack.

[akpm@linux-foundation.org: tweak comments]
[akpm@linux-foundation.org: fix spello]
Link: http://lkml.kernel.org/r/20170717155523.emckq2esjro6hf3z@suse.de
Reported-by: Nadav Amit
Signed-off-by: Mel Gorman
Cc: Andy Lutomirski
Cc: [v4.4+]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2017-08-03 07:34:46 +0800

10 Mar, 2017

1 commit

c2febafc6 mm: convert generic code to 5-level paging ... Browse Code »

Convert all non-architecture-specific code to 5-level paging.

It's mostly mechanical adding handling one more page table level in
places where we deal with pud_t.

Signed-off-by: Kirill A. Shutemov
Acked-by: Michal Hocko
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2017-03-10 03:48:47 +0800

25 Feb, 2017

1 commit

897ab3e0c userfaultfd: non-cooperative: add event for memory unmaps ... Browse Code »

When a non-cooperative userfaultfd monitor copies pages in the
background, it may encounter regions that were already unmapped.
Addition of UFFD_EVENT_UNMAP allows the uffd monitor to track precisely
changes in the virtual memory layout.

Since there might be different uffd contexts for the affected VMAs, we
first should create a temporary representation for the unmap event for
each uffd context and then notify them one by one to the appropriate
userfault file descriptors.

The event notification occurs after the mmap_sem has been released.

[arnd@arndb.de: fix nommu build]
Link: http://lkml.kernel.org/r/20170203165141.3665284-1-arnd@arndb.de
[mhocko@suse.com: fix nommu build]
Link: http://lkml.kernel.org/r/20170202091503.GA22823@dhcp22.suse.cz
Link: http://lkml.kernel.org/r/1485542673-24387-3-git-send-email-rppt@linux.vnet.ibm.com
Signed-off-by: Mike Rapoport
Signed-off-by: Michal Hocko
Signed-off-by: Arnd Bergmann
Acked-by: Hillf Danton
Cc: Andrea Arcangeli
Cc: "Dr. David Alan Gilbert"
Cc: Mike Kravetz
Cc: Pavel Emelyanov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mike Rapoport
2017-02-25 09:46:55 +0800

23 Feb, 2017

2 commits

90794bf19 userfaultfd: non-cooperative: optimize mremap_userfaultfd_complete() ... Browse Code »

Optimize the mremap_userfaultfd_complete() interface to pass only the
vm_userfaultfd_ctx pointer through the stack as a microoptimization.

Link: http://lkml.kernel.org/r/20161216144821.5183-13-aarcange@redhat.com
Signed-off-by: Andrea Arcangeli
Reported-by: Hillf Danton
Acked-by: Mike Rapoport
Cc: "Dr. David Alan Gilbert"
Cc: Michael Rapoport
Cc: Mike Kravetz
Cc: Pavel Emelyanov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrea Arcangeli
2017-02-23 08:41:28 +0800
72f87654c userfaultfd: non-cooperative: add mremap() event ... Browse Code »

The event denotes that an area [start:end] moves to different location.
Length change isn't reported as "new" addresses, if they appear on the
uffd reader side they will not contain any data and the latter can just
zeromap them.

Waiting for the event ACK is also done outside of mmap sem, as for fork
event.

Link: http://lkml.kernel.org/r/20161216144821.5183-12-aarcange@redhat.com
Signed-off-by: Pavel Emelyanov
Signed-off-by: Mike Rapoport
Signed-off-by: Andrea Arcangeli
Cc: "Dr. David Alan Gilbert"
Cc: Hillf Danton
Cc: Michael Rapoport
Cc: Mike Kravetz
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Pavel Emelyanov
2017-02-23 08:41:28 +0800

30 Nov, 2016

1 commit

a2ce2666a mremap: move_ptes: check pte dirty after its removal ... Browse Code »

Linus found there still is a race in mremap after commit 5d1904204c99
("mremap: fix race between mremap() and page cleanning").

As described by Linus:
"the issue is that another thread might make the pte be dirty (in the
hardware walker, so no locking of ours will make any difference)
*after* we checked whether it was dirty, but *before* we removed it
from the page tables"

Fix it by moving the check after we removed it from the page table.

Suggested-by: Linus Torvalds
Signed-off-by: Aaron Lu
Signed-off-by: Linus Torvalds

Aaron Lu
2016-11-30 00:20:24 +0800

18 Nov, 2016

1 commit

5d1904204 mremap: fix race between mremap() and page cleanning ... Browse Code »

Prior to 3.15, there was a race between zap_pte_range() and
page_mkclean() where writes to a page could be lost. Dave Hansen
discovered by inspection that there is a similar race between
move_ptes() and page_mkclean().

We've been able to reproduce the issue by enlarging the race window with
a msleep(), but have not been able to hit it without modifying the code.
So, we think it's a real issue, but is difficult or impossible to hit in
practice.

The zap_pte_range() issue is fixed by commit 1cf35d47712d("mm: split
'tlb_flush_mmu()' into tlb flushing and memory freeing parts"). And
this patch is to fix the race between page_mkclean() and mremap().

Here is one possible way to hit the race: suppose a process mmapped a
file with READ | WRITE and SHARED, it has two threads and they are bound
to 2 different CPUs, e.g. CPU1 and CPU2. mmap returned X, then thread
1 did a write to addr X so that CPU1 now has a writable TLB for addr X
on it. Thread 2 starts mremaping from addr X to Y while thread 1
cleaned the page and then did another write to the old addr X again.
The 2nd write from thread 1 could succeed but the value will get lost.

thread 1 thread 2
(bound to CPU1) (bound to CPU2)

1: write 1 to addr X to get a
writeable TLB on this CPU

2: mremap starts

3: move_ptes emptied PTE for addr X
and setup new PTE for addr Y and
then dropped PTL for X and Y

4: page laundering for N by doing
fadvise FADV_DONTNEED. When done,
pageframe N is deemed clean.

5: *write 2 to addr X

6: tlb flush for addr X

7: munmap (Y, pagesize) to make the
page unmapped

8: fadvise with FADV_DONTNEED again
to kick the page off the pagecache

9: pread the page from file to verify
the value. If 1 is there, it means
we have lost the written 2.

*the write may or may not cause segmentation fault, it depends on
if the TLB is still on the CPU.

Please note that this is only one specific way of how the race could
occur, it didn't mean that the race could only occur in exact the above
config, e.g. more than 2 threads could be involved and fadvise() could
be done in another thread, etc.

For anonymous pages, they could race between mremap() and page reclaim:
THP: a huge PMD is moved by mremap to a new huge PMD, then the new huge
PMD gets unmapped/splitted/pagedout before the flush tlb happened for
the old huge PMD in move_page_tables() and we could still write data to
it. The normal anonymous page has similar situation.

To fix this, check for any dirty PTE in move_ptes()/move_huge_pmd() and
if any, did the flush before dropping the PTL. If we did the flush for
every move_ptes()/move_huge_pmd() call then we do not need to do the
flush in move_pages_tables() for the whole range. But if we didn't, we
still need to do the whole range flush.

Alternatively, we can track which part of the range is flushed in
move_ptes()/move_huge_pmd() and which didn't to avoid flushing the whole
range in move_page_tables(). But that would require multiple tlb
flushes for the different sub-ranges and should be less efficient than
the single whole range flush.

KBuild test on my Sandybridge desktop doesn't show any noticeable change.
v4.9-rc4:
real 5m14.048s
user 32m19.800s
sys 4m50.320s

With this commit:
real 5m13.888s
user 32m19.330s
sys 4m51.200s

Reported-by: Dave Hansen
Signed-off-by: Aaron Lu
Signed-off-by: Linus Torvalds

Aaron Lu
2016-11-18 01:46:56 +0800

27 Jul, 2016

1 commit

337d9abf1 mm: thp: check pmd_trans_unstable() after split_huge_pmd() ... Browse Code »

split_huge_pmd() doesn't guarantee that the pmd is normal pmd pointing
to pte entries, which can be checked with pmd_trans_unstable(). Some
callers make this assertion and some do it differently and some not, so
let's do it in a unified manner.

Link: http://lkml.kernel.org/r/1464741400-12143-1-git-send-email-n-horiguchi@ah.jp.nec.com
Signed-off-by: Naoya Horiguchi
Cc: "Kirill A. Shutemov"
Cc: Hugh Dickins
Cc: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Naoya Horiguchi
2016-07-27 07:19:19 +0800

24 May, 2016

1 commit

dc0ef0df7 mm: make mmap_sem for write waits killable for mm syscalls ... Browse Code »

This is a follow up work for oom_reaper [1]. As the async OOM killing
depends on oom_sem for read we would really appreciate if a holder for
write didn't stood in the way. This patchset is changing many of
down_write calls to be killable to help those cases when the writer is
blocked and waiting for readers to release the lock and so help
__oom_reap_task to process the oom victim.

Most of the patches are really trivial because the lock is help from a
shallow syscall paths where we can return EINTR trivially and allow the
current task to die (note that EINTR will never get to the userspace as
the task has fatal signal pending). Others seem to be easy as well as
the callers are already handling fatal errors and bail and return to
userspace which should be sufficient to handle the failure gracefully.
I am not familiar with all those code paths so a deeper review is really
appreciated.

As this work is touching more areas which are not directly connected I
have tried to keep the CC list as small as possible and people who I
believed would be familiar are CCed only to the specific patches (all
should have received the cover though).

This patchset is based on linux-next and it depends on
down_write_killable for rw_semaphores which got merged into tip
locking/rwsem branch and it is merged into this next tree. I guess it
would be easiest to route these patches via mmotm because of the
dependency on the tip tree but if respective maintainers prefer other
way I have no objections.

I haven't covered all the mmap_write(mm->mmap_sem) instances here

$ git grep "down_write(.*\)" next/master | wc -l
98
$ git grep "down_write(.*\)" | wc -l
62

I have tried to cover those which should be relatively easy to review in
this series because this alone should be a nice improvement. Other
places can be changed on top.

[0] http://lkml.kernel.org/r/1456752417-9626-1-git-send-email-mhocko@kernel.org
[1] http://lkml.kernel.org/r/1452094975-551-1-git-send-email-mhocko@kernel.org
[2] http://lkml.kernel.org/r/1456750705-7141-1-git-send-email-mhocko@kernel.org

This patch (of 18):

This is the first step in making mmap_sem write waiters killable. It
focuses on the trivial ones which are taking the lock early after
entering the syscall and they are not changing state before.

Therefore it is very easy to change them to use down_write_killable and
immediately return with -EINTR. This will allow the waiter to pass away
without blocking the mmap_sem which might be required to make a forward
progress. E.g. the oom reaper will need the lock for reading to
dismantle the OOM victim address space.

The only tricky function in this patch is vm_mmap_pgoff which has many
call sites via vm_mmap. To reduce the risk keep vm_mmap with the
original non-killable semantic for now.

vm_munmap callers do not bother checking the return value so open code
it into the munmap syscall path for now for simplicity.

Signed-off-by: Michal Hocko
Acked-by: Vlastimil Babka
Cc: Mel Gorman
Cc: "Kirill A. Shutemov"
Cc: Konstantin Khlebnikov
Cc: Hugh Dickins
Cc: Andrea Arcangeli
Cc: David Rientjes
Cc: Dave Hansen
Cc: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Hocko
2016-05-24 08:04:14 +0800

20 May, 2016

1 commit

1d069b7dd huge pagecache: extend mremap pmd rmap lockout to files ... Browse Code »

Whatever huge pagecache implementation we go with, file rmap locking
must be added to anon rmap locking, when mremap's move_page_tables()
finds a pmd_trans_huge pmd entry: a simple change, let's do it now.

Factor out take_rmap_locks() and drop_rmap_locks() to handle the locking
for make move_ptes() and move_page_tables(), and delete the
VM_BUG_ON_VMA which rejected vm_file and required anon_vma.

Signed-off-by: Hugh Dickins
Cc: "Kirill A. Shutemov"
Cc: Andrea Arcangeli
Cc: Andres Lagar-Cavilla
Cc: Yang Shi
Cc: Ning Qu
Cc: Mel Gorman
Cc: Andres Lagar-Cavilla
Cc: Konstantin Khlebnikov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2016-05-20 10:12:14 +0800