Eric Lee / smarc-fsl-linux-kernel

18 May, 2022

1 commit

f8f836100 mm/huge_memory: do not overkill when splitting huge_zero_page ... Browse Code »

commit 478d134e9506c7e9bfe2830ed03dd85e97966313 upstream.

Kernel panic when injecting memory_failure for the global huge_zero_page,
when CONFIG_DEBUG_VM is enabled, as follows.

Injecting memory failure for pfn 0x109ff9 at process virtual address 0x20ff9000
page:00000000fb053fc3 refcount:2 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x109e00
head:00000000fb053fc3 order:9 compound_mapcount:0 compound_pincount:0
flags: 0x17fffc000010001(locked|head|node=0|zone=2|lastcpupid=0x1ffff)
raw: 017fffc000010001 0000000000000000 dead000000000122 0000000000000000
raw: 0000000000000000 0000000000000000 00000002ffffffff 0000000000000000
page dumped because: VM_BUG_ON_PAGE(is_huge_zero_page(head))
------------[ cut here ]------------
kernel BUG at mm/huge_memory.c:2499!
invalid opcode: 0000 [#1] PREEMPT SMP PTI
CPU: 6 PID: 553 Comm: split_bug Not tainted 5.18.0-rc1+ #11
Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS 3288b3c 04/01/2014
RIP: 0010:split_huge_page_to_list+0x66a/0x880
Code: 84 9b fb ff ff 48 8b 7c 24 08 31 f6 e8 9f 5d 2a 00 b8 b8 02 00 00 e9 e8 fb ff ff 48 c7 c6 e8 47 3c 82 4c b
RSP: 0018:ffffc90000dcbdf8 EFLAGS: 00010246
RAX: 000000000000003c RBX: 0000000000000001 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffffffff823e4c4f RDI: 00000000ffffffff
RBP: ffff88843fffdb40 R08: 0000000000000000 R09: 00000000fffeffff
R10: ffffc90000dcbc48 R11: ffffffff82d68448 R12: ffffea0004278000
R13: ffffffff823c6203 R14: 0000000000109ff9 R15: ffffea000427fe40
FS: 00007fc375a26740(0000) GS:ffff88842fd80000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fc3757c9290 CR3: 0000000102174006 CR4: 00000000003706e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
try_to_split_thp_page+0x3a/0x130
memory_failure+0x128/0x800
madvise_inject_error.cold+0x8b/0xa1
__x64_sys_madvise+0x54/0x60
do_syscall_64+0x35/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x7fc3754f8bf9
Code: 01 00 48 81 c4 80 00 00 00 e9 f1 fe ff ff 0f 1f 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 8
RSP: 002b:00007ffeda93a1d8 EFLAGS: 00000217 ORIG_RAX: 000000000000001c
RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fc3754f8bf9
RDX: 0000000000000064 RSI: 0000000000003000 RDI: 0000000020ff9000
RBP: 00007ffeda93a200 R08: 0000000000000000 R09: 0000000000000000
R10: 00000000ffffffff R11: 0000000000000217 R12: 0000000000400490
R13: 00007ffeda93a2e0 R14: 0000000000000000 R15: 0000000000000000

We think that raising BUG is overkilling for splitting huge_zero_page, the
huge_zero_page can't be met from normal paths other than memory failure,
but memory failure is a valid caller. So we tend to replace the BUG to
WARN + returning -EBUSY, and thus the panic above won't happen again.

Link: https://lkml.kernel.org/r/f35f8b97377d5d3ede1bc5ac3114da888c57cbce.1651052574.git.xuyu@linux.alibaba.com
Fixes: d173d5417fb6 ("mm/memory-failure.c: skip huge_zero_page in memory_failure()")
Fixes: 6a46079cf57a ("HWPOISON: The high level memory error handler in the VM v7")
Signed-off-by: Xu Yu
Suggested-by: Yang Shi
Reported-by: kernel test robot
Reviewed-by: Naoya Horiguchi
Reviewed-by: Yang Shi
Reviewed-by: Miaohe Lin
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Greg Kroah-Hartman

Xu Yu
2022-05-18 16:26:55 +0800

29 Oct, 2021

1 commit

eac96c3ef mm: filemap: check if THP has hwpoisoned subpage for PMD page fault ... Browse Code »

When handling shmem page fault the THP with corrupted subpage could be
PMD mapped if certain conditions are satisfied. But kernel is supposed
to send SIGBUS when trying to map hwpoisoned page.

There are two paths which may do PMD map: fault around and regular
fault.

Before commit f9ce0be71d1f ("mm: Cleanup faultaround and finish_fault()
codepaths") the thing was even worse in fault around path. The THP
could be PMD mapped as long as the VMA fits regardless what subpage is
accessed and corrupted. After this commit as long as head page is not
corrupted the THP could be PMD mapped.

In the regular fault path the THP could be PMD mapped as long as the
corrupted page is not accessed and the VMA fits.

This loophole could be fixed by iterating every subpage to check if any
of them is hwpoisoned or not, but it is somewhat costly in page fault
path.

So introduce a new page flag called HasHWPoisoned on the first tail
page. It indicates the THP has hwpoisoned subpage(s). It is set if any
subpage of THP is found hwpoisoned by memory failure and after the
refcount is bumped successfully, then cleared when the THP is freed or
split.

The soft offline path doesn't need this since soft offline handler just
marks a subpage hwpoisoned when the subpage is migrated successfully.
But shmem THP didn't get split then migrated at all.

Link: https://lkml.kernel.org/r/20211020210755.23964-3-shy828301@gmail.com
Fixes: 800d8c63b2e9 ("shmem: add huge pages support")
Signed-off-by: Yang Shi
Reviewed-by: Naoya Horiguchi
Suggested-by: Kirill A. Shutemov
Cc: Hugh Dickins
Cc: Matthew Wilcox
Cc: Oscar Salvador
Cc: Peter Xu
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Yang Shi
2021-10-29 08:18:55 +0800

19 Oct, 2021

1 commit

1ca7554d0 mm/thp: decrease nr_thps in file's mapping on THP split ... Browse Code »

Decrease nr_thps counter in file's mapping to ensure that the page cache
won't be dropped excessively on file write access if page has been
already split.

I've tried a test scenario running a big binary, kernel remaps it with
THPs, then force a THP split with /sys/kernel/debug/split_huge_pages.
During any further open of that binary with O_RDWR or O_WRITEONLY kernel
drops page cache for it, because of non-zero thps counter.

Link: https://lkml.kernel.org/r/20211012120237.2600-1-m.szyprowski@samsung.com
Signed-off-by: Marek Szyprowski
Fixes: 09d91cda0e82 ("mm,thp: avoid writes to file with THP in pagecache")
Fixes: 06d3eff62d9d ("mm/thp: fix node page state in split_huge_page_to_list()")
Acked-by: Matthew Wilcox (Oracle)
Reviewed-by: Yang Shi
Cc:
Cc: Song Liu
Cc: Rik van Riel
Cc: "Kirill A . Shutemov"
Cc: Johannes Weiner
Cc: Hillf Danton
Cc: Hugh Dickins
Cc: William Kucharski
Cc: Oleg Nesterov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Marek Szyprowski
2021-10-19 14:22:03 +0800

04 Sep, 2021

2 commits

f00230ff8 mm,do_huge_pmd_numa_page: remove unnecessary TLB flushing code ... Browse Code »

Before commit c5b5a3dd2c1f ("mm: thp: refactor NUMA fault handling"), the
TLB flushing is done in do_huge_pmd_numa_page() itself via
flush_tlb_range().

But after commit c5b5a3dd2c1f ("mm: thp: refactor NUMA fault handling"),
the TLB flushing is done in migrate_pages() as in the following code path
anyway.

do_huge_pmd_numa_page
migrate_misplaced_page
migrate_pages

So now, the TLB flushing code in do_huge_pmd_numa_page() becomes
unnecessary. So the code is deleted in this patch to simplify the code.
This is only code cleanup, there's no visible performance difference.

The mmu_notifier_invalidate_range() in do_huge_pmd_numa_page() is
deleted too. Because migrate_pages() takes care of that too when CPU
TLB is flushed.

Link: https://lkml.kernel.org/r/20210720065529.716031-1-ying.huang@intel.com
Signed-off-by: "Huang, Ying"
Reviewed-by: Zi Yan
Reviewed-by: Yang Shi
Cc: Dan Carpenter
Cc: Mel Gorman
Cc: Christian Borntraeger
Cc: Gerald Schaefer
Cc: Heiko Carstens
Cc: Hugh Dickins
Cc: Andrea Arcangeli
Cc: Kirill A. Shutemov
Cc: Michal Hocko
Cc: Vasily Gorbik
Cc: Paolo Bonzini
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Huang Ying
2021-09-04 00:58:13 +0800
d144bf620 huge tmpfs: fix split_huge_page() after FALLOC_FL_KEEP_SIZE ... Browse Code »

A successful shmem_fallocate() guarantees that the extent has been
reserved, even beyond i_size when the FALLOC_FL_KEEP_SIZE flag was used.
But that guarantee is broken by shmem_unused_huge_shrink()'s attempts to
split huge pages and free their excess beyond i_size; and by other uses of
split_huge_page() near i_size.

It's sad to add a shmem inode field just for this, but I did not find a
better way to keep the guarantee. A flag to say KEEP_SIZE has been used
would be cheaper, but I'm averse to unclearable flags. The fallocend
field is not perfect either (many disjoint ranges might be fallocated),
but good enough; and gains another use later on.

Link: https://lkml.kernel.org/r/ca9a146-3a59-6cd3-7f28-e9a044bb1052@google.com
Fixes: 779750d20b93 ("shmem: split huge pages beyond i_size under memory pressure")
Signed-off-by: Hugh Dickins
Reviewed-by: Yang Shi
Cc: "Kirill A. Shutemov"
Cc: Matthew Wilcox
Cc: Miaohe Lin
Cc: Michal Hocko
Cc: Mike Kravetz
Cc: Rik van Riel
Cc: Shakeel Butt
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2021-09-04 00:58:11 +0800

12 Jul, 2021

1 commit

64b586d19 mm/rmap: fix comments left over from recent changes ... Browse Code »

Parallel developments in mm/rmap.c have left behind some out-of-date
comments: try_to_migrate_one() also accepts TTU_SYNC (already commented
in try_to_migrate() itself), and try_to_migrate() returns nothing at
all.

TTU_SPLIT_FREEZE has just been deleted, so reword the comment about it
in mm/huge_memory.c; and TTU_IGNORE_ACCESS was removed in 5.11, so
delete the "recently referenced" comment from try_to_unmap_one() (once
upon a time the comment was near the removed codeblock, but they drifted
apart).

Signed-off-by: Hugh Dickins
Reviewed-by: Shakeel Butt
Reviewed-by: Alistair Popple
Link: https://lore.kernel.org/lkml/563ce5b2-7a44-5b4d-1dfd-59a0e65932a9@google.com/
Cc: Andrew Morton
Cc: Jason Gunthorpe
Cc: Ralph Campbell
Cc: Christoph Hellwig
Cc: Yang Shi
Cc: Kirill A. Shutemov
Signed-off-by: Linus Torvalds

Hugh Dickins
2021-07-12 06:05:15 +0800

02 Jul, 2021

3 commits

a98a2f0c8 mm/rmap: split migration into its own function ... Browse Code »

Migration is currently implemented as a mode of operation for
try_to_unmap_one() generally specified by passing the TTU_MIGRATION flag
or in the case of splitting a huge anonymous page TTU_SPLIT_FREEZE.

However it does not have much in common with the rest of the unmap
functionality of try_to_unmap_one() and thus splitting it into a separate
function reduces the complexity of try_to_unmap_one() making it more
readable.

Several simplifications can also be made in try_to_migrate_one() based on
the following observations:

- All users of TTU_MIGRATION also set TTU_IGNORE_MLOCK.
- No users of TTU_MIGRATION ever set TTU_IGNORE_HWPOISON.
- No users of TTU_MIGRATION ever set TTU_BATCH_FLUSH.

TTU_SPLIT_FREEZE is a special case of migration used when splitting an
anonymous page. This is most easily dealt with by calling the correct
function from unmap_page() in mm/huge_memory.c - either try_to_migrate()
for PageAnon or try_to_unmap().

Link: https://lkml.kernel.org/r/20210616105937.23201-5-apopple@nvidia.com
Signed-off-by: Alistair Popple
Reviewed-by: Christoph Hellwig
Reviewed-by: Ralph Campbell
Cc: Ben Skeggs
Cc: Hugh Dickins
Cc: Jason Gunthorpe
Cc: John Hubbard
Cc: "Matthew Wilcox (Oracle)"
Cc: Peter Xu
Cc: Shakeel Butt
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Alistair Popple
2021-07-02 02:06:03 +0800
4dd845b5a mm/swapops: rework swap entry manipulation code ... Browse Code »

Both migration and device private pages use special swap entries that are
manipluated by a range of inline functions. The arguments to these are
somewhat inconsistent so rework them to remove flag type arguments and to
make the arguments similar for both read and write entry creation.

Link: https://lkml.kernel.org/r/20210616105937.23201-3-apopple@nvidia.com
Signed-off-by: Alistair Popple
Reviewed-by: Christoph Hellwig
Reviewed-by: Jason Gunthorpe
Reviewed-by: Ralph Campbell
Cc: Ben Skeggs
Cc: Hugh Dickins
Cc: John Hubbard
Cc: "Matthew Wilcox (Oracle)"
Cc: Peter Xu
Cc: Shakeel Butt
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Alistair Popple
2021-07-02 02:06:03 +0800
af5cdaf82 mm: remove special swap entry functions ... Browse Code »

Patch series "Add support for SVM atomics in Nouveau", v11.

Introduction
============

Some devices have features such as atomic PTE bits that can be used to
implement atomic access to system memory. To support atomic operations to
a shared virtual memory page such a device needs access to that page which
is exclusive of the CPU. This series introduces a mechanism to
temporarily unmap pages granting exclusive access to a device.

These changes are required to support OpenCL atomic operations in Nouveau
to shared virtual memory (SVM) regions allocated with the
CL_MEM_SVM_ATOMICS clSVMAlloc flag. A more complete description of the
OpenCL SVM feature is available at
https://www.khronos.org/registry/OpenCL/specs/3.0-unified/html/
OpenCL_API.html#_shared_virtual_memory .

Implementation
==============

Exclusive device access is implemented by adding a new swap entry type
(SWAP_DEVICE_EXCLUSIVE) which is similar to a migration entry. The main
difference is that on fault the original entry is immediately restored by
the fault handler instead of waiting.

Restoring the entry triggers calls to MMU notifers which allows a device
driver to revoke the atomic access permission from the GPU prior to the
CPU finalising the entry.

Patches
=======

Patches 1 & 2 refactor existing migration and device private entry
functions.

Patches 3 & 4 rework try_to_unmap_one() by splitting out unrelated
functionality into separate functions - try_to_migrate_one() and
try_to_munlock_one().

Patch 5 renames some existing code but does not introduce functionality.

Patch 6 is a small clean-up to swap entry handling in copy_pte_range().

Patch 7 contains the bulk of the implementation for device exclusive
memory.

Patch 8 contains some additions to the HMM selftests to ensure everything
works as expected.

Patch 9 is a cleanup for the Nouveau SVM implementation.

Patch 10 contains the implementation of atomic access for the Nouveau
driver.

Testing
=======

This has been tested with upstream Mesa 21.1.0 and a simple OpenCL program
which checks that GPU atomic accesses to system memory are atomic.
Without this series the test fails as there is no way of write-protecting
the page mapping which results in the device clobbering CPU writes. For
reference the test is available at
https://ozlabs.org/~apopple/opencl_svm_atomics/

Further testing has been performed by adding support for testing exclusive
access to the hmm-tests kselftests.

This patch (of 10):

Remove multiple similar inline functions for dealing with different types
of special swap entries.

Both migration and device private swap entries use the swap offset to
store a pfn. Instead of multiple inline functions to obtain a struct page
for each swap entry type use a common function pfn_swap_entry_to_page().
Also open-code the various entry_to_pfn() functions as this results is
shorter code that is easier to understand.

Link: https://lkml.kernel.org/r/20210616105937.23201-1-apopple@nvidia.com
Link: https://lkml.kernel.org/r/20210616105937.23201-2-apopple@nvidia.com
Signed-off-by: Alistair Popple
Reviewed-by: Ralph Campbell
Reviewed-by: Christoph Hellwig
Cc: "Matthew Wilcox (Oracle)"
Cc: Hugh Dickins
Cc: Peter Xu
Cc: Shakeel Butt
Cc: Ben Skeggs
Cc: Jason Gunthorpe
Cc: John Hubbard
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Alistair Popple
2021-07-02 02:06:03 +0800

01 Jul, 2021

11 commits

1212e00c9 mm/thp: fix strncpy warning ... Browse Code »

Using MAX_INPUT_BUF_SZ as the maximum length of the string makes fortify
complain as it thinks the string might be longer than the buffer, and if
it is, we will end up with a "string" that is missing a NUL terminator.
It's trivial to show that 'tok' points to a NUL-terminated string which is
less than MAX_INPUT_BUF_SZ in length, so we may as well just use strcpy()
and avoid the warning.

Link: https://lkml.kernel.org/r/20210615200242.1716568-4-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle)
Cc: Mike Kravetz
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Matthew Wilcox (Oracle)
2021-07-01 11:47:30 +0800
ab02c252c mm/thp: remap_page() is only needed on anonymous THP ... Browse Code »

THP splitting's unmap_page() only sets TTU_SPLIT_FREEZE when PageAnon, and
migration entries are only inserted when TTU_MIGRATION (unused here) or
TTU_SPLIT_FREEZE is set: so it's just a waste of time for remap_page() to
search for migration entries to remove when !PageAnon.

Link: https://lkml.kernel.org/r/f987bc44-f28e-688d-2424-b4722153ed8@google.com
Fixes: baa355fd3314 ("thp: file pages support for split_huge_page()")
Signed-off-by: Hugh Dickins
Reviewed-by: Yang Shi
Acked-by: Kirill A. Shutemov
Cc: Alistair Popple
Cc: Jan Kara
Cc: Jue Wang
Cc: "Matthew Wilcox (Oracle)"
Cc: Miaohe Lin
Cc: Minchan Kim
Cc: Naoya Horiguchi
Cc: Oscar Salvador
Cc: Peter Xu
Cc: Ralph Campbell
Cc: Shakeel Butt
Cc: Wang Yugui
Cc: Zi Yan
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2021-07-01 11:47:30 +0800
e346e6688 mm: thp: skip make PMD PROT_NONE if THP migration is not supported ... Browse Code »

A quick grep shows x86_64, PowerPC (book3s), ARM64 and S390 support both
NUMA balancing and THP. But S390 doesn't support THP migration so NUMA
balancing actually can't migrate any misplaced pages.

Skip make PMD PROT_NONE for such case otherwise CPU cycles may be wasted
by pointless NUMA hinting faults on S390.

Link: https://lkml.kernel.org/r/20210518200801.7413-8-shy828301@gmail.com
Signed-off-by: Yang Shi
Acked-by: Mel Gorman
Cc: Christian Borntraeger
Cc: Gerald Schaefer
Cc: Heiko Carstens
Cc: Huang Ying
Cc: Hugh Dickins
Cc: Kirill A. Shutemov
Cc: Michal Hocko
Cc: Vasily Gorbik
Cc: Zi Yan
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Yang Shi
2021-07-01 11:47:30 +0800
c5b5a3dd2 mm: thp: refactor NUMA fault handling ... Browse Code »

When the THP NUMA fault support was added THP migration was not supported
yet. So the ad hoc THP migration was implemented in NUMA fault handling.
Since v4.14 THP migration has been supported so it doesn't make too much
sense to still keep another THP migration implementation rather than using
the generic migration code.

This patch reworks the NUMA fault handling to use generic migration
implementation to migrate misplaced page. There is no functional change.

After the refactor the flow of NUMA fault handling looks just like its
PTE counterpart:
Acquire ptl
Prepare for migration (elevate page refcount)
Release ptl
Isolate page from lru and elevate page refcount
Migrate the misplaced THP

If migration fails just restore the old normal PMD.

In the old code anon_vma lock was needed to serialize THP migration
against THP split, but since then the THP code has been reworked a lot, it
seems anon_vma lock is not required anymore to avoid the race.

The page refcount elevation when holding ptl should prevent from THP
split.

Use migrate_misplaced_page() for both base page and THP NUMA hinting fault
and remove all the dead and duplicate code.

[dan.carpenter@oracle.com: fix a double unlock bug]
Link: https://lkml.kernel.org/r/YLX8uYN01JmfLnlK@mwanda

Link: https://lkml.kernel.org/r/20210518200801.7413-4-shy828301@gmail.com
Signed-off-by: Yang Shi
Signed-off-by: Dan Carpenter
Acked-by: Mel Gorman
Cc: Christian Borntraeger
Cc: Gerald Schaefer
Cc: Heiko Carstens
Cc: Huang Ying
Cc: Hugh Dickins
Cc: Kirill A. Shutemov
Cc: Michal Hocko
Cc: Vasily Gorbik
Cc: Zi Yan
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Yang Shi
2021-07-01 11:47:30 +0800
5db4f15c4 mm: memory: add orig_pmd to struct vm_fault ... Browse Code »

Pach series "mm: thp: use generic THP migration for NUMA hinting fault", v3.

When the THP NUMA fault support was added THP migration was not supported
yet. So the ad hoc THP migration was implemented in NUMA fault handling.
Since v4.14 THP migration has been supported so it doesn't make too much
sense to still keep another THP migration implementation rather than using
the generic migration code. It is definitely a maintenance burden to keep
two THP migration implementation for different code paths and it is more
error prone. Using the generic THP migration implementation allows us
remove the duplicate code and some hacks needed by the old ad hoc
implementation.

A quick grep shows x86_64, PowerPC (book3s), ARM64 ans S390 support both
THP and NUMA balancing. The most of them support THP migration except for
S390. Zi Yan tried to add THP migration support for S390 before but it
was not accepted due to the design of S390 PMD. For the discussion,
please see: https://lkml.org/lkml/2018/4/27/953.

Per the discussion with Gerald Schaefer in v1 it is acceptible to skip
huge PMD for S390 for now.

I saw there were some hacks about gup from git history, but I didn't
figure out if they have been removed or not since I just found FOLL_NUMA
code in the current gup implementation and they seems useful.

Patch #1 ~ #2 are preparation patches.
Patch #3 is the real meat.
Patch #4 ~ #6 keep consistent counters and behaviors with before.
Patch #7 skips change huge PMD to prot_none if thp migration is not supported.

Test
----
Did some tests to measure the latency of do_huge_pmd_numa_page. The test
VM has 80 vcpus and 64G memory. The test would create 2 processes to
consume 128G memory together which would incur memory pressure to cause
THP splits. And it also creates 80 processes to hog cpu, and the memory
consumer processes are bound to different nodes periodically in order to
increase NUMA faults.

The below test script is used:

echo 3 > /proc/sys/vm/drop_caches

# Run stress-ng for 24 hours
./stress-ng/stress-ng --vm 2 --vm-bytes 64G --timeout 24h &
PID=$!

./stress-ng/stress-ng --cpu $NR_CPUS --timeout 24h &

# Wait for vm stressors forked
sleep 5

PID_1=`pgrep -P $PID | awk 'NR == 1'`
PID_2=`pgrep -P $PID | awk 'NR == 2'`

JOB1=`pgrep -P $PID_1`
JOB2=`pgrep -P $PID_2`

# Bind load jobs to different nodes periodically to force generate
# cross node memory access
while [ -d "/proc/$PID" ]
do
taskset -apc 8 $JOB1
taskset -apc 8 $JOB2
sleep 300
taskset -apc 58 $JOB1
taskset -apc 58 $JOB2
sleep 300
done

With the above test the histogram of latency of do_huge_pmd_numa_page is
as shown below. Since the number of do_huge_pmd_numa_page varies
drastically for each run (should be due to scheduler), so I converted the
raw number to percentage.

patched base
@us[stress-ng]:
[0] 3.57% 0.16%
[1] 55.68% 18.36%
[2, 4) 10.46% 40.44%
[4, 8) 7.26% 17.82%
[8, 16) 21.12% 13.41%
[16, 32) 1.06% 4.27%
[32, 64) 0.56% 4.07%
[64, 128) 0.16% 0.35%
[128, 256) < 0.1% < 0.1%
[256, 512) < 0.1% < 0.1%
[512, 1K) < 0.1% < 0.1%
[1K, 2K) < 0.1% < 0.1%
[2K, 4K) < 0.1% < 0.1%
[4K, 8K) < 0.1% < 0.1%
[8K, 16K) < 0.1% < 0.1%
[16K, 32K) < 0.1% < 0.1%
[32K, 64K) < 0.1% < 0.1%

Per the result, patched kernel is even slightly better than the base
kernel. I think this is because the lock contention against THP split is
less than base kernel due to the refactor.

To exclude the affect from THP split, I also did test w/o memory pressure.
No obvious regression is spotted. The below is the test result *w/o*
memory pressure.

patched base
@us[stress-ng]:
[0] 7.97% 18.4%
[1] 69.63% 58.24%
[2, 4) 4.18% 2.63%
[4, 8) 0.22% 0.17%
[8, 16) 1.03% 0.92%
[16, 32) 0.14% < 0.1%
[32, 64) < 0.1% < 0.1%
[64, 128) < 0.1% < 0.1%
[128, 256) < 0.1% < 0.1%
[256, 512) 0.45% 1.19%
[512, 1K) 15.45% 17.27%
[1K, 2K) < 0.1% < 0.1%
[2K, 4K) < 0.1% < 0.1%
[4K, 8K) < 0.1% < 0.1%
[8K, 16K) 0.86% 0.88%
[16K, 32K) < 0.1% 0.15%
[32K, 64K) < 0.1% < 0.1%
[64K, 128K) < 0.1% < 0.1%
[128K, 256K) < 0.1% < 0.1%

The series also survived a series of tests that exercise NUMA balancing
migrations by Mel.

This patch (of 7):

Add orig_pmd to struct vm_fault so the "orig_pmd" parameter used by huge
page fault could be removed, just like its PTE counterpart does.

Link: https://lkml.kernel.org/r/20210518200801.7413-1-shy828301@gmail.com
Link: https://lkml.kernel.org/r/20210518200801.7413-2-shy828301@gmail.com
Signed-off-by: Yang Shi
Acked-by: Mel Gorman
Cc: Kirill A. Shutemov
Cc: Zi Yan
Cc: Huang Ying
Cc: Michal Hocko
Cc: Hugh Dickins
Cc: Gerald Schaefer
Cc: Heiko Carstens
Cc: Vasily Gorbik
Cc: Christian Borntraeger
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Yang Shi
2021-07-01 11:47:30 +0800
8f34f1eac mm/userfaultfd: fix uffd-wp special cases for fork() ... Browse Code »

We tried to do something similar in b569a1760782 ("userfaultfd: wp: drop
_PAGE_UFFD_WP properly when fork") previously, but it's not doing it all
right.. A few fixes around the code path:

1. We were referencing VM_UFFD_WP vm_flags on the _old_ vma rather
than the new vma. That's overlooked in b569a1760782, so it won't work
as expected. Thanks to the recent rework on fork code
(7a4830c380f3a8b3), we can easily get the new vma now, so switch the
checks to that.

2. Dropping the uffd-wp bit in copy_huge_pmd() could be wrong if the
huge pmd is a migration huge pmd. When it happens, instead of using
pmd_uffd_wp(), we should use pmd_swp_uffd_wp(). The fix is simply to
handle them separately.

3. Forget to carry over uffd-wp bit for a write migration huge pmd
entry. This also happens in copy_huge_pmd(), where we converted a
write huge migration entry into a read one.

4. In copy_nonpresent_pte(), drop uffd-wp if necessary for swap ptes.

5. In copy_present_page() when COW is enforced when fork(), we also
need to pass over the uffd-wp bit if VM_UFFD_WP is armed on the new
vma, and when the pte to be copied has uffd-wp bit set.

Remove the comment in copy_present_pte() about this. It won't help a huge
lot to only comment there, but comment everywhere would be an overkill.
Let's assume the commit messages would help.

[peterx@redhat.com: fix a few thp pmd missing uffd-wp bit]
Link: https://lkml.kernel.org/r/20210428225030.9708-4-peterx@redhat.com

Link: https://lkml.kernel.org/r/20210428225030.9708-3-peterx@redhat.com
Fixes: b569a1760782f ("userfaultfd: wp: drop _PAGE_UFFD_WP properly when fork")
Signed-off-by: Peter Xu
Cc: Jerome Glisse
Cc: Mike Rapoport
Cc: Alexander Viro
Cc: Andrea Arcangeli
Cc: Axel Rasmussen
Cc: Brian Geffon
Cc: "Dr . David Alan Gilbert"
Cc: Hugh Dickins
Cc: Joe Perches
Cc: Kirill A. Shutemov
Cc: Lokesh Gidra
Cc: Mike Kravetz
Cc: Mina Almasry
Cc: Oliver Upton
Cc: Shaohua Li
Cc: Shuah Khan
Cc: Stephen Rothwell
Cc: Wang Qing
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Peter Xu
2021-07-01 11:47:27 +0800
5fc7a5f6f mm/thp: simplify copying of huge zero page pmd when fork ... Browse Code »

Patch series "mm/uffd: Misc fix for uffd-wp and one more test".

This series tries to fix some corner case bugs for uffd-wp on either thp
or fork(). Then it introduced a new test with pagemap/pageout.

Patch layout:

Patch 1: cleanup for THP, it'll slightly simplify the follow up patches
Patch 2-4: misc fixes for uffd-wp here and there; please refer to each patch
Patch 5: add pagemap support for uffd-wp
Patch 6: add pagemap/pageout test for uffd-wp

The last test introduced can also verify some of the fixes in previous
patches, as the test will fail without the fixes. However it's not easy
to verify all the changes in patch 2-4, but hopefully they can still be
properly reviewed.

Note that if considering the ongoing uffd-wp shmem & hugetlbfs work, patch
5 will be incomplete as it's missing e.g. hugetlbfs part or the special
swap pte detection. However that's not needed in this series, and since
that series is still during review, this series does not depend on that
one (the last test only runs with anonymous memory, not file-backed). So
this series can be merged even before that series.

This patch (of 6):

Huge zero page is handled in a special path in copy_huge_pmd(), however it
should share most codes with a normal thp page. Trying to share more code
with it by removing the special path. The only leftover so far is the
huge zero page refcounting (mm_get_huge_zero_page()), because that's
separately done with a global counter.

This prepares for a future patch to modify the huge pmd to be installed,
so that we don't need to duplicate it explicitly into huge zero page case
too.

Link: https://lkml.kernel.org/r/20210428225030.9708-1-peterx@redhat.com
Link: https://lkml.kernel.org/r/20210428225030.9708-2-peterx@redhat.com
Signed-off-by: Peter Xu
Cc: Kirill A. Shutemov
Cc: Mike Kravetz , peterx@redhat.com
Cc: Mike Rapoport
Cc: Axel Rasmussen
Cc: Andrea Arcangeli
Cc: Hugh Dickins
Cc: Jerome Glisse
Cc: Alexander Viro
Cc: Brian Geffon
Cc: "Dr . David Alan Gilbert"
Cc: Joe Perches
Cc: Lokesh Gidra
Cc: Mina Almasry
Cc: Oliver Upton
Cc: Shaohua Li
Cc: Shuah Khan
Cc: Stephen Rothwell
Cc: Wang Qing
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Peter Xu
2021-07-01 11:47:27 +0800
babbbdd08 mm/huge_memory.c: don't discard hugepage if other processes are mapping it ... Browse Code »

If other processes are mapping any other subpages of the hugepage, i.e.
in pte-mapped thp case, page_mapcount() will return 1 incorrectly. Then
we would discard the page while other processes are still mapping it. Fix
it by using total_mapcount() which can tell whether other processes are
still mapping it.

Link: https://lkml.kernel.org/r/20210511134857.1581273-6-linmiaohe@huawei.com
Fixes: b8d3c4c3009d ("mm/huge_memory.c: don't split THP page when MADV_FREE syscall is called")
Reviewed-by: Yang Shi
Signed-off-by: Miaohe Lin
Cc: Alexey Dobriyan
Cc: "Aneesh Kumar K . V"
Cc: Anshuman Khandual
Cc: David Hildenbrand
Cc: Hugh Dickins
Cc: Johannes Weiner
Cc: Kirill A. Shutemov
Cc: Matthew Wilcox
Cc: Minchan Kim
Cc: Ralph Campbell
Cc: Rik van Riel
Cc: Song Liu
Cc: William Kucharski
Cc: Zi Yan
Cc: Mike Kravetz
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Miaohe Lin
2021-07-01 11:47:26 +0800
9132a468a mm/huge_memory.c: remove unnecessary tlb_remove_page_size() for huge zero pmd ... Browse Code »

Commit aa88b68c3b1d ("thp: keep huge zero page pinned until tlb flush")
introduced tlb_remove_page() for huge zero page to keep it pinned until
flush is complete and prevents the page from being split under us. But
huge zero page is kept pinned until all relevant mm_users reach zero since
the commit 6fcb52a56ff6 ("thp: reduce usage of huge zero page's atomic
counter"). So tlb_remove_page_size() for huge zero pmd is unnecessary
now.

Link: https://lkml.kernel.org/r/20210511134857.1581273-5-linmiaohe@huawei.com
Reviewed-by: Yang Shi
Acked-by: David Hildenbrand
Signed-off-by: Miaohe Lin
Cc: Alexey Dobriyan
Cc: "Aneesh Kumar K . V"
Cc: Anshuman Khandual
Cc: Hugh Dickins
Cc: Johannes Weiner
Cc: Kirill A. Shutemov
Cc: Matthew Wilcox
Cc: Minchan Kim
Cc: Ralph Campbell
Cc: Rik van Riel
Cc: Song Liu
Cc: William Kucharski
Cc: Zi Yan
Cc: Mike Kravetz
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Miaohe Lin
2021-07-01 11:47:26 +0800
e6be37b2e mm/huge_memory.c: add missing read-only THP checking in transparent_hugepage_enabled() ... Browse Code »

Since commit 99cb0dbd47a1 ("mm,thp: add read-only THP support for
(non-shmem) FS"), read-only THP file mapping is supported. But it forgot
to add checking for it in transparent_hugepage_enabled(). To fix it, we
add checking for read-only THP file mapping and also introduce helper
transhuge_vma_enabled() to check whether thp is enabled for specified vma
to reduce duplicated code. We rename transparent_hugepage_enabled to
transparent_hugepage_active to make the code easier to follow as suggested
by David Hildenbrand.

[linmiaohe@huawei.com: define transhuge_vma_enabled next to transhuge_vma_suitable]
Link: https://lkml.kernel.org/r/20210514093007.4117906-1-linmiaohe@huawei.com

Link: https://lkml.kernel.org/r/20210511134857.1581273-4-linmiaohe@huawei.com
Fixes: 99cb0dbd47a1 ("mm,thp: add read-only THP support for (non-shmem) FS")
Signed-off-by: Miaohe Lin
Reviewed-by: Yang Shi
Cc: Alexey Dobriyan
Cc: "Aneesh Kumar K . V"
Cc: Anshuman Khandual
Cc: David Hildenbrand
Cc: Hugh Dickins
Cc: Johannes Weiner
Cc: Kirill A. Shutemov
Cc: Matthew Wilcox
Cc: Minchan Kim
Cc: Ralph Campbell
Cc: Rik van Riel
Cc: Song Liu
Cc: William Kucharski
Cc: Zi Yan
Cc: Mike Kravetz
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Miaohe Lin
2021-07-01 11:47:26 +0800
dfe5c51c6 mm/huge_memory.c: use page->deferred_list ... Browse Code »

Now that we can represent the location of ->deferred_list instead of
->mapping + ->index, make use of it to improve readability.

Link: https://lkml.kernel.org/r/20210511134857.1581273-3-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin
Reviewed-by: Yang Shi
Reviewed-by: David Hildenbrand
Cc: Alexey Dobriyan
Cc: "Aneesh Kumar K . V"
Cc: Anshuman Khandual
Cc: Hugh Dickins
Cc: Johannes Weiner
Cc: Kirill A. Shutemov
Cc: Matthew Wilcox
Cc: Minchan Kim
Cc: Ralph Campbell
Cc: Rik van Riel
Cc: Song Liu
Cc: William Kucharski
Cc: Zi Yan
Cc: Mike Kravetz
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Miaohe Lin
2021-07-01 11:47:26 +0800

17 Jun, 2021

4 commits

504e070dc mm: thp: replace DEBUG_VM BUG with VM_WARN when unmap fails for split ... Browse Code »

When debugging the bug reported by Wang Yugui [1], try_to_unmap() may
fail, but the first VM_BUG_ON_PAGE() just checks page_mapcount() however
it may miss the failure when head page is unmapped but other subpage is
mapped. Then the second DEBUG_VM BUG() that check total mapcount would
catch it. This may incur some confusion.

As this is not a fatal issue, so consolidate the two DEBUG_VM checks
into one VM_WARN_ON_ONCE_PAGE().

[1] https://lore.kernel.org/linux-mm/20210412180659.B9E3.409509F4@e16-tech.com/

Link: https://lkml.kernel.org/r/d0f0db68-98b8-ebfb-16dc-f29df24cf012@google.com
Signed-off-by: Yang Shi
Reviewed-by: Zi Yan
Acked-by: Kirill A. Shutemov
Signed-off-by: Hugh Dickins
Cc: Alistair Popple
Cc: Jan Kara
Cc: Jue Wang
Cc: "Matthew Wilcox (Oracle)"
Cc: Miaohe Lin
Cc: Minchan Kim
Cc: Naoya Horiguchi
Cc: Oscar Salvador
Cc: Peter Xu
Cc: Ralph Campbell
Cc: Shakeel Butt
Cc: Wang Yugui
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Yang Shi
2021-06-17 00:24:42 +0800
732ed5582 mm/thp: try_to_unmap() use TTU_SYNC for safe splitting ... Browse Code »

Stressing huge tmpfs often crashed on unmap_page()'s VM_BUG_ON_PAGE
(!unmap_success): with dump_page() showing mapcount:1, but then its raw
struct page output showing _mapcount ffffffff i.e. mapcount 0.

And even if that particular VM_BUG_ON_PAGE(!unmap_success) is removed,
it is immediately followed by a VM_BUG_ON_PAGE(compound_mapcount(head)),
and further down an IS_ENABLED(CONFIG_DEBUG_VM) total_mapcount BUG():
all indicative of some mapcount difficulty in development here perhaps.
But the !CONFIG_DEBUG_VM path handles the failures correctly and
silently.

I believe the problem is that once a racing unmap has cleared pte or
pmd, try_to_unmap_one() may skip taking the page table lock, and emerge
from try_to_unmap() before the racing task has reached decrementing
mapcount.

Instead of abandoning the unsafe VM_BUG_ON_PAGE(), and the ones that
follow, use PVMW_SYNC in try_to_unmap_one() in this case: adding
TTU_SYNC to the options, and passing that from unmap_page().

When CONFIG_DEBUG_VM, or for non-debug too? Consensus is to do the same
for both: the slight overhead added should rarely matter, except perhaps
if splitting sparsely-populated multiply-mapped shmem. Once confident
that bugs are fixed, TTU_SYNC here can be removed, and the race
tolerated.

Link: https://lkml.kernel.org/r/c1e95853-8bcd-d8fd-55fa-e7f2488e78f@google.com
Fixes: fec89c109f3a ("thp: rewrite freeze_page()/unfreeze_page() with generic rmap walkers")
Signed-off-by: Hugh Dickins
Cc: Alistair Popple
Cc: Jan Kara
Cc: Jue Wang
Cc: Kirill A. Shutemov
Cc: "Matthew Wilcox (Oracle)"
Cc: Miaohe Lin
Cc: Minchan Kim
Cc: Naoya Horiguchi
Cc: Oscar Salvador
Cc: Peter Xu
Cc: Ralph Campbell
Cc: Shakeel Butt
Cc: Wang Yugui
Cc: Yang Shi
Cc: Zi Yan
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2021-06-17 00:24:42 +0800
3b77e8c8c mm/thp: make is_huge_zero_pmd() safe and quicker ... Browse Code »

Most callers of is_huge_zero_pmd() supply a pmd already verified
present; but a few (notably zap_huge_pmd()) do not - it might be a pmd
migration entry, in which the pfn is encoded differently from a present
pmd: which might pass the is_huge_zero_pmd() test (though not on x86,
since L1TF forced us to protect against that); or perhaps even crash in
pmd_page() applied to a swap-like entry.

Make it safe by adding pmd_present() check into is_huge_zero_pmd()
itself; and make it quicker by saving huge_zero_pfn, so that
is_huge_zero_pmd() will not need to do that pmd_page() lookup each time.

__split_huge_pmd_locked() checked pmd_trans_huge() before: that worked,
but is unnecessary now that is_huge_zero_pmd() checks present.

Link: https://lkml.kernel.org/r/21ea9ca-a1f5-8b90-5e88-95fb1c49bbfa@google.com
Fixes: e71769ae5260 ("mm: enable thp migration for shmem thp")
Signed-off-by: Hugh Dickins
Acked-by: Kirill A. Shutemov
Reviewed-by: Yang Shi
Cc: Alistair Popple
Cc: Jan Kara
Cc: Jue Wang
Cc: "Matthew Wilcox (Oracle)"
Cc: Miaohe Lin
Cc: Minchan Kim
Cc: Naoya Horiguchi
Cc: Oscar Salvador
Cc: Peter Xu
Cc: Ralph Campbell
Cc: Shakeel Butt
Cc: Wang Yugui
Cc: Zi Yan
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2021-06-17 00:24:42 +0800
99fa8a482 mm/thp: fix __split_huge_pmd_locked() on shmem migration entry ... Browse Code »

Patch series "mm/thp: fix THP splitting unmap BUGs and related", v10.

Here is v2 batch of long-standing THP bug fixes that I had not got
around to sending before, but prompted now by Wang Yugui's report
https://lore.kernel.org/linux-mm/20210412180659.B9E3.409509F4@e16-tech.com/

Wang Yugui has tested a rollup of these fixes applied to 5.10.39, and
they have done no harm, but have *not* fixed that issue: something more
is needed and I have no idea of what.

This patch (of 7):

Stressing huge tmpfs page migration racing hole punch often crashed on
the VM_BUG_ON(!pmd_present) in pmdp_huge_clear_flush(), with DEBUG_VM=y
kernel; or shortly afterwards, on a bad dereference in
__split_huge_pmd_locked() when DEBUG_VM=n. They forgot to allow for pmd
migration entries in the non-anonymous case.

Full disclosure: those particular experiments were on a kernel with more
relaxed mmap_lock and i_mmap_rwsem locking, and were not repeated on the
vanilla kernel: it is conceivable that stricter locking happens to avoid
those cases, or makes them less likely; but __split_huge_pmd_locked()
already allowed for pmd migration entries when handling anonymous THPs,
so this commit brings the shmem and file THP handling into line.

And while there: use old_pmd rather than _pmd, as in the following
blocks; and make it clearer to the eye that the !vma_is_anonymous()
block is self-contained, making an early return after accounting for
unmapping.

Link: https://lkml.kernel.org/r/af88612-1473-2eaa-903-8d1a448b26@google.com
Link: https://lkml.kernel.org/r/dd221a99-efb3-cd1d-6256-7e646af29314@google.com
Fixes: e71769ae5260 ("mm: enable thp migration for shmem thp")
Signed-off-by: Hugh Dickins
Cc: Kirill A. Shutemov
Cc: Yang Shi
Cc: Wang Yugui
Cc: "Matthew Wilcox (Oracle)"
Cc: Naoya Horiguchi
Cc: Alistair Popple
Cc: Ralph Campbell
Cc: Zi Yan
Cc: Miaohe Lin
Cc: Minchan Kim
Cc: Jue Wang
Cc: Peter Xu
Cc: Jan Kara
Cc: Shakeel Butt
Cc: Oscar Salvador
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2021-06-17 00:24:42 +0800

07 May, 2021

1 commit

f0953a1bb mm: fix typos in comments ... Browse Code »

Fix ~94 single-word typos in locking code comments, plus a few
very obvious grammar mistakes.

Link: https://lkml.kernel.org/r/20210322212624.GA1963421@gmail.com
Link: https://lore.kernel.org/r/20210322205203.GB1959563@gmail.com
Signed-off-by: Ingo Molnar
Reviewed-by: Matthew Wilcox (Oracle)
Reviewed-by: Randy Dunlap
Cc: Bhaskar Chowdhury
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Ingo Molnar
2021-05-07 15:26:35 +0800

06 May, 2021

9 commits

2bfd36374 mm: vmscan: consolidate shrinker_maps handling code ... Browse Code »

The shrinker map management is not purely memcg specific, it is at the
intersection between memory cgroup and shrinkers. It's allocation and
assignment of a structure, and the only memcg bit is the map is being
stored in a memcg structure. So move the shrinker_maps handling code
into vmscan.c for tighter integration with shrinker code, and remove the
"memcg_" prefix. There is no functional change.

Link: https://lkml.kernel.org/r/20210311190845.9708-3-shy828301@gmail.com
Signed-off-by: Yang Shi
Acked-by: Vlastimil Babka
Acked-by: Kirill Tkhai
Acked-by: Roman Gushchin
Reviewed-by: Shakeel Butt
Cc: Dave Chinner
Cc: Johannes Weiner
Cc: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Yang Shi
2021-05-06 02:27:23 +0800
fbe37501b mm: huge_memory: debugfs for file-backed THP split ... Browse Code »

Further extend /split_huge_pages to accept
",," for file-backed THP split tests since
tmpfs may have file backed by THP that mapped nowhere.

Update selftest program to test file-backed THP split too.

Link: https://lkml.kernel.org/r/20210331235309.332292-2-zi.yan@sent.com
Signed-off-by: Zi Yan
Suggested-by: Kirill A. Shutemov
Reviewed-by: Yang Shi
Cc: "Kirill A . Shutemov"
Cc: Shuah Khan
Cc: John Hubbard
Cc: Sandipan Das
Cc: David Hildenbrand
Cc: Mika Penttila
Cc: David Rientjes
Cc: Matthew Wilcox
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Zi Yan
2021-05-06 02:27:21 +0800
fa6c02315 mm: huge_memory: a new debugfs interface for splitting THP tests ... Browse Code »

We did not have a direct user interface of splitting the compound page
backing a THP and there is no need unless we want to expose the THP
implementation details to users. Make /split_huge_pages accept a
new command to do that.

By writing ",," to
/split_huge_pages, THPs within the given virtual address range
from the process with the given pid are split. It is used to test
split_huge_page function. In addition, a selftest program is added to
tools/testing/selftests/vm to utilize the interface by splitting
PMD THPs and PTE-mapped THPs.

This does not change the old behavior, i.e., writing 1 to the interface
to split all THPs in the system.

Link: https://lkml.kernel.org/r/20210331235309.332292-1-zi.yan@sent.com
Signed-off-by: Zi Yan
Reviewed-by: Yang Shi
Cc: David Hildenbrand
Cc: David Rientjes
Cc: John Hubbard
Cc: "Kirill A . Shutemov"
Cc: Matthew Wilcox
Cc: Mika Penttila
Cc: Sandipan Das
Cc: Shuah Khan
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Zi Yan
2021-05-06 02:27:21 +0800
a44f89dc6 mm/huge_memory.c: use helper function migration_entry_to_page() ... Browse Code »

It's more recommended to use helper function migration_entry_to_page()
to get the page via migration entry. We can also enjoy the PageLocked()
check there.

Link: https://lkml.kernel.org/r/20210318122722.13135-7-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin
Reviewed-by: Peter Xu
Cc: Aneesh Kumar K.V
Cc: Matthew Wilcox
Cc: Michel Lespinasse
Cc: Ralph Campbell
Cc: Thomas Hellstrm (Intel)
Cc: Vlastimil Babka
Cc: Wei Yang
Cc: William Kucharski
Cc: Yang Shi
Cc: yuleixzhang
Cc: Zi Yan
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Miaohe Lin
2021-05-06 02:27:21 +0800
f6004e73a mm/huge_memory.c: remove redundant PageCompound() check ... Browse Code »

The !PageCompound() check limits the page must be head or tail while
!PageHead() further limits it to page head only. So !PageHead() check is
equivalent here.

Link: https://lkml.kernel.org/r/20210318122722.13135-5-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin
Reviewed-by: Peter Xu
Cc: Aneesh Kumar K.V
Cc: Matthew Wilcox
Cc: Michel Lespinasse
Cc: Ralph Campbell
Cc: Thomas Hellstrm (Intel)
Cc: Vlastimil Babka
Cc: Wei Yang
Cc: William Kucharski
Cc: Yang Shi
Cc: yuleixzhang
Cc: Zi Yan
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Miaohe Lin
2021-05-06 02:27:21 +0800
6beb5e8bb mm/huge_memory.c: rework the function do_huge_pmd_numa_page() slightly ... Browse Code »

The current code that checks if migrating misplaced transhuge page is
needed is pretty hard to follow. Rework it and add a comment to make
its logic more clear and improve readability.

Link: https://lkml.kernel.org/r/20210318122722.13135-4-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin
Reviewed-by: Zi Yan
Reviewed-by: Peter Xu
Cc: Aneesh Kumar K.V
Cc: Matthew Wilcox
Cc: Michel Lespinasse
Cc: Ralph Campbell
Cc: Thomas Hellstrm (Intel)
Cc: Vlastimil Babka
Cc: Wei Yang
Cc: William Kucharski
Cc: Yang Shi
Cc: yuleixzhang
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Miaohe Lin
2021-05-06 02:27:21 +0800
aaa9705b4 mm/huge_memory.c: make get_huge_zero_page() return bool ... Browse Code »

It's guaranteed that huge_zero_page will not be NULL if
huge_zero_refcount is increased successfully.

When READ_ONCE(huge_zero_page) is returned, there must be a
huge_zero_page and it can be replaced with returning
'true' when we do not care about the value of huge_zero_page.

We can thus make it return bool to save READ_ONCE cpu cycles as the
return value is just used to check if huge_zero_page exists.

Link: https://lkml.kernel.org/r/20210318122722.13135-3-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin
Reviewed-by: Zi Yan
Reviewed-by: Peter Xu
Cc: Aneesh Kumar K.V
Cc: Matthew Wilcox
Cc: Michel Lespinasse
Cc: Ralph Campbell
Cc: Thomas Hellstrm (Intel)
Cc: Vlastimil Babka
Cc: Wei Yang
Cc: William Kucharski
Cc: Yang Shi
Cc: yuleixzhang
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Miaohe Lin
2021-05-06 02:27:21 +0800
71f9e58eb mm/huge_memory.c: rework the function vma_adjust_trans_huge() ... Browse Code »

Patch series "Some cleanups for huge_memory", v3.

This series contains cleanups to rework some function logics to make it
more readable, use helper function and so on. More details can be found
in the respective changelogs.

This patch (of 6):

The current implementation of vma_adjust_trans_huge() contains some
duplicated codes. Add helper function to get rid of these codes to make
it more succinct.

Link: https://lkml.kernel.org/r/20210318122722.13135-1-linmiaohe@huawei.com
Link: https://lkml.kernel.org/r/20210318122722.13135-2-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin
Reviewed-by: Peter Xu
Cc: Zi Yan
Cc: Matthew Wilcox
Cc: William Kucharski
Cc: Vlastimil Babka
Cc: Peter Xu
Cc: yuleixzhang
Cc: Michel Lespinasse
Cc: Aneesh Kumar K.V
Cc: Ralph Campbell
Cc: Thomas Hellstrm (Intel)
Cc: Yang Shi
Cc: Wei Yang
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Miaohe Lin
2021-05-06 02:27:21 +0800
8fd5eda4c mm/huge_memory.c: remove unnecessary local variable ret2 ... Browse Code »

There is no need to use a new local variable ret2 to get the return
value of handle_userfault(). Use ret directly to make code more
succinct.

Link: https://lkml.kernel.org/r/20210210072409.60587-1-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin
Reviewed-by: Andrew Morton
Cc: Mike Kravetz
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Miaohe Lin
2021-05-06 02:27:20 +0800

14 Mar, 2021

2 commits

be6c8982e mm/memcg: rename mem_cgroup_split_huge_fixup to split_page_memcg and add nr_pages argument ... Browse Code »

Rename mem_cgroup_split_huge_fixup to split_page_memcg and explicitly pass
in page number argument.

In this way, the interface name is more common and can be used by
potential users. In addition, the complete info(memcg and flag) of the
memcg needs to be set to the tail pages.

Link: https://lkml.kernel.org/r/20210304074053.65527-2-zhouguanghui1@huawei.com
Signed-off-by: Zhou Guanghui
Acked-by: Johannes Weiner
Reviewed-by: Zi Yan
Reviewed-by: Shakeel Butt
Acked-by: Michal Hocko
Cc: Hugh Dickins
Cc: "Kirill A. Shutemov"
Cc: Nicholas Piggin
Cc: Kefeng Wang
Cc: Hanjun Guo
Cc: Tianhong Ding
Cc: Weilong Chen
Cc: Rui Xiang
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Zhou Guanghui
2021-03-14 03:27:31 +0800
97a7e4733 mm: introduce page_needs_cow_for_dma() for deciding whether cow ... Browse Code »

We've got quite a few places (pte, pmd, pud) that explicitly checked
against whether we should break the cow right now during fork(). It's
easier to provide a helper, especially before we work the same thing on
hugetlbfs.

Since we'll reference is_cow_mapping() in mm.h, move it there too.
Actually it suites mm.h more since internal.h is mm/ only, but mm.h is
exported to the whole kernel. With that we should expect another patch to
use is_cow_mapping() whenever we can across the kernel since we do use it
quite a lot but it's always done with raw code against VM_* flags.

Link: https://lkml.kernel.org/r/20210217233547.93892-4-peterx@redhat.com
Signed-off-by: Peter Xu
Reviewed-by: Jason Gunthorpe
Cc: Alexey Dobriyan
Cc: Andrea Arcangeli
Cc: Christoph Hellwig
Cc: Daniel Vetter
Cc: David Airlie
Cc: David Gibson
Cc: Gal Pressman
Cc: Jan Kara
Cc: Jann Horn
Cc: Kirill Shutemov
Cc: Kirill Tkhai
Cc: Matthew Wilcox
Cc: Miaohe Lin
Cc: Mike Kravetz
Cc: Mike Rapoport
Cc: Roland Scheidegger
Cc: VMware Graphics
Cc: Wei Zhang
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Peter Xu
2021-03-14 03:27:30 +0800

27 Feb, 2021

1 commit

164cc4fef mm,thp,shmem: limit shmem THP alloc gfp_mask ... Browse Code »

Patch series "mm,thp,shm: limit shmem THP alloc gfp_mask", v6.

The allocation flags of anonymous transparent huge pages can be controlled
through the files in /sys/kernel/mm/transparent_hugepage/defrag, which can
help the system from getting bogged down in the page reclaim and
compaction code when many THPs are getting allocated simultaneously.

However, the gfp_mask for shmem THP allocations were not limited by those
configuration settings, and some workloads ended up with all CPUs stuck on
the LRU lock in the page reclaim code, trying to allocate dozens of THPs
simultaneously.

This patch applies the same configurated limitation of THPs to shmem
hugepage allocations, to prevent that from happening.

This way a THP defrag setting of "never" or "defer+madvise" will result in
quick allocation failures without direct reclaim when no 2MB free pages
are available.

With this patch applied, THP allocations for tmpfs will be a little more
aggressive than today for files mmapped with MADV_HUGEPAGE, and a little
less aggressive for files that are not mmapped or mapped without that
flag.

This patch (of 4):

The allocation flags of anonymous transparent huge pages can be controlled
through the files in /sys/kernel/mm/transparent_hugepage/defrag, which can
help the system from getting bogged down in the page reclaim and
compaction code when many THPs are getting allocated simultaneously.

However, the gfp_mask for shmem THP allocations were not limited by those
configuration settings, and some workloads ended up with all CPUs stuck on
the LRU lock in the page reclaim code, trying to allocate dozens of THPs
simultaneously.

This patch applies the same configurated limitation of THPs to shmem
hugepage allocations, to prevent that from happening.

Controlling the gfp_mask of THP allocations through the knobs in sysfs
allows users to determine the balance between how aggressively the system
tries to allocate THPs at fault time, and how much the application may end
up stalling attempting those allocations.

This way a THP defrag setting of "never" or "defer+madvise" will result in
quick allocation failures without direct reclaim when no 2MB free pages
are available.

With this patch applied, THP allocations for tmpfs will be a little more
aggressive than today for files mmapped with MADV_HUGEPAGE, and a little
less aggressive for files that are not mmapped or mapped without that
flag.

Link: https://lkml.kernel.org/r/20201124194925.623931-1-riel@surriel.com
Link: https://lkml.kernel.org/r/20201124194925.623931-2-riel@surriel.com
Signed-off-by: Rik van Riel
Acked-by: Michal Hocko
Acked-by: Vlastimil Babka
Cc: Xu Yu
Cc: Mel Gorman
Cc: Andrea Arcangeli
Cc: Matthew Wilcox (Oracle)
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Rik van Riel
2021-02-27 01:40:59 +0800

25 Feb, 2021

3 commits

bae849538 mm/pmem: avoid inserting hugepage PTE entry with fsdax if hugepage support is disabled ... Browse Code »

Differentiate between hardware not supporting hugepages and user disabling
THP via 'echo never > /sys/kernel/mm/transparent_hugepage/enabled'

For the devdax namespace, the kernel handles the above via the
supported_alignment attribute and failing to initialize the namespace if
the namespace align value is not supported on the platform.

For the fsdax namespace, the kernel will continue to initialize the
namespace. This can result in the kernel creating a huge pte entry even
though the hardware don't support the same.

We do want hugepage support with pmem even if the end-user disabled THP
via sysfs file (/sys/kernel/mm/transparent_hugepage/enabled). Hence
differentiate between hardware/firmware lacking support vs user-controlled
disable of THP and prevent a huge fault if the hardware lacks hugepage
support.

Link: https://lkml.kernel.org/r/20210205023956.417587-1-aneesh.kumar@linux.ibm.com
Signed-off-by: Aneesh Kumar K.V
Reviewed-by: Dan Williams
Cc: "Kirill A . Shutemov"
Cc: Jan Kara
Cc: David Hildenbrand
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Aneesh Kumar K.V
2021-02-25 05:38:32 +0800
2efeb8da9 mm/huge_memory.c: remove unused return value of set_huge_zero_page() ... Browse Code »

The return value of set_huge_zero_page() is always ignored. So we should
drop such return value.

Link: https://lkml.kernel.org/r/20210203084816.46307-1-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin
Cc: Mike Kravetz
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Miaohe Lin
2021-02-25 05:38:32 +0800
fca40573e mm/huge_memory.c: update tlb entry if pmd is changed ... Browse Code »

When set_pmd_at is called in function do_huge_pmd_anonymous_page, new tlb
entry can be added by software on MIPS platform.

Here add update_mmu_cache_pmd when pmd entry is set, and
update_mmu_cache_pmd is defined as empty excepts arc/mips platform. This
patch has no negative effect on other platforms except arc/mips system.

Link: http://lkml.kernel.org/r/1592990792-1923-2-git-send-email-maobibo@loongson.cn
Signed-off-by: Bibo Mao
Cc: Anshuman Khandual
Cc: Daniel Silsby
Cc: "Kirill A. Shutemov"
Cc: Mike Kravetz
Cc: Mike Rapoport
Cc: Paul Burton
Cc: Ralf Baechle
Cc: Thomas Bogendoerfer
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Bibo Mao
2021-02-25 05:38:32 +0800