Eric Lee / smarc-fsl-linux-kernel

16 May, 2022

1 commit

72dd04872 mm: shmem: fix missing cache flush in shmem_mfill_atomic_pte() ... Browse Code »

commit 19b482c29b6f3805f1d8e93015847b89e2f7f3b1 upstream.

userfaultfd calls shmem_mfill_atomic_pte() which does not do any cache
flushing for the target page. Then the target page will be mapped to
the user space with a different address (user address), which might have
an alias issue with the kernel address used to copy the data from the
user to. Insert flush_dcache_page() in non-zero-page case. And replace
clear_highpage() with clear_user_highpage() which already considers the
cache maintenance.

Link: https://lkml.kernel.org/r/20220210123058.79206-6-songmuchun@bytedance.com
Fixes: 8d1039634206 ("userfaultfd: shmem: add shmem_mfill_zeropage_pte for userfaultfd support")
Fixes: 4c27fe4c4c84 ("userfaultfd: shmem: add shmem_mcopy_atomic_pte for userfaultfd support")
Signed-off-by: Muchun Song
Reviewed-by: Mike Kravetz
Cc: Axel Rasmussen
Cc: David Rientjes
Cc: Fam Zheng
Cc: Kirill A. Shutemov
Cc: Lars Persson
Cc: Peter Xu
Cc: Xiongchun Duan
Cc: Zi Yan
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Muchun Song
2022-05-16 02:18:53 +0800

27 Jan, 2022

1 commit

19f76690f shmem: fix a race between shmem_unused_huge_shrink and shmem_evict_inode ... Browse Code »

commit 62c9827cbb996c2c04f615ecd783ce28bcea894b upstream.

Fix a data race in commit 779750d20b93 ("shmem: split huge pages beyond
i_size under memory pressure").

Here are call traces causing race:

Call Trace 1:
shmem_unused_huge_shrink+0x3ae/0x410
? __list_lru_walk_one.isra.5+0x33/0x160
super_cache_scan+0x17c/0x190
shrink_slab.part.55+0x1ef/0x3f0
shrink_node+0x10e/0x330
kswapd+0x380/0x740
kthread+0xfc/0x130
? mem_cgroup_shrink_node+0x170/0x170
? kthread_create_on_node+0x70/0x70
ret_from_fork+0x1f/0x30

Call Trace 2:
shmem_evict_inode+0xd8/0x190
evict+0xbe/0x1c0
do_unlinkat+0x137/0x330
do_syscall_64+0x76/0x120
entry_SYSCALL_64_after_hwframe+0x3d/0xa2

A simple explanation:

Image there are 3 items in the local list (@list). In the first
traversal, A is not deleted from @list.

1) A->B->C
^
|
pos (leave)

In the second traversal, B is deleted from @list. Concurrently, A is
deleted from @list through shmem_evict_inode() since last reference
counter of inode is dropped by other thread. Then the @list is corrupted.

2) A->B->C
^ ^
| |
evict pos (drop)

We should make sure the inode is either on the global list or deleted from
any local list before iput().

Fixed by moving inodes back to global list before we put them.

[akpm@linux-foundation.org: coding style fixes]

Link: https://lkml.kernel.org/r/20211125064502.99983-1-ligang.bdlg@bytedance.com
Fixes: 779750d20b93 ("shmem: split huge pages beyond i_size under memory pressure")
Signed-off-by: Gang Li
Reviewed-by: Muchun Song
Acked-by: Kirill A. Shutemov
Cc: Hugh Dickins
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Gang Li
2022-01-27 18:03:01 +0800

25 Sep, 2021

1 commit

de6ee6596 mm/shmem.c: fix judgment error in shmem_is_huge() ... Browse Code »

In the case of SHMEM_HUGE_WITHIN_SIZE, the page index is not rounded up
correctly. When the page index points to the first page in a huge page,
round_up() cannot bring it to the end of the huge page, but to the end
of the previous one.

An example:

HPAGE_PMD_NR on my machine is 512(2 MB huge page size). After
allcoating a 3000 KB buffer, I access it at location 2050 KB. In
shmem_is_huge(), the corresponding index happens to be 512. After
rounded up by HPAGE_PMD_NR, it will still be 512 which is smaller than
i_size, and shmem_is_huge() will return true. As a result, my buffer
takes an additional huge page, and that shouldn't happen when
shmem_enabled is set to within_size.

Link: https://lkml.kernel.org/r/20210909032007.18353-1-liuyuntao10@huawei.com
Fixes: f3f0e1d2150b2b ("khugepaged: add support of collapse for tmpfs/shmem pages")
Signed-off-by: Liu Yuntao
Acked-by: Kirill A. Shutemov
Acked-by: Hugh Dickins
Cc: wuxu.wu
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Liu Yuntao
2021-09-25 07:13:34 +0800

04 Sep, 2021

15 commits

14726903c Merge branch 'akpm' (patches from Andrew) ... Browse Code »

Merge misc updates from Andrew Morton:
"173 patches.

Subsystems affected by this series: ia64, ocfs2, block, and mm (debug,
pagecache, gup, swap, shmem, memcg, selftests, pagemap, mremap,
bootmem, sparsemem, vmalloc, kasan, pagealloc, memory-failure,
hugetlb, userfaultfd, vmscan, compaction, mempolicy, memblock,
oom-kill, migration, ksm, percpu, vmstat, and madvise)"

* emailed patches from Andrew Morton : (173 commits)
mm/madvise: add MADV_WILLNEED to process_madvise()
mm/vmstat: remove unneeded return value
mm/vmstat: simplify the array size calculation
mm/vmstat: correct some wrong comments
mm/percpu,c: remove obsolete comments of pcpu_chunk_populated()
selftests: vm: add COW time test for KSM pages
selftests: vm: add KSM merging time test
mm: KSM: fix data type
selftests: vm: add KSM merging across nodes test
selftests: vm: add KSM zero page merging test
selftests: vm: add KSM unmerge test
selftests: vm: add KSM merge test
mm/migrate: correct kernel-doc notation
mm: wire up syscall process_mrelease
mm: introduce process_mrelease system call
memblock: make memblock_find_in_range method private
mm/mempolicy.c: use in_task() in mempolicy_slab_node()
mm/mempolicy: unify the create() func for bind/interleave/prefer-many policies
mm/mempolicy: advertise new MPOL_PREFERRED_MANY
mm/hugetlb: add support for mempolicy MPOL_PREFERRED_MANY
...

Linus Torvalds
2021-09-04 01:08:28 +0800
1e6decf30 shmem: shmem_writepage() split unlikely i915 THP ... Browse Code »

drivers/gpu/drm/i915/gem/i915_gem_shmem.c contains a shmem_writeback()
which calls shmem_writepage() from a shrinker: that usually works well
enough; but if /sys/kernel/mm/transparent_hugepage/shmem_enabled has been
set to "always" (intended to be usable) or "force" (forces huge everywhere
for easy testing), shmem_writepage() is surprised to be called with a huge
page, and crashes on the VM_BUG_ON_PAGE(PageCompound) (I did not find out
where the crash happens when CONFIG_DEBUG_VM is off).

LRU page reclaim always splits the shmem huge page first: I'd prefer not
to demand that of i915, so check and split compound in shmem_writepage().

Patch history: when first sent last year
http://lkml.kernel.org/r/alpine.LSU.2.11.2008301401390.5954@eggly.anvils
https://lore.kernel.org/linux-mm/20200919042009.bomzxmrg7%25akpm@linux-foundation.org/
Matthew Wilcox noticed that tail pages were wrongly left clean. This
version brackets the split with Set and Clear PageDirty as he suggested:
which works very well, even if it falls short of our aspirations. And
recently I realized that the crash is not limited to the testing option
"force", but affects "always" too: which is more important to fix.

Link: https://lkml.kernel.org/r/bac6158c-8b3d-4dca-cffc-4982f58d9794@google.com
Fixes: 2d6692e642e7 ("drm/i915: Start writeback from the shrinker")
Signed-off-by: Hugh Dickins
Reviewed-by: Shakeel Butt
Acked-by: Yang Shi
Cc: "Kirill A. Shutemov"
Cc: Matthew Wilcox
Cc: Miaohe Lin
Cc: Michal Hocko
Cc: Mike Kravetz
Cc: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2021-09-04 00:58:12 +0800
a7fddc362 huge tmpfs: decide stat.st_blksize by shmem_is_huge() ... Browse Code »

4.18 commit 89fdcd262fd4 ("mm: shmem: make stat.st_blksize return huge
page size if THP is on") added is_huge_enabled() to decide st_blksize: if
hugeness is to be defined per file, that will need to be replaced by
shmem_is_huge().

This does give a different answer (No) for small files on a
"huge=within_size" mount: but that can be considered a minor bugfix. And
a different answer (No) for default files on a "huge=advise" mount: I'm
reluctant to complicate it, just to reproduce the same debatable answer as
before.

Link: https://lkml.kernel.org/r/af7fb3f9-4415-9e8e-fdac-b1a5253ad21@google.com
Signed-off-by: Hugh Dickins
Reviewed-by: Yang Shi
Cc: "Kirill A. Shutemov"
Cc: Matthew Wilcox
Cc: Miaohe Lin
Cc: Michal Hocko
Cc: Mike Kravetz
Cc: Rik van Riel
Cc: Shakeel Butt
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2021-09-04 00:58:12 +0800
5e6e5a12a huge tmpfs: shmem_is_huge(vma, inode, index) ... Browse Code »

Extend shmem_huge_enabled(vma) to shmem_is_huge(vma, inode, index), so
that a consistent set of checks can be applied, even when the inode is
accessed through read/write syscalls (with NULL vma) instead of mmaps (the
index argument is seldom of interest, but required by mount option
"huge=within_size"). Clean up and rearrange the checks a little.

This then replaces the checks which shmem_fault() and shmem_getpage_gfp()
were making, and eliminates the SGP_HUGE and SGP_NOHUGE modes.

Replace a couple of 0s by explicit SHMEM_HUGE_NEVERs; and replace the
obscure !shmem_mapping() symlink check by explicit S_ISLNK() - nothing
else needs that symlink check, so leave it there in shmem_getpage_gfp().

Link: https://lkml.kernel.org/r/23a77889-2ddc-b030-75cd-44ca27fd4d1@google.com
Signed-off-by: Hugh Dickins
Reviewed-by: Yang Shi
Cc: "Kirill A. Shutemov"
Cc: Matthew Wilcox
Cc: Miaohe Lin
Cc: Michal Hocko
Cc: Mike Kravetz
Cc: Rik van Riel
Cc: Shakeel Butt
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2021-09-04 00:58:12 +0800
acdd9f8e0 huge tmpfs: SGP_NOALLOC to stop collapse_file() on race ... Browse Code »

khugepaged's collapse_file() currently uses SGP_NOHUGE to tell
shmem_getpage() not to try allocating a huge page, in the very unlikely
event that a racing hole-punch removes the swapped or fallocated page as
soon as i_pages lock is dropped.

We want to consolidate shmem's huge decisions, removing SGP_HUGE and
SGP_NOHUGE; but cannot quite persuade ourselves that it's okay to regress
the protection in this case - Yang Shi points out that the huge page would
remain indefinitely, charged to root instead of the intended memcg.

collapse_file() should not even allocate a small page in this case: why
proceed if someone is punching a hole? SGP_READ is almost the right flag
here, except that it optimizes away from a fallocated page, with NULL to
tell caller to fill with zeroes (like a hole); whereas collapse_file()'s
sequence relies on using a cache page. Add SGP_NOALLOC just for this.

There are too many consecutive "if (page"s there in shmem_getpage_gfp():
group it better; and fix the outdated "bring it back from swap" comment.

Link: https://lkml.kernel.org/r/1355343b-acf-4653-ef79-6aee40214ac5@google.com
Signed-off-by: Hugh Dickins
Reviewed-by: Yang Shi
Cc: "Kirill A. Shutemov"
Cc: Matthew Wilcox
Cc: Miaohe Lin
Cc: Michal Hocko
Cc: Mike Kravetz
Cc: Rik van Riel
Cc: Shakeel Butt
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2021-09-04 00:58:12 +0800
c852023e6 huge tmpfs: move shmem_huge_enabled() upwards ... Browse Code »

shmem_huge_enabled() is about to be enhanced into shmem_is_huge(), so that
it can be used more widely throughout: before making functional changes,
shift it to its final position (to avoid forward declaration).

Link: https://lkml.kernel.org/r/16fec7b7-5c84-415a-8586-69d8bf6a6685@google.com
Signed-off-by: Hugh Dickins
Reviewed-by: Yang Shi
Cc: "Kirill A. Shutemov"
Cc: Matthew Wilcox
Cc: Miaohe Lin
Cc: Michal Hocko
Cc: Mike Kravetz
Cc: Rik van Riel
Cc: Shakeel Butt
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2021-09-04 00:58:11 +0800
b9e2faaf6 huge tmpfs: revert shmem's use of transhuge_vma_enabled() ... Browse Code »

5.14 commit e6be37b2e7bd ("mm/huge_memory.c: add missing read-only THP
checking in transparent_hugepage_enabled()") added transhuge_vma_enabled()
as a wrapper for two very different checks (one check is whether the app
has marked its address range not to use THPs, the other check is whether
the app is running in a hierarchy that has been marked never to use THPs).
shmem_huge_enabled() prefers to show those two checks explicitly, as
before.

Link: https://lkml.kernel.org/r/45e5338-18d-c6f9-c17e-34f510bc1728@google.com
Signed-off-by: Hugh Dickins
Reviewed-by: Yang Shi
Cc: "Kirill A. Shutemov"
Cc: Matthew Wilcox
Cc: Miaohe Lin
Cc: Michal Hocko
Cc: Mike Kravetz
Cc: Rik van Riel
Cc: Shakeel Butt
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2021-09-04 00:58:11 +0800
2b5bbcb1c huge tmpfs: remove shrinklist addition from shmem_setattr() ... Browse Code »

There's a block of code in shmem_setattr() to add the inode to
shmem_unused_huge_shrink()'s shrinklist when lowering i_size: it dates
from before 5.7 changed truncation to do split_huge_page() for itself, and
should have been removed at that time.

I am over-stating that: split_huge_page() can fail (notably if there's an
extra reference to the page at that time), so there might be value in
retrying. But there were already retries as truncation worked through the
tails, and this addition risks repeating unsuccessful retries
indefinitely: I'd rather remove it now, and work on reducing the chance of
split_huge_page() failures separately, if we need to.

Link: https://lkml.kernel.org/r/b73b3492-8822-18f9-83e2-938528cdde94@google.com
Fixes: 71725ed10c40 ("mm: huge tmpfs: try to split_huge_page() when punching hole")
Signed-off-by: Hugh Dickins
Reviewed-by: Yang Shi
Cc: "Kirill A. Shutemov"
Cc: Matthew Wilcox
Cc: Miaohe Lin
Cc: Michal Hocko
Cc: Mike Kravetz
Cc: Rik van Riel
Cc: Shakeel Butt
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2021-09-04 00:58:11 +0800
d144bf620 huge tmpfs: fix split_huge_page() after FALLOC_FL_KEEP_SIZE ... Browse Code »

A successful shmem_fallocate() guarantees that the extent has been
reserved, even beyond i_size when the FALLOC_FL_KEEP_SIZE flag was used.
But that guarantee is broken by shmem_unused_huge_shrink()'s attempts to
split huge pages and free their excess beyond i_size; and by other uses of
split_huge_page() near i_size.

It's sad to add a shmem inode field just for this, but I did not find a
better way to keep the guarantee. A flag to say KEEP_SIZE has been used
would be cheaper, but I'm averse to unclearable flags. The fallocend
field is not perfect either (many disjoint ranges might be fallocated),
but good enough; and gains another use later on.

Link: https://lkml.kernel.org/r/ca9a146-3a59-6cd3-7f28-e9a044bb1052@google.com
Fixes: 779750d20b93 ("shmem: split huge pages beyond i_size under memory pressure")
Signed-off-by: Hugh Dickins
Reviewed-by: Yang Shi
Cc: "Kirill A. Shutemov"
Cc: Matthew Wilcox
Cc: Miaohe Lin
Cc: Michal Hocko
Cc: Mike Kravetz
Cc: Rik van Riel
Cc: Shakeel Butt
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2021-09-04 00:58:11 +0800
050dcb5c8 huge tmpfs: fix fallocate(vanilla) advance over huge pages ... Browse Code »

Patch series "huge tmpfs: shmem_is_huge() fixes and cleanups".

A series of huge tmpfs fixes and cleanups.

This patch (of 9):

shmem_fallocate() goes to a lot of trouble to leave its newly allocated
pages !Uptodate, partly to identify and undo them on failure, partly to
leave the overhead of clearing them until later. But the huge page case
did not skip to the end of the extent, walked through the tail pages one
by one, and appeared to work just fine: but in doing so, cleared and
Uptodated the huge page, so there was no way to undo it on failure.

And by setting Uptodate too soon, it messed up both its nr_falloced and
nr_unswapped counts, so that the intended "time to give up" heuristic did
not work at all.

Now advance immediately to the end of the huge extent, with a comment on
why this is more than just an optimization. But although this speeds up
huge tmpfs fallocation, it does leave the clearing until first use, and
some users may have come to appreciate slow fallocate but fast first use:
if they complain, then we can consider adding a pass to clear at the end.

Link: https://lkml.kernel.org/r/da632211-8e3e-6b1-aee-ab24734429a0@google.com
Link: https://lkml.kernel.org/r/16201bd2-70e-37e2-e89b-5f929430da@google.com
Fixes: 800d8c63b2e9 ("shmem: add huge pages support")
Signed-off-by: Hugh Dickins
Reviewed-by: Yang Shi
Cc: Shakeel Butt
Cc: "Kirill A. Shutemov"
Cc: Miaohe Lin
Cc: Mike Kravetz
Cc: Michal Hocko
Cc: Rik van Riel
Cc: Matthew Wilcox
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2021-09-04 00:58:11 +0800
86a2f3f2d shmem: include header file to declare swap_info ... Browse Code »

It's bad to extern swap_info[] in .c. Include corresponding header file
instead.

Link: https://lkml.kernel.org/r/20210812120350.49801-5-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Miaohe Lin
2021-09-04 00:58:11 +0800
cdd89d4cb shmem: remove unneeded function forward declaration ... Browse Code »

The forward declaration for shmem_should_replace_page() and
shmem_replace_page() is unnecessary. Remove them.

Link: https://lkml.kernel.org/r/20210812120350.49801-4-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Miaohe Lin
2021-09-04 00:58:11 +0800
b6378fc8b shmem: remove unneeded header file ... Browse Code »

mfill_atomic_install_pte() is introduced to install pte and update mmu
cache since commit bf6ebd97aba0 ("userfaultfd/shmem: modify
shmem_mfill_atomic_pte to use install_pte()"). So we should remove
tlbflush.h as update_mmu_cache() is not called here now.

Link: https://lkml.kernel.org/r/20210812120350.49801-3-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Miaohe Lin
2021-09-04 00:58:11 +0800
f2b346e45 shmem: remove unneeded variable ret ... Browse Code »

Patch series "Cleanups for shmem".

This series contains cleanups to remove unneeded variable, header file,
function forward declaration and so on. More details can be found in the
respective changelogs.

This patch (of 4):

The local variable ret is always equal to -ENOMEM and never touched. So
remove it and return -ENOMEM directly to simplify the code.

Link: https://lkml.kernel.org/r/20210812120350.49801-1-linmiaohe@huawei.com
Link: https://lkml.kernel.org/r/20210812120350.49801-2-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Miaohe Lin
2021-09-04 00:58:11 +0800
bf11b9a8e shmem: use raw_spinlock_t for ->stat_lock ... Browse Code »

Each CPU has SHMEM_INO_BATCH inodes available in `->ino_batch' which is
per-CPU. Access here is serialized by disabling preemption. If the pool
is empty, it gets reloaded from `->next_ino'. Access here is serialized
by ->stat_lock which is a spinlock_t and can not be acquired with disabled
preemption.

One way around it would make per-CPU ino_batch struct containing the inode
number a local_lock_t.

Another solution is to promote ->stat_lock to a raw_spinlock_t. The
critical sections are short. The mpol_put() must be moved outside of the
critical section to avoid invoking the destructor with disabled
preemption.

Link: https://lkml.kernel.org/r/20210806142916.jdwkb5bx62q5fwfo@linutronix.de
Signed-off-by: Sebastian Andrzej Siewior
Acked-by: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Sebastian Andrzej Siewior
2021-09-04 00:58:11 +0800

31 Aug, 2021

1 commit

aa99f3c2b Merge tag 'hole_punch_for_v5.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs ... Browse Code »

Pull fs hole punching vs cache filling race fixes from Jan Kara:
"Fix races leading to possible data corruption or stale data exposure
in multiple filesystems when hole punching races with operations such
as readahead.

This is the series I was sending for the last merge window but with
your objection fixed - now filemap_fault() has been modified to take
invalidate_lock only when we need to create new page in the page cache
and / or bring it uptodate"

* tag 'hole_punch_for_v5.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
filesystems/locking: fix Malformed table warning
cifs: Fix race between hole punch and page fault
ceph: Fix race between hole punch and page fault
fuse: Convert to using invalidate_lock
f2fs: Convert to using invalidate_lock
zonefs: Convert to using invalidate_lock
xfs: Convert double locking of MMAPLOCK to use VFS helpers
xfs: Convert to use invalidate_lock
xfs: Refactor xfs_isilocked()
ext2: Convert to using invalidate_lock
ext4: Convert to use mapping->invalidate_lock
mm: Add functions to lock invalidate_lock for two mappings
mm: Protect operations adding pages to page cache with invalidate_lock
documentation: Sync file_operations members with reality
mm: Fix comments mentioning i_mutex

Linus Torvalds
2021-08-31 01:24:50 +0800

21 Aug, 2021

1 commit

b1e1ef345 Revert "mm/shmem: fix shmem_swapin() race with swapoff" ... Browse Code »

Due to the change about how block layer detects congestion the
justification of commit 8fd2e0b505d1 ("mm: swap: check if swap backing
device is congested or not") doesn't stand anymore, so the commit could
be just reverted in order to solve the race reported by commit
2efa33fc7f6e ("mm/shmem: fix shmem_swapin() race with swapoff"), so the
fix commit could be just reverted as well.

And that fix is also kind of buggy as discussed by [1] and [2].

[1] https://lore.kernel.org/linux-mm/24187e5e-069-9f3f-cefe-39ac70783753@google.com/
[2] https://lore.kernel.org/linux-mm/e82380b9-3ad4-4a52-be50-6d45c7f2b5da@google.com/

Link: https://lkml.kernel.org/r/20210810202936.2672-2-shy828301@gmail.com
Signed-off-by: Yang Shi
Suggested-by: Hugh Dickins
Acked-by: Hugh Dickins
Cc: "Huang, Ying"
Cc: Miaohe Lin
Cc: David Hildenbrand
Cc: Johannes Weiner
Cc: Joonsoo Kim
Cc: Matthew Wilcox (Oracle)
Cc: Michal Hocko
Cc: Minchan Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Yang Shi
2021-08-21 02:31:41 +0800

13 Jul, 2021

1 commit

9608703e4 mm: Fix comments mentioning i_mutex ... Browse Code »

inode->i_mutex has been replaced with inode->i_rwsem long ago. Fix
comments still mentioning i_mutex.

Reviewed-by: Christoph Hellwig
Reviewed-by: Darrick J. Wong
Acked-by: Hugh Dickins
Signed-off-by: Jan Kara

Jan Kara
2021-07-13 00:31:16 +0800

03 Jul, 2021

1 commit

71bd93410 Merge branch 'akpm' (patches from Andrew) ... Browse Code »

Merge more updates from Andrew Morton:
"190 patches.

Subsystems affected by this patch series: mm (hugetlb, userfaultfd,
vmscan, kconfig, proc, z3fold, zbud, ras, mempolicy, memblock,
migration, thp, nommu, kconfig, madvise, memory-hotplug, zswap,
zsmalloc, zram, cleanups, kfence, and hmm), procfs, sysctl, misc,
core-kernel, lib, lz4, checkpatch, init, kprobes, nilfs2, hfs,
signals, exec, kcov, selftests, compress/decompress, and ipc"

* emailed patches from Andrew Morton : (190 commits)
ipc/util.c: use binary search for max_idx
ipc/sem.c: use READ_ONCE()/WRITE_ONCE() for use_global_lock
ipc: use kmalloc for msg_queue and shmid_kernel
ipc sem: use kvmalloc for sem_undo allocation
lib/decompressors: remove set but not used variabled 'level'
selftests/vm/pkeys: exercise x86 XSAVE init state
selftests/vm/pkeys: refill shadow register after implicit kernel write
selftests/vm/pkeys: handle negative sys_pkey_alloc() return code
selftests/vm/pkeys: fix alloc_random_pkey() to make it really, really random
kcov: add __no_sanitize_coverage to fix noinstr for all architectures
exec: remove checks in __register_bimfmt()
x86: signal: don't do sas_ss_reset() until we are certain that sigframe won't be abandoned
hfsplus: report create_date to kstat.btime
hfsplus: remove unnecessary oom message
nilfs2: remove redundant continue statement in a while-loop
kprobes: remove duplicated strong free_insn_page in x86 and s390
init: print out unknown kernel parameters
checkpatch: do not complain about positive return values starting with EPOLL
checkpatch: improve the indented label test
checkpatch: scripts/spdxcheck.py now requires python3
...

Linus Torvalds
2021-07-03 03:08:10 +0800

01 Jul, 2021

4 commits

7d64ae3ab userfaultfd/shmem: modify shmem_mfill_atomic_pte to use install_pte() ... Browse Code »

In a previous commit, we added the mfill_atomic_install_pte() helper.
This helper does the job of setting up PTEs for an existing page, to map
it into a given VMA. It deals with both the anon and shmem cases, as well
as the shared and private cases.

In other words, shmem_mfill_atomic_pte() duplicates a case it already
handles. So, expose it, and let shmem_mfill_atomic_pte() use it directly,
to reduce code duplication.

This requires that we refactor shmem_mfill_atomic_pte() a bit:

Instead of doing accounting (shmem_recalc_inode() et al) part-way through
the PTE setup, do it afterward. This frees up mfill_atomic_install_pte()
from having to care about this accounting, and means we don't need to e.g.
shmem_uncharge() in the error path.

A side effect is this switches shmem_mfill_atomic_pte() to use
lru_cache_add_inactive_or_unevictable() instead of just lru_cache_add().
This wrapper does some extra accounting in an exceptional case, if
appropriate, so it's actually the more correct thing to use.

Link: https://lkml.kernel.org/r/20210503180737.2487560-7-axelrasmussen@google.com
Signed-off-by: Axel Rasmussen
Reviewed-by: Peter Xu
Acked-by: Hugh Dickins
Cc: Alexander Viro
Cc: Andrea Arcangeli
Cc: Brian Geffon
Cc: "Dr . David Alan Gilbert"
Cc: Jerome Glisse
Cc: Joe Perches
Cc: Kirill A. Shutemov
Cc: Lokesh Gidra
Cc: Mike Kravetz
Cc: Mike Rapoport
Cc: Mina Almasry
Cc: Oliver Upton
Cc: Shaohua Li
Cc: Shuah Khan
Cc: Stephen Rothwell
Cc: Wang Qing
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Axel Rasmussen
2021-07-01 11:47:27 +0800
c949b097e userfaultfd/shmem: support minor fault registration for shmem ... Browse Code »

This patch allows shmem-backed VMAs to be registered for minor faults.
Minor faults are appropriately relayed to userspace in the fault path, for
VMAs with the relevant flag.

This commit doesn't hook up the UFFDIO_CONTINUE ioctl for shmem-backed
minor faults, though, so userspace doesn't yet have a way to resolve such
faults.

Because of this, we also don't yet advertise this as a supported feature.
That will be done in a separate commit when the feature is fully
implemented.

Link: https://lkml.kernel.org/r/20210503180737.2487560-4-axelrasmussen@google.com
Signed-off-by: Axel Rasmussen
Acked-by: Peter Xu
Acked-by: Hugh Dickins
Cc: Alexander Viro
Cc: Andrea Arcangeli
Cc: Brian Geffon
Cc: "Dr . David Alan Gilbert"
Cc: Jerome Glisse
Cc: Joe Perches
Cc: Kirill A. Shutemov
Cc: Lokesh Gidra
Cc: Mike Kravetz
Cc: Mike Rapoport
Cc: Mina Almasry
Cc: Oliver Upton
Cc: Shaohua Li
Cc: Shuah Khan
Cc: Stephen Rothwell
Cc: Wang Qing
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Axel Rasmussen
2021-07-01 11:47:27 +0800
3460f6e5c userfaultfd/shmem: combine shmem_{mcopy_atomic,mfill_zeropage}_pte ... Browse Code »

Patch series "userfaultfd: add minor fault handling for shmem", v6.

Overview
========

See the series which added minor faults for hugetlbfs [3] for a detailed
overview of minor fault handling in general. This series adds the same
support for shmem-backed areas.

This series is structured as follows:

- Commits 1 and 2 are cleanups.
- Commits 3 and 4 implement the new feature (minor fault handling for shmem).
- Commit 5 advertises that the feature is now available since at this point it's
fully implemented.
- Commit 6 is a final cleanup, modifying an existing code path to re-use a new
helper we've introduced.
- Commits 7, 8, 9, 10 update the userfaultfd selftest to exercise the feature.

Use Case
========

In some cases it is useful to have VM memory backed by tmpfs instead of
hugetlbfs. So, this feature will be used to support the same VM live
migration use case described in my original series.

Additionally, Android folks (Lokesh Gidra ) hope
to optimize the Android Runtime garbage collector using this feature:

"The plan is to use userfaultfd for concurrently compacting the heap.
With this feature, the heap can be shared-mapped at another location where
the GC-thread(s) could continue the compaction operation without the need
to invoke userfault ioctl(UFFDIO_COPY) each time. OTOH, if and when Java
threads get faults on the heap, UFFDIO_CONTINUE can be used to resume
execution. Furthermore, this feature enables updating references in the
'non-moving' portion of the heap efficiently. Without this feature,
uneccessary page copying (ioctl(UFFDIO_COPY)) would be required."

[1] https://lore.kernel.org/patchwork/cover/1388144/
[2] https://lore.kernel.org/patchwork/patch/1408161/
[3] https://lore.kernel.org/linux-fsdevel/20210301222728.176417-1-axelrasmussen@google.com/T/#t

This patch (of 9):

Previously, we did a dance where we had one calling path in userfaultfd.c
(mfill_atomic_pte), but then we split it into two in shmem_fs.h
(shmem_{mcopy_atomic,mfill_zeropage}_pte), and then rejoined into a single
shared function in shmem.c (shmem_mfill_atomic_pte).

This is all a bit overly complex. Just call the single combined shmem
function directly, allowing us to clean up various branches, boilerplate,
etc.

While we're touching this function, two other small cleanup changes:
- offset is equivalent to pgoff, so we can get rid of offset entirely.
- Split two VM_BUG_ON cases into two statements. This means the line
number reported when the BUG is hit specifies exactly which condition
was true.

Link: https://lkml.kernel.org/r/20210503180737.2487560-1-axelrasmussen@google.com
Link: https://lkml.kernel.org/r/20210503180737.2487560-3-axelrasmussen@google.com
Signed-off-by: Axel Rasmussen
Reviewed-by: Peter Xu
Acked-by: Hugh Dickins
Cc: Alexander Viro
Cc: Andrea Arcangeli
Cc: Brian Geffon
Cc: "Dr . David Alan Gilbert"
Cc: Jerome Glisse
Cc: Joe Perches
Cc: Kirill A. Shutemov
Cc: Lokesh Gidra
Cc: Mike Kravetz
Cc: Mike Rapoport
Cc: Mina Almasry
Cc: Oliver Upton
Cc: Shaohua Li
Cc: Shuah Khan
Cc: Stephen Rothwell
Cc: Wang Qing
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Axel Rasmussen
2021-07-01 11:47:27 +0800
e6be37b2e mm/huge_memory.c: add missing read-only THP checking in transparent_hugepage_enabled() ... Browse Code »

Since commit 99cb0dbd47a1 ("mm,thp: add read-only THP support for
(non-shmem) FS"), read-only THP file mapping is supported. But it forgot
to add checking for it in transparent_hugepage_enabled(). To fix it, we
add checking for read-only THP file mapping and also introduce helper
transhuge_vma_enabled() to check whether thp is enabled for specified vma
to reduce duplicated code. We rename transparent_hugepage_enabled to
transparent_hugepage_active to make the code easier to follow as suggested
by David Hildenbrand.

[linmiaohe@huawei.com: define transhuge_vma_enabled next to transhuge_vma_suitable]
Link: https://lkml.kernel.org/r/20210514093007.4117906-1-linmiaohe@huawei.com

Link: https://lkml.kernel.org/r/20210511134857.1581273-4-linmiaohe@huawei.com
Fixes: 99cb0dbd47a1 ("mm,thp: add read-only THP support for (non-shmem) FS")
Signed-off-by: Miaohe Lin
Reviewed-by: Yang Shi
Cc: Alexey Dobriyan
Cc: "Aneesh Kumar K . V"
Cc: Anshuman Khandual
Cc: David Hildenbrand
Cc: Hugh Dickins
Cc: Johannes Weiner
Cc: Kirill A. Shutemov
Cc: Matthew Wilcox
Cc: Minchan Kim
Cc: Ralph Campbell
Cc: Rik van Riel
Cc: Song Liu
Cc: William Kucharski
Cc: Zi Yan
Cc: Mike Kravetz
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Miaohe Lin
2021-07-01 11:47:26 +0800

30 Jun, 2021

3 commits

65090f30a Merge branch 'akpm' (patches from Andrew) ... Browse Code »

Merge misc updates from Andrew Morton:
"191 patches.

Subsystems affected by this patch series: kthread, ia64, scripts,
ntfs, squashfs, ocfs2, kernel/watchdog, and mm (gup, pagealloc, slab,
slub, kmemleak, dax, debug, pagecache, gup, swap, memcg, pagemap,
mprotect, bootmem, dma, tracing, vmalloc, kasan, initialization,
pagealloc, and memory-failure)"

* emailed patches from Andrew Morton : (191 commits)
mm,hwpoison: make get_hwpoison_page() call get_any_page()
mm,hwpoison: send SIGBUS with error virutal address
mm/page_alloc: split pcp->high across all online CPUs for cpuless nodes
mm/page_alloc: allow high-order pages to be stored on the per-cpu lists
mm: replace CONFIG_FLAT_NODE_MEM_MAP with CONFIG_FLATMEM
mm: replace CONFIG_NEED_MULTIPLE_NODES with CONFIG_NUMA
docs: remove description of DISCONTIGMEM
arch, mm: remove stale mentions of DISCONIGMEM
mm: remove CONFIG_DISCONTIGMEM
m68k: remove support for DISCONTIGMEM
arc: remove support for DISCONTIGMEM
arc: update comment about HIGHMEM implementation
alpha: remove DISCONTIGMEM and NUMA
mm/page_alloc: move free_the_page
mm/page_alloc: fix counting of managed_pages
mm/page_alloc: improve memmap_pages dbg msg
mm: drop SECTION_SHIFT in code comments
mm/page_alloc: introduce vm.percpu_pagelist_high_fraction
mm/page_alloc: limit the number of pages on PCP lists when reclaim is active
mm/page_alloc: scale the number of pages that are batch freed
...

Linus Torvalds
2021-06-30 08:29:11 +0800
04f94e3fb mm: charge active memcg when no mm is set ... Browse Code »

set_active_memcg() worked for kernel allocations but was silently ignored
for user pages.

This patch establishes a precedence order for who gets charged:

1. If there is a memcg associated with the page already, that memcg is
charged. This happens during swapin.

2. If an explicit mm is passed, mm->memcg is charged. This happens
during page faults, which can be triggered in remote VMs (eg gup).

3. Otherwise consult the current process context. If there is an
active_memcg, use that. Otherwise, current->mm->memcg.

Previously, if a NULL mm was passed to mem_cgroup_charge (case 3) it would
always charge the root cgroup. Now it looks up the active_memcg first
(falling back to charging the root cgroup if not set).

Link: https://lkml.kernel.org/r/20210610173944.1203706-3-schatzberg.dan@gmail.com
Signed-off-by: Dan Schatzberg
Acked-by: Johannes Weiner
Acked-by: Tejun Heo
Acked-by: Chris Down
Acked-by: Jens Axboe
Reviewed-by: Shakeel Butt
Reviewed-by: Michal Koutný
Cc: Michal Hocko
Cc: Ming Lei
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Dan Schatzberg
2021-06-30 01:53:50 +0800
2efa33fc7 mm/shmem: fix shmem_swapin() race with swapoff ... Browse Code »

When I was investigating the swap code, I found the below possible race
window:

CPU 1 CPU 2
----- -----
shmem_swapin
swap_cluster_readahead
if (likely(si->flags & (SWP_BLKDEV | SWP_FS_OPS))) {
swapoff
..
si->swap_file = NULL;
..
struct inode *inode = si->swap_file->f_mapping->host;[oops!]

Close this race window by using get/put_swap_device() to guard against
concurrent swapoff.

Link: https://lkml.kernel.org/r/20210426123316.806267-5-linmiaohe@huawei.com
Fixes: 8fd2e0b505d1 ("mm: swap: check if swap backing device is congested or not")
Signed-off-by: Miaohe Lin
Reviewed-by: "Huang, Ying"
Cc: Dennis Zhou
Cc: Tim Chen
Cc: Hugh Dickins
Cc: Johannes Weiner
Cc: Michal Hocko
Cc: Joonsoo Kim
Cc: Alex Shi
Cc: Matthew Wilcox
Cc: Minchan Kim
Cc: Wei Yang
Cc: Yang Shi
Cc: David Hildenbrand
Cc: Yu Zhao
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Miaohe Lin
2021-06-30 01:53:49 +0800

29 Jun, 2021

1 commit

c54b245d0 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace ... Browse Code »

Pull user namespace rlimit handling update from Eric Biederman:
"This is the work mainly by Alexey Gladkov to limit rlimits to the
rlimits of the user that created a user namespace, and to allow users
to have stricter limits on the resources created within a user
namespace."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
cred: add missing return error code when set_cred_ucounts() failed
ucounts: Silence warning in dec_rlimit_ucounts
ucounts: Set ucount_max to the largest positive value the type can hold
kselftests: Add test to check for rlimit changes in different user namespaces
Reimplement RLIMIT_MEMLOCK on top of ucounts
Reimplement RLIMIT_SIGPENDING on top of ucounts
Reimplement RLIMIT_MSGQUEUE on top of ucounts
Reimplement RLIMIT_NPROC on top of ucounts
Use atomic_t for ucounts reference counting
Add a reference to ucounts for each cred
Increase size of ucounts to atomic_long_t

Linus Torvalds
2021-06-29 11:39:26 +0800

15 May, 2021

2 commits

7ed9d238c userfaultfd: release page in error path to avoid BUG_ON ... Browse Code »

Consider the following sequence of events:

1. Userspace issues a UFFD ioctl, which ends up calling into
shmem_mfill_atomic_pte(). We successfully account the blocks, we
shmem_alloc_page(), but then the copy_from_user() fails. We return
-ENOENT. We don't release the page we allocated.
2. Our caller detects this error code, tries the copy_from_user() after
dropping the mmap_lock, and retries, calling back into
shmem_mfill_atomic_pte().
3. Meanwhile, let's say another process filled up the tmpfs being used.
4. So shmem_mfill_atomic_pte() fails to account blocks this time, and
immediately returns - without releasing the page.

This triggers a BUG_ON in our caller, which asserts that the page
should always be consumed, unless -ENOENT is returned.

To fix this, detect if we have such a "dangling" page when accounting
fails, and if so, release it before returning.

Link: https://lkml.kernel.org/r/20210428230858.348400-1-axelrasmussen@google.com
Fixes: cb658a453b93 ("userfaultfd: shmem: avoid leaking blocks and used blocks in UFFDIO_COPY")
Signed-off-by: Axel Rasmussen
Reported-by: Hugh Dickins
Acked-by: Hugh Dickins
Reviewed-by: Peter Xu
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Axel Rasmussen
2021-05-15 10:41:32 +0800
22247efd8 mm/hugetlb: fix F_SEAL_FUTURE_WRITE ... Browse Code »

Patch series "mm/hugetlb: Fix issues on file sealing and fork", v2.

Hugh reported issue with F_SEAL_FUTURE_WRITE not applied correctly to
hugetlbfs, which I can easily verify using the memfd_test program, which
seems that the program is hardly run with hugetlbfs pages (as by default
shmem).

Meanwhile I found another probably even more severe issue on that hugetlb
fork won't wr-protect child cow pages, so child can potentially write to
parent private pages. Patch 2 addresses that.

After this series applied, "memfd_test hugetlbfs" should start to pass.

This patch (of 2):

F_SEAL_FUTURE_WRITE is missing for hugetlb starting from the first day.
There is a test program for that and it fails constantly.

$ ./memfd_test hugetlbfs
memfd-hugetlb: CREATE
memfd-hugetlb: BASIC
memfd-hugetlb: SEAL-WRITE
memfd-hugetlb: SEAL-FUTURE-WRITE
mmap() didn't fail as expected
Aborted (core dumped)

I think it's probably because no one is really running the hugetlbfs test.

Fix it by checking FUTURE_WRITE also in hugetlbfs_file_mmap() as what we
do in shmem_mmap(). Generalize a helper for that.

Link: https://lkml.kernel.org/r/20210503234356.9097-1-peterx@redhat.com
Link: https://lkml.kernel.org/r/20210503234356.9097-2-peterx@redhat.com
Fixes: ab3948f58ff84 ("mm/memfd: add an F_SEAL_FUTURE_WRITE seal to memfd")
Signed-off-by: Peter Xu
Reported-by: Hugh Dickins
Reviewed-by: Mike Kravetz
Cc: Joel Fernandes (Google)
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Peter Xu
2021-05-15 10:41:32 +0800

06 May, 2021

1 commit

68d68ff6e mm/mempool: minor coding style tweaks ... Browse Code »

Various coding style tweaks to various files under mm/

[daizhiyuan@phytium.com.cn: mm/swapfile: minor coding style tweaks]
Link: https://lkml.kernel.org/r/1614223624-16055-1-git-send-email-daizhiyuan@phytium.com.cn
[daizhiyuan@phytium.com.cn: mm/sparse: minor coding style tweaks]
Link: https://lkml.kernel.org/r/1614227288-19363-1-git-send-email-daizhiyuan@phytium.com.cn
[daizhiyuan@phytium.com.cn: mm/vmscan: minor coding style tweaks]
Link: https://lkml.kernel.org/r/1614227649-19853-1-git-send-email-daizhiyuan@phytium.com.cn
[daizhiyuan@phytium.com.cn: mm/compaction: minor coding style tweaks]
Link: https://lkml.kernel.org/r/1614228218-20770-1-git-send-email-daizhiyuan@phytium.com.cn
[daizhiyuan@phytium.com.cn: mm/oom_kill: minor coding style tweaks]
Link: https://lkml.kernel.org/r/1614228360-21168-1-git-send-email-daizhiyuan@phytium.com.cn
[daizhiyuan@phytium.com.cn: mm/shmem: minor coding style tweaks]
Link: https://lkml.kernel.org/r/1614228504-21491-1-git-send-email-daizhiyuan@phytium.com.cn
[daizhiyuan@phytium.com.cn: mm/page_alloc: minor coding style tweaks]
Link: https://lkml.kernel.org/r/1614228613-21754-1-git-send-email-daizhiyuan@phytium.com.cn
[daizhiyuan@phytium.com.cn: mm/filemap: minor coding style tweaks]
Link: https://lkml.kernel.org/r/1614228936-22337-1-git-send-email-daizhiyuan@phytium.com.cn
[daizhiyuan@phytium.com.cn: mm/mlock: minor coding style tweaks]
Link: https://lkml.kernel.org/r/1613956588-2453-1-git-send-email-daizhiyuan@phytium.com.cn
[daizhiyuan@phytium.com.cn: mm/frontswap: minor coding style tweaks]
Link: https://lkml.kernel.org/r/1613962668-15045-1-git-send-email-daizhiyuan@phytium.com.cn
[daizhiyuan@phytium.com.cn: mm/vmalloc: minor coding style tweaks]
Link: https://lkml.kernel.org/r/1613963379-15988-1-git-send-email-daizhiyuan@phytium.com.cn
[daizhiyuan@phytium.com.cn: mm/memory_hotplug: minor coding style tweaks]
Link: https://lkml.kernel.org/r/1613971784-24878-1-git-send-email-daizhiyuan@phytium.com.cn
[daizhiyuan@phytium.com.cn: mm/mempolicy: minor coding style tweaks]
Link: https://lkml.kernel.org/r/1613972228-25501-1-git-send-email-daizhiyuan@phytium.com.cn

Link: https://lkml.kernel.org/r/1614222374-13805-1-git-send-email-daizhiyuan@phytium.com.cn
Signed-off-by: Zhiyuan Dai
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Zhiyuan Dai
2021-05-06 02:27:27 +0800

01 May, 2021

1 commit

d7c9e99ae Reimplement RLIMIT_MEMLOCK on top of ucounts ... Browse Code »

The rlimit counter is tied to uid in the user_namespace. This allows
rlimit values to be specified in userns even if they are already
globally exceeded by the user. However, the value of the previous
user_namespaces cannot be exceeded.

Changelog

v11:
* Fix issue found by lkp robot.

v8:
* Fix issues found by lkp-tests project.

v7:
* Keep only ucounts for RLIMIT_MEMLOCK checks instead of struct cred.

v6:
* Fix bug in hugetlb_file_setup() detected by trinity.

Reported-by: kernel test robot
Reported-by: kernel test robot
Signed-off-by: Alexey Gladkov
Link: https://lkml.kernel.org/r/970d50c70c71bfd4496e0e8d2a0a32feebebb350.1619094428.git.legion@kernel.org
Signed-off-by: Eric W. Biederman

Alexey Gladkov
2021-05-01 03:14:02 +0800

19 Apr, 2021

1 commit

59cda49ec shmem: allow reporting fanotify events with file handles on tmpfs ... Browse Code »

Since kernel v5.1, fanotify_init(2) supports the flag FAN_REPORT_FID
for identifying objects using file handle and fsid in events.

fanotify_mark(2) fails with -ENODEV when trying to set a mark on
filesystems that report null f_fsid in stasfs(2).

Use the digest of uuid as f_fsid for tmpfs to uniquely identify tmpfs
objects as best as possible and allow setting an fanotify mark that
reports events with file handles on tmpfs.

Link: https://lore.kernel.org/r/20210322173944.449469-3-amir73il@gmail.com
Acked-by: Hugh Dickins
Reviewed-by: Christian Brauner
Signed-off-by: Amir Goldstein
Signed-off-by: Jan Kara

Amir Goldstein
2021-04-19 22:03:48 +0800

27 Feb, 2021

5 commits

187df5dde mm,shmem,thp: limit shmem THP allocations to requested zones ... Browse Code »

Hugh pointed out that the gma500 driver uses shmem pages, but needs to
limit them to the DMA32 zone. Ensure the allocations resulting from the
gfp_mask returned by limit_gfp_mask use the zone flags that were
originally passed to shmem_getpage_gfp.

Link: https://lkml.kernel.org/r/20210224121016.1314ed6d@imladris.surriel.com
Signed-off-by: Rik van Riel
Suggested-by: Hugh Dickins
Cc: Michal Hocko
Cc: Vlastimil Babka
Cc: Xu Yu
Cc: Mel Gorman
Cc: Andrea Arcangeli
Cc: Matthew Wilcox (Oracle)
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Rik van Riel
2021-02-27 01:40:59 +0800
78cc8cdc5 mm,thp,shm: limit gfp mask to no more than specified ... Browse Code »

Matthew Wilcox pointed out that the i915 driver opportunistically
allocates tmpfs memory, but will happily reclaim some of its pool if no
memory is available.

Make sure the gfp mask used to opportunistically allocate a THP is always
at least as restrictive as the original gfp mask.

Link: https://lkml.kernel.org/r/20201124194925.623931-3-riel@surriel.com
Signed-off-by: Rik van Riel
Suggested-by: Matthew Wilcox
Cc: Andrea Arcangeli
Cc: Hugh Dickins
Cc: Mel Gorman
Cc: Michal Hocko
Cc: Vlastimil Babka
Cc: Xu Yu
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Rik van Riel
2021-02-27 01:40:59 +0800
164cc4fef mm,thp,shmem: limit shmem THP alloc gfp_mask ... Browse Code »

Patch series "mm,thp,shm: limit shmem THP alloc gfp_mask", v6.

The allocation flags of anonymous transparent huge pages can be controlled
through the files in /sys/kernel/mm/transparent_hugepage/defrag, which can
help the system from getting bogged down in the page reclaim and
compaction code when many THPs are getting allocated simultaneously.

However, the gfp_mask for shmem THP allocations were not limited by those
configuration settings, and some workloads ended up with all CPUs stuck on
the LRU lock in the page reclaim code, trying to allocate dozens of THPs
simultaneously.

This patch applies the same configurated limitation of THPs to shmem
hugepage allocations, to prevent that from happening.

This way a THP defrag setting of "never" or "defer+madvise" will result in
quick allocation failures without direct reclaim when no 2MB free pages
are available.

With this patch applied, THP allocations for tmpfs will be a little more
aggressive than today for files mmapped with MADV_HUGEPAGE, and a little
less aggressive for files that are not mmapped or mapped without that
flag.

This patch (of 4):

The allocation flags of anonymous transparent huge pages can be controlled
through the files in /sys/kernel/mm/transparent_hugepage/defrag, which can
help the system from getting bogged down in the page reclaim and
compaction code when many THPs are getting allocated simultaneously.

However, the gfp_mask for shmem THP allocations were not limited by those
configuration settings, and some workloads ended up with all CPUs stuck on
the LRU lock in the page reclaim code, trying to allocate dozens of THPs
simultaneously.

This patch applies the same configurated limitation of THPs to shmem
hugepage allocations, to prevent that from happening.

Controlling the gfp_mask of THP allocations through the knobs in sysfs
allows users to determine the balance between how aggressively the system
tries to allocate THPs at fault time, and how much the application may end
up stalling attempting those allocations.

This way a THP defrag setting of "never" or "defer+madvise" will result in
quick allocation failures without direct reclaim when no 2MB free pages
are available.

With this patch applied, THP allocations for tmpfs will be a little more
aggressive than today for files mmapped with MADV_HUGEPAGE, and a little
less aggressive for files that are not mmapped or mapped without that
flag.

Link: https://lkml.kernel.org/r/20201124194925.623931-1-riel@surriel.com
Link: https://lkml.kernel.org/r/20201124194925.623931-2-riel@surriel.com
Signed-off-by: Rik van Riel
Acked-by: Michal Hocko
Acked-by: Vlastimil Babka
Cc: Xu Yu
Cc: Mel Gorman
Cc: Andrea Arcangeli
Cc: Matthew Wilcox (Oracle)
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Rik van Riel
2021-02-27 01:40:59 +0800
cf2039af1 mm: pass pvec directly to find_get_entries ... Browse Code »

All callers of find_get_entries() use a pvec, so pass it directly instead
of manipulating it in the caller.

Link: https://lkml.kernel.org/r/20201112212641.27837-14-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle)
Reviewed-by: Jan Kara
Reviewed-by: William Kucharski
Cc: Christoph Hellwig
Cc: Dave Chinner
Cc: Hugh Dickins
Cc: Johannes Weiner
Cc: Kirill A. Shutemov
Cc: Yang Shi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Matthew Wilcox (Oracle)
2021-02-27 01:40:59 +0800
ca122fe40 mm: add an 'end' parameter to find_get_entries ... Browse Code »

This simplifies the callers and leads to a more efficient implementation
since the XArray has this functionality already.

Link: https://lkml.kernel.org/r/20201112212641.27837-11-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle)
Reviewed-by: Jan Kara
Reviewed-by: William Kucharski
Reviewed-by: Christoph Hellwig
Cc: Dave Chinner
Cc: Hugh Dickins
Cc: Johannes Weiner
Cc: Kirill A. Shutemov
Cc: Yang Shi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Matthew Wilcox (Oracle)
2021-02-27 01:40:59 +0800