Eric Lee / smarc-fsl-linux-kernel

23 Feb, 2024

2 commits

4d850ed74 mm: hugetlb pages should not be reserved by shmat() if SHM_NORESERVE ... Browse Code »

commit e656c7a9e59607d1672d85ffa9a89031876ffe67 upstream.

For shared memory of type SHM_HUGETLB, hugetlb pages are reserved in
shmget() call. If SHM_NORESERVE flags is specified then the hugetlb pages
are not reserved. However when the shared memory is attached with the
shmat() call the hugetlb pages are getting reserved incorrectly for
SHM_HUGETLB shared memory created with SHM_NORESERVE which is a bug.

-------------------------------
Following test shows the issue.

$cat shmhtb.c

int main()
{
int shmflags = 0660 | IPC_CREAT | SHM_HUGETLB | SHM_NORESERVE;
int shmid;

shmid = shmget(SKEY, SHMSZ, shmflags);
if (shmid < 0)
{
printf("shmat: shmget() failed, %d\n", errno);
return 1;
}
printf("After shmget()\n");
system("cat /proc/meminfo | grep -i hugepages_");

shmat(shmid, NULL, 0);
printf("\nAfter shmat()\n");
system("cat /proc/meminfo | grep -i hugepages_");

shmctl(shmid, IPC_RMID, NULL);
return 0;
}

#sysctl -w vm.nr_hugepages=20
#./shmhtb

After shmget()
HugePages_Total: 20
HugePages_Free: 20
HugePages_Rsvd: 0
HugePages_Surp: 0

After shmat()
HugePages_Total: 20
HugePages_Free: 20
HugePages_Rsvd: 5
Acked-by: Muchun Song
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Greg Kroah-Hartman

Prakash Sangappa
2024-02-23 16:25:16 +0800
13c5a9fb0 fs,hugetlb: fix NULL pointer dereference in hugetlbs_fill_super ... Browse Code »

commit 79d72c68c58784a3e1cd2378669d51bfd0cb7498 upstream.

When configuring a hugetlb filesystem via the fsconfig() syscall, there is
a possible NULL dereference in hugetlbfs_fill_super() caused by assigning
NULL to ctx->hstate in hugetlbfs_parse_param() when the requested pagesize
is non valid.

E.g: Taking the following steps:

fd = fsopen("hugetlbfs", FSOPEN_CLOEXEC);
fsconfig(fd, FSCONFIG_SET_STRING, "pagesize", "1024", 0);
fsconfig(fd, FSCONFIG_CMD_CREATE, NULL, NULL, 0);

Given that the requested "pagesize" is invalid, ctxt->hstate will be replaced
with NULL, losing its previous value, and we will print an error:

...
...
case Opt_pagesize:
ps = memparse(param->string, &rest);
ctx->hstate = h;
if (!ctx->hstate) {
pr_err("Unsupported page size %lu MB\n", ps / SZ_1M);
return -EINVAL;
}
return 0;
...
...

This is a problem because later on, we will dereference ctxt->hstate in
hugetlbfs_fill_super()

...
...
sb->s_blocksize = huge_page_size(ctx->hstate);
...
...

Causing below Oops.

Fix this by replacing cxt->hstate value only when then pagesize is known
to be valid.

kernel: hugetlbfs: Unsupported page size 0 MB
kernel: BUG: kernel NULL pointer dereference, address: 0000000000000028
kernel: #PF: supervisor read access in kernel mode
kernel: #PF: error_code(0x0000) - not-present page
kernel: PGD 800000010f66c067 P4D 800000010f66c067 PUD 1b22f8067 PMD 0
kernel: Oops: 0000 [#1] PREEMPT SMP PTI
kernel: CPU: 4 PID: 5659 Comm: syscall Tainted: G E 6.8.0-rc2-default+ #22 5a47c3fef76212addcc6eb71344aabc35190ae8f
kernel: Hardware name: Intel Corp. GROVEPORT/GROVEPORT, BIOS GVPRCRB1.86B.0016.D04.1705030402 05/03/2017
kernel: RIP: 0010:hugetlbfs_fill_super+0xb4/0x1a0
kernel: Code: 48 8b 3b e8 3e c6 ed ff 48 85 c0 48 89 45 20 0f 84 d6 00 00 00 48 b8 ff ff ff ff ff ff ff 7f 4c 89 e7 49 89 44 24 20 48 8b 03 48 28 b8 00 10 00 00 48 d3 e0 49 89 44 24 18 48 8b 03 8b 40 28
kernel: RSP: 0018:ffffbe9960fcbd48 EFLAGS: 00010246
kernel: RAX: 0000000000000000 RBX: ffff9af5272ae780 RCX: 0000000000372004
kernel: RDX: ffffffffffffffff RSI: ffffffffffffffff RDI: ffff9af555e9b000
kernel: RBP: ffff9af52ee66b00 R08: 0000000000000040 R09: 0000000000370004
kernel: R10: ffffbe9960fcbd48 R11: 0000000000000040 R12: ffff9af555e9b000
kernel: R13: ffffffffa66b86c0 R14: ffff9af507d2f400 R15: ffff9af507d2f400
kernel: FS: 00007ffbc0ba4740(0000) GS:ffff9b0bd7000000(0000) knlGS:0000000000000000
kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
kernel: CR2: 0000000000000028 CR3: 00000001b1ee0000 CR4: 00000000001506f0
kernel: Call Trace:
kernel:
kernel: ? __die_body+0x1a/0x60
kernel: ? page_fault_oops+0x16f/0x4a0
kernel: ? search_bpf_extables+0x65/0x70
kernel: ? fixup_exception+0x22/0x310
kernel: ? exc_page_fault+0x69/0x150
kernel: ? asm_exc_page_fault+0x22/0x30
kernel: ? __pfx_hugetlbfs_fill_super+0x10/0x10
kernel: ? hugetlbfs_fill_super+0xb4/0x1a0
kernel: ? hugetlbfs_fill_super+0x28/0x1a0
kernel: ? __pfx_hugetlbfs_fill_super+0x10/0x10
kernel: vfs_get_super+0x40/0xa0
kernel: ? __pfx_bpf_lsm_capable+0x10/0x10
kernel: vfs_get_tree+0x25/0xd0
kernel: vfs_cmd_create+0x64/0xe0
kernel: __x64_sys_fsconfig+0x395/0x410
kernel: do_syscall_64+0x80/0x160
kernel: ? syscall_exit_to_user_mode+0x82/0x240
kernel: ? do_syscall_64+0x8d/0x160
kernel: ? syscall_exit_to_user_mode+0x82/0x240
kernel: ? do_syscall_64+0x8d/0x160
kernel: ? exc_page_fault+0x69/0x150
kernel: entry_SYSCALL_64_after_hwframe+0x6e/0x76
kernel: RIP: 0033:0x7ffbc0cb87c9
kernel: Code: 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 3d 01 f0 ff ff 73 01 c3 48 8b 0d 97 96 0d 00 f7 d8 64 89 01 48
kernel: RSP: 002b:00007ffc29d2f388 EFLAGS: 00000206 ORIG_RAX: 00000000000001af
kernel: RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007ffbc0cb87c9
kernel: RDX: 0000000000000000 RSI: 0000000000000006 RDI: 0000000000000003
kernel: RBP: 00007ffc29d2f3b0 R08: 0000000000000000 R09: 0000000000000000
kernel: R10: 0000000000000000 R11: 0000000000000206 R12: 0000000000000000
kernel: R13: 00007ffc29d2f4c0 R14: 0000000000000000 R15: 0000000000000000
kernel:
kernel: Modules linked in: rpcsec_gss_krb5(E) auth_rpcgss(E) nfsv4(E) dns_resolver(E) nfs(E) lockd(E) grace(E) sunrpc(E) netfs(E) af_packet(E) bridge(E) stp(E) llc(E) iscsi_ibft(E) iscsi_boot_sysfs(E) intel_rapl_msr(E) intel_rapl_common(E) iTCO_wdt(E) intel_pmc_bxt(E) sb_edac(E) iTCO_vendor_support(E) x86_pkg_temp_thermal(E) intel_powerclamp(E) coretemp(E) kvm_intel(E) rfkill(E) ipmi_ssif(E) kvm(E) acpi_ipmi(E) irqbypass(E) pcspkr(E) igb(E) ipmi_si(E) mei_me(E) i2c_i801(E) joydev(E) intel_pch_thermal(E) i2c_smbus(E) dca(E) lpc_ich(E) mei(E) ipmi_devintf(E) ipmi_msghandler(E) acpi_pad(E) tiny_power_button(E) button(E) fuse(E) efi_pstore(E) configfs(E) ip_tables(E) x_tables(E) ext4(E) mbcache(E) jbd2(E) hid_generic(E) usbhid(E) sd_mod(E) t10_pi(E) crct10dif_pclmul(E) crc32_pclmul(E) crc32c_intel(E) polyval_clmulni(E) ahci(E) xhci_pci(E) polyval_generic(E) gf128mul(E) ghash_clmulni_intel(E) sha512_ssse3(E) sha256_ssse3(E) xhci_pci_renesas(E) libahci(E) ehci_pci(E) sha1_ssse3(E) xhci_hcd(E) ehci_hcd(E) libata(E)
kernel: mgag200(E) i2c_algo_bit(E) usbcore(E) wmi(E) sg(E) dm_multipath(E) dm_mod(E) scsi_dh_rdac(E) scsi_dh_emc(E) scsi_dh_alua(E) scsi_mod(E) scsi_common(E) aesni_intel(E) crypto_simd(E) cryptd(E)
kernel: Unloaded tainted modules: acpi_cpufreq(E):1 fjes(E):1
kernel: CR2: 0000000000000028
kernel: ---[ end trace 0000000000000000 ]---
kernel: RIP: 0010:hugetlbfs_fill_super+0xb4/0x1a0
kernel: Code: 48 8b 3b e8 3e c6 ed ff 48 85 c0 48 89 45 20 0f 84 d6 00 00 00 48 b8 ff ff ff ff ff ff ff 7f 4c 89 e7 49 89 44 24 20 48 8b 03 48 28 b8 00 10 00 00 48 d3 e0 49 89 44 24 18 48 8b 03 8b 40 28
kernel: RSP: 0018:ffffbe9960fcbd48 EFLAGS: 00010246
kernel: RAX: 0000000000000000 RBX: ffff9af5272ae780 RCX: 0000000000372004
kernel: RDX: ffffffffffffffff RSI: ffffffffffffffff RDI: ffff9af555e9b000
kernel: RBP: ffff9af52ee66b00 R08: 0000000000000040 R09: 0000000000370004
kernel: R10: ffffbe9960fcbd48 R11: 0000000000000040 R12: ffff9af555e9b000
kernel: R13: ffffffffa66b86c0 R14: ffff9af507d2f400 R15: ffff9af507d2f400
kernel: FS: 00007ffbc0ba4740(0000) GS:ffff9b0bd7000000(0000) knlGS:0000000000000000
kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
kernel: CR2: 0000000000000028 CR3: 00000001b1ee0000 CR4: 00000000001506f0

Link: https://lkml.kernel.org/r/20240130210418.3771-1-osalvador@suse.de
Fixes: 32021982a324 ("hugetlbfs: Convert to fs_context")
Signed-off-by: Michal Hocko
Signed-off-by: Oscar Salvador
Acked-by: Muchun Song
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Greg Kroah-Hartman

Oscar Salvador
2024-02-23 16:25:16 +0800

29 Nov, 2023

1 commit

cc64f5ea9 fs: use nth_page() in place of direct struct page manipulation ... Browse Code »

commit 8db0ec791f7788cd21e7f91ee5ff42c1c458d0e7 upstream.

When dealing with hugetlb pages, struct page is not guaranteed to be
contiguous on SPARSEMEM without VMEMMAP. Use nth_page() to handle it
properly.

Without the fix, a wrong subpage might be checked for HWPoison, causing wrong
number of bytes of a page copied to user space. No bug is reported. The fix
comes from code inspection.

Link: https://lkml.kernel.org/r/20230913201248.452081-5-zi.yan@sent.com
Fixes: 38c1ddbde6c6 ("hugetlbfs: improve read HWPOISON hugepage")
Signed-off-by: Zi Yan
Reviewed-by: Muchun Song
Cc: David Hildenbrand
Cc: Matthew Wilcox (Oracle)
Cc: Mike Kravetz
Cc: Mike Rapoport (IBM)
Cc: Thomas Bogendoerfer
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Greg Kroah-Hartman

Zi Yan
2023-11-29 01:20:05 +0800

30 Aug, 2023

1 commit

b96a3e914 Merge tag 'mm-stable-2023-08-28-18-26' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm ... Browse Code »

Pull MM updates from Andrew Morton:

- Some swap cleanups from Ma Wupeng ("fix WARN_ON in
add_to_avail_list")

- Peter Xu has a series (mm/gup: Unify hugetlb, speed up thp") which
reduces the special-case code for handling hugetlb pages in GUP. It
also speeds up GUP handling of transparent hugepages.

- Peng Zhang provides some maple tree speedups ("Optimize the fast path
of mas_store()").

- Sergey Senozhatsky has improved te performance of zsmalloc during
compaction (zsmalloc: small compaction improvements").

- Domenico Cerasuolo has developed additional selftest code for zswap
("selftests: cgroup: add zswap test program").

- xu xin has doe some work on KSM's handling of zero pages. These
changes are mainly to enable the user to better understand the
effectiveness of KSM's treatment of zero pages ("ksm: support
tracking KSM-placed zero-pages").

- Jeff Xu has fixes the behaviour of memfd's
MEMFD_NOEXEC_SCOPE_NOEXEC_ENFORCED sysctl ("mm/memfd: fix sysctl
MEMFD_NOEXEC_SCOPE_NOEXEC_ENFORCED").

- David Howells has fixed an fscache optimization ("mm, netfs, fscache:
Stop read optimisation when folio removed from pagecache").

- Axel Rasmussen has given userfaultfd the ability to simulate memory
poisoning ("add UFFDIO_POISON to simulate memory poisoning with
UFFD").

- Miaohe Lin has contributed some routine maintenance work on the
memory-failure code ("mm: memory-failure: remove unneeded PageHuge()
check").

- Peng Zhang has contributed some maintenance work on the maple tree
code ("Improve the validation for maple tree and some cleanup").

- Hugh Dickins has optimized the collapsing of shmem or file pages into
THPs ("mm: free retracted page table by RCU").

- Jiaqi Yan has a patch series which permits us to use the healthy
subpages within a hardware poisoned huge page for general purposes
("Improve hugetlbfs read on HWPOISON hugepages").

- Kemeng Shi has done some maintenance work on the pagetable-check code
("Remove unused parameters in page_table_check").

- More folioification work from Matthew Wilcox ("More filesystem folio
conversions for 6.6"), ("Followup folio conversions for zswap"). And
from ZhangPeng ("Convert several functions in page_io.c to use a
folio").

- page_ext cleanups from Kemeng Shi ("minor cleanups for page_ext").

- Baoquan He has converted some architectures to use the
GENERIC_IOREMAP ioremap()/iounmap() code ("mm: ioremap: Convert
architectures to take GENERIC_IOREMAP way").

- Anshuman Khandual has optimized arm64 tlb shootdown ("arm64: support
batched/deferred tlb shootdown during page reclamation/migration").

- Better maple tree lockdep checking from Liam Howlett ("More strict
maple tree lockdep"). Liam also developed some efficiency
improvements ("Reduce preallocations for maple tree").

- Cleanup and optimization to the secondary IOMMU TLB invalidation,
from Alistair Popple ("Invalidate secondary IOMMU TLB on permission
upgrade").

- Ryan Roberts fixes some arm64 MM selftest issues ("selftests/mm fixes
for arm64").

- Kemeng Shi provides some maintenance work on the compaction code
("Two minor cleanups for compaction").

- Some reduction in mmap_lock pressure from Matthew Wilcox ("Handle
most file-backed faults under the VMA lock").

- Aneesh Kumar contributes code to use the vmemmap optimization for DAX
on ppc64, under some circumstances ("Add support for DAX vmemmap
optimization for ppc64").

- page-ext cleanups from Kemeng Shi ("add page_ext_data to get client
data in page_ext"), ("minor cleanups to page_ext header").

- Some zswap cleanups from Johannes Weiner ("mm: zswap: three
cleanups").

- kmsan cleanups from ZhangPeng ("minor cleanups for kmsan").

- VMA handling cleanups from Kefeng Wang ("mm: convert to
vma_is_initial_heap/stack()").

- DAMON feature work from SeongJae Park ("mm/damon/sysfs-schemes:
implement DAMOS tried total bytes file"), ("Extend DAMOS filters for
address ranges and DAMON monitoring targets").

- Compaction work from Kemeng Shi ("Fixes and cleanups to compaction").

- Liam Howlett has improved the maple tree node replacement code
("maple_tree: Change replacement strategy").

- ZhangPeng has a general code cleanup - use the K() macro more widely
("cleanup with helper macro K()").

- Aneesh Kumar brings memmap-on-memory to ppc64 ("Add support for
memmap on memory feature on ppc64").

- pagealloc cleanups from Kemeng Shi ("Two minor cleanups for pcp list
in page_alloc"), ("Two minor cleanups for get pageblock
migratetype").

- Vishal Moola introduces a memory descriptor for page table tracking,
"struct ptdesc" ("Split ptdesc from struct page").

- memfd selftest maintenance work from Aleksa Sarai ("memfd: cleanups
for vm.memfd_noexec").

- MM include file rationalization from Hugh Dickins ("arch: include
asm/cacheflush.h in asm/hugetlb.h").

- THP debug output fixes from Hugh Dickins ("mm,thp: fix sloppy text
output").

- kmemleak improvements from Xiaolei Wang ("mm/kmemleak: use
object_cache instead of kmemleak_initialized").

- More folio-related cleanups from Matthew Wilcox ("Remove _folio_dtor
and _folio_order").

- A VMA locking scalability improvement from Suren Baghdasaryan
("Per-VMA lock support for swap and userfaults").

- pagetable handling cleanups from Matthew Wilcox ("New page table
range API").

- A batch of swap/thp cleanups from David Hildenbrand ("mm/swap: stop
using page->private on tail pages for THP_SWAP + cleanups").

- Cleanups and speedups to the hugetlb fault handling from Matthew
Wilcox ("Change calling convention for ->huge_fault").

- Matthew Wilcox has also done some maintenance work on the MM
subsystem documentation ("Improve mm documentation").

* tag 'mm-stable-2023-08-28-18-26' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (489 commits)
maple_tree: shrink struct maple_tree
maple_tree: clean up mas_wr_append()
secretmem: convert page_is_secretmem() to folio_is_secretmem()
nios2: fix flush_dcache_page() for usage from irq context
hugetlb: add documentation for vma_kernel_pagesize()
mm: add orphaned kernel-doc to the rst files.
mm: fix clean_record_shared_mapping_range kernel-doc
mm: fix get_mctgt_type() kernel-doc
mm: fix kernel-doc warning from tlb_flush_rmaps()
mm: remove enum page_entry_size
mm: allow ->huge_fault() to be called without the mmap_lock held
mm: move PMD_ORDER to pgtable.h
mm: remove checks for pte_index
memcg: remove duplication detection for mem_cgroup_uncharge_swap
mm/huge_memory: work on folio->swap instead of page->private when splitting folio
mm/swap: inline folio_set_swap_entry() and folio_swap_entry()
mm/swap: use dedicated entry for swap in folio
mm/swap: stop using page->private on tail pages for THP_SWAP
selftests/mm: fix WARNING comparing pointer to 0
selftests: cgroup: fix test_kmem_memcg_deletion kernel mem check
...

Linus Torvalds
2023-08-30 05:25:26 +0800

19 Aug, 2023

1 commit

38c1ddbde hugetlbfs: improve read HWPOISON hugepage ... Browse Code »

When a hugepage contains HWPOISON pages, read() fails to read any byte of
the hugepage and returns -EIO, although many bytes in the HWPOISON
hugepage are readable.

Improve this by allowing hugetlbfs_read_iter returns as many bytes as
possible. For a requested range [offset, offset + len) that contains
HWPOISON page, return [offset, first HWPOISON page addr); the next read
attempt will fail and return -EIO.

Link: https://lkml.kernel.org/r/20230713001833.3778937-4-jiaqiyan@google.com
Signed-off-by: Jiaqi Yan
Reviewed-by: Mike Kravetz
Reviewed-by: Naoya Horiguchi
Cc: James Houghton
Cc: Miaohe Lin
Cc: Muchun Song
Cc: Yang Shi
Signed-off-by: Andrew Morton

Jiaqi Yan
2023-08-19 01:12:26 +0800

24 Jul, 2023

1 commit

a72a7deab hugetlbfs: convert to ctime accessor functions ... Browse Code »

In later patches, we're going to change how the inode's ctime field is
used. Switch to using accessor functions instead of raw accesses of
inode->i_ctime.

Signed-off-by: Jeff Layton
Acked-by: Mike Kravetz
Reviewed-by: Jan Kara
Message-Id:
Signed-off-by: Christian Brauner

Jeff Layton
2023-07-24 16:30:00 +0800

24 Jun, 2023

1 commit

fd4aed8d9 hugetlb: revert use of page_cache_next_miss() ... Browse Code »

Ackerley Tng reported an issue with hugetlbfs fallocate as noted in the
Closes tag. The issue showed up after the conversion of hugetlb page
cache lookup code to use page_cache_next_miss. User visible effects are:

- hugetlbfs fallocate incorrectly returns -EEXIST if pages are presnet
in the file.
- hugetlb pages will not be included in core dumps if they need to be
brought in via GUP.
- userfaultfd UFFDIO_COPY will not notice pages already present in the
cache. It may try to allocate a new page and potentially return
ENOMEM as opposed to EEXIST.

Revert the use page_cache_next_miss() in hugetlb code.

IMPORTANT NOTE FOR STABLE BACKPORTS:
This patch will apply cleanly to v6.3. However, due to the change of
filemap_get_folio() return values, it will not function correctly. This
patch must be modified for stable backports.

[dan.carpenter@linaro.org: fix hugetlbfs_pagecache_present()]
Link: https://lkml.kernel.org/r/efa86091-6a2c-4064-8f55-9b44e1313015@moroto.mountain
Link: https://lkml.kernel.org/r/20230621212403.174710-2-mike.kravetz@oracle.com
Fixes: d0ce0e47b323 ("mm/hugetlb: convert hugetlb fault paths to use alloc_hugetlb_folio()")
Signed-off-by: Mike Kravetz
Signed-off-by: Dan Carpenter
Reported-by: Ackerley Tng
Closes: https://lore.kernel.org/linux-mm/cover.1683069252.git.ackerleytng@google.com
Reviewed-by: Sidhartha Kumar
Cc: Erdem Aktas
Cc: Greg Kroah-Hartman
Cc: Matthew Wilcox
Cc: Muchun Song
Cc: Vishal Annapurve
Signed-off-by: Andrew Morton

Mike Kravetz
2023-06-24 07:59:32 +0800

10 Jun, 2023

1 commit

adef08038 fs: hugetlbfs: set vma policy only when needed for allocating folio ... Browse Code »

Calling hugetlb_set_vma_policy() later avoids setting the vma policy
and then dropping it on a page cache hit.

Link: https://lkml.kernel.org/r/20230502235622.3652586-1-ackerleytng@google.com
Signed-off-by: Ackerley Tng
Reviewed-by: Mike Kravetz
Cc: Erdem Aktas
Cc: John Hubbard
Cc: Matthew Wilcox (Oracle)
Cc: Muchun Song
Cc: Sidhartha Kumar
Cc: Vishal Annapurve
Signed-off-by: Andrew Morton

Ackerley Tng
2023-06-10 07:25:17 +0800

22 Apr, 2023

1 commit

6b008640d mm: move 'mmap_min_addr' logic from callers into vm_unmapped_area() ... Browse Code »

Instead of having callers care about the mmap_min_addr logic for the
lowest valid mapping address (and some of them getting it wrong), just
move the logic into vm_unmapped_area() itself. One less thing for various
architecture cases (and generic helpers) to worry about.

We should really try to make much more of this be common code, but baby
steps..

Without this, vm_unmapped_area() could return an address below
mmap_min_addr (because some caller forgot about that). That then causes
the mmap machinery to think it has found a workable address, but then
later security_mmap_addr(addr) is unhappy about it and the mmap() returns
with a nonsensical error (EPERM).

The proper action is to either return ENOMEM (if the virtual address space
is exhausted), or try to find another address (ie do a bottom-up search
for free addresses after the top-down one failed).

See commit 2afc745f3e30 ("mm: ensure get_unmapped_area() returns higher
address than mmap_min_addr"), which fixed this for one call site (the
generic arch_get_unmapped_area_topdown() fallback) but left other cases
alone.

Link: https://lkml.kernel.org/r/20230418214009.1142926-1-Liam.Howlett@oracle.com
Signed-off-by: Linus Torvalds
Signed-off-by: Liam R. Howlett
Cc: Russell King
Cc: Liam Howlett
Signed-off-by: Andrew Morton

Linus Torvalds
2023-04-22 05:52:05 +0800

06 Apr, 2023

1 commit

66dabbb65 mm: return an ERR_PTR from __filemap_get_folio ... Browse Code »

Instead of returning NULL for all errors, distinguish between:

- no entry found and not asked to allocated (-ENOENT)
- failed to allocate memory (-ENOMEM)
- would block (-EAGAIN)

so that callers don't have to guess the error based on the passed in
flags.

Also pass through the error through the direct callers: filemap_get_folio,
filemap_lock_folio filemap_grab_folio and filemap_get_incore_folio.

[hch@lst.de: fix null-pointer deref]
Link: https://lkml.kernel.org/r/20230310070023.GA13563@lst.de
Link: https://lkml.kernel.org/r/20230310043137.GA1624890@u2004
Link: https://lkml.kernel.org/r/20230307143410.28031-8-hch@lst.de
Signed-off-by: Christoph Hellwig
Acked-by: Ryusuke Konishi [nilfs2]
Cc: Andreas Gruenbacher
Cc: Hugh Dickins
Cc: Matthew Wilcox (Oracle)
Cc: Naoya Horiguchi
Signed-off-by: Andrew Morton

Christoph Hellwig
2023-04-06 10:42:42 +0800

24 Feb, 2023

1 commit

3822a7c40 Merge tag 'mm-stable-2023-02-20-13-37' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm ... Browse Code »

Pull MM updates from Andrew Morton:

- Daniel Verkamp has contributed a memfd series ("mm/memfd: add
F_SEAL_EXEC") which permits the setting of the memfd execute bit at
memfd creation time, with the option of sealing the state of the X
bit.

- Peter Xu adds a patch series ("mm/hugetlb: Make huge_pte_offset()
thread-safe for pmd unshare") which addresses a rare race condition
related to PMD unsharing.

- Several folioification patch serieses from Matthew Wilcox, Vishal
Moola, Sidhartha Kumar and Lorenzo Stoakes

- Johannes Weiner has a series ("mm: push down lock_page_memcg()")
which does perform some memcg maintenance and cleanup work.

- SeongJae Park has added DAMOS filtering to DAMON, with the series
"mm/damon/core: implement damos filter".

These filters provide users with finer-grained control over DAMOS's
actions. SeongJae has also done some DAMON cleanup work.

- Kairui Song adds a series ("Clean up and fixes for swap").

- Vernon Yang contributed the series "Clean up and refinement for maple
tree".

- Yu Zhao has contributed the "mm: multi-gen LRU: memcg LRU" series. It
adds to MGLRU an LRU of memcgs, to improve the scalability of global
reclaim.

- David Hildenbrand has added some userfaultfd cleanup work in the
series "mm: uffd-wp + change_protection() cleanups".

- Christoph Hellwig has removed the generic_writepages() library
function in the series "remove generic_writepages".

- Baolin Wang has performed some maintenance on the compaction code in
his series "Some small improvements for compaction".

- Sidhartha Kumar is doing some maintenance work on struct page in his
series "Get rid of tail page fields".

- David Hildenbrand contributed some cleanup, bugfixing and
generalization of pte management and of pte debugging in his series
"mm: support __HAVE_ARCH_PTE_SWP_EXCLUSIVE on all architectures with
swap PTEs".

- Mel Gorman and Neil Brown have removed the __GFP_ATOMIC allocation
flag in the series "Discard __GFP_ATOMIC".

- Sergey Senozhatsky has improved zsmalloc's memory utilization with
his series "zsmalloc: make zspage chain size configurable".

- Joey Gouly has added prctl() support for prohibiting the creation of
writeable+executable mappings.

The previous BPF-based approach had shortcomings. See "mm: In-kernel
support for memory-deny-write-execute (MDWE)".

- Waiman Long did some kmemleak cleanup and bugfixing in the series
"mm/kmemleak: Simplify kmemleak_cond_resched() & fix UAF".

- T.J. Alumbaugh has contributed some MGLRU cleanup work in his series
"mm: multi-gen LRU: improve".

- Jiaqi Yan has provided some enhancements to our memory error
statistics reporting, mainly by presenting the statistics on a
per-node basis. See the series "Introduce per NUMA node memory error
statistics".

- Mel Gorman has a second and hopefully final shot at fixing a CPU-hog
regression in compaction via his series "Fix excessive CPU usage
during compaction".

- Christoph Hellwig does some vmalloc maintenance work in the series
"cleanup vfree and vunmap".

- Christoph Hellwig has removed block_device_operations.rw_page() in
ths series "remove ->rw_page".

- We get some maple_tree improvements and cleanups in Liam Howlett's
series "VMA tree type safety and remove __vma_adjust()".

- Suren Baghdasaryan has done some work on the maintainability of our
vm_flags handling in the series "introduce vm_flags modifier
functions".

- Some pagemap cleanup and generalization work in Mike Rapoport's
series "mm, arch: add generic implementation of pfn_valid() for
FLATMEM" and "fixups for generic implementation of pfn_valid()"

- Baoquan He has done some work to make /proc/vmallocinfo and
/proc/kcore better represent the real state of things in his series
"mm/vmalloc.c: allow vread() to read out vm_map_ram areas".

- Jason Gunthorpe rationalized the GUP system's interface to the rest
of the kernel in the series "Simplify the external interface for
GUP".

- SeongJae Park wishes to migrate people from DAMON's debugfs interface
over to its sysfs interface. To support this, we'll temporarily be
printing warnings when people use the debugfs interface. See the
series "mm/damon: deprecate DAMON debugfs interface".

- Andrey Konovalov provided the accurately named "lib/stackdepot: fixes
and clean-ups" series.

- Huang Ying has provided a dramatic reduction in migration's TLB flush
IPI rates with the series "migrate_pages(): batch TLB flushing".

- Arnd Bergmann has some objtool fixups in "objtool warning fixes".

* tag 'mm-stable-2023-02-20-13-37' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (505 commits)
include/linux/migrate.h: remove unneeded externs
mm/memory_hotplug: cleanup return value handing in do_migrate_range()
mm/uffd: fix comment in handling pte markers
mm: change to return bool for isolate_movable_page()
mm: hugetlb: change to return bool for isolate_hugetlb()
mm: change to return bool for isolate_lru_page()
mm: change to return bool for folio_isolate_lru()
objtool: add UACCESS exceptions for __tsan_volatile_read/write
kmsan: disable ftrace in kmsan core code
kasan: mark addr_has_metadata __always_inline
mm: memcontrol: rename memcg_kmem_enabled()
sh: initialize max_mapnr
m68k/nommu: add missing definition of ARCH_PFN_OFFSET
mm: percpu: fix incorrect size in pcpu_obj_full_size()
maple_tree: reduce stack usage with gcc-9 and earlier
mm: page_alloc: call panic() when memoryless node allocation fails
mm: multi-gen LRU: avoid futile retries
migrate_pages: move THP/hugetlb migration support check to simplify code
migrate_pages: batch flushing TLB
migrate_pages: share more code between _unmap and _move
...

Linus Torvalds
2023-02-24 09:09:35 +0800

14 Feb, 2023

3 commits

9b91c0e27 mm/hugetlb: convert hugetlb_add_to_page_cache to take in a folio ... Browse Code »

Every caller of hugetlb_add_to_page_cache() is now passing in
&folio->page, change the function to take in a folio directly and clean up
the call sites.

Link: https://lkml.kernel.org/r/20230125170537.96973-7-sidhartha.kumar@oracle.com
Signed-off-by: Sidhartha Kumar
Cc: Gerald Schaefer
Cc: John Hubbard
Cc: Matthew Wilcox
Cc: Mike Kravetz
Cc: Muchun Song
Signed-off-by: Andrew Morton

Sidhartha Kumar
2023-02-14 07:54:29 +0800
d2d7bb44b mm/hugetlb: convert restore_reserve_on_error to take in a folio ... Browse Code »

Every caller of restore_reserve_on_error() is now passing in &folio->page,
change the function to take in a folio directly and clean up the call
sites.

Link: https://lkml.kernel.org/r/20230125170537.96973-6-sidhartha.kumar@oracle.com
Signed-off-by: Sidhartha Kumar
Cc: Gerald Schaefer
Cc: John Hubbard
Cc: Matthew Wilcox
Cc: Mike Kravetz
Cc: Muchun Song
Signed-off-by: Andrew Morton

Sidhartha Kumar
2023-02-14 07:54:29 +0800
d0ce0e47b mm/hugetlb: convert hugetlb fault paths to use alloc_hugetlb_folio() ... Browse Code »

Change alloc_huge_page() to alloc_hugetlb_folio() by changing all callers
to handle the now folio return type of the function. In this conversion,
alloc_huge_page_vma() is also changed to alloc_hugetlb_folio_vma() and
hugepage_add_new_anon_rmap() is changed to take in a folio directly. Many
additions of '&folio->page' are cleaned up in subsequent patches.

hugetlbfs_fallocate() is also refactored to use the RCU +
page_cache_next_miss() API.

Link: https://lkml.kernel.org/r/20230125170537.96973-5-sidhartha.kumar@oracle.com
Suggested-by: Mike Kravetz
Reported-by: kernel test robot
Signed-off-by: Sidhartha Kumar
Cc: Gerald Schaefer
Cc: John Hubbard
Cc: Matthew Wilcox
Cc: Muchun Song
Signed-off-by: Andrew Morton

Sidhartha Kumar
2023-02-14 07:54:29 +0800

10 Feb, 2023

1 commit

1c71222e5 mm: replace vma->vm_flags direct modifications with modifier calls ... Browse Code »

Replace direct modifications to vma->vm_flags with calls to modifier
functions to be able to track flag changes and to keep vma locking
correctness.

[akpm@linux-foundation.org: fix drivers/misc/open-dice.c, per Hyeonggon Yoo]
Link: https://lkml.kernel.org/r/20230126193752.297968-5-surenb@google.com
Signed-off-by: Suren Baghdasaryan
Acked-by: Michal Hocko
Acked-by: Mel Gorman
Acked-by: Mike Rapoport (IBM)
Acked-by: Sebastian Reichel
Reviewed-by: Liam R. Howlett
Reviewed-by: Hyeonggon Yoo
Cc: Andy Lutomirski
Cc: Arjun Roy
Cc: Axel Rasmussen
Cc: David Hildenbrand
Cc: David Howells
Cc: Davidlohr Bueso
Cc: David Rientjes
Cc: Eric Dumazet
Cc: Greg Thelen
Cc: Hugh Dickins
Cc: Ingo Molnar
Cc: Jann Horn
Cc: Joel Fernandes
Cc: Johannes Weiner
Cc: Kent Overstreet
Cc: Laurent Dufour
Cc: Lorenzo Stoakes
Cc: Matthew Wilcox
Cc: Minchan Kim
Cc: Paul E. McKenney
Cc: Peter Oskolkov
Cc: Peter Xu
Cc: Peter Zijlstra
Cc: Punit Agrawal
Cc: Sebastian Andrzej Siewior
Cc: Shakeel Butt
Cc: Soheil Hassas Yeganeh
Cc: Song Liu
Cc: Vlastimil Babka
Cc: Will Deacon
Signed-off-by: Andrew Morton

Suren Baghdasaryan
2023-02-10 08:51:39 +0800

19 Jan, 2023

9 commits

f2d40141d fs: port inode_init_owner() to mnt_idmap ... Browse Code »

Convert to struct mnt_idmap.

Last cycle we merged the necessary infrastructure in
256c8aed2b42 ("fs: introduce dedicated idmap type for mounts").
This is just the conversion to struct mnt_idmap.

Currently we still pass around the plain namespace that was attached to a
mount. This is in general pretty convenient but it makes it easy to
conflate namespaces that are relevant on the filesystem with namespaces
that are relevent on the mount level. Especially for non-vfs developers
without detailed knowledge in this area this can be a potential source for
bugs.

Once the conversion to struct mnt_idmap is done all helpers down to the
really low-level helpers will take a struct mnt_idmap argument instead of
two namespace arguments. This way it becomes impossible to conflate the two
eliminating the possibility of any bugs. All of the vfs and all filesystems
only operate on struct mnt_idmap.

Acked-by: Dave Chinner
Reviewed-by: Christoph Hellwig
Signed-off-by: Christian Brauner (Microsoft)

Christian Brauner
2023-01-19 16:24:28 +0800
011e2b717 fs: port ->tmpfile() to pass mnt_idmap ... Browse Code »

Convert to struct mnt_idmap.

Last cycle we merged the necessary infrastructure in
256c8aed2b42 ("fs: introduce dedicated idmap type for mounts").
This is just the conversion to struct mnt_idmap.

Currently we still pass around the plain namespace that was attached to a
mount. This is in general pretty convenient but it makes it easy to
conflate namespaces that are relevant on the filesystem with namespaces
that are relevent on the mount level. Especially for non-vfs developers
without detailed knowledge in this area this can be a potential source for
bugs.

Once the conversion to struct mnt_idmap is done all helpers down to the
really low-level helpers will take a struct mnt_idmap argument instead of
two namespace arguments. This way it becomes impossible to conflate the two
eliminating the possibility of any bugs. All of the vfs and all filesystems
only operate on struct mnt_idmap.

Acked-by: Dave Chinner
Reviewed-by: Christoph Hellwig
Signed-off-by: Christian Brauner (Microsoft)

Christian Brauner
2023-01-19 16:24:27 +0800
5ebb29bee fs: port ->mknod() to pass mnt_idmap ... Browse Code »

Convert to struct mnt_idmap.

Last cycle we merged the necessary infrastructure in
256c8aed2b42 ("fs: introduce dedicated idmap type for mounts").
This is just the conversion to struct mnt_idmap.

Currently we still pass around the plain namespace that was attached to a
mount. This is in general pretty convenient but it makes it easy to
conflate namespaces that are relevant on the filesystem with namespaces
that are relevent on the mount level. Especially for non-vfs developers
without detailed knowledge in this area this can be a potential source for
bugs.

Once the conversion to struct mnt_idmap is done all helpers down to the
really low-level helpers will take a struct mnt_idmap argument instead of
two namespace arguments. This way it becomes impossible to conflate the two
eliminating the possibility of any bugs. All of the vfs and all filesystems
only operate on struct mnt_idmap.

Acked-by: Dave Chinner
Reviewed-by: Christoph Hellwig
Signed-off-by: Christian Brauner (Microsoft)

Christian Brauner
2023-01-19 16:24:26 +0800
c54bd91e9 fs: port ->mkdir() to pass mnt_idmap ... Browse Code »

Convert to struct mnt_idmap.

Last cycle we merged the necessary infrastructure in
256c8aed2b42 ("fs: introduce dedicated idmap type for mounts").
This is just the conversion to struct mnt_idmap.

Currently we still pass around the plain namespace that was attached to a
mount. This is in general pretty convenient but it makes it easy to
conflate namespaces that are relevant on the filesystem with namespaces
that are relevent on the mount level. Especially for non-vfs developers
without detailed knowledge in this area this can be a potential source for
bugs.

Once the conversion to struct mnt_idmap is done all helpers down to the
really low-level helpers will take a struct mnt_idmap argument instead of
two namespace arguments. This way it becomes impossible to conflate the two
eliminating the possibility of any bugs. All of the vfs and all filesystems
only operate on struct mnt_idmap.

Acked-by: Dave Chinner
Reviewed-by: Christoph Hellwig
Signed-off-by: Christian Brauner (Microsoft)

Christian Brauner
2023-01-19 16:24:26 +0800
7a77db955 fs: port ->symlink() to pass mnt_idmap ... Browse Code »

Convert to struct mnt_idmap.

Last cycle we merged the necessary infrastructure in
256c8aed2b42 ("fs: introduce dedicated idmap type for mounts").
This is just the conversion to struct mnt_idmap.

Currently we still pass around the plain namespace that was attached to a
mount. This is in general pretty convenient but it makes it easy to
conflate namespaces that are relevant on the filesystem with namespaces
that are relevent on the mount level. Especially for non-vfs developers
without detailed knowledge in this area this can be a potential source for
bugs.

Once the conversion to struct mnt_idmap is done all helpers down to the
really low-level helpers will take a struct mnt_idmap argument instead of
two namespace arguments. This way it becomes impossible to conflate the two
eliminating the possibility of any bugs. All of the vfs and all filesystems
only operate on struct mnt_idmap.

Acked-by: Dave Chinner
Reviewed-by: Christoph Hellwig
Signed-off-by: Christian Brauner (Microsoft)

Christian Brauner
2023-01-19 16:24:25 +0800
6c960e68a fs: port ->create() to pass mnt_idmap ... Browse Code »

Convert to struct mnt_idmap.

Last cycle we merged the necessary infrastructure in
256c8aed2b42 ("fs: introduce dedicated idmap type for mounts").
This is just the conversion to struct mnt_idmap.

Currently we still pass around the plain namespace that was attached to a
mount. This is in general pretty convenient but it makes it easy to
conflate namespaces that are relevant on the filesystem with namespaces
that are relevent on the mount level. Especially for non-vfs developers
without detailed knowledge in this area this can be a potential source for
bugs.

Once the conversion to struct mnt_idmap is done all helpers down to the
really low-level helpers will take a struct mnt_idmap argument instead of
two namespace arguments. This way it becomes impossible to conflate the two
eliminating the possibility of any bugs. All of the vfs and all filesystems
only operate on struct mnt_idmap.

Acked-by: Dave Chinner
Reviewed-by: Christoph Hellwig
Signed-off-by: Christian Brauner (Microsoft)

Christian Brauner
2023-01-19 16:24:25 +0800
c1632a0f1 fs: port ->setattr() to pass mnt_idmap ... Browse Code »

Convert to struct mnt_idmap.

Last cycle we merged the necessary infrastructure in
256c8aed2b42 ("fs: introduce dedicated idmap type for mounts").
This is just the conversion to struct mnt_idmap.

Currently we still pass around the plain namespace that was attached to a
mount. This is in general pretty convenient but it makes it easy to
conflate namespaces that are relevant on the filesystem with namespaces
that are relevent on the mount level. Especially for non-vfs developers
without detailed knowledge in this area this can be a potential source for
bugs.

Once the conversion to struct mnt_idmap is done all helpers down to the
really low-level helpers will take a struct mnt_idmap argument instead of
two namespace arguments. This way it becomes impossible to conflate the two
eliminating the possibility of any bugs. All of the vfs and all filesystems
only operate on struct mnt_idmap.

Acked-by: Dave Chinner
Reviewed-by: Christoph Hellwig
Signed-off-by: Christian Brauner (Microsoft)

Christian Brauner
2023-01-19 16:24:02 +0800
9c67a2070 mm/hugetlb: introduce hugetlb_walk() ... Browse Code »

huge_pte_offset() is the main walker function for hugetlb pgtables. The
name is not really representing what it does, though.

Instead of renaming it, introduce a wrapper function called hugetlb_walk()
which will use huge_pte_offset() inside. Assert on the locks when walking
the pgtable.

Note, the vma lock assertion will be a no-op for private mappings.

Document the last special case in the page_vma_mapped_walk() path where we
don't need any more lock to call hugetlb_walk().

Taking vma lock there is not needed because either: (1) potential callers
of hugetlb pvmw holds i_mmap_rwsem already (from one rmap_walk()), or (2)
the caller will not walk a hugetlb vma at all so the hugetlb code path not
reachable (e.g. in ksm or uprobe paths).

It's slightly implicit for future page_vma_mapped_walk() callers on that
lock requirement. But anyway, when one day this rule breaks, one will get
a straightforward warning in hugetlb_walk() with lockdep, then there'll be
a way out.

[akpm@linux-foundation.org: coding-style cleanups]
Link: https://lkml.kernel.org/r/20221216155229.2043750-1-peterx@redhat.com
Signed-off-by: Peter Xu
Reviewed-by: Mike Kravetz
Reviewed-by: John Hubbard
Reviewed-by: David Hildenbrand
Cc: Andrea Arcangeli
Cc: James Houghton
Cc: Jann Horn
Cc: Miaohe Lin
Cc: Muchun Song
Cc: Nadav Amit
Cc: Rik van Riel
Signed-off-by: Andrew Morton

Peter Xu
2023-01-19 09:12:39 +0800
243b1f2d3 mm/hugetlb: let vma_offset_start() to return start ... Browse Code »

Patch series "mm/hugetlb: Make huge_pte_offset() thread-safe for pmd
unshare", v4.

Problem
=======

huge_pte_offset() is a major helper used by hugetlb code paths to walk a
hugetlb pgtable. It's used mostly everywhere since that's needed even
before taking the pgtable lock.

huge_pte_offset() is always called with mmap lock held with either read or
write. It was assumed to be safe but it's actually not. One race
condition can easily trigger by: (1) firstly trigger pmd share on a memory
range, (2) do huge_pte_offset() on the range, then at the meantime, (3)
another thread unshare the pmd range, and the pgtable page is prone to lost
if the other shared process wants to free it completely (by either munmap
or exit mm).

The recent work from Mike on vma lock can resolve most of this already.
It's achieved by forbidden pmd unsharing during the lock being taken, so no
further risk of the pgtable page being freed. It means if we can take the
vma lock around all huge_pte_offset() callers it'll be safe.

There're already a bunch of them that we did as per the latest mm-unstable,
but also quite a few others that we didn't for various reasons especially
on huge_pte_offset() usage.

One more thing to mention is that besides the vma lock, i_mmap_rwsem can
also be used to protect the pgtable page (along with its pgtable lock) from
being freed from under us. IOW, huge_pte_offset() callers need to either
hold the vma lock or i_mmap_rwsem to safely walk the pgtables.

A reproducer of such problem, based on hugetlb GUP (NOTE: since the race is
very hard to trigger, one needs to apply another kernel delay patch too,
see below):

======8
#include
#include
#include
#include
#include
#include
#include
#include

#define MSIZE (1UL << 30) /* 1GB */
#define PSIZE (2UL << 20) /* 2MB */

#define HOLD_SEC (1)

int pipefd[2];
void *buf;

void *do_map(int fd)
{
unsigned char *tmpbuf, *p;
int ret;

ret = posix_memalign((void **)&tmpbuf, MSIZE, MSIZE);
if (ret) {
perror("posix_memalign() failed");
return NULL;
}

tmpbuf = mmap(tmpbuf, MSIZE, PROT_READ | PROT_WRITE,
MAP_SHARED | MAP_FIXED, fd, 0);
if (tmpbuf == MAP_FAILED) {
perror("mmap() failed");
return NULL;
}
printf("mmap() -> %p\n", tmpbuf);

for (p = tmpbuf; p < tmpbuf + MSIZE; p += PSIZE) {
*p = 1;
}

return tmpbuf;
}

void do_unmap(void *buf)
{
munmap(buf, MSIZE);
}

void proc2(int fd)
{
unsigned char c;

buf = do_map(fd);
if (!buf)
return;

read(pipefd[0], &c, 1);
/*
* This frees the shared pgtable page, causing use-after-free in
* proc1_thread1 when soft walking hugetlb pgtable.
*/
do_unmap(buf);

printf("Proc2 quitting\n");
}

void *proc1_thread1(void *data)
{
/*
* Trigger follow-page on 1st 2m page. Kernel hack patch needed to
* withhold this procedure for easier reproduce.
*/
madvise(buf, PSIZE, MADV_POPULATE_WRITE);
printf("Proc1-thread1 quitting\n");
return NULL;
}

void *proc1_thread2(void *data)
{
unsigned char c;

/* Wait a while until proc1_thread1() start to wait */
sleep(0.5);
/* Trigger pmd unshare */
madvise(buf, PSIZE, MADV_DONTNEED);
/* Kick off proc2 to release the pgtable */
write(pipefd[1], &c, 1);

printf("Proc1-thread2 quitting\n");
return NULL;
}

void proc1(int fd)
{
pthread_t tid1, tid2;
int ret;

buf = do_map(fd);
if (!buf)
return;

ret = pthread_create(&tid1, NULL, proc1_thread1, NULL);
assert(ret == 0);
ret = pthread_create(&tid2, NULL, proc1_thread2, NULL);
assert(ret == 0);

/* Kick the child to share the PUD entry */
pthread_join(tid1, NULL);
pthread_join(tid2, NULL);

do_unmap(buf);
}

int main(void)
{
int fd, ret;

fd = memfd_create("test-huge", MFD_HUGETLB | MFD_HUGE_2MB);
if (fd < 0) {
perror("open failed");
return -1;
}

ret = ftruncate(fd, MSIZE);
if (ret) {
perror("ftruncate() failed");
return -1;
}

ret = pipe(pipefd);
if (ret) {
perror("pipe() failed");
return -1;
}

if (fork()) {
proc1(fd);
} else {
proc2(fd);
}

close(pipefd[0]);
close(pipefd[1]);
close(fd);

return 0;
}
======8< 100; c++) {
: + udelay(10000);
: + }
: + pr_info("%s: withhold 1 sec...done\n", __func__);
: +
: if (pte)
: ptl = huge_pte_lock(h, mm, pte);
: absent = !pte || huge_pte_none(huge_ptep_get(pte));
: ======8vm_start address.

Make it return the real value of the start vaddr, and it also helps for
all the callers because whenever the retval is used, it'll be ultimately
added into the vma->vm_start anyway, so it's better.

Link: https://lkml.kernel.org/r/20221216155100.2043537-1-peterx@redhat.com
Link: https://lkml.kernel.org/r/20221216155100.2043537-2-peterx@redhat.com
Signed-off-by: Peter Xu
Reviewed-by: Mike Kravetz
Reviewed-by: David Hildenbrand
Reviewed-by: John Hubbard
Cc: Andrea Arcangeli
Cc: James Houghton
Cc: Jann Horn
Cc: Miaohe Lin
Cc: Muchun Song
Cc: Nadav Amit
Cc: Rik van Riel
Signed-off-by: Andrew Morton

Peter Xu
2023-01-19 09:12:38 +0800

01 Dec, 2022

2 commits

dbaf7dc97 hugetlbfs: inode: remove unnecessary (void*) conversions ... Browse Code »

The ei pointer does not need to cast the type.

Link: https://lkml.kernel.org/r/20221107015659.3221-1-zeming@nfschina.com
Signed-off-by: Li zeming
Reviewed-by: Muchun Song
Cc: Mike Kravetz
Signed-off-by: Andrew Morton

Li zeming
2022-12-01 07:58:56 +0800
a38358c93 Merge branch 'mm-hotfixes-stable' into mm-stable Browse Code »

Andrew Morton
2022-12-01 06:58:42 +0800

09 Nov, 2022

4 commits

26215b7ee hugetlbfs: fix null-ptr-deref in hugetlbfs_parse_param() ... Browse Code »

Syzkaller reports a null-ptr-deref bug as follows:
======================================================
KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007]
RIP: 0010:hugetlbfs_parse_param+0x1dd/0x8e0 fs/hugetlbfs/inode.c:1380
[...]
Call Trace:

vfs_parse_fs_param fs/fs_context.c:148 [inline]
vfs_parse_fs_param+0x1f9/0x3c0 fs/fs_context.c:129
vfs_parse_fs_string+0xdb/0x170 fs/fs_context.c:191
generic_parse_monolithic+0x16f/0x1f0 fs/fs_context.c:231
do_new_mount fs/namespace.c:3036 [inline]
path_mount+0x12de/0x1e20 fs/namespace.c:3370
do_mount fs/namespace.c:3383 [inline]
__do_sys_mount fs/namespace.c:3591 [inline]
__se_sys_mount fs/namespace.c:3568 [inline]
__x64_sys_mount+0x27f/0x300 fs/namespace.c:3568
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd
[...]

======================================================

According to commit "vfs: parse: deal with zero length string value",
kernel will set the param->string to null pointer in vfs_parse_fs_string()
if fs string has zero length.

Yet the problem is that, hugetlbfs_parse_param() will dereference the
param->string, without checking whether it is a null pointer. To be more
specific, if hugetlbfs_parse_param() parses an illegal mount parameter,
such as "size=,", kernel will constructs struct fs_parameter with null
pointer in vfs_parse_fs_string(), then passes this struct fs_parameter to
hugetlbfs_parse_param(), which triggers the above null-ptr-deref bug.

This patch solves it by adding sanity check on param->string
in hugetlbfs_parse_param().

Link: https://lkml.kernel.org/r/20221020231609.4810-1-yin31149@gmail.com
Reported-by: syzbot+a3e6acd85ded5c16a709@syzkaller.appspotmail.com
Tested-by: syzbot+a3e6acd85ded5c16a709@syzkaller.appspotmail.com
Link: https://lore.kernel.org/all/0000000000005ad00405eb7148c6@google.com/
Signed-off-by: Hawkins Jiawei
Reviewed-by: Mike Kravetz
Cc: Hawkins Jiawei
Cc: Muchun Song
Cc: Ian Kent
Signed-off-by: Andrew Morton

Hawkins Jiawei
2022-11-09 09:37:21 +0800
ece62684d hugetlbfs: convert hugetlb_delete_from_page_cache() to use folios ... Browse Code »

Remove the last caller of delete_from_page_cache() by converting the code
to its folio equivalent.

Link: https://lkml.kernel.org/r/20220922154207.1575343-5-sidhartha.kumar@oracle.com
Signed-off-by: Sidhartha Kumar
Reviewed-by: Mike Kravetz
Cc: Arnd Bergmann
Cc: Colin Cross
Cc: David Howells
Cc: "Eric W . Biederman"
Cc: Hugh Dickins
Cc: Matthew Wilcox
Cc: Muchun Song
Cc: Peter Xu
Cc: Vlastimil Babka
Cc: William Kucharski
Signed-off-by: Andrew Morton

Sidhartha Kumar
2022-11-09 09:37:12 +0800
149562f75 mm/hugetlb: add hugetlb_folio_subpool() helpers ... Browse Code »

Allow hugetlbfs_migrate_folio to check and read subpool information by
passing in a folio.

Link: https://lkml.kernel.org/r/20220922154207.1575343-4-sidhartha.kumar@oracle.com
Signed-off-by: Sidhartha Kumar
Reviewed-by: Mike Kravetz
Cc: Arnd Bergmann
Cc: Colin Cross
Cc: David Howells
Cc: "Eric W . Biederman"
Cc: Hugh Dickins
Cc: kernel test robot
Cc: Matthew Wilcox
Cc: Muchun Song
Cc: Peter Xu
Cc: Vlastimil Babka
Cc: William Kucharski
Signed-off-by: Andrew Morton

Sidhartha Kumar
2022-11-09 09:37:12 +0800
8625147ca hugetlbfs: don't delete error page from pagecache ... Browse Code »

This change is very similar to the change that was made for shmem [1], and
it solves the same problem but for HugeTLBFS instead.

Currently, when poison is found in a HugeTLB page, the page is removed
from the page cache. That means that attempting to map or read that
hugepage in the future will result in a new hugepage being allocated
instead of notifying the user that the page was poisoned. As [1] states,
this is effectively memory corruption.

The fix is to leave the page in the page cache. If the user attempts to
use a poisoned HugeTLB page with a syscall, the syscall will fail with
EIO, the same error code that shmem uses. For attempts to map the page,
the thread will get a BUS_MCEERR_AR SIGBUS.

[1]: commit a76054266661 ("mm: shmem: don't truncate page if memory failure happens")

Link: https://lkml.kernel.org/r/20221018200125.848471-1-jthoughton@google.com
Signed-off-by: James Houghton
Reviewed-by: Mike Kravetz
Reviewed-by: Naoya Horiguchi
Tested-by: Naoya Horiguchi
Reviewed-by: Yang Shi
Cc: Axel Rasmussen
Cc: James Houghton
Cc: Miaohe Lin
Cc: Muchun Song
Cc:
Signed-off-by: Andrew Morton

James Houghton
2022-11-09 07:57:22 +0800

11 Oct, 2022

1 commit

f721d24e5 Merge tag 'pull-tmpfile' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs ... Browse Code »

Pull vfs tmpfile updates from Al Viro:
"Miklos' ->tmpfile() signature change; pass an unopened struct file to
it, let it open the damn thing. Allows to add tmpfile support to FUSE"

* tag 'pull-tmpfile' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
fuse: implement ->tmpfile()
vfs: open inside ->tmpfile()
vfs: move open right after ->tmpfile()
vfs: make vfs_tmpfile() static
ovl: use vfs_tmpfile_open() helper
cachefiles: use vfs_tmpfile_open() helper
cachefiles: only pass inode to *mark_inode_inuse() helpers
cachefiles: tmpfile error handling cleanup
hugetlbfs: cleanup mknod and tmpfile
vfs: add vfs_tmpfile_open() helper

Linus Torvalds
2022-10-11 10:45:17 +0800

04 Oct, 2022

7 commits

fa27759af hugetlb: clean up code checking for fault/truncation races ... Browse Code »

With the new hugetlb vma lock in place, it can also be used to handle page
fault races with file truncation. The lock is taken at the beginning of
the code fault path in read mode. During truncation, it is taken in write
mode for each vma which has the file mapped. The file's size (i_size) is
modified before taking the vma lock to unmap.

How are races handled?

The page fault code checks i_size early in processing after taking the vma
lock. If the fault is beyond i_size, the fault is aborted. If the fault
is not beyond i_size the fault will continue and a new page will be added
to the file. It could be that truncation code modifies i_size after the
check in fault code. That is OK, as truncation code will soon remove the
page. The truncation code will wait until the fault is finished, as it
must obtain the vma lock in write mode.

This patch cleans up/removes late checks in the fault paths that try to
back out pages racing with truncation. As noted above, we just let the
truncation code remove the pages.

[mike.kravetz@oracle.com: fix reserve_alloc set but not used compiler warning]
Link: https://lkml.kernel.org/r/Yyj7HsJWfHDoU24U@monkey
Link: https://lkml.kernel.org/r/20220914221810.95771-10-mike.kravetz@oracle.com
Signed-off-by: Mike Kravetz
Cc: Andrea Arcangeli
Cc: "Aneesh Kumar K.V"
Cc: Axel Rasmussen
Cc: David Hildenbrand
Cc: Davidlohr Bueso
Cc: James Houghton
Cc: "Kirill A. Shutemov"
Cc: Miaohe Lin
Cc: Michal Hocko
Cc: Mina Almasry
Cc: Muchun Song
Cc: Naoya Horiguchi
Cc: Pasha Tatashin
Cc: Peter Xu
Cc: Prakash Sangappa
Cc: Sven Schnelle
Signed-off-by: Andrew Morton

Mike Kravetz
2022-10-04 05:03:17 +0800
40549ba8f hugetlb: use new vma_lock for pmd sharing synchronization ... Browse Code »

The new hugetlb vma lock is used to address this race:

Faulting thread Unsharing thread
... ...
ptep = huge_pte_offset()
or
ptep = huge_pte_alloc()
...
i_mmap_lock_write
lock page table
ptep invalid
Cc: Andrea Arcangeli
Cc: "Aneesh Kumar K.V"
Cc: Axel Rasmussen
Cc: David Hildenbrand
Cc: Davidlohr Bueso
Cc: James Houghton
Cc: "Kirill A. Shutemov"
Cc: Miaohe Lin
Cc: Michal Hocko
Cc: Mina Almasry
Cc: Muchun Song
Cc: Naoya Horiguchi
Cc: Pasha Tatashin
Cc: Peter Xu
Cc: Prakash Sangappa
Cc: Sven Schnelle
Signed-off-by: Andrew Morton

Mike Kravetz
2022-10-04 05:03:17 +0800
378397ccb hugetlb: create hugetlb_unmap_file_folio to unmap single file folio ... Browse Code »

Create the new routine hugetlb_unmap_file_folio that will unmap a single
file folio. This is refactored code from hugetlb_vmdelete_list. It is
modified to do locking within the routine itself and check whether the
page is mapped within a specific vma before unmapping.

This refactoring will be put to use and expanded upon in a subsequent
patch adding vma specific locking.

Link: https://lkml.kernel.org/r/20220914221810.95771-8-mike.kravetz@oracle.com
Signed-off-by: Mike Kravetz
Reviewed-by: Miaohe Lin
Cc: Andrea Arcangeli
Cc: "Aneesh Kumar K.V"
Cc: Axel Rasmussen
Cc: David Hildenbrand
Cc: Davidlohr Bueso
Cc: James Houghton
Cc: "Kirill A. Shutemov"
Cc: Michal Hocko
Cc: Mina Almasry
Cc: Muchun Song
Cc: Naoya Horiguchi
Cc: Pasha Tatashin
Cc: Peter Xu
Cc: Prakash Sangappa
Cc: Sven Schnelle
Signed-off-by: Andrew Morton

Mike Kravetz
2022-10-04 05:03:17 +0800
c86272287 hugetlb: create remove_inode_single_folio to remove single file folio ... Browse Code »

Create the new routine remove_inode_single_folio that will remove a single
folio from a file. This is refactored code from remove_inode_hugepages.
It checks for the uncommon case in which the folio is still mapped and
unmaps.

No functional change. This refactoring will be put to use and expanded
upon in a subsequent patches.

Link: https://lkml.kernel.org/r/20220914221810.95771-5-mike.kravetz@oracle.com
Signed-off-by: Mike Kravetz
Reviewed-by: Miaohe Lin
Cc: Andrea Arcangeli
Cc: "Aneesh Kumar K.V"
Cc: Axel Rasmussen
Cc: David Hildenbrand
Cc: Davidlohr Bueso
Cc: James Houghton
Cc: "Kirill A. Shutemov"
Cc: Michal Hocko
Cc: Mina Almasry
Cc: Muchun Song
Cc: Naoya Horiguchi
Cc: Pasha Tatashin
Cc: Peter Xu
Cc: Prakash Sangappa
Cc: Sven Schnelle
Signed-off-by: Andrew Morton

Mike Kravetz
2022-10-04 05:03:16 +0800
7e1813d48 hugetlb: rename remove_huge_page to hugetlb_delete_from_page_cache ... Browse Code »

remove_huge_page removes a hugetlb page from the page cache. Change to
hugetlb_delete_from_page_cache as it is a more descriptive name.
huge_add_to_page_cache is global in scope, but only deals with hugetlb
pages. For consistency and clarity, rename to hugetlb_add_to_page_cache.

Link: https://lkml.kernel.org/r/20220914221810.95771-4-mike.kravetz@oracle.com
Signed-off-by: Mike Kravetz
Reviewed-by: Miaohe Lin
Cc: Andrea Arcangeli
Cc: "Aneesh Kumar K.V"
Cc: Axel Rasmussen
Cc: David Hildenbrand
Cc: Davidlohr Bueso
Cc: James Houghton
Cc: "Kirill A. Shutemov"
Cc: Michal Hocko
Cc: Mina Almasry
Cc: Muchun Song
Cc: Naoya Horiguchi
Cc: Pasha Tatashin
Cc: Peter Xu
Cc: Prakash Sangappa
Cc: Sven Schnelle
Signed-off-by: Andrew Morton

Mike Kravetz
2022-10-04 05:03:16 +0800
3a47c54f0 hugetlbfs: revert use i_mmap_rwsem for more pmd sharing synchronization ... Browse Code »

Commit c0d0381ade79 ("hugetlbfs: use i_mmap_rwsem for more pmd sharing
synchronization") added code to take i_mmap_rwsem in read mode for the
duration of fault processing. However, this has been shown to cause
performance/scaling issues. Revert the code and go back to only taking
the semaphore in huge_pmd_share during the fault path.

Keep the code that takes i_mmap_rwsem in write mode before calling
try_to_unmap as this is required if huge_pmd_unshare is called.

NOTE: Reverting this code does expose the following race condition.

Faulting thread Unsharing thread
... ...
ptep = huge_pte_offset()
or
ptep = huge_pte_alloc()
...
i_mmap_lock_write
lock page table
ptep invalid
Reviewed-by: Miaohe Lin
Cc: Andrea Arcangeli
Cc: "Aneesh Kumar K.V"
Cc: Axel Rasmussen
Cc: David Hildenbrand
Cc: Davidlohr Bueso
Cc: James Houghton
Cc: "Kirill A. Shutemov"
Cc: Michal Hocko
Cc: Mina Almasry
Cc: Muchun Song
Cc: Naoya Horiguchi
Cc: Pasha Tatashin
Cc: Peter Xu
Cc: Prakash Sangappa
Cc: Sven Schnelle
Signed-off-by: Andrew Morton

Mike Kravetz
2022-10-04 05:03:16 +0800
188a39725 hugetlbfs: revert use i_mmap_rwsem to address page fault/truncate race ... Browse Code »

Patch series "hugetlb: Use new vma lock for huge pmd sharing
synchronization", v2.

hugetlb fault scalability regressions have recently been reported [1].
This is not the first such report, as regressions were also noted when
commit c0d0381ade79 ("hugetlbfs: use i_mmap_rwsem for more pmd sharing
synchronization") was added [2] in v5.7. At that time, a proposal to
address the regression was suggested [3] but went nowhere.

The regression and benefit of this patch series is not evident when
using the vm_scalability benchmark reported in [2] on a recent kernel.
Results from running,
"./usemem -n 48 --prealloc --prefault -O -U 3448054972"

48 sample Avg
next-20220913 next-20220913 next-20220913
unmodified revert i_mmap_sema locking vma sema locking, this series
-----------------------------------------------------------------------------
498150 KB/s 501934 KB/s 504793 KB/s

The recent regression report [1] notes page fault and fork latency of
shared hugetlb mappings. To measure this, I created two simple programs:
1) map a shared hugetlb area, write fault all pages, unmap area
Do this in a continuous loop to measure faults per second
2) map a shared hugetlb area, write fault a few pages, fork and exit
Do this in a continuous loop to measure forks per second
These programs were run on a 48 CPU VM with 320GB memory. The shared
mapping size was 250GB. For comparison, a single instance of the program
was run. Then, multiple instances were run in parallel to introduce
lock contention. Changing the locking scheme results in a significant
performance benefit.

test instances unmodified revert vma
--------------------------------------------------------------------------
faults per sec 1 393043 395680 389932
faults per sec 24 71405 81191 79048
forks per sec 1 2802 2747 2725
forks per sec 24 439 536 500
Combined faults 24 1621 68070 53662
Combined forks 24 358 67 142

Combined test is when running both faulting program and forking program
simultaneously.

Patches 1 and 2 of this series revert c0d0381ade79 and 87bf91d39bb5 which
depends on c0d0381ade79. Acquisition of i_mmap_rwsem is still required in
the fault path to establish pmd sharing, so this is moved back to
huge_pmd_share. With c0d0381ade79 reverted, this race is exposed:

Faulting thread Unsharing thread
... ...
ptep = huge_pte_offset()
or
ptep = huge_pte_alloc()
...
i_mmap_lock_write
lock page table
ptep invalid
Reviewed-by: Miaohe Lin
Cc: Andrea Arcangeli
Cc: "Aneesh Kumar K.V"
Cc: Axel Rasmussen
Cc: David Hildenbrand
Cc: Davidlohr Bueso
Cc: James Houghton
Cc: "Kirill A. Shutemov"
Cc: Michal Hocko
Cc: Mina Almasry
Cc: Muchun Song
Cc: Naoya Horiguchi
Cc: Pasha Tatashin
Cc: Peter Xu
Cc: Prakash Sangappa
Cc: Sven Schnelle
Signed-off-by: Andrew Morton

Mike Kravetz
2022-10-04 05:03:16 +0800

24 Sep, 2022

2 commits

863f144f1 vfs: open inside ->tmpfile() ... Browse Code »

This is in preparation for adding tmpfile support to fuse, which requires
that the tmpfile creation and opening are done as a single operation.

Replace the 'struct dentry *' argument of i_op->tmpfile with
'struct file *'.

Call finish_open_simple() as the last thing in ->tmpfile() instances (may
be omitted in the error case).

Change d_tmpfile() argument to 'struct file *' as well to make callers more
readable.

Reviewed-by: Christian Brauner (Microsoft)
Signed-off-by: Miklos Szeredi

Miklos Szeredi
2022-09-24 13:00:00 +0800
19ee5345f hugetlbfs: cleanup mknod and tmpfile ... Browse Code »

Duplicate the few lines that are shared between hugetlbfs_mknod() and
hugetlbfs_tmpfile().

This is a prerequisite for sanely changing the signature of ->tmpfile().

Signed-off-by: Al Viro
Reviewed-by: Christian Brauner (Microsoft)
Signed-off-by: Miklos Szeredi

Al Viro
2022-09-24 12:59:59 +0800