23 Feb, 2024

2 commits

  • commit e656c7a9e59607d1672d85ffa9a89031876ffe67 upstream.

    For shared memory of type SHM_HUGETLB, hugetlb pages are reserved in
    shmget() call. If SHM_NORESERVE flags is specified then the hugetlb pages
    are not reserved. However when the shared memory is attached with the
    shmat() call the hugetlb pages are getting reserved incorrectly for
    SHM_HUGETLB shared memory created with SHM_NORESERVE which is a bug.

    -------------------------------
    Following test shows the issue.

    $cat shmhtb.c

    int main()
    {
    int shmflags = 0660 | IPC_CREAT | SHM_HUGETLB | SHM_NORESERVE;
    int shmid;

    shmid = shmget(SKEY, SHMSZ, shmflags);
    if (shmid < 0)
    {
    printf("shmat: shmget() failed, %d\n", errno);
    return 1;
    }
    printf("After shmget()\n");
    system("cat /proc/meminfo | grep -i hugepages_");

    shmat(shmid, NULL, 0);
    printf("\nAfter shmat()\n");
    system("cat /proc/meminfo | grep -i hugepages_");

    shmctl(shmid, IPC_RMID, NULL);
    return 0;
    }

    #sysctl -w vm.nr_hugepages=20
    #./shmhtb

    After shmget()
    HugePages_Total: 20
    HugePages_Free: 20
    HugePages_Rsvd: 0
    HugePages_Surp: 0

    After shmat()
    HugePages_Total: 20
    HugePages_Free: 20
    HugePages_Rsvd: 5
    Acked-by: Muchun Song
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Greg Kroah-Hartman

    Prakash Sangappa
     
  • commit 79d72c68c58784a3e1cd2378669d51bfd0cb7498 upstream.

    When configuring a hugetlb filesystem via the fsconfig() syscall, there is
    a possible NULL dereference in hugetlbfs_fill_super() caused by assigning
    NULL to ctx->hstate in hugetlbfs_parse_param() when the requested pagesize
    is non valid.

    E.g: Taking the following steps:

    fd = fsopen("hugetlbfs", FSOPEN_CLOEXEC);
    fsconfig(fd, FSCONFIG_SET_STRING, "pagesize", "1024", 0);
    fsconfig(fd, FSCONFIG_CMD_CREATE, NULL, NULL, 0);

    Given that the requested "pagesize" is invalid, ctxt->hstate will be replaced
    with NULL, losing its previous value, and we will print an error:

    ...
    ...
    case Opt_pagesize:
    ps = memparse(param->string, &rest);
    ctx->hstate = h;
    if (!ctx->hstate) {
    pr_err("Unsupported page size %lu MB\n", ps / SZ_1M);
    return -EINVAL;
    }
    return 0;
    ...
    ...

    This is a problem because later on, we will dereference ctxt->hstate in
    hugetlbfs_fill_super()

    ...
    ...
    sb->s_blocksize = huge_page_size(ctx->hstate);
    ...
    ...

    Causing below Oops.

    Fix this by replacing cxt->hstate value only when then pagesize is known
    to be valid.

    kernel: hugetlbfs: Unsupported page size 0 MB
    kernel: BUG: kernel NULL pointer dereference, address: 0000000000000028
    kernel: #PF: supervisor read access in kernel mode
    kernel: #PF: error_code(0x0000) - not-present page
    kernel: PGD 800000010f66c067 P4D 800000010f66c067 PUD 1b22f8067 PMD 0
    kernel: Oops: 0000 [#1] PREEMPT SMP PTI
    kernel: CPU: 4 PID: 5659 Comm: syscall Tainted: G E 6.8.0-rc2-default+ #22 5a47c3fef76212addcc6eb71344aabc35190ae8f
    kernel: Hardware name: Intel Corp. GROVEPORT/GROVEPORT, BIOS GVPRCRB1.86B.0016.D04.1705030402 05/03/2017
    kernel: RIP: 0010:hugetlbfs_fill_super+0xb4/0x1a0
    kernel: Code: 48 8b 3b e8 3e c6 ed ff 48 85 c0 48 89 45 20 0f 84 d6 00 00 00 48 b8 ff ff ff ff ff ff ff 7f 4c 89 e7 49 89 44 24 20 48 8b 03 48 28 b8 00 10 00 00 48 d3 e0 49 89 44 24 18 48 8b 03 8b 40 28
    kernel: RSP: 0018:ffffbe9960fcbd48 EFLAGS: 00010246
    kernel: RAX: 0000000000000000 RBX: ffff9af5272ae780 RCX: 0000000000372004
    kernel: RDX: ffffffffffffffff RSI: ffffffffffffffff RDI: ffff9af555e9b000
    kernel: RBP: ffff9af52ee66b00 R08: 0000000000000040 R09: 0000000000370004
    kernel: R10: ffffbe9960fcbd48 R11: 0000000000000040 R12: ffff9af555e9b000
    kernel: R13: ffffffffa66b86c0 R14: ffff9af507d2f400 R15: ffff9af507d2f400
    kernel: FS: 00007ffbc0ba4740(0000) GS:ffff9b0bd7000000(0000) knlGS:0000000000000000
    kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    kernel: CR2: 0000000000000028 CR3: 00000001b1ee0000 CR4: 00000000001506f0
    kernel: Call Trace:
    kernel:
    kernel: ? __die_body+0x1a/0x60
    kernel: ? page_fault_oops+0x16f/0x4a0
    kernel: ? search_bpf_extables+0x65/0x70
    kernel: ? fixup_exception+0x22/0x310
    kernel: ? exc_page_fault+0x69/0x150
    kernel: ? asm_exc_page_fault+0x22/0x30
    kernel: ? __pfx_hugetlbfs_fill_super+0x10/0x10
    kernel: ? hugetlbfs_fill_super+0xb4/0x1a0
    kernel: ? hugetlbfs_fill_super+0x28/0x1a0
    kernel: ? __pfx_hugetlbfs_fill_super+0x10/0x10
    kernel: vfs_get_super+0x40/0xa0
    kernel: ? __pfx_bpf_lsm_capable+0x10/0x10
    kernel: vfs_get_tree+0x25/0xd0
    kernel: vfs_cmd_create+0x64/0xe0
    kernel: __x64_sys_fsconfig+0x395/0x410
    kernel: do_syscall_64+0x80/0x160
    kernel: ? syscall_exit_to_user_mode+0x82/0x240
    kernel: ? do_syscall_64+0x8d/0x160
    kernel: ? syscall_exit_to_user_mode+0x82/0x240
    kernel: ? do_syscall_64+0x8d/0x160
    kernel: ? exc_page_fault+0x69/0x150
    kernel: entry_SYSCALL_64_after_hwframe+0x6e/0x76
    kernel: RIP: 0033:0x7ffbc0cb87c9
    kernel: Code: 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 3d 01 f0 ff ff 73 01 c3 48 8b 0d 97 96 0d 00 f7 d8 64 89 01 48
    kernel: RSP: 002b:00007ffc29d2f388 EFLAGS: 00000206 ORIG_RAX: 00000000000001af
    kernel: RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007ffbc0cb87c9
    kernel: RDX: 0000000000000000 RSI: 0000000000000006 RDI: 0000000000000003
    kernel: RBP: 00007ffc29d2f3b0 R08: 0000000000000000 R09: 0000000000000000
    kernel: R10: 0000000000000000 R11: 0000000000000206 R12: 0000000000000000
    kernel: R13: 00007ffc29d2f4c0 R14: 0000000000000000 R15: 0000000000000000
    kernel:
    kernel: Modules linked in: rpcsec_gss_krb5(E) auth_rpcgss(E) nfsv4(E) dns_resolver(E) nfs(E) lockd(E) grace(E) sunrpc(E) netfs(E) af_packet(E) bridge(E) stp(E) llc(E) iscsi_ibft(E) iscsi_boot_sysfs(E) intel_rapl_msr(E) intel_rapl_common(E) iTCO_wdt(E) intel_pmc_bxt(E) sb_edac(E) iTCO_vendor_support(E) x86_pkg_temp_thermal(E) intel_powerclamp(E) coretemp(E) kvm_intel(E) rfkill(E) ipmi_ssif(E) kvm(E) acpi_ipmi(E) irqbypass(E) pcspkr(E) igb(E) ipmi_si(E) mei_me(E) i2c_i801(E) joydev(E) intel_pch_thermal(E) i2c_smbus(E) dca(E) lpc_ich(E) mei(E) ipmi_devintf(E) ipmi_msghandler(E) acpi_pad(E) tiny_power_button(E) button(E) fuse(E) efi_pstore(E) configfs(E) ip_tables(E) x_tables(E) ext4(E) mbcache(E) jbd2(E) hid_generic(E) usbhid(E) sd_mod(E) t10_pi(E) crct10dif_pclmul(E) crc32_pclmul(E) crc32c_intel(E) polyval_clmulni(E) ahci(E) xhci_pci(E) polyval_generic(E) gf128mul(E) ghash_clmulni_intel(E) sha512_ssse3(E) sha256_ssse3(E) xhci_pci_renesas(E) libahci(E) ehci_pci(E) sha1_ssse3(E) xhci_hcd(E) ehci_hcd(E) libata(E)
    kernel: mgag200(E) i2c_algo_bit(E) usbcore(E) wmi(E) sg(E) dm_multipath(E) dm_mod(E) scsi_dh_rdac(E) scsi_dh_emc(E) scsi_dh_alua(E) scsi_mod(E) scsi_common(E) aesni_intel(E) crypto_simd(E) cryptd(E)
    kernel: Unloaded tainted modules: acpi_cpufreq(E):1 fjes(E):1
    kernel: CR2: 0000000000000028
    kernel: ---[ end trace 0000000000000000 ]---
    kernel: RIP: 0010:hugetlbfs_fill_super+0xb4/0x1a0
    kernel: Code: 48 8b 3b e8 3e c6 ed ff 48 85 c0 48 89 45 20 0f 84 d6 00 00 00 48 b8 ff ff ff ff ff ff ff 7f 4c 89 e7 49 89 44 24 20 48 8b 03 48 28 b8 00 10 00 00 48 d3 e0 49 89 44 24 18 48 8b 03 8b 40 28
    kernel: RSP: 0018:ffffbe9960fcbd48 EFLAGS: 00010246
    kernel: RAX: 0000000000000000 RBX: ffff9af5272ae780 RCX: 0000000000372004
    kernel: RDX: ffffffffffffffff RSI: ffffffffffffffff RDI: ffff9af555e9b000
    kernel: RBP: ffff9af52ee66b00 R08: 0000000000000040 R09: 0000000000370004
    kernel: R10: ffffbe9960fcbd48 R11: 0000000000000040 R12: ffff9af555e9b000
    kernel: R13: ffffffffa66b86c0 R14: ffff9af507d2f400 R15: ffff9af507d2f400
    kernel: FS: 00007ffbc0ba4740(0000) GS:ffff9b0bd7000000(0000) knlGS:0000000000000000
    kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    kernel: CR2: 0000000000000028 CR3: 00000001b1ee0000 CR4: 00000000001506f0

    Link: https://lkml.kernel.org/r/20240130210418.3771-1-osalvador@suse.de
    Fixes: 32021982a324 ("hugetlbfs: Convert to fs_context")
    Signed-off-by: Michal Hocko
    Signed-off-by: Oscar Salvador
    Acked-by: Muchun Song
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Greg Kroah-Hartman

    Oscar Salvador
     

29 Nov, 2023

1 commit

  • commit 8db0ec791f7788cd21e7f91ee5ff42c1c458d0e7 upstream.

    When dealing with hugetlb pages, struct page is not guaranteed to be
    contiguous on SPARSEMEM without VMEMMAP. Use nth_page() to handle it
    properly.

    Without the fix, a wrong subpage might be checked for HWPoison, causing wrong
    number of bytes of a page copied to user space. No bug is reported. The fix
    comes from code inspection.

    Link: https://lkml.kernel.org/r/20230913201248.452081-5-zi.yan@sent.com
    Fixes: 38c1ddbde6c6 ("hugetlbfs: improve read HWPOISON hugepage")
    Signed-off-by: Zi Yan
    Reviewed-by: Muchun Song
    Cc: David Hildenbrand
    Cc: Matthew Wilcox (Oracle)
    Cc: Mike Kravetz
    Cc: Mike Rapoport (IBM)
    Cc: Thomas Bogendoerfer
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Greg Kroah-Hartman

    Zi Yan
     

30 Aug, 2023

1 commit

  • Pull MM updates from Andrew Morton:

    - Some swap cleanups from Ma Wupeng ("fix WARN_ON in
    add_to_avail_list")

    - Peter Xu has a series (mm/gup: Unify hugetlb, speed up thp") which
    reduces the special-case code for handling hugetlb pages in GUP. It
    also speeds up GUP handling of transparent hugepages.

    - Peng Zhang provides some maple tree speedups ("Optimize the fast path
    of mas_store()").

    - Sergey Senozhatsky has improved te performance of zsmalloc during
    compaction (zsmalloc: small compaction improvements").

    - Domenico Cerasuolo has developed additional selftest code for zswap
    ("selftests: cgroup: add zswap test program").

    - xu xin has doe some work on KSM's handling of zero pages. These
    changes are mainly to enable the user to better understand the
    effectiveness of KSM's treatment of zero pages ("ksm: support
    tracking KSM-placed zero-pages").

    - Jeff Xu has fixes the behaviour of memfd's
    MEMFD_NOEXEC_SCOPE_NOEXEC_ENFORCED sysctl ("mm/memfd: fix sysctl
    MEMFD_NOEXEC_SCOPE_NOEXEC_ENFORCED").

    - David Howells has fixed an fscache optimization ("mm, netfs, fscache:
    Stop read optimisation when folio removed from pagecache").

    - Axel Rasmussen has given userfaultfd the ability to simulate memory
    poisoning ("add UFFDIO_POISON to simulate memory poisoning with
    UFFD").

    - Miaohe Lin has contributed some routine maintenance work on the
    memory-failure code ("mm: memory-failure: remove unneeded PageHuge()
    check").

    - Peng Zhang has contributed some maintenance work on the maple tree
    code ("Improve the validation for maple tree and some cleanup").

    - Hugh Dickins has optimized the collapsing of shmem or file pages into
    THPs ("mm: free retracted page table by RCU").

    - Jiaqi Yan has a patch series which permits us to use the healthy
    subpages within a hardware poisoned huge page for general purposes
    ("Improve hugetlbfs read on HWPOISON hugepages").

    - Kemeng Shi has done some maintenance work on the pagetable-check code
    ("Remove unused parameters in page_table_check").

    - More folioification work from Matthew Wilcox ("More filesystem folio
    conversions for 6.6"), ("Followup folio conversions for zswap"). And
    from ZhangPeng ("Convert several functions in page_io.c to use a
    folio").

    - page_ext cleanups from Kemeng Shi ("minor cleanups for page_ext").

    - Baoquan He has converted some architectures to use the
    GENERIC_IOREMAP ioremap()/iounmap() code ("mm: ioremap: Convert
    architectures to take GENERIC_IOREMAP way").

    - Anshuman Khandual has optimized arm64 tlb shootdown ("arm64: support
    batched/deferred tlb shootdown during page reclamation/migration").

    - Better maple tree lockdep checking from Liam Howlett ("More strict
    maple tree lockdep"). Liam also developed some efficiency
    improvements ("Reduce preallocations for maple tree").

    - Cleanup and optimization to the secondary IOMMU TLB invalidation,
    from Alistair Popple ("Invalidate secondary IOMMU TLB on permission
    upgrade").

    - Ryan Roberts fixes some arm64 MM selftest issues ("selftests/mm fixes
    for arm64").

    - Kemeng Shi provides some maintenance work on the compaction code
    ("Two minor cleanups for compaction").

    - Some reduction in mmap_lock pressure from Matthew Wilcox ("Handle
    most file-backed faults under the VMA lock").

    - Aneesh Kumar contributes code to use the vmemmap optimization for DAX
    on ppc64, under some circumstances ("Add support for DAX vmemmap
    optimization for ppc64").

    - page-ext cleanups from Kemeng Shi ("add page_ext_data to get client
    data in page_ext"), ("minor cleanups to page_ext header").

    - Some zswap cleanups from Johannes Weiner ("mm: zswap: three
    cleanups").

    - kmsan cleanups from ZhangPeng ("minor cleanups for kmsan").

    - VMA handling cleanups from Kefeng Wang ("mm: convert to
    vma_is_initial_heap/stack()").

    - DAMON feature work from SeongJae Park ("mm/damon/sysfs-schemes:
    implement DAMOS tried total bytes file"), ("Extend DAMOS filters for
    address ranges and DAMON monitoring targets").

    - Compaction work from Kemeng Shi ("Fixes and cleanups to compaction").

    - Liam Howlett has improved the maple tree node replacement code
    ("maple_tree: Change replacement strategy").

    - ZhangPeng has a general code cleanup - use the K() macro more widely
    ("cleanup with helper macro K()").

    - Aneesh Kumar brings memmap-on-memory to ppc64 ("Add support for
    memmap on memory feature on ppc64").

    - pagealloc cleanups from Kemeng Shi ("Two minor cleanups for pcp list
    in page_alloc"), ("Two minor cleanups for get pageblock
    migratetype").

    - Vishal Moola introduces a memory descriptor for page table tracking,
    "struct ptdesc" ("Split ptdesc from struct page").

    - memfd selftest maintenance work from Aleksa Sarai ("memfd: cleanups
    for vm.memfd_noexec").

    - MM include file rationalization from Hugh Dickins ("arch: include
    asm/cacheflush.h in asm/hugetlb.h").

    - THP debug output fixes from Hugh Dickins ("mm,thp: fix sloppy text
    output").

    - kmemleak improvements from Xiaolei Wang ("mm/kmemleak: use
    object_cache instead of kmemleak_initialized").

    - More folio-related cleanups from Matthew Wilcox ("Remove _folio_dtor
    and _folio_order").

    - A VMA locking scalability improvement from Suren Baghdasaryan
    ("Per-VMA lock support for swap and userfaults").

    - pagetable handling cleanups from Matthew Wilcox ("New page table
    range API").

    - A batch of swap/thp cleanups from David Hildenbrand ("mm/swap: stop
    using page->private on tail pages for THP_SWAP + cleanups").

    - Cleanups and speedups to the hugetlb fault handling from Matthew
    Wilcox ("Change calling convention for ->huge_fault").

    - Matthew Wilcox has also done some maintenance work on the MM
    subsystem documentation ("Improve mm documentation").

    * tag 'mm-stable-2023-08-28-18-26' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (489 commits)
    maple_tree: shrink struct maple_tree
    maple_tree: clean up mas_wr_append()
    secretmem: convert page_is_secretmem() to folio_is_secretmem()
    nios2: fix flush_dcache_page() for usage from irq context
    hugetlb: add documentation for vma_kernel_pagesize()
    mm: add orphaned kernel-doc to the rst files.
    mm: fix clean_record_shared_mapping_range kernel-doc
    mm: fix get_mctgt_type() kernel-doc
    mm: fix kernel-doc warning from tlb_flush_rmaps()
    mm: remove enum page_entry_size
    mm: allow ->huge_fault() to be called without the mmap_lock held
    mm: move PMD_ORDER to pgtable.h
    mm: remove checks for pte_index
    memcg: remove duplication detection for mem_cgroup_uncharge_swap
    mm/huge_memory: work on folio->swap instead of page->private when splitting folio
    mm/swap: inline folio_set_swap_entry() and folio_swap_entry()
    mm/swap: use dedicated entry for swap in folio
    mm/swap: stop using page->private on tail pages for THP_SWAP
    selftests/mm: fix WARNING comparing pointer to 0
    selftests: cgroup: fix test_kmem_memcg_deletion kernel mem check
    ...

    Linus Torvalds
     

19 Aug, 2023

1 commit

  • When a hugepage contains HWPOISON pages, read() fails to read any byte of
    the hugepage and returns -EIO, although many bytes in the HWPOISON
    hugepage are readable.

    Improve this by allowing hugetlbfs_read_iter returns as many bytes as
    possible. For a requested range [offset, offset + len) that contains
    HWPOISON page, return [offset, first HWPOISON page addr); the next read
    attempt will fail and return -EIO.

    Link: https://lkml.kernel.org/r/20230713001833.3778937-4-jiaqiyan@google.com
    Signed-off-by: Jiaqi Yan
    Reviewed-by: Mike Kravetz
    Reviewed-by: Naoya Horiguchi
    Cc: James Houghton
    Cc: Miaohe Lin
    Cc: Muchun Song
    Cc: Yang Shi
    Signed-off-by: Andrew Morton

    Jiaqi Yan
     

24 Jul, 2023

1 commit

  • In later patches, we're going to change how the inode's ctime field is
    used. Switch to using accessor functions instead of raw accesses of
    inode->i_ctime.

    Signed-off-by: Jeff Layton
    Acked-by: Mike Kravetz
    Reviewed-by: Jan Kara
    Message-Id:
    Signed-off-by: Christian Brauner

    Jeff Layton
     

24 Jun, 2023

1 commit

  • Ackerley Tng reported an issue with hugetlbfs fallocate as noted in the
    Closes tag. The issue showed up after the conversion of hugetlb page
    cache lookup code to use page_cache_next_miss. User visible effects are:

    - hugetlbfs fallocate incorrectly returns -EEXIST if pages are presnet
    in the file.
    - hugetlb pages will not be included in core dumps if they need to be
    brought in via GUP.
    - userfaultfd UFFDIO_COPY will not notice pages already present in the
    cache. It may try to allocate a new page and potentially return
    ENOMEM as opposed to EEXIST.

    Revert the use page_cache_next_miss() in hugetlb code.

    IMPORTANT NOTE FOR STABLE BACKPORTS:
    This patch will apply cleanly to v6.3. However, due to the change of
    filemap_get_folio() return values, it will not function correctly. This
    patch must be modified for stable backports.

    [dan.carpenter@linaro.org: fix hugetlbfs_pagecache_present()]
    Link: https://lkml.kernel.org/r/efa86091-6a2c-4064-8f55-9b44e1313015@moroto.mountain
    Link: https://lkml.kernel.org/r/20230621212403.174710-2-mike.kravetz@oracle.com
    Fixes: d0ce0e47b323 ("mm/hugetlb: convert hugetlb fault paths to use alloc_hugetlb_folio()")
    Signed-off-by: Mike Kravetz
    Signed-off-by: Dan Carpenter
    Reported-by: Ackerley Tng
    Closes: https://lore.kernel.org/linux-mm/cover.1683069252.git.ackerleytng@google.com
    Reviewed-by: Sidhartha Kumar
    Cc: Erdem Aktas
    Cc: Greg Kroah-Hartman
    Cc: Matthew Wilcox
    Cc: Muchun Song
    Cc: Vishal Annapurve
    Signed-off-by: Andrew Morton

    Mike Kravetz
     

10 Jun, 2023

1 commit

  • Calling hugetlb_set_vma_policy() later avoids setting the vma policy
    and then dropping it on a page cache hit.

    Link: https://lkml.kernel.org/r/20230502235622.3652586-1-ackerleytng@google.com
    Signed-off-by: Ackerley Tng
    Reviewed-by: Mike Kravetz
    Cc: Erdem Aktas
    Cc: John Hubbard
    Cc: Matthew Wilcox (Oracle)
    Cc: Muchun Song
    Cc: Sidhartha Kumar
    Cc: Vishal Annapurve
    Signed-off-by: Andrew Morton

    Ackerley Tng
     

22 Apr, 2023

1 commit

  • Instead of having callers care about the mmap_min_addr logic for the
    lowest valid mapping address (and some of them getting it wrong), just
    move the logic into vm_unmapped_area() itself. One less thing for various
    architecture cases (and generic helpers) to worry about.

    We should really try to make much more of this be common code, but baby
    steps..

    Without this, vm_unmapped_area() could return an address below
    mmap_min_addr (because some caller forgot about that). That then causes
    the mmap machinery to think it has found a workable address, but then
    later security_mmap_addr(addr) is unhappy about it and the mmap() returns
    with a nonsensical error (EPERM).

    The proper action is to either return ENOMEM (if the virtual address space
    is exhausted), or try to find another address (ie do a bottom-up search
    for free addresses after the top-down one failed).

    See commit 2afc745f3e30 ("mm: ensure get_unmapped_area() returns higher
    address than mmap_min_addr"), which fixed this for one call site (the
    generic arch_get_unmapped_area_topdown() fallback) but left other cases
    alone.

    Link: https://lkml.kernel.org/r/20230418214009.1142926-1-Liam.Howlett@oracle.com
    Signed-off-by: Linus Torvalds
    Signed-off-by: Liam R. Howlett
    Cc: Russell King
    Cc: Liam Howlett
    Signed-off-by: Andrew Morton

    Linus Torvalds
     

06 Apr, 2023

1 commit

  • Instead of returning NULL for all errors, distinguish between:

    - no entry found and not asked to allocated (-ENOENT)
    - failed to allocate memory (-ENOMEM)
    - would block (-EAGAIN)

    so that callers don't have to guess the error based on the passed in
    flags.

    Also pass through the error through the direct callers: filemap_get_folio,
    filemap_lock_folio filemap_grab_folio and filemap_get_incore_folio.

    [hch@lst.de: fix null-pointer deref]
    Link: https://lkml.kernel.org/r/20230310070023.GA13563@lst.de
    Link: https://lkml.kernel.org/r/20230310043137.GA1624890@u2004
    Link: https://lkml.kernel.org/r/20230307143410.28031-8-hch@lst.de
    Signed-off-by: Christoph Hellwig
    Acked-by: Ryusuke Konishi [nilfs2]
    Cc: Andreas Gruenbacher
    Cc: Hugh Dickins
    Cc: Matthew Wilcox (Oracle)
    Cc: Naoya Horiguchi
    Signed-off-by: Andrew Morton

    Christoph Hellwig
     

24 Feb, 2023

1 commit

  • Pull MM updates from Andrew Morton:

    - Daniel Verkamp has contributed a memfd series ("mm/memfd: add
    F_SEAL_EXEC") which permits the setting of the memfd execute bit at
    memfd creation time, with the option of sealing the state of the X
    bit.

    - Peter Xu adds a patch series ("mm/hugetlb: Make huge_pte_offset()
    thread-safe for pmd unshare") which addresses a rare race condition
    related to PMD unsharing.

    - Several folioification patch serieses from Matthew Wilcox, Vishal
    Moola, Sidhartha Kumar and Lorenzo Stoakes

    - Johannes Weiner has a series ("mm: push down lock_page_memcg()")
    which does perform some memcg maintenance and cleanup work.

    - SeongJae Park has added DAMOS filtering to DAMON, with the series
    "mm/damon/core: implement damos filter".

    These filters provide users with finer-grained control over DAMOS's
    actions. SeongJae has also done some DAMON cleanup work.

    - Kairui Song adds a series ("Clean up and fixes for swap").

    - Vernon Yang contributed the series "Clean up and refinement for maple
    tree".

    - Yu Zhao has contributed the "mm: multi-gen LRU: memcg LRU" series. It
    adds to MGLRU an LRU of memcgs, to improve the scalability of global
    reclaim.

    - David Hildenbrand has added some userfaultfd cleanup work in the
    series "mm: uffd-wp + change_protection() cleanups".

    - Christoph Hellwig has removed the generic_writepages() library
    function in the series "remove generic_writepages".

    - Baolin Wang has performed some maintenance on the compaction code in
    his series "Some small improvements for compaction".

    - Sidhartha Kumar is doing some maintenance work on struct page in his
    series "Get rid of tail page fields".

    - David Hildenbrand contributed some cleanup, bugfixing and
    generalization of pte management and of pte debugging in his series
    "mm: support __HAVE_ARCH_PTE_SWP_EXCLUSIVE on all architectures with
    swap PTEs".

    - Mel Gorman and Neil Brown have removed the __GFP_ATOMIC allocation
    flag in the series "Discard __GFP_ATOMIC".

    - Sergey Senozhatsky has improved zsmalloc's memory utilization with
    his series "zsmalloc: make zspage chain size configurable".

    - Joey Gouly has added prctl() support for prohibiting the creation of
    writeable+executable mappings.

    The previous BPF-based approach had shortcomings. See "mm: In-kernel
    support for memory-deny-write-execute (MDWE)".

    - Waiman Long did some kmemleak cleanup and bugfixing in the series
    "mm/kmemleak: Simplify kmemleak_cond_resched() & fix UAF".

    - T.J. Alumbaugh has contributed some MGLRU cleanup work in his series
    "mm: multi-gen LRU: improve".

    - Jiaqi Yan has provided some enhancements to our memory error
    statistics reporting, mainly by presenting the statistics on a
    per-node basis. See the series "Introduce per NUMA node memory error
    statistics".

    - Mel Gorman has a second and hopefully final shot at fixing a CPU-hog
    regression in compaction via his series "Fix excessive CPU usage
    during compaction".

    - Christoph Hellwig does some vmalloc maintenance work in the series
    "cleanup vfree and vunmap".

    - Christoph Hellwig has removed block_device_operations.rw_page() in
    ths series "remove ->rw_page".

    - We get some maple_tree improvements and cleanups in Liam Howlett's
    series "VMA tree type safety and remove __vma_adjust()".

    - Suren Baghdasaryan has done some work on the maintainability of our
    vm_flags handling in the series "introduce vm_flags modifier
    functions".

    - Some pagemap cleanup and generalization work in Mike Rapoport's
    series "mm, arch: add generic implementation of pfn_valid() for
    FLATMEM" and "fixups for generic implementation of pfn_valid()"

    - Baoquan He has done some work to make /proc/vmallocinfo and
    /proc/kcore better represent the real state of things in his series
    "mm/vmalloc.c: allow vread() to read out vm_map_ram areas".

    - Jason Gunthorpe rationalized the GUP system's interface to the rest
    of the kernel in the series "Simplify the external interface for
    GUP".

    - SeongJae Park wishes to migrate people from DAMON's debugfs interface
    over to its sysfs interface. To support this, we'll temporarily be
    printing warnings when people use the debugfs interface. See the
    series "mm/damon: deprecate DAMON debugfs interface".

    - Andrey Konovalov provided the accurately named "lib/stackdepot: fixes
    and clean-ups" series.

    - Huang Ying has provided a dramatic reduction in migration's TLB flush
    IPI rates with the series "migrate_pages(): batch TLB flushing".

    - Arnd Bergmann has some objtool fixups in "objtool warning fixes".

    * tag 'mm-stable-2023-02-20-13-37' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (505 commits)
    include/linux/migrate.h: remove unneeded externs
    mm/memory_hotplug: cleanup return value handing in do_migrate_range()
    mm/uffd: fix comment in handling pte markers
    mm: change to return bool for isolate_movable_page()
    mm: hugetlb: change to return bool for isolate_hugetlb()
    mm: change to return bool for isolate_lru_page()
    mm: change to return bool for folio_isolate_lru()
    objtool: add UACCESS exceptions for __tsan_volatile_read/write
    kmsan: disable ftrace in kmsan core code
    kasan: mark addr_has_metadata __always_inline
    mm: memcontrol: rename memcg_kmem_enabled()
    sh: initialize max_mapnr
    m68k/nommu: add missing definition of ARCH_PFN_OFFSET
    mm: percpu: fix incorrect size in pcpu_obj_full_size()
    maple_tree: reduce stack usage with gcc-9 and earlier
    mm: page_alloc: call panic() when memoryless node allocation fails
    mm: multi-gen LRU: avoid futile retries
    migrate_pages: move THP/hugetlb migration support check to simplify code
    migrate_pages: batch flushing TLB
    migrate_pages: share more code between _unmap and _move
    ...

    Linus Torvalds
     

14 Feb, 2023

3 commits

  • Every caller of hugetlb_add_to_page_cache() is now passing in
    &folio->page, change the function to take in a folio directly and clean up
    the call sites.

    Link: https://lkml.kernel.org/r/20230125170537.96973-7-sidhartha.kumar@oracle.com
    Signed-off-by: Sidhartha Kumar
    Cc: Gerald Schaefer
    Cc: John Hubbard
    Cc: Matthew Wilcox
    Cc: Mike Kravetz
    Cc: Muchun Song
    Signed-off-by: Andrew Morton

    Sidhartha Kumar
     
  • Every caller of restore_reserve_on_error() is now passing in &folio->page,
    change the function to take in a folio directly and clean up the call
    sites.

    Link: https://lkml.kernel.org/r/20230125170537.96973-6-sidhartha.kumar@oracle.com
    Signed-off-by: Sidhartha Kumar
    Cc: Gerald Schaefer
    Cc: John Hubbard
    Cc: Matthew Wilcox
    Cc: Mike Kravetz
    Cc: Muchun Song
    Signed-off-by: Andrew Morton

    Sidhartha Kumar
     
  • Change alloc_huge_page() to alloc_hugetlb_folio() by changing all callers
    to handle the now folio return type of the function. In this conversion,
    alloc_huge_page_vma() is also changed to alloc_hugetlb_folio_vma() and
    hugepage_add_new_anon_rmap() is changed to take in a folio directly. Many
    additions of '&folio->page' are cleaned up in subsequent patches.

    hugetlbfs_fallocate() is also refactored to use the RCU +
    page_cache_next_miss() API.

    Link: https://lkml.kernel.org/r/20230125170537.96973-5-sidhartha.kumar@oracle.com
    Suggested-by: Mike Kravetz
    Reported-by: kernel test robot
    Signed-off-by: Sidhartha Kumar
    Cc: Gerald Schaefer
    Cc: John Hubbard
    Cc: Matthew Wilcox
    Cc: Muchun Song
    Signed-off-by: Andrew Morton

    Sidhartha Kumar
     

10 Feb, 2023

1 commit

  • Replace direct modifications to vma->vm_flags with calls to modifier
    functions to be able to track flag changes and to keep vma locking
    correctness.

    [akpm@linux-foundation.org: fix drivers/misc/open-dice.c, per Hyeonggon Yoo]
    Link: https://lkml.kernel.org/r/20230126193752.297968-5-surenb@google.com
    Signed-off-by: Suren Baghdasaryan
    Acked-by: Michal Hocko
    Acked-by: Mel Gorman
    Acked-by: Mike Rapoport (IBM)
    Acked-by: Sebastian Reichel
    Reviewed-by: Liam R. Howlett
    Reviewed-by: Hyeonggon Yoo
    Cc: Andy Lutomirski
    Cc: Arjun Roy
    Cc: Axel Rasmussen
    Cc: David Hildenbrand
    Cc: David Howells
    Cc: Davidlohr Bueso
    Cc: David Rientjes
    Cc: Eric Dumazet
    Cc: Greg Thelen
    Cc: Hugh Dickins
    Cc: Ingo Molnar
    Cc: Jann Horn
    Cc: Joel Fernandes
    Cc: Johannes Weiner
    Cc: Kent Overstreet
    Cc: Laurent Dufour
    Cc: Lorenzo Stoakes
    Cc: Matthew Wilcox
    Cc: Minchan Kim
    Cc: Paul E. McKenney
    Cc: Peter Oskolkov
    Cc: Peter Xu
    Cc: Peter Zijlstra
    Cc: Punit Agrawal
    Cc: Sebastian Andrzej Siewior
    Cc: Shakeel Butt
    Cc: Soheil Hassas Yeganeh
    Cc: Song Liu
    Cc: Vlastimil Babka
    Cc: Will Deacon
    Signed-off-by: Andrew Morton

    Suren Baghdasaryan
     

19 Jan, 2023

9 commits

  • Convert to struct mnt_idmap.

    Last cycle we merged the necessary infrastructure in
    256c8aed2b42 ("fs: introduce dedicated idmap type for mounts").
    This is just the conversion to struct mnt_idmap.

    Currently we still pass around the plain namespace that was attached to a
    mount. This is in general pretty convenient but it makes it easy to
    conflate namespaces that are relevant on the filesystem with namespaces
    that are relevent on the mount level. Especially for non-vfs developers
    without detailed knowledge in this area this can be a potential source for
    bugs.

    Once the conversion to struct mnt_idmap is done all helpers down to the
    really low-level helpers will take a struct mnt_idmap argument instead of
    two namespace arguments. This way it becomes impossible to conflate the two
    eliminating the possibility of any bugs. All of the vfs and all filesystems
    only operate on struct mnt_idmap.

    Acked-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Christian Brauner (Microsoft)

    Christian Brauner
     
  • Convert to struct mnt_idmap.

    Last cycle we merged the necessary infrastructure in
    256c8aed2b42 ("fs: introduce dedicated idmap type for mounts").
    This is just the conversion to struct mnt_idmap.

    Currently we still pass around the plain namespace that was attached to a
    mount. This is in general pretty convenient but it makes it easy to
    conflate namespaces that are relevant on the filesystem with namespaces
    that are relevent on the mount level. Especially for non-vfs developers
    without detailed knowledge in this area this can be a potential source for
    bugs.

    Once the conversion to struct mnt_idmap is done all helpers down to the
    really low-level helpers will take a struct mnt_idmap argument instead of
    two namespace arguments. This way it becomes impossible to conflate the two
    eliminating the possibility of any bugs. All of the vfs and all filesystems
    only operate on struct mnt_idmap.

    Acked-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Christian Brauner (Microsoft)

    Christian Brauner
     
  • Convert to struct mnt_idmap.

    Last cycle we merged the necessary infrastructure in
    256c8aed2b42 ("fs: introduce dedicated idmap type for mounts").
    This is just the conversion to struct mnt_idmap.

    Currently we still pass around the plain namespace that was attached to a
    mount. This is in general pretty convenient but it makes it easy to
    conflate namespaces that are relevant on the filesystem with namespaces
    that are relevent on the mount level. Especially for non-vfs developers
    without detailed knowledge in this area this can be a potential source for
    bugs.

    Once the conversion to struct mnt_idmap is done all helpers down to the
    really low-level helpers will take a struct mnt_idmap argument instead of
    two namespace arguments. This way it becomes impossible to conflate the two
    eliminating the possibility of any bugs. All of the vfs and all filesystems
    only operate on struct mnt_idmap.

    Acked-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Christian Brauner (Microsoft)

    Christian Brauner
     
  • Convert to struct mnt_idmap.

    Last cycle we merged the necessary infrastructure in
    256c8aed2b42 ("fs: introduce dedicated idmap type for mounts").
    This is just the conversion to struct mnt_idmap.

    Currently we still pass around the plain namespace that was attached to a
    mount. This is in general pretty convenient but it makes it easy to
    conflate namespaces that are relevant on the filesystem with namespaces
    that are relevent on the mount level. Especially for non-vfs developers
    without detailed knowledge in this area this can be a potential source for
    bugs.

    Once the conversion to struct mnt_idmap is done all helpers down to the
    really low-level helpers will take a struct mnt_idmap argument instead of
    two namespace arguments. This way it becomes impossible to conflate the two
    eliminating the possibility of any bugs. All of the vfs and all filesystems
    only operate on struct mnt_idmap.

    Acked-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Christian Brauner (Microsoft)

    Christian Brauner
     
  • Convert to struct mnt_idmap.

    Last cycle we merged the necessary infrastructure in
    256c8aed2b42 ("fs: introduce dedicated idmap type for mounts").
    This is just the conversion to struct mnt_idmap.

    Currently we still pass around the plain namespace that was attached to a
    mount. This is in general pretty convenient but it makes it easy to
    conflate namespaces that are relevant on the filesystem with namespaces
    that are relevent on the mount level. Especially for non-vfs developers
    without detailed knowledge in this area this can be a potential source for
    bugs.

    Once the conversion to struct mnt_idmap is done all helpers down to the
    really low-level helpers will take a struct mnt_idmap argument instead of
    two namespace arguments. This way it becomes impossible to conflate the two
    eliminating the possibility of any bugs. All of the vfs and all filesystems
    only operate on struct mnt_idmap.

    Acked-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Christian Brauner (Microsoft)

    Christian Brauner
     
  • Convert to struct mnt_idmap.

    Last cycle we merged the necessary infrastructure in
    256c8aed2b42 ("fs: introduce dedicated idmap type for mounts").
    This is just the conversion to struct mnt_idmap.

    Currently we still pass around the plain namespace that was attached to a
    mount. This is in general pretty convenient but it makes it easy to
    conflate namespaces that are relevant on the filesystem with namespaces
    that are relevent on the mount level. Especially for non-vfs developers
    without detailed knowledge in this area this can be a potential source for
    bugs.

    Once the conversion to struct mnt_idmap is done all helpers down to the
    really low-level helpers will take a struct mnt_idmap argument instead of
    two namespace arguments. This way it becomes impossible to conflate the two
    eliminating the possibility of any bugs. All of the vfs and all filesystems
    only operate on struct mnt_idmap.

    Acked-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Christian Brauner (Microsoft)

    Christian Brauner
     
  • Convert to struct mnt_idmap.

    Last cycle we merged the necessary infrastructure in
    256c8aed2b42 ("fs: introduce dedicated idmap type for mounts").
    This is just the conversion to struct mnt_idmap.

    Currently we still pass around the plain namespace that was attached to a
    mount. This is in general pretty convenient but it makes it easy to
    conflate namespaces that are relevant on the filesystem with namespaces
    that are relevent on the mount level. Especially for non-vfs developers
    without detailed knowledge in this area this can be a potential source for
    bugs.

    Once the conversion to struct mnt_idmap is done all helpers down to the
    really low-level helpers will take a struct mnt_idmap argument instead of
    two namespace arguments. This way it becomes impossible to conflate the two
    eliminating the possibility of any bugs. All of the vfs and all filesystems
    only operate on struct mnt_idmap.

    Acked-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Christian Brauner (Microsoft)

    Christian Brauner
     
  • huge_pte_offset() is the main walker function for hugetlb pgtables. The
    name is not really representing what it does, though.

    Instead of renaming it, introduce a wrapper function called hugetlb_walk()
    which will use huge_pte_offset() inside. Assert on the locks when walking
    the pgtable.

    Note, the vma lock assertion will be a no-op for private mappings.

    Document the last special case in the page_vma_mapped_walk() path where we
    don't need any more lock to call hugetlb_walk().

    Taking vma lock there is not needed because either: (1) potential callers
    of hugetlb pvmw holds i_mmap_rwsem already (from one rmap_walk()), or (2)
    the caller will not walk a hugetlb vma at all so the hugetlb code path not
    reachable (e.g. in ksm or uprobe paths).

    It's slightly implicit for future page_vma_mapped_walk() callers on that
    lock requirement. But anyway, when one day this rule breaks, one will get
    a straightforward warning in hugetlb_walk() with lockdep, then there'll be
    a way out.

    [akpm@linux-foundation.org: coding-style cleanups]
    Link: https://lkml.kernel.org/r/20221216155229.2043750-1-peterx@redhat.com
    Signed-off-by: Peter Xu
    Reviewed-by: Mike Kravetz
    Reviewed-by: John Hubbard
    Reviewed-by: David Hildenbrand
    Cc: Andrea Arcangeli
    Cc: James Houghton
    Cc: Jann Horn
    Cc: Miaohe Lin
    Cc: Muchun Song
    Cc: Nadav Amit
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton

    Peter Xu
     
  • Patch series "mm/hugetlb: Make huge_pte_offset() thread-safe for pmd
    unshare", v4.

    Problem
    =======

    huge_pte_offset() is a major helper used by hugetlb code paths to walk a
    hugetlb pgtable. It's used mostly everywhere since that's needed even
    before taking the pgtable lock.

    huge_pte_offset() is always called with mmap lock held with either read or
    write. It was assumed to be safe but it's actually not. One race
    condition can easily trigger by: (1) firstly trigger pmd share on a memory
    range, (2) do huge_pte_offset() on the range, then at the meantime, (3)
    another thread unshare the pmd range, and the pgtable page is prone to lost
    if the other shared process wants to free it completely (by either munmap
    or exit mm).

    The recent work from Mike on vma lock can resolve most of this already.
    It's achieved by forbidden pmd unsharing during the lock being taken, so no
    further risk of the pgtable page being freed. It means if we can take the
    vma lock around all huge_pte_offset() callers it'll be safe.

    There're already a bunch of them that we did as per the latest mm-unstable,
    but also quite a few others that we didn't for various reasons especially
    on huge_pte_offset() usage.

    One more thing to mention is that besides the vma lock, i_mmap_rwsem can
    also be used to protect the pgtable page (along with its pgtable lock) from
    being freed from under us. IOW, huge_pte_offset() callers need to either
    hold the vma lock or i_mmap_rwsem to safely walk the pgtables.

    A reproducer of such problem, based on hugetlb GUP (NOTE: since the race is
    very hard to trigger, one needs to apply another kernel delay patch too,
    see below):

    ======8
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include

    #define MSIZE (1UL << 30) /* 1GB */
    #define PSIZE (2UL << 20) /* 2MB */

    #define HOLD_SEC (1)

    int pipefd[2];
    void *buf;

    void *do_map(int fd)
    {
    unsigned char *tmpbuf, *p;
    int ret;

    ret = posix_memalign((void **)&tmpbuf, MSIZE, MSIZE);
    if (ret) {
    perror("posix_memalign() failed");
    return NULL;
    }

    tmpbuf = mmap(tmpbuf, MSIZE, PROT_READ | PROT_WRITE,
    MAP_SHARED | MAP_FIXED, fd, 0);
    if (tmpbuf == MAP_FAILED) {
    perror("mmap() failed");
    return NULL;
    }
    printf("mmap() -> %p\n", tmpbuf);

    for (p = tmpbuf; p < tmpbuf + MSIZE; p += PSIZE) {
    *p = 1;
    }

    return tmpbuf;
    }

    void do_unmap(void *buf)
    {
    munmap(buf, MSIZE);
    }

    void proc2(int fd)
    {
    unsigned char c;

    buf = do_map(fd);
    if (!buf)
    return;

    read(pipefd[0], &c, 1);
    /*
    * This frees the shared pgtable page, causing use-after-free in
    * proc1_thread1 when soft walking hugetlb pgtable.
    */
    do_unmap(buf);

    printf("Proc2 quitting\n");
    }

    void *proc1_thread1(void *data)
    {
    /*
    * Trigger follow-page on 1st 2m page. Kernel hack patch needed to
    * withhold this procedure for easier reproduce.
    */
    madvise(buf, PSIZE, MADV_POPULATE_WRITE);
    printf("Proc1-thread1 quitting\n");
    return NULL;
    }

    void *proc1_thread2(void *data)
    {
    unsigned char c;

    /* Wait a while until proc1_thread1() start to wait */
    sleep(0.5);
    /* Trigger pmd unshare */
    madvise(buf, PSIZE, MADV_DONTNEED);
    /* Kick off proc2 to release the pgtable */
    write(pipefd[1], &c, 1);

    printf("Proc1-thread2 quitting\n");
    return NULL;
    }

    void proc1(int fd)
    {
    pthread_t tid1, tid2;
    int ret;

    buf = do_map(fd);
    if (!buf)
    return;

    ret = pthread_create(&tid1, NULL, proc1_thread1, NULL);
    assert(ret == 0);
    ret = pthread_create(&tid2, NULL, proc1_thread2, NULL);
    assert(ret == 0);

    /* Kick the child to share the PUD entry */
    pthread_join(tid1, NULL);
    pthread_join(tid2, NULL);

    do_unmap(buf);
    }

    int main(void)
    {
    int fd, ret;

    fd = memfd_create("test-huge", MFD_HUGETLB | MFD_HUGE_2MB);
    if (fd < 0) {
    perror("open failed");
    return -1;
    }

    ret = ftruncate(fd, MSIZE);
    if (ret) {
    perror("ftruncate() failed");
    return -1;
    }

    ret = pipe(pipefd);
    if (ret) {
    perror("pipe() failed");
    return -1;
    }

    if (fork()) {
    proc1(fd);
    } else {
    proc2(fd);
    }

    close(pipefd[0]);
    close(pipefd[1]);
    close(fd);

    return 0;
    }
    ======8< 100; c++) {
    : + udelay(10000);
    : + }
    : + pr_info("%s: withhold 1 sec...done\n", __func__);
    : +
    : if (pte)
    : ptl = huge_pte_lock(h, mm, pte);
    : absent = !pte || huge_pte_none(huge_ptep_get(pte));
    : ======8vm_start address.

    Make it return the real value of the start vaddr, and it also helps for
    all the callers because whenever the retval is used, it'll be ultimately
    added into the vma->vm_start anyway, so it's better.

    Link: https://lkml.kernel.org/r/20221216155100.2043537-1-peterx@redhat.com
    Link: https://lkml.kernel.org/r/20221216155100.2043537-2-peterx@redhat.com
    Signed-off-by: Peter Xu
    Reviewed-by: Mike Kravetz
    Reviewed-by: David Hildenbrand
    Reviewed-by: John Hubbard
    Cc: Andrea Arcangeli
    Cc: James Houghton
    Cc: Jann Horn
    Cc: Miaohe Lin
    Cc: Muchun Song
    Cc: Nadav Amit
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton

    Peter Xu
     

01 Dec, 2022

2 commits


09 Nov, 2022

4 commits

  • Syzkaller reports a null-ptr-deref bug as follows:
    ======================================================
    KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007]
    RIP: 0010:hugetlbfs_parse_param+0x1dd/0x8e0 fs/hugetlbfs/inode.c:1380
    [...]
    Call Trace:

    vfs_parse_fs_param fs/fs_context.c:148 [inline]
    vfs_parse_fs_param+0x1f9/0x3c0 fs/fs_context.c:129
    vfs_parse_fs_string+0xdb/0x170 fs/fs_context.c:191
    generic_parse_monolithic+0x16f/0x1f0 fs/fs_context.c:231
    do_new_mount fs/namespace.c:3036 [inline]
    path_mount+0x12de/0x1e20 fs/namespace.c:3370
    do_mount fs/namespace.c:3383 [inline]
    __do_sys_mount fs/namespace.c:3591 [inline]
    __se_sys_mount fs/namespace.c:3568 [inline]
    __x64_sys_mount+0x27f/0x300 fs/namespace.c:3568
    do_syscall_x64 arch/x86/entry/common.c:50 [inline]
    do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
    entry_SYSCALL_64_after_hwframe+0x63/0xcd
    [...]

    ======================================================

    According to commit "vfs: parse: deal with zero length string value",
    kernel will set the param->string to null pointer in vfs_parse_fs_string()
    if fs string has zero length.

    Yet the problem is that, hugetlbfs_parse_param() will dereference the
    param->string, without checking whether it is a null pointer. To be more
    specific, if hugetlbfs_parse_param() parses an illegal mount parameter,
    such as "size=,", kernel will constructs struct fs_parameter with null
    pointer in vfs_parse_fs_string(), then passes this struct fs_parameter to
    hugetlbfs_parse_param(), which triggers the above null-ptr-deref bug.

    This patch solves it by adding sanity check on param->string
    in hugetlbfs_parse_param().

    Link: https://lkml.kernel.org/r/20221020231609.4810-1-yin31149@gmail.com
    Reported-by: syzbot+a3e6acd85ded5c16a709@syzkaller.appspotmail.com
    Tested-by: syzbot+a3e6acd85ded5c16a709@syzkaller.appspotmail.com
    Link: https://lore.kernel.org/all/0000000000005ad00405eb7148c6@google.com/
    Signed-off-by: Hawkins Jiawei
    Reviewed-by: Mike Kravetz
    Cc: Hawkins Jiawei
    Cc: Muchun Song
    Cc: Ian Kent
    Signed-off-by: Andrew Morton

    Hawkins Jiawei
     
  • Remove the last caller of delete_from_page_cache() by converting the code
    to its folio equivalent.

    Link: https://lkml.kernel.org/r/20220922154207.1575343-5-sidhartha.kumar@oracle.com
    Signed-off-by: Sidhartha Kumar
    Reviewed-by: Mike Kravetz
    Cc: Arnd Bergmann
    Cc: Colin Cross
    Cc: David Howells
    Cc: "Eric W . Biederman"
    Cc: Hugh Dickins
    Cc: Matthew Wilcox
    Cc: Muchun Song
    Cc: Peter Xu
    Cc: Vlastimil Babka
    Cc: William Kucharski
    Signed-off-by: Andrew Morton

    Sidhartha Kumar
     
  • Allow hugetlbfs_migrate_folio to check and read subpool information by
    passing in a folio.

    Link: https://lkml.kernel.org/r/20220922154207.1575343-4-sidhartha.kumar@oracle.com
    Signed-off-by: Sidhartha Kumar
    Reviewed-by: Mike Kravetz
    Cc: Arnd Bergmann
    Cc: Colin Cross
    Cc: David Howells
    Cc: "Eric W . Biederman"
    Cc: Hugh Dickins
    Cc: kernel test robot
    Cc: Matthew Wilcox
    Cc: Muchun Song
    Cc: Peter Xu
    Cc: Vlastimil Babka
    Cc: William Kucharski
    Signed-off-by: Andrew Morton

    Sidhartha Kumar
     
  • This change is very similar to the change that was made for shmem [1], and
    it solves the same problem but for HugeTLBFS instead.

    Currently, when poison is found in a HugeTLB page, the page is removed
    from the page cache. That means that attempting to map or read that
    hugepage in the future will result in a new hugepage being allocated
    instead of notifying the user that the page was poisoned. As [1] states,
    this is effectively memory corruption.

    The fix is to leave the page in the page cache. If the user attempts to
    use a poisoned HugeTLB page with a syscall, the syscall will fail with
    EIO, the same error code that shmem uses. For attempts to map the page,
    the thread will get a BUS_MCEERR_AR SIGBUS.

    [1]: commit a76054266661 ("mm: shmem: don't truncate page if memory failure happens")

    Link: https://lkml.kernel.org/r/20221018200125.848471-1-jthoughton@google.com
    Signed-off-by: James Houghton
    Reviewed-by: Mike Kravetz
    Reviewed-by: Naoya Horiguchi
    Tested-by: Naoya Horiguchi
    Reviewed-by: Yang Shi
    Cc: Axel Rasmussen
    Cc: James Houghton
    Cc: Miaohe Lin
    Cc: Muchun Song
    Cc:
    Signed-off-by: Andrew Morton

    James Houghton
     

11 Oct, 2022

1 commit

  • Pull vfs tmpfile updates from Al Viro:
    "Miklos' ->tmpfile() signature change; pass an unopened struct file to
    it, let it open the damn thing. Allows to add tmpfile support to FUSE"

    * tag 'pull-tmpfile' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    fuse: implement ->tmpfile()
    vfs: open inside ->tmpfile()
    vfs: move open right after ->tmpfile()
    vfs: make vfs_tmpfile() static
    ovl: use vfs_tmpfile_open() helper
    cachefiles: use vfs_tmpfile_open() helper
    cachefiles: only pass inode to *mark_inode_inuse() helpers
    cachefiles: tmpfile error handling cleanup
    hugetlbfs: cleanup mknod and tmpfile
    vfs: add vfs_tmpfile_open() helper

    Linus Torvalds
     

04 Oct, 2022

7 commits

  • With the new hugetlb vma lock in place, it can also be used to handle page
    fault races with file truncation. The lock is taken at the beginning of
    the code fault path in read mode. During truncation, it is taken in write
    mode for each vma which has the file mapped. The file's size (i_size) is
    modified before taking the vma lock to unmap.

    How are races handled?

    The page fault code checks i_size early in processing after taking the vma
    lock. If the fault is beyond i_size, the fault is aborted. If the fault
    is not beyond i_size the fault will continue and a new page will be added
    to the file. It could be that truncation code modifies i_size after the
    check in fault code. That is OK, as truncation code will soon remove the
    page. The truncation code will wait until the fault is finished, as it
    must obtain the vma lock in write mode.

    This patch cleans up/removes late checks in the fault paths that try to
    back out pages racing with truncation. As noted above, we just let the
    truncation code remove the pages.

    [mike.kravetz@oracle.com: fix reserve_alloc set but not used compiler warning]
    Link: https://lkml.kernel.org/r/Yyj7HsJWfHDoU24U@monkey
    Link: https://lkml.kernel.org/r/20220914221810.95771-10-mike.kravetz@oracle.com
    Signed-off-by: Mike Kravetz
    Cc: Andrea Arcangeli
    Cc: "Aneesh Kumar K.V"
    Cc: Axel Rasmussen
    Cc: David Hildenbrand
    Cc: Davidlohr Bueso
    Cc: James Houghton
    Cc: "Kirill A. Shutemov"
    Cc: Miaohe Lin
    Cc: Michal Hocko
    Cc: Mina Almasry
    Cc: Muchun Song
    Cc: Naoya Horiguchi
    Cc: Pasha Tatashin
    Cc: Peter Xu
    Cc: Prakash Sangappa
    Cc: Sven Schnelle
    Signed-off-by: Andrew Morton

    Mike Kravetz
     
  • The new hugetlb vma lock is used to address this race:

    Faulting thread Unsharing thread
    ... ...
    ptep = huge_pte_offset()
    or
    ptep = huge_pte_alloc()
    ...
    i_mmap_lock_write
    lock page table
    ptep invalid
    Cc: Andrea Arcangeli
    Cc: "Aneesh Kumar K.V"
    Cc: Axel Rasmussen
    Cc: David Hildenbrand
    Cc: Davidlohr Bueso
    Cc: James Houghton
    Cc: "Kirill A. Shutemov"
    Cc: Miaohe Lin
    Cc: Michal Hocko
    Cc: Mina Almasry
    Cc: Muchun Song
    Cc: Naoya Horiguchi
    Cc: Pasha Tatashin
    Cc: Peter Xu
    Cc: Prakash Sangappa
    Cc: Sven Schnelle
    Signed-off-by: Andrew Morton

    Mike Kravetz
     
  • Create the new routine hugetlb_unmap_file_folio that will unmap a single
    file folio. This is refactored code from hugetlb_vmdelete_list. It is
    modified to do locking within the routine itself and check whether the
    page is mapped within a specific vma before unmapping.

    This refactoring will be put to use and expanded upon in a subsequent
    patch adding vma specific locking.

    Link: https://lkml.kernel.org/r/20220914221810.95771-8-mike.kravetz@oracle.com
    Signed-off-by: Mike Kravetz
    Reviewed-by: Miaohe Lin
    Cc: Andrea Arcangeli
    Cc: "Aneesh Kumar K.V"
    Cc: Axel Rasmussen
    Cc: David Hildenbrand
    Cc: Davidlohr Bueso
    Cc: James Houghton
    Cc: "Kirill A. Shutemov"
    Cc: Michal Hocko
    Cc: Mina Almasry
    Cc: Muchun Song
    Cc: Naoya Horiguchi
    Cc: Pasha Tatashin
    Cc: Peter Xu
    Cc: Prakash Sangappa
    Cc: Sven Schnelle
    Signed-off-by: Andrew Morton

    Mike Kravetz
     
  • Create the new routine remove_inode_single_folio that will remove a single
    folio from a file. This is refactored code from remove_inode_hugepages.
    It checks for the uncommon case in which the folio is still mapped and
    unmaps.

    No functional change. This refactoring will be put to use and expanded
    upon in a subsequent patches.

    Link: https://lkml.kernel.org/r/20220914221810.95771-5-mike.kravetz@oracle.com
    Signed-off-by: Mike Kravetz
    Reviewed-by: Miaohe Lin
    Cc: Andrea Arcangeli
    Cc: "Aneesh Kumar K.V"
    Cc: Axel Rasmussen
    Cc: David Hildenbrand
    Cc: Davidlohr Bueso
    Cc: James Houghton
    Cc: "Kirill A. Shutemov"
    Cc: Michal Hocko
    Cc: Mina Almasry
    Cc: Muchun Song
    Cc: Naoya Horiguchi
    Cc: Pasha Tatashin
    Cc: Peter Xu
    Cc: Prakash Sangappa
    Cc: Sven Schnelle
    Signed-off-by: Andrew Morton

    Mike Kravetz
     
  • remove_huge_page removes a hugetlb page from the page cache. Change to
    hugetlb_delete_from_page_cache as it is a more descriptive name.
    huge_add_to_page_cache is global in scope, but only deals with hugetlb
    pages. For consistency and clarity, rename to hugetlb_add_to_page_cache.

    Link: https://lkml.kernel.org/r/20220914221810.95771-4-mike.kravetz@oracle.com
    Signed-off-by: Mike Kravetz
    Reviewed-by: Miaohe Lin
    Cc: Andrea Arcangeli
    Cc: "Aneesh Kumar K.V"
    Cc: Axel Rasmussen
    Cc: David Hildenbrand
    Cc: Davidlohr Bueso
    Cc: James Houghton
    Cc: "Kirill A. Shutemov"
    Cc: Michal Hocko
    Cc: Mina Almasry
    Cc: Muchun Song
    Cc: Naoya Horiguchi
    Cc: Pasha Tatashin
    Cc: Peter Xu
    Cc: Prakash Sangappa
    Cc: Sven Schnelle
    Signed-off-by: Andrew Morton

    Mike Kravetz
     
  • Commit c0d0381ade79 ("hugetlbfs: use i_mmap_rwsem for more pmd sharing
    synchronization") added code to take i_mmap_rwsem in read mode for the
    duration of fault processing. However, this has been shown to cause
    performance/scaling issues. Revert the code and go back to only taking
    the semaphore in huge_pmd_share during the fault path.

    Keep the code that takes i_mmap_rwsem in write mode before calling
    try_to_unmap as this is required if huge_pmd_unshare is called.

    NOTE: Reverting this code does expose the following race condition.

    Faulting thread Unsharing thread
    ... ...
    ptep = huge_pte_offset()
    or
    ptep = huge_pte_alloc()
    ...
    i_mmap_lock_write
    lock page table
    ptep invalid
    Reviewed-by: Miaohe Lin
    Cc: Andrea Arcangeli
    Cc: "Aneesh Kumar K.V"
    Cc: Axel Rasmussen
    Cc: David Hildenbrand
    Cc: Davidlohr Bueso
    Cc: James Houghton
    Cc: "Kirill A. Shutemov"
    Cc: Michal Hocko
    Cc: Mina Almasry
    Cc: Muchun Song
    Cc: Naoya Horiguchi
    Cc: Pasha Tatashin
    Cc: Peter Xu
    Cc: Prakash Sangappa
    Cc: Sven Schnelle
    Signed-off-by: Andrew Morton

    Mike Kravetz
     
  • Patch series "hugetlb: Use new vma lock for huge pmd sharing
    synchronization", v2.

    hugetlb fault scalability regressions have recently been reported [1].
    This is not the first such report, as regressions were also noted when
    commit c0d0381ade79 ("hugetlbfs: use i_mmap_rwsem for more pmd sharing
    synchronization") was added [2] in v5.7. At that time, a proposal to
    address the regression was suggested [3] but went nowhere.

    The regression and benefit of this patch series is not evident when
    using the vm_scalability benchmark reported in [2] on a recent kernel.
    Results from running,
    "./usemem -n 48 --prealloc --prefault -O -U 3448054972"

    48 sample Avg
    next-20220913 next-20220913 next-20220913
    unmodified revert i_mmap_sema locking vma sema locking, this series
    -----------------------------------------------------------------------------
    498150 KB/s 501934 KB/s 504793 KB/s

    The recent regression report [1] notes page fault and fork latency of
    shared hugetlb mappings. To measure this, I created two simple programs:
    1) map a shared hugetlb area, write fault all pages, unmap area
    Do this in a continuous loop to measure faults per second
    2) map a shared hugetlb area, write fault a few pages, fork and exit
    Do this in a continuous loop to measure forks per second
    These programs were run on a 48 CPU VM with 320GB memory. The shared
    mapping size was 250GB. For comparison, a single instance of the program
    was run. Then, multiple instances were run in parallel to introduce
    lock contention. Changing the locking scheme results in a significant
    performance benefit.

    test instances unmodified revert vma
    --------------------------------------------------------------------------
    faults per sec 1 393043 395680 389932
    faults per sec 24 71405 81191 79048
    forks per sec 1 2802 2747 2725
    forks per sec 24 439 536 500
    Combined faults 24 1621 68070 53662
    Combined forks 24 358 67 142

    Combined test is when running both faulting program and forking program
    simultaneously.

    Patches 1 and 2 of this series revert c0d0381ade79 and 87bf91d39bb5 which
    depends on c0d0381ade79. Acquisition of i_mmap_rwsem is still required in
    the fault path to establish pmd sharing, so this is moved back to
    huge_pmd_share. With c0d0381ade79 reverted, this race is exposed:

    Faulting thread Unsharing thread
    ... ...
    ptep = huge_pte_offset()
    or
    ptep = huge_pte_alloc()
    ...
    i_mmap_lock_write
    lock page table
    ptep invalid
    Reviewed-by: Miaohe Lin
    Cc: Andrea Arcangeli
    Cc: "Aneesh Kumar K.V"
    Cc: Axel Rasmussen
    Cc: David Hildenbrand
    Cc: Davidlohr Bueso
    Cc: James Houghton
    Cc: "Kirill A. Shutemov"
    Cc: Michal Hocko
    Cc: Mina Almasry
    Cc: Muchun Song
    Cc: Naoya Horiguchi
    Cc: Pasha Tatashin
    Cc: Peter Xu
    Cc: Prakash Sangappa
    Cc: Sven Schnelle
    Signed-off-by: Andrew Morton

    Mike Kravetz
     

24 Sep, 2022

2 commits

  • This is in preparation for adding tmpfile support to fuse, which requires
    that the tmpfile creation and opening are done as a single operation.

    Replace the 'struct dentry *' argument of i_op->tmpfile with
    'struct file *'.

    Call finish_open_simple() as the last thing in ->tmpfile() instances (may
    be omitted in the error case).

    Change d_tmpfile() argument to 'struct file *' as well to make callers more
    readable.

    Reviewed-by: Christian Brauner (Microsoft)
    Signed-off-by: Miklos Szeredi

    Miklos Szeredi
     
  • Duplicate the few lines that are shared between hugetlbfs_mknod() and
    hugetlbfs_tmpfile().

    This is a prerequisite for sanely changing the signature of ->tmpfile().

    Signed-off-by: Al Viro
    Reviewed-by: Christian Brauner (Microsoft)
    Signed-off-by: Miklos Szeredi

    Al Viro