16 May, 2022

1 commit

  • commit 19b482c29b6f3805f1d8e93015847b89e2f7f3b1 upstream.

    userfaultfd calls shmem_mfill_atomic_pte() which does not do any cache
    flushing for the target page. Then the target page will be mapped to
    the user space with a different address (user address), which might have
    an alias issue with the kernel address used to copy the data from the
    user to. Insert flush_dcache_page() in non-zero-page case. And replace
    clear_highpage() with clear_user_highpage() which already considers the
    cache maintenance.

    Link: https://lkml.kernel.org/r/20220210123058.79206-6-songmuchun@bytedance.com
    Fixes: 8d1039634206 ("userfaultfd: shmem: add shmem_mfill_zeropage_pte for userfaultfd support")
    Fixes: 4c27fe4c4c84 ("userfaultfd: shmem: add shmem_mcopy_atomic_pte for userfaultfd support")
    Signed-off-by: Muchun Song
    Reviewed-by: Mike Kravetz
    Cc: Axel Rasmussen
    Cc: David Rientjes
    Cc: Fam Zheng
    Cc: Kirill A. Shutemov
    Cc: Lars Persson
    Cc: Peter Xu
    Cc: Xiongchun Duan
    Cc: Zi Yan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Muchun Song
     

27 Jan, 2022

1 commit

  • commit 62c9827cbb996c2c04f615ecd783ce28bcea894b upstream.

    Fix a data race in commit 779750d20b93 ("shmem: split huge pages beyond
    i_size under memory pressure").

    Here are call traces causing race:

    Call Trace 1:
    shmem_unused_huge_shrink+0x3ae/0x410
    ? __list_lru_walk_one.isra.5+0x33/0x160
    super_cache_scan+0x17c/0x190
    shrink_slab.part.55+0x1ef/0x3f0
    shrink_node+0x10e/0x330
    kswapd+0x380/0x740
    kthread+0xfc/0x130
    ? mem_cgroup_shrink_node+0x170/0x170
    ? kthread_create_on_node+0x70/0x70
    ret_from_fork+0x1f/0x30

    Call Trace 2:
    shmem_evict_inode+0xd8/0x190
    evict+0xbe/0x1c0
    do_unlinkat+0x137/0x330
    do_syscall_64+0x76/0x120
    entry_SYSCALL_64_after_hwframe+0x3d/0xa2

    A simple explanation:

    Image there are 3 items in the local list (@list). In the first
    traversal, A is not deleted from @list.

    1) A->B->C
    ^
    |
    pos (leave)

    In the second traversal, B is deleted from @list. Concurrently, A is
    deleted from @list through shmem_evict_inode() since last reference
    counter of inode is dropped by other thread. Then the @list is corrupted.

    2) A->B->C
    ^ ^
    | |
    evict pos (drop)

    We should make sure the inode is either on the global list or deleted from
    any local list before iput().

    Fixed by moving inodes back to global list before we put them.

    [akpm@linux-foundation.org: coding style fixes]

    Link: https://lkml.kernel.org/r/20211125064502.99983-1-ligang.bdlg@bytedance.com
    Fixes: 779750d20b93 ("shmem: split huge pages beyond i_size under memory pressure")
    Signed-off-by: Gang Li
    Reviewed-by: Muchun Song
    Acked-by: Kirill A. Shutemov
    Cc: Hugh Dickins
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Gang Li
     

25 Sep, 2021

1 commit

  • In the case of SHMEM_HUGE_WITHIN_SIZE, the page index is not rounded up
    correctly. When the page index points to the first page in a huge page,
    round_up() cannot bring it to the end of the huge page, but to the end
    of the previous one.

    An example:

    HPAGE_PMD_NR on my machine is 512(2 MB huge page size). After
    allcoating a 3000 KB buffer, I access it at location 2050 KB. In
    shmem_is_huge(), the corresponding index happens to be 512. After
    rounded up by HPAGE_PMD_NR, it will still be 512 which is smaller than
    i_size, and shmem_is_huge() will return true. As a result, my buffer
    takes an additional huge page, and that shouldn't happen when
    shmem_enabled is set to within_size.

    Link: https://lkml.kernel.org/r/20210909032007.18353-1-liuyuntao10@huawei.com
    Fixes: f3f0e1d2150b2b ("khugepaged: add support of collapse for tmpfs/shmem pages")
    Signed-off-by: Liu Yuntao
    Acked-by: Kirill A. Shutemov
    Acked-by: Hugh Dickins
    Cc: wuxu.wu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Liu Yuntao
     

04 Sep, 2021

15 commits

  • Merge misc updates from Andrew Morton:
    "173 patches.

    Subsystems affected by this series: ia64, ocfs2, block, and mm (debug,
    pagecache, gup, swap, shmem, memcg, selftests, pagemap, mremap,
    bootmem, sparsemem, vmalloc, kasan, pagealloc, memory-failure,
    hugetlb, userfaultfd, vmscan, compaction, mempolicy, memblock,
    oom-kill, migration, ksm, percpu, vmstat, and madvise)"

    * emailed patches from Andrew Morton : (173 commits)
    mm/madvise: add MADV_WILLNEED to process_madvise()
    mm/vmstat: remove unneeded return value
    mm/vmstat: simplify the array size calculation
    mm/vmstat: correct some wrong comments
    mm/percpu,c: remove obsolete comments of pcpu_chunk_populated()
    selftests: vm: add COW time test for KSM pages
    selftests: vm: add KSM merging time test
    mm: KSM: fix data type
    selftests: vm: add KSM merging across nodes test
    selftests: vm: add KSM zero page merging test
    selftests: vm: add KSM unmerge test
    selftests: vm: add KSM merge test
    mm/migrate: correct kernel-doc notation
    mm: wire up syscall process_mrelease
    mm: introduce process_mrelease system call
    memblock: make memblock_find_in_range method private
    mm/mempolicy.c: use in_task() in mempolicy_slab_node()
    mm/mempolicy: unify the create() func for bind/interleave/prefer-many policies
    mm/mempolicy: advertise new MPOL_PREFERRED_MANY
    mm/hugetlb: add support for mempolicy MPOL_PREFERRED_MANY
    ...

    Linus Torvalds
     
  • drivers/gpu/drm/i915/gem/i915_gem_shmem.c contains a shmem_writeback()
    which calls shmem_writepage() from a shrinker: that usually works well
    enough; but if /sys/kernel/mm/transparent_hugepage/shmem_enabled has been
    set to "always" (intended to be usable) or "force" (forces huge everywhere
    for easy testing), shmem_writepage() is surprised to be called with a huge
    page, and crashes on the VM_BUG_ON_PAGE(PageCompound) (I did not find out
    where the crash happens when CONFIG_DEBUG_VM is off).

    LRU page reclaim always splits the shmem huge page first: I'd prefer not
    to demand that of i915, so check and split compound in shmem_writepage().

    Patch history: when first sent last year
    http://lkml.kernel.org/r/alpine.LSU.2.11.2008301401390.5954@eggly.anvils
    https://lore.kernel.org/linux-mm/20200919042009.bomzxmrg7%25akpm@linux-foundation.org/
    Matthew Wilcox noticed that tail pages were wrongly left clean. This
    version brackets the split with Set and Clear PageDirty as he suggested:
    which works very well, even if it falls short of our aspirations. And
    recently I realized that the crash is not limited to the testing option
    "force", but affects "always" too: which is more important to fix.

    Link: https://lkml.kernel.org/r/bac6158c-8b3d-4dca-cffc-4982f58d9794@google.com
    Fixes: 2d6692e642e7 ("drm/i915: Start writeback from the shrinker")
    Signed-off-by: Hugh Dickins
    Reviewed-by: Shakeel Butt
    Acked-by: Yang Shi
    Cc: "Kirill A. Shutemov"
    Cc: Matthew Wilcox
    Cc: Miaohe Lin
    Cc: Michal Hocko
    Cc: Mike Kravetz
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • 4.18 commit 89fdcd262fd4 ("mm: shmem: make stat.st_blksize return huge
    page size if THP is on") added is_huge_enabled() to decide st_blksize: if
    hugeness is to be defined per file, that will need to be replaced by
    shmem_is_huge().

    This does give a different answer (No) for small files on a
    "huge=within_size" mount: but that can be considered a minor bugfix. And
    a different answer (No) for default files on a "huge=advise" mount: I'm
    reluctant to complicate it, just to reproduce the same debatable answer as
    before.

    Link: https://lkml.kernel.org/r/af7fb3f9-4415-9e8e-fdac-b1a5253ad21@google.com
    Signed-off-by: Hugh Dickins
    Reviewed-by: Yang Shi
    Cc: "Kirill A. Shutemov"
    Cc: Matthew Wilcox
    Cc: Miaohe Lin
    Cc: Michal Hocko
    Cc: Mike Kravetz
    Cc: Rik van Riel
    Cc: Shakeel Butt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Extend shmem_huge_enabled(vma) to shmem_is_huge(vma, inode, index), so
    that a consistent set of checks can be applied, even when the inode is
    accessed through read/write syscalls (with NULL vma) instead of mmaps (the
    index argument is seldom of interest, but required by mount option
    "huge=within_size"). Clean up and rearrange the checks a little.

    This then replaces the checks which shmem_fault() and shmem_getpage_gfp()
    were making, and eliminates the SGP_HUGE and SGP_NOHUGE modes.

    Replace a couple of 0s by explicit SHMEM_HUGE_NEVERs; and replace the
    obscure !shmem_mapping() symlink check by explicit S_ISLNK() - nothing
    else needs that symlink check, so leave it there in shmem_getpage_gfp().

    Link: https://lkml.kernel.org/r/23a77889-2ddc-b030-75cd-44ca27fd4d1@google.com
    Signed-off-by: Hugh Dickins
    Reviewed-by: Yang Shi
    Cc: "Kirill A. Shutemov"
    Cc: Matthew Wilcox
    Cc: Miaohe Lin
    Cc: Michal Hocko
    Cc: Mike Kravetz
    Cc: Rik van Riel
    Cc: Shakeel Butt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • khugepaged's collapse_file() currently uses SGP_NOHUGE to tell
    shmem_getpage() not to try allocating a huge page, in the very unlikely
    event that a racing hole-punch removes the swapped or fallocated page as
    soon as i_pages lock is dropped.

    We want to consolidate shmem's huge decisions, removing SGP_HUGE and
    SGP_NOHUGE; but cannot quite persuade ourselves that it's okay to regress
    the protection in this case - Yang Shi points out that the huge page would
    remain indefinitely, charged to root instead of the intended memcg.

    collapse_file() should not even allocate a small page in this case: why
    proceed if someone is punching a hole? SGP_READ is almost the right flag
    here, except that it optimizes away from a fallocated page, with NULL to
    tell caller to fill with zeroes (like a hole); whereas collapse_file()'s
    sequence relies on using a cache page. Add SGP_NOALLOC just for this.

    There are too many consecutive "if (page"s there in shmem_getpage_gfp():
    group it better; and fix the outdated "bring it back from swap" comment.

    Link: https://lkml.kernel.org/r/1355343b-acf-4653-ef79-6aee40214ac5@google.com
    Signed-off-by: Hugh Dickins
    Reviewed-by: Yang Shi
    Cc: "Kirill A. Shutemov"
    Cc: Matthew Wilcox
    Cc: Miaohe Lin
    Cc: Michal Hocko
    Cc: Mike Kravetz
    Cc: Rik van Riel
    Cc: Shakeel Butt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • shmem_huge_enabled() is about to be enhanced into shmem_is_huge(), so that
    it can be used more widely throughout: before making functional changes,
    shift it to its final position (to avoid forward declaration).

    Link: https://lkml.kernel.org/r/16fec7b7-5c84-415a-8586-69d8bf6a6685@google.com
    Signed-off-by: Hugh Dickins
    Reviewed-by: Yang Shi
    Cc: "Kirill A. Shutemov"
    Cc: Matthew Wilcox
    Cc: Miaohe Lin
    Cc: Michal Hocko
    Cc: Mike Kravetz
    Cc: Rik van Riel
    Cc: Shakeel Butt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • 5.14 commit e6be37b2e7bd ("mm/huge_memory.c: add missing read-only THP
    checking in transparent_hugepage_enabled()") added transhuge_vma_enabled()
    as a wrapper for two very different checks (one check is whether the app
    has marked its address range not to use THPs, the other check is whether
    the app is running in a hierarchy that has been marked never to use THPs).
    shmem_huge_enabled() prefers to show those two checks explicitly, as
    before.

    Link: https://lkml.kernel.org/r/45e5338-18d-c6f9-c17e-34f510bc1728@google.com
    Signed-off-by: Hugh Dickins
    Reviewed-by: Yang Shi
    Cc: "Kirill A. Shutemov"
    Cc: Matthew Wilcox
    Cc: Miaohe Lin
    Cc: Michal Hocko
    Cc: Mike Kravetz
    Cc: Rik van Riel
    Cc: Shakeel Butt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • There's a block of code in shmem_setattr() to add the inode to
    shmem_unused_huge_shrink()'s shrinklist when lowering i_size: it dates
    from before 5.7 changed truncation to do split_huge_page() for itself, and
    should have been removed at that time.

    I am over-stating that: split_huge_page() can fail (notably if there's an
    extra reference to the page at that time), so there might be value in
    retrying. But there were already retries as truncation worked through the
    tails, and this addition risks repeating unsuccessful retries
    indefinitely: I'd rather remove it now, and work on reducing the chance of
    split_huge_page() failures separately, if we need to.

    Link: https://lkml.kernel.org/r/b73b3492-8822-18f9-83e2-938528cdde94@google.com
    Fixes: 71725ed10c40 ("mm: huge tmpfs: try to split_huge_page() when punching hole")
    Signed-off-by: Hugh Dickins
    Reviewed-by: Yang Shi
    Cc: "Kirill A. Shutemov"
    Cc: Matthew Wilcox
    Cc: Miaohe Lin
    Cc: Michal Hocko
    Cc: Mike Kravetz
    Cc: Rik van Riel
    Cc: Shakeel Butt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • A successful shmem_fallocate() guarantees that the extent has been
    reserved, even beyond i_size when the FALLOC_FL_KEEP_SIZE flag was used.
    But that guarantee is broken by shmem_unused_huge_shrink()'s attempts to
    split huge pages and free their excess beyond i_size; and by other uses of
    split_huge_page() near i_size.

    It's sad to add a shmem inode field just for this, but I did not find a
    better way to keep the guarantee. A flag to say KEEP_SIZE has been used
    would be cheaper, but I'm averse to unclearable flags. The fallocend
    field is not perfect either (many disjoint ranges might be fallocated),
    but good enough; and gains another use later on.

    Link: https://lkml.kernel.org/r/ca9a146-3a59-6cd3-7f28-e9a044bb1052@google.com
    Fixes: 779750d20b93 ("shmem: split huge pages beyond i_size under memory pressure")
    Signed-off-by: Hugh Dickins
    Reviewed-by: Yang Shi
    Cc: "Kirill A. Shutemov"
    Cc: Matthew Wilcox
    Cc: Miaohe Lin
    Cc: Michal Hocko
    Cc: Mike Kravetz
    Cc: Rik van Riel
    Cc: Shakeel Butt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Patch series "huge tmpfs: shmem_is_huge() fixes and cleanups".

    A series of huge tmpfs fixes and cleanups.

    This patch (of 9):

    shmem_fallocate() goes to a lot of trouble to leave its newly allocated
    pages !Uptodate, partly to identify and undo them on failure, partly to
    leave the overhead of clearing them until later. But the huge page case
    did not skip to the end of the extent, walked through the tail pages one
    by one, and appeared to work just fine: but in doing so, cleared and
    Uptodated the huge page, so there was no way to undo it on failure.

    And by setting Uptodate too soon, it messed up both its nr_falloced and
    nr_unswapped counts, so that the intended "time to give up" heuristic did
    not work at all.

    Now advance immediately to the end of the huge extent, with a comment on
    why this is more than just an optimization. But although this speeds up
    huge tmpfs fallocation, it does leave the clearing until first use, and
    some users may have come to appreciate slow fallocate but fast first use:
    if they complain, then we can consider adding a pass to clear at the end.

    Link: https://lkml.kernel.org/r/da632211-8e3e-6b1-aee-ab24734429a0@google.com
    Link: https://lkml.kernel.org/r/16201bd2-70e-37e2-e89b-5f929430da@google.com
    Fixes: 800d8c63b2e9 ("shmem: add huge pages support")
    Signed-off-by: Hugh Dickins
    Reviewed-by: Yang Shi
    Cc: Shakeel Butt
    Cc: "Kirill A. Shutemov"
    Cc: Miaohe Lin
    Cc: Mike Kravetz
    Cc: Michal Hocko
    Cc: Rik van Riel
    Cc: Matthew Wilcox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • It's bad to extern swap_info[] in .c. Include corresponding header file
    instead.

    Link: https://lkml.kernel.org/r/20210812120350.49801-5-linmiaohe@huawei.com
    Signed-off-by: Miaohe Lin
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miaohe Lin
     
  • The forward declaration for shmem_should_replace_page() and
    shmem_replace_page() is unnecessary. Remove them.

    Link: https://lkml.kernel.org/r/20210812120350.49801-4-linmiaohe@huawei.com
    Signed-off-by: Miaohe Lin
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miaohe Lin
     
  • mfill_atomic_install_pte() is introduced to install pte and update mmu
    cache since commit bf6ebd97aba0 ("userfaultfd/shmem: modify
    shmem_mfill_atomic_pte to use install_pte()"). So we should remove
    tlbflush.h as update_mmu_cache() is not called here now.

    Link: https://lkml.kernel.org/r/20210812120350.49801-3-linmiaohe@huawei.com
    Signed-off-by: Miaohe Lin
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miaohe Lin
     
  • Patch series "Cleanups for shmem".

    This series contains cleanups to remove unneeded variable, header file,
    function forward declaration and so on. More details can be found in the
    respective changelogs.

    This patch (of 4):

    The local variable ret is always equal to -ENOMEM and never touched. So
    remove it and return -ENOMEM directly to simplify the code.

    Link: https://lkml.kernel.org/r/20210812120350.49801-1-linmiaohe@huawei.com
    Link: https://lkml.kernel.org/r/20210812120350.49801-2-linmiaohe@huawei.com
    Signed-off-by: Miaohe Lin
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miaohe Lin
     
  • Each CPU has SHMEM_INO_BATCH inodes available in `->ino_batch' which is
    per-CPU. Access here is serialized by disabling preemption. If the pool
    is empty, it gets reloaded from `->next_ino'. Access here is serialized
    by ->stat_lock which is a spinlock_t and can not be acquired with disabled
    preemption.

    One way around it would make per-CPU ino_batch struct containing the inode
    number a local_lock_t.

    Another solution is to promote ->stat_lock to a raw_spinlock_t. The
    critical sections are short. The mpol_put() must be moved outside of the
    critical section to avoid invoking the destructor with disabled
    preemption.

    Link: https://lkml.kernel.org/r/20210806142916.jdwkb5bx62q5fwfo@linutronix.de
    Signed-off-by: Sebastian Andrzej Siewior
    Acked-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sebastian Andrzej Siewior
     

31 Aug, 2021

1 commit

  • Pull fs hole punching vs cache filling race fixes from Jan Kara:
    "Fix races leading to possible data corruption or stale data exposure
    in multiple filesystems when hole punching races with operations such
    as readahead.

    This is the series I was sending for the last merge window but with
    your objection fixed - now filemap_fault() has been modified to take
    invalidate_lock only when we need to create new page in the page cache
    and / or bring it uptodate"

    * tag 'hole_punch_for_v5.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
    filesystems/locking: fix Malformed table warning
    cifs: Fix race between hole punch and page fault
    ceph: Fix race between hole punch and page fault
    fuse: Convert to using invalidate_lock
    f2fs: Convert to using invalidate_lock
    zonefs: Convert to using invalidate_lock
    xfs: Convert double locking of MMAPLOCK to use VFS helpers
    xfs: Convert to use invalidate_lock
    xfs: Refactor xfs_isilocked()
    ext2: Convert to using invalidate_lock
    ext4: Convert to use mapping->invalidate_lock
    mm: Add functions to lock invalidate_lock for two mappings
    mm: Protect operations adding pages to page cache with invalidate_lock
    documentation: Sync file_operations members with reality
    mm: Fix comments mentioning i_mutex

    Linus Torvalds
     

21 Aug, 2021

1 commit

  • Due to the change about how block layer detects congestion the
    justification of commit 8fd2e0b505d1 ("mm: swap: check if swap backing
    device is congested or not") doesn't stand anymore, so the commit could
    be just reverted in order to solve the race reported by commit
    2efa33fc7f6e ("mm/shmem: fix shmem_swapin() race with swapoff"), so the
    fix commit could be just reverted as well.

    And that fix is also kind of buggy as discussed by [1] and [2].

    [1] https://lore.kernel.org/linux-mm/24187e5e-069-9f3f-cefe-39ac70783753@google.com/
    [2] https://lore.kernel.org/linux-mm/e82380b9-3ad4-4a52-be50-6d45c7f2b5da@google.com/

    Link: https://lkml.kernel.org/r/20210810202936.2672-2-shy828301@gmail.com
    Signed-off-by: Yang Shi
    Suggested-by: Hugh Dickins
    Acked-by: Hugh Dickins
    Cc: "Huang, Ying"
    Cc: Miaohe Lin
    Cc: David Hildenbrand
    Cc: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: Matthew Wilcox (Oracle)
    Cc: Michal Hocko
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yang Shi
     

13 Jul, 2021

1 commit

  • inode->i_mutex has been replaced with inode->i_rwsem long ago. Fix
    comments still mentioning i_mutex.

    Reviewed-by: Christoph Hellwig
    Reviewed-by: Darrick J. Wong
    Acked-by: Hugh Dickins
    Signed-off-by: Jan Kara

    Jan Kara
     

03 Jul, 2021

1 commit

  • Merge more updates from Andrew Morton:
    "190 patches.

    Subsystems affected by this patch series: mm (hugetlb, userfaultfd,
    vmscan, kconfig, proc, z3fold, zbud, ras, mempolicy, memblock,
    migration, thp, nommu, kconfig, madvise, memory-hotplug, zswap,
    zsmalloc, zram, cleanups, kfence, and hmm), procfs, sysctl, misc,
    core-kernel, lib, lz4, checkpatch, init, kprobes, nilfs2, hfs,
    signals, exec, kcov, selftests, compress/decompress, and ipc"

    * emailed patches from Andrew Morton : (190 commits)
    ipc/util.c: use binary search for max_idx
    ipc/sem.c: use READ_ONCE()/WRITE_ONCE() for use_global_lock
    ipc: use kmalloc for msg_queue and shmid_kernel
    ipc sem: use kvmalloc for sem_undo allocation
    lib/decompressors: remove set but not used variabled 'level'
    selftests/vm/pkeys: exercise x86 XSAVE init state
    selftests/vm/pkeys: refill shadow register after implicit kernel write
    selftests/vm/pkeys: handle negative sys_pkey_alloc() return code
    selftests/vm/pkeys: fix alloc_random_pkey() to make it really, really random
    kcov: add __no_sanitize_coverage to fix noinstr for all architectures
    exec: remove checks in __register_bimfmt()
    x86: signal: don't do sas_ss_reset() until we are certain that sigframe won't be abandoned
    hfsplus: report create_date to kstat.btime
    hfsplus: remove unnecessary oom message
    nilfs2: remove redundant continue statement in a while-loop
    kprobes: remove duplicated strong free_insn_page in x86 and s390
    init: print out unknown kernel parameters
    checkpatch: do not complain about positive return values starting with EPOLL
    checkpatch: improve the indented label test
    checkpatch: scripts/spdxcheck.py now requires python3
    ...

    Linus Torvalds
     

01 Jul, 2021

4 commits

  • In a previous commit, we added the mfill_atomic_install_pte() helper.
    This helper does the job of setting up PTEs for an existing page, to map
    it into a given VMA. It deals with both the anon and shmem cases, as well
    as the shared and private cases.

    In other words, shmem_mfill_atomic_pte() duplicates a case it already
    handles. So, expose it, and let shmem_mfill_atomic_pte() use it directly,
    to reduce code duplication.

    This requires that we refactor shmem_mfill_atomic_pte() a bit:

    Instead of doing accounting (shmem_recalc_inode() et al) part-way through
    the PTE setup, do it afterward. This frees up mfill_atomic_install_pte()
    from having to care about this accounting, and means we don't need to e.g.
    shmem_uncharge() in the error path.

    A side effect is this switches shmem_mfill_atomic_pte() to use
    lru_cache_add_inactive_or_unevictable() instead of just lru_cache_add().
    This wrapper does some extra accounting in an exceptional case, if
    appropriate, so it's actually the more correct thing to use.

    Link: https://lkml.kernel.org/r/20210503180737.2487560-7-axelrasmussen@google.com
    Signed-off-by: Axel Rasmussen
    Reviewed-by: Peter Xu
    Acked-by: Hugh Dickins
    Cc: Alexander Viro
    Cc: Andrea Arcangeli
    Cc: Brian Geffon
    Cc: "Dr . David Alan Gilbert"
    Cc: Jerome Glisse
    Cc: Joe Perches
    Cc: Kirill A. Shutemov
    Cc: Lokesh Gidra
    Cc: Mike Kravetz
    Cc: Mike Rapoport
    Cc: Mina Almasry
    Cc: Oliver Upton
    Cc: Shaohua Li
    Cc: Shuah Khan
    Cc: Stephen Rothwell
    Cc: Wang Qing
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Axel Rasmussen
     
  • This patch allows shmem-backed VMAs to be registered for minor faults.
    Minor faults are appropriately relayed to userspace in the fault path, for
    VMAs with the relevant flag.

    This commit doesn't hook up the UFFDIO_CONTINUE ioctl for shmem-backed
    minor faults, though, so userspace doesn't yet have a way to resolve such
    faults.

    Because of this, we also don't yet advertise this as a supported feature.
    That will be done in a separate commit when the feature is fully
    implemented.

    Link: https://lkml.kernel.org/r/20210503180737.2487560-4-axelrasmussen@google.com
    Signed-off-by: Axel Rasmussen
    Acked-by: Peter Xu
    Acked-by: Hugh Dickins
    Cc: Alexander Viro
    Cc: Andrea Arcangeli
    Cc: Brian Geffon
    Cc: "Dr . David Alan Gilbert"
    Cc: Jerome Glisse
    Cc: Joe Perches
    Cc: Kirill A. Shutemov
    Cc: Lokesh Gidra
    Cc: Mike Kravetz
    Cc: Mike Rapoport
    Cc: Mina Almasry
    Cc: Oliver Upton
    Cc: Shaohua Li
    Cc: Shuah Khan
    Cc: Stephen Rothwell
    Cc: Wang Qing
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Axel Rasmussen
     
  • Patch series "userfaultfd: add minor fault handling for shmem", v6.

    Overview
    ========

    See the series which added minor faults for hugetlbfs [3] for a detailed
    overview of minor fault handling in general. This series adds the same
    support for shmem-backed areas.

    This series is structured as follows:

    - Commits 1 and 2 are cleanups.
    - Commits 3 and 4 implement the new feature (minor fault handling for shmem).
    - Commit 5 advertises that the feature is now available since at this point it's
    fully implemented.
    - Commit 6 is a final cleanup, modifying an existing code path to re-use a new
    helper we've introduced.
    - Commits 7, 8, 9, 10 update the userfaultfd selftest to exercise the feature.

    Use Case
    ========

    In some cases it is useful to have VM memory backed by tmpfs instead of
    hugetlbfs. So, this feature will be used to support the same VM live
    migration use case described in my original series.

    Additionally, Android folks (Lokesh Gidra ) hope
    to optimize the Android Runtime garbage collector using this feature:

    "The plan is to use userfaultfd for concurrently compacting the heap.
    With this feature, the heap can be shared-mapped at another location where
    the GC-thread(s) could continue the compaction operation without the need
    to invoke userfault ioctl(UFFDIO_COPY) each time. OTOH, if and when Java
    threads get faults on the heap, UFFDIO_CONTINUE can be used to resume
    execution. Furthermore, this feature enables updating references in the
    'non-moving' portion of the heap efficiently. Without this feature,
    uneccessary page copying (ioctl(UFFDIO_COPY)) would be required."

    [1] https://lore.kernel.org/patchwork/cover/1388144/
    [2] https://lore.kernel.org/patchwork/patch/1408161/
    [3] https://lore.kernel.org/linux-fsdevel/20210301222728.176417-1-axelrasmussen@google.com/T/#t

    This patch (of 9):

    Previously, we did a dance where we had one calling path in userfaultfd.c
    (mfill_atomic_pte), but then we split it into two in shmem_fs.h
    (shmem_{mcopy_atomic,mfill_zeropage}_pte), and then rejoined into a single
    shared function in shmem.c (shmem_mfill_atomic_pte).

    This is all a bit overly complex. Just call the single combined shmem
    function directly, allowing us to clean up various branches, boilerplate,
    etc.

    While we're touching this function, two other small cleanup changes:
    - offset is equivalent to pgoff, so we can get rid of offset entirely.
    - Split two VM_BUG_ON cases into two statements. This means the line
    number reported when the BUG is hit specifies exactly which condition
    was true.

    Link: https://lkml.kernel.org/r/20210503180737.2487560-1-axelrasmussen@google.com
    Link: https://lkml.kernel.org/r/20210503180737.2487560-3-axelrasmussen@google.com
    Signed-off-by: Axel Rasmussen
    Reviewed-by: Peter Xu
    Acked-by: Hugh Dickins
    Cc: Alexander Viro
    Cc: Andrea Arcangeli
    Cc: Brian Geffon
    Cc: "Dr . David Alan Gilbert"
    Cc: Jerome Glisse
    Cc: Joe Perches
    Cc: Kirill A. Shutemov
    Cc: Lokesh Gidra
    Cc: Mike Kravetz
    Cc: Mike Rapoport
    Cc: Mina Almasry
    Cc: Oliver Upton
    Cc: Shaohua Li
    Cc: Shuah Khan
    Cc: Stephen Rothwell
    Cc: Wang Qing
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Axel Rasmussen
     
  • Since commit 99cb0dbd47a1 ("mm,thp: add read-only THP support for
    (non-shmem) FS"), read-only THP file mapping is supported. But it forgot
    to add checking for it in transparent_hugepage_enabled(). To fix it, we
    add checking for read-only THP file mapping and also introduce helper
    transhuge_vma_enabled() to check whether thp is enabled for specified vma
    to reduce duplicated code. We rename transparent_hugepage_enabled to
    transparent_hugepage_active to make the code easier to follow as suggested
    by David Hildenbrand.

    [linmiaohe@huawei.com: define transhuge_vma_enabled next to transhuge_vma_suitable]
    Link: https://lkml.kernel.org/r/20210514093007.4117906-1-linmiaohe@huawei.com

    Link: https://lkml.kernel.org/r/20210511134857.1581273-4-linmiaohe@huawei.com
    Fixes: 99cb0dbd47a1 ("mm,thp: add read-only THP support for (non-shmem) FS")
    Signed-off-by: Miaohe Lin
    Reviewed-by: Yang Shi
    Cc: Alexey Dobriyan
    Cc: "Aneesh Kumar K . V"
    Cc: Anshuman Khandual
    Cc: David Hildenbrand
    Cc: Hugh Dickins
    Cc: Johannes Weiner
    Cc: Kirill A. Shutemov
    Cc: Matthew Wilcox
    Cc: Minchan Kim
    Cc: Ralph Campbell
    Cc: Rik van Riel
    Cc: Song Liu
    Cc: William Kucharski
    Cc: Zi Yan
    Cc: Mike Kravetz
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miaohe Lin
     

30 Jun, 2021

3 commits

  • Merge misc updates from Andrew Morton:
    "191 patches.

    Subsystems affected by this patch series: kthread, ia64, scripts,
    ntfs, squashfs, ocfs2, kernel/watchdog, and mm (gup, pagealloc, slab,
    slub, kmemleak, dax, debug, pagecache, gup, swap, memcg, pagemap,
    mprotect, bootmem, dma, tracing, vmalloc, kasan, initialization,
    pagealloc, and memory-failure)"

    * emailed patches from Andrew Morton : (191 commits)
    mm,hwpoison: make get_hwpoison_page() call get_any_page()
    mm,hwpoison: send SIGBUS with error virutal address
    mm/page_alloc: split pcp->high across all online CPUs for cpuless nodes
    mm/page_alloc: allow high-order pages to be stored on the per-cpu lists
    mm: replace CONFIG_FLAT_NODE_MEM_MAP with CONFIG_FLATMEM
    mm: replace CONFIG_NEED_MULTIPLE_NODES with CONFIG_NUMA
    docs: remove description of DISCONTIGMEM
    arch, mm: remove stale mentions of DISCONIGMEM
    mm: remove CONFIG_DISCONTIGMEM
    m68k: remove support for DISCONTIGMEM
    arc: remove support for DISCONTIGMEM
    arc: update comment about HIGHMEM implementation
    alpha: remove DISCONTIGMEM and NUMA
    mm/page_alloc: move free_the_page
    mm/page_alloc: fix counting of managed_pages
    mm/page_alloc: improve memmap_pages dbg msg
    mm: drop SECTION_SHIFT in code comments
    mm/page_alloc: introduce vm.percpu_pagelist_high_fraction
    mm/page_alloc: limit the number of pages on PCP lists when reclaim is active
    mm/page_alloc: scale the number of pages that are batch freed
    ...

    Linus Torvalds
     
  • set_active_memcg() worked for kernel allocations but was silently ignored
    for user pages.

    This patch establishes a precedence order for who gets charged:

    1. If there is a memcg associated with the page already, that memcg is
    charged. This happens during swapin.

    2. If an explicit mm is passed, mm->memcg is charged. This happens
    during page faults, which can be triggered in remote VMs (eg gup).

    3. Otherwise consult the current process context. If there is an
    active_memcg, use that. Otherwise, current->mm->memcg.

    Previously, if a NULL mm was passed to mem_cgroup_charge (case 3) it would
    always charge the root cgroup. Now it looks up the active_memcg first
    (falling back to charging the root cgroup if not set).

    Link: https://lkml.kernel.org/r/20210610173944.1203706-3-schatzberg.dan@gmail.com
    Signed-off-by: Dan Schatzberg
    Acked-by: Johannes Weiner
    Acked-by: Tejun Heo
    Acked-by: Chris Down
    Acked-by: Jens Axboe
    Reviewed-by: Shakeel Butt
    Reviewed-by: Michal Koutný
    Cc: Michal Hocko
    Cc: Ming Lei
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Schatzberg
     
  • When I was investigating the swap code, I found the below possible race
    window:

    CPU 1 CPU 2
    ----- -----
    shmem_swapin
    swap_cluster_readahead
    if (likely(si->flags & (SWP_BLKDEV | SWP_FS_OPS))) {
    swapoff
    ..
    si->swap_file = NULL;
    ..
    struct inode *inode = si->swap_file->f_mapping->host;[oops!]

    Close this race window by using get/put_swap_device() to guard against
    concurrent swapoff.

    Link: https://lkml.kernel.org/r/20210426123316.806267-5-linmiaohe@huawei.com
    Fixes: 8fd2e0b505d1 ("mm: swap: check if swap backing device is congested or not")
    Signed-off-by: Miaohe Lin
    Reviewed-by: "Huang, Ying"
    Cc: Dennis Zhou
    Cc: Tim Chen
    Cc: Hugh Dickins
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Joonsoo Kim
    Cc: Alex Shi
    Cc: Matthew Wilcox
    Cc: Minchan Kim
    Cc: Wei Yang
    Cc: Yang Shi
    Cc: David Hildenbrand
    Cc: Yu Zhao
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miaohe Lin
     

29 Jun, 2021

1 commit

  • Pull user namespace rlimit handling update from Eric Biederman:
    "This is the work mainly by Alexey Gladkov to limit rlimits to the
    rlimits of the user that created a user namespace, and to allow users
    to have stricter limits on the resources created within a user
    namespace."

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
    cred: add missing return error code when set_cred_ucounts() failed
    ucounts: Silence warning in dec_rlimit_ucounts
    ucounts: Set ucount_max to the largest positive value the type can hold
    kselftests: Add test to check for rlimit changes in different user namespaces
    Reimplement RLIMIT_MEMLOCK on top of ucounts
    Reimplement RLIMIT_SIGPENDING on top of ucounts
    Reimplement RLIMIT_MSGQUEUE on top of ucounts
    Reimplement RLIMIT_NPROC on top of ucounts
    Use atomic_t for ucounts reference counting
    Add a reference to ucounts for each cred
    Increase size of ucounts to atomic_long_t

    Linus Torvalds
     

15 May, 2021

2 commits

  • Consider the following sequence of events:

    1. Userspace issues a UFFD ioctl, which ends up calling into
    shmem_mfill_atomic_pte(). We successfully account the blocks, we
    shmem_alloc_page(), but then the copy_from_user() fails. We return
    -ENOENT. We don't release the page we allocated.
    2. Our caller detects this error code, tries the copy_from_user() after
    dropping the mmap_lock, and retries, calling back into
    shmem_mfill_atomic_pte().
    3. Meanwhile, let's say another process filled up the tmpfs being used.
    4. So shmem_mfill_atomic_pte() fails to account blocks this time, and
    immediately returns - without releasing the page.

    This triggers a BUG_ON in our caller, which asserts that the page
    should always be consumed, unless -ENOENT is returned.

    To fix this, detect if we have such a "dangling" page when accounting
    fails, and if so, release it before returning.

    Link: https://lkml.kernel.org/r/20210428230858.348400-1-axelrasmussen@google.com
    Fixes: cb658a453b93 ("userfaultfd: shmem: avoid leaking blocks and used blocks in UFFDIO_COPY")
    Signed-off-by: Axel Rasmussen
    Reported-by: Hugh Dickins
    Acked-by: Hugh Dickins
    Reviewed-by: Peter Xu
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Axel Rasmussen
     
  • Patch series "mm/hugetlb: Fix issues on file sealing and fork", v2.

    Hugh reported issue with F_SEAL_FUTURE_WRITE not applied correctly to
    hugetlbfs, which I can easily verify using the memfd_test program, which
    seems that the program is hardly run with hugetlbfs pages (as by default
    shmem).

    Meanwhile I found another probably even more severe issue on that hugetlb
    fork won't wr-protect child cow pages, so child can potentially write to
    parent private pages. Patch 2 addresses that.

    After this series applied, "memfd_test hugetlbfs" should start to pass.

    This patch (of 2):

    F_SEAL_FUTURE_WRITE is missing for hugetlb starting from the first day.
    There is a test program for that and it fails constantly.

    $ ./memfd_test hugetlbfs
    memfd-hugetlb: CREATE
    memfd-hugetlb: BASIC
    memfd-hugetlb: SEAL-WRITE
    memfd-hugetlb: SEAL-FUTURE-WRITE
    mmap() didn't fail as expected
    Aborted (core dumped)

    I think it's probably because no one is really running the hugetlbfs test.

    Fix it by checking FUTURE_WRITE also in hugetlbfs_file_mmap() as what we
    do in shmem_mmap(). Generalize a helper for that.

    Link: https://lkml.kernel.org/r/20210503234356.9097-1-peterx@redhat.com
    Link: https://lkml.kernel.org/r/20210503234356.9097-2-peterx@redhat.com
    Fixes: ab3948f58ff84 ("mm/memfd: add an F_SEAL_FUTURE_WRITE seal to memfd")
    Signed-off-by: Peter Xu
    Reported-by: Hugh Dickins
    Reviewed-by: Mike Kravetz
    Cc: Joel Fernandes (Google)
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Xu
     

06 May, 2021

1 commit

  • Various coding style tweaks to various files under mm/

    [daizhiyuan@phytium.com.cn: mm/swapfile: minor coding style tweaks]
    Link: https://lkml.kernel.org/r/1614223624-16055-1-git-send-email-daizhiyuan@phytium.com.cn
    [daizhiyuan@phytium.com.cn: mm/sparse: minor coding style tweaks]
    Link: https://lkml.kernel.org/r/1614227288-19363-1-git-send-email-daizhiyuan@phytium.com.cn
    [daizhiyuan@phytium.com.cn: mm/vmscan: minor coding style tweaks]
    Link: https://lkml.kernel.org/r/1614227649-19853-1-git-send-email-daizhiyuan@phytium.com.cn
    [daizhiyuan@phytium.com.cn: mm/compaction: minor coding style tweaks]
    Link: https://lkml.kernel.org/r/1614228218-20770-1-git-send-email-daizhiyuan@phytium.com.cn
    [daizhiyuan@phytium.com.cn: mm/oom_kill: minor coding style tweaks]
    Link: https://lkml.kernel.org/r/1614228360-21168-1-git-send-email-daizhiyuan@phytium.com.cn
    [daizhiyuan@phytium.com.cn: mm/shmem: minor coding style tweaks]
    Link: https://lkml.kernel.org/r/1614228504-21491-1-git-send-email-daizhiyuan@phytium.com.cn
    [daizhiyuan@phytium.com.cn: mm/page_alloc: minor coding style tweaks]
    Link: https://lkml.kernel.org/r/1614228613-21754-1-git-send-email-daizhiyuan@phytium.com.cn
    [daizhiyuan@phytium.com.cn: mm/filemap: minor coding style tweaks]
    Link: https://lkml.kernel.org/r/1614228936-22337-1-git-send-email-daizhiyuan@phytium.com.cn
    [daizhiyuan@phytium.com.cn: mm/mlock: minor coding style tweaks]
    Link: https://lkml.kernel.org/r/1613956588-2453-1-git-send-email-daizhiyuan@phytium.com.cn
    [daizhiyuan@phytium.com.cn: mm/frontswap: minor coding style tweaks]
    Link: https://lkml.kernel.org/r/1613962668-15045-1-git-send-email-daizhiyuan@phytium.com.cn
    [daizhiyuan@phytium.com.cn: mm/vmalloc: minor coding style tweaks]
    Link: https://lkml.kernel.org/r/1613963379-15988-1-git-send-email-daizhiyuan@phytium.com.cn
    [daizhiyuan@phytium.com.cn: mm/memory_hotplug: minor coding style tweaks]
    Link: https://lkml.kernel.org/r/1613971784-24878-1-git-send-email-daizhiyuan@phytium.com.cn
    [daizhiyuan@phytium.com.cn: mm/mempolicy: minor coding style tweaks]
    Link: https://lkml.kernel.org/r/1613972228-25501-1-git-send-email-daizhiyuan@phytium.com.cn

    Link: https://lkml.kernel.org/r/1614222374-13805-1-git-send-email-daizhiyuan@phytium.com.cn
    Signed-off-by: Zhiyuan Dai
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zhiyuan Dai
     

01 May, 2021

1 commit

  • The rlimit counter is tied to uid in the user_namespace. This allows
    rlimit values to be specified in userns even if they are already
    globally exceeded by the user. However, the value of the previous
    user_namespaces cannot be exceeded.

    Changelog

    v11:
    * Fix issue found by lkp robot.

    v8:
    * Fix issues found by lkp-tests project.

    v7:
    * Keep only ucounts for RLIMIT_MEMLOCK checks instead of struct cred.

    v6:
    * Fix bug in hugetlb_file_setup() detected by trinity.

    Reported-by: kernel test robot
    Reported-by: kernel test robot
    Signed-off-by: Alexey Gladkov
    Link: https://lkml.kernel.org/r/970d50c70c71bfd4496e0e8d2a0a32feebebb350.1619094428.git.legion@kernel.org
    Signed-off-by: Eric W. Biederman

    Alexey Gladkov
     

19 Apr, 2021

1 commit

  • Since kernel v5.1, fanotify_init(2) supports the flag FAN_REPORT_FID
    for identifying objects using file handle and fsid in events.

    fanotify_mark(2) fails with -ENODEV when trying to set a mark on
    filesystems that report null f_fsid in stasfs(2).

    Use the digest of uuid as f_fsid for tmpfs to uniquely identify tmpfs
    objects as best as possible and allow setting an fanotify mark that
    reports events with file handles on tmpfs.

    Link: https://lore.kernel.org/r/20210322173944.449469-3-amir73il@gmail.com
    Acked-by: Hugh Dickins
    Reviewed-by: Christian Brauner
    Signed-off-by: Amir Goldstein
    Signed-off-by: Jan Kara

    Amir Goldstein
     

27 Feb, 2021

5 commits

  • Hugh pointed out that the gma500 driver uses shmem pages, but needs to
    limit them to the DMA32 zone. Ensure the allocations resulting from the
    gfp_mask returned by limit_gfp_mask use the zone flags that were
    originally passed to shmem_getpage_gfp.

    Link: https://lkml.kernel.org/r/20210224121016.1314ed6d@imladris.surriel.com
    Signed-off-by: Rik van Riel
    Suggested-by: Hugh Dickins
    Cc: Michal Hocko
    Cc: Vlastimil Babka
    Cc: Xu Yu
    Cc: Mel Gorman
    Cc: Andrea Arcangeli
    Cc: Matthew Wilcox (Oracle)
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik van Riel
     
  • Matthew Wilcox pointed out that the i915 driver opportunistically
    allocates tmpfs memory, but will happily reclaim some of its pool if no
    memory is available.

    Make sure the gfp mask used to opportunistically allocate a THP is always
    at least as restrictive as the original gfp mask.

    Link: https://lkml.kernel.org/r/20201124194925.623931-3-riel@surriel.com
    Signed-off-by: Rik van Riel
    Suggested-by: Matthew Wilcox
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Vlastimil Babka
    Cc: Xu Yu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik van Riel
     
  • Patch series "mm,thp,shm: limit shmem THP alloc gfp_mask", v6.

    The allocation flags of anonymous transparent huge pages can be controlled
    through the files in /sys/kernel/mm/transparent_hugepage/defrag, which can
    help the system from getting bogged down in the page reclaim and
    compaction code when many THPs are getting allocated simultaneously.

    However, the gfp_mask for shmem THP allocations were not limited by those
    configuration settings, and some workloads ended up with all CPUs stuck on
    the LRU lock in the page reclaim code, trying to allocate dozens of THPs
    simultaneously.

    This patch applies the same configurated limitation of THPs to shmem
    hugepage allocations, to prevent that from happening.

    This way a THP defrag setting of "never" or "defer+madvise" will result in
    quick allocation failures without direct reclaim when no 2MB free pages
    are available.

    With this patch applied, THP allocations for tmpfs will be a little more
    aggressive than today for files mmapped with MADV_HUGEPAGE, and a little
    less aggressive for files that are not mmapped or mapped without that
    flag.

    This patch (of 4):

    The allocation flags of anonymous transparent huge pages can be controlled
    through the files in /sys/kernel/mm/transparent_hugepage/defrag, which can
    help the system from getting bogged down in the page reclaim and
    compaction code when many THPs are getting allocated simultaneously.

    However, the gfp_mask for shmem THP allocations were not limited by those
    configuration settings, and some workloads ended up with all CPUs stuck on
    the LRU lock in the page reclaim code, trying to allocate dozens of THPs
    simultaneously.

    This patch applies the same configurated limitation of THPs to shmem
    hugepage allocations, to prevent that from happening.

    Controlling the gfp_mask of THP allocations through the knobs in sysfs
    allows users to determine the balance between how aggressively the system
    tries to allocate THPs at fault time, and how much the application may end
    up stalling attempting those allocations.

    This way a THP defrag setting of "never" or "defer+madvise" will result in
    quick allocation failures without direct reclaim when no 2MB free pages
    are available.

    With this patch applied, THP allocations for tmpfs will be a little more
    aggressive than today for files mmapped with MADV_HUGEPAGE, and a little
    less aggressive for files that are not mmapped or mapped without that
    flag.

    Link: https://lkml.kernel.org/r/20201124194925.623931-1-riel@surriel.com
    Link: https://lkml.kernel.org/r/20201124194925.623931-2-riel@surriel.com
    Signed-off-by: Rik van Riel
    Acked-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Cc: Xu Yu
    Cc: Mel Gorman
    Cc: Andrea Arcangeli
    Cc: Matthew Wilcox (Oracle)
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik van Riel
     
  • All callers of find_get_entries() use a pvec, so pass it directly instead
    of manipulating it in the caller.

    Link: https://lkml.kernel.org/r/20201112212641.27837-14-willy@infradead.org
    Signed-off-by: Matthew Wilcox (Oracle)
    Reviewed-by: Jan Kara
    Reviewed-by: William Kucharski
    Cc: Christoph Hellwig
    Cc: Dave Chinner
    Cc: Hugh Dickins
    Cc: Johannes Weiner
    Cc: Kirill A. Shutemov
    Cc: Yang Shi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox (Oracle)
     
  • This simplifies the callers and leads to a more efficient implementation
    since the XArray has this functionality already.

    Link: https://lkml.kernel.org/r/20201112212641.27837-11-willy@infradead.org
    Signed-off-by: Matthew Wilcox (Oracle)
    Reviewed-by: Jan Kara
    Reviewed-by: William Kucharski
    Reviewed-by: Christoph Hellwig
    Cc: Dave Chinner
    Cc: Hugh Dickins
    Cc: Johannes Weiner
    Cc: Kirill A. Shutemov
    Cc: Yang Shi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox (Oracle)