13 Aug, 2020

1 commit

  • syzbot found issues with having hugetlbfs on a union/overlay as reported
    in [1]. Due to the limitations (no write) and special functionality of
    hugetlbfs, it does not work well in filesystem stacking. There are no
    know use cases for hugetlbfs stacking. Rather than making modifications
    to get hugetlbfs working in such environments, simply prevent stacking.

    [1] https://lore.kernel.org/linux-mm/000000000000b4684e05a2968ca6@google.com/

    Reported-by: syzbot+d6ec23007e951dadf3de@syzkaller.appspotmail.com
    Suggested-by: Amir Goldstein
    Signed-off-by: Mike Kravetz
    Signed-off-by: Andrew Morton
    Acked-by: Miklos Szeredi
    Cc: Al Viro
    Cc: Matthew Wilcox
    Cc: Colin Walters
    Link: http://lkml.kernel.org/r/80f869aa-810d-ef6c-8888-b46cee135907@oracle.com
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     

08 Aug, 2020

1 commit

  • The current split between do_mmap() and do_mmap_pgoff() was introduced in
    commit 1fcfd8db7f82 ("mm, mpx: add "vm_flags_t vm_flags" arg to
    do_mmap_pgoff()") to support MPX.

    The wrapper function do_mmap_pgoff() always passed 0 as the value of the
    vm_flags argument to do_mmap(). However, MPX support has subsequently
    been removed from the kernel and there were no more direct callers of
    do_mmap(); all calls were going via do_mmap_pgoff().

    Simplify the code by removing do_mmap_pgoff() and changing all callers to
    directly call do_mmap(), which now no longer takes a vm_flags argument.

    Signed-off-by: Peter Collingbourne
    Signed-off-by: Andrew Morton
    Reviewed-by: Andrew Morton
    Reviewed-by: David Hildenbrand
    Link: http://lkml.kernel.org/r/20200727194109.1371462-1-pcc@google.com
    Signed-off-by: Linus Torvalds

    Peter Collingbourne
     

10 Jun, 2020

1 commit

  • Convert comments that reference old mmap_sem APIs to reference
    corresponding new mmap locking APIs instead.

    Signed-off-by: Michel Lespinasse
    Signed-off-by: Andrew Morton
    Reviewed-by: Vlastimil Babka
    Reviewed-by: Davidlohr Bueso
    Reviewed-by: Daniel Jordan
    Cc: David Rientjes
    Cc: Hugh Dickins
    Cc: Jason Gunthorpe
    Cc: Jerome Glisse
    Cc: John Hubbard
    Cc: Laurent Dufour
    Cc: Liam Howlett
    Cc: Matthew Wilcox
    Cc: Peter Zijlstra
    Cc: Ying Han
    Link: http://lkml.kernel.org/r/20200520052908.204642-12-walken@google.com
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     

04 Jun, 2020

1 commit

  • In a 32-bit program, running on arm64 architecture. When the address
    space below mmap base is completely exhausted, shmat() for huge pages will
    return ENOMEM, but shmat() for normal pages can still success on no-legacy
    mode. This seems not fair.

    For normal pages, the calling trace of get_unmapped_area() is:

    => mm->get_unmapped_area()
    if on legacy mode,
    => arch_get_unmapped_area()
    => vm_unmapped_area()
    if on no-legacy mode,
    => arch_get_unmapped_area_topdown()
    => vm_unmapped_area()

    For huge pages, the calling trace of get_unmapped_area() is:

    => file->f_op->get_unmapped_area()
    => hugetlb_get_unmapped_area()
    => vm_unmapped_area()

    To solve this issue, we only need to make hugetlb_get_unmapped_area() take
    the same way as mm->get_unmapped_area(). Add *bottomup() and *topdown()
    for hugetlbfs, and check current mm->get_unmapped_area() to decide which
    one to use. If mm->get_unmapped_area is equal to
    arch_get_unmapped_area_topdown(), hugetlb_get_unmapped_area() calls
    topdown routine, otherwise calls bottomup routine.

    Reported-by: kbuild test robot
    Signed-off-by: Shijie Hu
    Signed-off-by: Mike Kravetz
    Signed-off-by: Andrew Morton
    Cc: Will Deacon
    Cc: Xiaoming Ni
    Cc: Kefeng Wang
    Cc: yangerkun
    Cc: ChenGang
    Cc: Chen Jie
    Link: http://lkml.kernel.org/r/20200518065338.113664-1-hushijie3@huawei.com
    Signed-off-by: Linus Torvalds

    Shijie Hu
     

03 Apr, 2020

2 commits

  • hugetlbfs page faults can race with truncate and hole punch operations.
    Current code in the page fault path attempts to handle this by 'backing
    out' operations if we encounter the race. One obvious omission in the
    current code is removing a page newly added to the page cache. This is
    pretty straight forward to address, but there is a more subtle and
    difficult issue of backing out hugetlb reservations. To handle this
    correctly, the 'reservation state' before page allocation needs to be
    noted so that it can be properly backed out. There are four distinct
    possibilities for reservation state: shared/reserved, shared/no-resv,
    private/reserved and private/no-resv. Backing out a reservation may
    require memory allocation which could fail so that needs to be taken
    into account as well.

    Instead of writing the required complicated code for this rare
    occurrence, just eliminate the race. i_mmap_rwsem is now held in read
    mode for the duration of page fault processing. Hold i_mmap_rwsem in
    write mode when modifying i_size. In this way, truncation can not
    proceed when page faults are being processed. In addition, i_size
    will not change during fault processing so a single check can be made
    to ensure faults are not beyond (proposed) end of file. Faults can
    still race with hole punch, but that race is handled by existing code
    and the use of hugetlb_fault_mutex.

    With this modification, checks for races with truncation in the page
    fault path can be simplified and removed. remove_inode_hugepages no
    longer needs to take hugetlb_fault_mutex in the case of truncation.
    Comments are expanded to explain reasoning behind locking.

    Signed-off-by: Mike Kravetz
    Signed-off-by: Andrew Morton
    Cc: Andrea Arcangeli
    Cc: "Aneesh Kumar K . V"
    Cc: Davidlohr Bueso
    Cc: Hugh Dickins
    Cc: "Kirill A . Shutemov"
    Cc: Michal Hocko
    Cc: Naoya Horiguchi
    Cc: Prakash Sangappa
    Link: http://lkml.kernel.org/r/20200316205756.146666-3-mike.kravetz@oracle.com
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     
  • Patch series "hugetlbfs: use i_mmap_rwsem for more synchronization", v2.

    While discussing the issue with huge_pte_offset [1], I remembered that
    there were more outstanding hugetlb races. These issues are:

    1) For shared pmds, huge PTE pointers returned by huge_pte_alloc can become
    invalid via a call to huge_pmd_unshare by another thread.
    2) hugetlbfs page faults can race with truncation causing invalid global
    reserve counts and state.

    A previous attempt was made to use i_mmap_rwsem in this manner as
    described at [2]. However, those patches were reverted starting with [3]
    due to locking issues.

    To effectively use i_mmap_rwsem to address the above issues it needs to be
    held (in read mode) during page fault processing. However, during fault
    processing we need to lock the page we will be adding. Lock ordering
    requires we take page lock before i_mmap_rwsem. Waiting until after
    taking the page lock is too late in the fault process for the
    synchronization we want to do.

    To address this lock ordering issue, the following patches change the lock
    ordering for hugetlb pages. This is not too invasive as hugetlbfs
    processing is done separate from core mm in many places. However, I don't
    really like this idea. Much ugliness is contained in the new routine
    hugetlb_page_mapping_lock_write() of patch 1.

    The only other way I can think of to address these issues is by catching
    all the races. After catching a race, cleanup, backout, retry ... etc,
    as needed. This can get really ugly, especially for huge page
    reservations. At one time, I started writing some of the reservation
    backout code for page faults and it got so ugly and complicated I went
    down the path of adding synchronization to avoid the races. Any other
    suggestions would be welcome.

    [1] https://lore.kernel.org/linux-mm/1582342427-230392-1-git-send-email-longpeng2@huawei.com/
    [2] https://lore.kernel.org/linux-mm/20181222223013.22193-1-mike.kravetz@oracle.com/
    [3] https://lore.kernel.org/linux-mm/20190103235452.29335-1-mike.kravetz@oracle.com
    [4] https://lore.kernel.org/linux-mm/1584028670.7365.182.camel@lca.pw/
    [5] https://lore.kernel.org/lkml/20200312183142.108df9ac@canb.auug.org.au/

    This patch (of 2):

    While looking at BUGs associated with invalid huge page map counts, it was
    discovered and observed that a huge pte pointer could become 'invalid' and
    point to another task's page table. Consider the following:

    A task takes a page fault on a shared hugetlbfs file and calls
    huge_pte_alloc to get a ptep. Suppose the returned ptep points to a
    shared pmd.

    Now, another task truncates the hugetlbfs file. As part of truncation, it
    unmaps everyone who has the file mapped. If the range being truncated is
    covered by a shared pmd, huge_pmd_unshare will be called. For all but the
    last user of the shared pmd, huge_pmd_unshare will clear the pud pointing
    to the pmd. If the task in the middle of the page fault is not the last
    user, the ptep returned by huge_pte_alloc now points to another task's
    page table or worse. This leads to bad things such as incorrect page
    map/reference counts or invalid memory references.

    To fix, expand the use of i_mmap_rwsem as follows:
    - i_mmap_rwsem is held in read mode whenever huge_pmd_share is called.
    huge_pmd_share is only called via huge_pte_alloc, so callers of
    huge_pte_alloc take i_mmap_rwsem before calling. In addition, callers
    of huge_pte_alloc continue to hold the semaphore until finished with
    the ptep.
    - i_mmap_rwsem is held in write mode whenever huge_pmd_unshare is called.

    One problem with this scheme is that it requires taking i_mmap_rwsem
    before taking the page lock during page faults. This is not the order
    specified in the rest of mm code. Handling of hugetlbfs pages is mostly
    isolated today. Therefore, we use this alternative locking order for
    PageHuge() pages.

    mapping->i_mmap_rwsem
    hugetlb_fault_mutex (hugetlbfs specific page fault mutex)
    page->flags PG_locked (lock_page)

    To help with lock ordering issues, hugetlb_page_mapping_lock_write() is
    introduced to write lock the i_mmap_rwsem associated with a page.

    In most cases it is easy to get address_space via vma->vm_file->f_mapping.
    However, in the case of migration or memory errors for anon pages we do
    not have an associated vma. A new routine _get_hugetlb_page_mapping()
    will use anon_vma to get address_space in these cases.

    Signed-off-by: Mike Kravetz
    Signed-off-by: Andrew Morton
    Cc: Michal Hocko
    Cc: Hugh Dickins
    Cc: Naoya Horiguchi
    Cc: "Aneesh Kumar K . V"
    Cc: Andrea Arcangeli
    Cc: "Kirill A . Shutemov"
    Cc: Davidlohr Bueso
    Cc: Prakash Sangappa
    Link: http://lkml.kernel.org/r/20200316205756.146666-2-mike.kravetz@oracle.com
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     

08 Feb, 2020

3 commits


04 Jan, 2020

1 commit

  • LTP memfd_create04 started failing for some huge page sizes
    after v5.4-10135-gc3bfc5dd73c6.

    The problem is the check introduced to for_each_hstate() loop that
    should skip default_hstate_idx. Since it doesn't update 'i' counter,
    all subsequent huge page sizes are skipped as well.

    Fixes: 8fc312b32b25 ("mm/hugetlbfs: fix error handling when setting up mounts")
    Signed-off-by: Jan Stancek
    Reviewed-by: Mike Kravetz
    Signed-off-by: Linus Torvalds

    Jan Stancek
     

02 Dec, 2019

4 commits

  • The first parameter hstate in function hugetlb_fault_mutex_hash() is not
    used anymore.

    This patch removes it.

    [akpm@linux-foundation.org: various build fixes]
    [cai@lca.pw: fix a GCC compilation warning]
    Link: http://lkml.kernel.org/r/1570544108-32331-1-git-send-email-cai@lca.pw
    Link: http://lkml.kernel.org/r/20191005003302.785-1-richardw.yang@linux.intel.com
    Signed-off-by: Wei Yang
    Signed-off-by: Qian Cai
    Suggested-by: Andrew Morton
    Reviewed-by: Andrew Morton
    Cc: Mike Kravetz
    Cc: Hugh Dickins
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wei Yang
     
  • With hugetlbfs, a common pattern for mapping anonymous huge pages is to
    create a temporary file first. Currently libraries like libhugetlbfs
    and seastar create these with a standard mkstemp+unlink trick, but it
    would be more robust to be able to simply pass the O_TMPFILE flag to
    open(). O_TMPFILE is already supported by several file systems like
    ext4 and xfs. The implementation simply uses the existi= ng d_tmpfile
    utility function to instantiate the dcache entry for the file.

    Tested manually by successfully creating a temporary file by opening it
    with (O_TMPFILE|O_RDWR) on mounted hugetlbfs and successfully mapping 2M
    huge pages with it. Without the patch, trying to open a file with
    O_TMPFILE results in -ENOSUP.

    Link: http://lkml.kernel.org/r/bc9383eff6e1374d79f3a92257ae829ba1e6ae60.1573285189.git.p.sarna@tlen.pl
    Signed-off-by: Piotr Sarna
    Reviewed-by: Mike Kravetz
    Cc: Al Viro
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Piotr Sarna
     
  • It is assumed that the hugetlbfs_vfsmount[] array will contain either a
    valid vfsmount pointer or NULL for each hstate after initialization.
    Changes made while converting to use fs_context broke this assumption.

    While fixing the hugetlbfs_vfsmount issue, it was discovered that
    init_hugetlbfs_fs never did correctly clean up when encountering a vfs
    mount error.

    It was found during code inspection. A small memory allocation failure
    would be the most likely cause of taking a error path with the bug.
    This is unlikely to happen as this is early init code.

    Link: http://lkml.kernel.org/r/94b6244d-2c24-e269-b12c-e3ba694b242d@oracle.com
    Reported-by: Chengguang Xu
    Fixes: 32021982a324 ("hugetlbfs: Convert to fs_context")
    Signed-off-by: Mike Kravetz
    Cc: David Howells
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     
  • A new clang diagnostic (-Wsizeof-array-div) warns about the calculation
    to determine the number of u32's in an array of unsigned longs.
    Suppress warning by adding parentheses.

    While looking at the above issue, noticed that the 'address' parameter
    to hugetlb_fault_mutex_hash is no longer used. So, remove it from the
    definition and all callers.

    No functional change.

    Link: http://lkml.kernel.org/r/20190919011847.18400-1-mike.kravetz@oracle.com
    Signed-off-by: Mike Kravetz
    Reported-by: Nathan Chancellor
    Reviewed-by: Nathan Chancellor
    Reviewed-by: Davidlohr Bueso
    Reviewed-by: Andrew Morton
    Cc: Nick Desaulniers
    Cc: Ilie Halip
    Cc: David Bolvansky
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     

20 Jul, 2019

1 commit

  • Pull vfs mount updates from Al Viro:
    "The first part of mount updates.

    Convert filesystems to use the new mount API"

    * 'work.mount0' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (63 commits)
    mnt_init(): call shmem_init() unconditionally
    constify ksys_mount() string arguments
    don't bother with registering rootfs
    init_rootfs(): don't bother with init_ramfs_fs()
    vfs: Convert smackfs to use the new mount API
    vfs: Convert selinuxfs to use the new mount API
    vfs: Convert securityfs to use the new mount API
    vfs: Convert apparmorfs to use the new mount API
    vfs: Convert openpromfs to use the new mount API
    vfs: Convert xenfs to use the new mount API
    vfs: Convert gadgetfs to use the new mount API
    vfs: Convert oprofilefs to use the new mount API
    vfs: Convert ibmasmfs to use the new mount API
    vfs: Convert qib_fs/ipathfs to use the new mount API
    vfs: Convert efivarfs to use the new mount API
    vfs: Convert configfs to use the new mount API
    vfs: Convert binfmt_misc to use the new mount API
    convenience helper: get_tree_single()
    convenience helper get_tree_nodev()
    vfs: Kill sget_userns()
    ...

    Linus Torvalds
     

05 Jul, 2019

1 commit


21 May, 2019

1 commit


15 May, 2019

2 commits

  • Continuing discussion about 58b6e5e8f1ad ("hugetlbfs: fix memory leak for
    resv_map") brought up the issue that inode->i_mapping may not point to the
    address space embedded within the inode at inode eviction time. The
    hugetlbfs truncate routine handles this by explicitly using inode->i_data.
    However, code cleaning up the resv_map will still use the address space
    pointed to by inode->i_mapping. Luckily, private_data is NULL for address
    spaces in all such cases today but, there is no guarantee this will
    continue.

    Change all hugetlbfs code getting a resv_map pointer to explicitly get it
    from the address space embedded within the inode. In addition, add more
    comments in the code to indicate why this is being done.

    Link: http://lkml.kernel.org/r/20190419204435.16984-1-mike.kravetz@oracle.com
    Signed-off-by: Mike Kravetz
    Reported-by: Yufen Yu
    Cc: Michal Hocko
    Cc: Naoya Horiguchi
    Cc: "Kirill A . Shutemov"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     
  • hugetlb uses a fault mutex hash table to prevent page faults of the
    same pages concurrently. The key for shared and private mappings is
    different. Shared keys off address_space and file index. Private keys
    off mm and virtual address. Consider a private mappings of a populated
    hugetlbfs file. A fault will map the page from the file and if needed
    do a COW to map a writable page.

    Hugetlbfs hole punch uses the fault mutex to prevent mappings of file
    pages. It uses the address_space file index key. However, private
    mappings will use a different key and could race with this code to map
    the file page. This causes problems (BUG) for the page cache remove
    code as it expects the page to be unmapped. A sample stack is:

    page dumped because: VM_BUG_ON_PAGE(page_mapped(page))
    kernel BUG at mm/filemap.c:169!
    ...
    RIP: 0010:unaccount_page_cache_page+0x1b8/0x200
    ...
    Call Trace:
    __delete_from_page_cache+0x39/0x220
    delete_from_page_cache+0x45/0x70
    remove_inode_hugepages+0x13c/0x380
    ? __add_to_page_cache_locked+0x162/0x380
    hugetlbfs_fallocate+0x403/0x540
    ? _cond_resched+0x15/0x30
    ? __inode_security_revalidate+0x5d/0x70
    ? selinux_file_permission+0x100/0x130
    vfs_fallocate+0x13f/0x270
    ksys_fallocate+0x3c/0x80
    __x64_sys_fallocate+0x1a/0x20
    do_syscall_64+0x5b/0x180
    entry_SYSCALL_64_after_hwframe+0x44/0xa9

    There seems to be another potential COW issue/race with this approach
    of different private and shared keys as noted in commit 8382d914ebf7
    ("mm, hugetlb: improve page-fault scalability").

    Since every hugetlb mapping (even anon and private) is actually a file
    mapping, just use the address_space index key for all mappings. This
    results in potentially more hash collisions. However, this should not
    be the common case.

    Link: http://lkml.kernel.org/r/20190328234704.27083-3-mike.kravetz@oracle.com
    Link: http://lkml.kernel.org/r/20190412165235.t4sscoujczfhuiyt@linux-r8p5
    Fixes: b5cec28d36f5 ("hugetlbfs: truncate_hugepages() takes a range of pages")
    Signed-off-by: Mike Kravetz
    Reviewed-by: Naoya Horiguchi
    Reviewed-by: Davidlohr Bueso
    Cc: Joonsoo Kim
    Cc: "Kirill A . Shutemov"
    Cc: Michal Hocko
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     

02 May, 2019

1 commit

  • moving synchronous parts of ->destroy_inode() to ->evict_inode() is
    not possible here - they are balancing the stuff done in ->alloc_inode(),
    not the things acquired while using it or sanity checks.

    Signed-off-by: Al Viro

    Al Viro
     

06 Apr, 2019

1 commit

  • When mknod is used to create a block special file in hugetlbfs, it will
    allocate an inode and kmalloc a 'struct resv_map' via resv_map_alloc().
    inode->i_mapping->private_data will point the newly allocated resv_map.
    However, when the device special file is opened bd_acquire() will set
    inode->i_mapping to bd_inode->i_mapping. Thus the pointer to the
    allocated resv_map is lost and the structure is leaked.

    Programs to reproduce:
    mount -t hugetlbfs nodev hugetlbfs
    mknod hugetlbfs/dev b 0 0
    exec 30<> hugetlbfs/dev
    umount hugetlbfs/

    resv_map structures are only needed for inodes which can have associated
    page allocations. To fix the leak, only allocate resv_map for those
    inodes which could possibly be associated with page allocations.

    Link: http://lkml.kernel.org/r/20190401213101.16476-1-mike.kravetz@oracle.com
    Signed-off-by: Mike Kravetz
    Reviewed-by: Andrew Morton
    Reported-by: Yufen Yu
    Suggested-by: Yufen Yu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     

13 Mar, 2019

1 commit

  • Pull vfs mount infrastructure updates from Al Viro:
    "The rest of core infrastructure; no new syscalls in that pile, but the
    old parts are switched to new infrastructure. At that point
    conversions of individual filesystems can happen independently; some
    are done here (afs, cgroup, procfs, etc.), there's also a large series
    outside of that pile dealing with NFS (quite a bit of option-parsing
    stuff is getting used there - it's one of the most convoluted
    filesystems in terms of mount-related logics), but NFS bits are the
    next cycle fodder.

    It got seriously simplified since the last cycle; documentation is
    probably the weakest bit at the moment - I considered dropping the
    commit introducing Documentation/filesystems/mount_api.txt (cutting
    the size increase by quarter ;-), but decided that it would be better
    to fix it up after -rc1 instead.

    That pile allows to do followup work in independent branches, which
    should make life much easier for the next cycle. fs/super.c size
    increase is unpleasant; there's a followup series that allows to
    shrink it considerably, but I decided to leave that until the next
    cycle"

    * 'work.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (41 commits)
    afs: Use fs_context to pass parameters over automount
    afs: Add fs_context support
    vfs: Add some logging to the core users of the fs_context log
    vfs: Implement logging through fs_context
    vfs: Provide documentation for new mount API
    vfs: Remove kern_mount_data()
    hugetlbfs: Convert to fs_context
    cpuset: Use fs_context
    kernfs, sysfs, cgroup, intel_rdt: Support fs_context
    cgroup: store a reference to cgroup_ns into cgroup_fs_context
    cgroup1_get_tree(): separate "get cgroup_root to use" into a separate helper
    cgroup_do_mount(): massage calling conventions
    cgroup: stash cgroup_root reference into cgroup_fs_context
    cgroup2: switch to option-by-option parsing
    cgroup1: switch to option-by-option parsing
    cgroup: take options parsing into ->parse_monolithic()
    cgroup: fold cgroup1_mount() into cgroup1_get_tree()
    cgroup: start switching to fs_context
    ipc: Convert mqueue fs to fs_context
    proc: Add fs_context support to procfs
    ...

    Linus Torvalds
     

06 Mar, 2019

1 commit

  • Android uses ashmem for sharing memory regions. We are looking forward
    to migrating all usecases of ashmem to memfd so that we can possibly
    remove the ashmem driver in the future from staging while also
    benefiting from using memfd and contributing to it. Note staging
    drivers are also not ABI and generally can be removed at anytime.

    One of the main usecases Android has is the ability to create a region
    and mmap it as writeable, then add protection against making any
    "future" writes while keeping the existing already mmap'ed
    writeable-region active. This allows us to implement a usecase where
    receivers of the shared memory buffer can get a read-only view, while
    the sender continues to write to the buffer. See CursorWindow
    documentation in Android for more details:

    https://developer.android.com/reference/android/database/CursorWindow

    This usecase cannot be implemented with the existing F_SEAL_WRITE seal.
    To support the usecase, this patch adds a new F_SEAL_FUTURE_WRITE seal
    which prevents any future mmap and write syscalls from succeeding while
    keeping the existing mmap active.

    A better way to do F_SEAL_FUTURE_WRITE seal was discussed [1] last week
    where we don't need to modify core VFS structures to get the same
    behavior of the seal. This solves several side-effects pointed by Andy.
    self-tests are provided in later patch to verify the expected semantics.

    [1] https://lore.kernel.org/lkml/20181111173650.GA256781@google.com/

    Thanks a lot to Andy for suggestions to improve code.

    Link: http://lkml.kernel.org/r/20190112203816.85534-2-joel@joelfernandes.org
    Signed-off-by: Joel Fernandes (Google)
    Acked-by: John Stultz
    Cc: Andy Lutomirski
    Cc: Minchan Kim
    Cc: Jann Horn
    Cc: Al Viro
    Cc: Andy Lutomirski
    Cc: Hugh Dickins
    Cc: J. Bruce Fields
    Cc: Jeff Layton
    Cc: Marc-Andr Lureau
    Cc: Matthew Wilcox
    Cc: Mike Kravetz
    Cc: Shuah Khan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joel Fernandes (Google)
     

02 Mar, 2019

1 commit

  • hugetlb pages should only be migrated if they are 'active'. The
    routines set/clear_page_huge_active() modify the active state of hugetlb
    pages.

    When a new hugetlb page is allocated at fault time, set_page_huge_active
    is called before the page is locked. Therefore, another thread could
    race and migrate the page while it is being added to page table by the
    fault code. This race is somewhat hard to trigger, but can be seen by
    strategically adding udelay to simulate worst case scheduling behavior.
    Depending on 'how' the code races, various BUG()s could be triggered.

    To address this issue, simply delay the set_page_huge_active call until
    after the page is successfully added to the page table.

    Hugetlb pages can also be leaked at migration time if the pages are
    associated with a file in an explicitly mounted hugetlbfs filesystem.
    For example, consider a two node system with 4GB worth of huge pages
    available. A program mmaps a 2G file in a hugetlbfs filesystem. It
    then migrates the pages associated with the file from one node to
    another. When the program exits, huge page counts are as follows:

    node0
    1024 free_hugepages
    1024 nr_hugepages

    node1
    0 free_hugepages
    1024 nr_hugepages

    Filesystem Size Used Avail Use% Mounted on
    nodev 4.0G 2.0G 2.0G 50% /var/opt/hugepool

    That is as expected. 2G of huge pages are taken from the free_hugepages
    counts, and 2G is the size of the file in the explicitly mounted
    filesystem. If the file is then removed, the counts become:

    node0
    1024 free_hugepages
    1024 nr_hugepages

    node1
    1024 free_hugepages
    1024 nr_hugepages

    Filesystem Size Used Avail Use% Mounted on
    nodev 4.0G 2.0G 2.0G 50% /var/opt/hugepool

    Note that the filesystem still shows 2G of pages used, while there
    actually are no huge pages in use. The only way to 'fix' the filesystem
    accounting is to unmount the filesystem

    If a hugetlb page is associated with an explicitly mounted filesystem,
    this information in contained in the page_private field. At migration
    time, this information is not preserved. To fix, simply transfer
    page_private from old to new page at migration time if necessary.

    There is a related race with removing a huge page from a file and
    migration. When a huge page is removed from the pagecache, the
    page_mapping() field is cleared, yet page_private remains set until the
    page is actually freed by free_huge_page(). A page could be migrated
    while in this state. However, since page_mapping() is not set the
    hugetlbfs specific routine to transfer page_private is not called and we
    leak the page count in the filesystem.

    To fix that, check for this condition before migrating a huge page. If
    the condition is detected, return EBUSY for the page.

    Link: http://lkml.kernel.org/r/74510272-7319-7372-9ea6-ec914734c179@oracle.com
    Link: http://lkml.kernel.org/r/20190212221400.3512-1-mike.kravetz@oracle.com
    Fixes: bcc54222309c ("mm: hugetlb: introduce page_huge_active")
    Signed-off-by: Mike Kravetz
    Reviewed-by: Naoya Horiguchi
    Cc: Michal Hocko
    Cc: Andrea Arcangeli
    Cc: "Kirill A . Shutemov"
    Cc: Mel Gorman
    Cc: Davidlohr Bueso
    Cc:
    [mike.kravetz@oracle.com: v2]
    Link: http://lkml.kernel.org/r/7534d322-d782-8ac6-1c8d-a8dc380eb3ab@oracle.com
    [mike.kravetz@oracle.com: update comment and changelog]
    Link: http://lkml.kernel.org/r/420bcfd6-158b-38e4-98da-26d0cd85bd01@oracle.com
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     

28 Feb, 2019

1 commit


09 Jan, 2019

1 commit

  • This reverts c86aa7bbfd5568ba8a82d3635d8f7b8a8e06fe54

    The reverted commit caused ABBA deadlocks when file migration raced with
    file eviction for specific hugetlbfs files. This was discovered with a
    modified version of the LTP move_pages12 test.

    The purpose of the reverted patch was to close a long existing race
    between hugetlbfs file truncation and page faults. After more analysis
    of the patch and impacted code, it was determined that i_mmap_rwsem can
    not be used for all required synchronization. Therefore, revert this
    patch while working an another approach to the underlying issue.

    Link: http://lkml.kernel.org/r/20190103235452.29335-1-mike.kravetz@oracle.com
    Signed-off-by: Mike Kravetz
    Reported-by: Jan Stancek
    Cc: Michal Hocko
    Cc: Hugh Dickins
    Cc: Naoya Horiguchi
    Cc: "Aneesh Kumar K . V"
    Cc: Andrea Arcangeli
    Cc: "Kirill A . Shutemov"
    Cc: Davidlohr Bueso
    Cc: Prakash Sangappa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     

29 Dec, 2018

1 commit

  • hugetlbfs page faults can race with truncate and hole punch operations.
    Current code in the page fault path attempts to handle this by 'backing
    out' operations if we encounter the race. One obvious omission in the
    current code is removing a page newly added to the page cache. This is
    pretty straight forward to address, but there is a more subtle and
    difficult issue of backing out hugetlb reservations. To handle this
    correctly, the 'reservation state' before page allocation needs to be
    noted so that it can be properly backed out. There are four distinct
    possibilities for reservation state: shared/reserved, shared/no-resv,
    private/reserved and private/no-resv. Backing out a reservation may
    require memory allocation which could fail so that needs to be taken into
    account as well.

    Instead of writing the required complicated code for this rare occurrence,
    just eliminate the race. i_mmap_rwsem is now held in read mode for the
    duration of page fault processing. Hold i_mmap_rwsem longer in truncation
    and hold punch code to cover the call to remove_inode_hugepages.

    With this modification, code in remove_inode_hugepages checking for races
    becomes 'dead' as it can not longer happen. Remove the dead code and
    expand comments to explain reasoning. Similarly, checks for races with
    truncation in the page fault path can be simplified and removed.

    [mike.kravetz@oracle.com: incorporat suggestions from Kirill]
    Link: http://lkml.kernel.org/r/20181222223013.22193-3-mike.kravetz@oracle.com
    Link: http://lkml.kernel.org/r/20181218223557.5202-3-mike.kravetz@oracle.com
    Fixes: ebed4bfc8da8 ("hugetlb: fix absurd HugePages_Rsvd")
    Signed-off-by: Mike Kravetz
    Acked-by: Kirill A. Shutemov
    Cc: Michal Hocko
    Cc: Hugh Dickins
    Cc: Naoya Horiguchi
    Cc: "Aneesh Kumar K . V"
    Cc: Andrea Arcangeli
    Cc: Davidlohr Bueso
    Cc: Prakash Sangappa
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     

23 Aug, 2018

1 commit

  • Rather than in vm_area_alloc(). To ensure that the various oddball
    stack-based vmas are in a good state. Some of the callers were zeroing
    them out, others were not.

    Acked-by: Kirill A. Shutemov
    Cc: Russell King
    Cc: Dmitry Vyukov
    Cc: Oleg Nesterov
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     

14 Aug, 2018

1 commit

  • Pull vfs open-related updates from Al Viro:

    - "do we need fput() or put_filp()" rules are gone - it's always fput()
    now. We keep track of that state where it belongs - in ->f_mode.

    - int *opened mess killed - in finish_open(), in ->atomic_open()
    instances and in fs/namei.c code around do_last()/lookup_open()/atomic_open().

    - alloc_file() wrappers with saner calling conventions are introduced
    (alloc_file_clone() and alloc_file_pseudo()); callers converted, with
    much simplification.

    - while we are at it, saner calling conventions for path_init() and
    link_path_walk(), simplifying things inside fs/namei.c (both on
    open-related paths and elsewhere).

    * 'work.open3' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (40 commits)
    few more cleanups of link_path_walk() callers
    allow link_path_walk() to take ERR_PTR()
    make path_init() unconditionally paired with terminate_walk()
    document alloc_file() changes
    make alloc_file() static
    do_shmat(): grab shp->shm_file earlier, switch to alloc_file_clone()
    new helper: alloc_file_clone()
    create_pipe_files(): switch the first allocation to alloc_file_pseudo()
    anon_inode_getfile(): switch to alloc_file_pseudo()
    hugetlb_file_setup(): switch to alloc_file_pseudo()
    ocxlflash_getfile(): switch to alloc_file_pseudo()
    cxl_getfile(): switch to alloc_file_pseudo()
    ... and switch shmem_file_setup() to alloc_file_pseudo()
    __shmem_file_setup(): reorder allocations
    new wrapper: alloc_file_pseudo()
    kill FILE_{CREATED,OPENED}
    switch atomic_open() and lookup_open() to returning 0 in all success cases
    document ->atomic_open() changes
    ->atomic_open(): return 0 in all success cases
    get rid of 'opened' in path_openat() and the helpers downstream
    ...

    Linus Torvalds
     

27 Jul, 2018

1 commit

  • Make sure to initialize all VMAs properly, not only those which come
    from vm_area_cachep.

    Link: http://lkml.kernel.org/r/20180724121139.62570-3-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Acked-by: Linus Torvalds
    Reviewed-by: Andrew Morton
    Cc: Dmitry Vyukov
    Cc: Oleg Nesterov
    Cc: Andrea Arcangeli
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

12 Jul, 2018

2 commits


06 Apr, 2018

1 commit

  • This is a fix for a regression in 32 bit kernels caused by an invalid
    check for pgoff overflow in hugetlbfs mmap setup. The check incorrectly
    specified that the size of a loff_t was the same as the size of a long.
    The regression prevents mapping hugetlbfs files at offsets greater than
    4GB on 32 bit kernels.

    On 32 bit kernels conversion from a page based unsigned long can not
    overflow a loff_t byte offset. Therefore, skip this check if
    sizeof(unsigned long) != sizeof(loff_t).

    Link: http://lkml.kernel.org/r/20180330145402.5053-1-mike.kravetz@oracle.com
    Fixes: 63489f8e8211 ("hugetlbfs: check for pgoff value overflow")
    Reported-by: Dan Rue
    Signed-off-by: Mike Kravetz
    Tested-by: Anders Roxell
    Cc: Michal Hocko
    Cc: Yisheng Xie
    Cc: "Kirill A . Shutemov"
    Cc: Nic Losby
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     

23 Mar, 2018

1 commit

  • A vma with vm_pgoff large enough to overflow a loff_t type when
    converted to a byte offset can be passed via the remap_file_pages system
    call. The hugetlbfs mmap routine uses the byte offset to calculate
    reservations and file size.

    A sequence such as:

    mmap(0x20a00000, 0x600000, 0, 0x66033, -1, 0);
    remap_file_pages(0x20a00000, 0x600000, 0, 0x20000000000000, 0);

    will result in the following when task exits/file closed,

    kernel BUG at mm/hugetlb.c:749!
    Call Trace:
    hugetlbfs_evict_inode+0x2f/0x40
    evict+0xcb/0x190
    __dentry_kill+0xcb/0x150
    __fput+0x164/0x1e0
    task_work_run+0x84/0xa0
    exit_to_usermode_loop+0x7d/0x80
    do_syscall_64+0x18b/0x190
    entry_SYSCALL_64_after_hwframe+0x3d/0xa2

    The overflowed pgoff value causes hugetlbfs to try to set up a mapping
    with a negative range (end < start) that leaves invalid state which
    causes the BUG.

    The previous overflow fix to this code was incomplete and did not take
    the remap_file_pages system call into account.

    [mike.kravetz@oracle.com: v3]
    Link: http://lkml.kernel.org/r/20180309002726.7248-1-mike.kravetz@oracle.com
    [akpm@linux-foundation.org: include mmdebug.h]
    [akpm@linux-foundation.org: fix -ve left shift count on sh]
    Link: http://lkml.kernel.org/r/20180308210502.15952-1-mike.kravetz@oracle.com
    Fixes: 045c7a3f53d9 ("hugetlbfs: fix offset overflow in hugetlbfs mmap")
    Signed-off-by: Mike Kravetz
    Reported-by: Nic Losby
    Acked-by: Michal Hocko
    Cc: "Kirill A . Shutemov"
    Cc: Yisheng Xie
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     

01 Feb, 2018

2 commits

  • Implements memfd sealing, similar to shmem:
    - WRITE: deny fallocate(PUNCH_HOLE). mmap() write is denied in
    memfd_add_seals(). write() doesn't exist for hugetlbfs.
    - SHRINK: added similar check as shmem_setattr()
    - GROW: added similar check as shmem_setattr() & shmem_fallocate()

    Except write() operation that doesn't exist with hugetlbfs, that should
    make sealing as close as it can be to shmem support.

    Link: http://lkml.kernel.org/r/20171107122800.25517-5-marcandre.lureau@redhat.com
    Signed-off-by: Marc-André Lureau
    Reviewed-by: Mike Kravetz
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: Michal Hocko
    Cc: David Herrmann
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Marc-André Lureau
     
  • hugetlbfs inode information will need to be accessed by code in
    mm/shmem.c for file sealing operations. Move inode information
    definition from .c file to header for needed access.

    Link: http://lkml.kernel.org/r/20171107122800.25517-4-marcandre.lureau@redhat.com
    Signed-off-by: Marc-André Lureau
    Reviewed-by: Mike Kravetz
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: Michal Hocko
    Cc: David Herrmann
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Marc-André Lureau
     

30 Nov, 2017

1 commit

  • hugetlfs_fallocate() currently performs put_page() before unlock_page().
    This scenario opens a small time window, from the time the page is added
    to the page cache, until it is unlocked, in which the page might be
    removed from the page-cache by another core. If the page is removed
    during this time windows, it might cause a memory corruption, as the
    wrong page will be unlocked.

    It is arguable whether this scenario can happen in a real system, and
    there are several mitigating factors. The issue was found by code
    inspection (actually grep), and not by actually triggering the flow.
    Yet, since putting the page before unlocking is incorrect it should be
    fixed, if only to prevent future breakage or someone copy-pasting this
    code.

    Mike said:
    "I am of the opinion that this does not need to be sent to stable.
    Although the ordering is current code is incorrect, there is no way
    for this to be a problem with current locking. In addition, I verified
    that the perhaps bigger issue with sys_fadvise64(POSIX_FADV_DONTNEED)
    for hugetlbfs and other filesystems is addressed in 3a77d214807c ("mm:
    fadvise: avoid fadvise for fs without backing device")"

    Link: http://lkml.kernel.org/r/20170826191124.51642-1-namit@vmware.com
    Fixes: 70c3547e36f5c ("hugetlbfs: add hugetlbfs_fallocate()")
    Signed-off-by: Nadav Amit
    Reviewed-by: Mike Kravetz
    Acked-by: Michal Hocko
    Cc: Eric Biggers
    Cc: Mike Kravetz
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nadav Amit
     

16 Nov, 2017

2 commits

  • Every pagevec_init user claims the pages being released are hot even in
    cases where it is unlikely the pages are hot. As no one cares about the
    hotness of pages being released to the allocator, just ditch the
    parameter.

    No performance impact is expected as the overhead is marginal. The
    parameter is removed simply because it is a bit stupid to have a useless
    parameter copied everywhere.

    Link: http://lkml.kernel.org/r/20171018075952.10627-6-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Cc: Andi Kleen
    Cc: Dave Chinner
    Cc: Dave Hansen
    Cc: Jan Kara
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • There is no need to have a local return code set with -EINVAL when both
    the conditions following it return error codes appropriately. Just
    remove the redundant one.

    Link: http://lkml.kernel.org/r/20170929145444.17611-1-khandual@linux.vnet.ibm.com
    Signed-off-by: Anshuman Khandual
    Reviewed-by: Mike Kravetz
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Anshuman Khandual
     

03 Nov, 2017

1 commit

  • Calling madvise(MADV_HWPOISON) on a hugetlbfs page will result in bad
    (negative) reserved huge page counts. This may not happen immediately,
    but may happen later when the underlying file is removed or filesystem
    unmounted. For example:

    AnonHugePages: 0 kB
    ShmemHugePages: 0 kB
    HugePages_Total: 1
    HugePages_Free: 0
    HugePages_Rsvd: 18446744073709551615
    HugePages_Surp: 0
    Hugepagesize: 2048 kB

    In routine hugetlbfs_error_remove_page(), hugetlb_fix_reserve_counts is
    called after remove_huge_page. hugetlb_fix_reserve_counts is designed
    to only be called/used only if a failure is returned from
    hugetlb_unreserve_pages. Therefore, call hugetlb_unreserve_pages as
    required and only call hugetlb_fix_reserve_counts in the unlikely event
    that hugetlb_unreserve_pages returns an error.

    Link: http://lkml.kernel.org/r/20171019230007.17043-2-mike.kravetz@oracle.com
    Fixes: 78bb920344b8 ("mm: hwpoison: dissolve in-use hugepage in unrecoverable memory error")
    Signed-off-by: Mike Kravetz
    Acked-by: Naoya Horiguchi
    Cc: Michal Hocko
    Cc: Aneesh Kumar
    Cc: Anshuman Khandual
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Kravetz