05 Jan, 2020

2 commits

  • commit 15f0ec941f4f908fefa23a30ded8358977cc1cc0 upstream.

    LTP memfd_create04 started failing for some huge page sizes
    after v5.4-10135-gc3bfc5dd73c6.

    The problem is the check introduced to for_each_hstate() loop that
    should skip default_hstate_idx. Since it doesn't update 'i' counter,
    all subsequent huge page sizes are skipped as well.

    Fixes: 8fc312b32b25 ("mm/hugetlbfs: fix error handling when setting up mounts")
    Signed-off-by: Jan Stancek
    Reviewed-by: Mike Kravetz
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Jan Stancek
     
  • [ Upstream commit 8fc312b32b25c6b0a8b46fab4df8c68df5af1223 ]

    It is assumed that the hugetlbfs_vfsmount[] array will contain either a
    valid vfsmount pointer or NULL for each hstate after initialization.
    Changes made while converting to use fs_context broke this assumption.

    While fixing the hugetlbfs_vfsmount issue, it was discovered that
    init_hugetlbfs_fs never did correctly clean up when encountering a vfs
    mount error.

    It was found during code inspection. A small memory allocation failure
    would be the most likely cause of taking a error path with the bug.
    This is unlikely to happen as this is early init code.

    Link: http://lkml.kernel.org/r/94b6244d-2c24-e269-b12c-e3ba694b242d@oracle.com
    Reported-by: Chengguang Xu
    Fixes: 32021982a324 ("hugetlbfs: Convert to fs_context")
    Signed-off-by: Mike Kravetz
    Cc: David Howells
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    Mike Kravetz
     

20 Jul, 2019

1 commit

  • Pull vfs mount updates from Al Viro:
    "The first part of mount updates.

    Convert filesystems to use the new mount API"

    * 'work.mount0' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (63 commits)
    mnt_init(): call shmem_init() unconditionally
    constify ksys_mount() string arguments
    don't bother with registering rootfs
    init_rootfs(): don't bother with init_ramfs_fs()
    vfs: Convert smackfs to use the new mount API
    vfs: Convert selinuxfs to use the new mount API
    vfs: Convert securityfs to use the new mount API
    vfs: Convert apparmorfs to use the new mount API
    vfs: Convert openpromfs to use the new mount API
    vfs: Convert xenfs to use the new mount API
    vfs: Convert gadgetfs to use the new mount API
    vfs: Convert oprofilefs to use the new mount API
    vfs: Convert ibmasmfs to use the new mount API
    vfs: Convert qib_fs/ipathfs to use the new mount API
    vfs: Convert efivarfs to use the new mount API
    vfs: Convert configfs to use the new mount API
    vfs: Convert binfmt_misc to use the new mount API
    convenience helper: get_tree_single()
    convenience helper get_tree_nodev()
    vfs: Kill sget_userns()
    ...

    Linus Torvalds
     

05 Jul, 2019

1 commit


21 May, 2019

1 commit


15 May, 2019

2 commits

  • Continuing discussion about 58b6e5e8f1ad ("hugetlbfs: fix memory leak for
    resv_map") brought up the issue that inode->i_mapping may not point to the
    address space embedded within the inode at inode eviction time. The
    hugetlbfs truncate routine handles this by explicitly using inode->i_data.
    However, code cleaning up the resv_map will still use the address space
    pointed to by inode->i_mapping. Luckily, private_data is NULL for address
    spaces in all such cases today but, there is no guarantee this will
    continue.

    Change all hugetlbfs code getting a resv_map pointer to explicitly get it
    from the address space embedded within the inode. In addition, add more
    comments in the code to indicate why this is being done.

    Link: http://lkml.kernel.org/r/20190419204435.16984-1-mike.kravetz@oracle.com
    Signed-off-by: Mike Kravetz
    Reported-by: Yufen Yu
    Cc: Michal Hocko
    Cc: Naoya Horiguchi
    Cc: "Kirill A . Shutemov"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     
  • hugetlb uses a fault mutex hash table to prevent page faults of the
    same pages concurrently. The key for shared and private mappings is
    different. Shared keys off address_space and file index. Private keys
    off mm and virtual address. Consider a private mappings of a populated
    hugetlbfs file. A fault will map the page from the file and if needed
    do a COW to map a writable page.

    Hugetlbfs hole punch uses the fault mutex to prevent mappings of file
    pages. It uses the address_space file index key. However, private
    mappings will use a different key and could race with this code to map
    the file page. This causes problems (BUG) for the page cache remove
    code as it expects the page to be unmapped. A sample stack is:

    page dumped because: VM_BUG_ON_PAGE(page_mapped(page))
    kernel BUG at mm/filemap.c:169!
    ...
    RIP: 0010:unaccount_page_cache_page+0x1b8/0x200
    ...
    Call Trace:
    __delete_from_page_cache+0x39/0x220
    delete_from_page_cache+0x45/0x70
    remove_inode_hugepages+0x13c/0x380
    ? __add_to_page_cache_locked+0x162/0x380
    hugetlbfs_fallocate+0x403/0x540
    ? _cond_resched+0x15/0x30
    ? __inode_security_revalidate+0x5d/0x70
    ? selinux_file_permission+0x100/0x130
    vfs_fallocate+0x13f/0x270
    ksys_fallocate+0x3c/0x80
    __x64_sys_fallocate+0x1a/0x20
    do_syscall_64+0x5b/0x180
    entry_SYSCALL_64_after_hwframe+0x44/0xa9

    There seems to be another potential COW issue/race with this approach
    of different private and shared keys as noted in commit 8382d914ebf7
    ("mm, hugetlb: improve page-fault scalability").

    Since every hugetlb mapping (even anon and private) is actually a file
    mapping, just use the address_space index key for all mappings. This
    results in potentially more hash collisions. However, this should not
    be the common case.

    Link: http://lkml.kernel.org/r/20190328234704.27083-3-mike.kravetz@oracle.com
    Link: http://lkml.kernel.org/r/20190412165235.t4sscoujczfhuiyt@linux-r8p5
    Fixes: b5cec28d36f5 ("hugetlbfs: truncate_hugepages() takes a range of pages")
    Signed-off-by: Mike Kravetz
    Reviewed-by: Naoya Horiguchi
    Reviewed-by: Davidlohr Bueso
    Cc: Joonsoo Kim
    Cc: "Kirill A . Shutemov"
    Cc: Michal Hocko
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     

02 May, 2019

1 commit

  • moving synchronous parts of ->destroy_inode() to ->evict_inode() is
    not possible here - they are balancing the stuff done in ->alloc_inode(),
    not the things acquired while using it or sanity checks.

    Signed-off-by: Al Viro

    Al Viro
     

06 Apr, 2019

1 commit

  • When mknod is used to create a block special file in hugetlbfs, it will
    allocate an inode and kmalloc a 'struct resv_map' via resv_map_alloc().
    inode->i_mapping->private_data will point the newly allocated resv_map.
    However, when the device special file is opened bd_acquire() will set
    inode->i_mapping to bd_inode->i_mapping. Thus the pointer to the
    allocated resv_map is lost and the structure is leaked.

    Programs to reproduce:
    mount -t hugetlbfs nodev hugetlbfs
    mknod hugetlbfs/dev b 0 0
    exec 30<> hugetlbfs/dev
    umount hugetlbfs/

    resv_map structures are only needed for inodes which can have associated
    page allocations. To fix the leak, only allocate resv_map for those
    inodes which could possibly be associated with page allocations.

    Link: http://lkml.kernel.org/r/20190401213101.16476-1-mike.kravetz@oracle.com
    Signed-off-by: Mike Kravetz
    Reviewed-by: Andrew Morton
    Reported-by: Yufen Yu
    Suggested-by: Yufen Yu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     

13 Mar, 2019

1 commit

  • Pull vfs mount infrastructure updates from Al Viro:
    "The rest of core infrastructure; no new syscalls in that pile, but the
    old parts are switched to new infrastructure. At that point
    conversions of individual filesystems can happen independently; some
    are done here (afs, cgroup, procfs, etc.), there's also a large series
    outside of that pile dealing with NFS (quite a bit of option-parsing
    stuff is getting used there - it's one of the most convoluted
    filesystems in terms of mount-related logics), but NFS bits are the
    next cycle fodder.

    It got seriously simplified since the last cycle; documentation is
    probably the weakest bit at the moment - I considered dropping the
    commit introducing Documentation/filesystems/mount_api.txt (cutting
    the size increase by quarter ;-), but decided that it would be better
    to fix it up after -rc1 instead.

    That pile allows to do followup work in independent branches, which
    should make life much easier for the next cycle. fs/super.c size
    increase is unpleasant; there's a followup series that allows to
    shrink it considerably, but I decided to leave that until the next
    cycle"

    * 'work.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (41 commits)
    afs: Use fs_context to pass parameters over automount
    afs: Add fs_context support
    vfs: Add some logging to the core users of the fs_context log
    vfs: Implement logging through fs_context
    vfs: Provide documentation for new mount API
    vfs: Remove kern_mount_data()
    hugetlbfs: Convert to fs_context
    cpuset: Use fs_context
    kernfs, sysfs, cgroup, intel_rdt: Support fs_context
    cgroup: store a reference to cgroup_ns into cgroup_fs_context
    cgroup1_get_tree(): separate "get cgroup_root to use" into a separate helper
    cgroup_do_mount(): massage calling conventions
    cgroup: stash cgroup_root reference into cgroup_fs_context
    cgroup2: switch to option-by-option parsing
    cgroup1: switch to option-by-option parsing
    cgroup: take options parsing into ->parse_monolithic()
    cgroup: fold cgroup1_mount() into cgroup1_get_tree()
    cgroup: start switching to fs_context
    ipc: Convert mqueue fs to fs_context
    proc: Add fs_context support to procfs
    ...

    Linus Torvalds
     

06 Mar, 2019

1 commit

  • Android uses ashmem for sharing memory regions. We are looking forward
    to migrating all usecases of ashmem to memfd so that we can possibly
    remove the ashmem driver in the future from staging while also
    benefiting from using memfd and contributing to it. Note staging
    drivers are also not ABI and generally can be removed at anytime.

    One of the main usecases Android has is the ability to create a region
    and mmap it as writeable, then add protection against making any
    "future" writes while keeping the existing already mmap'ed
    writeable-region active. This allows us to implement a usecase where
    receivers of the shared memory buffer can get a read-only view, while
    the sender continues to write to the buffer. See CursorWindow
    documentation in Android for more details:

    https://developer.android.com/reference/android/database/CursorWindow

    This usecase cannot be implemented with the existing F_SEAL_WRITE seal.
    To support the usecase, this patch adds a new F_SEAL_FUTURE_WRITE seal
    which prevents any future mmap and write syscalls from succeeding while
    keeping the existing mmap active.

    A better way to do F_SEAL_FUTURE_WRITE seal was discussed [1] last week
    where we don't need to modify core VFS structures to get the same
    behavior of the seal. This solves several side-effects pointed by Andy.
    self-tests are provided in later patch to verify the expected semantics.

    [1] https://lore.kernel.org/lkml/20181111173650.GA256781@google.com/

    Thanks a lot to Andy for suggestions to improve code.

    Link: http://lkml.kernel.org/r/20190112203816.85534-2-joel@joelfernandes.org
    Signed-off-by: Joel Fernandes (Google)
    Acked-by: John Stultz
    Cc: Andy Lutomirski
    Cc: Minchan Kim
    Cc: Jann Horn
    Cc: Al Viro
    Cc: Andy Lutomirski
    Cc: Hugh Dickins
    Cc: J. Bruce Fields
    Cc: Jeff Layton
    Cc: Marc-Andr Lureau
    Cc: Matthew Wilcox
    Cc: Mike Kravetz
    Cc: Shuah Khan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joel Fernandes (Google)
     

02 Mar, 2019

1 commit

  • hugetlb pages should only be migrated if they are 'active'. The
    routines set/clear_page_huge_active() modify the active state of hugetlb
    pages.

    When a new hugetlb page is allocated at fault time, set_page_huge_active
    is called before the page is locked. Therefore, another thread could
    race and migrate the page while it is being added to page table by the
    fault code. This race is somewhat hard to trigger, but can be seen by
    strategically adding udelay to simulate worst case scheduling behavior.
    Depending on 'how' the code races, various BUG()s could be triggered.

    To address this issue, simply delay the set_page_huge_active call until
    after the page is successfully added to the page table.

    Hugetlb pages can also be leaked at migration time if the pages are
    associated with a file in an explicitly mounted hugetlbfs filesystem.
    For example, consider a two node system with 4GB worth of huge pages
    available. A program mmaps a 2G file in a hugetlbfs filesystem. It
    then migrates the pages associated with the file from one node to
    another. When the program exits, huge page counts are as follows:

    node0
    1024 free_hugepages
    1024 nr_hugepages

    node1
    0 free_hugepages
    1024 nr_hugepages

    Filesystem Size Used Avail Use% Mounted on
    nodev 4.0G 2.0G 2.0G 50% /var/opt/hugepool

    That is as expected. 2G of huge pages are taken from the free_hugepages
    counts, and 2G is the size of the file in the explicitly mounted
    filesystem. If the file is then removed, the counts become:

    node0
    1024 free_hugepages
    1024 nr_hugepages

    node1
    1024 free_hugepages
    1024 nr_hugepages

    Filesystem Size Used Avail Use% Mounted on
    nodev 4.0G 2.0G 2.0G 50% /var/opt/hugepool

    Note that the filesystem still shows 2G of pages used, while there
    actually are no huge pages in use. The only way to 'fix' the filesystem
    accounting is to unmount the filesystem

    If a hugetlb page is associated with an explicitly mounted filesystem,
    this information in contained in the page_private field. At migration
    time, this information is not preserved. To fix, simply transfer
    page_private from old to new page at migration time if necessary.

    There is a related race with removing a huge page from a file and
    migration. When a huge page is removed from the pagecache, the
    page_mapping() field is cleared, yet page_private remains set until the
    page is actually freed by free_huge_page(). A page could be migrated
    while in this state. However, since page_mapping() is not set the
    hugetlbfs specific routine to transfer page_private is not called and we
    leak the page count in the filesystem.

    To fix that, check for this condition before migrating a huge page. If
    the condition is detected, return EBUSY for the page.

    Link: http://lkml.kernel.org/r/74510272-7319-7372-9ea6-ec914734c179@oracle.com
    Link: http://lkml.kernel.org/r/20190212221400.3512-1-mike.kravetz@oracle.com
    Fixes: bcc54222309c ("mm: hugetlb: introduce page_huge_active")
    Signed-off-by: Mike Kravetz
    Reviewed-by: Naoya Horiguchi
    Cc: Michal Hocko
    Cc: Andrea Arcangeli
    Cc: "Kirill A . Shutemov"
    Cc: Mel Gorman
    Cc: Davidlohr Bueso
    Cc:
    [mike.kravetz@oracle.com: v2]
    Link: http://lkml.kernel.org/r/7534d322-d782-8ac6-1c8d-a8dc380eb3ab@oracle.com
    [mike.kravetz@oracle.com: update comment and changelog]
    Link: http://lkml.kernel.org/r/420bcfd6-158b-38e4-98da-26d0cd85bd01@oracle.com
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     

28 Feb, 2019

1 commit


09 Jan, 2019

1 commit

  • This reverts c86aa7bbfd5568ba8a82d3635d8f7b8a8e06fe54

    The reverted commit caused ABBA deadlocks when file migration raced with
    file eviction for specific hugetlbfs files. This was discovered with a
    modified version of the LTP move_pages12 test.

    The purpose of the reverted patch was to close a long existing race
    between hugetlbfs file truncation and page faults. After more analysis
    of the patch and impacted code, it was determined that i_mmap_rwsem can
    not be used for all required synchronization. Therefore, revert this
    patch while working an another approach to the underlying issue.

    Link: http://lkml.kernel.org/r/20190103235452.29335-1-mike.kravetz@oracle.com
    Signed-off-by: Mike Kravetz
    Reported-by: Jan Stancek
    Cc: Michal Hocko
    Cc: Hugh Dickins
    Cc: Naoya Horiguchi
    Cc: "Aneesh Kumar K . V"
    Cc: Andrea Arcangeli
    Cc: "Kirill A . Shutemov"
    Cc: Davidlohr Bueso
    Cc: Prakash Sangappa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     

29 Dec, 2018

1 commit

  • hugetlbfs page faults can race with truncate and hole punch operations.
    Current code in the page fault path attempts to handle this by 'backing
    out' operations if we encounter the race. One obvious omission in the
    current code is removing a page newly added to the page cache. This is
    pretty straight forward to address, but there is a more subtle and
    difficult issue of backing out hugetlb reservations. To handle this
    correctly, the 'reservation state' before page allocation needs to be
    noted so that it can be properly backed out. There are four distinct
    possibilities for reservation state: shared/reserved, shared/no-resv,
    private/reserved and private/no-resv. Backing out a reservation may
    require memory allocation which could fail so that needs to be taken into
    account as well.

    Instead of writing the required complicated code for this rare occurrence,
    just eliminate the race. i_mmap_rwsem is now held in read mode for the
    duration of page fault processing. Hold i_mmap_rwsem longer in truncation
    and hold punch code to cover the call to remove_inode_hugepages.

    With this modification, code in remove_inode_hugepages checking for races
    becomes 'dead' as it can not longer happen. Remove the dead code and
    expand comments to explain reasoning. Similarly, checks for races with
    truncation in the page fault path can be simplified and removed.

    [mike.kravetz@oracle.com: incorporat suggestions from Kirill]
    Link: http://lkml.kernel.org/r/20181222223013.22193-3-mike.kravetz@oracle.com
    Link: http://lkml.kernel.org/r/20181218223557.5202-3-mike.kravetz@oracle.com
    Fixes: ebed4bfc8da8 ("hugetlb: fix absurd HugePages_Rsvd")
    Signed-off-by: Mike Kravetz
    Acked-by: Kirill A. Shutemov
    Cc: Michal Hocko
    Cc: Hugh Dickins
    Cc: Naoya Horiguchi
    Cc: "Aneesh Kumar K . V"
    Cc: Andrea Arcangeli
    Cc: Davidlohr Bueso
    Cc: Prakash Sangappa
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     

23 Aug, 2018

1 commit

  • Rather than in vm_area_alloc(). To ensure that the various oddball
    stack-based vmas are in a good state. Some of the callers were zeroing
    them out, others were not.

    Acked-by: Kirill A. Shutemov
    Cc: Russell King
    Cc: Dmitry Vyukov
    Cc: Oleg Nesterov
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     

14 Aug, 2018

1 commit

  • Pull vfs open-related updates from Al Viro:

    - "do we need fput() or put_filp()" rules are gone - it's always fput()
    now. We keep track of that state where it belongs - in ->f_mode.

    - int *opened mess killed - in finish_open(), in ->atomic_open()
    instances and in fs/namei.c code around do_last()/lookup_open()/atomic_open().

    - alloc_file() wrappers with saner calling conventions are introduced
    (alloc_file_clone() and alloc_file_pseudo()); callers converted, with
    much simplification.

    - while we are at it, saner calling conventions for path_init() and
    link_path_walk(), simplifying things inside fs/namei.c (both on
    open-related paths and elsewhere).

    * 'work.open3' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (40 commits)
    few more cleanups of link_path_walk() callers
    allow link_path_walk() to take ERR_PTR()
    make path_init() unconditionally paired with terminate_walk()
    document alloc_file() changes
    make alloc_file() static
    do_shmat(): grab shp->shm_file earlier, switch to alloc_file_clone()
    new helper: alloc_file_clone()
    create_pipe_files(): switch the first allocation to alloc_file_pseudo()
    anon_inode_getfile(): switch to alloc_file_pseudo()
    hugetlb_file_setup(): switch to alloc_file_pseudo()
    ocxlflash_getfile(): switch to alloc_file_pseudo()
    cxl_getfile(): switch to alloc_file_pseudo()
    ... and switch shmem_file_setup() to alloc_file_pseudo()
    __shmem_file_setup(): reorder allocations
    new wrapper: alloc_file_pseudo()
    kill FILE_{CREATED,OPENED}
    switch atomic_open() and lookup_open() to returning 0 in all success cases
    document ->atomic_open() changes
    ->atomic_open(): return 0 in all success cases
    get rid of 'opened' in path_openat() and the helpers downstream
    ...

    Linus Torvalds
     

27 Jul, 2018

1 commit

  • Make sure to initialize all VMAs properly, not only those which come
    from vm_area_cachep.

    Link: http://lkml.kernel.org/r/20180724121139.62570-3-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Acked-by: Linus Torvalds
    Reviewed-by: Andrew Morton
    Cc: Dmitry Vyukov
    Cc: Oleg Nesterov
    Cc: Andrea Arcangeli
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

12 Jul, 2018

2 commits


06 Apr, 2018

1 commit

  • This is a fix for a regression in 32 bit kernels caused by an invalid
    check for pgoff overflow in hugetlbfs mmap setup. The check incorrectly
    specified that the size of a loff_t was the same as the size of a long.
    The regression prevents mapping hugetlbfs files at offsets greater than
    4GB on 32 bit kernels.

    On 32 bit kernels conversion from a page based unsigned long can not
    overflow a loff_t byte offset. Therefore, skip this check if
    sizeof(unsigned long) != sizeof(loff_t).

    Link: http://lkml.kernel.org/r/20180330145402.5053-1-mike.kravetz@oracle.com
    Fixes: 63489f8e8211 ("hugetlbfs: check for pgoff value overflow")
    Reported-by: Dan Rue
    Signed-off-by: Mike Kravetz
    Tested-by: Anders Roxell
    Cc: Michal Hocko
    Cc: Yisheng Xie
    Cc: "Kirill A . Shutemov"
    Cc: Nic Losby
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     

23 Mar, 2018

1 commit

  • A vma with vm_pgoff large enough to overflow a loff_t type when
    converted to a byte offset can be passed via the remap_file_pages system
    call. The hugetlbfs mmap routine uses the byte offset to calculate
    reservations and file size.

    A sequence such as:

    mmap(0x20a00000, 0x600000, 0, 0x66033, -1, 0);
    remap_file_pages(0x20a00000, 0x600000, 0, 0x20000000000000, 0);

    will result in the following when task exits/file closed,

    kernel BUG at mm/hugetlb.c:749!
    Call Trace:
    hugetlbfs_evict_inode+0x2f/0x40
    evict+0xcb/0x190
    __dentry_kill+0xcb/0x150
    __fput+0x164/0x1e0
    task_work_run+0x84/0xa0
    exit_to_usermode_loop+0x7d/0x80
    do_syscall_64+0x18b/0x190
    entry_SYSCALL_64_after_hwframe+0x3d/0xa2

    The overflowed pgoff value causes hugetlbfs to try to set up a mapping
    with a negative range (end < start) that leaves invalid state which
    causes the BUG.

    The previous overflow fix to this code was incomplete and did not take
    the remap_file_pages system call into account.

    [mike.kravetz@oracle.com: v3]
    Link: http://lkml.kernel.org/r/20180309002726.7248-1-mike.kravetz@oracle.com
    [akpm@linux-foundation.org: include mmdebug.h]
    [akpm@linux-foundation.org: fix -ve left shift count on sh]
    Link: http://lkml.kernel.org/r/20180308210502.15952-1-mike.kravetz@oracle.com
    Fixes: 045c7a3f53d9 ("hugetlbfs: fix offset overflow in hugetlbfs mmap")
    Signed-off-by: Mike Kravetz
    Reported-by: Nic Losby
    Acked-by: Michal Hocko
    Cc: "Kirill A . Shutemov"
    Cc: Yisheng Xie
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     

01 Feb, 2018

2 commits

  • Implements memfd sealing, similar to shmem:
    - WRITE: deny fallocate(PUNCH_HOLE). mmap() write is denied in
    memfd_add_seals(). write() doesn't exist for hugetlbfs.
    - SHRINK: added similar check as shmem_setattr()
    - GROW: added similar check as shmem_setattr() & shmem_fallocate()

    Except write() operation that doesn't exist with hugetlbfs, that should
    make sealing as close as it can be to shmem support.

    Link: http://lkml.kernel.org/r/20171107122800.25517-5-marcandre.lureau@redhat.com
    Signed-off-by: Marc-André Lureau
    Reviewed-by: Mike Kravetz
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: Michal Hocko
    Cc: David Herrmann
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Marc-André Lureau
     
  • hugetlbfs inode information will need to be accessed by code in
    mm/shmem.c for file sealing operations. Move inode information
    definition from .c file to header for needed access.

    Link: http://lkml.kernel.org/r/20171107122800.25517-4-marcandre.lureau@redhat.com
    Signed-off-by: Marc-André Lureau
    Reviewed-by: Mike Kravetz
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: Michal Hocko
    Cc: David Herrmann
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Marc-André Lureau
     

30 Nov, 2017

1 commit

  • hugetlfs_fallocate() currently performs put_page() before unlock_page().
    This scenario opens a small time window, from the time the page is added
    to the page cache, until it is unlocked, in which the page might be
    removed from the page-cache by another core. If the page is removed
    during this time windows, it might cause a memory corruption, as the
    wrong page will be unlocked.

    It is arguable whether this scenario can happen in a real system, and
    there are several mitigating factors. The issue was found by code
    inspection (actually grep), and not by actually triggering the flow.
    Yet, since putting the page before unlocking is incorrect it should be
    fixed, if only to prevent future breakage or someone copy-pasting this
    code.

    Mike said:
    "I am of the opinion that this does not need to be sent to stable.
    Although the ordering is current code is incorrect, there is no way
    for this to be a problem with current locking. In addition, I verified
    that the perhaps bigger issue with sys_fadvise64(POSIX_FADV_DONTNEED)
    for hugetlbfs and other filesystems is addressed in 3a77d214807c ("mm:
    fadvise: avoid fadvise for fs without backing device")"

    Link: http://lkml.kernel.org/r/20170826191124.51642-1-namit@vmware.com
    Fixes: 70c3547e36f5c ("hugetlbfs: add hugetlbfs_fallocate()")
    Signed-off-by: Nadav Amit
    Reviewed-by: Mike Kravetz
    Acked-by: Michal Hocko
    Cc: Eric Biggers
    Cc: Mike Kravetz
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nadav Amit
     

16 Nov, 2017

2 commits

  • Every pagevec_init user claims the pages being released are hot even in
    cases where it is unlikely the pages are hot. As no one cares about the
    hotness of pages being released to the allocator, just ditch the
    parameter.

    No performance impact is expected as the overhead is marginal. The
    parameter is removed simply because it is a bit stupid to have a useless
    parameter copied everywhere.

    Link: http://lkml.kernel.org/r/20171018075952.10627-6-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Cc: Andi Kleen
    Cc: Dave Chinner
    Cc: Dave Hansen
    Cc: Jan Kara
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • There is no need to have a local return code set with -EINVAL when both
    the conditions following it return error codes appropriately. Just
    remove the redundant one.

    Link: http://lkml.kernel.org/r/20170929145444.17611-1-khandual@linux.vnet.ibm.com
    Signed-off-by: Anshuman Khandual
    Reviewed-by: Mike Kravetz
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Anshuman Khandual
     

03 Nov, 2017

1 commit

  • Calling madvise(MADV_HWPOISON) on a hugetlbfs page will result in bad
    (negative) reserved huge page counts. This may not happen immediately,
    but may happen later when the underlying file is removed or filesystem
    unmounted. For example:

    AnonHugePages: 0 kB
    ShmemHugePages: 0 kB
    HugePages_Total: 1
    HugePages_Free: 0
    HugePages_Rsvd: 18446744073709551615
    HugePages_Surp: 0
    Hugepagesize: 2048 kB

    In routine hugetlbfs_error_remove_page(), hugetlb_fix_reserve_counts is
    called after remove_huge_page. hugetlb_fix_reserve_counts is designed
    to only be called/used only if a failure is returned from
    hugetlb_unreserve_pages. Therefore, call hugetlb_unreserve_pages as
    required and only call hugetlb_fix_reserve_counts in the unlikely event
    that hugetlb_unreserve_pages returns an error.

    Link: http://lkml.kernel.org/r/20171019230007.17043-2-mike.kravetz@oracle.com
    Fixes: 78bb920344b8 ("mm: hwpoison: dissolve in-use hugepage in unrecoverable memory error")
    Signed-off-by: Mike Kravetz
    Acked-by: Naoya Horiguchi
    Cc: Michal Hocko
    Cc: Aneesh Kumar
    Cc: Anshuman Khandual
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     

09 Sep, 2017

2 commits

  • Allow interval trees to quickly check for overlaps to avoid unnecesary
    tree lookups in interval_tree_iter_first().

    As of this patch, all interval tree flavors will require using a
    'rb_root_cached' such that we can have the leftmost node easily
    available. While most users will make use of this feature, those with
    special functions (in addition to the generic insert, delete, search
    calls) will avoid using the cached option as they can do funky things
    with insertions -- for example, vma_interval_tree_insert_after().

    [jglisse@redhat.com: fix deadlock from typo vm_lock_anon_vma()]
    Link: http://lkml.kernel.org/r/20170808225719.20723-1-jglisse@redhat.com
    Link: http://lkml.kernel.org/r/20170719014603.19029-12-dave@stgolabs.net
    Signed-off-by: Davidlohr Bueso
    Signed-off-by: Jérôme Glisse
    Acked-by: Christian König
    Acked-by: Peter Zijlstra (Intel)
    Acked-by: Doug Ledford
    Acked-by: Michael S. Tsirkin
    Cc: David Airlie
    Cc: Jason Wang
    Cc: Christian Benvenuti
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     
  • Introduce a new migration mode that allow to offload the copy to a device
    DMA engine. This changes the workflow of migration and not all
    address_space migratepage callback can support this.

    This is intended to be use by migrate_vma() which itself is use for thing
    like HMM (see include/linux/hmm.h).

    No additional per-filesystem migratepage testing is needed. I disables
    MIGRATE_SYNC_NO_COPY in all problematic migratepage() callback and i
    added comment in those to explain why (part of this patch). The commit
    message is unclear it should say that any callback that wish to support
    this new mode need to be aware of the difference in the migration flow
    from other mode.

    Some of these callbacks do extra locking while copying (aio, zsmalloc,
    balloon, ...) and for DMA to be effective you want to copy multiple
    pages in one DMA operations. But in the problematic case you can not
    easily hold the extra lock accross multiple call to this callback.

    Usual flow is:

    For each page {
    1 - lock page
    2 - call migratepage() callback
    3 - (extra locking in some migratepage() callback)
    4 - migrate page state (freeze refcount, update page cache, buffer
    head, ...)
    5 - copy page
    6 - (unlock any extra lock of migratepage() callback)
    7 - return from migratepage() callback
    8 - unlock page
    }

    The new mode MIGRATE_SYNC_NO_COPY:
    1 - lock multiple pages
    For each page {
    2 - call migratepage() callback
    3 - abort in all problematic migratepage() callback
    4 - migrate page state (freeze refcount, update page cache, buffer
    head, ...)
    } // finished all calls to migratepage() callback
    5 - DMA copy multiple pages
    6 - unlock all the pages

    To support MIGRATE_SYNC_NO_COPY in the problematic case we would need a
    new callback migratepages() (for instance) that deals with multiple
    pages in one transaction.

    Because the problematic cases are not important for current usage I did
    not wanted to complexify this patchset even more for no good reason.

    Link: http://lkml.kernel.org/r/20170817000548.32038-14-jglisse@redhat.com
    Signed-off-by: Jérôme Glisse
    Cc: Aneesh Kumar
    Cc: Balbir Singh
    Cc: Benjamin Herrenschmidt
    Cc: Dan Williams
    Cc: David Nellans
    Cc: Evgeny Baskakov
    Cc: Johannes Weiner
    Cc: John Hubbard
    Cc: Kirill A. Shutemov
    Cc: Mark Hairgrove
    Cc: Michal Hocko
    Cc: Paul E. McKenney
    Cc: Ross Zwisler
    Cc: Sherry Cheung
    Cc: Subhash Gutti
    Cc: Vladimir Davydov
    Cc: Bob Liu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jérôme Glisse
     

07 Sep, 2017

3 commits

  • All users of pagevec_lookup() and pagevec_lookup_range() now pass
    PAGEVEC_SIZE as a desired number of pages.

    Just drop the argument.

    Link: http://lkml.kernel.org/r/20170726114704.7626-11-jack@suse.cz
    Signed-off-by: Jan Kara
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • We want only pages from given range in remove_inode_hugepages(). Use
    pagevec_lookup_range() instead of pagevec_lookup().

    Link: http://lkml.kernel.org/r/20170726114704.7626-8-jack@suse.cz
    Signed-off-by: Jan Kara
    Reviewed-by: Mike Kravetz
    Cc: Nadia Yvette Chambers
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • Make pagevec_lookup() (and underlying find_get_pages()) update index to
    the next page where iteration should continue. Most callers want this
    and also pagevec_lookup_tag() already does this.

    Link: http://lkml.kernel.org/r/20170726114704.7626-3-jack@suse.cz
    Signed-off-by: Jan Kara
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     

16 Jul, 2017

1 commit

  • Pull ->s_options removal from Al Viro:
    "Preparations for fsmount/fsopen stuff (coming next cycle). Everything
    gets moved to explicit ->show_options(), killing ->s_options off +
    some cosmetic bits around fs/namespace.c and friends. Basically, the
    stuff needed to work with fsmount series with minimum of conflicts
    with other work.

    It's not strictly required for this merge window, but it would reduce
    the PITA during the coming cycle, so it would be nice to have those
    bits and pieces out of the way"

    * 'work.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    isofs: Fix isofs_show_options()
    VFS: Kill off s_options and helpers
    orangefs: Implement show_options
    9p: Implement show_options
    isofs: Implement show_options
    afs: Implement show_options
    affs: Implement show_options
    befs: Implement show_options
    spufs: Implement show_options
    bpf: Implement show_options
    ramfs: Implement show_options
    pstore: Implement show_options
    omfs: Implement show_options
    hugetlbfs: Implement show_options
    VFS: Don't use save/replace_mount_options if not using generic_show_options
    VFS: Provide empty name qstr
    VFS: Make get_filesystem() return the affected filesystem
    VFS: Clean up whitespace in fs/namespace.c and fs/super.c
    Provide a function to create a NUL-terminated string from unterminated data

    Linus Torvalds
     

11 Jul, 2017

1 commit

  • Currently me_huge_page() relies on dequeue_hwpoisoned_huge_page() to
    keep the error hugepage away from the system, which is OK but not good
    enough because the hugepage still has a refcount and unpoison doesn't
    work on the error hugepage (PageHWPoison flags are cleared but pages are
    still leaked.) And there's "wasting health subpages" issue too. This
    patch reworks on me_huge_page() to solve these issues.

    For hugetlb file, recently we have truncating code so let's use it in
    hugetlbfs specific ->error_remove_page().

    For anonymous hugepage, it's helpful to dissolve the error page after
    freeing it into free hugepage list. Migration entry and PageHWPoison in
    the head page prevent the access to it.

    TODO: dissolve_free_huge_page() can fail but we don't considered it yet.
    It's not critical (and at least no worse that now) because in such case
    the error hugepage just stays in free hugepage list without being
    dissolved. By virtue of PageHWPoison in head page, it's never allocated
    to processes.

    [akpm@linux-foundation.org: fix unused var warnings]
    Fixes: 23a003bfd23ea9ea0b7756b920e51f64b284b468 ("mm/madvise: pass return code of memory_failure() to userspace")
    Link: http://lkml.kernel.org/r/20170417055948.GM31394@yexl-desktop
    Link: http://lkml.kernel.org/r/1496305019-5493-8-git-send-email-n-horiguchi@ah.jp.nec.com
    Signed-off-by: Naoya Horiguchi
    Reported-by: kernel test robot
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     

06 Jul, 2017

1 commit

  • Implement the show_options superblock op for hugetlbfs as part of a bid to
    get rid of s_options and generic_show_options() to make it easier to
    implement a context-based mount where the mount options can be passed
    individually over a file descriptor.

    Note that the uid and gid should possibly be displayed relative to the
    viewer's user namespace.

    Signed-off-by: David Howells
    cc: Nadia Yvette Chambers
    Signed-off-by: Al Viro

    David Howells
     

19 Jun, 2017

1 commit

  • Stack guard page is a useful feature to reduce a risk of stack smashing
    into a different mapping. We have been using a single page gap which
    is sufficient to prevent having stack adjacent to a different mapping.
    But this seems to be insufficient in the light of the stack usage in
    userspace. E.g. glibc uses as large as 64kB alloca() in many commonly
    used functions. Others use constructs liks gid_t buffer[NGROUPS_MAX]
    which is 256kB or stack strings with MAX_ARG_STRLEN.

    This will become especially dangerous for suid binaries and the default
    no limit for the stack size limit because those applications can be
    tricked to consume a large portion of the stack and a single glibc call
    could jump over the guard page. These attacks are not theoretical,
    unfortunatelly.

    Make those attacks less probable by increasing the stack guard gap
    to 1MB (on systems with 4k pages; but make it depend on the page size
    because systems with larger base pages might cap stack allocations in
    the PAGE_SIZE units) which should cover larger alloca() and VLA stack
    allocations. It is obviously not a full fix because the problem is
    somehow inherent, but it should reduce attack space a lot.

    One could argue that the gap size should be configurable from userspace,
    but that can be done later when somebody finds that the new 1MB is wrong
    for some special case applications. For now, add a kernel command line
    option (stack_guard_gap) to specify the stack gap size (in page units).

    Implementation wise, first delete all the old code for stack guard page:
    because although we could get away with accounting one extra page in a
    stack vma, accounting a larger gap can break userspace - case in point,
    a program run with "ulimit -S -v 20000" failed when the 1MB gap was
    counted for RLIMIT_AS; similar problems could come with RLIMIT_MLOCK
    and strict non-overcommit mode.

    Instead of keeping gap inside the stack vma, maintain the stack guard
    gap as a gap between vmas: using vm_start_gap() in place of vm_start
    (or vm_end_gap() in place of vm_end if VM_GROWSUP) in just those few
    places which need to respect the gap - mainly arch_get_unmapped_area(),
    and and the vma tree's subtree_gap support for that.

    Original-patch-by: Oleg Nesterov
    Original-patch-by: Michal Hocko
    Signed-off-by: Hugh Dickins
    Acked-by: Michal Hocko
    Tested-by: Helge Deller # parisc
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

14 Apr, 2017

1 commit

  • If mmap() maps a file, it can be passed an offset into the file at which
    the mapping is to start. Offset could be a negative value when
    represented as a loff_t. The offset plus length will be used to update
    the file size (i_size) which is also a loff_t.

    Validate the value of offset and offset + length to make sure they do
    not overflow and appear as negative.

    Found by syzcaller with commit ff8c0c53c475 ("mm/hugetlb.c: don't call
    region_abort if region_chg fails") applied. Prior to this commit, the
    overflow would still occur but we would luckily return ENOMEM.

    To reproduce:

    mmap(0, 0x2000, 0, 0x40021, 0xffffffffffffffffULL, 0x8000000000000000ULL);

    Resulted in,

    kernel BUG at mm/hugetlb.c:742!
    Call Trace:
    hugetlbfs_evict_inode+0x80/0xa0
    evict+0x24a/0x620
    iput+0x48f/0x8c0
    dentry_unlink_inode+0x31f/0x4d0
    __dentry_kill+0x292/0x5e0
    dput+0x730/0x830
    __fput+0x438/0x720
    ____fput+0x1a/0x20
    task_work_run+0xfe/0x180
    exit_to_usermode_loop+0x133/0x150
    syscall_return_slowpath+0x184/0x1c0
    entry_SYSCALL_64_fastpath+0xab/0xad

    Fixes: ff8c0c53c475 ("mm/hugetlb.c: don't call region_abort if region_chg fails")
    Link: http://lkml.kernel.org/r/1491951118-30678-1-git-send-email-mike.kravetz@oracle.com
    Reported-by: Vegard Nossum
    Signed-off-by: Mike Kravetz
    Acked-by: Hillf Danton
    Cc: Dmitry Vyukov
    Cc: Michal Hocko
    Cc: "Kirill A . Shutemov"
    Cc: Andrey Ryabinin
    Cc: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     

01 Apr, 2017

1 commit

  • Any time after inode allocation, destroy_inode can be called. The
    hugetlbfs inode contains a shared_policy structure, and
    mpol_free_shared_policy is unconditionally called as part of
    hugetlbfs_destroy_inode. Initialize the policy as part of inode
    allocation so that any quick (error path) calls to destroy_inode will be
    handed an initialized policy.

    syzkaller fuzzer found this bug, that resulted in the following:

    BUG: KASAN: user-memory-access in atomic_inc
    include/asm-generic/atomic-instrumented.h:87 [inline] at addr
    000000131730bd7a
    BUG: KASAN: user-memory-access in __lock_acquire+0x21a/0x3a80
    kernel/locking/lockdep.c:3239 at addr 000000131730bd7a
    Write of size 4 by task syz-executor6/14086
    CPU: 3 PID: 14086 Comm: syz-executor6 Not tainted 4.11.0-rc3+ #364
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
    Call Trace:
    atomic_inc include/asm-generic/atomic-instrumented.h:87 [inline]
    __lock_acquire+0x21a/0x3a80 kernel/locking/lockdep.c:3239
    lock_acquire+0x1ee/0x590 kernel/locking/lockdep.c:3762
    __raw_write_lock include/linux/rwlock_api_smp.h:210 [inline]
    _raw_write_lock+0x33/0x50 kernel/locking/spinlock.c:295
    mpol_free_shared_policy+0x43/0xb0 mm/mempolicy.c:2536
    hugetlbfs_destroy_inode+0xca/0x120 fs/hugetlbfs/inode.c:952
    alloc_inode+0x10d/0x180 fs/inode.c:216
    new_inode_pseudo+0x69/0x190 fs/inode.c:889
    new_inode+0x1c/0x40 fs/inode.c:918
    hugetlbfs_get_inode+0x40/0x420 fs/hugetlbfs/inode.c:734
    hugetlb_file_setup+0x329/0x9f0 fs/hugetlbfs/inode.c:1282
    newseg+0x422/0xd30 ipc/shm.c:575
    ipcget_new ipc/util.c:285 [inline]
    ipcget+0x21e/0x580 ipc/util.c:639
    SYSC_shmget ipc/shm.c:673 [inline]
    SyS_shmget+0x158/0x230 ipc/shm.c:657
    entry_SYSCALL_64_fastpath+0x1f/0xc2

    Analysis provided by Tetsuo Handa

    Link: http://lkml.kernel.org/r/1490477850-7944-1-git-send-email-mike.kravetz@oracle.com
    Signed-off-by: Mike Kravetz
    Reported-by: Dmitry Vyukov
    Acked-by: Hillf Danton
    Cc: Tetsuo Handa
    Cc: Michal Hocko
    Cc: Dave Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     

02 Mar, 2017

1 commit