06 Mar, 2019

1 commit

  • commit cb6acd01e2e43fd8bad11155752b7699c3d0fb76 upstream.

    hugetlb pages should only be migrated if they are 'active'. The
    routines set/clear_page_huge_active() modify the active state of hugetlb
    pages.

    When a new hugetlb page is allocated at fault time, set_page_huge_active
    is called before the page is locked. Therefore, another thread could
    race and migrate the page while it is being added to page table by the
    fault code. This race is somewhat hard to trigger, but can be seen by
    strategically adding udelay to simulate worst case scheduling behavior.
    Depending on 'how' the code races, various BUG()s could be triggered.

    To address this issue, simply delay the set_page_huge_active call until
    after the page is successfully added to the page table.

    Hugetlb pages can also be leaked at migration time if the pages are
    associated with a file in an explicitly mounted hugetlbfs filesystem.
    For example, consider a two node system with 4GB worth of huge pages
    available. A program mmaps a 2G file in a hugetlbfs filesystem. It
    then migrates the pages associated with the file from one node to
    another. When the program exits, huge page counts are as follows:

    node0
    1024 free_hugepages
    1024 nr_hugepages

    node1
    0 free_hugepages
    1024 nr_hugepages

    Filesystem Size Used Avail Use% Mounted on
    nodev 4.0G 2.0G 2.0G 50% /var/opt/hugepool

    That is as expected. 2G of huge pages are taken from the free_hugepages
    counts, and 2G is the size of the file in the explicitly mounted
    filesystem. If the file is then removed, the counts become:

    node0
    1024 free_hugepages
    1024 nr_hugepages

    node1
    1024 free_hugepages
    1024 nr_hugepages

    Filesystem Size Used Avail Use% Mounted on
    nodev 4.0G 2.0G 2.0G 50% /var/opt/hugepool

    Note that the filesystem still shows 2G of pages used, while there
    actually are no huge pages in use. The only way to 'fix' the filesystem
    accounting is to unmount the filesystem

    If a hugetlb page is associated with an explicitly mounted filesystem,
    this information in contained in the page_private field. At migration
    time, this information is not preserved. To fix, simply transfer
    page_private from old to new page at migration time if necessary.

    There is a related race with removing a huge page from a file and
    migration. When a huge page is removed from the pagecache, the
    page_mapping() field is cleared, yet page_private remains set until the
    page is actually freed by free_huge_page(). A page could be migrated
    while in this state. However, since page_mapping() is not set the
    hugetlbfs specific routine to transfer page_private is not called and we
    leak the page count in the filesystem.

    To fix that, check for this condition before migrating a huge page. If
    the condition is detected, return EBUSY for the page.

    Link: http://lkml.kernel.org/r/74510272-7319-7372-9ea6-ec914734c179@oracle.com
    Link: http://lkml.kernel.org/r/20190212221400.3512-1-mike.kravetz@oracle.com
    Fixes: bcc54222309c ("mm: hugetlb: introduce page_huge_active")
    Signed-off-by: Mike Kravetz
    Reviewed-by: Naoya Horiguchi
    Cc: Michal Hocko
    Cc: Andrea Arcangeli
    Cc: "Kirill A . Shutemov"
    Cc: Mel Gorman
    Cc: Davidlohr Bueso
    Cc:
    [mike.kravetz@oracle.com: v2]
    Link: http://lkml.kernel.org/r/7534d322-d782-8ac6-1c8d-a8dc380eb3ab@oracle.com
    [mike.kravetz@oracle.com: update comment and changelog]
    Link: http://lkml.kernel.org/r/420bcfd6-158b-38e4-98da-26d0cd85bd01@oracle.com
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Mike Kravetz
     

23 Aug, 2018

1 commit

  • Rather than in vm_area_alloc(). To ensure that the various oddball
    stack-based vmas are in a good state. Some of the callers were zeroing
    them out, others were not.

    Acked-by: Kirill A. Shutemov
    Cc: Russell King
    Cc: Dmitry Vyukov
    Cc: Oleg Nesterov
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     

14 Aug, 2018

1 commit

  • Pull vfs open-related updates from Al Viro:

    - "do we need fput() or put_filp()" rules are gone - it's always fput()
    now. We keep track of that state where it belongs - in ->f_mode.

    - int *opened mess killed - in finish_open(), in ->atomic_open()
    instances and in fs/namei.c code around do_last()/lookup_open()/atomic_open().

    - alloc_file() wrappers with saner calling conventions are introduced
    (alloc_file_clone() and alloc_file_pseudo()); callers converted, with
    much simplification.

    - while we are at it, saner calling conventions for path_init() and
    link_path_walk(), simplifying things inside fs/namei.c (both on
    open-related paths and elsewhere).

    * 'work.open3' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (40 commits)
    few more cleanups of link_path_walk() callers
    allow link_path_walk() to take ERR_PTR()
    make path_init() unconditionally paired with terminate_walk()
    document alloc_file() changes
    make alloc_file() static
    do_shmat(): grab shp->shm_file earlier, switch to alloc_file_clone()
    new helper: alloc_file_clone()
    create_pipe_files(): switch the first allocation to alloc_file_pseudo()
    anon_inode_getfile(): switch to alloc_file_pseudo()
    hugetlb_file_setup(): switch to alloc_file_pseudo()
    ocxlflash_getfile(): switch to alloc_file_pseudo()
    cxl_getfile(): switch to alloc_file_pseudo()
    ... and switch shmem_file_setup() to alloc_file_pseudo()
    __shmem_file_setup(): reorder allocations
    new wrapper: alloc_file_pseudo()
    kill FILE_{CREATED,OPENED}
    switch atomic_open() and lookup_open() to returning 0 in all success cases
    document ->atomic_open() changes
    ->atomic_open(): return 0 in all success cases
    get rid of 'opened' in path_openat() and the helpers downstream
    ...

    Linus Torvalds
     

27 Jul, 2018

1 commit

  • Make sure to initialize all VMAs properly, not only those which come
    from vm_area_cachep.

    Link: http://lkml.kernel.org/r/20180724121139.62570-3-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Acked-by: Linus Torvalds
    Reviewed-by: Andrew Morton
    Cc: Dmitry Vyukov
    Cc: Oleg Nesterov
    Cc: Andrea Arcangeli
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

12 Jul, 2018

2 commits


06 Apr, 2018

1 commit

  • This is a fix for a regression in 32 bit kernels caused by an invalid
    check for pgoff overflow in hugetlbfs mmap setup. The check incorrectly
    specified that the size of a loff_t was the same as the size of a long.
    The regression prevents mapping hugetlbfs files at offsets greater than
    4GB on 32 bit kernels.

    On 32 bit kernels conversion from a page based unsigned long can not
    overflow a loff_t byte offset. Therefore, skip this check if
    sizeof(unsigned long) != sizeof(loff_t).

    Link: http://lkml.kernel.org/r/20180330145402.5053-1-mike.kravetz@oracle.com
    Fixes: 63489f8e8211 ("hugetlbfs: check for pgoff value overflow")
    Reported-by: Dan Rue
    Signed-off-by: Mike Kravetz
    Tested-by: Anders Roxell
    Cc: Michal Hocko
    Cc: Yisheng Xie
    Cc: "Kirill A . Shutemov"
    Cc: Nic Losby
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     

23 Mar, 2018

1 commit

  • A vma with vm_pgoff large enough to overflow a loff_t type when
    converted to a byte offset can be passed via the remap_file_pages system
    call. The hugetlbfs mmap routine uses the byte offset to calculate
    reservations and file size.

    A sequence such as:

    mmap(0x20a00000, 0x600000, 0, 0x66033, -1, 0);
    remap_file_pages(0x20a00000, 0x600000, 0, 0x20000000000000, 0);

    will result in the following when task exits/file closed,

    kernel BUG at mm/hugetlb.c:749!
    Call Trace:
    hugetlbfs_evict_inode+0x2f/0x40
    evict+0xcb/0x190
    __dentry_kill+0xcb/0x150
    __fput+0x164/0x1e0
    task_work_run+0x84/0xa0
    exit_to_usermode_loop+0x7d/0x80
    do_syscall_64+0x18b/0x190
    entry_SYSCALL_64_after_hwframe+0x3d/0xa2

    The overflowed pgoff value causes hugetlbfs to try to set up a mapping
    with a negative range (end < start) that leaves invalid state which
    causes the BUG.

    The previous overflow fix to this code was incomplete and did not take
    the remap_file_pages system call into account.

    [mike.kravetz@oracle.com: v3]
    Link: http://lkml.kernel.org/r/20180309002726.7248-1-mike.kravetz@oracle.com
    [akpm@linux-foundation.org: include mmdebug.h]
    [akpm@linux-foundation.org: fix -ve left shift count on sh]
    Link: http://lkml.kernel.org/r/20180308210502.15952-1-mike.kravetz@oracle.com
    Fixes: 045c7a3f53d9 ("hugetlbfs: fix offset overflow in hugetlbfs mmap")
    Signed-off-by: Mike Kravetz
    Reported-by: Nic Losby
    Acked-by: Michal Hocko
    Cc: "Kirill A . Shutemov"
    Cc: Yisheng Xie
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     

01 Feb, 2018

2 commits

  • Implements memfd sealing, similar to shmem:
    - WRITE: deny fallocate(PUNCH_HOLE). mmap() write is denied in
    memfd_add_seals(). write() doesn't exist for hugetlbfs.
    - SHRINK: added similar check as shmem_setattr()
    - GROW: added similar check as shmem_setattr() & shmem_fallocate()

    Except write() operation that doesn't exist with hugetlbfs, that should
    make sealing as close as it can be to shmem support.

    Link: http://lkml.kernel.org/r/20171107122800.25517-5-marcandre.lureau@redhat.com
    Signed-off-by: Marc-André Lureau
    Reviewed-by: Mike Kravetz
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: Michal Hocko
    Cc: David Herrmann
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Marc-André Lureau
     
  • hugetlbfs inode information will need to be accessed by code in
    mm/shmem.c for file sealing operations. Move inode information
    definition from .c file to header for needed access.

    Link: http://lkml.kernel.org/r/20171107122800.25517-4-marcandre.lureau@redhat.com
    Signed-off-by: Marc-André Lureau
    Reviewed-by: Mike Kravetz
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: Michal Hocko
    Cc: David Herrmann
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Marc-André Lureau
     

30 Nov, 2017

1 commit

  • hugetlfs_fallocate() currently performs put_page() before unlock_page().
    This scenario opens a small time window, from the time the page is added
    to the page cache, until it is unlocked, in which the page might be
    removed from the page-cache by another core. If the page is removed
    during this time windows, it might cause a memory corruption, as the
    wrong page will be unlocked.

    It is arguable whether this scenario can happen in a real system, and
    there are several mitigating factors. The issue was found by code
    inspection (actually grep), and not by actually triggering the flow.
    Yet, since putting the page before unlocking is incorrect it should be
    fixed, if only to prevent future breakage or someone copy-pasting this
    code.

    Mike said:
    "I am of the opinion that this does not need to be sent to stable.
    Although the ordering is current code is incorrect, there is no way
    for this to be a problem with current locking. In addition, I verified
    that the perhaps bigger issue with sys_fadvise64(POSIX_FADV_DONTNEED)
    for hugetlbfs and other filesystems is addressed in 3a77d214807c ("mm:
    fadvise: avoid fadvise for fs without backing device")"

    Link: http://lkml.kernel.org/r/20170826191124.51642-1-namit@vmware.com
    Fixes: 70c3547e36f5c ("hugetlbfs: add hugetlbfs_fallocate()")
    Signed-off-by: Nadav Amit
    Reviewed-by: Mike Kravetz
    Acked-by: Michal Hocko
    Cc: Eric Biggers
    Cc: Mike Kravetz
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nadav Amit
     

16 Nov, 2017

2 commits

  • Every pagevec_init user claims the pages being released are hot even in
    cases where it is unlikely the pages are hot. As no one cares about the
    hotness of pages being released to the allocator, just ditch the
    parameter.

    No performance impact is expected as the overhead is marginal. The
    parameter is removed simply because it is a bit stupid to have a useless
    parameter copied everywhere.

    Link: http://lkml.kernel.org/r/20171018075952.10627-6-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Cc: Andi Kleen
    Cc: Dave Chinner
    Cc: Dave Hansen
    Cc: Jan Kara
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • There is no need to have a local return code set with -EINVAL when both
    the conditions following it return error codes appropriately. Just
    remove the redundant one.

    Link: http://lkml.kernel.org/r/20170929145444.17611-1-khandual@linux.vnet.ibm.com
    Signed-off-by: Anshuman Khandual
    Reviewed-by: Mike Kravetz
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Anshuman Khandual
     

03 Nov, 2017

1 commit

  • Calling madvise(MADV_HWPOISON) on a hugetlbfs page will result in bad
    (negative) reserved huge page counts. This may not happen immediately,
    but may happen later when the underlying file is removed or filesystem
    unmounted. For example:

    AnonHugePages: 0 kB
    ShmemHugePages: 0 kB
    HugePages_Total: 1
    HugePages_Free: 0
    HugePages_Rsvd: 18446744073709551615
    HugePages_Surp: 0
    Hugepagesize: 2048 kB

    In routine hugetlbfs_error_remove_page(), hugetlb_fix_reserve_counts is
    called after remove_huge_page. hugetlb_fix_reserve_counts is designed
    to only be called/used only if a failure is returned from
    hugetlb_unreserve_pages. Therefore, call hugetlb_unreserve_pages as
    required and only call hugetlb_fix_reserve_counts in the unlikely event
    that hugetlb_unreserve_pages returns an error.

    Link: http://lkml.kernel.org/r/20171019230007.17043-2-mike.kravetz@oracle.com
    Fixes: 78bb920344b8 ("mm: hwpoison: dissolve in-use hugepage in unrecoverable memory error")
    Signed-off-by: Mike Kravetz
    Acked-by: Naoya Horiguchi
    Cc: Michal Hocko
    Cc: Aneesh Kumar
    Cc: Anshuman Khandual
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     

09 Sep, 2017

2 commits

  • Allow interval trees to quickly check for overlaps to avoid unnecesary
    tree lookups in interval_tree_iter_first().

    As of this patch, all interval tree flavors will require using a
    'rb_root_cached' such that we can have the leftmost node easily
    available. While most users will make use of this feature, those with
    special functions (in addition to the generic insert, delete, search
    calls) will avoid using the cached option as they can do funky things
    with insertions -- for example, vma_interval_tree_insert_after().

    [jglisse@redhat.com: fix deadlock from typo vm_lock_anon_vma()]
    Link: http://lkml.kernel.org/r/20170808225719.20723-1-jglisse@redhat.com
    Link: http://lkml.kernel.org/r/20170719014603.19029-12-dave@stgolabs.net
    Signed-off-by: Davidlohr Bueso
    Signed-off-by: Jérôme Glisse
    Acked-by: Christian König
    Acked-by: Peter Zijlstra (Intel)
    Acked-by: Doug Ledford
    Acked-by: Michael S. Tsirkin
    Cc: David Airlie
    Cc: Jason Wang
    Cc: Christian Benvenuti
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     
  • Introduce a new migration mode that allow to offload the copy to a device
    DMA engine. This changes the workflow of migration and not all
    address_space migratepage callback can support this.

    This is intended to be use by migrate_vma() which itself is use for thing
    like HMM (see include/linux/hmm.h).

    No additional per-filesystem migratepage testing is needed. I disables
    MIGRATE_SYNC_NO_COPY in all problematic migratepage() callback and i
    added comment in those to explain why (part of this patch). The commit
    message is unclear it should say that any callback that wish to support
    this new mode need to be aware of the difference in the migration flow
    from other mode.

    Some of these callbacks do extra locking while copying (aio, zsmalloc,
    balloon, ...) and for DMA to be effective you want to copy multiple
    pages in one DMA operations. But in the problematic case you can not
    easily hold the extra lock accross multiple call to this callback.

    Usual flow is:

    For each page {
    1 - lock page
    2 - call migratepage() callback
    3 - (extra locking in some migratepage() callback)
    4 - migrate page state (freeze refcount, update page cache, buffer
    head, ...)
    5 - copy page
    6 - (unlock any extra lock of migratepage() callback)
    7 - return from migratepage() callback
    8 - unlock page
    }

    The new mode MIGRATE_SYNC_NO_COPY:
    1 - lock multiple pages
    For each page {
    2 - call migratepage() callback
    3 - abort in all problematic migratepage() callback
    4 - migrate page state (freeze refcount, update page cache, buffer
    head, ...)
    } // finished all calls to migratepage() callback
    5 - DMA copy multiple pages
    6 - unlock all the pages

    To support MIGRATE_SYNC_NO_COPY in the problematic case we would need a
    new callback migratepages() (for instance) that deals with multiple
    pages in one transaction.

    Because the problematic cases are not important for current usage I did
    not wanted to complexify this patchset even more for no good reason.

    Link: http://lkml.kernel.org/r/20170817000548.32038-14-jglisse@redhat.com
    Signed-off-by: Jérôme Glisse
    Cc: Aneesh Kumar
    Cc: Balbir Singh
    Cc: Benjamin Herrenschmidt
    Cc: Dan Williams
    Cc: David Nellans
    Cc: Evgeny Baskakov
    Cc: Johannes Weiner
    Cc: John Hubbard
    Cc: Kirill A. Shutemov
    Cc: Mark Hairgrove
    Cc: Michal Hocko
    Cc: Paul E. McKenney
    Cc: Ross Zwisler
    Cc: Sherry Cheung
    Cc: Subhash Gutti
    Cc: Vladimir Davydov
    Cc: Bob Liu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jérôme Glisse
     

07 Sep, 2017

3 commits

  • All users of pagevec_lookup() and pagevec_lookup_range() now pass
    PAGEVEC_SIZE as a desired number of pages.

    Just drop the argument.

    Link: http://lkml.kernel.org/r/20170726114704.7626-11-jack@suse.cz
    Signed-off-by: Jan Kara
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • We want only pages from given range in remove_inode_hugepages(). Use
    pagevec_lookup_range() instead of pagevec_lookup().

    Link: http://lkml.kernel.org/r/20170726114704.7626-8-jack@suse.cz
    Signed-off-by: Jan Kara
    Reviewed-by: Mike Kravetz
    Cc: Nadia Yvette Chambers
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • Make pagevec_lookup() (and underlying find_get_pages()) update index to
    the next page where iteration should continue. Most callers want this
    and also pagevec_lookup_tag() already does this.

    Link: http://lkml.kernel.org/r/20170726114704.7626-3-jack@suse.cz
    Signed-off-by: Jan Kara
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     

16 Jul, 2017

1 commit

  • Pull ->s_options removal from Al Viro:
    "Preparations for fsmount/fsopen stuff (coming next cycle). Everything
    gets moved to explicit ->show_options(), killing ->s_options off +
    some cosmetic bits around fs/namespace.c and friends. Basically, the
    stuff needed to work with fsmount series with minimum of conflicts
    with other work.

    It's not strictly required for this merge window, but it would reduce
    the PITA during the coming cycle, so it would be nice to have those
    bits and pieces out of the way"

    * 'work.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    isofs: Fix isofs_show_options()
    VFS: Kill off s_options and helpers
    orangefs: Implement show_options
    9p: Implement show_options
    isofs: Implement show_options
    afs: Implement show_options
    affs: Implement show_options
    befs: Implement show_options
    spufs: Implement show_options
    bpf: Implement show_options
    ramfs: Implement show_options
    pstore: Implement show_options
    omfs: Implement show_options
    hugetlbfs: Implement show_options
    VFS: Don't use save/replace_mount_options if not using generic_show_options
    VFS: Provide empty name qstr
    VFS: Make get_filesystem() return the affected filesystem
    VFS: Clean up whitespace in fs/namespace.c and fs/super.c
    Provide a function to create a NUL-terminated string from unterminated data

    Linus Torvalds
     

11 Jul, 2017

1 commit

  • Currently me_huge_page() relies on dequeue_hwpoisoned_huge_page() to
    keep the error hugepage away from the system, which is OK but not good
    enough because the hugepage still has a refcount and unpoison doesn't
    work on the error hugepage (PageHWPoison flags are cleared but pages are
    still leaked.) And there's "wasting health subpages" issue too. This
    patch reworks on me_huge_page() to solve these issues.

    For hugetlb file, recently we have truncating code so let's use it in
    hugetlbfs specific ->error_remove_page().

    For anonymous hugepage, it's helpful to dissolve the error page after
    freeing it into free hugepage list. Migration entry and PageHWPoison in
    the head page prevent the access to it.

    TODO: dissolve_free_huge_page() can fail but we don't considered it yet.
    It's not critical (and at least no worse that now) because in such case
    the error hugepage just stays in free hugepage list without being
    dissolved. By virtue of PageHWPoison in head page, it's never allocated
    to processes.

    [akpm@linux-foundation.org: fix unused var warnings]
    Fixes: 23a003bfd23ea9ea0b7756b920e51f64b284b468 ("mm/madvise: pass return code of memory_failure() to userspace")
    Link: http://lkml.kernel.org/r/20170417055948.GM31394@yexl-desktop
    Link: http://lkml.kernel.org/r/1496305019-5493-8-git-send-email-n-horiguchi@ah.jp.nec.com
    Signed-off-by: Naoya Horiguchi
    Reported-by: kernel test robot
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     

06 Jul, 2017

1 commit

  • Implement the show_options superblock op for hugetlbfs as part of a bid to
    get rid of s_options and generic_show_options() to make it easier to
    implement a context-based mount where the mount options can be passed
    individually over a file descriptor.

    Note that the uid and gid should possibly be displayed relative to the
    viewer's user namespace.

    Signed-off-by: David Howells
    cc: Nadia Yvette Chambers
    Signed-off-by: Al Viro

    David Howells
     

19 Jun, 2017

1 commit

  • Stack guard page is a useful feature to reduce a risk of stack smashing
    into a different mapping. We have been using a single page gap which
    is sufficient to prevent having stack adjacent to a different mapping.
    But this seems to be insufficient in the light of the stack usage in
    userspace. E.g. glibc uses as large as 64kB alloca() in many commonly
    used functions. Others use constructs liks gid_t buffer[NGROUPS_MAX]
    which is 256kB or stack strings with MAX_ARG_STRLEN.

    This will become especially dangerous for suid binaries and the default
    no limit for the stack size limit because those applications can be
    tricked to consume a large portion of the stack and a single glibc call
    could jump over the guard page. These attacks are not theoretical,
    unfortunatelly.

    Make those attacks less probable by increasing the stack guard gap
    to 1MB (on systems with 4k pages; but make it depend on the page size
    because systems with larger base pages might cap stack allocations in
    the PAGE_SIZE units) which should cover larger alloca() and VLA stack
    allocations. It is obviously not a full fix because the problem is
    somehow inherent, but it should reduce attack space a lot.

    One could argue that the gap size should be configurable from userspace,
    but that can be done later when somebody finds that the new 1MB is wrong
    for some special case applications. For now, add a kernel command line
    option (stack_guard_gap) to specify the stack gap size (in page units).

    Implementation wise, first delete all the old code for stack guard page:
    because although we could get away with accounting one extra page in a
    stack vma, accounting a larger gap can break userspace - case in point,
    a program run with "ulimit -S -v 20000" failed when the 1MB gap was
    counted for RLIMIT_AS; similar problems could come with RLIMIT_MLOCK
    and strict non-overcommit mode.

    Instead of keeping gap inside the stack vma, maintain the stack guard
    gap as a gap between vmas: using vm_start_gap() in place of vm_start
    (or vm_end_gap() in place of vm_end if VM_GROWSUP) in just those few
    places which need to respect the gap - mainly arch_get_unmapped_area(),
    and and the vma tree's subtree_gap support for that.

    Original-patch-by: Oleg Nesterov
    Original-patch-by: Michal Hocko
    Signed-off-by: Hugh Dickins
    Acked-by: Michal Hocko
    Tested-by: Helge Deller # parisc
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

14 Apr, 2017

1 commit

  • If mmap() maps a file, it can be passed an offset into the file at which
    the mapping is to start. Offset could be a negative value when
    represented as a loff_t. The offset plus length will be used to update
    the file size (i_size) which is also a loff_t.

    Validate the value of offset and offset + length to make sure they do
    not overflow and appear as negative.

    Found by syzcaller with commit ff8c0c53c475 ("mm/hugetlb.c: don't call
    region_abort if region_chg fails") applied. Prior to this commit, the
    overflow would still occur but we would luckily return ENOMEM.

    To reproduce:

    mmap(0, 0x2000, 0, 0x40021, 0xffffffffffffffffULL, 0x8000000000000000ULL);

    Resulted in,

    kernel BUG at mm/hugetlb.c:742!
    Call Trace:
    hugetlbfs_evict_inode+0x80/0xa0
    evict+0x24a/0x620
    iput+0x48f/0x8c0
    dentry_unlink_inode+0x31f/0x4d0
    __dentry_kill+0x292/0x5e0
    dput+0x730/0x830
    __fput+0x438/0x720
    ____fput+0x1a/0x20
    task_work_run+0xfe/0x180
    exit_to_usermode_loop+0x133/0x150
    syscall_return_slowpath+0x184/0x1c0
    entry_SYSCALL_64_fastpath+0xab/0xad

    Fixes: ff8c0c53c475 ("mm/hugetlb.c: don't call region_abort if region_chg fails")
    Link: http://lkml.kernel.org/r/1491951118-30678-1-git-send-email-mike.kravetz@oracle.com
    Reported-by: Vegard Nossum
    Signed-off-by: Mike Kravetz
    Acked-by: Hillf Danton
    Cc: Dmitry Vyukov
    Cc: Michal Hocko
    Cc: "Kirill A . Shutemov"
    Cc: Andrey Ryabinin
    Cc: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     

01 Apr, 2017

1 commit

  • Any time after inode allocation, destroy_inode can be called. The
    hugetlbfs inode contains a shared_policy structure, and
    mpol_free_shared_policy is unconditionally called as part of
    hugetlbfs_destroy_inode. Initialize the policy as part of inode
    allocation so that any quick (error path) calls to destroy_inode will be
    handed an initialized policy.

    syzkaller fuzzer found this bug, that resulted in the following:

    BUG: KASAN: user-memory-access in atomic_inc
    include/asm-generic/atomic-instrumented.h:87 [inline] at addr
    000000131730bd7a
    BUG: KASAN: user-memory-access in __lock_acquire+0x21a/0x3a80
    kernel/locking/lockdep.c:3239 at addr 000000131730bd7a
    Write of size 4 by task syz-executor6/14086
    CPU: 3 PID: 14086 Comm: syz-executor6 Not tainted 4.11.0-rc3+ #364
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
    Call Trace:
    atomic_inc include/asm-generic/atomic-instrumented.h:87 [inline]
    __lock_acquire+0x21a/0x3a80 kernel/locking/lockdep.c:3239
    lock_acquire+0x1ee/0x590 kernel/locking/lockdep.c:3762
    __raw_write_lock include/linux/rwlock_api_smp.h:210 [inline]
    _raw_write_lock+0x33/0x50 kernel/locking/spinlock.c:295
    mpol_free_shared_policy+0x43/0xb0 mm/mempolicy.c:2536
    hugetlbfs_destroy_inode+0xca/0x120 fs/hugetlbfs/inode.c:952
    alloc_inode+0x10d/0x180 fs/inode.c:216
    new_inode_pseudo+0x69/0x190 fs/inode.c:889
    new_inode+0x1c/0x40 fs/inode.c:918
    hugetlbfs_get_inode+0x40/0x420 fs/hugetlbfs/inode.c:734
    hugetlb_file_setup+0x329/0x9f0 fs/hugetlbfs/inode.c:1282
    newseg+0x422/0xd30 ipc/shm.c:575
    ipcget_new ipc/util.c:285 [inline]
    ipcget+0x21e/0x580 ipc/util.c:639
    SYSC_shmget ipc/shm.c:673 [inline]
    SyS_shmget+0x158/0x230 ipc/shm.c:657
    entry_SYSCALL_64_fastpath+0x1f/0xc2

    Analysis provided by Tetsuo Handa

    Link: http://lkml.kernel.org/r/1490477850-7944-1-git-send-email-mike.kravetz@oracle.com
    Signed-off-by: Mike Kravetz
    Reported-by: Dmitry Vyukov
    Acked-by: Hillf Danton
    Cc: Tetsuo Handa
    Cc: Michal Hocko
    Cc: Dave Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     

02 Mar, 2017

1 commit


25 Dec, 2016

1 commit


11 Oct, 2016

2 commits

  • Pull more vfs updates from Al Viro:
    ">rename2() work from Miklos + current_time() from Deepa"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    fs: Replace current_fs_time() with current_time()
    fs: Replace CURRENT_TIME_SEC with current_time() for inode timestamps
    fs: Replace CURRENT_TIME with current_time() for inode timestamps
    fs: proc: Delete inode time initializations in proc_alloc_inode()
    vfs: Add current_time() api
    vfs: add note about i_op->rename changes to porting
    fs: rename "rename2" i_op to "rename"
    vfs: remove unused i_op->rename
    fs: make remaining filesystems use .rename2
    libfs: support RENAME_NOREPLACE in simple_rename()
    fs: support RENAME_NOREPLACE for local filesystems
    ncpfs: fix unused variable warning

    Linus Torvalds
     
  • Pull misc vfs updates from Al Viro:
    "Assorted misc bits and pieces.

    There are several single-topic branches left after this (rename2
    series from Miklos, current_time series from Deepa Dinamani, xattr
    series from Andreas, uaccess stuff from from me) and I'd prefer to
    send those separately"

    * 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (39 commits)
    proc: switch auxv to use of __mem_open()
    hpfs: support FIEMAP
    cifs: get rid of unused arguments of CIFSSMBWrite()
    posix_acl: uapi header split
    posix_acl: xattr representation cleanups
    fs/aio.c: eliminate redundant loads in put_aio_ring_file
    fs/internal.h: add const to ns_dentry_operations declaration
    compat: remove compat_printk()
    fs/buffer.c: make __getblk_slow() static
    proc: unsigned file descriptors
    fs/file: more unsigned file descriptors
    fs: compat: remove redundant check of nr_segs
    cachefiles: Fix attempt to read i_blocks after deleting file [ver #2]
    cifs: don't use memcpy() to copy struct iov_iter
    get rid of separate multipage fault-in primitives
    fs: Avoid premature clearing of capabilities
    fs: Give dentry to inode_change_ok() instead of inode
    fuse: Propagate dentry down to inode_change_ok()
    ceph: Propagate dentry down to inode_change_ok()
    xfs: Propagate dentry down to inode_change_ok()
    ...

    Linus Torvalds
     

08 Oct, 2016

1 commit

  • When the huge page is added to the page cahce (huge_add_to_page_cache),
    the page private flag will be cleared. since this code
    (remove_inode_hugepages) will only be called for pages in the page
    cahce, PagePrivate(page) will always be false.

    The patch remove the code without any functional change.

    Link: http://lkml.kernel.org/r/1475113323-29368-1-git-send-email-zhongjiang@huawei.com
    Signed-off-by: zhong jiang
    Reviewed-by: Naoya Horiguchi
    Reviewed-by: Mike Kravetz
    Tested-by: Mike Kravetz
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    zhong jiang
     

28 Sep, 2016

1 commit

  • CURRENT_TIME macro is not appropriate for filesystems as it
    doesn't use the right granularity for filesystem timestamps.
    Use current_time() instead.

    CURRENT_TIME is also not y2038 safe.

    This is also in preparation for the patch that transitions
    vfs timestamps to use 64 bit time and hence make them
    y2038 safe. As part of the effort current_time() will be
    extended to do range checks. Hence, it is necessary for all
    file system timestamps to use current_time(). Also,
    current_time() will be transitioned along with vfs to be
    y2038 safe.

    Note that whenever a single call to current_time() is used
    to change timestamps in different inodes, it is because they
    share the same time granularity.

    Signed-off-by: Deepa Dinamani
    Reviewed-by: Arnd Bergmann
    Acked-by: Felipe Balbi
    Acked-by: Steven Whitehouse
    Acked-by: Ryusuke Konishi
    Acked-by: David Sterba
    Signed-off-by: Al Viro

    Deepa Dinamani
     

22 Sep, 2016

1 commit

  • inode_change_ok() will be resposible for clearing capabilities and IMA
    extended attributes and as such will need dentry. Give it as an argument
    to inode_change_ok() instead of an inode. Also rename inode_change_ok()
    to setattr_prepare() to better relect that it does also some
    modifications in addition to checks.

    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jan Kara

    Jan Kara
     

05 Apr, 2016

2 commits

  • Mostly direct substitution with occasional adjustment or removing
    outdated comments.

    Signed-off-by: Kirill A. Shutemov
    Acked-by: Michal Hocko
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
    ago with promise that one day it will be possible to implement page
    cache with bigger chunks than PAGE_SIZE.

    This promise never materialized. And unlikely will.

    We have many places where PAGE_CACHE_SIZE assumed to be equal to
    PAGE_SIZE. And it's constant source of confusion on whether
    PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
    especially on the border between fs and mm.

    Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
    breakage to be doable.

    Let's stop pretending that pages in page cache are special. They are
    not.

    The changes are pretty straight-forward:

    - << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> ;

    - >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> ;

    - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};

    - page_cache_get() -> get_page();

    - page_cache_release() -> put_page();

    This patch contains automated changes generated with coccinelle using
    script below. For some reason, coccinelle doesn't patch header files.
    I've called spatch for them manually.

    The only adjustment after coccinelle is revert of changes to
    PAGE_CAHCE_ALIGN definition: we are going to drop it later.

    There are few places in the code where coccinelle didn't reach. I'll
    fix them manually in a separate patch. Comments and documentation also
    will be addressed with the separate patch.

    virtual patch

    @@
    expression E;
    @@
    - E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
    + E

    @@
    expression E;
    @@
    - E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
    + E

    @@
    @@
    - PAGE_CACHE_SHIFT
    + PAGE_SHIFT

    @@
    @@
    - PAGE_CACHE_SIZE
    + PAGE_SIZE

    @@
    @@
    - PAGE_CACHE_MASK
    + PAGE_MASK

    @@
    expression E;
    @@
    - PAGE_CACHE_ALIGN(E)
    + PAGE_ALIGN(E)

    @@
    expression E;
    @@
    - page_cache_get(E)
    + get_page(E)

    @@
    expression E;
    @@
    - page_cache_release(E)
    + put_page(E)

    Signed-off-by: Kirill A. Shutemov
    Acked-by: Michal Hocko
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

23 Jan, 2016

1 commit

  • parallel to mutex_{lock,unlock,trylock,is_locked,lock_nested},
    inode_foo(inode) being mutex_foo(&inode->i_mutex).

    Please, use those for access to ->i_mutex; over the coming cycle
    ->i_mutex will become rwsem, with ->lookup() done with it held
    only shared.

    Signed-off-by: Al Viro

    Al Viro
     

16 Jan, 2016

3 commits

  • Page faults can race with fallocate hole punch. If a page fault happens
    between the unmap and remove operations, the page is not removed and
    remains within the hole. This is not the desired behavior. The race is
    difficult to detect in user level code as even in the non-race case, a
    page within the hole could be faulted back in before fallocate returns.
    If userfaultfd is expanded to support hugetlbfs in the future, this race
    will be easier to observe.

    If this race is detected and a page is mapped, the remove operation
    (remove_inode_hugepages) will unmap the page before removing. The unmap
    within remove_inode_hugepages occurs with the hugetlb_fault_mutex held
    so that no other faults will be processed until the page is removed.

    The (unmodified) routine hugetlb_vmdelete_list was moved ahead of
    remove_inode_hugepages to satisfy the new reference.

    [akpm@linux-foundation.org: move hugetlb_vmdelete_list()]
    Signed-off-by: Mike Kravetz
    Cc: Hugh Dickins
    Cc: Naoya Horiguchi
    Cc: Hillf Danton
    Cc: Davidlohr Bueso
    Cc: Dave Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     
  • Hillf Danton noticed bugs in the hugetlb_vmtruncate_list routine. The
    argument end is of type pgoff_t. It was being converted to a vaddr
    offset and passed to unmap_hugepage_range. However, end was also being
    used as an argument to the vma_interval_tree_foreach controlling loop.
    In addition, the conversion of end to vaddr offset was incorrect.

    hugetlb_vmtruncate_list is called as part of a file truncate or
    fallocate hole punch operation.

    When truncating a hugetlbfs file, this bug could prevent some pages from
    being unmapped. This is possible if there are multiple vmas mapping the
    file, and there is a sufficiently sized hole between the mappings. The
    size of the hole between two vmas (A,B) must be such that the starting
    virtual address of B is greater than (ending virtual address of A <<
    PAGE_SHIFT). In this case, the pages in B would not be unmapped. If
    pages are not properly unmapped during truncate, the following BUG is
    hit:

    kernel BUG at fs/hugetlbfs/inode.c:428!

    In the fallocate hole punch case, this bug could prevent pages from
    being unmapped as in the truncate case. However, for hole punch the
    result is that unmapped pages will not be removed during the operation.
    For hole punch, it is also possible that more pages than desired will be
    unmapped. This unnecessary unmapping will cause page faults to
    reestablish the mappings on subsequent page access.

    Fixes: 1bfad99ab (" hugetlbfs: hugetlb_vmtruncate_list() needs to take a range")Reported-by: Hillf Danton
    Signed-off-by: Mike Kravetz
    Cc: Hugh Dickins
    Cc: Naoya Horiguchi
    Cc: Davidlohr Bueso
    Cc: Dave Hansen
    Cc: [4.3]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     
  • Dmitry Vyukov has reported[1] possible deadlock (triggered by his
    syzkaller fuzzer):

    Possible unsafe locking scenario:

    CPU0 CPU1
    ---- ----
    lock(&hugetlbfs_i_mmap_rwsem_key);
    lock(&mapping->i_mmap_rwsem);
    lock(&hugetlbfs_i_mmap_rwsem_key);
    lock(&mapping->i_mmap_rwsem);

    Both traces points to mm_take_all_locks() as a source of the problem.
    It doesn't take care about ordering or hugetlbfs_i_mmap_rwsem_key (aka
    mapping->i_mmap_rwsem for hugetlb mapping) vs. i_mmap_rwsem.

    huge_pmd_share() does memory allocation under hugetlbfs_i_mmap_rwsem_key
    and allocator can take i_mmap_rwsem if it hit reclaim. So we need to
    take i_mmap_rwsem from all hugetlb VMAs before taking i_mmap_rwsem from
    rest of VMAs.

    The patch also documents locking order for hugetlbfs_i_mmap_rwsem_key.

    [1] http://lkml.kernel.org/r/CACT4Y+Zu95tBs-0EvdiAKzUOsb4tczRRfCRTpLr4bg_OP9HuVg@mail.gmail.com

    Signed-off-by: Kirill A. Shutemov
    Reported-by: Dmitry Vyukov
    Reviewed-by: Michal Hocko
    Cc: Peter Zijlstra
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

15 Jan, 2016

2 commits

  • The Kconfig currently controlling compilation of this code is:

    config HUGETLBFS
    bool "HugeTLB file system support"

    ...meaning that it currently is not being built as a module by anyone.

    Lets remove the modular code that is essentially orphaned, so that when
    reading the driver there is no doubt it is builtin-only.

    Since module_init translates to device_initcall in the non-modular case,
    the init ordering gets moved to earlier levels when we use the more
    appropriate initcalls here.

    Originally I had the fs part and the mm part as separate commits, just
    by happenstance of the nature of how I detected these non-modular use
    cases. But that can possibly introduce regressions if the patch merge
    ordering puts the fs part 1st -- as the 0-day testing reported a splat
    at mount time.

    Investigating with "initcall_debug" showed that the delta was
    init_hugetlbfs_fs being called _before_ hugetlb_init instead of after. So
    both the fs change and the mm change are here together.

    In addition, it worked before due to luck of link order, since they were
    both in the same initcall category. So we now have the fs part using
    fs_initcall, and the mm part using subsys_initcall, which puts it one
    bucket earlier. It now passes the basic sanity test that failed in
    earlier 0-day testing.

    We delete the MODULE_LICENSE tag and capture that information at the top
    of the file alongside author comments, etc.

    We don't replace module.h with init.h since the file already has that.
    Also note that MODULE_ALIAS is a no-op for non-modular code.

    Signed-off-by: Paul Gortmaker
    Reported-by: kernel test robot
    Cc: Nadia Yvette Chambers
    Cc: Alexander Viro
    Cc: Naoya Horiguchi
    Reviewed-by: Mike Kravetz
    Cc: David Rientjes
    Cc: Hillf Danton
    Acked-by: Davidlohr Bueso
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Gortmaker
     
  • When running the SPECint_rate gcc on some very large boxes it was
    noticed that the system was spending lots of time in
    mpol_shared_policy_lookup(). The gamess benchmark can also show it and
    is what I mostly used to chase down the issue since the setup for that I
    found to be easier.

    To be clear the binaries were on tmpfs because of disk I/O requirements.
    We then used text replication to avoid icache misses and having all the
    copies banging on the memory where the instruction code resides. This
    results in us hitting a bottleneck in mpol_shared_policy_lookup() since
    lookup is serialised by the shared_policy lock.

    I have only reproduced this on very large (3k+ cores) boxes. The
    problem starts showing up at just a few hundred ranks getting worse
    until it threatens to livelock once it gets large enough. For example
    on the gamess benchmark at 128 ranks this area consumes only ~1% of
    time, at 512 ranks it consumes nearly 13%, and at 2k ranks it is over
    90%.

    To alleviate the contention in this area I converted the spinlock to an
    rwlock. This allows a large number of lookups to happen simultaneously.
    The results were quite good reducing this consumtion at max ranks to
    around 2%.

    [akpm@linux-foundation.org: tidy up code comments]
    Signed-off-by: Nathan Zimmer
    Acked-by: David Rientjes
    Acked-by: Vlastimil Babka
    Cc: Nadia Yvette Chambers
    Cc: Naoya Horiguchi
    Cc: Mel Gorman
    Cc: "Aneesh Kumar K.V"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nathan Zimmer