19 Apr, 2018

1 commit

  • commit 5df63c2a149ae65a9ec239e7c2af44efa6f79beb upstream.

    This is a fix for a regression in 32 bit kernels caused by an invalid
    check for pgoff overflow in hugetlbfs mmap setup. The check incorrectly
    specified that the size of a loff_t was the same as the size of a long.
    The regression prevents mapping hugetlbfs files at offsets greater than
    4GB on 32 bit kernels.

    On 32 bit kernels conversion from a page based unsigned long can not
    overflow a loff_t byte offset. Therefore, skip this check if
    sizeof(unsigned long) != sizeof(loff_t).

    Link: http://lkml.kernel.org/r/20180330145402.5053-1-mike.kravetz@oracle.com
    Fixes: 63489f8e8211 ("hugetlbfs: check for pgoff value overflow")
    Reported-by: Dan Rue
    Signed-off-by: Mike Kravetz
    Tested-by: Anders Roxell
    Cc: Michal Hocko
    Cc: Yisheng Xie
    Cc: "Kirill A . Shutemov"
    Cc: Nic Losby
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Mike Kravetz
     

29 Mar, 2018

1 commit

  • commit 63489f8e821144000e0bdca7e65a8d1cc23a7ee7 upstream.

    A vma with vm_pgoff large enough to overflow a loff_t type when
    converted to a byte offset can be passed via the remap_file_pages system
    call. The hugetlbfs mmap routine uses the byte offset to calculate
    reservations and file size.

    A sequence such as:

    mmap(0x20a00000, 0x600000, 0, 0x66033, -1, 0);
    remap_file_pages(0x20a00000, 0x600000, 0, 0x20000000000000, 0);

    will result in the following when task exits/file closed,

    kernel BUG at mm/hugetlb.c:749!
    Call Trace:
    hugetlbfs_evict_inode+0x2f/0x40
    evict+0xcb/0x190
    __dentry_kill+0xcb/0x150
    __fput+0x164/0x1e0
    task_work_run+0x84/0xa0
    exit_to_usermode_loop+0x7d/0x80
    do_syscall_64+0x18b/0x190
    entry_SYSCALL_64_after_hwframe+0x3d/0xa2

    The overflowed pgoff value causes hugetlbfs to try to set up a mapping
    with a negative range (end < start) that leaves invalid state which
    causes the BUG.

    The previous overflow fix to this code was incomplete and did not take
    the remap_file_pages system call into account.

    [mike.kravetz@oracle.com: v3]
    Link: http://lkml.kernel.org/r/20180309002726.7248-1-mike.kravetz@oracle.com
    [akpm@linux-foundation.org: include mmdebug.h]
    [akpm@linux-foundation.org: fix -ve left shift count on sh]
    Link: http://lkml.kernel.org/r/20180308210502.15952-1-mike.kravetz@oracle.com
    Fixes: 045c7a3f53d9 ("hugetlbfs: fix offset overflow in hugetlbfs mmap")
    Signed-off-by: Mike Kravetz
    Reported-by: Nic Losby
    Acked-by: Michal Hocko
    Cc: "Kirill A . Shutemov"
    Cc: Yisheng Xie
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Mike Kravetz
     

03 Nov, 2017

1 commit

  • Calling madvise(MADV_HWPOISON) on a hugetlbfs page will result in bad
    (negative) reserved huge page counts. This may not happen immediately,
    but may happen later when the underlying file is removed or filesystem
    unmounted. For example:

    AnonHugePages: 0 kB
    ShmemHugePages: 0 kB
    HugePages_Total: 1
    HugePages_Free: 0
    HugePages_Rsvd: 18446744073709551615
    HugePages_Surp: 0
    Hugepagesize: 2048 kB

    In routine hugetlbfs_error_remove_page(), hugetlb_fix_reserve_counts is
    called after remove_huge_page. hugetlb_fix_reserve_counts is designed
    to only be called/used only if a failure is returned from
    hugetlb_unreserve_pages. Therefore, call hugetlb_unreserve_pages as
    required and only call hugetlb_fix_reserve_counts in the unlikely event
    that hugetlb_unreserve_pages returns an error.

    Link: http://lkml.kernel.org/r/20171019230007.17043-2-mike.kravetz@oracle.com
    Fixes: 78bb920344b8 ("mm: hwpoison: dissolve in-use hugepage in unrecoverable memory error")
    Signed-off-by: Mike Kravetz
    Acked-by: Naoya Horiguchi
    Cc: Michal Hocko
    Cc: Aneesh Kumar
    Cc: Anshuman Khandual
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     

09 Sep, 2017

2 commits

  • Allow interval trees to quickly check for overlaps to avoid unnecesary
    tree lookups in interval_tree_iter_first().

    As of this patch, all interval tree flavors will require using a
    'rb_root_cached' such that we can have the leftmost node easily
    available. While most users will make use of this feature, those with
    special functions (in addition to the generic insert, delete, search
    calls) will avoid using the cached option as they can do funky things
    with insertions -- for example, vma_interval_tree_insert_after().

    [jglisse@redhat.com: fix deadlock from typo vm_lock_anon_vma()]
    Link: http://lkml.kernel.org/r/20170808225719.20723-1-jglisse@redhat.com
    Link: http://lkml.kernel.org/r/20170719014603.19029-12-dave@stgolabs.net
    Signed-off-by: Davidlohr Bueso
    Signed-off-by: Jérôme Glisse
    Acked-by: Christian König
    Acked-by: Peter Zijlstra (Intel)
    Acked-by: Doug Ledford
    Acked-by: Michael S. Tsirkin
    Cc: David Airlie
    Cc: Jason Wang
    Cc: Christian Benvenuti
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     
  • Introduce a new migration mode that allow to offload the copy to a device
    DMA engine. This changes the workflow of migration and not all
    address_space migratepage callback can support this.

    This is intended to be use by migrate_vma() which itself is use for thing
    like HMM (see include/linux/hmm.h).

    No additional per-filesystem migratepage testing is needed. I disables
    MIGRATE_SYNC_NO_COPY in all problematic migratepage() callback and i
    added comment in those to explain why (part of this patch). The commit
    message is unclear it should say that any callback that wish to support
    this new mode need to be aware of the difference in the migration flow
    from other mode.

    Some of these callbacks do extra locking while copying (aio, zsmalloc,
    balloon, ...) and for DMA to be effective you want to copy multiple
    pages in one DMA operations. But in the problematic case you can not
    easily hold the extra lock accross multiple call to this callback.

    Usual flow is:

    For each page {
    1 - lock page
    2 - call migratepage() callback
    3 - (extra locking in some migratepage() callback)
    4 - migrate page state (freeze refcount, update page cache, buffer
    head, ...)
    5 - copy page
    6 - (unlock any extra lock of migratepage() callback)
    7 - return from migratepage() callback
    8 - unlock page
    }

    The new mode MIGRATE_SYNC_NO_COPY:
    1 - lock multiple pages
    For each page {
    2 - call migratepage() callback
    3 - abort in all problematic migratepage() callback
    4 - migrate page state (freeze refcount, update page cache, buffer
    head, ...)
    } // finished all calls to migratepage() callback
    5 - DMA copy multiple pages
    6 - unlock all the pages

    To support MIGRATE_SYNC_NO_COPY in the problematic case we would need a
    new callback migratepages() (for instance) that deals with multiple
    pages in one transaction.

    Because the problematic cases are not important for current usage I did
    not wanted to complexify this patchset even more for no good reason.

    Link: http://lkml.kernel.org/r/20170817000548.32038-14-jglisse@redhat.com
    Signed-off-by: Jérôme Glisse
    Cc: Aneesh Kumar
    Cc: Balbir Singh
    Cc: Benjamin Herrenschmidt
    Cc: Dan Williams
    Cc: David Nellans
    Cc: Evgeny Baskakov
    Cc: Johannes Weiner
    Cc: John Hubbard
    Cc: Kirill A. Shutemov
    Cc: Mark Hairgrove
    Cc: Michal Hocko
    Cc: Paul E. McKenney
    Cc: Ross Zwisler
    Cc: Sherry Cheung
    Cc: Subhash Gutti
    Cc: Vladimir Davydov
    Cc: Bob Liu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jérôme Glisse
     

07 Sep, 2017

3 commits

  • All users of pagevec_lookup() and pagevec_lookup_range() now pass
    PAGEVEC_SIZE as a desired number of pages.

    Just drop the argument.

    Link: http://lkml.kernel.org/r/20170726114704.7626-11-jack@suse.cz
    Signed-off-by: Jan Kara
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • We want only pages from given range in remove_inode_hugepages(). Use
    pagevec_lookup_range() instead of pagevec_lookup().

    Link: http://lkml.kernel.org/r/20170726114704.7626-8-jack@suse.cz
    Signed-off-by: Jan Kara
    Reviewed-by: Mike Kravetz
    Cc: Nadia Yvette Chambers
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • Make pagevec_lookup() (and underlying find_get_pages()) update index to
    the next page where iteration should continue. Most callers want this
    and also pagevec_lookup_tag() already does this.

    Link: http://lkml.kernel.org/r/20170726114704.7626-3-jack@suse.cz
    Signed-off-by: Jan Kara
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     

16 Jul, 2017

1 commit

  • Pull ->s_options removal from Al Viro:
    "Preparations for fsmount/fsopen stuff (coming next cycle). Everything
    gets moved to explicit ->show_options(), killing ->s_options off +
    some cosmetic bits around fs/namespace.c and friends. Basically, the
    stuff needed to work with fsmount series with minimum of conflicts
    with other work.

    It's not strictly required for this merge window, but it would reduce
    the PITA during the coming cycle, so it would be nice to have those
    bits and pieces out of the way"

    * 'work.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    isofs: Fix isofs_show_options()
    VFS: Kill off s_options and helpers
    orangefs: Implement show_options
    9p: Implement show_options
    isofs: Implement show_options
    afs: Implement show_options
    affs: Implement show_options
    befs: Implement show_options
    spufs: Implement show_options
    bpf: Implement show_options
    ramfs: Implement show_options
    pstore: Implement show_options
    omfs: Implement show_options
    hugetlbfs: Implement show_options
    VFS: Don't use save/replace_mount_options if not using generic_show_options
    VFS: Provide empty name qstr
    VFS: Make get_filesystem() return the affected filesystem
    VFS: Clean up whitespace in fs/namespace.c and fs/super.c
    Provide a function to create a NUL-terminated string from unterminated data

    Linus Torvalds
     

11 Jul, 2017

1 commit

  • Currently me_huge_page() relies on dequeue_hwpoisoned_huge_page() to
    keep the error hugepage away from the system, which is OK but not good
    enough because the hugepage still has a refcount and unpoison doesn't
    work on the error hugepage (PageHWPoison flags are cleared but pages are
    still leaked.) And there's "wasting health subpages" issue too. This
    patch reworks on me_huge_page() to solve these issues.

    For hugetlb file, recently we have truncating code so let's use it in
    hugetlbfs specific ->error_remove_page().

    For anonymous hugepage, it's helpful to dissolve the error page after
    freeing it into free hugepage list. Migration entry and PageHWPoison in
    the head page prevent the access to it.

    TODO: dissolve_free_huge_page() can fail but we don't considered it yet.
    It's not critical (and at least no worse that now) because in such case
    the error hugepage just stays in free hugepage list without being
    dissolved. By virtue of PageHWPoison in head page, it's never allocated
    to processes.

    [akpm@linux-foundation.org: fix unused var warnings]
    Fixes: 23a003bfd23ea9ea0b7756b920e51f64b284b468 ("mm/madvise: pass return code of memory_failure() to userspace")
    Link: http://lkml.kernel.org/r/20170417055948.GM31394@yexl-desktop
    Link: http://lkml.kernel.org/r/1496305019-5493-8-git-send-email-n-horiguchi@ah.jp.nec.com
    Signed-off-by: Naoya Horiguchi
    Reported-by: kernel test robot
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     

06 Jul, 2017

1 commit

  • Implement the show_options superblock op for hugetlbfs as part of a bid to
    get rid of s_options and generic_show_options() to make it easier to
    implement a context-based mount where the mount options can be passed
    individually over a file descriptor.

    Note that the uid and gid should possibly be displayed relative to the
    viewer's user namespace.

    Signed-off-by: David Howells
    cc: Nadia Yvette Chambers
    Signed-off-by: Al Viro

    David Howells
     

19 Jun, 2017

1 commit

  • Stack guard page is a useful feature to reduce a risk of stack smashing
    into a different mapping. We have been using a single page gap which
    is sufficient to prevent having stack adjacent to a different mapping.
    But this seems to be insufficient in the light of the stack usage in
    userspace. E.g. glibc uses as large as 64kB alloca() in many commonly
    used functions. Others use constructs liks gid_t buffer[NGROUPS_MAX]
    which is 256kB or stack strings with MAX_ARG_STRLEN.

    This will become especially dangerous for suid binaries and the default
    no limit for the stack size limit because those applications can be
    tricked to consume a large portion of the stack and a single glibc call
    could jump over the guard page. These attacks are not theoretical,
    unfortunatelly.

    Make those attacks less probable by increasing the stack guard gap
    to 1MB (on systems with 4k pages; but make it depend on the page size
    because systems with larger base pages might cap stack allocations in
    the PAGE_SIZE units) which should cover larger alloca() and VLA stack
    allocations. It is obviously not a full fix because the problem is
    somehow inherent, but it should reduce attack space a lot.

    One could argue that the gap size should be configurable from userspace,
    but that can be done later when somebody finds that the new 1MB is wrong
    for some special case applications. For now, add a kernel command line
    option (stack_guard_gap) to specify the stack gap size (in page units).

    Implementation wise, first delete all the old code for stack guard page:
    because although we could get away with accounting one extra page in a
    stack vma, accounting a larger gap can break userspace - case in point,
    a program run with "ulimit -S -v 20000" failed when the 1MB gap was
    counted for RLIMIT_AS; similar problems could come with RLIMIT_MLOCK
    and strict non-overcommit mode.

    Instead of keeping gap inside the stack vma, maintain the stack guard
    gap as a gap between vmas: using vm_start_gap() in place of vm_start
    (or vm_end_gap() in place of vm_end if VM_GROWSUP) in just those few
    places which need to respect the gap - mainly arch_get_unmapped_area(),
    and and the vma tree's subtree_gap support for that.

    Original-patch-by: Oleg Nesterov
    Original-patch-by: Michal Hocko
    Signed-off-by: Hugh Dickins
    Acked-by: Michal Hocko
    Tested-by: Helge Deller # parisc
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

14 Apr, 2017

1 commit

  • If mmap() maps a file, it can be passed an offset into the file at which
    the mapping is to start. Offset could be a negative value when
    represented as a loff_t. The offset plus length will be used to update
    the file size (i_size) which is also a loff_t.

    Validate the value of offset and offset + length to make sure they do
    not overflow and appear as negative.

    Found by syzcaller with commit ff8c0c53c475 ("mm/hugetlb.c: don't call
    region_abort if region_chg fails") applied. Prior to this commit, the
    overflow would still occur but we would luckily return ENOMEM.

    To reproduce:

    mmap(0, 0x2000, 0, 0x40021, 0xffffffffffffffffULL, 0x8000000000000000ULL);

    Resulted in,

    kernel BUG at mm/hugetlb.c:742!
    Call Trace:
    hugetlbfs_evict_inode+0x80/0xa0
    evict+0x24a/0x620
    iput+0x48f/0x8c0
    dentry_unlink_inode+0x31f/0x4d0
    __dentry_kill+0x292/0x5e0
    dput+0x730/0x830
    __fput+0x438/0x720
    ____fput+0x1a/0x20
    task_work_run+0xfe/0x180
    exit_to_usermode_loop+0x133/0x150
    syscall_return_slowpath+0x184/0x1c0
    entry_SYSCALL_64_fastpath+0xab/0xad

    Fixes: ff8c0c53c475 ("mm/hugetlb.c: don't call region_abort if region_chg fails")
    Link: http://lkml.kernel.org/r/1491951118-30678-1-git-send-email-mike.kravetz@oracle.com
    Reported-by: Vegard Nossum
    Signed-off-by: Mike Kravetz
    Acked-by: Hillf Danton
    Cc: Dmitry Vyukov
    Cc: Michal Hocko
    Cc: "Kirill A . Shutemov"
    Cc: Andrey Ryabinin
    Cc: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     

01 Apr, 2017

1 commit

  • Any time after inode allocation, destroy_inode can be called. The
    hugetlbfs inode contains a shared_policy structure, and
    mpol_free_shared_policy is unconditionally called as part of
    hugetlbfs_destroy_inode. Initialize the policy as part of inode
    allocation so that any quick (error path) calls to destroy_inode will be
    handed an initialized policy.

    syzkaller fuzzer found this bug, that resulted in the following:

    BUG: KASAN: user-memory-access in atomic_inc
    include/asm-generic/atomic-instrumented.h:87 [inline] at addr
    000000131730bd7a
    BUG: KASAN: user-memory-access in __lock_acquire+0x21a/0x3a80
    kernel/locking/lockdep.c:3239 at addr 000000131730bd7a
    Write of size 4 by task syz-executor6/14086
    CPU: 3 PID: 14086 Comm: syz-executor6 Not tainted 4.11.0-rc3+ #364
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
    Call Trace:
    atomic_inc include/asm-generic/atomic-instrumented.h:87 [inline]
    __lock_acquire+0x21a/0x3a80 kernel/locking/lockdep.c:3239
    lock_acquire+0x1ee/0x590 kernel/locking/lockdep.c:3762
    __raw_write_lock include/linux/rwlock_api_smp.h:210 [inline]
    _raw_write_lock+0x33/0x50 kernel/locking/spinlock.c:295
    mpol_free_shared_policy+0x43/0xb0 mm/mempolicy.c:2536
    hugetlbfs_destroy_inode+0xca/0x120 fs/hugetlbfs/inode.c:952
    alloc_inode+0x10d/0x180 fs/inode.c:216
    new_inode_pseudo+0x69/0x190 fs/inode.c:889
    new_inode+0x1c/0x40 fs/inode.c:918
    hugetlbfs_get_inode+0x40/0x420 fs/hugetlbfs/inode.c:734
    hugetlb_file_setup+0x329/0x9f0 fs/hugetlbfs/inode.c:1282
    newseg+0x422/0xd30 ipc/shm.c:575
    ipcget_new ipc/util.c:285 [inline]
    ipcget+0x21e/0x580 ipc/util.c:639
    SYSC_shmget ipc/shm.c:673 [inline]
    SyS_shmget+0x158/0x230 ipc/shm.c:657
    entry_SYSCALL_64_fastpath+0x1f/0xc2

    Analysis provided by Tetsuo Handa

    Link: http://lkml.kernel.org/r/1490477850-7944-1-git-send-email-mike.kravetz@oracle.com
    Signed-off-by: Mike Kravetz
    Reported-by: Dmitry Vyukov
    Acked-by: Hillf Danton
    Cc: Tetsuo Handa
    Cc: Michal Hocko
    Cc: Dave Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     

02 Mar, 2017

1 commit


25 Dec, 2016

1 commit


11 Oct, 2016

2 commits

  • Pull more vfs updates from Al Viro:
    ">rename2() work from Miklos + current_time() from Deepa"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    fs: Replace current_fs_time() with current_time()
    fs: Replace CURRENT_TIME_SEC with current_time() for inode timestamps
    fs: Replace CURRENT_TIME with current_time() for inode timestamps
    fs: proc: Delete inode time initializations in proc_alloc_inode()
    vfs: Add current_time() api
    vfs: add note about i_op->rename changes to porting
    fs: rename "rename2" i_op to "rename"
    vfs: remove unused i_op->rename
    fs: make remaining filesystems use .rename2
    libfs: support RENAME_NOREPLACE in simple_rename()
    fs: support RENAME_NOREPLACE for local filesystems
    ncpfs: fix unused variable warning

    Linus Torvalds
     
  • Pull misc vfs updates from Al Viro:
    "Assorted misc bits and pieces.

    There are several single-topic branches left after this (rename2
    series from Miklos, current_time series from Deepa Dinamani, xattr
    series from Andreas, uaccess stuff from from me) and I'd prefer to
    send those separately"

    * 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (39 commits)
    proc: switch auxv to use of __mem_open()
    hpfs: support FIEMAP
    cifs: get rid of unused arguments of CIFSSMBWrite()
    posix_acl: uapi header split
    posix_acl: xattr representation cleanups
    fs/aio.c: eliminate redundant loads in put_aio_ring_file
    fs/internal.h: add const to ns_dentry_operations declaration
    compat: remove compat_printk()
    fs/buffer.c: make __getblk_slow() static
    proc: unsigned file descriptors
    fs/file: more unsigned file descriptors
    fs: compat: remove redundant check of nr_segs
    cachefiles: Fix attempt to read i_blocks after deleting file [ver #2]
    cifs: don't use memcpy() to copy struct iov_iter
    get rid of separate multipage fault-in primitives
    fs: Avoid premature clearing of capabilities
    fs: Give dentry to inode_change_ok() instead of inode
    fuse: Propagate dentry down to inode_change_ok()
    ceph: Propagate dentry down to inode_change_ok()
    xfs: Propagate dentry down to inode_change_ok()
    ...

    Linus Torvalds
     

08 Oct, 2016

1 commit

  • When the huge page is added to the page cahce (huge_add_to_page_cache),
    the page private flag will be cleared. since this code
    (remove_inode_hugepages) will only be called for pages in the page
    cahce, PagePrivate(page) will always be false.

    The patch remove the code without any functional change.

    Link: http://lkml.kernel.org/r/1475113323-29368-1-git-send-email-zhongjiang@huawei.com
    Signed-off-by: zhong jiang
    Reviewed-by: Naoya Horiguchi
    Reviewed-by: Mike Kravetz
    Tested-by: Mike Kravetz
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    zhong jiang
     

28 Sep, 2016

1 commit

  • CURRENT_TIME macro is not appropriate for filesystems as it
    doesn't use the right granularity for filesystem timestamps.
    Use current_time() instead.

    CURRENT_TIME is also not y2038 safe.

    This is also in preparation for the patch that transitions
    vfs timestamps to use 64 bit time and hence make them
    y2038 safe. As part of the effort current_time() will be
    extended to do range checks. Hence, it is necessary for all
    file system timestamps to use current_time(). Also,
    current_time() will be transitioned along with vfs to be
    y2038 safe.

    Note that whenever a single call to current_time() is used
    to change timestamps in different inodes, it is because they
    share the same time granularity.

    Signed-off-by: Deepa Dinamani
    Reviewed-by: Arnd Bergmann
    Acked-by: Felipe Balbi
    Acked-by: Steven Whitehouse
    Acked-by: Ryusuke Konishi
    Acked-by: David Sterba
    Signed-off-by: Al Viro

    Deepa Dinamani
     

22 Sep, 2016

1 commit

  • inode_change_ok() will be resposible for clearing capabilities and IMA
    extended attributes and as such will need dentry. Give it as an argument
    to inode_change_ok() instead of an inode. Also rename inode_change_ok()
    to setattr_prepare() to better relect that it does also some
    modifications in addition to checks.

    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jan Kara

    Jan Kara
     

05 Apr, 2016

2 commits

  • Mostly direct substitution with occasional adjustment or removing
    outdated comments.

    Signed-off-by: Kirill A. Shutemov
    Acked-by: Michal Hocko
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
    ago with promise that one day it will be possible to implement page
    cache with bigger chunks than PAGE_SIZE.

    This promise never materialized. And unlikely will.

    We have many places where PAGE_CACHE_SIZE assumed to be equal to
    PAGE_SIZE. And it's constant source of confusion on whether
    PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
    especially on the border between fs and mm.

    Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
    breakage to be doable.

    Let's stop pretending that pages in page cache are special. They are
    not.

    The changes are pretty straight-forward:

    - << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> ;

    - >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> ;

    - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};

    - page_cache_get() -> get_page();

    - page_cache_release() -> put_page();

    This patch contains automated changes generated with coccinelle using
    script below. For some reason, coccinelle doesn't patch header files.
    I've called spatch for them manually.

    The only adjustment after coccinelle is revert of changes to
    PAGE_CAHCE_ALIGN definition: we are going to drop it later.

    There are few places in the code where coccinelle didn't reach. I'll
    fix them manually in a separate patch. Comments and documentation also
    will be addressed with the separate patch.

    virtual patch

    @@
    expression E;
    @@
    - E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
    + E

    @@
    expression E;
    @@
    - E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
    + E

    @@
    @@
    - PAGE_CACHE_SHIFT
    + PAGE_SHIFT

    @@
    @@
    - PAGE_CACHE_SIZE
    + PAGE_SIZE

    @@
    @@
    - PAGE_CACHE_MASK
    + PAGE_MASK

    @@
    expression E;
    @@
    - PAGE_CACHE_ALIGN(E)
    + PAGE_ALIGN(E)

    @@
    expression E;
    @@
    - page_cache_get(E)
    + get_page(E)

    @@
    expression E;
    @@
    - page_cache_release(E)
    + put_page(E)

    Signed-off-by: Kirill A. Shutemov
    Acked-by: Michal Hocko
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

23 Jan, 2016

1 commit

  • parallel to mutex_{lock,unlock,trylock,is_locked,lock_nested},
    inode_foo(inode) being mutex_foo(&inode->i_mutex).

    Please, use those for access to ->i_mutex; over the coming cycle
    ->i_mutex will become rwsem, with ->lookup() done with it held
    only shared.

    Signed-off-by: Al Viro

    Al Viro
     

16 Jan, 2016

3 commits

  • Page faults can race with fallocate hole punch. If a page fault happens
    between the unmap and remove operations, the page is not removed and
    remains within the hole. This is not the desired behavior. The race is
    difficult to detect in user level code as even in the non-race case, a
    page within the hole could be faulted back in before fallocate returns.
    If userfaultfd is expanded to support hugetlbfs in the future, this race
    will be easier to observe.

    If this race is detected and a page is mapped, the remove operation
    (remove_inode_hugepages) will unmap the page before removing. The unmap
    within remove_inode_hugepages occurs with the hugetlb_fault_mutex held
    so that no other faults will be processed until the page is removed.

    The (unmodified) routine hugetlb_vmdelete_list was moved ahead of
    remove_inode_hugepages to satisfy the new reference.

    [akpm@linux-foundation.org: move hugetlb_vmdelete_list()]
    Signed-off-by: Mike Kravetz
    Cc: Hugh Dickins
    Cc: Naoya Horiguchi
    Cc: Hillf Danton
    Cc: Davidlohr Bueso
    Cc: Dave Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     
  • Hillf Danton noticed bugs in the hugetlb_vmtruncate_list routine. The
    argument end is of type pgoff_t. It was being converted to a vaddr
    offset and passed to unmap_hugepage_range. However, end was also being
    used as an argument to the vma_interval_tree_foreach controlling loop.
    In addition, the conversion of end to vaddr offset was incorrect.

    hugetlb_vmtruncate_list is called as part of a file truncate or
    fallocate hole punch operation.

    When truncating a hugetlbfs file, this bug could prevent some pages from
    being unmapped. This is possible if there are multiple vmas mapping the
    file, and there is a sufficiently sized hole between the mappings. The
    size of the hole between two vmas (A,B) must be such that the starting
    virtual address of B is greater than (ending virtual address of A <<
    PAGE_SHIFT). In this case, the pages in B would not be unmapped. If
    pages are not properly unmapped during truncate, the following BUG is
    hit:

    kernel BUG at fs/hugetlbfs/inode.c:428!

    In the fallocate hole punch case, this bug could prevent pages from
    being unmapped as in the truncate case. However, for hole punch the
    result is that unmapped pages will not be removed during the operation.
    For hole punch, it is also possible that more pages than desired will be
    unmapped. This unnecessary unmapping will cause page faults to
    reestablish the mappings on subsequent page access.

    Fixes: 1bfad99ab (" hugetlbfs: hugetlb_vmtruncate_list() needs to take a range")Reported-by: Hillf Danton
    Signed-off-by: Mike Kravetz
    Cc: Hugh Dickins
    Cc: Naoya Horiguchi
    Cc: Davidlohr Bueso
    Cc: Dave Hansen
    Cc: [4.3]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     
  • Dmitry Vyukov has reported[1] possible deadlock (triggered by his
    syzkaller fuzzer):

    Possible unsafe locking scenario:

    CPU0 CPU1
    ---- ----
    lock(&hugetlbfs_i_mmap_rwsem_key);
    lock(&mapping->i_mmap_rwsem);
    lock(&hugetlbfs_i_mmap_rwsem_key);
    lock(&mapping->i_mmap_rwsem);

    Both traces points to mm_take_all_locks() as a source of the problem.
    It doesn't take care about ordering or hugetlbfs_i_mmap_rwsem_key (aka
    mapping->i_mmap_rwsem for hugetlb mapping) vs. i_mmap_rwsem.

    huge_pmd_share() does memory allocation under hugetlbfs_i_mmap_rwsem_key
    and allocator can take i_mmap_rwsem if it hit reclaim. So we need to
    take i_mmap_rwsem from all hugetlb VMAs before taking i_mmap_rwsem from
    rest of VMAs.

    The patch also documents locking order for hugetlbfs_i_mmap_rwsem_key.

    [1] http://lkml.kernel.org/r/CACT4Y+Zu95tBs-0EvdiAKzUOsb4tczRRfCRTpLr4bg_OP9HuVg@mail.gmail.com

    Signed-off-by: Kirill A. Shutemov
    Reported-by: Dmitry Vyukov
    Reviewed-by: Michal Hocko
    Cc: Peter Zijlstra
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

15 Jan, 2016

3 commits

  • The Kconfig currently controlling compilation of this code is:

    config HUGETLBFS
    bool "HugeTLB file system support"

    ...meaning that it currently is not being built as a module by anyone.

    Lets remove the modular code that is essentially orphaned, so that when
    reading the driver there is no doubt it is builtin-only.

    Since module_init translates to device_initcall in the non-modular case,
    the init ordering gets moved to earlier levels when we use the more
    appropriate initcalls here.

    Originally I had the fs part and the mm part as separate commits, just
    by happenstance of the nature of how I detected these non-modular use
    cases. But that can possibly introduce regressions if the patch merge
    ordering puts the fs part 1st -- as the 0-day testing reported a splat
    at mount time.

    Investigating with "initcall_debug" showed that the delta was
    init_hugetlbfs_fs being called _before_ hugetlb_init instead of after. So
    both the fs change and the mm change are here together.

    In addition, it worked before due to luck of link order, since they were
    both in the same initcall category. So we now have the fs part using
    fs_initcall, and the mm part using subsys_initcall, which puts it one
    bucket earlier. It now passes the basic sanity test that failed in
    earlier 0-day testing.

    We delete the MODULE_LICENSE tag and capture that information at the top
    of the file alongside author comments, etc.

    We don't replace module.h with init.h since the file already has that.
    Also note that MODULE_ALIAS is a no-op for non-modular code.

    Signed-off-by: Paul Gortmaker
    Reported-by: kernel test robot
    Cc: Nadia Yvette Chambers
    Cc: Alexander Viro
    Cc: Naoya Horiguchi
    Reviewed-by: Mike Kravetz
    Cc: David Rientjes
    Cc: Hillf Danton
    Acked-by: Davidlohr Bueso
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Gortmaker
     
  • When running the SPECint_rate gcc on some very large boxes it was
    noticed that the system was spending lots of time in
    mpol_shared_policy_lookup(). The gamess benchmark can also show it and
    is what I mostly used to chase down the issue since the setup for that I
    found to be easier.

    To be clear the binaries were on tmpfs because of disk I/O requirements.
    We then used text replication to avoid icache misses and having all the
    copies banging on the memory where the instruction code resides. This
    results in us hitting a bottleneck in mpol_shared_policy_lookup() since
    lookup is serialised by the shared_policy lock.

    I have only reproduced this on very large (3k+ cores) boxes. The
    problem starts showing up at just a few hundred ranks getting worse
    until it threatens to livelock once it gets large enough. For example
    on the gamess benchmark at 128 ranks this area consumes only ~1% of
    time, at 512 ranks it consumes nearly 13%, and at 2k ranks it is over
    90%.

    To alleviate the contention in this area I converted the spinlock to an
    rwlock. This allows a large number of lookups to happen simultaneously.
    The results were quite good reducing this consumtion at max ranks to
    around 2%.

    [akpm@linux-foundation.org: tidy up code comments]
    Signed-off-by: Nathan Zimmer
    Acked-by: David Rientjes
    Acked-by: Vlastimil Babka
    Cc: Nadia Yvette Chambers
    Cc: Naoya Horiguchi
    Cc: Mel Gorman
    Cc: "Aneesh Kumar K.V"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nathan Zimmer
     
  • Mark those kmem allocations that are known to be easily triggered from
    userspace as __GFP_ACCOUNT/SLAB_ACCOUNT, which makes them accounted to
    memcg. For the list, see below:

    - threadinfo
    - task_struct
    - task_delay_info
    - pid
    - cred
    - mm_struct
    - vm_area_struct and vm_region (nommu)
    - anon_vma and anon_vma_chain
    - signal_struct
    - sighand_struct
    - fs_struct
    - files_struct
    - fdtable and fdtable->full_fds_bits
    - dentry and external_name
    - inode for all filesystems. This is the most tedious part, because
    most filesystems overwrite the alloc_inode method.

    The list is far from complete, so feel free to add more objects.
    Nevertheless, it should be close to "account everything" approach and
    keep most workloads within bounds. Malevolent users will be able to
    breach the limit, but this was possible even with the former "account
    everything" approach (simply because it did not account everything in
    fact).

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Vladimir Davydov
    Acked-by: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: Tejun Heo
    Cc: Greg Thelen
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     

09 Dec, 2015

1 commit

  • kmap() in page_follow_link_light() needed to go - allowing to hold
    an arbitrary number of kmaps for long is a great way to deadlocking
    the system.

    new helper (inode_nohighmem(inode)) needs to be used for pagecache
    symlinks inodes; done for all in-tree cases. page_follow_link_light()
    instrumented to yell about anything missed.

    Signed-off-by: Al Viro

    Al Viro
     

21 Nov, 2015

1 commit

  • Hugh Dickins pointed out problems with the new hugetlbfs fallocate hole
    punch code. These problems are in the routine remove_inode_hugepages and
    mostly occur in the case where there are holes in the range of pages to be
    removed. These holes could be the result of a previous hole punch or
    simply sparse allocation. The current code could access pages outside the
    specified range.

    remove_inode_hugepages handles both hole punch and truncate operations.
    Page index handling was fixed/cleaned up so that the loop index always
    matches the page being processed. The code now only makes a single pass
    through the range of pages as it was determined page faults could not race
    with truncate. A cond_resched() was added after removing up to
    PAGEVEC_SIZE pages.

    Some totally unnecessary code in hugetlbfs_fallocate() that remained from
    early development was also removed.

    Tested with fallocate tests submitted here:
    http://librelist.com/browser//libhugetlbfs/2015/6/25/patch-tests-add-tests-for-fallocate-system-call/
    And, some ftruncate tests under development

    Fixes: b5cec28d36f5 ("hugetlbfs: truncate_hugepages() takes a range of pages")
    Signed-off-by: Mike Kravetz
    Acked-by: Hugh Dickins
    Cc: Dave Hansen
    Cc: Naoya Horiguchi
    Cc: Davidlohr Bueso
    Cc: "Hillf Danton"
    Cc: [4.3]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     

09 Sep, 2015

3 commits

  • This is based on the shmem version, but it has diverged quite a bit. We
    have no swap to worry about, nor the new file sealing. Add
    synchronication via the fault mutex table to coordinate page faults,
    fallocate allocation and fallocate hole punch.

    What this allows us to do is move physical memory in and out of a
    hugetlbfs file without having it mapped. This also gives us the ability
    to support MADV_REMOVE since it is currently implemented using
    fallocate(). MADV_REMOVE lets madvise() remove pages from the middle of
    a hugetlbfs file, which wasn't possible before.

    hugetlbfs fallocate only operates on whole huge pages.

    Based on code by Dave Hansen.

    Signed-off-by: Mike Kravetz
    Reviewed-by: Naoya Horiguchi
    Acked-by: Hillf Danton
    Cc: Dave Hansen
    Cc: David Rientjes
    Cc: Hugh Dickins
    Cc: Davidlohr Bueso
    Cc: Aneesh Kumar
    Cc: Christoph Hellwig
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     
  • Modify truncate_hugepages() to take a range of pages (start, end)
    instead of simply start. If an end value of LLONG_MAX is passed, the
    current "truncate" functionality is maintained. Existing callers are
    modified to pass LLONG_MAX as end of range. By keying off end ==
    LLONG_MAX, the routine behaves differently for truncate and hole punch.
    Page removal is now synchronized with page allocation via faults by
    using the fault mutex table. The hole punch case can experience the
    rare region_del error and must handle accordingly.

    Add the routine hugetlb_fix_reserve_counts to fix up reserve counts in
    the case where region_del returns an error.

    Since the routine handles more than just the truncate case, it is
    renamed to remove_inode_hugepages(). To be consistent, the routine
    truncate_huge_page() is renamed remove_huge_page().

    Downstream of remove_inode_hugepages(), the routine
    hugetlb_unreserve_pages() is also modified to take a range of pages.
    hugetlb_unreserve_pages is modified to detect an error from region_del and
    pass it back to the caller.

    Signed-off-by: Mike Kravetz
    Reviewed-by: Naoya Horiguchi
    Acked-by: Hillf Danton
    Cc: Dave Hansen
    Cc: David Rientjes
    Cc: Hugh Dickins
    Cc: Davidlohr Bueso
    Cc: Aneesh Kumar
    Cc: Christoph Hellwig
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     
  • fallocate hole punch will want to unmap a specific range of pages.
    Modify the existing hugetlb_vmtruncate_list() routine to take a
    start/end range. If end is 0, this indicates all pages after start
    should be unmapped. This is the same as the existing truncate
    functionality. Modify existing callers to add 0 as end of range.

    Since the routine will be used in hole punch as well as truncate
    operations, it is more appropriately renamed to hugetlb_vmdelete_list().

    Signed-off-by: Mike Kravetz
    Reviewed-by: Naoya Horiguchi
    Acked-by: Hillf Danton
    Cc: Dave Hansen
    Cc: David Rientjes
    Cc: Hugh Dickins
    Cc: Davidlohr Bueso
    Cc: Aneesh Kumar
    Cc: Christoph Hellwig
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     

07 Aug, 2015

1 commit

  • The shm implementation internally uses shmem or hugetlbfs inodes for shm
    segments. As these inodes are never directly exposed to userspace and
    only accessed through the shm operations which are already hooked by
    security modules, mark the inodes with the S_PRIVATE flag so that inode
    security initialization and permission checking is skipped.

    This was motivated by the following lockdep warning:

    ======================================================
    [ INFO: possible circular locking dependency detected ]
    4.2.0-0.rc3.git0.1.fc24.x86_64+debug #1 Tainted: G W
    -------------------------------------------------------
    httpd/1597 is trying to acquire lock:
    (&ids->rwsem){+++++.}, at: shm_close+0x34/0x130
    but task is already holding lock:
    (&mm->mmap_sem){++++++}, at: SyS_shmdt+0x4b/0x180
    which lock already depends on the new lock.
    the existing dependency chain (in reverse order) is:
    -> #3 (&mm->mmap_sem){++++++}:
    lock_acquire+0xc7/0x270
    __might_fault+0x7a/0xa0
    filldir+0x9e/0x130
    xfs_dir2_block_getdents.isra.12+0x198/0x1c0 [xfs]
    xfs_readdir+0x1b4/0x330 [xfs]
    xfs_file_readdir+0x2b/0x30 [xfs]
    iterate_dir+0x97/0x130
    SyS_getdents+0x91/0x120
    entry_SYSCALL_64_fastpath+0x12/0x76
    -> #2 (&xfs_dir_ilock_class){++++.+}:
    lock_acquire+0xc7/0x270
    down_read_nested+0x57/0xa0
    xfs_ilock+0x167/0x350 [xfs]
    xfs_ilock_attr_map_shared+0x38/0x50 [xfs]
    xfs_attr_get+0xbd/0x190 [xfs]
    xfs_xattr_get+0x3d/0x70 [xfs]
    generic_getxattr+0x4f/0x70
    inode_doinit_with_dentry+0x162/0x670
    sb_finish_set_opts+0xd9/0x230
    selinux_set_mnt_opts+0x35c/0x660
    superblock_doinit+0x77/0xf0
    delayed_superblock_init+0x10/0x20
    iterate_supers+0xb3/0x110
    selinux_complete_init+0x2f/0x40
    security_load_policy+0x103/0x600
    sel_write_load+0xc1/0x750
    __vfs_write+0x37/0x100
    vfs_write+0xa9/0x1a0
    SyS_write+0x58/0xd0
    entry_SYSCALL_64_fastpath+0x12/0x76
    ...

    Signed-off-by: Stephen Smalley
    Reported-by: Morten Stevens
    Acked-by: Hugh Dickins
    Acked-by: Paul Moore
    Cc: Manfred Spraul
    Cc: Davidlohr Bueso
    Cc: Prarit Bhargava
    Cc: Eric Paris
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Stephen Smalley
     

25 Jun, 2015

1 commit


27 Apr, 2015

1 commit

  • Pull fourth vfs update from Al Viro:
    "d_inode() annotations from David Howells (sat in for-next since before
    the beginning of merge window) + four assorted fixes"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    RCU pathwalk breakage when running into a symlink overmounting something
    fix I_DIO_WAKEUP definition
    direct-io: only inc/dec inode->i_dio_count for file systems
    fs/9p: fix readdir()
    VFS: assorted d_backing_inode() annotations
    VFS: fs/inode.c helpers: d_inode() annotations
    VFS: fs/cachefiles: d_backing_inode() annotations
    VFS: fs library helpers: d_inode() annotations
    VFS: assorted weird filesystems: d_inode() annotations
    VFS: normal filesystems (and lustre): d_inode() annotations
    VFS: security/: d_inode() annotations
    VFS: security/: d_backing_inode() annotations
    VFS: net/: d_inode() annotations
    VFS: net/unix: d_backing_inode() annotations
    VFS: kernel/: d_inode() annotations
    VFS: audit: d_backing_inode() annotations
    VFS: Fix up some ->d_inode accesses in the chelsio driver
    VFS: Cachefiles should perform fs modifications on the top layer only
    VFS: AF_UNIX sockets should call mknod on the top layer only

    Linus Torvalds
     

16 Apr, 2015

2 commits

  • Merge second patchbomb from Andrew Morton:

    - the rest of MM

    - various misc bits

    - add ability to run /sbin/reboot at reboot time

    - printk/vsprintf changes

    - fiddle with seq_printf() return value

    * akpm: (114 commits)
    parisc: remove use of seq_printf return value
    lru_cache: remove use of seq_printf return value
    tracing: remove use of seq_printf return value
    cgroup: remove use of seq_printf return value
    proc: remove use of seq_printf return value
    s390: remove use of seq_printf return value
    cris fasttimer: remove use of seq_printf return value
    cris: remove use of seq_printf return value
    openrisc: remove use of seq_printf return value
    ARM: plat-pxa: remove use of seq_printf return value
    nios2: cpuinfo: remove use of seq_printf return value
    microblaze: mb: remove use of seq_printf return value
    ipc: remove use of seq_printf return value
    rtc: remove use of seq_printf return value
    power: wakeup: remove use of seq_printf return value
    x86: mtrr: if: remove use of seq_printf return value
    linux/bitmap.h: improve BITMAP_{LAST,FIRST}_WORD_MASK
    MAINTAINERS: CREDITS: remove Stefano Brivio from B43
    .mailmap: add Ricardo Ribalda
    CREDITS: add Ricardo Ribalda Delgado
    ...

    Linus Torvalds
     
  • Make 'min_size=' be an option when mounting a hugetlbfs. This
    option takes the same value as the 'size' option. min_size can be
    specified without specifying size. If both are specified, min_size must
    be less that or equal to size else the mount will fail. If min_size is
    specified, then at mount time an attempt is made to reserve min_size
    pages. If the reservation fails, the mount fails. At umount time, the
    reserved pages are released.

    Signed-off-by: Mike Kravetz
    Cc: Davidlohr Bueso
    Cc: Aneesh Kumar
    Cc: Joonsoo Kim
    Cc: Andi Kleen
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Kravetz