08 Dec, 2018

2 commits

  • commit 5b51072e97d587186c2f5390c8c9c1fb7e179505 upstream.

    Userfaultfd did not create private memory when UFFDIO_COPY was invoked
    on a MAP_PRIVATE shmem mapping. Instead it wrote to the shmem file,
    even when that had not been opened for writing. Though, fortunately,
    that could only happen where there was a hole in the file.

    Fix the shmem-backed implementation of UFFDIO_COPY to create private
    memory for MAP_PRIVATE mappings. The hugetlbfs-backed implementation
    was already correct.

    This change is visible to userland, if userfaultfd has been used in
    unintended ways: so it introduces a small risk of incompatibility, but
    is necessary in order to respect file permissions.

    An app that uses UFFDIO_COPY for anything like postcopy live migration
    won't notice the difference, and in fact it'll run faster because there
    will be no copy-on-write and memory waste in the tmpfs pagecache
    anymore.

    Userfaults on MAP_PRIVATE shmem keep triggering only on file holes like
    before.

    The real zeropage can also be built on a MAP_PRIVATE shmem mapping
    through UFFDIO_ZEROPAGE and that's safe because the zeropage pte is
    never dirty, in turn even an mprotect upgrading the vma permission from
    PROT_READ to PROT_READ|PROT_WRITE won't make the zeropage pte writable.

    Link: http://lkml.kernel.org/r/20181126173452.26955-3-aarcange@redhat.com
    Fixes: 4c27fe4c4c84 ("userfaultfd: shmem: add shmem_mcopy_atomic_pte for userfaultfd support")
    Signed-off-by: Andrea Arcangeli
    Reported-by: Mike Rapoport
    Reviewed-by: Hugh Dickins
    Cc:
    Cc: "Dr. David Alan Gilbert"
    Cc: Jann Horn
    Cc: Mike Kravetz
    Cc: Peter Xu
    Cc: stable@vger.kernel.org
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Andrea Arcangeli
     
  • commit 9e368259ad988356c4c95150fafd1a06af095d98 upstream.

    Patch series "userfaultfd shmem updates".

    Jann found two bugs in the userfaultfd shmem MAP_SHARED backend: the
    lack of the VM_MAYWRITE check and the lack of i_size checks.

    Then looking into the above we also fixed the MAP_PRIVATE case.

    Hugh by source review also found a data loss source if UFFDIO_COPY is
    used on shmem MAP_SHARED PROT_READ mappings (the production usages
    incidentally run with PROT_READ|PROT_WRITE, so the data loss couldn't
    happen in those production usages like with QEMU).

    The whole patchset is marked for stable.

    We verified QEMU postcopy live migration with guest running on shmem
    MAP_PRIVATE run as well as before after the fix of shmem MAP_PRIVATE.
    Regardless if it's shmem or hugetlbfs or MAP_PRIVATE or MAP_SHARED, QEMU
    unconditionally invokes a punch hole if the guest mapping is filebacked
    and a MADV_DONTNEED too (needed to get rid of the MAP_PRIVATE COWs and
    for the anon backend).

    This patch (of 5):

    We internally used EFAULT to communicate with the caller, switch to
    ENOENT, so EFAULT can be used as a non internal retval.

    Link: http://lkml.kernel.org/r/20181126173452.26955-2-aarcange@redhat.com
    Fixes: 4c27fe4c4c84 ("userfaultfd: shmem: add shmem_mcopy_atomic_pte for userfaultfd support")
    Signed-off-by: Andrea Arcangeli
    Reviewed-by: Mike Rapoport
    Reviewed-by: Hugh Dickins
    Cc: Mike Kravetz
    Cc: Jann Horn
    Cc: Peter Xu
    Cc: "Dr. David Alan Gilbert"
    Cc:
    Cc: stable@vger.kernel.org
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Andrea Arcangeli
     

06 Dec, 2018

11 commits

  • commit c1cb20d43728aa9b5393bd8d489bc85c142949b2 upstream.

    We changed the key of swap cache tree from swp_entry_t.val to
    swp_offset. We need to do so in shmem_replace_page() as well.

    Hugh said:
    "shmem_replace_page() has been wrong since the day I wrote it: good
    enough to work on swap "type" 0, which is all most people ever use
    (especially those few who need shmem_replace_page() at all), but
    broken once there are any non-0 swp_type bits set in the higher order
    bits"

    Link: http://lkml.kernel.org/r/20181121215442.138545-1-yuzhao@google.com
    Fixes: f6ab1f7f6b2d ("mm, swap: use offset of swap entry as key of swap cache")
    Signed-off-by: Yu Zhao
    Reviewed-by: Matthew Wilcox
    Acked-by: Hugh Dickins
    Cc: [4.9+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Yu Zhao
     
  • commit 06a5e1268a5fb9c2b346a3da6b97e85f2eba0f07 upstream.

    collapse_shmem()'s VM_BUG_ON_PAGE(PageTransCompound) was unsafe: before
    it holds page lock of the first page, racing truncation then extension
    might conceivably have inserted a hugepage there already. Fail with the
    SCAN_PAGE_COMPOUND result, instead of crashing (CONFIG_DEBUG_VM=y) or
    otherwise mishandling the unexpected hugepage - though later we might
    code up a more constructive way of handling it, with SCAN_SUCCESS.

    Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1811261529310.2275@eggly.anvils
    Fixes: f3f0e1d2150b2 ("khugepaged: add support of collapse for tmpfs/shmem pages")
    Signed-off-by: Hugh Dickins
    Cc: Kirill A. Shutemov
    Cc: Jerome Glisse
    Cc: Konstantin Khlebnikov
    Cc: Matthew Wilcox
    Cc: [4.8+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    Hugh Dickins
     
  • commit 87c460a0bded56195b5eb497d44709777ef7b415 upstream.

    khugepaged's collapse_shmem() does almost all of its work, to assemble
    the huge new_page from 512 scattered old pages, with the new_page's
    refcount frozen to 0 (and refcounts of all old pages so far also frozen
    to 0). Including shmem_getpage() to read in any which were out on swap,
    memory reclaim if necessary to allocate their intermediate pages, and
    copying over all the data from old to new.

    Imagine the frozen refcount as a spinlock held, but without any lock
    debugging to highlight the abuse: it's not good, and under serious load
    heads into lockups - speculative getters of the page are not expecting
    to spin while khugepaged is rescheduled.

    One can get a little further under load by hacking around elsewhere; but
    fortunately, freezing the new_page turns out to have been entirely
    unnecessary, with no hacks needed elsewhere.

    The huge new_page lock is already held throughout, and guards all its
    subpages as they are brought one by one into the page cache tree; and
    anything reading the data in that page, without the lock, before it has
    been marked PageUptodate, would already be in the wrong. So simply
    eliminate the freezing of the new_page.

    Each of the old pages remains frozen with refcount 0 after it has been
    replaced by a new_page subpage in the page cache tree, until they are
    all unfrozen on success or failure: just as before. They could be
    unfrozen sooner, but cause no problem once no longer visible to
    find_get_entry(), filemap_map_pages() and other speculative lookups.

    Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1811261527570.2275@eggly.anvils
    Fixes: f3f0e1d2150b2 ("khugepaged: add support of collapse for tmpfs/shmem pages")
    Signed-off-by: Hugh Dickins
    Acked-by: Kirill A. Shutemov
    Cc: Jerome Glisse
    Cc: Konstantin Khlebnikov
    Cc: Matthew Wilcox
    Cc: [4.8+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    Hugh Dickins
     
  • commit 042a30824871fa3149b0127009074b75cc25863c upstream.

    Several cleanups in collapse_shmem(): most of which probably do not
    really matter, beyond doing things in a more familiar and reassuring
    order. Simplify the failure gotos in the main loop, and on success
    update stats while interrupts still disabled from the last iteration.

    Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1811261526400.2275@eggly.anvils
    Fixes: f3f0e1d2150b2 ("khugepaged: add support of collapse for tmpfs/shmem pages")
    Signed-off-by: Hugh Dickins
    Acked-by: Kirill A. Shutemov
    Cc: Jerome Glisse
    Cc: Konstantin Khlebnikov
    Cc: Matthew Wilcox
    Cc: [4.8+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    Hugh Dickins
     
  • commit 2af8ff291848cc4b1cce24b6c943394eb2c761e8 upstream.

    Huge tmpfs testing reminds us that there is no __GFP_ZERO in the gfp
    flags khugepaged uses to allocate a huge page - in all common cases it
    would just be a waste of effort - so collapse_shmem() must remember to
    clear out any holes that it instantiates.

    The obvious place to do so, where they are put into the page cache tree,
    is not a good choice: because interrupts are disabled there. Leave it
    until further down, once success is assured, where the other pages are
    copied (before setting PageUptodate).

    Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1811261525080.2275@eggly.anvils
    Fixes: f3f0e1d2150b2 ("khugepaged: add support of collapse for tmpfs/shmem pages")
    Signed-off-by: Hugh Dickins
    Acked-by: Kirill A. Shutemov
    Cc: Jerome Glisse
    Cc: Konstantin Khlebnikov
    Cc: Matthew Wilcox
    Cc: [4.8+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    Hugh Dickins
     
  • commit aaa52e340073b7f4593b3c4ddafcafa70cf838b5 upstream.

    Huge tmpfs testing on a shortish file mapped into a pmd-rounded extent
    hit shmem_evict_inode()'s WARN_ON(inode->i_blocks) followed by
    clear_inode()'s BUG_ON(inode->i_data.nrpages) when the file was later
    closed and unlinked.

    khugepaged's collapse_shmem() was forgetting to update mapping->nrpages
    on the rollback path, after it had added but then needs to undo some
    holes.

    There is indeed an irritating asymmetry between shmem_charge(), whose
    callers want it to increment nrpages after successfully accounting
    blocks, and shmem_uncharge(), when __delete_from_page_cache() already
    decremented nrpages itself: oh well, just add a comment on that to them
    both.

    And shmem_recalc_inode() is supposed to be called when the accounting is
    expected to be in balance (so it can deduce from imbalance that reclaim
    discarded some pages): so change shmem_charge() to update nrpages
    earlier (though it's rare for the difference to matter at all).

    Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1811261523450.2275@eggly.anvils
    Fixes: 800d8c63b2e98 ("shmem: add huge pages support")
    Fixes: f3f0e1d2150b2 ("khugepaged: add support of collapse for tmpfs/shmem pages")
    Signed-off-by: Hugh Dickins
    Acked-by: Kirill A. Shutemov
    Cc: Jerome Glisse
    Cc: Konstantin Khlebnikov
    Cc: Matthew Wilcox
    Cc: [4.8+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    Hugh Dickins
     
  • commit 701270fa193aadf00bdcf607738f64997275d4c7 upstream.

    Huge tmpfs testing showed that although collapse_shmem() recognizes a
    concurrently truncated or hole-punched page correctly, its handling of
    holes was liable to refill an emptied extent. Add check to stop that.

    Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1811261522040.2275@eggly.anvils
    Fixes: f3f0e1d2150b2 ("khugepaged: add support of collapse for tmpfs/shmem pages")
    Signed-off-by: Hugh Dickins
    Reviewed-by: Matthew Wilcox
    Cc: Kirill A. Shutemov
    Cc: Jerome Glisse
    Cc: Konstantin Khlebnikov
    Cc: [4.8+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    Hugh Dickins
     
  • commit 006d3ff27e884f80bd7d306b041afc415f63598f upstream.

    Huge tmpfs testing, on 32-bit kernel with lockdep enabled, showed that
    __split_huge_page() was using i_size_read() while holding the irq-safe
    lru_lock and page tree lock, but the 32-bit i_size_read() uses an
    irq-unsafe seqlock which should not be nested inside them.

    Instead, read the i_size earlier in split_huge_page_to_list(), and pass
    the end offset down to __split_huge_page(): all while holding head page
    lock, which is enough to prevent truncation of that extent before the
    page tree lock has been taken.

    Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1811261520070.2275@eggly.anvils
    Fixes: baa355fd33142 ("thp: file pages support for split_huge_page()")
    Signed-off-by: Hugh Dickins
    Acked-by: Kirill A. Shutemov
    Cc: Jerome Glisse
    Cc: Konstantin Khlebnikov
    Cc: Matthew Wilcox
    Cc: [4.8+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    Hugh Dickins
     
  • commit 173d9d9fd3ddae84c110fea8aedf1f26af6be9ec upstream.

    Huge tmpfs stress testing has occasionally hit shmem_undo_range()'s
    VM_BUG_ON_PAGE(page_to_pgoff(page) != index, page).

    Move the setting of mapping and index up before the page_ref_unfreeze()
    in __split_huge_page_tail() to fix this: so that a page cache lookup
    cannot get a reference while the tail's mapping and index are unstable.

    In fact, might as well move them up before the smp_wmb(): I don't see an
    actual need for that, but if I'm missing something, this way round is
    safer than the other, and no less efficient.

    You might argue that VM_BUG_ON_PAGE(page_to_pgoff(page) != index, page) is
    misplaced, and should be left until after the trylock_page(); but left as
    is has not crashed since, and gives more stringent assurance.

    Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1811261516380.2275@eggly.anvils
    Fixes: e9b61f19858a5 ("thp: reintroduce split_huge_page()")
    Requires: 605ca5ede764 ("mm/huge_memory.c: reorder operations in __split_huge_page_tail()")
    Signed-off-by: Hugh Dickins
    Acked-by: Kirill A. Shutemov
    Cc: Konstantin Khlebnikov
    Cc: Jerome Glisse
    Cc: Matthew Wilcox
    Cc: [4.8+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    Hugh Dickins
     
  • commit 605ca5ede7643a01f4c4a15913f9714ac297f8a6 upstream.

    THP split makes non-atomic change of tail page flags. This is almost ok
    because tail pages are locked and isolated but this breaks recent
    changes in page locking: non-atomic operation could clear bit
    PG_waiters.

    As a result concurrent sequence get_page_unless_zero() -> lock_page()
    might block forever. Especially if this page was truncated later.

    Fix is trivial: clone flags before unfreezing page reference counter.

    This race exists since commit 62906027091f ("mm: add PageWaiters
    indicating tasks are waiting for a page bit") while unsave unfreeze
    itself was added in commit 8df651c7059e ("thp: cleanup
    split_huge_page()").

    clear_compound_head() also must be called before unfreezing page
    reference because after successful get_page_unless_zero() might follow
    put_page() which needs correct compound_head().

    And replace page_ref_inc()/page_ref_add() with page_ref_unfreeze() which
    is made especially for that and has semantic of smp_store_release().

    Link: http://lkml.kernel.org/r/151844393341.210639.13162088407980624477.stgit@buzz
    Signed-off-by: Konstantin Khlebnikov
    Acked-by: Kirill A. Shutemov
    Cc: Michal Hocko
    Cc: Nicholas Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    Konstantin Khlebnikov
     
  • commit 906f9cdfc2a0800f13683f9e4ebdfd08c12ee81b upstream.

    The term "freeze" is used in several ways in the kernel, and in mm it
    has the particular meaning of forcing page refcount temporarily to 0.
    freeze_page() is just too confusing a name for a function that unmaps a
    page: rename it unmap_page(), and rename unfreeze_page() remap_page().

    Went to change the mention of freeze_page() added later in mm/rmap.c,
    but found it to be incorrect: ordinary page reclaim reaches there too;
    but the substance of the comment still seems correct, so edit it down.

    Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1811261514080.2275@eggly.anvils
    Fixes: e9b61f19858a5 ("thp: reintroduce split_huge_page()")
    Signed-off-by: Hugh Dickins
    Acked-by: Kirill A. Shutemov
    Cc: Jerome Glisse
    Cc: Konstantin Khlebnikov
    Cc: Matthew Wilcox
    Cc: [4.8+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    Hugh Dickins
     

01 Dec, 2018

5 commits

  • [ Upstream commit c63ae43ba53bc432b414fd73dd5f4b01fcb1ab43 ]

    Konstantin has noticed that kvmalloc might trigger the following
    warning:

    WARNING: CPU: 0 PID: 6676 at mm/vmstat.c:986 __fragmentation_index+0x54/0x60
    [...]
    Call Trace:
    fragmentation_index+0x76/0x90
    compaction_suitable+0x4f/0xf0
    shrink_node+0x295/0x310
    node_reclaim+0x205/0x250
    get_page_from_freelist+0x649/0xad0
    __alloc_pages_nodemask+0x12a/0x2a0
    kmalloc_large_node+0x47/0x90
    __kmalloc_node+0x22b/0x2e0
    kvmalloc_node+0x3e/0x70
    xt_alloc_table_info+0x3a/0x80 [x_tables]
    do_ip6t_set_ctl+0xcd/0x1c0 [ip6_tables]
    nf_setsockopt+0x44/0x60
    SyS_setsockopt+0x6f/0xc0
    do_syscall_64+0x67/0x120
    entry_SYSCALL_64_after_hwframe+0x3d/0xa2

    the problem is that we only check for an out of bound order in the slow
    path and the node reclaim might happen from the fast path already. This
    is fixable by making sure that kvmalloc doesn't ever use kmalloc for
    requests that are larger than KMALLOC_MAX_SIZE but this also shows that
    the code is rather fragile. A recent UBSAN report just underlines that
    by the following report

    UBSAN: Undefined behaviour in mm/page_alloc.c:3117:19
    shift exponent 51 is too large for 32-bit type 'int'
    CPU: 0 PID: 6520 Comm: syz-executor1 Not tainted 4.19.0-rc2 #1
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
    Call Trace:
    __dump_stack lib/dump_stack.c:77 [inline]
    dump_stack+0xd2/0x148 lib/dump_stack.c:113
    ubsan_epilogue+0x12/0x94 lib/ubsan.c:159
    __ubsan_handle_shift_out_of_bounds+0x2b6/0x30b lib/ubsan.c:425
    __zone_watermark_ok+0x2c7/0x400 mm/page_alloc.c:3117
    zone_watermark_fast mm/page_alloc.c:3216 [inline]
    get_page_from_freelist+0xc49/0x44c0 mm/page_alloc.c:3300
    __alloc_pages_nodemask+0x21e/0x640 mm/page_alloc.c:4370
    alloc_pages_current+0xcc/0x210 mm/mempolicy.c:2093
    alloc_pages include/linux/gfp.h:509 [inline]
    __get_free_pages+0x12/0x60 mm/page_alloc.c:4414
    dma_mem_alloc+0x36/0x50 arch/x86/include/asm/floppy.h:156
    raw_cmd_copyin drivers/block/floppy.c:3159 [inline]
    raw_cmd_ioctl drivers/block/floppy.c:3206 [inline]
    fd_locked_ioctl+0xa00/0x2c10 drivers/block/floppy.c:3544
    fd_ioctl+0x40/0x60 drivers/block/floppy.c:3571
    __blkdev_driver_ioctl block/ioctl.c:303 [inline]
    blkdev_ioctl+0xb3c/0x1a30 block/ioctl.c:601
    block_ioctl+0x105/0x150 fs/block_dev.c:1883
    vfs_ioctl fs/ioctl.c:46 [inline]
    do_vfs_ioctl+0x1c0/0x1150 fs/ioctl.c:687
    ksys_ioctl+0x9e/0xb0 fs/ioctl.c:702
    __do_sys_ioctl fs/ioctl.c:709 [inline]
    __se_sys_ioctl fs/ioctl.c:707 [inline]
    __x64_sys_ioctl+0x7e/0xc0 fs/ioctl.c:707
    do_syscall_64+0xc4/0x510 arch/x86/entry/common.c:290
    entry_SYSCALL_64_after_hwframe+0x49/0xbe

    Note that this is not a kvmalloc path. It is just that the fast path
    really depends on having sanitzed order as well. Therefore move the
    order check to the fast path.

    Link: http://lkml.kernel.org/r/20181113094305.GM15120@dhcp22.suse.cz
    Signed-off-by: Michal Hocko
    Reported-by: Konstantin Khlebnikov
    Reported-by: Kyungtae Kim
    Acked-by: Vlastimil Babka
    Cc: Balbir Singh
    Cc: Mel Gorman
    Cc: Pavel Tatashin
    Cc: Oscar Salvador
    Cc: Mike Rapoport
    Cc: Aaron Lu
    Cc: Joonsoo Kim
    Cc: Byoungyoung Lee
    Cc: "Dae R. Jeong"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    Michal Hocko
     
  • [ Upstream commit 1a413646931cb14442065cfc17561e50f5b5bb44 ]

    Other filesystems such as ext4, f2fs and ubifs all return ENXIO when
    lseek (SEEK_DATA or SEEK_HOLE) requests a negative offset.

    man 2 lseek says

    : EINVAL whence is not valid. Or: the resulting file offset would be
    : negative, or beyond the end of a seekable device.
    :
    : ENXIO whence is SEEK_DATA or SEEK_HOLE, and the file offset is beyond
    : the end of the file.

    Make tmpfs return ENXIO under these circumstances as well. After this,
    tmpfs also passes xfstests's generic/448.

    [akpm@linux-foundation.org: rewrite changelog]
    Link: http://lkml.kernel.org/r/1540434176-14349-1-git-send-email-yuyufen@huawei.com
    Signed-off-by: Yufen Yu
    Reviewed-by: Andrew Morton
    Cc: Al Viro
    Cc: Hugh Dickins
    Cc: William Kucharski
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    Yufen Yu
     
  • [ Upstream commit ca0246bb97c23da9d267c2107c07fb77e38205c9 ]

    Reclaim and free can race on an object which is basically fine but in
    order for reclaim to be able to map "freed" object we need to encode
    object length in the handle. handle_to_chunks() is then introduced to
    extract object length from a handle and use it during mapping.

    Moreover, to avoid racing on a z3fold "headless" page release, we should
    not try to free that page in z3fold_free() if the reclaim bit is set.
    Also, in the unlikely case of trying to reclaim a page being freed, we
    should not proceed with that page.

    While at it, fix the page accounting in reclaim function.

    This patch supersedes "[PATCH] z3fold: fix reclaim lock-ups".

    Link: http://lkml.kernel.org/r/20181105162225.74e8837d03583a9b707cf559@gmail.com
    Signed-off-by: Vitaly Wool
    Signed-off-by: Jongseok Kim
    Reported-by-by: Jongseok Kim
    Reviewed-by: Snild Dolkow
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    Vitaly Wool
     
  • commit ff09d7ec9786be4ad7589aa987d7dc66e2dd9160 upstream.

    We clear the pte temporarily during read/modify/write update of the pte.
    If we take a page fault while the pte is cleared, the application can get
    SIGBUS. One such case is with remap_pfn_range without a backing
    vm_ops->fault callback. do_fault will return SIGBUS in that case.

    cpu 0 cpu1
    mprotect()
    ptep_modify_prot_start()/pte cleared.
    .
    . page fault.
    .
    .
    prep_modify_prot_commit()

    Fix this by taking page table lock and rechecking for pte_none.

    [aneesh.kumar@linux.ibm.com: fix crash observed with syzkaller run]
    Link: http://lkml.kernel.org/r/87va6bwlfg.fsf@linux.ibm.com
    Link: http://lkml.kernel.org/r/20180926031858.9692-1-aneesh.kumar@linux.ibm.com
    Signed-off-by: Aneesh Kumar K.V
    Acked-by: Kirill A. Shutemov
    Cc: Willem de Bruijn
    Cc: Eric Dumazet
    Cc: Ido Schimmel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Aneesh Kumar K.V
     
  • commit 61448479a9f2c954cde0cfe778cb6bec5d0a748d upstream.

    Slub does not call kmalloc_slab() for sizes > KMALLOC_MAX_CACHE_SIZE,
    instead it falls back to kmalloc_large().

    For slab KMALLOC_MAX_CACHE_SIZE == KMALLOC_MAX_SIZE and it calls
    kmalloc_slab() for all allocations relying on NULL return value for
    over-sized allocations.

    This inconsistency leads to unwanted warnings from kmalloc_slab() for
    over-sized allocations for slab. Returning NULL for failed allocations is
    the expected behavior.

    Make slub and slab code consistent by checking size >
    KMALLOC_MAX_CACHE_SIZE in slab before calling kmalloc_slab().

    While we are here also fix the check in kmalloc_slab(). We should check
    against KMALLOC_MAX_CACHE_SIZE rather than KMALLOC_MAX_SIZE. It all kinda
    worked because for slab the constants are the same, and slub always checks
    the size against KMALLOC_MAX_CACHE_SIZE before kmalloc_slab(). But if we
    get there with size > KMALLOC_MAX_CACHE_SIZE anyhow bad things will
    happen. For example, in case of a newly introduced bug in slub code.

    Also move the check in kmalloc_slab() from function entry to the size >
    192 case. This partially compensates for the additional check in slab
    code and makes slub code a bit faster (at least theoretically).

    Also drop __GFP_NOWARN in the warning check. This warning means a bug in
    slab code itself, user-passed flags have nothing to do with it.

    Nothing of this affects slob.

    Link: http://lkml.kernel.org/r/20180927171502.226522-1-dvyukov@gmail.com
    Signed-off-by: Dmitry Vyukov
    Reported-by: syzbot+87829a10073277282ad1@syzkaller.appspotmail.com
    Reported-by: syzbot+ef4e8fc3a06e9019bb40@syzkaller.appspotmail.com
    Reported-by: syzbot+6e438f4036df52cbb863@syzkaller.appspotmail.com
    Reported-by: syzbot+8574471d8734457d98aa@syzkaller.appspotmail.com
    Reported-by: syzbot+af1504df0807a083dbd9@syzkaller.appspotmail.com
    Acked-by: Christoph Lameter
    Acked-by: Vlastimil Babka
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Dmitry Vyukov
     

21 Nov, 2018

3 commits

  • commit 873d7bcfd066663e3e50113dc4a0de19289b6354 upstream.

    Commit a2468cc9bfdf ("swap: choose swap device according to numa node")
    changed 'avail_lists' field of 'struct swap_info_struct' to an array.
    In popular linux distros it increased size of swap_info_struct up to 40
    Kbytes and now swap_info_struct allocation requires order-4 page.
    Switch to kvzmalloc allows to avoid unexpected allocation failures.

    Link: http://lkml.kernel.org/r/fc23172d-3c75-21e2-d551-8b1808cbe593@virtuozzo.com
    Fixes: a2468cc9bfdf ("swap: choose swap device according to numa node")
    Signed-off-by: Vasily Averin
    Acked-by: Aaron Lu
    Acked-by: Michal Hocko
    Reviewed-by: Andrew Morton
    Cc: Huang Ying
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Vasily Averin
     
  • commit 5e41540c8a0f0e98c337dda8b391e5dda0cde7cf upstream.

    This bug has been experienced several times by the Oracle DB team. The
    BUG is in remove_inode_hugepages() as follows:

    /*
    * If page is mapped, it was faulted in after being
    * unmapped in caller. Unmap (again) now after taking
    * the fault mutex. The mutex will prevent faults
    * until we finish removing the page.
    *
    * This race can only happen in the hole punch case.
    * Getting here in a truncate operation is a bug.
    */
    if (unlikely(page_mapped(page))) {
    BUG_ON(truncate_op);

    In this case, the elevated map count is not the result of a race.
    Rather it was incorrectly incremented as the result of a bug in the huge
    pmd sharing code. Consider the following:

    - Process A maps a hugetlbfs file of sufficient size and alignment
    (PUD_SIZE) that a pmd page could be shared.

    - Process B maps the same hugetlbfs file with the same size and
    alignment such that a pmd page is shared.

    - Process B then calls mprotect() to change protections for the mapping
    with the shared pmd. As a result, the pmd is 'unshared'.

    - Process B then calls mprotect() again to chage protections for the
    mapping back to their original value. pmd remains unshared.

    - Process B then forks and process C is created. During the fork
    process, we do dup_mm -> dup_mmap -> copy_page_range to copy page
    tables. Copying page tables for hugetlb mappings is done in the
    routine copy_hugetlb_page_range.

    In copy_hugetlb_page_range(), the destination pte is obtained by:

    dst_pte = huge_pte_alloc(dst, addr, sz);

    If pmd sharing is possible, the returned pointer will be to a pte in an
    existing page table. In the situation above, process C could share with
    either process A or process B. Since process A is first in the list,
    the returned pte is a pointer to a pte in process A's page table.

    However, the check for pmd sharing in copy_hugetlb_page_range is:

    /* If the pagetables are shared don't copy or take references */
    if (dst_pte == src_pte)
    continue;

    Since process C is sharing with process A instead of process B, the
    above test fails. The code in copy_hugetlb_page_range which follows
    assumes dst_pte points to a huge_pte_none pte. It copies the pte entry
    from src_pte to dst_pte and increments this map count of the associated
    page. This is how we end up with an elevated map count.

    To solve, check the dst_pte entry for huge_pte_none. If !none, this
    implies PMD sharing so do not copy.

    Link: http://lkml.kernel.org/r/20181105212315.14125-1-mike.kravetz@oracle.com
    Fixes: c5c99429fa57 ("fix hugepages leak due to pagetable page sharing")
    Signed-off-by: Mike Kravetz
    Reviewed-by: Naoya Horiguchi
    Cc: Michal Hocko
    Cc: Hugh Dickins
    Cc: Andrea Arcangeli
    Cc: "Kirill A . Shutemov"
    Cc: Davidlohr Bueso
    Cc: Prakash Sangappa
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Mike Kravetz
     
  • commit ac5b2c18911ffe95c08d69273917f90212cf5659 upstream.

    THP allocation might be really disruptive when allocated on NUMA system
    with the local node full or hard to reclaim. Stefan has posted an
    allocation stall report on 4.12 based SLES kernel which suggests the
    same issue:

    kvm: page allocation stalls for 194572ms, order:9, mode:0x4740ca(__GFP_HIGHMEM|__GFP_IO|__GFP_FS|__GFP_COMP|__GFP_NOMEMALLOC|__GFP_HARDWALL|__GFP_THISNODE|__GFP_MOVABLE|__GFP_DIRECT_RECLAIM), nodemask=(null)
    kvm cpuset=/ mems_allowed=0-1
    CPU: 10 PID: 84752 Comm: kvm Tainted: G W 4.12.0+98-ph 0000001 SLE15 (unreleased)
    Hardware name: Supermicro SYS-1029P-WTRT/X11DDW-NT, BIOS 2.0 12/05/2017
    Call Trace:
    dump_stack+0x5c/0x84
    warn_alloc+0xe0/0x180
    __alloc_pages_slowpath+0x820/0xc90
    __alloc_pages_nodemask+0x1cc/0x210
    alloc_pages_vma+0x1e5/0x280
    do_huge_pmd_wp_page+0x83f/0xf00
    __handle_mm_fault+0x93d/0x1060
    handle_mm_fault+0xc6/0x1b0
    __do_page_fault+0x230/0x430
    do_page_fault+0x2a/0x70
    page_fault+0x7b/0x80
    [...]
    Mem-Info:
    active_anon:126315487 inactive_anon:1612476 isolated_anon:5
    active_file:60183 inactive_file:245285 isolated_file:0
    unevictable:15657 dirty:286 writeback:1 unstable:0
    slab_reclaimable:75543 slab_unreclaimable:2509111
    mapped:81814 shmem:31764 pagetables:370616 bounce:0
    free:32294031 free_pcp:6233 free_cma:0
    Node 0 active_anon:254680388kB inactive_anon:1112760kB active_file:240648kB inactive_file:981168kB unevictable:13368kB isolated(anon):0kB isolated(file):0kB mapped:280240kB dirty:1144kB writeback:0kB shmem:95832kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 81225728kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
    Node 1 active_anon:250583072kB inactive_anon:5337144kB active_file:84kB inactive_file:0kB unevictable:49260kB isolated(anon):20kB isolated(file):0kB mapped:47016kB dirty:0kB writeback:4kB shmem:31224kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 31897600kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no

    The defrag mode is "madvise" and from the above report it is clear that
    the THP has been allocated for MADV_HUGEPAGA vma.

    Andrea has identified that the main source of the problem is
    __GFP_THISNODE usage:

    : The problem is that direct compaction combined with the NUMA
    : __GFP_THISNODE logic in mempolicy.c is telling reclaim to swap very
    : hard the local node, instead of failing the allocation if there's no
    : THP available in the local node.
    :
    : Such logic was ok until __GFP_THISNODE was added to the THP allocation
    : path even with MPOL_DEFAULT.
    :
    : The idea behind the __GFP_THISNODE addition, is that it is better to
    : provide local memory in PAGE_SIZE units than to use remote NUMA THP
    : backed memory. That largely depends on the remote latency though, on
    : threadrippers for example the overhead is relatively low in my
    : experience.
    :
    : The combination of __GFP_THISNODE and __GFP_DIRECT_RECLAIM results in
    : extremely slow qemu startup with vfio, if the VM is larger than the
    : size of one host NUMA node. This is because it will try very hard to
    : unsuccessfully swapout get_user_pages pinned pages as result of the
    : __GFP_THISNODE being set, instead of falling back to PAGE_SIZE
    : allocations and instead of trying to allocate THP on other nodes (it
    : would be even worse without vfio type1 GUP pins of course, except it'd
    : be swapping heavily instead).

    Fix this by removing __GFP_THISNODE for THP requests which are
    requesting the direct reclaim. This effectivelly reverts 5265047ac301
    on the grounds that the zone/node reclaim was known to be disruptive due
    to premature reclaim when there was memory free. While it made sense at
    the time for HPC workloads without NUMA awareness on rare machines, it
    was ultimately harmful in the majority of cases. The existing behaviour
    is similar, if not as widespare as it applies to a corner case but
    crucially, it cannot be tuned around like zone_reclaim_mode can. The
    default behaviour should always be to cause the least harm for the
    common case.

    If there are specialised use cases out there that want zone_reclaim_mode
    in specific cases, then it can be built on top. Longterm we should
    consider a memory policy which allows for the node reclaim like behavior
    for the specific memory ranges which would allow a

    [1] http://lkml.kernel.org/r/20180820032204.9591-1-aarcange@redhat.com

    Mel said:

    : Both patches look correct to me but I'm responding to this one because
    : it's the fix. The change makes sense and moves further away from the
    : severe stalling behaviour we used to see with both THP and zone reclaim
    : mode.
    :
    : I put together a basic experiment with usemem configured to reference a
    : buffer multiple times that is 80% the size of main memory on a 2-socket
    : box with symmetric node sizes and defrag set to "always". The defrag
    : setting is not the default but it would be functionally similar to
    : accessing a buffer with madvise(MADV_HUGEPAGE). Usemem is configured to
    : reference the buffer multiple times and while it's not an interesting
    : workload, it would be expected to complete reasonably quickly as it fits
    : within memory. The results were;
    :
    : usemem
    : vanilla noreclaim-v1
    : Amean Elapsd-1 42.78 ( 0.00%) 26.87 ( 37.18%)
    : Amean Elapsd-3 27.55 ( 0.00%) 7.44 ( 73.00%)
    : Amean Elapsd-4 5.72 ( 0.00%) 5.69 ( 0.45%)
    :
    : This shows the elapsed time in seconds for 1 thread, 3 threads and 4
    : threads referencing buffers 80% the size of memory. With the patches
    : applied, it's 37.18% faster for the single thread and 73% faster with two
    : threads. Note that 4 threads showing little difference does not indicate
    : the problem is related to thread counts. It's simply the case that 4
    : threads gets spread so their workload mostly fits in one node.
    :
    : The overall view from /proc/vmstats is more startling
    :
    : 4.19.0-rc1 4.19.0-rc1
    : vanillanoreclaim-v1r1
    : Minor Faults 35593425 708164
    : Major Faults 484088 36
    : Swap Ins 3772837 0
    : Swap Outs 3932295 0
    :
    : Massive amounts of swap in/out without the patch
    :
    : Direct pages scanned 6013214 0
    : Kswapd pages scanned 0 0
    : Kswapd pages reclaimed 0 0
    : Direct pages reclaimed 4033009 0
    :
    : Lots of reclaim activity without the patch
    :
    : Kswapd efficiency 100% 100%
    : Kswapd velocity 0.000 0.000
    : Direct efficiency 67% 100%
    : Direct velocity 11191.956 0.000
    :
    : Mostly from direct reclaim context as you'd expect without the patch.
    :
    : Page writes by reclaim 3932314.000 0.000
    : Page writes file 19 0
    : Page writes anon 3932295 0
    : Page reclaim immediate 42336 0
    :
    : Writes from reclaim context is never good but the patch eliminates it.
    :
    : We should never have default behaviour to thrash the system for such a
    : basic workload. If zone reclaim mode behaviour is ever desired but on a
    : single task instead of a global basis then the sensible option is to build
    : a mempolicy that enforces that behaviour.

    This was a severe regression compared to previous kernels that made
    important workloads unusable and it starts when __GFP_THISNODE was
    added to THP allocations under MADV_HUGEPAGE. It is not a significant
    risk to go to the previous behavior before __GFP_THISNODE was added, it
    worked like that for years.

    This was simply an optimization to some lucky workloads that can fit in
    a single node, but it ended up breaking the VM for others that can't
    possibly fit in a single node, so going back is safe.

    [mhocko@suse.com: rewrote the changelog based on the one from Andrea]
    Link: http://lkml.kernel.org/r/20180925120326.24392-2-mhocko@kernel.org
    Fixes: 5265047ac301 ("mm, thp: really limit transparent hugepage allocation to local node")
    Signed-off-by: Andrea Arcangeli
    Signed-off-by: Michal Hocko
    Reported-by: Stefan Priebe
    Debugged-by: Andrea Arcangeli
    Reported-by: Alex Williamson
    Reviewed-by: Mel Gorman
    Tested-by: Mel Gorman
    Cc: Zi Yan
    Cc: Vlastimil Babka
    Cc: David Rientjes
    Cc: "Kirill A. Shutemov"
    Cc: [4.1+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Andrea Arcangeli
     

14 Nov, 2018

2 commits

  • commit aab8d0520e6e7c2a61f71195e6ce7007a4843afb upstream.

    Private ZONE_DEVICE pages use a special pte entry and thus are not
    present. Properly handle this case in map_pte(), it is already handled in
    check_pte(), the map_pte() part was lost in some rebase most probably.

    Without this patch the slow migration path can not migrate back to any
    private ZONE_DEVICE memory to regular memory. This was found after stress
    testing migration back to system memory. This ultimatly can lead to the
    CPU constantly page fault looping on the special swap entry.

    Link: http://lkml.kernel.org/r/20181019160442.18723-3-jglisse@redhat.com
    Signed-off-by: Ralph Campbell
    Signed-off-by: Jérôme Glisse
    Reviewed-by: Balbir Singh
    Cc: Andrew Morton
    Cc: Kirill A. Shutemov
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Ralph Campbell
     
  • commit 22146c3ce98962436e401f7b7016a6f664c9ffb5 upstream.

    Some test systems were experiencing negative huge page reserve counts and
    incorrect file block counts. This was traced to /proc/sys/vm/drop_caches
    removing clean pages from hugetlbfs file pagecaches. When non-hugetlbfs
    explicit code removes the pages, the appropriate accounting is not
    performed.

    This can be recreated as follows:
    fallocate -l 2M /dev/hugepages/foo
    echo 1 > /proc/sys/vm/drop_caches
    fallocate -l 2M /dev/hugepages/foo
    grep -i huge /proc/meminfo
    AnonHugePages: 0 kB
    ShmemHugePages: 0 kB
    HugePages_Total: 2048
    HugePages_Free: 2047
    HugePages_Rsvd: 18446744073709551615
    HugePages_Surp: 0
    Hugepagesize: 2048 kB
    Hugetlb: 4194304 kB
    ls -lsh /dev/hugepages/foo
    4.0M -rw-r--r--. 1 root root 2.0M Oct 17 20:05 /dev/hugepages/foo

    To address this issue, dirty pages as they are added to pagecache. This
    can easily be reproduced with fallocate as shown above. Read faulted
    pages will eventually end up being marked dirty. But there is a window
    where they are clean and could be impacted by code such as drop_caches.
    So, just dirty them all as they are added to the pagecache.

    Link: http://lkml.kernel.org/r/b5be45b8-5afe-56cd-9482-28384699a049@oracle.com
    Fixes: 6bda666a03f0 ("hugepages: fold find_or_alloc_pages into huge_no_page()")
    Signed-off-by: Mike Kravetz
    Acked-by: Mihcla Hocko
    Reviewed-by: Khalid Aziz
    Cc: Hugh Dickins
    Cc: Naoya Horiguchi
    Cc: "Aneesh Kumar K . V"
    Cc: Andrea Arcangeli
    Cc: "Kirill A . Shutemov"
    Cc: Davidlohr Bueso
    Cc: Alexander Viro
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Mike Kravetz
     

20 Oct, 2018

1 commit

  • commit eb66ae030829605d61fbef1909ce310e29f78821 upstream.

    Jann Horn points out that our TLB flushing was subtly wrong for the
    mremap() case. What makes mremap() special is that we don't follow the
    usual "add page to list of pages to be freed, then flush tlb, and then
    free pages". No, mremap() obviously just _moves_ the page from one page
    table location to another.

    That matters, because mremap() thus doesn't directly control the
    lifetime of the moved page with a freelist: instead, the lifetime of the
    page is controlled by the page table locking, that serializes access to
    the entry.

    As a result, we need to flush the TLB not just before releasing the lock
    for the source location (to avoid any concurrent accesses to the entry),
    but also before we release the destination page table lock (to avoid the
    TLB being flushed after somebody else has already done something to that
    page).

    This also makes the whole "need_flush" logic unnecessary, since we now
    always end up flushing the TLB for every valid entry.

    Reported-and-tested-by: Jann Horn
    Acked-by: Will Deacon
    Tested-by: Ingo Molnar
    Acked-by: Peter Zijlstra (Intel)
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Linus Torvalds
     

18 Oct, 2018

7 commits

  • commit 7aaf7727235870f497eb928f728f7773d6df3b40 upstream.

    Don't show nr_indirectly_reclaimable in /proc/vmstat, because there is
    no need to export this vm counter to userspace, and some changes are
    expected in reclaimable object accounting, which can alter this counter.

    Link: http://lkml.kernel.org/r/20180425191422.9159-1-guro@fb.com
    Signed-off-by: Roman Gushchin
    Acked-by: Vlastimil Babka
    Reviewed-by: Andrew Morton
    Cc: Matthew Wilcox
    Cc: Alexander Viro
    Cc: Michal Hocko
    Cc: Johannes Weiner
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Roman Gushchin
     
  • commit d79f7aa496fc94d763f67b833a1f36f4c171176f upstream.

    Indirectly reclaimable memory can consume a significant part of total
    memory and it's actually reclaimable (it will be released under actual
    memory pressure).

    So, the overcommit logic should treat it as free.

    Otherwise, it's possible to cause random system-wide memory allocation
    failures by consuming a significant amount of memory by indirectly
    reclaimable memory, e.g. dentry external names.

    If overcommit policy GUESS is used, it might be used for denial of
    service attack under some conditions.

    The following program illustrates the approach. It causes the kernel to
    allocate an unreclaimable kmalloc-256 chunk for each stat() call, so
    that at some point the overcommit logic may start blocking large
    allocation system-wide.

    int main()
    {
    char buf[256];
    unsigned long i;
    struct stat statbuf;

    buf[0] = '/';
    for (i = 1; i < sizeof(buf); i++)
    buf[i] = '_';

    for (i = 0; 1; i++) {
    sprintf(&buf[248], "%8lu", i);
    stat(buf, &statbuf);
    }

    return 0;
    }

    This patch in combination with related indirectly reclaimable memory
    patches closes this issue.

    Link: http://lkml.kernel.org/r/20180313130041.8078-1-guro@fb.com
    Signed-off-by: Roman Gushchin
    Reviewed-by: Andrew Morton
    Cc: Alexander Viro
    Cc: Michal Hocko
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Roman Gushchin
     
  • commit 034ebf65c3c21d85b963d39f992258a64a85e3a9 upstream.

    Adjust /proc/meminfo MemAvailable calculation by adding the amount of
    indirectly reclaimable memory (rounded to the PAGE_SIZE).

    Link: http://lkml.kernel.org/r/20180305133743.12746-4-guro@fb.com
    Signed-off-by: Roman Gushchin
    Reviewed-by: Andrew Morton
    Cc: Alexander Viro
    Cc: Michal Hocko
    Cc: Johannes Weiner
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Roman Gushchin
     
  • commit eb59254608bc1d42c4c6afdcdce9c0d3ce02b318 upstream.

    Patch series "indirectly reclaimable memory", v2.

    This patchset introduces the concept of indirectly reclaimable memory
    and applies it to fix the issue of when a big number of dentries with
    external names can significantly affect the MemAvailable value.

    This patch (of 3):

    Introduce a concept of indirectly reclaimable memory and adds the
    corresponding memory counter and /proc/vmstat item.

    Indirectly reclaimable memory is any sort of memory, used by the kernel
    (except of reclaimable slabs), which is actually reclaimable, i.e. will
    be released under memory pressure.

    The counter is in bytes, as it's not always possible to count such
    objects in pages. The name contains BYTES by analogy to
    NR_KERNEL_STACK_KB.

    Link: http://lkml.kernel.org/r/20180305133743.12746-2-guro@fb.com
    Signed-off-by: Roman Gushchin
    Reviewed-by: Andrew Morton
    Cc: Alexander Viro
    Cc: Michal Hocko
    Cc: Johannes Weiner
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Roman Gushchin
     
  • commit bfba8e5cf28f413aa05571af493871d74438979f upstream.

    Inside set_pmd_migration_entry() we are holding page table locks and thus
    we can not sleep so we can not call invalidate_range_start/end()

    So remove call to mmu_notifier_invalidate_range_start/end() because they
    are call inside the function calling set_pmd_migration_entry() (see
    try_to_unmap_one()).

    Link: http://lkml.kernel.org/r/20181012181056.7864-1-jglisse@redhat.com
    Signed-off-by: Jérôme Glisse
    Reported-by: Andrea Arcangeli
    Reviewed-by: Zi Yan
    Acked-by: Michal Hocko
    Cc: Greg Kroah-Hartman
    Cc: Kirill A. Shutemov
    Cc: "H. Peter Anvin"
    Cc: Anshuman Khandual
    Cc: Dave Hansen
    Cc: David Nellans
    Cc: Ingo Molnar
    Cc: Mel Gorman
    Cc: Minchan Kim
    Cc: Naoya Horiguchi
    Cc: Thomas Gleixner
    Cc: Vlastimil Babka
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Greg Kroah-Hartman

    Jérôme Glisse
     
  • commit 6685b357363bfe295e3ae73665014db4aed62c58 upstream.

    The commit ca460b3c9627 ("percpu: introduce bitmap metadata blocks")
    introduced bitmap metadata blocks. These metadata blocks are allocated
    whenever a new chunk is created, but they are never freed. Fix it.

    Fixes: ca460b3c9627 ("percpu: introduce bitmap metadata blocks")
    Signed-off-by: Mike Rapoport
    Cc: stable@vger.kernel.org
    Signed-off-by: Dennis Zhou
    Signed-off-by: Greg Kroah-Hartman

    Mike Rapoport
     
  • commit 28e2c4bb99aa40f9d5f07ac130cbc4da0ea93079 upstream.

    7a9cdebdcc17 ("mm: get rid of vmacache_flush_all() entirely") removed the
    VMACACHE_FULL_FLUSHES statistics, but didn't remove the corresponding
    entry in vmstat_text. This causes an out-of-bounds access in
    vmstat_show().

    Luckily this only affects kernels with CONFIG_DEBUG_VM_VMACACHE=y, which
    is probably very rare.

    Link: http://lkml.kernel.org/r/20181001143138.95119-1-jannh@google.com
    Fixes: 7a9cdebdcc17 ("mm: get rid of vmacache_flush_all() entirely")
    Signed-off-by: Jann Horn
    Reviewed-by: Kees Cook
    Reviewed-by: Andrew Morton
    Acked-by: Michal Hocko
    Acked-by: Roman Gushchin
    Cc: Davidlohr Bueso
    Cc: Oleg Nesterov
    Cc: Christoph Lameter
    Cc: Kemi Wang
    Cc: Andy Lutomirski
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Greg Kroah-Hartman

    Jann Horn
     

13 Oct, 2018

4 commits

  • commit c7cdff0e864713a089d7cb3a2b1136ba9a54881a upstream.

    fill_balloon doing memory allocations under balloon_lock
    can cause a deadlock when leak_balloon is called from
    virtballoon_oom_notify and tries to take same lock.

    To fix, split page allocation and enqueue and do allocations outside the lock.

    Here's a detailed analysis of the deadlock by Tetsuo Handa:

    In leak_balloon(), mutex_lock(&vb->balloon_lock) is called in order to
    serialize against fill_balloon(). But in fill_balloon(),
    alloc_page(GFP_HIGHUSER[_MOVABLE] | __GFP_NOMEMALLOC | __GFP_NORETRY) is
    called with vb->balloon_lock mutex held. Since GFP_HIGHUSER[_MOVABLE]
    implies __GFP_DIRECT_RECLAIM | __GFP_IO | __GFP_FS, despite __GFP_NORETRY
    is specified, this allocation attempt might indirectly depend on somebody
    else's __GFP_DIRECT_RECLAIM memory allocation. And such indirect
    __GFP_DIRECT_RECLAIM memory allocation might call leak_balloon() via
    virtballoon_oom_notify() via blocking_notifier_call_chain() callback via
    out_of_memory() when it reached __alloc_pages_may_oom() and held oom_lock
    mutex. Since vb->balloon_lock mutex is already held by fill_balloon(), it
    will cause OOM lockup.

    Thread1 Thread2
    fill_balloon()
    takes a balloon_lock
    balloon_page_enqueue()
    alloc_page(GFP_HIGHUSER_MOVABLE)
    direct reclaim (__GFP_FS context) takes a fs lock
    waits for that fs lock alloc_page(GFP_NOFS)
    __alloc_pages_may_oom()
    takes the oom_lock
    out_of_memory()
    blocking_notifier_call_chain()
    leak_balloon()
    tries to take that balloon_lock and deadlocks

    Reported-by: Tetsuo Handa
    Cc: Michal Hocko
    Cc: Wei Wang
    Signed-off-by: Michael S. Tsirkin
    Signed-off-by: Sudip Mukherjee
    Signed-off-by: Greg Kroah-Hartman

    Michael S. Tsirkin
     
  • commit 58bc4c34d249bf1bc50730a9a209139347cfacfe upstream.

    5dd0b16cdaff ("mm/vmstat: Make NR_TLB_REMOTE_FLUSH_RECEIVED available even
    on UP") made the availability of the NR_TLB_REMOTE_FLUSH* counters inside
    the kernel unconditional to reduce #ifdef soup, but (either to avoid
    showing dummy zero counters to userspace, or because that code was missed)
    didn't update the vmstat_array, meaning that all following counters would
    be shown with incorrect values.

    This only affects kernel builds with
    CONFIG_VM_EVENT_COUNTERS=y && CONFIG_DEBUG_TLBFLUSH=y && CONFIG_SMP=n.

    Link: http://lkml.kernel.org/r/20181001143138.95119-2-jannh@google.com
    Fixes: 5dd0b16cdaff ("mm/vmstat: Make NR_TLB_REMOTE_FLUSH_RECEIVED available even on UP")
    Signed-off-by: Jann Horn
    Reviewed-by: Kees Cook
    Reviewed-by: Andrew Morton
    Acked-by: Michal Hocko
    Acked-by: Roman Gushchin
    Cc: Davidlohr Bueso
    Cc: Oleg Nesterov
    Cc: Christoph Lameter
    Cc: Kemi Wang
    Cc: Andy Lutomirski
    Cc: Ingo Molnar
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Greg Kroah-Hartman

    Jann Horn
     
  • commit e125fe405abedc1dc8a5b2229b80ee91c1434015 upstream.

    A transparent huge page is represented by a single entry on an LRU list.
    Therefore, we can only make unevictable an entire compound page, not
    individual subpages.

    If a user tries to mlock() part of a huge page, we want the rest of the
    page to be reclaimable.

    We handle this by keeping PTE-mapped huge pages on normal LRU lists: the
    PMD on border of VM_LOCKED VMA will be split into PTE table.

    Introduction of THP migration breaks[1] the rules around mlocking THP
    pages. If we had a single PMD mapping of the page in mlocked VMA, the
    page will get mlocked, regardless of PTE mappings of the page.

    For tmpfs/shmem it's easy to fix by checking PageDoubleMap() in
    remove_migration_pmd().

    Anon THP pages can only be shared between processes via fork(). Mlocked
    page can only be shared if parent mlocked it before forking, otherwise CoW
    will be triggered on mlock().

    For Anon-THP, we can fix the issue by munlocking the page on removing PTE
    migration entry for the page. PTEs for the page will always come after
    mlocked PMD: rmap walks VMAs from oldest to newest.

    Test-case:

    #include
    #include
    #include
    #include
    #include

    int main(void)
    {
    unsigned long nodemask = 4;
    void *addr;

    addr = mmap((void *)0x20000000UL, 2UL << 20, PROT_READ | PROT_WRITE,
    MAP_PRIVATE | MAP_ANONYMOUS | MAP_LOCKED, -1, 0);

    if (fork()) {
    wait(NULL);
    return 0;
    }

    mlock(addr, 4UL << 10);
    mbind(addr, 2UL << 20, MPOL_PREFERRED | MPOL_F_RELATIVE_NODES,
    &nodemask, 4, MPOL_MF_MOVE);

    return 0;
    }

    [1] https://lkml.kernel.org/r/CAOMGZ=G52R-30rZvhGxEbkTw7rLLwBGadVYeo--iizcD3upL3A@mail.gmail.com

    Link: http://lkml.kernel.org/r/20180917133816.43995-1-kirill.shutemov@linux.intel.com
    Fixes: 616b8371539a ("mm: thp: enable thp migration in generic path")
    Signed-off-by: Kirill A. Shutemov
    Reported-by: Vegard Nossum
    Reviewed-by: Zi Yan
    Cc: Naoya Horiguchi
    Cc: Vlastimil Babka
    Cc: Andrea Arcangeli
    Cc: [4.14+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Greg Kroah-Hartman

    Kirill A. Shutemov
     
  • commit 017b1660df89f5fb4bfe66c34e35f7d2031100c7 upstream.

    The page migration code employs try_to_unmap() to try and unmap the source
    page. This is accomplished by using rmap_walk to find all vmas where the
    page is mapped. This search stops when page mapcount is zero. For shared
    PMD huge pages, the page map count is always 1 no matter the number of
    mappings. Shared mappings are tracked via the reference count of the PMD
    page. Therefore, try_to_unmap stops prematurely and does not completely
    unmap all mappings of the source page.

    This problem can result is data corruption as writes to the original
    source page can happen after contents of the page are copied to the target
    page. Hence, data is lost.

    This problem was originally seen as DB corruption of shared global areas
    after a huge page was soft offlined due to ECC memory errors. DB
    developers noticed they could reproduce the issue by (hotplug) offlining
    memory used to back huge pages. A simple testcase can reproduce the
    problem by creating a shared PMD mapping (note that this must be at least
    PUD_SIZE in size and PUD_SIZE aligned (1GB on x86)), and using
    migrate_pages() to migrate process pages between nodes while continually
    writing to the huge pages being migrated.

    To fix, have the try_to_unmap_one routine check for huge PMD sharing by
    calling huge_pmd_unshare for hugetlbfs huge pages. If it is a shared
    mapping it will be 'unshared' which removes the page table entry and drops
    the reference on the PMD page. After this, flush caches and TLB.

    mmu notifiers are called before locking page tables, but we can not be
    sure of PMD sharing until page tables are locked. Therefore, check for
    the possibility of PMD sharing before locking so that notifiers can
    prepare for the worst possible case.

    Link: http://lkml.kernel.org/r/20180823205917.16297-2-mike.kravetz@oracle.com
    [mike.kravetz@oracle.com: make _range_in_vma() a static inline]
    Link: http://lkml.kernel.org/r/6063f215-a5c8-2f0c-465a-2c515ddc952d@oracle.com
    Fixes: 39dde65c9940 ("shared page table for hugetlb page")
    Signed-off-by: Mike Kravetz
    Acked-by: Kirill A. Shutemov
    Reviewed-by: Naoya Horiguchi
    Acked-by: Michal Hocko
    Cc: Vlastimil Babka
    Cc: Davidlohr Bueso
    Cc: Jerome Glisse
    Cc: Mike Kravetz
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Greg Kroah-Hartman

    Mike Kravetz
     

10 Oct, 2018

1 commit

  • commit d41aa5252394c065d1f04d1ceea885b70d00c9c6 upstream.

    Reproducer, assuming 2M of hugetlbfs available:

    Hugetlbfs mounted, size=2M and option user=testuser

    # mount | grep ^hugetlbfs
    hugetlbfs on /dev/hugepages type hugetlbfs (rw,pagesize=2M,user=dan)
    # sysctl vm.nr_hugepages=1
    vm.nr_hugepages = 1
    # grep Huge /proc/meminfo
    AnonHugePages: 0 kB
    ShmemHugePages: 0 kB
    HugePages_Total: 1
    HugePages_Free: 1
    HugePages_Rsvd: 0
    HugePages_Surp: 0
    Hugepagesize: 2048 kB
    Hugetlb: 2048 kB

    Code:

    #include
    #include
    #define SIZE 2*1024*1024
    int main()
    {
    void *ptr;
    ptr = mmap(NULL, SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_HUGETLB | MAP_ANONYMOUS, -1, 0);
    madvise(ptr, SIZE, MADV_DONTDUMP);
    madvise(ptr, SIZE, MADV_DODUMP);
    }

    Compile and strace:

    mmap(NULL, 2097152, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_HUGETLB, -1, 0) = 0x7ff7c9200000
    madvise(0x7ff7c9200000, 2097152, MADV_DONTDUMP) = 0
    madvise(0x7ff7c9200000, 2097152, MADV_DODUMP) = -1 EINVAL (Invalid argument)

    hugetlbfs pages have VM_DONTEXPAND in the VmFlags driver pages based on
    author testing with analysis from Florian Weimer[1].

    The inclusion of VM_DONTEXPAND into the VM_SPECIAL defination was a
    consequence of the large useage of VM_DONTEXPAND in device drivers.

    A consequence of [2] is that VM_DONTEXPAND marked pages are unable to be
    marked DODUMP.

    A user could quite legitimately madvise(MADV_DONTDUMP) their hugetlbfs
    memory for a while and later request that madvise(MADV_DODUMP) on the same
    memory. We correct this omission by allowing madvice(MADV_DODUMP) on
    hugetlbfs pages.

    [1] https://stackoverflow.com/questions/52548260/madvisedodump-on-the-same-ptr-size-as-a-successful-madvisedontdump-fails-wit
    [2] commit 0103bd16fb90 ("mm: prepare VM_DONTDUMP for using in drivers")

    Link: http://lkml.kernel.org/r/20180930054629.29150-1-daniel@linux.ibm.com
    Link: https://lists.launchpad.net/maria-discuss/msg05245.html
    Fixes: 0103bd16fb90 ("mm: prepare VM_DONTDUMP for using in drivers")
    Reported-by: Kenneth Penza
    Signed-off-by: Daniel Black
    Reviewed-by: Mike Kravetz
    Cc: Konstantin Khlebnikov
    Signed-off-by: Andrew Morton
    Signed-off-by: Greg Kroah-Hartman
    Signed-off-by: Greg Kroah-Hartman

    Daniel Black
     

04 Oct, 2018

1 commit

  • commit e5d9998f3e09359b372a037a6ac55ba235d95d57 upstream.

    /*
    * cpu_partial determined the maximum number of objects
    * kept in the per cpu partial lists of a processor.
    */

    Can't be negative.

    Link: http://lkml.kernel.org/r/20180305200730.15812-15-adobriyan@gmail.com
    Signed-off-by: Alexey Dobriyan
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: zhong jiang
    Signed-off-by: Greg Kroah-Hartman

    Alexey Dobriyan
     

29 Sep, 2018

1 commit

  • commit b45d71fb89ab8adfe727b9d0ee188ed58582a647 upstream.

    Directories and inodes don't necessarily need to be in the same lockdep
    class. For ex, hugetlbfs splits them out too to prevent false positives
    in lockdep. Annotate correctly after new inode creation. If its a
    directory inode, it will be put into a different class.

    This should fix a lockdep splat reported by syzbot:

    > ======================================================
    > WARNING: possible circular locking dependency detected
    > 4.18.0-rc8-next-20180810+ #36 Not tainted
    > ------------------------------------------------------
    > syz-executor900/4483 is trying to acquire lock:
    > 00000000d2bfc8fe (&sb->s_type->i_mutex_key#9){++++}, at: inode_lock
    > include/linux/fs.h:765 [inline]
    > 00000000d2bfc8fe (&sb->s_type->i_mutex_key#9){++++}, at:
    > shmem_fallocate+0x18b/0x12e0 mm/shmem.c:2602
    >
    > but task is already holding lock:
    > 0000000025208078 (ashmem_mutex){+.+.}, at: ashmem_shrink_scan+0xb4/0x630
    > drivers/staging/android/ashmem.c:448
    >
    > which lock already depends on the new lock.
    >
    > -> #2 (ashmem_mutex){+.+.}:
    > __mutex_lock_common kernel/locking/mutex.c:925 [inline]
    > __mutex_lock+0x171/0x1700 kernel/locking/mutex.c:1073
    > mutex_lock_nested+0x16/0x20 kernel/locking/mutex.c:1088
    > ashmem_mmap+0x55/0x520 drivers/staging/android/ashmem.c:361
    > call_mmap include/linux/fs.h:1844 [inline]
    > mmap_region+0xf27/0x1c50 mm/mmap.c:1762
    > do_mmap+0xa10/0x1220 mm/mmap.c:1535
    > do_mmap_pgoff include/linux/mm.h:2298 [inline]
    > vm_mmap_pgoff+0x213/0x2c0 mm/util.c:357
    > ksys_mmap_pgoff+0x4da/0x660 mm/mmap.c:1585
    > __do_sys_mmap arch/x86/kernel/sys_x86_64.c:100 [inline]
    > __se_sys_mmap arch/x86/kernel/sys_x86_64.c:91 [inline]
    > __x64_sys_mmap+0xe9/0x1b0 arch/x86/kernel/sys_x86_64.c:91
    > do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
    > entry_SYSCALL_64_after_hwframe+0x49/0xbe
    >
    > -> #1 (&mm->mmap_sem){++++}:
    > __might_fault+0x155/0x1e0 mm/memory.c:4568
    > _copy_to_user+0x30/0x110 lib/usercopy.c:25
    > copy_to_user include/linux/uaccess.h:155 [inline]
    > filldir+0x1ea/0x3a0 fs/readdir.c:196
    > dir_emit_dot include/linux/fs.h:3464 [inline]
    > dir_emit_dots include/linux/fs.h:3475 [inline]
    > dcache_readdir+0x13a/0x620 fs/libfs.c:193
    > iterate_dir+0x48b/0x5d0 fs/readdir.c:51
    > __do_sys_getdents fs/readdir.c:231 [inline]
    > __se_sys_getdents fs/readdir.c:212 [inline]
    > __x64_sys_getdents+0x29f/0x510 fs/readdir.c:212
    > do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
    > entry_SYSCALL_64_after_hwframe+0x49/0xbe
    >
    > -> #0 (&sb->s_type->i_mutex_key#9){++++}:
    > lock_acquire+0x1e4/0x540 kernel/locking/lockdep.c:3924
    > down_write+0x8f/0x130 kernel/locking/rwsem.c:70
    > inode_lock include/linux/fs.h:765 [inline]
    > shmem_fallocate+0x18b/0x12e0 mm/shmem.c:2602
    > ashmem_shrink_scan+0x236/0x630 drivers/staging/android/ashmem.c:455
    > ashmem_ioctl+0x3ae/0x13a0 drivers/staging/android/ashmem.c:797
    > vfs_ioctl fs/ioctl.c:46 [inline]
    > file_ioctl fs/ioctl.c:501 [inline]
    > do_vfs_ioctl+0x1de/0x1720 fs/ioctl.c:685
    > ksys_ioctl+0xa9/0xd0 fs/ioctl.c:702
    > __do_sys_ioctl fs/ioctl.c:709 [inline]
    > __se_sys_ioctl fs/ioctl.c:707 [inline]
    > __x64_sys_ioctl+0x73/0xb0 fs/ioctl.c:707
    > do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
    > entry_SYSCALL_64_after_hwframe+0x49/0xbe
    >
    > other info that might help us debug this:
    >
    > Chain exists of:
    > &sb->s_type->i_mutex_key#9 --> &mm->mmap_sem --> ashmem_mutex
    >
    > Possible unsafe locking scenario:
    >
    > CPU0 CPU1
    > ---- ----
    > lock(ashmem_mutex);
    > lock(&mm->mmap_sem);
    > lock(ashmem_mutex);
    > lock(&sb->s_type->i_mutex_key#9);
    >
    > *** DEADLOCK ***
    >
    > 1 lock held by syz-executor900/4483:
    > #0: 0000000025208078 (ashmem_mutex){+.+.}, at:
    > ashmem_shrink_scan+0xb4/0x630 drivers/staging/android/ashmem.c:448

    Link: http://lkml.kernel.org/r/20180821231835.166639-1-joel@joelfernandes.org
    Signed-off-by: Joel Fernandes (Google)
    Reported-by: syzbot
    Reviewed-by: NeilBrown
    Suggested-by: NeilBrown
    Cc: Matthew Wilcox
    Cc: Peter Zijlstra
    Cc: Hugh Dickins
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Greg Kroah-Hartman

    Joel Fernandes (Google)
     

20 Sep, 2018

1 commit

  • commit 7a9cdebdcc17e426fb5287e4a82db1dfe86339b2 upstream.

    Jann Horn points out that the vmacache_flush_all() function is not only
    potentially expensive, it's buggy too. It also happens to be entirely
    unnecessary, because the sequence number overflow case can be avoided by
    simply making the sequence number be 64-bit. That doesn't even grow the
    data structures in question, because the other adjacent fields are
    already 64-bit.

    So simplify the whole thing by just making the sequence number overflow
    case go away entirely, which gets rid of all the complications and makes
    the code faster too. Win-win.

    [ Oleg Nesterov points out that the VMACACHE_FULL_FLUSHES statistics
    also just goes away entirely with this ]

    Reported-by: Jann Horn
    Suggested-by: Will Deacon
    Acked-by: Davidlohr Bueso
    Cc: Oleg Nesterov
    Cc: stable@kernel.org
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Linus Torvalds
     

15 Sep, 2018

1 commit

  • [ Upstream commit a718e28f538441a3b6612da9ff226973376cdf0f ]

    Signed integer overflow is undefined according to the C standard. The
    overflow in ksys_fadvise64_64() is deliberate, but since it is signed
    overflow, UBSAN complains:

    UBSAN: Undefined behaviour in mm/fadvise.c:76:10
    signed integer overflow:
    4 + 9223372036854775805 cannot be represented in type 'long long int'

    Use unsigned types to do math. Unsigned overflow is defined so UBSAN
    will not complain about it. This patch doesn't change generated code.

    [akpm@linux-foundation.org: add comment explaining the casts]
    Link: http://lkml.kernel.org/r/20180629184453.7614-1-aryabinin@virtuozzo.com
    Signed-off-by: Andrey Ryabinin
    Reported-by:
    Reviewed-by: Andrew Morton
    Cc: Alexander Potapenko
    Cc: Dmitry Vyukov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Andrey Ryabinin