29 Oct, 2018

1 commit


20 Oct, 2018

1 commit

  • commit eb66ae030829605d61fbef1909ce310e29f78821 upstream.

    Jann Horn points out that our TLB flushing was subtly wrong for the
    mremap() case. What makes mremap() special is that we don't follow the
    usual "add page to list of pages to be freed, then flush tlb, and then
    free pages". No, mremap() obviously just _moves_ the page from one page
    table location to another.

    That matters, because mremap() thus doesn't directly control the
    lifetime of the moved page with a freelist: instead, the lifetime of the
    page is controlled by the page table locking, that serializes access to
    the entry.

    As a result, we need to flush the TLB not just before releasing the lock
    for the source location (to avoid any concurrent accesses to the entry),
    but also before we release the destination page table lock (to avoid the
    TLB being flushed after somebody else has already done something to that
    page).

    This also makes the whole "need_flush" logic unnecessary, since we now
    always end up flushing the TLB for every valid entry.

    Reported-and-tested-by: Jann Horn
    Acked-by: Will Deacon
    Tested-by: Ingo Molnar
    Acked-by: Peter Zijlstra (Intel)
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Linus Torvalds
     

18 Oct, 2018

7 commits

  • commit 7aaf7727235870f497eb928f728f7773d6df3b40 upstream.

    Don't show nr_indirectly_reclaimable in /proc/vmstat, because there is
    no need to export this vm counter to userspace, and some changes are
    expected in reclaimable object accounting, which can alter this counter.

    Link: http://lkml.kernel.org/r/20180425191422.9159-1-guro@fb.com
    Signed-off-by: Roman Gushchin
    Acked-by: Vlastimil Babka
    Reviewed-by: Andrew Morton
    Cc: Matthew Wilcox
    Cc: Alexander Viro
    Cc: Michal Hocko
    Cc: Johannes Weiner
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Roman Gushchin
     
  • commit d79f7aa496fc94d763f67b833a1f36f4c171176f upstream.

    Indirectly reclaimable memory can consume a significant part of total
    memory and it's actually reclaimable (it will be released under actual
    memory pressure).

    So, the overcommit logic should treat it as free.

    Otherwise, it's possible to cause random system-wide memory allocation
    failures by consuming a significant amount of memory by indirectly
    reclaimable memory, e.g. dentry external names.

    If overcommit policy GUESS is used, it might be used for denial of
    service attack under some conditions.

    The following program illustrates the approach. It causes the kernel to
    allocate an unreclaimable kmalloc-256 chunk for each stat() call, so
    that at some point the overcommit logic may start blocking large
    allocation system-wide.

    int main()
    {
    char buf[256];
    unsigned long i;
    struct stat statbuf;

    buf[0] = '/';
    for (i = 1; i < sizeof(buf); i++)
    buf[i] = '_';

    for (i = 0; 1; i++) {
    sprintf(&buf[248], "%8lu", i);
    stat(buf, &statbuf);
    }

    return 0;
    }

    This patch in combination with related indirectly reclaimable memory
    patches closes this issue.

    Link: http://lkml.kernel.org/r/20180313130041.8078-1-guro@fb.com
    Signed-off-by: Roman Gushchin
    Reviewed-by: Andrew Morton
    Cc: Alexander Viro
    Cc: Michal Hocko
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Roman Gushchin
     
  • commit 034ebf65c3c21d85b963d39f992258a64a85e3a9 upstream.

    Adjust /proc/meminfo MemAvailable calculation by adding the amount of
    indirectly reclaimable memory (rounded to the PAGE_SIZE).

    Link: http://lkml.kernel.org/r/20180305133743.12746-4-guro@fb.com
    Signed-off-by: Roman Gushchin
    Reviewed-by: Andrew Morton
    Cc: Alexander Viro
    Cc: Michal Hocko
    Cc: Johannes Weiner
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Roman Gushchin
     
  • commit eb59254608bc1d42c4c6afdcdce9c0d3ce02b318 upstream.

    Patch series "indirectly reclaimable memory", v2.

    This patchset introduces the concept of indirectly reclaimable memory
    and applies it to fix the issue of when a big number of dentries with
    external names can significantly affect the MemAvailable value.

    This patch (of 3):

    Introduce a concept of indirectly reclaimable memory and adds the
    corresponding memory counter and /proc/vmstat item.

    Indirectly reclaimable memory is any sort of memory, used by the kernel
    (except of reclaimable slabs), which is actually reclaimable, i.e. will
    be released under memory pressure.

    The counter is in bytes, as it's not always possible to count such
    objects in pages. The name contains BYTES by analogy to
    NR_KERNEL_STACK_KB.

    Link: http://lkml.kernel.org/r/20180305133743.12746-2-guro@fb.com
    Signed-off-by: Roman Gushchin
    Reviewed-by: Andrew Morton
    Cc: Alexander Viro
    Cc: Michal Hocko
    Cc: Johannes Weiner
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Roman Gushchin
     
  • commit bfba8e5cf28f413aa05571af493871d74438979f upstream.

    Inside set_pmd_migration_entry() we are holding page table locks and thus
    we can not sleep so we can not call invalidate_range_start/end()

    So remove call to mmu_notifier_invalidate_range_start/end() because they
    are call inside the function calling set_pmd_migration_entry() (see
    try_to_unmap_one()).

    Link: http://lkml.kernel.org/r/20181012181056.7864-1-jglisse@redhat.com
    Signed-off-by: Jérôme Glisse
    Reported-by: Andrea Arcangeli
    Reviewed-by: Zi Yan
    Acked-by: Michal Hocko
    Cc: Greg Kroah-Hartman
    Cc: Kirill A. Shutemov
    Cc: "H. Peter Anvin"
    Cc: Anshuman Khandual
    Cc: Dave Hansen
    Cc: David Nellans
    Cc: Ingo Molnar
    Cc: Mel Gorman
    Cc: Minchan Kim
    Cc: Naoya Horiguchi
    Cc: Thomas Gleixner
    Cc: Vlastimil Babka
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Greg Kroah-Hartman

    Jérôme Glisse
     
  • commit 6685b357363bfe295e3ae73665014db4aed62c58 upstream.

    The commit ca460b3c9627 ("percpu: introduce bitmap metadata blocks")
    introduced bitmap metadata blocks. These metadata blocks are allocated
    whenever a new chunk is created, but they are never freed. Fix it.

    Fixes: ca460b3c9627 ("percpu: introduce bitmap metadata blocks")
    Signed-off-by: Mike Rapoport
    Cc: stable@vger.kernel.org
    Signed-off-by: Dennis Zhou
    Signed-off-by: Greg Kroah-Hartman

    Mike Rapoport
     
  • commit 28e2c4bb99aa40f9d5f07ac130cbc4da0ea93079 upstream.

    7a9cdebdcc17 ("mm: get rid of vmacache_flush_all() entirely") removed the
    VMACACHE_FULL_FLUSHES statistics, but didn't remove the corresponding
    entry in vmstat_text. This causes an out-of-bounds access in
    vmstat_show().

    Luckily this only affects kernels with CONFIG_DEBUG_VM_VMACACHE=y, which
    is probably very rare.

    Link: http://lkml.kernel.org/r/20181001143138.95119-1-jannh@google.com
    Fixes: 7a9cdebdcc17 ("mm: get rid of vmacache_flush_all() entirely")
    Signed-off-by: Jann Horn
    Reviewed-by: Kees Cook
    Reviewed-by: Andrew Morton
    Acked-by: Michal Hocko
    Acked-by: Roman Gushchin
    Cc: Davidlohr Bueso
    Cc: Oleg Nesterov
    Cc: Christoph Lameter
    Cc: Kemi Wang
    Cc: Andy Lutomirski
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Greg Kroah-Hartman

    Jann Horn
     

13 Oct, 2018

4 commits

  • commit c7cdff0e864713a089d7cb3a2b1136ba9a54881a upstream.

    fill_balloon doing memory allocations under balloon_lock
    can cause a deadlock when leak_balloon is called from
    virtballoon_oom_notify and tries to take same lock.

    To fix, split page allocation and enqueue and do allocations outside the lock.

    Here's a detailed analysis of the deadlock by Tetsuo Handa:

    In leak_balloon(), mutex_lock(&vb->balloon_lock) is called in order to
    serialize against fill_balloon(). But in fill_balloon(),
    alloc_page(GFP_HIGHUSER[_MOVABLE] | __GFP_NOMEMALLOC | __GFP_NORETRY) is
    called with vb->balloon_lock mutex held. Since GFP_HIGHUSER[_MOVABLE]
    implies __GFP_DIRECT_RECLAIM | __GFP_IO | __GFP_FS, despite __GFP_NORETRY
    is specified, this allocation attempt might indirectly depend on somebody
    else's __GFP_DIRECT_RECLAIM memory allocation. And such indirect
    __GFP_DIRECT_RECLAIM memory allocation might call leak_balloon() via
    virtballoon_oom_notify() via blocking_notifier_call_chain() callback via
    out_of_memory() when it reached __alloc_pages_may_oom() and held oom_lock
    mutex. Since vb->balloon_lock mutex is already held by fill_balloon(), it
    will cause OOM lockup.

    Thread1 Thread2
    fill_balloon()
    takes a balloon_lock
    balloon_page_enqueue()
    alloc_page(GFP_HIGHUSER_MOVABLE)
    direct reclaim (__GFP_FS context) takes a fs lock
    waits for that fs lock alloc_page(GFP_NOFS)
    __alloc_pages_may_oom()
    takes the oom_lock
    out_of_memory()
    blocking_notifier_call_chain()
    leak_balloon()
    tries to take that balloon_lock and deadlocks

    Reported-by: Tetsuo Handa
    Cc: Michal Hocko
    Cc: Wei Wang
    Signed-off-by: Michael S. Tsirkin
    Signed-off-by: Sudip Mukherjee
    Signed-off-by: Greg Kroah-Hartman

    Michael S. Tsirkin
     
  • commit 58bc4c34d249bf1bc50730a9a209139347cfacfe upstream.

    5dd0b16cdaff ("mm/vmstat: Make NR_TLB_REMOTE_FLUSH_RECEIVED available even
    on UP") made the availability of the NR_TLB_REMOTE_FLUSH* counters inside
    the kernel unconditional to reduce #ifdef soup, but (either to avoid
    showing dummy zero counters to userspace, or because that code was missed)
    didn't update the vmstat_array, meaning that all following counters would
    be shown with incorrect values.

    This only affects kernel builds with
    CONFIG_VM_EVENT_COUNTERS=y && CONFIG_DEBUG_TLBFLUSH=y && CONFIG_SMP=n.

    Link: http://lkml.kernel.org/r/20181001143138.95119-2-jannh@google.com
    Fixes: 5dd0b16cdaff ("mm/vmstat: Make NR_TLB_REMOTE_FLUSH_RECEIVED available even on UP")
    Signed-off-by: Jann Horn
    Reviewed-by: Kees Cook
    Reviewed-by: Andrew Morton
    Acked-by: Michal Hocko
    Acked-by: Roman Gushchin
    Cc: Davidlohr Bueso
    Cc: Oleg Nesterov
    Cc: Christoph Lameter
    Cc: Kemi Wang
    Cc: Andy Lutomirski
    Cc: Ingo Molnar
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Greg Kroah-Hartman

    Jann Horn
     
  • commit e125fe405abedc1dc8a5b2229b80ee91c1434015 upstream.

    A transparent huge page is represented by a single entry on an LRU list.
    Therefore, we can only make unevictable an entire compound page, not
    individual subpages.

    If a user tries to mlock() part of a huge page, we want the rest of the
    page to be reclaimable.

    We handle this by keeping PTE-mapped huge pages on normal LRU lists: the
    PMD on border of VM_LOCKED VMA will be split into PTE table.

    Introduction of THP migration breaks[1] the rules around mlocking THP
    pages. If we had a single PMD mapping of the page in mlocked VMA, the
    page will get mlocked, regardless of PTE mappings of the page.

    For tmpfs/shmem it's easy to fix by checking PageDoubleMap() in
    remove_migration_pmd().

    Anon THP pages can only be shared between processes via fork(). Mlocked
    page can only be shared if parent mlocked it before forking, otherwise CoW
    will be triggered on mlock().

    For Anon-THP, we can fix the issue by munlocking the page on removing PTE
    migration entry for the page. PTEs for the page will always come after
    mlocked PMD: rmap walks VMAs from oldest to newest.

    Test-case:

    #include
    #include
    #include
    #include
    #include

    int main(void)
    {
    unsigned long nodemask = 4;
    void *addr;

    addr = mmap((void *)0x20000000UL, 2UL << 20, PROT_READ | PROT_WRITE,
    MAP_PRIVATE | MAP_ANONYMOUS | MAP_LOCKED, -1, 0);

    if (fork()) {
    wait(NULL);
    return 0;
    }

    mlock(addr, 4UL << 10);
    mbind(addr, 2UL << 20, MPOL_PREFERRED | MPOL_F_RELATIVE_NODES,
    &nodemask, 4, MPOL_MF_MOVE);

    return 0;
    }

    [1] https://lkml.kernel.org/r/CAOMGZ=G52R-30rZvhGxEbkTw7rLLwBGadVYeo--iizcD3upL3A@mail.gmail.com

    Link: http://lkml.kernel.org/r/20180917133816.43995-1-kirill.shutemov@linux.intel.com
    Fixes: 616b8371539a ("mm: thp: enable thp migration in generic path")
    Signed-off-by: Kirill A. Shutemov
    Reported-by: Vegard Nossum
    Reviewed-by: Zi Yan
    Cc: Naoya Horiguchi
    Cc: Vlastimil Babka
    Cc: Andrea Arcangeli
    Cc: [4.14+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Greg Kroah-Hartman

    Kirill A. Shutemov
     
  • commit 017b1660df89f5fb4bfe66c34e35f7d2031100c7 upstream.

    The page migration code employs try_to_unmap() to try and unmap the source
    page. This is accomplished by using rmap_walk to find all vmas where the
    page is mapped. This search stops when page mapcount is zero. For shared
    PMD huge pages, the page map count is always 1 no matter the number of
    mappings. Shared mappings are tracked via the reference count of the PMD
    page. Therefore, try_to_unmap stops prematurely and does not completely
    unmap all mappings of the source page.

    This problem can result is data corruption as writes to the original
    source page can happen after contents of the page are copied to the target
    page. Hence, data is lost.

    This problem was originally seen as DB corruption of shared global areas
    after a huge page was soft offlined due to ECC memory errors. DB
    developers noticed they could reproduce the issue by (hotplug) offlining
    memory used to back huge pages. A simple testcase can reproduce the
    problem by creating a shared PMD mapping (note that this must be at least
    PUD_SIZE in size and PUD_SIZE aligned (1GB on x86)), and using
    migrate_pages() to migrate process pages between nodes while continually
    writing to the huge pages being migrated.

    To fix, have the try_to_unmap_one routine check for huge PMD sharing by
    calling huge_pmd_unshare for hugetlbfs huge pages. If it is a shared
    mapping it will be 'unshared' which removes the page table entry and drops
    the reference on the PMD page. After this, flush caches and TLB.

    mmu notifiers are called before locking page tables, but we can not be
    sure of PMD sharing until page tables are locked. Therefore, check for
    the possibility of PMD sharing before locking so that notifiers can
    prepare for the worst possible case.

    Link: http://lkml.kernel.org/r/20180823205917.16297-2-mike.kravetz@oracle.com
    [mike.kravetz@oracle.com: make _range_in_vma() a static inline]
    Link: http://lkml.kernel.org/r/6063f215-a5c8-2f0c-465a-2c515ddc952d@oracle.com
    Fixes: 39dde65c9940 ("shared page table for hugetlb page")
    Signed-off-by: Mike Kravetz
    Acked-by: Kirill A. Shutemov
    Reviewed-by: Naoya Horiguchi
    Acked-by: Michal Hocko
    Cc: Vlastimil Babka
    Cc: Davidlohr Bueso
    Cc: Jerome Glisse
    Cc: Mike Kravetz
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Greg Kroah-Hartman

    Mike Kravetz
     

10 Oct, 2018

1 commit

  • commit d41aa5252394c065d1f04d1ceea885b70d00c9c6 upstream.

    Reproducer, assuming 2M of hugetlbfs available:

    Hugetlbfs mounted, size=2M and option user=testuser

    # mount | grep ^hugetlbfs
    hugetlbfs on /dev/hugepages type hugetlbfs (rw,pagesize=2M,user=dan)
    # sysctl vm.nr_hugepages=1
    vm.nr_hugepages = 1
    # grep Huge /proc/meminfo
    AnonHugePages: 0 kB
    ShmemHugePages: 0 kB
    HugePages_Total: 1
    HugePages_Free: 1
    HugePages_Rsvd: 0
    HugePages_Surp: 0
    Hugepagesize: 2048 kB
    Hugetlb: 2048 kB

    Code:

    #include
    #include
    #define SIZE 2*1024*1024
    int main()
    {
    void *ptr;
    ptr = mmap(NULL, SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_HUGETLB | MAP_ANONYMOUS, -1, 0);
    madvise(ptr, SIZE, MADV_DONTDUMP);
    madvise(ptr, SIZE, MADV_DODUMP);
    }

    Compile and strace:

    mmap(NULL, 2097152, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_HUGETLB, -1, 0) = 0x7ff7c9200000
    madvise(0x7ff7c9200000, 2097152, MADV_DONTDUMP) = 0
    madvise(0x7ff7c9200000, 2097152, MADV_DODUMP) = -1 EINVAL (Invalid argument)

    hugetlbfs pages have VM_DONTEXPAND in the VmFlags driver pages based on
    author testing with analysis from Florian Weimer[1].

    The inclusion of VM_DONTEXPAND into the VM_SPECIAL defination was a
    consequence of the large useage of VM_DONTEXPAND in device drivers.

    A consequence of [2] is that VM_DONTEXPAND marked pages are unable to be
    marked DODUMP.

    A user could quite legitimately madvise(MADV_DONTDUMP) their hugetlbfs
    memory for a while and later request that madvise(MADV_DODUMP) on the same
    memory. We correct this omission by allowing madvice(MADV_DODUMP) on
    hugetlbfs pages.

    [1] https://stackoverflow.com/questions/52548260/madvisedodump-on-the-same-ptr-size-as-a-successful-madvisedontdump-fails-wit
    [2] commit 0103bd16fb90 ("mm: prepare VM_DONTDUMP for using in drivers")

    Link: http://lkml.kernel.org/r/20180930054629.29150-1-daniel@linux.ibm.com
    Link: https://lists.launchpad.net/maria-discuss/msg05245.html
    Fixes: 0103bd16fb90 ("mm: prepare VM_DONTDUMP for using in drivers")
    Reported-by: Kenneth Penza
    Signed-off-by: Daniel Black
    Reviewed-by: Mike Kravetz
    Cc: Konstantin Khlebnikov
    Signed-off-by: Andrew Morton
    Signed-off-by: Greg Kroah-Hartman
    Signed-off-by: Greg Kroah-Hartman

    Daniel Black
     

04 Oct, 2018

1 commit

  • commit e5d9998f3e09359b372a037a6ac55ba235d95d57 upstream.

    /*
    * cpu_partial determined the maximum number of objects
    * kept in the per cpu partial lists of a processor.
    */

    Can't be negative.

    Link: http://lkml.kernel.org/r/20180305200730.15812-15-adobriyan@gmail.com
    Signed-off-by: Alexey Dobriyan
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: zhong jiang
    Signed-off-by: Greg Kroah-Hartman

    Alexey Dobriyan
     

29 Sep, 2018

1 commit

  • commit b45d71fb89ab8adfe727b9d0ee188ed58582a647 upstream.

    Directories and inodes don't necessarily need to be in the same lockdep
    class. For ex, hugetlbfs splits them out too to prevent false positives
    in lockdep. Annotate correctly after new inode creation. If its a
    directory inode, it will be put into a different class.

    This should fix a lockdep splat reported by syzbot:

    > ======================================================
    > WARNING: possible circular locking dependency detected
    > 4.18.0-rc8-next-20180810+ #36 Not tainted
    > ------------------------------------------------------
    > syz-executor900/4483 is trying to acquire lock:
    > 00000000d2bfc8fe (&sb->s_type->i_mutex_key#9){++++}, at: inode_lock
    > include/linux/fs.h:765 [inline]
    > 00000000d2bfc8fe (&sb->s_type->i_mutex_key#9){++++}, at:
    > shmem_fallocate+0x18b/0x12e0 mm/shmem.c:2602
    >
    > but task is already holding lock:
    > 0000000025208078 (ashmem_mutex){+.+.}, at: ashmem_shrink_scan+0xb4/0x630
    > drivers/staging/android/ashmem.c:448
    >
    > which lock already depends on the new lock.
    >
    > -> #2 (ashmem_mutex){+.+.}:
    > __mutex_lock_common kernel/locking/mutex.c:925 [inline]
    > __mutex_lock+0x171/0x1700 kernel/locking/mutex.c:1073
    > mutex_lock_nested+0x16/0x20 kernel/locking/mutex.c:1088
    > ashmem_mmap+0x55/0x520 drivers/staging/android/ashmem.c:361
    > call_mmap include/linux/fs.h:1844 [inline]
    > mmap_region+0xf27/0x1c50 mm/mmap.c:1762
    > do_mmap+0xa10/0x1220 mm/mmap.c:1535
    > do_mmap_pgoff include/linux/mm.h:2298 [inline]
    > vm_mmap_pgoff+0x213/0x2c0 mm/util.c:357
    > ksys_mmap_pgoff+0x4da/0x660 mm/mmap.c:1585
    > __do_sys_mmap arch/x86/kernel/sys_x86_64.c:100 [inline]
    > __se_sys_mmap arch/x86/kernel/sys_x86_64.c:91 [inline]
    > __x64_sys_mmap+0xe9/0x1b0 arch/x86/kernel/sys_x86_64.c:91
    > do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
    > entry_SYSCALL_64_after_hwframe+0x49/0xbe
    >
    > -> #1 (&mm->mmap_sem){++++}:
    > __might_fault+0x155/0x1e0 mm/memory.c:4568
    > _copy_to_user+0x30/0x110 lib/usercopy.c:25
    > copy_to_user include/linux/uaccess.h:155 [inline]
    > filldir+0x1ea/0x3a0 fs/readdir.c:196
    > dir_emit_dot include/linux/fs.h:3464 [inline]
    > dir_emit_dots include/linux/fs.h:3475 [inline]
    > dcache_readdir+0x13a/0x620 fs/libfs.c:193
    > iterate_dir+0x48b/0x5d0 fs/readdir.c:51
    > __do_sys_getdents fs/readdir.c:231 [inline]
    > __se_sys_getdents fs/readdir.c:212 [inline]
    > __x64_sys_getdents+0x29f/0x510 fs/readdir.c:212
    > do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
    > entry_SYSCALL_64_after_hwframe+0x49/0xbe
    >
    > -> #0 (&sb->s_type->i_mutex_key#9){++++}:
    > lock_acquire+0x1e4/0x540 kernel/locking/lockdep.c:3924
    > down_write+0x8f/0x130 kernel/locking/rwsem.c:70
    > inode_lock include/linux/fs.h:765 [inline]
    > shmem_fallocate+0x18b/0x12e0 mm/shmem.c:2602
    > ashmem_shrink_scan+0x236/0x630 drivers/staging/android/ashmem.c:455
    > ashmem_ioctl+0x3ae/0x13a0 drivers/staging/android/ashmem.c:797
    > vfs_ioctl fs/ioctl.c:46 [inline]
    > file_ioctl fs/ioctl.c:501 [inline]
    > do_vfs_ioctl+0x1de/0x1720 fs/ioctl.c:685
    > ksys_ioctl+0xa9/0xd0 fs/ioctl.c:702
    > __do_sys_ioctl fs/ioctl.c:709 [inline]
    > __se_sys_ioctl fs/ioctl.c:707 [inline]
    > __x64_sys_ioctl+0x73/0xb0 fs/ioctl.c:707
    > do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
    > entry_SYSCALL_64_after_hwframe+0x49/0xbe
    >
    > other info that might help us debug this:
    >
    > Chain exists of:
    > &sb->s_type->i_mutex_key#9 --> &mm->mmap_sem --> ashmem_mutex
    >
    > Possible unsafe locking scenario:
    >
    > CPU0 CPU1
    > ---- ----
    > lock(ashmem_mutex);
    > lock(&mm->mmap_sem);
    > lock(ashmem_mutex);
    > lock(&sb->s_type->i_mutex_key#9);
    >
    > *** DEADLOCK ***
    >
    > 1 lock held by syz-executor900/4483:
    > #0: 0000000025208078 (ashmem_mutex){+.+.}, at:
    > ashmem_shrink_scan+0xb4/0x630 drivers/staging/android/ashmem.c:448

    Link: http://lkml.kernel.org/r/20180821231835.166639-1-joel@joelfernandes.org
    Signed-off-by: Joel Fernandes (Google)
    Reported-by: syzbot
    Reviewed-by: NeilBrown
    Suggested-by: NeilBrown
    Cc: Matthew Wilcox
    Cc: Peter Zijlstra
    Cc: Hugh Dickins
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Greg Kroah-Hartman

    Joel Fernandes (Google)
     

20 Sep, 2018

1 commit

  • commit 7a9cdebdcc17e426fb5287e4a82db1dfe86339b2 upstream.

    Jann Horn points out that the vmacache_flush_all() function is not only
    potentially expensive, it's buggy too. It also happens to be entirely
    unnecessary, because the sequence number overflow case can be avoided by
    simply making the sequence number be 64-bit. That doesn't even grow the
    data structures in question, because the other adjacent fields are
    already 64-bit.

    So simplify the whole thing by just making the sequence number overflow
    case go away entirely, which gets rid of all the complications and makes
    the code faster too. Win-win.

    [ Oleg Nesterov points out that the VMACACHE_FULL_FLUSHES statistics
    also just goes away entirely with this ]

    Reported-by: Jann Horn
    Suggested-by: Will Deacon
    Acked-by: Davidlohr Bueso
    Cc: Oleg Nesterov
    Cc: stable@kernel.org
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Linus Torvalds
     

15 Sep, 2018

1 commit

  • [ Upstream commit a718e28f538441a3b6612da9ff226973376cdf0f ]

    Signed integer overflow is undefined according to the C standard. The
    overflow in ksys_fadvise64_64() is deliberate, but since it is signed
    overflow, UBSAN complains:

    UBSAN: Undefined behaviour in mm/fadvise.c:76:10
    signed integer overflow:
    4 + 9223372036854775805 cannot be represented in type 'long long int'

    Use unsigned types to do math. Unsigned overflow is defined so UBSAN
    will not complain about it. This patch doesn't change generated code.

    [akpm@linux-foundation.org: add comment explaining the casts]
    Link: http://lkml.kernel.org/r/20180629184453.7614-1-aryabinin@virtuozzo.com
    Signed-off-by: Andrey Ryabinin
    Reported-by:
    Reviewed-by: Andrew Morton
    Cc: Alexander Potapenko
    Cc: Dmitry Vyukov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Andrey Ryabinin
     

10 Sep, 2018

2 commits

  • commit a6f572084fbee8b30f91465f4a085d7a90901c57 upstream.

    Will noted that only checking mm_users is incorrect; we should also
    check mm_count in order to cover CPUs that have a lazy reference to
    this mm (and could do speculative TLB operations).

    If removing this turns out to be a performance issue, we can
    re-instate a more complete check, but in tlb_table_flush() eliding the
    call_rcu_sched().

    Fixes: 267239116987 ("mm, powerpc: move the RCU page-table freeing into generic code")
    Reported-by: Will Deacon
    Signed-off-by: Peter Zijlstra (Intel)
    Acked-by: Rik van Riel
    Acked-by: Will Deacon
    Cc: Nicholas Piggin
    Cc: David Miller
    Cc: Martin Schwidefsky
    Cc: Michael Ellerman
    Cc: stable@kernel.org
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Peter Zijlstra
     
  • commit dc30b96ab6d569060741572cf30517d3179429a8 upstream.

    ondemand_readahead() checks bdi->io_pages to cap the maximum pages
    that need to be processed. This works until the readit section. If
    we would do an async only readahead (async size = sync size) and
    target is at beginning of window we expand the pages by another
    get_next_ra_size() pages. Btrace for large reads shows that kernel
    always issues a doubled size read at the beginning of processing.
    Add an additional check for io_pages in the lower part of the func.
    The fix helps devices that hard limit bio pages and rely on proper
    handling of max_hw_read_sectors (e.g. older FusionIO cards). For
    that reason it could qualify for stable.

    Fixes: 9491ae4a ("mm: don't cap request size based on read-ahead setting")
    Cc: stable@vger.kernel.org
    Signed-off-by: Markus Stockhausen stockhausen@collogia.de
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Markus Stockhausen
     

05 Sep, 2018

6 commits

  • commit d86564a2f085b79ec046a5cba90188e612352806 upstream.

    Jann reported that x86 was missing required TLB invalidates when he
    hit the !*batch slow path in tlb_remove_table().

    This is indeed the case; RCU_TABLE_FREE does not provide TLB (cache)
    invalidates, the PowerPC-hash where this code originated and the
    Sparc-hash where this was subsequently used did not need that. ARM
    which later used this put an explicit TLB invalidate in their
    __p*_free_tlb() functions, and PowerPC-radix followed that example.

    But when we hooked up x86 we failed to consider this. Fix this by
    (optionally) hooking tlb_remove_table() into the TLB invalidate code.

    NOTE: s390 was also needing something like this and might now
    be able to use the generic code again.

    [ Modified to be on top of Nick's cleanups, which simplified this patch
    now that tlb_flush_mmu_tlbonly() really only flushes the TLB - Linus ]

    Fixes: 9e52fc2b50de ("x86/mm: Enable RCU based page table freeing (CONFIG_HAVE_RCU_TABLE_FREE=y)")
    Reported-by: Jann Horn
    Signed-off-by: Peter Zijlstra (Intel)
    Acked-by: Rik van Riel
    Cc: Nicholas Piggin
    Cc: David Miller
    Cc: Will Deacon
    Cc: Martin Schwidefsky
    Cc: Michael Ellerman
    Cc: stable@kernel.org
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Peter Zijlstra
     
  • commit db7ddef301128dad394f1c0f77027f86ee9a4edb upstream.

    There is no need to call this from tlb_flush_mmu_tlbonly, it logically
    belongs with tlb_flush_mmu_free. This makes future fixes simpler.

    [ This was originally done to allow code consolidation for the
    mmu_notifier fix, but it also ends up helping simplify the
    HAVE_RCU_TABLE_INVALIDATE fix. - Linus ]

    Signed-off-by: Nicholas Piggin
    Acked-by: Will Deacon
    Cc: Peter Zijlstra
    Cc: stable@kernel.org
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Nicholas Piggin
     
  • [ Upstream commit 24eee1e4c47977bdfb71d6f15f6011e7b6188d04 ]

    ioremap_prot() can return NULL which could lead to an oops.

    Link: http://lkml.kernel.org/r/1533195441-58594-1-git-send-email-chenjie6@huawei.com
    Signed-off-by: chen jie
    Reviewed-by: Andrew Morton
    Cc: Li Zefan
    Cc: chenjie
    Cc: Yang Shi
    Cc: Alexey Dobriyan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    jie@chenjie6@huwei.com
     
  • [ Upstream commit 7e97de0b033bcac4fa9a35cef72e0c06e6a22c67 ]

    In case of memcg_online_kmem() failure, memcg_cgroup::id remains hashed
    in mem_cgroup_idr even after memcg memory is freed. This leads to leak
    of ID in mem_cgroup_idr.

    This patch adds removal into mem_cgroup_css_alloc(), which fixes the
    problem. For better readability, it adds a generic helper which is used
    in mem_cgroup_alloc() and mem_cgroup_id_put_many() as well.

    Link: http://lkml.kernel.org/r/152354470916.22460.14397070748001974638.stgit@localhost.localdomain
    Fixes 73f576c04b94 ("mm: memcontrol: fix cgroup creation failure after many small jobs")
    Signed-off-by: Kirill Tkhai
    Acked-by: Johannes Weiner
    Acked-by: Vladimir Davydov
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Kirill Tkhai
     
  • [ Upstream commit 53406ed1bcfdabe4b5bc35e6d17946c6f9f563e2 ]

    Delete the old VM_BUG_ON_VMA() from zap_pmd_range(), which asserted
    that mmap_sem must be held when splitting an "anonymous" vma there.
    Whether that's still strictly true nowadays is not entirely clear,
    but the danger of sometimes crashing on the BUG is now fairly clear.

    Even with the new stricter rules for anonymous vma marking, the
    condition it checks for can possible trigger. Commit 44960f2a7b63
    ("staging: ashmem: Fix SIGBUS crash when traversing mmaped ashmem
    pages") is good, and originally I thought it was safe from that
    VM_BUG_ON_VMA(), because the /dev/ashmem fd exposed to the user is
    disconnected from the vm_file in the vma, and madvise(,,MADV_REMOVE)
    insists on VM_SHARED.

    But after I read John's earlier mail, drawing attention to the
    vfs_fallocate() in there: I may be wrong, and I don't know if Android
    has THP in the config anyway, but it looks to me like an
    unmap_mapping_range() from ashmem's vfs_fallocate() could hit precisely
    the VM_BUG_ON_VMA(), once it's vma_is_anonymous().

    Signed-off-by: Hugh Dickins
    Cc: John Stultz
    Cc: Kirill Shutemov
    Cc: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Hugh Dickins
     
  • [ Upstream commit 16e536ef47f567289a5699abee9ff7bb304bc12d ]

    /sys/../zswap/stored_pages keeps rising in a zswap test with
    "zswap.max_pool_percent=0" parameter. But it should not compress or
    store pages any more since there is no space in the compressed pool.

    Reproduce steps:
    1. Boot kernel with "zswap.enabled=1"
    2. Set the max_pool_percent to 0
    # echo 0 > /sys/module/zswap/parameters/max_pool_percent
    3. Do memory stress test to see if some pages have been compressed
    # stress --vm 1 --vm-bytes $mem_available"M" --timeout 60s
    4. Watching the 'stored_pages' number increasing or not

    The root cause is:

    When zswap_max_pool_percent is set to 0 via kernel parameter,
    zswap_is_full() will always return true due to zswap_shrink(). But if
    the shinking is able to reclain a page successfully the code then
    proceeds to compressing/storing another page, so the value of
    stored_pages will keep changing.

    To solve the issue, this patch adds a zswap_is_full() check again after
    zswap_shrink() to make sure it's now under the max_pool_percent, and to
    not compress/store if we reached the limit.

    Link: http://lkml.kernel.org/r/20180530103936.17812-1-liwang@redhat.com
    Signed-off-by: Li Wang
    Acked-by: Dan Streetman
    Cc: Seth Jennings
    Cc: Huang Ying
    Cc: Yu Zhao
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Li Wang
     

24 Aug, 2018

1 commit

  • [ Upstream commit 1e8e18f694a52d703665012ca486826f64bac29d ]

    There is a special case that the size is "(N << KASAN_SHADOW_SCALE_SHIFT)
    Pages plus X", the value of X is [1, KASAN_SHADOW_SCALE_SIZE-1]. The
    operation "size >> KASAN_SHADOW_SCALE_SHIFT" will drop X, and the
    roundup operation can not retrieve the missed one page. For example:
    size=0x28006, PAGE_SIZE=0x1000, KASAN_SHADOW_SCALE_SHIFT=3, we will get
    shadow_size=0x5000, but actually we need 6 pages.

    shadow_size = round_up(size >> KASAN_SHADOW_SCALE_SHIFT, PAGE_SIZE);

    This can lead to a kernel crash when kasan is enabled and the value of
    mod->core_layout.size or mod->init_layout.size is like above. Because
    the shadow memory of X has not been allocated and mapped.

    move_module:
    ptr = module_alloc(mod->core_layout.size);
    ...
    memset(ptr, 0, mod->core_layout.size); //crashed

    Unable to handle kernel paging request at virtual address ffff0fffff97b000
    ......
    Call trace:
    __asan_storeN+0x174/0x1a8
    memset+0x24/0x48
    layout_and_allocate+0xcd8/0x1800
    load_module+0x190/0x23e8
    SyS_finit_module+0x148/0x180

    Link: http://lkml.kernel.org/r/1529659626-12660-1-git-send-email-thunder.leizhen@huawei.com
    Signed-off-by: Zhen Lei
    Reviewed-by: Dmitriy Vyukov
    Acked-by: Andrey Ryabinin
    Cc: Alexander Potapenko
    Cc: Hanjun Guo
    Cc: Libin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Zhen Lei
     

16 Aug, 2018

2 commits

  • commit 377eeaa8e11fe815b1d07c81c4a0e2843a8c15eb upstream

    For the L1TF workaround its necessary to limit the swap file size to below
    MAX_PA/2, so that the higher bits of the swap offset inverted never point
    to valid memory.

    Add a mechanism for the architecture to override the swap file size check
    in swapfile.c and add a x86 specific max swapfile check function that
    enforces that limit.

    The check is only enabled if the CPU is vulnerable to L1TF.

    In VMs with 42bit MAX_PA the typical limit is 2TB now, on a native system
    with 46bit PA it is 32TB. The limit is only per individual swap file, so
    it's always possible to exceed these limits with multiple swap files or
    partitions.

    Signed-off-by: Andi Kleen
    Signed-off-by: Thomas Gleixner
    Reviewed-by: Josh Poimboeuf
    Acked-by: Michal Hocko
    Acked-by: Dave Hansen
    Signed-off-by: Greg Kroah-Hartman

    Andi Kleen
     
  • commit 42e4089c7890725fcd329999252dc489b72f2921 upstream

    For L1TF PROT_NONE mappings are protected by inverting the PFN in the page
    table entry. This sets the high bits in the CPU's address space, thus
    making sure to point to not point an unmapped entry to valid cached memory.

    Some server system BIOSes put the MMIO mappings high up in the physical
    address space. If such an high mapping was mapped to unprivileged users
    they could attack low memory by setting such a mapping to PROT_NONE. This
    could happen through a special device driver which is not access
    protected. Normal /dev/mem is of course access protected.

    To avoid this forbid PROT_NONE mappings or mprotect for high MMIO mappings.

    Valid page mappings are allowed because the system is then unsafe anyways.

    It's not expected that users commonly use PROT_NONE on MMIO. But to
    minimize any impact this is only enforced if the mapping actually refers to
    a high MMIO address (defined as the MAX_PA-1 bit being set), and also skip
    the check for root.

    For mmaps this is straight forward and can be handled in vm_insert_pfn and
    in remap_pfn_range().

    For mprotect it's a bit trickier. At the point where the actual PTEs are
    accessed a lot of state has been changed and it would be difficult to undo
    on an error. Since this is a uncommon case use a separate early page talk
    walk pass for MMIO PROT_NONE mappings that checks for this condition
    early. For non MMIO and non PROT_NONE there are no changes.

    Signed-off-by: Andi Kleen
    Signed-off-by: Thomas Gleixner
    Reviewed-by: Josh Poimboeuf
    Acked-by: Dave Hansen
    Signed-off-by: Greg Kroah-Hartman

    Andi Kleen
     

03 Aug, 2018

2 commits

  • [ Upstream commit a38965bf941b7c2af50de09c96bc5f03e136caef ]

    __printf is useful to verify format and arguments. Remove the following
    warning (with W=1):

    mm/slub.c:721:2: warning: function might be possible candidate for `gnu_printf' format attribute [-Wsuggest-attribute=format]

    Link: http://lkml.kernel.org/r/20180505200706.19986-1-malat@debian.org
    Signed-off-by: Mathieu Malaterre
    Reviewed-by: Andrew Morton
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Mathieu Malaterre
     
  • [ Upstream commit f3c01d2f3ade6790db67f80fef60df84424f8964 ]

    Currently, __vunmap flow is,
    1) Release the VM area
    2) Free the debug objects corresponding to that vm area.

    This leave some race window open.
    1) Release the VM area
    1.5) Some other client gets the same vm area
    1.6) This client allocates new debug objects on the same
    vm area
    2) Free the debug objects corresponding to this vm area.

    Here, we actually free 'other' client's debug objects.

    Fix this by freeing the debug objects first and then releasing the VM
    area.

    Link: http://lkml.kernel.org/r/1523961828-9485-2-git-send-email-cpandya@codeaurora.org
    Signed-off-by: Chintan Pandya
    Reviewed-by: Andrew Morton
    Cc: Ard Biesheuvel
    Cc: Byungchul Park
    Cc: Catalin Marinas
    Cc: Florian Fainelli
    Cc: Johannes Weiner
    Cc: Laura Abbott
    Cc: Vlastimil Babka
    Cc: Wei Yang
    Cc: Yisheng Xie
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Chintan Pandya
     

25 Jul, 2018

2 commits

  • commit e1f1b1572e8db87a56609fd05bef76f98f0e456a upstream.

    __split_huge_pmd_locked() must check if the cleared huge pmd was dirty,
    and propagate that to PageDirty: otherwise, data may be lost when a huge
    tmpfs page is modified then split then reclaimed.

    How has this taken so long to be noticed? Because there was no problem
    when the huge page is written by a write system call (shmem_write_end()
    calls set_page_dirty()), nor when the page is allocated for a write fault
    (fault_dirty_shared_page() calls set_page_dirty()); but when allocated for
    a read fault (which MAP_POPULATE simulates), no set_page_dirty().

    Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1807111741430.1106@eggly.anvils
    Fixes: d21b9e57c74c ("thp: handle file pages in split_huge_pmd()")
    Signed-off-by: Hugh Dickins
    Reported-by: Ashwin Chaugule
    Reviewed-by: Yang Shi
    Reviewed-by: Kirill A. Shutemov
    Cc: "Huang, Ying"
    Cc: [4.8+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Hugh Dickins
     
  • commit 9f15bde671355c351cf20d9f879004b234353100 upstream.

    It was reported that a kernel crash happened in mem_cgroup_iter(), which
    can be triggered if the legacy cgroup-v1 non-hierarchical mode is used.

    Unable to handle kernel paging request at virtual address 6b6b6b6b6b6b8f
    ......
    Call trace:
    mem_cgroup_iter+0x2e0/0x6d4
    shrink_zone+0x8c/0x324
    balance_pgdat+0x450/0x640
    kswapd+0x130/0x4b8
    kthread+0xe8/0xfc
    ret_from_fork+0x10/0x20

    mem_cgroup_iter():
    ......
    if (css_tryget(css)) position, which has been freed before and
    filled with POISON_FREE(0x6b).

    And the root cause of the use-after-free issue is that
    invalidate_reclaim_iterators() fails to reset the value of iter->position
    to NULL when the css of the memcg is released in non- hierarchical mode.

    Link: http://lkml.kernel.org/r/1531994807-25639-1-git-send-email-jing.xia@unisoc.com
    Fixes: 6df38689e0e9 ("mm: memcontrol: fix possible memcg leak due to interrupted reclaim")
    Signed-off-by: Jing Xia
    Acked-by: Michal Hocko
    Cc: Johannes Weiner
    Cc: Vladimir Davydov
    Cc:
    Cc: Shakeel Butt
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Jing Xia
     

22 Jul, 2018

1 commit

  • commit 3ee7e8697d5860b173132606d80a9cd35e7113ee upstream.

    syzbot is reporting NULL pointer dereference at wb_workfn() [1] due to
    wb->bdi->dev being NULL. And Dmitry confirmed that wb->state was
    WB_shutting_down after wb->bdi->dev became NULL. This indicates that
    unregister_bdi() failed to call wb_shutdown() on one of wb objects.

    The problem is in cgwb_bdi_unregister() which does cgwb_kill() and thus
    drops bdi's reference to wb structures before going through the list of
    wbs again and calling wb_shutdown() on each of them. This way the loop
    iterating through all wbs can easily miss a wb if that wb has already
    passed through cgwb_remove_from_bdi_list() called from wb_shutdown()
    from cgwb_release_workfn() and as a result fully shutdown bdi although
    wb_workfn() for this wb structure is still running. In fact there are
    also other ways cgwb_bdi_unregister() can race with
    cgwb_release_workfn() leading e.g. to use-after-free issues:

    CPU1 CPU2
    cgwb_bdi_unregister()
    cgwb_kill(*slot);

    cgwb_release()
    queue_work(cgwb_release_wq, &wb->release_work);
    cgwb_release_workfn()
    wb = list_first_entry(&bdi->wb_list, ...)
    spin_unlock_irq(&cgwb_lock);
    wb_shutdown(wb);
    ...
    kfree_rcu(wb, rcu);
    wb_shutdown(wb); -> oops use-after-free

    We solve these issues by synchronizing writeback structure shutdown from
    cgwb_bdi_unregister() with cgwb_release_workfn() using a new mutex. That
    way we also no longer need synchronization using WB_shutting_down as the
    mutex provides it for CONFIG_CGROUP_WRITEBACK case and without
    CONFIG_CGROUP_WRITEBACK wb_shutdown() can be called only once from
    bdi_unregister().

    Reported-by: syzbot
    Acked-by: Tejun Heo
    Signed-off-by: Jan Kara
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Jan Kara
     

17 Jul, 2018

2 commits

  • commit bb177a732c4369bb58a1fe1df8f552b6f0f7db5f upstream.

    syzbot has noticed that a specially crafted library can easily hit
    VM_BUG_ON in __mm_populate

    kernel BUG at mm/gup.c:1242!
    invalid opcode: 0000 [#1] SMP
    CPU: 2 PID: 9667 Comm: a.out Not tainted 4.18.0-rc3 #644
    Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 05/19/2017
    RIP: 0010:__mm_populate+0x1e2/0x1f0
    Code: 55 d0 65 48 33 14 25 28 00 00 00 89 d8 75 21 48 83 c4 20 5b 41 5c 41 5d 41 5e 41 5f 5d c3 e8 75 18 f1 ff 0f 0b e8 6e 18 f1 ff 0b 31 db eb c9 e8 93 06 e0 ff 0f 1f 00 55 48 89 e5 53 48 89 fb
    Call Trace:
    vm_brk_flags+0xc3/0x100
    vm_brk+0x1f/0x30
    load_elf_library+0x281/0x2e0
    __ia32_sys_uselib+0x170/0x1e0
    do_fast_syscall_32+0xca/0x420
    entry_SYSENTER_compat+0x70/0x7f

    The reason is that the length of the new brk is not page aligned when we
    try to populate the it. There is no reason to bug on that though.
    do_brk_flags already aligns the length properly so the mapping is
    expanded as it should. All we need is to tell mm_populate about it.
    Besides that there is absolutely no reason to to bug_on in the first
    place. The worst thing that could happen is that the last page wouldn't
    get populated and that is far from putting system into an inconsistent
    state.

    Fix the issue by moving the length sanitization code from do_brk_flags
    up to vm_brk_flags. The only other caller of do_brk_flags is brk
    syscall entry and it makes sure to provide the proper length so t here
    is no need for sanitation and so we can use do_brk_flags without it.

    Also remove the bogus BUG_ONs.

    [osalvador@techadventures.net: fix up vm_brk_flags s@request@len@]
    Link: http://lkml.kernel.org/r/20180706090217.GI32658@dhcp22.suse.cz
    Signed-off-by: Michal Hocko
    Reported-by: syzbot
    Tested-by: Tetsuo Handa
    Reviewed-by: Oscar Salvador
    Cc: Zi Yan
    Cc: "Aneesh Kumar K.V"
    Cc: Dan Williams
    Cc: "Kirill A. Shutemov"
    Cc: Michael S. Tsirkin
    Cc: Al Viro
    Cc: "Huang, Ying"
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Michal Hocko
     
  • commit bce73e4842390f7b7309c8e253e139db71288ac3 upstream.

    KVM guests on s390 can notify the host of unused pages. This can result
    in pte_unused callbacks to be true for KVM guest memory.

    If a page is unused (checked with pte_unused) we might drop this page
    instead of paging it. This can have side-effects on userfaultd, when
    the page in question was already migrated:

    The next access of that page will trigger a fault and a user fault
    instead of faulting in a new and empty zero page. As QEMU does not
    expect a userfault on an already migrated page this migration will fail.

    The most straightforward solution is to ignore the pte_unused hint if a
    userfault context is active for this VMA.

    Link: http://lkml.kernel.org/r/20180703171854.63981-1-borntraeger@de.ibm.com
    Signed-off-by: Christian Borntraeger
    Cc: Martin Schwidefsky
    Cc: Andrea Arcangeli
    Cc: Mike Rapoport
    Cc: Janosch Frank
    Cc: David Hildenbrand
    Cc: Cornelia Huck
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Christian Borntraeger
     

11 Jul, 2018

3 commits

  • commit 28557cc106e6d2aa8b8c5c7687ea9f8055ff3911 upstream.

    Revert commit c7f26ccfb2c3 ("mm/vmstat.c: fix vmstat_update() preemption
    BUG"). Steven saw a "using smp_processor_id() in preemptible" message
    and added a preempt_disable() section around it to keep it quiet. This
    is not the right thing to do it does not fix the real problem.

    vmstat_update() is invoked by a kworker on a specific CPU. This worker
    it bound to this CPU. The name of the worker was "kworker/1:1" so it
    should have been a worker which was bound to CPU1. A worker which can
    run on any CPU would have a `u' before the first digit.

    smp_processor_id() can be used in a preempt-enabled region as long as
    the task is bound to a single CPU which is the case here. If it could
    run on an arbitrary CPU then this is the problem we have an should seek
    to resolve.

    Not only this smp_processor_id() must not be migrated to another CPU but
    also refresh_cpu_vm_stats() which might access wrong per-CPU variables.
    Not to mention that other code relies on the fact that such a worker
    runs on one specific CPU only.

    Therefore revert that commit and we should look instead what broke the
    affinity mask of the kworker.

    Link: http://lkml.kernel.org/r/20180504104451.20278-1-bigeasy@linutronix.de
    Signed-off-by: Sebastian Andrzej Siewior
    Cc: Steven J. Hill
    Cc: Tejun Heo
    Cc: Vlastimil Babka
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Sebastian Andrzej Siewior
     
  • commit 31286a8484a85e8b4e91ddb0f5415aee8a416827 upstream.

    Recently the following BUG was reported:

    Injecting memory failure for pfn 0x3c0000 at process virtual address 0x7fe300000000
    Memory failure: 0x3c0000: recovery action for huge page: Recovered
    BUG: unable to handle kernel paging request at ffff8dfcc0003000
    IP: gup_pgd_range+0x1f0/0xc20
    PGD 17ae72067 P4D 17ae72067 PUD 0
    Oops: 0000 [#1] SMP PTI
    ...
    CPU: 3 PID: 5467 Comm: hugetlb_1gb Not tainted 4.15.0-rc8-mm1-abc+ #3
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.3-1.fc25 04/01/2014

    You can easily reproduce this by calling madvise(MADV_HWPOISON) twice on
    a 1GB hugepage. This happens because get_user_pages_fast() is not aware
    of a migration entry on pud that was created in the 1st madvise() event.

    I think that conversion to pud-aligned migration entry is working, but
    other MM code walking over page table isn't prepared for it. We need
    some time and effort to make all this work properly, so this patch
    avoids the reported bug by just disabling error handling for 1GB
    hugepage.

    [n-horiguchi@ah.jp.nec.com: v2]
    Link: http://lkml.kernel.org/r/1517284444-18149-1-git-send-email-n-horiguchi@ah.jp.nec.com
    Link: http://lkml.kernel.org/r/1517207283-15769-1-git-send-email-n-horiguchi@ah.jp.nec.com
    Signed-off-by: Naoya Horiguchi
    Acked-by: Michal Hocko
    Reviewed-by: Andrew Morton
    Reviewed-by: Mike Kravetz
    Acked-by: Punit Agrawal
    Tested-by: Michael Ellerman
    Cc: Anshuman Khandual
    Cc: "Aneesh Kumar K.V"
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sudip Mukherjee
    Signed-off-by: Greg Kroah-Hartman

    Naoya Horiguchi
     
  • commit 520495fe96d74e05db585fc748351e0504d8f40d upstream.

    When booting with very large numbers of gigantic (i.e. 1G) pages, the
    operations in the loop of gather_bootmem_prealloc, and specifically
    prep_compound_gigantic_page, takes a very long time, and can cause a
    softlockup if enough pages are requested at boot.

    For example booting with 3844 1G pages requires prepping
    (set_compound_head, init the count) over 1 billion 4K tail pages, which
    takes considerable time.

    Add a cond_resched() to the outer loop in gather_bootmem_prealloc() to
    prevent this lockup.

    Tested: Booted with softlockup_panic=1 hugepagesz=1G hugepages=3844 and
    no softlockup is reported, and the hugepages are reported as
    successfully setup.

    Link: http://lkml.kernel.org/r/20180627214447.260804-1-cannonmatthews@google.com
    Signed-off-by: Cannon Matthews
    Reviewed-by: Andrew Morton
    Reviewed-by: Mike Kravetz
    Acked-by: Michal Hocko
    Cc: Andres Lagar-Cavilla
    Cc: Peter Feiner
    Cc: Greg Thelen
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Cannon Matthews
     

03 Jul, 2018

1 commit

  • commit d50d82faa0c964e31f7a946ba8aba7c715ca7ab0 upstream.

    In kernel 4.17 I removed some code from dm-bufio that did slab cache
    merging (commit 21bb13276768: "dm bufio: remove code that merges slab
    caches") - both slab and slub support merging caches with identical
    attributes, so dm-bufio now just calls kmem_cache_create and relies on
    implicit merging.

    This uncovered a bug in the slub subsystem - if we delete a cache and
    immediatelly create another cache with the same attributes, it fails
    because of duplicate filename in /sys/kernel/slab/. The slub subsystem
    offloads freeing the cache to a workqueue - and if we create the new
    cache before the workqueue runs, it complains because of duplicate
    filename in sysfs.

    This patch fixes the bug by moving the call of kobject_del from
    sysfs_slab_remove_workfn to shutdown_cache. kobject_del must be called
    while we hold slab_mutex - so that the sysfs entry is deleted before a
    cache with the same attributes could be created.

    Running device-mapper-test-suite with:

    dmtest run --suite thin-provisioning -n /commit_failure_causes_fallback/

    triggered:

    Buffer I/O error on dev dm-0, logical block 1572848, async page read
    device-mapper: thin: 253:1: metadata operation 'dm_pool_alloc_data_block' failed: error = -5
    device-mapper: thin: 253:1: aborting current metadata transaction
    sysfs: cannot create duplicate filename '/kernel/slab/:a-0000144'
    CPU: 2 PID: 1037 Comm: kworker/u48:1 Not tainted 4.17.0.snitm+ #25
    Hardware name: Supermicro SYS-1029P-WTR/X11DDW-L, BIOS 2.0a 12/06/2017
    Workqueue: dm-thin do_worker [dm_thin_pool]
    Call Trace:
    dump_stack+0x5a/0x73
    sysfs_warn_dup+0x58/0x70
    sysfs_create_dir_ns+0x77/0x80
    kobject_add_internal+0xba/0x2e0
    kobject_init_and_add+0x70/0xb0
    sysfs_slab_add+0xb1/0x250
    __kmem_cache_create+0x116/0x150
    create_cache+0xd9/0x1f0
    kmem_cache_create_usercopy+0x1c1/0x250
    kmem_cache_create+0x18/0x20
    dm_bufio_client_create+0x1ae/0x410 [dm_bufio]
    dm_block_manager_create+0x5e/0x90 [dm_persistent_data]
    __create_persistent_data_objects+0x38/0x940 [dm_thin_pool]
    dm_pool_abort_metadata+0x64/0x90 [dm_thin_pool]
    metadata_operation_failed+0x59/0x100 [dm_thin_pool]
    alloc_data_block.isra.53+0x86/0x180 [dm_thin_pool]
    process_cell+0x2a3/0x550 [dm_thin_pool]
    do_worker+0x28d/0x8f0 [dm_thin_pool]
    process_one_work+0x171/0x370
    worker_thread+0x49/0x3f0
    kthread+0xf8/0x130
    ret_from_fork+0x35/0x40
    kobject_add_internal failed for :a-0000144 with -EEXIST, don't try to register things with the same name in the same directory.
    kmem_cache_create(dm_bufio_buffer-16) failed with error -17

    Link: http://lkml.kernel.org/r/alpine.LRH.2.02.1806151817130.6333@file01.intranet.prod.int.rdu2.redhat.com
    Signed-off-by: Mikulas Patocka
    Reported-by: Mike Snitzer
    Tested-by: Mike Snitzer
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Mikulas Patocka