20 Sep, 2018

1 commit

  • commit 7a9cdebdcc17e426fb5287e4a82db1dfe86339b2 upstream.

    Jann Horn points out that the vmacache_flush_all() function is not only
    potentially expensive, it's buggy too. It also happens to be entirely
    unnecessary, because the sequence number overflow case can be avoided by
    simply making the sequence number be 64-bit. That doesn't even grow the
    data structures in question, because the other adjacent fields are
    already 64-bit.

    So simplify the whole thing by just making the sequence number overflow
    case go away entirely, which gets rid of all the complications and makes
    the code faster too. Win-win.

    [ Oleg Nesterov points out that the VMACACHE_FULL_FLUSHES statistics
    also just goes away entirely with this ]

    Reported-by: Jann Horn
    Suggested-by: Will Deacon
    Acked-by: Davidlohr Bueso
    Cc: Oleg Nesterov
    Cc: stable@kernel.org
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Linus Torvalds
     

15 Sep, 2018

1 commit

  • [ Upstream commit a718e28f538441a3b6612da9ff226973376cdf0f ]

    Signed integer overflow is undefined according to the C standard. The
    overflow in ksys_fadvise64_64() is deliberate, but since it is signed
    overflow, UBSAN complains:

    UBSAN: Undefined behaviour in mm/fadvise.c:76:10
    signed integer overflow:
    4 + 9223372036854775805 cannot be represented in type 'long long int'

    Use unsigned types to do math. Unsigned overflow is defined so UBSAN
    will not complain about it. This patch doesn't change generated code.

    [akpm@linux-foundation.org: add comment explaining the casts]
    Link: http://lkml.kernel.org/r/20180629184453.7614-1-aryabinin@virtuozzo.com
    Signed-off-by: Andrey Ryabinin
    Reported-by:
    Reviewed-by: Andrew Morton
    Cc: Alexander Potapenko
    Cc: Dmitry Vyukov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Andrey Ryabinin
     

10 Sep, 2018

2 commits

  • commit a6f572084fbee8b30f91465f4a085d7a90901c57 upstream.

    Will noted that only checking mm_users is incorrect; we should also
    check mm_count in order to cover CPUs that have a lazy reference to
    this mm (and could do speculative TLB operations).

    If removing this turns out to be a performance issue, we can
    re-instate a more complete check, but in tlb_table_flush() eliding the
    call_rcu_sched().

    Fixes: 267239116987 ("mm, powerpc: move the RCU page-table freeing into generic code")
    Reported-by: Will Deacon
    Signed-off-by: Peter Zijlstra (Intel)
    Acked-by: Rik van Riel
    Acked-by: Will Deacon
    Cc: Nicholas Piggin
    Cc: David Miller
    Cc: Martin Schwidefsky
    Cc: Michael Ellerman
    Cc: stable@kernel.org
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Peter Zijlstra
     
  • commit dc30b96ab6d569060741572cf30517d3179429a8 upstream.

    ondemand_readahead() checks bdi->io_pages to cap the maximum pages
    that need to be processed. This works until the readit section. If
    we would do an async only readahead (async size = sync size) and
    target is at beginning of window we expand the pages by another
    get_next_ra_size() pages. Btrace for large reads shows that kernel
    always issues a doubled size read at the beginning of processing.
    Add an additional check for io_pages in the lower part of the func.
    The fix helps devices that hard limit bio pages and rely on proper
    handling of max_hw_read_sectors (e.g. older FusionIO cards). For
    that reason it could qualify for stable.

    Fixes: 9491ae4a ("mm: don't cap request size based on read-ahead setting")
    Cc: stable@vger.kernel.org
    Signed-off-by: Markus Stockhausen stockhausen@collogia.de
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Markus Stockhausen
     

05 Sep, 2018

6 commits

  • commit d86564a2f085b79ec046a5cba90188e612352806 upstream.

    Jann reported that x86 was missing required TLB invalidates when he
    hit the !*batch slow path in tlb_remove_table().

    This is indeed the case; RCU_TABLE_FREE does not provide TLB (cache)
    invalidates, the PowerPC-hash where this code originated and the
    Sparc-hash where this was subsequently used did not need that. ARM
    which later used this put an explicit TLB invalidate in their
    __p*_free_tlb() functions, and PowerPC-radix followed that example.

    But when we hooked up x86 we failed to consider this. Fix this by
    (optionally) hooking tlb_remove_table() into the TLB invalidate code.

    NOTE: s390 was also needing something like this and might now
    be able to use the generic code again.

    [ Modified to be on top of Nick's cleanups, which simplified this patch
    now that tlb_flush_mmu_tlbonly() really only flushes the TLB - Linus ]

    Fixes: 9e52fc2b50de ("x86/mm: Enable RCU based page table freeing (CONFIG_HAVE_RCU_TABLE_FREE=y)")
    Reported-by: Jann Horn
    Signed-off-by: Peter Zijlstra (Intel)
    Acked-by: Rik van Riel
    Cc: Nicholas Piggin
    Cc: David Miller
    Cc: Will Deacon
    Cc: Martin Schwidefsky
    Cc: Michael Ellerman
    Cc: stable@kernel.org
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Peter Zijlstra
     
  • commit db7ddef301128dad394f1c0f77027f86ee9a4edb upstream.

    There is no need to call this from tlb_flush_mmu_tlbonly, it logically
    belongs with tlb_flush_mmu_free. This makes future fixes simpler.

    [ This was originally done to allow code consolidation for the
    mmu_notifier fix, but it also ends up helping simplify the
    HAVE_RCU_TABLE_INVALIDATE fix. - Linus ]

    Signed-off-by: Nicholas Piggin
    Acked-by: Will Deacon
    Cc: Peter Zijlstra
    Cc: stable@kernel.org
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Nicholas Piggin
     
  • [ Upstream commit 24eee1e4c47977bdfb71d6f15f6011e7b6188d04 ]

    ioremap_prot() can return NULL which could lead to an oops.

    Link: http://lkml.kernel.org/r/1533195441-58594-1-git-send-email-chenjie6@huawei.com
    Signed-off-by: chen jie
    Reviewed-by: Andrew Morton
    Cc: Li Zefan
    Cc: chenjie
    Cc: Yang Shi
    Cc: Alexey Dobriyan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    jie@chenjie6@huwei.com
     
  • [ Upstream commit 7e97de0b033bcac4fa9a35cef72e0c06e6a22c67 ]

    In case of memcg_online_kmem() failure, memcg_cgroup::id remains hashed
    in mem_cgroup_idr even after memcg memory is freed. This leads to leak
    of ID in mem_cgroup_idr.

    This patch adds removal into mem_cgroup_css_alloc(), which fixes the
    problem. For better readability, it adds a generic helper which is used
    in mem_cgroup_alloc() and mem_cgroup_id_put_many() as well.

    Link: http://lkml.kernel.org/r/152354470916.22460.14397070748001974638.stgit@localhost.localdomain
    Fixes 73f576c04b94 ("mm: memcontrol: fix cgroup creation failure after many small jobs")
    Signed-off-by: Kirill Tkhai
    Acked-by: Johannes Weiner
    Acked-by: Vladimir Davydov
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Kirill Tkhai
     
  • [ Upstream commit 53406ed1bcfdabe4b5bc35e6d17946c6f9f563e2 ]

    Delete the old VM_BUG_ON_VMA() from zap_pmd_range(), which asserted
    that mmap_sem must be held when splitting an "anonymous" vma there.
    Whether that's still strictly true nowadays is not entirely clear,
    but the danger of sometimes crashing on the BUG is now fairly clear.

    Even with the new stricter rules for anonymous vma marking, the
    condition it checks for can possible trigger. Commit 44960f2a7b63
    ("staging: ashmem: Fix SIGBUS crash when traversing mmaped ashmem
    pages") is good, and originally I thought it was safe from that
    VM_BUG_ON_VMA(), because the /dev/ashmem fd exposed to the user is
    disconnected from the vm_file in the vma, and madvise(,,MADV_REMOVE)
    insists on VM_SHARED.

    But after I read John's earlier mail, drawing attention to the
    vfs_fallocate() in there: I may be wrong, and I don't know if Android
    has THP in the config anyway, but it looks to me like an
    unmap_mapping_range() from ashmem's vfs_fallocate() could hit precisely
    the VM_BUG_ON_VMA(), once it's vma_is_anonymous().

    Signed-off-by: Hugh Dickins
    Cc: John Stultz
    Cc: Kirill Shutemov
    Cc: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Hugh Dickins
     
  • [ Upstream commit 16e536ef47f567289a5699abee9ff7bb304bc12d ]

    /sys/../zswap/stored_pages keeps rising in a zswap test with
    "zswap.max_pool_percent=0" parameter. But it should not compress or
    store pages any more since there is no space in the compressed pool.

    Reproduce steps:
    1. Boot kernel with "zswap.enabled=1"
    2. Set the max_pool_percent to 0
    # echo 0 > /sys/module/zswap/parameters/max_pool_percent
    3. Do memory stress test to see if some pages have been compressed
    # stress --vm 1 --vm-bytes $mem_available"M" --timeout 60s
    4. Watching the 'stored_pages' number increasing or not

    The root cause is:

    When zswap_max_pool_percent is set to 0 via kernel parameter,
    zswap_is_full() will always return true due to zswap_shrink(). But if
    the shinking is able to reclain a page successfully the code then
    proceeds to compressing/storing another page, so the value of
    stored_pages will keep changing.

    To solve the issue, this patch adds a zswap_is_full() check again after
    zswap_shrink() to make sure it's now under the max_pool_percent, and to
    not compress/store if we reached the limit.

    Link: http://lkml.kernel.org/r/20180530103936.17812-1-liwang@redhat.com
    Signed-off-by: Li Wang
    Acked-by: Dan Streetman
    Cc: Seth Jennings
    Cc: Huang Ying
    Cc: Yu Zhao
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Li Wang
     

24 Aug, 2018

1 commit

  • [ Upstream commit 1e8e18f694a52d703665012ca486826f64bac29d ]

    There is a special case that the size is "(N << KASAN_SHADOW_SCALE_SHIFT)
    Pages plus X", the value of X is [1, KASAN_SHADOW_SCALE_SIZE-1]. The
    operation "size >> KASAN_SHADOW_SCALE_SHIFT" will drop X, and the
    roundup operation can not retrieve the missed one page. For example:
    size=0x28006, PAGE_SIZE=0x1000, KASAN_SHADOW_SCALE_SHIFT=3, we will get
    shadow_size=0x5000, but actually we need 6 pages.

    shadow_size = round_up(size >> KASAN_SHADOW_SCALE_SHIFT, PAGE_SIZE);

    This can lead to a kernel crash when kasan is enabled and the value of
    mod->core_layout.size or mod->init_layout.size is like above. Because
    the shadow memory of X has not been allocated and mapped.

    move_module:
    ptr = module_alloc(mod->core_layout.size);
    ...
    memset(ptr, 0, mod->core_layout.size); //crashed

    Unable to handle kernel paging request at virtual address ffff0fffff97b000
    ......
    Call trace:
    __asan_storeN+0x174/0x1a8
    memset+0x24/0x48
    layout_and_allocate+0xcd8/0x1800
    load_module+0x190/0x23e8
    SyS_finit_module+0x148/0x180

    Link: http://lkml.kernel.org/r/1529659626-12660-1-git-send-email-thunder.leizhen@huawei.com
    Signed-off-by: Zhen Lei
    Reviewed-by: Dmitriy Vyukov
    Acked-by: Andrey Ryabinin
    Cc: Alexander Potapenko
    Cc: Hanjun Guo
    Cc: Libin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Zhen Lei
     

16 Aug, 2018

2 commits

  • commit 377eeaa8e11fe815b1d07c81c4a0e2843a8c15eb upstream

    For the L1TF workaround its necessary to limit the swap file size to below
    MAX_PA/2, so that the higher bits of the swap offset inverted never point
    to valid memory.

    Add a mechanism for the architecture to override the swap file size check
    in swapfile.c and add a x86 specific max swapfile check function that
    enforces that limit.

    The check is only enabled if the CPU is vulnerable to L1TF.

    In VMs with 42bit MAX_PA the typical limit is 2TB now, on a native system
    with 46bit PA it is 32TB. The limit is only per individual swap file, so
    it's always possible to exceed these limits with multiple swap files or
    partitions.

    Signed-off-by: Andi Kleen
    Signed-off-by: Thomas Gleixner
    Reviewed-by: Josh Poimboeuf
    Acked-by: Michal Hocko
    Acked-by: Dave Hansen
    Signed-off-by: Greg Kroah-Hartman

    Andi Kleen
     
  • commit 42e4089c7890725fcd329999252dc489b72f2921 upstream

    For L1TF PROT_NONE mappings are protected by inverting the PFN in the page
    table entry. This sets the high bits in the CPU's address space, thus
    making sure to point to not point an unmapped entry to valid cached memory.

    Some server system BIOSes put the MMIO mappings high up in the physical
    address space. If such an high mapping was mapped to unprivileged users
    they could attack low memory by setting such a mapping to PROT_NONE. This
    could happen through a special device driver which is not access
    protected. Normal /dev/mem is of course access protected.

    To avoid this forbid PROT_NONE mappings or mprotect for high MMIO mappings.

    Valid page mappings are allowed because the system is then unsafe anyways.

    It's not expected that users commonly use PROT_NONE on MMIO. But to
    minimize any impact this is only enforced if the mapping actually refers to
    a high MMIO address (defined as the MAX_PA-1 bit being set), and also skip
    the check for root.

    For mmaps this is straight forward and can be handled in vm_insert_pfn and
    in remap_pfn_range().

    For mprotect it's a bit trickier. At the point where the actual PTEs are
    accessed a lot of state has been changed and it would be difficult to undo
    on an error. Since this is a uncommon case use a separate early page talk
    walk pass for MMIO PROT_NONE mappings that checks for this condition
    early. For non MMIO and non PROT_NONE there are no changes.

    Signed-off-by: Andi Kleen
    Signed-off-by: Thomas Gleixner
    Reviewed-by: Josh Poimboeuf
    Acked-by: Dave Hansen
    Signed-off-by: Greg Kroah-Hartman

    Andi Kleen
     

03 Aug, 2018

2 commits

  • [ Upstream commit a38965bf941b7c2af50de09c96bc5f03e136caef ]

    __printf is useful to verify format and arguments. Remove the following
    warning (with W=1):

    mm/slub.c:721:2: warning: function might be possible candidate for `gnu_printf' format attribute [-Wsuggest-attribute=format]

    Link: http://lkml.kernel.org/r/20180505200706.19986-1-malat@debian.org
    Signed-off-by: Mathieu Malaterre
    Reviewed-by: Andrew Morton
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Mathieu Malaterre
     
  • [ Upstream commit f3c01d2f3ade6790db67f80fef60df84424f8964 ]

    Currently, __vunmap flow is,
    1) Release the VM area
    2) Free the debug objects corresponding to that vm area.

    This leave some race window open.
    1) Release the VM area
    1.5) Some other client gets the same vm area
    1.6) This client allocates new debug objects on the same
    vm area
    2) Free the debug objects corresponding to this vm area.

    Here, we actually free 'other' client's debug objects.

    Fix this by freeing the debug objects first and then releasing the VM
    area.

    Link: http://lkml.kernel.org/r/1523961828-9485-2-git-send-email-cpandya@codeaurora.org
    Signed-off-by: Chintan Pandya
    Reviewed-by: Andrew Morton
    Cc: Ard Biesheuvel
    Cc: Byungchul Park
    Cc: Catalin Marinas
    Cc: Florian Fainelli
    Cc: Johannes Weiner
    Cc: Laura Abbott
    Cc: Vlastimil Babka
    Cc: Wei Yang
    Cc: Yisheng Xie
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Chintan Pandya
     

25 Jul, 2018

2 commits

  • commit e1f1b1572e8db87a56609fd05bef76f98f0e456a upstream.

    __split_huge_pmd_locked() must check if the cleared huge pmd was dirty,
    and propagate that to PageDirty: otherwise, data may be lost when a huge
    tmpfs page is modified then split then reclaimed.

    How has this taken so long to be noticed? Because there was no problem
    when the huge page is written by a write system call (shmem_write_end()
    calls set_page_dirty()), nor when the page is allocated for a write fault
    (fault_dirty_shared_page() calls set_page_dirty()); but when allocated for
    a read fault (which MAP_POPULATE simulates), no set_page_dirty().

    Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1807111741430.1106@eggly.anvils
    Fixes: d21b9e57c74c ("thp: handle file pages in split_huge_pmd()")
    Signed-off-by: Hugh Dickins
    Reported-by: Ashwin Chaugule
    Reviewed-by: Yang Shi
    Reviewed-by: Kirill A. Shutemov
    Cc: "Huang, Ying"
    Cc: [4.8+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Hugh Dickins
     
  • commit 9f15bde671355c351cf20d9f879004b234353100 upstream.

    It was reported that a kernel crash happened in mem_cgroup_iter(), which
    can be triggered if the legacy cgroup-v1 non-hierarchical mode is used.

    Unable to handle kernel paging request at virtual address 6b6b6b6b6b6b8f
    ......
    Call trace:
    mem_cgroup_iter+0x2e0/0x6d4
    shrink_zone+0x8c/0x324
    balance_pgdat+0x450/0x640
    kswapd+0x130/0x4b8
    kthread+0xe8/0xfc
    ret_from_fork+0x10/0x20

    mem_cgroup_iter():
    ......
    if (css_tryget(css)) position, which has been freed before and
    filled with POISON_FREE(0x6b).

    And the root cause of the use-after-free issue is that
    invalidate_reclaim_iterators() fails to reset the value of iter->position
    to NULL when the css of the memcg is released in non- hierarchical mode.

    Link: http://lkml.kernel.org/r/1531994807-25639-1-git-send-email-jing.xia@unisoc.com
    Fixes: 6df38689e0e9 ("mm: memcontrol: fix possible memcg leak due to interrupted reclaim")
    Signed-off-by: Jing Xia
    Acked-by: Michal Hocko
    Cc: Johannes Weiner
    Cc: Vladimir Davydov
    Cc:
    Cc: Shakeel Butt
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Jing Xia
     

22 Jul, 2018

1 commit

  • commit 3ee7e8697d5860b173132606d80a9cd35e7113ee upstream.

    syzbot is reporting NULL pointer dereference at wb_workfn() [1] due to
    wb->bdi->dev being NULL. And Dmitry confirmed that wb->state was
    WB_shutting_down after wb->bdi->dev became NULL. This indicates that
    unregister_bdi() failed to call wb_shutdown() on one of wb objects.

    The problem is in cgwb_bdi_unregister() which does cgwb_kill() and thus
    drops bdi's reference to wb structures before going through the list of
    wbs again and calling wb_shutdown() on each of them. This way the loop
    iterating through all wbs can easily miss a wb if that wb has already
    passed through cgwb_remove_from_bdi_list() called from wb_shutdown()
    from cgwb_release_workfn() and as a result fully shutdown bdi although
    wb_workfn() for this wb structure is still running. In fact there are
    also other ways cgwb_bdi_unregister() can race with
    cgwb_release_workfn() leading e.g. to use-after-free issues:

    CPU1 CPU2
    cgwb_bdi_unregister()
    cgwb_kill(*slot);

    cgwb_release()
    queue_work(cgwb_release_wq, &wb->release_work);
    cgwb_release_workfn()
    wb = list_first_entry(&bdi->wb_list, ...)
    spin_unlock_irq(&cgwb_lock);
    wb_shutdown(wb);
    ...
    kfree_rcu(wb, rcu);
    wb_shutdown(wb); -> oops use-after-free

    We solve these issues by synchronizing writeback structure shutdown from
    cgwb_bdi_unregister() with cgwb_release_workfn() using a new mutex. That
    way we also no longer need synchronization using WB_shutting_down as the
    mutex provides it for CONFIG_CGROUP_WRITEBACK case and without
    CONFIG_CGROUP_WRITEBACK wb_shutdown() can be called only once from
    bdi_unregister().

    Reported-by: syzbot
    Acked-by: Tejun Heo
    Signed-off-by: Jan Kara
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Jan Kara
     

17 Jul, 2018

2 commits

  • commit bb177a732c4369bb58a1fe1df8f552b6f0f7db5f upstream.

    syzbot has noticed that a specially crafted library can easily hit
    VM_BUG_ON in __mm_populate

    kernel BUG at mm/gup.c:1242!
    invalid opcode: 0000 [#1] SMP
    CPU: 2 PID: 9667 Comm: a.out Not tainted 4.18.0-rc3 #644
    Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 05/19/2017
    RIP: 0010:__mm_populate+0x1e2/0x1f0
    Code: 55 d0 65 48 33 14 25 28 00 00 00 89 d8 75 21 48 83 c4 20 5b 41 5c 41 5d 41 5e 41 5f 5d c3 e8 75 18 f1 ff 0f 0b e8 6e 18 f1 ff 0b 31 db eb c9 e8 93 06 e0 ff 0f 1f 00 55 48 89 e5 53 48 89 fb
    Call Trace:
    vm_brk_flags+0xc3/0x100
    vm_brk+0x1f/0x30
    load_elf_library+0x281/0x2e0
    __ia32_sys_uselib+0x170/0x1e0
    do_fast_syscall_32+0xca/0x420
    entry_SYSENTER_compat+0x70/0x7f

    The reason is that the length of the new brk is not page aligned when we
    try to populate the it. There is no reason to bug on that though.
    do_brk_flags already aligns the length properly so the mapping is
    expanded as it should. All we need is to tell mm_populate about it.
    Besides that there is absolutely no reason to to bug_on in the first
    place. The worst thing that could happen is that the last page wouldn't
    get populated and that is far from putting system into an inconsistent
    state.

    Fix the issue by moving the length sanitization code from do_brk_flags
    up to vm_brk_flags. The only other caller of do_brk_flags is brk
    syscall entry and it makes sure to provide the proper length so t here
    is no need for sanitation and so we can use do_brk_flags without it.

    Also remove the bogus BUG_ONs.

    [osalvador@techadventures.net: fix up vm_brk_flags s@request@len@]
    Link: http://lkml.kernel.org/r/20180706090217.GI32658@dhcp22.suse.cz
    Signed-off-by: Michal Hocko
    Reported-by: syzbot
    Tested-by: Tetsuo Handa
    Reviewed-by: Oscar Salvador
    Cc: Zi Yan
    Cc: "Aneesh Kumar K.V"
    Cc: Dan Williams
    Cc: "Kirill A. Shutemov"
    Cc: Michael S. Tsirkin
    Cc: Al Viro
    Cc: "Huang, Ying"
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Michal Hocko
     
  • commit bce73e4842390f7b7309c8e253e139db71288ac3 upstream.

    KVM guests on s390 can notify the host of unused pages. This can result
    in pte_unused callbacks to be true for KVM guest memory.

    If a page is unused (checked with pte_unused) we might drop this page
    instead of paging it. This can have side-effects on userfaultd, when
    the page in question was already migrated:

    The next access of that page will trigger a fault and a user fault
    instead of faulting in a new and empty zero page. As QEMU does not
    expect a userfault on an already migrated page this migration will fail.

    The most straightforward solution is to ignore the pte_unused hint if a
    userfault context is active for this VMA.

    Link: http://lkml.kernel.org/r/20180703171854.63981-1-borntraeger@de.ibm.com
    Signed-off-by: Christian Borntraeger
    Cc: Martin Schwidefsky
    Cc: Andrea Arcangeli
    Cc: Mike Rapoport
    Cc: Janosch Frank
    Cc: David Hildenbrand
    Cc: Cornelia Huck
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Christian Borntraeger
     

11 Jul, 2018

3 commits

  • commit 28557cc106e6d2aa8b8c5c7687ea9f8055ff3911 upstream.

    Revert commit c7f26ccfb2c3 ("mm/vmstat.c: fix vmstat_update() preemption
    BUG"). Steven saw a "using smp_processor_id() in preemptible" message
    and added a preempt_disable() section around it to keep it quiet. This
    is not the right thing to do it does not fix the real problem.

    vmstat_update() is invoked by a kworker on a specific CPU. This worker
    it bound to this CPU. The name of the worker was "kworker/1:1" so it
    should have been a worker which was bound to CPU1. A worker which can
    run on any CPU would have a `u' before the first digit.

    smp_processor_id() can be used in a preempt-enabled region as long as
    the task is bound to a single CPU which is the case here. If it could
    run on an arbitrary CPU then this is the problem we have an should seek
    to resolve.

    Not only this smp_processor_id() must not be migrated to another CPU but
    also refresh_cpu_vm_stats() which might access wrong per-CPU variables.
    Not to mention that other code relies on the fact that such a worker
    runs on one specific CPU only.

    Therefore revert that commit and we should look instead what broke the
    affinity mask of the kworker.

    Link: http://lkml.kernel.org/r/20180504104451.20278-1-bigeasy@linutronix.de
    Signed-off-by: Sebastian Andrzej Siewior
    Cc: Steven J. Hill
    Cc: Tejun Heo
    Cc: Vlastimil Babka
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Sebastian Andrzej Siewior
     
  • commit 31286a8484a85e8b4e91ddb0f5415aee8a416827 upstream.

    Recently the following BUG was reported:

    Injecting memory failure for pfn 0x3c0000 at process virtual address 0x7fe300000000
    Memory failure: 0x3c0000: recovery action for huge page: Recovered
    BUG: unable to handle kernel paging request at ffff8dfcc0003000
    IP: gup_pgd_range+0x1f0/0xc20
    PGD 17ae72067 P4D 17ae72067 PUD 0
    Oops: 0000 [#1] SMP PTI
    ...
    CPU: 3 PID: 5467 Comm: hugetlb_1gb Not tainted 4.15.0-rc8-mm1-abc+ #3
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.3-1.fc25 04/01/2014

    You can easily reproduce this by calling madvise(MADV_HWPOISON) twice on
    a 1GB hugepage. This happens because get_user_pages_fast() is not aware
    of a migration entry on pud that was created in the 1st madvise() event.

    I think that conversion to pud-aligned migration entry is working, but
    other MM code walking over page table isn't prepared for it. We need
    some time and effort to make all this work properly, so this patch
    avoids the reported bug by just disabling error handling for 1GB
    hugepage.

    [n-horiguchi@ah.jp.nec.com: v2]
    Link: http://lkml.kernel.org/r/1517284444-18149-1-git-send-email-n-horiguchi@ah.jp.nec.com
    Link: http://lkml.kernel.org/r/1517207283-15769-1-git-send-email-n-horiguchi@ah.jp.nec.com
    Signed-off-by: Naoya Horiguchi
    Acked-by: Michal Hocko
    Reviewed-by: Andrew Morton
    Reviewed-by: Mike Kravetz
    Acked-by: Punit Agrawal
    Tested-by: Michael Ellerman
    Cc: Anshuman Khandual
    Cc: "Aneesh Kumar K.V"
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sudip Mukherjee
    Signed-off-by: Greg Kroah-Hartman

    Naoya Horiguchi
     
  • commit 520495fe96d74e05db585fc748351e0504d8f40d upstream.

    When booting with very large numbers of gigantic (i.e. 1G) pages, the
    operations in the loop of gather_bootmem_prealloc, and specifically
    prep_compound_gigantic_page, takes a very long time, and can cause a
    softlockup if enough pages are requested at boot.

    For example booting with 3844 1G pages requires prepping
    (set_compound_head, init the count) over 1 billion 4K tail pages, which
    takes considerable time.

    Add a cond_resched() to the outer loop in gather_bootmem_prealloc() to
    prevent this lockup.

    Tested: Booted with softlockup_panic=1 hugepagesz=1G hugepages=3844 and
    no softlockup is reported, and the hugepages are reported as
    successfully setup.

    Link: http://lkml.kernel.org/r/20180627214447.260804-1-cannonmatthews@google.com
    Signed-off-by: Cannon Matthews
    Reviewed-by: Andrew Morton
    Reviewed-by: Mike Kravetz
    Acked-by: Michal Hocko
    Cc: Andres Lagar-Cavilla
    Cc: Peter Feiner
    Cc: Greg Thelen
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Cannon Matthews
     

03 Jul, 2018

3 commits

  • commit d50d82faa0c964e31f7a946ba8aba7c715ca7ab0 upstream.

    In kernel 4.17 I removed some code from dm-bufio that did slab cache
    merging (commit 21bb13276768: "dm bufio: remove code that merges slab
    caches") - both slab and slub support merging caches with identical
    attributes, so dm-bufio now just calls kmem_cache_create and relies on
    implicit merging.

    This uncovered a bug in the slub subsystem - if we delete a cache and
    immediatelly create another cache with the same attributes, it fails
    because of duplicate filename in /sys/kernel/slab/. The slub subsystem
    offloads freeing the cache to a workqueue - and if we create the new
    cache before the workqueue runs, it complains because of duplicate
    filename in sysfs.

    This patch fixes the bug by moving the call of kobject_del from
    sysfs_slab_remove_workfn to shutdown_cache. kobject_del must be called
    while we hold slab_mutex - so that the sysfs entry is deleted before a
    cache with the same attributes could be created.

    Running device-mapper-test-suite with:

    dmtest run --suite thin-provisioning -n /commit_failure_causes_fallback/

    triggered:

    Buffer I/O error on dev dm-0, logical block 1572848, async page read
    device-mapper: thin: 253:1: metadata operation 'dm_pool_alloc_data_block' failed: error = -5
    device-mapper: thin: 253:1: aborting current metadata transaction
    sysfs: cannot create duplicate filename '/kernel/slab/:a-0000144'
    CPU: 2 PID: 1037 Comm: kworker/u48:1 Not tainted 4.17.0.snitm+ #25
    Hardware name: Supermicro SYS-1029P-WTR/X11DDW-L, BIOS 2.0a 12/06/2017
    Workqueue: dm-thin do_worker [dm_thin_pool]
    Call Trace:
    dump_stack+0x5a/0x73
    sysfs_warn_dup+0x58/0x70
    sysfs_create_dir_ns+0x77/0x80
    kobject_add_internal+0xba/0x2e0
    kobject_init_and_add+0x70/0xb0
    sysfs_slab_add+0xb1/0x250
    __kmem_cache_create+0x116/0x150
    create_cache+0xd9/0x1f0
    kmem_cache_create_usercopy+0x1c1/0x250
    kmem_cache_create+0x18/0x20
    dm_bufio_client_create+0x1ae/0x410 [dm_bufio]
    dm_block_manager_create+0x5e/0x90 [dm_persistent_data]
    __create_persistent_data_objects+0x38/0x940 [dm_thin_pool]
    dm_pool_abort_metadata+0x64/0x90 [dm_thin_pool]
    metadata_operation_failed+0x59/0x100 [dm_thin_pool]
    alloc_data_block.isra.53+0x86/0x180 [dm_thin_pool]
    process_cell+0x2a3/0x550 [dm_thin_pool]
    do_worker+0x28d/0x8f0 [dm_thin_pool]
    process_one_work+0x171/0x370
    worker_thread+0x49/0x3f0
    kthread+0xf8/0x130
    ret_from_fork+0x35/0x40
    kobject_add_internal failed for :a-0000144 with -EEXIST, don't try to register things with the same name in the same directory.
    kmem_cache_create(dm_bufio_buffer-16) failed with error -17

    Link: http://lkml.kernel.org/r/alpine.LRH.2.02.1806151817130.6333@file01.intranet.prod.int.rdu2.redhat.com
    Signed-off-by: Mikulas Patocka
    Reported-by: Mike Snitzer
    Tested-by: Mike Snitzer
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Mikulas Patocka
     
  • commit 1105a2fc022f3c7482e32faf516e8bc44095f778 upstream.

    In our armv8a server(QDF2400), I noticed lots of WARN_ON caused by
    PAGE_SIZE unaligned for rmap_item->address under memory pressure
    tests(start 20 guests and run memhog in the host).

    WARNING: CPU: 4 PID: 4641 at virt/kvm/arm/mmu.c:1826 kvm_age_hva_handler+0xc0/0xc8
    CPU: 4 PID: 4641 Comm: memhog Tainted: G W 4.17.0-rc3+ #8
    Call trace:
    kvm_age_hva_handler+0xc0/0xc8
    handle_hva_to_gpa+0xa8/0xe0
    kvm_age_hva+0x4c/0xe8
    kvm_mmu_notifier_clear_flush_young+0x54/0x98
    __mmu_notifier_clear_flush_young+0x6c/0xa0
    page_referenced_one+0x154/0x1d8
    rmap_walk_ksm+0x12c/0x1d0
    rmap_walk+0x94/0xa0
    page_referenced+0x194/0x1b0
    shrink_page_list+0x674/0xc28
    shrink_inactive_list+0x26c/0x5b8
    shrink_node_memcg+0x35c/0x620
    shrink_node+0x100/0x430
    do_try_to_free_pages+0xe0/0x3a8
    try_to_free_pages+0xe4/0x230
    __alloc_pages_nodemask+0x564/0xdc0
    alloc_pages_vma+0x90/0x228
    do_anonymous_page+0xc8/0x4d0
    __handle_mm_fault+0x4a0/0x508
    handle_mm_fault+0xf8/0x1b0
    do_page_fault+0x218/0x4b8
    do_translation_fault+0x90/0xa0
    do_mem_abort+0x68/0xf0
    el0_da+0x24/0x28

    In rmap_walk_ksm, the rmap_item->address might still have the
    STABLE_FLAG, then the start and end in handle_hva_to_gpa might not be
    PAGE_SIZE aligned. Thus it will cause exceptions in handle_hva_to_gpa
    on arm64.

    This patch fixes it by ignoring (not removing) the low bits of address
    when doing rmap_walk_ksm.

    IMO, it should be backported to stable tree. the storm of WARN_ONs is
    very easy for me to reproduce. More than that, I watched a panic (not
    reproducible) as follows:

    page:ffff7fe003742d80 count:-4871 mapcount:-2126053375 mapping: (null) index:0x0
    flags: 0x1fffc00000000000()
    raw: 1fffc00000000000 0000000000000000 0000000000000000 ffffecf981470000
    raw: dead000000000100 dead000000000200 ffff8017c001c000 0000000000000000
    page dumped because: nonzero _refcount
    CPU: 29 PID: 18323 Comm: qemu-kvm Tainted: G W 4.14.15-5.hxt.aarch64 #1
    Hardware name:
    Call trace:
    dump_backtrace+0x0/0x22c
    show_stack+0x24/0x2c
    dump_stack+0x8c/0xb0
    bad_page+0xf4/0x154
    free_pages_check_bad+0x90/0x9c
    free_pcppages_bulk+0x464/0x518
    free_hot_cold_page+0x22c/0x300
    __put_page+0x54/0x60
    unmap_stage2_range+0x170/0x2b4
    kvm_unmap_hva_handler+0x30/0x40
    handle_hva_to_gpa+0xb0/0xec
    kvm_unmap_hva_range+0x5c/0xd0

    I even injected a fault on purpose in kvm_unmap_hva_range by seting
    size=size-0x200, the call trace is similar as above. So I thought the
    panic is similarly caused by the root cause of WARN_ON.

    Andrea said:

    : It looks a straightforward safe fix, on x86 hva_to_gfn_memslot would
    : zap those bits and hide the misalignment caused by the low metadata
    : bits being erroneously left set in the address, but the arm code
    : notices when that's the last page in the memslot and the hva_end is
    : getting aligned and the size is below one page.
    :
    : I think the problem triggers in the addr += PAGE_SIZE of
    : unmap_stage2_ptes that never matches end because end is aligned but
    : addr is not.
    :
    : } while (pte++, addr += PAGE_SIZE, addr != end);
    :
    : x86 again only works on hva_start/hva_end after converting it to
    : gfn_start/end and that being in pfn units the bits are zapped before
    : they risk to cause trouble.

    Jia He said:

    : I've tested by myself in arm64 server (QDF2400,46 cpus,96G mem) Without
    : this patch, the WARN_ON is very easy for reproducing. After this patch, I
    : have run the same benchmarch for a whole day without any WARN_ONs

    Link: http://lkml.kernel.org/r/1525403506-6750-1-git-send-email-hejianet@gmail.com
    Signed-off-by: Jia He
    Reviewed-by: Andrea Arcangeli
    Tested-by: Jia He
    Cc: Suzuki K Poulose
    Cc: Minchan Kim
    Cc: Claudio Imbrenda
    Cc: Arvind Yadav
    Cc: Mike Rapoport
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Jia He
     
  • commit a9b6de77b1a3ff729f7bfc54b2e17711776a416c upstream.

    get_user_pages_fast() for device pages is missing the typical validation
    that all page references have been taken while the mapping was valid.
    Without this validation truncate operations can not reliably coordinate
    against new page reference events like O_DIRECT.

    Cc:
    Fixes: 3565fce3a659 ("mm, x86: get_user_pages() for dax mappings")
    Reported-by: Jan Kara
    Reviewed-by: Jan Kara
    Signed-off-by: Dan Williams
    Signed-off-by: Greg Kroah-Hartman

    Dan Williams
     

26 Jun, 2018

2 commits

  • commit 7810e6781e0fcbca78b91cf65053f895bf59e85f upstream.

    In __alloc_pages_slowpath() we reset zonelist and preferred_zoneref for
    allocations that can ignore memory policies. The zonelist is obtained
    from current CPU's node. This is a problem for __GFP_THISNODE
    allocations that want to allocate on a different node, e.g. because the
    allocating thread has been migrated to a different CPU.

    This has been observed to break SLAB in our 4.4-based kernel, because
    there it relies on __GFP_THISNODE working as intended. If a slab page
    is put on wrong node's list, then further list manipulations may corrupt
    the list because page_to_nid() is used to determine which node's
    list_lock should be locked and thus we may take a wrong lock and race.

    Current SLAB implementation seems to be immune by luck thanks to commit
    511e3a058812 ("mm/slab: make cache_grow() handle the page allocated on
    arbitrary node") but there may be others assuming that __GFP_THISNODE
    works as promised.

    We can fix it by simply removing the zonelist reset completely. There
    is actually no reason to reset it, because memory policies and cpusets
    don't affect the zonelist choice in the first place. This was different
    when commit 183f6371aac2 ("mm: ignore mempolicies when using
    ALLOC_NO_WATERMARK") introduced the code, as mempolicies provided their
    own restricted zonelists.

    We might consider this for 4.17 although I don't know if there's
    anything currently broken.

    SLAB is currently not affected, but in kernels older than 4.7 that don't
    yet have 511e3a058812 ("mm/slab: make cache_grow() handle the page
    allocated on arbitrary node") it is. That's at least 4.4 LTS. Older
    ones I'll have to check.

    So stable backports should be more important, but will have to be
    reviewed carefully, as the code went through many changes. BTW I think
    that also the ac->preferred_zoneref reset is currently useless if we
    don't also reset ac->nodemask from a mempolicy to NULL first (which we
    probably should for the OOM victims etc?), but I would leave that for a
    separate patch.

    Link: http://lkml.kernel.org/r/20180525130853.13915-1-vbabka@suse.cz
    Signed-off-by: Vlastimil Babka
    Fixes: 183f6371aac2 ("mm: ignore mempolicies when using ALLOC_NO_WATERMARK")
    Acked-by: Mel Gorman
    Cc: Michal Hocko
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Vlastimil Babka
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Vlastimil Babka
     
  • commit f183464684190bacbfb14623bd3e4e51b7575b4c upstream.

    From 0aa2e9b921d6db71150633ff290199554f0842a8 Mon Sep 17 00:00:00 2001
    From: Tejun Heo
    Date: Wed, 23 May 2018 10:29:00 -0700

    cgwb_release() punts the actual release to cgwb_release_workfn() on
    system_wq. Depending on the number of cgroups or block devices, there
    can be a lot of cgwb_release_workfn() in flight at the same time.

    We're periodically seeing close to 256 kworkers getting stuck with the
    following stack trace and overtime the entire system gets stuck.

    [] _synchronize_rcu_expedited.constprop.72+0x2fc/0x330
    [] synchronize_rcu_expedited+0x24/0x30
    [] bdi_unregister+0x53/0x290
    [] release_bdi+0x89/0xc0
    [] wb_exit+0x85/0xa0
    [] cgwb_release_workfn+0x54/0xb0
    [] process_one_work+0x150/0x410
    [] worker_thread+0x6d/0x520
    [] kthread+0x12c/0x160
    [] ret_from_fork+0x29/0x40
    [] 0xffffffffffffffff

    The events leading to the lockup are...

    1. A lot of cgwb_release_workfn() is queued at the same time and all
    system_wq kworkers are assigned to execute them.

    2. They all end up calling synchronize_rcu_expedited(). One of them
    wins and tries to perform the expedited synchronization.

    3. However, that invovles queueing rcu_exp_work to system_wq and
    waiting for it. Because #1 is holding all available kworkers on
    system_wq, rcu_exp_work can't be executed. cgwb_release_workfn()
    is waiting for synchronize_rcu_expedited() which in turn is waiting
    for cgwb_release_workfn() to free up some of the kworkers.

    We shouldn't be scheduling hundreds of cgwb_release_workfn() at the
    same time. There's nothing to be gained from that. This patch
    updates cgwb release path to use a dedicated percpu workqueue with
    @max_active of 1.

    While this resolves the problem at hand, it might be a good idea to
    isolate rcu_exp_work to its own workqueue too as it can be used from
    various paths and is prone to this sort of indirect A-A deadlocks.

    Signed-off-by: Tejun Heo
    Cc: "Paul E. McKenney"
    Cc: stable@vger.kernel.org
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Tejun Heo
     

21 Jun, 2018

1 commit

  • [ Upstream commit c892fd82cc0632d425ae011a4dd75eb59e9f84ee ]

    If there is heavy memory pressure, page allocation with __GFP_NOWAIT
    fails easily although it's order-0 request. I got below warning 9 times
    for normal boot.

    : page allocation failure: order:0, mode:0x2200000(GFP_NOWAIT|__GFP_NOTRACK)
    .. snip ..
    Call trace:
    dump_backtrace+0x0/0x4
    dump_stack+0xa4/0xc0
    warn_alloc+0xd4/0x15c
    __alloc_pages_nodemask+0xf88/0x10fc
    alloc_slab_page+0x40/0x18c
    new_slab+0x2b8/0x2e0
    ___slab_alloc+0x25c/0x464
    __kmalloc+0x394/0x498
    memcg_kmem_get_cache+0x114/0x2b8
    kmem_cache_alloc+0x98/0x3e8
    mmap_region+0x3bc/0x8c0
    do_mmap+0x40c/0x43c
    vm_mmap_pgoff+0x15c/0x1e4
    sys_mmap+0xb0/0xc8
    el0_svc_naked+0x24/0x28
    Mem-Info:
    active_anon:17124 inactive_anon:193 isolated_anon:0
    active_file:7898 inactive_file:712955 isolated_file:55
    unevictable:0 dirty:27 writeback:18 unstable:0
    slab_reclaimable:12250 slab_unreclaimable:23334
    mapped:19310 shmem:212 pagetables:816 bounce:0
    free:36561 free_pcp:1205 free_cma:35615
    Node 0 active_anon:68496kB inactive_anon:772kB active_file:31592kB inactive_file:2851820kB unevictable:0kB isolated(anon):0kB isolated(file):220kB mapped:77240kB dirty:108kB writeback:72kB shmem:848kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
    DMA free:142188kB min:3056kB low:3820kB high:4584kB active_anon:10052kB inactive_anon:12kB active_file:312kB inactive_file:1412620kB unevictable:0kB writepending:0kB present:1781412kB managed:1604728kB mlocked:0kB slab_reclaimable:3592kB slab_unreclaimable:876kB kernel_stack:400kB pagetables:52kB bounce:0kB free_pcp:1436kB local_pcp:124kB free_cma:142492kB
    lowmem_reserve[]: 0 1842 1842
    Normal free:4056kB min:4172kB low:5212kB high:6252kB active_anon:58376kB inactive_anon:760kB active_file:31348kB inactive_file:1439040kB unevictable:0kB writepending:180kB present:2000636kB managed:1923688kB mlocked:0kB slab_reclaimable:45408kB slab_unreclaimable:92460kB kernel_stack:9680kB pagetables:3212kB bounce:0kB free_pcp:3392kB local_pcp:688kB free_cma:0kB
    lowmem_reserve[]: 0 0 0
    DMA: 0*4kB 0*8kB 1*16kB (C) 0*32kB 0*64kB 0*128kB 1*256kB (C) 1*512kB (C) 0*1024kB 1*2048kB (C) 34*4096kB (C) = 142096kB
    Normal: 228*4kB (UMEH) 172*8kB (UMH) 23*16kB (UH) 24*32kB (H) 5*64kB (H) 1*128kB (H) 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 3872kB
    721350 total pagecache pages
    0 pages in swap cache
    Swap cache stats: add 0, delete 0, find 0/0
    Free swap = 0kB
    Total swap = 0kB
    945512 pages RAM
    0 pages HighMem/MovableOnly
    63408 pages reserved
    51200 pages cma reserved

    __memcg_schedule_kmem_cache_create() tries to create a shadow slab cache
    and the worker allocation failure is not really critical because we will
    retry on the next kmem charge. We might miss some charges but that
    shouldn't be critical. The excessive allocation failure report is not
    very helpful.

    [mhocko@kernel.org: changelog update]
    Link: http://lkml.kernel.org/r/20180418022912.248417-1-minchan@kernel.org
    Signed-off-by: Minchan Kim
    Acked-by: Johannes Weiner
    Reviewed-by: Andrew Morton
    Cc: Michal Hocko
    Cc: Vladimir Davydov
    Cc: Minchan Kim
    Cc: Matthew Wilcox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Minchan Kim
     

12 Jun, 2018

2 commits

  • commit 423913ad4ae5b3e8fb8983f70969fb522261ba26 upstream.

    Commit be83bbf80682 ("mmap: introduce sane default mmap limits") was
    introduced to catch problems in various ad-hoc character device drivers
    doing mmap and getting the size limits wrong. In the process, it used
    "known good" limits for the normal cases of mapping regular files and
    block device drivers.

    It turns out that the "s_maxbytes" limit was less "known good" than I
    thought. In particular, /proc doesn't set it, but exposes one regular
    file to mmap: /proc/vmcore. As a result, that file got limited to the
    default MAX_INT s_maxbytes value.

    This went unnoticed for a while, because apparently the only thing that
    needs it is the s390 kernel zfcpdump, but there might be other tools
    that use this too.

    Vasily suggested just changing s_maxbytes for all of /proc, which isn't
    wrong, but makes me nervous at this stage. So instead, just make the
    new mmap limit always be MAX_LFS_FILESIZE for regular files, which won't
    affect anything else. It wasn't the regular file case I was worried
    about.

    I'd really prefer for maxsize to have been per-inode, but that is not
    how things are today.

    Fixes: be83bbf80682 ("mmap: introduce sane default mmap limits")
    Reported-by: Vasily Gorbik
    Cc: Al Viro
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Linus Torvalds
     
  • commit be83bbf806822b1b89e0a0f23cd87cddc409e429 upstream.

    The internal VM "mmap()" interfaces are based on the mmap target doing
    everything using page indexes rather than byte offsets, because
    traditionally (ie 32-bit) we had the situation that the byte offset
    didn't fit in a register. So while the mmap virtual address was limited
    by the word size of the architecture, the backing store was not.

    So we're basically passing "pgoff" around as a page index, in order to
    be able to describe backing store locations that are much bigger than
    the word size (think files larger than 4GB etc).

    But while this all makes a ton of sense conceptually, we've been dogged
    by various drivers that don't really understand this, and internally
    work with byte offsets, and then try to work with the page index by
    turning it into a byte offset with "pgoff << PAGE_SHIFT".

    Which obviously can overflow.

    Adding the size of the mapping to it to get the byte offset of the end
    of the backing store just exacerbates the problem, and if you then use
    this overflow-prone value to check various limits of your device driver
    mmap capability, you're just setting yourself up for problems.

    The correct thing for drivers to do is to do their limit math in page
    indices, the way the interface is designed. Because the generic mmap
    code _does_ test that the index doesn't overflow, since that's what the
    mmap code really cares about.

    HOWEVER.

    Finding and fixing various random drivers is a sisyphean task, so let's
    just see if we can just make the core mmap() code do the limiting for
    us. Realistically, the only "big" backing stores we need to care about
    are regular files and block devices, both of which are known to do this
    properly, and which have nice well-defined limits for how much data they
    can access.

    So let's special-case just those two known cases, and then limit other
    random mmap users to a backing store that still fits in "unsigned long".
    Realistically, that's not much of a limit at all on 64-bit, and on
    32-bit architectures the only worry might be the GPU drivers, which can
    have big physical address spaces.

    To make it possible for drivers like that to say that they are 64-bit
    clean, this patch does repurpose the "FMODE_UNSIGNED_OFFSET" bit in the
    file flags to allow drivers to mark their file descriptors as safe in
    the full 64-bit mmap address space.

    [ The timing for doing this is less than optimal, and this should really
    go in a merge window. But realistically, this needs wide testing more
    than it needs anything else, and being main-line is the only way to do
    that.

    So the earlier the better, even if it's outside the proper development
    cycle - Linus ]

    Cc: Kees Cook
    Cc: Dan Carpenter
    Cc: Al Viro
    Cc: Willy Tarreau
    Cc: Dave Airlie
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Linus Torvalds
     

05 Jun, 2018

2 commits

  • commit 2d077d4b59924acd1f5180c6fb73b57f4771fde6 upstream.

    Swapping load on huge=always tmpfs (with khugepaged tuned up to be very
    eager, but I'm not sure that is relevant) soon hung uninterruptibly,
    waiting for page lock in shmem_getpage_gfp()'s find_lock_entry(), most
    often when "cp -a" was trying to write to a smallish file. Debug showed
    that the page in question was not locked, and page->mapping NULL by now,
    but page->index consistent with having been in a huge page before.

    Reproduced in minutes on a 4.15 kernel, even with 4.17's 605ca5ede764
    ("mm/huge_memory.c: reorder operations in __split_huge_page_tail()") added
    in; but took hours to reproduce on a 4.17 kernel (no idea why).

    The culprit proved to be the __ClearPageDirty() on tails beyond i_size in
    __split_huge_page(): the non-atomic __bitoperation may have been safe when
    4.8's baa355fd3314 ("thp: file pages support for split_huge_page()")
    introduced it, but liable to erase PageWaiters after 4.10's 62906027091f
    ("mm: add PageWaiters indicating tasks are waiting for a page bit").

    Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1805291841070.3197@eggly.anvils
    Fixes: 62906027091f ("mm: add PageWaiters indicating tasks are waiting for a page bit")
    Signed-off-by: Hugh Dickins
    Acked-by: Kirill A. Shutemov
    Cc: Konstantin Khlebnikov
    Cc: Nicholas Piggin
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Hugh Dickins
     
  • commit 145e1a71e090575c74969e3daa8136d1e5b99fc8 upstream.

    George Boole would have noticed a slight error in 4.16 commit
    69d763fc6d3a ("mm: pin address_space before dereferencing it while
    isolating an LRU page"). Fix it, to match both the comment above it,
    and the original behaviour.

    Although anonymous pages are not marked PageDirty at first, we have an
    old habit of calling SetPageDirty when a page is removed from swap
    cache: so there's a category of ex-swap pages that are easily
    migratable, but were inadvertently excluded from compaction's async
    migration in 4.16.

    Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1805302014001.12558@eggly.anvils
    Fixes: 69d763fc6d3a ("mm: pin address_space before dereferencing it while isolating an LRU page")
    Signed-off-by: Hugh Dickins
    Acked-by: Minchan Kim
    Acked-by: Mel Gorman
    Reported-by: Ivan Kalvachev
    Cc: "Huang, Ying"
    Cc: Jan Kara
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Hugh Dickins
     

30 May, 2018

7 commits

  • [ Upstream commit f0849ac0b8e072073ec5fcc7fadd05a77434364e ]

    For PTE-mapped THP, the compound THP has not been split to normal 4K
    pages yet, the whole THP is considered referenced if any one of sub page
    is referenced.

    When walking PTE-mapped THP by pvmw, all relevant PTEs will be checked
    to retrieve referenced bit. But, the current code just returns the
    result of the last PTE. If the last PTE has not referenced, the
    referenced flag will be cleared.

    Just set referenced when ptep{pmdp}_clear_young_notify() returns true.

    Link: http://lkml.kernel.org/r/1518212451-87134-1-git-send-email-yang.shi@linux.alibaba.com
    Signed-off-by: Yang Shi
    Reported-by: Gang Deng
    Suggested-by: Kirill A. Shutemov
    Reviewed-by: Andrew Morton
    Cc: "Kirill A. Shutemov"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Yang Shi
     
  • [ Upstream commit e92bb4dd9673945179b1fc738c9817dd91bfb629 ]

    When page_mapping() is called and the mapping is dereferenced in
    page_evicatable() through shrink_active_list(), it is possible for the
    inode to be truncated and the embedded address space to be freed at the
    same time. This may lead to the following race.

    CPU1 CPU2

    truncate(inode) shrink_active_list()
    ... page_evictable(page)
    truncate_inode_page(mapping, page);
    delete_from_page_cache(page)
    spin_lock_irqsave(&mapping->tree_lock, flags);
    __delete_from_page_cache(page, NULL)
    page_cache_tree_delete(..)
    ... mapping = page_mapping(page);
    page->mapping = NULL;
    ...
    spin_unlock_irqrestore(&mapping->tree_lock, flags);
    page_cache_free_page(mapping, page)
    put_page(page)
    if (put_page_testzero(page)) -> false
    - inode now has no pages and can be freed including embedded address_space

    mapping_unevictable(mapping)
    test_bit(AS_UNEVICTABLE, &mapping->flags);
    - we've dereferenced mapping which is potentially already free.

    Similar race exists between swap cache freeing and page_evicatable()
    too.

    The address_space in inode and swap cache will be freed after a RCU
    grace period. So the races are fixed via enclosing the page_mapping()
    and address_space usage in rcu_read_lock/unlock(). Some comments are
    added in code to make it clear what is protected by the RCU read lock.

    Link: http://lkml.kernel.org/r/20180212081227.1940-1-ying.huang@intel.com
    Signed-off-by: "Huang, Ying"
    Reviewed-by: Jan Kara
    Reviewed-by: Andrew Morton
    Cc: Mel Gorman
    Cc: Minchan Kim
    Cc: "Huang, Ying"
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Huang Ying
     
  • [ Upstream commit 77da2ba0648a4fd52e5ff97b8b2b8dd312aec4b0 ]

    This patch fixes a corner case for KSM. When two pages belong or
    belonged to the same transparent hugepage, and they should be merged,
    KSM fails to split the page, and therefore no merging happens.

    This bug can be reproduced by:
    * making sure ksm is running (in case disabling ksmtuned)
    * enabling transparent hugepages
    * allocating a THP-aligned 1-THP-sized buffer
    e.g. on amd64: posix_memalign(&p, 1<<<<
    Co-authored-by: Gerald Schaefer
    Reviewed-by: Andrew Morton
    Cc: Andrea Arcangeli
    Cc: Minchan Kim
    Cc: Kirill A. Shutemov
    Cc: Hugh Dickins
    Cc: Christian Borntraeger
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Claudio Imbrenda
     
  • [ Upstream commit 1ec6995d1290bfb87cc3a51f0836c889e857cef9 ]

    In z3fold_create_pool(), the memory allocated by __alloc_percpu() is not
    released on the error path that pool->compact_wq , which holds the
    return value of create_singlethread_workqueue(), is NULL. This will
    result in a memory leak bug.

    [akpm@linux-foundation.org: fix oops on kzalloc() failure, check __alloc_percpu() retval]
    Link: http://lkml.kernel.org/r/1522803111-29209-1-git-send-email-wangxidong_97@163.com
    Signed-off-by: Xidong Wang
    Reviewed-by: Andrew Morton
    Cc: Vitaly Wool
    Cc: Mike Rapoport
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Xidong Wang
     
  • [ Upstream commit a06ad633a37c64a0cd4c229fc605cee8725d376e ]

    Calling swapon() on a zero length swap file on SSD can lead to a
    divide-by-zero.

    Although creating such files isn't possible with mkswap and they woud be
    considered invalid, it would be better for the swapon code to be more
    robust and handle this condition gracefully (return -EINVAL).
    Especially since the fix is small and straightforward.

    To help with wear leveling on SSD, the swapon syscall calculates a
    random position in the swap file using modulo p->highest_bit, which is
    set to maxpages - 1 in read_swap_header.

    If the swap file is zero length, read_swap_header sets maxpages=1 and
    last_page=0, resulting in p->highest_bit=0 and we divide-by-zero when we
    modulo p->highest_bit in swapon syscall.

    This can be prevented by having read_swap_header return zero if
    last_page is zero.

    Link: http://lkml.kernel.org/r/5AC747C1020000A7001FA82C@prv-mh.provo.novell.com
    Signed-off-by: Thomas Abraham
    Reported-by:
    Reviewed-by: Andrew Morton
    Cc: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Tom Abraham
     
  • [ Upstream commit 914b6dfff790544d9b77dfd1723adb3745ec9700 ]

    A crash is observed when kmemleak_scan accesses the object->pointer,
    likely due to the following race.

    TASK A TASK B TASK C
    kmemleak_write
    (with "scan" and
    NOT "scan=on")
    kmemleak_scan()
    create_object
    kmem_cache_alloc fails
    kmemleak_disable
    kmemleak_do_cleanup
    kmemleak_free_enabled = 0
    kfree
    kmemleak_free bails out
    (kmemleak_free_enabled is 0)
    slub frees object->pointer
    update_checksum
    crash - object->pointer
    freed (DEBUG_PAGEALLOC)

    kmemleak_do_cleanup waits for the scan thread to complete, but not for
    direct call to kmemleak_scan via kmemleak_write. So add a wait for
    kmemleak_scan completion before disabling kmemleak_free, and while at it
    fix the comment on stop_scan_thread.

    [vinmenon@codeaurora.org: fix stop_scan_thread comment]
    Link: http://lkml.kernel.org/r/1522219972-22809-1-git-send-email-vinmenon@codeaurora.org
    Link: http://lkml.kernel.org/r/1522063429-18992-1-git-send-email-vinmenon@codeaurora.org
    Signed-off-by: Vinayak Menon
    Reviewed-by: Catalin Marinas
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Vinayak Menon
     
  • [ Upstream commit c7f26ccfb2c31eb1bf810ba13d044fcf583232db ]

    Attempting to hotplug CPUs with CONFIG_VM_EVENT_COUNTERS enabled can
    cause vmstat_update() to report a BUG due to preemption not being
    disabled around smp_processor_id().

    Discovered on Ubiquiti EdgeRouter Pro with Cavium Octeon II processor.

    BUG: using smp_processor_id() in preemptible [00000000] code:
    kworker/1:1/269
    caller is vmstat_update+0x50/0xa0
    CPU: 0 PID: 269 Comm: kworker/1:1 Not tainted
    4.16.0-rc4-Cavium-Octeon-00009-gf83bbd5-dirty #1
    Workqueue: mm_percpu_wq vmstat_update
    Call Trace:
    show_stack+0x94/0x128
    dump_stack+0xa4/0xe0
    check_preemption_disabled+0x118/0x120
    vmstat_update+0x50/0xa0
    process_one_work+0x144/0x348
    worker_thread+0x150/0x4b8
    kthread+0x110/0x140
    ret_from_kernel_thread+0x14/0x1c

    Link: http://lkml.kernel.org/r/1520881552-25659-1-git-send-email-steven.hill@cavium.com
    Signed-off-by: Steven J. Hill
    Reviewed-by: Andrew Morton
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Steven J. Hill