17 Jun, 2021

17 commits

  • I see a "virt_to_phys used for non-linear address" warning from
    check_usemap_section_nr() on arm64 platforms.

    In current implementation of NODE_DATA, if CONFIG_NEED_MULTIPLE_NODES=y,
    pglist_data is dynamically allocated and assigned to node_data[].

    For example, in arch/arm64/include/asm/mmzone.h:

    extern struct pglist_data *node_data[];
    #define NODE_DATA(nid) (node_data[(nid)])

    If CONFIG_NEED_MULTIPLE_NODES=n, pglist_data is defined as a global
    variable named "contig_page_data".

    For example, in include/linux/mmzone.h:

    extern struct pglist_data contig_page_data;
    #define NODE_DATA(nid) (&contig_page_data)

    If CONFIG_DEBUG_VIRTUAL is not enabled, __pa() can handle both
    dynamically allocated linear addresses and symbol addresses. However,
    if (CONFIG_DEBUG_VIRTUAL=y && CONFIG_NEED_MULTIPLE_NODES=n) we can see
    the "virt_to_phys used for non-linear address" warning because that
    &contig_page_data is not a linear address on arm64.

    Warning message:

    virt_to_phys used for non-linear address: (contig_page_data+0x0/0x1c00)
    WARNING: CPU: 0 PID: 0 at arch/arm64/mm/physaddr.c:15 __virt_to_phys+0x58/0x68
    Modules linked in:
    CPU: 0 PID: 0 Comm: swapper Tainted: G W 5.13.0-rc1-00074-g1140ab592e2e #3
    Hardware name: linux,dummy-virt (DT)
    pstate: 600000c5 (nZCv daIF -PAN -UAO -TCO BTYPE=--)
    Call trace:
    __virt_to_phys+0x58/0x68
    check_usemap_section_nr+0x50/0xfc
    sparse_init_nid+0x1ac/0x28c
    sparse_init+0x1c4/0x1e0
    bootmem_init+0x60/0x90
    setup_arch+0x184/0x1f0
    start_kernel+0x78/0x488

    To fix it, create a small function to handle both translation.

    Link: https://lkml.kernel.org/r/1623058729-27264-1-git-send-email-miles.chen@mediatek.com
    Signed-off-by: Miles Chen
    Cc: Mike Rapoport
    Cc: Baoquan He
    Cc: Kazu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miles Chen
     
  • When debugging the bug reported by Wang Yugui [1], try_to_unmap() may
    fail, but the first VM_BUG_ON_PAGE() just checks page_mapcount() however
    it may miss the failure when head page is unmapped but other subpage is
    mapped. Then the second DEBUG_VM BUG() that check total mapcount would
    catch it. This may incur some confusion.

    As this is not a fatal issue, so consolidate the two DEBUG_VM checks
    into one VM_WARN_ON_ONCE_PAGE().

    [1] https://lore.kernel.org/linux-mm/20210412180659.B9E3.409509F4@e16-tech.com/

    Link: https://lkml.kernel.org/r/d0f0db68-98b8-ebfb-16dc-f29df24cf012@google.com
    Signed-off-by: Yang Shi
    Reviewed-by: Zi Yan
    Acked-by: Kirill A. Shutemov
    Signed-off-by: Hugh Dickins
    Cc: Alistair Popple
    Cc: Jan Kara
    Cc: Jue Wang
    Cc: "Matthew Wilcox (Oracle)"
    Cc: Miaohe Lin
    Cc: Minchan Kim
    Cc: Naoya Horiguchi
    Cc: Oscar Salvador
    Cc: Peter Xu
    Cc: Ralph Campbell
    Cc: Shakeel Butt
    Cc: Wang Yugui
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yang Shi
     
  • There is a race between THP unmapping and truncation, when truncate sees
    pmd_none() and skips the entry, after munmap's zap_huge_pmd() cleared
    it, but before its page_remove_rmap() gets to decrement
    compound_mapcount: generating false "BUG: Bad page cache" reports that
    the page is still mapped when deleted. This commit fixes that, but not
    in the way I hoped.

    The first attempt used try_to_unmap(page, TTU_SYNC|TTU_IGNORE_MLOCK)
    instead of unmap_mapping_range() in truncate_cleanup_page(): it has
    often been an annoyance that we usually call unmap_mapping_range() with
    no pages locked, but there apply it to a single locked page.
    try_to_unmap() looks more suitable for a single locked page.

    However, try_to_unmap_one() contains a VM_BUG_ON_PAGE(!pvmw.pte,page):
    it is used to insert THP migration entries, but not used to unmap THPs.
    Copy zap_huge_pmd() and add THP handling now? Perhaps, but their TLB
    needs are different, I'm too ignorant of the DAX cases, and couldn't
    decide how far to go for anon+swap. Set that aside.

    The second attempt took a different tack: make no change in truncate.c,
    but modify zap_huge_pmd() to insert an invalidated huge pmd instead of
    clearing it initially, then pmd_clear() between page_remove_rmap() and
    unlocking at the end. Nice. But powerpc blows that approach out of the
    water, with its serialize_against_pte_lookup(), and interesting pgtable
    usage. It would need serious help to get working on powerpc (with a
    minor optimization issue on s390 too). Set that aside.

    Just add an "if (page_mapped(page)) synchronize_rcu();" or other such
    delay, after unmapping in truncate_cleanup_page()? Perhaps, but though
    that's likely to reduce or eliminate the number of incidents, it would
    give less assurance of whether we had identified the problem correctly.

    This successful iteration introduces "unmap_mapping_page(page)" instead
    of try_to_unmap(), and goes the usual unmap_mapping_range_tree() route,
    with an addition to details. Then zap_pmd_range() watches for this
    case, and does spin_unlock(pmd_lock) if so - just like
    page_vma_mapped_walk() now does in the PVMW_SYNC case. Not pretty, but
    safe.

    Note that unmap_mapping_page() is doing a VM_BUG_ON(!PageLocked) to
    assert its interface; but currently that's only used to make sure that
    page->mapping is stable, and zap_pmd_range() doesn't care if the page is
    locked or not. Along these lines, in invalidate_inode_pages2_range()
    move the initial unmap_mapping_range() out from under page lock, before
    then calling unmap_mapping_page() under page lock if still mapped.

    Link: https://lkml.kernel.org/r/a2a4a148-cdd8-942c-4ef8-51b77f643dbe@google.com
    Fixes: fc127da085c2 ("truncate: handle file thp")
    Signed-off-by: Hugh Dickins
    Acked-by: Kirill A. Shutemov
    Reviewed-by: Yang Shi
    Cc: Alistair Popple
    Cc: Jan Kara
    Cc: Jue Wang
    Cc: "Matthew Wilcox (Oracle)"
    Cc: Miaohe Lin
    Cc: Minchan Kim
    Cc: Naoya Horiguchi
    Cc: Oscar Salvador
    Cc: Peter Xu
    Cc: Ralph Campbell
    Cc: Shakeel Butt
    Cc: Wang Yugui
    Cc: Zi Yan
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Anon THP tails were already supported, but memory-failure may need to
    use page_address_in_vma() on file THP tails, which its page->mapping
    check did not permit: fix it.

    hughd adds: no current usage is known to hit the issue, but this does
    fix a subtle trap in a general helper: best fixed in stable sooner than
    later.

    Link: https://lkml.kernel.org/r/a0d9b53-bf5d-8bab-ac5-759dc61819c1@google.com
    Fixes: 800d8c63b2e9 ("shmem: add huge pages support")
    Signed-off-by: Jue Wang
    Signed-off-by: Hugh Dickins
    Reviewed-by: Matthew Wilcox (Oracle)
    Reviewed-by: Yang Shi
    Acked-by: Kirill A. Shutemov
    Cc: Alistair Popple
    Cc: Jan Kara
    Cc: Miaohe Lin
    Cc: Minchan Kim
    Cc: Naoya Horiguchi
    Cc: Oscar Salvador
    Cc: Peter Xu
    Cc: Ralph Campbell
    Cc: Shakeel Butt
    Cc: Wang Yugui
    Cc: Zi Yan
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jue Wang
     
  • Running certain tests with a DEBUG_VM kernel would crash within hours,
    on the total_mapcount BUG() in split_huge_page_to_list(), while trying
    to free up some memory by punching a hole in a shmem huge page: split's
    try_to_unmap() was unable to find all the mappings of the page (which,
    on a !DEBUG_VM kernel, would then keep the huge page pinned in memory).

    When that BUG() was changed to a WARN(), it would later crash on the
    VM_BUG_ON_VMA(end < vma->vm_start || start >= vma->vm_end, vma) in
    mm/internal.h:vma_address(), used by rmap_walk_file() for
    try_to_unmap().

    vma_address() is usually correct, but there's a wraparound case when the
    vm_start address is unusually low, but vm_pgoff not so low:
    vma_address() chooses max(start, vma->vm_start), but that decides on the
    wrong address, because start has become almost ULONG_MAX.

    Rewrite vma_address() to be more careful about vm_pgoff; move the
    VM_BUG_ON_VMA() out of it, returning -EFAULT for errors, so that it can
    be safely used from page_mapped_in_vma() and page_address_in_vma() too.

    Add vma_address_end() to apply similar care to end address calculation,
    in page_vma_mapped_walk() and page_mkclean_one() and try_to_unmap_one();
    though it raises a question of whether callers would do better to supply
    pvmw->end to page_vma_mapped_walk() - I chose not, for a smaller patch.

    An irritation is that their apparent generality breaks down on KSM
    pages, which cannot be located by the page->index that page_to_pgoff()
    uses: as commit 4b0ece6fa016 ("mm: migrate: fix remove_migration_pte()
    for ksm pages") once discovered. I dithered over the best thing to do
    about that, and have ended up with a VM_BUG_ON_PAGE(PageKsm) in both
    vma_address() and vma_address_end(); though the only place in danger of
    using it on them was try_to_unmap_one().

    Sidenote: vma_address() and vma_address_end() now use compound_nr() on a
    head page, instead of thp_size(): to make the right calculation on a
    hugetlbfs page, whether or not THPs are configured. try_to_unmap() is
    used on hugetlbfs pages, but perhaps the wrong calculation never
    mattered.

    Link: https://lkml.kernel.org/r/caf1c1a3-7cfb-7f8f-1beb-ba816e932825@google.com
    Fixes: a8fa41ad2f6f ("mm, rmap: check all VMAs that PTE-mapped THP can be part of")
    Signed-off-by: Hugh Dickins
    Acked-by: Kirill A. Shutemov
    Cc: Alistair Popple
    Cc: Jan Kara
    Cc: Jue Wang
    Cc: "Matthew Wilcox (Oracle)"
    Cc: Miaohe Lin
    Cc: Minchan Kim
    Cc: Naoya Horiguchi
    Cc: Oscar Salvador
    Cc: Peter Xu
    Cc: Ralph Campbell
    Cc: Shakeel Butt
    Cc: Wang Yugui
    Cc: Yang Shi
    Cc: Zi Yan
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Stressing huge tmpfs often crashed on unmap_page()'s VM_BUG_ON_PAGE
    (!unmap_success): with dump_page() showing mapcount:1, but then its raw
    struct page output showing _mapcount ffffffff i.e. mapcount 0.

    And even if that particular VM_BUG_ON_PAGE(!unmap_success) is removed,
    it is immediately followed by a VM_BUG_ON_PAGE(compound_mapcount(head)),
    and further down an IS_ENABLED(CONFIG_DEBUG_VM) total_mapcount BUG():
    all indicative of some mapcount difficulty in development here perhaps.
    But the !CONFIG_DEBUG_VM path handles the failures correctly and
    silently.

    I believe the problem is that once a racing unmap has cleared pte or
    pmd, try_to_unmap_one() may skip taking the page table lock, and emerge
    from try_to_unmap() before the racing task has reached decrementing
    mapcount.

    Instead of abandoning the unsafe VM_BUG_ON_PAGE(), and the ones that
    follow, use PVMW_SYNC in try_to_unmap_one() in this case: adding
    TTU_SYNC to the options, and passing that from unmap_page().

    When CONFIG_DEBUG_VM, or for non-debug too? Consensus is to do the same
    for both: the slight overhead added should rarely matter, except perhaps
    if splitting sparsely-populated multiply-mapped shmem. Once confident
    that bugs are fixed, TTU_SYNC here can be removed, and the race
    tolerated.

    Link: https://lkml.kernel.org/r/c1e95853-8bcd-d8fd-55fa-e7f2488e78f@google.com
    Fixes: fec89c109f3a ("thp: rewrite freeze_page()/unfreeze_page() with generic rmap walkers")
    Signed-off-by: Hugh Dickins
    Cc: Alistair Popple
    Cc: Jan Kara
    Cc: Jue Wang
    Cc: Kirill A. Shutemov
    Cc: "Matthew Wilcox (Oracle)"
    Cc: Miaohe Lin
    Cc: Minchan Kim
    Cc: Naoya Horiguchi
    Cc: Oscar Salvador
    Cc: Peter Xu
    Cc: Ralph Campbell
    Cc: Shakeel Butt
    Cc: Wang Yugui
    Cc: Yang Shi
    Cc: Zi Yan
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Most callers of is_huge_zero_pmd() supply a pmd already verified
    present; but a few (notably zap_huge_pmd()) do not - it might be a pmd
    migration entry, in which the pfn is encoded differently from a present
    pmd: which might pass the is_huge_zero_pmd() test (though not on x86,
    since L1TF forced us to protect against that); or perhaps even crash in
    pmd_page() applied to a swap-like entry.

    Make it safe by adding pmd_present() check into is_huge_zero_pmd()
    itself; and make it quicker by saving huge_zero_pfn, so that
    is_huge_zero_pmd() will not need to do that pmd_page() lookup each time.

    __split_huge_pmd_locked() checked pmd_trans_huge() before: that worked,
    but is unnecessary now that is_huge_zero_pmd() checks present.

    Link: https://lkml.kernel.org/r/21ea9ca-a1f5-8b90-5e88-95fb1c49bbfa@google.com
    Fixes: e71769ae5260 ("mm: enable thp migration for shmem thp")
    Signed-off-by: Hugh Dickins
    Acked-by: Kirill A. Shutemov
    Reviewed-by: Yang Shi
    Cc: Alistair Popple
    Cc: Jan Kara
    Cc: Jue Wang
    Cc: "Matthew Wilcox (Oracle)"
    Cc: Miaohe Lin
    Cc: Minchan Kim
    Cc: Naoya Horiguchi
    Cc: Oscar Salvador
    Cc: Peter Xu
    Cc: Ralph Campbell
    Cc: Shakeel Butt
    Cc: Wang Yugui
    Cc: Zi Yan
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Patch series "mm/thp: fix THP splitting unmap BUGs and related", v10.

    Here is v2 batch of long-standing THP bug fixes that I had not got
    around to sending before, but prompted now by Wang Yugui's report
    https://lore.kernel.org/linux-mm/20210412180659.B9E3.409509F4@e16-tech.com/

    Wang Yugui has tested a rollup of these fixes applied to 5.10.39, and
    they have done no harm, but have *not* fixed that issue: something more
    is needed and I have no idea of what.

    This patch (of 7):

    Stressing huge tmpfs page migration racing hole punch often crashed on
    the VM_BUG_ON(!pmd_present) in pmdp_huge_clear_flush(), with DEBUG_VM=y
    kernel; or shortly afterwards, on a bad dereference in
    __split_huge_pmd_locked() when DEBUG_VM=n. They forgot to allow for pmd
    migration entries in the non-anonymous case.

    Full disclosure: those particular experiments were on a kernel with more
    relaxed mmap_lock and i_mmap_rwsem locking, and were not repeated on the
    vanilla kernel: it is conceivable that stricter locking happens to avoid
    those cases, or makes them less likely; but __split_huge_pmd_locked()
    already allowed for pmd migration entries when handling anonymous THPs,
    so this commit brings the shmem and file THP handling into line.

    And while there: use old_pmd rather than _pmd, as in the following
    blocks; and make it clearer to the eye that the !vma_is_anonymous()
    block is self-contained, making an early return after accounting for
    unmapping.

    Link: https://lkml.kernel.org/r/af88612-1473-2eaa-903-8d1a448b26@google.com
    Link: https://lkml.kernel.org/r/dd221a99-efb3-cd1d-6256-7e646af29314@google.com
    Fixes: e71769ae5260 ("mm: enable thp migration for shmem thp")
    Signed-off-by: Hugh Dickins
    Cc: Kirill A. Shutemov
    Cc: Yang Shi
    Cc: Wang Yugui
    Cc: "Matthew Wilcox (Oracle)"
    Cc: Naoya Horiguchi
    Cc: Alistair Popple
    Cc: Ralph Campbell
    Cc: Zi Yan
    Cc: Miaohe Lin
    Cc: Minchan Kim
    Cc: Jue Wang
    Cc: Peter Xu
    Cc: Jan Kara
    Cc: Shakeel Butt
    Cc: Oscar Salvador
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • We notice that hung task happens in a corner but practical scenario when
    CONFIG_PREEMPT_NONE is enabled, as follows.

    Process 0 Process 1 Process 2..Inf
    split_huge_page_to_list
    unmap_page
    split_huge_pmd_address
    __migration_entry_wait(head)
    __migration_entry_wait(tail)
    remap_page (roll back)
    remove_migration_ptes
    rmap_walk_anon
    cond_resched

    Where __migration_entry_wait(tail) is occurred in kernel space, e.g.,
    copy_to_user in fstat, which will immediately fault again without
    rescheduling, and thus occupy the cpu fully.

    When there are too many processes performing __migration_entry_wait on
    tail page, remap_page will never be done after cond_resched.

    This makes __migration_entry_wait operate on the compound head page,
    thus waits for remap_page to complete, whether the THP is split
    successfully or roll back.

    Note that put_and_wait_on_page_locked helps to drop the page reference
    acquired with get_page_unless_zero, as soon as the page is on the wait
    queue, before actually waiting. So splitting the THP is only prevented
    for a brief interval.

    Link: https://lkml.kernel.org/r/b9836c1dd522e903891760af9f0c86a2cce987eb.1623144009.git.xuyu@linux.alibaba.com
    Fixes: ba98828088ad ("thp: add option to setup migration entries during PMD split")
    Suggested-by: Hugh Dickins
    Signed-off-by: Gang Deng
    Signed-off-by: Xu Yu
    Acked-by: Kirill A. Shutemov
    Acked-by: Hugh Dickins
    Cc: Matthew Wilcox
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Xu Yu
     
  • Fixes build with CONFIG_SLAB_FREELIST_HARDENED=y.

    Hopefully. But it's the right thing to do anwyay.

    Fixes: 1ad53d9fa3f61 ("slub: improve bit diffusion for freelist ptr obfuscation")
    Link: https://bugzilla.kernel.org/show_bug.cgi?id=213417
    Reported-by:
    Acked-by: Kees Cook
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • Our syzkaller trigger the "BUG_ON(!list_empty(&inode->i_wb_list))" in
    clear_inode:

    kernel BUG at fs/inode.c:519!
    Internal error: Oops - BUG: 0 [#1] SMP
    Modules linked in:
    Process syz-executor.0 (pid: 249, stack limit = 0x00000000a12409d7)
    CPU: 1 PID: 249 Comm: syz-executor.0 Not tainted 4.19.95
    Hardware name: linux,dummy-virt (DT)
    pstate: 80000005 (Nzcv daif -PAN -UAO)
    pc : clear_inode+0x280/0x2a8
    lr : clear_inode+0x280/0x2a8
    Call trace:
    clear_inode+0x280/0x2a8
    ext4_clear_inode+0x38/0xe8
    ext4_free_inode+0x130/0xc68
    ext4_evict_inode+0xb20/0xcb8
    evict+0x1a8/0x3c0
    iput+0x344/0x460
    do_unlinkat+0x260/0x410
    __arm64_sys_unlinkat+0x6c/0xc0
    el0_svc_common+0xdc/0x3b0
    el0_svc_handler+0xf8/0x160
    el0_svc+0x10/0x218
    Kernel panic - not syncing: Fatal exception

    A crash dump of this problem show that someone called __munlock_pagevec
    to clear page LRU without lock_page: do_mmap -> mmap_region -> do_munmap
    -> munlock_vma_pages_range -> __munlock_pagevec.

    As a result memory_failure will call identify_page_state without
    wait_on_page_writeback. And after truncate_error_page clear the mapping
    of this page. end_page_writeback won't call sb_clear_inode_writeback to
    clear inode->i_wb_list. That will trigger BUG_ON in clear_inode!

    Fix it by checking PageWriteback too to help determine should we skip
    wait_on_page_writeback.

    Link: https://lkml.kernel.org/r/20210604084705.3729204-1-yangerkun@huawei.com
    Fixes: 0bc1f8b0682c ("hwpoison: fix the handling path of the victimized page frame that belong to non-LRU")
    Signed-off-by: yangerkun
    Acked-by: Naoya Horiguchi
    Cc: Jan Kara
    Cc: Theodore Ts'o
    Cc: Oscar Salvador
    Cc: Yu Kuai
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    yangerkun
     
  • The routine restore_reserve_on_error is called to restore reservation
    information when an error occurs after page allocation. The routine
    alloc_huge_page modifies the mapping reserve map and potentially the
    reserve count during allocation. If code calling alloc_huge_page
    encounters an error after allocation and needs to free the page, the
    reservation information needs to be adjusted.

    Currently, restore_reserve_on_error only takes action on pages for which
    the reserve count was adjusted(HPageRestoreReserve flag). There is
    nothing wrong with these adjustments. However, alloc_huge_page ALWAYS
    modifies the reserve map during allocation even if the reserve count is
    not adjusted. This can cause issues as observed during development of
    this patch [1].

    One specific series of operations causing an issue is:

    - Create a shared hugetlb mapping
    Reservations for all pages created by default

    - Fault in a page in the mapping
    Reservation exists so reservation count is decremented

    - Punch a hole in the file/mapping at index previously faulted
    Reservation and any associated pages will be removed

    - Allocate a page to fill the hole
    No reservation entry, so reserve count unmodified
    Reservation entry added to map by alloc_huge_page

    - Error after allocation and before instantiating the page
    Reservation entry remains in map

    - Allocate a page to fill the hole
    Reservation entry exists, so decrement reservation count

    This will cause a reservation count underflow as the reservation count
    was decremented twice for the same index.

    A user would observe a very large number for HugePages_Rsvd in
    /proc/meminfo. This would also likely cause subsequent allocations of
    hugetlb pages to fail as it would 'appear' that all pages are reserved.

    This sequence of operations is unlikely to happen, however they were
    easily reproduced and observed using hacked up code as described in [1].

    Address the issue by having the routine restore_reserve_on_error take
    action on pages where HPageRestoreReserve is not set. In this case, we
    need to remove any reserve map entry created by alloc_huge_page. A new
    helper routine vma_del_reservation assists with this operation.

    There are three callers of alloc_huge_page which do not currently call
    restore_reserve_on error before freeing a page on error paths. Add
    those missing calls.

    [1] https://lore.kernel.org/linux-mm/20210528005029.88088-1-almasrymina@google.com/

    Link: https://lkml.kernel.org/r/20210607204510.22617-1-mike.kravetz@oracle.com
    Fixes: 96b96a96ddee ("mm/hugetlb: fix huge page reservation leak in private mapping error paths"
    Signed-off-by: Mike Kravetz
    Reviewed-by: Mina Almasry
    Cc: Axel Rasmussen
    Cc: Peter Xu
    Cc: Muchun Song
    Cc: Michal Hocko
    Cc: Naoya Horiguchi
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     
  • It turns out that SLUB redzoning ("slub_debug=Z") checks from
    s->object_size rather than from s->inuse (which is normally bumped to
    make room for the freelist pointer), so a cache created with an object
    size less than 24 would have the freelist pointer written beyond
    s->object_size, causing the redzone to be corrupted by the freelist
    pointer. This was very visible with "slub_debug=ZF":

    BUG test (Tainted: G B ): Right Redzone overwritten
    -----------------------------------------------------------------------------

    INFO: 0xffff957ead1c05de-0xffff957ead1c05df @offset=1502. First byte 0x1a instead of 0xbb
    INFO: Slab 0xffffef3950b47000 objects=170 used=170 fp=0x0000000000000000 flags=0x8000000000000200
    INFO: Object 0xffff957ead1c05d8 @offset=1496 fp=0xffff957ead1c0620

    Redzone (____ptrval____): bb bb bb bb bb bb bb bb ........
    Object (____ptrval____): 00 00 00 00 00 f6 f4 a5 ........
    Redzone (____ptrval____): 40 1d e8 1a aa @....
    Padding (____ptrval____): 00 00 00 00 00 00 00 00 ........

    Adjust the offset to stay within s->object_size.

    (Note that no caches of in this size range are known to exist in the
    kernel currently.)

    Link: https://lkml.kernel.org/r/20210608183955.280836-4-keescook@chromium.org
    Link: https://lore.kernel.org/linux-mm/20200807160627.GA1420741@elver.google.com/
    Link: https://lore.kernel.org/lkml/0f7dd7b2-7496-5e2d-9488-2ec9f8e90441@suse.cz/Fixes: 89b83f282d8b (slub: avoid redzone when choosing freepointer location)
    Link: https://lore.kernel.org/lkml/CANpmjNOwZ5VpKQn+SYWovTkFB4VsT-RPwyENBmaK0dLcpqStkA@mail.gmail.com
    Signed-off-by: Kees Cook
    Reported-by: Marco Elver
    Reported-by: "Lin, Zhenpeng"
    Tested-by: Marco Elver
    Acked-by: Vlastimil Babka
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Pekka Enberg
    Cc: Roman Gushchin
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kees Cook
     
  • The redzone area for SLUB exists between s->object_size and s->inuse
    (which is at least the word-aligned object_size). If a cache were
    created with an object_size smaller than sizeof(void *), the in-object
    stored freelist pointer would overwrite the redzone (e.g. with boot
    param "slub_debug=ZF"):

    BUG test (Tainted: G B ): Right Redzone overwritten
    -----------------------------------------------------------------------------

    INFO: 0xffff957ead1c05de-0xffff957ead1c05df @offset=1502. First byte 0x1a instead of 0xbb
    INFO: Slab 0xffffef3950b47000 objects=170 used=170 fp=0x0000000000000000 flags=0x8000000000000200
    INFO: Object 0xffff957ead1c05d8 @offset=1496 fp=0xffff957ead1c0620

    Redzone (____ptrval____): bb bb bb bb bb bb bb bb ........
    Object (____ptrval____): f6 f4 a5 40 1d e8 ...@..
    Redzone (____ptrval____): 1a aa ..
    Padding (____ptrval____): 00 00 00 00 00 00 00 00 ........

    Store the freelist pointer out of line when object_size is smaller than
    sizeof(void *) and redzoning is enabled.

    Additionally remove the "smaller than sizeof(void *)" check under
    CONFIG_DEBUG_VM in kmem_cache_sanity_check() as it is now redundant:
    SLAB and SLOB both handle small sizes.

    (Note that no caches within this size range are known to exist in the
    kernel currently.)

    Link: https://lkml.kernel.org/r/20210608183955.280836-3-keescook@chromium.org
    Fixes: 81819f0fc828 ("SLUB core")
    Signed-off-by: Kees Cook
    Acked-by: Vlastimil Babka
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: "Lin, Zhenpeng"
    Cc: Marco Elver
    Cc: Pekka Enberg
    Cc: Roman Gushchin
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kees Cook
     
  • Patch series "Actually fix freelist pointer vs redzoning", v4.

    This fixes redzoning vs the freelist pointer (both for middle-position
    and very small caches). Both are "theoretical" fixes, in that I see no
    evidence of such small-sized caches actually be used in the kernel, but
    that's no reason to let the bugs continue to exist, especially since
    people doing local development keep tripping over it. :)

    This patch (of 3):

    Instead of repeating "Redzone" and "Poison", clarify which sides of
    those zones got tripped. Additionally fix column alignment in the
    trailer.

    Before:

    BUG test (Tainted: G B ): Redzone overwritten
    ...
    Redzone (____ptrval____): bb bb bb bb bb bb bb bb ........
    Object (____ptrval____): f6 f4 a5 40 1d e8 ...@..
    Redzone (____ptrval____): 1a aa ..
    Padding (____ptrval____): 00 00 00 00 00 00 00 00 ........

    After:

    BUG test (Tainted: G B ): Right Redzone overwritten
    ...
    Redzone (____ptrval____): bb bb bb bb bb bb bb bb ........
    Object (____ptrval____): f6 f4 a5 40 1d e8 ...@..
    Redzone (____ptrval____): 1a aa ..
    Padding (____ptrval____): 00 00 00 00 00 00 00 00 ........

    The earlier commits that slowly resulted in the "Before" reporting were:

    d86bd1bece6f ("mm/slub: support left redzone")
    ffc79d288000 ("slub: use print_hex_dump")
    2492268472e7 ("SLUB: change error reporting format to follow lockdep loosely")

    Link: https://lkml.kernel.org/r/20210608183955.280836-1-keescook@chromium.org
    Link: https://lkml.kernel.org/r/20210608183955.280836-2-keescook@chromium.org
    Link: https://lore.kernel.org/lkml/cfdb11d7-fb8e-e578-c939-f7f5fb69a6bd@suse.cz/
    Signed-off-by: Kees Cook
    Acked-by: Vlastimil Babka
    Cc: Marco Elver
    Cc: "Lin, Zhenpeng"
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Roman Gushchin
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kees Cook
     
  • I found it by pure code review, that pte_same_as_swp() of unuse_vma()
    didn't take uffd-wp bit into account when comparing ptes.
    pte_same_as_swp() returning false negative could cause failure to
    swapoff swap ptes that was wr-protected by userfaultfd.

    Link: https://lkml.kernel.org/r/20210603180546.9083-1-peterx@redhat.com
    Fixes: f45ec5ff16a7 ("userfaultfd: wp: support swap and page migration")
    Signed-off-by: Peter Xu
    Acked-by: Hugh Dickins
    Cc: Andrea Arcangeli
    Cc: [5.7+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Xu
     
  • When hugetlb page fault (under overcommitting situation) and
    memory_failure() race, VM_BUG_ON_PAGE() is triggered by the following
    race:

    CPU0: CPU1:

    gather_surplus_pages()
    page = alloc_surplus_huge_page()
    memory_failure_hugetlb()
    get_hwpoison_page(page)
    __get_hwpoison_page(page)
    get_page_unless_zero(page)
    zero = put_page_testzero(page)
    VM_BUG_ON_PAGE(!zero, page)
    enqueue_huge_page(h, page)
    put_page(page)

    __get_hwpoison_page() only checks the page refcount before taking an
    additional one for memory error handling, which is not enough because
    there's a time window where compound pages have non-zero refcount during
    hugetlb page initialization.

    So make __get_hwpoison_page() check page status a bit more for hugetlb
    pages with get_hwpoison_huge_page(). Checking hugetlb-specific flags
    under hugetlb_lock makes sure that the hugetlb page is not transitive.
    It's notable that another new function, HWPoisonHandlable(), is helpful
    to prevent a race against other transitive page states (like a generic
    compound page just before PageHuge becomes true).

    Link: https://lkml.kernel.org/r/20210603233632.2964832-2-nao.horiguchi@gmail.com
    Fixes: ead07f6a867b ("mm/memory-failure: introduce get_hwpoison_page() for consistent refcount handling")
    Signed-off-by: Naoya Horiguchi
    Reported-by: Muchun Song
    Acked-by: Mike Kravetz
    Cc: Oscar Salvador
    Cc: Michal Hocko
    Cc: Tony Luck
    Cc: [5.12+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     

05 Jun, 2021

7 commits

  • The userfaultfd hugetlb tests cause a resv_huge_pages underflow. This
    happens when hugetlb_mcopy_atomic_pte() is called with !is_continue on
    an index for which we already have a page in the cache. When this
    happens, we allocate a second page, double consuming the reservation,
    and then fail to insert the page into the cache and return -EEXIST.

    To fix this, we first check if there is a page in the cache which
    already consumed the reservation, and return -EEXIST immediately if so.

    There is still a rare condition where we fail to copy the page contents
    AND race with a call for hugetlb_no_page() for this index and again we
    will underflow resv_huge_pages. That is fixed in a more complicated
    patch not targeted for -stable.

    Test:

    Hacked the code locally such that resv_huge_pages underflows produce a
    warning, then:

    ./tools/testing/selftests/vm/userfaultfd hugetlb_shared 10
    2 /tmp/kokonut_test/huge/userfaultfd_test && echo test success
    ./tools/testing/selftests/vm/userfaultfd hugetlb 10
    2 /tmp/kokonut_test/huge/userfaultfd_test && echo test success

    Both tests succeed and produce no warnings. After the test runs number
    of free/resv hugepages is correct.

    [mike.kravetz@oracle.com: changelog fixes]

    Link: https://lkml.kernel.org/r/20210528004649.85298-1-almasrymina@google.com
    Fixes: 8fb5debc5fcd ("userfaultfd: hugetlbfs: add hugetlb_mcopy_atomic_pte for userfaultfd support")
    Signed-off-by: Mina Almasry
    Reviewed-by: Mike Kravetz
    Cc: Axel Rasmussen
    Cc: Peter Xu
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mina Almasry
     
  • Fix gcc W=1 warning:

    mm/kasan/init.c:228: warning: Function parameter or member 'shadow_start' not described in 'kasan_populate_early_shadow'
    mm/kasan/init.c:228: warning: Function parameter or member 'shadow_end' not described in 'kasan_populate_early_shadow'

    Link: https://lkml.kernel.org/r/20210603140700.3045298-1-yukuai3@huawei.com
    Signed-off-by: Yu Kuai
    Acked-by: Andrey Ryabinin
    Cc: Zhang Yi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yu Kuai
     
  • When memory_failure() or soft_offline_page() is called on a tail page of
    some hugetlb page, "BUG: unable to handle page fault" error can be
    triggered.

    remove_hugetlb_page() dereferences page->lru, so it's assumed that the
    page points to a head page, but one of the caller,
    dissolve_free_huge_page(), provides remove_hugetlb_page() with 'page'
    which could be a tail page. So pass 'head' to it, instead.

    Link: https://lkml.kernel.org/r/20210526235257.2769473-1-nao.horiguchi@gmail.com
    Fixes: 6eb4e88a6d27 ("hugetlb: create remove_hugetlb_page() to separate functionality")
    Signed-off-by: Naoya Horiguchi
    Reviewed-by: Mike Kravetz
    Reviewed-by: Muchun Song
    Acked-by: Michal Hocko
    Reviewed-by: Oscar Salvador
    Cc: Miaohe Lin
    Cc: Matthew Wilcox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • Recently we found that there is a lot MemFree left in /proc/meminfo
    after do a lot of pages soft offline, it's not quite correct.

    Before Oscar's rework of soft offline for free pages [1], if we soft
    offline free pages, these pages are left in buddy with HWPoison flag,
    and NR_FREE_PAGES is not updated immediately. So the difference between
    NR_FREE_PAGES and real number of available free pages is also even big
    at the beginning.

    However, with the workload running, when we catch HWPoison page in any
    alloc functions subsequently, we will remove it from buddy, meanwhile
    update the NR_FREE_PAGES and try again, so the NR_FREE_PAGES will get
    more and more closer to the real number of available free pages.
    (regardless of unpoison_memory())

    Now, for offline free pages, after a successful call
    take_page_off_buddy(), the page is no longer belong to buddy allocator,
    and will not be used any more, but we missed accounting NR_FREE_PAGES in
    this situation, and there is no chance to be updated later.

    Do update in take_page_off_buddy() like rmqueue() does, but avoid double
    counting if some one already set_migratetype_isolate() on the page.

    [1]: commit 06be6ff3d2ec ("mm,hwpoison: rework soft offline for free pages")

    Link: https://lkml.kernel.org/r/20210526075247.11130-1-dinghui@sangfor.com.cn
    Fixes: 06be6ff3d2ec ("mm,hwpoison: rework soft offline for free pages")
    Signed-off-by: Ding Hui
    Suggested-by: Naoya Horiguchi
    Reviewed-by: Oscar Salvador
    Acked-by: David Hildenbrand
    Acked-by: Naoya Horiguchi
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ding Hui
     
  • In pmd/pud_advanced_tests(), the vaddr is aligned up to the next pmd/pud
    entry, and so it does not match the given pmdp/pudp and (aligned down)
    pfn any more.

    For s390, this results in memory corruption, because the IDTE
    instruction used e.g. in xxx_get_and_clear() will take the vaddr for
    some calculations, in combination with the given pmdp. It will then end
    up with a wrong table origin, ending on ...ff8, and some of those
    wrongly set low-order bits will also select a wrong pagetable level for
    the index addition. IDTE could therefore invalidate (or 0x20) something
    outside of the page tables, depending on the wrongly picked index, which
    in turn depends on the random vaddr.

    As result, we sometimes see "BUG task_struct (Not tainted): Padding
    overwritten" on s390, where one 0x5a padding value got overwritten with
    0x7a.

    Fix this by aligning down, similar to how the pmd/pud_aligned pfns are
    calculated.

    Link: https://lkml.kernel.org/r/20210525130043.186290-2-gerald.schaefer@linux.ibm.com
    Fixes: a5c3b9ffb0f40 ("mm/debug_vm_pgtable: add tests validating advanced arch page table helpers")
    Signed-off-by: Gerald Schaefer
    Reviewed-by: Anshuman Khandual
    Cc: Vineet Gupta
    Cc: Palmer Dabbelt
    Cc: Paul Walmsley
    Cc: [5.9+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Gerald Schaefer
     
  • Since wait_event() uses TASK_UNINTERRUPTIBLE by default, waiting for an
    allocation counts towards load. However, for KFENCE, this does not make
    any sense, since there is no busy work we're awaiting.

    Instead, use TASK_IDLE via wait_event_idle() to not count towards load.

    BugLink: https://bugzilla.suse.com/show_bug.cgi?id=1185565
    Link: https://lkml.kernel.org/r/20210521083209.3740269-1-elver@google.com
    Fixes: 407f1d8c1b5f ("kfence: await for allocation using wait_event")
    Signed-off-by: Marco Elver
    Cc: Mel Gorman
    Cc: Alexander Potapenko
    Cc: Dmitry Vyukov
    Cc: David Laight
    Cc: Hillf Danton
    Cc: [5.12+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Marco Elver
     
  • This reverts commit f685a533a7fab35c5d069dcd663f59c8e4171a75.

    The MIPS cache flush logic needs to know whether the mapping was already
    established to decide how to flush caches. This is done by checking the
    valid bit in the PTE. The commit above breaks this logic by setting the
    valid in the PTE in new mappings, which causes kernel crashes.

    Link: https://lkml.kernel.org/r/20210526094335.92948-1-tsbogend@alpha.franken.de
    Fixes: f685a533a7f ("MIPS: make userspace mapping young by default")
    Reported-by: Zhou Yanjie
    Signed-off-by: Thomas Bogendoerfer
    Cc: Huang Pei
    Cc: Nicholas Piggin
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Thomas Bogendoerfer
     

23 May, 2021

4 commits

  • In commit d6995da31122 ("hugetlb: use page.private for hugetlb specific
    page flags") the use of PagePrivate to indicate a reservation count
    should be restored at free time was changed to the hugetlb specific flag
    HPageRestoreReserve. Changes to a userfaultfd error path as well as a
    VM_BUG_ON() in remove_inode_hugepages() were overlooked.

    Users could see incorrect hugetlb reserve counts if they experience an
    error with a UFFDIO_COPY operation. Specifically, this would be the
    result of an unlikely copy_huge_page_from_user error. There is not an
    increased chance of hitting the VM_BUG_ON.

    Link: https://lkml.kernel.org/r/20210521233952.236434-1-mike.kravetz@oracle.com
    Fixes: d6995da31122 ("hugetlb: use page.private for hugetlb specific page flags")
    Signed-off-by: Mike Kravetz
    Reviewed-by: Mina Almasry
    Cc: Oscar Salvador
    Cc: Michal Hocko
    Cc: Muchun Song
    Cc: Naoya Horiguchi
    Cc: David Hildenbrand
    Cc: Matthew Wilcox
    Cc: Miaohe Lin
    Cc: Mina Almasry
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     
  • With CONFIG_DEBUG_PAGEALLOC enabled, the kernel should also untag the
    object pointer, as done in get_freepointer().

    Failing to do so reportedly leads to SLUB freelist corruptions that
    manifest as boot-time crashes.

    Link: https://lkml.kernel.org/r/20210514072228.534418-1-glider@google.com
    Signed-off-by: Alexander Potapenko
    Cc: Marco Elver
    Cc: Vincenzo Frascino
    Cc: Andrey Ryabinin
    Cc: Andrey Konovalov
    Cc: Elliot Berman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexander Potapenko
     
  • While reviewing [1] I came across commit d3378e86d182 ("mm/gup: check
    page posion status for coredump.") and noticed that this patch is broken
    in two ways. First it doesn't really prevent hwpoison pages from being
    dumped because hwpoison pages can be marked asynchornously at any time
    after the check. Secondly, and more importantly, the patch introduces a
    ref count leak because get_dump_page takes a reference on the page which
    is not released.

    It also seems that the patch was merged incorrectly because there were
    follow up changes not included as well as discussions on how to address
    the underlying problem [2]

    Therefore revert the original patch.

    Link: http://lkml.kernel.org/r/20210429122519.15183-4-david@redhat.com [1]
    Link: http://lkml.kernel.org/r/57ac524c-b49a-99ec-c1e4-ef5027bfb61b@redhat.com [2]
    Link: https://lkml.kernel.org/r/20210505135407.31590-1-mhocko@kernel.org
    Fixes: d3378e86d182 ("mm/gup: check page posion status for coredump.")
    Signed-off-by: Michal Hocko
    Reviewed-by: David Hildenbrand
    Cc: Aili Yao
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • clang sometimes decides not to inline shuffle_zone(), but it calls a
    __meminit function. Without the extra __meminit annotation we get this
    warning:

    WARNING: modpost: vmlinux.o(.text+0x2a86d4): Section mismatch in reference from the function shuffle_zone() to the function .meminit.text:__shuffle_zone()
    The function shuffle_zone() references
    the function __meminit __shuffle_zone().
    This is often because shuffle_zone lacks a __meminit
    annotation or the annotation of __shuffle_zone is wrong.

    shuffle_free_memory() did not show the same problem in my tests, but it
    could happen in theory as well, so mark both as __meminit.

    Link: https://lkml.kernel.org/r/20210514135952.2928094-1-arnd@kernel.org
    Signed-off-by: Arnd Bergmann
    Reviewed-by: David Hildenbrand
    Reviewed-by: Nathan Chancellor
    Cc: Nick Desaulniers
    Cc: Arnd Bergmann
    Cc: Wei Yang
    Cc: Dan Williams
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Arnd Bergmann
     

15 May, 2021

6 commits

  • iomap_max_page_shift is expected to contain a page shift, so it can't be a
    'bool', has to be an 'unsigned int'

    And fix the default values: P4D_SHIFT is when huge iomap is allowed.

    However, on some architectures (eg: powerpc book3s/64), P4D_SHIFT is not a
    constant so it can't be used to initialise a static variable. So,
    initialise iomap_max_page_shift with a maximum shift supported by the
    architecture, it is gated by P4D_SHIFT in vmap_try_huge_p4d() anyway.

    Link: https://lkml.kernel.org/r/ad2d366015794a9f21320dcbdd0a8eb98979e9df.1620898113.git.christophe.leroy@csgroup.eu
    Fixes: bbc180a5adb0 ("mm: HUGE_VMAP arch support cleanup")
    Signed-off-by: Christophe Leroy
    Reviewed-by: Nicholas Piggin
    Reviewed-by: Anshuman Khandual
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christophe Leroy
     
  • This reverts commit 3e96b6a2e9ad929a3230a22f4d64a74671a0720b. General
    Protection Fault in rmap_walk_ksm() under memory pressure:
    remove_rmap_item_from_tree() needs to take page lock, of course.

    Link: https://lkml.kernel.org/r/alpine.LSU.2.11.2105092253500.1127@eggly.anvils
    Signed-off-by: Hugh Dickins
    Cc: Miaohe Lin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Consider the following sequence of events:

    1. Userspace issues a UFFD ioctl, which ends up calling into
    shmem_mfill_atomic_pte(). We successfully account the blocks, we
    shmem_alloc_page(), but then the copy_from_user() fails. We return
    -ENOENT. We don't release the page we allocated.
    2. Our caller detects this error code, tries the copy_from_user() after
    dropping the mmap_lock, and retries, calling back into
    shmem_mfill_atomic_pte().
    3. Meanwhile, let's say another process filled up the tmpfs being used.
    4. So shmem_mfill_atomic_pte() fails to account blocks this time, and
    immediately returns - without releasing the page.

    This triggers a BUG_ON in our caller, which asserts that the page
    should always be consumed, unless -ENOENT is returned.

    To fix this, detect if we have such a "dangling" page when accounting
    fails, and if so, release it before returning.

    Link: https://lkml.kernel.org/r/20210428230858.348400-1-axelrasmussen@google.com
    Fixes: cb658a453b93 ("userfaultfd: shmem: avoid leaking blocks and used blocks in UFFDIO_COPY")
    Signed-off-by: Axel Rasmussen
    Reported-by: Hugh Dickins
    Acked-by: Hugh Dickins
    Reviewed-by: Peter Xu
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Axel Rasmussen
     
  • Paul E. McKenney reported [1] that commit 1f0723a4c0df ("mm, slub: enable
    slub_debug static key when creating cache with explicit debug flags")
    results in the lockdep complaint:

    ======================================================
    WARNING: possible circular locking dependency detected
    5.12.0+ #15 Not tainted
    ------------------------------------------------------
    rcu_torture_sta/109 is trying to acquire lock:
    ffffffff96063cd0 (cpu_hotplug_lock){++++}-{0:0}, at: static_key_enable+0x9/0x20

    but task is already holding lock:
    ffffffff96173c28 (slab_mutex){+.+.}-{3:3}, at: kmem_cache_create_usercopy+0x2d/0x250

    which lock already depends on the new lock.

    the existing dependency chain (in reverse order) is:

    -> #1 (slab_mutex){+.+.}-{3:3}:
    lock_acquire+0xb9/0x3a0
    __mutex_lock+0x8d/0x920
    slub_cpu_dead+0x15/0xf0
    cpuhp_invoke_callback+0x17a/0x7c0
    cpuhp_invoke_callback_range+0x3b/0x80
    _cpu_down+0xdf/0x2a0
    cpu_down+0x2c/0x50
    device_offline+0x82/0xb0
    remove_cpu+0x1a/0x30
    torture_offline+0x80/0x140
    torture_onoff+0x147/0x260
    kthread+0x10a/0x140
    ret_from_fork+0x22/0x30

    -> #0 (cpu_hotplug_lock){++++}-{0:0}:
    check_prev_add+0x8f/0xbf0
    __lock_acquire+0x13f0/0x1d80
    lock_acquire+0xb9/0x3a0
    cpus_read_lock+0x21/0xa0
    static_key_enable+0x9/0x20
    __kmem_cache_create+0x38d/0x430
    kmem_cache_create_usercopy+0x146/0x250
    kmem_cache_create+0xd/0x10
    rcu_torture_stats+0x79/0x280
    kthread+0x10a/0x140
    ret_from_fork+0x22/0x30

    other info that might help us debug this:

    Possible unsafe locking scenario:

    CPU0 CPU1
    ---- ----
    lock(slab_mutex);
    lock(cpu_hotplug_lock);
    lock(slab_mutex);
    lock(cpu_hotplug_lock);

    *** DEADLOCK ***

    1 lock held by rcu_torture_sta/109:
    #0: ffffffff96173c28 (slab_mutex){+.+.}-{3:3}, at: kmem_cache_create_usercopy+0x2d/0x250

    stack backtrace:
    CPU: 3 PID: 109 Comm: rcu_torture_sta Not tainted 5.12.0+ #15
    Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-1ubuntu1.1 04/01/2014
    Call Trace:
    dump_stack+0x6d/0x89
    check_noncircular+0xfe/0x110
    ? lock_is_held_type+0x98/0x110
    check_prev_add+0x8f/0xbf0
    __lock_acquire+0x13f0/0x1d80
    lock_acquire+0xb9/0x3a0
    ? static_key_enable+0x9/0x20
    ? mark_held_locks+0x49/0x70
    cpus_read_lock+0x21/0xa0
    ? static_key_enable+0x9/0x20
    static_key_enable+0x9/0x20
    __kmem_cache_create+0x38d/0x430
    kmem_cache_create_usercopy+0x146/0x250
    ? rcu_torture_stats_print+0xd0/0xd0
    kmem_cache_create+0xd/0x10
    rcu_torture_stats+0x79/0x280
    ? rcu_torture_stats_print+0xd0/0xd0
    kthread+0x10a/0x140
    ? kthread_park+0x80/0x80
    ret_from_fork+0x22/0x30

    This is because there's one order of locking from the hotplug callbacks:

    lock(cpu_hotplug_lock); // from hotplug machinery itself
    lock(slab_mutex); // in e.g. slab_mem_going_offline_callback()

    And commit 1f0723a4c0df made the reverse sequence possible:
    lock(slab_mutex); // in kmem_cache_create_usercopy()
    lock(cpu_hotplug_lock); // kmem_cache_open() -> static_key_enable()

    The simplest fix is to move static_key_enable() to a place before slab_mutex is
    taken. That means kmem_cache_create_usercopy() in mm/slab_common.c which is not
    ideal for SLUB-specific code, but the #ifdef CONFIG_SLUB_DEBUG makes it
    at least self-contained and obvious.

    [1] https://lore.kernel.org/lkml/20210502171827.GA3670492@paulmck-ThinkPad-P17-Gen-1/

    Link: https://lkml.kernel.org/r/20210504120019.26791-1-vbabka@suse.cz
    Fixes: 1f0723a4c0df ("mm, slub: enable slub_debug static key when creating cache with explicit debug flags")
    Signed-off-by: Vlastimil Babka
    Reported-by: Paul E. McKenney
    Tested-by: Paul E. McKenney
    Acked-by: David Rientjes
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • When rework early cow of pinned hugetlb pages, we moved huge_ptep_get()
    upper but overlooked a side effect that the huge_ptep_get() will fetch the
    pte after wr-protection. After moving it upwards, we need explicit
    wr-protect of child pte or we will keep the write bit set in the child
    process, which could cause data corrution where the child can write to the
    original page directly.

    This issue can also be exposed by "memfd_test hugetlbfs" kselftest.

    Link: https://lkml.kernel.org/r/20210503234356.9097-3-peterx@redhat.com
    Fixes: 4eae4efa2c299 ("hugetlb: do early cow when page pinned on src mm")
    Signed-off-by: Peter Xu
    Reviewed-by: Mike Kravetz
    Cc: Hugh Dickins
    Cc: Joel Fernandes (Google)
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Xu
     
  • Patch series "mm/hugetlb: Fix issues on file sealing and fork", v2.

    Hugh reported issue with F_SEAL_FUTURE_WRITE not applied correctly to
    hugetlbfs, which I can easily verify using the memfd_test program, which
    seems that the program is hardly run with hugetlbfs pages (as by default
    shmem).

    Meanwhile I found another probably even more severe issue on that hugetlb
    fork won't wr-protect child cow pages, so child can potentially write to
    parent private pages. Patch 2 addresses that.

    After this series applied, "memfd_test hugetlbfs" should start to pass.

    This patch (of 2):

    F_SEAL_FUTURE_WRITE is missing for hugetlb starting from the first day.
    There is a test program for that and it fails constantly.

    $ ./memfd_test hugetlbfs
    memfd-hugetlb: CREATE
    memfd-hugetlb: BASIC
    memfd-hugetlb: SEAL-WRITE
    memfd-hugetlb: SEAL-FUTURE-WRITE
    mmap() didn't fail as expected
    Aborted (core dumped)

    I think it's probably because no one is really running the hugetlbfs test.

    Fix it by checking FUTURE_WRITE also in hugetlbfs_file_mmap() as what we
    do in shmem_mmap(). Generalize a helper for that.

    Link: https://lkml.kernel.org/r/20210503234356.9097-1-peterx@redhat.com
    Link: https://lkml.kernel.org/r/20210503234356.9097-2-peterx@redhat.com
    Fixes: ab3948f58ff84 ("mm/memfd: add an F_SEAL_FUTURE_WRITE seal to memfd")
    Signed-off-by: Peter Xu
    Reported-by: Hugh Dickins
    Reviewed-by: Mike Kravetz
    Cc: Joel Fernandes (Google)
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Xu
     

07 May, 2021

6 commits

  • succed -> succeed in mm/hugetlb.c
    wil -> will in mm/mempolicy.c
    wit -> with in mm/page_alloc.c
    Retruns -> Returns in mm/page_vma_mapped.c
    confict -> conflict in mm/secretmem.c
    No functionality changed.

    Link: https://lkml.kernel.org/r/20210408140027.60623-1-lujialin4@huawei.com
    Signed-off-by: Lu Jialin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lu Jialin
     
  • Fix ~94 single-word typos in locking code comments, plus a few
    very obvious grammar mistakes.

    Link: https://lkml.kernel.org/r/20210322212624.GA1963421@gmail.com
    Link: https://lore.kernel.org/r/20210322205203.GB1959563@gmail.com
    Signed-off-by: Ingo Molnar
    Reviewed-by: Matthew Wilcox (Oracle)
    Reviewed-by: Randy Dunlap
    Cc: Bhaskar Chowdhury
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ingo Molnar
     
  • There is a spelling mistake in a comment. Fix it.

    Link: https://lkml.kernel.org/r/20210317094158.5762-1-colin.king@canonical.com
    Signed-off-by: Colin Ian King
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Colin Ian King
     
  • The last user (/dev/kmem) is gone. Let's drop it.

    Link: https://lkml.kernel.org/r/20210324102351.6932-4-david@redhat.com
    Signed-off-by: David Hildenbrand
    Acked-by: Michal Hocko
    Cc: Linus Torvalds
    Cc: Greg Kroah-Hartman
    Cc: Hillf Danton
    Cc: Matthew Wilcox
    Cc: Oleksiy Avramchenko
    Cc: Steven Rostedt
    Cc: Minchan Kim
    Cc: huang ying
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Hildenbrand
     
  • Patch series "drivers/char: remove /dev/kmem for good".

    Exploring /dev/kmem and /dev/mem in the context of memory hot(un)plug and
    memory ballooning, I started questioning the existence of /dev/kmem.

    Comparing it with the /proc/kcore implementation, it does not seem to be
    able to deal with things like

    a) Pages unmapped from the direct mapping (e.g., to be used by secretmem)
    -> kern_addr_valid(). virt_addr_valid() is not sufficient.

    b) Special cases like gart aperture memory that is not to be touched
    -> mem_pfn_is_ram()

    Unless I am missing something, it's at least broken in some cases and might
    fault/crash the machine.

    Looks like its existence has been questioned before in 2005 and 2010 [1],
    after ~11 additional years, it might make sense to revive the discussion.

    CONFIG_DEVKMEM is only enabled in a single defconfig (on purpose or by
    mistake?). All distributions disable it: in Ubuntu it has been disabled
    for more than 10 years, in Debian since 2.6.31, in Fedora at least
    starting with FC3, in RHEL starting with RHEL4, in SUSE starting from
    15sp2, and OpenSUSE has it disabled as well.

    1) /dev/kmem was popular for rootkits [2] before it got disabled
    basically everywhere. Ubuntu documents [3] "There is no modern user of
    /dev/kmem any more beyond attackers using it to load kernel rootkits.".
    RHEL documents in a BZ [5] "it served no practical purpose other than to
    serve as a potential security problem or to enable binary module drivers
    to access structures/functions they shouldn't be touching"

    2) /proc/kcore is a decent interface to have a controlled way to read
    kernel memory for debugging puposes. (will need some extensions to
    deal with memory offlining/unplug, memory ballooning, and poisoned
    pages, though)

    3) It might be useful for corner case debugging [1]. KDB/KGDB might be a
    better fit, especially, to write random memory; harder to shoot
    yourself into the foot.

    4) "Kernel Memory Editor" [4] hasn't seen any updates since 2000 and seems
    to be incompatible with 64bit [1]. For educational purposes,
    /proc/kcore might be used to monitor value updates -- or older
    kernels can be used.

    5) It's broken on arm64, and therefore, completely disabled there.

    Looks like it's essentially unused and has been replaced by better
    suited interfaces for individual tasks (/proc/kcore, KDB/KGDB). Let's
    just remove it.

    [1] https://lwn.net/Articles/147901/
    [2] https://www.linuxjournal.com/article/10505
    [3] https://wiki.ubuntu.com/Security/Features#A.2Fdev.2Fkmem_disabled
    [4] https://sourceforge.net/projects/kme/
    [5] https://bugzilla.redhat.com/show_bug.cgi?id=154796

    Link: https://lkml.kernel.org/r/20210324102351.6932-1-david@redhat.com
    Link: https://lkml.kernel.org/r/20210324102351.6932-2-david@redhat.com
    Signed-off-by: David Hildenbrand
    Acked-by: Michal Hocko
    Acked-by: Kees Cook
    Cc: Linus Torvalds
    Cc: Greg Kroah-Hartman
    Cc: "Alexander A. Klimov"
    Cc: Alexander Viro
    Cc: Alexandre Belloni
    Cc: Andrew Lunn
    Cc: Andrey Zhizhikin
    Cc: Arnd Bergmann
    Cc: Benjamin Herrenschmidt
    Cc: Brian Cain
    Cc: Christian Borntraeger
    Cc: Christophe Leroy
    Cc: Chris Zankel
    Cc: Corentin Labbe
    Cc: "David S. Miller"
    Cc: "Eric W. Biederman"
    Cc: Geert Uytterhoeven
    Cc: Gerald Schaefer
    Cc: Greentime Hu
    Cc: Gregory Clement
    Cc: Heiko Carstens
    Cc: Helge Deller
    Cc: Hillf Danton
    Cc: huang ying
    Cc: Ingo Molnar
    Cc: Ivan Kokshaysky
    Cc: "James E.J. Bottomley"
    Cc: James Troup
    Cc: Jiaxun Yang
    Cc: Jonas Bonn
    Cc: Jonathan Corbet
    Cc: Kairui Song
    Cc: Krzysztof Kozlowski
    Cc: Kuninori Morimoto
    Cc: Liviu Dudau
    Cc: Lorenzo Pieralisi
    Cc: Luc Van Oostenryck
    Cc: Luis Chamberlain
    Cc: Matthew Wilcox
    Cc: Matt Turner
    Cc: Max Filippov
    Cc: Michael Ellerman
    Cc: Mike Rapoport
    Cc: Mikulas Patocka
    Cc: Minchan Kim
    Cc: Niklas Schnelle
    Cc: Oleksiy Avramchenko
    Cc: openrisc@lists.librecores.org
    Cc: Palmer Dabbelt
    Cc: Paul Mackerras
    Cc: "Pavel Machek (CIP)"
    Cc: Pavel Machek
    Cc: "Peter Zijlstra (Intel)"
    Cc: Pierre Morel
    Cc: Randy Dunlap
    Cc: Richard Henderson
    Cc: Rich Felker
    Cc: Robert Richter
    Cc: Rob Herring
    Cc: Russell King
    Cc: Sam Ravnborg
    Cc: Sebastian Andrzej Siewior
    Cc: Sebastian Hesselbarth
    Cc: sparclinux@vger.kernel.org
    Cc: Stafford Horne
    Cc: Stefan Kristiansson
    Cc: Steven Rostedt
    Cc: Sudeep Holla
    Cc: Theodore Dubois
    Cc: Thomas Bogendoerfer
    Cc: Thomas Gleixner
    Cc: Vasily Gorbik
    Cc: Viresh Kumar
    Cc: William Cohen
    Cc: Xiaoming Ni
    Cc: Yoshinori Sato
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Hildenbrand
     
  • fix some typos and code style problems in mm.

    gfp.h: s/MAXNODES/MAX_NUMNODES
    mmzone.h: s/then/than
    rmap.c: s/__vma_split()/__vma_adjust()
    swap.c: s/__mod_zone_page_stat/__mod_zone_page_state, s/is is/is
    swap_state.c: s/whoes/whose
    z3fold.c: code style problem fix in z3fold_unregister_migration
    zsmalloc.c: s/of/or, s/give/given

    Link: https://lkml.kernel.org/r/20210419083057.64820-1-luoshijie1@huawei.com
    Signed-off-by: Shijie Luo
    Signed-off-by: Miaohe Lin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shijie Luo