12 Dec, 2020

3 commits

  • Commit 1378a5ee451a ("mm: store compound_nr as well as compound_order")
    added compound_nr counter to first tail struct page, overlaying with
    page->mapping. The overlay itself is fine, but while freeing gigantic
    hugepages via free_contig_range(), a "bad page" check will trigger for
    non-NULL page->mapping on the first tail page:

    BUG: Bad page state in process bash pfn:380001
    page:00000000c35f0856 refcount:0 mapcount:0 mapping:00000000126b68aa index:0x0 pfn:0x380001
    aops:0x0
    flags: 0x3ffff00000000000()
    raw: 3ffff00000000000 0000000000000100 0000000000000122 0000000100000000
    raw: 0000000000000000 0000000000000000 ffffffff00000000 0000000000000000
    page dumped because: non-NULL mapping
    Modules linked in:
    CPU: 6 PID: 616 Comm: bash Not tainted 5.10.0-rc7-next-20201208 #1
    Hardware name: IBM 3906 M03 703 (LPAR)
    Call Trace:
    show_stack+0x6e/0xe8
    dump_stack+0x90/0xc8
    bad_page+0xd6/0x130
    free_pcppages_bulk+0x26a/0x800
    free_unref_page+0x6e/0x90
    free_contig_range+0x94/0xe8
    update_and_free_page+0x1c4/0x2c8
    free_pool_huge_page+0x11e/0x138
    set_max_huge_pages+0x228/0x300
    nr_hugepages_store_common+0xb8/0x130
    kernfs_fop_write+0xd2/0x218
    vfs_write+0xb0/0x2b8
    ksys_write+0xac/0xe0
    system_call+0xe6/0x288
    Disabling lock debugging due to kernel taint

    This is because only the compound_order is cleared in
    destroy_compound_gigantic_page(), and compound_nr is set to
    1U << order == 1 for order 0 in set_compound_order(page, 0).

    Fix this by explicitly clearing compound_nr for first tail page after
    calling set_compound_order(page, 0).

    Link: https://lkml.kernel.org/r/20201208182813.66391-2-gerald.schaefer@linux.ibm.com
    Fixes: 1378a5ee451a ("mm: store compound_nr as well as compound_order")
    Signed-off-by: Gerald Schaefer
    Reviewed-by: Matthew Wilcox (Oracle)
    Cc: Heiko Carstens
    Cc: Mike Kravetz
    Cc: Christian Borntraeger
    Cc: [5.9+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Gerald Schaefer
     
  • We hit this issue in our internal test. When enabling generic kasan, a
    kfree()'d object is put into per-cpu quarantine first. If the cpu goes
    offline, object still remains in the per-cpu quarantine. If we call
    kmem_cache_destroy() now, slub will report "Objects remaining" error.

    =============================================================================
    BUG test_module_slab (Not tainted): Objects remaining in test_module_slab on __kmem_cache_shutdown()
    -----------------------------------------------------------------------------

    Disabling lock debugging due to kernel taint
    INFO: Slab 0x(____ptrval____) objects=34 used=1 fp=0x(____ptrval____) flags=0x2ffff00000010200
    CPU: 3 PID: 176 Comm: cat Tainted: G B 5.10.0-rc1-00007-g4525c8781ec0-dirty #10
    Hardware name: linux,dummy-virt (DT)
    Call trace:
    dump_backtrace+0x0/0x2b0
    show_stack+0x18/0x68
    dump_stack+0xfc/0x168
    slab_err+0xac/0xd4
    __kmem_cache_shutdown+0x1e4/0x3c8
    kmem_cache_destroy+0x68/0x130
    test_version_show+0x84/0xf0
    module_attr_show+0x40/0x60
    sysfs_kf_seq_show+0x128/0x1c0
    kernfs_seq_show+0xa0/0xb8
    seq_read+0x1f0/0x7e8
    kernfs_fop_read+0x70/0x338
    vfs_read+0xe4/0x250
    ksys_read+0xc8/0x180
    __arm64_sys_read+0x44/0x58
    el0_svc_common.constprop.0+0xac/0x228
    do_el0_svc+0x38/0xa0
    el0_sync_handler+0x170/0x178
    el0_sync+0x174/0x180
    INFO: Object 0x(____ptrval____) @offset=15848
    INFO: Allocated in test_version_show+0x98/0xf0 age=8188 cpu=6 pid=172
    stack_trace_save+0x9c/0xd0
    set_track+0x64/0xf0
    alloc_debug_processing+0x104/0x1a0
    ___slab_alloc+0x628/0x648
    __slab_alloc.isra.0+0x2c/0x58
    kmem_cache_alloc+0x560/0x588
    test_version_show+0x98/0xf0
    module_attr_show+0x40/0x60
    sysfs_kf_seq_show+0x128/0x1c0
    kernfs_seq_show+0xa0/0xb8
    seq_read+0x1f0/0x7e8
    kernfs_fop_read+0x70/0x338
    vfs_read+0xe4/0x250
    ksys_read+0xc8/0x180
    __arm64_sys_read+0x44/0x58
    el0_svc_common.constprop.0+0xac/0x228
    kmem_cache_destroy test_module_slab: Slab cache still has objects

    Register a cpu hotplug function to remove all objects in the offline
    per-cpu quarantine when cpu is going offline. Set a per-cpu variable to
    indicate this cpu is offline.

    [qiang.zhang@windriver.com: fix slab double free when cpu-hotplug]
    Link: https://lkml.kernel.org/r/20201204102206.20237-1-qiang.zhang@windriver.com

    Link: https://lkml.kernel.org/r/1606895585-17382-2-git-send-email-Kuan-Ying.Lee@mediatek.com
    Signed-off-by: Kuan-Ying Lee
    Signed-off-by: Zqiang
    Suggested-by: Dmitry Vyukov
    Reported-by: Guangye Yang
    Reviewed-by: Dmitry Vyukov
    Cc: Andrey Ryabinin
    Cc: Alexander Potapenko
    Cc: Matthias Brugger
    Cc: Nicholas Tang
    Cc: Miles Chen
    Cc: Qian Cai
    Cc: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kuan-Ying Lee
     
  • Revert commit 3351b16af494 ("mm/filemap: add static for function
    __add_to_page_cache_locked") due to incompatibility with
    ALLOW_ERROR_INJECTION which result in build errors.

    Link: https://lkml.kernel.org/r/CAADnVQJ6tmzBXvtroBuEH6QA0H+q7yaSKxrVvVxhqr3KBZdEXg@mail.gmail.com
    Tested-by: Justin Forbes
    Tested-by: Greg Thelen
    Acked-by: Alexei Starovoitov
    Cc: Michal Kubecek
    Cc: Alex Shi
    Cc: Souptick Joarder
    Cc: Daniel Borkmann
    Cc: Josef Bacik
    Cc: Tony Luck
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     

09 Dec, 2020

1 commit

  • Jann spotted the security hole due to race of mm ownership check.

    If the task is sharing the mm_struct but goes through execve() before
    mm_access(), it could skip process_madvise_behavior_valid check. That
    makes *any advice hint* to reach into the remote process.

    This patch removes the mm ownership check. With it, it will lose the
    ability that local process could give *any* advice hint with vector
    interface for some reason (e.g., performance). Since there is no
    concrete example in upstream yet, it would be better to remove the
    abiliity at this moment and need to review when such new advice comes
    up.

    Fixes: ecb8ac8b1f14 ("mm/madvise: introduce process_madvise() syscall: an external memory hinting API")
    Reported-by: Jann Horn
    Suggested-by: Jann Horn
    Signed-off-by: Minchan Kim
    Signed-off-by: Linus Torvalds

    Minchan Kim
     

07 Dec, 2020

7 commits

  • On success, mmap should return the begin address of newly mapped area,
    but patch "mm: mmap: merge vma after call_mmap() if possible" set
    vm_start of newly merged vma to return value addr. Users of mmap will
    get wrong address if vma is merged after call_mmap(). We fix this by
    moving the assignment to addr before merging vma.

    We have a driver which changes vm_flags, and this bug is found by our
    testcases.

    Fixes: d70cec898324 ("mm: mmap: merge vma after call_mmap() if possible")
    Signed-off-by: Liu Zixian
    Signed-off-by: Andrew Morton
    Reviewed-by: Jason Gunthorpe
    Reviewed-by: David Hildenbrand
    Cc: Miaohe Lin
    Cc: Hongxiang Lou
    Cc: Hu Shiyuan
    Cc: Matthew Wilcox
    Link: https://lkml.kernel.org/r/20201203085350.22624-1-liuzixian4@huawei.com
    Signed-off-by: Linus Torvalds

    Liu Zixian
     
  • Adrian Moreno was ruuning a kubernetes 1.19 + containerd/docker workload
    using hugetlbfs. In this environment the issue is reproduced by:

    - Start a simple pod that uses the recently added HugePages medium
    feature (pod yaml attached)

    - Start a DPDK app. It doesn't need to run successfully (as in transfer
    packets) nor interact with real hardware. It seems just initializing
    the EAL layer (which handles hugepage reservation and locking) is
    enough to trigger the issue

    - Delete the Pod (or let it "Complete").

    This would result in a kworker thread going into a tight loop (top output):

    1425 root 20 0 0 0 0 R 99.7 0.0 5:22.45 kworker/28:7+cgroup_destroy

    'perf top -g' reports:

    - 63.28% 0.01% [kernel] [k] worker_thread
    - 49.97% worker_thread
    - 52.64% process_one_work
    - 62.08% css_killed_work_fn
    - hugetlb_cgroup_css_offline
    41.52% _raw_spin_lock
    - 2.82% _cond_resched
    rcu_all_qs
    2.66% PageHuge
    - 0.57% schedule
    - 0.57% __schedule

    We are spinning in the do-while loop in hugetlb_cgroup_css_offline.
    Worse yet, we are holding the master cgroup lock (cgroup_mutex) while
    infinitely spinning. Little else can be done on the system as the
    cgroup_mutex can not be acquired.

    Do note that the issue can be reproduced by simply offlining a hugetlb
    cgroup containing pages with reservation counts.

    The loop in hugetlb_cgroup_css_offline is moving page counts from the
    cgroup being offlined to the parent cgroup. This is done for each
    hstate, and is repeated until hugetlb_cgroup_have_usage returns false.
    The routine moving counts (hugetlb_cgroup_move_parent) is only moving
    'usage' counts. The routine hugetlb_cgroup_have_usage is checking for
    both 'usage' and 'reservation' counts. Discussion about what to do with
    reservation counts when reparenting was discussed here:

    https://lore.kernel.org/linux-kselftest/CAHS8izMFAYTgxym-Hzb_JmkTK1N_S9tGN71uS6MFV+R7swYu5A@mail.gmail.com/

    The decision was made to leave a zombie cgroup for with reservation
    counts. Unfortunately, the code checking reservation counts was
    incorrectly added to hugetlb_cgroup_have_usage.

    To fix the issue, simply remove the check for reservation counts. While
    fixing this issue, a related bug in hugetlb_cgroup_css_offline was
    noticed. The hstate index is not reinitialized each time through the
    do-while loop. Fix this as well.

    Fixes: 1adc4d419aa2 ("hugetlb_cgroup: add interface for charge/uncharge hugetlb reservations")
    Reported-by: Adrian Moreno
    Signed-off-by: Mike Kravetz
    Signed-off-by: Andrew Morton
    Tested-by: Adrian Moreno
    Reviewed-by: Shakeel Butt
    Cc: Mina Almasry
    Cc: David Rientjes
    Cc: Greg Thelen
    Cc: Sandipan Das
    Cc: Shuah Khan
    Cc:
    Link: https://lkml.kernel.org/r/20201203220242.158165-1-mike.kravetz@oracle.com
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     
  • mm/filemap.c:830:14: warning: no previous prototype for `__add_to_page_cache_locked' [-Wmissing-prototypes]

    Signed-off-by: Alex Shi
    Signed-off-by: Andrew Morton
    Cc: Souptick Joarder
    Link: https://lkml.kernel.org/r/1604661895-5495-1-git-send-email-alex.shi@linux.alibaba.com
    Signed-off-by: Linus Torvalds

    Alex Shi
     
  • We can't call kvfree() with a spin lock held, so defer it. Fixes a
    might_sleep() runtime warning.

    Fixes: 873d7bcfd066 ("mm/swapfile.c: use kvzalloc for swap_info_struct allocation")
    Signed-off-by: Qian Cai
    Signed-off-by: Andrew Morton
    Reviewed-by: Andrew Morton
    Cc: Hugh Dickins
    Cc:
    Link: https://lkml.kernel.org/r/20201202151549.10350-1-qcai@redhat.com
    Signed-off-by: Linus Torvalds

    Qian Cai
     
  • While I was doing zram testing, I found sometimes decompression failed
    since the compression buffer was corrupted. With investigation, I found
    below commit calls cond_resched unconditionally so it could make a
    problem in atomic context if the task is reschedule.

    BUG: sleeping function called from invalid context at mm/vmalloc.c:108
    in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 946, name: memhog
    3 locks held by memhog/946:
    #0: ffff9d01d4b193e8 (&mm->mmap_lock#2){++++}-{4:4}, at: __mm_populate+0x103/0x160
    #1: ffffffffa3d53de0 (fs_reclaim){+.+.}-{0:0}, at: __alloc_pages_slowpath.constprop.0+0xa98/0x1160
    #2: ffff9d01d56b8110 (&zspage->lock){.+.+}-{3:3}, at: zs_map_object+0x8e/0x1f0
    CPU: 0 PID: 946 Comm: memhog Not tainted 5.9.3-00011-gc5bfc0287345-dirty #316
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1 04/01/2014
    Call Trace:
    unmap_kernel_range_noflush+0x2eb/0x350
    unmap_kernel_range+0x14/0x30
    zs_unmap_object+0xd5/0xe0
    zram_bvec_rw.isra.0+0x38c/0x8e0
    zram_rw_page+0x90/0x101
    bdev_write_page+0x92/0xe0
    __swap_writepage+0x94/0x4a0
    pageout+0xe3/0x3a0
    shrink_page_list+0xb94/0xd60
    shrink_inactive_list+0x158/0x460

    We can fix this by removing the ZSMALLOC_PGTABLE_MAPPING feature (which
    contains the offending calling code) from zsmalloc.

    Even though this option showed some amount improvement(e.g., 30%) in
    some arm32 platforms, it has been headache to maintain since it have
    abused APIs[1](e.g., unmap_kernel_range in atomic context).

    Since we are approaching to deprecate 32bit machines and already made
    the config option available for only builtin build since v5.8, lastly it
    has been not default option in zsmalloc, it's time to drop the option
    for better maintenance.

    [1] http://lore.kernel.org/linux-mm/20201105170249.387069-1-minchan@kernel.org

    Fixes: e47110e90584 ("mm/vunmap: add cond_resched() in vunmap_pmd_range")
    Signed-off-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Reviewed-by: Sergey Senozhatsky
    Cc: Tony Lindgren
    Cc: Christoph Hellwig
    Cc: Harish Sriram
    Cc: Uladzislau Rezki
    Cc:
    Link: https://lkml.kernel.org/r/20201117202916.GA3856507@google.com
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • When investigating a slab cache bloat problem, significant amount of
    negative dentry cache was seen, but confusingly they neither got shrunk
    by reclaimer (the host has very tight memory) nor be shrunk by dropping
    cache. The vmcore shows there are over 14M negative dentry objects on
    lru, but tracing result shows they were even not scanned at all.

    Further investigation shows the memcg's vfs shrinker_map bit is not set.
    So the reclaimer or dropping cache just skip calling vfs shrinker. So
    we have to reboot the hosts to get the memory back.

    I didn't manage to come up with a reproducer in test environment, and
    the problem can't be reproduced after rebooting. But it seems there is
    race between shrinker map bit clear and reparenting by code inspection.
    The hypothesis is elaborated as below.

    The memcg hierarchy on our production environment looks like:

    root
    / \
    system user

    The main workloads are running under user slice's children, and it
    creates and removes memcg frequently. So reparenting happens very often
    under user slice, but no task is under user slice directly.

    So with the frequent reparenting and tight memory pressure, the below
    hypothetical race condition may happen:

    CPU A CPU B
    reparent
    dst->nr_items == 0
    shrinker:
    total_objects == 0
    add src->nr_items to dst
    set_bit
    return SHRINK_EMPTY
    clear_bit
    child memcg offline
    replace child's kmemcg_id with
    parent's (in memcg_offline_kmem())
    list_lru_del() between shrinker runs
    see parent's kmemcg_id
    dec dst->nr_items
    reparent again
    dst->nr_items may go negative
    due to concurrent list_lru_del()

    The second run of shrinker:
    read nr_items without any
    synchronization, so it may
    see intermediate negative
    nr_items then total_objects
    may return 0 coincidently

    keep the bit cleared
    dst->nr_items != 0
    skip set_bit
    add scr->nr_item to dst

    After this point dst->nr_item may never go zero, so reparenting will not
    set shrinker_map bit anymore. And since there is no task under user
    slice directly, so no new object will be added to its lru to set the
    shrinker map bit either. That bit is kept cleared forever.

    How does list_lru_del() race with reparenting? It is because reparenting
    replaces children's kmemcg_id to parent's without protecting from
    nlru->lock, so list_lru_del() may see parent's kmemcg_id but actually
    deleting items from child's lru, but dec'ing parent's nr_items, so the
    parent's nr_items may go negative as commit 2788cf0c401c ("memcg:
    reparent list_lrus and free kmemcg_id on css offline") says.

    Since it is impossible that dst->nr_items goes negative and
    src->nr_items goes zero at the same time, so it seems we could set the
    shrinker map bit iff src->nr_items != 0. We could synchronize
    list_lru_count_one() and reparenting with nlru->lock, but it seems
    checking src->nr_items in reparenting is the simplest and avoids lock
    contention.

    Fixes: fae91d6d8be5 ("mm/list_lru.c: set bit in memcg shrinker bitmap on first list_lru item appearance")
    Suggested-by: Roman Gushchin
    Signed-off-by: Yang Shi
    Signed-off-by: Andrew Morton
    Reviewed-by: Roman Gushchin
    Reviewed-by: Shakeel Butt
    Acked-by: Kirill Tkhai
    Cc: Vladimir Davydov
    Cc: [4.19]
    Link: https://lkml.kernel.org/r/20201202171749.264354-1-shy828301@gmail.com
    Signed-off-by: Linus Torvalds

    Yang Shi
     
  • Commit 10befea91b61 ("mm: memcg/slab: use a single set of kmem_caches
    for all allocations") introduced a regression into the handling of the
    obj_cgroup_charge() return value. If a non-zero value is returned
    (indicating of exceeding one of memory.max limits), the allocation
    should fail, instead of falling back to non-accounted mode.

    To make the code more readable, move memcg_slab_pre_alloc_hook() and
    memcg_slab_post_alloc_hook() calling conditions into bodies of these
    hooks.

    Fixes: 10befea91b61 ("mm: memcg/slab: use a single set of kmem_caches for all allocations")
    Signed-off-by: Roman Gushchin
    Signed-off-by: Andrew Morton
    Reviewed-by: Shakeel Butt
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc:
    Link: https://lkml.kernel.org/r/20201127161828.GD840171@carbon.dhcp.thefacebook.com
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     

25 Nov, 2020

1 commit

  • Twice now, when exercising ext4 looped on shmem huge pages, I have crashed
    on the PF_ONLY_HEAD check inside PageWaiters(): ext4_finish_bio() calling
    end_page_writeback() calling wake_up_page() on tail of a shmem huge page,
    no longer an ext4 page at all.

    The problem is that PageWriteback is not accompanied by a page reference
    (as the NOTE at the end of test_clear_page_writeback() acknowledges): as
    soon as TestClearPageWriteback has been done, that page could be removed
    from page cache, freed, and reused for something else by the time that
    wake_up_page() is reached.

    https://lore.kernel.org/linux-mm/20200827122019.GC14765@casper.infradead.org/
    Matthew Wilcox suggested avoiding or weakening the PageWaiters() tail
    check; but I'm paranoid about even looking at an unreferenced struct page,
    lest its memory might itself have already been reused or hotremoved (and
    wake_up_page_bit() may modify that memory with its ClearPageWaiters()).

    Then on crashing a second time, realized there's a stronger reason against
    that approach. If my testing just occasionally crashes on that check,
    when the page is reused for part of a compound page, wouldn't it be much
    more common for the page to get reused as an order-0 page before reaching
    wake_up_page()? And on rare occasions, might that reused page already be
    marked PageWriteback by its new user, and already be waited upon? What
    would that look like?

    It would look like BUG_ON(PageWriteback) after wait_on_page_writeback()
    in write_cache_pages() (though I have never seen that crash myself).

    Matthew Wilcox explaining this to himself:
    "page is allocated, added to page cache, dirtied, writeback starts,

    --- thread A ---
    filesystem calls end_page_writeback()
    test_clear_page_writeback()
    --- context switch to thread B ---
    truncate_inode_pages_range() finds the page, it doesn't have writeback set,
    we delete it from the page cache. Page gets reallocated, dirtied, writeback
    starts again. Then we call write_cache_pages(), see
    PageWriteback() set, call wait_on_page_writeback()
    --- context switch back to thread A ---
    wake_up_page(page, PG_writeback);
    ... thread B is woken, but because the wakeup was for the old use of
    the page, PageWriteback is still set.

    Devious"

    And prior to 2a9127fcf229 ("mm: rewrite wait_on_page_bit_common() logic")
    this would have been much less likely: before that, wake_page_function()'s
    non-exclusive case would stop walking and not wake if it found Writeback
    already set again; whereas now the non-exclusive case proceeds to wake.

    I have not thought of a fix that does not add a little overhead: the
    simplest fix is for end_page_writeback() to get_page() before calling
    test_clear_page_writeback(), then put_page() after wake_up_page().

    Was there a chance of missed wakeups before, since a page freed before
    reaching wake_up_page() would have PageWaiters cleared? I think not,
    because each waiter does hold a reference on the page. This bug comes
    when the old use of the page, the one we do TestClearPageWriteback on,
    had *no* waiters, so no additional page reference beyond the page cache
    (and whoever racily freed it). The reuse of the page has a waiter
    holding a reference, and its own PageWriteback set; but the belated
    wake_up_page() has woken the reuse to hit that BUG_ON(PageWriteback).

    Reported-by: syzbot+3622cea378100f45d59f@syzkaller.appspotmail.com
    Reported-by: Qian Cai
    Fixes: 2a9127fcf229 ("mm: rewrite wait_on_page_bit_common() logic")
    Signed-off-by: Hugh Dickins
    Cc: stable@vger.kernel.org # v5.8+
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

23 Nov, 2020

5 commits

  • The calculation of the end page index was incorrect, leading to a
    regression of 70% when running stress-ng.

    With this fix, we instead see a performance improvement of 3%.

    Fixes: e6e88712e43b ("mm: optimise madvise WILLNEED")
    Reported-by: kernel test robot
    Signed-off-by: Matthew Wilcox (Oracle)
    Signed-off-by: Andrew Morton
    Tested-by: Xing Zhengjun
    Acked-by: Johannes Weiner
    Cc: William Kucharski
    Cc: Feng Tang
    Cc: "Chen, Rong A"
    Link: https://lkml.kernel.org/r/20201109134851.29692-1-willy@infradead.org
    Signed-off-by: Linus Torvalds

    Matthew Wilcox (Oracle)
     
  • Alexander reported a syzkaller / KASAN finding on s390, see below for
    complete output.

    In do_huge_pmd_anonymous_page(), the pre-allocated pagetable will be
    freed in some cases. In the case of userfaultfd_missing(), this will
    happen after calling handle_userfault(), which might have released the
    mmap_lock. Therefore, the following pte_free(vma->vm_mm, pgtable) will
    access an unstable vma->vm_mm, which could have been freed or re-used
    already.

    For all architectures other than s390 this will go w/o any negative
    impact, because pte_free() simply frees the page and ignores the
    passed-in mm. The implementation for SPARC32 would also access
    mm->page_table_lock for pte_free(), but there is no THP support in
    SPARC32, so the buggy code path will not be used there.

    For s390, the mm->context.pgtable_list is being used to maintain the 2K
    pagetable fragments, and operating on an already freed or even re-used
    mm could result in various more or less subtle bugs due to list /
    pagetable corruption.

    Fix this by calling pte_free() before handle_userfault(), similar to how
    it is already done in __do_huge_pmd_anonymous_page() for the WRITE /
    non-huge_zero_page case.

    Commit 6b251fc96cf2c ("userfaultfd: call handle_userfault() for
    userfaultfd_missing() faults") actually introduced both, the
    do_huge_pmd_anonymous_page() and also __do_huge_pmd_anonymous_page()
    changes wrt to calling handle_userfault(), but only in the latter case
    it put the pte_free() before calling handle_userfault().

    BUG: KASAN: use-after-free in do_huge_pmd_anonymous_page+0xcda/0xd90 mm/huge_memory.c:744
    Read of size 8 at addr 00000000962d6988 by task syz-executor.0/9334

    CPU: 1 PID: 9334 Comm: syz-executor.0 Not tainted 5.10.0-rc1-syzkaller-07083-g4c9720875573 #0
    Hardware name: IBM 3906 M04 701 (KVM/Linux)
    Call Trace:
    do_huge_pmd_anonymous_page+0xcda/0xd90 mm/huge_memory.c:744
    create_huge_pmd mm/memory.c:4256 [inline]
    __handle_mm_fault+0xe6e/0x1068 mm/memory.c:4480
    handle_mm_fault+0x288/0x748 mm/memory.c:4607
    do_exception+0x394/0xae0 arch/s390/mm/fault.c:479
    do_dat_exception+0x34/0x80 arch/s390/mm/fault.c:567
    pgm_check_handler+0x1da/0x22c arch/s390/kernel/entry.S:706
    copy_from_user_mvcos arch/s390/lib/uaccess.c:111 [inline]
    raw_copy_from_user+0x3a/0x88 arch/s390/lib/uaccess.c:174
    _copy_from_user+0x48/0xa8 lib/usercopy.c:16
    copy_from_user include/linux/uaccess.h:192 [inline]
    __do_sys_sigaltstack kernel/signal.c:4064 [inline]
    __s390x_sys_sigaltstack+0xc8/0x240 kernel/signal.c:4060
    system_call+0xe0/0x28c arch/s390/kernel/entry.S:415

    Allocated by task 9334:
    slab_alloc_node mm/slub.c:2891 [inline]
    slab_alloc mm/slub.c:2899 [inline]
    kmem_cache_alloc+0x118/0x348 mm/slub.c:2904
    vm_area_dup+0x9c/0x2b8 kernel/fork.c:356
    __split_vma+0xba/0x560 mm/mmap.c:2742
    split_vma+0xca/0x108 mm/mmap.c:2800
    mlock_fixup+0x4ae/0x600 mm/mlock.c:550
    apply_vma_lock_flags+0x2c6/0x398 mm/mlock.c:619
    do_mlock+0x1aa/0x718 mm/mlock.c:711
    __do_sys_mlock2 mm/mlock.c:738 [inline]
    __s390x_sys_mlock2+0x86/0xa8 mm/mlock.c:728
    system_call+0xe0/0x28c arch/s390/kernel/entry.S:415

    Freed by task 9333:
    slab_free mm/slub.c:3142 [inline]
    kmem_cache_free+0x7c/0x4b8 mm/slub.c:3158
    __vma_adjust+0x7b2/0x2508 mm/mmap.c:960
    vma_merge+0x87e/0xce0 mm/mmap.c:1209
    userfaultfd_release+0x412/0x6b8 fs/userfaultfd.c:868
    __fput+0x22c/0x7a8 fs/file_table.c:281
    task_work_run+0x200/0x320 kernel/task_work.c:151
    tracehook_notify_resume include/linux/tracehook.h:188 [inline]
    do_notify_resume+0x100/0x148 arch/s390/kernel/signal.c:538
    system_call+0xe6/0x28c arch/s390/kernel/entry.S:416

    The buggy address belongs to the object at 00000000962d6948 which belongs to the cache vm_area_struct of size 200
    The buggy address is located 64 bytes inside of 200-byte region [00000000962d6948, 00000000962d6a10)
    The buggy address belongs to the page: page:00000000313a09fe refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x962d6 flags: 0x3ffff00000000200(slab)
    raw: 3ffff00000000200 000040000257e080 0000000c0000000c 000000008020ba00
    raw: 0000000000000000 000f001e00000000 ffffffff00000001 0000000096959501
    page dumped because: kasan: bad access detected
    page->mem_cgroup:0000000096959501

    Memory state around the buggy address:
    00000000962d6880: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
    00000000962d6900: 00 fc fc fc fc fc fc fc fc fa fb fb fb fb fb fb
    >00000000962d6980: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
    ^
    00000000962d6a00: fb fb fc fc fc fc fc fc fc fc 00 00 00 00 00 00
    00000000962d6a80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
    ==================================================================

    Fixes: 6b251fc96cf2c ("userfaultfd: call handle_userfault() for userfaultfd_missing() faults")
    Reported-by: Alexander Egorenkov
    Signed-off-by: Gerald Schaefer
    Signed-off-by: Andrew Morton
    Cc: Andrea Arcangeli
    Cc: Heiko Carstens
    Cc: [4.3+]
    Link: https://lkml.kernel.org/r/20201110190329.11920-1-gerald.schaefer@linux.ibm.com
    Signed-off-by: Linus Torvalds

    Gerald Schaefer
     
  • If we reparent the slab objects to the root memcg, when we free the slab
    object, we need to update the per-memcg vmstats to keep it correct for
    the root memcg. Now this at least affects the vmstat of
    NR_KERNEL_STACK_KB for !CONFIG_VMAP_STACK when the thread stack size is
    smaller than the PAGE_SIZE.

    David said:
    "I assume that without this fix that the root memcg's vmstat would
    always be inflated if we reparented"

    Fixes: ec9f02384f60 ("mm: workingset: fix vmstat counters for shadow nodes")
    Signed-off-by: Muchun Song
    Signed-off-by: Andrew Morton
    Reviewed-by: Shakeel Butt
    Acked-by: Roman Gushchin
    Acked-by: Johannes Weiner
    Acked-by: David Rientjes
    Cc: Michal Hocko
    Cc: Vladimir Davydov
    Cc: Christopher Lameter
    Cc: Pekka Enberg
    Cc: Joonsoo Kim
    Cc: Roman Gushchin
    Cc: Vlastimil Babka
    Cc: Yafang Shao
    Cc: Chris Down
    Cc: [5.3+]
    Link: https://lkml.kernel.org/r/20201110031015.15715-1-songmuchun@bytedance.com
    Signed-off-by: Linus Torvalds

    Muchun Song
     
  • The core-mm has a default __weak implementation of phys_to_target_node()
    to mirror the weak definition of memory_add_physaddr_to_nid(). That
    symbol is exported for modules. However, while the export in
    mm/memory_hotplug.c exported the symbol in the configuration cases of:

    CONFIG_NUMA_KEEP_MEMINFO=y
    CONFIG_MEMORY_HOTPLUG=y

    ...and:

    CONFIG_NUMA_KEEP_MEMINFO=n
    CONFIG_MEMORY_HOTPLUG=y

    ...it failed to export the symbol in the case of:

    CONFIG_NUMA_KEEP_MEMINFO=y
    CONFIG_MEMORY_HOTPLUG=n

    Not only is that broken, but Christoph points out that the kernel should
    not be exporting any __weak symbol, which means that
    memory_add_physaddr_to_nid() example that phys_to_target_node() copied
    is broken too.

    Rework the definition of phys_to_target_node() and
    memory_add_physaddr_to_nid() to not require weak symbols. Move to the
    common arch override design-pattern of an asm header defining a symbol
    to replace the default implementation.

    The only common header that all memory_add_physaddr_to_nid() producing
    architectures implement is asm/sparsemem.h. In fact, powerpc already
    defines its memory_add_physaddr_to_nid() helper in sparsemem.h.
    Double-down on that observation and define phys_to_target_node() where
    necessary in asm/sparsemem.h. An alternate consideration that was
    discarded was to put this override in asm/numa.h, but that entangles
    with the definition of MAX_NUMNODES relative to the inclusion of
    linux/nodemask.h, and requires powerpc to grow a new header.

    The dependency on NUMA_KEEP_MEMINFO for DEV_DAX_HMEM_DEVICES is invalid
    now that the symbol is properly exported / stubbed in all combinations
    of CONFIG_NUMA_KEEP_MEMINFO and CONFIG_MEMORY_HOTPLUG.

    [dan.j.williams@intel.com: v4]
    Link: https://lkml.kernel.org/r/160461461867.1505359.5301571728749534585.stgit@dwillia2-desk3.amr.corp.intel.com
    [dan.j.williams@intel.com: powerpc: fix create_section_mapping compile warning]
    Link: https://lkml.kernel.org/r/160558386174.2948926.2740149041249041764.stgit@dwillia2-desk3.amr.corp.intel.com

    Fixes: a035b6bf863e ("mm/memory_hotplug: introduce default phys_to_target_node() implementation")
    Reported-by: Randy Dunlap
    Reported-by: Thomas Gleixner
    Reported-by: kernel test robot
    Reported-by: Christoph Hellwig
    Signed-off-by: Dan Williams
    Signed-off-by: Andrew Morton
    Tested-by: Randy Dunlap
    Tested-by: Thomas Gleixner
    Reviewed-by: Thomas Gleixner
    Reviewed-by: Christoph Hellwig
    Cc: Joao Martins
    Cc: Tony Luck
    Cc: Fenghua Yu
    Cc: Michael Ellerman
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: Vishal Verma
    Cc: Stephen Rothwell
    Link: https://lkml.kernel.org/r/160447639846.1133764.7044090803980177548.stgit@dwillia2-desk3.amr.corp.intel.com
    Signed-off-by: Linus Torvalds

    Dan Williams
     
  • The early return in process_madvise() will produce a memory leak.

    Fix it.

    Fixes: ecb8ac8b1f14 ("mm/madvise: introduce process_madvise() syscall: an external memory hinting API")
    Signed-off-by: Eric Dumazet
    Signed-off-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Link: https://lkml.kernel.org/r/20201116155132.GA3805951@google.com
    Signed-off-by: Linus Torvalds

    Eric Dumazet
     

21 Nov, 2020

1 commit

  • Pull io_uring fixes from Jens Axboe:
    "Mostly regression or stable fodder:

    - Disallow async path resolution of /proc/self

    - Tighten constraints for segmented async buffered reads

    - Fix double completion for a retry error case

    - Fix for fixed file life times (Pavel)"

    * tag 'io_uring-5.10-2020-11-20' of git://git.kernel.dk/linux-block:
    io_uring: order refnode recycling
    io_uring: get an active ref_node from files_data
    io_uring: don't double complete failed reissue request
    mm: never attempt async page lock if we've transferred data already
    io_uring: handle -EOPNOTSUPP on path resolution
    proc: don't allow async path resolution of /proc/self components

    Linus Torvalds
     

20 Nov, 2020

1 commit

  • Pull networking fixes from Jakub Kicinski:
    "Networking fixes for 5.10-rc5, including fixes from the WiFi
    (mac80211), can and bpf (including the strncpy_from_user fix).

    Current release - regressions:

    - mac80211: fix memory leak of filtered powersave frames

    - mac80211: free sta in sta_info_insert_finish() on errors to avoid
    sleeping in atomic context

    - netlabel: fix an uninitialized variable warning added in -rc4

    Previous release - regressions:

    - vsock: forward all packets to the host when no H2G is registered,
    un-breaking AWS Nitro Enclaves

    - net: Exempt multicast addresses from five-second neighbor lifetime
    requirement, decreasing the chances neighbor tables fill up

    - net/tls: fix corrupted data in recvmsg

    - qed: fix ILT configuration of SRC block

    - can: m_can: process interrupt only when not runtime suspended

    Previous release - always broken:

    - page_frag: Recover from memory pressure by not recycling pages
    allocating from the reserves

    - strncpy_from_user: Mask out bytes after NUL terminator

    - ip_tunnels: Set tunnel option flag only when tunnel metadata is
    present, always setting it confuses Open vSwitch

    - bpf, sockmap:
    - Fix partial copy_page_to_iter so progress can still be made
    - Fix socket memory accounting and obeying SO_RCVBUF

    - net: Have netpoll bring-up DSA management interface

    - net: bridge: add missing counters to ndo_get_stats64 callback

    - tcp: brr: only postpone PROBE_RTT if RTT is < current min_rtt

    - enetc: Workaround MDIO register access HW bug

    - net/ncsi: move netlink family registration to a subsystem init,
    instead of tying it to driver probe

    - net: ftgmac100: unregister NC-SI when removing driver to avoid
    crash

    - lan743x:
    - prevent interrupt storm on open
    - fix freeing skbs in the wrong context

    - net/mlx5e: Fix socket refcount leak on kTLS RX resync

    - net: dsa: mv88e6xxx: Avoid VLAN database corruption on 6097

    - fix 21 unset return codes and other mistakes on error paths, mostly
    detected by the Hulk Robot"

    * tag 'net-5.10-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (115 commits)
    fail_function: Remove a redundant mutex unlock
    selftest/bpf: Test bpf_probe_read_user_str() strips trailing bytes after NUL
    lib/strncpy_from_user.c: Mask out bytes after NUL terminator.
    net/smc: fix direct access to ib_gid_addr->ndev in smc_ib_determine_gid()
    net/smc: fix matching of existing link groups
    ipv6: Remove dependency of ipv6_frag_thdr_truncated on ipv6 module
    libbpf: Fix VERSIONED_SYM_COUNT number parsing
    net/mlx4_core: Fix init_hca fields offset
    atm: nicstar: Unmap DMA on send error
    page_frag: Recover from memory pressure
    net: dsa: mv88e6xxx: Wait for EEPROM done after HW reset
    mlxsw: core: Use variable timeout for EMAD retries
    mlxsw: Fix firmware flashing
    net: Have netpoll bring-up DSA management interface
    atl1e: fix error return code in atl1e_probe()
    atl1c: fix error return code in atl1c_probe()
    ah6: fix error return code in ah6_input()
    net: usb: qmi_wwan: Set DTR quirk for MR400
    can: m_can: process interrupt only when not runtime suspended
    can: flexcan: flexcan_chip_start(): fix erroneous flexcan_transceiver_enable() during bus-off recovery
    ...

    Linus Torvalds
     

19 Nov, 2020

1 commit

  • The ethernet driver may allocate skb (and skb->data) via napi_alloc_skb().
    This ends up to page_frag_alloc() to allocate skb->data from
    page_frag_cache->va.

    During the memory pressure, page_frag_cache->va may be allocated as
    pfmemalloc page. As a result, the skb->pfmemalloc is always true as
    skb->data is from page_frag_cache->va. The skb will be dropped if the
    sock (receiver) does not have SOCK_MEMALLOC. This is expected behaviour
    under memory pressure.

    However, once kernel is not under memory pressure any longer (suppose large
    amount of memory pages are just reclaimed), the page_frag_alloc() may still
    re-use the prior pfmemalloc page_frag_cache->va to allocate skb->data. As a
    result, the skb->pfmemalloc is always true unless page_frag_cache->va is
    re-allocated, even if the kernel is not under memory pressure any longer.

    Here is how kernel runs into issue.

    1. The kernel is under memory pressure and allocation of
    PAGE_FRAG_CACHE_MAX_ORDER in __page_frag_cache_refill() will fail. Instead,
    the pfmemalloc page is allocated for page_frag_cache->va.

    2: All skb->data from page_frag_cache->va (pfmemalloc) will have
    skb->pfmemalloc=true. The skb will always be dropped by sock without
    SOCK_MEMALLOC. This is an expected behaviour.

    3. Suppose a large amount of pages are reclaimed and kernel is not under
    memory pressure any longer. We expect skb->pfmemalloc drop will not happen.

    4. Unfortunately, page_frag_alloc() does not proactively re-allocate
    page_frag_alloc->va and will always re-use the prior pfmemalloc page. The
    skb->pfmemalloc is always true even kernel is not under memory pressure any
    longer.

    Fix this by freeing and re-allocating the page instead of recycling it.

    References: https://lore.kernel.org/lkml/20201103193239.1807-1-dongli.zhang@oracle.com/
    References: https://lore.kernel.org/linux-mm/20201105042140.5253-1-willy@infradead.org/
    Suggested-by: Matthew Wilcox (Oracle)
    Cc: Aruna Ramakrishna
    Cc: Bert Barbe
    Cc: Rama Nichanamatlu
    Cc: Venkat Venkatsubra
    Cc: Manjunath Patil
    Cc: Joe Jin
    Cc: SRINIVAS
    Fixes: 79930f5892e1 ("net: do not deplete pfmemalloc reserve")
    Signed-off-by: Dongli Zhang
    Acked-by: Vlastimil Babka
    Reviewed-by: Eric Dumazet
    Link: https://lore.kernel.org/r/20201115201029.11903-1-dongli.zhang@oracle.com
    Signed-off-by: Jakub Kicinski

    Dongli Zhang
     

17 Nov, 2020

1 commit

  • We catch the case where we enter generic_file_buffered_read() with data
    already transferred, but we also need to be careful not to allow an async
    page lock if we're looping transferring data. If not, we could be
    returning -EIOCBQUEUED instead of the transferred amount, and it could
    result in double waitqueue additions as well.

    Cc: stable@vger.kernel.org # v5.9
    Fixes: 1a0a7853b901 ("mm: support async buffered reads in generic_file_buffered_read()")
    Signed-off-by: Jens Axboe

    Jens Axboe
     

16 Nov, 2020

1 commit


15 Nov, 2020

6 commits

  • Qian Cai reported the following BUG in [1]

    LTP: starting move_pages12
    BUG: unable to handle page fault for address: ffffffffffffffe0
    ...
    RIP: 0010:anon_vma_interval_tree_iter_first+0xa2/0x170 avc_start_pgoff at mm/interval_tree.c:63
    Call Trace:
    rmap_walk_anon+0x141/0xa30 rmap_walk_anon at mm/rmap.c:1864
    try_to_unmap+0x209/0x2d0 try_to_unmap at mm/rmap.c:1763
    migrate_pages+0x1005/0x1fb0
    move_pages_and_store_status.isra.47+0xd7/0x1a0
    __x64_sys_move_pages+0xa5c/0x1100
    do_syscall_64+0x5f/0x310
    entry_SYSCALL_64_after_hwframe+0x44/0xa9

    Hugh Dickins diagnosed this as a migration bug caused by code introduced
    to use i_mmap_rwsem for pmd sharing synchronization. Specifically, the
    routine unmap_and_move_huge_page() is always passing the TTU_RMAP_LOCKED
    flag to try_to_unmap() while holding i_mmap_rwsem. This is wrong for
    anon pages as the anon_vma_lock should be held in this case. Further
    analysis suggested that i_mmap_rwsem was not required to he held at all
    when calling try_to_unmap for anon pages as an anon page could never be
    part of a shared pmd mapping.

    Discussion also revealed that the hack in hugetlb_page_mapping_lock_write
    to drop page lock and acquire i_mmap_rwsem is wrong. There is no way to
    keep mapping valid while dropping page lock.

    This patch does the following:

    - Do not take i_mmap_rwsem and set TTU_RMAP_LOCKED for anon pages when
    calling try_to_unmap.

    - Remove the hacky code in hugetlb_page_mapping_lock_write. The routine
    will now simply do a 'trylock' while still holding the page lock. If
    the trylock fails, it will return NULL. This could impact the
    callers:

    - migration calling code will receive -EAGAIN and retry up to the
    hard coded limit (10).

    - memory error code will treat the page as BUSY. This will force
    killing (SIGKILL) instead of SIGBUS any mapping tasks.

    Do note that this change in behavior only happens when there is a
    race. None of the standard kernel testing suites actually hit this
    race, but it is possible.

    [1] https://lore.kernel.org/lkml/20200708012044.GC992@lca.pw/
    [2] https://lore.kernel.org/linux-mm/alpine.LSU.2.11.2010071833100.2214@eggly.anvils/

    Fixes: c0d0381ade79 ("hugetlbfs: use i_mmap_rwsem for more pmd sharing synchronization")
    Reported-by: Qian Cai
    Suggested-by: Hugh Dickins
    Signed-off-by: Mike Kravetz
    Signed-off-by: Andrew Morton
    Acked-by: Naoya Horiguchi
    Cc:
    Link: https://lkml.kernel.org/r/20201105195058.78401-1-mike.kravetz@oracle.com
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     
  • When FOLL_PIN is passed to __get_user_pages() the page list must be put
    back using unpin_user_pages() otherwise the page pin reference persists
    in a corrupted state.

    There are two places in the unwind of __gup_longterm_locked() that put
    the pages back without checking. Normally on error this function would
    return the partial page list making this the caller's responsibility,
    but in these two cases the caller is not allowed to see these pages at
    all.

    Fixes: 3faa52c03f44 ("mm/gup: track FOLL_PIN pages")
    Reported-by: Ira Weiny
    Signed-off-by: Jason Gunthorpe
    Signed-off-by: Andrew Morton
    Reviewed-by: Ira Weiny
    Reviewed-by: John Hubbard
    Cc: Aneesh Kumar K.V
    Cc: Dan Williams
    Cc:
    Link: https://lkml.kernel.org/r/0-v2-3ae7d9d162e2+2a7-gup_cma_fix_jgg@nvidia.com
    Signed-off-by: Linus Torvalds

    Jason Gunthorpe
     
  • While doing memory hot-unplug operation on a PowerPC VM running 1024 CPUs
    with 11TB of ram, I hit the following panic:

    BUG: Kernel NULL pointer dereference on read at 0x00000007
    Faulting instruction address: 0xc000000000456048
    Oops: Kernel access of bad area, sig: 11 [#2]
    LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS= 2048 NUMA pSeries
    Modules linked in: rpadlpar_io rpaphp
    CPU: 160 PID: 1 Comm: systemd Tainted: G D 5.9.0 #1
    NIP: c000000000456048 LR: c000000000455fd4 CTR: c00000000047b350
    REGS: c00006028d1b77a0 TRAP: 0300 Tainted: G D (5.9.0)
    MSR: 8000000000009033 CR: 24004228 XER: 00000000
    CFAR: c00000000000f1b0 DAR: 0000000000000007 DSISR: 40000000 IRQMASK: 0
    GPR00: c000000000455fd4 c00006028d1b7a30 c000000001bec800 0000000000000000
    GPR04: 0000000000000dc0 0000000000000000 00000000000374ef c00007c53df99320
    GPR08: 000007c53c980000 0000000000000000 000007c53c980000 0000000000000000
    GPR12: 0000000000004400 c00000001e8e4400 0000000000000000 0000000000000f6a
    GPR16: 0000000000000000 c000000001c25930 c000000001d62528 00000000000000c1
    GPR20: c000000001d62538 c00006be469e9000 0000000fffffffe0 c0000000003c0ff8
    GPR24: 0000000000000018 0000000000000000 0000000000000dc0 0000000000000000
    GPR28: c00007c513755700 c000000001c236a4 c00007bc4001f800 0000000000000001
    NIP [c000000000456048] __kmalloc_node+0x108/0x790
    LR [c000000000455fd4] __kmalloc_node+0x94/0x790
    Call Trace:
    kvmalloc_node+0x58/0x110
    mem_cgroup_css_online+0x10c/0x270
    online_css+0x48/0xd0
    cgroup_apply_control_enable+0x2c4/0x470
    cgroup_mkdir+0x408/0x5f0
    kernfs_iop_mkdir+0x90/0x100
    vfs_mkdir+0x138/0x250
    do_mkdirat+0x154/0x1c0
    system_call_exception+0xf8/0x200
    system_call_common+0xf0/0x27c
    Instruction dump:
    e93e0000 e90d0030 39290008 7cc9402a e94d0030 e93e0000 7ce95214 7f89502a
    2fbc0000 419e0018 41920230 e9270010 7f994800 419e0220 7ee6bb78

    This pointing to the following code:

    mm/slub.c:2851
    if (unlikely(!object || !node_match(page, node))) {
    c000000000456038: 00 00 bc 2f cmpdi cr7,r28,0
    c00000000045603c: 18 00 9e 41 beq cr7,c000000000456054
    node_match():
    mm/slub.c:2491
    if (node != NUMA_NO_NODE && page_to_nid(page) != node)
    c000000000456040: 30 02 92 41 beq cr4,c000000000456270
    page_to_nid():
    include/linux/mm.h:1294
    c000000000456044: 10 00 27 e9 ld r9,16(r7)
    c000000000456048: 07 00 29 89 lbz r9,7(r9) <<<< r9 = NULL
    node_match():
    mm/slub.c:2491
    c00000000045604c: 00 48 99 7f cmpw cr7,r25,r9
    c000000000456050: 20 02 9e 41 beq cr7,c000000000456270

    The panic occurred in slab_alloc_node() when checking for the page's node:

    object = c->freelist;
    page = c->page;
    if (unlikely(!object || !node_match(page, node))) {
    object = __slab_alloc(s, gfpflags, node, addr, c);
    stat(s, ALLOC_SLOWPATH);

    The issue is that object is not NULL while page is NULL which is odd but
    may happen if the cache flush happened after loading object but before
    loading page. Thus checking for the page pointer is required too.

    The cache flush is done through an inter processor interrupt when a
    piece of memory is off-lined. That interrupt is triggered when a memory
    hot-unplug operation is initiated and offline_pages() is calling the
    slub's MEM_GOING_OFFLINE callback slab_mem_going_offline_callback()
    which is calling flush_cpu_slab(). If that interrupt is caught between
    the reading of c->freelist and the reading of c->page, this could lead
    to such a situation. That situation is expected and the later call to
    this_cpu_cmpxchg_double() will detect the change to c->freelist and redo
    the whole operation.

    In commit 6159d0f5c03e ("mm/slub.c: page is always non-NULL in
    node_match()") check on the page pointer has been removed assuming that
    page is always valid when it is called. It happens that this is not
    true in that particular case, so check for page before calling
    node_match() here.

    Fixes: 6159d0f5c03e ("mm/slub.c: page is always non-NULL in node_match()")
    Signed-off-by: Laurent Dufour
    Signed-off-by: Andrew Morton
    Acked-by: Vlastimil Babka
    Acked-by: Christoph Lameter
    Cc: Wei Yang
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Nathan Lynch
    Cc: Scott Cheloha
    Cc: Michal Hocko
    Cc:
    Link: https://lkml.kernel.org/r/20201027190406.33283-1-ldufour@linux.ibm.com
    Signed-off-by: Linus Torvalds

    Laurent Dufour
     
  • Previously the negated unsigned long would be cast back to signed long
    which would have the correct negative value. After commit 730ec8c01a2b
    ("mm/vmscan.c: change prototype for shrink_page_list"), the large
    unsigned int converts to a large positive signed long.

    Symptoms include CMA allocations hanging forever holding the cma_mutex
    due to alloc_contig_range->...->isolate_migratepages_block waiting
    forever in "while (unlikely(too_many_isolated(pgdat)))".

    [akpm@linux-foundation.org: fix -stat.nr_lazyfree_fail as well, per Michal]

    Fixes: 730ec8c01a2b ("mm/vmscan.c: change prototype for shrink_page_list")
    Signed-off-by: Nicholas Piggin
    Signed-off-by: Andrew Morton
    Acked-by: Michal Hocko
    Cc: Vaneet Narang
    Cc: Maninder Singh
    Cc: Amit Sahrawat
    Cc: Mel Gorman
    Cc: Vlastimil Babka
    Cc:
    Link: https://lkml.kernel.org/r/20201029032320.1448441-1-npiggin@gmail.com
    Signed-off-by: Linus Torvalds

    Nicholas Piggin
     
  • In isolate_migratepages_block, if we have too many isolated pages and
    nr_migratepages is not zero, we should try to migrate what we have
    without wasting time on isolating.

    In theory it's possible that multiple parallel compactions will cause
    too_many_isolated() to become true even if each has isolated less than
    COMPACT_CLUSTER_MAX, and loop forever in the while loop. Bailing
    immediately prevents that.

    [vbabka@suse.cz: changelog addition]

    Fixes: 1da2f328fa64 (“mm,thp,compaction,cma: allow THP migration for CMA allocations”)
    Suggested-by: Vlastimil Babka
    Signed-off-by: Zi Yan
    Signed-off-by: Andrew Morton
    Cc:
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Rik van Riel
    Cc: Yang Shi
    Link: https://lkml.kernel.org/r/20201030183809.3616803-2-zi.yan@sent.com
    Signed-off-by: Linus Torvalds

    Zi Yan
     
  • In isolate_migratepages_block, when cc->alloc_contig is true, we are
    able to isolate compound pages. But nr_migratepages and nr_isolated did
    not count compound pages correctly, causing us to isolate more pages
    than we thought.

    So count compound pages as the number of base pages they contain.
    Otherwise, we might be trapped in too_many_isolated while loop, since
    the actual isolated pages can go up to COMPACT_CLUSTER_MAX*512=16384,
    where COMPACT_CLUSTER_MAX is 32, since we stop isolation after
    cc->nr_migratepages reaches to COMPACT_CLUSTER_MAX.

    In addition, after we fix the issue above, cc->nr_migratepages could
    never be equal to COMPACT_CLUSTER_MAX if compound pages are isolated,
    thus page isolation could not stop as we intended. Change the isolation
    stop condition to '>='.

    The issue can be triggered as follows:

    In a system with 16GB memory and an 8GB CMA region reserved by
    hugetlb_cma, if we first allocate 10GB THPs and mlock them (so some THPs
    are allocated in the CMA region and mlocked), reserving 6 1GB hugetlb
    pages via /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages will
    get stuck (looping in too_many_isolated function) until we kill either
    task. With the patch applied, oom will kill the application with 10GB
    THPs and let hugetlb page reservation finish.

    [ziy@nvidia.com: v3]

    Link: https://lkml.kernel.org/r/20201030183809.3616803-1-zi.yan@sent.com
    Fixes: 1da2f328fa64 ("cmm,thp,compaction,cma: allow THP migration for CMA allocations")
    Signed-off-by: Zi Yan
    Signed-off-by: Andrew Morton
    Reviewed-by: Yang Shi
    Acked-by: Vlastimil Babka
    Cc: Rik van Riel
    Cc: Michal Hocko
    Cc: Mel Gorman
    Cc:
    Link: https://lkml.kernel.org/r/20201029200435.3386066-1-zi.yan@sent.com
    Signed-off-by: Linus Torvalds

    Zi Yan
     

03 Nov, 2020

6 commits

  • Fix the following sparse warning:

    mm/truncate.c:531:15: warning: symbol '__invalidate_mapping_pages' was not declared. Should it be static?

    Fixes: eb1d7a65f08a ("mm, fadvise: improve the expensive remote LRU cache draining after FADV_DONTNEED")
    Signed-off-by: Jason Yan
    Signed-off-by: Andrew Morton
    Reviewed-by: Yafang Shao
    Link: https://lkml.kernel.org/r/20201015054808.2445904-1-yanaijie@huawei.com
    Signed-off-by: Linus Torvalds

    Jason Yan
     
  • When flags in queue_pages_pte_range don't have MPOL_MF_MOVE or
    MPOL_MF_MOVE_ALL bits, code breaks and passing origin pte - 1 to
    pte_unmap_unlock seems like not a good idea.

    queue_pages_pte_range can run in MPOL_MF_MOVE_ALL mode which doesn't
    migrate misplaced pages but returns with EIO when encountering such a
    page. Since commit a7f40cfe3b7a ("mm: mempolicy: make mbind() return
    -EIO when MPOL_MF_STRICT is specified") and early break on the first pte
    in the range results in pte_unmap_unlock on an underflow pte. This can
    lead to lockups later on when somebody tries to lock the pte resp.
    page_table_lock again..

    Fixes: a7f40cfe3b7a ("mm: mempolicy: make mbind() return -EIO when MPOL_MF_STRICT is specified")
    Signed-off-by: Shijie Luo
    Signed-off-by: Miaohe Lin
    Signed-off-by: Andrew Morton
    Reviewed-by: Oscar Salvador
    Acked-by: Michal Hocko
    Cc: Miaohe Lin
    Cc: Feilong Lin
    Cc: Shijie Luo
    Cc:
    Link: https://lkml.kernel.org/r/20201019074853.50856-1-luoshijie1@huawei.com
    Signed-off-by: Linus Torvalds

    Shijie Luo
     
  • Richard reported a warning which can be reproduced by running the LTP
    madvise6 test (cgroup v1 in the non-hierarchical mode should be used):

    WARNING: CPU: 0 PID: 12 at mm/page_counter.c:57 page_counter_uncharge (mm/page_counter.c:57 mm/page_counter.c:50 mm/page_counter.c:156)
    Modules linked in:
    CPU: 0 PID: 12 Comm: kworker/0:1 Not tainted 5.9.0-rc7-22-default #77
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-48-gd9c812d-rebuilt.opensuse.org 04/01/2014
    Workqueue: events drain_local_stock
    RIP: 0010:page_counter_uncharge (mm/page_counter.c:57 mm/page_counter.c:50 mm/page_counter.c:156)
    Call Trace:
    __memcg_kmem_uncharge (mm/memcontrol.c:3022)
    drain_obj_stock (./include/linux/rcupdate.h:689 mm/memcontrol.c:3114)
    drain_local_stock (mm/memcontrol.c:2255)
    process_one_work (./arch/x86/include/asm/jump_label.h:25 ./include/linux/jump_label.h:200 ./include/trace/events/workqueue.h:108 kernel/workqueue.c:2274)
    worker_thread (./include/linux/list.h:282 kernel/workqueue.c:2416)
    kthread (kernel/kthread.c:292)
    ret_from_fork (arch/x86/entry/entry_64.S:300)

    The problem occurs because in the non-hierarchical mode non-root page
    counters are not linked to root page counters, so the charge is not
    propagated to the root memory cgroup.

    After the removal of the original memory cgroup and reparenting of the
    object cgroup, the root cgroup might be uncharged by draining a objcg
    stock, for example. It leads to an eventual underflow of the charge and
    triggers a warning.

    Fix it by linking all page counters to corresponding root page counters
    in the non-hierarchical mode.

    Please note, that in the non-hierarchical mode all objcgs are always
    reparented to the root memory cgroup, even if the hierarchy has more
    than 1 level. This patch doesn't change it.

    The patch also doesn't affect how the hierarchical mode is working,
    which is the only sane and truly supported mode now.

    Thanks to Richard for reporting, debugging and providing an alternative
    version of the fix!

    Fixes: bf4f059954dc ("mm: memcg/slab: obj_cgroup API")
    Reported-by:
    Signed-off-by: Roman Gushchin
    Signed-off-by: Andrew Morton
    Reviewed-by: Shakeel Butt
    Reviewed-by: Michal Koutný
    Acked-by: Johannes Weiner
    Cc: Michal Hocko
    Cc:
    Link: https://lkml.kernel.org/r/20201026231326.3212225-1-guro@fb.com
    Debugged-by: Richard Palethorpe
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     
  • memcg_page_state will get the specified number in hierarchical memcg, It
    should multiply by HPAGE_PMD_NR rather than an page if the item is
    NR_ANON_THPS.

    [akpm@linux-foundation.org: fix printk warning]
    [akpm@linux-foundation.org: use u64 cast, per Michal]

    Fixes: 468c398233da ("mm: memcontrol: switch to native NR_ANON_THPS counter")
    Signed-off-by: zhongjiang-ali
    Signed-off-by: Andrew Morton
    Acked-by: Johannes Weiner
    Acked-by: Michal Hocko
    Link: https://lkml.kernel.org/r/1603722395-72443-1-git-send-email-zhongjiang-ali@linux.alibaba.com
    Signed-off-by: Linus Torvalds

    zhongjiang-ali
     
  • Michal Privoznik was using "free page reporting" in QEMU/virtio-balloon
    with hugetlbfs and hit the warning below. QEMU with free page hinting
    uses fallocate(FALLOC_FL_PUNCH_HOLE) to discard pages that are reported
    as free by a VM. The reporting granularity is in pageblock granularity.
    So when the guest reports 2M chunks, we fallocate(FALLOC_FL_PUNCH_HOLE)
    one huge page in QEMU.

    WARNING: CPU: 7 PID: 6636 at mm/page_counter.c:57 page_counter_uncharge+0x4b/0x50
    Modules linked in: ...
    CPU: 7 PID: 6636 Comm: qemu-system-x86 Not tainted 5.9.0 #137
    Hardware name: Gigabyte Technology Co., Ltd. X570 AORUS PRO/X570 AORUS PRO, BIOS F21 07/31/2020
    RIP: 0010:page_counter_uncharge+0x4b/0x50
    ...
    Call Trace:
    hugetlb_cgroup_uncharge_file_region+0x4b/0x80
    region_del+0x1d3/0x300
    hugetlb_unreserve_pages+0x39/0xb0
    remove_inode_hugepages+0x1a8/0x3d0
    hugetlbfs_fallocate+0x3c4/0x5c0
    vfs_fallocate+0x146/0x290
    __x64_sys_fallocate+0x3e/0x70
    do_syscall_64+0x33/0x40
    entry_SYSCALL_64_after_hwframe+0x44/0xa9

    Investigation of the issue uncovered bugs in hugetlb cgroup reservation
    accounting. This patch addresses the found issues.

    Fixes: 075a61d07a8e ("hugetlb_cgroup: add accounting for shared mappings")
    Reported-by: Michal Privoznik
    Co-developed-by: David Hildenbrand
    Signed-off-by: David Hildenbrand
    Signed-off-by: Mike Kravetz
    Signed-off-by: Andrew Morton
    Tested-by: Michal Privoznik
    Reviewed-by: Mina Almasry
    Acked-by: Michael S. Tsirkin
    Cc:
    Cc: David Hildenbrand
    Cc: Michal Hocko
    Cc: Muchun Song
    Cc: "Aneesh Kumar K . V"
    Cc: Tejun Heo
    Link: https://lkml.kernel.org/r/20201021204426.36069-1-mike.kravetz@oracle.com
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     
  • commit 6f42193fd86e ("memremap: don't use a separate devm action for
    devmap_managed_enable_get") changed the static key updates such that we
    now call devmap_managed_enable_put() without doing the equivalent
    devmap_managed_enable_get().

    devmap_managed_enable_get() is only called for MEMORY_DEVICE_PRIVATE and
    MEMORY_DEVICE_FS_DAX, But memunmap_pages() get called for other pgmap
    types too. This results in the below warning when switching between
    system-ram and devdax mode for devdax namespace.

    jump label: negative count!
    WARNING: CPU: 52 PID: 1335 at kernel/jump_label.c:235 static_key_slow_try_dec+0x88/0xa0
    Modules linked in:
    ....

    NIP static_key_slow_try_dec+0x88/0xa0
    LR static_key_slow_try_dec+0x84/0xa0
    Call Trace:
    static_key_slow_try_dec+0x84/0xa0
    __static_key_slow_dec_cpuslocked+0x34/0xd0
    static_key_slow_dec+0x54/0xf0
    memunmap_pages+0x36c/0x500
    devm_action_release+0x30/0x50
    release_nodes+0x2f4/0x3e0
    device_release_driver_internal+0x17c/0x280
    bus_remove_device+0x124/0x210
    device_del+0x1d4/0x530
    unregister_dev_dax+0x48/0xe0
    devm_action_release+0x30/0x50
    release_nodes+0x2f4/0x3e0
    device_release_driver_internal+0x17c/0x280
    unbind_store+0x130/0x170
    drv_attr_store+0x40/0x60
    sysfs_kf_write+0x6c/0xb0
    kernfs_fop_write+0x118/0x280
    vfs_write+0xe8/0x2a0
    ksys_write+0x84/0x140
    system_call_exception+0x120/0x270
    system_call_common+0xf0/0x27c

    Reported-by: Aneesh Kumar K.V
    Signed-off-by: Ralph Campbell
    Signed-off-by: Andrew Morton
    Tested-by: Sachin Sant
    Reviewed-by: Aneesh Kumar K.V
    Reviewed-by: Ira Weiny
    Reviewed-by: Christoph Hellwig
    Cc: Dan Williams
    Cc: Jason Gunthorpe
    Link: https://lkml.kernel.org/r/20201023183222.13186-1-rcampbell@nvidia.com
    Signed-off-by: Linus Torvalds

    Ralph Campbell
     

31 Oct, 2020

1 commit


28 Oct, 2020

2 commits

  • With e.g. m68k/defconfig:

    mm/process_vm_access.c: In function ‘process_vm_rw’:
    mm/process_vm_access.c:277:5: error: implicit declaration of function ‘in_compat_syscall’ [-Werror=implicit-function-declaration]
    277 | in_compat_syscall());
    | ^~~~~~~~~~~~~~~~~

    Fix this by adding #include .

    Reported-by: noreply@ellerman.id.au
    Reported-by: damian
    Reported-by: Naresh Kamboju
    Fixes: 38dc5079da7081e8 ("Fix compat regression in process_vm_rw()")
    Signed-off-by: Geert Uytterhoeven
    Acked-by: Jens Axboe
    Signed-off-by: Linus Torvalds

    Geert Uytterhoeven
     
  • The removal of compat_process_vm_{readv,writev} didn't change
    process_vm_rw(), which always assumes it's not doing a compat syscall.

    Instead of passing in 'false' unconditionally for 'compat', make it
    conditional on in_compat_syscall().

    [ Both Al and Christoph point out that trying to access a 64-bit process
    from a 32-bit one cannot work anyway, and is likely better prohibited,
    but that's a separate issue - Linus ]

    Fixes: c3973b401ef2 ("mm: remove compat_process_vm_{readv,writev}")
    Reported-and-tested-by: Kyle Huey
    Signed-off-by: Jens Axboe
    Acked-by: Al Viro
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Linus Torvalds

    Jens Axboe
     

24 Oct, 2020

1 commit

  • Pull clone/dedupe/remap code refactoring from Darrick Wong:
    "Move the generic file range remap (aka reflink and dedupe) functions
    out of mm/filemap.c and fs/read_write.c and into fs/remap_range.c to
    reduce clutter in the first two files"

    * tag 'vfs-5.10-merge-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
    vfs: move the generic write and copy checks out of mm
    vfs: move the remap range helpers to remap_range.c
    vfs: move generic_remap_checks out of mm

    Linus Torvalds
     

21 Oct, 2020

2 commits

  • Pull XArray updates from Matthew Wilcox:

    - Fix the test suite after introduction of the local_lock

    - Fix a bug in the IDA spotted by Coverity

    - Change the API that allows the workingset code to delete a node

    - Fix xas_reload() when dealing with entries that occupy multiple
    indices

    - Add a few more tests to the test suite

    - Fix an unsigned int being shifted into an unsigned long

    * tag 'xarray-5.9' of git://git.infradead.org/users/willy/xarray:
    XArray: Fix xas_create_range for ranges above 4 billion
    radix-tree: fix the comment of radix_tree_next_slot()
    XArray: Fix xas_reload for multi-index entries
    XArray: Add private interface for workingset node deletion
    XArray: Fix xas_for_each_conflict documentation
    XArray: Test marked multiorder iterations
    XArray: Test two more things about xa_cmpxchg
    ida: Free allocated bitmap in error path
    radix tree test suite: Fix compilation

    Linus Torvalds
     
  • Pull io_uring updates from Jens Axboe:
    "A mix of fixes and a few stragglers. In detail:

    - Revert the bogus __read_mostly that we discussed for the initial
    pull request.

    - Fix a merge window regression with fixed file registration error
    path handling.

    - Fix io-wq numa node affinities.

    - Series abstracting out an io_identity struct, making it both easier
    to see what the personality items are, and also easier to to adopt
    more. Use this to cover audit logging.

    - Fix for read-ahead disabled block condition in async buffered
    reads, and using single page read-ahead to unify what
    generic_file_buffer_read() path is used.

    - Series for REQ_F_COMP_LOCKED fix and removal of it (Pavel)

    - Poll fix (Pavel)"

    * tag 'io_uring-5.10-2020-10-20' of git://git.kernel.dk/linux-block: (21 commits)
    io_uring: use blk_queue_nowait() to check if NOWAIT supported
    mm: use limited read-ahead to satisfy read
    mm: mark async iocb read as NOWAIT once some data has been copied
    io_uring: fix double poll mask init
    io-wq: inherit audit loginuid and sessionid
    io_uring: use percpu counters to track inflight requests
    io_uring: assign new io_identity for task if members have changed
    io_uring: store io_identity in io_uring_task
    io_uring: COW io_identity on mismatch
    io_uring: move io identity items into separate struct
    io_uring: rely solely on work flags to determine personality.
    io_uring: pass required context in as flags
    io-wq: assign NUMA node locality if appropriate
    io_uring: fix error path cleanup in io_sqe_files_register()
    Revert "io_uring: mark io_uring_fops/io_op_defs as __read_mostly"
    io_uring: fix REQ_F_COMP_LOCKED by killing it
    io_uring: dig out COMP_LOCK from deep call chain
    io_uring: don't put a poll req under spinlock
    io_uring: don't unnecessarily clear F_LINK_TIMEOUT
    io_uring: don't set COMP_LOCKED if won't put
    ...

    Linus Torvalds