30 Sep, 2022

1 commit

  • This is the 5.15.71 stable release

    * tag 'v5.15.71': (144 commits)
    Linux 5.15.71
    ext4: use locality group preallocation for small closed files
    ext4: avoid unnecessary spreading of allocations among groups
    ...

    Signed-off-by: Jason Liu

    Conflicts:
    drivers/net/phy/aquantia_main.c
    drivers/tty/serial/fsl_lpuart.c

    Jason Liu
     

28 Sep, 2022

3 commits

  • commit e45cc288724f0cfd497bb5920bcfa60caa335729 upstream.

    Commit 5a836bf6b09f ("mm: slub: move flush_cpu_slab() invocations
    __free_slab() invocations out of IRQ context") moved all flush_cpu_slab()
    invocations to the global workqueue to avoid a problem related
    with deactivate_slab()/__free_slab() being called from an IRQ context
    on PREEMPT_RT kernels.

    When the flush_all_cpu_locked() function is called from a task context
    it may happen that a workqueue with WQ_MEM_RECLAIM bit set ends up
    flushing the global workqueue, this will cause a dependency issue.

    workqueue: WQ_MEM_RECLAIM nvme-delete-wq:nvme_delete_ctrl_work [nvme_core]
    is flushing !WQ_MEM_RECLAIM events:flush_cpu_slab
    WARNING: CPU: 37 PID: 410 at kernel/workqueue.c:2637
    check_flush_dependency+0x10a/0x120
    Workqueue: nvme-delete-wq nvme_delete_ctrl_work [nvme_core]
    RIP: 0010:check_flush_dependency+0x10a/0x120[ 453.262125] Call Trace:
    __flush_work.isra.0+0xbf/0x220
    ? __queue_work+0x1dc/0x420
    flush_all_cpus_locked+0xfb/0x120
    __kmem_cache_shutdown+0x2b/0x320
    kmem_cache_destroy+0x49/0x100
    bioset_exit+0x143/0x190
    blk_release_queue+0xb9/0x100
    kobject_cleanup+0x37/0x130
    nvme_fc_ctrl_free+0xc6/0x150 [nvme_fc]
    nvme_free_ctrl+0x1ac/0x2b0 [nvme_core]

    Fix this bug by creating a workqueue for the flush operation with
    the WQ_MEM_RECLAIM bit set.

    Fixes: 5a836bf6b09f ("mm: slub: move flush_cpu_slab() invocations __free_slab() invocations out of IRQ context")
    Cc:
    Signed-off-by: Maurizio Lombardi
    Reviewed-by: Hyeonggon Yoo
    Signed-off-by: Vlastimil Babka
    Signed-off-by: Greg Kroah-Hartman

    Maurizio Lombardi
     
  • commit 7e9c323c52b379d261a72dc7bd38120a761a93cd upstream.

    In create_unique_id(), kmalloc(, GFP_KERNEL) can fail due to
    out-of-memory, if it fails, return errno correctly rather than
    triggering panic via BUG_ON();

    kernel BUG at mm/slub.c:5893!
    Internal error: Oops - BUG: 0 [#1] PREEMPT SMP

    Call trace:
    sysfs_slab_add+0x258/0x260 mm/slub.c:5973
    __kmem_cache_create+0x60/0x118 mm/slub.c:4899
    create_cache mm/slab_common.c:229 [inline]
    kmem_cache_create_usercopy+0x19c/0x31c mm/slab_common.c:335
    kmem_cache_create+0x1c/0x28 mm/slab_common.c:390
    f2fs_kmem_cache_create fs/f2fs/f2fs.h:2766 [inline]
    f2fs_init_xattr_caches+0x78/0xb4 fs/f2fs/xattr.c:808
    f2fs_fill_super+0x1050/0x1e0c fs/f2fs/super.c:4149
    mount_bdev+0x1b8/0x210 fs/super.c:1400
    f2fs_mount+0x44/0x58 fs/f2fs/super.c:4512
    legacy_get_tree+0x30/0x74 fs/fs_context.c:610
    vfs_get_tree+0x40/0x140 fs/super.c:1530
    do_new_mount+0x1dc/0x4e4 fs/namespace.c:3040
    path_mount+0x358/0x914 fs/namespace.c:3370
    do_mount fs/namespace.c:3383 [inline]
    __do_sys_mount fs/namespace.c:3591 [inline]
    __se_sys_mount fs/namespace.c:3568 [inline]
    __arm64_sys_mount+0x2f8/0x408 fs/namespace.c:3568

    Cc:
    Fixes: 81819f0fc8285 ("SLUB core")
    Reported-by: syzbot+81684812ea68216e08c5@syzkaller.appspotmail.com
    Reviewed-by: Muchun Song
    Reviewed-by: Hyeonggon Yoo
    Signed-off-by: Chao Yu
    Acked-by: David Rientjes
    Signed-off-by: Vlastimil Babka
    Signed-off-by: Greg Kroah-Hartman

    Chao Yu
     
  • commit 5373b8a09d6e037ee0587cb5d9fe4cc09077deeb upstream.

    We were failing to call kasan_malloc() from __kmalloc_*track_caller()
    which was causing us to sometimes fail to produce KASAN error reports
    for allocations made using e.g. devm_kcalloc(), as the KASAN poison was
    not being initialized. Fix it.

    Signed-off-by: Peter Collingbourne
    Cc: # 5.15
    Signed-off-by: Vlastimil Babka
    Signed-off-by: Greg Kroah-Hartman

    Peter Collingbourne
     

27 Sep, 2022

1 commit

  • This is the 5.15.70 stable release

    * tag 'v5.15.70': (2444 commits)
    Linux 5.15.70
    ALSA: hda/sigmatel: Fix unused variable warning for beep power change
    cgroup: Add missing cpus_read_lock() to cgroup_attach_task_all()
    ...

    Signed-off-by: Jason Liu

    Conflicts:
    arch/arm/boot/dts/imx6ul.dtsi
    arch/arm/mm/mmu.c
    arch/arm64/boot/dts/freescale/imx8mp-evk.dts
    drivers/gpu/drm/imx/dcss/dcss-kms.c
    drivers/media/platform/nxp/imx-jpeg/mxc-jpeg.c
    drivers/media/platform/nxp/imx-jpeg/mxc-jpeg.h
    drivers/mtd/nand/raw/gpmi-nand/gpmi-nand.c
    drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.c
    drivers/soc/fsl/Kconfig
    drivers/soc/imx/gpcv2.c
    drivers/usb/dwc3/host.c
    net/dsa/slave.c
    sound/soc/fsl/imx-card.c

    Jason Liu
     

20 Sep, 2022

1 commit

  • This is a stable-specific patch.
    I botched the stable-specific rewrite of
    commit b67fbebd4cf98 ("mmu_gather: Force tlb-flush VM_PFNMAP vmas"):
    As Hugh pointed out, unmap_region() actually operates on a list of VMAs,
    and the variable "vma" merely points to the first VMA in that list.
    So if we want to check whether any of the VMAs we're operating on is
    PFNMAP or MIXEDMAP, we have to iterate through the list and check each VMA.

    Signed-off-by: Jann Horn
    Signed-off-by: Greg Kroah-Hartman

    Jann Horn
     

15 Sep, 2022

1 commit

  • This reverts commit 23c2d497de21f25898fbea70aeb292ab8acc8c94.

    Commit 23c2d497de21 ("mm: kmemleak: take a full lowmem check in
    kmemleak_*_phys()") brought false leak alarms on some archs like arm64
    that does not init pfn boundary in early booting. The final solution
    lands on linux-6.0: commit 0c24e061196c ("mm: kmemleak: add rbtree and
    store physical address for objects allocated with PA").

    Revert this commit before linux-6.0. The original issue of invalid PA
    can be mitigated by additional check in devicetree.

    The false alarm report is as following: Kmemleak output: (Qemu/arm64)
    unreferenced object 0xffff0000c0170a00 (size 128):
    comm "swapper/0", pid 1, jiffies 4294892404 (age 126.208s)
    hex dump (first 32 bytes):
    62 61 73 65 00 00 00 00 00 00 00 00 00 00 00 00 base............
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
    backtrace:
    [] __kmalloc_track_caller+0x1b0/0x2e4
    [] kstrdup_const+0x8c/0xc4
    [] kvasprintf_const+0xbc/0xec
    [] kobject_set_name_vargs+0x58/0xe4
    [] kobject_add+0x84/0x100
    [] __of_attach_node_sysfs+0x78/0xec
    [] of_core_init+0x68/0x104
    [] driver_init+0x28/0x48
    [] do_basic_setup+0x14/0x28
    [] kernel_init_freeable+0x110/0x178
    [] kernel_init+0x20/0x1a0
    [] ret_from_fork+0x10/0x20

    This pacth is also applicable to linux-5.17.y/linux-5.18.y/linux-5.19.y

    Cc:
    Signed-off-by: Yee Lee
    Signed-off-by: Greg Kroah-Hartman

    Yee Lee
     

08 Sep, 2022

1 commit

  • [ Upstream commit 8782fb61cc848364e1e1599d76d3c9dd58a1cc06 ]

    The mmap lock protects the page walker from changes to the page tables
    during the walk. However a read lock is insufficient to protect those
    areas which don't have a VMA as munmap() detaches the VMAs before
    downgrading to a read lock and actually tearing down PTEs/page tables.

    For users of walk_page_range() the solution is to simply call pte_hole()
    immediately without checking the actual page tables when a VMA is not
    present. We now never call __walk_page_range() without a valid vma.

    For walk_page_range_novma() the locking requirements are tightened to
    require the mmap write lock to be taken, and then walking the pgd
    directly with 'no_vma' set.

    This in turn means that all page walkers either have a valid vma, or
    it's that special 'novma' case for page table debugging. As a result,
    all the odd '(!walk->vma && !walk->no_vma)' tests can be removed.

    Fixes: dd2283f2605e ("mm: mmap: zap pages with read mmap_sem in munmap")
    Reported-by: Jann Horn
    Signed-off-by: Steven Price
    Cc: Vlastimil Babka
    Cc: Thomas Hellström
    Cc: Konstantin Khlebnikov
    Cc: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    Steven Price
     

05 Sep, 2022

3 commits

  • commit 2555283eb40df89945557273121e9393ef9b542b upstream.

    anon_vma->degree tracks the combined number of child anon_vmas and VMAs
    that use the anon_vma as their ->anon_vma.

    anon_vma_clone() then assumes that for any anon_vma attached to
    src->anon_vma_chain other than src->anon_vma, it is impossible for it to
    be a leaf node of the VMA tree, meaning that for such VMAs ->degree is
    elevated by 1 because of a child anon_vma, meaning that if ->degree
    equals 1 there are no VMAs that use the anon_vma as their ->anon_vma.

    This assumption is wrong because the ->degree optimization leads to leaf
    nodes being abandoned on anon_vma_clone() - an existing anon_vma is
    reused and no new parent-child relationship is created. So it is
    possible to reuse an anon_vma for one VMA while it is still tied to
    another VMA.

    This is an issue because is_mergeable_anon_vma() and its callers assume
    that if two VMAs have the same ->anon_vma, the list of anon_vmas
    attached to the VMAs is guaranteed to be the same. When this assumption
    is violated, vma_merge() can merge pages into a VMA that is not attached
    to the corresponding anon_vma, leading to dangling page->mapping
    pointers that will be dereferenced during rmap walks.

    Fix it by separately tracking the number of child anon_vmas and the
    number of VMAs using the anon_vma as their ->anon_vma.

    Fixes: 7a3ef208e662 ("mm: prevent endless growth of anon_vma hierarchy")
    Cc: stable@kernel.org
    Acked-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Signed-off-by: Jann Horn
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Jann Horn
     
  • commit ab74ef708dc51df7cf2b8a890b9c6990fac5c0c6 upstream.

    In MCOPY_ATOMIC_CONTINUE case with a non-shared VMA, pages in the page
    cache are installed in the ptes. But hugepage_add_new_anon_rmap is called
    for them mistakenly because they're not vm_shared. This will corrupt the
    page->mapping used by page cache code.

    Link: https://lkml.kernel.org/r/20220712130542.18836-1-linmiaohe@huawei.com
    Fixes: f619147104c8 ("userfaultfd: add UFFDIO_CONTINUE ioctl")
    Signed-off-by: Miaohe Lin
    Reviewed-by: Mike Kravetz
    Cc: Axel Rasmussen
    Cc: Peter Xu
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Axel Rasmussen
    Signed-off-by: Greg Kroah-Hartman

    Miaohe Lin
     
  • commit b67fbebd4cf980aecbcc750e1462128bffe8ae15 upstream.

    Some drivers rely on having all VMAs through which a PFN might be
    accessible listed in the rmap for correctness.
    However, on X86, it was possible for a VMA with stale TLB entries
    to not be listed in the rmap.

    This was fixed in mainline with
    commit b67fbebd4cf9 ("mmu_gather: Force tlb-flush VM_PFNMAP vmas"),
    but that commit relies on preceding refactoring in
    commit 18ba064e42df3 ("mmu_gather: Let there be one tlb_{start,end}_vma()
    implementation") and commit 1e9fdf21a4339 ("mmu_gather: Remove per arch
    tlb_{start,end}_vma()").

    This patch provides equivalent protection without needing that
    refactoring, by forcing a TLB flush between removing PTEs in
    unmap_vmas() and the call to unlink_file_vma() in free_pgtables().

    [This is a stable-specific rewrite of the upstream commit!]
    Signed-off-by: Jann Horn
    Signed-off-by: Greg Kroah-Hartman

    Jann Horn
     

31 Aug, 2022

4 commits

  • commit f96f7a40874d7c746680c0b9f57cef2262ae551f upstream.

    Patch series "mm/hugetlb: fix write-fault handling for shared mappings", v2.

    I observed that hugetlb does not support/expect write-faults in shared
    mappings that would have to map the R/O-mapped page writable -- and I
    found two case where we could currently get such faults and would
    erroneously map an anon page into a shared mapping.

    Reproducers part of the patches.

    I propose to backport both fixes to stable trees. The first fix needs a
    small adjustment.

    This patch (of 2):

    Staring at hugetlb_wp(), one might wonder where all the logic for shared
    mappings is when stumbling over a write-protected page in a shared
    mapping. In fact, there is none, and so far we thought we could get away
    with that because e.g., mprotect() should always do the right thing and
    map all pages directly writable.

    Looks like we were wrong:

    --------------------------------------------------------------------------
    #include
    #include
    #include
    #include
    #include
    #include
    #include

    #define HUGETLB_SIZE (2 * 1024 * 1024u)

    static void clear_softdirty(void)
    {
    int fd = open("/proc/self/clear_refs", O_WRONLY);
    const char *ctrl = "4";
    int ret;

    if (fd < 0) {
    fprintf(stderr, "open(clear_refs) failed\n");
    exit(1);
    }
    ret = write(fd, ctrl, strlen(ctrl));
    if (ret != strlen(ctrl)) {
    fprintf(stderr, "write(clear_refs) failed\n");
    exit(1);
    }
    close(fd);
    }

    int main(int argc, char **argv)
    {
    char *map;
    int fd;

    fd = open("/dev/hugepages/tmp", O_RDWR | O_CREAT);
    if (!fd) {
    fprintf(stderr, "open() failed\n");
    return -errno;
    }
    if (ftruncate(fd, HUGETLB_SIZE)) {
    fprintf(stderr, "ftruncate() failed\n");
    return -errno;
    }

    map = mmap(NULL, HUGETLB_SIZE, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
    if (map == MAP_FAILED) {
    fprintf(stderr, "mmap() failed\n");
    return -errno;
    }

    *map = 0;

    if (mprotect(map, HUGETLB_SIZE, PROT_READ)) {
    fprintf(stderr, "mmprotect() failed\n");
    return -errno;
    }

    clear_softdirty();

    if (mprotect(map, HUGETLB_SIZE, PROT_READ|PROT_WRITE)) {
    fprintf(stderr, "mmprotect() failed\n");
    return -errno;
    }

    *map = 0;

    return 0;
    }
    --------------------------------------------------------------------------

    Above test fails with SIGBUS when there is only a single free hugetlb page.
    # echo 1 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
    # ./test
    Bus error (core dumped)

    And worse, with sufficient free hugetlb pages it will map an anonymous page
    into a shared mapping, for example, messing up accounting during unmap
    and breaking MAP_SHARED semantics:
    # echo 2 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
    # ./test
    # cat /proc/meminfo | grep HugePages_
    HugePages_Total: 2
    HugePages_Free: 1
    HugePages_Rsvd: 18446744073709551615
    HugePages_Surp: 0

    Reason in this particular case is that vma_wants_writenotify() will
    return "true", removing VM_SHARED in vma_set_page_prot() to map pages
    write-protected. Let's teach vma_wants_writenotify() that hugetlb does not
    support softdirty tracking.

    Link: https://lkml.kernel.org/r/20220811103435.188481-1-david@redhat.com
    Link: https://lkml.kernel.org/r/20220811103435.188481-2-david@redhat.com
    Fixes: 64e455079e1b ("mm: softdirty: enable write notifications on VMAs after VM_SOFTDIRTY cleared")
    Signed-off-by: David Hildenbrand
    Reviewed-by: Mike Kravetz
    Cc: Peter Feiner
    Cc: Kirill A. Shutemov
    Cc: Cyrill Gorcunov
    Cc: Pavel Emelyanov
    Cc: Jamie Liu
    Cc: Hugh Dickins
    Cc: Naoya Horiguchi
    Cc: Bjorn Helgaas
    Cc: Muchun Song
    Cc: Peter Xu
    Cc: [3.18+]
    Signed-off-by: Andrew Morton
    Signed-off-by: David Hildenbrand
    Signed-off-by: Greg Kroah-Hartman

    David Hildenbrand
     
  • commit dd0ff4d12dd284c334f7e9b07f8f335af856ac78 upstream.

    The vmemmap pages is marked by kmemleak when allocated from memblock.
    Remove it from kmemleak when freeing the page. Otherwise, when we reuse
    the page, kmemleak may report such an error and then stop working.

    kmemleak: Cannot insert 0xffff98fb6eab3d40 into the object search tree (overlaps existing)
    kmemleak: Kernel memory leak detector disabled
    kmemleak: Object 0xffff98fb6be00000 (size 335544320):
    kmemleak: comm "swapper", pid 0, jiffies 4294892296
    kmemleak: min_count = 0
    kmemleak: count = 0
    kmemleak: flags = 0x1
    kmemleak: checksum = 0
    kmemleak: backtrace:

    Link: https://lkml.kernel.org/r/20220819094005.2928241-1-liushixin2@huawei.com
    Fixes: f41f2ed43ca5 (mm: hugetlb: free the vmemmap pages associated with each HugeTLB page)
    Signed-off-by: Liu Shixin
    Reviewed-by: Muchun Song
    Cc: Matthew Wilcox
    Cc: Mike Kravetz
    Cc: Oscar Salvador
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Greg Kroah-Hartman

    Liu Shixin
     
  • commit d26f60703606ab425eee9882b32a1781a8bed74d upstream.

    When user tries to create a DAMON context via the DAMON debugfs interface
    with a name of an already existing context, the context directory creation
    fails but a new context is created and added in the internal data
    structure, due to absence of the directory creation success check. As a
    result, memory could leak and DAMON cannot be turned on. An example test
    case is as below:

    # cd /sys/kernel/debug/damon/
    # echo "off" > monitor_on
    # echo paddr > target_ids
    # echo "abc" > mk_context
    # echo "abc" > mk_context
    # echo $$ > abc/target_ids
    # echo "on" > monitor_on <<< fails

    Return value of 'debugfs_create_dir()' is expected to be ignored in
    general, but this is an exceptional case as DAMON feature is depending
    on the debugfs functionality and it has the potential duplicate name
    issue. This commit therefore fixes the issue by checking the directory
    creation failure and immediately return the error in the case.

    Link: https://lkml.kernel.org/r/20220821180853.2400-1-sj@kernel.org
    Fixes: 75c1c2b53c78 ("mm/damon/dbgfs: support multiple contexts")
    Signed-off-by: Badari Pulavarty
    Signed-off-by: SeongJae Park
    Cc: [ 5.15.x]
    Signed-off-by: Andrew Morton
    Signed-off-by: Greg Kroah-Hartman

    Badari Pulavarty
     
  • commit f87904c075515f3e1d8f4a7115869d3b914674fd upstream.

    When a disk is removed, bdi_unregister gets called to stop further
    writeback and wait for associated delayed work to complete. However,
    wb_inode_writeback_end() may schedule bandwidth estimation dwork after
    this has completed, which can result in the timer attempting to access the
    just freed bdi_writeback.

    Fix this by checking if the bdi_writeback is alive, similar to when
    scheduling writeback work.

    Since this requires wb->work_lock, and wb_inode_writeback_end() may get
    called from interrupt, switch wb->work_lock to an irqsafe lock.

    Link: https://lkml.kernel.org/r/20220801155034.3772543-1-khazhy@google.com
    Fixes: 45a2966fd641 ("writeback: fix bandwidth estimate for spiky workload")
    Signed-off-by: Khazhismel Kumykov
    Reviewed-by: Jan Kara
    Cc: Michael Stapelberg
    Cc: Wu Fengguang
    Cc: Alexander Viro
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Greg Kroah-Hartman

    Khazhismel Kumykov
     

17 Aug, 2022

4 commits

  • [ Upstream commit 7f82f922319ede486540e8746769865b9508d2c2 ]

    Since the beginning, charged is set to 0 to avoid calling vm_unacct_memory
    twice because vm_unacct_memory will be called by above unmap_region. But
    since commit 4f74d2c8e827 ("vm: remove 'nr_accounted' calculations from
    the unmap_vmas() interfaces"), unmap_region doesn't call vm_unacct_memory
    anymore. So charged shouldn't be set to 0 now otherwise the calling to
    paired vm_unacct_memory will be missed and leads to imbalanced account.

    Link: https://lkml.kernel.org/r/20220618082027.43391-1-linmiaohe@huawei.com
    Fixes: 4f74d2c8e827 ("vm: remove 'nr_accounted' calculations from the unmap_vmas() interfaces")
    Signed-off-by: Miaohe Lin
    Signed-off-by: Andrew Morton
    Signed-off-by: Sasha Levin

    Miaohe Lin
     
  • [ Upstream commit 000eca5d044d1ee23b4ca311793cf3fc528da6c6 ]

    When user specified more nodes than supported, get_nodes will access nmask
    array out of bounds.

    Link: https://lkml.kernel.org/r/20220601093211.2970565-1-tianyu.li@arm.com
    Fixes: e130242dc351 ("mm: simplify compat numa syscalls")
    Signed-off-by: Tianyu Li
    Cc: Arnd Bergmann
    Cc: Mark Rutland
    Signed-off-by: Andrew Morton
    Signed-off-by: Sasha Levin

    Tianyu Li
     
  • [ Upstream commit 1e57ffb6e3fd9583268c6462c4e3853575b21701 ]

    Think about the below scene:

    CPU1 CPU2
    memunmap_pages
    percpu_ref_exit
    __percpu_ref_exit
    free_percpu(percpu_count);
    /* percpu_count is freed here! */
    get_dev_pagemap
    xa_load(&pgmap_array, PHYS_PFN(phys))
    /* pgmap still in the pgmap_array */
    percpu_ref_tryget_live(&pgmap->ref)
    if __ref_is_percpu
    /* __PERCPU_REF_ATOMIC_DEAD not set yet */
    this_cpu_inc(*percpu_count)
    /* access freed percpu_count here! */
    ref->percpu_count_ptr = __PERCPU_REF_ATOMIC_DEAD;
    /* too late... */
    pageunmap_range

    To fix the issue, do percpu_ref_exit() after pgmap_array is emptied. So
    we won't do percpu_ref_tryget_live() against a being freed percpu_ref.

    Link: https://lkml.kernel.org/r/20220609121305.2508-1-linmiaohe@huawei.com
    Fixes: b7b3c01b1915 ("mm/memremap_pages: support multiple ranges per invocation")
    Signed-off-by: Miaohe Lin
    Cc: Dan Williams
    Signed-off-by: Andrew Morton
    Signed-off-by: Sasha Levin

    Miaohe Lin
     
  • [ Upstream commit b80892ca022e9eb484771a66eb68e12364695a2a ]

    No driver is left using the external pgmap refcount, so remove the
    code to support it.

    Signed-off-by: Christoph Hellwig
    Acked-by: Bjorn Helgaas
    Link: https://lore.kernel.org/r/20211028151017.50234-1-hch@lst.de
    Signed-off-by: Dan Williams
    Signed-off-by: Sasha Levin

    Christoph Hellwig
     

03 Aug, 2022

5 commits

  • commit 9282012fc0aa248b77a69f5eb802b67c5a16bb13 upstream.

    There was a report that a task is waiting at the
    throttle_direct_reclaim. The pgscan_direct_throttle in vmstat was
    increasing.

    This is a bug where zone_watermark_fast returns true even when the free
    is very low. The commit f27ce0e14088 ("page_alloc: consider highatomic
    reserve in watermark fast") changed the watermark fast to consider
    highatomic reserve. But it did not handle a negative value case which
    can be happened when reserved_highatomic pageblock is bigger than the
    actual free.

    If watermark is considered as ok for the negative value, allocating
    contexts for order-0 will consume all free pages without direct reclaim,
    and finally free page may become depleted except highatomic free.

    Then allocating contexts may fall into throttle_direct_reclaim. This
    symptom may easily happen in a system where wmark min is low and other
    reclaimers like kswapd does not make free pages quickly.

    Handle the negative case by using MIN.

    Link: https://lkml.kernel.org/r/20220725095212.25388-1-jaewon31.kim@samsung.com
    Fixes: f27ce0e14088 ("page_alloc: consider highatomic reserve in watermark fast")
    Signed-off-by: Jaewon Kim
    Reported-by: GyeongHwan Hong
    Acked-by: Mel Gorman
    Cc: Minchan Kim
    Cc: Baoquan He
    Cc: Vlastimil Babka
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Yong-Taek Lee
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Greg Kroah-Hartman

    Jaewon Kim
     
  • commit 8a295dbbaf7292c582a40ce469c326f472d51f66 upstream.

    If hmm_range_fault() is called with the HMM_PFN_REQ_FAULT flag and a
    device private PTE is found, the hmm_range::dev_private_owner page is used
    to determine if the device private page should not be faulted in.
    However, if the device private page is not owned by the caller,
    hmm_range_fault() returns an error instead of calling migrate_to_ram() to
    fault in the page.

    For example, if a page is migrated to GPU private memory and a RDMA fault
    capable NIC tries to read the migrated page, without this patch it will
    get an error. With this patch, the page will be migrated back to system
    memory and the NIC will be able to read the data.

    Link: https://lkml.kernel.org/r/20220727000837.4128709-2-rcampbell@nvidia.com
    Link: https://lkml.kernel.org/r/20220725183615.4118795-2-rcampbell@nvidia.com
    Fixes: 08ddddda667b ("mm/hmm: check the device private page owner in hmm_range_fault()")
    Signed-off-by: Ralph Campbell
    Reported-by: Felix Kuehling
    Reviewed-by: Alistair Popple
    Cc: Philip Yang
    Cc: Jason Gunthorpe
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Greg Kroah-Hartman

    Ralph Campbell
     
  • commit da9a298f5fad0dc615079a340da42928bc5b138e upstream.

    When alloc_huge_page fails, *pagep is set to NULL without put_page first.
    So the hugepage indicated by *pagep is leaked.

    Link: https://lkml.kernel.org/r/20220709092629.54291-1-linmiaohe@huawei.com
    Fixes: 8cc5fcbb5be8 ("mm, hugetlb: fix racy resv_huge_pages underflow on UFFDIO_COPY")
    Signed-off-by: Miaohe Lin
    Acked-by: Muchun Song
    Reviewed-by: Anshuman Khandual
    Reviewed-by: Baolin Wang
    Reviewed-by: Mike Kravetz
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Greg Kroah-Hartman

    Miaohe Lin
     
  • commit 3fe2895cfecd03ac74977f32102b966b6589f481 upstream.

    We have an application with a lot of threads that use a shared mmap backed
    by tmpfs mounted with -o huge=within_size. This application started
    leaking loads of huge pages when we upgraded to a recent kernel.

    Using the page ref tracepoints and a BPF program written by Tejun Heo we
    were able to determine that these pages would have multiple refcounts from
    the page fault path, but when it came to unmap time we wouldn't drop the
    number of refs we had added from the faults.

    I wrote a reproducer that mmap'ed a file backed by tmpfs with -o
    huge=always, and then spawned 20 threads all looping faulting random
    offsets in this map, while using madvise(MADV_DONTNEED) randomly for huge
    page aligned ranges. This very quickly reproduced the problem.

    The problem here is that we check for the case that we have multiple
    threads faulting in a range that was previously unmapped. One thread maps
    the PMD, the other thread loses the race and then returns 0. However at
    this point we already have the page, and we are no longer putting this
    page into the processes address space, and so we leak the page. We
    actually did the correct thing prior to f9ce0be71d1f, however it looks
    like Kirill copied what we do in the anonymous page case. In the
    anonymous page case we don't yet have a page, so we don't have to drop a
    reference on anything. Previously we did the correct thing for file based
    faults by returning VM_FAULT_NOPAGE so we correctly drop the reference on
    the page we faulted in.

    Fix this by returning VM_FAULT_NOPAGE in the pmd_devmap_trans_unstable()
    case, this makes us drop the ref on the page properly, and now my
    reproducer no longer leaks the huge pages.

    [josef@toxicpanda.com: v2]
    Link: https://lkml.kernel.org/r/e90c8f0dbae836632b669c2afc434006a00d4a67.1657721478.git.josef@toxicpanda.com
    Link: https://lkml.kernel.org/r/2b798acfd95c9ab9395fe85e8d5a835e2e10a920.1657051137.git.josef@toxicpanda.com
    Fixes: f9ce0be71d1f ("mm: Cleanup faultaround and finish_fault() codepaths")
    Signed-off-by: Josef Bacik
    Signed-off-by: Rik van Riel
    Signed-off-by: Chris Mason
    Acked-by: Kirill A. Shutemov
    Cc: Matthew Wilcox (Oracle)
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Greg Kroah-Hartman

    Josef Bacik
     
  • commit 84ac013046ccc438af04b7acecd4d3ab84fe4bde upstream.

    syzkaller reports the following issue:

    BUG: unable to handle page fault for address: ffff888021f7e005
    PGD 11401067 P4D 11401067 PUD 11402067 PMD 21f7d063 PTE 800fffffde081060
    Oops: 0002 [#1] PREEMPT SMP KASAN
    CPU: 0 PID: 3761 Comm: syz-executor281 Not tainted 5.19.0-rc4-syzkaller-00014-g941e3e791269 #0
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
    RIP: 0010:memset_erms+0x9/0x10 arch/x86/lib/memset_64.S:64
    Code: c1 e9 03 40 0f b6 f6 48 b8 01 01 01 01 01 01 01 01 48 0f af c6 f3 48 ab 89 d1 f3 aa 4c 89 c8 c3 90 49 89 f9 40 88 f0 48 89 d1 aa 4c 89 c8 c3 90 49 89 fa 40 0f b6 ce 48 b8 01 01 01 01 01 01
    RSP: 0018:ffffc9000329fa90 EFLAGS: 00010202
    RAX: 0000000000000000 RBX: 0000000000001000 RCX: 0000000000000ffb
    RDX: 0000000000000ffb RSI: 0000000000000000 RDI: ffff888021f7e005
    RBP: ffffea000087df80 R08: 0000000000000001 R09: ffff888021f7e005
    R10: ffffed10043efdff R11: 0000000000000000 R12: 0000000000000005
    R13: 0000000000000000 R14: 0000000000001000 R15: 0000000000000ffb
    FS: 00007fb29d8b2700(0000) GS:ffff8880b9a00000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: ffff888021f7e005 CR3: 0000000026e7b000 CR4: 00000000003506f0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    Call Trace:

    zero_user_segments include/linux/highmem.h:272 [inline]
    folio_zero_range include/linux/highmem.h:428 [inline]
    truncate_inode_partial_folio+0x76a/0xdf0 mm/truncate.c:237
    truncate_inode_pages_range+0x83b/0x1530 mm/truncate.c:381
    truncate_inode_pages mm/truncate.c:452 [inline]
    truncate_pagecache+0x63/0x90 mm/truncate.c:753
    simple_setattr+0xed/0x110 fs/libfs.c:535
    secretmem_setattr+0xae/0xf0 mm/secretmem.c:170
    notify_change+0xb8c/0x12b0 fs/attr.c:424
    do_truncate+0x13c/0x200 fs/open.c:65
    do_sys_ftruncate+0x536/0x730 fs/open.c:193
    do_syscall_x64 arch/x86/entry/common.c:50 [inline]
    do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
    entry_SYSCALL_64_after_hwframe+0x46/0xb0
    RIP: 0033:0x7fb29d900899
    Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 11 15 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48
    RSP: 002b:00007fb29d8b2318 EFLAGS: 00000246 ORIG_RAX: 000000000000004d
    RAX: ffffffffffffffda RBX: 00007fb29d988408 RCX: 00007fb29d900899
    RDX: 00007fb29d900899 RSI: 0000000000000005 RDI: 0000000000000003
    RBP: 00007fb29d988400 R08: 0000000000000000 R09: 0000000000000000
    R10: 0000000000000000 R11: 0000000000000246 R12: 00007fb29d98840c
    R13: 00007ffca01a23bf R14: 00007fb29d8b2400 R15: 0000000000022000

    Modules linked in:
    CR2: ffff888021f7e005
    ---[ end trace 0000000000000000 ]---

    Eric Biggers suggested that this happens when
    secretmem_setattr()->simple_setattr() races with secretmem_fault() so that
    a page that is faulted in by secretmem_fault() (and thus removed from the
    direct map) is zeroed by inode truncation right afterwards.

    Use mapping->invalidate_lock to make secretmem_fault() and
    secretmem_setattr() mutually exclusive.

    [rppt@linux.ibm.com: v3]
    Link: https://lkml.kernel.org/r/20220714091337.412297-1-rppt@kernel.org
    Link: https://lkml.kernel.org/r/20220707165650.248088-1-rppt@kernel.org
    Reported-by: syzbot+9bd2b7adbd34b30b87e4@syzkaller.appspotmail.com
    Signed-off-by: Mike Rapoport
    Suggested-by: Eric Biggers
    Reviewed-by: Axel Rasmussen
    Reviewed-by: Jan Kara
    Cc: Eric Biggers
    Cc: Hillf Danton
    Cc: Matthew Wilcox
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Greg Kroah-Hartman

    Mike Rapoport
     

29 Jul, 2022

1 commit

  • commit 018160ad314d75b1409129b2247b614a9f35894c upstream.

    mpol_set_nodemask()(mm/mempolicy.c) does not set up nodemask when
    pol->mode is MPOL_LOCAL. Check pol->mode before access
    pol->w.cpuset_mems_allowed in mpol_rebind_policy()(mm/mempolicy.c).

    BUG: KMSAN: uninit-value in mpol_rebind_policy mm/mempolicy.c:352 [inline]
    BUG: KMSAN: uninit-value in mpol_rebind_task+0x2ac/0x2c0 mm/mempolicy.c:368
    mpol_rebind_policy mm/mempolicy.c:352 [inline]
    mpol_rebind_task+0x2ac/0x2c0 mm/mempolicy.c:368
    cpuset_change_task_nodemask kernel/cgroup/cpuset.c:1711 [inline]
    cpuset_attach+0x787/0x15e0 kernel/cgroup/cpuset.c:2278
    cgroup_migrate_execute+0x1023/0x1d20 kernel/cgroup/cgroup.c:2515
    cgroup_migrate kernel/cgroup/cgroup.c:2771 [inline]
    cgroup_attach_task+0x540/0x8b0 kernel/cgroup/cgroup.c:2804
    __cgroup1_procs_write+0x5cc/0x7a0 kernel/cgroup/cgroup-v1.c:520
    cgroup1_tasks_write+0x94/0xb0 kernel/cgroup/cgroup-v1.c:539
    cgroup_file_write+0x4c2/0x9e0 kernel/cgroup/cgroup.c:3852
    kernfs_fop_write_iter+0x66a/0x9f0 fs/kernfs/file.c:296
    call_write_iter include/linux/fs.h:2162 [inline]
    new_sync_write fs/read_write.c:503 [inline]
    vfs_write+0x1318/0x2030 fs/read_write.c:590
    ksys_write+0x28b/0x510 fs/read_write.c:643
    __do_sys_write fs/read_write.c:655 [inline]
    __se_sys_write fs/read_write.c:652 [inline]
    __x64_sys_write+0xdb/0x120 fs/read_write.c:652
    do_syscall_x64 arch/x86/entry/common.c:51 [inline]
    do_syscall_64+0x54/0xd0 arch/x86/entry/common.c:82
    entry_SYSCALL_64_after_hwframe+0x44/0xae

    Uninit was created at:
    slab_post_alloc_hook mm/slab.h:524 [inline]
    slab_alloc_node mm/slub.c:3251 [inline]
    slab_alloc mm/slub.c:3259 [inline]
    kmem_cache_alloc+0x902/0x11c0 mm/slub.c:3264
    mpol_new mm/mempolicy.c:293 [inline]
    do_set_mempolicy+0x421/0xb70 mm/mempolicy.c:853
    kernel_set_mempolicy mm/mempolicy.c:1504 [inline]
    __do_sys_set_mempolicy mm/mempolicy.c:1510 [inline]
    __se_sys_set_mempolicy+0x44c/0xb60 mm/mempolicy.c:1507
    __x64_sys_set_mempolicy+0xd8/0x110 mm/mempolicy.c:1507
    do_syscall_x64 arch/x86/entry/common.c:51 [inline]
    do_syscall_64+0x54/0xd0 arch/x86/entry/common.c:82
    entry_SYSCALL_64_after_hwframe+0x44/0xae

    KMSAN: uninit-value in mpol_rebind_task (2)
    https://syzkaller.appspot.com/bug?id=d6eb90f952c2a5de9ea718a1b873c55cb13b59dc

    This patch seems to fix below bug too.
    KMSAN: uninit-value in mpol_rebind_mm (2)
    https://syzkaller.appspot.com/bug?id=f2fecd0d7013f54ec4162f60743a2b28df40926b

    The uninit-value is pol->w.cpuset_mems_allowed in mpol_rebind_policy().
    When syzkaller reproducer runs to the beginning of mpol_new(),

    mpol_new() mm/mempolicy.c
    do_mbind() mm/mempolicy.c
    kernel_mbind() mm/mempolicy.c

    `mode` is 1(MPOL_PREFERRED), nodes_empty(*nodes) is `true` and `flags`
    is 0. Then

    mode = MPOL_LOCAL;
    ...
    policy->mode = mode;
    policy->flags = flags;

    will be executed. So in mpol_set_nodemask(),

    mpol_set_nodemask() mm/mempolicy.c
    do_mbind()
    kernel_mbind()

    pol->mode is 4 (MPOL_LOCAL), that `nodemask` in `pol` is not initialized,
    which will be accessed in mpol_rebind_policy().

    Link: https://lkml.kernel.org/r/20220512123428.fq3wofedp6oiotd4@ppc.localdomain
    Signed-off-by: Wang Cheng
    Reported-by:
    Tested-by:
    Cc: David Rientjes
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Greg Kroah-Hartman

    Wang Cheng
     

22 Jul, 2022

2 commits

  • commit 14c99d65941538aa33edd8dc7b1bbbb593c324a2 upstream.

    Currently the implementation will split the PUD when a fallback is taken
    inside the create_huge_pud function. This isn't where it should be done:
    the splitting should be done in wp_huge_pud, just like it's done for PMDs.
    Reason being that if a callback is taken during create, there is no PUD
    yet so nothing to split, whereas if a fallback is taken when encountering
    a write protection fault there is something to split.

    It looks like this was the original intention with the commit where the
    splitting was introduced, but somehow it got moved to the wrong place
    between v1 and v2 of the patch series. Rebase mistake perhaps.

    Link: https://lkml.kernel.org/r/6f48d622eb8bce1ae5dd75327b0b73894a2ec407.camel@amazon.com
    Fixes: 327e9fd48972 ("mm: Split huge pages on write-notify or COW")
    Signed-off-by: James Gowans
    Reviewed-by: Thomas Hellström
    Cc: Christian König
    Cc: Jan H. Schönherr
    Signed-off-by: Andrew Morton
    Signed-off-by: Greg Kroah-Hartman

    Gowans, James
     
  • commit 73f37dbcfe1763ee2294c7717a1f571e27d17fd8 upstream.

    When fallocate() is used on a shmem file, the pages we allocate can end up
    with !PageUptodate.

    Since UFFDIO_CONTINUE tries to find the existing page the user wants to
    map with SGP_READ, we would fail to find such a page, since
    shmem_getpage_gfp returns with a "NULL" pagep for SGP_READ if it discovers
    !PageUptodate. As a result, UFFDIO_CONTINUE returns -EFAULT, as it would
    do if the page wasn't found in the page cache at all.

    This isn't the intended behavior. UFFDIO_CONTINUE is just trying to find
    if a page exists, and doesn't care whether it still needs to be cleared or
    not. So, instead of SGP_READ, pass in SGP_NOALLOC. This is the same,
    except for one critical difference: in the !PageUptodate case, SGP_NOALLOC
    will clear the page and then return it. With this change, UFFDIO_CONTINUE
    works properly (succeeds) on a shmem file which has been fallocated, but
    otherwise not modified.

    Link: https://lkml.kernel.org/r/20220610173812.1768919-1-axelrasmussen@google.com
    Fixes: 153132571f02 ("userfaultfd/shmem: support UFFDIO_CONTINUE for shmem")
    Signed-off-by: Axel Rasmussen
    Acked-by: Peter Xu
    Cc: Hugh Dickins
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Greg Kroah-Hartman

    Axel Rasmussen
     

12 Jul, 2022

8 commits

  • commit 2ba2b008a8bf5fd268a43d03ba79e0ad464d6836 upstream.

    Reverts commit 888af2701db7 ("mm/memory-failure.c: fix race with changing
    page compound again") because now we fetch the page refcount under
    hugetlb_lock in try_memory_failure_hugetlb() so that the race check is no
    longer necessary.

    Link: https://lkml.kernel.org/r/20220408135323.1559401-4-naoya.horiguchi@linux.dev
    Signed-off-by: Naoya Horiguchi
    Suggested-by: Miaohe Lin
    Reviewed-by: Miaohe Lin
    Reviewed-by: Mike Kravetz
    Cc: Miaohe Lin
    Cc: Yang Shi
    Cc: Dan Carpenter
    Signed-off-by: Andrew Morton
    Signed-off-by: Greg Kroah-Hartman

    Naoya Horiguchi
     
  • [ Upstream commit 405ce051236cc65b30bbfe490b28ce60ae6aed85 ]

    There is a race condition between memory_failure_hugetlb() and hugetlb
    free/demotion, which causes setting PageHWPoison flag on the wrong page.
    The one simple result is that wrong processes can be killed, but another
    (more serious) one is that the actual error is left unhandled, so no one
    prevents later access to it, and that might lead to more serious results
    like consuming corrupted data.

    Think about the below race window:

    CPU 1 CPU 2
    memory_failure_hugetlb
    struct page *head = compound_head(p);
    hugetlb page might be freed to
    buddy, or even changed to another
    compound page.

    get_hwpoison_page -- page is not what we want now...

    The current code first does prechecks roughly and then reconfirms after
    taking refcount, but it's found that it makes code overly complicated,
    so move the prechecks in a single hugetlb_lock range.

    A newly introduced function, try_memory_failure_hugetlb(), always takes
    hugetlb_lock (even for non-hugetlb pages). That can be improved, but
    memory_failure() is rare in principle, so should not be a big problem.

    Link: https://lkml.kernel.org/r/20220408135323.1559401-2-naoya.horiguchi@linux.dev
    Fixes: 761ad8d7c7b5 ("mm: hwpoison: introduce memory_failure_hugetlb()")
    Signed-off-by: Naoya Horiguchi
    Reported-by: Mike Kravetz
    Reviewed-by: Miaohe Lin
    Reviewed-by: Mike Kravetz
    Cc: Yang Shi
    Cc: Dan Carpenter
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    Naoya Horiguchi
     
  • [ Upstream commit 888af2701db79b9b27c7e37f9ede528a5ca53b76 ]

    Patch series "A few fixup patches for memory failure", v2.

    This series contains a few patches to fix the race with changing page
    compound page, make non-LRU movable pages unhandlable and so on. More
    details can be found in the respective changelogs.

    There is a race window where we got the compound_head, the hugetlb page
    could be freed to buddy, or even changed to another compound page just
    before we try to get hwpoison page. Think about the below race window:

    CPU 1 CPU 2
    memory_failure_hugetlb
    struct page *head = compound_head(p);
    hugetlb page might be freed to
    buddy, or even changed to another
    compound page.

    get_hwpoison_page -- page is not what we want now...

    If this race happens, just bail out. Also MF_MSG_DIFFERENT_PAGE_SIZE is
    introduced to record this event.

    [akpm@linux-foundation.org: s@/**@/*@, per Naoya Horiguchi]

    Link: https://lkml.kernel.org/r/20220312074613.4798-1-linmiaohe@huawei.com
    Link: https://lkml.kernel.org/r/20220312074613.4798-2-linmiaohe@huawei.com
    Signed-off-by: Miaohe Lin
    Acked-by: Naoya Horiguchi
    Cc: Tony Luck
    Cc: Borislav Petkov
    Cc: Mike Kravetz
    Cc: Yang Shi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    Miaohe Lin
     
  • [ Upstream commit d1fe111fb62a1cf0446a2919f5effbb33ad0702c ]

    When the hwpoison page meets the filter conditions, it should not be
    regarded as successful memory_failure() processing for mce handler, but
    should return a distinct value, otherwise mce handler regards the error
    page has been identified and isolated, which may lead to calling
    set_mce_nospec() to change page attribute, etc.

    Here memory_failure() return -EOPNOTSUPP to indicate that the error
    event is filtered, mce handler should not take any action for this
    situation and hwpoison injector should treat as correct.

    Link: https://lkml.kernel.org/r/20220223082135.2769649-1-luofei@unicloud.com
    Signed-off-by: luofei
    Acked-by: Borislav Petkov
    Cc: Dave Hansen
    Cc: H. Peter Anvin
    Cc: Ingo Molnar
    Cc: Miaohe Lin
    Cc: Naoya Horiguchi
    Cc: Thomas Gleixner
    Cc: Tony Luck
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    luofei
     
  • [ Upstream commit 91d005479e06392617bacc114509d611b705eaac ]

    Patch series "mm/hwpoison: fix unpoison_memory()", v4.

    The main purpose of this series is to sync unpoison code to recent
    changes around how hwpoison code takes page refcount. Unpoison should
    work or simply fail (without crash) if impossible.

    The recent works of keeping hwpoison pages in shmem pagecache introduce
    a new state of hwpoisoned pages, but unpoison for such pages is not
    supported yet with this series.

    It seems that soft-offline and unpoison can be used as general purpose
    page offline/online mechanism (not in the context of memory error). I
    think that we need some additional works to realize it because currently
    soft-offline and unpoison are assumed not to happen so frequently (print
    out too many messages for aggressive usecases). But anyway this could
    be another interesting next topic.

    v1: https://lore.kernel.org/linux-mm/20210614021212.223326-1-nao.horiguchi@gmail.com/
    v2: https://lore.kernel.org/linux-mm/20211025230503.2650970-1-naoya.horiguchi@linux.dev/
    v3: https://lore.kernel.org/linux-mm/20211105055058.3152564-1-naoya.horiguchi@linux.dev/

    This patch (of 3):

    Originally mf_mutex is introduced to serialize multiple MCE events, but
    it is not that useful to allow unpoison to run in parallel with
    memory_failure() and soft offline. So apply mf_mutex to soft offline
    and unpoison. The memory failure handler and soft offline handler get
    simpler with this.

    Link: https://lkml.kernel.org/r/20211115084006.3728254-1-naoya.horiguchi@linux.dev
    Link: https://lkml.kernel.org/r/20211115084006.3728254-2-naoya.horiguchi@linux.dev
    Signed-off-by: Naoya Horiguchi
    Reviewed-by: Yang Shi
    Cc: "Aneesh Kumar K.V"
    Cc: David Hildenbrand
    Cc: Ding Hui
    Cc: Miaohe Lin
    Cc: Michal Hocko
    Cc: Oscar Salvador
    Cc: Peter Xu
    Cc: Tony Luck
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    Naoya Horiguchi
     
  • [ Upstream commit a8749a35c39903120ec421ef2525acc8e0daa55c ]

    Linux has dozens of occurrences of vmalloc(array_size()) and
    vzalloc(array_size()). Allow to simplify the code by providing
    vmalloc_array and vcalloc, as well as the underscored variants that let
    the caller specify the GFP flags.

    Acked-by: Michal Hocko
    Signed-off-by: Paolo Bonzini
    Signed-off-by: Sasha Levin

    Paolo Bonzini
     
  • Release refcount after xas_set to fix UAF which may cause panic like this:

    page:ffffea000491fa40 refcount:1 mapcount:0 mapping:0000000000000000 index:0x1 pfn:0x1247e9
    head:ffffea000491fa00 order:3 compound_mapcount:0 compound_pincount:0
    memcg:ffff888104f91091
    flags: 0x2fffff80010200(slab|head|node=0|zone=2|lastcpupid=0x1fffff)
    ...
    page dumped because: VM_BUG_ON_PAGE(PageTail(page))
    ------------[ cut here ]------------
    kernel BUG at include/linux/page-flags.h:632!
    invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN
    CPU: 1 PID: 7642 Comm: sh Not tainted 5.15.51-dirty #26
    ...
    Call Trace:

    __invalidate_mapping_pages+0xe7/0x540
    drop_pagecache_sb+0x159/0x320
    iterate_supers+0x120/0x240
    drop_caches_sysctl_handler+0xaa/0xe0
    proc_sys_call_handler+0x2b4/0x480
    new_sync_write+0x3d6/0x5c0
    vfs_write+0x446/0x7a0
    ksys_write+0x105/0x210
    do_syscall_64+0x35/0x80
    entry_SYSCALL_64_after_hwframe+0x44/0xae
    RIP: 0033:0x7f52b5733130
    ...

    This problem has been fixed on mainline by patch 6b24ca4a1a8d ("mm: Use
    multi-index entries in the page cache") since it deletes the related code.

    Fixes: 5c211ba29deb ("mm: add and use find_lock_entries")
    Signed-off-by: Liu Shixin
    Acked-by: Matthew Wilcox (Oracle)
    Signed-off-by: Greg Kroah-Hartman

    Liu Shixin
     
  • commit eeaa345e128515135ccb864c04482180c08e3259 upstream.

    The fastpath in slab_alloc_node() assumes that c->slab is stable as long as
    the TID stays the same. However, two places in __slab_alloc() currently
    don't update the TID when deactivating the CPU slab.

    If multiple operations race the right way, this could lead to an object
    getting lost; or, in an even more unlikely situation, it could even lead to
    an object being freed onto the wrong slab's freelist, messing up the
    `inuse` counter and eventually causing a page to be freed to the page
    allocator while it still contains slab objects.

    (I haven't actually tested these cases though, this is just based on
    looking at the code. Writing testcases for this stuff seems like it'd be
    a pain...)

    The race leading to state inconsistency is (all operations on the same CPU
    and kmem_cache):

    - task A: begin do_slab_free():
    - read TID
    - read pcpu freelist (==NULL)
    - check `slab == c->slab` (true)
    - [PREEMPT A->B]
    - task B: begin slab_alloc_node():
    - fastpath fails (`c->freelist` is NULL)
    - enter __slab_alloc()
    - slub_get_cpu_ptr() (disables preemption)
    - enter ___slab_alloc()
    - take local_lock_irqsave()
    - read c->freelist as NULL
    - get_freelist() returns NULL
    - write `c->slab = NULL`
    - drop local_unlock_irqrestore()
    - goto new_slab
    - slub_percpu_partial() is NULL
    - get_partial() returns NULL
    - slub_put_cpu_ptr() (enables preemption)
    - [PREEMPT B->A]
    - task A: finish do_slab_free():
    - this_cpu_cmpxchg_double() succeeds()
    - [CORRUPT STATE: c->slab==NULL, c->freelist!=NULL]

    From there, the object on c->freelist will get lost if task B is allowed to
    continue from here: It will proceed to the retry_load_slab label,
    set c->slab, then jump to load_freelist, which clobbers c->freelist.

    But if we instead continue as follows, we get worse corruption:

    - task A: run __slab_free() on object from other struct slab:
    - CPU_PARTIAL_FREE case (slab was on no list, is now on pcpu partial)
    - task A: run slab_alloc_node() with NUMA node constraint:
    - fastpath fails (c->slab is NULL)
    - call __slab_alloc()
    - slub_get_cpu_ptr() (disables preemption)
    - enter ___slab_alloc()
    - c->slab is NULL: goto new_slab
    - slub_percpu_partial() is non-NULL
    - set c->slab to slub_percpu_partial(c)
    - [CORRUPT STATE: c->slab points to slab-1, c->freelist has objects
    from slab-2]
    - goto redo
    - node_match() fails
    - goto deactivate_slab
    - existing c->freelist is passed into deactivate_slab()
    - inuse count of slab-1 is decremented to account for object from
    slab-2

    At this point, the inuse count of slab-1 is 1 lower than it should be.
    This means that if we free all allocated objects in slab-1 except for one,
    SLUB will think that slab-1 is completely unused, and may free its page,
    leading to use-after-free.

    Fixes: c17dda40a6a4e ("slub: Separate out kmem_cache_cpu processing from deactivate_slab")
    Fixes: 03e404af26dc2 ("slub: fast release on full slab")
    Cc: stable@vger.kernel.org
    Signed-off-by: Jann Horn
    Acked-by: Christoph Lameter
    Acked-by: David Rientjes
    Reviewed-by: Muchun Song
    Tested-by: Hyeonggon Yoo
    Signed-off-by: Vlastimil Babka
    Link: https://lore.kernel.org/r/20220608182205.2945720-1-jannh@google.com
    Signed-off-by: Greg Kroah-Hartman

    Jann Horn
     

30 Jun, 2022

2 commits

  • This is the 5.15.50 stable release

    * tag 'v5.15.50': (1395 commits)
    Linux 5.15.50
    arm64: mm: Don't invalidate FROM_DEVICE buffers at start of DMA transfer
    serial: core: Initialize rs485 RTS polarity already on probe
    ...

    Signed-off-by: Jason Liu

    Conflicts:
    drivers/bus/fsl-mc/fsl-mc-bus.c
    drivers/crypto/caam/ctrl.c
    drivers/pci/controller/dwc/pci-imx6.c
    drivers/spi/spi-fsl-qspi.c
    drivers/tty/serial/fsl_lpuart.c
    include/uapi/linux/dma-buf.h

    Jason Liu
     
  • This is the 5.15.41 stable release

    * tag 'v5.15.41': (1977 commits)
    Linux 5.15.41
    usb: gadget: uvc: allow for application to cleanly shutdown
    usb: gadget: uvc: rename function to be more consistent
    ...

    Signed-off-by: Jason Liu

    Conflicts:
    arch/arm64/boot/dts/freescale/fsl-ls1043a.dtsi
    arch/arm64/boot/dts/freescale/fsl-ls1046a.dtsi
    arch/arm64/configs/defconfig
    drivers/clk/imx/clk-imx8qxp-lpcg.c
    drivers/dma/imx-sdma.c
    drivers/gpu/drm/bridge/nwl-dsi.c
    drivers/mailbox/imx-mailbox.c
    drivers/net/phy/at803x.c
    drivers/tty/serial/fsl_lpuart.c
    security/keys/trusted-keys/trusted_core.c

    Jason Liu
     

22 Jun, 2022

1 commit

  • [ Upstream commit 4bca7e80b6455772b4bf3f536dcbc19aac424d6a ]

    noop_backing_dev_info is used by superblocks of various
    pseudofilesystems such as kdevtmpfs. After commit 10e14073107d
    ("writeback: Fix inode->i_io_list not be protected by inode->i_lock
    error") this broke because __mark_inode_dirty() started to access more
    fields from noop_backing_dev_info and this led to crashes inside
    locked_inode_to_wb_and_lock_list() called from __mark_inode_dirty().
    Fix the problem by initializing noop_backing_dev_info before the
    filesystems get mounted.

    Fixes: 10e14073107d ("writeback: Fix inode->i_io_list not be protected by inode->i_lock error")
    Reported-and-tested-by: Suzuki K Poulose
    Reported-and-tested-by: Alexandru Elisei
    Reported-and-tested-by: Guenter Roeck
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jan Kara
    Signed-off-by: Sasha Levin

    Jan Kara
     

09 Jun, 2022

2 commits

  • commit a04e1928e2ead144dc2f369768bc0a0f3110af89 upstream.

    We forget to call untrack_pfn() to pair with track_pfn_remap() when range
    is not allowed to hotplug. Fix it by jump err_kasan.

    Link: https://lkml.kernel.org/r/20220531122643.25249-1-linmiaohe@huawei.com
    Fixes: bca3feaa0764 ("mm/memory_hotplug: prevalidate the address range being added with platform")
    Signed-off-by: Miaohe Lin
    Reviewed-by: David Hildenbrand
    Acked-by: Muchun Song
    Cc: Anshuman Khandual
    Cc: Oscar Salvador
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Greg Kroah-Hartman

    Miaohe Lin
     
  • commit 48381273f8734d28ef56a5bdf1966dd8530111bc upstream.

    The routine huge_pmd_unshare() is passed a pointer to an address
    associated with an area which may be unshared. If unshare is successful
    this address is updated to 'optimize' callers iterating over huge page
    addresses. For the optimization to work correctly, address should be
    updated to the last huge page in the unmapped/unshared area. However, in
    the common case where the passed address is PUD_SIZE aligned, the address
    is incorrectly updated to the address of the preceding huge page. That
    wastes CPU cycles as the unmapped/unshared range is scanned twice.

    Link: https://lkml.kernel.org/r/20220524205003.126184-1-mike.kravetz@oracle.com
    Fixes: 39dde65c9940 ("shared page table for hugetlb page")
    Signed-off-by: Mike Kravetz
    Acked-by: Muchun Song
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Greg Kroah-Hartman

    Mike Kravetz