Eric Lee / smarc-fsl-linux-kernel

30 Sep, 2022

1 commit

9e8af38e8 Merge tag 'v5.15.71' into lf-5.15.y ... Browse Code »

This is the 5.15.71 stable release

* tag 'v5.15.71': (144 commits)
Linux 5.15.71
ext4: use locality group preallocation for small closed files
ext4: avoid unnecessary spreading of allocations among groups
...

Signed-off-by: Jason Liu

Conflicts:
drivers/net/phy/aquantia_main.c
drivers/tty/serial/fsl_lpuart.c

Jason Liu
2022-09-30 03:10:22 +0800

28 Sep, 2022

3 commits

61703b248 mm: slub: fix flush_cpu_slab()/__free_slab() invocations in task context. ... Browse Code »

commit e45cc288724f0cfd497bb5920bcfa60caa335729 upstream.

Commit 5a836bf6b09f ("mm: slub: move flush_cpu_slab() invocations
__free_slab() invocations out of IRQ context") moved all flush_cpu_slab()
invocations to the global workqueue to avoid a problem related
with deactivate_slab()/__free_slab() being called from an IRQ context
on PREEMPT_RT kernels.

When the flush_all_cpu_locked() function is called from a task context
it may happen that a workqueue with WQ_MEM_RECLAIM bit set ends up
flushing the global workqueue, this will cause a dependency issue.

workqueue: WQ_MEM_RECLAIM nvme-delete-wq:nvme_delete_ctrl_work [nvme_core]
is flushing !WQ_MEM_RECLAIM events:flush_cpu_slab
WARNING: CPU: 37 PID: 410 at kernel/workqueue.c:2637
check_flush_dependency+0x10a/0x120
Workqueue: nvme-delete-wq nvme_delete_ctrl_work [nvme_core]
RIP: 0010:check_flush_dependency+0x10a/0x120[ 453.262125] Call Trace:
__flush_work.isra.0+0xbf/0x220
? __queue_work+0x1dc/0x420
flush_all_cpus_locked+0xfb/0x120
__kmem_cache_shutdown+0x2b/0x320
kmem_cache_destroy+0x49/0x100
bioset_exit+0x143/0x190
blk_release_queue+0xb9/0x100
kobject_cleanup+0x37/0x130
nvme_fc_ctrl_free+0xc6/0x150 [nvme_fc]
nvme_free_ctrl+0x1ac/0x2b0 [nvme_core]

Fix this bug by creating a workqueue for the flush operation with
the WQ_MEM_RECLAIM bit set.

Fixes: 5a836bf6b09f ("mm: slub: move flush_cpu_slab() invocations __free_slab() invocations out of IRQ context")
Cc:
Signed-off-by: Maurizio Lombardi
Reviewed-by: Hyeonggon Yoo
Signed-off-by: Vlastimil Babka
Signed-off-by: Greg Kroah-Hartman

Maurizio Lombardi
2022-09-28 17:11:44 +0800
2d6e55e0c mm/slub: fix to return errno if kmalloc() fails ... Browse Code »

commit 7e9c323c52b379d261a72dc7bd38120a761a93cd upstream.

In create_unique_id(), kmalloc(, GFP_KERNEL) can fail due to
out-of-memory, if it fails, return errno correctly rather than
triggering panic via BUG_ON();

kernel BUG at mm/slub.c:5893!
Internal error: Oops - BUG: 0 [#1] PREEMPT SMP

Call trace:
sysfs_slab_add+0x258/0x260 mm/slub.c:5973
__kmem_cache_create+0x60/0x118 mm/slub.c:4899
create_cache mm/slab_common.c:229 [inline]
kmem_cache_create_usercopy+0x19c/0x31c mm/slab_common.c:335
kmem_cache_create+0x1c/0x28 mm/slab_common.c:390
f2fs_kmem_cache_create fs/f2fs/f2fs.h:2766 [inline]
f2fs_init_xattr_caches+0x78/0xb4 fs/f2fs/xattr.c:808
f2fs_fill_super+0x1050/0x1e0c fs/f2fs/super.c:4149
mount_bdev+0x1b8/0x210 fs/super.c:1400
f2fs_mount+0x44/0x58 fs/f2fs/super.c:4512
legacy_get_tree+0x30/0x74 fs/fs_context.c:610
vfs_get_tree+0x40/0x140 fs/super.c:1530
do_new_mount+0x1dc/0x4e4 fs/namespace.c:3040
path_mount+0x358/0x914 fs/namespace.c:3370
do_mount fs/namespace.c:3383 [inline]
__do_sys_mount fs/namespace.c:3591 [inline]
__se_sys_mount fs/namespace.c:3568 [inline]
__arm64_sys_mount+0x2f8/0x408 fs/namespace.c:3568

Cc:
Fixes: 81819f0fc8285 ("SLUB core")
Reported-by: syzbot+81684812ea68216e08c5@syzkaller.appspotmail.com
Reviewed-by: Muchun Song
Reviewed-by: Hyeonggon Yoo
Signed-off-by: Chao Yu
Acked-by: David Rientjes
Signed-off-by: Vlastimil Babka
Signed-off-by: Greg Kroah-Hartman

Chao Yu
2022-09-28 17:11:44 +0800
bf0197aea kasan: call kasan_malloc() from __kmalloc_*track_caller() ... Browse Code »

commit 5373b8a09d6e037ee0587cb5d9fe4cc09077deeb upstream.

We were failing to call kasan_malloc() from __kmalloc_*track_caller()
which was causing us to sometimes fail to produce KASAN error reports
for allocations made using e.g. devm_kcalloc(), as the KASAN poison was
not being initialized. Fix it.

Signed-off-by: Peter Collingbourne
Cc: # 5.15
Signed-off-by: Vlastimil Babka
Signed-off-by: Greg Kroah-Hartman

Peter Collingbourne
2022-09-28 17:11:44 +0800

27 Sep, 2022

1 commit

2b53d4d52 Merge tag 'v5.15.70' into lf-5.15.y ... Browse Code »

This is the 5.15.70 stable release

* tag 'v5.15.70': (2444 commits)
Linux 5.15.70
ALSA: hda/sigmatel: Fix unused variable warning for beep power change
cgroup: Add missing cpus_read_lock() to cgroup_attach_task_all()
...

Signed-off-by: Jason Liu

Conflicts:
arch/arm/boot/dts/imx6ul.dtsi
arch/arm/mm/mmu.c
arch/arm64/boot/dts/freescale/imx8mp-evk.dts
drivers/gpu/drm/imx/dcss/dcss-kms.c
drivers/media/platform/nxp/imx-jpeg/mxc-jpeg.c
drivers/media/platform/nxp/imx-jpeg/mxc-jpeg.h
drivers/mtd/nand/raw/gpmi-nand/gpmi-nand.c
drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.c
drivers/soc/fsl/Kconfig
drivers/soc/imx/gpcv2.c
drivers/usb/dwc3/host.c
net/dsa/slave.c
sound/soc/fsl/imx-card.c

Jason Liu
2022-09-27 01:56:22 +0800

20 Sep, 2022

1 commit

3998dc50e mm: Fix TLB flush for not-first PFNMAP mappings in unmap_region() ... Browse Code »

This is a stable-specific patch.
I botched the stable-specific rewrite of
commit b67fbebd4cf98 ("mmu_gather: Force tlb-flush VM_PFNMAP vmas"):
As Hugh pointed out, unmap_region() actually operates on a list of VMAs,
and the variable "vma" merely points to the first VMA in that list.
So if we want to check whether any of the VMAs we're operating on is
PFNMAP or MIXEDMAP, we have to iterate through the list and check each VMA.

Signed-off-by: Jann Horn
Signed-off-by: Greg Kroah-Hartman

Jann Horn
2022-09-20 18:39:45 +0800

15 Sep, 2022

1 commit

625c78e1a Revert "mm: kmemleak: take a full lowmem check in kmemleak_*_phys()" ... Browse Code »

This reverts commit 23c2d497de21f25898fbea70aeb292ab8acc8c94.

Commit 23c2d497de21 ("mm: kmemleak: take a full lowmem check in
kmemleak_*_phys()") brought false leak alarms on some archs like arm64
that does not init pfn boundary in early booting. The final solution
lands on linux-6.0: commit 0c24e061196c ("mm: kmemleak: add rbtree and
store physical address for objects allocated with PA").

Revert this commit before linux-6.0. The original issue of invalid PA
can be mitigated by additional check in devicetree.

The false alarm report is as following: Kmemleak output: (Qemu/arm64)
unreferenced object 0xffff0000c0170a00 (size 128):
comm "swapper/0", pid 1, jiffies 4294892404 (age 126.208s)
hex dump (first 32 bytes):
62 61 73 65 00 00 00 00 00 00 00 00 00 00 00 00 base............
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
backtrace:
[] __kmalloc_track_caller+0x1b0/0x2e4
[] kstrdup_const+0x8c/0xc4
[] kvasprintf_const+0xbc/0xec
[] kobject_set_name_vargs+0x58/0xe4
[] kobject_add+0x84/0x100
[] __of_attach_node_sysfs+0x78/0xec
[] of_core_init+0x68/0x104
[] driver_init+0x28/0x48
[] do_basic_setup+0x14/0x28
[] kernel_init_freeable+0x110/0x178
[] kernel_init+0x20/0x1a0
[] ret_from_fork+0x10/0x20

This pacth is also applicable to linux-5.17.y/linux-5.18.y/linux-5.19.y

Cc:
Signed-off-by: Yee Lee
Signed-off-by: Greg Kroah-Hartman

Yee Lee
2022-09-15 17:30:00 +0800

08 Sep, 2022

1 commit

c235c4fc5 mm: pagewalk: Fix race between unmap and page walker ... Browse Code »

[ Upstream commit 8782fb61cc848364e1e1599d76d3c9dd58a1cc06 ]

The mmap lock protects the page walker from changes to the page tables
during the walk. However a read lock is insufficient to protect those
areas which don't have a VMA as munmap() detaches the VMAs before
downgrading to a read lock and actually tearing down PTEs/page tables.

For users of walk_page_range() the solution is to simply call pte_hole()
immediately without checking the actual page tables when a VMA is not
present. We now never call __walk_page_range() without a valid vma.

For walk_page_range_novma() the locking requirements are tightened to
require the mmap write lock to be taken, and then walking the pgd
directly with 'no_vma' set.

This in turn means that all page walkers either have a valid vma, or
it's that special 'novma' case for page table debugging. As a result,
all the odd '(!walk->vma && !walk->no_vma)' tests can be removed.

Fixes: dd2283f2605e ("mm: mmap: zap pages with read mmap_sem in munmap")
Reported-by: Jann Horn
Signed-off-by: Steven Price
Cc: Vlastimil Babka
Cc: Thomas Hellström
Cc: Konstantin Khlebnikov
Cc: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Sasha Levin

Steven Price
2022-09-08 18:28:05 +0800

05 Sep, 2022

3 commits

c18a209b5 mm/rmap: Fix anon_vma->degree ambiguity leading to double-reuse ... Browse Code »

commit 2555283eb40df89945557273121e9393ef9b542b upstream.

anon_vma->degree tracks the combined number of child anon_vmas and VMAs
that use the anon_vma as their ->anon_vma.

anon_vma_clone() then assumes that for any anon_vma attached to
src->anon_vma_chain other than src->anon_vma, it is impossible for it to
be a leaf node of the VMA tree, meaning that for such VMAs ->degree is
elevated by 1 because of a child anon_vma, meaning that if ->degree
equals 1 there are no VMAs that use the anon_vma as their ->anon_vma.

This assumption is wrong because the ->degree optimization leads to leaf
nodes being abandoned on anon_vma_clone() - an existing anon_vma is
reused and no new parent-child relationship is created. So it is
possible to reuse an anon_vma for one VMA while it is still tied to
another VMA.

This is an issue because is_mergeable_anon_vma() and its callers assume
that if two VMAs have the same ->anon_vma, the list of anon_vmas
attached to the VMAs is guaranteed to be the same. When this assumption
is violated, vma_merge() can merge pages into a VMA that is not attached
to the corresponding anon_vma, leading to dangling page->mapping
pointers that will be dereferenced during rmap walks.

Fix it by separately tracking the number of child anon_vmas and the
number of VMAs using the anon_vma as their ->anon_vma.

Fixes: 7a3ef208e662 ("mm: prevent endless growth of anon_vma hierarchy")
Cc: stable@kernel.org
Acked-by: Michal Hocko
Acked-by: Vlastimil Babka
Signed-off-by: Jann Horn
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Jann Horn
2022-09-05 16:30:07 +0800
da60ddd80 mm/hugetlb: avoid corrupting page->mapping in hugetlb_mcopy_atomic_pte ... Browse Code »

commit ab74ef708dc51df7cf2b8a890b9c6990fac5c0c6 upstream.

In MCOPY_ATOMIC_CONTINUE case with a non-shared VMA, pages in the page
cache are installed in the ptes. But hugepage_add_new_anon_rmap is called
for them mistakenly because they're not vm_shared. This will corrupt the
page->mapping used by page cache code.

Link: https://lkml.kernel.org/r/20220712130542.18836-1-linmiaohe@huawei.com
Fixes: f619147104c8 ("userfaultfd: add UFFDIO_CONTINUE ioctl")
Signed-off-by: Miaohe Lin
Reviewed-by: Mike Kravetz
Cc: Axel Rasmussen
Cc: Peter Xu
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Axel Rasmussen
Signed-off-by: Greg Kroah-Hartman

Miaohe Lin
2022-09-05 16:30:04 +0800
3ffb97fce mm: Force TLB flush for PFNMAP mappings before unlink_file_vma() ... Browse Code »

commit b67fbebd4cf980aecbcc750e1462128bffe8ae15 upstream.

Some drivers rely on having all VMAs through which a PFN might be
accessible listed in the rmap for correctness.
However, on X86, it was possible for a VMA with stale TLB entries
to not be listed in the rmap.

This was fixed in mainline with
commit b67fbebd4cf9 ("mmu_gather: Force tlb-flush VM_PFNMAP vmas"),
but that commit relies on preceding refactoring in
commit 18ba064e42df3 ("mmu_gather: Let there be one tlb_{start,end}_vma()
implementation") and commit 1e9fdf21a4339 ("mmu_gather: Remove per arch
tlb_{start,end}_vma()").

This patch provides equivalent protection without needing that
refactoring, by forcing a TLB flush between removing PTEs in
unmap_vmas() and the call to unlink_file_vma() in free_pgtables().

[This is a stable-specific rewrite of the upstream commit!]
Signed-off-by: Jann Horn
Signed-off-by: Greg Kroah-Hartman

Jann Horn
2022-09-05 16:30:03 +0800

31 Aug, 2022

4 commits

0038f8593 mm/hugetlb: fix hugetlb not supporting softdirty tracking ... Browse Code »

commit f96f7a40874d7c746680c0b9f57cef2262ae551f upstream.

Patch series "mm/hugetlb: fix write-fault handling for shared mappings", v2.

I observed that hugetlb does not support/expect write-faults in shared
mappings that would have to map the R/O-mapped page writable -- and I
found two case where we could currently get such faults and would
erroneously map an anon page into a shared mapping.

Reproducers part of the patches.

I propose to backport both fixes to stable trees. The first fix needs a
small adjustment.

This patch (of 2):

Staring at hugetlb_wp(), one might wonder where all the logic for shared
mappings is when stumbling over a write-protected page in a shared
mapping. In fact, there is none, and so far we thought we could get away
with that because e.g., mprotect() should always do the right thing and
map all pages directly writable.

Looks like we were wrong:

--------------------------------------------------------------------------
#include
#include
#include
#include
#include
#include
#include

#define HUGETLB_SIZE (2 * 1024 * 1024u)

static void clear_softdirty(void)
{
int fd = open("/proc/self/clear_refs", O_WRONLY);
const char *ctrl = "4";
int ret;

if (fd < 0) {
fprintf(stderr, "open(clear_refs) failed\n");
exit(1);
}
ret = write(fd, ctrl, strlen(ctrl));
if (ret != strlen(ctrl)) {
fprintf(stderr, "write(clear_refs) failed\n");
exit(1);
}
close(fd);
}

int main(int argc, char **argv)
{
char *map;
int fd;

fd = open("/dev/hugepages/tmp", O_RDWR | O_CREAT);
if (!fd) {
fprintf(stderr, "open() failed\n");
return -errno;
}
if (ftruncate(fd, HUGETLB_SIZE)) {
fprintf(stderr, "ftruncate() failed\n");
return -errno;
}

map = mmap(NULL, HUGETLB_SIZE, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
if (map == MAP_FAILED) {
fprintf(stderr, "mmap() failed\n");
return -errno;
}

*map = 0;

if (mprotect(map, HUGETLB_SIZE, PROT_READ)) {
fprintf(stderr, "mmprotect() failed\n");
return -errno;
}

clear_softdirty();

if (mprotect(map, HUGETLB_SIZE, PROT_READ|PROT_WRITE)) {
fprintf(stderr, "mmprotect() failed\n");
return -errno;
}

*map = 0;

return 0;
}
--------------------------------------------------------------------------

Above test fails with SIGBUS when there is only a single free hugetlb page.
# echo 1 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
# ./test
Bus error (core dumped)

And worse, with sufficient free hugetlb pages it will map an anonymous page
into a shared mapping, for example, messing up accounting during unmap
and breaking MAP_SHARED semantics:
# echo 2 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
# ./test
# cat /proc/meminfo | grep HugePages_
HugePages_Total: 2
HugePages_Free: 1
HugePages_Rsvd: 18446744073709551615
HugePages_Surp: 0

Reason in this particular case is that vma_wants_writenotify() will
return "true", removing VM_SHARED in vma_set_page_prot() to map pages
write-protected. Let's teach vma_wants_writenotify() that hugetlb does not
support softdirty tracking.

Link: https://lkml.kernel.org/r/20220811103435.188481-1-david@redhat.com
Link: https://lkml.kernel.org/r/20220811103435.188481-2-david@redhat.com
Fixes: 64e455079e1b ("mm: softdirty: enable write notifications on VMAs after VM_SOFTDIRTY cleared")
Signed-off-by: David Hildenbrand
Reviewed-by: Mike Kravetz
Cc: Peter Feiner
Cc: Kirill A. Shutemov
Cc: Cyrill Gorcunov
Cc: Pavel Emelyanov
Cc: Jamie Liu
Cc: Hugh Dickins
Cc: Naoya Horiguchi
Cc: Bjorn Helgaas
Cc: Muchun Song
Cc: Peter Xu
Cc: [3.18+]
Signed-off-by: Andrew Morton
Signed-off-by: David Hildenbrand
Signed-off-by: Greg Kroah-Hartman

David Hildenbrand
2022-08-31 23:16:49 +0800
16a12ee61 bootmem: remove the vmemmap pages from kmemleak in put_page_bootmem ... Browse Code »

commit dd0ff4d12dd284c334f7e9b07f8f335af856ac78 upstream.

The vmemmap pages is marked by kmemleak when allocated from memblock.
Remove it from kmemleak when freeing the page. Otherwise, when we reuse
the page, kmemleak may report such an error and then stop working.

kmemleak: Cannot insert 0xffff98fb6eab3d40 into the object search tree (overlaps existing)
kmemleak: Kernel memory leak detector disabled
kmemleak: Object 0xffff98fb6be00000 (size 335544320):
kmemleak: comm "swapper", pid 0, jiffies 4294892296
kmemleak: min_count = 0
kmemleak: count = 0
kmemleak: flags = 0x1
kmemleak: checksum = 0
kmemleak: backtrace:

Link: https://lkml.kernel.org/r/20220819094005.2928241-1-liushixin2@huawei.com
Fixes: f41f2ed43ca5 (mm: hugetlb: free the vmemmap pages associated with each HugeTLB page)
Signed-off-by: Liu Shixin
Reviewed-by: Muchun Song
Cc: Matthew Wilcox
Cc: Mike Kravetz
Cc: Oscar Salvador
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Greg Kroah-Hartman

Liu Shixin
2022-08-31 23:16:48 +0800
ddcb06961 mm/damon/dbgfs: avoid duplicate context directory creation ... Browse Code »

commit d26f60703606ab425eee9882b32a1781a8bed74d upstream.

When user tries to create a DAMON context via the DAMON debugfs interface
with a name of an already existing context, the context directory creation
fails but a new context is created and added in the internal data
structure, due to absence of the directory creation success check. As a
result, memory could leak and DAMON cannot be turned on. An example test
case is as below:

# cd /sys/kernel/debug/damon/
# echo "off" > monitor_on
# echo paddr > target_ids
# echo "abc" > mk_context
# echo "abc" > mk_context
# echo $$ > abc/target_ids
# echo "on" > monitor_on <<< fails

Return value of 'debugfs_create_dir()' is expected to be ignored in
general, but this is an exceptional case as DAMON feature is depending
on the debugfs functionality and it has the potential duplicate name
issue. This commit therefore fixes the issue by checking the directory
creation failure and immediately return the error in the case.

Link: https://lkml.kernel.org/r/20220821180853.2400-1-sj@kernel.org
Fixes: 75c1c2b53c78 ("mm/damon/dbgfs: support multiple contexts")
Signed-off-by: Badari Pulavarty
Signed-off-by: SeongJae Park
Cc: [ 5.15.x]
Signed-off-by: Andrew Morton
Signed-off-by: Greg Kroah-Hartman

Badari Pulavarty
2022-08-31 23:16:47 +0800
f96b9f7c1 writeback: avoid use-after-free after removing device ... Browse Code »

commit f87904c075515f3e1d8f4a7115869d3b914674fd upstream.

When a disk is removed, bdi_unregister gets called to stop further
writeback and wait for associated delayed work to complete. However,
wb_inode_writeback_end() may schedule bandwidth estimation dwork after
this has completed, which can result in the timer attempting to access the
just freed bdi_writeback.

Fix this by checking if the bdi_writeback is alive, similar to when
scheduling writeback work.

Since this requires wb->work_lock, and wb_inode_writeback_end() may get
called from interrupt, switch wb->work_lock to an irqsafe lock.

Link: https://lkml.kernel.org/r/20220801155034.3772543-1-khazhy@google.com
Fixes: 45a2966fd641 ("writeback: fix bandwidth estimate for spiky workload")
Signed-off-by: Khazhismel Kumykov
Reviewed-by: Jan Kara
Cc: Michael Stapelberg
Cc: Wu Fengguang
Cc: Alexander Viro
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Greg Kroah-Hartman

Khazhismel Kumykov
2022-08-31 23:16:47 +0800

17 Aug, 2022

4 commits

9fc8d3bee mm/mmap.c: fix missing call to vm_unacct_memory in mmap_region ... Browse Code »

[ Upstream commit 7f82f922319ede486540e8746769865b9508d2c2 ]

Since the beginning, charged is set to 0 to avoid calling vm_unacct_memory
twice because vm_unacct_memory will be called by above unmap_region. But
since commit 4f74d2c8e827 ("vm: remove 'nr_accounted' calculations from
the unmap_vmas() interfaces"), unmap_region doesn't call vm_unacct_memory
anymore. So charged shouldn't be set to 0 now otherwise the calling to
paired vm_unacct_memory will be missed and leads to imbalanced account.

Link: https://lkml.kernel.org/r/20220618082027.43391-1-linmiaohe@huawei.com
Fixes: 4f74d2c8e827 ("vm: remove 'nr_accounted' calculations from the unmap_vmas() interfaces")
Signed-off-by: Miaohe Lin
Signed-off-by: Andrew Morton
Signed-off-by: Sasha Levin

Miaohe Lin
2022-08-17 20:23:58 +0800
446521544 mm/mempolicy: fix get_nodes out of bound access ... Browse Code »

[ Upstream commit 000eca5d044d1ee23b4ca311793cf3fc528da6c6 ]

When user specified more nodes than supported, get_nodes will access nmask
array out of bounds.

Link: https://lkml.kernel.org/r/20220601093211.2970565-1-tianyu.li@arm.com
Fixes: e130242dc351 ("mm: simplify compat numa syscalls")
Signed-off-by: Tianyu Li
Cc: Arnd Bergmann
Cc: Mark Rutland
Signed-off-by: Andrew Morton
Signed-off-by: Sasha Levin

Tianyu Li
2022-08-17 20:23:47 +0800
d56c5eac8 mm/memremap: fix memunmap_pages() race with get_dev_pagemap() ... Browse Code »

[ Upstream commit 1e57ffb6e3fd9583268c6462c4e3853575b21701 ]

Think about the below scene:

CPU1 CPU2
memunmap_pages
percpu_ref_exit
__percpu_ref_exit
free_percpu(percpu_count);
/* percpu_count is freed here! */
get_dev_pagemap
xa_load(&pgmap_array, PHYS_PFN(phys))
/* pgmap still in the pgmap_array */
percpu_ref_tryget_live(&pgmap->ref)
if __ref_is_percpu
/* __PERCPU_REF_ATOMIC_DEAD not set yet */
this_cpu_inc(*percpu_count)
/* access freed percpu_count here! */
ref->percpu_count_ptr = __PERCPU_REF_ATOMIC_DEAD;
/* too late... */
pageunmap_range

To fix the issue, do percpu_ref_exit() after pgmap_array is emptied. So
we won't do percpu_ref_tryget_live() against a being freed percpu_ref.

Link: https://lkml.kernel.org/r/20220609121305.2508-1-linmiaohe@huawei.com
Fixes: b7b3c01b1915 ("mm/memremap_pages: support multiple ranges per invocation")
Signed-off-by: Miaohe Lin
Cc: Dan Williams
Signed-off-by: Andrew Morton
Signed-off-by: Sasha Levin

Miaohe Lin
2022-08-17 20:23:44 +0800
d18a90856 memremap: remove support for external pgmap refcounts ... Browse Code »

[ Upstream commit b80892ca022e9eb484771a66eb68e12364695a2a ]

No driver is left using the external pgmap refcount, so remove the
code to support it.

Signed-off-by: Christoph Hellwig
Acked-by: Bjorn Helgaas
Link: https://lore.kernel.org/r/20211028151017.50234-1-hch@lst.de
Signed-off-by: Dan Williams
Signed-off-by: Sasha Levin

Christoph Hellwig
2022-08-17 20:23:43 +0800

03 Aug, 2022

5 commits

86e83233d page_alloc: fix invalid watermark check on a negative value ... Browse Code »

commit 9282012fc0aa248b77a69f5eb802b67c5a16bb13 upstream.

There was a report that a task is waiting at the
throttle_direct_reclaim. The pgscan_direct_throttle in vmstat was
increasing.

This is a bug where zone_watermark_fast returns true even when the free
is very low. The commit f27ce0e14088 ("page_alloc: consider highatomic
reserve in watermark fast") changed the watermark fast to consider
highatomic reserve. But it did not handle a negative value case which
can be happened when reserved_highatomic pageblock is bigger than the
actual free.

If watermark is considered as ok for the negative value, allocating
contexts for order-0 will consume all free pages without direct reclaim,
and finally free page may become depleted except highatomic free.

Then allocating contexts may fall into throttle_direct_reclaim. This
symptom may easily happen in a system where wmark min is low and other
reclaimers like kswapd does not make free pages quickly.

Handle the negative case by using MIN.

Link: https://lkml.kernel.org/r/20220725095212.25388-1-jaewon31.kim@samsung.com
Fixes: f27ce0e14088 ("page_alloc: consider highatomic reserve in watermark fast")
Signed-off-by: Jaewon Kim
Reported-by: GyeongHwan Hong
Acked-by: Mel Gorman
Cc: Minchan Kim
Cc: Baoquan He
Cc: Vlastimil Babka
Cc: Johannes Weiner
Cc: Michal Hocko
Cc: Yong-Taek Lee
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Greg Kroah-Hartman

Jaewon Kim
2022-08-03 18:03:55 +0800
51a772c34 mm/hmm: fault non-owner device private entries ... Browse Code »

commit 8a295dbbaf7292c582a40ce469c326f472d51f66 upstream.

If hmm_range_fault() is called with the HMM_PFN_REQ_FAULT flag and a
device private PTE is found, the hmm_range::dev_private_owner page is used
to determine if the device private page should not be faulted in.
However, if the device private page is not owned by the caller,
hmm_range_fault() returns an error instead of calling migrate_to_ram() to
fault in the page.

For example, if a page is migrated to GPU private memory and a RDMA fault
capable NIC tries to read the migrated page, without this patch it will
get an error. With this patch, the page will be migrated back to system
memory and the NIC will be able to read the data.

Link: https://lkml.kernel.org/r/20220727000837.4128709-2-rcampbell@nvidia.com
Link: https://lkml.kernel.org/r/20220725183615.4118795-2-rcampbell@nvidia.com
Fixes: 08ddddda667b ("mm/hmm: check the device private page owner in hmm_range_fault()")
Signed-off-by: Ralph Campbell
Reported-by: Felix Kuehling
Reviewed-by: Alistair Popple
Cc: Philip Yang
Cc: Jason Gunthorpe
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Greg Kroah-Hartman

Ralph Campbell
2022-08-03 18:03:54 +0800
dc124c849 hugetlb: fix memoryleak in hugetlb_mcopy_atomic_pte ... Browse Code »

commit da9a298f5fad0dc615079a340da42928bc5b138e upstream.

When alloc_huge_page fails, *pagep is set to NULL without put_page first.
So the hugepage indicated by *pagep is leaked.

Link: https://lkml.kernel.org/r/20220709092629.54291-1-linmiaohe@huawei.com
Fixes: 8cc5fcbb5be8 ("mm, hugetlb: fix racy resv_huge_pages underflow on UFFDIO_COPY")
Signed-off-by: Miaohe Lin
Acked-by: Muchun Song
Reviewed-by: Anshuman Khandual
Reviewed-by: Baolin Wang
Reviewed-by: Mike Kravetz
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Greg Kroah-Hartman

Miaohe Lin
2022-08-03 18:03:42 +0800
2722fb0f7 mm: fix page leak with multiple threads mapping the same page ... Browse Code »

commit 3fe2895cfecd03ac74977f32102b966b6589f481 upstream.

We have an application with a lot of threads that use a shared mmap backed
by tmpfs mounted with -o huge=within_size. This application started
leaking loads of huge pages when we upgraded to a recent kernel.

Using the page ref tracepoints and a BPF program written by Tejun Heo we
were able to determine that these pages would have multiple refcounts from
the page fault path, but when it came to unmap time we wouldn't drop the
number of refs we had added from the faults.

I wrote a reproducer that mmap'ed a file backed by tmpfs with -o
huge=always, and then spawned 20 threads all looping faulting random
offsets in this map, while using madvise(MADV_DONTNEED) randomly for huge
page aligned ranges. This very quickly reproduced the problem.

The problem here is that we check for the case that we have multiple
threads faulting in a range that was previously unmapped. One thread maps
the PMD, the other thread loses the race and then returns 0. However at
this point we already have the page, and we are no longer putting this
page into the processes address space, and so we leak the page. We
actually did the correct thing prior to f9ce0be71d1f, however it looks
like Kirill copied what we do in the anonymous page case. In the
anonymous page case we don't yet have a page, so we don't have to drop a
reference on anything. Previously we did the correct thing for file based
faults by returning VM_FAULT_NOPAGE so we correctly drop the reference on
the page we faulted in.

Fix this by returning VM_FAULT_NOPAGE in the pmd_devmap_trans_unstable()
case, this makes us drop the ref on the page properly, and now my
reproducer no longer leaks the huge pages.

[josef@toxicpanda.com: v2]
Link: https://lkml.kernel.org/r/e90c8f0dbae836632b669c2afc434006a00d4a67.1657721478.git.josef@toxicpanda.com
Link: https://lkml.kernel.org/r/2b798acfd95c9ab9395fe85e8d5a835e2e10a920.1657051137.git.josef@toxicpanda.com
Fixes: f9ce0be71d1f ("mm: Cleanup faultaround and finish_fault() codepaths")
Signed-off-by: Josef Bacik
Signed-off-by: Rik van Riel
Signed-off-by: Chris Mason
Acked-by: Kirill A. Shutemov
Cc: Matthew Wilcox (Oracle)
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Greg Kroah-Hartman

Josef Bacik
2022-08-03 18:03:42 +0800
70d0ce332 secretmem: fix unhandled fault in truncate ... Browse Code »

commit 84ac013046ccc438af04b7acecd4d3ab84fe4bde upstream.

syzkaller reports the following issue:

BUG: unable to handle page fault for address: ffff888021f7e005
PGD 11401067 P4D 11401067 PUD 11402067 PMD 21f7d063 PTE 800fffffde081060
Oops: 0002 [#1] PREEMPT SMP KASAN
CPU: 0 PID: 3761 Comm: syz-executor281 Not tainted 5.19.0-rc4-syzkaller-00014-g941e3e791269 #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
RIP: 0010:memset_erms+0x9/0x10 arch/x86/lib/memset_64.S:64
Code: c1 e9 03 40 0f b6 f6 48 b8 01 01 01 01 01 01 01 01 48 0f af c6 f3 48 ab 89 d1 f3 aa 4c 89 c8 c3 90 49 89 f9 40 88 f0 48 89 d1 aa 4c 89 c8 c3 90 49 89 fa 40 0f b6 ce 48 b8 01 01 01 01 01 01
RSP: 0018:ffffc9000329fa90 EFLAGS: 00010202
RAX: 0000000000000000 RBX: 0000000000001000 RCX: 0000000000000ffb
RDX: 0000000000000ffb RSI: 0000000000000000 RDI: ffff888021f7e005
RBP: ffffea000087df80 R08: 0000000000000001 R09: ffff888021f7e005
R10: ffffed10043efdff R11: 0000000000000000 R12: 0000000000000005
R13: 0000000000000000 R14: 0000000000001000 R15: 0000000000000ffb
FS: 00007fb29d8b2700(0000) GS:ffff8880b9a00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffff888021f7e005 CR3: 0000000026e7b000 CR4: 00000000003506f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:

zero_user_segments include/linux/highmem.h:272 [inline]
folio_zero_range include/linux/highmem.h:428 [inline]
truncate_inode_partial_folio+0x76a/0xdf0 mm/truncate.c:237
truncate_inode_pages_range+0x83b/0x1530 mm/truncate.c:381
truncate_inode_pages mm/truncate.c:452 [inline]
truncate_pagecache+0x63/0x90 mm/truncate.c:753
simple_setattr+0xed/0x110 fs/libfs.c:535
secretmem_setattr+0xae/0xf0 mm/secretmem.c:170
notify_change+0xb8c/0x12b0 fs/attr.c:424
do_truncate+0x13c/0x200 fs/open.c:65
do_sys_ftruncate+0x536/0x730 fs/open.c:193
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x46/0xb0
RIP: 0033:0x7fb29d900899
Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 11 15 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007fb29d8b2318 EFLAGS: 00000246 ORIG_RAX: 000000000000004d
RAX: ffffffffffffffda RBX: 00007fb29d988408 RCX: 00007fb29d900899
RDX: 00007fb29d900899 RSI: 0000000000000005 RDI: 0000000000000003
RBP: 00007fb29d988400 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 00007fb29d98840c
R13: 00007ffca01a23bf R14: 00007fb29d8b2400 R15: 0000000000022000

Modules linked in:
CR2: ffff888021f7e005
---[ end trace 0000000000000000 ]---

Eric Biggers suggested that this happens when
secretmem_setattr()->simple_setattr() races with secretmem_fault() so that
a page that is faulted in by secretmem_fault() (and thus removed from the
direct map) is zeroed by inode truncation right afterwards.

Use mapping->invalidate_lock to make secretmem_fault() and
secretmem_setattr() mutually exclusive.

[rppt@linux.ibm.com: v3]
Link: https://lkml.kernel.org/r/20220714091337.412297-1-rppt@kernel.org
Link: https://lkml.kernel.org/r/20220707165650.248088-1-rppt@kernel.org
Reported-by: syzbot+9bd2b7adbd34b30b87e4@syzkaller.appspotmail.com
Signed-off-by: Mike Rapoport
Suggested-by: Eric Biggers
Reviewed-by: Axel Rasmussen
Reviewed-by: Jan Kara
Cc: Eric Biggers
Cc: Hillf Danton
Cc: Matthew Wilcox
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Greg Kroah-Hartman

Mike Rapoport
2022-08-03 18:03:41 +0800

29 Jul, 2022

1 commit

8c5429a04 mm/mempolicy: fix uninit-value in mpol_rebind_policy() ... Browse Code »

commit 018160ad314d75b1409129b2247b614a9f35894c upstream.

mpol_set_nodemask()(mm/mempolicy.c) does not set up nodemask when
pol->mode is MPOL_LOCAL. Check pol->mode before access
pol->w.cpuset_mems_allowed in mpol_rebind_policy()(mm/mempolicy.c).

BUG: KMSAN: uninit-value in mpol_rebind_policy mm/mempolicy.c:352 [inline]
BUG: KMSAN: uninit-value in mpol_rebind_task+0x2ac/0x2c0 mm/mempolicy.c:368
mpol_rebind_policy mm/mempolicy.c:352 [inline]
mpol_rebind_task+0x2ac/0x2c0 mm/mempolicy.c:368
cpuset_change_task_nodemask kernel/cgroup/cpuset.c:1711 [inline]
cpuset_attach+0x787/0x15e0 kernel/cgroup/cpuset.c:2278
cgroup_migrate_execute+0x1023/0x1d20 kernel/cgroup/cgroup.c:2515
cgroup_migrate kernel/cgroup/cgroup.c:2771 [inline]
cgroup_attach_task+0x540/0x8b0 kernel/cgroup/cgroup.c:2804
__cgroup1_procs_write+0x5cc/0x7a0 kernel/cgroup/cgroup-v1.c:520
cgroup1_tasks_write+0x94/0xb0 kernel/cgroup/cgroup-v1.c:539
cgroup_file_write+0x4c2/0x9e0 kernel/cgroup/cgroup.c:3852
kernfs_fop_write_iter+0x66a/0x9f0 fs/kernfs/file.c:296
call_write_iter include/linux/fs.h:2162 [inline]
new_sync_write fs/read_write.c:503 [inline]
vfs_write+0x1318/0x2030 fs/read_write.c:590
ksys_write+0x28b/0x510 fs/read_write.c:643
__do_sys_write fs/read_write.c:655 [inline]
__se_sys_write fs/read_write.c:652 [inline]
__x64_sys_write+0xdb/0x120 fs/read_write.c:652
do_syscall_x64 arch/x86/entry/common.c:51 [inline]
do_syscall_64+0x54/0xd0 arch/x86/entry/common.c:82
entry_SYSCALL_64_after_hwframe+0x44/0xae

Uninit was created at:
slab_post_alloc_hook mm/slab.h:524 [inline]
slab_alloc_node mm/slub.c:3251 [inline]
slab_alloc mm/slub.c:3259 [inline]
kmem_cache_alloc+0x902/0x11c0 mm/slub.c:3264
mpol_new mm/mempolicy.c:293 [inline]
do_set_mempolicy+0x421/0xb70 mm/mempolicy.c:853
kernel_set_mempolicy mm/mempolicy.c:1504 [inline]
__do_sys_set_mempolicy mm/mempolicy.c:1510 [inline]
__se_sys_set_mempolicy+0x44c/0xb60 mm/mempolicy.c:1507
__x64_sys_set_mempolicy+0xd8/0x110 mm/mempolicy.c:1507
do_syscall_x64 arch/x86/entry/common.c:51 [inline]
do_syscall_64+0x54/0xd0 arch/x86/entry/common.c:82
entry_SYSCALL_64_after_hwframe+0x44/0xae

KMSAN: uninit-value in mpol_rebind_task (2)
https://syzkaller.appspot.com/bug?id=d6eb90f952c2a5de9ea718a1b873c55cb13b59dc

This patch seems to fix below bug too.
KMSAN: uninit-value in mpol_rebind_mm (2)
https://syzkaller.appspot.com/bug?id=f2fecd0d7013f54ec4162f60743a2b28df40926b

The uninit-value is pol->w.cpuset_mems_allowed in mpol_rebind_policy().
When syzkaller reproducer runs to the beginning of mpol_new(),

mpol_new() mm/mempolicy.c
do_mbind() mm/mempolicy.c
kernel_mbind() mm/mempolicy.c

`mode` is 1(MPOL_PREFERRED), nodes_empty(*nodes) is `true` and `flags`
is 0. Then

mode = MPOL_LOCAL;
...
policy->mode = mode;
policy->flags = flags;

will be executed. So in mpol_set_nodemask(),

mpol_set_nodemask() mm/mempolicy.c
do_mbind()
kernel_mbind()

pol->mode is 4 (MPOL_LOCAL), that `nodemask` in `pol` is not initialized,
which will be accessed in mpol_rebind_policy().

Link: https://lkml.kernel.org/r/20220512123428.fq3wofedp6oiotd4@ppc.localdomain
Signed-off-by: Wang Cheng
Reported-by:
Tested-by:
Cc: David Rientjes
Cc: Vlastimil Babka
Signed-off-by: Andrew Morton
Signed-off-by: Greg Kroah-Hartman

Wang Cheng
2022-07-29 23:25:24 +0800

22 Jul, 2022

2 commits

e4967d228 mm: split huge PUD on wp_huge_pud fallback ... Browse Code »

commit 14c99d65941538aa33edd8dc7b1bbbb593c324a2 upstream.

Currently the implementation will split the PUD when a fallback is taken
inside the create_huge_pud function. This isn't where it should be done:
the splitting should be done in wp_huge_pud, just like it's done for PMDs.
Reason being that if a callback is taken during create, there is no PUD
yet so nothing to split, whereas if a fallback is taken when encountering
a write protection fault there is something to split.

It looks like this was the original intention with the commit where the
splitting was introduced, but somehow it got moved to the wrong place
between v1 and v2 of the patch series. Rebase mistake perhaps.

Link: https://lkml.kernel.org/r/6f48d622eb8bce1ae5dd75327b0b73894a2ec407.camel@amazon.com
Fixes: 327e9fd48972 ("mm: Split huge pages on write-notify or COW")
Signed-off-by: James Gowans
Reviewed-by: Thomas Hellström
Cc: Christian König
Cc: Jan H. Schönherr
Signed-off-by: Andrew Morton
Signed-off-by: Greg Kroah-Hartman

Gowans, James
2022-07-22 03:24:11 +0800
27056f20d mm: userfaultfd: fix UFFDIO_CONTINUE on fallocated shmem pages ... Browse Code »

commit 73f37dbcfe1763ee2294c7717a1f571e27d17fd8 upstream.

When fallocate() is used on a shmem file, the pages we allocate can end up
with !PageUptodate.

Since UFFDIO_CONTINUE tries to find the existing page the user wants to
map with SGP_READ, we would fail to find such a page, since
shmem_getpage_gfp returns with a "NULL" pagep for SGP_READ if it discovers
!PageUptodate. As a result, UFFDIO_CONTINUE returns -EFAULT, as it would
do if the page wasn't found in the page cache at all.

This isn't the intended behavior. UFFDIO_CONTINUE is just trying to find
if a page exists, and doesn't care whether it still needs to be cleared or
not. So, instead of SGP_READ, pass in SGP_NOALLOC. This is the same,
except for one critical difference: in the !PageUptodate case, SGP_NOALLOC
will clear the page and then return it. With this change, UFFDIO_CONTINUE
works properly (succeeds) on a shmem file which has been fallocated, but
otherwise not modified.

Link: https://lkml.kernel.org/r/20220610173812.1768919-1-axelrasmussen@google.com
Fixes: 153132571f02 ("userfaultfd/shmem: support UFFDIO_CONTINUE for shmem")
Signed-off-by: Axel Rasmussen
Acked-by: Peter Xu
Cc: Hugh Dickins
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Greg Kroah-Hartman

Axel Rasmussen
2022-07-22 03:24:11 +0800

12 Jul, 2022

8 commits

2fa22e790 Revert "mm/memory-failure.c: fix race with changing page compound again" ... Browse Code »

commit 2ba2b008a8bf5fd268a43d03ba79e0ad464d6836 upstream.

Reverts commit 888af2701db7 ("mm/memory-failure.c: fix race with changing
page compound again") because now we fetch the page refcount under
hugetlb_lock in try_memory_failure_hugetlb() so that the race check is no
longer necessary.

Link: https://lkml.kernel.org/r/20220408135323.1559401-4-naoya.horiguchi@linux.dev
Signed-off-by: Naoya Horiguchi
Suggested-by: Miaohe Lin
Reviewed-by: Miaohe Lin
Reviewed-by: Mike Kravetz
Cc: Miaohe Lin
Cc: Yang Shi
Cc: Dan Carpenter
Signed-off-by: Andrew Morton
Signed-off-by: Greg Kroah-Hartman

Naoya Horiguchi
2022-07-12 22:35:17 +0800
62d1655b9 mm/hwpoison: fix race between hugetlb free/demotion and memory_failure_hugetlb() ... Browse Code »

[ Upstream commit 405ce051236cc65b30bbfe490b28ce60ae6aed85 ]

There is a race condition between memory_failure_hugetlb() and hugetlb
free/demotion, which causes setting PageHWPoison flag on the wrong page.
The one simple result is that wrong processes can be killed, but another
(more serious) one is that the actual error is left unhandled, so no one
prevents later access to it, and that might lead to more serious results
like consuming corrupted data.

Think about the below race window:

CPU 1 CPU 2
memory_failure_hugetlb
struct page *head = compound_head(p);
hugetlb page might be freed to
buddy, or even changed to another
compound page.

get_hwpoison_page -- page is not what we want now...

The current code first does prechecks roughly and then reconfirms after
taking refcount, but it's found that it makes code overly complicated,
so move the prechecks in a single hugetlb_lock range.

A newly introduced function, try_memory_failure_hugetlb(), always takes
hugetlb_lock (even for non-hugetlb pages). That can be improved, but
memory_failure() is rare in principle, so should not be a big problem.

Link: https://lkml.kernel.org/r/20220408135323.1559401-2-naoya.horiguchi@linux.dev
Fixes: 761ad8d7c7b5 ("mm: hwpoison: introduce memory_failure_hugetlb()")
Signed-off-by: Naoya Horiguchi
Reported-by: Mike Kravetz
Reviewed-by: Miaohe Lin
Reviewed-by: Mike Kravetz
Cc: Yang Shi
Cc: Dan Carpenter
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Sasha Levin

Naoya Horiguchi
2022-07-12 22:35:06 +0800
5429eb550 mm/memory-failure.c: fix race with changing page compound again ... Browse Code »

[ Upstream commit 888af2701db79b9b27c7e37f9ede528a5ca53b76 ]

Patch series "A few fixup patches for memory failure", v2.

This series contains a few patches to fix the race with changing page
compound page, make non-LRU movable pages unhandlable and so on. More
details can be found in the respective changelogs.

There is a race window where we got the compound_head, the hugetlb page
could be freed to buddy, or even changed to another compound page just
before we try to get hwpoison page. Think about the below race window:

CPU 1 CPU 2
memory_failure_hugetlb
struct page *head = compound_head(p);
hugetlb page might be freed to
buddy, or even changed to another
compound page.

get_hwpoison_page -- page is not what we want now...

If this race happens, just bail out. Also MF_MSG_DIFFERENT_PAGE_SIZE is
introduced to record this event.

[akpm@linux-foundation.org: s@/**@/*@, per Naoya Horiguchi]

Link: https://lkml.kernel.org/r/20220312074613.4798-1-linmiaohe@huawei.com
Link: https://lkml.kernel.org/r/20220312074613.4798-2-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin
Acked-by: Naoya Horiguchi
Cc: Tony Luck
Cc: Borislav Petkov
Cc: Mike Kravetz
Cc: Yang Shi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Sasha Levin

Miaohe Lin
2022-07-12 22:35:05 +0800
7a07875fa mm/hwpoison: avoid the impact of hwpoison_filter() return value on mce handler ... Browse Code »

[ Upstream commit d1fe111fb62a1cf0446a2919f5effbb33ad0702c ]

When the hwpoison page meets the filter conditions, it should not be
regarded as successful memory_failure() processing for mce handler, but
should return a distinct value, otherwise mce handler regards the error
page has been identified and isolated, which may lead to calling
set_mce_nospec() to change page attribute, etc.

Here memory_failure() return -EOPNOTSUPP to indicate that the error
event is filtered, mce handler should not take any action for this
situation and hwpoison injector should treat as correct.

Link: https://lkml.kernel.org/r/20220223082135.2769649-1-luofei@unicloud.com
Signed-off-by: luofei
Acked-by: Borislav Petkov
Cc: Dave Hansen
Cc: H. Peter Anvin
Cc: Ingo Molnar
Cc: Miaohe Lin
Cc: Naoya Horiguchi
Cc: Thomas Gleixner
Cc: Tony Luck
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Sasha Levin

luofei
2022-07-12 22:35:05 +0800
d04b62b64 mm/hwpoison: mf_mutex for soft offline and unpoison ... Browse Code »

[ Upstream commit 91d005479e06392617bacc114509d611b705eaac ]

Patch series "mm/hwpoison: fix unpoison_memory()", v4.

The main purpose of this series is to sync unpoison code to recent
changes around how hwpoison code takes page refcount. Unpoison should
work or simply fail (without crash) if impossible.

The recent works of keeping hwpoison pages in shmem pagecache introduce
a new state of hwpoisoned pages, but unpoison for such pages is not
supported yet with this series.

It seems that soft-offline and unpoison can be used as general purpose
page offline/online mechanism (not in the context of memory error). I
think that we need some additional works to realize it because currently
soft-offline and unpoison are assumed not to happen so frequently (print
out too many messages for aggressive usecases). But anyway this could
be another interesting next topic.

v1: https://lore.kernel.org/linux-mm/20210614021212.223326-1-nao.horiguchi@gmail.com/
v2: https://lore.kernel.org/linux-mm/20211025230503.2650970-1-naoya.horiguchi@linux.dev/
v3: https://lore.kernel.org/linux-mm/20211105055058.3152564-1-naoya.horiguchi@linux.dev/

This patch (of 3):

Originally mf_mutex is introduced to serialize multiple MCE events, but
it is not that useful to allow unpoison to run in parallel with
memory_failure() and soft offline. So apply mf_mutex to soft offline
and unpoison. The memory failure handler and soft offline handler get
simpler with this.

Link: https://lkml.kernel.org/r/20211115084006.3728254-1-naoya.horiguchi@linux.dev
Link: https://lkml.kernel.org/r/20211115084006.3728254-2-naoya.horiguchi@linux.dev
Signed-off-by: Naoya Horiguchi
Reviewed-by: Yang Shi
Cc: "Aneesh Kumar K.V"
Cc: David Hildenbrand
Cc: Ding Hui
Cc: Miaohe Lin
Cc: Michal Hocko
Cc: Oscar Salvador
Cc: Peter Xu
Cc: Tony Luck
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Sasha Levin

Naoya Horiguchi
2022-07-12 22:35:05 +0800
c33904fd1 mm: vmalloc: introduce array allocation functions ... Browse Code »

[ Upstream commit a8749a35c39903120ec421ef2525acc8e0daa55c ]

Linux has dozens of occurrences of vmalloc(array_size()) and
vzalloc(array_size()). Allow to simplify the code by providing
vmalloc_array and vcalloc, as well as the underscored variants that let
the caller specify the GFP flags.

Acked-by: Michal Hocko
Signed-off-by: Paolo Bonzini
Signed-off-by: Sasha Levin

Paolo Bonzini
2022-07-12 22:35:01 +0800
066a5b678 mm/filemap: fix UAF in find_lock_entries ... Browse Code »

Release refcount after xas_set to fix UAF which may cause panic like this:

page:ffffea000491fa40 refcount:1 mapcount:0 mapping:0000000000000000 index:0x1 pfn:0x1247e9
head:ffffea000491fa00 order:3 compound_mapcount:0 compound_pincount:0
memcg:ffff888104f91091
flags: 0x2fffff80010200(slab|head|node=0|zone=2|lastcpupid=0x1fffff)
...
page dumped because: VM_BUG_ON_PAGE(PageTail(page))
------------[ cut here ]------------
kernel BUG at include/linux/page-flags.h:632!
invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN
CPU: 1 PID: 7642 Comm: sh Not tainted 5.15.51-dirty #26
...
Call Trace:

__invalidate_mapping_pages+0xe7/0x540
drop_pagecache_sb+0x159/0x320
iterate_supers+0x120/0x240
drop_caches_sysctl_handler+0xaa/0xe0
proc_sys_call_handler+0x2b4/0x480
new_sync_write+0x3d6/0x5c0
vfs_write+0x446/0x7a0
ksys_write+0x105/0x210
do_syscall_64+0x35/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x7f52b5733130
...

This problem has been fixed on mainline by patch 6b24ca4a1a8d ("mm: Use
multi-index entries in the page cache") since it deletes the related code.

Fixes: 5c211ba29deb ("mm: add and use find_lock_entries")
Signed-off-by: Liu Shixin
Acked-by: Matthew Wilcox (Oracle)
Signed-off-by: Greg Kroah-Hartman

Liu Shixin
2022-07-12 22:34:47 +0800
0515cc9b6 mm/slub: add missing TID updates on slab deactivation ... Browse Code »

commit eeaa345e128515135ccb864c04482180c08e3259 upstream.

The fastpath in slab_alloc_node() assumes that c->slab is stable as long as
the TID stays the same. However, two places in __slab_alloc() currently
don't update the TID when deactivating the CPU slab.

If multiple operations race the right way, this could lead to an object
getting lost; or, in an even more unlikely situation, it could even lead to
an object being freed onto the wrong slab's freelist, messing up the
`inuse` counter and eventually causing a page to be freed to the page
allocator while it still contains slab objects.

(I haven't actually tested these cases though, this is just based on
looking at the code. Writing testcases for this stuff seems like it'd be
a pain...)

The race leading to state inconsistency is (all operations on the same CPU
and kmem_cache):

- task A: begin do_slab_free():
- read TID
- read pcpu freelist (==NULL)
- check `slab == c->slab` (true)
- [PREEMPT A->B]
- task B: begin slab_alloc_node():
- fastpath fails (`c->freelist` is NULL)
- enter __slab_alloc()
- slub_get_cpu_ptr() (disables preemption)
- enter ___slab_alloc()
- take local_lock_irqsave()
- read c->freelist as NULL
- get_freelist() returns NULL
- write `c->slab = NULL`
- drop local_unlock_irqrestore()
- goto new_slab
- slub_percpu_partial() is NULL
- get_partial() returns NULL
- slub_put_cpu_ptr() (enables preemption)
- [PREEMPT B->A]
- task A: finish do_slab_free():
- this_cpu_cmpxchg_double() succeeds()
- [CORRUPT STATE: c->slab==NULL, c->freelist!=NULL]

From there, the object on c->freelist will get lost if task B is allowed to
continue from here: It will proceed to the retry_load_slab label,
set c->slab, then jump to load_freelist, which clobbers c->freelist.

But if we instead continue as follows, we get worse corruption:

- task A: run __slab_free() on object from other struct slab:
- CPU_PARTIAL_FREE case (slab was on no list, is now on pcpu partial)
- task A: run slab_alloc_node() with NUMA node constraint:
- fastpath fails (c->slab is NULL)
- call __slab_alloc()
- slub_get_cpu_ptr() (disables preemption)
- enter ___slab_alloc()
- c->slab is NULL: goto new_slab
- slub_percpu_partial() is non-NULL
- set c->slab to slub_percpu_partial(c)
- [CORRUPT STATE: c->slab points to slab-1, c->freelist has objects
from slab-2]
- goto redo
- node_match() fails
- goto deactivate_slab
- existing c->freelist is passed into deactivate_slab()
- inuse count of slab-1 is decremented to account for object from
slab-2

At this point, the inuse count of slab-1 is 1 lower than it should be.
This means that if we free all allocated objects in slab-1 except for one,
SLUB will think that slab-1 is completely unused, and may free its page,
leading to use-after-free.

Fixes: c17dda40a6a4e ("slub: Separate out kmem_cache_cpu processing from deactivate_slab")
Fixes: 03e404af26dc2 ("slub: fast release on full slab")
Cc: stable@vger.kernel.org
Signed-off-by: Jann Horn
Acked-by: Christoph Lameter
Acked-by: David Rientjes
Reviewed-by: Muchun Song
Tested-by: Hyeonggon Yoo
Signed-off-by: Vlastimil Babka
Link: https://lore.kernel.org/r/20220608182205.2945720-1-jannh@google.com
Signed-off-by: Greg Kroah-Hartman

Jann Horn
2022-07-12 22:34:47 +0800

30 Jun, 2022

2 commits

7928826df Merge tag 'v5.15.50' into lf-5.15.y ... Browse Code »

This is the 5.15.50 stable release

* tag 'v5.15.50': (1395 commits)
Linux 5.15.50
arm64: mm: Don't invalidate FROM_DEVICE buffers at start of DMA transfer
serial: core: Initialize rs485 RTS polarity already on probe
...

Signed-off-by: Jason Liu

Conflicts:
drivers/bus/fsl-mc/fsl-mc-bus.c
drivers/crypto/caam/ctrl.c
drivers/pci/controller/dwc/pci-imx6.c
drivers/spi/spi-fsl-qspi.c
drivers/tty/serial/fsl_lpuart.c
include/uapi/linux/dma-buf.h

Jason Liu
2022-06-30 01:58:02 +0800
eba369f0f Merge tag 'v5.15.41' into lf-5.15.y ... Browse Code »

This is the 5.15.41 stable release

* tag 'v5.15.41': (1977 commits)
Linux 5.15.41
usb: gadget: uvc: allow for application to cleanly shutdown
usb: gadget: uvc: rename function to be more consistent
...

Signed-off-by: Jason Liu

Conflicts:
arch/arm64/boot/dts/freescale/fsl-ls1043a.dtsi
arch/arm64/boot/dts/freescale/fsl-ls1046a.dtsi
arch/arm64/configs/defconfig
drivers/clk/imx/clk-imx8qxp-lpcg.c
drivers/dma/imx-sdma.c
drivers/gpu/drm/bridge/nwl-dsi.c
drivers/mailbox/imx-mailbox.c
drivers/net/phy/at803x.c
drivers/tty/serial/fsl_lpuart.c
security/keys/trusted-keys/trusted_core.c

Jason Liu
2022-06-30 00:35:21 +0800

22 Jun, 2022

1 commit

ca67881dc init: Initialize noop_backing_dev_info early ... Browse Code »

[ Upstream commit 4bca7e80b6455772b4bf3f536dcbc19aac424d6a ]

noop_backing_dev_info is used by superblocks of various
pseudofilesystems such as kdevtmpfs. After commit 10e14073107d
("writeback: Fix inode->i_io_list not be protected by inode->i_lock
error") this broke because __mark_inode_dirty() started to access more
fields from noop_backing_dev_info and this led to crashes inside
locked_inode_to_wb_and_lock_list() called from __mark_inode_dirty().
Fix the problem by initializing noop_backing_dev_info before the
filesystems get mounted.

Fixes: 10e14073107d ("writeback: Fix inode->i_io_list not be protected by inode->i_lock error")
Reported-and-tested-by: Suzuki K Poulose
Reported-and-tested-by: Alexandru Elisei
Reported-and-tested-by: Guenter Roeck
Reviewed-by: Christoph Hellwig
Signed-off-by: Jan Kara
Signed-off-by: Sasha Levin

Jan Kara
2022-06-22 20:22:02 +0800

09 Jun, 2022

2 commits

fd9a5081e mm/memremap: fix missing call to untrack_pfn() in pagemap_range() ... Browse Code »

commit a04e1928e2ead144dc2f369768bc0a0f3110af89 upstream.

We forget to call untrack_pfn() to pair with track_pfn_remap() when range
is not allowed to hotplug. Fix it by jump err_kasan.

Link: https://lkml.kernel.org/r/20220531122643.25249-1-linmiaohe@huawei.com
Fixes: bca3feaa0764 ("mm/memory_hotplug: prevalidate the address range being added with platform")
Signed-off-by: Miaohe Lin
Reviewed-by: David Hildenbrand
Acked-by: Muchun Song
Cc: Anshuman Khandual
Cc: Oscar Salvador
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Greg Kroah-Hartman

Miaohe Lin
2022-06-09 16:23:28 +0800
715455ca5 hugetlb: fix huge_pmd_unshare address update ... Browse Code »

commit 48381273f8734d28ef56a5bdf1966dd8530111bc upstream.

The routine huge_pmd_unshare() is passed a pointer to an address
associated with an area which may be unshared. If unshare is successful
this address is updated to 'optimize' callers iterating over huge page
addresses. For the optimization to work correctly, address should be
updated to the last huge page in the unmapped/unshared area. However, in
the common case where the passed address is PUD_SIZE aligned, the address
is incorrectly updated to the address of the preceding huge page. That
wastes CPU cycles as the unmapped/unshared range is scanned twice.

Link: https://lkml.kernel.org/r/20220524205003.126184-1-mike.kravetz@oracle.com
Fixes: 39dde65c9940 ("shared page table for hugetlb page")
Signed-off-by: Mike Kravetz
Acked-by: Muchun Song
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Greg Kroah-Hartman

Mike Kravetz
2022-06-09 16:23:28 +0800