Eric Lee / smarc-fsl-linux-kernel

05 Mar, 2020

4 commits

0bae7e7fa mm, thp: fix defrag setting if newline is not used ... Browse Code »

commit f42f25526502d851d0e3ca1e46297da8aafce8a7 upstream.

If thp defrag setting "defer" is used and a newline is *not* used when
writing to the sysfs file, this is interpreted as the "defer+madvise"
option.

This is because we do prefix matching and if five characters are written
without a newline, the current code ends up comparing to the first five
bytes of the "defer+madvise" option and using that instead.

Use the more appropriate sysfs_streq() that handles the trailing newline
for us. Since this doubles as a nice cleanup, do it in enabled_store()
as well.

The current implementation relies on prefix matching: the number of
bytes compared is either the number of bytes written or the length of
the option being compared. With a newline, "defer\n" does not match
"defer+"madvise"; without a newline, however, "defer" is considered to
match "defer+madvise" (prefix matching is only comparing the first five
bytes). End result is that writing "defer" is broken unless it has an
additional trailing character.

This means that writing "madv" in the past would match and set
"madvise". With strict checking, that no longer is the case but it is
unlikely anybody is currently doing this.

Link: http://lkml.kernel.org/r/alpine.DEB.2.21.2001171411020.56385@chino.kir.corp.google.com
Fixes: 21440d7eb904 ("mm, thp: add new defer+madvise defrag option")
Signed-off-by: David Rientjes
Suggested-by: Andrew Morton
Acked-by: Vlastimil Babka
Cc: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

David Rientjes
2020-03-05 23:43:51 +0800
0eb282cb1 mm/huge_memory.c: use head to check huge zero page ... Browse Code »

commit cb829624867b5ab10bc6a7036d183b1b82bfe9f8 upstream.

The page could be a tail page, if this is the case, this BUG_ON will
never be triggered.

Link: http://lkml.kernel.org/r/20200110032610.26499-1-richardw.yang@linux.intel.com
Fixes: e9b61f19858a ("thp: reintroduce split_huge_page()")

Signed-off-by: Wei Yang
Acked-by: Kirill A. Shutemov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Wei Yang
2020-03-05 23:43:51 +0800
8cb5db61a mm/gup: allow FOLL_FORCE for get_user_pages_fast() ... Browse Code »

commit f4000fdf435b8301a11cf85237c561047f8c4c72 upstream.

Commit 817be129e6f2 ("mm: validate get_user_pages_fast flags") allowed
only FOLL_WRITE and FOLL_LONGTERM to be passed to get_user_pages_fast().
This, combined with the fact that get_user_pages_fast() falls back to
"slow gup", which *does* accept FOLL_FORCE, leads to an odd situation:
if you need FOLL_FORCE, you cannot call get_user_pages_fast().

There does not appear to be any reason for filtering out FOLL_FORCE.
There is nothing in the _fast() implementation that requires that we
avoid writing to the pages. So it appears to have been an oversight.

Fix by allowing FOLL_FORCE to be set for get_user_pages_fast().

Link: http://lkml.kernel.org/r/20200107224558.2362728-9-jhubbard@nvidia.com
Fixes: 817be129e6f2 ("mm: validate get_user_pages_fast flags")
Signed-off-by: John Hubbard
Reviewed-by: Leon Romanovsky
Reviewed-by: Jan Kara
Cc: Christoph Hellwig
Cc: Alex Williamson
Cc: Aneesh Kumar K.V
Cc: Björn Töpel
Cc: Daniel Vetter
Cc: Dan Williams
Cc: Hans Verkuil
Cc: Ira Weiny
Cc: Jason Gunthorpe
Cc: Jason Gunthorpe
Cc: Jens Axboe
Cc: Jerome Glisse
Cc: Jonathan Corbet
Cc: Kirill A. Shutemov
Cc: Mauro Carvalho Chehab
Cc: Mike Rapoport
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

John Hubbard
2020-03-05 23:43:51 +0800
2bb9bfcde mm/debug.c: always print flags in dump_page() ... Browse Code »

commit 5b57b8f22709f07c0ab5921c94fd66e8c59c3e11 upstream.

Commit 76a1850e4572 ("mm/debug.c: __dump_page() prints an extra line")
inadvertently removed printing of page flags for pages that are neither
anon nor ksm nor have a mapping. Fix that.

Using pr_cont() again would be a solution, but the commit explicitly
removed its use. Avoiding the danger of mixing up split lines from
multiple CPUs might be beneficial for near-panic dumps like this, so fix
this without reintroducing pr_cont().

Link: http://lkml.kernel.org/r/9f884d5c-ca60-dc7b-219c-c081c755fab6@suse.cz
Fixes: 76a1850e4572 ("mm/debug.c: __dump_page() prints an extra line")
Signed-off-by: Vlastimil Babka
Reported-by: Anshuman Khandual
Reported-by: Michal Hocko
Acked-by: Michal Hocko
Cc: David Hildenbrand
Cc: Qian Cai
Cc: Oscar Salvador
Cc: Mel Gorman
Cc: Mike Rapoport
Cc: Dan Williams
Cc: Pavel Tatashin
Cc: Ralph Campbell
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Vlastimil Babka
2020-03-05 23:43:51 +0800

29 Feb, 2020

4 commits

95236ae76 mm: Avoid creating virtual address aliases in brk()/mmap()/mremap() ... Browse Code »

commit dcde237319e626d1ec3c9d8b7613032f0fd4663a upstream.

Currently the arm64 kernel ignores the top address byte passed to brk(),
mmap() and mremap(). When the user is not aware of the 56-bit address
limit or relies on the kernel to return an error, untagging such
pointers has the potential to create address aliases in user-space.
Passing a tagged address to munmap(), madvise() is permitted since the
tagged pointer is expected to be inside an existing mapping.

The current behaviour breaks the existing glibc malloc() implementation
which relies on brk() with an address beyond 56-bit to be rejected by
the kernel.

Remove untagging in the above functions by partially reverting commit
ce18d171cb73 ("mm: untag user pointers in mmap/munmap/mremap/brk"). In
addition, update the arm64 tagged-address-abi.rst document accordingly.

Link: https://bugzilla.redhat.com/1797052
Fixes: ce18d171cb73 ("mm: untag user pointers in mmap/munmap/mremap/brk")
Cc: # 5.4.x-
Cc: Florian Weimer
Reviewed-by: Andrew Morton
Reported-by: Victor Stinner
Acked-by: Will Deacon
Acked-by: Andrey Konovalov
Signed-off-by: Catalin Marinas
Signed-off-by: Will Deacon
Signed-off-by: Greg Kroah-Hartman

Catalin Marinas
2020-02-29 00:22:21 +0800
ef32399bf mm/sparsemem: pfn_to_page is not valid yet on SPARSEMEM ... Browse Code »

commit 18e19f195cd888f65643a77a0c6aee8f5be6439a upstream.

When we use SPARSEMEM instead of SPARSEMEM_VMEMMAP, pfn_to_page()
doesn't work before sparse_init_one_section() is called.

This leads to a crash when hotplug memory:

BUG: unable to handle page fault for address: 0000000006400000
#PF: supervisor write access in kernel mode
#PF: error_code(0x0002) - not-present page
PGD 0 P4D 0
Oops: 0002 [#1] SMP PTI
CPU: 3 PID: 221 Comm: kworker/u16:1 Tainted: G W 5.5.0-next-20200205+ #343
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.0.0 02/06/2015
Workqueue: kacpi_hotplug acpi_hotplug_work_fn
RIP: 0010:__memset+0x24/0x30
Code: cc cc cc cc cc cc 0f 1f 44 00 00 49 89 f9 48 89 d1 83 e2 07 48 c1 e9 03 40 0f b6 f6 48 b8 01 01 01 01 01 01 01 01 48 0f af c6 48 ab 89 d1 f3 aa 4c 89 c8 c3 90 49 89 f9 40 88 f0 48 89 d1 f3
RSP: 0018:ffffb43ac0373c80 EFLAGS: 00010a87
RAX: ffffffffffffffff RBX: ffff8a1518800000 RCX: 0000000000050000
RDX: 0000000000000000 RSI: 00000000000000ff RDI: 0000000006400000
RBP: 0000000000140000 R08: 0000000000100000 R09: 0000000006400000
R10: 0000000000000000 R11: 0000000000000002 R12: 0000000000000000
R13: 0000000000000028 R14: 0000000000000000 R15: ffff8a153ffd9280
FS: 0000000000000000(0000) GS:ffff8a153ab00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000006400000 CR3: 0000000136fca000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
sparse_add_section+0x1c9/0x26a
__add_pages+0xbf/0x150
add_pages+0x12/0x60
add_memory_resource+0xc8/0x210
__add_memory+0x62/0xb0
acpi_memory_device_add+0x13f/0x300
acpi_bus_attach+0xf6/0x200
acpi_bus_scan+0x43/0x90
acpi_device_hotplug+0x275/0x3d0
acpi_hotplug_work_fn+0x1a/0x30
process_one_work+0x1a7/0x370
worker_thread+0x30/0x380
kthread+0x112/0x130
ret_from_fork+0x35/0x40

We should use memmap as it did.

On x86 the impact is limited to x86_32 builds, or x86_64 configurations
that override the default setting for SPARSEMEM_VMEMMAP.

Other memory hotplug archs (arm64, ia64, and ppc) also default to
SPARSEMEM_VMEMMAP=y.

[dan.j.williams@intel.com: changelog update]
{rppt@linux.ibm.com: changelog update]
Link: http://lkml.kernel.org/r/20200219030454.4844-1-bhe@redhat.com
Fixes: ba72b4c8cf60 ("mm/sparsemem: support sub-section hotplug")
Signed-off-by: Wei Yang
Signed-off-by: Baoquan He
Acked-by: David Hildenbrand
Reviewed-by: Baoquan He
Reviewed-by: Dan Williams
Acked-by: Michal Hocko
Cc: Mike Rapoport
Cc: Oscar Salvador
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Wei Yang
2020-02-29 00:22:20 +0800
198f5aa0f mm/vmscan.c: don't round up scan size for online memory cgroup ... Browse Code »

commit 76073c646f5f4999d763f471df9e38a5a912d70d upstream.

Commit 68600f623d69 ("mm: don't miss the last page because of round-off
error") makes the scan size round up to @denominator regardless of the
memory cgroup's state, online or offline. This affects the overall
reclaiming behavior: the corresponding LRU list is eligible for
reclaiming only when its size logically right shifted by @sc->priority
is bigger than zero in the former formula.

For example, the inactive anonymous LRU list should have at least 0x4000
pages to be eligible for reclaiming when we have 60/12 for
swappiness/priority and without taking scan/rotation ratio into account.

After the roundup is applied, the inactive anonymous LRU list becomes
eligible for reclaiming when its size is bigger than or equal to 0x1000
in the same condition.

(0x4000 >> 12) * 60 / (60 + 140 + 1) = 1
((0x1000 >> 12) * 60) + 200) / (60 + 140 + 1) = 1

aarch64 has 512MB huge page size when the base page size is 64KB. The
memory cgroup that has a huge page is always eligible for reclaiming in
that case.

The reclaiming is likely to stop after the huge page is reclaimed,
meaing the further iteration on @sc->priority and the silbing and child
memory cgroups will be skipped. The overall behaviour has been changed.
This fixes the issue by applying the roundup to offlined memory cgroups
only, to give more preference to reclaim memory from offlined memory
cgroup. It sounds reasonable as those memory is unlikedly to be used by
anyone.

The issue was found by starting up 8 VMs on a Ampere Mustang machine,
which has 8 CPUs and 16 GB memory. Each VM is given with 2 vCPUs and
2GB memory. It took 264 seconds for all VMs to be completely up and
784MB swap is consumed after that. With this patch applied, it took 236
seconds and 60MB swap to do same thing. So there is 10% performance
improvement for my case. Note that KSM is disable while THP is enabled
in the testing.

total used free shared buff/cache available
Mem: 16196 10065 2049 16 4081 3749
Swap: 8175 784 7391
total used free shared buff/cache available
Mem: 16196 11324 3656 24 1215 2936
Swap: 8175 60 8115

Link: http://lkml.kernel.org/r/20200211024514.8730-1-gshan@redhat.com
Fixes: 68600f623d69 ("mm: don't miss the last page because of round-off error")
Signed-off-by: Gavin Shan
Acked-by: Roman Gushchin
Cc: [4.20+]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Gavin Shan
2020-02-29 00:22:20 +0800
e078c8d89 mm/memcontrol.c: lost css_put in memcg_expand_shrinker_maps() ... Browse Code »

commit 75866af62b439859d5146b7093ceb6b482852683 upstream.

for_each_mem_cgroup() increases css reference counter for memory cgroup
and requires to use mem_cgroup_iter_break() if the walk is cancelled.

Link: http://lkml.kernel.org/r/c98414fb-7e1f-da0f-867a-9340ec4bd30b@virtuozzo.com
Fixes: 0a4465d34028 ("mm, memcg: assign memcg-aware shrinkers bitmap to memcg")
Signed-off-by: Vasily Averin
Acked-by: Kirill Tkhai
Acked-by: Michal Hocko
Reviewed-by: Roman Gushchin
Cc: Johannes Weiner
Cc: Vladimir Davydov
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Vasily Averin
2020-02-29 00:22:20 +0800

11 Feb, 2020

7 commits

806cabd31 mm/mmu_gather: invalidate TLB correctly on batch allocation failure and flush ... Browse Code »

commit 0ed1325967ab5f7a4549a2641c6ebe115f76e228 upstream.

Architectures for which we have hardware walkers of Linux page table
should flush TLB on mmu gather batch allocation failures and batch flush.
Some architectures like POWER supports multiple translation modes (hash
and radix) and in the case of POWER only radix translation mode needs the
above TLBI. This is because for hash translation mode kernel wants to
avoid this extra flush since there are no hardware walkers of linux page
table. With radix translation, the hardware also walks linux page table
and with that, kernel needs to make sure to TLB invalidate page walk cache
before page table pages are freed.

More details in commit d86564a2f085 ("mm/tlb, x86/mm: Support invalidating
TLB caches for RCU_TABLE_FREE")

The changes to sparc are to make sure we keep the old behavior since we
are now removing HAVE_RCU_TABLE_NO_INVALIDATE. The default value for
tlb_needs_table_invalidate is to always force an invalidate and sparc can
avoid the table invalidate. Hence we define tlb_needs_table_invalidate to
false for sparc architecture.

Link: http://lkml.kernel.org/r/20200116064531.483522-3-aneesh.kumar@linux.ibm.com
Fixes: a46cc7a90fd8 ("powerpc/mm/radix: Improve TLB/PWC flushes")
Signed-off-by: Peter Zijlstra (Intel)
Acked-by: Michael Ellerman [powerpc]
Cc: [4.14+]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Peter Zijlstra
2020-02-11 20:35:42 +0800
ed53278ee mm/page_alloc.c: fix uninitialized memmaps on a partially populated last section ... Browse Code »

commit e822969cab48b786b64246aad1a3ba2a774f5d23 upstream.

Patch series "mm: fix max_pfn not falling on section boundary", v2.

Playing with different memory sizes for a x86-64 guest, I discovered that
some memmaps (highest section if max_mem does not fall on the section
boundary) are marked as being valid and online, but contain garbage. We
have to properly initialize these memmaps.

Looking at /proc/kpageflags and friends, I found some more issues,
partially related to this.

This patch (of 3):

If max_pfn is not aligned to a section boundary, we can easily run into
BUGs. This can e.g., be triggered on x86-64 under QEMU by specifying a
memory size that is not a multiple of 128MB (e.g., 4097MB, but also
4160MB). I was told that on real HW, we can easily have this scenario
(esp., one of the main reasons sub-section hotadd of devmem was added).

The issue is, that we have a valid memmap (pfn_valid()) for the whole
section, and the whole section will be marked "online".
pfn_to_online_page() will succeed, but the memmap contains garbage.

E.g., doing a "./page-types -r -a 0x144001" when QEMU was started with "-m
4160M" - (see tools/vm/page-types.c):

[ 200.476376] BUG: unable to handle page fault for address: fffffffffffffffe
[ 200.477500] #PF: supervisor read access in kernel mode
[ 200.478334] #PF: error_code(0x0000) - not-present page
[ 200.479076] PGD 59614067 P4D 59614067 PUD 59616067 PMD 0
[ 200.479557] Oops: 0000 [#4] SMP NOPTI
[ 200.479875] CPU: 0 PID: 603 Comm: page-types Tainted: G D W 5.5.0-rc1-next-20191209 #93
[ 200.480646] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu4
[ 200.481648] RIP: 0010:stable_page_flags+0x4d/0x410
[ 200.482061] Code: f3 ff 41 89 c0 48 b8 00 00 00 00 01 00 00 00 45 84 c0 0f 85 cd 02 00 00 48 8b 53 08 48 8b 2b 48f
[ 200.483644] RSP: 0018:ffffb139401cbe60 EFLAGS: 00010202
[ 200.484091] RAX: fffffffffffffffe RBX: fffffbeec5100040 RCX: 0000000000000000
[ 200.484697] RDX: 0000000000000001 RSI: ffffffff9535c7cd RDI: 0000000000000246
[ 200.485313] RBP: ffffffffffffffff R08: 0000000000000000 R09: 0000000000000000
[ 200.485917] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000144001
[ 200.486523] R13: 00007ffd6ba55f48 R14: 00007ffd6ba55f40 R15: ffffb139401cbf08
[ 200.487130] FS: 00007f68df717580(0000) GS:ffff9ec77fa00000(0000) knlGS:0000000000000000
[ 200.487804] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 200.488295] CR2: fffffffffffffffe CR3: 0000000135d48000 CR4: 00000000000006f0
[ 200.488897] Call Trace:
[ 200.489115] kpageflags_read+0xe9/0x140
[ 200.489447] proc_reg_read+0x3c/0x60
[ 200.489755] vfs_read+0xc2/0x170
[ 200.490037] ksys_pread64+0x65/0xa0
[ 200.490352] do_syscall_64+0x5c/0xa0
[ 200.490665] entry_SYSCALL_64_after_hwframe+0x49/0xbe

But it can be triggered much easier via "cat /proc/kpageflags > /dev/null"
after cold/hot plugging a DIMM to such a system:

[root@localhost ~]# cat /proc/kpageflags > /dev/null
[ 111.517275] BUG: unable to handle page fault for address: fffffffffffffffe
[ 111.517907] #PF: supervisor read access in kernel mode
[ 111.518333] #PF: error_code(0x0000) - not-present page
[ 111.518771] PGD a240e067 P4D a240e067 PUD a2410067 PMD 0

This patch fixes that by at least zero-ing out that memmap (so e.g.,
page_to_pfn() will not crash). Commit 907ec5fca3dc ("mm: zero remaining
unavailable struct pages") tried to fix a similar issue, but forgot to
consider this special case.

After this patch, there are still problems to solve. E.g., not all of
these pages falling into a memory hole will actually get initialized later
and set PageReserved - they are only zeroed out - but at least the
immediate crashes are gone. A follow-up patch will take care of this.

Link: http://lkml.kernel.org/r/20191211163201.17179-2-david@redhat.com
Fixes: f7f99100d8d9 ("mm: stop zeroing memory during allocation in vmemmap")
Signed-off-by: David Hildenbrand
Tested-by: Daniel Jordan
Cc: Naoya Horiguchi
Cc: Pavel Tatashin
Cc: Andrew Morton
Cc: Steven Sistare
Cc: Michal Hocko
Cc: Daniel Jordan
Cc: Bob Picco
Cc: Oscar Salvador
Cc: Alexey Dobriyan
Cc: Dan Williams
Cc: Michal Hocko
Cc: Stephen Rothwell
Cc: [4.15+]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

David Hildenbrand
2020-02-11 20:35:42 +0800
d364e9b37 mm: move_pages: report the number of non-attempted pages ... Browse Code »

commit 5984fabb6e82d9ab4e6305cb99694c85d46de8ae upstream.

Since commit a49bd4d71637 ("mm, numa: rework do_pages_move"), the
semantic of move_pages() has changed to return the number of
non-migrated pages if they were result of a non-fatal reasons (usually a
busy page).

This was an unintentional change that hasn't been noticed except for LTP
tests which checked for the documented behavior.

There are two ways to go around this change. We can even get back to
the original behavior and return -EAGAIN whenever migrate_pages is not
able to migrate pages due to non-fatal reasons. Another option would be
to simply continue with the changed semantic and extend move_pages
documentation to clarify that -errno is returned on an invalid input or
when migration simply cannot succeed (e.g. -ENOMEM, -EBUSY) or the
number of pages that couldn't have been migrated due to ephemeral
reasons (e.g. page is pinned or locked for other reasons).

This patch implements the second option because this behavior is in
place for some time without anybody complaining and possibly new users
depending on it. Also it allows to have a slightly easier error
handling as the caller knows that it is worth to retry when err > 0.

But since the new semantic would be aborted immediately if migration is
failed due to ephemeral reasons, need include the number of
non-attempted pages in the return value too.

Link: http://lkml.kernel.org/r/1580160527-109104-1-git-send-email-yang.shi@linux.alibaba.com
Fixes: a49bd4d71637 ("mm, numa: rework do_pages_move")
Signed-off-by: Yang Shi
Suggested-by: Michal Hocko
Acked-by: Michal Hocko
Reviewed-by: Wei Yang
Cc: [4.17+]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Yang Shi
2020-02-11 20:35:13 +0800
95419e7ef mm: thp: don't need care deferred split queue in memcg charge move path ... Browse Code »

commit fac0516b5534897bf4c4a88daa06a8cfa5611b23 upstream.

If compound is true, this means it is a PMD mapped THP. Which implies
the page is not linked to any defer list. So the first code chunk will
not be executed.

Also with this reason, it would not be proper to add this page to a
defer list. So the second code chunk is not correct.

Based on this, we should remove the defer list related code.

[yang.shi@linux.alibaba.com: better patch title]
Link: http://lkml.kernel.org/r/20200117233836.3434-1-richardw.yang@linux.intel.com
Fixes: 87eaceb3faa5 ("mm: thp: make deferred split shrinker memcg aware")
Signed-off-by: Wei Yang
Suggested-by: Kirill A. Shutemov
Acked-by: Yang Shi
Cc: David Rientjes
Cc: Michal Hocko
Cc: Kirill A. Shutemov
Cc: Johannes Weiner
Cc: Vladimir Davydov
Cc: [5.4+]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Wei Yang
2020-02-11 20:35:13 +0800
aab4189df mm/memory_hotplug: fix remove_memory() lockdep splat ... Browse Code »

commit f1037ec0cc8ac1a450974ad9754e991f72884f48 upstream.

The daxctl unit test for the dax_kmem driver currently triggers the
(false positive) lockdep splat below. It results from the fact that
remove_memory_block_devices() is invoked under the mem_hotplug_lock()
causing lockdep entanglements with cpu_hotplug_lock() and sysfs (kernfs
active state tracking). It is a false positive because the sysfs
attribute path triggering the memory remove is not the same attribute
path associated with memory-block device.

sysfs_break_active_protection() is not applicable since there is no real
deadlock conflict, instead move memory-block device removal outside the
lock. The mem_hotplug_lock() is not needed to synchronize the
memory-block device removal vs the page online state, that is already
handled by lock_device_hotplug(). Specifically, lock_device_hotplug()
is sufficient to allow try_remove_memory() to check the offline state of
the memblocks and be assured that any in progress online attempts are
flushed / blocked by kernfs_drain() / attribute removal.

The add_memory() path safely creates memblock devices under the
mem_hotplug_lock(). There is no kernfs active state synchronization in
the memblock device_register() path, so nothing to fix there.

This change is only possible thanks to the recent change that refactored
memory block device removal out of arch_remove_memory() (commit
4c4b7f9ba948 "mm/memory_hotplug: remove memory block devices before
arch_remove_memory()"), and David's due diligence tracking down the
guarantees afforded by kernfs_drain(). Not flagged for -stable since
this only impacts ongoing development and lockdep validation, not a
runtime issue.

======================================================
WARNING: possible circular locking dependency detected
5.5.0-rc3+ #230 Tainted: G OE
------------------------------------------------------
lt-daxctl/6459 is trying to acquire lock:
ffff99c7f0003510 (kn->count#241){++++}, at: kernfs_remove_by_name_ns+0x41/0x80

but task is already holding lock:
ffffffffa76a5450 (mem_hotplug_lock.rw_sem){++++}, at: percpu_down_write+0x20/0xe0

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #2 (mem_hotplug_lock.rw_sem){++++}:
__lock_acquire+0x39c/0x790
lock_acquire+0xa2/0x1b0
get_online_mems+0x3e/0xb0
kmem_cache_create_usercopy+0x2e/0x260
kmem_cache_create+0x12/0x20
ptlock_cache_init+0x20/0x28
start_kernel+0x243/0x547
secondary_startup_64+0xb6/0xc0

-> #1 (cpu_hotplug_lock.rw_sem){++++}:
__lock_acquire+0x39c/0x790
lock_acquire+0xa2/0x1b0
cpus_read_lock+0x3e/0xb0
online_pages+0x37/0x300
memory_subsys_online+0x17d/0x1c0
device_online+0x60/0x80
state_store+0x65/0xd0
kernfs_fop_write+0xcf/0x1c0
vfs_write+0xdb/0x1d0
ksys_write+0x65/0xe0
do_syscall_64+0x5c/0xa0
entry_SYSCALL_64_after_hwframe+0x49/0xbe

-> #0 (kn->count#241){++++}:
check_prev_add+0x98/0xa40
validate_chain+0x576/0x860
__lock_acquire+0x39c/0x790
lock_acquire+0xa2/0x1b0
__kernfs_remove+0x25f/0x2e0
kernfs_remove_by_name_ns+0x41/0x80
remove_files.isra.0+0x30/0x70
sysfs_remove_group+0x3d/0x80
sysfs_remove_groups+0x29/0x40
device_remove_attrs+0x39/0x70
device_del+0x16a/0x3f0
device_unregister+0x16/0x60
remove_memory_block_devices+0x82/0xb0
try_remove_memory+0xb5/0x130
remove_memory+0x26/0x40
dev_dax_kmem_remove+0x44/0x6a [kmem]
device_release_driver_internal+0xe4/0x1c0
unbind_store+0xef/0x120
kernfs_fop_write+0xcf/0x1c0
vfs_write+0xdb/0x1d0
ksys_write+0x65/0xe0
do_syscall_64+0x5c/0xa0
entry_SYSCALL_64_after_hwframe+0x49/0xbe

other info that might help us debug this:

Chain exists of:
kn->count#241 --> cpu_hotplug_lock.rw_sem --> mem_hotplug_lock.rw_sem

Possible unsafe locking scenario:

CPU0 CPU1
---- ----
lock(mem_hotplug_lock.rw_sem);
lock(cpu_hotplug_lock.rw_sem);
lock(mem_hotplug_lock.rw_sem);
lock(kn->count#241);

*** DEADLOCK ***

No fixes tag as this has been a long standing issue that predated the
addition of kernfs lockdep annotations.

Link: http://lkml.kernel.org/r/157991441887.2763922.4770790047389427325.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams
Acked-by: Michal Hocko
Reviewed-by: David Hildenbrand
Cc: Vishal Verma
Cc: Pavel Tatashin
Cc: Dave Hansen
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Dan Williams
2020-02-11 20:35:12 +0800
af823232b mm/sparse.c: reset section's mem_map when fully deactivated ... Browse Code »

commit 1f503443e7df8dc8366608b4d810ce2d6669827c upstream.

After commit ba72b4c8cf60 ("mm/sparsemem: support sub-section hotplug"),
when a mem section is fully deactivated, section_mem_map still records
the section's start pfn, which is not used any more and will be
reassigned during re-addition.

In analogy with alloc/free pattern, it is better to clear all fields of
section_mem_map.

Beside this, it breaks the user space tool "makedumpfile" [1], which
makes assumption that a hot-removed section has mem_map as NULL, instead
of checking directly against SECTION_MARKED_PRESENT bit. (makedumpfile
will be better to change the assumption, and need a patch)

The bug can be reproduced on IBM POWERVM by "drmgr -c mem -r -q 5" ,
trigger a crash, and save vmcore by makedumpfile

[1]: makedumpfile, commit e73016540293 ("[v1.6.7] Update version")

Link: http://lkml.kernel.org/r/1579487594-28889-1-git-send-email-kernelfans@gmail.com
Signed-off-by: Pingfan Liu
Acked-by: Michal Hocko
Acked-by: David Hildenbrand
Cc: Dan Williams
Cc: Oscar Salvador
Cc: Baoquan He
Cc: Qian Cai
Cc: Kazuhito Hagio
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Pingfan Liu
2020-02-11 20:35:12 +0800
c2c814fc9 memcg: fix a crash in wb_workfn when a device disappears ... Browse Code »

commit 68f23b89067fdf187763e75a56087550624fdbee upstream.

Without memcg, there is a one-to-one mapping between the bdi and
bdi_writeback structures. In this world, things are fairly
straightforward; the first thing bdi_unregister() does is to shutdown
the bdi_writeback structure (or wb), and part of that writeback ensures
that no other work queued against the wb, and that the wb is fully
drained.

With memcg, however, there is a one-to-many relationship between the bdi
and bdi_writeback structures; that is, there are multiple wb objects
which can all point to a single bdi. There is a refcount which prevents
the bdi object from being released (and hence, unregistered). So in
theory, the bdi_unregister() *should* only get called once its refcount
goes to zero (bdi_put will drop the refcount, and when it is zero,
release_bdi gets called, which calls bdi_unregister).

Unfortunately, del_gendisk() in block/gen_hd.c never got the memo about
the Brave New memcg World, and calls bdi_unregister directly. It does
this without informing the file system, or the memcg code, or anything
else. This causes the root wb associated with the bdi to be
unregistered, but none of the memcg-specific wb's are shutdown. So when
one of these wb's are woken up to do delayed work, they try to
dereference their wb->bdi->dev to fetch the device name, but
unfortunately bdi->dev is now NULL, thanks to the bdi_unregister()
called by del_gendisk(). As a result, *boom*.

Fortunately, it looks like the rest of the writeback path is perfectly
happy with bdi->dev and bdi->owner being NULL, so the simplest fix is to
create a bdi_dev_name() function which can handle bdi->dev being NULL.
This also allows us to bulletproof the writeback tracepoints to prevent
them from dereferencing a NULL pointer and crashing the kernel if one is
tracing with memcg's enabled, and an iSCSI device dies or a USB storage
stick is pulled.

The most common way of triggering this will be hotremoval of a device
while writeback with memcg enabled is going on. It was triggering
several times a day in a heavily loaded production environment.

Google Bug Id: 145475544

Link: https://lore.kernel.org/r/20191227194829.150110-1-tytso@mit.edu
Link: http://lkml.kernel.org/r/20191228005211.163952-1-tytso@mit.edu
Signed-off-by: Theodore Ts'o
Cc: Chris Mason
Cc: Tejun Heo
Cc: Jens Axboe
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Theodore Ts'o
2020-02-11 20:35:11 +0800

06 Feb, 2020

2 commits

52cf138f5 mm/migrate.c: also overwrite error when it is bigger than zero ... Browse Code »

[ Upstream commit dfe9aa23cab7880a794db9eb2d176c00ed064eb6 ]

If we get here after successfully adding page to list, err would be 1 to
indicate the page is queued in the list.

Current code has two problems:

* on success, 0 is not returned
* on error, if add_page_for_migratioin() return 1, and the following err1
from do_move_pages_to_node() is set, the err1 is not returned since err
is 1

And these behaviors break the user interface.

Link: http://lkml.kernel.org/r/20200119065753.21694-1-richardw.yang@linux.intel.com
Fixes: e0153fc2c760 ("mm: move_pages: return valid node id in status if the page is already on the target node").
Signed-off-by: Wei Yang
Acked-by: Yang Shi
Cc: John Hubbard
Cc: Vlastimil Babka
Cc: Christoph Lameter
Cc: Michal Hocko
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Sasha Levin

Wei Yang
2020-02-06 05:22:53 +0800
9bcdb8ba5 mm/mempolicy.c: fix out of bounds write in mpol_parse_str() ... Browse Code »

commit c7a91bc7c2e17e0a9c8b9745a2cb118891218fd1 upstream.

What we are trying to do is change the '=' character to a NUL terminator
and then at the end of the function we restore it back to an '='. The
problem is there are two error paths where we jump to the end of the
function before we have replaced the '=' with NUL.

We end up putting the '=' in the wrong place (possibly one element
before the start of the buffer).

Link: http://lkml.kernel.org/r/20200115055426.vdjwvry44nfug7yy@kili.mountain
Reported-by: syzbot+e64a13c5369a194d67df@syzkaller.appspotmail.com
Fixes: 095f1fc4ebf3 ("mempolicy: rework shmem mpol parsing and display")
Signed-off-by: Dan Carpenter
Acked-by: Vlastimil Babka
Dmitry Vyukov
Cc: Michal Hocko
Cc: Dan Carpenter
Cc: Lee Schermerhorn
Cc: Andrea Arcangeli
Cc: Hugh Dickins
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Dan Carpenter
2020-02-06 05:22:40 +0800

23 Jan, 2020

7 commits

36875206a mm/page-writeback.c: avoid potential division by zero in wb_min_max_ratio() ... Browse Code »

commit 6d9e8c651dd979aa666bee15f086745f3ea9c4b3 upstream.

Patch series "use div64_ul() instead of div_u64() if the divisor is
unsigned long".

We were first inspired by commit b0ab99e7736a ("sched: Fix possible divide
by zero in avg_atom () calculation"), then refer to the recently analyzed
mm code, we found this suspicious place.

201 if (min) {
202 min *= this_bw;
203 do_div(min, tot_bw);
204 }

And we also disassembled and confirmed it:

/usr/src/debug/kernel-4.9.168-016.ali3000/linux-4.9.168-016.ali3000.alios7.x86_64/mm/page-writeback.c: 201
0xffffffff811c37da : xor %r10d,%r10d
0xffffffff811c37dd : test %rax,%rax
0xffffffff811c37e0 : je 0xffffffff811c3800
/usr/src/debug/kernel-4.9.168-016.ali3000/linux-4.9.168-016.ali3000.alios7.x86_64/mm/page-writeback.c: 202
0xffffffff811c37e2 : imul %r8,%rax
/usr/src/debug/kernel-4.9.168-016.ali3000/linux-4.9.168-016.ali3000.alios7.x86_64/mm/page-writeback.c: 203
0xffffffff811c37e6 : mov %r9d,%r10d ---> truncates it to 32 bits here
0xffffffff811c37e9 : xor %edx,%edx
0xffffffff811c37eb : div %r10
0xffffffff811c37ee : imul %rbx,%rax
0xffffffff811c37f2 : shr $0x2,%rax
0xffffffff811c37f6 : mul %rcx
0xffffffff811c37f9 : shr $0x2,%rdx
0xffffffff811c37fd : mov %rdx,%r10

This series uses div64_ul() instead of div_u64() if the divisor is
unsigned long, to avoid truncation to 32-bit on 64-bit platforms.

This patch (of 3):

The variables 'min' and 'max' are unsigned long and do_div truncates
them to 32 bits, which means it can test non-zero and be truncated to
zero for division. Fix this issue by using div64_ul() instead.

Link: http://lkml.kernel.org/r/20200102081442.8273-2-wenyang@linux.alibaba.com
Fixes: 693108a8a667 ("writeback: make bdi->min/max_ratio handling cgroup writeback aware")
Signed-off-by: Wen Yang
Reviewed-by: Andrew Morton
Cc: Qian Cai
Cc: Tejun Heo
Cc: Jens Axboe
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Wen Yang
2020-01-23 15:22:41 +0800
5147a518f mm/memory_hotplug: don't free usage map when removing a re-added early section ... Browse Code »

commit 8068df3b60373c390198f660574ea14c8098de57 upstream.

When we remove an early section, we don't free the usage map, as the
usage maps of other sections are placed into the same page. Once the
section is removed, it is no longer an early section (especially, the
memmap is freed). When we re-add that section, the usage map is reused,
however, it is no longer an early section. When removing that section
again, we try to kfree() a usage map that was allocated during early
boot - bad.

Let's check against PageReserved() to see if we are dealing with an
usage map that was allocated during boot. We could also check against
!(PageSlab(usage_page) || PageCompound(usage_page)), but PageReserved() is
cleaner.

Can be triggered using memtrace under ppc64/powernv:

$ mount -t debugfs none /sys/kernel/debug/
$ echo 0x20000000 > /sys/kernel/debug/powerpc/memtrace/enable
$ echo 0x20000000 > /sys/kernel/debug/powerpc/memtrace/enable
------------[ cut here ]------------
kernel BUG at mm/slub.c:3969!
Oops: Exception in kernel mode, sig: 5 [#1]
LE PAGE_SIZE=3D64K MMU=3DHash SMP NR_CPUS=3D2048 NUMA PowerNV
Modules linked in:
CPU: 0 PID: 154 Comm: sh Not tainted 5.5.0-rc2-next-20191216-00005-g0be1dba7b7c0 #61
NIP kfree+0x338/0x3b0
LR section_deactivate+0x138/0x200
Call Trace:
section_deactivate+0x138/0x200
__remove_pages+0x114/0x150
arch_remove_memory+0x3c/0x160
try_remove_memory+0x114/0x1a0
__remove_memory+0x20/0x40
memtrace_enable_set+0x254/0x850
simple_attr_write+0x138/0x160
full_proxy_write+0x8c/0x110
__vfs_write+0x38/0x70
vfs_write+0x11c/0x2a0
ksys_write+0x84/0x140
system_call+0x5c/0x68
---[ end trace 4b053cbd84e0db62 ]---

The first invocation will offline+remove memory blocks. The second
invocation will first add+online them again, in order to offline+remove
them again (usually we are lucky and the exact same memory blocks will
get "reallocated").

Tested on powernv with boot memory: The usage map will not get freed.
Tested on x86-64 with DIMMs: The usage map will get freed.

Using Dynamic Memory under a Power DLAPR can trigger it easily.

Triggering removal (I assume after previously removed+re-added) of
memory from the HMC GUI can crash the kernel with the same call trace
and is fixed by this patch.

Link: http://lkml.kernel.org/r/20191217104637.5509-1-david@redhat.com
Fixes: 326e1b8f83a4 ("mm/sparsemem: introduce a SECTION_IS_EARLY flag")
Signed-off-by: David Hildenbrand
Tested-by: Pingfan Liu
Cc: Dan Williams
Cc: Oscar Salvador
Cc: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

David Hildenbrand
2020-01-23 15:22:41 +0800
d30dce351 mm, debug_pagealloc: don't rely on static keys too early ... Browse Code »

commit 8e57f8acbbd121ecfb0c9dc13b8b030f86c6bd3b upstream.

Commit 96a2b03f281d ("mm, debug_pagelloc: use static keys to enable
debugging") has introduced a static key to reduce overhead when
debug_pagealloc is compiled in but not enabled. It relied on the
assumption that jump_label_init() is called before parse_early_param()
as in start_kernel(), so when the "debug_pagealloc=on" option is parsed,
it is safe to enable the static key.

However, it turns out multiple architectures call parse_early_param()
earlier from their setup_arch(). x86 also calls jump_label_init() even
earlier, so no issue was found while testing the commit, but same is not
true for e.g. ppc64 and s390 where the kernel would not boot with
debug_pagealloc=on as found by our QA.

To fix this without tricky changes to init code of multiple
architectures, this patch partially reverts the static key conversion
from 96a2b03f281d. Init-time and non-fastpath calls (such as in arch
code) of debug_pagealloc_enabled() will again test a simple bool
variable. Fastpath mm code is converted to a new
debug_pagealloc_enabled_static() variant that relies on the static key,
which is enabled in a well-defined point in mm_init() where it's
guaranteed that jump_label_init() has been called, regardless of
architecture.

[sfr@canb.auug.org.au: export _debug_pagealloc_enabled_early]
Link: http://lkml.kernel.org/r/20200106164944.063ac07b@canb.auug.org.au
Link: http://lkml.kernel.org/r/20191219130612.23171-1-vbabka@suse.cz
Fixes: 96a2b03f281d ("mm, debug_pagelloc: use static keys to enable debugging")
Signed-off-by: Vlastimil Babka
Signed-off-by: Stephen Rothwell
Cc: Joonsoo Kim
Cc: "Kirill A. Shutemov"
Cc: Michal Hocko
Cc: Vlastimil Babka
Cc: Matthew Wilcox
Cc: Mel Gorman
Cc: Peter Zijlstra
Cc: Borislav Petkov
Cc: Qian Cai
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Vlastimil Babka
2020-01-23 15:22:40 +0800
904c1db4b mm: memcg/slab: call flush_memcg_workqueue() only if memcg workqueue is valid ... Browse Code »

commit 2fe20210fc5f5e62644678b8f927c49f2c6f42a7 upstream.

When booting with amd_iommu=off, the following WARNING message
appears:

AMD-Vi: AMD IOMMU disabled on kernel command-line
------------[ cut here ]------------
WARNING: CPU: 0 PID: 0 at kernel/workqueue.c:2772 flush_workqueue+0x42e/0x450
Modules linked in:
CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.5.0-rc3-amd-iommu #6
Hardware name: Lenovo ThinkSystem SR655-2S/7D2WRCZ000, BIOS D8E101L-1.00 12/05/2019
RIP: 0010:flush_workqueue+0x42e/0x450
Code: ff 0f 0b e9 7a fd ff ff 4d 89 ef e9 33 fe ff ff 0f 0b e9 7f fd ff ff 0f 0b e9 bc fd ff ff 0f 0b e9 a8 fd ff ff e8 52 2c fe ff 0b 31 d2 48 c7 c6 e0 88 c5 95 48 c7 c7 d8 ad f0 95 e8 19 f5 04
Call Trace:
kmem_cache_destroy+0x69/0x260
iommu_go_to_state+0x40c/0x5ab
amd_iommu_prepare+0x16/0x2a
irq_remapping_prepare+0x36/0x5f
enable_IR_x2apic+0x21/0x172
default_setup_apic_routing+0x12/0x6f
apic_intr_mode_init+0x1a1/0x1f1
x86_late_time_init+0x17/0x1c
start_kernel+0x480/0x53f
secondary_startup_64+0xb6/0xc0
---[ end trace 30894107c3749449 ]---
x2apic: IRQ remapping doesn't support X2APIC mode
x2apic disabled

The warning is caused by the calling of 'kmem_cache_destroy()'
in free_iommu_resources(). Here is the call path:

free_iommu_resources
kmem_cache_destroy
flush_memcg_workqueue
flush_workqueue

The root cause is that the IOMMU subsystem runs before the workqueue
subsystem, which the variable 'wq_online' is still 'false'. This leads
to the statement 'if (WARN_ON(!wq_online))' in flush_workqueue() is
'true'.

Since the variable 'memcg_kmem_cache_wq' is not allocated during the
time, it is unnecessary to call flush_memcg_workqueue(). This prevents
the WARNING message triggered by flush_workqueue().

Link: http://lkml.kernel.org/r/20200103085503.1665-1-ahuang12@lenovo.com
Fixes: 92ee383f6daab ("mm: fix race between kmem_cache destroy, create and deactivate")
Signed-off-by: Adrian Huang
Reported-by: Xiaochun Lee
Reviewed-by: Shakeel Butt
Cc: Joerg Roedel
Cc: Christoph Lameter
Cc: Pekka Enberg
Cc: David Rientjes
Cc: Joonsoo Kim
Cc: Michal Hocko
Cc: Johannes Weiner
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Adrian Huang
2020-01-23 15:22:39 +0800
090892ba7 mm: memcg/slab: fix percpu slab vmstats flushing ... Browse Code »

commit 4a87e2a25dc27131c3cce5e94421622193305638 upstream.

Currently slab percpu vmstats are flushed twice: during the memcg
offlining and just before freeing the memcg structure. Each time percpu
counters are summed, added to the atomic counterparts and propagated up
by the cgroup tree.

The second flushing is required due to how recursive vmstats are
implemented: counters are batched in percpu variables on a local level,
and once a percpu value is crossing some predefined threshold, it spills
over to atomic values on the local and each ascendant levels. It means
that without flushing some numbers cached in percpu variables will be
dropped on floor each time a cgroup is destroyed. And with uptime the
error on upper levels might become noticeable.

The first flushing aims to make counters on ancestor levels more
precise. Dying cgroups may resume in the dying state for a long time.
After kmem_cache reparenting which is performed during the offlining
slab counters of the dying cgroup don't have any chances to be updated,
because any slab operations will be performed on the parent level. It
means that the inaccuracy caused by percpu batching will not decrease up
to the final destruction of the cgroup. By the original idea flushing
slab counters during the offlining should minimize the visible
inaccuracy of slab counters on the parent level.

The problem is that percpu counters are not zeroed after the first
flushing. So every cached percpu value is summed twice. It creates a
small error (up to 32 pages per cpu, but usually less) which accumulates
on parent cgroup level. After creating and destroying of thousands of
child cgroups, slab counter on parent level can be way off the real
value.

For now, let's just stop flushing slab counters on memcg offlining. It
can't be done correctly without scheduling a work on each cpu: reading
and zeroing it during css offlining can race with an asynchronous
update, which doesn't expect values to be changed underneath.

With this change, slab counters on parent level will become eventually
consistent. Once all dying children are gone, values are correct. And
if not, the error is capped by 32 * NR_CPUS pages per dying cgroup.

It's not perfect, as slab are reparented, so any updates after the
reparenting will happen on the parent level. It means that if a slab
page was allocated, a counter on child level was bumped, then the page
was reparented and freed, the annihilation of positive and negative
counter values will not happen until the child cgroup is released. It
makes slab counters different from others, and it might want us to
implement flushing in a correct form again. But it's also a question of
performance: scheduling a work on each cpu isn't free, and it's an open
question if the benefit of having more accurate counters is worth it.

We might also consider flushing all counters on offlining, not only slab
counters.

So let's fix the main problem now: make the slab counters eventually
consistent, so at least the error won't grow with uptime (or more
precisely the number of created and destroyed cgroups). And think about
the accuracy of counters separately.

Link: http://lkml.kernel.org/r/20191220042728.1045881-1-guro@fb.com
Fixes: bee07b33db78 ("mm: memcontrol: flush percpu slab vmstats on kmem offlining")
Signed-off-by: Roman Gushchin
Acked-by: Johannes Weiner
Acked-by: Michal Hocko
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Roman Gushchin
2020-01-23 15:22:39 +0800
194e7fcd1 mm/huge_memory.c: thp: fix conflict of above-47bit hint address and PMD alignment ... Browse Code »

commit 97d3d0f9a1cf132c63c0b8b8bd497b8a56283dd9 upstream.

Patch series "Fix two above-47bit hint address vs. THP bugs".

The two get_unmapped_area() implementations have to be fixed to provide
THP-friendly mappings if above-47bit hint address is specified.

This patch (of 2):

Filesystems use thp_get_unmapped_area() to provide THP-friendly
mappings. For DAX in particular.

Normally, the kernel doesn't create userspace mappings above 47-bit,
even if the machine allows this (such as with 5-level paging on x86-64).
Not all user space is ready to handle wide addresses. It's known that
at least some JIT compilers use higher bits in pointers to encode their
information.

Userspace can ask for allocation from full address space by specifying
hint address (with or without MAP_FIXED) above 47-bits. If the
application doesn't need a particular address, but wants to allocate
from whole address space it can specify -1 as a hint address.

Unfortunately, this trick breaks thp_get_unmapped_area(): the function
would not try to allocate PMD-aligned area if *any* hint address
specified.

Modify the routine to handle it correctly:

- Try to allocate the space at the specified hint address with length
padding required for PMD alignment.
- If failed, retry without length padding (but with the same hint
address);
- If the returned address matches the hint address return it.
- Otherwise, align the address as required for THP and return.

The user specified hint address is passed down to get_unmapped_area() so
above-47bit hint address will be taken into account without breaking
alignment requirements.

Link: http://lkml.kernel.org/r/20191220142548.7118-2-kirill.shutemov@linux.intel.com
Fixes: b569bab78d8d ("x86/mm: Prepare to expose larger address space to userspace")
Signed-off-by: Kirill A. Shutemov
Reported-by: Thomas Willhalm
Tested-by: Dan Williams
Cc: "Aneesh Kumar K . V"
Cc: "Bruggeman, Otto G"
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Kirill A. Shutemov
2020-01-23 15:22:39 +0800
5e56519a9 mm/shmem.c: thp, shmem: fix conflict of above-47bit hint address and PMD alignment ... Browse Code »

commit 991589974d9c9ecb24ee3799ec8c415c730598a2 upstream.

Shmem/tmpfs tries to provide THP-friendly mappings if huge pages are
enabled. But it doesn't work well with above-47bit hint address.

Normally, the kernel doesn't create userspace mappings above 47-bit,
even if the machine allows this (such as with 5-level paging on x86-64).
Not all user space is ready to handle wide addresses. It's known that
at least some JIT compilers use higher bits in pointers to encode their
information.

Userspace can ask for allocation from full address space by specifying
hint address (with or without MAP_FIXED) above 47-bits. If the
application doesn't need a particular address, but wants to allocate
from whole address space it can specify -1 as a hint address.

Unfortunately, this trick breaks THP alignment in shmem/tmp:
shmem_get_unmapped_area() would not try to allocate PMD-aligned area if
*any* hint address specified.

This can be fixed by requesting the aligned area if the we failed to
allocated at user-specified hint address. The request with inflated
length will also take the user-specified hint address. This way we will
not lose an allocation request from the full address space.

[kirill@shutemov.name: fold in a fixup]
Link: http://lkml.kernel.org/r/20191223231309.t6bh5hkbmokihpfu@box
Link: http://lkml.kernel.org/r/20191220142548.7118-3-kirill.shutemov@linux.intel.com
Fixes: b569bab78d8d ("x86/mm: Prepare to expose larger address space to userspace")
Signed-off-by: Kirill A. Shutemov
Cc: "Willhalm, Thomas"
Cc: Dan Williams
Cc: "Bruggeman, Otto G"
Cc: "Aneesh Kumar K . V"
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Kirill A. Shutemov
2020-01-23 15:22:39 +0800

18 Jan, 2020

1 commit

e5194ee4b uaccess: Add non-pagefault user-space write function ... Browse Code »

commit 1d1585ca0f48fe7ed95c3571f3e4a82b2b5045dc upstream.

Commit 3d7081822f7f ("uaccess: Add non-pagefault user-space read functions")
missed to add probe write function, therefore factor out a probe_write_common()
helper with most logic of probe_kernel_write() except setting KERNEL_DS, and
add a new probe_user_write() helper so it can be used from BPF side.

Again, on some archs, the user address space and kernel address space can
co-exist and be overlapping, so in such case, setting KERNEL_DS would mean
that the given address is treated as being in kernel address space.

Signed-off-by: Daniel Borkmann
Signed-off-by: Alexei Starovoitov
Acked-by: Andrii Nakryiko
Cc: Masami Hiramatsu
Link: https://lore.kernel.org/bpf/9df2542e68141bfa3addde631441ee45503856a8.1572649915.git.daniel@iogearbox.net
Signed-off-by: Greg Kroah-Hartman

Daniel Borkmann
2020-01-18 02:48:40 +0800

09 Jan, 2020

10 commits

3a43ea274 mm/hugetlb: defer freeing of huge pages if in non-task context ... Browse Code »

[ Upstream commit c77c0a8ac4c522638a8242fcb9de9496e3cdbb2d ]

The following lockdep splat was observed when a certain hugetlbfs test
was run:

================================
WARNING: inconsistent lock state
4.18.0-159.el8.x86_64+debug #1 Tainted: G W --------- - -
--------------------------------
inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage.
swapper/30/0 [HC0[0]:SC1[1]:HE1:SE0] takes:
ffffffff9acdc038 (hugetlb_lock){+.?.}, at: free_huge_page+0x36f/0xaa0
{SOFTIRQ-ON-W} state was registered at:
lock_acquire+0x14f/0x3b0
_raw_spin_lock+0x30/0x70
__nr_hugepages_store_common+0x11b/0xb30
hugetlb_sysctl_handler_common+0x209/0x2d0
proc_sys_call_handler+0x37f/0x450
vfs_write+0x157/0x460
ksys_write+0xb8/0x170
do_syscall_64+0xa5/0x4d0
entry_SYSCALL_64_after_hwframe+0x6a/0xdf
irq event stamp: 691296
hardirqs last enabled at (691296): [] _raw_spin_unlock_irqrestore+0x4b/0x60
hardirqs last disabled at (691295): [] _raw_spin_lock_irqsave+0x22/0x81
softirqs last enabled at (691284): [] irq_enter+0xc3/0xe0
softirqs last disabled at (691285): [] irq_exit+0x23e/0x2b0

other info that might help us debug this:
Possible unsafe locking scenario:

CPU0
----
lock(hugetlb_lock);

lock(hugetlb_lock);

*** DEADLOCK ***
:
Call Trace:

__lock_acquire+0x146b/0x48c0
lock_acquire+0x14f/0x3b0
_raw_spin_lock+0x30/0x70
free_huge_page+0x36f/0xaa0
bio_check_pages_dirty+0x2fc/0x5c0
clone_endio+0x17f/0x670 [dm_mod]
blk_update_request+0x276/0xe50
scsi_end_request+0x7b/0x6a0
scsi_io_completion+0x1c6/0x1570
blk_done_softirq+0x22e/0x350
__do_softirq+0x23d/0xad8
irq_exit+0x23e/0x2b0
do_IRQ+0x11a/0x200
common_interrupt+0xf/0xf

Both the hugetbl_lock and the subpool lock can be acquired in
free_huge_page(). One way to solve the problem is to make both locks
irq-safe. However, Mike Kravetz had learned that the hugetlb_lock is
held for a linear scan of ALL hugetlb pages during a cgroup reparentling
operation. So it is just too long to have irq disabled unless we can
break hugetbl_lock down into finer-grained locks with shorter lock hold
times.

Another alternative is to defer the freeing to a workqueue job. This
patch implements the deferred freeing by adding a free_hpage_workfn()
work function to do the actual freeing. The free_huge_page() call in a
non-task context saves the page to be freed in the hpage_freelist linked
list in a lockless manner using the llist APIs.

The generic workqueue is used to process the work, but a dedicated
workqueue can be used instead if it is desirable to have the huge page
freed ASAP.

Thanks to Kirill Tkhai for suggesting the use of
llist APIs which simplfy the code.

Link: http://lkml.kernel.org/r/20191217170331.30893-1-longman@redhat.com
Signed-off-by: Waiman Long
Reviewed-by: Mike Kravetz
Acked-by: Davidlohr Bueso
Acked-by: Michal Hocko
Reviewed-by: Kirill Tkhai
Cc: Aneesh Kumar K.V
Cc: Matthew Wilcox
Cc: Andi Kleen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Sasha Levin

Waiman Long
2020-01-09 17:20:07 +0800
5e71be1a6 mm/sparse.c: mark populate_section_memmap as __meminit ... Browse Code »

[ Upstream commit 030eab4f9ffb469344c10a46bc02c5149db0a2a9 ]

Building the kernel on s390 with -Og produces the following warning:

WARNING: vmlinux.o(.text+0x28dabe): Section mismatch in reference from the function populate_section_memmap() to the function .meminit.text:__populate_section_memmap()
The function populate_section_memmap() references
the function __meminit __populate_section_memmap().
This is often because populate_section_memmap lacks a __meminit
annotation or the annotation of __populate_section_memmap is wrong.

While -Og is not supported, in theory this might still happen with
another compiler or on another architecture. So fix this by using the
correct section annotations.

[iii@linux.ibm.com: v2]
Link: http://lkml.kernel.org/r/20191030151639.41486-1-iii@linux.ibm.com
Link: http://lkml.kernel.org/r/20191028165549.14478-1-iii@linux.ibm.com
Signed-off-by: Ilya Leoshkevich
Acked-by: David Hildenbrand
Cc: Heiko Carstens
Cc: Vasily Gorbik
Cc: Oscar Salvador
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Sasha Levin

Ilya Leoshkevich
2020-01-09 17:20:06 +0800
623e5ae07 arm64: Revert support for execute-only user mappings ... Browse Code »

commit 24cecc37746393432d994c0dbc251fb9ac7c5d72 upstream.

The ARMv8 64-bit architecture supports execute-only user permissions by
clearing the PTE_USER and PTE_UXN bits, practically making it a mostly
privileged mapping but from which user running at EL0 can still execute.

The downside, however, is that the kernel at EL1 inadvertently reading
such mapping would not trip over the PAN (privileged access never)
protection.

Revert the relevant bits from commit cab15ce604e5 ("arm64: Introduce
execute-only page access permissions") so that PROT_EXEC implies
PROT_READ (and therefore PTE_USER) until the architecture gains proper
support for execute-only user mappings.

Fixes: cab15ce604e5 ("arm64: Introduce execute-only page access permissions")
Cc: # 4.9.x-
Acked-by: Will Deacon
Signed-off-by: Catalin Marinas
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Catalin Marinas
2020-01-09 17:20:01 +0800
20170bfa3 mm/gup: fix memory leak in __gup_benchmark_ioctl ... Browse Code »

commit a7c46c0c0e3d62f2764cd08b90934cd2aaaf8545 upstream.

In the implementation of __gup_benchmark_ioctl() the allocated pages
should be released before returning in case of an invalid cmd. Release
pages via kvfree().

[akpm@linux-foundation.org: rework code flow, return -EINVAL rather than -1]
Link: http://lkml.kernel.org/r/20191211174653.4102-1-navid.emamdoost@gmail.com
Fixes: 714a3a1ebafe ("mm/gup_benchmark.c: add additional pinning methods")
Signed-off-by: Navid Emamdoost
Reviewed-by: Andrew Morton
Reviewed-by: Ira Weiny
Reviewed-by: John Hubbard
Cc: Keith Busch
Cc: Kirill A. Shutemov
Cc: Dave Hansen
Cc: Dan Williams
Cc: David Hildenbrand
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Navid Emamdoost
2020-01-09 17:20:00 +0800
7123ee7b3 mm/oom: fix pgtables units mismatch in Killed process message ... Browse Code »

commit 941f762bcb276259a78e7931674668874ccbda59 upstream.

pr_err() expects kB, but mm_pgtables_bytes() returns the number of bytes.
As everything else is printed in kB, I chose to fix the value rather than
the string.

Before:

[ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name
...
[ 1878] 1000 1878 217253 151144 1269760 0 0 python
...
Out of memory: Killed process 1878 (python) total-vm:869012kB, anon-rss:604572kB, file-rss:4kB, shmem-rss:0kB, UID:1000 pgtables:1269760kB oom_score_adj:0

After:

[ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name
...
[ 1436] 1000 1436 217253 151890 1294336 0 0 python
...
Out of memory: Killed process 1436 (python) total-vm:869012kB, anon-rss:607516kB, file-rss:44kB, shmem-rss:0kB, UID:1000 pgtables:1264kB oom_score_adj:0

Link: http://lkml.kernel.org/r/20191211202830.1600-1-idryomov@gmail.com
Fixes: 70cb6d267790 ("mm/oom: add oom_score_adj and pgtables to Killed process message")
Signed-off-by: Ilya Dryomov
Reviewed-by: Andrew Morton
Acked-by: David Rientjes
Acked-by: Michal Hocko
Cc: Edward Chron
Cc: David Rientjes
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Ilya Dryomov
2020-01-09 17:19:57 +0800
366b5dce6 mm: move_pages: return valid node id in status if the page is already on the target node ... Browse Code »

commit e0153fc2c7606f101392b682e720a7a456d6c766 upstream.

Felix Abecassis reports move_pages() would return random status if the
pages are already on the target node by the below test program:

int main(void)
{
const long node_id = 1;
const long page_size = sysconf(_SC_PAGESIZE);
const int64_t num_pages = 8;

unsigned long nodemask = 1 << node_id;
long ret = set_mempolicy(MPOL_BIND, &nodemask, sizeof(nodemask));
if (ret < 0)
return (EXIT_FAILURE);

void **pages = malloc(sizeof(void*) * num_pages);
for (int i = 0; i < num_pages; ++i) {
pages[i] = mmap(NULL, page_size, PROT_WRITE | PROT_READ,
MAP_PRIVATE | MAP_POPULATE | MAP_ANONYMOUS,
-1, 0);
if (pages[i] == MAP_FAILED)
return (EXIT_FAILURE);
}

ret = set_mempolicy(MPOL_DEFAULT, NULL, 0);
if (ret < 0)
return (EXIT_FAILURE);

int *nodes = malloc(sizeof(int) * num_pages);
int *status = malloc(sizeof(int) * num_pages);
for (int i = 0; i < num_pages; ++i) {
nodes[i] = node_id;
status[i] = 0xd0; /* simulate garbage values */
}

ret = move_pages(0, num_pages, pages, nodes, status, MPOL_MF_MOVE);
printf("move_pages: %ld\n", ret);
for (int i = 0; i < num_pages; ++i)
printf("status[%d] = %d\n", i, status[i]);
}

Then running the program would return nonsense status values:

$ ./move_pages_bug
move_pages: 0
status[0] = 208
status[1] = 208
status[2] = 208
status[3] = 208
status[4] = 208
status[5] = 208
status[6] = 208
status[7] = 208

This is because the status is not set if the page is already on the
target node, but move_pages() should return valid status as long as it
succeeds. The valid status may be errno or node id.

We can't simply initialize status array to zero since the pages may be
not on node 0. Fix it by updating status with node id which the page is
already on.

Link: http://lkml.kernel.org/r/1575584353-125392-1-git-send-email-yang.shi@linux.alibaba.com
Fixes: a49bd4d71637 ("mm, numa: rework do_pages_move")
Signed-off-by: Yang Shi
Reported-by: Felix Abecassis
Tested-by: Felix Abecassis
Suggested-by: Michal Hocko
Reviewed-by: John Hubbard
Acked-by: Christoph Lameter
Acked-by: Michal Hocko
Reviewed-by: Vlastimil Babka
Cc: Mel Gorman
Cc: [4.17+]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Yang Shi
2020-01-09 17:19:57 +0800
cdc57bac9 mm/zsmalloc.c: fix the migrated zspage statistics. ... Browse Code »

commit ac8f05da5174c560de122c499ce5dfb5d0dfbee5 upstream.

When zspage is migrated to the other zone, the zone page state should be
updated as well, otherwise the NR_ZSPAGE for each zone shows wrong
counts including proc/zoneinfo in practice.

Link: http://lkml.kernel.org/r/1575434841-48009-1-git-send-email-chanho.min@lge.com
Fixes: 91537fee0013 ("mm: add NR_ZSMALLOC to vmstat")
Signed-off-by: Chanho Min
Signed-off-by: Jinsuk Choi
Reviewed-by: Sergey Senozhatsky
Acked-by: Minchan Kim
Cc: [4.9+]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Chanho Min
2020-01-09 17:19:56 +0800
e84c5b761 mm/memory_hotplug: shrink zones when offlining memory ... Browse Code »

commit feee6b2989165631b17ac6d4ccdbf6759254e85a upstream.

We currently try to shrink a single zone when removing memory. We use
the zone of the first page of the memory we are removing. If that
memmap was never initialized (e.g., memory was never onlined), we will
read garbage and can trigger kernel BUGs (due to a stale pointer):

BUG: unable to handle page fault for address: 000000000000353d
#PF: supervisor write access in kernel mode
#PF: error_code(0x0002) - not-present page
PGD 0 P4D 0
Oops: 0002 [#1] SMP PTI
CPU: 1 PID: 7 Comm: kworker/u8:0 Not tainted 5.3.0-rc5-next-20190820+ #317
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.4
Workqueue: kacpi_hotplug acpi_hotplug_work_fn
RIP: 0010:clear_zone_contiguous+0x5/0x10
Code: 48 89 c6 48 89 c3 e8 2a fe ff ff 48 85 c0 75 cf 5b 5d c3 c6 85 fd 05 00 00 01 5b 5d c3 0f 1f 840
RSP: 0018:ffffad2400043c98 EFLAGS: 00010246
RAX: 0000000000000000 RBX: 0000000200000000 RCX: 0000000000000000
RDX: 0000000000200000 RSI: 0000000000140000 RDI: 0000000000002f40
RBP: 0000000140000000 R08: 0000000000000000 R09: 0000000000000001
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000140000
R13: 0000000000140000 R14: 0000000000002f40 R15: ffff9e3e7aff3680
FS: 0000000000000000(0000) GS:ffff9e3e7bb00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000000000000353d CR3: 0000000058610000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
__remove_pages+0x4b/0x640
arch_remove_memory+0x63/0x8d
try_remove_memory+0xdb/0x130
__remove_memory+0xa/0x11
acpi_memory_device_remove+0x70/0x100
acpi_bus_trim+0x55/0x90
acpi_device_hotplug+0x227/0x3a0
acpi_hotplug_work_fn+0x1a/0x30
process_one_work+0x221/0x550
worker_thread+0x50/0x3b0
kthread+0x105/0x140
ret_from_fork+0x3a/0x50
Modules linked in:
CR2: 000000000000353d

Instead, shrink the zones when offlining memory or when onlining failed.
Introduce and use remove_pfn_range_from_zone(() for that. We now
properly shrink the zones, even if we have DIMMs whereby

- Some memory blocks fall into no zone (never onlined)

- Some memory blocks fall into multiple zones (offlined+re-onlined)

- Multiple memory blocks that fall into different zones

Drop the zone parameter (with a potential dubious value) from
__remove_pages() and __remove_section().

Link: http://lkml.kernel.org/r/20191006085646.5768-6-david@redhat.com
Fixes: f1dd2cd13c4b ("mm, memory_hotplug: do not associate hotadded memory to zones until online") [visible after d0dc12e86b319]
Signed-off-by: David Hildenbrand
Reviewed-by: Oscar Salvador
Cc: Michal Hocko
Cc: "Matthew Wilcox (Oracle)"
Cc: "Aneesh Kumar K.V"
Cc: Pavel Tatashin
Cc: Greg Kroah-Hartman
Cc: Dan Williams
Cc: Logan Gunthorpe
Cc: [5.0+]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

David Hildenbrand
2020-01-09 17:19:56 +0800
173fa52f7 mm: drop mmap_sem before calling balance_dirty_pages() in write fault ... Browse Code »

[ Upstream commit 89b15332af7c0312a41e50846819ca6613b58b4c ]

One of our services is observing hanging ps/top/etc under heavy write
IO, and the task states show this is an mmap_sem priority inversion:

A write fault is holding the mmap_sem in read-mode and waiting for
(heavily cgroup-limited) IO in balance_dirty_pages():

balance_dirty_pages+0x724/0x905
balance_dirty_pages_ratelimited+0x254/0x390
fault_dirty_shared_page.isra.96+0x4a/0x90
do_wp_page+0x33e/0x400
__handle_mm_fault+0x6f0/0xfa0
handle_mm_fault+0xe4/0x200
__do_page_fault+0x22b/0x4a0
page_fault+0x45/0x50

Somebody tries to change the address space, contending for the mmap_sem in
write-mode:

call_rwsem_down_write_failed_killable+0x13/0x20
do_mprotect_pkey+0xa8/0x330
SyS_mprotect+0xf/0x20
do_syscall_64+0x5b/0x100
entry_SYSCALL_64_after_hwframe+0x3d/0xa2

The waiting writer locks out all subsequent readers to avoid lock
starvation, and several threads can be seen hanging like this:

call_rwsem_down_read_failed+0x14/0x30
proc_pid_cmdline_read+0xa0/0x480
__vfs_read+0x23/0x140
vfs_read+0x87/0x130
SyS_read+0x42/0x90
do_syscall_64+0x5b/0x100
entry_SYSCALL_64_after_hwframe+0x3d/0xa2

To fix this, do what we do for cache read faults already: drop the
mmap_sem before calling into anything IO bound, in this case the
balance_dirty_pages() function, and return VM_FAULT_RETRY.

Link: http://lkml.kernel.org/r/20190924194238.GA29030@cmpxchg.org
Signed-off-by: Johannes Weiner
Reviewed-by: Matthew Wilcox (Oracle)
Acked-by: Kirill A. Shutemov
Cc: Josef Bacik
Cc: Hillf Danton
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Sasha Levin

Johannes Weiner
2020-01-09 17:19:55 +0800
d7af03159 shmem: pin the file in shmem_fault() if mmap_sem is dropped ... Browse Code »

[ Upstream commit 8897c1b1a1795cab23d5ac13e4e23bf0b5f4e0c6 ]

syzbot found the following crash:

BUG: KASAN: use-after-free in perf_trace_lock_acquire+0x401/0x530 include/trace/events/lock.h:13
Read of size 8 at addr ffff8880a5cf2c50 by task syz-executor.0/26173

CPU: 0 PID: 26173 Comm: syz-executor.0 Not tainted 5.3.0-rc6 #146
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
perf_trace_lock_acquire+0x401/0x530 include/trace/events/lock.h:13
trace_lock_acquire include/trace/events/lock.h:13 [inline]
lock_acquire+0x2de/0x410 kernel/locking/lockdep.c:4411
__raw_spin_lock include/linux/spinlock_api_smp.h:142 [inline]
_raw_spin_lock+0x2f/0x40 kernel/locking/spinlock.c:151
spin_lock include/linux/spinlock.h:338 [inline]
shmem_fault+0x5ec/0x7b0 mm/shmem.c:2034
__do_fault+0x111/0x540 mm/memory.c:3083
do_shared_fault mm/memory.c:3535 [inline]
do_fault mm/memory.c:3613 [inline]
handle_pte_fault mm/memory.c:3840 [inline]
__handle_mm_fault+0x2adf/0x3f20 mm/memory.c:3964
handle_mm_fault+0x1b5/0x6b0 mm/memory.c:4001
do_user_addr_fault arch/x86/mm/fault.c:1441 [inline]
__do_page_fault+0x536/0xdd0 arch/x86/mm/fault.c:1506
do_page_fault+0x38/0x590 arch/x86/mm/fault.c:1530
page_fault+0x39/0x40 arch/x86/entry/entry_64.S:1202

It happens if the VMA got unmapped under us while we dropped mmap_sem
and inode got freed.

Pinning the file if we drop mmap_sem fixes the issue.

Link: http://lkml.kernel.org/r/20190927083908.rhifa4mmaxefc24r@box
Signed-off-by: Kirill A. Shutemov
Reported-by: syzbot+03ee87124ee05af991bd@syzkaller.appspotmail.com
Acked-by: Johannes Weiner
Reviewed-by: Matthew Wilcox (Oracle)
Cc: Hillf Danton
Cc: Hugh Dickins
Cc: Josef Bacik
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Sasha Levin

Kirill A. Shutemov
2020-01-09 17:19:54 +0800

31 Dec, 2019

1 commit

b58b8e933 mm: vmscan: protect shrinker idr replace with CONFIG_MEMCG ... Browse Code »

commit 42a9a53bb394a1de2247ef78f0b802ae86798122 upstream.

Since commit 0a432dcbeb32 ("mm: shrinker: make shrinker not depend on
memcg kmem"), shrinkers' idr is protected by CONFIG_MEMCG instead of
CONFIG_MEMCG_KMEM, so it makes no sense to protect shrinker idr replace
with CONFIG_MEMCG_KMEM.

And in the CONFIG_MEMCG && CONFIG_SLOB case, shrinker_idr contains only
shrinker, and it is deferred_split_shrinker. But it is never actually
called, since idr_replace() is never compiled due to the wrong #ifdef.
The deferred_split_shrinker all the time is staying in half-registered
state, and it's never called for subordinate mem cgroups.

Link: http://lkml.kernel.org/r/1575486978-45249-1-git-send-email-yang.shi@linux.alibaba.com
Fixes: 0a432dcbeb32 ("mm: shrinker: make shrinker not depend on memcg kmem")
Signed-off-by: Yang Shi
Reviewed-by: Kirill Tkhai
Acked-by: Michal Hocko
Cc: Johannes Weiner
Cc: Shakeel Butt
Cc: Roman Gushchin
Cc: [5.4+]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Yang Shi
2019-12-31 23:46:07 +0800

18 Dec, 2019

3 commits

12de9bf4b mm/shmem.c: cast the type of unmap_start to u64 ... Browse Code »

commit aa71ecd8d86500da6081a72da6b0b524007e0627 upstream.

In 64bit system. sb->s_maxbytes of shmem filesystem is MAX_LFS_FILESIZE,
which equal LLONG_MAX.

If offset > LLONG_MAX - PAGE_SIZE, offset + len < LLONG_MAX in
shmem_fallocate, which will pass the checking in vfs_fallocate.

/* Check for wrap through zero too */
if (((offset + len) > inode->i_sb->s_maxbytes) || ((offset + len) < 0))
return -EFBIG;

loff_t unmap_start = round_up(offset, PAGE_SIZE) in shmem_fallocate
causes a overflow.

Syzkaller reports a overflow problem in mm/shmem:

UBSAN: Undefined behaviour in mm/shmem.c:2014:10
signed integer overflow: '9223372036854775807 + 1' cannot be represented in type 'long long int'
CPU: 0 PID:17076 Comm: syz-executor0 Not tainted 4.1.46+ #1
Hardware name: linux, dummy-virt (DT)
Call trace:
dump_backtrace+0x0/0x2c8 arch/arm64/kernel/traps.c:100
show_stack+0x20/0x30 arch/arm64/kernel/traps.c:238
__dump_stack lib/dump_stack.c:15 [inline]
ubsan_epilogue+0x18/0x70 lib/ubsan.c:164
handle_overflow+0x158/0x1b0 lib/ubsan.c:195
shmem_fallocate+0x6d0/0x820 mm/shmem.c:2104
vfs_fallocate+0x238/0x428 fs/open.c:312
SYSC_fallocate fs/open.c:335 [inline]
SyS_fallocate+0x54/0xc8 fs/open.c:239

The highest bit of unmap_start will be appended with sign bit 1
(overflow) when calculate shmem_falloc.start:

shmem_falloc.start = unmap_start >> PAGE_SHIFT.

Fix it by casting the type of unmap_start to u64, when right shifted.

This bug is found in LTS Linux 4.1. It also seems to exist in mainline.

Link: http://lkml.kernel.org/r/1573867464-5107-1-git-send-email-chenjun102@huawei.com
Signed-off-by: Chen Jun
Reviewed-by: Andrew Morton
Cc: Hugh Dickins
Cc: Qian Cai
Cc: Kefeng Wang
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Chen Jun
2019-12-18 02:56:51 +0800
e4d09b31a mm: memcg/slab: wait for !root kmem_cache refcnt killing on root kmem_cache destruction ... Browse Code »

commit a264df74df38855096393447f1b8f386069a94b9 upstream.

Christian reported a warning like the following obtained during running
some KVM-related tests on s390:

WARNING: CPU: 8 PID: 208 at lib/percpu-refcount.c:108 percpu_ref_exit+0x50/0x58
Modules linked in: kvm(-) xt_CHECKSUM xt_MASQUERADE bonding xt_tcpudp ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_conntrack ip6table_na>
CPU: 8 PID: 208 Comm: kworker/8:1 Not tainted 5.2.0+ #66
Hardware name: IBM 2964 NC9 712 (LPAR)
Workqueue: events sysfs_slab_remove_workfn
Krnl PSW : 0704e00180000000 0000001529746850 (percpu_ref_exit+0x50/0x58)
R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:2 PM:0 RI:0 EA:3
Krnl GPRS: 00000000ffff8808 0000001529746740 000003f4e30e8e18 0036008100000000
0000001f00000000 0035008100000000 0000001fb3573ab8 0000000000000000
0000001fbdb6de00 0000000000000000 0000001529f01328 0000001fb3573b00
0000001fbb27e000 0000001fbdb69300 000003e009263d00 000003e009263cd0
Krnl Code: 0000001529746842: f0a0000407fe srp 4(11,%r0),2046,0
0000001529746848: 47000700 bc 0,1792
#000000152974684c: a7f40001 brc 15,152974684e
>0000001529746850: a7f4fff2 brc 15,1529746834
0000001529746854: 0707 bcr 0,%r7
0000001529746856: 0707 bcr 0,%r7
0000001529746858: eb8ff0580024 stmg %r8,%r15,88(%r15)
000000152974685e: a738ffff lhi %r3,-1
Call Trace:
([] 0x3e009263d00)
[] slab_kmem_cache_release+0x3a/0x70
[] kobject_put+0xaa/0xe8
[] process_one_work+0x1e8/0x428
[] worker_thread+0x48/0x460
[] kthread+0x126/0x160
[] ret_from_fork+0x28/0x30
[] kernel_thread_starter+0x0/0x10
Last Breaking-Event-Address:
[] percpu_ref_exit+0x4c/0x58
---[ end trace b035e7da5788eb09 ]---

The problem occurs because kmem_cache_destroy() is called immediately
after deleting of a memcg, so it races with the memcg kmem_cache
deactivation.

flush_memcg_workqueue() at the beginning of kmem_cache_destroy() is
supposed to guarantee that all deactivation processes are finished, but
failed to do so. It waits for an rcu grace period, after which all
children kmem_caches should be deactivated. During the deactivation
percpu_ref_kill() is called for non root kmem_cache refcounters, but it
requires yet another rcu grace period to finish the transition to the
atomic (dead) state.

So in a rare case when not all children kmem_caches are destroyed at the
moment when the root kmem_cache is about to be gone, we need to wait
another rcu grace period before destroying the root kmem_cache.

This issue can be triggered only with dynamically created kmem_caches
which are used with memcg accounting. In this case per-memcg child
kmem_caches are created. They are deactivated from the cgroup removing
path. If the destruction of the root kmem_cache is racing with the
removal of the cgroup (both are quite complicated multi-stage
processes), the described issue can occur. The only known way to
trigger it in the real life, is to unload some kernel module which
creates a dedicated kmem_cache, used from different memory cgroups with
GFP_ACCOUNT flag. If the unloading happens immediately after calling
rmdir on the corresponding cgroup, there is some chance to trigger the
issue.

Link: http://lkml.kernel.org/r/20191129025011.3076017-1-guro@fb.com
Fixes: f0a3a24b532d ("mm: memcg/slab: rework non-root kmem_cache lifecycle management")
Signed-off-by: Roman Gushchin
Reported-by: Christian Borntraeger
Tested-by: Christian Borntraeger
Reviewed-by: Shakeel Butt
Acked-by: Michal Hocko
Cc: Johannes Weiner
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Roman Gushchin
2019-12-18 02:56:49 +0800
4d0f420c8 mm, memfd: fix COW issue on MAP_PRIVATE and F_SEAL_FUTURE_WRITE mappings ... Browse Code »

commit 05d351102dbe4e103d6bdac18b1122cd3cd04925 upstream.

F_SEAL_FUTURE_WRITE has unexpected behavior when used with MAP_PRIVATE:
A private mapping created after the memfd file that gets sealed with
F_SEAL_FUTURE_WRITE loses the copy-on-write at fork behavior, meaning
children and parent share the same memory, even though the mapping is
private.

The reason for this is due to the code below:

static int shmem_mmap(struct file *file, struct vm_area_struct *vma)
{
struct shmem_inode_info *info = SHMEM_I(file_inode(file));

if (info->seals & F_SEAL_FUTURE_WRITE) {
/*
* New PROT_WRITE and MAP_SHARED mmaps are not allowed when
* "future write" seal active.
*/
if ((vma->vm_flags & VM_SHARED) && (vma->vm_flags & VM_WRITE))
return -EPERM;

/*
* Since the F_SEAL_FUTURE_WRITE seals allow for a MAP_SHARED
* read-only mapping, take care to not allow mprotect to revert
* protections.
*/
vma->vm_flags &= ~(VM_MAYWRITE);
}
...
}

And for the mm to know if a mapping is copy-on-write:

static inline bool is_cow_mapping(vm_flags_t flags)
{
return (flags & (VM_SHARED | VM_MAYWRITE)) == VM_MAYWRITE;
}

The patch fixes the issue by making the mprotect revert protection
happen only for shared mappings. For private mappings, using mprotect
will have no effect on the seal behavior.

The F_SEAL_FUTURE_WRITE feature was introduced in v5.1 so v5.3.x stable
kernels would need a backport.

[akpm@linux-foundation.org: reflow comment, per Christoph]
Link: http://lkml.kernel.org/r/20191107195355.80608-1-joel@joelfernandes.org
Fixes: ab3948f58ff84 ("mm/memfd: add an F_SEAL_FUTURE_WRITE seal to memfd")
Signed-off-by: Nicolas Geoffray
Signed-off-by: Joel Fernandes (Google)
Cc: Hugh Dickins
Cc: Shuah Khan
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Nicolas Geoffray
2019-12-18 02:56:48 +0800

23 Nov, 2019

1 commit

9a63236f1 mm/ksm.c: don't WARN if page is still mapped in remove_stable_node() ... Browse Code »

It's possible to hit the WARN_ON_ONCE(page_mapped(page)) in
remove_stable_node() when it races with __mmput() and squeezes in
between ksm_exit() and exit_mmap().

WARNING: CPU: 0 PID: 3295 at mm/ksm.c:888 remove_stable_node+0x10c/0x150

Call Trace:
remove_all_stable_nodes+0x12b/0x330
run_store+0x4ef/0x7b0
kernfs_fop_write+0x200/0x420
vfs_write+0x154/0x450
ksys_write+0xf9/0x1d0
do_syscall_64+0x99/0x510
entry_SYSCALL_64_after_hwframe+0x49/0xbe

Remove the warning as there is nothing scary going on.

Link: http://lkml.kernel.org/r/20191119131850.5675-1-aryabinin@virtuozzo.com
Fixes: cbf86cfe04a6 ("ksm: remove old stable nodes more thoroughly")
Signed-off-by: Andrey Ryabinin
Acked-by: Hugh Dickins
Cc: Andrea Arcangeli
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrey Ryabinin
2019-11-23 01:11:18 +0800