Eric Lee / smarc-fsl-linux-kernel

12 Dec, 2020

3 commits

ba9c1201b mm/hugetlb: clear compound_nr before freeing gigantic pages ... Browse Code »

Commit 1378a5ee451a ("mm: store compound_nr as well as compound_order")
added compound_nr counter to first tail struct page, overlaying with
page->mapping. The overlay itself is fine, but while freeing gigantic
hugepages via free_contig_range(), a "bad page" check will trigger for
non-NULL page->mapping on the first tail page:

BUG: Bad page state in process bash pfn:380001
page:00000000c35f0856 refcount:0 mapcount:0 mapping:00000000126b68aa index:0x0 pfn:0x380001
aops:0x0
flags: 0x3ffff00000000000()
raw: 3ffff00000000000 0000000000000100 0000000000000122 0000000100000000
raw: 0000000000000000 0000000000000000 ffffffff00000000 0000000000000000
page dumped because: non-NULL mapping
Modules linked in:
CPU: 6 PID: 616 Comm: bash Not tainted 5.10.0-rc7-next-20201208 #1
Hardware name: IBM 3906 M03 703 (LPAR)
Call Trace:
show_stack+0x6e/0xe8
dump_stack+0x90/0xc8
bad_page+0xd6/0x130
free_pcppages_bulk+0x26a/0x800
free_unref_page+0x6e/0x90
free_contig_range+0x94/0xe8
update_and_free_page+0x1c4/0x2c8
free_pool_huge_page+0x11e/0x138
set_max_huge_pages+0x228/0x300
nr_hugepages_store_common+0xb8/0x130
kernfs_fop_write+0xd2/0x218
vfs_write+0xb0/0x2b8
ksys_write+0xac/0xe0
system_call+0xe6/0x288
Disabling lock debugging due to kernel taint

This is because only the compound_order is cleared in
destroy_compound_gigantic_page(), and compound_nr is set to
1U << order == 1 for order 0 in set_compound_order(page, 0).

Fix this by explicitly clearing compound_nr for first tail page after
calling set_compound_order(page, 0).

Link: https://lkml.kernel.org/r/20201208182813.66391-2-gerald.schaefer@linux.ibm.com
Fixes: 1378a5ee451a ("mm: store compound_nr as well as compound_order")
Signed-off-by: Gerald Schaefer
Reviewed-by: Matthew Wilcox (Oracle)
Cc: Heiko Carstens
Cc: Mike Kravetz
Cc: Christian Borntraeger
Cc: [5.9+]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Gerald Schaefer
2020-12-12 06:02:14 +0800
6c82d45c7 kasan: fix object remaining in offline per-cpu quarantine ... Browse Code »

We hit this issue in our internal test. When enabling generic kasan, a
kfree()'d object is put into per-cpu quarantine first. If the cpu goes
offline, object still remains in the per-cpu quarantine. If we call
kmem_cache_destroy() now, slub will report "Objects remaining" error.

=============================================================================
BUG test_module_slab (Not tainted): Objects remaining in test_module_slab on __kmem_cache_shutdown()
-----------------------------------------------------------------------------

Disabling lock debugging due to kernel taint
INFO: Slab 0x(____ptrval____) objects=34 used=1 fp=0x(____ptrval____) flags=0x2ffff00000010200
CPU: 3 PID: 176 Comm: cat Tainted: G B 5.10.0-rc1-00007-g4525c8781ec0-dirty #10
Hardware name: linux,dummy-virt (DT)
Call trace:
dump_backtrace+0x0/0x2b0
show_stack+0x18/0x68
dump_stack+0xfc/0x168
slab_err+0xac/0xd4
__kmem_cache_shutdown+0x1e4/0x3c8
kmem_cache_destroy+0x68/0x130
test_version_show+0x84/0xf0
module_attr_show+0x40/0x60
sysfs_kf_seq_show+0x128/0x1c0
kernfs_seq_show+0xa0/0xb8
seq_read+0x1f0/0x7e8
kernfs_fop_read+0x70/0x338
vfs_read+0xe4/0x250
ksys_read+0xc8/0x180
__arm64_sys_read+0x44/0x58
el0_svc_common.constprop.0+0xac/0x228
do_el0_svc+0x38/0xa0
el0_sync_handler+0x170/0x178
el0_sync+0x174/0x180
INFO: Object 0x(____ptrval____) @offset=15848
INFO: Allocated in test_version_show+0x98/0xf0 age=8188 cpu=6 pid=172
stack_trace_save+0x9c/0xd0
set_track+0x64/0xf0
alloc_debug_processing+0x104/0x1a0
___slab_alloc+0x628/0x648
__slab_alloc.isra.0+0x2c/0x58
kmem_cache_alloc+0x560/0x588
test_version_show+0x98/0xf0
module_attr_show+0x40/0x60
sysfs_kf_seq_show+0x128/0x1c0
kernfs_seq_show+0xa0/0xb8
seq_read+0x1f0/0x7e8
kernfs_fop_read+0x70/0x338
vfs_read+0xe4/0x250
ksys_read+0xc8/0x180
__arm64_sys_read+0x44/0x58
el0_svc_common.constprop.0+0xac/0x228
kmem_cache_destroy test_module_slab: Slab cache still has objects

Register a cpu hotplug function to remove all objects in the offline
per-cpu quarantine when cpu is going offline. Set a per-cpu variable to
indicate this cpu is offline.

[qiang.zhang@windriver.com: fix slab double free when cpu-hotplug]
Link: https://lkml.kernel.org/r/20201204102206.20237-1-qiang.zhang@windriver.com

Link: https://lkml.kernel.org/r/1606895585-17382-2-git-send-email-Kuan-Ying.Lee@mediatek.com
Signed-off-by: Kuan-Ying Lee
Signed-off-by: Zqiang
Suggested-by: Dmitry Vyukov
Reported-by: Guangye Yang
Reviewed-by: Dmitry Vyukov
Cc: Andrey Ryabinin
Cc: Alexander Potapenko
Cc: Matthias Brugger
Cc: Nicholas Tang
Cc: Miles Chen
Cc: Qian Cai
Cc: Stephen Rothwell
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kuan-Ying Lee
2020-12-12 06:02:14 +0800
16c0cc0ce revert "mm/filemap: add static for function __add_to_page_cache_locked" ... Browse Code »

Revert commit 3351b16af494 ("mm/filemap: add static for function
__add_to_page_cache_locked") due to incompatibility with
ALLOW_ERROR_INJECTION which result in build errors.

Link: https://lkml.kernel.org/r/CAADnVQJ6tmzBXvtroBuEH6QA0H+q7yaSKxrVvVxhqr3KBZdEXg@mail.gmail.com
Tested-by: Justin Forbes
Tested-by: Greg Thelen
Acked-by: Alexei Starovoitov
Cc: Michal Kubecek
Cc: Alex Shi
Cc: Souptick Joarder
Cc: Daniel Borkmann
Cc: Josef Bacik
Cc: Tony Luck
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrew Morton
2020-12-12 06:02:14 +0800

09 Dec, 2020

1 commit

a68a0262a mm/madvise: remove racy mm ownership check ... Browse Code »

Jann spotted the security hole due to race of mm ownership check.

If the task is sharing the mm_struct but goes through execve() before
mm_access(), it could skip process_madvise_behavior_valid check. That
makes *any advice hint* to reach into the remote process.

This patch removes the mm ownership check. With it, it will lose the
ability that local process could give *any* advice hint with vector
interface for some reason (e.g., performance). Since there is no
concrete example in upstream yet, it would be better to remove the
abiliity at this moment and need to review when such new advice comes
up.

Fixes: ecb8ac8b1f14 ("mm/madvise: introduce process_madvise() syscall: an external memory hinting API")
Reported-by: Jann Horn
Suggested-by: Jann Horn
Signed-off-by: Minchan Kim
Signed-off-by: Linus Torvalds

Minchan Kim
2020-12-09 12:57:18 +0800

07 Dec, 2020

7 commits

309d08d9b mm/mmap.c: fix mmap return value when vma is merged after call_mmap() ... Browse Code »

On success, mmap should return the begin address of newly mapped area,
but patch "mm: mmap: merge vma after call_mmap() if possible" set
vm_start of newly merged vma to return value addr. Users of mmap will
get wrong address if vma is merged after call_mmap(). We fix this by
moving the assignment to addr before merging vma.

We have a driver which changes vm_flags, and this bug is found by our
testcases.

Fixes: d70cec898324 ("mm: mmap: merge vma after call_mmap() if possible")
Signed-off-by: Liu Zixian
Signed-off-by: Andrew Morton
Reviewed-by: Jason Gunthorpe
Reviewed-by: David Hildenbrand
Cc: Miaohe Lin
Cc: Hongxiang Lou
Cc: Hu Shiyuan
Cc: Matthew Wilcox
Link: https://lkml.kernel.org/r/20201203085350.22624-1-liuzixian4@huawei.com
Signed-off-by: Linus Torvalds

Liu Zixian
2020-12-07 02:19:07 +0800
7a5bde379 hugetlb_cgroup: fix offline of hugetlb cgroup with reservations ... Browse Code »

Adrian Moreno was ruuning a kubernetes 1.19 + containerd/docker workload
using hugetlbfs. In this environment the issue is reproduced by:

- Start a simple pod that uses the recently added HugePages medium
feature (pod yaml attached)

- Start a DPDK app. It doesn't need to run successfully (as in transfer
packets) nor interact with real hardware. It seems just initializing
the EAL layer (which handles hugepage reservation and locking) is
enough to trigger the issue

- Delete the Pod (or let it "Complete").

This would result in a kworker thread going into a tight loop (top output):

1425 root 20 0 0 0 0 R 99.7 0.0 5:22.45 kworker/28:7+cgroup_destroy

'perf top -g' reports:

- 63.28% 0.01% [kernel] [k] worker_thread
- 49.97% worker_thread
- 52.64% process_one_work
- 62.08% css_killed_work_fn
- hugetlb_cgroup_css_offline
41.52% _raw_spin_lock
- 2.82% _cond_resched
rcu_all_qs
2.66% PageHuge
- 0.57% schedule
- 0.57% __schedule

We are spinning in the do-while loop in hugetlb_cgroup_css_offline.
Worse yet, we are holding the master cgroup lock (cgroup_mutex) while
infinitely spinning. Little else can be done on the system as the
cgroup_mutex can not be acquired.

Do note that the issue can be reproduced by simply offlining a hugetlb
cgroup containing pages with reservation counts.

The loop in hugetlb_cgroup_css_offline is moving page counts from the
cgroup being offlined to the parent cgroup. This is done for each
hstate, and is repeated until hugetlb_cgroup_have_usage returns false.
The routine moving counts (hugetlb_cgroup_move_parent) is only moving
'usage' counts. The routine hugetlb_cgroup_have_usage is checking for
both 'usage' and 'reservation' counts. Discussion about what to do with
reservation counts when reparenting was discussed here:

https://lore.kernel.org/linux-kselftest/CAHS8izMFAYTgxym-Hzb_JmkTK1N_S9tGN71uS6MFV+R7swYu5A@mail.gmail.com/

The decision was made to leave a zombie cgroup for with reservation
counts. Unfortunately, the code checking reservation counts was
incorrectly added to hugetlb_cgroup_have_usage.

To fix the issue, simply remove the check for reservation counts. While
fixing this issue, a related bug in hugetlb_cgroup_css_offline was
noticed. The hstate index is not reinitialized each time through the
do-while loop. Fix this as well.

Fixes: 1adc4d419aa2 ("hugetlb_cgroup: add interface for charge/uncharge hugetlb reservations")
Reported-by: Adrian Moreno
Signed-off-by: Mike Kravetz
Signed-off-by: Andrew Morton
Tested-by: Adrian Moreno
Reviewed-by: Shakeel Butt
Cc: Mina Almasry
Cc: David Rientjes
Cc: Greg Thelen
Cc: Sandipan Das
Cc: Shuah Khan
Cc:
Link: https://lkml.kernel.org/r/20201203220242.158165-1-mike.kravetz@oracle.com
Signed-off-by: Linus Torvalds

Mike Kravetz
2020-12-07 02:19:07 +0800
3351b16af mm/filemap: add static for function __add_to_page_cache_locked ... Browse Code »

mm/filemap.c:830:14: warning: no previous prototype for `__add_to_page_cache_locked' [-Wmissing-prototypes]

Signed-off-by: Alex Shi
Signed-off-by: Andrew Morton
Cc: Souptick Joarder
Link: https://lkml.kernel.org/r/1604661895-5495-1-git-send-email-alex.shi@linux.alibaba.com
Signed-off-by: Linus Torvalds

Alex Shi
2020-12-07 02:19:07 +0800
b11a76b37 mm/swapfile: do not sleep with a spin lock held ... Browse Code »

We can't call kvfree() with a spin lock held, so defer it. Fixes a
might_sleep() runtime warning.

Fixes: 873d7bcfd066 ("mm/swapfile.c: use kvzalloc for swap_info_struct allocation")
Signed-off-by: Qian Cai
Signed-off-by: Andrew Morton
Reviewed-by: Andrew Morton
Cc: Hugh Dickins
Cc:
Link: https://lkml.kernel.org/r/20201202151549.10350-1-qcai@redhat.com
Signed-off-by: Linus Torvalds

Qian Cai
2020-12-07 02:19:07 +0800
e91d8d782 mm/zsmalloc.c: drop ZSMALLOC_PGTABLE_MAPPING ... Browse Code »

While I was doing zram testing, I found sometimes decompression failed
since the compression buffer was corrupted. With investigation, I found
below commit calls cond_resched unconditionally so it could make a
problem in atomic context if the task is reschedule.

BUG: sleeping function called from invalid context at mm/vmalloc.c:108
in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 946, name: memhog
3 locks held by memhog/946:
#0: ffff9d01d4b193e8 (&mm->mmap_lock#2){++++}-{4:4}, at: __mm_populate+0x103/0x160
#1: ffffffffa3d53de0 (fs_reclaim){+.+.}-{0:0}, at: __alloc_pages_slowpath.constprop.0+0xa98/0x1160
#2: ffff9d01d56b8110 (&zspage->lock){.+.+}-{3:3}, at: zs_map_object+0x8e/0x1f0
CPU: 0 PID: 946 Comm: memhog Not tainted 5.9.3-00011-gc5bfc0287345-dirty #316
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1 04/01/2014
Call Trace:
unmap_kernel_range_noflush+0x2eb/0x350
unmap_kernel_range+0x14/0x30
zs_unmap_object+0xd5/0xe0
zram_bvec_rw.isra.0+0x38c/0x8e0
zram_rw_page+0x90/0x101
bdev_write_page+0x92/0xe0
__swap_writepage+0x94/0x4a0
pageout+0xe3/0x3a0
shrink_page_list+0xb94/0xd60
shrink_inactive_list+0x158/0x460

We can fix this by removing the ZSMALLOC_PGTABLE_MAPPING feature (which
contains the offending calling code) from zsmalloc.

Even though this option showed some amount improvement(e.g., 30%) in
some arm32 platforms, it has been headache to maintain since it have
abused APIs[1](e.g., unmap_kernel_range in atomic context).

Since we are approaching to deprecate 32bit machines and already made
the config option available for only builtin build since v5.8, lastly it
has been not default option in zsmalloc, it's time to drop the option
for better maintenance.

[1] http://lore.kernel.org/linux-mm/20201105170249.387069-1-minchan@kernel.org

Fixes: e47110e90584 ("mm/vunmap: add cond_resched() in vunmap_pmd_range")
Signed-off-by: Minchan Kim
Signed-off-by: Andrew Morton
Reviewed-by: Sergey Senozhatsky
Cc: Tony Lindgren
Cc: Christoph Hellwig
Cc: Harish Sriram
Cc: Uladzislau Rezki
Cc:
Link: https://lkml.kernel.org/r/20201117202916.GA3856507@google.com
Signed-off-by: Linus Torvalds

Minchan Kim
2020-12-07 02:19:07 +0800
8199be001 mm: list_lru: set shrinker map bit when child nr_items is not zero ... Browse Code »

When investigating a slab cache bloat problem, significant amount of
negative dentry cache was seen, but confusingly they neither got shrunk
by reclaimer (the host has very tight memory) nor be shrunk by dropping
cache. The vmcore shows there are over 14M negative dentry objects on
lru, but tracing result shows they were even not scanned at all.

Further investigation shows the memcg's vfs shrinker_map bit is not set.
So the reclaimer or dropping cache just skip calling vfs shrinker. So
we have to reboot the hosts to get the memory back.

I didn't manage to come up with a reproducer in test environment, and
the problem can't be reproduced after rebooting. But it seems there is
race between shrinker map bit clear and reparenting by code inspection.
The hypothesis is elaborated as below.

The memcg hierarchy on our production environment looks like:

root
/ \
system user

The main workloads are running under user slice's children, and it
creates and removes memcg frequently. So reparenting happens very often
under user slice, but no task is under user slice directly.

So with the frequent reparenting and tight memory pressure, the below
hypothetical race condition may happen:

CPU A CPU B
reparent
dst->nr_items == 0
shrinker:
total_objects == 0
add src->nr_items to dst
set_bit
return SHRINK_EMPTY
clear_bit
child memcg offline
replace child's kmemcg_id with
parent's (in memcg_offline_kmem())
list_lru_del() between shrinker runs
see parent's kmemcg_id
dec dst->nr_items
reparent again
dst->nr_items may go negative
due to concurrent list_lru_del()

The second run of shrinker:
read nr_items without any
synchronization, so it may
see intermediate negative
nr_items then total_objects
may return 0 coincidently

keep the bit cleared
dst->nr_items != 0
skip set_bit
add scr->nr_item to dst

After this point dst->nr_item may never go zero, so reparenting will not
set shrinker_map bit anymore. And since there is no task under user
slice directly, so no new object will be added to its lru to set the
shrinker map bit either. That bit is kept cleared forever.

How does list_lru_del() race with reparenting? It is because reparenting
replaces children's kmemcg_id to parent's without protecting from
nlru->lock, so list_lru_del() may see parent's kmemcg_id but actually
deleting items from child's lru, but dec'ing parent's nr_items, so the
parent's nr_items may go negative as commit 2788cf0c401c ("memcg:
reparent list_lrus and free kmemcg_id on css offline") says.

Since it is impossible that dst->nr_items goes negative and
src->nr_items goes zero at the same time, so it seems we could set the
shrinker map bit iff src->nr_items != 0. We could synchronize
list_lru_count_one() and reparenting with nlru->lock, but it seems
checking src->nr_items in reparenting is the simplest and avoids lock
contention.

Fixes: fae91d6d8be5 ("mm/list_lru.c: set bit in memcg shrinker bitmap on first list_lru item appearance")
Suggested-by: Roman Gushchin
Signed-off-by: Yang Shi
Signed-off-by: Andrew Morton
Reviewed-by: Roman Gushchin
Reviewed-by: Shakeel Butt
Acked-by: Kirill Tkhai
Cc: Vladimir Davydov
Cc: [4.19]
Link: https://lkml.kernel.org/r/20201202171749.264354-1-shy828301@gmail.com
Signed-off-by: Linus Torvalds

Yang Shi
2020-12-07 02:19:07 +0800
becaba65f mm: memcg/slab: fix obj_cgroup_charge() return value handling ... Browse Code »

Commit 10befea91b61 ("mm: memcg/slab: use a single set of kmem_caches
for all allocations") introduced a regression into the handling of the
obj_cgroup_charge() return value. If a non-zero value is returned
(indicating of exceeding one of memory.max limits), the allocation
should fail, instead of falling back to non-accounted mode.

To make the code more readable, move memcg_slab_pre_alloc_hook() and
memcg_slab_post_alloc_hook() calling conditions into bodies of these
hooks.

Fixes: 10befea91b61 ("mm: memcg/slab: use a single set of kmem_caches for all allocations")
Signed-off-by: Roman Gushchin
Signed-off-by: Andrew Morton
Reviewed-by: Shakeel Butt
Cc: Johannes Weiner
Cc: Michal Hocko
Cc:
Link: https://lkml.kernel.org/r/20201127161828.GD840171@carbon.dhcp.thefacebook.com
Signed-off-by: Linus Torvalds

Roman Gushchin
2020-12-07 02:19:07 +0800

25 Nov, 2020

1 commit

073861ed7 mm: fix VM_BUG_ON(PageTail) and BUG_ON(PageWriteback) ... Browse Code »

Twice now, when exercising ext4 looped on shmem huge pages, I have crashed
on the PF_ONLY_HEAD check inside PageWaiters(): ext4_finish_bio() calling
end_page_writeback() calling wake_up_page() on tail of a shmem huge page,
no longer an ext4 page at all.

The problem is that PageWriteback is not accompanied by a page reference
(as the NOTE at the end of test_clear_page_writeback() acknowledges): as
soon as TestClearPageWriteback has been done, that page could be removed
from page cache, freed, and reused for something else by the time that
wake_up_page() is reached.

https://lore.kernel.org/linux-mm/20200827122019.GC14765@casper.infradead.org/
Matthew Wilcox suggested avoiding or weakening the PageWaiters() tail
check; but I'm paranoid about even looking at an unreferenced struct page,
lest its memory might itself have already been reused or hotremoved (and
wake_up_page_bit() may modify that memory with its ClearPageWaiters()).

Then on crashing a second time, realized there's a stronger reason against
that approach. If my testing just occasionally crashes on that check,
when the page is reused for part of a compound page, wouldn't it be much
more common for the page to get reused as an order-0 page before reaching
wake_up_page()? And on rare occasions, might that reused page already be
marked PageWriteback by its new user, and already be waited upon? What
would that look like?

It would look like BUG_ON(PageWriteback) after wait_on_page_writeback()
in write_cache_pages() (though I have never seen that crash myself).

Matthew Wilcox explaining this to himself:
"page is allocated, added to page cache, dirtied, writeback starts,

--- thread A ---
filesystem calls end_page_writeback()
test_clear_page_writeback()
--- context switch to thread B ---
truncate_inode_pages_range() finds the page, it doesn't have writeback set,
we delete it from the page cache. Page gets reallocated, dirtied, writeback
starts again. Then we call write_cache_pages(), see
PageWriteback() set, call wait_on_page_writeback()
--- context switch back to thread A ---
wake_up_page(page, PG_writeback);
... thread B is woken, but because the wakeup was for the old use of
the page, PageWriteback is still set.

Devious"

And prior to 2a9127fcf229 ("mm: rewrite wait_on_page_bit_common() logic")
this would have been much less likely: before that, wake_page_function()'s
non-exclusive case would stop walking and not wake if it found Writeback
already set again; whereas now the non-exclusive case proceeds to wake.

I have not thought of a fix that does not add a little overhead: the
simplest fix is for end_page_writeback() to get_page() before calling
test_clear_page_writeback(), then put_page() after wake_up_page().

Was there a chance of missed wakeups before, since a page freed before
reaching wake_up_page() would have PageWaiters cleared? I think not,
because each waiter does hold a reference on the page. This bug comes
when the old use of the page, the one we do TestClearPageWriteback on,
had *no* waiters, so no additional page reference beyond the page cache
(and whoever racily freed it). The reuse of the page has a waiter
holding a reference, and its own PageWriteback set; but the belated
wake_up_page() has woken the reuse to hit that BUG_ON(PageWriteback).

Reported-by: syzbot+3622cea378100f45d59f@syzkaller.appspotmail.com
Reported-by: Qian Cai
Fixes: 2a9127fcf229 ("mm: rewrite wait_on_page_bit_common() logic")
Signed-off-by: Hugh Dickins
Cc: stable@vger.kernel.org # v5.8+
Signed-off-by: Linus Torvalds

Hugh Dickins
2020-11-25 07:23:19 +0800

23 Nov, 2020

5 commits

66383800d mm: fix madvise WILLNEED performance problem ... Browse Code »

The calculation of the end page index was incorrect, leading to a
regression of 70% when running stress-ng.

With this fix, we instead see a performance improvement of 3%.

Fixes: e6e88712e43b ("mm: optimise madvise WILLNEED")
Reported-by: kernel test robot
Signed-off-by: Matthew Wilcox (Oracle)
Signed-off-by: Andrew Morton
Tested-by: Xing Zhengjun
Acked-by: Johannes Weiner
Cc: William Kucharski
Cc: Feng Tang
Cc: "Chen, Rong A"
Link: https://lkml.kernel.org/r/20201109134851.29692-1-willy@infradead.org
Signed-off-by: Linus Torvalds

Matthew Wilcox (Oracle)
2020-11-23 02:48:22 +0800
bfe8cc1db mm/userfaultfd: do not access vma->vm_mm after calling handle_userfault() ... Browse Code »

Alexander reported a syzkaller / KASAN finding on s390, see below for
complete output.

In do_huge_pmd_anonymous_page(), the pre-allocated pagetable will be
freed in some cases. In the case of userfaultfd_missing(), this will
happen after calling handle_userfault(), which might have released the
mmap_lock. Therefore, the following pte_free(vma->vm_mm, pgtable) will
access an unstable vma->vm_mm, which could have been freed or re-used
already.

For all architectures other than s390 this will go w/o any negative
impact, because pte_free() simply frees the page and ignores the
passed-in mm. The implementation for SPARC32 would also access
mm->page_table_lock for pte_free(), but there is no THP support in
SPARC32, so the buggy code path will not be used there.

For s390, the mm->context.pgtable_list is being used to maintain the 2K
pagetable fragments, and operating on an already freed or even re-used
mm could result in various more or less subtle bugs due to list /
pagetable corruption.

Fix this by calling pte_free() before handle_userfault(), similar to how
it is already done in __do_huge_pmd_anonymous_page() for the WRITE /
non-huge_zero_page case.

Commit 6b251fc96cf2c ("userfaultfd: call handle_userfault() for
userfaultfd_missing() faults") actually introduced both, the
do_huge_pmd_anonymous_page() and also __do_huge_pmd_anonymous_page()
changes wrt to calling handle_userfault(), but only in the latter case
it put the pte_free() before calling handle_userfault().

BUG: KASAN: use-after-free in do_huge_pmd_anonymous_page+0xcda/0xd90 mm/huge_memory.c:744
Read of size 8 at addr 00000000962d6988 by task syz-executor.0/9334

CPU: 1 PID: 9334 Comm: syz-executor.0 Not tainted 5.10.0-rc1-syzkaller-07083-g4c9720875573 #0
Hardware name: IBM 3906 M04 701 (KVM/Linux)
Call Trace:
do_huge_pmd_anonymous_page+0xcda/0xd90 mm/huge_memory.c:744
create_huge_pmd mm/memory.c:4256 [inline]
__handle_mm_fault+0xe6e/0x1068 mm/memory.c:4480
handle_mm_fault+0x288/0x748 mm/memory.c:4607
do_exception+0x394/0xae0 arch/s390/mm/fault.c:479
do_dat_exception+0x34/0x80 arch/s390/mm/fault.c:567
pgm_check_handler+0x1da/0x22c arch/s390/kernel/entry.S:706
copy_from_user_mvcos arch/s390/lib/uaccess.c:111 [inline]
raw_copy_from_user+0x3a/0x88 arch/s390/lib/uaccess.c:174
_copy_from_user+0x48/0xa8 lib/usercopy.c:16
copy_from_user include/linux/uaccess.h:192 [inline]
__do_sys_sigaltstack kernel/signal.c:4064 [inline]
__s390x_sys_sigaltstack+0xc8/0x240 kernel/signal.c:4060
system_call+0xe0/0x28c arch/s390/kernel/entry.S:415

Allocated by task 9334:
slab_alloc_node mm/slub.c:2891 [inline]
slab_alloc mm/slub.c:2899 [inline]
kmem_cache_alloc+0x118/0x348 mm/slub.c:2904
vm_area_dup+0x9c/0x2b8 kernel/fork.c:356
__split_vma+0xba/0x560 mm/mmap.c:2742
split_vma+0xca/0x108 mm/mmap.c:2800
mlock_fixup+0x4ae/0x600 mm/mlock.c:550
apply_vma_lock_flags+0x2c6/0x398 mm/mlock.c:619
do_mlock+0x1aa/0x718 mm/mlock.c:711
__do_sys_mlock2 mm/mlock.c:738 [inline]
__s390x_sys_mlock2+0x86/0xa8 mm/mlock.c:728
system_call+0xe0/0x28c arch/s390/kernel/entry.S:415

Freed by task 9333:
slab_free mm/slub.c:3142 [inline]
kmem_cache_free+0x7c/0x4b8 mm/slub.c:3158
__vma_adjust+0x7b2/0x2508 mm/mmap.c:960
vma_merge+0x87e/0xce0 mm/mmap.c:1209
userfaultfd_release+0x412/0x6b8 fs/userfaultfd.c:868
__fput+0x22c/0x7a8 fs/file_table.c:281
task_work_run+0x200/0x320 kernel/task_work.c:151
tracehook_notify_resume include/linux/tracehook.h:188 [inline]
do_notify_resume+0x100/0x148 arch/s390/kernel/signal.c:538
system_call+0xe6/0x28c arch/s390/kernel/entry.S:416

The buggy address belongs to the object at 00000000962d6948 which belongs to the cache vm_area_struct of size 200
The buggy address is located 64 bytes inside of 200-byte region [00000000962d6948, 00000000962d6a10)
The buggy address belongs to the page: page:00000000313a09fe refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x962d6 flags: 0x3ffff00000000200(slab)
raw: 3ffff00000000200 000040000257e080 0000000c0000000c 000000008020ba00
raw: 0000000000000000 000f001e00000000 ffffffff00000001 0000000096959501
page dumped because: kasan: bad access detected
page->mem_cgroup:0000000096959501

Memory state around the buggy address:
00000000962d6880: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00000000962d6900: 00 fc fc fc fc fc fc fc fc fa fb fb fb fb fb fb
>00000000962d6980: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
^
00000000962d6a00: fb fb fc fc fc fc fc fc fc fc 00 00 00 00 00 00
00000000962d6a80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
==================================================================

Fixes: 6b251fc96cf2c ("userfaultfd: call handle_userfault() for userfaultfd_missing() faults")
Reported-by: Alexander Egorenkov
Signed-off-by: Gerald Schaefer
Signed-off-by: Andrew Morton
Cc: Andrea Arcangeli
Cc: Heiko Carstens
Cc: [4.3+]
Link: https://lkml.kernel.org/r/20201110190329.11920-1-gerald.schaefer@linux.ibm.com
Signed-off-by: Linus Torvalds

Gerald Schaefer
2020-11-23 02:48:22 +0800
8faeb1ffd mm: memcg/slab: fix root memcg vmstats ... Browse Code »

If we reparent the slab objects to the root memcg, when we free the slab
object, we need to update the per-memcg vmstats to keep it correct for
the root memcg. Now this at least affects the vmstat of
NR_KERNEL_STACK_KB for !CONFIG_VMAP_STACK when the thread stack size is
smaller than the PAGE_SIZE.

David said:
"I assume that without this fix that the root memcg's vmstat would
always be inflated if we reparented"

Fixes: ec9f02384f60 ("mm: workingset: fix vmstat counters for shadow nodes")
Signed-off-by: Muchun Song
Signed-off-by: Andrew Morton
Reviewed-by: Shakeel Butt
Acked-by: Roman Gushchin
Acked-by: Johannes Weiner
Acked-by: David Rientjes
Cc: Michal Hocko
Cc: Vladimir Davydov
Cc: Christopher Lameter
Cc: Pekka Enberg
Cc: Joonsoo Kim
Cc: Roman Gushchin
Cc: Vlastimil Babka
Cc: Yafang Shao
Cc: Chris Down
Cc: [5.3+]
Link: https://lkml.kernel.org/r/20201110031015.15715-1-songmuchun@bytedance.com
Signed-off-by: Linus Torvalds

Muchun Song
2020-11-23 02:48:22 +0800
a927bd6ba mm: fix phys_to_target_node() and memory_add_physaddr_to_nid() exports ... Browse Code »

The core-mm has a default __weak implementation of phys_to_target_node()
to mirror the weak definition of memory_add_physaddr_to_nid(). That
symbol is exported for modules. However, while the export in
mm/memory_hotplug.c exported the symbol in the configuration cases of:

CONFIG_NUMA_KEEP_MEMINFO=y
CONFIG_MEMORY_HOTPLUG=y

...and:

CONFIG_NUMA_KEEP_MEMINFO=n
CONFIG_MEMORY_HOTPLUG=y

...it failed to export the symbol in the case of:

CONFIG_NUMA_KEEP_MEMINFO=y
CONFIG_MEMORY_HOTPLUG=n

Not only is that broken, but Christoph points out that the kernel should
not be exporting any __weak symbol, which means that
memory_add_physaddr_to_nid() example that phys_to_target_node() copied
is broken too.

Rework the definition of phys_to_target_node() and
memory_add_physaddr_to_nid() to not require weak symbols. Move to the
common arch override design-pattern of an asm header defining a symbol
to replace the default implementation.

The only common header that all memory_add_physaddr_to_nid() producing
architectures implement is asm/sparsemem.h. In fact, powerpc already
defines its memory_add_physaddr_to_nid() helper in sparsemem.h.
Double-down on that observation and define phys_to_target_node() where
necessary in asm/sparsemem.h. An alternate consideration that was
discarded was to put this override in asm/numa.h, but that entangles
with the definition of MAX_NUMNODES relative to the inclusion of
linux/nodemask.h, and requires powerpc to grow a new header.

The dependency on NUMA_KEEP_MEMINFO for DEV_DAX_HMEM_DEVICES is invalid
now that the symbol is properly exported / stubbed in all combinations
of CONFIG_NUMA_KEEP_MEMINFO and CONFIG_MEMORY_HOTPLUG.

[dan.j.williams@intel.com: v4]
Link: https://lkml.kernel.org/r/160461461867.1505359.5301571728749534585.stgit@dwillia2-desk3.amr.corp.intel.com
[dan.j.williams@intel.com: powerpc: fix create_section_mapping compile warning]
Link: https://lkml.kernel.org/r/160558386174.2948926.2740149041249041764.stgit@dwillia2-desk3.amr.corp.intel.com

Fixes: a035b6bf863e ("mm/memory_hotplug: introduce default phys_to_target_node() implementation")
Reported-by: Randy Dunlap
Reported-by: Thomas Gleixner
Reported-by: kernel test robot
Reported-by: Christoph Hellwig
Signed-off-by: Dan Williams
Signed-off-by: Andrew Morton
Tested-by: Randy Dunlap
Tested-by: Thomas Gleixner
Reviewed-by: Thomas Gleixner
Reviewed-by: Christoph Hellwig
Cc: Joao Martins
Cc: Tony Luck
Cc: Fenghua Yu
Cc: Michael Ellerman
Cc: Benjamin Herrenschmidt
Cc: Paul Mackerras
Cc: Vishal Verma
Cc: Stephen Rothwell
Link: https://lkml.kernel.org/r/160447639846.1133764.7044090803980177548.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Linus Torvalds

Dan Williams
2020-11-23 02:48:22 +0800
450677dcb mm/madvise: fix memory leak from process_madvise ... Browse Code »

The early return in process_madvise() will produce a memory leak.

Fix it.

Fixes: ecb8ac8b1f14 ("mm/madvise: introduce process_madvise() syscall: an external memory hinting API")
Signed-off-by: Eric Dumazet
Signed-off-by: Minchan Kim
Signed-off-by: Andrew Morton
Link: https://lkml.kernel.org/r/20201116155132.GA3805951@google.com
Signed-off-by: Linus Torvalds

Eric Dumazet
2020-11-23 02:48:22 +0800

21 Nov, 2020

1 commit

fa5fca78b Merge tag 'io_uring-5.10-2020-11-20' of git://git.kernel.dk/linux-block ... Browse Code »

Pull io_uring fixes from Jens Axboe:
"Mostly regression or stable fodder:

- Disallow async path resolution of /proc/self

- Tighten constraints for segmented async buffered reads

- Fix double completion for a retry error case

- Fix for fixed file life times (Pavel)"

* tag 'io_uring-5.10-2020-11-20' of git://git.kernel.dk/linux-block:
io_uring: order refnode recycling
io_uring: get an active ref_node from files_data
io_uring: don't double complete failed reissue request
mm: never attempt async page lock if we've transferred data already
io_uring: handle -EOPNOTSUPP on path resolution
proc: don't allow async path resolution of /proc/self components

Linus Torvalds
2020-11-21 03:47:22 +0800

20 Nov, 2020

1 commit

4d02da974 Merge tag 'net-5.10-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net ... Browse Code »

Pull networking fixes from Jakub Kicinski:
"Networking fixes for 5.10-rc5, including fixes from the WiFi
(mac80211), can and bpf (including the strncpy_from_user fix).

Current release - regressions:

- mac80211: fix memory leak of filtered powersave frames

- mac80211: free sta in sta_info_insert_finish() on errors to avoid
sleeping in atomic context

- netlabel: fix an uninitialized variable warning added in -rc4

Previous release - regressions:

- vsock: forward all packets to the host when no H2G is registered,
un-breaking AWS Nitro Enclaves

- net: Exempt multicast addresses from five-second neighbor lifetime
requirement, decreasing the chances neighbor tables fill up

- net/tls: fix corrupted data in recvmsg

- qed: fix ILT configuration of SRC block

- can: m_can: process interrupt only when not runtime suspended

Previous release - always broken:

- page_frag: Recover from memory pressure by not recycling pages
allocating from the reserves

- strncpy_from_user: Mask out bytes after NUL terminator

- ip_tunnels: Set tunnel option flag only when tunnel metadata is
present, always setting it confuses Open vSwitch

- bpf, sockmap:
- Fix partial copy_page_to_iter so progress can still be made
- Fix socket memory accounting and obeying SO_RCVBUF

- net: Have netpoll bring-up DSA management interface

- net: bridge: add missing counters to ndo_get_stats64 callback

- tcp: brr: only postpone PROBE_RTT if RTT is < current min_rtt

- enetc: Workaround MDIO register access HW bug

- net/ncsi: move netlink family registration to a subsystem init,
instead of tying it to driver probe

- net: ftgmac100: unregister NC-SI when removing driver to avoid
crash

- lan743x:
- prevent interrupt storm on open
- fix freeing skbs in the wrong context

- net/mlx5e: Fix socket refcount leak on kTLS RX resync

- net: dsa: mv88e6xxx: Avoid VLAN database corruption on 6097

- fix 21 unset return codes and other mistakes on error paths, mostly
detected by the Hulk Robot"

* tag 'net-5.10-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (115 commits)
fail_function: Remove a redundant mutex unlock
selftest/bpf: Test bpf_probe_read_user_str() strips trailing bytes after NUL
lib/strncpy_from_user.c: Mask out bytes after NUL terminator.
net/smc: fix direct access to ib_gid_addr->ndev in smc_ib_determine_gid()
net/smc: fix matching of existing link groups
ipv6: Remove dependency of ipv6_frag_thdr_truncated on ipv6 module
libbpf: Fix VERSIONED_SYM_COUNT number parsing
net/mlx4_core: Fix init_hca fields offset
atm: nicstar: Unmap DMA on send error
page_frag: Recover from memory pressure
net: dsa: mv88e6xxx: Wait for EEPROM done after HW reset
mlxsw: core: Use variable timeout for EMAD retries
mlxsw: Fix firmware flashing
net: Have netpoll bring-up DSA management interface
atl1e: fix error return code in atl1e_probe()
atl1c: fix error return code in atl1c_probe()
ah6: fix error return code in ah6_input()
net: usb: qmi_wwan: Set DTR quirk for MR400
can: m_can: process interrupt only when not runtime suspended
can: flexcan: flexcan_chip_start(): fix erroneous flexcan_transceiver_enable() during bus-off recovery
...

Linus Torvalds
2020-11-20 05:33:16 +0800

19 Nov, 2020

1 commit

d8c19014b page_frag: Recover from memory pressure ... Browse Code »

The ethernet driver may allocate skb (and skb->data) via napi_alloc_skb().
This ends up to page_frag_alloc() to allocate skb->data from
page_frag_cache->va.

During the memory pressure, page_frag_cache->va may be allocated as
pfmemalloc page. As a result, the skb->pfmemalloc is always true as
skb->data is from page_frag_cache->va. The skb will be dropped if the
sock (receiver) does not have SOCK_MEMALLOC. This is expected behaviour
under memory pressure.

However, once kernel is not under memory pressure any longer (suppose large
amount of memory pages are just reclaimed), the page_frag_alloc() may still
re-use the prior pfmemalloc page_frag_cache->va to allocate skb->data. As a
result, the skb->pfmemalloc is always true unless page_frag_cache->va is
re-allocated, even if the kernel is not under memory pressure any longer.

Here is how kernel runs into issue.

1. The kernel is under memory pressure and allocation of
PAGE_FRAG_CACHE_MAX_ORDER in __page_frag_cache_refill() will fail. Instead,
the pfmemalloc page is allocated for page_frag_cache->va.

2: All skb->data from page_frag_cache->va (pfmemalloc) will have
skb->pfmemalloc=true. The skb will always be dropped by sock without
SOCK_MEMALLOC. This is an expected behaviour.

3. Suppose a large amount of pages are reclaimed and kernel is not under
memory pressure any longer. We expect skb->pfmemalloc drop will not happen.

4. Unfortunately, page_frag_alloc() does not proactively re-allocate
page_frag_alloc->va and will always re-use the prior pfmemalloc page. The
skb->pfmemalloc is always true even kernel is not under memory pressure any
longer.

Fix this by freeing and re-allocating the page instead of recycling it.

References: https://lore.kernel.org/lkml/20201103193239.1807-1-dongli.zhang@oracle.com/
References: https://lore.kernel.org/linux-mm/20201105042140.5253-1-willy@infradead.org/
Suggested-by: Matthew Wilcox (Oracle)
Cc: Aruna Ramakrishna
Cc: Bert Barbe
Cc: Rama Nichanamatlu
Cc: Venkat Venkatsubra
Cc: Manjunath Patil
Cc: Joe Jin
Cc: SRINIVAS
Fixes: 79930f5892e1 ("net: do not deplete pfmemalloc reserve")
Signed-off-by: Dongli Zhang
Acked-by: Vlastimil Babka
Reviewed-by: Eric Dumazet
Link: https://lore.kernel.org/r/20201115201029.11903-1-dongli.zhang@oracle.com
Signed-off-by: Jakub Kicinski

Dongli Zhang
2020-11-19 07:21:56 +0800

17 Nov, 2020

1 commit

0abed7c69 mm: never attempt async page lock if we've transferred data already ... Browse Code »

We catch the case where we enter generic_file_buffered_read() with data
already transferred, but we also need to be careful not to allow an async
page lock if we're looping transferring data. If not, we could be
returning -EIOCBQUEUED instead of the transferred amount, and it could
result in double waitqueue additions as well.

Cc: stable@vger.kernel.org # v5.9
Fixes: 1a0a7853b901 ("mm: support async buffered reads in generic_file_buffered_read()")
Signed-off-by: Jens Axboe

Jens Axboe
2020-11-17 04:39:34 +0800

16 Nov, 2020

1 commit

a50cf1590 Merge branch 'for-5.10-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/dennis/percpu ... Browse Code »

Pull percpu fix and cleanup from Dennis Zhou:
"A fix for a Wshadow warning in the asm-generic percpu macros came in
and then I tacked on the removal of flexible array initializers in the
percpu allocator"

* 'for-5.10-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/dennis/percpu:
percpu: convert flexible array initializers to use struct_size()
asm-generic: percpu: avoid Wshadow warning

Linus Torvalds
2020-11-16 00:57:19 +0800

15 Nov, 2020

6 commits

336bf30eb hugetlbfs: fix anon huge page migration race ... Browse Code »

Qian Cai reported the following BUG in [1]

LTP: starting move_pages12
BUG: unable to handle page fault for address: ffffffffffffffe0
...
RIP: 0010:anon_vma_interval_tree_iter_first+0xa2/0x170 avc_start_pgoff at mm/interval_tree.c:63
Call Trace:
rmap_walk_anon+0x141/0xa30 rmap_walk_anon at mm/rmap.c:1864
try_to_unmap+0x209/0x2d0 try_to_unmap at mm/rmap.c:1763
migrate_pages+0x1005/0x1fb0
move_pages_and_store_status.isra.47+0xd7/0x1a0
__x64_sys_move_pages+0xa5c/0x1100
do_syscall_64+0x5f/0x310
entry_SYSCALL_64_after_hwframe+0x44/0xa9

Hugh Dickins diagnosed this as a migration bug caused by code introduced
to use i_mmap_rwsem for pmd sharing synchronization. Specifically, the
routine unmap_and_move_huge_page() is always passing the TTU_RMAP_LOCKED
flag to try_to_unmap() while holding i_mmap_rwsem. This is wrong for
anon pages as the anon_vma_lock should be held in this case. Further
analysis suggested that i_mmap_rwsem was not required to he held at all
when calling try_to_unmap for anon pages as an anon page could never be
part of a shared pmd mapping.

Discussion also revealed that the hack in hugetlb_page_mapping_lock_write
to drop page lock and acquire i_mmap_rwsem is wrong. There is no way to
keep mapping valid while dropping page lock.

This patch does the following:

- Do not take i_mmap_rwsem and set TTU_RMAP_LOCKED for anon pages when
calling try_to_unmap.

- Remove the hacky code in hugetlb_page_mapping_lock_write. The routine
will now simply do a 'trylock' while still holding the page lock. If
the trylock fails, it will return NULL. This could impact the
callers:

- migration calling code will receive -EAGAIN and retry up to the
hard coded limit (10).

- memory error code will treat the page as BUSY. This will force
killing (SIGKILL) instead of SIGBUS any mapping tasks.

Do note that this change in behavior only happens when there is a
race. None of the standard kernel testing suites actually hit this
race, but it is possible.

[1] https://lore.kernel.org/lkml/20200708012044.GC992@lca.pw/
[2] https://lore.kernel.org/linux-mm/alpine.LSU.2.11.2010071833100.2214@eggly.anvils/

Fixes: c0d0381ade79 ("hugetlbfs: use i_mmap_rwsem for more pmd sharing synchronization")
Reported-by: Qian Cai
Suggested-by: Hugh Dickins
Signed-off-by: Mike Kravetz
Signed-off-by: Andrew Morton
Acked-by: Naoya Horiguchi
Cc:
Link: https://lkml.kernel.org/r/20201105195058.78401-1-mike.kravetz@oracle.com
Signed-off-by: Linus Torvalds

Mike Kravetz
2020-11-15 03:26:04 +0800
96e1fac16 mm/gup: use unpin_user_pages() in __gup_longterm_locked() ... Browse Code »

When FOLL_PIN is passed to __get_user_pages() the page list must be put
back using unpin_user_pages() otherwise the page pin reference persists
in a corrupted state.

There are two places in the unwind of __gup_longterm_locked() that put
the pages back without checking. Normally on error this function would
return the partial page list making this the caller's responsibility,
but in these two cases the caller is not allowed to see these pages at
all.

Fixes: 3faa52c03f44 ("mm/gup: track FOLL_PIN pages")
Reported-by: Ira Weiny
Signed-off-by: Jason Gunthorpe
Signed-off-by: Andrew Morton
Reviewed-by: Ira Weiny
Reviewed-by: John Hubbard
Cc: Aneesh Kumar K.V
Cc: Dan Williams
Cc:
Link: https://lkml.kernel.org/r/0-v2-3ae7d9d162e2+2a7-gup_cma_fix_jgg@nvidia.com
Signed-off-by: Linus Torvalds

Jason Gunthorpe
2020-11-15 03:26:03 +0800
22e4663e9 mm/slub: fix panic in slab_alloc_node() ... Browse Code »

While doing memory hot-unplug operation on a PowerPC VM running 1024 CPUs
with 11TB of ram, I hit the following panic:

BUG: Kernel NULL pointer dereference on read at 0x00000007
Faulting instruction address: 0xc000000000456048
Oops: Kernel access of bad area, sig: 11 [#2]
LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS= 2048 NUMA pSeries
Modules linked in: rpadlpar_io rpaphp
CPU: 160 PID: 1 Comm: systemd Tainted: G D 5.9.0 #1
NIP: c000000000456048 LR: c000000000455fd4 CTR: c00000000047b350
REGS: c00006028d1b77a0 TRAP: 0300 Tainted: G D (5.9.0)
MSR: 8000000000009033 CR: 24004228 XER: 00000000
CFAR: c00000000000f1b0 DAR: 0000000000000007 DSISR: 40000000 IRQMASK: 0
GPR00: c000000000455fd4 c00006028d1b7a30 c000000001bec800 0000000000000000
GPR04: 0000000000000dc0 0000000000000000 00000000000374ef c00007c53df99320
GPR08: 000007c53c980000 0000000000000000 000007c53c980000 0000000000000000
GPR12: 0000000000004400 c00000001e8e4400 0000000000000000 0000000000000f6a
GPR16: 0000000000000000 c000000001c25930 c000000001d62528 00000000000000c1
GPR20: c000000001d62538 c00006be469e9000 0000000fffffffe0 c0000000003c0ff8
GPR24: 0000000000000018 0000000000000000 0000000000000dc0 0000000000000000
GPR28: c00007c513755700 c000000001c236a4 c00007bc4001f800 0000000000000001
NIP [c000000000456048] __kmalloc_node+0x108/0x790
LR [c000000000455fd4] __kmalloc_node+0x94/0x790
Call Trace:
kvmalloc_node+0x58/0x110
mem_cgroup_css_online+0x10c/0x270
online_css+0x48/0xd0
cgroup_apply_control_enable+0x2c4/0x470
cgroup_mkdir+0x408/0x5f0
kernfs_iop_mkdir+0x90/0x100
vfs_mkdir+0x138/0x250
do_mkdirat+0x154/0x1c0
system_call_exception+0xf8/0x200
system_call_common+0xf0/0x27c
Instruction dump:
e93e0000 e90d0030 39290008 7cc9402a e94d0030 e93e0000 7ce95214 7f89502a
2fbc0000 419e0018 41920230 e9270010 7f994800 419e0220 7ee6bb78

This pointing to the following code:

mm/slub.c:2851
if (unlikely(!object || !node_match(page, node))) {
c000000000456038: 00 00 bc 2f cmpdi cr7,r28,0
c00000000045603c: 18 00 9e 41 beq cr7,c000000000456054
node_match():
mm/slub.c:2491
if (node != NUMA_NO_NODE && page_to_nid(page) != node)
c000000000456040: 30 02 92 41 beq cr4,c000000000456270
page_to_nid():
include/linux/mm.h:1294
c000000000456044: 10 00 27 e9 ld r9,16(r7)
c000000000456048: 07 00 29 89 lbz r9,7(r9) <<<< r9 = NULL
node_match():
mm/slub.c:2491
c00000000045604c: 00 48 99 7f cmpw cr7,r25,r9
c000000000456050: 20 02 9e 41 beq cr7,c000000000456270

The panic occurred in slab_alloc_node() when checking for the page's node:

object = c->freelist;
page = c->page;
if (unlikely(!object || !node_match(page, node))) {
object = __slab_alloc(s, gfpflags, node, addr, c);
stat(s, ALLOC_SLOWPATH);

The issue is that object is not NULL while page is NULL which is odd but
may happen if the cache flush happened after loading object but before
loading page. Thus checking for the page pointer is required too.

The cache flush is done through an inter processor interrupt when a
piece of memory is off-lined. That interrupt is triggered when a memory
hot-unplug operation is initiated and offline_pages() is calling the
slub's MEM_GOING_OFFLINE callback slab_mem_going_offline_callback()
which is calling flush_cpu_slab(). If that interrupt is caught between
the reading of c->freelist and the reading of c->page, this could lead
to such a situation. That situation is expected and the later call to
this_cpu_cmpxchg_double() will detect the change to c->freelist and redo
the whole operation.

In commit 6159d0f5c03e ("mm/slub.c: page is always non-NULL in
node_match()") check on the page pointer has been removed assuming that
page is always valid when it is called. It happens that this is not
true in that particular case, so check for page before calling
node_match() here.

Fixes: 6159d0f5c03e ("mm/slub.c: page is always non-NULL in node_match()")
Signed-off-by: Laurent Dufour
Signed-off-by: Andrew Morton
Acked-by: Vlastimil Babka
Acked-by: Christoph Lameter
Cc: Wei Yang
Cc: Pekka Enberg
Cc: David Rientjes
Cc: Joonsoo Kim
Cc: Nathan Lynch
Cc: Scott Cheloha
Cc: Michal Hocko
Cc:
Link: https://lkml.kernel.org/r/20201027190406.33283-1-ldufour@linux.ibm.com
Signed-off-by: Linus Torvalds

Laurent Dufour
2020-11-15 03:26:03 +0800
2da9f6305 mm/vmscan: fix NR_ISOLATED_FILE corruption on 64-bit ... Browse Code »

Previously the negated unsigned long would be cast back to signed long
which would have the correct negative value. After commit 730ec8c01a2b
("mm/vmscan.c: change prototype for shrink_page_list"), the large
unsigned int converts to a large positive signed long.

Symptoms include CMA allocations hanging forever holding the cma_mutex
due to alloc_contig_range->...->isolate_migratepages_block waiting
forever in "while (unlikely(too_many_isolated(pgdat)))".

[akpm@linux-foundation.org: fix -stat.nr_lazyfree_fail as well, per Michal]

Fixes: 730ec8c01a2b ("mm/vmscan.c: change prototype for shrink_page_list")
Signed-off-by: Nicholas Piggin
Signed-off-by: Andrew Morton
Acked-by: Michal Hocko
Cc: Vaneet Narang
Cc: Maninder Singh
Cc: Amit Sahrawat
Cc: Mel Gorman
Cc: Vlastimil Babka
Cc:
Link: https://lkml.kernel.org/r/20201029032320.1448441-1-npiggin@gmail.com
Signed-off-by: Linus Torvalds

Nicholas Piggin
2020-11-15 03:26:03 +0800
d20bdd571 mm/compaction: stop isolation if too many pages are isolated and we have pages to migrate ... Browse Code »

In isolate_migratepages_block, if we have too many isolated pages and
nr_migratepages is not zero, we should try to migrate what we have
without wasting time on isolating.

In theory it's possible that multiple parallel compactions will cause
too_many_isolated() to become true even if each has isolated less than
COMPACT_CLUSTER_MAX, and loop forever in the while loop. Bailing
immediately prevents that.

[vbabka@suse.cz: changelog addition]

Fixes: 1da2f328fa64 (“mm,thp,compaction,cma: allow THP migration for CMA allocations”)
Suggested-by: Vlastimil Babka
Signed-off-by: Zi Yan
Signed-off-by: Andrew Morton
Cc:
Cc: Mel Gorman
Cc: Michal Hocko
Cc: Rik van Riel
Cc: Yang Shi
Link: https://lkml.kernel.org/r/20201030183809.3616803-2-zi.yan@sent.com
Signed-off-by: Linus Torvalds

Zi Yan
2020-11-15 03:26:03 +0800
38935861d mm/compaction: count pages and stop correctly during page isolation ... Browse Code »

In isolate_migratepages_block, when cc->alloc_contig is true, we are
able to isolate compound pages. But nr_migratepages and nr_isolated did
not count compound pages correctly, causing us to isolate more pages
than we thought.

So count compound pages as the number of base pages they contain.
Otherwise, we might be trapped in too_many_isolated while loop, since
the actual isolated pages can go up to COMPACT_CLUSTER_MAX*512=16384,
where COMPACT_CLUSTER_MAX is 32, since we stop isolation after
cc->nr_migratepages reaches to COMPACT_CLUSTER_MAX.

In addition, after we fix the issue above, cc->nr_migratepages could
never be equal to COMPACT_CLUSTER_MAX if compound pages are isolated,
thus page isolation could not stop as we intended. Change the isolation
stop condition to '>='.

The issue can be triggered as follows:

In a system with 16GB memory and an 8GB CMA region reserved by
hugetlb_cma, if we first allocate 10GB THPs and mlock them (so some THPs
are allocated in the CMA region and mlocked), reserving 6 1GB hugetlb
pages via /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages will
get stuck (looping in too_many_isolated function) until we kill either
task. With the patch applied, oom will kill the application with 10GB
THPs and let hugetlb page reservation finish.

[ziy@nvidia.com: v3]

Link: https://lkml.kernel.org/r/20201030183809.3616803-1-zi.yan@sent.com
Fixes: 1da2f328fa64 ("cmm,thp,compaction,cma: allow THP migration for CMA allocations")
Signed-off-by: Zi Yan
Signed-off-by: Andrew Morton
Reviewed-by: Yang Shi
Acked-by: Vlastimil Babka
Cc: Rik van Riel
Cc: Michal Hocko
Cc: Mel Gorman
Cc:
Link: https://lkml.kernel.org/r/20201029200435.3386066-1-zi.yan@sent.com
Signed-off-by: Linus Torvalds

Zi Yan
2020-11-15 03:26:03 +0800

03 Nov, 2020

6 commits

a77eedbc8 mm/truncate.c: make __invalidate_mapping_pages() static ... Browse Code »

Fix the following sparse warning:

mm/truncate.c:531:15: warning: symbol '__invalidate_mapping_pages' was not declared. Should it be static?

Fixes: eb1d7a65f08a ("mm, fadvise: improve the expensive remote LRU cache draining after FADV_DONTNEED")
Signed-off-by: Jason Yan
Signed-off-by: Andrew Morton
Reviewed-by: Yafang Shao
Link: https://lkml.kernel.org/r/20201015054808.2445904-1-yanaijie@huawei.com
Signed-off-by: Linus Torvalds

Jason Yan
2020-11-03 04:14:19 +0800
3f0884209 mm: mempolicy: fix potential pte_unmap_unlock pte error ... Browse Code »

When flags in queue_pages_pte_range don't have MPOL_MF_MOVE or
MPOL_MF_MOVE_ALL bits, code breaks and passing origin pte - 1 to
pte_unmap_unlock seems like not a good idea.

queue_pages_pte_range can run in MPOL_MF_MOVE_ALL mode which doesn't
migrate misplaced pages but returns with EIO when encountering such a
page. Since commit a7f40cfe3b7a ("mm: mempolicy: make mbind() return
-EIO when MPOL_MF_STRICT is specified") and early break on the first pte
in the range results in pte_unmap_unlock on an underflow pte. This can
lead to lockups later on when somebody tries to lock the pte resp.
page_table_lock again..

Fixes: a7f40cfe3b7a ("mm: mempolicy: make mbind() return -EIO when MPOL_MF_STRICT is specified")
Signed-off-by: Shijie Luo
Signed-off-by: Miaohe Lin
Signed-off-by: Andrew Morton
Reviewed-by: Oscar Salvador
Acked-by: Michal Hocko
Cc: Miaohe Lin
Cc: Feilong Lin
Cc: Shijie Luo
Cc:
Link: https://lkml.kernel.org/r/20201019074853.50856-1-luoshijie1@huawei.com
Signed-off-by: Linus Torvalds

Shijie Luo
2020-11-03 04:14:19 +0800
8de15e920 mm: memcg: link page counters to root if use_hierarchy is false ... Browse Code »

Richard reported a warning which can be reproduced by running the LTP
madvise6 test (cgroup v1 in the non-hierarchical mode should be used):

WARNING: CPU: 0 PID: 12 at mm/page_counter.c:57 page_counter_uncharge (mm/page_counter.c:57 mm/page_counter.c:50 mm/page_counter.c:156)
Modules linked in:
CPU: 0 PID: 12 Comm: kworker/0:1 Not tainted 5.9.0-rc7-22-default #77
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-48-gd9c812d-rebuilt.opensuse.org 04/01/2014
Workqueue: events drain_local_stock
RIP: 0010:page_counter_uncharge (mm/page_counter.c:57 mm/page_counter.c:50 mm/page_counter.c:156)
Call Trace:
__memcg_kmem_uncharge (mm/memcontrol.c:3022)
drain_obj_stock (./include/linux/rcupdate.h:689 mm/memcontrol.c:3114)
drain_local_stock (mm/memcontrol.c:2255)
process_one_work (./arch/x86/include/asm/jump_label.h:25 ./include/linux/jump_label.h:200 ./include/trace/events/workqueue.h:108 kernel/workqueue.c:2274)
worker_thread (./include/linux/list.h:282 kernel/workqueue.c:2416)
kthread (kernel/kthread.c:292)
ret_from_fork (arch/x86/entry/entry_64.S:300)

The problem occurs because in the non-hierarchical mode non-root page
counters are not linked to root page counters, so the charge is not
propagated to the root memory cgroup.

After the removal of the original memory cgroup and reparenting of the
object cgroup, the root cgroup might be uncharged by draining a objcg
stock, for example. It leads to an eventual underflow of the charge and
triggers a warning.

Fix it by linking all page counters to corresponding root page counters
in the non-hierarchical mode.

Please note, that in the non-hierarchical mode all objcgs are always
reparented to the root memory cgroup, even if the hierarchy has more
than 1 level. This patch doesn't change it.

The patch also doesn't affect how the hierarchical mode is working,
which is the only sane and truly supported mode now.

Thanks to Richard for reporting, debugging and providing an alternative
version of the fix!

Fixes: bf4f059954dc ("mm: memcg/slab: obj_cgroup API")
Reported-by:
Signed-off-by: Roman Gushchin
Signed-off-by: Andrew Morton
Reviewed-by: Shakeel Butt
Reviewed-by: Michal Koutný
Acked-by: Johannes Weiner
Cc: Michal Hocko
Cc:
Link: https://lkml.kernel.org/r/20201026231326.3212225-1-guro@fb.com
Debugged-by: Richard Palethorpe
Signed-off-by: Linus Torvalds

Roman Gushchin
2020-11-03 04:14:18 +0800
7de2e9f19 mm: memcontrol: correct the NR_ANON_THPS counter of hierarchical memcg ... Browse Code »

memcg_page_state will get the specified number in hierarchical memcg, It
should multiply by HPAGE_PMD_NR rather than an page if the item is
NR_ANON_THPS.

[akpm@linux-foundation.org: fix printk warning]
[akpm@linux-foundation.org: use u64 cast, per Michal]

Fixes: 468c398233da ("mm: memcontrol: switch to native NR_ANON_THPS counter")
Signed-off-by: zhongjiang-ali
Signed-off-by: Andrew Morton
Acked-by: Johannes Weiner
Acked-by: Michal Hocko
Link: https://lkml.kernel.org/r/1603722395-72443-1-git-send-email-zhongjiang-ali@linux.alibaba.com
Signed-off-by: Linus Torvalds

zhongjiang-ali
2020-11-03 04:14:18 +0800
79aa925bf hugetlb_cgroup: fix reservation accounting ... Browse Code »

Michal Privoznik was using "free page reporting" in QEMU/virtio-balloon
with hugetlbfs and hit the warning below. QEMU with free page hinting
uses fallocate(FALLOC_FL_PUNCH_HOLE) to discard pages that are reported
as free by a VM. The reporting granularity is in pageblock granularity.
So when the guest reports 2M chunks, we fallocate(FALLOC_FL_PUNCH_HOLE)
one huge page in QEMU.

WARNING: CPU: 7 PID: 6636 at mm/page_counter.c:57 page_counter_uncharge+0x4b/0x50
Modules linked in: ...
CPU: 7 PID: 6636 Comm: qemu-system-x86 Not tainted 5.9.0 #137
Hardware name: Gigabyte Technology Co., Ltd. X570 AORUS PRO/X570 AORUS PRO, BIOS F21 07/31/2020
RIP: 0010:page_counter_uncharge+0x4b/0x50
...
Call Trace:
hugetlb_cgroup_uncharge_file_region+0x4b/0x80
region_del+0x1d3/0x300
hugetlb_unreserve_pages+0x39/0xb0
remove_inode_hugepages+0x1a8/0x3d0
hugetlbfs_fallocate+0x3c4/0x5c0
vfs_fallocate+0x146/0x290
__x64_sys_fallocate+0x3e/0x70
do_syscall_64+0x33/0x40
entry_SYSCALL_64_after_hwframe+0x44/0xa9

Investigation of the issue uncovered bugs in hugetlb cgroup reservation
accounting. This patch addresses the found issues.

Fixes: 075a61d07a8e ("hugetlb_cgroup: add accounting for shared mappings")
Reported-by: Michal Privoznik
Co-developed-by: David Hildenbrand
Signed-off-by: David Hildenbrand
Signed-off-by: Mike Kravetz
Signed-off-by: Andrew Morton
Tested-by: Michal Privoznik
Reviewed-by: Mina Almasry
Acked-by: Michael S. Tsirkin
Cc:
Cc: David Hildenbrand
Cc: Michal Hocko
Cc: Muchun Song
Cc: "Aneesh Kumar K . V"
Cc: Tejun Heo
Link: https://lkml.kernel.org/r/20201021204426.36069-1-mike.kravetz@oracle.com
Signed-off-by: Linus Torvalds

Mike Kravetz
2020-11-03 04:14:18 +0800
46b1ee38b mm/mremap_pages: fix static key devmap_managed_key updates ... Browse Code »

commit 6f42193fd86e ("memremap: don't use a separate devm action for
devmap_managed_enable_get") changed the static key updates such that we
now call devmap_managed_enable_put() without doing the equivalent
devmap_managed_enable_get().

devmap_managed_enable_get() is only called for MEMORY_DEVICE_PRIVATE and
MEMORY_DEVICE_FS_DAX, But memunmap_pages() get called for other pgmap
types too. This results in the below warning when switching between
system-ram and devdax mode for devdax namespace.

jump label: negative count!
WARNING: CPU: 52 PID: 1335 at kernel/jump_label.c:235 static_key_slow_try_dec+0x88/0xa0
Modules linked in:
....

NIP static_key_slow_try_dec+0x88/0xa0
LR static_key_slow_try_dec+0x84/0xa0
Call Trace:
static_key_slow_try_dec+0x84/0xa0
__static_key_slow_dec_cpuslocked+0x34/0xd0
static_key_slow_dec+0x54/0xf0
memunmap_pages+0x36c/0x500
devm_action_release+0x30/0x50
release_nodes+0x2f4/0x3e0
device_release_driver_internal+0x17c/0x280
bus_remove_device+0x124/0x210
device_del+0x1d4/0x530
unregister_dev_dax+0x48/0xe0
devm_action_release+0x30/0x50
release_nodes+0x2f4/0x3e0
device_release_driver_internal+0x17c/0x280
unbind_store+0x130/0x170
drv_attr_store+0x40/0x60
sysfs_kf_write+0x6c/0xb0
kernfs_fop_write+0x118/0x280
vfs_write+0xe8/0x2a0
ksys_write+0x84/0x140
system_call_exception+0x120/0x270
system_call_common+0xf0/0x27c

Reported-by: Aneesh Kumar K.V
Signed-off-by: Ralph Campbell
Signed-off-by: Andrew Morton
Tested-by: Sachin Sant
Reviewed-by: Aneesh Kumar K.V
Reviewed-by: Ira Weiny
Reviewed-by: Christoph Hellwig
Cc: Dan Williams
Cc: Jason Gunthorpe
Link: https://lkml.kernel.org/r/20201023183222.13186-1-rcampbell@nvidia.com
Signed-off-by: Linus Torvalds

Ralph Campbell
2020-11-03 04:14:18 +0800

31 Oct, 2020

1 commit

61cf93d3e percpu: convert flexible array initializers to use struct_size() ... Browse Code »

Use the safer macro as sparked by the long discussion in [1].

[1] https://lore.kernel.org/lkml/20200917204514.GA2880159@google.com/

Reviewed-by: Gustavo A. R. Silva
Signed-off-by: Dennis Zhou

Dennis Zhou
2020-10-31 07:02:28 +0800

28 Oct, 2020

2 commits

f78f63da9 mm/process_vm_access: Add missing #include <linux/compat.h> ... Browse Code »

With e.g. m68k/defconfig:

mm/process_vm_access.c: In function ‘process_vm_rw’:
mm/process_vm_access.c:277:5: error: implicit declaration of function ‘in_compat_syscall’ [-Werror=implicit-function-declaration]
277 | in_compat_syscall());
| ^~~~~~~~~~~~~~~~~

Fix this by adding #include .

Reported-by: noreply@ellerman.id.au
Reported-by: damian
Reported-by: Naresh Kamboju
Fixes: 38dc5079da7081e8 ("Fix compat regression in process_vm_rw()")
Signed-off-by: Geert Uytterhoeven
Acked-by: Jens Axboe
Signed-off-by: Linus Torvalds

Geert Uytterhoeven
2020-10-28 03:41:29 +0800
38dc5079d Fix compat regression in process_vm_rw() ... Browse Code »

The removal of compat_process_vm_{readv,writev} didn't change
process_vm_rw(), which always assumes it's not doing a compat syscall.

Instead of passing in 'false' unconditionally for 'compat', make it
conditional on in_compat_syscall().

[ Both Al and Christoph point out that trying to access a 64-bit process
from a 32-bit one cannot work anyway, and is likely better prohibited,
but that's a separate issue - Linus ]

Fixes: c3973b401ef2 ("mm: remove compat_process_vm_{readv,writev}")
Reported-and-tested-by: Kyle Huey
Signed-off-by: Jens Axboe
Acked-by: Al Viro
Reviewed-by: Christoph Hellwig
Signed-off-by: Linus Torvalds

Jens Axboe
2020-10-28 00:57:46 +0800

24 Oct, 2020

1 commit

c4728cfbe Merge tag 'vfs-5.10-merge-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux ... Browse Code »

Pull clone/dedupe/remap code refactoring from Darrick Wong:
"Move the generic file range remap (aka reflink and dedupe) functions
out of mm/filemap.c and fs/read_write.c and into fs/remap_range.c to
reduce clutter in the first two files"

* tag 'vfs-5.10-merge-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
vfs: move the generic write and copy checks out of mm
vfs: move the remap range helpers to remap_range.c
vfs: move generic_remap_checks out of mm

Linus Torvalds
2020-10-24 02:33:41 +0800

21 Oct, 2020

2 commits

c4d6fe731 Merge tag 'xarray-5.9' of git://git.infradead.org/users/willy/xarray ... Browse Code »

Pull XArray updates from Matthew Wilcox:

- Fix the test suite after introduction of the local_lock

- Fix a bug in the IDA spotted by Coverity

- Change the API that allows the workingset code to delete a node

- Fix xas_reload() when dealing with entries that occupy multiple
indices

- Add a few more tests to the test suite

- Fix an unsigned int being shifted into an unsigned long

* tag 'xarray-5.9' of git://git.infradead.org/users/willy/xarray:
XArray: Fix xas_create_range for ranges above 4 billion
radix-tree: fix the comment of radix_tree_next_slot()
XArray: Fix xas_reload for multi-index entries
XArray: Add private interface for workingset node deletion
XArray: Fix xas_for_each_conflict documentation
XArray: Test marked multiorder iterations
XArray: Test two more things about xa_cmpxchg
ida: Free allocated bitmap in error path
radix tree test suite: Fix compilation

Linus Torvalds
2020-10-21 05:39:37 +0800
4962a8569 Merge tag 'io_uring-5.10-2020-10-20' of git://git.kernel.dk/linux-block ... Browse Code »

Pull io_uring updates from Jens Axboe:
"A mix of fixes and a few stragglers. In detail:

- Revert the bogus __read_mostly that we discussed for the initial
pull request.

- Fix a merge window regression with fixed file registration error
path handling.

- Fix io-wq numa node affinities.

- Series abstracting out an io_identity struct, making it both easier
to see what the personality items are, and also easier to to adopt
more. Use this to cover audit logging.

- Fix for read-ahead disabled block condition in async buffered
reads, and using single page read-ahead to unify what
generic_file_buffer_read() path is used.

- Series for REQ_F_COMP_LOCKED fix and removal of it (Pavel)

- Poll fix (Pavel)"

* tag 'io_uring-5.10-2020-10-20' of git://git.kernel.dk/linux-block: (21 commits)
io_uring: use blk_queue_nowait() to check if NOWAIT supported
mm: use limited read-ahead to satisfy read
mm: mark async iocb read as NOWAIT once some data has been copied
io_uring: fix double poll mask init
io-wq: inherit audit loginuid and sessionid
io_uring: use percpu counters to track inflight requests
io_uring: assign new io_identity for task if members have changed
io_uring: store io_identity in io_uring_task
io_uring: COW io_identity on mismatch
io_uring: move io identity items into separate struct
io_uring: rely solely on work flags to determine personality.
io_uring: pass required context in as flags
io-wq: assign NUMA node locality if appropriate
io_uring: fix error path cleanup in io_sqe_files_register()
Revert "io_uring: mark io_uring_fops/io_op_defs as __read_mostly"
io_uring: fix REQ_F_COMP_LOCKED by killing it
io_uring: dig out COMP_LOCK from deep call chain
io_uring: don't put a poll req under spinlock
io_uring: don't unnecessarily clear F_LINK_TIMEOUT
io_uring: don't set COMP_LOCKED if won't put
...

Linus Torvalds
2020-10-21 04:19:30 +0800