Eric Lee / smarc-fsl-linux-kernel

01 Oct, 2020

8 commits

0f2122045 io_uring: don't rely on weak ->files references ... Browse Code »

Grab actual references to the files_struct. To avoid circular references
issues due to this, we add a per-task note that keeps track of what
io_uring contexts a task has used. When the tasks execs or exits its
assigned files, we cancel requests based on this tracking.

With that, we can grab proper references to the files table, and no
longer need to rely on stashing away ring_fd and ring_file to check
if the ring_fd may have been closed.

Cc: stable@vger.kernel.org # v5.5+
Reviewed-by: Pavel Begunkov
Signed-off-by: Jens Axboe

Jens Axboe
2020-10-01 10:32:32 +0800
e6c8aa9ac io_uring: enable task/files specific overflow flushing ... Browse Code »

This allows us to selectively flush out pending overflows, depending on
the task and/or files_struct being passed in.

No intended functional changes in this patch.

Reviewed-by: Pavel Begunkov
Signed-off-by: Jens Axboe

Jens Axboe
2020-10-01 10:32:32 +0800
76e1b6427 io_uring: return cancelation status from poll/timeout/files handlers ... Browse Code »

Return whether we found and canceled requests or not. This is in
preparation for using this information, no functional changes in this
patch.

Reviewed-by: Pavel Begunkov
Signed-off-by: Jens Axboe

Jens Axboe
2020-10-01 10:32:32 +0800
e3bc8e9da io_uring: unconditionally grab req->task ... Browse Code »

Sometimes we assign a weak reference to it, sometimes we grab a
reference to it. Clean this up and make it unconditional, and drop the
flag related to tracking this state.

Reviewed-by: Pavel Begunkov
Signed-off-by: Jens Axboe

Jens Axboe
2020-10-01 10:32:32 +0800
2aede0e41 io_uring: stash ctx task reference for SQPOLL ... Browse Code »

We can grab a reference to the task instead of stashing away the task
files_struct. This is doable without creating a circular reference
between the ring fd and the task itself.

Reviewed-by: Pavel Begunkov
Signed-off-by: Jens Axboe

Jens Axboe
2020-10-01 10:32:32 +0800
f573d3844 io_uring: move dropping of files into separate helper ... Browse Code »

No functional changes in this patch, prep patch for grabbing references
to the files_struct.

Reviewed-by: Pavel Begunkov
Signed-off-by: Jens Axboe

Jens Axboe
2020-10-01 10:32:32 +0800
f3606e3a9 io_uring: allow timeout/poll/files killing to take task into account ... Browse Code »

We currently cancel these when the ring exits, and we cancel all of
them. This is in preparation for killing only the ones associated
with a given task.

Reviewed-by: Pavel Begunkov
Signed-off-by: Jens Axboe

Jens Axboe
2020-10-01 10:32:32 +0800
0f0788969 Merge branch 'io_uring-5.9' into for-5.10/io_uring ... Browse Code »

* io_uring-5.9:
io_uring: fix async buffered reads when readahead is disabled
io_uring: fix potential ABBA deadlock in ->show_fdinfo()
io_uring: always delete double poll wait entry on match

Jens Axboe
2020-10-01 10:32:25 +0800

29 Sep, 2020

3 commits

c8d317aa1 io_uring: fix async buffered reads when readahead is disabled ... Browse Code »

The async buffered reads feature is not working when readahead is
turned off. There are two things to concern:

- when doing retry in io_read, not only the IOCB_WAITQ flag but also
the IOCB_NOWAIT flag is still set, which makes it goes to would_block
phase in generic_file_buffered_read() and then return -EAGAIN. After
that, the io-wq thread work is queued, and later doing the async
reads in the old way.

- even if we remove IOCB_NOWAIT when doing retry, the feature is still
not running properly, since in generic_file_buffered_read() it goes to
lock_page_killable() after calling mapping->a_ops->readpage() to do
IO, and thus causing process to sleep.

Fixes: 1a0a7853b901 ("mm: support async buffered reads in generic_file_buffered_read()")
Fixes: 3b2a4439e0ae ("io_uring: get rid of kiocb_wait_page_queue_init()")
Signed-off-by: Hao Xu
Signed-off-by: Jens Axboe

Hao Xu
2020-09-29 21:54:00 +0800
fb0155a09 Merge tag 'nfs-for-5.9-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs ... Browse Code »

Pull NFS client bugfixes from Trond Myklebust:
"Highlights include:

- NFSv4.2: copy_file_range needs to invalidate caches on success

- NFSv4.2: Fix security label length not being reset

- pNFS/flexfiles: Ensure we initialise the mirror bsizes correctly
on read

- pNFS/flexfiles: Fix signed/unsigned type issues with mirror
indices"

* tag 'nfs-for-5.9-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
pNFS/flexfiles: Be consistent about mirror index types
pNFS/flexfiles: Ensure we initialise the mirror bsizes correctly on read
NFSv4.2: fix client's attribute cache management for copy_file_range
nfs: Fix security label length not being reset

Linus Torvalds
2020-09-29 02:05:56 +0800
a4d63c373 mm: do not rely on mm == current->mm in __get_user_pages_locked ... Browse Code »

It seems likely this block was pasted from internal_get_user_pages_fast,
which is not passed an mm struct and therefore uses current's. But
__get_user_pages_locked is passed an explicit mm, and current->mm is not
always valid. This was hit when being called from i915, which uses:

pin_user_pages_remote->
__get_user_pages_remote->
__gup_longterm_locked->
__get_user_pages_locked

Before, this would lead to an OOPS:

BUG: kernel NULL pointer dereference, address: 0000000000000064
#PF: supervisor write access in kernel mode
#PF: error_code(0x0002) - not-present page
CPU: 10 PID: 1431 Comm: kworker/u33:1 Tainted: P S U O 5.9.0-rc7+ #140
Hardware name: LENOVO 20QTCTO1WW/20QTCTO1WW, BIOS N2OET47W (1.34 ) 08/06/2020
Workqueue: i915-userptr-acquire __i915_gem_userptr_get_pages_worker [i915]
RIP: 0010:__get_user_pages_remote+0xd7/0x310
Call Trace:
__i915_gem_userptr_get_pages_worker+0xc8/0x260 [i915]
process_one_work+0x1ca/0x390
worker_thread+0x48/0x3c0
kthread+0x114/0x130
ret_from_fork+0x1f/0x30
CR2: 0000000000000064

This commit fixes the problem by using the mm pointer passed to the
function rather than the bogus one in current.

Fixes: 008cfe4418b3 ("mm: Introduce mm_struct.has_pinned")
Tested-by: Chris Wilson
Reported-by: Harald Arnesen
Reviewed-by: Jason Gunthorpe
Reviewed-by: Peter Xu
Signed-off-by: Jason A. Donenfeld
Signed-off-by: Linus Torvalds

Jason A. Donenfeld
2020-09-29 00:21:50 +0800

28 Sep, 2020

10 commits

fad8e0de4 io_uring: fix potential ABBA deadlock in ->show_fdinfo() ... Browse Code »

syzbot reports a potential lock deadlock between the normal IO path and
->show_fdinfo():

======================================================
WARNING: possible circular locking dependency detected
5.9.0-rc6-syzkaller #0 Not tainted
------------------------------------------------------
syz-executor.2/19710 is trying to acquire lock:
ffff888098ddc450 (sb_writers#4){.+.+}-{0:0}, at: io_write+0x6b5/0xb30 fs/io_uring.c:3296

but task is already holding lock:
ffff8880a11b8428 (&ctx->uring_lock){+.+.}-{3:3}, at: __do_sys_io_uring_enter+0xe9a/0x1bd0 fs/io_uring.c:8348

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #2 (&ctx->uring_lock){+.+.}-{3:3}:
__mutex_lock_common kernel/locking/mutex.c:956 [inline]
__mutex_lock+0x134/0x10e0 kernel/locking/mutex.c:1103
__io_uring_show_fdinfo fs/io_uring.c:8417 [inline]
io_uring_show_fdinfo+0x194/0xc70 fs/io_uring.c:8460
seq_show+0x4a8/0x700 fs/proc/fd.c:65
seq_read+0x432/0x1070 fs/seq_file.c:208
do_loop_readv_writev fs/read_write.c:734 [inline]
do_loop_readv_writev fs/read_write.c:721 [inline]
do_iter_read+0x48e/0x6e0 fs/read_write.c:955
vfs_readv+0xe5/0x150 fs/read_write.c:1073
kernel_readv fs/splice.c:355 [inline]
default_file_splice_read.constprop.0+0x4e6/0x9e0 fs/splice.c:412
do_splice_to+0x137/0x170 fs/splice.c:871
splice_direct_to_actor+0x307/0x980 fs/splice.c:950
do_splice_direct+0x1b3/0x280 fs/splice.c:1059
do_sendfile+0x55f/0xd40 fs/read_write.c:1540
__do_sys_sendfile64 fs/read_write.c:1601 [inline]
__se_sys_sendfile64 fs/read_write.c:1587 [inline]
__x64_sys_sendfile64+0x1cc/0x210 fs/read_write.c:1587
do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
entry_SYSCALL_64_after_hwframe+0x44/0xa9

-> #1 (&p->lock){+.+.}-{3:3}:
__mutex_lock_common kernel/locking/mutex.c:956 [inline]
__mutex_lock+0x134/0x10e0 kernel/locking/mutex.c:1103
seq_read+0x61/0x1070 fs/seq_file.c:155
pde_read fs/proc/inode.c:306 [inline]
proc_reg_read+0x221/0x300 fs/proc/inode.c:318
do_loop_readv_writev fs/read_write.c:734 [inline]
do_loop_readv_writev fs/read_write.c:721 [inline]
do_iter_read+0x48e/0x6e0 fs/read_write.c:955
vfs_readv+0xe5/0x150 fs/read_write.c:1073
kernel_readv fs/splice.c:355 [inline]
default_file_splice_read.constprop.0+0x4e6/0x9e0 fs/splice.c:412
do_splice_to+0x137/0x170 fs/splice.c:871
splice_direct_to_actor+0x307/0x980 fs/splice.c:950
do_splice_direct+0x1b3/0x280 fs/splice.c:1059
do_sendfile+0x55f/0xd40 fs/read_write.c:1540
__do_sys_sendfile64 fs/read_write.c:1601 [inline]
__se_sys_sendfile64 fs/read_write.c:1587 [inline]
__x64_sys_sendfile64+0x1cc/0x210 fs/read_write.c:1587
do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
entry_SYSCALL_64_after_hwframe+0x44/0xa9

-> #0 (sb_writers#4){.+.+}-{0:0}:
check_prev_add kernel/locking/lockdep.c:2496 [inline]
check_prevs_add kernel/locking/lockdep.c:2601 [inline]
validate_chain kernel/locking/lockdep.c:3218 [inline]
__lock_acquire+0x2a96/0x5780 kernel/locking/lockdep.c:4441
lock_acquire+0x1f3/0xaf0 kernel/locking/lockdep.c:5029
percpu_down_read include/linux/percpu-rwsem.h:51 [inline]
__sb_start_write+0x228/0x450 fs/super.c:1672
io_write+0x6b5/0xb30 fs/io_uring.c:3296
io_issue_sqe+0x18f/0x5c50 fs/io_uring.c:5719
__io_queue_sqe+0x280/0x1160 fs/io_uring.c:6175
io_queue_sqe+0x692/0xfa0 fs/io_uring.c:6254
io_submit_sqe fs/io_uring.c:6324 [inline]
io_submit_sqes+0x1761/0x2400 fs/io_uring.c:6521
__do_sys_io_uring_enter+0xeac/0x1bd0 fs/io_uring.c:8349
do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
entry_SYSCALL_64_after_hwframe+0x44/0xa9

other info that might help us debug this:

Chain exists of:
sb_writers#4 --> &p->lock --> &ctx->uring_lock

Possible unsafe locking scenario:

CPU0 CPU1
---- ----
lock(&ctx->uring_lock);
lock(&p->lock);
lock(&ctx->uring_lock);
lock(sb_writers#4);

*** DEADLOCK ***

1 lock held by syz-executor.2/19710:
#0: ffff8880a11b8428 (&ctx->uring_lock){+.+.}-{3:3}, at: __do_sys_io_uring_enter+0xe9a/0x1bd0 fs/io_uring.c:8348

stack backtrace:
CPU: 0 PID: 19710 Comm: syz-executor.2 Not tainted 5.9.0-rc6-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
__dump_stack lib/dump_stack.c:77 [inline]
dump_stack+0x198/0x1fd lib/dump_stack.c:118
check_noncircular+0x324/0x3e0 kernel/locking/lockdep.c:1827
check_prev_add kernel/locking/lockdep.c:2496 [inline]
check_prevs_add kernel/locking/lockdep.c:2601 [inline]
validate_chain kernel/locking/lockdep.c:3218 [inline]
__lock_acquire+0x2a96/0x5780 kernel/locking/lockdep.c:4441
lock_acquire+0x1f3/0xaf0 kernel/locking/lockdep.c:5029
percpu_down_read include/linux/percpu-rwsem.h:51 [inline]
__sb_start_write+0x228/0x450 fs/super.c:1672
io_write+0x6b5/0xb30 fs/io_uring.c:3296
io_issue_sqe+0x18f/0x5c50 fs/io_uring.c:5719
__io_queue_sqe+0x280/0x1160 fs/io_uring.c:6175
io_queue_sqe+0x692/0xfa0 fs/io_uring.c:6254
io_submit_sqe fs/io_uring.c:6324 [inline]
io_submit_sqes+0x1761/0x2400 fs/io_uring.c:6521
__do_sys_io_uring_enter+0xeac/0x1bd0 fs/io_uring.c:8349
do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x45e179
Code: 3d b2 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 3d 01 f0 ff ff 0f 83 0b b2 fb ff c3 66 2e 0f 1f 84 00 00 00 00
RSP: 002b:00007f1194e74c78 EFLAGS: 00000246 ORIG_RAX: 00000000000001aa
RAX: ffffffffffffffda RBX: 00000000000082c0 RCX: 000000000045e179
RDX: 0000000000000000 RSI: 0000000000000001 RDI: 0000000000000004
RBP: 000000000118cf98 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 000000000118cf4c
R13: 00007ffd1aa5756f R14: 00007f1194e759c0 R15: 000000000118cf4c

Fix this by just not diving into details if we fail to trylock the
io_uring mutex. We know the ctx isn't going away during this operation,
but we cannot safely iterate buffers/files/personalities if we don't
hold the io_uring mutex.

Reported-by: syzbot+2f8fa4e860edc3066aba@syzkaller.appspotmail.com
Signed-off-by: Jens Axboe

Jens Axboe
2020-09-28 23:06:08 +0800
8706e04ed io_uring: always delete double poll wait entry on match ... Browse Code »

syzbot reports a crash with tty polling, which is using the double poll
handling:

general protection fault, probably for non-canonical address 0xdffffc0000000009: 0000 [#1] PREEMPT SMP KASAN
KASAN: null-ptr-deref in range [0x0000000000000048-0x000000000000004f]
CPU: 0 PID: 6874 Comm: syz-executor749 Not tainted 5.9.0-rc6-next-20200924-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
RIP: 0010:io_poll_get_single fs/io_uring.c:4778 [inline]
RIP: 0010:io_poll_double_wake+0x51/0x510 fs/io_uring.c:4845
Code: fc ff df 48 c1 ea 03 80 3c 02 00 0f 85 9e 03 00 00 48 b8 00 00 00 00 00 fc ff df 49 8b 5d 08 48 8d 7b 48 48 89 fa 48 c1 ea 03 b6 04 02 84 c0 74 06 0f 8e 63 03 00 00 0f b6 6b 48 bf 06 00 00
RSP: 0018:ffffc90001c1fb70 EFLAGS: 00010006
RAX: dffffc0000000000 RBX: 0000000000000000 RCX: 0000000000000004
RDX: 0000000000000009 RSI: ffffffff81d9b3ad RDI: 0000000000000048
RBP: dffffc0000000000 R08: ffff8880a3cac798 R09: ffffc90001c1fc60
R10: fffff52000383f73 R11: 0000000000000000 R12: 0000000000000004
R13: ffff8880a3cac798 R14: ffff8880a3cac7a0 R15: 0000000000000004
FS: 0000000001f98880(0000) GS:ffff8880ae400000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f18886916c0 CR3: 0000000094c5a000 CR4: 00000000001506f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
__wake_up_common+0x147/0x650 kernel/sched/wait.c:93
__wake_up_common_lock+0xd0/0x130 kernel/sched/wait.c:123
tty_ldisc_hangup+0x1cf/0x680 drivers/tty/tty_ldisc.c:735
__tty_hangup.part.0+0x403/0x870 drivers/tty/tty_io.c:625
__tty_hangup drivers/tty/tty_io.c:575 [inline]
tty_vhangup+0x1d/0x30 drivers/tty/tty_io.c:698
pty_close+0x3f5/0x550 drivers/tty/pty.c:79
tty_release+0x455/0xf60 drivers/tty/tty_io.c:1679
__fput+0x285/0x920 fs/file_table.c:281
task_work_run+0xdd/0x190 kernel/task_work.c:141
tracehook_notify_resume include/linux/tracehook.h:188 [inline]
exit_to_user_mode_loop kernel/entry/common.c:165 [inline]
exit_to_user_mode_prepare+0x1e2/0x1f0 kernel/entry/common.c:192
syscall_exit_to_user_mode+0x7a/0x2c0 kernel/entry/common.c:267
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x401210

which is due to a failure in removing the double poll wait entry if we
hit a wakeup match. This can cause multiple invocations of the wakeup,
which isn't safe.

Cc: stable@vger.kernel.org # v5.8
Reported-by: syzbot+81b3883093f772addf6d@syzkaller.appspotmail.com
Signed-off-by: Jens Axboe

Jens Axboe
2020-09-28 22:38:54 +0800
a1b8638ba Linux 5.9-rc7 Browse Code »

Linus Torvalds
2020-09-28 05:38:10 +0800
16bc1d543 Merge tag 'kbuild-fixes-v5.9-4' of git://git.kernel.org/pub/scm/linux/kernel/git… ... Browse Code »

…/masahiroy/linux-kbuild

Pull Kbuild fixes from Masahiro Yamada:

- ignore compiler stubs for PPC to fix builds

- fix the usage of --target mentioned in the LLVM document

* tag 'kbuild-fixes-v5.9-4' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild:
Documentation/llvm: Fix clang target examples
scripts/kallsyms: skip ppc compiler stub *.long_branch.* / *.plt_branch.*

Linus Torvalds
2020-09-28 03:18:57 +0800
f8818559c Merge tag 'x86-urgent-2020-09-27' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull x86 fixes from Thomas Gleixner:
"Two fixes for the x86 interrupt code:

- Unbreak the magic 'search the timer interrupt' logic in IO/APIC
code which got wreckaged when the core interrupt code made the
state tracking logic stricter.

That caused the interrupt line to stay masked after switching from
IO/APIC to PIC delivery mode, which obviously prevents interrupts
from being delivered.

- Make run_on_irqstack_code() typesafe. The function argument is a
void pointer which is then cast to 'void (*fun)(void *).

This breaks Control Flow Integrity checking in clang. Use proper
helper functions for the three variants reuqired"

* tag 'x86-urgent-2020-09-27' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/ioapic: Unbreak check_timer()
x86/irq: Make run_on_irqstack_cond() typesafe

Linus Torvalds
2020-09-28 03:15:21 +0800
ba25f0570 Merge tag 'timers-urgent-2020-09-27' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull timer updates from Thomas Gleixner:
"A set of clocksource/clockevents updates:

- Reset the TI/DM timer before enabling it instead of doing it the
other way round.

- Initialize the reload value for the GX6605s timer correctly so the
hardware counter starts at 0 again after overrun.

- Make error return value negative in the h8300 timer init function"

* tag 'timers-urgent-2020-09-27' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
clocksource/drivers/timer-gx6605s: Fixup counter reload
clocksource/drivers/timer-ti-dm: Do reset before enable
clocksource/drivers/h8300_timer8: Fix wrong return value in h8300_8timer_init()

Linus Torvalds
2020-09-28 03:11:35 +0800
d042035ea mm/thp: Split huge pmds/puds if they're pinned when fork() ... Browse Code »

Pinned pages shouldn't be write-protected when fork() happens, because
follow up copy-on-write on these pages could cause the pinned pages to
be replaced by random newly allocated pages.

For huge PMDs, we split the huge pmd if pinning is detected. So that
future handling will be done by the PTE level (with our latest changes,
each of the small pages will be copied). We can achieve this by let
copy_huge_pmd() return -EAGAIN for pinned pages, so that we'll
fallthrough in copy_pmd_range() and finally land the next
copy_pte_range() call.

Huge PUDs will be even more special - so far it does not support
anonymous pages. But it can actually be done the same as the huge PMDs
even if the split huge PUDs means to erase the PUD entries. It'll
guarantee the follow up fault ins will remap the same pages in either
parent/child later.

This might not be the most efficient way, but it should be easy and
clean enough. It should be fine, since we're tackling with a very rare
case just to make sure userspaces that pinned some thps will still work
even without MADV_DONTFORK and after they fork()ed.

Signed-off-by: Peter Xu
Signed-off-by: Linus Torvalds

Peter Xu
2020-09-28 02:21:35 +0800
70e806e4e mm: Do early cow for pinned pages during fork() for ptes ... Browse Code »

This allows copy_pte_range() to do early cow if the pages were pinned on
the source mm.

Currently we don't have an accurate way to know whether a page is pinned
or not. The only thing we have is page_maybe_dma_pinned(). However
that's good enough for now. Especially, with the newly added
mm->has_pinned flag to make sure we won't affect processes that never
pinned any pages.

It would be easier if we can do GFP_KERNEL allocation within
copy_one_pte(). Unluckily, we can't because we're with the page table
locks held for both the parent and child processes. So the page
allocation needs to be done outside copy_one_pte().

Some trick is there in copy_present_pte(), majorly the wrprotect trick
to block concurrent fast-gup. Comments in the function should explain
better in place.

Oleg Nesterov reported a (probably harmless) bug during review that we
didn't reset entry.val properly in copy_pte_range() so that potentially
there's chance to call add_swap_count_continuation() multiple times on
the same swp entry. However that should be harmless since even if it
happens, the same function (add_swap_count_continuation()) will return
directly noticing that there're enough space for the swp counter. So
instead of a standalone stable patch, it is touched up in this patch
directly.

Link: https://lore.kernel.org/lkml/20200914143829.GA1424636@nvidia.com/
Suggested-by: Linus Torvalds
Signed-off-by: Peter Xu
Signed-off-by: Linus Torvalds

Peter Xu
2020-09-28 02:21:35 +0800
7a4830c38 mm/fork: Pass new vma pointer into copy_page_range() ... Browse Code »

This prepares for the future work to trigger early cow on pinned pages
during fork().

No functional change intended.

Signed-off-by: Peter Xu
Signed-off-by: Linus Torvalds

Peter Xu
2020-09-28 02:21:35 +0800
008cfe441 mm: Introduce mm_struct.has_pinned ... Browse Code »

(Commit message majorly collected from Jason Gunthorpe)

Reduce the chance of false positive from page_maybe_dma_pinned() by
keeping track if the mm_struct has ever been used with pin_user_pages().
This allows cases that might drive up the page ref_count to avoid any
penalty from handling dma_pinned pages.

Future work is planned, to provide a more sophisticated solution, likely
to turn it into a real counter. For now, make it atomic_t but use it as
a boolean for simplicity.

Suggested-by: Jason Gunthorpe
Signed-off-by: Peter Xu
Signed-off-by: Linus Torvalds

Peter Xu
2020-09-28 02:21:35 +0800

27 Sep, 2020

17 commits

a7b6c0fed Merge tag 'timers-v5.9-rc4' of https://git.linaro.org/people/daniel.lezcano/linux into timers/urgent ... Browse Code »

Pull clocksource/clockevent fixes from Daniel Lezcano:

- Fix wrong signed return value when checking of_iomap in the probe
function for the h8300 timer (Tianjia Zhang)

- Fix reset sequence when setting up the timer on the dm_timer (Tony
Lindgren)

- Fix counter reload when the interrupt fires on gx6605s (Guo Ren)

Thomas Gleixner
2020-09-27 17:24:34 +0800
a1bffa487 Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi ... Browse Code »

Pull SCSI fixes from James Bottomley:
"Three fixes: one in drivers (lpfc) and two for zoned block devices.

The latter also impinges on the block layer but only to introduce a
new block API for setting the zone model rather than fiddling with the
queue directly in the zoned block driver"

* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
scsi: sd: sd_zbc: Fix ZBC disk initialization
scsi: sd: sd_zbc: Fix handling of host-aware ZBC disks
scsi: lpfc: Fix initial FLOGI failure due to BBSCN not supported

Linus Torvalds
2020-09-27 02:18:37 +0800
692495baa Merge tag 'io_uring-5.9-2020-09-25' of git://git.kernel.dk/linux-block ... Browse Code »

Pull io_uring fixes from Jens Axboe:
"Two fixes for regressions in this cycle, and one that goes to 5.8
stable:

- fix leak of getname() retrieved filename

- remove plug->nowait assignment, fixing a regression with btrfs

- fix for async buffered retry"

* tag 'io_uring-5.9-2020-09-25' of git://git.kernel.dk/linux-block:
io_uring: ensure async buffered read-retry is setup properly
io_uring: don't unconditionally set plug->nowait = true
io_uring: ensure open/openat2 name is cleaned on cancelation

Linus Torvalds
2020-09-27 02:13:51 +0800
9d2fbaefb Merge tag 'block-5.9-2020-09-25' of git://git.kernel.dk/linux-block ... Browse Code »

Pull block fixes from Jens Axboe:
"NVMe pull request from Christoph, and removal of a dead define.

- fix error during controller probe that cause double free irqs
(Keith Busch)

- FC connection establishment fix (James Smart)

- properly handle completions for invalid tags (Xianting Tian)

- pass the correct nsid to the command effects and supported log
(Chaitanya Kulkarni)"

* tag 'block-5.9-2020-09-25' of git://git.kernel.dk/linux-block:
block: remove unused BLK_QC_T_EAGAIN flag
nvme-core: don't use NVME_NSID_ALL for command effects and supported log
nvme-fc: fail new connections to a deleted host or remote port
nvme-pci: fix NULL req in completion handler
nvme: return errors for hwmon init

Linus Torvalds
2020-09-27 02:07:36 +0800
eeddbe684 Merge tag 's390-5.9-7' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux ... Browse Code »

Pull s390 fix from Vasily Gorbik:
"Fix truncated ZCRYPT_PERDEV_REQCNT ioctl result. Copy entire reqcnt
list"

* tag 's390-5.9-7' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
s390/zcrypt: Fix ZCRYPT_PERDEV_REQCNT ioctl

Linus Torvalds
2020-09-27 02:01:18 +0800
8fb1e9103 Merge branch 'akpm' (patches from Andrew) ... Browse Code »

Merge misc fixes from Andrew Morton:
"9 patches.

Subsystems affected by this patch series: mm (thp, memcg, gup,
migration, memory-hotplug), lib, and x86"

* emailed patches from Andrew Morton :
mm: don't rely on system state to detect hot-plug operations
mm: replace memmap_context by meminit_context
arch/x86/lib/usercopy_64.c: fix __copy_user_flushcache() cache writeback
lib/memregion.c: include memregion.h
lib/string.c: implement stpcpy
mm/migrate: correct thp migration stats
mm/gup: fix gup_fast with dynamic page table folding
mm: memcontrol: fix missing suffix of workingset_restore
mm, THP, swap: fix allocating cluster for swapfile by mistake

Linus Torvalds
2020-09-27 01:53:35 +0800
ce2684254 mm: validate pmd after splitting ... Browse Code »

syzbot reported the following KASAN splat:

general protection fault, probably for non-canonical address 0xdffffc0000000003: 0000 [#1] PREEMPT SMP KASAN
KASAN: null-ptr-deref in range [0x0000000000000018-0x000000000000001f]
CPU: 1 PID: 6826 Comm: syz-executor142 Not tainted 5.9.0-rc4-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
RIP: 0010:__lock_acquire+0x84/0x2ae0 kernel/locking/lockdep.c:4296
Code: ff df 8a 04 30 84 c0 0f 85 e3 16 00 00 83 3d 56 58 35 08 00 0f 84 0e 17 00 00 83 3d 25 c7 f5 07 00 74 2c 4c 89 e8 48 c1 e8 03 3c 30 00 74 12 4c 89 ef e8 3e d1 5a 00 48 be 00 00 00 00 00 fc
RSP: 0018:ffffc90004b9f850 EFLAGS: 00010006
Call Trace:
lock_acquire+0x140/0x6f0 kernel/locking/lockdep.c:5006
__raw_spin_lock include/linux/spinlock_api_smp.h:142 [inline]
_raw_spin_lock+0x2a/0x40 kernel/locking/spinlock.c:151
spin_lock include/linux/spinlock.h:354 [inline]
madvise_cold_or_pageout_pte_range+0x52f/0x25c0 mm/madvise.c:389
walk_pmd_range mm/pagewalk.c:89 [inline]
walk_pud_range mm/pagewalk.c:160 [inline]
walk_p4d_range mm/pagewalk.c:193 [inline]
walk_pgd_range mm/pagewalk.c:229 [inline]
__walk_page_range+0xe7b/0x1da0 mm/pagewalk.c:331
walk_page_range+0x2c3/0x5c0 mm/pagewalk.c:427
madvise_pageout_page_range mm/madvise.c:521 [inline]
madvise_pageout mm/madvise.c:557 [inline]
madvise_vma mm/madvise.c:946 [inline]
do_madvise+0x12d0/0x2090 mm/madvise.c:1145
__do_sys_madvise mm/madvise.c:1171 [inline]
__se_sys_madvise mm/madvise.c:1169 [inline]
__x64_sys_madvise+0x76/0x80 mm/madvise.c:1169
do_syscall_64+0x31/0x70 arch/x86/entry/common.c:46
entry_SYSCALL_64_after_hwframe+0x44/0xa9

The backing vma was shmem.

In case of split page of file-backed THP, madvise zaps the pmd instead
of remapping of sub-pages. So we need to check pmd validity after
split.

Reported-by: syzbot+ecf80462cb7d5d552bc7@syzkaller.appspotmail.com
Fixes: 1a4e58cce84e ("mm: introduce MADV_PAGEOUT")
Signed-off-by: Minchan Kim
Acked-by: Kirill A. Shutemov
Signed-off-by: Linus Torvalds

Minchan Kim
2020-09-27 01:48:08 +0800
f85086f95 mm: don't rely on system state to detect hot-plug operations ... Browse Code »

In register_mem_sect_under_node() the system_state's value is checked to
detect whether the call is made during boot time or during an hot-plug
operation. Unfortunately, that check against SYSTEM_BOOTING is wrong
because regular memory is registered at SYSTEM_SCHEDULING state. In
addition, memory hot-plug operation can be triggered at this system
state by the ACPI [1]. So checking against the system state is not
enough.

The consequence is that on system with interleaved node's ranges like this:

Early memory node ranges
node 1: [mem 0x0000000000000000-0x000000011fffffff]
node 2: [mem 0x0000000120000000-0x000000014fffffff]
node 1: [mem 0x0000000150000000-0x00000001ffffffff]
node 0: [mem 0x0000000200000000-0x000000048fffffff]
node 2: [mem 0x0000000490000000-0x00000007ffffffff]

This can be seen on PowerPC LPAR after multiple memory hot-plug and
hot-unplug operations are done. At the next reboot the node's memory
ranges can be interleaved and since the call to link_mem_sections() is
made in topology_init() while the system is in the SYSTEM_SCHEDULING
state, the node's id is not checked, and the sections registered to
multiple nodes:

$ ls -l /sys/devices/system/memory/memory21/node*
total 0
lrwxrwxrwx 1 root root 0 Aug 24 05:27 node1 -> ../../node/node1
lrwxrwxrwx 1 root root 0 Aug 24 05:27 node2 -> ../../node/node2

In that case, the system is able to boot but if later one of theses
memory blocks is hot-unplugged and then hot-plugged, the sysfs
inconsistency is detected and this is triggering a BUG_ON():

kernel BUG at /Users/laurent/src/linux-ppc/mm/memory_hotplug.c:1084!
Oops: Exception in kernel mode, sig: 5 [#1]
LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
Modules linked in: rpadlpar_io rpaphp pseries_rng rng_core vmx_crypto gf128mul binfmt_misc ip_tables x_tables xfs libcrc32c crc32c_vpmsum autofs4
CPU: 8 PID: 10256 Comm: drmgr Not tainted 5.9.0-rc1+ #25
Call Trace:
add_memory_resource+0x23c/0x340 (unreliable)
__add_memory+0x5c/0xf0
dlpar_add_lmb+0x1b4/0x500
dlpar_memory+0x1f8/0xb80
handle_dlpar_errorlog+0xc0/0x190
dlpar_store+0x198/0x4a0
kobj_attr_store+0x30/0x50
sysfs_kf_write+0x64/0x90
kernfs_fop_write+0x1b0/0x290
vfs_write+0xe8/0x290
ksys_write+0xdc/0x130
system_call_exception+0x160/0x270
system_call_common+0xf0/0x27c

This patch addresses the root cause by not relying on the system_state
value to detect whether the call is due to a hot-plug operation. An
extra parameter is added to link_mem_sections() detailing whether the
operation is due to a hot-plug operation.

[1] According to Oscar Salvador, using this qemu command line, ACPI
memory hotplug operations are raised at SYSTEM_SCHEDULING state:

$QEMU -enable-kvm -machine pc -smp 4,sockets=4,cores=1,threads=1 -cpu host -monitor pty \
-m size=$MEM,slots=255,maxmem=4294967296k \
-numa node,nodeid=0,cpus=0-3,mem=512 -numa node,nodeid=1,mem=512 \
-object memory-backend-ram,id=memdimm0,size=134217728 -device pc-dimm,node=0,memdev=memdimm0,id=dimm0,slot=0 \
-object memory-backend-ram,id=memdimm1,size=134217728 -device pc-dimm,node=0,memdev=memdimm1,id=dimm1,slot=1 \
-object memory-backend-ram,id=memdimm2,size=134217728 -device pc-dimm,node=0,memdev=memdimm2,id=dimm2,slot=2 \
-object memory-backend-ram,id=memdimm3,size=134217728 -device pc-dimm,node=0,memdev=memdimm3,id=dimm3,slot=3 \
-object memory-backend-ram,id=memdimm4,size=134217728 -device pc-dimm,node=1,memdev=memdimm4,id=dimm4,slot=4 \
-object memory-backend-ram,id=memdimm5,size=134217728 -device pc-dimm,node=1,memdev=memdimm5,id=dimm5,slot=5 \
-object memory-backend-ram,id=memdimm6,size=134217728 -device pc-dimm,node=1,memdev=memdimm6,id=dimm6,slot=6 \

Fixes: 4fbce633910e ("mm/memory_hotplug.c: make register_mem_sect_under_node() a callback of walk_memory_range()")
Signed-off-by: Laurent Dufour
Signed-off-by: Andrew Morton
Reviewed-by: David Hildenbrand
Reviewed-by: Oscar Salvador
Acked-by: Michal Hocko
Cc: Greg Kroah-Hartman
Cc: "Rafael J. Wysocki"
Cc: Fenghua Yu
Cc: Nathan Lynch
Cc: Scott Cheloha
Cc: Tony Luck
Cc:
Link: https://lkml.kernel.org/r/20200915094143.79181-3-ldufour@linux.ibm.com
Signed-off-by: Linus Torvalds

Laurent Dufour
2020-09-27 01:33:57 +0800
c1d0da833 mm: replace memmap_context by meminit_context ... Browse Code »

Patch series "mm: fix memory to node bad links in sysfs", v3.

Sometimes, firmware may expose interleaved memory layout like this:

Early memory node ranges
node 1: [mem 0x0000000000000000-0x000000011fffffff]
node 2: [mem 0x0000000120000000-0x000000014fffffff]
node 1: [mem 0x0000000150000000-0x00000001ffffffff]
node 0: [mem 0x0000000200000000-0x000000048fffffff]
node 2: [mem 0x0000000490000000-0x00000007ffffffff]

In that case, we can see memory blocks assigned to multiple nodes in
sysfs:

$ ls -l /sys/devices/system/memory/memory21
total 0
lrwxrwxrwx 1 root root 0 Aug 24 05:27 node1 -> ../../node/node1
lrwxrwxrwx 1 root root 0 Aug 24 05:27 node2 -> ../../node/node2
-rw-r--r-- 1 root root 65536 Aug 24 05:27 online
-r--r--r-- 1 root root 65536 Aug 24 05:27 phys_device
-r--r--r-- 1 root root 65536 Aug 24 05:27 phys_index
drwxr-xr-x 2 root root 0 Aug 24 05:27 power
-r--r--r-- 1 root root 65536 Aug 24 05:27 removable
-rw-r--r-- 1 root root 65536 Aug 24 05:27 state
lrwxrwxrwx 1 root root 0 Aug 24 05:25 subsystem -> ../../../../bus/memory
-rw-r--r-- 1 root root 65536 Aug 24 05:25 uevent
-r--r--r-- 1 root root 65536 Aug 24 05:27 valid_zones

The same applies in the node's directory with a memory21 link in both
the node1 and node2's directory.

This is wrong but doesn't prevent the system to run. However when
later, one of these memory blocks is hot-unplugged and then hot-plugged,
the system is detecting an inconsistency in the sysfs layout and a
BUG_ON() is raised:

kernel BUG at /Users/laurent/src/linux-ppc/mm/memory_hotplug.c:1084!
LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
Modules linked in: rpadlpar_io rpaphp pseries_rng rng_core vmx_crypto gf128mul binfmt_misc ip_tables x_tables xfs libcrc32c crc32c_vpmsum autofs4
CPU: 8 PID: 10256 Comm: drmgr Not tainted 5.9.0-rc1+ #25
Call Trace:
add_memory_resource+0x23c/0x340 (unreliable)
__add_memory+0x5c/0xf0
dlpar_add_lmb+0x1b4/0x500
dlpar_memory+0x1f8/0xb80
handle_dlpar_errorlog+0xc0/0x190
dlpar_store+0x198/0x4a0
kobj_attr_store+0x30/0x50
sysfs_kf_write+0x64/0x90
kernfs_fop_write+0x1b0/0x290
vfs_write+0xe8/0x290
ksys_write+0xdc/0x130
system_call_exception+0x160/0x270
system_call_common+0xf0/0x27c

This has been seen on PowerPC LPAR.

The root cause of this issue is that when node's memory is registered,
the range used can overlap another node's range, thus the memory block
is registered to multiple nodes in sysfs.

There are two issues here:

(a) The sysfs memory and node's layouts are broken due to these
multiple links

(b) The link errors in link_mem_sections() should not lead to a system
panic.

To address (a) register_mem_sect_under_node should not rely on the
system state to detect whether the link operation is triggered by a hot
plug operation or not. This is addressed by the patches 1 and 2 of this
series.

Issue (b) will be addressed separately.

This patch (of 2):

The memmap_context enum is used to detect whether a memory operation is
due to a hot-add operation or happening at boot time.

Make it general to the hotplug operation and rename it as
meminit_context.

There is no functional change introduced by this patch

Suggested-by: David Hildenbrand
Signed-off-by: Laurent Dufour
Signed-off-by: Andrew Morton
Reviewed-by: David Hildenbrand
Reviewed-by: Oscar Salvador
Acked-by: Michal Hocko
Cc: Greg Kroah-Hartman
Cc: "Rafael J . Wysocki"
Cc: Nathan Lynch
Cc: Scott Cheloha
Cc: Tony Luck
Cc: Fenghua Yu
Cc:
Link: https://lkml.kernel.org/r/20200915094143.79181-1-ldufour@linux.ibm.com
Link: https://lkml.kernel.org/r/20200915132624.9723-1-ldufour@linux.ibm.com
Signed-off-by: Linus Torvalds

Laurent Dufour
2020-09-27 01:33:57 +0800
a1cd6c2ae arch/x86/lib/usercopy_64.c: fix __copy_user_flushcache() cache writeback ... Browse Code »

If we copy less than 8 bytes and if the destination crosses a cache
line, __copy_user_flushcache would invalidate only the first cache line.

This patch makes it invalidate the second cache line as well.

Fixes: 0aed55af88345b ("x86, uaccess: introduce copy_from_iter_flushcache for pmem / cache-bypass operations")
Signed-off-by: Mikulas Patocka
Signed-off-by: Andrew Morton
Reviewed-by: Dan Williams
Cc: Jan Kara
Cc: Jeff Moyer
Cc: Ingo Molnar
Cc: Christoph Hellwig
Cc: Toshi Kani
Cc: "H. Peter Anvin"
Cc: Al Viro
Cc: Thomas Gleixner
Cc: Matthew Wilcox
Cc: Ross Zwisler
Cc: Ingo Molnar
Cc:
Link: https://lkml.kernel.org/r/alpine.LRH.2.02.2009161451140.21915@file01.intranet.prod.int.rdu2.redhat.com
Signed-off-by: Linus Torvalds

Mikulas Patocka
2020-09-27 01:33:57 +0800
ffa550cd6 lib/memregion.c: include memregion.h ... Browse Code »

This addresses the following sparse warning:

lib/memregion.c:8:5: warning: symbol 'memregion_alloc' was not declared. Should it be static?
lib/memregion.c:14:6: warning: symbol 'memregion_free' was not declared. Should it be static?

Reported-by: Hulk Robot
Signed-off-by: Jason Yan
Signed-off-by: Andrew Morton
Link: https://lkml.kernel.org/r/20200921142852.875312-1-yanaijie@huawei.com
Signed-off-by: Linus Torvalds

Jason Yan
2020-09-27 01:33:57 +0800
1e1b6d63d lib/string.c: implement stpcpy ... Browse Code »

LLVM implemented a recent "libcall optimization" that lowers calls to
`sprintf(dest, "%s", str)` where the return value is used to
`stpcpy(dest, str) - dest`.

This generally avoids the machinery involved in parsing format strings.
`stpcpy` is just like `strcpy` except it returns the pointer to the new
tail of `dest`. This optimization was introduced into clang-12.

Implement this so that we don't observe linkage failures due to missing
symbol definitions for `stpcpy`.

Similar to last year's fire drill with: commit 5f074f3e192f
("lib/string.c: implement a basic bcmp")

The kernel is somewhere between a "freestanding" environment (no full
libc) and "hosted" environment (many symbols from libc exist with the
same type, function signature, and semantics).

As Peter Anvin notes, there's not really a great way to inform the
compiler that you're targeting a freestanding environment but would like
to opt-in to some libcall optimizations (see pr/47280 below), rather
than opt-out.

Arvind notes, -fno-builtin-* behaves slightly differently between GCC
and Clang, and Clang is missing many __builtin_* definitions, which I
consider a bug in Clang and am working on fixing.

Masahiro summarizes the subtle distinction between compilers justly:
To prevent transformation from foo() into bar(), there are two ways in
Clang to do that; -fno-builtin-foo, and -fno-builtin-bar. There is
only one in GCC; -fno-buitin-foo.

(Any difference in that behavior in Clang is likely a bug from a missing
__builtin_* definition.)

Masahiro also notes:
We want to disable optimization from foo() to bar(),
but we may still benefit from the optimization from
foo() into something else. If GCC implements the same transform, we
would run into a problem because it is not -fno-builtin-bar, but
-fno-builtin-foo that disables that optimization.

In this regard, -fno-builtin-foo would be more future-proof than
-fno-built-bar, but -fno-builtin-foo is still potentially overkill. We
may want to prevent calls from foo() being optimized into calls to
bar(), but we still may want other optimization on calls to foo().

It seems that compilers today don't quite provide the fine grain control
over which libcall optimizations pseudo-freestanding environments would
prefer.

Finally, Kees notes that this interface is unsafe, so we should not
encourage its use. As such, I've removed the declaration from any
header, but it still needs to be exported to avoid linkage errors in
modules.

Reported-by: Sami Tolvanen
Suggested-by: Andy Lavr
Suggested-by: Arvind Sankar
Suggested-by: Joe Perches
Suggested-by: Kees Cook
Suggested-by: Masahiro Yamada
Suggested-by: Rasmus Villemoes
Signed-off-by: Nick Desaulniers
Signed-off-by: Andrew Morton
Tested-by: Nathan Chancellor
Cc:
Link: https://lkml.kernel.org/r/20200914161643.938408-1-ndesaulniers@google.com
Link: https://bugs.llvm.org/show_bug.cgi?id=47162
Link: https://bugs.llvm.org/show_bug.cgi?id=47280
Link: https://github.com/ClangBuiltLinux/linux/issues/1126
Link: https://man7.org/linux/man-pages/man3/stpcpy.3.html
Link: https://pubs.opengroup.org/onlinepubs/9699919799/functions/stpcpy.html
Link: https://reviews.llvm.org/D85963
Signed-off-by: Linus Torvalds

Nick Desaulniers
2020-09-27 01:33:57 +0800
6c5c7b9f3 mm/migrate: correct thp migration stats ... Browse Code »

PageTransHuge returns true for both thp and hugetlb, so thp stats was
counting both thp and hugetlb migrations. Exclude hugetlb migration by
setting is_thp variable right.

Clean up thp handling code too when we are there.

Fixes: 1a5bae25e3cf ("mm/vmstat: add events for THP migration without split")
Signed-off-by: Zi Yan
Signed-off-by: Andrew Morton
Reviewed-by: Daniel Jordan
Cc: Anshuman Khandual
Link: https://lkml.kernel.org/r/20200917210413.1462975-1-zi.yan@sent.com
Signed-off-by: Linus Torvalds

Zi Yan
2020-09-27 01:33:57 +0800
d3f7b1bb2 mm/gup: fix gup_fast with dynamic page table folding ... Browse Code »

Currently to make sure that every page table entry is read just once
gup_fast walks perform READ_ONCE and pass pXd value down to the next
gup_pXd_range function by value e.g.:

static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
unsigned int flags, struct page **pages, int *nr)
...
pudp = pud_offset(&p4d, addr);

This function passes a reference on that local value copy to pXd_offset,
and might get the very same pointer in return. This happens when the
level is folded (on most arches), and that pointer should not be
iterated.

On s390 due to the fact that each task might have different 5,4 or
3-level address translation and hence different levels folded the logic
is more complex and non-iteratable pointer to a local copy leads to
severe problems.

Here is an example of what happens with gup_fast on s390, for a task
with 3-level paging, crossing a 2 GB pud boundary:

// addr = 0x1007ffff000, end = 0x10080001000
static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
unsigned int flags, struct page **pages, int *nr)
{
unsigned long next;
pud_t *pudp;

// pud_offset returns &p4d itself (a pointer to a value on stack)
pudp = pud_offset(&p4d, addr);
do {
// on second iteratation reading "random" stack value
pud_t pud = READ_ONCE(*pudp);

// next = 0x10080000000, due to PUD_SIZE/MASK != PGDIR_SIZE/MASK on s390
next = pud_addr_end(addr, end);
...
} while (pudp++, addr = next, addr != end); // pudp++ iterating over stack

return 1;
}

This happens since s390 moved to common gup code with commit
d1874a0c2805 ("s390/mm: make the pxd_offset functions more robust") and
commit 1a42010cdc26 ("s390/mm: convert to the generic
get_user_pages_fast code").

s390 tried to mimic static level folding by changing pXd_offset
primitives to always calculate top level page table offset in pgd_offset
and just return the value passed when pXd_offset has to act as folded.

What is crucial for gup_fast and what has been overlooked is that
PxD_SIZE/MASK and thus pXd_addr_end should also change correspondingly.
And the latter is not possible with dynamic folding.

To fix the issue in addition to pXd values pass original pXdp pointers
down to gup_pXd_range functions. And introduce pXd_offset_lockless
helpers, which take an additional pXd entry value parameter. This has
already been discussed in

https://lkml.kernel.org/r/20190418100218.0a4afd51@mschwideX1

Fixes: 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast code")
Signed-off-by: Vasily Gorbik
Signed-off-by: Andrew Morton
Reviewed-by: Gerald Schaefer
Reviewed-by: Alexander Gordeev
Reviewed-by: Jason Gunthorpe
Reviewed-by: Mike Rapoport
Reviewed-by: John Hubbard
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Dave Hansen
Cc: Russell King
Cc: Catalin Marinas
Cc: Will Deacon
Cc: Michael Ellerman
Cc: Benjamin Herrenschmidt
Cc: Paul Mackerras
Cc: Jeff Dike
Cc: Richard Weinberger
Cc: Dave Hansen
Cc: Andy Lutomirski
Cc: Thomas Gleixner
Cc: Ingo Molnar
Cc: Borislav Petkov
Cc: Arnd Bergmann
Cc: Andrey Ryabinin
Cc: Heiko Carstens
Cc: Christian Borntraeger
Cc: Claudio Imbrenda
Cc: [5.2+]
Link: https://lkml.kernel.org/r/patch.git-943f1e5dcff2.your-ad-here.call-01599856292-ext-8676@work.hours
Signed-off-by: Linus Torvalds

Vasily Gorbik
2020-09-27 01:33:57 +0800
8d3fe09d8 mm: memcontrol: fix missing suffix of workingset_restore ... Browse Code »

We forget to add the suffix to the workingset_restore string, so fix it.

And also update the documentation of cgroup-v2.rst.

Fixes: 170b04b7ae49 ("mm/workingset: prepare the workingset detection infrastructure for anon LRU")
Signed-off-by: Muchun Song
Signed-off-by: Andrew Morton
Reviewed-by: Shakeel Butt
Cc: Joonsoo Kim
Cc: Johannes Weiner
Cc: Vlastimil Babka
Cc: Tejun Heo
Cc: Zefan Li
Cc: Jonathan Corbet
Cc: Michal Hocko
Cc: Vladimir Davydov
Cc: Roman Gushchin
Cc: Randy Dunlap
Link: https://lkml.kernel.org/r/20200916100030.71698-1-songmuchun@bytedance.com
Signed-off-by: Linus Torvalds

Muchun Song
2020-09-27 01:33:57 +0800
416634305 mm, THP, swap: fix allocating cluster for swapfile by mistake ... Browse Code »

SWP_FS is used to make swap_{read,write}page() go through the
filesystem, and it's only used for swap files over NFS. So, !SWP_FS
means non NFS for now, it could be either file backed or device backed.
Something similar goes with legacy SWP_FILE.

So in order to achieve the goal of the original patch, SWP_BLKDEV should
be used instead.

FS corruption can be observed with SSD device + XFS + fragmented
swapfile due to CONFIG_THP_SWAP=y.

I reproduced the issue with the following details:

Environment:

QEMU + upstream kernel + buildroot + NVMe (2 GB)

Kernel config:

CONFIG_BLK_DEV_NVME=y
CONFIG_THP_SWAP=y

Some reproducible steps:

mkfs.xfs -f /dev/nvme0n1
mkdir /tmp/mnt
mount /dev/nvme0n1 /tmp/mnt
bs="32k"
sz="1024m" # doesn't matter too much, I also tried 16m
xfs_io -f -c "pwrite -R -b $bs 0 $sz" -c "fdatasync" /tmp/mnt/sw
xfs_io -f -c "pwrite -R -b $bs 0 $sz" -c "fdatasync" /tmp/mnt/sw
xfs_io -f -c "pwrite -R -b $bs 0 $sz" -c "fdatasync" /tmp/mnt/sw
xfs_io -f -c "pwrite -F -S 0 -b $bs 0 $sz" -c "fdatasync" /tmp/mnt/sw
xfs_io -f -c "pwrite -R -b $bs 0 $sz" -c "fsync" /tmp/mnt/sw

mkswap /tmp/mnt/sw
swapon /tmp/mnt/sw

stress --vm 2 --vm-bytes 600M # doesn't matter too much as well

Symptoms:
- FS corruption (e.g. checksum failure)
- memory corruption at: 0xd2808010
- segfault

Fixes: f0eea189e8e9 ("mm, THP, swap: Don't allocate huge cluster for file backed swap device")
Fixes: 38d8b4e6bdc8 ("mm, THP, swap: delay splitting THP during swap out")
Signed-off-by: Gao Xiang
Signed-off-by: Andrew Morton
Reviewed-by: "Huang, Ying"
Reviewed-by: Yang Shi
Acked-by: Rafael Aquini
Cc: Matthew Wilcox
Cc: Carlos Maiolino
Cc: Eric Sandeen
Cc: Dave Chinner
Cc:
Link: https://lkml.kernel.org/r/20200820045323.7809-1-hsiangkao@redhat.com
Signed-off-by: Linus Torvalds

Gao Xiang
2020-09-27 01:33:57 +0800
678ff6a7a mm: slab: fix potential double free in ___cache_free ... Browse Code »

With the commit 10befea91b61 ("mm: memcg/slab: use a single set of
kmem_caches for all allocations"), it becomes possible to call kfree()
from the slabs_destroy().

The functions cache_flusharray() and do_drain() calls slabs_destroy() on
array_cache of the local CPU without updating the size of the
array_cache. This enables the kfree() call from the slabs_destroy() to
recursively call cache_flusharray() which can potentially call
free_block() on the same elements of the array_cache of the local CPU
and causing double free and memory corruption.

To fix the issue, simply update the local CPU array_cache cache before
calling slabs_destroy().

Fixes: 10befea91b61 ("mm: memcg/slab: use a single set of kmem_caches for all allocations")
Signed-off-by: Shakeel Butt
Reviewed-by: Roman Gushchin
Tested-by: Ming Lei
Reported-by: kernel test robot
Cc: Andrew Morton
Cc: Ted Ts'o
Signed-off-by: Linus Torvalds

Shakeel Butt
2020-09-27 01:15:01 +0800

26 Sep, 2020

2 commits

e30d694c3 Documentation/llvm: Fix clang target examples ... Browse Code »

clang --target= is how we can specify a particular toolchain
triple to be use, fix the two occurences in the documentation.

Fixes: fcf1b6a35c16 ("Documentation/llvm: add documentation on building w/ Clang/LLVM")
Signed-off-by: Florian Fainelli
Reviewed-by: Nick Desaulniers
Reviewed-by: Nathan Chancellor
Signed-off-by: Masahiro Yamada

Florian Fainelli
2020-09-26 12:54:08 +0800
7c7ec3226 Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm ... Browse Code »

Pull more kvm fixes from Paolo Bonzini:
"Five small fixes.

The nested migration bug will be fixed with a better API in 5.10 or
5.11, for now this is a fix that works with existing userspace but
keeps the current ugly API"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
KVM: SVM: Add a dedicated INVD intercept routine
KVM: x86: Reset MMU context if guest toggles CR4.SMAP or CR4.PKE
KVM: x86: fix MSR_IA32_TSC read for nested migration
selftests: kvm: Fix assert failure in single-step test
KVM: x86: VMX: Make smaller physical guest address space support user-configurable

Linus Torvalds
2020-09-26 08:15:19 +0800