Eric Lee / smarc-fsl-linux-kernel

10 Mar, 2019

1 commit

f5e66cdb5 aio: Fix locking in aio_poll() ... Browse Code »

commit d3d6a18d7d351cbcc9b33dbedf710e65f8ce1595 upstream.

wake_up_locked() may but does not have to be called with interrupts
disabled. Since the fuse filesystem calls wake_up_locked() without
disabling interrupts aio_poll_wake() may be called with interrupts
enabled. Since the kioctx.ctx_lock may be acquired from IRQ context,
all code that acquires that lock from thread context must disable
interrupts. Hence change the spin_trylock() call in aio_poll_wake()
into a spin_trylock_irqsave() call. This patch fixes the following
lockdep complaint:

=====================================================
WARNING: SOFTIRQ-safe -> SOFTIRQ-unsafe lock order detected
5.0.0-rc4-next-20190131 #23 Not tainted
-----------------------------------------------------
syz-executor2/13779 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire:
0000000098ac1230 (&fiq->waitq){+.+.}, at: spin_lock include/linux/spinlock.h:329 [inline]
0000000098ac1230 (&fiq->waitq){+.+.}, at: aio_poll fs/aio.c:1772 [inline]
0000000098ac1230 (&fiq->waitq){+.+.}, at: __io_submit_one fs/aio.c:1875 [inline]
0000000098ac1230 (&fiq->waitq){+.+.}, at: io_submit_one+0xedf/0x1cf0 fs/aio.c:1908

and this task is already holding:
000000003c46111c (&(&ctx->ctx_lock)->rlock){..-.}, at: spin_lock_irq include/linux/spinlock.h:354 [inline]
000000003c46111c (&(&ctx->ctx_lock)->rlock){..-.}, at: aio_poll fs/aio.c:1771 [inline]
000000003c46111c (&(&ctx->ctx_lock)->rlock){..-.}, at: __io_submit_one fs/aio.c:1875 [inline]
000000003c46111c (&(&ctx->ctx_lock)->rlock){..-.}, at: io_submit_one+0xeb6/0x1cf0 fs/aio.c:1908
which would create a new lock dependency:
(&(&ctx->ctx_lock)->rlock){..-.} -> (&fiq->waitq){+.+.}

but this new dependency connects a SOFTIRQ-irq-safe lock:
(&(&ctx->ctx_lock)->rlock){..-.}

... which became SOFTIRQ-irq-safe at:
lock_acquire+0x16f/0x3f0 kernel/locking/lockdep.c:3826
__raw_spin_lock_irq include/linux/spinlock_api_smp.h:128 [inline]
_raw_spin_lock_irq+0x60/0x80 kernel/locking/spinlock.c:160
spin_lock_irq include/linux/spinlock.h:354 [inline]
free_ioctx_users+0x2d/0x4a0 fs/aio.c:610
percpu_ref_put_many include/linux/percpu-refcount.h:285 [inline]
percpu_ref_put include/linux/percpu-refcount.h:301 [inline]
percpu_ref_call_confirm_rcu lib/percpu-refcount.c:123 [inline]
percpu_ref_switch_to_atomic_rcu+0x3e7/0x520 lib/percpu-refcount.c:158
__rcu_reclaim kernel/rcu/rcu.h:240 [inline]
rcu_do_batch kernel/rcu/tree.c:2486 [inline]
invoke_rcu_callbacks kernel/rcu/tree.c:2799 [inline]
rcu_core+0x928/0x1390 kernel/rcu/tree.c:2780
__do_softirq+0x266/0x95a kernel/softirq.c:292
run_ksoftirqd kernel/softirq.c:654 [inline]
run_ksoftirqd+0x8e/0x110 kernel/softirq.c:646
smpboot_thread_fn+0x6ab/0xa10 kernel/smpboot.c:164
kthread+0x357/0x430 kernel/kthread.c:247
ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:352

to a SOFTIRQ-irq-unsafe lock:
(&fiq->waitq){+.+.}

... which became SOFTIRQ-irq-unsafe at:
...
lock_acquire+0x16f/0x3f0 kernel/locking/lockdep.c:3826
__raw_spin_lock include/linux/spinlock_api_smp.h:142 [inline]
_raw_spin_lock+0x2f/0x40 kernel/locking/spinlock.c:144
spin_lock include/linux/spinlock.h:329 [inline]
flush_bg_queue+0x1f3/0x3c0 fs/fuse/dev.c:415
fuse_request_queue_background+0x2d1/0x580 fs/fuse/dev.c:676
fuse_request_send_background+0x58/0x120 fs/fuse/dev.c:687
fuse_send_init fs/fuse/inode.c:989 [inline]
fuse_fill_super+0x13bb/0x1730 fs/fuse/inode.c:1214
mount_nodev+0x68/0x110 fs/super.c:1392
fuse_mount+0x2d/0x40 fs/fuse/inode.c:1239
legacy_get_tree+0xf2/0x200 fs/fs_context.c:590
vfs_get_tree+0x123/0x450 fs/super.c:1481
do_new_mount fs/namespace.c:2610 [inline]
do_mount+0x1436/0x2c40 fs/namespace.c:2932
ksys_mount+0xdb/0x150 fs/namespace.c:3148
__do_sys_mount fs/namespace.c:3162 [inline]
__se_sys_mount fs/namespace.c:3159 [inline]
__x64_sys_mount+0xbe/0x150 fs/namespace.c:3159
do_syscall_64+0x103/0x610 arch/x86/entry/common.c:290
entry_SYSCALL_64_after_hwframe+0x49/0xbe

other info that might help us debug this:

Possible interrupt unsafe locking scenario:

CPU0 CPU1
---- ----
lock(&fiq->waitq);
local_irq_disable();
lock(&(&ctx->ctx_lock)->rlock);
lock(&fiq->waitq);

lock(&(&ctx->ctx_lock)->rlock);

*** DEADLOCK ***

1 lock held by syz-executor2/13779:
#0: 000000003c46111c (&(&ctx->ctx_lock)->rlock){..-.}, at: spin_lock_irq include/linux/spinlock.h:354 [inline]
#0: 000000003c46111c (&(&ctx->ctx_lock)->rlock){..-.}, at: aio_poll fs/aio.c:1771 [inline]
#0: 000000003c46111c (&(&ctx->ctx_lock)->rlock){..-.}, at: __io_submit_one fs/aio.c:1875 [inline]
#0: 000000003c46111c (&(&ctx->ctx_lock)->rlock){..-.}, at: io_submit_one+0xeb6/0x1cf0 fs/aio.c:1908

the dependencies between SOFTIRQ-irq-safe lock and the holding lock:
-> (&(&ctx->ctx_lock)->rlock){..-.} {
IN-SOFTIRQ-W at:
lock_acquire+0x16f/0x3f0 kernel/locking/lockdep.c:3826
__raw_spin_lock_irq include/linux/spinlock_api_smp.h:128 [inline]
_raw_spin_lock_irq+0x60/0x80 kernel/locking/spinlock.c:160
spin_lock_irq include/linux/spinlock.h:354 [inline]
free_ioctx_users+0x2d/0x4a0 fs/aio.c:610
percpu_ref_put_many include/linux/percpu-refcount.h:285 [inline]
percpu_ref_put include/linux/percpu-refcount.h:301 [inline]
percpu_ref_call_confirm_rcu lib/percpu-refcount.c:123 [inline]
percpu_ref_switch_to_atomic_rcu+0x3e7/0x520 lib/percpu-refcount.c:158
__rcu_reclaim kernel/rcu/rcu.h:240 [inline]
rcu_do_batch kernel/rcu/tree.c:2486 [inline]
invoke_rcu_callbacks kernel/rcu/tree.c:2799 [inline]
rcu_core+0x928/0x1390 kernel/rcu/tree.c:2780
__do_softirq+0x266/0x95a kernel/softirq.c:292
run_ksoftirqd kernel/softirq.c:654 [inline]
run_ksoftirqd+0x8e/0x110 kernel/softirq.c:646
smpboot_thread_fn+0x6ab/0xa10 kernel/smpboot.c:164
kthread+0x357/0x430 kernel/kthread.c:247
ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:352
INITIAL USE at:
lock_acquire+0x16f/0x3f0 kernel/locking/lockdep.c:3826
__raw_spin_lock_irq include/linux/spinlock_api_smp.h:128 [inline]
_raw_spin_lock_irq+0x60/0x80 kernel/locking/spinlock.c:160
spin_lock_irq include/linux/spinlock.h:354 [inline]
__do_sys_io_cancel fs/aio.c:2052 [inline]
__se_sys_io_cancel fs/aio.c:2035 [inline]
__x64_sys_io_cancel+0xd5/0x5a0 fs/aio.c:2035
do_syscall_64+0x103/0x610 arch/x86/entry/common.c:290
entry_SYSCALL_64_after_hwframe+0x49/0xbe
}
... key at: [] __key.52370+0x0/0x40
... acquired at:
lock_acquire+0x16f/0x3f0 kernel/locking/lockdep.c:3826
__raw_spin_lock include/linux/spinlock_api_smp.h:142 [inline]
_raw_spin_lock+0x2f/0x40 kernel/locking/spinlock.c:144
spin_lock include/linux/spinlock.h:329 [inline]
aio_poll fs/aio.c:1772 [inline]
__io_submit_one fs/aio.c:1875 [inline]
io_submit_one+0xedf/0x1cf0 fs/aio.c:1908
__do_sys_io_submit fs/aio.c:1953 [inline]
__se_sys_io_submit fs/aio.c:1923 [inline]
__x64_sys_io_submit+0x1bd/0x580 fs/aio.c:1923
do_syscall_64+0x103/0x610 arch/x86/entry/common.c:290
entry_SYSCALL_64_after_hwframe+0x49/0xbe

the dependencies between the lock to be acquired
and SOFTIRQ-irq-unsafe lock:
-> (&fiq->waitq){+.+.} {
HARDIRQ-ON-W at:
lock_acquire+0x16f/0x3f0 kernel/locking/lockdep.c:3826
__raw_spin_lock include/linux/spinlock_api_smp.h:142 [inline]
_raw_spin_lock+0x2f/0x40 kernel/locking/spinlock.c:144
spin_lock include/linux/spinlock.h:329 [inline]
flush_bg_queue+0x1f3/0x3c0 fs/fuse/dev.c:415
fuse_request_queue_background+0x2d1/0x580 fs/fuse/dev.c:676
fuse_request_send_background+0x58/0x120 fs/fuse/dev.c:687
fuse_send_init fs/fuse/inode.c:989 [inline]
fuse_fill_super+0x13bb/0x1730 fs/fuse/inode.c:1214
mount_nodev+0x68/0x110 fs/super.c:1392
fuse_mount+0x2d/0x40 fs/fuse/inode.c:1239
legacy_get_tree+0xf2/0x200 fs/fs_context.c:590
vfs_get_tree+0x123/0x450 fs/super.c:1481
do_new_mount fs/namespace.c:2610 [inline]
do_mount+0x1436/0x2c40 fs/namespace.c:2932
ksys_mount+0xdb/0x150 fs/namespace.c:3148
__do_sys_mount fs/namespace.c:3162 [inline]
__se_sys_mount fs/namespace.c:3159 [inline]
__x64_sys_mount+0xbe/0x150 fs/namespace.c:3159
do_syscall_64+0x103/0x610 arch/x86/entry/common.c:290
entry_SYSCALL_64_after_hwframe+0x49/0xbe
SOFTIRQ-ON-W at:
lock_acquire+0x16f/0x3f0 kernel/locking/lockdep.c:3826
__raw_spin_lock include/linux/spinlock_api_smp.h:142 [inline]
_raw_spin_lock+0x2f/0x40 kernel/locking/spinlock.c:144
spin_lock include/linux/spinlock.h:329 [inline]
flush_bg_queue+0x1f3/0x3c0 fs/fuse/dev.c:415
fuse_request_queue_background+0x2d1/0x580 fs/fuse/dev.c:676
fuse_request_send_background+0x58/0x120 fs/fuse/dev.c:687
fuse_send_init fs/fuse/inode.c:989 [inline]
fuse_fill_super+0x13bb/0x1730 fs/fuse/inode.c:1214
mount_nodev+0x68/0x110 fs/super.c:1392
fuse_mount+0x2d/0x40 fs/fuse/inode.c:1239
legacy_get_tree+0xf2/0x200 fs/fs_context.c:590
vfs_get_tree+0x123/0x450 fs/super.c:1481
do_new_mount fs/namespace.c:2610 [inline]
do_mount+0x1436/0x2c40 fs/namespace.c:2932
ksys_mount+0xdb/0x150 fs/namespace.c:3148
__do_sys_mount fs/namespace.c:3162 [inline]
__se_sys_mount fs/namespace.c:3159 [inline]
__x64_sys_mount+0xbe/0x150 fs/namespace.c:3159
do_syscall_64+0x103/0x610 arch/x86/entry/common.c:290
entry_SYSCALL_64_after_hwframe+0x49/0xbe
INITIAL USE at:
lock_acquire+0x16f/0x3f0 kernel/locking/lockdep.c:3826
__raw_spin_lock include/linux/spinlock_api_smp.h:142 [inline]
_raw_spin_lock+0x2f/0x40 kernel/locking/spinlock.c:144
spin_lock include/linux/spinlock.h:329 [inline]
flush_bg_queue+0x1f3/0x3c0 fs/fuse/dev.c:415
fuse_request_queue_background+0x2d1/0x580 fs/fuse/dev.c:676
fuse_request_send_background+0x58/0x120 fs/fuse/dev.c:687
fuse_send_init fs/fuse/inode.c:989 [inline]
fuse_fill_super+0x13bb/0x1730 fs/fuse/inode.c:1214
mount_nodev+0x68/0x110 fs/super.c:1392
fuse_mount+0x2d/0x40 fs/fuse/inode.c:1239
legacy_get_tree+0xf2/0x200 fs/fs_context.c:590
vfs_get_tree+0x123/0x450 fs/super.c:1481
do_new_mount fs/namespace.c:2610 [inline]
do_mount+0x1436/0x2c40 fs/namespace.c:2932
ksys_mount+0xdb/0x150 fs/namespace.c:3148
__do_sys_mount fs/namespace.c:3162 [inline]
__se_sys_mount fs/namespace.c:3159 [inline]
__x64_sys_mount+0xbe/0x150 fs/namespace.c:3159
do_syscall_64+0x103/0x610 arch/x86/entry/common.c:290
entry_SYSCALL_64_after_hwframe+0x49/0xbe
}
... key at: [] __key.43450+0x0/0x40
... acquired at:
lock_acquire+0x16f/0x3f0 kernel/locking/lockdep.c:3826
__raw_spin_lock include/linux/spinlock_api_smp.h:142 [inline]
_raw_spin_lock+0x2f/0x40 kernel/locking/spinlock.c:144
spin_lock include/linux/spinlock.h:329 [inline]
aio_poll fs/aio.c:1772 [inline]
__io_submit_one fs/aio.c:1875 [inline]
io_submit_one+0xedf/0x1cf0 fs/aio.c:1908
__do_sys_io_submit fs/aio.c:1953 [inline]
__se_sys_io_submit fs/aio.c:1923 [inline]
__x64_sys_io_submit+0x1bd/0x580 fs/aio.c:1923
do_syscall_64+0x103/0x610 arch/x86/entry/common.c:290
entry_SYSCALL_64_after_hwframe+0x49/0xbe

stack backtrace:
CPU: 0 PID: 13779 Comm: syz-executor2 Not tainted 5.0.0-rc4-next-20190131 #23
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
__dump_stack lib/dump_stack.c:77 [inline]
dump_stack+0x172/0x1f0 lib/dump_stack.c:113
print_bad_irq_dependency kernel/locking/lockdep.c:1573 [inline]
check_usage.cold+0x60f/0x940 kernel/locking/lockdep.c:1605
check_irq_usage kernel/locking/lockdep.c:1650 [inline]
check_prev_add_irq kernel/locking/lockdep_states.h:8 [inline]
check_prev_add kernel/locking/lockdep.c:1860 [inline]
check_prevs_add kernel/locking/lockdep.c:1968 [inline]
validate_chain kernel/locking/lockdep.c:2339 [inline]
__lock_acquire+0x1f12/0x4790 kernel/locking/lockdep.c:3320
lock_acquire+0x16f/0x3f0 kernel/locking/lockdep.c:3826
__raw_spin_lock include/linux/spinlock_api_smp.h:142 [inline]
_raw_spin_lock+0x2f/0x40 kernel/locking/spinlock.c:144
spin_lock include/linux/spinlock.h:329 [inline]
aio_poll fs/aio.c:1772 [inline]
__io_submit_one fs/aio.c:1875 [inline]
io_submit_one+0xedf/0x1cf0 fs/aio.c:1908
__do_sys_io_submit fs/aio.c:1953 [inline]
__se_sys_io_submit fs/aio.c:1923 [inline]
__x64_sys_io_submit+0x1bd/0x580 fs/aio.c:1923
do_syscall_64+0x103/0x610 arch/x86/entry/common.c:290
entry_SYSCALL_64_after_hwframe+0x49/0xbe

Reported-by: syzbot
Cc: Christoph Hellwig
Cc: Avi Kivity
Cc: Miklos Szeredi
Cc:
Fixes: e8693bcfa0b4 ("aio: allow direct aio poll comletions for keyed wakeups") # v4.19
Signed-off-by: Miklos Szeredi
[ bvanassche: added a comment ]
Reluctantly-Acked-by: Christoph Hellwig
Signed-off-by: Bart Van Assche
Signed-off-by: Al Viro
Signed-off-by: Greg Kroah-Hartman

Bart Van Assche
2019-03-10 14:17:21 +0800

20 Dec, 2018

1 commit

a6136922d aio: fix spectre gadget in lookup_ioctx ... Browse Code »

commit a538e3ff9dabcdf6c3f477a373c629213d1c3066 upstream.

Matthew pointed out that the ioctx_table is susceptible to spectre v1,
because the index can be controlled by an attacker. The below patch
should mitigate the attack for all of the aio system calls.

Cc: stable@vger.kernel.org
Reported-by: Matthew Wilcox
Reported-by: Dan Carpenter
Signed-off-by: Jeff Moyer
Signed-off-by: Jens Axboe
Signed-off-by: Greg Kroah-Hartman

Jeff Moyer
2018-12-20 02:19:50 +0800

17 Dec, 2018

1 commit

df66ef67c aio: fix failure to put the file pointer ... Browse Code »

[ Upstream commit 53fffe29a9e664a999dd3787e4428da8c30533e0 ]

If the ioprio capability check fails, we return without putting
the file pointer.

Fixes: d9a08a9e616b ("fs: Add aio iopriority support")
Signed-off-by: Jens Axboe
Signed-off-by: Al Viro
Signed-off-by: Sasha Levin

Jens Axboe
2018-12-17 16:24:33 +0800

14 Aug, 2018

2 commits

f2be26989 Merge branch 'work.aio' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs ... Browse Code »

Pull vfs aio updates from Al Viro:
"Christoph's aio poll, saner this time around.

This time it's pretty much local to fs/aio.c. Hopefully race-free..."

* 'work.aio' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
aio: allow direct aio poll comletions for keyed wakeups
aio: implement IOCB_CMD_POLL
aio: add a iocb refcount
timerfd: add support for keyed wakeups

Linus Torvalds
2018-08-14 11:56:23 +0800
a66b4cd1e Merge branch 'work.open3' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs ... Browse Code »

Pull vfs open-related updates from Al Viro:

- "do we need fput() or put_filp()" rules are gone - it's always fput()
now. We keep track of that state where it belongs - in ->f_mode.

- int *opened mess killed - in finish_open(), in ->atomic_open()
instances and in fs/namei.c code around do_last()/lookup_open()/atomic_open().

- alloc_file() wrappers with saner calling conventions are introduced
(alloc_file_clone() and alloc_file_pseudo()); callers converted, with
much simplification.

- while we are at it, saner calling conventions for path_init() and
link_path_walk(), simplifying things inside fs/namei.c (both on
open-related paths and elsewhere).

* 'work.open3' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (40 commits)
few more cleanups of link_path_walk() callers
allow link_path_walk() to take ERR_PTR()
make path_init() unconditionally paired with terminate_walk()
document alloc_file() changes
make alloc_file() static
do_shmat(): grab shp->shm_file earlier, switch to alloc_file_clone()
new helper: alloc_file_clone()
create_pipe_files(): switch the first allocation to alloc_file_pseudo()
anon_inode_getfile(): switch to alloc_file_pseudo()
hugetlb_file_setup(): switch to alloc_file_pseudo()
ocxlflash_getfile(): switch to alloc_file_pseudo()
cxl_getfile(): switch to alloc_file_pseudo()
... and switch shmem_file_setup() to alloc_file_pseudo()
__shmem_file_setup(): reorder allocations
new wrapper: alloc_file_pseudo()
kill FILE_{CREATED,OPENED}
switch atomic_open() and lookup_open() to returning 0 in all success cases
document ->atomic_open() changes
->atomic_open(): return 0 in all success cases
get rid of 'opened' in path_openat() and the helpers downstream
...

Linus Torvalds
2018-08-14 10:58:36 +0800

06 Aug, 2018

3 commits

e8693bcfa aio: allow direct aio poll comletions for keyed wakeups ... Browse Code »

If we get a keyed wakeup for a aio poll waitqueue and wake can acquire the
ctx_lock without spinning we can just complete the iocb straight from the
wakeup callback to avoid a context switch.

Signed-off-by: Christoph Hellwig
Tested-by: Avi Kivity

Christoph Hellwig
2018-08-06 16:24:39 +0800
bfe4037e7 aio: implement IOCB_CMD_POLL ... Browse Code »

Simple one-shot poll through the io_submit() interface. To poll for
a file descriptor the application should submit an iocb of type
IOCB_CMD_POLL. It will poll the fd for the events specified in the
the first 32 bits of the aio_buf field of the iocb.

Unlike poll or epoll without EPOLLONESHOT this interface always works
in one shot mode, that is once the iocb is completed, it will have to be
resubmitted.

Signed-off-by: Christoph Hellwig
Tested-by: Avi Kivity

Christoph Hellwig
2018-08-06 16:24:33 +0800
9018ccc45 aio: add a iocb refcount ... Browse Code »

This is needed to prevent races caused by the way the ->poll API works.
To avoid introducing overhead for other users of the iocbs we initialize
it to zero and only do refcount operations if it is non-zero in the
completion path.

Signed-off-by: Christoph Hellwig
Tested-by: Avi Kivity

Christoph Hellwig
2018-08-06 16:24:28 +0800

23 Jul, 2018

1 commit

165ea0d1c Merge branch 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs ... Browse Code »

Pull vfs fixes from Al Viro:
"Fix several places that screw up cleanups after failures halfway
through opening a file (one open-coding filp_clone_open() and getting
it wrong, two misusing alloc_file()). That part is -stable fodder from
the 'work.open' branch.

And Christoph's regression fix for uapi breakage in aio series;
include/uapi/linux/aio_abi.h shouldn't be pulling in the kernel
definition of sigset_t, the reason for doing so in the first place had
been bogus - there's no need to expose struct __aio_sigset in
aio_abi.h at all"

* 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
aio: don't expose __aio_sigset in uapi
ocxlflash_getfile(): fix double-iput() on alloc_file() failures
cxl_getfile(): fix double-iput() on alloc_file() failures
drm_mode_create_lease_ioctl(): fix open-coded filp_clone_open()

Linus Torvalds
2018-07-23 03:04:51 +0800

18 Jul, 2018

1 commit

9ba546c01 aio: don't expose __aio_sigset in uapi ... Browse Code »

glibc uses a different defintion of sigset_t than the kernel does,
and the current version would pull in both. To fix this just do not
expose the type at all - this somewhat mirrors pselect() where we
do not even have a type for the magic sigmask argument, but just
use pointer arithmetics.

Fixes: 7a074e96 ("aio: implement io_pgetevents")
Signed-off-by: Christoph Hellwig
Reported-by: Adrian Reber
Signed-off-by: Al Viro

Christoph Hellwig
2018-07-18 11:26:58 +0800

12 Jul, 2018

2 commits

d93aa9d82 new wrapper: alloc_file_pseudo() ... Browse Code »

takes inode, vfsmount, name, O_... flags and file_operations and
either returns a new struct file (in which case inode reference we
held is consumed) or returns ERR_PTR(), in which case no refcounts
are altered.

converted aio_private_file() and sock_alloc_file() to it

Acked-by: Linus Torvalds
Signed-off-by: Al Viro

Al Viro
2018-07-12 22:04:23 +0800
c9c554f21 alloc_file(): switch to passing O_... flags instead of FMODE_... mode ... Browse Code »

... so that it could set both ->f_flags and ->f_mode, without callers
having to set ->f_flags manually.

Signed-off-by: Al Viro

Al Viro
2018-07-12 22:02:57 +0800

29 Jun, 2018

1 commit

a11e1d432 Revert changes to convert to ->poll_mask() and aio IOCB_CMD_POLL ... Browse Code »

The poll() changes were not well thought out, and completely
unexplained. They also caused a huge performance regression, because
"->poll()" was no longer a trivial file operation that just called down
to the underlying file operations, but instead did at least two indirect
calls.

Indirect calls are sadly slow now with the Spectre mitigation, but the
performance problem could at least be largely mitigated by changing the
"->get_poll_head()" operation to just have a per-file-descriptor pointer
to the poll head instead. That gets rid of one of the new indirections.

But that doesn't fix the new complexity that is completely unwarranted
for the regular case. The (undocumented) reason for the poll() changes
was some alleged AIO poll race fixing, but we don't make the common case
slower and more complex for some uncommon special case, so this all
really needs way more explanations and most likely a fundamental
redesign.

[ This revert is a revert of about 30 different commits, not reverted
individually because that would just be unnecessarily messy - Linus ]

Cc: Al Viro
Cc: Christoph Hellwig
Signed-off-by: Linus Torvalds

Linus Torvalds
2018-06-29 01:40:47 +0800

15 Jun, 2018

1 commit

2739b807b aio: only return events requested in poll_mask() for IOCB_CMD_POLL ... Browse Code »

The ->poll_mask() operation has a mask of events that the caller
is interested in, but not all implementations might take it into
account. Mask the return value to only the requested events,
similar to what the poll and epoll code does.

Reported-by: Avi Kivity
Signed-off-by: Christoph Hellwig
Signed-off-by: Al Viro

Christoph Hellwig
2018-06-15 08:08:14 +0800

05 Jun, 2018

1 commit

9a6d9a62e fs: aio ioprio use ioprio_check_cap ret val ... Browse Code »

Previously the value was ignored.

Signed-off-by: Adam Manzanares
Signed-off-by: Al Viro

Adam Manzanares
2018-06-05 02:20:39 +0800

31 May, 2018

2 commits

d9a08a9e6 fs: Add aio iopriority support ... Browse Code »

This is the per-I/O equivalent of the ioprio_set system call.

When IOCB_FLAG_IOPRIO is set on the iocb aio_flags field, then we set the
newly added kiocb ki_ioprio field to the value in the iocb aio_reqprio field.

This patch depends on block: add ioprio_check_cap function.

Signed-off-by: Adam Manzanares
Reviewed-by: Jeff Moyer
Reviewed-by: Christoph Hellwig
Signed-off-by: Al Viro

Adam Manzanares
2018-05-31 22:50:55 +0800
fc28724d6 fs: Convert kiocb rw_hint from enum to u16 ... Browse Code »

In order to avoid kiocb bloat for per command iopriority support, rw_hint
is converted from enum to a u16. Added a guard around ki_hint assignment.

Signed-off-by: Adam Manzanares
Signed-off-by: Al Viro

Adam Manzanares
2018-05-31 22:50:54 +0800

30 May, 2018

6 commits

1da92779e aio: sanitize the limit checking in io_submit(2) ... Browse Code »

as it is, the logics in native io_submit(2) is "if asked for
more than LONG_MAX/sizeof(pointer) iocbs to submit, don't
bother with more than LONG_MAX/sizeof(pointer)" (i.e.
512M requests on 32bit and 1E requests on 64bit) while
compat io_submit(2) goes with "stop after the first
PAGE_SIZE/sizeof(pointer) iocbs", i.e. 1K or so. Which is
* inconsistent
* *way* too much in native case
* possibly too little in compat one
and
* wrong anyway, since the natural point where we
ought to stop bothering is ctx->nr_events

Reviewed-by: Christoph Hellwig
Signed-off-by: Al Viro

Al Viro
2018-05-30 11:20:17 +0800
67ba049f9 aio: fold do_io_submit() into callers ... Browse Code »

get rid of insane "copy array of 32bit pointers into an array of
native ones" glue.

Reviewed-by: Christoph Hellwig
Signed-off-by: Al Viro

Al Viro
2018-05-30 11:19:29 +0800
95af8496a aio: shift copyin of iocb into io_submit_one() ... Browse Code »

Reviewed-by: Christoph Hellwig
Signed-off-by: Al Viro

Al Viro
2018-05-30 11:18:31 +0800
d2988bd41 aio_read_events_ring(): make a bit more readable ... Browse Code »

The logics for 'avail' is
* not past the tail of cyclic buffer
* no more than asked
* not past the end of buffer
* not past the end of a page

Unobfuscate the last part.

Signed-off-by: Al Viro

Al Viro
2018-05-30 11:18:17 +0800
9061d14a8 aio: all callers of aio_{read,write,fsync,poll} treat 0 and -EIOCBQUEUED the same way ... Browse Code »

... so just make them return 0 when caller does not need to destroy iocb

Reviewed-by: Christoph Hellwig
Signed-off-by: Al Viro

Al Viro
2018-05-30 11:17:40 +0800
3c96c7f4c aio: take list removal to (some) callers of aio_complete() ... Browse Code »

We really want iocb out of io_cancel(2) reach before we start tearing
it down.

Signed-off-by: Al Viro

Al Viro
2018-05-30 11:16:43 +0800

29 May, 2018

1 commit

ac060cbaa aio: add missing break for the IOCB_CMD_FDSYNC case ... Browse Code »

Looks like this got lost in a merge.

Signed-off-by: Christoph Hellwig
Signed-off-by: Al Viro

Christoph Hellwig
2018-05-29 01:40:50 +0800

26 May, 2018

5 commits

1962da0d2 aio: try to complete poll iocbs without context switch ... Browse Code »

If we can acquire ctx_lock without spinning we can just remove our
iocb from the active_reqs list, and thus complete the iocbs from the
wakeup context.

Signed-off-by: Christoph Hellwig

Christoph Hellwig
2018-05-26 15:16:44 +0800
2c14fa838 aio: implement IOCB_CMD_POLL ... Browse Code »

Simple one-shot poll through the io_submit() interface. To poll for
a file descriptor the application should submit an iocb of type
IOCB_CMD_POLL. It will poll the fd for the events specified in the
the first 32 bits of the aio_buf field of the iocb.

Unlike poll or epoll without EPOLLONESHOT this interface always works
in one shot mode, that is once the iocb is completed, it will have to be
resubmitted.

Signed-off-by: Christoph Hellwig
Reviewed-by: Darrick J. Wong

Christoph Hellwig
2018-05-26 15:16:44 +0800
888933f8f aio: simplify cancellation ... Browse Code »

With the current aio code there is no need for the magic KIOCB_CANCELLED
value, as a cancelation just kicks the driver to queue the completion
ASAP, with all actual completion handling done in another thread. Given
that both the completion path and cancelation take the context lock there
is no need for magic cmpxchg loops either. If we remove iocbs from the
active list after calling ->ki_cancel (but with ctx_lock still held), we
can also rely on the invariant thay anything found on the list has a
->ki_cancel callback and can be cancelled, further simplifing the code.

Signed-off-by: Christoph Hellwig

Christoph Hellwig
2018-05-26 15:16:44 +0800
f3a2752a4 aio: simplify KIOCB_KEY handling ... Browse Code »

No need to pass the key field to lookup_iocb to compare it with KIOCB_KEY,
as we can do that right after retrieving it from userspace. Also move the
KIOCB_KEY definition to aio.c as it is an internal value not used by any
other place in the kernel.

Signed-off-by: Christoph Hellwig

Christoph Hellwig
2018-05-26 15:16:44 +0800
ed0d523ad Merge branch 'fixes' of https://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs into aio-base Browse Code »

Christoph Hellwig
2018-05-26 15:16:25 +0800

24 May, 2018

1 commit

4faa99965 fix io_destroy()/aio_complete() race ... Browse Code »

If io_destroy() gets to cancelling everything that can be cancelled and
gets to kiocb_cancel() calling the function driver has left in ->ki_cancel,
it becomes vulnerable to a race with IO completion. At that point req
is already taken off the list and aio_complete() does *NOT* spin until
we (in free_ioctx_users()) releases ->ctx_lock. As the result, it proceeds
to kiocb_free(), freing req just it gets passed to ->ki_cancel().

Fix is simple - remove from the list after the call of kiocb_cancel(). All
instances of ->ki_cancel() already have to cope with the being called with
iocb still on list - that's what happens in io_cancel(2).

Cc: stable@kernel.org
Fixes: 0460fef2a921 "aio: use cancellation list lazily"
Signed-off-by: Al Viro

Al Viro
2018-05-24 10:53:22 +0800

22 May, 2018

1 commit

baf10564f aio: fix io_destroy(2) vs. lookup_ioctx() race ... Browse Code »

kill_ioctx() used to have an explicit RCU delay between removing the
reference from ->ioctx_table and percpu_ref_kill() dropping the refcount.
At some point that delay had been removed, on the theory that
percpu_ref_kill() itself contained an RCU delay. Unfortunately, that was
the wrong kind of RCU delay and it didn't care about rcu_read_lock() used
by lookup_ioctx(). As the result, we could get ctx freed right under
lookup_ioctx(). Tejun has fixed that in a6d7cff472e ("fs/aio: Add explicit
RCU grace period when freeing kioctx"); however, that fix is not enough.

Suppose io_destroy() from one thread races with e.g. io_setup() from another;
CPU1 removes the reference from current->mm->ioctx_table[...] just as CPU2
has picked it (under rcu_read_lock()). Then CPU1 proceeds to drop the
refcount, getting it to 0 and triggering a call of free_ioctx_users(),
which proceeds to drop the secondary refcount and once that reaches zero
calls free_ioctx_reqs(). That does
INIT_RCU_WORK(&ctx->free_rwork, free_ioctx);
queue_rcu_work(system_wq, &ctx->free_rwork);
and schedules freeing the whole thing after RCU delay.

In the meanwhile CPU2 has gotten around to percpu_ref_get(), bumping the
refcount from 0 to 1 and returned the reference to io_setup().

Tejun's fix (that queue_rcu_work() in there) guarantees that ctx won't get
freed until after percpu_ref_get(). Sure, we'd increment the counter before
ctx can be freed. Now we are out of rcu_read_lock() and there's nothing to
stop freeing of the whole thing. Unfortunately, CPU2 assumes that since it
has grabbed the reference, ctx is *NOT* going away until it gets around to
dropping that reference.

The fix is obvious - use percpu_ref_tryget_live() and treat failure as miss.
It's not costlier than what we currently do in normal case, it's safe to
call since freeing *is* delayed and it closes the race window - either
lookup_ioctx() comes before percpu_ref_kill() (in which case ctx->users
won't reach 0 until the caller of lookup_ioctx() drops it) or lookup_ioctx()
fails, ctx->users is unaffected and caller of lookup_ioctx() doesn't see
the object in question at all.

Cc: stable@kernel.org
Fixes: a6d7cff472e "fs/aio: Add explicit RCU grace period when freeing kioctx"
Signed-off-by: Al Viro

Al Viro
2018-05-22 02:30:11 +0800

03 May, 2018

7 commits

7a074e96d aio: implement io_pgetevents ... Browse Code »

This is the io_getevents equivalent of ppoll/pselect and allows to
properly mix signals and aio completions (especially with IOCB_CMD_POLL)
and atomically executes the following sequence:

sigset_t origmask;

pthread_sigmask(SIG_SETMASK, &sigmask, &origmask);
ret = io_getevents(ctx, min_nr, nr, events, timeout);
pthread_sigmask(SIG_SETMASK, &origmask, NULL);

Note that unlike many other signal related calls we do not pass a sigmask
size, as that would get us to 7 arguments, which aren't easily supported
by the syscall infrastructure. It seems a lot less painful to just add a
new syscall variant in the unlikely case we're going to increase the
sigset size.

Signed-off-by: Christoph Hellwig
Reviewed-by: Greg Kroah-Hartman
Reviewed-by: Darrick J. Wong

Christoph Hellwig
2018-05-03 01:57:24 +0800
a3c0d439e aio: implement IOCB_CMD_FSYNC and IOCB_CMD_FDSYNC ... Browse Code »

Simple workqueue offload for now, but prepared for adding a real aio_fsync
method if the need arises. Based on an earlier patch from Dave Chinner.

Signed-off-by: Christoph Hellwig
Reviewed-by: Greg Kroah-Hartman
Reviewed-by: Darrick J. Wong

Christoph Hellwig
2018-05-03 01:57:24 +0800
54843f875 aio: refactor read/write iocb setup ... Browse Code »

Don't reference the kiocb structure from the common aio code, and move
any use of it into helper specific to the read/write path. This is in
preparation for aio_poll support that wants to use the space for different
fields.

Signed-off-by: Christoph Hellwig
Acked-by: Jeff Moyer
Reviewed-by: Greg Kroah-Hartman
Reviewed-by: Darrick J. Wong

Christoph Hellwig
2018-05-03 01:57:21 +0800
92ce47285 aio: remove the extra get_file/fput pair in io_submit_one ... Browse Code »

If we release the lockdep write protection token before calling into
->write_iter and thus never access the file pointer after an -EIOCBQUEUED
return from ->write_iter or ->read_iter we don't need this extra
reference.

Signed-off-by: Christoph Hellwig

Christoph Hellwig
2018-05-03 01:56:30 +0800
75321b50a aio: sanitize ki_list handling ... Browse Code »

Instead of handcoded non-null checks always initialize ki_list to an
empty list and use list_empty / list_empty_careful on it. While we're
at it also error out on a double call to kiocb_set_cancel_fn instead
of ignoring it.

Signed-off-by: Christoph Hellwig
Acked-by: Jeff Moyer
Reviewed-by: Greg Kroah-Hartman
Reviewed-by: Darrick J. Wong

Christoph Hellwig
2018-05-03 01:52:59 +0800
c213dc826 aio: remove an outdated BUG_ON and comment in aio_complete ... Browse Code »

These days we don't treat sync iocbs special in the aio completion code as
they never use it. Remove the old comment and BUG_ON given that the
current definition of is_sync_kiocb makes it impossible to hit.

Signed-off-by: Christoph Hellwig
Reviewed-by: Darrick J. Wong

Christoph Hellwig
2018-05-03 01:52:38 +0800
01a658e1e aio: don't print the page size at boot time ... Browse Code »

The page size is in no way related to the aio code, and printing it in
the (debug) dmesg at every boot serves no purpose.

Signed-off-by: Christoph Hellwig
Acked-by: Jeff Moyer
Reviewed-by: Greg Kroah-Hartman
Reviewed-by: Darrick J. Wong

Christoph Hellwig
2018-05-03 01:48:36 +0800

20 Mar, 2018

1 commit

f729863a8 fs/aio: Use rcu_work instead of explicit rcu and work item ... Browse Code »

Workqueue now has rcu_work. Use it instead of open-coding rcu -> work
item bouncing.

Signed-off-by: Tejun Heo

Tejun Heo
2018-03-20 01:12:03 +0800

15 Mar, 2018

1 commit

d0264c01e fs/aio: Use RCU accessors for kioctx_table->table[] ... Browse Code »

While converting ioctx index from a list to a table, db446a08c23d
("aio: convert the ioctx list to table lookup v3") missed tagging
kioctx_table->table[] as an array of RCU pointers and using the
appropriate RCU accessors. This introduces a small window in the
lookup path where init and access may race.

Mark kioctx_table->table[] with __rcu and use the approriate RCU
accessors when using the field.

Signed-off-by: Tejun Heo
Reported-by: Jann Horn
Fixes: db446a08c23d ("aio: convert the ioctx list to table lookup v3")
Cc: Benjamin LaHaise
Cc: Linus Torvalds
Cc: stable@vger.kernel.org # v3.12+

Tejun Heo
2018-03-15 03:10:17 +0800