10 Mar, 2019

1 commit

  • commit d3d6a18d7d351cbcc9b33dbedf710e65f8ce1595 upstream.

    wake_up_locked() may but does not have to be called with interrupts
    disabled. Since the fuse filesystem calls wake_up_locked() without
    disabling interrupts aio_poll_wake() may be called with interrupts
    enabled. Since the kioctx.ctx_lock may be acquired from IRQ context,
    all code that acquires that lock from thread context must disable
    interrupts. Hence change the spin_trylock() call in aio_poll_wake()
    into a spin_trylock_irqsave() call. This patch fixes the following
    lockdep complaint:

    =====================================================
    WARNING: SOFTIRQ-safe -> SOFTIRQ-unsafe lock order detected
    5.0.0-rc4-next-20190131 #23 Not tainted
    -----------------------------------------------------
    syz-executor2/13779 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire:
    0000000098ac1230 (&fiq->waitq){+.+.}, at: spin_lock include/linux/spinlock.h:329 [inline]
    0000000098ac1230 (&fiq->waitq){+.+.}, at: aio_poll fs/aio.c:1772 [inline]
    0000000098ac1230 (&fiq->waitq){+.+.}, at: __io_submit_one fs/aio.c:1875 [inline]
    0000000098ac1230 (&fiq->waitq){+.+.}, at: io_submit_one+0xedf/0x1cf0 fs/aio.c:1908

    and this task is already holding:
    000000003c46111c (&(&ctx->ctx_lock)->rlock){..-.}, at: spin_lock_irq include/linux/spinlock.h:354 [inline]
    000000003c46111c (&(&ctx->ctx_lock)->rlock){..-.}, at: aio_poll fs/aio.c:1771 [inline]
    000000003c46111c (&(&ctx->ctx_lock)->rlock){..-.}, at: __io_submit_one fs/aio.c:1875 [inline]
    000000003c46111c (&(&ctx->ctx_lock)->rlock){..-.}, at: io_submit_one+0xeb6/0x1cf0 fs/aio.c:1908
    which would create a new lock dependency:
    (&(&ctx->ctx_lock)->rlock){..-.} -> (&fiq->waitq){+.+.}

    but this new dependency connects a SOFTIRQ-irq-safe lock:
    (&(&ctx->ctx_lock)->rlock){..-.}

    ... which became SOFTIRQ-irq-safe at:
    lock_acquire+0x16f/0x3f0 kernel/locking/lockdep.c:3826
    __raw_spin_lock_irq include/linux/spinlock_api_smp.h:128 [inline]
    _raw_spin_lock_irq+0x60/0x80 kernel/locking/spinlock.c:160
    spin_lock_irq include/linux/spinlock.h:354 [inline]
    free_ioctx_users+0x2d/0x4a0 fs/aio.c:610
    percpu_ref_put_many include/linux/percpu-refcount.h:285 [inline]
    percpu_ref_put include/linux/percpu-refcount.h:301 [inline]
    percpu_ref_call_confirm_rcu lib/percpu-refcount.c:123 [inline]
    percpu_ref_switch_to_atomic_rcu+0x3e7/0x520 lib/percpu-refcount.c:158
    __rcu_reclaim kernel/rcu/rcu.h:240 [inline]
    rcu_do_batch kernel/rcu/tree.c:2486 [inline]
    invoke_rcu_callbacks kernel/rcu/tree.c:2799 [inline]
    rcu_core+0x928/0x1390 kernel/rcu/tree.c:2780
    __do_softirq+0x266/0x95a kernel/softirq.c:292
    run_ksoftirqd kernel/softirq.c:654 [inline]
    run_ksoftirqd+0x8e/0x110 kernel/softirq.c:646
    smpboot_thread_fn+0x6ab/0xa10 kernel/smpboot.c:164
    kthread+0x357/0x430 kernel/kthread.c:247
    ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:352

    to a SOFTIRQ-irq-unsafe lock:
    (&fiq->waitq){+.+.}

    ... which became SOFTIRQ-irq-unsafe at:
    ...
    lock_acquire+0x16f/0x3f0 kernel/locking/lockdep.c:3826
    __raw_spin_lock include/linux/spinlock_api_smp.h:142 [inline]
    _raw_spin_lock+0x2f/0x40 kernel/locking/spinlock.c:144
    spin_lock include/linux/spinlock.h:329 [inline]
    flush_bg_queue+0x1f3/0x3c0 fs/fuse/dev.c:415
    fuse_request_queue_background+0x2d1/0x580 fs/fuse/dev.c:676
    fuse_request_send_background+0x58/0x120 fs/fuse/dev.c:687
    fuse_send_init fs/fuse/inode.c:989 [inline]
    fuse_fill_super+0x13bb/0x1730 fs/fuse/inode.c:1214
    mount_nodev+0x68/0x110 fs/super.c:1392
    fuse_mount+0x2d/0x40 fs/fuse/inode.c:1239
    legacy_get_tree+0xf2/0x200 fs/fs_context.c:590
    vfs_get_tree+0x123/0x450 fs/super.c:1481
    do_new_mount fs/namespace.c:2610 [inline]
    do_mount+0x1436/0x2c40 fs/namespace.c:2932
    ksys_mount+0xdb/0x150 fs/namespace.c:3148
    __do_sys_mount fs/namespace.c:3162 [inline]
    __se_sys_mount fs/namespace.c:3159 [inline]
    __x64_sys_mount+0xbe/0x150 fs/namespace.c:3159
    do_syscall_64+0x103/0x610 arch/x86/entry/common.c:290
    entry_SYSCALL_64_after_hwframe+0x49/0xbe

    other info that might help us debug this:

    Possible interrupt unsafe locking scenario:

    CPU0 CPU1
    ---- ----
    lock(&fiq->waitq);
    local_irq_disable();
    lock(&(&ctx->ctx_lock)->rlock);
    lock(&fiq->waitq);

    lock(&(&ctx->ctx_lock)->rlock);

    *** DEADLOCK ***

    1 lock held by syz-executor2/13779:
    #0: 000000003c46111c (&(&ctx->ctx_lock)->rlock){..-.}, at: spin_lock_irq include/linux/spinlock.h:354 [inline]
    #0: 000000003c46111c (&(&ctx->ctx_lock)->rlock){..-.}, at: aio_poll fs/aio.c:1771 [inline]
    #0: 000000003c46111c (&(&ctx->ctx_lock)->rlock){..-.}, at: __io_submit_one fs/aio.c:1875 [inline]
    #0: 000000003c46111c (&(&ctx->ctx_lock)->rlock){..-.}, at: io_submit_one+0xeb6/0x1cf0 fs/aio.c:1908

    the dependencies between SOFTIRQ-irq-safe lock and the holding lock:
    -> (&(&ctx->ctx_lock)->rlock){..-.} {
    IN-SOFTIRQ-W at:
    lock_acquire+0x16f/0x3f0 kernel/locking/lockdep.c:3826
    __raw_spin_lock_irq include/linux/spinlock_api_smp.h:128 [inline]
    _raw_spin_lock_irq+0x60/0x80 kernel/locking/spinlock.c:160
    spin_lock_irq include/linux/spinlock.h:354 [inline]
    free_ioctx_users+0x2d/0x4a0 fs/aio.c:610
    percpu_ref_put_many include/linux/percpu-refcount.h:285 [inline]
    percpu_ref_put include/linux/percpu-refcount.h:301 [inline]
    percpu_ref_call_confirm_rcu lib/percpu-refcount.c:123 [inline]
    percpu_ref_switch_to_atomic_rcu+0x3e7/0x520 lib/percpu-refcount.c:158
    __rcu_reclaim kernel/rcu/rcu.h:240 [inline]
    rcu_do_batch kernel/rcu/tree.c:2486 [inline]
    invoke_rcu_callbacks kernel/rcu/tree.c:2799 [inline]
    rcu_core+0x928/0x1390 kernel/rcu/tree.c:2780
    __do_softirq+0x266/0x95a kernel/softirq.c:292
    run_ksoftirqd kernel/softirq.c:654 [inline]
    run_ksoftirqd+0x8e/0x110 kernel/softirq.c:646
    smpboot_thread_fn+0x6ab/0xa10 kernel/smpboot.c:164
    kthread+0x357/0x430 kernel/kthread.c:247
    ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:352
    INITIAL USE at:
    lock_acquire+0x16f/0x3f0 kernel/locking/lockdep.c:3826
    __raw_spin_lock_irq include/linux/spinlock_api_smp.h:128 [inline]
    _raw_spin_lock_irq+0x60/0x80 kernel/locking/spinlock.c:160
    spin_lock_irq include/linux/spinlock.h:354 [inline]
    __do_sys_io_cancel fs/aio.c:2052 [inline]
    __se_sys_io_cancel fs/aio.c:2035 [inline]
    __x64_sys_io_cancel+0xd5/0x5a0 fs/aio.c:2035
    do_syscall_64+0x103/0x610 arch/x86/entry/common.c:290
    entry_SYSCALL_64_after_hwframe+0x49/0xbe
    }
    ... key at: [] __key.52370+0x0/0x40
    ... acquired at:
    lock_acquire+0x16f/0x3f0 kernel/locking/lockdep.c:3826
    __raw_spin_lock include/linux/spinlock_api_smp.h:142 [inline]
    _raw_spin_lock+0x2f/0x40 kernel/locking/spinlock.c:144
    spin_lock include/linux/spinlock.h:329 [inline]
    aio_poll fs/aio.c:1772 [inline]
    __io_submit_one fs/aio.c:1875 [inline]
    io_submit_one+0xedf/0x1cf0 fs/aio.c:1908
    __do_sys_io_submit fs/aio.c:1953 [inline]
    __se_sys_io_submit fs/aio.c:1923 [inline]
    __x64_sys_io_submit+0x1bd/0x580 fs/aio.c:1923
    do_syscall_64+0x103/0x610 arch/x86/entry/common.c:290
    entry_SYSCALL_64_after_hwframe+0x49/0xbe

    the dependencies between the lock to be acquired
    and SOFTIRQ-irq-unsafe lock:
    -> (&fiq->waitq){+.+.} {
    HARDIRQ-ON-W at:
    lock_acquire+0x16f/0x3f0 kernel/locking/lockdep.c:3826
    __raw_spin_lock include/linux/spinlock_api_smp.h:142 [inline]
    _raw_spin_lock+0x2f/0x40 kernel/locking/spinlock.c:144
    spin_lock include/linux/spinlock.h:329 [inline]
    flush_bg_queue+0x1f3/0x3c0 fs/fuse/dev.c:415
    fuse_request_queue_background+0x2d1/0x580 fs/fuse/dev.c:676
    fuse_request_send_background+0x58/0x120 fs/fuse/dev.c:687
    fuse_send_init fs/fuse/inode.c:989 [inline]
    fuse_fill_super+0x13bb/0x1730 fs/fuse/inode.c:1214
    mount_nodev+0x68/0x110 fs/super.c:1392
    fuse_mount+0x2d/0x40 fs/fuse/inode.c:1239
    legacy_get_tree+0xf2/0x200 fs/fs_context.c:590
    vfs_get_tree+0x123/0x450 fs/super.c:1481
    do_new_mount fs/namespace.c:2610 [inline]
    do_mount+0x1436/0x2c40 fs/namespace.c:2932
    ksys_mount+0xdb/0x150 fs/namespace.c:3148
    __do_sys_mount fs/namespace.c:3162 [inline]
    __se_sys_mount fs/namespace.c:3159 [inline]
    __x64_sys_mount+0xbe/0x150 fs/namespace.c:3159
    do_syscall_64+0x103/0x610 arch/x86/entry/common.c:290
    entry_SYSCALL_64_after_hwframe+0x49/0xbe
    SOFTIRQ-ON-W at:
    lock_acquire+0x16f/0x3f0 kernel/locking/lockdep.c:3826
    __raw_spin_lock include/linux/spinlock_api_smp.h:142 [inline]
    _raw_spin_lock+0x2f/0x40 kernel/locking/spinlock.c:144
    spin_lock include/linux/spinlock.h:329 [inline]
    flush_bg_queue+0x1f3/0x3c0 fs/fuse/dev.c:415
    fuse_request_queue_background+0x2d1/0x580 fs/fuse/dev.c:676
    fuse_request_send_background+0x58/0x120 fs/fuse/dev.c:687
    fuse_send_init fs/fuse/inode.c:989 [inline]
    fuse_fill_super+0x13bb/0x1730 fs/fuse/inode.c:1214
    mount_nodev+0x68/0x110 fs/super.c:1392
    fuse_mount+0x2d/0x40 fs/fuse/inode.c:1239
    legacy_get_tree+0xf2/0x200 fs/fs_context.c:590
    vfs_get_tree+0x123/0x450 fs/super.c:1481
    do_new_mount fs/namespace.c:2610 [inline]
    do_mount+0x1436/0x2c40 fs/namespace.c:2932
    ksys_mount+0xdb/0x150 fs/namespace.c:3148
    __do_sys_mount fs/namespace.c:3162 [inline]
    __se_sys_mount fs/namespace.c:3159 [inline]
    __x64_sys_mount+0xbe/0x150 fs/namespace.c:3159
    do_syscall_64+0x103/0x610 arch/x86/entry/common.c:290
    entry_SYSCALL_64_after_hwframe+0x49/0xbe
    INITIAL USE at:
    lock_acquire+0x16f/0x3f0 kernel/locking/lockdep.c:3826
    __raw_spin_lock include/linux/spinlock_api_smp.h:142 [inline]
    _raw_spin_lock+0x2f/0x40 kernel/locking/spinlock.c:144
    spin_lock include/linux/spinlock.h:329 [inline]
    flush_bg_queue+0x1f3/0x3c0 fs/fuse/dev.c:415
    fuse_request_queue_background+0x2d1/0x580 fs/fuse/dev.c:676
    fuse_request_send_background+0x58/0x120 fs/fuse/dev.c:687
    fuse_send_init fs/fuse/inode.c:989 [inline]
    fuse_fill_super+0x13bb/0x1730 fs/fuse/inode.c:1214
    mount_nodev+0x68/0x110 fs/super.c:1392
    fuse_mount+0x2d/0x40 fs/fuse/inode.c:1239
    legacy_get_tree+0xf2/0x200 fs/fs_context.c:590
    vfs_get_tree+0x123/0x450 fs/super.c:1481
    do_new_mount fs/namespace.c:2610 [inline]
    do_mount+0x1436/0x2c40 fs/namespace.c:2932
    ksys_mount+0xdb/0x150 fs/namespace.c:3148
    __do_sys_mount fs/namespace.c:3162 [inline]
    __se_sys_mount fs/namespace.c:3159 [inline]
    __x64_sys_mount+0xbe/0x150 fs/namespace.c:3159
    do_syscall_64+0x103/0x610 arch/x86/entry/common.c:290
    entry_SYSCALL_64_after_hwframe+0x49/0xbe
    }
    ... key at: [] __key.43450+0x0/0x40
    ... acquired at:
    lock_acquire+0x16f/0x3f0 kernel/locking/lockdep.c:3826
    __raw_spin_lock include/linux/spinlock_api_smp.h:142 [inline]
    _raw_spin_lock+0x2f/0x40 kernel/locking/spinlock.c:144
    spin_lock include/linux/spinlock.h:329 [inline]
    aio_poll fs/aio.c:1772 [inline]
    __io_submit_one fs/aio.c:1875 [inline]
    io_submit_one+0xedf/0x1cf0 fs/aio.c:1908
    __do_sys_io_submit fs/aio.c:1953 [inline]
    __se_sys_io_submit fs/aio.c:1923 [inline]
    __x64_sys_io_submit+0x1bd/0x580 fs/aio.c:1923
    do_syscall_64+0x103/0x610 arch/x86/entry/common.c:290
    entry_SYSCALL_64_after_hwframe+0x49/0xbe

    stack backtrace:
    CPU: 0 PID: 13779 Comm: syz-executor2 Not tainted 5.0.0-rc4-next-20190131 #23
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
    Call Trace:
    __dump_stack lib/dump_stack.c:77 [inline]
    dump_stack+0x172/0x1f0 lib/dump_stack.c:113
    print_bad_irq_dependency kernel/locking/lockdep.c:1573 [inline]
    check_usage.cold+0x60f/0x940 kernel/locking/lockdep.c:1605
    check_irq_usage kernel/locking/lockdep.c:1650 [inline]
    check_prev_add_irq kernel/locking/lockdep_states.h:8 [inline]
    check_prev_add kernel/locking/lockdep.c:1860 [inline]
    check_prevs_add kernel/locking/lockdep.c:1968 [inline]
    validate_chain kernel/locking/lockdep.c:2339 [inline]
    __lock_acquire+0x1f12/0x4790 kernel/locking/lockdep.c:3320
    lock_acquire+0x16f/0x3f0 kernel/locking/lockdep.c:3826
    __raw_spin_lock include/linux/spinlock_api_smp.h:142 [inline]
    _raw_spin_lock+0x2f/0x40 kernel/locking/spinlock.c:144
    spin_lock include/linux/spinlock.h:329 [inline]
    aio_poll fs/aio.c:1772 [inline]
    __io_submit_one fs/aio.c:1875 [inline]
    io_submit_one+0xedf/0x1cf0 fs/aio.c:1908
    __do_sys_io_submit fs/aio.c:1953 [inline]
    __se_sys_io_submit fs/aio.c:1923 [inline]
    __x64_sys_io_submit+0x1bd/0x580 fs/aio.c:1923
    do_syscall_64+0x103/0x610 arch/x86/entry/common.c:290
    entry_SYSCALL_64_after_hwframe+0x49/0xbe

    Reported-by: syzbot
    Cc: Christoph Hellwig
    Cc: Avi Kivity
    Cc: Miklos Szeredi
    Cc:
    Fixes: e8693bcfa0b4 ("aio: allow direct aio poll comletions for keyed wakeups") # v4.19
    Signed-off-by: Miklos Szeredi
    [ bvanassche: added a comment ]
    Reluctantly-Acked-by: Christoph Hellwig
    Signed-off-by: Bart Van Assche
    Signed-off-by: Al Viro
    Signed-off-by: Greg Kroah-Hartman

    Bart Van Assche
     

20 Dec, 2018

1 commit

  • commit a538e3ff9dabcdf6c3f477a373c629213d1c3066 upstream.

    Matthew pointed out that the ioctx_table is susceptible to spectre v1,
    because the index can be controlled by an attacker. The below patch
    should mitigate the attack for all of the aio system calls.

    Cc: stable@vger.kernel.org
    Reported-by: Matthew Wilcox
    Reported-by: Dan Carpenter
    Signed-off-by: Jeff Moyer
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Jeff Moyer
     

17 Dec, 2018

1 commit

  • [ Upstream commit 53fffe29a9e664a999dd3787e4428da8c30533e0 ]

    If the ioprio capability check fails, we return without putting
    the file pointer.

    Fixes: d9a08a9e616b ("fs: Add aio iopriority support")
    Signed-off-by: Jens Axboe
    Signed-off-by: Al Viro
    Signed-off-by: Sasha Levin

    Jens Axboe
     

14 Aug, 2018

2 commits

  • Pull vfs aio updates from Al Viro:
    "Christoph's aio poll, saner this time around.

    This time it's pretty much local to fs/aio.c. Hopefully race-free..."

    * 'work.aio' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    aio: allow direct aio poll comletions for keyed wakeups
    aio: implement IOCB_CMD_POLL
    aio: add a iocb refcount
    timerfd: add support for keyed wakeups

    Linus Torvalds
     
  • Pull vfs open-related updates from Al Viro:

    - "do we need fput() or put_filp()" rules are gone - it's always fput()
    now. We keep track of that state where it belongs - in ->f_mode.

    - int *opened mess killed - in finish_open(), in ->atomic_open()
    instances and in fs/namei.c code around do_last()/lookup_open()/atomic_open().

    - alloc_file() wrappers with saner calling conventions are introduced
    (alloc_file_clone() and alloc_file_pseudo()); callers converted, with
    much simplification.

    - while we are at it, saner calling conventions for path_init() and
    link_path_walk(), simplifying things inside fs/namei.c (both on
    open-related paths and elsewhere).

    * 'work.open3' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (40 commits)
    few more cleanups of link_path_walk() callers
    allow link_path_walk() to take ERR_PTR()
    make path_init() unconditionally paired with terminate_walk()
    document alloc_file() changes
    make alloc_file() static
    do_shmat(): grab shp->shm_file earlier, switch to alloc_file_clone()
    new helper: alloc_file_clone()
    create_pipe_files(): switch the first allocation to alloc_file_pseudo()
    anon_inode_getfile(): switch to alloc_file_pseudo()
    hugetlb_file_setup(): switch to alloc_file_pseudo()
    ocxlflash_getfile(): switch to alloc_file_pseudo()
    cxl_getfile(): switch to alloc_file_pseudo()
    ... and switch shmem_file_setup() to alloc_file_pseudo()
    __shmem_file_setup(): reorder allocations
    new wrapper: alloc_file_pseudo()
    kill FILE_{CREATED,OPENED}
    switch atomic_open() and lookup_open() to returning 0 in all success cases
    document ->atomic_open() changes
    ->atomic_open(): return 0 in all success cases
    get rid of 'opened' in path_openat() and the helpers downstream
    ...

    Linus Torvalds
     

06 Aug, 2018

3 commits

  • If we get a keyed wakeup for a aio poll waitqueue and wake can acquire the
    ctx_lock without spinning we can just complete the iocb straight from the
    wakeup callback to avoid a context switch.

    Signed-off-by: Christoph Hellwig
    Tested-by: Avi Kivity

    Christoph Hellwig
     
  • Simple one-shot poll through the io_submit() interface. To poll for
    a file descriptor the application should submit an iocb of type
    IOCB_CMD_POLL. It will poll the fd for the events specified in the
    the first 32 bits of the aio_buf field of the iocb.

    Unlike poll or epoll without EPOLLONESHOT this interface always works
    in one shot mode, that is once the iocb is completed, it will have to be
    resubmitted.

    Signed-off-by: Christoph Hellwig
    Tested-by: Avi Kivity

    Christoph Hellwig
     
  • This is needed to prevent races caused by the way the ->poll API works.
    To avoid introducing overhead for other users of the iocbs we initialize
    it to zero and only do refcount operations if it is non-zero in the
    completion path.

    Signed-off-by: Christoph Hellwig
    Tested-by: Avi Kivity

    Christoph Hellwig
     

23 Jul, 2018

1 commit

  • Pull vfs fixes from Al Viro:
    "Fix several places that screw up cleanups after failures halfway
    through opening a file (one open-coding filp_clone_open() and getting
    it wrong, two misusing alloc_file()). That part is -stable fodder from
    the 'work.open' branch.

    And Christoph's regression fix for uapi breakage in aio series;
    include/uapi/linux/aio_abi.h shouldn't be pulling in the kernel
    definition of sigset_t, the reason for doing so in the first place had
    been bogus - there's no need to expose struct __aio_sigset in
    aio_abi.h at all"

    * 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    aio: don't expose __aio_sigset in uapi
    ocxlflash_getfile(): fix double-iput() on alloc_file() failures
    cxl_getfile(): fix double-iput() on alloc_file() failures
    drm_mode_create_lease_ioctl(): fix open-coded filp_clone_open()

    Linus Torvalds
     

18 Jul, 2018

1 commit

  • glibc uses a different defintion of sigset_t than the kernel does,
    and the current version would pull in both. To fix this just do not
    expose the type at all - this somewhat mirrors pselect() where we
    do not even have a type for the magic sigmask argument, but just
    use pointer arithmetics.

    Fixes: 7a074e96 ("aio: implement io_pgetevents")
    Signed-off-by: Christoph Hellwig
    Reported-by: Adrian Reber
    Signed-off-by: Al Viro

    Christoph Hellwig
     

12 Jul, 2018

2 commits


29 Jun, 2018

1 commit

  • The poll() changes were not well thought out, and completely
    unexplained. They also caused a huge performance regression, because
    "->poll()" was no longer a trivial file operation that just called down
    to the underlying file operations, but instead did at least two indirect
    calls.

    Indirect calls are sadly slow now with the Spectre mitigation, but the
    performance problem could at least be largely mitigated by changing the
    "->get_poll_head()" operation to just have a per-file-descriptor pointer
    to the poll head instead. That gets rid of one of the new indirections.

    But that doesn't fix the new complexity that is completely unwarranted
    for the regular case. The (undocumented) reason for the poll() changes
    was some alleged AIO poll race fixing, but we don't make the common case
    slower and more complex for some uncommon special case, so this all
    really needs way more explanations and most likely a fundamental
    redesign.

    [ This revert is a revert of about 30 different commits, not reverted
    individually because that would just be unnecessarily messy - Linus ]

    Cc: Al Viro
    Cc: Christoph Hellwig
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

15 Jun, 2018

1 commit


05 Jun, 2018

1 commit


31 May, 2018

2 commits

  • This is the per-I/O equivalent of the ioprio_set system call.

    When IOCB_FLAG_IOPRIO is set on the iocb aio_flags field, then we set the
    newly added kiocb ki_ioprio field to the value in the iocb aio_reqprio field.

    This patch depends on block: add ioprio_check_cap function.

    Signed-off-by: Adam Manzanares
    Reviewed-by: Jeff Moyer
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Adam Manzanares
     
  • In order to avoid kiocb bloat for per command iopriority support, rw_hint
    is converted from enum to a u16. Added a guard around ki_hint assignment.

    Signed-off-by: Adam Manzanares
    Signed-off-by: Al Viro

    Adam Manzanares
     

30 May, 2018

6 commits


29 May, 2018

1 commit


26 May, 2018

5 commits

  • If we can acquire ctx_lock without spinning we can just remove our
    iocb from the active_reqs list, and thus complete the iocbs from the
    wakeup context.

    Signed-off-by: Christoph Hellwig

    Christoph Hellwig
     
  • Simple one-shot poll through the io_submit() interface. To poll for
    a file descriptor the application should submit an iocb of type
    IOCB_CMD_POLL. It will poll the fd for the events specified in the
    the first 32 bits of the aio_buf field of the iocb.

    Unlike poll or epoll without EPOLLONESHOT this interface always works
    in one shot mode, that is once the iocb is completed, it will have to be
    resubmitted.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Darrick J. Wong

    Christoph Hellwig
     
  • With the current aio code there is no need for the magic KIOCB_CANCELLED
    value, as a cancelation just kicks the driver to queue the completion
    ASAP, with all actual completion handling done in another thread. Given
    that both the completion path and cancelation take the context lock there
    is no need for magic cmpxchg loops either. If we remove iocbs from the
    active list after calling ->ki_cancel (but with ctx_lock still held), we
    can also rely on the invariant thay anything found on the list has a
    ->ki_cancel callback and can be cancelled, further simplifing the code.

    Signed-off-by: Christoph Hellwig

    Christoph Hellwig
     
  • No need to pass the key field to lookup_iocb to compare it with KIOCB_KEY,
    as we can do that right after retrieving it from userspace. Also move the
    KIOCB_KEY definition to aio.c as it is an internal value not used by any
    other place in the kernel.

    Signed-off-by: Christoph Hellwig

    Christoph Hellwig
     
  • Christoph Hellwig
     

24 May, 2018

1 commit

  • If io_destroy() gets to cancelling everything that can be cancelled and
    gets to kiocb_cancel() calling the function driver has left in ->ki_cancel,
    it becomes vulnerable to a race with IO completion. At that point req
    is already taken off the list and aio_complete() does *NOT* spin until
    we (in free_ioctx_users()) releases ->ctx_lock. As the result, it proceeds
    to kiocb_free(), freing req just it gets passed to ->ki_cancel().

    Fix is simple - remove from the list after the call of kiocb_cancel(). All
    instances of ->ki_cancel() already have to cope with the being called with
    iocb still on list - that's what happens in io_cancel(2).

    Cc: stable@kernel.org
    Fixes: 0460fef2a921 "aio: use cancellation list lazily"
    Signed-off-by: Al Viro

    Al Viro
     

22 May, 2018

1 commit

  • kill_ioctx() used to have an explicit RCU delay between removing the
    reference from ->ioctx_table and percpu_ref_kill() dropping the refcount.
    At some point that delay had been removed, on the theory that
    percpu_ref_kill() itself contained an RCU delay. Unfortunately, that was
    the wrong kind of RCU delay and it didn't care about rcu_read_lock() used
    by lookup_ioctx(). As the result, we could get ctx freed right under
    lookup_ioctx(). Tejun has fixed that in a6d7cff472e ("fs/aio: Add explicit
    RCU grace period when freeing kioctx"); however, that fix is not enough.

    Suppose io_destroy() from one thread races with e.g. io_setup() from another;
    CPU1 removes the reference from current->mm->ioctx_table[...] just as CPU2
    has picked it (under rcu_read_lock()). Then CPU1 proceeds to drop the
    refcount, getting it to 0 and triggering a call of free_ioctx_users(),
    which proceeds to drop the secondary refcount and once that reaches zero
    calls free_ioctx_reqs(). That does
    INIT_RCU_WORK(&ctx->free_rwork, free_ioctx);
    queue_rcu_work(system_wq, &ctx->free_rwork);
    and schedules freeing the whole thing after RCU delay.

    In the meanwhile CPU2 has gotten around to percpu_ref_get(), bumping the
    refcount from 0 to 1 and returned the reference to io_setup().

    Tejun's fix (that queue_rcu_work() in there) guarantees that ctx won't get
    freed until after percpu_ref_get(). Sure, we'd increment the counter before
    ctx can be freed. Now we are out of rcu_read_lock() and there's nothing to
    stop freeing of the whole thing. Unfortunately, CPU2 assumes that since it
    has grabbed the reference, ctx is *NOT* going away until it gets around to
    dropping that reference.

    The fix is obvious - use percpu_ref_tryget_live() and treat failure as miss.
    It's not costlier than what we currently do in normal case, it's safe to
    call since freeing *is* delayed and it closes the race window - either
    lookup_ioctx() comes before percpu_ref_kill() (in which case ctx->users
    won't reach 0 until the caller of lookup_ioctx() drops it) or lookup_ioctx()
    fails, ctx->users is unaffected and caller of lookup_ioctx() doesn't see
    the object in question at all.

    Cc: stable@kernel.org
    Fixes: a6d7cff472e "fs/aio: Add explicit RCU grace period when freeing kioctx"
    Signed-off-by: Al Viro

    Al Viro
     

03 May, 2018

7 commits

  • This is the io_getevents equivalent of ppoll/pselect and allows to
    properly mix signals and aio completions (especially with IOCB_CMD_POLL)
    and atomically executes the following sequence:

    sigset_t origmask;

    pthread_sigmask(SIG_SETMASK, &sigmask, &origmask);
    ret = io_getevents(ctx, min_nr, nr, events, timeout);
    pthread_sigmask(SIG_SETMASK, &origmask, NULL);

    Note that unlike many other signal related calls we do not pass a sigmask
    size, as that would get us to 7 arguments, which aren't easily supported
    by the syscall infrastructure. It seems a lot less painful to just add a
    new syscall variant in the unlikely case we're going to increase the
    sigset size.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Greg Kroah-Hartman
    Reviewed-by: Darrick J. Wong

    Christoph Hellwig
     
  • Simple workqueue offload for now, but prepared for adding a real aio_fsync
    method if the need arises. Based on an earlier patch from Dave Chinner.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Greg Kroah-Hartman
    Reviewed-by: Darrick J. Wong

    Christoph Hellwig
     
  • Don't reference the kiocb structure from the common aio code, and move
    any use of it into helper specific to the read/write path. This is in
    preparation for aio_poll support that wants to use the space for different
    fields.

    Signed-off-by: Christoph Hellwig
    Acked-by: Jeff Moyer
    Reviewed-by: Greg Kroah-Hartman
    Reviewed-by: Darrick J. Wong

    Christoph Hellwig
     
  • If we release the lockdep write protection token before calling into
    ->write_iter and thus never access the file pointer after an -EIOCBQUEUED
    return from ->write_iter or ->read_iter we don't need this extra
    reference.

    Signed-off-by: Christoph Hellwig

    Christoph Hellwig
     
  • Instead of handcoded non-null checks always initialize ki_list to an
    empty list and use list_empty / list_empty_careful on it. While we're
    at it also error out on a double call to kiocb_set_cancel_fn instead
    of ignoring it.

    Signed-off-by: Christoph Hellwig
    Acked-by: Jeff Moyer
    Reviewed-by: Greg Kroah-Hartman
    Reviewed-by: Darrick J. Wong

    Christoph Hellwig
     
  • These days we don't treat sync iocbs special in the aio completion code as
    they never use it. Remove the old comment and BUG_ON given that the
    current definition of is_sync_kiocb makes it impossible to hit.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Darrick J. Wong

    Christoph Hellwig
     
  • The page size is in no way related to the aio code, and printing it in
    the (debug) dmesg at every boot serves no purpose.

    Signed-off-by: Christoph Hellwig
    Acked-by: Jeff Moyer
    Reviewed-by: Greg Kroah-Hartman
    Reviewed-by: Darrick J. Wong

    Christoph Hellwig
     

20 Mar, 2018

1 commit


15 Mar, 2018

1 commit

  • While converting ioctx index from a list to a table, db446a08c23d
    ("aio: convert the ioctx list to table lookup v3") missed tagging
    kioctx_table->table[] as an array of RCU pointers and using the
    appropriate RCU accessors. This introduces a small window in the
    lookup path where init and access may race.

    Mark kioctx_table->table[] with __rcu and use the approriate RCU
    accessors when using the field.

    Signed-off-by: Tejun Heo
    Reported-by: Jann Horn
    Fixes: db446a08c23d ("aio: convert the ioctx list to table lookup v3")
    Cc: Benjamin LaHaise
    Cc: Linus Torvalds
    Cc: stable@vger.kernel.org # v3.12+

    Tejun Heo