02 Dec, 2019

1 commit

  • syzbot reports:

    kasan: CONFIG_KASAN_INLINE enabled
    kasan: GPF could be caused by NULL-ptr deref or user memory access
    general protection fault: 0000 [#1] PREEMPT SMP KASAN
    CPU: 0 PID: 9217 Comm: io_uring-sq Not tainted 5.4.0-syzkaller #0
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
    Google 01/01/2011
    RIP: 0010:creds_are_invalid kernel/cred.c:792 [inline]
    RIP: 0010:__validate_creds include/linux/cred.h:187 [inline]
    RIP: 0010:override_creds+0x9f/0x170 kernel/cred.c:550
    Code: ac 25 00 81 fb 64 65 73 43 0f 85 a3 37 00 00 e8 17 ab 25 00 49 8d 7c
    24 10 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 b6 04 02 84
    c0 74 08 3c 03 0f 8e 96 00 00 00 41 8b 5c 24 10 bf
    RSP: 0018:ffff88809c45fda0 EFLAGS: 00010202
    RAX: dffffc0000000000 RBX: 0000000043736564 RCX: ffffffff814f3318
    RDX: 0000000000000002 RSI: ffffffff814f3329 RDI: 0000000000000010
    RBP: ffff88809c45fdb8 R08: ffff8880a3aac240 R09: ffffed1014755849
    R10: ffffed1014755848 R11: ffff8880a3aac247 R12: 0000000000000000
    R13: ffff888098ab1600 R14: 0000000000000000 R15: 0000000000000000
    FS: 0000000000000000(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 00007ffd51c40664 CR3: 0000000092641000 CR4: 00000000001406f0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    Call Trace:
    io_sq_thread+0x1c7/0xa20 fs/io_uring.c:3274
    kthread+0x361/0x430 kernel/kthread.c:255
    ret_from_fork+0x24/0x30 arch/x86/entry/entry_64.S:352
    Modules linked in:
    ---[ end trace f2e1a4307fbe2245 ]---
    RIP: 0010:creds_are_invalid kernel/cred.c:792 [inline]
    RIP: 0010:__validate_creds include/linux/cred.h:187 [inline]
    RIP: 0010:override_creds+0x9f/0x170 kernel/cred.c:550
    Code: ac 25 00 81 fb 64 65 73 43 0f 85 a3 37 00 00 e8 17 ab 25 00 49 8d 7c
    24 10 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 b6 04 02 84
    c0 74 08 3c 03 0f 8e 96 00 00 00 41 8b 5c 24 10 bf
    RSP: 0018:ffff88809c45fda0 EFLAGS: 00010202
    RAX: dffffc0000000000 RBX: 0000000043736564 RCX: ffffffff814f3318
    RDX: 0000000000000002 RSI: ffffffff814f3329 RDI: 0000000000000010
    RBP: ffff88809c45fdb8 R08: ffff8880a3aac240 R09: ffffed1014755849
    R10: ffffed1014755848 R11: ffff8880a3aac247 R12: 0000000000000000
    R13: ffff888098ab1600 R14: 0000000000000000 R15: 0000000000000000
    FS: 0000000000000000(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 00007ffd51c40664 CR3: 0000000092641000 CR4: 00000000001406f0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400

    which is caused by slab fault injection triggering a failure in
    prepare_creds(). We don't actually need to create a copy of the creds
    as we're not modifying it, we just need a reference on the current task
    creds. This avoids the failure case as well, and propagates the const
    throughout the stack.

    Fixes: 181e448d8709 ("io_uring: async workers should inherit the user creds")
    Reported-by: syzbot+5320383e16029ba057ff@syzkaller.appspotmail.com
    Signed-off-by: Jens Axboe

    Jens Axboe
     

27 Nov, 2019

3 commits

  • Currently we're using 40 bytes for the io_wq_work structure, and 16 of
    those is the doubly link list node. We don't need doubly linked lists,
    we always add to tail to keep things ordered, and any other use case
    is list traversal with deletion. For the deletion case, we can easily
    support any node deletion by keeping track of the previous entry.

    This shrinks io_wq_work to 32 bytes, and subsequently io_kiock from
    io_uring to 216 to 208 bytes.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • There are several things that can go wrong in the current code on NUMA
    systems, especially if not all nodes are online all the time:

    - If the identifiers of the online nodes do not form a single contiguous
    block starting at zero, wq->wqes will be too small, and OOB memory
    accesses will occur e.g. in the loop in io_wq_create().
    - If a node comes online between the call to num_online_nodes() and the
    for_each_node() loop in io_wq_create(), an OOB write will occur.
    - If a node comes online between io_wq_create() and io_wq_enqueue(), a
    lookup is performed for an element that doesn't exist, and an OOB read
    will probably occur.

    Fix it by:

    - using nr_node_ids instead of num_online_nodes() for the allocation size;
    nr_node_ids is calculated by setup_nr_node_ids() to be bigger than the
    highest node ID that could possibly come online at some point, even if
    those nodes' identifiers are not a contiguous block
    - creating workers for all possible CPUs, not just all online ones

    This is basically what the normal workqueue code also does, as far as I can
    tell.

    Signed-off-by: Jann Horn
    Signed-off-by: Jens Axboe

    Jann Horn
     
  • These allocations are single-element allocations, so don't use the array
    allocation wrapper for them.

    Signed-off-by: Jann Horn
    Signed-off-by: Jens Axboe

    Jann Horn
     

26 Nov, 2019

5 commits

  • If we don't inherit the original task creds, then we can confuse users
    like fuse that pass creds in the request header. See link below on
    identical aio issue.

    Link: https://lore.kernel.org/linux-fsdevel/26f0d78e-99ca-2f1b-78b9-433088053a61@scylladb.com/T/#u
    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • We currently pass in 4 arguments outside of the bounded size. In
    preparation for adding one more argument, let's bundle them up in
    a struct to make it more readable.

    No functional changes in this patch.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • When we find new work to process within the work handler, we queue the
    linked timeout before we have issued the new work. This can be
    problematic for very short timeouts, as we have a window where the new
    work isn't visible.

    Allow the work handler to store a callback function for this in the work
    item, and flag it with IO_WQ_WORK_CB if the caller has done so. If that
    is set, then io-wq will call the callback when it has setup the new work
    item.

    Reported-by: Pavel Begunkov
    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • These lines are indented an extra space character.

    Signed-off-by: Dan Carpenter
    Signed-off-by: Jens Axboe

    Dan Carpenter
     
  • We currently have a race where if setup is really slow, we can be
    calling io_wq_destroy() before we're done setting up. This will cause
    the caller to get stuck waiting for the manager to set things up, but
    the manager already exited.

    Fix this by doing a sync setup of the manager. This also fixes the case
    where if we failed creating workers, we'd also get stuck.

    In practice this race window was really small, as we already wait for
    the manager to start. Hence someone would have to call io_wq_destroy()
    after the task has started, but before it started the first loop. The
    reported test case forked tons of these, which is why it became an
    issue.

    Reported-by: syzbot+0f1cc17f85154f400465@syzkaller.appspotmail.com
    Fixes: 771b53d033e8 ("io-wq: small threadpool implementation for io_uring")
    Signed-off-by: Jens Axboe

    Jens Axboe
     

14 Nov, 2019

4 commits

  • Since we don't iterate these lists anymore after commit:

    e61df66c69b1 ("io-wq: ensure free/busy list browsing see all items")

    we don't need to retain the nulls value we use for them. That means it's
    pretty pointless to wrap the hlist_nulls_head in a structure, so get rid
    of it.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • We have two lists for workers in io-wq, a busy and a free list. For
    certain operations we want to browse all workers, and we currently do
    that by browsing the two separate lists. But since these lists are RCU
    protected, we can potentially miss workers if they move between the two
    lists while we're browsing them.

    Add a third list, all_list, that simply holds all workers. A worker is
    added to that list when it starts, and removed when it exits. This makes
    the worker iteration cleaner, too.

    Reported-by: Paul E. McKenney
    Reviewed-by: Paul E. McKenney
    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • worker->cur_work is currently protected by the lock of the wqe that the
    worker belongs to. When we send a signal to a worker, we need a stable
    view of ->cur_work, so we need to hold that lock. But this doesn't work
    so well, since we have the opposite order potentially on queueing work.
    If POLL_ADD is used with a signalfd, then io_poll_wake() is called with
    the signal lock, and that sometimes needs to insert work items.

    Add a specific worker lock that protects the current work item. Then we
    can guarantee that the task we're sending a signal is currently
    processing the exact work we think it is.

    Reported-by: Paul E. McKenney
    Reviewed-by: Paul E. McKenney
    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • For cancellation, we need to ensure that the work item stays valid for
    as long as ->cur_work is valid. Right now we can't safely dereference
    the work item even under the wqe->lock, because while the ->cur_work
    pointer will remain valid, the work could be completing and be freed
    in parallel.

    Only invoke ->get/put_work() on items we know that the caller queued
    themselves. Add IO_WQ_WORK_INTERNAL for io-wq to use, which is needed
    when we're queueing a flush item, for instance.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

10 Nov, 2019

1 commit


08 Nov, 2019

1 commit

  • io_uring supports request types that basically have two different
    lifetimes:

    1) Bounded completion time. These are requests like disk reads or writes,
    which we know will finish in a finite amount of time.
    2) Unbounded completion time. These are generally networked IO, where we
    have no idea how long they will take to complete. Another example is
    POLL commands.

    This patch provides support for io-wq to handle these differently, so we
    don't starve bounded requests by tying up workers for too long. By default
    all work is bounded, unless otherwise specified in the work item.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

06 Nov, 2019

1 commit


02 Nov, 2019

1 commit


01 Nov, 2019

1 commit

  • This adds support for IORING_OP_ASYNC_CANCEL, which will attempt to
    cancel requests that have been punted to async context and are now
    in-flight. This works for regular read/write requests to files, as
    long as they haven't been started yet. For socket based IO (or things
    like accept4(2)), we can cancel work that is already running as well.

    To cancel a request, the sqe must have ->addr set to the user_data of
    the request it wishes to cancel. If the request is cancelled
    successfully, the original request is completed with -ECANCELED
    and the cancel request is completed with a result of 0. If the
    request was already running, the original may or may not complete
    in error. The cancel request will complete with -EALREADY for that
    case. And finally, if the request to cancel wasn't found, the cancel
    request is completed with -ENOENT.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

30 Oct, 2019

2 commits

  • This is in preparation for adding opcodes that need to add new files
    in a process file table, system calls like open(2) or accept4(2).

    If an opcode needs this, it must set IO_WQ_WORK_NEEDS_FILES in the work
    item. If work that needs to get punted to async context have this
    set, the async worker will assume the original task file table before
    executing the work.

    Note that opcodes that need access to the current files of an
    application cannot be done through IORING_SETUP_SQPOLL.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • This adds support for io-wq, a smaller and specialized thread pool
    implementation. This is meant to replace workqueues for io_uring. Among
    the reasons for this addition are:

    - We can assign memory context smarter and more persistently if we
    manage the life time of threads.

    - We can drop various work-arounds we have in io_uring, like the
    async_list.

    - We can implement hashed work insertion, to manage concurrency of
    buffered writes without needing a) an extra workqueue, or b)
    needlessly making the concurrency of said workqueue very low
    which hurts performance of multiple buffered file writers.

    - We can implement cancel through signals, for cancelling
    interruptible work like read/write (or send/recv) to/from sockets.

    - We need the above cancel for being able to assign and use file tables
    from a process.

    - We can implement a more thorough cancel operation in general.

    - We need it to move towards a syslet/threadlet model for even faster
    async execution. For that we need to take ownership of the used
    threads.

    This list is just off the top of my head. Performance should be the
    same, or better, at least that's what I've seen in my testing. io-wq
    supports basic NUMA functionality, setting up a pool per node.

    io-wq hooks up to the scheduler schedule in/out just like workqueue
    and uses that to drive the need for more/less workers.

    Acked-by: Peter Zijlstra (Intel)
    Signed-off-by: Jens Axboe

    Jens Axboe