24 Mar, 2020

1 commit

  • We always punt async buffered writes to an io-wq helper, as the core
    kernel does not have IOCB_NOWAIT support for that. Most buffered async
    writes complete very quickly, as it's just a copy operation. This means
    that doing multiple locking roundtrips on the shared wqe lock for each
    buffered write is wasteful. Additionally, buffered writes are hashed
    work items, which means that any buffered write to a given file is
    serialized.

    Keep identicaly hashed work items contiguously in @wqe->work_list, and
    track a tail for each hash bucket. On dequeue of a hashed item, splice
    all of the same hash in one go using the tracked tail. Until the batch
    is done, the caller doesn't have to synchronize with the wqe or worker
    locks again.

    Signed-off-by: Pavel Begunkov
    Signed-off-by: Jens Axboe

    Pavel Begunkov
     

23 Mar, 2020

1 commit

  • After io_assign_current_work() of a linked work, it can be decided to
    offloaded to another thread so doing io_wqe_enqueue(). However, until
    next io_assign_current_work() it can be cancelled, that isn't handled.

    Don't assign it, if it's not going to be executed.

    Fixes: 60cf46ae6054 ("io-wq: hash dependent work")
    Signed-off-by: Pavel Begunkov
    Signed-off-by: Jens Axboe

    Pavel Begunkov
     

15 Mar, 2020

3 commits

  • Enable io-wq hashing stuff for dependent works simply by re-enqueueing
    such requests.

    Signed-off-by: Pavel Begunkov
    Signed-off-by: Jens Axboe

    Pavel Begunkov
     
  • It's a preparation patch removing io_wq_enqueue_hashed(), which
    now should be done by io_wq_hash_work() + io_wq_enqueue().

    Also, set hash value for dependant works, and do it as late as possible,
    because req->file can be unavailable before. This hash will be ignored
    by io-wq.

    Signed-off-by: Pavel Begunkov
    Signed-off-by: Jens Axboe

    Pavel Begunkov
     
  • This little tweak restores the behaviour that was before the recent
    io_worker_handle_work() optimisation patches. It makes the function do
    cond_resched() and flush_signals() only if there is an actual work to
    execute.

    Signed-off-by: Pavel Begunkov
    Signed-off-by: Jens Axboe

    Pavel Begunkov
     

12 Mar, 2020

1 commit


05 Mar, 2020

4 commits

  • First it changes io-wq interfaces. It replaces {get,put}_work() with
    free_work(), which guaranteed to be called exactly once. It also enforces
    free_work() callback to be non-NULL.

    io_uring follows the changes and instead of putting a submission reference
    in io_put_req_async_completion(), it will be done in io_free_work(). As
    removes io_get_work() with corresponding refcount_inc(), the ref balance
    is maintained.

    Signed-off-by: Pavel Begunkov
    Signed-off-by: Jens Axboe

    Pavel Begunkov
     
  • When executing non-linked hashed work, io_worker_handle_work()
    will lock-unlock wqe->lock to update hash, and then immediately
    lock-unlock to get next work. Optimise this case and do
    lock/unlock only once.

    Signed-off-by: Pavel Begunkov
    Signed-off-by: Jens Axboe

    Pavel Begunkov
     
  • There are 2 optimisations:
    - Now, io_worker_handler_work() do io_assign_current_work() twice per
    request, and each one adds lock/unlock(worker->lock) pair. The first is
    to reset worker->cur_work to NULL, and the second to set a real work
    shortly after. If there is a dependant work, set it immediately, that
    effectively removes the extra NULL'ing.

    - And there is no use in taking wqe->lock for linked works, as they are
    not hashed now. Optimise it out.

    Signed-off-by: Pavel Begunkov
    Signed-off-by: Jens Axboe

    Pavel Begunkov
     
  • This is a preparation patch, it adds some helpers and makes
    the next patches cleaner.

    - extract io_impersonate_work() and io_assign_current_work()
    - replace @next label with nested do-while
    - move put_work() right after NULL'ing cur_work.

    Signed-off-by: Pavel Begunkov
    Signed-off-by: Jens Axboe

    Pavel Begunkov
     

03 Mar, 2020

4 commits

  • @hash_map is unsigned long, but BIT_ULL() is used for manipulations.
    BIT() is a better match as it returns exactly unsigned long value.

    Signed-off-by: Pavel Begunkov
    Signed-off-by: Jens Axboe

    Pavel Begunkov
     
  • IO_WQ_WORK_CB is used only for linked timeouts, which will be armed
    before the work setup (i.e. mm, override creds, etc). The setup
    shouldn't take long, so it's ok to arm it a bit later and get rid
    of IO_WQ_WORK_CB.

    Make io-wq call work->func() only once, callbacks will handle the rest.
    i.e. the linked timeout handler will do the actual issue. And as a
    bonus, it removes an extra indirect call.

    Signed-off-by: Pavel Begunkov
    Signed-off-by: Jens Axboe

    Pavel Begunkov
     
  • IO_WQ_WORK_HAS_MM is set but never used, remove it.

    Signed-off-by: Pavel Begunkov
    Signed-off-by: Jens Axboe

    Pavel Begunkov
     
  • io_wq_flush() is buggy, during cancelation of a flush, the associated
    work may be passed to the caller's (i.e. io_uring) @match callback. That
    callback is expecting it to be embedded in struct io_kiocb. Cancelation
    of internal work probably doesn't make a lot of sense to begin with.

    As the flush helper is no longer used, just delete it and the associated
    work flag.

    Signed-off-by: Pavel Begunkov
    Signed-off-by: Jens Axboe

    Pavel Begunkov
     

02 Mar, 2020

1 commit


25 Feb, 2020

1 commit

  • Andres reports that buffered IO seems to suck up more cycles than we
    would like, and he narrowed it down to the fact that the io-wq workers
    will briefly spin for more work on completion of a work item. This was
    a win on the networking side, but apparently some other cases take a
    hit because of it. Remove the optimization to avoid burning more CPU
    than we have to for disk IO.

    Reported-by: Andres Freund
    Signed-off-by: Jens Axboe

    Jens Axboe
     

13 Feb, 2020

1 commit

  • Glauber reports a crash on init on a box he has:

    RIP: 0010:__alloc_pages_nodemask+0x132/0x340
    Code: 18 01 75 04 41 80 ce 80 89 e8 48 8b 54 24 08 8b 74 24 1c c1 e8 0c 48 8b 3c 24 83 e0 01 88 44 24 20 48 85 d2 0f 85 74 01 00 00 77 08 0f 82 6b 01 00 00 48 89 7c 24 10 89 ea 48 8b 07 b9 00 02
    RSP: 0018:ffffb8be4d0b7c28 EFLAGS: 00010246
    RAX: 0000000000000000 RBX: 0000000000000000 RCX: 000000000000e8e8
    RDX: 0000000000000000 RSI: 0000000000000002 RDI: 0000000000002080
    RBP: 0000000000012cc0 R08: 0000000000000000 R09: 0000000000000002
    R10: 0000000000000dc0 R11: ffff995c60400100 R12: 0000000000000000
    R13: 0000000000012cc0 R14: 0000000000000001 R15: ffff995c60db00f0
    FS: 00007f4d115ca900(0000) GS:ffff995c60d80000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 0000000000002088 CR3: 00000017cca66002 CR4: 00000000007606e0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    PKRU: 55555554
    Call Trace:
    alloc_slab_page+0x46/0x320
    new_slab+0x9d/0x4e0
    ___slab_alloc+0x507/0x6a0
    ? io_wq_create+0xb4/0x2a0
    __slab_alloc+0x1c/0x30
    kmem_cache_alloc_node_trace+0xa6/0x260
    io_wq_create+0xb4/0x2a0
    io_uring_setup+0x97f/0xaa0
    ? io_remove_personalities+0x30/0x30
    ? io_poll_trigger_evfd+0x30/0x30
    do_syscall_64+0x5b/0x1c0
    entry_SYSCALL_64_after_hwframe+0x44/0xa9
    RIP: 0033:0x7f4d116cb1ed

    which is due to the 'wqe' and 'worker' allocation being node affine.
    But it isn't valid to call the node affine allocation if the node isn't
    online.

    Setup structures for even offline nodes, as usual, but skip them in
    terms of thread setup to not waste resources. If the node isn't online,
    just alloc memory with NUMA_NO_NODE.

    Reported-by: Glauber Costa
    Tested-by: Glauber Costa
    Signed-off-by: Jens Axboe

    Jens Axboe
     

10 Feb, 2020

2 commits


09 Feb, 2020

1 commit


30 Jan, 2020

1 commit

  • We're not consistent in how the file table is grabbed and assigned if we
    have a command linked that requires the use of it.

    Add ->file_table to the io_op_defs[] array, and use that to determine
    when to grab the table instead of having the handlers set it if they
    need to defer. This also means we can kill the IO_WQ_WORK_NEEDS_FILES
    flag. We always initialize work->files, so io-wq can just check for
    that.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

29 Jan, 2020

2 commits

  • Export a helper to attach to an existing io-wq, rather than setting up
    a new one. This is doable now that we have reference counted io_wq's.

    Reported-by: Jens Axboe
    Signed-off-by: Pavel Begunkov
    Signed-off-by: Jens Axboe

    Pavel Begunkov
     
  • We currently setup the io_wq with a static set of mm and creds. Even for
    a single-use io-wq per io_uring, this is suboptimal as we have may have
    multiple enters of the ring. For sharing the io-wq backend, it doesn't
    work at all.

    Switch to passing in the creds and mm when the work item is setup. This
    means that async work is no longer deferred to the io_uring mm and creds,
    it is done with the current mm and creds.

    Flag this behavior with IORING_FEAT_CUR_PERSONALITY, so applications know
    they can rely on the current personality (mm and creds) being the same
    for direct issue and async issue.

    Reviewed-by: Stefan Metzmacher
    Signed-off-by: Jens Axboe

    Jens Axboe
     

28 Jan, 2020

1 commit


21 Jan, 2020

2 commits

  • io-wq assumes that work will complete fast (and not block), so it
    doesn't create a new worker when work is enqueued, if we already have
    at least one worker running. This is done on the assumption that if work
    is running, then it will complete fast.

    Add an option to force io-wq to fork a new worker for work queued. This
    is signaled by setting IO_WQ_WORK_CONCURRENT on the work item. For that
    case, io-wq will create a new worker, even though workers are already
    running.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • Not all work can be cancelled, some of it we may need to guarantee
    that it runs to completion. Allow the caller to set IO_WQ_WORK_NO_CANCEL
    on work that must not be cancelled. Note that the caller work function
    must also check for IO_WQ_WORK_NO_CANCEL on work that is marked
    IO_WQ_WORK_CANCEL.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

15 Jan, 2020

1 commit


25 Dec, 2019

1 commit


23 Dec, 2019

1 commit

  • Commit e61df66c69b1 ("io-wq: ensure free/busy list browsing see all
    items") added a list for io workers in addition to the free and busy
    lists, not only making worker walk cleaner, but leaving the busy list
    unused. Let's remove it.

    Signed-off-by: Hillf Danton
    Signed-off-by: Jens Axboe

    Hillf Danton
     

16 Dec, 2019

1 commit


11 Dec, 2019

2 commits


02 Dec, 2019

1 commit

  • syzbot reports:

    kasan: CONFIG_KASAN_INLINE enabled
    kasan: GPF could be caused by NULL-ptr deref or user memory access
    general protection fault: 0000 [#1] PREEMPT SMP KASAN
    CPU: 0 PID: 9217 Comm: io_uring-sq Not tainted 5.4.0-syzkaller #0
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
    Google 01/01/2011
    RIP: 0010:creds_are_invalid kernel/cred.c:792 [inline]
    RIP: 0010:__validate_creds include/linux/cred.h:187 [inline]
    RIP: 0010:override_creds+0x9f/0x170 kernel/cred.c:550
    Code: ac 25 00 81 fb 64 65 73 43 0f 85 a3 37 00 00 e8 17 ab 25 00 49 8d 7c
    24 10 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 b6 04 02 84
    c0 74 08 3c 03 0f 8e 96 00 00 00 41 8b 5c 24 10 bf
    RSP: 0018:ffff88809c45fda0 EFLAGS: 00010202
    RAX: dffffc0000000000 RBX: 0000000043736564 RCX: ffffffff814f3318
    RDX: 0000000000000002 RSI: ffffffff814f3329 RDI: 0000000000000010
    RBP: ffff88809c45fdb8 R08: ffff8880a3aac240 R09: ffffed1014755849
    R10: ffffed1014755848 R11: ffff8880a3aac247 R12: 0000000000000000
    R13: ffff888098ab1600 R14: 0000000000000000 R15: 0000000000000000
    FS: 0000000000000000(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 00007ffd51c40664 CR3: 0000000092641000 CR4: 00000000001406f0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    Call Trace:
    io_sq_thread+0x1c7/0xa20 fs/io_uring.c:3274
    kthread+0x361/0x430 kernel/kthread.c:255
    ret_from_fork+0x24/0x30 arch/x86/entry/entry_64.S:352
    Modules linked in:
    ---[ end trace f2e1a4307fbe2245 ]---
    RIP: 0010:creds_are_invalid kernel/cred.c:792 [inline]
    RIP: 0010:__validate_creds include/linux/cred.h:187 [inline]
    RIP: 0010:override_creds+0x9f/0x170 kernel/cred.c:550
    Code: ac 25 00 81 fb 64 65 73 43 0f 85 a3 37 00 00 e8 17 ab 25 00 49 8d 7c
    24 10 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 b6 04 02 84
    c0 74 08 3c 03 0f 8e 96 00 00 00 41 8b 5c 24 10 bf
    RSP: 0018:ffff88809c45fda0 EFLAGS: 00010202
    RAX: dffffc0000000000 RBX: 0000000043736564 RCX: ffffffff814f3318
    RDX: 0000000000000002 RSI: ffffffff814f3329 RDI: 0000000000000010
    RBP: ffff88809c45fdb8 R08: ffff8880a3aac240 R09: ffffed1014755849
    R10: ffffed1014755848 R11: ffff8880a3aac247 R12: 0000000000000000
    R13: ffff888098ab1600 R14: 0000000000000000 R15: 0000000000000000
    FS: 0000000000000000(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 00007ffd51c40664 CR3: 0000000092641000 CR4: 00000000001406f0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400

    which is caused by slab fault injection triggering a failure in
    prepare_creds(). We don't actually need to create a copy of the creds
    as we're not modifying it, we just need a reference on the current task
    creds. This avoids the failure case as well, and propagates the const
    throughout the stack.

    Fixes: 181e448d8709 ("io_uring: async workers should inherit the user creds")
    Reported-by: syzbot+5320383e16029ba057ff@syzkaller.appspotmail.com
    Signed-off-by: Jens Axboe

    Jens Axboe
     

27 Nov, 2019

3 commits

  • Currently we're using 40 bytes for the io_wq_work structure, and 16 of
    those is the doubly link list node. We don't need doubly linked lists,
    we always add to tail to keep things ordered, and any other use case
    is list traversal with deletion. For the deletion case, we can easily
    support any node deletion by keeping track of the previous entry.

    This shrinks io_wq_work to 32 bytes, and subsequently io_kiock from
    io_uring to 216 to 208 bytes.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • There are several things that can go wrong in the current code on NUMA
    systems, especially if not all nodes are online all the time:

    - If the identifiers of the online nodes do not form a single contiguous
    block starting at zero, wq->wqes will be too small, and OOB memory
    accesses will occur e.g. in the loop in io_wq_create().
    - If a node comes online between the call to num_online_nodes() and the
    for_each_node() loop in io_wq_create(), an OOB write will occur.
    - If a node comes online between io_wq_create() and io_wq_enqueue(), a
    lookup is performed for an element that doesn't exist, and an OOB read
    will probably occur.

    Fix it by:

    - using nr_node_ids instead of num_online_nodes() for the allocation size;
    nr_node_ids is calculated by setup_nr_node_ids() to be bigger than the
    highest node ID that could possibly come online at some point, even if
    those nodes' identifiers are not a contiguous block
    - creating workers for all possible CPUs, not just all online ones

    This is basically what the normal workqueue code also does, as far as I can
    tell.

    Signed-off-by: Jann Horn
    Signed-off-by: Jens Axboe

    Jann Horn
     
  • These allocations are single-element allocations, so don't use the array
    allocation wrapper for them.

    Signed-off-by: Jann Horn
    Signed-off-by: Jens Axboe

    Jann Horn
     

26 Nov, 2019

4 commits

  • If we don't inherit the original task creds, then we can confuse users
    like fuse that pass creds in the request header. See link below on
    identical aio issue.

    Link: https://lore.kernel.org/linux-fsdevel/26f0d78e-99ca-2f1b-78b9-433088053a61@scylladb.com/T/#u
    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • We currently pass in 4 arguments outside of the bounded size. In
    preparation for adding one more argument, let's bundle them up in
    a struct to make it more readable.

    No functional changes in this patch.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • When we find new work to process within the work handler, we queue the
    linked timeout before we have issued the new work. This can be
    problematic for very short timeouts, as we have a window where the new
    work isn't visible.

    Allow the work handler to store a callback function for this in the work
    item, and flag it with IO_WQ_WORK_CB if the caller has done so. If that
    is set, then io-wq will call the callback when it has setup the new work
    item.

    Reported-by: Pavel Begunkov
    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • These lines are indented an extra space character.

    Signed-off-by: Dan Carpenter
    Signed-off-by: Jens Axboe

    Dan Carpenter