24 Mar, 2020
1 commit
-
We always punt async buffered writes to an io-wq helper, as the core
kernel does not have IOCB_NOWAIT support for that. Most buffered async
writes complete very quickly, as it's just a copy operation. This means
that doing multiple locking roundtrips on the shared wqe lock for each
buffered write is wasteful. Additionally, buffered writes are hashed
work items, which means that any buffered write to a given file is
serialized.Keep identicaly hashed work items contiguously in @wqe->work_list, and
track a tail for each hash bucket. On dequeue of a hashed item, splice
all of the same hash in one go using the tracked tail. Until the batch
is done, the caller doesn't have to synchronize with the wqe or worker
locks again.Signed-off-by: Pavel Begunkov
Signed-off-by: Jens Axboe
23 Mar, 2020
1 commit
-
After io_assign_current_work() of a linked work, it can be decided to
offloaded to another thread so doing io_wqe_enqueue(). However, until
next io_assign_current_work() it can be cancelled, that isn't handled.Don't assign it, if it's not going to be executed.
Fixes: 60cf46ae6054 ("io-wq: hash dependent work")
Signed-off-by: Pavel Begunkov
Signed-off-by: Jens Axboe
15 Mar, 2020
3 commits
-
Enable io-wq hashing stuff for dependent works simply by re-enqueueing
such requests.Signed-off-by: Pavel Begunkov
Signed-off-by: Jens Axboe -
It's a preparation patch removing io_wq_enqueue_hashed(), which
now should be done by io_wq_hash_work() + io_wq_enqueue().Also, set hash value for dependant works, and do it as late as possible,
because req->file can be unavailable before. This hash will be ignored
by io-wq.Signed-off-by: Pavel Begunkov
Signed-off-by: Jens Axboe -
This little tweak restores the behaviour that was before the recent
io_worker_handle_work() optimisation patches. It makes the function do
cond_resched() and flush_signals() only if there is an actual work to
execute.Signed-off-by: Pavel Begunkov
Signed-off-by: Jens Axboe
12 Mar, 2020
1 commit
-
Deduplicate cancellation parts, as many of them looks the same, as do
e.g.
- io_wqe_cancel_cb_work() and io_wqe_cancel_work()
- io_wq_worker_cancel() and io_work_cancel()Signed-off-by: Pavel Begunkov
Signed-off-by: Jens Axboe
05 Mar, 2020
4 commits
-
First it changes io-wq interfaces. It replaces {get,put}_work() with
free_work(), which guaranteed to be called exactly once. It also enforces
free_work() callback to be non-NULL.io_uring follows the changes and instead of putting a submission reference
in io_put_req_async_completion(), it will be done in io_free_work(). As
removes io_get_work() with corresponding refcount_inc(), the ref balance
is maintained.Signed-off-by: Pavel Begunkov
Signed-off-by: Jens Axboe -
When executing non-linked hashed work, io_worker_handle_work()
will lock-unlock wqe->lock to update hash, and then immediately
lock-unlock to get next work. Optimise this case and do
lock/unlock only once.Signed-off-by: Pavel Begunkov
Signed-off-by: Jens Axboe -
There are 2 optimisations:
- Now, io_worker_handler_work() do io_assign_current_work() twice per
request, and each one adds lock/unlock(worker->lock) pair. The first is
to reset worker->cur_work to NULL, and the second to set a real work
shortly after. If there is a dependant work, set it immediately, that
effectively removes the extra NULL'ing.- And there is no use in taking wqe->lock for linked works, as they are
not hashed now. Optimise it out.Signed-off-by: Pavel Begunkov
Signed-off-by: Jens Axboe -
This is a preparation patch, it adds some helpers and makes
the next patches cleaner.- extract io_impersonate_work() and io_assign_current_work()
- replace @next label with nested do-while
- move put_work() right after NULL'ing cur_work.Signed-off-by: Pavel Begunkov
Signed-off-by: Jens Axboe
03 Mar, 2020
4 commits
-
@hash_map is unsigned long, but BIT_ULL() is used for manipulations.
BIT() is a better match as it returns exactly unsigned long value.Signed-off-by: Pavel Begunkov
Signed-off-by: Jens Axboe -
IO_WQ_WORK_CB is used only for linked timeouts, which will be armed
before the work setup (i.e. mm, override creds, etc). The setup
shouldn't take long, so it's ok to arm it a bit later and get rid
of IO_WQ_WORK_CB.Make io-wq call work->func() only once, callbacks will handle the rest.
i.e. the linked timeout handler will do the actual issue. And as a
bonus, it removes an extra indirect call.Signed-off-by: Pavel Begunkov
Signed-off-by: Jens Axboe -
IO_WQ_WORK_HAS_MM is set but never used, remove it.
Signed-off-by: Pavel Begunkov
Signed-off-by: Jens Axboe -
io_wq_flush() is buggy, during cancelation of a flush, the associated
work may be passed to the caller's (i.e. io_uring) @match callback. That
callback is expecting it to be embedded in struct io_kiocb. Cancelation
of internal work probably doesn't make a lot of sense to begin with.As the flush helper is no longer used, just delete it and the associated
work flag.Signed-off-by: Pavel Begunkov
Signed-off-by: Jens Axboe
02 Mar, 2020
1 commit
-
To cancel a work, io-wq sets IO_WQ_WORK_CANCEL and executes the
callback. However, IO_WQ_WORK_NO_CANCEL works will just execute and may
return next work, which will be ignored and lost.Cancel the whole link.
Signed-off-by: Pavel Begunkov
Signed-off-by: Jens Axboe
25 Feb, 2020
1 commit
-
Andres reports that buffered IO seems to suck up more cycles than we
would like, and he narrowed it down to the fact that the io-wq workers
will briefly spin for more work on completion of a work item. This was
a win on the networking side, but apparently some other cases take a
hit because of it. Remove the optimization to avoid burning more CPU
than we have to for disk IO.Reported-by: Andres Freund
Signed-off-by: Jens Axboe
13 Feb, 2020
1 commit
-
Glauber reports a crash on init on a box he has:
RIP: 0010:__alloc_pages_nodemask+0x132/0x340
Code: 18 01 75 04 41 80 ce 80 89 e8 48 8b 54 24 08 8b 74 24 1c c1 e8 0c 48 8b 3c 24 83 e0 01 88 44 24 20 48 85 d2 0f 85 74 01 00 00 77 08 0f 82 6b 01 00 00 48 89 7c 24 10 89 ea 48 8b 07 b9 00 02
RSP: 0018:ffffb8be4d0b7c28 EFLAGS: 00010246
RAX: 0000000000000000 RBX: 0000000000000000 RCX: 000000000000e8e8
RDX: 0000000000000000 RSI: 0000000000000002 RDI: 0000000000002080
RBP: 0000000000012cc0 R08: 0000000000000000 R09: 0000000000000002
R10: 0000000000000dc0 R11: ffff995c60400100 R12: 0000000000000000
R13: 0000000000012cc0 R14: 0000000000000001 R15: ffff995c60db00f0
FS: 00007f4d115ca900(0000) GS:ffff995c60d80000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000002088 CR3: 00000017cca66002 CR4: 00000000007606e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
PKRU: 55555554
Call Trace:
alloc_slab_page+0x46/0x320
new_slab+0x9d/0x4e0
___slab_alloc+0x507/0x6a0
? io_wq_create+0xb4/0x2a0
__slab_alloc+0x1c/0x30
kmem_cache_alloc_node_trace+0xa6/0x260
io_wq_create+0xb4/0x2a0
io_uring_setup+0x97f/0xaa0
? io_remove_personalities+0x30/0x30
? io_poll_trigger_evfd+0x30/0x30
do_syscall_64+0x5b/0x1c0
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f4d116cb1edwhich is due to the 'wqe' and 'worker' allocation being node affine.
But it isn't valid to call the node affine allocation if the node isn't
online.Setup structures for even offline nodes, as usual, but skip them in
terms of thread setup to not waste resources. If the node isn't online,
just alloc memory with NUMA_NO_NODE.Reported-by: Glauber Costa
Tested-by: Glauber Costa
Signed-off-by: Jens Axboe
10 Feb, 2020
2 commits
-
Add a helper that allows the caller to cancel work based on what mm
it belongs to. This allows io_uring to cancel work from a given
task or thread when it exits.Signed-off-by: Jens Axboe
-
We want to use the cancel functionality for canceling based on not
just the work itself. Instead of matching on the work address
manually, allow a match handler to tell us if we found the right work
item or not.No functional changes in this patch.
Signed-off-by: Jens Axboe
09 Feb, 2020
1 commit
-
Some work items need this for relative path lookup, make it available
like the other inherited credentials/mm/etc.Cc: stable@vger.kernel.org # 5.3+
Signed-off-by: Jens Axboe
30 Jan, 2020
1 commit
-
We're not consistent in how the file table is grabbed and assigned if we
have a command linked that requires the use of it.Add ->file_table to the io_op_defs[] array, and use that to determine
when to grab the table instead of having the handlers set it if they
need to defer. This also means we can kill the IO_WQ_WORK_NEEDS_FILES
flag. We always initialize work->files, so io-wq can just check for
that.Signed-off-by: Jens Axboe
29 Jan, 2020
2 commits
-
Export a helper to attach to an existing io-wq, rather than setting up
a new one. This is doable now that we have reference counted io_wq's.Reported-by: Jens Axboe
Signed-off-by: Pavel Begunkov
Signed-off-by: Jens Axboe -
We currently setup the io_wq with a static set of mm and creds. Even for
a single-use io-wq per io_uring, this is suboptimal as we have may have
multiple enters of the ring. For sharing the io-wq backend, it doesn't
work at all.Switch to passing in the creds and mm when the work item is setup. This
means that async work is no longer deferred to the io_uring mm and creds,
it is done with the current mm and creds.Flag this behavior with IORING_FEAT_CUR_PERSONALITY, so applications know
they can rely on the current personality (mm and creds) being the same
for direct issue and async issue.Reviewed-by: Stefan Metzmacher
Signed-off-by: Jens Axboe
28 Jan, 2020
1 commit
-
In preparation for sharing an io-wq across different users, add a
reference count that manages destruction of it.Reviewed-by: Pavel Begunkov
Signed-off-by: Jens Axboe
21 Jan, 2020
2 commits
-
io-wq assumes that work will complete fast (and not block), so it
doesn't create a new worker when work is enqueued, if we already have
at least one worker running. This is done on the assumption that if work
is running, then it will complete fast.Add an option to force io-wq to fork a new worker for work queued. This
is signaled by setting IO_WQ_WORK_CONCURRENT on the work item. For that
case, io-wq will create a new worker, even though workers are already
running.Signed-off-by: Jens Axboe
-
Not all work can be cancelled, some of it we may need to guarantee
that it runs to completion. Allow the caller to set IO_WQ_WORK_NO_CANCEL
on work that must not be cancelled. Note that the caller work function
must also check for IO_WQ_WORK_NO_CANCEL on work that is marked
IO_WQ_WORK_CANCEL.Signed-off-by: Jens Axboe
15 Jan, 2020
1 commit
-
If we require mm and user context, mark the request for cancellation
if we fail to acquire the desired mm.Signed-off-by: Jens Axboe
25 Dec, 2019
1 commit
-
Reschedule the current IO worker to cut the risk that it is becoming
a cpu hog.Signed-off-by: Hillf Danton
Signed-off-by: Jens Axboe
23 Dec, 2019
1 commit
-
Commit e61df66c69b1 ("io-wq: ensure free/busy list browsing see all
items") added a list for io workers in addition to the free and busy
lists, not only making worker walk cleaner, but leaving the busy list
unused. Let's remove it.Signed-off-by: Hillf Danton
Signed-off-by: Jens Axboe
16 Dec, 2019
1 commit
-
- Fix a few typos found while reading the code.
- Fix stale io_get_sqring comment referencing s->sqe, the 's' parameter
was renamed to 'req', but the comment still holds.Signed-off-by: Brian Gianforcaro
Signed-off-by: Jens Axboe
11 Dec, 2019
2 commits
-
To avoid going to sleep only to get woken shortly thereafter, spin
briefly for new work upon completion of work.Signed-off-by: Jens Axboe
-
We only have one cases of using the waitqueue to wake the worker, the
rest are using wake_up_process(). Since we can save some cycles not
fiddling with the waitqueue io_wqe_worker(), switch the work activation
to task wakeup and get rid of the now unused wait_queue_head_t in
struct io_worker.Signed-off-by: Jens Axboe
02 Dec, 2019
1 commit
-
syzbot reports:
kasan: CONFIG_KASAN_INLINE enabled
kasan: GPF could be caused by NULL-ptr deref or user memory access
general protection fault: 0000 [#1] PREEMPT SMP KASAN
CPU: 0 PID: 9217 Comm: io_uring-sq Not tainted 5.4.0-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
Google 01/01/2011
RIP: 0010:creds_are_invalid kernel/cred.c:792 [inline]
RIP: 0010:__validate_creds include/linux/cred.h:187 [inline]
RIP: 0010:override_creds+0x9f/0x170 kernel/cred.c:550
Code: ac 25 00 81 fb 64 65 73 43 0f 85 a3 37 00 00 e8 17 ab 25 00 49 8d 7c
24 10 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 b6 04 02 84
c0 74 08 3c 03 0f 8e 96 00 00 00 41 8b 5c 24 10 bf
RSP: 0018:ffff88809c45fda0 EFLAGS: 00010202
RAX: dffffc0000000000 RBX: 0000000043736564 RCX: ffffffff814f3318
RDX: 0000000000000002 RSI: ffffffff814f3329 RDI: 0000000000000010
RBP: ffff88809c45fdb8 R08: ffff8880a3aac240 R09: ffffed1014755849
R10: ffffed1014755848 R11: ffff8880a3aac247 R12: 0000000000000000
R13: ffff888098ab1600 R14: 0000000000000000 R15: 0000000000000000
FS: 0000000000000000(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007ffd51c40664 CR3: 0000000092641000 CR4: 00000000001406f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
io_sq_thread+0x1c7/0xa20 fs/io_uring.c:3274
kthread+0x361/0x430 kernel/kthread.c:255
ret_from_fork+0x24/0x30 arch/x86/entry/entry_64.S:352
Modules linked in:
---[ end trace f2e1a4307fbe2245 ]---
RIP: 0010:creds_are_invalid kernel/cred.c:792 [inline]
RIP: 0010:__validate_creds include/linux/cred.h:187 [inline]
RIP: 0010:override_creds+0x9f/0x170 kernel/cred.c:550
Code: ac 25 00 81 fb 64 65 73 43 0f 85 a3 37 00 00 e8 17 ab 25 00 49 8d 7c
24 10 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 b6 04 02 84
c0 74 08 3c 03 0f 8e 96 00 00 00 41 8b 5c 24 10 bf
RSP: 0018:ffff88809c45fda0 EFLAGS: 00010202
RAX: dffffc0000000000 RBX: 0000000043736564 RCX: ffffffff814f3318
RDX: 0000000000000002 RSI: ffffffff814f3329 RDI: 0000000000000010
RBP: ffff88809c45fdb8 R08: ffff8880a3aac240 R09: ffffed1014755849
R10: ffffed1014755848 R11: ffff8880a3aac247 R12: 0000000000000000
R13: ffff888098ab1600 R14: 0000000000000000 R15: 0000000000000000
FS: 0000000000000000(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007ffd51c40664 CR3: 0000000092641000 CR4: 00000000001406f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400which is caused by slab fault injection triggering a failure in
prepare_creds(). We don't actually need to create a copy of the creds
as we're not modifying it, we just need a reference on the current task
creds. This avoids the failure case as well, and propagates the const
throughout the stack.Fixes: 181e448d8709 ("io_uring: async workers should inherit the user creds")
Reported-by: syzbot+5320383e16029ba057ff@syzkaller.appspotmail.com
Signed-off-by: Jens Axboe
27 Nov, 2019
3 commits
-
Currently we're using 40 bytes for the io_wq_work structure, and 16 of
those is the doubly link list node. We don't need doubly linked lists,
we always add to tail to keep things ordered, and any other use case
is list traversal with deletion. For the deletion case, we can easily
support any node deletion by keeping track of the previous entry.This shrinks io_wq_work to 32 bytes, and subsequently io_kiock from
io_uring to 216 to 208 bytes.Signed-off-by: Jens Axboe
-
There are several things that can go wrong in the current code on NUMA
systems, especially if not all nodes are online all the time:- If the identifiers of the online nodes do not form a single contiguous
block starting at zero, wq->wqes will be too small, and OOB memory
accesses will occur e.g. in the loop in io_wq_create().
- If a node comes online between the call to num_online_nodes() and the
for_each_node() loop in io_wq_create(), an OOB write will occur.
- If a node comes online between io_wq_create() and io_wq_enqueue(), a
lookup is performed for an element that doesn't exist, and an OOB read
will probably occur.Fix it by:
- using nr_node_ids instead of num_online_nodes() for the allocation size;
nr_node_ids is calculated by setup_nr_node_ids() to be bigger than the
highest node ID that could possibly come online at some point, even if
those nodes' identifiers are not a contiguous block
- creating workers for all possible CPUs, not just all online onesThis is basically what the normal workqueue code also does, as far as I can
tell.Signed-off-by: Jann Horn
Signed-off-by: Jens Axboe -
These allocations are single-element allocations, so don't use the array
allocation wrapper for them.Signed-off-by: Jann Horn
Signed-off-by: Jens Axboe
26 Nov, 2019
4 commits
-
If we don't inherit the original task creds, then we can confuse users
like fuse that pass creds in the request header. See link below on
identical aio issue.Link: https://lore.kernel.org/linux-fsdevel/26f0d78e-99ca-2f1b-78b9-433088053a61@scylladb.com/T/#u
Signed-off-by: Jens Axboe -
We currently pass in 4 arguments outside of the bounded size. In
preparation for adding one more argument, let's bundle them up in
a struct to make it more readable.No functional changes in this patch.
Signed-off-by: Jens Axboe
-
When we find new work to process within the work handler, we queue the
linked timeout before we have issued the new work. This can be
problematic for very short timeouts, as we have a window where the new
work isn't visible.Allow the work handler to store a callback function for this in the work
item, and flag it with IO_WQ_WORK_CB if the caller has done so. If that
is set, then io-wq will call the callback when it has setup the new work
item.Reported-by: Pavel Begunkov
Signed-off-by: Jens Axboe -
These lines are indented an extra space character.
Signed-off-by: Dan Carpenter
Signed-off-by: Jens Axboe