05 Nov, 2020
1 commit
-
This can't currently happen, but will be possible shortly. Handle missing
files just like we do not being able to grab a needed mm, and mark the
request as needing cancelation.Signed-off-by: Jens Axboe
22 Oct, 2020
1 commit
-
We correctly set io-wq NUMA node affinities when the io-wq context is
setup, but if an entire node CPU set is offlined and then brought back
online, the per node affinities are broken. Ensure that we set them
again whenever a CPU comes online. This ensures that we always track
the right node affinity. The usual cpuhp notifiers are used to drive it.Reported-by: Zhang Qiang
Signed-off-by: Jens Axboe
21 Oct, 2020
1 commit
-
This one was missed in the earlier conversion, should be included like
any of the other IO identity flags. Make sure we restore to RLIM_INIFITY
when dropping the personality again.Fixes: 98447d65b4a7 ("io_uring: move io identity items into separate struct")
Signed-off-by: Jens Axboe
17 Oct, 2020
5 commits
-
Make sure the async io-wq workers inherit the loginuid and sessionid from
the original task, and restore them to unset once we're done with the
async work item.While at it, disable the ability for kernel threads to write to their own
loginuid.Signed-off-by: Jens Axboe
-
io-wq contains a pointer to the identity, which we just hold in io_kiocb
for now. This is in preparation for putting this outside io_kiocb. The
only exception is struct files_struct, which we'll need different rules
for to avoid a circular dependency.No functional changes in this patch.
Signed-off-by: Jens Axboe
-
We solely rely on work->work_flags now, so use that for proper checking
and clearing/dropping of various identity items.Signed-off-by: Jens Axboe
-
We have a number of bits that decide what context to inherit. Set up
io-wq flags for these instead. This is in preparation for always having
the various members set, but not always needing them for all requests.No intended functional changes in this patch.
Signed-off-by: Jens Axboe
-
There was an assumption that kthread_create_on_node() would properly set
NUMA affinities in terms of CPUs allowed, but it doesn't. Make sure we
do this when creating an io-wq context on NUMA.Cc: stable@vger.kernel.org
Stefan Metzmacher
Signed-off-by: Jens Axboe
01 Oct, 2020
5 commits
-
This flag is no longer used, remove it.
Signed-off-by: Jens Axboe
-
The smart syzbot has found a reproducer for the following issue:
==================================================================
BUG: KASAN: use-after-free in instrument_atomic_write include/linux/instrumented.h:71 [inline]
BUG: KASAN: use-after-free in atomic_inc include/asm-generic/atomic-instrumented.h:240 [inline]
BUG: KASAN: use-after-free in io_wqe_inc_running fs/io-wq.c:301 [inline]
BUG: KASAN: use-after-free in io_wq_worker_running+0xde/0x110 fs/io-wq.c:613
Write of size 4 at addr ffff8882183db08c by task io_wqe_worker-0/7771CPU: 0 PID: 7771 Comm: io_wqe_worker-0 Not tainted 5.9.0-rc4-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
__dump_stack lib/dump_stack.c:77 [inline]
dump_stack+0x198/0x1fd lib/dump_stack.c:118
print_address_description.constprop.0.cold+0xae/0x497 mm/kasan/report.c:383
__kasan_report mm/kasan/report.c:513 [inline]
kasan_report.cold+0x1f/0x37 mm/kasan/report.c:530
check_memory_region_inline mm/kasan/generic.c:186 [inline]
check_memory_region+0x13d/0x180 mm/kasan/generic.c:192
instrument_atomic_write include/linux/instrumented.h:71 [inline]
atomic_inc include/asm-generic/atomic-instrumented.h:240 [inline]
io_wqe_inc_running fs/io-wq.c:301 [inline]
io_wq_worker_running+0xde/0x110 fs/io-wq.c:613
schedule_timeout+0x148/0x250 kernel/time/timer.c:1879
io_wqe_worker+0x517/0x10e0 fs/io-wq.c:580
kthread+0x3b5/0x4a0 kernel/kthread.c:292
ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:294Allocated by task 7768:
kasan_save_stack+0x1b/0x40 mm/kasan/common.c:48
kasan_set_track mm/kasan/common.c:56 [inline]
__kasan_kmalloc.constprop.0+0xbf/0xd0 mm/kasan/common.c:461
kmem_cache_alloc_node_trace+0x17b/0x3f0 mm/slab.c:3594
kmalloc_node include/linux/slab.h:572 [inline]
kzalloc_node include/linux/slab.h:677 [inline]
io_wq_create+0x57b/0xa10 fs/io-wq.c:1064
io_init_wq_offload fs/io_uring.c:7432 [inline]
io_sq_offload_start fs/io_uring.c:7504 [inline]
io_uring_create fs/io_uring.c:8625 [inline]
io_uring_setup+0x1836/0x28e0 fs/io_uring.c:8694
do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
entry_SYSCALL_64_after_hwframe+0x44/0xa9Freed by task 21:
kasan_save_stack+0x1b/0x40 mm/kasan/common.c:48
kasan_set_track+0x1c/0x30 mm/kasan/common.c:56
kasan_set_free_info+0x1b/0x30 mm/kasan/generic.c:355
__kasan_slab_free+0xd8/0x120 mm/kasan/common.c:422
__cache_free mm/slab.c:3418 [inline]
kfree+0x10e/0x2b0 mm/slab.c:3756
__io_wq_destroy fs/io-wq.c:1138 [inline]
io_wq_destroy+0x2af/0x460 fs/io-wq.c:1146
io_finish_async fs/io_uring.c:6836 [inline]
io_ring_ctx_free fs/io_uring.c:7870 [inline]
io_ring_exit_work+0x1e4/0x6d0 fs/io_uring.c:7954
process_one_work+0x94c/0x1670 kernel/workqueue.c:2269
worker_thread+0x64c/0x1120 kernel/workqueue.c:2415
kthread+0x3b5/0x4a0 kernel/kthread.c:292
ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:294The buggy address belongs to the object at ffff8882183db000
which belongs to the cache kmalloc-1k of size 1024
The buggy address is located 140 bytes inside of
1024-byte region [ffff8882183db000, ffff8882183db400)
The buggy address belongs to the page:
page:000000009bada22b refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x2183db
flags: 0x57ffe0000000200(slab)
raw: 057ffe0000000200 ffffea0008604c48 ffffea00086a8648 ffff8880aa040700
raw: 0000000000000000 ffff8882183db000 0000000100000002 0000000000000000
page dumped because: kasan: bad access detectedMemory state around the buggy address:
ffff8882183daf80: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
ffff8882183db000: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>ffff8882183db080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
^
ffff8882183db100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
ffff8882183db180: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
==================================================================which is down to the comment below,
/* all workers gone, wq exit can proceed */
if (!nr_workers && refcount_dec_and_test(&wqe->wq->refs))
complete(&wqe->wq->done);because there might be multiple cases of wqe in a wq and we would wait
for every worker in every wqe to go home before releasing wq's resources
on destroying.To that end, rework wq's refcount by making it independent of the tracking
of workers because after all they are two different things, and keeping
it balanced when workers come and go. Note the manager kthread, like
other workers, now holds a grab to wq during its lifetime.Finally to help destroy wq, check IO_WQ_BIT_EXIT upon creating worker
and do nothing for exiting wq.Cc: stable@vger.kernel.org # v5.5+
Reported-by: syzbot+45fa0a195b941764e0f0@syzkaller.appspotmail.com
Reported-by: syzbot+9af99580130003da82b1@syzkaller.appspotmail.com
Cc: Pavel Begunkov
Signed-off-by: Hillf Danton
Signed-off-by: Jens Axboe -
There are a few operations that are offloaded to the worker threads. In
this case, we lose process context and end up in kthread context. This
results in ios to be not accounted to the issuing cgroup and
consequently end up as issued by root. Just like others, adopt the
personality of the blkcg too when issuing via the workqueues.For the SQPOLL thread, it will live and attach in the inited cgroup's
context.Signed-off-by: Dennis Zhou
Signed-off-by: Jens Axboe -
During a context switch the scheduler invokes wq_worker_sleeping() with
disabled preemption. Disabling preemption is needed because it protects
access to `worker->sleeping'. As an optimisation it avoids invoking
schedule() within the schedule path as part of possible wake up (thus
preempt_enable_no_resched() afterwards).The io-wq has been added to the mix in the same section with disabled
preemption. This breaks on PREEMPT_RT because io_wq_worker_sleeping()
acquires a spinlock_t. Also within the schedule() the spinlock_t must be
acquired after tsk_is_pi_blocked() otherwise it will block on the
sleeping lock again while scheduling out.While playing with `io_uring-bench' I didn't notice a significant
latency spike after converting io_wqe::lock to a raw_spinlock_t. The
latency was more or less the same.In order to keep the spinlock_t it would have to be moved after the
tsk_is_pi_blocked() check which would introduce a branch instruction
into the hot path.The lock is used to maintain the `work_list' and wakes one task up at
most.
Should io_wqe_cancel_pending_work() cause latency spikes, while
searching for a specific item, then it would need to drop the lock
during iterations.
revert_creds() is also invoked under the lock. According to debug
cred::non_rcu is 0. Otherwise it should be moved outside of the locked
section because put_cred_rcu()->free_uid() acquires a sleeping lock.Convert io_wqe::lock to a raw_spinlock_t.c
Signed-off-by: Sebastian Andrzej Siewior
Signed-off-by: Jens Axboe -
If we don't get and assign the namespace for the async work, then certain
paths just don't work properly (like /dev/stdin, /proc/mounts, etc).
Anything that references the current namespace of the given task should
be assigned for async work on behalf of that task.Cc: stable@vger.kernel.org # v5.5+
Reported-by: Al Viro
Signed-off-by: Jens Axboe
24 Aug, 2020
1 commit
-
Don't forget to update wqe->hash_tail after cancelling a pending work
item, if it was hashed.Cc: stable@vger.kernel.org # 5.7+
Reported-by: Dmitry Shulyak
Fixes: 86f3cd1b589a1 ("io-wq: handle hashed writes in chains")
Signed-off-by: Pavel Begunkov
Signed-off-by: Jens Axboe
25 Jul, 2020
2 commits
-
Linked requests are hashed, remove a comment stating otherwise. Also
move hash bits to emphasise that we don't carry it through loop
iteration and set it every time.Signed-off-by: Pavel Begunkov
Signed-off-by: Jens Axboe -
RLIMIT_SIZE in needed only for execution from an io-wq context, hence
move all preparations from hot path to io-wq work setup.Signed-off-by: Pavel Begunkov
Signed-off-by: Jens Axboe
27 Jun, 2020
1 commit
-
It's easier to return next work from ->do_work() than
having an in-out argument. Looks nicer and easier to compile.
Also, merge io_wq_assign_next() into its only user.Signed-off-by: Pavel Begunkov
Signed-off-by: Jens Axboe
15 Jun, 2020
3 commits
-
If a process is going away, io_uring_flush() will cancel only 1
request with a matching pid. Cancel all of themSigned-off-by: Pavel Begunkov
Signed-off-by: Jens Axboe -
This adds support for cancelling all io-wq works matching a predicate.
It isn't used yet, so no change in observable behaviour.Signed-off-by: Pavel Begunkov
Signed-off-by: Jens Axboe -
Go all over all pending lists and cancel works there, and only then
try to match running requests. No functional changes here, just a
preparation for bulk cancellation.Signed-off-by: Pavel Begunkov
Signed-off-by: Jens Axboe
12 Jun, 2020
1 commit
-
Pull io_uring fixes from Jens Axboe:
"A few late stragglers in here. In particular:- Validate full range for provided buffers (Bijan)
- Fix bad use of kfree() in buffer registration failure (Denis)
- Don't allow close of ring itself, it's not fully safe. Making it
fully safe would require making the system call more expensive,
which isn't worth it.- Buffer selection fix
- Regression fix for O_NONBLOCK retry
- Make IORING_OP_ACCEPT honor O_NONBLOCK (Jiufei)
- Restrict opcode handling for SQ/IOPOLL (Pavel)
- io-wq work handling cleanups and improvements (Pavel, Xiaoguang)
- IOPOLL race fix (Xiaoguang)"
* tag 'io_uring-5.8-2020-06-11' of git://git.kernel.dk/linux-block:
io_uring: fix io_kiocb.flags modification race in IOPOLL mode
io_uring: check file O_NONBLOCK state for accept
io_uring: avoid unnecessary io_wq_work copy for fast poll feature
io_uring: avoid whole io_wq_work copy for requests completed inline
io_uring: allow O_NONBLOCK async retry
io_wq: add per-wq work handler instead of per work
io_uring: don't arm a timeout through work.func
io_uring: remove custom ->func handlers
io_uring: don't derive close state from ->func
io_uring: use kvfree() in io_sqe_buffer_register()
io_uring: validate the full range of provided buffers for access
io_uring: re-set iov base/len for buffer select retry
io_uring: move send/recv IOPOLL check into prep
io_uring: deduplicate io_openat{,2}_prep()
io_uring: do build_open_how() only once
io_uring: fix {SQ,IO}POLL with unsupported opcodes
io_uring: disallow close of ring itself
11 Jun, 2020
3 commits
-
Some architectures like arm64 and s390 require USER_DS to be set for
kernel threads to access user address space, which is the whole purpose of
kthread_use_mm, but other like x86 don't. That has lead to a huge mess
where some callers are fixed up once they are tested on said
architectures, while others linger around and yet other like io_uring try
to do "clever" optimizations for what usually is just a trivial asignment
to a member in the thread_struct for most architectures.Make kthread_use_mm set USER_DS, and kthread_unuse_mm restore to the
previous value instead.Signed-off-by: Christoph Hellwig
Signed-off-by: Andrew Morton
Tested-by: Jens Axboe
Reviewed-by: Jens Axboe
Acked-by: Michael S. Tsirkin
Cc: Alex Deucher
Cc: Al Viro
Cc: Felipe Balbi
Cc: Felix Kuehling
Cc: Jason Wang
Cc: Zhenyu Wang
Cc: Zhi Wang
Cc: Greg Kroah-Hartman
Link: http://lkml.kernel.org/r/20200404094101.672954-7-hch@lst.de
Signed-off-by: Linus Torvalds -
Switch the function documentation to kerneldoc comments, and add
WARN_ON_ONCE asserts that the calling thread is a kernel thread and does
not have ->mm set (or has ->mm set in the case of unuse_mm).Also give the functions a kthread_ prefix to better document the use case.
[hch@lst.de: fix a comment typo, cover the newly merged use_mm/unuse_mm caller in vfio]
Link: http://lkml.kernel.org/r/20200416053158.586887-3-hch@lst.de
[sfr@canb.auug.org.au: powerpc/vas: fix up for {un}use_mm() rename]
Link: http://lkml.kernel.org/r/20200422163935.5aa93ba5@canb.auug.org.auSigned-off-by: Christoph Hellwig
Signed-off-by: Stephen Rothwell
Signed-off-by: Andrew Morton
Tested-by: Jens Axboe
Reviewed-by: Jens Axboe
Acked-by: Felix Kuehling
Acked-by: Greg Kroah-Hartman [usb]
Acked-by: Haren Myneni
Cc: Alex Deucher
Cc: Al Viro
Cc: Felipe Balbi
Cc: Jason Wang
Cc: "Michael S. Tsirkin"
Cc: Zhenyu Wang
Cc: Zhi Wang
Link: http://lkml.kernel.org/r/20200404094101.672954-6-hch@lst.de
Signed-off-by: Linus Torvalds -
Patch series "improve use_mm / unuse_mm", v2.
This series improves the use_mm / unuse_mm interface by better documenting
the assumptions, and my taking the set_fs manipulations spread over the
callers into the core API.This patch (of 3):
Use the proper API instead.
Link: http://lkml.kernel.org/r/20200404094101.672954-1-hch@lst.de
These helpers are only for use with kernel threads, and I will tie them
more into the kthread infrastructure going forward. Also move the
prototypes to kthread.h - mmu_context.h was a little weird to start with
as it otherwise contains very low-level MM bits.Signed-off-by: Christoph Hellwig
Signed-off-by: Andrew Morton
Tested-by: Jens Axboe
Reviewed-by: Jens Axboe
Acked-by: Felix Kuehling
Cc: Alex Deucher
Cc: Al Viro
Cc: Felipe Balbi
Cc: Jason Wang
Cc: "Michael S. Tsirkin"
Cc: Zhenyu Wang
Cc: Zhi Wang
Cc: Greg Kroah-Hartman
Link: http://lkml.kernel.org/r/20200404094101.672954-1-hch@lst.de
Link: http://lkml.kernel.org/r/20200416053158.586887-1-hch@lst.de
Link: http://lkml.kernel.org/r/20200404094101.672954-5-hch@lst.de
Signed-off-by: Linus Torvalds
09 Jun, 2020
1 commit
-
io_uring is the only user of io-wq, and now it uses only io-wq callback
for all its requests, namely io_wq_submit_work(). Instead of storing
work->runner callback in each instance of io_wq_work, keep it in io-wq
itself.pros:
- reduces io_wq_work size
- more robust -- ->func won't be invalidated with mem{cpy,set}(req)
- helps other workSigned-off-by: Pavel Begunkov
Signed-off-by: Jens Axboe
04 Apr, 2020
1 commit
-
If the original task is (or has) exited, then the task work will not get
queued properly. Allow for using the io-wq manager task to queue this
work for execution, and ensure that the io-wq manager notices and runs
this work if woken up (or exiting).Reported-by: Dan Melnic
Signed-off-by: Jens Axboe
24 Mar, 2020
1 commit
-
We always punt async buffered writes to an io-wq helper, as the core
kernel does not have IOCB_NOWAIT support for that. Most buffered async
writes complete very quickly, as it's just a copy operation. This means
that doing multiple locking roundtrips on the shared wqe lock for each
buffered write is wasteful. Additionally, buffered writes are hashed
work items, which means that any buffered write to a given file is
serialized.Keep identicaly hashed work items contiguously in @wqe->work_list, and
track a tail for each hash bucket. On dequeue of a hashed item, splice
all of the same hash in one go using the tracked tail. Until the batch
is done, the caller doesn't have to synchronize with the wqe or worker
locks again.Signed-off-by: Pavel Begunkov
Signed-off-by: Jens Axboe
23 Mar, 2020
1 commit
-
After io_assign_current_work() of a linked work, it can be decided to
offloaded to another thread so doing io_wqe_enqueue(). However, until
next io_assign_current_work() it can be cancelled, that isn't handled.Don't assign it, if it's not going to be executed.
Fixes: 60cf46ae6054 ("io-wq: hash dependent work")
Signed-off-by: Pavel Begunkov
Signed-off-by: Jens Axboe
15 Mar, 2020
3 commits
-
Enable io-wq hashing stuff for dependent works simply by re-enqueueing
such requests.Signed-off-by: Pavel Begunkov
Signed-off-by: Jens Axboe -
It's a preparation patch removing io_wq_enqueue_hashed(), which
now should be done by io_wq_hash_work() + io_wq_enqueue().Also, set hash value for dependant works, and do it as late as possible,
because req->file can be unavailable before. This hash will be ignored
by io-wq.Signed-off-by: Pavel Begunkov
Signed-off-by: Jens Axboe -
This little tweak restores the behaviour that was before the recent
io_worker_handle_work() optimisation patches. It makes the function do
cond_resched() and flush_signals() only if there is an actual work to
execute.Signed-off-by: Pavel Begunkov
Signed-off-by: Jens Axboe
12 Mar, 2020
1 commit
-
Deduplicate cancellation parts, as many of them looks the same, as do
e.g.
- io_wqe_cancel_cb_work() and io_wqe_cancel_work()
- io_wq_worker_cancel() and io_work_cancel()Signed-off-by: Pavel Begunkov
Signed-off-by: Jens Axboe
05 Mar, 2020
4 commits
-
First it changes io-wq interfaces. It replaces {get,put}_work() with
free_work(), which guaranteed to be called exactly once. It also enforces
free_work() callback to be non-NULL.io_uring follows the changes and instead of putting a submission reference
in io_put_req_async_completion(), it will be done in io_free_work(). As
removes io_get_work() with corresponding refcount_inc(), the ref balance
is maintained.Signed-off-by: Pavel Begunkov
Signed-off-by: Jens Axboe -
When executing non-linked hashed work, io_worker_handle_work()
will lock-unlock wqe->lock to update hash, and then immediately
lock-unlock to get next work. Optimise this case and do
lock/unlock only once.Signed-off-by: Pavel Begunkov
Signed-off-by: Jens Axboe -
There are 2 optimisations:
- Now, io_worker_handler_work() do io_assign_current_work() twice per
request, and each one adds lock/unlock(worker->lock) pair. The first is
to reset worker->cur_work to NULL, and the second to set a real work
shortly after. If there is a dependant work, set it immediately, that
effectively removes the extra NULL'ing.- And there is no use in taking wqe->lock for linked works, as they are
not hashed now. Optimise it out.Signed-off-by: Pavel Begunkov
Signed-off-by: Jens Axboe -
This is a preparation patch, it adds some helpers and makes
the next patches cleaner.- extract io_impersonate_work() and io_assign_current_work()
- replace @next label with nested do-while
- move put_work() right after NULL'ing cur_work.Signed-off-by: Pavel Begunkov
Signed-off-by: Jens Axboe
03 Mar, 2020
4 commits
-
@hash_map is unsigned long, but BIT_ULL() is used for manipulations.
BIT() is a better match as it returns exactly unsigned long value.Signed-off-by: Pavel Begunkov
Signed-off-by: Jens Axboe -
IO_WQ_WORK_CB is used only for linked timeouts, which will be armed
before the work setup (i.e. mm, override creds, etc). The setup
shouldn't take long, so it's ok to arm it a bit later and get rid
of IO_WQ_WORK_CB.Make io-wq call work->func() only once, callbacks will handle the rest.
i.e. the linked timeout handler will do the actual issue. And as a
bonus, it removes an extra indirect call.Signed-off-by: Pavel Begunkov
Signed-off-by: Jens Axboe -
IO_WQ_WORK_HAS_MM is set but never used, remove it.
Signed-off-by: Pavel Begunkov
Signed-off-by: Jens Axboe -
io_wq_flush() is buggy, during cancelation of a flush, the associated
work may be passed to the caller's (i.e. io_uring) @match callback. That
callback is expecting it to be embedded in struct io_kiocb. Cancelation
of internal work probably doesn't make a lot of sense to begin with.As the flush helper is no longer used, just delete it and the associated
work flag.Signed-off-by: Pavel Begunkov
Signed-off-by: Jens Axboe