Eric Lee / smarc-fsl-linux-kernel

29 Aug, 2021

1 commit

2e480058d io-wq: provide a way to limit max number of workers ... Browse Code »

io-wq divides work into two categories:

1) Work that completes in a bounded time, like reading from a regular file
or a block device. This type of work is limited based on the size of
the SQ ring.

2) Work that may never complete, we call this unbounded work. The amount
of workers here is just limited by RLIMIT_NPROC.

For various uses cases, it's handy to have the kernel limit the maximum
amount of pending workers for both categories. Provide a way to do with
with a new IORING_REGISTER_IOWQ_MAX_WORKERS operation.

IORING_REGISTER_IOWQ_MAX_WORKERS takes an array of two integers and sets
the max worker count to what is being passed in for each category. The
old values are returned into that same array. If 0 is being passed in for
either category, it simply returns the current value.

The value is capped at RLIMIT_NPROC. This actually isn't that important
as it's more of a hint, if we're exceeding the value then our attempt
to fork a new worker will fail. This happens naturally already if more
than one node is in the system, as these values are per-node internally
for io-wq.

Reported-by: Johannes Lundberg
Link: https://github.com/axboe/liburing/issues/420
Signed-off-by: Jens Axboe

Jens Axboe
2021-08-29 21:55:55 +0800

24 Aug, 2021

1 commit

8724dd8c8 io-wq: improve wq_list_add_tail() ... Browse Code »

Prepare nodes that we're going to add before actually linking them, it's
always safer and costs us nothing.

Signed-off-by: Pavel Begunkov
Link: https://lore.kernel.org/r/f7e53f0c84c02ed6748c488ed0789b98f8cc6185.1628471125.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe

Pavel Begunkov
2021-08-24 03:07:56 +0800

18 Jun, 2021

2 commits

c10d1f986 io_uring: move creds from io-wq work to io_kiocb ... Browse Code »

io-wq now doesn't have anything to do with creds now, so move ->creds
from struct io_wq_work into request (aka struct io_kiocb).

Signed-off-by: Pavel Begunkov
Link: https://lore.kernel.org/r/8520c72ab8b8f4b96db12a228a2ab4c094ae64e1.1623949695.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe

Pavel Begunkov
2021-06-18 23:22:02 +0800
fe76421d1 io_uring: allow user configurable IO thread CPU affinity ... Browse Code »

io-wq defaults to per-node masks for IO workers. This works fine by
default, but isn't particularly handy for workloads that prefer more
specific affinities, for either performance or isolation reasons.

This adds IORING_REGISTER_IOWQ_AFF that allows the user to pass in a CPU
mask that is then applied to IO thread workers, and an
IORING_UNREGISTER_IOWQ_AFF that simply resets the masks back to the
default of per-node.

Note that no care is given to existing IO threads, they will need to go
through a reschedule before the affinity is correct if they are already
running or sleeping.

Signed-off-by: Jens Axboe

Jens Axboe
2021-06-18 00:25:50 +0800

26 May, 2021

1 commit

17a91051f io_uring/io-wq: close io-wq full-stop gap ... Browse Code »

There is an old problem with io-wq cancellation where requests should be
killed and are in io-wq but are not discoverable, e.g. in @next_hashed
or @linked vars of io_worker_handle_work(). It adds some unreliability
to individual request canellation, but also may potentially get
__io_uring_cancel() stuck. For instance:

1) An __io_uring_cancel()'s cancellation round have not found any
request but there are some as desribed.
2) __io_uring_cancel() goes to sleep
3) Then workers wake up and try to execute those hidden requests
that happen to be unbound.

As we already cancel all requests of io-wq there, set IO_WQ_BIT_EXIT
in advance, so preventing 3) from executing unbound requests. The
workers will initially break looping because of getting a signal as they
are threads of the dying/exec()'ing user task.

Cc: stable@vger.kernel.org
Signed-off-by: Pavel Begunkov
Link: https://lore.kernel.org/r/abfcf8c54cb9e8f7bfbad7e9a0cc5433cc70bdc2.1621781238.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe

Pavel Begunkov
2021-05-26 09:39:58 +0800

12 Apr, 2021

1 commit

685fe7fee io-wq: eliminate the need for a manager thread ... Browse Code »

io-wq relies on a manager thread to create/fork new workers, as needed.
But there's really no strong need for it anymore. We have the following
cases that fork a new worker:

1) Work queue. This is done from the task itself always, and it's trivial
to create a worker off that path, if needed.

2) All workers have gone to sleep, and we have more work. This is called
off the sched out path. For this case, use a task_work items to queue
a fork-worker operation.

3) Hashed work completion. Don't think we need to do anything off this
case. If need be, it could just use approach 2 as well.

Part of this change is incrementing the running worker count before the
fork, to avoid cases where we observe we need a worker and then queue
creation of one. Then new work comes in, we fork a new one. That last
queue operation should have waited for the previous worker to come up,
it's quite possible we don't even need it. Hence move the worker running
from before we fork it off to more efficiently handle that case.

Signed-off-by: Jens Axboe

Jens Axboe
2021-04-12 07:42:00 +0800

18 Mar, 2021

1 commit

53e043b2b io_uring: remove structures from include/linux/io_uring.h ... Browse Code »

Link: https://lore.kernel.org/r/8c1d14f3748105f4caeda01716d47af2fa41d11c.1615809009.git.metze@samba.org
Signed-off-by: Stefan Metzmacher
Signed-off-by: Jens Axboe

Stefan Metzmacher
2021-03-18 23:44:35 +0800

07 Mar, 2021

1 commit

003e8dccd io-wq: always track creds for async issue ... Browse Code »

If we go async with a request, grab the creds that the task currently has
assigned and make sure that the async side switches to them. This is
handled in the same way that we do for registered personalities.

Signed-off-by: Jens Axboe

Jens Axboe
2021-03-07 01:57:01 +0800

05 Mar, 2021

1 commit

46fe18b16 io_uring: move to using create_io_thread() ... Browse Code »

This allows us to do task creation and setup without needing to use
completions to try and synchronize with the starting thread. Get rid of
the old io_wq_fork_thread() wrapper, and the 'wq' and 'worker' startup
completion events - we can now do setup before the task is running.

Signed-off-by: Jens Axboe

Jens Axboe
2021-03-05 23:43:01 +0800

04 Mar, 2021

2 commits

5730b27e8 io_uring: move cred assignment into io_issue_sqe() ... Browse Code »

If we move it in there, then we no longer have to care about it in io-wq.
This means we can drop the cred handling in io-wq, and we can drop the
REQ_F_WORK_INITIALIZED flag and async init functions as that was the last
user of it since we moved to the new workers. Then we can also drop
io_wq_work->creds, and just hold the personality u16 in there instead.

Suggested-by: Pavel Begunkov
Signed-off-by: Jens Axboe

Jens Axboe
2021-03-04 21:36:28 +0800
afcc4015d io-wq: provide an io_wq_put_and_exit() helper ... Browse Code »

If we put the io-wq from io_uring, we really want it to exit. Provide
a helper that does that for us. Couple that with not having the manager
hold a reference to the 'wq' and the normal SQPOLL exit will tear down
the io-wq context appropriate.

On the io-wq side, our wq context is per task, so only the task itself
is manipulating ->manager and hence it's safe to check and clear without
any extra locking. We just need to ensure that the manager task stays
around, in case it exits.

Signed-off-by: Jens Axboe

Jens Axboe
2021-03-04 21:34:39 +0800

26 Feb, 2021

2 commits

4fb6ac326 io-wq: improve manager/worker handling over exec ... Browse Code »

exec will cancel any threads, including the ones that io-wq is using. This
isn't a problem, in fact we'd prefer it to be that way since it means we
know that any async work cancels naturally without having to handle it
proactively.

But it does mean that we need to setup a new manager, as the manager and
workers are gone. Handle this at queue time, and cancel work if we fail.
Since the manager can go away without us noticing, ensure that the manager
itself holds a reference to the 'wq' as well. Rename io_wq_destroy() to
io_wq_put() to reflect that.

In the future we can now simplify exec cancelation handling, for now just
leave it the same.

Signed-off-by: Jens Axboe

Jens Axboe
2021-02-26 01:17:09 +0800
e941894ea io-wq: make buffered file write hashed work map per-ctx ... Browse Code »

Before the io-wq thread change, we maintained a hash work map and lock
per-node per-ring. That wasn't ideal, as we really wanted it to be per
ring. But now that we have per-task workers, the hash map ends up being
just per-task. That'll work just fine for the normal case of having
one task use a ring, but if you share the ring between tasks, then it's
considerably worse than it was before.

Make the hash map per ctx instead, which provides full per-ctx buffered
write serialization on hashed writes.

Signed-off-by: Jens Axboe

Jens Axboe
2021-02-26 00:23:47 +0800

24 Feb, 2021

1 commit

728f13e73 io-wq: remove nr_process accounting ... Browse Code »

We're now just using fork like we would from userspace, so there's no
need to try and impose extra restrictions or accounting on the user
side of things. That's already being done for us. That also means we
don't have to pass in the user_struct anymore, that's correctly inherited
through ->creds on fork.

Signed-off-by: Jens Axboe

Jens Axboe
2021-02-24 11:33:26 +0800

22 Feb, 2021

6 commits

843bbfd49 io-wq: make io_wq_fork_thread() available to other users ... Browse Code »

We want to use this in io_uring proper as well, for the SQPOLL thread.
Rename it from fork_thread() to io_wq_fork_thread(), and make it
available through the io-wq.h header.

Signed-off-by: Jens Axboe

Jens Axboe
2021-02-22 08:25:22 +0800
4379bf8bd io_uring: remove io_identity ... Browse Code »

We are no longer grabbing state, so no need to maintain an IO identity
that we COW if there are changes.

Signed-off-by: Jens Axboe

Jens Axboe
2021-02-22 08:25:22 +0800
44526bedc io_uring: remove any grabbing of context ... Browse Code »

The async workers are siblings of the task itself, so by definition we
have all the state that we need. Remove any of the state grabbing that
we have, and requests flagging what they need.

Signed-off-by: Jens Axboe

Jens Axboe
2021-02-22 08:25:22 +0800
3bfe61066 io-wq: fork worker threads from original task ... Browse Code »

Instead of using regular kthread kernel threads, create kernel threads
that are like a real thread that the task would create. This ensures that
we get all the context that we need, without having to carry that state
around. This greatly reduces the code complexity, and the risk of missing
state for a given request type.

With the move away from kthread, we can also dump everything related to
assigned state to the new threads.

Signed-off-by: Jens Axboe

Jens Axboe
2021-02-22 08:25:22 +0800
3b094e727 io-wq: get rid of wq->use_refs ... Browse Code »

We don't support attach anymore, so doesn't make sense to carry the
use_refs reference count. Get rid of it.

Signed-off-by: Jens Axboe

Jens Axboe
2021-02-22 08:25:22 +0800
7c25c0d16 io_uring: remove the need for relying on an io-wq fallback worker ... Browse Code »

We hit this case when the task is exiting, and we need somewhere to
do background cleanup of requests. Instead of relying on the io-wq
task manager to do this work for us, just stuff it somewhere where
we can safely run it ourselves directly.

Signed-off-by: Jens Axboe

Jens Axboe
2021-02-22 08:25:22 +0800

10 Feb, 2021

1 commit

7cbf1722d io_uring: provide FIFO ordering for task_work ... Browse Code »

task_work is a LIFO list, due to how it's implemented as a lockless
list. For long chains of task_work, this can be problematic as the
first entry added is the last one processed. Similarly, we'd waste
a lot of CPU cycles reversing this list.

Wrap the task_work so we have a single task_work entry per task per
ctx, and use that to run it in the right order.

Signed-off-by: Pavel Begunkov
Signed-off-by: Jens Axboe

Jens Axboe
2021-02-10 22:28:43 +0800

04 Feb, 2021

1 commit

5280f7e53 io_uring/io-wq: return 2-step work swap scheme ... Browse Code »

Saving one lock/unlock for io-wq is not super important, but adds some
ugliness in the code. More important, atomic decs not turning it to zero
for some archs won't give the right ordering/barriers so the
io_steal_work() may pretty easily get subtly and completely broken.

Return back 2-step io-wq work exchange and clean it up.

Signed-off-by: Pavel Begunkov
Signed-off-by: Jens Axboe

Pavel Begunkov
2021-02-04 23:05:46 +0800

02 Feb, 2021

1 commit

4014d943c io_uring/io-wq: kill off now unused IO_WQ_WORK_NO_CANCEL ... Browse Code »

It's no longer used as IORING_OP_CLOSE got rid for the need of flagging
it as uncancelable, kill it of.

Signed-off-by: Jens Axboe

Jens Axboe
2021-02-02 01:02:43 +0800

21 Dec, 2020

1 commit

446bc1c20 io-wq: kill now unused io_wq_cancel_all() ... Browse Code »

io_uring no longer issues full cancelations on the io-wq, so remove any
remnants of this code and the IO_WQ_BIT_CANCEL flag.

Signed-off-by: Jens Axboe

Jens Axboe
2020-12-21 01:47:42 +0800

18 Dec, 2020

1 commit

0020ef04e io_uring: fix io_wqe->work_list corruption ... Browse Code »

For the first time a req punted to io-wq, we'll initialize io_wq_work's
list to be NULL, then insert req to io_wqe->work_list. If this req is not
inserted into tail of io_wqe->work_list, this req's io_wq_work list will
point to another req's io_wq_work. For splitted bio case, this req maybe
inserted to io_wqe->work_list repeatedly, once we insert it to tail of
io_wqe->work_list for the second time, now io_wq_work->list->next will be
invalid pointer, which then result in many strang error, panic, kernel
soft-lockup, rcu stall, etc.

In my vm, kernel doest not have commit cc29e1bf0d63f7 ("block: disable
iopoll for split bio"), below fio job can reproduce this bug steadily:
[global]
name=iouring-sqpoll-iopoll-1
ioengine=io_uring
iodepth=128
numjobs=1
thread
rw=randread
direct=1
registerfiles=1
hipri=1
bs=4m
size=100M
runtime=120
time_based
group_reporting
randrepeat=0

[device]
directory=/home/feiman.wxg/mntpoint/ # an ext4 mount point

If we have commit cc29e1bf0d63f7 ("block: disable iopoll for split bio"),
there will no splitted bio case for polled io, but I think we still to need
to fix this list corruption, it also should maybe go to stable branchs.

To fix this corruption, if a req is inserted into tail of io_wqe->work_list,
initialize req->io_wq_work->list->next to bu NULL.

Cc: stable@vger.kernel.org
Signed-off-by: Xiaoguang Wang
Reviewed-by: Pavel Begunkov
Signed-off-by: Jens Axboe

Xiaoguang Wang
2020-12-18 23:15:10 +0800

10 Dec, 2020

1 commit

f6edbabb8 io_uring: always batch cancel in *cancel_files() ... Browse Code »

Instead of iterating over each request and cancelling it individually in
io_uring_cancel_files(), try to cancel all matching requests and use
->inflight_list only to check if there anything left.

In many cases it should be faster, and we can reuse a lot of code from
task cancellation.

Signed-off-by: Pavel Begunkov
Signed-off-by: Jens Axboe

Pavel Begunkov
2020-12-10 03:04:00 +0800

21 Oct, 2020

1 commit

69228338c io_uring: unify fsize with def->work_flags ... Browse Code »

This one was missed in the earlier conversion, should be included like
any of the other IO identity flags. Make sure we restore to RLIM_INIFITY
when dropping the personality again.

Fixes: 98447d65b4a7 ("io_uring: move io identity items into separate struct")
Signed-off-by: Jens Axboe

Jens Axboe
2020-10-21 06:03:13 +0800

17 Oct, 2020

2 commits

98447d65b io_uring: move io identity items into separate struct ... Browse Code »

io-wq contains a pointer to the identity, which we just hold in io_kiocb
for now. This is in preparation for putting this outside io_kiocb. The
only exception is struct files_struct, which we'll need different rules
for to avoid a circular dependency.

No functional changes in this patch.

Signed-off-by: Jens Axboe

Jens Axboe
2020-10-17 23:25:45 +0800
0f2037658 io_uring: pass required context in as flags ... Browse Code »

We have a number of bits that decide what context to inherit. Set up
io-wq flags for these instead. This is in preparation for always having
the various members set, but not always needing them for all requests.

No intended functional changes in this patch.

Signed-off-by: Jens Axboe

Jens Axboe
2020-10-17 23:25:45 +0800

01 Oct, 2020

2 commits

91d8f5191 io_uring: add blkcg accounting to offloaded operations ... Browse Code »

There are a few operations that are offloaded to the worker threads. In
this case, we lose process context and end up in kthread context. This
results in ios to be not accounted to the issuing cgroup and
consequently end up as issued by root. Just like others, adopt the
personality of the blkcg too when issuing via the workqueues.

For the SQPOLL thread, it will live and attach in the inited cgroup's
context.

Signed-off-by: Dennis Zhou
Signed-off-by: Jens Axboe

Dennis Zhou
2020-10-01 10:32:34 +0800
9b8284921 io_uring: reference ->nsproxy for file table commands ... Browse Code »

If we don't get and assign the namespace for the async work, then certain
paths just don't work properly (like /dev/stdin, /proc/mounts, etc).
Anything that references the current namespace of the given task should
be assigned for async work on behalf of that task.

Cc: stable@vger.kernel.org # v5.5+
Reported-by: Al Viro
Signed-off-by: Jens Axboe

Jens Axboe
2020-10-01 10:32:32 +0800

25 Jul, 2020

1 commit

57f1a6495 io_uring/io-wq: move RLIMIT_FSIZE to io-wq ... Browse Code »

RLIMIT_SIZE in needed only for execution from an io-wq context, hence
move all preparations from hot path to io-wq work setup.

Signed-off-by: Pavel Begunkov
Signed-off-by: Jens Axboe

Pavel Begunkov
2020-07-25 03:00:44 +0800

27 Jun, 2020

2 commits

f4db7182e io-wq: return next work from ->do_work() directly ... Browse Code »

It's easier to return next work from ->do_work() than
having an in-out argument. Looks nicer and easier to compile.
Also, merge io_wq_assign_next() into its only user.

Signed-off-by: Pavel Begunkov
Signed-off-by: Jens Axboe

Pavel Begunkov
2020-06-27 00:34:27 +0800
e883a79d8 io-wq: compact io-wq flags numbers ... Browse Code »

Renumerate IO_WQ flags, so they take adjacent bits

Signed-off-by: Pavel Begunkov
Signed-off-by: Jens Axboe

Pavel Begunkov
2020-06-27 00:34:27 +0800

15 Jun, 2020

3 commits

801dd57bd io_uring: cancel by ->task not pid ... Browse Code »

For an exiting process it tries to cancel all its inflight requests. Use
req->task to match such instead of work.pid. We always have req->task
set, and it will be valid because we're matching only current exiting
task.

Also, remove work.pid and everything related, it's useless now.

Reported-by: Eric W. Biederman
Signed-off-by: Pavel Begunkov
Signed-off-by: Jens Axboe

Pavel Begunkov
2020-06-15 22:51:38 +0800
44e728b8a io_uring: cancel all task's requests on exit ... Browse Code »

If a process is going away, io_uring_flush() will cancel only 1
request with a matching pid. Cancel all of them

Signed-off-by: Pavel Begunkov
Signed-off-by: Jens Axboe

Pavel Begunkov
2020-06-15 22:51:34 +0800
4f26bda15 io-wq: add an option to cancel all matched reqs ... Browse Code »

This adds support for cancelling all io-wq works matching a predicate.
It isn't used yet, so no change in observable behaviour.

Signed-off-by: Pavel Begunkov
Signed-off-by: Jens Axboe

Pavel Begunkov
2020-06-15 22:51:34 +0800

11 Jun, 2020

1 commit

7cdaf587d io_uring: avoid whole io_wq_work copy for requests completed inline ... Browse Code »

If requests can be submitted and completed inline, we don't need to
initialize whole io_wq_work in io_init_req(), which is an expensive
operation, add a new 'REQ_F_WORK_INITIALIZED' to determine whether
io_wq_work is initialized and add a helper io_req_init_async(), users
must call io_req_init_async() for the first time touching any members
of io_wq_work.

I use /dev/nullb0 to evaluate performance improvement in my physical
machine:
modprobe null_blk nr_devices=1 completion_nsec=0
sudo taskset -c 60 fio -name=fiotest -filename=/dev/nullb0 -iodepth=128
-thread -rw=read -ioengine=io_uring -direct=1 -bs=4k -size=100G -numjobs=1
-time_based -runtime=120

before this patch:
Run status group 0 (all jobs):
READ: bw=724MiB/s (759MB/s), 724MiB/s-724MiB/s (759MB/s-759MB/s),
io=84.8GiB (91.1GB), run=120001-120001msec

With this patch:
Run status group 0 (all jobs):
READ: bw=761MiB/s (798MB/s), 761MiB/s-761MiB/s (798MB/s-798MB/s),
io=89.2GiB (95.8GB), run=120001-120001msec

About 5% improvement.

Signed-off-by: Xiaoguang Wang
Signed-off-by: Jens Axboe

Xiaoguang Wang
2020-06-11 07:58:46 +0800

09 Jun, 2020

1 commit

f5fa38c59 io_wq: add per-wq work handler instead of per work ... Browse Code »

io_uring is the only user of io-wq, and now it uses only io-wq callback
for all its requests, namely io_wq_submit_work(). Instead of storing
work->runner callback in each instance of io_wq_work, keep it in io-wq
itself.

pros:
- reduces io_wq_work size
- more robust -- ->func won't be invalidated with mem{cpy,set}(req)
- helps other work

Signed-off-by: Pavel Begunkov
Signed-off-by: Jens Axboe

Pavel Begunkov
2020-06-09 03:47:37 +0800

04 Apr, 2020

1 commit

aa96bf8a9 io_uring: use io-wq manager as backup task if task is exiting ... Browse Code »

If the original task is (or has) exited, then the task work will not get
queued properly. Allow for using the io-wq manager task to queue this
work for execution, and ensure that the io-wq manager notices and runs
this work if woken up (or exiting).

Reported-by: Dan Melnic
Signed-off-by: Jens Axboe

Jens Axboe
2020-04-04 01:35:57 +0800