29 Aug, 2021

1 commit

  • io-wq divides work into two categories:

    1) Work that completes in a bounded time, like reading from a regular file
    or a block device. This type of work is limited based on the size of
    the SQ ring.

    2) Work that may never complete, we call this unbounded work. The amount
    of workers here is just limited by RLIMIT_NPROC.

    For various uses cases, it's handy to have the kernel limit the maximum
    amount of pending workers for both categories. Provide a way to do with
    with a new IORING_REGISTER_IOWQ_MAX_WORKERS operation.

    IORING_REGISTER_IOWQ_MAX_WORKERS takes an array of two integers and sets
    the max worker count to what is being passed in for each category. The
    old values are returned into that same array. If 0 is being passed in for
    either category, it simply returns the current value.

    The value is capped at RLIMIT_NPROC. This actually isn't that important
    as it's more of a hint, if we're exceeding the value then our attempt
    to fork a new worker will fail. This happens naturally already if more
    than one node is in the system, as these values are per-node internally
    for io-wq.

    Reported-by: Johannes Lundberg
    Link: https://github.com/axboe/liburing/issues/420
    Signed-off-by: Jens Axboe

    Jens Axboe
     

24 Aug, 2021

1 commit

  • Prepare nodes that we're going to add before actually linking them, it's
    always safer and costs us nothing.

    Signed-off-by: Pavel Begunkov
    Link: https://lore.kernel.org/r/f7e53f0c84c02ed6748c488ed0789b98f8cc6185.1628471125.git.asml.silence@gmail.com
    Signed-off-by: Jens Axboe

    Pavel Begunkov
     

18 Jun, 2021

2 commits

  • io-wq now doesn't have anything to do with creds now, so move ->creds
    from struct io_wq_work into request (aka struct io_kiocb).

    Signed-off-by: Pavel Begunkov
    Link: https://lore.kernel.org/r/8520c72ab8b8f4b96db12a228a2ab4c094ae64e1.1623949695.git.asml.silence@gmail.com
    Signed-off-by: Jens Axboe

    Pavel Begunkov
     
  • io-wq defaults to per-node masks for IO workers. This works fine by
    default, but isn't particularly handy for workloads that prefer more
    specific affinities, for either performance or isolation reasons.

    This adds IORING_REGISTER_IOWQ_AFF that allows the user to pass in a CPU
    mask that is then applied to IO thread workers, and an
    IORING_UNREGISTER_IOWQ_AFF that simply resets the masks back to the
    default of per-node.

    Note that no care is given to existing IO threads, they will need to go
    through a reschedule before the affinity is correct if they are already
    running or sleeping.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

26 May, 2021

1 commit

  • There is an old problem with io-wq cancellation where requests should be
    killed and are in io-wq but are not discoverable, e.g. in @next_hashed
    or @linked vars of io_worker_handle_work(). It adds some unreliability
    to individual request canellation, but also may potentially get
    __io_uring_cancel() stuck. For instance:

    1) An __io_uring_cancel()'s cancellation round have not found any
    request but there are some as desribed.
    2) __io_uring_cancel() goes to sleep
    3) Then workers wake up and try to execute those hidden requests
    that happen to be unbound.

    As we already cancel all requests of io-wq there, set IO_WQ_BIT_EXIT
    in advance, so preventing 3) from executing unbound requests. The
    workers will initially break looping because of getting a signal as they
    are threads of the dying/exec()'ing user task.

    Cc: stable@vger.kernel.org
    Signed-off-by: Pavel Begunkov
    Link: https://lore.kernel.org/r/abfcf8c54cb9e8f7bfbad7e9a0cc5433cc70bdc2.1621781238.git.asml.silence@gmail.com
    Signed-off-by: Jens Axboe

    Pavel Begunkov
     

12 Apr, 2021

1 commit

  • io-wq relies on a manager thread to create/fork new workers, as needed.
    But there's really no strong need for it anymore. We have the following
    cases that fork a new worker:

    1) Work queue. This is done from the task itself always, and it's trivial
    to create a worker off that path, if needed.

    2) All workers have gone to sleep, and we have more work. This is called
    off the sched out path. For this case, use a task_work items to queue
    a fork-worker operation.

    3) Hashed work completion. Don't think we need to do anything off this
    case. If need be, it could just use approach 2 as well.

    Part of this change is incrementing the running worker count before the
    fork, to avoid cases where we observe we need a worker and then queue
    creation of one. Then new work comes in, we fork a new one. That last
    queue operation should have waited for the previous worker to come up,
    it's quite possible we don't even need it. Hence move the worker running
    from before we fork it off to more efficiently handle that case.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

18 Mar, 2021

1 commit


07 Mar, 2021

1 commit


05 Mar, 2021

1 commit

  • This allows us to do task creation and setup without needing to use
    completions to try and synchronize with the starting thread. Get rid of
    the old io_wq_fork_thread() wrapper, and the 'wq' and 'worker' startup
    completion events - we can now do setup before the task is running.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

04 Mar, 2021

2 commits

  • If we move it in there, then we no longer have to care about it in io-wq.
    This means we can drop the cred handling in io-wq, and we can drop the
    REQ_F_WORK_INITIALIZED flag and async init functions as that was the last
    user of it since we moved to the new workers. Then we can also drop
    io_wq_work->creds, and just hold the personality u16 in there instead.

    Suggested-by: Pavel Begunkov
    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • If we put the io-wq from io_uring, we really want it to exit. Provide
    a helper that does that for us. Couple that with not having the manager
    hold a reference to the 'wq' and the normal SQPOLL exit will tear down
    the io-wq context appropriate.

    On the io-wq side, our wq context is per task, so only the task itself
    is manipulating ->manager and hence it's safe to check and clear without
    any extra locking. We just need to ensure that the manager task stays
    around, in case it exits.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

26 Feb, 2021

2 commits

  • exec will cancel any threads, including the ones that io-wq is using. This
    isn't a problem, in fact we'd prefer it to be that way since it means we
    know that any async work cancels naturally without having to handle it
    proactively.

    But it does mean that we need to setup a new manager, as the manager and
    workers are gone. Handle this at queue time, and cancel work if we fail.
    Since the manager can go away without us noticing, ensure that the manager
    itself holds a reference to the 'wq' as well. Rename io_wq_destroy() to
    io_wq_put() to reflect that.

    In the future we can now simplify exec cancelation handling, for now just
    leave it the same.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • Before the io-wq thread change, we maintained a hash work map and lock
    per-node per-ring. That wasn't ideal, as we really wanted it to be per
    ring. But now that we have per-task workers, the hash map ends up being
    just per-task. That'll work just fine for the normal case of having
    one task use a ring, but if you share the ring between tasks, then it's
    considerably worse than it was before.

    Make the hash map per ctx instead, which provides full per-ctx buffered
    write serialization on hashed writes.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

24 Feb, 2021

1 commit

  • We're now just using fork like we would from userspace, so there's no
    need to try and impose extra restrictions or accounting on the user
    side of things. That's already being done for us. That also means we
    don't have to pass in the user_struct anymore, that's correctly inherited
    through ->creds on fork.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

22 Feb, 2021

6 commits


10 Feb, 2021

1 commit

  • task_work is a LIFO list, due to how it's implemented as a lockless
    list. For long chains of task_work, this can be problematic as the
    first entry added is the last one processed. Similarly, we'd waste
    a lot of CPU cycles reversing this list.

    Wrap the task_work so we have a single task_work entry per task per
    ctx, and use that to run it in the right order.

    Signed-off-by: Pavel Begunkov
    Signed-off-by: Jens Axboe

    Jens Axboe
     

04 Feb, 2021

1 commit

  • Saving one lock/unlock for io-wq is not super important, but adds some
    ugliness in the code. More important, atomic decs not turning it to zero
    for some archs won't give the right ordering/barriers so the
    io_steal_work() may pretty easily get subtly and completely broken.

    Return back 2-step io-wq work exchange and clean it up.

    Signed-off-by: Pavel Begunkov
    Signed-off-by: Jens Axboe

    Pavel Begunkov
     

02 Feb, 2021

1 commit


21 Dec, 2020

1 commit


18 Dec, 2020

1 commit

  • For the first time a req punted to io-wq, we'll initialize io_wq_work's
    list to be NULL, then insert req to io_wqe->work_list. If this req is not
    inserted into tail of io_wqe->work_list, this req's io_wq_work list will
    point to another req's io_wq_work. For splitted bio case, this req maybe
    inserted to io_wqe->work_list repeatedly, once we insert it to tail of
    io_wqe->work_list for the second time, now io_wq_work->list->next will be
    invalid pointer, which then result in many strang error, panic, kernel
    soft-lockup, rcu stall, etc.

    In my vm, kernel doest not have commit cc29e1bf0d63f7 ("block: disable
    iopoll for split bio"), below fio job can reproduce this bug steadily:
    [global]
    name=iouring-sqpoll-iopoll-1
    ioengine=io_uring
    iodepth=128
    numjobs=1
    thread
    rw=randread
    direct=1
    registerfiles=1
    hipri=1
    bs=4m
    size=100M
    runtime=120
    time_based
    group_reporting
    randrepeat=0

    [device]
    directory=/home/feiman.wxg/mntpoint/ # an ext4 mount point

    If we have commit cc29e1bf0d63f7 ("block: disable iopoll for split bio"),
    there will no splitted bio case for polled io, but I think we still to need
    to fix this list corruption, it also should maybe go to stable branchs.

    To fix this corruption, if a req is inserted into tail of io_wqe->work_list,
    initialize req->io_wq_work->list->next to bu NULL.

    Cc: stable@vger.kernel.org
    Signed-off-by: Xiaoguang Wang
    Reviewed-by: Pavel Begunkov
    Signed-off-by: Jens Axboe

    Xiaoguang Wang
     

10 Dec, 2020

1 commit

  • Instead of iterating over each request and cancelling it individually in
    io_uring_cancel_files(), try to cancel all matching requests and use
    ->inflight_list only to check if there anything left.

    In many cases it should be faster, and we can reuse a lot of code from
    task cancellation.

    Signed-off-by: Pavel Begunkov
    Signed-off-by: Jens Axboe

    Pavel Begunkov
     

21 Oct, 2020

1 commit

  • This one was missed in the earlier conversion, should be included like
    any of the other IO identity flags. Make sure we restore to RLIM_INIFITY
    when dropping the personality again.

    Fixes: 98447d65b4a7 ("io_uring: move io identity items into separate struct")
    Signed-off-by: Jens Axboe

    Jens Axboe
     

17 Oct, 2020

2 commits

  • io-wq contains a pointer to the identity, which we just hold in io_kiocb
    for now. This is in preparation for putting this outside io_kiocb. The
    only exception is struct files_struct, which we'll need different rules
    for to avoid a circular dependency.

    No functional changes in this patch.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • We have a number of bits that decide what context to inherit. Set up
    io-wq flags for these instead. This is in preparation for always having
    the various members set, but not always needing them for all requests.

    No intended functional changes in this patch.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

01 Oct, 2020

2 commits

  • There are a few operations that are offloaded to the worker threads. In
    this case, we lose process context and end up in kthread context. This
    results in ios to be not accounted to the issuing cgroup and
    consequently end up as issued by root. Just like others, adopt the
    personality of the blkcg too when issuing via the workqueues.

    For the SQPOLL thread, it will live and attach in the inited cgroup's
    context.

    Signed-off-by: Dennis Zhou
    Signed-off-by: Jens Axboe

    Dennis Zhou
     
  • If we don't get and assign the namespace for the async work, then certain
    paths just don't work properly (like /dev/stdin, /proc/mounts, etc).
    Anything that references the current namespace of the given task should
    be assigned for async work on behalf of that task.

    Cc: stable@vger.kernel.org # v5.5+
    Reported-by: Al Viro
    Signed-off-by: Jens Axboe

    Jens Axboe
     

25 Jul, 2020

1 commit


27 Jun, 2020

2 commits


15 Jun, 2020

3 commits


11 Jun, 2020

1 commit

  • If requests can be submitted and completed inline, we don't need to
    initialize whole io_wq_work in io_init_req(), which is an expensive
    operation, add a new 'REQ_F_WORK_INITIALIZED' to determine whether
    io_wq_work is initialized and add a helper io_req_init_async(), users
    must call io_req_init_async() for the first time touching any members
    of io_wq_work.

    I use /dev/nullb0 to evaluate performance improvement in my physical
    machine:
    modprobe null_blk nr_devices=1 completion_nsec=0
    sudo taskset -c 60 fio -name=fiotest -filename=/dev/nullb0 -iodepth=128
    -thread -rw=read -ioengine=io_uring -direct=1 -bs=4k -size=100G -numjobs=1
    -time_based -runtime=120

    before this patch:
    Run status group 0 (all jobs):
    READ: bw=724MiB/s (759MB/s), 724MiB/s-724MiB/s (759MB/s-759MB/s),
    io=84.8GiB (91.1GB), run=120001-120001msec

    With this patch:
    Run status group 0 (all jobs):
    READ: bw=761MiB/s (798MB/s), 761MiB/s-761MiB/s (798MB/s-798MB/s),
    io=89.2GiB (95.8GB), run=120001-120001msec

    About 5% improvement.

    Signed-off-by: Xiaoguang Wang
    Signed-off-by: Jens Axboe

    Xiaoguang Wang
     

09 Jun, 2020

1 commit

  • io_uring is the only user of io-wq, and now it uses only io-wq callback
    for all its requests, namely io_wq_submit_work(). Instead of storing
    work->runner callback in each instance of io_wq_work, keep it in io-wq
    itself.

    pros:
    - reduces io_wq_work size
    - more robust -- ->func won't be invalidated with mem{cpy,set}(req)
    - helps other work

    Signed-off-by: Pavel Begunkov
    Signed-off-by: Jens Axboe

    Pavel Begunkov
     

04 Apr, 2020

1 commit