30 Dec, 2020

1 commit

  • commit 0020ef04e48571a88d4f482ad08f71052c5c5a08 upstream.

    For the first time a req punted to io-wq, we'll initialize io_wq_work's
    list to be NULL, then insert req to io_wqe->work_list. If this req is not
    inserted into tail of io_wqe->work_list, this req's io_wq_work list will
    point to another req's io_wq_work. For splitted bio case, this req maybe
    inserted to io_wqe->work_list repeatedly, once we insert it to tail of
    io_wqe->work_list for the second time, now io_wq_work->list->next will be
    invalid pointer, which then result in many strang error, panic, kernel
    soft-lockup, rcu stall, etc.

    In my vm, kernel doest not have commit cc29e1bf0d63f7 ("block: disable
    iopoll for split bio"), below fio job can reproduce this bug steadily:
    [global]
    name=iouring-sqpoll-iopoll-1
    ioengine=io_uring
    iodepth=128
    numjobs=1
    thread
    rw=randread
    direct=1
    registerfiles=1
    hipri=1
    bs=4m
    size=100M
    runtime=120
    time_based
    group_reporting
    randrepeat=0

    [device]
    directory=/home/feiman.wxg/mntpoint/ # an ext4 mount point

    If we have commit cc29e1bf0d63f7 ("block: disable iopoll for split bio"),
    there will no splitted bio case for polled io, but I think we still to need
    to fix this list corruption, it also should maybe go to stable branchs.

    To fix this corruption, if a req is inserted into tail of io_wqe->work_list,
    initialize req->io_wq_work->list->next to bu NULL.

    Cc: stable@vger.kernel.org
    Signed-off-by: Xiaoguang Wang
    Reviewed-by: Pavel Begunkov
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Xiaoguang Wang
     

21 Oct, 2020

1 commit

  • This one was missed in the earlier conversion, should be included like
    any of the other IO identity flags. Make sure we restore to RLIM_INIFITY
    when dropping the personality again.

    Fixes: 98447d65b4a7 ("io_uring: move io identity items into separate struct")
    Signed-off-by: Jens Axboe

    Jens Axboe
     

17 Oct, 2020

2 commits

  • io-wq contains a pointer to the identity, which we just hold in io_kiocb
    for now. This is in preparation for putting this outside io_kiocb. The
    only exception is struct files_struct, which we'll need different rules
    for to avoid a circular dependency.

    No functional changes in this patch.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • We have a number of bits that decide what context to inherit. Set up
    io-wq flags for these instead. This is in preparation for always having
    the various members set, but not always needing them for all requests.

    No intended functional changes in this patch.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

01 Oct, 2020

2 commits

  • There are a few operations that are offloaded to the worker threads. In
    this case, we lose process context and end up in kthread context. This
    results in ios to be not accounted to the issuing cgroup and
    consequently end up as issued by root. Just like others, adopt the
    personality of the blkcg too when issuing via the workqueues.

    For the SQPOLL thread, it will live and attach in the inited cgroup's
    context.

    Signed-off-by: Dennis Zhou
    Signed-off-by: Jens Axboe

    Dennis Zhou
     
  • If we don't get and assign the namespace for the async work, then certain
    paths just don't work properly (like /dev/stdin, /proc/mounts, etc).
    Anything that references the current namespace of the given task should
    be assigned for async work on behalf of that task.

    Cc: stable@vger.kernel.org # v5.5+
    Reported-by: Al Viro
    Signed-off-by: Jens Axboe

    Jens Axboe
     

25 Jul, 2020

1 commit


27 Jun, 2020

2 commits


15 Jun, 2020

3 commits


11 Jun, 2020

1 commit

  • If requests can be submitted and completed inline, we don't need to
    initialize whole io_wq_work in io_init_req(), which is an expensive
    operation, add a new 'REQ_F_WORK_INITIALIZED' to determine whether
    io_wq_work is initialized and add a helper io_req_init_async(), users
    must call io_req_init_async() for the first time touching any members
    of io_wq_work.

    I use /dev/nullb0 to evaluate performance improvement in my physical
    machine:
    modprobe null_blk nr_devices=1 completion_nsec=0
    sudo taskset -c 60 fio -name=fiotest -filename=/dev/nullb0 -iodepth=128
    -thread -rw=read -ioengine=io_uring -direct=1 -bs=4k -size=100G -numjobs=1
    -time_based -runtime=120

    before this patch:
    Run status group 0 (all jobs):
    READ: bw=724MiB/s (759MB/s), 724MiB/s-724MiB/s (759MB/s-759MB/s),
    io=84.8GiB (91.1GB), run=120001-120001msec

    With this patch:
    Run status group 0 (all jobs):
    READ: bw=761MiB/s (798MB/s), 761MiB/s-761MiB/s (798MB/s-798MB/s),
    io=89.2GiB (95.8GB), run=120001-120001msec

    About 5% improvement.

    Signed-off-by: Xiaoguang Wang
    Signed-off-by: Jens Axboe

    Xiaoguang Wang
     

09 Jun, 2020

1 commit

  • io_uring is the only user of io-wq, and now it uses only io-wq callback
    for all its requests, namely io_wq_submit_work(). Instead of storing
    work->runner callback in each instance of io_wq_work, keep it in io-wq
    itself.

    pros:
    - reduces io_wq_work size
    - more robust -- ->func won't be invalidated with mem{cpy,set}(req)
    - helps other work

    Signed-off-by: Pavel Begunkov
    Signed-off-by: Jens Axboe

    Pavel Begunkov
     

04 Apr, 2020

1 commit


24 Mar, 2020

1 commit

  • We always punt async buffered writes to an io-wq helper, as the core
    kernel does not have IOCB_NOWAIT support for that. Most buffered async
    writes complete very quickly, as it's just a copy operation. This means
    that doing multiple locking roundtrips on the shared wqe lock for each
    buffered write is wasteful. Additionally, buffered writes are hashed
    work items, which means that any buffered write to a given file is
    serialized.

    Keep identicaly hashed work items contiguously in @wqe->work_list, and
    track a tail for each hash bucket. On dequeue of a hashed item, splice
    all of the same hash in one go using the tracked tail. Until the batch
    is done, the caller doesn't have to synchronize with the wqe or worker
    locks again.

    Signed-off-by: Pavel Begunkov
    Signed-off-by: Jens Axboe

    Pavel Begunkov
     

23 Mar, 2020

1 commit

  • work->data and work->list are shared in union. io_wq_assign_next() sets
    ->data if a req having a linked_timeout, but then io-wq may want to use
    work->list, e.g. to do re-enqueue of a request, so corrupting ->data.

    ->data is not necessary, just remove it and extract linked_timeout
    through @link_list.

    Fixes: 60cf46ae6054 ("io-wq: hash dependent work")
    Signed-off-by: Pavel Begunkov
    Signed-off-by: Jens Axboe

    Pavel Begunkov
     

15 Mar, 2020

1 commit

  • It's a preparation patch removing io_wq_enqueue_hashed(), which
    now should be done by io_wq_hash_work() + io_wq_enqueue().

    Also, set hash value for dependant works, and do it as late as possible,
    because req->file can be unavailable before. This hash will be ignored
    by io-wq.

    Signed-off-by: Pavel Begunkov
    Signed-off-by: Jens Axboe

    Pavel Begunkov
     

05 Mar, 2020

1 commit

  • First it changes io-wq interfaces. It replaces {get,put}_work() with
    free_work(), which guaranteed to be called exactly once. It also enforces
    free_work() callback to be non-NULL.

    io_uring follows the changes and instead of putting a submission reference
    in io_put_req_async_completion(), it will be done in io_free_work(). As
    removes io_get_work() with corresponding refcount_inc(), the ref balance
    is maintained.

    Signed-off-by: Pavel Begunkov
    Signed-off-by: Jens Axboe

    Pavel Begunkov
     

03 Mar, 2020

3 commits

  • IO_WQ_WORK_CB is used only for linked timeouts, which will be armed
    before the work setup (i.e. mm, override creds, etc). The setup
    shouldn't take long, so it's ok to arm it a bit later and get rid
    of IO_WQ_WORK_CB.

    Make io-wq call work->func() only once, callbacks will handle the rest.
    i.e. the linked timeout handler will do the actual issue. And as a
    bonus, it removes an extra indirect call.

    Signed-off-by: Pavel Begunkov
    Signed-off-by: Jens Axboe

    Pavel Begunkov
     
  • IO_WQ_WORK_HAS_MM is set but never used, remove it.

    Signed-off-by: Pavel Begunkov
    Signed-off-by: Jens Axboe

    Pavel Begunkov
     
  • io_wq_flush() is buggy, during cancelation of a flush, the associated
    work may be passed to the caller's (i.e. io_uring) @match callback. That
    callback is expecting it to be embedded in struct io_kiocb. Cancelation
    of internal work probably doesn't make a lot of sense to begin with.

    As the flush helper is no longer used, just delete it and the associated
    work flag.

    Signed-off-by: Pavel Begunkov
    Signed-off-by: Jens Axboe

    Pavel Begunkov
     

26 Feb, 2020

1 commit

  • We use ->task_pid for exit cancellation, but we need to ensure it's
    cleared to zero for io_req_work_grab_env() to do the right thing. Take
    a suggestion from Bart and clear the whole thing, just setting the
    function passed in. This makes it more future proof as well.

    Fixes: 36282881a795 ("io-wq: add io_wq_cancel_pid() to cancel based on a specific pid")
    Signed-off-by: Jens Axboe

    Jens Axboe
     

10 Feb, 2020

1 commit


09 Feb, 2020

1 commit


30 Jan, 2020

1 commit

  • We're not consistent in how the file table is grabbed and assigned if we
    have a command linked that requires the use of it.

    Add ->file_table to the io_op_defs[] array, and use that to determine
    when to grab the table instead of having the handlers set it if they
    need to defer. This also means we can kill the IO_WQ_WORK_NEEDS_FILES
    flag. We always initialize work->files, so io-wq can just check for
    that.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

29 Jan, 2020

2 commits

  • Export a helper to attach to an existing io-wq, rather than setting up
    a new one. This is doable now that we have reference counted io_wq's.

    Reported-by: Jens Axboe
    Signed-off-by: Pavel Begunkov
    Signed-off-by: Jens Axboe

    Pavel Begunkov
     
  • We currently setup the io_wq with a static set of mm and creds. Even for
    a single-use io-wq per io_uring, this is suboptimal as we have may have
    multiple enters of the ring. For sharing the io-wq backend, it doesn't
    work at all.

    Switch to passing in the creds and mm when the work item is setup. This
    means that async work is no longer deferred to the io_uring mm and creds,
    it is done with the current mm and creds.

    Flag this behavior with IORING_FEAT_CUR_PERSONALITY, so applications know
    they can rely on the current personality (mm and creds) being the same
    for direct issue and async issue.

    Reviewed-by: Stefan Metzmacher
    Signed-off-by: Jens Axboe

    Jens Axboe
     

21 Jan, 2020

2 commits

  • io-wq assumes that work will complete fast (and not block), so it
    doesn't create a new worker when work is enqueued, if we already have
    at least one worker running. This is done on the assumption that if work
    is running, then it will complete fast.

    Add an option to force io-wq to fork a new worker for work queued. This
    is signaled by setting IO_WQ_WORK_CONCURRENT on the work item. For that
    case, io-wq will create a new worker, even though workers are already
    running.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • Not all work can be cancelled, some of it we may need to guarantee
    that it runs to completion. Allow the caller to set IO_WQ_WORK_NO_CANCEL
    on work that must not be cancelled. Note that the caller work function
    must also check for IO_WQ_WORK_NO_CANCEL on work that is marked
    IO_WQ_WORK_CANCEL.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

18 Dec, 2019

1 commit


11 Dec, 2019

1 commit


05 Dec, 2019

1 commit

  • If someone removes a node from a list, and then later adds it back to
    a list, we can have invalid data in ->next. This can cause all sorts
    of issues. One such use case is the IORING_OP_POLL_ADD command, which
    will do just that if we race and get woken twice without any pending
    events. This is a pretty rare case, but can happen under extreme loads.
    Dan reports that he saw the following crash:

    BUG: kernel NULL pointer dereference, address: 0000000000000000
    PGD d283ce067 P4D d283ce067 PUD e5ca04067 PMD 0
    Oops: 0002 [#1] SMP
    CPU: 17 PID: 10726 Comm: tao:fast-fiber Kdump: loaded Not tainted 5.2.9-02851-gac7bc042d2d1 #116
    Hardware name: Quanta Twin Lakes MP/Twin Lakes Passive MP, BIOS F09_3A17 05/03/2019
    RIP: 0010:io_wqe_enqueue+0x3e/0xd0
    Code: 34 24 74 55 8b 47 58 48 8d 6f 50 85 c0 74 50 48 89 df e8 35 7c 75 00 48 83 7b 08 00 48 8b 14 24 0f 84 84 00 00 00 48 8b 4b 10 89 11 48 89 53 10 83 63 20 fe 48 89 c6 48 89 df e8 0c 7a 75 00
    RSP: 0000:ffffc90006858a08 EFLAGS: 00010082
    RAX: 0000000000000002 RBX: ffff889037492fc0 RCX: 0000000000000000
    RDX: ffff888e40cc11a8 RSI: ffff888e40cc11a8 RDI: ffff889037492fc0
    RBP: ffff889037493010 R08: 00000000000000c3 R09: ffffc90006858ab8
    R10: 0000000000000000 R11: 0000000000000000 R12: ffff888e40cc11a8
    R13: 0000000000000000 R14: 00000000000000c3 R15: ffff888e40cc1100
    FS: 00007fcddc9db700(0000) GS:ffff88903fa40000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 0000000000000000 CR3: 0000000e479f5003 CR4: 00000000007606e0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    PKRU: 55555554
    Call Trace:

    io_poll_wake+0x12f/0x2a0
    __wake_up_common+0x86/0x120
    __wake_up_common_lock+0x7a/0xc0
    sock_def_readable+0x3c/0x70
    tcp_rcv_established+0x557/0x630
    tcp_v6_do_rcv+0x118/0x3c0
    tcp_v6_rcv+0x97e/0x9d0
    ip6_protocol_deliver_rcu+0xe3/0x440
    ip6_input+0x3d/0xc0
    ? ip6_protocol_deliver_rcu+0x440/0x440
    ipv6_rcv+0x56/0xd0
    ? ip6_rcv_finish_core.isra.18+0x80/0x80
    __netif_receive_skb_one_core+0x50/0x70
    netif_receive_skb_internal+0x2f/0xa0
    napi_gro_receive+0x125/0x150
    mlx5e_handle_rx_cqe+0x1d9/0x5a0
    ? mlx5e_poll_tx_cq+0x305/0x560
    mlx5e_poll_rx_cq+0x49f/0x9c5
    mlx5e_napi_poll+0xee/0x640
    ? smp_reschedule_interrupt+0x16/0xd0
    ? reschedule_interrupt+0xf/0x20
    net_rx_action+0x286/0x3d0
    __do_softirq+0xca/0x297
    irq_exit+0x96/0xa0
    do_IRQ+0x54/0xe0
    common_interrupt+0xf/0xf

    RIP: 0033:0x7fdc627a2e3a
    Code: 31 c0 85 d2 0f 88 f6 00 00 00 55 48 89 e5 41 57 41 56 4c 63 f2 41 55 41 54 53 48 83 ec 18 48 85 ff 0f 84 c7 00 00 00 48 8b 07 89 d4 49 89 f5 48 89 fb 48 85 c0 0f 84 64 01 00 00 48 83 78 10

    when running a networked workload with about 5000 sockets being polled
    for. Fix this by clearing node->next when the node is being removed from
    the list.

    Fixes: 6206f0e180d4 ("io-wq: shrink io_wq_work a bit")
    Reported-by: Dan Melnic
    Signed-off-by: Jens Axboe

    Jens Axboe
     

03 Dec, 2019

1 commit


02 Dec, 2019

1 commit

  • syzbot reports:

    kasan: CONFIG_KASAN_INLINE enabled
    kasan: GPF could be caused by NULL-ptr deref or user memory access
    general protection fault: 0000 [#1] PREEMPT SMP KASAN
    CPU: 0 PID: 9217 Comm: io_uring-sq Not tainted 5.4.0-syzkaller #0
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
    Google 01/01/2011
    RIP: 0010:creds_are_invalid kernel/cred.c:792 [inline]
    RIP: 0010:__validate_creds include/linux/cred.h:187 [inline]
    RIP: 0010:override_creds+0x9f/0x170 kernel/cred.c:550
    Code: ac 25 00 81 fb 64 65 73 43 0f 85 a3 37 00 00 e8 17 ab 25 00 49 8d 7c
    24 10 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 b6 04 02 84
    c0 74 08 3c 03 0f 8e 96 00 00 00 41 8b 5c 24 10 bf
    RSP: 0018:ffff88809c45fda0 EFLAGS: 00010202
    RAX: dffffc0000000000 RBX: 0000000043736564 RCX: ffffffff814f3318
    RDX: 0000000000000002 RSI: ffffffff814f3329 RDI: 0000000000000010
    RBP: ffff88809c45fdb8 R08: ffff8880a3aac240 R09: ffffed1014755849
    R10: ffffed1014755848 R11: ffff8880a3aac247 R12: 0000000000000000
    R13: ffff888098ab1600 R14: 0000000000000000 R15: 0000000000000000
    FS: 0000000000000000(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 00007ffd51c40664 CR3: 0000000092641000 CR4: 00000000001406f0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    Call Trace:
    io_sq_thread+0x1c7/0xa20 fs/io_uring.c:3274
    kthread+0x361/0x430 kernel/kthread.c:255
    ret_from_fork+0x24/0x30 arch/x86/entry/entry_64.S:352
    Modules linked in:
    ---[ end trace f2e1a4307fbe2245 ]---
    RIP: 0010:creds_are_invalid kernel/cred.c:792 [inline]
    RIP: 0010:__validate_creds include/linux/cred.h:187 [inline]
    RIP: 0010:override_creds+0x9f/0x170 kernel/cred.c:550
    Code: ac 25 00 81 fb 64 65 73 43 0f 85 a3 37 00 00 e8 17 ab 25 00 49 8d 7c
    24 10 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 b6 04 02 84
    c0 74 08 3c 03 0f 8e 96 00 00 00 41 8b 5c 24 10 bf
    RSP: 0018:ffff88809c45fda0 EFLAGS: 00010202
    RAX: dffffc0000000000 RBX: 0000000043736564 RCX: ffffffff814f3318
    RDX: 0000000000000002 RSI: ffffffff814f3329 RDI: 0000000000000010
    RBP: ffff88809c45fdb8 R08: ffff8880a3aac240 R09: ffffed1014755849
    R10: ffffed1014755848 R11: ffff8880a3aac247 R12: 0000000000000000
    R13: ffff888098ab1600 R14: 0000000000000000 R15: 0000000000000000
    FS: 0000000000000000(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 00007ffd51c40664 CR3: 0000000092641000 CR4: 00000000001406f0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400

    which is caused by slab fault injection triggering a failure in
    prepare_creds(). We don't actually need to create a copy of the creds
    as we're not modifying it, we just need a reference on the current task
    creds. This avoids the failure case as well, and propagates the const
    throughout the stack.

    Fixes: 181e448d8709 ("io_uring: async workers should inherit the user creds")
    Reported-by: syzbot+5320383e16029ba057ff@syzkaller.appspotmail.com
    Signed-off-by: Jens Axboe

    Jens Axboe
     

27 Nov, 2019

1 commit

  • Currently we're using 40 bytes for the io_wq_work structure, and 16 of
    those is the doubly link list node. We don't need doubly linked lists,
    we always add to tail to keep things ordered, and any other use case
    is list traversal with deletion. For the deletion case, we can easily
    support any node deletion by keeping track of the previous entry.

    This shrinks io_wq_work to 32 bytes, and subsequently io_kiock from
    io_uring to 216 to 208 bytes.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

26 Nov, 2019

3 commits

  • If we don't inherit the original task creds, then we can confuse users
    like fuse that pass creds in the request header. See link below on
    identical aio issue.

    Link: https://lore.kernel.org/linux-fsdevel/26f0d78e-99ca-2f1b-78b9-433088053a61@scylladb.com/T/#u
    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • We currently pass in 4 arguments outside of the bounded size. In
    preparation for adding one more argument, let's bundle them up in
    a struct to make it more readable.

    No functional changes in this patch.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • When we find new work to process within the work handler, we queue the
    linked timeout before we have issued the new work. This can be
    problematic for very short timeouts, as we have a window where the new
    work isn't visible.

    Allow the work handler to store a callback function for this in the work
    item, and flag it with IO_WQ_WORK_CB if the caller has done so. If that
    is set, then io-wq will call the callback when it has setup the new work
    item.

    Reported-by: Pavel Begunkov
    Signed-off-by: Jens Axboe

    Jens Axboe
     

14 Nov, 2019

1 commit

  • For cancellation, we need to ensure that the work item stays valid for
    as long as ->cur_work is valid. Right now we can't safely dereference
    the work item even under the wqe->lock, because while the ->cur_work
    pointer will remain valid, the work could be completing and be freed
    in parallel.

    Only invoke ->get/put_work() on items we know that the caller queued
    themselves. Add IO_WQ_WORK_INTERNAL for io-wq to use, which is needed
    when we're queueing a flush item, for instance.

    Signed-off-by: Jens Axboe

    Jens Axboe