05 Nov, 2020

1 commit


22 Oct, 2020

1 commit

  • We correctly set io-wq NUMA node affinities when the io-wq context is
    setup, but if an entire node CPU set is offlined and then brought back
    online, the per node affinities are broken. Ensure that we set them
    again whenever a CPU comes online. This ensures that we always track
    the right node affinity. The usual cpuhp notifiers are used to drive it.

    Reported-by: Zhang Qiang
    Signed-off-by: Jens Axboe

    Jens Axboe
     

21 Oct, 2020

1 commit

  • This one was missed in the earlier conversion, should be included like
    any of the other IO identity flags. Make sure we restore to RLIM_INIFITY
    when dropping the personality again.

    Fixes: 98447d65b4a7 ("io_uring: move io identity items into separate struct")
    Signed-off-by: Jens Axboe

    Jens Axboe
     

17 Oct, 2020

5 commits


01 Oct, 2020

5 commits

  • This flag is no longer used, remove it.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • The smart syzbot has found a reproducer for the following issue:

    ==================================================================
    BUG: KASAN: use-after-free in instrument_atomic_write include/linux/instrumented.h:71 [inline]
    BUG: KASAN: use-after-free in atomic_inc include/asm-generic/atomic-instrumented.h:240 [inline]
    BUG: KASAN: use-after-free in io_wqe_inc_running fs/io-wq.c:301 [inline]
    BUG: KASAN: use-after-free in io_wq_worker_running+0xde/0x110 fs/io-wq.c:613
    Write of size 4 at addr ffff8882183db08c by task io_wqe_worker-0/7771

    CPU: 0 PID: 7771 Comm: io_wqe_worker-0 Not tainted 5.9.0-rc4-syzkaller #0
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
    Call Trace:
    __dump_stack lib/dump_stack.c:77 [inline]
    dump_stack+0x198/0x1fd lib/dump_stack.c:118
    print_address_description.constprop.0.cold+0xae/0x497 mm/kasan/report.c:383
    __kasan_report mm/kasan/report.c:513 [inline]
    kasan_report.cold+0x1f/0x37 mm/kasan/report.c:530
    check_memory_region_inline mm/kasan/generic.c:186 [inline]
    check_memory_region+0x13d/0x180 mm/kasan/generic.c:192
    instrument_atomic_write include/linux/instrumented.h:71 [inline]
    atomic_inc include/asm-generic/atomic-instrumented.h:240 [inline]
    io_wqe_inc_running fs/io-wq.c:301 [inline]
    io_wq_worker_running+0xde/0x110 fs/io-wq.c:613
    schedule_timeout+0x148/0x250 kernel/time/timer.c:1879
    io_wqe_worker+0x517/0x10e0 fs/io-wq.c:580
    kthread+0x3b5/0x4a0 kernel/kthread.c:292
    ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:294

    Allocated by task 7768:
    kasan_save_stack+0x1b/0x40 mm/kasan/common.c:48
    kasan_set_track mm/kasan/common.c:56 [inline]
    __kasan_kmalloc.constprop.0+0xbf/0xd0 mm/kasan/common.c:461
    kmem_cache_alloc_node_trace+0x17b/0x3f0 mm/slab.c:3594
    kmalloc_node include/linux/slab.h:572 [inline]
    kzalloc_node include/linux/slab.h:677 [inline]
    io_wq_create+0x57b/0xa10 fs/io-wq.c:1064
    io_init_wq_offload fs/io_uring.c:7432 [inline]
    io_sq_offload_start fs/io_uring.c:7504 [inline]
    io_uring_create fs/io_uring.c:8625 [inline]
    io_uring_setup+0x1836/0x28e0 fs/io_uring.c:8694
    do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
    entry_SYSCALL_64_after_hwframe+0x44/0xa9

    Freed by task 21:
    kasan_save_stack+0x1b/0x40 mm/kasan/common.c:48
    kasan_set_track+0x1c/0x30 mm/kasan/common.c:56
    kasan_set_free_info+0x1b/0x30 mm/kasan/generic.c:355
    __kasan_slab_free+0xd8/0x120 mm/kasan/common.c:422
    __cache_free mm/slab.c:3418 [inline]
    kfree+0x10e/0x2b0 mm/slab.c:3756
    __io_wq_destroy fs/io-wq.c:1138 [inline]
    io_wq_destroy+0x2af/0x460 fs/io-wq.c:1146
    io_finish_async fs/io_uring.c:6836 [inline]
    io_ring_ctx_free fs/io_uring.c:7870 [inline]
    io_ring_exit_work+0x1e4/0x6d0 fs/io_uring.c:7954
    process_one_work+0x94c/0x1670 kernel/workqueue.c:2269
    worker_thread+0x64c/0x1120 kernel/workqueue.c:2415
    kthread+0x3b5/0x4a0 kernel/kthread.c:292
    ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:294

    The buggy address belongs to the object at ffff8882183db000
    which belongs to the cache kmalloc-1k of size 1024
    The buggy address is located 140 bytes inside of
    1024-byte region [ffff8882183db000, ffff8882183db400)
    The buggy address belongs to the page:
    page:000000009bada22b refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x2183db
    flags: 0x57ffe0000000200(slab)
    raw: 057ffe0000000200 ffffea0008604c48 ffffea00086a8648 ffff8880aa040700
    raw: 0000000000000000 ffff8882183db000 0000000100000002 0000000000000000
    page dumped because: kasan: bad access detected

    Memory state around the buggy address:
    ffff8882183daf80: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
    ffff8882183db000: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
    >ffff8882183db080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
    ^
    ffff8882183db100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
    ffff8882183db180: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
    ==================================================================

    which is down to the comment below,

    /* all workers gone, wq exit can proceed */
    if (!nr_workers && refcount_dec_and_test(&wqe->wq->refs))
    complete(&wqe->wq->done);

    because there might be multiple cases of wqe in a wq and we would wait
    for every worker in every wqe to go home before releasing wq's resources
    on destroying.

    To that end, rework wq's refcount by making it independent of the tracking
    of workers because after all they are two different things, and keeping
    it balanced when workers come and go. Note the manager kthread, like
    other workers, now holds a grab to wq during its lifetime.

    Finally to help destroy wq, check IO_WQ_BIT_EXIT upon creating worker
    and do nothing for exiting wq.

    Cc: stable@vger.kernel.org # v5.5+
    Reported-by: syzbot+45fa0a195b941764e0f0@syzkaller.appspotmail.com
    Reported-by: syzbot+9af99580130003da82b1@syzkaller.appspotmail.com
    Cc: Pavel Begunkov
    Signed-off-by: Hillf Danton
    Signed-off-by: Jens Axboe

    Hillf Danton
     
  • There are a few operations that are offloaded to the worker threads. In
    this case, we lose process context and end up in kthread context. This
    results in ios to be not accounted to the issuing cgroup and
    consequently end up as issued by root. Just like others, adopt the
    personality of the blkcg too when issuing via the workqueues.

    For the SQPOLL thread, it will live and attach in the inited cgroup's
    context.

    Signed-off-by: Dennis Zhou
    Signed-off-by: Jens Axboe

    Dennis Zhou
     
  • During a context switch the scheduler invokes wq_worker_sleeping() with
    disabled preemption. Disabling preemption is needed because it protects
    access to `worker->sleeping'. As an optimisation it avoids invoking
    schedule() within the schedule path as part of possible wake up (thus
    preempt_enable_no_resched() afterwards).

    The io-wq has been added to the mix in the same section with disabled
    preemption. This breaks on PREEMPT_RT because io_wq_worker_sleeping()
    acquires a spinlock_t. Also within the schedule() the spinlock_t must be
    acquired after tsk_is_pi_blocked() otherwise it will block on the
    sleeping lock again while scheduling out.

    While playing with `io_uring-bench' I didn't notice a significant
    latency spike after converting io_wqe::lock to a raw_spinlock_t. The
    latency was more or less the same.

    In order to keep the spinlock_t it would have to be moved after the
    tsk_is_pi_blocked() check which would introduce a branch instruction
    into the hot path.

    The lock is used to maintain the `work_list' and wakes one task up at
    most.
    Should io_wqe_cancel_pending_work() cause latency spikes, while
    searching for a specific item, then it would need to drop the lock
    during iterations.
    revert_creds() is also invoked under the lock. According to debug
    cred::non_rcu is 0. Otherwise it should be moved outside of the locked
    section because put_cred_rcu()->free_uid() acquires a sleeping lock.

    Convert io_wqe::lock to a raw_spinlock_t.c

    Signed-off-by: Sebastian Andrzej Siewior
    Signed-off-by: Jens Axboe

    Sebastian Andrzej Siewior
     
  • If we don't get and assign the namespace for the async work, then certain
    paths just don't work properly (like /dev/stdin, /proc/mounts, etc).
    Anything that references the current namespace of the given task should
    be assigned for async work on behalf of that task.

    Cc: stable@vger.kernel.org # v5.5+
    Reported-by: Al Viro
    Signed-off-by: Jens Axboe

    Jens Axboe
     

24 Aug, 2020

1 commit


25 Jul, 2020

2 commits


27 Jun, 2020

1 commit


15 Jun, 2020

3 commits


12 Jun, 2020

1 commit

  • Pull io_uring fixes from Jens Axboe:
    "A few late stragglers in here. In particular:

    - Validate full range for provided buffers (Bijan)

    - Fix bad use of kfree() in buffer registration failure (Denis)

    - Don't allow close of ring itself, it's not fully safe. Making it
    fully safe would require making the system call more expensive,
    which isn't worth it.

    - Buffer selection fix

    - Regression fix for O_NONBLOCK retry

    - Make IORING_OP_ACCEPT honor O_NONBLOCK (Jiufei)

    - Restrict opcode handling for SQ/IOPOLL (Pavel)

    - io-wq work handling cleanups and improvements (Pavel, Xiaoguang)

    - IOPOLL race fix (Xiaoguang)"

    * tag 'io_uring-5.8-2020-06-11' of git://git.kernel.dk/linux-block:
    io_uring: fix io_kiocb.flags modification race in IOPOLL mode
    io_uring: check file O_NONBLOCK state for accept
    io_uring: avoid unnecessary io_wq_work copy for fast poll feature
    io_uring: avoid whole io_wq_work copy for requests completed inline
    io_uring: allow O_NONBLOCK async retry
    io_wq: add per-wq work handler instead of per work
    io_uring: don't arm a timeout through work.func
    io_uring: remove custom ->func handlers
    io_uring: don't derive close state from ->func
    io_uring: use kvfree() in io_sqe_buffer_register()
    io_uring: validate the full range of provided buffers for access
    io_uring: re-set iov base/len for buffer select retry
    io_uring: move send/recv IOPOLL check into prep
    io_uring: deduplicate io_openat{,2}_prep()
    io_uring: do build_open_how() only once
    io_uring: fix {SQ,IO}POLL with unsupported opcodes
    io_uring: disallow close of ring itself

    Linus Torvalds
     

11 Jun, 2020

3 commits

  • Some architectures like arm64 and s390 require USER_DS to be set for
    kernel threads to access user address space, which is the whole purpose of
    kthread_use_mm, but other like x86 don't. That has lead to a huge mess
    where some callers are fixed up once they are tested on said
    architectures, while others linger around and yet other like io_uring try
    to do "clever" optimizations for what usually is just a trivial asignment
    to a member in the thread_struct for most architectures.

    Make kthread_use_mm set USER_DS, and kthread_unuse_mm restore to the
    previous value instead.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Tested-by: Jens Axboe
    Reviewed-by: Jens Axboe
    Acked-by: Michael S. Tsirkin
    Cc: Alex Deucher
    Cc: Al Viro
    Cc: Felipe Balbi
    Cc: Felix Kuehling
    Cc: Jason Wang
    Cc: Zhenyu Wang
    Cc: Zhi Wang
    Cc: Greg Kroah-Hartman
    Link: http://lkml.kernel.org/r/20200404094101.672954-7-hch@lst.de
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     
  • Switch the function documentation to kerneldoc comments, and add
    WARN_ON_ONCE asserts that the calling thread is a kernel thread and does
    not have ->mm set (or has ->mm set in the case of unuse_mm).

    Also give the functions a kthread_ prefix to better document the use case.

    [hch@lst.de: fix a comment typo, cover the newly merged use_mm/unuse_mm caller in vfio]
    Link: http://lkml.kernel.org/r/20200416053158.586887-3-hch@lst.de
    [sfr@canb.auug.org.au: powerpc/vas: fix up for {un}use_mm() rename]
    Link: http://lkml.kernel.org/r/20200422163935.5aa93ba5@canb.auug.org.au

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Tested-by: Jens Axboe
    Reviewed-by: Jens Axboe
    Acked-by: Felix Kuehling
    Acked-by: Greg Kroah-Hartman [usb]
    Acked-by: Haren Myneni
    Cc: Alex Deucher
    Cc: Al Viro
    Cc: Felipe Balbi
    Cc: Jason Wang
    Cc: "Michael S. Tsirkin"
    Cc: Zhenyu Wang
    Cc: Zhi Wang
    Link: http://lkml.kernel.org/r/20200404094101.672954-6-hch@lst.de
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     
  • Patch series "improve use_mm / unuse_mm", v2.

    This series improves the use_mm / unuse_mm interface by better documenting
    the assumptions, and my taking the set_fs manipulations spread over the
    callers into the core API.

    This patch (of 3):

    Use the proper API instead.

    Link: http://lkml.kernel.org/r/20200404094101.672954-1-hch@lst.de

    These helpers are only for use with kernel threads, and I will tie them
    more into the kthread infrastructure going forward. Also move the
    prototypes to kthread.h - mmu_context.h was a little weird to start with
    as it otherwise contains very low-level MM bits.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Tested-by: Jens Axboe
    Reviewed-by: Jens Axboe
    Acked-by: Felix Kuehling
    Cc: Alex Deucher
    Cc: Al Viro
    Cc: Felipe Balbi
    Cc: Jason Wang
    Cc: "Michael S. Tsirkin"
    Cc: Zhenyu Wang
    Cc: Zhi Wang
    Cc: Greg Kroah-Hartman
    Link: http://lkml.kernel.org/r/20200404094101.672954-1-hch@lst.de
    Link: http://lkml.kernel.org/r/20200416053158.586887-1-hch@lst.de
    Link: http://lkml.kernel.org/r/20200404094101.672954-5-hch@lst.de
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     

09 Jun, 2020

1 commit

  • io_uring is the only user of io-wq, and now it uses only io-wq callback
    for all its requests, namely io_wq_submit_work(). Instead of storing
    work->runner callback in each instance of io_wq_work, keep it in io-wq
    itself.

    pros:
    - reduces io_wq_work size
    - more robust -- ->func won't be invalidated with mem{cpy,set}(req)
    - helps other work

    Signed-off-by: Pavel Begunkov
    Signed-off-by: Jens Axboe

    Pavel Begunkov
     

04 Apr, 2020

1 commit


24 Mar, 2020

1 commit

  • We always punt async buffered writes to an io-wq helper, as the core
    kernel does not have IOCB_NOWAIT support for that. Most buffered async
    writes complete very quickly, as it's just a copy operation. This means
    that doing multiple locking roundtrips on the shared wqe lock for each
    buffered write is wasteful. Additionally, buffered writes are hashed
    work items, which means that any buffered write to a given file is
    serialized.

    Keep identicaly hashed work items contiguously in @wqe->work_list, and
    track a tail for each hash bucket. On dequeue of a hashed item, splice
    all of the same hash in one go using the tracked tail. Until the batch
    is done, the caller doesn't have to synchronize with the wqe or worker
    locks again.

    Signed-off-by: Pavel Begunkov
    Signed-off-by: Jens Axboe

    Pavel Begunkov
     

23 Mar, 2020

1 commit

  • After io_assign_current_work() of a linked work, it can be decided to
    offloaded to another thread so doing io_wqe_enqueue(). However, until
    next io_assign_current_work() it can be cancelled, that isn't handled.

    Don't assign it, if it's not going to be executed.

    Fixes: 60cf46ae6054 ("io-wq: hash dependent work")
    Signed-off-by: Pavel Begunkov
    Signed-off-by: Jens Axboe

    Pavel Begunkov
     

15 Mar, 2020

3 commits

  • Enable io-wq hashing stuff for dependent works simply by re-enqueueing
    such requests.

    Signed-off-by: Pavel Begunkov
    Signed-off-by: Jens Axboe

    Pavel Begunkov
     
  • It's a preparation patch removing io_wq_enqueue_hashed(), which
    now should be done by io_wq_hash_work() + io_wq_enqueue().

    Also, set hash value for dependant works, and do it as late as possible,
    because req->file can be unavailable before. This hash will be ignored
    by io-wq.

    Signed-off-by: Pavel Begunkov
    Signed-off-by: Jens Axboe

    Pavel Begunkov
     
  • This little tweak restores the behaviour that was before the recent
    io_worker_handle_work() optimisation patches. It makes the function do
    cond_resched() and flush_signals() only if there is an actual work to
    execute.

    Signed-off-by: Pavel Begunkov
    Signed-off-by: Jens Axboe

    Pavel Begunkov
     

12 Mar, 2020

1 commit


05 Mar, 2020

4 commits

  • First it changes io-wq interfaces. It replaces {get,put}_work() with
    free_work(), which guaranteed to be called exactly once. It also enforces
    free_work() callback to be non-NULL.

    io_uring follows the changes and instead of putting a submission reference
    in io_put_req_async_completion(), it will be done in io_free_work(). As
    removes io_get_work() with corresponding refcount_inc(), the ref balance
    is maintained.

    Signed-off-by: Pavel Begunkov
    Signed-off-by: Jens Axboe

    Pavel Begunkov
     
  • When executing non-linked hashed work, io_worker_handle_work()
    will lock-unlock wqe->lock to update hash, and then immediately
    lock-unlock to get next work. Optimise this case and do
    lock/unlock only once.

    Signed-off-by: Pavel Begunkov
    Signed-off-by: Jens Axboe

    Pavel Begunkov
     
  • There are 2 optimisations:
    - Now, io_worker_handler_work() do io_assign_current_work() twice per
    request, and each one adds lock/unlock(worker->lock) pair. The first is
    to reset worker->cur_work to NULL, and the second to set a real work
    shortly after. If there is a dependant work, set it immediately, that
    effectively removes the extra NULL'ing.

    - And there is no use in taking wqe->lock for linked works, as they are
    not hashed now. Optimise it out.

    Signed-off-by: Pavel Begunkov
    Signed-off-by: Jens Axboe

    Pavel Begunkov
     
  • This is a preparation patch, it adds some helpers and makes
    the next patches cleaner.

    - extract io_impersonate_work() and io_assign_current_work()
    - replace @next label with nested do-while
    - move put_work() right after NULL'ing cur_work.

    Signed-off-by: Pavel Begunkov
    Signed-off-by: Jens Axboe

    Pavel Begunkov
     

03 Mar, 2020

4 commits

  • @hash_map is unsigned long, but BIT_ULL() is used for manipulations.
    BIT() is a better match as it returns exactly unsigned long value.

    Signed-off-by: Pavel Begunkov
    Signed-off-by: Jens Axboe

    Pavel Begunkov
     
  • IO_WQ_WORK_CB is used only for linked timeouts, which will be armed
    before the work setup (i.e. mm, override creds, etc). The setup
    shouldn't take long, so it's ok to arm it a bit later and get rid
    of IO_WQ_WORK_CB.

    Make io-wq call work->func() only once, callbacks will handle the rest.
    i.e. the linked timeout handler will do the actual issue. And as a
    bonus, it removes an extra indirect call.

    Signed-off-by: Pavel Begunkov
    Signed-off-by: Jens Axboe

    Pavel Begunkov
     
  • IO_WQ_WORK_HAS_MM is set but never used, remove it.

    Signed-off-by: Pavel Begunkov
    Signed-off-by: Jens Axboe

    Pavel Begunkov
     
  • io_wq_flush() is buggy, during cancelation of a flush, the associated
    work may be passed to the caller's (i.e. io_uring) @match callback. That
    callback is expecting it to be embedded in struct io_kiocb. Cancelation
    of internal work probably doesn't make a lot of sense to begin with.

    As the flush helper is no longer used, just delete it and the associated
    work flag.

    Signed-off-by: Pavel Begunkov
    Signed-off-by: Jens Axboe

    Pavel Begunkov