05 Mar, 2020

2 commits

  • There are 2 optimisations:
    - Now, io_worker_handler_work() do io_assign_current_work() twice per
    request, and each one adds lock/unlock(worker->lock) pair. The first is
    to reset worker->cur_work to NULL, and the second to set a real work
    shortly after. If there is a dependant work, set it immediately, that
    effectively removes the extra NULL'ing.

    - And there is no use in taking wqe->lock for linked works, as they are
    not hashed now. Optimise it out.

    Signed-off-by: Pavel Begunkov
    Signed-off-by: Jens Axboe

    Pavel Begunkov
     
  • This is a preparation patch, it adds some helpers and makes
    the next patches cleaner.

    - extract io_impersonate_work() and io_assign_current_work()
    - replace @next label with nested do-while
    - move put_work() right after NULL'ing cur_work.

    Signed-off-by: Pavel Begunkov
    Signed-off-by: Jens Axboe

    Pavel Begunkov
     

04 Mar, 2020

3 commits

  • If after dropping the submission reference req->refs == 1, the request
    is done, because this one is for io_put_work() and will be dropped
    synchronously shortly after. In this case it's safe to steal a next
    work from the request.

    Signed-off-by: Pavel Begunkov
    Signed-off-by: Jens Axboe

    Pavel Begunkov
     
  • There will be no use for @nxt in the handlers, and it's doesn't work
    anyway, so purge it

    Signed-off-by: Pavel Begunkov
    Signed-off-by: Jens Axboe

    Pavel Begunkov
     
  • The rule is simple, any async handler gets a submission ref and should
    put it at the end. Make them all follow it, and so more consistent.

    This is a preparation patch, and as io_wq_assign_next() currently won't
    ever work, this doesn't care to use io_put_req_find_next() instead of
    io_put_req().

    Signed-off-by: Pavel Begunkov

    refcount_inc_not_zero() -> refcount_inc() fix.

    Signed-off-by: Jens Axboe

    Pavel Begunkov
     

03 Mar, 2020

23 commits

  • Don't abuse labels for plain and straightworward code.

    Signed-off-by: Pavel Begunkov
    Signed-off-by: Jens Axboe

    Pavel Begunkov
     
  • Clang warns:

    fs/io_uring.c:4178:6: warning: variable 'mask' is used uninitialized
    whenever 'if' condition is false [-Wsometimes-uninitialized]
    if (def->pollin)
    ^~~~~~~~~~~
    fs/io_uring.c:4182:2: note: uninitialized use occurs here
    mask |= POLLERR | POLLPRI;
    ^~~~
    fs/io_uring.c:4178:2: note: remove the 'if' if its condition is always
    true
    if (def->pollin)
    ^~~~~~~~~~~~~~~~
    fs/io_uring.c:4154:15: note: initialize the variable 'mask' to silence
    this warning
    __poll_t mask, ret;
    ^
    = 0
    1 warning generated.

    io_op_defs has many definitions where pollin is not set so mask indeed
    might be uninitialized. Initialize it to zero and change the next
    assignment to |=, in case further masks are added in the future to avoid
    missing changing the assignment then.

    Fixes: d7718a9d25a6 ("io_uring: use poll driven retry for files that support it")
    Link: https://github.com/ClangBuiltLinux/linux/issues/916
    Signed-off-by: Nathan Chancellor
    Signed-off-by: Jens Axboe

    Nathan Chancellor
     
  • io-wq cares about IO_WQ_WORK_UNBOUND flag only while enqueueing, so
    it's useless setting it for a next req of a link. Thus, removed it
    from io_prep_linked_timeout(), and inline the function.

    Signed-off-by: Pavel Begunkov
    Signed-off-by: Jens Axboe

    Pavel Begunkov
     
  • After __io_queue_sqe() ended up in io_queue_async_work(), it's already
    known that there is no @nxt req, so skip the check and return from the
    function.

    Also, @nxt initialisation now can be done just before
    io_put_req_find_next(), as there is no jumping until it's checked.

    Signed-off-by: Pavel Begunkov
    Signed-off-by: Jens Axboe

    Pavel Begunkov
     
  • Currently io_uring tries any request in a non-blocking manner, if it can,
    and then retries from a worker thread if we get -EAGAIN. Now that we have
    a new and fancy poll based retry backend, use that to retry requests if
    the file supports it.

    This means that, for example, an IORING_OP_RECVMSG on a socket no longer
    requires an async thread to complete the IO. If we get -EAGAIN reading
    from the socket in a non-blocking manner, we arm a poll handler for
    notification on when the socket becomes readable. When it does, the
    pending read is executed directly by the task again, through the io_uring
    task work handlers. Not only is this faster and more efficient, it also
    means we're not generating potentially tons of async threads that just
    sit and block, waiting for the IO to complete.

    The feature is marked with IORING_FEAT_FAST_POLL, meaning that async
    pollable IO is fast, and that pollother_op is fast as well.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • Add a pollin/pollout field to the request table, and have commands that
    we can safely poll for properly marked.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • For poll requests, it's not uncommon to link a read (or write) after
    the poll to execute immediately after the file is marked as ready.
    Since the poll completion is called inside the waitqueue wake up handler,
    we have to punt that linked request to async context. This slows down
    the processing, and actually means it's faster to not use a link for this
    use case.

    We also run into problems if the completion_lock is contended, as we're
    doing a different lock ordering than the issue side is. Hence we have
    to do trylock for completion, and if that fails, go async. Poll removal
    needs to go async as well, for the same reason.

    eventfd notification needs special case as well, to avoid stack blowing
    recursion or deadlocks.

    These are all deficiencies that were inherited from the aio poll
    implementation, but I think we can do better. When a poll completes,
    simply queue it up in the task poll list. When the task completes the
    list, we can run dependent links inline as well. This means we never
    have to go async, and we can remove a bunch of code associated with
    that, and optimizations to try and make that run faster. The diffstat
    speaks for itself.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • Store the io_kiocb in the private field instead of the poll entry, this
    is in preparation for allowing multiple waitqueues.

    No functional changes in this patch.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • As Peter pointed out, task_work() can avoid ->pi_lock and cmpxchg()
    if task->task_works == NULL && !PF_EXITING.

    And in fact the only reason why task_work_run() needs ->pi_lock is
    the possible race with task_work_cancel(), we can optimize this code
    and make the locking more clear.

    Signed-off-by: Oleg Nesterov
    Signed-off-by: Jens Axboe

    Oleg Nesterov
     
  • @hash_map is unsigned long, but BIT_ULL() is used for manipulations.
    BIT() is a better match as it returns exactly unsigned long value.

    Signed-off-by: Pavel Begunkov
    Signed-off-by: Jens Axboe

    Pavel Begunkov
     
  • IO_WQ_WORK_CB is used only for linked timeouts, which will be armed
    before the work setup (i.e. mm, override creds, etc). The setup
    shouldn't take long, so it's ok to arm it a bit later and get rid
    of IO_WQ_WORK_CB.

    Make io-wq call work->func() only once, callbacks will handle the rest.
    i.e. the linked timeout handler will do the actual issue. And as a
    bonus, it removes an extra indirect call.

    Signed-off-by: Pavel Begunkov
    Signed-off-by: Jens Axboe

    Pavel Begunkov
     
  • IO_WQ_WORK_HAS_MM is set but never used, remove it.

    Signed-off-by: Pavel Begunkov
    Signed-off-by: Jens Axboe

    Pavel Begunkov
     
  • io_recvmsg() and io_sendmsg() duplicate nonblock -EAGAIN finilising
    part, so add helper for that.

    Signed-off-by: Pavel Begunkov
    Signed-off-by: Jens Axboe

    Pavel Begunkov
     
  • Deduplicate call to io_cqring_fill_event(), plain and easy

    Signed-off-by: Pavel Begunkov
    Signed-off-by: Jens Axboe

    Pavel Begunkov
     
  • Add support for splice(2).

    - output file is specified as sqe->fd, so it's handled by generic code
    - hash_reg_file handled by generic code as well
    - len is 32bit, but should be fine
    - the fd_in is registered file, when SPLICE_F_FD_IN_FIXED is set, which
    is a splice flag (i.e. sqe->splice_flags).

    Signed-off-by: Pavel Begunkov
    Signed-off-by: Jens Axboe

    Pavel Begunkov
     
  • Preparation without functional changes. Adds io_get_file(), that allows
    to grab files not only into req->file.

    Signed-off-by: Pavel Begunkov
    Signed-off-by: Jens Axboe

    Pavel Begunkov
     
  • Make do_splice(), so other kernel parts can reuse it

    Signed-off-by: Pavel Begunkov
    Signed-off-by: Jens Axboe

    Pavel Begunkov
     
  • req->in_async is not really needed, it only prevents propagation of
    @nxt for fast not-blocked submissions. Remove it.

    Signed-off-by: Pavel Begunkov
    Signed-off-by: Jens Axboe

    Pavel Begunkov
     
  • io_prep_async_worker() called io_wq_assign_next() do many useless checks:
    io_req_work_grab_env() was already called during prep, and @do_hashed
    is not ever used. Add io_prep_next_work() -- simplified version, that
    can be called io-wq.

    Signed-off-by: Pavel Begunkov
    Signed-off-by: Jens Axboe

    Pavel Begunkov
     
  • Many operations define custom work.func before getting into an io-wq.
    There are several points against:
    - it calls io_wq_assign_next() from outside io-wq, that may be confusing
    - sync context would go unnecessary through io_req_cancelled()
    - prototypes are quite different, so work!=old_work looks strange
    - makes async/sync responsibilities fuzzy
    - adds extra overhead

    Don't call generic path and io-wq handlers from each other, but use
    helpers instead

    Signed-off-by: Pavel Begunkov
    Signed-off-by: Jens Axboe

    Pavel Begunkov
     
  • Don't drop an early reference, hang on to it and let the caller drop
    it. This makes it behave more like "regular" requests.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • If the -EAGAIN happens because of a static condition, then a poll
    or later retry won't fix it. We must call it again from blocking
    condition. Play it safe and ensure that any -EAGAIN condition from read
    or write must retry from async context.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • io_wq_flush() is buggy, during cancelation of a flush, the associated
    work may be passed to the caller's (i.e. io_uring) @match callback. That
    callback is expecting it to be embedded in struct io_kiocb. Cancelation
    of internal work probably doesn't make a lot of sense to begin with.

    As the flush helper is no longer used, just delete it and the associated
    work flag.

    Signed-off-by: Pavel Begunkov
    Signed-off-by: Jens Axboe

    Pavel Begunkov
     

02 Mar, 2020

5 commits

  • To cancel a work, io-wq sets IO_WQ_WORK_CANCEL and executes the
    callback. However, IO_WQ_WORK_NO_CANCEL works will just execute and may
    return next work, which will be ignored and lost.

    Cancel the whole link.

    Signed-off-by: Pavel Begunkov
    Signed-off-by: Jens Axboe

    Pavel Begunkov
     
  • Linus Torvalds
     
  • Pull ext4 fixes from Ted Ts'o:
    "Two more bug fixes (including a regression) for 5.6"

    * tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
    ext4: potential crash on allocation error in ext4_alloc_flex_bg_array()
    jbd2: fix data races at struct journal_head

    Linus Torvalds
     
  • Pull KVM fixes from Paolo Bonzini:
    "More bugfixes, including a few remaining "make W=1" issues such as too
    large frame sizes on some configurations.

    On the ARM side, the compiler was messing up shadow stacks between EL1
    and EL2 code, which is easily fixed with __always_inline"

    * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
    KVM: VMX: check descriptor table exits on instruction emulation
    kvm: x86: Limit the number of "kvm: disabled by bios" messages
    KVM: x86: avoid useless copy of cpufreq policy
    KVM: allow disabling -Werror
    KVM: x86: allow compiling as non-module with W=1
    KVM: Pre-allocate 1 cpumask variable per cpu for both pv tlb and pv ipis
    KVM: Introduce pv check helpers
    KVM: let declaration of kvm_get_running_vcpus match implementation
    KVM: SVM: allocate AVIC data structures based on kvm_amd module parameter
    arm64: Ask the compiler to __always_inline functions used by KVM at HYP
    KVM: arm64: Define our own swab32() to avoid a uapi static inline
    KVM: arm64: Ask the compiler to __always_inline functions used at HYP
    kvm: arm/arm64: Fold VHE entry/exit work into kvm_vcpu_run_vhe()
    KVM: arm/arm64: Fix up includes for trace.h

    Linus Torvalds
     
  • KVM emulates UMIP on hardware that doesn't support it by setting the
    'descriptor table exiting' VM-execution control and performing
    instruction emulation. When running nested, this emulation is broken as
    KVM refuses to emulate L2 instructions by default.

    Correct this regression by allowing the emulation of descriptor table
    instructions if L1 hasn't requested 'descriptor table exiting'.

    Fixes: 07721feee46b ("KVM: nVMX: Don't emulate instructions in guest mode")
    Reported-by: Jan Kiszka
    Cc: stable@vger.kernel.org
    Cc: Paolo Bonzini
    Cc: Jim Mattson
    Signed-off-by: Oliver Upton
    Signed-off-by: Paolo Bonzini

    Oliver Upton
     

01 Mar, 2020

4 commits

  • Pull i2c fixes from Wolfram Sang:
    "I2C has three driver bugfixes for you. We agreed on the Mac regression
    to go in via I2C"

    * 'i2c/for-current-fixed' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux:
    macintosh: therm_windtunnel: fix regression when instantiating devices
    i2c: altera: Fix potential integer overflow
    i2c: jz4780: silence log flood on txabrt

    Linus Torvalds
     
  • If sbi->s_flex_groups_allocated is zero and the first allocation fails
    then this code will crash. The problem is that "i--" will set "i" to
    -1 but when we compare "i >= sbi->s_flex_groups_allocated" then the -1
    is type promoted to unsigned and becomes UINT_MAX. Since UINT_MAX
    is more than zero, the condition is true so we call kvfree(new_groups[-1]).
    The loop will carry on freeing invalid memory until it crashes.

    Fixes: 7c990728b99e ("ext4: fix potential race between s_flex_groups online resizing and access")
    Reviewed-by: Suraj Jitindar Singh
    Signed-off-by: Dan Carpenter
    Cc: stable@kernel.org
    Link: https://lore.kernel.org/r/20200228092142.7irbc44yaz3by7nb@kili.mountain
    Signed-off-by: Theodore Ts'o

    Dan Carpenter
     
  • Removing attach_adapter from this driver caused a regression for at
    least some machines. Those machines had the sensors described in their
    DT, too, so they didn't need manual creation of the sensor devices. The
    old code worked, though, because manual creation came first. Creation of
    DT devices then failed later and caused error logs, but the sensors
    worked nonetheless because of the manually created devices.

    When removing attach_adaper, manual creation now comes later and loses
    the race. The sensor devices were already registered via DT, yet with
    another binding, so the driver could not be bound to it.

    This fix refactors the code to remove the race and only manually creates
    devices if there are no DT nodes present. Also, the DT binding is updated
    to match both, the DT and manually created devices. Because we don't
    know which device creation will be used at runtime, the code to start
    the kthread is moved to do_probe() which will be called by both methods.

    Fixes: 3e7bed52719d ("macintosh: therm_windtunnel: drop using attach_adapter")
    Link: https://bugzilla.kernel.org/show_bug.cgi?id=201723
    Reported-by: Erhard Furtner
    Tested-by: Erhard Furtner
    Acked-by: Michael Ellerman (powerpc)
    Signed-off-by: Wolfram Sang
    Cc: stable@kernel.org # v4.19+

    Wolfram Sang
     
  • journal_head::b_transaction and journal_head::b_next_transaction could
    be accessed concurrently as noticed by KCSAN,

    LTP: starting fsync04
    /dev/zero: Can't open blockdev
    EXT4-fs (loop0): mounting ext3 file system using the ext4 subsystem
    EXT4-fs (loop0): mounted filesystem with ordered data mode. Opts: (null)
    ==================================================================
    BUG: KCSAN: data-race in __jbd2_journal_refile_buffer [jbd2] / jbd2_write_access_granted [jbd2]

    write to 0xffff99f9b1bd0e30 of 8 bytes by task 25721 on cpu 70:
    __jbd2_journal_refile_buffer+0xdd/0x210 [jbd2]
    __jbd2_journal_refile_buffer at fs/jbd2/transaction.c:2569
    jbd2_journal_commit_transaction+0x2d15/0x3f20 [jbd2]
    (inlined by) jbd2_journal_commit_transaction at fs/jbd2/commit.c:1034
    kjournald2+0x13b/0x450 [jbd2]
    kthread+0x1cd/0x1f0
    ret_from_fork+0x27/0x50

    read to 0xffff99f9b1bd0e30 of 8 bytes by task 25724 on cpu 68:
    jbd2_write_access_granted+0x1b2/0x250 [jbd2]
    jbd2_write_access_granted at fs/jbd2/transaction.c:1155
    jbd2_journal_get_write_access+0x2c/0x60 [jbd2]
    __ext4_journal_get_write_access+0x50/0x90 [ext4]
    ext4_mb_mark_diskspace_used+0x158/0x620 [ext4]
    ext4_mb_new_blocks+0x54f/0xca0 [ext4]
    ext4_ind_map_blocks+0xc79/0x1b40 [ext4]
    ext4_map_blocks+0x3b4/0x950 [ext4]
    _ext4_get_block+0xfc/0x270 [ext4]
    ext4_get_block+0x3b/0x50 [ext4]
    __block_write_begin_int+0x22e/0xae0
    __block_write_begin+0x39/0x50
    ext4_write_begin+0x388/0xb50 [ext4]
    generic_perform_write+0x15d/0x290
    ext4_buffered_write_iter+0x11f/0x210 [ext4]
    ext4_file_write_iter+0xce/0x9e0 [ext4]
    new_sync_write+0x29c/0x3b0
    __vfs_write+0x92/0xa0
    vfs_write+0x103/0x260
    ksys_write+0x9d/0x130
    __x64_sys_write+0x4c/0x60
    do_syscall_64+0x91/0xb05
    entry_SYSCALL_64_after_hwframe+0x49/0xbe

    5 locks held by fsync04/25724:
    #0: ffff99f9911093f8 (sb_writers#13){.+.+}, at: vfs_write+0x21c/0x260
    #1: ffff99f9db4c0348 (&sb->s_type->i_mutex_key#15){+.+.}, at: ext4_buffered_write_iter+0x65/0x210 [ext4]
    #2: ffff99f5e7dfcf58 (jbd2_handle){++++}, at: start_this_handle+0x1c1/0x9d0 [jbd2]
    #3: ffff99f9db4c0168 (&ei->i_data_sem){++++}, at: ext4_map_blocks+0x176/0x950 [ext4]
    #4: ffffffff99086b40 (rcu_read_lock){....}, at: jbd2_write_access_granted+0x4e/0x250 [jbd2]
    irq event stamp: 1407125
    hardirqs last enabled at (1407125): [] __find_get_block+0x107/0x790
    hardirqs last disabled at (1407124): [] __find_get_block+0x49/0x790
    softirqs last enabled at (1405528): [] __do_softirq+0x34c/0x57c
    softirqs last disabled at (1405521): [] irq_exit+0xa2/0xc0

    Reported by Kernel Concurrency Sanitizer on:
    CPU: 68 PID: 25724 Comm: fsync04 Tainted: G L 5.6.0-rc2-next-20200221+ #7
    Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 07/10/2019

    The plain reads are outside of jh->b_state_lock critical section which result
    in data races. Fix them by adding pairs of READ|WRITE_ONCE().

    Reviewed-by: Jan Kara
    Signed-off-by: Qian Cai
    Link: https://lore.kernel.org/r/20200222043111.2227-1-cai@lca.pw
    Signed-off-by: Theodore Ts'o

    Qian Cai
     

29 Feb, 2020

3 commits

  • Pull SCSI fixes from James Bottomley:
    "Four small fixes.

    Three are in drivers for fairly obvious bugs. The fourth is a set of
    regressions introduced by the compat_ioctl changes because some of the
    compat updates wrongly replaced .ioctl instead of .compat_ioctl"

    * tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
    scsi: compat_ioctl: cdrom: Replace .ioctl with .compat_ioctl in four appropriate places
    scsi: zfcp: fix wrong data and display format of SFP+ temperature
    scsi: sd_sbc: Fix sd_zbc_report_zones()
    scsi: libfc: free response frame from GPN_ID

    Linus Torvalds
     
  • Pull PCI fixes from Bjorn Helgaas:

    - Fix build issue on 32-bit ARM with old compilers (Marek Szyprowski)

    - Update MAINTAINERS for recent Cadence driver file move (Lukas
    Bulwahn)

    * tag 'pci-v5.6-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci:
    MAINTAINERS: Correct Cadence PCI driver path
    PCI: brcmstb: Fix build on 32bit ARM platforms with older compilers

    Linus Torvalds
     
  • Pull block fixes from Jens Axboe:

    - Passthrough insertion fix (Ming)

    - Kill off some unused arguments (John)

    - blktrace RCU fix (Jan)

    - Dead fields removal for null_blk (Dongli)

    - NVMe polled IO fix (Bijan)

    * tag 'block-5.6-2020-02-28' of git://git.kernel.dk/linux-block:
    nvme-pci: Hold cq_poll_lock while completing CQEs
    blk-mq: Remove some unused function arguments
    null_blk: remove unused fields in 'nullb_cmd'
    blktrace: Protect q->blk_trace with RCU
    blk-mq: insert passthrough request into hctx->dispatch directly

    Linus Torvalds