05 Mar, 2020
2 commits
-
There are 2 optimisations:
- Now, io_worker_handler_work() do io_assign_current_work() twice per
request, and each one adds lock/unlock(worker->lock) pair. The first is
to reset worker->cur_work to NULL, and the second to set a real work
shortly after. If there is a dependant work, set it immediately, that
effectively removes the extra NULL'ing.- And there is no use in taking wqe->lock for linked works, as they are
not hashed now. Optimise it out.Signed-off-by: Pavel Begunkov
Signed-off-by: Jens Axboe -
This is a preparation patch, it adds some helpers and makes
the next patches cleaner.- extract io_impersonate_work() and io_assign_current_work()
- replace @next label with nested do-while
- move put_work() right after NULL'ing cur_work.Signed-off-by: Pavel Begunkov
Signed-off-by: Jens Axboe
04 Mar, 2020
3 commits
-
If after dropping the submission reference req->refs == 1, the request
is done, because this one is for io_put_work() and will be dropped
synchronously shortly after. In this case it's safe to steal a next
work from the request.Signed-off-by: Pavel Begunkov
Signed-off-by: Jens Axboe -
There will be no use for @nxt in the handlers, and it's doesn't work
anyway, so purge itSigned-off-by: Pavel Begunkov
Signed-off-by: Jens Axboe -
The rule is simple, any async handler gets a submission ref and should
put it at the end. Make them all follow it, and so more consistent.This is a preparation patch, and as io_wq_assign_next() currently won't
ever work, this doesn't care to use io_put_req_find_next() instead of
io_put_req().Signed-off-by: Pavel Begunkov
refcount_inc_not_zero() -> refcount_inc() fix.
Signed-off-by: Jens Axboe
03 Mar, 2020
23 commits
-
Don't abuse labels for plain and straightworward code.
Signed-off-by: Pavel Begunkov
Signed-off-by: Jens Axboe -
Clang warns:
fs/io_uring.c:4178:6: warning: variable 'mask' is used uninitialized
whenever 'if' condition is false [-Wsometimes-uninitialized]
if (def->pollin)
^~~~~~~~~~~
fs/io_uring.c:4182:2: note: uninitialized use occurs here
mask |= POLLERR | POLLPRI;
^~~~
fs/io_uring.c:4178:2: note: remove the 'if' if its condition is always
true
if (def->pollin)
^~~~~~~~~~~~~~~~
fs/io_uring.c:4154:15: note: initialize the variable 'mask' to silence
this warning
__poll_t mask, ret;
^
= 0
1 warning generated.io_op_defs has many definitions where pollin is not set so mask indeed
might be uninitialized. Initialize it to zero and change the next
assignment to |=, in case further masks are added in the future to avoid
missing changing the assignment then.Fixes: d7718a9d25a6 ("io_uring: use poll driven retry for files that support it")
Link: https://github.com/ClangBuiltLinux/linux/issues/916
Signed-off-by: Nathan Chancellor
Signed-off-by: Jens Axboe -
io-wq cares about IO_WQ_WORK_UNBOUND flag only while enqueueing, so
it's useless setting it for a next req of a link. Thus, removed it
from io_prep_linked_timeout(), and inline the function.Signed-off-by: Pavel Begunkov
Signed-off-by: Jens Axboe -
After __io_queue_sqe() ended up in io_queue_async_work(), it's already
known that there is no @nxt req, so skip the check and return from the
function.Also, @nxt initialisation now can be done just before
io_put_req_find_next(), as there is no jumping until it's checked.Signed-off-by: Pavel Begunkov
Signed-off-by: Jens Axboe -
Currently io_uring tries any request in a non-blocking manner, if it can,
and then retries from a worker thread if we get -EAGAIN. Now that we have
a new and fancy poll based retry backend, use that to retry requests if
the file supports it.This means that, for example, an IORING_OP_RECVMSG on a socket no longer
requires an async thread to complete the IO. If we get -EAGAIN reading
from the socket in a non-blocking manner, we arm a poll handler for
notification on when the socket becomes readable. When it does, the
pending read is executed directly by the task again, through the io_uring
task work handlers. Not only is this faster and more efficient, it also
means we're not generating potentially tons of async threads that just
sit and block, waiting for the IO to complete.The feature is marked with IORING_FEAT_FAST_POLL, meaning that async
pollable IO is fast, and that pollother_op is fast as well.Signed-off-by: Jens Axboe
-
Add a pollin/pollout field to the request table, and have commands that
we can safely poll for properly marked.Signed-off-by: Jens Axboe
-
For poll requests, it's not uncommon to link a read (or write) after
the poll to execute immediately after the file is marked as ready.
Since the poll completion is called inside the waitqueue wake up handler,
we have to punt that linked request to async context. This slows down
the processing, and actually means it's faster to not use a link for this
use case.We also run into problems if the completion_lock is contended, as we're
doing a different lock ordering than the issue side is. Hence we have
to do trylock for completion, and if that fails, go async. Poll removal
needs to go async as well, for the same reason.eventfd notification needs special case as well, to avoid stack blowing
recursion or deadlocks.These are all deficiencies that were inherited from the aio poll
implementation, but I think we can do better. When a poll completes,
simply queue it up in the task poll list. When the task completes the
list, we can run dependent links inline as well. This means we never
have to go async, and we can remove a bunch of code associated with
that, and optimizations to try and make that run faster. The diffstat
speaks for itself.Signed-off-by: Jens Axboe
-
Store the io_kiocb in the private field instead of the poll entry, this
is in preparation for allowing multiple waitqueues.No functional changes in this patch.
Signed-off-by: Jens Axboe
-
As Peter pointed out, task_work() can avoid ->pi_lock and cmpxchg()
if task->task_works == NULL && !PF_EXITING.And in fact the only reason why task_work_run() needs ->pi_lock is
the possible race with task_work_cancel(), we can optimize this code
and make the locking more clear.Signed-off-by: Oleg Nesterov
Signed-off-by: Jens Axboe -
@hash_map is unsigned long, but BIT_ULL() is used for manipulations.
BIT() is a better match as it returns exactly unsigned long value.Signed-off-by: Pavel Begunkov
Signed-off-by: Jens Axboe -
IO_WQ_WORK_CB is used only for linked timeouts, which will be armed
before the work setup (i.e. mm, override creds, etc). The setup
shouldn't take long, so it's ok to arm it a bit later and get rid
of IO_WQ_WORK_CB.Make io-wq call work->func() only once, callbacks will handle the rest.
i.e. the linked timeout handler will do the actual issue. And as a
bonus, it removes an extra indirect call.Signed-off-by: Pavel Begunkov
Signed-off-by: Jens Axboe -
IO_WQ_WORK_HAS_MM is set but never used, remove it.
Signed-off-by: Pavel Begunkov
Signed-off-by: Jens Axboe -
io_recvmsg() and io_sendmsg() duplicate nonblock -EAGAIN finilising
part, so add helper for that.Signed-off-by: Pavel Begunkov
Signed-off-by: Jens Axboe -
Deduplicate call to io_cqring_fill_event(), plain and easy
Signed-off-by: Pavel Begunkov
Signed-off-by: Jens Axboe -
Add support for splice(2).
- output file is specified as sqe->fd, so it's handled by generic code
- hash_reg_file handled by generic code as well
- len is 32bit, but should be fine
- the fd_in is registered file, when SPLICE_F_FD_IN_FIXED is set, which
is a splice flag (i.e. sqe->splice_flags).Signed-off-by: Pavel Begunkov
Signed-off-by: Jens Axboe -
Preparation without functional changes. Adds io_get_file(), that allows
to grab files not only into req->file.Signed-off-by: Pavel Begunkov
Signed-off-by: Jens Axboe -
Make do_splice(), so other kernel parts can reuse it
Signed-off-by: Pavel Begunkov
Signed-off-by: Jens Axboe -
req->in_async is not really needed, it only prevents propagation of
@nxt for fast not-blocked submissions. Remove it.Signed-off-by: Pavel Begunkov
Signed-off-by: Jens Axboe -
io_prep_async_worker() called io_wq_assign_next() do many useless checks:
io_req_work_grab_env() was already called during prep, and @do_hashed
is not ever used. Add io_prep_next_work() -- simplified version, that
can be called io-wq.Signed-off-by: Pavel Begunkov
Signed-off-by: Jens Axboe -
Many operations define custom work.func before getting into an io-wq.
There are several points against:
- it calls io_wq_assign_next() from outside io-wq, that may be confusing
- sync context would go unnecessary through io_req_cancelled()
- prototypes are quite different, so work!=old_work looks strange
- makes async/sync responsibilities fuzzy
- adds extra overheadDon't call generic path and io-wq handlers from each other, but use
helpers insteadSigned-off-by: Pavel Begunkov
Signed-off-by: Jens Axboe -
Don't drop an early reference, hang on to it and let the caller drop
it. This makes it behave more like "regular" requests.Signed-off-by: Jens Axboe
-
If the -EAGAIN happens because of a static condition, then a poll
or later retry won't fix it. We must call it again from blocking
condition. Play it safe and ensure that any -EAGAIN condition from read
or write must retry from async context.Signed-off-by: Jens Axboe
-
io_wq_flush() is buggy, during cancelation of a flush, the associated
work may be passed to the caller's (i.e. io_uring) @match callback. That
callback is expecting it to be embedded in struct io_kiocb. Cancelation
of internal work probably doesn't make a lot of sense to begin with.As the flush helper is no longer used, just delete it and the associated
work flag.Signed-off-by: Pavel Begunkov
Signed-off-by: Jens Axboe
02 Mar, 2020
5 commits
-
To cancel a work, io-wq sets IO_WQ_WORK_CANCEL and executes the
callback. However, IO_WQ_WORK_NO_CANCEL works will just execute and may
return next work, which will be ignored and lost.Cancel the whole link.
Signed-off-by: Pavel Begunkov
Signed-off-by: Jens Axboe -
Pull ext4 fixes from Ted Ts'o:
"Two more bug fixes (including a regression) for 5.6"* tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
ext4: potential crash on allocation error in ext4_alloc_flex_bg_array()
jbd2: fix data races at struct journal_head -
Pull KVM fixes from Paolo Bonzini:
"More bugfixes, including a few remaining "make W=1" issues such as too
large frame sizes on some configurations.On the ARM side, the compiler was messing up shadow stacks between EL1
and EL2 code, which is easily fixed with __always_inline"* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
KVM: VMX: check descriptor table exits on instruction emulation
kvm: x86: Limit the number of "kvm: disabled by bios" messages
KVM: x86: avoid useless copy of cpufreq policy
KVM: allow disabling -Werror
KVM: x86: allow compiling as non-module with W=1
KVM: Pre-allocate 1 cpumask variable per cpu for both pv tlb and pv ipis
KVM: Introduce pv check helpers
KVM: let declaration of kvm_get_running_vcpus match implementation
KVM: SVM: allocate AVIC data structures based on kvm_amd module parameter
arm64: Ask the compiler to __always_inline functions used by KVM at HYP
KVM: arm64: Define our own swab32() to avoid a uapi static inline
KVM: arm64: Ask the compiler to __always_inline functions used at HYP
kvm: arm/arm64: Fold VHE entry/exit work into kvm_vcpu_run_vhe()
KVM: arm/arm64: Fix up includes for trace.h -
KVM emulates UMIP on hardware that doesn't support it by setting the
'descriptor table exiting' VM-execution control and performing
instruction emulation. When running nested, this emulation is broken as
KVM refuses to emulate L2 instructions by default.Correct this regression by allowing the emulation of descriptor table
instructions if L1 hasn't requested 'descriptor table exiting'.Fixes: 07721feee46b ("KVM: nVMX: Don't emulate instructions in guest mode")
Reported-by: Jan Kiszka
Cc: stable@vger.kernel.org
Cc: Paolo Bonzini
Cc: Jim Mattson
Signed-off-by: Oliver Upton
Signed-off-by: Paolo Bonzini
01 Mar, 2020
4 commits
-
Pull i2c fixes from Wolfram Sang:
"I2C has three driver bugfixes for you. We agreed on the Mac regression
to go in via I2C"* 'i2c/for-current-fixed' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux:
macintosh: therm_windtunnel: fix regression when instantiating devices
i2c: altera: Fix potential integer overflow
i2c: jz4780: silence log flood on txabrt -
If sbi->s_flex_groups_allocated is zero and the first allocation fails
then this code will crash. The problem is that "i--" will set "i" to
-1 but when we compare "i >= sbi->s_flex_groups_allocated" then the -1
is type promoted to unsigned and becomes UINT_MAX. Since UINT_MAX
is more than zero, the condition is true so we call kvfree(new_groups[-1]).
The loop will carry on freeing invalid memory until it crashes.Fixes: 7c990728b99e ("ext4: fix potential race between s_flex_groups online resizing and access")
Reviewed-by: Suraj Jitindar Singh
Signed-off-by: Dan Carpenter
Cc: stable@kernel.org
Link: https://lore.kernel.org/r/20200228092142.7irbc44yaz3by7nb@kili.mountain
Signed-off-by: Theodore Ts'o -
Removing attach_adapter from this driver caused a regression for at
least some machines. Those machines had the sensors described in their
DT, too, so they didn't need manual creation of the sensor devices. The
old code worked, though, because manual creation came first. Creation of
DT devices then failed later and caused error logs, but the sensors
worked nonetheless because of the manually created devices.When removing attach_adaper, manual creation now comes later and loses
the race. The sensor devices were already registered via DT, yet with
another binding, so the driver could not be bound to it.This fix refactors the code to remove the race and only manually creates
devices if there are no DT nodes present. Also, the DT binding is updated
to match both, the DT and manually created devices. Because we don't
know which device creation will be used at runtime, the code to start
the kthread is moved to do_probe() which will be called by both methods.Fixes: 3e7bed52719d ("macintosh: therm_windtunnel: drop using attach_adapter")
Link: https://bugzilla.kernel.org/show_bug.cgi?id=201723
Reported-by: Erhard Furtner
Tested-by: Erhard Furtner
Acked-by: Michael Ellerman (powerpc)
Signed-off-by: Wolfram Sang
Cc: stable@kernel.org # v4.19+ -
journal_head::b_transaction and journal_head::b_next_transaction could
be accessed concurrently as noticed by KCSAN,LTP: starting fsync04
/dev/zero: Can't open blockdev
EXT4-fs (loop0): mounting ext3 file system using the ext4 subsystem
EXT4-fs (loop0): mounted filesystem with ordered data mode. Opts: (null)
==================================================================
BUG: KCSAN: data-race in __jbd2_journal_refile_buffer [jbd2] / jbd2_write_access_granted [jbd2]write to 0xffff99f9b1bd0e30 of 8 bytes by task 25721 on cpu 70:
__jbd2_journal_refile_buffer+0xdd/0x210 [jbd2]
__jbd2_journal_refile_buffer at fs/jbd2/transaction.c:2569
jbd2_journal_commit_transaction+0x2d15/0x3f20 [jbd2]
(inlined by) jbd2_journal_commit_transaction at fs/jbd2/commit.c:1034
kjournald2+0x13b/0x450 [jbd2]
kthread+0x1cd/0x1f0
ret_from_fork+0x27/0x50read to 0xffff99f9b1bd0e30 of 8 bytes by task 25724 on cpu 68:
jbd2_write_access_granted+0x1b2/0x250 [jbd2]
jbd2_write_access_granted at fs/jbd2/transaction.c:1155
jbd2_journal_get_write_access+0x2c/0x60 [jbd2]
__ext4_journal_get_write_access+0x50/0x90 [ext4]
ext4_mb_mark_diskspace_used+0x158/0x620 [ext4]
ext4_mb_new_blocks+0x54f/0xca0 [ext4]
ext4_ind_map_blocks+0xc79/0x1b40 [ext4]
ext4_map_blocks+0x3b4/0x950 [ext4]
_ext4_get_block+0xfc/0x270 [ext4]
ext4_get_block+0x3b/0x50 [ext4]
__block_write_begin_int+0x22e/0xae0
__block_write_begin+0x39/0x50
ext4_write_begin+0x388/0xb50 [ext4]
generic_perform_write+0x15d/0x290
ext4_buffered_write_iter+0x11f/0x210 [ext4]
ext4_file_write_iter+0xce/0x9e0 [ext4]
new_sync_write+0x29c/0x3b0
__vfs_write+0x92/0xa0
vfs_write+0x103/0x260
ksys_write+0x9d/0x130
__x64_sys_write+0x4c/0x60
do_syscall_64+0x91/0xb05
entry_SYSCALL_64_after_hwframe+0x49/0xbe5 locks held by fsync04/25724:
#0: ffff99f9911093f8 (sb_writers#13){.+.+}, at: vfs_write+0x21c/0x260
#1: ffff99f9db4c0348 (&sb->s_type->i_mutex_key#15){+.+.}, at: ext4_buffered_write_iter+0x65/0x210 [ext4]
#2: ffff99f5e7dfcf58 (jbd2_handle){++++}, at: start_this_handle+0x1c1/0x9d0 [jbd2]
#3: ffff99f9db4c0168 (&ei->i_data_sem){++++}, at: ext4_map_blocks+0x176/0x950 [ext4]
#4: ffffffff99086b40 (rcu_read_lock){....}, at: jbd2_write_access_granted+0x4e/0x250 [jbd2]
irq event stamp: 1407125
hardirqs last enabled at (1407125): [] __find_get_block+0x107/0x790
hardirqs last disabled at (1407124): [] __find_get_block+0x49/0x790
softirqs last enabled at (1405528): [] __do_softirq+0x34c/0x57c
softirqs last disabled at (1405521): [] irq_exit+0xa2/0xc0Reported by Kernel Concurrency Sanitizer on:
CPU: 68 PID: 25724 Comm: fsync04 Tainted: G L 5.6.0-rc2-next-20200221+ #7
Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 07/10/2019The plain reads are outside of jh->b_state_lock critical section which result
in data races. Fix them by adding pairs of READ|WRITE_ONCE().Reviewed-by: Jan Kara
Signed-off-by: Qian Cai
Link: https://lore.kernel.org/r/20200222043111.2227-1-cai@lca.pw
Signed-off-by: Theodore Ts'o
29 Feb, 2020
3 commits
-
Pull SCSI fixes from James Bottomley:
"Four small fixes.Three are in drivers for fairly obvious bugs. The fourth is a set of
regressions introduced by the compat_ioctl changes because some of the
compat updates wrongly replaced .ioctl instead of .compat_ioctl"* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
scsi: compat_ioctl: cdrom: Replace .ioctl with .compat_ioctl in four appropriate places
scsi: zfcp: fix wrong data and display format of SFP+ temperature
scsi: sd_sbc: Fix sd_zbc_report_zones()
scsi: libfc: free response frame from GPN_ID -
Pull PCI fixes from Bjorn Helgaas:
- Fix build issue on 32-bit ARM with old compilers (Marek Szyprowski)
- Update MAINTAINERS for recent Cadence driver file move (Lukas
Bulwahn)* tag 'pci-v5.6-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci:
MAINTAINERS: Correct Cadence PCI driver path
PCI: brcmstb: Fix build on 32bit ARM platforms with older compilers -
Pull block fixes from Jens Axboe:
- Passthrough insertion fix (Ming)
- Kill off some unused arguments (John)
- blktrace RCU fix (Jan)
- Dead fields removal for null_blk (Dongli)
- NVMe polled IO fix (Bijan)
* tag 'block-5.6-2020-02-28' of git://git.kernel.dk/linux-block:
nvme-pci: Hold cq_poll_lock while completing CQEs
blk-mq: Remove some unused function arguments
null_blk: remove unused fields in 'nullb_cmd'
blktrace: Protect q->blk_trace with RCU
blk-mq: insert passthrough request into hctx->dispatch directly