13 Dec, 2019
4 commits
-
There's an issue with deferred requests through drain, where if we do
need to defer, we're not copying over the sqe_submit state correctly.
This can result in using uninitialized data when we then later go and
submit the deferred request, like this check in __io_submit_sqe():if (unlikely(s->index >= ctx->sq_entries))
return -EINVAL;with 's' being uninitialized, we can randomly fail this check. Fix this
by copying sqe_submit state when we defer a request.Because it was fixed as part of a cleanup series in mainline, before
anyone realized we had this issue. That removed the separate states
of ->index vs ->submit.sqe. That series is not something I was
comfortable putting into stable, hence the much simpler addition.
Here's the patch in the series that fixes the same issue:commit cf6fd4bd559ee61a4454b161863c8de6f30f8dca
Author: Pavel Begunkov
Date: Mon Nov 25 23:14:39 2019 +0300io_uring: inline struct sqe_submit
Reported-by: Andres Freund
Reported-by: Tomáš Chaloupka
Signed-off-by: Jens Axboe
Signed-off-by: Greg Kroah-Hartman -
commit aa4c3967756c6c576a38a23ac511be211462a6b7 upstream.
Christophe reports that current master fails building on powerpc with
this error:CC fs/io_uring.o
fs/io_uring.c: In function ‘loop_rw_iter’:
fs/io_uring.c:1628:21: error: implicit declaration of function ‘kmap’
[-Werror=implicit-function-declaration]
iovec.iov_base = kmap(iter->bvec->bv_page)
^
fs/io_uring.c:1628:19: warning: assignment makes pointer from integer
without a cast [-Wint-conversion]
iovec.iov_base = kmap(iter->bvec->bv_page)
^
fs/io_uring.c:1643:4: error: implicit declaration of function ‘kunmap’
[-Werror=implicit-function-declaration]
kunmap(iter->bvec->bv_page);
^which is caused by a missing highmem.h include. Fix it by including
it.Fixes: 311ae9e159d8 ("io_uring: fix dead-hung for non-iter fixed rw")
Reported-by: Christophe Leroy
Tested-by: Christophe Leroy
Signed-off-by: Jens Axboe
Signed-off-by: Greg Kroah-Hartman -
commit 441cdbd5449b4923cd413d3ba748124f91388be9 upstream.
We should never return -ERESTARTSYS to userspace, transform it into
-EINTR.Cc: stable@vger.kernel.org # v5.3+
Signed-off-by: Jens Axboe
Signed-off-by: Greg Kroah-Hartman -
commit 311ae9e159d81a1ec1cf645daf40b39ae5a0bd84 upstream.
Read/write requests to devices without implemented read/write_iter
using fixed buffers can cause general protection fault, which totally
hangs a machine.io_import_fixed() initialises iov_iter with bvec, but loop_rw_iter()
accesses it as iovec, dereferencing random address.kmap() page by page in this case
Cc: stable@vger.kernel.org
Signed-off-by: Pavel Begunkov
Signed-off-by: Jens Axboe
Signed-off-by: Greg Kroah-Hartman
05 Dec, 2019
1 commit
-
[ Upstream commit 181e448d8709e517c9c7b523fcd209f24eb38ca7 ]
If we don't inherit the original task creds, then we can confuse users
like fuse that pass creds in the request header. See link below on
identical aio issue.Link: https://lore.kernel.org/linux-fsdevel/26f0d78e-99ca-2f1b-78b9-433088053a61@scylladb.com/T/#u
Signed-off-by: Jens Axboe
Signed-off-by: Sasha Levin
14 Nov, 2019
2 commits
-
A test case was reported where two linked reads with registered buffers
failed the second link always. This is because we set the expected value
of a request in req->result, and if we don't get this result, then we
fail the dependent links. For some reason the registered buffer import
returned -ERROR/0, while the normal import returns -ERROR/length. This
broke linked commands with registered buffers.Fix this by making io_import_fixed() correctly return the mapped length.
Cc: stable@vger.kernel.org # v5.3
Reported-by: 李通洲
Signed-off-by: Jens Axboe -
For timeout requests io_uring tries to grab a file with specified fd,
which is usually stdin/fd=0.
Update io_op_needs_file()Signed-off-by: Pavel Begunkov
Signed-off-by: Jens Axboe
12 Nov, 2019
1 commit
-
Currently we make sequence == 0 be the same as sequence == 1, but that's
not super useful if the intent is really to have a timeout that's just
a pure timeout.If the user passes in sqe->off == 0, then don't apply any sequence logic
to the request, let it purely be driven by the timeout specified.Reported-by: 李通洲
Reviewed-by: 李通洲
Signed-off-by: Jens Axboe
31 Oct, 2019
1 commit
-
We use io_kiocb->result == -EAGAIN as a way to know if we need to
re-submit a polled request, as -EAGAIN reporting happens out-of-line
for IO submission failures. This field is cleared when we originally
allocate the request, but it isn't reset when we retry the submission
from async context. This can cause issues where we think something
needs a re-issue, but we're really just reading stale data.Reset ->result whenever we re-prep a request for polled submission.
Cc: stable@vger.kernel.org
Fixes: 9e645e1105ca ("io_uring: add support for sqe links")
Reported-by: Bijan Mottahedeh
Signed-off-by: Jens Axboe
28 Oct, 2019
2 commits
-
syzkaller reported an issue where it looks like a malicious app can
trigger a use-after-free of reading the ctx ->sq_array and ->rings
value right after having installed the ring fd in the process file
table.Defer ring fd installation until after we're done reading those
values.Fixes: 75b28affdd6a ("io_uring: allocate the two rings together")
Reported-by: syzbot+6f03d895a6cd0d06187f@syzkaller.appspotmail.com
Signed-off-by: Jens Axboe -
io_queue_link_head() owns shadow_req after taking it as an argument.
By not freeing it in case of an error, it can leak the request along
with taken ctx->refs.Reviewed-by: Jackie Liu
Signed-off-by: Pavel Begunkov
Signed-off-by: Jens Axboe
26 Oct, 2019
2 commits
-
We currently assume that submissions from the sqthread are successful,
and if IO polling is enabled, we use that value for knowing how many
completions to look for. But if we overflowed the CQ ring or some
requests simply got errored and already completed, they won't be
available for polling.For the case of IO polling and SQTHREAD usage, look at the pending
poll list. If it ever hits empty then we know that we don't have
anymore pollable requests inflight. For that case, simply reset
the inflight count to zero.Reported-by: Pavel Begunkov
Reviewed-by: Pavel Begunkov
Signed-off-by: Jens Axboe -
We currently use the ring values directly, but that can lead to issues
if the application is malicious and changes these values on our behalf.
Created in-kernel cached versions of them, and just overwrite the user
side when we update them. This is similar to how we treat the sq/cq
ring tail/head updates.Reported-by: Pavel Begunkov
Reviewed-by: Pavel Begunkov
Signed-off-by: Jens Axboe
25 Oct, 2019
3 commits
-
io_ring_submit() finalises with
1. io_commit_sqring(), which releases sqes to the userspace
2. Then calls to io_queue_link_head(), accessing released head's sqeReorder them.
Signed-off-by: Pavel Begunkov
Signed-off-by: Jens Axboe -
io_sq_thread() processes sqes by 8 without considering links. As a
result, links will be randomely subdivided.The easiest way to fix it is to call io_get_sqring() inside
io_submit_sqes() as do io_ring_submit().Downsides:
1. This removes optimisation of not grabbing mm_struct for fixed files
2. It submitting all sqes in one go, without finer-grained sheduling
with cq processing.Signed-off-by: Pavel Begunkov
Signed-off-by: Jens Axboe -
There is a bug, where failed linked requests are returned not with
specified @user_data, but with garbage from a kernel stack.The reason is that io_fail_links() uses req->user_data, which is
uninitialised when called from io_queue_sqe() on fail path.Signed-off-by: Pavel Begunkov
Signed-off-by: Jens Axboe
24 Oct, 2019
3 commits
-
The sequence number of the timeout req (req->sequence) indicate the
expected completion request. Because of each timeout req consume a
sequence number, so the sequence of each timeout req on the timeout
list shouldn't be the same. But now, we may get the same number (also
incorrect) if we insert a new entry before the last one, such as submit
such two timeout reqs on a new ring instance below.req->sequence
req_1 (count = 2): 2
req_2 (count = 1): 2Then, if we submit a nop req, req_2 will still timeout even the nop req
finished. This patch fix this problem by adjust the sequence number of
each reordered reqs when inserting a new entry.Signed-off-by: zhangyi (F)
Signed-off-by: Jens Axboe -
The sequence number of reqs on the timeout_list before the timeout req
should be adjusted in io_timeout_fn(), because the current timeout req
will consumes a slot in the cq_ring and cq_tail pointer will be
increased, otherwise other timeout reqs may return in advance without
waiting for enough wait_nr.Signed-off-by: zhangyi (F)
Signed-off-by: Jens Axboe -
There are cases where it isn't always safe to block for submission,
even if the caller asked to wait for events as well. Revert the
previous optimization of doing that.This reverts two commits:
bf7ec93c644cb
c576666863b78Fixes: c576666863b78 ("io_uring: optimize submit_and_wait API")
Signed-off-by: Jens Axboe
19 Oct, 2019
1 commit
-
Pull block fixes from Jens Axboe:
- NVMe pull request from Keith that address deadlocks, double resets,
memory leaks, and other regression.- Fixup elv_support_iosched() for bio based devices (Damien)
- Fixup for the ahci PCS quirk (Dan)
- Socket O_NONBLOCK handling fix for io_uring (me)
- Timeout sequence io_uring fixes (yangerkun)
- MD warning fix for parameter default_layout (Song)
- blkcg activation fixes (Tejun)
- blk-rq-qos node deletion fix (Tejun)
* tag 'for-linus-2019-10-18' of git://git.kernel.dk/linux-block:
nvme-pci: Set the prp2 correctly when using more than 4k page
io_uring: fix logic error in io_timeout
io_uring: fix up O_NONBLOCK handling for sockets
md/raid0: fix warning message for parameter default_layout
libata/ahci: Fix PCS quirk application
blk-rq-qos: fix first node deletion of rq_qos_del()
blkcg: Fix multiple bugs in blkcg_activate_policy()
io_uring: consider the overflow of sequence for timeout req
nvme-tcp: fix possible leakage during error flow
nvmet-loop: fix possible leakage during error flow
block: Fix elv_support_iosched()
nvme-tcp: Initialize sk->sk_ll_usec only with NET_RX_BUSY_POLL
nvme: Wait for reset state when required
nvme: Prevent resets during paused controller state
nvme: Restart request timers in resetting state
nvme: Remove ADMIN_ONLY state
nvme-pci: Free tagset if no IO queues
nvme: retain split access workaround for capability reads
nvme: fix possible deadlock when nvme_update_formats fails
18 Oct, 2019
2 commits
-
If ctx->cached_sq_head < nxt_sq_head, we should add UINT_MAX to tmp, not
tmp_nxt.Fixes: 5da0fb1ab34c ("io_uring: consider the overflow of sequence for timeout req")
Signed-off-by: yangerkun
Signed-off-by: Jens Axboe -
We've got two issues with the non-regular file handling for non-blocking
IO:1) We don't want to re-do a short read in full for a non-regular file,
as we can't just read the data again.
2) For non-regular files that don't support non-blocking IO attempts,
we need to punt to async context even if the file is opened as
non-blocking. Otherwise the caller always gets -EAGAIN.Add two new request flags to handle these cases. One is just a cache
of the inode S_ISREG() status, the other tells io_uring that we always
need to punt this request to async context, even if REQ_F_NOWAIT is set.Cc: stable@vger.kernel.org
Reported-by: Hrvoje Zeba
Tested-by: Hrvoje Zeba
Signed-off-by: Jens Axboe
15 Oct, 2019
1 commit
-
Now we recalculate the sequence of timeout with 'req->sequence =
ctx->cached_sq_head + count - 1', judge the right place to insert
for timeout_list by compare the number of request we still expected for
completion. But we have not consider about the situation of overflow:1. ctx->cached_sq_head + count - 1 may overflow. And a bigger count for
the new timeout req can have a small req->sequence.2. cached_sq_head of now may overflow compare with before req. And it
will lead the timeout req with small req->sequence.This overflow will lead to the misorder of timeout_list, which can lead
to the wrong order of the completion of timeout_list. Fix it by reuse
req->submit.sequence to store the count, and change the logic of
inserting sort in io_timeout.Signed-off-by: yangerkun
Signed-off-by: Jens Axboe
13 Oct, 2019
1 commit
-
Pull io_uring fix from Jens Axboe:
"Single small fix for a regression in the sequence logic for linked
commands"* tag 'for-linus-20191012' of git://git.kernel.dk/linux-block:
io_uring: fix sequence logic for timeout requests
11 Oct, 2019
2 commits
-
Pull block fixes from Jens Axboe:
- Fix wbt performance regression introduced with the blk-rq-qos
refactoring (Harshad)- Fix io_uring fileset removal inadvertently killing the workqueue (me)
- Fix io_uring typo in linked command nonblock submission (Pavel)
- Remove spurious io_uring wakeups on request free (Pavel)
- Fix null_blk zoned command error return (Keith)
- Don't use freezable workqueues for backing_dev, also means we can
revert a previous libata hack (Mika)- Fix nbd sysfs mutex dropped too soon at removal time (Xiubo)
* tag 'for-linus-20191010' of git://git.kernel.dk/linux-block:
nbd: fix possible sysfs duplicate warning
null_blk: Fix zoned command return code
io_uring: only flush workqueues on fileset removal
io_uring: remove wait loop spurious wakeups
blk-wbt: fix performance regression in wbt scale_up/scale_down
Revert "libata, freezer: avoid block device removal while system is frozen"
bdi: Do not use freezable workqueue
io_uring: fix reversed nonblock flag for link submission -
We have two ways a request can be deferred:
1) It's a regular request that depends on another one
2) It's a timeout that tracks completionsWe have a shared helper to determine whether to defer, and that
attempts to make the right decision based on the request. But we
only have some of this information in the caller. Un-share the
two timeout/defer helpers so the caller can use the right one.Fixes: 5262f567987d ("io_uring: IORING_OP_TIMEOUT support")
Reported-by: yangerkun
Reviewed-by: Jackie Liu
Signed-off-by: Jens Axboe
10 Oct, 2019
1 commit
-
We should not remove the workqueue, we just need to ensure that the
workqueues are synced. The workqueues are torn down on ctx removal.Cc: stable@vger.kernel.org
Fixes: 6b06314c47e1 ("io_uring: add file set registration")
Reported-by: Stefan Hajnoczi
Signed-off-by: Jens Axboe
08 Oct, 2019
1 commit
-
Any changes interesting to tasks waiting in io_cqring_wait() are
commited with io_cqring_ev_posted(). However, io_ring_drop_ctx_refs()
also tries to do that but with no reason, that means spurious wakeups
every io_free_req() and io_uring_enter().Just use percpu_ref_put() instead.
Signed-off-by: Pavel Begunkov
Signed-off-by: Jens Axboe
05 Oct, 2019
1 commit
-
Pull block fixes from Jens Axboe:
- Mandate timespec64 for the io_uring timeout ABI (Arnd)
- Set of NVMe changes via Sagi:
- controller removal race fix from Balbir
- quirk additions from Gabriel and Jian-Hong
- nvme-pci power state save fix from Mario
- Add 64bit user commands (for 64bit registers) from Marta
- nvme-rdma/nvme-tcp fixes from Max, Mark and Me
- Minor cleanups and nits from James, Dan and John- Two s390 dasd fixes (Jan, Stefan)
- Have loop change block size in DIO mode (Martijn)
- paride pg header ifdef guard (Masahiro)
- Two blk-mq queue scheduler tweaks, fixing an ordering issue on zoned
devices and suboptimal performance on others (Ming)* tag 'for-linus-2019-10-03' of git://git.kernel.dk/linux-block: (22 commits)
block: sed-opal: fix sparse warning: convert __be64 data
block: sed-opal: fix sparse warning: obsolete array init.
block: pg: add header include guard
Revert "s390/dasd: Add discard support for ESE volumes"
s390/dasd: Fix error handling during online processing
io_uring: use __kernel_timespec in timeout ABI
loop: change queue block size to match when using DIO
blk-mq: apply normal plugging for HDD
blk-mq: honor IO scheduler for multiqueue devices
nvme-rdma: fix possible use-after-free in connect timeout
nvme: Move ctrl sqsize to generic space
nvme: Add ctrl attributes for queue_count and sqsize
nvme: allow 64-bit results in passthru commands
nvme: Add quirk for Kingston NVME SSD running FW E8FK11.T
nvmet-tcp: remove superflous check on request sgl
Added QUIRKs for ADATA XPG SX8200 Pro 512GB
nvme-rdma: Fix max_hw_sectors calculation
nvme: fix an error code in nvme_init_subsystem()
nvme-pci: Save PCI state before putting drive into deepest state
nvme-tcp: fix wrong stop condition in io_work
...
04 Oct, 2019
1 commit
-
io_queue_link_head() accepts @force_nonblock flag, but io_ring_submit()
passes something opposite.Fixes: c576666863b78 ("io_uring: optimize submit_and_wait API")
Reported-by: kbuild test robot
Signed-off-by: Pavel Begunkov
Signed-off-by: Jens Axboe
01 Oct, 2019
1 commit
-
All system calls use struct __kernel_timespec instead of the old struct
timespec, but this one was just added with the old-style ABI. Change it
now to enforce the use of __kernel_timespec, avoiding ABI confusion and
the need for compat handlers on 32-bit architectures.Any user space caller will have to use __kernel_timespec now, but this
is unambiguous and works for any C library regardless of the time_t
definition. A nicer way to specify the timeout would have been a less
ambiguous 64-bit nanosecond value, but I suppose it's too late now to
change that as this would impact both 32-bit and 64-bit users.Fixes: 5262f567987d ("io_uring: IORING_OP_TIMEOUT support")
Signed-off-by: Arnd Bergmann
Signed-off-by: Jens Axboe
28 Sep, 2019
1 commit
-
Pull more io_uring updates from Jens Axboe:
"Just two things in here:- Improvement to the io_uring CQ ring wakeup for batched IO (me)
- Fix wrong comparison in poll handling (yangerkun)
I realize the first one is a little late in the game, but it felt
pointless to hold it off until the next release. Went through various
testing and reviews with Pavel and peterz"* tag 'for-5.4/io_uring-2019-09-27' of git://git.kernel.dk/linux-block:
io_uring: make CQ ring wakeups be more efficient
io_uring: compare cached_cq_tail with cq.head in_io_uring_poll
26 Sep, 2019
1 commit
-
For batched IO, it's not uncommon for waiters to ask for more than 1
IO to complete before being woken up. This is a problem with
wait_event() since tasks will get woken for every IO that completes,
re-check condition, then go back to sleep. For batch counts on the
order of what you do for high IOPS, that can result in 10s of extra
wakeups for the waiting task.Add a private wake function that checks for the wake up count criteria
being met before calling autoremove_wake_function(). Pavel reports that
one test case he has runs 40% faster with proper batching of wakeups.Reported-by: Pavel Begunkov
Tested-by: Pavel Begunkov
Reviewed-by: Pavel Begunkov
Signed-off-by: Jens Axboe
25 Sep, 2019
2 commits
-
Pull more io_uring updates from Jens Axboe:
"A collection of later fixes and additions, that weren't quite ready
for pushing out with the initial pull request.This contains:
- Fix potential use-after-free of shadow requests (Jackie)
- Fix potential OOM crash in request allocation (Jackie)
- kmalloc+memcpy -> kmemdup cleanup (Jackie)
- Fix poll crash regression (me)
- Fix SQ thread not being nice and giving up CPU for !PREEMPT (me)
- Add support for timeouts, making it easier to do epoll_wait()
conversions, for instance (me)- Ensure io_uring works without f_ops->read_iter() and
f_ops->write_iter() (me)"* tag 'for-5.4/io_uring-2019-09-24' of git://git.kernel.dk/linux-block:
io_uring: correctly handle non ->{read,write}_iter() file_operations
io_uring: IORING_OP_TIMEOUT support
io_uring: use cond_resched() in sqthread
io_uring: fix potential crash issue due to io_get_req failure
io_uring: ensure poll commands clear ->sqe
io_uring: fix use-after-free of shadow_req
io_uring: use kmemdup instead of kmalloc and memcpy -
Patch series "Make working with compound pages easier", v2.
These three patches add three helpers and convert the appropriate
places to use them.This patch (of 3):
It's unnecessarily hard to find out the size of a potentially huge page.
Replace 'PAGE_SIZE << compound_order(page)' with page_size(page).Link: http://lkml.kernel.org/r/20190721104612.19120-2-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle)
Acked-by: Michal Hocko
Reviewed-by: Andrew Morton
Reviewed-by: Ira Weiny
Acked-by: Kirill A. Shutemov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
24 Sep, 2019
2 commits
-
After 75b28af("io_uring: allocate the two rings together"), we compare
sq.head with cached_cq_tail to determine does there any cq invalid.
Actually, we should use cq.head.Fixes: 75b28affdd6a ("io_uring: allocate the two rings together")
Signed-off-by: yangerkun
Signed-off-by: Jens Axboe -
Currently we just -EINVAL a read or write to an fd that isn't backed
by ->read_iter() or ->write_iter(). But we can handle them just fine,
as long as we punt fo async context first.Implement a simple loop function for doing ->read() or ->write()
instead, and ensure we call it appropriately.Reported-by: 李通洲
Signed-off-by: Jens Axboe
19 Sep, 2019
3 commits
-
There's been a few requests for functionality similar to io_getevents()
and epoll_wait(), where the user can specify a timeout for waiting on
events. I deliberately did not add support for this through the system
call initially to avoid overloading the args, but I can see that the use
cases for this are valid.This adds support for IORING_OP_TIMEOUT. If a user wants to get woken
when waiting for events, simply submit one of these timeout commands
with your wait call (or before). This ensures that the application
sleeping on the CQ ring waiting for events will get woken. The timeout
command is passed in as a pointer to a struct timespec. Timeouts are
relative. The timeout command also includes a way to auto-cancel after
N events has passed.Signed-off-by: Jens Axboe
-
If preempt isn't enabled in the kernel, we can run into hang issues with
sqthread submissions. Use cond_resched() to play nice instead of
cpu_relax(), if we end up starting the loop and not having any events
pending for submissions.Signed-off-by: Jens Axboe
-
Sometimes io_get_req will return a NUL, then we need to do the
correct error handling, otherwise it will cause the kernel null
pointer exception.Fixes: 4fe2c963154c ("io_uring: add support for link with drain")
Signed-off-by: Jackie Liu
Signed-off-by: Jens Axboe