13 Dec, 2019

4 commits

  • There's an issue with deferred requests through drain, where if we do
    need to defer, we're not copying over the sqe_submit state correctly.
    This can result in using uninitialized data when we then later go and
    submit the deferred request, like this check in __io_submit_sqe():

    if (unlikely(s->index >= ctx->sq_entries))
    return -EINVAL;

    with 's' being uninitialized, we can randomly fail this check. Fix this
    by copying sqe_submit state when we defer a request.

    Because it was fixed as part of a cleanup series in mainline, before
    anyone realized we had this issue. That removed the separate states
    of ->index vs ->submit.sqe. That series is not something I was
    comfortable putting into stable, hence the much simpler addition.
    Here's the patch in the series that fixes the same issue:

    commit cf6fd4bd559ee61a4454b161863c8de6f30f8dca
    Author: Pavel Begunkov
    Date: Mon Nov 25 23:14:39 2019 +0300

    io_uring: inline struct sqe_submit

    Reported-by: Andres Freund
    Reported-by: Tomáš Chaloupka
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Jens Axboe
     
  • commit aa4c3967756c6c576a38a23ac511be211462a6b7 upstream.

    Christophe reports that current master fails building on powerpc with
    this error:

    CC fs/io_uring.o
    fs/io_uring.c: In function ‘loop_rw_iter’:
    fs/io_uring.c:1628:21: error: implicit declaration of function ‘kmap’
    [-Werror=implicit-function-declaration]
    iovec.iov_base = kmap(iter->bvec->bv_page)
    ^
    fs/io_uring.c:1628:19: warning: assignment makes pointer from integer
    without a cast [-Wint-conversion]
    iovec.iov_base = kmap(iter->bvec->bv_page)
    ^
    fs/io_uring.c:1643:4: error: implicit declaration of function ‘kunmap’
    [-Werror=implicit-function-declaration]
    kunmap(iter->bvec->bv_page);
    ^

    which is caused by a missing highmem.h include. Fix it by including
    it.

    Fixes: 311ae9e159d8 ("io_uring: fix dead-hung for non-iter fixed rw")
    Reported-by: Christophe Leroy
    Tested-by: Christophe Leroy
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Jens Axboe
     
  • commit 441cdbd5449b4923cd413d3ba748124f91388be9 upstream.

    We should never return -ERESTARTSYS to userspace, transform it into
    -EINTR.

    Cc: stable@vger.kernel.org # v5.3+
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Jens Axboe
     
  • commit 311ae9e159d81a1ec1cf645daf40b39ae5a0bd84 upstream.

    Read/write requests to devices without implemented read/write_iter
    using fixed buffers can cause general protection fault, which totally
    hangs a machine.

    io_import_fixed() initialises iov_iter with bvec, but loop_rw_iter()
    accesses it as iovec, dereferencing random address.

    kmap() page by page in this case

    Cc: stable@vger.kernel.org
    Signed-off-by: Pavel Begunkov
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Pavel Begunkov
     

05 Dec, 2019

1 commit

  • [ Upstream commit 181e448d8709e517c9c7b523fcd209f24eb38ca7 ]

    If we don't inherit the original task creds, then we can confuse users
    like fuse that pass creds in the request header. See link below on
    identical aio issue.

    Link: https://lore.kernel.org/linux-fsdevel/26f0d78e-99ca-2f1b-78b9-433088053a61@scylladb.com/T/#u
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin

    Jens Axboe
     

14 Nov, 2019

2 commits

  • A test case was reported where two linked reads with registered buffers
    failed the second link always. This is because we set the expected value
    of a request in req->result, and if we don't get this result, then we
    fail the dependent links. For some reason the registered buffer import
    returned -ERROR/0, while the normal import returns -ERROR/length. This
    broke linked commands with registered buffers.

    Fix this by making io_import_fixed() correctly return the mapped length.

    Cc: stable@vger.kernel.org # v5.3
    Reported-by: 李通洲
    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • For timeout requests io_uring tries to grab a file with specified fd,
    which is usually stdin/fd=0.
    Update io_op_needs_file()

    Signed-off-by: Pavel Begunkov
    Signed-off-by: Jens Axboe

    Pavel Begunkov
     

12 Nov, 2019

1 commit

  • Currently we make sequence == 0 be the same as sequence == 1, but that's
    not super useful if the intent is really to have a timeout that's just
    a pure timeout.

    If the user passes in sqe->off == 0, then don't apply any sequence logic
    to the request, let it purely be driven by the timeout specified.

    Reported-by: 李通洲
    Reviewed-by: 李通洲
    Signed-off-by: Jens Axboe

    Jens Axboe
     

31 Oct, 2019

1 commit

  • We use io_kiocb->result == -EAGAIN as a way to know if we need to
    re-submit a polled request, as -EAGAIN reporting happens out-of-line
    for IO submission failures. This field is cleared when we originally
    allocate the request, but it isn't reset when we retry the submission
    from async context. This can cause issues where we think something
    needs a re-issue, but we're really just reading stale data.

    Reset ->result whenever we re-prep a request for polled submission.

    Cc: stable@vger.kernel.org
    Fixes: 9e645e1105ca ("io_uring: add support for sqe links")
    Reported-by: Bijan Mottahedeh
    Signed-off-by: Jens Axboe

    Jens Axboe
     

28 Oct, 2019

2 commits

  • syzkaller reported an issue where it looks like a malicious app can
    trigger a use-after-free of reading the ctx ->sq_array and ->rings
    value right after having installed the ring fd in the process file
    table.

    Defer ring fd installation until after we're done reading those
    values.

    Fixes: 75b28affdd6a ("io_uring: allocate the two rings together")
    Reported-by: syzbot+6f03d895a6cd0d06187f@syzkaller.appspotmail.com
    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • io_queue_link_head() owns shadow_req after taking it as an argument.
    By not freeing it in case of an error, it can leak the request along
    with taken ctx->refs.

    Reviewed-by: Jackie Liu
    Signed-off-by: Pavel Begunkov
    Signed-off-by: Jens Axboe

    Pavel Begunkov
     

26 Oct, 2019

2 commits

  • We currently assume that submissions from the sqthread are successful,
    and if IO polling is enabled, we use that value for knowing how many
    completions to look for. But if we overflowed the CQ ring or some
    requests simply got errored and already completed, they won't be
    available for polling.

    For the case of IO polling and SQTHREAD usage, look at the pending
    poll list. If it ever hits empty then we know that we don't have
    anymore pollable requests inflight. For that case, simply reset
    the inflight count to zero.

    Reported-by: Pavel Begunkov
    Reviewed-by: Pavel Begunkov
    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • We currently use the ring values directly, but that can lead to issues
    if the application is malicious and changes these values on our behalf.
    Created in-kernel cached versions of them, and just overwrite the user
    side when we update them. This is similar to how we treat the sq/cq
    ring tail/head updates.

    Reported-by: Pavel Begunkov
    Reviewed-by: Pavel Begunkov
    Signed-off-by: Jens Axboe

    Jens Axboe
     

25 Oct, 2019

3 commits

  • io_ring_submit() finalises with
    1. io_commit_sqring(), which releases sqes to the userspace
    2. Then calls to io_queue_link_head(), accessing released head's sqe

    Reorder them.

    Signed-off-by: Pavel Begunkov
    Signed-off-by: Jens Axboe

    Pavel Begunkov
     
  • io_sq_thread() processes sqes by 8 without considering links. As a
    result, links will be randomely subdivided.

    The easiest way to fix it is to call io_get_sqring() inside
    io_submit_sqes() as do io_ring_submit().

    Downsides:
    1. This removes optimisation of not grabbing mm_struct for fixed files
    2. It submitting all sqes in one go, without finer-grained sheduling
    with cq processing.

    Signed-off-by: Pavel Begunkov
    Signed-off-by: Jens Axboe

    Pavel Begunkov
     
  • There is a bug, where failed linked requests are returned not with
    specified @user_data, but with garbage from a kernel stack.

    The reason is that io_fail_links() uses req->user_data, which is
    uninitialised when called from io_queue_sqe() on fail path.

    Signed-off-by: Pavel Begunkov
    Signed-off-by: Jens Axboe

    Pavel Begunkov
     

24 Oct, 2019

3 commits

  • The sequence number of the timeout req (req->sequence) indicate the
    expected completion request. Because of each timeout req consume a
    sequence number, so the sequence of each timeout req on the timeout
    list shouldn't be the same. But now, we may get the same number (also
    incorrect) if we insert a new entry before the last one, such as submit
    such two timeout reqs on a new ring instance below.

    req->sequence
    req_1 (count = 2): 2
    req_2 (count = 1): 2

    Then, if we submit a nop req, req_2 will still timeout even the nop req
    finished. This patch fix this problem by adjust the sequence number of
    each reordered reqs when inserting a new entry.

    Signed-off-by: zhangyi (F)
    Signed-off-by: Jens Axboe

    zhangyi (F)
     
  • The sequence number of reqs on the timeout_list before the timeout req
    should be adjusted in io_timeout_fn(), because the current timeout req
    will consumes a slot in the cq_ring and cq_tail pointer will be
    increased, otherwise other timeout reqs may return in advance without
    waiting for enough wait_nr.

    Signed-off-by: zhangyi (F)
    Signed-off-by: Jens Axboe

    zhangyi (F)
     
  • There are cases where it isn't always safe to block for submission,
    even if the caller asked to wait for events as well. Revert the
    previous optimization of doing that.

    This reverts two commits:

    bf7ec93c644cb
    c576666863b78

    Fixes: c576666863b78 ("io_uring: optimize submit_and_wait API")
    Signed-off-by: Jens Axboe

    Jens Axboe
     

19 Oct, 2019

1 commit

  • Pull block fixes from Jens Axboe:

    - NVMe pull request from Keith that address deadlocks, double resets,
    memory leaks, and other regression.

    - Fixup elv_support_iosched() for bio based devices (Damien)

    - Fixup for the ahci PCS quirk (Dan)

    - Socket O_NONBLOCK handling fix for io_uring (me)

    - Timeout sequence io_uring fixes (yangerkun)

    - MD warning fix for parameter default_layout (Song)

    - blkcg activation fixes (Tejun)

    - blk-rq-qos node deletion fix (Tejun)

    * tag 'for-linus-2019-10-18' of git://git.kernel.dk/linux-block:
    nvme-pci: Set the prp2 correctly when using more than 4k page
    io_uring: fix logic error in io_timeout
    io_uring: fix up O_NONBLOCK handling for sockets
    md/raid0: fix warning message for parameter default_layout
    libata/ahci: Fix PCS quirk application
    blk-rq-qos: fix first node deletion of rq_qos_del()
    blkcg: Fix multiple bugs in blkcg_activate_policy()
    io_uring: consider the overflow of sequence for timeout req
    nvme-tcp: fix possible leakage during error flow
    nvmet-loop: fix possible leakage during error flow
    block: Fix elv_support_iosched()
    nvme-tcp: Initialize sk->sk_ll_usec only with NET_RX_BUSY_POLL
    nvme: Wait for reset state when required
    nvme: Prevent resets during paused controller state
    nvme: Restart request timers in resetting state
    nvme: Remove ADMIN_ONLY state
    nvme-pci: Free tagset if no IO queues
    nvme: retain split access workaround for capability reads
    nvme: fix possible deadlock when nvme_update_formats fails

    Linus Torvalds
     

18 Oct, 2019

2 commits

  • If ctx->cached_sq_head < nxt_sq_head, we should add UINT_MAX to tmp, not
    tmp_nxt.

    Fixes: 5da0fb1ab34c ("io_uring: consider the overflow of sequence for timeout req")
    Signed-off-by: yangerkun
    Signed-off-by: Jens Axboe

    yangerkun
     
  • We've got two issues with the non-regular file handling for non-blocking
    IO:

    1) We don't want to re-do a short read in full for a non-regular file,
    as we can't just read the data again.
    2) For non-regular files that don't support non-blocking IO attempts,
    we need to punt to async context even if the file is opened as
    non-blocking. Otherwise the caller always gets -EAGAIN.

    Add two new request flags to handle these cases. One is just a cache
    of the inode S_ISREG() status, the other tells io_uring that we always
    need to punt this request to async context, even if REQ_F_NOWAIT is set.

    Cc: stable@vger.kernel.org
    Reported-by: Hrvoje Zeba
    Tested-by: Hrvoje Zeba
    Signed-off-by: Jens Axboe

    Jens Axboe
     

15 Oct, 2019

1 commit

  • Now we recalculate the sequence of timeout with 'req->sequence =
    ctx->cached_sq_head + count - 1', judge the right place to insert
    for timeout_list by compare the number of request we still expected for
    completion. But we have not consider about the situation of overflow:

    1. ctx->cached_sq_head + count - 1 may overflow. And a bigger count for
    the new timeout req can have a small req->sequence.

    2. cached_sq_head of now may overflow compare with before req. And it
    will lead the timeout req with small req->sequence.

    This overflow will lead to the misorder of timeout_list, which can lead
    to the wrong order of the completion of timeout_list. Fix it by reuse
    req->submit.sequence to store the count, and change the logic of
    inserting sort in io_timeout.

    Signed-off-by: yangerkun
    Signed-off-by: Jens Axboe

    yangerkun
     

13 Oct, 2019

1 commit


11 Oct, 2019

2 commits

  • Pull block fixes from Jens Axboe:

    - Fix wbt performance regression introduced with the blk-rq-qos
    refactoring (Harshad)

    - Fix io_uring fileset removal inadvertently killing the workqueue (me)

    - Fix io_uring typo in linked command nonblock submission (Pavel)

    - Remove spurious io_uring wakeups on request free (Pavel)

    - Fix null_blk zoned command error return (Keith)

    - Don't use freezable workqueues for backing_dev, also means we can
    revert a previous libata hack (Mika)

    - Fix nbd sysfs mutex dropped too soon at removal time (Xiubo)

    * tag 'for-linus-20191010' of git://git.kernel.dk/linux-block:
    nbd: fix possible sysfs duplicate warning
    null_blk: Fix zoned command return code
    io_uring: only flush workqueues on fileset removal
    io_uring: remove wait loop spurious wakeups
    blk-wbt: fix performance regression in wbt scale_up/scale_down
    Revert "libata, freezer: avoid block device removal while system is frozen"
    bdi: Do not use freezable workqueue
    io_uring: fix reversed nonblock flag for link submission

    Linus Torvalds
     
  • We have two ways a request can be deferred:

    1) It's a regular request that depends on another one
    2) It's a timeout that tracks completions

    We have a shared helper to determine whether to defer, and that
    attempts to make the right decision based on the request. But we
    only have some of this information in the caller. Un-share the
    two timeout/defer helpers so the caller can use the right one.

    Fixes: 5262f567987d ("io_uring: IORING_OP_TIMEOUT support")
    Reported-by: yangerkun
    Reviewed-by: Jackie Liu
    Signed-off-by: Jens Axboe

    Jens Axboe
     

10 Oct, 2019

1 commit

  • We should not remove the workqueue, we just need to ensure that the
    workqueues are synced. The workqueues are torn down on ctx removal.

    Cc: stable@vger.kernel.org
    Fixes: 6b06314c47e1 ("io_uring: add file set registration")
    Reported-by: Stefan Hajnoczi
    Signed-off-by: Jens Axboe

    Jens Axboe
     

08 Oct, 2019

1 commit

  • Any changes interesting to tasks waiting in io_cqring_wait() are
    commited with io_cqring_ev_posted(). However, io_ring_drop_ctx_refs()
    also tries to do that but with no reason, that means spurious wakeups
    every io_free_req() and io_uring_enter().

    Just use percpu_ref_put() instead.

    Signed-off-by: Pavel Begunkov
    Signed-off-by: Jens Axboe

    Pavel Begunkov
     

05 Oct, 2019

1 commit

  • Pull block fixes from Jens Axboe:

    - Mandate timespec64 for the io_uring timeout ABI (Arnd)

    - Set of NVMe changes via Sagi:
    - controller removal race fix from Balbir
    - quirk additions from Gabriel and Jian-Hong
    - nvme-pci power state save fix from Mario
    - Add 64bit user commands (for 64bit registers) from Marta
    - nvme-rdma/nvme-tcp fixes from Max, Mark and Me
    - Minor cleanups and nits from James, Dan and John

    - Two s390 dasd fixes (Jan, Stefan)

    - Have loop change block size in DIO mode (Martijn)

    - paride pg header ifdef guard (Masahiro)

    - Two blk-mq queue scheduler tweaks, fixing an ordering issue on zoned
    devices and suboptimal performance on others (Ming)

    * tag 'for-linus-2019-10-03' of git://git.kernel.dk/linux-block: (22 commits)
    block: sed-opal: fix sparse warning: convert __be64 data
    block: sed-opal: fix sparse warning: obsolete array init.
    block: pg: add header include guard
    Revert "s390/dasd: Add discard support for ESE volumes"
    s390/dasd: Fix error handling during online processing
    io_uring: use __kernel_timespec in timeout ABI
    loop: change queue block size to match when using DIO
    blk-mq: apply normal plugging for HDD
    blk-mq: honor IO scheduler for multiqueue devices
    nvme-rdma: fix possible use-after-free in connect timeout
    nvme: Move ctrl sqsize to generic space
    nvme: Add ctrl attributes for queue_count and sqsize
    nvme: allow 64-bit results in passthru commands
    nvme: Add quirk for Kingston NVME SSD running FW E8FK11.T
    nvmet-tcp: remove superflous check on request sgl
    Added QUIRKs for ADATA XPG SX8200 Pro 512GB
    nvme-rdma: Fix max_hw_sectors calculation
    nvme: fix an error code in nvme_init_subsystem()
    nvme-pci: Save PCI state before putting drive into deepest state
    nvme-tcp: fix wrong stop condition in io_work
    ...

    Linus Torvalds
     

04 Oct, 2019

1 commit


01 Oct, 2019

1 commit

  • All system calls use struct __kernel_timespec instead of the old struct
    timespec, but this one was just added with the old-style ABI. Change it
    now to enforce the use of __kernel_timespec, avoiding ABI confusion and
    the need for compat handlers on 32-bit architectures.

    Any user space caller will have to use __kernel_timespec now, but this
    is unambiguous and works for any C library regardless of the time_t
    definition. A nicer way to specify the timeout would have been a less
    ambiguous 64-bit nanosecond value, but I suppose it's too late now to
    change that as this would impact both 32-bit and 64-bit users.

    Fixes: 5262f567987d ("io_uring: IORING_OP_TIMEOUT support")
    Signed-off-by: Arnd Bergmann
    Signed-off-by: Jens Axboe

    Arnd Bergmann
     

28 Sep, 2019

1 commit

  • Pull more io_uring updates from Jens Axboe:
    "Just two things in here:

    - Improvement to the io_uring CQ ring wakeup for batched IO (me)

    - Fix wrong comparison in poll handling (yangerkun)

    I realize the first one is a little late in the game, but it felt
    pointless to hold it off until the next release. Went through various
    testing and reviews with Pavel and peterz"

    * tag 'for-5.4/io_uring-2019-09-27' of git://git.kernel.dk/linux-block:
    io_uring: make CQ ring wakeups be more efficient
    io_uring: compare cached_cq_tail with cq.head in_io_uring_poll

    Linus Torvalds
     

26 Sep, 2019

1 commit

  • For batched IO, it's not uncommon for waiters to ask for more than 1
    IO to complete before being woken up. This is a problem with
    wait_event() since tasks will get woken for every IO that completes,
    re-check condition, then go back to sleep. For batch counts on the
    order of what you do for high IOPS, that can result in 10s of extra
    wakeups for the waiting task.

    Add a private wake function that checks for the wake up count criteria
    being met before calling autoremove_wake_function(). Pavel reports that
    one test case he has runs 40% faster with proper batching of wakeups.

    Reported-by: Pavel Begunkov
    Tested-by: Pavel Begunkov
    Reviewed-by: Pavel Begunkov
    Signed-off-by: Jens Axboe

    Jens Axboe
     

25 Sep, 2019

2 commits

  • Pull more io_uring updates from Jens Axboe:
    "A collection of later fixes and additions, that weren't quite ready
    for pushing out with the initial pull request.

    This contains:

    - Fix potential use-after-free of shadow requests (Jackie)

    - Fix potential OOM crash in request allocation (Jackie)

    - kmalloc+memcpy -> kmemdup cleanup (Jackie)

    - Fix poll crash regression (me)

    - Fix SQ thread not being nice and giving up CPU for !PREEMPT (me)

    - Add support for timeouts, making it easier to do epoll_wait()
    conversions, for instance (me)

    - Ensure io_uring works without f_ops->read_iter() and
    f_ops->write_iter() (me)"

    * tag 'for-5.4/io_uring-2019-09-24' of git://git.kernel.dk/linux-block:
    io_uring: correctly handle non ->{read,write}_iter() file_operations
    io_uring: IORING_OP_TIMEOUT support
    io_uring: use cond_resched() in sqthread
    io_uring: fix potential crash issue due to io_get_req failure
    io_uring: ensure poll commands clear ->sqe
    io_uring: fix use-after-free of shadow_req
    io_uring: use kmemdup instead of kmalloc and memcpy

    Linus Torvalds
     
  • Patch series "Make working with compound pages easier", v2.

    These three patches add three helpers and convert the appropriate
    places to use them.

    This patch (of 3):

    It's unnecessarily hard to find out the size of a potentially huge page.
    Replace 'PAGE_SIZE << compound_order(page)' with page_size(page).

    Link: http://lkml.kernel.org/r/20190721104612.19120-2-willy@infradead.org
    Signed-off-by: Matthew Wilcox (Oracle)
    Acked-by: Michal Hocko
    Reviewed-by: Andrew Morton
    Reviewed-by: Ira Weiny
    Acked-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox (Oracle)
     

24 Sep, 2019

2 commits


19 Sep, 2019

3 commits

  • There's been a few requests for functionality similar to io_getevents()
    and epoll_wait(), where the user can specify a timeout for waiting on
    events. I deliberately did not add support for this through the system
    call initially to avoid overloading the args, but I can see that the use
    cases for this are valid.

    This adds support for IORING_OP_TIMEOUT. If a user wants to get woken
    when waiting for events, simply submit one of these timeout commands
    with your wait call (or before). This ensures that the application
    sleeping on the CQ ring waiting for events will get woken. The timeout
    command is passed in as a pointer to a struct timespec. Timeouts are
    relative. The timeout command also includes a way to auto-cancel after
    N events has passed.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • If preempt isn't enabled in the kernel, we can run into hang issues with
    sqthread submissions. Use cond_resched() to play nice instead of
    cpu_relax(), if we end up starting the loop and not having any events
    pending for submissions.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • Sometimes io_get_req will return a NUL, then we need to do the
    correct error handling, otherwise it will cause the kernel null
    pointer exception.

    Fixes: 4fe2c963154c ("io_uring: add support for link with drain")
    Signed-off-by: Jackie Liu
    Signed-off-by: Jens Axboe

    Jackie Liu