03 Sep, 2020

1 commit

  • the commit ("") caused a crash in io_sq_wq_submit_work().
    when io_ring-wq get a req form async_list, which not have been
    added to task_list. Then try to delete the req from task_list will caused
    a "NULL pointer dereference".

    Ensure add req to async_list and task_list at the sametime.

    The crash log looks like this:
    [95995.973638] Unable to handle kernel NULL pointer dereference at virtual address 00000000
    [95995.979123] pgd = c20c8964
    [95995.981803] [00000000] *pgd=1c72d831, *pte=00000000, *ppte=00000000
    [95995.988043] Internal error: Oops: 817 [#1] SMP ARM
    [95995.992814] Modules linked in: bpfilter(-)
    [95995.996898] CPU: 1 PID: 15661 Comm: kworker/u8:5 Not tainted 5.4.56 #2
    [95996.003406] Hardware name: Amlogic Meson platform
    [95996.008108] Workqueue: io_ring-wq io_sq_wq_submit_work
    [95996.013224] PC is at io_sq_wq_submit_work+0x1f4/0x5c4
    [95996.018261] LR is at walk_stackframe+0x24/0x40
    [95996.022685] pc : [] lr : [] psr: 600f0093
    [95996.028936] sp : dc6f7e88 ip : dc6f7df0 fp : dc6f7ef4
    [95996.034148] r10: deff9800 r9 : dc1d1694 r8 : dda58b80
    [95996.039358] r7 : dc6f6000 r6 : dc6f7ebc r5 : dc1d1600 r4 : deff99c0
    [95996.045871] r3 : 0000cb5d r2 : 00000000 r1 : ef6b9b80 r0 : c059b88c
    [95996.052385] Flags: nZCv IRQs off FIQs on Mode SVC_32 ISA ARM Segment user
    [95996.059593] Control: 10c5387d Table: 22be804a DAC: 00000055
    [95996.065325] Process kworker/u8:5 (pid: 15661, stack limit = 0x78013c69)
    [95996.071923] Stack: (0xdc6f7e88 to 0xdc6f8000)
    [95996.076268] 7e80: dc6f7ecc dc6f7e98 00000000 c1f06c08 de9dc800 deff9a04
    [95996.084431] 7ea0: 00000000 dc6f7f7c 00000000 c1f65808 0000080c dc677a00 c1ee9bd0 dc6f7ebc
    [95996.092594] 7ec0: dc6f7ebc d085c8f6 c0445a90 dc1d1e00 e008f300 c0288400 e4ef7100 00000000
    [95996.100757] 7ee0: c20d45b0 e4ef7115 dc6f7f34 dc6f7ef8 c03725f0 c059b6b0 c0288400 c0288400
    [95996.108921] 7f00: c0288400 00000001 c0288418 e008f300 c0288400 e008f314 00000088 c0288418
    [95996.117083] 7f20: c1f03d00 dc6f6038 dc6f7f7c dc6f7f38 c0372df8 c037246c dc6f7f5c 00000000
    [95996.125245] 7f40: c1f03d00 c1f03d00 c20d3cbe c0288400 dc6f7f7c e1c43880 e4fa7980 00000000
    [95996.133409] 7f60: e008f300 c0372d9c e48bbe74 e1c4389c dc6f7fac dc6f7f80 c0379244 c0372da8
    [95996.141570] 7f80: 600f0093 e4fa7980 c0379108 00000000 00000000 00000000 00000000 00000000
    [95996.149734] 7fa0: 00000000 dc6f7fb0 c03010ac c0379114 00000000 00000000 00000000 00000000
    [95996.157897] 7fc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
    [95996.166060] 7fe0: 00000000 00000000 00000000 00000000 00000013 00000000 00000000 00000000
    [95996.174217] Backtrace:
    [95996.176662] [] (io_sq_wq_submit_work) from [] (process_one_work+0x190/0x4c0)
    [95996.185425] r10:e4ef7115 r9:c20d45b0 r8:00000000 r7:e4ef7100 r6:c0288400 r5:e008f300
    [95996.193237] r4:dc1d1e00
    [95996.195760] [] (process_one_work) from [] (worker_thread+0x5c/0x5bc)
    [95996.203836] r10:dc6f6038 r9:c1f03d00 r8:c0288418 r7:00000088 r6:e008f314 r5:c0288400
    [95996.211647] r4:e008f300
    [95996.214173] [] (worker_thread) from [] (kthread+0x13c/0x168)
    [95996.221554] r10:e1c4389c r9:e48bbe74 r8:c0372d9c r7:e008f300 r6:00000000 r5:e4fa7980
    [95996.229363] r4:e1c43880
    [95996.231888] [] (kthread) from [] (ret_from_fork+0x14/0x28)
    [95996.239088] Exception stack(0xdc6f7fb0 to 0xdc6f7ff8)
    [95996.244127] 7fa0: 00000000 00000000 00000000 00000000
    [95996.252291] 7fc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
    [95996.260453] 7fe0: 00000000 00000000 00000000 00000000 00000013 00000000
    [95996.267054] r10:00000000 r9:00000000 r8:00000000 r7:00000000 r6:00000000 r5:c0379108
    [95996.274866] r4:e4fa7980 r3:600f0093
    [95996.278430] Code: eb3a59e1 e5952098 e5951094 e5812004 (e5821000)

    Signed-off-by: Xin Yin
    Signed-off-by: Greg Kroah-Hartman

    Xin Yin
     

19 Aug, 2020

4 commits

  • commit 2dd2111d0d383df104b144e0d1f6b5a00cb7cd88 upstream.

    loop_rw_iter() does not check whether the file has a read or
    write function. This can lead to NULL pointer dereference
    when the user passes in a file descriptor that does not have
    read or write function.

    The crash log looks like this:

    [ 99.834071] BUG: kernel NULL pointer dereference, address: 0000000000000000
    [ 99.835364] #PF: supervisor instruction fetch in kernel mode
    [ 99.836522] #PF: error_code(0x0010) - not-present page
    [ 99.837771] PGD 8000000079d62067 P4D 8000000079d62067 PUD 79d8c067 PMD 0
    [ 99.839649] Oops: 0010 [#2] SMP PTI
    [ 99.840591] CPU: 1 PID: 333 Comm: io_wqe_worker-0 Tainted: G D 5.8.0 #2
    [ 99.842622] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1 04/01/2014
    [ 99.845140] RIP: 0010:0x0
    [ 99.845840] Code: Bad RIP value.
    [ 99.846672] RSP: 0018:ffffa1c7c01ebc08 EFLAGS: 00010202
    [ 99.848018] RAX: 0000000000000000 RBX: ffff92363bd67300 RCX: ffff92363d461208
    [ 99.849854] RDX: 0000000000000010 RSI: 00007ffdbf696bb0 RDI: ffff92363bd67300
    [ 99.851743] RBP: ffffa1c7c01ebc40 R08: 0000000000000000 R09: 0000000000000000
    [ 99.853394] R10: ffffffff9ec692a0 R11: 0000000000000000 R12: 0000000000000010
    [ 99.855148] R13: 0000000000000000 R14: ffff92363d461208 R15: ffffa1c7c01ebc68
    [ 99.856914] FS: 0000000000000000(0000) GS:ffff92363dd00000(0000) knlGS:0000000000000000
    [ 99.858651] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [ 99.860032] CR2: ffffffffffffffd6 CR3: 000000007ac66000 CR4: 00000000000006e0
    [ 99.861979] Call Trace:
    [ 99.862617] loop_rw_iter.part.0+0xad/0x110
    [ 99.863838] io_write+0x2ae/0x380
    [ 99.864644] ? kvm_sched_clock_read+0x11/0x20
    [ 99.865595] ? sched_clock+0x9/0x10
    [ 99.866453] ? sched_clock_cpu+0x11/0xb0
    [ 99.867326] ? newidle_balance+0x1d4/0x3c0
    [ 99.868283] io_issue_sqe+0xd8f/0x1340
    [ 99.869216] ? __switch_to+0x7f/0x450
    [ 99.870280] ? __switch_to_asm+0x42/0x70
    [ 99.871254] ? __switch_to_asm+0x36/0x70
    [ 99.872133] ? lock_timer_base+0x72/0xa0
    [ 99.873155] ? switch_mm_irqs_off+0x1bf/0x420
    [ 99.874152] io_wq_submit_work+0x64/0x180
    [ 99.875192] ? kthread_use_mm+0x71/0x100
    [ 99.876132] io_worker_handle_work+0x267/0x440
    [ 99.877233] io_wqe_worker+0x297/0x350
    [ 99.878145] kthread+0x112/0x150
    [ 99.878849] ? __io_worker_unuse+0x100/0x100
    [ 99.879935] ? kthread_park+0x90/0x90
    [ 99.880874] ret_from_fork+0x22/0x30
    [ 99.881679] Modules linked in:
    [ 99.882493] CR2: 0000000000000000
    [ 99.883324] ---[ end trace 4453745f4673190b ]---
    [ 99.884289] RIP: 0010:0x0
    [ 99.884837] Code: Bad RIP value.
    [ 99.885492] RSP: 0018:ffffa1c7c01ebc08 EFLAGS: 00010202
    [ 99.886851] RAX: 0000000000000000 RBX: ffff92363acd7f00 RCX: ffff92363d461608
    [ 99.888561] RDX: 0000000000000010 RSI: 00007ffe040d9e10 RDI: ffff92363acd7f00
    [ 99.890203] RBP: ffffa1c7c01ebc40 R08: 0000000000000000 R09: 0000000000000000
    [ 99.891907] R10: ffffffff9ec692a0 R11: 0000000000000000 R12: 0000000000000010
    [ 99.894106] R13: 0000000000000000 R14: ffff92363d461608 R15: ffffa1c7c01ebc68
    [ 99.896079] FS: 0000000000000000(0000) GS:ffff92363dd00000(0000) knlGS:0000000000000000
    [ 99.898017] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [ 99.899197] CR2: ffffffffffffffd6 CR3: 000000007ac66000 CR4: 00000000000006e0

    Fixes: 32960613b7c3 ("io_uring: correctly handle non ->{read,write}_iter() file_operations")
    Cc: stable@vger.kernel.org
    Signed-off-by: Guoyu Huang
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Guoyu Huang
     
  • commit bd74048108c179cea0ff52979506164c80f29da7 upstream.

    If we hit an earlier error path in io_uring_create(), then we will have
    accounted memory, but not set ctx->{sq,cq}_entries yet. Then when the
    ring is torn down in error, we use those values to unaccount the memory.

    Ensure we set the ctx entries before we're able to hit a potential error
    path.

    Cc: stable@vger.kernel.org
    Reported-by: Tomáš Chaloupka
    Tested-by: Tomáš Chaloupka
    Reviewed-by: Stefano Garzarella
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Jens Axboe
     
  • [ Upstream commit b36200f543ff07a1cb346aa582349141df2c8068 ]

    rings_size() sets sq_offset to the total size of the rings (the returned
    value which is used for memory allocation). This is wrong: sq array should
    be located within the rings, not after them. Set sq_offset to where it
    should be.

    Fixes: 75b28affdd6a ("io_uring: allocate the two rings together")
    Signed-off-by: Dmitry Vyukov
    Acked-by: Hristo Venev
    Cc: io-uring@vger.kernel.org
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin

    Dmitry Vyukov
     
  • the commit ("opcode>")
    caused another vulnerability. After io_get_req(), the sqe_submit struct
    in req is not initialized, but the following code defaults that
    req->submit.opcode is available.

    Signed-off-by: Liu Yong
    Signed-off-by: Sasha Levin

    Liu Yong
     

11 Aug, 2020

2 commits

  • when ctx->sqo_mm is zero, io_sq_wq_submit_work() frees 'req'
    without deleting it from 'task_list'. After that, 'req' is
    accessed in io_ring_ctx_wait_and_kill() which lead to
    a use-after-free.

    Signed-off-by: Guoyu Huang
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Guoyu Huang
     
  • Liu reports that he can trigger a NULL pointer dereference with
    IORING_OP_SENDMSG, by changing the sqe->opcode after we've validated
    that the previous opcode didn't need a file and didn't assign one.

    Ensure we validate and read the opcode only once.

    Reported-by: Liu Yong
    Tested-by: Liu Yong
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Jens Axboe
     

09 Jul, 2020

1 commit

  • Track async work items that we queue, so we can safely cancel them
    if the ring is closed or the process exits. Newer kernels handle
    this automatically with io-wq, but the old workqueue based setup needs
    a bit of special help to get there.

    There's no upstream variant of this, as that would require backporting
    all the io-wq changes from 5.5 and on. Hence I made a one-off that
    ensures that we don't leak memory if we have async work items that
    need active cancelation (like socket IO).

    Reported-by: Agarwal, Anchal
    Tested-by: Agarwal, Anchal
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin

    Jens Axboe
     

17 Jun, 2020

1 commit

  • commit a8c73c1a614f6da6c0b04c393f87447e28cb6de4 upstream.

    Use kvfree() to free the pages and vmas, since they are allocated by
    kvmalloc_array() in a loop.

    Fixes: d4ef647510b1 ("io_uring: avoid page allocation warnings")
    Signed-off-by: Denis Efremov
    Signed-off-by: Jens Axboe
    Cc: stable@vger.kernel.org
    Link: https://lore.kernel.org/r/20200605093203.40087-1-efremov@linux.com
    Signed-off-by: Greg Kroah-Hartman

    Denis Efremov
     

07 Jun, 2020

1 commit

  • [ Upstream commit 583863ed918136412ddf14de2e12534f17cfdc6f ]

    Ensure that ctx->sqo_wait is initialized as soon as the ctx is allocated,
    instead of deferring it to the offload setup. This fixes a syzbot
    reported lockdep complaint, which is really due to trying to wake_up
    on an uninitialized wait queue:

    RSP: 002b:00007fffb1fb9aa8 EFLAGS: 00000246 ORIG_RAX: 00000000000001a9
    RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 0000000000441319
    RDX: 0000000000000001 RSI: 0000000020000140 RDI: 000000000000047b
    RBP: 0000000000010475 R08: 0000000000000001 R09: 00000000004002c8
    R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000402260
    R13: 00000000004022f0 R14: 0000000000000000 R15: 0000000000000000
    INFO: trying to register non-static key.
    the code is fine but needs lockdep annotation.
    turning off the locking correctness validator.
    CPU: 1 PID: 7090 Comm: syz-executor222 Not tainted 5.7.0-rc1-next-20200415-syzkaller #0
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
    Call Trace:
    __dump_stack lib/dump_stack.c:77 [inline]
    dump_stack+0x188/0x20d lib/dump_stack.c:118
    assign_lock_key kernel/locking/lockdep.c:913 [inline]
    register_lock_class+0x1664/0x1760 kernel/locking/lockdep.c:1225
    __lock_acquire+0x104/0x4c50 kernel/locking/lockdep.c:4234
    lock_acquire+0x1f2/0x8f0 kernel/locking/lockdep.c:4934
    __raw_spin_lock_irqsave include/linux/spinlock_api_smp.h:110 [inline]
    _raw_spin_lock_irqsave+0x8c/0xbf kernel/locking/spinlock.c:159
    __wake_up_common_lock+0xb4/0x130 kernel/sched/wait.c:122
    io_cqring_ev_posted+0xa5/0x1e0 fs/io_uring.c:1160
    io_poll_remove_all fs/io_uring.c:4357 [inline]
    io_ring_ctx_wait_and_kill+0x2bc/0x5a0 fs/io_uring.c:7305
    io_uring_create fs/io_uring.c:7843 [inline]
    io_uring_setup+0x115e/0x22b0 fs/io_uring.c:7870
    do_syscall_64+0xf6/0x7d0 arch/x86/entry/common.c:295
    entry_SYSCALL_64_after_hwframe+0x49/0xb3
    RIP: 0033:0x441319
    Code: e8 5c ae 02 00 48 83 c4 18 c3 0f 1f 80 00 00 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 3d 01 f0 ff ff 0f 83 bb 0a fc ff c3 66 2e 0f 1f 84 00 00 00 00
    RSP: 002b:00007fffb1fb9aa8 EFLAGS: 00000246 ORIG_RAX: 00000000000001a9

    Reported-by: syzbot+8c91f5d054e998721c57@syzkaller.appspotmail.com
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin

    Jens Axboe
     

17 Apr, 2020

2 commits

  • commit 4ed734b0d0913e566a9d871e15d24eb240f269f7 upstream.

    With the previous fixes for number of files open checking, I added some
    debug code to see if we had other spots where we're checking rlimit()
    against the async io-wq workers. The only one I found was file size
    checking, which we should also honor.

    During write and fallocate prep, store the max file size and override
    that for the current ask if we're in io-wq worker context.

    Cc: stable@vger.kernel.org # 5.1+
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Jens Axboe
     
  • commit c336e992cb1cb1db9ee608dfb30342ae781057ab upstream.

    We already checked this limit when the file was opened, and we keep it
    open in the file table. Hence when we added unit_inflight to the count
    we want to register, we're doubly accounting these files. This results
    in -EMFILE for file registration, if we're at half the limit.

    Cc: stable@vger.kernel.org # v5.1+
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Jens Axboe
     

05 Mar, 2020

2 commits

  • commit d876836204897b6d7d911f942084f69a1e9d5c4d upstream.

    We must set MSG_CMSG_COMPAT if we're in compatability mode, otherwise
    the iovec import for these commands will not do the right thing and fail
    the command with -EINVAL.

    Found by running the test suite compiled as 32-bit.

    Cc: stable@vger.kernel.org
    Fixes: aa1fa28fc73e ("io_uring: add support for recvmsg()")
    Fixes: 0fa03c624d8f ("io_uring: add support for sendmsg()")
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Jens Axboe
     
  • [ Upstream commits 9392a27d88b9 and ff002b30181d ]

    Ensure that the async work grabs ->fs from the queueing task if the
    punted commands needs to do lookups.

    We don't have these two commits in 5.4-stable:

    ff002b30181d30cdfbca316dadd099c3ca0d739c
    9392a27d88b9707145d713654eb26f0c29789e50

    because they don't apply with the rework that was done in how io_uring
    handles offload. Since there's no io-wq in 5.4, it doesn't make sense to
    do two patches. I'm attaching my port of the two for 5.4-stable, it's
    been tested. Please queue it up for the next 5.4-stable, thanks!

    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin

    Jens Axboe
     

29 Feb, 2020

2 commits

  • commit 7143b5ac5750f404ff3a594b34fdf3fc2f99f828 upstream.

    This patch drops 'cur_mm' before calling cond_resched(), to prevent
    the sq_thread from spinning even when the user process is finished.

    Before this patch, if the user process ended without closing the
    io_uring fd, the sq_thread continues to spin until the
    'sq_thread_idle' timeout ends.

    In the worst case where the 'sq_thread_idle' parameter is bigger than
    INT_MAX, the sq_thread will spin forever.

    Fixes: 6c271ce2f1d5 ("io_uring: add submission polling")
    Signed-off-by: Stefano Garzarella
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Stefano Garzarella
     
  • commit c7849be9cc2dd2754c48ddbaca27c2de6d80a95d upstream.

    Since commit a3a0e43fd770 ("io_uring: don't enter poll loop if we have
    CQEs pending"), if we already events pending, we won't enter poll loop.
    In case SETUP_IOPOLL and SETUP_SQPOLL are both enabled, if app has
    been terminated and don't reap pending events which are already in cq
    ring, and there are some reqs in poll_list, io_sq_thread will enter
    __io_iopoll_check(), and find pending events, then return, this loop
    will never have a chance to exit.

    I have seen this issue in fio stress tests, to fix this issue, let
    io_sq_thread call io_iopoll_getevents() with argument 'min' being zero,
    and remove __io_iopoll_check().

    Fixes: a3a0e43fd770 ("io_uring: don't enter poll loop if we have CQEs pending")
    Signed-off-by: Xiaoguang Wang
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Xiaoguang Wang
     

29 Jan, 2020

1 commit

  • commit 73e08e711d9c1d79fae01daed4b0e1fee5f8a275 upstream.

    This ends up being too restrictive for tasks that willingly fork and
    share the ring between forks. Andres reports that this breaks his
    postgresql work. Since we're close to 5.5 release, revert this change
    for now.

    Cc: stable@vger.kernel.org
    Fixes: 44d282796f81 ("io_uring: only allow submit from owning task")
    Reported-by: Andres Freund
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Jens Axboe
     

23 Jan, 2020

1 commit

  • commit 44d282796f81eb1debc1d7cb53245b4cb3214cb5 upstream.

    If the credentials or the mm doesn't match, don't allow the task to
    submit anything on behalf of this ring. The task that owns the ring can
    pass the file descriptor to another task, but we don't want to allow
    that task to submit an SQE that then assumes the ring mm and creds if
    it needs to go async.

    Cc: stable@vger.kernel.org
    Suggested-by: Stefan Metzmacher
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Jens Axboe
     

12 Jan, 2020

1 commit

  • [ Upstream commit 7c504e65206a4379ff38fe41d21b32b6c2c3e53e ]

    There is no reliable way to submit and wait in a single syscall, as
    io_submit_sqes() may under-consume sqes (in case of an early error).
    Then it will wait for not-yet-submitted requests, deadlocking the user
    in most cases.

    Don't wait/poll if can't submit all sqes

    Signed-off-by: Pavel Begunkov
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin

    Pavel Begunkov
     

09 Jan, 2020

1 commit

  • commit 0b8c0ec7eedcd8f9f1a1f238d87f9b512b09e71a upstream.

    syzbot reports:

    kasan: CONFIG_KASAN_INLINE enabled
    kasan: GPF could be caused by NULL-ptr deref or user memory access
    general protection fault: 0000 [#1] PREEMPT SMP KASAN
    CPU: 0 PID: 9217 Comm: io_uring-sq Not tainted 5.4.0-syzkaller #0
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
    Google 01/01/2011
    RIP: 0010:creds_are_invalid kernel/cred.c:792 [inline]
    RIP: 0010:__validate_creds include/linux/cred.h:187 [inline]
    RIP: 0010:override_creds+0x9f/0x170 kernel/cred.c:550
    Code: ac 25 00 81 fb 64 65 73 43 0f 85 a3 37 00 00 e8 17 ab 25 00 49 8d 7c
    24 10 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 b6 04 02 84
    c0 74 08 3c 03 0f 8e 96 00 00 00 41 8b 5c 24 10 bf
    RSP: 0018:ffff88809c45fda0 EFLAGS: 00010202
    RAX: dffffc0000000000 RBX: 0000000043736564 RCX: ffffffff814f3318
    RDX: 0000000000000002 RSI: ffffffff814f3329 RDI: 0000000000000010
    RBP: ffff88809c45fdb8 R08: ffff8880a3aac240 R09: ffffed1014755849
    R10: ffffed1014755848 R11: ffff8880a3aac247 R12: 0000000000000000
    R13: ffff888098ab1600 R14: 0000000000000000 R15: 0000000000000000
    FS: 0000000000000000(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 00007ffd51c40664 CR3: 0000000092641000 CR4: 00000000001406f0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    Call Trace:
    io_sq_thread+0x1c7/0xa20 fs/io_uring.c:3274
    kthread+0x361/0x430 kernel/kthread.c:255
    ret_from_fork+0x24/0x30 arch/x86/entry/entry_64.S:352
    Modules linked in:
    ---[ end trace f2e1a4307fbe2245 ]---
    RIP: 0010:creds_are_invalid kernel/cred.c:792 [inline]
    RIP: 0010:__validate_creds include/linux/cred.h:187 [inline]
    RIP: 0010:override_creds+0x9f/0x170 kernel/cred.c:550
    Code: ac 25 00 81 fb 64 65 73 43 0f 85 a3 37 00 00 e8 17 ab 25 00 49 8d 7c
    24 10 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 b6 04 02 84
    c0 74 08 3c 03 0f 8e 96 00 00 00 41 8b 5c 24 10 bf
    RSP: 0018:ffff88809c45fda0 EFLAGS: 00010202
    RAX: dffffc0000000000 RBX: 0000000043736564 RCX: ffffffff814f3318
    RDX: 0000000000000002 RSI: ffffffff814f3329 RDI: 0000000000000010
    RBP: ffff88809c45fdb8 R08: ffff8880a3aac240 R09: ffffed1014755849
    R10: ffffed1014755848 R11: ffff8880a3aac247 R12: 0000000000000000
    R13: ffff888098ab1600 R14: 0000000000000000 R15: 0000000000000000
    FS: 0000000000000000(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 00007ffd51c40664 CR3: 0000000092641000 CR4: 00000000001406f0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400

    which is caused by slab fault injection triggering a failure in
    prepare_creds(). We don't actually need to create a copy of the creds
    as we're not modifying it, we just need a reference on the current task
    creds. This avoids the failure case as well, and propagates the const
    throughout the stack.

    Fixes: 181e448d8709 ("io_uring: async workers should inherit the user creds")
    Reported-by: syzbot+5320383e16029ba057ff@syzkaller.appspotmail.com
    Signed-off-by: Jens Axboe
    [ only use the io_uring.c portion of the patch - gregkh]
    Signed-off-by: Greg Kroah-Hartman

    Jens Axboe
     

05 Jan, 2020

1 commit

  • [ Upstream commit eb065d301e8c83643367bdb0898becc364046bda ]

    We currently rely on the ring destroy on cleaning things up in case of
    failure, but io_allocate_scq_urings() can leave things half initialized
    if only parts of it fails.

    Be nice and return with either everything setup in success, or return an
    error with things nicely cleaned up.

    Reported-by: syzbot+0d818c0d39399188f393@syzkaller.appspotmail.com
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin

    Jens Axboe
     

13 Dec, 2019

4 commits

  • There's an issue with deferred requests through drain, where if we do
    need to defer, we're not copying over the sqe_submit state correctly.
    This can result in using uninitialized data when we then later go and
    submit the deferred request, like this check in __io_submit_sqe():

    if (unlikely(s->index >= ctx->sq_entries))
    return -EINVAL;

    with 's' being uninitialized, we can randomly fail this check. Fix this
    by copying sqe_submit state when we defer a request.

    Because it was fixed as part of a cleanup series in mainline, before
    anyone realized we had this issue. That removed the separate states
    of ->index vs ->submit.sqe. That series is not something I was
    comfortable putting into stable, hence the much simpler addition.
    Here's the patch in the series that fixes the same issue:

    commit cf6fd4bd559ee61a4454b161863c8de6f30f8dca
    Author: Pavel Begunkov
    Date: Mon Nov 25 23:14:39 2019 +0300

    io_uring: inline struct sqe_submit

    Reported-by: Andres Freund
    Reported-by: Tomáš Chaloupka
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Jens Axboe
     
  • commit aa4c3967756c6c576a38a23ac511be211462a6b7 upstream.

    Christophe reports that current master fails building on powerpc with
    this error:

    CC fs/io_uring.o
    fs/io_uring.c: In function ‘loop_rw_iter’:
    fs/io_uring.c:1628:21: error: implicit declaration of function ‘kmap’
    [-Werror=implicit-function-declaration]
    iovec.iov_base = kmap(iter->bvec->bv_page)
    ^
    fs/io_uring.c:1628:19: warning: assignment makes pointer from integer
    without a cast [-Wint-conversion]
    iovec.iov_base = kmap(iter->bvec->bv_page)
    ^
    fs/io_uring.c:1643:4: error: implicit declaration of function ‘kunmap’
    [-Werror=implicit-function-declaration]
    kunmap(iter->bvec->bv_page);
    ^

    which is caused by a missing highmem.h include. Fix it by including
    it.

    Fixes: 311ae9e159d8 ("io_uring: fix dead-hung for non-iter fixed rw")
    Reported-by: Christophe Leroy
    Tested-by: Christophe Leroy
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Jens Axboe
     
  • commit 441cdbd5449b4923cd413d3ba748124f91388be9 upstream.

    We should never return -ERESTARTSYS to userspace, transform it into
    -EINTR.

    Cc: stable@vger.kernel.org # v5.3+
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Jens Axboe
     
  • commit 311ae9e159d81a1ec1cf645daf40b39ae5a0bd84 upstream.

    Read/write requests to devices without implemented read/write_iter
    using fixed buffers can cause general protection fault, which totally
    hangs a machine.

    io_import_fixed() initialises iov_iter with bvec, but loop_rw_iter()
    accesses it as iovec, dereferencing random address.

    kmap() page by page in this case

    Cc: stable@vger.kernel.org
    Signed-off-by: Pavel Begunkov
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Pavel Begunkov
     

05 Dec, 2019

1 commit

  • [ Upstream commit 181e448d8709e517c9c7b523fcd209f24eb38ca7 ]

    If we don't inherit the original task creds, then we can confuse users
    like fuse that pass creds in the request header. See link below on
    identical aio issue.

    Link: https://lore.kernel.org/linux-fsdevel/26f0d78e-99ca-2f1b-78b9-433088053a61@scylladb.com/T/#u
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin

    Jens Axboe
     

14 Nov, 2019

2 commits

  • A test case was reported where two linked reads with registered buffers
    failed the second link always. This is because we set the expected value
    of a request in req->result, and if we don't get this result, then we
    fail the dependent links. For some reason the registered buffer import
    returned -ERROR/0, while the normal import returns -ERROR/length. This
    broke linked commands with registered buffers.

    Fix this by making io_import_fixed() correctly return the mapped length.

    Cc: stable@vger.kernel.org # v5.3
    Reported-by: 李通洲
    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • For timeout requests io_uring tries to grab a file with specified fd,
    which is usually stdin/fd=0.
    Update io_op_needs_file()

    Signed-off-by: Pavel Begunkov
    Signed-off-by: Jens Axboe

    Pavel Begunkov
     

12 Nov, 2019

1 commit

  • Currently we make sequence == 0 be the same as sequence == 1, but that's
    not super useful if the intent is really to have a timeout that's just
    a pure timeout.

    If the user passes in sqe->off == 0, then don't apply any sequence logic
    to the request, let it purely be driven by the timeout specified.

    Reported-by: 李通洲
    Reviewed-by: 李通洲
    Signed-off-by: Jens Axboe

    Jens Axboe
     

31 Oct, 2019

1 commit

  • We use io_kiocb->result == -EAGAIN as a way to know if we need to
    re-submit a polled request, as -EAGAIN reporting happens out-of-line
    for IO submission failures. This field is cleared when we originally
    allocate the request, but it isn't reset when we retry the submission
    from async context. This can cause issues where we think something
    needs a re-issue, but we're really just reading stale data.

    Reset ->result whenever we re-prep a request for polled submission.

    Cc: stable@vger.kernel.org
    Fixes: 9e645e1105ca ("io_uring: add support for sqe links")
    Reported-by: Bijan Mottahedeh
    Signed-off-by: Jens Axboe

    Jens Axboe
     

28 Oct, 2019

2 commits

  • syzkaller reported an issue where it looks like a malicious app can
    trigger a use-after-free of reading the ctx ->sq_array and ->rings
    value right after having installed the ring fd in the process file
    table.

    Defer ring fd installation until after we're done reading those
    values.

    Fixes: 75b28affdd6a ("io_uring: allocate the two rings together")
    Reported-by: syzbot+6f03d895a6cd0d06187f@syzkaller.appspotmail.com
    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • io_queue_link_head() owns shadow_req after taking it as an argument.
    By not freeing it in case of an error, it can leak the request along
    with taken ctx->refs.

    Reviewed-by: Jackie Liu
    Signed-off-by: Pavel Begunkov
    Signed-off-by: Jens Axboe

    Pavel Begunkov
     

26 Oct, 2019

2 commits

  • We currently assume that submissions from the sqthread are successful,
    and if IO polling is enabled, we use that value for knowing how many
    completions to look for. But if we overflowed the CQ ring or some
    requests simply got errored and already completed, they won't be
    available for polling.

    For the case of IO polling and SQTHREAD usage, look at the pending
    poll list. If it ever hits empty then we know that we don't have
    anymore pollable requests inflight. For that case, simply reset
    the inflight count to zero.

    Reported-by: Pavel Begunkov
    Reviewed-by: Pavel Begunkov
    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • We currently use the ring values directly, but that can lead to issues
    if the application is malicious and changes these values on our behalf.
    Created in-kernel cached versions of them, and just overwrite the user
    side when we update them. This is similar to how we treat the sq/cq
    ring tail/head updates.

    Reported-by: Pavel Begunkov
    Reviewed-by: Pavel Begunkov
    Signed-off-by: Jens Axboe

    Jens Axboe
     

25 Oct, 2019

3 commits

  • io_ring_submit() finalises with
    1. io_commit_sqring(), which releases sqes to the userspace
    2. Then calls to io_queue_link_head(), accessing released head's sqe

    Reorder them.

    Signed-off-by: Pavel Begunkov
    Signed-off-by: Jens Axboe

    Pavel Begunkov
     
  • io_sq_thread() processes sqes by 8 without considering links. As a
    result, links will be randomely subdivided.

    The easiest way to fix it is to call io_get_sqring() inside
    io_submit_sqes() as do io_ring_submit().

    Downsides:
    1. This removes optimisation of not grabbing mm_struct for fixed files
    2. It submitting all sqes in one go, without finer-grained sheduling
    with cq processing.

    Signed-off-by: Pavel Begunkov
    Signed-off-by: Jens Axboe

    Pavel Begunkov
     
  • There is a bug, where failed linked requests are returned not with
    specified @user_data, but with garbage from a kernel stack.

    The reason is that io_fail_links() uses req->user_data, which is
    uninitialised when called from io_queue_sqe() on fail path.

    Signed-off-by: Pavel Begunkov
    Signed-off-by: Jens Axboe

    Pavel Begunkov
     

24 Oct, 2019

3 commits

  • The sequence number of the timeout req (req->sequence) indicate the
    expected completion request. Because of each timeout req consume a
    sequence number, so the sequence of each timeout req on the timeout
    list shouldn't be the same. But now, we may get the same number (also
    incorrect) if we insert a new entry before the last one, such as submit
    such two timeout reqs on a new ring instance below.

    req->sequence
    req_1 (count = 2): 2
    req_2 (count = 1): 2

    Then, if we submit a nop req, req_2 will still timeout even the nop req
    finished. This patch fix this problem by adjust the sequence number of
    each reordered reqs when inserting a new entry.

    Signed-off-by: zhangyi (F)
    Signed-off-by: Jens Axboe

    zhangyi (F)
     
  • The sequence number of reqs on the timeout_list before the timeout req
    should be adjusted in io_timeout_fn(), because the current timeout req
    will consumes a slot in the cq_ring and cq_tail pointer will be
    increased, otherwise other timeout reqs may return in advance without
    waiting for enough wait_nr.

    Signed-off-by: zhangyi (F)
    Signed-off-by: Jens Axboe

    zhangyi (F)
     
  • There are cases where it isn't always safe to block for submission,
    even if the caller asked to wait for events as well. Revert the
    previous optimization of doing that.

    This reverts two commits:

    bf7ec93c644cb
    c576666863b78

    Fixes: c576666863b78 ("io_uring: optimize submit_and_wait API")
    Signed-off-by: Jens Axboe

    Jens Axboe