Eric Lee / smarc-fsl-linux-kernel

02 Dec, 2019

1 commit

0b8c0ec7e io_uring: use current task creds instead of allocating a new one ... Browse Code »

syzbot reports:

kasan: CONFIG_KASAN_INLINE enabled
kasan: GPF could be caused by NULL-ptr deref or user memory access
general protection fault: 0000 [#1] PREEMPT SMP KASAN
CPU: 0 PID: 9217 Comm: io_uring-sq Not tainted 5.4.0-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
Google 01/01/2011
RIP: 0010:creds_are_invalid kernel/cred.c:792 [inline]
RIP: 0010:__validate_creds include/linux/cred.h:187 [inline]
RIP: 0010:override_creds+0x9f/0x170 kernel/cred.c:550
Code: ac 25 00 81 fb 64 65 73 43 0f 85 a3 37 00 00 e8 17 ab 25 00 49 8d 7c
24 10 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 b6 04 02 84
c0 74 08 3c 03 0f 8e 96 00 00 00 41 8b 5c 24 10 bf
RSP: 0018:ffff88809c45fda0 EFLAGS: 00010202
RAX: dffffc0000000000 RBX: 0000000043736564 RCX: ffffffff814f3318
RDX: 0000000000000002 RSI: ffffffff814f3329 RDI: 0000000000000010
RBP: ffff88809c45fdb8 R08: ffff8880a3aac240 R09: ffffed1014755849
R10: ffffed1014755848 R11: ffff8880a3aac247 R12: 0000000000000000
R13: ffff888098ab1600 R14: 0000000000000000 R15: 0000000000000000
FS: 0000000000000000(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007ffd51c40664 CR3: 0000000092641000 CR4: 00000000001406f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
io_sq_thread+0x1c7/0xa20 fs/io_uring.c:3274
kthread+0x361/0x430 kernel/kthread.c:255
ret_from_fork+0x24/0x30 arch/x86/entry/entry_64.S:352
Modules linked in:
---[ end trace f2e1a4307fbe2245 ]---
RIP: 0010:creds_are_invalid kernel/cred.c:792 [inline]
RIP: 0010:__validate_creds include/linux/cred.h:187 [inline]
RIP: 0010:override_creds+0x9f/0x170 kernel/cred.c:550
Code: ac 25 00 81 fb 64 65 73 43 0f 85 a3 37 00 00 e8 17 ab 25 00 49 8d 7c
24 10 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 b6 04 02 84
c0 74 08 3c 03 0f 8e 96 00 00 00 41 8b 5c 24 10 bf
RSP: 0018:ffff88809c45fda0 EFLAGS: 00010202
RAX: dffffc0000000000 RBX: 0000000043736564 RCX: ffffffff814f3318
RDX: 0000000000000002 RSI: ffffffff814f3329 RDI: 0000000000000010
RBP: ffff88809c45fdb8 R08: ffff8880a3aac240 R09: ffffed1014755849
R10: ffffed1014755848 R11: ffff8880a3aac247 R12: 0000000000000000
R13: ffff888098ab1600 R14: 0000000000000000 R15: 0000000000000000
FS: 0000000000000000(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007ffd51c40664 CR3: 0000000092641000 CR4: 00000000001406f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400

which is caused by slab fault injection triggering a failure in
prepare_creds(). We don't actually need to create a copy of the creds
as we're not modifying it, we just need a reference on the current task
creds. This avoids the failure case as well, and propagates the const
throughout the stack.

Fixes: 181e448d8709 ("io_uring: async workers should inherit the user creds")
Reported-by: syzbot+5320383e16029ba057ff@syzkaller.appspotmail.com
Signed-off-by: Jens Axboe

Jens Axboe
2019-12-02 23:50:00 +0800

27 Nov, 2019

3 commits

6206f0e18 io-wq: shrink io_wq_work a bit ... Browse Code »

Currently we're using 40 bytes for the io_wq_work structure, and 16 of
those is the doubly link list node. We don't need doubly linked lists,
we always add to tail to keep things ordered, and any other use case
is list traversal with deletion. For the deletion case, we can easily
support any node deletion by keeping track of the previous entry.

This shrinks io_wq_work to 32 bytes, and subsequently io_kiock from
io_uring to 216 to 208 bytes.

Signed-off-by: Jens Axboe

Jens Axboe
2019-11-27 06:02:56 +0800
3fc50ab55 io-wq: fix handling of NUMA node IDs ... Browse Code »

There are several things that can go wrong in the current code on NUMA
systems, especially if not all nodes are online all the time:

- If the identifiers of the online nodes do not form a single contiguous
block starting at zero, wq->wqes will be too small, and OOB memory
accesses will occur e.g. in the loop in io_wq_create().
- If a node comes online between the call to num_online_nodes() and the
for_each_node() loop in io_wq_create(), an OOB write will occur.
- If a node comes online between io_wq_create() and io_wq_enqueue(), a
lookup is performed for an element that doesn't exist, and an OOB read
will probably occur.

Fix it by:

- using nr_node_ids instead of num_online_nodes() for the allocation size;
nr_node_ids is calculated by setup_nr_node_ids() to be bigger than the
highest node ID that could possibly come online at some point, even if
those nodes' identifiers are not a contiguous block
- creating workers for all possible CPUs, not just all online ones

This is basically what the normal workqueue code also does, as far as I can
tell.

Signed-off-by: Jann Horn
Signed-off-by: Jens Axboe

Jann Horn
2019-11-27 06:02:56 +0800
ad6e005ca io_uring: use kzalloc instead of kcalloc for single-element allocations ... Browse Code »

These allocations are single-element allocations, so don't use the array
allocation wrapper for them.

Signed-off-by: Jann Horn
Signed-off-by: Jens Axboe

Jann Horn
2019-11-27 06:02:56 +0800

26 Nov, 2019

5 commits

181e448d8 io_uring: async workers should inherit the user creds ... Browse Code »

If we don't inherit the original task creds, then we can confuse users
like fuse that pass creds in the request header. See link below on
identical aio issue.

Link: https://lore.kernel.org/linux-fsdevel/26f0d78e-99ca-2f1b-78b9-433088053a61@scylladb.com/T/#u
Signed-off-by: Jens Axboe

Jens Axboe
2019-11-26 10:56:11 +0800
576a347b7 io-wq: have io_wq_create() take a 'data' argument ... Browse Code »

We currently pass in 4 arguments outside of the bounded size. In
preparation for adding one more argument, let's bundle them up in
a struct to make it more readable.

No functional changes in this patch.

Signed-off-by: Jens Axboe

Jens Axboe
2019-11-26 10:56:11 +0800
b76da70fc io_uring: close lookup gap for dependent next work ... Browse Code »

When we find new work to process within the work handler, we queue the
linked timeout before we have issued the new work. This can be
problematic for very short timeouts, as we have a window where the new
work isn't visible.

Allow the work handler to store a callback function for this in the work
item, and flag it with IO_WQ_WORK_CB if the caller has done so. If that
is set, then io-wq will call the callback when it has setup the new work
item.

Reported-by: Pavel Begunkov
Signed-off-by: Jens Axboe

Jens Axboe
2019-11-26 10:56:10 +0800
b2e9c7d64 io-wq: remove extra space characters ... Browse Code »

These lines are indented an extra space character.

Signed-off-by: Dan Carpenter
Signed-off-by: Jens Axboe

Dan Carpenter
2019-11-26 10:56:05 +0800
b60fda600 io-wq: wait for io_wq_create() to setup necessary workers ... Browse Code »

We currently have a race where if setup is really slow, we can be
calling io_wq_destroy() before we're done setting up. This will cause
the caller to get stuck waiting for the manager to set things up, but
the manager already exited.

Fix this by doing a sync setup of the manager. This also fixes the case
where if we failed creating workers, we'd also get stuck.

In practice this race window was really small, as we already wait for
the manager to start. Hence someone would have to call io_wq_destroy()
after the task has started, but before it started the first loop. The
reported test case forked tons of these, which is why it became an
issue.

Reported-by: syzbot+0f1cc17f85154f400465@syzkaller.appspotmail.com
Fixes: 771b53d033e8 ("io-wq: small threadpool implementation for io_uring")
Signed-off-by: Jens Axboe

Jens Axboe
2019-11-26 10:56:05 +0800

14 Nov, 2019

4 commits

021d1cdda io-wq: remove now redundant struct io_wq_nulls_list ... Browse Code »

Since we don't iterate these lists anymore after commit:

e61df66c69b1 ("io-wq: ensure free/busy list browsing see all items")

we don't need to retain the nulls value we use for them. That means it's
pretty pointless to wrap the hlist_nulls_head in a structure, so get rid
of it.

Signed-off-by: Jens Axboe

Jens Axboe
2019-11-14 23:02:19 +0800
e61df66c6 io-wq: ensure free/busy list browsing see all items ... Browse Code »

We have two lists for workers in io-wq, a busy and a free list. For
certain operations we want to browse all workers, and we currently do
that by browsing the two separate lists. But since these lists are RCU
protected, we can potentially miss workers if they move between the two
lists while we're browsing them.

Add a third list, all_list, that simply holds all workers. A worker is
added to that list when it starts, and removed when it exits. This makes
the worker iteration cleaner, too.

Reported-by: Paul E. McKenney
Reviewed-by: Paul E. McKenney
Signed-off-by: Jens Axboe

Jens Axboe
2019-11-14 10:40:57 +0800
36c2f9223 io-wq: ensure we have a stable view of ->cur_work for cancellations ... Browse Code »

worker->cur_work is currently protected by the lock of the wqe that the
worker belongs to. When we send a signal to a worker, we need a stable
view of ->cur_work, so we need to hold that lock. But this doesn't work
so well, since we have the opposite order potentially on queueing work.
If POLL_ADD is used with a signalfd, then io_poll_wake() is called with
the signal lock, and that sometimes needs to insert work items.

Add a specific worker lock that protects the current work item. Then we
can guarantee that the task we're sending a signal is currently
processing the exact work we think it is.

Reported-by: Paul E. McKenney
Reviewed-by: Paul E. McKenney
Signed-off-by: Jens Axboe

Jens Axboe
2019-11-14 04:51:54 +0800
7d7230652 io_wq: add get/put_work handlers to io_wq_create() ... Browse Code »

For cancellation, we need to ensure that the work item stays valid for
as long as ->cur_work is valid. Right now we can't safely dereference
the work item even under the wqe->lock, because while the ->cur_work
pointer will remain valid, the work could be completing and be freed
in parallel.

Only invoke ->get/put_work() on items we know that the caller queued
themselves. Add IO_WQ_WORK_INTERNAL for io-wq to use, which is needed
when we're queueing a flush item, for instance.

Signed-off-by: Jens Axboe

Jens Axboe
2019-11-14 02:37:54 +0800

10 Nov, 2019

1 commit

91d666ea4 io-wq: io_wqe_run_queue() doesn't need to use list_empty_careful() ... Browse Code »

We hold the wqe lock at this point (which is also annotated), so there's
no need to use the careful variant of list_empty().

Signed-off-by: Jens Axboe

Jens Axboe
2019-11-10 02:45:32 +0800

08 Nov, 2019

1 commit

c5def4ab8 io-wq: add support for bounded vs unbunded work ... Browse Code »

io_uring supports request types that basically have two different
lifetimes:

1) Bounded completion time. These are requests like disk reads or writes,
which we know will finish in a finite amount of time.
2) Unbounded completion time. These are generally networked IO, where we
have no idea how long they will take to complete. Another example is
POLL commands.

This patch provides support for io-wq to handle these differently, so we
don't starve bounded requests by tying up workers for too long. By default
all work is bounded, unless otherwise specified in the work item.

Signed-off-by: Jens Axboe

Jens Axboe
2019-11-08 02:41:35 +0800

06 Nov, 2019

1 commit

6f72653e7 io-wq: use proper nesting IRQ disabling spinlocks for cancel ... Browse Code »

We don't know what context we'll be called in for cancel, it could very
well be with IRQs disabled already. Use the IRQ saving variants of the
locking primitives.

Signed-off-by: Jens Axboe

Jens Axboe
2019-11-06 04:53:53 +0800

02 Nov, 2019

1 commit

364b05fd0 io-wq: use kfree_rcu() to simplify the code ... Browse Code »

The callback function of call_rcu() just calls kfree(), so we can use
kfree_rcu() instead of call_rcu() + callback function.

Signed-off-by: YueHaibing
Signed-off-by: Jens Axboe

YueHaibing
2019-11-02 21:59:46 +0800

01 Nov, 2019

1 commit

62755e35d io_uring: support for generic async request cancel ... Browse Code »

This adds support for IORING_OP_ASYNC_CANCEL, which will attempt to
cancel requests that have been punted to async context and are now
in-flight. This works for regular read/write requests to files, as
long as they haven't been started yet. For socket based IO (or things
like accept4(2)), we can cancel work that is already running as well.

To cancel a request, the sqe must have ->addr set to the user_data of
the request it wishes to cancel. If the request is cancelled
successfully, the original request is completed with -ECANCELED
and the cancel request is completed with a result of 0. If the
request was already running, the original may or may not complete
in error. The cancel request will complete with -EALREADY for that
case. And finally, if the request to cancel wasn't found, the cancel
request is completed with -ENOENT.

Signed-off-by: Jens Axboe

Jens Axboe
2019-11-01 22:35:31 +0800

30 Oct, 2019

2 commits

fcb323cc5 io_uring: io_uring: add support for async work inheriting files ... Browse Code »

This is in preparation for adding opcodes that need to add new files
in a process file table, system calls like open(2) or accept4(2).

If an opcode needs this, it must set IO_WQ_WORK_NEEDS_FILES in the work
item. If work that needs to get punted to async context have this
set, the async worker will assume the original task file table before
executing the work.

Note that opcodes that need access to the current files of an
application cannot be done through IORING_SETUP_SQPOLL.

Signed-off-by: Jens Axboe

Jens Axboe
2019-10-30 02:43:06 +0800
771b53d03 io-wq: small threadpool implementation for io_uring ... Browse Code »

This adds support for io-wq, a smaller and specialized thread pool
implementation. This is meant to replace workqueues for io_uring. Among
the reasons for this addition are:

- We can assign memory context smarter and more persistently if we
manage the life time of threads.

- We can drop various work-arounds we have in io_uring, like the
async_list.

- We can implement hashed work insertion, to manage concurrency of
buffered writes without needing a) an extra workqueue, or b)
needlessly making the concurrency of said workqueue very low
which hurts performance of multiple buffered file writers.

- We can implement cancel through signals, for cancelling
interruptible work like read/write (or send/recv) to/from sockets.

- We need the above cancel for being able to assign and use file tables
from a process.

- We can implement a more thorough cancel operation in general.

- We need it to move towards a syslet/threadlet model for even faster
async execution. For that we need to take ownership of the used
threads.

This list is just off the top of my head. Performance should be the
same, or better, at least that's what I've seen in my testing. io-wq
supports basic NUMA functionality, setting up a pool per node.

io-wq hooks up to the scheduler schedule in/out just like workqueue
and uses that to drive the need for more/less workers.

Acked-by: Peter Zijlstra (Intel)
Signed-off-by: Jens Axboe

Jens Axboe
2019-10-30 02:43:00 +0800