Eric Lee / smarc-ti-linux-kernel | Embedian Git Server

26 Nov, 2014

1 commit

277f850fb Merge git://git.kvack.org/~bcrl/aio-fixes ... Browse Code »

Pull aio fix from Ben LaHaise:
"Dirty page accounting fix for aio"

* git://git.kvack.org/~bcrl/aio-fixes:
aio: fix uncorrent dirty pages accouting when truncating AIO ring buffer

Linus Torvalds
2014-11-26 10:55:44 +0800

07 Nov, 2014

1 commit

835f252c6 aio: fix uncorrent dirty pages accouting when truncating AIO ring buffer ... Browse Code »
5

https://bugzilla.kernel.org/show_bug.cgi?id=86831

Markus reported that when shutting down mysqld (with AIO support,
on a ext3 formatted Harddrive) leads to a negative number of dirty pages
(underrun to the counter). The negative number results in a drastic reduction
of the write performance because the page cache is not used, because the kernel
thinks it is still 2 ^ 32 dirty pages open.

Add a warn trace in __dec_zone_state will catch this easily:

static inline void __dec_zone_state(struct zone *zone, enum
zone_stat_item item)
{
atomic_long_dec(&zone->vm_stat[item]);
+ WARN_ON_ONCE(item == NR_FILE_DIRTY &&
atomic_long_read(&zone->vm_stat[item]) < 0);
atomic_long_dec(&vm_stat[item]);
}

[ 21.341632] ------------[ cut here ]------------
[ 21.346294] WARNING: CPU: 0 PID: 309 at include/linux/vmstat.h:242
cancel_dirty_page+0x164/0x224()
[ 21.355296] Modules linked in: wutbox_cp sata_mv
[ 21.359968] CPU: 0 PID: 309 Comm: kworker/0:1 Not tainted 3.14.21-WuT #80
[ 21.366793] Workqueue: events free_ioctx
[ 21.370760] [] (unwind_backtrace) from []
(show_stack+0x20/0x24)
[ 21.378562] [] (show_stack) from []
(dump_stack+0x24/0x28)
[ 21.385840] [] (dump_stack) from []
(warn_slowpath_common+0x84/0x9c)
[ 21.393976] [] (warn_slowpath_common) from []
(warn_slowpath_null+0x2c/0x34)
[ 21.402800] [] (warn_slowpath_null) from []
(cancel_dirty_page+0x164/0x224)
[ 21.411524] [] (cancel_dirty_page) from []
(truncate_inode_page+0x8c/0x158)
[ 21.420272] [] (truncate_inode_page) from []
(truncate_inode_pages_range+0x11c/0x53c)
[ 21.429890] [] (truncate_inode_pages_range) from
[] (truncate_pagecache+0x88/0xac)
[ 21.439252] [] (truncate_pagecache) from []
(truncate_setsize+0x5c/0x74)
[ 21.447731] [] (truncate_setsize) from []
(put_aio_ring_file.isra.14+0x34/0x90)
[ 21.456826] [] (put_aio_ring_file.isra.14) from
[] (aio_free_ring+0x20/0xcc)
[ 21.465660] [] (aio_free_ring) from []
(free_ioctx+0x24/0x44)
[ 21.473190] [] (free_ioctx) from []
(process_one_work+0x134/0x47c)
[ 21.481132] [] (process_one_work) from []
(worker_thread+0x130/0x414)
[ 21.489350] [] (worker_thread) from []
(kthread+0xd4/0xec)
[ 21.496621] [] (kthread) from []
(ret_from_fork+0x14/0x20)
[ 21.503884] ---[ end trace 79c4bf42c038c9a1 ]---

The cause is that we set the aio ring file pages as *DIRTY* via SetPageDirty
(bypasses the VFS dirty pages increment) when init, and aio fs uses
*default_backing_dev_info* as the backing dev, which does not disable
the dirty pages accounting capability.
So truncating aio ring file will contribute to accounting dirty pages (VFS
dirty pages decrement), then error occurs.

The original goal is keeping these pages in memory (can not be reclaimed
or swapped) in life-time via marking it dirty. But thinking more, we have
already pinned pages via elevating the page's refcount, which can already
achieve the goal, so the SetPageDirty seems unnecessary.

In order to fix the issue, using the __set_page_dirty_no_writeback instead
of the nop .set_page_dirty, and dropped the SetPageDirty (don't manually
set the dirty flags, don't disable set_page_dirty(), rely on default behaviour).

With the above change, the dirty pages accounting can work well. But as we
known, aio fs is an anonymous one, which should never cause any real write-back,
we can ignore the dirty pages (write back) accounting by disabling the dirty
pages (write back) accounting capability. So we introduce an aio private
backing dev info (disabled the ACCT_DIRTY/WRITEBACK/ACCT_WB capabilities) to
replace the default one.

Reported-by: Markus Königshaus
Signed-off-by: Gu Zheng
Cc: stable
Acked-by: Andrew Morton
Signed-off-by: Benjamin LaHaise

Gu Zheng
2014-11-07 03:27:19 +0800

25 Sep, 2014

2 commits

2aad2a86f percpu_ref: add PERCPU_REF_INIT_* flags ... Browse Code »

With the recent addition of percpu_ref_reinit(), percpu_ref now can be
used as a persistent switch which can be turned on and off repeatedly
where turning off maps to killing the ref and waiting for it to drain;
however, there currently isn't a way to initialize a percpu_ref in its
off (killed and drained) state, which can be inconvenient for certain
persistent switch use cases.

Similarly, percpu_ref_switch_to_atomic/percpu() allow dynamic
selection of operation mode; however, currently a newly initialized
percpu_ref is always in percpu mode making it impossible to avoid the
latency overhead of switching to atomic mode.

This patch adds @flags to percpu_ref_init() and implements the
following flags.

* PERCPU_REF_INIT_ATOMIC : start ref in atomic mode
* PERCPU_REF_INIT_DEAD : start ref killed and drained

These flags should be able to serve the above two use cases.

v2: target_core_tpg.c conversion was missing. Fixed.

Signed-off-by: Tejun Heo
Reviewed-by: Kent Overstreet
Cc: Jens Axboe
Cc: Christoph Hellwig
Cc: Johannes Weiner

Tejun Heo
2014-09-25 01:31:50 +0800
d06efebf0 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/… ... Browse Code »

…linux-block into for-3.18

This is to receive 0a30288da1ae ("blk-mq, percpu_ref: implement a
kludge for SCSI blk-mq stall during probe") which implements
__percpu_ref_kill_expedited() to work around SCSI blk-mq stall. The
commit reverted and patches to implement proper fix will be added.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Kent Overstreet <kmo@daterainc.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Christoph Hellwig <hch@lst.de>

Tejun Heo
2014-09-25 01:00:21 +0800

08 Sep, 2014

1 commit

a34375ef9 percpu-refcount: add @gfp to percpu_ref_init() ... Browse Code »

Percpu allocator now supports allocation mask. Add @gfp to
percpu_ref_init() so that !GFP_KERNEL allocation masks can be used
with percpu_refs too.

This patch doesn't make any functional difference.

v2: blk-mq conversion was missing. Updated.

Signed-off-by: Tejun Heo
Cc: Kent Overstreet
Cc: Benjamin LaHaise
Cc: Li Zefan
Cc: Nicholas A. Bellinger
Cc: Jens Axboe

Tejun Heo
2014-09-08 08:51:30 +0800

05 Sep, 2014

1 commit

6098b45b3 aio: block exit_aio() until all context requests are completed ... Browse Code »
5

It seems that exit_aio() also needs to wait for all iocbs to complete (like
io_destroy), but we missed the wait step in current implemention, so fix
it in the same way as we did in io_destroy.

Signed-off-by: Gu Zheng
Signed-off-by: Benjamin LaHaise
Cc: stable@vger.kernel.org

Gu Zheng
2014-09-05 04:54:47 +0800

03 Sep, 2014

1 commit

2ff396be6 aio: add missing smp_rmb() in read_events_ring ... Browse Code »
6

We ran into a case on ppc64 running mariadb where io_getevents would
return zeroed out I/O events. After adding instrumentation, it became
clear that there was some missing synchronization between reading the
tail pointer and the events themselves. This small patch fixes the
problem in testing.

Thanks to Zach for helping to look into this, and suggesting the fix.

Signed-off-by: Jeff Moyer
Signed-off-by: Benjamin LaHaise
Cc: stable@vger.kernel.org

Jeff Moyer
2014-09-03 03:20:03 +0800

25 Aug, 2014

1 commit

d856f32a8 aio: fix reqs_available handling ... Browse Code »
5

As reported by Dan Aloni, commit f8567a3845ac ("aio: fix aio request
leak when events are reaped by userspace") introduces a regression when
user code attempts to perform io_submit() with more events than are
available in the ring buffer. Reverting that commit would reintroduce a
regression when user space event reaping is used.

Fixing this bug is a bit more involved than the previous attempts to fix
this regression. Since we do not have a single point at which we can
count events as being reaped by user space and io_getevents(), we have
to track event completion by looking at the number of events left in the
event ring. So long as there are as many events in the ring buffer as
there have been completion events generate, we cannot call
put_reqs_available(). The code to check for this is now placed in
refill_reqs_available().

A test program from Dan and modified by me for verifying this bug is available
at http://www.kvack.org/~bcrl/20140824-aio_bug.c .

Reported-by: Dan Aloni
Signed-off-by: Benjamin LaHaise
Acked-by: Dan Aloni
Cc: Kent Overstreet
Cc: Mateusz Guzik
Cc: Petr Matousek
Cc: stable@vger.kernel.org # v3.16 and anything that f8567a3845ac was backported to
Signed-off-by: Linus Torvalds

Benjamin LaHaise
2014-08-25 06:47:27 +0800

16 Aug, 2014

1 commit

da06df548 Merge git://git.kvack.org/~bcrl/aio-next ... Browse Code »

Pull aio updates from Ben LaHaise.

* git://git.kvack.org/~bcrl/aio-next:
aio: use iovec array rather than the single one
aio: fix some comments
aio: use the macro rather than the inline magic number
aio: remove the needless registration of ring file's private_data
aio: remove no longer needed preempt_disable()
aio: kill the misleading rcu read locks in ioctx_add_table() and kill_ioctx()
aio: change exit_aio() to load mm->ioctx_table once and avoid rcu_read_lock()

Linus Torvalds
2014-08-16 22:56:27 +0800

05 Aug, 2014

1 commit

f2a84170e Merge branch 'for-3.17' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu ... Browse Code »

Pull percpu updates from Tejun Heo:

- Major reorganization of percpu header files which I think makes
things a lot more readable and logical than before.

- percpu-refcount is updated so that it requires explicit destruction
and can be reinitialized if necessary. This was pulled into the
block tree to replace the custom percpu refcnting implemented in
blk-mq.

- In the process, percpu and percpu-refcount got cleaned up a bit

* 'for-3.17' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: (21 commits)
percpu-refcount: implement percpu_ref_reinit() and percpu_ref_is_zero()
percpu-refcount: require percpu_ref to be exited explicitly
percpu-refcount: use unsigned long for pcpu_count pointer
percpu-refcount: add helpers for ->percpu_count accesses
percpu-refcount: one bit is enough for REF_STATUS
percpu-refcount, aio: use percpu_ref_cancel_init() in ioctx_alloc()
workqueue: stronger test in process_one_work()
workqueue: clear POOL_DISASSOCIATED in rebind_workers()
percpu: Use ALIGN macro instead of hand coding alignment calculation
percpu: invoke __verify_pcpu_ptr() from the generic part of accessors and operations
percpu: preffity percpu header files
percpu: use raw_cpu_*() to define __this_cpu_*()
percpu: reorder macros in percpu header files
percpu: move {raw|this}_cpu_*() definitions to include/linux/percpu-defs.h
percpu: move generic {raw|this}_cpu_*_N() definitions to include/asm-generic/percpu.h
percpu: only allow sized arch overrides for {raw|this}_cpu_*() ops
percpu: reorganize include/linux/percpu-defs.h
percpu: move accessors from include/linux/percpu.h to percpu-defs.h
percpu: include/asm-generic/percpu.h should contain only arch-overridable parts
percpu: introduce arch_raw_cpu_ptr()
...

Linus Torvalds
2014-08-05 01:09:27 +0800

24 Jul, 2014

4 commits

00fefb9cf aio: use iovec array rather than the single one ... Browse Code »

Previously, we only offer a single iovec to handle all the read/write cases, so
the PREADV/PWRITEV request always need to alloc more iovec buffer when copying
user vectors.
If we use a tmp iovec array rather than the single one, some small PREADV/PWRITEV
workloads(vector size small than the tmp buffer) will not need to alloc more
iovec buffer when copying user vectors.

Reviewed-by: Jeff Moyer
Signed-off-by: Gu Zheng
Signed-off-by: Benjamin LaHaise

Gu Zheng
2014-07-24 22:59:40 +0800
2be4e7dee aio: fix some comments ... Browse Code »

The function comments of aio_run_iocb and aio_read_events are out of date, so
fix them here.

Reviewed-by: Jeff Moyer
Signed-off-by: Gu Zheng
Signed-off-by: Benjamin LaHaise

Gu Zheng
2014-07-24 22:59:40 +0800
8dc4379e1 aio: use the macro rather than the inline magic number ... Browse Code »

Replace the inline magic number with the ready-made macro(AIO_RING_MAGIC),
just clean up.

Reviewed-by: Jeff Moyer
Signed-off-by: Gu Zheng
Signed-off-by: Benjamin LaHaise

Gu Zheng
2014-07-24 22:59:40 +0800
b53f1f82f aio: remove the needless registration of ring file's private_data ... Browse Code »

Remove the registration of ring file's private_data, we do not use
it.

Reviewed-by: Jeff Moyer
Signed-off-by: Gu Zheng
Signed-off-by: Benjamin LaHaise

Gu Zheng
2014-07-24 22:59:40 +0800

22 Jul, 2014

1 commit

be6fb451a aio: remove no longer needed preempt_disable() ... Browse Code »

Based on feedback from Jens Axboe on 263782c1c95bbddbb022dc092fd89a36bb8d5577,
clean up get/put_reqs_available() to remove the no longer needed preempt_disable()
and preempt_enable() pair.

Signed-off-by: Benjamin LaHaise
Cc: Jens Axboe

Benjamin LaHaise
2014-07-22 21:56:56 +0800

15 Jul, 2014

2 commits

6e830d537 Merge ../aio-fixes Browse Code »

Benjamin LaHaise
2014-07-15 01:14:27 +0800
263782c1c aio: protect reqs_available updates from changes in interrupt handlers ... Browse Code »
18

As of commit f8567a3845ac05bb28f3c1b478ef752762bd39ef it is now possible to
have put_reqs_available() called from irq context. While put_reqs_available()
is per cpu, it did not protect itself from interrupts on the same CPU. This
lead to aio_complete() corrupting the available io requests count when run
under a heavy O_DIRECT workloads as reported by Robert Elliott. Fix this by
disabling irq updates around the per cpu batch updates of reqs_available.

Many thanks to Robert and folks for testing and tracking this down.

Reported-by: Robert Elliot
Tested-by: Robert Elliot
Signed-off-by: Benjamin LaHaise
Cc: Jens Axboe , Christoph Hellwig
Cc: stable@vger.kenel.org

Benjamin LaHaise
2014-07-15 01:05:26 +0800

28 Jun, 2014

2 commits

9a1049da9 percpu-refcount: require percpu_ref to be exited explicitly ... Browse Code »
2

Currently, a percpu_ref undoes percpu_ref_init() automatically by
freeing the allocated percpu area when the percpu_ref is killed.
While seemingly convenient, this has the following niggles.

* It's impossible to re-init a released reference counter without
going through re-allocation.

* In the similar vein, it's impossible to initialize a percpu_ref
count with static percpu variables.

* We need and have an explicit destructor anyway for failure paths -
percpu_ref_cancel_init().

This patch removes the automatic percpu counter freeing in
percpu_ref_kill_rcu() and repurposes percpu_ref_cancel_init() into a
generic destructor now named percpu_ref_exit(). percpu_ref_destroy()
is considered but it gets confusing with percpu_ref_kill() while
"exit" clearly indicates that it's the counterpart of
percpu_ref_init().

All percpu_ref_cancel_init() users are updated to invoke
percpu_ref_exit() instead and explicit percpu_ref_exit() calls are
added to the destruction path of all percpu_ref users.

Signed-off-by: Tejun Heo
Acked-by: Benjamin LaHaise
Cc: Kent Overstreet
Cc: Christoph Lameter
Cc: Benjamin LaHaise
Cc: Nicholas A. Bellinger
Cc: Li Zefan

Tejun Heo
2014-06-28 20:10:14 +0800
55c6c814a percpu-refcount, aio: use percpu_ref_cancel_init() in ioctx_alloc() ... Browse Code »

ioctx_alloc() reaches inside percpu_ref and directly frees
->pcpu_count in its failure path, which is quite gross. percpu_ref
has been providing a proper interface to do this,
percpu_ref_cancel_init(), for quite some time now. Let's use that
instead.

This patch doesn't introduce any behavior changes.

Signed-off-by: Tejun Heo
Acked-by: Benjamin LaHaise
Cc: Kent Overstreet

Tejun Heo
2014-06-28 20:10:12 +0800

25 Jun, 2014

4 commits

855ef0dec aio: kill the misleading rcu read locks in ioctx_add_table() and kill_ioctx() ... Browse Code »

ioctx_add_table() is the writer, it does not need rcu_read_lock() to
protect ->ioctx_table. It relies on mm->ioctx_lock and rcu locks just
add the confusion.

And it doesn't need rcu_dereference() by the same reason, it must see
any updates previously done under the same ->ioctx_lock. We could use
rcu_dereference_protected() but the patch uses rcu_dereference_raw(),
the function is simple enough.

The same for kill_ioctx(), although it does not update the pointer.

Signed-off-by: Oleg Nesterov
Signed-off-by: Benjamin LaHaise

Oleg Nesterov
2014-06-25 06:10:25 +0800
4b70ac5fd aio: change exit_aio() to load mm->ioctx_table once and avoid rcu_read_lock() ... Browse Code »

On 04/30, Benjamin LaHaise wrote:
>
> > - ctx->mmap_size = 0;
> > -
> > - kill_ioctx(mm, ctx, NULL);
> > + if (ctx) {
> > + ctx->mmap_size = 0;
> > + kill_ioctx(mm, ctx, NULL);
> > + }
>
> Rather than indenting and moving the two lines changing mmap_size and the
> kill_ioctx() call, why not just do "if (!ctx) ... continue;"? That reduces
> the number of lines changed and avoid excessive indentation.

OK. To me the code looks better/simpler with "if (ctx)", but this is subjective
of course, I won't argue.

The patch still removes the empty line between mmap_size = 0 and kill_ioctx(),
we reset mmap_size only for kill_ioctx(). But feel free to remove this change.

-------------------------------------------------------------------------------
Subject: [PATCH v3 1/2] aio: change exit_aio() to load mm->ioctx_table once and avoid rcu_read_lock()

1. We can read ->ioctx_table only once and we do not read rcu_read_lock()
or even rcu_dereference().

This mm has no users, nobody else can play with ->ioctx_table. Otherwise
the code is buggy anyway, if we need rcu_read_lock() in a loop because
->ioctx_table can be updated then kfree(table) is obviously wrong.

2. Update the comment. "exit_mmap(mm) is coming" is the good reason to avoid
munmap(), but another reason is that we simply can't do vm_munmap() unless
current->mm == mm and this is not true in general, the caller is mmput().

3. We do not really need to nullify mm->ioctx_table before return, probably
the current code does this to catch the potential problems. But in this
case RCU_INIT_POINTER(NULL) looks better.

Signed-off-by: Oleg Nesterov
Signed-off-by: Benjamin LaHaise

Oleg Nesterov
2014-06-25 06:10:24 +0800
edfbbf388 aio: fix kernel memory disclosure in io_getevents() introduced in v3.10 ... Browse Code »
5

A kernel memory disclosure was introduced in aio_read_events_ring() in v3.10
by commit a31ad380bed817aa25f8830ad23e1a0480fef797. The changes made to
aio_read_events_ring() failed to correctly limit the index into
ctx->ring_pages[], allowing an attacked to cause the subsequent kmap() of
an arbitrary page with a copy_to_user() to copy the contents into userspace.
This vulnerability has been assigned CVE-2014-0206. Thanks to Mateusz and
Petr for disclosing this issue.

This patch applies to v3.12+. A separate backport is needed for 3.10/3.11.

Signed-off-by: Benjamin LaHaise
Cc: Mateusz Guzik
Cc: Petr Matousek
Cc: Kent Overstreet
Cc: Jeff Moyer
Cc: stable@vger.kernel.org

Benjamin LaHaise
2014-06-25 01:46:01 +0800
f8567a384 aio: fix aio request leak when events are reaped by userspace ... Browse Code »
54

The aio cleanups and optimizations by kmo that were merged into the 3.10
tree added a regression for userspace event reaping. Specifically, the
reference counts are not decremented if the event is reaped in userspace,
leading to the application being unable to submit further aio requests.
This patch applies to 3.12+. A separate backport is required for 3.10/3.11.
This issue was uncovered as part of CVE-2014-0206.

Signed-off-by: Benjamin LaHaise
Cc: stable@vger.kernel.org
Cc: Kent Overstreet
Cc: Mateusz Guzik
Cc: Petr Matousek

Benjamin LaHaise
2014-06-25 01:32:27 +0800

15 Jun, 2014

1 commit

a311c4803 Merge git://git.kvack.org/~bcrl/aio-next ... Browse Code »

Pull aio fix and cleanups from Ben LaHaise:
"This consists of a couple of code cleanups plus a minor bug fix"

* git://git.kvack.org/~bcrl/aio-next:
aio: cleanup: flatten kill_ioctx()
aio: report error from io_destroy() when threads race in io_destroy()
fs/aio.c: Remove ctx parameter in kiocb_cancel

Linus Torvalds
2014-06-15 08:43:27 +0800

07 May, 2014

1 commit

293bc9822 new methods: ->read_iter() and ->write_iter() ... Browse Code »

Beginning to introduce those. Just the callers for now, and it's
clumsier than it'll eventually become; once we finish converting
aio_read and aio_write instances, the things will get nicer.

For now, these guys are in parallel to ->aio_read() and ->aio_write();
they take iocb and iov_iter, with everything in iov_iter already
validated. File offset is passed in iocb->ki_pos, iov/nr_segs -
in iov_iter.

Main concerns in that series are stack footprint and ability to
split the damn thing cleanly.

[fix from Peter Ujfalusi folded]

Signed-off-by: Al Viro

Al Viro
2014-05-07 05:36:00 +0800

01 May, 2014

1 commit

754320d6e aio: fix potential leak in aio_run_iocb(). ... Browse Code »
5

iovec should be reclaimed whenever caller of rw_copy_check_uvector() returns,
but it doesn't hold when failure happens right after aio_setup_vectored_rw().

Fix that in a such way to avoid hairy goto.

Signed-off-by: Leon Yu
Signed-off-by: Benjamin LaHaise
Cc: stable@vger.kernel.org

Leon Yu
2014-05-01 20:37:43 +0800

30 Apr, 2014

2 commits

fa88b6f88 aio: cleanup: flatten kill_ioctx() ... Browse Code »

There is no need to have most of the code in kill_ioctx() indented. Flatten
it.

Signed-off-by: Benjamin LaHaise

Benjamin LaHaise
2014-04-30 00:55:48 +0800
fb2d44838 aio: report error from io_destroy() when threads race in io_destroy() ... Browse Code »

As reported by Anatol Pomozov, io_destroy() fails to report an error when
it loses the race to destroy a given ioctx. Since there is a difference in
behaviour between the thread that wins the race (which blocks on outstanding
io requests) versus lthe thread that loses (which returns immediately), wire
up a return code from kill_ioctx() to the io_destroy() syscall.

Signed-off-by: Benjamin LaHaise
Cc: Anatol Pomozov

Benjamin LaHaise
2014-04-30 00:45:17 +0800

23 Apr, 2014

1 commit

d52a8f9ea fs/aio.c: Remove ctx parameter in kiocb_cancel ... Browse Code »

ctx is no longer used in kiocb_cancel since

57282d8fd74407 ("aio: Kill ki_users")

Cc: Alexander Viro
Cc: Andrew Morton
Signed-off-by: Fabian Frederick
Signed-off-by: Benjamin LaHaise

Fabian Frederick
2014-04-23 00:27:24 +0800

17 Apr, 2014

1 commit

e02ba72aa aio: block io_destroy() until all context requests are completed ... Browse Code »
5

deletes aio context and all resources related to. It makes sense that
no IO operations connected to the context should be running after the context
is destroyed. As we removed io_context we have no chance to
get requests status or call io_getevents().

man page for io_destroy says that this function may block until
all context's requests are completed. Before kernel 3.11 io_destroy()
blocked indeed, but since aio refactoring in 3.11 it is not true anymore.

Here is a pseudo-code that shows a testcase for a race condition discovered
in 3.11:

initialize io_context
io_submit(read to buffer)
io_destroy()

// context is destroyed so we can free the resources
free(buffers);

// if the buffer is allocated by some other user he'll be surprised
// to learn that the buffer still filled by an outstanding operation
// from the destroyed io_context

The fix is straight-forward - add a completion struct and wait on it
in io_destroy, complete() should be called when number of in-fligh requests
reaches zero.

If two or more io_destroy() called for the same context simultaneously then
only the first one waits for IO completion, other calls behaviour is undefined.

Tested: ran http://pastebin.com/LrPsQ4RL testcase for several hours and
do not see the race condition anymore.

Signed-off-by: Anatol Pomozov
Signed-off-by: Benjamin LaHaise

Anatol Pomozov
2014-04-17 01:38:04 +0800

28 Mar, 2014

1 commit

fa8a53c39 aio: v4 ensure access to ctx->ring_pages is correctly serialised for migration ... Browse Code »
5

As reported by Tang Chen, Gu Zheng and Yasuaki Isimatsu, the following issues
exist in the aio ring page migration support.

As a result, for example, we have the following problem:

thread 1 | thread 2
|
aio_migratepage() |
|-> take ctx->completion_lock |
|-> migrate_page_copy(new, old) |
| *NOW*, ctx->ring_pages[idx] == old |
|
| *NOW*, ctx->ring_pages[idx] == old
| aio_read_events_ring()
| |-> ring = kmap_atomic(ctx->ring_pages[0])
| |-> ring->head = head; *HERE, write to the old ring page*
| |-> kunmap_atomic(ring);
|
|-> ctx->ring_pages[idx] = new |
| *BUT NOW*, the content of |
| ring_pages[idx] is old. |
|-> release ctx->completion_lock |

As above, the new ring page will not be updated.

Fix this issue, as well as prevent races in aio_ring_setup() by holding
the ring_lock mutex during kioctx setup and page migration. This avoids
the overhead of taking another spinlock in aio_read_events_ring() as Tang's
and Gu's original fix did, pushing the overhead into the migration code.

Note that to handle the nesting of ring_lock inside of mmap_sem, the
migratepage operation uses mutex_trylock(). Page migration is not a 100%
critical operation in this case, so the ocassional failure can be
tolerated. This issue was reported by Sasha Levin.

Based on feedback from Linus, avoid the extra taking of ctx->completion_lock.
Instead, make page migration fully serialised by mapping->private_lock, and
have aio_free_ring() simply disconnect the kioctx from the mapping by calling
put_aio_ring_file() before touching ctx->ring_pages[]. This simplifies the
error handling logic in aio_migratepage(), and should improve robustness.

v4: always do mutex_unlock() in cases when kioctx setup fails.

Reported-by: Yasuaki Ishimatsu
Reported-by: Sasha Levin
Signed-off-by: Benjamin LaHaise
Cc: Tang Chen
Cc: Gu Zheng
Cc: stable@vger.kernel.org

Benjamin LaHaise
2014-03-28 22:14:45 +0800

23 Dec, 2013

2 commits

a8472b4bb Merge git://git.kvack.org/~bcrl/aio-next ... Browse Code »

Pull AIO leak fixes from Ben LaHaise:
"I've put these two patches plus Linus's change through a round of
tests, and it passes millions of iterations of the aio numa
migratepage test, as well as a number of repetitions of a few simple
read and write tests.

The first patch fixes the memory leak Kent introduced, while the
second patch makes aio_migratepage() much more paranoid and robust"

* git://git.kvack.org/~bcrl/aio-next:
aio/migratepages: make aio migrate pages sane
aio: fix kioctx leak introduced by "aio: Fix a trinity splat"

Linus Torvalds
2013-12-23 03:03:49 +0800
3dc9acb67 aio: clean up and fix aio_setup_ring page mapping ... Browse Code »

Since commit 36bc08cc01709 ("fs/aio: Add support to aio ring pages
migration") the aio ring setup code has used a special per-ring backing
inode for the page allocations, rather than just using random anonymous
pages.

However, rather than remembering the pages as it allocated them, it
would allocate the pages, insert them into the file mapping (dirty, so
that they couldn't be free'd), and then forget about them. And then to
look them up again, it would mmap the mapping, and then use
"get_user_pages()" to get back an array of the pages we just created.

Now, not only is that incredibly inefficient, it also leaked all the
pages if the mmap failed (which could happen due to excessive number of
mappings, for example).

So clean it all up, making it much more straightforward. Also remove
some left-overs of the previous (broken) mm_populate() usage that was
removed in commit d6c355c7dabc ("aio: fix race in ring buffer page
lookup introduced by page migration support") but left the pointless and
now misleading MAP_POPULATE flag around.

Tested-and-acked-by: Benjamin LaHaise
Signed-off-by: Linus Torvalds

Linus Torvalds
2013-12-23 03:03:08 +0800

22 Dec, 2013

2 commits

8e321fefb aio/migratepages: make aio migrate pages sane ... Browse Code »

The arbitrary restriction on page counts offered by the core
migrate_page_move_mapping() code results in rather suspicious looking
fiddling with page reference counts in the aio_migratepage() operation.
To fix this, make migrate_page_move_mapping() take an extra_count parameter
that allows aio to tell the code about its own reference count on the page
being migrated.

While cleaning up aio_migratepage(), make it validate that the old page
being passed in is actually what aio_migratepage() expects to prevent
misbehaviour in the case of races.

Signed-off-by: Benjamin LaHaise

Benjamin LaHaise
2013-12-22 06:56:08 +0800
1881686f8 aio: fix kioctx leak introduced by "aio: Fix a trinity splat" ... Browse Code »
2

e34ecee2ae791df674dfb466ce40692ca6218e43 reworked the percpu reference
counting to correct a bug trinity found. Unfortunately, the change lead
to kioctxes being leaked because there was no final reference count to
put. Add that reference count back in to fix things.

Signed-off-by: Benjamin LaHaise
Cc: stable@vger.kernel.org

Benjamin LaHaise
2013-12-22 04:57:09 +0800

07 Dec, 2013

1 commit

c537aba00 Merge git://git.kvack.org/~bcrl/aio-next ... Browse Code »

Pull aio fix from Benjamin LaHaise:
"AIO fix from Gu Zheng that fixes a GPF that Dave Jones uncovered with
trinity"

* git://git.kvack.org/~bcrl/aio-next:
aio: clean up aio ring in the fail path

Linus Torvalds
2013-12-07 00:32:59 +0800

06 Dec, 2013

1 commit

d1b943271 aio: clean up aio ring in the fail path ... Browse Code »
2

Clean up the aio ring file in the fail path of aio_setup_ring
and ioctx_alloc. And maybe it can fix the GPF issue reported by
Dave Jones:
https://lkml.org/lkml/2013/11/25/898

Signed-off-by: Gu Zheng
Signed-off-by: Benjamin LaHaise

Gu Zheng
2013-12-06 23:22:55 +0800

23 Nov, 2013

1 commit

d0f278c1d Merge git://git.kvack.org/~bcrl/aio-next ... Browse Code »

Pull aio fixes from Benjamin LaHaise.

* git://git.kvack.org/~bcrl/aio-next:
aio: nullify aio->ring_pages after freeing it
aio: prevent double free in ioctx_alloc
aio: Fix a trinity splat

Linus Torvalds
2013-11-23 00:42:14 +0800

20 Nov, 2013

2 commits

ddb8c45ba aio: nullify aio->ring_pages after freeing it ... Browse Code »
2

After freeing ring_pages we leave it as is causing a dangling pointer. This
has already caused an issue so to help catching any issues in the future
NULL it out.

Signed-off-by: Sasha Levin
Signed-off-by: Benjamin LaHaise

Sasha Levin
2013-11-20 06:40:48 +0800
d55802320 aio: prevent double free in ioctx_alloc ... Browse Code »
2

ioctx_alloc() calls aio_setup_ring() to allocate a ring. If aio_setup_ring()
fails to do so it would call aio_free_ring() before returning, but
ioctx_alloc() would call aio_free_ring() again causing a double free of
the ring.

This is easily reproducible from userspace.

Signed-off-by: Sasha Levin
Signed-off-by: Benjamin LaHaise

Sasha Levin
2013-11-20 06:40:48 +0800