28 Mar, 2014

1 commit

  • As reported by Tang Chen, Gu Zheng and Yasuaki Isimatsu, the following issues
    exist in the aio ring page migration support.

    As a result, for example, we have the following problem:

    thread 1 | thread 2
    |
    aio_migratepage() |
    |-> take ctx->completion_lock |
    |-> migrate_page_copy(new, old) |
    | *NOW*, ctx->ring_pages[idx] == old |
    |
    | *NOW*, ctx->ring_pages[idx] == old
    | aio_read_events_ring()
    | |-> ring = kmap_atomic(ctx->ring_pages[0])
    | |-> ring->head = head; *HERE, write to the old ring page*
    | |-> kunmap_atomic(ring);
    |
    |-> ctx->ring_pages[idx] = new |
    | *BUT NOW*, the content of |
    | ring_pages[idx] is old. |
    |-> release ctx->completion_lock |

    As above, the new ring page will not be updated.

    Fix this issue, as well as prevent races in aio_ring_setup() by holding
    the ring_lock mutex during kioctx setup and page migration. This avoids
    the overhead of taking another spinlock in aio_read_events_ring() as Tang's
    and Gu's original fix did, pushing the overhead into the migration code.

    Note that to handle the nesting of ring_lock inside of mmap_sem, the
    migratepage operation uses mutex_trylock(). Page migration is not a 100%
    critical operation in this case, so the ocassional failure can be
    tolerated. This issue was reported by Sasha Levin.

    Based on feedback from Linus, avoid the extra taking of ctx->completion_lock.
    Instead, make page migration fully serialised by mapping->private_lock, and
    have aio_free_ring() simply disconnect the kioctx from the mapping by calling
    put_aio_ring_file() before touching ctx->ring_pages[]. This simplifies the
    error handling logic in aio_migratepage(), and should improve robustness.

    v4: always do mutex_unlock() in cases when kioctx setup fails.

    Reported-by: Yasuaki Ishimatsu
    Reported-by: Sasha Levin
    Signed-off-by: Benjamin LaHaise
    Cc: Tang Chen
    Cc: Gu Zheng
    Cc: stable@vger.kernel.org

    Benjamin LaHaise
     

23 Dec, 2013

2 commits

  • Pull AIO leak fixes from Ben LaHaise:
    "I've put these two patches plus Linus's change through a round of
    tests, and it passes millions of iterations of the aio numa
    migratepage test, as well as a number of repetitions of a few simple
    read and write tests.

    The first patch fixes the memory leak Kent introduced, while the
    second patch makes aio_migratepage() much more paranoid and robust"

    * git://git.kvack.org/~bcrl/aio-next:
    aio/migratepages: make aio migrate pages sane
    aio: fix kioctx leak introduced by "aio: Fix a trinity splat"

    Linus Torvalds
     
  • Since commit 36bc08cc01709 ("fs/aio: Add support to aio ring pages
    migration") the aio ring setup code has used a special per-ring backing
    inode for the page allocations, rather than just using random anonymous
    pages.

    However, rather than remembering the pages as it allocated them, it
    would allocate the pages, insert them into the file mapping (dirty, so
    that they couldn't be free'd), and then forget about them. And then to
    look them up again, it would mmap the mapping, and then use
    "get_user_pages()" to get back an array of the pages we just created.

    Now, not only is that incredibly inefficient, it also leaked all the
    pages if the mmap failed (which could happen due to excessive number of
    mappings, for example).

    So clean it all up, making it much more straightforward. Also remove
    some left-overs of the previous (broken) mm_populate() usage that was
    removed in commit d6c355c7dabc ("aio: fix race in ring buffer page
    lookup introduced by page migration support") but left the pointless and
    now misleading MAP_POPULATE flag around.

    Tested-and-acked-by: Benjamin LaHaise
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

22 Dec, 2013

2 commits

  • The arbitrary restriction on page counts offered by the core
    migrate_page_move_mapping() code results in rather suspicious looking
    fiddling with page reference counts in the aio_migratepage() operation.
    To fix this, make migrate_page_move_mapping() take an extra_count parameter
    that allows aio to tell the code about its own reference count on the page
    being migrated.

    While cleaning up aio_migratepage(), make it validate that the old page
    being passed in is actually what aio_migratepage() expects to prevent
    misbehaviour in the case of races.

    Signed-off-by: Benjamin LaHaise

    Benjamin LaHaise
     
  • e34ecee2ae791df674dfb466ce40692ca6218e43 reworked the percpu reference
    counting to correct a bug trinity found. Unfortunately, the change lead
    to kioctxes being leaked because there was no final reference count to
    put. Add that reference count back in to fix things.

    Signed-off-by: Benjamin LaHaise
    Cc: stable@vger.kernel.org

    Benjamin LaHaise
     

07 Dec, 2013

1 commit


06 Dec, 2013

1 commit

  • Clean up the aio ring file in the fail path of aio_setup_ring
    and ioctx_alloc. And maybe it can fix the GPF issue reported by
    Dave Jones:
    https://lkml.org/lkml/2013/11/25/898

    Signed-off-by: Gu Zheng
    Signed-off-by: Benjamin LaHaise

    Gu Zheng
     

23 Nov, 2013

1 commit


20 Nov, 2013

2 commits

  • After freeing ring_pages we leave it as is causing a dangling pointer. This
    has already caused an issue so to help catching any issues in the future
    NULL it out.

    Signed-off-by: Sasha Levin
    Signed-off-by: Benjamin LaHaise

    Sasha Levin
     
  • ioctx_alloc() calls aio_setup_ring() to allocate a ring. If aio_setup_ring()
    fails to do so it would call aio_free_ring() before returning, but
    ioctx_alloc() would call aio_free_ring() again causing a double free of
    the ring.

    This is easily reproducible from userspace.

    Signed-off-by: Sasha Levin
    Signed-off-by: Benjamin LaHaise

    Sasha Levin
     

13 Nov, 2013

1 commit


09 Nov, 2013

1 commit


11 Oct, 2013

1 commit

  • aio kiocb refcounting was broken - it was relying on keeping track of
    the number of available ring buffer entries, which it needs to do
    anyways; then at shutdown time it'd wait for completions to be delivered
    until the # of available ring buffer entries equalled what it was
    initialized to.

    Problem with that is that the ring buffer is mapped writable into
    userspace, so userspace could futz with the head and tail pointers to
    cause the kernel to see extra completions, and cause free_ioctx() to
    return while there were still outstanding kiocbs. Which would be bad.

    Fix is just to directly refcount the kiocbs - which is more
    straightforward, and with the new percpu refcounting code doesn't cost
    us any cacheline bouncing which was the whole point of the original
    scheme.

    Also clean up ioctx_alloc()'s error path and fix a bug where it wasn't
    subtracting from aio_nr if ioctx_add_table() failed.

    Signed-off-by: Kent Overstreet

    Kent Overstreet
     

27 Sep, 2013

1 commit

  • Dmitry Vyukov managed to trigger a case where aio_migratepage can cause a
    use-after-free during teardown of the aio ring buffer's mapping. This turns
    out to be caused by access to the ioctx's ring_pages via the migratepage
    operation which was not being protected by any locks during ioctx freeing.
    Use the address_space's private_lock to protect use and updates of the mapping's
    private_data, and make ioctx teardown unlink the ioctx from the address space.

    Reported-by: Dmitry Vyukov
    Tested-by: Dmitry Vyukov
    Signed-off-by: Benjamin LaHaise

    Benjamin LaHaise
     

10 Sep, 2013

1 commit

  • Patch "aio: fix rcu sparse warnings introduced by ioctx table lookup patch"
    (77d30b14d24e557f89c41980011d72428514d729 in linux-next.git) introduced a
    couple of new rcu_dereference calls which are not protected by rcu_read_lock
    and result in following warnings during syscall fuzzing(trinity):

    [ 471.646379] ===============================
    [ 471.649727] [ INFO: suspicious RCU usage. ]
    [ 471.653919] 3.11.0-next-20130906+ #496 Not tainted
    [ 471.657792] -------------------------------
    [ 471.661235] fs/aio.c:503 suspicious rcu_dereference_check() usage!
    [ 471.665968]
    [ 471.665968] other info that might help us debug this:
    [ 471.665968]
    [ 471.672141]
    [ 471.672141] rcu_scheduler_active = 1, debug_locks = 1
    [ 471.677549] 1 lock held by trinity-child0/3774:
    [ 471.681675] #0: (&(&mm->ioctx_lock)->rlock){+.+...}, at: [] SyS_io_setup+0x63a/0xc70
    [ 471.688721]
    [ 471.688721] stack backtrace:
    [ 471.692488] CPU: 1 PID: 3774 Comm: trinity-child0 Not tainted 3.11.0-next-20130906+ #496
    [ 471.698437] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
    [ 471.703151] 00000000 00000000 c58bbf30 c18a814b de2234c0 c58bbf58 c10a4ec6 c1b0d824
    [ 471.709544] c1b0f60e 00000001 00000001 c1af61b0 00000000 cb670ac0 c3aca000 c58bbfac
    [ 471.716251] c119bc7c 00000002 00000001 00000000 c119b8dd 00000000 c10cf684 c58bbfb4
    [ 471.722902] Call Trace:
    [ 471.724859] [] dump_stack+0x4b/0x66
    [ 471.728772] [] lockdep_rcu_suspicious+0xc6/0x100
    [ 471.733716] [] SyS_io_setup+0x89c/0xc70
    [ 471.737806] [] ? SyS_io_setup+0x4fd/0xc70
    [ 471.741689] [] ? __audit_syscall_entry+0x94/0xe0
    [ 471.746080] [] syscall_call+0x7/0xb
    [ 471.749723] [] ? task_fork_fair+0x240/0x260

    Signed-off-by: Artem Savkov
    Reviewed-by: Gu Zheng
    Signed-off-by: Benjamin LaHaise

    Artem Savkov
     

09 Sep, 2013

1 commit

  • Prior to the introduction of page migration support in "fs/aio: Add support
    to aio ring pages migration" / 36bc08cc01709b4a9bb563b35aa530241ddc63e3,
    mapping of the ring buffer pages was done via get_user_pages() while
    retaining mmap_sem held for write. This avoided possible races with userland
    racing an munmap() or mremap(). The page migration patch, however, switched
    to using mm_populate() to prime the page mapping. mm_populate() cannot be
    called with mmap_sem held.

    Instead of dropping the mmap_sem, revert to the old behaviour and simply
    drop the use of mm_populate() since get_user_pages() will cause the pages to
    get mapped anyways. Thanks to Al Viro for spotting this issue.

    Signed-off-by: Benjamin LaHaise

    Benjamin LaHaise
     

30 Aug, 2013

2 commits


08 Aug, 2013

1 commit


06 Aug, 2013

1 commit

  • In the patch "aio: convert the ioctx list to table lookup v3", incorrect
    handling in the ioctx_alloc() error path was introduced that lead to an
    ioctx being added via ioctx_add_table() while freed when the ioctx_alloc()
    call returned -EAGAIN due to hitting the aio_max_nr limit. Fix this by
    only calling ioctx_add_table() as the last step in ioctx_alloc().

    Also, several unnecessary rcu_dereference() calls were added that lead to
    RCU warnings where the system was already protected by a spin lock for
    accessing mm->ioctx_table.

    Signed-off-by: Benjamin LaHaise

    Benjamin LaHaise
     

31 Jul, 2013

3 commits

  • In the event that an overflow/underflow occurs while calculating req_batch,
    clamp the minimum at 1 request instead of doing a BUG_ON().

    Signed-off-by: Benjamin LaHaise

    Benjamin LaHaise
     
  • On Wed, Jun 12, 2013 at 11:14:40AM -0700, Kent Overstreet wrote:
    > On Mon, Apr 15, 2013 at 02:40:55PM +0300, Octavian Purdila wrote:
    > > When using a large number of threads performing AIO operations the
    > > IOCTX list may get a significant number of entries which will cause
    > > significant overhead. For example, when running this fio script:
    > >
    > > rw=randrw; size=256k ;directory=/mnt/fio; ioengine=libaio; iodepth=1
    > > blocksize=1024; numjobs=512; thread; loops=100
    > >
    > > on an EXT2 filesystem mounted on top of a ramdisk we can observe up to
    > > 30% CPU time spent by lookup_ioctx:
    > >
    > > 32.51% [guest.kernel] [g] lookup_ioctx
    > > 9.19% [guest.kernel] [g] __lock_acquire.isra.28
    > > 4.40% [guest.kernel] [g] lock_release
    > > 4.19% [guest.kernel] [g] sched_clock_local
    > > 3.86% [guest.kernel] [g] local_clock
    > > 3.68% [guest.kernel] [g] native_sched_clock
    > > 3.08% [guest.kernel] [g] sched_clock_cpu
    > > 2.64% [guest.kernel] [g] lock_release_holdtime.part.11
    > > 2.60% [guest.kernel] [g] memcpy
    > > 2.33% [guest.kernel] [g] lock_acquired
    > > 2.25% [guest.kernel] [g] lock_acquire
    > > 1.84% [guest.kernel] [g] do_io_submit
    > >
    > > This patchs converts the ioctx list to a radix tree. For a performance
    > > comparison the above FIO script was run on a 2 sockets 8 core
    > > machine. This are the results (average and %rsd of 10 runs) for the
    > > original list based implementation and for the radix tree based
    > > implementation:
    > >
    > > cores 1 2 4 8 16 32
    > > list 109376 ms 69119 ms 35682 ms 22671 ms 19724 ms 16408 ms
    > > %rsd 0.69% 1.15% 1.17% 1.21% 1.71% 1.43%
    > > radix 73651 ms 41748 ms 23028 ms 16766 ms 15232 ms 13787 ms
    > > %rsd 1.19% 0.98% 0.69% 1.13% 0.72% 0.75%
    > > % of radix
    > > relative 66.12% 65.59% 66.63% 72.31% 77.26% 83.66%
    > > to list
    > >
    > > To consider the impact of the patch on the typical case of having
    > > only one ctx per process the following FIO script was run:
    > >
    > > rw=randrw; size=100m ;directory=/mnt/fio; ioengine=libaio; iodepth=1
    > > blocksize=1024; numjobs=1; thread; loops=100
    > >
    > > on the same system and the results are the following:
    > >
    > > list 58892 ms
    > > %rsd 0.91%
    > > radix 59404 ms
    > > %rsd 0.81%
    > > % of radix
    > > relative 100.87%
    > > to list
    >
    > So, I was just doing some benchmarking/profiling to get ready to send
    > out the aio patches I've got for 3.11 - and it looks like your patch is
    > causing a ~1.5% throughput regression in my testing :/
    ...

    I've got an alternate approach for fixing this wart in lookup_ioctx()...
    Instead of using an rbtree, just use the reserved id in the ring buffer
    header to index an array pointing the ioctx. It's not finished yet, and
    it needs to be tidied up, but is most of the way there.

    -ben
    --
    "Thought is the essence of where you are now."
    --
    kmo> And, a rework of Ben's code, but this was entirely his idea
    kmo> -Kent

    bcrl> And fix the code to use the right mm_struct in kill_ioctx(), actually
    free memory.

    Signed-off-by: Benjamin LaHaise

    Benjamin LaHaise
     
  • With the changes to use percpu counters for aio event ring size calculation,
    existing increases to aio_max_nr are now insufficient to allow for the
    allocation of enough events. Double the value used for aio_max_nr to account
    for the doubling introduced by the percpu slack.

    Signed-off-by: Benjamin LaHaise

    Benjamin LaHaise
     

30 Jul, 2013

9 commits

  • sock_aio_dtor() is dead code - and stuff that does need to do cleanup
    can simply do it before calling aio_complete().

    Signed-off-by: Kent Overstreet
    Cc: Zach Brown
    Cc: Felipe Balbi
    Cc: Greg Kroah-Hartman
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Rusty Russell
    Cc: Jens Axboe
    Cc: Asai Thambi S P
    Cc: Selvan Mani
    Cc: Sam Bradshaw
    Cc: Jeff Moyer
    Cc: Al Viro
    Cc: Benjamin LaHaise
    Cc: Theodore Ts'o
    Signed-off-by: Benjamin LaHaise

    Kent Overstreet
     
  • The kiocb refcount is only needed for cancellation - to ensure a kiocb
    isn't freed while a ki_cancel callback is running. But if we restrict
    ki_cancel callbacks to not block (which they currently don't), we can
    simply drop the refcount.

    Signed-off-by: Kent Overstreet
    Cc: Zach Brown
    Cc: Felipe Balbi
    Cc: Greg Kroah-Hartman
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Rusty Russell
    Cc: Jens Axboe
    Cc: Asai Thambi S P
    Cc: Selvan Mani
    Cc: Sam Bradshaw
    Cc: Jeff Moyer
    Cc: Al Viro
    Cc: Benjamin LaHaise
    Cc: Theodore Ts'o
    Signed-off-by: Benjamin LaHaise

    Kent Overstreet
     
  • The old aio retry infrastucture needed to save the various arguments to
    to aio operations. But with the retry infrastructure gone, we can trim
    struct kiocb quite a bit.

    Signed-off-by: Kent Overstreet
    Cc: Zach Brown
    Cc: Felipe Balbi
    Cc: Greg Kroah-Hartman
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Rusty Russell
    Cc: Jens Axboe
    Cc: Asai Thambi S P
    Cc: Selvan Mani
    Cc: Sam Bradshaw
    Cc: Jeff Moyer
    Cc: Al Viro
    Cc: Benjamin LaHaise
    Cc: Theodore Ts'o
    Signed-off-by: Benjamin LaHaise

    Kent Overstreet
     
  • This code doesn't serve any purpose anymore, since the aio retry
    infrastructure has been removed.

    This change should be safe because aio_read/write are also used for
    synchronous IO, and called from do_sync_read()/do_sync_write() - and
    there's no looping done in the sync case (the read and write syscalls).

    Signed-off-by: Kent Overstreet
    Cc: Zach Brown
    Cc: Felipe Balbi
    Cc: Greg Kroah-Hartman
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Rusty Russell
    Cc: Jens Axboe
    Cc: Asai Thambi S P
    Cc: Selvan Mani
    Cc: Sam Bradshaw
    Cc: Jeff Moyer
    Cc: Al Viro
    Cc: Benjamin LaHaise
    Signed-off-by: Benjamin LaHaise

    Kent Overstreet
     
  • aio_complete() (arguably) needs to keep its own trusted copy of the tail
    pointer, but io_getevents() doesn't have to use it - it's already using
    the head pointer from the ring buffer.

    So convert it to use the tail from the ring buffer so it touches fewer
    cachelines and doesn't contend with the cacheline aio_complete() needs.

    Signed-off-by: Kent Overstreet
    Cc: Zach Brown
    Cc: Felipe Balbi
    Cc: Greg Kroah-Hartman
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Rusty Russell
    Cc: Jens Axboe
    Cc: Asai Thambi S P
    Cc: Selvan Mani
    Cc: Sam Bradshaw
    Cc: Jeff Moyer
    Cc: Al Viro
    Cc: Benjamin LaHaise
    Signed-off-by: Benjamin LaHaise

    Kent Overstreet
     
  • Originally, io_event() was documented to return the io_event if
    cancellation succeeded - the io_event wouldn't be delivered via the ring
    buffer like it normally would.

    But this isn't what the implementation was actually doing; the only
    driver implementing cancellation, the usb gadget code, never returned an
    io_event in its cancel function. And aio_complete() was recently changed
    to no longer suppress event delivery if the kiocb had been cancelled.

    This gets rid of the unused io_event argument to kiocb_cancel() and
    kiocb->ki_cancel(), and changes io_cancel() to return -EINPROGRESS if
    kiocb->ki_cancel() returned success.

    Also tweak the refcounting in kiocb_cancel() to make more sense.

    Signed-off-by: Kent Overstreet
    Cc: Zach Brown
    Cc: Felipe Balbi
    Cc: Greg Kroah-Hartman
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Rusty Russell
    Cc: Jens Axboe
    Cc: Asai Thambi S P
    Cc: Selvan Mani
    Cc: Sam Bradshaw
    Cc: Jeff Moyer
    Cc: Al Viro
    Cc: Benjamin LaHaise
    Signed-off-by: Benjamin LaHaise

    Kent Overstreet
     
  • This just converts the ioctx refcount to the new generic dynamic percpu
    refcount code.

    Signed-off-by: Kent Overstreet
    Cc: Zach Brown
    Cc: Felipe Balbi
    Cc: Greg Kroah-Hartman
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Rusty Russell
    Cc: Jens Axboe
    Cc: Asai Thambi S P
    Cc: Selvan Mani
    Cc: Sam Bradshaw
    Cc: Jeff Moyer
    Cc: Al Viro
    Cc: Benjamin LaHaise
    Reviewed-by: "Theodore Ts'o"
    Signed-off-by: Andrew Morton
    Signed-off-by: Benjamin LaHaise

    Kent Overstreet
     
  • See the previous patch ("aio: reqs_active -> reqs_available") for why we
    want to do this - this basically implements a per cpu allocator for
    reqs_available that doesn't actually allocate anything.

    Note that we need to increase the size of the ringbuffer we allocate,
    since a single thread won't necessarily be able to use all the
    reqs_available slots - some (up to about half) might be on other per cpu
    lists, unavailable for the current thread.

    We size the ringbuffer based on the nr_events userspace passed to
    io_setup(), so this is a slight behaviour change - but nr_events wasn't
    being used as a hard limit before, it was being rounded up to the next
    page before so this doesn't change the actual semantics.

    Signed-off-by: Kent Overstreet
    Cc: Zach Brown
    Cc: Felipe Balbi
    Cc: Greg Kroah-Hartman
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Rusty Russell
    Cc: Jens Axboe
    Cc: Asai Thambi S P
    Cc: Selvan Mani
    Cc: Sam Bradshaw
    Cc: Jeff Moyer
    Cc: Al Viro
    Cc: Benjamin LaHaise
    Reviewed-by: "Theodore Ts'o"
    Signed-off-by: Andrew Morton
    Signed-off-by: Benjamin LaHaise

    Kent Overstreet
     
  • The number of outstanding kiocbs is one of the few shared things left that
    has to be touched for every kiocb - it'd be nice to make it percpu.

    We can make it per cpu by treating it like an allocation problem: we have
    a maximum number of kiocbs that can be outstanding (i.e. slots) - then we
    just allocate and free slots, and we know how to write per cpu allocators.

    So as prep work for that, we convert reqs_active to reqs_available.

    Signed-off-by: Kent Overstreet
    Cc: Zach Brown
    Cc: Felipe Balbi
    Cc: Greg Kroah-Hartman
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Rusty Russell
    Cc: Jens Axboe
    Cc: Asai Thambi S P
    Cc: Selvan Mani
    Cc: Sam Bradshaw
    Cc: Jeff Moyer
    Cc: Al Viro
    Cc: Benjamin LaHaise
    Reviewed-by: "Theodore Ts'o"
    Signed-off-by: Andrew Morton
    Signed-off-by: Benjamin LaHaise

    Kent Overstreet
     

17 Jul, 2013

1 commit

  • When "fs/aio: Add support to aio ring pages migration" was applied, it
    broke the build when CONFIG_MIGRATION was disabled. Wrap the migration
    code with a test for CONFIG_MIGRATION to fix this and save a few bytes
    when migration is disabled.

    Signed-off-by: Benjamin LaHaise

    Benjamin LaHaise
     

16 Jul, 2013

1 commit

  • As the aio job will pin the ring pages, that will lead to mem migrated
    failed. In order to fix this problem we use an anon inode to manage the aio ring
    pages, and setup the migratepage callback in the anon inode's address space, so
    that when mem migrating the aio ring pages will be moved to other mem node safely.

    Signed-off-by: Gu Zheng
    Signed-off-by: Benjamin LaHaise

    Gu Zheng
     

04 Jul, 2013

1 commit


29 Jun, 2013

1 commit


13 Jun, 2013

1 commit

  • There was a regression introduced by 36f5588905c1 ("aio: refcounting
    cleanup"), reported by Jens Axboe - the refcounting cleanup switched to
    using RCU in the shutdown path, but the synchronize_rcu() was done in
    the context of the io_destroy() syscall greatly increasing the time it
    could block.

    This patch switches it to call_rcu() and makes shutdown asynchronous
    (more asynchronous than it was originally; before the refcount changes
    io_destroy() would still wait on pending kiocbs).

    Note that there's a global quota on the max outstanding kiocbs, and that
    quota must be manipulated synchronously; otherwise io_setup() could
    return -EAGAIN when there isn't quota available, and userspace won't
    have any way of waiting until shutdown of the old kioctxs has finished
    (besides busy looping).

    So we release our quota before kioctx shutdown has finished, which
    should be fine since the quota never corresponded to anything real
    anyways.

    Signed-off-by: Kent Overstreet
    Cc: Zach Brown
    Cc: Felipe Balbi
    Cc: Greg Kroah-Hartman
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Rusty Russell
    Reported-by: Jens Axboe
    Tested-by: Jens Axboe
    Cc: Asai Thambi S P
    Cc: Selvan Mani
    Cc: Sam Bradshaw
    Cc: Jeff Moyer
    Cc: Al Viro
    Signed-off-by: Benjamin LaHaise
    Tested-by: Benjamin LaHaise
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kent Overstreet
     

25 May, 2013

2 commits

  • The recent changes overhauling fs/aio.c introduced a bug that results in
    the kioctx not being freed when outstanding kiocbs are cancelled at
    exit_aio() time. Specifically, a kiocb that is cancelled has its
    completion events discarded by batch_complete_aio(), which then fails to
    wake up the process stuck in free_ioctx(). Fix this by modifying the
    wait_event() condition in free_ioctx() appropriately.

    This patch was tested with the cancel operation in the thread based code
    posted yesterday.

    [akpm@linux-foundation.org: fix build]
    Signed-off-by: Benjamin LaHaise
    Signed-off-by: Kent Overstreet
    Cc: Kent Overstreet
    Cc: Josh Boyer
    Cc: Zach Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Benjamin LaHaise
     
  • In reviewing man pages, I noticed that io_getevents is documented to
    update the timeout that gets passed into the library call. This doesn't
    happen in kernel space or in the library (even though it's documented to
    do so in both places). Unless there is objection, I'd like to fix the
    comments/docs to match the code (I will also update the man page upon
    consensus).

    Signed-off-by: Jeff Moyer
    Signed-off-by: Benjamin LaHaise
    Acked-by: Cyril Hrubis
    Acked-by: Michael Kerrisk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jeff Moyer
     

08 May, 2013

1 commit

  • Thanks to Zach Brown's work to rip out the retry infrastructure, we don't
    need this anymore - ki_retry was only called right after the kiocb was
    initialized.

    This also refactors and trims some duplicated code, as well as cleaning up
    the refcounting/error handling a bit.

    [akpm@linux-foundation.org: use fmode_t in aio_run_iocb()]
    [akpm@linux-foundation.org: fix file_start_write/file_end_write tests]
    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Kent Overstreet
    Cc: Zach Brown
    Cc: Felipe Balbi
    Cc: Greg Kroah-Hartman
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Rusty Russell
    Cc: Jens Axboe
    Cc: Asai Thambi S P
    Cc: Selvan Mani
    Cc: Sam Bradshaw
    Cc: Jeff Moyer
    Cc: Al Viro
    Cc: Benjamin LaHaise
    Reviewed-by: "Theodore Ts'o"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kent Overstreet