14 Jan, 2012

1 commit

  • Since commit 080d676de095 ("aio: allocate kiocbs in batches") iocbs are
    allocated in a batch during processing of first iocbs. All iocbs in a
    batch are automatically added to ctx->active_reqs list and accounted in
    ctx->reqs_active.

    If one (not the last one) of iocbs submitted by an user fails, further
    iocbs are not processed, but they are still present in ctx->active_reqs
    and accounted in ctx->reqs_active. This causes process to stuck in a D
    state in wait_for_all_aios() on exit since ctx->reqs_active will never
    go down to zero. Furthermore since kiocb_batch_free() frees iocb
    without removing it from active_reqs list the list become corrupted
    which may cause oops.

    Fix this by removing iocb from ctx->active_reqs and updating
    ctx->reqs_active in kiocb_batch_free().

    Signed-off-by: Gleb Natapov
    Reviewed-by: Jeff Moyer
    Cc: stable@kernel.org # 3.2
    Signed-off-by: Linus Torvalds

    Gleb Natapov
     

03 Nov, 2011

1 commit

  • In testing aio on a fast storage device, I found that the context lock
    takes up a fair amount of cpu time in the I/O submission path. The reason
    is that we take it for every I/O submitted (see __aio_get_req). Since we
    know how many I/Os are passed to io_submit, we can preallocate the kiocbs
    in batches, reducing the number of times we take and release the lock.

    In my testing, I was able to reduce the amount of time spent in
    _raw_spin_lock_irq by .56% (average of 3 runs). The command I used to
    test this was:

    aio-stress -O -o 2 -o 3 -r 8 -d 128 -b 32 -i 32 -s 16384

    I also tested the patch with various numbers of events passed to
    io_submit, and I ran the xfstests aio group of tests to ensure I didn't
    break anything.

    Signed-off-by: Jeff Moyer
    Cc: Daniel Ehrenberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jeff Moyer
     

01 Nov, 2011

1 commit

  • The basic idea behind cross memory attach is to allow MPI programs doing
    intra-node communication to do a single copy of the message rather than a
    double copy of the message via shared memory.

    The following patch attempts to achieve this by allowing a destination
    process, given an address and size from a source process, to copy memory
    directly from the source process into its own address space via a system
    call. There is also a symmetrical ability to copy from the current
    process's address space into a destination process's address space.

    - Use of /proc/pid/mem has been considered, but there are issues with
    using it:
    - Does not allow for specifying iovecs for both src and dest, assuming
    preadv or pwritev was implemented either the area read from or
    written to would need to be contiguous.
    - Currently mem_read allows only processes who are currently
    ptrace'ing the target and are still able to ptrace the target to read
    from the target. This check could possibly be moved to the open call,
    but its not clear exactly what race this restriction is stopping
    (reason appears to have been lost)
    - Having to send the fd of /proc/self/mem via SCM_RIGHTS on unix
    domain socket is a bit ugly from a userspace point of view,
    especially when you may have hundreds if not (eventually) thousands
    of processes that all need to do this with each other
    - Doesn't allow for some future use of the interface we would like to
    consider adding in the future (see below)
    - Interestingly reading from /proc/pid/mem currently actually
    involves two copies! (But this could be fixed pretty easily)

    As mentioned previously use of vmsplice instead was considered, but has
    problems. Since you need the reader and writer working co-operatively if
    the pipe is not drained then you block. Which requires some wrapping to
    do non blocking on the send side or polling on the receive. In all to all
    communication it requires ordering otherwise you can deadlock. And in the
    example of many MPI tasks writing to one MPI task vmsplice serialises the
    copying.

    There are some cases of MPI collectives where even a single copy interface
    does not get us the performance gain we could. For example in an
    MPI_Reduce rather than copy the data from the source we would like to
    instead use it directly in a mathops (say the reduce is doing a sum) as
    this would save us doing a copy. We don't need to keep a copy of the data
    from the source. I haven't implemented this, but I think this interface
    could in the future do all this through the use of the flags - eg could
    specify the math operation and type and the kernel rather than just
    copying the data would apply the specified operation between the source
    and destination and store it in the destination.

    Although we don't have a "second user" of the interface (though I've had
    some nibbles from people who may be interested in using it for intra
    process messaging which is not MPI). This interface is something which
    hardware vendors are already doing for their custom drivers to implement
    fast local communication. And so in addition to this being useful for
    OpenMPI it would mean the driver maintainers don't have to fix things up
    when the mm changes.

    There was some discussion about how much faster a true zero copy would
    go. Here's a link back to the email with some testing I did on that:

    http://marc.info/?l=linux-mm&m=130105930902915&w=2

    There is a basic man page for the proposed interface here:

    http://ozlabs.org/~cyeoh/cma/process_vm_readv.txt

    This has been implemented for x86 and powerpc, other architecture should
    mainly (I think) just need to add syscall numbers for the process_vm_readv
    and process_vm_writev. There are 32 bit compatibility versions for
    64-bit kernels.

    For arch maintainers there are some simple tests to be able to quickly
    verify that the syscalls are working correctly here:

    http://ozlabs.org/~cyeoh/cma/cma-test-20110718.tgz

    Signed-off-by: Chris Yeoh
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Thomas Gleixner
    Cc: Arnd Bergmann
    Cc: Paul Mackerras
    Cc: Benjamin Herrenschmidt
    Cc: David Howells
    Cc: James Morris
    Cc:
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christopher Yeoh
     

25 Mar, 2011

1 commit

  • * 'for-2.6.39/core' of git://git.kernel.dk/linux-2.6-block: (65 commits)
    Documentation/iostats.txt: bit-size reference etc.
    cfq-iosched: removing unnecessary think time checking
    cfq-iosched: Don't clear queue stats when preempt.
    blk-throttle: Reset group slice when limits are changed
    blk-cgroup: Only give unaccounted_time under debug
    cfq-iosched: Don't set active queue in preempt
    block: fix non-atomic access to genhd inflight structures
    block: attempt to merge with existing requests on plug flush
    block: NULL dereference on error path in __blkdev_get()
    cfq-iosched: Don't update group weights when on service tree
    fs: assign sb->s_bdi to default_backing_dev_info if the bdi is going away
    block: Require subsystems to explicitly allocate bio_set integrity mempool
    jbd2: finish conversion from WRITE_SYNC_PLUG to WRITE_SYNC and explicit plugging
    jbd: finish conversion from WRITE_SYNC_PLUG to WRITE_SYNC and explicit plugging
    fs: make fsync_buffers_list() plug
    mm: make generic_writepages() use plugging
    blk-cgroup: Add unaccounted time to timeslice_used.
    block: fixup plugging stubs for !CONFIG_BLOCK
    block: remove obsolete comments for blkdev_issue_zeroout.
    blktrace: Use rq->cmd_flags directly in blk_add_trace_rq.
    ...

    Fix up conflicts in fs/{aio.c,super.c}

    Linus Torvalds
     

23 Mar, 2011

1 commit

  • The test program below will hang because io_getevents() uses
    add_wait_queue_exclusive(), which means the wake_up() in io_destroy() only
    wakes up one of the threads. Fix this by using wake_up_all() in the aio
    code paths where we want to make sure no one gets stuck.

    // t.c -- compile with gcc -lpthread -laio t.c

    #include
    #include
    #include
    #include

    static const int nthr = 2;

    void *getev(void *ctx)
    {
    struct io_event ev;
    io_getevents(ctx, 1, 1, &ev, NULL);
    printf("io_getevents returned\n");
    return NULL;
    }

    int main(int argc, char *argv[])
    {
    io_context_t ctx = 0;
    pthread_t thread[nthr];
    int i;

    io_setup(1024, &ctx);

    for (i = 0; i < nthr; ++i)
    pthread_create(&thread[i], NULL, getev, ctx);

    sleep(1);

    io_destroy(ctx);

    for (i = 0; i < nthr; ++i)
    pthread_join(thread[i], NULL);

    return 0;
    }

    Signed-off-by: Roland Dreier
    Reviewed-by: Jeff Moyer
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roland Dreier
     

16 Mar, 2011

1 commit

  • * 'for-2.6.39' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
    workqueue: fix build failure introduced by s/freezeable/freezable/
    workqueue: add system_freezeable_wq
    rds/ib: use system_wq instead of rds_ib_fmr_wq
    net/9p: replace p9_poll_task with a work
    net/9p: use system_wq instead of p9_mux_wq
    xfs: convert to alloc_workqueue()
    reiserfs: make commit_wq use the default concurrency level
    ocfs2: use system_wq instead of ocfs2_quota_wq
    ext4: convert to alloc_workqueue()
    scsi/scsi_tgt_lib: scsi_tgtd isn't used in memory reclaim path
    scsi/be2iscsi,qla2xxx: convert to alloc_workqueue()
    misc/iwmc3200top: use system_wq instead of dedicated workqueues
    i2o: use alloc_workqueue() instead of create_workqueue()
    acpi: kacpi*_wq don't need WQ_MEM_RECLAIM
    fs/aio: aio_wq isn't used in memory reclaim path
    input/tps6507x-ts: use system_wq instead of dedicated workqueue
    cpufreq: use system_wq instead of dedicated workqueues
    wireless/ipw2x00: use system_wq instead of dedicated workqueues
    arm/omap: use system_wq in mailbox
    workqueue: use WQ_MEM_RECLAIM instead of WQ_RESCUER

    Linus Torvalds
     

10 Mar, 2011

4 commits


26 Feb, 2011

2 commits

  • A race can occur when io_submit() races with io_destroy():

    CPU1 CPU2
    io_submit()
    do_io_submit()
    ...
    ctx = lookup_ioctx(ctx_id);
    io_destroy()
    Now do_io_submit() holds the last reference to ctx.
    ...
    queue new AIO
    put_ioctx(ctx) - frees ctx with active AIOs

    We solve this issue by checking whether ctx is being destroyed in AIO
    submission path after adding new AIO to ctx. Then we are guaranteed that
    either io_destroy() waits for new AIO or we see that ctx is being
    destroyed and bail out.

    Cc: Nick Piggin
    Reviewed-by: Jeff Moyer
    Signed-off-by: Jan Kara
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • aio-dio-invalidate-failure GPFs in aio_put_req from io_submit.

    lookup_ioctx doesn't implement the rcu lookup pattern properly.
    rcu_read_lock does not prevent refcount going to zero, so we might take
    a refcount on a zero count ioctx.

    Fix the bug by atomically testing for zero refcount before incrementing.

    [jack@suse.cz: added comment into the code]
    Reviewed-by: Jeff Moyer
    Signed-off-by: Nick Piggin
    Signed-off-by: Jan Kara
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     

27 Jan, 2011

1 commit

  • aio_wq isn't used during memory reclaim. Convert to alloc_workqueue()
    without WQ_MEM_RECLAIM. It's possible to use system_wq but given that
    the number of work items is determined from userland and the work item
    may block, enforcing strict concurrency limit would be a good idea.

    Also, move fput_work to system_wq so that aio_wq is used soley to
    throttle the max concurrency of aio work items and fput_work doesn't
    interact with other work items.

    Signed-off-by: Tejun Heo
    Acked-by: Jeff Moyer
    Cc: Benjamin LaHaise
    Cc: linux-aio@kvack.org

    Tejun Heo
     

17 Jan, 2011

1 commit


14 Jan, 2011

2 commits


26 Oct, 2010

2 commits

  • Clones an existing reference to inode; caller must already hold one.

    Signed-off-by: Al Viro

    Al Viro
     
  • The aio batching code is using igrab to get an extra reference on the
    inode so it can safely batch. igrab will go ahead and take the global
    inode spinlock, which can be a bottleneck on large machines doing lots
    of AIO.

    In this case, igrab isn't required because we already have a reference
    on the file handle. It is safe to just bump the i_count directly
    on the inode.

    Benchmarking shows this patch brings IOP/s on tons of flash up by about
    2.5X.

    Signed-off-by: Chris Mason

    Chris Mason
     

23 Sep, 2010

1 commit

  • OCFS2 can return ERESTARTSYS from its write function when the process is
    signalled while waiting for a cluster lock (and the filesystem is mounted
    with intr mount option). Generally, it seems reasonable to allow
    filesystems to return this error code from its IO functions. As we must
    not leak ERESTARTSYS (and similar error codes) to userspace as a result of
    an AIO operation, we have to properly convert it to EINTR inside AIO code
    (restarting the syscall isn't really an option because other AIO could
    have been already submitted by the same io_submit syscall).

    Signed-off-by: Jan Kara
    Reviewed-by: Jeff Moyer
    Cc: Christoph Hellwig
    Cc: Zach Brown
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     

15 Sep, 2010

1 commit

  • Tavis Ormandy pointed out that do_io_submit does not do proper bounds
    checking on the passed-in iocb array:

           if (unlikely(nr < 0))
                   return -EINVAL;

           if (unlikely(!access_ok(VERIFY_READ, iocbpp, (nr*sizeof(iocbpp)))))
                   return -EFAULT;                      ^^^^^^^^^^^^^^^^^^

    The attached patch checks for overflow, and if it is detected, the
    number of iocbs submitted is scaled down to a number that will fit in
    the long.  This is an ok thing to do, as sys_io_submit is documented as
    returning the number of iocbs submitted, so callers should handle a
    return value of less than the 'nr' argument passed in.

    Reported-by: Tavis Ormandy
    Signed-off-by: Jeff Moyer
    Signed-off-by: Linus Torvalds

    Jeff Moyer
     

06 Aug, 2010

1 commit

  • - sys_io_destroy(): acutually return -EINVAL if the context pointed to
    is invalidIndex: linux-2.6.33-rc4/fs/aio.c
    - sys_io_getevents(): An argument specifying timeout is not `when',
    but `timeout'.
    - sys_io_getevents(): Should describe what is returned if this syscall
    succeeds.

    Signed-off-by: Satoru Takeuchi
    Signed-off-by: Randy Dunlap
    Reviewed-by: Jeff Moyer
    Signed-off-by: Linus Torvalds

    Satoru Takeuchi
     

28 May, 2010

2 commits

  • __aio_put_req() plays sick games with file refcount. What
    it wants is fput() from atomic context; it's almost always
    done with f_count > 1, so they only have to deal with delayed
    work in rare cases when their reference happens to be the
    last one. Current code decrements f_count and if it hasn't
    hit 0, everything is fine. Otherwise it keeps a pointer
    to struct file (with zero f_count!) around and has delayed
    work do __fput() on it.

    Better way to do it: use atomic_long_add_unless( , -1, 1)
    instead of !atomic_long_dec_and_test(). IOW, decrement it
    only if it's not the last reference, leave refcount alone
    if it was. And use normal fput() in delayed work.

    I've made that atomic_long_add_unless call a new helper -
    fput_atomic(). Drops a reference to file if it's safe to
    do in atomic (i.e. if that's not the last one), tells if
    it had been able to do that. aio.c converted to it, __fput()
    use is gone. req->ki_file *always* contributes to refcount
    now. And __fput() became static.

    Signed-off-by: Al Viro

    Al Viro
     
  • The aio compat code was not converting the struct iovecs from 32bit to
    64bit pointers, causing either EINVAL to be returned from io_getevents, or
    EFAULT as the result of the I/O. This patch passes a compat flag to
    io_submit to signal that pointer conversion is necessary for a given iocb
    array.

    A variant of this was tested by Michael Tokarev. I have also updated the
    libaio test harness to exercise this code path with good success.
    Further, I grabbed a copy of ltp and ran the
    testcases/kernel/syscall/readv and writev tests there (compiled with -m32
    on my 64bit system). All seems happy, but extra eyes on this would be
    welcome.

    [akpm@linux-foundation.org: coding-style fixes]
    [akpm@linux-foundation.org: fix CONFIG_COMPAT=n build]
    Signed-off-by: Jeff Moyer
    Reported-by: Michael Tokarev
    Cc: Zach Brown
    Cc: [2.6.35.1]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jeff Moyer
     

16 Dec, 2009

1 commit

  • Don't know the reason, but it appears ki_wait field of iocb never gets used.

    Signed-off-by: Shaohua Li
    Cc: Jeff Moyer
    Cc: Benjamin LaHaise
    Cc: Zach Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shaohua Li
     

29 Oct, 2009

1 commit


28 Oct, 2009

1 commit

  • Hi,

    Some workloads issue batches of small I/O, and the performance is poor
    due to the call to blk_run_address_space for every single iocb. Nathan
    Roberts pointed this out, and suggested that by deferring this call
    until all I/Os in the iocb array are submitted to the block layer, we
    can realize some impressive performance gains (up to 30% for sequential
    4k reads in batches of 16).

    Signed-off-by: Jeff Moyer
    Signed-off-by: Jens Axboe

    Jeff Moyer
     

23 Sep, 2009

1 commit


22 Sep, 2009

1 commit

  • Anyone who wants to do copy to/from user from a kernel thread, needs
    use_mm (like what fs/aio has). Move that into mm/, to make reusing and
    exporting easier down the line, and make aio use it. Next intended user,
    besides aio, will be vhost-net.

    Acked-by: Andrea Arcangeli
    Signed-off-by: Michael S. Tsirkin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michael S. Tsirkin
     

01 Jul, 2009

1 commit

  • Change the eventfd interface to de-couple the eventfd memory context, from
    the file pointer instance.

    Without such change, there is no clean way to racely free handle the
    POLLHUP event sent when the last instance of the file* goes away. Also,
    now the internal eventfd APIs are using the eventfd context instead of the
    file*.

    This patch is required by KVM's IRQfd code, which is still under
    development.

    Signed-off-by: Davide Libenzi
    Cc: Gregory Haskins
    Cc: Rusty Russell
    Cc: Benjamin LaHaise
    Cc: Avi Kivity
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davide Libenzi
     

20 Mar, 2009

2 commits

  • The libaio test harness turned up a problem whereby lookup_ioctx on a
    bogus io context was returning the 1 valid io context from the list
    (harness/cases/3.p).

    Because of that, an extra put_iocontext was done, and when the process
    exited, it hit a BUG_ON in the put_iocontext macro called from exit_aio
    (since we expect a users count of 1 and instead get 0).

    The problem was introduced by "aio: make the lookup_ioctx() lockless"
    (commit abf137dd7712132ee56d5b3143c2ff61a72a5faa).

    Thanks to Zach for pointing out that hlist_for_each_entry_rcu will not
    return with a NULL tpos at the end of the loop, even if the entry was
    not found.

    Signed-off-by: Jeff Moyer
    Acked-by: Zach Brown
    Acked-by: Jens Axboe
    Cc: Benjamin LaHaise
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jeff Moyer
     
  • Remove a source of fput() call from inside IRQ context. Myself, like Eric,
    wasn't able to reproduce an fput() call from IRQ context, but Jeff said he was
    able to, with the attached test program. Independently from this, the bug is
    conceptually there, so we might be better off fixing it. This patch adds an
    optimization similar to the one we already do on ->ki_filp, on ->ki_eventfd.
    Playing with ->f_count directly is not pretty in general, but the alternative
    here would be to add a brand new delayed fput() infrastructure, that I'm not
    sure is worth it.

    Signed-off-by: Davide Libenzi
    Cc: Benjamin LaHaise
    Cc: Trond Myklebust
    Cc: Eric Dumazet
    Signed-off-by: Jeff Moyer
    Cc: Zach Brown
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davide Libenzi
     

14 Jan, 2009

1 commit


29 Dec, 2008

1 commit

  • The mm->ioctx_list is currently protected by a reader-writer lock,
    so we always grab that lock on the read side for doing ioctx
    lookups. As the workload is extremely reader biased, turn this into
    an rcu hlist so we can make lookup_ioctx() lockless. Get rid of
    the rwlock and use a spinlock for providing update side exclusion.

    There's usually only 1 entry on this list, so it doesn't make sense
    to look into fancier data structures.

    Reviewed-by: Jeff Moyer
    Signed-off-by: Jens Axboe

    Jens Axboe
     

27 Jul, 2008

1 commit


26 Jul, 2008

1 commit

  • Kill PF_BORROWED_MM. Change use_mm/unuse_mm to not play with ->flags, and
    do s/PF_BORROWED_MM/PF_KTHREAD/ for a couple of other users.

    No functional changes yet. But this allows us to do further
    fixes/cleanups.

    oom_kill/ptrace/etc often check "p->mm != NULL" to filter out the
    kthreads, this is wrong because of use_mm(). The problem with
    PF_BORROWED_MM is that we need task_lock() to avoid races. With this
    patch we can check PF_KTHREAD directly, or use a simple lockless helper:

    /* The result must not be dereferenced !!! */
    struct mm_struct *__get_task_mm(struct task_struct *tsk)
    {
    if (tsk->flags & PF_KTHREAD)
    return NULL;
    return tsk->mm;
    }

    Note also ecard_task(). It runs with ->mm != NULL, but it's the kernel
    thread without PF_BORROWED_MM.

    Signed-off-by: Oleg Nesterov
    Cc: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

07 Jun, 2008

1 commit

  • use_mm() was changed to use switch_mm() instead of activate_mm(), since
    then nobody calls (and nobody should call) activate_mm() with
    PF_BORROWED_MM bit set.

    As Jeff Dike pointed out, we can also remove the "old != new" check, it is
    always true.

    Signed-off-by: Oleg Nesterov
    Cc: Jeff Dike
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

30 Apr, 2008

1 commit

  • Add calls to the generic object debugging infrastructure and provide fixup
    functions which allow to keep the system alive when recoverable problems have
    been detected by the object debugging core code.

    Signed-off-by: Thomas Gleixner
    Acked-by: Ingo Molnar
    Cc: Greg KH
    Cc: Randy Dunlap
    Cc: Kay Sievers
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Thomas Gleixner
     

29 Apr, 2008

3 commits

  • The FIXME comments are inaccurate.
    The locking comment over lookup_ioctx() is wrong.

    Signed-off-by: Jeff Moyer
    Signed-off-by: Zach Brown
    Signed-off-by: Shen Feng
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jeff Moyer
     
  • Some drivers have duplicated unlikely() macros. IS_ERR() already has
    unlikely() in itself.

    This patch cleans up such pointless code.

    Signed-off-by: Hirofumi Nakagawa
    Acked-by: David S. Miller
    Acked-by: Jeff Garzik
    Cc: Paul Clements
    Cc: Richard Purdie
    Cc: Alessandro Zummo
    Cc: David Brownell
    Cc: James Bottomley
    Cc: Michael Halcrow
    Cc: Anton Altaparmakov
    Cc: Al Viro
    Cc: Carsten Otte
    Cc: Patrick McHardy
    Cc: Paul Mundt
    Cc: Jaroslav Kysela
    Cc: Takashi Iwai
    Acked-by: Mike Frysinger
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hirofumi Nakagawa
     
  • Make the following needlessly global functions static:

    - __put_ioctx()
    - lookup_ioctx()
    - io_submit_one()

    Signed-off-by: Adrian Bunk
    Cc: Zach Brown
    Cc: Benjamin LaHaise
    Cc: Badari Pulavarty
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Adrian Bunk