23 Sep, 2010

1 commit

  • OCFS2 can return ERESTARTSYS from its write function when the process is
    signalled while waiting for a cluster lock (and the filesystem is mounted
    with intr mount option). Generally, it seems reasonable to allow
    filesystems to return this error code from its IO functions. As we must
    not leak ERESTARTSYS (and similar error codes) to userspace as a result of
    an AIO operation, we have to properly convert it to EINTR inside AIO code
    (restarting the syscall isn't really an option because other AIO could
    have been already submitted by the same io_submit syscall).

    Signed-off-by: Jan Kara
    Reviewed-by: Jeff Moyer
    Cc: Christoph Hellwig
    Cc: Zach Brown
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     

15 Sep, 2010

1 commit

  • Tavis Ormandy pointed out that do_io_submit does not do proper bounds
    checking on the passed-in iocb array:

           if (unlikely(nr < 0))
                   return -EINVAL;

           if (unlikely(!access_ok(VERIFY_READ, iocbpp, (nr*sizeof(iocbpp)))))
                   return -EFAULT;                      ^^^^^^^^^^^^^^^^^^

    The attached patch checks for overflow, and if it is detected, the
    number of iocbs submitted is scaled down to a number that will fit in
    the long.  This is an ok thing to do, as sys_io_submit is documented as
    returning the number of iocbs submitted, so callers should handle a
    return value of less than the 'nr' argument passed in.

    Reported-by: Tavis Ormandy
    Signed-off-by: Jeff Moyer
    Signed-off-by: Linus Torvalds

    Jeff Moyer
     

06 Aug, 2010

1 commit

  • - sys_io_destroy(): acutually return -EINVAL if the context pointed to
    is invalidIndex: linux-2.6.33-rc4/fs/aio.c
    - sys_io_getevents(): An argument specifying timeout is not `when',
    but `timeout'.
    - sys_io_getevents(): Should describe what is returned if this syscall
    succeeds.

    Signed-off-by: Satoru Takeuchi
    Signed-off-by: Randy Dunlap
    Reviewed-by: Jeff Moyer
    Signed-off-by: Linus Torvalds

    Satoru Takeuchi
     

28 May, 2010

2 commits

  • __aio_put_req() plays sick games with file refcount. What
    it wants is fput() from atomic context; it's almost always
    done with f_count > 1, so they only have to deal with delayed
    work in rare cases when their reference happens to be the
    last one. Current code decrements f_count and if it hasn't
    hit 0, everything is fine. Otherwise it keeps a pointer
    to struct file (with zero f_count!) around and has delayed
    work do __fput() on it.

    Better way to do it: use atomic_long_add_unless( , -1, 1)
    instead of !atomic_long_dec_and_test(). IOW, decrement it
    only if it's not the last reference, leave refcount alone
    if it was. And use normal fput() in delayed work.

    I've made that atomic_long_add_unless call a new helper -
    fput_atomic(). Drops a reference to file if it's safe to
    do in atomic (i.e. if that's not the last one), tells if
    it had been able to do that. aio.c converted to it, __fput()
    use is gone. req->ki_file *always* contributes to refcount
    now. And __fput() became static.

    Signed-off-by: Al Viro

    Al Viro
     
  • The aio compat code was not converting the struct iovecs from 32bit to
    64bit pointers, causing either EINVAL to be returned from io_getevents, or
    EFAULT as the result of the I/O. This patch passes a compat flag to
    io_submit to signal that pointer conversion is necessary for a given iocb
    array.

    A variant of this was tested by Michael Tokarev. I have also updated the
    libaio test harness to exercise this code path with good success.
    Further, I grabbed a copy of ltp and ran the
    testcases/kernel/syscall/readv and writev tests there (compiled with -m32
    on my 64bit system). All seems happy, but extra eyes on this would be
    welcome.

    [akpm@linux-foundation.org: coding-style fixes]
    [akpm@linux-foundation.org: fix CONFIG_COMPAT=n build]
    Signed-off-by: Jeff Moyer
    Reported-by: Michael Tokarev
    Cc: Zach Brown
    Cc: [2.6.35.1]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jeff Moyer
     

16 Dec, 2009

1 commit

  • Don't know the reason, but it appears ki_wait field of iocb never gets used.

    Signed-off-by: Shaohua Li
    Cc: Jeff Moyer
    Cc: Benjamin LaHaise
    Cc: Zach Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shaohua Li
     

29 Oct, 2009

1 commit


28 Oct, 2009

1 commit

  • Hi,

    Some workloads issue batches of small I/O, and the performance is poor
    due to the call to blk_run_address_space for every single iocb. Nathan
    Roberts pointed this out, and suggested that by deferring this call
    until all I/Os in the iocb array are submitted to the block layer, we
    can realize some impressive performance gains (up to 30% for sequential
    4k reads in batches of 16).

    Signed-off-by: Jeff Moyer
    Signed-off-by: Jens Axboe

    Jeff Moyer
     

23 Sep, 2009

1 commit


22 Sep, 2009

1 commit

  • Anyone who wants to do copy to/from user from a kernel thread, needs
    use_mm (like what fs/aio has). Move that into mm/, to make reusing and
    exporting easier down the line, and make aio use it. Next intended user,
    besides aio, will be vhost-net.

    Acked-by: Andrea Arcangeli
    Signed-off-by: Michael S. Tsirkin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michael S. Tsirkin
     

01 Jul, 2009

1 commit

  • Change the eventfd interface to de-couple the eventfd memory context, from
    the file pointer instance.

    Without such change, there is no clean way to racely free handle the
    POLLHUP event sent when the last instance of the file* goes away. Also,
    now the internal eventfd APIs are using the eventfd context instead of the
    file*.

    This patch is required by KVM's IRQfd code, which is still under
    development.

    Signed-off-by: Davide Libenzi
    Cc: Gregory Haskins
    Cc: Rusty Russell
    Cc: Benjamin LaHaise
    Cc: Avi Kivity
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davide Libenzi
     

20 Mar, 2009

2 commits

  • The libaio test harness turned up a problem whereby lookup_ioctx on a
    bogus io context was returning the 1 valid io context from the list
    (harness/cases/3.p).

    Because of that, an extra put_iocontext was done, and when the process
    exited, it hit a BUG_ON in the put_iocontext macro called from exit_aio
    (since we expect a users count of 1 and instead get 0).

    The problem was introduced by "aio: make the lookup_ioctx() lockless"
    (commit abf137dd7712132ee56d5b3143c2ff61a72a5faa).

    Thanks to Zach for pointing out that hlist_for_each_entry_rcu will not
    return with a NULL tpos at the end of the loop, even if the entry was
    not found.

    Signed-off-by: Jeff Moyer
    Acked-by: Zach Brown
    Acked-by: Jens Axboe
    Cc: Benjamin LaHaise
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jeff Moyer
     
  • Remove a source of fput() call from inside IRQ context. Myself, like Eric,
    wasn't able to reproduce an fput() call from IRQ context, but Jeff said he was
    able to, with the attached test program. Independently from this, the bug is
    conceptually there, so we might be better off fixing it. This patch adds an
    optimization similar to the one we already do on ->ki_filp, on ->ki_eventfd.
    Playing with ->f_count directly is not pretty in general, but the alternative
    here would be to add a brand new delayed fput() infrastructure, that I'm not
    sure is worth it.

    Signed-off-by: Davide Libenzi
    Cc: Benjamin LaHaise
    Cc: Trond Myklebust
    Cc: Eric Dumazet
    Signed-off-by: Jeff Moyer
    Cc: Zach Brown
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davide Libenzi
     

14 Jan, 2009

1 commit


29 Dec, 2008

1 commit

  • The mm->ioctx_list is currently protected by a reader-writer lock,
    so we always grab that lock on the read side for doing ioctx
    lookups. As the workload is extremely reader biased, turn this into
    an rcu hlist so we can make lookup_ioctx() lockless. Get rid of
    the rwlock and use a spinlock for providing update side exclusion.

    There's usually only 1 entry on this list, so it doesn't make sense
    to look into fancier data structures.

    Reviewed-by: Jeff Moyer
    Signed-off-by: Jens Axboe

    Jens Axboe
     

27 Jul, 2008

1 commit


26 Jul, 2008

1 commit

  • Kill PF_BORROWED_MM. Change use_mm/unuse_mm to not play with ->flags, and
    do s/PF_BORROWED_MM/PF_KTHREAD/ for a couple of other users.

    No functional changes yet. But this allows us to do further
    fixes/cleanups.

    oom_kill/ptrace/etc often check "p->mm != NULL" to filter out the
    kthreads, this is wrong because of use_mm(). The problem with
    PF_BORROWED_MM is that we need task_lock() to avoid races. With this
    patch we can check PF_KTHREAD directly, or use a simple lockless helper:

    /* The result must not be dereferenced !!! */
    struct mm_struct *__get_task_mm(struct task_struct *tsk)
    {
    if (tsk->flags & PF_KTHREAD)
    return NULL;
    return tsk->mm;
    }

    Note also ecard_task(). It runs with ->mm != NULL, but it's the kernel
    thread without PF_BORROWED_MM.

    Signed-off-by: Oleg Nesterov
    Cc: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

07 Jun, 2008

1 commit

  • use_mm() was changed to use switch_mm() instead of activate_mm(), since
    then nobody calls (and nobody should call) activate_mm() with
    PF_BORROWED_MM bit set.

    As Jeff Dike pointed out, we can also remove the "old != new" check, it is
    always true.

    Signed-off-by: Oleg Nesterov
    Cc: Jeff Dike
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

30 Apr, 2008

1 commit

  • Add calls to the generic object debugging infrastructure and provide fixup
    functions which allow to keep the system alive when recoverable problems have
    been detected by the object debugging core code.

    Signed-off-by: Thomas Gleixner
    Acked-by: Ingo Molnar
    Cc: Greg KH
    Cc: Randy Dunlap
    Cc: Kay Sievers
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Thomas Gleixner
     

29 Apr, 2008

3 commits

  • The FIXME comments are inaccurate.
    The locking comment over lookup_ioctx() is wrong.

    Signed-off-by: Jeff Moyer
    Signed-off-by: Zach Brown
    Signed-off-by: Shen Feng
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jeff Moyer
     
  • Some drivers have duplicated unlikely() macros. IS_ERR() already has
    unlikely() in itself.

    This patch cleans up such pointless code.

    Signed-off-by: Hirofumi Nakagawa
    Acked-by: David S. Miller
    Acked-by: Jeff Garzik
    Cc: Paul Clements
    Cc: Richard Purdie
    Cc: Alessandro Zummo
    Cc: David Brownell
    Cc: James Bottomley
    Cc: Michael Halcrow
    Cc: Anton Altaparmakov
    Cc: Al Viro
    Cc: Carsten Otte
    Cc: Patrick McHardy
    Cc: Paul Mundt
    Cc: Jaroslav Kysela
    Cc: Takashi Iwai
    Acked-by: Mike Frysinger
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hirofumi Nakagawa
     
  • Make the following needlessly global functions static:

    - __put_ioctx()
    - lookup_ioctx()
    - io_submit_one()

    Signed-off-by: Adrian Bunk
    Cc: Zach Brown
    Cc: Benjamin LaHaise
    Cc: Badari Pulavarty
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Adrian Bunk
     

28 Apr, 2008

1 commit

  • This patch wakes up a thread waiting in io_getevents if another thread
    destroys the context. This was tested using a small program that spawns a
    thread to wait in io_getevents while the parent thread destroys the io context
    and then waits for the getevents thread to exit. Without this patch, the
    program hangs indefinitely. With the patch, the program exits as expected.

    Signed-off-by: Jeff Moyer
    Cc: Zach Brown
    Cc: Christopher Smith
    Cc: Benjamin LaHaise
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jeff Moyer
     

11 Apr, 2008

2 commits

  • Jeff Roberson discovered a race when using kaio eventfd based notifications.
    When it occurs it can lead tomissed wakeups and hung userspace.

    This patch fixes the race by moving the notification inside the spinlocked
    section of kaio. The operation is safe since eventfd spinlock and kaio one
    are unrelated.

    Signed-off-by: Davide Libenzi
    Cc: Zach Brown
    Cc: Jeff Roberson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davide Libenzi
     
  • Use asmlinkage_protect in sys_io_getevents, because GCC for i386 with
    CONFIG_FRAME_POINTER=n can decide to clobber an argument word on the
    stack, i.e. the user struct pt_regs. Here the problem is not a tail
    call, but just the compiler's use of the stack when it inlines and
    optimizes the body of the called function. This seems to avoid it.

    Signed-off-by: Roland McGrath
    Signed-off-by: Linus Torvalds

    Roland McGrath
     

20 Mar, 2008

1 commit

  • My group ran into a AIO process hang on a 2.6.24 kernel with the process
    sleeping indefinitely in io_getevents(2) waiting for the last wakeup to come
    and it never would.

    We ran the tests on x86_64 SMP. The hang only occurred on a Xeon box
    ("Clovertown") but not a Core2Duo ("Conroe"). On the Xeon, the L2 cache isn't
    shared between all eight processors, but is L2 is shared between between all
    two processors on the Core2Duo we use.

    My analysis of the hang is if you go down to the second while-loop
    in read_events(), what happens on processor #1:
    1) add_wait_queue_exclusive() adds thread to ctx->wait
    2) aio_read_evt() to check tail
    3) if aio_read_evt() returned 0, call [io_]schedule() and sleep

    In aio_complete() with processor #2:
    A) info->tail = tail;
    B) waitqueue_active(&ctx->wait)
    C) if waitqueue_active() returned non-0, call wake_up()

    The way the code is written, step 1 must be seen by all other processors
    before processor 1 checks for pending events in step 2 (that were recorded by
    step A) and step A by processor 2 must be seen by all other processors
    (checked in step 2) before step B is done.

    The race I believed I was seeing is that steps 1 and 2 were
    effectively swapped due to the __list_add() being delayed by the L2
    cache not shared by some of the other processors. Imagine:
    proc 2: just before step A
    proc 1, step 1: adds to ctx->wait, but is not visible by other processors yet
    proc 1, step 2: checks tail and sees no pending events
    proc 2, step A: updates tail
    proc 1, step 3: calls [io_]schedule() and sleeps
    proc 2, step B: checks ctx->wait, but sees no one waiting, skips wakeup
    so proc 1 sleeps indefinitely

    My patch adds a memory barrier between steps A and B. It ensures that the
    update in step 1 gets seen on processor 2 before continuing. If processor 1
    was just before step 1, the memory barrier makes sure that step A (update
    tail) gets seen by the time processor 1 makes it to step 2 (check tail).

    Before the patch our AIO process would hang virtually 100% of the time. After
    the patch, we have yet to see the process ever hang.

    Signed-off-by: Quentin Barnes
    Reviewed-by: Zach Brown
    Cc: Benjamin LaHaise
    Cc:
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    [ We should probably disallow that "if (waitqueue_active()) wake_up()"
    coding pattern, because it's so often buggy wrt memory ordering ]
    Signed-off-by: Linus Torvalds

    Quentin Barnes
     

09 Feb, 2008

3 commits


30 Jan, 2008

1 commit


06 Dec, 2007

1 commit

  • On 2.6.24, top started showing 100% iowait on one CPU when a UML instance was
    running (but completely idle). The UML code sits in io_getevents waiting for
    an event to be submitted and completed.

    Fix this by checking ctx->reqs_active before scheduling to determine whether
    or not we are waiting for I/O.

    Signed-off-by: Jeff Moyer
    Cc: Zach Brown
    Cc: Miklos Szeredi
    Cc: Jeff Dike
    Cc: "Rafael J. Wysocki"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jeff Moyer
     

19 Oct, 2007

1 commit

  • Hell knows what happened in commit 63b05203af57e7de4f3bb63b8b81d43bc196d32b
    during 2.6.9 development. Commit introduced io_wait field which remained
    write-only than and still remains write-only.

    Also garbage collect macros which "use" io_wait.

    Signed-off-by: Alexey Dobriyan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     

17 Oct, 2007

1 commit

  • Some months back I proposed changing the schedule() call in
    read_events to an io_schedule():
    http://osdir.com/ml/linux.kernel.aio.general/2006-10/msg00024.html
    This was rejected as there are AIO operations that do not initiate
    disk I/O. I've had another look at the problem, and the only AIO
    operation that will not initiate disk I/O is IOCB_CMD_NOOP. However,
    this command isn't even wired up!

    Given that it doesn't work, and hasn't for *years*, I'm going to
    suggest again that we do proper I/O accounting when using AIO.

    Signed-off-by: Jeff Moyer
    Acked-by: Zach Brown
    Cc: Benjamin LaHaise
    Cc: Suparna Bhattacharya
    Cc: Badari Pulavarty
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jeff Moyer
     

09 Oct, 2007

1 commit

  • When IOCB_FLAG_RESFD flag is set and iocb->aio_resfd is incorrect,
    statement 'goto out_put_req' is executed. At label 'out_put_req',
    aio_put_req(..) is called, which requires 'req->ki_filp' set.

    Signed-off-by: Yan Zheng
    Cc: Zach Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yan Zheng
     

11 May, 2007

1 commit

  • This is an example about how to add eventfd support to the current KAIO code,
    in order to enable KAIO to post readiness events to a pollable fd (hence
    compatible with POSIX select/poll). The KAIO code simply signals the eventfd
    fd when events are ready, and this triggers a POLLIN in the fd. This patch
    uses a reserved for future use member of the struct iocb to pass an eventfd
    file descriptor, that KAIO will use to post events every time a request
    completes. At that point, an aio_getevents() will return the completed result
    to a struct io_event. I made a quick test program to verify the patch, and it
    runs fine here:

    http://www.xmailserver.org/eventfd-aio-test.c

    The test program uses poll(2), but it'd, of course, work with select and epoll
    too.

    This can allow to schedule both block I/O and other poll-able devices
    requests, and wait for results using select/poll/epoll. In a typical
    scenario, an application would submit KAIO request using aio_submit(), and
    will also use epoll_ctl() on the whole other class of devices (that with the
    addition of signals, timers and user events, now it's pretty much complete),
    and then would:

    epoll_wait(...);
    for_each_event {
    if (curr_event_is_kaiofd) {
    aio_getevents();
    dispatch_aio_events();
    } else {
    dispatch_epoll_event();
    }
    }

    Signed-off-by: Davide Libenzi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davide Libenzi
     

10 May, 2007

2 commits

  • flush_work(wq, work) doesn't need the first parameter, we can use cwq->wq
    (this was possible from the very beginnig, I missed this). So we can unify
    flush_work_keventd and flush_work.

    Also, rename flush_work() to cancel_work_sync() and fix all callers.
    Perhaps this is not the best name, but "flush_work" is really bad.

    (akpm: this is why the earlier patches bypassed maintainers)

    Signed-off-by: Oleg Nesterov
    Cc: Jeff Garzik
    Cc: "David S. Miller"
    Cc: Jens Axboe
    Cc: Tejun Heo
    Cc: Auke Kok ,
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Migrate AIO over to use flush_work().

    Cc: "Maciej W. Rozycki"
    Cc: David Howells
    Cc: Zach Brown
    Cc: Benjamin LaHaise
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     

08 May, 2007

1 commit

  • This patch provides a new macro

    KMEM_CACHE(, )

    to simplify slab creation. KMEM_CACHE creates a slab with the name of the
    struct, with the size of the struct and with the alignment of the struct.
    Additional slab flags may be specified if necessary.

    Example

    struct test_slab {
    int a,b,c;
    struct list_head;
    } __cacheline_aligned_in_smp;

    test_slab_cache = KMEM_CACHE(test_slab, SLAB_PANIC)

    will create a new slab named "test_slab" of the size sizeof(struct
    test_slab) and aligned to the alignment of test slab. If it fails then we
    panic.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

28 Mar, 2007

1 commit

  • The user can generate console output if they cause do_mmap() to fail
    during sys_io_setup(). This was seen in a regression test that does
    exactly that by spinning calling mmap() until it gets -ENOMEM before
    calling io_setup().

    We don't need this printk at all, just remove it.

    Signed-off-by: Zach Brown
    Signed-off-by: Linus Torvalds

    Zach Brown
     

12 Feb, 2007

1 commit

  • Replace appropriate pairs of "kmem_cache_alloc()" + "memset(0)" with the
    corresponding "kmem_cache_zalloc()" call.

    Signed-off-by: Robert P. J. Day
    Cc: "Luck, Tony"
    Cc: Andi Kleen
    Cc: Roland McGrath
    Cc: James Bottomley
    Cc: Greg KH
    Acked-by: Joel Becker
    Cc: Steven Whitehouse
    Cc: Jan Kara
    Cc: Michael Halcrow
    Cc: "David S. Miller"
    Cc: Stephen Smalley
    Cc: James Morris
    Cc: Chris Wright
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Robert P. J. Day