21 Dec, 2018

1 commit

  • commit a538e3ff9dabcdf6c3f477a373c629213d1c3066 upstream.

    Matthew pointed out that the ioctx_table is susceptible to spectre v1,
    because the index can be controlled by an attacker. The below patch
    should mitigate the attack for all of the aio system calls.

    Cc: stable@vger.kernel.org
    Reported-by: Matthew Wilcox
    Reported-by: Dan Carpenter
    Signed-off-by: Jeff Moyer
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Jeff Moyer
     

05 Jun, 2018

1 commit

  • commit 4faa99965e027cc057c5145ce45fa772caa04e8d upstream.

    If io_destroy() gets to cancelling everything that can be cancelled and
    gets to kiocb_cancel() calling the function driver has left in ->ki_cancel,
    it becomes vulnerable to a race with IO completion. At that point req
    is already taken off the list and aio_complete() does *NOT* spin until
    we (in free_ioctx_users()) releases ->ctx_lock. As the result, it proceeds
    to kiocb_free(), freing req just it gets passed to ->ki_cancel().

    Fix is simple - remove from the list after the call of kiocb_cancel(). All
    instances of ->ki_cancel() already have to cope with the being called with
    iocb still on list - that's what happens in io_cancel(2).

    Cc: stable@kernel.org
    Fixes: 0460fef2a921 "aio: use cancellation list lazily"
    Signed-off-by: Al Viro
    Signed-off-by: Greg Kroah-Hartman

    Al Viro
     

30 May, 2018

1 commit

  • commit baf10564fbb66ea222cae66fbff11c444590ffd9 upstream.

    kill_ioctx() used to have an explicit RCU delay between removing the
    reference from ->ioctx_table and percpu_ref_kill() dropping the refcount.
    At some point that delay had been removed, on the theory that
    percpu_ref_kill() itself contained an RCU delay. Unfortunately, that was
    the wrong kind of RCU delay and it didn't care about rcu_read_lock() used
    by lookup_ioctx(). As the result, we could get ctx freed right under
    lookup_ioctx(). Tejun has fixed that in a6d7cff472e ("fs/aio: Add explicit
    RCU grace period when freeing kioctx"); however, that fix is not enough.

    Suppose io_destroy() from one thread races with e.g. io_setup() from another;
    CPU1 removes the reference from current->mm->ioctx_table[...] just as CPU2
    has picked it (under rcu_read_lock()). Then CPU1 proceeds to drop the
    refcount, getting it to 0 and triggering a call of free_ioctx_users(),
    which proceeds to drop the secondary refcount and once that reaches zero
    calls free_ioctx_reqs(). That does
    INIT_RCU_WORK(&ctx->free_rwork, free_ioctx);
    queue_rcu_work(system_wq, &ctx->free_rwork);
    and schedules freeing the whole thing after RCU delay.

    In the meanwhile CPU2 has gotten around to percpu_ref_get(), bumping the
    refcount from 0 to 1 and returned the reference to io_setup().

    Tejun's fix (that queue_rcu_work() in there) guarantees that ctx won't get
    freed until after percpu_ref_get(). Sure, we'd increment the counter before
    ctx can be freed. Now we are out of rcu_read_lock() and there's nothing to
    stop freeing of the whole thing. Unfortunately, CPU2 assumes that since it
    has grabbed the reference, ctx is *NOT* going away until it gets around to
    dropping that reference.

    The fix is obvious - use percpu_ref_tryget_live() and treat failure as miss.
    It's not costlier than what we currently do in normal case, it's safe to
    call since freeing *is* delayed and it closes the race window - either
    lookup_ioctx() comes before percpu_ref_kill() (in which case ctx->users
    won't reach 0 until the caller of lookup_ioctx() drops it) or lookup_ioctx()
    fails, ctx->users is unaffected and caller of lookup_ioctx() doesn't see
    the object in question at all.

    Cc: stable@kernel.org
    Fixes: a6d7cff472e "fs/aio: Add explicit RCU grace period when freeing kioctx"
    Signed-off-by: Al Viro
    Signed-off-by: Greg Kroah-Hartman

    Al Viro
     

21 Mar, 2018

2 commits

  • commit d0264c01e7587001a8c4608a5d1818dba9a4c11a upstream.

    While converting ioctx index from a list to a table, db446a08c23d
    ("aio: convert the ioctx list to table lookup v3") missed tagging
    kioctx_table->table[] as an array of RCU pointers and using the
    appropriate RCU accessors. This introduces a small window in the
    lookup path where init and access may race.

    Mark kioctx_table->table[] with __rcu and use the approriate RCU
    accessors when using the field.

    Signed-off-by: Tejun Heo
    Reported-by: Jann Horn
    Fixes: db446a08c23d ("aio: convert the ioctx list to table lookup v3")
    Cc: Benjamin LaHaise
    Cc: Linus Torvalds
    Cc: stable@vger.kernel.org # v3.12+
    Signed-off-by: Greg Kroah-Hartman

    Tejun Heo
     
  • commit a6d7cff472eea87d96899a20fa718d2bab7109f3 upstream.

    While fixing refcounting, e34ecee2ae79 ("aio: Fix a trinity splat")
    incorrectly removed explicit RCU grace period before freeing kioctx.
    The intention seems to be depending on the internal RCU grace periods
    of percpu_ref; however, percpu_ref uses a different flavor of RCU,
    sched-RCU. This can lead to kioctx being freed while RCU read
    protected dereferences are still in progress.

    Fix it by updating free_ioctx() to go through call_rcu() explicitly.

    v2: Comment added to explain double bouncing.

    Signed-off-by: Tejun Heo
    Reported-by: Jann Horn
    Fixes: e34ecee2ae79 ("aio: Fix a trinity splat")
    Cc: Kent Overstreet
    Cc: Linus Torvalds
    Cc: stable@vger.kernel.org # v3.13+
    Signed-off-by: Greg Kroah-Hartman

    Tejun Heo
     

15 Sep, 2017

1 commit


09 Sep, 2017

1 commit

  • Introduce a new migration mode that allow to offload the copy to a device
    DMA engine. This changes the workflow of migration and not all
    address_space migratepage callback can support this.

    This is intended to be use by migrate_vma() which itself is use for thing
    like HMM (see include/linux/hmm.h).

    No additional per-filesystem migratepage testing is needed. I disables
    MIGRATE_SYNC_NO_COPY in all problematic migratepage() callback and i
    added comment in those to explain why (part of this patch). The commit
    message is unclear it should say that any callback that wish to support
    this new mode need to be aware of the difference in the migration flow
    from other mode.

    Some of these callbacks do extra locking while copying (aio, zsmalloc,
    balloon, ...) and for DMA to be effective you want to copy multiple
    pages in one DMA operations. But in the problematic case you can not
    easily hold the extra lock accross multiple call to this callback.

    Usual flow is:

    For each page {
    1 - lock page
    2 - call migratepage() callback
    3 - (extra locking in some migratepage() callback)
    4 - migrate page state (freeze refcount, update page cache, buffer
    head, ...)
    5 - copy page
    6 - (unlock any extra lock of migratepage() callback)
    7 - return from migratepage() callback
    8 - unlock page
    }

    The new mode MIGRATE_SYNC_NO_COPY:
    1 - lock multiple pages
    For each page {
    2 - call migratepage() callback
    3 - abort in all problematic migratepage() callback
    4 - migrate page state (freeze refcount, update page cache, buffer
    head, ...)
    } // finished all calls to migratepage() callback
    5 - DMA copy multiple pages
    6 - unlock all the pages

    To support MIGRATE_SYNC_NO_COPY in the problematic case we would need a
    new callback migratepages() (for instance) that deals with multiple
    pages in one transaction.

    Because the problematic cases are not important for current usage I did
    not wanted to complexify this patchset even more for no good reason.

    Link: http://lkml.kernel.org/r/20170817000548.32038-14-jglisse@redhat.com
    Signed-off-by: Jérôme Glisse
    Cc: Aneesh Kumar
    Cc: Balbir Singh
    Cc: Benjamin Herrenschmidt
    Cc: Dan Williams
    Cc: David Nellans
    Cc: Evgeny Baskakov
    Cc: Johannes Weiner
    Cc: John Hubbard
    Cc: Kirill A. Shutemov
    Cc: Mark Hairgrove
    Cc: Michal Hocko
    Cc: Paul E. McKenney
    Cc: Ross Zwisler
    Cc: Sherry Cheung
    Cc: Subhash Gutti
    Cc: Vladimir Davydov
    Cc: Bob Liu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jérôme Glisse
     

08 Sep, 2017

1 commit

  • Currently, aio-nr is incremented in steps of 'num_possible_cpus() * 8'
    for io_setup(nr_events, ..) with 'nr_events < num_possible_cpus() * 4':

    ioctx_alloc()
    ...
    nr_events = max(nr_events, num_possible_cpus() * 4);
    nr_events *= 2;
    ...
    ctx->max_reqs = nr_events;
    ...
    aio_nr += ctx->max_reqs;
    ....

    This limits the number of aio contexts actually available to much less
    than aio-max-nr, and is increasingly worse with greater number of CPUs.

    For example, with 64 CPUs, only 256 aio contexts are actually available
    (with aio-max-nr = 65536) because the increment is 512 in that scenario.

    Note: 65536 [max aio contexts] / (64*4*2) [increment per aio context]
    is 128, but make it 256 (double) as counting against 'aio-max-nr * 2':

    ioctx_alloc()
    ...
    if (aio_nr + nr_events > (aio_max_nr * 2UL) ||
    ...
    goto err_ctx;
    ...

    This patch uses the original value of nr_events (from userspace) to
    increment aio-nr and count against aio-max-nr, which resolves those.

    Signed-off-by: Mauricio Faria de Oliveira
    Reported-by: Lekshmi C. Pillai
    Tested-by: Lekshmi C. Pillai
    Tested-by: Paul Nguyen
    Reviewed-by: Jeff Moyer
    Signed-off-by: Benjamin LaHaise

    Mauricio Faria de Oliveira
     

05 Sep, 2017

1 commit


28 Jun, 2017

1 commit


20 Jun, 2017

2 commits

  • RWF_NOWAIT informs kernel to bail out if an AIO request will block
    for reasons such as file allocations, or a writeback triggered,
    or would block while allocating requests while performing
    direct I/O.

    RWF_NOWAIT is translated to IOCB_NOWAIT for iocb->ki_flags.

    FMODE_AIO_NOWAIT is a flag which identifies the file opened is capable
    of returning -EAGAIN if the AIO call will block. This must be set by
    supporting filesystems in the ->open() call.

    Filesystems xfs, btrfs and ext4 would be supported in the following patches.

    Reviewed-by: Christoph Hellwig
    Reviewed-by: Jan Kara
    Signed-off-by: Goldwyn Rodrigues
    Signed-off-by: Jens Axboe

    Goldwyn Rodrigues
     
  • aio_rw_flags is introduced in struct iocb (using aio_reserved1) which will
    carry the RWF_* flags. We cannot use aio_flags because they are not
    checked for validity which may break existing applications.

    Note, the only place RWF_HIPRI comes in effect is dio_await_one().
    All the rest of the locations, aio code return -EIOCBQUEUED before the
    checks for RWF_HIPRI.

    Reviewed-by: Christoph Hellwig
    Reviewed-by: Jan Kara
    Signed-off-by: Goldwyn Rodrigues
    Signed-off-by: Jens Axboe

    Goldwyn Rodrigues
     

04 Mar, 2017

1 commit

  • Pull sched.h split-up from Ingo Molnar:
    "The point of these changes is to significantly reduce the
    header footprint, to speed up the kernel build and to
    have a cleaner header structure.

    After these changes the new 's typical preprocessed
    size goes down from a previous ~0.68 MB (~22K lines) to ~0.45 MB (~15K
    lines), which is around 40% faster to build on typical configs.

    Not much changed from the last version (-v2) posted three weeks ago: I
    eliminated quirks, backmerged fixes plus I rebased it to an upstream
    SHA1 from yesterday that includes most changes queued up in -next plus
    all sched.h changes that were pending from Andrew.

    I've re-tested the series both on x86 and on cross-arch defconfigs,
    and did a bisectability test at a number of random points.

    I tried to test as many build configurations as possible, but some
    build breakage is probably still left - but it should be mostly
    limited to architectures that have no cross-compiler binaries
    available on kernel.org, and non-default configurations"

    * 'WIP.sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (146 commits)
    sched/headers: Clean up
    sched/headers: Remove #ifdefs from
    sched/headers: Remove the include from
    sched/headers, hrtimer: Remove the include from
    sched/headers, x86/apic: Remove the header inclusion from
    sched/headers, timers: Remove the include from
    sched/headers: Remove from
    sched/headers: Remove from
    sched/core: Remove unused prefetch_stack()
    sched/headers: Remove from
    sched/headers: Remove the 'init_pid_ns' prototype from
    sched/headers: Remove from
    sched/headers: Remove from
    sched/headers: Remove the runqueue_is_locked() prototype
    sched/headers: Remove from
    sched/headers: Remove from
    sched/headers: Remove from
    sched/headers: Remove from
    sched/headers: Remove the include from
    sched/headers: Remove from
    ...

    Linus Torvalds
     

03 Mar, 2017

1 commit

  • Pull vfs pile two from Al Viro:

    - orangefs fix

    - series of fs/namei.c cleanups from me

    - VFS stuff coming from overlayfs tree

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    orangefs: Use RCU for destroy_inode
    vfs: use helper for calling f_op->fsync()
    mm: use helper for calling f_op->mmap()
    vfs: use helpers for calling f_op->{read,write}_iter()
    vfs: pass type instead of fn to do_{loop,iter}_readv_writev()
    vfs: extract common parts of {compat_,}do_readv_writev()
    vfs: wrap write f_ops with file_{start,end}_write()
    vfs: deny copy_file_range() for non regular files
    vfs: deny fallocate() on directory
    vfs: create vfs helper vfs_tmpfile()
    namei.c: split unlazy_walk()
    namei.c: fold the check for DCACHE_OP_REVALIDATE into d_revalidate()
    lookup_fast(): clean up the logics around the fallback to non-rcu mode
    namei: fold unlazy_link() into its sole caller

    Linus Torvalds
     

02 Mar, 2017

2 commits


25 Feb, 2017

1 commit

  • When a non-cooperative userfaultfd monitor copies pages in the
    background, it may encounter regions that were already unmapped.
    Addition of UFFD_EVENT_UNMAP allows the uffd monitor to track precisely
    changes in the virtual memory layout.

    Since there might be different uffd contexts for the affected VMAs, we
    first should create a temporary representation for the unmap event for
    each uffd context and then notify them one by one to the appropriate
    userfault file descriptors.

    The event notification occurs after the mmap_sem has been released.

    [arnd@arndb.de: fix nommu build]
    Link: http://lkml.kernel.org/r/20170203165141.3665284-1-arnd@arndb.de
    [mhocko@suse.com: fix nommu build]
    Link: http://lkml.kernel.org/r/20170202091503.GA22823@dhcp22.suse.cz
    Link: http://lkml.kernel.org/r/1485542673-24387-3-git-send-email-rppt@linux.vnet.ibm.com
    Signed-off-by: Mike Rapoport
    Signed-off-by: Michal Hocko
    Signed-off-by: Arnd Bergmann
    Acked-by: Hillf Danton
    Cc: Andrea Arcangeli
    Cc: "Dr. David Alan Gilbert"
    Cc: Mike Kravetz
    Cc: Pavel Emelyanov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     

20 Feb, 2017

1 commit


15 Jan, 2017

1 commit

  • lockdep reports a warnning. file_start_write/file_end_write only
    acquire/release the lock for regular files. So checking the files in aio
    side too.

    [ 453.532141] ------------[ cut here ]------------
    [ 453.533011] WARNING: CPU: 1 PID: 1298 at ../kernel/locking/lockdep.c:3514 lock_release+0x434/0x670
    [ 453.533011] DEBUG_LOCKS_WARN_ON(depth ] dump_stack+0x67/0x9c
    [ 453.533011] [] __warn+0x111/0x130
    [ 453.533011] [] warn_slowpath_fmt+0x97/0xb0
    [ 453.533011] [] ? __warn+0x130/0x130
    [ 453.533011] [] ? blk_finish_plug+0x29/0x60
    [ 453.533011] [] lock_release+0x434/0x670
    [ 453.533011] [] ? import_single_range+0xd4/0x110
    [ 453.533011] [] ? rw_verify_area+0x65/0x140
    [ 453.533011] [] ? aio_write+0x1f6/0x280
    [ 453.533011] [] aio_write+0x229/0x280
    [ 453.533011] [] ? aio_complete+0x640/0x640
    [ 453.533011] [] ? debug_check_no_locks_freed+0x1a0/0x1a0
    [ 453.533011] [] ? debug_lockdep_rcu_enabled.part.2+0x1a/0x30
    [ 453.533011] [] ? debug_lockdep_rcu_enabled+0x35/0x40
    [ 453.533011] [] ? __might_fault+0x7e/0xf0
    [ 453.533011] [] do_io_submit+0x94c/0xb10
    [ 453.533011] [] ? do_io_submit+0x23e/0xb10
    [ 453.533011] [] ? SyS_io_destroy+0x270/0x270
    [ 453.533011] [] ? mark_held_locks+0x23/0xc0
    [ 453.533011] [] ? trace_hardirqs_on_thunk+0x1a/0x1c
    [ 453.533011] [] SyS_io_submit+0x10/0x20
    [ 453.533011] [] entry_SYSCALL_64_fastpath+0x18/0xad
    [ 453.533011] [] ? trace_hardirqs_off_caller+0xc0/0x110
    [ 453.533011] ---[ end trace b2fbe664d1cc0082 ]---

    Cc: Dmitry Monakhov
    Cc: Jan Kara
    Cc: Christoph Hellwig
    Cc: Al Viro
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Shaohua Li
    Signed-off-by: Al Viro

    Shaohua Li
     

26 Dec, 2016

1 commit

  • ktime is a union because the initial implementation stored the time in
    scalar nanoseconds on 64 bit machine and in a endianess optimized timespec
    variant for 32bit machines. The Y2038 cleanup removed the timespec variant
    and switched everything to scalar nanoseconds. The union remained, but
    become completely pointless.

    Get rid of the union and just keep ktime_t as simple typedef of type s64.

    The conversion was done with coccinelle and some manual mopping up.

    Signed-off-by: Thomas Gleixner
    Cc: Peter Zijlstra

    Thomas Gleixner
     

25 Dec, 2016

1 commit


23 Dec, 2016

1 commit

  • ... and fix the minor buglet in compat io_submit() - native one
    kills ioctx as cleanup when put_user() fails. Get rid of
    bogus compat_... in !CONFIG_AIO case, while we are at it - they
    should simply fail with ENOSYS, same as for native counterparts.

    Signed-off-by: Al Viro

    Al Viro
     

05 Dec, 2016

1 commit


31 Oct, 2016

4 commits


28 Sep, 2016

1 commit


16 Sep, 2016

1 commit

  • This ensures that do_mmap() won't implicitly make AIO memory mappings
    executable if the READ_IMPLIES_EXEC personality flag is set. Such
    behavior is problematic because the security_mmap_file LSM hook doesn't
    catch this case, potentially permitting an attacker to bypass a W^X
    policy enforced by SELinux.

    I have tested the patch on my machine.

    To test the behavior, compile and run this:

    #define _GNU_SOURCE
    #include
    #include
    #include
    #include
    #include
    #include
    #include

    int main(void) {
    personality(READ_IMPLIES_EXEC);
    aio_context_t ctx = 0;
    if (syscall(__NR_io_setup, 1, &ctx))
    err(1, "io_setup");

    char cmd[1000];
    sprintf(cmd, "cat /proc/%d/maps | grep -F '/[aio]'",
    (int)getpid());
    system(cmd);
    return 0;
    }

    In the output, "rw-s" is good, "rwxs" is bad.

    Signed-off-by: Jann Horn
    Signed-off-by: Linus Torvalds

    Jann Horn
     

24 May, 2016

1 commit

  • aio_setup_ring waits for mmap_sem in writable mode. If the waiting task
    gets killed by the oom killer it would block oom_reaper from
    asynchronous address space reclaim and reduce the chances of timely OOM
    resolving. Wait for the lock in the killable mode and return with EINTR
    if the task got killed while waiting. This will also expedite the
    return to the userspace and do_exit.

    Signed-off-by: Michal Hocko
    Acked-by: Jeff Moyer
    Acked-by: Vlastimil Babka
    Cc: Benamin LaHaise
    Cc: Alexander Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

04 Apr, 2016

1 commit


05 Sep, 2015

1 commit

  • vma->vm_ops->mremap() looks more natural and clean in move_vma(), and this
    way ->mremap() can have more users. Say, vdso.

    While at it, s/aio_ring_remap/aio_ring_mremap/.

    Note: this is the minimal change before ->mremap() finds another user in
    file_operations; this method should have more arguments, and it can be
    used to kill arch_remap().

    Signed-off-by: Oleg Nesterov
    Acked-by: Pavel Emelyanov
    Acked-by: Kirill A. Shutemov
    Cc: David Rientjes
    Cc: Benjamin LaHaise
    Cc: Hugh Dickins
    Cc: Jeff Moyer
    Cc: Laurent Dufour
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

17 Apr, 2015

2 commits

  • Pull third hunk of vfs changes from Al Viro:
    "This contains the ->direct_IO() changes from Omar + saner
    generic_write_checks() + dealing with fcntl()/{read,write}() races
    (mirroring O_APPEND/O_DIRECT into iocb->ki_flags and instead of
    repeatedly looking at ->f_flags, which can be changed by fcntl(2),
    check ->ki_flags - which cannot) + infrastructure bits for dhowells'
    d_inode annotations + Christophs switch of /dev/loop to
    vfs_iter_write()"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (30 commits)
    block: loop: switch to VFS ITER_BVEC
    configfs: Fix inconsistent use of file_inode() vs file->f_path.dentry->d_inode
    VFS: Make pathwalk use d_is_reg() rather than S_ISREG()
    VFS: Fix up debugfs to use d_is_dir() in place of S_ISDIR()
    VFS: Combine inode checks with d_is_negative() and d_is_positive() in pathwalk
    NFS: Don't use d_inode as a variable name
    VFS: Impose ordering on accesses of d_inode and d_flags
    VFS: Add owner-filesystem positive/negative dentry checks
    nfs: generic_write_checks() shouldn't be done on swapout...
    ocfs2: use __generic_file_write_iter()
    mirror O_APPEND and O_DIRECT into iocb->ki_flags
    switch generic_write_checks() to iocb and iter
    ocfs2: move generic_write_checks() before the alignment checks
    ocfs2_file_write_iter: stop messing with ppos
    udf_file_write_iter: reorder and simplify
    fuse: ->direct_IO() doesn't need generic_write_checks()
    ext4_file_write_iter: move generic_write_checks() up
    xfs_file_aio_write_checks: switch to iocb/iov_iter
    generic_write_checks(): drop isblk argument
    blkdev_write_iter: expand generic_file_checks() call in there
    ...

    Linus Torvalds
     
  • Pull block layer core bits from Jens Axboe:
    "This is the core pull request for 4.1. Not a lot of stuff in here for
    this round, mostly little fixes or optimizations. This pull request
    contains:

    - An optimization that speeds up queue runs on blk-mq, especially for
    the case where there's a large difference between nr_cpu_ids and
    the actual mapped software queues on a hardware queue. From Chong
    Yuan.

    - Honor node local allocations for requests on legacy devices. From
    David Rientjes.

    - Cleanup of blk_mq_rq_to_pdu() from me.

    - exit_aio() fixup from me, greatly speeding up exiting multiple IO
    contexts off exit_group(). For my particular test case, fio exit
    took ~6 seconds. A typical case of both exposing RCU grace periods
    to user space, and serializing exit of them.

    - Make blk_mq_queue_enter() honor the gfp mask passed in, so we only
    wait if __GFP_WAIT is set. From Keith Busch.

    - blk-mq exports and two added helpers from Mike Snitzer, which will
    be used by the dm-mq code.

    - Cleanups of blk-mq queue init from Wei Fang and Xiaoguang Wang"

    * 'for-4.1/core' of git://git.kernel.dk/linux-block:
    blk-mq: reduce unnecessary software queue looping
    aio: fix serial draining in exit_aio()
    blk-mq: cleanup blk_mq_rq_to_pdu()
    blk-mq: put blk_queue_rq_timeout together in blk_mq_init_queue()
    block: remove redundant check about 'set->nr_hw_queues' in blk_mq_alloc_tag_set()
    block: allocate request memory local to request queue
    blk-mq: don't wait in blk_mq_queue_enter() if __GFP_WAIT isn't set
    blk-mq: export blk_mq_run_hw_queues
    blk-mq: add blk_mq_init_allocated_queue and export blk_mq_register_disk

    Linus Torvalds
     

16 Apr, 2015

2 commits

  • Pull second vfs update from Al Viro:
    "Now that net-next went in... Here's the next big chunk - killing
    ->aio_read() and ->aio_write().

    There'll be one more pile today (direct_IO changes and
    generic_write_checks() cleanups/fixes), but I'd prefer to keep that
    one separate"

    * 'for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (37 commits)
    ->aio_read and ->aio_write removed
    pcm: another weird API abuse
    infinibad: weird APIs switched to ->write_iter()
    kill do_sync_read/do_sync_write
    fuse: use iov_iter_get_pages() for non-splice path
    fuse: switch to ->read_iter/->write_iter
    switch drivers/char/mem.c to ->read_iter/->write_iter
    make new_sync_{read,write}() static
    coredump: accept any write method
    switch /dev/loop to vfs_iter_write()
    serial2002: switch to __vfs_read/__vfs_write
    ashmem: use __vfs_read()
    export __vfs_read()
    autofs: switch to __vfs_write()
    new helper: __vfs_write()
    switch hugetlbfs to ->read_iter()
    coda: switch to ->read_iter/->write_iter
    ncpfs: switch to ->read_iter/->write_iter
    net/9p: remove (now-)unused helpers
    p9_client_attach(): set fid->uid correctly
    ...

    Linus Torvalds
     
  • exit_aio() currently serializes killing io contexts. Each context
    killing ends up having to do percpu_ref_kill(), which in turns has
    to wait for an RCU grace period. This can take a long time, depending
    on the number of contexts. And there's no point in doing them serially,
    when we could be waiting for all of them in one fell swoop.

    This patches makes my fio thread offload test case exit 0.2s instead
    of almost 6s.

    Reviewed-by: Jeff Moyer
    Signed-off-by: Jens Axboe

    Jens Axboe
     

15 Apr, 2015

1 commit

  • Pull vfs update from Al Viro:
    "Part one:

    - struct filename-related cleanups

    - saner iov_iter_init() replacements (and switching the syscalls to
    use of those)

    - ntfs switch to ->write_iter() (Anton)

    - aio cleanups and splitting iocb into common and async parts
    (Christoph)

    - assorted fixes (me, bfields, Andrew Elble)

    There's a lot more, including the completion of switchover to
    ->{read,write}_iter(), d_inode/d_backing_inode annotations, f_flags
    race fixes, etc, but that goes after #for-davem merge. David has
    pulled it, and once it's in I'll send the next vfs pull request"

    * 'for-linus-1' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (35 commits)
    sg_start_req(): use import_iovec()
    sg_start_req(): make sure that there's not too many elements in iovec
    blk_rq_map_user(): use import_single_range()
    sg_io(): use import_iovec()
    process_vm_access: switch to {compat_,}import_iovec()
    switch keyctl_instantiate_key_common() to iov_iter
    switch {compat_,}do_readv_writev() to {compat_,}import_iovec()
    aio_setup_vectored_rw(): switch to {compat_,}import_iovec()
    vmsplice_to_user(): switch to import_iovec()
    kill aio_setup_single_vector()
    aio: simplify arguments of aio_setup_..._rw()
    aio: lift iov_iter_init() into aio_setup_..._rw()
    lift iov_iter into {compat_,}do_readv_writev()
    NFS: fix BUG() crash in notify_change() with patch to chown_common()
    dcache: return -ESTALE not -EBUSY on distributed fs race
    NTFS: Version 2.1.32 - Update file write from aio_write to write_iter.
    VFS: Add iov_iter_fault_in_multipages_readable()
    drop bogus check in file_open_root()
    switch security_inode_getattr() to struct path *
    constify tomoyo_realpath_from_path()
    ...

    Linus Torvalds
     

12 Apr, 2015

3 commits