31 Oct, 2016

4 commits


28 Sep, 2016

1 commit


16 Sep, 2016

1 commit

  • This ensures that do_mmap() won't implicitly make AIO memory mappings
    executable if the READ_IMPLIES_EXEC personality flag is set. Such
    behavior is problematic because the security_mmap_file LSM hook doesn't
    catch this case, potentially permitting an attacker to bypass a W^X
    policy enforced by SELinux.

    I have tested the patch on my machine.

    To test the behavior, compile and run this:

    #define _GNU_SOURCE
    #include
    #include
    #include
    #include
    #include
    #include
    #include

    int main(void) {
    personality(READ_IMPLIES_EXEC);
    aio_context_t ctx = 0;
    if (syscall(__NR_io_setup, 1, &ctx))
    err(1, "io_setup");

    char cmd[1000];
    sprintf(cmd, "cat /proc/%d/maps | grep -F '/[aio]'",
    (int)getpid());
    system(cmd);
    return 0;
    }

    In the output, "rw-s" is good, "rwxs" is bad.

    Signed-off-by: Jann Horn
    Signed-off-by: Linus Torvalds

    Jann Horn
     

24 May, 2016

1 commit

  • aio_setup_ring waits for mmap_sem in writable mode. If the waiting task
    gets killed by the oom killer it would block oom_reaper from
    asynchronous address space reclaim and reduce the chances of timely OOM
    resolving. Wait for the lock in the killable mode and return with EINTR
    if the task got killed while waiting. This will also expedite the
    return to the userspace and do_exit.

    Signed-off-by: Michal Hocko
    Acked-by: Jeff Moyer
    Acked-by: Vlastimil Babka
    Cc: Benamin LaHaise
    Cc: Alexander Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

04 Apr, 2016

1 commit


05 Sep, 2015

1 commit

  • vma->vm_ops->mremap() looks more natural and clean in move_vma(), and this
    way ->mremap() can have more users. Say, vdso.

    While at it, s/aio_ring_remap/aio_ring_mremap/.

    Note: this is the minimal change before ->mremap() finds another user in
    file_operations; this method should have more arguments, and it can be
    used to kill arch_remap().

    Signed-off-by: Oleg Nesterov
    Acked-by: Pavel Emelyanov
    Acked-by: Kirill A. Shutemov
    Cc: David Rientjes
    Cc: Benjamin LaHaise
    Cc: Hugh Dickins
    Cc: Jeff Moyer
    Cc: Laurent Dufour
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

17 Apr, 2015

2 commits

  • Pull third hunk of vfs changes from Al Viro:
    "This contains the ->direct_IO() changes from Omar + saner
    generic_write_checks() + dealing with fcntl()/{read,write}() races
    (mirroring O_APPEND/O_DIRECT into iocb->ki_flags and instead of
    repeatedly looking at ->f_flags, which can be changed by fcntl(2),
    check ->ki_flags - which cannot) + infrastructure bits for dhowells'
    d_inode annotations + Christophs switch of /dev/loop to
    vfs_iter_write()"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (30 commits)
    block: loop: switch to VFS ITER_BVEC
    configfs: Fix inconsistent use of file_inode() vs file->f_path.dentry->d_inode
    VFS: Make pathwalk use d_is_reg() rather than S_ISREG()
    VFS: Fix up debugfs to use d_is_dir() in place of S_ISDIR()
    VFS: Combine inode checks with d_is_negative() and d_is_positive() in pathwalk
    NFS: Don't use d_inode as a variable name
    VFS: Impose ordering on accesses of d_inode and d_flags
    VFS: Add owner-filesystem positive/negative dentry checks
    nfs: generic_write_checks() shouldn't be done on swapout...
    ocfs2: use __generic_file_write_iter()
    mirror O_APPEND and O_DIRECT into iocb->ki_flags
    switch generic_write_checks() to iocb and iter
    ocfs2: move generic_write_checks() before the alignment checks
    ocfs2_file_write_iter: stop messing with ppos
    udf_file_write_iter: reorder and simplify
    fuse: ->direct_IO() doesn't need generic_write_checks()
    ext4_file_write_iter: move generic_write_checks() up
    xfs_file_aio_write_checks: switch to iocb/iov_iter
    generic_write_checks(): drop isblk argument
    blkdev_write_iter: expand generic_file_checks() call in there
    ...

    Linus Torvalds
     
  • Pull block layer core bits from Jens Axboe:
    "This is the core pull request for 4.1. Not a lot of stuff in here for
    this round, mostly little fixes or optimizations. This pull request
    contains:

    - An optimization that speeds up queue runs on blk-mq, especially for
    the case where there's a large difference between nr_cpu_ids and
    the actual mapped software queues on a hardware queue. From Chong
    Yuan.

    - Honor node local allocations for requests on legacy devices. From
    David Rientjes.

    - Cleanup of blk_mq_rq_to_pdu() from me.

    - exit_aio() fixup from me, greatly speeding up exiting multiple IO
    contexts off exit_group(). For my particular test case, fio exit
    took ~6 seconds. A typical case of both exposing RCU grace periods
    to user space, and serializing exit of them.

    - Make blk_mq_queue_enter() honor the gfp mask passed in, so we only
    wait if __GFP_WAIT is set. From Keith Busch.

    - blk-mq exports and two added helpers from Mike Snitzer, which will
    be used by the dm-mq code.

    - Cleanups of blk-mq queue init from Wei Fang and Xiaoguang Wang"

    * 'for-4.1/core' of git://git.kernel.dk/linux-block:
    blk-mq: reduce unnecessary software queue looping
    aio: fix serial draining in exit_aio()
    blk-mq: cleanup blk_mq_rq_to_pdu()
    blk-mq: put blk_queue_rq_timeout together in blk_mq_init_queue()
    block: remove redundant check about 'set->nr_hw_queues' in blk_mq_alloc_tag_set()
    block: allocate request memory local to request queue
    blk-mq: don't wait in blk_mq_queue_enter() if __GFP_WAIT isn't set
    blk-mq: export blk_mq_run_hw_queues
    blk-mq: add blk_mq_init_allocated_queue and export blk_mq_register_disk

    Linus Torvalds
     

16 Apr, 2015

2 commits

  • Pull second vfs update from Al Viro:
    "Now that net-next went in... Here's the next big chunk - killing
    ->aio_read() and ->aio_write().

    There'll be one more pile today (direct_IO changes and
    generic_write_checks() cleanups/fixes), but I'd prefer to keep that
    one separate"

    * 'for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (37 commits)
    ->aio_read and ->aio_write removed
    pcm: another weird API abuse
    infinibad: weird APIs switched to ->write_iter()
    kill do_sync_read/do_sync_write
    fuse: use iov_iter_get_pages() for non-splice path
    fuse: switch to ->read_iter/->write_iter
    switch drivers/char/mem.c to ->read_iter/->write_iter
    make new_sync_{read,write}() static
    coredump: accept any write method
    switch /dev/loop to vfs_iter_write()
    serial2002: switch to __vfs_read/__vfs_write
    ashmem: use __vfs_read()
    export __vfs_read()
    autofs: switch to __vfs_write()
    new helper: __vfs_write()
    switch hugetlbfs to ->read_iter()
    coda: switch to ->read_iter/->write_iter
    ncpfs: switch to ->read_iter/->write_iter
    net/9p: remove (now-)unused helpers
    p9_client_attach(): set fid->uid correctly
    ...

    Linus Torvalds
     
  • exit_aio() currently serializes killing io contexts. Each context
    killing ends up having to do percpu_ref_kill(), which in turns has
    to wait for an RCU grace period. This can take a long time, depending
    on the number of contexts. And there's no point in doing them serially,
    when we could be waiting for all of them in one fell swoop.

    This patches makes my fio thread offload test case exit 0.2s instead
    of almost 6s.

    Reviewed-by: Jeff Moyer
    Signed-off-by: Jens Axboe

    Jens Axboe
     

15 Apr, 2015

1 commit

  • Pull vfs update from Al Viro:
    "Part one:

    - struct filename-related cleanups

    - saner iov_iter_init() replacements (and switching the syscalls to
    use of those)

    - ntfs switch to ->write_iter() (Anton)

    - aio cleanups and splitting iocb into common and async parts
    (Christoph)

    - assorted fixes (me, bfields, Andrew Elble)

    There's a lot more, including the completion of switchover to
    ->{read,write}_iter(), d_inode/d_backing_inode annotations, f_flags
    race fixes, etc, but that goes after #for-davem merge. David has
    pulled it, and once it's in I'll send the next vfs pull request"

    * 'for-linus-1' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (35 commits)
    sg_start_req(): use import_iovec()
    sg_start_req(): make sure that there's not too many elements in iovec
    blk_rq_map_user(): use import_single_range()
    sg_io(): use import_iovec()
    process_vm_access: switch to {compat_,}import_iovec()
    switch keyctl_instantiate_key_common() to iov_iter
    switch {compat_,}do_readv_writev() to {compat_,}import_iovec()
    aio_setup_vectored_rw(): switch to {compat_,}import_iovec()
    vmsplice_to_user(): switch to import_iovec()
    kill aio_setup_single_vector()
    aio: simplify arguments of aio_setup_..._rw()
    aio: lift iov_iter_init() into aio_setup_..._rw()
    lift iov_iter into {compat_,}do_readv_writev()
    NFS: fix BUG() crash in notify_change() with patch to chown_common()
    dcache: return -ESTALE not -EBUSY on distributed fs race
    NTFS: Version 2.1.32 - Update file write from aio_write to write_iter.
    VFS: Add iov_iter_fault_in_multipages_readable()
    drop bogus check in file_open_root()
    switch security_inode_getattr() to struct path *
    constify tomoyo_realpath_from_path()
    ...

    Linus Torvalds
     

12 Apr, 2015

10 commits


07 Apr, 2015

2 commits

  • If we fail past the aio_setup_ring(), we need to destroy the
    mapping. We don't need to care about anybody having found ctx,
    or added requests to it, since the last failure exit is exactly
    the failure to make ctx visible to lookups.

    Reproducer (based on one by Joe Mario ):

    void count(char *p)
    {
    char s[80];
    printf("%s: ", p);
    fflush(stdout);
    sprintf(s, "/bin/cat /proc/%d/maps|/bin/fgrep -c '/[aio] (deleted)'", getpid());
    system(s);
    }

    int main()
    {
    io_context_t *ctx;
    int created, limit, i, destroyed;
    FILE *f;

    count("before");
    if ((f = fopen("/proc/sys/fs/aio-max-nr", "r")) == NULL)
    perror("opening aio-max-nr");
    else if (fscanf(f, "%d", &limit) != 1)
    fprintf(stderr, "can't parse aio-max-nr\n");
    else if ((ctx = calloc(limit, sizeof(io_context_t))) == NULL)
    perror("allocating aio_context_t array");
    else {
    for (i = 0, created = 0; i < limit; i++) {
    if (io_setup(1000, ctx + created) == 0)
    created++;
    }
    for (i = 0, destroyed = 0; i < created; i++)
    if (io_destroy(ctx[i]) == 0)
    destroyed++;
    printf("created %d, failed %d, destroyed %d\n",
    created, limit - created, destroyed);
    count("after");
    }
    }

    Found-by: Joe Mario
    Cc: stable@vger.kernel.org
    Signed-off-by: Al Viro

    Al Viro
     
  • teach ->mremap() method to return an error and have it fail for
    aio mappings in process of being killed

    Note that in case of ->mremap() failure we need to undo move_page_tables()
    we'd already done; we could call ->mremap() first, but then the failure of
    move_page_tables() would require undoing whatever _successful_ ->mremap()
    has done, which would be a lot more headache in general.

    Signed-off-by: Al Viro

    Al Viro
     

14 Mar, 2015

2 commits

  • Most callers in the kernel want to perform synchronous file I/O, but
    still have to bloat the stack with a full struct kiocb. Split out
    the parts needed in filesystem code from those in the aio code, and
    only allocate those needed to pass down argument on the stack. The
    aio code embedds the generic iocb in the one it allocates and can
    easily get back to it by using container_of.

    Also add a ->ki_complete method to struct kiocb, this is used to call
    into the aio code and thus removes the dependency on aio for filesystems
    impementing asynchronous operations. It will also allow other callers
    to substitute their own completion callback.

    We also add a new ->ki_flags field to work around the nasty layering
    violation recently introduced in commit 5e33f6 ("usb: gadget: ffs: add
    eventfd notification about ffs events").

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     
  • The AIO interface is fairly complex because it tries to allow
    filesystems to always work async and then wakeup a synchronous
    caller through aio_complete. It turns out that basically no one
    was doing this to avoid the complexity and context switches,
    and we've already fixed up the remaining users and can now
    get rid of this case.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     

13 Mar, 2015

1 commit


20 Feb, 2015

1 commit


13 Feb, 2015

1 commit

  • Pull backing device changes from Jens Axboe:
    "This contains a cleanup of how the backing device is handled, in
    preparation for a rework of the life time rules. In this part, the
    most important change is to split the unrelated nommu mmap flags from
    it, but also removing a backing_dev_info pointer from the
    address_space (and inode), and a cleanup of other various minor bits.

    Christoph did all the work here, I just fixed an oops with pages that
    have a swap backing. Arnd fixed a missing export, and Oleg killed the
    lustre backing_dev_info from staging. Last patch was from Al,
    unexporting parts that are now no longer needed outside"

    * 'for-3.20/bdi' of git://git.kernel.dk/linux-block:
    Make super_blocks and sb_lock static
    mtd: export new mtd_mmap_capabilities
    fs: make inode_to_bdi() handle NULL inode
    staging/lustre/llite: get rid of backing_dev_info
    fs: remove default_backing_dev_info
    fs: don't reassign dirty inodes to default_backing_dev_info
    nfs: don't call bdi_unregister
    ceph: remove call to bdi_unregister
    fs: remove mapping->backing_dev_info
    fs: export inode_to_bdi and use it in favor of mapping->backing_dev_info
    nilfs2: set up s_bdi like the generic mount_bdev code
    block_dev: get bdev inode bdi directly from the block device
    block_dev: only write bdev inode on close
    fs: introduce f_op->mmap_capabilities for nommu mmap support
    fs: kill BDI_CAP_SWAP_BACKED
    fs: deduplicate noop_backing_dev_info

    Linus Torvalds
     

04 Feb, 2015

1 commit

  • Under CONFIG_DEBUG_ATOMIC_SLEEP=y, aio_read_event_ring() will throw
    warnings like the following due to being called from wait_event
    context:

    WARNING: CPU: 0 PID: 16006 at kernel/sched/core.c:7300 __might_sleep+0x7f/0x90()
    do not call blocking ops when !TASK_RUNNING; state=1 set at [] prepare_to_wait_event+0x63/0x110
    Modules linked in:
    CPU: 0 PID: 16006 Comm: aio-dio-fcntl-r Not tainted 3.19.0-rc6-dgc+ #705
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
    ffffffff821c0372 ffff88003c117cd8 ffffffff81daf2bd 000000000000d8d8
    ffff88003c117d28 ffff88003c117d18 ffffffff8109beda ffff88003c117cf8
    ffffffff821c115e 0000000000000061 0000000000000000 00007ffffe4aa300
    Call Trace:
    [] dump_stack+0x4c/0x65
    [] warn_slowpath_common+0x8a/0xc0
    [] warn_slowpath_fmt+0x46/0x50
    [] ? prepare_to_wait_event+0x63/0x110
    [] ? prepare_to_wait_event+0x63/0x110
    [] __might_sleep+0x7f/0x90
    [] mutex_lock+0x24/0x45
    [] aio_read_events+0x4c/0x290
    [] read_events+0x1ec/0x220
    [] ? prepare_to_wait_event+0x110/0x110
    [] ? hrtimer_get_res+0x50/0x50
    [] SyS_io_getevents+0x4d/0xb0
    [] system_call_fastpath+0x12/0x17
    ---[ end trace bde69eaf655a4fea ]---

    There is not actually a bug here, so annotate the code to tell the
    debug logic that everything is just fine and not to fire a false
    positive.

    Signed-off-by: Dave Chinner
    Signed-off-by: Benjamin LaHaise

    Dave Chinner
     

21 Jan, 2015

2 commits

  • Now that we never use the backing_dev_info pointer in struct address_space
    we can simply remove it and save 4 to 8 bytes in every inode.

    Signed-off-by: Christoph Hellwig
    Acked-by: Ryusuke Konishi
    Reviewed-by: Tejun Heo
    Reviewed-by: Jan Kara
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • Since "BDI: Provide backing device capability information [try #3]" the
    backing_dev_info structure also provides flags for the kind of mmap
    operation available in a nommu environment, which is entirely unrelated
    to it's original purpose.

    Introduce a new nommu-only file operation to provide this information to
    the nommu mmap code instead. Splitting this from the backing_dev_info
    structure allows to remove lots of backing_dev_info instance that aren't
    otherwise needed, and entirely gets rid of the concept of providing a
    backing_dev_info for a character device. It also removes the need for
    the mtd_inodefs filesystem.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Tejun Heo
    Acked-by: Brian Norris
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

14 Dec, 2014

2 commits

  • In this case, it is basically a polling. Let's not involve timer at all
    because that would hurt performance for application event loops.

    In an arbitrary test I've done, io_getevents syscall elapsed time
    reduces from 50000+ nanoseconds to a few hundereds.

    Signed-off-by: Fam Zheng
    Signed-off-by: Benjamin LaHaise

    Fam Zheng
     
  • There are actually two issues this patch addresses. Let me start with
    the one I tried to solve in the beginning.

    So, in the checkpoint-restore project (criu) we try to dump tasks'
    state and restore one back exactly as it was. One of the tasks' state
    bits is rings set up with io_setup() call. There's (almost) no problems
    in dumping them, there's a problem restoring them -- if I dump a task
    with aio ring originally mapped at address A, I want to restore one
    back at exactly the same address A. Unfortunately, the io_setup() does
    not allow for that -- it mmaps the ring at whatever place mm finds
    appropriate (it calls do_mmap_pgoff() with zero address and without
    the MAP_FIXED flag).

    To make restore possible I'm going to mremap() the freshly created ring
    into the address A (under which it was seen before dump). The problem is
    that the ring's virtual address is passed back to the user-space as the
    context ID and this ID is then used as search key by all the other io_foo()
    calls. Reworking this ID to be just some integer doesn't seem to work, as
    this value is already used by libaio as a pointer using which this library
    accesses memory for aio meta-data.

    So, to make restore work we need to make sure that

    a) ring is mapped at desired virtual address
    b) kioctx->user_id matches this value

    Having said that, the patch makes mremap() on aio region update the
    kioctx's user_id and mmap_base values.

    Here appears the 2nd issue I mentioned in the beginning of this mail.
    If (regardless of the C/R dances I do) someone creates an io context
    with io_setup(), then mremap()-s the ring and then destroys the context,
    the kill_ioctx() routine will call munmap() on wrong (old) address.
    This will result in a) aio ring remaining in memory and b) some other
    vma get unexpectedly unmapped.

    What do you think?

    Signed-off-by: Pavel Emelyanov
    Acked-by: Dmitry Monakhov
    Signed-off-by: Benjamin LaHaise

    Pavel Emelyanov
     

26 Nov, 2014

1 commit


07 Nov, 2014

1 commit

  • https://bugzilla.kernel.org/show_bug.cgi?id=86831

    Markus reported that when shutting down mysqld (with AIO support,
    on a ext3 formatted Harddrive) leads to a negative number of dirty pages
    (underrun to the counter). The negative number results in a drastic reduction
    of the write performance because the page cache is not used, because the kernel
    thinks it is still 2 ^ 32 dirty pages open.

    Add a warn trace in __dec_zone_state will catch this easily:

    static inline void __dec_zone_state(struct zone *zone, enum
    zone_stat_item item)
    {
    atomic_long_dec(&zone->vm_stat[item]);
    + WARN_ON_ONCE(item == NR_FILE_DIRTY &&
    atomic_long_read(&zone->vm_stat[item]) < 0);
    atomic_long_dec(&vm_stat[item]);
    }

    [ 21.341632] ------------[ cut here ]------------
    [ 21.346294] WARNING: CPU: 0 PID: 309 at include/linux/vmstat.h:242
    cancel_dirty_page+0x164/0x224()
    [ 21.355296] Modules linked in: wutbox_cp sata_mv
    [ 21.359968] CPU: 0 PID: 309 Comm: kworker/0:1 Not tainted 3.14.21-WuT #80
    [ 21.366793] Workqueue: events free_ioctx
    [ 21.370760] [] (unwind_backtrace) from []
    (show_stack+0x20/0x24)
    [ 21.378562] [] (show_stack) from []
    (dump_stack+0x24/0x28)
    [ 21.385840] [] (dump_stack) from []
    (warn_slowpath_common+0x84/0x9c)
    [ 21.393976] [] (warn_slowpath_common) from []
    (warn_slowpath_null+0x2c/0x34)
    [ 21.402800] [] (warn_slowpath_null) from []
    (cancel_dirty_page+0x164/0x224)
    [ 21.411524] [] (cancel_dirty_page) from []
    (truncate_inode_page+0x8c/0x158)
    [ 21.420272] [] (truncate_inode_page) from []
    (truncate_inode_pages_range+0x11c/0x53c)
    [ 21.429890] [] (truncate_inode_pages_range) from
    [] (truncate_pagecache+0x88/0xac)
    [ 21.439252] [] (truncate_pagecache) from []
    (truncate_setsize+0x5c/0x74)
    [ 21.447731] [] (truncate_setsize) from []
    (put_aio_ring_file.isra.14+0x34/0x90)
    [ 21.456826] [] (put_aio_ring_file.isra.14) from
    [] (aio_free_ring+0x20/0xcc)
    [ 21.465660] [] (aio_free_ring) from []
    (free_ioctx+0x24/0x44)
    [ 21.473190] [] (free_ioctx) from []
    (process_one_work+0x134/0x47c)
    [ 21.481132] [] (process_one_work) from []
    (worker_thread+0x130/0x414)
    [ 21.489350] [] (worker_thread) from []
    (kthread+0xd4/0xec)
    [ 21.496621] [] (kthread) from []
    (ret_from_fork+0x14/0x20)
    [ 21.503884] ---[ end trace 79c4bf42c038c9a1 ]---

    The cause is that we set the aio ring file pages as *DIRTY* via SetPageDirty
    (bypasses the VFS dirty pages increment) when init, and aio fs uses
    *default_backing_dev_info* as the backing dev, which does not disable
    the dirty pages accounting capability.
    So truncating aio ring file will contribute to accounting dirty pages (VFS
    dirty pages decrement), then error occurs.

    The original goal is keeping these pages in memory (can not be reclaimed
    or swapped) in life-time via marking it dirty. But thinking more, we have
    already pinned pages via elevating the page's refcount, which can already
    achieve the goal, so the SetPageDirty seems unnecessary.

    In order to fix the issue, using the __set_page_dirty_no_writeback instead
    of the nop .set_page_dirty, and dropped the SetPageDirty (don't manually
    set the dirty flags, don't disable set_page_dirty(), rely on default behaviour).

    With the above change, the dirty pages accounting can work well. But as we
    known, aio fs is an anonymous one, which should never cause any real write-back,
    we can ignore the dirty pages (write back) accounting by disabling the dirty
    pages (write back) accounting capability. So we introduce an aio private
    backing dev info (disabled the ACCT_DIRTY/WRITEBACK/ACCT_WB capabilities) to
    replace the default one.

    Reported-by: Markus Königshaus
    Signed-off-by: Gu Zheng
    Cc: stable
    Acked-by: Andrew Morton
    Signed-off-by: Benjamin LaHaise

    Gu Zheng
     

25 Sep, 2014

2 commits

  • With the recent addition of percpu_ref_reinit(), percpu_ref now can be
    used as a persistent switch which can be turned on and off repeatedly
    where turning off maps to killing the ref and waiting for it to drain;
    however, there currently isn't a way to initialize a percpu_ref in its
    off (killed and drained) state, which can be inconvenient for certain
    persistent switch use cases.

    Similarly, percpu_ref_switch_to_atomic/percpu() allow dynamic
    selection of operation mode; however, currently a newly initialized
    percpu_ref is always in percpu mode making it impossible to avoid the
    latency overhead of switching to atomic mode.

    This patch adds @flags to percpu_ref_init() and implements the
    following flags.

    * PERCPU_REF_INIT_ATOMIC : start ref in atomic mode
    * PERCPU_REF_INIT_DEAD : start ref killed and drained

    These flags should be able to serve the above two use cases.

    v2: target_core_tpg.c conversion was missing. Fixed.

    Signed-off-by: Tejun Heo
    Reviewed-by: Kent Overstreet
    Cc: Jens Axboe
    Cc: Christoph Hellwig
    Cc: Johannes Weiner

    Tejun Heo
     
  • …linux-block into for-3.18

    This is to receive 0a30288da1ae ("blk-mq, percpu_ref: implement a
    kludge for SCSI blk-mq stall during probe") which implements
    __percpu_ref_kill_expedited() to work around SCSI blk-mq stall. The
    commit reverted and patches to implement proper fix will be added.

    Signed-off-by: Tejun Heo <tj@kernel.org>
    Cc: Kent Overstreet <kmo@daterainc.com>
    Cc: Jens Axboe <axboe@kernel.dk>
    Cc: Christoph Hellwig <hch@lst.de>

    Tejun Heo