27 Jul, 2016

2 commits

  • Pull block driver updates from Jens Axboe:
    "This branch also contains core changes. I've come to the conclusion
    that from 4.9 and forward, I'll be doing just a single branch. We
    often have dependencies between core and drivers, and it's hard to
    always split them up appropriately without pulling core into drivers
    when that happens.

    That said, this contains:

    - separate secure erase type for the core block layer, from
    Christoph.

    - set of discard fixes, from Christoph.

    - bio shrinking fixes from Christoph, as a followup up to the
    op/flags change in the core branch.

    - map and append request fixes from Christoph.

    - NVMeF (NVMe over Fabrics) code from Christoph. This is pretty
    exciting!

    - nvme-loop fixes from Arnd.

    - removal of ->driverfs_dev from Dan, after providing a
    device_add_disk() helper.

    - bcache fixes from Bhaktipriya and Yijing.

    - cdrom subchannel read fix from Vchannaiah.

    - set of lightnvm updates from Wenwei, Matias, Johannes, and Javier.

    - set of drbd updates and fixes from Fabian, Lars, and Philipp.

    - mg_disk error path fix from Bart.

    - user notification for failed device add for loop, from Minfei.

    - NVMe in general:
    + NVMe delay quirk from Guilherme.
    + SR-IOV support and command retry limits from Keith.
    + fix for memory-less NUMA node from Masayoshi.
    + use UINT_MAX for discard sectors, from Minfei.
    + cancel IO fixes from Ming.
    + don't allocate unused major, from Neil.
    + error code fixup from Dan.
    + use constants for PSDT/FUSE from James.
    + variable init fix from Jay.
    + fabrics fixes from Ming, Sagi, and Wei.
    + various fixes"

    * 'for-4.8/drivers' of git://git.kernel.dk/linux-block: (115 commits)
    nvme/pci: Provide SR-IOV support
    nvme: initialize variable before logical OR'ing it
    block: unexport various bio mapping helpers
    scsi/osd: open code blk_make_request
    target: stop using blk_make_request
    block: simplify and export blk_rq_append_bio
    block: ensure bios return from blk_get_request are properly initialized
    virtio_blk: use blk_rq_map_kern
    memstick: don't allow REQ_TYPE_BLOCK_PC requests
    block: shrink bio size again
    block: simplify and cleanup bvec pool handling
    block: get rid of bio_rw and READA
    block: don't ignore -EOPNOTSUPP blkdev_issue_write_same
    block: introduce BLKDEV_DISCARD_ZERO to fix zeroout
    NVMe: don't allocate unused nvme_major
    nvme: avoid crashes when node 0 is memoryless node.
    nvme: Limit command retries
    loop: Make user notify for adding loop device failed
    nvme-loop: fix nvme-loop Kconfig dependencies
    nvmet: fix return value check in nvmet_subsys_alloc()
    ...

    Linus Torvalds
     
  • Pull core block updates from Jens Axboe:

    - the big change is the cleanup from Mike Christie, cleaning up our
    uses of command types and modified flags. This is what will throw
    some merge conflicts

    - regression fix for the above for btrfs, from Vincent

    - following up to the above, better packing of struct request from
    Christoph

    - a 2038 fix for blktrace from Arnd

    - a few trivial/spelling fixes from Bart Van Assche

    - a front merge check fix from Damien, which could cause issues on
    SMR drives

    - Atari partition fix from Gabriel

    - convert cfq to highres timers, since jiffies isn't granular enough
    for some devices these days. From Jan and Jeff

    - CFQ priority boost fix idle classes, from me

    - cleanup series from Ming, improving our bio/bvec iteration

    - a direct issue fix for blk-mq from Omar

    - fix for plug merging not involving the IO scheduler, like we do for
    other types of merges. From Tahsin

    - expose DAX type internally and through sysfs. From Toshi and Yigal

    * 'for-4.8/core' of git://git.kernel.dk/linux-block: (76 commits)
    block: Fix front merge check
    block: do not merge requests without consulting with io scheduler
    block: Fix spelling in a source code comment
    block: expose QUEUE_FLAG_DAX in sysfs
    block: add QUEUE_FLAG_DAX for devices to advertise their DAX support
    Btrfs: fix comparison in __btrfs_map_block()
    block: atari: Return early for unsupported sector size
    Doc: block: Fix a typo in queue-sysfs.txt
    cfq-iosched: Charge at least 1 jiffie instead of 1 ns
    cfq-iosched: Fix regression in bonnie++ rewrite performance
    cfq-iosched: Convert slice_resid from u64 to s64
    block: Convert fifo_time from ulong to u64
    blktrace: avoid using timespec
    block/blk-cgroup.c: Declare local symbols static
    block/bio-integrity.c: Add #include "blk.h"
    block/partition-generic.c: Remove a set-but-not-used variable
    block: bio: kill BIO_MAX_SIZE
    cfq-iosched: temporarily boost queue priority for idle classes
    block: drbd: avoid to use BIO_MAX_SIZE
    block: bio: remove BIO_MAX_SECTORS
    ...

    Linus Torvalds
     

26 Jul, 2016

1 commit

  • Pull timer updates from Thomas Gleixner:
    "This update provides the following changes:

    - The rework of the timer wheel which addresses the shortcomings of
    the current wheel (cascading, slow search for next expiring timer,
    etc). That's the first major change of the wheel in almost 20
    years since Finn implemted it.

    - A large overhaul of the clocksource drivers init functions to
    consolidate the Device Tree initialization

    - Some more Y2038 updates

    - A capability fix for timerfd

    - Yet another clock chip driver

    - The usual pile of updates, comment improvements all over the place"

    * 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (130 commits)
    tick/nohz: Optimize nohz idle enter
    clockevents: Make clockevents_subsys static
    clocksource/drivers/time-armada-370-xp: Fix return value check
    timers: Implement optimization for same expiry time in mod_timer()
    timers: Split out index calculation
    timers: Only wake softirq if necessary
    timers: Forward the wheel clock whenever possible
    timers/nohz: Remove pointless tick_nohz_kick_tick() function
    timers: Optimize collect_expired_timers() for NOHZ
    timers: Move __run_timers() function
    timers: Remove set_timer_slack() leftovers
    timers: Switch to a non-cascading wheel
    timers: Reduce the CPU index space to 256k
    timers: Give a few structs and members proper names
    hlist: Add hlist_is_singular_node() helper
    signals: Use hrtimer for sigtimedwait()
    timers: Remove the deprecated mod_timer_pinned() API
    timers, net/ipv4/inet: Initialize connection request timers as pinned
    timers, drivers/tty/mips_ejtag: Initialize the poll timer as pinned
    timers, drivers/tty/metag_da: Initialize the poll timer as pinned
    ...

    Linus Torvalds
     

21 Jul, 2016

11 commits

  • For a front merge, the maximum number of sectors of the
    request must be checked against the front merge BIO sector,
    not the current sector of the request.

    Signed-off-by: Damien Le Moal
    Reviewed-by: Hannes Reinecke
    Signed-off-by: Jens Axboe

    Damien Le Moal
     
  • Before merging a bio into an existing request, io scheduler is called to
    get its approval first. However, the requests that come from a plug
    flush may get merged by block layer without consulting with io
    scheduler.

    In case of CFQ, this can cause fairness problems. For instance, if a
    request gets merged into a low weight cgroup's request, high weight cgroup
    now will depend on low weight cgroup to get scheduled. If high weigt cgroup
    needs that io request to complete before submitting more requests, then it
    will also lose its timeslice.

    Following script demonstrates the problem. Group g1 has a low weight, g2
    and g3 have equal high weights but g2's requests are adjacent to g1's
    requests so they are subject to merging. Due to these merges, g2 gets
    poor disk time allocation.

    cat > cfq-merge-repro.sh << "EOF"
    #!/bin/bash
    set -e

    IO_ROOT=/mnt-cgroup/io

    mkdir -p $IO_ROOT

    if ! mount | grep -qw $IO_ROOT; then
    mount -t cgroup none -oblkio $IO_ROOT
    fi

    cd $IO_ROOT

    for i in g1 g2 g3; do
    if [ -d $i ]; then
    rmdir $i
    fi
    done

    mkdir g1 && echo 10 > g1/blkio.weight
    mkdir g2 && echo 495 > g2/blkio.weight
    mkdir g3 && echo 495 > g3/blkio.weight

    RUNTIME=10

    (echo $BASHPID > g1/cgroup.procs &&
    fio --readonly --name name1 --filename /dev/sdb \
    --rw read --size 64k --bs 64k --time_based \
    --runtime=$RUNTIME --offset=0k &> /dev/null)&

    (echo $BASHPID > g2/cgroup.procs &&
    fio --readonly --name name1 --filename /dev/sdb \
    --rw read --size 64k --bs 64k --time_based \
    --runtime=$RUNTIME --offset=64k &> /dev/null)&

    (echo $BASHPID > g3/cgroup.procs &&
    fio --readonly --name name1 --filename /dev/sdb \
    --rw read --size 64k --bs 64k --time_based \
    --runtime=$RUNTIME --offset=256k &> /dev/null)&

    sleep $((RUNTIME+1))

    for i in g1 g2 g3; do
    echo ---- $i ----
    cat $i/blkio.time
    done

    EOF
    # ./cfq-merge-repro.sh
    ---- g1 ----
    8:16 162
    ---- g2 ----
    8:16 165
    ---- g3 ----
    8:16 686

    After applying the patch:

    # ./cfq-merge-repro.sh
    ---- g1 ----
    8:16 90
    ---- g2 ----
    8:16 445
    ---- g3 ----
    8:16 471

    Signed-off-by: Tahsin Erdogan
    Signed-off-by: Jens Axboe

    Tahsin Erdogan
     
  • Signed-off-by: Bart Van Assche
    Cc: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Bart Van Assche
     
  • Provides the ability to identify DAX enabled devices in userspace.

    Signed-off-by: Yigal Korman
    Signed-off-by: Toshi Kani
    Acked-by: Dan Williams
    Signed-off-by: Mike Snitzer
    Signed-off-by: Jens Axboe

    Yigal Korman
     
  • They are unused and potential new users really should use the
    blk_rq_map* versions.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • I wish the OSD code could simply use blk_rq_map_* helpers like
    everyone else, but the complex nature of deciding if we have
    DATA IN and/or DATA OUT buffers might make this impossible
    (at least for a mere human like me).

    But using blk_rq_append_bio at least allows sharing the setup code
    between request with or without dat a buffers, and given that this
    is the last user of blk_make_request it allows getting rid of that
    somewhat awkward interface.

    Signed-off-by: Christoph Hellwig
    Acked-by: Boaz Harrosh
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • The target SCSI passthrough backend is much better served with the low-level
    blk_rq_append_bio construct then the helpers built on top of it, so export it.

    Also use the opportunity to remove the pointless request_queue argument and
    make the code flow a little more readable.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • blk_get_request is used for BLOCK_PC and similar passthrough requests.
    Currently we always need to call blk_rq_set_block_pc or an open coded
    version of it to allow appending bios using the request mapping helpers
    later on, which is a somewhat awkward API. Instead move the
    initialization part of blk_rq_set_block_pc into blk_get_request, so that
    we always have a safe to use request.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • Instead of a flag and an index just make sure an index of 0 means
    no need to free the bvec array. Also move the constants related
    to the bvec pools together and use a consistent naming scheme for
    them.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Johannes Thumshirn
    Reviewed-by: Mike Christie
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • WRITE SAME is a data integrity operation and we can't simply ignore
    errors.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Martin K. Petersen
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • Currently blkdev_issue_zeroout cascades down from discards (if the driver
    guarantees that discards zero data), to WRITE SAME and then to a loop
    writing zeroes. Unfortunately we ignore run-time EOPNOTSUPP errors in the
    block layer blkdev_issue_discard helper to work around DM volumes that
    may have mixed discard support underneath.

    This patch intoroduces a new BLKDEV_DISCARD_ZERO flag to
    blkdev_issue_discard that indicates we are called for zeroing operation.
    This allows both to ignore the EOPNOTSUPP hack and actually consolidating
    the discard_zeroes_data check into the function.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Martin K. Petersen
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

14 Jul, 2016

1 commit

  • For 4K LBA or very large disks, atari_partition can easily get tricked
    into thinking it has found an Atari partition table. Depending on the
    data in the disk, it ends up creating partitions with awkward lengths.

    We saw logs like this while playing with fio.

    [5.625867] nvme2n1: AHDI p2
    [5.625872] nvme2n1: p2 size 2910030523 extends beyond EOD, truncated

    People has had issues with misinterpreted AHDI partition tables for a long
    time, see this BSD thread from 1995, for example.

    https://mail-index.netbsd.org/port-atari/1995/11/19/0001.html

    Since the atari partition, according to the spec, doesn't even support
    sector sizes with more than 512, a quick sanity check is reasonable to
    just bail out early, before even attempting to read sector 0.

    Signed-off-by: Gabriel Krisman Bertazi
    Signed-off-by: Jens Axboe

    Gabriel Krisman Bertazi
     

09 Jul, 2016

1 commit

  • …dimm/nvdimm into for-4.8/drivers

    Dan writes:

    "The removal of ->driverfs_dev in favor of just passing the parent
    device in as a parameter to add_disk(). See below, it has received a
    "Reviewed-by" from Christoph, Bart, and Johannes.

    It is also a pre-requisite for Fam Zheng's work to cleanup gendisk
    uevents vs attribute visibility [1]. We would extend device_add_disk()
    to take an attribute_group list.

    This is based off a branch of block.git/for-4.8/drivers and has
    received a positive build success notification from the kbuild robot
    across several configs.

    [1]: "gendisk: Generate uevent after attribute available"
    http://marc.info/?l=linux-virtualization&m=146725201522201&w=2"

    Jens Axboe
     

08 Jul, 2016

1 commit

  • The new nvme-rdma driver will need to reinitialize all the tags as part of
    the error recovery procedure (realloc the tag memory region). Add a helper
    in blk-mq for it that can iterate over all requests in a tagset to make
    this easier.

    Signed-off-by: Sagi Grimberg
    Tested-by: Ming Lin
    Reviewed-by: Stephen Bates
    Signed-off-by: Christoph Hellwig
    Reviewed-by: Steve Wise
    Tested-by: Steve Wise
    Signed-off-by: Jens Axboe

    Sagi Grimberg
     

07 Jul, 2016

1 commit

  • We now have implicit batching in the timer wheel. The slack API is no longer
    used, so remove it.

    Signed-off-by: Thomas Gleixner
    Cc: Alan Stern
    Cc: Andrew F. Davis
    Cc: Arjan van de Ven
    Cc: Chris Mason
    Cc: David S. Miller
    Cc: David Woodhouse
    Cc: Dmitry Eremin-Solenikov
    Cc: Eric Dumazet
    Cc: Frederic Weisbecker
    Cc: George Spelvin
    Cc: Greg Kroah-Hartman
    Cc: Jaehoon Chung
    Cc: Jens Axboe
    Cc: John Stultz
    Cc: Josh Triplett
    Cc: Len Brown
    Cc: Linus Torvalds
    Cc: Mathias Nyman
    Cc: Pali Rohár
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Cc: Sebastian Reichel
    Cc: Ulf Hansson
    Cc: linux-block@vger.kernel.org
    Cc: linux-kernel@vger.kernel.org
    Cc: linux-mmc@vger.kernel.org
    Cc: linux-pm@vger.kernel.org
    Cc: linux-usb@vger.kernel.org
    Cc: netdev@vger.kernel.org
    Cc: rt@linutronix.de
    Link: http://lkml.kernel.org/r/20160704094342.189813118@linutronix.de
    Signed-off-by: Ingo Molnar

    Thomas Gleixner
     

06 Jul, 2016

2 commits

  • The new NVMe over fabrics target will make use of this outside from a
    module.

    Signed-off-by: Sagi Grimberg
    Signed-off-by: Christoph Hellwig
    Reviewed-by: Steve Wise
    Signed-off-by: Jens Axboe

    Sagi Grimberg
     
  • For some protocols like NVMe over Fabrics we need to be able to send
    initialization commands to a specific queue.

    Based on an earlier patch from Christoph Hellwig .

    Signed-off-by: Ming Lin
    [hch: disallow sleeping allocation, req_op fixes]
    Signed-off-by: Christoph Hellwig
    Reviewed-by: Keith Busch
    Signed-off-by: Jens Axboe

    Ming Lin
     

01 Jul, 2016

1 commit

  • get_task_ioprio() accesses the task->io_context without holding the task
    lock and thus can race with exit_io_context(), leading to a
    use-after-free. The reproducer below hits this within a few seconds on
    my 4-core QEMU VM:

    #define _GNU_SOURCE
    #include
    #include
    #include
    #include

    int main(int argc, char **argv)
    {
    pid_t pid, child;
    long nproc, i;

    /* ioprio_set(IOPRIO_WHO_PROCESS, 0, IOPRIO_PRIO_VALUE(IOPRIO_CLASS_IDLE, 0)); */
    syscall(SYS_ioprio_set, 1, 0, 0x6000);

    nproc = sysconf(_SC_NPROCESSORS_ONLN);

    for (i = 0; i < nproc; i++) {
    pid = fork();
    assert(pid != -1);
    if (pid == 0) {
    for (;;) {
    pid = fork();
    assert(pid != -1);
    if (pid == 0) {
    _exit(0);
    } else {
    child = wait(NULL);
    assert(child == pid);
    }
    }
    }

    pid = fork();
    assert(pid != -1);
    if (pid == 0) {
    for (;;) {
    /* ioprio_get(IOPRIO_WHO_PGRP, 0); */
    syscall(SYS_ioprio_get, 2, 0);
    }
    }
    }

    for (;;) {
    /* ioprio_get(IOPRIO_WHO_PGRP, 0); */
    syscall(SYS_ioprio_get, 2, 0);
    }

    return 0;
    }

    This gets us KASAN dumps like this:

    [ 35.526914] ==================================================================
    [ 35.530009] BUG: KASAN: out-of-bounds in get_task_ioprio+0x7b/0x90 at addr ffff880066f34e6c
    [ 35.530009] Read of size 2 by task ioprio-gpf/363
    [ 35.530009] =============================================================================
    [ 35.530009] BUG blkdev_ioc (Not tainted): kasan: bad access detected
    [ 35.530009] -----------------------------------------------------------------------------

    [ 35.530009] Disabling lock debugging due to kernel taint
    [ 35.530009] INFO: Allocated in create_task_io_context+0x2b/0x370 age=0 cpu=0 pid=360
    [ 35.530009] ___slab_alloc+0x55d/0x5a0
    [ 35.530009] __slab_alloc.isra.20+0x2b/0x40
    [ 35.530009] kmem_cache_alloc_node+0x84/0x200
    [ 35.530009] create_task_io_context+0x2b/0x370
    [ 35.530009] get_task_io_context+0x92/0xb0
    [ 35.530009] copy_process.part.8+0x5029/0x5660
    [ 35.530009] _do_fork+0x155/0x7e0
    [ 35.530009] SyS_clone+0x19/0x20
    [ 35.530009] do_syscall_64+0x195/0x3a0
    [ 35.530009] return_from_SYSCALL_64+0x0/0x6a
    [ 35.530009] INFO: Freed in put_io_context+0xe7/0x120 age=0 cpu=0 pid=1060
    [ 35.530009] __slab_free+0x27b/0x3d0
    [ 35.530009] kmem_cache_free+0x1fb/0x220
    [ 35.530009] put_io_context+0xe7/0x120
    [ 35.530009] put_io_context_active+0x238/0x380
    [ 35.530009] exit_io_context+0x66/0x80
    [ 35.530009] do_exit+0x158e/0x2b90
    [ 35.530009] do_group_exit+0xe5/0x2b0
    [ 35.530009] SyS_exit_group+0x1d/0x20
    [ 35.530009] entry_SYSCALL_64_fastpath+0x1a/0xa4
    [ 35.530009] INFO: Slab 0xffffea00019bcd00 objects=20 used=4 fp=0xffff880066f34ff0 flags=0x1fffe0000004080
    [ 35.530009] INFO: Object 0xffff880066f34e58 @offset=3672 fp=0x0000000000000001
    [ 35.530009] ==================================================================

    Fix it by grabbing the task lock while we poke at the io_context.

    Cc: stable@vger.kernel.org
    Reported-by: Dmitry Vyukov
    Signed-off-by: Omar Sandoval
    Signed-off-by: Jens Axboe

    Omar Sandoval
     

28 Jun, 2016

5 commits

  • Commit 9a7f38c42c2b (cfq-iosched: Convert from jiffies to nanoseconds)
    could result in charging just 1 ns to a cgroup submitting IO instead of 1
    jiffie we always charged before. It is arguable what is the right amount
    to change but for now lets retain the old behavior of always charging at
    least one jiffie.

    Fixes: 9a7f38c42c2b92391d9dabaf9f51df7cfe5608e4
    Signed-off-by: Jan Kara
    Signed-off-by: Jens Axboe

    Jan Kara
     
  • Commit 9a7f38c42c2 (cfq-iosched: Convert from jiffies to nanoseconds)
    broke the condition for detecting starved sync IO in
    cfq_completed_request() because rq->start_time remained in jiffies but
    we compared it with nanosecond values. This manifested as a regression
    in bonnie++ rewrite performance because we always ended up considering
    sync IO starved and thus never increased async IO queue depth.

    Since rq->start_time is used in a lot of places, converting it to ns
    values would be non-trivial. So just revert the condition in CFQ to use
    comparison with jiffies. This will lead to suboptimal results if
    cfq_fifo_expire[1] will ever come close to 1 jiffie but so far we are
    relatively far from that with the storage used with CFQ (the default
    value is 128 ms).

    Fixes: 9a7f38c42c2b92391d9dabaf9f51df7cfe5608e4
    Signed-off-by: Jan Kara
    Signed-off-by: Jens Axboe

    Jan Kara
     
  • slice_resid can be both positive and negative. Commit 9a7f38c42c2b
    (cfq-iosched: Convert from jiffies to nanoseconds) converted it from
    long to u64. Although this did not introduce any functional regression
    (the operations just overflow and the result was fine), it is certainly
    wrong and could cause issues in future. So convert the type to more
    appropriate s64.

    Fixes: 9a7f38c42c2b92391d9dabaf9f51df7cfe5608e4
    Signed-off-by: Jan Kara
    Signed-off-by: Jens Axboe

    Jan Kara
     
  • Currently rq->fifo_time is unsigned long but CFQ stores nanosecond
    timestamp in it which would overflow on 32-bit archs. Convert it to u64
    to avoid the overflow. Since the rq->fifo_time is unioned with struct
    call_single_data(), this does not change the size of struct request in
    any way.

    We have to slightly fixup block/deadline-iosched.c so that comparison
    happens in the right types.

    Fixes: 9a7f38c42c2b92391d9dabaf9f51df7cfe5608e4
    Signed-off-by: Jan Kara
    Signed-off-by: Jens Axboe

    Jan Kara
     
  • Now that all drivers that specify a ->driverfs_dev have been converted
    to device_add_disk(), the pointer can be removed from struct gendisk.

    Cc: Jens Axboe
    Cc: Ross Zwisler
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Johannes Thumshirn
    Signed-off-by: Dan Williams

    Dan Williams
     

16 Jun, 2016

1 commit

  • In preparation for removing the ->driverfs_dev member of a gendisk, add
    an api that takes the parent device as a parameter to add_disk(). For
    now this maintains the status quo of WARN()ing on failure, but not
    return a error code.

    Reviewed-by: Christoph Hellwig
    Reviewed-by: Johannes Thumshirn
    Reviewed-by: Bart Van Assche
    Signed-off-by: Dan Williams

    Dan Williams
     

14 Jun, 2016

3 commits


10 Jun, 2016

1 commit

  • If we're queuing REQ_PRIO IO and the task is running at an idle IO
    class, then temporarily boost the priority. This prevents livelocks
    due to priority inversion, when a low priority task is holding file
    system resources while attempting to do IO.

    An example of that is shown below. An ioniced idle task is holding
    the directory mutex, while a normal priority task is trying to do
    a directory lookup.

    [478381.198925] ------------[ cut here ]------------
    [478381.200315] INFO: task ionice:1168369 blocked for more than 120 seconds.
    [478381.201324] Not tainted 4.0.9-38_fbk5_hotfix1_2936_g85409c6 #1
    [478381.202278] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    [478381.203462] ionice D ffff8803692736a8 0 1168369 1 0x00000080
    [478381.203466] ffff8803692736a8 ffff880399c21300 ffff880276adcc00 ffff880369273698
    [478381.204589] ffff880369273fd8 0000000000000000 7fffffffffffffff 0000000000000002
    [478381.205752] ffffffff8177d5e0 ffff8803692736c8 ffffffff8177cea7 0000000000000000
    [478381.206874] Call Trace:
    [478381.207253] [] ? bit_wait_io_timeout+0x80/0x80
    [478381.208175] [] schedule+0x37/0x90
    [478381.208932] [] schedule_timeout+0x1dc/0x250
    [478381.209805] [] ? __blk_run_queue+0x37/0x50
    [478381.210706] [] ? ktime_get+0x45/0xb0
    [478381.211489] [] io_schedule_timeout+0xa7/0x110
    [478381.212402] [] ? prepare_to_wait+0x5b/0x90
    [478381.213280] [] bit_wait_io+0x36/0x50
    [478381.214063] [] __wait_on_bit+0x65/0x90
    [478381.214961] [] ? bit_wait_io_timeout+0x80/0x80
    [478381.215872] [] out_of_line_wait_on_bit+0x7c/0x90
    [478381.216806] [] ? wake_atomic_t_function+0x40/0x40
    [478381.217773] [] __wait_on_buffer+0x2a/0x30
    [478381.218641] [] ext4_bread+0x57/0x70
    [478381.219425] [] __ext4_read_dirblock+0x3c/0x380
    [478381.220467] [] ext4_dx_find_entry+0x7d/0x170
    [478381.221357] [] ? find_get_entry+0x1e/0xa0
    [478381.222208] [] ext4_find_entry+0x484/0x510
    [478381.223090] [] ext4_lookup+0x52/0x160
    [478381.223882] [] lookup_real+0x1d/0x60
    [478381.224675] [] __lookup_hash+0x38/0x50
    [478381.225697] [] lookup_slow+0x45/0xab
    [478381.226941] [] link_path_walk+0x7ae/0x820
    [478381.227880] [] path_init+0xc2/0x430
    [478381.228677] [] ? security_file_alloc+0x16/0x20
    [478381.229776] [] path_openat+0x77/0x620
    [478381.230767] [] ? page_add_file_rmap+0x2e/0x70
    [478381.232019] [] do_filp_open+0x43/0xa0
    [478381.233016] [] ? creds_are_invalid+0x29/0x70
    [478381.234072] [] do_open_execat+0x70/0x170
    [478381.235039] [] do_execveat_common.isra.36+0x1b8/0x6e0
    [478381.236051] [] do_execve+0x2c/0x30
    [478381.236809] [] ? getname+0x12/0x20
    [478381.237564] [] SyS_execve+0x2e/0x40
    [478381.238338] [] stub_execve+0x6d/0xa0
    [478381.239126] ------------[ cut here ]------------
    [478381.239915] ------------[ cut here ]------------
    [478381.240606] INFO: task python2.7:1168375 blocked for more than 120 seconds.
    [478381.242673] Not tainted 4.0.9-38_fbk5_hotfix1_2936_g85409c6 #1
    [478381.243653] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    [478381.244902] python2.7 D ffff88005cf8fb98 0 1168375 1168248 0x00000080
    [478381.244904] ffff88005cf8fb98 ffff88016c1f0980 ffffffff81c134c0 ffff88016c1f11a0
    [478381.246023] ffff88005cf8ffd8 ffff880466cd0cbc ffff88016c1f0980 00000000ffffffff
    [478381.247138] ffff880466cd0cc0 ffff88005cf8fbb8 ffffffff8177cea7 ffff88005cf8fcc8
    [478381.248252] Call Trace:
    [478381.248630] [] schedule+0x37/0x90
    [478381.249382] [] schedule_preempt_disabled+0xe/0x10
    [478381.250465] [] __mutex_lock_slowpath+0x92/0x100
    [478381.251409] [] mutex_lock+0x1b/0x2f
    [478381.252199] [] lookup_slow+0x36/0xab
    [478381.253023] [] link_path_walk+0x7ae/0x820
    [478381.253877] [] ? try_charge+0xc1/0x700
    [478381.254690] [] path_init+0xc2/0x430
    [478381.255525] [] ? security_file_alloc+0x16/0x20
    [478381.256450] [] path_openat+0x77/0x620
    [478381.257256] [] ? lru_cache_add_active_or_unevictable+0x2b/0xa0
    [478381.258390] [] ? handle_mm_fault+0x13f3/0x1720
    [478381.259309] [] do_filp_open+0x43/0xa0
    [478381.260139] [] ? __alloc_fd+0x42/0x120
    [478381.260962] [] do_sys_open+0x13c/0x230
    [478381.261779] [] ? syscall_trace_enter_phase1+0x113/0x170
    [478381.262851] [] SyS_open+0x22/0x30
    [478381.263598] [] system_call_fastpath+0x12/0x17
    [478381.264551] ------------[ cut here ]------------
    [478381.265377] ------------[ cut here ]------------

    Signed-off-by: Jens Axboe
    Reviewed-by: Jeff Moyer

    Jens Axboe
     

09 Jun, 2016

2 commits

  • If ->queue_rq() returns BLK_MQ_RQ_QUEUE_OK, we use continue and skip
    over the rest of the loop body. However, dptr is assigned later in the
    loop body, and the BLK_MQ_RQ_QUEUE_OK case is exactly the case that we'd
    want it for.

    NVMe isn't actually using BLK_MQ_F_DEFER_ISSUE yet, nor is any other
    in-tree driver, but if the code's going to be there, it might as well
    work.

    Fixes: 74c450521dd8 ("blk-mq: add a 'list' parameter to ->queue_rq()")
    Signed-off-by: Omar Sandoval
    Signed-off-by: Jens Axboe

    Omar Sandoval
     
  • Instead of overloading the discard support with the REQ_SECURE flag.
    Use the opportunity to rename the queue flag as well, and remove the
    dead checks for this flag in the RAID 1 and RAID 10 drivers that don't
    claim support for secure erase.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

08 Jun, 2016

7 commits