12 Feb, 2019

2 commits


21 Dec, 2018

1 commit

  • [ Upstream commit 2527d99789e248576ac8081530cd4fd88730f8c7 ]

    If an IO scheduler is selected via elevator= and it doesn't match
    the driver in question wrt blk-mq support, then we fail to boot.

    The elevator= parameter is deprecated and only supported for
    non-mq devices. Augment the elevator lookup API so that we
    pass in if we're looking for an mq capable scheduler or not,
    so that we only ever return a valid type for the queue in
    question.

    Fixes: https://bugzilla.kernel.org/show_bug.cgi?id=196695
    Reviewed-by: Omar Sandoval
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin

    Jens Axboe
     

29 Aug, 2017

1 commit

  • There is a race between changing I/O elevator and request_queue removal
    which can trigger the warning in kobject_add_internal. A program can
    use sysfs to request a change of elevator at the same time another task
    is unregistering the request_queue the elevator would be attached to.
    The elevator's kobject will then attempt to be connected to the
    request_queue in the object tree when the request_queue has just been
    removed from sysfs. This triggers the warning in kobject_add_internal
    as the request_queue no longer has a sysfs directory:

    kobject_add_internal failed for iosched (error: -2 parent: queue)
    ------------[ cut here ]------------
    WARNING: CPU: 3 PID: 14075 at lib/kobject.c:244 kobject_add_internal+0x103/0x2d0

    To fix this warning, we can check the QUEUE_FLAG_REGISTERED flag when
    changing the elevator and use the request_queue's sysfs_lock to
    serialize between clearing the flag and the elevator testing the flag.

    Signed-off-by: David Jeffery
    Tested-by: Ming Lei
    Reviewed-by: Ming Lei
    Signed-off-by: Jens Axboe

    David Jeffery
     

22 Jun, 2017

1 commit

  • This patch suppresses gcc 7 warnings about falling through in switch
    statements when building with W=1. From the gcc documentation: The
    -Wimplicit-fallthrough=3 warning is enabled by -Wextra. See also
    https://gcc.gnu.org/onlinedocs/gcc-7.1.0/gcc/Warning-Options.html.

    Signed-off-by: Bart Van Assche
    Signed-off-by: Jens Axboe

    Bart Van Assche
     

10 May, 2017

1 commit

  • We warn twice for switching to a scheduler, if that switch fails.
    As we also report the failure in the return value to the
    sysfs write, remove the dmesg induced failures.

    Keep the failure print for warning to switch to the kconfig
    selected IO scheduler, as we can't report errors for that in
    any other way.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

03 May, 2017

1 commit

  • After queue is frozen, no request in this queue can be in use at all, so
    there can't be any .queue_rq() running on this queue. It isn't
    necessary to call blk_mq_quiesce_queue() any more, so remove it in both
    elevator_switch_mq() and blk_mq_update_nr_requests().

    Cc: Bart Van Assche
    Signed-off-by: Ming Lei

    Fixed up the description a bit.

    Signed-off-by: Jens Axboe

    Ming Lei
     

02 May, 2017

2 commits

  • Since commit 84253394927c ("remove the mg_disk driver") removed the
    only caller of elevator_change(), also remove the elevator_change()
    function itself.

    Signed-off-by: Bart Van Assche
    Cc: Christoph Hellwig
    Cc: Markus Trippelsdorf
    Signed-off-by: Jens Axboe

    Bart Van Assche
     
  • Pull block layer updates from Jens Axboe:

    - Add BFQ IO scheduler under the new blk-mq scheduling framework. BFQ
    was initially a fork of CFQ, but subsequently changed to implement
    fairness based on B-WF2Q+, a modified variant of WF2Q. BFQ is meant
    to be used on desktop type single drives, providing good fairness.
    From Paolo.

    - Add Kyber IO scheduler. This is a full multiqueue aware scheduler,
    using a scalable token based algorithm that throttles IO based on
    live completion IO stats, similary to blk-wbt. From Omar.

    - A series from Jan, moving users to separately allocated backing
    devices. This continues the work of separating backing device life
    times, solving various problems with hot removal.

    - A series of updates for lightnvm, mostly from Javier. Includes a
    'pblk' target that exposes an open channel SSD as a physical block
    device.

    - A series of fixes and improvements for nbd from Josef.

    - A series from Omar, removing queue sharing between devices on mostly
    legacy drivers. This helps us clean up other bits, if we know that a
    queue only has a single device backing. This has been overdue for
    more than a decade.

    - Fixes for the blk-stats, and improvements to unify the stats and user
    windows. This both improves blk-wbt, and enables other users to
    register a need to receive IO stats for a device. From Omar.

    - blk-throttle improvements from Shaohua. This provides a scalable
    framework for implementing scalable priotization - particularly for
    blk-mq, but applicable to any type of block device. The interface is
    marked experimental for now.

    - Bucketized IO stats for IO polling from Stephen Bates. This improves
    efficiency of polled workloads in the presence of mixed block size
    IO.

    - A few fixes for opal, from Scott.

    - A few pulls for NVMe, including a lot of fixes for NVMe-over-fabrics.
    From a variety of folks, mostly Sagi and James Smart.

    - A series from Bart, improving our exposed info and capabilities from
    the blk-mq debugfs support.

    - A series from Christoph, cleaning up how handle WRITE_ZEROES.

    - A series from Christoph, cleaning up the block layer handling of how
    we track errors in a request. On top of being a nice cleanup, it also
    shrinks the size of struct request a bit.

    - Removal of mg_disk and hd (sorry Linus) by Christoph. The former was
    never used by platforms, and the latter has outlived it's usefulness.

    - Various little bug fixes and cleanups from a wide variety of folks.

    * 'for-4.12/block' of git://git.kernel.dk/linux-block: (329 commits)
    block: hide badblocks attribute by default
    blk-mq: unify hctx delay_work and run_work
    block: add kblock_mod_delayed_work_on()
    blk-mq: unify hctx delayed_run_work and run_work
    nbd: fix use after free on module unload
    MAINTAINERS: bfq: Add Paolo as maintainer for the BFQ I/O scheduler
    blk-mq-sched: alloate reserved tags out of normal pool
    mtip32xx: use runtime tag to initialize command header
    scsi: Implement blk_mq_ops.show_rq()
    blk-mq: Add blk_mq_ops.show_rq()
    blk-mq: Show operation, cmd_flags and rq_flags names
    blk-mq: Make blk_flags_show() callers append a newline character
    blk-mq: Move the "state" debugfs attribute one level down
    blk-mq: Unregister debugfs attributes earlier
    blk-mq: Only unregister hctxs for which registration succeeded
    blk-mq-debugfs: Rename functions for registering and unregistering the mq directory
    blk-mq: Let blk_mq_debugfs_register() look up the queue name
    blk-mq: Register /queue/mq after having registered /queue
    ide-pm: always pass 0 error to ide_complete_rq in ide_do_devset
    ide-pm: always pass 0 error to __blk_end_request_all
    ..

    Linus Torvalds
     

20 Apr, 2017

1 commit

  • If one driver claims that it doesn't support io scheduler via
    BLK_MQ_F_NO_SCHED, we should not allow to change and show the
    availabe io schedulers.

    This patch adds check to enhance this behaviour.

    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     

19 Apr, 2017

1 commit

  • When CFQ is used as an elevator, it disables writeback throttling
    because they don't play well together. Later when a different elevator
    is chosen for the device, writeback throttling doesn't get enabled
    again as it should. Make sure CFQ enables writeback throttling (if it
    should be enabled by default) when we switch from it to another IO
    scheduler.

    Signed-off-by: Jan Kara
    Signed-off-by: Jens Axboe

    Jan Kara
     

07 Apr, 2017

2 commits

  • In elevator_switch(), if blk_mq_init_sched() fails, we attempt to fall
    back to the original scheduler. However, at this point, we've already
    torn down the original scheduler's tags, so this causes a crash. Doing
    the fallback like the legacy elevator path is much harder for mq, so fix
    it by just falling back to none, instead.

    Signed-off-by: Omar Sandoval
    Signed-off-by: Jens Axboe

    Omar Sandoval
     
  • Preparation cleanup for the next couple of fixes, push
    blk_mq_sched_setup() and e->ops.mq.init_sched() into a helper.

    Signed-off-by: Omar Sandoval
    Signed-off-by: Jens Axboe

    Omar Sandoval
     

03 Mar, 2017

1 commit

  • For legacy scheduling, we always call ioc_exit_icq() with both the
    ioc and queue lock held. This poses a problem for blk-mq with
    scheduling, since the queue lock isn't what we use in the scheduler.
    And since we don't need the queue lock held for ioc exit there,
    don't grab it and leave any extra locking up to the blk-mq scheduler.

    Reported-by: Paolo Valente
    Tested-by: Paolo Valente
    Reviewed-by: Omar Sandoval
    Signed-off-by: Jens Axboe

    Jens Axboe
     

23 Feb, 2017

1 commit


18 Feb, 2017

1 commit


14 Feb, 2017

1 commit


09 Feb, 2017

1 commit


02 Feb, 2017

1 commit

  • Avoid that sparse reports the following complaints:

    block/elevator.c:541:29: warning: incorrect type in assignment (different base types)
    block/elevator.c:541:29: expected bool [unsigned] [usertype] next_sorted
    block/elevator.c:541:29: got restricted req_flags_t

    block/blk-mq-debugfs.c:92:54: warning: cast from restricted req_flags_t

    Signed-off-by: Bart Van Assche
    Reviewed-by: Omar Sandoval
    Signed-off-by: Jens Axboe

    Bart Van Assche
     

01 Feb, 2017

1 commit

  • This can be used to check for fs vs non-fs requests and basically
    removes all knowledge of BLOCK_PC specific from the block layer,
    as well as preparing for removing the cmd_type field in struct request.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

19 Jan, 2017

1 commit


18 Jan, 2017

3 commits

  • Add Kconfig entries to manage what devices get assigned an MQ
    scheduler, and add a blk-mq flag for drivers to opt out of scheduling.
    The latter is useful for admin type queues that still allocate a blk-mq
    queue and tag set, but aren't use for normal IO.

    Signed-off-by: Jens Axboe
    Reviewed-by: Bart Van Assche
    Reviewed-by: Omar Sandoval

    Jens Axboe
     
  • This adds a set of hooks that intercepts the blk-mq path of
    allocating/inserting/issuing/completing requests, allowing
    us to develop a scheduler within that framework.

    We reuse the existing elevator scheduler API on the registration
    side, but augment that with the scheduler flagging support for
    the blk-mq interfce, and with a separate set of ops hooks for MQ
    devices.

    We split driver and scheduler tags, so we can run the scheduling
    independently of device queue depth.

    Signed-off-by: Jens Axboe
    Reviewed-by: Bart Van Assche
    Reviewed-by: Omar Sandoval

    Jens Axboe
     
  • Prep patch for adding MQ ops as well, since doing anon unions with
    named initializers doesn't work on older compilers.

    Signed-off-by: Jens Axboe
    Reviewed-by: Johannes Thumshirn
    Reviewed-by: Bart Van Assche
    Reviewed-by: Omar Sandoval

    Jens Axboe
     

10 Dec, 2016

1 commit


28 Oct, 2016

2 commits

  • Now that we don't need the common flags to overflow outside the range
    of a 32-bit type we can encode them the same way for both the bio and
    request fields. This in addition allows us to place the operation
    first (and make some room for more ops while we're at it) and to
    stop having to shift around the operation values.

    In addition this allows passing around only one value in the block layer
    instead of two (and eventuall also in the file systems, but we can do
    that later) and thus clean up a lot of code.

    Last but not least this allows decreasing the size of the cmd_flags
    field in struct request to 32-bits. Various functions passing this
    value could also be updated, but I'd like to avoid the churn for now.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • A lot of the REQ_* flags are only used on struct requests, and only of
    use to the block layer and a few drivers that dig into struct request
    internals.

    This patch adds a new req_flags_t rq_flags field to struct request for
    them, and thus dramatically shrinks the number of common requests. It
    also removes the unfortunate situation where we have to fit the fields
    from the same enum into 32 bits for struct bio and 64 bits for
    struct request.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Shaun Tancheff
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

16 Aug, 2016

1 commit

  • Commit 288dab8a35a0 ("block: add a separate operation type for secure
    erase") split REQ_OP_SECURE_ERASE from REQ_OP_DISCARD without considering
    all the places REQ_OP_DISCARD was being used to mean either. Fix those.

    Signed-off-by: Adrian Hunter
    Fixes: 288dab8a35a0 ("block: add a separate operation type for secure erase")
    Signed-off-by: Jens Axboe

    Adrian Hunter
     

21 Jul, 2016

1 commit

  • Before merging a bio into an existing request, io scheduler is called to
    get its approval first. However, the requests that come from a plug
    flush may get merged by block layer without consulting with io
    scheduler.

    In case of CFQ, this can cause fairness problems. For instance, if a
    request gets merged into a low weight cgroup's request, high weight cgroup
    now will depend on low weight cgroup to get scheduled. If high weigt cgroup
    needs that io request to complete before submitting more requests, then it
    will also lose its timeslice.

    Following script demonstrates the problem. Group g1 has a low weight, g2
    and g3 have equal high weights but g2's requests are adjacent to g1's
    requests so they are subject to merging. Due to these merges, g2 gets
    poor disk time allocation.

    cat > cfq-merge-repro.sh << "EOF"
    #!/bin/bash
    set -e

    IO_ROOT=/mnt-cgroup/io

    mkdir -p $IO_ROOT

    if ! mount | grep -qw $IO_ROOT; then
    mount -t cgroup none -oblkio $IO_ROOT
    fi

    cd $IO_ROOT

    for i in g1 g2 g3; do
    if [ -d $i ]; then
    rmdir $i
    fi
    done

    mkdir g1 && echo 10 > g1/blkio.weight
    mkdir g2 && echo 495 > g2/blkio.weight
    mkdir g3 && echo 495 > g3/blkio.weight

    RUNTIME=10

    (echo $BASHPID > g1/cgroup.procs &&
    fio --readonly --name name1 --filename /dev/sdb \
    --rw read --size 64k --bs 64k --time_based \
    --runtime=$RUNTIME --offset=0k &> /dev/null)&

    (echo $BASHPID > g2/cgroup.procs &&
    fio --readonly --name name1 --filename /dev/sdb \
    --rw read --size 64k --bs 64k --time_based \
    --runtime=$RUNTIME --offset=64k &> /dev/null)&

    (echo $BASHPID > g3/cgroup.procs &&
    fio --readonly --name name1 --filename /dev/sdb \
    --rw read --size 64k --bs 64k --time_based \
    --runtime=$RUNTIME --offset=256k &> /dev/null)&

    sleep $((RUNTIME+1))

    for i in g1 g2 g3; do
    echo ---- $i ----
    cat $i/blkio.time
    done

    EOF
    # ./cfq-merge-repro.sh
    ---- g1 ----
    8:16 162
    ---- g2 ----
    8:16 165
    ---- g3 ----
    8:16 686

    After applying the patch:

    # ./cfq-merge-repro.sh
    ---- g1 ----
    8:16 90
    ---- g2 ----
    8:16 445
    ---- g3 ----
    8:16 471

    Signed-off-by: Tahsin Erdogan
    Signed-off-by: Jens Axboe

    Tahsin Erdogan
     

08 Jun, 2016

1 commit

  • This patch converts the elevator code to use separate variables
    for the operation and flags, and to check req_op for the REQ_OP.

    Signed-off-by: Mike Christie
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Hannes Reinecke
    Signed-off-by: Jens Axboe

    Mike Christie
     

22 Oct, 2015

1 commit

  • After bio splitting is introduced, one bio can be splitted
    and it is marked as NOMERGE because it is too fat to be merged,
    so check bio_mergeable() earlier to avoid to try to merge it
    unnecessarily.

    Signed-off-by: Ming Lei
    Reviewed-by: Jeff Moyer
    Signed-off-by: Jens Axboe

    Ming Lei
     

26 Jun, 2015

2 commits

  • Pull cgroup writeback support from Jens Axboe:
    "This is the big pull request for adding cgroup writeback support.

    This code has been in development for a long time, and it has been
    simmering in for-next for a good chunk of this cycle too. This is one
    of those problems that has been talked about for at least half a
    decade, finally there's a solution and code to go with it.

    Also see last weeks writeup on LWN:

    http://lwn.net/Articles/648292/"

    * 'for-4.2/writeback' of git://git.kernel.dk/linux-block: (85 commits)
    writeback, blkio: add documentation for cgroup writeback support
    vfs, writeback: replace FS_CGROUP_WRITEBACK with SB_I_CGROUPWB
    writeback: do foreign inode detection iff cgroup writeback is enabled
    v9fs: fix error handling in v9fs_session_init()
    bdi: fix wrong error return value in cgwb_create()
    buffer: remove unusued 'ret' variable
    writeback: disassociate inodes from dying bdi_writebacks
    writeback: implement foreign cgroup inode bdi_writeback switching
    writeback: add lockdep annotation to inode_to_wb()
    writeback: use unlocked_inode_to_wb transaction in inode_congested()
    writeback: implement unlocked_inode_to_wb transaction and use it for stat updates
    writeback: implement [locked_]inode_to_wb_and_lock_list()
    writeback: implement foreign cgroup inode detection
    writeback: make writeback_control track the inode being written back
    writeback: relocate wb[_try]_get(), wb_put(), inode_{attach|detach}_wb()
    mm: vmscan: disable memcg direct reclaim stalling if cgroup writeback support is in use
    writeback: implement memcg writeback domain based throttling
    writeback: reset wb_domain->dirty_limit[_tstmp] when memcg domain size changes
    writeback: implement memcg wb_domain
    writeback: update wb_over_bg_thresh() to use wb_domain aware operations
    ...

    Linus Torvalds
     
  • Pull core block IO update from Jens Axboe:
    "Nothing really major in here, mostly a collection of smaller
    optimizations and cleanups, mixed with various fixes. In more detail,
    this contains:

    - Addition of policy specific data to blkcg for block cgroups. From
    Arianna Avanzini.

    - Various cleanups around command types from Christoph.

    - Cleanup of the suspend block I/O path from Christoph.

    - Plugging updates from Shaohua and Jeff Moyer, for blk-mq.

    - Eliminating atomic inc/dec of both remaining IO count and reference
    count in a bio. From me.

    - Fixes for SG gap and chunk size support for data-less (discards)
    IO, so we can merge these better. From me.

    - Small restructuring of blk-mq shared tag support, freeing drivers
    from iterating hardware queues. From Keith Busch.

    - A few cfq-iosched tweaks, from Tahsin Erdogan and me. Makes the
    IOPS mode the default for non-rotational storage"

    * 'for-4.2/core' of git://git.kernel.dk/linux-block: (35 commits)
    cfq-iosched: fix other locations where blkcg_to_cfqgd() can return NULL
    cfq-iosched: fix sysfs oops when attempting to read unconfigured weights
    cfq-iosched: move group scheduling functions under ifdef
    cfq-iosched: fix the setting of IOPS mode on SSDs
    blktrace: Add blktrace.c to BLOCK LAYER in MAINTAINERS file
    block, cgroup: implement policy-specific per-blkcg data
    block: Make CFQ default to IOPS mode on SSDs
    block: add blk_set_queue_dying() to blkdev.h
    blk-mq: Shared tag enhancements
    block: don't honor chunk sizes for data-less IO
    block: only honor SG gap prevention for merges that contain data
    block: fix returnvar.cocci warnings
    block, dm: don't copy bios for request clones
    block: remove management of bi_remaining when restoring original bi_end_io
    block: replace trylock with mutex_lock in blkdev_reread_part()
    block: export blkdev_reread_part() and __blkdev_reread_part()
    suspend: simplify block I/O handling
    block: collapse bio bit space
    block: remove unused BIO_RW_BLOCK and BIO_EOF flags
    block: remove BIO_EOPNOTSUPP
    ...

    Linus Torvalds
     

10 Jun, 2015

1 commit

  • A previous commit wanted to make CFQ default to IOPS mode on
    non-rotational storage, however it did so when the queue was
    initialized and the non-rotational flag is only set later on
    in the probe.

    Add an elevator hook that gets called off the add_disk() path,
    at that point we know that feature probing has finished, and
    we can reliably check for the various flags that drivers can
    set.

    Fixes: 41c0126b ("block: Make CFQ default to IOPS mode on SSDs")
    Tested-by: Romain Francoise
    Signed-off-by: Jens Axboe

    Jens Axboe
     

02 Jun, 2015

1 commit


24 Apr, 2015

1 commit

  • Our issue is descripted in below call path:
    ->elevator_init
    ->elevator_init_fn
    ->{cfq,deadline,noop}_init_queue
    ->elevator_alloc
    ->kzalloc_node
    fail to call kzalloc_node and then put module in elevator_alloc;
    fail to call elevator_init_fn and then put module again in elevator_init.

    Remove elevator_put invoking in error path of elevator_alloc to avoid
    double release issue.

    Signed-off-by: Chao Yu
    Reviewed-by: Jeff Moyer
    Signed-off-by: Jens Axboe

    Chao Yu
     

04 Dec, 2014

1 commit

  • After commit b2b49ccbdd54 (PM: Kconfig: Set PM_RUNTIME if PM_SLEEP is
    selected) PM_RUNTIME is always set if PM is set, so #ifdef blocks
    depending on CONFIG_PM_RUNTIME may now be changed to depend on
    CONFIG_PM.

    Replace CONFIG_PM_RUNTIME with CONFIG_PM in the block device core.

    Reviewed-by: Aaron Lu
    Acked-by: Jens Axboe
    Signed-off-by: Rafael J. Wysocki

    Rafael J. Wysocki
     

24 Oct, 2014

1 commit

  • while compiling integer err was showing as a set but unused variable.
    elevator_init_fn can be either cfq_init_queue or deadline_init_queue
    or noop_init_queue.
    all three of these functions are returning -ENOMEM if they fail to
    allocate the queue.
    so we should actually be returning the error code rather than
    returning 0 always.

    Signed-off-by: Sudip Mukherjee
    Signed-off-by: Jens Axboe

    Sudip Mukherjee
     

23 Jun, 2014

1 commit


12 Jun, 2014

1 commit