09 Jul, 2018

3 commits

  • It won't be efficient to dequeue request one by one from sw queue,
    but we have to do that when queue is busy for better merge performance.

    This patch takes the Exponential Weighted Moving Average(EWMA) to figure
    out if queue is busy, then only dequeue request one by one from sw queue
    when queue is busy.

    Fixes: b347689ffbca ("blk-mq-sched: improve dispatching from sw queue")
    Cc: Kashyap Desai
    Cc: Laurence Oberman
    Cc: Omar Sandoval
    Cc: Christoph Hellwig
    Cc: Bart Van Assche
    Cc: Hannes Reinecke
    Reported-by: Kashyap Desai
    Tested-by: Kashyap Desai
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     
  • Exclude zoned block device members from struct request_queue for
    CONFIG_BLK_DEV_ZONED == n. Avoid breaking the build by only building
    the code that uses these struct request_queue members if
    CONFIG_BLK_DEV_ZONED != n.

    Signed-off-by: Bart Van Assche
    Reviewed-by: Damien Le Moal
    Cc: Matias Bjorling
    Cc: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Bart Van Assche
     
  • Since the implementation of blk_queue_nr_zones() is trivial and since
    it only has a single caller, inline this function.

    Signed-off-by: Bart Van Assche
    Reviewed-by: Damien Le Moal
    Cc: Matias Bjorling
    Cc: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Bart Van Assche
     

21 Jun, 2018

1 commit


29 May, 2018

1 commit

  • This patch simplifies the timeout handling by relying on the request
    reference counting to ensure the iterator is operating on an inflight
    and truly timed out request. Since the reference counting prevents the
    tag from being reallocated, the block layer no longer needs to prevent
    drivers from completing their requests while the timeout handler is
    operating on it: a driver completing a request is allowed to proceed to
    the next state without additional syncronization with the block layer.

    This also removes any need for generation sequence numbers since the
    request lifetime is prevented from being reallocated as a new sequence
    while timeout handling is operating on it.

    To enables this a refcount is added to struct request so that request
    users can be sure they're operating on the same request without it
    changing while they're processing it. The request's tag won't be
    released for reuse until both the timeout handler and the completion
    are done with it.

    Signed-off-by: Keith Busch
    [hch: slight cleanups, added back submission side hctx lock, use cmpxchg
    for completions]
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Keith Busch
     

10 Apr, 2018

1 commit

  • No driver uses this interface any more, so remove it.

    Cc: Stefan Haberland
    Tested-by: Christian Borntraeger
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Sagi Grimberg
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     

18 Mar, 2018

1 commit

  • Since commit 634f9e4631a8 ("blk-mq: remove REQ_ATOM_COMPLETE usages
    from blk-mq") blk_rq_is_complete() only reports whether or not a
    request has completed for legacy queues. Hence modify the
    blk-mq-debugfs code such that it shows the blk-mq request state
    again.

    Fixes: 634f9e4631a8 ("blk-mq: remove REQ_ATOM_COMPLETE usages from blk-mq")
    Signed-off-by: Bart Van Assche
    Cc: Tejun Heo
    Signed-off-by: Jens Axboe

    Bart Van Assche
     

01 Mar, 2018

2 commits

  • When debugging the ZBC code in the mq-deadline scheduler it is very
    important to know which zones are locked and which zones are not
    locked. Hence this patch that exports the zone locking information
    through debugfs.

    Cc: Omar Sandoval
    Cc: Ming Lei
    Cc: Hannes Reinecke
    Cc: Johannes Thumshirn
    Reviewed-by: Damien Le Moal
    Tested-by: Damien Le Moal
    Signed-off-by: Bart Van Assche
    Signed-off-by: Jens Axboe

    Bart Van Assche
     
  • Make sure that the queue show and store methods are contiguous and
    also that these appear in alphabetical order.

    Signed-off-by: Bart Van Assche
    Cc: Omar Sandoval
    Cc: Damien Le Moal
    Cc: Ming Lei
    Cc: Hannes Reinecke
    Cc: Johannes Thumshirn
    Signed-off-by: Jens Axboe

    Bart Van Assche
     

25 Jan, 2018

1 commit

  • Attributes that only implement .seq_ops are read-only, any write to
    them should be rejected. But currently kernel would crash when
    writing to such debugfs entries, e.g.

    chmod +w /sys/kernel/debug/block//requeue_list
    echo 0 > /sys/kernel/debug/block//requeue_list
    chmod -w /sys/kernel/debug/block//requeue_list

    Fix it by returning -EPERM in blk_mq_debugfs_write() when writing to
    such attributes.

    Cc: Ming Lei
    Signed-off-by: Eryu Guan
    Signed-off-by: Jens Axboe

    Eryu Guan
     

13 Jan, 2018

1 commit

  • Looking at debug output, we see:

    ./000000009ddfa913/requeue_list:000000009646711c {.op=READ, .state=idle, gen=0x1
    18, abort_gen=0x0, .cmd_flags=, .rq_flags=SORTED|1|SOFTBARRIER|IO_STAT, complete
    =0, .tag=-1, .internal_tag=217}

    Note the '1' between SORTED and SOFTBARRIER - that's because no name
    as defined for RQF_STARTED. Fixed that.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

11 Jan, 2018

3 commits


10 Jan, 2018

1 commit

  • After the recent updates to use generation number and state based
    synchronization, we can easily replace REQ_ATOM_STARTED usages by
    adding an extra state to distinguish completed but not yet freed
    state.

    Add MQ_RQ_COMPLETE and replace REQ_ATOM_STARTED usages with
    blk_mq_rq_state() tests. REQ_ATOM_STARTED no longer has any users
    left and is removed.

    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Tejun Heo
     

15 Nov, 2017

1 commit

  • Pull core block layer updates from Jens Axboe:
    "This is the main pull request for block storage for 4.15-rc1.

    Nothing out of the ordinary in here, and no API changes or anything
    like that. Just various new features for drivers, core changes, etc.
    In particular, this pull request contains:

    - A patch series from Bart, closing the whole on blk/scsi-mq queue
    quescing.

    - A series from Christoph, building towards hidden gendisks (for
    multipath) and ability to move bio chains around.

    - NVMe
    - Support for native multipath for NVMe (Christoph).
    - Userspace notifications for AENs (Keith).
    - Command side-effects support (Keith).
    - SGL support (Chaitanya Kulkarni)
    - FC fixes and improvements (James Smart)
    - Lots of fixes and tweaks (Various)

    - bcache
    - New maintainer (Michael Lyle)
    - Writeback control improvements (Michael)
    - Various fixes (Coly, Elena, Eric, Liang, et al)

    - lightnvm updates, mostly centered around the pblk interface
    (Javier, Hans, and Rakesh).

    - Removal of unused bio/bvec kmap atomic interfaces (me, Christoph)

    - Writeback series that fix the much discussed hundreds of millions
    of sync-all units. This goes all the way, as discussed previously
    (me).

    - Fix for missing wakeup on writeback timer adjustments (Yafang
    Shao).

    - Fix laptop mode on blk-mq (me).

    - {mq,name} tupple lookup for IO schedulers, allowing us to have
    alias names. This means you can use 'deadline' on both !mq and on
    mq (where it's called mq-deadline). (me).

    - blktrace race fix, oopsing on sg load (me).

    - blk-mq optimizations (me).

    - Obscure waitqueue race fix for kyber (Omar).

    - NBD fixes (Josef).

    - Disable writeback throttling by default on bfq, like we do on cfq
    (Luca Miccio).

    - Series from Ming that enable us to treat flush requests on blk-mq
    like any other request. This is a really nice cleanup.

    - Series from Ming that improves merging on blk-mq with schedulers,
    getting us closer to flipping the switch on scsi-mq again.

    - BFQ updates (Paolo).

    - blk-mq atomic flags memory ordering fixes (Peter Z).

    - Loop cgroup support (Shaohua).

    - Lots of minor fixes from lots of different folks, both for core and
    driver code"

    * 'for-4.15/block' of git://git.kernel.dk/linux-block: (294 commits)
    nvme: fix visibility of "uuid" ns attribute
    blk-mq: fixup some comment typos and lengths
    ide: ide-atapi: fix compile error with defining macro DEBUG
    blk-mq: improve tag waiting setup for non-shared tags
    brd: remove unused brd_mutex
    blk-mq: only run the hardware queue if IO is pending
    block: avoid null pointer dereference on null disk
    fs: guard_bio_eod() needs to consider partitions
    xtensa/simdisk: fix compile error
    nvme: expose subsys attribute to sysfs
    nvme: create 'slaves' and 'holders' entries for hidden controllers
    block: create 'slaves' and 'holders' entries for hidden gendisks
    nvme: also expose the namespace identification sysfs files for mpath nodes
    nvme: implement multipath access to nvme subsystems
    nvme: track shared namespaces
    nvme: introduce a nvme_ns_ids structure
    nvme: track subsystems
    block, nvme: Introduce blk_mq_req_flags_t
    block, scsi: Make SCSI quiesce and resume work reliably
    block: Add the QUEUE_FLAG_PREEMPT_ONLY request queue flag
    ...

    Linus Torvalds
     

11 Nov, 2017

2 commits

  • This flag will be used in the next patch to let the block layer
    core know whether or not a SCSI request queue has been quiesced.
    A quiesced SCSI queue namely only processes RQF_PREEMPT requests.

    Signed-off-by: Bart Van Assche
    Reviewed-by: Hannes Reinecke
    Tested-by: Martin Steigerwald
    Tested-by: Oleksandr Natalenko
    Cc: Ming Lei
    Cc: Christoph Hellwig
    Cc: Johannes Thumshirn
    Signed-off-by: Jens Axboe

    Bart Van Assche
     
  • This patch attempts to make the case of hctx re-running on driver tag
    failure more robust. Without this patch, it's pretty easy to trigger a
    stall condition with shared tags. An example is using null_blk like
    this:

    modprobe null_blk queue_mode=2 nr_devices=4 shared_tags=1 submit_queues=1 hw_queue_depth=1

    which sets up 4 devices, sharing the same tag set with a depth of 1.
    Running a fio job ala:

    [global]
    bs=4k
    rw=randread
    norandommap
    direct=1
    ioengine=libaio
    iodepth=4

    [nullb0]
    filename=/dev/nullb0
    [nullb1]
    filename=/dev/nullb1
    [nullb2]
    filename=/dev/nullb2
    [nullb3]
    filename=/dev/nullb3

    will inevitably end with one or more threads being stuck waiting for a
    scheduler tag. That IO is then stuck forever, until someone else
    triggers a run of the queue.

    Ensure that we always re-run the hardware queue, if the driver tag we
    were waiting for got freed before we added our leftover request entries
    back on the dispatch list.

    Reviewed-by: Bart Van Assche
    Tested-by: Bart Van Assche
    Reviewed-by: Ming Lei
    Reviewed-by: Omar Sandoval
    Signed-off-by: Jens Axboe

    Jens Axboe
     

06 Oct, 2017

1 commit


04 Oct, 2017

1 commit

  • In blk_mq_debugfs_register(), I remembered to set up the per-hctx sched
    directories if a default scheduler was already configured by
    blk_mq_sched_init() from blk_mq_init_allocated_queue(), but I didn't do
    the same for the device-wide sched directory. Fix it.

    Fixes: d332ce091813 ("blk-mq-debugfs: allow schedulers to register debugfs attributes")
    Signed-off-by: Omar Sandoval
    Signed-off-by: Jens Axboe

    Omar Sandoval
     

08 Sep, 2017

1 commit

  • Pull block layer updates from Jens Axboe:
    "This is the first pull request for 4.14, containing most of the code
    changes. It's a quiet series this round, which I think we needed after
    the churn of the last few series. This contains:

    - Fix for a registration race in loop, from Anton Volkov.

    - Overflow complaint fix from Arnd for DAC960.

    - Series of drbd changes from the usual suspects.

    - Conversion of the stec/skd driver to blk-mq. From Bart.

    - A few BFQ improvements/fixes from Paolo.

    - CFQ improvement from Ritesh, allowing idling for group idle.

    - A few fixes found by Dan's smatch, courtesy of Dan.

    - A warning fixup for a race between changing the IO scheduler and
    device remova. From David Jeffery.

    - A few nbd fixes from Josef.

    - Support for cgroup info in blktrace, from Shaohua.

    - Also from Shaohua, new features in the null_blk driver to allow it
    to actually hold data, among other things.

    - Various corner cases and error handling fixes from Weiping Zhang.

    - Improvements to the IO stats tracking for blk-mq from me. Can
    drastically improve performance for fast devices and/or big
    machines.

    - Series from Christoph removing bi_bdev as being needed for IO
    submission, in preparation for nvme multipathing code.

    - Series from Bart, including various cleanups and fixes for switch
    fall through case complaints"

    * 'for-4.14/block' of git://git.kernel.dk/linux-block: (162 commits)
    kernfs: checking for IS_ERR() instead of NULL
    drbd: remove BIOSET_NEED_RESCUER flag from drbd_{md_,}io_bio_set
    drbd: Fix allyesconfig build, fix recent commit
    drbd: switch from kmalloc() to kmalloc_array()
    drbd: abort drbd_start_resync if there is no connection
    drbd: move global variables to drbd namespace and make some static
    drbd: rename "usermode_helper" to "drbd_usermode_helper"
    drbd: fix race between handshake and admin disconnect/down
    drbd: fix potential deadlock when trying to detach during handshake
    drbd: A single dot should be put into a sequence.
    drbd: fix rmmod cleanup, remove _all_ debugfs entries
    drbd: Use setup_timer() instead of init_timer() to simplify the code.
    drbd: fix potential get_ldev/put_ldev refcount imbalance during attach
    drbd: new disk-option disable-write-same
    drbd: Fix resource role for newly created resources in events2
    drbd: mark symbols static where possible
    drbd: Send P_NEG_ACK upon write error in protocol != C
    drbd: add explicit plugging when submitting batches
    drbd: change list_for_each_safe to while(list_first_entry_or_null)
    drbd: introduce drbd_recv_header_maybe_unplug
    ...

    Linus Torvalds
     

25 Aug, 2017

1 commit

  • The symbolic constants QUEUE_FLAG_SCSI_PASSTHROUGH, QUEUE_FLAG_QUIESCED
    and REQ_NOWAIT are missing from blk-mq-debugfs.c. Add these to
    blk-mq-debugfs.c such that these appear as names in debugfs instead of
    as numbers.

    Reviewed-by: Omar Sandoval
    Signed-off-by: Bart Van Assche
    Cc: Hannes Reinecke
    Signed-off-by: Jens Axboe

    Bart Van Assche
     

18 Aug, 2017

1 commit


10 Aug, 2017

1 commit


28 Jun, 2017

1 commit

  • Useful to verify that things are working the way they should.
    Reading the file will return number of kb written with each
    write hint. Writing the file will reset the statistics. No care
    is taken to ensure that we don't race on updates.

    Drivers will write to q->write_hints[] if they handle a given
    write hint.

    Reviewed-by: Andreas Dilger
    Reviewed-by: Martin K. Petersen
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Jens Axboe
     

02 Jun, 2017

4 commits

  • Running a queue causes the block layer to examine the per-CPU and
    hw queues but not the requeue list. Hence add a 'kick' operation
    that also examines the requeue list.

    Signed-off-by: Bart Van Assche
    Reviewed-by: Ming Lei
    Reviewed-by: Eduardo Valentin
    Cc: Christoph Hellwig
    Cc: Hannes Reinecke
    Cc: Omar Sandoval
    Signed-off-by: Jens Axboe

    Bart Van Assche
     
  • Requests that got stuck in a block driver are neither on
    blk_mq_ctx.rq_list nor on any hw dispatch queue. Make these
    visible in debugfs through the "busy" attribute.

    Signed-off-by: Bart Van Assche
    Reviewed-by: Eduardo Valentin
    Cc: Christoph Hellwig
    Cc: Hannes Reinecke
    Cc: Omar Sandoval
    Cc: Ming Lei
    Signed-off-by: Jens Axboe

    Bart Van Assche
     
  • When verifying whether or not a blk-mq driver forgot to kick the
    requeue list after having requeued a request it is important to
    be able to verify the contents of the requeue list. Hence export
    that list through debugfs.

    Signed-off-by: Bart Van Assche
    Reviewed-by: Hannes Reinecke
    Reviewed-by: Ming Lei
    Reviewed-by: Eduardo Valentin
    Cc: Christoph Hellwig
    Cc: Omar Sandoval
    Signed-off-by: Jens Axboe

    Bart Van Assche
     
  • When analyzing e.g. queue lockups it is important to know whether
    or not a request has already been started. Hence also show the
    atomic request flags.

    Signed-off-by: Bart Van Assche
    Reviewed-by: Hannes Reinecke
    Reviewed-by: Ming Lei
    Reviewed-by: Eduardo Valentin
    Cc: Omar Sandoval
    Cc: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Bart Van Assche
     

04 May, 2017

11 commits

  • Expose the fifo lists, cached next requests, batching state, and
    dispatch list. It'd also be possible to add the sorted lists, but there
    aren't already seq_file helpers for rbtrees.

    Signed-off-by: Omar Sandoval
    Reviewed-by: Hannes Reinecke
    Signed-off-by: Jens Axboe

    Omar Sandoval
     
  • Expose the domain token pools, asynchronous sbitmap depth, domain
    request lists, and batching state.

    Signed-off-by: Omar Sandoval
    Reviewed-by: Hannes Reinecke
    Signed-off-by: Jens Axboe

    Omar Sandoval
     
  • This provides the infrastructure for schedulers to expose their internal
    state through debugfs. We add a list of queue attributes and a list of
    hctx attributes to struct elevator_type and wire them up when switching
    schedulers.

    Signed-off-by: Omar Sandoval
    Reviewed-by: Hannes Reinecke

    Add missing seq_file.h header in blk-mq-debugfs.h

    Signed-off-by: Jens Axboe

    Omar Sandoval
     
  • Originally, I tied debugfs registration/unregistration together with
    sysfs. There's no reason to do this, and it's getting in the way of
    letting schedulers define their own debugfs attributes. Instead, tie the
    debugfs registration to the lifetime of the structures themselves.

    The saner lifetimes mean we can also get rid of the extra mq directory
    and move everything one level up. I.e., nvme0n1/mq/hctx0/tags is now
    just nvme0n1/hctx0/tags.

    Signed-off-by: Omar Sandoval
    Signed-off-by: Jens Axboe

    Omar Sandoval
     
  • Preparation for adding more declarations.

    Signed-off-by: Omar Sandoval
    Reviewed-by: Hannes Reinecke
    Signed-off-by: Jens Axboe

    Omar Sandoval
     
  • In commit e869b5462f83 ("blk-mq: Unregister debugfs attributes
    earlier"), we shuffled the debugfs cleanup around so that the "state"
    attribute was removed before we freed the blk-mq data structures.
    However, later changes are going to undo that, so we need to explicitly
    disallow running a dead queue.

    [Omar: rebased and updated commit message]
    Signed-off-by: Omar Sandoval
    Signed-off-by: Bart Van Assche
    Reviewed-by: Hannes Reinecke
    Signed-off-by: Jens Axboe

    Bart Van Assche
     
  • A large part of blk-mq-debugfs.c is file_operations and seq_file
    boilerplate. This sucks as is but will suck even more when schedulers
    can define their own debugfs entries. Factor it all out into a single
    blk_mq_debugfs_fops which multiplexes as needed. We store the
    request_queue, blk_mq_hw_ctx, or blk_mq_ctx in the parent directory
    dentry, which is kind of hacky, but it works.

    Signed-off-by: Omar Sandoval
    Reviewed-by: Hannes Reinecke
    Signed-off-by: Jens Axboe

    Omar Sandoval
     
  • It's not clear what these numbered directories represent unless you
    consult the code. We're about to get rid of the intermediate "mq"
    directory, so these would be even more confusing without that context.

    Signed-off-by: Omar Sandoval
    Signed-off-by: Jens Axboe

    Omar Sandoval
     
  • Slightly more readable, plus we also strip leading spaces.

    Signed-off-by: Omar Sandoval
    Reviewed-by: Hannes Reinecke
    Signed-off-by: Jens Axboe

    Omar Sandoval
     
  • blk_queue_flags_store() currently truncates and returns a short write if
    the operation being written is too long. This can give us weird results,
    like here:

    $ echo "run bar"
    echo: write error: invalid argument
    $ dmesg
    [ 1103.075435] blk_queue_flags_store: unsupported operation bar. Use either 'run' or 'start'

    Instead, return an error if the user does this. While we're here, make
    the argument names consistent with everywhere else in this file.

    Signed-off-by: Omar Sandoval
    Reviewed-by: Hannes Reinecke
    Signed-off-by: Jens Axboe

    Omar Sandoval
     
  • Make sure the spelled out flag names match the definition. This also
    adds a missing hctx state, BLK_MQ_S_START_ON_RUN, and a missing
    cmd_flag, __REQ_NOUNMAP.

    Signed-off-by: Omar Sandoval
    Reviewed-by: Hannes Reinecke
    Signed-off-by: Jens Axboe

    Omar Sandoval