23 Sep, 2020

1 commit

  • [ Upstream commit e8a8a185051a460e3eb0617dca33f996f4e31516 ]

    Yang Yang reported the following crash caused by requeueing a flush
    request in Kyber:

    [ 2.517297] Unable to handle kernel paging request at virtual address ffffffd8071c0b00
    ...
    [ 2.517468] pc : clear_bit+0x18/0x2c
    [ 2.517502] lr : sbitmap_queue_clear+0x40/0x228
    [ 2.517503] sp : ffffff800832bc60 pstate : 00c00145
    ...
    [ 2.517599] Process ksoftirqd/5 (pid: 51, stack limit = 0xffffff8008328000)
    [ 2.517602] Call trace:
    [ 2.517606] clear_bit+0x18/0x2c
    [ 2.517619] kyber_finish_request+0x74/0x80
    [ 2.517627] blk_mq_requeue_request+0x3c/0xc0
    [ 2.517637] __scsi_queue_insert+0x11c/0x148
    [ 2.517640] scsi_softirq_done+0x114/0x130
    [ 2.517643] blk_done_softirq+0x7c/0xb0
    [ 2.517651] __do_softirq+0x208/0x3bc
    [ 2.517657] run_ksoftirqd+0x34/0x60
    [ 2.517663] smpboot_thread_fn+0x1c4/0x2c0
    [ 2.517667] kthread+0x110/0x120
    [ 2.517669] ret_from_fork+0x10/0x18

    This happens because Kyber doesn't track flush requests, so
    kyber_finish_request() reads a garbage domain token. Only call the
    scheduler's requeue_request() hook if RQF_ELVPRIV is set (like we do for
    the finish_request() hook in blk_mq_free_request()). Now that we're
    handling it in blk-mq, also remove the check from BFQ.

    Reported-by: Yang Yang
    Signed-off-by: Omar Sandoval
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin

    Omar Sandoval
     

23 Jul, 2019

1 commit


21 Jun, 2019

1 commit

  • We only need the number of segments in the blk-mq submission path.
    Remove the field from struct bio, and return it from a variant of
    blk_queue_split instead of that it can passed as an argument to
    those functions that need the value.

    This also means we stop recounting segments except for cloning
    and partial segments.

    To keep the number of arguments in this how path down remove
    pointless struct request_queue arguments from any of the functions
    that had it and grew a nr_segs argument.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

07 Jun, 2019

1 commit

  • In theory, IO scheduler belongs to request queue, and the request pool
    of sched tags belongs to the request queue too.

    However, the current tags allocation interfaces are re-used for both
    driver tags and sched tags, and driver tags is definitely host wide,
    and doesn't belong to any request queue, same with its request pool.
    So we need tagset instance for freeing request of sched tags.

    Meantime, blk_mq_free_tag_set() often follows blk_cleanup_queue() in case
    of non-BLK_MQ_F_TAG_SHARED, this way requires that request pool of sched
    tags to be freed before calling blk_mq_free_tag_set().

    Commit 47cdee29ef9d94e ("block: move blk_exit_queue into __blk_release_queue")
    moves blk_exit_queue into __blk_release_queue for simplying the fast
    path in generic_make_request(), then causes oops during freeing requests
    of sched tags in __blk_release_queue().

    Fix the above issue by move freeing request pool of sched tags into
    blk_cleanup_queue(), this way is safe becasue queue has been frozen and no any
    in-queue requests at that time. Freeing sched tags has to be kept in queue's
    release handler becasue there might be un-completed dispatch activity
    which might refer to sched tags.

    Cc: Bart Van Assche
    Cc: Christoph Hellwig
    Fixes: 47cdee29ef9d94e485eb08f962c74943023a5271 ("block: move blk_exit_queue into __blk_release_queue")
    Tested-by: Yi Zhang
    Reported-by: kernel test robot
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     

18 Dec, 2018

1 commit

  • For a zoned block device using mq-deadline, if a write request for a
    zone is received while another write was already dispatched for the same
    zone, dd_dispatch_request() will return NULL and the newly inserted
    write request is kept in the scheduler queue waiting for the ongoing
    zone write to complete. With this behavior, when no other request has
    been dispatched, rq_list in blk_mq_sched_dispatch_requests() is empty
    and blk_mq_sched_mark_restart_hctx() not called. This in turn leads to
    __blk_mq_free_request() call of blk_mq_sched_restart() to not run the
    queue when the already dispatched write request completes. The newly
    dispatched request stays stuck in the scheduler queue until eventually
    another request is submitted.

    This problem does not affect SCSI disk as the SCSI stack handles queue
    restart on request completion. However, this problem is can be triggered
    the nullblk driver with zoned mode enabled.

    Fix this by always requesting a queue restart in dd_dispatch_request()
    if no request was dispatched while WRITE requests are queued.

    Fixes: 5700f69178e9 ("mq-deadline: Introduce zone locking support")
    Cc:
    Signed-off-by: Damien Le Moal

    Add missing export of blk_mq_sched_restart()

    Signed-off-by: Jens Axboe

    Damien Le Moal
     

20 Nov, 2018

1 commit

  • bio->bi_ioc is never set so always NULL. Remove references to it in
    bio_disassociate_task() and in rq_ioc() and delete this field from
    struct bio. With this change, rq_ioc() always returns
    current->io_context without the need for a bio argument. Further
    simplify the code and make it more readable by also removing this
    helper, which also allows to simplify blk_mq_sched_assign_ioc() by
    removing its bio argument.

    Reviewed-by: Christoph Hellwig
    Reviewed-by: Johannes Thumshirn
    Reviewed-by: Adam Manzanares
    Signed-off-by: Damien Le Moal
    Signed-off-by: Jens Axboe

    Damien Le Moal
     

08 Nov, 2018

2 commits

  • It's somewhat strange to have a list insertion function that
    relies on the fact that the caller has mapped things correctly.
    Pass in the hardware queue directly for insertion, which makes
    for a much cleaner interface and implementation.

    Reviewed-by: Keith Busch
    Reviewed-by: Sagi Grimberg
    Reviewed-by: Hannes Reinecke
    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • This is a remnant of when we had ops for both SQ and MQ
    schedulers. Now it's just MQ, so get rid of the union.

    Reviewed-by: Omar Sandoval
    Signed-off-by: Jens Axboe

    Jens Axboe
     

28 Sep, 2018

1 commit


21 Aug, 2018

1 commit

  • Currently, when update nr_hw_queues, IO scheduler's init_hctx will
    be invoked before the mapping between ctx and hctx is adapted
    correctly by blk_mq_map_swqueue. The IO scheduler init_hctx (kyber)
    may depend on this mapping and get wrong result and panic finally.
    A simply way to fix this is that switch the IO scheduler to 'none'
    before update the nr_hw_queues, and then switch it back after
    update nr_hw_queues. blk_mq_sched_init_/exit_hctx are removed due
    to nobody use them any more.

    Signed-off-by: Jianchao Wang
    Signed-off-by: Jens Axboe

    Jianchao Wang
     

01 Jun, 2018

1 commit


18 Jan, 2018

1 commit


02 Nov, 2017

1 commit

  • Many source files in the tree are missing licensing information, which
    makes it harder for compliance tools to determine the correct license.

    By default all files without license information are under the default
    license of the kernel, which is GPL version 2.

    Update the files which contain no license information with the 'GPL-2.0'
    SPDX license identifier. The SPDX identifier is a legally binding
    shorthand, which can be used instead of the full boiler plate text.

    This patch is based on work done by Thomas Gleixner and Kate Stewart and
    Philippe Ombredanne.

    How this work was done:

    Patches were generated and checked against linux-4.14-rc6 for a subset of
    the use cases:
    - file had no licensing information it it.
    - file was a */uapi/* one with no licensing information in it,
    - file was a */uapi/* one with existing licensing information,

    Further patches will be generated in subsequent months to fix up cases
    where non-standard license headers were used, and references to license
    had to be inferred by heuristics based on keywords.

    The analysis to determine which SPDX License Identifier to be applied to
    a file was done in a spreadsheet of side by side results from of the
    output of two independent scanners (ScanCode & Windriver) producing SPDX
    tag:value files created by Philippe Ombredanne. Philippe prepared the
    base worksheet, and did an initial spot review of a few 1000 files.

    The 4.13 kernel was the starting point of the analysis with 60,537 files
    assessed. Kate Stewart did a file by file comparison of the scanner
    results in the spreadsheet to determine which SPDX license identifier(s)
    to be applied to the file. She confirmed any determination that was not
    immediately clear with lawyers working with the Linux Foundation.

    Criteria used to select files for SPDX license identifier tagging was:
    - Files considered eligible had to be source code files.
    - Make and config files were included as candidates if they contained >5
    lines of source
    - File already had some variant of a license header in it (even if
    Reviewed-by: Philippe Ombredanne
    Reviewed-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Greg Kroah-Hartman
     

23 Jun, 2017

1 commit


22 Jun, 2017

1 commit

  • If we have shared tags enabled, then every IO completion will trigger
    a full loop of every queue belonging to a tag set, and every hardware
    queue for each of those queues, even if nothing needs to be done.
    This causes a massive performance regression if you have a lot of
    shared devices.

    Instead of doing this huge full scan on every IO, add an atomic
    counter to the main queue that tracks how many hardware queues have
    been marked as needing a restart. With that, we can avoid looking for
    restartable queues, if we don't have to.

    Max reports that this restores performance. Before this patch, 4K
    IOPS was limited to 22-23K IOPS. With the patch, we are running at
    950-970K IOPS.

    Fixes: 6d8c6c0f97ad ("blk-mq: Restart a single queue if tag sets are shared")
    Reported-by: Max Gurtovoy
    Tested-by: Max Gurtovoy
    Reviewed-by: Bart Van Assche
    Tested-by: Bart Van Assche
    Signed-off-by: Jens Axboe

    Jens Axboe
     

19 Jun, 2017

3 commits


27 May, 2017

1 commit


21 Apr, 2017

1 commit


15 Apr, 2017

1 commit

  • Currently, this callback is called right after put_request() and has no
    distinguishable purpose. Instead, let's call it before put_request() as
    soon as I/O has completed on the request, before we account it in
    blk-stat. With this, Kyber can enable stats when it sees a latency
    outlier and make sure the outlier gets accounted.

    Signed-off-by: Omar Sandoval
    Signed-off-by: Jens Axboe

    Omar Sandoval
     

08 Apr, 2017

2 commits

  • Schedulers need to be informed when a hardware queue is added or removed
    at runtime so they can allocate/free per-hardware queue data. So,
    replace the blk_mq_sched_init_hctx_data() helper, which only makes sense
    at init time, with .init_hctx() and .exit_hctx() hooks.

    Signed-off-by: Omar Sandoval
    Signed-off-by: Jens Axboe

    Omar Sandoval
     
  • To improve scalability, if hardware queues are shared, restart
    a single hardware queue in round-robin fashion. Rename
    blk_mq_sched_restart_queues() to reflect the new semantics.
    Remove blk_mq_sched_mark_restart_queue() because this function
    has no callers. Remove flag QUEUE_FLAG_RESTART because this
    patch removes the code that uses this flag.

    Signed-off-by: Bart Van Assche
    Cc: Christoph Hellwig
    Cc: Hannes Reinecke
    Signed-off-by: Jens Axboe

    Bart Van Assche
     

07 Apr, 2017

3 commits


24 Feb, 2017

1 commit

  • In blk_mq_sched_dispatch_requests(), we call blk_mq_sched_mark_restart()
    after we dispatch requests left over on our hardware queue dispatch
    list. This is so we'll go back and dispatch requests from the scheduler.
    In this case, it's only necessary to restart the hardware queue that we
    are running; there's no reason to run other hardware queues just because
    we are using shared tags.

    So, split out blk_mq_sched_mark_restart() into two operations, one for
    just the hardware queue and one for the whole request queue. The core
    code only needs the hctx variant, but I/O schedulers will want to use
    both.

    This also requires adjusting blk_mq_sched_restart_queues() to always
    check the queue restart flag, not just when using shared tags.

    Signed-off-by: Omar Sandoval
    Signed-off-by: Jens Axboe

    Omar Sandoval
     

11 Feb, 2017

1 commit

  • bio is used in bfq-mq's get_rq_priv, to get the request group. We could
    pass directly the group here, but I thought that passing the bio was
    more general, giving the possibility to get other pieces of information
    if needed.

    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Paolo Valente
     

04 Feb, 2017

1 commit

  • If we end up doing a request-to-request merge when we have completed
    a bio-to-request merge, we free the request from deep down in that
    path. For blk-mq-sched, the merge path has to hold the appropriate
    lock, but we don't need it for freeing the request. And in fact
    holding the lock is problematic, since we are now calling the
    mq sched put_rq_private() hook with the lock held. Other call paths
    do not hold this lock.

    Fix this inconsistency by ensuring that the caller frees a merged
    request. Then we can do it outside of the lock, making it both more
    efficient and fixing the blk-mq-sched problem of invoking parts of
    the scheduler with an unknown lock state.

    Reported-by: Paolo Valente
    Signed-off-by: Jens Axboe
    Reviewed-by: Omar Sandoval

    Jens Axboe
     

03 Feb, 2017

1 commit


28 Jan, 2017

1 commit


27 Jan, 2017

1 commit

  • If we have both multiple hardware queues and shared tag map between
    devices, we need to ensure that we propagate the hardware queue
    restart bit higher up. This is because we can get into a situation
    where we don't have any IO pending on a hardware queue, yet we fail
    getting a tag to start new IO. If that happens, it's not enough to
    mark the hardware queue as needing a restart, we need to bubble
    that up to the higher level queue as well.

    Signed-off-by: Jens Axboe
    Reviewed-by: Omar Sandoval
    Tested-by: Hannes Reinecke

    Jens Axboe
     

18 Jan, 2017

2 commits

  • Add Kconfig entries to manage what devices get assigned an MQ
    scheduler, and add a blk-mq flag for drivers to opt out of scheduling.
    The latter is useful for admin type queues that still allocate a blk-mq
    queue and tag set, but aren't use for normal IO.

    Signed-off-by: Jens Axboe
    Reviewed-by: Bart Van Assche
    Reviewed-by: Omar Sandoval

    Jens Axboe
     
  • This adds a set of hooks that intercepts the blk-mq path of
    allocating/inserting/issuing/completing requests, allowing
    us to develop a scheduler within that framework.

    We reuse the existing elevator scheduler API on the registration
    side, but augment that with the scheduler flagging support for
    the blk-mq interfce, and with a separate set of ops hooks for MQ
    devices.

    We split driver and scheduler tags, so we can run the scheduling
    independently of device queue depth.

    Signed-off-by: Jens Axboe
    Reviewed-by: Bart Van Assche
    Reviewed-by: Omar Sandoval

    Jens Axboe