21 Jun, 2019

3 commits

  • This function just has a few trivial assignments, has two callers with
    one of them being in the fastpath.

    Reviewed-by: Hannes Reinecke
    Reviewed-by: Chaitanya Kulkarni
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • Return the segement and let the callers assign them, which makes the code
    a littler more obvious. Also pass the request instead of q plus bio
    chain, allowing for the use of rq_for_each_bvec.

    Reviewed-by: Hannes Reinecke
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • We only need the number of segments in the blk-mq submission path.
    Remove the field from struct bio, and return it from a variant of
    blk_queue_split instead of that it can passed as an argument to
    those functions that need the value.

    This also means we stop recounting segments except for cloning
    and partial segments.

    To keep the number of arguments in this how path down remove
    pointless struct request_queue arguments from any of the functions
    that had it and grew a nr_segs argument.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

07 Jun, 2019

1 commit

  • In theory, IO scheduler belongs to request queue, and the request pool
    of sched tags belongs to the request queue too.

    However, the current tags allocation interfaces are re-used for both
    driver tags and sched tags, and driver tags is definitely host wide,
    and doesn't belong to any request queue, same with its request pool.
    So we need tagset instance for freeing request of sched tags.

    Meantime, blk_mq_free_tag_set() often follows blk_cleanup_queue() in case
    of non-BLK_MQ_F_TAG_SHARED, this way requires that request pool of sched
    tags to be freed before calling blk_mq_free_tag_set().

    Commit 47cdee29ef9d94e ("block: move blk_exit_queue into __blk_release_queue")
    moves blk_exit_queue into __blk_release_queue for simplying the fast
    path in generic_make_request(), then causes oops during freeing requests
    of sched tags in __blk_release_queue().

    Fix the above issue by move freeing request pool of sched tags into
    blk_cleanup_queue(), this way is safe becasue queue has been frozen and no any
    in-queue requests at that time. Freeing sched tags has to be kept in queue's
    release handler becasue there might be un-completed dispatch activity
    which might refer to sched tags.

    Cc: Bart Van Assche
    Cc: Christoph Hellwig
    Fixes: 47cdee29ef9d94e485eb08f962c74943023a5271 ("block: move blk_exit_queue into __blk_release_queue")
    Tested-by: Yi Zhang
    Reported-by: kernel test robot
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     

29 May, 2019

1 commit

  • Commit 498f6650aec8 ("block: Fix a race between the cgroup code and
    request queue initialization") moves what blk_exit_queue does into
    blk_cleanup_queue() for fixing issue caused by changing back
    queue lock.

    However, after legacy request IO path is killed, driver queue lock
    won't be used at all, and there isn't story for changing back
    queue lock. Then the issue addressed by Commit 498f6650aec8 doesn't
    exist any more.

    So move move blk_exit_queue into __blk_release_queue.

    This patch basically reverts the following two commits:

    498f6650aec8 block: Fix a race between the cgroup code and request queue initialization
    24ecc3585348 block: Ensure that a request queue is dissociated from the cgroup controller

    Cc: Bart Van Assche
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     

02 Apr, 2019

1 commit

  • xen_biovec_phys_mergeable() only needs .bv_page of the 2nd bio bvec
    for checking if the two bvecs can be merged, so pass page to
    xen_biovec_phys_mergeable() directly.

    No function change.

    Cc: ris Ostrovsky
    Cc: Juergen Gross
    Cc: xen-devel@lists.xenproject.org
    Cc: Omar Sandoval
    Cc: Christoph Hellwig
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Boris Ostrovsky
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     

01 Feb, 2019

1 commit

  • Currently, the queue mapping result is saved in a two-dimensional
    array. In the hot path, to get a hctx, we need do following:

    q->queue_hw_ctx[q->tag_set->map[type].mq_map[cpu]]

    This isn't very efficient. We could save the queue mapping result into
    ctx directly with different hctx type, like,

    ctx->hctxs[type]

    Signed-off-by: Jianchao Wang
    Signed-off-by: Jens Axboe

    Jianchao Wang
     

27 Nov, 2018

1 commit

  • This isn't exactly the same as the previous count, as it includes
    requests for all devices. But that really doesn't matter, if we have
    more than the threshold (16) queued up, flush it. It's not worth it
    to have an expensive list loop for this.

    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Jens Axboe
     

20 Nov, 2018

1 commit

  • bio->bi_ioc is never set so always NULL. Remove references to it in
    bio_disassociate_task() and in rq_ioc() and delete this field from
    struct bio. With this change, rq_ioc() always returns
    current->io_context without the need for a bio argument. Further
    simplify the code and make it more readable by also removing this
    helper, which also allows to simplify blk_mq_sched_assign_ioc() by
    removing its bio argument.

    Reviewed-by: Christoph Hellwig
    Reviewed-by: Johannes Thumshirn
    Reviewed-by: Adam Manzanares
    Signed-off-by: Damien Le Moal
    Signed-off-by: Jens Axboe

    Damien Le Moal
     

19 Nov, 2018

1 commit

  • Merge in -rc3 to resolve a few conflicts, but also to get a few
    important fixes that have gone into mainline since the block
    4.21 branch was forked off (most notably the SCSI queue issue,
    which is both a conflict AND needed fix).

    Signed-off-by: Jens Axboe

    Jens Axboe
     

16 Nov, 2018

3 commits

  • The only remaining user unconditionally drops and reacquires the lock,
    which means we really don't need any additional (conditional) annotation.

    Reviewed-by: Hannes Reinecke
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • ->queue_flags is generally not set or cleared in the fast path, and also
    generally set or cleared one flag at a time. Make use of the normal
    atomic bitops for it so that we don't need to take the queue_lock,
    which is otherwise mostly unused in the core block layer now.

    Reviewed-by: Hannes Reinecke
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • No users left since the removal of the legacy request interface, we can
    remove all the magic bit stealing now and make it a normal field.

    But use WRITE_ONCE/READ_ONCE on the new deadline field, given that we
    don't seem to have any mechanism to guarantee a new value actually
    gets seen by other threads.

    Reviewed-by: Hannes Reinecke
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

10 Nov, 2018

1 commit


09 Nov, 2018

1 commit

  • Obviously the created discard bio has to be aligned with logical block size.

    This patch introduces the helper of bio_allowed_max_sectors() for
    this purpose.

    Cc: stable@vger.kernel.org
    Cc: Mike Snitzer
    Cc: Christoph Hellwig
    Cc: Xiao Ni
    Cc: Mariusz Dabrowski
    Fixes: 744889b7cbb56a6 ("block: don't deal with discard limit in blkdev_issue_discard()")
    Fixes: a22c4d7e34402cc ("block: re-add discard_granularity and alignment checks")
    Reported-by: Rui Salvaterra
    Tested-by: Rui Salvaterra
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     

08 Nov, 2018

7 commits

  • Prep patch for being able to place request based not just on
    CPU location, but also on the type of request.

    Reviewed-by: Hannes Reinecke
    Reviewed-by: Keith Busch
    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • Reviewed-by: Hannes Reinecke
    Tested-by: Ming Lei
    Reviewed-by: Omar Sandoval
    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • It's now dead code, nobody uses it.

    Reviewed-by: Hannes Reinecke
    Tested-by: Ming Lei
    Reviewed-by: Omar Sandoval
    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • The only user of legacy timing now is BSG, which is invoked
    from the mq timeout handler. Kill the legacy code, and rename
    the q->rq_timed_out_fn to q->bsg_job_timeout_fn.

    Reviewed-by: Hannes Reinecke
    Tested-by: Ming Lei
    Reviewed-by: Omar Sandoval
    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • This removes a bunch of core and elevator related code. On the core
    front, we remove anything related to queue running, draining,
    initialization, plugging, and congestions. We also kill anything
    related to request allocation, merging, retrieval, and completion.

    Remove any checking for single queue IO schedulers, as they no
    longer exist. This means we can also delete a bunch of code related
    to request issue, adding, completion, etc - and all the SQ related
    ops and helpers.

    Also kill the load_default_modules(), as all that did was provide
    for a way to load the default single queue elevator.

    Tested-by: Ming Lei
    Reviewed-by: Omar Sandoval
    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • Reviewed-by: Hannes Reinecke
    Tested-by: Ming Lei
    Reviewed-by: Omar Sandoval
    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • With drivers that are settting a virtual boundary constrain, we are
    seeing a lot of bio splitting and smaller I/Os being submitted to the
    driver.

    This happens because the bio gap detection code does not account cases
    where PAGE_SIZE - 1 is bigger than queue_virt_boundary() and thus will
    split the bio unnecessarily.

    Cc: Jan Kara
    Cc: Bart Van Assche
    Cc: Ming Lei
    Reviewed-by: Sagi Grimberg
    Signed-off-by: Johannes Thumshirn
    Acked-by: Keith Busch
    Reviewed-by: Ming Lei
    Signed-off-by: Jens Axboe

    Johannes Thumshirn
     

26 Oct, 2018

2 commits

  • Drivers exposing zoned block devices have to initialize and maintain
    correctness (i.e. revalidate) of the device zone bitmaps attached to
    the device request queue (seq_zones_bitmap and seq_zones_wlock).

    To simplify coding this, introduce a generic helper function
    blk_revalidate_disk_zones() suitable for most (and likely all) cases.
    This new function always update the seq_zones_bitmap and seq_zones_wlock
    bitmaps as well as the queue nr_zones field when called for a disk
    using a request based queue. For a disk using a BIO based queue, only
    the number of zones is updated since these queues do not have
    schedulers and so do not need the zone bitmaps.

    With this change, the zone bitmap initialization code in sd_zbc.c can be
    replaced with a call to this function in sd_zbc_read_zones(), which is
    called from the disk revalidate block operation method.

    A call to blk_revalidate_disk_zones() is also added to the null_blk
    driver for devices created with the zoned mode enabled.

    Finally, to ensure that zoned devices created with dm-linear or
    dm-flakey expose the correct number of zones through sysfs, a call to
    blk_revalidate_disk_zones() is added to dm_table_set_restrictions().

    The zone bitmaps allocated and initialized with
    blk_revalidate_disk_zones() are freed automatically from
    __blk_release_queue() using the block internal function
    blk_queue_free_zone_bitmaps().

    Reviewed-by: Hannes Reinecke
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Martin K. Petersen
    Reviewed-by: Mike Snitzer
    Signed-off-by: Damien Le Moal
    Signed-off-by: Jens Axboe

    Damien Le Moal
     
  • There is no need to synchronously execute all REQ_OP_ZONE_RESET BIOs
    necessary to reset a range of zones. Similarly to what is done for
    discard BIOs in blk-lib.c, all zone reset BIOs can be chained and
    executed asynchronously and a synchronous call done only for the last
    BIO of the chain.

    Modify blkdev_reset_zones() to operate similarly to
    blkdev_issue_discard() using the next_bio() helper for chaining BIOs. To
    avoid code duplication of that function in blk_zoned.c, rename
    next_bio() into blk_next_bio() and declare it as a block internal
    function in blk.h.

    Reviewed-by: Christoph Hellwig
    Reviewed-by: Hannes Reinecke
    Signed-off-by: Damien Le Moal
    Signed-off-by: Jens Axboe

    Damien Le Moal
     

14 Oct, 2018

1 commit


26 Sep, 2018

1 commit


25 Sep, 2018

5 commits


21 Aug, 2018

1 commit

  • Currently, when update nr_hw_queues, IO scheduler's init_hctx will
    be invoked before the mapping between ctx and hctx is adapted
    correctly by blk_mq_map_swqueue. The IO scheduler init_hctx (kyber)
    may depend on this mapping and get wrong result and panic finally.
    A simply way to fix this is that switch the IO scheduler to 'none'
    before update the nr_hw_queues, and then switch it back after
    update nr_hw_queues. blk_mq_sched_init_/exit_hctx are removed due
    to nobody use them any more.

    Signed-off-by: Jianchao Wang
    Signed-off-by: Jens Axboe

    Jianchao Wang
     

17 Aug, 2018

1 commit


09 Aug, 2018

1 commit


09 Jul, 2018

1 commit

  • Current IO controllers for the block layer are less than ideal for our
    use case. The io.max controller is great at hard limiting, but it is
    not work conserving. This patch introduces io.latency. You provide a
    latency target for your group and we monitor the io in short windows to
    make sure we are not exceeding those latency targets. This makes use of
    the rq-qos infrastructure and works much like the wbt stuff. There are
    a few differences from wbt

    - It's bio based, so the latency covers the whole block layer in addition to
    the actual io.
    - We will throttle all IO types that comes in here if we need to.
    - We use the mean latency over the 100ms window. This is because writes can
    be particularly fast, which could give us a false sense of the impact of
    other workloads on our protected workload.
    - By default there's no throttling, we set the queue_depth to INT_MAX so that
    we can have as many outstanding bio's as we're allowed to. Only at
    throttle time do we pay attention to the actual queue depth.
    - We backcharge cgroups for root cg issued IO and induce artificial
    delays in order to deal with cases like metadata only or swap heavy
    workloads.

    In testing this has worked out relatively well. Protected workloads
    will throttle noisy workloads down to 1 io at time if they are doing
    normal IO on their own, or induce up to a 1 second delay per syscall if
    they are doing a lot of root issued IO (metadata/swap IO).

    Our testing has revolved mostly around our production web servers where
    we have hhvm (the web server application) in a protected group and
    everything else in another group. We see slightly higher requests per
    second (RPS) on the test tier vs the control tier, and much more stable
    RPS across all machines in the test tier vs the control tier.

    Another test we run is a slow memory allocator in the unprotected group.
    Before this would eventually push us into swap and cause the whole box
    to die and not recover at all. With these patches we see slight RPS
    drops (usually 10-15%) before the memory consumer is properly killed and
    things recover within seconds.

    Signed-off-by: Josef Bacik
    Acked-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Josef Bacik
     

01 Jun, 2018

3 commits


09 May, 2018

1 commit

  • Currently, struct request has four timestamp fields:

    - A start time, set at get_request time, in jiffies, used for iostats
    - An I/O start time, set at start_request time, in ktime nanoseconds,
    used for blk-stats (i.e., wbt, kyber, hybrid polling)
    - Another start time and another I/O start time, used for cfq and bfq

    These can all be consolidated into one start time and one I/O start
    time, both in ktime nanoseconds, shaving off up to 16 bytes from struct
    request depending on the kernel config.

    Signed-off-by: Omar Sandoval
    Signed-off-by: Jens Axboe

    Omar Sandoval
     

09 Mar, 2018

1 commit