04 Sep, 2020

1 commit

  • Introduce pointers for the blk_mq_tags regular and reserved bitmap tags,
    with the goal of later being able to use a common shared tag bitmap across
    all HW contexts in a set.

    Signed-off-by: John Garry
    Tested-by: Don Brace #SCSI resv cmds patches used
    Tested-by: Douglas Gilbert
    Reviewed-by: Hannes Reinecke
    Signed-off-by: Jens Axboe

    John Garry
     

02 Sep, 2020

1 commit


30 May, 2020

1 commit


03 Jul, 2019

1 commit

  • No code that occurs between blk_mq_get_ctx() and blk_mq_put_ctx() depends
    on preemption being disabled for its correctness. Since removing the CPU
    preemption calls does not measurably affect performance, simplify the
    blk-mq code by removing the blk_mq_put_ctx() function and also by not
    disabling preemption in blk_mq_get_ctx().

    Cc: Hannes Reinecke
    Cc: Omar Sandoval
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Ming Lei
    Signed-off-by: Bart Van Assche
    Signed-off-by: Jens Axboe

    Bart Van Assche
     

21 Jun, 2019

1 commit

  • We only need the number of segments in the blk-mq submission path.
    Remove the field from struct bio, and return it from a variant of
    blk_queue_split instead of that it can passed as an argument to
    those functions that need the value.

    This also means we stop recounting segments except for cloning
    and partial segments.

    To keep the number of arguments in this how path down remove
    pointless struct request_queue arguments from any of the functions
    that had it and grew a nr_segs argument.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

01 May, 2019

1 commit


21 Dec, 2018

1 commit

  • sbq_wake_ptr() checks sbq->ws_active to know if it needs to loop
    the wait indexes or not. This requires the use of the sbitmap
    waitqueue wrappers, but kyber doesn't use those for its domain
    token waitqueue handling.

    Convert kyber to use the helpers. This fixes a hang with waiting
    for domain tokens.

    Fixes: 5d2ee7122c73 ("sbitmap: optimize wakeup check")
    Tested-by: Ming Lei
    Reported-by: Ming Lei
    Reviewed-by: Omar Sandoval
    Signed-off-by: Jens Axboe

    Jens Axboe
     

08 Nov, 2018

3 commits

  • The mapping used to be dependent on just the CPU location, but
    now it's a tuple of (type, cpu) instead. This is a prep patch
    for allowing a single software queue to map to multiple hardware
    queues. No functional changes in this patch.

    This changes the software queue count to an unsigned short
    to save a bit of space. We can still support 64K-1 CPUs,
    which should be enough. Add a check to catch a wrap.

    Reviewed-by: Hannes Reinecke
    Reviewed-by: Keith Busch
    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • This is a remnant of when we had ops for both SQ and MQ
    schedulers. Now it's just MQ, so get rid of the union.

    Reviewed-by: Omar Sandoval
    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • This removes a bunch of core and elevator related code. On the core
    front, we remove anything related to queue running, draining,
    initialization, plugging, and congestions. We also kill anything
    related to request allocation, merging, retrieval, and completion.

    Remove any checking for single queue IO schedulers, as they no
    longer exist. This means we can also delete a bunch of code related
    to request issue, adding, completion, etc - and all the SQ related
    ops and helpers.

    Also kill the load_default_modules(), as all that did was provide
    for a way to load the default single queue elevator.

    Tested-by: Ming Lei
    Reviewed-by: Omar Sandoval
    Signed-off-by: Jens Axboe

    Jens Axboe
     

29 Sep, 2018

1 commit

  • NSEC_PER_SEC has type long, so 5 * NSEC_PER_SEC is calculated as a long.
    However, 5 seconds is 5,000,000,000 nanoseconds, which overflows a
    32-bit long. Make sure all of the targets are calculated as 64-bit
    values.

    Fixes: 6e25cb01ea20 ("kyber: implement improved heuristics")
    Reported-by: Stephen Rothwell
    Signed-off-by: Omar Sandoval
    Signed-off-by: Jens Axboe

    Omar Sandoval
     

28 Sep, 2018

4 commits

  • When debugging Kyber, it's really useful to know what latencies we've
    been having, how the domain depths have been adjusted, and if we've
    actually been throttling. Add three tracepoints, kyber_latency,
    kyber_adjust, and kyber_throttled, to record that.

    Signed-off-by: Omar Sandoval
    Signed-off-by: Jens Axboe

    Omar Sandoval
     
  • Kyber's current heuristics have a few flaws:

    - It's based on the mean latency, but p99 latency tends to be more
    meaningful to anyone who cares about latency. The mean can also be
    skewed by rare outliers that the scheduler can't do anything about.
    - The statistics calculations are purely time-based with a short window.
    This works for steady, high load, but is more sensitive to outliers
    with bursty workloads.
    - It only considers the latency once an I/O has been submitted to the
    device, but the user cares about the time spent in the kernel, as
    well.

    These are shortcomings of the generic blk-stat code which doesn't quite
    fit the ideal use case for Kyber. So, this replaces the statistics with
    a histogram used to calculate percentiles of total latency and I/O
    latency, which we then use to adjust depths in a slightly more
    intelligent manner:

    - Sync and async writes are now the same domain.
    - Discards are a separate domain.
    - Domain queue depths are scaled by the ratio of the p99 total latency
    to the target latency (e.g., if the p99 latency is double the target
    latency, we will double the queue depth; if the p99 latency is half of
    the target latency, we can halve the queue depth).
    - We use the I/O latency to determine whether we should scale queue
    depths down: we will only scale down if any domain's I/O latency
    exceeds the target latency, which is an indicator of congestion in the
    device.

    These new heuristics are just as scalable as the heuristics they
    replace.

    Signed-off-by: Omar Sandoval
    Signed-off-by: Jens Axboe

    Omar Sandoval
     
  • The domain token sbitmaps are currently initialized to the device queue
    depth or 256, whichever is larger, and immediately resized to the
    maximum depth for that domain (256, 128, or 64 for read, write, and
    other, respectively). The sbitmap is never resized larger than that, so
    it's unnecessary to allocate a bitmap larger than the maximum depth.
    Let's just allocate it to the maximum depth to begin with. This will use
    marginally less memory, and more importantly, give us a more appropriate
    number of bits per sbitmap word.

    Signed-off-by: Omar Sandoval
    Signed-off-by: Jens Axboe

    Omar Sandoval
     
  • Commit 4bc6339a583c ("block: move blk_stat_add() to
    __blk_mq_end_request()") consolidated some calls using ktime_get() so
    we'd only need to call it once. Kyber's ->completed_request() hook also
    calls ktime_get(), so let's move it to the same place, too.

    Signed-off-by: Omar Sandoval
    Signed-off-by: Jens Axboe

    Omar Sandoval
     

31 May, 2018

1 commit

  • Currently, kyber is very unfriendly with merging. kyber depends
    on ctx rq_list to do merging, however, most of time, it will not
    leave any requests in ctx rq_list. This is because even if tokens
    of one domain is used up, kyber will try to dispatch requests
    from other domain and flush the rq_list there.

    To improve this, we setup kyber_ctx_queue (kcq) which is similar
    with ctx, but it has rq_lists for different domain and build same
    mapping between kcq and khd as the ctx & hctx. Then we could merge,
    insert and dispatch for different domains separately. At the same
    time, only flush the rq_list of kcq when get domain token successfully.
    Then if one domain token is used up, the requests could be left in
    the rq_list of that domain and maybe merged with following io.

    Following is my test result on machine with 8 cores and NVMe card
    INTEL SSDPEKKR128G7

    fio size=256m ioengine=libaio iodepth=64 direct=1 numjobs=8
    seq/random
    +------+---------------------------------------------------------------+
    |patch?| bw(MB/s) | iops | slat(usec) | clat(usec) | merge |
    +----------------------------------------------------------------------+
    | w/o | 606/612 | 151k/153k | 6.89/7.03 | 3349.21/3305.40 | 0/0 |
    +----------------------------------------------------------------------+
    | w/ | 1083/616 | 277k/154k | 4.93/6.95 | 1830.62/3279.95 | 223k/3k |
    +----------------------------------------------------------------------+
    When set numjobs to 16, the bw and iops could reach 1662MB/s and 425k
    on my platform.

    Signed-off-by: Jianchao Wang
    Tested-by: Holger Hoffstätte
    Reviewed-by: Omar Sandoval
    Signed-off-by: Jens Axboe

    Jianchao Wang
     

11 May, 2018

1 commit


09 May, 2018

1 commit

  • struct blk_issue_stat squashes three things into one u64:

    - The time the driver started working on a request
    - The original size of the request (for the io.low controller)
    - Flags for writeback throttling

    It turns out that on x86_64, we have a 4 byte hole in struct request
    which we can fill with the non-timestamp fields from blk_issue_stat,
    simplifying things quite a bit.

    Signed-off-by: Omar Sandoval
    Signed-off-by: Jens Axboe

    Omar Sandoval
     

25 Feb, 2018

1 commit

  • When requeuing request, the domain token should have been freed
    before re-inserting the request to io scheduler. Otherwise, the
    assigned domain token will be leaked, and IO hang can be caused.

    Cc: Paolo Valente
    Cc: Omar Sandoval
    Cc: stable@vger.kernel.org
    Reviewed-by: Bart Van Assche
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     

07 Dec, 2017

1 commit

  • Commit 8cf466602028 ("kyber: fix hang on domain token wait queue") fixed
    a hang caused by leaving wait entries on the domain token wait queue
    after the __sbitmap_queue_get() retry succeeded, making that wait entry
    a "dud" which won't in turn wake more entries up. However, we can also
    get a dud entry if kyber_get_domain_token() fails once but is then
    called again and succeeds. This can happen if the hardware queue is
    rerun for some other reason, or, more likely, kyber_dispatch_request()
    tries the same domain twice.

    The fix is to remove our entry from the wait queue whenever we
    successfully get a token. The only complication is that we might be on
    one of many wait queues in the struct sbitmap_queue, but that's easily
    fixed by remembering which wait queue we were put on.

    While we're here, only initialize the wait queue entry once instead of
    on every wait, and use spin_lock_irq() instead of spin_lock_irqsave(),
    since this is always called from process context with irqs enabled.

    Signed-off-by: Omar Sandoval
    Signed-off-by: Jens Axboe

    Omar Sandoval
     

01 Nov, 2017

1 commit


18 Oct, 2017

1 commit

  • When we're getting a domain token, if we fail to get a token on our
    first attempt, we put the current hardware queue on a wait queue and
    then try again just in case a token was freed after our initial attempt
    but before we got on the wait queue. If this second attempt succeeds, we
    currently leave the hardware queue on the wait queue. Usually this is
    okay; we'll just run the hardware queue one extra time when another
    token is freed. However, if the hardware queue doesn't have any other
    requests waiting, then when it it gets the extra wakeup, it won't have
    anything to free and therefore won't wake up any other hardware queues.
    If tokens are limited, then we won't make forward progress and the
    device will hang.

    Reported-by: Bin Zha
    Signed-off-by: Omar Sandoval
    Signed-off-by: Jens Axboe

    Omar Sandoval
     

04 Jul, 2017

1 commit

  • Pull scheduler updates from Ingo Molnar:
    "The main changes in this cycle were:

    - Add the SYSTEM_SCHEDULING bootup state to move various scheduler
    debug checks earlier into the bootup. This turns silent and
    sporadically deadly bugs into nice, deterministic splats. Fix some
    of the splats that triggered. (Thomas Gleixner)

    - A round of restructuring and refactoring of the load-balancing and
    topology code (Peter Zijlstra)

    - Another round of consolidating ~20 of incremental scheduler code
    history: this time in terms of wait-queue nomenclature. (I didn't
    get much feedback on these renaming patches, and we can still
    easily change any names I might have misplaced, so if anyone hates
    a new name, please holler and I'll fix it.) (Ingo Molnar)

    - sched/numa improvements, fixes and updates (Rik van Riel)

    - Another round of x86/tsc scheduler clock code improvements, in hope
    of making it more robust (Peter Zijlstra)

    - Improve NOHZ behavior (Frederic Weisbecker)

    - Deadline scheduler improvements and fixes (Luca Abeni, Daniel
    Bristot de Oliveira)

    - Simplify and optimize the topology setup code (Lauro Ramos
    Venancio)

    - Debloat and decouple scheduler code some more (Nicolas Pitre)

    - Simplify code by making better use of llist primitives (Byungchul
    Park)

    - ... plus other fixes and improvements"

    * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (103 commits)
    sched/cputime: Refactor the cputime_adjust() code
    sched/debug: Expose the number of RT/DL tasks that can migrate
    sched/numa: Hide numa_wake_affine() from UP build
    sched/fair: Remove effective_load()
    sched/numa: Implement NUMA node level wake_affine()
    sched/fair: Simplify wake_affine() for the single socket case
    sched/numa: Override part of migrate_degrades_locality() when idle balancing
    sched/rt: Move RT related code from sched/core.c to sched/rt.c
    sched/deadline: Move DL related code from sched/core.c to sched/deadline.c
    sched/cpuset: Only offer CONFIG_CPUSETS if SMP is enabled
    sched/fair: Spare idle load balancing on nohz_full CPUs
    nohz: Move idle balancer registration to the idle path
    sched/loadavg: Generalize "_idle" naming to "_nohz"
    sched/core: Drop the unused try_get_task_struct() helper function
    sched/fair: WARN() and refuse to set buddy when !se->on_rq
    sched/debug: Fix SCHED_WARN_ON() to return a value on !CONFIG_SCHED_DEBUG as well
    sched/wait: Disambiguate wq_entry->task_list and wq_head->task_list naming
    sched/wait: Move bit_wait_table[] and related functionality from sched/core.c to sched/wait_bit.c
    sched/wait: Split out the wait_bit*() APIs from into
    sched/wait: Re-adjust macro line continuation backslashes in
    ...

    Linus Torvalds
     

20 Jun, 2017

2 commits

  • So I've noticed a number of instances where it was not obvious from the
    code whether ->task_list was for a wait-queue head or a wait-queue entry.

    Furthermore, there's a number of wait-queue users where the lists are
    not for 'tasks' but other entities (poll tables, etc.), in which case
    the 'task_list' name is actively confusing.

    To clear this all up, name the wait-queue head and entry list structure
    fields unambiguously:

    struct wait_queue_head::task_list => ::head
    struct wait_queue_entry::task_list => ::entry

    For example, this code:

    rqw->wait.task_list.next != &wait->task_list

    ... is was pretty unclear (to me) what it's doing, while now it's written this way:

    rqw->wait.head.next != &wait->entry

    ... which makes it pretty clear that we are iterating a list until we see the head.

    Other examples are:

    list_for_each_entry_safe(pos, next, &x->task_list, task_list) {
    list_for_each_entry(wq, &fence->wait.task_list, task_list) {

    ... where it's unclear (to me) what we are iterating, and during review it's
    hard to tell whether it's trying to walk a wait-queue entry (which would be
    a bug), while now it's written as:

    list_for_each_entry_safe(pos, next, &x->head, entry) {
    list_for_each_entry(wq, &fence->wait.head, entry) {

    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • Rename:

    wait_queue_t => wait_queue_entry_t

    'wait_queue_t' was always a slight misnomer: its name implies that it's a "queue",
    but in reality it's a queue *entry*. The 'real' queue is the wait queue head,
    which had to carry the name.

    Start sorting this out by renaming it to 'wait_queue_entry_t'.

    This also allows the real structure name 'struct __wait_queue' to
    lose its double underscore and become 'struct wait_queue_entry',
    which is the more canonical nomenclature for such data types.

    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Ingo Molnar
     

19 Jun, 2017

2 commits

  • This patch makes sure we always allocate requests in the core blk-mq
    code and use a common prepare_request method to initialize them for
    both mq I/O schedulers. For Kyber and additional limit_depth method
    is added that is called before allocating the request.

    Also because none of the intializations can really fail the new method
    does not return an error - instead the bfq finish method is hardened
    to deal with the no-IOC case.

    Last but not least this removes the abuse of RQF_QUEUE by the blk-mq
    scheduling code as RQF_ELFPRIV is all that is needed now.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • No need to have two different callouts of bfq vs kyber.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

04 May, 2017

1 commit


21 Apr, 2017

1 commit

  • In order to allow for filtering of IO based on some other properties
    of the request than direction we allow the bucket function to return
    an int.

    If the bucket callback returns a negative do no count it in the stats
    accumulation.

    Signed-off-by: Stephen Bates

    Fixed up Kyber scheduler stat callback.

    Signed-off-by: Jens Axboe

    Stephen Bates
     

15 Apr, 2017

1 commit

  • The Kyber I/O scheduler is an I/O scheduler for fast devices designed to
    scale to multiple queues. Users configure only two knobs, the target
    read and synchronous write latencies, and the scheduler tunes itself to
    achieve that latency goal.

    The implementation is based on "tokens", built on top of the scalable
    bitmap library. Tokens serve as a mechanism for limiting requests. There
    are two tiers of tokens: queueing tokens and dispatch tokens.

    A queueing token is required to allocate a request. In fact, these
    tokens are actually the blk-mq internal scheduler tags, but the
    scheduler manages the allocation directly in order to implement its
    policy.

    Dispatch tokens are device-wide and split up into two scheduling
    domains: reads vs. writes. Each hardware queue dispatches batches
    round-robin between the scheduling domains as long as tokens are
    available for that domain.

    These tokens can be used as the mechanism to enable various policies.
    The policy Kyber uses is inspired by active queue management techniques
    for network routing, similar to blk-wbt. The scheduler monitors
    latencies and scales the number of dispatch tokens accordingly. Queueing
    tokens are used to prevent starvation of synchronous requests by
    asynchronous requests.

    Various extensions are possible, including better heuristics and ionice
    support. The new scheduler isn't set as the default yet.

    Signed-off-by: Omar Sandoval
    Signed-off-by: Jens Axboe

    Omar Sandoval