31 May, 2018

1 commit

  • Currently, kyber is very unfriendly with merging. kyber depends
    on ctx rq_list to do merging, however, most of time, it will not
    leave any requests in ctx rq_list. This is because even if tokens
    of one domain is used up, kyber will try to dispatch requests
    from other domain and flush the rq_list there.

    To improve this, we setup kyber_ctx_queue (kcq) which is similar
    with ctx, but it has rq_lists for different domain and build same
    mapping between kcq and khd as the ctx & hctx. Then we could merge,
    insert and dispatch for different domains separately. At the same
    time, only flush the rq_list of kcq when get domain token successfully.
    Then if one domain token is used up, the requests could be left in
    the rq_list of that domain and maybe merged with following io.

    Following is my test result on machine with 8 cores and NVMe card
    INTEL SSDPEKKR128G7

    fio size=256m ioengine=libaio iodepth=64 direct=1 numjobs=8
    seq/random
    +------+---------------------------------------------------------------+
    |patch?| bw(MB/s) | iops | slat(usec) | clat(usec) | merge |
    +----------------------------------------------------------------------+
    | w/o | 606/612 | 151k/153k | 6.89/7.03 | 3349.21/3305.40 | 0/0 |
    +----------------------------------------------------------------------+
    | w/ | 1083/616 | 277k/154k | 4.93/6.95 | 1830.62/3279.95 | 223k/3k |
    +----------------------------------------------------------------------+
    When set numjobs to 16, the bw and iops could reach 1662MB/s and 425k
    on my platform.

    Signed-off-by: Jianchao Wang
    Tested-by: Holger Hoffstätte
    Reviewed-by: Omar Sandoval
    Signed-off-by: Jens Axboe

    Jianchao Wang
     

11 May, 2018

1 commit


09 May, 2018

1 commit

  • struct blk_issue_stat squashes three things into one u64:

    - The time the driver started working on a request
    - The original size of the request (for the io.low controller)
    - Flags for writeback throttling

    It turns out that on x86_64, we have a 4 byte hole in struct request
    which we can fill with the non-timestamp fields from blk_issue_stat,
    simplifying things quite a bit.

    Signed-off-by: Omar Sandoval
    Signed-off-by: Jens Axboe

    Omar Sandoval
     

25 Feb, 2018

1 commit

  • When requeuing request, the domain token should have been freed
    before re-inserting the request to io scheduler. Otherwise, the
    assigned domain token will be leaked, and IO hang can be caused.

    Cc: Paolo Valente
    Cc: Omar Sandoval
    Cc: stable@vger.kernel.org
    Reviewed-by: Bart Van Assche
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     

07 Dec, 2017

1 commit

  • Commit 8cf466602028 ("kyber: fix hang on domain token wait queue") fixed
    a hang caused by leaving wait entries on the domain token wait queue
    after the __sbitmap_queue_get() retry succeeded, making that wait entry
    a "dud" which won't in turn wake more entries up. However, we can also
    get a dud entry if kyber_get_domain_token() fails once but is then
    called again and succeeds. This can happen if the hardware queue is
    rerun for some other reason, or, more likely, kyber_dispatch_request()
    tries the same domain twice.

    The fix is to remove our entry from the wait queue whenever we
    successfully get a token. The only complication is that we might be on
    one of many wait queues in the struct sbitmap_queue, but that's easily
    fixed by remembering which wait queue we were put on.

    While we're here, only initialize the wait queue entry once instead of
    on every wait, and use spin_lock_irq() instead of spin_lock_irqsave(),
    since this is always called from process context with irqs enabled.

    Signed-off-by: Omar Sandoval
    Signed-off-by: Jens Axboe

    Omar Sandoval
     

01 Nov, 2017

1 commit


18 Oct, 2017

1 commit

  • When we're getting a domain token, if we fail to get a token on our
    first attempt, we put the current hardware queue on a wait queue and
    then try again just in case a token was freed after our initial attempt
    but before we got on the wait queue. If this second attempt succeeds, we
    currently leave the hardware queue on the wait queue. Usually this is
    okay; we'll just run the hardware queue one extra time when another
    token is freed. However, if the hardware queue doesn't have any other
    requests waiting, then when it it gets the extra wakeup, it won't have
    anything to free and therefore won't wake up any other hardware queues.
    If tokens are limited, then we won't make forward progress and the
    device will hang.

    Reported-by: Bin Zha
    Signed-off-by: Omar Sandoval
    Signed-off-by: Jens Axboe

    Omar Sandoval
     

04 Jul, 2017

1 commit

  • Pull scheduler updates from Ingo Molnar:
    "The main changes in this cycle were:

    - Add the SYSTEM_SCHEDULING bootup state to move various scheduler
    debug checks earlier into the bootup. This turns silent and
    sporadically deadly bugs into nice, deterministic splats. Fix some
    of the splats that triggered. (Thomas Gleixner)

    - A round of restructuring and refactoring of the load-balancing and
    topology code (Peter Zijlstra)

    - Another round of consolidating ~20 of incremental scheduler code
    history: this time in terms of wait-queue nomenclature. (I didn't
    get much feedback on these renaming patches, and we can still
    easily change any names I might have misplaced, so if anyone hates
    a new name, please holler and I'll fix it.) (Ingo Molnar)

    - sched/numa improvements, fixes and updates (Rik van Riel)

    - Another round of x86/tsc scheduler clock code improvements, in hope
    of making it more robust (Peter Zijlstra)

    - Improve NOHZ behavior (Frederic Weisbecker)

    - Deadline scheduler improvements and fixes (Luca Abeni, Daniel
    Bristot de Oliveira)

    - Simplify and optimize the topology setup code (Lauro Ramos
    Venancio)

    - Debloat and decouple scheduler code some more (Nicolas Pitre)

    - Simplify code by making better use of llist primitives (Byungchul
    Park)

    - ... plus other fixes and improvements"

    * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (103 commits)
    sched/cputime: Refactor the cputime_adjust() code
    sched/debug: Expose the number of RT/DL tasks that can migrate
    sched/numa: Hide numa_wake_affine() from UP build
    sched/fair: Remove effective_load()
    sched/numa: Implement NUMA node level wake_affine()
    sched/fair: Simplify wake_affine() for the single socket case
    sched/numa: Override part of migrate_degrades_locality() when idle balancing
    sched/rt: Move RT related code from sched/core.c to sched/rt.c
    sched/deadline: Move DL related code from sched/core.c to sched/deadline.c
    sched/cpuset: Only offer CONFIG_CPUSETS if SMP is enabled
    sched/fair: Spare idle load balancing on nohz_full CPUs
    nohz: Move idle balancer registration to the idle path
    sched/loadavg: Generalize "_idle" naming to "_nohz"
    sched/core: Drop the unused try_get_task_struct() helper function
    sched/fair: WARN() and refuse to set buddy when !se->on_rq
    sched/debug: Fix SCHED_WARN_ON() to return a value on !CONFIG_SCHED_DEBUG as well
    sched/wait: Disambiguate wq_entry->task_list and wq_head->task_list naming
    sched/wait: Move bit_wait_table[] and related functionality from sched/core.c to sched/wait_bit.c
    sched/wait: Split out the wait_bit*() APIs from into
    sched/wait: Re-adjust macro line continuation backslashes in
    ...

    Linus Torvalds
     

20 Jun, 2017

2 commits

  • So I've noticed a number of instances where it was not obvious from the
    code whether ->task_list was for a wait-queue head or a wait-queue entry.

    Furthermore, there's a number of wait-queue users where the lists are
    not for 'tasks' but other entities (poll tables, etc.), in which case
    the 'task_list' name is actively confusing.

    To clear this all up, name the wait-queue head and entry list structure
    fields unambiguously:

    struct wait_queue_head::task_list => ::head
    struct wait_queue_entry::task_list => ::entry

    For example, this code:

    rqw->wait.task_list.next != &wait->task_list

    ... is was pretty unclear (to me) what it's doing, while now it's written this way:

    rqw->wait.head.next != &wait->entry

    ... which makes it pretty clear that we are iterating a list until we see the head.

    Other examples are:

    list_for_each_entry_safe(pos, next, &x->task_list, task_list) {
    list_for_each_entry(wq, &fence->wait.task_list, task_list) {

    ... where it's unclear (to me) what we are iterating, and during review it's
    hard to tell whether it's trying to walk a wait-queue entry (which would be
    a bug), while now it's written as:

    list_for_each_entry_safe(pos, next, &x->head, entry) {
    list_for_each_entry(wq, &fence->wait.head, entry) {

    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • Rename:

    wait_queue_t => wait_queue_entry_t

    'wait_queue_t' was always a slight misnomer: its name implies that it's a "queue",
    but in reality it's a queue *entry*. The 'real' queue is the wait queue head,
    which had to carry the name.

    Start sorting this out by renaming it to 'wait_queue_entry_t'.

    This also allows the real structure name 'struct __wait_queue' to
    lose its double underscore and become 'struct wait_queue_entry',
    which is the more canonical nomenclature for such data types.

    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Ingo Molnar
     

19 Jun, 2017

2 commits

  • This patch makes sure we always allocate requests in the core blk-mq
    code and use a common prepare_request method to initialize them for
    both mq I/O schedulers. For Kyber and additional limit_depth method
    is added that is called before allocating the request.

    Also because none of the intializations can really fail the new method
    does not return an error - instead the bfq finish method is hardened
    to deal with the no-IOC case.

    Last but not least this removes the abuse of RQF_QUEUE by the blk-mq
    scheduling code as RQF_ELFPRIV is all that is needed now.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • No need to have two different callouts of bfq vs kyber.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

04 May, 2017

1 commit


21 Apr, 2017

1 commit

  • In order to allow for filtering of IO based on some other properties
    of the request than direction we allow the bucket function to return
    an int.

    If the bucket callback returns a negative do no count it in the stats
    accumulation.

    Signed-off-by: Stephen Bates

    Fixed up Kyber scheduler stat callback.

    Signed-off-by: Jens Axboe

    Stephen Bates
     

15 Apr, 2017

1 commit

  • The Kyber I/O scheduler is an I/O scheduler for fast devices designed to
    scale to multiple queues. Users configure only two knobs, the target
    read and synchronous write latencies, and the scheduler tunes itself to
    achieve that latency goal.

    The implementation is based on "tokens", built on top of the scalable
    bitmap library. Tokens serve as a mechanism for limiting requests. There
    are two tiers of tokens: queueing tokens and dispatch tokens.

    A queueing token is required to allocate a request. In fact, these
    tokens are actually the blk-mq internal scheduler tags, but the
    scheduler manages the allocation directly in order to implement its
    policy.

    Dispatch tokens are device-wide and split up into two scheduling
    domains: reads vs. writes. Each hardware queue dispatches batches
    round-robin between the scheduling domains as long as tokens are
    available for that domain.

    These tokens can be used as the mechanism to enable various policies.
    The policy Kyber uses is inspired by active queue management techniques
    for network routing, similar to blk-wbt. The scheduler monitors
    latencies and scales the number of dispatch tokens accordingly. Queueing
    tokens are used to prevent starvation of synchronous requests by
    asynchronous requests.

    Various extensions are possible, including better heuristics and ionice
    support. The new scheduler isn't set as the default yet.

    Signed-off-by: Omar Sandoval
    Signed-off-by: Jens Axboe

    Omar Sandoval