Eric Lee / smarc-fsl-linux-kernel

04 Sep, 2020

1 commit

222a5ae03 blk-mq: Use pointers for blk_mq_tags bitmap tags ... Browse Code »

Introduce pointers for the blk_mq_tags regular and reserved bitmap tags,
with the goal of later being able to use a common shared tag bitmap across
all HW contexts in a set.

Signed-off-by: John Garry
Tested-by: Don Brace #SCSI resv cmds patches used
Tested-by: Douglas Gilbert
Reviewed-by: Hannes Reinecke
Signed-off-by: Jens Axboe

John Garry
2020-09-04 05:20:47 +0800

02 Sep, 2020

1 commit

bdc6a287b block: Move blk_mq_bio_list_merge() into blk-merge.c ... Browse Code »

Move the blk_mq_bio_list_merge() into blk-merge.c and
rename it as a generic name.

Reviewed-by: Christoph Hellwig
Signed-off-by: Baolin Wang
Signed-off-by: Jens Axboe

Baolin Wang
2020-09-02 06:49:26 +0800

30 May, 2020

1 commit

5d9c305b8 blk-mq: remove the bio argument to ->prepare_request ... Browse Code »

None of the I/O schedulers actually needs it.

Signed-off-by: Christoph Hellwig
Reviewed-by: Ming Lei
Reviewed-by: Hannes Reinecke
Reviewed-by: Johannes Thumshirn
Reviewed-by: Bart Van Assche
Reviewed-by: Daniel Wagner
Signed-off-by: Jens Axboe

Christoph Hellwig
2020-05-30 00:23:24 +0800

03 Jul, 2019

1 commit

c05f42206 blk-mq: remove blk_mq_put_ctx() ... Browse Code »

No code that occurs between blk_mq_get_ctx() and blk_mq_put_ctx() depends
on preemption being disabled for its correctness. Since removing the CPU
preemption calls does not measurably affect performance, simplify the
blk-mq code by removing the blk_mq_put_ctx() function and also by not
disabling preemption in blk_mq_get_ctx().

Cc: Hannes Reinecke
Cc: Omar Sandoval
Reviewed-by: Christoph Hellwig
Reviewed-by: Ming Lei
Signed-off-by: Bart Van Assche
Signed-off-by: Jens Axboe

Bart Van Assche
2019-07-03 11:03:27 +0800

21 Jun, 2019

1 commit

14ccb66b3 block: remove the bi_phys_segments field in struct bio ... Browse Code »

We only need the number of segments in the blk-mq submission path.
Remove the field from struct bio, and return it from a variant of
blk_queue_split instead of that it can passed as an argument to
those functions that need the value.

This also means we stop recounting segments except for cloning
and partial segments.

To keep the number of arguments in this how path down remove
pointless struct request_queue arguments from any of the functions
that had it and grew a nr_segs argument.

Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Christoph Hellwig
2019-06-21 00:29:22 +0800

01 May, 2019

1 commit

8c16567d8 block: switch all files cleared marked as GPLv2 to SPDX tags ... Browse Code »

All these files have some form of the usual GPLv2 boilerplate. Switch
them to use SPDX tags instead.

Reviewed-by: Chaitanya Kulkarni
Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Christoph Hellwig
2019-05-01 06:11:57 +0800

21 Dec, 2018

1 commit

00203ba40 kyber: use sbitmap add_wait_queue/list_del wait helpers ... Browse Code »

sbq_wake_ptr() checks sbq->ws_active to know if it needs to loop
the wait indexes or not. This requires the use of the sbitmap
waitqueue wrappers, but kyber doesn't use those for its domain
token waitqueue handling.

Convert kyber to use the helpers. This fixes a hang with waiting
for domain tokens.

Fixes: 5d2ee7122c73 ("sbitmap: optimize wakeup check")
Tested-by: Ming Lei
Reported-by: Ming Lei
Reviewed-by: Omar Sandoval
Signed-off-by: Jens Axboe

Jens Axboe
2018-12-21 03:17:21 +0800

08 Nov, 2018

3 commits

f31967f0e blk-mq: allow software queue to map to multiple hardware queues ... Browse Code »

The mapping used to be dependent on just the CPU location, but
now it's a tuple of (type, cpu) instead. This is a prep patch
for allowing a single software queue to map to multiple hardware
queues. No functional changes in this patch.

This changes the software queue count to an unsigned short
to save a bit of space. We can still support 64K-1 CPUs,
which should be enough. Add a check to catch a wrap.

Reviewed-by: Hannes Reinecke
Reviewed-by: Keith Busch
Signed-off-by: Jens Axboe

Jens Axboe
2018-11-08 04:44:59 +0800
f9cd4bfe9 block: get rid of MQ scheduler ops union ... Browse Code »

This is a remnant of when we had ops for both SQ and MQ
schedulers. Now it's just MQ, so get rid of the union.

Reviewed-by: Omar Sandoval
Signed-off-by: Jens Axboe

Jens Axboe
2018-11-08 04:42:32 +0800
a1ce35fa4 block: remove dead elevator code ... Browse Code »

This removes a bunch of core and elevator related code. On the core
front, we remove anything related to queue running, draining,
initialization, plugging, and congestions. We also kill anything
related to request allocation, merging, retrieval, and completion.

Remove any checking for single queue IO schedulers, as they no
longer exist. This means we can also delete a bunch of code related
to request issue, adding, completion, etc - and all the SQ related
ops and helpers.

Also kill the load_default_modules(), as all that did was provide
for a way to load the default single queue elevator.

Tested-by: Ming Lei
Reviewed-by: Omar Sandoval
Signed-off-by: Jens Axboe

Jens Axboe
2018-11-08 04:42:32 +0800

29 Sep, 2018

1 commit

f0a0cdddb kyber: fix integer overflow of latency targets on 32-bit ... Browse Code »

NSEC_PER_SEC has type long, so 5 * NSEC_PER_SEC is calculated as a long.
However, 5 seconds is 5,000,000,000 nanoseconds, which overflows a
32-bit long. Make sure all of the targets are calculated as 64-bit
values.

Fixes: 6e25cb01ea20 ("kyber: implement improved heuristics")
Reported-by: Stephen Rothwell
Signed-off-by: Omar Sandoval
Signed-off-by: Jens Axboe

Omar Sandoval
2018-09-29 00:49:39 +0800

28 Sep, 2018

4 commits

6c3b7af1c kyber: add tracepoints ... Browse Code »

When debugging Kyber, it's really useful to know what latencies we've
been having, how the domain depths have been adjusted, and if we've
actually been throttling. Add three tracepoints, kyber_latency,
kyber_adjust, and kyber_throttled, to record that.

Signed-off-by: Omar Sandoval
Signed-off-by: Jens Axboe

Omar Sandoval
2018-09-28 07:34:59 +0800
6e25cb01e kyber: implement improved heuristics ... Browse Code »

Kyber's current heuristics have a few flaws:

- It's based on the mean latency, but p99 latency tends to be more
meaningful to anyone who cares about latency. The mean can also be
skewed by rare outliers that the scheduler can't do anything about.
- The statistics calculations are purely time-based with a short window.
This works for steady, high load, but is more sensitive to outliers
with bursty workloads.
- It only considers the latency once an I/O has been submitted to the
device, but the user cares about the time spent in the kernel, as
well.

These are shortcomings of the generic blk-stat code which doesn't quite
fit the ideal use case for Kyber. So, this replaces the statistics with
a histogram used to calculate percentiles of total latency and I/O
latency, which we then use to adjust depths in a slightly more
intelligent manner:

- Sync and async writes are now the same domain.
- Discards are a separate domain.
- Domain queue depths are scaled by the ratio of the p99 total latency
to the target latency (e.g., if the p99 latency is double the target
latency, we will double the queue depth; if the p99 latency is half of
the target latency, we can halve the queue depth).
- We use the I/O latency to determine whether we should scale queue
depths down: we will only scale down if any domain's I/O latency
exceeds the target latency, which is an indicator of congestion in the
device.

These new heuristics are just as scalable as the heuristics they
replace.

Signed-off-by: Omar Sandoval
Signed-off-by: Jens Axboe

Omar Sandoval
2018-09-28 07:34:57 +0800
fa2a1f609 kyber: don't make domain token sbitmap larger than necessary ... Browse Code »

The domain token sbitmaps are currently initialized to the device queue
depth or 256, whichever is larger, and immediately resized to the
maximum depth for that domain (256, 128, or 64 for read, write, and
other, respectively). The sbitmap is never resized larger than that, so
it's unnecessary to allocate a bitmap larger than the maximum depth.
Let's just allocate it to the maximum depth to begin with. This will use
marginally less memory, and more importantly, give us a more appropriate
number of bits per sbitmap word.

Signed-off-by: Omar Sandoval
Signed-off-by: Jens Axboe

Omar Sandoval
2018-09-28 07:34:56 +0800
ed88660a5 block: move call of scheduler's ->completed_request() hook ... Browse Code »

Commit 4bc6339a583c ("block: move blk_stat_add() to
__blk_mq_end_request()") consolidated some calls using ktime_get() so
we'd only need to call it once. Kyber's ->completed_request() hook also
calls ktime_get(), so let's move it to the same place, too.

Signed-off-by: Omar Sandoval
Signed-off-by: Jens Axboe

Omar Sandoval
2018-09-28 07:34:52 +0800

31 May, 2018

1 commit

a6088845c block: kyber: make kyber more friendly with merging ... Browse Code »

Currently, kyber is very unfriendly with merging. kyber depends
on ctx rq_list to do merging, however, most of time, it will not
leave any requests in ctx rq_list. This is because even if tokens
of one domain is used up, kyber will try to dispatch requests
from other domain and flush the rq_list there.

To improve this, we setup kyber_ctx_queue (kcq) which is similar
with ctx, but it has rq_lists for different domain and build same
mapping between kcq and khd as the ctx & hctx. Then we could merge,
insert and dispatch for different domains separately. At the same
time, only flush the rq_list of kcq when get domain token successfully.
Then if one domain token is used up, the requests could be left in
the rq_list of that domain and maybe merged with following io.

Following is my test result on machine with 8 cores and NVMe card
INTEL SSDPEKKR128G7

fio size=256m ioengine=libaio iodepth=64 direct=1 numjobs=8
seq/random
+------+---------------------------------------------------------------+
|patch?| bw(MB/s) | iops | slat(usec) | clat(usec) | merge |
+----------------------------------------------------------------------+
| w/o | 606/612 | 151k/153k | 6.89/7.03 | 3349.21/3305.40 | 0/0 |
+----------------------------------------------------------------------+
| w/ | 1083/616 | 277k/154k | 4.93/6.95 | 1830.62/3279.95 | 223k/3k |
+----------------------------------------------------------------------+
When set numjobs to 16, the bw and iops could reach 1662MB/s and 425k
on my platform.

Signed-off-by: Jianchao Wang
Tested-by: Holger Hoffstätte
Reviewed-by: Omar Sandoval
Signed-off-by: Jens Axboe

Jianchao Wang
2018-05-31 00:47:40 +0800

11 May, 2018

1 commit

288206407 kyber-iosched: update shallow depth when setting up hardware queue ... Browse Code »

We don't expect the async depth to be smaller than the wake batch
count for sbitmap, but just in case, inform sbitmap of what shallow
depth kyber may use.

Acked-by: Paolo Valente
Reviewed-by: Omar Sandoval
Signed-off-by: Jens Axboe

Jens Axboe
2018-05-11 01:27:46 +0800

09 May, 2018

1 commit

544ccc8dc block: get rid of struct blk_issue_stat ... Browse Code »

struct blk_issue_stat squashes three things into one u64:

- The time the driver started working on a request
- The original size of the request (for the io.low controller)
- Flags for writeback throttling

It turns out that on x86_64, we have a 4 byte hole in struct request
which we can fill with the non-timestamp fields from blk_issue_stat,
simplifying things quite a bit.

Signed-off-by: Omar Sandoval
Signed-off-by: Jens Axboe

Omar Sandoval
2018-05-09 22:33:05 +0800

25 Feb, 2018

1 commit

ba989a014 block: kyber: fix domain token leak during requeue ... Browse Code »

When requeuing request, the domain token should have been freed
before re-inserting the request to io scheduler. Otherwise, the
assigned domain token will be leaked, and IO hang can be caused.

Cc: Paolo Valente
Cc: Omar Sandoval
Cc: stable@vger.kernel.org
Reviewed-by: Bart Van Assche
Signed-off-by: Ming Lei
Signed-off-by: Jens Axboe

Ming Lei
2018-02-25 06:55:54 +0800

07 Dec, 2017

1 commit

fcf38cdf3 kyber: fix another domain token wait queue hang ... Browse Code »

Commit 8cf466602028 ("kyber: fix hang on domain token wait queue") fixed
a hang caused by leaving wait entries on the domain token wait queue
after the __sbitmap_queue_get() retry succeeded, making that wait entry
a "dud" which won't in turn wake more entries up. However, we can also
get a dud entry if kyber_get_domain_token() fails once but is then
called again and succeeds. This can happen if the hardware queue is
rerun for some other reason, or, more likely, kyber_dispatch_request()
tries the same domain twice.

The fix is to remove our entry from the wait queue whenever we
successfully get a token. The only complication is that we might be on
one of many wait queues in the struct sbitmap_queue, but that's easily
fixed by remembering which wait queue we were put on.

While we're here, only initialize the wait queue entry once instead of
on every wait, and use spin_lock_irq() instead of spin_lock_irqsave(),
since this is always called from process context with irqs enabled.

Signed-off-by: Omar Sandoval
Signed-off-by: Jens Axboe

Omar Sandoval
2017-12-07 03:33:07 +0800

01 Nov, 2017

1 commit

63ba8e31c block: kyber: check if there are requests in ctx in kyber_has_work() ... Browse Code »

There may be request in sw queue, and not fetched to domain queue
yet, so check it in kyber_has_work().

Signed-off-by: Ming Lei
Signed-off-by: Jens Axboe

Ming Lei
2017-11-01 22:20:02 +0800

18 Oct, 2017

1 commit

8cf466602 kyber: fix hang on domain token wait queue ... Browse Code »

When we're getting a domain token, if we fail to get a token on our
first attempt, we put the current hardware queue on a wait queue and
then try again just in case a token was freed after our initial attempt
but before we got on the wait queue. If this second attempt succeeds, we
currently leave the hardware queue on the wait queue. Usually this is
okay; we'll just run the hardware queue one extra time when another
token is freed. However, if the hardware queue doesn't have any other
requests waiting, then when it it gets the extra wakeup, it won't have
anything to free and therefore won't wake up any other hardware queues.
If tokens are limited, then we won't make forward progress and the
device will hang.

Reported-by: Bin Zha
Signed-off-by: Omar Sandoval
Signed-off-by: Jens Axboe

Omar Sandoval
2017-10-18 06:18:11 +0800

04 Jul, 2017

1 commit

9bd42183b Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull scheduler updates from Ingo Molnar:
"The main changes in this cycle were:

- Add the SYSTEM_SCHEDULING bootup state to move various scheduler
debug checks earlier into the bootup. This turns silent and
sporadically deadly bugs into nice, deterministic splats. Fix some
of the splats that triggered. (Thomas Gleixner)

- A round of restructuring and refactoring of the load-balancing and
topology code (Peter Zijlstra)

- Another round of consolidating ~20 of incremental scheduler code
history: this time in terms of wait-queue nomenclature. (I didn't
get much feedback on these renaming patches, and we can still
easily change any names I might have misplaced, so if anyone hates
a new name, please holler and I'll fix it.) (Ingo Molnar)

- sched/numa improvements, fixes and updates (Rik van Riel)

- Another round of x86/tsc scheduler clock code improvements, in hope
of making it more robust (Peter Zijlstra)

- Improve NOHZ behavior (Frederic Weisbecker)

- Deadline scheduler improvements and fixes (Luca Abeni, Daniel
Bristot de Oliveira)

- Simplify and optimize the topology setup code (Lauro Ramos
Venancio)

- Debloat and decouple scheduler code some more (Nicolas Pitre)

- Simplify code by making better use of llist primitives (Byungchul
Park)

- ... plus other fixes and improvements"

* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (103 commits)
sched/cputime: Refactor the cputime_adjust() code
sched/debug: Expose the number of RT/DL tasks that can migrate
sched/numa: Hide numa_wake_affine() from UP build
sched/fair: Remove effective_load()
sched/numa: Implement NUMA node level wake_affine()
sched/fair: Simplify wake_affine() for the single socket case
sched/numa: Override part of migrate_degrades_locality() when idle balancing
sched/rt: Move RT related code from sched/core.c to sched/rt.c
sched/deadline: Move DL related code from sched/core.c to sched/deadline.c
sched/cpuset: Only offer CONFIG_CPUSETS if SMP is enabled
sched/fair: Spare idle load balancing on nohz_full CPUs
nohz: Move idle balancer registration to the idle path
sched/loadavg: Generalize "_idle" naming to "_nohz"
sched/core: Drop the unused try_get_task_struct() helper function
sched/fair: WARN() and refuse to set buddy when !se->on_rq
sched/debug: Fix SCHED_WARN_ON() to return a value on !CONFIG_SCHED_DEBUG as well
sched/wait: Disambiguate wq_entry->task_list and wq_head->task_list naming
sched/wait: Move bit_wait_table[] and related functionality from sched/core.c to sched/wait_bit.c
sched/wait: Split out the wait_bit*() APIs from into
sched/wait: Re-adjust macro line continuation backslashes in
...

Linus Torvalds
2017-07-04 04:08:04 +0800

20 Jun, 2017

2 commits

2055da973 sched/wait: Disambiguate wq_entry->task_list and wq_head->task_list naming ... Browse Code »

So I've noticed a number of instances where it was not obvious from the
code whether ->task_list was for a wait-queue head or a wait-queue entry.

Furthermore, there's a number of wait-queue users where the lists are
not for 'tasks' but other entities (poll tables, etc.), in which case
the 'task_list' name is actively confusing.

To clear this all up, name the wait-queue head and entry list structure
fields unambiguously:

struct wait_queue_head::task_list => ::head
struct wait_queue_entry::task_list => ::entry

For example, this code:

rqw->wait.task_list.next != &wait->task_list

... is was pretty unclear (to me) what it's doing, while now it's written this way:

rqw->wait.head.next != &wait->entry

... which makes it pretty clear that we are iterating a list until we see the head.

Other examples are:

list_for_each_entry_safe(pos, next, &x->task_list, task_list) {
list_for_each_entry(wq, &fence->wait.task_list, task_list) {

... where it's unclear (to me) what we are iterating, and during review it's
hard to tell whether it's trying to walk a wait-queue entry (which would be
a bug), while now it's written as:

list_for_each_entry_safe(pos, next, &x->head, entry) {
list_for_each_entry(wq, &fence->wait.head, entry) {

Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar

Ingo Molnar
2017-06-20 18:19:14 +0800
ac6424b98 sched/wait: Rename wait_queue_t => wait_queue_entry_t ... Browse Code »

Rename:

wait_queue_t => wait_queue_entry_t

'wait_queue_t' was always a slight misnomer: its name implies that it's a "queue",
but in reality it's a queue *entry*. The 'real' queue is the wait queue head,
which had to carry the name.

Start sorting this out by renaming it to 'wait_queue_entry_t'.

This also allows the real structure name 'struct __wait_queue' to
lose its double underscore and become 'struct wait_queue_entry',
which is the more canonical nomenclature for such data types.

Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar

Ingo Molnar
2017-06-20 18:18:27 +0800

19 Jun, 2017

2 commits

5bbf4e5a8 blk-mq-sched: unify request prepare methods ... Browse Code »

This patch makes sure we always allocate requests in the core blk-mq
code and use a common prepare_request method to initialize them for
both mq I/O schedulers. For Kyber and additional limit_depth method
is added that is called before allocating the request.

Also because none of the intializations can really fail the new method
does not return an error - instead the bfq finish method is hardened
to deal with the no-IOC case.

Last but not least this removes the abuse of RQF_QUEUE by the blk-mq
scheduling code as RQF_ELFPRIV is all that is needed now.

Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Christoph Hellwig
2017-06-19 00:08:55 +0800
7b9e93616 blk-mq-sched: unify request finished methods ... Browse Code »

No need to have two different callouts of bfq vs kyber.

Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Christoph Hellwig
2017-06-19 00:08:55 +0800

04 May, 2017

1 commit

16b738f65 kyber: add debugfs attributes ... Browse Code »

Expose the domain token pools, asynchronous sbitmap depth, domain
request lists, and batching state.

Signed-off-by: Omar Sandoval
Reviewed-by: Hannes Reinecke
Signed-off-by: Jens Axboe

Omar Sandoval
2017-05-04 22:25:17 +0800

21 Apr, 2017

1 commit

a37244e4c blk-stat: convert blk-stat bucket callback to signed ... Browse Code »

In order to allow for filtering of IO based on some other properties
of the request than direction we allow the bucket function to return
an int.

If the bucket callback returns a negative do no count it in the stats
accumulation.

Signed-off-by: Stephen Bates

Fixed up Kyber scheduler stat callback.

Signed-off-by: Jens Axboe

Stephen Bates
2017-04-21 05:29:16 +0800

15 Apr, 2017

1 commit

00e043936 blk-mq: introduce Kyber multiqueue I/O scheduler ... Browse Code »

The Kyber I/O scheduler is an I/O scheduler for fast devices designed to
scale to multiple queues. Users configure only two knobs, the target
read and synchronous write latencies, and the scheduler tunes itself to
achieve that latency goal.

The implementation is based on "tokens", built on top of the scalable
bitmap library. Tokens serve as a mechanism for limiting requests. There
are two tiers of tokens: queueing tokens and dispatch tokens.

A queueing token is required to allocate a request. In fact, these
tokens are actually the blk-mq internal scheduler tags, but the
scheduler manages the allocation directly in order to implement its
policy.

Dispatch tokens are device-wide and split up into two scheduling
domains: reads vs. writes. Each hardware queue dispatches batches
round-robin between the scheduling domains as long as tokens are
available for that domain.

These tokens can be used as the mechanism to enable various policies.
The policy Kyber uses is inspired by active queue management techniques
for network routing, similar to blk-wbt. The scheduler monitors
latencies and scales the number of dispatch tokens accordingly. Queueing
tokens are used to prevent starvation of synchronous requests by
asynchronous requests.

Various extensions are possible, including better heuristics and ionice
support. The new scheduler isn't set as the default yet.

Signed-off-by: Omar Sandoval
Signed-off-by: Jens Axboe

Omar Sandoval
2017-04-15 04:06:58 +0800