Eric Lee / smarc-fsl-linux-kernel

18 Jul, 2018

1 commit

bdca3c87f block: Track DISCARD statistics and output them in stat and diskstat ... Browse Code »

Add tracking of REQ_OP_DISCARD ios to the partition statistics and
append them to the various stat files in /sys as well as
/proc/diskstats. These are tracked with the same four stats as reads
and writes:

Number of discard ios completed.
Number of discard ios merged
Number of discard sectors completed
Milliseconds spent on discard requests

This is done via adding a new STAT_DISCARD define to genhd.h and then
using it to index that stat field for discard requests.

tj: Refreshed on top of v4.17 and other previous updates.

Signed-off-by: Michael Callahan
Signed-off-by: Tejun Heo
Cc: Andy Newell
Signed-off-by: Jens Axboe

Michael Callahan
2018-07-18 22:44:22 +0800

09 Jul, 2018

1 commit

ca4b2a011 null_blk: add zone support ... Browse Code »

Adds support for exposing a null_blk device through the zone device
interface.

The interface is managed with the parameters zoned and zone_size.
If zoned is set, the null_blk instance registers as a zoned block
device. The zone_size parameter defines how big each zone will be.

Signed-off-by: Matias Bjørling
Signed-off-by: Bart Van Assche
Signed-off-by: Damien Le Moal
Signed-off-by: Jens Axboe

Matias Bjørling
2018-07-09 23:07:55 +0800

15 Jun, 2018

1 commit

be7f99c53 block: remov blk_queue_invalidate_tags ... Browse Code »

This function is entirely unused, so remove it and the tag_queue_busy
member of struct request_queue.

Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Christoph Hellwig
2018-06-15 22:13:35 +0800

05 Jun, 2018

1 commit

eeee3149a Merge tag 'docs-4.18' of git://git.lwn.net/linux ... Browse Code »

Pull documentation updates from Jonathan Corbet:
"There's been a fair amount of work in the docs tree this time around,
including:

- Extensive RST conversions and organizational work in the
memory-management docs thanks to Mike Rapoport.

- An update of Documentation/features from Andrea Parri and a script
to keep it updated.

- Various LICENSES updates from Thomas, along with a script to check
SPDX tags.

- Work to fix dangling references to documentation files; this
involved a fair number of one-liner comment changes outside of
Documentation/

... and the usual list of documentation improvements, typo fixes, etc"

* tag 'docs-4.18' of git://git.lwn.net/linux: (103 commits)
Documentation: document hung_task_panic kernel parameter
docs/admin-guide/mm: add high level concepts overview
docs/vm: move ksm and transhuge from "user" to "internals" section.
docs: Use the kerneldoc comments for memalloc_no*()
doc: document scope NOFS, NOIO APIs
docs: update kernel versions and dates in tables
docs/vm: transhuge: split userspace bits to admin-guide/mm/transhuge
docs/vm: transhuge: minor updates
docs/vm: transhuge: change sections order
Documentation: arm: clean up Marvell Berlin family info
Documentation: gpio: driver: Fix a typo and some odd grammar
docs: ranoops.rst: fix location of ramoops.txt
scripts/documentation-file-ref-check: rewrite it in perl with auto-fix mode
docs: uio-howto.rst: use a code block to solve a warning
mm, THP, doc: Add document for thp_swpout/thp_swpout_fallback
w1: w1_io.c: fix a kernel-doc warning
Documentation/process/posting: wrap text at 80 cols
docs: admin-guide: add cgroup-v2 documentation
Revert "Documentation/features/vm: Remove arch support status file for 'pte_special'"
Documentation: refcount-vs-atomic: Update reference to LKMM doc.
...

Linus Torvalds
2018-06-05 03:34:27 +0800

25 May, 2018

1 commit

6723d8dc9 null_blk: add blocking description and remove lightnvm ... Browse Code »

- The description of 'blocking' is missing in null_blk.txt

- The 'lightnvm' parameter has been removed in null_blk.c

This updates both in null_blk.txt.

Signed-off-by: Liu Bo
Signed-off-by: Jens Axboe

Liu Bo
2018-05-25 22:51:58 +0800

08 May, 2018

1 commit

f6dbf65b6 Documentation: block: cmdline-partition.txt fixes and additions ... Browse Code »

Make the description of the kernel command line option "blkdevparts"
a bit more flowing and readable.

Fix a few typos.
Add the optional and suffixes.
Note that size can be "-" to indicate all of the remaining space.

Signed-off-by: Randy Dunlap
Cc: Cai Zhiyong
Signed-off-by: Jonathan Corbet

Randy Dunlap
2018-05-08 23:18:53 +0800

15 Nov, 2017

3 commits

a33801e8b block, bfq: move debug blkio stats behind CONFIG_DEBUG_BLK_CGROUP ... Browse Code »

BFQ currently creates, and updates, its own instance of the whole
set of blkio statistics that cfq creates. Yet, from the comments
of Tejun Heo in [1], it turned out that most of these statistics
are meant/useful only for debugging. This commit makes BFQ create
the latter, debugging statistics only if the option
CONFIG_DEBUG_BLK_CGROUP is set.

By doing so, this commit also enables BFQ to enjoy a high perfomance
boost. The reason is that, if CONFIG_DEBUG_BLK_CGROUP is not set, then
BFQ has to update far fewer statistics, and, in particular, not the
heaviest to update. To give an idea of the benefits, if
CONFIG_DEBUG_BLK_CGROUP is not set, then, on an Intel i7-4850HQ, and
with 8 threads doing random I/O in parallel on null_blk (configured
with 0 latency), the throughput of BFQ grows from 310 to 400 KIOPS
(+30%). We have measured similar or even much higher boosts with other
CPUs: e.g., +45% with an ARM CortexTM-A53 Octa-core. Our results have
been obtained and can be reproduced very easily with the script in [1].

[1] https://www.spinics.net/lists/linux-block/msg18943.html

Suggested-by: Tejun Heo
Suggested-by: Ulf Hansson
Tested-by: Lee Tibbert
Tested-by: Oleksandr Natalenko
Signed-off-by: Luca Miccio
Signed-off-by: Paolo Valente
Signed-off-by: Jens Axboe

Luca Miccio
2017-11-15 11:13:33 +0800
24bfd19bb block, bfq: update blkio stats outside the scheduler lock ... Browse Code »

bfq invokes various blkg_*stats_* functions to update the statistics
contained in the special files blkio.bfq.* in the blkio controller
groups, i.e., the I/O accounting related to the proportional-share
policy provided by bfq. The execution of these functions takes a
considerable percentage, about 40%, of the total per-request execution
time of bfq (i.e., of the sum of the execution time of all the bfq
functions that have to be executed to process an I/O request from its
creation to its destruction). This reduces the request-processing
rate sustainable by bfq noticeably, even on a multicore CPU. In fact,
the bfq functions that invoke blkg_*stats_* functions cannot be
executed in parallel with the rest of the code of bfq, because both
are executed under the same same per-device scheduler lock.

To reduce this slowdown, this commit moves, wherever possible, the
invocation of these functions (more precisely, of the bfq functions
that invoke blkg_*stats_* functions) outside the critical sections
protected by the scheduler lock.

With this change, and with all blkio.bfq.* statistics enabled, the
throughput grows, e.g., from 250 to 310 KIOPS (+25%) on an Intel
i7-4850HQ, in case of 8 threads doing random I/O in parallel on
null_blk, with the latter configured with 0 latency. We obtained the
same or higher throughput boosts, up to +30%, with other processors
(some figures are reported in the documentation). For our tests, we
used the script [1], with which our results can be easily reproduced.

NOTE. This commit still protects the invocation of blkg_*stats_*
functions with the request_queue lock, because the group these
functions are invoked on may otherwise disappear before or while these
functions are executed. Fortunately, tests without even this lock
show, by difference, that the serialization caused by this lock has a
little impact (at most ~5% of throughput reduction).

[1] https://github.com/Algodev-github/IOSpeed

Tested-by: Lee Tibbert
Tested-by: Oleksandr Natalenko
Signed-off-by: Paolo Valente
Signed-off-by: Luca Miccio
Signed-off-by: Jens Axboe

Paolo Valente
2017-11-15 11:13:33 +0800
68017e5d8 doc, block, bfq: update max IOPS sustainable with BFQ ... Browse Code »

We have investigated more deeply the performance of BFQ, in terms of
number of IOPS that can be processed by the CPU when BFQ is used as
I/O scheduler. In more detail, using the script [1], we have measured
the number of IOPS reached on top of a null block device configured
with zero latency, as a function of the workload (sequential read,
sequential write, random read, random write) and of the system (we
considered desktops, laptops and embedded systems).

Basing on the resulting figures, with this commit we update the
current, conservative IOPS range reported in BFQ documentation. In
particular, the documentation now reports, for each of three different
systems, the lowest number of IOPS obtained for that system with the
above test (namely, the value obtained with the workload leading to
the lowest IOPS).

[1] https://github.com/Algodev-github/IOSpeed

Reviewed-by: Lee Tibbert
Signed-off-by: Paolo Valente
Signed-off-by: Luca Miccio
Signed-off-by: Jens Axboe

Paolo Valente
2017-11-15 11:13:33 +0800

11 Nov, 2017

2 commits

d004a5e7d block: remove __bio_kmap_atomic ... Browse Code »

This helper doesn't buy us much over calling kmap_atomic directly.
In fact in the only caller it does a bit of useless work as the
caller already has the bvec at hand, and said caller would even
buggy for a multi-segment bio due to the use of this helper.

So just remove it.

Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Christoph Hellwig
2017-11-11 10:53:25 +0800
83f5f7ed7 block: kill bio_kmap/kunmap_irq() ... Browse Code »

There are no users of it anymore.

Reviewed-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Jens Axboe
2017-11-11 10:53:25 +0800

08 Nov, 2017

2 commits

bf9fc98b7 null_blk: add an usage for shared tags in documentation ... Browse Code »

Add usage explanation for a shared_tags, introduced by commit:

82f402fefa50 ("null_blk: add support for shared tags")

Signed-off-by: Minwoo Im

Reworded slightly.

Signed-off-by: Jens Axboe

Minwoo Im
2017-11-08 00:25:37 +0800
e88152411 null_blk: fix default values in documentation ... Browse Code »

null_blk.c has initial value of
(1) nr_devices as 1.
(2) completion_nsec as 10,000ns, not 10.000ns.
documentation should be updated for fixes above.

Signed-off-by: Minwoo Im
Signed-off-by: Jens Axboe

Minwoo Im
2017-11-08 00:24:06 +0800

14 Oct, 2017

2 commits

fc186311f null_blk: add usage hints for no_sched ... Browse Code »

This parameter provide an option to disable io scheduler when nullb*
in multi-queue mode.

Signed-off-by: weiping zhang
Signed-off-by: Jens Axboe

weiping zhang
2017-10-14 00:52:39 +0800
23c4490d2 null_blk: update usage hints for submit_queues ... Browse Code »

update the range of submits_queues, and correct usage hints.

Signed-off-by: weiping zhang
Signed-off-by: Jens Axboe

weiping zhang
2017-10-14 00:52:37 +0800

01 Sep, 2017

2 commits

2670cd167 doc, block, bfq: better describe how to properly configure bfq ... Browse Code »

Many users have reported the lack of an HOWTO for properly configuring
bfq as a function of the goal one wants to achieve (max
responsiveness, max throughput, ...). In fact, all needed details are
already provided in the documentation file bfq-iosched.txt. Yet the
document lacks guidance on which parameter descriptions to look
at. This commit adds some simple direction.

Signed-off-by: Paolo Valente
Reviewed-by: Jeremy Hickman
Reviewed-by: Laurentiu Nicola
Signed-off-by: Jens Axboe

Paolo Valente
2017-09-01 03:55:26 +0800
233f0bf41 doc, block, bfq: fix some typos and remove stale stuff ... Browse Code »

In addition to containing some typos and stale sentences, the file
bfq-iosched.txt still mentioned a set of sysfs parameters that have
been removed from this version of bfq. This commit fixes all these
issues.

Signed-off-by: Paolo Valente
Reviewed-by: Jeremy Hickman
Reviewed-by: Laurentiu Nicola
Signed-off-by: Jens Axboe

Paolo Valente
2017-09-01 03:55:25 +0800

04 Jul, 2017

1 commit

e23947bd7 bio-integrity: fold bio_integrity_enabled to bio_integrity_prep ... Browse Code »

Currently all integrity prep hooks are open-coded, and if prepare fails
we ignore it's code and fail bio with EIO. Let's return real error to
upper layer, so later caller may react accordingly.

In fact no one want to use bio_integrity_prep() w/o bio_integrity_enabled,
so it is reasonable to fold it in to one function.

Signed-off-by: Dmitry Monakhov
Reviewed-by: Martin K. Petersen
[hch: merged with the latest block tree,
return bool from bio_integrity_prep]
Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Dmitry Monakhov
2017-07-04 06:56:24 +0800

19 Jun, 2017

1 commit

9b10f6a9c block: remove bio_clone() and all references. ... Browse Code »

bio_clone() is no longer used.
Only bio_clone_bioset() or bio_clone_fast().
This is for the best, as bio_clone() used fs_bio_set,
and filesystems are unlikely to want to use bio_clone().

So remove bio_clone() and all references.
This includes a fix to some incorrect documentation.

Reviewed-by: Christoph Hellwig
Reviewed-by: Ming Lei
Signed-off-by: NeilBrown
Signed-off-by: Jens Axboe

NeilBrown
2017-06-19 02:40:59 +0800

10 May, 2017

1 commit

43c1b3d6e block, bfq: stress that low_latency must be off to get max throughput ... Browse Code »

The introduction of the BFQ and Kyber I/O schedulers has triggered a
new wave of I/O benchmarks. Unfortunately, comments and discussions on
these benchmarks confirm that there is still little awareness that it
is very hard to achieve, at the same time, a low latency and a high
throughput. In particular, virtually all benchmarks measure
throughput, or throughput-related figures of merit, but, for BFQ, they
use the scheduler in its default configuration. This configuration is
geared, instead, toward a low latency. This is evidently a sign that
BFQ documentation is still too unclear on this important aspect. This
commit addresses this issue by stressing how BFQ configuration must be
(easily) changed if the only goal is maximum throughput.

Signed-off-by: Paolo Valente
Signed-off-by: Jens Axboe

Paolo Valente
2017-05-10 21:39:43 +0800

19 Apr, 2017

3 commits

44e44a1b3 block, bfq: improve responsiveness ... Browse Code »

This patch introduces a simple heuristic to load applications quickly,
and to perform the I/O requested by interactive applications just as
quickly. To this purpose, both a newly-created queue and a queue
associated with an interactive application (we explain in a moment how
BFQ decides whether the associated application is interactive),
receive the following two special treatments:

1) The weight of the queue is raised.

2) The queue unconditionally enjoys device idling when it empties; in
fact, if the requests of a queue are sync, then performing device
idling for the queue is a necessary condition to guarantee that the
queue receives a fraction of the throughput proportional to its weight
(see [1] for details).

For brevity, we call just weight-raising the combination of these
two preferential treatments. For a newly-created queue,
weight-raising starts immediately and lasts for a time interval that:
1) depends on the device speed and type (rotational or
non-rotational), and 2) is equal to the time needed to load (start up)
a large-size application on that device, with cold caches and with no
additional workload.

Finally, as for guaranteeing a fast execution to interactive,
I/O-related tasks (such as opening a file), consider that any
interactive application blocks and waits for user input both after
starting up and after executing some task. After a while, the user may
trigger new operations, after which the application stops again, and
so on. Accordingly, the low-latency heuristic weight-raises again a
queue in case it becomes backlogged after being idle for a
sufficiently long (configurable) time. The weight-raising then lasts
for the same time as for a just-created queue.

According to our experiments, the combination of this low-latency
heuristic and of the improvements described in the previous patch
allows BFQ to guarantee a high application responsiveness.

[1] P. Valente, A. Avanzini, "Evolution of the BFQ Storage I/O
Scheduler", Proceedings of the First Workshop on Mobile System
Technologies (MST-2015), May 2015.
http://algogroup.unimore.it/people/paolo/disk_sched/mst-2015.pdf

Signed-off-by: Paolo Valente
Signed-off-by: Arianna Avanzini
Signed-off-by: Jens Axboe

Paolo Valente
2017-04-19 22:30:26 +0800
e21b7a0b9 block, bfq: add full hierarchical scheduling and cgroups support ... Browse Code »

Add complete support for full hierarchical scheduling, with a cgroups
interface. Full hierarchical scheduling is implemented through the
'entity' abstraction: both bfq_queues, i.e., the internal BFQ queues
associated with processes, and groups are represented in general by
entities. Given the bfq_queues associated with the processes belonging
to a given group, the entities representing these queues are sons of
the entity representing the group. At higher levels, if a group, say
G, contains other groups, then the entity representing G is the parent
entity of the entities representing the groups in G.

Hierarchical scheduling is performed as follows: if the timestamps of
a leaf entity (i.e., of a bfq_queue) change, and such a change lets
the entity become the next-to-serve entity for its parent entity, then
the timestamps of the parent entity are recomputed as a function of
the budget of its new next-to-serve leaf entity. If the parent entity
belongs, in its turn, to a group, and its new timestamps let it become
the next-to-serve for its parent entity, then the timestamps of the
latter parent entity are recomputed as well, and so on. When a new
bfq_queue must be set in service, the reverse path is followed: the
next-to-serve highest-level entity is chosen, then its next-to-serve
child entity, and so on, until the next-to-serve leaf entity is
reached, and the bfq_queue that this entity represents is set in
service.

Writeback is accounted for on a per-group basis, i.e., for each group,
the async I/O requests of the processes of the group are enqueued in a
distinct bfq_queue, and the entity associated with this queue is a
child of the entity associated with the group.

Weights can be assigned explicitly to groups and processes through the
cgroups interface, differently from what happens, for single
processes, if the cgroups interface is not used (as explained in the
description of the previous patch). In particular, since each node has
a full scheduler, each group can be assigned its own weight.

Signed-off-by: Fabio Checconi
Signed-off-by: Paolo Valente
Signed-off-by: Arianna Avanzini
Signed-off-by: Jens Axboe

Arianna Avanzini
2017-04-19 22:30:26 +0800
aee69d78d block, bfq: introduce the BFQ-v0 I/O scheduler as an extra scheduler ... Browse Code »

We tag as v0 the version of BFQ containing only BFQ's engine plus
hierarchical support. BFQ's engine is introduced by this commit, while
hierarchical support is added by next commit. We use the v0 tag to
distinguish this minimal version of BFQ from the versions containing
also the features and the improvements added by next commits. BFQ-v0
coincides with the version of BFQ submitted a few years ago [1], apart
from the introduction of preemption, described below.

BFQ is a proportional-share I/O scheduler, whose general structure,
plus a lot of code, are borrowed from CFQ.

- Each process doing I/O on a device is associated with a weight and a
(bfq_)queue.

- BFQ grants exclusive access to the device, for a while, to one queue
(process) at a time, and implements this service model by
associating every queue with a budget, measured in number of
sectors.

- After a queue is granted access to the device, the budget of the
queue is decremented, on each request dispatch, by the size of the
request.

- The in-service queue is expired, i.e., its service is suspended,
only if one of the following events occurs: 1) the queue finishes
its budget, 2) the queue empties, 3) a "budget timeout" fires.

- The budget timeout prevents processes doing random I/O from
holding the device for too long and dramatically reducing
throughput.

- Actually, as in CFQ, a queue associated with a process issuing
sync requests may not be expired immediately when it empties. In
contrast, BFQ may idle the device for a short time interval,
giving the process the chance to go on being served if it issues
a new request in time. Device idling typically boosts the
throughput on rotational devices, if processes do synchronous
and sequential I/O. In addition, under BFQ, device idling is
also instrumental in guaranteeing the desired throughput
fraction to processes issuing sync requests (see [2] for
details).

- With respect to idling for service guarantees, if several
processes are competing for the device at the same time, but
all processes (and groups, after the following commit) have
the same weight, then BFQ guarantees the expected throughput
distribution without ever idling the device. Throughput is
thus as high as possible in this common scenario.

- Queues are scheduled according to a variant of WF2Q+, named
B-WF2Q+, and implemented using an augmented rb-tree to preserve an
O(log N) overall complexity. See [2] for more details. B-WF2Q+ is
also ready for hierarchical scheduling. However, for a cleaner
logical breakdown, the code that enables and completes
hierarchical support is provided in the next commit, which focuses
exactly on this feature.

- B-WF2Q+ guarantees a tight deviation with respect to an ideal,
perfectly fair, and smooth service. In particular, B-WF2Q+
guarantees that each queue receives a fraction of the device
throughput proportional to its weight, even if the throughput
fluctuates, and regardless of: the device parameters, the current
workload and the budgets assigned to the queue.

- The last, budget-independence, property (although probably
counterintuitive in the first place) is definitely beneficial, for
the following reasons:

- First, with any proportional-share scheduler, the maximum
deviation with respect to an ideal service is proportional to
the maximum budget (slice) assigned to queues. As a consequence,
BFQ can keep this deviation tight not only because of the
accurate service of B-WF2Q+, but also because BFQ *does not*
need to assign a larger budget to a queue to let the queue
receive a higher fraction of the device throughput.

- Second, BFQ is free to choose, for every process (queue), the
budget that best fits the needs of the process, or best
leverages the I/O pattern of the process. In particular, BFQ
updates queue budgets with a simple feedback-loop algorithm that
allows a high throughput to be achieved, while still providing
tight latency guarantees to time-sensitive applications. When
the in-service queue expires, this algorithm computes the next
budget of the queue so as to:

- Let large budgets be eventually assigned to the queues
associated with I/O-bound applications performing sequential
I/O: in fact, the longer these applications are served once
got access to the device, the higher the throughput is.

- Let small budgets be eventually assigned to the queues
associated with time-sensitive applications (which typically
perform sporadic and short I/O), because, the smaller the
budget assigned to a queue waiting for service is, the sooner
B-WF2Q+ will serve that queue (Subsec 3.3 in [2]).

- Weights can be assigned to processes only indirectly, through I/O
priorities, and according to the relation:
weight = 10 * (IOPRIO_BE_NR - ioprio).
The next patch provides, instead, a cgroups interface through which
weights can be assigned explicitly.

- If several processes are competing for the device at the same time,
but all processes and groups have the same weight, then BFQ
guarantees the expected throughput distribution without ever idling
the device. It uses preemption instead. Throughput is then much
higher in this common scenario.

- ioprio classes are served in strict priority order, i.e.,
lower-priority queues are not served as long as there are
higher-priority queues. Among queues in the same class, the
bandwidth is distributed in proportion to the weight of each
queue. A very thin extra bandwidth is however guaranteed to the Idle
class, to prevent it from starving.

- If the strict_guarantees parameter is set (default: unset), then BFQ
- always performs idling when the in-service queue becomes empty;
- forces the device to serve one I/O request at a time, by
dispatching a new request only if there is no outstanding
request.
In the presence of differentiated weights or I/O-request sizes,
both the above conditions are needed to guarantee that every
queue receives its allotted share of the bandwidth (see
Documentation/block/bfq-iosched.txt for more details). Setting
strict_guarantees may evidently affect throughput.

[1] https://lkml.org/lkml/2008/4/1/234
https://lkml.org/lkml/2008/11/11/148

[2] P. Valente and M. Andreolini, "Improving Application
Responsiveness with the BFQ Disk I/O Scheduler", Proceedings of
the 5th Annual International Systems and Storage Conference
(SYSTOR '12), June 2012.
Slightly extended version:
http://algogroup.unimore.it/people/paolo/disk_sched/bfq-v1-suite-
results.pdf

Signed-off-by: Fabio Checconi
Signed-off-by: Paolo Valente
Signed-off-by: Arianna Avanzini
Signed-off-by: Jens Axboe

Paolo Valente
2017-04-19 22:29:02 +0800

15 Apr, 2017

1 commit

00e043936 blk-mq: introduce Kyber multiqueue I/O scheduler ... Browse Code »

The Kyber I/O scheduler is an I/O scheduler for fast devices designed to
scale to multiple queues. Users configure only two knobs, the target
read and synchronous write latencies, and the scheduler tunes itself to
achieve that latency goal.

The implementation is based on "tokens", built on top of the scalable
bitmap library. Tokens serve as a mechanism for limiting requests. There
are two tiers of tokens: queueing tokens and dispatch tokens.

A queueing token is required to allocate a request. In fact, these
tokens are actually the blk-mq internal scheduler tags, but the
scheduler manages the allocation directly in order to implement its
policy.

Dispatch tokens are device-wide and split up into two scheduling
domains: reads vs. writes. Each hardware queue dispatches batches
round-robin between the scheduling domains as long as tokens are
available for that domain.

These tokens can be used as the mechanism to enable various policies.
The policy Kyber uses is inspired by active queue management techniques
for network routing, similar to blk-wbt. The scheduler monitors
latencies and scales the number of dispatch tokens accordingly. Queueing
tokens are used to prevent starvation of synchronous requests by
asynchronous requests.

Various extensions are possible, including better heuristics and ionice
support. The new scheduler isn't set as the default yet.

Signed-off-by: Omar Sandoval
Signed-off-by: Jens Axboe

Omar Sandoval
2017-04-15 04:06:58 +0800

09 Apr, 2017

1 commit

48920ff2a block: remove the discard_zeroes_data flag ... Browse Code »

Now that we use the proper REQ_OP_WRITE_ZEROES operation everywhere we can
kill this hack.

Signed-off-by: Christoph Hellwig
Reviewed-by: Martin K. Petersen
Reviewed-by: Hannes Reinecke
Signed-off-by: Jens Axboe

Christoph Hellwig
2017-04-09 01:25:38 +0800

28 Mar, 2017

1 commit

297e3d854 blk-throttle: make throtl_slice tunable ... Browse Code »

throtl_slice is important for blk-throttling. It's called slice
internally but it really is a time window blk-throttling samples data.
blk-throttling will make decision based on the samplings. An example is
bandwidth measurement. A cgroup's bandwidth is measured in the time
interval of throtl_slice.

A small throtl_slice meanse cgroups have smoother throughput but burn
more CPUs. It has 100ms default value, which is not appropriate for all
disks. A fast SSD can dispatch a lot of IOs in 100ms. This patch makes
it tunable.

Since throtl_slice isn't a time slice, the sysfs name
'throttle_sample_time' reflects its character better.

Signed-off-by: Shaohua Li
Signed-off-by: Jens Axboe

Shaohua Li
2017-03-28 22:02:20 +0800

23 Feb, 2017

1 commit

c1aac62f3 Merge tag 'docs-4.11' of git://git.lwn.net/linux ... Browse Code »

Pull documentation updates from Jonathan Corbet:
"A slightly quieter cycle for documentation this time around.

Three more DocBook template files have been converted to RST; only 21
to go. There are various build improvements and the usual array of
documentation improvements and fixes"

* tag 'docs-4.11' of git://git.lwn.net/linux: (44 commits)
docs / driver-api: Fix structure references in device_link.rst
PM / docs: Fix structure references in device.rst
Add a target to check broken external links in the Documentation
Documentation: Fix linux-api list typo
Documentation: DocBook/Makefile comment typo
Improve sparse documentation
Documentation: make Makefile.sphinx no-ops quieter
Documentation: DMA-ISA-LPC.txt
Documentation: input: fix path to input code definitions
docs: Remove the copyright year from conf.py
docs: Fix a warning in the Korean HOWTO.rst translation
PM / sleep / docs: Convert PM notifiers document to reST
PM / core / docs: Convert sleep states API document to reST
PM / core: Update kerneldoc comments in pm.h
doc-rst: Fix recursive make invocation from macros
doc-rst: Delete output of failed dot-SVG conversion
doc-rst: Break shell command sequences on failure
Documentation/sphinx: make targets independent of Sphinx work for HAVE_SPHINX=0
doc-rst: fixed cleandoc target when used with O=dir
Documentation/sphinx: prevent generation of .pyc files in the source tree
...

Linus Torvalds
2017-02-23 10:51:29 +0800

27 Jan, 2017

1 commit

8da9704c8 Doc: Fix double words in Documentation ... Browse Code »

This patch fix some double words found in Documentation.

Signed-off-by: Masanari Iida
Signed-off-by: Jonathan Corbet

Masanari Iida
2017-01-27 06:25:41 +0800

04 Jan, 2017

1 commit

7158339d4 block: fix up io_poll documentation ... Browse Code »

/sys/block//queue/io_poll is a boolean. Fix the docs.

Signed-off-by: Jeff Moyer
Signed-off-by: Jens Axboe

Jeff Moyer
2017-01-04 07:47:13 +0800

29 Nov, 2016

1 commit

80e091d10 blk-wbt: allow reset of default latency through sysfs ... Browse Code »

Allow a write of '-1' to reset the default latency target for
a given device. This removes knowledge of the different default
settings for rotational vs non-rotational from user space.

Signed-off-by: Jens Axboe

Jens Axboe
2016-11-29 01:27:03 +0800

18 Nov, 2016

1 commit

10e6246e2 block: document the 'io_poll_delay' queue sysfs file ... Browse Code »

This was documented in the original commit, 64f1c21e86f7, but it
never made it into the proper location for queue sysfs files.

Signed-off-by: Jens Axboe

Jens Axboe
2016-11-18 13:23:02 +0800

16 Nov, 2016

1 commit

92153d30c null_blk: add usage hints for NVM ... Browse Code »

If CONFIG_NVM is disabled, loading null_block module with use_lightnvm=1
fails. But there are no messages and documents related to the failure.

Add the appropriate error message.

Signed-off-by: Yasuaki Ishimatsu

Massaged the text a bit.

Signed-off-by: Jens Axboe

Yasuaki Ishimatsu
2016-11-16 23:26:11 +0800

11 Nov, 2016

1 commit

87760e5ee block: hook up writeback throttling ... Browse Code »

Enable throttling of buffered writeback to make it a lot
more smooth, and has way less impact on other system activity.
Background writeback should be, by definition, background
activity. The fact that we flush huge bundles of it at the time
means that it potentially has heavy impacts on foreground workloads,
which isn't ideal. We can't easily limit the sizes of writes that
we do, since that would impact file system layout in the presence
of delayed allocation. So just throttle back buffered writeback,
unless someone is waiting for it.

The algorithm for when to throttle takes its inspiration in the
CoDel networking scheduling algorithm. Like CoDel, blk-wb monitors
the minimum latencies of requests over a window of time. In that
window of time, if the minimum latency of any request exceeds a
given target, then a scale count is incremented and the queue depth
is shrunk. The next monitoring window is shrunk accordingly. Unlike
CoDel, if we hit a window that exhibits good behavior, then we
simply increment the scale count and re-calculate the limits for that
scale value. This prevents us from oscillating between a
close-to-ideal value and max all the time, instead remaining in the
windows where we get good behavior.

Unlike CoDel, blk-wb allows the scale count to to negative. This
happens if we primarily have writes going on. Unlike positive
scale counts, this doesn't change the size of the monitoring window.
When the heavy writers finish, blk-bw quickly snaps back to it's
stable state of a zero scale count.

The patch registers a sysfs entry, 'wb_lat_usec'. This sets the latency
target to me met. It defaults to 2 msec for non-rotational storage, and
75 msec for rotational storage. Setting this value to '0' disables
blk-wb. Generally, a user would not have to touch this setting.

We don't enable WBT on devices that are managed with CFQ, and have
a non-root block cgroup attached. If we have a proportional share setup
on this particular disk, then the wbt throttling will interfere with
that. We don't have a strong need for wbt for that case, since we will
rely on CFQ doing that for us.

Signed-off-by: Jens Axboe

Jens Axboe
2016-11-11 04:53:40 +0800

01 Nov, 2016

1 commit

a2b809672 block: replace REQ_NOIDLE with REQ_IDLE ... Browse Code »

Noidle should be the default for writes as seen by all the compounds
definitions in fs.h using it. In fact only direct I/O really should
be using NODILE, so turn the whole flag around to get the defaults
right, which will make our life much easier especially onces the
WRITE_* defines go away.

This assumes all the existing "raw" users of REQ_SYNC for writes
want noidle behavior, which seems to be spot on from a quick audit.

Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Christoph Hellwig
2016-11-01 23:43:26 +0800

28 Oct, 2016

2 commits

ef295ecf0 block: better op and flags encoding ... Browse Code »

Now that we don't need the common flags to overflow outside the range
of a 32-bit type we can encode them the same way for both the bio and
request fields. This in addition allows us to place the operation
first (and make some room for more ops while we're at it) and to
stop having to shift around the operation values.

In addition this allows passing around only one value in the block layer
instead of two (and eventuall also in the file systems, but we can do
that later) and thus clean up a lot of code.

Last but not least this allows decreasing the size of the cmd_flags
field in struct request to 32-bits. Various functions passing this
value could also be updated, but I'd like to avoid the churn for now.

Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Christoph Hellwig
2016-10-28 22:48:16 +0800
e80640213 block: split out request-only flags into a new namespace ... Browse Code »

A lot of the REQ_* flags are only used on struct requests, and only of
use to the block layer and a few drivers that dig into struct request
internals.

This patch adds a new req_flags_t rq_flags field to struct request for
them, and thus dramatically shrinks the number of common requests. It
also removes the unfortunate situation where we have to fit the fields
from the same enum into 32 bits for struct bio and 64 bits for
struct request.

Signed-off-by: Christoph Hellwig
Reviewed-by: Shaun Tancheff
Signed-off-by: Jens Axboe

Christoph Hellwig
2016-10-28 22:45:17 +0800

14 Sep, 2016

1 commit

a441b0d09 block: remove remnant refs to hardsect ... Browse Code »

commit e1defc4ff0cf57aca6c5e3ff99fa503f5943c1f1
"block: Do away with the notion of hardsect_size"
removed the notion of "hardware sector size" from
the kernel in favor of logical block size, but
references remain in comments and documentation.

Update the remaining sites mentioning hardsect.

Signed-off-by: Linus Walleij
Reviewed-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Linus Walleij
2016-09-14 22:44:57 +0800

11 Aug, 2016

1 commit

005411ea7 doc: update block/queue-sysfs.txt entries ... Browse Code »

Add descriptions for dax, io_poll, and write_same_max_bytes files.

Signed-off-by: Joe Lawrence
Signed-off-by: Jens Axboe

Joe Lawrence
2016-08-11 23:37:23 +0800

08 Aug, 2016

1 commit

1eff9d322 block: rename bio bi_rw to bi_opf ... Browse Code »

Since commit 63a4cc24867d, bio->bi_rw contains flags in the lower
portion and the op code in the higher portions. This means that
old code that relies on manually setting bi_rw is most likely
going to be broken. Instead of letting that brokeness linger,
rename the member, to force old and out-of-tree code to break
at compile time instead of at runtime.

No intended functional changes in this commit.

Signed-off-by: Jens Axboe

Jens Axboe
2016-08-08 04:41:02 +0800

29 Jul, 2016

1 commit

6784725ab Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs ... Browse Code »

Pull vfs updates from Al Viro:
"Assorted cleanups and fixes.

Probably the most interesting part long-term is ->d_init() - that will
have a bunch of followups in (at least) ceph and lustre, but we'll
need to sort the barrier-related rules before it can get used for
really non-trivial stuff.

Another fun thing is the merge of ->d_iput() callers (dentry_iput()
and dentry_unlink_inode()) and a bunch of ->d_compare() ones (all
except the one in __d_lookup_lru())"

* 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (26 commits)
fs/dcache.c: avoid soft-lockup in dput()
vfs: new d_init method
vfs: Update lookup_dcache() comment
bdev: get rid of ->bd_inodes
Remove last traces of ->sync_page
new helper: d_same_name()
dentry_cmp(): use lockless_dereference() instead of smp_read_barrier_depends()
vfs: clean up documentation
vfs: document ->d_real()
vfs: merge .d_select_inode() into .d_real()
unify dentry_iput() and dentry_unlink_inode()
binfmt_misc: ->s_root is not going anywhere
drop redundant ->owner initializations
ufs: get rid of redundant checks
orangefs: constify inode_operations
missed comment updates from ->direct_IO() prototype change
file_inode(f)->i_mapping is f->f_mapping
trim fsnotify hooks a bit
9p: new helper - v9fs_parent_fid()
debugfs: ->d_parent is never NULL or negative
...

Linus Torvalds
2016-07-29 03:59:05 +0800