Eric Lee / smarc-fsl-linux-kernel

04 Jul, 2017

4 commits

03ffbcdd7 Merge branch 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull irq updates from Thomas Gleixner:
"The irq department delivers:

- Expand the generic infrastructure handling the irq migration on CPU
hotplug and convert X86 over to it. (Thomas Gleixner)

Aside of consolidating code this is a preparatory change for:

- Finalizing the affinity management for multi-queue devices. The
main change here is to shut down interrupts which are affine to a
outgoing CPU and reenabling them when the CPU comes online again.
That avoids moving interrupts pointlessly around and breaking and
reestablishing affinities for no value. (Christoph Hellwig)

Note: This contains also the BLOCK-MQ and NVME changes which depend
on the rework of the irq core infrastructure. Jens acked them and
agreed that they should go with the irq changes.

- Consolidation of irq domain code (Marc Zyngier)

- State tracking consolidation in the core code (Jeffy Chen)

- Add debug infrastructure for hierarchical irq domains (Thomas
Gleixner)

- Infrastructure enhancement for managing generic interrupt chips via
devmem (Bartosz Golaszewski)

- Constification work all over the place (Tobias Klauser)

- Two new interrupt controller drivers for MVEBU (Thomas Petazzoni)

- The usual set of fixes, updates and enhancements all over the
place"

* 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (112 commits)
irqchip/or1k-pic: Fix interrupt acknowledgement
irqchip/irq-mvebu-gicp: Allocate enough memory for spi_bitmap
irqchip/gic-v3: Fix out-of-bound access in gic_set_affinity
nvme: Allocate queues for all possible CPUs
blk-mq: Create hctx for each present CPU
blk-mq: Include all present CPUs in the default queue mapping
genirq: Avoid unnecessary low level irq function calls
genirq: Set irq masked state when initializing irq_desc
genirq/timings: Add infrastructure for estimating the next interrupt arrival time
genirq/timings: Add infrastructure to track the interrupt timings
genirq/debugfs: Remove pointless NULL pointer check
irqchip/gic-v3-its: Don't assume GICv3 hardware supports 16bit INTID
irqchip/gic-v3-its: Add ACPI NUMA node mapping
irqchip/gic-v3-its-platform-msi: Make of_device_ids const
irqchip/gic-v3-its: Make of_device_ids const
irqchip/irq-mvebu-icu: Add new driver for Marvell ICU
irqchip/irq-mvebu-gicp: Add new driver for Marvell GICP
dt-bindings/interrupt-controller: Add DT binding for the Marvell ICU
genirq/irqdomain: Remove auto-recursive hierarchy support
irqchip/MSI: Use irq_domain_update_bus_token instead of an open coded access
...

Linus Torvalds
2017-07-04 07:50:31 +0800
9bd42183b Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull scheduler updates from Ingo Molnar:
"The main changes in this cycle were:

- Add the SYSTEM_SCHEDULING bootup state to move various scheduler
debug checks earlier into the bootup. This turns silent and
sporadically deadly bugs into nice, deterministic splats. Fix some
of the splats that triggered. (Thomas Gleixner)

- A round of restructuring and refactoring of the load-balancing and
topology code (Peter Zijlstra)

- Another round of consolidating ~20 of incremental scheduler code
history: this time in terms of wait-queue nomenclature. (I didn't
get much feedback on these renaming patches, and we can still
easily change any names I might have misplaced, so if anyone hates
a new name, please holler and I'll fix it.) (Ingo Molnar)

- sched/numa improvements, fixes and updates (Rik van Riel)

- Another round of x86/tsc scheduler clock code improvements, in hope
of making it more robust (Peter Zijlstra)

- Improve NOHZ behavior (Frederic Weisbecker)

- Deadline scheduler improvements and fixes (Luca Abeni, Daniel
Bristot de Oliveira)

- Simplify and optimize the topology setup code (Lauro Ramos
Venancio)

- Debloat and decouple scheduler code some more (Nicolas Pitre)

- Simplify code by making better use of llist primitives (Byungchul
Park)

- ... plus other fixes and improvements"

* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (103 commits)
sched/cputime: Refactor the cputime_adjust() code
sched/debug: Expose the number of RT/DL tasks that can migrate
sched/numa: Hide numa_wake_affine() from UP build
sched/fair: Remove effective_load()
sched/numa: Implement NUMA node level wake_affine()
sched/fair: Simplify wake_affine() for the single socket case
sched/numa: Override part of migrate_degrades_locality() when idle balancing
sched/rt: Move RT related code from sched/core.c to sched/rt.c
sched/deadline: Move DL related code from sched/core.c to sched/deadline.c
sched/cpuset: Only offer CONFIG_CPUSETS if SMP is enabled
sched/fair: Spare idle load balancing on nohz_full CPUs
nohz: Move idle balancer registration to the idle path
sched/loadavg: Generalize "_idle" naming to "_nohz"
sched/core: Drop the unused try_get_task_struct() helper function
sched/fair: WARN() and refuse to set buddy when !se->on_rq
sched/debug: Fix SCHED_WARN_ON() to return a value on !CONFIG_SCHED_DEBUG as well
sched/wait: Disambiguate wq_entry->task_list and wq_head->task_list naming
sched/wait: Move bit_wait_table[] and related functionality from sched/core.c to sched/wait_bit.c
sched/wait: Split out the wait_bit*() APIs from into
sched/wait: Re-adjust macro line continuation backslashes in
...

Linus Torvalds
2017-07-04 04:08:04 +0800
c6b1e36c8 Merge branch 'for-4.13/block' of git://git.kernel.dk/linux-block ... Browse Code »

Pull core block/IO updates from Jens Axboe:
"This is the main pull request for the block layer for 4.13. Not a huge
round in terms of features, but there's a lot of churn related to some
core cleanups.

Note this depends on the UUID tree pull request, that Christoph
already sent out.

This pull request contains:

- A series from Christoph, unifying the error/stats codes in the
block layer. We now use blk_status_t everywhere, instead of using
different schemes for different places.

- Also from Christoph, some cleanups around request allocation and IO
scheduler interactions in blk-mq.

- And yet another series from Christoph, cleaning up how we handle
and do bounce buffering in the block layer.

- A blk-mq debugfs series from Bart, further improving on the support
we have for exporting internal information to aid debugging IO
hangs or stalls.

- Also from Bart, a series that cleans up the request initialization
differences across types of devices.

- A series from Goldwyn Rodrigues, allowing the block layer to return
failure if we will block and the user asked for non-blocking.

- Patch from Hannes for supporting setting loop devices block size to
that of the underlying device.

- Two series of patches from Javier, fixing various issues with
lightnvm, particular around pblk.

- A series from me, adding support for write hints. This comes with
NVMe support as well, so applications can help guide data placement
on flash to improve performance, latencies, and write
amplification.

- A series from Ming, improving and hardening blk-mq support for
stopping/starting and quiescing hardware queues.

- Two pull requests for NVMe updates. Nothing major on the feature
side, but lots of cleanups and bug fixes. From the usual crew.

- A series from Neil Brown, greatly improving the bio rescue set
support. Most notably, this kills the bio rescue work queues, if we
don't really need them.

- Lots of other little bug fixes that are all over the place"

* 'for-4.13/block' of git://git.kernel.dk/linux-block: (217 commits)
lightnvm: pblk: set line bitmap check under debug
lightnvm: pblk: verify that cache read is still valid
lightnvm: pblk: add initialization check
lightnvm: pblk: remove target using async. I/Os
lightnvm: pblk: use vmalloc for GC data buffer
lightnvm: pblk: use right metadata buffer for recovery
lightnvm: pblk: schedule if data is not ready
lightnvm: pblk: remove unused return variable
lightnvm: pblk: fix double-free on pblk init
lightnvm: pblk: fix bad le64 assignations
nvme: Makefile: remove dead build rule
blk-mq: map all HWQ also in hyperthreaded system
nvmet-rdma: register ib_client to not deadlock in device removal
nvme_fc: fix error recovery on link down.
nvmet_fc: fix crashes on bad opcodes
nvme_fc: Fix crash when nvme controller connection fails.
nvme_fc: replace ioabort msleep loop with completion
nvme_fc: fix double calls to nvme_cleanup_cmd()
nvme-fabrics: verify that a controller returns the correct NQN
nvme: simplify nvme_dev_attrs_are_visible
...

Linus Torvalds
2017-07-04 01:34:51 +0800
81e3e0448 Merge tag 'uuid-for-4.13' of git://git.infradead.org/users/hch/uuid ... Browse Code »

Pull uuid subsystem from Christoph Hellwig:
"This is the new uuid subsystem, in which Amir, Andy and I have started
consolidating our uuid/guid helpers and improving the types used for
them. Note that various other subsystems have pulled in this tree, so
I'd like it to go in early.

UUID/GUID summary:

- introduce the new uuid_t/guid_t types that are going to replace the
somewhat confusing uuid_be/uuid_le types and make the terminology
fit the various specs, as well as the userspace libuuid library.
(me, based on a previous version from Amir)

- consolidated generic uuid/guid helper functions lifted from XFS and
libnvdimm (Amir and me)

- conversions to the new types and helpers (Amir, Andy and me)"

* tag 'uuid-for-4.13' of git://git.infradead.org/users/hch/uuid: (34 commits)
ACPI: hns_dsaf_acpi_dsm_guid can be static
mmc: sdhci-pci: make guid intel_dsm_guid static
uuid: Take const on input of uuid_is_null() and guid_is_null()
thermal: int340x_thermal: fix compile after the UUID API switch
thermal: int340x_thermal: Switch to use new generic UUID API
acpi: always include uuid.h
ACPI: Switch to use generic guid_t in acpi_evaluate_dsm()
ACPI / extlog: Switch to use new generic UUID API
ACPI / bus: Switch to use new generic UUID API
ACPI / APEI: Switch to use new generic UUID API
acpi, nfit: Switch to use new generic UUID API
MAINTAINERS: add uuid entry
tmpfs: generate random sb->s_uuid
scsi_debug: switch to uuid_t
nvme: switch to uuid_t
sysctl: switch to use uuid_t
partitions/ldm: switch to use uuid_t
overlayfs: use uuid_t instead of uuid_be
fs: switch ->s_uuid to uuid_t
ima/policy: switch to use uuid_t
...

Linus Torvalds
2017-07-04 00:55:26 +0800

29 Jun, 2017

4 commits

fe631457f blk-mq: map all HWQ also in hyperthreaded system ... Browse Code »

This patch performs sequential mapping between CPUs and queues.
In case the system has more CPUs than HWQs then there are still
CPUs to map to HWQs. In hyperthreaded system, map the unmapped CPUs
and their siblings to the same HWQ.
This actually fixes a bug that found unmapped HWQs in a system with
2 sockets, 18 cores per socket, 2 threads per core (total 72 CPUs)
running NVMEoF (opens upto maximum of 64 HWQs).

Performance results running fio (72 jobs, 128 iodepth)
using null_blk (w/w.o patch):

bs IOPS(read submit_queues=72) IOPS(write submit_queues=72) IOPS(read submit_queues=24) IOPS(write submit_queues=24)
----- ---------------------------- ------------------------------ ---------------------------- -----------------------------
512 4890.4K/4723.5K 4524.7K/4324.2K 4280.2K/4264.3K 3902.4K/3909.5K
1k 4910.1K/4715.2K 4535.8K/4309.6K 4296.7K/4269.1K 3906.8K/3914.9K
2k 4906.3K/4739.7K 4526.7K/4330.6K 4301.1K/4262.4K 3890.8K/3900.1K
4k 4918.6K/4730.7K 4556.1K/4343.6K 4297.6K/4264.5K 3886.9K/3893.9K
8k 4906.4K/4748.9K 4550.9K/4346.7K 4283.2K/4268.8K 3863.4K/3858.2K
16k 4903.8K/4782.6K 4501.5K/4233.9K 4292.3K/4282.3K 3773.1K/3773.5K
32k 4885.8K/4782.4K 4365.9K/4184.2K 4307.5K/4289.4K 3780.3K/3687.3K
64k 4822.5K/4762.7K 2752.8K/2675.1K 4308.8K/4312.3K 2651.5K/2655.7K
128k 2388.5K/2313.8K 1391.9K/1375.7K 2142.8K/2152.2K 1395.5K/1374.2K

Signed-off-by: Max Gurtovoy
Reviewed-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Max Gurtovoy
2017-06-29 22:40:11 +0800
9ae3b3f52 block: provide bio_uninit() free freeing integrity/task associations ... Browse Code »

Wen reports significant memory leaks with DIF and O_DIRECT:

"With nvme devive + T10 enabled, On a system it has 256GB and started
logging /proc/meminfo & /proc/slabinfo for every minute and in an hour
it increased by 15968128 kB or ~15+GB.. Approximately 256 MB / minute
leaking.

/proc/meminfo | grep SUnreclaim...

SUnreclaim: 6752128 kB
SUnreclaim: 6874880 kB
SUnreclaim: 7238080 kB
....
SUnreclaim: 22307264 kB
SUnreclaim: 22485888 kB
SUnreclaim: 22720256 kB

When testcases with T10 enabled call into __blkdev_direct_IO_simple,
code doesn't free memory allocated by bio_integrity_alloc. The patch
fixes the issue. HTX has been run with +60 hours without failure."

Since __blkdev_direct_IO_simple() allocates the bio on the stack, it
doesn't go through the regular bio free. This means that any ancillary
data allocated with the bio through the stack is not freed. Hence, we
can leak the integrity data associated with the bio, if the device is
using DIF/DIX.

Fix this by providing a bio_uninit() and export it, so that we can use
it to free this data. Note that this is a minimal fix for this issue.
Any current user of bio's that are allocated outside of
bio_alloc_bioset() suffers from this issue, most notably some drivers.
We will fix those in a more comprehensive patch for 4.13. This also
means that the commit marked as being fixed by this isn't the real
culprit, it's just the most obvious one out there.

Fixes: 542ff7bf18c6 ("block: new direct I/O implementation")
Reported-by: Wen Xiong
Reviewed-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Jens Axboe
2017-06-29 05:30:13 +0800
4b855ad37 blk-mq: Create hctx for each present CPU ... Browse Code »

Currently we only create hctx for online CPUs, which can lead to a lot
of churn due to frequent soft offline / online operations. Instead
allocate one for each present CPU to avoid this and dramatically simplify
the code.

Signed-off-by: Christoph Hellwig
Reviewed-by: Jens Axboe
Cc: Keith Busch
Cc: linux-block@vger.kernel.org
Cc: linux-nvme@lists.infradead.org
Link: http://lkml.kernel.org/r/20170626102058.10200-3-hch@lst.de
Signed-off-by: Thomas Gleixner

Christoph Hellwig
2017-06-29 05:00:07 +0800
5f042e7cb blk-mq: Include all present CPUs in the default queue mapping ... Browse Code »

This way we get a nice distribution independent of the current cpu
online / offline state.

Signed-off-by: Christoph Hellwig
Reviewed-by: Jens Axboe
Cc: Keith Busch
Cc: linux-block@vger.kernel.org
Cc: linux-nvme@lists.infradead.org
Link: http://lkml.kernel.org/r/20170626102058.10200-2-hch@lst.de
Signed-off-by: Thomas Gleixner

Christoph Hellwig
2017-06-29 05:00:06 +0800

28 Jun, 2017

10 commits

13c931bd9 block, bfq: update wr_busy_queues if needed on a queue split ... Browse Code »

This commit fixes a bug triggered by a non-trivial sequence of
events. These events are briefly described in the next two
paragraphs. The impatiens, or those who are familiar with queue
merging and splitting, can jump directly to the last paragraph.

On each I/O-request arrival for a shared bfq_queue, i.e., for a
bfq_queue that is the result of the merge of two or more bfq_queues,
BFQ checks whether the shared bfq_queue has become seeky (i.e., if too
many random I/O requests have arrived for the bfq_queue; if the device
is non rotational, then random requests must be also small for the
bfq_queue to be tagged as seeky). If the shared bfq_queue is actually
detected as seeky, then a split occurs: the bfq I/O context of the
process that has issued the request is redirected from the shared
bfq_queue to a new non-shared bfq_queue. As a degenerate case, if the
shared bfq_queue actually happens to be shared only by one process
(because of previous splits), then no new bfq_queue is created: the
state of the shared bfq_queue is just changed from shared to non
shared.

Regardless of whether a brand new non-shared bfq_queue is created, or
the pre-existing shared bfq_queue is just turned into a non-shared
bfq_queue, several parameters of the non-shared bfq_queue are set
(restored) to the original values they had when the bfq_queue
associated with the bfq I/O context of the process (that has just
issued an I/O request) was merged with the shared bfq_queue. One of
these parameters is the weight-raising state.

If, on the split of a shared bfq_queue,
1) a pre-existing shared bfq_queue is turned into a non-shared
bfq_queue;
2) the previously shared bfq_queue happens to be busy;
3) the weight-raising state of the previously shared bfq_queue happens
to change;
the number of weight-raised busy queues changes. The field
wr_busy_queues must then be updated accordingly, but such an update
was missing. This commit adds the missing update.

Reported-by: Luca Miccio
Signed-off-by: Paolo Valente
Signed-off-by: Jens Axboe

Paolo Valente
2017-06-28 02:30:47 +0800
8fc450443 block: don't set bounce limit in blk_init_queue ... Browse Code »

Instead move it to the callers. Those that either don't use bio_data() or
page_address() or are specific to architectures that do not support highmem
are skipped.

Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Christoph Hellwig
2017-06-28 02:13:45 +0800
0bf6595ec block: don't set bounce limit in blk_init_allocated_queue ... Browse Code »

And just move it into scsi_transport_sas which needs it due to low-level
drivers directly derferencing bio_data, and into blk_init_queue_node,
which will need a further push into the callers.

Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Christoph Hellwig
2017-06-28 02:13:45 +0800
46685d1a9 blk-mq: don't bounce by default ... Browse Code »

For historical reasons we default to bouncing highmem pages for all block
queues. But the blk-mq drivers are easy to audit to ensure that we don't
need this - scsi and mtip32xx set explicit limits and everyone else doesn't
have any particular ones.

Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Christoph Hellwig
2017-06-28 02:13:45 +0800
0b0bcacc3 block: don't bother with bounce limits for make_request drivers ... Browse Code »

We only call blk_queue_bounce for request-based drivers, so stop messing
with it for make_request based drivers.

Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Christoph Hellwig
2017-06-28 02:13:45 +0800
1c4bc3ab9 block: remove the queue_bounce_pfn helper ... Browse Code »

Only used inside the bounce code, and opencoding it makes it more obvious
what is going on.

Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Christoph Hellwig
2017-06-28 02:13:45 +0800
3bce016a4 block: move bounce declarations to block/blk.h ... Browse Code »

Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Christoph Hellwig
2017-06-28 02:13:45 +0800
caa4b0247 blk-map: call blk_queue_bounce from blk_rq_append_bio ... Browse Code »

This makes moves the knowledge about bouncing out of the callers into the
block core (just like we do for the normal I/O path), and allows to unexport
blk_queue_bounce.

Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Christoph Hellwig
2017-06-28 02:13:21 +0800
f793dfd3f blk-mq: expose write hints through debugfs ... Browse Code »

Useful to verify that things are working the way they should.
Reading the file will return number of kb written with each
write hint. Writing the file will reset the statistics. No care
is taken to ensure that we don't race on updates.

Drivers will write to q->write_hints[] if they handle a given
write hint.

Reviewed-by: Andreas Dilger
Reviewed-by: Martin K. Petersen
Reviewed-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Jens Axboe
2017-06-28 02:05:31 +0800
cb6934f8e block: add support for write hints in a bio ... Browse Code »

No functional changes in this patch, we just use up some holes
in the bio and request structures to define a write hint that
we psas down the stack.

Ensure that we don't merge requests that have different life time
hints assigned to them, and that we inherit the write hint when
cloning a bio.

Reviewed-by: Martin K. Petersen
Reviewed-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Jens Axboe
2017-06-28 02:05:27 +0800

24 Jun, 2017

1 commit

1bc3cd4df Merge branch 'linus' into sched/core, to pick up fixes ... Browse Code »

Signed-off-by: Ingo Molnar

Ingo Molnar
2017-06-24 14:57:20 +0800

23 Jun, 2017

1 commit

f95a0d6a9 Merge commit '8e8320c9315c ' into for-4.13/block ... Browse Code »

Pull in the fix for shared tags, as it conflicts with the pending
changes in for-4.13/block. We already pulled in v4.12-rc5 to solve
other conflicts or get fixes that went into 4.12, so not a lot
of changes in this merge.

Signed-off-by: Jens Axboe

Jens Axboe
2017-06-23 11:55:24 +0800

22 Jun, 2017

6 commits

a9590fe14 blk-mq: remove double set queue_num ... Browse Code »

hwctx's queue_num has been set prior call blk_mq_init_hctx, so no need
set it again.

Signed-off-by: weiping
Signed-off-by: Jens Axboe

weiping
2017-06-22 23:18:25 +0800
852ec8098 blk-mq: Make it safe to quiesce and unquiesce from an interrupt handler ... Browse Code »

Since blk_mq_quiesce_queue_nowait() can be called from interrupt
context, make this safe. Since this function is not in the hot
path, uninline it.

Fixes: commit f4560ffe8cec ("blk-mq: use QUEUE_FLAG_QUIESCED to quiesce queue")
Signed-off-by: Bart Van Assche
Cc: Ming Lei
Cc: Hannes Reinecke
Cc: Martin K. Petersen
Signed-off-by: Jens Axboe

Bart Van Assche
2017-06-22 02:01:15 +0800
34bd9c1c4 block: Fix off-by-one errors in blk_status_to_errno() and print_req_error() ... Browse Code »

This was detected by the smatch static analyzer.

Fixes: commit 2a842acab109 ("block: introduce new block status code type")
Signed-off-by: Bart Van Assche
Cc: Christoph Hellwig
Cc: Hannes Reinecke
Cc: Ming Lei
Signed-off-by: Jens Axboe

Bart Van Assche
2017-06-22 02:01:14 +0800
e0fc443a8 block: Declare local symbols static ... Browse Code »

Avoid that building with W=1 causes the compiler to complain that
a declaration for bounce_bio_set and bounce_bio_split is missing.

References: commit a8821f3f32be ("block: Improvements to bounce-buffer handling")
Signed-off-by: Bart Van Assche
Cc: Neil Brown
Cc: Christoph Hellwig
Cc: Hannes Reinecke
Cc: Ming Lei
Signed-off-by: Jens Axboe

Bart Van Assche
2017-06-22 02:01:13 +0800
e29387ebd block: Add fallthrough markers to switch statements ... Browse Code »

This patch suppresses gcc 7 warnings about falling through in switch
statements when building with W=1. From the gcc documentation: The
-Wimplicit-fallthrough=3 warning is enabled by -Wextra. See also
https://gcc.gnu.org/onlinedocs/gcc-7.1.0/gcc/Warning-Options.html.

Signed-off-by: Bart Van Assche
Signed-off-by: Jens Axboe

Bart Van Assche
2017-06-22 01:46:07 +0800
8e8320c93 blk-mq: fix performance regression with shared tags ... Browse Code »

If we have shared tags enabled, then every IO completion will trigger
a full loop of every queue belonging to a tag set, and every hardware
queue for each of those queues, even if nothing needs to be done.
This causes a massive performance regression if you have a lot of
shared devices.

Instead of doing this huge full scan on every IO, add an atomic
counter to the main queue that tracks how many hardware queues have
been marked as needing a restart. With that, we can avoid looking for
restartable queues, if we don't have to.

Max reports that this restores performance. Before this patch, 4K
IOPS was limited to 22-23K IOPS. With the patch, we are running at
950-970K IOPS.

Fixes: 6d8c6c0f97ad ("blk-mq: Restart a single queue if tag sets are shared")
Reported-by: Max Gurtovoy
Tested-by: Max Gurtovoy
Reviewed-by: Bart Van Assche
Tested-by: Bart Van Assche
Signed-off-by: Jens Axboe

Jens Axboe
2017-06-22 00:17:49 +0800

21 Jun, 2017

11 commits

5435c023b blk-mq: Warn when attempting to run a hardware queue that is not mapped ... Browse Code »

A queue must be frozen while the mapped state of a hardware queue
is changed. Additionally, any change of the mapped state is
followed by a call to blk_mq_map_swqueue() (see also
blk_mq_init_allocated_queue() and blk_mq_update_nr_hw_queues()).
Since blk_mq_map_swqueue() does not map any unmapped hardware
queue onto any software queue, no attempt will be made to run
an unmapped hardware queue. Hence issue a warning upon attempts
to run an unmapped hardware queue.

Signed-off-by: Bart Van Assche
Reviewed-by: Christoph Hellwig
Cc: Hannes Reinecke
Cc: Omar Sandoval
Cc: Ming Lei
Signed-off-by: Jens Axboe

Bart Van Assche
2017-06-21 09:27:14 +0800
edf8ff558 block: Constify disk_type ... Browse Code »

The variable 'disk_type' is never modified so constify it.

Signed-off-by: Bart Van Assche
Reviewed-by: Christoph Hellwig
Cc: Hannes Reinecke
Cc: Omar Sandoval
Cc: Ming Lei
Signed-off-by: Jens Axboe

Bart Van Assche
2017-06-21 09:27:14 +0800
7b6078146 blk-mq: Document locking assumptions ... Browse Code »

Document the locking assumptions in functions that modify
blk_mq_ctx.rq_list to make it easier for humans to verify
this code.

Signed-off-by: Bart Van Assche
Reviewed-by: Christoph Hellwig
Cc: Hannes Reinecke
Cc: Omar Sandoval
Cc: Ming Lei
Signed-off-by: Jens Axboe

Bart Van Assche
2017-06-21 09:27:14 +0800
332ebbf7f block: Document what queue type each function is intended for ... Browse Code »

Some functions in block/blk-core.c must only be used on blk-sq queues
while others are safe to use against any queue type. Document which
functions are intended for blk-sq queues and issue a warning if the
blk-sq API is misused. This does not only help block driver authors
but will also make it easier to remove the blk-sq code once that code
is declared obsolete.

Signed-off-by: Bart Van Assche
Reviewed-by: Christoph Hellwig
Cc: Hannes Reinecke
Cc: Omar Sandoval
Cc: Ming Lei
Signed-off-by: Jens Axboe

Bart Van Assche
2017-06-21 09:27:14 +0800
2fff8a924 block: Check locking assumptions at runtime ... Browse Code »

Instead of documenting the locking assumptions of most block layer
functions as a comment, use lockdep_assert_held() to verify locking
assumptions at runtime.

Signed-off-by: Bart Van Assche
Reviewed-by: Christoph Hellwig
Cc: Hannes Reinecke
Cc: Omar Sandoval
Cc: Ming Lei
Signed-off-by: Jens Axboe

Bart Van Assche
2017-06-21 09:27:14 +0800
c3a148d20 blk-mq: Initialize .rq_flags in blk_mq_rq_ctx_init() ... Browse Code »

Initialization of blk-mq requests is a bit weird: blk_mq_rq_ctx_init()
is called after a value has been assigned to .rq_flags and .rq_flags
is initialized in __blk_mq_finish_request(). Initialize .rq_flags in
blk_mq_rq_ctx_init() instead of relying on __blk_mq_finish_request().
Moving the initialization of .rq_flags is fine because all changes
and tests of .rq_flags occur between blk_get_request() and finishing
a request.

Signed-off-by: Bart Van Assche
Cc: Christoph Hellwig
Cc: Hannes Reinecke
Cc: Omar Sandoval
Cc: Ming Lei
Signed-off-by: Jens Axboe

Bart Van Assche
2017-06-21 09:27:14 +0800
c8d9cf22c block: Change argument type of scsi_req_init() ... Browse Code »

Since scsi_req_init() works on a struct scsi_request, change the
argument type into struct scsi_request *.

Signed-off-by: Bart Van Assche
Reviewed-by: Christoph Hellwig
Reviewed-by: Hannes Reinecke
Reviewed-by: Martin K. Petersen
Signed-off-by: Jens Axboe

Bart Van Assche
2017-06-21 09:27:14 +0800
ca18d6f76 block: Make most scsi_req_init() calls implicit ... Browse Code »

Instead of explicitly calling scsi_req_init() after blk_get_request(),
call that function from inside blk_get_request(). Add an
.initialize_rq_fn() callback function to the block drivers that need
it. Merge the IDE .init_rq_fn() function into .initialize_rq_fn()
because it is too small to keep it as a separate function. Keep the
scsi_req_init() call in ide_prep_sense() because it follows a
blk_rq_init() call.

References: commit 82ed4db499b8 ("block: split scsi_request out of struct request")
Signed-off-by: Bart Van Assche
Cc: Christoph Hellwig
Cc: Hannes Reinecke
Cc: Omar Sandoval
Cc: Nicholas Bellinger
Signed-off-by: Jens Axboe

Bart Van Assche
2017-06-21 09:27:14 +0800
d280bab30 block: Introduce request_queue.initialize_rq_fn() ... Browse Code »

Several block drivers need to initialize the driver-private request
data after having called blk_get_request() and before .prep_rq_fn()
is called, e.g. when submitting a REQ_OP_SCSI_* request. Avoid that
that initialization code has to be repeated after every
blk_get_request() call by adding new callback functions to struct
request_queue and to struct blk_mq_ops.

Signed-off-by: Bart Van Assche
Cc: Christoph Hellwig
Cc: Hannes Reinecke
Cc: Omar Sandoval
Signed-off-by: Jens Axboe

Bart Van Assche
2017-06-21 09:27:14 +0800
cd6ce1482 block: Make request operation type argument declarations consistent ... Browse Code »

Instead of declaring the second argument of blk_*_get_request()
as int and passing it to functions that expect an unsigned int,
declare that second argument as unsigned int. Also because of
consistency, rename that second argument from 'rw' into 'op'.
This patch does not change any functionality.

Signed-off-by: Bart Van Assche
Reviewed-by: Christoph Hellwig
Cc: Hannes Reinecke
Cc: Omar Sandoval
Cc: Ming Lei
Signed-off-by: Jens Axboe

Bart Van Assche
2017-06-21 09:27:14 +0800
073196787 blk-mq: Reduce blk_mq_hw_ctx size ... Browse Code »

Since the srcu structure is rather large (184 bytes on an x86-64
system with kernel debugging disabled), only allocate it if needed.

Reported-by: Ming Lei
Signed-off-by: Bart Van Assche
Reviewed-by: Christoph Hellwig
Reviewed-by: Ming Lei
Cc: Hannes Reinecke
Cc: Omar Sandoval
Signed-off-by: Jens Axboe

Bart Van Assche
2017-06-21 09:27:14 +0800

20 Jun, 2017

3 commits

03a07c92a block: return on congested block device ... Browse Code »

A new bio operation flag REQ_NOWAIT is introduced to identify bio's
orignating from iocb with IOCB_NOWAIT. This flag indicates
to return immediately if a request cannot be made instead
of retrying.

Stacked devices such as md (the ones with make_request_fn hooks)
currently are not supported because it may block for housekeeping.
For example, an md can have a part of the device suspended.
For this reason, only request based devices are supported.
In the future, this feature will be expanded to stacked devices
by teaching them how to handle the REQ_NOWAIT flags.

Reviewed-by: Christoph Hellwig
Reviewed-by: Jens Axboe
Signed-off-by: Goldwyn Rodrigues
Signed-off-by: Jens Axboe

Goldwyn Rodrigues
2017-06-20 21:12:03 +0800
2055da973 sched/wait: Disambiguate wq_entry->task_list and wq_head->task_list naming ... Browse Code »

So I've noticed a number of instances where it was not obvious from the
code whether ->task_list was for a wait-queue head or a wait-queue entry.

Furthermore, there's a number of wait-queue users where the lists are
not for 'tasks' but other entities (poll tables, etc.), in which case
the 'task_list' name is actively confusing.

To clear this all up, name the wait-queue head and entry list structure
fields unambiguously:

struct wait_queue_head::task_list => ::head
struct wait_queue_entry::task_list => ::entry

For example, this code:

rqw->wait.task_list.next != &wait->task_list

... is was pretty unclear (to me) what it's doing, while now it's written this way:

rqw->wait.head.next != &wait->entry

... which makes it pretty clear that we are iterating a list until we see the head.

Other examples are:

list_for_each_entry_safe(pos, next, &x->task_list, task_list) {
list_for_each_entry(wq, &fence->wait.task_list, task_list) {

... where it's unclear (to me) what we are iterating, and during review it's
hard to tell whether it's trying to walk a wait-queue entry (which would be
a bug), while now it's written as:

list_for_each_entry_safe(pos, next, &x->head, entry) {
list_for_each_entry(wq, &fence->wait.head, entry) {

Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar

Ingo Molnar
2017-06-20 18:19:14 +0800
ac6424b98 sched/wait: Rename wait_queue_t => wait_queue_entry_t ... Browse Code »

Rename:

wait_queue_t => wait_queue_entry_t

'wait_queue_t' was always a slight misnomer: its name implies that it's a "queue",
but in reality it's a queue *entry*. The 'real' queue is the wait queue head,
which had to carry the name.

Start sorting this out by renaming it to 'wait_queue_entry_t'.

This also allows the real structure name 'struct __wait_queue' to
lose its double underscore and become 'struct wait_queue_entry',
which is the more canonical nomenclature for such data types.

Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar

Ingo Molnar
2017-06-20 18:18:27 +0800