Eric Lee / smarc-fsl-linux-kernel

04 May, 2019

4 commits

2f8f1336a blk-mq: always free hctx after request queue is freed ... Browse Code »

In normal queue cleanup path, hctx is released after request queue
is freed, see blk_mq_release().

However, in __blk_mq_update_nr_hw_queues(), hctx may be freed because
of hw queues shrinking. This way is easy to cause use-after-free,
because: one implicit rule is that it is safe to call almost all block
layer APIs if the request queue is alive; and one hctx may be retrieved
by one API, then the hctx can be freed by blk_mq_update_nr_hw_queues();
finally use-after-free is triggered.

Fixes this issue by always freeing hctx after releasing request queue.
If some hctxs are removed in blk_mq_update_nr_hw_queues(), introduce
a per-queue list to hold them, then try to resuse these hctxs if numa
node is matched.

Cc: Dongli Zhang
Cc: James Smart
Cc: Bart Van Assche
Cc: linux-scsi@vger.kernel.org,
Cc: Martin K . Petersen ,
Cc: Christoph Hellwig ,
Cc: James E . J . Bottomley ,
Reviewed-by: Hannes Reinecke
Tested-by: James Smart
Signed-off-by: Ming Lei
Signed-off-by: Jens Axboe

Ming Lei
2019-05-04 21:24:08 +0800
7c6c5b7c9 blk-mq: split blk_mq_alloc_and_init_hctx into two parts ... Browse Code »

Split blk_mq_alloc_and_init_hctx into two parts, and one is
blk_mq_alloc_hctx() for allocating all hctx resources, another
is blk_mq_init_hctx() for initializing hctx, which serves as
counter-part of blk_mq_exit_hctx().

Cc: Dongli Zhang
Cc: James Smart
Cc: Bart Van Assche
Cc: linux-scsi@vger.kernel.org
Cc: Martin K . Petersen
Cc: Christoph Hellwig
Cc: James E . J . Bottomley
Reviewed-by: Hannes Reinecke
Reviewed-by: Christoph Hellwig
Tested-by: James Smart
Signed-off-by: Ming Lei
Signed-off-by: Jens Axboe

Ming Lei
2019-05-04 21:24:06 +0800
c7e2d94b3 blk-mq: free hw queue's resource in hctx's release handler ... Browse Code »

Once blk_cleanup_queue() returns, tags shouldn't be used any more,
because blk_mq_free_tag_set() may be called. Commit 45a9c9d909b2
("blk-mq: Fix a use-after-free") fixes this issue exactly.

However, that commit introduces another issue. Before 45a9c9d909b2,
we are allowed to run queue during cleaning up queue if the queue's
kobj refcount is held. After that commit, queue can't be run during
queue cleaning up, otherwise oops can be triggered easily because
some fields of hctx are freed by blk_mq_free_queue() in blk_cleanup_queue().

We have invented ways for addressing this kind of issue before, such as:

8dc765d438f1 ("SCSI: fix queue cleanup race before queue initialization is done")
c2856ae2f315 ("blk-mq: quiesce queue before freeing queue")

But still can't cover all cases, recently James reports another such
kind of issue:

https://marc.info/?l=linux-scsi&m=155389088124782&w=2

This issue can be quite hard to address by previous way, given
scsi_run_queue() may run requeues for other LUNs.

Fixes the above issue by freeing hctx's resources in its release handler, and this
way is safe becasue tags isn't needed for freeing such hctx resource.

This approach follows typical design pattern wrt. kobject's release handler.

Cc: Dongli Zhang
Cc: James Smart
Cc: Bart Van Assche
Cc: linux-scsi@vger.kernel.org,
Cc: Martin K . Petersen ,
Cc: Christoph Hellwig ,
Cc: James E . J . Bottomley ,
Reported-by: James Smart
Fixes: 45a9c9d909b2 ("blk-mq: Fix a use-after-free")
Cc: stable@vger.kernel.org
Reviewed-by: Hannes Reinecke
Reviewed-by: Christoph Hellwig
Tested-by: James Smart
Signed-off-by: Ming Lei
Signed-off-by: Jens Axboe

Ming Lei
2019-05-04 21:24:05 +0800
fbc2a15e3 blk-mq: move cancel of requeue_work into blk_mq_release ... Browse Code »

With holding queue's kobject refcount, it is safe for driver
to schedule requeue. However, blk_mq_kick_requeue_list() may
be called after blk_sync_queue() is done because of concurrent
requeue activities, then requeue work may not be completed when
freeing queue, and kernel oops is triggered.

So moving the cancel of requeue_work into blk_mq_release() for
avoiding race between requeue and freeing queue.

Cc: Dongli Zhang
Cc: James Smart
Cc: Bart Van Assche
Cc: linux-scsi@vger.kernel.org,
Cc: Martin K . Petersen ,
Cc: Christoph Hellwig ,
Cc: James E . J . Bottomley ,
Reviewed-by: Bart Van Assche
Reviewed-by: Johannes Thumshirn
Reviewed-by: Hannes Reinecke
Reviewed-by: Christoph Hellwig
Tested-by: James Smart
Signed-off-by: Ming Lei
Signed-off-by: Jens Axboe

Ming Lei
2019-05-04 21:24:04 +0800

03 May, 2019

1 commit

273938bf7 block: fix function name in comment ... Browse Code »

The comment was out of date.

Reviewed-by: Bart Van Assche
Signed-off-by: Raul E Rangel
Signed-off-by: Jens Axboe

Raul E Rangel
2019-05-03 05:51:52 +0800

01 May, 2019

1 commit

3dcf60bcb block: add SPDX tags to block layer files missing licensing information ... Browse Code »

Various block layer files do not have any licensing information at all.
Add SPDX tags for the default kernel GPLv2 license to those.

Reviewed-by: Chaitanya Kulkarni
Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Christoph Hellwig
2019-05-01 06:12:03 +0800

14 Apr, 2019

1 commit

77f1e0a52 bfq: update internal depth state when queue depth changes ... Browse Code »

A previous commit moved the shallow depth and BFQ depth map calculations
to be done at init time, moving it outside of the hotter IO path. This
potentially causes hangs if the users changes the depth of the scheduler
map, by writing to the 'nr_requests' sysfs file for that device.

Add a blk-mq-sched hook that allows blk-mq to inform the scheduler if
the depth changes, so that the scheduler can update its internal state.

Tested-by: Kai Krakow
Reported-by: Paolo Valente
Fixes: f0635b8a416e ("bfq: calculate shallow depths at init time")
Signed-off-by: Jens Axboe

Jens Axboe
2019-04-14 09:08:22 +0800

10 Apr, 2019

1 commit

1b8f21b74 blk-mq: introduce blk_mq_complete_request_sync() ... Browse Code »

In NVMe's error handler, follows the typical steps of tearing down
hardware for recovering controller:

1) stop blk_mq hw queues
2) stop the real hw queues
3) cancel in-flight requests via
blk_mq_tagset_busy_iter(tags, cancel_request, ...)
cancel_request():
mark the request as abort
blk_mq_complete_request(req);
4) destroy real hw queues

However, there may be race between #3 and #4, because blk_mq_complete_request()
may run q->mq_ops->complete(rq) remotelly and asynchronously, and
->complete(rq) may be run after #4.

This patch introduces blk_mq_complete_request_sync() for fixing the
above race.

Cc: Sagi Grimberg
Cc: Bart Van Assche
Cc: James Smart
Cc: linux-nvme@lists.infradead.org
Reviewed-by: Keith Busch
Reviewed-by: Christoph Hellwig
Signed-off-by: Ming Lei
Signed-off-by: Jens Axboe

Ming Lei
2019-04-10 23:57:33 +0800

05 Apr, 2019

1 commit

fd9c40f64 block: Revert v5.0 blk_mq_request_issue_directly() changes ... Browse Code »

blk_mq_try_issue_directly() can return BLK_STS*_RESOURCE for requests that
have been queued. If that happens when blk_mq_try_issue_directly() is called
by the dm-mpath driver then dm-mpath will try to resubmit a request that is
already queued and a kernel crash follows. Since it is nontrivial to fix
blk_mq_request_issue_directly(), revert the blk_mq_request_issue_directly()
changes that went into kernel v5.0.

This patch reverts the following commits:
* d6a51a97c0b2 ("blk-mq: replace and kill blk_mq_request_issue_directly") # v5.0.
* 5b7a6f128aad ("blk-mq: issue directly with bypass 'false' in blk_mq_sched_insert_requests") # v5.0.
* 7f556a44e61d ("blk-mq: refactor the code of issue request directly") # v5.0.

Cc: Christoph Hellwig
Cc: Ming Lei
Cc: Jianchao Wang
Cc: Hannes Reinecke
Cc: Johannes Thumshirn
Cc: James Smart
Cc: Dongli Zhang
Cc: Laurence Oberman
Cc:
Reported-by: Laurence Oberman
Tested-by: Laurence Oberman
Fixes: 7f556a44e61d ("blk-mq: refactor the code of issue request directly") # v5.0.
Signed-off-by: Bart Van Assche
Signed-off-by: Jens Axboe

Bart Van Assche
2019-04-05 23:40:46 +0800

04 Apr, 2019

1 commit

bcc816dfe blk-mq: do not reset plug->rq_count before the list is sorted ... Browse Code »

We would never be able to sort the list if we first reset plug->rq_count
which is used in conditional check later.

Fixes: ce5b009cff19 ("block: improve logic around when to sort a plug list")
Reviewed-by: Ming Lei
Signed-off-by: Dongli Zhang
Signed-off-by: Jens Axboe

Dongli Zhang
2019-04-04 22:37:34 +0800

02 Apr, 2019

2 commits

ff3b74b8e blk-mq: add trace block plug and unplug for multiple queues ... Browse Code »

For now, we just trace plug for single queue device or drivers
provide .commit_rqs, and have not trace plug for multiple queues
device. But, unplug events will be recorded when call
blk_mq_flush_plug_list(). Then, trace events will be asymmetrical,
just have unplug and without plug.

This patch add trace plug and unplug for multiple queues device in
blk_mq_make_request(). After that, we can accurately trace plug and
unplug for multiple queues.

Reviewed-by: Christoph Hellwig
Signed-off-by: Yufen Yu
Signed-off-by: Jens Axboe

Yufen Yu
2019-04-02 22:57:05 +0800
b9a1ff504 block: use blk_free_flush_queue() to free hctx->fq in blk_mq_init_hctx ... Browse Code »

kfree() can leak the hctx->fq->flush_rq field.

Reviewed-by: Ming Lei
Signed-off-by: Shenghui Wang
Signed-off-by: Jens Axboe

Shenghui Wang
2019-04-02 22:20:06 +0800

26 Mar, 2019

1 commit

e86185754 blk-mq: fix sbitmap ws_active for shared tags ... Browse Code »

We now wrap sbitmap waitqueues in an active counter, so we can avoid
iterating wakeups unless we have waiters there. This works as long as
everyone that's manipulating the waitqueues use the proper helpers. For
the tag wait case for shared tags, however, we add ourselves to the
waitqueue without incrementing/decrementing the ->ws_active count. This
means that wakeups can take a long time to happen.

Fix this by manually doing the inc/dec as needed for the wait queue
handling.

Reported-by: Michael Leun
Tested-by: Michael Leun
Cc: stable@vger.kernel.org
Reviewed-by: Omar Sandoval
Fixes: 5d2ee7122c73 ("sbitmap: optimize wakeup check")
Signed-off-by: Jens Axboe

Jens Axboe
2019-03-26 03:05:47 +0800

25 Mar, 2019

1 commit

85fae294e blk-mq: update comment for blk_mq_hctx_has_pending() ... Browse Code »

For now, blk_mq_hctx_has_pending() checks any of ctx, hctx->dispatch
or io scheduler have pending work. So, update the comment accordingly.

Signed-off-by: Yufen Yu
Signed-off-by: Jens Axboe

Yufen Yu
2019-03-25 00:26:17 +0800

21 Mar, 2019

2 commits

e6c987120 block: Unexport blk_mq_add_to_requeue_list() ... Browse Code »

This function is not used outside the block layer core. Hence unexport it.

Cc: Christoph Hellwig
Cc: Ming Lei
Signed-off-by: Bart Van Assche
Signed-off-by: Jens Axboe

Bart Van Assche
2019-03-21 04:19:36 +0800
29ece8b43 block: add BLK_MQ_POLL_CLASSIC for hybrid poll and return EINVAL for unexpected value ... Browse Code »

For q->poll_nsec == -1, means doing classic poll, not hybrid poll.
We introduce a new flag BLK_MQ_POLL_CLASSIC to replace -1, which
may make code much easier to read.

Additionally, since val is an int obtained with kstrtoint(), val can be
a negative value other than -1, so return -EINVAL for that case.

Thanks to Damien Le Moal for some good suggestion.

Reviewed-by: Damien Le Moal
Signed-off-by: Yufen Yu
Signed-off-by: Jens Axboe

Yufen Yu
2019-03-21 04:02:07 +0800

18 Mar, 2019

1 commit

684b73245 blk-mq: use blk_mq_sched_mark_restart_hctx to set RESTART ... Browse Code »

Let blk_mq_mark_tag_wait() use the blk_mq_sched_mark_restart_hctx()
to set BLK_MQ_S_SCHED_RESTART.

Signed-off-by: Yufen Yu
Signed-off-by: Jens Axboe

Yufen Yu
2019-03-18 22:14:51 +0800

10 Mar, 2019

1 commit

92fff53b7 Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi ... Browse Code »

Pull SCSI updates from James Bottomley:
"This is mostly update of the usual drivers: arcmsr, qla2xxx, lpfc,
hisi_sas, target/iscsi and target/core.

Additionally Christoph refactored gdth as part of the dma changes. The
major mid-layer change this time is the removal of bidi commands and
with them the whole of the osd/exofs driver and filesystem. This is a
major simplification for block and mq in particular"

* tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (240 commits)
scsi: cxgb4i: validate tcp sequence number only if chip version pf
scsi: core: replace GFP_ATOMIC with GFP_KERNEL in scsi_scan.c
scsi: mpt3sas: Add missing breaks in switch statements
scsi: aacraid: Fix missing break in switch statement
scsi: kill command serial number
scsi: csiostor: drop serial_number usage
scsi: mvumi: use request tag instead of serial_number
scsi: dpt_i2o: remove serial number usage
scsi: st: osst: Remove negative constant left-shifts
scsi: ufs-bsg: Allow reading descriptors
scsi: ufs: Allow reading descriptor via raw upiu
scsi: ufs-bsg: Change the calling convention for write descriptor
scsi: ufs: Remove unused device quirks
Revert "scsi: ufs: disable vccq if it's not needed by UFS device"
scsi: megaraid_sas: Remove a bunch of set but not used variables
scsi: clean obsolete return values of eh_timed_out
scsi: sd: Optimal I/O size should be a multiple of physical block size
scsi: MAINTAINERS: SCSI initiator and target tweaks
scsi: fcoe: make use of fip_mode enum complete
...

Linus Torvalds
2019-03-10 08:53:47 +0800

09 Mar, 2019

1 commit

80201fe17 Merge tag 'for-5.1/block-20190302' of git://git.kernel.dk/linux-block ... Browse Code »

Pull block layer updates from Jens Axboe:
"Not a huge amount of changes in this round, the biggest one is that we
finally have Mings multi-page bvec support merged. Apart from that,
this pull request contains:

- Small series that avoids quiescing the queue for sysfs changes that
match what we currently have (Aleksei)

- Series of bcache fixes (via Coly)

- Series of lightnvm fixes (via Mathias)

- NVMe pull request from Christoph. Nothing major, just SPDX/license
cleanups, RR mp policy (Hannes), and little fixes (Bart,
Chaitanya).

- BFQ series (Paolo)

- Save blk-mq cpu -> hw queue mapping, removing a pointer indirection
for the fast path (Jianchao)

- fops->iopoll() added for async IO polling, this is a feature that
the upcoming io_uring interface will use (Christoph, me)

- Partition scan loop fixes (Dongli)

- mtip32xx conversion from managed resource API (Christoph)

- cdrom registration race fix (Guenter)

- MD pull from Song, two minor fixes.

- Various documentation fixes (Marcos)

- Multi-page bvec feature. This brings a lot of nice improvements
with it, like more efficient splitting, larger IOs can be supported
without growing the bvec table size, and so on. (Ming)

- Various little fixes to core and drivers"

* tag 'for-5.1/block-20190302' of git://git.kernel.dk/linux-block: (117 commits)
block: fix updating bio's front segment size
block: Replace function name in string with __func__
nbd: propagate genlmsg_reply return code
floppy: remove set but not used variable 'q'
null_blk: fix checking for REQ_FUA
block: fix NULL pointer dereference in register_disk
fs: fix guard_bio_eod to check for real EOD errors
blk-mq: use HCTX_TYPE_DEFAULT but not 0 to index blk_mq_tag_set->map
block: optimize bvec iteration in bvec_iter_advance
block: introduce mp_bvec_for_each_page() for iterating over page
block: optimize blk_bio_segment_split for single-page bvec
block: optimize __blk_segment_map_sg() for single-page bvec
block: introduce bvec_nth_page()
iomap: wire up the iopoll method
block: add bio_set_polled() helper
block: wire up block device iopoll method
fs: add an iopoll method to struct file_operations
loop: set GENHD_FL_NO_PART_SCAN after blkdev_reread_part()
loop: do not print warn message if partition scan is successful
block: bounce: make sure that bvec table is updated
...

Linus Torvalds
2019-03-09 06:12:17 +0800

01 Mar, 2019

1 commit

7d76f8562 blk-mq: use HCTX_TYPE_DEFAULT but not 0 to index blk_mq_tag_set->map ... Browse Code »

Replace set->map[0] with set->map[HCTX_TYPE_DEFAULT] to avoid hardcoding.

Signed-off-by: Dongli Zhang
Signed-off-by: Jens Axboe

Dongli Zhang
2019-03-01 04:57:32 +0800

15 Feb, 2019

1 commit

2705c9374 block: kill QUEUE_FLAG_NO_SG_MERGE ... Browse Code »

Since bdced438acd83ad83a6c ("block: setup bi_phys_segments after splitting"),
physical segment number is mainly figured out in blk_queue_split() for
fast path, and the flag of BIO_SEG_VALID is set there too.

Now only blk_recount_segments() and blk_recalc_rq_segments() use this
flag.

Basically blk_recount_segments() is bypassed in fast path given BIO_SEG_VALID
is set in blk_queue_split().

For another user of blk_recalc_rq_segments():

- run in partial completion branch of blk_update_request, which is an unusual case

- run in blk_cloned_rq_check_limits(), still not a big problem if the flag is killed
since dm-rq is the only user.

Multi-page bvec is enabled now, not doing S/G merging is rather pointless with the
current setup of the I/O path, as it isn't going to save you a significant amount
of cycles.

Reviewed-by: Christoph Hellwig
Reviewed-by: Omar Sandoval
Signed-off-by: Ming Lei
Signed-off-by: Jens Axboe

Ming Lei
2019-02-15 23:40:12 +0800

12 Feb, 2019

1 commit

aef1897cd blk-mq: insert rq with DONTPREP to hctx dispatch list when requeue ... Browse Code »

When requeue, if RQF_DONTPREP, rq has contained some driver
specific data, so insert it to hctx dispatch list to avoid any
merge. Take scsi as example, here is the trace event log (no
io scheduler, because RQF_STARTED would prevent merging),

kworker/0:1H-339 [000] ...1 2037.209289: block_rq_insert: 8,0 R 4096 () 32768 + 8 [kworker/0:1H]
scsi_inert_test-1987 [000] .... 2037.220465: block_bio_queue: 8,0 R 32776 + 8 [scsi_inert_test]
scsi_inert_test-1987 [000] ...2 2037.220466: block_bio_backmerge: 8,0 R 32776 + 8 [scsi_inert_test]
kworker/0:1H-339 [000] .... 2047.220913: block_rq_issue: 8,0 R 8192 () 32768 + 16 [kworker/0:1H]
scsi_inert_test-1996 [000] ..s1 2047.221007: block_rq_complete: 8,0 R () 32768 + 8 [0]
scsi_inert_test-1996 [000] .Ns1 2047.221045: block_rq_requeue: 8,0 R () 32776 + 8 [0]
kworker/0:1H-339 [000] ...1 2047.221054: block_rq_insert: 8,0 R 4096 () 32776 + 8 [kworker/0:1H]
kworker/0:1H-339 [000] ...1 2047.221056: block_rq_issue: 8,0 R 4096 () 32776 + 8 [kworker/0:1H]
scsi_inert_test-1986 [000] ..s1 2047.221119: block_rq_complete: 8,0 R () 32776 + 8 [0]

(32768 + 8) was requeued by scsi_queue_insert and had RQF_DONTPREP.
Then it was merged with (32776 + 8) and issued. Due to RQF_DONTPREP,
the sdb only contained the part of (32768 + 8), then only that part
was completed. The lucky thing was that scsi_io_completion detected
it and requeued the remaining part. So we didn't get corrupted data.
However, the requeue of (32776 + 8) is not expected.

Suggested-by: Jens Axboe
Signed-off-by: Jianchao Wang
Signed-off-by: Jens Axboe

Jianchao Wang
2019-02-12 10:51:52 +0800

09 Feb, 2019

1 commit

e5fa81408 block: avoid setting nr_requests to current value ... Browse Code »

There's no reason to freeze queue and set nr_requests value
if current value is the same.

Signed-off-by: Aleksei Zakharov
Signed-off-by: Jens Axboe

Aleksei Zakharov
2019-02-09 03:43:25 +0800

06 Feb, 2019

2 commits

8b3238cab scsi: block: remove bidi support ... Browse Code »

Unused now, and another field in struct request bites the dust.

Signed-off-by: Christoph Hellwig
Acked-by: Jens Axboe
Signed-off-by: Martin K. Petersen

Christoph Hellwig
2019-02-06 10:30:27 +0800
69ed175c1 scsi: block: remove req->special ... Browse Code »

No users left.

Signed-off-by: Christoph Hellwig
Acked-by: Jens Axboe
Signed-off-by: Martin K. Petersen

Christoph Hellwig
2019-02-06 10:30:09 +0800

01 Feb, 2019

2 commits

bb94aea14 blk-mq: save default hctx into ctx->hctxs for not-supported type ... Browse Code »

Currently, we check whether the hctx type is supported every time
in hot path. Actually, this is not necessary, we could save the
default hctx into ctx->hctxs if the type is not supported when
map swqueues and use it directly with ctx->hctxs[type].

We also needn't check whether the poll is enabled or not, because
the caller would clear the REQ_HIPRI in that case.

Signed-off-by: Jianchao Wang
Signed-off-by: Jens Axboe

Jianchao Wang
2019-02-01 23:33:43 +0800
8ccdf4a37 blk-mq: save queue mapping result into ctx directly ... Browse Code »

Currently, the queue mapping result is saved in a two-dimensional
array. In the hot path, to get a hctx, we need do following:

q->queue_hw_ctx[q->tag_set->map[type].mq_map[cpu]]

This isn't very efficient. We could save the queue mapping result into
ctx directly with different hctx type, like,

ctx->hctxs[type]

Signed-off-by: Jianchao Wang
Signed-off-by: Jens Axboe

Jianchao Wang
2019-02-01 23:33:04 +0800

16 Jan, 2019

1 commit

7809167da block: don't lose track of REQ_INTEGRITY flag ... Browse Code »

We need to pass bio->bi_opf after bio intergrity preparing, otherwise
the flag of REQ_INTEGRITY may not be set on the allocated request, then
breaks block integrity.

Fixes: f9afca4d367b ("blk-mq: pass in request/bio flags to queue mapping")
Cc: Hannes Reinecke
Cc: Keith Busch
Signed-off-by: Ming Lei
Signed-off-by: Jens Axboe

Ming Lei
2019-01-16 22:28:10 +0800

19 Dec, 2018

1 commit

7b7ab780a block: make request_to_qc_t public ... Browse Code »

block consumers will need it for polling requests that
are sent with blk_execute_rq_nowait. Also, get rid of
blk_tag_to_qc_t and open-code it instead.

Reviewed-by: Jens Axboe
Signed-off-by: Sagi Grimberg
Signed-off-by: Christoph Hellwig

Sagi Grimberg
2018-12-19 00:50:47 +0800

18 Dec, 2018

4 commits

cd19181bf blk-mq: enable IO poll if .nr_queues of type poll > 0 ... Browse Code »

The queue mapping of type poll only exists when set->map[HCTX_TYPE_POLL].nr_queues
is bigger than zero, so enhance the constraint by checking .nr_queues of type poll
before enabling IO poll.

Otherwise IO race & timeout can be observed when running block/007.

Cc: Jeff Moyer
Cc: Christoph Hellwig
Signed-off-by: Ming Lei
Signed-off-by: Jens Axboe

Ming Lei
2018-12-18 12:35:07 +0800
3c94d83cb blk-mq: change blk_mq_queue_busy() to blk_mq_queue_inflight() ... Browse Code »

There's a single user of this function, dm, and dm just wants
to check if IO is inflight, not that it's just allocated.

This fixes a hang with srp/002 in blktests with dm, where it tries
to suspend but waits for inflight IO to finish first. As it checks
for just allocated requests, this fails.

Tested-by: Mike Snitzer
Signed-off-by: Jens Axboe

Jens Axboe
2018-12-18 12:31:42 +0800
e5edd5f29 blk-mq: skip zero-queue maps in blk_mq_map_swqueue ... Browse Code »

From 7e849dd9cf37 ("nvme-pci: don't share queue maps"), the mapping
table won't be initialized actually if map->nr_queues is zero, so
we can't use blk_mq_map_queue_type() to retrieve hctx any more.

This way still may cause broken mapping, fix it by skipping zero-queues
maps in blk_mq_map_swqueue().

Cc: Jeff Moyer
Cc: Mike Snitzer
Reviewed-by: Christoph Hellwig
Signed-off-by: Ming Lei
Signed-off-by: Jens Axboe

Ming Lei
2018-12-18 02:19:55 +0800
c16d6b5a9 blk-mq: fix dispatch from sw queue ... Browse Code »

When a request is added to rq list of sw queue(ctx), the rq may be from
a different type of hctx, especially after multi queue mapping is
introduced.

So when dispach request from sw queue via blk_mq_flush_busy_ctxs() or
blk_mq_dequeue_from_ctx(), one request belonging to other queue type of
hctx can be dispatched to current hctx in case that read queue or poll
queue is enabled.

This patch fixes this issue by introducing per-queue-type list.

Cc: Christoph Hellwig
Signed-off-by: Ming Lei

Changed by me to not use separately cacheline aligned lists, just
place them all in the same cacheline where we had just the one list
and lock before.

Signed-off-by: Jens Axboe

Ming Lei
2018-12-18 02:19:54 +0800

17 Dec, 2018

1 commit

07b35eb5a blk-mq: fix allocation for queue mapping table ... Browse Code »

Type of each element in queue mapping table is 'unsigned int,
intead of 'struct blk_mq_queue_map)', so fix it.

Cc: Jeff Moyer
Cc: Mike Snitzer
Cc: Christoph Hellwig
Reviewed-by: Christoph Hellwig
Signed-off-by: Ming Lei
Signed-off-by: Jens Axboe

Ming Lei
2018-12-17 20:37:23 +0800

16 Dec, 2018

3 commits

d6a51a97c blk-mq: replace and kill blk_mq_request_issue_directly ... Browse Code »

Replace blk_mq_request_issue_directly with blk_mq_try_issue_directly
in blk_insert_cloned_request and kill it as nobody uses it any more.

Signed-off-by: Jianchao Wang
Signed-off-by: Jens Axboe

Jianchao Wang
2018-12-16 23:33:58 +0800
5b7a6f128 blk-mq: issue directly with bypass 'false' in blk_mq_sched_insert_requests ... Browse Code »

It is not necessary to issue request directly with bypass 'true'
in blk_mq_sched_insert_requests and handle the non-issued requests
itself. Just set bypass to 'false' and let blk_mq_try_issue_directly
handle them totally. Remove the blk_rq_can_direct_dispatch check,
because blk_mq_try_issue_directly can handle it well.If request is
direct-issued unsuccessfully, insert the reset.

Signed-off-by: Jianchao Wang
Signed-off-by: Jens Axboe

Jianchao Wang
2018-12-16 23:33:57 +0800
7f556a44e blk-mq: refactor the code of issue request directly ... Browse Code »

Merge blk_mq_try_issue_directly and __blk_mq_try_issue_directly
into one interface to unify the interfaces to issue requests
directly. The merged interface takes over the requests totally,
it could insert, end or do nothing based on the return value of
.queue_rq and 'bypass' parameter. Then caller needn't any other
handling any more and then code could be cleaned up.

And also the commit c616cbee ( blk-mq: punt failed direct issue
to dispatch list ) always inserts requests to hctx dispatch list
whenever get a BLK_STS_RESOURCE or BLK_STS_DEV_RESOURCE, this is
overkill and will harm the merging. We just need to do that for
the requests that has been through .queue_rq. This patch also
could fix this.

Signed-off-by: Jianchao Wang
Signed-off-by: Jens Axboe

Jianchao Wang
2018-12-16 23:33:57 +0800

10 Dec, 2018

2 commits

e016b7820 block: return just one value from part_in_flight ... Browse Code »

The previous patches deleted all the code that needed the second value
returned from part_in_flight - now the kernel only uses the first value.

Consequently, part_in_flight (and blk_mq_in_flight) may be changed so that
it only returns one value.

This patch just refactors the code, there's no functional change.

Signed-off-by: Mikulas Patocka
Signed-off-by: Mike Snitzer
Signed-off-by: Jens Axboe

Mikulas Patocka
2018-12-10 23:30:38 +0800
96f774106 Merge tag 'v4.20-rc6' into for-4.21/block ... Browse Code »

Pull in v4.20-rc6 to resolve the conflict in NVMe, but also to get the
two corruption fixes. We're going to be overhauling the direct dispatch
path, and we need to do that on top of the changes we made for that
in mainline.

Signed-off-by: Jens Axboe

Jens Axboe
2018-12-10 08:45:40 +0800

08 Dec, 2018

1 commit

593887024 blk-mq: re-build queue map in case of kdump kernel ... Browse Code »

Now almost all .map_queues() implementation based on managed irq
affinity doesn't update queue mapping and it just retrieves the
old built mapping, so if nr_hw_queues is changed, the mapping talbe
includes stale mapping. And only blk_mq_map_queues() may rebuild
the mapping talbe.

One case is that we limit .nr_hw_queues as 1 in case of kdump kernel.
However, drivers often builds queue mapping before allocating tagset
via pci_alloc_irq_vectors_affinity(), but set->nr_hw_queues can be set
as 1 in case of kdump kernel, so wrong queue mapping is used, and
kernel panic[1] is observed during booting.

This patch fixes the kernel panic triggerd on nvme by rebulding the
mapping table via blk_mq_map_queues().

[1] kernel panic log
[ 4.438371] nvme nvme0: 16/0/0 default/read/poll queues
[ 4.443277] BUG: unable to handle kernel NULL pointer dereference at 0000000000000098
[ 4.444681] PGD 0 P4D 0
[ 4.445367] Oops: 0000 [#1] SMP NOPTI
[ 4.446342] CPU: 3 PID: 201 Comm: kworker/u33:10 Not tainted 4.20.0-rc5-00664-g5eb02f7ee1eb-dirty #459
[ 4.447630] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.10.2-2.fc27 04/01/2014
[ 4.448689] Workqueue: nvme-wq nvme_scan_work [nvme_core]
[ 4.449368] RIP: 0010:blk_mq_map_swqueue+0xfb/0x222
[ 4.450596] Code: 04 f5 20 28 ef 81 48 89 c6 39 55 30 76 93 89 d0 48 c1 e0 04 48 03 83 f8 05 00 00 48 8b 00 42 8b 3c 28 48 8b 43 58 48 8b 04 f8 8b b8 98 00 00 00 4c 0f a3 37 72 42 f0 4c 0f ab 37 66 8b b8 f6
[ 4.453132] RSP: 0018:ffffc900023b3cd8 EFLAGS: 00010286
[ 4.454061] RAX: 0000000000000000 RBX: ffff888174448000 RCX: 0000000000000001
[ 4.456480] RDX: 0000000000000001 RSI: ffffe8feffc506c0 RDI: 0000000000000001
[ 4.458750] RBP: ffff88810722d008 R08: ffff88817647a880 R09: 0000000000000002
[ 4.464580] R10: ffffc900023b3c10 R11: 0000000000000004 R12: ffff888174448538
[ 4.467803] R13: 0000000000000004 R14: 0000000000000001 R15: 0000000000000001
[ 4.469220] FS: 0000000000000000(0000) GS:ffff88817bac0000(0000) knlGS:0000000000000000
[ 4.471554] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 4.472464] CR2: 0000000000000098 CR3: 0000000174e4e001 CR4: 0000000000760ee0
[ 4.474264] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 4.476007] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 4.477061] PKRU: 55555554
[ 4.477464] Call Trace:
[ 4.478731] blk_mq_init_allocated_queue+0x36a/0x3ad
[ 4.479595] blk_mq_init_queue+0x32/0x4e
[ 4.480178] nvme_validate_ns+0x98/0x623 [nvme_core]
[ 4.480963] ? nvme_submit_sync_cmd+0x1b/0x20 [nvme_core]
[ 4.481685] ? nvme_identify_ctrl.isra.8+0x70/0xa0 [nvme_core]
[ 4.482601] nvme_scan_work+0x23a/0x29b [nvme_core]
[ 4.483269] ? _raw_spin_unlock_irqrestore+0x25/0x38
[ 4.483930] ? try_to_wake_up+0x38d/0x3b3
[ 4.484478] ? process_one_work+0x179/0x2fc
[ 4.485118] process_one_work+0x1d3/0x2fc
[ 4.485655] ? rescuer_thread+0x2ae/0x2ae
[ 4.486196] worker_thread+0x1e9/0x2be
[ 4.486841] kthread+0x115/0x11d
[ 4.487294] ? kthread_park+0x76/0x76
[ 4.487784] ret_from_fork+0x3a/0x50
[ 4.488322] Modules linked in: nvme nvme_core qemu_fw_cfg virtio_scsi ip_tables
[ 4.489428] Dumping ftrace buffer:
[ 4.489939] (ftrace buffer empty)
[ 4.490492] CR2: 0000000000000098
[ 4.491052] ---[ end trace 03cd268ad5a86ff7 ]---

Cc: Christoph Hellwig
Cc: linux-nvme@lists.infradead.org
Cc: David Milburn
Signed-off-by: Ming Lei
Signed-off-by: Jens Axboe

Ming Lei
2018-12-08 13:26:38 +0800