Eric Lee / smarc-fsl-linux-kernel

07 Jun, 2019

2 commits

19e9da9e8 block, bfq: add weight symlink to the bfq.weight cgroup parameter ... Browse Code »

Many userspace tools and services use the proportional-share policy of
the blkio/io cgroups controller. The CFQ I/O scheduler implemented
this policy for the legacy block layer. To modify the weight of a
group in case CFQ was in charge, the 'weight' parameter of the group
must be modified. On the other hand, the BFQ I/O scheduler implements
the same policy in blk-mq, but, with BFQ, the parameter to modify has
a different name: bfq.weight (forced choice until legacy block was
present, because two different policies cannot share a common parameter
in cgroups).

Due to CFQ legacy, most if not all userspace configurations still use
the parameter 'weight', and for the moment do not seem likely to be
changed. But, when CFQ went away with legacy block, such a parameter
ceased to exist.

So, a simple workaround has been proposed [1] to make all
configurations work: add a symlink, named weight, to bfq.weight. This
commit adds such a symlink.

[1] https://lkml.org/lkml/2019/4/8/555

Suggested-by: Johannes Thumshirn
Signed-off-by: Angelo Ruocco
Signed-off-by: Paolo Valente
Signed-off-by: Jens Axboe

Angelo Ruocco
2019-06-07 15:29:40 +0800
c3e221921 block: free sched's request pool in blk_cleanup_queue ... Browse Code »

In theory, IO scheduler belongs to request queue, and the request pool
of sched tags belongs to the request queue too.

However, the current tags allocation interfaces are re-used for both
driver tags and sched tags, and driver tags is definitely host wide,
and doesn't belong to any request queue, same with its request pool.
So we need tagset instance for freeing request of sched tags.

Meantime, blk_mq_free_tag_set() often follows blk_cleanup_queue() in case
of non-BLK_MQ_F_TAG_SHARED, this way requires that request pool of sched
tags to be freed before calling blk_mq_free_tag_set().

Commit 47cdee29ef9d94e ("block: move blk_exit_queue into __blk_release_queue")
moves blk_exit_queue into __blk_release_queue for simplying the fast
path in generic_make_request(), then causes oops during freeing requests
of sched tags in __blk_release_queue().

Fix the above issue by move freeing request pool of sched tags into
blk_cleanup_queue(), this way is safe becasue queue has been frozen and no any
in-queue requests at that time. Freeing sched tags has to be kept in queue's
release handler becasue there might be un-completed dispatch activity
which might refer to sched tags.

Cc: Bart Van Assche
Cc: Christoph Hellwig
Fixes: 47cdee29ef9d94e485eb08f962c74943023a5271 ("block: move blk_exit_queue into __blk_release_queue")
Tested-by: Yi Zhang
Reported-by: kernel test robot
Signed-off-by: Ming Lei
Signed-off-by: Jens Axboe

Ming Lei
2019-06-07 12:39:39 +0800

05 Jun, 2019

1 commit

98d669b49 block: Drop unlikely before IS_ERR(_OR_NULL) ... Browse Code »

IS_ERR(_OR_NULL) already contain an 'unlikely' compiler flag,
so no need to do that again from its callers. Drop it.

Cc: Jens Axboe
Cc: linux-block@vger.kernel.org
Signed-off-by: Kefeng Wang
Signed-off-by: Jens Axboe

Kefeng Wang
2019-06-05 22:22:24 +0800

01 Jun, 2019

9 commits

61939b12d block: print offending values when cloned rq limits are exceeded ... Browse Code »

While troubleshooting issues where cloned request limits have been
exceeded, it is often beneficial to know the actual values that
have been breached. Print these values, assisting in ease of
identification of root cause of the breach.

Reviewed-by: Chaitanya Kulkarni
Reviewed-by: Ming Lei
Signed-off-by: John Pittman
Signed-off-by: Jens Axboe

John Pittman
2019-06-01 05:12:34 +0800
cd669f88b blk-mq: Document the blk_mq_hw_queue_to_node() arguments ... Browse Code »

Document the meaning of the blk_mq_hw_queue_to_node() arguments.

Reviewed-by: Chaitanya Kulkarni
Signed-off-by: Bart Van Assche
Signed-off-by: Jens Axboe

Bart Van Assche
2019-06-01 05:12:34 +0800
ef025d7ec blk-mq: Fix spelling in a source code comment ... Browse Code »

Change one occurrence of 'performace' into 'performance'.

Cc: Max Gurtovoy
Fixes: fe631457ff3e ("blk-mq: map all HWQ also in hyperthreaded system") # v4.13.
Reviewed-by: Chaitanya Kulkarni
Signed-off-by: Bart Van Assche
Signed-off-by: Jens Axboe

Bart Van Assche
2019-06-01 05:12:34 +0800
a0b77e36e block: Fix bsg_setup_queue() kernel-doc header ... Browse Code »

Document all bsg_setup_queue() arguments as required.

Fixes: aae3b069d5ce ("bsg: pass in desired timeout handler") # v5.0.
Reviewed-by: Chaitanya Kulkarni
Signed-off-by: Bart Van Assche
Signed-off-by: Jens Axboe

Bart Van Assche
2019-06-01 05:12:34 +0800
83826a506 block: Fix rq_qos_wait() kernel-doc header ... Browse Code »

Add documentation for the @rqw argument and change " - " into ": ".

Fixes: 84f603246db9 ("block: add rq_qos_wait to rq_qos") # v5.0-rc1~52^2~140.
Reviewed-by: Chaitanya Kulkarni
Signed-off-by: Bart Van Assche
Signed-off-by: Jens Axboe

Bart Van Assche
2019-06-01 05:12:34 +0800
0542cd57d block: Fix blk_mq_*_map_queues() kernel-doc headers ... Browse Code »

This patch avoids that the kernel-doc script complains about these
function headers when building with W=1.

Cc: Hannes Reinecke
Cc: Keith Busch
Fixes: ed76e329d74a ("blk-mq: abstract out queue map") # v5.0.
Fixes: e42b3867de4b ("blk-mq-rdma: pass in queue map to blk_mq_rdma_map_queues") # v5.0.
Reviewed-by: Chaitanya Kulkarni
Signed-off-by: Bart Van Assche
Signed-off-by: Jens Axboe

Bart Van Assche
2019-06-01 05:12:34 +0800
216382dcc block: Fix throtl_pending_timer_fn() kernel-doc header ... Browse Code »

Commit e99e88a9d2b0 renamed a function argument without updating the
corresponding kernel-doc header. Update the kernel-doc header.

Reviewed-by: Chaitanya Kulkarni
Reviewed-by: Kees Cook
Fixes: e99e88a9d2b0 ("treewide: setup_timer() -> timer_setup()") # v4.15.
Signed-off-by: Bart Van Assche
Signed-off-by: Jens Axboe

Bart Van Assche
2019-06-01 05:12:34 +0800
33c826ef1 block: Convert blk_invalidate_devt() header into a non-kernel-doc header ... Browse Code »

This patch avoids that the kernel-doc tool warns about this function
header when building with W=1.

Reviewed-by: Chaitanya Kulkarni
Signed-off-by: Bart Van Assche
Signed-off-by: Jens Axboe

Bart Van Assche
2019-06-01 05:12:34 +0800
210eaaaea block/partitions/ldm: Convert a kernel-doc header into a non-kernel-doc header ... Browse Code »

This patch avoids that the kernel-doc tool warns about this function
header when building with W=1.

Reviewed-by: Chaitanya Kulkarni
Signed-off-by: Bart Van Assche
Signed-off-by: Jens Axboe

Bart Van Assche
2019-06-01 05:12:34 +0800

30 May, 2019

1 commit

41de54c64 blk-mq: Fix memory leak in error handling ... Browse Code »

If blk_mq_init_allocated_queue() fails, make sure to free the poll
stat callback struct allocated.

Signed-off-by: Jes Sorensen
Signed-off-by: Jens Axboe

Jes Sorensen
2019-05-30 04:33:04 +0800

29 May, 2019

2 commits

fe2008640 block: don't protect generic_make_request_checks with blk_queue_enter ... Browse Code »

Now a063057d7c73 ("block: Fix a race between request queue removal and
the block cgroup controller") has been reverted, and blkcg_exit_queue()
won't be called in blk_cleanup_queue() any more.

So don't need to protect generic_make_request_checks() with
blk_queue_enter(), then the total mess can be cleaned.

37f9579f4c31 ("blk-mq: Avoid that submitting a bio concurrently with device
removal triggers a crash") is reverted.

Cc: Bart Van Assche
Reviewed-by: Christoph Hellwig
Signed-off-by: Ming Lei
Signed-off-by: Jens Axboe

Ming Lei
2019-05-29 20:09:11 +0800
47cdee29e block: move blk_exit_queue into __blk_release_queue ... Browse Code »

Commit 498f6650aec8 ("block: Fix a race between the cgroup code and
request queue initialization") moves what blk_exit_queue does into
blk_cleanup_queue() for fixing issue caused by changing back
queue lock.

However, after legacy request IO path is killed, driver queue lock
won't be used at all, and there isn't story for changing back
queue lock. Then the issue addressed by Commit 498f6650aec8 doesn't
exist any more.

So move move blk_exit_queue into __blk_release_queue.

This patch basically reverts the following two commits:

498f6650aec8 block: Fix a race between the cgroup code and request queue initialization
24ecc3585348 block: Ensure that a request queue is dissociated from the cgroup controller

Cc: Bart Van Assche
Reviewed-by: Christoph Hellwig
Signed-off-by: Ming Lei
Signed-off-by: Jens Axboe

Ming Lei
2019-05-29 20:09:09 +0800

24 May, 2019

5 commits

7996a8b55 blk-mq: fix hang caused by freeze/unfreeze sequence ... Browse Code »

The following is a description of a hang in blk_mq_freeze_queue_wait().
The hang happens on attempt to freeze a queue while another task does
queue unfreeze.

The root cause is an incorrect sequence of percpu_ref_resurrect() and
percpu_ref_kill() and as a result those two can be swapped:

CPU#0 CPU#1
---------------- -----------------
q1 = blk_mq_init_queue(shared_tags)

q2 = blk_mq_init_queue(shared_tags):
blk_mq_add_queue_tag_set(shared_tags):
blk_mq_update_tag_set_depth(shared_tags):
list_for_each_entry()
blk_mq_freeze_queue(q1)
> percpu_ref_kill()
> blk_mq_freeze_queue_wait()

blk_cleanup_queue(q1)
blk_mq_freeze_queue(q1)
> percpu_ref_kill()
^^^^^^ freeze_depth can't guarantee the order

blk_mq_unfreeze_queue()
> percpu_ref_resurrect()

> blk_mq_freeze_queue_wait()
^^^^^^ Hang here!!!!

This wrong sequence raises kernel warning:
percpu_ref_kill_and_confirm called more than once on blk_queue_usage_counter_release!
WARNING: CPU: 0 PID: 11854 at lib/percpu-refcount.c:336 percpu_ref_kill_and_confirm+0x99/0xb0

But the most unpleasant effect is a hang of a blk_mq_freeze_queue_wait(),
which waits for a zero of a q_usage_counter, which never happens
because percpu-ref was reinited (instead of being killed) and stays in
PERCPU state forever.

How to reproduce:
- "insmod null_blk.ko shared_tags=1 nr_devices=0 queue_mode=2"
- cpu0: python Script.py 0; taskset the corresponding process running on cpu0
- cpu1: python Script.py 1; taskset the corresponding process running on cpu1

Script.py:
------
#!/usr/bin/python3

import os
import sys

while True:
on = "echo 1 > /sys/kernel/config/nullb/%s/power" % sys.argv[1]
off = "echo 0 > /sys/kernel/config/nullb/%s/power" % sys.argv[1]
os.system(on)
os.system(off)
------

This bug was first reported and fixed by Roman, previous discussion:
[1] Message id: 1443287365-4244-7-git-send-email-akinobu.mita@gmail.com
[2] Message id: 1443563240-29306-6-git-send-email-tj@kernel.org
[3] https://patchwork.kernel.org/patch/9268199/

Reviewed-by: Hannes Reinecke
Reviewed-by: Ming Lei
Reviewed-by: Bart Van Assche
Reviewed-by: Christoph Hellwig
Signed-off-by: Roman Pen
Signed-off-by: Bob Liu
Signed-off-by: Jens Axboe

Bob Liu
2019-05-24 00:25:26 +0800
6869875fb block: remove the bi_seg_{front,back}_size fields in struct bio ... Browse Code »

At this point these fields aren't used for anything, so we can remove
them.

Reviewed-by: Ming Lei
Reviewed-by: Hannes Reinecke
Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Christoph Hellwig
2019-05-24 00:25:26 +0800
200a9aff7 block: remove the segment size check in bio_will_gap ... Browse Code »

We fundamentally do not have a maximum segement size for devices with a
virt boundary. So don't bother checking it, especially given that the
existing checks didn't properly work to start with as we never fully
update the front/back segment size and miss the bi_seg_front_size that
wuld have been required for some cases.

Signed-off-by: Christoph Hellwig
Reviewed-by: Ming Lei
Reviewed-by: Hannes Reinecke
Signed-off-by: Jens Axboe

Christoph Hellwig
2019-05-24 00:25:26 +0800
09324d32d block: force an unlimited segment size on queues with a virt boundary ... Browse Code »

We currently fail to update the front/back segment size in the bio when
deciding to allow an otherwise gappy segement to a device with a
virt boundary. The reason why this did not cause problems is that
devices with a virt boundary fundamentally don't use segments as we
know it and thus don't care. Make that assumption formal by forcing
an unlimited segement size in this case.

Fixes: f6970f83ef79 ("block: don't check if adjacent bvecs in one bio can be mergeable")
Signed-off-by: Christoph Hellwig
Reviewed-by: Ming Lei
Reviewed-by: Hannes Reinecke
Signed-off-by: Jens Axboe

Christoph Hellwig
2019-05-24 00:25:26 +0800
eded341c0 block: don't decrement nr_phys_segments for physically contigous segments ... Browse Code »

Currently ll_merge_requests_fn, unlike all other merge functions,
reduces nr_phys_segments by one if the last segment of the previous,
and the first segment of the next segement are contigous. While this
seems like a nice solution to avoid building smaller than possible
requests it causes a mismatch between the segments actually present
in the request and those iterated over by the bvec iterators, including
__rq_for_each_bio. This can for example mistrigger the single segment
optimization in the nvme-pci driver, and might lead to mismatching
nr_phys_segments number when recalculating the number of request
when inserting a cloned request.

We could possibly work around this by making the bvec iterators take
the front and back segment size into account, but that would require
moving them from the bio to the bio_iter and spreading this mess
over all users of bvecs. Or we could simply remove this optimization
under the assumption that most users already build good enough bvecs,
and that the bio merge patch never cared about this optimization
either. The latter is what this patch does.

dff824b2aadb ("nvme-pci: optimize mapping of small single segment requests").
Reviewed-by: Ming Lei
Reviewed-by: Hannes Reinecke
Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Christoph Hellwig
2019-05-24 00:25:26 +0800

17 May, 2019

1 commit

1718de78e Merge tag 'for-5.2/block-post-20190516' of git://git.kernel.dk/linux-block ... Browse Code »

Pull more block updates from Jens Axboe:
"This is mainly some late lightnvm changes that came in just before the
merge window, as well as fixes that have been queued up since the
initial pull request was frozen.

This contains:

- lightnvm changes, fixing race conditions, improving memory
utilization, and improving pblk compatability (Chansol, Igor,
Marcin)

- NVMe pull request with minor fixes all over the map (via Christoph)

- remove redundant error print in sata_rcar (Geert)

- struct_size() cleanup (Jackie)

- dasd CONFIG_LBADF warning fix (Ming)

- brd cond_resched() improvement (Mikulas)"

* tag 'for-5.2/block-post-20190516' of git://git.kernel.dk/linux-block: (41 commits)
block/bio-integrity: use struct_size() in kmalloc()
nvme: validate cntlid during controller initialisation
nvme: change locking for the per-subsystem controller list
nvme: trace all async notice events
nvme: fix typos in nvme status code values
nvme-fabrics: remove unused argument
nvme-multipath: avoid crash on invalid subsystem cntlid enumeration
nvme-fc: use separate work queue to avoid warning
nvme-rdma: remove redundant reference between ib_device and tagset
nvme-pci: mark expected switch fall-through
nvme-pci: add known admin effects to augument admin effects log page
nvme-pci: init shadow doorbell after each reset
brd: add cond_resched to brd_free_pages
sata_rcar: Remove ata_host_alloc() error printing
s390/dasd: fix build warning in dasd_eckd_build_cp_raw
lightnvm: pblk: use nvm_rq_to_ppa_list()
lightnvm: pblk: simplify partial read path
lightnvm: do not remove instance under global lock
lightnvm: track inflight target creations
lightnvm: pblk: recover only written metadata
...

Linus Torvalds
2019-05-17 10:08:15 +0800

16 May, 2019

1 commit

7a102d904 block/bio-integrity: use struct_size() in kmalloc() ... Browse Code »

Use the new struct_size() helper to keep code simple.

Reviewed-by: Chaitanya Kulkarni
Signed-off-by: Jackie Liu
Signed-off-by: Jens Axboe

Jackie Liu
2019-05-16 22:48:48 +0800

08 May, 2019

2 commits

67a242223 Merge tag 'for-5.2/block-20190507' of git://git.kernel.dk/linux-block ... Browse Code »

Pull block updates from Jens Axboe:
"Nothing major in this series, just fixes and improvements all over the
map. This contains:

- Series of fixes for sed-opal (David, Jonas)

- Fixes and performance tweaks for BFQ (via Paolo)

- Set of fixes for bcache (via Coly)

- Set of fixes for md (via Song)

- Enabling multi-page for passthrough requests (Ming)

- Queue release fix series (Ming)

- Device notification improvements (Martin)

- Propagate underlying device rotational status in loop (Holger)

- Removal of mtip32xx trim support, which has been disabled for years
(Christoph)

- Improvement and cleanup of nvme command handling (Christoph)

- Add block SPDX tags (Christoph)

- Cleanup/hardening of bio/bvec iteration (Christoph)

- A few NVMe pull requests (Christoph)

- Removal of CONFIG_LBDAF (Christoph)

- Various little fixes here and there"

* tag 'for-5.2/block-20190507' of git://git.kernel.dk/linux-block: (164 commits)
block: fix mismerge in bvec_advance
block: don't drain in-progress dispatch in blk_cleanup_queue()
blk-mq: move cancel of hctx->run_work into blk_mq_hw_sysfs_release
blk-mq: always free hctx after request queue is freed
blk-mq: split blk_mq_alloc_and_init_hctx into two parts
blk-mq: free hw queue's resource in hctx's release handler
blk-mq: move cancel of requeue_work into blk_mq_release
blk-mq: grab .q_usage_counter when queuing request from plug code path
block: fix function name in comment
nvmet: protect discovery change log event list iteration
nvme: mark nvme_core_init and nvme_core_exit static
nvme: move command size checks to the core
nvme-fabrics: check more command sizes
nvme-pci: check more command sizes
nvme-pci: remove an unneeded variable initialization
nvme-pci: unquiesce admin queue on shutdown
nvme-pci: shutdown on timeout during deletion
nvme-pci: fix psdt field for single segment sgls
nvme-multipath: don't print ANA group state by default
nvme-multipath: split bios with the ns_head bio_set before submitting
...

Linus Torvalds
2019-05-08 09:14:36 +0800
cf482a49a Merge tag 'driver-core-5.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core ... Browse Code »

Pull driver core/kobject updates from Greg KH:
"Here is the "big" set of driver core patches for 5.2-rc1

There are a number of ACPI patches in here as well, as Rafael said
they should go through this tree due to the driver core changes they
required. They have all been acked by the ACPI developers.

There are also a number of small subsystem-specific changes in here,
due to some changes to the kobject core code. Those too have all been
acked by the various subsystem maintainers.

As for content, it's pretty boring outside of the ACPI changes:
- spdx cleanups
- kobject documentation updates
- default attribute groups for kobjects
- other minor kobject/driver core fixes

All have been in linux-next for a while with no reported issues"

* tag 'driver-core-5.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (47 commits)
kobject: clean up the kobject add documentation a bit more
kobject: Fix kernel-doc comment first line
kobject: Remove docstring reference to kset
firmware_loader: Fix a typo ("syfs" -> "sysfs")
kobject: fix dereference before null check on kobj
Revert "driver core: platform: Fix the usage of platform device name(pdev->name)"
init/config: Do not select BUILD_BIN2C for IKCONFIG
Provide in-kernel headers to make extending kernel easier
kobject: Improve doc clarity kobject_init_and_add()
kobject: Improve docs for kobject_add/del
driver core: platform: Fix the usage of platform device name(pdev->name)
livepatch: Replace klp_ktype_patch's default_attrs with groups
cpufreq: schedutil: Replace default_attrs field with groups
padata: Replace padata_attr_type default_attrs field with groups
irqdesc: Replace irq_kobj_type's default_attrs field with groups
net-sysfs: Replace ktype default_attrs field with groups
block: Replace all ktype default_attrs with groups
samples/kobject: Replace foo_ktype's default_attrs field with groups
kobject: Add support for default attribute groups to kobj_type
driver core: Postpone DMA tear-down until after devres release for probe failure
...

Linus Torvalds
2019-05-08 04:01:40 +0800

04 May, 2019

7 commits

662156641 block: don't drain in-progress dispatch in blk_cleanup_queue() ... Browse Code »

Now freeing hw queue resource is moved to hctx's release handler,
we don't need to worry about the race between blk_cleanup_queue and
run queue any more.

So don't drain in-progress dispatch in blk_cleanup_queue().

This is basically revert of c2856ae2f315 ("blk-mq: quiesce queue before
freeing queue").

Cc: Dongli Zhang
Cc: James Smart
Cc: Bart Van Assche
Cc: linux-scsi@vger.kernel.org,
Cc: Martin K . Petersen ,
Cc: Christoph Hellwig ,
Cc: James E . J . Bottomley ,
Reviewed-by: Bart Van Assche
Reviewed-by: Hannes Reinecke
Tested-by: James Smart
Signed-off-by: Ming Lei
Signed-off-by: Jens Axboe

Ming Lei
2019-05-04 21:24:11 +0800
1b97871b5 blk-mq: move cancel of hctx->run_work into blk_mq_hw_sysfs_release ... Browse Code »

hctx is always released after requeue is freed.

With holding queue's kobject refcount, it is safe for driver to run queue,
so one run queue might be scheduled after blk_sync_queue() is done.

So moving the cancel of hctx->run_work into blk_mq_hw_sysfs_release()
for avoiding run released queue.

Cc: Dongli Zhang
Cc: James Smart
Cc: Bart Van Assche
Cc: linux-scsi@vger.kernel.org,
Cc: Martin K . Petersen ,
Cc: Christoph Hellwig ,
Cc: James E . J . Bottomley ,
Reviewed-by: Bart Van Assche
Reviewed-by: Hannes Reinecke
Tested-by: James Smart
Signed-off-by: Ming Lei
Signed-off-by: Jens Axboe

Ming Lei
2019-05-04 21:24:09 +0800
2f8f1336a blk-mq: always free hctx after request queue is freed ... Browse Code »

In normal queue cleanup path, hctx is released after request queue
is freed, see blk_mq_release().

However, in __blk_mq_update_nr_hw_queues(), hctx may be freed because
of hw queues shrinking. This way is easy to cause use-after-free,
because: one implicit rule is that it is safe to call almost all block
layer APIs if the request queue is alive; and one hctx may be retrieved
by one API, then the hctx can be freed by blk_mq_update_nr_hw_queues();
finally use-after-free is triggered.

Fixes this issue by always freeing hctx after releasing request queue.
If some hctxs are removed in blk_mq_update_nr_hw_queues(), introduce
a per-queue list to hold them, then try to resuse these hctxs if numa
node is matched.

Cc: Dongli Zhang
Cc: James Smart
Cc: Bart Van Assche
Cc: linux-scsi@vger.kernel.org,
Cc: Martin K . Petersen ,
Cc: Christoph Hellwig ,
Cc: James E . J . Bottomley ,
Reviewed-by: Hannes Reinecke
Tested-by: James Smart
Signed-off-by: Ming Lei
Signed-off-by: Jens Axboe

Ming Lei
2019-05-04 21:24:08 +0800
7c6c5b7c9 blk-mq: split blk_mq_alloc_and_init_hctx into two parts ... Browse Code »

Split blk_mq_alloc_and_init_hctx into two parts, and one is
blk_mq_alloc_hctx() for allocating all hctx resources, another
is blk_mq_init_hctx() for initializing hctx, which serves as
counter-part of blk_mq_exit_hctx().

Cc: Dongli Zhang
Cc: James Smart
Cc: Bart Van Assche
Cc: linux-scsi@vger.kernel.org
Cc: Martin K . Petersen
Cc: Christoph Hellwig
Cc: James E . J . Bottomley
Reviewed-by: Hannes Reinecke
Reviewed-by: Christoph Hellwig
Tested-by: James Smart
Signed-off-by: Ming Lei
Signed-off-by: Jens Axboe

Ming Lei
2019-05-04 21:24:06 +0800
c7e2d94b3 blk-mq: free hw queue's resource in hctx's release handler ... Browse Code »

Once blk_cleanup_queue() returns, tags shouldn't be used any more,
because blk_mq_free_tag_set() may be called. Commit 45a9c9d909b2
("blk-mq: Fix a use-after-free") fixes this issue exactly.

However, that commit introduces another issue. Before 45a9c9d909b2,
we are allowed to run queue during cleaning up queue if the queue's
kobj refcount is held. After that commit, queue can't be run during
queue cleaning up, otherwise oops can be triggered easily because
some fields of hctx are freed by blk_mq_free_queue() in blk_cleanup_queue().

We have invented ways for addressing this kind of issue before, such as:

8dc765d438f1 ("SCSI: fix queue cleanup race before queue initialization is done")
c2856ae2f315 ("blk-mq: quiesce queue before freeing queue")

But still can't cover all cases, recently James reports another such
kind of issue:

https://marc.info/?l=linux-scsi&m=155389088124782&w=2

This issue can be quite hard to address by previous way, given
scsi_run_queue() may run requeues for other LUNs.

Fixes the above issue by freeing hctx's resources in its release handler, and this
way is safe becasue tags isn't needed for freeing such hctx resource.

This approach follows typical design pattern wrt. kobject's release handler.

Cc: Dongli Zhang
Cc: James Smart
Cc: Bart Van Assche
Cc: linux-scsi@vger.kernel.org,
Cc: Martin K . Petersen ,
Cc: Christoph Hellwig ,
Cc: James E . J . Bottomley ,
Reported-by: James Smart
Fixes: 45a9c9d909b2 ("blk-mq: Fix a use-after-free")
Cc: stable@vger.kernel.org
Reviewed-by: Hannes Reinecke
Reviewed-by: Christoph Hellwig
Tested-by: James Smart
Signed-off-by: Ming Lei
Signed-off-by: Jens Axboe

Ming Lei
2019-05-04 21:24:05 +0800
fbc2a15e3 blk-mq: move cancel of requeue_work into blk_mq_release ... Browse Code »

With holding queue's kobject refcount, it is safe for driver
to schedule requeue. However, blk_mq_kick_requeue_list() may
be called after blk_sync_queue() is done because of concurrent
requeue activities, then requeue work may not be completed when
freeing queue, and kernel oops is triggered.

So moving the cancel of requeue_work into blk_mq_release() for
avoiding race between requeue and freeing queue.

Cc: Dongli Zhang
Cc: James Smart
Cc: Bart Van Assche
Cc: linux-scsi@vger.kernel.org,
Cc: Martin K . Petersen ,
Cc: Christoph Hellwig ,
Cc: James E . J . Bottomley ,
Reviewed-by: Bart Van Assche
Reviewed-by: Johannes Thumshirn
Reviewed-by: Hannes Reinecke
Reviewed-by: Christoph Hellwig
Tested-by: James Smart
Signed-off-by: Ming Lei
Signed-off-by: Jens Axboe

Ming Lei
2019-05-04 21:24:04 +0800
e87eb301b blk-mq: grab .q_usage_counter when queuing request from plug code path ... Browse Code »

Just like aio/io_uring, we need to grab 2 refcount for queuing one
request, one is for submission, another is for completion.

If the request isn't queued from plug code path, the refcount grabbed
in generic_make_request() serves for submission. In theroy, this
refcount should have been released after the sumission(async run queue)
is done. blk_freeze_queue() works with blk_sync_queue() together
for avoiding race between cleanup queue and IO submission, given async
run queue activities are canceled because hctx->run_work is scheduled with
the refcount held, so it is fine to not hold the refcount when
running the run queue work function for dispatch IO.

However, if request is staggered into plug list, and finally queued
from plug code path, the refcount in submission side is actually missed.
And we may start to run queue after queue is removed because the queue's
kobject refcount isn't guaranteed to be grabbed in flushing plug list
context, then kernel oops is triggered, see the following race:

blk_mq_flush_plug_list():
blk_mq_sched_insert_requests()
insert requests to sw queue or scheduler queue
blk_mq_run_hw_queue

Because of concurrent run queue, all requests inserted above may be
completed before calling the above blk_mq_run_hw_queue. Then queue can
be freed during the above blk_mq_run_hw_queue().

Fixes the issue by grab .q_usage_counter before calling
blk_mq_sched_insert_requests() in blk_mq_flush_plug_list(). This way is
safe because the queue is absolutely alive before inserting request.

Cc: Dongli Zhang
Cc: James Smart
Cc: linux-scsi@vger.kernel.org,
Cc: Martin K . Petersen ,
Cc: Christoph Hellwig ,
Cc: James E . J . Bottomley ,
Reviewed-by: Bart Van Assche
Tested-by: James Smart
Signed-off-by: Ming Lei
Signed-off-by: Jens Axboe

Ming Lei
2019-05-04 21:24:02 +0800

03 May, 2019

1 commit

273938bf7 block: fix function name in comment ... Browse Code »

The comment was out of date.

Reviewed-by: Bart Van Assche
Signed-off-by: Raul E Rangel
Signed-off-by: Jens Axboe

Raul E Rangel
2019-05-03 05:51:52 +0800

01 May, 2019

4 commits

12adb7a01 block: remove the unused blk_queue_dma_pad function ... Browse Code »

Reviewed-by: Chaitanya Kulkarni
Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Christoph Hellwig
2019-05-01 06:12:36 +0800
3dcf60bcb block: add SPDX tags to block layer files missing licensing information ... Browse Code »

Various block layer files do not have any licensing information at all.
Add SPDX tags for the default kernel GPLv2 license to those.

Reviewed-by: Chaitanya Kulkarni
Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Christoph Hellwig
2019-05-01 06:12:03 +0800
a497ee34a block: switch all files cleared marked as GPLv2 or later to SPDX tags ... Browse Code »

All these files have some form of the usual GPLv2 or later boilerplate.
Switch them to use SPDX tags instead.

Reviewed-by: Chaitanya Kulkarni
Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Christoph Hellwig
2019-05-01 06:11:59 +0800
8c16567d8 block: switch all files cleared marked as GPLv2 to SPDX tags ... Browse Code »

All these files have some form of the usual GPLv2 boilerplate. Switch
them to use SPDX tags instead.

Reviewed-by: Chaitanya Kulkarni
Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Christoph Hellwig
2019-05-01 06:11:57 +0800

30 Apr, 2019

4 commits

dcdca753c block: clean up __bio_add_pc_page a bit ... Browse Code »

Share the bi_size update by moving the done label up, and duplicate
the bv_len update in the two callers to get rid of the bvec_merge
label.

Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Christoph Hellwig
2019-04-30 23:26:44 +0800
6601e44ef block: remove bogus comments in __bio_add_pc_page ... Browse Code »

We are never called with file system pages by defintions for the
passthrough interface, and we also never undo any addition later
these days.

Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Christoph Hellwig
2019-04-30 23:26:42 +0800
4713839df block: remove the __bio_add_pc_page export ... Browse Code »

The same page optimization is a rather odd corner case, which is not
used outside bio.c and which really should not be used outside of bio.c
either - we have better highlevel helpers like the rq/bio mapping
helpers.

Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Christoph Hellwig
2019-04-30 23:26:41 +0800
2b070cfe5 block: remove the i argument to bio_for_each_segment_all ... Browse Code »

We only have two callers that need the integer loop iterator, and they
can easily maintain it themselves.

Suggested-by: Matthew Wilcox
Reviewed-by: Johannes Thumshirn
Acked-by: David Sterba
Reviewed-by: Hannes Reinecke
Acked-by: Coly Li
Reviewed-by: Matthew Wilcox
Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Christoph Hellwig
2019-04-30 23:26:13 +0800