09 Jul, 2019
1 commit
-
Pull cgroup updates from Tejun Heo:
"Documentation updates and the addition of cgroup_parse_float() which
will be used by new controllers including blk-iocost"* 'for-5.3' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
docs: cgroup-v1: convert docs to ReST and rename to *.rst
cgroup: Move cgroup_parse_float() implementation out of CONFIG_SYSFS
cgroup: add cgroup_parse_float()
07 Jul, 2019
1 commit
-
When the blk-mq debugfs file creation logic was "cleaned up" it was
cleaned up too much, causing the queue file to not be created in the
correct location. Turns out the check for the directory being present
is needed as if that has not happened yet, the files should not be
created, and the function will be called later on in the initialization
code so that the files can be created in the correct location.Fixes: 6cfc0081b046 ("blk-mq: no need to check return value of debugfs_create functions")
Reported-by: Stephen Rothwell
Cc: linux-block@vger.kernel.org
Signed-off-by: Greg Kroah-Hartman
Signed-off-by: Jens Axboe
26 Jun, 2019
1 commit
-
By mistake, there is a '&' instead of a '==' in the definition of the
macro BFQQ_TOTALLY_SEEKY. This commit replaces the wrong operator with
the correct one.Fixes: 7074f076ff15 ("block, bfq: do not tag totally seeky queues as soft rt")
Signed-off-by: Paolo Valente
Signed-off-by: Jens Axboe
17 Jun, 2019
2 commits
-
When multiple iovecs reference the same page, each get_user_page call
will add a reference to the page. But once we've created the bio that
information gets lost and only a single reference will be dropped after
I/O completion. Use the same_page information returned from
__bio_try_merge_page to drop additional references to pages that were
already present in the bio.Based on a patch from Ming Lei.
Link: https://lkml.org/lkml/2019/4/23/64
Fixes: 576ed913 ("block: use bio_add_page in bio_iov_iter_get_pages")
Reported-by: David Gibson
Signed-off-by: Christoph Hellwig
Reviewed-by: Ming Lei
Signed-off-by: Jens Axboe -
We currently have an input same_page parameter to __bio_try_merge_page
to prohibit merging in the same page. The rationale for that is that
some callers need to account for every page added to a bio. Instead of
letting these callers call twice into the merge code to account for the
new vs existing page cases, just turn the paramter into an output one that
returns if a merge in the same page occured and let them act accordingly.Signed-off-by: Christoph Hellwig
Reviewed-by: Ming Lei
Signed-off-by: Jens Axboe
15 Jun, 2019
1 commit
-
Convert the cgroup-v1 files to ReST format, in order to
allow a later addition to the admin-guide.The conversion is actually:
- add blank lines and identation in order to identify paragraphs;
- fix tables markups;
- add some lists markups;
- mark literal blocks;
- adjust title markups.At its new index.rst, let's add a :orphan: while this is not linked to
the main index.rst file, in order to avoid build warnings.Signed-off-by: Mauro Carvalho Chehab
Acked-by: Tejun Heo
Signed-off-by: Tejun Heo
13 Jun, 2019
3 commits
-
blk_mq_sched_free_requests() may be called in failure path in which
q->elevator may not be setup yet, so remove WARN_ON(!q->elevator) from
blk_mq_sched_free_requests for avoiding the false positive.This function is actually safe to call in case of !q->elevator because
hctx->sched_tags is checked.Cc: Bart Van Assche
Cc: Christoph Hellwig
Cc: Yi Zhang
Fixes: c3e2219216c9 ("block: free sched's request pool in blk_cleanup_queue")
Reported-by: syzbot+b9d0d56867048c7bcfde@syzkaller.appspotmail.com
Signed-off-by: Ming Lei
Signed-off-by: Jens Axboe -
When calling debugfs functions, there is no need to ever check the
return value. The function can work or not, but the code logic should
never do something different based on this.When all of these checks are cleaned up, lots of the functions used in
the blk-mq-debugfs code can now return void, as no need to check the
return value of them either.Overall, this ends up cleaning up the code and making it smaller, always
a nice win.Cc: Jens Axboe
Cc: linux-block@vger.kernel.org
Signed-off-by: Greg Kroah-Hartman
Signed-off-by: Jens Axboe -
In most use cases of zoned block devices (aka SMR disks), the
mq-deadline scheduler is mandatory as it implements sequential write
command processing guarantees with zone write locking. So make sure that
this scheduler is always enabled if CONFIG_BLK_DEV_ZONED is selected.Tested-by: Chaitanya Kulkarni
Reviewed-by: Chaitanya Kulkarni
Signed-off-by: Damien Le Moal
Reviewed-by: Ming Lei
Signed-off-by: Jens Axboe
10 Jun, 2019
1 commit
-
There's some discussion on how to do this the best, and Tejun prefers
that BFQ just create the file itself instead of having cgroups support
a symlink feature.Hence revert commit 54b7b868e826 and 19e9da9e86c4 for 5.2, and this
can be done properly for 5.3.Signed-off-by: Jens Axboe
07 Jun, 2019
2 commits
-
Many userspace tools and services use the proportional-share policy of
the blkio/io cgroups controller. The CFQ I/O scheduler implemented
this policy for the legacy block layer. To modify the weight of a
group in case CFQ was in charge, the 'weight' parameter of the group
must be modified. On the other hand, the BFQ I/O scheduler implements
the same policy in blk-mq, but, with BFQ, the parameter to modify has
a different name: bfq.weight (forced choice until legacy block was
present, because two different policies cannot share a common parameter
in cgroups).Due to CFQ legacy, most if not all userspace configurations still use
the parameter 'weight', and for the moment do not seem likely to be
changed. But, when CFQ went away with legacy block, such a parameter
ceased to exist.So, a simple workaround has been proposed [1] to make all
configurations work: add a symlink, named weight, to bfq.weight. This
commit adds such a symlink.[1] https://lkml.org/lkml/2019/4/8/555
Suggested-by: Johannes Thumshirn
Signed-off-by: Angelo Ruocco
Signed-off-by: Paolo Valente
Signed-off-by: Jens Axboe -
In theory, IO scheduler belongs to request queue, and the request pool
of sched tags belongs to the request queue too.However, the current tags allocation interfaces are re-used for both
driver tags and sched tags, and driver tags is definitely host wide,
and doesn't belong to any request queue, same with its request pool.
So we need tagset instance for freeing request of sched tags.Meantime, blk_mq_free_tag_set() often follows blk_cleanup_queue() in case
of non-BLK_MQ_F_TAG_SHARED, this way requires that request pool of sched
tags to be freed before calling blk_mq_free_tag_set().Commit 47cdee29ef9d94e ("block: move blk_exit_queue into __blk_release_queue")
moves blk_exit_queue into __blk_release_queue for simplying the fast
path in generic_make_request(), then causes oops during freeing requests
of sched tags in __blk_release_queue().Fix the above issue by move freeing request pool of sched tags into
blk_cleanup_queue(), this way is safe becasue queue has been frozen and no any
in-queue requests at that time. Freeing sched tags has to be kept in queue's
release handler becasue there might be un-completed dispatch activity
which might refer to sched tags.Cc: Bart Van Assche
Cc: Christoph Hellwig
Fixes: 47cdee29ef9d94e485eb08f962c74943023a5271 ("block: move blk_exit_queue into __blk_release_queue")
Tested-by: Yi Zhang
Reported-by: kernel test robot
Signed-off-by: Ming Lei
Signed-off-by: Jens Axboe
05 Jun, 2019
1 commit
-
IS_ERR(_OR_NULL) already contain an 'unlikely' compiler flag,
so no need to do that again from its callers. Drop it.Cc: Jens Axboe
Cc: linux-block@vger.kernel.org
Signed-off-by: Kefeng Wang
Signed-off-by: Jens Axboe
01 Jun, 2019
9 commits
-
While troubleshooting issues where cloned request limits have been
exceeded, it is often beneficial to know the actual values that
have been breached. Print these values, assisting in ease of
identification of root cause of the breach.Reviewed-by: Chaitanya Kulkarni
Reviewed-by: Ming Lei
Signed-off-by: John Pittman
Signed-off-by: Jens Axboe -
Document the meaning of the blk_mq_hw_queue_to_node() arguments.
Reviewed-by: Chaitanya Kulkarni
Signed-off-by: Bart Van Assche
Signed-off-by: Jens Axboe -
Change one occurrence of 'performace' into 'performance'.
Cc: Max Gurtovoy
Fixes: fe631457ff3e ("blk-mq: map all HWQ also in hyperthreaded system") # v4.13.
Reviewed-by: Chaitanya Kulkarni
Signed-off-by: Bart Van Assche
Signed-off-by: Jens Axboe -
Document all bsg_setup_queue() arguments as required.
Fixes: aae3b069d5ce ("bsg: pass in desired timeout handler") # v5.0.
Reviewed-by: Chaitanya Kulkarni
Signed-off-by: Bart Van Assche
Signed-off-by: Jens Axboe -
Add documentation for the @rqw argument and change " - " into ": ".
Fixes: 84f603246db9 ("block: add rq_qos_wait to rq_qos") # v5.0-rc1~52^2~140.
Reviewed-by: Chaitanya Kulkarni
Signed-off-by: Bart Van Assche
Signed-off-by: Jens Axboe -
This patch avoids that the kernel-doc script complains about these
function headers when building with W=1.Cc: Hannes Reinecke
Cc: Keith Busch
Fixes: ed76e329d74a ("blk-mq: abstract out queue map") # v5.0.
Fixes: e42b3867de4b ("blk-mq-rdma: pass in queue map to blk_mq_rdma_map_queues") # v5.0.
Reviewed-by: Chaitanya Kulkarni
Signed-off-by: Bart Van Assche
Signed-off-by: Jens Axboe -
Commit e99e88a9d2b0 renamed a function argument without updating the
corresponding kernel-doc header. Update the kernel-doc header.Reviewed-by: Chaitanya Kulkarni
Reviewed-by: Kees Cook
Fixes: e99e88a9d2b0 ("treewide: setup_timer() -> timer_setup()") # v4.15.
Signed-off-by: Bart Van Assche
Signed-off-by: Jens Axboe -
This patch avoids that the kernel-doc tool warns about this function
header when building with W=1.Reviewed-by: Chaitanya Kulkarni
Signed-off-by: Bart Van Assche
Signed-off-by: Jens Axboe -
This patch avoids that the kernel-doc tool warns about this function
header when building with W=1.Reviewed-by: Chaitanya Kulkarni
Signed-off-by: Bart Van Assche
Signed-off-by: Jens Axboe
30 May, 2019
1 commit
-
If blk_mq_init_allocated_queue() fails, make sure to free the poll
stat callback struct allocated.Signed-off-by: Jes Sorensen
Signed-off-by: Jens Axboe
29 May, 2019
2 commits
-
Now a063057d7c73 ("block: Fix a race between request queue removal and
the block cgroup controller") has been reverted, and blkcg_exit_queue()
won't be called in blk_cleanup_queue() any more.So don't need to protect generic_make_request_checks() with
blk_queue_enter(), then the total mess can be cleaned.37f9579f4c31 ("blk-mq: Avoid that submitting a bio concurrently with device
removal triggers a crash") is reverted.Cc: Bart Van Assche
Reviewed-by: Christoph Hellwig
Signed-off-by: Ming Lei
Signed-off-by: Jens Axboe -
Commit 498f6650aec8 ("block: Fix a race between the cgroup code and
request queue initialization") moves what blk_exit_queue does into
blk_cleanup_queue() for fixing issue caused by changing back
queue lock.However, after legacy request IO path is killed, driver queue lock
won't be used at all, and there isn't story for changing back
queue lock. Then the issue addressed by Commit 498f6650aec8 doesn't
exist any more.So move move blk_exit_queue into __blk_release_queue.
This patch basically reverts the following two commits:
498f6650aec8 block: Fix a race between the cgroup code and request queue initialization
24ecc3585348 block: Ensure that a request queue is dissociated from the cgroup controllerCc: Bart Van Assche
Reviewed-by: Christoph Hellwig
Signed-off-by: Ming Lei
Signed-off-by: Jens Axboe
24 May, 2019
5 commits
-
The following is a description of a hang in blk_mq_freeze_queue_wait().
The hang happens on attempt to freeze a queue while another task does
queue unfreeze.The root cause is an incorrect sequence of percpu_ref_resurrect() and
percpu_ref_kill() and as a result those two can be swapped:CPU#0 CPU#1
---------------- -----------------
q1 = blk_mq_init_queue(shared_tags)q2 = blk_mq_init_queue(shared_tags):
blk_mq_add_queue_tag_set(shared_tags):
blk_mq_update_tag_set_depth(shared_tags):
list_for_each_entry()
blk_mq_freeze_queue(q1)
> percpu_ref_kill()
> blk_mq_freeze_queue_wait()blk_cleanup_queue(q1)
blk_mq_freeze_queue(q1)
> percpu_ref_kill()
^^^^^^ freeze_depth can't guarantee the orderblk_mq_unfreeze_queue()
> percpu_ref_resurrect()> blk_mq_freeze_queue_wait()
^^^^^^ Hang here!!!!This wrong sequence raises kernel warning:
percpu_ref_kill_and_confirm called more than once on blk_queue_usage_counter_release!
WARNING: CPU: 0 PID: 11854 at lib/percpu-refcount.c:336 percpu_ref_kill_and_confirm+0x99/0xb0But the most unpleasant effect is a hang of a blk_mq_freeze_queue_wait(),
which waits for a zero of a q_usage_counter, which never happens
because percpu-ref was reinited (instead of being killed) and stays in
PERCPU state forever.How to reproduce:
- "insmod null_blk.ko shared_tags=1 nr_devices=0 queue_mode=2"
- cpu0: python Script.py 0; taskset the corresponding process running on cpu0
- cpu1: python Script.py 1; taskset the corresponding process running on cpu1Script.py:
------
#!/usr/bin/python3import os
import syswhile True:
on = "echo 1 > /sys/kernel/config/nullb/%s/power" % sys.argv[1]
off = "echo 0 > /sys/kernel/config/nullb/%s/power" % sys.argv[1]
os.system(on)
os.system(off)
------This bug was first reported and fixed by Roman, previous discussion:
[1] Message id: 1443287365-4244-7-git-send-email-akinobu.mita@gmail.com
[2] Message id: 1443563240-29306-6-git-send-email-tj@kernel.org
[3] https://patchwork.kernel.org/patch/9268199/Reviewed-by: Hannes Reinecke
Reviewed-by: Ming Lei
Reviewed-by: Bart Van Assche
Reviewed-by: Christoph Hellwig
Signed-off-by: Roman Pen
Signed-off-by: Bob Liu
Signed-off-by: Jens Axboe -
At this point these fields aren't used for anything, so we can remove
them.Reviewed-by: Ming Lei
Reviewed-by: Hannes Reinecke
Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe -
We fundamentally do not have a maximum segement size for devices with a
virt boundary. So don't bother checking it, especially given that the
existing checks didn't properly work to start with as we never fully
update the front/back segment size and miss the bi_seg_front_size that
wuld have been required for some cases.Signed-off-by: Christoph Hellwig
Reviewed-by: Ming Lei
Reviewed-by: Hannes Reinecke
Signed-off-by: Jens Axboe -
We currently fail to update the front/back segment size in the bio when
deciding to allow an otherwise gappy segement to a device with a
virt boundary. The reason why this did not cause problems is that
devices with a virt boundary fundamentally don't use segments as we
know it and thus don't care. Make that assumption formal by forcing
an unlimited segement size in this case.Fixes: f6970f83ef79 ("block: don't check if adjacent bvecs in one bio can be mergeable")
Signed-off-by: Christoph Hellwig
Reviewed-by: Ming Lei
Reviewed-by: Hannes Reinecke
Signed-off-by: Jens Axboe -
Currently ll_merge_requests_fn, unlike all other merge functions,
reduces nr_phys_segments by one if the last segment of the previous,
and the first segment of the next segement are contigous. While this
seems like a nice solution to avoid building smaller than possible
requests it causes a mismatch between the segments actually present
in the request and those iterated over by the bvec iterators, including
__rq_for_each_bio. This can for example mistrigger the single segment
optimization in the nvme-pci driver, and might lead to mismatching
nr_phys_segments number when recalculating the number of request
when inserting a cloned request.We could possibly work around this by making the bvec iterators take
the front and back segment size into account, but that would require
moving them from the bio to the bio_iter and spreading this mess
over all users of bvecs. Or we could simply remove this optimization
under the assumption that most users already build good enough bvecs,
and that the bio merge patch never cared about this optimization
either. The latter is what this patch does.dff824b2aadb ("nvme-pci: optimize mapping of small single segment requests").
Reviewed-by: Ming Lei
Reviewed-by: Hannes Reinecke
Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe
17 May, 2019
1 commit
-
Pull more block updates from Jens Axboe:
"This is mainly some late lightnvm changes that came in just before the
merge window, as well as fixes that have been queued up since the
initial pull request was frozen.This contains:
- lightnvm changes, fixing race conditions, improving memory
utilization, and improving pblk compatability (Chansol, Igor,
Marcin)- NVMe pull request with minor fixes all over the map (via Christoph)
- remove redundant error print in sata_rcar (Geert)
- struct_size() cleanup (Jackie)
- dasd CONFIG_LBADF warning fix (Ming)
- brd cond_resched() improvement (Mikulas)"
* tag 'for-5.2/block-post-20190516' of git://git.kernel.dk/linux-block: (41 commits)
block/bio-integrity: use struct_size() in kmalloc()
nvme: validate cntlid during controller initialisation
nvme: change locking for the per-subsystem controller list
nvme: trace all async notice events
nvme: fix typos in nvme status code values
nvme-fabrics: remove unused argument
nvme-multipath: avoid crash on invalid subsystem cntlid enumeration
nvme-fc: use separate work queue to avoid warning
nvme-rdma: remove redundant reference between ib_device and tagset
nvme-pci: mark expected switch fall-through
nvme-pci: add known admin effects to augument admin effects log page
nvme-pci: init shadow doorbell after each reset
brd: add cond_resched to brd_free_pages
sata_rcar: Remove ata_host_alloc() error printing
s390/dasd: fix build warning in dasd_eckd_build_cp_raw
lightnvm: pblk: use nvm_rq_to_ppa_list()
lightnvm: pblk: simplify partial read path
lightnvm: do not remove instance under global lock
lightnvm: track inflight target creations
lightnvm: pblk: recover only written metadata
...
16 May, 2019
1 commit
-
Use the new struct_size() helper to keep code simple.
Reviewed-by: Chaitanya Kulkarni
Signed-off-by: Jackie Liu
Signed-off-by: Jens Axboe
08 May, 2019
2 commits
-
Pull block updates from Jens Axboe:
"Nothing major in this series, just fixes and improvements all over the
map. This contains:- Series of fixes for sed-opal (David, Jonas)
- Fixes and performance tweaks for BFQ (via Paolo)
- Set of fixes for bcache (via Coly)
- Set of fixes for md (via Song)
- Enabling multi-page for passthrough requests (Ming)
- Queue release fix series (Ming)
- Device notification improvements (Martin)
- Propagate underlying device rotational status in loop (Holger)
- Removal of mtip32xx trim support, which has been disabled for years
(Christoph)- Improvement and cleanup of nvme command handling (Christoph)
- Add block SPDX tags (Christoph)
- Cleanup/hardening of bio/bvec iteration (Christoph)
- A few NVMe pull requests (Christoph)
- Removal of CONFIG_LBDAF (Christoph)
- Various little fixes here and there"
* tag 'for-5.2/block-20190507' of git://git.kernel.dk/linux-block: (164 commits)
block: fix mismerge in bvec_advance
block: don't drain in-progress dispatch in blk_cleanup_queue()
blk-mq: move cancel of hctx->run_work into blk_mq_hw_sysfs_release
blk-mq: always free hctx after request queue is freed
blk-mq: split blk_mq_alloc_and_init_hctx into two parts
blk-mq: free hw queue's resource in hctx's release handler
blk-mq: move cancel of requeue_work into blk_mq_release
blk-mq: grab .q_usage_counter when queuing request from plug code path
block: fix function name in comment
nvmet: protect discovery change log event list iteration
nvme: mark nvme_core_init and nvme_core_exit static
nvme: move command size checks to the core
nvme-fabrics: check more command sizes
nvme-pci: check more command sizes
nvme-pci: remove an unneeded variable initialization
nvme-pci: unquiesce admin queue on shutdown
nvme-pci: shutdown on timeout during deletion
nvme-pci: fix psdt field for single segment sgls
nvme-multipath: don't print ANA group state by default
nvme-multipath: split bios with the ns_head bio_set before submitting
... -
Pull driver core/kobject updates from Greg KH:
"Here is the "big" set of driver core patches for 5.2-rc1There are a number of ACPI patches in here as well, as Rafael said
they should go through this tree due to the driver core changes they
required. They have all been acked by the ACPI developers.There are also a number of small subsystem-specific changes in here,
due to some changes to the kobject core code. Those too have all been
acked by the various subsystem maintainers.As for content, it's pretty boring outside of the ACPI changes:
- spdx cleanups
- kobject documentation updates
- default attribute groups for kobjects
- other minor kobject/driver core fixesAll have been in linux-next for a while with no reported issues"
* tag 'driver-core-5.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (47 commits)
kobject: clean up the kobject add documentation a bit more
kobject: Fix kernel-doc comment first line
kobject: Remove docstring reference to kset
firmware_loader: Fix a typo ("syfs" -> "sysfs")
kobject: fix dereference before null check on kobj
Revert "driver core: platform: Fix the usage of platform device name(pdev->name)"
init/config: Do not select BUILD_BIN2C for IKCONFIG
Provide in-kernel headers to make extending kernel easier
kobject: Improve doc clarity kobject_init_and_add()
kobject: Improve docs for kobject_add/del
driver core: platform: Fix the usage of platform device name(pdev->name)
livepatch: Replace klp_ktype_patch's default_attrs with groups
cpufreq: schedutil: Replace default_attrs field with groups
padata: Replace padata_attr_type default_attrs field with groups
irqdesc: Replace irq_kobj_type's default_attrs field with groups
net-sysfs: Replace ktype default_attrs field with groups
block: Replace all ktype default_attrs with groups
samples/kobject: Replace foo_ktype's default_attrs field with groups
kobject: Add support for default attribute groups to kobj_type
driver core: Postpone DMA tear-down until after devres release for probe failure
...
04 May, 2019
6 commits
-
Now freeing hw queue resource is moved to hctx's release handler,
we don't need to worry about the race between blk_cleanup_queue and
run queue any more.So don't drain in-progress dispatch in blk_cleanup_queue().
This is basically revert of c2856ae2f315 ("blk-mq: quiesce queue before
freeing queue").Cc: Dongli Zhang
Cc: James Smart
Cc: Bart Van Assche
Cc: linux-scsi@vger.kernel.org,
Cc: Martin K . Petersen ,
Cc: Christoph Hellwig ,
Cc: James E . J . Bottomley ,
Reviewed-by: Bart Van Assche
Reviewed-by: Hannes Reinecke
Tested-by: James Smart
Signed-off-by: Ming Lei
Signed-off-by: Jens Axboe -
hctx is always released after requeue is freed.
With holding queue's kobject refcount, it is safe for driver to run queue,
so one run queue might be scheduled after blk_sync_queue() is done.So moving the cancel of hctx->run_work into blk_mq_hw_sysfs_release()
for avoiding run released queue.Cc: Dongli Zhang
Cc: James Smart
Cc: Bart Van Assche
Cc: linux-scsi@vger.kernel.org,
Cc: Martin K . Petersen ,
Cc: Christoph Hellwig ,
Cc: James E . J . Bottomley ,
Reviewed-by: Bart Van Assche
Reviewed-by: Hannes Reinecke
Tested-by: James Smart
Signed-off-by: Ming Lei
Signed-off-by: Jens Axboe -
In normal queue cleanup path, hctx is released after request queue
is freed, see blk_mq_release().However, in __blk_mq_update_nr_hw_queues(), hctx may be freed because
of hw queues shrinking. This way is easy to cause use-after-free,
because: one implicit rule is that it is safe to call almost all block
layer APIs if the request queue is alive; and one hctx may be retrieved
by one API, then the hctx can be freed by blk_mq_update_nr_hw_queues();
finally use-after-free is triggered.Fixes this issue by always freeing hctx after releasing request queue.
If some hctxs are removed in blk_mq_update_nr_hw_queues(), introduce
a per-queue list to hold them, then try to resuse these hctxs if numa
node is matched.Cc: Dongli Zhang
Cc: James Smart
Cc: Bart Van Assche
Cc: linux-scsi@vger.kernel.org,
Cc: Martin K . Petersen ,
Cc: Christoph Hellwig ,
Cc: James E . J . Bottomley ,
Reviewed-by: Hannes Reinecke
Tested-by: James Smart
Signed-off-by: Ming Lei
Signed-off-by: Jens Axboe -
Split blk_mq_alloc_and_init_hctx into two parts, and one is
blk_mq_alloc_hctx() for allocating all hctx resources, another
is blk_mq_init_hctx() for initializing hctx, which serves as
counter-part of blk_mq_exit_hctx().Cc: Dongli Zhang
Cc: James Smart
Cc: Bart Van Assche
Cc: linux-scsi@vger.kernel.org
Cc: Martin K . Petersen
Cc: Christoph Hellwig
Cc: James E . J . Bottomley
Reviewed-by: Hannes Reinecke
Reviewed-by: Christoph Hellwig
Tested-by: James Smart
Signed-off-by: Ming Lei
Signed-off-by: Jens Axboe -
Once blk_cleanup_queue() returns, tags shouldn't be used any more,
because blk_mq_free_tag_set() may be called. Commit 45a9c9d909b2
("blk-mq: Fix a use-after-free") fixes this issue exactly.However, that commit introduces another issue. Before 45a9c9d909b2,
we are allowed to run queue during cleaning up queue if the queue's
kobj refcount is held. After that commit, queue can't be run during
queue cleaning up, otherwise oops can be triggered easily because
some fields of hctx are freed by blk_mq_free_queue() in blk_cleanup_queue().We have invented ways for addressing this kind of issue before, such as:
8dc765d438f1 ("SCSI: fix queue cleanup race before queue initialization is done")
c2856ae2f315 ("blk-mq: quiesce queue before freeing queue")But still can't cover all cases, recently James reports another such
kind of issue:https://marc.info/?l=linux-scsi&m=155389088124782&w=2
This issue can be quite hard to address by previous way, given
scsi_run_queue() may run requeues for other LUNs.Fixes the above issue by freeing hctx's resources in its release handler, and this
way is safe becasue tags isn't needed for freeing such hctx resource.This approach follows typical design pattern wrt. kobject's release handler.
Cc: Dongli Zhang
Cc: James Smart
Cc: Bart Van Assche
Cc: linux-scsi@vger.kernel.org,
Cc: Martin K . Petersen ,
Cc: Christoph Hellwig ,
Cc: James E . J . Bottomley ,
Reported-by: James Smart
Fixes: 45a9c9d909b2 ("blk-mq: Fix a use-after-free")
Cc: stable@vger.kernel.org
Reviewed-by: Hannes Reinecke
Reviewed-by: Christoph Hellwig
Tested-by: James Smart
Signed-off-by: Ming Lei
Signed-off-by: Jens Axboe -
With holding queue's kobject refcount, it is safe for driver
to schedule requeue. However, blk_mq_kick_requeue_list() may
be called after blk_sync_queue() is done because of concurrent
requeue activities, then requeue work may not be completed when
freeing queue, and kernel oops is triggered.So moving the cancel of requeue_work into blk_mq_release() for
avoiding race between requeue and freeing queue.Cc: Dongli Zhang
Cc: James Smart
Cc: Bart Van Assche
Cc: linux-scsi@vger.kernel.org,
Cc: Martin K . Petersen ,
Cc: Christoph Hellwig ,
Cc: James E . J . Bottomley ,
Reviewed-by: Bart Van Assche
Reviewed-by: Johannes Thumshirn
Reviewed-by: Hannes Reinecke
Reviewed-by: Christoph Hellwig
Tested-by: James Smart
Signed-off-by: Ming Lei
Signed-off-by: Jens Axboe