Eric Lee / smarc-fsl-linux-kernel

10 Oct, 2020

1 commit

47ce030b7 blk-mq: move cancel of hctx->run_work to the front of blk_exit_queue ... Browse Code »

blk_exit_queue will free elevator_data, while blk_mq_run_work_fn
will access it. Move cancel of hctx->run_work to the front of
blk_exit_queue to avoid use-after-free.

Fixes: 1b97871b501f ("blk-mq: move cancel of hctx->run_work into blk_mq_hw_sysfs_release")
Signed-off-by: Yang Yang
Reviewed-by: Ming Lei
Signed-off-by: Jens Axboe

Yang Yang
2020-10-10 02:46:28 +0800

04 Nov, 2019

1 commit

d2c9be89f blk-mq: make sure that line break can be printed ... Browse Code »

8962842ca5ab ("blk-mq: avoid sysfs buffer overflow with too many CPU cores")
avoids sysfs buffer overflow, and reserves one character for line break.
However, the last snprintf() doesn't get correct 'size' parameter passed
in, so fixed it.

Fixes: 8962842ca5ab ("blk-mq: avoid sysfs buffer overflow with too many CPU cores")
Signed-off-by: Ming Lei
Signed-off-by: Jens Axboe

Ming Lei
2019-11-04 22:14:10 +0800

02 Nov, 2019

1 commit

8962842ca blk-mq: avoid sysfs buffer overflow with too many CPU cores ... Browse Code »

It is reported that sysfs buffer overflow can be triggered if the system
has too many CPU cores(>841 on 4K PAGE_SIZE) when showing CPUs of
hctx via /sys/block/$DEV/mq/$N/cpu_list.

Use snprintf to avoid the potential buffer overflow.

This version doesn't change the attribute format, and simply stops
showing CPU numbers if the buffer is going to overflow.

Cc: stable@vger.kernel.org
Fixes: 676141e48af7("blk-mq: don't dump CPU -> hw queue map on driver load")
Signed-off-by: Ming Lei
Signed-off-by: Jens Axboe

Ming Lei
2019-11-02 22:02:21 +0800

07 Oct, 2019

1 commit

bae85c156 block: Remove "dying" checks from sysfs callbacks ... Browse Code »

Block drivers must call del_gendisk() before blk_cleanup_queue().
del_gendisk() calls kobject_del() and kobject_del() waits until any
ongoing sysfs callback functions have finished. In other words, the
sysfs callback functions won't be called for a queue in the dying
state. Hence remove the "dying" checks from the sysfs callback
functions.

Cc: Christoph Hellwig
Cc: Ming Lei
Cc: Hannes Reinecke
Cc: Johannes Thumshirn
Signed-off-by: Bart Van Assche
Signed-off-by: Jens Axboe

Bart Van Assche
2019-10-07 22:31:59 +0800

28 Aug, 2019

2 commits

cecf5d87f block: split .sysfs_lock into two locks ... Browse Code »

The kernfs built-in lock of 'kn->count' is held in sysfs .show/.store
path. Meantime, inside block's .show/.store callback, q->sysfs_lock is
required.

However, when mq & iosched kobjects are removed via
blk_mq_unregister_dev() & elv_unregister_queue(), q->sysfs_lock is held
too. This way causes AB-BA lock because the kernfs built-in lock of
'kn-count' is required inside kobject_del() too, see the lockdep warning[1].

On the other hand, it isn't necessary to acquire q->sysfs_lock for
both blk_mq_unregister_dev() & elv_unregister_queue() because
clearing REGISTERED flag prevents storing to 'queue/scheduler'
from being happened. Also sysfs write(store) is exclusive, so no
necessary to hold the lock for elv_unregister_queue() when it is
called in switching elevator path.

So split .sysfs_lock into two: one is still named as .sysfs_lock for
covering sync .store, the other one is named as .sysfs_dir_lock
for covering kobjects and related status change.

sysfs itself can handle the race between add/remove kobjects and
showing/storing attributes under kobjects. For switching scheduler
via storing to 'queue/scheduler', we use the queue flag of
QUEUE_FLAG_REGISTERED with .sysfs_lock for avoiding the race, then
we can avoid to hold .sysfs_lock during removing/adding kobjects.

[1] lockdep warning
======================================================
WARNING: possible circular locking dependency detected
5.3.0-rc3-00044-g73277fc75ea0 #1380 Not tainted
------------------------------------------------------
rmmod/777 is trying to acquire lock:
00000000ac50e981 (kn->count#202){++++}, at: kernfs_remove_by_name_ns+0x59/0x72

but task is already holding lock:
00000000fb16ae21 (&q->sysfs_lock){+.+.}, at: blk_unregister_queue+0x78/0x10b

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #1 (&q->sysfs_lock){+.+.}:
__lock_acquire+0x95f/0xa2f
lock_acquire+0x1b4/0x1e8
__mutex_lock+0x14a/0xa9b
blk_mq_hw_sysfs_show+0x63/0xb6
sysfs_kf_seq_show+0x11f/0x196
seq_read+0x2cd/0x5f2
vfs_read+0xc7/0x18c
ksys_read+0xc4/0x13e
do_syscall_64+0xa7/0x295
entry_SYSCALL_64_after_hwframe+0x49/0xbe

-> #0 (kn->count#202){++++}:
check_prev_add+0x5d2/0xc45
validate_chain+0xed3/0xf94
__lock_acquire+0x95f/0xa2f
lock_acquire+0x1b4/0x1e8
__kernfs_remove+0x237/0x40b
kernfs_remove_by_name_ns+0x59/0x72
remove_files+0x61/0x96
sysfs_remove_group+0x81/0xa4
sysfs_remove_groups+0x3b/0x44
kobject_del+0x44/0x94
blk_mq_unregister_dev+0x83/0xdd
blk_unregister_queue+0xa0/0x10b
del_gendisk+0x259/0x3fa
null_del_dev+0x8b/0x1c3 [null_blk]
null_exit+0x5c/0x95 [null_blk]
__se_sys_delete_module+0x204/0x337
do_syscall_64+0xa7/0x295
entry_SYSCALL_64_after_hwframe+0x49/0xbe

other info that might help us debug this:

Possible unsafe locking scenario:

CPU0 CPU1
---- ----
lock(&q->sysfs_lock);
lock(kn->count#202);
lock(&q->sysfs_lock);
lock(kn->count#202);

*** DEADLOCK ***

2 locks held by rmmod/777:
#0: 00000000e69bd9de (&lock){+.+.}, at: null_exit+0x2e/0x95 [null_blk]
#1: 00000000fb16ae21 (&q->sysfs_lock){+.+.}, at: blk_unregister_queue+0x78/0x10b

stack backtrace:
CPU: 0 PID: 777 Comm: rmmod Not tainted 5.3.0-rc3-00044-g73277fc75ea0 #1380
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS ?-20180724_192412-buildhw-07.phx4
Call Trace:
dump_stack+0x9a/0xe6
check_noncircular+0x207/0x251
? print_circular_bug+0x32a/0x32a
? find_usage_backwards+0x84/0xb0
check_prev_add+0x5d2/0xc45
validate_chain+0xed3/0xf94
? check_prev_add+0xc45/0xc45
? mark_lock+0x11b/0x804
? check_usage_forwards+0x1ca/0x1ca
__lock_acquire+0x95f/0xa2f
lock_acquire+0x1b4/0x1e8
? kernfs_remove_by_name_ns+0x59/0x72
__kernfs_remove+0x237/0x40b
? kernfs_remove_by_name_ns+0x59/0x72
? kernfs_next_descendant_post+0x7d/0x7d
? strlen+0x10/0x23
? strcmp+0x22/0x44
kernfs_remove_by_name_ns+0x59/0x72
remove_files+0x61/0x96
sysfs_remove_group+0x81/0xa4
sysfs_remove_groups+0x3b/0x44
kobject_del+0x44/0x94
blk_mq_unregister_dev+0x83/0xdd
blk_unregister_queue+0xa0/0x10b
del_gendisk+0x259/0x3fa
? disk_events_poll_msecs_store+0x12b/0x12b
? check_flags+0x1ea/0x204
? mark_held_locks+0x1f/0x7a
null_del_dev+0x8b/0x1c3 [null_blk]
null_exit+0x5c/0x95 [null_blk]
__se_sys_delete_module+0x204/0x337
? free_module+0x39f/0x39f
? blkcg_maybe_throttle_current+0x8a/0x718
? rwlock_bug+0x62/0x62
? __blkcg_punt_bio_submit+0xd0/0xd0
? trace_hardirqs_on_thunk+0x1a/0x20
? mark_held_locks+0x1f/0x7a
? do_syscall_64+0x4c/0x295
do_syscall_64+0xa7/0x295
entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x7fb696cdbe6b
Code: 73 01 c3 48 8b 0d 1d 20 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 008
RSP: 002b:00007ffec9588788 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
RAX: ffffffffffffffda RBX: 0000559e589137c0 RCX: 00007fb696cdbe6b
RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559e58913828
RBP: 0000000000000000 R08: 00007ffec9587701 R09: 0000000000000000
R10: 00007fb696d4eae0 R11: 0000000000000206 R12: 00007ffec95889b0
R13: 00007ffec95896b3 R14: 0000559e58913260 R15: 0000559e589137c0

Cc: Christoph Hellwig
Cc: Hannes Reinecke
Cc: Greg KH
Cc: Mike Snitzer
Reviewed-by: Bart Van Assche
Signed-off-by: Ming Lei
Signed-off-by: Jens Axboe

Ming Lei
2019-08-28 00:40:20 +0800
9685b2270 block: Remove blk_mq_register_dev() ... Browse Code »

This function has no callers. Hence remove it.

Cc: Christoph Hellwig
Cc: Ming Lei
Cc: Hannes Reinecke
Signed-off-by: Bart Van Assche
Signed-off-by: Jens Axboe

Bart Van Assche
2019-08-28 00:40:19 +0800

08 May, 2019

1 commit

67a242223 Merge tag 'for-5.2/block-20190507' of git://git.kernel.dk/linux-block ... Browse Code »

Pull block updates from Jens Axboe:
"Nothing major in this series, just fixes and improvements all over the
map. This contains:

- Series of fixes for sed-opal (David, Jonas)

- Fixes and performance tweaks for BFQ (via Paolo)

- Set of fixes for bcache (via Coly)

- Set of fixes for md (via Song)

- Enabling multi-page for passthrough requests (Ming)

- Queue release fix series (Ming)

- Device notification improvements (Martin)

- Propagate underlying device rotational status in loop (Holger)

- Removal of mtip32xx trim support, which has been disabled for years
(Christoph)

- Improvement and cleanup of nvme command handling (Christoph)

- Add block SPDX tags (Christoph)

- Cleanup/hardening of bio/bvec iteration (Christoph)

- A few NVMe pull requests (Christoph)

- Removal of CONFIG_LBDAF (Christoph)

- Various little fixes here and there"

* tag 'for-5.2/block-20190507' of git://git.kernel.dk/linux-block: (164 commits)
block: fix mismerge in bvec_advance
block: don't drain in-progress dispatch in blk_cleanup_queue()
blk-mq: move cancel of hctx->run_work into blk_mq_hw_sysfs_release
blk-mq: always free hctx after request queue is freed
blk-mq: split blk_mq_alloc_and_init_hctx into two parts
blk-mq: free hw queue's resource in hctx's release handler
blk-mq: move cancel of requeue_work into blk_mq_release
blk-mq: grab .q_usage_counter when queuing request from plug code path
block: fix function name in comment
nvmet: protect discovery change log event list iteration
nvme: mark nvme_core_init and nvme_core_exit static
nvme: move command size checks to the core
nvme-fabrics: check more command sizes
nvme-pci: check more command sizes
nvme-pci: remove an unneeded variable initialization
nvme-pci: unquiesce admin queue on shutdown
nvme-pci: shutdown on timeout during deletion
nvme-pci: fix psdt field for single segment sgls
nvme-multipath: don't print ANA group state by default
nvme-multipath: split bios with the ns_head bio_set before submitting
...

Linus Torvalds
2019-05-08 09:14:36 +0800

04 May, 2019

2 commits

1b97871b5 blk-mq: move cancel of hctx->run_work into blk_mq_hw_sysfs_release ... Browse Code »

hctx is always released after requeue is freed.

With holding queue's kobject refcount, it is safe for driver to run queue,
so one run queue might be scheduled after blk_sync_queue() is done.

So moving the cancel of hctx->run_work into blk_mq_hw_sysfs_release()
for avoiding run released queue.

Cc: Dongli Zhang
Cc: James Smart
Cc: Bart Van Assche
Cc: linux-scsi@vger.kernel.org,
Cc: Martin K . Petersen ,
Cc: Christoph Hellwig ,
Cc: James E . J . Bottomley ,
Reviewed-by: Bart Van Assche
Reviewed-by: Hannes Reinecke
Tested-by: James Smart
Signed-off-by: Ming Lei
Signed-off-by: Jens Axboe

Ming Lei
2019-05-04 21:24:09 +0800
c7e2d94b3 blk-mq: free hw queue's resource in hctx's release handler ... Browse Code »

Once blk_cleanup_queue() returns, tags shouldn't be used any more,
because blk_mq_free_tag_set() may be called. Commit 45a9c9d909b2
("blk-mq: Fix a use-after-free") fixes this issue exactly.

However, that commit introduces another issue. Before 45a9c9d909b2,
we are allowed to run queue during cleaning up queue if the queue's
kobj refcount is held. After that commit, queue can't be run during
queue cleaning up, otherwise oops can be triggered easily because
some fields of hctx are freed by blk_mq_free_queue() in blk_cleanup_queue().

We have invented ways for addressing this kind of issue before, such as:

8dc765d438f1 ("SCSI: fix queue cleanup race before queue initialization is done")
c2856ae2f315 ("blk-mq: quiesce queue before freeing queue")

But still can't cover all cases, recently James reports another such
kind of issue:

https://marc.info/?l=linux-scsi&m=155389088124782&w=2

This issue can be quite hard to address by previous way, given
scsi_run_queue() may run requeues for other LUNs.

Fixes the above issue by freeing hctx's resources in its release handler, and this
way is safe becasue tags isn't needed for freeing such hctx resource.

This approach follows typical design pattern wrt. kobject's release handler.

Cc: Dongli Zhang
Cc: James Smart
Cc: Bart Van Assche
Cc: linux-scsi@vger.kernel.org,
Cc: Martin K . Petersen ,
Cc: Christoph Hellwig ,
Cc: James E . J . Bottomley ,
Reported-by: James Smart
Fixes: 45a9c9d909b2 ("blk-mq: Fix a use-after-free")
Cc: stable@vger.kernel.org
Reviewed-by: Hannes Reinecke
Reviewed-by: Christoph Hellwig
Tested-by: James Smart
Signed-off-by: Ming Lei
Signed-off-by: Jens Axboe

Ming Lei
2019-05-04 21:24:05 +0800

01 May, 2019

1 commit

3dcf60bcb block: add SPDX tags to block layer files missing licensing information ... Browse Code »

Various block layer files do not have any licensing information at all.
Add SPDX tags for the default kernel GPLv2 license to those.

Reviewed-by: Chaitanya Kulkarni
Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Christoph Hellwig
2019-05-01 06:12:03 +0800

26 Apr, 2019

1 commit

800f5aa1e block: Replace all ktype default_attrs with groups ... Browse Code »

The kobj_type default_attrs field is being replaced by the
default_groups field. Replace all of the ktype default_attrs fields in
the block subsystem with default_groups and use the ATTRIBUTE_GROUPS
macro to create the default groups.

Remove default_ctx_attrs[] because it doesn't contain any attributes.

This patch was tested by verifying that the sysfs files for the
attributes in the default groups were created.

Signed-off-by: Kimberly Brown
Reviewed-by: Bart Van Assche
Signed-off-by: Greg Kroah-Hartman

Kimberly Brown
2019-04-26 04:06:11 +0800

17 Dec, 2018

1 commit

346fc1089 blk-mq: export hctx->type in debugfs instead of sysfs ... Browse Code »

Now we only export hctx->type via sysfs, and there isn't such info
in hctx entry under debugfs. We often use debugfs only to diagnose
queue mapping issue, so add the support in debugfs.

Queue mapping becomes a bit more complicated after multiple queue
mapping is supported, we may write blktest to verify if queue mapping
is valid based on blk-mq-debugfs.

Given not necessary to export hctx->type twice, so remove the export
from sysfs.

Cc: Jeff Moyer
Cc: Mike Snitzer
Reviewed-by: Christoph Hellwig
Signed-off-by: Ming Lei
Signed-off-by: Jens Axboe

Ming Lei
2018-12-17 20:44:45 +0800

05 Dec, 2018

1 commit

e20ba6e1d block: move queues types to the block layer ... Browse Code »

Having another indirect all in the fast path doesn't really help
in our post-spectre world. Also having too many queue type is just
going to create confusion, so I'd rather manage them centrally.

Note that the queue type naming and ordering changes a bit - the
first index now is the default queue for everything not explicitly
marked, the optional ones are read and poll queues.

Reviewed-by: Sagi Grimberg
Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Christoph Hellwig
2018-12-05 02:38:17 +0800

21 Nov, 2018

1 commit

1db4909e7 blk-mq: not embed .mq_kobj and ctx->kobj into queue instance ... Browse Code »

Even though .mq_kobj, ctx->kobj and q->kobj share same lifetime
from block layer's view, actually they don't because userspace may
grab one kobject anytime via sysfs.

This patch fixes the issue by the following approach:

1) introduce 'struct blk_mq_ctxs' for holding .mq_kobj and managing
all ctxs

2) free all allocated ctxs and the 'blk_mq_ctxs' instance in release
handler of .mq_kobj

3) grab one ref of .mq_kobj before initializing each ctx->kobj, so that
.mq_kobj is always released after all ctxs are freed.

This patch fixes kernel panic issue during booting when DEBUG_KOBJECT_RELEASE
is enabled.

Reported-by: Guenter Roeck
Cc: "jianchao.wang"
Tested-by: Guenter Roeck
Reviewed-by: Greg Kroah-Hartman
Signed-off-by: Ming Lei
Signed-off-by: Jens Axboe

Ming Lei
2018-11-21 20:57:56 +0800

16 Nov, 2018

1 commit

b6676f653 block: remove a few unused exports ... Browse Code »

Reviewed-by: Hannes Reinecke
Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Christoph Hellwig
2018-11-16 03:13:25 +0800

08 Nov, 2018

1 commit

a783b8182 blk-mq: add 'type' attribute to the sysfs hctx directory ... Browse Code »

It can be useful for a user to verify what type a given hardware
queue is, expose this information in sysfs.

Reviewed-by: Hannes Reinecke
Reviewed-by: Bart Van Assche
Reviewed-by: Keith Busch
Reviewed-by: Sagi Grimberg
Signed-off-by: Jens Axboe

Jens Axboe
2018-11-08 04:44:59 +0800

25 May, 2018

1 commit

5657a819a block drivers/block: Use octal not symbolic permissions ... Browse Code »

Convert the S_ symbolic permissions to their octal equivalents as
using octal and not symbolic permissions is preferred by many as more
readable.

see: https://lkml.org/lkml/2016/8/2/1945

Done with automated conversion via:
$ ./scripts/checkpatch.pl -f --types=SYMBOLIC_PERMS --fix-inplace

Miscellanea:

o Wrapped modified multi-line calls to a single line where appropriate
o Realign modified multi-line calls to open parenthesis

Signed-off-by: Joe Perches
Signed-off-by: Jens Axboe

Joe Perches
2018-05-25 03:38:59 +0800

15 Jan, 2018

1 commit

667257e8b block: properly protect the 'queue' kobj in blk_unregister_queue ... Browse Code »

The original commit e9a823fb34a8b (block: fix warning when I/O elevator
is changed as request_queue is being removed) is pretty conflated.
"conflated" because the resource being protected by q->sysfs_lock isn't
the queue_flags (it is the 'queue' kobj).

q->sysfs_lock serializes __elevator_change() (via elv_iosched_store)
from racing with blk_unregister_queue():
1) By holding q->sysfs_lock first, __elevator_change() can complete
before a racing blk_unregister_queue().
2) Conversely, __elevator_change() is testing for QUEUE_FLAG_REGISTERED
in case elv_iosched_store() loses the race with blk_unregister_queue(),
it needs a way to know the 'queue' kobj isn't there.

Expand the scope of blk_unregister_queue()'s q->sysfs_lock use so it is
held until after the 'queue' kobj is removed.

To do so blk_mq_unregister_dev() must not also take q->sysfs_lock. So
rename __blk_mq_unregister_dev() to blk_mq_unregister_dev().

Also, blk_unregister_queue() should use q->queue_lock to protect against
any concurrent writes to q->queue_flags -- even though chances are the
queue is being cleaned up so no concurrent writes are likely.

Fixes: e9a823fb34a8b ("block: fix warning when I/O elevator is changed as request_queue is being removed")
Signed-off-by: Mike Snitzer
Reviewed-by: Ming Lei
Signed-off-by: Jens Axboe

Mike Snitzer
2018-01-15 23:41:38 +0800

04 May, 2017

2 commits

9c1051aac blk-mq: untangle debugfs and sysfs ... Browse Code »

Originally, I tied debugfs registration/unregistration together with
sysfs. There's no reason to do this, and it's getting in the way of
letting schedulers define their own debugfs attributes. Instead, tie the
debugfs registration to the lifetime of the structures themselves.

The saner lifetimes mean we can also get rid of the extra mq directory
and move everything one level up. I.e., nvme0n1/mq/hctx0/tags is now
just nvme0n1/hctx0/tags.

Signed-off-by: Omar Sandoval
Signed-off-by: Jens Axboe

Omar Sandoval
2017-05-04 22:24:13 +0800
d173a2516 blk-mq: move debugfs declarations to a separate header file ... Browse Code »

Preparation for adding more declarations.

Signed-off-by: Omar Sandoval
Reviewed-by: Hannes Reinecke
Signed-off-by: Jens Axboe

Omar Sandoval
2017-05-04 22:23:44 +0800

27 Apr, 2017

4 commits

f05d1ba78 blk-mq: Only unregister hctxs for which registration succeeded ... Browse Code »

Hctx unregistration involves calling kobject_del(). kobject_del()
must not be called if kobject_add() has not been called. Hence in
the error path only unregister hctxs for which registration succeeded.

Signed-off-by: Bart Van Assche
Cc: Omar Sandoval
Cc: Hannes Reinecke
Signed-off-by: Jens Axboe

Bart Van Assche
2017-04-27 05:09:04 +0800
62d6c9496 blk-mq-debugfs: Rename functions for registering and unregistering the mq directory ... Browse Code »

Since the blk_mq_debugfs_*register_hctxs() functions register and
unregister all attributes under the "mq" directory, rename these
into blk_mq_debugfs_*register_mq().

Signed-off-by: Bart Van Assche
Reviewed-by: Hannes Reinecke
Reviewed-by: Omar Sandoval
Signed-off-by: Jens Axboe

Bart Van Assche
2017-04-27 05:09:04 +0800
4c9e4019f blk-mq: Let blk_mq_debugfs_register() look up the queue name ... Browse Code »

A later patch will move the call of blk_mq_debugfs_register() to
a function to which the queue name is not passed as an argument.
To avoid having to add a 'name' argument to multiple callers, let
blk_mq_debugfs_register() look up the queue name.

Signed-off-by: Bart Van Assche
Reviewed-by: Omar Sandoval
Reviewed-by: Hannes Reinecke
Signed-off-by: Jens Axboe

Bart Van Assche
2017-04-27 05:09:04 +0800
2d0364c8c blk-mq: Register <dev>/queue/mq after having registered <dev>/queue ... Browse Code »

A later patch in this series will modify blk_mq_debugfs_register()
such that it uses q->kobj.parent to determine the name of a
request queue. Hence make sure that that pointer is initialized
before blk_mq_debugfs_register() is called. To avoid lock inversion,
protect sysfs / debugfs registration with the queue sysfs_lock
instead of the global mutex all_q_mutex.

Signed-off-by: Bart Van Assche
Reviewed-by: Hannes Reinecke
Reviewed-by: Omar Sandoval
Signed-off-by: Jens Axboe

Bart Van Assche
2017-04-27 05:09:04 +0800

09 Mar, 2017

4 commits

01388df37 blk-mq: free hctx->cpumask in release handler of hctx's kobject ... Browse Code »

It is obviously that hctx->cpumask is per hctx, and both
share same lifetime, so this patch moves freeing of hctx->cpumask
into release handler of hctx's kobject.

Signed-off-by: Ming Lei
Tested-by: Peter Zijlstra (Intel)
Signed-off-by: Jens Axboe

Ming Lei
2017-03-09 00:56:12 +0800
6c8b232ef blk-mq: make lifetime consistent between hctx and its kobject ... Browse Code »

This patch removes kobject_put() over hctx in __blk_mq_unregister_dev(),
and trys to keep lifetime consistent between hctx and hctx's kobject.

Now blk_mq_sysfs_register() and blk_mq_sysfs_unregister() become
totally symmetrical, and kobject's refcounter drops to zero just
when the hctx is freed.

Signed-off-by: Ming Lei
Tested-by: Peter Zijlstra (Intel)
Signed-off-by: Jens Axboe

Ming Lei
2017-03-09 00:56:12 +0800
7ea5fe31c blk-mq: make lifetime consitent between q/ctx and its kobject ... Browse Code »

Currently from kobject view, both q->mq_kobj and ctx->kobj can
be released during one cycle of blk_mq_register_dev() and
blk_mq_unregister_dev(). Actually, sw queue's lifetime is
same with its request queue's, which is covered by request_queue->kobj.

So we don't need to call kobject_put() for the two kinds of
kobject in __blk_mq_unregister_dev(), instead we do that
in release handler of request queue.

Signed-off-by: Ming Lei
Tested-by: Peter Zijlstra (Intel)
Signed-off-by: Jens Axboe

Ming Lei
2017-03-09 00:56:12 +0800
737f98cfe blk-mq: initialize mq kobjects in blk_mq_init_allocated_queue() ... Browse Code »

Both q->mq_kobj and sw queues' kobjects should have been initialized
once, instead of doing that each add_disk context.

Also this patch removes clearing of ctx in blk_mq_init_cpu_queues()
because percpu allocator fills zero to allocated variable.

This patch fixes one issue[1] reported from Omar.

[1] kernel wearning when doing unbind/bind on one scsi-mq device

[ 19.347924] kobject (ffff8800791ea0b8): tried to init an initialized object, something is seriously wrong.
[ 19.349781] CPU: 1 PID: 84 Comm: kworker/u8:1 Not tainted 4.10.0-rc7-00210-g53f39eeaa263 #34
[ 19.350686] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.1-20161122_114906-anatol 04/01/2014
[ 19.350920] Workqueue: events_unbound async_run_entry_fn
[ 19.350920] Call Trace:
[ 19.350920] dump_stack+0x63/0x83
[ 19.350920] kobject_init+0x77/0x90
[ 19.350920] blk_mq_register_dev+0x40/0x130
[ 19.350920] blk_register_queue+0xb6/0x190
[ 19.350920] device_add_disk+0x1ec/0x4b0
[ 19.350920] sd_probe_async+0x10d/0x1c0 [sd_mod]
[ 19.350920] async_run_entry_fn+0x48/0x150
[ 19.350920] process_one_work+0x1d0/0x480
[ 19.350920] worker_thread+0x48/0x4e0
[ 19.350920] kthread+0x101/0x140
[ 19.350920] ? process_one_work+0x480/0x480
[ 19.350920] ? kthread_create_on_node+0x60/0x60
[ 19.350920] ret_from_fork+0x2c/0x40

Cc: Omar Sandoval
Signed-off-by: Ming Lei
Tested-by: Peter Zijlstra (Intel)
Signed-off-by: Jens Axboe

Ming Lei
2017-03-09 00:56:12 +0800

03 Feb, 2017

1 commit

62ebce16c blk-mq: move debugfs_remove() of disk dir to blk_release_queue() ... Browse Code »

This needs to happen after we tear down blktrace.

Signed-off-by: Omar Sandoval
Signed-off-by: Jens Axboe

Omar Sandoval
2017-02-03 01:20:16 +0800

27 Jan, 2017

5 commits

4a46f05eb blk-mq: move hctx and ctx counters from sysfs to debugfs ... Browse Code »

These counters aren't as out-of-place in sysfs as the other stuff, but
debugfs is a slightly better home for them.

Reviewed-by: Hannes Reinecke
Signed-off-by: Omar Sandoval
Signed-off-by: Jens Axboe

Omar Sandoval
2017-01-27 23:17:44 +0800
be2154731 blk-mq: move hctx io_poll, stats, and dispatched from sysfs to debugfs ... Browse Code »

These statistics _might_ be useful to userspace, but it's better not to
commit to an ABI for these yet. Also, the dispatched file in sysfs
couldn't be cleared, so make it clearable like the others in debugfs.

Reviewed-by: Hannes Reinecke
Signed-off-by: Omar Sandoval
Signed-off-by: Jens Axboe

Omar Sandoval
2017-01-27 23:17:44 +0800
d96b37c0a blk-mq: move tags and sched_tags info from sysfs to debugfs ... Browse Code »

These are very tied to the blk-mq tag implementation, so exposing them
to sysfs isn't a great idea. Move the debugging information to debugfs
and add basic entries for the number of tags and the number of reserved
tags to sysfs.

Reviewed-by: Hannes Reinecke
Signed-off-by: Omar Sandoval
Signed-off-by: Jens Axboe

Omar Sandoval
2017-01-27 23:17:44 +0800
950cd7e9f blk-mq: move hctx->dispatch and ctx->rq_list from sysfs to debugfs ... Browse Code »

These lists are only useful for debugging; they definitely don't belong
in sysfs. Putting them in debugfs also removes the limitation of a
single page of output.

Reviewed-by: Hannes Reinecke
Signed-off-by: Omar Sandoval
Signed-off-by: Jens Axboe

Omar Sandoval
2017-01-27 23:17:44 +0800
07e4fead4 blk-mq: create debugfs directory tree ... Browse Code »

In preparation for putting blk-mq debugging information in debugfs,
create a directory tree mirroring the one in sysfs:

# tree -d /sys/kernel/debug/block
/sys/kernel/debug/block
|-- nvme0n1
| `-- mq
| |-- 0
| | `-- cpu0
| |-- 1
| | `-- cpu1
| |-- 2
| | `-- cpu2
| `-- 3
| `-- cpu3
`-- vda
`-- mq
`-- 0
|-- cpu0
|-- cpu1
|-- cpu2
`-- cpu3

Also add the scaffolding for the actual files that will go in here,
either under the hardware queue or software queue directories.

Reviewed-by: Hannes Reinecke
Signed-off-by: Omar Sandoval
Signed-off-by: Jens Axboe

Omar Sandoval
2017-01-27 23:17:44 +0800

18 Jan, 2017

1 commit

bd166ef18 blk-mq-sched: add framework for MQ capable IO schedulers ... Browse Code »

This adds a set of hooks that intercepts the blk-mq path of
allocating/inserting/issuing/completing requests, allowing
us to develop a scheduler within that framework.

We reuse the existing elevator scheduler API on the registration
side, but augment that with the scheduler flagging support for
the blk-mq interfce, and with a separate set of ops hooks for MQ
devices.

We split driver and scheduler tags, so we can run the scheduling
independently of device queue depth.

Signed-off-by: Jens Axboe
Reviewed-by: Bart Van Assche
Reviewed-by: Omar Sandoval

Jens Axboe
2017-01-18 01:04:20 +0800

11 Nov, 2016

1 commit

cf43e6be8 block: add scalable completion tracking of requests ... Browse Code »

For legacy block, we simply track them in the request queue. For
blk-mq, we track them on a per-sw queue basis, which we can then
sum up through the hardware queues and finally to a per device
state.

The stats are tracked in, roughly, 0.1s interval windows.

Add sysfs files to display the stats.

The feature is off by default, to avoid any extra overhead. In-kernel
users of it can turn it on by setting QUEUE_FLAG_STATS in the queue
flags. We currently don't turn it on if someone just reads any of
the stats files, that is something we could add as well.

Signed-off-by: Jens Axboe

Jens Axboe
2016-11-11 04:53:26 +0800

21 Sep, 2016

1 commit

b21d5b301 blk-mq: register device instead of disk ... Browse Code »

Enable devices without a gendisk instance to register itself with blk-mq
and expose the associated multi-queue sysfs entries.

Signed-off-by: Matias Bjørling
Signed-off-by: Jens Axboe

Matias Bjørling
2016-09-21 21:56:16 +0800

17 Sep, 2016

1 commit

703fd1c0f blk-mq: account higher order dispatch ... Browse Code »

We currently account a '0' dispatch, and anything above that still falls
below the range set by BLK_MQ_MAX_DISPATCH_ORDER. If we dispatch more,
we don't account it.

Change the last bucket to be inclusive of anything above the range we
track, and have the sysfs file reflect that by including a '+' in the
output:

$ cat /sys/block/nvme0n1/mq/0/dispatched
0 1006
1 20229
2 1
4 0
8 0
16 0
32+ 0

Signed-off-by: Jens Axboe
Reviewed-by: Omar Sandoval

Jens Axboe
2016-09-17 04:03:04 +0800

14 Sep, 2016

2 commits

d21ea4bc0 block: enable zeroing of io_poll statistics ... Browse Code »

Allow the io_poll statistics to be zeroed to make for easier logging
of polling event.

Signed-off-by: Stephen Bates
Acked-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Stephen Bates
2016-09-14 22:41:23 +0800
6e219353a block: add poll_considered statistic ... Browse Code »

In order to help determine the effectiveness of polling in a running
system it is usful to determine the ratio of how often the poll
function is called vs how often the completion is checked. For this
reason we add a poll_considered variable and add it to the sysfs entry
for io_poll.

Signed-off-by: Stephen Bates
Acked-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Stephen Bates
2016-09-14 22:41:21 +0800