10 Oct, 2020

1 commit


04 Nov, 2019

1 commit

  • 8962842ca5ab ("blk-mq: avoid sysfs buffer overflow with too many CPU cores")
    avoids sysfs buffer overflow, and reserves one character for line break.
    However, the last snprintf() doesn't get correct 'size' parameter passed
    in, so fixed it.

    Fixes: 8962842ca5ab ("blk-mq: avoid sysfs buffer overflow with too many CPU cores")
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     

02 Nov, 2019

1 commit

  • It is reported that sysfs buffer overflow can be triggered if the system
    has too many CPU cores(>841 on 4K PAGE_SIZE) when showing CPUs of
    hctx via /sys/block/$DEV/mq/$N/cpu_list.

    Use snprintf to avoid the potential buffer overflow.

    This version doesn't change the attribute format, and simply stops
    showing CPU numbers if the buffer is going to overflow.

    Cc: stable@vger.kernel.org
    Fixes: 676141e48af7("blk-mq: don't dump CPU -> hw queue map on driver load")
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     

07 Oct, 2019

1 commit

  • Block drivers must call del_gendisk() before blk_cleanup_queue().
    del_gendisk() calls kobject_del() and kobject_del() waits until any
    ongoing sysfs callback functions have finished. In other words, the
    sysfs callback functions won't be called for a queue in the dying
    state. Hence remove the "dying" checks from the sysfs callback
    functions.

    Cc: Christoph Hellwig
    Cc: Ming Lei
    Cc: Hannes Reinecke
    Cc: Johannes Thumshirn
    Signed-off-by: Bart Van Assche
    Signed-off-by: Jens Axboe

    Bart Van Assche
     

28 Aug, 2019

2 commits

  • The kernfs built-in lock of 'kn->count' is held in sysfs .show/.store
    path. Meantime, inside block's .show/.store callback, q->sysfs_lock is
    required.

    However, when mq & iosched kobjects are removed via
    blk_mq_unregister_dev() & elv_unregister_queue(), q->sysfs_lock is held
    too. This way causes AB-BA lock because the kernfs built-in lock of
    'kn-count' is required inside kobject_del() too, see the lockdep warning[1].

    On the other hand, it isn't necessary to acquire q->sysfs_lock for
    both blk_mq_unregister_dev() & elv_unregister_queue() because
    clearing REGISTERED flag prevents storing to 'queue/scheduler'
    from being happened. Also sysfs write(store) is exclusive, so no
    necessary to hold the lock for elv_unregister_queue() when it is
    called in switching elevator path.

    So split .sysfs_lock into two: one is still named as .sysfs_lock for
    covering sync .store, the other one is named as .sysfs_dir_lock
    for covering kobjects and related status change.

    sysfs itself can handle the race between add/remove kobjects and
    showing/storing attributes under kobjects. For switching scheduler
    via storing to 'queue/scheduler', we use the queue flag of
    QUEUE_FLAG_REGISTERED with .sysfs_lock for avoiding the race, then
    we can avoid to hold .sysfs_lock during removing/adding kobjects.

    [1] lockdep warning
    ======================================================
    WARNING: possible circular locking dependency detected
    5.3.0-rc3-00044-g73277fc75ea0 #1380 Not tainted
    ------------------------------------------------------
    rmmod/777 is trying to acquire lock:
    00000000ac50e981 (kn->count#202){++++}, at: kernfs_remove_by_name_ns+0x59/0x72

    but task is already holding lock:
    00000000fb16ae21 (&q->sysfs_lock){+.+.}, at: blk_unregister_queue+0x78/0x10b

    which lock already depends on the new lock.

    the existing dependency chain (in reverse order) is:

    -> #1 (&q->sysfs_lock){+.+.}:
    __lock_acquire+0x95f/0xa2f
    lock_acquire+0x1b4/0x1e8
    __mutex_lock+0x14a/0xa9b
    blk_mq_hw_sysfs_show+0x63/0xb6
    sysfs_kf_seq_show+0x11f/0x196
    seq_read+0x2cd/0x5f2
    vfs_read+0xc7/0x18c
    ksys_read+0xc4/0x13e
    do_syscall_64+0xa7/0x295
    entry_SYSCALL_64_after_hwframe+0x49/0xbe

    -> #0 (kn->count#202){++++}:
    check_prev_add+0x5d2/0xc45
    validate_chain+0xed3/0xf94
    __lock_acquire+0x95f/0xa2f
    lock_acquire+0x1b4/0x1e8
    __kernfs_remove+0x237/0x40b
    kernfs_remove_by_name_ns+0x59/0x72
    remove_files+0x61/0x96
    sysfs_remove_group+0x81/0xa4
    sysfs_remove_groups+0x3b/0x44
    kobject_del+0x44/0x94
    blk_mq_unregister_dev+0x83/0xdd
    blk_unregister_queue+0xa0/0x10b
    del_gendisk+0x259/0x3fa
    null_del_dev+0x8b/0x1c3 [null_blk]
    null_exit+0x5c/0x95 [null_blk]
    __se_sys_delete_module+0x204/0x337
    do_syscall_64+0xa7/0x295
    entry_SYSCALL_64_after_hwframe+0x49/0xbe

    other info that might help us debug this:

    Possible unsafe locking scenario:

    CPU0 CPU1
    ---- ----
    lock(&q->sysfs_lock);
    lock(kn->count#202);
    lock(&q->sysfs_lock);
    lock(kn->count#202);

    *** DEADLOCK ***

    2 locks held by rmmod/777:
    #0: 00000000e69bd9de (&lock){+.+.}, at: null_exit+0x2e/0x95 [null_blk]
    #1: 00000000fb16ae21 (&q->sysfs_lock){+.+.}, at: blk_unregister_queue+0x78/0x10b

    stack backtrace:
    CPU: 0 PID: 777 Comm: rmmod Not tainted 5.3.0-rc3-00044-g73277fc75ea0 #1380
    Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS ?-20180724_192412-buildhw-07.phx4
    Call Trace:
    dump_stack+0x9a/0xe6
    check_noncircular+0x207/0x251
    ? print_circular_bug+0x32a/0x32a
    ? find_usage_backwards+0x84/0xb0
    check_prev_add+0x5d2/0xc45
    validate_chain+0xed3/0xf94
    ? check_prev_add+0xc45/0xc45
    ? mark_lock+0x11b/0x804
    ? check_usage_forwards+0x1ca/0x1ca
    __lock_acquire+0x95f/0xa2f
    lock_acquire+0x1b4/0x1e8
    ? kernfs_remove_by_name_ns+0x59/0x72
    __kernfs_remove+0x237/0x40b
    ? kernfs_remove_by_name_ns+0x59/0x72
    ? kernfs_next_descendant_post+0x7d/0x7d
    ? strlen+0x10/0x23
    ? strcmp+0x22/0x44
    kernfs_remove_by_name_ns+0x59/0x72
    remove_files+0x61/0x96
    sysfs_remove_group+0x81/0xa4
    sysfs_remove_groups+0x3b/0x44
    kobject_del+0x44/0x94
    blk_mq_unregister_dev+0x83/0xdd
    blk_unregister_queue+0xa0/0x10b
    del_gendisk+0x259/0x3fa
    ? disk_events_poll_msecs_store+0x12b/0x12b
    ? check_flags+0x1ea/0x204
    ? mark_held_locks+0x1f/0x7a
    null_del_dev+0x8b/0x1c3 [null_blk]
    null_exit+0x5c/0x95 [null_blk]
    __se_sys_delete_module+0x204/0x337
    ? free_module+0x39f/0x39f
    ? blkcg_maybe_throttle_current+0x8a/0x718
    ? rwlock_bug+0x62/0x62
    ? __blkcg_punt_bio_submit+0xd0/0xd0
    ? trace_hardirqs_on_thunk+0x1a/0x20
    ? mark_held_locks+0x1f/0x7a
    ? do_syscall_64+0x4c/0x295
    do_syscall_64+0xa7/0x295
    entry_SYSCALL_64_after_hwframe+0x49/0xbe
    RIP: 0033:0x7fb696cdbe6b
    Code: 73 01 c3 48 8b 0d 1d 20 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 008
    RSP: 002b:00007ffec9588788 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
    RAX: ffffffffffffffda RBX: 0000559e589137c0 RCX: 00007fb696cdbe6b
    RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559e58913828
    RBP: 0000000000000000 R08: 00007ffec9587701 R09: 0000000000000000
    R10: 00007fb696d4eae0 R11: 0000000000000206 R12: 00007ffec95889b0
    R13: 00007ffec95896b3 R14: 0000559e58913260 R15: 0000559e589137c0

    Cc: Christoph Hellwig
    Cc: Hannes Reinecke
    Cc: Greg KH
    Cc: Mike Snitzer
    Reviewed-by: Bart Van Assche
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     
  • This function has no callers. Hence remove it.

    Cc: Christoph Hellwig
    Cc: Ming Lei
    Cc: Hannes Reinecke
    Signed-off-by: Bart Van Assche
    Signed-off-by: Jens Axboe

    Bart Van Assche
     

08 May, 2019

1 commit

  • Pull block updates from Jens Axboe:
    "Nothing major in this series, just fixes and improvements all over the
    map. This contains:

    - Series of fixes for sed-opal (David, Jonas)

    - Fixes and performance tweaks for BFQ (via Paolo)

    - Set of fixes for bcache (via Coly)

    - Set of fixes for md (via Song)

    - Enabling multi-page for passthrough requests (Ming)

    - Queue release fix series (Ming)

    - Device notification improvements (Martin)

    - Propagate underlying device rotational status in loop (Holger)

    - Removal of mtip32xx trim support, which has been disabled for years
    (Christoph)

    - Improvement and cleanup of nvme command handling (Christoph)

    - Add block SPDX tags (Christoph)

    - Cleanup/hardening of bio/bvec iteration (Christoph)

    - A few NVMe pull requests (Christoph)

    - Removal of CONFIG_LBDAF (Christoph)

    - Various little fixes here and there"

    * tag 'for-5.2/block-20190507' of git://git.kernel.dk/linux-block: (164 commits)
    block: fix mismerge in bvec_advance
    block: don't drain in-progress dispatch in blk_cleanup_queue()
    blk-mq: move cancel of hctx->run_work into blk_mq_hw_sysfs_release
    blk-mq: always free hctx after request queue is freed
    blk-mq: split blk_mq_alloc_and_init_hctx into two parts
    blk-mq: free hw queue's resource in hctx's release handler
    blk-mq: move cancel of requeue_work into blk_mq_release
    blk-mq: grab .q_usage_counter when queuing request from plug code path
    block: fix function name in comment
    nvmet: protect discovery change log event list iteration
    nvme: mark nvme_core_init and nvme_core_exit static
    nvme: move command size checks to the core
    nvme-fabrics: check more command sizes
    nvme-pci: check more command sizes
    nvme-pci: remove an unneeded variable initialization
    nvme-pci: unquiesce admin queue on shutdown
    nvme-pci: shutdown on timeout during deletion
    nvme-pci: fix psdt field for single segment sgls
    nvme-multipath: don't print ANA group state by default
    nvme-multipath: split bios with the ns_head bio_set before submitting
    ...

    Linus Torvalds
     

04 May, 2019

2 commits

  • hctx is always released after requeue is freed.

    With holding queue's kobject refcount, it is safe for driver to run queue,
    so one run queue might be scheduled after blk_sync_queue() is done.

    So moving the cancel of hctx->run_work into blk_mq_hw_sysfs_release()
    for avoiding run released queue.

    Cc: Dongli Zhang
    Cc: James Smart
    Cc: Bart Van Assche
    Cc: linux-scsi@vger.kernel.org,
    Cc: Martin K . Petersen ,
    Cc: Christoph Hellwig ,
    Cc: James E . J . Bottomley ,
    Reviewed-by: Bart Van Assche
    Reviewed-by: Hannes Reinecke
    Tested-by: James Smart
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     
  • Once blk_cleanup_queue() returns, tags shouldn't be used any more,
    because blk_mq_free_tag_set() may be called. Commit 45a9c9d909b2
    ("blk-mq: Fix a use-after-free") fixes this issue exactly.

    However, that commit introduces another issue. Before 45a9c9d909b2,
    we are allowed to run queue during cleaning up queue if the queue's
    kobj refcount is held. After that commit, queue can't be run during
    queue cleaning up, otherwise oops can be triggered easily because
    some fields of hctx are freed by blk_mq_free_queue() in blk_cleanup_queue().

    We have invented ways for addressing this kind of issue before, such as:

    8dc765d438f1 ("SCSI: fix queue cleanup race before queue initialization is done")
    c2856ae2f315 ("blk-mq: quiesce queue before freeing queue")

    But still can't cover all cases, recently James reports another such
    kind of issue:

    https://marc.info/?l=linux-scsi&m=155389088124782&w=2

    This issue can be quite hard to address by previous way, given
    scsi_run_queue() may run requeues for other LUNs.

    Fixes the above issue by freeing hctx's resources in its release handler, and this
    way is safe becasue tags isn't needed for freeing such hctx resource.

    This approach follows typical design pattern wrt. kobject's release handler.

    Cc: Dongli Zhang
    Cc: James Smart
    Cc: Bart Van Assche
    Cc: linux-scsi@vger.kernel.org,
    Cc: Martin K . Petersen ,
    Cc: Christoph Hellwig ,
    Cc: James E . J . Bottomley ,
    Reported-by: James Smart
    Fixes: 45a9c9d909b2 ("blk-mq: Fix a use-after-free")
    Cc: stable@vger.kernel.org
    Reviewed-by: Hannes Reinecke
    Reviewed-by: Christoph Hellwig
    Tested-by: James Smart
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     

01 May, 2019

1 commit


26 Apr, 2019

1 commit

  • The kobj_type default_attrs field is being replaced by the
    default_groups field. Replace all of the ktype default_attrs fields in
    the block subsystem with default_groups and use the ATTRIBUTE_GROUPS
    macro to create the default groups.

    Remove default_ctx_attrs[] because it doesn't contain any attributes.

    This patch was tested by verifying that the sysfs files for the
    attributes in the default groups were created.

    Signed-off-by: Kimberly Brown
    Reviewed-by: Bart Van Assche
    Signed-off-by: Greg Kroah-Hartman

    Kimberly Brown
     

17 Dec, 2018

1 commit

  • Now we only export hctx->type via sysfs, and there isn't such info
    in hctx entry under debugfs. We often use debugfs only to diagnose
    queue mapping issue, so add the support in debugfs.

    Queue mapping becomes a bit more complicated after multiple queue
    mapping is supported, we may write blktest to verify if queue mapping
    is valid based on blk-mq-debugfs.

    Given not necessary to export hctx->type twice, so remove the export
    from sysfs.

    Cc: Jeff Moyer
    Cc: Mike Snitzer
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     

05 Dec, 2018

1 commit

  • Having another indirect all in the fast path doesn't really help
    in our post-spectre world. Also having too many queue type is just
    going to create confusion, so I'd rather manage them centrally.

    Note that the queue type naming and ordering changes a bit - the
    first index now is the default queue for everything not explicitly
    marked, the optional ones are read and poll queues.

    Reviewed-by: Sagi Grimberg
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

21 Nov, 2018

1 commit

  • Even though .mq_kobj, ctx->kobj and q->kobj share same lifetime
    from block layer's view, actually they don't because userspace may
    grab one kobject anytime via sysfs.

    This patch fixes the issue by the following approach:

    1) introduce 'struct blk_mq_ctxs' for holding .mq_kobj and managing
    all ctxs

    2) free all allocated ctxs and the 'blk_mq_ctxs' instance in release
    handler of .mq_kobj

    3) grab one ref of .mq_kobj before initializing each ctx->kobj, so that
    .mq_kobj is always released after all ctxs are freed.

    This patch fixes kernel panic issue during booting when DEBUG_KOBJECT_RELEASE
    is enabled.

    Reported-by: Guenter Roeck
    Cc: "jianchao.wang"
    Tested-by: Guenter Roeck
    Reviewed-by: Greg Kroah-Hartman
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     

16 Nov, 2018

1 commit


08 Nov, 2018

1 commit


25 May, 2018

1 commit

  • Convert the S_ symbolic permissions to their octal equivalents as
    using octal and not symbolic permissions is preferred by many as more
    readable.

    see: https://lkml.org/lkml/2016/8/2/1945

    Done with automated conversion via:
    $ ./scripts/checkpatch.pl -f --types=SYMBOLIC_PERMS --fix-inplace

    Miscellanea:

    o Wrapped modified multi-line calls to a single line where appropriate
    o Realign modified multi-line calls to open parenthesis

    Signed-off-by: Joe Perches
    Signed-off-by: Jens Axboe

    Joe Perches
     

15 Jan, 2018

1 commit

  • The original commit e9a823fb34a8b (block: fix warning when I/O elevator
    is changed as request_queue is being removed) is pretty conflated.
    "conflated" because the resource being protected by q->sysfs_lock isn't
    the queue_flags (it is the 'queue' kobj).

    q->sysfs_lock serializes __elevator_change() (via elv_iosched_store)
    from racing with blk_unregister_queue():
    1) By holding q->sysfs_lock first, __elevator_change() can complete
    before a racing blk_unregister_queue().
    2) Conversely, __elevator_change() is testing for QUEUE_FLAG_REGISTERED
    in case elv_iosched_store() loses the race with blk_unregister_queue(),
    it needs a way to know the 'queue' kobj isn't there.

    Expand the scope of blk_unregister_queue()'s q->sysfs_lock use so it is
    held until after the 'queue' kobj is removed.

    To do so blk_mq_unregister_dev() must not also take q->sysfs_lock. So
    rename __blk_mq_unregister_dev() to blk_mq_unregister_dev().

    Also, blk_unregister_queue() should use q->queue_lock to protect against
    any concurrent writes to q->queue_flags -- even though chances are the
    queue is being cleaned up so no concurrent writes are likely.

    Fixes: e9a823fb34a8b ("block: fix warning when I/O elevator is changed as request_queue is being removed")
    Signed-off-by: Mike Snitzer
    Reviewed-by: Ming Lei
    Signed-off-by: Jens Axboe

    Mike Snitzer
     

04 May, 2017

2 commits

  • Originally, I tied debugfs registration/unregistration together with
    sysfs. There's no reason to do this, and it's getting in the way of
    letting schedulers define their own debugfs attributes. Instead, tie the
    debugfs registration to the lifetime of the structures themselves.

    The saner lifetimes mean we can also get rid of the extra mq directory
    and move everything one level up. I.e., nvme0n1/mq/hctx0/tags is now
    just nvme0n1/hctx0/tags.

    Signed-off-by: Omar Sandoval
    Signed-off-by: Jens Axboe

    Omar Sandoval
     
  • Preparation for adding more declarations.

    Signed-off-by: Omar Sandoval
    Reviewed-by: Hannes Reinecke
    Signed-off-by: Jens Axboe

    Omar Sandoval
     

27 Apr, 2017

4 commits


09 Mar, 2017

4 commits

  • It is obviously that hctx->cpumask is per hctx, and both
    share same lifetime, so this patch moves freeing of hctx->cpumask
    into release handler of hctx's kobject.

    Signed-off-by: Ming Lei
    Tested-by: Peter Zijlstra (Intel)
    Signed-off-by: Jens Axboe

    Ming Lei
     
  • This patch removes kobject_put() over hctx in __blk_mq_unregister_dev(),
    and trys to keep lifetime consistent between hctx and hctx's kobject.

    Now blk_mq_sysfs_register() and blk_mq_sysfs_unregister() become
    totally symmetrical, and kobject's refcounter drops to zero just
    when the hctx is freed.

    Signed-off-by: Ming Lei
    Tested-by: Peter Zijlstra (Intel)
    Signed-off-by: Jens Axboe

    Ming Lei
     
  • Currently from kobject view, both q->mq_kobj and ctx->kobj can
    be released during one cycle of blk_mq_register_dev() and
    blk_mq_unregister_dev(). Actually, sw queue's lifetime is
    same with its request queue's, which is covered by request_queue->kobj.

    So we don't need to call kobject_put() for the two kinds of
    kobject in __blk_mq_unregister_dev(), instead we do that
    in release handler of request queue.

    Signed-off-by: Ming Lei
    Tested-by: Peter Zijlstra (Intel)
    Signed-off-by: Jens Axboe

    Ming Lei
     
  • Both q->mq_kobj and sw queues' kobjects should have been initialized
    once, instead of doing that each add_disk context.

    Also this patch removes clearing of ctx in blk_mq_init_cpu_queues()
    because percpu allocator fills zero to allocated variable.

    This patch fixes one issue[1] reported from Omar.

    [1] kernel wearning when doing unbind/bind on one scsi-mq device

    [ 19.347924] kobject (ffff8800791ea0b8): tried to init an initialized object, something is seriously wrong.
    [ 19.349781] CPU: 1 PID: 84 Comm: kworker/u8:1 Not tainted 4.10.0-rc7-00210-g53f39eeaa263 #34
    [ 19.350686] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.1-20161122_114906-anatol 04/01/2014
    [ 19.350920] Workqueue: events_unbound async_run_entry_fn
    [ 19.350920] Call Trace:
    [ 19.350920] dump_stack+0x63/0x83
    [ 19.350920] kobject_init+0x77/0x90
    [ 19.350920] blk_mq_register_dev+0x40/0x130
    [ 19.350920] blk_register_queue+0xb6/0x190
    [ 19.350920] device_add_disk+0x1ec/0x4b0
    [ 19.350920] sd_probe_async+0x10d/0x1c0 [sd_mod]
    [ 19.350920] async_run_entry_fn+0x48/0x150
    [ 19.350920] process_one_work+0x1d0/0x480
    [ 19.350920] worker_thread+0x48/0x4e0
    [ 19.350920] kthread+0x101/0x140
    [ 19.350920] ? process_one_work+0x480/0x480
    [ 19.350920] ? kthread_create_on_node+0x60/0x60
    [ 19.350920] ret_from_fork+0x2c/0x40

    Cc: Omar Sandoval
    Signed-off-by: Ming Lei
    Tested-by: Peter Zijlstra (Intel)
    Signed-off-by: Jens Axboe

    Ming Lei
     

03 Feb, 2017

1 commit


27 Jan, 2017

5 commits

  • These counters aren't as out-of-place in sysfs as the other stuff, but
    debugfs is a slightly better home for them.

    Reviewed-by: Hannes Reinecke
    Signed-off-by: Omar Sandoval
    Signed-off-by: Jens Axboe

    Omar Sandoval
     
  • These statistics _might_ be useful to userspace, but it's better not to
    commit to an ABI for these yet. Also, the dispatched file in sysfs
    couldn't be cleared, so make it clearable like the others in debugfs.

    Reviewed-by: Hannes Reinecke
    Signed-off-by: Omar Sandoval
    Signed-off-by: Jens Axboe

    Omar Sandoval
     
  • These are very tied to the blk-mq tag implementation, so exposing them
    to sysfs isn't a great idea. Move the debugging information to debugfs
    and add basic entries for the number of tags and the number of reserved
    tags to sysfs.

    Reviewed-by: Hannes Reinecke
    Signed-off-by: Omar Sandoval
    Signed-off-by: Jens Axboe

    Omar Sandoval
     
  • These lists are only useful for debugging; they definitely don't belong
    in sysfs. Putting them in debugfs also removes the limitation of a
    single page of output.

    Reviewed-by: Hannes Reinecke
    Signed-off-by: Omar Sandoval
    Signed-off-by: Jens Axboe

    Omar Sandoval
     
  • In preparation for putting blk-mq debugging information in debugfs,
    create a directory tree mirroring the one in sysfs:

    # tree -d /sys/kernel/debug/block
    /sys/kernel/debug/block
    |-- nvme0n1
    | `-- mq
    | |-- 0
    | | `-- cpu0
    | |-- 1
    | | `-- cpu1
    | |-- 2
    | | `-- cpu2
    | `-- 3
    | `-- cpu3
    `-- vda
    `-- mq
    `-- 0
    |-- cpu0
    |-- cpu1
    |-- cpu2
    `-- cpu3

    Also add the scaffolding for the actual files that will go in here,
    either under the hardware queue or software queue directories.

    Reviewed-by: Hannes Reinecke
    Signed-off-by: Omar Sandoval
    Signed-off-by: Jens Axboe

    Omar Sandoval
     

18 Jan, 2017

1 commit

  • This adds a set of hooks that intercepts the blk-mq path of
    allocating/inserting/issuing/completing requests, allowing
    us to develop a scheduler within that framework.

    We reuse the existing elevator scheduler API on the registration
    side, but augment that with the scheduler flagging support for
    the blk-mq interfce, and with a separate set of ops hooks for MQ
    devices.

    We split driver and scheduler tags, so we can run the scheduling
    independently of device queue depth.

    Signed-off-by: Jens Axboe
    Reviewed-by: Bart Van Assche
    Reviewed-by: Omar Sandoval

    Jens Axboe
     

11 Nov, 2016

1 commit

  • For legacy block, we simply track them in the request queue. For
    blk-mq, we track them on a per-sw queue basis, which we can then
    sum up through the hardware queues and finally to a per device
    state.

    The stats are tracked in, roughly, 0.1s interval windows.

    Add sysfs files to display the stats.

    The feature is off by default, to avoid any extra overhead. In-kernel
    users of it can turn it on by setting QUEUE_FLAG_STATS in the queue
    flags. We currently don't turn it on if someone just reads any of
    the stats files, that is something we could add as well.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

21 Sep, 2016

1 commit


17 Sep, 2016

1 commit

  • We currently account a '0' dispatch, and anything above that still falls
    below the range set by BLK_MQ_MAX_DISPATCH_ORDER. If we dispatch more,
    we don't account it.

    Change the last bucket to be inclusive of anything above the range we
    track, and have the sysfs file reflect that by including a '+' in the
    output:

    $ cat /sys/block/nvme0n1/mq/0/dispatched
    0 1006
    1 20229
    2 1
    4 0
    8 0
    16 0
    32+ 0

    Signed-off-by: Jens Axboe
    Reviewed-by: Omar Sandoval

    Jens Axboe
     

14 Sep, 2016

2 commits

  • Allow the io_poll statistics to be zeroed to make for easier logging
    of polling event.

    Signed-off-by: Stephen Bates
    Acked-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Stephen Bates
     
  • In order to help determine the effectiveness of polling in a running
    system it is usful to determine the ratio of how often the poll
    function is called vs how often the completion is checked. For this
    reason we add a poll_considered variable and add it to the sysfs entry
    for io_poll.

    Signed-off-by: Stephen Bates
    Acked-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Stephen Bates