29 Jun, 2017

1 commit

  • Wen reports significant memory leaks with DIF and O_DIRECT:

    "With nvme devive + T10 enabled, On a system it has 256GB and started
    logging /proc/meminfo & /proc/slabinfo for every minute and in an hour
    it increased by 15968128 kB or ~15+GB.. Approximately 256 MB / minute
    leaking.

    /proc/meminfo | grep SUnreclaim...

    SUnreclaim: 6752128 kB
    SUnreclaim: 6874880 kB
    SUnreclaim: 7238080 kB
    ....
    SUnreclaim: 22307264 kB
    SUnreclaim: 22485888 kB
    SUnreclaim: 22720256 kB

    When testcases with T10 enabled call into __blkdev_direct_IO_simple,
    code doesn't free memory allocated by bio_integrity_alloc. The patch
    fixes the issue. HTX has been run with +60 hours without failure."

    Since __blkdev_direct_IO_simple() allocates the bio on the stack, it
    doesn't go through the regular bio free. This means that any ancillary
    data allocated with the bio through the stack is not freed. Hence, we
    can leak the integrity data associated with the bio, if the device is
    using DIF/DIX.

    Fix this by providing a bio_uninit() and export it, so that we can use
    it to free this data. Note that this is a minimal fix for this issue.
    Any current user of bio's that are allocated outside of
    bio_alloc_bioset() suffers from this issue, most notably some drivers.
    We will fix those in a more comprehensive patch for 4.13. This also
    means that the commit marked as being fixed by this isn't the real
    culprit, it's just the most obvious one out there.

    Fixes: 542ff7bf18c6 ("block: new direct I/O implementation")
    Reported-by: Wen Xiong
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Jens Axboe
     

22 Jun, 2017

1 commit

  • If we have shared tags enabled, then every IO completion will trigger
    a full loop of every queue belonging to a tag set, and every hardware
    queue for each of those queues, even if nothing needs to be done.
    This causes a massive performance regression if you have a lot of
    shared devices.

    Instead of doing this huge full scan on every IO, add an atomic
    counter to the main queue that tracks how many hardware queues have
    been marked as needing a restart. With that, we can avoid looking for
    restartable queues, if we don't have to.

    Max reports that this restores performance. Before this patch, 4K
    IOPS was limited to 22-23K IOPS. With the patch, we are running at
    950-970K IOPS.

    Fixes: 6d8c6c0f97ad ("blk-mq: Restart a single queue if tag sets are shared")
    Reported-by: Max Gurtovoy
    Tested-by: Max Gurtovoy
    Reviewed-by: Bart Van Assche
    Tested-by: Bart Van Assche
    Signed-off-by: Jens Axboe

    Jens Axboe
     

15 Jun, 2017

1 commit

  • Avoid that the following complaint is reported:

    BUG: sleeping function called from invalid context at kernel/workqueue.c:2790
    in_atomic(): 1, irqs_disabled(): 0, pid: 41, name: rcuop/3
    1 lock held by rcuop/3/41:
    #0: (rcu_callback){......}, at: [] rcu_nocb_kthread+0x282/0x500
    Call Trace:
    dump_stack+0x86/0xcf
    ___might_sleep+0x174/0x260
    __might_sleep+0x4a/0x80
    flush_work+0x7e/0x2e0
    __cancel_work_timer+0x143/0x1c0
    cancel_work_sync+0x10/0x20
    blk_throtl_exit+0x25/0x60
    blkcg_exit_queue+0x35/0x40
    blk_release_queue+0x42/0x130
    kobject_put+0xa9/0x190

    This happens since we invoke callbacks that need to block from the
    queue release handler. Fix this by pushing the final release to
    a workqueue.

    Reported-by: Ross Zwisler
    Fixes: commit b425e5049258 ("block: Avoid that blk_exit_rl() triggers a use-after-free")
    Signed-off-by: Bart Van Assche
    Tested-by: Ross Zwisler

    Updated changelog
    Signed-off-by: Jens Axboe

    Bart Van Assche
     

08 Jun, 2017

1 commit

  • In blk-cgroup, operations on blkg objects are protected with the
    request_queue lock. This is no more the lock that protects
    I/O-scheduler operations in blk-mq. In fact, the latter are now
    protected with a finer-grained per-scheduler-instance lock. As a
    consequence, although blkg lookups are also rcu-protected, blk-mq I/O
    schedulers may see inconsistent data when they access blkg and
    blkg-related objects. BFQ does access these objects, and does incur
    this problem, in the following case.

    The blkg_lookup performed in bfq_get_queue, being protected (only)
    through rcu, may happen to return the address of a copy of the
    original blkg. If this is the case, then the blkg_get performed in
    bfq_get_queue, to pin down the blkg, is useless: it does not prevent
    blk-cgroup code from destroying both the original blkg and all objects
    directly or indirectly referred by the copy of the blkg. BFQ accesses
    these objects, which typically causes a crash for NULL-pointer
    dereference of memory-protection violation.

    Some additional protection mechanism should be added to blk-cgroup to
    address this issue. In the meantime, this commit provides a quick
    temporary fix for BFQ: cache (when safe) blkg data that might
    disappear right after a blkg_lookup.

    In particular, this commit exploits the following facts to achieve its
    goal without introducing further locks. Destroy operations on a blkg
    invoke, as a first step, hooks of the scheduler associated with the
    blkg. And these hooks are executed with bfqd->lock held for BFQ. As a
    consequence, for any blkg associated with the request queue an
    instance of BFQ is attached to, we are guaranteed that such a blkg is
    not destroyed, and that all the pointers it contains are consistent,
    while that instance is holding its bfqd->lock. A blkg_lookup performed
    with bfqd->lock held then returns a fully consistent blkg, which
    remains consistent until this lock is held. In more detail, this holds
    even if the returned blkg is a copy of the original one.

    Finally, also the object describing a group inside BFQ needs to be
    protected from destruction on the blkg_free of the original blkg
    (which invokes bfq_pd_free). This commit adds private refcounting for
    this object, to let it disappear only after no bfq_queue refers to it
    any longer.

    This commit also removes or updates some stale comments on locking
    issues related to blk-cgroup operations.

    Reported-by: Tomas Konir
    Reported-by: Lee Tibbert
    Reported-by: Marco Piazza
    Signed-off-by: Paolo Valente
    Tested-by: Tomas Konir
    Tested-by: Lee Tibbert
    Tested-by: Marco Piazza
    Signed-off-by: Jens Axboe

    Paolo Valente
     

07 Jun, 2017

4 commits

  • hard disk IO latency varies a lot depending on spindle move. The latency
    range could be from several microseconds to several milliseconds. It's
    pretty hard to get the baseline latency used by io.low.

    We will use a different stragety here. The idea is only using IO with
    spindle move to determine if cgroup IO is in good state. For HD, if io
    latency is small (< 1ms), we ignore the IO. Such IO is likely from
    sequential IO, and is helpless to help determine if a cgroup's IO is
    impacted by other cgroups. With this, we only account IO with big
    latency. Then we can choose a hardcoded baseline latency for HD (4ms,
    which is typical IO latency with seek). With all these settings, the
    io.low latency works for both HD and SSD.

    Signed-off-by: Shaohua Li
    Signed-off-by: Jens Axboe

    Shaohua Li
     
  • I have encountered a NULL pointer dereference in
    throtl_schedule_pending_timer:
    [ 413.735396] BUG: unable to handle kernel NULL pointer dereference at 0000000000000038
    [ 413.735535] IP: [] throtl_schedule_pending_timer+0x3f/0x210
    [ 413.735643] PGD 22c8cf067 PUD 22cb34067 PMD 0
    [ 413.735713] Oops: 0000 [#1] SMP
    ......

    This is caused by the following case:
    blk_throtl_bio
    throtl_schedule_next_dispatch td->throtl_slice td, which will always
    return a valid td.

    Fixes: 297e3d854784 ("blk-throttle: make throtl_slice tunable")
    Signed-off-by: Joseph Qi
    Reviewed-by: Shaohua Li
    Signed-off-by: Jens Axboe

    Joseph Qi
     
  • If queue is stopped, we shouldn't dispatch request into driver and
    hardware, unfortunately the check is removed in bd166ef183c2(blk-mq-sched:
    add framework for MQ capable IO schedulers).

    This patch fixes the issue by moving the check back into
    __blk_mq_try_issue_directly().

    This patch fixes request use-after-free[1][2] during canceling requets
    of NVMe in nvme_dev_disable(), which can be triggered easily during
    NVMe reset & remove test.

    [1] oops kernel log when CONFIG_BLK_DEV_INTEGRITY is on
    [ 103.412969] BUG: unable to handle kernel NULL pointer dereference at 000000000000000a
    [ 103.412980] IP: bio_integrity_advance+0x48/0xf0
    [ 103.412981] PGD 275a88067
    [ 103.412981] P4D 275a88067
    [ 103.412982] PUD 276c43067
    [ 103.412983] PMD 0
    [ 103.412984]
    [ 103.412986] Oops: 0000 [#1] SMP
    [ 103.412989] Modules linked in: vfat fat intel_rapl sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc aesni_intel crypto_simd cryptd ipmi_ssif iTCO_wdt iTCO_vendor_support mxm_wmi glue_helper dcdbas ipmi_si mei_me pcspkr mei sg ipmi_devintf lpc_ich ipmi_msghandler shpchp acpi_power_meter wmi nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c sd_mod mgag200 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm crc32c_intel nvme ahci nvme_core libahci libata tg3 i2c_core megaraid_sas ptp pps_core dm_mirror dm_region_hash dm_log dm_mod
    [ 103.413035] CPU: 0 PID: 102 Comm: kworker/0:2 Not tainted 4.11.0+ #1
    [ 103.413036] Hardware name: Dell Inc. PowerEdge R730xd/072T6D, BIOS 2.2.5 09/06/2016
    [ 103.413041] Workqueue: events nvme_remove_dead_ctrl_work [nvme]
    [ 103.413043] task: ffff9cc8775c8000 task.stack: ffffc033c252c000
    [ 103.413045] RIP: 0010:bio_integrity_advance+0x48/0xf0
    [ 103.413046] RSP: 0018:ffffc033c252fc10 EFLAGS: 00010202
    [ 103.413048] RAX: 0000000000000000 RBX: ffff9cc8720a8cc0 RCX: ffff9cca72958240
    [ 103.413049] RDX: ffff9cca72958000 RSI: 0000000000000008 RDI: ffff9cc872537f00
    [ 103.413049] RBP: ffffc033c252fc28 R08: 0000000000000000 R09: ffffffffb963a0d5
    [ 103.413050] R10: 000000000000063e R11: 0000000000000000 R12: ffff9cc8720a8d18
    [ 103.413051] R13: 0000000000001000 R14: ffff9cc872682e00 R15: 00000000fffffffb
    [ 103.413053] FS: 0000000000000000(0000) GS:ffff9cc877c00000(0000) knlGS:0000000000000000
    [ 103.413054] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [ 103.413055] CR2: 000000000000000a CR3: 0000000276c41000 CR4: 00000000001406f0
    [ 103.413056] Call Trace:
    [ 103.413063] bio_advance+0x2a/0xe0
    [ 103.413067] blk_update_request+0x76/0x330
    [ 103.413072] blk_mq_end_request+0x1a/0x70
    [ 103.413074] blk_mq_dispatch_rq_list+0x370/0x410
    [ 103.413076] ? blk_mq_flush_busy_ctxs+0x94/0xe0
    [ 103.413080] blk_mq_sched_dispatch_requests+0x173/0x1a0
    [ 103.413083] __blk_mq_run_hw_queue+0x8e/0xa0
    [ 103.413085] __blk_mq_delay_run_hw_queue+0x9d/0xa0
    [ 103.413088] blk_mq_start_hw_queue+0x17/0x20
    [ 103.413090] blk_mq_start_hw_queues+0x32/0x50
    [ 103.413095] nvme_kill_queues+0x54/0x80 [nvme_core]
    [ 103.413097] nvme_remove_dead_ctrl_work+0x1f/0x40 [nvme]
    [ 103.413103] process_one_work+0x149/0x360
    [ 103.413105] worker_thread+0x4d/0x3c0
    [ 103.413109] kthread+0x109/0x140
    [ 103.413111] ? rescuer_thread+0x380/0x380
    [ 103.413113] ? kthread_park+0x60/0x60
    [ 103.413120] ret_from_fork+0x2c/0x40
    [ 103.413121] Code: 08 4c 8b 63 50 48 8b 80 80 00 00 00 48 8b 90 d0 03 00 00 31 c0 48 83 ba 40 02 00 00 00 48 8d 8a 40 02 00 00 48 0f 45 c1 c1 ee 09 b6 48 0a 0f b6 40 09 41 89 f5 83 e9 09 41 d3 ed 44 0f af e8
    [ 103.413145] RIP: bio_integrity_advance+0x48/0xf0 RSP: ffffc033c252fc10
    [ 103.413146] CR2: 000000000000000a
    [ 103.413157] ---[ end trace cd6875d16eb5a11e ]---
    [ 103.455368] Kernel panic - not syncing: Fatal exception
    [ 103.459826] Kernel Offset: 0x37600000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
    [ 103.850916] ---[ end Kernel panic - not syncing: Fatal exception
    [ 103.857637] sched: Unexpected reschedule of offline CPU#1!
    [ 103.863762] ------------[ cut here ]------------

    [2] kernel hang in blk_mq_freeze_queue_wait() when CONFIG_BLK_DEV_INTEGRITY is off
    [ 247.129825] INFO: task nvme-test:1772 blocked for more than 120 seconds.
    [ 247.137311] Not tainted 4.12.0-rc2.upstream+ #4
    [ 247.142954] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    [ 247.151704] Call Trace:
    [ 247.154445] __schedule+0x28a/0x880
    [ 247.158341] schedule+0x36/0x80
    [ 247.161850] blk_mq_freeze_queue_wait+0x4b/0xb0
    [ 247.166913] ? remove_wait_queue+0x60/0x60
    [ 247.171485] blk_freeze_queue+0x1a/0x20
    [ 247.175770] blk_cleanup_queue+0x7f/0x140
    [ 247.180252] nvme_ns_remove+0xa3/0xb0 [nvme_core]
    [ 247.185503] nvme_remove_namespaces+0x32/0x50 [nvme_core]
    [ 247.191532] nvme_uninit_ctrl+0x2d/0xa0 [nvme_core]
    [ 247.196977] nvme_remove+0x70/0x110 [nvme]
    [ 247.201545] pci_device_remove+0x39/0xc0
    [ 247.205927] device_release_driver_internal+0x141/0x200
    [ 247.211761] device_release_driver+0x12/0x20
    [ 247.216531] pci_stop_bus_device+0x8c/0xa0
    [ 247.221104] pci_stop_and_remove_bus_device_locked+0x1a/0x30
    [ 247.227420] remove_store+0x7c/0x90
    [ 247.231320] dev_attr_store+0x18/0x30
    [ 247.235409] sysfs_kf_write+0x3a/0x50
    [ 247.239497] kernfs_fop_write+0xff/0x180
    [ 247.243867] __vfs_write+0x37/0x160
    [ 247.247757] ? selinux_file_permission+0xe5/0x120
    [ 247.253011] ? security_file_permission+0x3b/0xc0
    [ 247.258260] vfs_write+0xb2/0x1b0
    [ 247.261964] ? syscall_trace_enter+0x1d0/0x2b0
    [ 247.266924] SyS_write+0x55/0xc0
    [ 247.270540] do_syscall_64+0x67/0x150
    [ 247.274636] entry_SYSCALL64_slow_path+0x25/0x25
    [ 247.279794] RIP: 0033:0x7f5c96740840
    [ 247.283785] RSP: 002b:00007ffd00e87ee8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
    [ 247.292238] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007f5c96740840
    [ 247.300194] RDX: 0000000000000002 RSI: 00007f5c97060000 RDI: 0000000000000001
    [ 247.308159] RBP: 00007f5c97060000 R08: 000000000000000a R09: 00007f5c97059740
    [ 247.316123] R10: 0000000000000001 R11: 0000000000000246 R12: 00007f5c96a14400
    [ 247.324087] R13: 0000000000000002 R14: 0000000000000001 R15: 0000000000000000
    [ 370.016340] INFO: task nvme-test:1772 blocked for more than 120 seconds.

    Fixes: 12d70958a2e8(blk-mq: don't fail allocating driver tag for stopped hw queue)
    Cc: stable@vger.kernel.org
    Signed-off-by: Ming Lei
    Reviewed-by: Bart Van Assche
    Signed-off-by: Jens Axboe

    Ming Lei
     
  • When direct issue is done on request picked up from plug list,
    the hctx need to be updated with the actual hw queue, otherwise
    wrong hctx is used and may hurt performance, especially when
    wrong SRCU readlock is acquired/released

    Reported-by: Bart Van Assche
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     

03 Jun, 2017

1 commit

  • If bio has no data, such as ones from blkdev_issue_flush(),
    then we have nothing to protect.

    This patch prevent bugon like follows:

    kfree_debugcheck: out of range ptr ac1fa1d106742a5ah
    kernel BUG at mm/slab.c:2773!
    invalid opcode: 0000 [#1] SMP
    Modules linked in: bcache
    CPU: 0 PID: 4428 Comm: xfs_io Tainted: G W 4.11.0-rc4-ext4-00041-g2ef0043-dirty #43
    Hardware name: Virtuozzo KVM, BIOS seabios-1.7.5-11.vz7.4 04/01/2014
    task: ffff880137786440 task.stack: ffffc90000ba8000
    RIP: 0010:kfree_debugcheck+0x25/0x2a
    RSP: 0018:ffffc90000babde0 EFLAGS: 00010082
    RAX: 0000000000000034 RBX: ac1fa1d106742a5a RCX: 0000000000000007
    RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff88013f3ccb40
    RBP: ffffc90000babde8 R08: 0000000000000000 R09: 0000000000000000
    R10: 00000000fcb76420 R11: 00000000725172ed R12: 0000000000000282
    R13: ffffffff8150e766 R14: ffff88013a145e00 R15: 0000000000000001
    FS: 00007fb09384bf40(0000) GS:ffff88013f200000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 00007fd0172f9e40 CR3: 0000000137fa9000 CR4: 00000000000006f0
    Call Trace:
    kfree+0xc8/0x1b3
    bio_integrity_free+0xc3/0x16b
    bio_free+0x25/0x66
    bio_put+0x14/0x26
    blkdev_issue_flush+0x7a/0x85
    blkdev_fsync+0x35/0x42
    vfs_fsync_range+0x8e/0x9f
    vfs_fsync+0x1c/0x1e
    do_fsync+0x31/0x4a
    SyS_fsync+0x10/0x14
    entry_SYSCALL_64_fastpath+0x1f/0xc2

    Reviewed-by: Christoph Hellwig
    Reviewed-by: Hannes Reinecke
    Reviewed-by: Martin K. Petersen
    Signed-off-by: Dmitry Monakhov
    Signed-off-by: Jens Axboe

    Dmitry Monakhov
     

02 Jun, 2017

1 commit

  • Since the introduction of .init_rq_fn() and .exit_rq_fn() it is
    essential that the memory allocated for struct request_queue
    stays around until all blk_exit_rl() calls have finished. Hence
    make blk_init_rl() take a reference on struct request_queue.

    This patch fixes the following crash:

    general protection fault: 0000 [#2] SMP
    CPU: 3 PID: 28 Comm: ksoftirqd/3 Tainted: G D 4.12.0-rc2-dbg+ #2
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.0.0-prebuilt.qemu-project.org 04/01/2014
    task: ffff88013a108040 task.stack: ffffc9000071c000
    RIP: 0010:free_request_size+0x1a/0x30
    RSP: 0018:ffffc9000071fd38 EFLAGS: 00010202
    RAX: 6b6b6b6b6b6b6b6b RBX: ffff880067362a88 RCX: 0000000000000003
    RDX: ffff880067464178 RSI: ffff880067362a88 RDI: ffff880135ea4418
    RBP: ffffc9000071fd40 R08: 0000000000000000 R09: 0000000100180009
    R10: ffffc9000071fd38 R11: ffffffff81110800 R12: ffff88006752d3d8
    R13: ffff88006752d3d8 R14: ffff88013a108040 R15: 000000000000000a
    FS: 0000000000000000(0000) GS:ffff88013fd80000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 00007fa8ec1edb00 CR3: 0000000138ee8000 CR4: 00000000001406e0
    Call Trace:
    mempool_destroy.part.10+0x21/0x40
    mempool_destroy+0xe/0x10
    blk_exit_rl+0x12/0x20
    blkg_free+0x4d/0xa0
    __blkg_release_rcu+0x59/0x170
    rcu_process_callbacks+0x260/0x4e0
    __do_softirq+0x116/0x250
    smpboot_thread_fn+0x123/0x1e0
    kthread+0x109/0x140
    ret_from_fork+0x31/0x40

    Fixes: commit e9c787e65c0c ("scsi: allocate scsi_cmnd structures as part of struct request")
    Signed-off-by: Bart Van Assche
    Acked-by: Tejun Heo
    Reviewed-by: Hannes Reinecke
    Reviewed-by: Christoph Hellwig
    Cc: Jan Kara
    Cc: # v4.11+
    Signed-off-by: Jens Axboe

    Bart Van Assche
     

31 May, 2017

2 commits

  • When adding a cfq_group into the cfq service tree, we use CFQ_IDLE_DELAY
    as the delay of cfq_group's vdisktime if there have been other cfq_groups
    already.

    When cfq is under iops mode, commit 9a7f38c42c2b ("cfq-iosched: Convert
    from jiffies to nanoseconds") could result in a large iops delay and
    lead to an abnormal io schedule delay for the added cfq_group. To fix
    it, we just need to revert to the old CFQ_IDLE_DELAY value: HZ / 5
    when iops mode is enabled.

    Despite having the same value, the delay of a cfq_queue in idle class
    and the delay of cfq_group are different things, so I define two new
    macros for the delay of a cfq_group under time-slice mode and iops mode.

    Fixes: 9a7f38c42c2b ("cfq-iosched: Convert from jiffies to nanoseconds")
    Cc: # 4.8+
    Signed-off-by: Hou Tao
    Acked-by: Jan Kara
    Signed-off-by: Jens Axboe

    Hou Tao
     
  • The tagset lock needs to be held when iterating the tag_list, so a
    lockdep assert was added when updating number of hardware queues. The
    drivers calling this API, however, were unaware of the new requirement,
    so are failing the assertion.

    This patch takes the lock within the blk-mq function so the drivers do
    not have to be modified in order to be safe.

    Fixes: 705cda97e ("blk-mq: Make it safe to use RCU to iterate over blk_mq_tag_set.tag_list")
    Reported-by: Gabriel Krisman Bertazi
    Reviewed-by: Bart Van Assche
    Signed-off-by: Keith Busch
    Signed-off-by: Jens Axboe

    Keith Busch
     

26 May, 2017

2 commits

  • Christoph writes:

    "A couple of fixes for the next rc on the nvme front. Various FC fixes
    from James, controller removal fixes from Ming (including a block layer
    patch), a APST related device quirk from Andy, a RDMA fix for small
    queue depth device from Marta, as well as fixes for the lack of
    metadata support in non-PCIe drivers and the printk logging format from
    me."

    Jens Axboe
     
  • The code in blk-mq-debugfs.c assumes that it is working on a blk-mq
    queue and is not intended to work on a blk-sq queue. Hence only
    register blk-mq debugfs attributes for blk-mq queues.

    Fixes: commit 9c1051aacde8 ("blk-mq: untangle debugfs and sysfs")
    Signed-off-by: Bart Van Assche
    Cc: Christoph Hellwig
    Cc: Ming Lei
    Reviewed-by: Omar Sandoval
    Reviewed-by: Hannes Reinecke
    Signed-off-by: Jens Axboe

    Bart Van Assche
     

23 May, 2017

7 commits

  • The code in block/partitions/msdos.c recognizes FreeBSD, OpenBSD
    and NetBSD partitions and does a reasonable job picking out OpenBSD
    and NetBSD UFS subpartitions.

    But for FreeBSD the subpartitions are always "bad".

    Kernel:
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Richard
     
  • We don't set an error code on this path. It means that we return NULL
    instead of an error pointer and the caller does a NULL dereference.

    Fixes: 6d1d8050b4bc ("block, partition: add partition_meta_info to hd_struct")
    Signed-off-by: Dan Carpenter
    Signed-off-by: Jens Axboe

    Dan Carpenter
     
  • Default value of io.low limit is 0. If user doesn't configure the limit,
    last patch makes cgroup be throttled to very tiny bps/iops, which could
    stall the system. A cgroup with default settings of io.low limit really
    means nothing, so we force user to configure all settings, otherwise
    io.low limit doesn't take effect. With this stragety, default setting of
    latency/idle isn't important, so just set them to very conservative and
    safe value.

    Signed-off-by: Shaohua Li
    Acked-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Shaohua Li
     
  • If a cgroup with low limit 0 for both bps/iops, the cgroup's low limit
    is ignored and we throttle the cgroup with its max limit. In this way,
    other cgroups with a low limit will not get protected. To fix this, we
    don't do the exception any more. cgroup will be throttled to a limit 0
    if it uese default setting. To avoid completed stall, we give such
    cgroup tiny IO resources.

    Signed-off-by: Shaohua Li
    Acked-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Shaohua Li
     
  • These info are important to understand what's happening and help debug.

    Signed-off-by: Shaohua Li
    Acked-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Shaohua Li
     
  • For idle time, children's setting should not be bigger than parent's.
    For latency target, children's setting should not be smaller than
    parent's. The leaf nodes will adjust their settings according to the
    hierarchy and compare their IO with the settings and do
    upgrade/downgrade. parents nodes don't need to track their IO
    latency/idle time.

    Signed-off-by: Shaohua Li
    Acked-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Shaohua Li
     
  • No one uses it any more, so remove it.

    Reviewed-by: Keith Busch
    Reviewed-by: Johannes Thumshirn
    Signed-off-by: Ming Lei
    Signed-off-by: Christoph Hellwig

    Ming Lei
     

13 May, 2017

1 commit

  • Pull libnvdimm fixes from Dan Williams:
    "Incremental fixes and a small feature addition on top of the main
    libnvdimm 4.12 pull request:

    - Geert noticed that tinyconfig was bloated by BLOCK selecting DAX.
    The size regression is fixed by moving all dax helpers into the
    dax-core and only specifying "select DAX" for FS_DAX and
    dax-capable drivers. He also asked for clarification of the
    NR_DEV_DAX config option which, on closer look, does not need to be
    a config option at all. Mike also throws in a DEV_DAX_PMEM fixup
    for good measure.

    - Ben's attention to detail on -stable patch submissions caught a
    case where the recent fixes to arch_copy_from_iter_pmem() missed a
    condition where we strand dirty data in the cache. This is tagged
    for -stable and will also be included in the rework of the pmem api
    to a proposed {memcpy,copy_user}_flushcache() interface for 4.13.

    - Vishal adds a feature that missed the initial pull due to pending
    review feedback. It allows the kernel to clear media errors when
    initializing a BTT (atomic sector update driver) instance on a pmem
    namespace.

    - Ross noticed that the dax_device + dax_operations conversion broke
    __dax_zero_page_range(). The nvdimm unit tests fail to check this
    path, but xfstests immediately trips over it. No excuse for missing
    this before submitting the 4.12 pull request.

    These all pass the nvdimm unit tests and an xfstests spot check. The
    set has received a build success notification from the kbuild robot"

    * 'libnvdimm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
    filesystem-dax: fix broken __dax_zero_page_range() conversion
    libnvdimm, btt: ensure that initializing metadata clears poison
    libnvdimm: add an atomic vs process context flag to rw_bytes
    x86, pmem: Fix cache flushing for iovec write < 8 bytes
    device-dax: kill NR_DEV_DAX
    block, dax: move "select DAX" from BLOCK to FS_DAX
    device-dax: Tell kbuild DEV_DAX_PMEM depends on DEV_DAX

    Linus Torvalds
     

11 May, 2017

1 commit


10 May, 2017

5 commits

  • When formatting NVMe to 512B/4K + T10 DIf/DIX, dd with split op returns
    "Input/output error". Looks block layer split the bio after calling
    bio_integrity_prep(bio). This patch fixes the issue.

    Below is how we debug this issue:
    (1)format nvme to 4K block # size with type 2 DIF
    (2)dd with block size bigger than 1024k.
    oflag=direct
    dd: error writing '/dev/nvme0n1': Input/output error

    We added some debug code in nvme device driver. It showed us the first
    op and the second op have the same bi and pi address. This is not
    correct.

    1st op: nvme0n1 Op:Wr slba 0x505 length 0x100, PI ctrl=0x1400,
    dsmgmt=0x0, AT=0x0 & RT=0x505
    Guard 0x00b1, AT 0x0000, RT physical 0x00000505 RT virtual 0x00002828

    2nd op: nvme0n1 Op:Wr slba 0x605 length 0x1, PI ctrl=0x1400, dsmgmt=0x0,
    AT=0x0 & RT=0x605 ==> This op fails and subsequent 5 retires..
    Guard 0x00b1, AT 0x0000, RT physical 0x00000605 RT virtual 0x00002828

    With the fix, It showed us both of the first op and the second op have
    correct bi and pi address.

    1st op: nvme2n1 Op:Wr slba 0x505 length 0x100, PI ctrl=0x1400,
    dsmgmt=0x0, AT=0x0 & RT=0x505
    Guard 0x5ccb, AT 0x0000, RT physical 0x00000505 RT virtual
    0x00002828
    2nd op: nvme2n1 Op:Wr slba 0x605 length 0x1, PI ctrl=0x1400, dsmgmt=0x0,
    AT=0x0 & RT=0x605
    Guard 0xab4c, AT 0x0000, RT physical 0x00000605 RT virtual
    0x00003028

    Signed-off-by: Wen Xiong
    Signed-off-by: Jens Axboe

    Wen Xiong
     
  • If PREEMPT_RCU is enabled, rcu_read_lock() isn't strong enough
    for us to use this_cpu_ptr() in that section. Use the safer
    get/put_cpu_ptr() variants instead.

    Reported-by: Mike Galbraith
    Fixes: 34dbad5d26e2 ("blk-stat: convert to callback-based statistics reporting")
    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • We warn twice for switching to a scheduler, if that switch fails.
    As we also report the failure in the return value to the
    sysfs write, remove the dmesg induced failures.

    Keep the failure print for warning to switch to the kconfig
    selected IO scheduler, as we can't report errors for that in
    any other way.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • The introduction of the BFQ and Kyber I/O schedulers has triggered a
    new wave of I/O benchmarks. Unfortunately, comments and discussions on
    these benchmarks confirm that there is still little awareness that it
    is very hard to achieve, at the same time, a low latency and a high
    throughput. In particular, virtually all benchmarks measure
    throughput, or throughput-related figures of merit, but, for BFQ, they
    use the scheduler in its default configuration. This configuration is
    geared, instead, toward a low latency. This is evidently a sign that
    BFQ documentation is still too unclear on this important aspect. This
    commit addresses this issue by stressing how BFQ configuration must be
    (easily) changed if the only goal is maximum throughput.

    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Paolo Valente
     
  • In the function __bfq_deactivate_entity, the pointer
    entity->sched_data could happen to be used before being properly
    initialized. This led to a NULL pointer dereference. This commit fixes
    this bug by just using this pointer only where it is safe to do so.

    Reported-by: Tom Harrison
    Tested-by: Tom Harrison
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Paolo Valente
     

09 May, 2017

1 commit

  • For configurations that do not enable DAX filesystems or drivers, do not
    require the DAX core to be built.

    Given that the 'direct_access' method has been removed from
    'block_device_operations', we can also go ahead and remove the
    block-related dax helper functions from fs/block_dev.c to
    drivers/dax/super.c. This keeps dax details out of the block layer and
    lets the DAX core be built as a module in the FS_DAX=n case.

    Filesystems need to include dax.h to call bdev_dax_supported().

    Cc: linux-xfs@vger.kernel.org
    Cc: Jens Axboe
    Cc: "Theodore Ts'o"
    Cc: Matthew Wilcox
    Cc: Alexander Viro
    Cc: "Darrick J. Wong"
    Cc: Ross Zwisler
    Reviewed-by: Jan Kara
    Reported-by: Geert Uytterhoeven
    Signed-off-by: Dan Williams

    Dan Williams
     

08 May, 2017

2 commits

  • Making __blk_mq_stop_hw_queues static fixes sparse warning:

    block/blk-mq.c:6: warning: symbol '__blk_mq_stop_hw_queues' was not
    declared. Should it be static?

    Fixes: 2719aa217e0d0 ("blk-mq: don't use sync workqueue flushing from drivers")
    Signed-off-by: Colin Ian King
    Signed-off-by: Jens Axboe

    Colin Ian King
     
  • This can be triggered by hot-unplug one cpu.

    ======================================================
    [ INFO: possible circular locking dependency detected ]
    4.11.0+ #17 Not tainted
    -------------------------------------------------------
    step_after_susp/2640 is trying to acquire lock:
    (all_q_mutex){+.+...}, at: [] blk_mq_queue_reinit_work+0x18/0x110

    but task is already holding lock:
    (cpu_hotplug.lock){+.+.+.}, at: [] cpu_hotplug_begin+0x7f/0xe0

    which lock already depends on the new lock.

    the existing dependency chain (in reverse order) is:

    -> #1 (cpu_hotplug.lock){+.+.+.}:
    lock_acquire+0x11c/0x230
    __mutex_lock+0x92/0x990
    mutex_lock_nested+0x1b/0x20
    get_online_cpus+0x64/0x80
    blk_mq_init_allocated_queue+0x3a0/0x4e0
    blk_mq_init_queue+0x3a/0x60
    loop_add+0xe5/0x280
    loop_init+0x124/0x177
    do_one_initcall+0x53/0x1c0
    kernel_init_freeable+0x1e3/0x27f
    kernel_init+0xe/0x100
    ret_from_fork+0x31/0x40

    -> #0 (all_q_mutex){+.+...}:
    __lock_acquire+0x189a/0x18a0
    lock_acquire+0x11c/0x230
    __mutex_lock+0x92/0x990
    mutex_lock_nested+0x1b/0x20
    blk_mq_queue_reinit_work+0x18/0x110
    blk_mq_queue_reinit_dead+0x1c/0x20
    cpuhp_invoke_callback+0x1f2/0x810
    cpuhp_down_callbacks+0x42/0x80
    _cpu_down+0xb2/0xe0
    freeze_secondary_cpus+0xb6/0x390
    suspend_devices_and_enter+0x3b3/0xa40
    pm_suspend+0x129/0x490
    state_store+0x82/0xf0
    kobj_attr_store+0xf/0x20
    sysfs_kf_write+0x45/0x60
    kernfs_fop_write+0x135/0x1c0
    __vfs_write+0x37/0x160
    vfs_write+0xcd/0x1d0
    SyS_write+0x58/0xc0
    do_syscall_64+0x8f/0x710
    return_from_SYSCALL_64+0x0/0x7a

    other info that might help us debug this:

    Possible unsafe locking scenario:

    CPU0 CPU1
    ---- ----
    lock(cpu_hotplug.lock);
    lock(all_q_mutex);
    lock(cpu_hotplug.lock);
    lock(all_q_mutex);

    *** DEADLOCK ***

    8 locks held by step_after_susp/2640:
    #0: (sb_writers#6){.+.+.+}, at: [] vfs_write+0x1ad/0x1d0
    #1: (&of->mutex){+.+.+.}, at: [] kernfs_fop_write+0x101/0x1c0
    #2: (s_active#166){.+.+.+}, at: [] kernfs_fop_write+0x109/0x1c0
    #3: (pm_mutex){+.+...}, at: [] pm_suspend+0x21d/0x490
    #4: (acpi_scan_lock){+.+.+.}, at: [] acpi_scan_lock_acquire+0x17/0x20
    #5: (cpu_add_remove_lock){+.+.+.}, at: [] freeze_secondary_cpus+0x27/0x390
    #6: (cpu_hotplug.dep_map){++++++}, at: [] cpu_hotplug_begin+0x5/0xe0
    #7: (cpu_hotplug.lock){+.+.+.}, at: [] cpu_hotplug_begin+0x7f/0xe0

    stack backtrace:
    CPU: 3 PID: 2640 Comm: step_after_susp Not tainted 4.11.0+ #17
    Hardware name: Dell Inc. OptiPlex 7040/0JCTF8, BIOS 1.4.9 09/12/2016
    Call Trace:
    dump_stack+0x99/0xce
    print_circular_bug+0x1fa/0x270
    __lock_acquire+0x189a/0x18a0
    lock_acquire+0x11c/0x230
    ? lock_acquire+0x11c/0x230
    ? blk_mq_queue_reinit_work+0x18/0x110
    ? blk_mq_queue_reinit_work+0x18/0x110
    __mutex_lock+0x92/0x990
    ? blk_mq_queue_reinit_work+0x18/0x110
    ? kmem_cache_free+0x2cb/0x330
    ? anon_transport_class_unregister+0x20/0x20
    ? blk_mq_queue_reinit_work+0x110/0x110
    mutex_lock_nested+0x1b/0x20
    ? mutex_lock_nested+0x1b/0x20
    blk_mq_queue_reinit_work+0x18/0x110
    blk_mq_queue_reinit_dead+0x1c/0x20
    cpuhp_invoke_callback+0x1f2/0x810
    ? __flow_cache_shrink+0x160/0x160
    cpuhp_down_callbacks+0x42/0x80
    _cpu_down+0xb2/0xe0
    freeze_secondary_cpus+0xb6/0x390
    suspend_devices_and_enter+0x3b3/0xa40
    ? rcu_read_lock_sched_held+0x79/0x80
    pm_suspend+0x129/0x490
    state_store+0x82/0xf0
    kobj_attr_store+0xf/0x20
    sysfs_kf_write+0x45/0x60
    kernfs_fop_write+0x135/0x1c0
    __vfs_write+0x37/0x160
    ? rcu_read_lock_sched_held+0x79/0x80
    ? rcu_sync_lockdep_assert+0x2f/0x60
    ? __sb_start_write+0xd9/0x1c0
    ? vfs_write+0x1ad/0x1d0
    vfs_write+0xcd/0x1d0
    SyS_write+0x58/0xc0
    ? rcu_read_lock_sched_held+0x79/0x80
    do_syscall_64+0x8f/0x710
    ? trace_hardirqs_on_thunk+0x1a/0x1c
    entry_SYSCALL64_slow_path+0x25/0x25

    The cpu hotplug path will hold cpu_hotplug.lock and then reinit all exiting
    queues for blk mq w/ all_q_mutex, however, blk_mq_init_allocated_queue() will
    contend these two locks in the inversion order. This is due to commit eabe06595d62
    (blk/mq: Cure cpu hotplug lock inversion), it fixes a cpu hotplug lock inversion
    issue because of hotplug rework, however the hotplug rework is still work-in-progress
    and lives in a -tip branch and mainline cannot yet trigger that splat. The commit
    breaks the linus's tree in the merge window, so this patch reverts the lock order
    and avoids to splat linus's tree.

    Cc: Jens Axboe
    Cc: Peter Zijlstra (Intel)
    Cc: Thomas Gleixner
    Signed-off-by: Wanpeng Li
    Signed-off-by: Jens Axboe

    Wanpeng Li
     

07 May, 2017

1 commit

  • Pull block fixes and updates from Jens Axboe:
    "Some fixes and followup features/changes that should go in, in this
    merge window. This contains:

    - Two fixes for lightnvm from Javier, fixing problems in the new code
    merge previously in this merge window.

    - A fix from Jan for the backing device changes, fixing an issue in
    NFS that causes a failure to mount on certain setups.

    - A change from Christoph, cleaning up the blk-mq init and exit
    request paths.

    - Remove elevator_change(), which is now unused. From Bart.

    - A fix for queue operation invocation on a dead queue, from Bart.

    - A series fixing up mtip32xx for blk-mq scheduling, removing a
    bandaid we previously had in place for this. From me.

    - A regression fix for this series, fixing a case where we wait on
    workqueue flushing from an invalid (non-blocking) context. From me.

    - A fix/optimization from Ming, ensuring that we don't both quiesce
    and freeze a queue at the same time.

    - A fix from Peter on lock ordering for CPU hotplug. Not a real
    problem right now, but will be once the CPU hotplug rework goes in.

    - A series from Omar, cleaning up out blk-mq debugfs support, and
    adding support for exporting info from schedulers in debugfs as
    well. This is really useful in debugging stalls or livelocks. From
    Omar"

    * 'for-linus' of git://git.kernel.dk/linux-block: (28 commits)
    mq-deadline: add debugfs attributes
    kyber: add debugfs attributes
    blk-mq-debugfs: allow schedulers to register debugfs attributes
    blk-mq: untangle debugfs and sysfs
    blk-mq: move debugfs declarations to a separate header file
    blk-mq: Do not invoke queue operations on a dead queue
    blk-mq-debugfs: get rid of a bunch of boilerplate
    blk-mq-debugfs: rename hw queue directories from to hctx
    blk-mq-debugfs: don't open code strstrip()
    blk-mq-debugfs: error on long write to queue "state" file
    blk-mq-debugfs: clean up flag definitions
    blk-mq-debugfs: separate flags with |
    nfs: Fix bdi handling for cloned superblocks
    block/mq: Cure cpu hotplug lock inversion
    lightnvm: fix bad back free on error path
    lightnvm: create cmd before allocating request
    blk-mq: don't use sync workqueue flushing from drivers
    mtip32xx: convert internal commands to regular block infrastructure
    mtip32xx: cleanup internal tag assumptions
    block: don't call blk_mq_quiesce_queue() after queue is frozen
    ...

    Linus Torvalds
     

06 May, 2017

1 commit

  • Pull libnvdimm updates from Dan Williams:
    "The bulk of this has been in multiple -next releases. There were a few
    late breaking fixes and small features that got added in the last
    couple days, but the whole set has received a build success
    notification from the kbuild robot.

    Change summary:

    - Region media error reporting: A libnvdimm region device is the
    parent to one or more namespaces. To date, media errors have been
    reported via the "badblocks" attribute attached to pmem block
    devices for namespaces in "raw" or "memory" mode. Given that
    namespaces can be in "device-dax" or "btt-sector" mode this new
    interface reports media errors generically, i.e. independent of
    namespace modes or state.

    This subsequently allows userspace tooling to craft "ACPI 6.1
    Section 9.20.7.6 Function Index 4 - Clear Uncorrectable Error"
    requests and submit them via the ioctl path for NVDIMM root bus
    devices.

    - Introduce 'struct dax_device' and 'struct dax_operations': Prompted
    by a request from Linus and feedback from Christoph this allows for
    dax capable drivers to publish their own custom dax operations.
    This fixes the broken assumption that all dax operations are
    related to a persistent memory device, and makes it easier for
    other architectures and platforms to add customized persistent
    memory support.

    - 'libnvdimm' core updates: A new "deep_flush" sysfs attribute is
    available for storage appliance applications to manually trigger
    memory controllers to drain write-pending buffers that would
    otherwise be flushed automatically by the platform ADR
    (asynchronous-DRAM-refresh) mechanism at a power loss event.
    Support for "locked" DIMMs is included to prevent namespaces from
    surfacing when the namespace label data area is locked. Finally,
    fixes for various reported deadlocks and crashes, also tagged for
    -stable.

    - ACPI / nfit driver updates: General updates of the nfit driver to
    add DSM command overrides, ACPI 6.1 health state flags support, DSM
    payload debug available by default, and various fixes.

    Acknowledgements that came after the branch was pushed:

    - commmit 565851c972b5 "device-dax: fix sysfs attribute deadlock":
    Tested-by: Yi Zhang

    - commit 23f498448362 "libnvdimm: rework region badblocks clearing"
    Tested-by: Toshi Kani "

    * tag 'libnvdimm-for-4.12' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (52 commits)
    libnvdimm, pfn: fix 'npfns' vs section alignment
    libnvdimm: handle locked label storage areas
    libnvdimm: convert NDD_ flags to use bitops, introduce NDD_LOCKED
    brd: fix uninitialized use of brd->dax_dev
    block, dax: use correct format string in bdev_dax_supported
    device-dax: fix sysfs attribute deadlock
    libnvdimm: restore "libnvdimm: band aid btt vs clear poison locking"
    libnvdimm: fix nvdimm_bus_lock() vs device_lock() ordering
    libnvdimm: rework region badblocks clearing
    acpi, nfit: kill ACPI_NFIT_DEBUG
    libnvdimm: fix clear length of nvdimm_forget_poison()
    libnvdimm, pmem: fix a NULL pointer BUG in nd_pmem_notify
    libnvdimm, region: sysfs trigger for nvdimm_flush()
    libnvdimm: fix phys_addr for nvdimm_clear_poison
    x86, dax, pmem: remove indirection around memcpy_from_pmem()
    block: remove block_device_operations ->direct_access()
    block, dax: convert bdev_dax_supported() to dax_direct_access()
    filesystem-dax: convert to dax_direct_access()
    Revert "block: use DAX for partition table reads"
    ext2, ext4, xfs: retrieve dax_device for iomap operations
    ...

    Linus Torvalds
     

04 May, 2017

7 commits

  • Expose the fifo lists, cached next requests, batching state, and
    dispatch list. It'd also be possible to add the sorted lists, but there
    aren't already seq_file helpers for rbtrees.

    Signed-off-by: Omar Sandoval
    Reviewed-by: Hannes Reinecke
    Signed-off-by: Jens Axboe

    Omar Sandoval
     
  • Expose the domain token pools, asynchronous sbitmap depth, domain
    request lists, and batching state.

    Signed-off-by: Omar Sandoval
    Reviewed-by: Hannes Reinecke
    Signed-off-by: Jens Axboe

    Omar Sandoval
     
  • This provides the infrastructure for schedulers to expose their internal
    state through debugfs. We add a list of queue attributes and a list of
    hctx attributes to struct elevator_type and wire them up when switching
    schedulers.

    Signed-off-by: Omar Sandoval
    Reviewed-by: Hannes Reinecke

    Add missing seq_file.h header in blk-mq-debugfs.h

    Signed-off-by: Jens Axboe

    Omar Sandoval
     
  • Originally, I tied debugfs registration/unregistration together with
    sysfs. There's no reason to do this, and it's getting in the way of
    letting schedulers define their own debugfs attributes. Instead, tie the
    debugfs registration to the lifetime of the structures themselves.

    The saner lifetimes mean we can also get rid of the extra mq directory
    and move everything one level up. I.e., nvme0n1/mq/hctx0/tags is now
    just nvme0n1/hctx0/tags.

    Signed-off-by: Omar Sandoval
    Signed-off-by: Jens Axboe

    Omar Sandoval
     
  • Preparation for adding more declarations.

    Signed-off-by: Omar Sandoval
    Reviewed-by: Hannes Reinecke
    Signed-off-by: Jens Axboe

    Omar Sandoval
     
  • In commit e869b5462f83 ("blk-mq: Unregister debugfs attributes
    earlier"), we shuffled the debugfs cleanup around so that the "state"
    attribute was removed before we freed the blk-mq data structures.
    However, later changes are going to undo that, so we need to explicitly
    disallow running a dead queue.

    [Omar: rebased and updated commit message]
    Signed-off-by: Omar Sandoval
    Signed-off-by: Bart Van Assche
    Reviewed-by: Hannes Reinecke
    Signed-off-by: Jens Axboe

    Bart Van Assche
     
  • A large part of blk-mq-debugfs.c is file_operations and seq_file
    boilerplate. This sucks as is but will suck even more when schedulers
    can define their own debugfs entries. Factor it all out into a single
    blk_mq_debugfs_fops which multiplexes as needed. We store the
    request_queue, blk_mq_hw_ctx, or blk_mq_ctx in the parent directory
    dentry, which is kind of hacky, but it works.

    Signed-off-by: Omar Sandoval
    Reviewed-by: Hannes Reinecke
    Signed-off-by: Jens Axboe

    Omar Sandoval