10 Oct, 2020

3 commits

  • blk_exit_queue will free elevator_data, while blk_mq_run_work_fn
    will access it. Move cancel of hctx->run_work to the front of
    blk_exit_queue to avoid use-after-free.

    Fixes: 1b97871b501f ("blk-mq: move cancel of hctx->run_work into blk_mq_hw_sysfs_release")
    Signed-off-by: Yang Yang
    Reviewed-by: Ming Lei
    Signed-off-by: Jens Axboe

    Yang Yang
     
  • Since whole elevator register is protectd by sysfs_lock, we
    don't need extras 'has_elevator'. Just use q->elevator directly.

    Signed-off-by: Yufen Yu
    Signed-off-by: Jens Axboe

    Yufen Yu
     
  • We will register debugfs for scheduler no matter whether it have
    defined callback funciton .exit_sched. So, blk_mq_exit_sched()
    is always needed to unregister debugfs. Also, q->elevator should
    be set as NULL after exiting scheduler.

    For now, since all register scheduler have defined .exit_sched,
    it will not cause any actual problem. But It will be more reasonable
    to do this change.

    Signed-off-by: Yufen Yu
    Signed-off-by: Jens Axboe

    Yufen Yu
     

25 Sep, 2020

2 commits

  • The BDI_CAP_STABLE_WRITES is one of the few bits of information in the
    backing_dev_info shared between the block drivers and the writeback code.
    To help untangling the dependency replace it with a queue flag and a
    superblock flag derived from it. This also helps with the case of e.g.
    a file system requiring stable writes due to its own checksumming, but
    not forcing it on other users of the block device like the swap code.

    One downside is that we an't support the stable_pages_required bdi
    attribute in sysfs anymore. It is replaced with a queue attribute which
    also is writable for easier testing.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Jan Kara
    Reviewed-by: Johannes Thumshirn
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • Drivers shouldn't really mess with the readahead size, as that is a VM
    concept. Instead set it based on the optimal I/O size by lifting the
    algorithm from the md driver when registering the disk. Also set
    bdi->io_pages there as well by applying the same scheme based on
    max_sectors. To ensure the limits work well for stacking drivers a
    new helper is added to update the readahead limits from the block
    limits, which is also called from disk_stack_limits.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Johannes Thumshirn
    Reviewed-by: Jan Kara
    Reviewed-by: Mike Snitzer
    Reviewed-by: Martin K. Petersen
    Acked-by: Coly Li
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

08 Sep, 2020

2 commits


16 Jul, 2020

2 commits

  • Add a new max_active zones definition in the sysfs documentation.
    This definition will be common for all devices utilizing the zoned block
    device support in the kernel.

    Export max_active_zones according to this new definition for NVMe Zoned
    Namespace devices, ZAC ATA devices (which are treated as SCSI devices by
    the kernel), and ZBC SCSI devices.

    Add the new max_active_zones member to struct request_queue, rather
    than as a queue limit, since this property cannot be split across stacking
    drivers.

    For SCSI devices, even though max active zones is not part of the ZBC/ZAC
    spec, export max_active_zones as 0, signifying "no limit".

    Signed-off-by: Niklas Cassel
    Reviewed-by: Javier González
    Reviewed-by: Damien Le Moal
    Reviewed-by: Johannes Thumshirn
    Reviewed-by: Martin K. Petersen
    Signed-off-by: Jens Axboe

    Niklas Cassel
     
  • Add a new max_open_zones definition in the sysfs documentation.
    This definition will be common for all devices utilizing the zoned block
    device support in the kernel.

    Export max open zones according to this new definition for NVMe Zoned
    Namespace devices, ZAC ATA devices (which are treated as SCSI devices by
    the kernel), and ZBC SCSI devices.

    Add the new max_open_zones member to struct request_queue, rather
    than as a queue limit, since this property cannot be split across stacking
    drivers.

    Signed-off-by: Niklas Cassel
    Reviewed-by: Javier González
    Reviewed-by: Damien Le Moal
    Reviewed-by: Johannes Thumshirn
    Reviewed-by: Martin K. Petersen
    Signed-off-by: Jens Axboe

    Niklas Cassel
     

24 Jun, 2020

2 commits

  • We were only creating the request_queue debugfs_dir only
    for make_request block drivers (multiqueue), but never for
    request-based block drivers. We did this as we were only
    creating non-blktrace additional debugfs files on that directory
    for make_request drivers. However, since blktrace *always* creates
    that directory anyway, we special-case the use of that directory
    on blktrace. Other than this being an eye-sore, this exposes
    request-based block drivers to the same debugfs fragile
    race that used to exist with make_request block drivers
    where if we start adding files onto that directory we can later
    run a race with a double removal of dentries on the directory
    if we don't deal with this carefully on blktrace.

    Instead, just simplify things by always creating the request_queue
    debugfs_dir on request_queue registration. Rename the mutex also to
    reflect the fact that this is used outside of the blktrace context.

    Signed-off-by: Luis Chamberlain
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Luis Chamberlain
     
  • Commit dc9edc44de6c ("block: Fix a blk_exit_rl() regression") merged on
    v4.12 moved the work behind blk_release_queue() into a workqueue after a
    splat floated around which indicated some work on blk_release_queue()
    could sleep in blk_exit_rl(). This splat would be possible when a driver
    called blk_put_queue() or blk_cleanup_queue() (which calls blk_put_queue()
    as its final call) from an atomic context.

    blk_put_queue() decrements the refcount for the request_queue kobject, and
    upon reaching 0 blk_release_queue() is called. Although blk_exit_rl() is
    now removed through commit db6d99523560 ("block: remove request_list code")
    on v5.0, we reserve the right to be able to sleep within
    blk_release_queue() context.

    The last reference for the request_queue must not be called from atomic
    context. *When* the last reference to the request_queue reaches 0 varies,
    and so let's take the opportunity to document when that is expected to
    happen and also document the context of the related calls as best as
    possible so we can avoid future issues, and with the hopes that the
    synchronous request_queue removal sticks.

    We revert back to synchronous request_queue removal because asynchronous
    removal creates a regression with expected userspace interaction with
    several drivers. An example is when removing the loopback driver, one
    uses ioctls from userspace to do so, but upon return and if successful,
    one expects the device to be removed. Likewise if one races to add another
    device the new one may not be added as it is still being removed. This was
    expected behavior before and it now fails as the device is still present
    and busy still. Moving to asynchronous request_queue removal could have
    broken many scripts which relied on the removal to have been completed if
    there was no error. Document this expectation as well so that this
    doesn't regress userspace again.

    Using asynchronous request_queue removal however has helped us find
    other bugs. In the future we can test what could break with this
    arrangement by enabling CONFIG_DEBUG_KOBJECT_RELEASE.

    While at it, update the docs with the context expectations for the
    request_queue / gendisk refcount decrement, and make these
    expectations explicit by using might_sleep().

    Fixes: dc9edc44de6c ("block: Fix a blk_exit_rl() regression")
    Suggested-by: Nicolai Stange
    Signed-off-by: Luis Chamberlain
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Bart Van Assche
    Cc: Bart Van Assche
    Cc: Omar Sandoval
    Cc: Hannes Reinecke
    Cc: Nicolai Stange
    Cc: Greg Kroah-Hartman
    Cc: Michal Hocko
    Cc: yu kuai
    Signed-off-by: Jens Axboe

    Luis Chamberlain
     

13 May, 2020

1 commit

  • Define REQ_OP_ZONE_APPEND to append-write sectors to a zone of a zoned
    block device. This is a no-merge write operation.

    A zone append write BIO must:
    * Target a zoned block device
    * Have a sector position indicating the start sector of the target zone
    * The target zone must be a sequential write zone
    * The BIO must not cross a zone boundary
    * The BIO size must not be split to ensure that a single range of LBAs
    is written with a single command.

    Implement these checks in generic_make_request_checks() using the
    helper function blk_check_zone_append(). To avoid write append BIO
    splitting, introduce the new max_zone_append_sectors queue limit
    attribute and ensure that a BIO size is always lower than this limit.
    Export this new limit through sysfs and check these limits in bio_full().

    Also when a LLDD can't dispatch a request to a specific zone, it
    will return BLK_STS_ZONE_RESOURCE indicating this request needs to
    be delayed, e.g. because the zone it will be dispatched to is still
    write-locked. If this happens set the request aside in a local list
    to continue trying dispatching requests such as READ requests or a
    WRITE/ZONE_APPEND requests targetting other zones. This way we can
    still keep a high queue depth without starving other requests even if
    one request can't be served due to zone write-locking.

    Finally, make sure that the bio sector position indicates the actual
    write position as indicated by the device on completion.

    Signed-off-by: Keith Busch
    [ jth: added zone-append specific add_page and merge_page helpers ]
    Signed-off-by: Johannes Thumshirn
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Hannes Reinecke
    Reviewed-by: Martin K. Petersen
    Signed-off-by: Jens Axboe

    Keith Busch
     

07 Oct, 2019

1 commit

  • Block drivers must call del_gendisk() before blk_cleanup_queue().
    del_gendisk() calls kobject_del() and kobject_del() waits until any
    ongoing sysfs callback functions have finished. In other words, the
    sysfs callback functions won't be called for a queue in the dying
    state. Hence remove the "dying" checks from the sysfs callback
    functions.

    Cc: Christoph Hellwig
    Cc: Ming Lei
    Cc: Hannes Reinecke
    Cc: Johannes Thumshirn
    Signed-off-by: Bart Van Assche
    Signed-off-by: Jens Axboe

    Bart Van Assche
     

27 Sep, 2019

1 commit


26 Sep, 2019

1 commit

  • cecf5d87ff20 ("block: split .sysfs_lock into two locks") starts to
    release & acquire sysfs_lock before registering/un-registering elevator
    queue during switching elevator for avoiding potential deadlock from
    showing & storing 'queue/iosched' attributes and removing elevator's
    kobject.

    Turns out there isn't such deadlock because 'q->sysfs_lock' isn't
    required in .show & .store of queue/iosched's attributes, and just
    elevator's sysfs lock is acquired in elv_iosched_store() and
    elv_iosched_show(). So it is safe to hold queue's sysfs lock when
    registering/un-registering elevator queue.

    The biggest issue is that commit cecf5d87ff20 assumes that concurrent
    write on 'queue/scheduler' can't happen. However, this assumption isn't
    true, because kernfs_fop_write() only guarantees that concurrent write
    aren't called on the same open file, but the write could be from
    different open on the file. So we can't release & re-acquire queue's
    sysfs lock during switching elevator, otherwise use-after-free on
    elevator could be triggered.

    Fixes the issue by not releasing queue's sysfs lock during switching
    elevator.

    Fixes: cecf5d87ff20 ("block: split .sysfs_lock into two locks")
    Cc: Christoph Hellwig
    Cc: Hannes Reinecke
    Cc: Greg KH
    Cc: Mike Snitzer
    Reviewed-by: Bart Van Assche
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     

18 Sep, 2019

1 commit

  • Pull block updates from Jens Axboe:

    - Two NVMe pull requests:
    - ana log parse fix from Anton
    - nvme quirks support for Apple devices from Ben
    - fix missing bio completion tracing for multipath stack devices
    from Hannes and Mikhail
    - IP TOS settings for nvme rdma and tcp transports from Israel
    - rq_dma_dir cleanups from Israel
    - tracing for Get LBA Status command from Minwoo
    - Some nvme-tcp cleanups from Minwoo, Potnuri and Myself
    - Some consolidation between the fabrics transports for handling
    the CAP register
    - reset race with ns scanning fix for fabrics (move fabrics
    commands to a dedicated request queue with a different lifetime
    from the admin request queue)."
    - controller reset and namespace scan races fixes
    - nvme discovery log change uevent support
    - naming improvements from Keith
    - multiple discovery controllers reject fix from James
    - some regular cleanups from various people

    - Series fixing (and re-fixing) null_blk debug printing and nr_devices
    checks (André)

    - A few pull requests from Song, with fixes from Andy, Guoqing,
    Guilherme, Neil, Nigel, and Yufen.

    - REQ_OP_ZONE_RESET_ALL support (Chaitanya)

    - Bio merge handling unification (Christoph)

    - Pick default elevator correctly for devices with special needs
    (Damien)

    - Block stats fixes (Hou)

    - Timeout and support devices nbd fixes (Mike)

    - Series fixing races around elevator switching and device add/remove
    (Ming)

    - sed-opal cleanups (Revanth)

    - Per device weight support for BFQ (Fam)

    - Support for blk-iocost, a new model that can properly account cost of
    IO workloads. (Tejun)

    - blk-cgroup writeback fixes (Tejun)

    - paride queue init fixes (zhengbin)

    - blk_set_runtime_active() cleanup (Stanley)

    - Block segment mapping optimizations (Bart)

    - lightnvm fixes (Hans/Minwoo/YueHaibing)

    - Various little fixes and cleanups

    * tag 'for-5.4/block-2019-09-16' of git://git.kernel.dk/linux-block: (186 commits)
    null_blk: format pr_* logs with pr_fmt
    null_blk: match the type of parameter nr_devices
    null_blk: do not fail the module load with zero devices
    block: also check RQF_STATS in blk_mq_need_time_stamp()
    block: make rq sector size accessible for block stats
    bfq: Fix bfq linkage error
    raid5: use bio_end_sector in r5_next_bio
    raid5: remove STRIPE_OPS_REQ_PENDING
    md: add feature flag MD_FEATURE_RAID0_LAYOUT
    md/raid0: avoid RAID0 data corruption due to layout confusion.
    raid5: don't set STRIPE_HANDLE to stripe which is in batch list
    raid5: don't increment read_errors on EILSEQ return
    nvmet: fix a wrong error status returned in error log page
    nvme: send discovery log page change events to userspace
    nvme: add uevent variables for controller devices
    nvme: enable aen regardless of the presence of I/O queues
    nvme-fabrics: allow discovery subsystems accept a kato
    nvmet: Use PTR_ERR_OR_ZERO() in nvmet_init_discovery()
    nvme: Remove redundant assignment of cq vector
    nvme: Assign subsys instance from first ctrl
    ...

    Linus Torvalds
     

12 Sep, 2019

1 commit

  • cecf5d87ff20 ("block: split .sysfs_lock into two locks") starts to
    release & actuire sysfs_lock again during switching elevator. So it
    isn't enough to prevent switching elevator from happening by simply
    clearing QUEUE_FLAG_REGISTERED with holding sysfs_lock, because
    in-progress switch still can move on after re-acquiring the lock,
    meantime the flag of QUEUE_FLAG_REGISTERED won't get checked.

    Fixes this issue by checking 'q->elevator' directly & locklessly after
    q->kobj is removed in blk_unregister_queue(), this way is safe because
    q->elevator can't be changed at that time.

    Fixes: cecf5d87ff20 ("block: split .sysfs_lock into two locks")
    Cc: Christoph Hellwig
    Cc: Hannes Reinecke
    Cc: Greg KH
    Cc: Mike Snitzer
    Cc: Bart Van Assche
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     

28 Aug, 2019

2 commits

  • The kernfs built-in lock of 'kn->count' is held in sysfs .show/.store
    path. Meantime, inside block's .show/.store callback, q->sysfs_lock is
    required.

    However, when mq & iosched kobjects are removed via
    blk_mq_unregister_dev() & elv_unregister_queue(), q->sysfs_lock is held
    too. This way causes AB-BA lock because the kernfs built-in lock of
    'kn-count' is required inside kobject_del() too, see the lockdep warning[1].

    On the other hand, it isn't necessary to acquire q->sysfs_lock for
    both blk_mq_unregister_dev() & elv_unregister_queue() because
    clearing REGISTERED flag prevents storing to 'queue/scheduler'
    from being happened. Also sysfs write(store) is exclusive, so no
    necessary to hold the lock for elv_unregister_queue() when it is
    called in switching elevator path.

    So split .sysfs_lock into two: one is still named as .sysfs_lock for
    covering sync .store, the other one is named as .sysfs_dir_lock
    for covering kobjects and related status change.

    sysfs itself can handle the race between add/remove kobjects and
    showing/storing attributes under kobjects. For switching scheduler
    via storing to 'queue/scheduler', we use the queue flag of
    QUEUE_FLAG_REGISTERED with .sysfs_lock for avoiding the race, then
    we can avoid to hold .sysfs_lock during removing/adding kobjects.

    [1] lockdep warning
    ======================================================
    WARNING: possible circular locking dependency detected
    5.3.0-rc3-00044-g73277fc75ea0 #1380 Not tainted
    ------------------------------------------------------
    rmmod/777 is trying to acquire lock:
    00000000ac50e981 (kn->count#202){++++}, at: kernfs_remove_by_name_ns+0x59/0x72

    but task is already holding lock:
    00000000fb16ae21 (&q->sysfs_lock){+.+.}, at: blk_unregister_queue+0x78/0x10b

    which lock already depends on the new lock.

    the existing dependency chain (in reverse order) is:

    -> #1 (&q->sysfs_lock){+.+.}:
    __lock_acquire+0x95f/0xa2f
    lock_acquire+0x1b4/0x1e8
    __mutex_lock+0x14a/0xa9b
    blk_mq_hw_sysfs_show+0x63/0xb6
    sysfs_kf_seq_show+0x11f/0x196
    seq_read+0x2cd/0x5f2
    vfs_read+0xc7/0x18c
    ksys_read+0xc4/0x13e
    do_syscall_64+0xa7/0x295
    entry_SYSCALL_64_after_hwframe+0x49/0xbe

    -> #0 (kn->count#202){++++}:
    check_prev_add+0x5d2/0xc45
    validate_chain+0xed3/0xf94
    __lock_acquire+0x95f/0xa2f
    lock_acquire+0x1b4/0x1e8
    __kernfs_remove+0x237/0x40b
    kernfs_remove_by_name_ns+0x59/0x72
    remove_files+0x61/0x96
    sysfs_remove_group+0x81/0xa4
    sysfs_remove_groups+0x3b/0x44
    kobject_del+0x44/0x94
    blk_mq_unregister_dev+0x83/0xdd
    blk_unregister_queue+0xa0/0x10b
    del_gendisk+0x259/0x3fa
    null_del_dev+0x8b/0x1c3 [null_blk]
    null_exit+0x5c/0x95 [null_blk]
    __se_sys_delete_module+0x204/0x337
    do_syscall_64+0xa7/0x295
    entry_SYSCALL_64_after_hwframe+0x49/0xbe

    other info that might help us debug this:

    Possible unsafe locking scenario:

    CPU0 CPU1
    ---- ----
    lock(&q->sysfs_lock);
    lock(kn->count#202);
    lock(&q->sysfs_lock);
    lock(kn->count#202);

    *** DEADLOCK ***

    2 locks held by rmmod/777:
    #0: 00000000e69bd9de (&lock){+.+.}, at: null_exit+0x2e/0x95 [null_blk]
    #1: 00000000fb16ae21 (&q->sysfs_lock){+.+.}, at: blk_unregister_queue+0x78/0x10b

    stack backtrace:
    CPU: 0 PID: 777 Comm: rmmod Not tainted 5.3.0-rc3-00044-g73277fc75ea0 #1380
    Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS ?-20180724_192412-buildhw-07.phx4
    Call Trace:
    dump_stack+0x9a/0xe6
    check_noncircular+0x207/0x251
    ? print_circular_bug+0x32a/0x32a
    ? find_usage_backwards+0x84/0xb0
    check_prev_add+0x5d2/0xc45
    validate_chain+0xed3/0xf94
    ? check_prev_add+0xc45/0xc45
    ? mark_lock+0x11b/0x804
    ? check_usage_forwards+0x1ca/0x1ca
    __lock_acquire+0x95f/0xa2f
    lock_acquire+0x1b4/0x1e8
    ? kernfs_remove_by_name_ns+0x59/0x72
    __kernfs_remove+0x237/0x40b
    ? kernfs_remove_by_name_ns+0x59/0x72
    ? kernfs_next_descendant_post+0x7d/0x7d
    ? strlen+0x10/0x23
    ? strcmp+0x22/0x44
    kernfs_remove_by_name_ns+0x59/0x72
    remove_files+0x61/0x96
    sysfs_remove_group+0x81/0xa4
    sysfs_remove_groups+0x3b/0x44
    kobject_del+0x44/0x94
    blk_mq_unregister_dev+0x83/0xdd
    blk_unregister_queue+0xa0/0x10b
    del_gendisk+0x259/0x3fa
    ? disk_events_poll_msecs_store+0x12b/0x12b
    ? check_flags+0x1ea/0x204
    ? mark_held_locks+0x1f/0x7a
    null_del_dev+0x8b/0x1c3 [null_blk]
    null_exit+0x5c/0x95 [null_blk]
    __se_sys_delete_module+0x204/0x337
    ? free_module+0x39f/0x39f
    ? blkcg_maybe_throttle_current+0x8a/0x718
    ? rwlock_bug+0x62/0x62
    ? __blkcg_punt_bio_submit+0xd0/0xd0
    ? trace_hardirqs_on_thunk+0x1a/0x20
    ? mark_held_locks+0x1f/0x7a
    ? do_syscall_64+0x4c/0x295
    do_syscall_64+0xa7/0x295
    entry_SYSCALL_64_after_hwframe+0x49/0xbe
    RIP: 0033:0x7fb696cdbe6b
    Code: 73 01 c3 48 8b 0d 1d 20 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 008
    RSP: 002b:00007ffec9588788 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
    RAX: ffffffffffffffda RBX: 0000559e589137c0 RCX: 00007fb696cdbe6b
    RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559e58913828
    RBP: 0000000000000000 R08: 00007ffec9587701 R09: 0000000000000000
    R10: 00007fb696d4eae0 R11: 0000000000000206 R12: 00007ffec95889b0
    R13: 00007ffec95896b3 R14: 0000559e58913260 R15: 0000559e589137c0

    Cc: Christoph Hellwig
    Cc: Hannes Reinecke
    Cc: Greg KH
    Cc: Mike Snitzer
    Reviewed-by: Bart Van Assche
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     
  • There are 4 users which check if queue is registered, so add one helper
    to check it.

    Cc: Christoph Hellwig
    Cc: Hannes Reinecke
    Cc: Greg KH
    Cc: Mike Snitzer
    Cc: Bart Van Assche
    Reviewed-by: Bart Van Assche
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     

12 Aug, 2019

1 commit

  • blk_exit_queue will free elevator_data, while blk_mq_requeue_work
    will access it. Move cancel of requeue_work to the front of
    blk_exit_queue to avoid use-after-free.

    blk_exit_queue blk_mq_requeue_work
    __elevator_exit blk_mq_run_hw_queues
    blk_mq_exit_sched blk_mq_run_hw_queue
    dd_exit_queue blk_mq_hctx_has_pending
    kfree(elevator_data) blk_mq_sched_has_work
    dd_has_work

    Fixes: fbc2a15e3433 ("blk-mq: move cancel of requeue_work into blk_mq_release")
    Cc: stable@vger.kernel.org
    Reviewed-by: Ming Lei
    Signed-off-by: zhengbin
    Signed-off-by: Jens Axboe

    zhengbin
     

07 Jun, 2019

1 commit

  • In theory, IO scheduler belongs to request queue, and the request pool
    of sched tags belongs to the request queue too.

    However, the current tags allocation interfaces are re-used for both
    driver tags and sched tags, and driver tags is definitely host wide,
    and doesn't belong to any request queue, same with its request pool.
    So we need tagset instance for freeing request of sched tags.

    Meantime, blk_mq_free_tag_set() often follows blk_cleanup_queue() in case
    of non-BLK_MQ_F_TAG_SHARED, this way requires that request pool of sched
    tags to be freed before calling blk_mq_free_tag_set().

    Commit 47cdee29ef9d94e ("block: move blk_exit_queue into __blk_release_queue")
    moves blk_exit_queue into __blk_release_queue for simplying the fast
    path in generic_make_request(), then causes oops during freeing requests
    of sched tags in __blk_release_queue().

    Fix the above issue by move freeing request pool of sched tags into
    blk_cleanup_queue(), this way is safe becasue queue has been frozen and no any
    in-queue requests at that time. Freeing sched tags has to be kept in queue's
    release handler becasue there might be un-completed dispatch activity
    which might refer to sched tags.

    Cc: Bart Van Assche
    Cc: Christoph Hellwig
    Fixes: 47cdee29ef9d94e485eb08f962c74943023a5271 ("block: move blk_exit_queue into __blk_release_queue")
    Tested-by: Yi Zhang
    Reported-by: kernel test robot
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     

29 May, 2019

1 commit

  • Commit 498f6650aec8 ("block: Fix a race between the cgroup code and
    request queue initialization") moves what blk_exit_queue does into
    blk_cleanup_queue() for fixing issue caused by changing back
    queue lock.

    However, after legacy request IO path is killed, driver queue lock
    won't be used at all, and there isn't story for changing back
    queue lock. Then the issue addressed by Commit 498f6650aec8 doesn't
    exist any more.

    So move move blk_exit_queue into __blk_release_queue.

    This patch basically reverts the following two commits:

    498f6650aec8 block: Fix a race between the cgroup code and request queue initialization
    24ecc3585348 block: Ensure that a request queue is dissociated from the cgroup controller

    Cc: Bart Van Assche
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     

22 Apr, 2019

1 commit


21 Mar, 2019

1 commit

  • For q->poll_nsec == -1, means doing classic poll, not hybrid poll.
    We introduce a new flag BLK_MQ_POLL_CLASSIC to replace -1, which
    may make code much easier to read.

    Additionally, since val is an int obtained with kstrtoint(), val can be
    a negative value other than -1, so return -EINVAL for that case.

    Thanks to Damien Le Moal for some good suggestion.

    Reviewed-by: Damien Le Moal
    Signed-off-by: Yufen Yu
    Signed-off-by: Jens Axboe

    Yufen Yu
     

11 Feb, 2019

2 commits


29 Dec, 2018

1 commit

  • Pull SCSI updates from James Bottomley:
    "This is mostly update of the usual drivers: smarpqi, lpfc, qedi,
    megaraid_sas, libsas, zfcp, mpt3sas, hisi_sas.

    Additionally, we have a pile of annotation, unused variable and minor
    updates.

    The big API change is the updates for Christoph's DMA rework which
    include removing the DISABLE_CLUSTERING flag.

    And finally there are a couple of target tree updates"

    * tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (259 commits)
    scsi: isci: request: mark expected switch fall-through
    scsi: isci: remote_node_context: mark expected switch fall-throughs
    scsi: isci: remote_device: Mark expected switch fall-throughs
    scsi: isci: phy: Mark expected switch fall-through
    scsi: iscsi: Capture iscsi debug messages using tracepoints
    scsi: myrb: Mark expected switch fall-throughs
    scsi: megaraid: fix out-of-bound array accesses
    scsi: mpt3sas: mpt3sas_scsih: Mark expected switch fall-through
    scsi: fcoe: remove set but not used variable 'port'
    scsi: smartpqi: call pqi_free_interrupts() in pqi_shutdown()
    scsi: smartpqi: fix build warnings
    scsi: smartpqi: update driver version
    scsi: smartpqi: add ofa support
    scsi: smartpqi: increase fw status register read timeout
    scsi: smartpqi: bump driver version
    scsi: smartpqi: add smp_utils support
    scsi: smartpqi: correct lun reset issues
    scsi: smartpqi: correct volume status
    scsi: smartpqi: do not offline disks for transient did no connect conditions
    scsi: smartpqi: allow for larger raid maps
    ...

    Linus Torvalds
     

19 Dec, 2018

1 commit

  • Now that the the SCSI layer replaced the use of the cluster flag with
    segment size limits and the DMA boundary we can remove the cluster flag
    from the block layer.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Jens Axboe
    Signed-off-by: Martin K. Petersen

    Christoph Hellwig
     

18 Dec, 2018

1 commit

  • The queue mapping of type poll only exists when set->map[HCTX_TYPE_POLL].nr_queues
    is bigger than zero, so enhance the constraint by checking .nr_queues of type poll
    before enabling IO poll.

    Otherwise IO race & timeout can be observed when running block/007.

    Cc: Jeff Moyer
    Cc: Christoph Hellwig
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     

05 Dec, 2018

1 commit


29 Nov, 2018

1 commit


16 Nov, 2018

3 commits

  • Various spots check for q->mq_ops being non-NULL, but provide
    a helper to do this instead.

    Where the ->mq_ops != NULL check is redundant, remove it.

    Since mq == rq-based now that legacy is gone, get rid of the
    queue_is_rq_based() and just use queue_is_mq() everywhere.

    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • With the legacy request path gone there is no good reason to keep
    queue_lock as a pointer, we can always use the embedded lock now.

    Reviewed-by: Hannes Reinecke
    Signed-off-by: Christoph Hellwig

    Fixed floppy and blk-cgroup missing conversions and half done edits.

    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • ->queue_flags is generally not set or cleared in the fast path, and also
    generally set or cleared one flag at a time. Make use of the normal
    atomic bitops for it so that we don't need to take the queue_lock,
    which is otherwise mostly unused in the core block layer now.

    Reviewed-by: Hannes Reinecke
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

08 Nov, 2018

2 commits

  • This removes a bunch of core and elevator related code. On the core
    front, we remove anything related to queue running, draining,
    initialization, plugging, and congestions. We also kill anything
    related to request allocation, merging, retrieval, and completion.

    Remove any checking for single queue IO schedulers, as they no
    longer exist. This means we can also delete a bunch of code related
    to request issue, adding, completion, etc - and all the SQ related
    ops and helpers.

    Also kill the load_default_modules(), as all that did was provide
    for a way to load the default single queue elevator.

    Tested-by: Ming Lei
    Reviewed-by: Omar Sandoval
    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • It's now unused, kill it.

    Reviewed-by: Hannes Reinecke
    Tested-by: Ming Lei
    Reviewed-by: Omar Sandoval
    Signed-off-by: Jens Axboe

    Jens Axboe
     

31 Oct, 2018

1 commit

  • rq_qos_exit() removes the current q->rq_qos, this action has to be
    done after queue is frozen, otherwise the IO queue path may never
    be waken up, then IO hang is caused.

    So fixes this issue by moving rq_qos_exit() after queue is frozen.

    Cc: Josef Bacik
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     

26 Oct, 2018

2 commits

  • Drivers exposing zoned block devices have to initialize and maintain
    correctness (i.e. revalidate) of the device zone bitmaps attached to
    the device request queue (seq_zones_bitmap and seq_zones_wlock).

    To simplify coding this, introduce a generic helper function
    blk_revalidate_disk_zones() suitable for most (and likely all) cases.
    This new function always update the seq_zones_bitmap and seq_zones_wlock
    bitmaps as well as the queue nr_zones field when called for a disk
    using a request based queue. For a disk using a BIO based queue, only
    the number of zones is updated since these queues do not have
    schedulers and so do not need the zone bitmaps.

    With this change, the zone bitmap initialization code in sd_zbc.c can be
    replaced with a call to this function in sd_zbc_read_zones(), which is
    called from the disk revalidate block operation method.

    A call to blk_revalidate_disk_zones() is also added to the null_blk
    driver for devices created with the zoned mode enabled.

    Finally, to ensure that zoned devices created with dm-linear or
    dm-flakey expose the correct number of zones through sysfs, a call to
    blk_revalidate_disk_zones() is added to dm_table_set_restrictions().

    The zone bitmaps allocated and initialized with
    blk_revalidate_disk_zones() are freed automatically from
    __blk_release_queue() using the block internal function
    blk_queue_free_zone_bitmaps().

    Reviewed-by: Hannes Reinecke
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Martin K. Petersen
    Reviewed-by: Mike Snitzer
    Signed-off-by: Damien Le Moal
    Signed-off-by: Jens Axboe

    Damien Le Moal
     
  • Expose through sysfs the nr_zones field of struct request_queue.
    Exposing this value helps in debugging disk issues as well as
    facilitating scripts based use of the disk (e.g. blktests).

    For zoned block devices, the nr_zones field indicates the total number
    of zones of the device calculated using the known disk capacity and
    zone size. This number of zones is always 0 for regular block devices.

    Since nr_zones is defined conditionally with CONFIG_BLK_DEV_ZONED,
    introduce the blk_queue_nr_zones() function to return the correct value
    for any device, regardless if CONFIG_BLK_DEV_ZONED is set.

    Reviewed-by: Christoph Hellwig
    Reviewed-by: Hannes Reinecke
    Signed-off-by: Damien Le Moal
    Signed-off-by: Jens Axboe

    Damien Le Moal
     

23 Aug, 2018

1 commit

  • A previous commit removed the ability to have per-rq flags. We used
    those flags to maintain inflight counts. Since we don't have those
    anymore, we have to always maintain inflight counts, even if wbt is
    disabled. This is clearly suboptimal.

    Add a queue quiesce around changing the wbt latency settings from sysfs
    to work around this. With that, we can reliably put the enabled check in
    our bio_to_wbt_flags(), since we know the WBT_TRACKED flag will be
    consistent for the lifetime of the request.

    Fixes: c1c80384c8f ("block: remove external dependency on wbt_flags")
    Reviewed-by: Josef Bacik
    Signed-off-by: Jens Axboe

    Jens Axboe