28 Sep, 2019

2 commits

  • Some HDD drive may expose multiple hardware queues, such as MegraRaid.
    Let's apply the normal plugging for such devices because sequential IO
    may benefit a lot from plug merging.

    Cc: Bart Van Assche
    Cc: Hannes Reinecke
    Cc: Dave Chinner
    Reviewed-by: Damien Le Moal
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     
  • If a device is using multiple queues, the IO scheduler may be bypassed.
    This may hurt performance for some slow MQ devices, and it also breaks
    zoned devices which depend on mq-deadline for respecting the write order
    in one zone.

    Don't bypass io scheduler if we have one setup.

    This patch can double sequential write performance basically on MQ
    scsi_debug when mq-deadline is applied.

    Cc: Bart Van Assche
    Cc: Hannes Reinecke
    Cc: Dave Chinner
    Reviewed-by: Javier González
    Reviewed-by: Damien Le Moal
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     

27 Sep, 2019

1 commit

  • We got a null pointer deference BUG_ON in blk_mq_rq_timed_out()
    as following:

    [ 108.825472] BUG: kernel NULL pointer dereference, address: 0000000000000040
    [ 108.827059] PGD 0 P4D 0
    [ 108.827313] Oops: 0000 [#1] SMP PTI
    [ 108.827657] CPU: 6 PID: 198 Comm: kworker/6:1H Not tainted 5.3.0-rc8+ #431
    [ 108.829503] Workqueue: kblockd blk_mq_timeout_work
    [ 108.829913] RIP: 0010:blk_mq_check_expired+0x258/0x330
    [ 108.838191] Call Trace:
    [ 108.838406] bt_iter+0x74/0x80
    [ 108.838665] blk_mq_queue_tag_busy_iter+0x204/0x450
    [ 108.839074] ? __switch_to_asm+0x34/0x70
    [ 108.839405] ? blk_mq_stop_hw_queue+0x40/0x40
    [ 108.839823] ? blk_mq_stop_hw_queue+0x40/0x40
    [ 108.840273] ? syscall_return_via_sysret+0xf/0x7f
    [ 108.840732] blk_mq_timeout_work+0x74/0x200
    [ 108.841151] process_one_work+0x297/0x680
    [ 108.841550] worker_thread+0x29c/0x6f0
    [ 108.841926] ? rescuer_thread+0x580/0x580
    [ 108.842344] kthread+0x16a/0x1a0
    [ 108.842666] ? kthread_flush_work+0x170/0x170
    [ 108.843100] ret_from_fork+0x35/0x40

    The bug is caused by the race between timeout handle and completion for
    flush request.

    When timeout handle function blk_mq_rq_timed_out() try to read
    'req->q->mq_ops', the 'req' have completed and reinitiated by next
    flush request, which would call blk_rq_init() to clear 'req' as 0.

    After commit 12f5b93145 ("blk-mq: Remove generation seqeunce"),
    normal requests lifetime are protected by refcount. Until 'rq->ref'
    drop to zero, the request can really be free. Thus, these requests
    cannot been reused before timeout handle finish.

    However, flush request has defined .end_io and rq->end_io() is still
    called even if 'rq->ref' doesn't drop to zero. After that, the 'flush_rq'
    can be reused by the next flush request handle, resulting in null
    pointer deference BUG ON.

    We fix this problem by covering flush request with 'rq->ref'.
    If the refcount is not zero, flush_end_io() return and wait the
    last holder recall it. To record the request status, we add a new
    entry 'rq_status', which will be used in flush_end_io().

    Cc: Christoph Hellwig
    Cc: Keith Busch
    Cc: Bart Van Assche
    Cc: stable@vger.kernel.org # v4.18+
    Reviewed-by: Ming Lei
    Reviewed-by: Bob Liu
    Signed-off-by: Yufen Yu

    -------
    v2:
    - move rq_status from struct request to struct blk_flush_queue
    v3:
    - remove unnecessary '{}' pair.
    v4:
    - let spinlock to protect 'fq->rq_status'
    v5:
    - move rq_status after flush_running_idx member of struct blk_flush_queue
    Signed-off-by: Jens Axboe

    Yufen Yu
     

18 Sep, 2019

3 commits

  • Currently t10_pi_prepare/t10_pi_complete functions are called during the
    NVMe and SCSi layers command preparetion/completion, but their actual
    place should be the block layer since T10-PI is a general data integrity
    feature that is used by block storage protocols. Introduce .prepare_fn
    and .complete_fn callbacks within the integrity profile that each type
    can implement according to its needs.

    Suggested-by: Christoph Hellwig
    Reviewed-by: Christoph Hellwig
    Suggested-by: Martin K. Petersen
    Reviewed-by: Martin K. Petersen
    Signed-off-by: Max Gurtovoy

    Fixed to not call queue integrity functions if BLK_DEV_INTEGRITY
    isn't defined in the config.

    Signed-off-by: Jens Axboe

    Max Gurtovoy
     
  • Pull block updates from Jens Axboe:

    - Two NVMe pull requests:
    - ana log parse fix from Anton
    - nvme quirks support for Apple devices from Ben
    - fix missing bio completion tracing for multipath stack devices
    from Hannes and Mikhail
    - IP TOS settings for nvme rdma and tcp transports from Israel
    - rq_dma_dir cleanups from Israel
    - tracing for Get LBA Status command from Minwoo
    - Some nvme-tcp cleanups from Minwoo, Potnuri and Myself
    - Some consolidation between the fabrics transports for handling
    the CAP register
    - reset race with ns scanning fix for fabrics (move fabrics
    commands to a dedicated request queue with a different lifetime
    from the admin request queue)."
    - controller reset and namespace scan races fixes
    - nvme discovery log change uevent support
    - naming improvements from Keith
    - multiple discovery controllers reject fix from James
    - some regular cleanups from various people

    - Series fixing (and re-fixing) null_blk debug printing and nr_devices
    checks (André)

    - A few pull requests from Song, with fixes from Andy, Guoqing,
    Guilherme, Neil, Nigel, and Yufen.

    - REQ_OP_ZONE_RESET_ALL support (Chaitanya)

    - Bio merge handling unification (Christoph)

    - Pick default elevator correctly for devices with special needs
    (Damien)

    - Block stats fixes (Hou)

    - Timeout and support devices nbd fixes (Mike)

    - Series fixing races around elevator switching and device add/remove
    (Ming)

    - sed-opal cleanups (Revanth)

    - Per device weight support for BFQ (Fam)

    - Support for blk-iocost, a new model that can properly account cost of
    IO workloads. (Tejun)

    - blk-cgroup writeback fixes (Tejun)

    - paride queue init fixes (zhengbin)

    - blk_set_runtime_active() cleanup (Stanley)

    - Block segment mapping optimizations (Bart)

    - lightnvm fixes (Hans/Minwoo/YueHaibing)

    - Various little fixes and cleanups

    * tag 'for-5.4/block-2019-09-16' of git://git.kernel.dk/linux-block: (186 commits)
    null_blk: format pr_* logs with pr_fmt
    null_blk: match the type of parameter nr_devices
    null_blk: do not fail the module load with zero devices
    block: also check RQF_STATS in blk_mq_need_time_stamp()
    block: make rq sector size accessible for block stats
    bfq: Fix bfq linkage error
    raid5: use bio_end_sector in r5_next_bio
    raid5: remove STRIPE_OPS_REQ_PENDING
    md: add feature flag MD_FEATURE_RAID0_LAYOUT
    md/raid0: avoid RAID0 data corruption due to layout confusion.
    raid5: don't set STRIPE_HANDLE to stripe which is in batch list
    raid5: don't increment read_errors on EILSEQ return
    nvmet: fix a wrong error status returned in error log page
    nvme: send discovery log page change events to userspace
    nvme: add uevent variables for controller devices
    nvme: enable aen regardless of the presence of I/O queues
    nvme-fabrics: allow discovery subsystems accept a kato
    nvmet: Use PTR_ERR_OR_ZERO() in nvmet_init_discovery()
    nvme: Remove redundant assignment of cq vector
    nvme: Assign subsys instance from first ctrl
    ...

    Linus Torvalds
     
  • Pull core timer updates from Thomas Gleixner:
    "Timers and timekeeping updates:

    - A large overhaul of the posix CPU timer code which is a preparation
    for moving the CPU timer expiry out into task work so it can be
    properly accounted on the task/process.

    An update to the bogus permission checks will come later during the
    merge window as feedback was not complete before heading of for
    travel.

    - Switch the timerqueue code to use cached rbtrees and get rid of the
    homebrewn caching of the leftmost node.

    - Consolidate hrtimer_init() + hrtimer_init_sleeper() calls into a
    single function

    - Implement the separation of hrtimers to be forced to expire in hard
    interrupt context even when PREEMPT_RT is enabled and mark the
    affected timers accordingly.

    - Implement a mechanism for hrtimers and the timer wheel to protect
    RT against priority inversion and live lock issues when a (hr)timer
    which should be canceled is currently executing the callback.
    Instead of infinitely spinning, the task which tries to cancel the
    timer blocks on a per cpu base expiry lock which is held and
    released by the (hr)timer expiry code.

    - Enable the Hyper-V TSC page based sched_clock for Hyper-V guests
    resulting in faster access to timekeeping functions.

    - Updates to various clocksource/clockevent drivers and their device
    tree bindings.

    - The usual small improvements all over the place"

    * 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (101 commits)
    posix-cpu-timers: Fix permission check regression
    posix-cpu-timers: Always clear head pointer on dequeue
    hrtimer: Add a missing bracket and hide `migration_base' on !SMP
    posix-cpu-timers: Make expiry_active check actually work correctly
    posix-timers: Unbreak CONFIG_POSIX_TIMERS=n build
    tick: Mark sched_timer to expire in hard interrupt context
    hrtimer: Add kernel doc annotation for HRTIMER_MODE_HARD
    x86/hyperv: Hide pv_ops access for CONFIG_PARAVIRT=n
    posix-cpu-timers: Utilize timerqueue for storage
    posix-cpu-timers: Move state tracking to struct posix_cputimers
    posix-cpu-timers: Deduplicate rlimit handling
    posix-cpu-timers: Remove pointless comparisons
    posix-cpu-timers: Get rid of 64bit divisions
    posix-cpu-timers: Consolidate timer expiry further
    posix-cpu-timers: Get rid of zero checks
    rlimit: Rewrite non-sensical RLIMIT_CPU comment
    posix-cpu-timers: Respect INFINITY for hard RTTIME limit
    posix-cpu-timers: Switch thread group sampling to array
    posix-cpu-timers: Restructure expiry array
    posix-cpu-timers: Remove cputime_expires
    ...

    Linus Torvalds
     

16 Sep, 2019

2 commits

  • In __blk_mq_end_request() if block stats needs update, we should
    ensure now is valid instead of 0 even when iostat is disabled.

    Signed-off-by: Hou Tao
    Signed-off-by: Jens Axboe

    Hou Tao
     
  • Currently rq->data_len will be decreased by partial completion or
    zeroed by completion, so when blk_stat_add() is invoked, data_len
    will be zero and there will never be samples in poll_cb because
    blk_mq_poll_stats_bkt() will return -1 if data_len is zero.

    We could move blk_stat_add() back to __blk_mq_complete_request(),
    but that would make the effort of trying to call ktime_get_ns()
    once in vain. Instead we can reuse throtl_size field, and use
    it for both block stats and block throttle, and adjust the
    logic in blk_mq_poll_stats_bkt() accordingly.

    Fixes: 4bc6339a583c ("block: move blk_stat_add() to __blk_mq_end_request()")
    Tested-by: Pavel Begunkov
    Signed-off-by: Hou Tao
    Signed-off-by: Jens Axboe

    Hou Tao
     

06 Sep, 2019

3 commits

  • When elevator_init_mq() is called from blk_mq_init_allocated_queue(),
    the only information known about the device is the number of hardware
    queues as the block device scan by the device driver is not completed
    yet for most drivers. The device type and elevator required features
    are not set yet, preventing to correctly select the default elevator
    most suitable for the device.

    This currently affects all multi-queue zoned block devices which default
    to the "none" elevator instead of the required "mq-deadline" elevator.
    These drives currently include host-managed SMR disks connected to a
    smartpqi HBA and null_blk block devices with zoned mode enabled.
    Upcoming NVMe Zoned Namespace devices will also be affected.

    Fix this by adding the boolean elevator_init argument to
    blk_mq_init_allocated_queue() to control the execution of
    elevator_init_mq(). Two cases exist:
    1) elevator_init = false is used for calls to
    blk_mq_init_allocated_queue() within blk_mq_init_queue(). In this
    case, a call to elevator_init_mq() is added to __device_add_disk(),
    resulting in the delayed initialization of the queue elevator
    after the device driver finished probing the device information. This
    effectively allows elevator_init_mq() access to more information
    about the device.
    2) elevator_init = true preserves the current behavior of initializing
    the elevator directly from blk_mq_init_allocated_queue(). This case
    is used for the special request based DM devices where the device
    gendisk is created before the queue initialization and device
    information (e.g. queue limits) is already known when the queue
    initialization is executed.

    Additionally, to make sure that the elevator initialization is never
    done while requests are in-flight (there should be none when the device
    driver calls device_add_disk()), freeze and quiesce the device request
    queue before calling blk_mq_init_sched() in elevator_init_mq().

    Reviewed-by: Ming Lei
    Signed-off-by: Damien Le Moal
    Signed-off-by: Jens Axboe

    Damien Le Moal
     
  • If the default elevator chosen is mq-deadline, elevator_init_mq() may
    return an error if mq-deadline initialization fails, leading to
    blk_mq_init_allocated_queue() returning an error, which in turn will
    cause the block device initialization to fail and the device not being
    exposed.

    Instead of taking such extreme measure, handle mq-deadline
    initialization failures in the same manner as when mq-deadline is not
    available (no module to load), that is, default to the "none" scheduler.
    With this change, elevator_init_mq() return type can be changed to void.

    Reviewed-by: Johannes Thumshirn
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Ming Lei
    Signed-off-by: Damien Le Moal
    Signed-off-by: Jens Axboe

    Damien Le Moal
     
  • Instead of checking a queue tag_set BLK_MQ_F_NO_SCHED flag before
    calling elevator_init_mq() to make sure that the queue supports IO
    scheduling, use the elevator.c function elv_support_iosched() in
    elevator_init_mq(). This does not introduce any functional change but
    ensure that elevator_init_mq() does the right thing based on the queue
    settings.

    Reviewed-by: Ming Lei
    Reviewed-by: Johannes Thumshirn
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Damien Le Moal
    Signed-off-by: Jens Axboe

    Damien Le Moal
     

29 Aug, 2019

1 commit

  • There are currently two start time timestamps - start_time_ns and
    io_start_time_ns. The former marks the request allocation and and the
    second issue-to-device time. The planned io.weight controller needs
    to measure the total time bios take to execute after it leaves rq_qos
    including the time spent waiting for request to become available,
    which can easily dominate on saturated devices.

    This patch adds request->alloc_time_ns which records when the request
    allocation attempt started. As it isn't used for the usual stats,
    make it optional behind CONFIG_BLK_RQ_ALLOC_TIME and
    QUEUE_FLAG_RQ_ALLOC_TIME so that it can be compiled out when there are
    no users and it's active only on queues which need it even when
    compiled in.

    v2: s/pre_start_time/alloc_time/ and add CONFIG_BLK_RQ_ALLOC_TIME
    gating as suggested by Jens.

    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Tejun Heo
     

28 Aug, 2019

1 commit

  • blk_mq_map_swqueue() is called from blk_mq_init_allocated_queue()
    and blk_mq_update_nr_hw_queues(). For the former caller, the kobject
    isn't exposed to userspace yet. For the latter caller, hctx sysfs entries
    and debugfs are un-registered before updating nr_hw_queues.

    On the other hand, commit 2f8f1336a48b ("blk-mq: always free hctx after
    request queue is freed") moves freeing hctx into queue's release
    handler, so there won't be race with queue release path too.

    So don't hold q->sysfs_lock in blk_mq_map_swqueue().

    Cc: Christoph Hellwig
    Cc: Hannes Reinecke
    Cc: Greg KH
    Cc: Mike Snitzer
    Cc: Bart Van Assche
    Reviewed-by: Bart Van Assche
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     

16 Aug, 2019

1 commit

  • We had a few issues with this code, and there's still a problem around
    how we deal with error handling for chained/split bios. For now, just
    revert the code and we'll try again with a thoroug solution. This
    reverts commits:

    e15c2ffa1091 ("block: fix O_DIRECT error handling for bio fragments")
    0eb6ddfb865c ("block: Fix __blkdev_direct_IO() for bio fragments")
    6a43074e2f46 ("block: properly handle IOCB_NOWAIT for async O_DIRECT IO")
    893a1c97205a ("blk-mq: allow REQ_NOWAIT to return an error inline")

    Signed-off-by: Jens Axboe

    Jens Axboe
     

12 Aug, 2019

2 commits

  • If blk_mq_init_allocated_queue->elevator_init_mq fails, need to release
    the previously requested resources.

    Fixes: d34849913819 ("blk-mq-sched: allow setting of default IO scheduler")
    Signed-off-by: zhengbin
    Signed-off-by: Jens Axboe

    zhengbin
     
  • blk_exit_queue will free elevator_data, while blk_mq_requeue_work
    will access it. Move cancel of requeue_work to the front of
    blk_exit_queue to avoid use-after-free.

    blk_exit_queue blk_mq_requeue_work
    __elevator_exit blk_mq_run_hw_queues
    blk_mq_exit_sched blk_mq_run_hw_queue
    dd_exit_queue blk_mq_hctx_has_pending
    kfree(elevator_data) blk_mq_sched_has_work
    dd_has_work

    Fixes: fbc2a15e3433 ("blk-mq: move cancel of requeue_work into blk_mq_release")
    Cc: stable@vger.kernel.org
    Reviewed-by: Ming Lei
    Signed-off-by: zhengbin
    Signed-off-by: Jens Axboe

    zhengbin
     

05 Aug, 2019

2 commits

  • blk_mq_tagset_wait_completed_request() has been applied for waiting
    for completed request's fn, so not necessary to use
    blk_mq_complete_request_sync() any more.

    Cc: Max Gurtovoy
    Cc: Sagi Grimberg
    Cc: Keith Busch
    Cc: Christoph Hellwig
    Reviewed-by: Sagi Grimberg
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     
  • NVMe needs this function to decide if one request to be aborted has
    been completed in normal IO path already.

    So introduce it.

    Cc: Max Gurtovoy
    Cc: Sagi Grimberg
    Cc: Keith Busch
    Cc: Christoph Hellwig
    Reviewed-by: Sagi Grimberg
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     

01 Aug, 2019

2 commits

  • hrtimer_sleepers will gain a scheduling class dependent treatment on
    PREEMPT_RT. Use the new hrtimer_sleeper_start_expires() function to make
    that possible.

    Signed-off-by: Thomas Gleixner

    Thomas Gleixner
     
  • hrtimer_init_sleeper() calls require prior initialisation of the hrtimer
    object which is embedded into the hrtimer_sleeper.

    Combine the initialization and spare a function call. Fixup all call sites.

    This is also a preparatory change for PREEMPT_RT to do hrtimer sleeper
    specific initializations of the embedded hrtimer without modifying any of
    the call sites.

    No functional change.

    [ anna-maria: Minor cleanups ]
    [ tglx: Adopted to the removal of the task argument of
    hrtimer_init_sleeper() and trivial polishing.
    Folded a fix from Stephen Rothwell for the vsoc code ]

    Signed-off-by: Sebastian Andrzej Siewior
    Signed-off-by: Anna-Maria Gleixner
    Signed-off-by: Thomas Gleixner
    Acked-by: Peter Zijlstra (Intel)
    Link: https://lkml.kernel.org/r/20190726185752.887468908@linutronix.de

    Sebastian Andrzej Siewior
     

31 Jul, 2019

1 commit


23 Jul, 2019

1 commit


22 Jul, 2019

1 commit

  • By default, if a caller sets REQ_NOWAIT and we need to block, we'll
    return -EAGAIN through the bio->bi_end_io() callback. For some use
    cases, this makes it hard to use.

    Allow a caller to ask for inline return of errors related to
    blocking by also setting REQ_NOWAIT_INLINE.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

11 Jul, 2019

1 commit

  • Simultaneously writing to a sequential zone of a zoned block device
    from multiple contexts requires mutual exclusion for BIO issuing to
    ensure that writes happen sequentially. However, even for a well
    behaved user correctly implementing such synchronization, BIO plugging
    may interfere and result in BIOs from the different contextx to be
    reordered if plugging is done outside of the mutual exclusion section,
    e.g. the plug was started by a function higher in the call chain than
    the function issuing BIOs.

    Context A Context B

    | blk_start_plug()
    | ...
    | seq_write_zone()
    | mutex_lock(zone)
    | bio-0->bi_iter.bi_sector = zone->wp
    | zone->wp += bio_sectors(bio-0)
    | submit_bio(bio-0)
    | bio-1->bi_iter.bi_sector = zone->wp
    | zone->wp += bio_sectors(bio-1)
    | submit_bio(bio-1)
    | mutex_unlock(zone)
    | return
    | -----------------------> | seq_write_zone()
    | mutex_lock(zone)
    | bio-2->bi_iter.bi_sector = zone->wp
    | zone->wp += bio_sectors(bio-2)
    | submit_bio(bio-2)
    | mutex_unlock(zone)
    |
    Signed-off-by: Jens Axboe

    Damien Le Moal
     

03 Jul, 2019

2 commits

  • Move the blk_mq_bio_to_request() call in front of the if-statement.

    Cc: Hannes Reinecke
    Cc: Omar Sandoval
    Reviewed-by: Minwoo Im
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Ming Lei
    Signed-off-by: Bart Van Assche
    Signed-off-by: Jens Axboe

    Bart Van Assche
     
  • No code that occurs between blk_mq_get_ctx() and blk_mq_put_ctx() depends
    on preemption being disabled for its correctness. Since removing the CPU
    preemption calls does not measurably affect performance, simplify the
    blk-mq code by removing the blk_mq_put_ctx() function and also by not
    disabling preemption in blk_mq_get_ctx().

    Cc: Hannes Reinecke
    Cc: Omar Sandoval
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Ming Lei
    Signed-off-by: Bart Van Assche
    Signed-off-by: Jens Axboe

    Bart Van Assche
     

21 Jun, 2019

2 commits

  • We only need the number of segments in the blk-mq submission path.
    Remove the field from struct bio, and return it from a variant of
    blk_queue_split instead of that it can passed as an argument to
    those functions that need the value.

    This also means we stop recounting segments except for cloning
    and partial segments.

    To keep the number of arguments in this how path down remove
    pointless struct request_queue arguments from any of the functions
    that had it and grew a nr_segs argument.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • lightnvm should have never used this function, as it is sending
    passthrough requests, so switch it to blk_rq_append_bio like all the
    other passthrough request users. Inline blk_init_request_from_bio into
    the only remaining caller.

    Reviewed-by: Hannes Reinecke
    Reviewed-by: Minwoo Im
    Reviewed-by: Javier González
    Reviewed-by: Matias Bjørling
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

30 May, 2019

1 commit


24 May, 2019

1 commit

  • The following is a description of a hang in blk_mq_freeze_queue_wait().
    The hang happens on attempt to freeze a queue while another task does
    queue unfreeze.

    The root cause is an incorrect sequence of percpu_ref_resurrect() and
    percpu_ref_kill() and as a result those two can be swapped:

    CPU#0 CPU#1
    ---------------- -----------------
    q1 = blk_mq_init_queue(shared_tags)

    q2 = blk_mq_init_queue(shared_tags):
    blk_mq_add_queue_tag_set(shared_tags):
    blk_mq_update_tag_set_depth(shared_tags):
    list_for_each_entry()
    blk_mq_freeze_queue(q1)
    > percpu_ref_kill()
    > blk_mq_freeze_queue_wait()

    blk_cleanup_queue(q1)
    blk_mq_freeze_queue(q1)
    > percpu_ref_kill()
    ^^^^^^ freeze_depth can't guarantee the order

    blk_mq_unfreeze_queue()
    > percpu_ref_resurrect()

    > blk_mq_freeze_queue_wait()
    ^^^^^^ Hang here!!!!

    This wrong sequence raises kernel warning:
    percpu_ref_kill_and_confirm called more than once on blk_queue_usage_counter_release!
    WARNING: CPU: 0 PID: 11854 at lib/percpu-refcount.c:336 percpu_ref_kill_and_confirm+0x99/0xb0

    But the most unpleasant effect is a hang of a blk_mq_freeze_queue_wait(),
    which waits for a zero of a q_usage_counter, which never happens
    because percpu-ref was reinited (instead of being killed) and stays in
    PERCPU state forever.

    How to reproduce:
    - "insmod null_blk.ko shared_tags=1 nr_devices=0 queue_mode=2"
    - cpu0: python Script.py 0; taskset the corresponding process running on cpu0
    - cpu1: python Script.py 1; taskset the corresponding process running on cpu1

    Script.py:
    ------
    #!/usr/bin/python3

    import os
    import sys

    while True:
    on = "echo 1 > /sys/kernel/config/nullb/%s/power" % sys.argv[1]
    off = "echo 0 > /sys/kernel/config/nullb/%s/power" % sys.argv[1]
    os.system(on)
    os.system(off)
    ------

    This bug was first reported and fixed by Roman, previous discussion:
    [1] Message id: 1443287365-4244-7-git-send-email-akinobu.mita@gmail.com
    [2] Message id: 1443563240-29306-6-git-send-email-tj@kernel.org
    [3] https://patchwork.kernel.org/patch/9268199/

    Reviewed-by: Hannes Reinecke
    Reviewed-by: Ming Lei
    Reviewed-by: Bart Van Assche
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Roman Pen
    Signed-off-by: Bob Liu
    Signed-off-by: Jens Axboe

    Bob Liu
     

04 May, 2019

4 commits

  • In normal queue cleanup path, hctx is released after request queue
    is freed, see blk_mq_release().

    However, in __blk_mq_update_nr_hw_queues(), hctx may be freed because
    of hw queues shrinking. This way is easy to cause use-after-free,
    because: one implicit rule is that it is safe to call almost all block
    layer APIs if the request queue is alive; and one hctx may be retrieved
    by one API, then the hctx can be freed by blk_mq_update_nr_hw_queues();
    finally use-after-free is triggered.

    Fixes this issue by always freeing hctx after releasing request queue.
    If some hctxs are removed in blk_mq_update_nr_hw_queues(), introduce
    a per-queue list to hold them, then try to resuse these hctxs if numa
    node is matched.

    Cc: Dongli Zhang
    Cc: James Smart
    Cc: Bart Van Assche
    Cc: linux-scsi@vger.kernel.org,
    Cc: Martin K . Petersen ,
    Cc: Christoph Hellwig ,
    Cc: James E . J . Bottomley ,
    Reviewed-by: Hannes Reinecke
    Tested-by: James Smart
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     
  • Split blk_mq_alloc_and_init_hctx into two parts, and one is
    blk_mq_alloc_hctx() for allocating all hctx resources, another
    is blk_mq_init_hctx() for initializing hctx, which serves as
    counter-part of blk_mq_exit_hctx().

    Cc: Dongli Zhang
    Cc: James Smart
    Cc: Bart Van Assche
    Cc: linux-scsi@vger.kernel.org
    Cc: Martin K . Petersen
    Cc: Christoph Hellwig
    Cc: James E . J . Bottomley
    Reviewed-by: Hannes Reinecke
    Reviewed-by: Christoph Hellwig
    Tested-by: James Smart
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     
  • Once blk_cleanup_queue() returns, tags shouldn't be used any more,
    because blk_mq_free_tag_set() may be called. Commit 45a9c9d909b2
    ("blk-mq: Fix a use-after-free") fixes this issue exactly.

    However, that commit introduces another issue. Before 45a9c9d909b2,
    we are allowed to run queue during cleaning up queue if the queue's
    kobj refcount is held. After that commit, queue can't be run during
    queue cleaning up, otherwise oops can be triggered easily because
    some fields of hctx are freed by blk_mq_free_queue() in blk_cleanup_queue().

    We have invented ways for addressing this kind of issue before, such as:

    8dc765d438f1 ("SCSI: fix queue cleanup race before queue initialization is done")
    c2856ae2f315 ("blk-mq: quiesce queue before freeing queue")

    But still can't cover all cases, recently James reports another such
    kind of issue:

    https://marc.info/?l=linux-scsi&m=155389088124782&w=2

    This issue can be quite hard to address by previous way, given
    scsi_run_queue() may run requeues for other LUNs.

    Fixes the above issue by freeing hctx's resources in its release handler, and this
    way is safe becasue tags isn't needed for freeing such hctx resource.

    This approach follows typical design pattern wrt. kobject's release handler.

    Cc: Dongli Zhang
    Cc: James Smart
    Cc: Bart Van Assche
    Cc: linux-scsi@vger.kernel.org,
    Cc: Martin K . Petersen ,
    Cc: Christoph Hellwig ,
    Cc: James E . J . Bottomley ,
    Reported-by: James Smart
    Fixes: 45a9c9d909b2 ("blk-mq: Fix a use-after-free")
    Cc: stable@vger.kernel.org
    Reviewed-by: Hannes Reinecke
    Reviewed-by: Christoph Hellwig
    Tested-by: James Smart
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     
  • With holding queue's kobject refcount, it is safe for driver
    to schedule requeue. However, blk_mq_kick_requeue_list() may
    be called after blk_sync_queue() is done because of concurrent
    requeue activities, then requeue work may not be completed when
    freeing queue, and kernel oops is triggered.

    So moving the cancel of requeue_work into blk_mq_release() for
    avoiding race between requeue and freeing queue.

    Cc: Dongli Zhang
    Cc: James Smart
    Cc: Bart Van Assche
    Cc: linux-scsi@vger.kernel.org,
    Cc: Martin K . Petersen ,
    Cc: Christoph Hellwig ,
    Cc: James E . J . Bottomley ,
    Reviewed-by: Bart Van Assche
    Reviewed-by: Johannes Thumshirn
    Reviewed-by: Hannes Reinecke
    Reviewed-by: Christoph Hellwig
    Tested-by: James Smart
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     

03 May, 2019

1 commit


01 May, 2019

1 commit


14 Apr, 2019

1 commit

  • A previous commit moved the shallow depth and BFQ depth map calculations
    to be done at init time, moving it outside of the hotter IO path. This
    potentially causes hangs if the users changes the depth of the scheduler
    map, by writing to the 'nr_requests' sysfs file for that device.

    Add a blk-mq-sched hook that allows blk-mq to inform the scheduler if
    the depth changes, so that the scheduler can update its internal state.

    Tested-by: Kai Krakow
    Reported-by: Paolo Valente
    Fixes: f0635b8a416e ("bfq: calculate shallow depths at init time")
    Signed-off-by: Jens Axboe

    Jens Axboe
     

10 Apr, 2019

1 commit

  • In NVMe's error handler, follows the typical steps of tearing down
    hardware for recovering controller:

    1) stop blk_mq hw queues
    2) stop the real hw queues
    3) cancel in-flight requests via
    blk_mq_tagset_busy_iter(tags, cancel_request, ...)
    cancel_request():
    mark the request as abort
    blk_mq_complete_request(req);
    4) destroy real hw queues

    However, there may be race between #3 and #4, because blk_mq_complete_request()
    may run q->mq_ops->complete(rq) remotelly and asynchronously, and
    ->complete(rq) may be run after #4.

    This patch introduces blk_mq_complete_request_sync() for fixing the
    above race.

    Cc: Sagi Grimberg
    Cc: Bart Van Assche
    Cc: James Smart
    Cc: linux-nvme@lists.infradead.org
    Reviewed-by: Keith Busch
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     

05 Apr, 2019

1 commit

  • blk_mq_try_issue_directly() can return BLK_STS*_RESOURCE for requests that
    have been queued. If that happens when blk_mq_try_issue_directly() is called
    by the dm-mpath driver then dm-mpath will try to resubmit a request that is
    already queued and a kernel crash follows. Since it is nontrivial to fix
    blk_mq_request_issue_directly(), revert the blk_mq_request_issue_directly()
    changes that went into kernel v5.0.

    This patch reverts the following commits:
    * d6a51a97c0b2 ("blk-mq: replace and kill blk_mq_request_issue_directly") # v5.0.
    * 5b7a6f128aad ("blk-mq: issue directly with bypass 'false' in blk_mq_sched_insert_requests") # v5.0.
    * 7f556a44e61d ("blk-mq: refactor the code of issue request directly") # v5.0.

    Cc: Christoph Hellwig
    Cc: Ming Lei
    Cc: Jianchao Wang
    Cc: Hannes Reinecke
    Cc: Johannes Thumshirn
    Cc: James Smart
    Cc: Dongli Zhang
    Cc: Laurence Oberman
    Cc:
    Reported-by: Laurence Oberman
    Tested-by: Laurence Oberman
    Fixes: 7f556a44e61d ("blk-mq: refactor the code of issue request directly") # v5.0.
    Signed-off-by: Bart Van Assche
    Signed-off-by: Jens Axboe

    Bart Van Assche
     

04 Apr, 2019

1 commit