07 Feb, 2021

1 commit

  • Pull block fixes from Jens Axboe:
    "A few small regression fixes:

    - NVMe pull request from Christoph:
    - more quirks for buggy devices (Thorsten Leemhuis, Claus Stovgaard)
    - update the email address for Keith (Keith Busch)
    - fix an out of bounds access in nvmet-tcp (Sagi Grimberg)

    - Regression fix for BFQ shallow depth calculations introduced in
    this merge window (Lin)"

    * tag 'block-5.11-2021-02-05' of git://git.kernel.dk/linux-block:
    nvmet-tcp: fix out-of-bounds access when receiving multiple h2cdata PDUs
    bfq-iosched: Revert "bfq: Fix computation of shallow depth"
    update the email address for Keith Bush
    nvme-pci: ignore the subsysem NQN on Phison E16
    nvme-pci: avoid the deepest sleep state on Kingston A2000 SSDs

    Linus Torvalds
     

03 Feb, 2021

1 commit

  • This reverts commit 6d4d273588378c65915acaf7b2ee74e9dd9c130a.

    bfq.limit_depth passes word_depths[] as shallow_depth down to sbitmap core
    sbitmap_get_shallow, which uses just the number to limit the scan depth of
    each bitmap word, formula:
    scan_percentage_for_each_word = shallow_depth / (1 << sbimap->shift) * 100%

    That means the comments's percentiles 50%, 75%, 18%, 37% of bfq are correct.
    But after commit patch 'bfq: Fix computation of shallow depth', we use
    sbitmap.depth instead, as a example in following case:

    sbitmap.depth = 256, map_nr = 4, shift = 6; sbitmap_word.depth = 64.
    The resulsts of computed bfqd->word_depths[] are {128, 192, 48, 96}, and
    three of the numbers exceed core dirver's 'sbitmap_word.depth=64' limit
    nothing.

    Signed-off-by: Lin Feng
    Reviewed-by: Jan Kara
    Signed-off-by: Jens Axboe

    Lin Feng
     

30 Jan, 2021

1 commit

  • Pull block fixes from Jens Axboe:
    "All over the place fixes for this release:

    - blk-cgroup iteration teardown resched fix (Baolin)

    - NVMe pull request from Christoph:
    - add another Write Zeroes quirk (Chaitanya Kulkarni)
    - handle a no path available corner case (Daniel Wagner)
    - use the proper RCU aware list_add helper (Chao Leng)

    - bcache regression fix (Coly)

    - bdev->bd_size_lock IRQ fix. This will be fixed in drivers for 5.12,
    but for now, we'll make it IRQ safe (Damien)

    - null_blk zoned init fix (Damien)

    - add_partition() error handling fix (Dinghao)

    - s390 dasd kobject fix (Jan)

    - nbd fix for freezing queue while adding connections (Josef)

    - tag queueing regression fix (Ming)

    - revert of a patch that inadvertently meant that we regressed write
    performance on raid (Maxim)"

    * tag 'block-5.11-2021-01-29' of git://git.kernel.dk/linux-block:
    null_blk: cleanup zoned mode initialization
    nvme-core: use list_add_tail_rcu instead of list_add_tail for nvme_init_ns_head
    nvme-multipath: Early exit if no path is available
    nvme-pci: add the DISABLE_WRITE_ZEROES quirk for a SPCC device
    bcache: only check feature sets when sb->version >= BCACHE_SB_VERSION_CDEV_WITH_FEATURES
    block: fix bd_size_lock use
    blk-cgroup: Use cond_resched() when destroy blkgs
    Revert "block: simplify set_init_blocksize" to regain lost performance
    nbd: freeze the queue while we're adding connections
    s390/dasd: Fix inconsistent kobject removal
    block: Fix an error handling in add_partition
    blk-mq: test QUEUE_FLAG_HCTX_ACTIVE for sbitmap_shared in hctx_may_queue

    Linus Torvalds
     

28 Jan, 2021

2 commits

  • Some block device drivers, e.g. the skd driver, call set_capacity() with
    IRQ disabled. This results in lockdep ito complain about inconsistent
    lock states ("inconsistent {HARDIRQ-ON-W} -> {IN-HARDIRQ-W} usage")
    because set_capacity takes a block device bd_size_lock using the
    functions spin_lock() and spin_unlock(). Ensure a consistent locking
    state by replacing these calls with spin_lock_irqsave() and
    spin_lock_irqrestore(). The same applies to bdev_set_nr_sectors().
    With this fix, all lockdep complaints are resolved.

    Signed-off-by: Damien Le Moal
    Signed-off-by: Jens Axboe

    Damien Le Moal
     
  • On !PREEMPT kernel, we can get below softlockup when doing stress
    testing with creating and destroying block cgroup repeatly. The
    reason is it may take a long time to acquire the queue's lock in
    the loop of blkcg_destroy_blkgs(), or the system can accumulate a
    huge number of blkgs in pathological cases. We can add a need_resched()
    check on each loop and release locks and do cond_resched() if true
    to avoid this issue, since the blkcg_destroy_blkgs() is not called
    from atomic contexts.

    [ 4757.010308] watchdog: BUG: soft lockup - CPU#11 stuck for 94s!
    [ 4757.010698] Call trace:
    [ 4757.010700]  blkcg_destroy_blkgs+0x68/0x150
    [ 4757.010701]  cgwb_release_workfn+0x104/0x158
    [ 4757.010702]  process_one_work+0x1bc/0x3f0
    [ 4757.010704]  worker_thread+0x164/0x468
    [ 4757.010705]  kthread+0x108/0x138

    Suggested-by: Tejun Heo
    Signed-off-by: Baolin Wang
    Signed-off-by: Jens Axboe

    Baolin Wang
     

25 Jan, 2021

2 commits


11 Jan, 2021

1 commit

  • Pull block fixes from Jens Axboe:

    - Missing CRC32 selections (Arnd)

    - Fix for a merge window regression with bdev inode init (Christoph)

    - bcache fixes

    - rnbd fixes

    - NVMe pull request from Christoph:
    - fix a race in the nvme-tcp send code (Sagi Grimberg)
    - fix a list corruption in an nvme-rdma error path (Israel Rukshin)
    - avoid a possible double fetch in nvme-pci (Lalithambika Krishnakumar)
    - add the susystem NQN quirk for a Samsung driver (Gopal Tiwari)
    - fix two compiler warnings in nvme-fcloop (James Smart)
    - don't call sleeping functions from irq context in nvme-fc (James Smart)
    - remove an unused argument (Max Gurtovoy)
    - remove unused exports (Minwoo Im)

    - Use-after-free fix for partition iteration (Ming)

    - Missing blk-mq debugfs flag annotation (John)

    - Bdev freeze regression fix (Satya)

    - blk-iocost NULL pointer deref fix (Tejun)

    * tag 'block-5.11-2021-01-10' of git://git.kernel.dk/linux-block: (26 commits)
    bcache: set bcache device into read-only mode for BCH_FEATURE_INCOMPAT_OBSO_LARGE_BUCKET
    bcache: introduce BCH_FEATURE_INCOMPAT_LOG_LARGE_BUCKET_SIZE for large bucket
    bcache: check unsupported feature sets for bcache register
    bcache: fix typo from SUUP to SUPP in features.h
    bcache: set pdev_set_uuid before scond loop iteration
    blk-mq-debugfs: Add decode for BLK_MQ_F_TAG_HCTX_SHARED
    block/rnbd-clt: avoid module unload race with close confirmation
    block/rnbd: Adding name to the Contributors List
    block/rnbd-clt: Fix sg table use after free
    block/rnbd-srv: Fix use after free in rnbd_srv_sess_dev_force_close
    block/rnbd: Select SG_POOL for RNBD_CLIENT
    block: pre-initialize struct block_device in bdev_alloc_inode
    fs: Fix freeze_bdev()/thaw_bdev() accounting of bd_fsfreeze_sb
    nvme: remove the unused status argument from nvme_trace_bio_complete
    nvmet-rdma: Fix list_del corruption on queue establishment failure
    nvme: unexport functions with no external caller
    nvme: avoid possible double fetch in handling CQE
    nvme-tcp: Fix possible race of io_work and direct send
    nvme-pci: mark Samsung PM1725a as IGNORE_DEV_SUBNQN
    nvme-fcloop: Fix sscanf type and list_first_entry_or_null warnings
    ...

    Linus Torvalds
     

08 Jan, 2021

1 commit

  • Showing the hctx flags for when BLK_MQ_F_TAG_HCTX_SHARED is set gives
    something like:

    root@debian:/home/john# more /sys/kernel/debug/block/sda/hctx0/flags
    alloc_policy=FIFO SHOULD_MERGE|TAG_QUEUE_SHARED|3

    Add the decoding for that flag.

    Fixes: 32bc15afed04b ("blk-mq: Facilitate a shared sbitmap per tagset")
    Signed-off-by: John Garry
    Signed-off-by: Jens Axboe

    John Garry
     

06 Jan, 2021

3 commits

  • Make sure that bdgrab() is done on the 'block_device' instance before
    referring to it for avoiding use-after-free.

    Cc:
    Reported-by: syzbot+825f0f9657d4e528046e@syzkaller.appspotmail.com
    Signed-off-by: Ming Lei
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Ming Lei
     
  • BFQ computes number of tags it allows to be allocated for each request type
    based on tag bitmap. However it uses 1 << bitmap.shift as number of
    available tags which is wrong. 'shift' is just an internal bitmap value
    containing logarithm of how many bits bitmap uses in each bitmap word.
    Thus number of tags allowed for some request types can be far to low.
    Use proper bitmap.depth which has the number of tags instead.

    Signed-off-by: Jan Kara
    Signed-off-by: Jens Axboe

    Jan Kara
     
  • When initializing iocost for a queue, its rqos should be registered before
    the blkcg policy is activated to allow policy data initiailization to lookup
    the associated ioc. This unfortunately means that the rqos methods can be
    called on bios before iocgs are attached to all existing blkgs.

    While the race is theoretically possible on ioc_rqos_throttle(), it mostly
    happened in ioc_rqos_merge() due to the difference in how they lookup ioc.
    The former determines it from the passed in @rqos and then bails before
    dereferencing iocg if the looked up ioc is disabled, which most likely is
    the case if initialization is still in progress. The latter looked up ioc by
    dereferencing the possibly NULL iocg making it a lot more prone to actually
    triggering the bug.

    * Make ioc_rqos_merge() use the same method as ioc_rqos_throttle() to look
    up ioc for consistency.

    * Make ioc_rqos_throttle() and ioc_rqos_merge() test for NULL iocg before
    dereferencing it.

    * Explain the danger of NULL iocgs in blk_iocost_init().

    Signed-off-by: Tejun Heo
    Reported-by: Jonathan Lemon
    Cc: stable@vger.kernel.org # v5.4+
    Signed-off-by: Jens Axboe

    Tejun Heo
     

02 Jan, 2021

1 commit

  • Pull SCSI fixes from James Bottomley:
    "This is a load of driver fixes (12 ufs, 1 mpt3sas, 1 cxgbi).

    The big core two fixes are for power management ("block: Do not accept
    any requests while suspended" and "block: Fix a race in the runtime
    power management code") which finally sorts out the resume problems
    we've occasionally been having.

    To make the resume fix, there are seven necessary precursors which
    effectively renames REQ_PREEMPT to REQ_PM, so every "special" request
    in block is automatically a power management exempt one.

    All of the non-PM preempt cases are removed except for the one in the
    SCSI Parallel Interface (spi) domain validation which is a genuine
    case where we have to run requests at high priority to validate the
    bus so this becomes an autopm get/put protected request"

    * tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (22 commits)
    scsi: cxgb4i: Fix TLS dependency
    scsi: ufs: Un-inline ufshcd_vops_device_reset function
    scsi: ufs: Re-enable WriteBooster after device reset
    scsi: ufs-mediatek: Use correct path to fix compile error
    scsi: mpt3sas: Signedness bug in _base_get_diag_triggers()
    scsi: block: Do not accept any requests while suspended
    scsi: block: Remove RQF_PREEMPT and BLK_MQ_REQ_PREEMPT
    scsi: core: Only process PM requests if rpm_status != RPM_ACTIVE
    scsi: scsi_transport_spi: Set RQF_PM for domain validation commands
    scsi: ide: Mark power management requests with RQF_PM instead of RQF_PREEMPT
    scsi: ide: Do not set the RQF_PREEMPT flag for sense requests
    scsi: block: Introduce BLK_MQ_REQ_PM
    scsi: block: Fix a race in the runtime power management code
    scsi: ufs-pci: Enable UFSHCD_CAP_RPM_AUTOSUSPEND for Intel controllers
    scsi: ufs-pci: Fix recovery from hibernate exit errors for Intel controllers
    scsi: ufs-pci: Ensure UFS device is in PowerDown mode for suspend-to-disk ->poweroff()
    scsi: ufs-pci: Fix restore from S4 for Intel controllers
    scsi: ufs-mediatek: Keep VCC always-on for specific devices
    scsi: ufs: Allow regulators being always-on
    scsi: ufs: Clear UAC for RPMB after ufshcd resets
    ...

    Linus Torvalds
     

30 Dec, 2020

1 commit

  • This was missed in 021a24460dc2. Leads to the numeric value of
    QUEUE_FLAG_NOWAIT (i.e. 29) showing up in
    /sys/kernel/debug/block/*/state.

    Fixes: 021a24460dc28e7412aecfae89f60e1847e685c0
    Cc: Konstantin Khlebnikov
    Cc: Mike Snitzer
    Cc: Christoph Hellwig
    Cc: Jens Axboe
    Signed-off-by: Andres Freund
    Signed-off-by: Jens Axboe

    Andres Freund
     

22 Dec, 2020

1 commit


18 Dec, 2020

1 commit

  • With force threaded interrupts enabled, raising softirq from an SMP
    function call will always result in waking the ksoftirqd thread. This is
    not optimal given that the thread runs at SCHED_OTHER priority.

    Completing the request in hard IRQ-context on PREEMPT_RT (which enforces
    the force threaded mode) is bad because the completion handler may
    acquire sleeping locks which violate the locking context.

    Disable request completing on a remote CPU in force threaded mode.

    Signed-off-by: Sebastian Andrzej Siewior
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Daniel Wagner
    Signed-off-by: Jens Axboe

    Sebastian Andrzej Siewior
     

17 Dec, 2020

5 commits

  • It will be helpful to trace the iocg's whole state, including active and
    idle state. And we can easily expand the original iocost_iocg_activate
    trace event to support a state trace class, including active and idle
    state tracing.

    Acked-by: Tejun Heo
    Signed-off-by: Baolin Wang
    Signed-off-by: Jens Axboe

    Baolin Wang
     
  • It's guaranteed that no request is in flight when a hctx is going
    offline. This warning is only triggered when the wq's CPU is hot
    plugged and the blk-mq is not synced up yet.

    As this state is temporary and the request is still processed
    correctly, better remove the warning as this is the fast path.

    Suggested-by: Ming Lei
    Signed-off-by: Daniel Wagner
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Ming Lei
    Signed-off-by: Jens Axboe

    Daniel Wagner
     
  • Pull SCSI updates from James Bottomley:
    "This consists of the usual driver updates (ufs, qla2xxx, smartpqi,
    target, zfcp, fnic, mpt3sas, ibmvfc) plus a load of cleanups, a major
    power management rework and a load of assorted minor updates.

    There are a few core updates (formatting fixes being the big one) but
    nothing major this cycle"

    * tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (279 commits)
    scsi: mpt3sas: Update driver version to 36.100.00.00
    scsi: mpt3sas: Handle trigger page after firmware update
    scsi: mpt3sas: Add persistent MPI trigger page
    scsi: mpt3sas: Add persistent SCSI sense trigger page
    scsi: mpt3sas: Add persistent Event trigger page
    scsi: mpt3sas: Add persistent Master trigger page
    scsi: mpt3sas: Add persistent trigger pages support
    scsi: mpt3sas: Sync time periodically between driver and firmware
    scsi: qla2xxx: Update version to 10.02.00.104-k
    scsi: qla2xxx: Fix device loss on 4G and older HBAs
    scsi: qla2xxx: If fcport is undergoing deletion complete I/O with retry
    scsi: qla2xxx: Fix the call trace for flush workqueue
    scsi: qla2xxx: Fix flash update in 28XX adapters on big endian machines
    scsi: qla2xxx: Handle aborts correctly for port undergoing deletion
    scsi: qla2xxx: Fix N2N and NVMe connect retry failure
    scsi: qla2xxx: Fix FW initialization error on big endian machines
    scsi: qla2xxx: Fix crash during driver load on big endian machines
    scsi: qla2xxx: Fix compilation issue in PPC systems
    scsi: qla2xxx: Don't check for fw_started while posting NVMe command
    scsi: qla2xxx: Tear down session if FW say it is down
    ...

    Linus Torvalds
     
  • Pull block driver updates from Jens Axboe:
    "Nothing major in here:

    - NVMe pull request from Christoph:
    - nvmet passthrough improvements (Chaitanya Kulkarni)
    - fcloop error injection support (James Smart)
    - read-only support for zoned namespaces without Zone Append
    (Javier González)
    - improve some error message (Minwoo Im)
    - reject I/O to offline fabrics namespaces (Victor Gladkov)
    - PCI queue allocation cleanups (Niklas Schnelle)
    - remove an unused allocation in nvmet (Amit Engel)
    - a Kconfig spelling fix (Colin Ian King)
    - nvme_req_qid simplication (Baolin Wang)

    - MD pull request from Song:
    - Fix race condition in md_ioctl() (Dae R. Jeong)
    - Initialize read_slot properly for raid10 (Kevin Vigor)
    - Code cleanup (Pankaj Gupta)
    - md-cluster resync/reshape fix (Zhao Heming)

    - Move null_blk into its own directory (Damien Le Moal)

    - null_blk zone and discard improvements (Damien Le Moal)

    - bcache race fix (Dongsheng Yang)

    - Set of rnbd fixes/improvements (Gioh Kim, Guoqing Jiang, Jack Wang,
    Lutz Pogrell, Md Haris Iqbal)

    - lightnvm NULL pointer deref fix (tangzhenhao)

    - sr in_interrupt() removal (Sebastian Andrzej Siewior)

    - FC endpoint security support for s390/dasd (Jan Höppner, Sebastian
    Ott, Vineeth Vijayan). From the s390 arch guys, arch bits included
    as it made it easier for them to funnel the feature through the
    block driver tree.

    - Follow up fixes (Colin Ian King)"

    * tag 'for-5.11/drivers-2020-12-14' of git://git.kernel.dk/linux-block: (64 commits)
    block: drop dead assignments in loop_init()
    sr: Remove in_interrupt() usage in sr_init_command().
    sr: Switch the sector size back to 2048 if sr_read_sector() changed it.
    cdrom: Reset sector_size back it is not 2048.
    drivers/lightnvm: fix a null-ptr-deref bug in pblk-core.c
    null_blk: Move driver into its own directory
    null_blk: Allow controlling max_hw_sectors limit
    null_blk: discard zones on reset
    null_blk: cleanup discard handling
    null_blk: Improve implicit zone close
    null_blk: improve zone locking
    block: Align max_hw_sectors to logical blocksize
    null_blk: Fail zone append to conventional zones
    null_blk: Fix zone size initialization
    bcache: fix race between setting bdev state to none and new write request direct to backing
    block/rnbd: fix a null pointer dereference on dev->blk_symlink_name
    block/rnbd-clt: Dynamically alloc buffer for pathname & blk_symlink_name
    block/rnbd: call kobject_put in the failure path
    Documentation/ABI/rnbd-srv: add document for force_close
    block/rnbd-srv: close a mapped device from server side.
    ...

    Linus Torvalds
     
  • Pull block updates from Jens Axboe:
    "Another series of killing more code than what is being added, again
    thanks to Christoph's relentless cleanups and tech debt tackling.

    This contains:

    - blk-iocost improvements (Baolin Wang)

    - part0 iostat fix (Jeffle Xu)

    - Disable iopoll for split bios (Jeffle Xu)

    - block tracepoint cleanups (Christoph Hellwig)

    - Merging of struct block_device and hd_struct (Christoph Hellwig)

    - Rework/cleanup of how block device sizes are updated (Christoph
    Hellwig)

    - Simplification of gendisk lookup and removal of block device
    aliasing (Christoph Hellwig)

    - Block device ioctl cleanups (Christoph Hellwig)

    - Removal of bdget()/blkdev_get() as exported API (Christoph Hellwig)

    - Disk change rework, avoid ->revalidate_disk() (Christoph Hellwig)

    - sbitmap improvements (Pavel Begunkov)

    - Hybrid polling fix (Pavel Begunkov)

    - bvec iteration improvements (Pavel Begunkov)

    - Zone revalidation fixes (Damien Le Moal)

    - blk-throttle limit fix (Yu Kuai)

    - Various little fixes"

    * tag 'for-5.11/block-2020-12-14' of git://git.kernel.dk/linux-block: (126 commits)
    blk-mq: fix msec comment from micro to milli seconds
    blk-mq: update arg in comment of blk_mq_map_queue
    blk-mq: add helper allocating tagset->tags
    Revert "block: Fix a lockdep complaint triggered by request queue flushing"
    nvme-loop: use blk_mq_hctx_set_fq_lock_class to set loop's lock class
    blk-mq: add new API of blk_mq_hctx_set_fq_lock_class
    block: disable iopoll for split bio
    block: Improve blk_revalidate_disk_zones() checks
    sbitmap: simplify wrap check
    sbitmap: replace CAS with atomic and
    sbitmap: remove swap_lock
    sbitmap: optimise sbitmap_deferred_clear()
    blk-mq: skip hybrid polling if iopoll doesn't spin
    blk-iocost: Factor out the base vrate change into a separate function
    blk-iocost: Factor out the active iocgs' state check into a separate function
    blk-iocost: Move the usage ratio calculation to the correct place
    blk-iocost: Remove unnecessary advance declaration
    blk-iocost: Fix some typos in comments
    blktrace: fix up a kerneldoc comment
    block: remove the request_queue to argument request based tracepoints
    ...

    Linus Torvalds
     

15 Dec, 2020

1 commit

  • Pull scheduler updates from Thomas Gleixner:

    - migrate_disable/enable() support which originates from the RT tree
    and is now a prerequisite for the new preemptible kmap_local() API
    which aims to replace kmap_atomic().

    - A fair amount of topology and NUMA related improvements

    - Improvements for the frequency invariant calculations

    - Enhanced robustness for the global CPU priority tracking and decision
    making

    - The usual small fixes and enhancements all over the place

    * tag 'sched-core-2020-12-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (61 commits)
    sched/fair: Trivial correction of the newidle_balance() comment
    sched/fair: Clear SMT siblings after determining the core is not idle
    sched: Fix kernel-doc markup
    x86: Print ratio freq_max/freq_base used in frequency invariance calculations
    x86, sched: Use midpoint of max_boost and max_P for frequency invariance on AMD EPYC
    x86, sched: Calculate frequency invariance for AMD systems
    irq_work: Optimize irq_work_single()
    smp: Cleanup smp_call_function*()
    irq_work: Cleanup
    sched: Limit the amount of NUMA imbalance that can exist at fork time
    sched/numa: Allow a floating imbalance between NUMA nodes
    sched: Avoid unnecessary calculation of load imbalance at clone time
    sched/numa: Rename nr_running and break out the magic number
    sched: Make migrate_disable/enable() independent of RT
    sched/topology: Condition EAS enablement on FIE support
    arm64: Rebuild sched domains on invariance status changes
    sched/topology,schedutil: Wrap sched domains rebuild
    sched/uclamp: Allow to reset a task uclamp constraint value
    sched/core: Fix typos in comments
    Documentation: scheduler: fix information on arch SD flags, sched_domain and sched_debug
    ...

    Linus Torvalds
     

13 Dec, 2020

3 commits


10 Dec, 2020

4 commits

  • blk_queue_enter() accepts BLK_MQ_REQ_PM requests independent of the runtime
    power management state. Now that SCSI domain validation no longer depends
    on this behavior, modify the behavior of blk_queue_enter() as follows:

    - Do not accept any requests while suspended.

    - Only process power management requests while suspending or resuming.

    Submitting BLK_MQ_REQ_PM requests to a device that is runtime suspended
    causes runtime-suspended devices not to resume as they should. The request
    which should cause a runtime resume instead gets issued directly, without
    resuming the device first. Of course the device can't handle it properly,
    the I/O fails, and the device remains suspended.

    The problem is fixed by checking that the queue's runtime-PM status isn't
    RPM_SUSPENDED before allowing a request to be issued, and queuing a
    runtime-resume request if it is. In particular, the inline
    blk_pm_request_resume() routine is renamed blk_pm_resume_queue() and the
    code is unified by merging the surrounding checks into the routine. If the
    queue isn't set up for runtime PM, or there currently is no restriction on
    allowed requests, the request is allowed. Likewise if the BLK_MQ_REQ_PM
    flag is set and the status isn't RPM_SUSPENDED. Otherwise a runtime resume
    is queued and the request is blocked until conditions are more suitable.

    [ bvanassche: modified commit message and removed Cc: stable because
    without the previous patches from this series this patch would break
    parallel SCSI domain validation + introduced queue_rpm_status() ]

    Link: https://lore.kernel.org/r/20201209052951.16136-9-bvanassche@acm.org
    Cc: Jens Axboe
    Cc: Christoph Hellwig
    Cc: Hannes Reinecke
    Cc: Can Guo
    Cc: Stanley Chu
    Cc: Ming Lei
    Cc: Rafael J. Wysocki
    Reported-and-tested-by: Martin Kepplinger
    Reviewed-by: Hannes Reinecke
    Reviewed-by: Can Guo
    Signed-off-by: Alan Stern
    Signed-off-by: Bart Van Assche
    Signed-off-by: Martin K. Petersen

    Alan Stern
     
  • Remove flag RQF_PREEMPT and BLK_MQ_REQ_PREEMPT since these are no longer
    used by any kernel code.

    Link: https://lore.kernel.org/r/20201209052951.16136-8-bvanassche@acm.org
    Cc: Can Guo
    Cc: Stanley Chu
    Cc: Alan Stern
    Cc: Ming Lei
    Cc: Rafael J. Wysocki
    Cc: Martin Kepplinger
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Hannes Reinecke
    Reviewed-by: Jens Axboe
    Reviewed-by: Can Guo
    Signed-off-by: Bart Van Assche
    Signed-off-by: Martin K. Petersen

    Bart Van Assche
     
  • Introduce the BLK_MQ_REQ_PM flag. This flag makes the request allocation
    functions set RQF_PM. This is the first step towards removing
    BLK_MQ_REQ_PREEMPT.

    Link: https://lore.kernel.org/r/20201209052951.16136-3-bvanassche@acm.org
    Cc: Alan Stern
    Cc: Stanley Chu
    Cc: Ming Lei
    Cc: Rafael J. Wysocki
    Cc: Can Guo
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Hannes Reinecke
    Reviewed-by: Jens Axboe
    Reviewed-by: Can Guo
    Signed-off-by: Bart Van Assche
    Signed-off-by: Martin K. Petersen

    Bart Van Assche
     
  • With the current implementation the following race can happen:

    * blk_pre_runtime_suspend() calls blk_freeze_queue_start() and
    blk_mq_unfreeze_queue().

    * blk_queue_enter() calls blk_queue_pm_only() and that function returns
    true.

    * blk_queue_enter() calls blk_pm_request_resume() and that function does
    not call pm_request_resume() because the queue runtime status is
    RPM_ACTIVE.

    * blk_pre_runtime_suspend() changes the queue status into RPM_SUSPENDING.

    Fix this race by changing the queue runtime status into RPM_SUSPENDING
    before switching q_usage_counter to atomic mode.

    Link: https://lore.kernel.org/r/20201209052951.16136-2-bvanassche@acm.org
    Fixes: 986d413b7c15 ("blk-mq: Enable support for runtime power management")
    Cc: Ming Lei
    Cc: Rafael J. Wysocki
    Cc: stable
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Hannes Reinecke
    Reviewed-by: Jens Axboe
    Acked-by: Alan Stern
    Acked-by: Stanley Chu
    Co-developed-by: Can Guo
    Signed-off-by: Can Guo
    Signed-off-by: Bart Van Assche
    Signed-off-by: Martin K. Petersen

    Bart Van Assche
     

08 Dec, 2020

11 commits

  • This reverts commit b3c6a59975415bde29cfd76ff1ab008edbf614a9.

    Now we can avoid nvme-loop lockdep warning of 'lockdep possible recursive locking'
    by nvme-loop's lock class, no need to apply dynamically allocated lock class key,
    so revert commit b3c6a5997541("block: Fix a lockdep complaint triggered by request
    queue flushing").

    This way fixes horrible SCSI probe delay issue on megaraid_sas, and it is reported
    the whole probe may take more than half an hour.

    Tested-by: Kashyap Desai
    Reported-by: Qian Cai
    Reviewed-by: Christoph Hellwig
    Cc: Sumit Saxena
    Cc: John Garry
    Cc: Kashyap Desai
    Cc: Bart Van Assche
    Cc: Hannes Reinecke
    Signed-off-by: Ming Lei
    Reviewed-by: Hannes Reinecke
    Signed-off-by: Jens Axboe

    Ming Lei
     
  • flush_end_io() may be called recursively from some driver, such as
    nvme-loop, so lockdep may complain 'possible recursive locking'.
    Commit b3c6a5997541("block: Fix a lockdep complaint triggered by
    request queue flushing") tried to address this issue by assigning
    dynamically allocated per-flush-queue lock class. This solution
    adds synchronize_rcu() for each hctx's release handler, and causes
    horrible SCSI MQ probe delay(more than half an hour on megaraid sas).

    Add new API of blk_mq_hctx_set_fq_lock_class() for these drivers, so
    we just need to use driver specific lock class for avoiding the
    lockdep warning of 'possible recursive locking'.

    Tested-by: Kashyap Desai
    Reported-by: Qian Cai
    Cc: Sumit Saxena
    Cc: John Garry
    Cc: Kashyap Desai
    Cc: Bart Van Assche
    Cc: Hannes Reinecke
    Signed-off-by: Ming Lei
    Reviewed-by: Hannes Reinecke
    Signed-off-by: Jens Axboe

    Ming Lei
     
  • iopoll is initially for small size, latency sensitive IO. It doesn't
    work well for big IO, especially when it needs to be split to multiple
    bios. In this case, the returned cookie of __submit_bio_noacct_mq() is
    indeed the cookie of the last split bio. The completion of *this* last
    split bio done by iopoll doesn't mean the whole original bio has
    completed. Callers of iopoll still need to wait for completion of other
    split bios.

    Besides bio splitting may cause more trouble for iopoll which isn't
    supposed to be used in case of big IO.

    iopoll for split bio may cause potential race if CPU migration happens
    during bio submission. Since the returned cookie is that of the last
    split bio, polling on the corresponding hardware queue doesn't help
    complete other split bios, if these split bios are enqueued into
    different hardware queues. Since interrupts are disabled for polling
    queues, the completion of these other split bios depends on timeout
    mechanism, thus causing a potential hang.

    iopoll for split bio may also cause hang for sync polling. Currently
    both the blkdev and iomap-based fs (ext4/xfs, etc) support sync polling
    in direct IO routine. These routines will submit bio without REQ_NOWAIT
    flag set, and then start sync polling in current process context. The
    process may hang in blk_mq_get_tag() if the submitted bio has to be
    split into multiple bios and can rapidly exhaust the queue depth. The
    process are waiting for the completion of the previously allocated
    requests, which should be reaped by the following polling, and thus
    causing a deadlock.

    To avoid these subtle trouble described above, just disable iopoll for
    split bio and return BLK_QC_T_NONE in this case. The side effect is that
    non-HIPRI IO also returns BLK_QC_T_NONE now. It should be acceptable
    since the returned cookie is never used for non-HIPRI IO.

    Suggested-by: Ming Lei
    Signed-off-by: Jeffle Xu
    Reviewed-by: Ming Lei
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Jeffle Xu
     
  • Block device drivers do not have to call blk_queue_max_hw_sectors() to
    set a limit on request size if the default limit BLK_SAFE_MAX_SECTORS
    is acceptable. However, this limit (255 sectors) may not be aligned
    to the device logical block size which cannot be used as is for a
    request maximum size. This is the case for the null_blk device driver.

    Modify blk_queue_max_hw_sectors() to make sure that the request size
    limits specified by the max_hw_sectors and max_sectors queue limits
    are always aligned to the device logical block size. Additionally, to
    avoid introducing a dependence on the execution order of this function
    with blk_queue_logical_block_size(), also modify
    blk_queue_logical_block_size() to perform the same alignment when the
    logical block size is set after max_hw_sectors.

    Signed-off-by: Damien Le Moal
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Johannes Thumshirn
    Signed-off-by: Jens Axboe

    Damien Le Moal
     
  • Improves the checks on the zones of a zoned block device done in
    blk_revalidate_disk_zones() by making sure that the device report_zones
    method did report at least one zone and that the zones reported exactly
    cover the entire disk capacity, that is, that there are no missing zones
    at the end of the disk sector range.

    Signed-off-by: Damien Le Moal
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Damien Le Moal
     
  • If blk_poll() is not going to spin (i.e. @spin=false), it also must not
    sleep in hybrid polling, otherwise it might be pretty suprising for
    users trying to do a quick check and expecting no-wait behaviour.

    Signed-off-by: Pavel Begunkov
    Signed-off-by: Jens Axboe

    Pavel Begunkov
     
  • Factor out the base vrate change code into a separate function
    to fimplify the ioc_timer_fn().

    No functional change.

    Signed-off-by: Baolin Wang
    Acked-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Baolin Wang
     
  • Factor out the iocgs' state check into a separate function to
    simplify the ioc_timer_fn().

    No functional change.

    Signed-off-by: Baolin Wang
    Acked-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Baolin Wang
     
  • We only use the hweight based usage ratio to calculate the new
    hweight_inuse of the iocg to decide if this iocg can donate some
    surplus vtime.

    Thus move the usage ratio calculation to the correct place to
    avoid unnecessary calculation for some vtime shortage iocgs.

    Signed-off-by: Baolin Wang
    Acked-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Baolin Wang
     
  • Remove unnecessary advance declaration of struct ioc_gq.

    Signed-off-by: Baolin Wang
    Acked-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Baolin Wang
     
  • Fix some typos in comments.

    Signed-off-by: Baolin Wang
    Acked-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Baolin Wang