Eric Lee / smarc-fsl-linux-kernel

07 Feb, 2021

1 commit

eec791812 Merge tag 'block-5.11-2021-02-05' of git://git.kernel.dk/linux-block ... Browse Code »

Pull block fixes from Jens Axboe:
"A few small regression fixes:

- NVMe pull request from Christoph:
- more quirks for buggy devices (Thorsten Leemhuis, Claus Stovgaard)
- update the email address for Keith (Keith Busch)
- fix an out of bounds access in nvmet-tcp (Sagi Grimberg)

- Regression fix for BFQ shallow depth calculations introduced in
this merge window (Lin)"

* tag 'block-5.11-2021-02-05' of git://git.kernel.dk/linux-block:
nvmet-tcp: fix out-of-bounds access when receiving multiple h2cdata PDUs
bfq-iosched: Revert "bfq: Fix computation of shallow depth"
update the email address for Keith Bush
nvme-pci: ignore the subsysem NQN on Phison E16
nvme-pci: avoid the deepest sleep state on Kingston A2000 SSDs

Linus Torvalds
2021-02-07 06:40:27 +0800

03 Feb, 2021

1 commit

388c705b9 bfq-iosched: Revert "bfq: Fix computation of shallow depth" ... Browse Code »

This reverts commit 6d4d273588378c65915acaf7b2ee74e9dd9c130a.

bfq.limit_depth passes word_depths[] as shallow_depth down to sbitmap core
sbitmap_get_shallow, which uses just the number to limit the scan depth of
each bitmap word, formula:
scan_percentage_for_each_word = shallow_depth / (1 << sbimap->shift) * 100%

That means the comments's percentiles 50%, 75%, 18%, 37% of bfq are correct.
But after commit patch 'bfq: Fix computation of shallow depth', we use
sbitmap.depth instead, as a example in following case:

sbitmap.depth = 256, map_nr = 4, shift = 6; sbitmap_word.depth = 64.
The resulsts of computed bfqd->word_depths[] are {128, 192, 48, 96}, and
three of the numbers exceed core dirver's 'sbitmap_word.depth=64' limit
nothing.

Signed-off-by: Lin Feng
Reviewed-by: Jan Kara
Signed-off-by: Jens Axboe

Lin Feng
2021-02-03 11:37:08 +0800

30 Jan, 2021

1 commit

2ba1c4d1a Merge tag 'block-5.11-2021-01-29' of git://git.kernel.dk/linux-block ... Browse Code »

Pull block fixes from Jens Axboe:
"All over the place fixes for this release:

- blk-cgroup iteration teardown resched fix (Baolin)

- NVMe pull request from Christoph:
- add another Write Zeroes quirk (Chaitanya Kulkarni)
- handle a no path available corner case (Daniel Wagner)
- use the proper RCU aware list_add helper (Chao Leng)

- bcache regression fix (Coly)

- bdev->bd_size_lock IRQ fix. This will be fixed in drivers for 5.12,
but for now, we'll make it IRQ safe (Damien)

- null_blk zoned init fix (Damien)

- add_partition() error handling fix (Dinghao)

- s390 dasd kobject fix (Jan)

- nbd fix for freezing queue while adding connections (Josef)

- tag queueing regression fix (Ming)

- revert of a patch that inadvertently meant that we regressed write
performance on raid (Maxim)"

* tag 'block-5.11-2021-01-29' of git://git.kernel.dk/linux-block:
null_blk: cleanup zoned mode initialization
nvme-core: use list_add_tail_rcu instead of list_add_tail for nvme_init_ns_head
nvme-multipath: Early exit if no path is available
nvme-pci: add the DISABLE_WRITE_ZEROES quirk for a SPCC device
bcache: only check feature sets when sb->version >= BCACHE_SB_VERSION_CDEV_WITH_FEATURES
block: fix bd_size_lock use
blk-cgroup: Use cond_resched() when destroy blkgs
Revert "block: simplify set_init_blocksize" to regain lost performance
nbd: freeze the queue while we're adding connections
s390/dasd: Fix inconsistent kobject removal
block: Fix an error handling in add_partition
blk-mq: test QUEUE_FLAG_HCTX_ACTIVE for sbitmap_shared in hctx_may_queue

Linus Torvalds
2021-01-30 05:50:06 +0800

28 Jan, 2021

2 commits

0fe37724f block: fix bd_size_lock use ... Browse Code »

Some block device drivers, e.g. the skd driver, call set_capacity() with
IRQ disabled. This results in lockdep ito complain about inconsistent
lock states ("inconsistent {HARDIRQ-ON-W} -> {IN-HARDIRQ-W} usage")
because set_capacity takes a block device bd_size_lock using the
functions spin_lock() and spin_unlock(). Ensure a consistent locking
state by replacing these calls with spin_lock_irqsave() and
spin_lock_irqrestore(). The same applies to bdev_set_nr_sectors().
With this fix, all lockdep complaints are resolved.

Signed-off-by: Damien Le Moal
Signed-off-by: Jens Axboe

Damien Le Moal
2021-01-28 22:31:50 +0800
6c635caef blk-cgroup: Use cond_resched() when destroy blkgs ... Browse Code »

On !PREEMPT kernel, we can get below softlockup when doing stress
testing with creating and destroying block cgroup repeatly. The
reason is it may take a long time to acquire the queue's lock in
the loop of blkcg_destroy_blkgs(), or the system can accumulate a
huge number of blkgs in pathological cases. We can add a need_resched()
check on each loop and release locks and do cond_resched() if true
to avoid this issue, since the blkcg_destroy_blkgs() is not called
from atomic contexts.

[ 4757.010308] watchdog: BUG: soft lockup - CPU#11 stuck for 94s!
[ 4757.010698] Call trace:
[ 4757.010700] blkcg_destroy_blkgs+0x68/0x150
[ 4757.010701] cgwb_release_workfn+0x104/0x158
[ 4757.010702] process_one_work+0x1bc/0x3f0
[ 4757.010704] worker_thread+0x164/0x468
[ 4757.010705] kthread+0x108/0x138

Suggested-by: Tejun Heo
Signed-off-by: Baolin Wang
Signed-off-by: Jens Axboe

Baolin Wang
2021-01-28 22:31:48 +0800

25 Jan, 2021

2 commits

ef49d40b6 block: Fix an error handling in add_partition ... Browse Code »

Once we have called device_initialize(), we should use put_device() to
give up the reference on error, just like what we have done on failure
of device_add().

Signed-off-by: Dinghao Liu
Reviewed-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Dinghao Liu
2021-01-25 12:35:58 +0800
2569063c7 blk-mq: test QUEUE_FLAG_HCTX_ACTIVE for sbitmap_shared in hctx_may_queue ... Browse Code »

In case of blk_mq_is_sbitmap_shared(), we should test QUEUE_FLAG_HCTX_ACTIVE against
q->queue_flags instead of BLK_MQ_S_TAG_ACTIVE.

So fix it.

Cc: John Garry
Cc: Kashyap Desai
Fixes: f1b49fdc1c64 ("blk-mq: Record active_queues_shared_sbitmap per tag_set for when using shared sbitmap")
Signed-off-by: Ming Lei
Reviewed-by: John Garry
Signed-off-by: Jens Axboe

Ming Lei
2021-01-25 12:25:17 +0800

11 Jan, 2021

1 commit

ed41fd071 Merge tag 'block-5.11-2021-01-10' of git://git.kernel.dk/linux-block ... Browse Code »

Pull block fixes from Jens Axboe:

- Missing CRC32 selections (Arnd)

- Fix for a merge window regression with bdev inode init (Christoph)

- bcache fixes

- rnbd fixes

- NVMe pull request from Christoph:
- fix a race in the nvme-tcp send code (Sagi Grimberg)
- fix a list corruption in an nvme-rdma error path (Israel Rukshin)
- avoid a possible double fetch in nvme-pci (Lalithambika Krishnakumar)
- add the susystem NQN quirk for a Samsung driver (Gopal Tiwari)
- fix two compiler warnings in nvme-fcloop (James Smart)
- don't call sleeping functions from irq context in nvme-fc (James Smart)
- remove an unused argument (Max Gurtovoy)
- remove unused exports (Minwoo Im)

- Use-after-free fix for partition iteration (Ming)

- Missing blk-mq debugfs flag annotation (John)

- Bdev freeze regression fix (Satya)

- blk-iocost NULL pointer deref fix (Tejun)

* tag 'block-5.11-2021-01-10' of git://git.kernel.dk/linux-block: (26 commits)
bcache: set bcache device into read-only mode for BCH_FEATURE_INCOMPAT_OBSO_LARGE_BUCKET
bcache: introduce BCH_FEATURE_INCOMPAT_LOG_LARGE_BUCKET_SIZE for large bucket
bcache: check unsupported feature sets for bcache register
bcache: fix typo from SUUP to SUPP in features.h
bcache: set pdev_set_uuid before scond loop iteration
blk-mq-debugfs: Add decode for BLK_MQ_F_TAG_HCTX_SHARED
block/rnbd-clt: avoid module unload race with close confirmation
block/rnbd: Adding name to the Contributors List
block/rnbd-clt: Fix sg table use after free
block/rnbd-srv: Fix use after free in rnbd_srv_sess_dev_force_close
block/rnbd: Select SG_POOL for RNBD_CLIENT
block: pre-initialize struct block_device in bdev_alloc_inode
fs: Fix freeze_bdev()/thaw_bdev() accounting of bd_fsfreeze_sb
nvme: remove the unused status argument from nvme_trace_bio_complete
nvmet-rdma: Fix list_del corruption on queue establishment failure
nvme: unexport functions with no external caller
nvme: avoid possible double fetch in handling CQE
nvme-tcp: Fix possible race of io_work and direct send
nvme-pci: mark Samsung PM1725a as IGNORE_DEV_SUBNQN
nvme-fcloop: Fix sscanf type and list_first_entry_or_null warnings
...

Linus Torvalds
2021-01-11 04:53:08 +0800

08 Jan, 2021

1 commit

02f938e9f blk-mq-debugfs: Add decode for BLK_MQ_F_TAG_HCTX_SHARED ... Browse Code »

Showing the hctx flags for when BLK_MQ_F_TAG_HCTX_SHARED is set gives
something like:

root@debian:/home/john# more /sys/kernel/debug/block/sda/hctx0/flags
alloc_policy=FIFO SHOULD_MERGE|TAG_QUEUE_SHARED|3

Add the decoding for that flag.

Fixes: 32bc15afed04b ("blk-mq: Facilitate a shared sbitmap per tagset")
Signed-off-by: John Garry
Signed-off-by: Jens Axboe

John Garry
2021-01-08 23:20:27 +0800

06 Jan, 2021

3 commits

aebf5db91 block: fix use-after-free in disk_part_iter_next ... Browse Code »

Make sure that bdgrab() is done on the 'block_device' instance before
referring to it for avoiding use-after-free.

Cc:
Reported-by: syzbot+825f0f9657d4e528046e@syzkaller.appspotmail.com
Signed-off-by: Ming Lei
Reviewed-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Ming Lei
2021-01-06 02:35:17 +0800
6d4d27358 bfq: Fix computation of shallow depth ... Browse Code »

BFQ computes number of tags it allows to be allocated for each request type
based on tag bitmap. However it uses 1 << bitmap.shift as number of
available tags which is wrong. 'shift' is just an internal bitmap value
containing logarithm of how many bits bitmap uses in each bitmap word.
Thus number of tags allowed for some request types can be far to low.
Use proper bitmap.depth which has the number of tags instead.

Signed-off-by: Jan Kara
Signed-off-by: Jens Axboe

Jan Kara
2021-01-06 02:33:50 +0800
d16baa3f1 blk-iocost: fix NULL iocg deref from racing against initialization ... Browse Code »

When initializing iocost for a queue, its rqos should be registered before
the blkcg policy is activated to allow policy data initiailization to lookup
the associated ioc. This unfortunately means that the rqos methods can be
called on bios before iocgs are attached to all existing blkgs.

While the race is theoretically possible on ioc_rqos_throttle(), it mostly
happened in ioc_rqos_merge() due to the difference in how they lookup ioc.
The former determines it from the passed in @rqos and then bails before
dereferencing iocg if the looked up ioc is disabled, which most likely is
the case if initialization is still in progress. The latter looked up ioc by
dereferencing the possibly NULL iocg making it a lot more prone to actually
triggering the bug.

* Make ioc_rqos_merge() use the same method as ioc_rqos_throttle() to look
up ioc for consistency.

* Make ioc_rqos_throttle() and ioc_rqos_merge() test for NULL iocg before
dereferencing it.

* Explain the danger of NULL iocgs in blk_iocost_init().

Signed-off-by: Tejun Heo
Reported-by: Jonathan Lemon
Cc: stable@vger.kernel.org # v5.4+
Signed-off-by: Jens Axboe

Tejun Heo
2021-01-06 02:33:32 +0800

02 Jan, 2021

1 commit

eda809aef Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi ... Browse Code »

Pull SCSI fixes from James Bottomley:
"This is a load of driver fixes (12 ufs, 1 mpt3sas, 1 cxgbi).

The big core two fixes are for power management ("block: Do not accept
any requests while suspended" and "block: Fix a race in the runtime
power management code") which finally sorts out the resume problems
we've occasionally been having.

To make the resume fix, there are seven necessary precursors which
effectively renames REQ_PREEMPT to REQ_PM, so every "special" request
in block is automatically a power management exempt one.

All of the non-PM preempt cases are removed except for the one in the
SCSI Parallel Interface (spi) domain validation which is a genuine
case where we have to run requests at high priority to validate the
bus so this becomes an autopm get/put protected request"

* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (22 commits)
scsi: cxgb4i: Fix TLS dependency
scsi: ufs: Un-inline ufshcd_vops_device_reset function
scsi: ufs: Re-enable WriteBooster after device reset
scsi: ufs-mediatek: Use correct path to fix compile error
scsi: mpt3sas: Signedness bug in _base_get_diag_triggers()
scsi: block: Do not accept any requests while suspended
scsi: block: Remove RQF_PREEMPT and BLK_MQ_REQ_PREEMPT
scsi: core: Only process PM requests if rpm_status != RPM_ACTIVE
scsi: scsi_transport_spi: Set RQF_PM for domain validation commands
scsi: ide: Mark power management requests with RQF_PM instead of RQF_PREEMPT
scsi: ide: Do not set the RQF_PREEMPT flag for sense requests
scsi: block: Introduce BLK_MQ_REQ_PM
scsi: block: Fix a race in the runtime power management code
scsi: ufs-pci: Enable UFSHCD_CAP_RPM_AUTOSUSPEND for Intel controllers
scsi: ufs-pci: Fix recovery from hibernate exit errors for Intel controllers
scsi: ufs-pci: Ensure UFS device is in PowerDown mode for suspend-to-disk ->poweroff()
scsi: ufs-pci: Fix restore from S4 for Intel controllers
scsi: ufs-mediatek: Keep VCC always-on for specific devices
scsi: ufs: Allow regulators being always-on
scsi: ufs: Clear UAC for RPMB after ufshcd resets
...

Linus Torvalds
2021-01-02 04:58:07 +0800

30 Dec, 2020

1 commit

dc3043260 block: add debugfs stanza for QUEUE_FLAG_NOWAIT ... Browse Code »

This was missed in 021a24460dc2. Leads to the numeric value of
QUEUE_FLAG_NOWAIT (i.e. 29) showing up in
/sys/kernel/debug/block/*/state.

Fixes: 021a24460dc28e7412aecfae89f60e1847e685c0
Cc: Konstantin Khlebnikov
Cc: Mike Snitzer
Cc: Christoph Hellwig
Cc: Jens Axboe
Signed-off-by: Andres Freund
Signed-off-by: Jens Axboe

Andres Freund
2020-12-30 07:47:46 +0800

22 Dec, 2020

1 commit

7b51e703a block: update some copyrights ... Browse Code »

Update copyrights for files that have gotten some major rewrites lately.

Signed-off-by: Christoph Hellwig
Reviewed-by: Johannes Thumshirn
Signed-off-by: Jens Axboe

Christoph Hellwig
2020-12-22 23:43:06 +0800

18 Dec, 2020

1 commit

71425189b blk-mq: Don't complete on a remote CPU in force threaded mode ... Browse Code »

With force threaded interrupts enabled, raising softirq from an SMP
function call will always result in waking the ksoftirqd thread. This is
not optimal given that the thread runs at SCHED_OTHER priority.

Completing the request in hard IRQ-context on PREEMPT_RT (which enforces
the force threaded mode) is bad because the completion handler may
acquire sleeping locks which violate the locking context.

Disable request completing on a remote CPU in force threaded mode.

Signed-off-by: Sebastian Andrzej Siewior
Reviewed-by: Christoph Hellwig
Reviewed-by: Daniel Wagner
Signed-off-by: Jens Axboe

Sebastian Andrzej Siewior
2020-12-18 04:41:30 +0800

17 Dec, 2020

5 commits

76efc1c77 blk-iocost: Add iocg idle state tracepoint ... Browse Code »

It will be helpful to trace the iocg's whole state, including active and
idle state. And we can easily expand the original iocost_iocg_activate
trace event to support a state trace class, including active and idle
state tracing.

Acked-by: Tejun Heo
Signed-off-by: Baolin Wang
Signed-off-by: Jens Axboe

Baolin Wang
2020-12-17 22:55:44 +0800
e6582cb5d blk-mq: Remove 'running from the wrong CPU' warning ... Browse Code »

It's guaranteed that no request is in flight when a hctx is going
offline. This warning is only triggered when the wq's CPU is hot
plugged and the blk-mq is not synced up yet.

As this state is temporary and the request is still processed
correctly, better remove the warning as this is the fast path.

Suggested-by: Ming Lei
Signed-off-by: Daniel Wagner
Reviewed-by: Christoph Hellwig
Reviewed-by: Ming Lei
Signed-off-by: Jens Axboe

Daniel Wagner
2020-12-17 05:55:25 +0800
60f7c503d Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi ... Browse Code »

Pull SCSI updates from James Bottomley:
"This consists of the usual driver updates (ufs, qla2xxx, smartpqi,
target, zfcp, fnic, mpt3sas, ibmvfc) plus a load of cleanups, a major
power management rework and a load of assorted minor updates.

There are a few core updates (formatting fixes being the big one) but
nothing major this cycle"

* tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (279 commits)
scsi: mpt3sas: Update driver version to 36.100.00.00
scsi: mpt3sas: Handle trigger page after firmware update
scsi: mpt3sas: Add persistent MPI trigger page
scsi: mpt3sas: Add persistent SCSI sense trigger page
scsi: mpt3sas: Add persistent Event trigger page
scsi: mpt3sas: Add persistent Master trigger page
scsi: mpt3sas: Add persistent trigger pages support
scsi: mpt3sas: Sync time periodically between driver and firmware
scsi: qla2xxx: Update version to 10.02.00.104-k
scsi: qla2xxx: Fix device loss on 4G and older HBAs
scsi: qla2xxx: If fcport is undergoing deletion complete I/O with retry
scsi: qla2xxx: Fix the call trace for flush workqueue
scsi: qla2xxx: Fix flash update in 28XX adapters on big endian machines
scsi: qla2xxx: Handle aborts correctly for port undergoing deletion
scsi: qla2xxx: Fix N2N and NVMe connect retry failure
scsi: qla2xxx: Fix FW initialization error on big endian machines
scsi: qla2xxx: Fix crash during driver load on big endian machines
scsi: qla2xxx: Fix compilation issue in PPC systems
scsi: qla2xxx: Don't check for fw_started while posting NVMe command
scsi: qla2xxx: Tear down session if FW say it is down
...

Linus Torvalds
2020-12-17 05:34:31 +0800
69f637c33 Merge tag 'for-5.11/drivers-2020-12-14' of git://git.kernel.dk/linux-block ... Browse Code »

Pull block driver updates from Jens Axboe:
"Nothing major in here:

- NVMe pull request from Christoph:
- nvmet passthrough improvements (Chaitanya Kulkarni)
- fcloop error injection support (James Smart)
- read-only support for zoned namespaces without Zone Append
(Javier González)
- improve some error message (Minwoo Im)
- reject I/O to offline fabrics namespaces (Victor Gladkov)
- PCI queue allocation cleanups (Niklas Schnelle)
- remove an unused allocation in nvmet (Amit Engel)
- a Kconfig spelling fix (Colin Ian King)
- nvme_req_qid simplication (Baolin Wang)

- MD pull request from Song:
- Fix race condition in md_ioctl() (Dae R. Jeong)
- Initialize read_slot properly for raid10 (Kevin Vigor)
- Code cleanup (Pankaj Gupta)
- md-cluster resync/reshape fix (Zhao Heming)

- Move null_blk into its own directory (Damien Le Moal)

- null_blk zone and discard improvements (Damien Le Moal)

- bcache race fix (Dongsheng Yang)

- Set of rnbd fixes/improvements (Gioh Kim, Guoqing Jiang, Jack Wang,
Lutz Pogrell, Md Haris Iqbal)

- lightnvm NULL pointer deref fix (tangzhenhao)

- sr in_interrupt() removal (Sebastian Andrzej Siewior)

- FC endpoint security support for s390/dasd (Jan Höppner, Sebastian
Ott, Vineeth Vijayan). From the s390 arch guys, arch bits included
as it made it easier for them to funnel the feature through the
block driver tree.

- Follow up fixes (Colin Ian King)"

* tag 'for-5.11/drivers-2020-12-14' of git://git.kernel.dk/linux-block: (64 commits)
block: drop dead assignments in loop_init()
sr: Remove in_interrupt() usage in sr_init_command().
sr: Switch the sector size back to 2048 if sr_read_sector() changed it.
cdrom: Reset sector_size back it is not 2048.
drivers/lightnvm: fix a null-ptr-deref bug in pblk-core.c
null_blk: Move driver into its own directory
null_blk: Allow controlling max_hw_sectors limit
null_blk: discard zones on reset
null_blk: cleanup discard handling
null_blk: Improve implicit zone close
null_blk: improve zone locking
block: Align max_hw_sectors to logical blocksize
null_blk: Fail zone append to conventional zones
null_blk: Fix zone size initialization
bcache: fix race between setting bdev state to none and new write request direct to backing
block/rnbd: fix a null pointer dereference on dev->blk_symlink_name
block/rnbd-clt: Dynamically alloc buffer for pathname & blk_symlink_name
block/rnbd: call kobject_put in the failure path
Documentation/ABI/rnbd-srv: add document for force_close
block/rnbd-srv: close a mapped device from server side.
...

Linus Torvalds
2020-12-17 05:09:32 +0800
ac7ac4618 Merge tag 'for-5.11/block-2020-12-14' of git://git.kernel.dk/linux-block ... Browse Code »

Pull block updates from Jens Axboe:
"Another series of killing more code than what is being added, again
thanks to Christoph's relentless cleanups and tech debt tackling.

This contains:

- blk-iocost improvements (Baolin Wang)

- part0 iostat fix (Jeffle Xu)

- Disable iopoll for split bios (Jeffle Xu)

- block tracepoint cleanups (Christoph Hellwig)

- Merging of struct block_device and hd_struct (Christoph Hellwig)

- Rework/cleanup of how block device sizes are updated (Christoph
Hellwig)

- Simplification of gendisk lookup and removal of block device
aliasing (Christoph Hellwig)

- Block device ioctl cleanups (Christoph Hellwig)

- Removal of bdget()/blkdev_get() as exported API (Christoph Hellwig)

- Disk change rework, avoid ->revalidate_disk() (Christoph Hellwig)

- sbitmap improvements (Pavel Begunkov)

- Hybrid polling fix (Pavel Begunkov)

- bvec iteration improvements (Pavel Begunkov)

- Zone revalidation fixes (Damien Le Moal)

- blk-throttle limit fix (Yu Kuai)

- Various little fixes"

* tag 'for-5.11/block-2020-12-14' of git://git.kernel.dk/linux-block: (126 commits)
blk-mq: fix msec comment from micro to milli seconds
blk-mq: update arg in comment of blk_mq_map_queue
blk-mq: add helper allocating tagset->tags
Revert "block: Fix a lockdep complaint triggered by request queue flushing"
nvme-loop: use blk_mq_hctx_set_fq_lock_class to set loop's lock class
blk-mq: add new API of blk_mq_hctx_set_fq_lock_class
block: disable iopoll for split bio
block: Improve blk_revalidate_disk_zones() checks
sbitmap: simplify wrap check
sbitmap: replace CAS with atomic and
sbitmap: remove swap_lock
sbitmap: optimise sbitmap_deferred_clear()
blk-mq: skip hybrid polling if iopoll doesn't spin
blk-iocost: Factor out the base vrate change into a separate function
blk-iocost: Factor out the active iocgs' state check into a separate function
blk-iocost: Move the usage ratio calculation to the correct place
blk-iocost: Remove unnecessary advance declaration
blk-iocost: Fix some typos in comments
blktrace: fix up a kerneldoc comment
block: remove the request_queue to argument request based tracepoints
...

Linus Torvalds
2020-12-17 04:57:51 +0800

15 Dec, 2020

1 commit

adb35e8dc Merge tag 'sched-core-2020-12-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull scheduler updates from Thomas Gleixner:

- migrate_disable/enable() support which originates from the RT tree
and is now a prerequisite for the new preemptible kmap_local() API
which aims to replace kmap_atomic().

- A fair amount of topology and NUMA related improvements

- Improvements for the frequency invariant calculations

- Enhanced robustness for the global CPU priority tracking and decision
making

- The usual small fixes and enhancements all over the place

* tag 'sched-core-2020-12-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (61 commits)
sched/fair: Trivial correction of the newidle_balance() comment
sched/fair: Clear SMT siblings after determining the core is not idle
sched: Fix kernel-doc markup
x86: Print ratio freq_max/freq_base used in frequency invariance calculations
x86, sched: Use midpoint of max_boost and max_P for frequency invariance on AMD EPYC
x86, sched: Calculate frequency invariance for AMD systems
irq_work: Optimize irq_work_single()
smp: Cleanup smp_call_function*()
irq_work: Cleanup
sched: Limit the amount of NUMA imbalance that can exist at fork time
sched/numa: Allow a floating imbalance between NUMA nodes
sched: Avoid unnecessary calculation of load imbalance at clone time
sched/numa: Rename nr_running and break out the magic number
sched: Make migrate_disable/enable() independent of RT
sched/topology: Condition EAS enablement on FIE support
arm64: Rebuild sched domains on invariance status changes
sched/topology,schedutil: Wrap sched domains rebuild
sched/uclamp: Allow to reset a task uclamp constraint value
sched/core: Fix typos in comments
Documentation: scheduler: fix information on arch SD flags, sched_domain and sched_debug
...

Linus Torvalds
2020-12-15 10:29:11 +0800

13 Dec, 2020

3 commits

fa94ba8a7 blk-mq: fix msec comment from micro to milli seconds ... Browse Code »

Delay to wait for queue running is milli second unit which is passed to
delayed work via msecs_to_jiffies() which is to convert milliseconds to
jiffies.

Signed-off-by: Minwoo Im
Reviewed-by: John Garry
Signed-off-by: Jens Axboe

Minwoo Im
2020-12-13 02:13:41 +0800
d220a2141 blk-mq: update arg in comment of blk_mq_map_queue ... Browse Code »

Update mis-named argument description of blk_mq_map_queue(). This patch
also updates description that argument to software queue percpu context.

Signed-off-by: Minwoo Im
Reviewed-by: John Garry
Signed-off-by: Jens Axboe

Minwoo Im
2020-12-13 02:13:41 +0800
91cdf265b blk-mq: add helper allocating tagset->tags ... Browse Code »

tagset->set is allocated from blk_mq_alloc_tag_set() rather than being
reallocated. This patch added a helper to make its meaning explicitly
which is to allocate rather than to reallocate.

Signed-off-by: Minwoo Im
Signed-off-by: Jens Axboe

Minwoo Im
2020-12-13 02:13:41 +0800

10 Dec, 2020

4 commits

52abca64f scsi: block: Do not accept any requests while suspended ... Browse Code »

blk_queue_enter() accepts BLK_MQ_REQ_PM requests independent of the runtime
power management state. Now that SCSI domain validation no longer depends
on this behavior, modify the behavior of blk_queue_enter() as follows:

- Do not accept any requests while suspended.

- Only process power management requests while suspending or resuming.

Submitting BLK_MQ_REQ_PM requests to a device that is runtime suspended
causes runtime-suspended devices not to resume as they should. The request
which should cause a runtime resume instead gets issued directly, without
resuming the device first. Of course the device can't handle it properly,
the I/O fails, and the device remains suspended.

The problem is fixed by checking that the queue's runtime-PM status isn't
RPM_SUSPENDED before allowing a request to be issued, and queuing a
runtime-resume request if it is. In particular, the inline
blk_pm_request_resume() routine is renamed blk_pm_resume_queue() and the
code is unified by merging the surrounding checks into the routine. If the
queue isn't set up for runtime PM, or there currently is no restriction on
allowed requests, the request is allowed. Likewise if the BLK_MQ_REQ_PM
flag is set and the status isn't RPM_SUSPENDED. Otherwise a runtime resume
is queued and the request is blocked until conditions are more suitable.

[ bvanassche: modified commit message and removed Cc: stable because
without the previous patches from this series this patch would break
parallel SCSI domain validation + introduced queue_rpm_status() ]

Link: https://lore.kernel.org/r/20201209052951.16136-9-bvanassche@acm.org
Cc: Jens Axboe
Cc: Christoph Hellwig
Cc: Hannes Reinecke
Cc: Can Guo
Cc: Stanley Chu
Cc: Ming Lei
Cc: Rafael J. Wysocki
Reported-and-tested-by: Martin Kepplinger
Reviewed-by: Hannes Reinecke
Reviewed-by: Can Guo
Signed-off-by: Alan Stern
Signed-off-by: Bart Van Assche
Signed-off-by: Martin K. Petersen

Alan Stern
2020-12-10 00:41:42 +0800
a4d34da71 scsi: block: Remove RQF_PREEMPT and BLK_MQ_REQ_PREEMPT ... Browse Code »

Remove flag RQF_PREEMPT and BLK_MQ_REQ_PREEMPT since these are no longer
used by any kernel code.

Link: https://lore.kernel.org/r/20201209052951.16136-8-bvanassche@acm.org
Cc: Can Guo
Cc: Stanley Chu
Cc: Alan Stern
Cc: Ming Lei
Cc: Rafael J. Wysocki
Cc: Martin Kepplinger
Reviewed-by: Christoph Hellwig
Reviewed-by: Hannes Reinecke
Reviewed-by: Jens Axboe
Reviewed-by: Can Guo
Signed-off-by: Bart Van Assche
Signed-off-by: Martin K. Petersen

Bart Van Assche
2020-12-10 00:41:42 +0800
0854bcdcd scsi: block: Introduce BLK_MQ_REQ_PM ... Browse Code »

Introduce the BLK_MQ_REQ_PM flag. This flag makes the request allocation
functions set RQF_PM. This is the first step towards removing
BLK_MQ_REQ_PREEMPT.

Link: https://lore.kernel.org/r/20201209052951.16136-3-bvanassche@acm.org
Cc: Alan Stern
Cc: Stanley Chu
Cc: Ming Lei
Cc: Rafael J. Wysocki
Cc: Can Guo
Reviewed-by: Christoph Hellwig
Reviewed-by: Hannes Reinecke
Reviewed-by: Jens Axboe
Reviewed-by: Can Guo
Signed-off-by: Bart Van Assche
Signed-off-by: Martin K. Petersen

Bart Van Assche
2020-12-10 00:41:41 +0800
fa4d0f199 scsi: block: Fix a race in the runtime power management code ... Browse Code »

With the current implementation the following race can happen:

* blk_pre_runtime_suspend() calls blk_freeze_queue_start() and
blk_mq_unfreeze_queue().

* blk_queue_enter() calls blk_queue_pm_only() and that function returns
true.

* blk_queue_enter() calls blk_pm_request_resume() and that function does
not call pm_request_resume() because the queue runtime status is
RPM_ACTIVE.

* blk_pre_runtime_suspend() changes the queue status into RPM_SUSPENDING.

Fix this race by changing the queue runtime status into RPM_SUSPENDING
before switching q_usage_counter to atomic mode.

Link: https://lore.kernel.org/r/20201209052951.16136-2-bvanassche@acm.org
Fixes: 986d413b7c15 ("blk-mq: Enable support for runtime power management")
Cc: Ming Lei
Cc: Rafael J. Wysocki
Cc: stable
Reviewed-by: Christoph Hellwig
Reviewed-by: Hannes Reinecke
Reviewed-by: Jens Axboe
Acked-by: Alan Stern
Acked-by: Stanley Chu
Co-developed-by: Can Guo
Signed-off-by: Can Guo
Signed-off-by: Bart Van Assche
Signed-off-by: Martin K. Petersen

Bart Van Assche
2020-12-10 00:41:41 +0800

08 Dec, 2020

11 commits

7aa390ec2 Revert "block: Fix a lockdep complaint triggered by request queue flushing" ... Browse Code »

This reverts commit b3c6a59975415bde29cfd76ff1ab008edbf614a9.

Now we can avoid nvme-loop lockdep warning of 'lockdep possible recursive locking'
by nvme-loop's lock class, no need to apply dynamically allocated lock class key,
so revert commit b3c6a5997541("block: Fix a lockdep complaint triggered by request
queue flushing").

This way fixes horrible SCSI probe delay issue on megaraid_sas, and it is reported
the whole probe may take more than half an hour.

Tested-by: Kashyap Desai
Reported-by: Qian Cai
Reviewed-by: Christoph Hellwig
Cc: Sumit Saxena
Cc: John Garry
Cc: Kashyap Desai
Cc: Bart Van Assche
Cc: Hannes Reinecke
Signed-off-by: Ming Lei
Reviewed-by: Hannes Reinecke
Signed-off-by: Jens Axboe

Ming Lei
2020-12-08 11:30:19 +0800
fb01a2932 blk-mq: add new API of blk_mq_hctx_set_fq_lock_class ... Browse Code »

flush_end_io() may be called recursively from some driver, such as
nvme-loop, so lockdep may complain 'possible recursive locking'.
Commit b3c6a5997541("block: Fix a lockdep complaint triggered by
request queue flushing") tried to address this issue by assigning
dynamically allocated per-flush-queue lock class. This solution
adds synchronize_rcu() for each hctx's release handler, and causes
horrible SCSI MQ probe delay(more than half an hour on megaraid sas).

Add new API of blk_mq_hctx_set_fq_lock_class() for these drivers, so
we just need to use driver specific lock class for avoiding the
lockdep warning of 'possible recursive locking'.

Tested-by: Kashyap Desai
Reported-by: Qian Cai
Cc: Sumit Saxena
Cc: John Garry
Cc: Kashyap Desai
Cc: Bart Van Assche
Cc: Hannes Reinecke
Signed-off-by: Ming Lei
Reviewed-by: Hannes Reinecke
Signed-off-by: Jens Axboe

Ming Lei
2020-12-08 11:30:19 +0800
cc29e1bf0 block: disable iopoll for split bio ... Browse Code »

iopoll is initially for small size, latency sensitive IO. It doesn't
work well for big IO, especially when it needs to be split to multiple
bios. In this case, the returned cookie of __submit_bio_noacct_mq() is
indeed the cookie of the last split bio. The completion of *this* last
split bio done by iopoll doesn't mean the whole original bio has
completed. Callers of iopoll still need to wait for completion of other
split bios.

Besides bio splitting may cause more trouble for iopoll which isn't
supposed to be used in case of big IO.

iopoll for split bio may cause potential race if CPU migration happens
during bio submission. Since the returned cookie is that of the last
split bio, polling on the corresponding hardware queue doesn't help
complete other split bios, if these split bios are enqueued into
different hardware queues. Since interrupts are disabled for polling
queues, the completion of these other split bios depends on timeout
mechanism, thus causing a potential hang.

iopoll for split bio may also cause hang for sync polling. Currently
both the blkdev and iomap-based fs (ext4/xfs, etc) support sync polling
in direct IO routine. These routines will submit bio without REQ_NOWAIT
flag set, and then start sync polling in current process context. The
process may hang in blk_mq_get_tag() if the submitted bio has to be
split into multiple bios and can rapidly exhaust the queue depth. The
process are waiting for the completion of the previously allocated
requests, which should be reaped by the following polling, and thus
causing a deadlock.

To avoid these subtle trouble described above, just disable iopoll for
split bio and return BLK_QC_T_NONE in this case. The side effect is that
non-HIPRI IO also returns BLK_QC_T_NONE now. It should be acceptable
since the returned cookie is never used for non-HIPRI IO.

Suggested-by: Ming Lei
Signed-off-by: Jeffle Xu
Reviewed-by: Ming Lei
Reviewed-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Jeffle Xu
2020-12-08 11:29:15 +0800
817046ecd block: Align max_hw_sectors to logical blocksize ... Browse Code »

Block device drivers do not have to call blk_queue_max_hw_sectors() to
set a limit on request size if the default limit BLK_SAFE_MAX_SECTORS
is acceptable. However, this limit (255 sectors) may not be aligned
to the device logical block size which cannot be used as is for a
request maximum size. This is the case for the null_blk device driver.

Modify blk_queue_max_hw_sectors() to make sure that the request size
limits specified by the max_hw_sectors and max_sectors queue limits
are always aligned to the device logical block size. Additionally, to
avoid introducing a dependence on the execution order of this function
with blk_queue_logical_block_size(), also modify
blk_queue_logical_block_size() to perform the same alignment when the
logical block size is set after max_hw_sectors.

Signed-off-by: Damien Le Moal
Reviewed-by: Christoph Hellwig
Reviewed-by: Johannes Thumshirn
Signed-off-by: Jens Axboe

Damien Le Moal
2020-12-08 08:36:03 +0800
2afdeb23e block: Improve blk_revalidate_disk_zones() checks ... Browse Code »

Improves the checks on the zones of a zoned block device done in
blk_revalidate_disk_zones() by making sure that the device report_zones
method did report at least one zone and that the zones reported exactly
cover the entire disk capacity, that is, that there are no missing zones
at the end of the disk sector range.

Signed-off-by: Damien Le Moal
Reviewed-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Damien Le Moal
2020-12-08 08:34:21 +0800
f6f371f7d blk-mq: skip hybrid polling if iopoll doesn't spin ... Browse Code »

If blk_poll() is not going to spin (i.e. @spin=false), it also must not
sleep in hybrid polling, otherwise it might be pretty suprising for
users trying to do a quick check and expecting no-wait behaviour.

Signed-off-by: Pavel Begunkov
Signed-off-by: Jens Axboe

Pavel Begunkov
2020-12-08 07:49:04 +0800
926f75f6a blk-iocost: Factor out the base vrate change into a separate function ... Browse Code »

Factor out the base vrate change code into a separate function
to fimplify the ioc_timer_fn().

No functional change.

Signed-off-by: Baolin Wang
Acked-by: Tejun Heo
Signed-off-by: Jens Axboe

Baolin Wang
2020-12-08 04:20:31 +0800
2474787a7 blk-iocost: Factor out the active iocgs' state check into a separate function ... Browse Code »

Factor out the iocgs' state check into a separate function to
simplify the ioc_timer_fn().

No functional change.

Signed-off-by: Baolin Wang
Acked-by: Tejun Heo
Signed-off-by: Jens Axboe

Baolin Wang
2020-12-08 04:20:31 +0800
c09245f61 blk-iocost: Move the usage ratio calculation to the correct place ... Browse Code »

We only use the hweight based usage ratio to calculate the new
hweight_inuse of the iocg to decide if this iocg can donate some
surplus vtime.

Thus move the usage ratio calculation to the correct place to
avoid unnecessary calculation for some vtime shortage iocgs.

Signed-off-by: Baolin Wang
Acked-by: Tejun Heo
Signed-off-by: Jens Axboe

Baolin Wang
2020-12-08 04:20:31 +0800
647c9f03b blk-iocost: Remove unnecessary advance declaration ... Browse Code »

Remove unnecessary advance declaration of struct ioc_gq.

Signed-off-by: Baolin Wang
Acked-by: Tejun Heo
Signed-off-by: Jens Axboe

Baolin Wang
2020-12-08 04:20:31 +0800
5ba1add21 blk-iocost: Fix some typos in comments ... Browse Code »

Fix some typos in comments.

Signed-off-by: Baolin Wang
Acked-by: Tejun Heo
Signed-off-by: Jens Axboe

Baolin Wang
2020-12-08 04:20:31 +0800