Eric Lee / smarc-fsl-linux-kernel

28 Sep, 2019

2 commits

3154df262 blk-mq: apply normal plugging for HDD ... Browse Code »

Some HDD drive may expose multiple hardware queues, such as MegraRaid.
Let's apply the normal plugging for such devices because sequential IO
may benefit a lot from plug merging.

Cc: Bart Van Assche
Cc: Hannes Reinecke
Cc: Dave Chinner
Reviewed-by: Damien Le Moal
Signed-off-by: Ming Lei
Signed-off-by: Jens Axboe

Ming Lei
2019-09-28 01:40:21 +0800
a12de1d42 blk-mq: honor IO scheduler for multiqueue devices ... Browse Code »

If a device is using multiple queues, the IO scheduler may be bypassed.
This may hurt performance for some slow MQ devices, and it also breaks
zoned devices which depend on mq-deadline for respecting the write order
in one zone.

Don't bypass io scheduler if we have one setup.

This patch can double sequential write performance basically on MQ
scsi_debug when mq-deadline is applied.

Cc: Bart Van Assche
Cc: Hannes Reinecke
Cc: Dave Chinner
Reviewed-by: Javier González
Reviewed-by: Damien Le Moal
Signed-off-by: Ming Lei
Signed-off-by: Jens Axboe

Ming Lei
2019-09-28 01:38:28 +0800

27 Sep, 2019

1 commit

8d6996630 block: fix null pointer dereference in blk_mq_rq_timed_out() ... Browse Code »

We got a null pointer deference BUG_ON in blk_mq_rq_timed_out()
as following:

[ 108.825472] BUG: kernel NULL pointer dereference, address: 0000000000000040
[ 108.827059] PGD 0 P4D 0
[ 108.827313] Oops: 0000 [#1] SMP PTI
[ 108.827657] CPU: 6 PID: 198 Comm: kworker/6:1H Not tainted 5.3.0-rc8+ #431
[ 108.829503] Workqueue: kblockd blk_mq_timeout_work
[ 108.829913] RIP: 0010:blk_mq_check_expired+0x258/0x330
[ 108.838191] Call Trace:
[ 108.838406] bt_iter+0x74/0x80
[ 108.838665] blk_mq_queue_tag_busy_iter+0x204/0x450
[ 108.839074] ? __switch_to_asm+0x34/0x70
[ 108.839405] ? blk_mq_stop_hw_queue+0x40/0x40
[ 108.839823] ? blk_mq_stop_hw_queue+0x40/0x40
[ 108.840273] ? syscall_return_via_sysret+0xf/0x7f
[ 108.840732] blk_mq_timeout_work+0x74/0x200
[ 108.841151] process_one_work+0x297/0x680
[ 108.841550] worker_thread+0x29c/0x6f0
[ 108.841926] ? rescuer_thread+0x580/0x580
[ 108.842344] kthread+0x16a/0x1a0
[ 108.842666] ? kthread_flush_work+0x170/0x170
[ 108.843100] ret_from_fork+0x35/0x40

The bug is caused by the race between timeout handle and completion for
flush request.

When timeout handle function blk_mq_rq_timed_out() try to read
'req->q->mq_ops', the 'req' have completed and reinitiated by next
flush request, which would call blk_rq_init() to clear 'req' as 0.

After commit 12f5b93145 ("blk-mq: Remove generation seqeunce"),
normal requests lifetime are protected by refcount. Until 'rq->ref'
drop to zero, the request can really be free. Thus, these requests
cannot been reused before timeout handle finish.

However, flush request has defined .end_io and rq->end_io() is still
called even if 'rq->ref' doesn't drop to zero. After that, the 'flush_rq'
can be reused by the next flush request handle, resulting in null
pointer deference BUG ON.

We fix this problem by covering flush request with 'rq->ref'.
If the refcount is not zero, flush_end_io() return and wait the
last holder recall it. To record the request status, we add a new
entry 'rq_status', which will be used in flush_end_io().

Cc: Christoph Hellwig
Cc: Keith Busch
Cc: Bart Van Assche
Cc: stable@vger.kernel.org # v4.18+
Reviewed-by: Ming Lei
Reviewed-by: Bob Liu
Signed-off-by: Yufen Yu

-------
v2:
- move rq_status from struct request to struct blk_flush_queue
v3:
- remove unnecessary '{}' pair.
v4:
- let spinlock to protect 'fq->rq_status'
v5:
- move rq_status after flush_running_idx member of struct blk_flush_queue
Signed-off-by: Jens Axboe

Yufen Yu
2019-09-27 21:01:25 +0800

18 Sep, 2019

3 commits

54d4e6ab9 block: centralize PI remapping logic to the block layer ... Browse Code »

Currently t10_pi_prepare/t10_pi_complete functions are called during the
NVMe and SCSi layers command preparetion/completion, but their actual
place should be the block layer since T10-PI is a general data integrity
feature that is used by block storage protocols. Introduce .prepare_fn
and .complete_fn callbacks within the integrity profile that each type
can implement according to its needs.

Suggested-by: Christoph Hellwig
Reviewed-by: Christoph Hellwig
Suggested-by: Martin K. Petersen
Reviewed-by: Martin K. Petersen
Signed-off-by: Max Gurtovoy

Fixed to not call queue integrity functions if BLK_DEV_INTEGRITY
isn't defined in the config.

Signed-off-by: Jens Axboe

Max Gurtovoy
2019-09-18 10:03:49 +0800
7ad67ca55 Merge tag 'for-5.4/block-2019-09-16' of git://git.kernel.dk/linux-block ... Browse Code »

Pull block updates from Jens Axboe:

- Two NVMe pull requests:
- ana log parse fix from Anton
- nvme quirks support for Apple devices from Ben
- fix missing bio completion tracing for multipath stack devices
from Hannes and Mikhail
- IP TOS settings for nvme rdma and tcp transports from Israel
- rq_dma_dir cleanups from Israel
- tracing for Get LBA Status command from Minwoo
- Some nvme-tcp cleanups from Minwoo, Potnuri and Myself
- Some consolidation between the fabrics transports for handling
the CAP register
- reset race with ns scanning fix for fabrics (move fabrics
commands to a dedicated request queue with a different lifetime
from the admin request queue)."
- controller reset and namespace scan races fixes
- nvme discovery log change uevent support
- naming improvements from Keith
- multiple discovery controllers reject fix from James
- some regular cleanups from various people

- Series fixing (and re-fixing) null_blk debug printing and nr_devices
checks (André)

- A few pull requests from Song, with fixes from Andy, Guoqing,
Guilherme, Neil, Nigel, and Yufen.

- REQ_OP_ZONE_RESET_ALL support (Chaitanya)

- Bio merge handling unification (Christoph)

- Pick default elevator correctly for devices with special needs
(Damien)

- Block stats fixes (Hou)

- Timeout and support devices nbd fixes (Mike)

- Series fixing races around elevator switching and device add/remove
(Ming)

- sed-opal cleanups (Revanth)

- Per device weight support for BFQ (Fam)

- Support for blk-iocost, a new model that can properly account cost of
IO workloads. (Tejun)

- blk-cgroup writeback fixes (Tejun)

- paride queue init fixes (zhengbin)

- blk_set_runtime_active() cleanup (Stanley)

- Block segment mapping optimizations (Bart)

- lightnvm fixes (Hans/Minwoo/YueHaibing)

- Various little fixes and cleanups

* tag 'for-5.4/block-2019-09-16' of git://git.kernel.dk/linux-block: (186 commits)
null_blk: format pr_* logs with pr_fmt
null_blk: match the type of parameter nr_devices
null_blk: do not fail the module load with zero devices
block: also check RQF_STATS in blk_mq_need_time_stamp()
block: make rq sector size accessible for block stats
bfq: Fix bfq linkage error
raid5: use bio_end_sector in r5_next_bio
raid5: remove STRIPE_OPS_REQ_PENDING
md: add feature flag MD_FEATURE_RAID0_LAYOUT
md/raid0: avoid RAID0 data corruption due to layout confusion.
raid5: don't set STRIPE_HANDLE to stripe which is in batch list
raid5: don't increment read_errors on EILSEQ return
nvmet: fix a wrong error status returned in error log page
nvme: send discovery log page change events to userspace
nvme: add uevent variables for controller devices
nvme: enable aen regardless of the presence of I/O queues
nvme-fabrics: allow discovery subsystems accept a kato
nvmet: Use PTR_ERR_OR_ZERO() in nvmet_init_discovery()
nvme: Remove redundant assignment of cq vector
nvme: Assign subsys instance from first ctrl
...

Linus Torvalds
2019-09-18 07:57:47 +0800
7f2444d38 Merge branch 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull core timer updates from Thomas Gleixner:
"Timers and timekeeping updates:

- A large overhaul of the posix CPU timer code which is a preparation
for moving the CPU timer expiry out into task work so it can be
properly accounted on the task/process.

An update to the bogus permission checks will come later during the
merge window as feedback was not complete before heading of for
travel.

- Switch the timerqueue code to use cached rbtrees and get rid of the
homebrewn caching of the leftmost node.

- Consolidate hrtimer_init() + hrtimer_init_sleeper() calls into a
single function

- Implement the separation of hrtimers to be forced to expire in hard
interrupt context even when PREEMPT_RT is enabled and mark the
affected timers accordingly.

- Implement a mechanism for hrtimers and the timer wheel to protect
RT against priority inversion and live lock issues when a (hr)timer
which should be canceled is currently executing the callback.
Instead of infinitely spinning, the task which tries to cancel the
timer blocks on a per cpu base expiry lock which is held and
released by the (hr)timer expiry code.

- Enable the Hyper-V TSC page based sched_clock for Hyper-V guests
resulting in faster access to timekeeping functions.

- Updates to various clocksource/clockevent drivers and their device
tree bindings.

- The usual small improvements all over the place"

* 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (101 commits)
posix-cpu-timers: Fix permission check regression
posix-cpu-timers: Always clear head pointer on dequeue
hrtimer: Add a missing bracket and hide `migration_base' on !SMP
posix-cpu-timers: Make expiry_active check actually work correctly
posix-timers: Unbreak CONFIG_POSIX_TIMERS=n build
tick: Mark sched_timer to expire in hard interrupt context
hrtimer: Add kernel doc annotation for HRTIMER_MODE_HARD
x86/hyperv: Hide pv_ops access for CONFIG_PARAVIRT=n
posix-cpu-timers: Utilize timerqueue for storage
posix-cpu-timers: Move state tracking to struct posix_cputimers
posix-cpu-timers: Deduplicate rlimit handling
posix-cpu-timers: Remove pointless comparisons
posix-cpu-timers: Get rid of 64bit divisions
posix-cpu-timers: Consolidate timer expiry further
posix-cpu-timers: Get rid of zero checks
rlimit: Rewrite non-sensical RLIMIT_CPU comment
posix-cpu-timers: Respect INFINITY for hard RTTIME limit
posix-cpu-timers: Switch thread group sampling to array
posix-cpu-timers: Restructure expiry array
posix-cpu-timers: Remove cputime_expires
...

Linus Torvalds
2019-09-18 03:35:15 +0800

16 Sep, 2019

2 commits

9a91b05bb block: also check RQF_STATS in blk_mq_need_time_stamp() ... Browse Code »

In __blk_mq_end_request() if block stats needs update, we should
ensure now is valid instead of 0 even when iostat is disabled.

Signed-off-by: Hou Tao
Signed-off-by: Jens Axboe

Hou Tao
2019-09-16 06:02:10 +0800
3d2443069 block: make rq sector size accessible for block stats ... Browse Code »

Currently rq->data_len will be decreased by partial completion or
zeroed by completion, so when blk_stat_add() is invoked, data_len
will be zero and there will never be samples in poll_cb because
blk_mq_poll_stats_bkt() will return -1 if data_len is zero.

We could move blk_stat_add() back to __blk_mq_complete_request(),
but that would make the effort of trying to call ktime_get_ns()
once in vain. Instead we can reuse throtl_size field, and use
it for both block stats and block throttle, and adjust the
logic in blk_mq_poll_stats_bkt() accordingly.

Fixes: 4bc6339a583c ("block: move blk_stat_add() to __blk_mq_end_request()")
Tested-by: Pavel Begunkov
Signed-off-by: Hou Tao
Signed-off-by: Jens Axboe

Hou Tao
2019-09-16 06:02:08 +0800

06 Sep, 2019

3 commits

737eb78e8 block: Delay default elevator initialization ... Browse Code »

When elevator_init_mq() is called from blk_mq_init_allocated_queue(),
the only information known about the device is the number of hardware
queues as the block device scan by the device driver is not completed
yet for most drivers. The device type and elevator required features
are not set yet, preventing to correctly select the default elevator
most suitable for the device.

This currently affects all multi-queue zoned block devices which default
to the "none" elevator instead of the required "mq-deadline" elevator.
These drives currently include host-managed SMR disks connected to a
smartpqi HBA and null_blk block devices with zoned mode enabled.
Upcoming NVMe Zoned Namespace devices will also be affected.

Fix this by adding the boolean elevator_init argument to
blk_mq_init_allocated_queue() to control the execution of
elevator_init_mq(). Two cases exist:
1) elevator_init = false is used for calls to
blk_mq_init_allocated_queue() within blk_mq_init_queue(). In this
case, a call to elevator_init_mq() is added to __device_add_disk(),
resulting in the delayed initialization of the queue elevator
after the device driver finished probing the device information. This
effectively allows elevator_init_mq() access to more information
about the device.
2) elevator_init = true preserves the current behavior of initializing
the elevator directly from blk_mq_init_allocated_queue(). This case
is used for the special request based DM devices where the device
gendisk is created before the queue initialization and device
information (e.g. queue limits) is already known when the queue
initialization is executed.

Additionally, to make sure that the elevator initialization is never
done while requests are in-flight (there should be none when the device
driver calls device_add_disk()), freeze and quiesce the device request
queue before calling blk_mq_init_sched() in elevator_init_mq().

Reviewed-by: Ming Lei
Signed-off-by: Damien Le Moal
Signed-off-by: Jens Axboe

Damien Le Moal
2019-09-06 09:52:34 +0800
954b4a5ce block: Change elevator_init_mq() to always succeed ... Browse Code »

If the default elevator chosen is mq-deadline, elevator_init_mq() may
return an error if mq-deadline initialization fails, leading to
blk_mq_init_allocated_queue() returning an error, which in turn will
cause the block device initialization to fail and the device not being
exposed.

Instead of taking such extreme measure, handle mq-deadline
initialization failures in the same manner as when mq-deadline is not
available (no module to load), that is, default to the "none" scheduler.
With this change, elevator_init_mq() return type can be changed to void.

Reviewed-by: Johannes Thumshirn
Reviewed-by: Christoph Hellwig
Reviewed-by: Ming Lei
Signed-off-by: Damien Le Moal
Signed-off-by: Jens Axboe

Damien Le Moal
2019-09-06 09:52:33 +0800
61db437d1 block: Cleanup elevator_init_mq() use ... Browse Code »

Instead of checking a queue tag_set BLK_MQ_F_NO_SCHED flag before
calling elevator_init_mq() to make sure that the queue supports IO
scheduling, use the elevator.c function elv_support_iosched() in
elevator_init_mq(). This does not introduce any functional change but
ensure that elevator_init_mq() does the right thing based on the queue
settings.

Reviewed-by: Ming Lei
Reviewed-by: Johannes Thumshirn
Reviewed-by: Christoph Hellwig
Signed-off-by: Damien Le Moal
Signed-off-by: Jens Axboe

Damien Le Moal
2019-09-06 09:52:33 +0800

29 Aug, 2019

1 commit

6f816b4b7 blk-mq: add optional request->alloc_time_ns ... Browse Code »

There are currently two start time timestamps - start_time_ns and
io_start_time_ns. The former marks the request allocation and and the
second issue-to-device time. The planned io.weight controller needs
to measure the total time bios take to execute after it leaves rq_qos
including the time spent waiting for request to become available,
which can easily dominate on saturated devices.

This patch adds request->alloc_time_ns which records when the request
allocation attempt started. As it isn't used for the usual stats,
make it optional behind CONFIG_BLK_RQ_ALLOC_TIME and
QUEUE_FLAG_RQ_ALLOC_TIME so that it can be compiled out when there are
no users and it's active only on queues which need it even when
compiled in.

v2: s/pre_start_time/alloc_time/ and add CONFIG_BLK_RQ_ALLOC_TIME
gating as suggested by Jens.

Signed-off-by: Tejun Heo
Signed-off-by: Jens Axboe

Tejun Heo
2019-08-29 11:17:10 +0800

28 Aug, 2019

1 commit

c6ba93335 blk-mq: don't hold q->sysfs_lock in blk_mq_map_swqueue ... Browse Code »

blk_mq_map_swqueue() is called from blk_mq_init_allocated_queue()
and blk_mq_update_nr_hw_queues(). For the former caller, the kobject
isn't exposed to userspace yet. For the latter caller, hctx sysfs entries
and debugfs are un-registered before updating nr_hw_queues.

On the other hand, commit 2f8f1336a48b ("blk-mq: always free hctx after
request queue is freed") moves freeing hctx into queue's release
handler, so there won't be race with queue release path too.

So don't hold q->sysfs_lock in blk_mq_map_swqueue().

Cc: Christoph Hellwig
Cc: Hannes Reinecke
Cc: Greg KH
Cc: Mike Snitzer
Cc: Bart Van Assche
Reviewed-by: Bart Van Assche
Signed-off-by: Ming Lei
Signed-off-by: Jens Axboe

Ming Lei
2019-08-28 00:40:20 +0800

16 Aug, 2019

1 commit

7b6620d7d block: remove REQ_NOWAIT_INLINE ... Browse Code »

We had a few issues with this code, and there's still a problem around
how we deal with error handling for chained/split bios. For now, just
revert the code and we'll try again with a thoroug solution. This
reverts commits:

e15c2ffa1091 ("block: fix O_DIRECT error handling for bio fragments")
0eb6ddfb865c ("block: Fix __blkdev_direct_IO() for bio fragments")
6a43074e2f46 ("block: properly handle IOCB_NOWAIT for async O_DIRECT IO")
893a1c97205a ("blk-mq: allow REQ_NOWAIT to return an error inline")

Signed-off-by: Jens Axboe

Jens Axboe
2019-08-16 01:09:16 +0800

12 Aug, 2019

2 commits

73d9c8d4c blk-mq: Fix memory leak in blk_mq_init_allocated_queue error handling ... Browse Code »

If blk_mq_init_allocated_queue->elevator_init_mq fails, need to release
the previously requested resources.

Fixes: d34849913819 ("blk-mq-sched: allow setting of default IO scheduler")
Signed-off-by: zhengbin
Signed-off-by: Jens Axboe

zhengbin
2019-08-12 22:15:36 +0800
e26cc0826 blk-mq: move cancel of requeue_work to the front of blk_exit_queue ... Browse Code »

blk_exit_queue will free elevator_data, while blk_mq_requeue_work
will access it. Move cancel of requeue_work to the front of
blk_exit_queue to avoid use-after-free.

blk_exit_queue blk_mq_requeue_work
__elevator_exit blk_mq_run_hw_queues
blk_mq_exit_sched blk_mq_run_hw_queue
dd_exit_queue blk_mq_hctx_has_pending
kfree(elevator_data) blk_mq_sched_has_work
dd_has_work

Fixes: fbc2a15e3433 ("blk-mq: move cancel of requeue_work into blk_mq_release")
Cc: stable@vger.kernel.org
Reviewed-by: Ming Lei
Signed-off-by: zhengbin
Signed-off-by: Jens Axboe

zhengbin
2019-08-12 22:14:11 +0800

05 Aug, 2019

2 commits

a87ccce0b blk-mq: remove blk_mq_complete_request_sync ... Browse Code »

blk_mq_tagset_wait_completed_request() has been applied for waiting
for completed request's fn, so not necessary to use
blk_mq_complete_request_sync() any more.

Cc: Max Gurtovoy
Cc: Sagi Grimberg
Cc: Keith Busch
Cc: Christoph Hellwig
Reviewed-by: Sagi Grimberg
Signed-off-by: Ming Lei
Signed-off-by: Jens Axboe

Ming Lei
2019-08-05 11:41:29 +0800
aa306ab70 blk-mq: introduce blk_mq_request_completed() ... Browse Code »

NVMe needs this function to decide if one request to be aborted has
been completed in normal IO path already.

So introduce it.

Cc: Max Gurtovoy
Cc: Sagi Grimberg
Cc: Keith Busch
Cc: Christoph Hellwig
Reviewed-by: Sagi Grimberg
Signed-off-by: Ming Lei
Signed-off-by: Jens Axboe

Ming Lei
2019-08-05 11:41:29 +0800

01 Aug, 2019

2 commits

9dd8813ed hrtimer/treewide: Use hrtimer_sleeper_start_expires() ... Browse Code »

hrtimer_sleepers will gain a scheduling class dependent treatment on
PREEMPT_RT. Use the new hrtimer_sleeper_start_expires() function to make
that possible.

Signed-off-by: Thomas Gleixner

Thomas Gleixner
2019-08-01 23:43:16 +0800
dbc1625fc hrtimer: Consolidate hrtimer_init() + hrtimer_init_sleeper() calls ... Browse Code »

hrtimer_init_sleeper() calls require prior initialisation of the hrtimer
object which is embedded into the hrtimer_sleeper.

Combine the initialization and spare a function call. Fixup all call sites.

This is also a preparatory change for PREEMPT_RT to do hrtimer sleeper
specific initializations of the embedded hrtimer without modifying any of
the call sites.

No functional change.

[ anna-maria: Minor cleanups ]
[ tglx: Adopted to the removal of the task argument of
hrtimer_init_sleeper() and trivial polishing.
Folded a fix from Stephen Rothwell for the vsoc code ]

Signed-off-by: Sebastian Andrzej Siewior
Signed-off-by: Anna-Maria Gleixner
Signed-off-by: Thomas Gleixner
Acked-by: Peter Zijlstra (Intel)
Link: https://lkml.kernel.org/r/20190726185752.887468908@linutronix.de

Sebastian Andrzej Siewior
2019-08-01 23:43:15 +0800

31 Jul, 2019

1 commit

b74494872 hrtimer: Remove task argument from hrtimer_init_sleeper() ... Browse Code »

All callers hand in 'current' and that's the only task pointer which
actually makes sense. Remove the task argument and set current in the
function.

Signed-off-by: Thomas Gleixner
Reviewed-by: Steven Rostedt (VMware)
Acked-by: Peter Zijlstra (Intel)
Link: https://lkml.kernel.org/r/20190726185752.791885290@linutronix.de

Thomas Gleixner
2019-07-31 05:57:51 +0800

23 Jul, 2019

1 commit

327fe1d42 block: blk-mq: Remove blk_mq_sched_started_request and started_request ... Browse Code »

blk_mq_sched_completed_request is a function that checks if the elevator
related to the request has started_request implemented, but currently, none of
the available IO schedulers implement started_request, so remove both.

Signed-off-by: Marcos Paulo de Souza
Signed-off-by: Jens Axboe

Marcos Paulo de Souza
2019-07-23 21:25:09 +0800

22 Jul, 2019

1 commit

893a1c972 blk-mq: allow REQ_NOWAIT to return an error inline ... Browse Code »

By default, if a caller sets REQ_NOWAIT and we need to block, we'll
return -EAGAIN through the bio->bi_end_io() callback. For some use
cases, this makes it hard to use.

Allow a caller to ask for inline return of errors related to
blocking by also setting REQ_NOWAIT_INLINE.

Signed-off-by: Jens Axboe

Jens Axboe
2019-07-22 11:46:23 +0800

11 Jul, 2019

1 commit

b49773e7b block: Disable write plugging for zoned block devices ... Browse Code »

Simultaneously writing to a sequential zone of a zoned block device
from multiple contexts requires mutual exclusion for BIO issuing to
ensure that writes happen sequentially. However, even for a well
behaved user correctly implementing such synchronization, BIO plugging
may interfere and result in BIOs from the different contextx to be
reordered if plugging is done outside of the mutual exclusion section,
e.g. the plug was started by a function higher in the call chain than
the function issuing BIOs.

Context A Context B

| blk_start_plug()
| ...
| seq_write_zone()
| mutex_lock(zone)
| bio-0->bi_iter.bi_sector = zone->wp
| zone->wp += bio_sectors(bio-0)
| submit_bio(bio-0)
| bio-1->bi_iter.bi_sector = zone->wp
| zone->wp += bio_sectors(bio-1)
| submit_bio(bio-1)
| mutex_unlock(zone)
| return
| -----------------------> | seq_write_zone()
| mutex_lock(zone)
| bio-2->bi_iter.bi_sector = zone->wp
| zone->wp += bio_sectors(bio-2)
| submit_bio(bio-2)
| mutex_unlock(zone)
|
Signed-off-by: Jens Axboe

Damien Le Moal
2019-07-11 04:18:01 +0800

03 Jul, 2019

2 commits

970d168de blk-mq: simplify blk_mq_make_request() ... Browse Code »

Move the blk_mq_bio_to_request() call in front of the if-statement.

Cc: Hannes Reinecke
Cc: Omar Sandoval
Reviewed-by: Minwoo Im
Reviewed-by: Christoph Hellwig
Reviewed-by: Ming Lei
Signed-off-by: Bart Van Assche
Signed-off-by: Jens Axboe

Bart Van Assche
2019-07-03 11:03:38 +0800
c05f42206 blk-mq: remove blk_mq_put_ctx() ... Browse Code »

No code that occurs between blk_mq_get_ctx() and blk_mq_put_ctx() depends
on preemption being disabled for its correctness. Since removing the CPU
preemption calls does not measurably affect performance, simplify the
blk-mq code by removing the blk_mq_put_ctx() function and also by not
disabling preemption in blk_mq_get_ctx().

Cc: Hannes Reinecke
Cc: Omar Sandoval
Reviewed-by: Christoph Hellwig
Reviewed-by: Ming Lei
Signed-off-by: Bart Van Assche
Signed-off-by: Jens Axboe

Bart Van Assche
2019-07-03 11:03:27 +0800

21 Jun, 2019

2 commits

14ccb66b3 block: remove the bi_phys_segments field in struct bio ... Browse Code »

We only need the number of segments in the blk-mq submission path.
Remove the field from struct bio, and return it from a variant of
blk_queue_split instead of that it can passed as an argument to
those functions that need the value.

This also means we stop recounting segments except for cloning
and partial segments.

To keep the number of arguments in this how path down remove
pointless struct request_queue arguments from any of the functions
that had it and grew a nr_segs argument.

Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Christoph Hellwig
2019-06-21 00:29:22 +0800
f924cddeb block: remove blk_init_request_from_bio ... Browse Code »

lightnvm should have never used this function, as it is sending
passthrough requests, so switch it to blk_rq_append_bio like all the
other passthrough request users. Inline blk_init_request_from_bio into
the only remaining caller.

Reviewed-by: Hannes Reinecke
Reviewed-by: Minwoo Im
Reviewed-by: Javier González
Reviewed-by: Matias Bjørling
Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Christoph Hellwig
2019-06-21 00:29:22 +0800

30 May, 2019

1 commit

41de54c64 blk-mq: Fix memory leak in error handling ... Browse Code »

If blk_mq_init_allocated_queue() fails, make sure to free the poll
stat callback struct allocated.

Signed-off-by: Jes Sorensen
Signed-off-by: Jens Axboe

Jes Sorensen
2019-05-30 04:33:04 +0800

24 May, 2019

1 commit

7996a8b55 blk-mq: fix hang caused by freeze/unfreeze sequence ... Browse Code »

The following is a description of a hang in blk_mq_freeze_queue_wait().
The hang happens on attempt to freeze a queue while another task does
queue unfreeze.

The root cause is an incorrect sequence of percpu_ref_resurrect() and
percpu_ref_kill() and as a result those two can be swapped:

CPU#0 CPU#1
---------------- -----------------
q1 = blk_mq_init_queue(shared_tags)

q2 = blk_mq_init_queue(shared_tags):
blk_mq_add_queue_tag_set(shared_tags):
blk_mq_update_tag_set_depth(shared_tags):
list_for_each_entry()
blk_mq_freeze_queue(q1)
> percpu_ref_kill()
> blk_mq_freeze_queue_wait()

blk_cleanup_queue(q1)
blk_mq_freeze_queue(q1)
> percpu_ref_kill()
^^^^^^ freeze_depth can't guarantee the order

blk_mq_unfreeze_queue()
> percpu_ref_resurrect()

> blk_mq_freeze_queue_wait()
^^^^^^ Hang here!!!!

This wrong sequence raises kernel warning:
percpu_ref_kill_and_confirm called more than once on blk_queue_usage_counter_release!
WARNING: CPU: 0 PID: 11854 at lib/percpu-refcount.c:336 percpu_ref_kill_and_confirm+0x99/0xb0

But the most unpleasant effect is a hang of a blk_mq_freeze_queue_wait(),
which waits for a zero of a q_usage_counter, which never happens
because percpu-ref was reinited (instead of being killed) and stays in
PERCPU state forever.

How to reproduce:
- "insmod null_blk.ko shared_tags=1 nr_devices=0 queue_mode=2"
- cpu0: python Script.py 0; taskset the corresponding process running on cpu0
- cpu1: python Script.py 1; taskset the corresponding process running on cpu1

Script.py:
------
#!/usr/bin/python3

import os
import sys

while True:
on = "echo 1 > /sys/kernel/config/nullb/%s/power" % sys.argv[1]
off = "echo 0 > /sys/kernel/config/nullb/%s/power" % sys.argv[1]
os.system(on)
os.system(off)
------

This bug was first reported and fixed by Roman, previous discussion:
[1] Message id: 1443287365-4244-7-git-send-email-akinobu.mita@gmail.com
[2] Message id: 1443563240-29306-6-git-send-email-tj@kernel.org
[3] https://patchwork.kernel.org/patch/9268199/

Reviewed-by: Hannes Reinecke
Reviewed-by: Ming Lei
Reviewed-by: Bart Van Assche
Reviewed-by: Christoph Hellwig
Signed-off-by: Roman Pen
Signed-off-by: Bob Liu
Signed-off-by: Jens Axboe

Bob Liu
2019-05-24 00:25:26 +0800

04 May, 2019

4 commits

2f8f1336a blk-mq: always free hctx after request queue is freed ... Browse Code »

In normal queue cleanup path, hctx is released after request queue
is freed, see blk_mq_release().

However, in __blk_mq_update_nr_hw_queues(), hctx may be freed because
of hw queues shrinking. This way is easy to cause use-after-free,
because: one implicit rule is that it is safe to call almost all block
layer APIs if the request queue is alive; and one hctx may be retrieved
by one API, then the hctx can be freed by blk_mq_update_nr_hw_queues();
finally use-after-free is triggered.

Fixes this issue by always freeing hctx after releasing request queue.
If some hctxs are removed in blk_mq_update_nr_hw_queues(), introduce
a per-queue list to hold them, then try to resuse these hctxs if numa
node is matched.

Cc: Dongli Zhang
Cc: James Smart
Cc: Bart Van Assche
Cc: linux-scsi@vger.kernel.org,
Cc: Martin K . Petersen ,
Cc: Christoph Hellwig ,
Cc: James E . J . Bottomley ,
Reviewed-by: Hannes Reinecke
Tested-by: James Smart
Signed-off-by: Ming Lei
Signed-off-by: Jens Axboe

Ming Lei
2019-05-04 21:24:08 +0800
7c6c5b7c9 blk-mq: split blk_mq_alloc_and_init_hctx into two parts ... Browse Code »

Split blk_mq_alloc_and_init_hctx into two parts, and one is
blk_mq_alloc_hctx() for allocating all hctx resources, another
is blk_mq_init_hctx() for initializing hctx, which serves as
counter-part of blk_mq_exit_hctx().

Cc: Dongli Zhang
Cc: James Smart
Cc: Bart Van Assche
Cc: linux-scsi@vger.kernel.org
Cc: Martin K . Petersen
Cc: Christoph Hellwig
Cc: James E . J . Bottomley
Reviewed-by: Hannes Reinecke
Reviewed-by: Christoph Hellwig
Tested-by: James Smart
Signed-off-by: Ming Lei
Signed-off-by: Jens Axboe

Ming Lei
2019-05-04 21:24:06 +0800
c7e2d94b3 blk-mq: free hw queue's resource in hctx's release handler ... Browse Code »

Once blk_cleanup_queue() returns, tags shouldn't be used any more,
because blk_mq_free_tag_set() may be called. Commit 45a9c9d909b2
("blk-mq: Fix a use-after-free") fixes this issue exactly.

However, that commit introduces another issue. Before 45a9c9d909b2,
we are allowed to run queue during cleaning up queue if the queue's
kobj refcount is held. After that commit, queue can't be run during
queue cleaning up, otherwise oops can be triggered easily because
some fields of hctx are freed by blk_mq_free_queue() in blk_cleanup_queue().

We have invented ways for addressing this kind of issue before, such as:

8dc765d438f1 ("SCSI: fix queue cleanup race before queue initialization is done")
c2856ae2f315 ("blk-mq: quiesce queue before freeing queue")

But still can't cover all cases, recently James reports another such
kind of issue:

https://marc.info/?l=linux-scsi&m=155389088124782&w=2

This issue can be quite hard to address by previous way, given
scsi_run_queue() may run requeues for other LUNs.

Fixes the above issue by freeing hctx's resources in its release handler, and this
way is safe becasue tags isn't needed for freeing such hctx resource.

This approach follows typical design pattern wrt. kobject's release handler.

Cc: Dongli Zhang
Cc: James Smart
Cc: Bart Van Assche
Cc: linux-scsi@vger.kernel.org,
Cc: Martin K . Petersen ,
Cc: Christoph Hellwig ,
Cc: James E . J . Bottomley ,
Reported-by: James Smart
Fixes: 45a9c9d909b2 ("blk-mq: Fix a use-after-free")
Cc: stable@vger.kernel.org
Reviewed-by: Hannes Reinecke
Reviewed-by: Christoph Hellwig
Tested-by: James Smart
Signed-off-by: Ming Lei
Signed-off-by: Jens Axboe

Ming Lei
2019-05-04 21:24:05 +0800
fbc2a15e3 blk-mq: move cancel of requeue_work into blk_mq_release ... Browse Code »

With holding queue's kobject refcount, it is safe for driver
to schedule requeue. However, blk_mq_kick_requeue_list() may
be called after blk_sync_queue() is done because of concurrent
requeue activities, then requeue work may not be completed when
freeing queue, and kernel oops is triggered.

So moving the cancel of requeue_work into blk_mq_release() for
avoiding race between requeue and freeing queue.

Cc: Dongli Zhang
Cc: James Smart
Cc: Bart Van Assche
Cc: linux-scsi@vger.kernel.org,
Cc: Martin K . Petersen ,
Cc: Christoph Hellwig ,
Cc: James E . J . Bottomley ,
Reviewed-by: Bart Van Assche
Reviewed-by: Johannes Thumshirn
Reviewed-by: Hannes Reinecke
Reviewed-by: Christoph Hellwig
Tested-by: James Smart
Signed-off-by: Ming Lei
Signed-off-by: Jens Axboe

Ming Lei
2019-05-04 21:24:04 +0800

03 May, 2019

1 commit

273938bf7 block: fix function name in comment ... Browse Code »

The comment was out of date.

Reviewed-by: Bart Van Assche
Signed-off-by: Raul E Rangel
Signed-off-by: Jens Axboe

Raul E Rangel
2019-05-03 05:51:52 +0800

01 May, 2019

1 commit

3dcf60bcb block: add SPDX tags to block layer files missing licensing information ... Browse Code »

Various block layer files do not have any licensing information at all.
Add SPDX tags for the default kernel GPLv2 license to those.

Reviewed-by: Chaitanya Kulkarni
Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Christoph Hellwig
2019-05-01 06:12:03 +0800

14 Apr, 2019

1 commit

77f1e0a52 bfq: update internal depth state when queue depth changes ... Browse Code »

A previous commit moved the shallow depth and BFQ depth map calculations
to be done at init time, moving it outside of the hotter IO path. This
potentially causes hangs if the users changes the depth of the scheduler
map, by writing to the 'nr_requests' sysfs file for that device.

Add a blk-mq-sched hook that allows blk-mq to inform the scheduler if
the depth changes, so that the scheduler can update its internal state.

Tested-by: Kai Krakow
Reported-by: Paolo Valente
Fixes: f0635b8a416e ("bfq: calculate shallow depths at init time")
Signed-off-by: Jens Axboe

Jens Axboe
2019-04-14 09:08:22 +0800

10 Apr, 2019

1 commit

1b8f21b74 blk-mq: introduce blk_mq_complete_request_sync() ... Browse Code »

In NVMe's error handler, follows the typical steps of tearing down
hardware for recovering controller:

1) stop blk_mq hw queues
2) stop the real hw queues
3) cancel in-flight requests via
blk_mq_tagset_busy_iter(tags, cancel_request, ...)
cancel_request():
mark the request as abort
blk_mq_complete_request(req);
4) destroy real hw queues

However, there may be race between #3 and #4, because blk_mq_complete_request()
may run q->mq_ops->complete(rq) remotelly and asynchronously, and
->complete(rq) may be run after #4.

This patch introduces blk_mq_complete_request_sync() for fixing the
above race.

Cc: Sagi Grimberg
Cc: Bart Van Assche
Cc: James Smart
Cc: linux-nvme@lists.infradead.org
Reviewed-by: Keith Busch
Reviewed-by: Christoph Hellwig
Signed-off-by: Ming Lei
Signed-off-by: Jens Axboe

Ming Lei
2019-04-10 23:57:33 +0800

05 Apr, 2019

1 commit

fd9c40f64 block: Revert v5.0 blk_mq_request_issue_directly() changes ... Browse Code »

blk_mq_try_issue_directly() can return BLK_STS*_RESOURCE for requests that
have been queued. If that happens when blk_mq_try_issue_directly() is called
by the dm-mpath driver then dm-mpath will try to resubmit a request that is
already queued and a kernel crash follows. Since it is nontrivial to fix
blk_mq_request_issue_directly(), revert the blk_mq_request_issue_directly()
changes that went into kernel v5.0.

This patch reverts the following commits:
* d6a51a97c0b2 ("blk-mq: replace and kill blk_mq_request_issue_directly") # v5.0.
* 5b7a6f128aad ("blk-mq: issue directly with bypass 'false' in blk_mq_sched_insert_requests") # v5.0.
* 7f556a44e61d ("blk-mq: refactor the code of issue request directly") # v5.0.

Cc: Christoph Hellwig
Cc: Ming Lei
Cc: Jianchao Wang
Cc: Hannes Reinecke
Cc: Johannes Thumshirn
Cc: James Smart
Cc: Dongli Zhang
Cc: Laurence Oberman
Cc:
Reported-by: Laurence Oberman
Tested-by: Laurence Oberman
Fixes: 7f556a44e61d ("blk-mq: refactor the code of issue request directly") # v5.0.
Signed-off-by: Bart Van Assche
Signed-off-by: Jens Axboe

Bart Van Assche
2019-04-05 23:40:46 +0800

04 Apr, 2019

1 commit

bcc816dfe blk-mq: do not reset plug->rq_count before the list is sorted ... Browse Code »

We would never be able to sort the list if we first reset plug->rq_count
which is used in conditional check later.

Fixes: ce5b009cff19 ("block: improve logic around when to sort a plug list")
Reviewed-by: Ming Lei
Signed-off-by: Dongli Zhang
Signed-off-by: Jens Axboe

Dongli Zhang
2019-04-04 22:37:34 +0800