Eric Lee / smarc-fsl-linux-kernel

13 May, 2017

1 commit

0fcc3ab23 Merge branch 'libnvdimm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm ... Browse Code »

Pull libnvdimm fixes from Dan Williams:
"Incremental fixes and a small feature addition on top of the main
libnvdimm 4.12 pull request:

- Geert noticed that tinyconfig was bloated by BLOCK selecting DAX.
The size regression is fixed by moving all dax helpers into the
dax-core and only specifying "select DAX" for FS_DAX and
dax-capable drivers. He also asked for clarification of the
NR_DEV_DAX config option which, on closer look, does not need to be
a config option at all. Mike also throws in a DEV_DAX_PMEM fixup
for good measure.

- Ben's attention to detail on -stable patch submissions caught a
case where the recent fixes to arch_copy_from_iter_pmem() missed a
condition where we strand dirty data in the cache. This is tagged
for -stable and will also be included in the rework of the pmem api
to a proposed {memcpy,copy_user}_flushcache() interface for 4.13.

- Vishal adds a feature that missed the initial pull due to pending
review feedback. It allows the kernel to clear media errors when
initializing a BTT (atomic sector update driver) instance on a pmem
namespace.

- Ross noticed that the dax_device + dax_operations conversion broke
__dax_zero_page_range(). The nvdimm unit tests fail to check this
path, but xfstests immediately trips over it. No excuse for missing
this before submitting the 4.12 pull request.

These all pass the nvdimm unit tests and an xfstests spot check. The
set has received a build success notification from the kbuild robot"

* 'libnvdimm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
filesystem-dax: fix broken __dax_zero_page_range() conversion
libnvdimm, btt: ensure that initializing metadata clears poison
libnvdimm: add an atomic vs process context flag to rw_bytes
x86, pmem: Fix cache flushing for iovec write < 8 bytes
device-dax: kill NR_DEV_DAX
block, dax: move "select DAX" from BLOCK to FS_DAX
device-dax: Tell kbuild DEV_DAX_PMEM depends on DEV_DAX

Linus Torvalds
2017-05-13 06:43:10 +0800

11 May, 2017

1 commit

ed6565e73 block: handle partial completions for special payload requests ... Browse Code »

SCSI devices can return short writes on Write Same just like for normal
writes, so we need to handle this case for our special payload requests
as well.

Signed-off-by: Christoph Hellwig
Reported-by: Abdul Haleem
Tested-by: Abdul Haleem
Signed-off-by: Jens Axboe

Christoph Hellwig
2017-05-11 22:08:53 +0800

10 May, 2017

5 commits

f36ea50ca blk-mq: NVMe 512B/4K+T10 DIF/DIX format returns I/O error on dd with split op ... Browse Code »

When formatting NVMe to 512B/4K + T10 DIf/DIX, dd with split op returns
"Input/output error". Looks block layer split the bio after calling
bio_integrity_prep(bio). This patch fixes the issue.

Below is how we debug this issue:
(1)format nvme to 4K block # size with type 2 DIF
(2)dd with block size bigger than 1024k.
oflag=direct
dd: error writing '/dev/nvme0n1': Input/output error

We added some debug code in nvme device driver. It showed us the first
op and the second op have the same bi and pi address. This is not
correct.

1st op: nvme0n1 Op:Wr slba 0x505 length 0x100, PI ctrl=0x1400,
dsmgmt=0x0, AT=0x0 & RT=0x505
Guard 0x00b1, AT 0x0000, RT physical 0x00000505 RT virtual 0x00002828

2nd op: nvme0n1 Op:Wr slba 0x605 length 0x1, PI ctrl=0x1400, dsmgmt=0x0,
AT=0x0 & RT=0x605 ==> This op fails and subsequent 5 retires..
Guard 0x00b1, AT 0x0000, RT physical 0x00000605 RT virtual 0x00002828

With the fix, It showed us both of the first op and the second op have
correct bi and pi address.

1st op: nvme2n1 Op:Wr slba 0x505 length 0x100, PI ctrl=0x1400,
dsmgmt=0x0, AT=0x0 & RT=0x505
Guard 0x5ccb, AT 0x0000, RT physical 0x00000505 RT virtual
0x00002828
2nd op: nvme2n1 Op:Wr slba 0x605 length 0x1, PI ctrl=0x1400, dsmgmt=0x0,
AT=0x0 & RT=0x605
Guard 0xab4c, AT 0x0000, RT physical 0x00000605 RT virtual
0x00003028

Signed-off-by: Wen Xiong
Signed-off-by: Jens Axboe

Wen Xiong
2017-05-10 22:02:58 +0800
d37381239 blk-stat: don't use this_cpu_ptr() in a preemptable section ... Browse Code »

If PREEMPT_RCU is enabled, rcu_read_lock() isn't strong enough
for us to use this_cpu_ptr() in that section. Use the safer
get/put_cpu_ptr() variants instead.

Reported-by: Mike Galbraith
Fixes: 34dbad5d26e2 ("blk-stat: convert to callback-based statistics reporting")
Signed-off-by: Jens Axboe

Jens Axboe
2017-05-10 21:40:18 +0800
340ff3216 elevator: remove redundant warnings on IO scheduler switch ... Browse Code »

We warn twice for switching to a scheduler, if that switch fails.
As we also report the failure in the return value to the
sysfs write, remove the dmesg induced failures.

Keep the failure print for warning to switch to the kconfig
selected IO scheduler, as we can't report errors for that in
any other way.

Signed-off-by: Jens Axboe

Jens Axboe
2017-05-10 21:40:04 +0800
43c1b3d6e block, bfq: stress that low_latency must be off to get max throughput ... Browse Code »

The introduction of the BFQ and Kyber I/O schedulers has triggered a
new wave of I/O benchmarks. Unfortunately, comments and discussions on
these benchmarks confirm that there is still little awareness that it
is very hard to achieve, at the same time, a low latency and a high
throughput. In particular, virtually all benchmarks measure
throughput, or throughput-related figures of merit, but, for BFQ, they
use the scheduler in its default configuration. This configuration is
geared, instead, toward a low latency. This is evidently a sign that
BFQ documentation is still too unclear on this important aspect. This
commit addresses this issue by stressing how BFQ configuration must be
(easily) changed if the only goal is maximum throughput.

Signed-off-by: Paolo Valente
Signed-off-by: Jens Axboe

Paolo Valente
2017-05-10 21:39:43 +0800
a66c38a17 block, bfq: use pointer entity->sched_data only if set ... Browse Code »

In the function __bfq_deactivate_entity, the pointer
entity->sched_data could happen to be used before being properly
initialized. This led to a NULL pointer dereference. This commit fixes
this bug by just using this pointer only where it is safe to do so.

Reported-by: Tom Harrison
Tested-by: Tom Harrison
Signed-off-by: Paolo Valente
Signed-off-by: Jens Axboe

Paolo Valente
2017-05-10 21:39:43 +0800

09 May, 2017

1 commit

ef5104247 block, dax: move "select DAX" from BLOCK to FS_DAX ... Browse Code »

For configurations that do not enable DAX filesystems or drivers, do not
require the DAX core to be built.

Given that the 'direct_access' method has been removed from
'block_device_operations', we can also go ahead and remove the
block-related dax helper functions from fs/block_dev.c to
drivers/dax/super.c. This keeps dax details out of the block layer and
lets the DAX core be built as a module in the FS_DAX=n case.

Filesystems need to include dax.h to call bdev_dax_supported().

Cc: linux-xfs@vger.kernel.org
Cc: Jens Axboe
Cc: "Theodore Ts'o"
Cc: Matthew Wilcox
Cc: Alexander Viro
Cc: "Darrick J. Wong"
Cc: Ross Zwisler
Reviewed-by: Jan Kara
Reported-by: Geert Uytterhoeven
Signed-off-by: Dan Williams

Dan Williams
2017-05-09 01:55:27 +0800

08 May, 2017

2 commits

ebd768579 blk-mq: make __blk_mq_stop_hw_queues static ... Browse Code »

Making __blk_mq_stop_hw_queues static fixes sparse warning:

block/blk-mq.c:6: warning: symbol '__blk_mq_stop_hw_queues' was not
declared. Should it be static?

Fixes: 2719aa217e0d0 ("blk-mq: don't use sync workqueue flushing from drivers")
Signed-off-by: Colin Ian King
Signed-off-by: Jens Axboe

Colin Ian King
2017-05-08 22:09:39 +0800
51d638b1f block/mq: fix potential deadlock during cpu hotplug ... Browse Code »

This can be triggered by hot-unplug one cpu.

======================================================
[ INFO: possible circular locking dependency detected ]
4.11.0+ #17 Not tainted
-------------------------------------------------------
step_after_susp/2640 is trying to acquire lock:
(all_q_mutex){+.+...}, at: [] blk_mq_queue_reinit_work+0x18/0x110

but task is already holding lock:
(cpu_hotplug.lock){+.+.+.}, at: [] cpu_hotplug_begin+0x7f/0xe0

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #1 (cpu_hotplug.lock){+.+.+.}:
lock_acquire+0x11c/0x230
__mutex_lock+0x92/0x990
mutex_lock_nested+0x1b/0x20
get_online_cpus+0x64/0x80
blk_mq_init_allocated_queue+0x3a0/0x4e0
blk_mq_init_queue+0x3a/0x60
loop_add+0xe5/0x280
loop_init+0x124/0x177
do_one_initcall+0x53/0x1c0
kernel_init_freeable+0x1e3/0x27f
kernel_init+0xe/0x100
ret_from_fork+0x31/0x40

-> #0 (all_q_mutex){+.+...}:
__lock_acquire+0x189a/0x18a0
lock_acquire+0x11c/0x230
__mutex_lock+0x92/0x990
mutex_lock_nested+0x1b/0x20
blk_mq_queue_reinit_work+0x18/0x110
blk_mq_queue_reinit_dead+0x1c/0x20
cpuhp_invoke_callback+0x1f2/0x810
cpuhp_down_callbacks+0x42/0x80
_cpu_down+0xb2/0xe0
freeze_secondary_cpus+0xb6/0x390
suspend_devices_and_enter+0x3b3/0xa40
pm_suspend+0x129/0x490
state_store+0x82/0xf0
kobj_attr_store+0xf/0x20
sysfs_kf_write+0x45/0x60
kernfs_fop_write+0x135/0x1c0
__vfs_write+0x37/0x160
vfs_write+0xcd/0x1d0
SyS_write+0x58/0xc0
do_syscall_64+0x8f/0x710
return_from_SYSCALL_64+0x0/0x7a

other info that might help us debug this:

Possible unsafe locking scenario:

CPU0 CPU1
---- ----
lock(cpu_hotplug.lock);
lock(all_q_mutex);
lock(cpu_hotplug.lock);
lock(all_q_mutex);

*** DEADLOCK ***

8 locks held by step_after_susp/2640:
#0: (sb_writers#6){.+.+.+}, at: [] vfs_write+0x1ad/0x1d0
#1: (&of->mutex){+.+.+.}, at: [] kernfs_fop_write+0x101/0x1c0
#2: (s_active#166){.+.+.+}, at: [] kernfs_fop_write+0x109/0x1c0
#3: (pm_mutex){+.+...}, at: [] pm_suspend+0x21d/0x490
#4: (acpi_scan_lock){+.+.+.}, at: [] acpi_scan_lock_acquire+0x17/0x20
#5: (cpu_add_remove_lock){+.+.+.}, at: [] freeze_secondary_cpus+0x27/0x390
#6: (cpu_hotplug.dep_map){++++++}, at: [] cpu_hotplug_begin+0x5/0xe0
#7: (cpu_hotplug.lock){+.+.+.}, at: [] cpu_hotplug_begin+0x7f/0xe0

stack backtrace:
CPU: 3 PID: 2640 Comm: step_after_susp Not tainted 4.11.0+ #17
Hardware name: Dell Inc. OptiPlex 7040/0JCTF8, BIOS 1.4.9 09/12/2016
Call Trace:
dump_stack+0x99/0xce
print_circular_bug+0x1fa/0x270
__lock_acquire+0x189a/0x18a0
lock_acquire+0x11c/0x230
? lock_acquire+0x11c/0x230
? blk_mq_queue_reinit_work+0x18/0x110
? blk_mq_queue_reinit_work+0x18/0x110
__mutex_lock+0x92/0x990
? blk_mq_queue_reinit_work+0x18/0x110
? kmem_cache_free+0x2cb/0x330
? anon_transport_class_unregister+0x20/0x20
? blk_mq_queue_reinit_work+0x110/0x110
mutex_lock_nested+0x1b/0x20
? mutex_lock_nested+0x1b/0x20
blk_mq_queue_reinit_work+0x18/0x110
blk_mq_queue_reinit_dead+0x1c/0x20
cpuhp_invoke_callback+0x1f2/0x810
? __flow_cache_shrink+0x160/0x160
cpuhp_down_callbacks+0x42/0x80
_cpu_down+0xb2/0xe0
freeze_secondary_cpus+0xb6/0x390
suspend_devices_and_enter+0x3b3/0xa40
? rcu_read_lock_sched_held+0x79/0x80
pm_suspend+0x129/0x490
state_store+0x82/0xf0
kobj_attr_store+0xf/0x20
sysfs_kf_write+0x45/0x60
kernfs_fop_write+0x135/0x1c0
__vfs_write+0x37/0x160
? rcu_read_lock_sched_held+0x79/0x80
? rcu_sync_lockdep_assert+0x2f/0x60
? __sb_start_write+0xd9/0x1c0
? vfs_write+0x1ad/0x1d0
vfs_write+0xcd/0x1d0
SyS_write+0x58/0xc0
? rcu_read_lock_sched_held+0x79/0x80
do_syscall_64+0x8f/0x710
? trace_hardirqs_on_thunk+0x1a/0x1c
entry_SYSCALL64_slow_path+0x25/0x25

The cpu hotplug path will hold cpu_hotplug.lock and then reinit all exiting
queues for blk mq w/ all_q_mutex, however, blk_mq_init_allocated_queue() will
contend these two locks in the inversion order. This is due to commit eabe06595d62
(blk/mq: Cure cpu hotplug lock inversion), it fixes a cpu hotplug lock inversion
issue because of hotplug rework, however the hotplug rework is still work-in-progress
and lives in a -tip branch and mainline cannot yet trigger that splat. The commit
breaks the linus's tree in the merge window, so this patch reverts the lock order
and avoids to splat linus's tree.

Cc: Jens Axboe
Cc: Peter Zijlstra (Intel)
Cc: Thomas Gleixner
Signed-off-by: Wanpeng Li
Signed-off-by: Jens Axboe

Wanpeng Li
2017-05-08 09:50:11 +0800

07 May, 2017

1 commit

044f1daaa Merge branch 'for-linus' of git://git.kernel.dk/linux-block ... Browse Code »

Pull block fixes and updates from Jens Axboe:
"Some fixes and followup features/changes that should go in, in this
merge window. This contains:

- Two fixes for lightnvm from Javier, fixing problems in the new code
merge previously in this merge window.

- A fix from Jan for the backing device changes, fixing an issue in
NFS that causes a failure to mount on certain setups.

- A change from Christoph, cleaning up the blk-mq init and exit
request paths.

- Remove elevator_change(), which is now unused. From Bart.

- A fix for queue operation invocation on a dead queue, from Bart.

- A series fixing up mtip32xx for blk-mq scheduling, removing a
bandaid we previously had in place for this. From me.

- A regression fix for this series, fixing a case where we wait on
workqueue flushing from an invalid (non-blocking) context. From me.

- A fix/optimization from Ming, ensuring that we don't both quiesce
and freeze a queue at the same time.

- A fix from Peter on lock ordering for CPU hotplug. Not a real
problem right now, but will be once the CPU hotplug rework goes in.

- A series from Omar, cleaning up out blk-mq debugfs support, and
adding support for exporting info from schedulers in debugfs as
well. This is really useful in debugging stalls or livelocks. From
Omar"

* 'for-linus' of git://git.kernel.dk/linux-block: (28 commits)
mq-deadline: add debugfs attributes
kyber: add debugfs attributes
blk-mq-debugfs: allow schedulers to register debugfs attributes
blk-mq: untangle debugfs and sysfs
blk-mq: move debugfs declarations to a separate header file
blk-mq: Do not invoke queue operations on a dead queue
blk-mq-debugfs: get rid of a bunch of boilerplate
blk-mq-debugfs: rename hw queue directories from to hctx
blk-mq-debugfs: don't open code strstrip()
blk-mq-debugfs: error on long write to queue "state" file
blk-mq-debugfs: clean up flag definitions
blk-mq-debugfs: separate flags with |
nfs: Fix bdi handling for cloned superblocks
block/mq: Cure cpu hotplug lock inversion
lightnvm: fix bad back free on error path
lightnvm: create cmd before allocating request
blk-mq: don't use sync workqueue flushing from drivers
mtip32xx: convert internal commands to regular block infrastructure
mtip32xx: cleanup internal tag assumptions
block: don't call blk_mq_quiesce_queue() after queue is frozen
...

Linus Torvalds
2017-05-07 02:25:08 +0800

06 May, 2017

1 commit

53ef7d0e2 Merge tag 'libnvdimm-for-4.12' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm ... Browse Code »

Pull libnvdimm updates from Dan Williams:
"The bulk of this has been in multiple -next releases. There were a few
late breaking fixes and small features that got added in the last
couple days, but the whole set has received a build success
notification from the kbuild robot.

Change summary:

- Region media error reporting: A libnvdimm region device is the
parent to one or more namespaces. To date, media errors have been
reported via the "badblocks" attribute attached to pmem block
devices for namespaces in "raw" or "memory" mode. Given that
namespaces can be in "device-dax" or "btt-sector" mode this new
interface reports media errors generically, i.e. independent of
namespace modes or state.

This subsequently allows userspace tooling to craft "ACPI 6.1
Section 9.20.7.6 Function Index 4 - Clear Uncorrectable Error"
requests and submit them via the ioctl path for NVDIMM root bus
devices.

- Introduce 'struct dax_device' and 'struct dax_operations': Prompted
by a request from Linus and feedback from Christoph this allows for
dax capable drivers to publish their own custom dax operations.
This fixes the broken assumption that all dax operations are
related to a persistent memory device, and makes it easier for
other architectures and platforms to add customized persistent
memory support.

- 'libnvdimm' core updates: A new "deep_flush" sysfs attribute is
available for storage appliance applications to manually trigger
memory controllers to drain write-pending buffers that would
otherwise be flushed automatically by the platform ADR
(asynchronous-DRAM-refresh) mechanism at a power loss event.
Support for "locked" DIMMs is included to prevent namespaces from
surfacing when the namespace label data area is locked. Finally,
fixes for various reported deadlocks and crashes, also tagged for
-stable.

- ACPI / nfit driver updates: General updates of the nfit driver to
add DSM command overrides, ACPI 6.1 health state flags support, DSM
payload debug available by default, and various fixes.

Acknowledgements that came after the branch was pushed:

- commmit 565851c972b5 "device-dax: fix sysfs attribute deadlock":
Tested-by: Yi Zhang

- commit 23f498448362 "libnvdimm: rework region badblocks clearing"
Tested-by: Toshi Kani "

* tag 'libnvdimm-for-4.12' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (52 commits)
libnvdimm, pfn: fix 'npfns' vs section alignment
libnvdimm: handle locked label storage areas
libnvdimm: convert NDD_ flags to use bitops, introduce NDD_LOCKED
brd: fix uninitialized use of brd->dax_dev
block, dax: use correct format string in bdev_dax_supported
device-dax: fix sysfs attribute deadlock
libnvdimm: restore "libnvdimm: band aid btt vs clear poison locking"
libnvdimm: fix nvdimm_bus_lock() vs device_lock() ordering
libnvdimm: rework region badblocks clearing
acpi, nfit: kill ACPI_NFIT_DEBUG
libnvdimm: fix clear length of nvdimm_forget_poison()
libnvdimm, pmem: fix a NULL pointer BUG in nd_pmem_notify
libnvdimm, region: sysfs trigger for nvdimm_flush()
libnvdimm: fix phys_addr for nvdimm_clear_poison
x86, dax, pmem: remove indirection around memcpy_from_pmem()
block: remove block_device_operations ->direct_access()
block, dax: convert bdev_dax_supported() to dax_direct_access()
filesystem-dax: convert to dax_direct_access()
Revert "block: use DAX for partition table reads"
ext2, ext4, xfs: retrieve dax_device for iomap operations
...

Linus Torvalds
2017-05-06 09:49:20 +0800

04 May, 2017

15 commits

daaadb3e9 mq-deadline: add debugfs attributes ... Browse Code »

Expose the fifo lists, cached next requests, batching state, and
dispatch list. It'd also be possible to add the sorted lists, but there
aren't already seq_file helpers for rbtrees.

Signed-off-by: Omar Sandoval
Reviewed-by: Hannes Reinecke
Signed-off-by: Jens Axboe

Omar Sandoval
2017-05-04 22:25:17 +0800
16b738f65 kyber: add debugfs attributes ... Browse Code »

Expose the domain token pools, asynchronous sbitmap depth, domain
request lists, and batching state.

Signed-off-by: Omar Sandoval
Reviewed-by: Hannes Reinecke
Signed-off-by: Jens Axboe

Omar Sandoval
2017-05-04 22:25:17 +0800
d332ce091 blk-mq-debugfs: allow schedulers to register debugfs attributes ... Browse Code »

This provides the infrastructure for schedulers to expose their internal
state through debugfs. We add a list of queue attributes and a list of
hctx attributes to struct elevator_type and wire them up when switching
schedulers.

Signed-off-by: Omar Sandoval
Reviewed-by: Hannes Reinecke

Add missing seq_file.h header in blk-mq-debugfs.h

Signed-off-by: Jens Axboe

Omar Sandoval
2017-05-04 22:24:40 +0800
9c1051aac blk-mq: untangle debugfs and sysfs ... Browse Code »

Originally, I tied debugfs registration/unregistration together with
sysfs. There's no reason to do this, and it's getting in the way of
letting schedulers define their own debugfs attributes. Instead, tie the
debugfs registration to the lifetime of the structures themselves.

The saner lifetimes mean we can also get rid of the extra mq directory
and move everything one level up. I.e., nvme0n1/mq/hctx0/tags is now
just nvme0n1/hctx0/tags.

Signed-off-by: Omar Sandoval
Signed-off-by: Jens Axboe

Omar Sandoval
2017-05-04 22:24:13 +0800
d173a2516 blk-mq: move debugfs declarations to a separate header file ... Browse Code »

Preparation for adding more declarations.

Signed-off-by: Omar Sandoval
Reviewed-by: Hannes Reinecke
Signed-off-by: Jens Axboe

Omar Sandoval
2017-05-04 22:23:44 +0800
18d4d7d05 blk-mq: Do not invoke queue operations on a dead queue ... Browse Code »

In commit e869b5462f83 ("blk-mq: Unregister debugfs attributes
earlier"), we shuffled the debugfs cleanup around so that the "state"
attribute was removed before we freed the blk-mq data structures.
However, later changes are going to undo that, so we need to explicitly
disallow running a dead queue.

[Omar: rebased and updated commit message]
Signed-off-by: Omar Sandoval
Signed-off-by: Bart Van Assche
Reviewed-by: Hannes Reinecke
Signed-off-by: Jens Axboe

Bart Van Assche
2017-05-04 22:23:39 +0800
f57de23ac blk-mq-debugfs: get rid of a bunch of boilerplate ... Browse Code »

A large part of blk-mq-debugfs.c is file_operations and seq_file
boilerplate. This sucks as is but will suck even more when schedulers
can define their own debugfs entries. Factor it all out into a single
blk_mq_debugfs_fops which multiplexes as needed. We store the
request_queue, blk_mq_hw_ctx, or blk_mq_ctx in the parent directory
dentry, which is kind of hacky, but it works.

Signed-off-by: Omar Sandoval
Reviewed-by: Hannes Reinecke
Signed-off-by: Jens Axboe

Omar Sandoval
2017-05-04 22:23:35 +0800
88aabbd7e blk-mq-debugfs: rename hw queue directories from <n> to hctx<n> ... Browse Code »

It's not clear what these numbered directories represent unless you
consult the code. We're about to get rid of the intermediate "mq"
directory, so these would be even more confusing without that context.

Signed-off-by: Omar Sandoval
Signed-off-by: Jens Axboe

Omar Sandoval
2017-05-04 22:23:30 +0800
71b90511c blk-mq-debugfs: don't open code strstrip() ... Browse Code »

Slightly more readable, plus we also strip leading spaces.

Signed-off-by: Omar Sandoval
Reviewed-by: Hannes Reinecke
Signed-off-by: Jens Axboe

Omar Sandoval
2017-05-04 22:23:20 +0800
c7e4145ae blk-mq-debugfs: error on long write to queue "state" file ... Browse Code »

blk_queue_flags_store() currently truncates and returns a short write if
the operation being written is too long. This can give us weird results,
like here:

$ echo "run bar"
echo: write error: invalid argument
$ dmesg
[ 1103.075435] blk_queue_flags_store: unsupported operation bar. Use either 'run' or 'start'

Instead, return an error if the user does this. While we're here, make
the argument names consistent with everywhere else in this file.

Signed-off-by: Omar Sandoval
Reviewed-by: Hannes Reinecke
Signed-off-by: Jens Axboe

Omar Sandoval
2017-05-04 22:23:16 +0800
1a435111f blk-mq-debugfs: clean up flag definitions ... Browse Code »

Make sure the spelled out flag names match the definition. This also
adds a missing hctx state, BLK_MQ_S_START_ON_RUN, and a missing
cmd_flag, __REQ_NOUNMAP.

Signed-off-by: Omar Sandoval
Reviewed-by: Hannes Reinecke
Signed-off-by: Jens Axboe

Omar Sandoval
2017-05-04 22:23:11 +0800
bec03d6b9 blk-mq-debugfs: separate flags with | ... Browse Code »

This reads more naturally than spaces.

Signed-off-by: Omar Sandoval
Reviewed-by: Hannes Reinecke
Signed-off-by: Jens Axboe

Omar Sandoval
2017-05-04 22:22:28 +0800
eabe06595 block/mq: Cure cpu hotplug lock inversion ... Browse Code »

By poking at /debug/sched_features I triggered the following splat:

[] ======================================================
[] WARNING: possible circular locking dependency detected
[] 4.11.0-00873-g964c8b7-dirty #694 Not tainted
[] ------------------------------------------------------
[] bash/2109 is trying to acquire lock:
[] (cpu_hotplug_lock.rw_sem){++++++}, at: [] static_key_slow_dec+0x1b/0x50
[]
[] but task is already holding lock:
[] (&sb->s_type->i_mutex_key#4){+++++.}, at: [] sched_feat_write+0x86/0x170
[]
[] which lock already depends on the new lock.
[]
[]
[] the existing dependency chain (in reverse order) is:
[]
[] -> #2 (&sb->s_type->i_mutex_key#4){+++++.}:
[] lock_acquire+0x100/0x210
[] down_write+0x28/0x60
[] start_creating+0x5e/0xf0
[] debugfs_create_dir+0x13/0x110
[] blk_mq_debugfs_register+0x21/0x70
[] blk_mq_register_dev+0x64/0xd0
[] blk_register_queue+0x6a/0x170
[] device_add_disk+0x22d/0x440
[] loop_add+0x1f3/0x280
[] loop_init+0x104/0x142
[] do_one_initcall+0x43/0x180
[] kernel_init_freeable+0x1de/0x266
[] kernel_init+0xe/0x100
[] ret_from_fork+0x31/0x40
[]
[] -> #1 (all_q_mutex){+.+.+.}:
[] lock_acquire+0x100/0x210
[] __mutex_lock+0x6c/0x960
[] mutex_lock_nested+0x1b/0x20
[] blk_mq_init_allocated_queue+0x37c/0x4e0
[] blk_mq_init_queue+0x3a/0x60
[] loop_add+0xe5/0x280
[] loop_init+0x104/0x142
[] do_one_initcall+0x43/0x180
[] kernel_init_freeable+0x1de/0x266
[] kernel_init+0xe/0x100
[] ret_from_fork+0x31/0x40

[] *** DEADLOCK ***
[]
[] 3 locks held by bash/2109:
[] #0: (sb_writers#11){.+.+.+}, at: [] vfs_write+0x17d/0x1a0
[] #1: (debugfs_srcu){......}, at: [] full_proxy_write+0x5d/0xd0
[] #2: (&sb->s_type->i_mutex_key#4){+++++.}, at: [] sched_feat_write+0x86/0x170
[]
[] stack backtrace:
[] CPU: 9 PID: 2109 Comm: bash Not tainted 4.11.0-00873-g964c8b7-dirty #694
[] Hardware name: Intel Corporation S2600GZ/S2600GZ, BIOS SE5C600.86B.02.02.0002.122320131210 12/23/2013
[] Call Trace:

[] lock_acquire+0x100/0x210
[] get_online_cpus+0x2a/0x90
[] static_key_slow_dec+0x1b/0x50
[] static_key_disable+0x20/0x30
[] sched_feat_write+0x131/0x170
[] full_proxy_write+0x97/0xd0
[] __vfs_write+0x28/0x120
[] vfs_write+0xb5/0x1a0
[] SyS_write+0x49/0xa0
[] entry_SYSCALL_64_fastpath+0x23/0xc2

This is because of the cpu hotplug lock rework. Break the chain at #1
by reversing the lock acquisition order. This way i_mutex_key#4 no
longer depends on cpu_hotplug_lock and things are good.

Cc: Jens Axboe
Signed-off-by: Peter Zijlstra (Intel)
Signed-off-by: Jens Axboe

Peter Zijlstra
2017-05-04 21:57:13 +0800
2719aa217 blk-mq: don't use sync workqueue flushing from drivers ... Browse Code »

A previous commit introduced the sync flush, which we need from
internal callers like blk_mq_quiesce_queue(). However, we also
call the stop helpers from drivers, particularly from ->queue_rq()
when we have to stop processing for a bit. We can't block from
those locations, and we don't have to guarantee that we're
fully flushed.

Fixes: 9f993737906b ("blk-mq: unify hctx delayed_run_work and run_work")
Reviewed-by: Bart Van Assche
Signed-off-by: Jens Axboe

Jens Axboe
2017-05-04 01:44:43 +0800
e5021876c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md ... Browse Code »

Pull MD updates from Shaohua Li:

- Add Partial Parity Log (ppl) feature found in Intel IMSM raid array
by Artur Paszkiewicz. This feature is another way to close RAID5
writehole. The Linux implementation is also available for normal
RAID5 array if specific superblock bit is set.

- A number of md-cluser fixes and enabling md-cluster array resize from
Guoqing Jiang

- A bunch of patches from Ming Lei and Neil Brown to rewrite MD bio
handling related code. Now MD doesn't directly access bio bvec,
bi_phys_segments and uses modern bio API for bio split.

- Improve RAID5 IO pattern to improve performance for hard disk based
RAID5/6 from me.

- Several patches from Song Liu to speed up raid5-cache recovery and
allow raid5 cache feature disabling in runtime.

- Fix a performance regression in raid1 resync from Xiao Ni.

- Other cleanup and fixes from various people.

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md: (84 commits)
md/raid10: skip spare disk as 'first' disk
md/raid1: Use a new variable to count flighting sync requests
md: clear WantReplacement once disk is removed
md/raid1/10: remove unused queue
md: handle read-only member devices better.
md/raid10: wait up frozen array in handle_write_completed
uapi: fix linux/raid/md_p.h userspace compilation error
md-cluster: Fix a memleak in an error handling path
md: support disabling of create-on-open semantics.
md: allow creation of mdNNN arrays via md_mod/parameters/new_array
raid5-ppl: use a single mempool for ppl_io_unit and header_page
md/raid0: fix up bio splitting.
md/linear: improve bio splitting.
md/raid5: make chunk_aligned_read() split bios more cleanly.
md/raid10: simplify handle_read_error()
md/raid10: simplify the splitting of requests.
md/raid1: factor out flush_bio_list()
md/raid1: simplify handle_read_error().
Revert "block: introduce bio_copy_data_partial"
md/raid1: simplify alloc_behind_master_bio()
...

Linus Torvalds
2017-05-04 01:05:38 +0800

03 May, 2017

2 commits

7a148c2fc block: don't call blk_mq_quiesce_queue() after queue is frozen ... Browse Code »

After queue is frozen, no request in this queue can be in use at all, so
there can't be any .queue_rq() running on this queue. It isn't
necessary to call blk_mq_quiesce_queue() any more, so remove it in both
elevator_switch_mq() and blk_mq_update_nr_requests().

Cc: Bart Van Assche
Signed-off-by: Ming Lei

Fixed up the description a bit.

Signed-off-by: Jens Axboe

Ming Lei
2017-05-03 01:33:08 +0800
c58d4055c Merge tag 'docs-4.12' of git://git.lwn.net/linux ... Browse Code »

Pull documentation update from Jonathan Corbet:
"A reasonably busy cycle for documentation this time around. There is a
new guide for user-space API documents, rather sparsely populated at
the moment, but it's a start. Markus improved the infrastructure for
converting diagrams. Mauro has converted much of the USB documentation
over to RST. Plus the usual set of fixes, improvements, and tweaks.

There's a bit more than the usual amount of reaching out of
Documentation/ to fix comments elsewhere in the tree; I have acks for
those where I could get them"

* tag 'docs-4.12' of git://git.lwn.net/linux: (74 commits)
docs: Fix a couple typos
docs: Fix a spelling error in vfio-mediated-device.txt
docs: Fix a spelling error in ioctl-number.txt
MAINTAINERS: update file entry for HSI subsystem
Documentation: allow installing man pages to a user defined directory
Doc/PM: Sync with intel_powerclamp code behavior
zr364xx.rst: usb/devices is now at /sys/kernel/debug/
usb.rst: move documentation from proc_usb_info.txt to USB ReST book
convert philips.txt to ReST and add to media docs
docs-rst: usb: update old usbfs-related documentation
arm: Documentation: update a path name
docs: process/4.Coding.rst: Fix a couple of document refs
docs-rst: fix usb cross-references
usb: gadget.h: be consistent at kernel doc macros
usb: composite.h: fix two warnings when building docs
usb: get rid of some ReST doc build errors
usb.rst: get rid of some Sphinx errors
usb/URB.txt: convert to ReST and update it
usb/persist.txt: convert to ReST and add to driver-api book
usb/hotplug.txt: convert to ReST and add to driver-api book
...

Linus Torvalds
2017-05-03 01:21:17 +0800

02 May, 2017

6 commits

d6296d39e blk-mq: update ->init_request and ->exit_request prototypes ... Browse Code »

Remove the request_idx parameter, which can't be used safely now that we
support I/O schedulers with blk-mq. Except for a superflous check in
mtip32xx it was unused anyway.

Also pass the tag_set instead of just the driver data - this allows drivers
to avoid some code duplication in a follow on cleanup.

Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Christoph Hellwig
2017-05-02 21:52:08 +0800
9f2779bff blk-mq-sched: remove hack that bypasses scheduler for reserved requests ... Browse Code »

We have update the troublesome driver (mtip32xx) to deal with this
appropriately. So kill the hack that bypassed scheduler allocation
and insertion for reserved requests.

Reviewed-by: Ming Lei
Reviewed-by: Christoph Hellwig
Tested-by: Ming Lei
Signed-off-by: Jens Axboe

Jens Axboe
2017-05-02 21:52:08 +0800
c03326949 block: Remove elevator_change() ... Browse Code »

Since commit 84253394927c ("remove the mg_disk driver") removed the
only caller of elevator_change(), also remove the elevator_change()
function itself.

Signed-off-by: Bart Van Assche
Cc: Christoph Hellwig
Cc: Markus Trippelsdorf
Signed-off-by: Jens Axboe

Bart Van Assche
2017-05-02 21:52:08 +0800
5db6db0d4 Merge branch 'work.uaccess' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs ... Browse Code »

Pull uaccess unification updates from Al Viro:
"This is the uaccess unification pile. It's _not_ the end of uaccess
work, but the next batch of that will go into the next cycle. This one
mostly takes copy_from_user() and friends out of arch/* and gets the
zero-padding behaviour in sync for all architectures.

Dealing with the nocache/writethrough mess is for the next cycle;
fortunately, that's x86-only. Same for cleanups in iov_iter.c (I am
sold on access_ok() in there, BTW; just not in this pile), same for
reducing __copy_... callsites, strn*... stuff, etc. - there will be a
pile about as large as this one in the next merge window.

This one sat in -next for weeks. -3KLoC"

* 'work.uaccess' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (96 commits)
HAVE_ARCH_HARDENED_USERCOPY is unconditional now
CONFIG_ARCH_HAS_RAW_COPY_USER is unconditional now
m32r: switch to RAW_COPY_USER
hexagon: switch to RAW_COPY_USER
microblaze: switch to RAW_COPY_USER
get rid of padding, switch to RAW_COPY_USER
ia64: get rid of copy_in_user()
ia64: sanitize __access_ok()
ia64: get rid of 'segment' argument of __do_{get,put}_user()
ia64: get rid of 'segment' argument of __{get,put}_user_check()
ia64: add extable.h
powerpc: get rid of zeroing, switch to RAW_COPY_USER
esas2r: don't open-code memdup_user()
alpha: fix stack smashing in old_adjtimex(2)
don't open-code kernel_setsockopt()
mips: switch to RAW_COPY_USER
mips: get rid of tail-zeroing in primitives
mips: make copy_from_user() zero tail explicitly
mips: clean and reorder the forest of macros...
mips: consolidate __invoke_... wrappers
...

Linus Torvalds
2017-05-02 05:41:04 +0800
e265eb3a3 Merge branch 'md-next' into md-linus Browse Code »

Shaohua Li
2017-05-02 05:09:21 +0800
694752922 Merge branch 'for-4.12/block' of git://git.kernel.dk/linux-block ... Browse Code »

Pull block layer updates from Jens Axboe:

- Add BFQ IO scheduler under the new blk-mq scheduling framework. BFQ
was initially a fork of CFQ, but subsequently changed to implement
fairness based on B-WF2Q+, a modified variant of WF2Q. BFQ is meant
to be used on desktop type single drives, providing good fairness.
From Paolo.

- Add Kyber IO scheduler. This is a full multiqueue aware scheduler,
using a scalable token based algorithm that throttles IO based on
live completion IO stats, similary to blk-wbt. From Omar.

- A series from Jan, moving users to separately allocated backing
devices. This continues the work of separating backing device life
times, solving various problems with hot removal.

- A series of updates for lightnvm, mostly from Javier. Includes a
'pblk' target that exposes an open channel SSD as a physical block
device.

- A series of fixes and improvements for nbd from Josef.

- A series from Omar, removing queue sharing between devices on mostly
legacy drivers. This helps us clean up other bits, if we know that a
queue only has a single device backing. This has been overdue for
more than a decade.

- Fixes for the blk-stats, and improvements to unify the stats and user
windows. This both improves blk-wbt, and enables other users to
register a need to receive IO stats for a device. From Omar.

- blk-throttle improvements from Shaohua. This provides a scalable
framework for implementing scalable priotization - particularly for
blk-mq, but applicable to any type of block device. The interface is
marked experimental for now.

- Bucketized IO stats for IO polling from Stephen Bates. This improves
efficiency of polled workloads in the presence of mixed block size
IO.

- A few fixes for opal, from Scott.

- A few pulls for NVMe, including a lot of fixes for NVMe-over-fabrics.
From a variety of folks, mostly Sagi and James Smart.

- A series from Bart, improving our exposed info and capabilities from
the blk-mq debugfs support.

- A series from Christoph, cleaning up how handle WRITE_ZEROES.

- A series from Christoph, cleaning up the block layer handling of how
we track errors in a request. On top of being a nice cleanup, it also
shrinks the size of struct request a bit.

- Removal of mg_disk and hd (sorry Linus) by Christoph. The former was
never used by platforms, and the latter has outlived it's usefulness.

- Various little bug fixes and cleanups from a wide variety of folks.

* 'for-4.12/block' of git://git.kernel.dk/linux-block: (329 commits)
block: hide badblocks attribute by default
blk-mq: unify hctx delay_work and run_work
block: add kblock_mod_delayed_work_on()
blk-mq: unify hctx delayed_run_work and run_work
nbd: fix use after free on module unload
MAINTAINERS: bfq: Add Paolo as maintainer for the BFQ I/O scheduler
blk-mq-sched: alloate reserved tags out of normal pool
mtip32xx: use runtime tag to initialize command header
scsi: Implement blk_mq_ops.show_rq()
blk-mq: Add blk_mq_ops.show_rq()
blk-mq: Show operation, cmd_flags and rq_flags names
blk-mq: Make blk_flags_show() callers append a newline character
blk-mq: Move the "state" debugfs attribute one level down
blk-mq: Unregister debugfs attributes earlier
blk-mq: Only unregister hctxs for which registration succeeded
blk-mq-debugfs: Rename functions for registering and unregistering the mq directory
blk-mq: Let blk_mq_debugfs_register() look up the queue name
blk-mq: Register /queue/mq after having registered /queue
ide-pm: always pass 0 error to ide_complete_rq in ide_do_devset
ide-pm: always pass 0 error to __blk_end_request_all
..

Linus Torvalds
2017-05-02 01:39:57 +0800

28 Apr, 2017

4 commits

9438b3e08 block: hide badblocks attribute by default ... Browse Code »

Commit 99e6608c9e74 "block: Add badblock management for gendisks"
allowed for drivers like pmem and software-raid to advertise a list of
bad media areas. However, it inadvertently added a 'badblocks' to all
block devices. Lets clean this up by having the 'badblocks' attribute
not be visible when the driver has not populated a 'struct badblocks'
instance in the gendisk.

Cc: Jens Axboe
Cc: Christoph Hellwig
Cc: Martin K. Petersen
Reported-by: Vishal Verma
Signed-off-by: Dan Williams
Tested-by: Vishal Verma
Signed-off-by: Jens Axboe

Dan Williams
2017-04-28 22:26:42 +0800
21c6e939a blk-mq: unify hctx delay_work and run_work ... Browse Code »

The only difference between ->run_work and ->delay_work, is that
the latter is used to defer running a queue. This is done by
marking the queue stopped, and scheduling ->delay_work to run
sometime in the future. While the queue is stopped, direct runs
or runs through ->run_work will not run the queue.

If we combine the handlers, then we need to handle two things:

1) If a delayed/stopped run is scheduled, then we should not run
the queue before that has been completed.
2) If a queue is delayed/stopped, the handler needs to restart
the queue. Normally a run of a queue with the stopped bit set
would be a no-op.

Case 1 is handled by modifying a currently pending queue run
to the deadline set by the caller of blk_mq_delay_queue().
Subsequent attempts to queue a queue run will find the work
item already pending, and direct runs will see a stopped queue
as before.

Case 2 is handled by adding a new bit, BLK_MQ_S_START_ON_RUN,
that tells the work handler that it should clear a stopped
queue and run the handler.

Reviewed-by: Bart Van Assche
Signed-off-by: Jens Axboe

Jens Axboe
2017-04-28 22:11:43 +0800
818cd1cba block: add kblock_mod_delayed_work_on() ... Browse Code »

This modifies (or adds, if not currently pending) an existing
delayed work item.

Reviewed-by: Christoph Hellwig
Reviewed-by: Bart Van Assche
Signed-off-by: Jens Axboe

Jens Axboe
2017-04-28 22:10:15 +0800
9f9937379 blk-mq: unify hctx delayed_run_work and run_work ... Browse Code »

They serve the exact same purpose. Get rid of the non-delayed
work variant, and just run it without delay for the normal case.

Reviewed-by: Christoph Hellwig
Reviewed-by: Bart Van Assche
Reviewed-by: Ming Lei
Signed-off-by: Jens Axboe

Jens Axboe
2017-04-28 22:10:15 +0800

27 Apr, 2017

1 commit

339318080 blk-mq-sched: alloate reserved tags out of normal pool ... Browse Code »

At least one driver, mtip32xx, has a hard coded dependency on
the value of the reserved tag used for internal commands. While
that should really be fixed up, for now let's ensure that we just
bypass the scheduler tags an allocation marked as reserved. They
are used for house keeping or error handling, so we can safely
ignore them in the scheduler.

Tested-by: Ming Lei
Signed-off-by: Jens Axboe

Jens Axboe
2017-04-27 21:45:46 +0800