Eric Lee / smarc-ti-linux-kernel | Embedian Git Server

17 Jun, 2017

1 commit

220b67a16 partitions/msdos: FreeBSD UFS2 file systems are not recognized ... Browse Code »

commit 223220356d5ebc05ead9a8d697abb0c0a906fc81 upstream.

The code in block/partitions/msdos.c recognizes FreeBSD, OpenBSD
and NetBSD partitions and does a reasonable job picking out OpenBSD
and NetBSD UFS subpartitions.

But for FreeBSD the subpartitions are always "bad".

Kernel:
Reviewed-by: Christoph Hellwig
Signed-off-by: Jens Axboe
Signed-off-by: Greg Kroah-Hartman

Richard
2017-06-17 12:41:49 +0800

14 Jun, 2017

1 commit

08229c119 cfq-iosched: fix the delay of cfq_group's vdisktime under iops mode ... Browse Code »

commit 5be6b75610cefd1e21b98a218211922c2feb6e08 upstream.

When adding a cfq_group into the cfq service tree, we use CFQ_IDLE_DELAY
as the delay of cfq_group's vdisktime if there have been other cfq_groups
already.

When cfq is under iops mode, commit 9a7f38c42c2b ("cfq-iosched: Convert
from jiffies to nanoseconds") could result in a large iops delay and
lead to an abnormal io schedule delay for the added cfq_group. To fix
it, we just need to revert to the old CFQ_IDLE_DELAY value: HZ / 5
when iops mode is enabled.

Despite having the same value, the delay of a cfq_queue in idle class
and the delay of cfq_group are different things, so I define two new
macros for the delay of a cfq_group under time-slice mode and iops mode.

Fixes: 9a7f38c42c2b ("cfq-iosched: Convert from jiffies to nanoseconds")
Signed-off-by: Hou Tao
Acked-by: Jan Kara
Signed-off-by: Jens Axboe
Signed-off-by: Greg Kroah-Hartman

Hou Tao
2017-06-14 21:05:58 +0800

20 May, 2017

1 commit

9b2fb8ad5 block: fix blk_integrity_register to use template's interval_exp if not 0 ... Browse Code »

commit 2859323e35ab5fc42f351fbda23ab544eaa85945 upstream.

When registering an integrity profile: if the template's interval_exp is
not 0 use it, otherwise use the ilog2() of logical block size of the
provided gendisk.

This fixes a long-standing DM linear target bug where it cannot pass
integrity data to the underlying device if its logical block size
conflicts with the underlying device's logical block size.

Reported-by: Mikulas Patocka
Signed-off-by: Mike Snitzer
Acked-by: Martin K. Petersen
Signed-off-by: Jens Axboe
Signed-off-by: Greg Kroah-Hartman

Mike Snitzer
2017-05-20 20:28:36 +0800

14 May, 2017

1 commit

6a7620744 block: get rid of blk_integrity_revalidate() ... Browse Code »

commit 19b7ccf8651df09d274671b53039c672a52ad84d upstream.

Commit 25520d55cdb6 ("block: Inline blk_integrity in struct gendisk")
introduced blk_integrity_revalidate(), which seems to assume ownership
of the stable pages flag and unilaterally clears it if no blk_integrity
profile is registered:

if (bi->profile)
disk->queue->backing_dev_info->capabilities |=
BDI_CAP_STABLE_WRITES;
else
disk->queue->backing_dev_info->capabilities &=
~BDI_CAP_STABLE_WRITES;

It's called from revalidate_disk() and rescan_partitions(), making it
impossible to enable stable pages for drivers that support partitions
and don't use blk_integrity: while the call in revalidate_disk() can be
trivially worked around (see zram, which doesn't support partitions and
hence gets away with zram_revalidate_disk()), rescan_partitions() can
be triggered from userspace at any time. This breaks rbd, where the
ceph messenger is responsible for generating/verifying CRCs.

Since blk_integrity_{un,}register() "must" be used for (un)registering
the integrity profile with the block layer, move BDI_CAP_STABLE_WRITES
setting there. This way drivers that call blk_integrity_register() and
use integrity infrastructure won't interfere with drivers that don't
but still want stable pages.

Fixes: 25520d55cdb6 ("block: Inline blk_integrity in struct gendisk")
Cc: "Martin K. Petersen"
Cc: Christoph Hellwig
Cc: Mike Snitzer
Tested-by: Dan Williams
Signed-off-by: Ilya Dryomov
[idryomov@gmail.com: backport to < 4.11: bdi is embedded in queue]
Signed-off-by: Jens Axboe
Signed-off-by: Greg Kroah-Hartman

Ilya Dryomov
2017-05-14 20:00:22 +0800

18 Apr, 2017

1 commit

d7045cbf4 blk-mq: Avoid memory reclaim when remapping queues ... Browse Code »

commit 36e1f3d107867b25c616c2fd294f5a1c9d4e5d09 upstream.

While stressing memory and IO at the same time we changed SMT settings,
we were able to consistently trigger deadlocks in the mm system, which
froze the entire machine.

I think that under memory stress conditions, the large allocations
performed by blk_mq_init_rq_map may trigger a reclaim, which stalls
waiting on the block layer remmaping completion, thus deadlocking the
system. The trace below was collected after the machine stalled,
waiting for the hotplug event completion.

The simplest fix for this is to make allocations in this path
non-reclaimable, with GFP_NOIO. With this patch, We couldn't hit the
issue anymore.

This should apply on top of Jens's for-next branch cleanly.

Changes since v1:
- Use GFP_NOIO instead of GFP_NOWAIT.

Call Trace:
[c000000f0160aaf0] [c000000f0160ab50] 0xc000000f0160ab50 (unreliable)
[c000000f0160acc0] [c000000000016624] __switch_to+0x2e4/0x430
[c000000f0160ad20] [c000000000b1a880] __schedule+0x310/0x9b0
[c000000f0160ae00] [c000000000b1af68] schedule+0x48/0xc0
[c000000f0160ae30] [c000000000b1b4b0] schedule_preempt_disabled+0x20/0x30
[c000000f0160ae50] [c000000000b1d4fc] __mutex_lock_slowpath+0xec/0x1f0
[c000000f0160aed0] [c000000000b1d678] mutex_lock+0x78/0xa0
[c000000f0160af00] [d000000019413cac] xfs_reclaim_inodes_ag+0x33c/0x380 [xfs]
[c000000f0160b0b0] [d000000019415164] xfs_reclaim_inodes_nr+0x54/0x70 [xfs]
[c000000f0160b0f0] [d0000000194297f8] xfs_fs_free_cached_objects+0x38/0x60 [xfs]
[c000000f0160b120] [c0000000003172c8] super_cache_scan+0x1f8/0x210
[c000000f0160b190] [c00000000026301c] shrink_slab.part.13+0x21c/0x4c0
[c000000f0160b2d0] [c000000000268088] shrink_zone+0x2d8/0x3c0
[c000000f0160b380] [c00000000026834c] do_try_to_free_pages+0x1dc/0x520
[c000000f0160b450] [c00000000026876c] try_to_free_pages+0xdc/0x250
[c000000f0160b4e0] [c000000000251978] __alloc_pages_nodemask+0x868/0x10d0
[c000000f0160b6f0] [c000000000567030] blk_mq_init_rq_map+0x160/0x380
[c000000f0160b7a0] [c00000000056758c] blk_mq_map_swqueue+0x33c/0x360
[c000000f0160b820] [c000000000567904] blk_mq_queue_reinit+0x64/0xb0
[c000000f0160b850] [c00000000056a16c] blk_mq_queue_reinit_notify+0x19c/0x250
[c000000f0160b8a0] [c0000000000f5d38] notifier_call_chain+0x98/0x100
[c000000f0160b8f0] [c0000000000c5fb0] __cpu_notify+0x70/0xe0
[c000000f0160b930] [c0000000000c63c4] notify_prepare+0x44/0xb0
[c000000f0160b9b0] [c0000000000c52f4] cpuhp_invoke_callback+0x84/0x250
[c000000f0160ba10] [c0000000000c570c] cpuhp_up_callbacks+0x5c/0x120
[c000000f0160ba60] [c0000000000c7cb8] _cpu_up+0xf8/0x1d0
[c000000f0160bac0] [c0000000000c7eb0] do_cpu_up+0x120/0x150
[c000000f0160bb40] [c0000000006fe024] cpu_subsys_online+0x64/0xe0
[c000000f0160bb90] [c0000000006f5124] device_online+0xb4/0x120
[c000000f0160bbd0] [c0000000006f5244] online_store+0xb4/0xc0
[c000000f0160bc20] [c0000000006f0a68] dev_attr_store+0x68/0xa0
[c000000f0160bc60] [c0000000003ccc30] sysfs_kf_write+0x80/0xb0
[c000000f0160bca0] [c0000000003cbabc] kernfs_fop_write+0x17c/0x250
[c000000f0160bcf0] [c00000000030fe6c] __vfs_write+0x6c/0x1e0
[c000000f0160bd90] [c000000000311490] vfs_write+0xd0/0x270
[c000000f0160bde0] [c0000000003131fc] SyS_write+0x6c/0x110
[c000000f0160be30] [c000000000009204] system_call+0x38/0xec

Signed-off-by: Gabriel Krisman Bertazi
Cc: Brian King
Cc: Douglas Miller
Cc: linux-block@vger.kernel.org
Cc: linux-scsi@vger.kernel.org
Signed-off-by: Jens Axboe
Signed-off-by: Sumit Semwal
Signed-off-by: Greg Kroah-Hartman

Gabriel Krisman Bertazi
2017-04-18 13:11:49 +0800

08 Apr, 2017

2 commits

5959cded9 blk: Ensure users for current->bio_list can see the full list. ... Browse Code »

commit f5fe1b51905df7cfe4fdfd85c5fb7bc5b71a094f upstream.

Commit 79bd99596b73 ("blk: improve order of bio handling in generic_make_request()")
changed current->bio_list so that it did not contain *all* of the
queued bios, but only those submitted by the currently running
make_request_fn.

There are two places which walk the list and requeue selected bios,
and others that check if the list is empty. These are no longer
correct.

So redefine current->bio_list to point to an array of two lists, which
contain all queued bios, and adjust various code to test or walk both
lists.

Signed-off-by: NeilBrown
Fixes: 79bd99596b73 ("blk: improve order of bio handling in generic_make_request()")
Signed-off-by: Jens Axboe
Cc: Jack Wang
Signed-off-by: Greg Kroah-Hartman

NeilBrown
2017-04-08 15:30:36 +0800
d5986e007 blk: improve order of bio handling in generic_make_request() ... Browse Code »

commit 79bd99596b7305ab08109a8bf44a6a4511dbf1cd upstream.

To avoid recursion on the kernel stack when stacked block devices
are in use, generic_make_request() will, when called recursively,
queue new requests for later handling. They will be handled when the
make_request_fn for the current bio completes.

If any bios are submitted by a make_request_fn, these will ultimately
be handled seqeuntially. If the handling of one of those generates
further requests, they will be added to the end of the queue.

This strict first-in-first-out behaviour can lead to deadlocks in
various ways, normally because a request might need to wait for a
previous request to the same device to complete. This can happen when
they share a mempool, and can happen due to interdependencies
particular to the device. Both md and dm have examples where this happens.

These deadlocks can be erradicated by more selective ordering of bios.
Specifically by handling them in depth-first order. That is: when the
handling of one bio generates one or more further bios, they are
handled immediately after the parent, before any siblings of the
parent. That way, when generic_make_request() calls make_request_fn
for some particular device, we can be certain that all previously
submited requests for that device have been completely handled and are
not waiting for anything in the queue of requests maintained in
generic_make_request().

An easy way to achieve this would be to use a last-in-first-out stack
instead of a queue. However this will change the order of consecutive
bios submitted by a make_request_fn, which could have unexpected consequences.
Instead we take a slightly more complex approach.
A fresh queue is created for each call to a make_request_fn. After it completes,
any bios for a different device are placed on the front of the main queue, followed
by any bios for the same device, followed by all bios that were already on
the queue before the make_request_fn was called.
This provides the depth-first approach without reordering bios on the same level.

This, by itself, it not enough to remove all deadlocks. It just makes
it possible for drivers to take the extra step required themselves.

To avoid deadlocks, drivers must never risk waiting for a request
after submitting one to generic_make_request. This includes never
allocing from a mempool twice in the one call to a make_request_fn.

A common pattern in drivers is to call bio_split() in a loop, handling
the first part and then looping around to possibly split the next part.
Instead, a driver that finds it needs to split a bio should queue
(with generic_make_request) the second part, handle the first part,
and then return. The new code in generic_make_request will ensure the
requests to underlying bios are processed first, then the second bio
that was split off. If it splits again, the same process happens. In
each case one bio will be completely handled before the next one is attempted.

With this is place, it should be possible to disable the
punt_bios_to_recover() recovery thread for many block devices, and
eventually it may be possible to remove it completely.

Ref: http://www.spinics.net/lists/raid/msg54680.html
Tested-by: Jinpu Wang
Inspired-by: Lars Ellenberg
Signed-off-by: NeilBrown
Signed-off-by: Jens Axboe
Cc: Jack Wang
Signed-off-by: Greg Kroah-Hartman

NeilBrown
2017-04-08 15:30:36 +0800

30 Mar, 2017

1 commit

21d17f1b5 blk-mq: don't complete un-started request in timeout handler ... Browse Code »

commit 95a49603707d982b25d17c5b70e220a05556a2f9 upstream.

When iterating busy requests in timeout handler,
if the STARTED flag of one request isn't set, that means
the request is being processed in block layer or driver, and
isn't submitted to hardware yet.

In current implementation of blk_mq_check_expired(),
if the request queue becomes dying, un-started requests are
handled as being completed/freed immediately. This way is
wrong, and can cause rq corruption or double allocation[1][2],
when doing I/O and removing&resetting NVMe device at the sametime.

This patch fixes several issues reported by Yi Zhang.

[1]. oops log 1
[ 581.789754] ------------[ cut here ]------------
[ 581.789758] kernel BUG at block/blk-mq.c:374!
[ 581.789760] invalid opcode: 0000 [#1] SMP
[ 581.789761] Modules linked in: vfat fat ipmi_ssif intel_rapl sb_edac
edac_core x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm nvme
irqbypass crct10dif_pclmul nvme_core crc32_pclmul ghash_clmulni_intel
intel_cstate ipmi_si mei_me ipmi_devintf intel_uncore sg ipmi_msghandler
intel_rapl_perf iTCO_wdt mei iTCO_vendor_support mxm_wmi lpc_ich dcdbas shpchp
pcspkr acpi_power_meter wmi nfsd auth_rpcgss nfs_acl lockd dm_multipath grace
sunrpc ip_tables xfs libcrc32c sd_mod mgag200 i2c_algo_bit drm_kms_helper
syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm ahci libahci
crc32c_intel tg3 libata megaraid_sas i2c_core ptp fjes pps_core dm_mirror
dm_region_hash dm_log dm_mod
[ 581.789796] CPU: 1 PID: 1617 Comm: kworker/1:1H Not tainted 4.10.0.bz1420297+ #4
[ 581.789797] Hardware name: Dell Inc. PowerEdge R730xd/072T6D, BIOS 2.2.5 09/06/2016
[ 581.789804] Workqueue: kblockd blk_mq_timeout_work
[ 581.789806] task: ffff8804721c8000 task.stack: ffffc90006ee4000
[ 581.789809] RIP: 0010:blk_mq_end_request+0x58/0x70
[ 581.789810] RSP: 0018:ffffc90006ee7d50 EFLAGS: 00010202
[ 581.789811] RAX: 0000000000000001 RBX: ffff8802e4195340 RCX: ffff88028e2f4b88
[ 581.789812] RDX: 0000000000001000 RSI: 0000000000001000 RDI: 0000000000000000
[ 581.789813] RBP: ffffc90006ee7d60 R08: 0000000000000003 R09: ffff88028e2f4b00
[ 581.789814] R10: 0000000000001000 R11: 0000000000000001 R12: 00000000fffffffb
[ 581.789815] R13: ffff88042abe5780 R14: 000000000000002d R15: ffff88046fbdff80
[ 581.789817] FS: 0000000000000000(0000) GS:ffff88047fc00000(0000) knlGS:0000000000000000
[ 581.789818] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 581.789819] CR2: 00007f64f403a008 CR3: 000000014d078000 CR4: 00000000001406e0
[ 581.789820] Call Trace:
[ 581.789825] blk_mq_check_expired+0x76/0x80
[ 581.789828] bt_iter+0x45/0x50
[ 581.789830] blk_mq_queue_tag_busy_iter+0xdd/0x1f0
[ 581.789832] ? blk_mq_rq_timed_out+0x70/0x70
[ 581.789833] ? blk_mq_rq_timed_out+0x70/0x70
[ 581.789840] ? __switch_to+0x140/0x450
[ 581.789841] blk_mq_timeout_work+0x88/0x170
[ 581.789845] process_one_work+0x165/0x410
[ 581.789847] worker_thread+0x137/0x4c0
[ 581.789851] kthread+0x101/0x140
[ 581.789853] ? rescuer_thread+0x3b0/0x3b0
[ 581.789855] ? kthread_park+0x90/0x90
[ 581.789860] ret_from_fork+0x2c/0x40
[ 581.789861] Code: 48 85 c0 74 0d 44 89 e6 48 89 df ff d0 5b 41 5c 5d c3 48
8b bb 70 01 00 00 48 85 ff 75 0f 48 89 df e8 7d f0 ff ff 5b 41 5c 5d c3
0b e8 71 f0 ff ff 90 eb e9 0f 1f 40 00 66 2e 0f 1f 84 00 00
[ 581.789882] RIP: blk_mq_end_request+0x58/0x70 RSP: ffffc90006ee7d50
[ 581.789889] ---[ end trace bcaf03d9a14a0a70 ]---

[2]. oops log2
[ 6984.857362] BUG: unable to handle kernel NULL pointer dereference at 0000000000000010
[ 6984.857372] IP: nvme_queue_rq+0x6e6/0x8cd [nvme]
[ 6984.857373] PGD 0
[ 6984.857374]
[ 6984.857376] Oops: 0000 [#1] SMP
[ 6984.857379] Modules linked in: ipmi_ssif vfat fat intel_rapl sb_edac
edac_core x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm
irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel ipmi_si iTCO_wdt
iTCO_vendor_support mxm_wmi ipmi_devintf intel_cstate sg dcdbas intel_uncore
mei_me intel_rapl_perf mei pcspkr lpc_ich ipmi_msghandler shpchp
acpi_power_meter wmi nfsd auth_rpcgss dm_multipath nfs_acl lockd grace sunrpc
ip_tables xfs libcrc32c sd_mod mgag200 i2c_algo_bit drm_kms_helper syscopyarea
sysfillrect crc32c_intel sysimgblt fb_sys_fops ttm nvme drm nvme_core ahci
libahci i2c_core tg3 libata ptp megaraid_sas pps_core fjes dm_mirror
dm_region_hash dm_log dm_mod
[ 6984.857416] CPU: 7 PID: 1635 Comm: kworker/7:1H Not tainted
4.10.0-2.el7.bz1420297.x86_64 #1
[ 6984.857417] Hardware name: Dell Inc. PowerEdge R730xd/072T6D, BIOS 2.2.5 09/06/2016
[ 6984.857427] Workqueue: kblockd blk_mq_run_work_fn
[ 6984.857429] task: ffff880476e3da00 task.stack: ffffc90002e90000
[ 6984.857432] RIP: 0010:nvme_queue_rq+0x6e6/0x8cd [nvme]
[ 6984.857433] RSP: 0018:ffffc90002e93c50 EFLAGS: 00010246
[ 6984.857434] RAX: 0000000000000000 RBX: ffff880275646600 RCX: 0000000000001000
[ 6984.857435] RDX: 0000000000000fff RSI: 00000002fba2a000 RDI: ffff8804734e6950
[ 6984.857436] RBP: ffffc90002e93d30 R08: 0000000000002000 R09: 0000000000001000
[ 6984.857437] R10: 0000000000001000 R11: 0000000000000000 R12: ffff8804741d8000
[ 6984.857438] R13: 0000000000000040 R14: ffff880475649f80 R15: ffff8804734e6780
[ 6984.857439] FS: 0000000000000000(0000) GS:ffff88047fcc0000(0000) knlGS:0000000000000000
[ 6984.857440] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 6984.857442] CR2: 0000000000000010 CR3: 0000000001c09000 CR4: 00000000001406e0
[ 6984.857443] Call Trace:
[ 6984.857451] ? mempool_free+0x2b/0x80
[ 6984.857455] ? bio_free+0x4e/0x60
[ 6984.857459] blk_mq_dispatch_rq_list+0xf5/0x230
[ 6984.857462] blk_mq_process_rq_list+0x133/0x170
[ 6984.857465] __blk_mq_run_hw_queue+0x8c/0xa0
[ 6984.857467] blk_mq_run_work_fn+0x12/0x20
[ 6984.857473] process_one_work+0x165/0x410
[ 6984.857475] worker_thread+0x137/0x4c0
[ 6984.857478] kthread+0x101/0x140
[ 6984.857480] ? rescuer_thread+0x3b0/0x3b0
[ 6984.857481] ? kthread_park+0x90/0x90
[ 6984.857489] ret_from_fork+0x2c/0x40
[ 6984.857490] Code: 8b bd 70 ff ff ff 89 95 50 ff ff ff 89 8d 58 ff ff ff 44
89 95 60 ff ff ff e8 b7 dd 12 e1 8b 95 50 ff ff ff 48 89 85 68 ff ff ff
8b 48 10 44 8b 58 18 8b 8d 58 ff ff ff 44 8b 95 60 ff ff ff
[ 6984.857511] RIP: nvme_queue_rq+0x6e6/0x8cd [nvme] RSP: ffffc90002e93c50
[ 6984.857512] CR2: 0000000000000010
[ 6984.895359] ---[ end trace 2d7ceb528432bf83 ]---

Reported-by: Yi Zhang
Tested-by: Yi Zhang
Reviewed-by: Bart Van Assche
Reviewed-by: Hannes Reinecke
Signed-off-by: Ming Lei
Signed-off-by: Jens Axboe
Signed-off-by: Greg Kroah-Hartman

Ming Lei
2017-03-30 15:41:27 +0800

22 Mar, 2017

1 commit

61a153d06 block: allow WRITE_SAME commands with the SG_IO ioctl ... Browse Code »

[ Upstream commit 25cdb64510644f3e854d502d69c73f21c6df88a9 ]

The WRITE_SAME commands are not present in the blk_default_cmd_filter
write_ok list, and thus are failed with -EPERM when the SG_IO ioctl()
is executed without CAP_SYS_RAWIO capability (e.g., unprivileged users).
[ sg_io() -> blk_fill_sghdr_rq() > blk_verify_command() -> -EPERM ]

The problem can be reproduced with the sg_write_same command

# sg_write_same --num 1 --xferlen 512 /dev/sda
#

# capsh --drop=cap_sys_rawio -- -c \
'sg_write_same --num 1 --xferlen 512 /dev/sda'
Write same: pass through os error: Operation not permitted
#

For comparison, the WRITE_VERIFY command does not observe this problem,
since it is in that list:

# capsh --drop=cap_sys_rawio -- -c \
'sg_write_verify --num 1 --ilen 512 --lba 0 /dev/sda'
#

So, this patch adds the WRITE_SAME commands to the list, in order
for the SG_IO ioctl to finish successfully:

# capsh --drop=cap_sys_rawio -- -c \
'sg_write_same --num 1 --xferlen 512 /dev/sda'
#

That case happens to be exercised by QEMU KVM guests with 'scsi-block' devices
(qemu "-device scsi-block" [1], libvirt "" [2]),
which employs the SG_IO ioctl() and runs as an unprivileged user (libvirt-qemu).

In that scenario, when a filesystem (e.g., ext4) performs its zero-out calls,
which are translated to write-same calls in the guest kernel, and then into
SG_IO ioctls to the host kernel, SCSI I/O errors may be observed in the guest:

[...] sd 0:0:0:0: [sda] tag#0 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[...] sd 0:0:0:0: [sda] tag#0 Sense Key : Aborted Command [current]
[...] sd 0:0:0:0: [sda] tag#0 Add. Sense: I/O process terminated
[...] sd 0:0:0:0: [sda] tag#0 CDB: Write Same(10) 41 00 01 04 e0 78 00 00 08 00
[...] blk_update_request: I/O error, dev sda, sector 17096824

Links:
[1] http://git.qemu.org/?p=qemu.git;a=commit;h=336a6915bc7089fb20fea4ba99972ad9a97c5f52
[2] https://libvirt.org/formatdomain.html#elementsDisks (see 'disk' -> 'device')

Signed-off-by: Mauricio Faria de Oliveira
Signed-off-by: Brahadambal Srinivasan
Reported-by: Manjunatha H R
Reviewed-by: Christoph Hellwig
Signed-off-by: Jens Axboe
Signed-off-by: Sasha Levin
Signed-off-by: Greg Kroah-Hartman

Mauricio Faria de Oliveira
2017-03-22 19:43:38 +0800

20 Jan, 2017

2 commits

d9c19f90f blk-mq: Always schedule hctx->next_cpu ... Browse Code »

commit c02ebfdddbafa9a6a0f52fbd715e6bfa229af9d3 upstream.

Commit 0e87e58bf60e ("blk-mq: improve warning for running a queue on the
wrong CPU") attempts to avoid triggering the WARN_ON in
__blk_mq_run_hw_queue when the expected CPU is dead. Problem is, in the
last batch execution before round robin, blk_mq_hctx_next_cpu can
schedule a dead CPU and also update next_cpu to the next alive CPU in
the mask, which will trigger the WARN_ON despite the previous
workaround.

The following patch fixes this scenario by always scheduling the value
in hctx->next_cpu. This changes the moment when we round-robin the CPU
running the hctx, but it really doesn't matter, since it still executes
BLK_MQ_CPU_WORK_BATCH times in a row before switching to another CPU.

Fixes: 0e87e58bf60e ("blk-mq: improve warning for running a queue on the wrong CPU")
Signed-off-by: Gabriel Krisman Bertazi
Signed-off-by: Jens Axboe
Signed-off-by: Greg Kroah-Hartman

Gabriel Krisman Bertazi
2017-01-20 03:18:07 +0800
f57d87100 block: cfq_cpd_alloc() should use @gfp ... Browse Code »

commit ebc4ff661fbe76781c6b16dfb7b754a5d5073f8e upstream.

cfq_cpd_alloc() which is the cpd_alloc_fn implementation for cfq was
incorrectly hard coding GFP_KERNEL instead of using the mask specified
through the @gfp parameter. This currently doesn't cause any actual
issues because all current callers specify GFP_KERNEL. Fix it.

Signed-off-by: Tejun Heo
Reported-by: Dan Carpenter
Fixes: e4a9bde9589f ("blkcg: replace blkcg_policy->cpd_size with ->cpd_alloc/free_fn() methods")
Signed-off-by: Jens Axboe
Signed-off-by: Greg Kroah-Hartman

Tejun Heo
2017-01-20 03:18:07 +0800

09 Jan, 2017

1 commit

3f3a6bbe6 sg_write()/bsg_write() is not fit to be called under KERNEL_DS ... Browse Code »

commit 128394eff343fc6d2f32172f03e24829539c5835 upstream.

Both damn things interpret userland pointers embedded into the payload;
worse, they are actually traversing those. Leaving aside the bad
API design, this is very much _not_ safe to call with KERNEL_DS.
Bail out early if that happens.

Signed-off-by: Al Viro
Signed-off-by: Greg Kroah-Hartman

Al Viro
2017-01-09 15:32:25 +0800

06 Jan, 2017

1 commit

67b0069a5 blk-mq: Do not invoke .queue_rq() for a stopped queue ... Browse Code »

commit bc27c01b5c46d3bfec42c96537c7a3fae0bb2cc4 upstream.

The meaning of the BLK_MQ_S_STOPPED flag is "do not call
.queue_rq()". Hence modify blk_mq_make_request() such that requests
are queued instead of issued if a queue has been stopped.

Reported-by: Ming Lei
Signed-off-by: Bart Van Assche
Reviewed-by: Christoph Hellwig
Reviewed-by: Ming Lei
Reviewed-by: Hannes Reinecke
Reviewed-by: Johannes Thumshirn
Reviewed-by: Sagi Grimberg
Signed-off-by: Jens Axboe
Signed-off-by: Greg Kroah-Hartman

Bart Van Assche
2017-01-06 17:40:15 +0800

08 Dec, 2016

1 commit

a0ac402cf Don't feed anything but regular iovec's to blk_rq_map_user_iov ... Browse Code »

In theory we could map other things, but there's a reason that function
is called "user_iov". Using anything else (like splice can do) just
confuses it.

Reported-and-tested-by: Johannes Thumshirn
Cc: Al Viro
Signed-off-by: Linus Torvalds

Linus Torvalds
2016-12-08 00:23:35 +0800

27 Oct, 2016

1 commit

7fe311302 blk-mq: update hardware and software queues for sleeping alloc ... Browse Code »
2

If we end up sleeping due to running out of requests, we should
update the hardware and software queues in the map ctx structure.
Otherwise we could end up having rq->mq_ctx point to the pre-sleep
context, and risk corrupting ctx->rq_list since we'll be
grabbing the wrong lock when inserting the request.

Reported-by: Dave Jones
Reported-by: Chris Mason
Tested-by: Chris Mason
Fixes: 63581af3f31e ("blk-mq: remove non-blocking pass in blk_mq_map_request")
Signed-off-by: Jens Axboe

Jens Axboe
2016-10-27 23:56:03 +0800

26 Oct, 2016

1 commit

94d7dea44 block: flush: fix IO hang in case of flood fua req ... Browse Code »

This patch fixes one issue reported by Kent, which can
be triggered in bcachefs over sata disk. Actually it
is a generic issue in block flush vs. blk-tag.

Cc: Christoph Hellwig
Reported-by: Kent Overstreet
Signed-off-by: Ming Lei
Signed-off-by: Jens Axboe

Ming Lei
2016-10-26 21:49:27 +0800

22 Oct, 2016

2 commits

b4a1278c7 badblocks: badblocks_set/clear update unacked_exist ... Browse Code »

When bandblocks_set acknowledges a range or badblocks_clear a range,
it's possible all badblocks are acknowledged. We should update
unacked_exist if this occurs.

Signed-off-by: Shaohua Li
Reviewed-by: Tomasz Majchrzak
Tested-by: Tomasz Majchrzak
Signed-off-by: Jens Axboe

Shaohua Li
2016-10-22 05:45:47 +0800
ecd06f288 Merge branch 'for-linus' of git://git.kernel.dk/linux-block ... Browse Code »

Pull block fixes from Jens Axboe:
"A set of fixes that missed the merge window, mostly due to me being
away around that time.

Nothing major here, a mix of nvme cleanups and fixes, and one fix for
the badblocks handling"

* 'for-linus' of git://git.kernel.dk/linux-block:
nvmet: use symbolic constants for CNS values
nvme: use symbolic constants for CNS values
nvme.h: add an enum for cns values
nvme.h: don't use uuid_be
nvme.h: resync with nvme-cli
nvme: Add tertiary number to NVME_VS
nvme : Add sysfs entry for NVMe CMBs when appropriate
nvme: don't schedule multiple resets
nvme: Delete created IO queues on reset
nvme: Stop probing a removed device
badblocks: fix overlapping check for clearing

Linus Torvalds
2016-10-22 01:54:01 +0800

16 Oct, 2016

1 commit

9ffc66941 Merge tag 'gcc-plugins-v4.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux ... Browse Code »

Pull gcc plugins update from Kees Cook:
"This adds a new gcc plugin named "latent_entropy". It is designed to
extract as much possible uncertainty from a running system at boot
time as possible, hoping to capitalize on any possible variation in
CPU operation (due to runtime data differences, hardware differences,
SMP ordering, thermal timing variation, cache behavior, etc).

At the very least, this plugin is a much more comprehensive example
for how to manipulate kernel code using the gcc plugin internals"

* tag 'gcc-plugins-v4.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
latent_entropy: Mark functions with __latent_entropy
gcc-plugins: Add latent_entropy plugin

Linus Torvalds
2016-10-16 01:03:15 +0800

15 Oct, 2016

1 commit

f34d3606f Merge branch 'for-4.9' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup ... Browse Code »

Pull cgroup updates from Tejun Heo:

- tracepoints for basic cgroup management operations added

- kernfs and cgroup path formatting functions updated to behave in the
style of strlcpy()

- non-critical bug fixes

* 'for-4.9' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
blkcg: Unlock blkcg_pol_mutex only once when cpd == NULL
cgroup: fix error handling regressions in proc_cgroup_show() and cgroup_release_agent()
cpuset: fix error handling regression in proc_cpuset_show()
cgroup: add tracepoints for basic operations
cgroup: make cgroup_path() and friends behave in the style of strlcpy()
kernfs: remove kernfs_path_len()
kernfs: make kernfs_path*() behave in the style of strlcpy()
kernfs: add dummy implementation of kernfs_path_from_node()

Linus Torvalds
2016-10-15 03:18:50 +0800

12 Oct, 2016

3 commits

1fa9ce8d0 badblocks: fix overlapping check for clearing ... Browse Code »

Current bad block clear implementation assumes the range to clear
overlaps with at least one bad block already stored. If given range to
clear precedes first bad block in a list, the first entry is incorrectly
updated.

Check not only if stored block end is past clear block end but also if
stored block start is before clear block end.

Signed-off-by: Tomasz Majchrzak
Acked-by: NeilBrown
Signed-off-by: Jens Axboe

Tomasz Majchrzak
2016-10-12 22:08:08 +0800
28b2be203 block: require write_same and discard requests align to logical block size ... Browse Code »

Make sure that the offset and length arguments that we're using to
construct WRITE SAME and DISCARD requests are actually aligned to the
logical block size. Failure to do this causes other errors in other parts
of the block layer or the SCSI layer because disks don't support partial
logical block writes.

Link: http://lkml.kernel.org/r/147518379026.22791.4437508871355153928.stgit@birch.djwong.org
Signed-off-by: Darrick J. Wong
Reviewed-by: Christoph Hellwig
Reviewed-by: Bart Van Assche
Reviewed-by: Martin K. Petersen
Reviewed-by: Hannes Reinecke
Cc: Theodore Ts'o
Cc: Mike Snitzer # tweaked header
Cc: Brian Foster
Cc: Jens Axboe
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Darrick J. Wong
2016-10-12 06:06:30 +0800
22dd6d356 block: invalidate the page cache when issuing BLKZEROOUT ... Browse Code »

Patch series "fallocate for block devices", v11.

This is a patchset to fix page cache coherency with BLKZEROOUT and
implement fallocate for block devices.

The first patch is a fix to the existing BLKZEROOUT ioctl to invalidate
the page cache if the zeroing command to the underlying device succeeds.
Without this patch we still have the pagecache coherence bug that's been
in the kernel forever.

The second patch changes the internal block device functions to reject
attempts to discard or zeroout that are not aligned to the logical block
size. Previously, we only checked that the start/len parameters were
512-byte aligned, which caused kernel BUG_ONs for unaligned IOs to 4k-LBA
devices.

The third patch creates an fallocate handler for block devices, wires up
the FALLOC_FL_PUNCH_HOLE flag to zeroing-discard, and connects
FALLOC_FL_ZERO_RANGE to write-same so that we can have a consistent
fallocate interface between files and block devices. It also allows the
combination of PUNCH_HOLE and NO_HIDE_STALE to invoke non-zeroing discard.

Test cases for the new block device fallocate are now in xfstests as
generic/349-351.

This patch (of 3):

Invalidate the page cache (as a regular O_DIRECT write would do) to avoid
returning stale cache contents at a later time.

Link: http://lkml.kernel.org/r/147518378313.22791.16649519283678515021.stgit@birch.djwong.org
Signed-off-by: Darrick J. Wong
Reviewed-by: Christoph Hellwig
Reviewed-by: Martin K. Petersen
Reviewed-by: Bart Van Assche
Reviewed-by: Hannes Reinecke
Cc: Theodore Ts'o
Cc: Mike Snitzer
Cc: Brian Foster
Cc: Jens Axboe
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Darrick J. Wong
2016-10-12 06:06:30 +0800

11 Oct, 2016

1 commit

0766f788e latent_entropy: Mark functions with __latent_entropy ... Browse Code »

The __latent_entropy gcc attribute can be used only on functions and
variables. If it is on a function then the plugin will instrument it for
gathering control-flow entropy. If the attribute is on a variable then
the plugin will initialize it with random contents. The variable must
be an integer, an integer array type or a structure with integer fields.

These specific functions have been selected because they are init
functions (to help gather boot-time entropy), are called at unpredictable
times, or they have variable loops, each of which provide some level of
latent entropy.

Signed-off-by: Emese Revfy
[kees: expanded commit message]
Signed-off-by: Kees Cook

Emese Revfy
2016-10-11 05:51:45 +0800

10 Oct, 2016

2 commits

24532f768 Merge branch 'for-4.9/block-smp' of git://git.kernel.dk/linux-block ... Browse Code »

Pull blk-mq CPU hotplug update from Jens Axboe:
"This is the conversion of blk-mq to the new hotplug state machine"

* 'for-4.9/block-smp' of git://git.kernel.dk/linux-block:
blk-mq: fixup "Convert to new hotplug state machine"
blk-mq: Convert to new hotplug state machine
blk-mq/cpu-notif: Convert to new hotplug state machine

Linus Torvalds
2016-10-10 08:32:20 +0800
12e3d3cdd Merge branch 'for-4.9/block-irq' of git://git.kernel.dk/linux-block ... Browse Code »

Pull blk-mq irq/cpu mapping updates from Jens Axboe:
"This is the block-irq topic branch for 4.9-rc. It's mostly from
Christoph, and it allows drivers to specify their own mappings, and
more importantly, to share the blk-mq mappings with the IRQ affinity
mappings. It's a good step towards making this work better out of the
box"

* 'for-4.9/block-irq' of git://git.kernel.dk/linux-block:
blk_mq: linux/blk-mq.h does not include all the headers it depends on
blk-mq: kill unused blk_mq_create_mq_map()
blk-mq: get rid of the cpumask in struct blk_mq_tags
nvme: remove the post_scan callout
nvme: switch to use pci_alloc_irq_vectors
blk-mq: provide a default queue mapping for PCI device
blk-mq: allow the driver to pass in a queue mapping
blk-mq: remove ->map_queue
blk-mq: only allocate a single mq_map per tag_set
blk-mq: don't redistribute hardware queues on a CPU hotplug event

Linus Torvalds
2016-10-10 08:29:33 +0800

08 Oct, 2016

1 commit

513a4befa Merge branch 'for-4.9/block' of git://git.kernel.dk/linux-block ... Browse Code »

Pull block layer updates from Jens Axboe:
"This is the main pull request for block layer changes in 4.9.

As mentioned at the last merge window, I've changed things up and now
do just one branch for core block layer changes, and driver changes.
This avoids dependencies between the two branches. Outside of this
main pull request, there are two topical branches coming as well.

This pull request contains:

- A set of fixes, and a conversion to blk-mq, of nbd. From Josef.

- Set of fixes and updates for lightnvm from Matias, Simon, and Arnd.
Followup dependency fix from Geert.

- General fixes from Bart, Baoyou, Guoqing, and Linus W.

- CFQ async write starvation fix from Glauber.

- Add supprot for delayed kick of the requeue list, from Mike.

- Pull out the scalable bitmap code from blk-mq-tag.c and make it
generally available under the name of sbitmap. Only blk-mq-tag uses
it for now, but the blk-mq scheduling bits will use it as well.
From Omar.

- bdev thaw error progagation from Pierre.

- Improve the blk polling statistics, and allow the user to clear
them. From Stephen.

- Set of minor cleanups from Christoph in block/blk-mq.

- Set of cleanups and optimizations from me for block/blk-mq.

- Various nvme/nvmet/nvmeof fixes from the various folks"

* 'for-4.9/block' of git://git.kernel.dk/linux-block: (54 commits)
fs/block_dev.c: return the right error in thaw_bdev()
nvme: Pass pointers, not dma addresses, to nvme_get/set_features()
nvme/scsi: Remove power management support
nvmet: Make dsm number of ranges zero based
nvmet: Use direct IO for writes
admin-cmd: Added smart-log command support.
nvme-fabrics: Add host_traddr options field to host infrastructure
nvme-fabrics: revise host transport option descriptions
nvme-fabrics: rework nvmf_get_address() for variable options
nbd: use BLK_MQ_F_BLOCKING
blkcg: Annotate blkg_hint correctly
cfq: fix starvation of asynchronous writes
blk-mq: add flag for drivers wanting blocking ->queue_rq()
blk-mq: remove non-blocking pass in blk_mq_map_request
blk-mq: get rid of manual run of queue with __blk_mq_run_hw_queue()
block: export bio_free_pages to other modules
lightnvm: propagate device_add() error code
lightnvm: expose device geometry through sysfs
lightnvm: control life of nvm_dev in driver
blk-mq: register device instead of disk
...

Linus Torvalds
2016-10-08 05:42:05 +0800

04 Oct, 2016

1 commit

597f03f9d Merge branch 'smp-hotplug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull CPU hotplug updates from Thomas Gleixner:
"Yet another batch of cpu hotplug core updates and conversions:

- Provide core infrastructure for multi instance drivers so the
drivers do not have to keep custom lists.

- Convert custom lists to the new infrastructure. The block-mq custom
list conversion comes through the block tree and makes the diffstat
tip over to more lines removed than added.

- Handle unbalanced hotplug enable/disable calls more gracefully.

- Remove the obsolete CPU_STARTING/DYING notifier support.

- Convert another batch of notifier users.

The relayfs changes which conflicted with the conversion have been
shipped to me by Andrew.

The remaining lot is targeted for 4.10 so that we finally can remove
the rest of the notifiers"

* 'smp-hotplug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (46 commits)
cpufreq: Fix up conversion to hotplug state machine
blk/mq: Reserve hotplug states for block multiqueue
x86/apic/uv: Convert to hotplug state machine
s390/mm/pfault: Convert to hotplug state machine
mips/loongson/smp: Convert to hotplug state machine
mips/octeon/smp: Convert to hotplug state machine
fault-injection/cpu: Convert to hotplug state machine
padata: Convert to hotplug state machine
cpufreq: Convert to hotplug state machine
ACPI/processor: Convert to hotplug state machine
virtio scsi: Convert to hotplug state machine
oprofile/timer: Convert to hotplug state machine
block/softirq: Convert to hotplug state machine
lib/irq_poll: Convert to hotplug state machine
x86/microcode: Convert to hotplug state machine
sh/SH-X3 SMP: Convert to hotplug state machine
ia64/mca: Convert to hotplug state machine
ARM/OMAP/wakeupgen: Convert to hotplug state machine
ARM/shmobile: Convert to hotplug state machine
arm64/FP/SIMD: Convert to hotplug state machine
...

Linus Torvalds
2016-10-04 10:43:08 +0800

30 Sep, 2016

1 commit

bbb427e34 blkcg: Unlock blkcg_pol_mutex only once when cpd == NULL ... Browse Code »

Unlocking a mutex twice is wrong. Hence modify blkcg_policy_register()
such that blkcg_pol_mutex is unlocked once if cpd == NULL. This patch
avoids that smatch reports the following error:

block/blk-cgroup.c:1378: blkcg_policy_register() error: double unlock 'mutex:&blkcg_pol_mutex'

Fixes: 06b285bd1125 ("blkcg: fix blkcg_policy_data allocation bug")
Signed-off-by: Bart Van Assche
Cc: Tejun Heo
Cc: # v4.2+
Signed-off-by: Tejun Heo

Bart Van Assche
2016-09-30 16:31:20 +0800

24 Sep, 2016

2 commits

c8712c6a6 blk-mq: skip unmapped queues in blk_mq_alloc_request_hctx ... Browse Code »

This provides the caller a feedback that a given hctx is not mapped and thus
no command can be sent on it.

Signed-off-by: Christoph Hellwig
Tested-by: Steve Wise
Signed-off-by: Jens Axboe

Christoph Hellwig
2016-09-24 00:25:48 +0800
3932a86b4 cfq: fix starvation of asynchronous writes ... Browse Code »

While debugging timeouts happening in my application workload (ScyllaDB), I have
observed calls to open() taking a long time, ranging everywhere from 2 seconds -
the first ones that are enough to time out my application - to more than 30
seconds.

The problem seems to happen because XFS may block on pending metadata updates
under certain circumnstances, and that's confirmed with the following backtrace
taken by the offcputime tool (iovisor/bcc):

ffffffffb90c57b1 finish_task_switch
ffffffffb97dffb5 schedule
ffffffffb97e310c schedule_timeout
ffffffffb97e1f12 __down
ffffffffb90ea821 down
ffffffffc046a9dc xfs_buf_lock
ffffffffc046abfb _xfs_buf_find
ffffffffc046ae4a xfs_buf_get_map
ffffffffc046babd xfs_buf_read_map
ffffffffc0499931 xfs_trans_read_buf_map
ffffffffc044a561 xfs_da_read_buf
ffffffffc0451390 xfs_dir3_leaf_read.constprop.16
ffffffffc0452b90 xfs_dir2_leaf_lookup_int
ffffffffc0452e0f xfs_dir2_leaf_lookup
ffffffffc044d9d3 xfs_dir_lookup
ffffffffc047d1d9 xfs_lookup
ffffffffc0479e53 xfs_vn_lookup
ffffffffb925347a path_openat
ffffffffb9254a71 do_filp_open
ffffffffb9242a94 do_sys_open
ffffffffb9242b9e sys_open
ffffffffb97e42b2 entry_SYSCALL_64_fastpath
00007fb0698162ed [unknown]

Inspecting my run with blktrace, I can see that the xfsaild kthread exhibit very
high "Dispatch wait" times, on the dozens of seconds range and consistent with
the open() times I have saw in that run.

Still from the blktrace output, we can after searching a bit, identify the
request that wasn't dispatched:

8,0 11 152 81.092472813 804 A WM 141698288 + 8
8,0 0 289372 96.718761435 0 D WM 141698288 + 8 (15626265317) [swapper/0]

As we can see above, in this particular example CFQ took 15 seconds to dispatch
this request. Going back to the full trace, we can see that the xfsaild queue
had plenty of opportunity to run, and it was selected as the active queue many
times. It would just always be preempted by something else (example):

8,0 1 0 81.117912979 0 m N cfq1618SN / insert_request
8,0 1 0 81.117913419 0 m N cfq1618SN / add_to_rr
8,0 1 0 81.117914044 0 m N cfq1618SN / preempt
8,0 1 0 81.117914398 0 m N cfq767A / slice expired t=1
8,0 1 0 81.117914755 0 m N cfq767A / resid=40
8,0 1 0 81.117915340 0 m N / served: vt=1948520448 min_vt=1948520448
8,0 1 0 81.117915858 0 m N cfq767A / sl_used=1 disp=0 charge=0 iops=1 sect=0

where cfq767 is the xfsaild queue and cfq1618 corresponds to one of the ScyllaDB
IO dispatchers.

The requests preempting the xfsaild queue are synchronous requests. That's a
characteristic of ScyllaDB workloads, as we only ever issue O_DIRECT requests.
While it can be argued that preempting ASYNC requests in favor of SYNC is part
of the CFQ logic, I don't believe that doing so for 15+ seconds is anyone's
goal.

Moreover, unless I am misunderstanding something, that breaks the expectation
set by the "fifo_expire_async" tunable, which in my system is set to the
default.

Looking at the code, it seems to me that the issue is that after we make
an async queue active, there is no guarantee that it will execute any request.

When the queue itself tests if it cfq_may_dispatch() it can bail if it sees SYNC
requests in flight. An incoming request from another queue can also preempt it
in such situation before we have the chance to execute anything (as seen in the
trace above).

This patch sets the must_dispatch flag if we notice that we have requests
that are already fifo_expired. This flag is always cleared after
cfq_dispatch_request() returns from cfq_dispatch_requests(), so it won't pin
the queue for subsequent requests (unless they are themselves expired)

Care is taken during preempt to still allow rt requests to preempt us
regardless.

Testing my workload with this patch applied produces much better results.
From the application side I see no timeouts, and the open() latency histogram
generated by systemtap looks much better, with the worst outlier at 131ms:

Latency histogram of xfs_buf_lock acquisition (microseconds):
value |-------------------------------------------------- count
0 | 11
1 |@@@@ 161
2 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1966
4 |@ 54
8 | 36
16 | 7
32 | 0
64 | 0
~
1024 | 0
2048 | 0
4096 | 1
8192 | 1
16384 | 2
32768 | 0
65536 | 0
131072 | 1
262144 | 0
524288 | 0

Signed-off-by: Glauber Costa
CC: Jens Axboe
CC: linux-block@vger.kernel.org
CC: linux-kernel@vger.kernel.org

Signed-off-by: Glauber Costa
Signed-off-by: Jens Axboe

Glauber Costa
2016-09-24 00:01:24 +0800

23 Sep, 2016

3 commits

97a32864e blk-mq: fixup "Convert to new hotplug state machine" ... Browse Code »

The "blk_mq_queue_reinit_dead()" just cleared the cpumask instead doing
a copy. Since we might never had an online callback we could end up with
a ZERO mask which in turn leads to crash as test robot demonstarted.

Fixes: 65d5291eee66 ("blk-mq: Convert to new hotplug state machine")
Signed-off-by: Sebastian Andrzej Siewior
Signed-off-by: Jens Axboe

Sebastian Andrzej Siewior
2016-09-23 23:49:32 +0800
1b792f2f9 blk-mq: add flag for drivers wanting blocking ->queue_rq() ... Browse Code »

If a driver sets BLK_MQ_F_BLOCKING, it is allowed to block in its
->queue_rq() handler. For that case, blk-mq ensures that we always
calls it from a safe context.

Signed-off-by: Jens Axboe
Tested-by: Josef Bacik

Jens Axboe
2016-09-23 04:28:38 +0800
63581af3f blk-mq: remove non-blocking pass in blk_mq_map_request ... Browse Code »
2

bt_get already does a non-blocking pass as well as running the queue
when scheduling internally, no need to duplicate it.

Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Christoph Hellwig
2016-09-23 04:27:39 +0800

22 Sep, 2016

4 commits

841bac2c8 blk-mq: get rid of manual run of queue with __blk_mq_run_hw_queue() ... Browse Code »

Two cases:

1) blk_mq_alloc_request() needlessly re-runs the queue, after
calling into the tag allocation without NOWAIT set. We don't
need to do that.

2) blk_mq_map_request() should just use blk_mq_run_hw_queue() with
the async flag set to false.

Signed-off-by: Jens Axboe
Reviewed-by: Christoph Hellwig

Jens Axboe
2016-09-22 23:39:53 +0800
65d5291ee blk-mq: Convert to new hotplug state machine ... Browse Code »
2

Install the callbacks via the state machine so we can phase out the cpu
hotplug notifiers mess.

Signed-off-by: Sebastian Andrzej Siewior
Signed-off-by: Thomas Gleixner
Cc: Peter Zijlstra
Cc: linux-block@vger.kernel.org
Cc: rt@linutronix.de
Cc: Christoph Hellwing
Link: http://lkml.kernel.org/r/20160919212601.180033814@linutronix.de
Signed-off-by: Thomas Gleixner
Signed-off-by: Jens Axboe

Sebastian Andrzej Siewior
2016-09-22 22:05:19 +0800
9467f8596 blk-mq/cpu-notif: Convert to new hotplug state machine ... Browse Code »

Replace the block-mq notifier list management with the multi instance
facility in the cpu hotplug state machine.

Signed-off-by: Thomas Gleixner
Cc: Peter Zijlstra
Cc: linux-block@vger.kernel.org
Cc: rt@linutronix.de
Cc: Christoph Hellwing
Signed-off-by: Jens Axboe

Thomas Gleixner
2016-09-22 22:05:17 +0800
491221f88 block: export bio_free_pages to other modules ... Browse Code »

bio_free_pages is introduced in commit 1dfa0f68c040
("block: add a helper to free bio bounce buffer pages"),
we can reuse the func in other modules after it was
imported.

Cc: Christoph Hellwig
Cc: Jens Axboe
Cc: Mike Snitzer
Cc: Shaohua Li
Signed-off-by: Guoqing Jiang
Acked-by: Kent Overstreet
Signed-off-by: Jens Axboe

Guoqing Jiang
2016-09-22 21:48:03 +0800

21 Sep, 2016

1 commit

b21d5b301 blk-mq: register device instead of disk ... Browse Code »

Enable devices without a gendisk instance to register itself with blk-mq
and expose the associated multi-queue sysfs entries.

Signed-off-by: Matias Bjørling
Signed-off-by: Jens Axboe

Matias Bjørling
2016-09-21 21:56:16 +0800

20 Sep, 2016

1 commit

164c80ed8 blk-throttle: Extend slice if throttle group is not empty ... Browse Code »

Right now, if slice is expired, we start a new slice. If a bio is
queued, we keep on extending slice by throtle_slice interval (100ms).

This worked well as long as pending timer function got executed with-in
few milli seconds of scheduled time. But looks like with recent changes
in timer subsystem, slack can be much longer depending on the expiry time
of the scheduled timer.

commit 500462a9de65 ("timers: Switch to a non-cascading wheel")

This means, by the time timer function gets executed, it is possible the
delay from scheduled time is more than 100ms. That means current code
will conclude that existing slice has expired and a new one needs to
be started. New slice will be 100ms by default and that will not be
sufficient to meet rate requirement of group given the bio size and
bio will not be dispatched and we will start a new timer function to
wait. And when that timer expires, same process will repeat and we
will wait again and this can easily be an infinite loop.

Solve this issue by starting a new slice only if throttle gropup is
empty. If it is not empty, that means there should be an active slice
going on. Ideally it should not be expired but given the slack, it is
possible that it has expired.

Reported-by: Hou Tao
Signed-off-by: Vivek Goyal
Signed-off-by: Jens Axboe

Vivek Goyal
2016-09-20 05:12:41 +0800