Eric Lee / smarc-fsl-linux-kernel

26 Dec, 2016

1 commit

8b0e19531 ktime: Cleanup ktime_set() usage ... Browse Code »

ktime_set(S,N) was required for the timespec storage type and is still
useful for situations where a Seconds and Nanoseconds part of a time value
needs to be converted. For anything where the Seconds argument is 0, this
is pointless and can be replaced with a simple assignment.

Signed-off-by: Thomas Gleixner
Cc: Peter Zijlstra

Thomas Gleixner
2016-12-26 00:21:22 +0800

25 Dec, 2016

1 commit

7c0f6ba68 Replace <asm/uaccess.h> with <linux/uaccess.h> globally ... Browse Code »

This was entirely automated, using the script by Al:

PATT='^[[:blank:]]*#[[:blank:]]*include[[:blank:]]*'
sed -i -e "s!$PATT!#include !" \
$(git grep -l "$PATT"|grep -v ^include/linux/uaccess.h)

to do the replacement at the end of the merge window.

Requested-by: Al Viro
Signed-off-by: Linus Torvalds

Linus Torvalds
2016-12-25 03:46:01 +0800

24 Dec, 2016

1 commit

a307d0a00 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs ... Browse Code »

Pull final vfs updates from Al Viro:
"Assorted cleanups and fixes all over the place"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
sg_write()/bsg_write() is not fit to be called under KERNEL_DS
ufs: fix function declaration for ufs_truncate_blocks
fs: exec: apply CLOEXEC before changing dumpable task flags
seq_file: reset iterator to first record for zero offset
vfs: fix isize/pos/len checks for reflink & dedupe
[iov_iter] fix iterate_all_kinds() on empty iterators
move aio compat to fs/aio.c
reorganize do_make_slave()
clone_private_mount() doesn't need to touch namespace_sem
remove a bogus claim about namespace_sem being held by callers of mnt_alloc_id()

Linus Torvalds
2016-12-24 02:52:43 +0800

23 Dec, 2016

1 commit

128394eff sg_write()/bsg_write() is not fit to be called under KERNEL_DS ... Browse Code »

Both damn things interpret userland pointers embedded into the payload;
worse, they are actually traversing those. Leaving aside the bad
API design, this is very much _not_ safe to call with KERNEL_DS.
Bail out early if that happens.

Cc: stable@vger.kernel.org
Signed-off-by: Al Viro

Al Viro
2016-12-23 12:03:42 +0800

20 Dec, 2016

1 commit

633395b67 block: check partition alignment ... Browse Code »

Partitions that are not aligned to the blocksize of a device may cause
invalid I/O requests because the blocklayer cares only about alignment
within the partition when building requests on partitions.

device
|--------4096--------|--------4096--------|--------4096--------|
partition offset 512byte
|-512-|--------4096--------|--------4096--------|--------4096--------|

When reading/writing one 4k block of the partition this maps to
reading/writing with an offset of 512 byte of the device leading to
unaligned requests for the device which in turn may cause unexpected
behavior of the device driver.

For DASD devices we have to translate the block number into a cylinder,
head, record format. The unaligned requests lead to wrong calculation
and therefore to misdirected I/O. In a "good" case this leads to I/O
errors because the underlying hardware detects the wrong addressing.
In a worst case scenario this might destroy data on the device.

To prevent partitions that are not aligned to the physical blocksize
of a device check for the alignment in the blkpg_ioctl.

Signed-off-by: Stefan Haberland
Reviewed-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Stefan Haberland
2016-12-20 00:20:59 +0800

19 Dec, 2016

1 commit

25cdb6451 block: allow WRITE_SAME commands with the SG_IO ioctl ... Browse Code »

The WRITE_SAME commands are not present in the blk_default_cmd_filter
write_ok list, and thus are failed with -EPERM when the SG_IO ioctl()
is executed without CAP_SYS_RAWIO capability (e.g., unprivileged users).
[ sg_io() -> blk_fill_sghdr_rq() > blk_verify_command() -> -EPERM ]

The problem can be reproduced with the sg_write_same command

# sg_write_same --num 1 --xferlen 512 /dev/sda
#

# capsh --drop=cap_sys_rawio -- -c \
'sg_write_same --num 1 --xferlen 512 /dev/sda'
Write same: pass through os error: Operation not permitted
#

For comparison, the WRITE_VERIFY command does not observe this problem,
since it is in that list:

# capsh --drop=cap_sys_rawio -- -c \
'sg_write_verify --num 1 --ilen 512 --lba 0 /dev/sda'
#

So, this patch adds the WRITE_SAME commands to the list, in order
for the SG_IO ioctl to finish successfully:

# capsh --drop=cap_sys_rawio -- -c \
'sg_write_same --num 1 --xferlen 512 /dev/sda'
#

That case happens to be exercised by QEMU KVM guests with 'scsi-block' devices
(qemu "-device scsi-block" [1], libvirt "" [2]),
which employs the SG_IO ioctl() and runs as an unprivileged user (libvirt-qemu).

In that scenario, when a filesystem (e.g., ext4) performs its zero-out calls,
which are translated to write-same calls in the guest kernel, and then into
SG_IO ioctls to the host kernel, SCSI I/O errors may be observed in the guest:

[...] sd 0:0:0:0: [sda] tag#0 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[...] sd 0:0:0:0: [sda] tag#0 Sense Key : Aborted Command [current]
[...] sd 0:0:0:0: [sda] tag#0 Add. Sense: I/O process terminated
[...] sd 0:0:0:0: [sda] tag#0 CDB: Write Same(10) 41 00 01 04 e0 78 00 00 08 00
[...] blk_update_request: I/O error, dev sda, sector 17096824

Links:
[1] http://git.qemu.org/?p=qemu.git;a=commit;h=336a6915bc7089fb20fea4ba99972ad9a97c5f52
[2] https://libvirt.org/formatdomain.html#elementsDisks (see 'disk' -> 'device')

Signed-off-by: Mauricio Faria de Oliveira
Signed-off-by: Brahadambal Srinivasan
Reported-by: Manjunatha H R
Reviewed-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Mauricio Faria de Oliveira
2016-12-19 23:34:17 +0800

15 Dec, 2016

3 commits

cf1b3341a Merge branch 'for-linus' of git://git.kernel.dk/linux-block ... Browse Code »

Pull block IO fixes from Jens Axboe:
"A few fixes that I collected as post-merge.

I was going to wait a bit with sending this out, but the O_DIRECT fix
should really go in sooner rather than later"

* 'for-linus' of git://git.kernel.dk/linux-block:
blk-mq: Fix failed allocation path when mapping queues
blk-mq: Avoid memory reclaim when remapping queues
block_dev: don't update file access position for sync direct IO
nvme/pci: Log PCI_STATUS when the controller dies
block_dev: don't test bdev->bd_contains when it is not stable

Linus Torvalds
2016-12-15 09:21:53 +0800
d1b1cea1e blk-mq: Fix failed allocation path when mapping queues ... Browse Code »

In blk_mq_map_swqueue, there is a memory optimization that frees the
tags of a queue that has gone unmapped. Later, if that hctx is remapped
after another topology change, the tags need to be reallocated.

If this allocation fails, a simple WARN_ON triggers, but the block layer
ends up with an active hctx without any corresponding set of tags.
Then, any income IO to that hctx can trigger an Oops.

I can reproduce it consistently by running IO, flipping CPUs on and off
and eventually injecting a memory allocation failure in that path.

In the fix below, if the system experiences a failed allocation of any
hctx's tags, we remap all the ctxs of that queue to the hctx_0, which
should always keep it's tags. There is a minor performance hit, since
our mapping just got worse after the error path, but this is
the simplest solution to handle this error path. The performance hit
will disappear after another successful remap.

I considered dropping the memory optimization all together, but it
seemed a bad trade-off to handle this very specific error case.

This should apply cleanly on top of Jens' for-next branch.

The Oops is the one below:

SP (3fff935ce4d0) is in userspace
1:mon> e
cpu 0x1: Vector: 300 (Data Access) at [c000000fe99eb110]
pc: c0000000005e868c: __sbitmap_queue_get+0x2c/0x180
lr: c000000000575328: __bt_get+0x48/0xd0
sp: c000000fe99eb390
msr: 900000010280b033
dar: 28
dsisr: 40000000
current = 0xc000000fe9966800
paca = 0xc000000007e80300 softe: 0 irq_happened: 0x01
pid = 11035, comm = aio-stress
Linux version 4.8.0-rc6+ (root@bean) (gcc version 5.4.0 20160609
(Ubuntu/IBM 5.4.0-6ubuntu1~16.04.2) ) #3 SMP Mon Oct 10 20:16:53 CDT 2016
1:mon> s
[c000000fe99eb3d0] c000000000575328 __bt_get+0x48/0xd0
[c000000fe99eb400] c000000000575838 bt_get.isra.1+0x78/0x2d0
[c000000fe99eb480] c000000000575cb4 blk_mq_get_tag+0x44/0x100
[c000000fe99eb4b0] c00000000056f6f4 __blk_mq_alloc_request+0x44/0x220
[c000000fe99eb500] c000000000570050 blk_mq_map_request+0x100/0x1f0
[c000000fe99eb580] c000000000574650 blk_mq_make_request+0xf0/0x540
[c000000fe99eb640] c000000000561c44 generic_make_request+0x144/0x230
[c000000fe99eb690] c000000000561e00 submit_bio+0xd0/0x200
[c000000fe99eb740] c0000000003ef740 ext4_io_submit+0x90/0xb0
[c000000fe99eb770] c0000000003e95d8 ext4_writepages+0x588/0xdd0
[c000000fe99eb910] c00000000025a9f0 do_writepages+0x60/0xc0
[c000000fe99eb940] c000000000246c88 __filemap_fdatawrite_range+0xf8/0x180
[c000000fe99eb9e0] c000000000246f90 filemap_write_and_wait_range+0x70/0xf0
[c000000fe99eba20] c0000000003dd844 ext4_sync_file+0x214/0x540
[c000000fe99eba80] c000000000364718 vfs_fsync_range+0x78/0x130
[c000000fe99ebad0] c0000000003dd46c ext4_file_write_iter+0x35c/0x430
[c000000fe99ebb90] c00000000038c280 aio_run_iocb+0x3b0/0x450
[c000000fe99ebce0] c00000000038dc28 do_io_submit+0x368/0x730
[c000000fe99ebe30] c000000000009404 system_call+0x38/0xec

Signed-off-by: Gabriel Krisman Bertazi
Cc: Brian King
Cc: Douglas Miller
Cc: linux-block@vger.kernel.org
Cc: linux-scsi@vger.kernel.org
Reviewed-by: Douglas Miller
Signed-off-by: Jens Axboe

Gabriel Krisman Bertazi
2016-12-15 04:57:47 +0800
a829a8445 Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi ... Browse Code »

Pull SCSI updates from James Bottomley:
"This update includes the usual round of major driver updates (ncr5380,
lpfc, hisi_sas, megaraid_sas, ufs, ibmvscsis, mpt3sas).

There's also an assortment of minor fixes, mostly in error legs or
other not very user visible stuff. The major change is the
pci_alloc_irq_vectors replacement for the old pci_msix_.. calls; this
effectively makes IRQ mapping generic for the drivers and allows
blk_mq to use the information"

* tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (256 commits)
scsi: qla4xxx: switch to pci_alloc_irq_vectors
scsi: hisi_sas: support deferred probe for v2 hw
scsi: megaraid_sas: switch to pci_alloc_irq_vectors
scsi: scsi_devinfo: remove synchronous ALUA for NETAPP devices
scsi: be2iscsi: set errno on error path
scsi: be2iscsi: set errno on error path
scsi: hpsa: fallback to use legacy REPORT PHYS command
scsi: scsi_dh_alua: Fix RCU annotations
scsi: hpsa: use %phN for short hex dumps
scsi: hisi_sas: fix free'ing in probe and remove
scsi: isci: switch to pci_alloc_irq_vectors
scsi: ipr: Fix runaway IRQs when falling back from MSI to LSI
scsi: dpt_i2o: double free on error path
scsi: cxlflash: Migrate scsi command pointer to AFU command
scsi: cxlflash: Migrate IOARRIN specific routines to function pointers
scsi: cxlflash: Cleanup queuecommand()
scsi: cxlflash: Cleanup send_tmf()
scsi: cxlflash: Remove AFU command lock
scsi: cxlflash: Wait for active AFU commands to timeout upon tear down
scsi: cxlflash: Remove private command pool
...

Linus Torvalds
2016-12-15 02:49:33 +0800

14 Dec, 2016

3 commits

36e1f3d10 blk-mq: Avoid memory reclaim when remapping queues ... Browse Code »

While stressing memory and IO at the same time we changed SMT settings,
we were able to consistently trigger deadlocks in the mm system, which
froze the entire machine.

I think that under memory stress conditions, the large allocations
performed by blk_mq_init_rq_map may trigger a reclaim, which stalls
waiting on the block layer remmaping completion, thus deadlocking the
system. The trace below was collected after the machine stalled,
waiting for the hotplug event completion.

The simplest fix for this is to make allocations in this path
non-reclaimable, with GFP_NOIO. With this patch, We couldn't hit the
issue anymore.

This should apply on top of Jens's for-next branch cleanly.

Changes since v1:
- Use GFP_NOIO instead of GFP_NOWAIT.

Call Trace:
[c000000f0160aaf0] [c000000f0160ab50] 0xc000000f0160ab50 (unreliable)
[c000000f0160acc0] [c000000000016624] __switch_to+0x2e4/0x430
[c000000f0160ad20] [c000000000b1a880] __schedule+0x310/0x9b0
[c000000f0160ae00] [c000000000b1af68] schedule+0x48/0xc0
[c000000f0160ae30] [c000000000b1b4b0] schedule_preempt_disabled+0x20/0x30
[c000000f0160ae50] [c000000000b1d4fc] __mutex_lock_slowpath+0xec/0x1f0
[c000000f0160aed0] [c000000000b1d678] mutex_lock+0x78/0xa0
[c000000f0160af00] [d000000019413cac] xfs_reclaim_inodes_ag+0x33c/0x380 [xfs]
[c000000f0160b0b0] [d000000019415164] xfs_reclaim_inodes_nr+0x54/0x70 [xfs]
[c000000f0160b0f0] [d0000000194297f8] xfs_fs_free_cached_objects+0x38/0x60 [xfs]
[c000000f0160b120] [c0000000003172c8] super_cache_scan+0x1f8/0x210
[c000000f0160b190] [c00000000026301c] shrink_slab.part.13+0x21c/0x4c0
[c000000f0160b2d0] [c000000000268088] shrink_zone+0x2d8/0x3c0
[c000000f0160b380] [c00000000026834c] do_try_to_free_pages+0x1dc/0x520
[c000000f0160b450] [c00000000026876c] try_to_free_pages+0xdc/0x250
[c000000f0160b4e0] [c000000000251978] __alloc_pages_nodemask+0x868/0x10d0
[c000000f0160b6f0] [c000000000567030] blk_mq_init_rq_map+0x160/0x380
[c000000f0160b7a0] [c00000000056758c] blk_mq_map_swqueue+0x33c/0x360
[c000000f0160b820] [c000000000567904] blk_mq_queue_reinit+0x64/0xb0
[c000000f0160b850] [c00000000056a16c] blk_mq_queue_reinit_notify+0x19c/0x250
[c000000f0160b8a0] [c0000000000f5d38] notifier_call_chain+0x98/0x100
[c000000f0160b8f0] [c0000000000c5fb0] __cpu_notify+0x70/0xe0
[c000000f0160b930] [c0000000000c63c4] notify_prepare+0x44/0xb0
[c000000f0160b9b0] [c0000000000c52f4] cpuhp_invoke_callback+0x84/0x250
[c000000f0160ba10] [c0000000000c570c] cpuhp_up_callbacks+0x5c/0x120
[c000000f0160ba60] [c0000000000c7cb8] _cpu_up+0xf8/0x1d0
[c000000f0160bac0] [c0000000000c7eb0] do_cpu_up+0x120/0x150
[c000000f0160bb40] [c0000000006fe024] cpu_subsys_online+0x64/0xe0
[c000000f0160bb90] [c0000000006f5124] device_online+0xb4/0x120
[c000000f0160bbd0] [c0000000006f5244] online_store+0xb4/0xc0
[c000000f0160bc20] [c0000000006f0a68] dev_attr_store+0x68/0xa0
[c000000f0160bc60] [c0000000003ccc30] sysfs_kf_write+0x80/0xb0
[c000000f0160bca0] [c0000000003cbabc] kernfs_fop_write+0x17c/0x250
[c000000f0160bcf0] [c00000000030fe6c] __vfs_write+0x6c/0x1e0
[c000000f0160bd90] [c000000000311490] vfs_write+0xd0/0x270
[c000000f0160bde0] [c0000000003131fc] SyS_write+0x6c/0x110
[c000000f0160be30] [c000000000009204] system_call+0x38/0xec

Signed-off-by: Gabriel Krisman Bertazi
Cc: Brian King
Cc: Douglas Miller
Cc: linux-block@vger.kernel.org
Cc: linux-scsi@vger.kernel.org
Signed-off-by: Jens Axboe

Gabriel Krisman Bertazi
2016-12-14 23:15:25 +0800
b92e09bb5 Merge branch 'for-4.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/libata ... Browse Code »

Pull libata updates from Tejun Heo:

- Adam added opt-in ATA command priority support.

- There are machines which hide multiple nvme devices behind an ahci
BAR. Dan Williams proposed a solution to force-switch the mode but
deemed too hackishd. People are gonna discuss the proper way to
handle the situation in nvme standard meetings. For now, detect and
warn about the situation.

- Low level driver specific changes.

Christoph Hellwig pipes in about the hidden nvme warning:
"I wish that was the case. We've pretty much agreed that we'll want to
implement it as a virtual PCIe root bridge, similar to Intels other
'innovation' VMD that we work around that way.

But Intel management has apparently decided that they don't want to
spend more cycles on this now that Lenovo has an optional BIOS that
doesn't force this broken mode anymore, and no one outside of Intel
has enough information to implement something like this.

So for now I guess this warning is it, until Intel reconsideres and
spends resources on fixing up the damage their Chipset people caused"

* 'for-4.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/libata:
ahci: warn about remapped NVMe devices
ahci-remap.h: add ahci remapping definitions
nvme: move NVMe class code to pci_ids.h
pata: imx: support controller modes up to PIO4
pata: imx: add support of setting timings for PIO modes
pata: imx: set controller PIO mode with .set_piomode callback
pata: imx: sort headers out
ata: set ncq_prio_enabled iff device has support
ata: ATA Command Priority Disabled By Default
ata: Enabling ATA Command Priorities
block: Add iocontext priority to request
ahci: qoriq: added ls1046a platform support

Linus Torvalds
2016-12-14 05:26:24 +0800
36869cb93 Merge branch 'for-4.10/block' of git://git.kernel.dk/linux-block ... Browse Code »

Pull block layer updates from Jens Axboe:
"This is the main block pull request this series. Contrary to previous
release, I've kept the core and driver changes in the same branch. We
always ended up having dependencies between the two for obvious
reasons, so makes more sense to keep them together. That said, I'll
probably try and keep more topical branches going forward, especially
for cycles that end up being as busy as this one.

The major parts of this pull request is:

- Improved support for O_DIRECT on block devices, with a small
private implementation instead of using the pig that is
fs/direct-io.c. From Christoph.

- Request completion tracking in a scalable fashion. This is utilized
by two components in this pull, the new hybrid polling and the
writeback queue throttling code.

- Improved support for polling with O_DIRECT, adding a hybrid mode
that combines pure polling with an initial sleep. From me.

- Support for automatic throttling of writeback queues on the block
side. This uses feedback from the device completion latencies to
scale the queue on the block side up or down. From me.

- Support from SMR drives in the block layer and for SD. From Hannes
and Shaun.

- Multi-connection support for nbd. From Josef.

- Cleanup of request and bio flags, so we have a clear split between
which are bio (or rq) private, and which ones are shared. From
Christoph.

- A set of patches from Bart, that improve how we handle queue
stopping and starting in blk-mq.

- Support for WRITE_ZEROES from Chaitanya.

- Lightnvm updates from Javier/Matias.

- Supoort for FC for the nvme-over-fabrics code. From James Smart.

- A bunch of fixes from a whole slew of people, too many to name
here"

* 'for-4.10/block' of git://git.kernel.dk/linux-block: (182 commits)
blk-stat: fix a few cases of missing batch flushing
blk-flush: run the queue when inserting blk-mq flush
elevator: make the rqhash helpers exported
blk-mq: abstract out blk_mq_dispatch_rq_list() helper
blk-mq: add blk_mq_start_stopped_hw_queue()
block: improve handling of the magic discard payload
blk-wbt: don't throttle discard or write zeroes
nbd: use dev_err_ratelimited in io path
nbd: reset the setup task for NBD_CLEAR_SOCK
nvme-fabrics: Add FC LLDD loopback driver to test FC-NVME
nvme-fabrics: Add target support for FC transport
nvme-fabrics: Add host support for FC transport
nvme-fabrics: Add FC transport LLDD api definitions
nvme-fabrics: Add FC transport FC-NVME definitions
nvme-fabrics: Add FC transport error codes to nvme.h
Add type 0x28 NVME type code to scsi fc headers
nvme-fabrics: patch target code in prep for FC transport support
nvme-fabrics: set sqe.command_id in core not transports
parser: add u64 number parser
nvme-rdma: align to generic ib_event logging helper
...

Linus Torvalds
2016-12-14 02:19:16 +0800

13 Dec, 2016

1 commit

9491ae4aa mm: don't cap request size based on read-ahead setting ... Browse Code »

We ran into a funky issue, where someone doing 256K buffered reads saw
128K requests at the device level. Turns out it is read-ahead capping
the request size, since we use 128K as the default setting. This
doesn't make a lot of sense - if someone is issuing 256K reads, they
should see 256K reads, regardless of the read-ahead setting, if the
underlying device can support a 256K read in a single command.

This patch introduces a bdi hint, io_pages. This is the soft max IO
size for the lower level, I've hooked it up to the bdev settings here.
Read-ahead is modified to issue the maximum of the user request size,
and the read-ahead max size, but capped to the max request size on the
device side. The latter is done to avoid reading ahead too much, if the
application asks for a huge read. With this patch, the kernel behaves
like the application expects.

Link: http://lkml.kernel.org/r/1479498073-8657-1-git-send-email-axboe@fb.com
Signed-off-by: Jens Axboe
Acked-by: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jens Axboe
2016-12-13 10:55:08 +0800

10 Dec, 2016

5 commits

7cd54aa84 blk-stat: fix a few cases of missing batch flushing ... Browse Code »

Everytime we need to read ->nr_samples, we should have flushed
the batch first. The non-mq read path also needs to flush the
batch.

Signed-off-by: Jens Axboe

Jens Axboe
2016-12-10 04:08:35 +0800
c8e52ba5e blk-flush: run the queue when inserting blk-mq flush ... Browse Code »

Currently we pass in to run the queue async, but don't flag the
queue to be run. We don't need to run it async here, but we should
run it. So fixup the parameters.

Signed-off-by: Jens Axboe
Reviewed-by: Hannes Reinecke

Jens Axboe
2016-12-10 00:03:02 +0800
70b3ea056 elevator: make the rqhash helpers exported ... Browse Code »

Signed-off-by: Jens Axboe
Reviewed-by: Hannes Reinecke

Jens Axboe
2016-12-10 00:03:02 +0800
f04c3df3e blk-mq: abstract out blk_mq_dispatch_rq_list() helper ... Browse Code »

Takes a list of requests, and dispatches it. Moves any residual
requests to the dispatch list.

Signed-off-by: Jens Axboe
Reviewed-by: Hannes Reinecke

Jens Axboe
2016-12-10 00:03:02 +0800
ae911c5e7 blk-mq: add blk_mq_start_stopped_hw_queue() ... Browse Code »

We have a variant for all hardware queues, but not one for a single
hardware queue.

Signed-off-by: Jens Axboe
Reviewed-by: Hannes Reinecke

Jens Axboe
2016-12-10 00:03:02 +0800

09 Dec, 2016

2 commits

f9d03f96b block: improve handling of the magic discard payload ... Browse Code »

Instead of allocating a single unused biovec for discard requests, send
them down without any payload. Instead we allow the driver to add a
"special" payload using a biovec embedded into struct request (unioned
over other fields never used while in the driver), and overloading
the number of segments for this case.

This has a couple of advantages:

- we don't have to allocate the bio_vec
- the amount of special casing for discard requests in the block
layer is significantly reduced
- using this same scheme for other request types is trivial,
which will be important for implementing the new WRITE_ZEROES
op on devices where it actually requires a payload (e.g. SCSI)
- we can get rid of playing games with the request length, as
we'll never touch it and completions will work just fine
- it will allow us to support ranged discard operations in the
future by merging non-contiguous discard bios into a single
request
- last but not least it removes a lot of code

This patch is the common base for my WIP series for ranges discards and to
remove discard_zeroes_data in favor of always using REQ_OP_WRITE_ZEROES,
so it would be good to get it in quickly.

Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Christoph Hellwig
2016-12-09 23:30:51 +0800
be07e14f9 blk-wbt: don't throttle discard or write zeroes ... Browse Code »

Both of these are metadata only commands that are not issued by the
writeback code and not directly relevant to the writeback bandwith.

Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Christoph Hellwig
2016-12-09 23:29:35 +0800

08 Dec, 2016

1 commit

a0ac402cf Don't feed anything but regular iovec's to blk_rq_map_user_iov ... Browse Code »

In theory we could map other things, but there's a reason that function
is called "user_iov". Using anything else (like splice can do) just
confuses it.

Reported-and-tested-by: Johannes Thumshirn
Cc: Al Viro
Signed-off-by: Linus Torvalds

Linus Torvalds
2016-12-08 00:23:35 +0800

06 Dec, 2016

1 commit

6e85eaf30 blk-mq: blk_account_io_start() takes a bool ... Browse Code »

Signed-off-by: Jens Axboe
Reviewed-by: Christoph Hellwig
Reviewed-by: Johannes Thumshirn

Jens Axboe
2016-12-06 03:14:28 +0800

05 Dec, 2016

1 commit

58886785d block: fix unintended fallthrough in generic_make_request_checks() ... Browse Code »

Since commit e73c23ff736e ("block: add async variant of
blkdev_issue_zeroout") messages like the following show up:

EXT4-fs (dm-1): Delayed block allocation failed for inode 2368848 at
logical offset 0 with max blocks 1 with error 95
EXT4-fs (dm-1): This should not happen!! Data will be lost

Due to the following fallthrough introduced with
commit 2d253440b5af ("block: Define zoned block device operations"),
generic_make_request_checks() would accept a REQ_OP_WRITE_SAME bio only
if the block device supports "write same" *and* is a zoned one:

switch (bio_op(bio)) {
[...]
case REQ_OP_WRITE_SAME:
if (!bdev_write_same(bio->bi_bdev))
goto not_supported;
case REQ_OP_ZONE_REPORT:
case REQ_OP_ZONE_RESET:
if (!bdev_is_zoned(bio->bi_bdev))
goto not_supported;
break;
[...]
}

Thus, although the bio setup as done by __blkdev_issue_write_same() from
commit e73c23ff736e ("block: add async variant of blkdev_issue_zeroout")
would succeed, its actual submission would not, resulting in the
EOPNOTSUPP == 95.

Fix this by removing the fallthrough which, due to the lack of an explicit
comment, seems to be unintended anyway.

Fixes: e73c23ff736e ("block: add async variant of blkdev_issue_zeroout")
Fixes: 2d253440b5af ("block: Define zoned block device operations")
Signed-off-by: Nicolai Stange
Reviewed-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Nicolai Stange
2016-12-05 22:54:39 +0800

03 Dec, 2016

1 commit

209200efa blk-stat: fix a typo ... Browse Code »

Signed-off-by: Shaohua Li
Fixes: cf43e6be865a ("block: add scalable completion tracking of requests")
Signed-off-by: Jens Axboe

Shaohua Li
2016-12-03 11:17:43 +0800

01 Dec, 2016

4 commits

e0c723000 block: factor out req_set_nomerge ... Browse Code »

Factor out common code for setting REQ_NOMERGE flag which is being used
out at certain places and make it a helper instead, req_set_nomerge().

Signed-off-by: Ritesh Harjani

Get rid of the inline.

Signed-off-by: Jens Axboe

Ritesh Harjani
2016-12-01 23:36:16 +0800
a6f0788ec block: add support for REQ_OP_WRITE_ZEROES ... Browse Code »

This adds a new block layer operation to zero out a range of
LBAs. This allows to implement zeroing for devices that don't use
either discard with a predictable zero pattern or WRITE SAME of zeroes.
The prominent example of that is NVMe with the Write Zeroes command,
but in the future, this should also help with improving the way
zeroing discards work. For this operation, suitable entry is exported in
sysfs which indicate the number of maximum bytes allowed in one
write zeroes operation by the device.

Signed-off-by: Chaitanya Kulkarni
Reviewed-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Chaitanya Kulkarni
2016-12-01 22:58:40 +0800
e73c23ff7 block: add async variant of blkdev_issue_zeroout ... Browse Code »

Similar to __blkdev_issue_discard this variant allows submitting
the final bio asynchronously and chaining multiple ranges
into a single completion.

Signed-off-by: Chaitanya Kulkarni
Reviewed-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Chaitanya Kulkarni
2016-12-01 22:58:40 +0800
b02d8aaea block: Check partition alignment on zoned block devices ... Browse Code »

Both blkdev_report_zones and blkdev_reset_zones can operate on a partition of
a zoned block device. However, the first and last zones reported for a
partition make sense only if the partition start sector and size are aligned
on the device zone size. The same applies for zone reset. Resetting the first
or the last zone of a partition straddling zones may impact neighboring
partitions. Finally, if a partition start sector is not at the beginning of a
sequential zone, it will be impossible to write to the first sectors of the
partition on a host-managed device.
Avoid all these problems and incoherencies by ignoring partitions that are not
zone aligned.

Note: Even with CONFIG_BLK_DEV_ZONED disabled, bdev_is_zoned() will report the
correct disk zoning type (host-aware, host-managed or none) but
bdev_zone_size() will always return 0 for zoned block devices (i.e. the zone
size is unknown). So test this as a way to ensure that a zoned block device is
being handled as such. As a result, for a host-aware devices, unaligned zone
partitions will be accepted with CONFIG_BLK_DEV_ZONED disabled. That is, the
disk will be treated as a regular block device (as it should). If zoned block
device support is enabled, only aligned partitions will be accepted.

Signed-off-by: Damien Le Moal
Reviewed-by: Hannes Reinecke
Signed-off-by: Jens Axboe

Damien Le Moal
2016-12-01 22:56:53 +0800

29 Nov, 2016

4 commits

415d3dab9 blk-mq: Drop explicit timeout sync in hotplug ... Browse Code »

After commit 287922eb0b18 ("block: defer timeouts to a workqueue"),
deleting the timeout work after freezing the queue shouldn't be
necessary, since the synchronization is already enforced by the
acquisition of a q_usage_counter reference in blk_mq_timeout_work.

Signed-off-by: Gabriel Krisman Bertazi
Reviewed-by: Ming Lei
Signed-off-by: Jens Axboe

Gabriel Krisman Bertazi
2016-11-29 23:01:08 +0800
d62118b6d blk-wbt: allow wbt to be enabled always through sysfs ... Browse Code »

Currently there's no way to enable wbt if it's not enabled in the
kernel config by default for a device. Allow a write to the
'wbt_lat_usec' queue sysfs file to enable wbt.

This is useful for both the kernel config case, but also if the
device is CFQ managed and it was turned off by default.

Signed-off-by: Jens Axboe

Jens Axboe
2016-11-29 01:27:03 +0800
fa224eed2 blk-wbt: cleanup disable-by-default for CFQ ... Browse Code »

Make it clear that we are disabling wbt for the specified queued,
if it was enabled by default. This is in preparation for allowing
users to re-enable wbt, and not have it disabled automatically
again.

Signed-off-by: Jens Axboe

Jens Axboe
2016-11-29 01:27:03 +0800
80e091d10 blk-wbt: allow reset of default latency through sysfs ... Browse Code »

Allow a write of '-1' to reset the default latency target for
a given device. This removes knowledge of the different default
settings for rotational vs non-rotational from user space.

Signed-off-by: Jens Axboe

Jens Axboe
2016-11-29 01:27:03 +0800

22 Nov, 2016

3 commits

e00f4f4d0 block,blkcg: use __GFP_NOWARN for best-effort allocations in blkcg ... Browse Code »

blkcg allocates some per-cgroup data structures with GFP_NOWAIT and
when that fails falls back to operations which aren't specific to the
cgroup. Occassional failures are expected under pressure and falling
back to non-cgroup operation is the right thing to do.

Unfortunately, I forgot to add __GFP_NOWARN to these allocations and
these expected failures end up creating a lot of noise. Add
__GFP_NOWARN.

Signed-off-by: Tejun Heo
Reported-by: Marc MERLIN
Reported-by: Vlastimil Babka
Signed-off-by: Jens Axboe

Tejun Heo
2016-11-22 23:59:49 +0800
3a83f4677 block: bio: pass bvec table to bio_init() ... Browse Code »

Some drivers often use external bvec table, so introduce
this helper for this case. It is always safe to access the
bio->bi_io_vec in this way for this case.

After converting to this usage, it will becomes a bit easier
to evaluate the remaining direct access to bio->bi_io_vec,
so it can help to prepare for the following multipage bvec
support.

Signed-off-by: Ming Lei
Reviewed-by: Christoph Hellwig

Fixed up the new O_DIRECT cases.

Signed-off-by: Jens Axboe

Ming Lei
2016-11-22 23:57:21 +0800
778889d84 block: apply blk_partition_remap to REQ_OP_ZONE_RESET ... Browse Code »

If a ZBC device is partitioned and operations are performed on the partition
the zone information is rebased to the partition, however the zone reset
is not mapped from the partition to device as are other operations.

This causes the API (report zones / reset zone) to be unbalanced in this
regard. Checking for the zone reset op code explicitly will balance the
API.

Signed-off-by: Shaun Tancheff
Signed-off-by: Jens Axboe

Shaun Tancheff
2016-11-22 06:08:24 +0800

18 Nov, 2016

5 commits

a0f4bd7f2 scsi: fc: move FC transport's bsg code to bsg-lib ... Browse Code »

Now that all conversions are done, move the FibreChannel bsg code over
to the bsg library.

This patch is derived from work done by Mike Christie in 2011 [1] but
only the iscsi parts got merged back then.

[1] http://marc.info/?l=linux-scsi&m=131149780921009&w=2

Signed-off-by: Johannes Thumshirn
Reviewed-by: Hannes Reinecke
Signed-off-by: Martin K. Petersen

Johannes Thumshirn
2016-11-18 09:15:26 +0800
fb6f7c8d8 block: add bsg_job_put() and bsg_job_get() ... Browse Code »

Add bsg_job_put() and bsg_job_get() so don't need to export
bsg_destroy_job() any more.

Signed-off-by: Johannes Thumshirn
Reviewed-by: Hannes Reinecke
Signed-off-by: Martin K. Petersen

Johannes Thumshirn
2016-11-18 09:15:26 +0800
6aa858cd3 scsi: fc: use bsg_softirq_done ... Browse Code »

bsg_softirq_done() and fc_bsg_softirq_done() are copies of each other, so
ditch the fc specific one.

Signed-off-by: Johannes Thumshirn
Reviewed-by: Hannes Reinecke
Signed-off-by: Martin K. Petersen

Johannes Thumshirn
2016-11-18 09:15:26 +0800
c00da4c90 scsi: fc: Use bsg_destroy_job ... Browse Code »

fc_destroy_bsgjob() and bsg_destroy_job() are now 1:1 copies, so use the
latter. As bsg_destroy_job() comes from bsg-lib we need to select it in
Kconfig once CONFOG_SCSI_FC_ATTRS is active.

Signed-off-by: Johannes Thumshirn
Reviewed-by: Hannes Reinecke
Signed-off-by: Martin K. Petersen

Johannes Thumshirn
2016-11-18 09:15:25 +0800
bf0f2d380 block: add reference counting for struct bsg_job ... Browse Code »

Add reference counting to 'struct bsg_job' so we can implement a reuqest
timeout handler for bsg_jobs, which is needed for Fibre Channel.

Signed-off-by: Johannes Thumshirn
Reviewed-by: Hannes Reinecke
Signed-off-by: Martin K. Petersen

Johannes Thumshirn
2016-11-18 09:15:25 +0800